Public Beta

Literature Review
Reimagined

Highlight passages and ask questions as you read. LitReviewer helps you understand papers the way researchers actually do: through discussion tied directly to the page.

Get Started Learn More

Highlight. Note. Remember.

So YOU read the paper, not just AI.

Select any passage to drop a personal note. Your thoughts stay anchored to the exact text that sparked them.

Ask Questions. Get Context.

Any context you want, surfaced at will.

Stuck on dense jargon? Highlight it and ask. Get context-aware explanations that reference the exact section.

Follow Along, Step by Step.

To skim the paper, systematically.

The AI guide walks you through each section, highlighting key concepts incrementally as you move through the paper.

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Neural Information Processing Systems (NeurIPS), 2017

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.

1 Introduction

Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems.

Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences.

Aligning the positions to steps in computation time, they generate a sequence of hidden states h_t, as a function of the previous hidden state h_(t-1) and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization.

Annotation

highlight

Opening setup — mark the baseline assumptions before the paper introduces the Transformer.

Discussion

2 replies

Can I reproduce this?

Technically yes, the paper's results are reproducible, but you would need a lot of compute power, at least 4 H100 GPUs.

Guide

2 / 5

The Transformer Architecture

The guide breaks down the Transformer step by step.

Public Beta

Literature Review
Reimagined

Highlight passages and ask questions as you read. LitReviewer helps you understand papers the way researchers actually do: through discussion tied directly to the page.

Get Started Learn More

Trusted by Authors At

Before & After

Your reading and your questions belong together

No more copying quotes into ChatGPT. Highlight anything, ask the AI, and keep reading — without leaving the page.

Research Notes — Scaling Laws

You and 2 others

Key Takeaways

- Power law: L ∝ N^−α (see eq 1.1)

- Need to ask about chinchilla comparison

- "seven orders of magnitude" — is this verified?

- TODO: copy the relevant figure from the PDF

- TODO: paste ChatGPT answer about compute here

scaling_laws_2020.pdf

Scaling Laws for Neural Language Models

Jared Kaplan¹, Sam McCandlish¹, Tom Henighan¹, Tom B. Brown¹,
Benjamin Chess¹, Rewon Child¹, Scott Gray¹, Alec Radford¹, Jeffrey Wu¹, Dario Amodei¹

¹Johns Hopkins University & OpenAI

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range.

Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.

1 Introduction

Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be framed in terms of language. Recent work has shown that language model performance improves smoothly and predictably as we appropriately scale up model size, data, and compute.

The existence of scaling laws implies that larger models are significantly more sample-efficient than smaller models. In this context, optimal performance is achieved by training very large models and stopping well short of convergence.

We present a detailed empirical study of the dependence of language modeling loss on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process.

The key results we find are summarized as follows. For models with a limited number of parameters, trained to convergence on sufficiently large datasets, the test loss follows a precise power-law in the number of non-embedding parameters N:

L(N) = (N_c / N)^α_N ; α_N ∼ 0.076

Similarly, the loss scales as a power-law in the dataset size D and in the amount of compute C used for training. These three scaling laws hold across more than seven orders of magnitude and show no signs of deviating.

2 Background and Methods

We train a range of Transformer language models using the WebText2 dataset. Models range from 768 to over 1.5 billion non-embedding parameters, with context lengths of 1024 tokens.

ChatGPTGPT-4

New chat

You

What does "scales as a power-law" mean in this context? Here's the quote: "The loss scales as a power-law with model size, dataset size, and the amount of compute"

ChatGPT

A power-law relationship means that as you increase one quantity (like model size N), the loss decreases following L ∝ N^(−α). The important finding is that this pattern holds consistently across seven orders of magnitude.

Send a message

AI Scaling Literature Review List

PaperStatusNotes

Scaling Laws (2020)Readingeq 1.1?

Chinchilla (2022)To Readask Sarah

Attention (2017)Done

BERT (2018)To Read

ChatGPTGPT-4

New chat

Can you explain Figure 1 from the scaling laws paper?

I don't have access to the specific figure you're referring to. Could you describe it or paste the relevant data so I can help interpret it?

Send a message

Stickies

- ask Sarah about chinchilla

- re-read §3 (lost the chat)

- where did I put fig 1?

Pull up to reveal

The old way

Download the PDF

Open ChatGPT in another tab

Copy-paste the confusing paragraph

Ask your question

Copy the answer somewhere

Switch back to the PDF

Lose your place

Open a new chat when context drifts

With LitReviewer

Highlight any passage

Ask the AI — it responds in the margin

Keep reading

The old way

Download the PDF. Open ChatGPT. Copy a quote. Paste. Ask. Copy the answer. Switch back. Lose your place. Open a new chat when context drifts.

With LitReviewer

Highlight. Ask. Keep reading. The AI responds right next to the text, your collaborators reply in the same thread, and every question stays anchored to the passage.

Your Research, Mapped

A living map of your knowledge

Watch your field split into patterns, clusters, and momentum.

Citation paths

Map the progression of your field.

Follow the arrows to trace your way through your papers, chronologically.

Status-aware map

Context in the palm of your hand.

Status is built into the view, so your queue and your understanding live together.

Semantic neighborhoods

Because subcategories always emerge.

Group related papers by meaning, not just by who cited whom, so clusters reveal the shape of a field.

Transparent Pricing

Pay only for what you use

The visible number stays simple. The glass reveals the exact formula behind it.

Estimate Ledger

Typical reading costs

Drag the glass to inspect a card

Voucher Ref. LR-PRC-001

Volume	Description	Subtotal
arXiv import	LaTeX source — fast	~$0.05
PDF upload	PDF parsing — slightly slower	~$0.06

Length	Description	Subtotal
~500 tokens	Quick passage answer	$0.003
~3K tokens	Deep multi-section reply	$0.015

Literature Review
Reimagined

Highlight. Note. Remember.

Ask Questions. Get Context.

Follow Along, Step by Step.

Literature Review
Reimagined

Your reading and your questions belong together

A living map of your knowledge

Citation paths

Status-aware map

Semantic neighborhoods

Pay only for what you use

Paper processing

Assistant reply

Literature ReviewReimagined

Highlight. Note. Remember.

Ask Questions. Get Context.

Follow Along, Step by Step.

Literature ReviewReimagined

Your reading and your questions belong together

A living map of your knowledge

Citation paths

Status-aware map

Semantic neighborhoods

Pay only for what you use

Paper processing

Assistant reply

Literature Review
Reimagined

Literature Review
Reimagined