Literature Review
Reimagined
Highlight passages and ask questions as you read. LitReviewer helps you understand papers the way researchers actually do: through discussion tied directly to the page.
Highlight passages and ask questions as you read. LitReviewer helps you understand papers the way researchers actually do: through discussion tied directly to the page.
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Neural Information Processing Systems (NeurIPS), 2017
Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
1 Introduction
Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems.
Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences.
Aligning the positions to steps in computation time, they generate a sequence of hidden states h_t, as a function of the previous hidden state h_(t-1) and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization.
Annotation
highlightOpening setup — mark the baseline assumptions before the paper introduces the Transformer.
Discussion
2 repliesCan I reproduce this?
Technically yes, the paper's results are reproducible, but you would need a lot of compute power, at least 4 H100 GPUs.
Guide
The Transformer Architecture
The guide breaks down the Transformer step by step.
Highlight passages and ask questions as you read. LitReviewer helps you understand papers the way researchers actually do: through discussion tied directly to the page.
Trusted by Authors At
Before & After
No more copying quotes into ChatGPT. Highlight anything, ask the AI, and keep reading — without leaving the page.
The old way
Download the PDF
Open ChatGPT in another tab
Copy-paste the confusing paragraph
Ask your question
Copy the answer somewhere
Switch back to the PDF
Lose your place
Open a new chat when context drifts
With LitReviewer
Highlight any passage
Ask the AI — it responds in the margin
Keep reading
Your Research, Mapped
Watch your field split into patterns, clusters, and momentum.
Map the progression of your field.
Follow the arrows to trace your way through your papers, chronologically.
Context in the palm of your hand.
Status is built into the view, so your queue and your understanding live together.
Because subcategories always emerge.
Group related papers by meaning, not just by who cited whom, so clusters reveal the shape of a field.
Transparent Pricing
The visible number stays simple. The glass reveals the exact formula behind it.
Estimate Ledger
Typical reading costs
Drag the glass to inspect a card
Computational Resource
| Volume | Description | Subtotal |
|---|---|---|
| arXiv import | LaTeX source — fast | ~$0.05 |
| PDF upload | PDF parsing — slightly slower | ~$0.06 |
Provider Tokens
| Length | Description | Subtotal |
|---|---|---|
| ~500 tokens | Quick passage answer | $0.003 |
| ~3K tokens | Deep multi-section reply | $0.015 |
Why it works
We don't hide the bill inside the product.
LitReviewer treats billing as pass-through infrastructure. The estimate is the user-facing number. The formula underneath it is the proof.
Inspection tool
Drag me to reveal pricing formulas
Stripe cut
2.9% + $0.30
Deducted once when you top up. After that, the balance is yours to use. We do not take a second cut on top of it.
Exact provider rates and billing methodology live on the billing page.