Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Attention Mechanism
The Bottleneck Problem
In the previous chapters, we built sequence-to-sequence models using encoder-decoder LSTMs. But there’s a fundamental problem: The entire source sequence is compressed into a single fixed-size vector. This is like asking someone to summarize a 300-page novel in a single tweet, and then expecting another person to reconstruct the entire plot from that tweet.Evidence of the Problem
The Attention Solution
Intuition: Looking Back at the Source
Instead of forcing the decoder to use only the final encoder state, let it look at all encoder states and focus on relevant ones: When translating “dog” to “chien”:- Look at all encoder states
- Focus attention on the state corresponding to “dog”
- Use that information to generate “chien”
Key Insight: Attention computes a weighted combination of all encoder states, where weights indicate relevance to the current decoding step.
Attention Mechanisms in Detail
Dot-Product Attention
The simplest form of attention: Why divide by ? This is not an arbitrary choice — it is essential for numerical stability. When is large (e.g., 512), each dot product is the sum of 512 terms. If entries of and have unit variance, the dot product has variance (by the central limit theorem, the sum of 512 independent unit-variance terms has variance 512). So the raw scores have standard deviation . Softmax applied to values this large pushes almost all the probability mass onto a single key — the attention becomes a hard argmax, and its gradient effectively vanishes. Dividing by rescales the scores to unit variance, keeping the softmax in its “useful” regime where gradients flow to multiple keys. Without this scaling, training larger models would be dramatically harder. This is one of those small details that separates a paper implementation from a working one.Visualizing Attention Weights
Types of Attention
Additive (Bahdanau) Attention
The original attention mechanism from the 2014 paper:Multiplicative (Luong) Attention
Simpler and faster variants:Self-Attention
Attending to the Same Sequence
Self-attention allows each position in a sequence to attend to all other positions:Why Self-Attention Matters
Multi-Head Attention
Multiple Attention “Perspectives”
Instead of one attention function, use multiple “heads” that each learn different relationships: Why multiple heads instead of one big attention? A single attention head can only compute one weighted average per query position — it assigns a single scalar relevance to each key. But language has multiple simultaneous relationships: “it” relates to “animal” syntactically (coreference), to “tired” semantically (attribute), and to “cross” structurally (subject-verb). Multiple heads let the model maintain all these relationships simultaneously, each head specializing in a different type of dependency. The mathematical cost is negligible: if and you use 8 heads, each head operates in a dimensional subspace. The total computation is the same as a single head with , but you get 8 independent attention patterns instead of 1. The output projection then learns how to combine these perspectives. This is one of the best “free lunches” in deep learning architecture design.Visualizing Multi-Head Attention
Positional Encoding
The Problem: Attention is Permutation-Invariant
Unlike RNNs, self-attention has no inherent notion of position:Sinusoidal Positional Encoding
The original Transformer uses sinusoidal functions:Learned Positional Embeddings
An alternative: learn position embeddings like word embeddings:Attention with Masking
Causal (Autoregressive) Mask
For language models, we need to prevent attending to future positions:Padding Mask
For variable-length sequences with padding:Complete Attention-Based Seq2Seq
Exercises
Exercise 1: Implement Attention from Scratch
Exercise 1: Implement Attention from Scratch
Implement scaled dot-product attention without using any PyTorch attention functions:
- Implement the forward pass with proper scaling
- Add masking support (padding and causal)
- Verify gradients flow correctly
- Test on a simple sequence copying task
Exercise 2: Visualize Real Attention Patterns
Exercise 2: Visualize Real Attention Patterns
Using a pre-trained model (or train your own):
- Extract attention weights for various inputs
- Create visualizations showing what each head attends to
- Identify heads with interpretable patterns
- Compare attention patterns for different input types
Exercise 3: Compare Attention Variants
Exercise 3: Compare Attention Variants
Implement and compare:
- Dot-product attention
- Additive (Bahdanau) attention
- Multiplicative (Luong) attention
- Training speed
- Final BLEU score
- Attention patterns
Exercise 4: Relative Position Encoding
Exercise 4: Relative Position Encoding
Implement relative positional encoding:
- Instead of absolute positions, encode relative distances
- Modify attention scores to include position bias
- Compare with sinusoidal on long sequences
- Test extrapolation to longer sequences
Exercise 5: Efficient Attention
Exercise 5: Efficient Attention
Implement a more efficient attention mechanism:
- Implement local attention (attend only to nearby positions)
- Implement sparse attention patterns
- Compare memory usage and speed with full attention
- Evaluate impact on model quality
Training Pitfalls and Debugging Hints
Key Takeaways
| Concept | Key Insight |
|---|---|
| Attention | Weighted combination of values based on query-key similarity |
| Dot-Product | — simple and efficient |
| Multi-Head | Multiple attention functions for different relationship types |
| Self-Attention | Sequence attends to itself — O(1) dependency paths |
| Positional Encoding | Add position information since attention is permutation-invariant |
| Masking | Causal for autoregressive, padding for variable lengths |
What’s Next
Module 11: Transformers
Build the complete Transformer architecture — combine attention with feed-forward networks, layer normalization, and residual connections to create the model that revolutionized NLP.