Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
LSTMs & GRUs: Gated Recurrent Networks
The Memory Problem Revisited
Vanilla RNNs suffer from a fundamental flaw: they can’t remember things for long. The vanishing gradient problem means information from early time steps gets “washed out” as it passes through many layers of tanh activations. Real-world consequence: An RNN reading a book can’t remember what happened in Chapter 1 when it reaches Chapter 10.Long Short-Term Memory (LSTM)
The Big Idea: A Memory Cell with Gates
An LSTM maintains two types of state:- Cell State (): The “long-term memory” — a conveyor belt for information that flows through with minimal modification
- Hidden State (): The “working memory” — what the network is currently thinking about
- Forget Gate: What to erase from long-term memory (“the subject changed, forget the old topic”)
- Input Gate: What new information to write to long-term memory (“this is a new character, store their name”)
- Output Gate: What to surface from long-term memory for the current decision (“for predicting the next word, I need the subject, not the setting”)
LSTM Equations
Where:- is the sigmoid function (outputs 0-1 for gating)
- is element-wise multiplication
- is concatenation of previous hidden and current input
Complete LSTM Layer
Understanding the Gates
The Forget Gate: Learning What to Ignore
The Input Gate: Learning What to Remember
The Output Gate: Learning What to Reveal
Why LSTM Solves Vanishing Gradients
The Gradient Highway
Gated Recurrent Unit (GRU)
A Simpler Alternative
GRU simplifies LSTM by:- Combining forget and input gates into an “update gate”
- Merging cell state and hidden state
GRU Equations
LSTM vs GRU Comparison
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| States | 2 (hidden + cell) | 1 (hidden only) |
| Parameters | 4 × hidden² | 3 × hidden² |
| Training | Slightly slower | Faster |
| Performance | Often slightly better on complex tasks | Often comparable |
| When to use | Long sequences, complex dependencies | Faster training needed, simpler tasks |
Practical Applications
Sentiment Analysis with LSTM
Language Modeling with LSTM
Sequence-to-Sequence Translation
Advanced LSTM Variants
Peephole Connections
Layer Normalization in LSTM
Best Practices and Tips
Exercises
Exercise 1: Implement LSTM from Scratch
Exercise 1: Implement LSTM from Scratch
nn.LSTM:- Implement
LSTMCellwith all gates - Stack cells into
LSTMlayer - Add bidirectional support
- Verify outputs match
nn.LSTM
Exercise 2: Gradient Flow Analysis
Exercise 2: Gradient Flow Analysis
- Process sequences of length 10, 50, 100, 200
- Track gradient magnitude at each time step
- Compare with vanilla RNN
- Plot the results
Exercise 3: Named Entity Recognition
Exercise 3: Named Entity Recognition
- Load CoNLL-2003 dataset
- Implement BiLSTM-CRF model
- Train with proper evaluation (F1 score)
- Analyze errors by entity type
Exercise 4: Music Generation
Exercise 4: Music Generation
- Download MIDI files and convert to sequences
- Train character-level LSTM on ABC notation
- Generate new melodies
- Convert back to MIDI and listen
Exercise 5: Time Series Forecasting
Exercise 5: Time Series Forecasting
- Use a dataset like air quality or stock prices
- Implement encoder-decoder with LSTM
- Add attention (preview of next chapter!)
- Compare with simple baselines
Key Takeaways
| Concept | Key Insight |
|---|---|
| Cell State | Long-term memory highway - gradients flow unimpeded |
| Forget Gate | Learns what to erase - enables “forgetting” of irrelevant info |
| Input Gate | Learns what to remember - filters new information |
| Output Gate | Learns what to reveal - controls hidden state exposure |
| GRU | Simpler, fewer parameters, often similar performance |
| Gradient Highway | Cell state update is additive → gradients don’t vanish |
What’s Next
Module 10: Attention Mechanism
Interview Deep-Dive
Walk through the LSTM cell equations. For each gate, explain what it does and why that specific activation function was chosen.
Walk through the LSTM cell equations. For each gate, explain what it does and why that specific activation function was chosen.
- Forget gate: . Sigmoid outputs in (0,1) act as a dimmer switch on each cell state element. Values near 1 keep information, near 0 erase it. Sigmoid is chosen because we need a smooth differentiable gate in [0,1].
- Input gate + candidate: controls how much new information to write. Candidate values use tanh because its [-1, 1] range allows both additive and subtractive modifications to the cell state.
- Cell state update: . This additive update is the key — can stay near 1, preserving gradients across hundreds of steps. Compare to vanilla RNNs where gradients decay exponentially.
- Output gate: , . The cell state may store information not needed for the current prediction. The output gate selectively exposes relevant parts.
- The architectural insight: cell state is the memory bus, gates are learned read/write controllers. This separation of storage from computation enables long-term memory.
Compare LSTM and GRU architecturally. When would you choose one over the other?
Compare LSTM and GRU architecturally. When would you choose one over the other?
- GRU merges forget and input gates into a single update gate and combines cell state with hidden state, reducing parameters by roughly 25% and improving training speed by 15-20%.
- LSTM has a separate cell state providing a cleaner gradient highway, and separate forget/input gates give more fine-grained memory control.
- Choose GRU: small datasets (fewer parameters reduce overfitting), speed-critical applications, moderate-length dependencies (under 200 steps). GRU performs comparably to LSTM on most benchmarks with shorter sequences.
- Choose LSTM: very long dependencies (500+ steps), ample data to support extra parameters, or when you need the explicit cell state for inspection/interpretability.
- In practice, the performance gap is usually 1-2%, and both have been largely superseded by transformers. The choice between them matters less than the choice between recurrence and attention.
Why is the LSTM cell state update additive rather than multiplicative? Connect this to gradient flow and ResNet skip connections.
Why is the LSTM cell state update additive rather than multiplicative? Connect this to gradient flow and ResNet skip connections.
- The cell state update is additive: new state = weighted old state + weighted new candidate. Vanilla RNNs use , a nonlinear (multiplicative) transformation.
- Gradient consequence: . With forget gates near 1, this product stays close to 1 across hundreds of steps. Vanilla RNNs have , which decays exponentially.
- This is the exact same principle as ResNet: gives gradient . The additive identity provides a gradient highway. LSTM’s forget gate modulates this highway ( instead of fixed 1), but when , the effect is identical.
- Both LSTM (1997) and ResNet (2015) independently discovered that additive shortcuts solve the gradient degradation problem in deep/long computation chains. The underlying math is the same: addition distributes gradients without attenuation.