Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Sequence-to-Sequence: Transforming Sequences
The Seq2Seq Paradigm
Seq2Seq models transform one sequence into another. The core insight: decompose the problem into two stages — first understand the input (encoding), then generate the output (decoding). This mirrors how a human translator works: you read the entire French sentence, build a mental model of its meaning, then produce the English equivalent word by word. This paradigm is remarkably general. Any problem where you have a variable-length input and need a variable-length output can be cast as seq2seq:- Machine Translation: “Hello world” -> “Bonjour monde”
- Summarization: Long document -> Short summary
- Question Answering: Context + Question -> Answer
- Code Generation: Natural language description -> Working code
- Speech Recognition: Audio waveform -> Text transcript
Basic Encoder-Decoder Architecture
The Encoder
The encoder reads the input sequence and compresses it into a context vector — a fixed-size summary of the entire input. This is like reading an entire book and then trying to capture its essence in a single paragraph. The bidirectional LSTM below reads the input both forward and backward, giving it a complete picture of the context around each token.The Decoder
The decoder generates the output sequence one token at a time:Complete Seq2Seq Model
Attention Mechanisms
The Information Bottleneck Problem
The basic encoder-decoder has a fatal flaw: it forces ALL information about the source sentence through a single fixed-size vector. Imagine trying to translate a 500-word paragraph, but you can only pass information through a single 512-dimensional vector. Long sequences inevitably lose information. Attention solves this elegantly by allowing the decoder to “look back” at every position in the source sequence at every step of generation. Instead of one summary vector, the decoder gets direct access to all encoder hidden states and learns to focus on the most relevant ones. When translating “le chat” to “the cat,” the decoder learns to attend strongly to “chat” when generating “cat.” This innovation, introduced by Bahdanau et al. in 2014, was so fundamental that it eventually evolved into the self-attention mechanism that powers modern Transformers.Bahdanau Attention (Additive)
Luong Attention (Multiplicative)
Attention Decoder
Beam Search Decoding
Greedy decoding can miss better sequences. Beam search explores multiple hypotheses:Diverse Beam Search
Training Techniques
Label Smoothing
Scheduled Sampling
Gradually reduce teacher forcing during training:Practical Applications
Machine Translation
Text Summarization
Evaluation Metrics
Exercises
Exercise 1: Implement Copy Mechanism
Exercise 1: Implement Copy Mechanism
Add pointer-generator network to copy words from source:
Exercise 2: Add Coverage Mechanism
Exercise 2: Add Coverage Mechanism
Prevent repetition by tracking attention history:
Exercise 3: Implement Nucleus Sampling
Exercise 3: Implement Nucleus Sampling
Top-p sampling for more diverse generation:
What’s Next?
Self-Supervised Learning
Learn representations without labels
Reinforcement Learning for DL
RLHF, PPO, and reward optimization
Interview Deep-Dive
What is the information bottleneck problem in basic seq2seq models, and how does attention solve it?
What is the information bottleneck problem in basic seq2seq models, and how does attention solve it?
Strong Answer:In a basic encoder-decoder, the encoder compresses the entire input sequence into a single fixed-size context vector — typically the last hidden state. This is like asking someone to summarize a 500-page book in a single sentence and then asking another person to reconstruct the book from that sentence. No matter how large you make that context vector, there is a fundamental information bottleneck: long sequences contain more information than any fixed-size vector can faithfully represent.The symptoms are clear in practice: translation quality degrades sharply for sentences beyond 20-30 tokens, and the model struggles with capturing long-range dependencies. The first few words of the input and the last few are represented well (recency and primacy effects in the RNN), but middle portions are effectively lost.Attention (Bahdanau et al., 2014) eliminates the bottleneck by allowing the decoder to “look back” at every encoder hidden state at every decoding step. Instead of one context vector, the decoder gets a weighted combination of all encoder hidden states, with the weights learned dynamically based on what the decoder needs at each step. When generating the verb in a translation, the attention weights concentrate on the source verb. When generating a noun, they shift to the source noun.Mathematically, at each decoding step t, attention computes: context_t = sum(alpha_ * h_i) where alpha_ = softmax(score(s_t, h_i)). The score function can be dot-product, additive (Bahdanau), or multiplicative (Luong). The context vector is now different at every step, carrying exactly the information the decoder needs.The scaling implication: this is also the foundation for the Transformer’s self-attention, which generalizes the idea by allowing every position to attend to every other position, not just decoder-to-encoder. The conceptual leap from seq2seq attention to Transformer self-attention is smaller than it appears — it is the same attention mechanism applied bidirectionally within a single sequence.Follow-up: Why did the Transformer architecture completely replace LSTM-based seq2seq models? What did LSTMs do better?The Transformer won on two fronts: parallelism and long-range dependencies. An LSTM processes tokens sequentially — token t+1 depends on the hidden state from token t. This means you cannot parallelize across time steps during training. Transformers compute all positions simultaneously via matrix multiplications, achieving massive speedups on GPUs. For a sequence of length N, LSTM training is O(N) sequential steps; Transformer training is O(1) sequential steps (though each step is O(N^2) in compute).For long-range dependencies, LSTMs must propagate information through N sequential cells, each with a gating mechanism that can leak or distort the signal. Transformers connect any two positions with a single attention step, regardless of distance.What LSTMs did better: memory efficiency for long sequences. Transformer self-attention is O(N^2) in memory, while LSTMs are O(N). For very long sequences (100K+ tokens), this matters. Also, LSTMs have an inductive bias toward sequential processing that makes them data-efficient for small sequence modeling tasks. You rarely hear about this advantage because most modern tasks have enough data that the Transformer’s flexibility outweighs the LSTM’s inductive bias.
Compare greedy decoding, beam search, and nucleus sampling. When would you use each in production?
Compare greedy decoding, beam search, and nucleus sampling. When would you use each in production?
Strong Answer:Greedy decoding selects the highest-probability token at each step. It is fast (one forward pass per token) and deterministic, but it often produces suboptimal sequences because it cannot recover from a locally good but globally poor choice. It is like navigating a maze by always turning toward the exit — you get stuck in dead ends.Beam search maintains the top-k (beam width) partial sequences at each step, expanding all of them and keeping the best k. It approximates a search over the full output space without the exponential cost. Beam width of 4-5 is standard for machine translation, where there is usually one “correct” output. The trade-off is k-fold more compute per token.Nucleus (top-p) sampling restricts random sampling to the smallest set of tokens whose cumulative probability exceeds p (typically 0.9-0.95), then samples from that set. This balances diversity with quality by excluding the long tail of unlikely tokens.When to use each in production: beam search for tasks with “correct” answers where quality matters (translation, summarization, code generation) — it reliably finds high-probability outputs. Nucleus sampling for creative or conversational tasks (chatbots, story generation) where diversity and naturalness matter more than finding the single best sequence. Greedy decoding for latency-critical applications where quality can be slightly sacrificed, or as a baseline to validate that more expensive decoding actually helps.A production nuance people miss: beam search can produce degenerate outputs (repetitive text, empty sequences) because high-probability sequences are often boring or repetitive. Production systems add length normalization (divide sequence log-probability by length to avoid favoring short outputs), repetition penalties (reduce the probability of recently generated tokens), and sometimes a minimum length constraint.Follow-up: You are building a code completion system. You notice beam search produces correct but boring completions while sampling produces creative but often incorrect code. How do you get the best of both?The standard approach is speculative decoding combined with rejection sampling. Generate N candidate completions using nucleus sampling (for diversity), then re-rank them using the model’s own log-probability (or a separate verifier model) and return the highest-scoring candidate. This gives you the diversity of sampling with the quality filtering of beam search.More sophisticated approaches: use a separate code execution model to filter candidates that do not compile or pass basic tests. At GitHub Copilot scale, they generate multiple candidates in parallel, score them with a fill-in-the-middle verifier, and present the best one. The key insight is that generation and verification are separate problems, and using different strategies for each outperforms using a single strategy for both.
What is teacher forcing in seq2seq training, and what problem does it create at inference time? How do you mitigate it?
What is teacher forcing in seq2seq training, and what problem does it create at inference time? How do you mitigate it?
Strong Answer:Teacher forcing feeds the ground-truth previous token as input to the decoder at each training step, rather than the decoder’s own prediction. If the correct output is “the cat sat,” at step 3 the decoder receives “the cat” as input regardless of what it predicted at steps 1 and 2. This makes training faster and more stable because the decoder always sees correct context, which provides clear gradient signals.The problem is exposure bias: during inference, there is no ground truth — the decoder feeds its own predictions back as input. If it makes an error at step 2, that error compounds at step 3 and beyond because the model was never trained on its own erroneous outputs. It is like a student who practices piano only by reading sheet music one note at a time but never plays a whole piece — they fall apart when they have to maintain coherence across measures.Mitigation strategies: (1) Scheduled sampling (Bengio et al., 2015) — during training, gradually increase the probability of using the model’s own prediction instead of the ground truth. Start with 100% teacher forcing and linearly decrease to 50% over training. This exposes the model to its own errors during training. (2) Sequence-level training with REINFORCE — optimize the actual sequence-level metric (BLEU, ROUGE) using policy gradient methods, which naturally trains the model on its own outputs. (3) For modern Transformer models, the problem is less severe because self-attention gives each token direct access to all previous tokens, making the model more robust to individual token errors.In practice, scheduled sampling provides the best effort-to-improvement ratio for LSTM-based seq2seq. For Transformer-based models, teacher forcing with proper regularization (label smoothing, dropout) works well enough that scheduled sampling is rarely used — the Transformer’s parallel attention mechanism makes it inherently more robust to exposure bias.Follow-up: How does label smoothing help with exposure bias in seq2seq models?Label smoothing replaces the hard target distribution (1.0 on the correct token, 0.0 on everything else) with a soft distribution (0.9 on the correct token, 0.1 distributed across all other tokens). This has two effects that reduce exposure bias. First, the model never learns to be 100% confident in any single token, which means at inference time its probability distribution is wider and more recovery-friendly — if it makes a wrong prediction, the context is less “broken” because the model was always somewhat uncertain. Second, it regularizes the model against overfitting to the exact training sequences, encouraging it to learn more general patterns that transfer better to the auto-regressive inference setting.Empirically, label smoothing of 0.1 gives 0.5-1.0 BLEU improvement on machine translation tasks. It is essentially free compute-wise and should be the default for any seq2seq model.