Recurrent Neural Networks
The Problem with Sequences
CNNs transformed computer vision. But they have a fundamental limitation: they assume fixed-size inputs. What about:- Text: “The cat sat on the mat” (6 words)
- Time series: Stock prices over months (varying length)
- Audio: Speech of different durations
- Video: Frames over time
The Core Insight: Sequences have temporal dependencies. The word “sat” depends on knowing “cat” came before it. Today’s stock price depends on yesterday’s. We need networks that can remember.
From Feedforward to Recurrent
The Feedforward Limitation
Copy
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
# A feedforward network processes each input independently
class FeedforwardClassifier(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super().__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# Each input processed independently
# No memory of previous inputs!
return self.fc2(torch.relu(self.fc1(x)))
# Problem: Processing words one at a time loses context
words = ["The", "cat", "sat", "on", "the", "mat"]
# When processing "mat", the network has no memory of "cat"!
The Recurrent Solution
An RNN maintains a hidden state that carries information across time steps: ht=f(ht−1,xt)Copy
class SimpleRNNCell(nn.Module):
"""
A single RNN cell - the building block of RNNs.
At each time step:
1. Combine current input with previous hidden state
2. Apply non-linearity
3. Output new hidden state
"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# Weights for input
self.W_xh = nn.Linear(input_size, hidden_size, bias=False)
# Weights for hidden state
self.W_hh = nn.Linear(hidden_size, hidden_size, bias=True)
def forward(self, x, h_prev):
"""
Args:
x: Current input (batch, input_size)
h_prev: Previous hidden state (batch, hidden_size)
Returns:
h_new: Updated hidden state
"""
h_new = torch.tanh(self.W_xh(x) + self.W_hh(h_prev))
return h_new
# Demonstrate the recurrent connection
cell = SimpleRNNCell(input_size=10, hidden_size=20)
# Process a sequence of length 5
sequence = torch.randn(3, 5, 10) # (batch=3, seq_len=5, input=10)
batch_size = sequence.size(0)
seq_len = sequence.size(1)
# Initialize hidden state
h = torch.zeros(batch_size, 20)
# Process each time step
hidden_states = []
for t in range(seq_len):
x_t = sequence[:, t, :] # Get input at time t
h = cell(x_t, h) # Update hidden state
hidden_states.append(h)
print(f"Time {t}: h shape = {h.shape}")
print(f"\nFinal hidden state captures the entire sequence!")
The RNN Equations
Mathematical Formulation
At each time step t: ht=tanh(Wxhxt+Whhht−1+bh) yt=Whyht+by Where:- xt∈Rd is the input at time t
- ht∈Rh is the hidden state
- yt∈Ro is the output
- Wxh∈Rh×d transforms input to hidden
- Whh∈Rh×h transforms previous hidden to current
- Why∈Ro×h transforms hidden to output
Copy
class VanillaRNN(nn.Module):
"""
Vanilla (Elman) RNN implemented from scratch.
This is the simplest form of RNN, directly implementing
the recurrence relation with tanh non-linearity.
"""
def __init__(self, input_size, hidden_size, output_size, num_layers=1):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# Input to hidden weights
self.W_xh = nn.ModuleList([
nn.Linear(input_size if i == 0 else hidden_size, hidden_size)
for i in range(num_layers)
])
# Hidden to hidden weights
self.W_hh = nn.ModuleList([
nn.Linear(hidden_size, hidden_size)
for _ in range(num_layers)
])
# Hidden to output
self.W_hy = nn.Linear(hidden_size, output_size)
def forward(self, x, h_0=None):
"""
Process entire sequence.
Args:
x: Input sequence (batch, seq_len, input_size)
h_0: Initial hidden state (num_layers, batch, hidden_size)
Returns:
outputs: Output at each time step (batch, seq_len, output_size)
h_n: Final hidden state (num_layers, batch, hidden_size)
"""
batch_size, seq_len, _ = x.size()
# Initialize hidden states
if h_0 is None:
h = [torch.zeros(batch_size, self.hidden_size, device=x.device)
for _ in range(self.num_layers)]
else:
h = [h_0[i] for i in range(self.num_layers)]
outputs = []
for t in range(seq_len):
x_t = x[:, t, :]
# Process through each layer
for layer in range(self.num_layers):
h[layer] = torch.tanh(
self.W_xh[layer](x_t) + self.W_hh[layer](h[layer])
)
x_t = h[layer] # Output of this layer is input to next
# Compute output
y_t = self.W_hy(h[-1])
outputs.append(y_t)
outputs = torch.stack(outputs, dim=1)
h_n = torch.stack(h, dim=0)
return outputs, h_n
# Test our implementation
rnn = VanillaRNN(input_size=10, hidden_size=32, output_size=5, num_layers=2)
x = torch.randn(4, 20, 10) # batch=4, seq_len=20, input=10
outputs, h_n = rnn(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {outputs.shape}") # (4, 20, 5)
print(f"Final hidden shape: {h_n.shape}") # (2, 4, 32)
RNN Architectures
Many-to-One (Sequence Classification)
Use the final hidden state to classify the entire sequence:Copy
class SentimentClassifier(nn.Module):
"""
Many-to-One: Classify entire sequence (e.g., sentiment analysis)
Input: "This movie was absolutely terrible"
Output: "Negative" (single label)
"""
def __init__(self, vocab_size, embed_dim, hidden_size, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# x: (batch, seq_len) - token indices
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
# Process through RNN
output, h_n = self.rnn(embedded) # h_n: (1, batch, hidden)
# Use final hidden state for classification
final_hidden = h_n.squeeze(0) # (batch, hidden)
logits = self.fc(final_hidden) # (batch, num_classes)
return logits
# Example
classifier = SentimentClassifier(vocab_size=10000, embed_dim=128,
hidden_size=256, num_classes=2)
# Simulated tokenized sentence (batch of 8 sentences, max length 50)
sentences = torch.randint(0, 10000, (8, 50))
predictions = classifier(sentences)
print(f"Predictions shape: {predictions.shape}") # (8, 2)
One-to-Many (Sequence Generation)
Generate a sequence from a single input:Copy
class CaptionGenerator(nn.Module):
"""
One-to-Many: Generate sequence from single input (e.g., image captioning)
Input: Image features
Output: "A cat sitting on a couch"
"""
def __init__(self, image_dim, hidden_size, vocab_size, max_length=20):
super().__init__()
self.max_length = max_length
# Project image features to hidden state
self.image_proj = nn.Linear(image_dim, hidden_size)
# Word embedding
self.embedding = nn.Embedding(vocab_size, hidden_size)
# RNN cell for generation
self.rnn_cell = nn.RNNCell(hidden_size, hidden_size)
# Output projection
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, image_features, max_length=None):
"""
Generate caption from image features.
Args:
image_features: (batch, image_dim)
Returns:
Generated token indices
"""
if max_length is None:
max_length = self.max_length
batch_size = image_features.size(0)
# Initialize hidden state from image
h = torch.tanh(self.image_proj(image_features))
# Start with <START> token (assume index 0)
current_token = torch.zeros(batch_size, dtype=torch.long,
device=image_features.device)
generated = []
for t in range(max_length):
# Embed current token
x = self.embedding(current_token)
# RNN step
h = self.rnn_cell(x, h)
# Predict next token
logits = self.fc(h)
current_token = logits.argmax(dim=-1)
generated.append(current_token)
return torch.stack(generated, dim=1)
Many-to-Many (Sequence-to-Sequence)
Transform one sequence into another:Copy
class Seq2SeqTranslator(nn.Module):
"""
Many-to-Many: Transform sequence to sequence (e.g., translation)
Input: "Hello, how are you?"
Output: "Bonjour, comment allez-vous?"
"""
def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim, hidden_size):
super().__init__()
# Encoder
self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
self.encoder = nn.RNN(embed_dim, hidden_size, batch_first=True)
# Decoder
self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
self.decoder = nn.RNN(embed_dim, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, tgt_vocab_size)
def forward(self, src, tgt):
"""
Args:
src: Source sequence (batch, src_len)
tgt: Target sequence for teacher forcing (batch, tgt_len)
"""
# Encode source
src_embedded = self.src_embedding(src)
_, encoder_hidden = self.encoder(src_embedded)
# Decode target (teacher forcing during training)
tgt_embedded = self.tgt_embedding(tgt)
decoder_output, _ = self.decoder(tgt_embedded, encoder_hidden)
# Project to vocabulary
logits = self.fc(decoder_output)
return logits
translator = Seq2SeqTranslator(
src_vocab_size=10000, tgt_vocab_size=8000,
embed_dim=256, hidden_size=512
)
src = torch.randint(0, 10000, (16, 30)) # Batch of 16, max length 30
tgt = torch.randint(0, 8000, (16, 25)) # Target length 25
output = translator(src, tgt)
print(f"Output shape: {output.shape}") # (16, 25, 8000)
Backpropagation Through Time (BPTT)
The Challenge of Temporal Gradients
RNNs are trained using BPTT - unrolling the network through time and applying backpropagation:Copy
def bptt_visualization():
"""
Demonstrate how gradients flow backward through time.
"""
# Consider a 4-step sequence
# Loss at final step L_4
#
# Forward: x_1 → h_1 → h_2 → h_3 → h_4 → y_4 → L_4
# ↗ ↗ ↗ ↗
# x_2 x_3 x_4
#
# Backward gradient for W_hh:
# ∂L/∂W_hh = ∂L/∂h_4 · ∂h_4/∂W_hh
# + ∂L/∂h_4 · ∂h_4/∂h_3 · ∂h_3/∂W_hh
# + ∂L/∂h_4 · ∂h_4/∂h_3 · ∂h_3/∂h_2 · ∂h_2/∂W_hh
# + ...
print("BPTT Gradient Flow:")
print("=" * 50)
print("∂L/∂W_hh = Σ_t ∂L/∂h_T · (∏_{k=t+1}^{T} ∂h_k/∂h_{k-1}) · ∂h_t/∂W_hh")
print()
print("The product of Jacobians can explode or vanish!")
bptt_visualization()
class RNNWithGradientTracking(nn.Module):
"""RNN that tracks gradient magnitudes through time."""
def __init__(self, input_size, hidden_size):
super().__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.gradient_norms = []
def forward(self, x):
output, h_n = self.rnn(x)
return output, h_n
def track_gradients(self, x, target):
"""Track how gradients change through time."""
x.requires_grad_(True)
output, _ = self.forward(x)
# Compute gradients for output at each time step
seq_len = output.size(1)
gradient_norms = []
for t in range(seq_len):
if x.grad is not None:
x.grad.zero_()
# Gradient of output at time t w.r.t. input at time 0
output[:, t, 0].sum().backward(retain_graph=True)
# Gradient norm from first input
grad_norm = x.grad[:, 0, :].norm().item()
gradient_norms.append(grad_norm)
return gradient_norms
# Demonstrate gradient flow
torch.manual_seed(42)
model = RNNWithGradientTracking(10, 32)
x = torch.randn(1, 50, 10)
gradient_norms = model.track_gradients(x, None)
plt.figure(figsize=(10, 4))
plt.plot(gradient_norms)
plt.xlabel('Time step (distance from input)')
plt.ylabel('Gradient norm')
plt.title('Gradient Flow in Vanilla RNN')
plt.yscale('log')
plt.grid(True)
plt.show()
print(f"Gradient ratio (step 49 / step 1): {gradient_norms[-1]/gradient_norms[0]:.6f}")
The Vanishing/Exploding Gradient Problem
Why Gradients Vanish
The gradient through time involves products of the recurrent weight matrix: ∂h1∂hT=t=2∏T∂ht−1∂ht=t=2∏TWhhT⋅diag(tanh′(ht−1))Copy
def analyze_gradient_flow():
"""
Analyze why gradients vanish or explode in RNNs.
"""
hidden_size = 100
seq_length = 100
# Different initialization scales
scales = [0.5, 1.0, 1.5]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, scale in zip(axes, scales):
# Initialize W_hh with different scales
W = torch.randn(hidden_size, hidden_size) * scale / np.sqrt(hidden_size)
# Simulate gradient flow (simplified - ignoring tanh derivative)
gradient = torch.eye(hidden_size)
gradient_norms = [gradient.norm().item()]
for t in range(seq_length):
gradient = gradient @ W
gradient_norms.append(gradient.norm().item())
ax.semilogy(gradient_norms)
ax.set_xlabel('Time steps back')
ax.set_ylabel('Gradient norm (log scale)')
ax.set_title(f'Scale = {scale}')
ax.axhline(y=1, color='r', linestyle='--', alpha=0.5)
ax.set_ylim([1e-20, 1e20])
plt.tight_layout()
plt.show()
analyze_gradient_flow()
The Impact on Learning
Copy
def demonstrate_vanishing_gradient_impact():
"""
Show how vanishing gradients affect learning long-range dependencies.
"""
# Task: Remember the first element of a sequence
# Output at the end should equal input at the beginning
class CopyFirstTask(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.rnn = nn.RNN(1, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
output, h_n = self.rnn(x)
# Use final hidden state
return self.fc(h_n.squeeze(0))
def train_copy_task(seq_length, hidden_size=32, num_epochs=1000):
model = CopyFirstTask(hidden_size)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
losses = []
for epoch in range(num_epochs):
# Generate data: first value is target
batch_size = 32
first_value = torch.randn(batch_size, 1)
# Sequence: [first_value, noise, noise, ..., noise]
sequence = torch.randn(batch_size, seq_length, 1) * 0.1
sequence[:, 0, :] = first_value
# Forward pass
prediction = model(sequence)
loss = criterion(prediction, first_value)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.item())
return losses
# Test with different sequence lengths
seq_lengths = [10, 30, 50, 100]
plt.figure(figsize=(12, 4))
for seq_len in seq_lengths:
losses = train_copy_task(seq_len)
plt.plot(losses, label=f'Length = {seq_len}')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Copy First Task: Longer Sequences are Harder')
plt.legend()
plt.yscale('log')
plt.grid(True)
plt.show()
demonstrate_vanishing_gradient_impact()
Solutions to Vanishing Gradients
| Solution | Description | Effectiveness |
|---|---|---|
| Gradient Clipping | Limit gradient magnitude | Helps exploding, not vanishing |
| Better Initialization | Orthogonal initialization | Modest improvement |
| LSTM/GRU | Gated architectures | Very effective |
| Skip Connections | Connect distant time steps | Effective |
| Attention | Direct access to all states | Very effective |
Copy
def gradient_clipping_demo():
"""Demonstrate gradient clipping to prevent explosion."""
model = nn.RNN(10, 32, batch_first=True)
x = torch.randn(1, 100, 10, requires_grad=True)
output, _ = model(x)
loss = output.sum()
loss.backward()
# Check gradient norms before clipping
total_norm = 0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.norm().item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm before clipping: {total_norm:.4f}")
# Apply gradient clipping
max_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
# Check after clipping
total_norm_after = 0
for p in model.parameters():
if p.grad is not None:
total_norm_after += p.grad.norm().item() ** 2
total_norm_after = total_norm_after ** 0.5
print(f"Gradient norm after clipping: {total_norm_after:.4f}")
gradient_clipping_demo()
Bidirectional RNNs
Why Look Both Ways?
In many tasks, future context is as important as past context:- “The bank of the river” vs “I went to the bank”
- In translation, the end of a sentence can clarify the beginning
Copy
class BidirectionalRNN(nn.Module):
"""
Bidirectional RNN - process sequence in both directions.
Forward: x_1 → x_2 → x_3 → x_4 →
Backward: ← x_1 ← x_2 ← x_3 ← x_4
At each position, we have context from both directions.
"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
# Two separate RNNs
self.rnn_forward = nn.RNN(input_size, hidden_size, batch_first=True)
self.rnn_backward = nn.RNN(input_size, hidden_size, batch_first=True)
# Output layer (hidden_size * 2 because we concatenate)
self.fc = nn.Linear(hidden_size * 2, output_size)
def forward(self, x):
"""
Args:
x: (batch, seq_len, input_size)
Returns:
output: (batch, seq_len, output_size)
"""
# Forward pass
forward_out, _ = self.rnn_forward(x)
# Backward pass (flip, process, flip back)
x_reversed = torch.flip(x, dims=[1])
backward_out, _ = self.rnn_backward(x_reversed)
backward_out = torch.flip(backward_out, dims=[1])
# Concatenate
combined = torch.cat([forward_out, backward_out], dim=-1)
# Output
output = self.fc(combined)
return output
# PyTorch's built-in bidirectional RNN
birnn = nn.RNN(input_size=10, hidden_size=32,
batch_first=True, bidirectional=True)
x = torch.randn(4, 20, 10)
output, h_n = birnn(x)
print(f"Input: {x.shape}")
print(f"Output: {output.shape}") # (4, 20, 64) - doubled hidden size
print(f"Hidden: {h_n.shape}") # (2, 4, 32) - 2 directions
Deep RNNs
Stacking RNN Layers
Copy
class DeepRNN(nn.Module):
"""
Deep RNN - stack multiple RNN layers.
Layer 1: Processes input sequence
Layer 2: Processes Layer 1's hidden states
Layer 3: Processes Layer 2's hidden states
...
Deeper = more abstract representations
"""
def __init__(self, input_size, hidden_size, num_layers, output_size):
super().__init__()
# Stack of RNN layers
self.rnn = nn.RNN(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.2 if num_layers > 1 else 0 # Dropout between layers
)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x, h_0=None):
output, h_n = self.rnn(x, h_0)
return self.fc(output), h_n
# Compare shallow vs deep
shallow = nn.RNN(10, 128, num_layers=1, batch_first=True)
deep = nn.RNN(10, 64, num_layers=4, batch_first=True)
# Similar parameter counts but different architectures
shallow_params = sum(p.numel() for p in shallow.parameters())
deep_params = sum(p.numel() for p in deep.parameters())
print(f"Shallow RNN (1 layer, 128 hidden): {shallow_params:,} params")
print(f"Deep RNN (4 layers, 64 hidden): {deep_params:,} params")
x = torch.randn(8, 50, 10)
print(f"\nShallow output: {shallow(x)[0].shape}")
print(f"Deep output: {deep(x)[0].shape}")
Practical Example: Character-Level Language Model
Copy
class CharRNN(nn.Module):
"""
Character-level language model.
Given: "Hello Worl"
Predict: "ello World" (next character at each position)
"""
def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.RNN(embed_size, hidden_size, num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden=None):
# x: (batch, seq_len) character indices
embedded = self.embedding(x) # (batch, seq_len, embed_size)
output, hidden = self.rnn(embedded, hidden) # (batch, seq_len, hidden)
logits = self.fc(output) # (batch, seq_len, vocab_size)
return logits, hidden
def generate(self, start_char, char_to_idx, idx_to_char,
length=100, temperature=1.0):
"""Generate text character by character."""
self.eval()
# Start with initial character
current = torch.tensor([[char_to_idx[start_char]]])
hidden = None
generated = [start_char]
with torch.no_grad():
for _ in range(length):
logits, hidden = self.forward(current, hidden)
# Apply temperature
probs = torch.softmax(logits[0, -1] / temperature, dim=0)
# Sample from distribution
next_idx = torch.multinomial(probs, 1).item()
next_char = idx_to_char[next_idx]
generated.append(next_char)
current = torch.tensor([[next_idx]])
return ''.join(generated)
def train_char_rnn():
"""Train a character-level RNN on sample text."""
# Sample text (in practice, use a large corpus)
text = """
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them.
""" * 100 # Repeat for more training data
# Build vocabulary
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")
# Prepare data
seq_length = 50
def create_sequences(text, seq_length):
inputs = []
targets = []
for i in range(0, len(text) - seq_length):
inputs.append([char_to_idx[c] for c in text[i:i+seq_length]])
targets.append([char_to_idx[c] for c in text[i+1:i+seq_length+1]])
return torch.tensor(inputs), torch.tensor(targets)
inputs, targets = create_sequences(text, seq_length)
# Create model
model = CharRNN(vocab_size, embed_size=64, hidden_size=128, num_layers=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.003)
# Training loop
batch_size = 64
num_epochs = 50
for epoch in range(num_epochs):
# Random batch
idx = torch.randint(0, len(inputs), (batch_size,))
x_batch = inputs[idx]
y_batch = targets[idx]
# Forward
logits, _ = model(x_batch)
loss = criterion(logits.view(-1, vocab_size), y_batch.view(-1))
# Backward
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
# Generate sample
sample = model.generate('T', char_to_idx, idx_to_char,
length=100, temperature=0.8)
print(f"Sample: {sample[:80]}...")
print()
return model, char_to_idx, idx_to_char
# Uncomment to train
# model, c2i, i2c = train_char_rnn()
RNN Applications
| Application | Architecture | Input | Output |
|---|---|---|---|
| Sentiment Analysis | Many-to-One | Text | Positive/Negative |
| Language Modeling | Many-to-Many | Characters/Words | Next char/word |
| Machine Translation | Encoder-Decoder | Source sentence | Target sentence |
| Speech Recognition | Many-to-Many | Audio frames | Text |
| Music Generation | Many-to-Many | Notes | Next notes |
| Time Series Forecasting | Many-to-One | Past values | Future value |
| Named Entity Recognition | Many-to-Many | Words | Entity labels |
Exercises
Exercise 1: Implement RNN from Scratch
Exercise 1: Implement RNN from Scratch
Implement a complete RNN without using Verify it gives similar results to
nn.RNN:Copy
class MyRNN(nn.Module):
def __init__(self, input_size, hidden_size):
super().__init__()
# Initialize W_xh, W_hh, bias
# Implement forward pass with proper state management
pass
def forward(self, x, h_0=None):
# Process sequence step by step
# Return outputs and final hidden state
pass
nn.RNN.Exercise 2: Adding Problem
Exercise 2: Adding Problem
Implement the adding problem to test long-range dependencies:
- Input: Two sequences - numbers and a mask
- Output: Sum of numbers where mask is 1
- Example: Numbers [0.3, 0.1, 0.8, 0.2], Mask [1, 0, 0, 1] → Output: 0.5
Exercise 3: Name Classification
Exercise 3: Name Classification
Build a model to classify names by nationality:
- Input: Character sequence (name)
- Output: Nationality (English, French, German, etc.)
Exercise 4: Sequence-to-Sequence
Exercise 4: Sequence-to-Sequence
Build a simple date format converter:
- Input: “January 15, 2023”
- Output: “2023-01-15”
Exercise 5: Visualize Hidden States
Exercise 5: Visualize Hidden States
For a trained sentiment analysis model:
- Process sentences with known sentiment
- Extract hidden states at each position
- Visualize with t-SNE or PCA
- Color by sentiment and position
Key Takeaways
| Concept | Key Insight |
|---|---|
| Hidden State | Memory that carries information across time steps |
| BPTT | Backprop through unrolled network - gradients flow through time |
| Vanishing Gradient | Long sequences → gradients shrink → early inputs forgotten |
| Exploding Gradient | Gradients grow → training unstable → use clipping |
| Bidirectional | Context from both past and future |
| Deep RNNs | Stack layers for hierarchical representations |
Vanilla RNNs are rarely used in practice! The vanishing gradient problem makes them unable to learn long-range dependencies. In the next chapter, we’ll learn about LSTMs and GRUs - architectures specifically designed to solve this problem.