Recurrent Neural Networks

The Problem with Sequences

CNNs transformed computer vision. But they have a fundamental limitation: they assume fixed-size inputs. What about:

Text: “The cat sat on the mat” (6 words)
Time series: Stock prices over months (varying length)
Audio: Speech of different durations
Video: Frames over time

These are sequences - data where order matters and length varies.

The Core Insight: Sequences have temporal dependencies. The word “sat” depends on knowing “cat” came before it. Today’s stock price depends on yesterday’s. We need networks that can remember.

From Feedforward to Recurrent

The Feedforward Limitation

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# A feedforward network processes each input independently
class FeedforwardClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        # Each input processed independently
        # No memory of previous inputs!
        return self.fc2(torch.relu(self.fc1(x)))


# Problem: Processing words one at a time loses context
words = ["The", "cat", "sat", "on", "the", "mat"]
# When processing "mat", the network has no memory of "cat"!

The Recurrent Solution

An RNN maintains a hidden state that carries information across time steps:

h_t = f(h_{t-1}, x_t)

class SimpleRNNCell(nn.Module):
    """
    A single RNN cell - the building block of RNNs.
    
    At each time step:
    1. Combine current input with previous hidden state
    2. Apply non-linearity
    3. Output new hidden state
    """
    
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # Weights for input
        self.W_xh = nn.Linear(input_size, hidden_size, bias=False)
        # Weights for hidden state
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=True)
    
    def forward(self, x, h_prev):
        """
        Args:
            x: Current input (batch, input_size)
            h_prev: Previous hidden state (batch, hidden_size)
        
        Returns:
            h_new: Updated hidden state
        """
        h_new = torch.tanh(self.W_xh(x) + self.W_hh(h_prev))
        return h_new


# Demonstrate the recurrent connection
cell = SimpleRNNCell(input_size=10, hidden_size=20)

# Process a sequence of length 5
sequence = torch.randn(3, 5, 10)  # (batch=3, seq_len=5, input=10)
batch_size = sequence.size(0)
seq_len = sequence.size(1)

# Initialize hidden state
h = torch.zeros(batch_size, 20)

# Process each time step
hidden_states = []
for t in range(seq_len):
    x_t = sequence[:, t, :]  # Get input at time t
    h = cell(x_t, h)         # Update hidden state
    hidden_states.append(h)
    print(f"Time {t}: h shape = {h.shape}")

print(f"\nFinal hidden state captures the entire sequence!")

The RNN Equations

Mathematical Formulation

At each time step

t

h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)

y_t = W_{hy} h_t + b_y

Where:

$x_t \in \mathbb{R}^{d}$ is the input at time $t$
$h_t \in \mathbb{R}^{h}$ is the hidden state
$y_t \in \mathbb{R}^{o}$ is the output
$W_{xh} \in \mathbb{R}^{h \times d}$ transforms input to hidden
$W_{hh} \in \mathbb{R}^{h \times h}$ transforms previous hidden to current
$W_{hy} \in \mathbb{R}^{o \times h}$ transforms hidden to output

class VanillaRNN(nn.Module):
    """
    Vanilla (Elman) RNN implemented from scratch.
    
    This is the simplest form of RNN, directly implementing
    the recurrence relation with tanh non-linearity.
    """
    
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Input to hidden weights
        self.W_xh = nn.ModuleList([
            nn.Linear(input_size if i == 0 else hidden_size, hidden_size)
            for i in range(num_layers)
        ])
        
        # Hidden to hidden weights
        self.W_hh = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size)
            for _ in range(num_layers)
        ])
        
        # Hidden to output
        self.W_hy = nn.Linear(hidden_size, output_size)
    
    def forward(self, x, h_0=None):
        """
        Process entire sequence.
        
        Args:
            x: Input sequence (batch, seq_len, input_size)
            h_0: Initial hidden state (num_layers, batch, hidden_size)
        
        Returns:
            outputs: Output at each time step (batch, seq_len, output_size)
            h_n: Final hidden state (num_layers, batch, hidden_size)
        """
        batch_size, seq_len, _ = x.size()
        
        # Initialize hidden states
        if h_0 is None:
            h = [torch.zeros(batch_size, self.hidden_size, device=x.device)
                 for _ in range(self.num_layers)]
        else:
            h = [h_0[i] for i in range(self.num_layers)]
        
        outputs = []
        
        for t in range(seq_len):
            x_t = x[:, t, :]
            
            # Process through each layer
            for layer in range(self.num_layers):
                h[layer] = torch.tanh(
                    self.W_xh[layer](x_t) + self.W_hh[layer](h[layer])
                )
                x_t = h[layer]  # Output of this layer is input to next
            
            # Compute output
            y_t = self.W_hy(h[-1])
            outputs.append(y_t)
        
        outputs = torch.stack(outputs, dim=1)
        h_n = torch.stack(h, dim=0)
        
        return outputs, h_n


# Test our implementation
rnn = VanillaRNN(input_size=10, hidden_size=32, output_size=5, num_layers=2)
x = torch.randn(4, 20, 10)  # batch=4, seq_len=20, input=10

outputs, h_n = rnn(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {outputs.shape}")  # (4, 20, 5)
print(f"Final hidden shape: {h_n.shape}")  # (2, 4, 32)

RNN Architectures

Many-to-One (Sequence Classification)

Use the final hidden state to classify the entire sequence:

class SentimentClassifier(nn.Module):
    """
    Many-to-One: Classify entire sequence (e.g., sentiment analysis)
    
    Input: "This movie was absolutely terrible"
    Output: "Negative" (single label)
    """
    
    def __init__(self, vocab_size, embed_dim, hidden_size, num_classes):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        # x: (batch, seq_len) - token indices
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        # Process through RNN
        output, h_n = self.rnn(embedded)  # h_n: (1, batch, hidden)
        
        # Use final hidden state for classification
        final_hidden = h_n.squeeze(0)  # (batch, hidden)
        logits = self.fc(final_hidden)  # (batch, num_classes)
        
        return logits


# Example
classifier = SentimentClassifier(vocab_size=10000, embed_dim=128, 
                                  hidden_size=256, num_classes=2)

# Simulated tokenized sentence (batch of 8 sentences, max length 50)
sentences = torch.randint(0, 10000, (8, 50))
predictions = classifier(sentences)
print(f"Predictions shape: {predictions.shape}")  # (8, 2)

One-to-Many (Sequence Generation)

Generate a sequence from a single input:

class CaptionGenerator(nn.Module):
    """
    One-to-Many: Generate sequence from single input (e.g., image captioning)
    
    Input: Image features
    Output: "A cat sitting on a couch"
    """
    
    def __init__(self, image_dim, hidden_size, vocab_size, max_length=20):
        super().__init__()
        
        self.max_length = max_length
        
        # Project image features to hidden state
        self.image_proj = nn.Linear(image_dim, hidden_size)
        
        # Word embedding
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        
        # RNN cell for generation
        self.rnn_cell = nn.RNNCell(hidden_size, hidden_size)
        
        # Output projection
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, image_features, max_length=None):
        """
        Generate caption from image features.
        
        Args:
            image_features: (batch, image_dim)
        
        Returns:
            Generated token indices
        """
        if max_length is None:
            max_length = self.max_length
        
        batch_size = image_features.size(0)
        
        # Initialize hidden state from image
        h = torch.tanh(self.image_proj(image_features))
        
        # Start with <START> token (assume index 0)
        current_token = torch.zeros(batch_size, dtype=torch.long, 
                                    device=image_features.device)
        
        generated = []
        
        for t in range(max_length):
            # Embed current token
            x = self.embedding(current_token)
            
            # RNN step
            h = self.rnn_cell(x, h)
            
            # Predict next token
            logits = self.fc(h)
            current_token = logits.argmax(dim=-1)
            generated.append(current_token)
        
        return torch.stack(generated, dim=1)

Many-to-Many (Sequence-to-Sequence)

Transform one sequence into another:

class Seq2SeqTranslator(nn.Module):
    """
    Many-to-Many: Transform sequence to sequence (e.g., translation)
    
    Input: "Hello, how are you?"
    Output: "Bonjour, comment allez-vous?"
    """
    
    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim, hidden_size):
        super().__init__()
        
        # Encoder
        self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
        self.encoder = nn.RNN(embed_dim, hidden_size, batch_first=True)
        
        # Decoder
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
        self.decoder = nn.RNN(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, tgt_vocab_size)
    
    def forward(self, src, tgt):
        """
        Args:
            src: Source sequence (batch, src_len)
            tgt: Target sequence for teacher forcing (batch, tgt_len)
        """
        # Encode source
        src_embedded = self.src_embedding(src)
        _, encoder_hidden = self.encoder(src_embedded)
        
        # Decode target (teacher forcing during training)
        tgt_embedded = self.tgt_embedding(tgt)
        decoder_output, _ = self.decoder(tgt_embedded, encoder_hidden)
        
        # Project to vocabulary
        logits = self.fc(decoder_output)
        
        return logits


translator = Seq2SeqTranslator(
    src_vocab_size=10000, tgt_vocab_size=8000,
    embed_dim=256, hidden_size=512
)

src = torch.randint(0, 10000, (16, 30))  # Batch of 16, max length 30
tgt = torch.randint(0, 8000, (16, 25))   # Target length 25

output = translator(src, tgt)
print(f"Output shape: {output.shape}")  # (16, 25, 8000)

Backpropagation Through Time (BPTT)

The Challenge of Temporal Gradients

RNNs are trained using BPTT - unrolling the network through time and applying backpropagation:

def bptt_visualization():
    """
    Demonstrate how gradients flow backward through time.
    """
    # Consider a 4-step sequence
    # Loss at final step L_4
    # 
    # Forward:  x_1 → h_1 → h_2 → h_3 → h_4 → y_4 → L_4
    #              ↗      ↗      ↗      ↗
    #            x_2    x_3    x_4
    #
    # Backward gradient for W_hh:
    # ∂L/∂W_hh = ∂L/∂h_4 · ∂h_4/∂W_hh 
    #          + ∂L/∂h_4 · ∂h_4/∂h_3 · ∂h_3/∂W_hh
    #          + ∂L/∂h_4 · ∂h_4/∂h_3 · ∂h_3/∂h_2 · ∂h_2/∂W_hh
    #          + ...
    
    print("BPTT Gradient Flow:")
    print("=" * 50)
    print("∂L/∂W_hh = Σ_t ∂L/∂h_T · (∏_{k=t+1}^{T} ∂h_k/∂h_{k-1}) · ∂h_t/∂W_hh")
    print()
    print("The product of Jacobians can explode or vanish!")

bptt_visualization()


class RNNWithGradientTracking(nn.Module):
    """RNN that tracks gradient magnitudes through time."""
    
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.gradient_norms = []
    
    def forward(self, x):
        output, h_n = self.rnn(x)
        return output, h_n
    
    def track_gradients(self, x, target):
        """Track how gradients change through time."""
        x.requires_grad_(True)
        output, _ = self.forward(x)
        
        # Compute gradients for output at each time step
        seq_len = output.size(1)
        gradient_norms = []
        
        for t in range(seq_len):
            if x.grad is not None:
                x.grad.zero_()
            
            # Gradient of output at time t w.r.t. input at time 0
            output[:, t, 0].sum().backward(retain_graph=True)
            
            # Gradient norm from first input
            grad_norm = x.grad[:, 0, :].norm().item()
            gradient_norms.append(grad_norm)
        
        return gradient_norms


# Demonstrate gradient flow
torch.manual_seed(42)
model = RNNWithGradientTracking(10, 32)
x = torch.randn(1, 50, 10)

gradient_norms = model.track_gradients(x, None)

plt.figure(figsize=(10, 4))
plt.plot(gradient_norms)
plt.xlabel('Time step (distance from input)')
plt.ylabel('Gradient norm')
plt.title('Gradient Flow in Vanilla RNN')
plt.yscale('log')
plt.grid(True)
plt.show()

print(f"Gradient ratio (step 49 / step 1): {gradient_norms[-1]/gradient_norms[0]:.6f}")

The Vanishing/Exploding Gradient Problem

Why Gradients Vanish

The gradient through time involves products of the recurrent weight matrix:

\frac{\partial h_T}{\partial h_1} = \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} = \prod_{t=2}^{T} W_{hh}^T \cdot \text{diag}(\tanh'(h_{t-1}))

def analyze_gradient_flow():
    """
    Analyze why gradients vanish or explode in RNNs.
    """
    hidden_size = 100
    seq_length = 100
    
    # Different initialization scales
    scales = [0.5, 1.0, 1.5]
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for ax, scale in zip(axes, scales):
        # Initialize W_hh with different scales
        W = torch.randn(hidden_size, hidden_size) * scale / np.sqrt(hidden_size)
        
        # Simulate gradient flow (simplified - ignoring tanh derivative)
        gradient = torch.eye(hidden_size)
        gradient_norms = [gradient.norm().item()]
        
        for t in range(seq_length):
            gradient = gradient @ W
            gradient_norms.append(gradient.norm().item())
        
        ax.semilogy(gradient_norms)
        ax.set_xlabel('Time steps back')
        ax.set_ylabel('Gradient norm (log scale)')
        ax.set_title(f'Scale = {scale}')
        ax.axhline(y=1, color='r', linestyle='--', alpha=0.5)
        ax.set_ylim([1e-20, 1e20])
    
    plt.tight_layout()
    plt.show()

analyze_gradient_flow()

The Impact on Learning

def demonstrate_vanishing_gradient_impact():
    """
    Show how vanishing gradients affect learning long-range dependencies.
    """
    
    # Task: Remember the first element of a sequence
    # Output at the end should equal input at the beginning
    
    class CopyFirstTask(nn.Module):
        def __init__(self, hidden_size):
            super().__init__()
            self.rnn = nn.RNN(1, hidden_size, batch_first=True)
            self.fc = nn.Linear(hidden_size, 1)
        
        def forward(self, x):
            output, h_n = self.rnn(x)
            # Use final hidden state
            return self.fc(h_n.squeeze(0))
    
    def train_copy_task(seq_length, hidden_size=32, num_epochs=1000):
        model = CopyFirstTask(hidden_size)
        optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
        criterion = nn.MSELoss()
        
        losses = []
        
        for epoch in range(num_epochs):
            # Generate data: first value is target
            batch_size = 32
            first_value = torch.randn(batch_size, 1)
            
            # Sequence: [first_value, noise, noise, ..., noise]
            sequence = torch.randn(batch_size, seq_length, 1) * 0.1
            sequence[:, 0, :] = first_value
            
            # Forward pass
            prediction = model(sequence)
            loss = criterion(prediction, first_value)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            losses.append(loss.item())
        
        return losses
    
    # Test with different sequence lengths
    seq_lengths = [10, 30, 50, 100]
    
    plt.figure(figsize=(12, 4))
    
    for seq_len in seq_lengths:
        losses = train_copy_task(seq_len)
        plt.plot(losses, label=f'Length = {seq_len}')
    
    plt.xlabel('Epoch')
    plt.ylabel('MSE Loss')
    plt.title('Copy First Task: Longer Sequences are Harder')
    plt.legend()
    plt.yscale('log')
    plt.grid(True)
    plt.show()

demonstrate_vanishing_gradient_impact()

Solutions to Vanishing Gradients

Solution	Description	Effectiveness
Gradient Clipping	Limit gradient magnitude	Helps exploding, not vanishing
Better Initialization	Orthogonal initialization	Modest improvement
LSTM/GRU	Gated architectures	Very effective
Skip Connections	Connect distant time steps	Effective
Attention	Direct access to all states	Very effective

def gradient_clipping_demo():
    """Demonstrate gradient clipping to prevent explosion."""
    
    model = nn.RNN(10, 32, batch_first=True)
    
    x = torch.randn(1, 100, 10, requires_grad=True)
    output, _ = model(x)
    loss = output.sum()
    loss.backward()
    
    # Check gradient norms before clipping
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            total_norm += p.grad.norm().item() ** 2
    total_norm = total_norm ** 0.5
    print(f"Gradient norm before clipping: {total_norm:.4f}")
    
    # Apply gradient clipping
    max_norm = 1.0
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    
    # Check after clipping
    total_norm_after = 0
    for p in model.parameters():
        if p.grad is not None:
            total_norm_after += p.grad.norm().item() ** 2
    total_norm_after = total_norm_after ** 0.5
    print(f"Gradient norm after clipping:  {total_norm_after:.4f}")

gradient_clipping_demo()

Bidirectional RNNs

Why Look Both Ways?

In many tasks, future context is as important as past context:

“The bank of the river” vs “I went to the bank”
In translation, the end of a sentence can clarify the beginning

class BidirectionalRNN(nn.Module):
    """
    Bidirectional RNN - process sequence in both directions.
    
    Forward:  x_1 → x_2 → x_3 → x_4 →
    Backward: ← x_1 ← x_2 ← x_3 ← x_4
    
    At each position, we have context from both directions.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.hidden_size = hidden_size
        
        # Two separate RNNs
        self.rnn_forward = nn.RNN(input_size, hidden_size, batch_first=True)
        self.rnn_backward = nn.RNN(input_size, hidden_size, batch_first=True)
        
        # Output layer (hidden_size * 2 because we concatenate)
        self.fc = nn.Linear(hidden_size * 2, output_size)
    
    def forward(self, x):
        """
        Args:
            x: (batch, seq_len, input_size)
        
        Returns:
            output: (batch, seq_len, output_size)
        """
        # Forward pass
        forward_out, _ = self.rnn_forward(x)
        
        # Backward pass (flip, process, flip back)
        x_reversed = torch.flip(x, dims=[1])
        backward_out, _ = self.rnn_backward(x_reversed)
        backward_out = torch.flip(backward_out, dims=[1])
        
        # Concatenate
        combined = torch.cat([forward_out, backward_out], dim=-1)
        
        # Output
        output = self.fc(combined)
        
        return output


# PyTorch's built-in bidirectional RNN
birnn = nn.RNN(input_size=10, hidden_size=32, 
               batch_first=True, bidirectional=True)

x = torch.randn(4, 20, 10)
output, h_n = birnn(x)

print(f"Input: {x.shape}")
print(f"Output: {output.shape}")  # (4, 20, 64) - doubled hidden size
print(f"Hidden: {h_n.shape}")     # (2, 4, 32) - 2 directions

Deep RNNs

Stacking RNN Layers

class DeepRNN(nn.Module):
    """
    Deep RNN - stack multiple RNN layers.
    
    Layer 1: Processes input sequence
    Layer 2: Processes Layer 1's hidden states
    Layer 3: Processes Layer 2's hidden states
    ...
    
    Deeper = more abstract representations
    """
    
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super().__init__()
        
        # Stack of RNN layers
        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.2 if num_layers > 1 else 0  # Dropout between layers
        )
        
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x, h_0=None):
        output, h_n = self.rnn(x, h_0)
        return self.fc(output), h_n


# Compare shallow vs deep
shallow = nn.RNN(10, 128, num_layers=1, batch_first=True)
deep = nn.RNN(10, 64, num_layers=4, batch_first=True)

# Similar parameter counts but different architectures
shallow_params = sum(p.numel() for p in shallow.parameters())
deep_params = sum(p.numel() for p in deep.parameters())

print(f"Shallow RNN (1 layer, 128 hidden): {shallow_params:,} params")
print(f"Deep RNN (4 layers, 64 hidden):    {deep_params:,} params")

x = torch.randn(8, 50, 10)
print(f"\nShallow output: {shallow(x)[0].shape}")
print(f"Deep output:    {deep(x)[0].shape}")

Practical Example: Character-Level Language Model

class CharRNN(nn.Module):
    """
    Character-level language model.
    
    Given: "Hello Worl"
    Predict: "ello World" (next character at each position)
    """
    
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, num_layers, 
                          batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden=None):
        # x: (batch, seq_len) character indices
        embedded = self.embedding(x)  # (batch, seq_len, embed_size)
        output, hidden = self.rnn(embedded, hidden)  # (batch, seq_len, hidden)
        logits = self.fc(output)  # (batch, seq_len, vocab_size)
        return logits, hidden
    
    def generate(self, start_char, char_to_idx, idx_to_char, 
                 length=100, temperature=1.0):
        """Generate text character by character."""
        self.eval()
        
        # Start with initial character
        current = torch.tensor([[char_to_idx[start_char]]])
        hidden = None
        
        generated = [start_char]
        
        with torch.no_grad():
            for _ in range(length):
                logits, hidden = self.forward(current, hidden)
                
                # Apply temperature
                probs = torch.softmax(logits[0, -1] / temperature, dim=0)
                
                # Sample from distribution
                next_idx = torch.multinomial(probs, 1).item()
                next_char = idx_to_char[next_idx]
                
                generated.append(next_char)
                current = torch.tensor([[next_idx]])
        
        return ''.join(generated)


def train_char_rnn():
    """Train a character-level RNN on sample text."""
    
    # Sample text (in practice, use a large corpus)
    text = """
    To be, or not to be, that is the question:
    Whether 'tis nobler in the mind to suffer
    The slings and arrows of outrageous fortune,
    Or to take arms against a sea of troubles
    And by opposing end them.
    """ * 100  # Repeat for more training data
    
    # Build vocabulary
    chars = sorted(set(text))
    char_to_idx = {c: i for i, c in enumerate(chars)}
    idx_to_char = {i: c for c, i in char_to_idx.items()}
    vocab_size = len(chars)
    
    print(f"Vocabulary size: {vocab_size}")
    print(f"Characters: {''.join(chars)}")
    
    # Prepare data
    seq_length = 50
    
    def create_sequences(text, seq_length):
        inputs = []
        targets = []
        
        for i in range(0, len(text) - seq_length):
            inputs.append([char_to_idx[c] for c in text[i:i+seq_length]])
            targets.append([char_to_idx[c] for c in text[i+1:i+seq_length+1]])
        
        return torch.tensor(inputs), torch.tensor(targets)
    
    inputs, targets = create_sequences(text, seq_length)
    
    # Create model
    model = CharRNN(vocab_size, embed_size=64, hidden_size=128, num_layers=2)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.003)
    
    # Training loop
    batch_size = 64
    num_epochs = 50
    
    for epoch in range(num_epochs):
        # Random batch
        idx = torch.randint(0, len(inputs), (batch_size,))
        x_batch = inputs[idx]
        y_batch = targets[idx]
        
        # Forward
        logits, _ = model(x_batch)
        loss = criterion(logits.view(-1, vocab_size), y_batch.view(-1))
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
        optimizer.step()
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
            
            # Generate sample
            sample = model.generate('T', char_to_idx, idx_to_char, 
                                    length=100, temperature=0.8)
            print(f"Sample: {sample[:80]}...")
            print()
    
    return model, char_to_idx, idx_to_char

# Uncomment to train
# model, c2i, i2c = train_char_rnn()

RNN Applications

Application	Architecture	Input	Output
Sentiment Analysis	Many-to-One	Text	Positive/Negative
Language Modeling	Many-to-Many	Characters/Words	Next char/word
Machine Translation	Encoder-Decoder	Source sentence	Target sentence
Speech Recognition	Many-to-Many	Audio frames	Text
Music Generation	Many-to-Many	Notes	Next notes
Time Series Forecasting	Many-to-One	Past values	Future value
Named Entity Recognition	Many-to-Many	Words	Entity labels

Exercises

Exercise 1: Implement RNN from Scratch

Implement a complete RNN without using nn.RNN:

class MyRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        # Initialize W_xh, W_hh, bias
        # Implement forward pass with proper state management
        pass
    
    def forward(self, x, h_0=None):
        # Process sequence step by step
        # Return outputs and final hidden state
        pass

Verify it gives similar results to nn.RNN.

Exercise 2: Adding Problem

Implement the adding problem to test long-range dependencies:

Input: Two sequences - numbers and a mask
Output: Sum of numbers where mask is 1
Example: Numbers [0.3, 0.1, 0.8, 0.2], Mask [1, 0, 0, 1] → Output: 0.5

Test with sequence lengths 10, 50, 100, 200.

Exercise 3: Name Classification

Build a model to classify names by nationality:

Input: Character sequence (name)
Output: Nationality (English, French, German, etc.)

Use the names dataset from PyTorch tutorials.

Exercise 4: Sequence-to-Sequence

Build a simple date format converter:

Input: “January 15, 2023”
Output: “2023-01-15”

Generate synthetic training data and train encoder-decoder.

Exercise 5: Visualize Hidden States

For a trained sentiment analysis model:

Process sentences with known sentiment
Extract hidden states at each position
Visualize with t-SNE or PCA
Color by sentiment and position

What patterns do you observe?

Key Takeaways

Concept	Key Insight
Hidden State	Memory that carries information across time steps
BPTT	Backprop through unrolled network - gradients flow through time
Vanishing Gradient	Long sequences → gradients shrink → early inputs forgotten
Exploding Gradient	Gradients grow → training unstable → use clipping
Bidirectional	Context from both past and future
Deep RNNs	Stack layers for hierarchical representations

Vanilla RNNs are rarely used in practice! The vanishing gradient problem makes them unable to learn long-range dependencies. In the next chapter, we’ll learn about LSTMs and GRUs - architectures specifically designed to solve this problem.

What’s Next

Module 9: LSTMs & GRUs

Solve the vanishing gradient problem with gated architectures — learn how LSTM and GRU cells maintain long-term memory through gates.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Recurrent Neural Networks

​The Problem with Sequences

​From Feedforward to Recurrent

​The Feedforward Limitation

​The Recurrent Solution

​The RNN Equations

​Mathematical Formulation

​RNN Architectures

​Many-to-One (Sequence Classification)

​One-to-Many (Sequence Generation)

​Many-to-Many (Sequence-to-Sequence)

​Backpropagation Through Time (BPTT)

​The Challenge of Temporal Gradients

​The Vanishing/Exploding Gradient Problem

​Why Gradients Vanish

​The Impact on Learning

​Solutions to Vanishing Gradients

​Bidirectional RNNs

​Why Look Both Ways?

​Deep RNNs

​Stacking RNN Layers

​Practical Example: Character-Level Language Model

​RNN Applications

​Exercises

​Key Takeaways

​What’s Next

Module 9: LSTMs & GRUs

Recurrent Neural Networks

The Problem with Sequences

From Feedforward to Recurrent

The Feedforward Limitation

The Recurrent Solution

The RNN Equations

Mathematical Formulation

RNN Architectures

Many-to-One (Sequence Classification)

One-to-Many (Sequence Generation)

Many-to-Many (Sequence-to-Sequence)

Backpropagation Through Time (BPTT)

The Challenge of Temporal Gradients

The Vanishing/Exploding Gradient Problem

Why Gradients Vanish

The Impact on Learning

Solutions to Vanishing Gradients

Bidirectional RNNs

Why Look Both Ways?

Deep RNNs

Stacking RNN Layers

Practical Example: Character-Level Language Model

RNN Applications

Exercises

Key Takeaways

What’s Next