> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Neural Networks

> From biology to math - understand how artificial neurons learn

# Neural Networks: The Foundation of Deep Learning

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/neural-networks-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=d137fe4a27146cdaefd7b5762e8000bb" alt="Neural Network Architecture" width="1080" height="1080" data-path="images/courses/ml-mastery/neural-networks-concept.svg" />
</Frame>

## From Brains to Math

Your brain has about 86 billion neurons, each connected to thousands of others.

A single neuron:

1. **Receives inputs** from other neurons
2. **Weighs** how important each input is
3. **Sums** them up
4. **Activates** if the sum exceeds a threshold
5. **Sends output** to other neurons

**That's literally what an artificial neuron does!**

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/neural-networks-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=295a79bd2defbcdbd8fce23b714980d4" alt="Tesla Autopilot Neural Network" width="1080" height="1080" data-path="images/courses/ml-mastery/neural-networks-real-world.svg" />
</Frame>

***

## The Perceptron: One Artificial Neuron

### How It Works

```
Input 1 ─── weight 1 ───┐
                        │
Input 2 ─── weight 2 ───┼──► [Sum] ──► [Activation] ──► Output
                        │
Input 3 ─── weight 3 ───┘
```

**Math version:**

$$
output = activation\left(\sum_{i=1}^{n} w_i x_i + b\right) = activation(w \cdot x + b)
$$

Where:

* $x_i$ = inputs
* $w_i$ = weights (learnable)
* $b$ = bias (also learnable)
* $activation$ = a function that decides to "fire" or not

***

## Building a Perceptron from Scratch

```python theme={null}
import numpy as np

class Perceptron:
    """
    A single artificial neuron -- the simplest possible neural network.
    
    Think of it as a tiny decision-maker: it looks at evidence (inputs),
    weighs how important each piece is (weights), adds it up, and makes
    a yes/no call (activation). Like a hiring manager who scores candidates
    on different criteria and hires if the total score exceeds a threshold.
    """
    
    def __init__(self, n_inputs):
        # Initialize weights to small random values (not zero -- symmetry breaking)
        # If all weights start at zero, every neuron learns the same thing
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0  # The "default tendency" -- like a judge's prior disposition
    
    def forward(self, x):
        """Compute output for given inputs."""
        # Weighted sum + bias: each input contributes proportionally to its weight
        z = np.dot(x, self.weights) + self.bias
        # Step activation: fire (1) if evidence exceeds threshold, stay silent (0) otherwise
        return 1 if z > 0 else 0
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """
        Train using the perceptron learning rule.
        
        The learning rule is beautifully simple: if the prediction is correct,
        do nothing. If wrong, nudge each weight toward the correct answer.
        The learning_rate controls how big each nudge is -- too large and the
        model oscillates, too small and it takes forever.
        """
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.forward(xi)
                error = yi - prediction  # +1 if we should have fired, -1 if we shouldn't have
                
                # Update rule: move weights in the direction of the error
                # If error > 0 (should have fired), increase weights for active inputs
                # If error < 0 (should not have fired), decrease weights for active inputs
                self.weights += learning_rate * error * xi
                self.bias += learning_rate * error
                
                errors += abs(error)
            
            if epoch % 20 == 0:
                print(f"Epoch {epoch}: {errors} errors")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}")
                break

# Test on AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X, y_and)

print("\nAND Gate Results:")
for xi in X:
    print(f"  {xi} -> {perceptron.forward(xi)}")
```

***

## The XOR Problem: Why We Need More Layers

```python theme={null}
# XOR: outputs 1 if inputs are different
y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X, y_xor, epochs=1000)

print("\nXOR Gate Results (FAILS!):")
for xi in X:
    print(f"  {xi} -> {perceptron_xor.forward(xi)}")
```

**A single perceptron can only learn linearly separable patterns!**

XOR is not linearly separable -- you cannot draw a single straight line to separate the 0s from the 1s. Think of it like a bouncer at a club who can only apply one rule: "everyone taller than 6 feet gets in" works fine, but "people get in if they have an ID **or** they are on the list, but not both" requires understanding two conditions simultaneously. A single perceptron is that one-rule bouncer.

**Solution**: Stack multiple layers of neurons = **Multi-Layer Perceptron (MLP)**. The first layer learns simple patterns, the second layer combines those patterns into more complex ones -- just like how the visual cortex processes edges first, then shapes, then objects.

***

## Activation Functions

The step function (0 or 1) has a problem: its gradient is 0 everywhere except at the threshold, where it is undefined. This means gradient descent has no signal to work with -- it is like trying to roll a ball downhill on a perfectly flat surface with a single cliff edge.

We need **smooth, differentiable** activation functions that provide a gradient at every point -- a gentle slope the optimization can follow:

```python theme={null}
import matplotlib.pyplot as plt

def sigmoid(x):
    """S-curve from 0 to 1"""
    return 1 / (1 + np.exp(-x))

def tanh(x):
    """S-curve from -1 to 1"""
    return np.tanh(x)

def relu(x):
    """0 if negative, x if positive"""
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    """Small slope for negative values"""
    return np.where(x > 0, x, alpha * x)

# Visualize
x = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
activations = [
    (sigmoid, 'Sigmoid', 'Output between 0 and 1'),
    (tanh, 'Tanh', 'Output between -1 and 1'),
    (relu, 'ReLU', 'Most popular, fast to compute'),
    (leaky_relu, 'Leaky ReLU', 'Fixes "dying ReLU" problem')
]

for ax, (func, name, desc) in zip(axes.flat, activations):
    ax.plot(x, func(x), linewidth=2)
    ax.set_title(f'{name}: {desc}')
    ax.grid(True)
    ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()
```

| Activation | Range             | Use Case                               | Gotcha                                                                                 |
| ---------- | ----------------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
| Sigmoid    | (0, 1)            | Output layer for binary classification | Vanishing gradient in deep networks -- gradients shrink toward zero in early layers    |
| Tanh       | (-1, 1)           | Hidden layers (centered at 0)          | Same vanishing gradient problem as sigmoid, but centered output helps convergence      |
| ReLU       | \[0, ∞)           | Hidden layers (most common, fast)      | "Dying ReLU": if a neuron's output goes negative, gradient is 0 and it never recovers  |
| Softmax    | (0, 1), sums to 1 | Output for multi-class classification  | Only for the output layer -- it normalizes across all outputs to produce probabilities |

<Tip>
  **Practical default**: Use ReLU for hidden layers and sigmoid/softmax for the output layer. This covers 90% of use cases. Only switch to Leaky ReLU or GELU if you observe dying neurons (training loss plateaus while many neurons output zero).
</Tip>

***

## Multi-Layer Perceptron: The Universal Approximator

By stacking layers, we can learn ANY function! This is not hand-waving -- the **Universal Approximation Theorem** (Cybenko, 1989) proves that a neural network with just one hidden layer and enough neurons can approximate any continuous function to arbitrary accuracy. The catch: "enough neurons" might mean millions, and finding the right weights is the hard part. In practice, deeper networks with fewer neurons per layer learn hierarchical features more efficiently than one massive wide layer.

```python theme={null}
class NeuralNetwork:
    """
    Simple 2-layer neural network built from scratch.
    
    Architecture: Input -> Hidden Layer (sigmoid) -> Output Layer (sigmoid)
    This is the minimum viable network that can solve non-linear problems like XOR.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        # Layer 1: input -> hidden (learns basic patterns/features)
        # Weight initialization with * 0.5 keeps initial values moderate;
        # too large and gradients explode, too small and learning is glacially slow
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros(hidden_size)
        
        # Layer 2: hidden -> output (combines hidden features into final prediction)
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros(output_size)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, X):
        """
        Forward pass: data flows input -> hidden -> output.
        We store intermediate values (z1, a1, z2, a2) because
        backpropagation needs them to compute gradients.
        """
        # Layer 1: compute weighted sum, then apply activation
        self.z1 = X @ self.W1 + self.b1       # Linear transformation
        self.a1 = self.sigmoid(self.z1)         # Non-linear activation (this is what lets us learn curves, not just lines)
        
        # Layer 2: hidden layer output becomes input to the output layer
        self.z2 = self.a1 @ self.W2 + self.b2  # Linear transformation
        self.a2 = self.sigmoid(self.z2)          # Final prediction (0-1 for binary classification)
        
        return self.a2
    
    def backward(self, X, y, learning_rate=0.5):
        """
        Backward pass: compute how much each weight contributed to the error,
        then nudge weights in the direction that reduces error.
        
        This is backpropagation -- the chain rule applied layer by layer,
        working backwards from output to input. Think of it like tracing
        blame: "The output was wrong because the hidden layer sent the wrong
        signal, which happened because the input weights were off."
        """
        m = len(X)  # Number of samples (for averaging gradients)
        
        # Output layer error: how far off were our predictions?
        dz2 = self.a2 - y.reshape(-1, 1)       # Derivative of loss w.r.t. z2
        dW2 = (self.a1.T @ dz2) / m            # How much each W2 weight contributed to error
        db2 = np.mean(dz2, axis=0)              # How much the bias contributed
        
        # Hidden layer error: chain rule propagates error backward through W2
        # The sigmoid derivative a*(1-a) is what makes this differentiable
        dz1 = (dz2 @ self.W2.T) * self.a1 * (1 - self.a1)
        dW1 = (X.T @ dz1) / m
        db1 = np.mean(dz1, axis=0)
        
        # Update weights: step in the direction that reduces error
        # learning_rate controls step size -- the fundamental tradeoff of optimization
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def train(self, X, y, epochs=1000, learning_rate=0.5):
        losses = []
        for epoch in range(epochs):
            # Forward
            output = self.forward(X)
            
            # Loss (binary cross-entropy)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            losses.append(loss)
            
            # Backward
            self.backward(X, y, learning_rate)
            
            if epoch % 200 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}")
        
        return losses

# NOW we can learn XOR!
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
losses = nn.train(X, y_xor, epochs=2000)

print("\nXOR Results (SUCCESS!):")
predictions = nn.forward(X)
for xi, pred in zip(X, predictions):
    print(f"  {xi} -> {pred[0]:.3f} (rounded: {int(pred[0] > 0.5)})")
```

***

## Backpropagation: How Networks Learn

Backpropagation uses the **chain rule** from calculus to compute gradients efficiently.

<Note>
  **Math Connection**: Backpropagation is just repeated application of the chain rule. See [Chain Rule](/courses/math-for-ml-calculus/03-chain-rule) for the mathematical foundation.
</Note>

The key insight:

1. Compute error at output
2. Propagate error backward through layers
3. Update each weight proportionally to how much it contributed to the error

$$
\frac{\partial Loss}{\partial w} = \frac{\partial Loss}{\partial output} \cdot \frac{\partial output}{\partial hidden} \cdot \frac{\partial hidden}{\partial w}
$$

***

## Using PyTorch (The Professional Way)

```python theme={null}
import torch
import torch.nn as nn
import torch.optim as optim

# Define the network
class XORNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()
    
    def forward(self, x):
        x = self.activation(self.layer1(x))
        x = self.activation(self.layer2(x))
        return x

# Create network
model = XORNet()

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

# Data
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y_xor).reshape(-1, 1)

# Training loop
for epoch in range(1000):
    # Forward pass
    output = model(X_tensor)
    loss = criterion(output, y_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# Test
with torch.no_grad():
    predictions = model(X_tensor)
    print("\nPyTorch XOR Results:")
    for xi, pred in zip(X, predictions):
        print(f"  {xi} -> {pred.item():.3f}")
```

***

## Using scikit-learn

```python theme={null}
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load digit recognition dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create neural network
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',
    max_iter=500,
    random_state=42
)

# Train
mlp.fit(X_train, y_train)

# Evaluate
y_pred = mlp.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

# Visualize some predictions
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
    ax.set_title(f"Pred: {y_pred[i]}, True: {y_test[i]}")
    ax.axis('off')
plt.tight_layout()
plt.show()
```

***

## Network Architectures

| Architecture  | Layers                     | Use Case                      |
| ------------- | -------------------------- | ----------------------------- |
| Shallow       | 1-2 hidden                 | Simple patterns, tabular data |
| Deep          | 3+ hidden                  | Complex patterns              |
| Wide          | Many neurons               | More capacity per layer       |
| Deep & Narrow | Many layers, fewer neurons | Hierarchical features         |

**Rule of thumb for tabular data:**

* Start with 2 hidden layers
* Hidden size: between input and output size
* Use ReLU activation
* Use dropout for regularization

***

## Regularization for Neural Networks

### Dropout

Randomly "turn off" neurons during training. Think of it like a team where you randomly bench different players in each practice session. No single player can carry the team alone, so every player has to be competent. This forces the network to build redundant representations rather than relying on a few "star" neurons -- which means it generalizes better to new data.

```python theme={null}
model = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Dropout(0.3),  # 30% of neurons randomly zeroed each forward pass
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3),  # Dropout is only active during training, NOT during inference
    nn.Linear(64, 10)  # No dropout on the output layer
)
```

<Tip>
  **Practical tip**: Start with dropout rate of 0.2-0.3 for hidden layers. If the model still overfits, increase toward 0.5. Never apply dropout to the output layer. Remember to call `model.eval()` during inference -- dropout must be disabled for predictions.
</Tip>

### Early Stopping

Stop training when validation loss stops improving -- the simplest and most effective regularization technique. Training too long is like studying for an exam past the point of understanding into the territory of memorizing typos in the textbook.

```python theme={null}
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100,),
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # Use 10% for validation
    n_iter_no_change=10,       # Stop after 10 epochs without improvement
    max_iter=1000
)
```

***

## Key Hyperparameters

| Hyperparameter    | Effect                                                                                  | Practical Starting Point                           |
| ----------------- | --------------------------------------------------------------------------------------- | -------------------------------------------------- |
| Learning rate     | Too high = unstable/diverges, too low = painfully slow convergence                      | 0.001 for Adam, 0.01 for SGD                       |
| Hidden layers     | More = more complex patterns, but harder to train and more overfitting risk             | 2 layers for tabular, 3+ for images/text           |
| Neurons per layer | More = more capacity per layer                                                          | Start between input size and output size           |
| Batch size        | Smaller = noisier gradients (can help escape local minima), larger = more stable/faster | 32-128 for most tasks                              |
| Activation        | Determines what non-linearities the network can learn                                   | ReLU for hidden layers, sigmoid/softmax for output |
| Dropout rate      | Higher = more regularization, lower = more capacity                                     | 0.2-0.3 as starting point                          |

***

## When to Use Neural Networks

**Good for:**

* Image data (use CNNs)
* Text data (use Transformers)
* Sequential data (use RNNs/LSTMs)
* Very large datasets
* Complex non-linear patterns

**Not great for:**

* Small datasets (overfits easily -- neural nets are data-hungry by nature)
* When interpretability matters (explaining why a 10-layer network made a decision is much harder than explaining a decision tree)
* Tabular data with fewer than 10,000 rows (tree-based models like XGBoost or Random Forest are almost always better here, and this is backed by extensive benchmarks)

<Tip>
  **Industry reality**: For tabular data in production, gradient boosted trees (XGBoost, LightGBM) beat neural networks in the majority of Kaggle competitions and real-world deployments. Neural networks shine on unstructured data: images, text, audio, and video. If someone suggests a neural network for a 5,000-row CSV, push back.
</Tip>

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1: Digit Recognizer" icon="1">
    Build a neural network to recognize handwritten digits
  </Card>

  <Card title="Project 2: Neural Network from Scratch" icon="code">
    Implement a neural network without libraries
  </Card>

  <Card title="Project 3: Activation Function Explorer" icon="wave-square">
    Compare different activation functions
  </Card>

  <Card title="Project 4: Hyperparameter Tuner" icon="sliders">
    Find optimal architecture through experimentation
  </Card>
</CardGroup>

### Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits from the MNIST dataset.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_digits
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.neural_network import MLPClassifier
  from sklearn.metrics import classification_report, confusion_matrix

  # Step 1: Load and explore data
  digits = load_digits()
  X, y = digits.data, digits.target

  print("="*60)
  print("🔢 DIGIT RECOGNITION WITH NEURAL NETWORKS")
  print("="*60)
  print(f"Dataset shape: {X.shape}")
  print(f"Number of classes: {len(np.unique(y))}")
  print(f"Image size: {digits.images.shape[1]}x{digits.images.shape[2]}")

  # Visualize some digits
  fig, axes = plt.subplots(2, 5, figsize=(12, 5))
  for i, ax in enumerate(axes.flat):
      ax.imshow(digits.images[i], cmap='gray')
      ax.set_title(f'Label: {y[i]}')
      ax.axis('off')
  plt.suptitle('Sample Digits')
  plt.savefig('sample_digits.png', dpi=150)

  # Step 2: Prepare data
  X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2, random_state=42, stratify=y
  )

  # Scale features
  scaler = StandardScaler()
  X_train = scaler.fit_transform(X_train)
  X_test = scaler.transform(X_test)

  print(f"\nTraining samples: {len(X_train)}")
  print(f"Test samples: {len(X_test)}")

  # Step 3: Train neural network
  print("\n1️⃣ TRAINING NEURAL NETWORK")
  print("-"*40)

  mlp = MLPClassifier(
      hidden_layer_sizes=(128, 64),
      activation='relu',
      solver='adam',
      max_iter=500,
      random_state=42,
      verbose=True
  )

  mlp.fit(X_train, y_train)

  # Step 4: Evaluate
  print("\n2️⃣ EVALUATION")
  print("-"*40)

  train_acc = mlp.score(X_train, y_train)
  test_acc = mlp.score(X_test, y_test)

  print(f"Training accuracy: {train_acc:.4f}")
  print(f"Test accuracy: {test_acc:.4f}")

  y_pred = mlp.predict(X_test)
  print("\nClassification Report:")
  print(classification_report(y_test, y_pred))

  # Step 5: Visualize results
  fig, axes = plt.subplots(2, 2, figsize=(12, 10))

  # Confusion matrix
  ax1 = axes[0, 0]
  cm = confusion_matrix(y_test, y_pred)
  im = ax1.imshow(cm, cmap='Blues')
  ax1.set_xticks(range(10))
  ax1.set_yticks(range(10))
  ax1.set_xlabel('Predicted')
  ax1.set_ylabel('Actual')
  ax1.set_title('Confusion Matrix')
  plt.colorbar(im, ax=ax1)

  # Learning curve
  ax2 = axes[0, 1]
  ax2.plot(mlp.loss_curve_)
  ax2.set_xlabel('Iteration')
  ax2.set_ylabel('Loss')
  ax2.set_title('Training Loss Curve')

  # Show some predictions
  ax3 = axes[1, 0]
  for i in range(16):
      plt.subplot(4, 4, i+1)
      idx = np.random.randint(0, len(X_test))
      img = scaler.inverse_transform([X_test[idx]])[0].reshape(8, 8)
      pred = y_pred[idx]
      actual = y_test[idx]
      color = 'green' if pred == actual else 'red'
      plt.imshow(img, cmap='gray')
      plt.title(f'P:{pred} A:{actual}', color=color, fontsize=8)
      plt.axis('off')

  # Architecture visualization
  ax4 = axes[1, 1]
  layer_sizes = [64] + list(mlp.hidden_layer_sizes) + [10]
  ax4.barh(range(len(layer_sizes)), layer_sizes, color='steelblue')
  ax4.set_yticks(range(len(layer_sizes)))
  ax4.set_yticklabels(['Input (64)'] + 
                      [f'Hidden {i+1} ({s})' for i, s in enumerate(mlp.hidden_layer_sizes)] + 
                      ['Output (10)'])
  ax4.set_xlabel('Number of Neurons')
  ax4.set_title('Network Architecture')

  plt.tight_layout()
  plt.savefig('digit_recognition.png', dpi=150)

  # Step 6: Analyze errors
  print("\n3️⃣ ERROR ANALYSIS")
  print("-"*40)

  errors = y_test != y_pred
  print(f"Total errors: {errors.sum()} out of {len(y_test)}")

  # Most confused pairs
  from collections import Counter
  error_pairs = [(y_test[i], y_pred[i]) for i in range(len(y_test)) if y_test[i] != y_pred[i]]
  most_common = Counter(error_pairs).most_common(5)
  print("\nMost confused digit pairs:")
  for (actual, predicted), count in most_common:
      print(f"  {actual} misclassified as {predicted}: {count} times")

  print("\n✅ Digit recognition complete!")
  ```

  **What you learned:**

  * Neural networks excel at image classification
  * Hidden layers learn hierarchical features
  * Error analysis helps understand model weaknesses
</details>

### Project 2: Neural Network from Scratch

Implement a simple neural network using only NumPy.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt

  # Step 1: Define neural network class
  class NeuralNetwork:
      """A simple 2-layer neural network from scratch"""
      
      def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
          self.learning_rate = learning_rate
          
          # Initialize weights with Xavier initialization
          self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
          self.b1 = np.zeros((1, hidden_size))
          self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
          self.b2 = np.zeros((1, output_size))
          
          self.losses = []
      
      def sigmoid(self, x):
          """Sigmoid activation"""
          return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
      
      def sigmoid_derivative(self, x):
          """Derivative of sigmoid"""
          return x * (1 - x)
      
      def relu(self, x):
          """ReLU activation"""
          return np.maximum(0, x)
      
      def relu_derivative(self, x):
          """Derivative of ReLU"""
          return (x > 0).astype(float)
      
      def softmax(self, x):
          """Softmax for output layer"""
          exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
          return exp_x / np.sum(exp_x, axis=1, keepdims=True)
      
      def forward(self, X):
          """Forward pass"""
          # Hidden layer
          self.z1 = np.dot(X, self.W1) + self.b1
          self.a1 = self.relu(self.z1)
          
          # Output layer
          self.z2 = np.dot(self.a1, self.W2) + self.b2
          self.a2 = self.softmax(self.z2)
          
          return self.a2
      
      def backward(self, X, y):
          """Backward pass (backpropagation)"""
          m = X.shape[0]
          
          # One-hot encode y
          y_onehot = np.zeros((m, self.a2.shape[1]))
          y_onehot[np.arange(m), y] = 1
          
          # Output layer gradients
          dz2 = self.a2 - y_onehot  # Cross-entropy + softmax derivative
          dW2 = np.dot(self.a1.T, dz2) / m
          db2 = np.sum(dz2, axis=0, keepdims=True) / m
          
          # Hidden layer gradients
          dz1 = np.dot(dz2, self.W2.T) * self.relu_derivative(self.a1)
          dW1 = np.dot(X.T, dz1) / m
          db1 = np.sum(dz1, axis=0, keepdims=True) / m
          
          # Update weights
          self.W2 -= self.learning_rate * dW2
          self.b2 -= self.learning_rate * db2
          self.W1 -= self.learning_rate * dW1
          self.b1 -= self.learning_rate * db1
      
      def compute_loss(self, y_pred, y_true):
          """Cross-entropy loss"""
          m = len(y_true)
          log_probs = -np.log(y_pred[np.arange(m), y_true] + 1e-8)
          return np.mean(log_probs)
      
      def fit(self, X, y, epochs=1000, verbose=True):
          """Train the network"""
          for epoch in range(epochs):
              # Forward pass
              y_pred = self.forward(X)
              
              # Compute loss
              loss = self.compute_loss(y_pred, y)
              self.losses.append(loss)
              
              # Backward pass
              self.backward(X, y)
              
              if verbose and epoch % 100 == 0:
                  acc = self.accuracy(X, y)
                  print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {acc:.4f}")
          
          return self
      
      def predict(self, X):
          """Make predictions"""
          probs = self.forward(X)
          return np.argmax(probs, axis=1)
      
      def accuracy(self, X, y):
          """Calculate accuracy"""
          predictions = self.predict(X)
          return np.mean(predictions == y)

  # Step 2: Generate a simple dataset
  print("="*60)
  print("🧠 NEURAL NETWORK FROM SCRATCH")
  print("="*60)

  from sklearn.datasets import make_moons
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler

  # Create dataset
  X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

  # Scale features
  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  # Split data
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  print(f"Training samples: {len(X_train)}")
  print(f"Test samples: {len(X_test)}")
  print(f"Features: {X_train.shape[1]}")
  print(f"Classes: {len(np.unique(y))}")

  # Step 3: Train neural network
  print("\n1️⃣ TRAINING FROM SCRATCH")
  print("-"*40)

  nn = NeuralNetwork(
      input_size=2,
      hidden_size=16,
      output_size=2,
      learning_rate=0.5
  )

  nn.fit(X_train, y_train, epochs=1000, verbose=True)

  # Step 4: Evaluate
  print("\n2️⃣ EVALUATION")
  print("-"*40)

  train_acc = nn.accuracy(X_train, y_train)
  test_acc = nn.accuracy(X_test, y_test)

  print(f"Training accuracy: {train_acc:.4f}")
  print(f"Test accuracy: {test_acc:.4f}")

  # Step 5: Visualize
  fig, axes = plt.subplots(1, 3, figsize=(15, 4))

  # Loss curve
  ax1 = axes[0]
  ax1.plot(nn.losses)
  ax1.set_xlabel('Epoch')
  ax1.set_ylabel('Loss')
  ax1.set_title('Training Loss')

  # Decision boundary
  ax2 = axes[1]
  h = 0.02
  x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
  y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
  Z = nn.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)
  ax2.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
  ax2.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='black')
  ax2.set_title('Decision Boundary')

  # Network weights visualization
  ax3 = axes[2]
  ax3.text(0.5, 0.9, f'Input Layer: {nn.W1.shape[0]} neurons', ha='center', transform=ax3.transAxes)
  ax3.text(0.5, 0.5, f'Hidden Layer: {nn.W1.shape[1]} neurons', ha='center', transform=ax3.transAxes)
  ax3.text(0.5, 0.1, f'Output Layer: {nn.W2.shape[1]} neurons', ha='center', transform=ax3.transAxes)
  ax3.set_title('Network Architecture')
  ax3.axis('off')

  plt.tight_layout()
  plt.savefig('nn_from_scratch.png', dpi=150)

  print("\n✅ Neural network from scratch complete!")
  print(f"\nNetwork learned {nn.W1.size + nn.W2.size + nn.b1.size + nn.b2.size} parameters")
  ```

  **What you learned:**

  * Forward pass: multiply inputs by weights, add biases, apply activations
  * Backward pass: compute gradients using chain rule, update weights
  * The math behind neural networks is elegant and understandable
</details>

### Project 3: Activation Function Explorer

Compare different activation functions and their effects on learning.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import make_circles
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.neural_network import MLPClassifier

  # Step 1: Visualize activation functions
  print("="*60)
  print("⚡ ACTIVATION FUNCTION EXPLORER")
  print("="*60)

  x = np.linspace(-5, 5, 100)

  def sigmoid(x):
      return 1 / (1 + np.exp(-x))

  def tanh(x):
      return np.tanh(x)

  def relu(x):
      return np.maximum(0, x)

  def leaky_relu(x, alpha=0.1):
      return np.where(x > 0, x, alpha * x)

  activations = {
      'Sigmoid': (sigmoid, 'Squashes to (0, 1), can cause vanishing gradients'),
      'Tanh': (tanh, 'Squashes to (-1, 1), zero-centered'),
      'ReLU': (relu, 'Simple, fast, can have dead neurons'),
      'Leaky ReLU': (leaky_relu, 'Allows small negative values, prevents dead neurons')
  }

  fig, axes = plt.subplots(2, 2, figsize=(12, 10))

  for i, (name, (func, desc)) in enumerate(activations.items()):
      ax = axes[i // 2, i % 2]
      y = func(x)
      
      ax.plot(x, y, 'b-', linewidth=2, label=name)
      ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
      ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)
      ax.set_xlim(-5, 5)
      ax.set_ylim(-2, 2)
      ax.set_title(f'{name}\n{desc}', fontsize=10)
      ax.set_xlabel('Input')
      ax.set_ylabel('Output')
      ax.grid(True, alpha=0.3)
      ax.legend()

  plt.tight_layout()
  plt.savefig('activation_functions.png', dpi=150)

  # Step 2: Compare on real data
  print("\n1️⃣ GENERATING DATASET")
  print("-"*40)

  X, y = make_circles(n_samples=1000, noise=0.1, factor=0.5, random_state=42)
  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  print(f"Training samples: {len(X_train)}")

  # Step 3: Train with different activations
  print("\n2️⃣ TRAINING WITH DIFFERENT ACTIVATIONS")
  print("-"*40)

  activation_results = {}
  activation_names = ['relu', 'tanh', 'logistic']  # sklearn names

  for activation in activation_names:
      print(f"\nTraining with {activation}...")
      
      mlp = MLPClassifier(
          hidden_layer_sizes=(32, 16),
          activation=activation,
          solver='adam',
          max_iter=500,
          random_state=42
      )
      
      mlp.fit(X_train, y_train)
      
      train_acc = mlp.score(X_train, y_train)
      test_acc = mlp.score(X_test, y_test)
      
      activation_results[activation] = {
          'model': mlp,
          'train_acc': train_acc,
          'test_acc': test_acc,
          'loss_curve': mlp.loss_curve_,
          'n_iter': mlp.n_iter_
      }
      
      print(f"  Iterations: {mlp.n_iter_}")
      print(f"  Train accuracy: {train_acc:.4f}")
      print(f"  Test accuracy: {test_acc:.4f}")

  # Step 4: Compare results
  print("\n3️⃣ COMPARISON")
  print("-"*40)

  fig, axes = plt.subplots(2, 3, figsize=(15, 10))

  # Loss curves
  ax1 = axes[0, 0]
  for name, results in activation_results.items():
      ax1.plot(results['loss_curve'], label=f"{name} ({results['n_iter']} iters)")
  ax1.set_xlabel('Epoch')
  ax1.set_ylabel('Loss')
  ax1.set_title('Training Loss Curves')
  ax1.legend()

  # Accuracy comparison
  ax2 = axes[0, 1]
  names = list(activation_results.keys())
  train_accs = [r['train_acc'] for r in activation_results.values()]
  test_accs = [r['test_acc'] for r in activation_results.values()]
  x_pos = np.arange(len(names))
  width = 0.35
  ax2.bar(x_pos - width/2, train_accs, width, label='Train', color='steelblue')
  ax2.bar(x_pos + width/2, test_accs, width, label='Test', color='coral')
  ax2.set_xticks(x_pos)
  ax2.set_xticklabels(names)
  ax2.set_ylabel('Accuracy')
  ax2.set_title('Accuracy Comparison')
  ax2.legend()

  # Convergence speed
  ax3 = axes[0, 2]
  iters = [r['n_iter'] for r in activation_results.values()]
  ax3.bar(names, iters, color='green')
  ax3.set_ylabel('Iterations to Converge')
  ax3.set_title('Convergence Speed')

  # Decision boundaries
  for i, (name, results) in enumerate(activation_results.items()):
      ax = axes[1, i]
      h = 0.02
      x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
      y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
      xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
      Z = results['model'].predict(np.c_[xx.ravel(), yy.ravel()])
      Z = Z.reshape(xx.shape)
      ax.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
      ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='coolwarm', edgecolors='black')
      ax.set_title(f'{name} Decision Boundary\nTest acc: {results["test_acc"]:.3f}')

  plt.tight_layout()
  plt.savefig('activation_comparison.png', dpi=150)

  # Step 5: Summary
  print("\n📊 SUMMARY")
  print("-"*40)
  print(f"{'Activation':<15} {'Train Acc':<12} {'Test Acc':<12} {'Iterations':<10}")
  print("-"*50)
  for name, results in activation_results.items():
      print(f"{name:<15} {results['train_acc']:.4f}       {results['test_acc']:.4f}       {results['n_iter']}")

  print("\n💡 Key Insights:")
  print("  - ReLU typically converges faster")
  print("  - Tanh provides zero-centered outputs")
  print("  - Sigmoid (logistic) can cause vanishing gradients")

  print("\n✅ Activation function exploration complete!")
  ```

  **What you learned:**

  * Different activations have different convergence speeds
  * ReLU is often fastest but can have dead neurons
  * The choice of activation depends on the problem
</details>

### Project 4: Hyperparameter Tuner

Systematically find the best neural network architecture.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.preprocessing import StandardScaler
  from sklearn.neural_network import MLPClassifier
  import time
  import itertools

  # Step 1: Load data
  print("="*60)
  print("🔧 NEURAL NETWORK HYPERPARAMETER TUNING")
  print("="*60)

  cancer = load_breast_cancer()
  X, y = cancer.data, cancer.target

  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  print(f"Features: {X.shape[1]}")
  print(f"Training samples: {len(X_train)}")

  # Step 2: Define hyperparameter search space
  architectures = [
      (32,),
      (64,),
      (128,),
      (32, 16),
      (64, 32),
      (128, 64),
      (64, 32, 16),
      (128, 64, 32),
  ]

  learning_rates = [0.001, 0.01, 0.1]
  activations = ['relu', 'tanh']

  # Step 3: Grid search
  print("\n1️⃣ HYPERPARAMETER SEARCH")
  print("-"*40)

  results = []
  total_combinations = len(architectures) * len(learning_rates) * len(activations)
  print(f"Total combinations to try: {total_combinations}")

  for i, (arch, lr, act) in enumerate(itertools.product(architectures, learning_rates, activations)):
      start_time = time.time()
      
      mlp = MLPClassifier(
          hidden_layer_sizes=arch,
          learning_rate_init=lr,
          activation=act,
          solver='adam',
          max_iter=200,
          random_state=42,
          early_stopping=True,
          validation_fraction=0.1
      )
      
      try:
          cv_scores = cross_val_score(mlp, X_train, y_train, cv=3, scoring='accuracy')
          
          mlp.fit(X_train, y_train)
          test_score = mlp.score(X_test, y_test)
          
          result = {
              'architecture': str(arch),
              'learning_rate': lr,
              'activation': act,
              'cv_mean': cv_scores.mean(),
              'cv_std': cv_scores.std(),
              'test_score': test_score,
              'n_iter': mlp.n_iter_,
              'time': time.time() - start_time
          }
          results.append(result)
          
          if (i + 1) % 5 == 0:
              print(f"Progress: {i+1}/{total_combinations} combinations tested")
              
      except Exception as e:
          print(f"Error with {arch}, {lr}, {act}: {e}")

  # Step 4: Analyze results
  df_results = pd.DataFrame(results)
  df_results = df_results.sort_values('cv_mean', ascending=False)

  print("\n2️⃣ TOP 10 CONFIGURATIONS")
  print("-"*40)
  print(df_results[['architecture', 'learning_rate', 'activation', 'cv_mean', 'test_score']].head(10).to_string())

  # Step 5: Best configuration
  best = df_results.iloc[0]
  print("\n3️⃣ BEST CONFIGURATION")
  print("-"*40)
  print(f"Architecture: {best['architecture']}")
  print(f"Learning rate: {best['learning_rate']}")
  print(f"Activation: {best['activation']}")
  print(f"CV Score: {best['cv_mean']:.4f} (+/- {best['cv_std']:.4f})")
  print(f"Test Score: {best['test_score']:.4f}")

  # Step 6: Analyze patterns
  print("\n4️⃣ PATTERN ANALYSIS")
  print("-"*40)

  # Best by architecture
  print("\nBest score by architecture:")
  arch_best = df_results.groupby('architecture')['cv_mean'].max().sort_values(ascending=False)
  for arch, score in arch_best.head(5).items():
      print(f"  {arch}: {score:.4f}")

  # Best by learning rate
  print("\nBest score by learning rate:")
  lr_best = df_results.groupby('learning_rate')['cv_mean'].max().sort_values(ascending=False)
  for lr, score in lr_best.items():
      print(f"  {lr}: {score:.4f}")

  # Best by activation
  print("\nBest score by activation:")
  act_best = df_results.groupby('activation')['cv_mean'].max().sort_values(ascending=False)
  for act, score in act_best.items():
      print(f"  {act}: {score:.4f}")

  # Step 7: Visualize
  fig, axes = plt.subplots(2, 2, figsize=(14, 10))

  # Top configurations
  ax1 = axes[0, 0]
  top10 = df_results.head(10)
  y_pos = np.arange(len(top10))
  ax1.barh(y_pos, top10['cv_mean'], xerr=top10['cv_std'], capsize=3)
  ax1.set_yticks(y_pos)
  ax1.set_yticklabels([f"{r['architecture']}\nlr={r['learning_rate']}" 
                      for _, r in top10.iterrows()], fontsize=8)
  ax1.set_xlabel('CV Score')
  ax1.set_title('Top 10 Configurations')

  # Score by architecture depth
  ax2 = axes[0, 1]
  df_results['depth'] = df_results['architecture'].apply(lambda x: len(eval(x)))
  depth_scores = df_results.groupby('depth')['cv_mean'].mean()
  ax2.bar(depth_scores.index, depth_scores.values, color='steelblue')
  ax2.set_xlabel('Number of Hidden Layers')
  ax2.set_ylabel('Mean CV Score')
  ax2.set_title('Score vs Network Depth')

  # Score by learning rate
  ax3 = axes[1, 0]
  for act in activations:
      subset = df_results[df_results['activation'] == act]
      lr_means = subset.groupby('learning_rate')['cv_mean'].mean()
      ax3.plot(lr_means.index, lr_means.values, 'o-', label=act)
  ax3.set_xlabel('Learning Rate')
  ax3.set_ylabel('Mean CV Score')
  ax3.set_title('Learning Rate Effect by Activation')
  ax3.set_xscale('log')
  ax3.legend()

  # Training time vs accuracy
  ax4 = axes[1, 1]
  ax4.scatter(df_results['time'], df_results['cv_mean'], 
             c=df_results['depth'], cmap='viridis', alpha=0.7)
  ax4.set_xlabel('Training Time (s)')
  ax4.set_ylabel('CV Score')
  ax4.set_title('Time vs Accuracy (color = depth)')
  cbar = plt.colorbar(ax4.collections[0], ax=ax4)
  cbar.set_label('Depth')

  plt.tight_layout()
  plt.savefig('nn_tuning.png', dpi=150)

  # Step 8: Final model
  print("\n5️⃣ TRAINING FINAL MODEL")
  print("-"*40)

  final_mlp = MLPClassifier(
      hidden_layer_sizes=eval(best['architecture']),
      learning_rate_init=best['learning_rate'],
      activation=best['activation'],
      solver='adam',
      max_iter=500,
      random_state=42
  )

  final_mlp.fit(X_train, y_train)
  final_test_score = final_mlp.score(X_test, y_test)

  print(f"Final model test accuracy: {final_test_score:.4f}")
  print("\n✅ Hyperparameter tuning complete!")
  ```

  **What you learned:**

  * Systematic search finds better architectures than intuition
  * Deeper networks aren't always better
  * Learning rate is often the most important hyperparameter
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Neurons = Weighted Sums" icon="plus">
    Input × weights + bias → activation → output
  </Card>

  <Card title="Layers = Power" icon="layer-group">
    More layers = learn more complex patterns
  </Card>

  <Card title="Backprop = Chain Rule" icon="link">
    Gradients flow backward to update weights
  </Card>

  <Card title="Regularize!" icon="shield-halved">
    Dropout and early stopping prevent overfitting
  </Card>
</CardGroup>

***

## What's Next?

Now that you understand neural networks, let's learn about regularization in more depth - the key to preventing overfitting in any model!

<Card title="Continue to Module 13: Regularization" icon="arrow-right" href="/courses/ml-mastery/13-regularization">
  Learn L1, L2 regularization and other techniques to prevent overfitting
</Card>