Skip to main content
Build a Neural Network from Scratch

Final Project: The Architect

Your Graduation Exam

You have learned the theory.
  • Derivatives: The rate of change.
  • Gradients: The direction of steepest ascent.
  • Chain Rule: How to propagate blame.
  • Gradient Descent: How to learn.
Now, you must prove your mastery. You will not use PyTorch. You will not use TensorFlow. You will build a brain using nothing but raw math (NumPy).
Estimated Time: 4-6 hours (take your time!)
Difficulty: Intermediate
Prerequisites: Completed all previous modules
What You’ll Build: A fully functional 2-layer neural network that learns XOR
Don’t Rush! This project is where everything clicks together. If you find yourself copy-pasting code without understanding it, STOP. Go back to the relevant module and review. The goal isn’t to finish fast—it’s to deeply understand every line.

🎯 What You’re Actually Building (And Why It Matters)

Before we write code, let’s understand what we’re creating and why each piece exists.

The XOR Problem: Why Neural Networks?

We’re solving the XOR problem - a classification task that stumped AI researchers for decades:
Input 1Input 2Output
000
011
101
110
Why XOR is special: You cannot draw a single straight line to separate the 0s from the 1s. This is called a non-linearly separable problem.
import matplotlib.pyplot as plt
import numpy as np

# Visualize XOR - notice you can't draw ONE line to separate blue from red
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([0, 1, 1, 0])

plt.figure(figsize=(8, 6))
colors = ['red' if y == 0 else 'blue' for y in Y]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=300, edgecolors='black', linewidths=2)

for i, (x, y, label) in enumerate(zip(X[:, 0], X[:, 1], Y)):
    plt.annotate(f'XOR={label}', (x, y), textcoords="offset points", xytext=(10,10), fontsize=12)

plt.xlabel('Input 1', fontsize=14)
plt.ylabel('Input 2', fontsize=14)
plt.title('XOR Problem: No single line can separate red from blue!', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.show()
The breakthrough: A neural network with a hidden layer can learn to “bend” the space and separate these points. This is exactly what you’ll build!

The Complete Training Loop

Before diving into code, study this diagram. Every neural network ever built follows this exact pattern:
Neural Network Training Loop
The loop is:
  1. Forward Pass: Data flows through → we get a prediction
  2. Loss: How wrong are we?
  3. Backward Pass: Compute gradients (blame assignment)
  4. Update: Adjust weights to reduce loss
  5. Repeat thousands of times

The Blueprint

You are building a neural network to solve a classification problem. Architecture:
  • Input Layer: 2 neurons (x1,x2x_1, x_2) — the XOR inputs
  • Hidden Layer: 3 neurons (h1,h2,h3h_1, h_2, h_3) — the “secret sauce” that enables non-linear learning
  • Output Layer: 1 neuron (yy) — the predicted XOR output
Neural Network Blueprint

Understanding the Dimensions

This is critical for debugging. Let’s trace the shapes:
VariableShapeDescription
XX(m,2)(m, 2)mm examples, 2 input features
W1W_1(2,3)(2, 3)Connects 2 inputs → 3 hidden neurons
b1b_1(1,3)(1, 3)One bias per hidden neuron
Z1Z_1(m,3)(m, 3)Pre-activation for hidden layer
A1A_1(m,3)(m, 3)Post-activation (ReLU applied)
W2W_2(3,1)(3, 1)Connects 3 hidden → 1 output
b2b_2(1,1)(1, 1)One bias for output
Z2Z_2(m,1)(m, 1)Pre-activation for output
Y^\hat{Y}(m,1)(m, 1)Final predictions (sigmoid applied)
Pro Debugging Tip: When something breaks, print the shapes! 90% of neural network bugs are shape mismatches.
print(f"X shape: {X.shape}")  # Should be (4, 2) for XOR
print(f"W1 shape: {W1.shape}")  # Should be (2, 3)
# ... and so on

Step 1: The Bricks (Initialization)

A neural network is just a collection of matrices (weights) and vectors (biases). But how you initialize them matters enormously!

Why Small Random Numbers?

import numpy as np

def init_params(input_size, hidden_size, output_size):
    """
    Initialize neural network parameters.
    
    Why random? If all weights are the same, all neurons learn the same thing!
    This is called the "symmetry problem."
    
    Why small (×0.01)? Large weights → large outputs → saturated sigmoid
    → tiny gradients → no learning (vanishing gradients)
    """
    # Weights: Random small numbers (break symmetry)
    W1 = np.random.randn(input_size, hidden_size) * 0.01
    b1 = np.zeros((1, hidden_size))  # Biases can start at zero
    
    W2 = np.random.randn(hidden_size, output_size) * 0.01
    b2 = np.zeros((1, output_size))
    
    return {"W1": W1, "b1": b1, "W2": W2, "b2": b2}

# Let's see what this produces
np.random.seed(42)  # For reproducibility
params = init_params(2, 3, 1)
print("W1 (2 inputs → 3 hidden):")
print(params["W1"])
print(f"\nW1 shape: {params['W1'].shape}")
print(f"W2 shape: {params['W2'].shape}")
Output:
W1 (2 inputs → 3 hidden):
[[ 0.00496714 -0.00138264  0.00647689]
 [ 0.01523030 -0.00234153 -0.00234137]]

W1 shape: (2, 3)
W2 shape: (3, 1)
Advanced: Xavier/He InitializationFor deeper networks, * 0.01 isn’t optimal. Better initializations:
# Xavier initialization (for sigmoid/tanh)
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1 / input_size)

# He initialization (for ReLU) - recommended!
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
These keep the variance of activations stable across layers.

Step 2: The Mortar (Activation Functions)

Neurons need to be non-linear. Without non-linearity, stacking layers is useless—you’d just get another linear function!
Why We Need Activation Functions

ReLU: The Modern Workhorse

def relu(z):
    """
    ReLU: Rectified Linear Unit
    
    f(z) = max(0, z)
    
    Why ReLU is popular:
    1. Simple and fast to compute
    2. Doesn't saturate for positive values (no vanishing gradients)
    3. Creates sparse activations (many zeros = efficient)
    
    The "kink" at z=0 is what creates non-linearity!
    """
    return np.maximum(0, z)

def relu_derivative(z):
    """
    Derivative of ReLU:
    f'(z) = 1 if z > 0, else 0
    
    Note: Technically undefined at z=0, but we use 0.
    """
    return (z > 0).astype(float)

# Visualize ReLU
z = np.linspace(-3, 3, 100)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(z, relu(z), linewidth=2, color='#22c55e')
plt.title('ReLU: f(z) = max(0, z)')
plt.xlabel('z')
plt.ylabel('ReLU(z)')
plt.grid(True, alpha=0.3)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)

plt.subplot(1, 2, 2)
plt.plot(z, relu_derivative(z), linewidth=2, color='#3b82f6')
plt.title("ReLU Derivative: f'(z)")
plt.xlabel('z')
plt.ylabel("f'(z)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Sigmoid: For Probabilities

def sigmoid(z):
    """
    Sigmoid squashes any value to (0, 1) range.
    
    f(z) = 1 / (1 + e^(-z))
    
    Perfect for binary classification output!
    - Output near 0 = confident "class 0"
    - Output near 1 = confident "class 1"
    - Output ~0.5 = uncertain
    """
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    """
    Derivative of sigmoid (using output a, not input z):
    f'(z) = f(z) × (1 - f(z)) = a × (1 - a)
    
    This is why we pass 'a' (the sigmoid output), not 'z'.
    """
    return a * (1 - a)

# Visualize sigmoid
z = np.linspace(-6, 6, 100)
a = sigmoid(z)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(z, a, linewidth=2, color='#a855f7')
plt.title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(z, sigmoid_derivative(a), linewidth=2, color='#06b6d4')
plt.title("Sigmoid Derivative: σ'(z) = σ(z)(1-σ(z))")
plt.xlabel('z')
plt.ylabel("σ'(z)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The Vanishing Gradient ProblemNotice the sigmoid derivative peaks at 0.25 and approaches 0 for large |z|. This means:
  • For very confident predictions, gradients are tiny
  • In deep networks, these tiny gradients multiply → vanishing gradients
  • This is why we use ReLU for hidden layers (derivative is 1 for positive values)

Step 3: The Construction (Forward Pass)

The forward pass is where data flows through the network to produce a prediction. Let’s trace through every computation step by step.

The Math

Z1=XW1+b1(Linear transformation)Z_1 = X \cdot W_1 + b_1 \quad \text{(Linear transformation)} A1=ReLU(Z1)(Non-linear activation)A_1 = \text{ReLU}(Z_1) \quad \text{(Non-linear activation)} Z2=A1W2+b2(Linear transformation)Z_2 = A_1 \cdot W_2 + b_2 \quad \text{(Linear transformation)} Y^=σ(Z2)(Sigmoid for probability)\hat{Y} = \sigma(Z_2) \quad \text{(Sigmoid for probability)}

The Code (With Detailed Commentary)

def forward(X, params):
    """
    Forward propagation: Input → Prediction
    
    Args:
        X: Input data, shape (m, 2) where m = number of examples
        params: Dictionary with W1, b1, W2, b2
    
    Returns:
        A2: Predictions, shape (m, 1)
        cache: All intermediate values (needed for backprop!)
    """
    # Unpack parameters
    W1, b1 = params["W1"], params["b1"]
    W2, b2 = params["W2"], params["b2"]
    
    # === LAYER 1 ===
    # Linear: Z1 = X @ W1 + b1
    # Shape: (m, 2) @ (2, 3) + (1, 3) = (m, 3)
    Z1 = np.dot(X, W1) + b1
    
    # Activation: A1 = ReLU(Z1)
    # Shape: (m, 3)
    A1 = relu(Z1)
    
    # === LAYER 2 ===
    # Linear: Z2 = A1 @ W2 + b2
    # Shape: (m, 3) @ (3, 1) + (1, 1) = (m, 1)
    Z2 = np.dot(A1, W2) + b2
    
    # Activation: A2 = sigmoid(Z2)
    # Shape: (m, 1)
    A2 = sigmoid(Z2)
    
    # IMPORTANT: Cache everything for backpropagation!
    # We need Z1 for ReLU derivative, A1 for dW2, etc.
    cache = {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2}
    
    return A2, cache

# Let's trace through with actual numbers
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # XOR inputs
np.random.seed(42)
params = init_params(2, 3, 1)

print("=== FORWARD PASS TRACE ===\n")
print(f"Input X (4 examples, 2 features):\n{X}\n")

# Manual trace through layer 1
Z1 = np.dot(X, params["W1"]) + params["b1"]
print(f"Z1 = X @ W1 + b1, shape {Z1.shape}:")
print(Z1.round(4))
print()

A1 = relu(Z1)
print(f"A1 = ReLU(Z1) - negative values become 0:")
print(A1.round(4))
print()

# Layer 2
Z2 = np.dot(A1, params["W2"]) + params["b2"]
print(f"Z2 = A1 @ W2 + b2, shape {Z2.shape}:")
print(Z2.round(4))
print()

A2 = sigmoid(Z2)
print(f"Predictions = sigmoid(Z2) - squashed to (0, 1):")
print(A2.round(4))
print()

print(f"True labels: {np.array([0, 1, 1, 0])}")
print("Before training: predictions are basically random (~0.5)")

Step 4: The Inspection (Loss Function)

The loss function measures “how wrong” our predictions are. It gives us a single number to minimize.

Mean Squared Error (MSE)

L=12mi=1m(y^iyi)2L = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2 Why MSE?
  • Squared errors penalize large mistakes more than small ones
  • The 12\frac{1}{2} makes the derivative cleaner (the 2 cancels out)
  • Works well for regression; decent for binary classification
def compute_loss(Y_pred, Y_true):
    """
    Mean Squared Error loss.
    
    Args:
        Y_pred: Predictions, shape (m, 1)
        Y_true: True labels, shape (m, 1)
    
    Returns:
        loss: A single number (lower = better)
    """
    m = Y_true.shape[0]
    
    # Squared difference for each example
    squared_errors = (Y_pred - Y_true) ** 2
    
    # Average across all examples, divide by 2
    loss = (1 / (2 * m)) * np.sum(squared_errors)
    
    return loss

# Example: How wrong is our random network?
Y_true = np.array([[0], [1], [1], [0]])  # XOR labels
Y_pred, cache = forward(X, params)

loss = compute_loss(Y_pred, Y_true)
print(f"Initial loss: {loss:.4f}")
print(f"This is basically random guessing (predictions ~0.5)")
Alternative: Binary Cross-Entropy (BCE)For binary classification, BCE is often preferred: L=1mi=1m[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{m} \sum_{i=1}^{m} [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]
def cross_entropy_loss(Y_pred, Y_true):
    m = Y_true.shape[0]
    epsilon = 1e-8  # Prevent log(0)
    loss = -(1/m) * np.sum(
        Y_true * np.log(Y_pred + epsilon) + 
        (1 - Y_true) * np.log(1 - Y_pred + epsilon)
    )
    return loss
BCE has nice properties: it’s derived from maximum likelihood and penalizes confident wrong predictions heavily.

Step 5: The Renovation (Backward Pass) ⭐ THE HEART OF DEEP LEARNING

This is the most important part. Backpropagation computes how much each weight contributed to the error, so we know how to fix them.

The Big Picture: Blame Assignment

Imagine your network made a wrong prediction. Who’s to blame?
  • The output layer weights (W2W_2)?
  • The hidden layer weights (W1W_1)?
  • Both, but how much each?
Backpropagation answers this using the chain rule. We propagate the error signal backward through the network.

Deriving the Gradients (Step by Step)

Let’s derive every gradient from scratch. This is the math that powers all of deep learning.

Step 5.1: Output Layer Gradients

We want LW2\frac{\partial L}{\partial W_2} - how does changing W2W_2 affect the loss? Chain rule path: LY^Z2W2L \rightarrow \hat{Y} \rightarrow Z_2 \rightarrow W_2 LW2=LY^Y^Z2Z2W2\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{Y}} \cdot \frac{\partial \hat{Y}}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial W_2} Let’s compute each piece:
  1. LY^\frac{\partial L}{\partial \hat{Y}} - How loss changes with prediction: L=12m(Y^Y)2    LY^=1m(Y^Y)L = \frac{1}{2m}(\hat{Y} - Y)^2 \implies \frac{\partial L}{\partial \hat{Y}} = \frac{1}{m}(\hat{Y} - Y)
  2. Y^Z2\frac{\partial \hat{Y}}{\partial Z_2} - Sigmoid derivative: Y^=σ(Z2)    Y^Z2=Y^(1Y^)\hat{Y} = \sigma(Z_2) \implies \frac{\partial \hat{Y}}{\partial Z_2} = \hat{Y}(1 - \hat{Y})
  3. Z2W2\frac{\partial Z_2}{\partial W_2} - Linear layer: Z2=A1W2+b2    Z2W2=A1Z_2 = A_1 W_2 + b_2 \implies \frac{\partial Z_2}{\partial W_2} = A_1
Putting it together (with some simplification for MSE+Sigmoid): dZ2=Y^YdZ_2 = \hat{Y} - Y dW2=1mA1TdZ2dW_2 = \frac{1}{m} A_1^T \cdot dZ_2 db2=1mdZ2db_2 = \frac{1}{m} \sum dZ_2

Step 5.2: Hidden Layer Gradients

Now we backpropagate to the first layer. The error signal must flow through W2W_2! Chain rule path: LZ2A1Z1W1L \rightarrow Z_2 \rightarrow A_1 \rightarrow Z_1 \rightarrow W_1
  1. LA1\frac{\partial L}{\partial A_1} - Error flowing back through W2W_2: dA1=dZ2W2TdA_1 = dZ_2 \cdot W_2^T
  2. A1Z1\frac{\partial A_1}{\partial Z_1} - ReLU derivative: dZ1=dA11Z1>0dZ_1 = dA_1 \odot \mathbb{1}_{Z_1 > 0} (Element-wise multiply by 1 where Z1>0Z_1 > 0, else 0)
  3. Final gradients: dW1=1mXTdZ1dW_1 = \frac{1}{m} X^T \cdot dZ_1 db1=1mdZ1db_1 = \frac{1}{m} \sum dZ_1

The Code (With Detailed Commentary)

def backward(X, Y_true, params, cache):
    """
    Backward propagation: Compute all gradients.
    
    This is the chain rule in action!
    
    Args:
        X: Input data, shape (m, 2)
        Y_true: True labels, shape (m, 1)
        params: Dictionary with W1, b1, W2, b2
        cache: From forward pass - Z1, A1, Z2, A2
    
    Returns:
        grads: Dictionary with dW1, db1, dW2, db2
    """
    m = X.shape[0]  # Number of examples
    
    # Retrieve cached values
    W2 = params["W2"]
    A1, A2 = cache["A1"], cache["A2"]
    Z1 = cache["Z1"]
    
    # ========================================
    # OUTPUT LAYER GRADIENTS
    # ========================================
    
    # dZ2 = dL/dZ2 = (A2 - Y) for MSE loss with sigmoid
    # This is the "error signal" at the output
    dZ2 = A2 - Y_true  # Shape: (m, 1)
    
    # dW2 = dL/dW2 = A1.T @ dZ2 / m
    # Each column of A1 "votes" on how to change corresponding W2 weight
    dW2 = (1/m) * np.dot(A1.T, dZ2)  # Shape: (3, 1)
    
    # db2 = dL/db2 = mean of dZ2 across examples
    # Bias affects all examples equally
    db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)  # Shape: (1, 1)
    
    # ========================================
    # HIDDEN LAYER GRADIENTS
    # ========================================
    
    # First, propagate error back through W2
    # dA1 = dZ2 @ W2.T
    # "How much did each hidden neuron contribute to the error?"
    dA1 = np.dot(dZ2, W2.T)  # Shape: (m, 3)
    
    # Then, propagate through ReLU
    # dZ1 = dA1 * relu_derivative(Z1)
    # ReLU derivative: 1 where Z1 > 0, else 0
    # If a neuron was "off" (Z1 <= 0), its gradient is 0
    dZ1 = dA1 * relu_derivative(Z1)  # Shape: (m, 3)
    
    # dW1 = dL/dW1 = X.T @ dZ1 / m
    dW1 = (1/m) * np.dot(X.T, dZ1)  # Shape: (2, 3)
    
    # db1 = mean of dZ1
    db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)  # Shape: (1, 3)
    
    return {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}

# Let's verify with numerical gradients (gradient checking!)
def numerical_gradient(X, Y, params, param_name, epsilon=1e-7):
    """Compute gradient numerically for verification."""
    param = params[param_name]
    grad = np.zeros_like(param)
    
    for i in range(param.shape[0]):
        for j in range(param.shape[1]):
            # f(x + epsilon)
            param[i, j] += epsilon
            Y_pred, _ = forward(X, params)
            loss_plus = compute_loss(Y_pred, Y)
            
            # f(x - epsilon)
            param[i, j] -= 2 * epsilon
            Y_pred, _ = forward(X, params)
            loss_minus = compute_loss(Y_pred, Y)
            
            # Restore
            param[i, j] += epsilon
            
            # Numerical gradient
            grad[i, j] = (loss_plus - loss_minus) / (2 * epsilon)
    
    return grad

# Verify our backward pass is correct!
print("=== GRADIENT CHECK ===")
Y_pred, cache = forward(X, params)
grads = backward(X, Y_true, params, cache)
numerical_grads = numerical_gradient(X, Y_true, params, "W1")

print(f"Analytical dW1:\n{grads['dW1'].round(6)}")
print(f"\nNumerical dW1:\n{numerical_grads.round(6)}")
print(f"\nDifference: {np.max(np.abs(grads['dW1'] - numerical_grads)):.2e}")
print("(Should be < 1e-5 if correct!)")
Gradient Checking: Always verify your backprop with numerical gradients when implementing from scratch! This catches bugs that are otherwise invisible.

Step 6: The Training Loop

Now we put everything together into the full training algorithm.
def train(X, Y, epochs=1000, lr=0.1, verbose=True):
    """
    Train a neural network using gradient descent.
    
    Args:
        X: Training inputs, shape (m, 2)
        Y: Training labels, shape (m, 1)
        epochs: Number of training iterations
        lr: Learning rate (step size)
        verbose: Print progress
    
    Returns:
        params: Trained parameters
        history: Loss at each epoch (for plotting)
    """
    # 1. Initialize parameters
    np.random.seed(42)  # For reproducibility
    params = init_params(2, 3, 1)
    history = []
    
    for i in range(epochs):
        # 2. Forward pass: compute predictions
        Y_pred, cache = forward(X, params)
        
        # 3. Compute loss
        loss = compute_loss(Y_pred, Y)
        history.append(loss)
        
        if verbose and i % 500 == 0:
            accuracy = np.mean((Y_pred > 0.5) == Y) * 100
            print(f"Epoch {i:4d} | Loss: {loss:.6f} | Accuracy: {accuracy:.1f}%")
        
        # 4. Backward pass: compute gradients
        grads = backward(X, Y, params, cache)
        
        # 5. Update parameters (gradient descent step)
        # Move each weight in the direction that reduces loss
        params["W1"] -= lr * grads["dW1"]
        params["b1"] -= lr * grads["db1"]
        params["W2"] -= lr * grads["dW2"]
        params["b2"] -= lr * grads["db2"]
    
    return params, history

# Train on XOR!
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
Y = np.array([[0], [1], [1], [0]])

print("=" * 50)
print("TRAINING NEURAL NETWORK ON XOR")
print("=" * 50)
trained_params, history = train(X, Y, epochs=5000, lr=0.5)

# Plot the learning curve
plt.figure(figsize=(10, 4))
plt.plot(history, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training Progress: Loss Over Time', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

# Final evaluation
print("\n" + "=" * 50)
print("FINAL RESULTS")
print("=" * 50)
final_preds, _ = forward(X, trained_params)
print("\nInputs → Predictions → Actual")
for i in range(4):
    pred = final_preds[i, 0]
    actual = Y[i, 0]
    correct = "✓" if (pred > 0.5) == actual else "✗"
    print(f"({X[i, 0]}, {X[i, 1]}) → {pred:.4f}{actual} {correct}")

accuracy = np.mean((final_preds > 0.5) == Y) * 100
print(f"\nFinal Accuracy: {accuracy:.1f}%")
Expected Output:
==================================================
TRAINING NEURAL NETWORK ON XOR
==================================================
Epoch    0 | Loss: 0.249876 | Accuracy: 50.0%
Epoch  500 | Loss: 0.124234 | Accuracy: 75.0%
Epoch 1000 | Loss: 0.045678 | Accuracy: 100.0%
...

==================================================
FINAL RESULTS
==================================================

Inputs → Predictions → Actual
(0, 0) → 0.0234 → 0 ✓
(0, 1) → 0.9812 → 1 ✓
(1, 0) → 0.9756 → 1 ✓
(1, 1) → 0.0312 → 0 ✓

Final Accuracy: 100.0%

🎓 Understanding What You’ve Built

Let’s visualize what the network actually learned:
def visualize_decision_boundary(params):
    """Visualize how the network separates the input space."""
    # Create a grid of points
    xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200),
                          np.linspace(-0.5, 1.5, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    
    # Get predictions for each point
    preds, _ = forward(grid, params)
    preds = preds.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, preds, levels=50, cmap='RdYlBu', alpha=0.8)
    plt.colorbar(label='P(y=1)')
    
    # Plot training points
    colors = ['red' if y == 0 else 'blue' for y in Y.flatten()]
    plt.scatter(X[:, 0], X[:, 1], c=colors, s=300, edgecolors='black', linewidths=2)
    
    plt.xlabel('Input 1', fontsize=12)
    plt.ylabel('Input 2', fontsize=12)
    plt.title('Neural Network Decision Boundary for XOR', fontsize=14)
    plt.show()

visualize_decision_boundary(trained_params)
What you’ll see: The network has learned to create a non-linear boundary that correctly separates the XOR classes - something impossible with a single line!

⚠️ Common Training Problems

Common Training Problems
Understanding these issues is crucial for debugging real ML systems:

Problem 1: Vanishing Gradients

Symptoms: Loss decreases very slowly, early layers barely change Fix: Use ReLU instead of sigmoid for hidden layers, batch normalization

Problem 2: Exploding Gradients

Symptoms: Loss becomes NaN, weights explode to infinity Fix: Gradient clipping, proper initialization, lower learning rate

Problem 3: Dead ReLU

Symptoms: Some neurons output 0 for all inputs Fix: Use LeakyReLU, careful initialization, lower learning rate

Problem 4: Learning Rate Issues

Symptoms: Loss bounces around (too high) or barely moves (too low) Fix: Learning rate schedulers, adaptive optimizers (Adam)

Extension Challenges 🏆

Ready to push further? Try these advanced challenges:

Challenge 1: Momentum Optimizer

Implement momentum to accelerate training:
def train_with_momentum(X, Y, params, lr=0.5, epochs=5000, beta=0.9):
    velocity = {
        "dW1": np.zeros_like(params["W1"]),
        "db1": np.zeros_like(params["b1"]),
        "dW2": np.zeros_like(params["W2"]),
        "db2": np.zeros_like(params["b2"])
    }
    
    for i in range(epochs):
        Y_pred, cache = forward(X, params)
        grads = backward(X, Y, params, cache)
        
        # Update with momentum
        for key in ["W1", "b1", "W2", "b2"]:
            # v = β * v + (1-β) * gradient
            velocity["d" + key] = beta * velocity["d" + key] + (1 - beta) * grads["d" + key]
            # Update parameter using velocity
            params[key] -= lr * velocity["d" + key]
        
        if i % 500 == 0:
            loss = compute_loss(Y_pred, Y)
            print(f"Epoch {i}: Loss {loss:.6f}")
    
    return params
Goal: Compare training curves with and without momentum.

Challenge 2: Multi-Class Classification

Extend your network to classify more than 2 classes using softmax: softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
def softmax(z):
    """Numerically stable softmax"""
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

def cross_entropy_loss(y_pred, y_true):
    """Cross-entropy for multi-class"""
    m = y_true.shape[0]
    return -np.sum(y_true * np.log(y_pred + 1e-8)) / m
Goal: Classify handwritten digits 0-9 (use a small subset of MNIST).

Challenge 3: Regularization

Add L2 regularization to prevent overfitting: Lreg=L+λ2mlW[l]2L_{reg} = L + \frac{\lambda}{2m}\sum_{l}\|W^{[l]}\|^2
def compute_loss_with_reg(Y_pred, Y_true, params, lambd=0.01):
    m = Y_true.shape[0]
    
    # Original loss
    base_loss = (1 / (2*m)) * np.sum((Y_pred - Y_true)**2)
    
    # L2 regularization term
    reg_loss = (lambd / (2*m)) * (
        np.sum(np.square(params["W1"])) + 
        np.sum(np.square(params["W2"]))
    )
    
    return base_loss + reg_loss
Goal: Train on a larger dataset and observe how regularization affects the weights.

Mathematical Summary: Connecting All Concepts

Here’s how everything you learned fits together:

The Deep Learning Pipeline

StepMath ConceptWhat It Does
1. Data as VectorsLinear AlgebraImages → pixel vectors, text → embeddings
2. Linear TransformMatrix multiplicationz=Wx+bz = Wx + b
3. Non-linearityActivation functionsa=σ(z)a = \sigma(z)
4. Measure ErrorLoss functionL=(yy^)2L = (y - \hat{y})^2
5. Compute GradientsChain RuleL\nabla L = direction of steepest ascent
6. Update WeightsGradient DescentW:=WαLW := W - \alpha \nabla L
7. RepeatOptimizationConverge to minimum

Key Formulas Reference

Forward Pass: z[l]=W[l]a[l1]+b[l],a[l]=g(z[l])z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}, \quad a^{[l]} = g(z^{[l]}) Backpropagation: dz[l]=da[l]g(z[l])dz^{[l]} = da^{[l]} * g'(z^{[l]}) dW[l]=1mdz[l](a[l1])TdW^{[l]} = \frac{1}{m} dz^{[l]} (a^{[l-1]})^T db[l]=1mdz[l]db^{[l]} = \frac{1}{m} \sum dz^{[l]} da[l1]=(W[l])Tdz[l]da^{[l-1]} = (W^{[l]})^T dz^{[l]} Gradient Descent: W[l]:=W[l]αdW[l]W^{[l]} := W^{[l]} - \alpha \cdot dW^{[l]}

What You’ve Mastered

Derivatives: Rate of change, finding optima
Gradients: Multi-variable optimization
Chain Rule: Backpropagation through layers
Gradient Descent: Iterative optimization
Neural Networks: Putting it all together

Congratulations!

Course Complete!

You’ve built a neural network from scratch—not using a “black box,” but by building the box yourself. You understand every gear and lever inside.This is the power of Calculus. It turns “magic” into math.
Your Calculus for ML Toolkit:
  • Derivatives - How fast things change; foundation of learning
  • Gradients - Multi-dimensional derivatives; direction of steepest change
  • Chain Rule - Compositions of functions; enables backpropagation
  • Gradient Descent - Iterative optimization; how models learn
  • Loss Functions - What to optimize; MSE, cross-entropy, etc.
  • Neural Networks - Functions composed of differentiable layers
Career Impact: Understanding these foundations separates ML engineers who can debug and innovate from those who just call APIs. When training fails, when gradients explode, when models don’t converge—this knowledge is what helps you fix it.

What’s Next?