Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Build a Neural Network from Scratch

Final Project: The Architect

Your Graduation Exam

You have learned the theory.
  • Derivatives: The rate of change.
  • Gradients: The direction of steepest ascent.
  • Chain Rule: How to propagate blame.
  • Gradient Descent: How to learn.
Now, you must prove your mastery. You will not use PyTorch. You will not use TensorFlow. You will build a brain using nothing but raw math (NumPy).
Estimated Time: 4-6 hours (take your time!)
Difficulty: Intermediate
Prerequisites: Completed all previous modules
What You’ll Build: A fully functional 2-layer neural network that learns XOR
Don’t Rush! This project is where everything clicks together. If you find yourself copy-pasting code without understanding it, STOP. Go back to the relevant module and review. The goal isn’t to finish fast—it’s to deeply understand every line.

🎯 What You’re Actually Building (And Why It Matters)

Before we write code, let’s understand what we’re creating and why each piece exists.

The XOR Problem: Why Neural Networks?

We’re solving the XOR problem - a classification task that stumped AI researchers for decades:
Input 1Input 2Output
000
011
101
110
Why XOR is special: You cannot draw a single straight line to separate the 0s from the 1s. This is called a non-linearly separable problem.
import matplotlib.pyplot as plt
import numpy as np

# Visualize XOR - notice you can't draw ONE line to separate blue from red
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([0, 1, 1, 0])

plt.figure(figsize=(8, 6))
colors = ['red' if y == 0 else 'blue' for y in Y]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=300, edgecolors='black', linewidths=2)

for i, (x, y, label) in enumerate(zip(X[:, 0], X[:, 1], Y)):
    plt.annotate(f'XOR={label}', (x, y), textcoords="offset points", xytext=(10,10), fontsize=12)

plt.xlabel('Input 1', fontsize=14)
plt.ylabel('Input 2', fontsize=14)
plt.title('XOR Problem: No single line can separate red from blue!', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.show()
The breakthrough: A neural network with a hidden layer can learn to “bend” the space and separate these points. This is exactly what you’ll build!

The Complete Training Loop

Before diving into code, study this diagram. Every neural network ever built follows this exact pattern:
Neural Network Training Loop
The loop is:
  1. Forward Pass: Data flows through → we get a prediction
  2. Loss: How wrong are we?
  3. Backward Pass: Compute gradients (blame assignment)
  4. Update: Adjust weights to reduce loss
  5. Repeat thousands of times

The Blueprint

You are building a neural network to solve a classification problem. Architecture:
  • Input Layer: 2 neurons (x1,x2x_1, x_2) — the XOR inputs
  • Hidden Layer: 3 neurons (h1,h2,h3h_1, h_2, h_3) — the “secret sauce” that enables non-linear learning
  • Output Layer: 1 neuron (yy) — the predicted XOR output
Neural Network Blueprint

Understanding the Dimensions

This is critical for debugging. Let’s trace the shapes:
VariableShapeDescription
XX(m,2)(m, 2)mm examples, 2 input features
W1W_1(2,3)(2, 3)Connects 2 inputs → 3 hidden neurons
b1b_1(1,3)(1, 3)One bias per hidden neuron
Z1Z_1(m,3)(m, 3)Pre-activation for hidden layer
A1A_1(m,3)(m, 3)Post-activation (ReLU applied)
W2W_2(3,1)(3, 1)Connects 3 hidden → 1 output
b2b_2(1,1)(1, 1)One bias for output
Z2Z_2(m,1)(m, 1)Pre-activation for output
Y^\hat{Y}(m,1)(m, 1)Final predictions (sigmoid applied)
Pro Debugging Tip: When something breaks, print the shapes! 90% of neural network bugs are shape mismatches.
print(f"X shape: {X.shape}")  # Should be (4, 2) for XOR
print(f"W1 shape: {W1.shape}")  # Should be (2, 3)
# ... and so on

Step 1: The Bricks (Initialization)

A neural network is just a collection of matrices (weights) and vectors (biases). But how you initialize them matters enormously!

Why Small Random Numbers?

Initialization might seem like a minor detail, but it is one of the most common reasons neural networks fail to train. There are two mistakes to avoid, and they pull in opposite directions: Mistake 1: All zeros (or all the same value). If every weight starts at the same value, every neuron computes the same thing during the forward pass, receives the same gradient during the backward pass, and makes the same update. They remain identical forever — you effectively have a single neuron, no matter how wide your layer is. This is called the symmetry problem. Random initialization breaks this symmetry. Mistake 2: Too large. If weights are large, the inputs to sigmoid or tanh saturate at extreme values where the derivative is nearly zero. Gradients vanish and learning stalls. Imagine turning all the knobs on a mixing board to maximum — the sound is clipped and distorted, and small adjustments make no difference. The sweet spot is small random numbers. The scale matters and depends on the layer size.
import numpy as np

def init_params(input_size, hidden_size, output_size):
    """
    Initialize neural network parameters.
    
    Why random? If all weights are the same, all neurons learn the same thing!
    This is called the "symmetry problem."
    
    Why small (x0.01)? Large weights lead to large outputs, which saturate
    sigmoid, which produces tiny gradients, which prevents learning.
    """
    # Weights: Random small numbers (break symmetry)
    W1 = np.random.randn(input_size, hidden_size) * 0.01
    b1 = np.zeros((1, hidden_size))  # Biases can start at zero
    
    W2 = np.random.randn(hidden_size, output_size) * 0.01
    b2 = np.zeros((1, output_size))
    
    return {"W1": W1, "b1": b1, "W2": W2, "b2": b2}

# Let's see what this produces
np.random.seed(42)  # For reproducibility
params = init_params(2, 3, 1)
print("W1 (2 inputs → 3 hidden):")
print(params["W1"])
print(f"\nW1 shape: {params['W1'].shape}")
print(f"W2 shape: {params['W2'].shape}")
Output:
W1 (2 inputs → 3 hidden):
[[ 0.00496714 -0.00138264  0.00647689]
 [ 0.01523030 -0.00234153 -0.00234137]]

W1 shape: (2, 3)
W2 shape: (3, 1)
Advanced: Xavier/He InitializationFor deeper networks, * 0.01 isn’t optimal. Better initializations:
# Xavier initialization (for sigmoid/tanh)
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1 / input_size)

# He initialization (for ReLU) - recommended!
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
These keep the variance of activations stable across layers.

Step 2: The Mortar (Activation Functions)

Neurons need to be non-linear. Without non-linearity, stacking layers is useless—you’d just get another linear function!
Why We Need Activation Functions

ReLU: The Modern Workhorse

def relu(z):
    """
    ReLU: Rectified Linear Unit
    
    f(z) = max(0, z)
    
    Why ReLU is popular:
    1. Simple and fast to compute
    2. Doesn't saturate for positive values (no vanishing gradients)
    3. Creates sparse activations (many zeros = efficient)
    
    The "kink" at z=0 is what creates non-linearity!
    """
    return np.maximum(0, z)

def relu_derivative(z):
    """
    Derivative of ReLU:
    f'(z) = 1 if z > 0, else 0
    
    Note: Technically undefined at z=0, but we use 0.
    """
    return (z > 0).astype(float)

# Visualize ReLU
z = np.linspace(-3, 3, 100)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(z, relu(z), linewidth=2, color='#22c55e')
plt.title('ReLU: f(z) = max(0, z)')
plt.xlabel('z')
plt.ylabel('ReLU(z)')
plt.grid(True, alpha=0.3)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)

plt.subplot(1, 2, 2)
plt.plot(z, relu_derivative(z), linewidth=2, color='#3b82f6')
plt.title("ReLU Derivative: f'(z)")
plt.xlabel('z')
plt.ylabel("f'(z)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Sigmoid: For Probabilities

def sigmoid(z):
    """
    Sigmoid squashes any value to (0, 1) range.
    
    f(z) = 1 / (1 + e^(-z))
    
    Perfect for binary classification output!
    - Output near 0 = confident "class 0"
    - Output near 1 = confident "class 1"
    - Output ~0.5 = uncertain
    """
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    """
    Derivative of sigmoid (using output a, not input z):
    f'(z) = f(z) × (1 - f(z)) = a × (1 - a)
    
    This is why we pass 'a' (the sigmoid output), not 'z'.
    """
    return a * (1 - a)

# Visualize sigmoid
z = np.linspace(-6, 6, 100)
a = sigmoid(z)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(z, a, linewidth=2, color='#a855f7')
plt.title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(z, sigmoid_derivative(a), linewidth=2, color='#06b6d4')
plt.title("Sigmoid Derivative: σ'(z) = σ(z)(1-σ(z))")
plt.xlabel('z')
plt.ylabel("σ'(z)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The Vanishing Gradient ProblemNotice the sigmoid derivative peaks at 0.25 and approaches 0 for large |z|. This means:
  • For very confident predictions, gradients are tiny
  • In deep networks, these tiny gradients multiply through the chain rule, causing vanishing gradients
  • This is why we use ReLU for hidden layers (derivative is 1 for positive values)
Think of it like a game of telephone through 50 people, where each person whispers at 25% volume. By the time the message reaches the end, it is inaudible. ReLU fixes this by passing the message at full volume (derivative = 1) for positive inputs.

Step 3: The Construction (Forward Pass)

The forward pass is where data flows through the network to produce a prediction. Let’s trace through every computation step by step.

The Math

Z1=XW1+b1(Linear transformation)Z_1 = X \cdot W_1 + b_1 \quad \text{(Linear transformation)} A1=ReLU(Z1)(Non-linear activation)A_1 = \text{ReLU}(Z_1) \quad \text{(Non-linear activation)} Z2=A1W2+b2(Linear transformation)Z_2 = A_1 \cdot W_2 + b_2 \quad \text{(Linear transformation)} Y^=σ(Z2)(Sigmoid for probability)\hat{Y} = \sigma(Z_2) \quad \text{(Sigmoid for probability)}

The Code (With Detailed Commentary)

def forward(X, params):
    """
    Forward propagation: Input → Prediction
    
    Args:
        X: Input data, shape (m, 2) where m = number of examples
        params: Dictionary with W1, b1, W2, b2
    
    Returns:
        A2: Predictions, shape (m, 1)
        cache: All intermediate values (needed for backprop!)
    """
    # Unpack parameters
    W1, b1 = params["W1"], params["b1"]
    W2, b2 = params["W2"], params["b2"]
    
    # === LAYER 1 ===
    # Linear: Z1 = X @ W1 + b1
    # Shape: (m, 2) @ (2, 3) + (1, 3) = (m, 3)
    Z1 = np.dot(X, W1) + b1
    
    # Activation: A1 = ReLU(Z1)
    # Shape: (m, 3)
    A1 = relu(Z1)
    
    # === LAYER 2 ===
    # Linear: Z2 = A1 @ W2 + b2
    # Shape: (m, 3) @ (3, 1) + (1, 1) = (m, 1)
    Z2 = np.dot(A1, W2) + b2
    
    # Activation: A2 = sigmoid(Z2)
    # Shape: (m, 1)
    A2 = sigmoid(Z2)
    
    # IMPORTANT: Cache everything for backpropagation!
    # We need Z1 for ReLU derivative, A1 for dW2, etc.
    cache = {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2}
    
    return A2, cache

# Let's trace through with actual numbers
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # XOR inputs
np.random.seed(42)
params = init_params(2, 3, 1)

print("=== FORWARD PASS TRACE ===\n")
print(f"Input X (4 examples, 2 features):\n{X}\n")

# Manual trace through layer 1
Z1 = np.dot(X, params["W1"]) + params["b1"]
print(f"Z1 = X @ W1 + b1, shape {Z1.shape}:")
print(Z1.round(4))
print()

A1 = relu(Z1)
print(f"A1 = ReLU(Z1) - negative values become 0:")
print(A1.round(4))
print()

# Layer 2
Z2 = np.dot(A1, params["W2"]) + params["b2"]
print(f"Z2 = A1 @ W2 + b2, shape {Z2.shape}:")
print(Z2.round(4))
print()

A2 = sigmoid(Z2)
print(f"Predictions = sigmoid(Z2) - squashed to (0, 1):")
print(A2.round(4))
print()

print(f"True labels: {np.array([0, 1, 1, 0])}")
print("Before training: predictions are basically random (~0.5)")

Step 4: The Inspection (Loss Function)

The loss function measures “how wrong” our predictions are. It gives us a single number to minimize.

Mean Squared Error (MSE)

L=12mi=1m(y^iyi)2L = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2 Why MSE?
  • Squared errors penalize large mistakes more than small ones
  • The 12\frac{1}{2} makes the derivative cleaner (the 2 cancels out)
  • Works well for regression; decent for binary classification
def compute_loss(Y_pred, Y_true):
    """
    Mean Squared Error loss.
    
    Args:
        Y_pred: Predictions, shape (m, 1)
        Y_true: True labels, shape (m, 1)
    
    Returns:
        loss: A single number (lower = better)
    """
    m = Y_true.shape[0]
    
    # Squared difference for each example
    squared_errors = (Y_pred - Y_true) ** 2
    
    # Average across all examples, divide by 2
    loss = (1 / (2 * m)) * np.sum(squared_errors)
    
    return loss

# Example: How wrong is our random network?
Y_true = np.array([[0], [1], [1], [0]])  # XOR labels
Y_pred, cache = forward(X, params)

loss = compute_loss(Y_pred, Y_true)
print(f"Initial loss: {loss:.4f}")
print(f"This is basically random guessing (predictions ~0.5)")
Alternative: Binary Cross-Entropy (BCE)For binary classification, BCE is often preferred: L=1mi=1m[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{m} \sum_{i=1}^{m} [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]
def cross_entropy_loss(Y_pred, Y_true):
    m = Y_true.shape[0]
    epsilon = 1e-8  # Prevent log(0)
    loss = -(1/m) * np.sum(
        Y_true * np.log(Y_pred + epsilon) + 
        (1 - Y_true) * np.log(1 - Y_pred + epsilon)
    )
    return loss
BCE has nice properties: it is derived from maximum likelihood and penalizes confident wrong predictions heavily. Note the epsilon = 1e-8 — this prevents log(0) which would produce negative infinity. In production code, you will also see np.clip(Y_pred, epsilon, 1 - epsilon) used to guard both ends of the range.

Step 5: The Renovation (Backward Pass) ⭐ THE HEART OF DEEP LEARNING

This is the most important part. Backpropagation computes how much each weight contributed to the error, so we know how to fix them.

The Big Picture: Blame Assignment

Imagine your network made a wrong prediction. Who’s to blame?
  • The output layer weights (W2W_2)?
  • The hidden layer weights (W1W_1)?
  • Both, but how much each?
Backpropagation answers this using the chain rule. We propagate the error signal backward through the network.

Deriving the Gradients (Step by Step)

Let’s derive every gradient from scratch. This is the math that powers all of deep learning.

Step 5.1: Output Layer Gradients

We want LW2\frac{\partial L}{\partial W_2} - how does changing W2W_2 affect the loss? Chain rule path: LY^Z2W2L \rightarrow \hat{Y} \rightarrow Z_2 \rightarrow W_2 LW2=LY^Y^Z2Z2W2\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{Y}} \cdot \frac{\partial \hat{Y}}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial W_2} Let’s compute each piece:
  1. LY^\frac{\partial L}{\partial \hat{Y}} - How loss changes with prediction: L=12m(Y^Y)2    LY^=1m(Y^Y)L = \frac{1}{2m}(\hat{Y} - Y)^2 \implies \frac{\partial L}{\partial \hat{Y}} = \frac{1}{m}(\hat{Y} - Y)
  2. Y^Z2\frac{\partial \hat{Y}}{\partial Z_2} - Sigmoid derivative: Y^=σ(Z2)    Y^Z2=Y^(1Y^)\hat{Y} = \sigma(Z_2) \implies \frac{\partial \hat{Y}}{\partial Z_2} = \hat{Y}(1 - \hat{Y})
  3. Z2W2\frac{\partial Z_2}{\partial W_2} - Linear layer: Z2=A1W2+b2    Z2W2=A1Z_2 = A_1 W_2 + b_2 \implies \frac{\partial Z_2}{\partial W_2} = A_1
Putting it together (with some simplification for MSE+Sigmoid): dZ2=Y^YdZ_2 = \hat{Y} - Y dW2=1mA1TdZ2dW_2 = \frac{1}{m} A_1^T \cdot dZ_2 db2=1mdZ2db_2 = \frac{1}{m} \sum dZ_2

Step 5.2: Hidden Layer Gradients

Now we backpropagate to the first layer. The error signal must flow through W2W_2! Chain rule path: LZ2A1Z1W1L \rightarrow Z_2 \rightarrow A_1 \rightarrow Z_1 \rightarrow W_1
  1. LA1\frac{\partial L}{\partial A_1} - Error flowing back through W2W_2: dA1=dZ2W2TdA_1 = dZ_2 \cdot W_2^T
  2. A1Z1\frac{\partial A_1}{\partial Z_1} - ReLU derivative: dZ1=dA11Z1>0dZ_1 = dA_1 \odot \mathbb{1}_{Z_1 > 0} (Element-wise multiply by 1 where Z1>0Z_1 > 0, else 0)
  3. Final gradients: dW1=1mXTdZ1dW_1 = \frac{1}{m} X^T \cdot dZ_1 db1=1mdZ1db_1 = \frac{1}{m} \sum dZ_1

The Code (With Detailed Commentary)

def backward(X, Y_true, params, cache):
    """
    Backward propagation: Compute all gradients.
    
    This is the chain rule in action!
    
    Args:
        X: Input data, shape (m, 2)
        Y_true: True labels, shape (m, 1)
        params: Dictionary with W1, b1, W2, b2
        cache: From forward pass - Z1, A1, Z2, A2
    
    Returns:
        grads: Dictionary with dW1, db1, dW2, db2
    """
    m = X.shape[0]  # Number of examples
    
    # Retrieve cached values
    W2 = params["W2"]
    A1, A2 = cache["A1"], cache["A2"]
    Z1 = cache["Z1"]
    
    # ========================================
    # OUTPUT LAYER GRADIENTS
    # ========================================
    
    # dZ2 = dL/dZ2 = (A2 - Y) for MSE loss with sigmoid
    # This is the "error signal" at the output
    dZ2 = A2 - Y_true  # Shape: (m, 1)
    
    # dW2 = dL/dW2 = A1.T @ dZ2 / m
    # Each column of A1 "votes" on how to change corresponding W2 weight
    dW2 = (1/m) * np.dot(A1.T, dZ2)  # Shape: (3, 1)
    
    # db2 = dL/db2 = mean of dZ2 across examples
    # Bias affects all examples equally
    db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)  # Shape: (1, 1)
    
    # ========================================
    # HIDDEN LAYER GRADIENTS
    # ========================================
    
    # First, propagate error back through W2
    # dA1 = dZ2 @ W2.T
    # "How much did each hidden neuron contribute to the error?"
    dA1 = np.dot(dZ2, W2.T)  # Shape: (m, 3)
    
    # Then, propagate through ReLU
    # dZ1 = dA1 * relu_derivative(Z1)
    # ReLU derivative: 1 where Z1 > 0, else 0
    # If a neuron was "off" (Z1 <= 0), its gradient is 0
    dZ1 = dA1 * relu_derivative(Z1)  # Shape: (m, 3)
    
    # dW1 = dL/dW1 = X.T @ dZ1 / m
    dW1 = (1/m) * np.dot(X.T, dZ1)  # Shape: (2, 3)
    
    # db1 = mean of dZ1
    db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)  # Shape: (1, 3)
    
    return {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}

# Let's verify with numerical gradients (gradient checking!)
def numerical_gradient(X, Y, params, param_name, epsilon=1e-7):
    """Compute gradient numerically for verification."""
    param = params[param_name]
    grad = np.zeros_like(param)
    
    for i in range(param.shape[0]):
        for j in range(param.shape[1]):
            # f(x + epsilon)
            param[i, j] += epsilon
            Y_pred, _ = forward(X, params)
            loss_plus = compute_loss(Y_pred, Y)
            
            # f(x - epsilon)
            param[i, j] -= 2 * epsilon
            Y_pred, _ = forward(X, params)
            loss_minus = compute_loss(Y_pred, Y)
            
            # Restore
            param[i, j] += epsilon
            
            # Numerical gradient
            grad[i, j] = (loss_plus - loss_minus) / (2 * epsilon)
    
    return grad

# Verify our backward pass is correct!
print("=== GRADIENT CHECK ===")
Y_pred, cache = forward(X, params)
grads = backward(X, Y_true, params, cache)
numerical_grads = numerical_gradient(X, Y_true, params, "W1")

print(f"Analytical dW1:\n{grads['dW1'].round(6)}")
print(f"\nNumerical dW1:\n{numerical_grads.round(6)}")
print(f"\nDifference: {np.max(np.abs(grads['dW1'] - numerical_grads)):.2e}")
print("(Should be < 1e-5 if correct!)")
Gradient Checking: Always verify your backprop with numerical gradients when implementing from scratch! This catches bugs that are otherwise invisible.

Step 6: The Training Loop

Now we put everything together into the full training algorithm.
def train(X, Y, epochs=1000, lr=0.1, verbose=True):
    """
    Train a neural network using gradient descent.
    
    Args:
        X: Training inputs, shape (m, 2)
        Y: Training labels, shape (m, 1)
        epochs: Number of training iterations
        lr: Learning rate (step size)
        verbose: Print progress
    
    Returns:
        params: Trained parameters
        history: Loss at each epoch (for plotting)
    """
    # 1. Initialize parameters
    np.random.seed(42)  # For reproducibility
    params = init_params(2, 3, 1)
    history = []
    
    for i in range(epochs):
        # 2. Forward pass: compute predictions
        Y_pred, cache = forward(X, params)
        
        # 3. Compute loss
        loss = compute_loss(Y_pred, Y)
        history.append(loss)
        
        if verbose and i % 500 == 0:
            accuracy = np.mean((Y_pred > 0.5) == Y) * 100
            print(f"Epoch {i:4d} | Loss: {loss:.6f} | Accuracy: {accuracy:.1f}%")
        
        # 4. Backward pass: compute gradients
        grads = backward(X, Y, params, cache)
        
        # 5. Update parameters (gradient descent step)
        # Move each weight in the direction that reduces loss
        params["W1"] -= lr * grads["dW1"]
        params["b1"] -= lr * grads["db1"]
        params["W2"] -= lr * grads["dW2"]
        params["b2"] -= lr * grads["db2"]
    
    return params, history

# Train on XOR!
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
Y = np.array([[0], [1], [1], [0]])

print("=" * 50)
print("TRAINING NEURAL NETWORK ON XOR")
print("=" * 50)
trained_params, history = train(X, Y, epochs=5000, lr=0.5)

# Plot the learning curve
plt.figure(figsize=(10, 4))
plt.plot(history, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training Progress: Loss Over Time', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

# Final evaluation
print("\n" + "=" * 50)
print("FINAL RESULTS")
print("=" * 50)
final_preds, _ = forward(X, trained_params)
print("\nInputs → Predictions → Actual")
for i in range(4):
    pred = final_preds[i, 0]
    actual = Y[i, 0]
    correct = "✓" if (pred > 0.5) == actual else "✗"
    print(f"({X[i, 0]}, {X[i, 1]}) → {pred:.4f}{actual} {correct}")

accuracy = np.mean((final_preds > 0.5) == Y) * 100
print(f"\nFinal Accuracy: {accuracy:.1f}%")
Expected Output:
==================================================
TRAINING NEURAL NETWORK ON XOR
==================================================
Epoch    0 | Loss: 0.249876 | Accuracy: 50.0%
Epoch  500 | Loss: 0.124234 | Accuracy: 75.0%
Epoch 1000 | Loss: 0.045678 | Accuracy: 100.0%
...

==================================================
FINAL RESULTS
==================================================

Inputs → Predictions → Actual
(0, 0) → 0.0234 → 0 ✓
(0, 1) → 0.9812 → 1 ✓
(1, 0) → 0.9756 → 1 ✓
(1, 1) → 0.0312 → 0 ✓

Final Accuracy: 100.0%

🎓 Understanding What You’ve Built

Let’s visualize what the network actually learned:
def visualize_decision_boundary(params):
    """Visualize how the network separates the input space."""
    # Create a grid of points
    xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200),
                          np.linspace(-0.5, 1.5, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    
    # Get predictions for each point
    preds, _ = forward(grid, params)
    preds = preds.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, preds, levels=50, cmap='RdYlBu', alpha=0.8)
    plt.colorbar(label='P(y=1)')
    
    # Plot training points
    colors = ['red' if y == 0 else 'blue' for y in Y.flatten()]
    plt.scatter(X[:, 0], X[:, 1], c=colors, s=300, edgecolors='black', linewidths=2)
    
    plt.xlabel('Input 1', fontsize=12)
    plt.ylabel('Input 2', fontsize=12)
    plt.title('Neural Network Decision Boundary for XOR', fontsize=14)
    plt.show()

visualize_decision_boundary(trained_params)
What you’ll see: The network has learned to create a non-linear boundary that correctly separates the XOR classes - something impossible with a single line!

⚠️ Common Training Problems

Common Training Problems
Understanding these issues is crucial for debugging real ML systems:

Problem 1: Vanishing Gradients

Symptoms: Loss decreases very slowly, early layers barely change Fix: Use ReLU instead of sigmoid for hidden layers, batch normalization

Problem 2: Exploding Gradients

Symptoms: Loss becomes NaN, weights explode to infinity Fix: Gradient clipping, proper initialization, lower learning rate

Problem 3: Dead ReLU

Symptoms: Some neurons output 0 for all inputs, and their gradients are also 0, so they never recover
Why it happens: If a large gradient update pushes a neuron’s bias so negative that its input is always negative, ReLU outputs 0 forever. It is “dead” because gradient of ReLU at 0 is 0, so no update can revive it.
Fix: Use LeakyReLU (max(0.01*z, z) — allows a small gradient even for negative inputs), careful He initialization, or a more conservative learning rate

Problem 4: Learning Rate Issues

Symptoms: Loss bounces around (too high) or barely moves (too low)
Fix: Learning rate schedulers, adaptive optimizers (Adam), or use the learning rate finder technique
Debugging Checklist for Your Neural NetworkWhen your network is not learning, run through this checklist in order:
  1. Check shapes first. Print every matrix shape in the forward and backward pass. Most bugs are shape mismatches that silently broadcast.
  2. Run gradient checking. Compare your analytical gradients to numerical gradients (centered difference). The relative error should be below 10510^{-5}. If not, you have a bug in backprop.
  3. Overfit on a tiny dataset. Before training on all your data, try to get 100% accuracy on just 4-8 examples. If the network cannot memorize a tiny dataset, something is fundamentally broken.
  4. Check gradient magnitudes per layer. Print np.linalg.norm(grads["dW1"]) and np.linalg.norm(grads["dW2"]). If one is orders of magnitude different from the other, you likely have a vanishing or exploding gradient issue.
  5. Visualize the loss curve. A healthy loss curve decreases smoothly (possibly with noise). A flat line means gradients are zero. Spikes mean the learning rate is too high. NaN means overflow.
  6. Try a known-working configuration first. For XOR: hidden_size=4, lr=1.0, epochs=5000 should work. If it does not, the bug is in your code, not your hyperparameters.

Extension Challenges 🏆

Ready to push further? Try these advanced challenges:

Challenge 1: Momentum Optimizer

Implement momentum to accelerate training:
def train_with_momentum(X, Y, params, lr=0.5, epochs=5000, beta=0.9):
    velocity = {
        "dW1": np.zeros_like(params["W1"]),
        "db1": np.zeros_like(params["b1"]),
        "dW2": np.zeros_like(params["W2"]),
        "db2": np.zeros_like(params["b2"])
    }
    
    for i in range(epochs):
        Y_pred, cache = forward(X, params)
        grads = backward(X, Y, params, cache)
        
        # Update with momentum
        for key in ["W1", "b1", "W2", "b2"]:
            # v = β * v + (1-β) * gradient
            velocity["d" + key] = beta * velocity["d" + key] + (1 - beta) * grads["d" + key]
            # Update parameter using velocity
            params[key] -= lr * velocity["d" + key]
        
        if i % 500 == 0:
            loss = compute_loss(Y_pred, Y)
            print(f"Epoch {i}: Loss {loss:.6f}")
    
    return params
Goal: Compare training curves with and without momentum.

Challenge 2: Multi-Class Classification

Extend your network to classify more than 2 classes using softmax: softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
def softmax(z):
    """
    Numerically stable softmax.
    
    The subtraction of np.max(z) prevents overflow in np.exp().
    Without it, exp(1000) = inf, and inf/inf = NaN.
    With it, the largest exponent is exp(0) = 1, which is safe.
    Mathematically, this produces identical results because the constant
    cancels in the numerator/denominator ratio.
    """
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

def cross_entropy_loss(y_pred, y_true):
    """Cross-entropy for multi-class"""
    m = y_true.shape[0]
    return -np.sum(y_true * np.log(y_pred + 1e-8)) / m
Goal: Classify handwritten digits 0-9 (use a small subset of MNIST).

Challenge 3: Regularization

Add L2 regularization to prevent overfitting: Lreg=L+λ2mlW[l]2L_{reg} = L + \frac{\lambda}{2m}\sum_{l}\|W^{[l]}\|^2
def compute_loss_with_reg(Y_pred, Y_true, params, lambd=0.01):
    m = Y_true.shape[0]
    
    # Original loss
    base_loss = (1 / (2*m)) * np.sum((Y_pred - Y_true)**2)
    
    # L2 regularization term
    reg_loss = (lambd / (2*m)) * (
        np.sum(np.square(params["W1"])) + 
        np.sum(np.square(params["W2"]))
    )
    
    return base_loss + reg_loss
Goal: Train on a larger dataset and observe how regularization affects the weights.

Mathematical Summary: Connecting All Concepts

Here’s how everything you learned fits together:

The Deep Learning Pipeline

StepMath ConceptWhat It Does
1. Data as VectorsLinear AlgebraImages → pixel vectors, text → embeddings
2. Linear TransformMatrix multiplicationz=Wx+bz = Wx + b
3. Non-linearityActivation functionsa=σ(z)a = \sigma(z)
4. Measure ErrorLoss functionL=(yy^)2L = (y - \hat{y})^2
5. Compute GradientsChain RuleL\nabla L = direction of steepest ascent
6. Update WeightsGradient DescentW:=WαLW := W - \alpha \nabla L
7. RepeatOptimizationConverge to minimum

Key Formulas Reference

Forward Pass: z[l]=W[l]a[l1]+b[l],a[l]=g(z[l])z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}, \quad a^{[l]} = g(z^{[l]}) Backpropagation: dz[l]=da[l]g(z[l])dz^{[l]} = da^{[l]} * g'(z^{[l]}) dW[l]=1mdz[l](a[l1])TdW^{[l]} = \frac{1}{m} dz^{[l]} (a^{[l-1]})^T db[l]=1mdz[l]db^{[l]} = \frac{1}{m} \sum dz^{[l]} da[l1]=(W[l])Tdz[l]da^{[l-1]} = (W^{[l]})^T dz^{[l]} Gradient Descent: W[l]:=W[l]αdW[l]W^{[l]} := W^{[l]} - \alpha \cdot dW^{[l]}

What You’ve Mastered

Derivatives: Rate of change, finding optima
Gradients: Multi-variable optimization
Chain Rule: Backpropagation through layers
Gradient Descent: Iterative optimization
Neural Networks: Putting it all together

Congratulations!

Course Complete!

You’ve built a neural network from scratch—not using a “black box,” but by building the box yourself. You understand every gear and lever inside.This is the power of Calculus. It turns “magic” into math.
Your Calculus for ML Toolkit:
  • Derivatives - How fast things change; foundation of learning
  • Gradients - Multi-dimensional derivatives; direction of steepest change
  • Chain Rule - Compositions of functions; enables backpropagation
  • Gradient Descent - Iterative optimization; how models learn
  • Loss Functions - What to optimize; MSE, cross-entropy, etc.
  • Neural Networks - Functions composed of differentiable layers
Career Impact: Understanding these foundations separates ML engineers who can debug and innovate from those who just call APIs. When training fails, when gradients explode, when models don’t converge—this knowledge is what helps you fix it.

What’s Next?

Linear Algebra for ML

Master vectors, matrices, eigenvalues, and SVD - the language of data.

Statistics for ML

Learn probability, distributions, and statistical inference for ML.

Interview Deep-Dive

Strong Answer:
  • The biggest thing is understanding the backward pass at the level of individual matrix operations. In PyTorch, you call loss.backward() and gradients appear. When you build from scratch, you have to manually derive and implement dL/dW2 = hidden_activations^T @ dL/dz2 and dL/dW1 = input^T @ dL/dz1. This forces you to understand exactly what information flows where and why the shapes of matrices in the backward pass are the transposes of the forward pass.
  • Second, numerical stability becomes visceral. When I implemented sigmoid, I immediately hit overflow for large inputs. I had to learn the hard way that you need to clip inputs or use the numerically stable version: sigmoid(x) = 1/(1+exp(-abs(x))) with special handling for the sign. PyTorch hides all of this behind a single call to torch.sigmoid().
  • Third, weight initialization matters in ways you only appreciate when your from-scratch network fails to learn. I initialized weights to random values in [-1, 1] and training diverged. I tried all zeros and the network learned nothing because all neurons compute identical gradients (symmetry breaking problem). I eventually learned why Xavier and He initialization formulas exist — they keep the variance of activations and gradients constant across layers, and the specific scaling factors (1/sqrt(n_in) for Xavier, sqrt(2/n_in) for He with ReLU) are derived from propagating variance through the network equations.
  • Fourth, the training loop structure. The order of operations matters: forward, loss, backward, update, zero gradients. Getting the order wrong (like updating weights before computing all gradients, or forgetting to zero accumulated gradients) causes subtle bugs that are hard to diagnose.
  • In interviews, this kind of from-scratch experience is gold because it shows you are not just an API user. You understand why the abstractions exist and what they are doing for you.
Follow-up: You mentioned the symmetry breaking problem with all-zero initialization. Why does this not happen with bias terms — can biases be initialized to zero?Yes, biases can and typically are initialized to zero, and it does not cause the symmetry problem. The reason: the symmetry issue with weights is that if all weights in a layer are identical, every neuron in that layer computes the exact same output for any input, receives the exact same gradient, and updates identically — they are permanently “locked” together. Biases do not break or preserve this symmetry because biases shift the activation but do not create inter-neuron coupling. The weight matrix is what connects neurons to their inputs, and the symmetry there means neurons cannot specialize. Random weight initialization gives each neuron a slightly different “perspective” on the input, so they produce different outputs, receive different gradients, and learn different features. Once symmetry is broken by even tiny random differences, the diversity grows during training. This is also why techniques like dropout work: they randomly disable neurons, forcing the remaining ones to develop independent, useful representations rather than co-adapting.
Strong Answer:
  • A single-layer network computes y = sigma(w1x1 + w2x2 + b), which is a sigmoid applied to a linear combination of inputs. The decision boundary (where the output is 0.5) is the line w1x1 + w2x2 + b = 0. This is a single straight line in 2D space.
  • For XOR, the four points are: (0,0)=0, (0,1)=1, (1,0)=1, (1,1)=0. You need to separate from . These two classes are interleaved — no single straight line can separate them. Try any line orientation: it will always misclassify at least one point. This is the formal meaning of “not linearly separable.”
  • The hidden layer solves this by warping the input space. Each hidden neuron defines its own linear decision boundary. With two hidden neurons, you get two lines that carve the input space into regions. The first hidden neuron might activate for “x1 + x2 > 0.5” (separating (0,0) from the rest) and the second for “x1 + x2 < 1.5” (separating (1,1) from the rest). In the hidden representation space (h1, h2), the four XOR points are remapped to positions that ARE linearly separable.
  • Geometrically, the hidden layer performs a nonlinear coordinate transformation. It “folds” or “bends” the 2D input space so that the formerly interleaved classes end up on the same side. The output layer then draws a single linear boundary in this transformed space to make the final classification. This is the fundamental power of depth in neural networks: each layer adds another nonlinear transformation, allowing increasingly complex decision boundaries. A network with one hidden layer and enough neurons can approximate any continuous function (universal approximation theorem), but in practice, deeper networks with fewer neurons per layer often learn better representations than wide shallow networks.
Follow-up: The universal approximation theorem says a single hidden layer with enough neurons can approximate any function. So why do we use deep networks instead of wide shallow ones?The universal approximation theorem is an existence result — it says a wide enough single-layer network CAN represent any function, but it says nothing about whether gradient descent can FIND that representation, or how many neurons you would need. In practice, depth provides exponential efficiency gains. Certain functions that require exponentially many neurons in a shallow network can be represented with polynomially many neurons in a deep network. For example, representing a nested composition like f(g(h(x))) directly matches a 3-layer architecture but would need a huge single layer to approximate. Deep networks also learn hierarchical representations: early layers learn simple features (edges), middle layers compose them (textures, parts), and late layers combine those into high-level concepts (objects, scenes). This hierarchical structure mirrors the structure of real-world data. Finally, the optimization landscape of deep networks, while non-convex, has been empirically shown to have favorable properties: many local minima with similar loss values, connected by flat valleys. Shallow wide networks can have more pathological loss landscapes despite their theoretical expressiveness.
Strong Answer:
  • The single most common bug is shape mismatches in matrix operations during the backward pass. The forward pass computes z = W @ x + b (shapes: W is [out, in], x is [in, batch], z is [out, batch]). The backward pass needs dL/dW which has the same shape as W: [out, in]. The formula is dL/dW = dL/dz @ x^T. If you accidentally compute x^T @ dL/dz instead of dL/dz @ x^T, you get either a shape error or, worse, a matrix of the wrong shape that silently produces garbage gradients.
  • The second most common bug is forgetting to apply the activation derivative. People write dL/dh_prev = W^T @ dL/dz instead of first computing dL/dz = dL/da * activation_derivative(z), then dL/dh_prev = W^T @ dL/dz. The missing activation derivative means gradients propagate as if the network were linear, which completely changes the optimization dynamics.
  • Systematic verification uses gradient checking. For each parameter tensor W, perturb one element W[i,j] by +epsilon and -epsilon, compute the loss for each, and approximate the gradient as (L_plus - L_minus) / (2 * epsilon). Compare this to your analytical gradient. The relative error should be below 1e-5 for float64. Do this for EVERY parameter in the network, not just one layer.
  • Additional verification strategies: (1) Train on a single data point and verify the loss goes to near-zero. If it does not, there is a bug in either forward or backward. (2) Check that gradient norms are in a reasonable range — not zero, not millions. (3) Verify that gradients are zero for parameters that should not be affected by a particular input (this catches accidental gradient flow through wrong paths). (4) Compare your implementation’s output against PyTorch on the same inputs and weights — the forward pass values, the loss, and every gradient should match to floating-point precision.
Follow-up: Gradient checking uses epsilon=1e-5 typically. Why not use a smaller epsilon for better accuracy?This is a common trap. The numerical gradient approximation has two competing error sources. Truncation error (from the Taylor series approximation) scales as O(epsilon^2) for central differences — smaller epsilon means less truncation error. But floating-point cancellation error scales as O(machine_epsilon / epsilon) — when epsilon is very small, f(x+epsilon) and f(x-epsilon) are nearly identical floating-point numbers, and subtracting them loses significant digits. For float64 with machine epsilon around 1e-16, the optimal epsilon balances these at roughly epsilon = (machine_epsilon)^(1/3) which is about 1e-5 to 1e-6. For float32 with machine epsilon around 1e-7, the optimal epsilon is around 1e-3 to 1e-4. Using epsilon=1e-7 with float32 will give unreliable gradient checks because cancellation error dominates. I always run gradient checks in float64 for this reason, even if the actual training uses float32 or mixed precision.