Final Project: The Architect
Your Graduation Exam
You have learned the theory.- Derivatives: The rate of change.
- Gradients: The direction of steepest ascent.
- Chain Rule: How to propagate blame.
- Gradient Descent: How to learn.
Estimated Time: 4-6 hours (take your time!)
Difficulty: Intermediate
Prerequisites: Completed all previous modules
What You’ll Build: A fully functional 2-layer neural network that learns XOR
Difficulty: Intermediate
Prerequisites: Completed all previous modules
What You’ll Build: A fully functional 2-layer neural network that learns XOR
🎯 What You’re Actually Building (And Why It Matters)
Before we write code, let’s understand what we’re creating and why each piece exists.The XOR Problem: Why Neural Networks?
We’re solving the XOR problem - a classification task that stumped AI researchers for decades:| Input 1 | Input 2 | Output |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
The Complete Training Loop
Before diving into code, study this diagram. Every neural network ever built follows this exact pattern:- Forward Pass: Data flows through → we get a prediction
- Loss: How wrong are we?
- Backward Pass: Compute gradients (blame assignment)
- Update: Adjust weights to reduce loss
- Repeat thousands of times
The Blueprint
You are building a neural network to solve a classification problem. Architecture:- Input Layer: 2 neurons () — the XOR inputs
- Hidden Layer: 3 neurons () — the “secret sauce” that enables non-linear learning
- Output Layer: 1 neuron () — the predicted XOR output
Understanding the Dimensions
This is critical for debugging. Let’s trace the shapes:| Variable | Shape | Description |
|---|---|---|
| examples, 2 input features | ||
| Connects 2 inputs → 3 hidden neurons | ||
| One bias per hidden neuron | ||
| Pre-activation for hidden layer | ||
| Post-activation (ReLU applied) | ||
| Connects 3 hidden → 1 output | ||
| One bias for output | ||
| Pre-activation for output | ||
| Final predictions (sigmoid applied) |
Step 1: The Bricks (Initialization)
A neural network is just a collection of matrices (weights) and vectors (biases). But how you initialize them matters enormously!Why Small Random Numbers?
Advanced: Xavier/He InitializationFor deeper networks, These keep the variance of activations stable across layers.
* 0.01 isn’t optimal. Better initializations:Step 2: The Mortar (Activation Functions)
Neurons need to be non-linear. Without non-linearity, stacking layers is useless—you’d just get another linear function!ReLU: The Modern Workhorse
Sigmoid: For Probabilities
Step 3: The Construction (Forward Pass)
The forward pass is where data flows through the network to produce a prediction. Let’s trace through every computation step by step.The Math
The Code (With Detailed Commentary)
Step 4: The Inspection (Loss Function)
The loss function measures “how wrong” our predictions are. It gives us a single number to minimize.Mean Squared Error (MSE)
Why MSE?- Squared errors penalize large mistakes more than small ones
- The makes the derivative cleaner (the 2 cancels out)
- Works well for regression; decent for binary classification
Alternative: Binary Cross-Entropy (BCE)For binary classification, BCE is often preferred:
BCE has nice properties: it’s derived from maximum likelihood and penalizes confident wrong predictions heavily.
Step 5: The Renovation (Backward Pass) ⭐ THE HEART OF DEEP LEARNING
This is the most important part. Backpropagation computes how much each weight contributed to the error, so we know how to fix them.The Big Picture: Blame Assignment
Imagine your network made a wrong prediction. Who’s to blame?- The output layer weights ()?
- The hidden layer weights ()?
- Both, but how much each?
Deriving the Gradients (Step by Step)
Let’s derive every gradient from scratch. This is the math that powers all of deep learning.Step 5.1: Output Layer Gradients
We want - how does changing affect the loss? Chain rule path: Let’s compute each piece:- - How loss changes with prediction:
- - Sigmoid derivative:
- - Linear layer:
Step 5.2: Hidden Layer Gradients
Now we backpropagate to the first layer. The error signal must flow through ! Chain rule path:- - Error flowing back through :
- - ReLU derivative: (Element-wise multiply by 1 where , else 0)
- Final gradients:
The Code (With Detailed Commentary)
Step 6: The Training Loop
Now we put everything together into the full training algorithm.🎓 Understanding What You’ve Built
Let’s visualize what the network actually learned:⚠️ Common Training Problems
Problem 1: Vanishing Gradients
Symptoms: Loss decreases very slowly, early layers barely change Fix: Use ReLU instead of sigmoid for hidden layers, batch normalizationProblem 2: Exploding Gradients
Symptoms: Loss becomes NaN, weights explode to infinity Fix: Gradient clipping, proper initialization, lower learning rateProblem 3: Dead ReLU
Symptoms: Some neurons output 0 for all inputs Fix: Use LeakyReLU, careful initialization, lower learning rateProblem 4: Learning Rate Issues
Symptoms: Loss bounces around (too high) or barely moves (too low) Fix: Learning rate schedulers, adaptive optimizers (Adam)Extension Challenges 🏆
Ready to push further? Try these advanced challenges:Challenge 1: Momentum Optimizer
Implement momentum to accelerate training:Challenge 2: Multi-Class Classification
Extend your network to classify more than 2 classes using softmax:Challenge 3: Regularization
Add L2 regularization to prevent overfitting:Mathematical Summary: Connecting All Concepts
Here’s how everything you learned fits together:The Deep Learning Pipeline
| Step | Math Concept | What It Does |
|---|---|---|
| 1. Data as Vectors | Linear Algebra | Images → pixel vectors, text → embeddings |
| 2. Linear Transform | Matrix multiplication | |
| 3. Non-linearity | Activation functions | |
| 4. Measure Error | Loss function | |
| 5. Compute Gradients | Chain Rule | = direction of steepest ascent |
| 6. Update Weights | Gradient Descent | |
| 7. Repeat | Optimization | Converge to minimum |
Key Formulas Reference
Forward Pass: Backpropagation: Gradient Descent:What You’ve Mastered
✅ Derivatives: Rate of change, finding optima✅ Gradients: Multi-variable optimization
✅ Chain Rule: Backpropagation through layers
✅ Gradient Descent: Iterative optimization
✅ Neural Networks: Putting it all together
Congratulations!
Course Complete!
You’ve built a neural network from scratch—not using a “black box,” but by building the box yourself. You understand every gear and lever inside.This is the power of Calculus. It turns “magic” into math.
Your Calculus for ML Toolkit:
- ✅ Derivatives - How fast things change; foundation of learning
- ✅ Gradients - Multi-dimensional derivatives; direction of steepest change
- ✅ Chain Rule - Compositions of functions; enables backpropagation
- ✅ Gradient Descent - Iterative optimization; how models learn
- ✅ Loss Functions - What to optimize; MSE, cross-entropy, etc.
- ✅ Neural Networks - Functions composed of differentiable layers