Learning From Mistakes
The Problem We Left Off With
In the last module, we had:- A model:
price = base + bed_weight * bedrooms + bath_weight * bathrooms + sqft_weight * sqft - Training data: Houses with known prices
- A loss function: Sum of squared errors
- The question: How do we find the best weights?
A Different Perspective: You’re Lost on a Hill
Imagine you’re blindfolded, dropped somewhere on a hilly landscape. Your goal: find the lowest point (the valley). You can’t see anything, but you can feel the slope under your feet. What do you do?
This is gradient descent. And it’s how almost all machine learning works.
Connecting to Our Problem
| Hill Analogy | Machine Learning |
|---|---|
| Your position on the hill | Current values of weights |
| Height at your position | The error (loss) |
| The valley (lowest point) | Minimum error = best weights |
| Slope of the ground | The gradient |
| Taking a step downhill | Updating weights to reduce error |
Let’s See It With One Weight
To make this concrete, let’s simplify. Imagine we only have one weight to figure out: the price per square foot.The Key Insight: Slope Tells Us Direction
At any point on the curve:- If slope is negative (going down to the right) -> increase weight
- If slope is positive (going up to the right) -> decrease weight
- If slope is zero -> we’re at the bottom!
Math Connection: The slope of a function is its derivative.If you want to understand this deeply, check out our Derivatives module.For now, just know: slope tells us direction.
Computing the Slope (Gradient)
For our squared error loss, the derivative with respect to the weight is:The Gradient Descent Algorithm
Now we can write the actual learning algorithm:The Learning Rate: Too Big vs Too Small
The learning rate controls step size:Scaling Up: Multiple Weights
Our house model has 4 weights: base, bedroom, bathroom, and sqft. The gradient is now a vector of slopes, one for each weight:The Math Behind It
What we just did has a formal name: Gradient Descent. The update rule is: Where:- = weights (parameters)
- = learning rate
- = gradient of the loss function
Deep Dive AvailableFor the full mathematical treatment, see our Gradient Descent module in the Calculus course.Key concepts covered there:
- Why the gradient points “uphill”
- The chain rule for computing gradients
- Advanced optimizers like Adam and SGD with momentum
Why This Matters
Gradient descent is the algorithm that powers:- Linear regression
- Logistic regression
- Neural networks
- Deep learning
- GPT, DALL-E, and all modern AI
Visualizing the Journey
Imagine the loss function as a bowl-shaped surface:Key Takeaways
Loss = Height on a Hill
We want to find the lowest point
Gradient = Slope
Tells us which direction is downhill
Learning Rate = Step Size
How far to move each iteration
Iterate Until Converged
Keep stepping until you reach the bottom
🚀 Mini Projects
Project 1
Implement gradient descent from scratch
Project 2
Learning rate explorer with visualization
Project 3
Predict student exam scores
What’s Next?
Now that you understand how learning works, we can explore:- Linear Regression - The complete algorithm for fitting lines to data
- Classification - What if we’re predicting categories, not numbers?
- Regularization - How to prevent overfitting
Continue to Module 3: Linear Regression
Build a complete, production-ready regression model
🔗 Math → ML Connection
What you learned in this module powers ALL of modern AI:
The Takeaway: You just learned the exact algorithm that trains:
| Concept Here | Formal Term | Where It’s Used |
|---|---|---|
| Slope of the loss | Gradient | Every neural network, every ML model |
| Step downhill | Gradient descent step | Training GPT-4, DALL-E, AlphaFold |
| Step size | Learning rate | Critical hyperparameter in all training |
| Reaching the bottom | Convergence | When training is “done” |
| Too-big steps (overshooting) | Divergence | Learning rate too high |
- ChatGPT (gradient descent on 175B parameters)
- Stable Diffusion (gradient descent on image-text pairs)
- AlphaFold (gradient descent on protein structures)
- Self-driving cars (gradient descent on sensor data)
🎮 Interactive Visualization
🚀 Going Deeper (Optional)
Advanced Optimization Concepts
Advanced Optimization Concepts
Why Not Just Use the Formula?
For linear regression, there’s a closed-form solution: So why use gradient descent?- Scalability: Matrix inversion is O(n³). Gradient descent is O(n).
- Generality: Works for ANY differentiable loss (not just squared error)
- Deep learning: Neural nets have no closed-form solution
Stochastic Gradient Descent (SGD)
Instead of computing gradient on ALL data:Modern Optimizers
Gradient descent has been improved:| Optimizer | Key Idea | When to Use |
|---|---|---|
| SGD + Momentum | Build up speed in consistent directions | Simple baseline |
| Adam | Adaptive learning rate per parameter | Default choice for most |
| AdamW | Adam with proper weight decay | Transformers, LLMs |
| LAMB | Adam scaled for huge batches | Very large models |
Convergence Theory
For convex functions (like linear regression’s MSE):- Gradient descent ALWAYS finds the global minimum
- Convergence rate depends on learning rate and function curvature
- Many local minima exist
- We find “good enough” solutions, not guaranteed global minimum
- This works surprisingly well in practice!
Practice Challenge
Implement gradient descent from scratch for this dataset:Solution
Solution