Learning From Mistakes
The Problem We Left Off With
In the last module, we had:- A model:
price = base + bed_weight * bedrooms + bath_weight * bathrooms + sqft_weight * sqft - Training data: Houses with known prices
- A loss function: Sum of squared errors
- The question: How do we find the best weights?
A Different Perspective: You’re Lost on a Hill
Imagine you’re blindfolded, dropped somewhere on a hilly landscape. Your goal: find the lowest point (the valley). You can’t see anything, but you can feel the slope under your feet. What do you do?1
Feel the ground
Which direction is “downhill”?
2
Take a step downhill
Move in that direction
3
Repeat
Keep stepping downhill until the ground is flat (you’ve reached the bottom)
Connecting to Our Problem
| Hill Analogy | Machine Learning |
|---|---|
| Your position on the hill | Current values of weights |
| Height at your position | The error (loss) |
| The valley (lowest point) | Minimum error = best weights |
| Slope of the ground | The gradient |
| Taking a step downhill | Updating weights to reduce error |
Let’s See It With One Weight
To make this concrete, let’s simplify. Imagine we only have one weight to figure out: the price per square foot.The Key Insight: Slope Tells Us Direction
At any point on the curve:- If slope is negative (going down to the right) -> increase weight
- If slope is positive (going up to the right) -> decrease weight
- If slope is zero -> we’re at the bottom!
Math Connection: The slope of a function is its derivative.If you want to understand this deeply, check out our Derivatives module.For now, just know: slope tells us direction.
Computing the Slope (Gradient)
For our squared error loss, the derivative with respect to the weight is:The Gradient Descent Algorithm
Now we can write the actual learning algorithm:The Learning Rate: Too Big vs Too Small
The learning rate controls step size:Scaling Up: Multiple Weights
Our house model has 4 weights: base, bedroom, bathroom, and sqft. The gradient is now a vector of slopes, one for each weight:The Math Behind It
What we just did has a formal name: Gradient Descent. The update rule is: Where:- = weights (parameters)
- = learning rate
- = gradient of the loss function
Deep Dive AvailableFor the full mathematical treatment, see our Gradient Descent module in the Calculus course.Key concepts covered there:
- Why the gradient points “uphill”
- The chain rule for computing gradients
- Advanced optimizers like Adam and SGD with momentum
Why This Matters
Gradient descent is the algorithm that powers:- Linear regression
- Logistic regression
- Neural networks
- Deep learning
- GPT, DALL-E, and all modern AI
Visualizing the Journey
Imagine the loss function as a bowl-shaped surface:Key Takeaways
Loss = Height on a Hill
We want to find the lowest point
Gradient = Slope
Tells us which direction is downhill
Learning Rate = Step Size
How far to move each iteration
Iterate Until Converged
Keep stepping until you reach the bottom
🚀 Mini Projects
Project 1
Implement gradient descent from scratch
Project 2
Learning rate explorer with visualization
Project 3
Predict student exam scores
What’s Next?
Now that you understand how learning works, we can explore:- Linear Regression - The complete algorithm for fitting lines to data
- Classification - What if we’re predicting categories, not numbers?
- Regularization - How to prevent overfitting
Continue to Module 3: Linear Regression
Build a complete, production-ready regression model
🔗 Math → ML Connection
What you learned in this module powers ALL of modern AI:
The Takeaway: You just learned the exact algorithm that trains:
| Concept Here | Formal Term | Where It’s Used |
|---|---|---|
| Slope of the loss | Gradient | Every neural network, every ML model |
| Step downhill | Gradient descent step | Training GPT-4, DALL-E, AlphaFold |
| Step size | Learning rate | Critical hyperparameter in all training |
| Reaching the bottom | Convergence | When training is “done” |
| Too-big steps (overshooting) | Divergence | Learning rate too high |
- ChatGPT (gradient descent on 175B parameters)
- Stable Diffusion (gradient descent on image-text pairs)
- AlphaFold (gradient descent on protein structures)
- Self-driving cars (gradient descent on sensor data)
🎮 Interactive Visualization
🚀 Going Deeper (Optional)
Advanced Optimization Concepts
Advanced Optimization Concepts
Why Not Just Use the Formula?
For linear regression, there’s a closed-form solution: So why use gradient descent?- Scalability: Matrix inversion is O(n³). Gradient descent is O(n).
- Generality: Works for ANY differentiable loss (not just squared error)
- Deep learning: Neural nets have no closed-form solution
Stochastic Gradient Descent (SGD)
Instead of computing gradient on ALL data:Modern Optimizers
Gradient descent has been improved:| Optimizer | Key Idea | When to Use |
|---|---|---|
| SGD + Momentum | Build up speed in consistent directions | Simple baseline |
| Adam | Adaptive learning rate per parameter | Default choice for most |
| AdamW | Adam with proper weight decay | Transformers, LLMs |
| LAMB | Adam scaled for huge batches | Very large models |
Convergence Theory
For convex functions (like linear regression’s MSE):- Gradient descent ALWAYS finds the global minimum
- Convergence rate depends on learning rate and function curvature
- Many local minima exist
- We find “good enough” solutions, not guaranteed global minimum
- This works surprisingly well in practice!
Practice Challenge
Implement gradient descent from scratch for this dataset:Solution
Solution