Skip to main content

Learning From Mistakes

Gradient Descent - Finding Minimum

The Problem We Left Off With

In the last module, we had:
  • A model: price = base + bed_weight * bedrooms + bath_weight * bathrooms + sqft_weight * sqft
  • Training data: Houses with known prices
  • A loss function: Sum of squared errors
  • The question: How do we find the best weights?
We tried random guessing, but that’s slow and unreliable. There’s a smarter way.

A Different Perspective: You’re Lost on a Hill

Imagine you’re blindfolded, dropped somewhere on a hilly landscape. Your goal: find the lowest point (the valley). You can’t see anything, but you can feel the slope under your feet. What do you do?
1

Feel the ground

Which direction is “downhill”?
2

Take a step downhill

Move in that direction
3

Repeat

Keep stepping downhill until the ground is flat (you’ve reached the bottom)
This is gradient descent. And it’s how almost all machine learning works.
GPS Navigation - Real World Gradient Descent

Connecting to Our Problem

Hill AnalogyMachine Learning
Your position on the hillCurrent values of weights
Height at your positionThe error (loss)
The valley (lowest point)Minimum error = best weights
Slope of the groundThe gradient
Taking a step downhillUpdating weights to reduce error

Let’s See It With One Weight

To make this concrete, let’s simplify. Imagine we only have one weight to figure out: the price per square foot.
# Simplified: price = base + sqft_weight * sqft
# Let's assume base = 50000 (fixed)

houses = [
    {"sqft": 1000, "price": 250000},
    {"sqft": 1500, "price": 350000},
    {"sqft": 2000, "price": 450000},
    {"sqft": 2500, "price": 550000},
]

def calculate_loss(sqft_weight):
    """Calculate total squared error for a given weight."""
    total_error = 0
    base = 50000
    
    for house in houses:
        predicted = base + sqft_weight * house["sqft"]
        actual = house["price"]
        error = (predicted - actual) ** 2
        total_error += error
    
    return total_error

# Try different weights and see the error
for weight in [50, 100, 150, 200, 250, 300]:
    loss = calculate_loss(weight)
    print(f"Weight: ${weight}/sqft -> Loss: {loss:,.0f}")
Output:
Weight: $50/sqft -> Loss: 130,000,000,000
Weight: $100/sqft -> Loss: 30,000,000,000
Weight: $150/sqft -> Loss: 5,000,000,000
Weight: $200/sqft -> Loss: 0                 <- Perfect!
Weight: $250/sqft -> Loss: 5,000,000,000
Weight: $300/sqft -> Loss: 30,000,000,000
If we plot this, we get a parabola (U-shape) - the bottom is where error is lowest!

The Key Insight: Slope Tells Us Direction

At any point on the curve:
  • If slope is negative (going down to the right) -> increase weight
  • If slope is positive (going up to the right) -> decrease weight
  • If slope is zero -> we’re at the bottom!
The slope of the loss function tells us which way to go.
Math Connection: The slope of a function is its derivative.If you want to understand this deeply, check out our Derivatives module.For now, just know: slope tells us direction.

Computing the Slope (Gradient)

For our squared error loss, the derivative with respect to the weight is:
def calculate_gradient(sqft_weight):
    """Calculate the slope of the loss function."""
    gradient = 0
    base = 50000
    
    for house in houses:
        predicted = base + sqft_weight * house["sqft"]
        actual = house["price"]
        error = predicted - actual
        
        # Derivative of (predicted - actual)^2 with respect to weight
        # = 2 * (predicted - actual) * sqft
        gradient += 2 * error * house["sqft"]
    
    return gradient

# Check the gradient at different weights
for weight in [100, 150, 200, 250]:
    grad = calculate_gradient(weight)
    print(f"Weight: ${weight}, Gradient: {grad:,.0f}")
Output:
Weight: $100, Gradient: -1,400,000  (negative -> go UP/increase weight)
Weight: $150, Gradient: -400,000   (negative -> go UP/increase weight)
Weight: $200, Gradient: 0           (zero -> we're at the minimum!)
Weight: $250, Gradient: 400,000    (positive -> go DOWN/decrease weight)

The Gradient Descent Algorithm

Now we can write the actual learning algorithm:
def gradient_descent(initial_weight, learning_rate, num_steps):
    """
    Find the best weight by following the slope downhill.
    
    Args:
        initial_weight: Where to start
        learning_rate: How big of a step to take
        num_steps: How many steps to take
    """
    weight = initial_weight
    
    for step in range(num_steps):
        # 1. Calculate current loss
        loss = calculate_loss(weight)
        
        # 2. Calculate the gradient (slope)
        gradient = calculate_gradient(weight)
        
        # 3. Update weight: move opposite to gradient
        weight = weight - learning_rate * gradient
        
        if step % 10 == 0:
            print(f"Step {step}: weight = ${weight:.2f}, loss = {loss:,.0f}")
    
    return weight

# Start at $100/sqft, take small steps
final_weight = gradient_descent(
    initial_weight=100,
    learning_rate=0.0000001,  # Very small!
    num_steps=100
)

print(f"\nFinal weight: ${final_weight:.2f}/sqft")
Output:
Step 0: weight = $100.00, loss = 30,000,000,000
Step 10: weight = $134.97, loss = 10,663,000,000
Step 20: weight = $158.74, loss = 3,424,000,000
Step 30: weight = $175.03, loss = 1,246,000,000
Step 40: weight = $186.18, loss = 380,000,000
Step 50: weight = $193.81, loss = 76,000,000
Step 60: weight = $198.80, loss = 2,880,000
Step 70: weight = $199.92, loss = 12,800
Step 80: weight = $199.99, loss = 20
Step 90: weight = $200.00, loss = 0

Final weight: $200.00/sqft
It found the optimal weight automatically!

The Learning Rate: Too Big vs Too Small

The learning rate controls step size:
# Too small: takes forever
gradient_descent(100, 0.00000001, 100)  # Barely moves

# Just right: converges smoothly  
gradient_descent(100, 0.0000001, 100)   # Gets to $200

# Too big: overshoots and explodes!
gradient_descent(100, 0.000001, 100)    # Bounces wildly
Learning Rate is Crucial
  • Too small: Training takes forever
  • Too big: The model never converges, bounces around
  • Just right: Smooth convergence to the optimal answer
Finding the right learning rate is part science, part art. Common starting points: 0.01, 0.001, 0.0001

Scaling Up: Multiple Weights

Our house model has 4 weights: base, bedroom, bathroom, and sqft. The gradient is now a vector of slopes, one for each weight:
import numpy as np

# Training data
X = np.array([
    [1, 2, 1, 1000],  # [bias, bedrooms, bathrooms, sqft]
    [1, 3, 2, 1500],
    [1, 4, 2, 1800],
    [1, 3, 3, 2000],
    [1, 5, 4, 3000],
])
y = np.array([250000, 380000, 450000, 520000, 750000])

def predict(X, weights):
    """Predict prices for all houses."""
    return X @ weights  # Matrix multiplication!

def calculate_loss(X, y, weights):
    """Mean squared error."""
    predictions = predict(X, weights)
    errors = predictions - y
    return np.mean(errors ** 2)

def calculate_gradient(X, y, weights):
    """Gradient of MSE with respect to each weight."""
    predictions = predict(X, weights)
    errors = predictions - y
    # Gradient is: (2/n) * X.T @ errors
    return (2 / len(y)) * X.T @ errors

def gradient_descent_multi(X, y, learning_rate=0.0000001, num_steps=1000):
    # Initialize weights randomly
    weights = np.zeros(X.shape[1])
    
    for step in range(num_steps):
        loss = calculate_loss(X, y, weights)
        gradient = calculate_gradient(X, y, weights)
        weights = weights - learning_rate * gradient
        
        if step % 100 == 0:
            print(f"Step {step}: Loss = {loss:,.0f}")
    
    return weights

# Train!
final_weights = gradient_descent_multi(X, y)
print("\nLearned weights:")
print(f"  Base:     ${final_weights[0]:,.0f}")
print(f"  Bedroom:  ${final_weights[1]:,.0f}")
print(f"  Bathroom: ${final_weights[2]:,.0f}")
print(f"  Sqft:     ${final_weights[3]:.2f}")

The Math Behind It

What we just did has a formal name: Gradient Descent. The update rule is: wnew=woldαL(w)w_{new} = w_{old} - \alpha \cdot \nabla L(w) Where:
  • ww = weights (parameters)
  • α\alpha = learning rate
  • L(w)\nabla L(w) = gradient of the loss function
Deep Dive AvailableFor the full mathematical treatment, see our Gradient Descent module in the Calculus course.Key concepts covered there:
  • Why the gradient points “uphill”
  • The chain rule for computing gradients
  • Advanced optimizers like Adam and SGD with momentum

Why This Matters

Gradient descent is the algorithm that powers:
  • Linear regression
  • Logistic regression
  • Neural networks
  • Deep learning
  • GPT, DALL-E, and all modern AI
Every time a model “learns,” it’s doing some variant of gradient descent.

Visualizing the Journey

Imagine the loss function as a bowl-shaped surface:
Loss (height)
     ^
     |    *  <- Start here (high error)
     |   /
     |  /
     | /
     |/
     *-------> Weight value
     ^
     Minimum (optimal weight)
Gradient descent follows the curve down to the bottom.

Key Takeaways

Loss = Height on a Hill

We want to find the lowest point

Gradient = Slope

Tells us which direction is downhill

Learning Rate = Step Size

How far to move each iteration

Iterate Until Converged

Keep stepping until you reach the bottom

🚀 Mini Projects

Project 1

Implement gradient descent from scratch

Project 2

Learning rate explorer with visualization

Project 3

Predict student exam scores

What’s Next?

Now that you understand how learning works, we can explore:
  1. Linear Regression - The complete algorithm for fitting lines to data
  2. Classification - What if we’re predicting categories, not numbers?
  3. Regularization - How to prevent overfitting

Continue to Module 3: Linear Regression

Build a complete, production-ready regression model

🔗 Math → ML Connection

What you learned in this module powers ALL of modern AI:
Concept HereFormal TermWhere It’s Used
Slope of the lossGradientEvery neural network, every ML model
Step downhillGradient descent stepTraining GPT-4, DALL-E, AlphaFold
Step sizeLearning rateCritical hyperparameter in all training
Reaching the bottomConvergenceWhen training is “done”
Too-big steps (overshooting)DivergenceLearning rate too high
The Takeaway: You just learned the exact algorithm that trains:
  • ChatGPT (gradient descent on 175B parameters)
  • Stable Diffusion (gradient descent on image-text pairs)
  • AlphaFold (gradient descent on protein structures)
  • Self-driving cars (gradient descent on sensor data)
The only difference? They have more weights and fancier loss functions.

🎮 Interactive Visualization

Try this gradient descent visualizer to build intuition:Gradient Descent VisualizerExperiment with:
  1. Different starting points → all converge to same minimum!
  2. Learning rate too high → watch it overshoot and diverge
  3. Learning rate too low → watch it crawl slowly
  4. Different loss surfaces → some have multiple minima!
This visual intuition will help you debug real training issues.

🚀 Going Deeper (Optional)

Why Not Just Use the Formula?

For linear regression, there’s a closed-form solution: w=(XTX)1XTyw = (X^TX)^{-1}X^TySo why use gradient descent?
  1. Scalability: Matrix inversion is O(n³). Gradient descent is O(n).
  2. Generality: Works for ANY differentiable loss (not just squared error)
  3. Deep learning: Neural nets have no closed-form solution

Stochastic Gradient Descent (SGD)

Instead of computing gradient on ALL data:
# Full batch (what we learned)
gradient = compute_gradient_on_all_data(X, y, w)

# Stochastic (pick one random sample)
i = random.randint(0, len(X) - 1)
gradient = compute_gradient_on_sample(X[i], y[i], w)

# Mini-batch (pick a small random subset)
batch = random.sample(data, batch_size=32)
gradient = compute_gradient_on_batch(batch, w)
Mini-batch is the industry standard - fast + stable.

Modern Optimizers

Gradient descent has been improved:
OptimizerKey IdeaWhen to Use
SGD + MomentumBuild up speed in consistent directionsSimple baseline
AdamAdaptive learning rate per parameterDefault choice for most
AdamWAdam with proper weight decayTransformers, LLMs
LAMBAdam scaled for huge batchesVery large models
Our Calculus course covers these in depth.

Convergence Theory

For convex functions (like linear regression’s MSE):
  • Gradient descent ALWAYS finds the global minimum
  • Convergence rate depends on learning rate and function curvature
For non-convex functions (like neural networks):
  • Many local minima exist
  • We find “good enough” solutions, not guaranteed global minimum
  • This works surprisingly well in practice!

Practice Challenge

Implement gradient descent from scratch for this dataset:
# Predict exam score from hours studied
study_data = [
    {"hours": 1, "score": 45},
    {"hours": 2, "score": 55},
    {"hours": 3, "score": 60},
    {"hours": 4, "score": 70},
    {"hours": 5, "score": 80},
    {"hours": 6, "score": 85},
    {"hours": 7, "score": 88},
    {"hours": 8, "score": 92},
]

# Model: score = base + hours * weight
# Find optimal base and weight using gradient descent
def gradient_descent_study():
    # Initialize
    base = 30
    weight = 5
    learning_rate = 0.01
    
    for step in range(1000):
        # Calculate gradients
        base_grad = 0
        weight_grad = 0
        
        for data in study_data:
            predicted = base + weight * data["hours"]
            error = predicted - data["score"]
            base_grad += 2 * error
            weight_grad += 2 * error * data["hours"]
        
        base_grad /= len(study_data)
        weight_grad /= len(study_data)
        
        # Update
        base = base - learning_rate * base_grad
        weight = weight - learning_rate * weight_grad
        
    print(f"Score = {base:.1f} + {weight:.1f} * hours")
    # Expected: Score = 40.6 + 7.4 * hours

gradient_descent_study()