> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Learning From Mistakes

> Gradient descent - how machines learn by following the slope downhill

# Learning From Mistakes

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/gradient-descent-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=6624c85dc8629ee6c2316dad9d8d96f7" alt="Gradient Descent - Finding Minimum" width="1080" height="1080" data-path="images/courses/ml-mastery/gradient-descent-concept.svg" />
</Frame>

## The Problem We Left Off With

In the last module, we had:

* A model: `price = base + bed_weight * bedrooms + bath_weight * bathrooms + sqft_weight * sqft`
* Training data: Houses with known prices
* A loss function: Sum of squared errors
* **The question**: How do we find the best weights?

We tried random guessing, but that's slow and unreliable.

**There's a smarter way.**

***

## A Different Perspective: You're Lost on a Hill

Imagine you're blindfolded, dropped somewhere on a hilly landscape. Your goal: find the lowest point (the valley).

You can't see anything, but you **can feel the slope under your feet**.

What do you do?

<Steps>
  <Step title="Feel the ground">
    Which direction is "downhill"?
  </Step>

  <Step title="Take a step downhill">
    Move in that direction
  </Step>

  <Step title="Repeat">
    Keep stepping downhill until the ground is flat (you've reached the bottom)
  </Step>
</Steps>

**This is gradient descent.** And it's how almost all machine learning works.

Here's the key intuition: you don't need to see the whole landscape. You only need to feel the slope *right where you're standing*. That local information is enough to find the global valley -- at least for the "bowl-shaped" loss functions we'll use in this course.

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/gradient-descent-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=23fa3b40b338ab1eb637f2e56c8b91df" alt="GPS Navigation - Real World Gradient Descent" width="1080" height="1080" data-path="images/courses/ml-mastery/gradient-descent-real-world.svg" />
</Frame>

***

## Connecting to Our Problem

| Hill Analogy              | Machine Learning                 |
| ------------------------- | -------------------------------- |
| Your position on the hill | Current values of weights        |
| Height at your position   | The error (loss)                 |
| The valley (lowest point) | Minimum error = best weights     |
| Slope of the ground       | The **gradient**                 |
| Taking a step downhill    | Updating weights to reduce error |

***

## Let's See It With One Weight

To make this concrete, let's simplify. Imagine we only have one weight to figure out: the price per square foot.

```python theme={null}
# Simplified: price = base + sqft_weight * sqft
# Let's assume base = 50000 (fixed)

houses = [
    {"sqft": 1000, "price": 250000},
    {"sqft": 1500, "price": 350000},
    {"sqft": 2000, "price": 450000},
    {"sqft": 2500, "price": 550000},
]

def calculate_loss(sqft_weight):
    """Calculate total squared error for a given weight."""
    total_error = 0
    base = 50000
    
    for house in houses:
        predicted = base + sqft_weight * house["sqft"]
        actual = house["price"]
        error = (predicted - actual) ** 2
        total_error += error
    
    return total_error

# Try different weights and see the error
for weight in [50, 100, 150, 200, 250, 300]:
    loss = calculate_loss(weight)
    print(f"Weight: ${weight}/sqft -> Loss: {loss:,.0f}")
```

**Output:**

```
Weight: $50/sqft -> Loss: 130,000,000,000
Weight: $100/sqft -> Loss: 30,000,000,000
Weight: $150/sqft -> Loss: 5,000,000,000
Weight: $200/sqft -> Loss: 0                 <- Perfect!
Weight: $250/sqft -> Loss: 5,000,000,000
Weight: $300/sqft -> Loss: 30,000,000,000
```

If we plot this, we get a **parabola** (U-shape) - the bottom is where error is lowest!

***

## The Key Insight: Slope Tells Us Direction

At any point on the curve:

* If slope is **negative** (going down to the right) -> increase weight
* If slope is **positive** (going up to the right) -> decrease weight
* If slope is **zero** -> we're at the bottom!

The **slope of the loss function** tells us which way to go.

<Note>
  **Math Connection**: The slope of a function is its **derivative**.

  If you want to understand this deeply, check out our [Derivatives module](/courses/math-for-ml-calculus/01-derivatives).

  For now, just know: **slope tells us direction**.
</Note>

***

## Computing the Slope (Gradient)

For our squared error loss, the derivative with respect to the weight is:

```python theme={null}
def calculate_gradient(sqft_weight):
    """
    Calculate the slope of the loss function at the current weight.
    
    The gradient answers: "If I increase the weight by a tiny amount,
    does the loss go up or down, and by how much?"
    
    Mathematically, for loss = (predicted - actual)^2:
      d(loss)/d(weight) = 2 * (predicted - actual) * sqft
    
    The chain rule gives us this: the outer derivative (2 * error)
    times the inner derivative (sqft, because predicted = base + weight * sqft).
    """
    gradient = 0
    base = 50000
    
    for house in houses:
        predicted = base + sqft_weight * house["sqft"]
        actual = house["price"]
        error = predicted - actual
        
        # Derivative of (predicted - actual)^2 with respect to weight
        # = 2 * (predicted - actual) * sqft
        # The "sqft" term is crucial -- it means features with larger values
        # produce larger gradients, which is why we'll need feature scaling later.
        gradient += 2 * error * house["sqft"]
    
    return gradient

# Check the gradient at different weights
for weight in [100, 150, 200, 250]:
    grad = calculate_gradient(weight)
    print(f"Weight: ${weight}, Gradient: {grad:,.0f}")
```

**Output:**

```
Weight: $100, Gradient: -1,400,000  (negative -> go UP/increase weight)
Weight: $150, Gradient: -400,000   (negative -> go UP/increase weight)
Weight: $200, Gradient: 0           (zero -> we're at the minimum!)
Weight: $250, Gradient: 400,000    (positive -> go DOWN/decrease weight)
```

***

## The Gradient Descent Algorithm

Now we can write the actual learning algorithm:

```python theme={null}
def gradient_descent(initial_weight, learning_rate, num_steps):
    """
    Find the best weight by following the slope downhill.
    
    Args:
        initial_weight: Where to start
        learning_rate: How big of a step to take
        num_steps: How many steps to take
    """
    weight = initial_weight
    
    for step in range(num_steps):
        # 1. Calculate current loss
        loss = calculate_loss(weight)
        
        # 2. Calculate the gradient (slope)
        gradient = calculate_gradient(weight)
        
        # 3. Update weight: move opposite to gradient
        weight = weight - learning_rate * gradient
        
        if step % 10 == 0:
            print(f"Step {step}: weight = ${weight:.2f}, loss = {loss:,.0f}")
    
    return weight

# Start at $100/sqft, take small steps
final_weight = gradient_descent(
    initial_weight=100,
    learning_rate=0.0000001,  # Very small!
    num_steps=100
)

print(f"\nFinal weight: ${final_weight:.2f}/sqft")
```

**Output:**

```
Step 0: weight = $100.00, loss = 30,000,000,000
Step 10: weight = $134.97, loss = 10,663,000,000
Step 20: weight = $158.74, loss = 3,424,000,000
Step 30: weight = $175.03, loss = 1,246,000,000
Step 40: weight = $186.18, loss = 380,000,000
Step 50: weight = $193.81, loss = 76,000,000
Step 60: weight = $198.80, loss = 2,880,000
Step 70: weight = $199.92, loss = 12,800
Step 80: weight = $199.99, loss = 20
Step 90: weight = $200.00, loss = 0

Final weight: $200.00/sqft
```

**It found the optimal weight automatically!**

***

## The Learning Rate: Too Big vs Too Small

The learning rate controls step size:

```python theme={null}
# Too small: takes forever
gradient_descent(100, 0.00000001, 100)  # Barely moves

# Just right: converges smoothly  
gradient_descent(100, 0.0000001, 100)   # Gets to $200

# Too big: overshoots and explodes!
gradient_descent(100, 0.000001, 100)    # Bounces wildly
```

<Warning>
  **Learning Rate is Crucial**

  Think of it like walking down a hill in the fog:

  * **Too small** (baby steps): You'll reach the bottom eventually, but it'll take all day. In ML, this means wasted compute time and patience.
  * **Too big** (giant leaps): You overshoot the valley, land on the other side, bounce back, and never settle down. In ML, your loss function oscillates wildly or even explodes to infinity.
  * **Just right**: Smooth convergence to the optimal answer.

  **Practical tip**: Start with 0.01 for most problems. If the loss explodes, divide by 10. If it barely moves, multiply by 10. Modern optimizers like Adam (Module 12) adapt the learning rate automatically, which is why they've become the default in deep learning.
</Warning>

***

## Scaling Up: Multiple Weights

Our house model has 4 weights: base, bedroom, bathroom, and sqft.

The gradient is now a **vector** of slopes, one for each weight:

```python theme={null}
import numpy as np

# Training data
X = np.array([
    [1, 2, 1, 1000],  # [bias, bedrooms, bathrooms, sqft]
    [1, 3, 2, 1500],
    [1, 4, 2, 1800],
    [1, 3, 3, 2000],
    [1, 5, 4, 3000],
])
y = np.array([250000, 380000, 450000, 520000, 750000])

def predict(X, weights):
    """Predict prices for all houses."""
    return X @ weights  # Matrix multiplication!

def calculate_loss(X, y, weights):
    """Mean squared error."""
    predictions = predict(X, weights)
    errors = predictions - y
    return np.mean(errors ** 2)

def calculate_gradient(X, y, weights):
    """Gradient of MSE with respect to each weight."""
    predictions = predict(X, weights)
    errors = predictions - y
    # Gradient is: (2/n) * X.T @ errors
    return (2 / len(y)) * X.T @ errors

def gradient_descent_multi(X, y, learning_rate=0.0000001, num_steps=1000):
    # Initialize weights to zero -- a common starting point.
    # Why zero and not random? For linear models, zero works fine.
    # For neural networks (Module 12), random initialization matters a lot.
    weights = np.zeros(X.shape[1])
    
    for step in range(num_steps):
        # Forward pass: how wrong are we right now?
        loss = calculate_loss(X, y, weights)
        
        # Backward pass: which direction should we adjust each weight?
        gradient = calculate_gradient(X, y, weights)
        
        # Update: take a step "downhill" in weight space.
        # The minus sign is key -- we go OPPOSITE to the gradient
        # because the gradient points uphill (toward increasing loss).
        weights = weights - learning_rate * gradient
        
        if step % 100 == 0:
            print(f"Step {step}: Loss = {loss:,.0f}")
    
    return weights

# Train!
final_weights = gradient_descent_multi(X, y)
print("\nLearned weights:")
print(f"  Base:     ${final_weights[0]:,.0f}")
print(f"  Bedroom:  ${final_weights[1]:,.0f}")
print(f"  Bathroom: ${final_weights[2]:,.0f}")
print(f"  Sqft:     ${final_weights[3]:.2f}")
```

***

## The Math Behind It

What we just did has a formal name: **Gradient Descent**.

The update rule is:

$$
w_{new} = w_{old} - \alpha \cdot \nabla L(w)
$$

Where:

* $w$ = weights (parameters)
* $\alpha$ = learning rate
* $\nabla L(w)$ = gradient of the loss function

<Note>
  **Deep Dive Available**

  For the full mathematical treatment, see our [Gradient Descent module](/courses/math-for-ml-calculus/04-gradient-descent) in the Calculus course.

  Key concepts covered there:

  * Why the gradient points "uphill"
  * The chain rule for computing gradients
  * Advanced optimizers like Adam and SGD with momentum
</Note>

***

## Why This Matters

Gradient descent is **the** algorithm that powers:

* Linear regression
* Logistic regression
* Neural networks
* Deep learning
* GPT, DALL-E, and all modern AI

Every time a model "learns," it's doing some variant of gradient descent.

***

## Visualizing the Journey

Imagine the loss function as a bowl-shaped surface:

```
Loss (height)
     ^
     |    *  <- Start here (high error)
     |   /
     |  /
     | /
     |/
     *-------> Weight value
     ^
     Minimum (optimal weight)
```

Gradient descent follows the curve down to the bottom.

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Loss = Height on a Hill" icon="mountain">
    We want to find the lowest point
  </Card>

  <Card title="Gradient = Slope" icon="chart-line">
    Tells us which direction is downhill
  </Card>

  <Card title="Learning Rate = Step Size" icon="shoe-prints">
    How far to move each iteration
  </Card>

  <Card title="Iterate Until Converged" icon="repeat">
    Keep stepping until you reach the bottom
  </Card>
</CardGroup>

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1" icon="stairs" color="#3B82F6">
    Implement gradient descent from scratch
  </Card>

  <Card title="Project 2" icon="gauge-high" color="#10B981">
    Learning rate explorer with visualization
  </Card>

  <Card title="Project 3" icon="graduation-cap" color="#8B5CF6">
    Predict student exam scores
  </Card>
</CardGroup>

<details>
  <summary>**Project 1: Gradient Descent from Scratch** - Implement the core ML algorithm</summary>

  **Objective**: Build gradient descent to find the minimum of a function.

  ```python theme={null}
  import numpy as np

  # Minimize f(x) = (x - 3)^2 + 5 (minimum at x=3)
  def f(x):
      return (x - 3)**2 + 5

  def df(x):
      return 2 * (x - 3)

  def gradient_descent(start_x, learning_rate, num_iterations):
      x = start_x
      history = [x]
      
      for i in range(num_iterations):
          gradient = df(x)
          x = x - learning_rate * gradient
          history.append(x)
          
          if i % 10 == 0:
              print(f"Iteration {i:3d}: x = {x:.4f}, f(x) = {f(x):.4f}")
      
      return x, history

  # Test with different learning rates
  print("=== Learning Rate: 0.1 ===")
  final_x, history = gradient_descent(start_x=10, learning_rate=0.1, num_iterations=50)
  print(f"Final x: {final_x:.4f} (optimal is 3.0)")

  print("\n=== Learning Rate: 0.01 (too slow) ===")
  final_x, _ = gradient_descent(start_x=10, learning_rate=0.01, num_iterations=50)
  print(f"Final x: {final_x:.4f}")

  print("\n=== Learning Rate: 0.9 (oscillates) ===")
  final_x, _ = gradient_descent(start_x=10, learning_rate=0.9, num_iterations=50)
  print(f"Final x: {final_x:.4f}")
  ```
</details>

<details>
  <summary>**Project 2: Learning Rate Explorer** - Visualize convergence</summary>

  **Objective**: See how learning rate affects convergence on fitting a line.

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt

  # Fit y = w * x to data
  data = [(1, 2.1), (2, 4.0), (3, 6.2), (4, 7.9), (5, 10.1)]

  def compute_loss(w):
      return sum((w * x - y) ** 2 for x, y in data) / len(data)

  def compute_gradient(w):
      return sum(2 * (w * x - y) * x for x, y in data) / len(data)

  # Compare learning rates
  learning_rates = [0.001, 0.01, 0.05, 0.1]
  plt.figure(figsize=(12, 4))

  for i, lr in enumerate(learning_rates, 1):
      w = 0.1
      losses = [compute_loss(w)]
      
      for _ in range(100):
          w = w - lr * compute_gradient(w)
          losses.append(compute_loss(w))
      
      plt.subplot(1, 4, i)
      plt.plot(losses)
      plt.title(f'LR = {lr}')
      plt.xlabel('Iteration')
      plt.ylabel('Loss')
      plt.yscale('log')

  plt.tight_layout()
  plt.savefig('lr_comparison.png')
  print("Saved lr_comparison.png")
  ```
</details>

<details>
  <summary>**Project 3: Student Score Predictor** - Complete regression</summary>

  **Objective**: Predict exam scores from study hours.

  ```python theme={null}
  import numpy as np

  students = [
      {"hours": 1, "score": 45}, {"hours": 2, "score": 55},
      {"hours": 4, "score": 65}, {"hours": 6, "score": 78},
      {"hours": 8, "score": 88}, {"hours": 10, "score": 95},
  ]

  def train(learning_rate=0.01, epochs=1000):
      intercept, slope = 30, 5
      
      for epoch in range(epochs):
          grad_i, grad_s = 0, 0
          for s in students:
              pred = intercept + slope * s["hours"]
              error = pred - s["score"]
              grad_i += 2 * error / len(students)
              grad_s += 2 * error * s["hours"] / len(students)
          
          intercept -= learning_rate * grad_i
          slope -= learning_rate * grad_s
          
          if epoch % 200 == 0:
              loss = sum((intercept + slope * s["hours"] - s["score"])**2 for s in students)
              print(f"Epoch {epoch}: score = {intercept:.1f} + {slope:.2f} × hours, loss = {loss:.0f}")
      
      return intercept, slope

  intercept, slope = train()
  print(f"\nFinal: score = {intercept:.1f} + {slope:.2f} × hours")
  print(f"Study 5 hours → {intercept + slope * 5:.0f} points")
  print(f"Study 12 hours → {intercept + slope * 12:.0f} points")
  ```
</details>

***

## What's Next?

Now that you understand how learning works, we can explore:

1. **Linear Regression** - The complete algorithm for fitting lines to data
2. **Classification** - What if we're predicting categories, not numbers?
3. **Regularization** - How to prevent overfitting

<Card title="Continue to Module 3: Linear Regression" icon="arrow-right" href="/courses/ml-mastery/03-linear-regression">
  Build a complete, production-ready regression model
</Card>

***

## 🔗 Math → ML Connection

<Note>
  **What you learned in this module powers ALL of modern AI:**

  | Concept Here                 | Formal Term               | Where It's Used                         |
  | ---------------------------- | ------------------------- | --------------------------------------- |
  | Slope of the loss            | **Gradient**              | Every neural network, every ML model    |
  | Step downhill                | **Gradient descent step** | Training GPT-4, DALL-E, AlphaFold       |
  | Step size                    | **Learning rate**         | Critical hyperparameter in all training |
  | Reaching the bottom          | **Convergence**           | When training is "done"                 |
  | Too-big steps (overshooting) | **Divergence**            | Learning rate too high                  |

  **The Takeaway**: You just learned the exact algorithm that trains:

  * ChatGPT (gradient descent on 175B parameters)
  * Stable Diffusion (gradient descent on image-text pairs)
  * AlphaFold (gradient descent on protein structures)
  * Self-driving cars (gradient descent on sensor data)

  The only difference? They have more weights and fancier loss functions.
</Note>

***

## 🎮 Interactive Visualization

<Tip>
  **Try this gradient descent visualizer to build intuition:**

  **[Gradient Descent Visualizer](https://uclaacm.github.io/gradient-descent-visualiser/)**

  Experiment with:

  1. Different starting points → all converge to same minimum!
  2. Learning rate too high → watch it overshoot and diverge
  3. Learning rate too low → watch it crawl slowly
  4. Different loss surfaces → some have multiple minima!

  This visual intuition will help you debug real training issues.
</Tip>

***

## 🚀 Going Deeper (Optional)

<Accordion title="Advanced Optimization Concepts" icon="graduation-cap">
  ### Why Not Just Use the Formula?

  For linear regression, there's a closed-form solution: $w = (X^TX)^{-1}X^Ty$

  So why use gradient descent?

  1. **Scalability**: Matrix inversion is O(n³). Gradient descent is O(n).
  2. **Generality**: Works for ANY differentiable loss (not just squared error)
  3. **Deep learning**: Neural nets have no closed-form solution

  ### Stochastic Gradient Descent (SGD)

  Instead of computing gradient on ALL data:

  ```python theme={null}
  # Full batch (what we learned)
  gradient = compute_gradient_on_all_data(X, y, w)

  # Stochastic (pick one random sample)
  i = random.randint(0, len(X) - 1)
  gradient = compute_gradient_on_sample(X[i], y[i], w)

  # Mini-batch (pick a small random subset)
  batch = random.sample(data, batch_size=32)
  gradient = compute_gradient_on_batch(batch, w)
  ```

  **Mini-batch is the industry standard** - fast + stable.

  ### Modern Optimizers

  Gradient descent has been improved:

  | Optimizer          | Key Idea                                | When to Use             |
  | ------------------ | --------------------------------------- | ----------------------- |
  | **SGD + Momentum** | Build up speed in consistent directions | Simple baseline         |
  | **Adam**           | Adaptive learning rate per parameter    | Default choice for most |
  | **AdamW**          | Adam with proper weight decay           | Transformers, LLMs      |
  | **LAMB**           | Adam scaled for huge batches            | Very large models       |

  **Our [Calculus course](/courses/math-for-ml-calculus/05-optimization)** covers these in depth.

  ### Convergence Theory

  For **convex** functions (like linear regression's MSE):

  * Gradient descent ALWAYS finds the global minimum
  * Convergence rate depends on learning rate and function curvature

  For **non-convex** functions (like neural networks):

  * Many local minima exist
  * We find "good enough" solutions, not guaranteed global minimum
  * This works surprisingly well in practice!
</Accordion>

***

## Practice Challenge

Implement gradient descent from scratch for this dataset:

```python theme={null}
# Predict exam score from hours studied
study_data = [
    {"hours": 1, "score": 45},
    {"hours": 2, "score": 55},
    {"hours": 3, "score": 60},
    {"hours": 4, "score": 70},
    {"hours": 5, "score": 80},
    {"hours": 6, "score": 85},
    {"hours": 7, "score": 88},
    {"hours": 8, "score": 92},
]

# Model: score = base + hours * weight
# Find optimal base and weight using gradient descent
```

<Accordion title="Solution">
  ```python theme={null}
  def gradient_descent_study():
      # Initialize
      base = 30
      weight = 5
      learning_rate = 0.01
      
      for step in range(1000):
          # Calculate gradients
          base_grad = 0
          weight_grad = 0
          
          for data in study_data:
              predicted = base + weight * data["hours"]
              error = predicted - data["score"]
              base_grad += 2 * error
              weight_grad += 2 * error * data["hours"]
          
          base_grad /= len(study_data)
          weight_grad /= len(study_data)
          
          # Update
          base = base - learning_rate * base_grad
          weight = weight - learning_rate * weight_grad
          
      print(f"Score = {base:.1f} + {weight:.1f} * hours")
      # Expected: Score = 40.6 + 7.4 * hours

  gradient_descent_study()
  ```
</Accordion>