Skip to main content
Chain Rule & Backpropagation

Chain Rule & Backpropagation

Your Challenge: The Butterfly Effect

In complex systems, a small change here can cause a massive result there. Imagine you run a global manufacturing company.
  1. Raw Material Price goes up by $0.10.
  2. Production Cost increases by $1.00.
  3. Product Price increases by $5.00.
  4. Sales Volume drops by 1,000 units.
  5. Total Revenue crashes by $50,000.
Your Question: “How did a 10-cent change cause a $50,000 crash?” To understand this, you need to trace the impact through every link in the chain. This is exactly what the Chain Rule does. And it’s how neural networks “blame” a specific weight in Layer 1 for an error in Layer 50.

The Domino Effect

Chain Rule Domino Effect Think of it as a row of dominoes.
  • You push the first domino (Input).
  • It hits the second (Hidden Layer).
  • Which hits the third (Output).
The Chain Rule says:
“The total impact is the product of all the individual impacts along the chain.”
If Domino A hits Domino B with force 2, and Domino B hits Domino C with force 3… Then Domino A hits Domino C with force 2 × 3 = 6.

The Intuition: Composition of Functions

The Problem

You have a function inside another function: y=f(g(x))y = f(g(x))
  • xx = Input (Raw Material Price)
  • u=g(x)u = g(x) = Intermediate (Production Cost)
  • y=f(u)y = f(u) = Output (Revenue)
Question: If I change xx, how much does yy change?

The Solution

You multiply the rates of change! dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
  • dudx\frac{du}{dx}: How much Cost changes when Material Price changes.
  • dydu\frac{dy}{du}: How much Revenue changes when Cost changes.

Let’s Code It

# 1. Material Price → Production Cost
def cost(material_price):
    return material_price * 10 + 50  # 10 units per product

# 2. Production Cost → Revenue
def revenue(production_cost):
    return -5 * production_cost + 2000 # Higher cost = Lower revenue

# Composed: Material Price → Revenue
def total_revenue(material_price):
    c = cost(material_price)
    return revenue(c)

# The Derivatives (Impacts)
d_cost_d_material = 10    # Every $1 material increase adds $10 to cost
d_revenue_d_cost = -5     # Every $1 cost increase reduces revenue by $5

# The Chain Rule
d_revenue_d_material = d_revenue_d_cost * d_cost_d_material
# -5 * 10 = -50

print(f"Impact of material price on revenue: {d_revenue_d_material}")
print("Interpretation: A $1 increase in material price kills revenue by $50.")
Key Insight: You can break a complex system into small, simple links. Multiply them together to get the total effect.

Mathematical Definition

For composed functions f(g(x))f(g(x)): ddxf(g(x))=f(g(x))g(x)\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x) In words:
  1. Take derivative of outer function (evaluated at inner function)
  2. Multiply by derivative of inner function

Simple Example

h(x)=(x2+1)3h(x) = (x^2 + 1)^3 Think of it as: h(x)=f(g(x))h(x) = f(g(x)) where:
  • Inner function: g(x)=x2+1g(x) = x^2 + 1
  • Outer function: f(u)=u3f(u) = u^3
Derivatives:
  • g(x)=2xg'(x) = 2x
  • f(u)=3u2f'(u) = 3u^2
Chain rule: h(x)=f(g(x))g(x)=3(x2+1)22x=6x(x2+1)2h'(x) = f'(g(x)) \cdot g'(x) = 3(x^2 + 1)^2 \cdot 2x = 6x(x^2 + 1)^2
def h(x):
    return (x**2 + 1)**3

def h_derivative(x):
    # Chain rule
    inner = x**2 + 1
    outer_derivative = 3 * inner**2
    inner_derivative = 2 * x
    return outer_derivative * inner_derivative

x = 2
print(f"h'({x}) = {h_derivative(x)}")  # 150

# Verify numerically
h_val = 0.0001
numerical = (h(x + h_val) - h(x)) / h_val
print(f"Numerical: {numerical:.2f}")  # ≈ 150

Example 1: Multi-Stage Business Process

The Scenario

Supply Chain: Raw materials → Manufacturing → Sales → Revenue
Material Cost → Production Quantity → Sales Volume → Total Revenue
Functions: \begin{align} \text{Production}(c) &= 1000 - 10c \quad \text{(higher cost → less production)} \\ \text{Sales}(p) &= 0.8p \quad \text{(80% of production sells)} \\ \text{Revenue}(s) &= 50s \quad \text{($50 per unit sold)} \end{align} Question: If material cost increases by $1, how much does revenue change?

Solution Using Chain Rule

def production(cost):
    return 1000 - 10 * cost

def sales(production_qty):
    return 0.8 * production_qty

def revenue(sales_qty):
    return 50 * sales_qty

# Composed function
def total_revenue(cost):
    prod = production(cost)
    sale = sales(prod)
    return revenue(sale)

# Derivatives
def d_production_d_cost(cost):
    return -10

def d_sales_d_production(prod):
    return 0.8

def d_revenue_d_sales(sales):
    return 50

# Chain rule: multiply all derivatives
cost = 20
prod = production(cost)
sale = sales(prod)

chain_derivative = (d_production_d_cost(cost) * 
                   d_sales_d_production(prod) * 
                   d_revenue_d_sales(sale))

print(f"At cost=${cost}:")
print(f"  Production: {prod} units")
print(f"  Sales: {sale} units")
print(f"  Revenue: ${total_revenue(cost)}")
print(f"\nIf cost increases by $1:")
print(f"  Revenue changes by: ${chain_derivative}")
# -10 × 0.8 × 50 = -$400
Interpretation:
  • $1 cost increase → 10 fewer units produced
  • 10 fewer units → 8 fewer sales (80% sell rate)
  • 8 fewer sales → $400 less revenue
Real Application: Amazon uses chain rule to optimize their entire supply chain, from warehouses to delivery!

Example 2: Your Learning Chain

The Scenario

Let’s model your own learning process as a chain of functions:
  1. Study Time (xx) → Understanding (uu)
  2. Understanding (uu) → Test Score (tt)
  3. Test Score (tt) → Final Grade (GG)
Your Goal: Find out how much 1 extra hour of study improves your Final Grade.

The Functions

def understanding(study_hours):
    """More study → better understanding (diminishing returns)"""
    return 10 * np.sqrt(study_hours)

def test_score(understanding_level):
    """Understanding → test score"""
    return 5 * understanding_level + 20

def final_grade(test_avg):
    """Test average → final grade (curved)"""
    return 0.9 * test_avg + 5

Applying the Chain Rule

You want to find dGdx\frac{dG}{dx} (Change in Grade per Hour). dGdx=dGdtdtdududx\frac{dG}{dx} = \frac{dG}{dt} \cdot \frac{dt}{du} \cdot \frac{du}{dx}
# 1. Derivative of Understanding w.r.t Hours
def d_understanding_d_hours(h):
    return 5 / np.sqrt(h)  # Derivative of 10√h

# 2. Derivative of Test Score w.r.t Understanding
def d_test_d_understanding(u):
    return 5

# 3. Derivative of Grade w.r.t Test Score
def d_grade_d_test(t):
    return 0.9

# Calculate for 9 hours of study
hours = 9
u = understanding(hours)
t = test_score(u)

# Multiply the links!
grade_improvement = (d_grade_d_test(t) * 
                    d_test_d_understanding(u) * 
                    d_understanding_d_hours(hours))

print(f"At 9 hours/week:")
print(f"1 extra hour adds +{grade_improvement:.2f} points to your Final Grade")
Output:
At 9 hours/week:
1 extra hour adds +7.50 points to your Final Grade
Insight: The chain rule lets you connect your input (effort) directly to your output (grade), even through multiple steps!

Example 3: Your First Neural Network

The Computational Graph

This is how deep learning actually works. We represent the network as a graph of nodes. Computational Graph for Backpropagation The Flow:
  1. Forward Pass (Blue): You calculate the prediction and the error.
  2. Backward Pass (Red): You calculate who is to blame for the error.

Neural Network Backpropagation: Layer by Layer

This is the heart of deep learning. Let’s work through a complete 2-layer network step by step.

Network Architecture

Input xW1,b1z1ReLUa1W2,b2z2sigmoidy^MSEL\text{Input } x \xrightarrow{W_1, b_1} z_1 \xrightarrow{\text{ReLU}} a_1 \xrightarrow{W_2, b_2} z_2 \xrightarrow{\text{sigmoid}} \hat{y} \xrightarrow{\text{MSE}} L

Forward Pass Equations

StepFormulaDescription
1z1=W1x+b1z_1 = W_1 x + b_1Linear combination
2a1=ReLU(z1)a_1 = \text{ReLU}(z_1)Activation
3z2=W2a1+b2z_2 = W_2 a_1 + b_2Second linear
4y^=σ(z2)\hat{y} = \sigma(z_2)Output (sigmoid)
5L=(y^y)2L = (\hat{y} - y)^2Loss

Backward Pass: Compute ALL Gradients

We want: LW1\frac{\partial L}{\partial W_1}, Lb1\frac{\partial L}{\partial b_1}, LW2\frac{\partial L}{\partial W_2}, Lb2\frac{\partial L}{\partial b_2} Starting from the end (Layer 2): Ly^=2(y^y)\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) Lz2=Ly^σ(z2)=2(y^y)y^(1y^)\frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z_2) = 2(\hat{y} - y) \cdot \hat{y}(1-\hat{y}) LW2=Lz2a1T\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial z_2} \cdot a_1^T Lb2=Lz2\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2} Propagating to Layer 1: La1=W2TLz2\frac{\partial L}{\partial a_1} = W_2^T \cdot \frac{\partial L}{\partial z_2} Lz1=La1ReLU(z1)\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \cdot \text{ReLU}'(z_1) LW1=Lz1xT\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_1} \cdot x^T Lb1=Lz1\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1}

Complete Python Implementation

import numpy as np

class TwoLayerNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights (small random values)
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b2 = np.zeros((1, output_size))
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def relu(self, z):
        return np.maximum(0, z)
    
    def relu_derivative(self, z):
        return (z > 0).astype(float)
    
    def forward(self, X):
        """Forward pass - compute predictions"""
        # Layer 1
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        
        # Layer 2
        self.z2 = self.a1 @ self.W2 + self.b2
        self.y_hat = self.sigmoid(self.z2)
        
        return self.y_hat
    
    def backward(self, X, y):
        """Backward pass - compute gradients using chain rule"""
        m = X.shape[0]
        
        # Output layer gradients
        dL_dy_hat = 2 * (self.y_hat - y)                  # dL/dŷ
        dL_dz2 = dL_dy_hat * self.y_hat * (1 - self.y_hat) # Chain: dL/dz2
        
        self.dW2 = (self.a1.T @ dL_dz2) / m               # dL/dW2
        self.db2 = np.sum(dL_dz2, axis=0, keepdims=True) / m
        
        # Hidden layer gradients (propagate backward!)
        dL_da1 = dL_dz2 @ self.W2.T                       # dL/da1
        dL_dz1 = dL_da1 * self.relu_derivative(self.z1)   # Chain: dL/dz1
        
        self.dW1 = (X.T @ dL_dz1) / m                     # dL/dW1
        self.db1 = np.sum(dL_dz1, axis=0, keepdims=True) / m
    
    def update(self, learning_rate):
        """Gradient descent update"""
        self.W1 -= learning_rate * self.dW1
        self.b1 -= learning_rate * self.db1
        self.W2 -= learning_rate * self.dW2
        self.b2 -= learning_rate * self.db2

# Train on XOR problem (the classic neural network test!)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

net = TwoLayerNetwork(input_size=2, hidden_size=4, output_size=1)

print("Training Neural Network on XOR...")
print("="*50)

for epoch in range(5000):
    # Forward
    predictions = net.forward(X)
    
    # Backward (chain rule!)
    net.backward(X, y)
    
    # Update
    net.update(learning_rate=1.0)
    
    if epoch % 1000 == 0:
        loss = np.mean((predictions - y)**2)
        print(f"Epoch {epoch}: Loss = {loss:.6f}")

print("\nFinal Predictions:")
print(net.forward(X).round(2))
print("\nTrue Values:", y.flatten())
Key Insight: Each weight’s gradient is computed by multiplying all the derivatives along the path from that weight to the loss. This is the chain rule in action!

The Code (Backpropagation)

Let’s implement the graph above for a single neuron: xzaLx \to z \to a \to L
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def train_neuron(x, w, b, y_true):
    # --- 1. Forward Pass ---
    z = w * x + b
    a = sigmoid(z)
    loss = (a - y_true)**2
    
    print(f"Prediction: {a:.4f}, Error: {loss:.4f}")
    
    # --- 2. Backward Pass (Chain Rule) ---
    # We want dL/dw (How much is w to blame?)
    
    # Link 1: How Loss changes with Activation
    dL_da = 2 * (a - y_true)
    
    # Link 2: How Activation changes with z
    da_dz = a * (1 - a)
    
    # Link 3: How z changes with weight w
    dz_dw = x
    
    # Total Chain: Multiply them all!
    dL_dw = dL_da * da_dz * dz_dw
    
    return dL_dw

# Test it
x = 2.0      # Input
w = 0.5      # Current Weight
b = 0.1      # Bias
y_true = 1.0 # Target (We want output to be 1)

gradient = train_neuron(x, w, b, y_true)
print(f"\nGradient (dL/dw): {gradient:.4f}")
print("Interpretation: Increasing w will reduce the error!")
This is Backpropagation. It’s just the Chain Rule applied to a graph!

Multi-Layer Networks

For a deep network with 100 layers, you just have a longer chain: dLdw1=dLda100da100dz100dz1dw1\frac{dL}{dw_1} = \frac{dL}{da_{100}} \cdot \frac{da_{100}}{dz_{100}} \dots \frac{dz_1}{dw_1} The computer simply multiplies these numbers backward from the end to the start.

Practice Exercises

Exercise 1: Chain Rule Practice

# Given: h(x) = sin(x²)
# Find: h'(x) using chain rule

# TODO:
# 1. Identify inner and outer functions
# 2. Find their derivatives
# 3. Apply chain rule
# 4. Verify numerically at x=2

🎯 Practice Exercises & Real-World Applications

Challenge yourself! The chain rule is the engine of backpropagation - master it here!

Exercise 1: Supply Chain Impact Analysis 🏭

A semiconductor shortage affects the entire supply chain. Trace the impact:
# Supply chain model:
# 1. Chip shortage reduces chip supply: C(s) = 1000 - 50*s (s = shortage severity 0-10)
# 2. Less chips means fewer phones: P(c) = 0.1 * c (c = chips available)
# 3. Fewer phones means less revenue: R(p) = 800 * p - 0.01 * p² (p = phones made)

# TODO:
# 1. Compute dR/ds using the chain rule (how does revenue change with shortage severity?)
# 2. At s=5, how much revenue is lost per unit increase in shortage?
# 3. Which link in the chain has the biggest multiplier effect?
import numpy as np

def chips(s):
    """Chip supply based on shortage severity"""
    return 1000 - 50 * s

def phones(c):
    """Phones produced based on chips available"""
    return 0.1 * c

def revenue(p):
    """Revenue from phones sold"""
    return 800 * p - 0.01 * p**2

# Derivatives of each link
def dC_ds(s):
    """d(chips)/d(shortage) = -50"""
    return -50

def dP_dC(c):
    """d(phones)/d(chips) = 0.1"""
    return 0.1

def dR_dP(p):
    """d(revenue)/d(phones) = 800 - 0.02*p"""
    return 800 - 0.02 * p

def chain_derivative(s):
    """
    Chain rule: dR/ds = dR/dP × dP/dC × dC/ds
    """
    c = chips(s)
    p = phones(c)
    
    return dR_dP(p) * dP_dC(c) * dC_ds(s)

print("🏭 Supply Chain Impact Analysis")
print("=" * 55)

s = 5  # Moderate shortage
c = chips(s)
p = phones(c)
r = revenue(p)

print(f"\n📊 At shortage severity s = {s}:")
print(f"   Chips available: {c}")
print(f"   Phones produced: {p}")
print(f"   Revenue: ${r:,.2f}")

# Chain rule calculation
print("\n🔗 Chain Rule Breakdown:")
print(f"   dC/ds (chip sensitivity): {dC_ds(s)} chips per severity unit")
print(f"   dP/dC (production rate): {dP_dC(c)} phones per chip")
print(f"   dR/dP (revenue per phone): ${dR_dP(p):.2f}")

total_impact = chain_derivative(s)
print(f"\n📉 Total Impact (dR/ds):")
print(f"   = {dR_dP(p):.2f} × {dP_dC(c)} × {dC_ds(s)}")
print(f"   = ${total_impact:,.2f} revenue per unit shortage increase")

# Sensitivity analysis
print("\n📈 Sensitivity Analysis:")
print("   Shortage | Revenue    | Marginal Impact")
print("   ---------|------------|----------------")
for sev in range(0, 11, 2):
    c = chips(sev)
    p = phones(c)
    r = revenue(p)
    impact = chain_derivative(sev)
    print(f"   {sev:8} | ${r:9,.0f} | ${impact:,.0f}/unit")

print("\n💡 Key Insight:")
print("   Even a small chip shortage has a MULTIPLIED effect on revenue!")
print("   This is why supply chain disruptions are so devastating.")
Real-World Insight: This is exactly what happened during COVID - a small disruption in Taiwan chip factories cascaded through auto, phone, and appliance industries, causing billions in lost revenue!

Exercise 2: Backpropagation by Hand ✋

Compute gradients manually for a 2-layer neural network:
import numpy as np

# Network: Input(1) → Hidden(2) → Output(1)
# x → [w1, w2] → h1, h2 → ReLU → [v1, v2] → y
# Loss = (y - target)²

# Forward pass:
x = 2.0
w = np.array([0.5, -0.3])  # Input to hidden
v = np.array([0.8, 0.4])   # Hidden to output
target = 1.0

# h = w * x
# a = ReLU(h)
# y = v · a

# TODO:
# 1. Compute forward pass
# 2. Compute dL/dv (output layer gradients)
# 3. Compute dL/dw (input layer gradients using chain rule)
# 4. Update weights with lr=0.1
import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Setup
x = 2.0
w = np.array([0.5, -0.3])
v = np.array([0.8, 0.4])
target = 1.0
lr = 0.1

print("✋ Manual Backpropagation")
print("=" * 55)

# Forward pass
print("\n➡️ Forward Pass:")
h = w * x                    # Pre-activation [1.0, -0.6]
print(f"   h = w × x = {w} × {x} = {h}")

a = relu(h)                  # Activation [1.0, 0.0]
print(f"   a = ReLU(h) = {a}")

y = np.dot(v, a)             # Output 0.8
print(f"   y = v · a = {v} · {a} = {y}")

loss = (y - target) ** 2     # Loss 0.04
print(f"   L = (y - target)² = ({y} - {target})² = {loss}")

# Backward pass
print("\n⬅️ Backward Pass (Chain Rule):")

# Step 1: dL/dy
dL_dy = 2 * (y - target)     # -0.4
print(f"   dL/dy = 2(y - target) = {dL_dy}")

# Step 2: dL/dv (output layer)
dL_dv = dL_dy * a            # [-0.4, 0]
print(f"   dL/dv = dL/dy × a = {dL_dy} × {a} = {dL_dv}")

# Step 3: dL/da
dL_da = dL_dy * v            # [-0.32, -0.16]
print(f"   dL/da = dL/dy × v = {dL_dy} × {v} = {dL_da}")

# Step 4: dL/dh (through ReLU)
dL_dh = dL_da * relu_derivative(h)  # [-0.32, 0] (ReLU kills the negative)
print(f"   dL/dh = dL/da × ReLU'(h) = {dL_da} × {relu_derivative(h)} = {dL_dh}")

# Step 5: dL/dw (input layer)
dL_dw = dL_dh * x            # [-0.64, 0]
print(f"   dL/dw = dL/dh × x = {dL_dh} × {x} = {dL_dw}")

# Weight update
print("\n📝 Weight Updates:")
w_new = w - lr * dL_dw
v_new = v - lr * dL_dv
print(f"   w: {w}{w_new}")
print(f"   v: {v}{v_new}")

# Verify improvement
h_new = w_new * x
a_new = relu(h_new)
y_new = np.dot(v_new, a_new)
loss_new = (y_new - target) ** 2

print(f"\n✅ Verification:")
print(f"   Old prediction: {y:.4f}, Loss: {loss:.4f}")
print(f"   New prediction: {y_new:.4f}, Loss: {loss_new:.4f}")
print(f"   Improvement: {(1 - loss_new/loss)*100:.1f}%")
Real-World Insight: This is EXACTLY what PyTorch/TensorFlow’s loss.backward() does! They just do it for millions of weights automatically using computational graphs.

Exercise 3: Weather Cascade Model 🌧️

Predict how atmospheric conditions cascade to affect temperature:
# Cascade: Pressure → Humidity → Cloud Cover → Temperature
# 
# humidity(p) = 100 - 0.1*(p - 1013)²  (p = pressure in hPa)
# clouds(h) = 0.8 * h - 10             (h = humidity %)
# temp(c) = 30 - 0.2 * c               (c = cloud cover %)
#
# Current pressure: p = 1020 hPa

# TODO:
# 1. Compute dT/dp using chain rule
# 2. If pressure increases by 5 hPa, how much does temperature change?
# 3. Visualize the cascade effect
import numpy as np

def humidity(p):
    """Humidity decreases as pressure deviates from ideal"""
    return 100 - 0.1 * (p - 1013)**2

def clouds(h):
    """Cloud cover increases with humidity"""
    return max(0, 0.8 * h - 10)

def temperature(c):
    """Temperature decreases with cloud cover"""
    return 30 - 0.2 * c

# Derivatives
def dH_dP(p):
    """d(humidity)/d(pressure) = -0.2*(p - 1013)"""
    return -0.2 * (p - 1013)

def dC_dH(h):
    """d(clouds)/d(humidity) = 0.8"""
    return 0.8

def dT_dC(c):
    """d(temp)/d(clouds) = -0.2"""
    return -0.2

def full_chain(p):
    """dT/dP = dT/dC × dC/dH × dH/dP"""
    h = humidity(p)
    c = clouds(h)
    return dT_dC(c) * dC_dH(h) * dH_dP(p)

print("🌧️ Weather Cascade Analysis")
print("=" * 55)

p = 1020  # Current pressure

# Forward calculation
h = humidity(p)
c = clouds(h)
t = temperature(c)

print(f"\n📊 Current Conditions (P = {p} hPa):")
print(f"   Humidity: {h:.1f}%")
print(f"   Cloud Cover: {c:.1f}%")
print(f"   Temperature: {t:.1f}°C")

# Chain rule calculation
print("\n🔗 Chain Rule Breakdown:")
print(f"   dH/dP = -0.2 × (P - 1013) = {dH_dP(p):.3f}")
print(f"   dC/dH = 0.8")
print(f"   dT/dC = -0.2")
print(f"\n   dT/dP = dT/dC × dC/dH × dH/dP")
print(f"        = -0.2 × 0.8 × {dH_dP(p):.3f}")
print(f"        = {full_chain(p):.4f}°C per hPa")

# Prediction for pressure change
delta_p = 5
delta_t = full_chain(p) * delta_p
t_new_predicted = t + delta_t

# Actual calculation
p_new = p + delta_p
h_new = humidity(p_new)
c_new = clouds(h_new)
t_new_actual = temperature(c_new)

print(f"\n🌡️ Effect of {delta_p} hPa Pressure Increase:")
print(f"   Predicted temp change: {delta_t:.2f}°C")
print(f"   New temp (predicted): {t_new_predicted:.2f}°C")
print(f"   New temp (actual): {t_new_actual:.2f}°C")
print(f"   Prediction error: {abs(t_new_actual - t_new_predicted):.4f}°C")

# Cascade visualization
print("\n📈 Cascade Effect Visualization:")
print("   Pressure → Humidity → Clouds → Temperature")
for p_val in range(1005, 1030, 5):
    h_val = humidity(p_val)
    c_val = clouds(h_val)
    t_val = temperature(c_val)
    bars = int(t_val)
    print(f"   P={p_val}: {'█' * bars}{'░' * (30-bars)} {t_val:.1f}°C")
Real-World Insight: Weather models use similar cascade calculations with hundreds of interacting variables. The chain rule helps meteorologists understand how small changes in one variable propagate through the entire system!

Exercise 4: Viral Growth Model 📱

Model how a social media post goes viral:
# Viral cascade model:
# 1. Quality → Initial shares: S(q) = 10 * q²
# 2. Initial shares → Network reach: R(s) = 100 * ln(s + 1)
# 3. Network reach → New followers: F(r) = 0.05 * r * (1 - r/10000)

# A post has quality score q = 8

# TODO:
# 1. Compute the full derivative dF/dq
# 2. What's the marginal value of improving quality by 1 point?
# 3. At what quality level does adding more quality have diminishing returns?
import numpy as np

def shares(q):
    """Initial shares based on content quality"""
    return 10 * q**2

def reach(s):
    """Network reach from initial shares (logarithmic growth)"""
    return 100 * np.log(s + 1)

def followers(r):
    """New followers (logistic-like, saturates at high reach)"""
    return 0.05 * r * (1 - r/10000)

# Derivatives
def dS_dQ(q):
    """d(shares)/d(quality) = 20*q"""
    return 20 * q

def dR_dS(s):
    """d(reach)/d(shares) = 100/(s+1)"""
    return 100 / (s + 1)

def dF_dR(r):
    """d(followers)/d(reach) = 0.05*(1 - 2r/10000)"""
    return 0.05 * (1 - 2*r/10000)

def full_chain(q):
    """dF/dQ using chain rule"""
    s = shares(q)
    r = reach(s)
    return dF_dR(r) * dR_dS(s) * dS_dQ(q)

print("📱 Viral Growth Analysis")
print("=" * 55)

q = 8  # Current quality

# Forward calculation
s = shares(q)
r = reach(s)
f = followers(r)

print(f"\n📊 Current Post (Quality = {q}):")
print(f"   Initial shares: {s:.0f}")
print(f"   Network reach: {r:.0f}")
print(f"   New followers: {f:.1f}")

# Chain rule breakdown
print("\n🔗 Chain Rule at q = 8:")
print(f"   dS/dQ = 20q = {dS_dQ(q)}")
print(f"   dR/dS = 100/(s+1) = {dR_dS(s):.4f}")
print(f"   dF/dR = 0.05(1 - 2r/10000) = {dF_dR(r):.4f}")
print(f"\n   dF/dQ = {dF_dR(r):.4f} × {dR_dS(s):.4f} × {dS_dQ(q)}")
print(f"        = {full_chain(q):.4f} followers per quality point")

# Marginal analysis
print("\n📈 Marginal Analysis:")
print("   Quality | Shares | Reach  | Followers | dF/dQ")
print("   --------|--------|--------|-----------|-------")
for quality in range(1, 15, 2):
    s_val = shares(quality)
    r_val = reach(s_val)
    f_val = followers(r_val)
    deriv = full_chain(quality)
    print(f"   {quality:7} | {s_val:6.0f} | {r_val:6.0f} | {f_val:9.2f} | {deriv:.4f}")

# Find diminishing returns point
print("\n🎯 Diminishing Returns Analysis:")
qualities = np.linspace(1, 15, 100)
derivatives = [full_chain(q) for q in qualities]
max_deriv_idx = np.argmax(derivatives)
optimal_q = qualities[max_deriv_idx]

print(f"   Maximum marginal impact at quality ≈ {optimal_q:.1f}")
print(f"   After this point, each quality improvement gives less followers")
print(f"\n💡 Insight: It's not always worth perfecting content!")
print(f"   Publishing at quality {optimal_q:.0f} maximizes growth rate per effort unit")
Real-World Insight: Social media algorithms like TikTok’s use similar models to predict viral potential. The chain rule helps identify which factor to improve for maximum impact - that’s why “good enough, post fast” often beats perfection!

Key Takeaways

Chain rule = multiply derivatives along the chain
Backpropagation = chain rule applied backward
Deep learning = chain rule through many layers
Gradients flow backward = from output to input
Every framework uses this = PyTorch, TensorFlow, JAX

Automatic Differentiation: How PyTorch Does It

You just learned what happens inside loss.backward()!
Modern frameworks use Automatic Differentiation (AutoDiff) - they build a computational graph during the forward pass and automatically apply the chain rule during backward pass.
import torch

# PyTorch builds the chain automatically!
x = torch.tensor([2.0], requires_grad=True)

# Forward pass (builds computational graph)
z = x ** 2       # Node 1
a = torch.sin(z) # Node 2  
loss = a.sum()   # Node 3

# Backward pass (applies chain rule automatically)
loss.backward()

# x.grad contains dL/dx (computed via chain rule!)
print(f"Gradient: {x.grad.item():.4f}")

# Manual verification:
# dL/dx = dL/da × da/dz × dz/dx
#       = 1 × cos(4) × 4
#       = 4 × cos(4) = -2.6146
print(f"Manual:   {4 * torch.cos(torch.tensor(4.0)).item():.4f}")

Two Types of AutoDiff

TypeDescriptionFrameworks
Forward ModeCompute derivatives alongside forward passGood for few inputs, many outputs
Reverse ModeBuild graph forward, compute derivatives backwardPyTorch, TensorFlow (good for many inputs, few outputs)
Neural networks have millions of inputs (weights) and one output (loss), so reverse mode (backprop) is optimal!

Common Debugging Issues & Solutions

Debugging gradient problems is one of the most common challenges in ML. Here’s what to watch for:
Symptom: Early layers don’t learn, gradients near zeroCause: Chain rule multiplies small numbers. If each layer’s derivative < 1, the product → 0.Example: Sigmoid derivatives max at 0.25. After 10 layers: 0.2510=0.000000950.25^{10} = 0.00000095Solutions:
  • Use ReLU instead of sigmoid (derivative = 0 or 1)
  • Batch normalization
  • Skip connections (ResNet)
  • LSTM/GRU for RNNs
Symptom: Loss becomes NaN, weights blow upCause: Chain rule multiplies large numbers. If each layer’s derivative > 1, the product → ∞.Solutions:
  • Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  • Proper weight initialization (Xavier, He)
  • Lower learning rate
Symptom: Some neurons never activate, gradient = 0 foreverCause: ReLU outputs 0 for negative inputs, gradient = 0, never recovers.Solutions:
  • Leaky ReLU: small slope for negative inputs
  • Lower learning rate
  • Better initialization
Symptom: Loss oscillates wildly or becomes inf/nanSolutions:
# Instead of: log(x)
torch.log(x + 1e-8)  # Add small epsilon

# Instead of: exp(x) 
torch.clamp(x, max=50)  # Clip before exp

# Use numerically stable implementations
F.log_softmax(x, dim=-1)  # Instead of log(softmax(x))

Gradient Checking: Verify Your Implementation

def gradient_check(f, x, analytical_grad, eps=1e-5):
    """Verify analytical gradient with numerical approximation"""
    numerical_grad = (f(x + eps) - f(x - eps)) / (2 * eps)
    
    relative_error = abs(numerical_grad - analytical_grad) / max(abs(numerical_grad), abs(analytical_grad), 1e-8)
    
    if relative_error < 1e-5:
        print(f"✅ Gradient check PASSED (error: {relative_error:.2e})")
    else:
        print(f"❌ Gradient check FAILED (error: {relative_error:.2e})")
        print(f"   Numerical: {numerical_grad:.6f}")
        print(f"   Analytical: {analytical_grad:.6f}")
    
    return relative_error

# Example usage
f = lambda x: x**3 + 2*x**2 + x
analytical = lambda x: 3*x**2 + 4*x + 1  # f'(x)

x = 2.0
gradient_check(f, x, analytical(x))

What’s Next?

You now understand how gradients flow through compositions. But how do we USE these gradients to actually train models? That’s Gradient Descent - the optimization algorithm that powers all of machine learning!

Next: Gradient Descent

Learn the algorithm that makes machines learn