Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Calculus for Machine Learning

Calculus for Machine Learning

The Question That Unlocks AI

You train a neural network. It starts completely random - worse than guessing. You feed it 10,000 images of cats and dogs. An hour later, it’s 95% accurate. What happened in that hour? The network adjusted millions of numbers (weights) until they were “right.” But how did it know which direction to adjust each number? How did it know how much? The answer is calculus. Specifically: derivatives tell the network “if I change this weight by a tiny amount, how much will my error change?” Then it adjusts every weight to reduce the error, step by step, millions of times.
Real Talk: You probably remember calculus as “finding the derivative of x³” and plugging numbers into formulas. That’s not what this course is about.We’re going to show you what derivatives actually mean, why neural networks need them, and how to use them to make things learn.
Estimated Time: 14-18 hours
Difficulty: Beginner-friendly (we start from scratch)
Prerequisites: Basic Python, Linear Algebra course (or willingness to learn alongside)
What You’ll Build: A neural network that learns - from scratch, no libraries
Before starting, make sure you can:Python Basics
  • Work with NumPy arrays: np.array([1, 2, 3])
  • Write functions with multiple parameters
  • Create simple plots with matplotlib
  • Understand list comprehensions
Linear Algebra Concepts (from our course or elsewhere)
  • Vectors: what they are and how to add them
  • Dot product: np.dot(a, b) and what it means
  • Basic matrix operations (helpful but we’ll review)
Math Comfort
  • Comfortable with basic graphing (x-y plots)
  • Understand slope of a line (rise/run)
  • Know that functions take inputs and produce outputs
You DON’T need:
  • Previous calculus experience
  • To remember derivative rules from school
  • Physics or engineering background
Recommended Path: Linear Algebra for ML → This Course → Statistics for ML
Try these checks to gauge your readiness:Slope Check (can you answer this?): A line goes through points (1, 3) and (4, 9). What is its slope?Vector Check (do you know this?): What does np.dot([1, 2, 3], [4, 5, 6]) return?Remediation Paths:
Gap IdentifiedRecommended Action
Slope concept unclearKhan Academy “Slope of a line” - 20 min
Vector/dot product unfamiliarVectors Module - 3 hours
NumPy basicsPython Crash Course - NumPy section
Graphing conceptsYouTube “Reading function graphs” - 30 min
Career Impact: Calculus knowledge directly translates to higher salaries. ML engineers who understand gradients debug models 3x faster and build more sophisticated architectures. This is the math that separates senior engineers from juniors.

The Core Insight: Learning = Finding the Bottom of a Hill

Imagine you’re blindfolded, dropped somewhere on a hilly landscape. Your goal: find the lowest point (the valley). You can’t see anything. But you can feel the slope under your feet.
  • If the ground slopes down to your left, step left
  • If it slopes down forward, step forward
  • Keep stepping downhill until the ground is flat
That’s gradient descent. And the “slope” is the derivative. In machine learning:
  • The “landscape” is your error function (how wrong your model is)
  • The “position” is your current weights
  • The “slope” (derivative) tells you which direction reduces error
  • You keep stepping until error is minimized
Learning Hill Analogy
🔗 ML Connection: This “hill descent” is literally how every major AI system learns:
AI SystemWhat It’s OptimizingThe “Slope”
ChatGPTPredict next word probabilityCross-entropy gradient
DALL-EMatch image to text descriptionDiffusion loss gradient
AlphaFoldProtein structure accuracyDistance & angle gradients
Tesla AutopilotObject detection accuracyMulti-task loss gradient
Spotify RecommendationsUser engagement predictionRanking loss gradient
Every module connects to these real systems!

Who Uses This (Companies & Roles)

OpenAI

GPT-4 training uses gradient descent on 175 billion parameters. Understanding calculus = understanding how ChatGPT learns.

Tesla Autopilot

Self-driving AI optimizes millions of weights to detect pedestrians, lanes, and obstacles in real-time.

DeepMind AlphaFold

Solved 50-year protein folding problem using neural networks trained with the exact math you’ll learn here.
RoleHow They Use CalculusSalary Impact
ML EngineerDebug training, implement custom layers, optimize performance+$30-50K over non-ML roles
Research ScientistDevelop new architectures, publish papers, prove convergence+$50-80K, often PhD required
ML Ops EngineerOptimize training pipelines, reduce compute costs+$20-40K
Data ScientistUnderstand why models work, explain to stakeholders+$15-30K

What You’ll Actually Learn

Module 1: Derivatives — “Which way is downhill?”

The Real Question: If I change this weight by 0.001, how much does my error change? What You’ll Understand:
  • Derivatives measure sensitivity (how much output changes when input changes)
  • Finding the minimum means finding where the derivative is zero
  • Every weight in a neural network has a derivative
What You’ll Build: A price optimizer that finds the profit-maximizing price automatically.
# By the end of this module, you'll understand:
# "The derivative of error with respect to weight is 0.05"
# Meaning: increase weight by 1 → error increases by 0.05
# So we should DECREASE the weight to reduce error!

Module 2: Gradients — “Which way is MOST downhill?”

The Real Question: I have 1,000 weights. Which combination of changes reduces error the fastest? What You’ll Understand:
  • A gradient is just a list of derivatives (one per weight)
  • It points in the direction of steepest increase
  • We go the OPPOSITE direction to decrease error
What You’ll Build: A multi-variable optimizer for a business with price AND ad spend.

Module 3: Chain Rule — “How do changes propagate through layers?”

The Real Question: In a 50-layer neural network, how does changing a weight in layer 1 affect the final output? What You’ll Understand:
  • Nested functions: the output of layer 1 becomes input to layer 2, etc.
  • Chain rule: multiply the derivatives along the chain
  • Backpropagation: computing all derivatives efficiently, from output back to input
What You’ll Build: Backpropagation from scratch - the algorithm that made deep learning possible.

Module 4: Gradient Descent — “Taking steps downhill”

The Real Question: How big should each step be? When should we stop? What You’ll Understand:
  • Learning rate: step too big = overshoot, step too small = takes forever
  • Convergence: knowing when you’ve reached the bottom
  • Local minima: getting stuck in small valleys instead of the deepest one
What You’ll Build: A complete training loop that learns from data.

Module 5: Optimization — “Getting there faster”

The Real Question: Gradient descent is slow. How do we speed it up? What You’ll Understand:
  • Momentum: build up speed when going in a consistent direction
  • Adam: adapt the step size for each weight individually
  • Why Adam is the default choice for most deep learning
What You’ll Build: Compare optimizers head-to-head on the same problem.

Your Learning Journey

1

Week 1: Derivatives

Understand what derivatives really mean. Build a price optimizer.
2

Week 2: Gradients

Handle multiple variables at once. Optimize price AND marketing spend together.
3

Week 3: Chain Rule

Understand how changes propagate through layers. Implement backpropagation.
4

Week 4: Gradient Descent

Build a complete training loop. Watch your model learn.
5

Week 5: Final Project

Build a neural network from scratch using ONLY NumPy. No TensorFlow. No PyTorch.

Prerequisites

What You Need:
  • Basic Python (variables, functions, loops)
  • Linear Algebra course (or take it alongside - they complement each other)
  • Curiosity about how AI actually works
What You Don’t Need:
  • Previous calculus knowledge (we start from zero)
  • Memorized derivative formulas (we focus on understanding)
  • Mathematical proofs (we focus on intuition and code)

Setup

pip install numpy matplotlib jupyter plotly ipywidgets

jupyter notebook
That’s all you need. We build everything from scratch.
🎮 Interactive Visualizations: This course includes interactive gradient descent visualizers where you can:
  • Watch the optimization path unfold step-by-step
  • Adjust learning rate with sliders and see the effect immediately
  • Visualize loss landscapes in 3D
  • See backpropagation flow through network layers
Look for the 🎮 symbol throughout the course!

🎮 Interactive Visualization Tools

Calculus comes alive when you can see it. Use these tools alongside the course:

3Blue1Brown: Essence of Calculus

Beautiful visualizations of derivatives, integrals, and why they matter. Watch the first 3 videos before Module 1.

Desmos Graphing Calculator

Plot functions, visualize derivatives as tangent lines, see how slope changes. Use throughout the course.

Gradient Descent Visualizer

Watch gradient descent optimize in real-time on different loss surfaces. Perfect for Module 4.

TensorFlow Playground

See neural networks learn live. Adjust architecture, watch loss decrease. Great after Module 5.
🔗 When to Use These Tools:
  • Module 1 (Derivatives): Desmos - plot f(x), add tangent lines, see slopes
  • Module 2 (Gradients): 3D surface plots in our notebooks
  • Module 3 (Chain Rule): Our interactive backprop visualizer
  • Module 4 (Gradient Descent): Gradient Descent Visualizer website
  • Module 5 (Final Project): TensorFlow Playground after you build your own!
Want more mathematical rigor? Each module includes optional “Going Deeper” sections:
ModuleAdvanced TopicWhy It Matters
DerivativesLimits, continuity, formal definitionUnderstand convergence proofs in ML papers
GradientsJacobian matrices, HessiansUnderstand second-order optimization methods
Chain RuleComputational graphs, automatic differentiationHow PyTorch/JAX actually work
OptimizationConvexity, convergence rates, saddle pointsWhy certain architectures train better
These sections are OPTIONAL. You can build neural networks and understand gradient descent without them. They’re for learners who:
  • Want to read ML research papers
  • Are curious about optimization theory
  • Plan to implement custom autograd systems
Recommended Resources for Deep Dives:
  • Calculus Made Easy by Silvanus Thompson (classic, intuitive)
  • Convex Optimization by Boyd & Vandenberghe (free online)
  • Fast.ai’s “Practical Deep Learning” course (connects calculus to real training)

What You’ll Build

Price Optimizer

Given a profit function, automatically find the price that maximizes profit using derivatives.

Multi-Variable Optimizer

Optimize both price and ad spend simultaneously using gradients.

Backpropagation Engine

Implement the chain rule to compute gradients through multiple layers.

Neural Network (From Scratch)

Build a complete neural network that learns XOR - using only NumPy.

Interview Preparation: What Companies Ask

Google/Meta/Amazon commonly ask:
  • “Explain how backpropagation works” (Chain Rule module)
  • “Why might training get stuck? How do you fix it?” (Optimization module)
  • “What happens if learning rate is too high/low?” (Gradient Descent module)
  • “Derive the gradient for a simple loss function” (Derivatives module)
Fast-growing startups focus on:
  • “Walk me through training a neural network from scratch”
  • “How would you debug a model that’s not learning?”
  • “Why do we use Adam over SGD?”
  • “Explain vanishing/exploding gradients”
Research-focused roles ask:
  • “Prove that gradient descent converges for convex functions”
  • “What are second-order optimization methods?”
  • “Explain the mathematical foundations of attention mechanisms”
  • “Derive backprop for a custom activation function”

Why This Course Exists

Most calculus courses teach you to solve problems like: “Find the derivative of f(x)=3x42x2+5f(x) = 3x^4 - 2x^2 + 5 And you learn: “Use the power rule: f(x)=12x34xf'(x) = 12x^3 - 4x But nobody tells you WHY. Why do neural networks need derivatives? How does PyTorch compute gradients automatically? Why does “learning rate = 0.01” work better than “learning rate = 1.0”? This course answers those questions. By the end, you won’t just know formulas - you’ll understand the engine that makes AI learn.

By The End of This Course

You will:
  • Understand why every ML framework computes gradients
  • Build a neural network that actually learns (from scratch!)
  • Debug training problems because you understand what’s happening
  • Read ML papers and understand the math notation
  • Choose the right optimizer for your problem
When you see this equation: θt+1=θtαθJ(θ)\theta_{t+1} = \theta_t - \alpha \nabla_\theta J(\theta) You’ll think: “Oh, that’s just saying: update the weights by stepping opposite to the gradient, scaled by the learning rate.”

Let’s Begin

The next module starts with a simple question: “You own a business. What price should you charge to maximize profit?” The answer will teach you what derivatives really mean.

Next: Derivatives

Learn what derivatives actually measure and why neural networks need them

Interview Deep-Dive

Strong Answer:
  • The way I frame this is simple: every ML model learns by asking one question over and over — “if I adjust this knob slightly, does my prediction get better or worse?” Calculus is the math that answers that question precisely and efficiently. Without it, you are guessing.
  • In production, this matters directly. An engineer who understands gradients can diagnose why a model stopped improving in 30 minutes. An engineer who treats the framework as a black box might spend three days trying random hyperparameters. At a company processing 10M predictions per day, a single day of a broken model can cost six figures in revenue.
  • The real business case: understanding the math lets you choose the right optimizer, set learning rates that converge rather than diverge, and design loss functions tailored to your business metric. These are not academic exercises — they translate to faster model iteration, lower compute costs, and better product outcomes.
Follow-up: You mentioned that engineers who understand gradients debug faster. Can you give a concrete example of a training failure where calculus knowledge was the difference?A classic one is vanishing gradients in deep networks. An engineer without calculus knowledge sees the loss plateau and tries bigger learning rates, more data, different architectures — shotgun debugging. An engineer who understands the chain rule immediately checks gradient norms per layer, recognizes that multiplying many small derivatives through 50 layers drives gradients to zero, and knows the fix: switch from sigmoid to ReLU activations or add residual connections. I have seen this exact diagnosis cut debugging time from a week to an afternoon on a production NLP pipeline. The sigmoid derivative maxes out at 0.25, so after 10 layers you are multiplying by 0.25^10 which is roughly 1e-6. That is the chain rule making or breaking your training.
Strong Answer:
  • That approach is called random search or evolutionary strategies, and it actually does work for small problems. The issue is scale. A modern language model has billions of parameters. If you tried random perturbations, you would need to evaluate the loss for each random change — the search space is astronomically large. Gradient descent uses calculus to compute the exact direction of improvement for ALL parameters simultaneously in a single backward pass.
  • To put numbers on it: for a model with 1 billion parameters, random search would need at minimum billions of function evaluations per step. Backpropagation computes the gradient for all 1 billion parameters in roughly 2-3x the cost of a single forward pass. That is the power of the chain rule applied systematically.
  • The deeper insight is that the gradient is not just “which direction is downhill” — it tells you the exact rate of change for every single parameter. Some weights need big adjustments, some need tiny ones. The gradient gives you that precision. Random search treats all parameters equally, which is wasteful.
  • That said, gradient-free methods like evolutionary strategies do have a niche. They work well when the loss landscape is extremely noisy or non-differentiable, or when you need to optimize discrete structures. OpenAI published a paper showing evolutionary strategies can train RL policies competitively. But for standard supervised learning with differentiable losses, calculus-based optimization wins by orders of magnitude.
Follow-up: At what scale does this efficiency difference become truly prohibitive? Is there a parameter count threshold where gradient-free methods become impractical?In my experience, once you cross roughly 10,000 parameters, gradient-free methods start becoming painfully slow. At 1 million parameters, they are essentially unusable for standard training. The key insight is that the cost of gradient-free methods scales linearly (or worse) with parameter count per update step, while backpropagation’s cost is essentially constant relative to the forward pass regardless of parameter count. That is why nobody trains GPT-scale models with evolutionary strategies for the main optimization loop, though they might use them for hyperparameter search over a small discrete space.
Strong Answer:
  • The biggest way it breaks down is dimensionality. We picture a 2D or 3D landscape, but real neural networks operate in millions of dimensions. In high dimensions, local minima are actually quite rare — what dominates are saddle points. A saddle point is a place where the gradient is zero but you are at a minimum in some directions and a maximum in others. Research from Dauphin et al. (2014) showed that in high-dimensional spaces, almost all critical points are saddle points, not local minima.
  • The second breakdown: in real landscapes, you can see the terrain and judge distances. In optimization, you only have local gradient information. You cannot tell if you are near the global minimum or in a shallow plateau. Worse, the curvature can vary dramatically — some directions are steep, others are nearly flat. This is why adaptive optimizers like Adam exist: they handle different curvatures per parameter.
  • Third, real landscapes are static. Neural network loss landscapes change with every mini-batch of data. The “terrain” literally shifts under your feet every step. This stochastic noise is actually beneficial — it helps escape sharp minima and find flatter regions that generalize better. Keskar et al. (2017) showed that small-batch SGD finds wider minima that generalize better than large-batch training.
  • The practical implication: do not over-index on “finding the global minimum.” For neural networks, many local minima have similar loss values (the loss landscape has many equally good valleys), and flatter minima often generalize better than sharp global minima. The goal is not the absolute lowest point — it is a low point that performs well on unseen data.
Follow-up: If saddle points are the real problem in high dimensions rather than local minima, how do modern optimizers actually escape them?Saddle points have zero gradient, so vanilla gradient descent stalls. But SGD with mini-batches adds noise that naturally perturbs you away from saddle points — the stochastic gradient is almost never exactly zero even at a saddle. Momentum helps even more: it accumulates velocity, so even if the gradient is momentarily small, the optimizer keeps moving based on past gradients. Adam combines this with per-parameter adaptive rates, so parameters stuck in flat directions of a saddle get larger effective learning rates. There is also research on adding explicit noise (like Langevin dynamics) or using second-order information from the Hessian to detect saddle points — if the Hessian has negative eigenvalues, you know you are at a saddle and can move along the negative curvature direction to escape.
Strong Answer:
  • Start with the forward pass. Given input x, we compute layer-by-layer: z1 = W1x + b1, then a1 = activation(z1), then z2 = W2a1 + b2, and so on until we get the final prediction y_hat. Each layer is a composition of an affine transformation and a nonlinear activation.
  • Next, compute the loss. For a single example with true label y, we might use MSE: L = (y - y_hat)^2. This is a scalar that measures how wrong we are.
  • Now the calculus begins. We need dL/dW for every weight matrix W. The chain rule is the entire engine here. For the last layer, dL/dy_hat = -2(y - y_hat). Then dL/dz_last = dL/dy_hat * dy_hat/dz_last, where that second term is the derivative of the activation function. For sigmoid, that is sigma(z)(1 - sigma(z)). For ReLU, it is 1 if z > 0, else 0.
  • To get the gradient for the weights: dL/dW_last = dL/dz_last * a_prev^T. This is an outer product — the upstream gradient times the activations from the previous layer. For the bias: dL/db_last = dL/dz_last directly.
  • To propagate backward to earlier layers: dL/da_prev = W_last^T * dL/dz_last. Then apply the chain rule again through that layer’s activation. This recursive process is backpropagation — it is literally the chain rule applied from output to input, reusing intermediate computations.
  • Finally, the weight update: W_new = W_old - learning_rate * dL/dW. This is one step of gradient descent. The learning rate scales how far we step in the negative gradient direction.
  • The key numerical concern: if activation derivatives are consistently less than 1 (like sigmoid’s max of 0.25), multiplying many of them through deep networks drives gradients toward zero. That is the vanishing gradient problem, and it is a direct consequence of the chain rule’s multiplicative structure.
Follow-up: You mentioned the gradient for the weights is an outer product. Why does that specific structure matter for understanding what the network is learning?The outer product dL/dz * a_prev^T means that a weight update is large only when BOTH the upstream error signal is large AND the input activation to that weight is large. A weight connecting an inactive neuron (a_prev near zero) gets almost no update regardless of the error, because the outer product zeroes it out. This is why dead ReLU neurons are problematic — once a ReLU neuron outputs zero for all inputs, its incoming weights stop receiving gradient signal entirely, and it never recovers. It also explains why input normalization matters: if some features are much larger than others, their corresponding weights get disproportionately large gradients, leading to uneven learning dynamics.