Skip to main content
Calculus for Machine Learning

Calculus for Machine Learning

The Question That Unlocks AI

You train a neural network. It starts completely random - worse than guessing. You feed it 10,000 images of cats and dogs. An hour later, it’s 95% accurate. What happened in that hour? The network adjusted millions of numbers (weights) until they were “right.” But how did it know which direction to adjust each number? How did it know how much? The answer is calculus. Specifically: derivatives tell the network “if I change this weight by a tiny amount, how much will my error change?” Then it adjusts every weight to reduce the error, step by step, millions of times.
Real Talk: You probably remember calculus as “finding the derivative of x³” and plugging numbers into formulas. That’s not what this course is about.We’re going to show you what derivatives actually mean, why neural networks need them, and how to use them to make things learn.
Estimated Time: 14-18 hours
Difficulty: Beginner-friendly (we start from scratch)
Prerequisites: Basic Python, Linear Algebra course (or willingness to learn alongside)
What You’ll Build: A neural network that learns - from scratch, no libraries
Before starting, make sure you can:Python Basics
  • Work with NumPy arrays: np.array([1, 2, 3])
  • Write functions with multiple parameters
  • Create simple plots with matplotlib
  • Understand list comprehensions
Linear Algebra Concepts (from our course or elsewhere)
  • Vectors: what they are and how to add them
  • Dot product: np.dot(a, b) and what it means
  • Basic matrix operations (helpful but we’ll review)
Math Comfort
  • Comfortable with basic graphing (x-y plots)
  • Understand slope of a line (rise/run)
  • Know that functions take inputs and produce outputs
You DON’T need:
  • Previous calculus experience
  • To remember derivative rules from school
  • Physics or engineering background
Recommended Path: Linear Algebra for ML → This Course → Statistics for ML
Try these checks to gauge your readiness:Slope Check (can you answer this?): A line goes through points (1, 3) and (4, 9). What is its slope?Vector Check (do you know this?): What does np.dot([1, 2, 3], [4, 5, 6]) return?Remediation Paths:
Gap IdentifiedRecommended Action
Slope concept unclearKhan Academy “Slope of a line” - 20 min
Vector/dot product unfamiliarVectors Module - 3 hours
NumPy basicsPython Crash Course - NumPy section
Graphing conceptsYouTube “Reading function graphs” - 30 min
Career Impact: Calculus knowledge directly translates to higher salaries. ML engineers who understand gradients debug models 3x faster and build more sophisticated architectures. This is the math that separates senior engineers from juniors.

The Core Insight: Learning = Finding the Bottom of a Hill

Imagine you’re blindfolded, dropped somewhere on a hilly landscape. Your goal: find the lowest point (the valley). You can’t see anything. But you can feel the slope under your feet.
  • If the ground slopes down to your left, step left
  • If it slopes down forward, step forward
  • Keep stepping downhill until the ground is flat
That’s gradient descent. And the “slope” is the derivative. In machine learning:
  • The “landscape” is your error function (how wrong your model is)
  • The “position” is your current weights
  • The “slope” (derivative) tells you which direction reduces error
  • You keep stepping until error is minimized
Learning Hill Analogy
🔗 ML Connection: This “hill descent” is literally how every major AI system learns:
AI SystemWhat It’s OptimizingThe “Slope”
ChatGPTPredict next word probabilityCross-entropy gradient
DALL-EMatch image to text descriptionDiffusion loss gradient
AlphaFoldProtein structure accuracyDistance & angle gradients
Tesla AutopilotObject detection accuracyMulti-task loss gradient
Spotify RecommendationsUser engagement predictionRanking loss gradient
Every module connects to these real systems!

Who Uses This (Companies & Roles)

OpenAI

GPT-4 training uses gradient descent on 175 billion parameters. Understanding calculus = understanding how ChatGPT learns.

Tesla Autopilot

Self-driving AI optimizes millions of weights to detect pedestrians, lanes, and obstacles in real-time.

DeepMind AlphaFold

Solved 50-year protein folding problem using neural networks trained with the exact math you’ll learn here.
RoleHow They Use CalculusSalary Impact
ML EngineerDebug training, implement custom layers, optimize performance+$30-50K over non-ML roles
Research ScientistDevelop new architectures, publish papers, prove convergence+$50-80K, often PhD required
ML Ops EngineerOptimize training pipelines, reduce compute costs+$20-40K
Data ScientistUnderstand why models work, explain to stakeholders+$15-30K

What You’ll Actually Learn

Module 1: Derivatives — “Which way is downhill?”

The Real Question: If I change this weight by 0.001, how much does my error change? What You’ll Understand:
  • Derivatives measure sensitivity (how much output changes when input changes)
  • Finding the minimum means finding where the derivative is zero
  • Every weight in a neural network has a derivative
What You’ll Build: A price optimizer that finds the profit-maximizing price automatically.
# By the end of this module, you'll understand:
# "The derivative of error with respect to weight is 0.05"
# Meaning: increase weight by 1 → error increases by 0.05
# So we should DECREASE the weight to reduce error!

Module 2: Gradients — “Which way is MOST downhill?”

The Real Question: I have 1,000 weights. Which combination of changes reduces error the fastest? What You’ll Understand:
  • A gradient is just a list of derivatives (one per weight)
  • It points in the direction of steepest increase
  • We go the OPPOSITE direction to decrease error
What You’ll Build: A multi-variable optimizer for a business with price AND ad spend.

Module 3: Chain Rule — “How do changes propagate through layers?”

The Real Question: In a 50-layer neural network, how does changing a weight in layer 1 affect the final output? What You’ll Understand:
  • Nested functions: the output of layer 1 becomes input to layer 2, etc.
  • Chain rule: multiply the derivatives along the chain
  • Backpropagation: computing all derivatives efficiently, from output back to input
What You’ll Build: Backpropagation from scratch - the algorithm that made deep learning possible.

Module 4: Gradient Descent — “Taking steps downhill”

The Real Question: How big should each step be? When should we stop? What You’ll Understand:
  • Learning rate: step too big = overshoot, step too small = takes forever
  • Convergence: knowing when you’ve reached the bottom
  • Local minima: getting stuck in small valleys instead of the deepest one
What You’ll Build: A complete training loop that learns from data.

Module 5: Optimization — “Getting there faster”

The Real Question: Gradient descent is slow. How do we speed it up? What You’ll Understand:
  • Momentum: build up speed when going in a consistent direction
  • Adam: adapt the step size for each weight individually
  • Why Adam is the default choice for most deep learning
What You’ll Build: Compare optimizers head-to-head on the same problem.

Your Learning Journey

1

Week 1: Derivatives

Understand what derivatives really mean. Build a price optimizer.
2

Week 2: Gradients

Handle multiple variables at once. Optimize price AND marketing spend together.
3

Week 3: Chain Rule

Understand how changes propagate through layers. Implement backpropagation.
4

Week 4: Gradient Descent

Build a complete training loop. Watch your model learn.
5

Week 5: Final Project

Build a neural network from scratch using ONLY NumPy. No TensorFlow. No PyTorch.

Prerequisites

What You Need:
  • Basic Python (variables, functions, loops)
  • Linear Algebra course (or take it alongside - they complement each other)
  • Curiosity about how AI actually works
What You Don’t Need:
  • Previous calculus knowledge (we start from zero)
  • Memorized derivative formulas (we focus on understanding)
  • Mathematical proofs (we focus on intuition and code)

Setup

pip install numpy matplotlib jupyter plotly ipywidgets

jupyter notebook
That’s all you need. We build everything from scratch.
🎮 Interactive Visualizations: This course includes interactive gradient descent visualizers where you can:
  • Watch the optimization path unfold step-by-step
  • Adjust learning rate with sliders and see the effect immediately
  • Visualize loss landscapes in 3D
  • See backpropagation flow through network layers
Look for the 🎮 symbol throughout the course!

🎮 Interactive Visualization Tools

Calculus comes alive when you can see it. Use these tools alongside the course:
🔗 When to Use These Tools:
  • Module 1 (Derivatives): Desmos - plot f(x), add tangent lines, see slopes
  • Module 2 (Gradients): 3D surface plots in our notebooks
  • Module 3 (Chain Rule): Our interactive backprop visualizer
  • Module 4 (Gradient Descent): Gradient Descent Visualizer website
  • Module 5 (Final Project): TensorFlow Playground after you build your own!
Want more mathematical rigor? Each module includes optional “Going Deeper” sections:
ModuleAdvanced TopicWhy It Matters
DerivativesLimits, continuity, formal definitionUnderstand convergence proofs in ML papers
GradientsJacobian matrices, HessiansUnderstand second-order optimization methods
Chain RuleComputational graphs, automatic differentiationHow PyTorch/JAX actually work
OptimizationConvexity, convergence rates, saddle pointsWhy certain architectures train better
These sections are OPTIONAL. You can build neural networks and understand gradient descent without them. They’re for learners who:
  • Want to read ML research papers
  • Are curious about optimization theory
  • Plan to implement custom autograd systems
Recommended Resources for Deep Dives:
  • Calculus Made Easy by Silvanus Thompson (classic, intuitive)
  • Convex Optimization by Boyd & Vandenberghe (free online)
  • Fast.ai’s “Practical Deep Learning” course (connects calculus to real training)

What You’ll Build

Price Optimizer

Given a profit function, automatically find the price that maximizes profit using derivatives.

Multi-Variable Optimizer

Optimize both price and ad spend simultaneously using gradients.

Backpropagation Engine

Implement the chain rule to compute gradients through multiple layers.

Neural Network (From Scratch)

Build a complete neural network that learns XOR - using only NumPy.

Interview Preparation: What Companies Ask

Google/Meta/Amazon commonly ask:
  • “Explain how backpropagation works” (Chain Rule module)
  • “Why might training get stuck? How do you fix it?” (Optimization module)
  • “What happens if learning rate is too high/low?” (Gradient Descent module)
  • “Derive the gradient for a simple loss function” (Derivatives module)
Fast-growing startups focus on:
  • “Walk me through training a neural network from scratch”
  • “How would you debug a model that’s not learning?”
  • “Why do we use Adam over SGD?”
  • “Explain vanishing/exploding gradients”
Research-focused roles ask:
  • “Prove that gradient descent converges for convex functions”
  • “What are second-order optimization methods?”
  • “Explain the mathematical foundations of attention mechanisms”
  • “Derive backprop for a custom activation function”

Why This Course Exists

Most calculus courses teach you to solve problems like: “Find the derivative of f(x)=3x42x2+5f(x) = 3x^4 - 2x^2 + 5 And you learn: “Use the power rule: f(x)=12x34xf'(x) = 12x^3 - 4x But nobody tells you WHY. Why do neural networks need derivatives? How does PyTorch compute gradients automatically? Why does “learning rate = 0.01” work better than “learning rate = 1.0”? This course answers those questions. By the end, you won’t just know formulas - you’ll understand the engine that makes AI learn.

By The End of This Course

You will:
  • Understand why every ML framework computes gradients
  • Build a neural network that actually learns (from scratch!)
  • Debug training problems because you understand what’s happening
  • Read ML papers and understand the math notation
  • Choose the right optimizer for your problem
When you see this equation: θt+1=θtαθJ(θ)\theta_{t+1} = \theta_t - \alpha \nabla_\theta J(\theta) You’ll think: “Oh, that’s just saying: update the weights by stepping opposite to the gradient, scaled by the learning rate.”

Let’s Begin

The next module starts with a simple question: “You own a business. What price should you charge to maximize profit?” The answer will teach you what derivatives really mean.

Next: Derivatives

Learn what derivatives actually measure and why neural networks need them