Linear Regression
From Intuition to Implementation
The Real-World Setup
The Linear Regression Model
Step-by-Step Implementation
Step 1: Prepare the Data
Step 2: Define the Model
Step 3: Define the Loss Function
Step 4: Compute the Gradient
Step 5: Gradient Descent
Step 6: Train the Model
Using scikit-learn (The Professional Way)
Interpreting the Results
The Closed-Form Solution
Real-World Example: House Price Prediction
Common Pitfalls and Solutions
Pitfall 1: Features on Different Scales
Pitfall 2: Multicollinearity
Pitfall 3: Overfitting
The Complete Linear Regression Workflow
Key Takeaways
🚀 Mini Projects
What’s Next?
🔗 Math → ML Connection Summary

Linear Regression

From Intuition to Implementation

In the previous modules, you learned:

ML is about finding patterns in data
We measure “wrongness” with a loss function
Gradient descent minimizes the loss

Now let’s put it all together into a complete, professional algorithm.

The Real-World Setup

Your boss asks: “Can you predict how much revenue we’ll make based on our advertising spend?” You have historical data:

import numpy as np
import matplotlib.pyplot as plt

# Historical data: [TV ads ($k), Radio ads ($k), Newspaper ads ($k)] -> Revenue ($k)
advertising = np.array([
    [230, 37, 69],
    [44, 39, 45],
    [17, 45, 69],
    [151, 41, 58],
    [180, 10, 58],
    [8, 49, 75],
    [57, 32, 23],
    [120, 19, 11],
    [8, 35, 65],
    [199, 3, 70],
])

revenue = np.array([22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6])

Question: If we spend

100k on TV,

40k on Radio, and $30k on Newspaper, what revenue can we expect?

House Price Prediction with Linear Regression

The Linear Regression Model

We assume revenue is a weighted combination of ad spends:

\text{revenue} = w_0 + w_1 \cdot \text{TV} + w_2 \cdot \text{Radio} + w_3 \cdot \text{Newspaper}

Or in matrix notation (see Matrix Operations):

\hat{y} = X \cdot w

Where:

$X$ is our feature matrix (with a column of 1s for the bias term)
$w$ is our weight vector
$\hat{y}$ is our predictions

Step-by-Step Implementation

Step 1: Prepare the Data

# Add a column of 1s for the bias (intercept) term
X = np.column_stack([np.ones(len(advertising)), advertising])
y = revenue

print("X shape:", X.shape)  # (10, 4) - 10 samples, 4 features (including bias)
print("y shape:", y.shape)  # (10,) - 10 target values

Step 2: Define the Model

def predict(X, w):
    """Linear model: y_hat = X @ w"""
    return X @ w

Step 3: Define the Loss Function

def mean_squared_error(y_true, y_pred):
    """Average of squared differences."""
    return np.mean((y_true - y_pred) ** 2)

def compute_loss(X, y, w):
    """Compute MSE for given weights."""
    predictions = predict(X, w)
    return mean_squared_error(y, predictions)

Step 4: Compute the Gradient

def compute_gradient(X, y, w):
    """
    Gradient of MSE with respect to weights.
    
    d(MSE)/dw = (2/n) * X.T @ (X @ w - y)
    """
    n = len(y)
    predictions = predict(X, w)
    errors = predictions - y
    gradient = (2 / n) * X.T @ errors
    return gradient

Step 5: Gradient Descent

def train_linear_regression(X, y, learning_rate=0.0001, num_epochs=1000):
    """
    Train a linear regression model using gradient descent.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target values (n_samples,)
        learning_rate: Step size for gradient descent
        num_epochs: Number of training iterations
    
    Returns:
        Trained weights
    """
    # Initialize weights to zero
    w = np.zeros(X.shape[1])
    
    # Track loss history for plotting
    losses = []
    
    for epoch in range(num_epochs):
        # Compute current loss
        loss = compute_loss(X, y, w)
        losses.append(loss)
        
        # Compute gradient
        gradient = compute_gradient(X, y, w)
        
        # Update weights
        w = w - learning_rate * gradient
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss = {loss:.4f}")
    
    return w, losses

Step 6: Train the Model

# Normalize features for better convergence
X_normalized = X.copy()
X_normalized[:, 1:] = (X[:, 1:] - X[:, 1:].mean(axis=0)) / X[:, 1:].std(axis=0)

# Train
weights, loss_history = train_linear_regression(X_normalized, y, learning_rate=0.1, num_epochs=500)

print("\nFinal weights:")
print(f"  Bias:      {weights[0]:.4f}")
print(f"  TV:        {weights[1]:.4f}")
print(f"  Radio:     {weights[2]:.4f}")
print(f"  Newspaper: {weights[3]:.4f}")

Using scikit-learn (The Professional Way)

In practice, we use libraries that handle all the details:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    advertising, revenue, test_size=0.2, random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared Score: {r2:.4f}")

print("\nLearned Coefficients:")
print(f"  Intercept:  {model.intercept_:.4f}")
print(f"  TV:         {model.coef_[0]:.4f}")
print(f"  Radio:      {model.coef_[1]:.4f}")
print(f"  Newspaper:  {model.coef_[2]:.4f}")

Interpreting the Results

The coefficients tell a story:

TV coefficient:        0.046
Radio coefficient:     0.188
Newspaper coefficient: 0.003

Insights:

Every $1k spent on **Radio** returns ~$ 188 in revenue (best ROI!)
Every $1k on **TV** returns ~$ 46
Newspaper ads have almost no impact

Business decision: Shift budget from newspaper to radio!

The Closed-Form Solution

For linear regression, there’s actually a formula that gives the optimal weights directly, without gradient descent:

w = (X^T X)^{-1} X^T y

This is called the Normal Equation. It comes from calculus - setting the gradient to zero and solving.

def linear_regression_closed_form(X, y):
    """Compute optimal weights using the normal equation."""
    # w = (X^T X)^(-1) X^T y
    XtX = X.T @ X
    XtX_inv = np.linalg.inv(XtX)
    Xty = X.T @ y
    w = XtX_inv @ Xty
    return w

# This gives the same answer as gradient descent!
w_closed = linear_regression_closed_form(X, y)
print("Closed-form solution:", w_closed)

When to use which?Normal Equation: Fast for small datasets, exact solution Gradient Descent: Better for large datasets, more flexibleThe matrix inversion in the normal equation is O(n^3), which becomes slow for millions of features.

Real-World Example: House Price Prediction

Let’s build a proper house price predictor using real data:

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load California housing data
housing = fetch_california_housing()
X, y = housing.data, housing.target

print("Features:", housing.feature_names)
print("X shape:", X.shape)  # (20640, 8) - 20k houses, 8 features

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for many algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"\nR-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)) * 100000:.2f}")

# Feature importance
print("\nFeature Importance (coefficients):")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"  {name}: {coef:.4f}")

Common Pitfalls and Solutions

Pitfall 1: Features on Different Scales

# BAD: Sqft ranges 500-5000, bedrooms 1-6
# The model will be biased toward sqft

# GOOD: Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Pitfall 2: Multicollinearity

When features are highly correlated, coefficients become unstable.

# Check correlations
import pandas as pd
df = pd.DataFrame(X, columns=feature_names)
print(df.corr())

# Solution: Use Ridge regression or drop correlated features
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # Adds regularization

Pitfall 3: Overfitting

When the model memorizes training data but fails on new data.

# Always evaluate on held-out test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
print("Train R2:", model.score(X_train, y_train))
print("Test R2:", model.score(X_test, y_test))
# If train >> test, you're overfitting!

The Complete Linear Regression Workflow

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score

def linear_regression_pipeline(X, y, feature_names=None):
    """
    Complete linear regression workflow.
    """
    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 2. Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 3. Train model
    model = LinearRegression()
    model.fit(X_train_scaled, y_train)
    
    # 4. Evaluate
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    print("=== Model Evaluation ===")
    print(f"Train R2: {r2_score(y_train, y_pred_train):.4f}")
    print(f"Test R2:  {r2_score(y_test, y_pred_test):.4f}")
    print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
    
    # 5. Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    print(f"\nCross-validation R2: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    
    # 6. Feature importance
    if feature_names:
        print("\n=== Feature Importance ===")
        importance = pd.DataFrame({
            'feature': feature_names,
            'coefficient': model.coef_
        }).sort_values('coefficient', key=abs, ascending=False)
        print(importance.to_string(index=False))
    
    return model, scaler

Key Takeaways

Linear = Weighted Sum

y = w0 + w1x1 + w2x2 + …

MSE Loss

Measures average squared error

Scale Your Features

Normalize for better training

Evaluate on Test Data

Always hold out some data

🚀 Mini Projects

Project 1

Build a salary prediction model

Project 2

Real estate price predictor with feature engineering

Project 3

Model comparison and selection pipeline

What’s Next?

Linear regression predicts continuous numbers. But what if you want to predict categories?

Is this email spam or not spam?
Will this customer churn or stay?
Is this tumor malignant or benign?

That’s classification - the subject of our next module!

Continue to Module 4: Classification

Learn to predict categories with logistic regression and beyond

🔗 Math → ML Connection Summary

Where the math you learned powers linear regression:

Math Concept	Where It Appears	What It Does
Vectors (Module)	Feature vector $\mathbf{x}$	Represents each data point
Dot Product (Module)	$\mathbf{w} \cdot \mathbf{x}$	Core of the linear model prediction
Matrix Multiplication (Module)	$X\mathbf{w} = \hat{y}$	Batch predictions for all samples
Gradient (Module)	$\nabla_w L$	Direction to update weights
Chain Rule (Module)	Backpropagation	Compute gradients efficiently
Mean/Variance (Module)	Feature scaling	Standardization for better training
Normal Distribution (Module)	Error distribution	Justifies using MSE loss

Bottom line: Linear regression is the intersection of all three math courses. Master this, and neural networks become “just deeper linear regression with nonlinearities.”

🚀 Going Deeper: The Mathematics of Linear Regression

Want to understand the theory? Here’s what’s happening under the hood:

Why Gradient Descent Works

The MSE loss is convex (bowl-shaped), meaning:

There’s exactly one minimum
Gradient descent is guaranteed to find it
Step size (learning rate) affects speed but not destination

L(w) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2

Taking the gradient:

\nabla_w L = -\frac{2}{n}X^T(y - X\mathbf{w})

Setting to zero gives the normal equation:

\mathbf{w}^* = (X^TX)^{-1}X^Ty

The Statistical Interpretation

Under the assumption that errors are normally distributed:

y = \mathbf{w}^T\mathbf{x} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

Maximum Likelihood Estimation (MLE) of

\mathbf{w}

is equivalent to minimizing MSE!

Regularization from a Bayesian View

Ridge Regression (L2): Assumes weights have Gaussian prior $\mathbf{w} \sim \mathcal{N}(0, \tau^2I)$
Lasso (L1): Assumes weights have Laplacian prior → promotes sparsity

This connection between regularization and Bayesian priors is why regularization prevents overfitting!

Recommended Deep-Dive Resources

The Elements of Statistical Learning - Chapter 3 (free online)
Pattern Recognition and Machine Learning by Bishop - Chapter 3
Stanford CS229 Linear Regression Notes

Learning From Mistakes Classification

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Linear Regression

​From Intuition to Implementation

​The Real-World Setup

​The Linear Regression Model

​Step-by-Step Implementation

​Step 1: Prepare the Data

​Step 2: Define the Model

​Step 3: Define the Loss Function

​Step 4: Compute the Gradient

​Step 5: Gradient Descent

​Step 6: Train the Model

​Using scikit-learn (The Professional Way)

​Interpreting the Results

​The Closed-Form Solution

​Real-World Example: House Price Prediction

​Common Pitfalls and Solutions

​Pitfall 1: Features on Different Scales

​Pitfall 2: Multicollinearity

​Pitfall 3: Overfitting

​The Complete Linear Regression Workflow

​Key Takeaways

Linear = Weighted Sum

MSE Loss

Scale Your Features

Evaluate on Test Data

​🚀 Mini Projects

Project 1

Project 2

Project 3

​What’s Next?

Continue to Module 4: Classification

​🔗 Math → ML Connection Summary

​Why Gradient Descent Works

​The Statistical Interpretation

​Regularization from a Bayesian View

​Recommended Deep-Dive Resources

Linear Regression

From Intuition to Implementation

The Real-World Setup

The Linear Regression Model

Step-by-Step Implementation

Step 1: Prepare the Data

Step 2: Define the Model

Step 3: Define the Loss Function

Step 4: Compute the Gradient

Step 5: Gradient Descent

Step 6: Train the Model

Using scikit-learn (The Professional Way)

Interpreting the Results

The Closed-Form Solution

Real-World Example: House Price Prediction

Common Pitfalls and Solutions

Pitfall 1: Features on Different Scales

Pitfall 2: Multicollinearity

Pitfall 3: Overfitting

The Complete Linear Regression Workflow

Key Takeaways

🚀 Mini Projects

What’s Next?

🔗 Math → ML Connection Summary

Why Gradient Descent Works

The Statistical Interpretation

Regularization from a Bayesian View

Recommended Deep-Dive Resources