Skip to main content

Linear Regression

Linear Regression - Best Fit Line

From Intuition to Implementation

In the previous modules, you learned:
  1. ML is about finding patterns in data
  2. We measure “wrongness” with a loss function
  3. Gradient descent minimizes the loss
Now let’s put it all together into a complete, professional algorithm.

The Real-World Setup

Your boss asks: “Can you predict how much revenue we’ll make based on our advertising spend?” You have historical data:
import numpy as np
import matplotlib.pyplot as plt

# Historical data: [TV ads ($k), Radio ads ($k), Newspaper ads ($k)] -> Revenue ($k)
advertising = np.array([
    [230, 37, 69],
    [44, 39, 45],
    [17, 45, 69],
    [151, 41, 58],
    [180, 10, 58],
    [8, 49, 75],
    [57, 32, 23],
    [120, 19, 11],
    [8, 35, 65],
    [199, 3, 70],
])

revenue = np.array([22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6])
Question: If we spend 100konTV,100k on TV, 40k on Radio, and $30k on Newspaper, what revenue can we expect?
House Price Prediction with Linear Regression

The Linear Regression Model

We assume revenue is a weighted combination of ad spends: revenue=w0+w1TV+w2Radio+w3Newspaper\text{revenue} = w_0 + w_1 \cdot \text{TV} + w_2 \cdot \text{Radio} + w_3 \cdot \text{Newspaper} Or in matrix notation (see Matrix Operations): y^=Xw\hat{y} = X \cdot w Where:
  • XX is our feature matrix (with a column of 1s for the bias term)
  • ww is our weight vector
  • y^\hat{y} is our predictions

Step-by-Step Implementation

Step 1: Prepare the Data

# Add a column of 1s for the bias (intercept) term
X = np.column_stack([np.ones(len(advertising)), advertising])
y = revenue

print("X shape:", X.shape)  # (10, 4) - 10 samples, 4 features (including bias)
print("y shape:", y.shape)  # (10,) - 10 target values

Step 2: Define the Model

def predict(X, w):
    """Linear model: y_hat = X @ w"""
    return X @ w

Step 3: Define the Loss Function

def mean_squared_error(y_true, y_pred):
    """Average of squared differences."""
    return np.mean((y_true - y_pred) ** 2)

def compute_loss(X, y, w):
    """Compute MSE for given weights."""
    predictions = predict(X, w)
    return mean_squared_error(y, predictions)

Step 4: Compute the Gradient

def compute_gradient(X, y, w):
    """
    Gradient of MSE with respect to weights.
    
    d(MSE)/dw = (2/n) * X.T @ (X @ w - y)
    """
    n = len(y)
    predictions = predict(X, w)
    errors = predictions - y
    gradient = (2 / n) * X.T @ errors
    return gradient

Step 5: Gradient Descent

def train_linear_regression(X, y, learning_rate=0.0001, num_epochs=1000):
    """
    Train a linear regression model using gradient descent.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target values (n_samples,)
        learning_rate: Step size for gradient descent
        num_epochs: Number of training iterations
    
    Returns:
        Trained weights
    """
    # Initialize weights to zero
    w = np.zeros(X.shape[1])
    
    # Track loss history for plotting
    losses = []
    
    for epoch in range(num_epochs):
        # Compute current loss
        loss = compute_loss(X, y, w)
        losses.append(loss)
        
        # Compute gradient
        gradient = compute_gradient(X, y, w)
        
        # Update weights
        w = w - learning_rate * gradient
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss = {loss:.4f}")
    
    return w, losses

Step 6: Train the Model

# Normalize features for better convergence
X_normalized = X.copy()
X_normalized[:, 1:] = (X[:, 1:] - X[:, 1:].mean(axis=0)) / X[:, 1:].std(axis=0)

# Train
weights, loss_history = train_linear_regression(X_normalized, y, learning_rate=0.1, num_epochs=500)

print("\nFinal weights:")
print(f"  Bias:      {weights[0]:.4f}")
print(f"  TV:        {weights[1]:.4f}")
print(f"  Radio:     {weights[2]:.4f}")
print(f"  Newspaper: {weights[3]:.4f}")

Using scikit-learn (The Professional Way)

In practice, we use libraries that handle all the details:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    advertising, revenue, test_size=0.2, random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared Score: {r2:.4f}")

print("\nLearned Coefficients:")
print(f"  Intercept:  {model.intercept_:.4f}")
print(f"  TV:         {model.coef_[0]:.4f}")
print(f"  Radio:      {model.coef_[1]:.4f}")
print(f"  Newspaper:  {model.coef_[2]:.4f}")

Interpreting the Results

The coefficients tell a story:
TV coefficient:        0.046
Radio coefficient:     0.188
Newspaper coefficient: 0.003
Insights:
  • Every 1kspentonRadioreturns 1k spent on **Radio** returns ~188 in revenue (best ROI!)
  • Every 1konTVreturns 1k on **TV** returns ~46
  • Newspaper ads have almost no impact
Business decision: Shift budget from newspaper to radio!

The Closed-Form Solution

For linear regression, there’s actually a formula that gives the optimal weights directly, without gradient descent: w=(XTX)1XTyw = (X^T X)^{-1} X^T y This is called the Normal Equation. It comes from calculus - setting the gradient to zero and solving.
def linear_regression_closed_form(X, y):
    """Compute optimal weights using the normal equation."""
    # w = (X^T X)^(-1) X^T y
    XtX = X.T @ X
    XtX_inv = np.linalg.inv(XtX)
    Xty = X.T @ y
    w = XtX_inv @ Xty
    return w

# This gives the same answer as gradient descent!
w_closed = linear_regression_closed_form(X, y)
print("Closed-form solution:", w_closed)
When to use which?Normal Equation: Fast for small datasets, exact solution Gradient Descent: Better for large datasets, more flexibleThe matrix inversion in the normal equation is O(n^3), which becomes slow for millions of features.

Real-World Example: House Price Prediction

Let’s build a proper house price predictor using real data:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load California housing data
housing = fetch_california_housing()
X, y = housing.data, housing.target

print("Features:", housing.feature_names)
print("X shape:", X.shape)  # (20640, 8) - 20k houses, 8 features

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for many algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"\nR-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)) * 100000:.2f}")

# Feature importance
print("\nFeature Importance (coefficients):")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"  {name}: {coef:.4f}")

Common Pitfalls and Solutions

Pitfall 1: Features on Different Scales

# BAD: Sqft ranges 500-5000, bedrooms 1-6
# The model will be biased toward sqft

# GOOD: Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Pitfall 2: Multicollinearity

When features are highly correlated, coefficients become unstable.
# Check correlations
import pandas as pd
df = pd.DataFrame(X, columns=feature_names)
print(df.corr())

# Solution: Use Ridge regression or drop correlated features
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # Adds regularization

Pitfall 3: Overfitting

When the model memorizes training data but fails on new data.
# Always evaluate on held-out test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
print("Train R2:", model.score(X_train, y_train))
print("Test R2:", model.score(X_test, y_test))
# If train >> test, you're overfitting!

The Complete Linear Regression Workflow

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score

def linear_regression_pipeline(X, y, feature_names=None):
    """
    Complete linear regression workflow.
    """
    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 2. Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 3. Train model
    model = LinearRegression()
    model.fit(X_train_scaled, y_train)
    
    # 4. Evaluate
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    print("=== Model Evaluation ===")
    print(f"Train R2: {r2_score(y_train, y_pred_train):.4f}")
    print(f"Test R2:  {r2_score(y_test, y_pred_test):.4f}")
    print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
    
    # 5. Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    print(f"\nCross-validation R2: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    
    # 6. Feature importance
    if feature_names:
        print("\n=== Feature Importance ===")
        importance = pd.DataFrame({
            'feature': feature_names,
            'coefficient': model.coef_
        }).sort_values('coefficient', key=abs, ascending=False)
        print(importance.to_string(index=False))
    
    return model, scaler

Key Takeaways

Linear = Weighted Sum

y = w0 + w1x1 + w2x2 + …

MSE Loss

Measures average squared error

Scale Your Features

Normalize for better training

Evaluate on Test Data

Always hold out some data

🚀 Mini Projects

Project 1

Build a salary prediction model

Project 2

Real estate price predictor with feature engineering

Project 3

Model comparison and selection pipeline

What’s Next?

Linear regression predicts continuous numbers. But what if you want to predict categories?
  • Is this email spam or not spam?
  • Will this customer churn or stay?
  • Is this tumor malignant or benign?
That’s classification - the subject of our next module!

Continue to Module 4: Classification

Learn to predict categories with logistic regression and beyond

🔗 Math → ML Connection Summary

Where the math you learned powers linear regression:
Math ConceptWhere It AppearsWhat It Does
Vectors (Module)Feature vector x\mathbf{x}Represents each data point
Dot Product (Module)wx\mathbf{w} \cdot \mathbf{x}Core of the linear model prediction
Matrix Multiplication (Module)Xw=y^X\mathbf{w} = \hat{y}Batch predictions for all samples
Gradient (Module)wL\nabla_w LDirection to update weights
Chain Rule (Module)BackpropagationCompute gradients efficiently
Mean/Variance (Module)Feature scalingStandardization for better training
Normal Distribution (Module)Error distributionJustifies using MSE loss
Bottom line: Linear regression is the intersection of all three math courses. Master this, and neural networks become “just deeper linear regression with nonlinearities.”
Want to understand the theory? Here’s what’s happening under the hood:

Why Gradient Descent Works

The MSE loss is convex (bowl-shaped), meaning:
  • There’s exactly one minimum
  • Gradient descent is guaranteed to find it
  • Step size (learning rate) affects speed but not destination
L(w)=1ni=1n(yiwTxi)2L(w) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2Taking the gradient: wL=2nXT(yXw)\nabla_w L = -\frac{2}{n}X^T(y - X\mathbf{w})Setting to zero gives the normal equation: w=(XTX)1XTy\mathbf{w}^* = (X^TX)^{-1}X^Ty

The Statistical Interpretation

Under the assumption that errors are normally distributed: y=wTx+ϵ,ϵN(0,σ2)y = \mathbf{w}^T\mathbf{x} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)Maximum Likelihood Estimation (MLE) of w\mathbf{w} is equivalent to minimizing MSE!

Regularization from a Bayesian View

  • Ridge Regression (L2): Assumes weights have Gaussian prior wN(0,τ2I)\mathbf{w} \sim \mathcal{N}(0, \tau^2I)
  • Lasso (L1): Assumes weights have Laplacian prior → promotes sparsity
This connection between regularization and Bayesian priors is why regularization prevents overfitting!