Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linear Regression

Linear Regression - Best Fit Line

From Intuition to Implementation

In the previous modules, you learned:
  1. ML is about finding patterns in data
  2. We measure “wrongness” with a loss function
  3. Gradient descent minimizes the loss
Now let’s put it all together into a complete, professional algorithm.

The Real-World Setup

Your boss asks: “Can you predict how much revenue we’ll make based on our advertising spend?” You have historical data:
import numpy as np
import matplotlib.pyplot as plt

# Historical data: [TV ads ($k), Radio ads ($k), Newspaper ads ($k)] -> Revenue ($k)
advertising = np.array([
    [230, 37, 69],
    [44, 39, 45],
    [17, 45, 69],
    [151, 41, 58],
    [180, 10, 58],
    [8, 49, 75],
    [57, 32, 23],
    [120, 19, 11],
    [8, 35, 65],
    [199, 3, 70],
])

revenue = np.array([22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6])
Question: If we spend 100konTV,100k on TV, 40k on Radio, and $30k on Newspaper, what revenue can we expect?
House Price Prediction with Linear Regression

The Linear Regression Model

We assume revenue is a weighted combination of ad spends: revenue=w0+w1TV+w2Radio+w3Newspaper\text{revenue} = w_0 + w_1 \cdot \text{TV} + w_2 \cdot \text{Radio} + w_3 \cdot \text{Newspaper} Or in matrix notation (see Matrix Operations): y^=Xw\hat{y} = X \cdot w Where:
  • XX is our feature matrix (with a column of 1s for the bias term)
  • ww is our weight vector
  • y^\hat{y} is our predictions

Step-by-Step Implementation

Step 1: Prepare the Data

# Add a column of 1s for the bias (intercept) term
X = np.column_stack([np.ones(len(advertising)), advertising])
y = revenue

print("X shape:", X.shape)  # (10, 4) - 10 samples, 4 features (including bias)
print("y shape:", y.shape)  # (10,) - 10 target values

Step 2: Define the Model

def predict(X, w):
    """Linear model: y_hat = X @ w"""
    return X @ w

Step 3: Define the Loss Function

def mean_squared_error(y_true, y_pred):
    """Average of squared differences."""
    return np.mean((y_true - y_pred) ** 2)

def compute_loss(X, y, w):
    """Compute MSE for given weights."""
    predictions = predict(X, w)
    return mean_squared_error(y, predictions)

Step 4: Compute the Gradient

def compute_gradient(X, y, w):
    """
    Gradient of MSE with respect to weights.
    
    d(MSE)/dw = (2/n) * X.T @ (X @ w - y)
    """
    n = len(y)
    predictions = predict(X, w)
    errors = predictions - y
    gradient = (2 / n) * X.T @ errors
    return gradient

Step 5: Gradient Descent

def train_linear_regression(X, y, learning_rate=0.0001, num_epochs=1000):
    """
    Train a linear regression model using gradient descent.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target values (n_samples,)
        learning_rate: Step size for gradient descent
        num_epochs: Number of training iterations
    
    Returns:
        Trained weights
    """
    # Initialize weights to zero
    w = np.zeros(X.shape[1])
    
    # Track loss history for plotting
    losses = []
    
    for epoch in range(num_epochs):
        # Compute current loss
        loss = compute_loss(X, y, w)
        losses.append(loss)
        
        # Compute gradient
        gradient = compute_gradient(X, y, w)
        
        # Update weights
        w = w - learning_rate * gradient
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss = {loss:.4f}")
    
    return w, losses

Step 6: Train the Model

# Normalize features for better convergence
X_normalized = X.copy()
X_normalized[:, 1:] = (X[:, 1:] - X[:, 1:].mean(axis=0)) / X[:, 1:].std(axis=0)

# Train
weights, loss_history = train_linear_regression(X_normalized, y, learning_rate=0.1, num_epochs=500)

print("\nFinal weights:")
print(f"  Bias:      {weights[0]:.4f}")
print(f"  TV:        {weights[1]:.4f}")
print(f"  Radio:     {weights[2]:.4f}")
print(f"  Newspaper: {weights[3]:.4f}")

Using scikit-learn (The Professional Way)

In practice, we use libraries that handle all the details — the normal equation, numerical stability, edge cases. Understanding the math (above) is for your brain; scikit-learn is for your production code.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data into training and test sets.
# Why 80/20? It's a common default. With very little data, consider
# cross-validation instead (Module 7) to use every row for both
# training and evaluation.
X_train, X_test, y_train, y_test = train_test_split(
    advertising, revenue, test_size=0.2, random_state=42
)

# Create and train the model.
# Under the hood, sklearn uses the normal equation (not gradient descent)
# for LinearRegression -- it's faster for small datasets.
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on data the model has NEVER seen
y_pred = model.predict(X_test)

# Evaluate: R-squared tells you what fraction of the variance
# in revenue is explained by ad spend. 0.8 means "80% explained."
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared Score: {r2:.4f}")

# The coefficients are the "weights" we've been learning about.
# Each one tells you: "holding everything else constant,
# how much does a $1K increase in this ad channel change revenue?"
print("\nLearned Coefficients:")
print(f"  Intercept:  {model.intercept_:.4f}")
print(f"  TV:         {model.coef_[0]:.4f}")
print(f"  Radio:      {model.coef_[1]:.4f}")
print(f"  Newspaper:  {model.coef_[2]:.4f}")

Interpreting the Results

The coefficients tell a story:
TV coefficient:        0.046
Radio coefficient:     0.188
Newspaper coefficient: 0.003
Insights:
  • Every 1kspentonRadioreturns 1k spent on **Radio** returns ~188 in revenue (best ROI!)
  • Every 1konTVreturns 1k on **TV** returns ~46
  • Newspaper ads have almost no impact
Business decision: Shift budget from newspaper to radio!

The Closed-Form Solution

For linear regression, there’s actually a formula that gives the optimal weights directly, without gradient descent: w=(XTX)1XTyw = (X^T X)^{-1} X^T y This is called the Normal Equation. It comes from calculus - setting the gradient to zero and solving. Think of it as the “just give me the answer” approach versus gradient descent’s “let me walk there step by step.”
def linear_regression_closed_form(X, y):
    """
    Compute optimal weights using the normal equation.
    
    This solves the system of equations directly -- no iterations,
    no learning rate to tune. The trade-off? It requires computing
    a matrix inverse, which is expensive for many features.
    """
    # w = (X^T X)^(-1) X^T y
    XtX = X.T @ X           # Shape: (features, features) -- square matrix
    XtX_inv = np.linalg.inv(XtX)  # O(n^3) -- the expensive step
    Xty = X.T @ y           # Shape: (features,) -- project targets onto features
    w = XtX_inv @ Xty
    return w

# This gives the same answer as gradient descent!
w_closed = linear_regression_closed_form(X, y)
print("Closed-form solution:", w_closed)
When to use which?
ApproachBest WhenWhy
Normal Equation< 10,000 featuresExact solution, no hyperparameters to tune
Gradient Descent> 10,000 features or very large datasetsO(n) per step vs O(n^3) for matrix inversion
In practice, scikit-learn’s LinearRegression automatically picks the best solver for your data size. You rarely need to worry about this choice — but understanding it helps you debug slow training times.

Real-World Example: House Price Prediction

Let’s build a proper house price predictor using real data:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load California housing data
housing = fetch_california_housing()
X, y = housing.data, housing.target

print("Features:", housing.feature_names)
print("X shape:", X.shape)  # (20640, 8) - 20k houses, 8 features

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for many algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"\nR-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)) * 100000:.2f}")

# Feature importance
print("\nFeature Importance (coefficients):")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"  {name}: {coef:.4f}")

Common Pitfalls and Solutions

Pitfall 1: Features on Different Scales

# BAD: Sqft ranges 500-5000, bedrooms 1-6
# The model will be biased toward sqft because the
# coefficient for sqft will be tiny (e.g., 0.15) while
# the bedroom coefficient will be huge (e.g., 25000).
# This doesn't affect predictions, but it makes coefficients
# impossible to compare for feature importance.

# GOOD: Standardize features so each has mean=0, std=1.
# Now a 1-unit change in any feature means "1 standard deviation,"
# making coefficients directly comparable.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Pitfall 2: Multicollinearity

When features are highly correlated, coefficients become unstable. Imagine trying to figure out whether it’s the coffee or the sugar making your drink sweet — when they always appear together, it’s hard to tell who deserves the credit.
# Check correlations -- look for values above 0.8 or below -0.8
import pandas as pd
df = pd.DataFrame(X, columns=feature_names)
print(df.corr())

# Solution: Use Ridge regression (L2 regularization) to stabilize
# coefficients, or drop one of the correlated features.
# Ridge adds a penalty that discourages any single weight from
# getting too large, which naturally handles collinearity.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # alpha controls regularization strength

Pitfall 3: Overfitting

When the model memorizes training data but fails on new data. This is like studying only the practice exam and then bombing the real test because the questions are slightly different.
# Always evaluate on held-out test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
print("Train R2:", model.score(X_train, y_train))
print("Test R2:", model.score(X_test, y_test))
# If train >> test, you're overfitting!
# A 5-10% gap is normal. A 30%+ gap means trouble.
Practical rule of thumb: Linear regression rarely overfits unless you have far more features than samples. If you have 50 features and 100 rows, consider Ridge or Lasso regression (Module 13) to keep things under control.

The Complete Linear Regression Workflow

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score

def linear_regression_pipeline(X, y, feature_names=None):
    """
    Complete linear regression workflow.
    """
    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 2. Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 3. Train model
    model = LinearRegression()
    model.fit(X_train_scaled, y_train)
    
    # 4. Evaluate
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    print("=== Model Evaluation ===")
    print(f"Train R2: {r2_score(y_train, y_pred_train):.4f}")
    print(f"Test R2:  {r2_score(y_test, y_pred_test):.4f}")
    print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
    
    # 5. Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    print(f"\nCross-validation R2: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    
    # 6. Feature importance
    if feature_names:
        print("\n=== Feature Importance ===")
        importance = pd.DataFrame({
            'feature': feature_names,
            'coefficient': model.coef_
        }).sort_values('coefficient', key=abs, ascending=False)
        print(importance.to_string(index=False))
    
    return model, scaler

Key Takeaways

Linear = Weighted Sum

y = w0 + w1x1 + w2x2 + …

MSE Loss

Measures average squared error

Scale Your Features

Normalize for better training

Evaluate on Test Data

Always hold out some data

🚀 Mini Projects

Project 1

Build a salary prediction model

Project 2

Real estate price predictor with feature engineering

Project 3

Model comparison and selection pipeline

What’s Next?

Linear regression predicts continuous numbers. But what if you want to predict categories?
  • Is this email spam or not spam?
  • Will this customer churn or stay?
  • Is this tumor malignant or benign?
That’s classification - the subject of our next module!

Continue to Module 4: Classification

Learn to predict categories with logistic regression and beyond

🔗 Math → ML Connection Summary

Where the math you learned powers linear regression:
Math ConceptWhere It AppearsWhat It Does
Vectors (Module)Feature vector x\mathbf{x}Represents each data point
Dot Product (Module)wx\mathbf{w} \cdot \mathbf{x}Core of the linear model prediction
Matrix Multiplication (Module)Xw=y^X\mathbf{w} = \hat{y}Batch predictions for all samples
Gradient (Module)wL\nabla_w LDirection to update weights
Chain Rule (Module)BackpropagationCompute gradients efficiently
Mean/Variance (Module)Feature scalingStandardization for better training
Normal Distribution (Module)Error distributionJustifies using MSE loss
Bottom line: Linear regression is the intersection of all three math courses. Master this, and neural networks become “just deeper linear regression with nonlinearities.”
Want to understand the theory? Here’s what’s happening under the hood:

Why Gradient Descent Works

The MSE loss is convex (bowl-shaped), meaning:
  • There’s exactly one minimum
  • Gradient descent is guaranteed to find it
  • Step size (learning rate) affects speed but not destination
L(w)=1ni=1n(yiwTxi)2L(w) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2Taking the gradient: wL=2nXT(yXw)\nabla_w L = -\frac{2}{n}X^T(y - X\mathbf{w})Setting to zero gives the normal equation: w=(XTX)1XTy\mathbf{w}^* = (X^TX)^{-1}X^Ty

The Statistical Interpretation

Under the assumption that errors are normally distributed: y=wTx+ϵ,ϵN(0,σ2)y = \mathbf{w}^T\mathbf{x} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)Maximum Likelihood Estimation (MLE) of w\mathbf{w} is equivalent to minimizing MSE!

Regularization from a Bayesian View

  • Ridge Regression (L2): Assumes weights have Gaussian prior wN(0,τ2I)\mathbf{w} \sim \mathcal{N}(0, \tau^2I)
  • Lasso (L1): Assumes weights have Laplacian prior → promotes sparsity
This connection between regularization and Bayesian priors is why regularization prevents overfitting!