Linear Regression
From Intuition to Implementation
In the previous modules, you learned:
ML is about finding patterns in data
We measure “wrongness” with a loss function
Gradient descent minimizes the loss
Now let’s put it all together into a complete, professional algorithm.
The Real-World Setup
Your boss asks : “Can you predict how much revenue we’ll make based on our advertising spend?”
You have historical data:
import numpy as np
import matplotlib.pyplot as plt
# Historical data: [TV ads ($k), Radio ads ($k), Newspaper ads ($k)] -> Revenue ($k)
advertising = np.array([
[ 230 , 37 , 69 ],
[ 44 , 39 , 45 ],
[ 17 , 45 , 69 ],
[ 151 , 41 , 58 ],
[ 180 , 10 , 58 ],
[ 8 , 49 , 75 ],
[ 57 , 32 , 23 ],
[ 120 , 19 , 11 ],
[ 8 , 35 , 65 ],
[ 199 , 3 , 70 ],
])
revenue = np.array([ 22.1 , 10.4 , 9.3 , 18.5 , 12.9 , 7.2 , 11.8 , 13.2 , 4.8 , 10.6 ])
Question : If we spend 100 k o n T V , 100k on TV, 100 k o n T V , 40k on Radio, and $30k on Newspaper, what revenue can we expect?
The Linear Regression Model
We assume revenue is a weighted combination of ad spends:
revenue = w 0 + w 1 ⋅ TV + w 2 ⋅ Radio + w 3 ⋅ Newspaper \text{revenue} = w_0 + w_1 \cdot \text{TV} + w_2 \cdot \text{Radio} + w_3 \cdot \text{Newspaper} revenue = w 0 + w 1 ⋅ TV + w 2 ⋅ Radio + w 3 ⋅ Newspaper
Or in matrix notation (see Matrix Operations ):
y ^ = X ⋅ w \hat{y} = X \cdot w y ^ = X ⋅ w
Where:
X X X is our feature matrix (with a column of 1s for the bias term)
w w w is our weight vector
y ^ \hat{y} y ^ is our predictions
Step-by-Step Implementation
Step 1: Prepare the Data
# Add a column of 1s for the bias (intercept) term
X = np.column_stack([np.ones( len (advertising)), advertising])
y = revenue
print ( "X shape:" , X.shape) # (10, 4) - 10 samples, 4 features (including bias)
print ( "y shape:" , y.shape) # (10,) - 10 target values
Step 2: Define the Model
def predict ( X , w ):
"""Linear model: y_hat = X @ w"""
return X @ w
Step 3: Define the Loss Function
def mean_squared_error ( y_true , y_pred ):
"""Average of squared differences."""
return np.mean((y_true - y_pred) ** 2 )
def compute_loss ( X , y , w ):
"""Compute MSE for given weights."""
predictions = predict(X, w)
return mean_squared_error(y, predictions)
Step 4: Compute the Gradient
def compute_gradient ( X , y , w ):
"""
Gradient of MSE with respect to weights.
d(MSE)/dw = (2/n) * X.T @ (X @ w - y)
"""
n = len (y)
predictions = predict(X, w)
errors = predictions - y
gradient = ( 2 / n) * X.T @ errors
return gradient
Step 5: Gradient Descent
def train_linear_regression ( X , y , learning_rate = 0.0001 , num_epochs = 1000 ):
"""
Train a linear regression model using gradient descent.
Args:
X: Feature matrix (n_samples, n_features)
y: Target values (n_samples,)
learning_rate: Step size for gradient descent
num_epochs: Number of training iterations
Returns:
Trained weights
"""
# Initialize weights to zero
w = np.zeros(X.shape[ 1 ])
# Track loss history for plotting
losses = []
for epoch in range (num_epochs):
# Compute current loss
loss = compute_loss(X, y, w)
losses.append(loss)
# Compute gradient
gradient = compute_gradient(X, y, w)
# Update weights
w = w - learning_rate * gradient
if epoch % 100 == 0 :
print ( f "Epoch { epoch } : Loss = { loss :.4f} " )
return w, losses
Step 6: Train the Model
# Normalize features for better convergence
X_normalized = X.copy()
X_normalized[:, 1 :] = (X[:, 1 :] - X[:, 1 :].mean( axis = 0 )) / X[:, 1 :].std( axis = 0 )
# Train
weights, loss_history = train_linear_regression(X_normalized, y, learning_rate = 0.1 , num_epochs = 500 )
print ( " \n Final weights:" )
print ( f " Bias: { weights[ 0 ] :.4f} " )
print ( f " TV: { weights[ 1 ] :.4f} " )
print ( f " Radio: { weights[ 2 ] :.4f} " )
print ( f " Newspaper: { weights[ 3 ] :.4f} " )
Using scikit-learn (The Professional Way)
In practice, we use libraries that handle all the details:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
advertising, revenue, test_size = 0.2 , random_state = 42
)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print ( f "Mean Squared Error: { mse :.4f} " )
print ( f "R-squared Score: { r2 :.4f} " )
print ( " \n Learned Coefficients:" )
print ( f " Intercept: { model.intercept_ :.4f} " )
print ( f " TV: { model.coef_[ 0 ] :.4f} " )
print ( f " Radio: { model.coef_[ 1 ] :.4f} " )
print ( f " Newspaper: { model.coef_[ 2 ] :.4f} " )
Interpreting the Results
The coefficients tell a story:
TV coefficient: 0.046
Radio coefficient: 0.188
Newspaper coefficient: 0.003
Insights :
Every 1 k s p e n t o n ∗ ∗ R a d i o ∗ ∗ r e t u r n s 1k spent on **Radio** returns ~ 1 k s p e n t o n ∗ ∗ R a d i o ∗ ∗ re t u r n s 188 in revenue (best ROI!)
Every 1 k o n ∗ ∗ T V ∗ ∗ r e t u r n s 1k on **TV** returns ~ 1 k o n ∗ ∗ T V ∗ ∗ re t u r n s 46
Newspaper ads have almost no impact
Business decision : Shift budget from newspaper to radio!
For linear regression, there’s actually a formula that gives the optimal weights directly, without gradient descent:
w = ( X T X ) − 1 X T y w = (X^T X)^{-1} X^T y w = ( X T X ) − 1 X T y
This is called the Normal Equation . It comes from calculus - setting the gradient to zero and solving.
def linear_regression_closed_form ( X , y ):
"""Compute optimal weights using the normal equation."""
# w = (X^T X)^(-1) X^T y
XtX = X.T @ X
XtX_inv = np.linalg.inv(XtX)
Xty = X.T @ y
w = XtX_inv @ Xty
return w
# This gives the same answer as gradient descent!
w_closed = linear_regression_closed_form(X, y)
print ( "Closed-form solution:" , w_closed)
When to use which? Normal Equation : Fast for small datasets, exact solution
Gradient Descent : Better for large datasets, more flexibleThe matrix inversion in the normal equation is O(n^3), which becomes slow for millions of features.
Real-World Example: House Price Prediction
Let’s build a proper house price predictor using real data:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# Load California housing data
housing = fetch_california_housing()
X, y = housing.data, housing.target
print ( "Features:" , housing.feature_names)
print ( "X shape:" , X.shape) # (20640, 8) - 20k houses, 8 features
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2 , random_state = 42
)
# Scale features (important for many algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
print ( f " \n R-squared: { r2_score(y_test, y_pred) :.4f} " )
print ( f "RMSE: $ { np.sqrt(mean_squared_error(y_test, y_pred)) * 100000 :.2f} " )
# Feature importance
print ( " \n Feature Importance (coefficients):" )
for name, coef in zip (housing.feature_names, model.coef_):
print ( f " { name } : { coef :.4f} " )
Common Pitfalls and Solutions
Pitfall 1: Features on Different Scales
# BAD: Sqft ranges 500-5000, bedrooms 1-6
# The model will be biased toward sqft
# GOOD: Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Pitfall 2: Multicollinearity
When features are highly correlated, coefficients become unstable.
# Check correlations
import pandas as pd
df = pd.DataFrame(X, columns = feature_names)
print (df.corr())
# Solution: Use Ridge regression or drop correlated features
from sklearn.linear_model import Ridge
model = Ridge( alpha = 1.0 ) # Adds regularization
Pitfall 3: Overfitting
When the model memorizes training data but fails on new data.
# Always evaluate on held-out test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 )
model.fit(X_train, y_train)
print ( "Train R2:" , model.score(X_train, y_train))
print ( "Test R2:" , model.score(X_test, y_test))
# If train >> test, you're overfitting!
The Complete Linear Regression Workflow
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
def linear_regression_pipeline ( X , y , feature_names = None ):
"""
Complete linear regression workflow.
"""
# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2 , random_state = 42
)
# 2. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 3. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# 4. Evaluate
y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_test_scaled)
print ( "=== Model Evaluation ===" )
print ( f "Train R2: { r2_score(y_train, y_pred_train) :.4f} " )
print ( f "Test R2: { r2_score(y_test, y_pred_test) :.4f} " )
print ( f "Test RMSE: { np.sqrt(mean_squared_error(y_test, y_pred_test)) :.4f} " )
# 5. Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv = 5 )
print ( f " \n Cross-validation R2: { cv_scores.mean() :.4f} (+/- { cv_scores.std() :.4f} )" )
# 6. Feature importance
if feature_names:
print ( " \n === Feature Importance ===" )
importance = pd.DataFrame({
'feature' : feature_names,
'coefficient' : model.coef_
}).sort_values( 'coefficient' , key = abs , ascending = False )
print (importance.to_string( index = False ))
return model, scaler
Key Takeaways
Linear = Weighted Sum y = w0 + w1x1 + w2 x2 + …
MSE Loss Measures average squared error
Scale Your Features Normalize for better training
Evaluate on Test Data Always hold out some data
🚀 Mini Projects
Project 1 Build a salary prediction model
Project 2 Real estate price predictor with feature engineering
Project 3 Model comparison and selection pipeline
What’s Next?
Linear regression predicts continuous numbers. But what if you want to predict categories ?
Is this email spam or not spam?
Will this customer churn or stay?
Is this tumor malignant or benign?
That’s classification - the subject of our next module!
Continue to Module 4: Classification Learn to predict categories with logistic regression and beyond
🔗 Math → ML Connection Summary
Where the math you learned powers linear regression: Math Concept Where It Appears What It Does Vectors (Module )Feature vector x \mathbf{x} x Represents each data point Dot Product (Module )w ⋅ x \mathbf{w} \cdot \mathbf{x} w ⋅ x Core of the linear model prediction Matrix Multiplication (Module )X w = y ^ X\mathbf{w} = \hat{y} X w = y ^ Batch predictions for all samples Gradient (Module )∇ w L \nabla_w L ∇ w L Direction to update weights Chain Rule (Module )Backpropagation Compute gradients efficiently Mean/Variance (Module )Feature scaling Standardization for better training Normal Distribution (Module )Error distribution Justifies using MSE loss
Bottom line : Linear regression is the intersection of all three math courses. Master this, and neural networks become “just deeper linear regression with nonlinearities.”
🚀 Going Deeper: The Mathematics of Linear Regression
Want to understand the theory? Here’s what’s happening under the hood:Why Gradient Descent Works The MSE loss is convex (bowl-shaped), meaning:
There’s exactly one minimum
Gradient descent is guaranteed to find it
Step size (learning rate) affects speed but not destination
L ( w ) = 1 n ∑ i = 1 n ( y i − w T x i ) 2 L(w) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2 L ( w ) = n 1 ∑ i = 1 n ( y i − w T x i ) 2 Taking the gradient: ∇ w L = − 2 n X T ( y − X w ) \nabla_w L = -\frac{2}{n}X^T(y - X\mathbf{w}) ∇ w L = − n 2 X T ( y − X w ) Setting to zero gives the normal equation: w ∗ = ( X T X ) − 1 X T y \mathbf{w}^* = (X^TX)^{-1}X^Ty w ∗ = ( X T X ) − 1 X T y The Statistical Interpretation Under the assumption that errors are normally distributed:
y = w T x + ϵ , ϵ ∼ N ( 0 , σ 2 ) y = \mathbf{w}^T\mathbf{x} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) y = w T x + ϵ , ϵ ∼ N ( 0 , σ 2 ) Maximum Likelihood Estimation (MLE) of w \mathbf{w} w is equivalent to minimizing MSE!Regularization from a Bayesian View
Ridge Regression (L2): Assumes weights have Gaussian prior w ∼ N ( 0 , τ 2 I ) \mathbf{w} \sim \mathcal{N}(0, \tau^2I) w ∼ N ( 0 , τ 2 I )
Lasso (L1): Assumes weights have Laplacian prior → promotes sparsity
This connection between regularization and Bayesian priors is why regularization prevents overfitting! Recommended Deep-Dive Resources