We assume revenue is a weighted combination of ad spends:revenue=w0+w1⋅TV+w2⋅Radio+w3⋅NewspaperOr in matrix notation (see Matrix Operations):y^=X⋅wWhere:
X is our feature matrix (with a column of 1s for the bias term)
def train_linear_regression(X, y, learning_rate=0.0001, num_epochs=1000): """ Train a linear regression model using gradient descent. Args: X: Feature matrix (n_samples, n_features) y: Target values (n_samples,) learning_rate: Step size for gradient descent num_epochs: Number of training iterations Returns: Trained weights """ # Initialize weights to zero w = np.zeros(X.shape[1]) # Track loss history for plotting losses = [] for epoch in range(num_epochs): # Compute current loss loss = compute_loss(X, y, w) losses.append(loss) # Compute gradient gradient = compute_gradient(X, y, w) # Update weights w = w - learning_rate * gradient if epoch % 100 == 0: print(f"Epoch {epoch}: Loss = {loss:.4f}") return w, losses
In practice, we use libraries that handle all the details — the normal equation, numerical stability, edge cases. Understanding the math (above) is for your brain; scikit-learn is for your production code.
from sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error, r2_score# Split data into training and test sets.# Why 80/20? It's a common default. With very little data, consider# cross-validation instead (Module 7) to use every row for both# training and evaluation.X_train, X_test, y_train, y_test = train_test_split( advertising, revenue, test_size=0.2, random_state=42)# Create and train the model.# Under the hood, sklearn uses the normal equation (not gradient descent)# for LinearRegression -- it's faster for small datasets.model = LinearRegression()model.fit(X_train, y_train)# Make predictions on data the model has NEVER seeny_pred = model.predict(X_test)# Evaluate: R-squared tells you what fraction of the variance# in revenue is explained by ad spend. 0.8 means "80% explained."mse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f"Mean Squared Error: {mse:.4f}")print(f"R-squared Score: {r2:.4f}")# The coefficients are the "weights" we've been learning about.# Each one tells you: "holding everything else constant,# how much does a $1K increase in this ad channel change revenue?"print("\nLearned Coefficients:")print(f" Intercept: {model.intercept_:.4f}")print(f" TV: {model.coef_[0]:.4f}")print(f" Radio: {model.coef_[1]:.4f}")print(f" Newspaper: {model.coef_[2]:.4f}")
For linear regression, there’s actually a formula that gives the optimal weights directly, without gradient descent:w=(XTX)−1XTyThis is called the Normal Equation. It comes from calculus - setting the gradient to zero and solving. Think of it as the “just give me the answer” approach versus gradient descent’s “let me walk there step by step.”
def linear_regression_closed_form(X, y): """ Compute optimal weights using the normal equation. This solves the system of equations directly -- no iterations, no learning rate to tune. The trade-off? It requires computing a matrix inverse, which is expensive for many features. """ # w = (X^T X)^(-1) X^T y XtX = X.T @ X # Shape: (features, features) -- square matrix XtX_inv = np.linalg.inv(XtX) # O(n^3) -- the expensive step Xty = X.T @ y # Shape: (features,) -- project targets onto features w = XtX_inv @ Xty return w# This gives the same answer as gradient descent!w_closed = linear_regression_closed_form(X, y)print("Closed-form solution:", w_closed)
When to use which?
Approach
Best When
Why
Normal Equation
< 10,000 features
Exact solution, no hyperparameters to tune
Gradient Descent
> 10,000 features or very large datasets
O(n) per step vs O(n^3) for matrix inversion
In practice, scikit-learn’s LinearRegression automatically picks the best solver for your data size. You rarely need to worry about this choice — but understanding it helps you debug slow training times.
# BAD: Sqft ranges 500-5000, bedrooms 1-6# The model will be biased toward sqft because the# coefficient for sqft will be tiny (e.g., 0.15) while# the bedroom coefficient will be huge (e.g., 25000).# This doesn't affect predictions, but it makes coefficients# impossible to compare for feature importance.# GOOD: Standardize features so each has mean=0, std=1.# Now a 1-unit change in any feature means "1 standard deviation,"# making coefficients directly comparable.from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(X)
When features are highly correlated, coefficients become unstable. Imagine trying to figure out whether it’s the coffee or the sugar making your drink sweet — when they always appear together, it’s hard to tell who deserves the credit.
# Check correlations -- look for values above 0.8 or below -0.8import pandas as pddf = pd.DataFrame(X, columns=feature_names)print(df.corr())# Solution: Use Ridge regression (L2 regularization) to stabilize# coefficients, or drop one of the correlated features.# Ridge adds a penalty that discourages any single weight from# getting too large, which naturally handles collinearity.from sklearn.linear_model import Ridgemodel = Ridge(alpha=1.0) # alpha controls regularization strength
When the model memorizes training data but fails on new data. This is like studying only the practice exam and then bombing the real test because the questions are slightly different.
# Always evaluate on held-out test dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)model.fit(X_train, y_train)print("Train R2:", model.score(X_train, y_train))print("Test R2:", model.score(X_test, y_test))# If train >> test, you're overfitting!# A 5-10% gap is normal. A 30%+ gap means trouble.
Practical rule of thumb: Linear regression rarely overfits unless you have far more features than samples. If you have 50 features and 100 rows, consider Ridge or Lasso regression (Module 13) to keep things under control.
Bottom line: Linear regression is the intersection of all three math courses. Master this, and neural networks become “just deeper linear regression with nonlinearities.”
🚀 Going Deeper: The Mathematics of Linear Regression
Want to understand the theory? Here’s what’s happening under the hood: