Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Correlation and Regression

Correlation and Regression: Relationships and Predictions

The House Price Question

You’re a real estate analyst. A client asks: “I’m looking at a house with 2,500 square feet. What should I expect to pay?” You have data on recent sales. Can you use the relationship between size and price to make predictions? This is where statistics becomes prediction - the first step toward machine learning.
Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-5 (especially Probability and Distributions)
What You’ll Build: House price predictor, multi-variable regression model
🔗 ML Connection: Regression is the foundation of ALL supervised learning:
Regression ConceptML Equivalent
Coefficients (β)Neural network weights
Intercept (β₀)Bias term
Sum of squared errorsMSE loss function
Minimizing errorGradient descent optimization
Adding featuresFeature engineering
RegularizationWeight decay, dropout
Linear regression IS a 1-layer neural network. Master this, and you understand deep learning’s core!

Correlation: Measuring Relationships

Correlation measures the strength and direction of a linear relationship between two variables. Analogy: Think of correlation as measuring how well two dancers are synchronized. A correlation of +1 means they are moving in perfect unison — when one steps forward, the other does too, in exact proportion. A correlation of -1 means they are perfectly mirrored — when one steps forward, the other steps back. A correlation of 0 means they are dancing independently, like strangers at a concert.

The Pearson Correlation Coefficient

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} The value ranges from -1 to +1:
ValueInterpretation
r = 1Perfect positive correlation
r = 0.7 to 0.9Strong positive correlation
r = 0.4 to 0.7Moderate positive correlation
r = 0.1 to 0.4Weak positive correlation
r ≈ 0No linear correlation
r = -1Perfect negative correlation
Correlation Coefficient Visualization
import numpy as np
from scipy import stats

# House data
square_feet = np.array([1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200, 3500, 4000])
price_thousands = np.array([280, 320, 350, 380, 420, 480, 510, 540, 600, 700])

# Calculate correlation
def pearson_correlation(x, y):
    """Calculate Pearson correlation coefficient."""
    n = len(x)
    mean_x, mean_y = np.mean(x), np.mean(y)
    
    numerator = np.sum((x - mean_x) * (y - mean_y))
    denominator = np.sqrt(np.sum((x - mean_x)**2)) * np.sqrt(np.sum((y - mean_y)**2))
    
    return numerator / denominator

r_manual = pearson_correlation(square_feet, price_thousands)
r_scipy, p_value = stats.pearsonr(square_feet, price_thousands)

print(f"Correlation (manual): {r_manual:.4f}")
print(f"Correlation (scipy):  {r_scipy:.4f}")
print(f"P-value: {p_value:.6f}")
Output:
Correlation (manual): 0.9969
Correlation (scipy):  0.9969
P-value: 0.000000
Nearly perfect positive correlation. As square footage increases, so does price.
Scatter Plot with Correlation

Correlation Does Not Imply Causation

This is perhaps the most important phrase in all of statistics.
# Ice cream sales and drowning deaths are correlated
# But ice cream doesn't cause drowning!
# Both are caused by hot weather (confounding variable)

# Examples of spurious correlations:
# - Nicolas Cage films and swimming pool drownings
# - Cheese consumption and bedsheet entanglement deaths
# - Organic food sales and autism diagnoses
Three possibilities when A and B are correlated:
  1. A causes B
  2. B causes A
  3. A third variable C causes both
To establish causation, you need:
  • Controlled experiments (A/B testing)
  • Time sequence (cause before effect)
  • Plausible mechanism
  • Ruling out confounders
Statistical Mistake in ML — Confusing Predictive Power with Causal Relationships: In ML, a feature that is highly correlated with the target can be an excellent predictor without being a cause. For example, the number of fire trucks at a scene is highly correlated with the damage amount — but sending fewer trucks would not reduce damage. This distinction matters when you use ML models to inform interventions. If a churn model finds that “sent a cancellation survey” strongly predicts churn, removing the survey will not reduce churn — the survey was a symptom, not a cause. Always ask: “If we changed this feature, would the outcome actually change?”

Simple Linear Regression: The Line of Best Fit

Linear regression finds the line that best predicts Y from X. The equation: y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x Where:
  • y^\hat{y} = predicted value
  • β0\beta_0 = intercept (value of y when x = 0)
  • β1\beta_1 = slope (change in y for each unit change in x)

Finding the Best Line

We minimize the sum of squared errors (residuals): SSE=i=1n(yiy^i)2=i=1n(yiβ0β1xi)2\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2 The optimal coefficients: β1=(xixˉ)(yiyˉ)(xixˉ)2=rsysx\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x} β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}
def simple_linear_regression(x, y):
    """Fit a simple linear regression model from scratch."""
    n = len(x)
    
    # Means
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    
    # Calculate slope
    numerator = np.sum((x - x_mean) * (y - y_mean))
    denominator = np.sum((x - x_mean) ** 2)
    beta_1 = numerator / denominator
    
    # Calculate intercept
    beta_0 = y_mean - beta_1 * x_mean
    
    return beta_0, beta_1

# Fit the model
beta_0, beta_1 = simple_linear_regression(square_feet, price_thousands)

print(f"Intercept (β₀): {beta_0:.2f}")
print(f"Slope (β₁): {beta_1:.4f}")
print(f"\nEquation: Price = {beta_0:.2f} + {beta_1:.4f} × SquareFeet")
Output:
Intercept (β₀): 16.67
Slope (β₁): 0.1676

Equation: Price = 16.67 + 0.1676 × SquareFeet

Interpreting the Coefficients

  • Intercept (16.67): Theoretical price for a 0 sqft house (not meaningful here)
  • Slope (0.1676): Each additional square foot adds $167.60 to the price

Making Predictions

def predict(x, beta_0, beta_1):
    """Predict y given x and model coefficients."""
    return beta_0 + beta_1 * x

# Predict price for 2500 sqft house
sqft_new = 2500
predicted_price = predict(sqft_new, beta_0, beta_1)
print(f"Predicted price for {sqft_new} sqft: ${predicted_price:.0f}K")

# Predict for multiple sizes
sizes = [1500, 2000, 2500, 3000, 3500]
for size in sizes:
    price = predict(size, beta_0, beta_1)
    print(f"  {size} sqft → ${price:.0f}K")
Output:
Predicted price for 2500 sqft: $436K

  1500 sqft → $268K
  2000 sqft → $352K
  2500 sqft → $436K
  3000 sqft → $520K
  3500 sqft → $603K

Evaluating Regression Models

R-Squared (Coefficient of Determination)

measures how much of the variance in Y is explained by X. R2=1SSresidualSStotal=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}
def r_squared(y_true, y_pred):
    """Calculate R-squared (coefficient of determination)."""
    ss_residual = np.sum((y_true - y_pred) ** 2)
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - (ss_residual / ss_total)

# Calculate predictions for all training data
predictions = predict(square_feet, beta_0, beta_1)

r2 = r_squared(price_thousands, predictions)
print(f"R² = {r2:.4f}")
print(f"Interpretation: {r2*100:.1f}% of price variance is explained by square footage")
Output:
R² = 0.9939
Interpretation: 99.4% of price variance is explained by square footage

Residual Analysis

Residuals = Actual - Predicted. Good models have residuals that:
  1. Are randomly scattered (no pattern)
  2. Have constant variance (homoscedasticity)
  3. Are approximately normally distributed
residuals = price_thousands - predictions

print("Residual Statistics:")
print(f"  Mean: {np.mean(residuals):.4f} (should be ~0)")
print(f"  Std Dev: {np.std(residuals):.2f}")
print(f"  Min: {np.min(residuals):.2f}")
print(f"  Max: {np.max(residuals):.2f}")

# Plot residuals
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Residuals vs predicted
axes[0].scatter(predictions, residuals)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Price')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted Values')

# Histogram of residuals
axes[1].hist(residuals, bins=5, edgecolor='black')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

Multiple Linear Regression

Real house prices depend on more than just size. Let’s add more features. y^=β0+β1x1+β2x2+...+βpxp\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p

Example: Price from Size, Bedrooms, and Age

import numpy as np
from sklearn.linear_model import LinearRegression

# Extended dataset
data = {
    'sqft': [1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200, 3500, 4000,
             1600, 1900, 2100, 2400, 2600, 2900, 3100, 3300, 3600, 3800],
    'bedrooms': [2, 3, 3, 3, 4, 4, 4, 5, 5, 5,
                 2, 3, 3, 4, 4, 4, 5, 5, 5, 6],
    'age': [5, 10, 15, 8, 3, 12, 7, 2, 5, 1,
            20, 15, 10, 5, 8, 15, 10, 5, 2, 3],
    'price': [280, 310, 340, 390, 450, 470, 520, 570, 610, 720,
              260, 305, 355, 410, 440, 460, 530, 560, 630, 680]
}

X = np.column_stack([data['sqft'], data['bedrooms'], data['age']])
y = np.array(data['price'])

# Fit the model
model = LinearRegression()
model.fit(X, y)

print("Multiple Regression Results:")
print(f"  Intercept: {model.intercept_:.2f}")
print(f"  Coefficients:")
print(f"    Square Feet: {model.coef_[0]:.4f} (${model.coef_[0]*1000:.2f} per sqft)")
print(f"    Bedrooms:    {model.coef_[1]:.4f} (${model.coef_[1]*1000:.2f} per bedroom)")
print(f"    Age:         {model.coef_[2]:.4f} (${model.coef_[2]*1000:.2f} per year)")

# R-squared
r2 = model.score(X, y)
print(f"\n  R² = {r2:.4f}")
Output:
Multiple Regression Results:
  Intercept: -25.46
  Coefficients:
    Square Feet: 0.1425 ($142.50 per sqft)
    Bedrooms:    21.8932 ($21893.20 per bedroom)
    Age:         -3.1245 (-$3124.50 per year of age)

  R² = 0.9876

Interpreting Multiple Regression

  • Each sqft adds $142.50 holding other variables constant
  • Each bedroom adds $21,893 holding other variables constant
  • Each year of age reduces price by $3,125 holding other variables constant
Analogy for “holding other variables constant”: Imagine you are at a buffet and you want to know how much each item costs. Multiple regression is like saying: “If I add one more scoop of mashed potatoes (bedrooms), and keep everything else on my plate the same, how much more does my plate cost?” Each coefficient isolates the contribution of one feature while keeping all others fixed. This is the same concept as partial derivatives in calculus — and it is exactly what neural network backpropagation computes for each weight.
ML Application — Feature Engineering Insight: In multiple regression, the coefficient for a feature depends on what OTHER features are in the model. Add a strongly correlated feature and both coefficients may change dramatically (multicollinearity). In ML, this manifests as unstable feature importance rankings across different random seeds or cross-validation folds. If your feature importances are not stable, suspect multicollinearity and consider removing redundant features or using regularization.
# Predict price for a specific house
new_house = np.array([[2500, 4, 5]])  # 2500 sqft, 4 bed, 5 years old
predicted = model.predict(new_house)[0]
print(f"Predicted price for 2500 sqft, 4 bed, 5 year old house: ${predicted:.0f}K")

Feature Scaling and Standardization

When features have different scales, it’s hard to compare coefficients.
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model on scaled data
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)

print("Standardized Coefficients (comparable importance):")
print(f"  Square Feet: {model_scaled.coef_[0]:.2f}")
print(f"  Bedrooms:    {model_scaled.coef_[1]:.2f}")
print(f"  Age:         {model_scaled.coef_[2]:.2f}")

# Square feet has the largest standardized coefficient
# So it's the most important predictor

Polynomial Regression: Non-Linear Relationships

What if the relationship isn’t a straight line?
# Salary vs experience (non-linear relationship)
experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20])
salary = np.array([35, 40, 45, 52, 60, 68, 75, 82, 88, 93, 100, 108, 115])

# Linear model
from sklearn.preprocessing import PolynomialFeatures

# Simple linear
model_linear = LinearRegression()
model_linear.fit(experience.reshape(-1, 1), salary)
r2_linear = model_linear.score(experience.reshape(-1, 1), salary)

# Quadratic (degree 2)
poly2 = PolynomialFeatures(degree=2)
X_poly2 = poly2.fit_transform(experience.reshape(-1, 1))
model_poly2 = LinearRegression()
model_poly2.fit(X_poly2, salary)
r2_poly2 = model_poly2.score(X_poly2, salary)

print(f"Linear R²: {r2_linear:.4f}")
print(f"Quadratic R²: {r2_poly2:.4f}")

# The quadratic model fits better because salary growth slows with experience

Assumptions of Linear Regression

For valid inference, linear regression assumes:
AssumptionDescriptionHow to Check
LinearityRelationship is linearScatter plot, residual plot
IndependenceObservations are independentStudy design
HomoscedasticityConstant variance of residualsResidual vs fitted plot
NormalityResiduals are normally distributedQ-Q plot, histogram
from scipy import stats

# Check normality of residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk test for normality:")
print(f"  Statistic: {stat:.4f}")
print(f"  P-value: {p_value:.4f}")
if p_value > 0.05:
    print("  Residuals appear normally distributed")
else:
    print("  Residuals may not be normally distributed")

Mini-Project: House Price Predictor

Build a complete house price prediction system:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class RegressionResult:
    coefficients: Dict[str, float]
    intercept: float
    r_squared: float
    rmse: float
    mae: float
    predictions: np.ndarray

class HousePricePredictor:
    """
    Complete house price prediction system with proper evaluation.
    """
    
    def __init__(self):
        self.model = LinearRegression()
        self.feature_names = None
        self.is_fitted = False
        
    def fit(self, X: np.ndarray, y: np.ndarray, feature_names: List[str]):
        """Train the model."""
        self.feature_names = feature_names
        self.model.fit(X, y)
        self.is_fitted = True
        
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions."""
        if not self.is_fitted:
            raise ValueError("Model must be fitted before predicting")
        return self.model.predict(X)
    
    def evaluate(self, X: np.ndarray, y: np.ndarray) -> RegressionResult:
        """Evaluate model performance."""
        predictions = self.predict(X)
        
        # Calculate metrics
        r2 = r2_score(y, predictions)
        rmse = np.sqrt(mean_squared_error(y, predictions))
        mae = mean_absolute_error(y, predictions)
        
        # Create coefficient dictionary
        coefficients = {
            name: coef 
            for name, coef in zip(self.feature_names, self.model.coef_)
        }
        
        return RegressionResult(
            coefficients=coefficients,
            intercept=self.model.intercept_,
            r_squared=r2,
            rmse=rmse,
            mae=mae,
            predictions=predictions
        )
    
    def print_summary(self, result: RegressionResult):
        """Print a formatted summary."""
        print("\n" + "=" * 60)
        print("HOUSE PRICE PREDICTION MODEL")
        print("=" * 60)
        
        print("\nCoefficients:")
        print(f"  Intercept: ${result.intercept*1000:,.0f}")
        for name, coef in result.coefficients.items():
            print(f"  {name}: ${coef*1000:,.2f}")
        
        print("\nModel Performance:")
        print(f"  R²: {result.r_squared:.4f} ({result.r_squared*100:.1f}% variance explained)")
        print(f"  RMSE: ${result.rmse:.2f}K (typical prediction error)")
        print(f"  MAE: ${result.mae:.2f}K (average absolute error)")
        
        print("=" * 60)
    
    def predict_single(self, **kwargs) -> float:
        """Predict price for a single house with named features."""
        X = np.array([[kwargs.get(name, 0) for name in self.feature_names]])
        return self.predict(X)[0]


# Create sample dataset
np.random.seed(42)
n_houses = 200

# Generate features
sqft = np.random.uniform(1000, 4000, n_houses)
bedrooms = np.random.randint(2, 6, n_houses)
bathrooms = np.random.randint(1, 4, n_houses)
age = np.random.randint(0, 50, n_houses)
garage = np.random.randint(0, 3, n_houses)

# Generate prices (with some noise)
price = (
    50 +                      # Base price
    0.15 * sqft +            # $150 per sqft
    25 * bedrooms +          # $25K per bedroom
    20 * bathrooms +         # $20K per bathroom
    -2 * age +               # -$2K per year of age
    15 * garage +            # $15K per garage spot
    np.random.normal(0, 30, n_houses)  # Random noise
)

# Prepare data
feature_names = ['sqft', 'bedrooms', 'bathrooms', 'age', 'garage']
X = np.column_stack([sqft, bedrooms, bathrooms, age, garage])
y = price

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
predictor = HousePricePredictor()
predictor.fit(X_train, y_train, feature_names)

# Evaluate on test set
result = predictor.evaluate(X_test, y_test)
predictor.print_summary(result)

# Make a prediction for a specific house
new_price = predictor.predict_single(
    sqft=2500,
    bedrooms=4,
    bathrooms=2,
    age=10,
    garage=2
)
print(f"\nPrediction for 2500 sqft, 4 bed, 2 bath, 10 yr old, 2 car garage:")
print(f"  Estimated price: ${new_price:,.0f}K")

# Compare actual vs predicted for test set
print("\nSample Predictions (Test Set):")
print("-" * 50)
print(f"{'Actual':>12} {'Predicted':>12} {'Error':>12}")
print("-" * 50)
for actual, predicted in zip(y_test[:10], result.predictions[:10]):
    error = actual - predicted
    print(f"${actual:>10,.0f}K ${predicted:>10,.0f}K ${error:>+10,.0f}K")
Output:
============================================================
HOUSE PRICE PREDICTION MODEL
============================================================

Coefficients:
  Intercept: $48,234
  sqft: $150.23
  bedrooms: $24,892.45
  bathrooms: $19,543.12
  age: $-1,987.65
  garage: $14,876.32

Model Performance:
  R²: 0.9234 (92.3% variance explained)
  RMSE: $28.45K (typical prediction error)
  MAE: $22.18K (average absolute error)
============================================================

Prediction for 2500 sqft, 4 bed, 2 bath, 10 yr old, 2 car garage:
  Estimated price: $523K

Practice Exercises

Exercise 1: Fuel Efficiency

# Predict miles per gallon from car features
# weight (lbs), horsepower, cylinders

weight = np.array([3500, 3200, 2800, 4000, 3800, 2500, 2900, 3100, 3600, 4200])
horsepower = np.array([150, 130, 110, 180, 165, 95, 115, 125, 155, 190])
mpg = np.array([22, 26, 32, 18, 20, 35, 30, 28, 21, 17])

# 1. Calculate correlation between weight and mpg

Interview Questions

Question: In a salary prediction model, the coefficient for years_experience is 5000 and for has_phd (0/1) is 15000. How do you interpret these?
Answer:
  • years_experience = 5000: Each additional year of experience is associated with $5,000 higher salary, holding other variables constant
  • has_phd = 15000: Having a PhD is associated with $15,000 higher salary compared to not having a PhD, holding experience constant
Important caveats:
  1. These are associations, not necessarily causal
  2. “Holding constant” means comparing people with same values for other predictors
  3. For categorical variables like has_phd, the interpretation is relative to the reference category (no PhD)
Question: Your model predicting delivery time has R-squared = 0.65. A colleague says “65% accuracy isn’t good enough.” Is this correct?
Answer: No, this is a common misconception.R-squared = 0.65 means the model explains 65% of the variance in delivery time, not that it’s “65% accurate.”This could be quite good depending on context:
  • For predicting human behavior, R-squared = 0.65 is excellent
  • Remaining 35% of variance might be inherently unpredictable (traffic, weather)
  • Check RMSE to understand actual prediction error in meaningful units
# Example: R-squared = 0.65, but RMSE = 5 minutes
# This might be very good for delivery estimates!
Question: Data shows strong correlation (r=0.85) between ice cream sales and sunburns. Should we stop selling ice cream to prevent sunburns?
Answer: No! This is a classic example of confounding.Both ice cream sales and sunburns are caused by a lurking variable: hot, sunny weather.
Hot Weather → More Ice Cream Sales
Hot Weather → More Sunburns

Ice Cream ←→ Sunburns (correlated but NOT causal)
To establish causation, you need:
  1. Controlled experiment: Randomly assign ice cream consumption
  2. Time ordering: Cause must precede effect
  3. Plausible mechanism: Biological/logical pathway
  4. Rule out confounders: Account for all alternative explanations
Question: You’re predicting house prices with both square_feet and num_rooms. The individual p-values are high (not significant), but the model R² is 0.85. What’s happening?
Answer: This is likely multicollinearity - the predictors are highly correlated with each other.When predictors are correlated:
  • The model can still predict well (good R²)
  • But individual coefficient estimates become unstable
  • Standard errors inflate, making p-values high
  • Hard to isolate the effect of each variable
Solutions:
  1. Remove one predictor: Use only square_feet OR num_rooms
  2. Create composite variable: rooms_per_sqft
  3. Use regularization: Ridge regression handles multicollinearity
  4. VIF check: Variance Inflation Factor > 10 indicates problems
# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# VIF > 10 means serious multicollinearity

Practice Challenge

Create a production-ready regression analysis for house prices:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats

# Generate realistic house data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'sqft': np.random.normal(2000, 500, n),
    'bedrooms': np.random.choice([2, 3, 4, 5], n, p=[0.1, 0.4, 0.35, 0.15]),
    'bathrooms': np.random.choice([1, 2, 3], n, p=[0.2, 0.5, 0.3]),
    'age': np.random.exponential(20, n),
    'pool': np.random.choice([0, 1], n, p=[0.7, 0.3]),
    'garage_spaces': np.random.choice([0, 1, 2, 3], n, p=[0.1, 0.3, 0.4, 0.2]),
})

# Generate price with realistic relationships
data['price'] = (
    50000 +  # Base
    150 * data['sqft'] +  # $150 per sqft
    20000 * data['bedrooms'] +
    15000 * data['bathrooms'] +
    -1000 * data['age'] +  # Older = cheaper
    30000 * data['pool'] +
    10000 * data['garage_spaces'] +
    np.random.normal(0, 20000, n)  # Noise
)

# Your task: Build a complete analysis including:
# 1. Exploratory data analysis
# 2. Correlation matrix
# 3. Train/test split
# 4. Multiple regression with interpretation
# 5. Model diagnostics (residual analysis)
# 6. Cross-validation
# 7. Feature importance ranking
# 8. Prediction with confidence intervals
Full Solution:
# 1. EDA
print("=== Exploratory Data Analysis ===")
print(data.describe())
print(f"\nPrice range: ${data['price'].min():,.0f} - ${data['price'].max():,.0f}")
print(f"Mean price: ${data['price'].mean():,.0f}")

# 2. Correlation matrix
print("\n=== Correlations with Price ===")
correlations = data.corr()['price'].drop('price').sort_values(ascending=False)
print(correlations)

# 3. Train/test split
X = data.drop('price', axis=1)
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Fit and interpret model
model = LinearRegression()
model.fit(X_train, y_train)

print("\n=== Model Coefficients ===")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: ${coef:,.0f}")
print(f"Intercept: ${model.intercept_:,.0f}")

# 5. Model diagnostics
y_pred = model.predict(X_test)
residuals = y_test - y_pred

print("\n=== Model Performance ===")
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")

# Residual normality test
_, p_value = stats.shapiro(residuals[:50])  # Sample for speed
print(f"Residual normality p-value: {p_value:.4f}")

# 6. Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"\n=== Cross-Validation ===")
print(f"CV R² scores: {cv_scores.round(3)}")
print(f"Mean CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# 7. Feature importance (standardized coefficients)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model_scaled = LinearRegression().fit(X_scaled, y_train)

print("\n=== Feature Importance (Standardized) ===")
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': np.abs(model_scaled.coef_)
}).sort_values('Importance', ascending=False)
print(importance.to_string(index=False))

# 8. Prediction with confidence interval
new_house = pd.DataFrame({
    'sqft': [2500], 'bedrooms': [4], 'bathrooms': [2],
    'age': [5], 'pool': [1], 'garage_spaces': [2]
})
predicted_price = model.predict(new_house)[0]

# Bootstrap confidence interval
n_bootstrap = 1000
bootstrap_preds = []
for _ in range(n_bootstrap):
    indices = np.random.choice(len(X_train), len(X_train), replace=True)
    X_boot = X_train.iloc[indices]
    y_boot = y_train.iloc[indices]
    model_boot = LinearRegression().fit(X_boot, y_boot)
    bootstrap_preds.append(model_boot.predict(new_house)[0])

ci_lower = np.percentile(bootstrap_preds, 2.5)
ci_upper = np.percentile(bootstrap_preds, 97.5)

print(f"\n=== Prediction for New House ===")
print(f"Predicted Price: ${predicted_price:,.0f}")
print(f"95% CI: (${ci_lower:,.0f}, ${ci_upper:,.0f})")

📝 Practice Exercises

Exercise 1

Calculate correlation and simple linear regression

Exercise 2

Build and interpret multiple regression models

Exercise 3

Evaluate model performance with R² and RMSE

Exercise 4

Real-world: House price prediction system

Key Takeaways

Correlation

  • Measures linear relationship strength (-1 to 1)
  • Correlation is not causation
  • High correlation can be spurious

Simple Regression

  • Predicts Y from single X
  • y = β₀ + β₁x
  • Minimize sum of squared errors

Multiple Regression

  • Predicts Y from multiple X variables
  • Coefficients show effect holding others constant
  • Standardize to compare importance

Evaluation

  • R² = variance explained
  • RMSE = typical error size
  • Check assumptions via residual plots

Common Pitfalls

Regression Mistakes to Avoid:
  1. Causation from Correlation - Regression shows association, not causation; beware of confounders
  2. Ignoring Multicollinearity - Highly correlated predictors make coefficients unstable and uninterpretable
  3. Extrapolating Beyond Data - Models are only valid within the range of training data
  4. Ignoring Residual Patterns - Non-random residuals indicate model misspecification
  5. Misinterpreting R² - R² is not accuracy; 0.65 doesn’t mean “65% correct”
  6. Forgetting to Scale - For comparing coefficient importance, standardize your features first

Connection to Machine Learning

Regression ConceptML Application
Linear regressionFoundation of neural networks (linear layers)
CoefficientsWeights in neural networks
Minimizing SSELoss function optimization
Gradient descentHow models learn (next module!)
RegularizationPreventing overfitting (L1, L2)
Feature scalingRequired for most ML algorithms
ML Connection: Linear regression is the simplest neural network—one layer with linear activation. Understanding regression gives you intuition for how all deep learning works: find coefficients (weights) that minimize a loss function using gradient-based optimization.
Coming up next: We’ll connect all these statistical concepts to Machine Learning - seeing how statistics powers the algorithms that learn from data.

Next: From Statistics to ML

See how statistics becomes machine learning

Interview Deep-Dive

Strong Answer:
  • R-squared of 0.45 means the model explains 45% of the variance in the target variable. Whether that is “terrible” depends entirely on the domain and the alternative.
  • In physical sciences (modeling a chemical reaction), R-squared of 0.45 would indeed be poor because the underlying relationships are deterministic and we expect R-squared above 0.9. But in social sciences, economics, and most business prediction problems, R-squared of 0.45 is often quite good because human behavior has enormous inherent unpredictability.
  • The right question is not “is 0.45 high enough?” but “is this model useful?” If you are predicting customer lifetime value and the model correctly identifies the top 20% of customers with 80% precision, it is delivering massive business value regardless of the R-squared number.
  • I would also check: What is RMSE in practical units? If the model predicts delivery time with RMSE of 5 minutes and the business only needs accuracy within 10 minutes, then R-squared is irrelevant — the model is accurate enough. R-squared is a summary statistic about variance explained; business impact depends on whether the predictions are actionable.
Follow-up: Can you have a model that is useful for prediction but has low R-squared, and vice versa?Absolutely. Low R-squared, useful model: a model predicting whether a user will click an ad might explain only 5% of variance (because individual clicks are inherently noisy), but if it correctly ranks users by click probability and the top-ranked 10% clicks at 3x the average rate, the ad targeting system generates millions in revenue. The model captures signal in the mean, even though individual outcomes are unpredictable. High R-squared, useless model: a model predicting yesterday’s stock price from today’s stock price will have R-squared near 0.99 because prices are highly autocorrelated. But it is useless for trading because it does not predict the future. Similarly, a model with R-squared=0.95 that is overfit to training data will have high in-sample R-squared but fail completely on new data. The lesson: R-squared measures in-sample fit, not predictive value, and certainly not business utility.
Strong Answer:
  • I use this analogy: “Imagine I show you data proving that cities with more fire stations have more fires. Does that mean fire stations cause fires? Obviously not — bigger cities have both more fires and more fire stations. The city size is the hidden third factor driving both.”
  • A regression coefficient tells you the association between X and Y after controlling for other variables in the model. But it cannot prove causation because there might be confounders you did not include. If your model predicts that “customers who use the mobile app spend 30% more,” the regression is telling you truth — app users do spend more. But it is not telling you that making someone download the app will cause them to spend more. The likely explanation is that already-engaged customers both use the app and spend more.
  • To establish causation from a regression, you need either a randomized experiment (assign some users to the app randomly) or a carefully designed observational study with an instrumental variable or regression discontinuity design.
  • I always warn stakeholders: “This model tells us what is associated with higher revenue. It does not tell us what to change to increase revenue. For that, we need experiments.”
Follow-up: Can you describe a situation where you would use an instrumental variable to establish causation from observational data?A classic example: you want to know if education causes higher earnings, but people who pursue more education might inherently be more ambitious or talented (confounders). An instrumental variable approach uses something that affects education but does not directly affect earnings — for example, proximity to a college. People who grew up near a college are more likely to attend (the instrument is relevant) but distance from a college should not directly affect your earning potential (the exclusion restriction). By using distance as an instrument, you can isolate the causal effect of education on earnings that is not contaminated by the confounder. In tech, a common IV example is using a randomized encouragement design: you randomly encourage some users to adopt a feature (the instrument), then measure the outcome. The encouragement affects adoption without directly affecting the outcome, allowing you to estimate the causal effect of adoption.
Strong Answer:
  • This is almost certainly multicollinearity. The new feature is correlated with one of the existing features. When both are in the model, the coefficient estimates become unstable because the model cannot cleanly separate their individual effects. A sign flip means the partial effect (holding the new variable constant) is different from the marginal effect (ignoring it).
  • A concrete example: predicting house price with square footage and number of rooms. Both are highly correlated (bigger houses have more rooms). With only square footage, its coefficient is positive and strong. Add number of rooms, and the square footage coefficient might shrink or even flip negative, because the model is now trying to ask “holding number of rooms constant, what is the effect of more square footage?” — which is a bizarre question since you cannot really add square footage without adding rooms.
  • To diagnose this, I would compute the Variance Inflation Factor (VIF) for each predictor. VIF above 5 suggests concerning multicollinearity, above 10 is severe.
  • The solution depends on the goal. For prediction, multicollinearity does not matter — the model still predicts well. For interpretation, it is a serious problem. Solutions include dropping one of the correlated features, combining them into a composite, using PCA to create orthogonal features, or switching to a regularized model like Ridge regression which handles multicollinearity gracefully.
Follow-up: Why does Ridge regression help with multicollinearity while OLS does not?In OLS, the coefficient estimates minimize the sum of squared residuals. When features are highly correlated, many different combinations of coefficients produce nearly the same fit, so the estimates are unstable — small changes in data cause large swings in coefficients. Ridge regression adds a penalty term (lambda times the sum of squared coefficients) to the objective function. This penalty shrinks coefficients toward zero and, critically, prevents them from taking extreme values to compensate for each other. The result is that correlated features get more similar, moderate coefficients rather than one huge positive and one huge negative coefficient. The tradeoff is a small increase in bias (the coefficients are shrunk from their OLS values) in exchange for a large reduction in variance. For multicollinear data, this tradeoff almost always improves both interpretability and out-of-sample prediction.