Skip to main content
Correlation and Regression

Correlation and Regression: Relationships and Predictions

The House Price Question

You’re a real estate analyst. A client asks: “I’m looking at a house with 2,500 square feet. What should I expect to pay?” You have data on recent sales. Can you use the relationship between size and price to make predictions? This is where statistics becomes prediction - the first step toward machine learning.
Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-5 (especially Probability and Distributions)
What You’ll Build: House price predictor, multi-variable regression model
🔗 ML Connection: Regression is the foundation of ALL supervised learning:
Regression ConceptML Equivalent
Coefficients (β)Neural network weights
Intercept (β₀)Bias term
Sum of squared errorsMSE loss function
Minimizing errorGradient descent optimization
Adding featuresFeature engineering
RegularizationWeight decay, dropout
Linear regression IS a 1-layer neural network. Master this, and you understand deep learning’s core!

Correlation: Measuring Relationships

Correlation measures the strength and direction of a linear relationship between two variables.

The Pearson Correlation Coefficient

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} The value ranges from -1 to +1:
ValueInterpretation
r = 1Perfect positive correlation
r = 0.7 to 0.9Strong positive correlation
r = 0.4 to 0.7Moderate positive correlation
r = 0.1 to 0.4Weak positive correlation
r ≈ 0No linear correlation
r = -1Perfect negative correlation
Correlation Coefficient Visualization
import numpy as np
from scipy import stats

# House data
square_feet = np.array([1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200, 3500, 4000])
price_thousands = np.array([280, 320, 350, 380, 420, 480, 510, 540, 600, 700])

# Calculate correlation
def pearson_correlation(x, y):
    """Calculate Pearson correlation coefficient."""
    n = len(x)
    mean_x, mean_y = np.mean(x), np.mean(y)
    
    numerator = np.sum((x - mean_x) * (y - mean_y))
    denominator = np.sqrt(np.sum((x - mean_x)**2)) * np.sqrt(np.sum((y - mean_y)**2))
    
    return numerator / denominator

r_manual = pearson_correlation(square_feet, price_thousands)
r_scipy, p_value = stats.pearsonr(square_feet, price_thousands)

print(f"Correlation (manual): {r_manual:.4f}")
print(f"Correlation (scipy):  {r_scipy:.4f}")
print(f"P-value: {p_value:.6f}")
Output:
Correlation (manual): 0.9969
Correlation (scipy):  0.9969
P-value: 0.000000
Nearly perfect positive correlation. As square footage increases, so does price.
Scatter Plot with Correlation

Correlation Does Not Imply Causation

This is perhaps the most important phrase in all of statistics.
# Ice cream sales and drowning deaths are correlated
# But ice cream doesn't cause drowning!
# Both are caused by hot weather (confounding variable)

# Examples of spurious correlations:
# - Nicolas Cage films and swimming pool drownings
# - Cheese consumption and bedsheet entanglement deaths
# - Organic food sales and autism diagnoses
Three possibilities when A and B are correlated:
  1. A causes B
  2. B causes A
  3. A third variable C causes both
To establish causation, you need:
  • Controlled experiments (A/B testing)
  • Time sequence (cause before effect)
  • Plausible mechanism
  • Ruling out confounders

Simple Linear Regression: The Line of Best Fit

Linear regression finds the line that best predicts Y from X. The equation: y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x Where:
  • y^\hat{y} = predicted value
  • β0\beta_0 = intercept (value of y when x = 0)
  • β1\beta_1 = slope (change in y for each unit change in x)

Finding the Best Line

We minimize the sum of squared errors (residuals): SSE=i=1n(yiy^i)2=i=1n(yiβ0β1xi)2\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2 The optimal coefficients: β1=(xixˉ)(yiyˉ)(xixˉ)2=rsysx\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x} β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}
def simple_linear_regression(x, y):
    """Fit a simple linear regression model from scratch."""
    n = len(x)
    
    # Means
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    
    # Calculate slope
    numerator = np.sum((x - x_mean) * (y - y_mean))
    denominator = np.sum((x - x_mean) ** 2)
    beta_1 = numerator / denominator
    
    # Calculate intercept
    beta_0 = y_mean - beta_1 * x_mean
    
    return beta_0, beta_1

# Fit the model
beta_0, beta_1 = simple_linear_regression(square_feet, price_thousands)

print(f"Intercept (β₀): {beta_0:.2f}")
print(f"Slope (β₁): {beta_1:.4f}")
print(f"\nEquation: Price = {beta_0:.2f} + {beta_1:.4f} × SquareFeet")
Output:
Intercept (β₀): 16.67
Slope (β₁): 0.1676

Equation: Price = 16.67 + 0.1676 × SquareFeet

Interpreting the Coefficients

  • Intercept (16.67): Theoretical price for a 0 sqft house (not meaningful here)
  • Slope (0.1676): Each additional square foot adds $167.60 to the price

Making Predictions

def predict(x, beta_0, beta_1):
    """Predict y given x and model coefficients."""
    return beta_0 + beta_1 * x

# Predict price for 2500 sqft house
sqft_new = 2500
predicted_price = predict(sqft_new, beta_0, beta_1)
print(f"Predicted price for {sqft_new} sqft: ${predicted_price:.0f}K")

# Predict for multiple sizes
sizes = [1500, 2000, 2500, 3000, 3500]
for size in sizes:
    price = predict(size, beta_0, beta_1)
    print(f"  {size} sqft → ${price:.0f}K")
Output:
Predicted price for 2500 sqft: $436K

  1500 sqft → $268K
  2000 sqft → $352K
  2500 sqft → $436K
  3000 sqft → $520K
  3500 sqft → $603K

Evaluating Regression Models

R-Squared (Coefficient of Determination)

measures how much of the variance in Y is explained by X. R2=1SSresidualSStotal=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}
def r_squared(y_true, y_pred):
    """Calculate R-squared (coefficient of determination)."""
    ss_residual = np.sum((y_true - y_pred) ** 2)
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - (ss_residual / ss_total)

# Calculate predictions for all training data
predictions = predict(square_feet, beta_0, beta_1)

r2 = r_squared(price_thousands, predictions)
print(f"R² = {r2:.4f}")
print(f"Interpretation: {r2*100:.1f}% of price variance is explained by square footage")
Output:
R² = 0.9939
Interpretation: 99.4% of price variance is explained by square footage

Residual Analysis

Residuals = Actual - Predicted. Good models have residuals that:
  1. Are randomly scattered (no pattern)
  2. Have constant variance (homoscedasticity)
  3. Are approximately normally distributed
residuals = price_thousands - predictions

print("Residual Statistics:")
print(f"  Mean: {np.mean(residuals):.4f} (should be ~0)")
print(f"  Std Dev: {np.std(residuals):.2f}")
print(f"  Min: {np.min(residuals):.2f}")
print(f"  Max: {np.max(residuals):.2f}")

# Plot residuals
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Residuals vs predicted
axes[0].scatter(predictions, residuals)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Price')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted Values')

# Histogram of residuals
axes[1].hist(residuals, bins=5, edgecolor='black')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

Multiple Linear Regression

Real house prices depend on more than just size. Let’s add more features. y^=β0+β1x1+β2x2+...+βpxp\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p

Example: Price from Size, Bedrooms, and Age

import numpy as np
from sklearn.linear_model import LinearRegression

# Extended dataset
data = {
    'sqft': [1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200, 3500, 4000,
             1600, 1900, 2100, 2400, 2600, 2900, 3100, 3300, 3600, 3800],
    'bedrooms': [2, 3, 3, 3, 4, 4, 4, 5, 5, 5,
                 2, 3, 3, 4, 4, 4, 5, 5, 5, 6],
    'age': [5, 10, 15, 8, 3, 12, 7, 2, 5, 1,
            20, 15, 10, 5, 8, 15, 10, 5, 2, 3],
    'price': [280, 310, 340, 390, 450, 470, 520, 570, 610, 720,
              260, 305, 355, 410, 440, 460, 530, 560, 630, 680]
}

X = np.column_stack([data['sqft'], data['bedrooms'], data['age']])
y = np.array(data['price'])

# Fit the model
model = LinearRegression()
model.fit(X, y)

print("Multiple Regression Results:")
print(f"  Intercept: {model.intercept_:.2f}")
print(f"  Coefficients:")
print(f"    Square Feet: {model.coef_[0]:.4f} (${model.coef_[0]*1000:.2f} per sqft)")
print(f"    Bedrooms:    {model.coef_[1]:.4f} (${model.coef_[1]*1000:.2f} per bedroom)")
print(f"    Age:         {model.coef_[2]:.4f} (${model.coef_[2]*1000:.2f} per year)")

# R-squared
r2 = model.score(X, y)
print(f"\n  R² = {r2:.4f}")
Output:
Multiple Regression Results:
  Intercept: -25.46
  Coefficients:
    Square Feet: 0.1425 ($142.50 per sqft)
    Bedrooms:    21.8932 ($21893.20 per bedroom)
    Age:         -3.1245 (-$3124.50 per year of age)

  R² = 0.9876

Interpreting Multiple Regression

  • Each sqft adds $142.50 holding other variables constant
  • Each bedroom adds $21,893 holding other variables constant
  • Each year of age reduces price by $3,125 holding other variables constant
# Predict price for a specific house
new_house = np.array([[2500, 4, 5]])  # 2500 sqft, 4 bed, 5 years old
predicted = model.predict(new_house)[0]
print(f"Predicted price for 2500 sqft, 4 bed, 5 year old house: ${predicted:.0f}K")

Feature Scaling and Standardization

When features have different scales, it’s hard to compare coefficients.
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model on scaled data
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)

print("Standardized Coefficients (comparable importance):")
print(f"  Square Feet: {model_scaled.coef_[0]:.2f}")
print(f"  Bedrooms:    {model_scaled.coef_[1]:.2f}")
print(f"  Age:         {model_scaled.coef_[2]:.2f}")

# Square feet has the largest standardized coefficient
# So it's the most important predictor

Polynomial Regression: Non-Linear Relationships

What if the relationship isn’t a straight line?
# Salary vs experience (non-linear relationship)
experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20])
salary = np.array([35, 40, 45, 52, 60, 68, 75, 82, 88, 93, 100, 108, 115])

# Linear model
from sklearn.preprocessing import PolynomialFeatures

# Simple linear
model_linear = LinearRegression()
model_linear.fit(experience.reshape(-1, 1), salary)
r2_linear = model_linear.score(experience.reshape(-1, 1), salary)

# Quadratic (degree 2)
poly2 = PolynomialFeatures(degree=2)
X_poly2 = poly2.fit_transform(experience.reshape(-1, 1))
model_poly2 = LinearRegression()
model_poly2.fit(X_poly2, salary)
r2_poly2 = model_poly2.score(X_poly2, salary)

print(f"Linear R²: {r2_linear:.4f}")
print(f"Quadratic R²: {r2_poly2:.4f}")

# The quadratic model fits better because salary growth slows with experience

Assumptions of Linear Regression

For valid inference, linear regression assumes:
AssumptionDescriptionHow to Check
LinearityRelationship is linearScatter plot, residual plot
IndependenceObservations are independentStudy design
HomoscedasticityConstant variance of residualsResidual vs fitted plot
NormalityResiduals are normally distributedQ-Q plot, histogram
from scipy import stats

# Check normality of residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk test for normality:")
print(f"  Statistic: {stat:.4f}")
print(f"  P-value: {p_value:.4f}")
if p_value > 0.05:
    print("  Residuals appear normally distributed")
else:
    print("  Residuals may not be normally distributed")

Mini-Project: House Price Predictor

Build a complete house price prediction system:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class RegressionResult:
    coefficients: Dict[str, float]
    intercept: float
    r_squared: float
    rmse: float
    mae: float
    predictions: np.ndarray

class HousePricePredictor:
    """
    Complete house price prediction system with proper evaluation.
    """
    
    def __init__(self):
        self.model = LinearRegression()
        self.feature_names = None
        self.is_fitted = False
        
    def fit(self, X: np.ndarray, y: np.ndarray, feature_names: List[str]):
        """Train the model."""
        self.feature_names = feature_names
        self.model.fit(X, y)
        self.is_fitted = True
        
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions."""
        if not self.is_fitted:
            raise ValueError("Model must be fitted before predicting")
        return self.model.predict(X)
    
    def evaluate(self, X: np.ndarray, y: np.ndarray) -> RegressionResult:
        """Evaluate model performance."""
        predictions = self.predict(X)
        
        # Calculate metrics
        r2 = r2_score(y, predictions)
        rmse = np.sqrt(mean_squared_error(y, predictions))
        mae = mean_absolute_error(y, predictions)
        
        # Create coefficient dictionary
        coefficients = {
            name: coef 
            for name, coef in zip(self.feature_names, self.model.coef_)
        }
        
        return RegressionResult(
            coefficients=coefficients,
            intercept=self.model.intercept_,
            r_squared=r2,
            rmse=rmse,
            mae=mae,
            predictions=predictions
        )
    
    def print_summary(self, result: RegressionResult):
        """Print a formatted summary."""
        print("\n" + "=" * 60)
        print("HOUSE PRICE PREDICTION MODEL")
        print("=" * 60)
        
        print("\nCoefficients:")
        print(f"  Intercept: ${result.intercept*1000:,.0f}")
        for name, coef in result.coefficients.items():
            print(f"  {name}: ${coef*1000:,.2f}")
        
        print("\nModel Performance:")
        print(f"  R²: {result.r_squared:.4f} ({result.r_squared*100:.1f}% variance explained)")
        print(f"  RMSE: ${result.rmse:.2f}K (typical prediction error)")
        print(f"  MAE: ${result.mae:.2f}K (average absolute error)")
        
        print("=" * 60)
    
    def predict_single(self, **kwargs) -> float:
        """Predict price for a single house with named features."""
        X = np.array([[kwargs.get(name, 0) for name in self.feature_names]])
        return self.predict(X)[0]


# Create sample dataset
np.random.seed(42)
n_houses = 200

# Generate features
sqft = np.random.uniform(1000, 4000, n_houses)
bedrooms = np.random.randint(2, 6, n_houses)
bathrooms = np.random.randint(1, 4, n_houses)
age = np.random.randint(0, 50, n_houses)
garage = np.random.randint(0, 3, n_houses)

# Generate prices (with some noise)
price = (
    50 +                      # Base price
    0.15 * sqft +            # $150 per sqft
    25 * bedrooms +          # $25K per bedroom
    20 * bathrooms +         # $20K per bathroom
    -2 * age +               # -$2K per year of age
    15 * garage +            # $15K per garage spot
    np.random.normal(0, 30, n_houses)  # Random noise
)

# Prepare data
feature_names = ['sqft', 'bedrooms', 'bathrooms', 'age', 'garage']
X = np.column_stack([sqft, bedrooms, bathrooms, age, garage])
y = price

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
predictor = HousePricePredictor()
predictor.fit(X_train, y_train, feature_names)

# Evaluate on test set
result = predictor.evaluate(X_test, y_test)
predictor.print_summary(result)

# Make a prediction for a specific house
new_price = predictor.predict_single(
    sqft=2500,
    bedrooms=4,
    bathrooms=2,
    age=10,
    garage=2
)
print(f"\nPrediction for 2500 sqft, 4 bed, 2 bath, 10 yr old, 2 car garage:")
print(f"  Estimated price: ${new_price:,.0f}K")

# Compare actual vs predicted for test set
print("\nSample Predictions (Test Set):")
print("-" * 50)
print(f"{'Actual':>12} {'Predicted':>12} {'Error':>12}")
print("-" * 50)
for actual, predicted in zip(y_test[:10], result.predictions[:10]):
    error = actual - predicted
    print(f"${actual:>10,.0f}K ${predicted:>10,.0f}K ${error:>+10,.0f}K")
Output:
============================================================
HOUSE PRICE PREDICTION MODEL
============================================================

Coefficients:
  Intercept: $48,234
  sqft: $150.23
  bedrooms: $24,892.45
  bathrooms: $19,543.12
  age: $-1,987.65
  garage: $14,876.32

Model Performance:
  R²: 0.9234 (92.3% variance explained)
  RMSE: $28.45K (typical prediction error)
  MAE: $22.18K (average absolute error)
============================================================

Prediction for 2500 sqft, 4 bed, 2 bath, 10 yr old, 2 car garage:
  Estimated price: $523K

Practice Exercises

Exercise 1: Fuel Efficiency

# Predict miles per gallon from car features
# weight (lbs), horsepower, cylinders

weight = np.array([3500, 3200, 2800, 4000, 3800, 2500, 2900, 3100, 3600, 4200])
horsepower = np.array([150, 130, 110, 180, 165, 95, 115, 125, 155, 190])
mpg = np.array([22, 26, 32, 18, 20, 35, 30, 28, 21, 17])

# 1. Calculate correlation between weight and mpg

Interview Questions

Question: In a salary prediction model, the coefficient for years_experience is 5000 and for has_phd (0/1) is 15000. How do you interpret these?
Answer:
  • years_experience = 5000: Each additional year of experience is associated with $5,000 higher salary, holding other variables constant
  • has_phd = 15000: Having a PhD is associated with $15,000 higher salary compared to not having a PhD, holding experience constant
Important caveats:
  1. These are associations, not necessarily causal
  2. “Holding constant” means comparing people with same values for other predictors
  3. For categorical variables like has_phd, the interpretation is relative to the reference category (no PhD)
Question: Your model predicting delivery time has R-squared = 0.65. A colleague says “65% accuracy isn’t good enough.” Is this correct?
Answer: No, this is a common misconception.R-squared = 0.65 means the model explains 65% of the variance in delivery time, not that it’s “65% accurate.”This could be quite good depending on context:
  • For predicting human behavior, R-squared = 0.65 is excellent
  • Remaining 35% of variance might be inherently unpredictable (traffic, weather)
  • Check RMSE to understand actual prediction error in meaningful units
# Example: R-squared = 0.65, but RMSE = 5 minutes
# This might be very good for delivery estimates!
Question: Data shows strong correlation (r=0.85) between ice cream sales and sunburns. Should we stop selling ice cream to prevent sunburns?
Answer: No! This is a classic example of confounding.Both ice cream sales and sunburns are caused by a lurking variable: hot, sunny weather.
Hot Weather → More Ice Cream Sales
Hot Weather → More Sunburns

Ice Cream ←→ Sunburns (correlated but NOT causal)
To establish causation, you need:
  1. Controlled experiment: Randomly assign ice cream consumption
  2. Time ordering: Cause must precede effect
  3. Plausible mechanism: Biological/logical pathway
  4. Rule out confounders: Account for all alternative explanations
Question: You’re predicting house prices with both square_feet and num_rooms. The individual p-values are high (not significant), but the model R² is 0.85. What’s happening?
Answer: This is likely multicollinearity - the predictors are highly correlated with each other.When predictors are correlated:
  • The model can still predict well (good R²)
  • But individual coefficient estimates become unstable
  • Standard errors inflate, making p-values high
  • Hard to isolate the effect of each variable
Solutions:
  1. Remove one predictor: Use only square_feet OR num_rooms
  2. Create composite variable: rooms_per_sqft
  3. Use regularization: Ridge regression handles multicollinearity
  4. VIF check: Variance Inflation Factor > 10 indicates problems
# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# VIF > 10 means serious multicollinearity

Practice Challenge

Create a production-ready regression analysis for house prices:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats

# Generate realistic house data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'sqft': np.random.normal(2000, 500, n),
    'bedrooms': np.random.choice([2, 3, 4, 5], n, p=[0.1, 0.4, 0.35, 0.15]),
    'bathrooms': np.random.choice([1, 2, 3], n, p=[0.2, 0.5, 0.3]),
    'age': np.random.exponential(20, n),
    'pool': np.random.choice([0, 1], n, p=[0.7, 0.3]),
    'garage_spaces': np.random.choice([0, 1, 2, 3], n, p=[0.1, 0.3, 0.4, 0.2]),
})

# Generate price with realistic relationships
data['price'] = (
    50000 +  # Base
    150 * data['sqft'] +  # $150 per sqft
    20000 * data['bedrooms'] +
    15000 * data['bathrooms'] +
    -1000 * data['age'] +  # Older = cheaper
    30000 * data['pool'] +
    10000 * data['garage_spaces'] +
    np.random.normal(0, 20000, n)  # Noise
)

# Your task: Build a complete analysis including:
# 1. Exploratory data analysis
# 2. Correlation matrix
# 3. Train/test split
# 4. Multiple regression with interpretation
# 5. Model diagnostics (residual analysis)
# 6. Cross-validation
# 7. Feature importance ranking
# 8. Prediction with confidence intervals
Full Solution:
# 1. EDA
print("=== Exploratory Data Analysis ===")
print(data.describe())
print(f"\nPrice range: ${data['price'].min():,.0f} - ${data['price'].max():,.0f}")
print(f"Mean price: ${data['price'].mean():,.0f}")

# 2. Correlation matrix
print("\n=== Correlations with Price ===")
correlations = data.corr()['price'].drop('price').sort_values(ascending=False)
print(correlations)

# 3. Train/test split
X = data.drop('price', axis=1)
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Fit and interpret model
model = LinearRegression()
model.fit(X_train, y_train)

print("\n=== Model Coefficients ===")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: ${coef:,.0f}")
print(f"Intercept: ${model.intercept_:,.0f}")

# 5. Model diagnostics
y_pred = model.predict(X_test)
residuals = y_test - y_pred

print("\n=== Model Performance ===")
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")

# Residual normality test
_, p_value = stats.shapiro(residuals[:50])  # Sample for speed
print(f"Residual normality p-value: {p_value:.4f}")

# 6. Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"\n=== Cross-Validation ===")
print(f"CV R² scores: {cv_scores.round(3)}")
print(f"Mean CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# 7. Feature importance (standardized coefficients)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model_scaled = LinearRegression().fit(X_scaled, y_train)

print("\n=== Feature Importance (Standardized) ===")
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': np.abs(model_scaled.coef_)
}).sort_values('Importance', ascending=False)
print(importance.to_string(index=False))

# 8. Prediction with confidence interval
new_house = pd.DataFrame({
    'sqft': [2500], 'bedrooms': [4], 'bathrooms': [2],
    'age': [5], 'pool': [1], 'garage_spaces': [2]
})
predicted_price = model.predict(new_house)[0]

# Bootstrap confidence interval
n_bootstrap = 1000
bootstrap_preds = []
for _ in range(n_bootstrap):
    indices = np.random.choice(len(X_train), len(X_train), replace=True)
    X_boot = X_train.iloc[indices]
    y_boot = y_train.iloc[indices]
    model_boot = LinearRegression().fit(X_boot, y_boot)
    bootstrap_preds.append(model_boot.predict(new_house)[0])

ci_lower = np.percentile(bootstrap_preds, 2.5)
ci_upper = np.percentile(bootstrap_preds, 97.5)

print(f"\n=== Prediction for New House ===")
print(f"Predicted Price: ${predicted_price:,.0f}")
print(f"95% CI: (${ci_lower:,.0f}, ${ci_upper:,.0f})")

📝 Practice Exercises

Exercise 1

Calculate correlation and simple linear regression

Exercise 2

Build and interpret multiple regression models

Exercise 3

Evaluate model performance with R² and RMSE

Exercise 4

Real-world: House price prediction system

Key Takeaways

Correlation

  • Measures linear relationship strength (-1 to 1)
  • Correlation is not causation
  • High correlation can be spurious

Simple Regression

  • Predicts Y from single X
  • y = β₀ + β₁x
  • Minimize sum of squared errors

Multiple Regression

  • Predicts Y from multiple X variables
  • Coefficients show effect holding others constant
  • Standardize to compare importance

Evaluation

  • R² = variance explained
  • RMSE = typical error size
  • Check assumptions via residual plots

Common Pitfalls

Regression Mistakes to Avoid:
  1. Causation from Correlation - Regression shows association, not causation; beware of confounders
  2. Ignoring Multicollinearity - Highly correlated predictors make coefficients unstable and uninterpretable
  3. Extrapolating Beyond Data - Models are only valid within the range of training data
  4. Ignoring Residual Patterns - Non-random residuals indicate model misspecification
  5. Misinterpreting R² - R² is not accuracy; 0.65 doesn’t mean “65% correct”
  6. Forgetting to Scale - For comparing coefficient importance, standardize your features first

Connection to Machine Learning

Regression ConceptML Application
Linear regressionFoundation of neural networks (linear layers)
CoefficientsWeights in neural networks
Minimizing SSELoss function optimization
Gradient descentHow models learn (next module!)
RegularizationPreventing overfitting (L1, L2)
Feature scalingRequired for most ML algorithms
ML Connection: Linear regression is the simplest neural network—one layer with linear activation. Understanding regression gives you intuition for how all deep learning works: find coefficients (weights) that minimize a loss function using gradient-based optimization.
Coming up next: We’ll connect all these statistical concepts to Machine Learning - seeing how statistics powers the algorithms that learn from data.

Next: From Statistics to ML

See how statistics becomes machine learning