Correlation and Regression: Relationships and Predictions

The House Price Question

You’re a real estate analyst. A client asks: “I’m looking at a house with 2,500 square feet. What should I expect to pay?” You have data on recent sales. Can you use the relationship between size and price to make predictions? This is where statistics becomes prediction - the first step toward machine learning.

Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-5 (especially Probability and Distributions)
What You’ll Build: House price predictor, multi-variable regression model

🔗 ML Connection: Regression is the foundation of ALL supervised learning:

Regression Concept	ML Equivalent
Coefficients (β)	Neural network weights
Intercept (β₀)	Bias term
Sum of squared errors	MSE loss function
Minimizing error	Gradient descent optimization
Adding features	Feature engineering
Regularization	Weight decay, dropout

Linear regression IS a 1-layer neural network. Master this, and you understand deep learning’s core!

Correlation: Measuring Relationships

Correlation measures the strength and direction of a linear relationship between two variables.

The Pearson Correlation Coefficient

r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}

The value ranges from -1 to +1:

Value	Interpretation
r = 1	Perfect positive correlation
r = 0.7 to 0.9	Strong positive correlation
r = 0.4 to 0.7	Moderate positive correlation
r = 0.1 to 0.4	Weak positive correlation
r ≈ 0	No linear correlation
r = -1	Perfect negative correlation

import numpy as np
from scipy import stats

# House data
square_feet = np.array([1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200, 3500, 4000])
price_thousands = np.array([280, 320, 350, 380, 420, 480, 510, 540, 600, 700])

# Calculate correlation
def pearson_correlation(x, y):
    """Calculate Pearson correlation coefficient."""
    n = len(x)
    mean_x, mean_y = np.mean(x), np.mean(y)
    
    numerator = np.sum((x - mean_x) * (y - mean_y))
    denominator = np.sqrt(np.sum((x - mean_x)**2)) * np.sqrt(np.sum((y - mean_y)**2))
    
    return numerator / denominator

r_manual = pearson_correlation(square_feet, price_thousands)
r_scipy, p_value = stats.pearsonr(square_feet, price_thousands)

print(f"Correlation (manual): {r_manual:.4f}")
print(f"Correlation (scipy):  {r_scipy:.4f}")
print(f"P-value: {p_value:.6f}")

Output:

Correlation (manual): 0.9969
Correlation (scipy):  0.9969
P-value: 0.000000

Nearly perfect positive correlation. As square footage increases, so does price.

Correlation Does Not Imply Causation

This is perhaps the most important phrase in all of statistics.

# Ice cream sales and drowning deaths are correlated
# But ice cream doesn't cause drowning!
# Both are caused by hot weather (confounding variable)

# Examples of spurious correlations:
# - Nicolas Cage films and swimming pool drownings
# - Cheese consumption and bedsheet entanglement deaths
# - Organic food sales and autism diagnoses

Three possibilities when A and B are correlated:

A causes B
B causes A
A third variable C causes both

To establish causation, you need:

Controlled experiments (A/B testing)
Time sequence (cause before effect)
Plausible mechanism
Ruling out confounders

Simple Linear Regression: The Line of Best Fit

Linear regression finds the line that best predicts Y from X. The equation:

\hat{y} = \beta_0 + \beta_1 x

Where:

$\hat{y}$ = predicted value
$\beta_0$ = intercept (value of y when x = 0)
$\beta_1$ = slope (change in y for each unit change in x)

Finding the Best Line

We minimize the sum of squared errors (residuals):

\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

The optimal coefficients:

\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}

\beta_0 = \bar{y} - \beta_1 \bar{x}

def simple_linear_regression(x, y):
    """Fit a simple linear regression model from scratch."""
    n = len(x)
    
    # Means
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    
    # Calculate slope
    numerator = np.sum((x - x_mean) * (y - y_mean))
    denominator = np.sum((x - x_mean) ** 2)
    beta_1 = numerator / denominator
    
    # Calculate intercept
    beta_0 = y_mean - beta_1 * x_mean
    
    return beta_0, beta_1

# Fit the model
beta_0, beta_1 = simple_linear_regression(square_feet, price_thousands)

print(f"Intercept (β₀): {beta_0:.2f}")
print(f"Slope (β₁): {beta_1:.4f}")
print(f"\nEquation: Price = {beta_0:.2f} + {beta_1:.4f} × SquareFeet")

Output:

Intercept (β₀): 16.67
Slope (β₁): 0.1676

Equation: Price = 16.67 + 0.1676 × SquareFeet

Interpreting the Coefficients

Intercept (16.67): Theoretical price for a 0 sqft house (not meaningful here)
Slope (0.1676): Each additional square foot adds $167.60 to the price

Making Predictions

def predict(x, beta_0, beta_1):
    """Predict y given x and model coefficients."""
    return beta_0 + beta_1 * x

# Predict price for 2500 sqft house
sqft_new = 2500
predicted_price = predict(sqft_new, beta_0, beta_1)
print(f"Predicted price for {sqft_new} sqft: ${predicted_price:.0f}K")

# Predict for multiple sizes
sizes = [1500, 2000, 2500, 3000, 3500]
for size in sizes:
    price = predict(size, beta_0, beta_1)
    print(f"  {size} sqft → ${price:.0f}K")

Output:

Predicted price for 2500 sqft: $436K

sqft → $268K
sqft → $352K
sqft → $436K
sqft → $520K
sqft → $603K

Evaluating Regression Models

R-Squared (Coefficient of Determination)

R² measures how much of the variance in Y is explained by X.

R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

def r_squared(y_true, y_pred):
    """Calculate R-squared (coefficient of determination)."""
    ss_residual = np.sum((y_true - y_pred) ** 2)
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - (ss_residual / ss_total)

# Calculate predictions for all training data
predictions = predict(square_feet, beta_0, beta_1)

r2 = r_squared(price_thousands, predictions)
print(f"R² = {r2:.4f}")
print(f"Interpretation: {r2*100:.1f}% of price variance is explained by square footage")

Output:

R² = 0.9939
Interpretation: 99.4% of price variance is explained by square footage

Residual Analysis

Residuals = Actual - Predicted. Good models have residuals that:

Are randomly scattered (no pattern)
Have constant variance (homoscedasticity)
Are approximately normally distributed

residuals = price_thousands - predictions

print("Residual Statistics:")
print(f"  Mean: {np.mean(residuals):.4f} (should be ~0)")
print(f"  Std Dev: {np.std(residuals):.2f}")
print(f"  Min: {np.min(residuals):.2f}")
print(f"  Max: {np.max(residuals):.2f}")

# Plot residuals
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Residuals vs predicted
axes[0].scatter(predictions, residuals)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Price')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted Values')

# Histogram of residuals
axes[1].hist(residuals, bins=5, edgecolor='black')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

Multiple Linear Regression

Real house prices depend on more than just size. Let’s add more features.

\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p

Example: Price from Size, Bedrooms, and Age

import numpy as np
from sklearn.linear_model import LinearRegression

# Extended dataset
data = {
    'sqft': [1500, 1800, 2000, 2200, 2500, 2800, 3000, 3200, 3500, 4000,
             1600, 1900, 2100, 2400, 2600, 2900, 3100, 3300, 3600, 3800],
    'bedrooms': [2, 3, 3, 3, 4, 4, 4, 5, 5, 5,
                 2, 3, 3, 4, 4, 4, 5, 5, 5, 6],
    'age': [5, 10, 15, 8, 3, 12, 7, 2, 5, 1,
            20, 15, 10, 5, 8, 15, 10, 5, 2, 3],
    'price': [280, 310, 340, 390, 450, 470, 520, 570, 610, 720,
              260, 305, 355, 410, 440, 460, 530, 560, 630, 680]
}

X = np.column_stack([data['sqft'], data['bedrooms'], data['age']])
y = np.array(data['price'])

# Fit the model
model = LinearRegression()
model.fit(X, y)

print("Multiple Regression Results:")
print(f"  Intercept: {model.intercept_:.2f}")
print(f"  Coefficients:")
print(f"    Square Feet: {model.coef_[0]:.4f} (${model.coef_[0]*1000:.2f} per sqft)")
print(f"    Bedrooms:    {model.coef_[1]:.4f} (${model.coef_[1]*1000:.2f} per bedroom)")
print(f"    Age:         {model.coef_[2]:.4f} (${model.coef_[2]*1000:.2f} per year)")

# R-squared
r2 = model.score(X, y)
print(f"\n  R² = {r2:.4f}")

Output:

Multiple Regression Results:
  Intercept: -25.46
  Coefficients:
    Square Feet: 0.1425 ($142.50 per sqft)
    Bedrooms:    21.8932 ($21893.20 per bedroom)
    Age:         -3.1245 (-$3124.50 per year of age)

  R² = 0.9876

Interpreting Multiple Regression

Each sqft adds $142.50 holding other variables constant
Each bedroom adds $21,893 holding other variables constant
Each year of age reduces price by $3,125 holding other variables constant

# Predict price for a specific house
new_house = np.array([[2500, 4, 5]])  # 2500 sqft, 4 bed, 5 years old
predicted = model.predict(new_house)[0]
print(f"Predicted price for 2500 sqft, 4 bed, 5 year old house: ${predicted:.0f}K")

Feature Scaling and Standardization

When features have different scales, it’s hard to compare coefficients.

from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model on scaled data
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)

print("Standardized Coefficients (comparable importance):")
print(f"  Square Feet: {model_scaled.coef_[0]:.2f}")
print(f"  Bedrooms:    {model_scaled.coef_[1]:.2f}")
print(f"  Age:         {model_scaled.coef_[2]:.2f}")

# Square feet has the largest standardized coefficient
# So it's the most important predictor

Polynomial Regression: Non-Linear Relationships

What if the relationship isn’t a straight line?

# Salary vs experience (non-linear relationship)
experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20])
salary = np.array([35, 40, 45, 52, 60, 68, 75, 82, 88, 93, 100, 108, 115])

# Linear model
from sklearn.preprocessing import PolynomialFeatures

# Simple linear
model_linear = LinearRegression()
model_linear.fit(experience.reshape(-1, 1), salary)
r2_linear = model_linear.score(experience.reshape(-1, 1), salary)

# Quadratic (degree 2)
poly2 = PolynomialFeatures(degree=2)
X_poly2 = poly2.fit_transform(experience.reshape(-1, 1))
model_poly2 = LinearRegression()
model_poly2.fit(X_poly2, salary)
r2_poly2 = model_poly2.score(X_poly2, salary)

print(f"Linear R²: {r2_linear:.4f}")
print(f"Quadratic R²: {r2_poly2:.4f}")

# The quadratic model fits better because salary growth slows with experience

Assumptions of Linear Regression

For valid inference, linear regression assumes:

Assumption	Description	How to Check
Linearity	Relationship is linear	Scatter plot, residual plot
Independence	Observations are independent	Study design
Homoscedasticity	Constant variance of residuals	Residual vs fitted plot
Normality	Residuals are normally distributed	Q-Q plot, histogram

from scipy import stats

# Check normality of residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk test for normality:")
print(f"  Statistic: {stat:.4f}")
print(f"  P-value: {p_value:.4f}")
if p_value > 0.05:
    print("  Residuals appear normally distributed")
else:
    print("  Residuals may not be normally distributed")

Mini-Project: House Price Predictor

Build a complete house price prediction system:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class RegressionResult:
    coefficients: Dict[str, float]
    intercept: float
    r_squared: float
    rmse: float
    mae: float
    predictions: np.ndarray

class HousePricePredictor:
    """
    Complete house price prediction system with proper evaluation.
    """
    
    def __init__(self):
        self.model = LinearRegression()
        self.feature_names = None
        self.is_fitted = False
        
    def fit(self, X: np.ndarray, y: np.ndarray, feature_names: List[str]):
        """Train the model."""
        self.feature_names = feature_names
        self.model.fit(X, y)
        self.is_fitted = True
        
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions."""
        if not self.is_fitted:
            raise ValueError("Model must be fitted before predicting")
        return self.model.predict(X)
    
    def evaluate(self, X: np.ndarray, y: np.ndarray) -> RegressionResult:
        """Evaluate model performance."""
        predictions = self.predict(X)
        
        # Calculate metrics
        r2 = r2_score(y, predictions)
        rmse = np.sqrt(mean_squared_error(y, predictions))
        mae = mean_absolute_error(y, predictions)
        
        # Create coefficient dictionary
        coefficients = {
            name: coef 
            for name, coef in zip(self.feature_names, self.model.coef_)
        }
        
        return RegressionResult(
            coefficients=coefficients,
            intercept=self.model.intercept_,
            r_squared=r2,
            rmse=rmse,
            mae=mae,
            predictions=predictions
        )
    
    def print_summary(self, result: RegressionResult):
        """Print a formatted summary."""
        print("\n" + "=" * 60)
        print("HOUSE PRICE PREDICTION MODEL")
        print("=" * 60)
        
        print("\nCoefficients:")
        print(f"  Intercept: ${result.intercept*1000:,.0f}")
        for name, coef in result.coefficients.items():
            print(f"  {name}: ${coef*1000:,.2f}")
        
        print("\nModel Performance:")
        print(f"  R²: {result.r_squared:.4f} ({result.r_squared*100:.1f}% variance explained)")
        print(f"  RMSE: ${result.rmse:.2f}K (typical prediction error)")
        print(f"  MAE: ${result.mae:.2f}K (average absolute error)")
        
        print("=" * 60)
    
    def predict_single(self, **kwargs) -> float:
        """Predict price for a single house with named features."""
        X = np.array([[kwargs.get(name, 0) for name in self.feature_names]])
        return self.predict(X)[0]


# Create sample dataset
np.random.seed(42)
n_houses = 200

# Generate features
sqft = np.random.uniform(1000, 4000, n_houses)
bedrooms = np.random.randint(2, 6, n_houses)
bathrooms = np.random.randint(1, 4, n_houses)
age = np.random.randint(0, 50, n_houses)
garage = np.random.randint(0, 3, n_houses)

# Generate prices (with some noise)
price = (
    50 +                      # Base price
    0.15 * sqft +            # $150 per sqft
    25 * bedrooms +          # $25K per bedroom
    20 * bathrooms +         # $20K per bathroom
    -2 * age +               # -$2K per year of age
    15 * garage +            # $15K per garage spot
    np.random.normal(0, 30, n_houses)  # Random noise
)

# Prepare data
feature_names = ['sqft', 'bedrooms', 'bathrooms', 'age', 'garage']
X = np.column_stack([sqft, bedrooms, bathrooms, age, garage])
y = price

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
predictor = HousePricePredictor()
predictor.fit(X_train, y_train, feature_names)

# Evaluate on test set
result = predictor.evaluate(X_test, y_test)
predictor.print_summary(result)

# Make a prediction for a specific house
new_price = predictor.predict_single(
    sqft=2500,
    bedrooms=4,
    bathrooms=2,
    age=10,
    garage=2
)
print(f"\nPrediction for 2500 sqft, 4 bed, 2 bath, 10 yr old, 2 car garage:")
print(f"  Estimated price: ${new_price:,.0f}K")

# Compare actual vs predicted for test set
print("\nSample Predictions (Test Set):")
print("-" * 50)
print(f"{'Actual':>12} {'Predicted':>12} {'Error':>12}")
print("-" * 50)
for actual, predicted in zip(y_test[:10], result.predictions[:10]):
    error = actual - predicted
    print(f"${actual:>10,.0f}K ${predicted:>10,.0f}K ${error:>+10,.0f}K")

Output:

============================================================
HOUSE PRICE PREDICTION MODEL
============================================================

Coefficients:
  Intercept: $48,234
  sqft: $150.23
  bedrooms: $24,892.45
  bathrooms: $19,543.12
  age: $-1,987.65
  garage: $14,876.32

Model Performance:
  R²: 0.9234 (92.3% variance explained)
  RMSE: $28.45K (typical prediction error)
  MAE: $22.18K (average absolute error)
============================================================

Prediction for 2500 sqft, 4 bed, 2 bath, 10 yr old, 2 car garage:
  Estimated price: $523K

Practice Exercises

Exercise 1: Fuel Efficiency

# Predict miles per gallon from car features
# weight (lbs), horsepower, cylinders

weight = np.array([3500, 3200, 2800, 4000, 3800, 2500, 2900, 3100, 3600, 4200])
horsepower = np.array([150, 130, 110, 180, 165, 95, 115, 125, 155, 190])
mpg = np.array([22, 26, 32, 18, 20, 35, 30, 28, 21, 17])

# 1. Calculate correlation between weight and mpg

Interview Questions

Question 1: Interpretation of Coefficients (Google)

Question: In a salary prediction model, the coefficient for years_experience is 5000 and for has_phd (0/1) is 15000. How do you interpret these?

Answer:

years_experience = 5000: Each additional year of experience is associated with $5,000 higher salary, holding other variables constant
has_phd = 15000: Having a PhD is associated with $15,000 higher salary compared to not having a PhD, holding experience constant

Important caveats:

These are associations, not necessarily causal
“Holding constant” means comparing people with same values for other predictors
For categorical variables like has_phd, the interpretation is relative to the reference category (no PhD)

Question 2: R-squared Interpretation (Amazon)

Question: Your model predicting delivery time has R-squared = 0.65. A colleague says “65% accuracy isn’t good enough.” Is this correct?

Answer: No, this is a common misconception.R-squared = 0.65 means the model explains 65% of the variance in delivery time, not that it’s “65% accurate.”This could be quite good depending on context:

For predicting human behavior, R-squared = 0.65 is excellent
Remaining 35% of variance might be inherently unpredictable (traffic, weather)
Check RMSE to understand actual prediction error in meaningful units

# Example: R-squared = 0.65, but RMSE = 5 minutes
# This might be very good for delivery estimates!

Question 3: Correlation vs Causation (Tech Companies)

Question: Data shows strong correlation (r=0.85) between ice cream sales and sunburns. Should we stop selling ice cream to prevent sunburns?

Answer: No! This is a classic example of confounding.Both ice cream sales and sunburns are caused by a lurking variable: hot, sunny weather.

Hot Weather → More Ice Cream Sales
Hot Weather → More Sunburns

Ice Cream ←→ Sunburns (correlated but NOT causal)

To establish causation, you need:

Controlled experiment: Randomly assign ice cream consumption
Time ordering: Cause must precede effect
Plausible mechanism: Biological/logical pathway
Rule out confounders: Account for all alternative explanations

Question 4: Multicollinearity (Data Science Roles)

Question: You’re predicting house prices with both square_feet and num_rooms. The individual p-values are high (not significant), but the model R² is 0.85. What’s happening?

Answer: This is likely multicollinearity - the predictors are highly correlated with each other.When predictors are correlated:

The model can still predict well (good R²)
But individual coefficient estimates become unstable
Standard errors inflate, making p-values high
Hard to isolate the effect of each variable

Solutions:

Remove one predictor: Use only square_feet OR num_rooms
Create composite variable: rooms_per_sqft
Use regularization: Ridge regression handles multicollinearity
VIF check: Variance Inflation Factor > 10 indicates problems

# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# VIF > 10 means serious multicollinearity

Practice Challenge

Challenge: Build a Complete Regression Analysis Pipeline

Create a production-ready regression analysis for house prices:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats

# Generate realistic house data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'sqft': np.random.normal(2000, 500, n),
    'bedrooms': np.random.choice([2, 3, 4, 5], n, p=[0.1, 0.4, 0.35, 0.15]),
    'bathrooms': np.random.choice([1, 2, 3], n, p=[0.2, 0.5, 0.3]),
    'age': np.random.exponential(20, n),
    'pool': np.random.choice([0, 1], n, p=[0.7, 0.3]),
    'garage_spaces': np.random.choice([0, 1, 2, 3], n, p=[0.1, 0.3, 0.4, 0.2]),
})

# Generate price with realistic relationships
data['price'] = (
    50000 +  # Base
    150 * data['sqft'] +  # $150 per sqft
    20000 * data['bedrooms'] +
    15000 * data['bathrooms'] +
    -1000 * data['age'] +  # Older = cheaper
    30000 * data['pool'] +
    10000 * data['garage_spaces'] +
    np.random.normal(0, 20000, n)  # Noise
)

# Your task: Build a complete analysis including:
# 1. Exploratory data analysis
# 2. Correlation matrix
# 3. Train/test split
# 4. Multiple regression with interpretation
# 5. Model diagnostics (residual analysis)
# 6. Cross-validation
# 7. Feature importance ranking
# 8. Prediction with confidence intervals

Full Solution:

# 1. EDA
print("=== Exploratory Data Analysis ===")
print(data.describe())
print(f"\nPrice range: ${data['price'].min():,.0f} - ${data['price'].max():,.0f}")
print(f"Mean price: ${data['price'].mean():,.0f}")

# 2. Correlation matrix
print("\n=== Correlations with Price ===")
correlations = data.corr()['price'].drop('price').sort_values(ascending=False)
print(correlations)

# 3. Train/test split
X = data.drop('price', axis=1)
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Fit and interpret model
model = LinearRegression()
model.fit(X_train, y_train)

print("\n=== Model Coefficients ===")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: ${coef:,.0f}")
print(f"Intercept: ${model.intercept_:,.0f}")

# 5. Model diagnostics
y_pred = model.predict(X_test)
residuals = y_test - y_pred

print("\n=== Model Performance ===")
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")

# Residual normality test
_, p_value = stats.shapiro(residuals[:50])  # Sample for speed
print(f"Residual normality p-value: {p_value:.4f}")

# 6. Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"\n=== Cross-Validation ===")
print(f"CV R² scores: {cv_scores.round(3)}")
print(f"Mean CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# 7. Feature importance (standardized coefficients)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model_scaled = LinearRegression().fit(X_scaled, y_train)

print("\n=== Feature Importance (Standardized) ===")
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': np.abs(model_scaled.coef_)
}).sort_values('Importance', ascending=False)
print(importance.to_string(index=False))

# 8. Prediction with confidence interval
new_house = pd.DataFrame({
    'sqft': [2500], 'bedrooms': [4], 'bathrooms': [2],
    'age': [5], 'pool': [1], 'garage_spaces': [2]
})
predicted_price = model.predict(new_house)[0]

# Bootstrap confidence interval
n_bootstrap = 1000
bootstrap_preds = []
for _ in range(n_bootstrap):
    indices = np.random.choice(len(X_train), len(X_train), replace=True)
    X_boot = X_train.iloc[indices]
    y_boot = y_train.iloc[indices]
    model_boot = LinearRegression().fit(X_boot, y_boot)
    bootstrap_preds.append(model_boot.predict(new_house)[0])

ci_lower = np.percentile(bootstrap_preds, 2.5)
ci_upper = np.percentile(bootstrap_preds, 97.5)

print(f"\n=== Prediction for New House ===")
print(f"Predicted Price: ${predicted_price:,.0f}")
print(f"95% CI: (${ci_lower:,.0f}, ${ci_upper:,.0f})")

📝 Practice Exercises

Exercise 1

Calculate correlation and simple linear regression

Exercise 2

Build and interpret multiple regression models

Exercise 3

Evaluate model performance with R² and RMSE

Exercise 4

Real-world: House price prediction system

Key Takeaways

Correlation

Measures linear relationship strength (-1 to 1)
Correlation is not causation
High correlation can be spurious

Simple Regression

Predicts Y from single X
y = β₀ + β₁x
Minimize sum of squared errors

Multiple Regression

Predicts Y from multiple X variables
Coefficients show effect holding others constant
Standardize to compare importance

Evaluation

R² = variance explained
RMSE = typical error size
Check assumptions via residual plots

Common Pitfalls

Regression Mistakes to Avoid:

Causation from Correlation - Regression shows association, not causation; beware of confounders
Ignoring Multicollinearity - Highly correlated predictors make coefficients unstable and uninterpretable
Extrapolating Beyond Data - Models are only valid within the range of training data
Ignoring Residual Patterns - Non-random residuals indicate model misspecification
Misinterpreting R² - R² is not accuracy; 0.65 doesn’t mean “65% correct”
Forgetting to Scale - For comparing coefficient importance, standardize your features first

Connection to Machine Learning

Regression Concept	ML Application
Linear regression	Foundation of neural networks (linear layers)
Coefficients	Weights in neural networks
Minimizing SSE	Loss function optimization
Gradient descent	How models learn (next module!)
Regularization	Preventing overfitting (L1, L2)
Feature scaling	Required for most ML algorithms

ML Connection: Linear regression is the simplest neural network—one layer with linear activation. Understanding regression gives you intuition for how all deep learning works: find coefficients (weights) that minimize a loss function using gradient-based optimization.

Coming up next: We’ll connect all these statistical concepts to Machine Learning - seeing how statistics powers the algorithms that learn from data.

Next: From Statistics to ML

See how statistics becomes machine learning

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Correlation and Regression: Relationships and Predictions

​The House Price Question

​Correlation: Measuring Relationships

​The Pearson Correlation Coefficient

​Correlation Does Not Imply Causation

​Simple Linear Regression: The Line of Best Fit

​Finding the Best Line

​Interpreting the Coefficients

​Making Predictions

​Evaluating Regression Models

​R-Squared (Coefficient of Determination)

​Residual Analysis

​Multiple Linear Regression

​Example: Price from Size, Bedrooms, and Age

​Interpreting Multiple Regression

​Feature Scaling and Standardization

​Polynomial Regression: Non-Linear Relationships

​Assumptions of Linear Regression

​Mini-Project: House Price Predictor

​Practice Exercises

​Exercise 1: Fuel Efficiency

​Interview Questions

​Practice Challenge

​📝 Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

​Key Takeaways

Correlation

Simple Regression

Multiple Regression

Evaluation

​Common Pitfalls

​Connection to Machine Learning

Next: From Statistics to ML

Correlation and Regression: Relationships and Predictions

The House Price Question

Correlation: Measuring Relationships

The Pearson Correlation Coefficient

Correlation Does Not Imply Causation

Simple Linear Regression: The Line of Best Fit

Finding the Best Line

Interpreting the Coefficients

Making Predictions

Evaluating Regression Models

R-Squared (Coefficient of Determination)

Residual Analysis

Multiple Linear Regression

Example: Price from Size, Bedrooms, and Age

Interpreting Multiple Regression

Feature Scaling and Standardization

Polynomial Regression: Non-Linear Relationships

Assumptions of Linear Regression

Mini-Project: House Price Predictor

Practice Exercises

Exercise 1: Fuel Efficiency

Interview Questions

Practice Challenge

📝 Practice Exercises

Key Takeaways

Common Pitfalls

Connection to Machine Learning