Correlation and Regression: Relationships and Predictions
The House Price Question
You’re a real estate analyst. A client asks: “I’m looking at a house with 2,500 square feet. What should I expect to pay?” You have data on recent sales. Can you use the relationship between size and price to make predictions? This is where statistics becomes prediction - the first step toward machine learning.Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-5 (especially Probability and Distributions)
What You’ll Build: House price predictor, multi-variable regression model
Difficulty: Intermediate
Prerequisites: Modules 1-5 (especially Probability and Distributions)
What You’ll Build: House price predictor, multi-variable regression model
🔗 ML Connection: Regression is the foundation of ALL supervised learning:
Linear regression IS a 1-layer neural network. Master this, and you understand deep learning’s core!
| Regression Concept | ML Equivalent |
|---|---|
| Coefficients (β) | Neural network weights |
| Intercept (β₀) | Bias term |
| Sum of squared errors | MSE loss function |
| Minimizing error | Gradient descent optimization |
| Adding features | Feature engineering |
| Regularization | Weight decay, dropout |
Correlation: Measuring Relationships
Correlation measures the strength and direction of a linear relationship between two variables.The Pearson Correlation Coefficient
The value ranges from -1 to +1:| Value | Interpretation |
|---|---|
| r = 1 | Perfect positive correlation |
| r = 0.7 to 0.9 | Strong positive correlation |
| r = 0.4 to 0.7 | Moderate positive correlation |
| r = 0.1 to 0.4 | Weak positive correlation |
| r ≈ 0 | No linear correlation |
| r = -1 | Perfect negative correlation |
Correlation Does Not Imply Causation
This is perhaps the most important phrase in all of statistics.- A causes B
- B causes A
- A third variable C causes both
- Controlled experiments (A/B testing)
- Time sequence (cause before effect)
- Plausible mechanism
- Ruling out confounders
Simple Linear Regression: The Line of Best Fit
Linear regression finds the line that best predicts Y from X. The equation: Where:- = predicted value
- = intercept (value of y when x = 0)
- = slope (change in y for each unit change in x)
Finding the Best Line
We minimize the sum of squared errors (residuals): The optimal coefficients:Interpreting the Coefficients
- Intercept (16.67): Theoretical price for a 0 sqft house (not meaningful here)
- Slope (0.1676): Each additional square foot adds $167.60 to the price
Making Predictions
Evaluating Regression Models
R-Squared (Coefficient of Determination)
R² measures how much of the variance in Y is explained by X.Residual Analysis
Residuals = Actual - Predicted. Good models have residuals that:- Are randomly scattered (no pattern)
- Have constant variance (homoscedasticity)
- Are approximately normally distributed
Multiple Linear Regression
Real house prices depend on more than just size. Let’s add more features.Example: Price from Size, Bedrooms, and Age
Interpreting Multiple Regression
- Each sqft adds $142.50 holding other variables constant
- Each bedroom adds $21,893 holding other variables constant
- Each year of age reduces price by $3,125 holding other variables constant
Feature Scaling and Standardization
When features have different scales, it’s hard to compare coefficients.Polynomial Regression: Non-Linear Relationships
What if the relationship isn’t a straight line?Assumptions of Linear Regression
For valid inference, linear regression assumes:| Assumption | Description | How to Check |
|---|---|---|
| Linearity | Relationship is linear | Scatter plot, residual plot |
| Independence | Observations are independent | Study design |
| Homoscedasticity | Constant variance of residuals | Residual vs fitted plot |
| Normality | Residuals are normally distributed | Q-Q plot, histogram |
Mini-Project: House Price Predictor
Build a complete house price prediction system:Practice Exercises
Exercise 1: Fuel Efficiency
Interview Questions
Question 1: Interpretation of Coefficients (Google)
Question 1: Interpretation of Coefficients (Google)
Question: In a salary prediction model, the coefficient for years_experience is 5000 and for has_phd (0/1) is 15000. How do you interpret these?
Question 2: R-squared Interpretation (Amazon)
Question 2: R-squared Interpretation (Amazon)
Question: Your model predicting delivery time has R-squared = 0.65. A colleague says “65% accuracy isn’t good enough.” Is this correct?
Question 3: Correlation vs Causation (Tech Companies)
Question 3: Correlation vs Causation (Tech Companies)
Question: Data shows strong correlation (r=0.85) between ice cream sales and sunburns. Should we stop selling ice cream to prevent sunburns?
Question 4: Multicollinearity (Data Science Roles)
Question 4: Multicollinearity (Data Science Roles)
Question: You’re predicting house prices with both square_feet and num_rooms. The individual p-values are high (not significant), but the model R² is 0.85. What’s happening?
Practice Challenge
Challenge: Build a Complete Regression Analysis Pipeline
Challenge: Build a Complete Regression Analysis Pipeline
Create a production-ready regression analysis for house prices:Full Solution:
📝 Practice Exercises
Exercise 1
Calculate correlation and simple linear regression
Exercise 2
Build and interpret multiple regression models
Exercise 3
Evaluate model performance with R² and RMSE
Exercise 4
Real-world: House price prediction system
Key Takeaways
Correlation
- Measures linear relationship strength (-1 to 1)
- Correlation is not causation
- High correlation can be spurious
Simple Regression
- Predicts Y from single X
- y = β₀ + β₁x
- Minimize sum of squared errors
Multiple Regression
- Predicts Y from multiple X variables
- Coefficients show effect holding others constant
- Standardize to compare importance
Evaluation
- R² = variance explained
- RMSE = typical error size
- Check assumptions via residual plots
Common Pitfalls
Connection to Machine Learning
| Regression Concept | ML Application |
|---|---|
| Linear regression | Foundation of neural networks (linear layers) |
| Coefficients | Weights in neural networks |
| Minimizing SSE | Loss function optimization |
| Gradient descent | How models learn (next module!) |
| Regularization | Preventing overfitting (L1, L2) |
| Feature scaling | Required for most ML algorithms |
Next: From Statistics to ML
See how statistics becomes machine learning