Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Correlation and Regression: Relationships and Predictions
The House Price Question
You’re a real estate analyst. A client asks: “I’m looking at a house with 2,500 square feet. What should I expect to pay?” You have data on recent sales. Can you use the relationship between size and price to make predictions? This is where statistics becomes prediction - the first step toward machine learning.Difficulty: Intermediate
Prerequisites: Modules 1-5 (especially Probability and Distributions)
What You’ll Build: House price predictor, multi-variable regression model
| Regression Concept | ML Equivalent |
|---|---|
| Coefficients (β) | Neural network weights |
| Intercept (β₀) | Bias term |
| Sum of squared errors | MSE loss function |
| Minimizing error | Gradient descent optimization |
| Adding features | Feature engineering |
| Regularization | Weight decay, dropout |
Correlation: Measuring Relationships
Correlation measures the strength and direction of a linear relationship between two variables. Analogy: Think of correlation as measuring how well two dancers are synchronized. A correlation of +1 means they are moving in perfect unison — when one steps forward, the other does too, in exact proportion. A correlation of -1 means they are perfectly mirrored — when one steps forward, the other steps back. A correlation of 0 means they are dancing independently, like strangers at a concert.The Pearson Correlation Coefficient
The value ranges from -1 to +1:| Value | Interpretation |
|---|---|
| r = 1 | Perfect positive correlation |
| r = 0.7 to 0.9 | Strong positive correlation |
| r = 0.4 to 0.7 | Moderate positive correlation |
| r = 0.1 to 0.4 | Weak positive correlation |
| r ≈ 0 | No linear correlation |
| r = -1 | Perfect negative correlation |
Correlation Does Not Imply Causation
This is perhaps the most important phrase in all of statistics.- A causes B
- B causes A
- A third variable C causes both
- Controlled experiments (A/B testing)
- Time sequence (cause before effect)
- Plausible mechanism
- Ruling out confounders
Simple Linear Regression: The Line of Best Fit
Linear regression finds the line that best predicts Y from X. The equation: Where:- = predicted value
- = intercept (value of y when x = 0)
- = slope (change in y for each unit change in x)
Finding the Best Line
We minimize the sum of squared errors (residuals): The optimal coefficients:Interpreting the Coefficients
- Intercept (16.67): Theoretical price for a 0 sqft house (not meaningful here)
- Slope (0.1676): Each additional square foot adds $167.60 to the price
Making Predictions
Evaluating Regression Models
R-Squared (Coefficient of Determination)
R² measures how much of the variance in Y is explained by X.Residual Analysis
Residuals = Actual - Predicted. Good models have residuals that:- Are randomly scattered (no pattern)
- Have constant variance (homoscedasticity)
- Are approximately normally distributed
Multiple Linear Regression
Real house prices depend on more than just size. Let’s add more features.Example: Price from Size, Bedrooms, and Age
Interpreting Multiple Regression
- Each sqft adds $142.50 holding other variables constant
- Each bedroom adds $21,893 holding other variables constant
- Each year of age reduces price by $3,125 holding other variables constant
Feature Scaling and Standardization
When features have different scales, it’s hard to compare coefficients.Polynomial Regression: Non-Linear Relationships
What if the relationship isn’t a straight line?Assumptions of Linear Regression
For valid inference, linear regression assumes:| Assumption | Description | How to Check |
|---|---|---|
| Linearity | Relationship is linear | Scatter plot, residual plot |
| Independence | Observations are independent | Study design |
| Homoscedasticity | Constant variance of residuals | Residual vs fitted plot |
| Normality | Residuals are normally distributed | Q-Q plot, histogram |
Mini-Project: House Price Predictor
Build a complete house price prediction system:Practice Exercises
Exercise 1: Fuel Efficiency
Interview Questions
Question 1: Interpretation of Coefficients (Google)
Question 1: Interpretation of Coefficients (Google)
Question 2: R-squared Interpretation (Amazon)
Question 2: R-squared Interpretation (Amazon)
Question 3: Correlation vs Causation (Tech Companies)
Question 3: Correlation vs Causation (Tech Companies)
Question 4: Multicollinearity (Data Science Roles)
Question 4: Multicollinearity (Data Science Roles)
Practice Challenge
Challenge: Build a Complete Regression Analysis Pipeline
Challenge: Build a Complete Regression Analysis Pipeline
📝 Practice Exercises
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Key Takeaways
Correlation
- Measures linear relationship strength (-1 to 1)
- Correlation is not causation
- High correlation can be spurious
Simple Regression
- Predicts Y from single X
- y = β₀ + β₁x
- Minimize sum of squared errors
Multiple Regression
- Predicts Y from multiple X variables
- Coefficients show effect holding others constant
- Standardize to compare importance
Evaluation
- R² = variance explained
- RMSE = typical error size
- Check assumptions via residual plots
Common Pitfalls
Connection to Machine Learning
| Regression Concept | ML Application |
|---|---|
| Linear regression | Foundation of neural networks (linear layers) |
| Coefficients | Weights in neural networks |
| Minimizing SSE | Loss function optimization |
| Gradient descent | How models learn (next module!) |
| Regularization | Preventing overfitting (L1, L2) |
| Feature scaling | Required for most ML algorithms |
Next: From Statistics to ML
Interview Deep-Dive
Your regression model has R-squared of 0.45. A manager says that is terrible. How do you respond?
Your regression model has R-squared of 0.45. A manager says that is terrible. How do you respond?
- R-squared of 0.45 means the model explains 45% of the variance in the target variable. Whether that is “terrible” depends entirely on the domain and the alternative.
- In physical sciences (modeling a chemical reaction), R-squared of 0.45 would indeed be poor because the underlying relationships are deterministic and we expect R-squared above 0.9. But in social sciences, economics, and most business prediction problems, R-squared of 0.45 is often quite good because human behavior has enormous inherent unpredictability.
- The right question is not “is 0.45 high enough?” but “is this model useful?” If you are predicting customer lifetime value and the model correctly identifies the top 20% of customers with 80% precision, it is delivering massive business value regardless of the R-squared number.
- I would also check: What is RMSE in practical units? If the model predicts delivery time with RMSE of 5 minutes and the business only needs accuracy within 10 minutes, then R-squared is irrelevant — the model is accurate enough. R-squared is a summary statistic about variance explained; business impact depends on whether the predictions are actionable.
How do you explain the difference between correlation and causation in the context of a regression model to a non-technical stakeholder?
How do you explain the difference between correlation and causation in the context of a regression model to a non-technical stakeholder?
- I use this analogy: “Imagine I show you data proving that cities with more fire stations have more fires. Does that mean fire stations cause fires? Obviously not — bigger cities have both more fires and more fire stations. The city size is the hidden third factor driving both.”
- A regression coefficient tells you the association between X and Y after controlling for other variables in the model. But it cannot prove causation because there might be confounders you did not include. If your model predicts that “customers who use the mobile app spend 30% more,” the regression is telling you truth — app users do spend more. But it is not telling you that making someone download the app will cause them to spend more. The likely explanation is that already-engaged customers both use the app and spend more.
- To establish causation from a regression, you need either a randomized experiment (assign some users to the app randomly) or a carefully designed observational study with an instrumental variable or regression discontinuity design.
- I always warn stakeholders: “This model tells us what is associated with higher revenue. It does not tell us what to change to increase revenue. For that, we need experiments.”
You find that adding a feature to your regression model increases R-squared from 0.72 to 0.73 but makes the coefficient on another feature flip sign. What is happening?
You find that adding a feature to your regression model increases R-squared from 0.72 to 0.73 but makes the coefficient on another feature flip sign. What is happening?
- This is almost certainly multicollinearity. The new feature is correlated with one of the existing features. When both are in the model, the coefficient estimates become unstable because the model cannot cleanly separate their individual effects. A sign flip means the partial effect (holding the new variable constant) is different from the marginal effect (ignoring it).
- A concrete example: predicting house price with square footage and number of rooms. Both are highly correlated (bigger houses have more rooms). With only square footage, its coefficient is positive and strong. Add number of rooms, and the square footage coefficient might shrink or even flip negative, because the model is now trying to ask “holding number of rooms constant, what is the effect of more square footage?” — which is a bizarre question since you cannot really add square footage without adding rooms.
- To diagnose this, I would compute the Variance Inflation Factor (VIF) for each predictor. VIF above 5 suggests concerning multicollinearity, above 10 is severe.
- The solution depends on the goal. For prediction, multicollinearity does not matter — the model still predicts well. For interpretation, it is a serious problem. Solutions include dropping one of the correlated features, combining them into a composite, using PCA to create orthogonal features, or switching to a regularized model like Ridge regression which handles multicollinearity gracefully.