Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Describing Data: What’s Normal?
The House Hunting Problem
You’re moving to Austin, Texas. You have a budget of $500,000 and want to know: Is that enough for a decent 3-bedroom house? You could look at one listing, but that’s just one data point. You need to understand the whole picture. Let’s load some real data:Measures of Central Tendency: “What’s Typical?”
The Mean (Average)
The mean is what most people think of as “average” - add everything up and divide by the count. Analogy: Think of the mean as the balance point of a seesaw. If you placed each data point as a weight along a beam, the mean is where you would put the fulcrum to make it balance perfectly. One very heavy weight far from center (an outlier) can shift the balance point dramatically.The Median (Middle Value)
The median is the middle value when you sort the data. Half the values are above, half below.| Use Mean When | Use Median When |
|---|---|
| Data is symmetric | Data has outliers |
| No extreme values | Income/wealth data |
| You want total divided by count | You want “typical” value |
| Example: Test scores | Example: House prices |
The Mode (Most Common Value)
The mode is the value that appears most frequently. Less useful for continuous data, but great for categories.- Most popular shirt size at a store
- Most common customer complaint
- Peak traffic hour
Measures of Spread: “How Different Are Things?”
Knowing the center isn’t enough. Consider these two neighborhoods:- Neighborhood A: All houses are between 462K (consistent)
- Neighborhood B: Houses range from 550K (huge variation)
Range (Simplest Measure)
Variance: Average Squared Distance from Mean
Variance measures how far values typically are from the mean.- Find the mean
- For each value, calculate distance from mean
- Square each distance (makes all positive, penalizes big deviations)
- Average the squared distances
Standard Deviation: Variance in Original Units
Variance is in “squared dollars” which is hard to interpret. Standard deviation brings us back to dollars.- Neighborhood A: Houses are typically within plus or minus $5.4K of the mean
- Neighborhood B: Houses are typically within plus or minus $71K of the mean
Percentiles and Quartiles: “Where Does This Value Rank?”
Going back to our Austin house prices - is $500K expensive or affordable? Percentiles tell you what percentage of values fall below a given number.The Interquartile Range (IQR)
The IQR is the range of the middle 50% of data:Visualizing Data: See the Distribution
Numbers are great, but our brains understand pictures better.Box Plot (Box-and-Whisker)
Histogram
Complete Summary Statistics
Here’s a function that gives you the full picture:🎯 Practice Exercises
Exercise 1: Salary Analysis
Solution
Solution
Exercise 2: Test Score Comparison
Solution
Solution
🏠 Mini-Project: House Price Analyzer
Build a complete house price analysis tool!Complete Solution
Complete Solution
Key Takeaways
Central Tendency
- Mean: Add and divide. Sensitive to outliers.
- Median: Middle value. Robust to outliers.
- Mode: Most common. Great for categories.
Spread
- Range: Max - Min. Simple but limited.
- Variance: Average squared distance from mean.
- Std Dev: Square root of variance. Same units as data.
Position
- Percentiles: What % of values fall below this?
- Quartiles: 25th, 50th, 75th percentiles.
- IQR: Range of middle 50%. Good for outlier detection.
When to Use What
- Symmetric data: Mean + Std Dev
- Skewed data: Median + IQR
- Outliers present: Always check both!
Common Mistakes to Avoid
Interview Questions
Question 1: Mean vs Median (Facebook/Meta)
Question 1: Mean vs Median (Facebook/Meta)
Question 2: Outlier Detection (Google)
Question 2: Outlier Detection (Google)
Question 3: Variance Application (Amazon)
Question 3: Variance Application (Amazon)
Question 4: Percentiles in Practice (Netflix)
Question 4: Percentiles in Practice (Netflix)
Practice Challenge
Challenge: Analyze This Real Dataset
Challenge: Analyze This Real Dataset
📝 Practice Exercises
Exercise 1
Exercise 2
Exercise 3
Exercise 4
How This Connects to Machine Learning
Everything you just learned is foundational to ML:| Descriptive Stat | ML Application |
|---|---|
| Mean | Used in normalization, calculating errors |
| Variance | Feature scaling, understanding data spread |
| Standard deviation | Standardization (z-scores), batch normalization |
| Percentiles | Handling outliers, creating features |
| Distribution shape | Choosing the right model and loss function |
Interview Prep: Common Questions
Frequently Asked Interview Questions
Frequently Asked Interview Questions
Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts).Q: How do you detect outliers?
Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots.Q: What’s the difference between population and sample variance?
Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel’s correction).Q: A dataset has mean = median. What does this tell you?
The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode.
Common Pitfalls
Key Takeaways
- ✅ Mean - Sum divided by count; sensitive to outliers
- ✅ Median - Middle value; robust to outliers; use for skewed data
- ✅ Mode - Most frequent value; useful for categorical data
- ✅ Variance & Std Dev - Measure spread around the mean
- ✅ Percentiles & IQR - Divide data into portions; detect outliers
- ✅ Z-scores - Standardize values across different scales
Next: Probability Foundations
Interview Deep-Dive
You are analyzing daily revenue at an e-commerce company. The mean is $85K but the median is $42K. What is happening and what do you report to the VP?
You are analyzing daily revenue at an e-commerce company. The mean is $85K but the median is $42K. What is happening and what do you report to the VP?
- The large gap between mean (42K) tells me the distribution is heavily right-skewed. A small number of very high-revenue days — likely driven by flash sales, holiday events, or a few massive B2B orders — are pulling the mean upward. On a “typical” day, the company makes closer to $42K.
- I would report both numbers to the VP, but frame it carefully: “On a normal day, we generate about 85K because we have occasional spike days that significantly boost the total. If you are planning staffing and operations around daily expectations, use the median. If you are forecasting monthly totals, the mean times 30 gives a better estimate.”
- I would also present the P90 and P99 to show how extreme the spike days are, and potentially a histogram showing the bimodal or long-tail shape. Stakeholders make better decisions when they understand the shape, not just a single number.
- The key risk: if someone uses the $85K mean for daily budgeting, they will overspend on most days and then scramble during the rare high days. Conversely, if they use only the median, they will underestimate total monthly revenue.
When would you use standard deviation versus interquartile range, and why does it matter in practice?
When would you use standard deviation versus interquartile range, and why does it matter in practice?
- Standard deviation assumes your data is roughly symmetric and does not have extreme outliers. It uses every data point including the tails, so a single extreme value can inflate it dramatically. It is the right choice when your data is approximately normal — test scores, manufacturing measurements, or sensor readings from a calibrated instrument.
- IQR (the range between the 25th and 75th percentiles) is robust to outliers because it only looks at the middle 50% of the data. It is the right choice for skewed or contaminated data — income distributions, transaction amounts, page load times, or any dataset where you suspect data quality issues at the extremes.
- In practice, the choice has real consequences. If you use standard deviation for fraud detection thresholds on transaction amounts (which are heavily right-skewed), the outliers inflate the std dev so much that your “anomaly threshold” becomes absurdly high and you miss actual fraud. Using IQR-based thresholds (like the 1.5 x IQR rule) gives much more practical detection boundaries.
- A concrete example: at a payments company, transaction amounts might have mean 500 (because of a few 1,550, which misses all the moderately fraudulent 300 transactions. An IQR-based approach with Q3 around 125 as worth investigating.
Explain Anscombe's quartet and what it teaches us about relying solely on summary statistics.
Explain Anscombe's quartet and what it teaches us about relying solely on summary statistics.
- Anscombe’s quartet is a set of four datasets that have nearly identical summary statistics — same mean of x, same mean of y, same variance, same correlation coefficient (r approximately 0.816), and the same regression line — yet look completely different when plotted. One is a normal linear relationship, one is a perfect curve, one is a perfect line with one outlier, and one has all points at one x-value except for a single extreme point.
- The lesson is devastating for anyone who relies on summary statistics alone: the numbers can lie. Two datasets with identical means, variances, and correlations can have fundamentally different structures, and any model or decision built on those numbers without visual inspection could be wildly wrong.
- In practice, this means every analysis should start with visualization. Before computing a single correlation or fitting a regression, plot a scatterplot, a histogram, or a residual plot. At companies with large data pipelines, I have seen teams ship models based on correlation matrices without ever looking at the actual data shape — and later discover non-linear relationships, clustering, or data artifacts that invalidated their approach.
- The modern extension is the Datasaurus Dozen, which shows the same idea with 13 datasets (including one shaped like a dinosaur) that all share the same summary statistics. It reinforces that summary stats are a lossy compression of your data — useful for communication, but dangerous as the sole basis for decisions.