Describing Data: What’s Normal?
The House Hunting Problem
You’re moving to Austin, Texas. You have a budget of $500,000 and want to know: Is that enough for a decent 3-bedroom house? You could look at one listing, but that’s just one data point. You need to understand the whole picture. Let’s load some real data:Measures of Central Tendency: “What’s Typical?”
The Mean (Average)
The mean is what most people think of as “average” - add everything up and divide by the count.The Median (Middle Value)
The median is the middle value when you sort the data. Half the values are above, half below.When to use Mean vs Median?
| Use Mean When | Use Median When |
|---|---|
| Data is symmetric | Data has outliers |
| No extreme values | Income/wealth data |
| You want total divided by count | You want “typical” value |
| Example: Test scores | Example: House prices |
The Mode (Most Common Value)
The mode is the value that appears most frequently. Less useful for continuous data, but great for categories.- Most popular shirt size at a store
- Most common customer complaint
- Peak traffic hour
Measures of Spread: “How Different Are Things?”
Knowing the center isn’t enough. Consider these two neighborhoods:- Neighborhood A: All houses are between 462K (consistent)
- Neighborhood B: Houses range from 550K (huge variation)
Range (Simplest Measure)
Variance: Average Squared Distance from Mean
Variance measures how far values typically are from the mean.- Find the mean
- For each value, calculate distance from mean
- Square each distance (makes all positive, penalizes big deviations)
- Average the squared distances
Standard Deviation: Variance in Original Units
Variance is in “squared dollars” which is hard to interpret. Standard deviation brings us back to dollars.- Neighborhood A: Houses are typically within ±$5.4K of the mean
- Neighborhood B: Houses are typically within ±$71K of the mean
Percentiles and Quartiles: “Where Does This Value Rank?”
Going back to our Austin house prices - is $500K expensive or affordable? Percentiles tell you what percentage of values fall below a given number.The Interquartile Range (IQR)
The IQR is the range of the middle 50% of data:Visualizing Data: See the Distribution
Numbers are great, but our brains understand pictures better.Box Plot (Box-and-Whisker)
Histogram
Complete Summary Statistics
Here’s a function that gives you the full picture:🎯 Practice Exercises
Exercise 1: Salary Analysis
Solution
Solution
Exercise 2: Test Score Comparison
Solution
Solution
🏠 Mini-Project: House Price Analyzer
Build a complete house price analysis tool!Complete Solution
Complete Solution
Key Takeaways
Central Tendency
- Mean: Add and divide. Sensitive to outliers.
- Median: Middle value. Robust to outliers.
- Mode: Most common. Great for categories.
Spread
- Range: Max - Min. Simple but limited.
- Variance: Average squared distance from mean.
- Std Dev: Square root of variance. Same units as data.
Position
- Percentiles: What % of values fall below this?
- Quartiles: 25th, 50th, 75th percentiles.
- IQR: Range of middle 50%. Good for outlier detection.
When to Use What
- Symmetric data: Mean + Std Dev
- Skewed data: Median + IQR
- Outliers present: Always check both!
Common Mistakes to Avoid
Interview Questions
Question 1: Mean vs Median (Facebook/Meta)
Question 1: Mean vs Median (Facebook/Meta)
Question: You’re analyzing user session lengths. The mean is 45 minutes, but the median is only 8 minutes. What does this tell you about the distribution?
Question 2: Outlier Detection (Google)
Question 2: Outlier Detection (Google)
Question: You have a dataset of daily ad revenue. How would you identify outliers?
Question 3: Variance Application (Amazon)
Question 3: Variance Application (Amazon)
Question: You’re comparing two delivery drivers. Driver A has mean delivery time of 30 min (std dev 2 min). Driver B has mean 32 min (std dev 8 min). Which driver would you prefer?
Question 4: Percentiles in Practice (Netflix)
Question 4: Percentiles in Practice (Netflix)
Question: We measure page load times. The mean is 2.5 seconds, but the 99th percentile is 15 seconds. What action might you take?
Practice Challenge
Challenge: Analyze This Real Dataset
Challenge: Analyze This Real Dataset
You’re given website session data. Analyze it completely:Solution:
📝 Practice Exercises
Exercise 1
Calculate descriptive statistics for employee salaries
Exercise 2
Analyze website load times for performance optimization
Exercise 3
Detect outliers in e-commerce transaction data
Exercise 4
Real-world: Analyze housing market price distributions
How This Connects to Machine Learning
Everything you just learned is foundational to ML:| Descriptive Stat | ML Application |
|---|---|
| Mean | Used in normalization, calculating errors |
| Variance | Feature scaling, understanding data spread |
| Standard deviation | Standardization (z-scores), batch normalization |
| Percentiles | Handling outliers, creating features |
| Distribution shape | Choosing the right model and loss function |
Interview Prep: Common Questions
Frequently Asked Interview Questions
Frequently Asked Interview Questions
Q: When would you use median instead of mean?
Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts).Q: How do you detect outliers?
Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots.Q: What’s the difference between population and sample variance?
Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel’s correction).Q: A dataset has mean = median. What does this tell you?
The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode.
Common Pitfalls
Key Takeaways
What You Learned:
- ✅ Mean - Sum divided by count; sensitive to outliers
- ✅ Median - Middle value; robust to outliers; use for skewed data
- ✅ Mode - Most frequent value; useful for categorical data
- ✅ Variance & Std Dev - Measure spread around the mean
- ✅ Percentiles & IQR - Divide data into portions; detect outliers
- ✅ Z-scores - Standardize values across different scales
Next: Probability Foundations
Learn to quantify uncertainty and make predictions