Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Describing Data - Mean and Central Tendency

Describing Data: What’s Normal?

The House Hunting Problem

You’re moving to Austin, Texas. You have a budget of $500,000 and want to know: Is that enough for a decent 3-bedroom house? You could look at one listing, but that’s just one data point. You need to understand the whole picture. Let’s load some real data:
import numpy as np
import pandas as pd

# House prices in Austin (3-bedroom homes, in thousands)
prices = np.array([
    425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
    512, 445, 468, 502, 389, 475, 498, 415, 528, 459,
    442, 495, 478, 410, 525, 465, 488, 435, 505, 472,
    1250,  # A mansion somehow in the dataset
    448, 492, 418, 485, 455, 508, 428, 475, 495, 462
])

print(f"Number of houses: {len(prices)}")
print(f"Cheapest: ${min(prices)}K")
print(f"Most expensive: ${max(prices)}K")
Output:
Number of houses: 41
Cheapest: $389K
Most expensive: $1250K
The range is 389Kto389K to 1250K. But that doesn’t tell us what’s “typical”. We need better tools.

Measures of Central Tendency: “What’s Typical?”

The Mean (Average)

The mean is what most people think of as “average” - add everything up and divide by the count. Analogy: Think of the mean as the balance point of a seesaw. If you placed each data point as a weight along a beam, the mean is where you would put the fulcrum to make it balance perfectly. One very heavy weight far from center (an outlier) can shift the balance point dramatically.
Mean Formula Visualization
The mathematical formula: xˉ=1ni=1nxi=x1+x2+...+xnn\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n}
def calculate_mean(data):
    """Calculate the arithmetic mean."""
    return sum(data) / len(data)

mean_price = calculate_mean(prices)
print(f"Mean price: ${mean_price:.1f}K")

# Or use NumPy
print(f"Mean (NumPy): ${np.mean(prices):.1f}K")
Output:
Mean price: $492.6K
Wait… 492.6K?Mosthousesarearound492.6K? Most houses are around 450K, but the mean is higher. What’s happening?
Mean Real World - Mansion Pulling Average
The Problem with the Mean: That $1.25M mansion is pulling the average up! The mean is sensitive to outliers.

The Median (Middle Value)

The median is the middle value when you sort the data. Half the values are above, half below.
def calculate_median(data):
    """Calculate the median."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    
    if n % 2 == 0:  # Even number of elements
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:  # Odd number of elements
        return sorted_data[mid]

median_price = calculate_median(prices)
print(f"Median price: ${median_price:.1f}K")

# Or use NumPy
print(f"Median (NumPy): ${np.median(prices):.1f}K")
Output:
Median price: $472.0K
The median is $472K - much more representative of a “typical” house! The mansion doesn’t affect it because it’s just one value above the middle.
When to use Mean vs Median?
Use Mean WhenUse Median When
Data is symmetricData has outliers
No extreme valuesIncome/wealth data
You want total divided by countYou want “typical” value
Example: Test scoresExample: House prices

The Mode (Most Common Value)

The mode is the value that appears most frequently. Less useful for continuous data, but great for categories.
from collections import Counter

def calculate_mode(data):
    """Find the most common value(s)."""
    counts = Counter(data)
    max_count = max(counts.values())
    modes = [val for val, count in counts.items() if count == max_count]
    return modes

# For house prices, mode isn't very useful (all unique)
# But for bedrooms:
bedrooms = [3, 3, 4, 3, 2, 3, 4, 3, 3, 2, 3, 4, 3, 5, 3, 3, 2, 4, 3, 3]
mode_bedrooms = calculate_mode(bedrooms)
print(f"Most common bedroom count: {mode_bedrooms}")  # [3]
Real-World Usage:
  • Most popular shirt size at a store
  • Most common customer complaint
  • Peak traffic hour

Measures of Spread: “How Different Are Things?”

Knowing the center isn’t enough. Consider these two neighborhoods:
neighborhood_A = [450, 455, 448, 460, 452, 445, 458, 447, 453, 462]
neighborhood_B = [350, 550, 400, 500, 380, 520, 410, 490, 360, 540]

print(f"Neighborhood A - Mean: ${np.mean(neighborhood_A):.1f}K")
print(f"Neighborhood B - Mean: ${np.mean(neighborhood_B):.1f}K")
Output:
Neighborhood A - Mean: $453.0K
Neighborhood B - Mean: $450.0K
Almost the same mean! But look at the actual houses:
  • Neighborhood A: All houses are between 445K445K-462K (consistent)
  • Neighborhood B: Houses range from 350Kto350K to 550K (huge variation)
We need to measure spread.

Range (Simplest Measure)

def calculate_range(data):
    return max(data) - min(data)

print(f"Range A: ${calculate_range(neighborhood_A)}K")
print(f"Range B: ${calculate_range(neighborhood_B)}K")
Output:
Range A: $17K
Range B: $200K
The range shows the difference, but it only uses two values and is sensitive to outliers.

Variance: Average Squared Distance from Mean

Variance measures how far values typically are from the mean.
Variance Formula Visualization
The Formula: σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 Step by step:
  1. Find the mean
  2. For each value, calculate distance from mean
  3. Square each distance (makes all positive, penalizes big deviations)
  4. Average the squared distances
def calculate_variance(data):
    """Calculate population variance."""
    mean = sum(data) / len(data)
    squared_diffs = [(x - mean) ** 2 for x in data]
    return sum(squared_diffs) / len(data)

var_A = calculate_variance(neighborhood_A)
var_B = calculate_variance(neighborhood_B)

print(f"Variance A: {var_A:.1f}")
print(f"Variance B: {var_B:.1f}")

# Using NumPy
print(f"Variance A (NumPy): {np.var(neighborhood_A):.1f}")
print(f"Variance B (NumPy): {np.var(neighborhood_B):.1f}")
Output:
Variance A: 29.4
Variance B: 5040.0
Neighborhood B has 171x more variance than A!
Variance Real World - Neighborhood Comparison

Standard Deviation: Variance in Original Units

Variance is in “squared dollars” which is hard to interpret. Standard deviation brings us back to dollars. σ=σ2=1ni=1n(xixˉ)2\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}
def calculate_std(data):
    """Calculate population standard deviation."""
    return np.sqrt(calculate_variance(data))

std_A = calculate_std(neighborhood_A)
std_B = calculate_std(neighborhood_B)

print(f"Std Dev A: ${std_A:.1f}K")
print(f"Std Dev B: ${std_B:.1f}K")
Output:
Std Dev A: $5.4K
Std Dev B: $71.0K
Interpretation:
  • Neighborhood A: Houses are typically within plus or minus $5.4K of the mean
  • Neighborhood B: Houses are typically within plus or minus $71K of the mean
Analogy: Standard deviation is like the “typical commute distance” of data points from their home (the mean). In Neighborhood A, every data point lives close to the mean — a short commute. In Neighborhood B, data points are scattered far and wide.
ML Application — Feature Scaling: Standard deviation is the foundation of standardization (z-score normalization), one of the most critical preprocessing steps in ML. When you run StandardScaler() in scikit-learn, it is dividing each feature by its standard deviation so all features have comparable scales. Skip this step with algorithms like gradient descent or SVM, and the features with larger scales will dominate learning — a classic beginner mistake that produces mysteriously poor models.
Sample vs Population: When working with a sample (not the entire population), we divide by (n-1) instead of n for variance. This is called Bessel’s correction.
# Population variance (you have ALL data)
np.var(data, ddof=0)

# Sample variance (you have a sample from larger population)
np.var(data, ddof=1)  # Default in pandas

Percentiles and Quartiles: “Where Does This Value Rank?”

Going back to our Austin house prices - is $500K expensive or affordable? Percentiles tell you what percentage of values fall below a given number.
# Remove the mansion outlier for cleaner analysis
prices_clean = prices[prices < 1000]

# Calculate percentiles
p25 = np.percentile(prices_clean, 25)
p50 = np.percentile(prices_clean, 50)  # Same as median!
p75 = np.percentile(prices_clean, 75)
p90 = np.percentile(prices_clean, 90)

print(f"25th percentile: ${p25:.1f}K")
print(f"50th percentile (median): ${p50:.1f}K")
print(f"75th percentile: ${p75:.1f}K")
print(f"90th percentile: ${p90:.1f}K")
Output:
25th percentile: $445.0K
50th percentile (median): $468.0K
75th percentile: $495.0K
90th percentile: $512.0K
Your $500K budget puts you at the 78th percentile - you can afford 78% of houses in this area!

The Interquartile Range (IQR)

The IQR is the range of the middle 50% of data: IQR=Q3Q1=P75P25IQR = Q3 - Q1 = P_{75} - P_{25}
iqr = p75 - p25
print(f"IQR: ${iqr:.1f}K")

# Houses outside 1.5*IQR from quartiles are often considered outliers
lower_fence = p25 - 1.5 * iqr
upper_fence = p75 + 1.5 * iqr
print(f"Outlier thresholds: ${lower_fence:.1f}K - ${upper_fence:.1f}K")
Output:
IQR: $50.0K
Outlier thresholds: $370.0K - $570.0K
That $1.25M mansion is definitely an outlier!

Visualizing Data: See the Distribution

Numbers are great, but our brains understand pictures better.

Box Plot (Box-and-Whisker)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With outlier
axes[0].boxplot(prices, vert=True)
axes[0].set_title('House Prices (with $1.25M mansion)')
axes[0].set_ylabel('Price ($K)')

# Without outlier
axes[1].boxplot(prices_clean, vert=True)
axes[1].set_title('House Prices (mansion removed)')
axes[1].set_ylabel('Price ($K)')

plt.tight_layout()
plt.show()

Histogram

plt.figure(figsize=(10, 5))
plt.hist(prices_clean, bins=15, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(prices_clean), color='red', linestyle='--', label=f'Mean: ${np.mean(prices_clean):.0f}K')
plt.axvline(np.median(prices_clean), color='green', linestyle='--', label=f'Median: ${np.median(prices_clean):.0f}K')
plt.xlabel('Price ($K)')
plt.ylabel('Number of Houses')
plt.title('Distribution of House Prices in Austin')
plt.legend()
plt.show()

Complete Summary Statistics

Here’s a function that gives you the full picture:
def describe_data(data, name="Data"):
    """Generate comprehensive summary statistics."""
    stats = {
        'Count': len(data),
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Std Dev': np.std(data),
        'Variance': np.var(data),
        'Min': np.min(data),
        '25%': np.percentile(data, 25),
        '50%': np.percentile(data, 50),
        '75%': np.percentile(data, 75),
        'Max': np.max(data),
        'Range': np.max(data) - np.min(data),
        'IQR': np.percentile(data, 75) - np.percentile(data, 25)
    }
    
    print(f"\n{'='*40}")
    print(f"Summary Statistics: {name}")
    print(f"{'='*40}")
    for key, value in stats.items():
        print(f"{key:12}: {value:>12.2f}")
    
    return stats

# Use it!
describe_data(prices_clean, "Austin House Prices ($K)")
Output:
========================================
Summary Statistics: Austin House Prices ($K)
========================================
Count       :        40.00
Mean        :       464.60
Median      :       468.00
Std Dev     :        39.28
Variance    :      1542.84
Min         :       389.00
25%         :       445.00
50%         :       468.00
75%         :       495.00
Max         :       528.00
Range       :       139.00
IQR         :        50.00

🎯 Practice Exercises

Exercise 1: Salary Analysis

# Tech company salaries (in thousands)
salaries = np.array([
    75, 82, 78, 95, 88, 72, 105, 92, 85, 79,
    110, 125, 88, 95, 82, 450,  # CEO salary!
    78, 92, 85, 102, 88, 95, 82, 79, 105
])

# TODO: Calculate mean and median
# TODO: Which one better represents "typical" salary?
# TODO: Calculate standard deviation
# TODO: Identify outliers using IQR method
mean_salary = np.mean(salaries)
median_salary = np.median(salaries)

print(f"Mean: ${mean_salary:.1f}K")
print(f"Median: ${median_salary:.1f}K")

# Median is better - CEO salary inflates mean
# Typical employee makes ~$88K, not $105K

std_salary = np.std(salaries)
print(f"Std Dev: ${std_salary:.1f}K")

# IQR outlier detection
q1 = np.percentile(salaries, 25)
q3 = np.percentile(salaries, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers = salaries[(salaries < lower) | (salaries > upper)]
print(f"Outliers: {outliers}")  # [450] - the CEO

Exercise 2: Test Score Comparison

# Two classes took the same test
class_A = [72, 75, 78, 80, 82, 85, 88, 90, 92, 95]
class_B = [65, 70, 78, 82, 83, 84, 85, 88, 95, 100]

# TODO: Which class performed better on average?
# TODO: Which class was more consistent?
# TODO: If you had to bet on a random student getting 80+, which class?
print(f"Class A - Mean: {np.mean(class_A):.1f}, Std: {np.std(class_A):.1f}")
print(f"Class B - Mean: {np.mean(class_B):.1f}, Std: {np.std(class_B):.1f}")

# Class A: Mean 83.7, Std 7.5
# Class B: Mean 83.0, Std 9.9

# Class A performed slightly better on average
# Class A was more consistent (lower std dev)

# For 80+ bet:
a_above_80 = sum(1 for x in class_A if x >= 80) / len(class_A)
b_above_80 = sum(1 for x in class_B if x >= 80) / len(class_B)
print(f"Class A: {a_above_80:.0%} got 80+")
print(f"Class B: {b_above_80:.0%} got 80+")

# Both 70%, but Class A is safer bet due to lower variance

🏠 Mini-Project: House Price Analyzer

Build a complete house price analysis tool!
import numpy as np
import pandas as pd

# Extended Austin dataset
data = {
    'price': [425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
              512, 445, 468, 502, 389, 475, 498, 415, 528, 459],
    'sqft': [1800, 1600, 1950, 2400, 2100, 1700, 2300, 2000, 1750, 2150,
             2350, 1900, 2050, 2250, 1650, 2100, 2200, 1850, 2450, 2000],
    'bedrooms': [3, 3, 4, 4, 3, 3, 4, 3, 3, 4,
                 4, 3, 3, 4, 3, 3, 4, 3, 5, 3],
    'neighborhood': ['North', 'South', 'North', 'West', 'North', 'South', 
                     'West', 'North', 'South', 'West', 'West', 'North',
                     'South', 'West', 'South', 'North', 'West', 'South',
                     'West', 'North']
}

houses = pd.DataFrame(data)

# YOUR TASKS:
# 1. Calculate summary statistics for price by neighborhood
# 2. Find price per square foot for each house
# 3. Which neighborhood has the most consistent prices?
# 4. Is there a relationship between bedrooms and price?
# 5. Your budget is $475K. What percentage of houses can you afford in each neighborhood?
import numpy as np
import pandas as pd

# ... (data from above)

# 1. Summary statistics by neighborhood
print("="*50)
print("PRICE STATISTICS BY NEIGHBORHOOD")
print("="*50)
for hood in houses['neighborhood'].unique():
    subset = houses[houses['neighborhood'] == hood]['price']
    print(f"\n{hood}:")
    print(f"  Mean:   ${subset.mean():.0f}K")
    print(f"  Median: ${subset.median():.0f}K")
    print(f"  Std:    ${subset.std():.1f}K")
    print(f"  Count:  {len(subset)}")

# 2. Price per square foot
houses['price_per_sqft'] = houses['price'] * 1000 / houses['sqft']
print("\n" + "="*50)
print("PRICE PER SQUARE FOOT")
print("="*50)
print(f"Mean: ${houses['price_per_sqft'].mean():.0f}/sqft")
print(f"Range: ${houses['price_per_sqft'].min():.0f} - ${houses['price_per_sqft'].max():.0f}/sqft")

# 3. Most consistent neighborhood (lowest std dev)
consistency = houses.groupby('neighborhood')['price'].std()
print("\n" + "="*50)
print("PRICE CONSISTENCY (Std Dev)")
print("="*50)
print(consistency.sort_values())
print(f"\nMost consistent: {consistency.idxmin()} (${consistency.min():.1f}K std)")

# 4. Bedrooms vs Price
bedroom_analysis = houses.groupby('bedrooms')['price'].agg(['mean', 'count'])
print("\n" + "="*50)
print("BEDROOMS VS PRICE")
print("="*50)
print(bedroom_analysis)

# 5. Affordability with $475K budget
budget = 475
print("\n" + "="*50)
print(f"AFFORDABILITY WITH ${budget}K BUDGET")
print("="*50)
for hood in houses['neighborhood'].unique():
    subset = houses[houses['neighborhood'] == hood]['price']
    affordable = (subset <= budget).sum() / len(subset) * 100
    print(f"{hood}: {affordable:.0f}% of houses affordable")
Output:
PRICE STATISTICS BY NEIGHBORHOOD
==================================================

North:
  Mean:   $455K
  Median: $455K
  Std:    $30.2K
  Count:  6

South:
  Mean:   $412K
  Median: $397K
  Std:    $28.4K
  Count:  6

West:
  Mean:   $508K
  Median: $506K
  Std:    $18.3K
  Count:  8

PRICE CONSISTENCY (Std Dev)
==================================================
neighborhood
West     18.32
South    28.44
North    30.21
Name: price, dtype: float64

Most consistent: West ($18.3K std)

AFFORDABILITY WITH $475K BUDGET
==================================================
North: 67% of houses affordable
South: 100% of houses affordable
West: 25% of houses affordable

Key Takeaways

Central Tendency

  • Mean: Add and divide. Sensitive to outliers.
  • Median: Middle value. Robust to outliers.
  • Mode: Most common. Great for categories.

Spread

  • Range: Max - Min. Simple but limited.
  • Variance: Average squared distance from mean.
  • Std Dev: Square root of variance. Same units as data.

Position

  • Percentiles: What % of values fall below this?
  • Quartiles: 25th, 50th, 75th percentiles.
  • IQR: Range of middle 50%. Good for outlier detection.

When to Use What

  • Symmetric data: Mean + Std Dev
  • Skewed data: Median + IQR
  • Outliers present: Always check both!

Common Mistakes to Avoid

Mistake 1: Always Using the MeanThe mean can be heavily influenced by outliers. For salary data, housing prices, or any skewed distribution, the median is often more representative.Example: In a company where 9 employees earn 50KandtheCEOearns50K and the CEO earns 5M, the mean salary is $545K - wildly misleading!
Mistake 2: Ignoring UnitsVariance is in squared units, which can be hard to interpret. Standard deviation is in the original units, making it much more practical.Example: A variance of 10,000 dollars² is hard to understand. A std dev of $100 is clear.
Mistake 3: Comparing Std Devs Across Different ScalesA std dev of 10Kforhousepricesvs10K for house prices vs 10 for groceries aren’t comparable. Use the coefficient of variation (CV = std/mean) to compare relative variability.

Interview Questions

Question: You’re analyzing user session lengths. The mean is 45 minutes, but the median is only 8 minutes. What does this tell you about the distribution?
Answer: This indicates a heavily right-skewed distribution with outliers. Most users have short sessions (around 8 minutes), but some power users have very long sessions that pull the mean way up. The median is more representative of the “typical” user experience.
Question: You have a dataset of daily ad revenue. How would you identify outliers?
Answer: Use the IQR method:
  1. Calculate Q1 (25th percentile) and Q3 (75th percentile)
  2. Calculate IQR = Q3 - Q1
  3. Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR
Alternatively, use z-scores: values more than 3 standard deviations from the mean are typically considered outliers.
Question: You’re comparing two delivery drivers. Driver A has mean delivery time of 30 min (std dev 2 min). Driver B has mean 32 min (std dev 8 min). Which driver would you prefer?
Answer: Despite Driver A being slightly faster on average, the low variance is the key differentiator. Driver A is consistently fast (28-32 min range), while Driver B is unpredictable (could be 24-40 min). For customer satisfaction and logistics planning, consistency often matters more than a slightly faster mean.
Question: We measure page load times. The mean is 2.5 seconds, but the 99th percentile is 15 seconds. What action might you take?
Answer: The 99th percentile (P99) being 6x the mean suggests there’s a “long tail” of slow experiences. Even though 99% of users have decent load times, 1% are having a terrible experience. For a company with millions of users, that’s a lot of frustrated customers. Focus on identifying what causes these edge cases - geographic regions, specific devices, or server issues.

Practice Challenge

You’re given website session data. Analyze it completely:
import numpy as np
np.random.seed(42)

# Simulate session durations (in seconds)
# Mix of quick bouncers and engaged users
short_sessions = np.random.exponential(30, size=800)  # Most users leave quickly
long_sessions = np.random.normal(600, 120, size=200)  # Engaged users stay ~10 min
sessions = np.concatenate([short_sessions, long_sessions])

# Your tasks:
# 1. Calculate mean, median, std dev
# 2. Identify which measure best represents "typical" session
# 3. Find the 10th, 50th, and 90th percentiles
# 4. Identify outliers using the IQR method
# 5. What story does this data tell about user behavior?

# Write your analysis here:
Solution:
# 1. Basic statistics
print(f"Mean: {np.mean(sessions):.1f} seconds")
print(f"Median: {np.median(sessions):.1f} seconds")
print(f"Std Dev: {np.std(sessions):.1f} seconds")

# 2. The median (~47 sec) is more representative because the 
#    distribution is heavily skewed by the engaged user segment

# 3. Percentiles
print(f"P10: {np.percentile(sessions, 10):.1f} seconds")
print(f"P50: {np.percentile(sessions, 50):.1f} seconds")
print(f"P90: {np.percentile(sessions, 90):.1f} seconds")

# 4. Outliers using IQR
Q1 = np.percentile(sessions, 25)
Q3 = np.percentile(sessions, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = sessions[(sessions < lower_bound) | (sessions > upper_bound)]
print(f"Outliers: {len(outliers)} values")

# 5. Story: Two distinct user groups!
#    - 80% are "bouncers" with very short sessions
#    - 20% are "engaged users" with ~10 minute sessions
#    - This bimodal distribution suggests we should analyze 
#      these groups separately

📝 Practice Exercises

Exercise 1

Calculate descriptive statistics for employee salaries

Exercise 2

Analyze website load times for performance optimization

Exercise 3

Detect outliers in e-commerce transaction data

Exercise 4

Real-world: Analyze housing market price distributions

How This Connects to Machine Learning

Everything you just learned is foundational to ML:
Descriptive StatML Application
MeanUsed in normalization, calculating errors
VarianceFeature scaling, understanding data spread
Standard deviationStandardization (z-scores), batch normalization
PercentilesHandling outliers, creating features
Distribution shapeChoosing the right model and loss function
Statistical Mistake in ML — Using Mean on Skewed Targets: If your target variable (the thing you are predicting) is right-skewed — like house prices, income, or time-to-event data — training a regression model on the raw values causes the model to overweight expensive outliers. The fix: check skewness first. If mean and median diverge significantly, apply a log transform to the target before training. This single step routinely improves RMSE by 10-30% on real datasets.

Interview Prep: Common Questions

Q: When would you use median instead of mean?
Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts).
Q: How do you detect outliers?
Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots.
Q: What’s the difference between population and sample variance?
Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel’s correction).
Q: A dataset has mean = median. What does this tell you?
The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode.

Common Pitfalls

Mistakes to Avoid:
  1. Using mean for skewed data - Always check for outliers first; median is often more representative
  2. Ignoring the spread - Two datasets can have identical means but completely different distributions
  3. Confusing variance units - Variance is in squared units; use standard deviation for interpretable scale
  4. Forgetting to visualize - Statistics alone can be misleading (Anscombe’s quartet is the classic example)

Key Takeaways

What You Learned:
  • Mean - Sum divided by count; sensitive to outliers
  • Median - Middle value; robust to outliers; use for skewed data
  • Mode - Most frequent value; useful for categorical data
  • Variance & Std Dev - Measure spread around the mean
  • Percentiles & IQR - Divide data into portions; detect outliers
  • Z-scores - Standardize values across different scales
Coming up next: We’ll learn about probability - how to quantify uncertainty and make predictions. This is where statistics becomes truly powerful!

Next: Probability Foundations

Learn to quantify uncertainty and make predictions

Interview Deep-Dive

Strong Answer:
  • The large gap between mean (85K)andmedian(85K) and median (42K) tells me the distribution is heavily right-skewed. A small number of very high-revenue days — likely driven by flash sales, holiday events, or a few massive B2B orders — are pulling the mean upward. On a “typical” day, the company makes closer to $42K.
  • I would report both numbers to the VP, but frame it carefully: “On a normal day, we generate about 42Kinrevenue.However,ouraverageishigherat42K in revenue. However, our average is higher at 85K because we have occasional spike days that significantly boost the total. If you are planning staffing and operations around daily expectations, use the median. If you are forecasting monthly totals, the mean times 30 gives a better estimate.”
  • I would also present the P90 and P99 to show how extreme the spike days are, and potentially a histogram showing the bimodal or long-tail shape. Stakeholders make better decisions when they understand the shape, not just a single number.
  • The key risk: if someone uses the $85K mean for daily budgeting, they will overspend on most days and then scramble during the rare high days. Conversely, if they use only the median, they will underestimate total monthly revenue.
Follow-up: How would you detect whether the spike days are periodic (like weekends or holidays) versus random?I would decompose the time series by day of week and month to check for seasonal patterns. A simple groupby on day-of-week showing that Saturday revenue is 3x the weekday median would confirm a weekly cycle. For holiday effects, I would flag known retail events (Black Friday, Prime Day) and compare flagged versus unflagged days. If the spikes are periodic, you can model them with seasonal adjustments. If they are random (driven by viral social media posts or unpredictable B2B orders), then you need a different forecasting approach that accounts for heavy-tailed distributions — perhaps a log-normal model rather than a normal one.
Strong Answer:
  • Standard deviation assumes your data is roughly symmetric and does not have extreme outliers. It uses every data point including the tails, so a single extreme value can inflate it dramatically. It is the right choice when your data is approximately normal — test scores, manufacturing measurements, or sensor readings from a calibrated instrument.
  • IQR (the range between the 25th and 75th percentiles) is robust to outliers because it only looks at the middle 50% of the data. It is the right choice for skewed or contaminated data — income distributions, transaction amounts, page load times, or any dataset where you suspect data quality issues at the extremes.
  • In practice, the choice has real consequences. If you use standard deviation for fraud detection thresholds on transaction amounts (which are heavily right-skewed), the outliers inflate the std dev so much that your “anomaly threshold” becomes absurdly high and you miss actual fraud. Using IQR-based thresholds (like the 1.5 x IQR rule) gives much more practical detection boundaries.
  • A concrete example: at a payments company, transaction amounts might have mean 50,stddev50, std dev 500 (because of a few 10Kwiretransfers).A"meanplus3sigma"thresholdwouldbe10K wire transfers). A "mean plus 3 sigma" threshold would be 1,550, which misses all the moderately fraudulent 200200-300 transactions. An IQR-based approach with Q3 around 80wouldflaganythingaboveroughly80 would flag anything above roughly 125 as worth investigating.
Follow-up: You mentioned the 1.5 x IQR rule. Where does that 1.5 come from, and when would you adjust it?The 1.5 multiplier was introduced by John Tukey for box plots and corresponds roughly to the boundaries that would capture about 99.3% of a normal distribution. For normal data, Q1 minus 1.5 x IQR and Q3 plus 1.5 x IQR align approximately with mean plus or minus 2.7 standard deviations. You would adjust the multiplier based on your tolerance for false positives: use 3.0 x IQR for “extreme outliers” when you want very high confidence, or drop to 1.0 if you want a more aggressive filter. In fraud detection, you often tune this multiplier empirically against labeled fraud data to optimize the precision-recall tradeoff for your specific domain.
Strong Answer:
  • Anscombe’s quartet is a set of four datasets that have nearly identical summary statistics — same mean of x, same mean of y, same variance, same correlation coefficient (r approximately 0.816), and the same regression line — yet look completely different when plotted. One is a normal linear relationship, one is a perfect curve, one is a perfect line with one outlier, and one has all points at one x-value except for a single extreme point.
  • The lesson is devastating for anyone who relies on summary statistics alone: the numbers can lie. Two datasets with identical means, variances, and correlations can have fundamentally different structures, and any model or decision built on those numbers without visual inspection could be wildly wrong.
  • In practice, this means every analysis should start with visualization. Before computing a single correlation or fitting a regression, plot a scatterplot, a histogram, or a residual plot. At companies with large data pipelines, I have seen teams ship models based on correlation matrices without ever looking at the actual data shape — and later discover non-linear relationships, clustering, or data artifacts that invalidated their approach.
  • The modern extension is the Datasaurus Dozen, which shows the same idea with 13 datasets (including one shaped like a dinosaur) that all share the same summary statistics. It reinforces that summary stats are a lossy compression of your data — useful for communication, but dangerous as the sole basis for decisions.
Follow-up: In a large-scale ML pipeline where you cannot manually visualize every feature pair, how do you catch these kinds of issues?You build automated data profiling into the pipeline. Tools like pandas-profiling, Great Expectations, or custom checks can flag non-linearity (by comparing Pearson versus Spearman correlations — if they diverge, the relationship is non-linear), detect bimodality (using the dip test or kernel density estimation), and identify influential outliers (using Cook’s distance). You also set up distribution dashboards that sample and plot key feature pairs on a rotating basis. The goal is not to inspect every combination manually but to have automated red flags that trigger human review when something looks off.