Skip to main content
Describing Data - Mean and Central Tendency

Describing Data: What’s Normal?

The House Hunting Problem

You’re moving to Austin, Texas. You have a budget of $500,000 and want to know: Is that enough for a decent 3-bedroom house? You could look at one listing, but that’s just one data point. You need to understand the whole picture. Let’s load some real data:
import numpy as np
import pandas as pd

# House prices in Austin (3-bedroom homes, in thousands)
prices = np.array([
    425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
    512, 445, 468, 502, 389, 475, 498, 415, 528, 459,
    442, 495, 478, 410, 525, 465, 488, 435, 505, 472,
    1250,  # A mansion somehow in the dataset
    448, 492, 418, 485, 455, 508, 428, 475, 495, 462
])

print(f"Number of houses: {len(prices)}")
print(f"Cheapest: ${min(prices)}K")
print(f"Most expensive: ${max(prices)}K")
Output:
Number of houses: 41
Cheapest: $389K
Most expensive: $1250K
The range is 389Kto389K to 1250K. But that doesn’t tell us what’s “typical”. We need better tools.

Measures of Central Tendency: “What’s Typical?”

The Mean (Average)

The mean is what most people think of as “average” - add everything up and divide by the count.
Mean Formula Visualization
The mathematical formula: xˉ=1ni=1nxi=x1+x2+...+xnn\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n}
def calculate_mean(data):
    """Calculate the arithmetic mean."""
    return sum(data) / len(data)

mean_price = calculate_mean(prices)
print(f"Mean price: ${mean_price:.1f}K")

# Or use NumPy
print(f"Mean (NumPy): ${np.mean(prices):.1f}K")
Output:
Mean price: $492.6K
Wait… 492.6K?Mosthousesarearound492.6K? Most houses are around 450K, but the mean is higher. What’s happening?
Mean Real World - Mansion Pulling Average
The Problem with the Mean: That $1.25M mansion is pulling the average up! The mean is sensitive to outliers.

The Median (Middle Value)

The median is the middle value when you sort the data. Half the values are above, half below.
def calculate_median(data):
    """Calculate the median."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    
    if n % 2 == 0:  # Even number of elements
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:  # Odd number of elements
        return sorted_data[mid]

median_price = calculate_median(prices)
print(f"Median price: ${median_price:.1f}K")

# Or use NumPy
print(f"Median (NumPy): ${np.median(prices):.1f}K")
Output:
Median price: $472.0K
The median is $472K - much more representative of a “typical” house! The mansion doesn’t affect it because it’s just one value above the middle.
When to use Mean vs Median?
Use Mean WhenUse Median When
Data is symmetricData has outliers
No extreme valuesIncome/wealth data
You want total divided by countYou want “typical” value
Example: Test scoresExample: House prices

The Mode (Most Common Value)

The mode is the value that appears most frequently. Less useful for continuous data, but great for categories.
from collections import Counter

def calculate_mode(data):
    """Find the most common value(s)."""
    counts = Counter(data)
    max_count = max(counts.values())
    modes = [val for val, count in counts.items() if count == max_count]
    return modes

# For house prices, mode isn't very useful (all unique)
# But for bedrooms:
bedrooms = [3, 3, 4, 3, 2, 3, 4, 3, 3, 2, 3, 4, 3, 5, 3, 3, 2, 4, 3, 3]
mode_bedrooms = calculate_mode(bedrooms)
print(f"Most common bedroom count: {mode_bedrooms}")  # [3]
Real-World Usage:
  • Most popular shirt size at a store
  • Most common customer complaint
  • Peak traffic hour

Measures of Spread: “How Different Are Things?”

Knowing the center isn’t enough. Consider these two neighborhoods:
neighborhood_A = [450, 455, 448, 460, 452, 445, 458, 447, 453, 462]
neighborhood_B = [350, 550, 400, 500, 380, 520, 410, 490, 360, 540]

print(f"Neighborhood A - Mean: ${np.mean(neighborhood_A):.1f}K")
print(f"Neighborhood B - Mean: ${np.mean(neighborhood_B):.1f}K")
Output:
Neighborhood A - Mean: $453.0K
Neighborhood B - Mean: $450.0K
Almost the same mean! But look at the actual houses:
  • Neighborhood A: All houses are between 445K445K-462K (consistent)
  • Neighborhood B: Houses range from 350Kto350K to 550K (huge variation)
We need to measure spread.

Range (Simplest Measure)

def calculate_range(data):
    return max(data) - min(data)

print(f"Range A: ${calculate_range(neighborhood_A)}K")
print(f"Range B: ${calculate_range(neighborhood_B)}K")
Output:
Range A: $17K
Range B: $200K
The range shows the difference, but it only uses two values and is sensitive to outliers.

Variance: Average Squared Distance from Mean

Variance measures how far values typically are from the mean.
Variance Formula Visualization
The Formula: σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 Step by step:
  1. Find the mean
  2. For each value, calculate distance from mean
  3. Square each distance (makes all positive, penalizes big deviations)
  4. Average the squared distances
def calculate_variance(data):
    """Calculate population variance."""
    mean = sum(data) / len(data)
    squared_diffs = [(x - mean) ** 2 for x in data]
    return sum(squared_diffs) / len(data)

var_A = calculate_variance(neighborhood_A)
var_B = calculate_variance(neighborhood_B)

print(f"Variance A: {var_A:.1f}")
print(f"Variance B: {var_B:.1f}")

# Using NumPy
print(f"Variance A (NumPy): {np.var(neighborhood_A):.1f}")
print(f"Variance B (NumPy): {np.var(neighborhood_B):.1f}")
Output:
Variance A: 29.4
Variance B: 5040.0
Neighborhood B has 171x more variance than A!
Variance Real World - Neighborhood Comparison

Standard Deviation: Variance in Original Units

Variance is in “squared dollars” which is hard to interpret. Standard deviation brings us back to dollars. σ=σ2=1ni=1n(xixˉ)2\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}
def calculate_std(data):
    """Calculate population standard deviation."""
    return np.sqrt(calculate_variance(data))

std_A = calculate_std(neighborhood_A)
std_B = calculate_std(neighborhood_B)

print(f"Std Dev A: ${std_A:.1f}K")
print(f"Std Dev B: ${std_B:.1f}K")
Output:
Std Dev A: $5.4K
Std Dev B: $71.0K
Interpretation:
  • Neighborhood A: Houses are typically within ±$5.4K of the mean
  • Neighborhood B: Houses are typically within ±$71K of the mean
Sample vs Population: When working with a sample (not the entire population), we divide by (n-1) instead of n for variance. This is called Bessel’s correction.
# Population variance (you have ALL data)
np.var(data, ddof=0)

# Sample variance (you have a sample from larger population)
np.var(data, ddof=1)  # Default in pandas

Percentiles and Quartiles: “Where Does This Value Rank?”

Going back to our Austin house prices - is $500K expensive or affordable? Percentiles tell you what percentage of values fall below a given number.
# Remove the mansion outlier for cleaner analysis
prices_clean = prices[prices < 1000]

# Calculate percentiles
p25 = np.percentile(prices_clean, 25)
p50 = np.percentile(prices_clean, 50)  # Same as median!
p75 = np.percentile(prices_clean, 75)
p90 = np.percentile(prices_clean, 90)

print(f"25th percentile: ${p25:.1f}K")
print(f"50th percentile (median): ${p50:.1f}K")
print(f"75th percentile: ${p75:.1f}K")
print(f"90th percentile: ${p90:.1f}K")
Output:
25th percentile: $445.0K
50th percentile (median): $468.0K
75th percentile: $495.0K
90th percentile: $512.0K
Your $500K budget puts you at the 78th percentile - you can afford 78% of houses in this area!

The Interquartile Range (IQR)

The IQR is the range of the middle 50% of data: IQR=Q3Q1=P75P25IQR = Q3 - Q1 = P_{75} - P_{25}
iqr = p75 - p25
print(f"IQR: ${iqr:.1f}K")

# Houses outside 1.5*IQR from quartiles are often considered outliers
lower_fence = p25 - 1.5 * iqr
upper_fence = p75 + 1.5 * iqr
print(f"Outlier thresholds: ${lower_fence:.1f}K - ${upper_fence:.1f}K")
Output:
IQR: $50.0K
Outlier thresholds: $370.0K - $570.0K
That $1.25M mansion is definitely an outlier!

Visualizing Data: See the Distribution

Numbers are great, but our brains understand pictures better.

Box Plot (Box-and-Whisker)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With outlier
axes[0].boxplot(prices, vert=True)
axes[0].set_title('House Prices (with $1.25M mansion)')
axes[0].set_ylabel('Price ($K)')

# Without outlier
axes[1].boxplot(prices_clean, vert=True)
axes[1].set_title('House Prices (mansion removed)')
axes[1].set_ylabel('Price ($K)')

plt.tight_layout()
plt.show()

Histogram

plt.figure(figsize=(10, 5))
plt.hist(prices_clean, bins=15, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(prices_clean), color='red', linestyle='--', label=f'Mean: ${np.mean(prices_clean):.0f}K')
plt.axvline(np.median(prices_clean), color='green', linestyle='--', label=f'Median: ${np.median(prices_clean):.0f}K')
plt.xlabel('Price ($K)')
plt.ylabel('Number of Houses')
plt.title('Distribution of House Prices in Austin')
plt.legend()
plt.show()

Complete Summary Statistics

Here’s a function that gives you the full picture:
def describe_data(data, name="Data"):
    """Generate comprehensive summary statistics."""
    stats = {
        'Count': len(data),
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Std Dev': np.std(data),
        'Variance': np.var(data),
        'Min': np.min(data),
        '25%': np.percentile(data, 25),
        '50%': np.percentile(data, 50),
        '75%': np.percentile(data, 75),
        'Max': np.max(data),
        'Range': np.max(data) - np.min(data),
        'IQR': np.percentile(data, 75) - np.percentile(data, 25)
    }
    
    print(f"\n{'='*40}")
    print(f"Summary Statistics: {name}")
    print(f"{'='*40}")
    for key, value in stats.items():
        print(f"{key:12}: {value:>12.2f}")
    
    return stats

# Use it!
describe_data(prices_clean, "Austin House Prices ($K)")
Output:
========================================
Summary Statistics: Austin House Prices ($K)
========================================
Count       :        40.00
Mean        :       464.60
Median      :       468.00
Std Dev     :        39.28
Variance    :      1542.84
Min         :       389.00
25%         :       445.00
50%         :       468.00
75%         :       495.00
Max         :       528.00
Range       :       139.00
IQR         :        50.00

🎯 Practice Exercises

Exercise 1: Salary Analysis

# Tech company salaries (in thousands)
salaries = np.array([
    75, 82, 78, 95, 88, 72, 105, 92, 85, 79,
    110, 125, 88, 95, 82, 450,  # CEO salary!
    78, 92, 85, 102, 88, 95, 82, 79, 105
])

# TODO: Calculate mean and median
# TODO: Which one better represents "typical" salary?
# TODO: Calculate standard deviation
# TODO: Identify outliers using IQR method
mean_salary = np.mean(salaries)
median_salary = np.median(salaries)

print(f"Mean: ${mean_salary:.1f}K")
print(f"Median: ${median_salary:.1f}K")

# Median is better - CEO salary inflates mean
# Typical employee makes ~$88K, not $105K

std_salary = np.std(salaries)
print(f"Std Dev: ${std_salary:.1f}K")

# IQR outlier detection
q1 = np.percentile(salaries, 25)
q3 = np.percentile(salaries, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers = salaries[(salaries < lower) | (salaries > upper)]
print(f"Outliers: {outliers}")  # [450] - the CEO

Exercise 2: Test Score Comparison

# Two classes took the same test
class_A = [72, 75, 78, 80, 82, 85, 88, 90, 92, 95]
class_B = [65, 70, 78, 82, 83, 84, 85, 88, 95, 100]

# TODO: Which class performed better on average?
# TODO: Which class was more consistent?
# TODO: If you had to bet on a random student getting 80+, which class?
print(f"Class A - Mean: {np.mean(class_A):.1f}, Std: {np.std(class_A):.1f}")
print(f"Class B - Mean: {np.mean(class_B):.1f}, Std: {np.std(class_B):.1f}")

# Class A: Mean 83.7, Std 7.5
# Class B: Mean 83.0, Std 9.9

# Class A performed slightly better on average
# Class A was more consistent (lower std dev)

# For 80+ bet:
a_above_80 = sum(1 for x in class_A if x >= 80) / len(class_A)
b_above_80 = sum(1 for x in class_B if x >= 80) / len(class_B)
print(f"Class A: {a_above_80:.0%} got 80+")
print(f"Class B: {b_above_80:.0%} got 80+")

# Both 70%, but Class A is safer bet due to lower variance

🏠 Mini-Project: House Price Analyzer

Build a complete house price analysis tool!
import numpy as np
import pandas as pd

# Extended Austin dataset
data = {
    'price': [425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
              512, 445, 468, 502, 389, 475, 498, 415, 528, 459],
    'sqft': [1800, 1600, 1950, 2400, 2100, 1700, 2300, 2000, 1750, 2150,
             2350, 1900, 2050, 2250, 1650, 2100, 2200, 1850, 2450, 2000],
    'bedrooms': [3, 3, 4, 4, 3, 3, 4, 3, 3, 4,
                 4, 3, 3, 4, 3, 3, 4, 3, 5, 3],
    'neighborhood': ['North', 'South', 'North', 'West', 'North', 'South', 
                     'West', 'North', 'South', 'West', 'West', 'North',
                     'South', 'West', 'South', 'North', 'West', 'South',
                     'West', 'North']
}

houses = pd.DataFrame(data)

# YOUR TASKS:
# 1. Calculate summary statistics for price by neighborhood
# 2. Find price per square foot for each house
# 3. Which neighborhood has the most consistent prices?
# 4. Is there a relationship between bedrooms and price?
# 5. Your budget is $475K. What percentage of houses can you afford in each neighborhood?
import numpy as np
import pandas as pd

# ... (data from above)

# 1. Summary statistics by neighborhood
print("="*50)
print("PRICE STATISTICS BY NEIGHBORHOOD")
print("="*50)
for hood in houses['neighborhood'].unique():
    subset = houses[houses['neighborhood'] == hood]['price']
    print(f"\n{hood}:")
    print(f"  Mean:   ${subset.mean():.0f}K")
    print(f"  Median: ${subset.median():.0f}K")
    print(f"  Std:    ${subset.std():.1f}K")
    print(f"  Count:  {len(subset)}")

# 2. Price per square foot
houses['price_per_sqft'] = houses['price'] * 1000 / houses['sqft']
print("\n" + "="*50)
print("PRICE PER SQUARE FOOT")
print("="*50)
print(f"Mean: ${houses['price_per_sqft'].mean():.0f}/sqft")
print(f"Range: ${houses['price_per_sqft'].min():.0f} - ${houses['price_per_sqft'].max():.0f}/sqft")

# 3. Most consistent neighborhood (lowest std dev)
consistency = houses.groupby('neighborhood')['price'].std()
print("\n" + "="*50)
print("PRICE CONSISTENCY (Std Dev)")
print("="*50)
print(consistency.sort_values())
print(f"\nMost consistent: {consistency.idxmin()} (${consistency.min():.1f}K std)")

# 4. Bedrooms vs Price
bedroom_analysis = houses.groupby('bedrooms')['price'].agg(['mean', 'count'])
print("\n" + "="*50)
print("BEDROOMS VS PRICE")
print("="*50)
print(bedroom_analysis)

# 5. Affordability with $475K budget
budget = 475
print("\n" + "="*50)
print(f"AFFORDABILITY WITH ${budget}K BUDGET")
print("="*50)
for hood in houses['neighborhood'].unique():
    subset = houses[houses['neighborhood'] == hood]['price']
    affordable = (subset <= budget).sum() / len(subset) * 100
    print(f"{hood}: {affordable:.0f}% of houses affordable")
Output:
PRICE STATISTICS BY NEIGHBORHOOD
==================================================

North:
  Mean:   $455K
  Median: $455K
  Std:    $30.2K
  Count:  6

South:
  Mean:   $412K
  Median: $397K
  Std:    $28.4K
  Count:  6

West:
  Mean:   $508K
  Median: $506K
  Std:    $18.3K
  Count:  8

PRICE CONSISTENCY (Std Dev)
==================================================
neighborhood
West     18.32
South    28.44
North    30.21
Name: price, dtype: float64

Most consistent: West ($18.3K std)

AFFORDABILITY WITH $475K BUDGET
==================================================
North: 67% of houses affordable
South: 100% of houses affordable
West: 25% of houses affordable

Key Takeaways

Central Tendency

  • Mean: Add and divide. Sensitive to outliers.
  • Median: Middle value. Robust to outliers.
  • Mode: Most common. Great for categories.

Spread

  • Range: Max - Min. Simple but limited.
  • Variance: Average squared distance from mean.
  • Std Dev: Square root of variance. Same units as data.

Position

  • Percentiles: What % of values fall below this?
  • Quartiles: 25th, 50th, 75th percentiles.
  • IQR: Range of middle 50%. Good for outlier detection.

When to Use What

  • Symmetric data: Mean + Std Dev
  • Skewed data: Median + IQR
  • Outliers present: Always check both!

Common Mistakes to Avoid

Mistake 1: Always Using the MeanThe mean can be heavily influenced by outliers. For salary data, housing prices, or any skewed distribution, the median is often more representative.Example: In a company where 9 employees earn 50KandtheCEOearns50K and the CEO earns 5M, the mean salary is $545K - wildly misleading!
Mistake 2: Ignoring UnitsVariance is in squared units, which can be hard to interpret. Standard deviation is in the original units, making it much more practical.Example: A variance of 10,000 dollars² is hard to understand. A std dev of $100 is clear.
Mistake 3: Comparing Std Devs Across Different ScalesA std dev of 10Kforhousepricesvs10K for house prices vs 10 for groceries aren’t comparable. Use the coefficient of variation (CV = std/mean) to compare relative variability.

Interview Questions

Question: You’re analyzing user session lengths. The mean is 45 minutes, but the median is only 8 minutes. What does this tell you about the distribution?
Answer: This indicates a heavily right-skewed distribution with outliers. Most users have short sessions (around 8 minutes), but some power users have very long sessions that pull the mean way up. The median is more representative of the “typical” user experience.
Question: You have a dataset of daily ad revenue. How would you identify outliers?
Answer: Use the IQR method:
  1. Calculate Q1 (25th percentile) and Q3 (75th percentile)
  2. Calculate IQR = Q3 - Q1
  3. Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR
Alternatively, use z-scores: values more than 3 standard deviations from the mean are typically considered outliers.
Question: You’re comparing two delivery drivers. Driver A has mean delivery time of 30 min (std dev 2 min). Driver B has mean 32 min (std dev 8 min). Which driver would you prefer?
Answer: Despite Driver A being slightly faster on average, the low variance is the key differentiator. Driver A is consistently fast (28-32 min range), while Driver B is unpredictable (could be 24-40 min). For customer satisfaction and logistics planning, consistency often matters more than a slightly faster mean.
Question: We measure page load times. The mean is 2.5 seconds, but the 99th percentile is 15 seconds. What action might you take?
Answer: The 99th percentile (P99) being 6x the mean suggests there’s a “long tail” of slow experiences. Even though 99% of users have decent load times, 1% are having a terrible experience. For a company with millions of users, that’s a lot of frustrated customers. Focus on identifying what causes these edge cases - geographic regions, specific devices, or server issues.

Practice Challenge

You’re given website session data. Analyze it completely:
import numpy as np
np.random.seed(42)

# Simulate session durations (in seconds)
# Mix of quick bouncers and engaged users
short_sessions = np.random.exponential(30, size=800)  # Most users leave quickly
long_sessions = np.random.normal(600, 120, size=200)  # Engaged users stay ~10 min
sessions = np.concatenate([short_sessions, long_sessions])

# Your tasks:
# 1. Calculate mean, median, std dev
# 2. Identify which measure best represents "typical" session
# 3. Find the 10th, 50th, and 90th percentiles
# 4. Identify outliers using the IQR method
# 5. What story does this data tell about user behavior?

# Write your analysis here:
Solution:
# 1. Basic statistics
print(f"Mean: {np.mean(sessions):.1f} seconds")
print(f"Median: {np.median(sessions):.1f} seconds")
print(f"Std Dev: {np.std(sessions):.1f} seconds")

# 2. The median (~47 sec) is more representative because the 
#    distribution is heavily skewed by the engaged user segment

# 3. Percentiles
print(f"P10: {np.percentile(sessions, 10):.1f} seconds")
print(f"P50: {np.percentile(sessions, 50):.1f} seconds")
print(f"P90: {np.percentile(sessions, 90):.1f} seconds")

# 4. Outliers using IQR
Q1 = np.percentile(sessions, 25)
Q3 = np.percentile(sessions, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = sessions[(sessions < lower_bound) | (sessions > upper_bound)]
print(f"Outliers: {len(outliers)} values")

# 5. Story: Two distinct user groups!
#    - 80% are "bouncers" with very short sessions
#    - 20% are "engaged users" with ~10 minute sessions
#    - This bimodal distribution suggests we should analyze 
#      these groups separately

📝 Practice Exercises

Exercise 1

Calculate descriptive statistics for employee salaries

Exercise 2

Analyze website load times for performance optimization

Exercise 3

Detect outliers in e-commerce transaction data

Exercise 4

Real-world: Analyze housing market price distributions

How This Connects to Machine Learning

Everything you just learned is foundational to ML:
Descriptive StatML Application
MeanUsed in normalization, calculating errors
VarianceFeature scaling, understanding data spread
Standard deviationStandardization (z-scores), batch normalization
PercentilesHandling outliers, creating features
Distribution shapeChoosing the right model and loss function

Interview Prep: Common Questions

Q: When would you use median instead of mean?
Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts).
Q: How do you detect outliers?
Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots.
Q: What’s the difference between population and sample variance?
Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel’s correction).
Q: A dataset has mean = median. What does this tell you?
The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode.

Common Pitfalls

Mistakes to Avoid:
  1. Using mean for skewed data - Always check for outliers first; median is often more representative
  2. Ignoring the spread - Two datasets can have identical means but completely different distributions
  3. Confusing variance units - Variance is in squared units; use standard deviation for interpretable scale
  4. Forgetting to visualize - Statistics alone can be misleading (Anscombe’s quartet is the classic example)

Key Takeaways

What You Learned:
  • Mean - Sum divided by count; sensitive to outliers
  • Median - Middle value; robust to outliers; use for skewed data
  • Mode - Most frequent value; useful for categorical data
  • Variance & Std Dev - Measure spread around the mean
  • Percentiles & IQR - Divide data into portions; detect outliers
  • Z-scores - Standardize values across different scales
Coming up next: We’ll learn about probability - how to quantify uncertainty and make predictions. This is where statistics becomes truly powerful!

Next: Probability Foundations

Learn to quantify uncertainty and make predictions