Describing Data - Mean and Central Tendency

Describing Data: What’s Normal?

The House Hunting Problem

You’re moving to Austin, Texas. You have a budget of $500,000 and want to know: Is that enough for a decent 3-bedroom house? You could look at one listing, but that’s just one data point. You need to understand the whole picture. Let’s load some real data:

import numpy as np
import pandas as pd

# House prices in Austin (3-bedroom homes, in thousands)
prices = np.array([
    425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
    512, 445, 468, 502, 389, 475, 498, 415, 528, 459,
    442, 495, 478, 410, 525, 465, 488, 435, 505, 472,
    1250,  # A mansion somehow in the dataset
    448, 492, 418, 485, 455, 508, 428, 475, 495, 462
])

print(f"Number of houses: {len(prices)}")
print(f"Cheapest: ${min(prices)}K")
print(f"Most expensive: ${max(prices)}K")

Output:

Number of houses: 41
Cheapest: $389K
Most expensive: $1250K

The range is

389K to

1250K. But that doesn’t tell us what’s “typical”. We need better tools.

Measures of Central Tendency: “What’s Typical?”

The Mean (Average)

The mean is what most people think of as “average” - add everything up and divide by the count.

The mathematical formula:

\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n}

def calculate_mean(data):
    """Calculate the arithmetic mean."""
    return sum(data) / len(data)

mean_price = calculate_mean(prices)
print(f"Mean price: ${mean_price:.1f}K")

# Or use NumPy
print(f"Mean (NumPy): ${np.mean(prices):.1f}K")

Output:

Mean price: $492.6K

Wait…

492.6K? Most houses are around

450K, but the mean is higher. What’s happening?

Mean Real World - Mansion Pulling Average

The Problem with the Mean: That $1.25M mansion is pulling the average up! The mean is sensitive to outliers.

The Median (Middle Value)

The median is the middle value when you sort the data. Half the values are above, half below.

def calculate_median(data):
    """Calculate the median."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    
    if n % 2 == 0:  # Even number of elements
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:  # Odd number of elements
        return sorted_data[mid]

median_price = calculate_median(prices)
print(f"Median price: ${median_price:.1f}K")

# Or use NumPy
print(f"Median (NumPy): ${np.median(prices):.1f}K")

Output:

Median price: $472.0K

The median is $472K - much more representative of a “typical” house! The mansion doesn’t affect it because it’s just one value above the middle.

When to use Mean vs Median?

Use Mean When	Use Median When
Data is symmetric	Data has outliers
No extreme values	Income/wealth data
You want total divided by count	You want “typical” value
Example: Test scores	Example: House prices

The Mode (Most Common Value)

The mode is the value that appears most frequently. Less useful for continuous data, but great for categories.

from collections import Counter

def calculate_mode(data):
    """Find the most common value(s)."""
    counts = Counter(data)
    max_count = max(counts.values())
    modes = [val for val, count in counts.items() if count == max_count]
    return modes

# For house prices, mode isn't very useful (all unique)
# But for bedrooms:
bedrooms = [3, 3, 4, 3, 2, 3, 4, 3, 3, 2, 3, 4, 3, 5, 3, 3, 2, 4, 3, 3]
mode_bedrooms = calculate_mode(bedrooms)
print(f"Most common bedroom count: {mode_bedrooms}")  # [3]

Real-World Usage:

Most popular shirt size at a store
Most common customer complaint
Peak traffic hour

Measures of Spread: “How Different Are Things?”

Knowing the center isn’t enough. Consider these two neighborhoods:

neighborhood_A = [450, 455, 448, 460, 452, 445, 458, 447, 453, 462]
neighborhood_B = [350, 550, 400, 500, 380, 520, 410, 490, 360, 540]

print(f"Neighborhood A - Mean: ${np.mean(neighborhood_A):.1f}K")
print(f"Neighborhood B - Mean: ${np.mean(neighborhood_B):.1f}K")

Output:

Neighborhood A - Mean: $453.0K
Neighborhood B - Mean: $450.0K

Almost the same mean! But look at the actual houses:

Neighborhood A: All houses are between $445K-$ 462K (consistent)
Neighborhood B: Houses range from $350K to$ 550K (huge variation)

We need to measure spread.

Range (Simplest Measure)

def calculate_range(data):
    return max(data) - min(data)

print(f"Range A: ${calculate_range(neighborhood_A)}K")
print(f"Range B: ${calculate_range(neighborhood_B)}K")

Output:

Range A: $17K
Range B: $200K

The range shows the difference, but it only uses two values and is sensitive to outliers.

Variance: Average Squared Distance from Mean

Variance measures how far values typically are from the mean.

The Formula:

\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2

Step by step:

Find the mean
For each value, calculate distance from mean
Square each distance (makes all positive, penalizes big deviations)
Average the squared distances

def calculate_variance(data):
    """Calculate population variance."""
    mean = sum(data) / len(data)
    squared_diffs = [(x - mean) ** 2 for x in data]
    return sum(squared_diffs) / len(data)

var_A = calculate_variance(neighborhood_A)
var_B = calculate_variance(neighborhood_B)

print(f"Variance A: {var_A:.1f}")
print(f"Variance B: {var_B:.1f}")

# Using NumPy
print(f"Variance A (NumPy): {np.var(neighborhood_A):.1f}")
print(f"Variance B (NumPy): {np.var(neighborhood_B):.1f}")

Output:

Variance A: 29.4
Variance B: 5040.0

Neighborhood B has 171x more variance than A!

Variance Real World - Neighborhood Comparison

Standard Deviation: Variance in Original Units

Variance is in “squared dollars” which is hard to interpret. Standard deviation brings us back to dollars.

\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}

def calculate_std(data):
    """Calculate population standard deviation."""
    return np.sqrt(calculate_variance(data))

std_A = calculate_std(neighborhood_A)
std_B = calculate_std(neighborhood_B)

print(f"Std Dev A: ${std_A:.1f}K")
print(f"Std Dev B: ${std_B:.1f}K")

Output:

Std Dev A: $5.4K
Std Dev B: $71.0K

Interpretation:

Neighborhood A: Houses are typically within ±$5.4K of the mean
Neighborhood B: Houses are typically within ±$71K of the mean

Sample vs Population: When working with a sample (not the entire population), we divide by (n-1) instead of n for variance. This is called Bessel’s correction.

# Population variance (you have ALL data)
np.var(data, ddof=0)

# Sample variance (you have a sample from larger population)
np.var(data, ddof=1)  # Default in pandas

Percentiles and Quartiles: “Where Does This Value Rank?”

Going back to our Austin house prices - is $500K expensive or affordable? Percentiles tell you what percentage of values fall below a given number.

# Remove the mansion outlier for cleaner analysis
prices_clean = prices[prices < 1000]

# Calculate percentiles
p25 = np.percentile(prices_clean, 25)
p50 = np.percentile(prices_clean, 50)  # Same as median!
p75 = np.percentile(prices_clean, 75)
p90 = np.percentile(prices_clean, 90)

print(f"25th percentile: ${p25:.1f}K")
print(f"50th percentile (median): ${p50:.1f}K")
print(f"75th percentile: ${p75:.1f}K")
print(f"90th percentile: ${p90:.1f}K")

Output:

25th percentile: $445.0K
50th percentile (median): $468.0K
75th percentile: $495.0K
90th percentile: $512.0K

Your $500K budget puts you at the 78th percentile - you can afford 78% of houses in this area!

The Interquartile Range (IQR)

The IQR is the range of the middle 50% of data:

IQR = Q3 - Q1 = P_{75} - P_{25}

iqr = p75 - p25
print(f"IQR: ${iqr:.1f}K")

# Houses outside 1.5*IQR from quartiles are often considered outliers
lower_fence = p25 - 1.5 * iqr
upper_fence = p75 + 1.5 * iqr
print(f"Outlier thresholds: ${lower_fence:.1f}K - ${upper_fence:.1f}K")

Output:

IQR: $50.0K
Outlier thresholds: $370.0K - $570.0K

That $1.25M mansion is definitely an outlier!

Visualizing Data: See the Distribution

Numbers are great, but our brains understand pictures better.

Box Plot (Box-and-Whisker)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With outlier
axes[0].boxplot(prices, vert=True)
axes[0].set_title('House Prices (with $1.25M mansion)')
axes[0].set_ylabel('Price ($K)')

# Without outlier
axes[1].boxplot(prices_clean, vert=True)
axes[1].set_title('House Prices (mansion removed)')
axes[1].set_ylabel('Price ($K)')

plt.tight_layout()
plt.show()

Histogram

plt.figure(figsize=(10, 5))
plt.hist(prices_clean, bins=15, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(prices_clean), color='red', linestyle='--', label=f'Mean: ${np.mean(prices_clean):.0f}K')
plt.axvline(np.median(prices_clean), color='green', linestyle='--', label=f'Median: ${np.median(prices_clean):.0f}K')
plt.xlabel('Price ($K)')
plt.ylabel('Number of Houses')
plt.title('Distribution of House Prices in Austin')
plt.legend()
plt.show()

Complete Summary Statistics

Here’s a function that gives you the full picture:

def describe_data(data, name="Data"):
    """Generate comprehensive summary statistics."""
    stats = {
        'Count': len(data),
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Std Dev': np.std(data),
        'Variance': np.var(data),
        'Min': np.min(data),
        '25%': np.percentile(data, 25),
        '50%': np.percentile(data, 50),
        '75%': np.percentile(data, 75),
        'Max': np.max(data),
        'Range': np.max(data) - np.min(data),
        'IQR': np.percentile(data, 75) - np.percentile(data, 25)
    }
    
    print(f"\n{'='*40}")
    print(f"Summary Statistics: {name}")
    print(f"{'='*40}")
    for key, value in stats.items():
        print(f"{key:12}: {value:>12.2f}")
    
    return stats

# Use it!
describe_data(prices_clean, "Austin House Prices ($K)")

Output:

========================================
Summary Statistics: Austin House Prices ($K)
========================================
Count       :        40.00
Mean        :       464.60
Median      :       468.00
Std Dev     :        39.28
Variance    :      1542.84
Min         :       389.00
25%         :       445.00
50%         :       468.00
75%         :       495.00
Max         :       528.00
Range       :       139.00
IQR         :        50.00

🎯 Practice Exercises

Exercise 1: Salary Analysis

# Tech company salaries (in thousands)
salaries = np.array([
    75, 82, 78, 95, 88, 72, 105, 92, 85, 79,
    110, 125, 88, 95, 82, 450,  # CEO salary!
    78, 92, 85, 102, 88, 95, 82, 79, 105
])

# TODO: Calculate mean and median
# TODO: Which one better represents "typical" salary?
# TODO: Calculate standard deviation
# TODO: Identify outliers using IQR method

Solution

mean_salary = np.mean(salaries)
median_salary = np.median(salaries)

print(f"Mean: ${mean_salary:.1f}K")
print(f"Median: ${median_salary:.1f}K")

# Median is better - CEO salary inflates mean
# Typical employee makes ~$88K, not $105K

std_salary = np.std(salaries)
print(f"Std Dev: ${std_salary:.1f}K")

# IQR outlier detection
q1 = np.percentile(salaries, 25)
q3 = np.percentile(salaries, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers = salaries[(salaries < lower) | (salaries > upper)]
print(f"Outliers: {outliers}")  # [450] - the CEO

Exercise 2: Test Score Comparison

# Two classes took the same test
class_A = [72, 75, 78, 80, 82, 85, 88, 90, 92, 95]
class_B = [65, 70, 78, 82, 83, 84, 85, 88, 95, 100]

# TODO: Which class performed better on average?
# TODO: Which class was more consistent?
# TODO: If you had to bet on a random student getting 80+, which class?

Solution

print(f"Class A - Mean: {np.mean(class_A):.1f}, Std: {np.std(class_A):.1f}")
print(f"Class B - Mean: {np.mean(class_B):.1f}, Std: {np.std(class_B):.1f}")

# Class A: Mean 83.7, Std 7.5
# Class B: Mean 83.0, Std 9.9

# Class A performed slightly better on average
# Class A was more consistent (lower std dev)

# For 80+ bet:
a_above_80 = sum(1 for x in class_A if x >= 80) / len(class_A)
b_above_80 = sum(1 for x in class_B if x >= 80) / len(class_B)
print(f"Class A: {a_above_80:.0%} got 80+")
print(f"Class B: {b_above_80:.0%} got 80+")

# Both 70%, but Class A is safer bet due to lower variance

🏠 Mini-Project: House Price Analyzer

Build a complete house price analysis tool!

import numpy as np
import pandas as pd

# Extended Austin dataset
data = {
    'price': [425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
              512, 445, 468, 502, 389, 475, 498, 415, 528, 459],
    'sqft': [1800, 1600, 1950, 2400, 2100, 1700, 2300, 2000, 1750, 2150,
             2350, 1900, 2050, 2250, 1650, 2100, 2200, 1850, 2450, 2000],
    'bedrooms': [3, 3, 4, 4, 3, 3, 4, 3, 3, 4,
                 4, 3, 3, 4, 3, 3, 4, 3, 5, 3],
    'neighborhood': ['North', 'South', 'North', 'West', 'North', 'South', 
                     'West', 'North', 'South', 'West', 'West', 'North',
                     'South', 'West', 'South', 'North', 'West', 'South',
                     'West', 'North']
}

houses = pd.DataFrame(data)

# YOUR TASKS:
# 1. Calculate summary statistics for price by neighborhood
# 2. Find price per square foot for each house
# 3. Which neighborhood has the most consistent prices?
# 4. Is there a relationship between bedrooms and price?
# 5. Your budget is $475K. What percentage of houses can you afford in each neighborhood?

Complete Solution

import numpy as np
import pandas as pd

# ... (data from above)

# 1. Summary statistics by neighborhood
print("="*50)
print("PRICE STATISTICS BY NEIGHBORHOOD")
print("="*50)
for hood in houses['neighborhood'].unique():
    subset = houses[houses['neighborhood'] == hood]['price']
    print(f"\n{hood}:")
    print(f"  Mean:   ${subset.mean():.0f}K")
    print(f"  Median: ${subset.median():.0f}K")
    print(f"  Std:    ${subset.std():.1f}K")
    print(f"  Count:  {len(subset)}")

# 2. Price per square foot
houses['price_per_sqft'] = houses['price'] * 1000 / houses['sqft']
print("\n" + "="*50)
print("PRICE PER SQUARE FOOT")
print("="*50)
print(f"Mean: ${houses['price_per_sqft'].mean():.0f}/sqft")
print(f"Range: ${houses['price_per_sqft'].min():.0f} - ${houses['price_per_sqft'].max():.0f}/sqft")

# 3. Most consistent neighborhood (lowest std dev)
consistency = houses.groupby('neighborhood')['price'].std()
print("\n" + "="*50)
print("PRICE CONSISTENCY (Std Dev)")
print("="*50)
print(consistency.sort_values())
print(f"\nMost consistent: {consistency.idxmin()} (${consistency.min():.1f}K std)")

# 4. Bedrooms vs Price
bedroom_analysis = houses.groupby('bedrooms')['price'].agg(['mean', 'count'])
print("\n" + "="*50)
print("BEDROOMS VS PRICE")
print("="*50)
print(bedroom_analysis)

# 5. Affordability with $475K budget
budget = 475
print("\n" + "="*50)
print(f"AFFORDABILITY WITH ${budget}K BUDGET")
print("="*50)
for hood in houses['neighborhood'].unique():
    subset = houses[houses['neighborhood'] == hood]['price']
    affordable = (subset <= budget).sum() / len(subset) * 100
    print(f"{hood}: {affordable:.0f}% of houses affordable")

Output:

PRICE STATISTICS BY NEIGHBORHOOD
==================================================

North:
  Mean:   $455K
  Median: $455K
  Std:    $30.2K
  Count:  6

South:
  Mean:   $412K
  Median: $397K
  Std:    $28.4K
  Count:  6

West:
  Mean:   $508K
  Median: $506K
  Std:    $18.3K
  Count:  8

PRICE CONSISTENCY (Std Dev)
==================================================
neighborhood
West     18.32
South    28.44
North    30.21
Name: price, dtype: float64

Most consistent: West ($18.3K std)

AFFORDABILITY WITH $475K BUDGET
==================================================
North: 67% of houses affordable
South: 100% of houses affordable
West: 25% of houses affordable

Key Takeaways

Central Tendency

Mean: Add and divide. Sensitive to outliers.
Median: Middle value. Robust to outliers.
Mode: Most common. Great for categories.

Spread

Range: Max - Min. Simple but limited.
Variance: Average squared distance from mean.
Std Dev: Square root of variance. Same units as data.

Position

Percentiles: What % of values fall below this?
Quartiles: 25th, 50th, 75th percentiles.
IQR: Range of middle 50%. Good for outlier detection.

When to Use What

Symmetric data: Mean + Std Dev
Skewed data: Median + IQR
Outliers present: Always check both!

Common Mistakes to Avoid

Mistake 1: Always Using the MeanThe mean can be heavily influenced by outliers. For salary data, housing prices, or any skewed distribution, the median is often more representative.Example: In a company where 9 employees earn

50K and the CEO earns

5M, the mean salary is $545K - wildly misleading!

Mistake 2: Ignoring UnitsVariance is in squared units, which can be hard to interpret. Standard deviation is in the original units, making it much more practical.Example: A variance of 10,000 dollars² is hard to understand. A std dev of $100 is clear.

Mistake 3: Comparing Std Devs Across Different ScalesA std dev of

10K for house prices vs

10 for groceries aren’t comparable. Use the coefficient of variation (CV = std/mean) to compare relative variability.

Interview Questions

Question 1: Mean vs Median (Facebook/Meta)

Question: You’re analyzing user session lengths. The mean is 45 minutes, but the median is only 8 minutes. What does this tell you about the distribution?

Answer: This indicates a heavily right-skewed distribution with outliers. Most users have short sessions (around 8 minutes), but some power users have very long sessions that pull the mean way up. The median is more representative of the “typical” user experience.

Question 2: Outlier Detection (Google)

Question: You have a dataset of daily ad revenue. How would you identify outliers?

Answer: Use the IQR method:

Calculate Q1 (25th percentile) and Q3 (75th percentile)
Calculate IQR = Q3 - Q1
Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR

Alternatively, use z-scores: values more than 3 standard deviations from the mean are typically considered outliers.

Question 3: Variance Application (Amazon)

Question: You’re comparing two delivery drivers. Driver A has mean delivery time of 30 min (std dev 2 min). Driver B has mean 32 min (std dev 8 min). Which driver would you prefer?

Answer: Despite Driver A being slightly faster on average, the low variance is the key differentiator. Driver A is consistently fast (28-32 min range), while Driver B is unpredictable (could be 24-40 min). For customer satisfaction and logistics planning, consistency often matters more than a slightly faster mean.

Question 4: Percentiles in Practice (Netflix)

Question: We measure page load times. The mean is 2.5 seconds, but the 99th percentile is 15 seconds. What action might you take?

Answer: The 99th percentile (P99) being 6x the mean suggests there’s a “long tail” of slow experiences. Even though 99% of users have decent load times, 1% are having a terrible experience. For a company with millions of users, that’s a lot of frustrated customers. Focus on identifying what causes these edge cases - geographic regions, specific devices, or server issues.

Practice Challenge

Challenge: Analyze This Real Dataset

You’re given website session data. Analyze it completely:

import numpy as np
np.random.seed(42)

# Simulate session durations (in seconds)
# Mix of quick bouncers and engaged users
short_sessions = np.random.exponential(30, size=800)  # Most users leave quickly
long_sessions = np.random.normal(600, 120, size=200)  # Engaged users stay ~10 min
sessions = np.concatenate([short_sessions, long_sessions])

# Your tasks:
# 1. Calculate mean, median, std dev
# 2. Identify which measure best represents "typical" session
# 3. Find the 10th, 50th, and 90th percentiles
# 4. Identify outliers using the IQR method
# 5. What story does this data tell about user behavior?

# Write your analysis here:

Solution:

# 1. Basic statistics
print(f"Mean: {np.mean(sessions):.1f} seconds")
print(f"Median: {np.median(sessions):.1f} seconds")
print(f"Std Dev: {np.std(sessions):.1f} seconds")

# 2. The median (~47 sec) is more representative because the 
#    distribution is heavily skewed by the engaged user segment

# 3. Percentiles
print(f"P10: {np.percentile(sessions, 10):.1f} seconds")
print(f"P50: {np.percentile(sessions, 50):.1f} seconds")
print(f"P90: {np.percentile(sessions, 90):.1f} seconds")

# 4. Outliers using IQR
Q1 = np.percentile(sessions, 25)
Q3 = np.percentile(sessions, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = sessions[(sessions < lower_bound) | (sessions > upper_bound)]
print(f"Outliers: {len(outliers)} values")

# 5. Story: Two distinct user groups!
#    - 80% are "bouncers" with very short sessions
#    - 20% are "engaged users" with ~10 minute sessions
#    - This bimodal distribution suggests we should analyze 
#      these groups separately

📝 Practice Exercises

Exercise 1

Calculate descriptive statistics for employee salaries

Exercise 2

Analyze website load times for performance optimization

Exercise 3

Detect outliers in e-commerce transaction data

Exercise 4

Real-world: Analyze housing market price distributions

How This Connects to Machine Learning

Everything you just learned is foundational to ML:

Descriptive Stat	ML Application
Mean	Used in normalization, calculating errors
Variance	Feature scaling, understanding data spread
Standard deviation	Standardization (z-scores), batch normalization
Percentiles	Handling outliers, creating features
Distribution shape	Choosing the right model and loss function

Interview Prep: Common Questions

Frequently Asked Interview Questions

Q: When would you use median instead of mean?

Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts).

Q: How do you detect outliers?

Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots.

Q: What’s the difference between population and sample variance?

Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel’s correction).

Q: A dataset has mean = median. What does this tell you?

The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode.

Common Pitfalls

Mistakes to Avoid:

Using mean for skewed data - Always check for outliers first; median is often more representative
Ignoring the spread - Two datasets can have identical means but completely different distributions
Confusing variance units - Variance is in squared units; use standard deviation for interpretable scale
Forgetting to visualize - Statistics alone can be misleading (Anscombe’s quartet is the classic example)

Key Takeaways

What You Learned:

✅ Mean - Sum divided by count; sensitive to outliers
✅ Median - Middle value; robust to outliers; use for skewed data
✅ Mode - Most frequent value; useful for categorical data
✅ Variance & Std Dev - Measure spread around the mean
✅ Percentiles & IQR - Divide data into portions; detect outliers
✅ Z-scores - Standardize values across different scales

Coming up next: We’ll learn about probability - how to quantify uncertainty and make predictions. This is where statistics becomes truly powerful!

Next: Probability Foundations

Learn to quantify uncertainty and make predictions

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Describing Data: What’s Normal?

​The House Hunting Problem

​Measures of Central Tendency: “What’s Typical?”

​The Mean (Average)

​The Median (Middle Value)

​The Mode (Most Common Value)

​Measures of Spread: “How Different Are Things?”

​Range (Simplest Measure)

​Variance: Average Squared Distance from Mean

​Standard Deviation: Variance in Original Units

​Percentiles and Quartiles: “Where Does This Value Rank?”

​The Interquartile Range (IQR)

​Visualizing Data: See the Distribution

​Box Plot (Box-and-Whisker)

​Histogram

​Complete Summary Statistics

​🎯 Practice Exercises

​Exercise 1: Salary Analysis

​Exercise 2: Test Score Comparison

​🏠 Mini-Project: House Price Analyzer

​Key Takeaways

Central Tendency

Spread

Position

When to Use What

​Common Mistakes to Avoid

​Interview Questions

​Practice Challenge

​📝 Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

​How This Connects to Machine Learning

​Interview Prep: Common Questions

​Common Pitfalls

​Key Takeaways

Next: Probability Foundations

Describing Data: What’s Normal?

The House Hunting Problem

Measures of Central Tendency: “What’s Typical?”

The Mean (Average)

The Median (Middle Value)

The Mode (Most Common Value)

Measures of Spread: “How Different Are Things?”

Range (Simplest Measure)

Variance: Average Squared Distance from Mean

Standard Deviation: Variance in Original Units

Percentiles and Quartiles: “Where Does This Value Rank?”

The Interquartile Range (IQR)

Visualizing Data: See the Distribution

Box Plot (Box-and-Whisker)

Histogram

Complete Summary Statistics

🎯 Practice Exercises

Exercise 1: Salary Analysis

Exercise 2: Test Score Comparison

🏠 Mini-Project: House Price Analyzer

Key Takeaways

Common Mistakes to Avoid

Interview Questions

Practice Challenge

📝 Practice Exercises

How This Connects to Machine Learning

Interview Prep: Common Questions

Common Pitfalls

Key Takeaways