> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Describing Data: What's Normal?

> Learn to summarize any dataset with mean, median, variance, and more

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/statistics-for-ml/mean-real-world.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=44f2277214eb7e54278533df8fe84a44" alt="Describing Data - Mean and Central Tendency" width="1080" height="1080" data-path="images/courses/statistics-for-ml/mean-real-world.svg" />
</Frame>

# Describing Data: What's Normal?

## The House Hunting Problem

You're moving to Austin, Texas. You have a budget of \$500,000 and want to know: **Is that enough for a decent 3-bedroom house?**

You could look at one listing, but that's just one data point. You need to understand the **whole picture**.

Let's load some real data:

```python theme={null}
import numpy as np
import pandas as pd

# House prices in Austin (3-bedroom homes, in thousands)
prices = np.array([
    425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
    512, 445, 468, 502, 389, 475, 498, 415, 528, 459,
    442, 495, 478, 410, 525, 465, 488, 435, 505, 472,
    1250,  # A mansion somehow in the dataset
    448, 492, 418, 485, 455, 508, 428, 475, 495, 462
])

print(f"Number of houses: {len(prices)}")
print(f"Cheapest: ${min(prices)}K")
print(f"Most expensive: ${max(prices)}K")
```

**Output:**

```
Number of houses: 41
Cheapest: $389K
Most expensive: $1250K
```

The range is $389K to $1250K. But that doesn't tell us what's "typical". We need better tools.

***

## Measures of Central Tendency: "What's Typical?"

### The Mean (Average)

The **mean** is what most people think of as "average" - add everything up and divide by the count.

**Analogy**: Think of the mean as the balance point of a seesaw. If you placed each data point as a weight along a beam, the mean is where you would put the fulcrum to make it balance perfectly. One very heavy weight far from center (an outlier) can shift the balance point dramatically.

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/statistics-for-ml/mean-math.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=d65e16de56a5d4f63ede4121a69ec4bd" alt="Mean Formula Visualization" width="1080" height="1080" data-path="images/courses/statistics-for-ml/mean-math.svg" />
</Frame>

The mathematical formula:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n}
$$

```python theme={null}
def calculate_mean(data):
    """Calculate the arithmetic mean."""
    return sum(data) / len(data)

mean_price = calculate_mean(prices)
print(f"Mean price: ${mean_price:.1f}K")

# Or use NumPy
print(f"Mean (NumPy): ${np.mean(prices):.1f}K")
```

**Output:**

```
Mean price: $492.6K
```

Wait... $492.6K? Most houses are around $450K, but the mean is higher. What's happening?

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/statistics-for-ml/mean-real-world.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=44f2277214eb7e54278533df8fe84a44" alt="Mean Real World - Mansion Pulling Average" width="1080" height="1080" data-path="images/courses/statistics-for-ml/mean-real-world.svg" />
</Frame>

**The Problem with the Mean**: That \$1.25M mansion is pulling the average up! The mean is **sensitive to outliers**.

***

### The Median (Middle Value)

The **median** is the middle value when you sort the data. Half the values are above, half below.

```python theme={null}
def calculate_median(data):
    """Calculate the median."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    
    if n % 2 == 0:  # Even number of elements
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:  # Odd number of elements
        return sorted_data[mid]

median_price = calculate_median(prices)
print(f"Median price: ${median_price:.1f}K")

# Or use NumPy
print(f"Median (NumPy): ${np.median(prices):.1f}K")
```

**Output:**

```
Median price: $472.0K
```

**The median is \$472K** - much more representative of a "typical" house! The mansion doesn't affect it because it's just one value above the middle.

<Info>
  **When to use Mean vs Median?**

  | Use Mean When                   | Use Median When          |
  | ------------------------------- | ------------------------ |
  | Data is symmetric               | Data has outliers        |
  | No extreme values               | Income/wealth data       |
  | You want total divided by count | You want "typical" value |
  | Example: Test scores            | Example: House prices    |
</Info>

***

### The Mode (Most Common Value)

The **mode** is the value that appears most frequently. Less useful for continuous data, but great for categories.

```python theme={null}
from collections import Counter

def calculate_mode(data):
    """Find the most common value(s)."""
    counts = Counter(data)
    max_count = max(counts.values())
    modes = [val for val, count in counts.items() if count == max_count]
    return modes

# For house prices, mode isn't very useful (all unique)
# But for bedrooms:
bedrooms = [3, 3, 4, 3, 2, 3, 4, 3, 3, 2, 3, 4, 3, 5, 3, 3, 2, 4, 3, 3]
mode_bedrooms = calculate_mode(bedrooms)
print(f"Most common bedroom count: {mode_bedrooms}")  # [3]
```

**Real-World Usage**:

* Most popular shirt size at a store
* Most common customer complaint
* Peak traffic hour

***

## Measures of Spread: "How Different Are Things?"

Knowing the center isn't enough. Consider these two neighborhoods:

```python theme={null}
neighborhood_A = [450, 455, 448, 460, 452, 445, 458, 447, 453, 462]
neighborhood_B = [350, 550, 400, 500, 380, 520, 410, 490, 360, 540]

print(f"Neighborhood A - Mean: ${np.mean(neighborhood_A):.1f}K")
print(f"Neighborhood B - Mean: ${np.mean(neighborhood_B):.1f}K")
```

**Output:**

```
Neighborhood A - Mean: $453.0K
Neighborhood B - Mean: $450.0K
```

Almost the same mean! But look at the actual houses:

* **Neighborhood A**: All houses are between $445K-$462K (consistent)
* **Neighborhood B**: Houses range from $350K to $550K (huge variation)

We need to measure **spread**.

***

### Range (Simplest Measure)

```python theme={null}
def calculate_range(data):
    return max(data) - min(data)

print(f"Range A: ${calculate_range(neighborhood_A)}K")
print(f"Range B: ${calculate_range(neighborhood_B)}K")
```

**Output:**

```
Range A: $17K
Range B: $200K
```

The range shows the difference, but it only uses two values and is sensitive to outliers.

***

### Variance: Average Squared Distance from Mean

**Variance** measures how far values typically are from the mean.

<Frame>
  <img src="https://mintcdn.com/devweeekends/CHfRzoAmD5TGW2ch/images/courses/statistics-for-ml/variance-math.svg?fit=max&auto=format&n=CHfRzoAmD5TGW2ch&q=85&s=21dc5972e75bfec9a5a014eb64ae6efd" alt="Variance Formula Visualization" width="1080" height="1080" data-path="images/courses/statistics-for-ml/variance-math.svg" />
</Frame>

**The Formula:**

$$
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

**Step by step:**

1. Find the mean
2. For each value, calculate distance from mean
3. Square each distance (makes all positive, penalizes big deviations)
4. Average the squared distances

```python theme={null}
def calculate_variance(data):
    """Calculate population variance."""
    mean = sum(data) / len(data)
    squared_diffs = [(x - mean) ** 2 for x in data]
    return sum(squared_diffs) / len(data)

var_A = calculate_variance(neighborhood_A)
var_B = calculate_variance(neighborhood_B)

print(f"Variance A: {var_A:.1f}")
print(f"Variance B: {var_B:.1f}")

# Using NumPy
print(f"Variance A (NumPy): {np.var(neighborhood_A):.1f}")
print(f"Variance B (NumPy): {np.var(neighborhood_B):.1f}")
```

**Output:**

```
Variance A: 29.4
Variance B: 5040.0
```

Neighborhood B has **171x more variance** than A!

<Frame>
  <img src="https://mintcdn.com/devweeekends/CHfRzoAmD5TGW2ch/images/courses/statistics-for-ml/variance-real-world.svg?fit=max&auto=format&n=CHfRzoAmD5TGW2ch&q=85&s=438dbaae0ba866f0c58124bbddc922d7" alt="Variance Real World - Neighborhood Comparison" width="1080" height="1080" data-path="images/courses/statistics-for-ml/variance-real-world.svg" />
</Frame>

***

### Standard Deviation: Variance in Original Units

Variance is in "squared dollars" which is hard to interpret. **Standard deviation** brings us back to dollars.

$$
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

```python theme={null}
def calculate_std(data):
    """Calculate population standard deviation."""
    return np.sqrt(calculate_variance(data))

std_A = calculate_std(neighborhood_A)
std_B = calculate_std(neighborhood_B)

print(f"Std Dev A: ${std_A:.1f}K")
print(f"Std Dev B: ${std_B:.1f}K")
```

**Output:**

```
Std Dev A: $5.4K
Std Dev B: $71.0K
```

**Interpretation**:

* Neighborhood A: Houses are typically within plus or minus \$5.4K of the mean
* Neighborhood B: Houses are typically within plus or minus \$71K of the mean

**Analogy**: Standard deviation is like the "typical commute distance" of data points from their home (the mean). In Neighborhood A, every data point lives close to the mean -- a short commute. In Neighborhood B, data points are scattered far and wide.

<Tip>
  **ML Application -- Feature Scaling**: Standard deviation is the foundation of **standardization** (z-score normalization), one of the most critical preprocessing steps in ML. When you run `StandardScaler()` in scikit-learn, it is dividing each feature by its standard deviation so all features have comparable scales. Skip this step with algorithms like gradient descent or SVM, and the features with larger scales will dominate learning -- a classic beginner mistake that produces mysteriously poor models.
</Tip>

<Warning>
  **Sample vs Population**: When working with a sample (not the entire population), we divide by (n-1) instead of n for variance. This is called **Bessel's correction**.

  ```python theme={null}
  # Population variance (you have ALL data)
  np.var(data, ddof=0)

  # Sample variance (you have a sample from larger population)
  np.var(data, ddof=1)  # Default in pandas
  ```
</Warning>

***

## Percentiles and Quartiles: "Where Does This Value Rank?"

Going back to our Austin house prices - is \$500K expensive or affordable?

**Percentiles** tell you what percentage of values fall below a given number.

```python theme={null}
# Remove the mansion outlier for cleaner analysis
prices_clean = prices[prices < 1000]

# Calculate percentiles
p25 = np.percentile(prices_clean, 25)
p50 = np.percentile(prices_clean, 50)  # Same as median!
p75 = np.percentile(prices_clean, 75)
p90 = np.percentile(prices_clean, 90)

print(f"25th percentile: ${p25:.1f}K")
print(f"50th percentile (median): ${p50:.1f}K")
print(f"75th percentile: ${p75:.1f}K")
print(f"90th percentile: ${p90:.1f}K")
```

**Output:**

```
25th percentile: $445.0K
50th percentile (median): $468.0K
75th percentile: $495.0K
90th percentile: $512.0K
```

**Your \$500K budget puts you at the 78th percentile** - you can afford 78% of houses in this area!

### The Interquartile Range (IQR)

The **IQR** is the range of the middle 50% of data:

$$
IQR = Q3 - Q1 = P_{75} - P_{25}
$$

```python theme={null}
iqr = p75 - p25
print(f"IQR: ${iqr:.1f}K")

# Houses outside 1.5*IQR from quartiles are often considered outliers
lower_fence = p25 - 1.5 * iqr
upper_fence = p75 + 1.5 * iqr
print(f"Outlier thresholds: ${lower_fence:.1f}K - ${upper_fence:.1f}K")
```

**Output:**

```
IQR: $50.0K
Outlier thresholds: $370.0K - $570.0K
```

That \$1.25M mansion is definitely an outlier!

***

## Visualizing Data: See the Distribution

Numbers are great, but our brains understand pictures better.

### Box Plot (Box-and-Whisker)

```python theme={null}
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With outlier
axes[0].boxplot(prices, vert=True)
axes[0].set_title('House Prices (with $1.25M mansion)')
axes[0].set_ylabel('Price ($K)')

# Without outlier
axes[1].boxplot(prices_clean, vert=True)
axes[1].set_title('House Prices (mansion removed)')
axes[1].set_ylabel('Price ($K)')

plt.tight_layout()
plt.show()
```

### Histogram

```python theme={null}
plt.figure(figsize=(10, 5))
plt.hist(prices_clean, bins=15, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(prices_clean), color='red', linestyle='--', label=f'Mean: ${np.mean(prices_clean):.0f}K')
plt.axvline(np.median(prices_clean), color='green', linestyle='--', label=f'Median: ${np.median(prices_clean):.0f}K')
plt.xlabel('Price ($K)')
plt.ylabel('Number of Houses')
plt.title('Distribution of House Prices in Austin')
plt.legend()
plt.show()
```

***

## Complete Summary Statistics

Here's a function that gives you the full picture:

```python theme={null}
def describe_data(data, name="Data"):
    """Generate comprehensive summary statistics."""
    stats = {
        'Count': len(data),
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Std Dev': np.std(data),
        'Variance': np.var(data),
        'Min': np.min(data),
        '25%': np.percentile(data, 25),
        '50%': np.percentile(data, 50),
        '75%': np.percentile(data, 75),
        'Max': np.max(data),
        'Range': np.max(data) - np.min(data),
        'IQR': np.percentile(data, 75) - np.percentile(data, 25)
    }
    
    print(f"\n{'='*40}")
    print(f"Summary Statistics: {name}")
    print(f"{'='*40}")
    for key, value in stats.items():
        print(f"{key:12}: {value:>12.2f}")
    
    return stats

# Use it!
describe_data(prices_clean, "Austin House Prices ($K)")
```

**Output:**

```
========================================
Summary Statistics: Austin House Prices ($K)
========================================
Count       :        40.00
Mean        :       464.60
Median      :       468.00
Std Dev     :        39.28
Variance    :      1542.84
Min         :       389.00
25%         :       445.00
50%         :       468.00
75%         :       495.00
Max         :       528.00
Range       :       139.00
IQR         :        50.00
```

***

## 🎯 Practice Exercises

### Exercise 1: Salary Analysis

```python theme={null}
# Tech company salaries (in thousands)
salaries = np.array([
    75, 82, 78, 95, 88, 72, 105, 92, 85, 79,
    110, 125, 88, 95, 82, 450,  # CEO salary!
    78, 92, 85, 102, 88, 95, 82, 79, 105
])

# TODO: Calculate mean and median
# TODO: Which one better represents "typical" salary?
# TODO: Calculate standard deviation
# TODO: Identify outliers using IQR method
```

<Accordion title="Solution">
  ```python theme={null}
  mean_salary = np.mean(salaries)
  median_salary = np.median(salaries)

  print(f"Mean: ${mean_salary:.1f}K")
  print(f"Median: ${median_salary:.1f}K")

  # Median is better - CEO salary inflates mean
  # Typical employee makes ~$88K, not $105K

  std_salary = np.std(salaries)
  print(f"Std Dev: ${std_salary:.1f}K")

  # IQR outlier detection
  q1 = np.percentile(salaries, 25)
  q3 = np.percentile(salaries, 75)
  iqr = q3 - q1
  lower = q1 - 1.5 * iqr
  upper = q3 + 1.5 * iqr

  outliers = salaries[(salaries < lower) | (salaries > upper)]
  print(f"Outliers: {outliers}")  # [450] - the CEO
  ```
</Accordion>

### Exercise 2: Test Score Comparison

```python theme={null}
# Two classes took the same test
class_A = [72, 75, 78, 80, 82, 85, 88, 90, 92, 95]
class_B = [65, 70, 78, 82, 83, 84, 85, 88, 95, 100]

# TODO: Which class performed better on average?
# TODO: Which class was more consistent?
# TODO: If you had to bet on a random student getting 80+, which class?
```

<Accordion title="Solution">
  ```python theme={null}
  print(f"Class A - Mean: {np.mean(class_A):.1f}, Std: {np.std(class_A):.1f}")
  print(f"Class B - Mean: {np.mean(class_B):.1f}, Std: {np.std(class_B):.1f}")

  # Class A: Mean 83.7, Std 7.5
  # Class B: Mean 83.0, Std 9.9

  # Class A performed slightly better on average
  # Class A was more consistent (lower std dev)

  # For 80+ bet:
  a_above_80 = sum(1 for x in class_A if x >= 80) / len(class_A)
  b_above_80 = sum(1 for x in class_B if x >= 80) / len(class_B)
  print(f"Class A: {a_above_80:.0%} got 80+")
  print(f"Class B: {b_above_80:.0%} got 80+")

  # Both 70%, but Class A is safer bet due to lower variance
  ```
</Accordion>

***

## 🏠 Mini-Project: House Price Analyzer

Build a complete house price analysis tool!

```python theme={null}
import numpy as np
import pandas as pd

# Extended Austin dataset
data = {
    'price': [425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
              512, 445, 468, 502, 389, 475, 498, 415, 528, 459],
    'sqft': [1800, 1600, 1950, 2400, 2100, 1700, 2300, 2000, 1750, 2150,
             2350, 1900, 2050, 2250, 1650, 2100, 2200, 1850, 2450, 2000],
    'bedrooms': [3, 3, 4, 4, 3, 3, 4, 3, 3, 4,
                 4, 3, 3, 4, 3, 3, 4, 3, 5, 3],
    'neighborhood': ['North', 'South', 'North', 'West', 'North', 'South', 
                     'West', 'North', 'South', 'West', 'West', 'North',
                     'South', 'West', 'South', 'North', 'West', 'South',
                     'West', 'North']
}

houses = pd.DataFrame(data)

# YOUR TASKS:
# 1. Calculate summary statistics for price by neighborhood
# 2. Find price per square foot for each house
# 3. Which neighborhood has the most consistent prices?
# 4. Is there a relationship between bedrooms and price?
# 5. Your budget is $475K. What percentage of houses can you afford in each neighborhood?
```

<Accordion title="Complete Solution">
  ```python theme={null}
  import numpy as np
  import pandas as pd

  # ... (data from above)

  # 1. Summary statistics by neighborhood
  print("="*50)
  print("PRICE STATISTICS BY NEIGHBORHOOD")
  print("="*50)
  for hood in houses['neighborhood'].unique():
      subset = houses[houses['neighborhood'] == hood]['price']
      print(f"\n{hood}:")
      print(f"  Mean:   ${subset.mean():.0f}K")
      print(f"  Median: ${subset.median():.0f}K")
      print(f"  Std:    ${subset.std():.1f}K")
      print(f"  Count:  {len(subset)}")

  # 2. Price per square foot
  houses['price_per_sqft'] = houses['price'] * 1000 / houses['sqft']
  print("\n" + "="*50)
  print("PRICE PER SQUARE FOOT")
  print("="*50)
  print(f"Mean: ${houses['price_per_sqft'].mean():.0f}/sqft")
  print(f"Range: ${houses['price_per_sqft'].min():.0f} - ${houses['price_per_sqft'].max():.0f}/sqft")

  # 3. Most consistent neighborhood (lowest std dev)
  consistency = houses.groupby('neighborhood')['price'].std()
  print("\n" + "="*50)
  print("PRICE CONSISTENCY (Std Dev)")
  print("="*50)
  print(consistency.sort_values())
  print(f"\nMost consistent: {consistency.idxmin()} (${consistency.min():.1f}K std)")

  # 4. Bedrooms vs Price
  bedroom_analysis = houses.groupby('bedrooms')['price'].agg(['mean', 'count'])
  print("\n" + "="*50)
  print("BEDROOMS VS PRICE")
  print("="*50)
  print(bedroom_analysis)

  # 5. Affordability with $475K budget
  budget = 475
  print("\n" + "="*50)
  print(f"AFFORDABILITY WITH ${budget}K BUDGET")
  print("="*50)
  for hood in houses['neighborhood'].unique():
      subset = houses[houses['neighborhood'] == hood]['price']
      affordable = (subset <= budget).sum() / len(subset) * 100
      print(f"{hood}: {affordable:.0f}% of houses affordable")
  ```

  **Output:**

  ```
  PRICE STATISTICS BY NEIGHBORHOOD
  ==================================================

  North:
    Mean:   $455K
    Median: $455K
    Std:    $30.2K
    Count:  6

  South:
    Mean:   $412K
    Median: $397K
    Std:    $28.4K
    Count:  6

  West:
    Mean:   $508K
    Median: $506K
    Std:    $18.3K
    Count:  8

  PRICE CONSISTENCY (Std Dev)
  ==================================================
  neighborhood
  West     18.32
  South    28.44
  North    30.21
  Name: price, dtype: float64

  Most consistent: West ($18.3K std)

  AFFORDABILITY WITH $475K BUDGET
  ==================================================
  North: 67% of houses affordable
  South: 100% of houses affordable
  West: 25% of houses affordable
  ```
</Accordion>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Central Tendency" icon="bullseye">
    * **Mean**: Add and divide. Sensitive to outliers.
    * **Median**: Middle value. Robust to outliers.
    * **Mode**: Most common. Great for categories.
  </Card>

  <Card title="Spread" icon="arrows-left-right">
    * **Range**: Max - Min. Simple but limited.
    * **Variance**: Average squared distance from mean.
    * **Std Dev**: Square root of variance. Same units as data.
  </Card>

  <Card title="Position" icon="ranking-star">
    * **Percentiles**: What % of values fall below this?
    * **Quartiles**: 25th, 50th, 75th percentiles.
    * **IQR**: Range of middle 50%. Good for outlier detection.
  </Card>

  <Card title="When to Use What" icon="lightbulb">
    * **Symmetric data**: Mean + Std Dev
    * **Skewed data**: Median + IQR
    * **Outliers present**: Always check both!
  </Card>
</CardGroup>

***

## Common Mistakes to Avoid

<Warning>
  **Mistake 1: Always Using the Mean**

  The mean can be heavily influenced by outliers. For salary data, housing prices, or any skewed distribution, the median is often more representative.

  **Example**: In a company where 9 employees earn $50K and the CEO earns $5M, the mean salary is \$545K - wildly misleading!
</Warning>

<Warning>
  **Mistake 2: Ignoring Units**

  Variance is in squared units, which can be hard to interpret. Standard deviation is in the original units, making it much more practical.

  **Example**: A variance of 10,000 dollars² is hard to understand. A std dev of \$100 is clear.
</Warning>

<Warning>
  **Mistake 3: Comparing Std Devs Across Different Scales**

  A std dev of $10K for house prices vs $10 for groceries aren't comparable. Use the coefficient of variation (CV = std/mean) to compare relative variability.
</Warning>

***

## Interview Questions

<Accordion title="Question 1: Mean vs Median (Facebook/Meta)">
  **Question**: You're analyzing user session lengths. The mean is 45 minutes, but the median is only 8 minutes. What does this tell you about the distribution?

  <Tip>
    **Answer**: This indicates a heavily right-skewed distribution with outliers. Most users have short sessions (around 8 minutes), but some power users have very long sessions that pull the mean way up. The median is more representative of the "typical" user experience.
  </Tip>
</Accordion>

<Accordion title="Question 2: Outlier Detection (Google)">
  **Question**: You have a dataset of daily ad revenue. How would you identify outliers?

  <Tip>
    **Answer**: Use the IQR method:

    1. Calculate Q1 (25th percentile) and Q3 (75th percentile)
    2. Calculate IQR = Q3 - Q1
    3. Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR

    Alternatively, use z-scores: values more than 3 standard deviations from the mean are typically considered outliers.
  </Tip>
</Accordion>

<Accordion title="Question 3: Variance Application (Amazon)">
  **Question**: You're comparing two delivery drivers. Driver A has mean delivery time of 30 min (std dev 2 min). Driver B has mean 32 min (std dev 8 min). Which driver would you prefer?

  <Tip>
    **Answer**: Despite Driver A being slightly faster on average, the low variance is the key differentiator. Driver A is consistently fast (28-32 min range), while Driver B is unpredictable (could be 24-40 min). For customer satisfaction and logistics planning, consistency often matters more than a slightly faster mean.
  </Tip>
</Accordion>

<Accordion title="Question 4: Percentiles in Practice (Netflix)">
  **Question**: We measure page load times. The mean is 2.5 seconds, but the 99th percentile is 15 seconds. What action might you take?

  <Tip>
    **Answer**: The 99th percentile (P99) being 6x the mean suggests there's a "long tail" of slow experiences. Even though 99% of users have decent load times, 1% are having a terrible experience. For a company with millions of users, that's a lot of frustrated customers. Focus on identifying what causes these edge cases - geographic regions, specific devices, or server issues.
  </Tip>
</Accordion>

***

## Practice Challenge

<Accordion title="Challenge: Analyze This Real Dataset">
  You're given website session data. Analyze it completely:

  ```python theme={null}
  import numpy as np
  np.random.seed(42)

  # Simulate session durations (in seconds)
  # Mix of quick bouncers and engaged users
  short_sessions = np.random.exponential(30, size=800)  # Most users leave quickly
  long_sessions = np.random.normal(600, 120, size=200)  # Engaged users stay ~10 min
  sessions = np.concatenate([short_sessions, long_sessions])

  # Your tasks:
  # 1. Calculate mean, median, std dev
  # 2. Identify which measure best represents "typical" session
  # 3. Find the 10th, 50th, and 90th percentiles
  # 4. Identify outliers using the IQR method
  # 5. What story does this data tell about user behavior?

  # Write your analysis here:
  ```

  **Solution**:

  ```python theme={null}
  # 1. Basic statistics
  print(f"Mean: {np.mean(sessions):.1f} seconds")
  print(f"Median: {np.median(sessions):.1f} seconds")
  print(f"Std Dev: {np.std(sessions):.1f} seconds")

  # 2. The median (~47 sec) is more representative because the 
  #    distribution is heavily skewed by the engaged user segment

  # 3. Percentiles
  print(f"P10: {np.percentile(sessions, 10):.1f} seconds")
  print(f"P50: {np.percentile(sessions, 50):.1f} seconds")
  print(f"P90: {np.percentile(sessions, 90):.1f} seconds")

  # 4. Outliers using IQR
  Q1 = np.percentile(sessions, 25)
  Q3 = np.percentile(sessions, 75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
  outliers = sessions[(sessions < lower_bound) | (sessions > upper_bound)]
  print(f"Outliers: {len(outliers)} values")

  # 5. Story: Two distinct user groups!
  #    - 80% are "bouncers" with very short sessions
  #    - 20% are "engaged users" with ~10 minute sessions
  #    - This bimodal distribution suggests we should analyze 
  #      these groups separately
  ```
</Accordion>

***

## 📝 Practice Exercises

<CardGroup cols={2}>
  <Card title="Exercise 1" icon="calculator" color="#3B82F6">
    Calculate descriptive statistics for employee salaries
  </Card>

  <Card title="Exercise 2" icon="chart-bar" color="#10B981">
    Analyze website load times for performance optimization
  </Card>

  <Card title="Exercise 3" icon="magnifying-glass" color="#8B5CF6">
    Detect outliers in e-commerce transaction data
  </Card>

  <Card title="Exercise 4" icon="building" color="#F59E0B">
    Real-world: Analyze housing market price distributions
  </Card>
</CardGroup>

<details>
  <summary>**Exercise 1: Employee Salary Analysis** - Calculate mean, median, and standard deviation</summary>

  **Problem**: A company has the following employee salaries (in thousands):
  `[45, 52, 48, 95, 51, 49, 47, 250, 53, 50]`

  1. Calculate the mean salary
  2. Calculate the median salary
  3. Calculate the standard deviation
  4. Which measure (mean or median) better represents the "typical" salary? Why?

  **Solution**:

  ```python theme={null}
  import numpy as np

  salaries = [45, 52, 48, 95, 51, 49, 47, 250, 53, 50]

  mean_salary = np.mean(salaries)
  median_salary = np.median(salaries)
  std_salary = np.std(salaries)

  print(f"Mean salary: ${mean_salary:.2f}K")      # $74.0K
  print(f"Median salary: ${median_salary:.2f}K")  # $50.5K
  print(f"Standard deviation: ${std_salary:.2f}K") # $59.24K

  # The median ($50.5K) better represents the typical salary because 
  # the mean is heavily influenced by the CEO's salary ($250K).
  # Most employees earn around $50K, not $74K.
  ```
</details>

<details>
  <summary>**Exercise 2: Website Performance Analysis** - Analyze load time percentiles</summary>

  **Problem**: You have 1000 page load times (in seconds). The data has:

  * Mean: 2.3 seconds
  * P50 (median): 1.8 seconds
  * P90: 4.5 seconds
  * P99: 12.0 seconds

  1. What does the difference between mean and median tell you?
  2. If you set an SLA at P90, what percentage of users experience worse performance?
  3. Calculate the ratio of P99 to P50. What does this indicate?

  **Solution**:

  ```python theme={null}
  import numpy as np

  # Given statistics
  mean = 2.3
  p50 = 1.8
  p90 = 4.5
  p99 = 12.0

  # 1. Mean > Median indicates right-skewed distribution
  # There are some very slow page loads pulling the mean up
  print("Mean > Median: Right-skewed distribution (long tail of slow requests)")

  # 2. P90 SLA
  print(f"With P90 SLA at {p90}s, 10% of users (100 out of 1000) experience worse performance")

  # 3. P99/P50 ratio
  ratio = p99 / p50
  print(f"P99/P50 ratio: {ratio:.1f}x")
  # A ratio of 6.67x means the slowest 1% of requests are nearly 7x slower than typical
  # This indicates a "long tail" problem requiring investigation

  # Simulate similar data
  np.random.seed(42)
  load_times = np.concatenate([
      np.random.exponential(1.5, 900),  # Normal requests
      np.random.exponential(8, 100)      # Slow requests
  ])
  print(f"\nSimulated P50: {np.percentile(load_times, 50):.2f}s")
  print(f"Simulated P99: {np.percentile(load_times, 99):.2f}s")
  ```
</details>

<details>
  <summary>**Exercise 3: Outlier Detection in Transactions** - Use IQR method</summary>

  **Problem**: An e-commerce platform has the following transaction amounts:
  `[25, 30, 28, 35, 32, 29, 500, 27, 31, 33, 28, 750, 26, 30]`

  1. Calculate Q1, Q3, and IQR
  2. Determine the outlier boundaries using the 1.5×IQR rule
  3. Identify which transactions are outliers
  4. What might these outliers represent in real life?

  **Solution**:

  ```python theme={null}
  import numpy as np

  transactions = [25, 30, 28, 35, 32, 29, 500, 27, 31, 33, 28, 750, 26, 30]

  # 1. Calculate quartiles
  Q1 = np.percentile(transactions, 25)
  Q3 = np.percentile(transactions, 75)
  IQR = Q3 - Q1

  print(f"Q1: ${Q1:.2f}")   # $27.75
  print(f"Q3: ${Q3:.2f}")   # $32.50
  print(f"IQR: ${IQR:.2f}")  # $4.75

  # 2. Outlier boundaries
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  print(f"\nLower boundary: ${lower_bound:.2f}")  # $20.62
  print(f"Upper boundary: ${upper_bound:.2f}")   # $39.62

  # 3. Identify outliers
  outliers = [x for x in transactions if x < lower_bound or x > upper_bound]
  print(f"\nOutliers: {outliers}")  # [$500, $750]

  # 4. Real-world interpretation
  print("\nPossible explanations for outliers:")
  print("- Bulk/wholesale purchases")
  print("- Potential fraud requiring investigation")
  print("- High-value customers (VIP segment)")
  print("- Data entry errors (extra zeros)")
  ```
</details>

<details>
  <summary>**Exercise 4: Housing Market Analysis** - Real-world comprehensive statistics</summary>

  **Problem**: You're analyzing home prices in two neighborhoods:

  **Neighborhood A**: `[320, 335, 340, 328, 345, 332, 338, 342, 325, 330]` (in thousands)
  **Neighborhood B**: `[280, 450, 310, 520, 290, 380, 300, 480, 275, 420]` (in thousands)

  1. Calculate mean and standard deviation for both
  2. Calculate the coefficient of variation (CV = std/mean × 100%)
  3. Which neighborhood has more consistent pricing?
  4. A buyer with \$350K budget - which neighborhood should they focus on?

  **Solution**:

  ```python theme={null}
  import numpy as np

  neighborhood_a = [320, 335, 340, 328, 345, 332, 338, 342, 325, 330]
  neighborhood_b = [280, 450, 310, 520, 290, 380, 300, 480, 275, 420]

  # 1. Calculate statistics
  mean_a, std_a = np.mean(neighborhood_a), np.std(neighborhood_a)
  mean_b, std_b = np.mean(neighborhood_b), np.std(neighborhood_b)

  print("Neighborhood A:")
  print(f"  Mean: ${mean_a:.1f}K, Std: ${std_a:.1f}K")  # Mean: $333.5K, Std: $7.4K

  print("\nNeighborhood B:")
  print(f"  Mean: ${mean_b:.1f}K, Std: ${std_b:.1f}K")  # Mean: $370.5K, Std: $85.1K

  # 2. Coefficient of Variation
  cv_a = (std_a / mean_a) * 100
  cv_b = (std_b / mean_b) * 100

  print(f"\nCoefficient of Variation:")
  print(f"  Neighborhood A: {cv_a:.1f}%")  # 2.2%
  print(f"  Neighborhood B: {cv_b:.1f}%")  # 23.0%

  # 3. Neighborhood A has much more consistent pricing (CV = 2.2% vs 23.0%)

  # 4. Budget analysis
  budget = 350

  affordable_a = [p for p in neighborhood_a if p <= budget]
  affordable_b = [p for p in neighborhood_b if p <= budget]

  print(f"\nWith ${budget}K budget:")
  print(f"  Neighborhood A: {len(affordable_a)}/{len(neighborhood_a)} homes affordable")
  print(f"  Neighborhood B: {len(affordable_b)}/{len(neighborhood_b)} homes affordable")
  # A: 8/10 homes, B: 5/10 homes

  # Recommendation: Focus on Neighborhood A - more options within budget
  # and predictable pricing makes comparison shopping easier
  ```
</details>

***

## How This Connects to Machine Learning

Everything you just learned is foundational to ML:

| Descriptive Stat   | ML Application                                  |
| ------------------ | ----------------------------------------------- |
| Mean               | Used in normalization, calculating errors       |
| Variance           | Feature scaling, understanding data spread      |
| Standard deviation | Standardization (z-scores), batch normalization |
| Percentiles        | Handling outliers, creating features            |
| Distribution shape | Choosing the right model and loss function      |

<Warning>
  **Statistical Mistake in ML -- Using Mean on Skewed Targets**: If your target variable (the thing you are predicting) is right-skewed -- like house prices, income, or time-to-event data -- training a regression model on the raw values causes the model to overweight expensive outliers. **The fix**: check skewness first. If mean and median diverge significantly, apply a log transform to the target before training. This single step routinely improves RMSE by 10-30% on real datasets.
</Warning>

***

## Interview Prep: Common Questions

<Accordion title="Frequently Asked Interview Questions">
  **Q: When would you use median instead of mean?**

  > Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts).

  **Q: How do you detect outliers?**

  > Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots.

  **Q: What's the difference between population and sample variance?**

  > Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel's correction).

  **Q: A dataset has mean = median. What does this tell you?**

  > The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode.
</Accordion>

***

## Common Pitfalls

<Warning>
  **Mistakes to Avoid**:

  1. **Using mean for skewed data** - Always check for outliers first; median is often more representative
  2. **Ignoring the spread** - Two datasets can have identical means but completely different distributions
  3. **Confusing variance units** - Variance is in squared units; use standard deviation for interpretable scale
  4. **Forgetting to visualize** - Statistics alone can be misleading (Anscombe's quartet is the classic example)
</Warning>

***

## Key Takeaways

<Note>
  **What You Learned**:

  * ✅ **Mean** - Sum divided by count; sensitive to outliers
  * ✅ **Median** - Middle value; robust to outliers; use for skewed data
  * ✅ **Mode** - Most frequent value; useful for categorical data
  * ✅ **Variance & Std Dev** - Measure spread around the mean
  * ✅ **Percentiles & IQR** - Divide data into portions; detect outliers
  * ✅ **Z-scores** - Standardize values across different scales
</Note>

**Coming up next**: We'll learn about **probability** - how to quantify uncertainty and make predictions. This is where statistics becomes truly powerful!

<Card title="Next: Probability Foundations" icon="arrow-right" href="/courses/statistics-for-ml/03-probability">
  Learn to quantify uncertainty and make predictions
</Card>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You are analyzing daily revenue at an e-commerce company. The mean is $85K but the median is $42K. What is happening and what do you report to the VP?">
    **Strong Answer:**

    * The large gap between mean ($85K) and median ($42K) tells me the distribution is heavily right-skewed. A small number of very high-revenue days -- likely driven by flash sales, holiday events, or a few massive B2B orders -- are pulling the mean upward. On a "typical" day, the company makes closer to \$42K.
    * I would report both numbers to the VP, but frame it carefully: "On a normal day, we generate about $42K in revenue. However, our average is higher at $85K because we have occasional spike days that significantly boost the total. If you are planning staffing and operations around daily expectations, use the median. If you are forecasting monthly totals, the mean times 30 gives a better estimate."
    * I would also present the P90 and P99 to show how extreme the spike days are, and potentially a histogram showing the bimodal or long-tail shape. Stakeholders make better decisions when they understand the shape, not just a single number.
    * The key risk: if someone uses the \$85K mean for daily budgeting, they will overspend on most days and then scramble during the rare high days. Conversely, if they use only the median, they will underestimate total monthly revenue.

    **Follow-up: How would you detect whether the spike days are periodic (like weekends or holidays) versus random?**

    I would decompose the time series by day of week and month to check for seasonal patterns. A simple groupby on day-of-week showing that Saturday revenue is 3x the weekday median would confirm a weekly cycle. For holiday effects, I would flag known retail events (Black Friday, Prime Day) and compare flagged versus unflagged days. If the spikes are periodic, you can model them with seasonal adjustments. If they are random (driven by viral social media posts or unpredictable B2B orders), then you need a different forecasting approach that accounts for heavy-tailed distributions -- perhaps a log-normal model rather than a normal one.
  </Accordion>

  <Accordion title="When would you use standard deviation versus interquartile range, and why does it matter in practice?">
    **Strong Answer:**

    * Standard deviation assumes your data is roughly symmetric and does not have extreme outliers. It uses every data point including the tails, so a single extreme value can inflate it dramatically. It is the right choice when your data is approximately normal -- test scores, manufacturing measurements, or sensor readings from a calibrated instrument.
    * IQR (the range between the 25th and 75th percentiles) is robust to outliers because it only looks at the middle 50% of the data. It is the right choice for skewed or contaminated data -- income distributions, transaction amounts, page load times, or any dataset where you suspect data quality issues at the extremes.
    * In practice, the choice has real consequences. If you use standard deviation for fraud detection thresholds on transaction amounts (which are heavily right-skewed), the outliers inflate the std dev so much that your "anomaly threshold" becomes absurdly high and you miss actual fraud. Using IQR-based thresholds (like the 1.5 x IQR rule) gives much more practical detection boundaries.
    * A concrete example: at a payments company, transaction amounts might have mean $50, std dev $500 (because of a few $10K wire transfers). A "mean plus 3 sigma" threshold would be $1,550, which misses all the moderately fraudulent $200-$300 transactions. An IQR-based approach with Q3 around $80 would flag anything above roughly $125 as worth investigating.

    **Follow-up: You mentioned the 1.5 x IQR rule. Where does that 1.5 come from, and when would you adjust it?**

    The 1.5 multiplier was introduced by John Tukey for box plots and corresponds roughly to the boundaries that would capture about 99.3% of a normal distribution. For normal data, Q1 minus 1.5 x IQR and Q3 plus 1.5 x IQR align approximately with mean plus or minus 2.7 standard deviations. You would adjust the multiplier based on your tolerance for false positives: use 3.0 x IQR for "extreme outliers" when you want very high confidence, or drop to 1.0 if you want a more aggressive filter. In fraud detection, you often tune this multiplier empirically against labeled fraud data to optimize the precision-recall tradeoff for your specific domain.
  </Accordion>

  <Accordion title="Explain Anscombe's quartet and what it teaches us about relying solely on summary statistics.">
    **Strong Answer:**

    * Anscombe's quartet is a set of four datasets that have nearly identical summary statistics -- same mean of x, same mean of y, same variance, same correlation coefficient (r approximately 0.816), and the same regression line -- yet look completely different when plotted. One is a normal linear relationship, one is a perfect curve, one is a perfect line with one outlier, and one has all points at one x-value except for a single extreme point.
    * The lesson is devastating for anyone who relies on summary statistics alone: the numbers can lie. Two datasets with identical means, variances, and correlations can have fundamentally different structures, and any model or decision built on those numbers without visual inspection could be wildly wrong.
    * In practice, this means every analysis should start with visualization. Before computing a single correlation or fitting a regression, plot a scatterplot, a histogram, or a residual plot. At companies with large data pipelines, I have seen teams ship models based on correlation matrices without ever looking at the actual data shape -- and later discover non-linear relationships, clustering, or data artifacts that invalidated their approach.
    * The modern extension is the Datasaurus Dozen, which shows the same idea with 13 datasets (including one shaped like a dinosaur) that all share the same summary statistics. It reinforces that summary stats are a lossy compression of your data -- useful for communication, but dangerous as the sole basis for decisions.

    **Follow-up: In a large-scale ML pipeline where you cannot manually visualize every feature pair, how do you catch these kinds of issues?**

    You build automated data profiling into the pipeline. Tools like pandas-profiling, Great Expectations, or custom checks can flag non-linearity (by comparing Pearson versus Spearman correlations -- if they diverge, the relationship is non-linear), detect bimodality (using the dip test or kernel density estimation), and identify influential outliers (using Cook's distance). You also set up distribution dashboards that sample and plot key feature pairs on a rotating basis. The goal is not to inspect every combination manually but to have automated red flags that trigger human review when something looks off.
  </Accordion>
</AccordionGroup>
