> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Describing Data: What's Normal? > Learn to summarize any dataset with mean, median, variance, and more Describing Data - Mean and Central Tendency

Describing Data - Mean and Central Tendency

# Describing Data: What's Normal? ## The House Hunting Problem You're moving to Austin, Texas. You have a budget of \$500,000 and want to know: **Is that enough for a decent 3-bedroom house?** You could look at one listing, but that's just one data point. You need to understand the **whole picture**. Let's load some real data: ```python theme={null} import numpy as np import pandas as pd # House prices in Austin (3-bedroom homes, in thousands) prices = np.array([ 425, 389, 445, 520, 478, 395, 510, 462, 398, 485, 512, 445, 468, 502, 389, 475, 498, 415, 528, 459, 442, 495, 478, 410, 525, 465, 488, 435, 505, 472, 1250, # A mansion somehow in the dataset 448, 492, 418, 485, 455, 508, 428, 475, 495, 462 ]) print(f"Number of houses: {len(prices)}") print(f"Cheapest: ${min(prices)}K") print(f"Most expensive: ${max(prices)}K") ``` **Output:** ``` Number of houses: 41 Cheapest: $389K Most expensive: $1250K ``` The range is $389K to $1250K. But that doesn't tell us what's "typical". We need better tools. *** ## Measures of Central Tendency: "What's Typical?" ### The Mean (Average) The **mean** is what most people think of as "average" - add everything up and divide by the count. **Analogy**: Think of the mean as the balance point of a seesaw. If you placed each data point as a weight along a beam, the mean is where you would put the fulcrum to make it balance perfectly. One very heavy weight far from center (an outlier) can shift the balance point dramatically. $Mean Formula Visualization$ The mathematical formula: $$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n} $$ ```python theme={null} def calculate_mean(data): """Calculate the arithmetic mean.""" return sum(data) / len(data) mean_price = calculate_mean(prices) print(f"Mean price: ${mean_price:.1f}K") # Or use NumPy print(f"Mean (NumPy): ${np.mean(prices):.1f}K") ``` **Output:** ``` Mean price: $492.6K ``` Wait... $492.6K? Most houses are around $450K, but the mean is higher. What's happening? Mean Real World - Mansion Pulling Average

Mean Real World - Mansion Pulling Average

**The Problem with the Mean**: That \$1.25M mansion is pulling the average up! The mean is **sensitive to outliers**. *** ### The Median (Middle Value) The **median** is the middle value when you sort the data. Half the values are above, half below. ```python theme={null} def calculate_median(data): """Calculate the median.""" sorted_data = sorted(data) n = len(sorted_data) mid = n // 2 if n % 2 == 0: # Even number of elements return (sorted_data[mid - 1] + sorted_data[mid]) / 2 else: # Odd number of elements return sorted_data[mid] median_price = calculate_median(prices) print(f"Median price: ${median_price:.1f}K") # Or use NumPy print(f"Median (NumPy): ${np.median(prices):.1f}K") ``` **Output:** ``` Median price: $472.0K ``` **The median is \$472K** - much more representative of a "typical" house! The mansion doesn't affect it because it's just one value above the middle. **When to use Mean vs Median?** | Use Mean When | Use Median When | | ------------------------------- | ------------------------ | | Data is symmetric | Data has outliers | | No extreme values | Income/wealth data | | You want total divided by count | You want "typical" value | | Example: Test scores | Example: House prices | *** ### The Mode (Most Common Value) The **mode** is the value that appears most frequently. Less useful for continuous data, but great for categories. ```python theme={null} from collections import Counter def calculate_mode(data): """Find the most common value(s).""" counts = Counter(data) max_count = max(counts.values()) modes = [val for val, count in counts.items() if count == max_count] return modes # For house prices, mode isn't very useful (all unique) # But for bedrooms: bedrooms = [3, 3, 4, 3, 2, 3, 4, 3, 3, 2, 3, 4, 3, 5, 3, 3, 2, 4, 3, 3] mode_bedrooms = calculate_mode(bedrooms) print(f"Most common bedroom count: {mode_bedrooms}") # [3] ``` **Real-World Usage**: * Most popular shirt size at a store * Most common customer complaint * Peak traffic hour *** ## Measures of Spread: "How Different Are Things?" Knowing the center isn't enough. Consider these two neighborhoods: ```python theme={null} neighborhood_A = [450, 455, 448, 460, 452, 445, 458, 447, 453, 462] neighborhood_B = [350, 550, 400, 500, 380, 520, 410, 490, 360, 540] print(f"Neighborhood A - Mean: ${np.mean(neighborhood_A):.1f}K") print(f"Neighborhood B - Mean: ${np.mean(neighborhood_B):.1f}K") ``` **Output:** ``` Neighborhood A - Mean: $453.0K Neighborhood B - Mean: $450.0K ``` Almost the same mean! But look at the actual houses: * **Neighborhood A**: All houses are between $445K-$462K (consistent) * **Neighborhood B**: Houses range from $350K to $550K (huge variation) We need to measure **spread**. *** ### Range (Simplest Measure) ```python theme={null} def calculate_range(data): return max(data) - min(data) print(f"Range A: ${calculate_range(neighborhood_A)}K") print(f"Range B: ${calculate_range(neighborhood_B)}K") ``` **Output:** ``` Range A: $17K Range B: $200K ``` The range shows the difference, but it only uses two values and is sensitive to outliers. *** ### Variance: Average Squared Distance from Mean **Variance** measures how far values typically are from the mean. $Variance Formula Visualization$ **The Formula:** $$ \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 $$ **Step by step:** 1. Find the mean 2. For each value, calculate distance from mean 3. Square each distance (makes all positive, penalizes big deviations) 4. Average the squared distances ```python theme={null} def calculate_variance(data): """Calculate population variance.""" mean = sum(data) / len(data) squared_diffs = [(x - mean) ** 2 for x in data] return sum(squared_diffs) / len(data) var_A = calculate_variance(neighborhood_A) var_B = calculate_variance(neighborhood_B) print(f"Variance A: {var_A:.1f}") print(f"Variance B: {var_B:.1f}") # Using NumPy print(f"Variance A (NumPy): {np.var(neighborhood_A):.1f}") print(f"Variance B (NumPy): {np.var(neighborhood_B):.1f}") ``` **Output:** ``` Variance A: 29.4 Variance B: 5040.0 ``` Neighborhood B has **171x more variance** than A! Variance Real World - Neighborhood Comparison

Variance Real World - Neighborhood Comparison

*** ### Standard Deviation: Variance in Original Units Variance is in "squared dollars" which is hard to interpret. **Standard deviation** brings us back to dollars. $$ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} $$ ```python theme={null} def calculate_std(data): """Calculate population standard deviation.""" return np.sqrt(calculate_variance(data)) std_A = calculate_std(neighborhood_A) std_B = calculate_std(neighborhood_B) print(f"Std Dev A: ${std_A:.1f}K") print(f"Std Dev B: ${std_B:.1f}K") ``` **Output:** ``` Std Dev A: $5.4K Std Dev B: $71.0K ``` **Interpretation**: * Neighborhood A: Houses are typically within plus or minus \$5.4K of the mean * Neighborhood B: Houses are typically within plus or minus \$71K of the mean **Analogy**: Standard deviation is like the "typical commute distance" of data points from their home (the mean). In Neighborhood A, every data point lives close to the mean -- a short commute. In Neighborhood B, data points are scattered far and wide. **ML Application -- Feature Scaling**: Standard deviation is the foundation of **standardization** (z-score normalization), one of the most critical preprocessing steps in ML. When you run `StandardScaler()` in scikit-learn, it is dividing each feature by its standard deviation so all features have comparable scales. Skip this step with algorithms like gradient descent or SVM, and the features with larger scales will dominate learning -- a classic beginner mistake that produces mysteriously poor models. **Sample vs Population**: When working with a sample (not the entire population), we divide by (n-1) instead of n for variance. This is called **Bessel's correction**. ```python theme={null} # Population variance (you have ALL data) np.var(data, ddof=0) # Sample variance (you have a sample from larger population) np.var(data, ddof=1) # Default in pandas ``` *** ## Percentiles and Quartiles: "Where Does This Value Rank?" Going back to our Austin house prices - is \$500K expensive or affordable? **Percentiles** tell you what percentage of values fall below a given number. ```python theme={null} # Remove the mansion outlier for cleaner analysis prices_clean = prices[prices < 1000] # Calculate percentiles p25 = np.percentile(prices_clean, 25) p50 = np.percentile(prices_clean, 50) # Same as median! p75 = np.percentile(prices_clean, 75) p90 = np.percentile(prices_clean, 90) print(f"25th percentile: ${p25:.1f}K") print(f"50th percentile (median): ${p50:.1f}K") print(f"75th percentile: ${p75:.1f}K") print(f"90th percentile: ${p90:.1f}K") ``` **Output:** ``` 25th percentile: $445.0K 50th percentile (median): $468.0K 75th percentile: $495.0K 90th percentile: $512.0K ``` **Your \$500K budget puts you at the 78th percentile** - you can afford 78% of houses in this area! ### The Interquartile Range (IQR) The **IQR** is the range of the middle 50% of data: $$ IQR = Q3 - Q1 = P_{75} - P_{25} $$ ```python theme={null} iqr = p75 - p25 print(f"IQR: ${iqr:.1f}K") # Houses outside 1.5*IQR from quartiles are often considered outliers lower_fence = p25 - 1.5 * iqr upper_fence = p75 + 1.5 * iqr print(f"Outlier thresholds: ${lower_fence:.1f}K - ${upper_fence:.1f}K") ``` **Output:** ``` IQR: $50.0K Outlier thresholds: $370.0K - $570.0K ``` That \$1.25M mansion is definitely an outlier! *** ## Visualizing Data: See the Distribution Numbers are great, but our brains understand pictures better. ### Box Plot (Box-and-Whisker) ```python theme={null} import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # With outlier axes[0].boxplot(prices, vert=True) axes[0].set_title('House Prices (with $1.25M mansion)') axes[0].set_ylabel('Price ($K)') # Without outlier axes[1].boxplot(prices_clean, vert=True) axes[1].set_title('House Prices (mansion removed)') axes[1].set_ylabel('Price ($K)') plt.tight_layout() plt.show() ``` ### Histogram ```python theme={null} plt.figure(figsize=(10, 5)) plt.hist(prices_clean, bins=15, edgecolor='black', alpha=0.7) plt.axvline(np.mean(prices_clean), color='red', linestyle='--', label=f'Mean: ${np.mean(prices_clean):.0f}K') plt.axvline(np.median(prices_clean), color='green', linestyle='--', label=f'Median: ${np.median(prices_clean):.0f}K') plt.xlabel('Price ($K)') plt.ylabel('Number of Houses') plt.title('Distribution of House Prices in Austin') plt.legend() plt.show() ``` *** ## Complete Summary Statistics Here's a function that gives you the full picture: ```python theme={null} def describe_data(data, name="Data"): """Generate comprehensive summary statistics.""" stats = { 'Count': len(data), 'Mean': np.mean(data), 'Median': np.median(data), 'Std Dev': np.std(data), 'Variance': np.var(data), 'Min': np.min(data), '25%': np.percentile(data, 25), '50%': np.percentile(data, 50), '75%': np.percentile(data, 75), 'Max': np.max(data), 'Range': np.max(data) - np.min(data), 'IQR': np.percentile(data, 75) - np.percentile(data, 25) } print(f"\n{'='*40}") print(f"Summary Statistics: {name}") print(f"{'='*40}") for key, value in stats.items(): print(f"{key:12}: {value:>12.2f}") return stats # Use it! describe_data(prices_clean, "Austin House Prices ($K)") ``` **Output:** ``` ======================================== Summary Statistics: Austin House Prices ($K) ======================================== Count : 40.00 Mean : 464.60 Median : 468.00 Std Dev : 39.28 Variance : 1542.84 Min : 389.00 25% : 445.00 50% : 468.00 75% : 495.00 Max : 528.00 Range : 139.00 IQR : 50.00 ``` *** ## 🎯 Practice Exercises ### Exercise 1: Salary Analysis ```python theme={null} # Tech company salaries (in thousands) salaries = np.array([ 75, 82, 78, 95, 88, 72, 105, 92, 85, 79, 110, 125, 88, 95, 82, 450, # CEO salary! 78, 92, 85, 102, 88, 95, 82, 79, 105 ]) # TODO: Calculate mean and median # TODO: Which one better represents "typical" salary? # TODO: Calculate standard deviation # TODO: Identify outliers using IQR method ``` ```python theme={null} mean_salary = np.mean(salaries) median_salary = np.median(salaries) print(f"Mean: ${mean_salary:.1f}K") print(f"Median: ${median_salary:.1f}K") # Median is better - CEO salary inflates mean # Typical employee makes ~$88K, not $105K std_salary = np.std(salaries) print(f"Std Dev: ${std_salary:.1f}K") # IQR outlier detection q1 = np.percentile(salaries, 25) q3 = np.percentile(salaries, 75) iqr = q3 - q1 lower = q1 - 1.5 * iqr upper = q3 + 1.5 * iqr outliers = salaries[(salaries < lower) | (salaries > upper)] print(f"Outliers: {outliers}") # [450] - the CEO ``` ### Exercise 2: Test Score Comparison ```python theme={null} # Two classes took the same test class_A = [72, 75, 78, 80, 82, 85, 88, 90, 92, 95] class_B = [65, 70, 78, 82, 83, 84, 85, 88, 95, 100] # TODO: Which class performed better on average? # TODO: Which class was more consistent? # TODO: If you had to bet on a random student getting 80+, which class? ``` ```python theme={null} print(f"Class A - Mean: {np.mean(class_A):.1f}, Std: {np.std(class_A):.1f}") print(f"Class B - Mean: {np.mean(class_B):.1f}, Std: {np.std(class_B):.1f}") # Class A: Mean 83.7, Std 7.5 # Class B: Mean 83.0, Std 9.9 # Class A performed slightly better on average # Class A was more consistent (lower std dev) # For 80+ bet: a_above_80 = sum(1 for x in class_A if x >= 80) / len(class_A) b_above_80 = sum(1 for x in class_B if x >= 80) / len(class_B) print(f"Class A: {a_above_80:.0%} got 80+") print(f"Class B: {b_above_80:.0%} got 80+") # Both 70%, but Class A is safer bet due to lower variance ``` *** ## 🏠 Mini-Project: House Price Analyzer Build a complete house price analysis tool! ```python theme={null} import numpy as np import pandas as pd # Extended Austin dataset data = { 'price': [425, 389, 445, 520, 478, 395, 510, 462, 398, 485, 512, 445, 468, 502, 389, 475, 498, 415, 528, 459], 'sqft': [1800, 1600, 1950, 2400, 2100, 1700, 2300, 2000, 1750, 2150, 2350, 1900, 2050, 2250, 1650, 2100, 2200, 1850, 2450, 2000], 'bedrooms': [3, 3, 4, 4, 3, 3, 4, 3, 3, 4, 4, 3, 3, 4, 3, 3, 4, 3, 5, 3], 'neighborhood': ['North', 'South', 'North', 'West', 'North', 'South', 'West', 'North', 'South', 'West', 'West', 'North', 'South', 'West', 'South', 'North', 'West', 'South', 'West', 'North'] } houses = pd.DataFrame(data) # YOUR TASKS: # 1. Calculate summary statistics for price by neighborhood # 2. Find price per square foot for each house # 3. Which neighborhood has the most consistent prices? # 4. Is there a relationship between bedrooms and price? # 5. Your budget is $475K. What percentage of houses can you afford in each neighborhood? ``` ```python theme={null} import numpy as np import pandas as pd # ... (data from above) # 1. Summary statistics by neighborhood print("="*50) print("PRICE STATISTICS BY NEIGHBORHOOD") print("="*50) for hood in houses['neighborhood'].unique(): subset = houses[houses['neighborhood'] == hood]['price'] print(f"\n{hood}:") print(f" Mean: ${subset.mean():.0f}K") print(f" Median: ${subset.median():.0f}K") print(f" Std: ${subset.std():.1f}K") print(f" Count: {len(subset)}") # 2. Price per square foot houses['price_per_sqft'] = houses['price'] * 1000 / houses['sqft'] print("\n" + "="*50) print("PRICE PER SQUARE FOOT") print("="*50) print(f"Mean: ${houses['price_per_sqft'].mean():.0f}/sqft") print(f"Range: ${houses['price_per_sqft'].min():.0f} - ${houses['price_per_sqft'].max():.0f}/sqft") # 3. Most consistent neighborhood (lowest std dev) consistency = houses.groupby('neighborhood')['price'].std() print("\n" + "="*50) print("PRICE CONSISTENCY (Std Dev)") print("="*50) print(consistency.sort_values()) print(f"\nMost consistent: {consistency.idxmin()} (${consistency.min():.1f}K std)") # 4. Bedrooms vs Price bedroom_analysis = houses.groupby('bedrooms')['price'].agg(['mean', 'count']) print("\n" + "="*50) print("BEDROOMS VS PRICE") print("="*50) print(bedroom_analysis) # 5. Affordability with $475K budget budget = 475 print("\n" + "="*50) print(f"AFFORDABILITY WITH ${budget}K BUDGET") print("="*50) for hood in houses['neighborhood'].unique(): subset = houses[houses['neighborhood'] == hood]['price'] affordable = (subset <= budget).sum() / len(subset) * 100 print(f"{hood}: {affordable:.0f}% of houses affordable") ``` **Output:** ``` PRICE STATISTICS BY NEIGHBORHOOD ================================================== North: Mean: $455K Median: $455K Std: $30.2K Count: 6 South: Mean: $412K Median: $397K Std: $28.4K Count: 6 West: Mean: $508K Median: $506K Std: $18.3K Count: 8 PRICE CONSISTENCY (Std Dev) ================================================== neighborhood West 18.32 South 28.44 North 30.21 Name: price, dtype: float64 Most consistent: West ($18.3K std) AFFORDABILITY WITH $475K BUDGET ================================================== North: 67% of houses affordable South: 100% of houses affordable West: 25% of houses affordable ``` *** ## Key Takeaways * **Mean**: Add and divide. Sensitive to outliers. * **Median**: Middle value. Robust to outliers. * **Mode**: Most common. Great for categories. * **Range**: Max - Min. Simple but limited. * **Variance**: Average squared distance from mean. * **Std Dev**: Square root of variance. Same units as data. * **Percentiles**: What % of values fall below this? * **Quartiles**: 25th, 50th, 75th percentiles. * **IQR**: Range of middle 50%. Good for outlier detection. * **Symmetric data**: Mean + Std Dev * **Skewed data**: Median + IQR * **Outliers present**: Always check both! *** ## Common Mistakes to Avoid **Mistake 1: Always Using the Mean** The mean can be heavily influenced by outliers. For salary data, housing prices, or any skewed distribution, the median is often more representative. **Example**: In a company where 9 employees earn $50K and the CEO earns $5M, the mean salary is \$545K - wildly misleading! **Mistake 2: Ignoring Units** Variance is in squared units, which can be hard to interpret. Standard deviation is in the original units, making it much more practical. **Example**: A variance of 10,000 dollars² is hard to understand. A std dev of \$100 is clear. **Mistake 3: Comparing Std Devs Across Different Scales** A std dev of $10K for house prices vs $10 for groceries aren't comparable. Use the coefficient of variation (CV = std/mean) to compare relative variability. *** ## Interview Questions **Question**: You're analyzing user session lengths. The mean is 45 minutes, but the median is only 8 minutes. What does this tell you about the distribution? **Answer**: This indicates a heavily right-skewed distribution with outliers. Most users have short sessions (around 8 minutes), but some power users have very long sessions that pull the mean way up. The median is more representative of the "typical" user experience. **Question**: You have a dataset of daily ad revenue. How would you identify outliers? **Answer**: Use the IQR method: 1. Calculate Q1 (25th percentile) and Q3 (75th percentile) 2. Calculate IQR = Q3 - Q1 3. Outliers are values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR Alternatively, use z-scores: values more than 3 standard deviations from the mean are typically considered outliers. **Question**: You're comparing two delivery drivers. Driver A has mean delivery time of 30 min (std dev 2 min). Driver B has mean 32 min (std dev 8 min). Which driver would you prefer? **Answer**: Despite Driver A being slightly faster on average, the low variance is the key differentiator. Driver A is consistently fast (28-32 min range), while Driver B is unpredictable (could be 24-40 min). For customer satisfaction and logistics planning, consistency often matters more than a slightly faster mean. **Question**: We measure page load times. The mean is 2.5 seconds, but the 99th percentile is 15 seconds. What action might you take? **Answer**: The 99th percentile (P99) being 6x the mean suggests there's a "long tail" of slow experiences. Even though 99% of users have decent load times, 1% are having a terrible experience. For a company with millions of users, that's a lot of frustrated customers. Focus on identifying what causes these edge cases - geographic regions, specific devices, or server issues. *** ## Practice Challenge You're given website session data. Analyze it completely: ```python theme={null} import numpy as np np.random.seed(42) # Simulate session durations (in seconds) # Mix of quick bouncers and engaged users short_sessions = np.random.exponential(30, size=800) # Most users leave quickly long_sessions = np.random.normal(600, 120, size=200) # Engaged users stay ~10 min sessions = np.concatenate([short_sessions, long_sessions]) # Your tasks: # 1. Calculate mean, median, std dev # 2. Identify which measure best represents "typical" session # 3. Find the 10th, 50th, and 90th percentiles # 4. Identify outliers using the IQR method # 5. What story does this data tell about user behavior? # Write your analysis here: ``` **Solution**: ```python theme={null} # 1. Basic statistics print(f"Mean: {np.mean(sessions):.1f} seconds") print(f"Median: {np.median(sessions):.1f} seconds") print(f"Std Dev: {np.std(sessions):.1f} seconds") # 2. The median (~47 sec) is more representative because the # distribution is heavily skewed by the engaged user segment # 3. Percentiles print(f"P10: {np.percentile(sessions, 10):.1f} seconds") print(f"P50: {np.percentile(sessions, 50):.1f} seconds") print(f"P90: {np.percentile(sessions, 90):.1f} seconds") # 4. Outliers using IQR Q1 = np.percentile(sessions, 25) Q3 = np.percentile(sessions, 75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = sessions[(sessions < lower_bound) | (sessions > upper_bound)] print(f"Outliers: {len(outliers)} values") # 5. Story: Two distinct user groups! # - 80% are "bouncers" with very short sessions # - 20% are "engaged users" with ~10 minute sessions # - This bimodal distribution suggests we should analyze # these groups separately ``` *** ## 📝 Practice Exercises Calculate descriptive statistics for employee salaries Analyze website load times for performance optimization Detect outliers in e-commerce transaction data Real-world: Analyze housing market price distributions

**Exercise 1: Employee Salary Analysis** - Calculate mean, median, and standard deviation

**Problem**: A company has the following employee salaries (in thousands): `[45, 52, 48, 95, 51, 49, 47, 250, 53, 50]` 1. Calculate the mean salary 2. Calculate the median salary 3. Calculate the standard deviation 4. Which measure (mean or median) better represents the "typical" salary? Why? **Solution**: ```python theme={null} import numpy as np salaries = [45, 52, 48, 95, 51, 49, 47, 250, 53, 50] mean_salary = np.mean(salaries) median_salary = np.median(salaries) std_salary = np.std(salaries) print(f"Mean salary: ${mean_salary:.2f}K") # $74.0K print(f"Median salary: ${median_salary:.2f}K") # $50.5K print(f"Standard deviation: ${std_salary:.2f}K") # $59.24K # The median ($50.5K) better represents the typical salary because # the mean is heavily influenced by the CEO's salary ($250K). # Most employees earn around $50K, not $74K. ```

**Exercise 2: Website Performance Analysis** - Analyze load time percentiles

**Problem**: You have 1000 page load times (in seconds). The data has: * Mean: 2.3 seconds * P50 (median): 1.8 seconds * P90: 4.5 seconds * P99: 12.0 seconds 1. What does the difference between mean and median tell you? 2. If you set an SLA at P90, what percentage of users experience worse performance? 3. Calculate the ratio of P99 to P50. What does this indicate? **Solution**: ```python theme={null} import numpy as np # Given statistics mean = 2.3 p50 = 1.8 p90 = 4.5 p99 = 12.0 # 1. Mean > Median indicates right-skewed distribution # There are some very slow page loads pulling the mean up print("Mean > Median: Right-skewed distribution (long tail of slow requests)") # 2. P90 SLA print(f"With P90 SLA at {p90}s, 10% of users (100 out of 1000) experience worse performance") # 3. P99/P50 ratio ratio = p99 / p50 print(f"P99/P50 ratio: {ratio:.1f}x") # A ratio of 6.67x means the slowest 1% of requests are nearly 7x slower than typical # This indicates a "long tail" problem requiring investigation # Simulate similar data np.random.seed(42) load_times = np.concatenate([ np.random.exponential(1.5, 900), # Normal requests np.random.exponential(8, 100) # Slow requests ]) print(f"\nSimulated P50: {np.percentile(load_times, 50):.2f}s") print(f"Simulated P99: {np.percentile(load_times, 99):.2f}s") ```

**Exercise 3: Outlier Detection in Transactions** - Use IQR method

**Problem**: An e-commerce platform has the following transaction amounts: `[25, 30, 28, 35, 32, 29, 500, 27, 31, 33, 28, 750, 26, 30]` 1. Calculate Q1, Q3, and IQR 2. Determine the outlier boundaries using the 1.5×IQR rule 3. Identify which transactions are outliers 4. What might these outliers represent in real life? **Solution**: ```python theme={null} import numpy as np transactions = [25, 30, 28, 35, 32, 29, 500, 27, 31, 33, 28, 750, 26, 30] # 1. Calculate quartiles Q1 = np.percentile(transactions, 25) Q3 = np.percentile(transactions, 75) IQR = Q3 - Q1 print(f"Q1: ${Q1:.2f}") # $27.75 print(f"Q3: ${Q3:.2f}") # $32.50 print(f"IQR: ${IQR:.2f}") # $4.75 # 2. Outlier boundaries lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"\nLower boundary: ${lower_bound:.2f}") # $20.62 print(f"Upper boundary: ${upper_bound:.2f}") # $39.62 # 3. Identify outliers outliers = [x for x in transactions if x < lower_bound or x > upper_bound] print(f"\nOutliers: {outliers}") # [$500, $750] # 4. Real-world interpretation print("\nPossible explanations for outliers:") print("- Bulk/wholesale purchases") print("- Potential fraud requiring investigation") print("- High-value customers (VIP segment)") print("- Data entry errors (extra zeros)") ```

**Exercise 4: Housing Market Analysis** - Real-world comprehensive statistics

**Problem**: You're analyzing home prices in two neighborhoods: **Neighborhood A**: `[320, 335, 340, 328, 345, 332, 338, 342, 325, 330]` (in thousands) **Neighborhood B**: `[280, 450, 310, 520, 290, 380, 300, 480, 275, 420]` (in thousands) 1. Calculate mean and standard deviation for both 2. Calculate the coefficient of variation (CV = std/mean × 100%) 3. Which neighborhood has more consistent pricing? 4. A buyer with \$350K budget - which neighborhood should they focus on? **Solution**: ```python theme={null} import numpy as np neighborhood_a = [320, 335, 340, 328, 345, 332, 338, 342, 325, 330] neighborhood_b = [280, 450, 310, 520, 290, 380, 300, 480, 275, 420] # 1. Calculate statistics mean_a, std_a = np.mean(neighborhood_a), np.std(neighborhood_a) mean_b, std_b = np.mean(neighborhood_b), np.std(neighborhood_b) print("Neighborhood A:") print(f" Mean: ${mean_a:.1f}K, Std: ${std_a:.1f}K") # Mean: $333.5K, Std: $7.4K print("\nNeighborhood B:") print(f" Mean: ${mean_b:.1f}K, Std: ${std_b:.1f}K") # Mean: $370.5K, Std: $85.1K # 2. Coefficient of Variation cv_a = (std_a / mean_a) * 100 cv_b = (std_b / mean_b) * 100 print(f"\nCoefficient of Variation:") print(f" Neighborhood A: {cv_a:.1f}%") # 2.2% print(f" Neighborhood B: {cv_b:.1f}%") # 23.0% # 3. Neighborhood A has much more consistent pricing (CV = 2.2% vs 23.0%) # 4. Budget analysis budget = 350 affordable_a = [p for p in neighborhood_a if p <= budget] affordable_b = [p for p in neighborhood_b if p <= budget] print(f"\nWith ${budget}K budget:") print(f" Neighborhood A: {len(affordable_a)}/{len(neighborhood_a)} homes affordable") print(f" Neighborhood B: {len(affordable_b)}/{len(neighborhood_b)} homes affordable") # A: 8/10 homes, B: 5/10 homes # Recommendation: Focus on Neighborhood A - more options within budget # and predictable pricing makes comparison shopping easier ```

*** ## How This Connects to Machine Learning Everything you just learned is foundational to ML: | Descriptive Stat | ML Application | | ------------------ | ----------------------------------------------- | | Mean | Used in normalization, calculating errors | | Variance | Feature scaling, understanding data spread | | Standard deviation | Standardization (z-scores), batch normalization | | Percentiles | Handling outliers, creating features | | Distribution shape | Choosing the right model and loss function | **Statistical Mistake in ML -- Using Mean on Skewed Targets**: If your target variable (the thing you are predicting) is right-skewed -- like house prices, income, or time-to-event data -- training a regression model on the raw values causes the model to overweight expensive outliers. **The fix**: check skewness first. If mean and median diverge significantly, apply a log transform to the target before training. This single step routinely improves RMSE by 10-30% on real datasets. *** ## Interview Prep: Common Questions **Q: When would you use median instead of mean?** > Use median when data has outliers or is heavily skewed. Classic examples: income data (billionaires skew the mean), house prices, response times (occasional timeouts). **Q: How do you detect outliers?** > Common methods: IQR method (1.5 × IQR beyond Q1/Q3), z-score method (beyond ±2 or ±3 standard deviations), visual inspection with box plots. **Q: What's the difference between population and sample variance?** > Population variance divides by n, sample variance divides by (n-1). Use n-1 for samples because it provides an unbiased estimate of population variance (Bessel's correction). **Q: A dataset has mean = median. What does this tell you?** > The distribution is likely symmetric (not skewed). In a perfectly symmetric distribution, mean = median = mode. *** ## Common Pitfalls **Mistakes to Avoid**: 1. **Using mean for skewed data** - Always check for outliers first; median is often more representative 2. **Ignoring the spread** - Two datasets can have identical means but completely different distributions 3. **Confusing variance units** - Variance is in squared units; use standard deviation for interpretable scale 4. **Forgetting to visualize** - Statistics alone can be misleading (Anscombe's quartet is the classic example) *** ## Key Takeaways **What You Learned**: * ✅ **Mean** - Sum divided by count; sensitive to outliers * ✅ **Median** - Middle value; robust to outliers; use for skewed data * ✅ **Mode** - Most frequent value; useful for categorical data * ✅ **Variance & Std Dev** - Measure spread around the mean * ✅ **Percentiles & IQR** - Divide data into portions; detect outliers * ✅ **Z-scores** - Standardize values across different scales **Coming up next**: We'll learn about **probability** - how to quantify uncertainty and make predictions. This is where statistics becomes truly powerful! Learn to quantify uncertainty and make predictions *** ## Interview Deep-Dive **Strong Answer:** * The large gap between mean ($85K) and median ($42K) tells me the distribution is heavily right-skewed. A small number of very high-revenue days -- likely driven by flash sales, holiday events, or a few massive B2B orders -- are pulling the mean upward. On a "typical" day, the company makes closer to \$42K. * I would report both numbers to the VP, but frame it carefully: "On a normal day, we generate about $42K in revenue. However, our average is higher at $85K because we have occasional spike days that significantly boost the total. If you are planning staffing and operations around daily expectations, use the median. If you are forecasting monthly totals, the mean times 30 gives a better estimate." * I would also present the P90 and P99 to show how extreme the spike days are, and potentially a histogram showing the bimodal or long-tail shape. Stakeholders make better decisions when they understand the shape, not just a single number. * The key risk: if someone uses the \$85K mean for daily budgeting, they will overspend on most days and then scramble during the rare high days. Conversely, if they use only the median, they will underestimate total monthly revenue. **Follow-up: How would you detect whether the spike days are periodic (like weekends or holidays) versus random?** I would decompose the time series by day of week and month to check for seasonal patterns. A simple groupby on day-of-week showing that Saturday revenue is 3x the weekday median would confirm a weekly cycle. For holiday effects, I would flag known retail events (Black Friday, Prime Day) and compare flagged versus unflagged days. If the spikes are periodic, you can model them with seasonal adjustments. If they are random (driven by viral social media posts or unpredictable B2B orders), then you need a different forecasting approach that accounts for heavy-tailed distributions -- perhaps a log-normal model rather than a normal one. **Strong Answer:** * Standard deviation assumes your data is roughly symmetric and does not have extreme outliers. It uses every data point including the tails, so a single extreme value can inflate it dramatically. It is the right choice when your data is approximately normal -- test scores, manufacturing measurements, or sensor readings from a calibrated instrument. * IQR (the range between the 25th and 75th percentiles) is robust to outliers because it only looks at the middle 50% of the data. It is the right choice for skewed or contaminated data -- income distributions, transaction amounts, page load times, or any dataset where you suspect data quality issues at the extremes. * In practice, the choice has real consequences. If you use standard deviation for fraud detection thresholds on transaction amounts (which are heavily right-skewed), the outliers inflate the std dev so much that your "anomaly threshold" becomes absurdly high and you miss actual fraud. Using IQR-based thresholds (like the 1.5 x IQR rule) gives much more practical detection boundaries. * A concrete example: at a payments company, transaction amounts might have mean $50, std dev $500 (because of a few $10K wire transfers). A "mean plus 3 sigma" threshold would be $1,550, which misses all the moderately fraudulent $200-$300 transactions. An IQR-based approach with Q3 around $80 would flag anything above roughly $125 as worth investigating. **Follow-up: You mentioned the 1.5 x IQR rule. Where does that 1.5 come from, and when would you adjust it?** The 1.5 multiplier was introduced by John Tukey for box plots and corresponds roughly to the boundaries that would capture about 99.3% of a normal distribution. For normal data, Q1 minus 1.5 x IQR and Q3 plus 1.5 x IQR align approximately with mean plus or minus 2.7 standard deviations. You would adjust the multiplier based on your tolerance for false positives: use 3.0 x IQR for "extreme outliers" when you want very high confidence, or drop to 1.0 if you want a more aggressive filter. In fraud detection, you often tune this multiplier empirically against labeled fraud data to optimize the precision-recall tradeoff for your specific domain. **Strong Answer:** * Anscombe's quartet is a set of four datasets that have nearly identical summary statistics -- same mean of x, same mean of y, same variance, same correlation coefficient (r approximately 0.816), and the same regression line -- yet look completely different when plotted. One is a normal linear relationship, one is a perfect curve, one is a perfect line with one outlier, and one has all points at one x-value except for a single extreme point. * The lesson is devastating for anyone who relies on summary statistics alone: the numbers can lie. Two datasets with identical means, variances, and correlations can have fundamentally different structures, and any model or decision built on those numbers without visual inspection could be wildly wrong. * In practice, this means every analysis should start with visualization. Before computing a single correlation or fitting a regression, plot a scatterplot, a histogram, or a residual plot. At companies with large data pipelines, I have seen teams ship models based on correlation matrices without ever looking at the actual data shape -- and later discover non-linear relationships, clustering, or data artifacts that invalidated their approach. * The modern extension is the Datasaurus Dozen, which shows the same idea with 13 datasets (including one shaped like a dinosaur) that all share the same summary statistics. It reinforces that summary stats are a lossy compression of your data -- useful for communication, but dangerous as the sole basis for decisions. **Follow-up: In a large-scale ML pipeline where you cannot manually visualize every feature pair, how do you catch these kinds of issues?** You build automated data profiling into the pipeline. Tools like pandas-profiling, Great Expectations, or custom checks can flag non-linearity (by comparing Pearson versus Spearman correlations -- if they diverge, the relationship is non-linear), detect bimodality (using the dip test or kernel density estimation), and identify influential outliers (using Cook's distance). You also set up distribution dashboards that sample and plot key feature pairs on a rotating basis. The goal is not to inspect every combination manually but to have automated red flags that trigger human review when something looks off.