It’s election night. With only 2% of votes counted, news networks are already predicting the winner with 95% confidence.How is that possible? They’ve only seen a tiny fraction of the votes!This is the power of statistical inference - the science of drawing conclusions about a large group (population) by studying a small part of it (sample).
The fundamental challenge: We want to know the parameter, but we can only calculate the statistic.
Copy
import numpy as np# Imagine this is the TRUE population (we normally don't know this!)np.random.seed(42)population = np.random.choice(['A', 'B'], size=10_000_000, p=[0.52, 0.48])true_proportion = np.mean(population == 'A')print(f"TRUE population proportion for A: {true_proportion:.4f}") # ~0.52# But we can only survey 1000 peoplesample = np.random.choice(population, size=1000, replace=False)sample_proportion = np.mean(sample == 'A')print(f"Sample proportion for A: {sample_proportion:.4f}") # Varies!
The standard error measures how much sample statistics vary from sample to sample.For a proportion:SE=np(1−p)For a mean:SE=nσ
Copy
def standard_error_proportion(p, n): """Standard error for a sample proportion.""" return np.sqrt(p * (1 - p) / n)def standard_error_mean(std_dev, n): """Standard error for a sample mean.""" return std_dev / np.sqrt(n)# Example: Poll with 52% for candidate A, n=1000se = standard_error_proportion(0.52, 1000)print(f"Standard error: {se:.4f}") # 0.0158 or about 1.58%# With larger samplese_large = standard_error_proportion(0.52, 4000)print(f"SE with n=4000: {se_large:.4f}") # 0.0079 or about 0.79%
Key Insight: Standard error decreases with square root of sample size. To halve the error, you need 4x the sample size.
It does NOT mean “95% probability the true value is in this interval.”It means: If we repeated this process many times, 95% of the intervals we construct would contain the true value.
Copy
# Demonstrate: Create 100 confidence intervalsintervals_containing_truth = 0true_p = 0.52 # Known true value (in real life, unknown)for _ in range(100): # Take a sample sample = np.random.choice(['A', 'B'], size=1000, p=[true_p, 1-true_p]) p_hat = np.mean(sample == 'A') # Calculate CI ci, _ = confidence_interval_proportion(p_hat, 1000) # Check if CI contains true value if ci[0] <= true_p <= ci[1]: intervals_containing_truth += 1print(f"{intervals_containing_truth}% of intervals contained the true value")# Should be close to 95!
When sample size is small (n < 30), we use the t-distribution instead of the normal distribution.Why? Small samples have more uncertainty about the true standard deviation.
Copy
# Compare t and normal distributionsx = np.linspace(-4, 4, 1000)normal = stats.norm.pdf(x)t_5 = stats.t.pdf(x, df=5) # 5 degrees of freedomt_30 = stats.t.pdf(x, df=30) # 30 degrees of freedomimport matplotlib.pyplot as pltplt.figure(figsize=(10, 5))plt.plot(x, normal, label='Normal', linewidth=2)plt.plot(x, t_5, label='t (df=5)', linestyle='--', linewidth=2)plt.plot(x, t_30, label='t (df=30)', linestyle=':', linewidth=2)plt.xlabel('x')plt.ylabel('Density')plt.title('Normal vs t-Distribution')plt.legend()plt.show()
The t-distribution has heavier tails, meaning it accounts for more uncertainty. As sample size increases, it approaches the normal distribution.
# BAD: Survey only people who answer phones during business hours# This systematically excludes working people!# GOOD: Random sampling from entire population
Mistake 1: Misleading Margin of ErrorA headline saying “Poll shows 52% support (±3%)” means the 95% CI is 49-55%. But if the race is 52% vs 48%, the intervals overlap and the race is actually too close to call!
Mistake 2: Small Sample, Big Claims
Copy
# Survey of 30 people shows 60% prefer product Aci, moe = confidence_interval_proportion(0.60, 30)print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})")# CI: (42.4%, 77.6%) - way too wide to claim victory!
Mistake 3: Confusing Confidence Level with Probability
Copy
# WRONG: "There's a 95% chance the true value is between 48.9% and 55.1%"# RIGHT: "We're 95% confident our method produces intervals containing the true value"
Question: Your A/B test shows the new feature increased click-through rate from 2.0% to 2.3%, with a 95% CI of [2.1%, 2.5%] for the new version. What can you conclude?
Answer:
We’re 95% confident the true CTR for the new version is between 2.1% and 2.5%
Since the entire CI is above 2.0% (the control), we have evidence of a real improvement
The minimum expected improvement is 0.1 percentage points (2.1% - 2.0%)
For business decisions, consider if a 0.1-0.5 pp improvement justifies the change
Note: If the CI included 2.0%, we couldn’t conclude there’s a real difference.
Question 2: Sample Size Planning (Amazon)
Question: You want to estimate the proportion of customers who will buy a new product. You need ±5% precision with 95% confidence. How many customers should you survey?
Answer:
Copy
# Use p = 0.5 for maximum sample size (conservative)# Formula: n = (z²×p×(1-p)) / E²z = 1.96p = 0.5 # Conservative assumptionE = 0.05 # Desired margin of errorn = (z**2 * p * (1-p)) / (E**2)print(f"Sample size needed: {int(np.ceil(n))}") # 385
Key insight: Using p = 0.5 gives the largest sample size because that’s where variance is maximized. If you know approximately what p will be, you can use that value for a smaller required sample.
Question 3: Standard Error Application (Facebook/Meta)
Question: Daily active users (DAU) over the past 100 days had mean 50M with standard deviation 5M. What’s the 95% CI for the true mean DAU?
The true average DAU is likely between 49M and 51M with 95% confidence.
Question 4: Bias in Sampling (Tech Companies)
Question: You survey users who contacted customer support about a new feature. 80% say they dislike it. Is this valid for all users?
Answer: No! This is selection bias.Users who contact support are more likely to have problems. The sample is not representative of all users. You’re measuring “satisfaction among users with issues” not “overall satisfaction.”To get a valid estimate, you need:
Random sampling from all users
Or stratified sampling to ensure representation
Consider that satisfied users rarely reach out
This is called “sampling bias” or “survivorship bias” and is a major concern in ML training data as well.
You’re planning an A/B test. The current conversion rate is 5%. You want to detect a 20% relative improvement (5% → 6%). How many users do you need per group?
Copy
# This is called a power analysis# We need to balance:# - Significance level (α): probability of false positive# - Power (1-β): probability of detecting a real effect# - Effect size: the difference we want to detect# - Sample size: what we're solving forfrom scipy import statsimport numpy as npdef required_sample_size_ab(p1, p2, alpha=0.05, power=0.8): """ Calculate required sample size per group for A/B test. p1: baseline conversion rate p2: expected conversion rate after improvement alpha: significance level (typically 0.05) power: probability of detecting effect if real (typically 0.8) """ # Effect size (Cohen's h) h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1))) # Z-scores z_alpha = stats.norm.ppf(1 - alpha/2) # Two-tailed z_beta = stats.norm.ppf(power) # Sample size per group n = 2 * ((z_alpha + z_beta) / h) ** 2 return int(np.ceil(n))# Your task: Calculate and interpretp_baseline = 0.05p_improved = 0.06n_per_group = required_sample_size_ab(p_baseline, p_improved)print(f"Need {n_per_group} users per group")print(f"Total users: {2 * n_per_group}")# Also calculate for different scenarios:# 1. What if you want to detect a 10% improvement (5% → 5.5%)?# 2. What if power needs to be 90% instead of 80%?
Solution:
Copy
# Base case: 5% → 6% (20% relative improvement)n_base = required_sample_size_ab(0.05, 0.06)print(f"Base case: {n_base:,} per group") # ~4,794# Scenario 1: 5% → 5.5% (10% relative improvement)n_smaller = required_sample_size_ab(0.05, 0.055)print(f"Smaller effect: {n_smaller:,} per group") # ~19,177# 4x more users for half the effect size!# Scenario 2: 90% powern_high_power = required_sample_size_ab(0.05, 0.06, power=0.9)print(f"Higher power: {n_high_power:,} per group") # ~6,420# ~34% more users for 10% more power# Key insight: Sample size grows QUADRATICALLY with smaller effect sizes# This is why A/B testing small improvements is expensive!
ML Connection: When you report model accuracy as “92% ± 2%”, you’re using confidence intervals! Cross-validation provides multiple samples, and the standard error tells you how much your accuracy estimate might vary.
Coming up next: We’ll learn about Hypothesis Testing - how to determine if a difference is real or just random noise. This is the foundation of A/B testing and scientific validation of ML models.