> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Statistical Inference: Conclusions from Samples > How to draw reliable conclusions about millions from just hundreds of observations

# Statistical Inference: Conclusions from Samples ## The Election Polling Problem It's election night. With only 2% of votes counted, news networks are already predicting the winner with 95% confidence. How is that possible? They've only seen a tiny fraction of the votes! This is the power of **statistical inference** - the science of drawing conclusions about a large group (population) by studying a small part of it (sample). **Estimated Time**: 3-4 hours\ **Difficulty**: Intermediate\ **Prerequisites**: Modules 1-3 (Describing Data, Probability, Distributions)\ **What You'll Build**: Election predictor, survey analyzer *** ## Population vs Sample | Term | Definition | Example | | -------------- | ---------------------------------- | ---------------------- | | **Population** | The entire group you want to study | All 150 million voters | | **Sample** | A subset you actually observe | 1,500 surveyed voters | | **Parameter** | True value for the population | Actual vote percentage | | **Statistic** | Calculated from the sample | Survey percentage | The fundamental challenge: We want to know the **parameter**, but we can only calculate the **statistic**. **Analogy**: Imagine you are trying to figure out the average temperature of an entire ocean. You cannot measure every single water molecule -- that is the population parameter. Instead, you dip thermometers at various points -- those are your samples. The art of inference is taking those thermometer readings and making reliable statements about the whole ocean, along with an honest assessment of how wrong you might be. **ML Application -- Training vs. Test Performance**: Statistical inference is exactly the problem ML faces. Your training set is a sample, and you want to know how the model performs on the population (all future data it will ever see). This is why cross-validation exists -- it simulates drawing multiple samples to estimate how much your performance metric varies. When you report "accuracy = 92% plus or minus 2%", you are doing inference: using sample statistics to estimate a population parameter. ```python theme={null} import numpy as np # Imagine this is the TRUE population (we normally don't know this!) np.random.seed(42) population = np.random.choice(['A', 'B'], size=10_000_000, p=[0.52, 0.48]) true_proportion = np.mean(population == 'A') print(f"TRUE population proportion for A: {true_proportion:.4f}") # ~0.52 # But we can only survey 1000 people sample = np.random.choice(population, size=1000, replace=False) sample_proportion = np.mean(sample == 'A') print(f"Sample proportion for A: {sample_proportion:.4f}") # Varies! ``` $Population vs Sample Visualization$ *** ## Sampling Distributions: The Key Insight Here's the crucial question: If we took many different samples, how would our estimates vary? ```python theme={null} # Take 1000 different samples of 500 people each sample_proportions = [] for _ in range(1000): sample = np.random.choice(population, size=500, replace=False) prop = np.mean(sample == 'A') sample_proportions.append(prop) sample_proportions = np.array(sample_proportions) print(f"Mean of sample proportions: {np.mean(sample_proportions):.4f}") print(f"Std of sample proportions: {np.std(sample_proportions):.4f}") print(f"True proportion: {true_proportion:.4f}") ``` **Output:** ``` Mean of sample proportions: 0.5198 Std of sample proportions: 0.0223 True proportion: 0.5200 ``` The samples cluster around the true value, and they form a **normal distribution**. Sampling Distribution of Poll Results

*** ## Standard Error: Quantifying Uncertainty The **standard error** measures how much sample statistics vary from sample to sample. For a proportion: $$ SE = \sqrt{\frac{p(1-p)}{n}} $$ For a mean: $$ SE = \frac{\sigma}{\sqrt{n}} $$ ```python theme={null} def standard_error_proportion(p, n): """Standard error for a sample proportion.""" return np.sqrt(p * (1 - p) / n) def standard_error_mean(std_dev, n): """Standard error for a sample mean.""" return std_dev / np.sqrt(n) # Example: Poll with 52% for candidate A, n=1000 se = standard_error_proportion(0.52, 1000) print(f"Standard error: {se:.4f}") # 0.0158 or about 1.58% # With larger sample se_large = standard_error_proportion(0.52, 4000) print(f"SE with n=4000: {se_large:.4f}") # 0.0079 or about 0.79% ``` **Key Insight**: Standard error decreases with square root of sample size. To halve the error, you need 4x the sample size. **Analogy**: Think of standard error as the "blurriness" of a photograph. More data is like more megapixels -- the picture gets sharper. But the improvement is diminishing: going from 100 to 400 data points (4x) only cuts the blur in half. Going from 400 to 1,600 (another 4x) cuts it in half again. This is why data scientists obsess over whether more data is actually worth the collection cost. **Statistical Mistake in ML -- Trusting Small Validation Sets**: If you evaluate your model on a test set of only 50 samples and report "95% accuracy," the standard error is roughly sqrt(0.95 times 0.05 / 50) = 3.1%. Your true accuracy could easily be anywhere from 89% to 100%. With 500 samples, that error drops to about 1%. Always compute confidence intervals around your ML metrics, especially when comparing models. A "2% improvement" on a small test set is often statistical noise. *** ## Confidence Intervals: Expressing Uncertainty A **confidence interval** gives a range of plausible values for the true parameter. ### The Formula For a proportion with 95% confidence: $$ \hat{p} \pm z^* \cdot SE = \hat{p} \pm 1.96 \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$ ```python theme={null} from scipy import stats def confidence_interval_proportion(p_hat, n, confidence=0.95): """Calculate confidence interval for a proportion.""" # Z-score for desired confidence level z = stats.norm.ppf((1 + confidence) / 2) # Standard error se = np.sqrt(p_hat * (1 - p_hat) / n) # Margin of error moe = z * se return (p_hat - moe, p_hat + moe), moe # Poll result: 52% with n=1000 p_hat = 0.52 n = 1000 ci, moe = confidence_interval_proportion(p_hat, n) print(f"Point estimate: {p_hat:.1%}") print(f"Margin of error: ±{moe:.1%}") print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})") ``` **Output:** ``` Point estimate: 52.0% Margin of error: ±3.1% 95% CI: (48.9%, 55.1%) ``` ### What Does "95% Confidence" Mean? It does NOT mean "95% probability the true value is in this interval." It means: If we repeated this process many times, 95% of the intervals we construct would contain the true value. ```python theme={null} # Demonstrate: Create 100 confidence intervals intervals_containing_truth = 0 true_p = 0.52 # Known true value (in real life, unknown) for _ in range(100): # Take a sample sample = np.random.choice(['A', 'B'], size=1000, p=[true_p, 1-true_p]) p_hat = np.mean(sample == 'A') # Calculate CI ci, _ = confidence_interval_proportion(p_hat, 1000) # Check if CI contains true value if ci[0] <= true_p <= ci[1]: intervals_containing_truth += 1 print(f"{intervals_containing_truth}% of intervals contained the true value") # Should be close to 95! ``` *** ## Confidence Intervals for Means When estimating an average (like average house price): ```python theme={null} def confidence_interval_mean(data, confidence=0.95): """Calculate confidence interval for a mean using t-distribution.""" n = len(data) mean = np.mean(data) std_err = stats.sem(data) # Standard error of the mean # Use t-distribution for small samples t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1) moe = t_crit * std_err return (mean - moe, mean + moe), moe # Example: House prices (in thousands) house_prices = np.array([ 425, 389, 445, 520, 478, 395, 510, 462, 398, 485, 512, 445, 468, 502, 389, 475, 498, 415, 528, 459, 442, 495, 478, 410, 525, 465, 488, 435, 505, 472 ]) ci, moe = confidence_interval_mean(house_prices) print(f"Sample mean: ${np.mean(house_prices):.1f}K") print(f"Margin of error: ±${moe:.1f}K") print(f"95% CI: (${ci[0]:.1f}K, ${ci[1]:.1f}K)") ``` **Output:** ``` Sample mean: $463.5K Margin of error: ±$15.1K 95% CI: ($448.4K, $478.6K) ``` *** ## The t-Distribution: For Small Samples When sample size is small (n \< 30), we use the **t-distribution** instead of the normal distribution. Why? Small samples have more uncertainty about the true standard deviation. ```python theme={null} # Compare t and normal distributions x = np.linspace(-4, 4, 1000) normal = stats.norm.pdf(x) t_5 = stats.t.pdf(x, df=5) # 5 degrees of freedom t_30 = stats.t.pdf(x, df=30) # 30 degrees of freedom import matplotlib.pyplot as plt plt.figure(figsize=(10, 5)) plt.plot(x, normal, label='Normal', linewidth=2) plt.plot(x, t_5, label='t (df=5)', linestyle='--', linewidth=2) plt.plot(x, t_30, label='t (df=30)', linestyle=':', linewidth=2) plt.xlabel('x') plt.ylabel('Density') plt.title('Normal vs t-Distribution') plt.legend() plt.show() ``` The t-distribution has heavier tails, meaning it accounts for more uncertainty. As sample size increases, it approaches the normal distribution. *** ## Sample Size Planning How large a sample do you need? It depends on: 1. Desired margin of error 2. Desired confidence level 3. Expected variability ### For Proportions $$ n = \left(\frac{z^*}{MOE}\right)^2 \cdot p(1-p) $$ ```python theme={null} def sample_size_proportion(moe, p=0.5, confidence=0.95): """Calculate required sample size for a proportion.""" z = stats.norm.ppf((1 + confidence) / 2) n = (z / moe) ** 2 * p * (1 - p) return int(np.ceil(n)) # Want ±3% margin of error at 95% confidence n_needed = sample_size_proportion(moe=0.03) print(f"Need {n_needed} respondents for ±3% MOE") # 1068 # For ±1% (much more precise) n_needed_1pct = sample_size_proportion(moe=0.01) print(f"Need {n_needed_1pct} respondents for ±1% MOE") # 9604 # Notice: 3x more precision requires 9x more sample! ``` ### For Means $$ n = \left(\frac{z^* \cdot \sigma}{MOE}\right)^2 $$ ```python theme={null} def sample_size_mean(moe, std_dev, confidence=0.95): """Calculate required sample size for a mean.""" z = stats.norm.ppf((1 + confidence) / 2) n = (z * std_dev / moe) ** 2 return int(np.ceil(n)) # House prices: want ±$10K, estimated std dev is $50K n_houses = sample_size_mean(moe=10, std_dev=50) print(f"Need {n_houses} houses for ±$10K MOE") # 97 ``` *** ## Mini-Project: Election Poll Analyzer Build a complete polling analysis system: ```python theme={null} import numpy as np from scipy import stats class PollAnalyzer: """ Analyze election poll results with proper uncertainty quantification. """ def __init__(self, candidate_a_votes, total_votes, confidence=0.95): self.n = total_votes self.a_votes = candidate_a_votes self.b_votes = total_votes - candidate_a_votes self.p_a = candidate_a_votes / total_votes self.p_b = 1 - self.p_a self.confidence = confidence def margin_of_error(self): """Calculate margin of error for candidate A's proportion.""" z = stats.norm.ppf((1 + self.confidence) / 2) se = np.sqrt(self.p_a * self.p_b / self.n) return z * se def confidence_interval(self): """Calculate confidence interval for candidate A.""" moe = self.margin_of_error() return (self.p_a - moe, self.p_a + moe) def probability_a_wins(self): """ Estimate probability that candidate A is truly ahead. This uses the normal approximation. """ # We want P(true_p_a > 0.5) # The difference (p_a - 0.5) follows approximately normal diff = self.p_a - 0.5 se = np.sqrt(self.p_a * self.p_b / self.n) # Z-score for the difference from 0.5 z = diff / se # Probability that true proportion > 0.5 return stats.norm.cdf(z) def required_sample_for_call(self, min_confidence=0.95): """ Calculate sample size needed to call the race at given confidence. Returns None if current lead is too small. """ # We need the CI to not cross 50% # This happens when |p_a - 0.5| > MOE lead = abs(self.p_a - 0.5) z = stats.norm.ppf((1 + min_confidence) / 2) # Solve: z * sqrt(p*q/n) = lead # n = (z^2 * p * q) / lead^2 if lead == 0: return float('inf') n = (z ** 2 * self.p_a * self.p_b) / (lead ** 2) return int(np.ceil(n)) def report(self): """Generate comprehensive poll report.""" ci = self.confidence_interval() print("\n" + "=" * 60) print("ELECTION POLL ANALYSIS") print("=" * 60) print(f"Sample Size: {self.n:,} voters") print(f"Confidence Level: {self.confidence:.0%}") print("-" * 60) print(f"Candidate A: {self.p_a:.1%} ({self.a_votes:,} votes)") print(f"Candidate B: {self.p_b:.1%} ({self.b_votes:,} votes)") print("-" * 60) print(f"Margin of Error: ±{self.margin_of_error():.1%}") print(f"95% CI for A: ({ci[0]:.1%}, {ci[1]:.1%})") print("-" * 60) p_wins = self.probability_a_wins() if p_wins > 0.99: call = "PROJECTED WINNER: Candidate A" elif p_wins < 0.01: call = "PROJECTED WINNER: Candidate B" elif p_wins > 0.95: call = "LIKELY WINNER: Candidate A" elif p_wins < 0.05: call = "LIKELY WINNER: Candidate B" else: call = "TOO CLOSE TO CALL" print(f"P(A is truly ahead): {p_wins:.1%}") print(f"Status: {call}") if 0.05 < p_wins < 0.95: n_needed = self.required_sample_for_call() if n_needed < float('inf'): print(f"Need ~{n_needed:,} votes to call at 95%") print("=" * 60) # Example 1: Early results (close race) print("\n--- EARLY RESULTS ---") poll1 = PollAnalyzer(candidate_a_votes=520, total_votes=1000) poll1.report() # Example 2: More data (still close) print("\n--- UPDATED RESULTS ---") poll2 = PollAnalyzer(candidate_a_votes=5200, total_votes=10000) poll2.report() # Example 3: Clear lead print("\n--- CLEAR LEAD ---") poll3 = PollAnalyzer(candidate_a_votes=5500, total_votes=10000) poll3.report() ``` **Output:** ``` --- EARLY RESULTS --- ============================================================ ELECTION POLL ANALYSIS ============================================================ Sample Size: 1,000 voters Confidence Level: 95% ------------------------------------------------------------ Candidate A: 52.0% (520 votes) Candidate B: 48.0% (480 votes) ------------------------------------------------------------ Margin of Error: ±3.1% 95% CI for A: (48.9%, 55.1%) ------------------------------------------------------------ P(A is truly ahead): 89.7% Status: TOO CLOSE TO CALL Need ~2,397 votes to call at 95% ============================================================ --- UPDATED RESULTS --- ============================================================ ELECTION POLL ANALYSIS ============================================================ Sample Size: 10,000 voters Confidence Level: 95% ------------------------------------------------------------ Candidate A: 52.0% (5,200 votes) Candidate B: 48.0% (4,800 votes) ------------------------------------------------------------ Margin of Error: ±1.0% 95% CI for A: (51.0%, 53.0%) ------------------------------------------------------------ P(A is truly ahead): 100.0% Status: PROJECTED WINNER: Candidate A ============================================================ ``` *** ## Common Mistakes in Inference ### Mistake 1: Ignoring Sample Bias ```python theme={null} # BAD: Survey only people who answer phones during business hours # This systematically excludes working people! # GOOD: Random sampling from entire population ``` *** ## Common Mistakes to Avoid **Mistake 1: Misleading Margin of Error** A headline saying "Poll shows 52% support (±3%)" means the 95% CI is 49-55%. But if the race is 52% vs 48%, the intervals overlap and the race is actually too close to call! **Mistake 2: Small Sample, Big Claims** ```python theme={null} # Survey of 30 people shows 60% prefer product A ci, moe = confidence_interval_proportion(0.60, 30) print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})") # CI: (42.4%, 77.6%) - way too wide to claim victory! ``` **Mistake 3: Confusing Confidence Level with Probability** ```python theme={null} # WRONG: "There's a 95% chance the true value is between 48.9% and 55.1%" # RIGHT: "We're 95% confident our method produces intervals containing the true value" ``` *** ## Interview Questions **Question**: Your A/B test shows the new feature increased click-through rate from 2.0% to 2.3%, with a 95% CI of \[2.1%, 2.5%] for the new version. What can you conclude? **Answer**: * We're 95% confident the true CTR for the new version is between 2.1% and 2.5% * Since the entire CI is above 2.0% (the control), we have evidence of a real improvement * The minimum expected improvement is 0.1 percentage points (2.1% - 2.0%) * For business decisions, consider if a 0.1-0.5 pp improvement justifies the change Note: If the CI included 2.0%, we couldn't conclude there's a real difference. **Question**: You want to estimate the proportion of customers who will buy a new product. You need ±5% precision with 95% confidence. How many customers should you survey? **Answer**: ```python theme={null} # Use p = 0.5 for maximum sample size (conservative) # Formula: n = (z²×p×(1-p)) / E² z = 1.96 p = 0.5 # Conservative assumption E = 0.05 # Desired margin of error n = (z**2 * p * (1-p)) / (E**2) print(f"Sample size needed: {int(np.ceil(n))}") # 385 ``` Key insight: Using p = 0.5 gives the largest sample size because that's where variance is maximized. If you know approximately what p will be, you can use that value for a smaller required sample. **Question**: Daily active users (DAU) over the past 100 days had mean 50M with standard deviation 5M. What's the 95% CI for the true mean DAU? **Answer**: ```python theme={null} mean_dau = 50 # millions std_dau = 5 # millions n = 100 # Standard error se = std_dau / np.sqrt(n) # 0.5 million # 95% CI z = 1.96 ci = (mean_dau - z*se, mean_dau + z*se) print(f"95% CI: ({ci[0]:.1f}M, {ci[1]:.1f}M)") # (49.02M, 50.98M) ``` The true average DAU is likely between 49M and 51M with 95% confidence. **Question**: You survey users who contacted customer support about a new feature. 80% say they dislike it. Is this valid for all users? **Answer**: No! This is selection bias. Users who contact support are more likely to have problems. The sample is not representative of all users. You're measuring "satisfaction among users with issues" not "overall satisfaction." To get a valid estimate, you need: * Random sampling from all users * Or stratified sampling to ensure representation * Consider that satisfied users rarely reach out This is called "sampling bias" or "survivorship bias" and is a major concern in ML training data as well. *** ## Practice Challenge You're planning an A/B test. The current conversion rate is 5%. You want to detect a 20% relative improvement (5% → 6%). How many users do you need per group? ```python theme={null} # This is called a power analysis # We need to balance: # - Significance level (α): probability of false positive # - Power (1-β): probability of detecting a real effect # - Effect size: the difference we want to detect # - Sample size: what we're solving for from scipy import stats import numpy as np def required_sample_size_ab(p1, p2, alpha=0.05, power=0.8): """ Calculate required sample size per group for A/B test. p1: baseline conversion rate p2: expected conversion rate after improvement alpha: significance level (typically 0.05) power: probability of detecting effect if real (typically 0.8) """ # Effect size (Cohen's h) h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1))) # Z-scores z_alpha = stats.norm.ppf(1 - alpha/2) # Two-tailed z_beta = stats.norm.ppf(power) # Sample size per group n = 2 * ((z_alpha + z_beta) / h) ** 2 return int(np.ceil(n)) # Your task: Calculate and interpret p_baseline = 0.05 p_improved = 0.06 n_per_group = required_sample_size_ab(p_baseline, p_improved) print(f"Need {n_per_group} users per group") print(f"Total users: {2 * n_per_group}") # Also calculate for different scenarios: # 1. What if you want to detect a 10% improvement (5% → 5.5%)? # 2. What if power needs to be 90% instead of 80%? ``` **Solution**: ```python theme={null} # Base case: 5% → 6% (20% relative improvement) n_base = required_sample_size_ab(0.05, 0.06) print(f"Base case: {n_base:,} per group") # ~4,794 # Scenario 1: 5% → 5.5% (10% relative improvement) n_smaller = required_sample_size_ab(0.05, 0.055) print(f"Smaller effect: {n_smaller:,} per group") # ~19,177 # 4x more users for half the effect size! # Scenario 2: 90% power n_high_power = required_sample_size_ab(0.05, 0.06, power=0.9) print(f"Higher power: {n_high_power:,} per group") # ~6,420 # ~34% more users for 10% more power # Key insight: Sample size grows QUADRATICALLY with smaller effect sizes # This is why A/B testing small improvements is expensive! ``` *** ## 📝 Practice Exercises Calculate confidence intervals for proportions Determine required sample sizes for precision Construct confidence intervals for means Real-world: Election polling analysis

**Exercise 1: Website Conversion Rate** - Confidence interval for proportion

**Problem**: Your e-commerce site had 2,500 visitors last week, and 125 made a purchase. 1. What's the point estimate for conversion rate? 2. Calculate the 95% confidence interval 3. Calculate the 99% confidence interval 4. Can you claim the true rate is at least 4%? **Solution**: ```python theme={null} import numpy as np from scipy import stats n = 2500 # sample size x = 125 # successes # 1. Point estimate p_hat = x / n print(f"Point estimate: {p_hat:.2%}") # 5.00% # 2. 95% CI z_95 = stats.norm.ppf(0.975) # 1.96 se = np.sqrt(p_hat * (1 - p_hat) / n) moe_95 = z_95 * se ci_95_lower = p_hat - moe_95 ci_95_upper = p_hat + moe_95 print(f"\n95% CI: ({ci_95_lower:.2%}, {ci_95_upper:.2%})") print(f"Margin of error: ±{moe_95:.2%}") # 3. 99% CI z_99 = stats.norm.ppf(0.995) # 2.576 moe_99 = z_99 * se ci_99_lower = p_hat - moe_99 ci_99_upper = p_hat + moe_99 print(f"\n99% CI: ({ci_99_lower:.2%}, {ci_99_upper:.2%})") print(f"Margin of error: ±{moe_99:.2%}") # 4. Can we claim rate ≥ 4%? # Check if 4% is below the lower bound of 95% CI claim_threshold = 0.04 if ci_95_lower > claim_threshold: print(f"\n✓ Yes! The 95% CI lower bound ({ci_95_lower:.2%}) > 4%") print(" We can confidently claim the rate is at least 4%") else: print(f"\n✗ No. The 95% CI lower bound ({ci_95_lower:.2%}) ≤ 4%") ```

**Exercise 2: Sample Size Planning** - How many responses do you need?

**Problem**: You're planning a customer satisfaction survey. You want to estimate the proportion of satisfied customers with: * 95% confidence * Margin of error of ±3% 1. What sample size is needed if you have no prior estimate? 2. What if you estimate 70% are satisfied based on past data? 3. How does doubling precision affect sample size? **Solution**: ```python theme={null} import numpy as np from scipy import stats def sample_size_proportion(confidence, moe, p=0.5): """Calculate required sample size for proportion.""" z = stats.norm.ppf((1 + confidence) / 2) n = (z ** 2) * p * (1 - p) / (moe ** 2) return int(np.ceil(n)) # Given parameters confidence = 0.95 moe = 0.03 # 3% # 1. No prior estimate (use p = 0.5 for maximum variability) n_no_prior = sample_size_proportion(confidence, moe, p=0.5) print(f"Sample size (no prior, p=0.5): {n_no_prior:,}") # 1,068 # 2. With prior estimate of 70% n_with_prior = sample_size_proportion(confidence, moe, p=0.70) print(f"Sample size (p=0.70): {n_with_prior:,}") # 897 # 3. Double precision (moe = 1.5%) moe_double = 0.015 n_double_precision = sample_size_proportion(confidence, moe_double, p=0.5) print(f"\nSample size for ±1.5%: {n_double_precision:,}") # 4,269 print(f"Factor increase: {n_double_precision / n_no_prior:.1f}x") # 4x # Key insight: Halving margin of error requires 4x sample size! # Bonus: Compare different confidence levels print("\n--- Confidence Level Comparison (±3% MOE) ---") for conf in [0.90, 0.95, 0.99]: n = sample_size_proportion(conf, moe) print(f"{conf:.0%} confidence: {n:,} samples") ```

**Exercise 3: Customer Spending Analysis** - Confidence interval for mean

**Problem**: A sample of 50 customer transactions shows: * Sample mean: \$85.40 * Sample standard deviation: \$32.50 1. Calculate the 95% confidence interval for mean spending 2. What if sample size was only 15? (Use t-distribution) 3. How wide would the CI be with n = 200? **Solution**: ```python theme={null} import numpy as np from scipy import stats # Sample statistics x_bar = 85.40 s = 32.50 # 1. CI with n = 50 n_50 = 50 se_50 = s / np.sqrt(n_50) # Use t-distribution (technically correct for sample std dev) t_50 = stats.t.ppf(0.975, df=n_50-1) moe_50 = t_50 * se_50 print(f"n = 50:") print(f" Standard Error: ${se_50:.2f}") print(f" t-value (df=49): {t_50:.3f}") print(f" 95% CI: (${x_bar - moe_50:.2f}, ${x_bar + moe_50:.2f})") print(f" Width: ${2*moe_50:.2f}") # 2. Small sample (n = 15) n_15 = 15 se_15 = s / np.sqrt(n_15) t_15 = stats.t.ppf(0.975, df=n_15-1) # Wider t-value for small n moe_15 = t_15 * se_15 print(f"\nn = 15:") print(f" Standard Error: ${se_15:.2f}") print(f" t-value (df=14): {t_15:.3f}") # Notice t > z for small n print(f" 95% CI: (${x_bar - moe_15:.2f}, ${x_bar + moe_15:.2f})") print(f" Width: ${2*moe_15:.2f}") # 3. Large sample (n = 200) n_200 = 200 se_200 = s / np.sqrt(n_200) t_200 = stats.t.ppf(0.975, df=n_200-1) # Close to z=1.96 moe_200 = t_200 * se_200 print(f"\nn = 200:") print(f" Standard Error: ${se_200:.2f}") print(f" t-value (df=199): {t_200:.3f}") # ≈ 1.96 print(f" 95% CI: (${x_bar - moe_200:.2f}, ${x_bar + moe_200:.2f})") print(f" Width: ${2*moe_200:.2f}") # Summary print("\n--- CI Width Comparison ---") print(f"n=15: ${2*moe_15:.2f}") print(f"n=50: ${2*moe_50:.2f}") print(f"n=200: ${2*moe_200:.2f}") ```

**Exercise 4: Election Poll Analysis** - Real-world inference

**Problem**: A political poll surveys 1,200 likely voters: * Candidate A: 52% * Candidate B: 48% 1. Calculate 95% CI for Candidate A's support 2. Can you predict a winner with 95% confidence? 3. What sample size needed to predict winner if true split is 51-49? 4. How does "likely voter" sampling affect reliability? **Solution**: ```python theme={null} import numpy as np from scipy import stats n = 1200 p_a = 0.52 p_b = 0.48 # 1. 95% CI for Candidate A z = stats.norm.ppf(0.975) se = np.sqrt(p_a * (1 - p_a) / n) moe = z * se ci_lower = p_a - moe ci_upper = p_a + moe print(f"Candidate A: {p_a:.1%}") print(f"95% CI: ({ci_lower:.1%}, {ci_upper:.1%})") print(f"Margin of error: ±{moe:.1%}") # 2. Can we predict a winner? # Winner clear if CI doesn't include 50% if ci_lower > 0.50: print("\n✓ Candidate A predicted to win (lower bound > 50%)") elif ci_upper < 0.50: print("\n✓ Candidate B predicted to win (upper bound < 50%)") else: print("\n✗ Too close to call (CI includes 50%)") print(f" The race is within the margin of error") # 3. Sample size for 51-49 split def sample_size_for_significance(p1, p2, alpha=0.05, power=0.80): """Sample size to detect difference between proportions.""" z_alpha = stats.norm.ppf(1 - alpha/2) z_beta = stats.norm.ppf(power) p_avg = (p1 + p2) / 2 effect = abs(p1 - p2) # Per group n = 2 * p_avg * (1 - p_avg) * ((z_alpha + z_beta) / effect) ** 2 return int(np.ceil(n)) n_needed = sample_size_for_significance(0.51, 0.49) print(f"\nFor 51-49 split with 80% power:") print(f" Sample size needed: {n_needed:,}") # 4. Sampling bias discussion print("\n--- Sampling Considerations ---") print("'Likely voter' screens can introduce bias:") print(" - Definition varies by pollster") print(" - May miss new/infrequent voters") print(" - Response rates differ by demographics") print("\nHistorically, polls have ~3% average error") print("Account for this when interpreting close races!") # Bonus: Probability Candidate A wins (simplified) # Assuming normal approximation to sampling distribution p_a_wins = 1 - stats.norm.cdf(0.50, loc=p_a, scale=se) print(f"\nP(Candidate A actually leads): {p_a_wins:.1%}") ```

*** ## Key Takeaways * Population: entire group of interest * Sample: subset we actually observe * Statistics estimate parameters * Measures variability of sample statistics * Decreases with sqrt(n) * Foundation of confidence intervals * Range of plausible values for parameter * Width = 2 x margin of error * Higher confidence = wider interval * More precision requires more data * Quadratic relationship (2x precision = 4x data) * Plan before collecting data *** ## Common Pitfalls **Inference Mistakes to Avoid**: 1. **Misinterpreting Confidence Intervals** - "95% confident the true value is in this range" NOT "95% chance the parameter is here" 2. **Ignoring Sample Bias** - Non-random samples lead to biased estimates regardless of sample size 3. **Confusing Confidence Level with Precision** - 99% confidence is wider, not more precise 4. **Forgetting Standard Error Shrinks with √n** - To halve SE, you need 4x the data, not 2x 5. **Using z when t is appropriate** - For small samples (n \< 30), t-distribution accounts for extra uncertainty *** ## Connection to Machine Learning | Inference Concept | ML Application | | -------------------- | ------------------------------------ | | Confidence intervals | Uncertainty in predictions | | Standard error | Error bars, prediction intervals | | Sample size | Training set size planning | | t-distribution | Small data regimes, regularization | | Bias in sampling | Training/test split, data collection | **ML Connection**: When you report model accuracy as "92% ± 2%", you're using confidence intervals! Cross-validation provides multiple samples, and the standard error tells you how much your accuracy estimate might vary. **Coming up next**: We'll learn about **Hypothesis Testing** - how to determine if a difference is real or just random noise. This is the foundation of A/B testing and scientific validation of ML models. Learn to distinguish real effects from random noise *** ## Interview Deep-Dive **Strong Answer:** * This is one of the most common misconceptions in statistics, and it is wrong. The true population parameter is a fixed number -- it either is or is not in that interval. There is no probability about it. The 95% refers to the procedure, not to any single interval. * The correct interpretation: if we repeated this experiment many times and constructed a 95% confidence interval each time, approximately 95% of those intervals would contain the true parameter. For any single interval, we do not know if it is one of the 95% that captured the truth or one of the 5% that missed. * In practice, most people (including many data scientists) use the Bayesian-sounding interpretation because it is more intuitive for decision-making. And with a flat prior, the Bayesian credible interval actually gives you the statement your colleague was trying to make. But in a frequentist framework, the distinction matters because it affects how you communicate uncertainty. * The practical consequence: when reporting results to stakeholders, I usually say "we are 95% confident that the true rate lies between 4.2% and 5.8%" which is technically correct, rather than "there is a 95% chance" which is technically not. But I also explain that the interval gives us a range of plausible values, and values outside the interval are implausible given our data. **Follow-up: When would you recommend switching from a frequentist confidence interval to a Bayesian credible interval?** I would switch to Bayesian methods when prior information is genuinely available and important. For example, if we are estimating conversion rates for a new checkout flow, we have strong prior knowledge that conversion rates for e-commerce sites typically fall between 1% and 10%. A Bayesian approach with an informative prior can produce tighter, more useful intervals, especially with small sample sizes. The Bayesian credible interval also answers the question people actually want to ask: "What is the probability the parameter is in this range?" Another scenario is online A/B testing where you want to continuously update your beliefs as data arrives. Bayesian methods allow natural sequential updating without the multiple testing penalties that plague frequentist approaches when you peek at results. **Strong Answer:** * With n=12, we are firmly in small-sample territory, so I would use a t-distribution rather than a normal distribution for the confidence interval. The t-distribution has heavier tails than the normal, which accounts for the extra uncertainty in estimating the population standard deviation from such a small sample. * Specifically, I would compute the sample mean and sample standard deviation, then build a t-based confidence interval: x-bar plus or minus t-critical times (s / sqrt(12)), where t-critical comes from the t-distribution with 11 degrees of freedom. For 95% confidence, t-critical is approximately 2.201, compared to 1.96 for the normal -- about 12% wider. * Before trusting this interval, I would check the normality assumption. With only 12 observations, I cannot reliably test normality with formal tests (they have very low power at n=12), but I would look at a Q-Q plot and check for obvious outliers or extreme skewness. If the data is heavily skewed (which order values often are due to a long right tail), I would consider either a log transformation or a bootstrap confidence interval instead. * I would also be transparent about the limitations. A 95% CI from 12 observations will be very wide -- potentially too wide to be useful for decision-making. I would tell stakeholders: "Here is our best estimate and its range, but we need at least 50-100 orders before this estimate stabilizes enough to base pricing or inventory decisions on it." **Follow-up: How does the bootstrap approach work for this problem, and why might it be better than the t-interval?** Bootstrap resamples the 12 observations with replacement thousands of times, computes the mean of each resample, and uses the distribution of those means to form a confidence interval. The key advantage is that it makes no assumption about the underlying distribution -- no normality required. For order values, which are often right-skewed with occasional large orders, this matters because the t-interval assumes approximate normality of the sampling distribution, which may not hold well at n=12 with skewed data. The bootstrap percentile interval (2.5th and 97.5th percentiles of the bootstrap means) will naturally be asymmetric if the data is skewed, giving a more honest representation of uncertainty. The downside is that with only 12 original data points, the bootstrap has limited "material" to work with, so its coverage properties are not guaranteed. With very small n, the bias-corrected and accelerated (BCa) bootstrap is preferred over the simple percentile method. **Strong Answer:** * Standard deviation (SD) measures the spread of individual observations in your data. If heights have SD = 3 inches, that tells you individual people vary by about 3 inches from the average height. * Standard error (SE) measures the precision of a sample statistic -- how much that statistic would vary if you repeated the experiment. The standard error of the mean is SD / sqrt(n). With n=100 and SD=3, the SE is 0.3 inches. This tells you your sample mean is precise to about 0.3 inches. * Candidates confuse them because both involve similar-looking formulas and both measure "variability." The critical distinction is what is varying: SD describes variability of the data, SE describes variability of the estimate. Doubling the sample size barely changes the SD but cuts the SE by a factor of sqrt(2). * In practice, the confusion causes real errors. If someone reports "mean response time is 200ms with standard deviation 50ms" and someone else interprets that 50ms as the standard error, they would think the mean is extremely precisely estimated (CI of roughly 200 plus or minus 100ms) when actually the standard error might be 5ms (CI of 200 plus or minus 10ms) -- a very different statement about confidence. **Follow-up: In a research paper, the error bars on a chart could represent SD, SE, or a 95% CI. How do you tell which, and why does it matter?** You check the figure legend or methods section -- responsible papers specify which measure is used. If they do not, that is already a red flag. The visual impact is dramatically different: SE bars are roughly sqrt(n) times smaller than SD bars, and 95% CI bars are about 2x wider than SE bars. A chart with SD bars looks like the data is noisy and the groups overlap heavily. The same data with SE bars looks like the groups are cleanly separated. This is why some researchers have been criticized for cherry-picking: showing SE bars when they want to emphasize differences and SD bars when they want to show the "spread." The gold standard is to show 95% CI bars, because they directly communicate the precision of the estimate and allow the reader to visually assess significance.