> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Hypothesis Testing: Real Effect or Random Noise? > The scientific method for determining if differences are meaningful

# Hypothesis Testing: Real Effect or Random Noise? ## The A/B Testing Problem You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week: | Version | Visitors | Purchases | Conversion Rate | | -------------- | -------- | --------- | --------------- | | Blue (Control) | 10,000 | 320 | 3.20% | | Green (New) | 10,000 | 355 | 3.55% | The green button has a higher conversion rate. But is this a **real improvement** or just **random chance**? This is the fundamental question of hypothesis testing. **Analogy**: Hypothesis testing is like a courtroom trial for data. The null hypothesis is the "defendant" -- innocent until proven guilty. The data is the "evidence." The p-value is how surprising the evidence would be if the defendant were truly innocent. And just like in court, you can make two kinds of mistakes: convicting an innocent person (false positive) or letting a guilty one go free (false negative). The entire framework is about managing those risks. **Estimated Time**: 4-5 hours\ **Difficulty**: Intermediate\ **Prerequisites**: Modules 1-4 (especially Distributions and Inference)\ **What You'll Build**: Complete A/B testing framework *** ## The Framework: Innocent Until Proven Guilty Hypothesis testing borrows from the legal system: | Legal System | Hypothesis Testing | | ---------------------------------------------------- | ------------------------------------------- | | Defendant is innocent until proven guilty | No effect until proven otherwise | | Prosecution must prove guilt beyond reasonable doubt | Data must prove effect with high confidence | | Jury verdict: guilty or not guilty | Decision: reject or fail to reject null | | "Not guilty" ≠ "innocent" | "Fail to reject" ≠ "effect doesn't exist" | $Hypothesis Testing Framework$ ### The Two Hypotheses **Null Hypothesis (H₀)**: The default assumption. Nothing special is happening. * "The new button has the same conversion rate as the old one" * "The drug has no effect" * "The two groups are the same" **Alternative Hypothesis (H₁ or Hₐ)**: What we're trying to prove. * "The new button has a different conversion rate" * "The drug has an effect" * "The groups are different" ```python theme={null} # Our A/B test hypotheses: # H₀: p_green = p_blue (no difference) # H₁: p_green ≠ p_blue (there is a difference) ``` *** ## The P-Value: Quantifying Surprise The **p-value** answers: "If there really were no effect, how likely would we be to see data this extreme?" P-Value Intuition

### Interpreting P-Values | P-Value | Interpretation | | --------- | --------------------------------------------- | | p \< 0.01 | Strong evidence against null hypothesis | | p \< 0.05 | Moderate evidence against null hypothesis | | p \< 0.10 | Weak evidence against null hypothesis | | p ≥ 0.10 | Little to no evidence against null hypothesis | **Common threshold (α)**: 0.05 (5%) * If p \< 0.05, we reject the null hypothesis * If p ≥ 0.05, we fail to reject the null hypothesis **Critical Misconception**: The p-value is NOT the probability that the null hypothesis is true. It's the probability of seeing data this extreme IF the null hypothesis were true. *** ## Testing Our A/B Example Let's test whether the green button is actually better: ```python theme={null} import numpy as np from scipy import stats # Data blue_visitors = 10000 blue_purchases = 320 blue_rate = blue_purchases / blue_visitors green_visitors = 10000 green_purchases = 355 green_rate = green_purchases / green_visitors print(f"Blue conversion rate: {blue_rate:.2%}") print(f"Green conversion rate: {green_rate:.2%}") print(f"Observed difference: {green_rate - blue_rate:.2%}") ``` ### Method 1: Two-Proportion Z-Test ```python theme={null} def two_proportion_z_test(x1, n1, x2, n2): """ Test if two proportions are significantly different. x1, x2: number of successes n1, n2: number of trials Returns: z-statistic, p-value (two-tailed) """ # Sample proportions p1 = x1 / n1 p2 = x2 / n2 # Pooled proportion (under null hypothesis) p_pool = (x1 + x2) / (n1 + n2) # Standard error under null se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2)) # Z-statistic z = (p1 - p2) / se # Two-tailed p-value p_value = 2 * (1 - stats.norm.cdf(abs(z))) return z, p_value # Run the test z_stat, p_value = two_proportion_z_test( green_purchases, green_visitors, blue_purchases, blue_visitors ) print(f"\nZ-statistic: {z_stat:.3f}") print(f"P-value: {p_value:.4f}") if p_value < 0.05: print("\nResult: Reject null hypothesis") print("The difference is statistically significant at α=0.05") else: print("\nResult: Fail to reject null hypothesis") print("The difference is NOT statistically significant at α=0.05") ``` **Output:** ``` Blue conversion rate: 3.20% Green conversion rate: 3.55% Observed difference: 0.35% Z-statistic: 1.404 P-value: 0.1603 Result: Fail to reject null hypothesis The difference is NOT statistically significant at α=0.05 ``` Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance. **Step-by-step reasoning for interpreting this result**: 1. **What we observed**: Green button had 0.35 percentage points higher conversion. 2. **What we asked**: If there were truly no difference, how often would we see a gap this large by luck alone? 3. **What we found**: About 16% of the time (p = 0.16). That is not particularly rare. 4. **Our decision**: Since 16% is above our 5% threshold, we cannot confidently say the green button is better. The observed difference is plausible under random noise. 5. **What this does NOT mean**: It does not mean the green button is NOT better. It means we do not have enough evidence to conclude either way. A larger sample might reveal a real difference. **ML Application -- Model Comparison**: This same logic applies when you compare two ML models. If Model A gets 94.2% accuracy and Model B gets 93.8%, is A really better? Without a statistical test (like a paired t-test over cross-validation folds), you cannot know. Many ML practitioners deploy "better" models that are actually within the noise margin. Always test whether the difference is statistically significant before making deployment decisions. ### Method 2: Chi-Square Test ```python theme={null} from scipy.stats import chi2_contingency # Contingency table # Purchased Not Purchased # Blue 320 9680 # Green 355 9645 contingency_table = np.array([ [blue_purchases, blue_visitors - blue_purchases], [green_purchases, green_visitors - green_purchases] ]) chi2, p_value, dof, expected = chi2_contingency(contingency_table) print(f"Chi-square statistic: {chi2:.3f}") print(f"P-value: {p_value:.4f}") print(f"Degrees of freedom: {dof}") ``` **Output:** ``` Chi-square statistic: 1.972 P-value: 0.1603 Degrees of freedom: 1 ``` Same p-value, same conclusion. *** ## Types of Errors We can make two types of mistakes: | | H₀ is True (No Effect) | H₀ is False (Real Effect) | | --------------------- | ----------------------------- | ------------------------------ | | **Reject H₀** | Type I Error (False Positive) | Correct Decision | | **Fail to Reject H₀** | Correct Decision | Type II Error (False Negative) | ### Type I Error (α): False Positive We claim there's an effect when there isn't one. * Probability = α (typically 0.05) * "The boy who cried wolf" * Example: Launching a feature that doesn't actually help ### Type II Error (β): False Negative We miss a real effect. * Probability = β (varies, often 0.20) * Power = 1 - β (typically 0.80) * Example: Abandoning a feature that would have helped ```python theme={null} # Visualize the tradeoff import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Type I Error x = np.linspace(-4, 4, 1000) null_dist = stats.norm.pdf(x, 0, 1) axes[0].plot(x, null_dist, 'b-', linewidth=2, label='Null Distribution') axes[0].fill_between(x[x > 1.96], null_dist[x > 1.96], alpha=0.3, color='red', label=f'Type I Error Region (α/2)') axes[0].fill_between(x[x < -1.96], null_dist[x < -1.96], alpha=0.3, color='red') axes[0].axvline(1.96, color='red', linestyle='--') axes[0].axvline(-1.96, color='red', linestyle='--') axes[0].set_title('Type I Error (False Positive)') axes[0].legend() # Type II Error (with alternative distribution) alt_dist = stats.norm.pdf(x, 2, 1) # Effect exists, shifted right axes[1].plot(x, null_dist, 'b-', linewidth=2, label='Null (No Effect)') axes[1].plot(x, alt_dist, 'g-', linewidth=2, label='Alternative (Real Effect)') axes[1].fill_between(x[(x > -1.96) & (x < 1.96)], alt_dist[(x > -1.96) & (x < 1.96)], alpha=0.3, color='orange', label='Type II Error Region (β)') axes[1].axvline(1.96, color='red', linestyle='--') axes[1].axvline(-1.96, color='red', linestyle='--') axes[1].set_title('Type II Error (False Negative)') axes[1].legend() plt.tight_layout() plt.show() ``` *** ## Statistical Power: Ability to Detect Real Effects **Power** = Probability of detecting an effect when it exists = 1 - β Higher power means: * Less likely to miss real effects * Requires larger sample sizes * More confidence in negative results ```python theme={null} def power_proportion_test(p1, p2, n, alpha=0.05): """ Calculate power for a two-proportion test. p1: control proportion p2: treatment proportion n: sample size per group alpha: significance level """ # Effect size effect = abs(p2 - p1) # Pooled standard error under null p_pool = (p1 + p2) / 2 se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n) # Standard error under alternative se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n) # Critical value z_crit = stats.norm.ppf(1 - alpha / 2) # Power z_power = (effect - z_crit * se_null) / se_alt power = stats.norm.cdf(z_power) return power # Our A/B test: 3.20% vs 3.55%, n=10,000 per group power = power_proportion_test(0.032, 0.0355, 10000) print(f"Power of our test: {power:.1%}") # What if we had 50,000 per group? power_large = power_proportion_test(0.032, 0.0355, 50000) print(f"Power with n=50,000: {power_large:.1%}") ``` **Output:** ``` Power of our test: 27.3% Power with n=50,000: 75.8% ``` With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance. **Analogy**: Running an underpowered test is like using a metal detector at the beach with the sensitivity turned way down. Even if there is buried treasure, you are unlikely to find it. Power analysis tells you how sensitive your detector needs to be (how large your sample must be) to find treasure of a given size (your minimum detectable effect). **Statistical Mistake in ML -- Underpowered Hyperparameter Searches**: This same power problem plagues ML practitioners who do hyperparameter tuning with small validation sets. You try 50 configurations, pick the "best" one, but the differences are smaller than the noise. You have essentially picked a random configuration and convinced yourself it is optimal. Use cross-validation with enough folds and enough data per fold to ensure the differences you are selecting on are real. *** ## Sample Size Calculation for Desired Power ```python theme={null} def sample_size_proportion_test(p1, p2, power=0.80, alpha=0.05): """ Calculate required sample size per group. p1: expected control proportion p2: expected treatment proportion power: desired power (typically 0.80) alpha: significance level (typically 0.05) """ # Effect size effect = abs(p2 - p1) # Pooled proportion p_pool = (p1 + p2) / 2 # Z-scores z_alpha = stats.norm.ppf(1 - alpha / 2) z_beta = stats.norm.ppf(power) # Variance terms var_null = 2 * p_pool * (1 - p_pool) var_alt = p1 * (1 - p1) + p2 * (1 - p2) # Sample size formula n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2 return int(np.ceil(n)) # How many visitors do we need to detect a 0.35% difference? n_needed = sample_size_proportion_test(0.032, 0.0355) print(f"Need {n_needed:,} visitors per group to detect 0.35% difference with 80% power") # For a larger effect (1% improvement) n_1pct = sample_size_proportion_test(0.032, 0.042) print(f"Need {n_1pct:,} visitors per group to detect 1.0% difference with 80% power") ``` **Output:** ``` Need 48,614 visitors per group to detect 0.35% difference with 80% power Need 6,038 visitors per group to detect 1.0% difference with 80% power ``` *** ## Common Statistical Tests ### 1. One-Sample t-Test Is this sample mean different from a known value? ```python theme={null} # Are our website load times different from the 3-second industry standard? load_times = np.array([2.8, 3.2, 2.9, 3.5, 2.7, 3.1, 2.6, 3.0, 2.9, 3.3]) t_stat, p_value = stats.ttest_1samp(load_times, 3.0) print(f"Sample mean: {np.mean(load_times):.2f}s") print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.4f}") ``` ### 2. Two-Sample t-Test Are the means of two groups different? ```python theme={null} # Do users spend more time on new homepage design? old_design_time = np.array([45, 52, 38, 61, 42, 55, 48, 50, 44, 58]) new_design_time = np.array([58, 62, 55, 70, 65, 60, 68, 72, 63, 59]) t_stat, p_value = stats.ttest_ind(old_design_time, new_design_time) print(f"Old design mean: {np.mean(old_design_time):.1f}s") print(f"New design mean: {np.mean(new_design_time):.1f}s") print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.4f}") ``` ### 3. Paired t-Test Before/after comparisons on the same subjects: ```python theme={null} # Does a training program improve test scores? before = np.array([65, 72, 58, 80, 75, 62, 70, 68, 74, 78]) after = np.array([70, 78, 62, 85, 82, 68, 75, 72, 80, 82]) t_stat, p_value = stats.ttest_rel(before, after) print(f"Mean improvement: {np.mean(after - before):.1f} points") print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.4f}") ``` ### 4. ANOVA Are three or more groups different? ```python theme={null} # Do three different ad campaigns have different click rates? campaign_a = np.array([2.1, 2.3, 2.0, 2.4, 2.2]) campaign_b = np.array([2.8, 3.0, 2.9, 3.1, 2.7]) campaign_c = np.array([2.3, 2.5, 2.4, 2.6, 2.2]) f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c) print(f"F-statistic: {f_stat:.3f}") print(f"p-value: {p_value:.4f}") ``` *** ## Complete A/B Testing Framework ```python theme={null} import numpy as np from scipy import stats from dataclasses import dataclass from typing import Tuple, Optional @dataclass class ABTestResult: """Results of an A/B test.""" control_rate: float treatment_rate: float relative_lift: float absolute_lift: float z_statistic: float p_value: float confidence_interval: Tuple[float, float] is_significant: bool power: float class ABTestAnalyzer: """ Complete A/B testing framework with proper statistical methodology. """ def __init__(self, alpha: float = 0.05, power_threshold: float = 0.80): self.alpha = alpha self.power_threshold = power_threshold def run_test( self, control_successes: int, control_total: int, treatment_successes: int, treatment_total: int ) -> ABTestResult: """Run a two-proportion z-test.""" # Calculate rates p_control = control_successes / control_total p_treatment = treatment_successes / treatment_total # Lifts absolute_lift = p_treatment - p_control relative_lift = (p_treatment - p_control) / p_control if p_control > 0 else 0 # Pooled proportion p_pool = (control_successes + treatment_successes) / (control_total + treatment_total) # Standard error se = np.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/treatment_total)) # Z-statistic z = absolute_lift / se if se > 0 else 0 # P-value (two-tailed) p_value = 2 * (1 - stats.norm.cdf(abs(z))) # Confidence interval for the difference se_diff = np.sqrt( p_control * (1 - p_control) / control_total + p_treatment * (1 - p_treatment) / treatment_total ) z_crit = stats.norm.ppf(1 - self.alpha / 2) ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff) # Power (approximate) power = self._calculate_power(p_control, p_treatment, min(control_total, treatment_total)) return ABTestResult( control_rate=p_control, treatment_rate=p_treatment, relative_lift=relative_lift, absolute_lift=absolute_lift, z_statistic=z, p_value=p_value, confidence_interval=ci, is_significant=p_value < self.alpha, power=power ) def _calculate_power(self, p1: float, p2: float, n: int) -> float: """Calculate statistical power.""" effect = abs(p2 - p1) if effect == 0: return 0 p_pool = (p1 + p2) / 2 se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n) se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n) z_crit = stats.norm.ppf(1 - self.alpha / 2) z_power = (effect - z_crit * se_null) / se_alt return stats.norm.cdf(z_power) def required_sample_size( self, baseline_rate: float, minimum_detectable_effect: float, power: float = 0.80 ) -> int: """Calculate required sample size per group.""" p1 = baseline_rate p2 = baseline_rate * (1 + minimum_detectable_effect) effect = abs(p2 - p1) p_pool = (p1 + p2) / 2 z_alpha = stats.norm.ppf(1 - self.alpha / 2) z_beta = stats.norm.ppf(power) var_null = 2 * p_pool * (1 - p_pool) var_alt = p1 * (1 - p1) + p2 * (1 - p2) n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2 return int(np.ceil(n)) def print_report(self, result: ABTestResult, test_name: str = "A/B Test"): """Print a formatted test report.""" print("\n" + "=" * 60) print(f"A/B TEST REPORT: {test_name}") print("=" * 60) print(f"\nConversion Rates:") print(f" Control: {result.control_rate:.2%}") print(f" Treatment: {result.treatment_rate:.2%}") print(f"\nLift:") print(f" Absolute: {result.absolute_lift:+.2%}") print(f" Relative: {result.relative_lift:+.1%}") print(f"\nStatistical Analysis:") print(f" Z-statistic: {result.z_statistic:.3f}") print(f" P-value: {result.p_value:.4f}") print(f" 95% CI for difference: ({result.confidence_interval[0]:+.2%}, {result.confidence_interval[1]:+.2%})") print(f"\nTest Quality:") print(f" Power: {result.power:.1%}") if result.power < self.power_threshold: print(f" Warning: Low power. Consider larger sample size.") print(f"\nConclusion (α = {self.alpha}):") if result.is_significant: if result.absolute_lift > 0: print(" SIGNIFICANT: Treatment performs BETTER than control") else: print(" SIGNIFICANT: Treatment performs WORSE than control") else: print(" NOT SIGNIFICANT: Cannot conclude a difference exists") if result.power < self.power_threshold: print(" Note: Low power means we might be missing a real effect") print("=" * 60) # Usage example analyzer = ABTestAnalyzer(alpha=0.05) # Test 1: Original example (not significant) result1 = analyzer.run_test( control_successes=320, control_total=10000, treatment_successes=355, treatment_total=10000 ) analyzer.print_report(result1, "Checkout Button Color") # Test 2: Larger sample (now significant!) result2 = analyzer.run_test( control_successes=3200, control_total=100000, treatment_successes=3550, treatment_total=100000 ) analyzer.print_report(result2, "Checkout Button Color (Large Sample)") # Calculate required sample size n_required = analyzer.required_sample_size( baseline_rate=0.032, minimum_detectable_effect=0.10 # 10% relative improvement ) print(f"\nTo detect 10% relative improvement with 80% power:") print(f"Need {n_required:,} visitors per group") ``` *** ## Common Mistakes to Avoid ### 1. Peeking and Early Stopping ```python theme={null} # BAD: Stopping as soon as p < 0.05 # This inflates false positive rate to ~30%! # GOOD: Pre-specify sample size and run to completion ``` ### 2. Multiple Testing Without Correction ```python theme={null} # Testing 20 variations? Some will be "significant" by chance! # Bonferroni correction alpha_corrected = 0.05 / 20 # = 0.0025 # Or use False Discovery Rate (FDR) correction ``` *** ## Interview Questions **Question**: Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders? **Answer**: "At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations: 1. **The observed difference is 0.3 percentage points** - if real, this could be meaningful at scale 2. **The p-value of 0.08 suggests weak evidence** against the null hypothesis, not proof the treatment doesn't work 3. **Consider power analysis** - we may have been underpowered to detect this effect size 4. **Practical significance** - if the change is low-risk and low-cost, you might still consider implementing Recommendation: If resources allow, run a larger test to get more conclusive results." **Question**: You test 20 different variations of a product page. Three show p-values under 0.05. What's the problem? **Answer**: With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different! **Expected false positives** = 20 × 0.05 = 1 Solutions: 1. **Bonferroni correction**: Use α = 0.05/20 = 0.0025 as threshold 2. **Benjamini-Hochberg (FDR)**: Control the expected proportion of false discoveries 3. **Holdout validation**: Test the "winners" on fresh data ```python theme={null} # Bonferroni alpha_corrected = 0.05 / 20 # 0.0025 # Only results with p < 0.0025 are significant ``` **Question**: An A/B test with 10,000 users per group shows no significant difference. Does this prove the treatment has no effect? **Answer**: No! "Fail to reject null" ≠ "null is true" We need to consider statistical power: 1. **What effect size could we detect?** With n=10,000, we might only detect large effects 2. **What was our power?** If power was 50%, we had a coin flip's chance of detecting a real effect 3. **What's the confidence interval?** Even with p > 0.05, the CI might not include zero ```python theme={null} # For a conversion test at 5% baseline, n=10,000/group # We can reliably detect ~0.5 percentage point differences # Smaller effects would require larger samples ``` The correct conclusion: "We failed to find evidence of an effect of size > X" **Question**: You're running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test? **Answer**: No! This is called "p-hacking" or "optional stopping" and inflates false positive rates. The p-value assumes you only look once at the end. If you peek repeatedly: * With 5 peeks, your actual false positive rate is \~19%, not 5% * With 10 peeks, it's \~25% Proper approaches: 1. **Sequential testing** with adjusted thresholds (O'Brien-Fleming, Pocock) 2. **Bayesian methods** that allow continuous monitoring 3. **Pre-commit** to analysis plan and stick to it ```python theme={null} # O'Brien-Fleming boundaries for 5 interim analyses: # Look 1: α = 0.00001 # Look 2: α = 0.001 # Look 3: α = 0.01 # Look 4: α = 0.02 # Look 5: α = 0.04 ``` *** ## Practice Challenge Create a production-ready A/B test analysis tool: ```python theme={null} import numpy as np from scipy import stats from dataclasses import dataclass from typing import Optional, Tuple @dataclass class ABTestResult: control_rate: float treatment_rate: float relative_lift: float absolute_lift: float z_statistic: float p_value: float confidence_interval: Tuple[float, float] power: float is_significant: bool recommendation: str class ProductionABTest: """ Production-ready A/B test analyzer with: - Power analysis - Effect size estimation - Confidence intervals - Clear recommendations """ def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1): self.alpha = alpha self.mde = min_detectable_effect def analyze( self, control_conversions: int, control_visitors: int, treatment_conversions: int, treatment_visitors: int, test_name: str = "A/B Test" ) -> ABTestResult: """Analyze A/B test results.""" # Your implementation here pass def recommend_sample_size( self, baseline_rate: float, min_detectable_effect: float, power: float = 0.8 ) -> int: """Calculate required sample size per group.""" # Your implementation here pass def generate_report(self, result: ABTestResult) -> str: """Generate human-readable report.""" # Your implementation here pass # Test your implementation: test = ProductionABTest() # Scenario 1: Clear winner result1 = test.analyze( control_conversions=500, control_visitors=10000, treatment_conversions=600, treatment_visitors=10000 ) # Scenario 2: Inconclusive result2 = test.analyze( control_conversions=510, control_visitors=10000, treatment_conversions=530, treatment_visitors=10000 ) # Scenario 3: Treatment is worse result3 = test.analyze( control_conversions=500, control_visitors=10000, treatment_conversions=420, treatment_visitors=10000 ) ``` **Full Solution**: ```python theme={null} class ProductionABTest: def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1): self.alpha = alpha self.mde = min_detectable_effect def analyze( self, control_conversions: int, control_visitors: int, treatment_conversions: int, treatment_visitors: int, test_name: str = "A/B Test" ) -> ABTestResult: # Calculate rates p_c = control_conversions / control_visitors p_t = treatment_conversions / treatment_visitors # Effect sizes absolute_lift = p_t - p_c relative_lift = (p_t - p_c) / p_c if p_c > 0 else 0 # Pooled proportion and standard error p_pool = (control_conversions + treatment_conversions) / \ (control_visitors + treatment_visitors) se = np.sqrt(p_pool * (1 - p_pool) * (1/control_visitors + 1/treatment_visitors)) # Z-test z = absolute_lift / se if se > 0 else 0 p_value = 2 * (1 - stats.norm.cdf(abs(z))) # Confidence interval for difference se_diff = np.sqrt( p_c * (1 - p_c) / control_visitors + p_t * (1 - p_t) / treatment_visitors ) z_crit = stats.norm.ppf(1 - self.alpha/2) ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff) # Power calculation effect_size = abs(p_t - p_c) / np.sqrt(p_c * (1 - p_c)) power = self._calculate_power( control_visitors, treatment_visitors, p_c, p_t ) # Significance check is_significant = p_value < self.alpha # Generate recommendation recommendation = self._generate_recommendation( p_c, p_t, p_value, power, is_significant ) return ABTestResult( control_rate=p_c, treatment_rate=p_t, relative_lift=relative_lift, absolute_lift=absolute_lift, z_statistic=z, p_value=p_value, confidence_interval=ci, power=power, is_significant=is_significant, recommendation=recommendation ) def _calculate_power(self, n1, n2, p1, p2): """Calculate achieved power.""" effect = abs(p2 - p1) pooled_p = (p1 + p2) / 2 se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2)) z_crit = stats.norm.ppf(1 - self.alpha/2) z_power = (effect / se) - z_crit return stats.norm.cdf(z_power) def _generate_recommendation(self, p_c, p_t, p_value, power, sig): if sig and p_t > p_c: return "SHIP IT: Treatment significantly outperforms control" elif sig and p_t < p_c: return "STOP: Treatment significantly underperforms control" elif not sig and power < 0.5: return "INCONCLUSIVE: Test underpowered, consider running longer" elif not sig and power >= 0.8: return "NO EFFECT: High-powered test found no significant difference" else: return "BORDERLINE: Consider practical significance and run longer" def recommend_sample_size(self, baseline_rate, mde, power=0.8): target_rate = baseline_rate * (1 + mde) effect = target_rate - baseline_rate pooled_p = (baseline_rate + target_rate) / 2 z_alpha = stats.norm.ppf(1 - self.alpha/2) z_beta = stats.norm.ppf(power) n = 2 * pooled_p * (1 - pooled_p) * ((z_alpha + z_beta) / effect) ** 2 return int(np.ceil(n)) ``` *** ## 📝 Practice Exercises Conduct a one-sample hypothesis test Analyze an A/B test for conversion rates Calculate statistical power and sample size Real-world: Drug trial effectiveness testing

**Exercise 1: Website Load Time Test** - One-sample t-test

**Problem**: Your website claims an average load time of 2.0 seconds. A sample of 30 page loads shows: * Sample mean: 2.3 seconds * Sample std dev: 0.6 seconds 1. State the null and alternative hypotheses 2. Calculate the t-statistic 3. Find the p-value (two-tailed) 4. At α = 0.05, do you reject the claim? **Solution**: ```python theme={null} import numpy as np from scipy import stats # Sample data x_bar = 2.3 # sample mean s = 0.6 # sample std dev n = 30 # sample size mu_0 = 2.0 # claimed value # 1. Hypotheses print("H₀: μ = 2.0 seconds (load time equals claim)") print("H₁: μ ≠ 2.0 seconds (load time differs from claim)") # 2. Calculate t-statistic se = s / np.sqrt(n) # standard error t_stat = (x_bar - mu_0) / se print(f"\nStandard Error: {se:.4f}") print(f"t-statistic: {t_stat:.4f}") # 3. P-value (two-tailed) df = n - 1 p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) print(f"Degrees of freedom: {df}") print(f"P-value (two-tailed): {p_value:.4f}") # 4. Decision at α = 0.05 alpha = 0.05 critical_t = stats.t.ppf(1 - alpha/2, df) print(f"\n--- Decision at α = {alpha} ---") print(f"Critical t-value: ±{critical_t:.4f}") if p_value < alpha: print(f"P-value ({p_value:.4f}) < α ({alpha})") print("✗ REJECT H₀: Website is NOT meeting the 2.0s claim") else: print(f"P-value ({p_value:.4f}) ≥ α ({alpha})") print("✓ FAIL TO REJECT H₀: No evidence against the claim") # Using scipy directly t_result = stats.ttest_1samp([x_bar], mu_0) # Would need raw data print(f"\nConclusion: Load time ({x_bar}s) is significantly higher than claimed ({mu_0}s)") ```

**Exercise 2: A/B Test Analysis** - Two-proportion z-test

**Problem**: You're testing a new checkout flow: * Control: 5,000 visitors, 200 purchases (4.0%) * Treatment: 5,000 visitors, 240 purchases (4.8%) 1. Is the 0.8% improvement statistically significant at α = 0.05? 2. Calculate the 95% confidence interval for the difference 3. What's the relative lift? 4. Should you ship the new checkout flow? **Solution**: ```python theme={null} import numpy as np from scipy import stats # Data n_c, x_c = 5000, 200 # Control n_t, x_t = 5000, 240 # Treatment p_c = x_c / n_c # 4.0% p_t = x_t / n_t # 4.8% diff = p_t - p_c # 0.8% print(f"Control rate: {p_c:.2%}") print(f"Treatment rate: {p_t:.2%}") print(f"Absolute difference: {diff:.2%}") # 1. Two-proportion z-test # Pooled proportion under null hypothesis p_pool = (x_c + x_t) / (n_c + n_t) se_pool = np.sqrt(p_pool * (1 - p_pool) * (1/n_c + 1/n_t)) z_stat = diff / se_pool p_value = 2 * (1 - stats.norm.cdf(abs(z_stat))) print(f"\nPooled proportion: {p_pool:.4f}") print(f"Z-statistic: {z_stat:.3f}") print(f"P-value: {p_value:.4f}") alpha = 0.05 if p_value < alpha: print(f"\n✓ Significant at α = {alpha}") else: print(f"\n✗ NOT significant at α = {alpha}") # 2. 95% CI for difference # Use unpooled SE for CI se_unpooled = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t) z_crit = stats.norm.ppf(0.975) moe = z_crit * se_unpooled ci_lower = diff - moe ci_upper = diff + moe print(f"\n95% CI for difference: ({ci_lower:.2%}, {ci_upper:.2%})") # 3. Relative lift relative_lift = (p_t - p_c) / p_c print(f"\nRelative lift: {relative_lift:.1%}") # 20% # 4. Business decision print("\n--- Business Decision ---") if p_value < alpha and ci_lower > 0: print("SHIP IT: Statistically significant improvement!") # Calculate expected impact monthly_visitors = 100000 additional_conversions = monthly_visitors * diff print(f"Expected additional conversions/month: {additional_conversions:.0f}") else: print("HOLD: Need more data or re-evaluate the change") ```

**Exercise 3: Power Analysis** - Sample size calculation

**Problem**: You're planning an A/B test. Your current conversion rate is 3%, and you want to detect a 10% relative lift (3.0% → 3.3%). 1. What sample size is needed per group for 80% power at α = 0.05? 2. What about 90% power? 3. If you only have 10,000 users, what's the minimum detectable effect? 4. How long will the test take at 1,000 visitors/day? **Solution**: ```python theme={null} import numpy as np from scipy import stats def sample_size_ab(p1, p2, alpha=0.05, power=0.80): """Calculate sample size per group for A/B test.""" z_alpha = stats.norm.ppf(1 - alpha/2) z_beta = stats.norm.ppf(power) # Pooled variance estimate p_pool = (p1 + p2) / 2 effect = abs(p2 - p1) # Sample size per group n = 2 * p_pool * (1 - p_pool) * ((z_alpha + z_beta) / effect) ** 2 return int(np.ceil(n)) def minimum_detectable_effect(n, p1, alpha=0.05, power=0.80): """Calculate MDE given sample size.""" z_alpha = stats.norm.ppf(1 - alpha/2) z_beta = stats.norm.ppf(power) se = np.sqrt(2 * p1 * (1 - p1) / n) mde = (z_alpha + z_beta) * se return mde # Given parameters p_baseline = 0.03 # 3% relative_lift = 0.10 # 10% p_target = p_baseline * (1 + relative_lift) # 3.3% print(f"Baseline rate: {p_baseline:.1%}") print(f"Target rate: {p_target:.1%}") print(f"Absolute difference: {p_target - p_baseline:.2%}") # 1. Sample size for 80% power n_80 = sample_size_ab(p_baseline, p_target, power=0.80) print(f"\n1. Sample size for 80% power: {n_80:,} per group") print(f" Total users needed: {2*n_80:,}") # 2. Sample size for 90% power n_90 = sample_size_ab(p_baseline, p_target, power=0.90) print(f"\n2. Sample size for 90% power: {n_90:,} per group") print(f" Increase from 80%: {(n_90-n_80)/n_80:.0%}") # 3. MDE with 10,000 users n_available = 5000 # per group mde = minimum_detectable_effect(n_available, p_baseline) relative_mde = mde / p_baseline print(f"\n3. With 10,000 total users (5,000 per group):") print(f" Minimum Detectable Effect: {mde:.2%} absolute") print(f" Relative MDE: {relative_mde:.1%}") # 4. Test duration visitors_per_day = 1000 days_needed_80 = (2 * n_80) / visitors_per_day days_needed_90 = (2 * n_90) / visitors_per_day print(f"\n4. Test duration at {visitors_per_day:,} visitors/day:") print(f" For 80% power: {days_needed_80:.0f} days ({days_needed_80/7:.1f} weeks)") print(f" For 90% power: {days_needed_90:.0f} days ({days_needed_90/7:.1f} weeks)") ```

**Exercise 4: Drug Trial Analysis** - Real-world hypothesis testing

**Problem**: A clinical trial tests a new blood pressure medication: * Control (placebo): n=150, mean BP reduction = 2 mmHg, std = 8 mmHg * Treatment (drug): n=150, mean BP reduction = 6 mmHg, std = 10 mmHg 1. Conduct a two-sample t-test 2. Calculate Cohen's d (effect size) 3. Is this clinically significant, not just statistically significant? 4. What are Type I and Type II error implications in this context? **Solution**: ```python theme={null} import numpy as np from scipy import stats # Control group n_c = 150 mean_c = 2 # mmHg reduction std_c = 8 # Treatment group n_t = 150 mean_t = 6 # mmHg reduction std_t = 10 # 1. Two-sample t-test # Pooled standard error (assuming equal variance) sp = np.sqrt(((n_c-1)*std_c**2 + (n_t-1)*std_t**2) / (n_c + n_t - 2)) se = sp * np.sqrt(1/n_c + 1/n_t) t_stat = (mean_t - mean_c) / se df = n_c + n_t - 2 p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) print("=== Two-Sample T-Test ===") print(f"Control: {mean_c} mmHg reduction (std={std_c})") print(f"Treatment: {mean_t} mmHg reduction (std={std_t})") print(f"\nPooled std: {sp:.2f}") print(f"t-statistic: {t_stat:.3f}") print(f"P-value: {p_value:.6f}") # Decision alpha = 0.05 if p_value < alpha: print(f"\n✓ Statistically significant (p < {alpha})") else: print(f"\n✗ Not statistically significant") # 2. Cohen's d (effect size) # d = (mean_t - mean_c) / pooled_std cohens_d = (mean_t - mean_c) / sp print(f"\n=== Effect Size ===") print(f"Cohen's d: {cohens_d:.3f}") # Interpret effect size if abs(cohens_d) < 0.2: effect_label = "negligible" elif abs(cohens_d) < 0.5: effect_label = "small" elif abs(cohens_d) < 0.8: effect_label = "medium" else: effect_label = "large" print(f"Interpretation: {effect_label} effect") # 3. Clinical significance print("\n=== Clinical Significance ===") mean_diff = mean_t - mean_c print(f"Mean difference: {mean_diff} mmHg") # Clinically meaningful threshold often 5+ mmHg clinical_threshold = 5 if mean_diff >= clinical_threshold: print(f"✓ Clinically significant (≥{clinical_threshold} mmHg)") else: print(f"✗ May not be clinically meaningful (<{clinical_threshold} mmHg)") # 95% CI for difference ci_lower = mean_diff - stats.t.ppf(0.975, df) * se ci_upper = mean_diff + stats.t.ppf(0.975, df) * se print(f"95% CI: ({ci_lower:.1f}, {ci_upper:.1f}) mmHg") # 4. Error implications print("\n=== Error Implications ===") print("Type I Error (False Positive):") print(" - Approving an ineffective drug") print(" - Patients take unnecessary medication with side effects") print(" - Cost burden on healthcare system") print("\nType II Error (False Negative):") print(" - Rejecting an effective drug") print(" - Patients denied beneficial treatment") print(" - Continued suffering from high blood pressure") print("\nIn drug trials, Type I error is often considered worse") print("(hence the stringent α = 0.05 or even 0.01 threshold)") ```

*** ## Key Takeaways * Null hypothesis: no effect (innocent) * Alternative: there is an effect * P-value: how surprising is the data? * Decision threshold: typically α = 0.05 * Type I (α): False positive, claiming effect that doesn't exist * Type II (β): False negative, missing real effect * Power = 1 - β: Ability to detect real effects * Small samples = low power = missed effects * Calculate sample size BEFORE running test * More precision requires exponentially more data * Two proportions: Chi-square or z-test * Two means: t-test * Multiple groups: ANOVA * Non-normal: Mann-Whitney U *** ## Common Pitfalls **A/B Testing Mistakes to Avoid**: 1. **Peeking & Early Stopping** - Checking daily inflates false positives; use sequential testing methods instead 2. **Underpowered Tests** - Running with too few samples misses real effects; calculate sample size first 3. **Multiple Comparisons** - Testing 20 variants without correction guarantees false positives 4. **Ignoring Practical Significance** - A p \< 0.05 with 0.01% improvement isn't worth shipping 5. **One-Tailed When Uncertain** - Only use one-tailed tests when you truly can't care about opposite effects 6. **P-value Misinterpretation** - P-value is NOT the probability the null is true! *** ## Connection to Machine Learning | Hypothesis Testing | ML Application | | --------------------------- | -------------------------------------------- | | A/B testing | Model comparison, feature evaluation | | Power analysis | Training set size planning | | Multiple testing correction | Hyperparameter search, feature selection | | Type I/II errors | Precision/Recall tradeoff | | Significance testing | Statistical validation of model improvements | **ML Connection**: Every time you compare "Model A accuracy = 0.92 vs Model B accuracy = 0.89", you're implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer. **Coming up next**: We'll learn about **Correlation and Regression** - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning. Understand relationships and make predictions *** ## Interview Deep-Dive **Strong Answer:** * Neither is automatically right -- the answer depends on context that a p-value alone does not provide. A p=0.04 means the result is technically significant at alpha=0.05, but there are several things I would check before making a recommendation. * First, what is the effect size? If the new variant improved conversion by 0.02 percentage points (from 3.00% to 3.02%), the result may be statistically significant with a large enough sample but practically meaningless. Shipping a code change, increasing technical debt, and potentially confusing users for a 0.02pp improvement is not worth it. I would compute the expected annual revenue impact and compare it to the implementation cost. * Second, what is the power of the test? If we planned for 80% power to detect a 10% relative lift but only ran enough traffic for 50% power, then p=0.04 might be an inflated estimate (the "winner's curse" -- significant results from underpowered tests tend to overestimate effect sizes). * Third, did anyone peek at the data before the test concluded? If the team checked results daily, the effective alpha is much higher than 0.05 due to multiple comparisons, and p=0.04 may not actually be significant under the true (inflated) alpha. * My recommendation: if the effect size is meaningful, the test was pre-registered, and nobody peeked, ship it. If any of those conditions are not met, run a confirmation test or extend the current one. **Follow-up: Walk me through the "winner's curse" in A/B testing. How does it inflate effect size estimates?** The winner's curse occurs because we only ship results that cross the significance threshold. Imagine the true effect is a 2% lift. Due to sampling noise, your measured lift will be randomly higher or lower than 2%. If you only ship when p is less than 0.05, you are selecting for experiments where random noise happened to push the measured lift above 2%. The shipped estimate is biased upward -- it is the true effect plus a positive noise component. In underpowered tests, this bias is especially large because you need a bigger "lucky" noise realization to cross the significance threshold. The practical consequence: the business case you built on "5% measured lift" might only deliver 2% lift in production, because the extra 3% was noise that happened to be in your favor. The fix is to either run adequately powered tests (where the noise is small relative to the true effect) or use shrinkage estimators that adjust for the selection bias. **Strong Answer:** * Statistical significance means the observed difference is unlikely to have occurred by chance alone. Practical significance means the difference is large enough to actually matter for the business. * Concrete example: an e-commerce company runs an A/B test on 2 million users and finds the new homepage increases average order value from $47.20 to $47.35. With that sample size, the p-value is 0.001 -- highly statistically significant. But the actual improvement is $0.15 per order, which translates to maybe $150K annually. If the homepage redesign cost \$500K in engineering time and introduced new technical debt, the statistically significant result is practically worthless. * Conversely, a startup tests a new pricing page on 500 users and sees a conversion lift from 3% to 5%. The p-value is 0.08 -- not statistically significant at alpha=0.05. But the 67% relative lift, if real, would double the company's revenue. The practical significance is enormous; the test was just underpowered. The right action is to run longer, not to conclude "no effect." * The way I think about it: p-values tell you whether the signal is distinguishable from noise. Effect size and business context tell you whether the signal matters. You need both. **Follow-up: How would you set the minimum detectable effect for an A/B test before running it?** I work backwards from the business case. First, I ask: "What is the smallest improvement that would justify the cost of implementing this change?" If the change requires 2 weeks of engineering time ($30K cost), and we get 1 million orders per year at $50 average, then a $0.03 improvement per order generates $30K annually -- barely break-even. So the MDE should be at least $0.05-$0.10 per order to provide a comfortable return. Then I compute the sample size needed to detect that MDE with 80% power. If the required traffic exceeds what we can collect in a reasonable timeframe (say 4 weeks), I would either accept a larger MDE, reduce alpha from 0.05 to 0.10, or reconsider whether the test is worth running at all. This upfront planning prevents the common trap of running underpowered tests that waste weeks of traffic and produce inconclusive results. **Strong Answer:** * P-hacking is the practice of manipulating data analysis to find statistically significant results. Common forms include: checking results daily and stopping when p drops below 0.05, testing multiple metrics and only reporting the one that is significant, segmenting data after the fact to find a subgroup where the effect is significant, or adding and removing covariates until significance appears. * Each of these inflates the false positive rate well beyond the nominal 5%. A team that checks daily for 14 days has effectively run 14 tests, pushing the real false positive rate to roughly 25-30%. A team that tests 20 metrics will find at least one "significant" result by chance alone. * To prevent it at the platform level, I would design the system with these guardrails: (1) Pre-registration: require teams to specify the primary metric, sample size, and analysis plan before the test launches. Lock these parameters. (2) Sequential testing: use methods like always-valid p-values or group sequential designs that allow continuous monitoring without inflating the error rate. (3) Automated correction: when multiple metrics are tracked, automatically apply Benjamini-Hochberg correction and highlight the distinction between primary and exploratory metrics. (4) Mandatory effect size reporting: always show the confidence interval for the effect size alongside the p-value. (5) Cool-off period: require a minimum test duration covering at least one full business cycle before results can be acted upon. * The cultural piece is equally important: incentivize teams for running well-designed experiments regardless of outcome, not just for finding "winners." **Follow-up: How do sequential testing methods (like always-valid p-values) allow peeking without inflating false positives?** Traditional p-values assume you look at the data exactly once. Sequential testing methods use a different mathematical framework -- typically based on martingale theory or spending functions -- that accounts for continuous monitoring. The always-valid p-value is constructed so that it maintains its coverage guarantee no matter how many times you look at the data. The tradeoff is that at any fixed sample size, the sequential method requires slightly more evidence to declare significance compared to a fixed-sample test. Think of it as paying an "insurance premium" for the right to peek continuously. In practice, the cost is modest (roughly 20-30% more traffic) and the benefit is enormous: teams can monitor experiments in real-time, stop harmful experiments early, and make shipping decisions whenever the evidence is clear, rather than waiting for a pre-specified end date. **Strong Answer:** * Under the null hypothesis (no effect for any test), the expected number of false positives from 10 tests at alpha=0.05 is 0.5. So getting 3 "significant" results when you run 10 tests is suspicious -- at least some are likely false positives. * However, it is unrealistic to assume all 10 nulls are true. If you are testing reasonable product changes, maybe 3-4 of them actually have real effects. In that case, 3 significant results might include 2-3 real effects and 0-1 false positive. * The standard corrections are Bonferroni (divide alpha by the number of tests, requiring p less than 0.005) and Benjamini-Hochberg (FDR control, which is less conservative). Bonferroni controls the family-wise error rate but is very strict -- you might miss real effects. BH controls the false discovery rate, saying "of the results I call significant, at most X% are false." * The best practice is to flag all 3 as candidates, apply BH correction to see which survive, and then run a focused confirmation test on the 1-2 that survive correction. The confirmation test uses fresh data and a single pre-specified hypothesis, eliminating the multiple testing problem entirely. **Follow-up: What is the False Discovery Rate, and why is it often more useful than the Family-Wise Error Rate in practice?** The Family-Wise Error Rate (FWER) is the probability of making at least one false positive among all tests. Bonferroni controls this by making each individual test extremely stringent. The problem is that as the number of tests grows, each test becomes so conservative that you lose power to detect real effects. With 100 tests, each requires p less than 0.0005 -- virtually nothing passes. The False Discovery Rate (FDR) controls a different quantity: the expected proportion of false positives among the rejected hypotheses. If you control FDR at 5% and get 20 significant results, you expect about 1 to be a false discovery. This is much more practical for exploratory analysis because you maintain reasonable power while keeping the false discovery proportion bounded. In genomics, where researchers test thousands of genes simultaneously, FDR is the standard approach. In tech, it is increasingly used for feature experimentation platforms that run many simultaneous tests.