You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week:
Version
Visitors
Purchases
Conversion Rate
Blue (Control)
10,000
320
3.20%
Green (New)
10,000
355
3.55%
The green button has a higher conversion rate. But is this a real improvement or just random chance?This is the fundamental question of hypothesis testing.Analogy: Hypothesis testing is like a courtroom trial for data. The null hypothesis is the “defendant” — innocent until proven guilty. The data is the “evidence.” The p-value is how surprising the evidence would be if the defendant were truly innocent. And just like in court, you can make two kinds of mistakes: convicting an innocent person (false positive) or letting a guilty one go free (false negative). The entire framework is about managing those risks.
Estimated Time: 4-5 hours Difficulty: Intermediate Prerequisites: Modules 1-4 (especially Distributions and Inference) What You’ll Build: Complete A/B testing framework
If p ≥ 0.05, we fail to reject the null hypothesis
Critical Misconception: The p-value is NOT the probability that the null hypothesis is true.It’s the probability of seeing data this extreme IF the null hypothesis were true.
def two_proportion_z_test(x1, n1, x2, n2): """ Test if two proportions are significantly different. x1, x2: number of successes n1, n2: number of trials Returns: z-statistic, p-value (two-tailed) """ # Sample proportions p1 = x1 / n1 p2 = x2 / n2 # Pooled proportion (under null hypothesis) p_pool = (x1 + x2) / (n1 + n2) # Standard error under null se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2)) # Z-statistic z = (p1 - p2) / se # Two-tailed p-value p_value = 2 * (1 - stats.norm.cdf(abs(z))) return z, p_value# Run the testz_stat, p_value = two_proportion_z_test( green_purchases, green_visitors, blue_purchases, blue_visitors)print(f"\nZ-statistic: {z_stat:.3f}")print(f"P-value: {p_value:.4f}")if p_value < 0.05: print("\nResult: Reject null hypothesis") print("The difference is statistically significant at α=0.05")else: print("\nResult: Fail to reject null hypothesis") print("The difference is NOT statistically significant at α=0.05")
Output:
Blue conversion rate: 3.20%Green conversion rate: 3.55%Observed difference: 0.35%Z-statistic: 1.404P-value: 0.1603Result: Fail to reject null hypothesisThe difference is NOT statistically significant at α=0.05
Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance.Step-by-step reasoning for interpreting this result:
What we observed: Green button had 0.35 percentage points higher conversion.
What we asked: If there were truly no difference, how often would we see a gap this large by luck alone?
What we found: About 16% of the time (p = 0.16). That is not particularly rare.
Our decision: Since 16% is above our 5% threshold, we cannot confidently say the green button is better. The observed difference is plausible under random noise.
What this does NOT mean: It does not mean the green button is NOT better. It means we do not have enough evidence to conclude either way. A larger sample might reveal a real difference.
ML Application — Model Comparison: This same logic applies when you compare two ML models. If Model A gets 94.2% accuracy and Model B gets 93.8%, is A really better? Without a statistical test (like a paired t-test over cross-validation folds), you cannot know. Many ML practitioners deploy “better” models that are actually within the noise margin. Always test whether the difference is statistically significant before making deployment decisions.
Power = Probability of detecting an effect when it exists = 1 - βHigher power means:
Less likely to miss real effects
Requires larger sample sizes
More confidence in negative results
def power_proportion_test(p1, p2, n, alpha=0.05): """ Calculate power for a two-proportion test. p1: control proportion p2: treatment proportion n: sample size per group alpha: significance level """ # Effect size effect = abs(p2 - p1) # Pooled standard error under null p_pool = (p1 + p2) / 2 se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n) # Standard error under alternative se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n) # Critical value z_crit = stats.norm.ppf(1 - alpha / 2) # Power z_power = (effect - z_crit * se_null) / se_alt power = stats.norm.cdf(z_power) return power# Our A/B test: 3.20% vs 3.55%, n=10,000 per grouppower = power_proportion_test(0.032, 0.0355, 10000)print(f"Power of our test: {power:.1%}")# What if we had 50,000 per group?power_large = power_proportion_test(0.032, 0.0355, 50000)print(f"Power with n=50,000: {power_large:.1%}")
Output:
Power of our test: 27.3%Power with n=50,000: 75.8%
With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance.Analogy: Running an underpowered test is like using a metal detector at the beach with the sensitivity turned way down. Even if there is buried treasure, you are unlikely to find it. Power analysis tells you how sensitive your detector needs to be (how large your sample must be) to find treasure of a given size (your minimum detectable effect).
Statistical Mistake in ML — Underpowered Hyperparameter Searches: This same power problem plagues ML practitioners who do hyperparameter tuning with small validation sets. You try 50 configurations, pick the “best” one, but the differences are smaller than the noise. You have essentially picked a random configuration and convinced yourself it is optimal. Use cross-validation with enough folds and enough data per fold to ensure the differences you are selecting on are real.
# Testing 20 variations? Some will be "significant" by chance!# Bonferroni correctionalpha_corrected = 0.05 / 20 # = 0.0025# Or use False Discovery Rate (FDR) correction
Question: Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders?
Answer:
“At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations:
The observed difference is 0.3 percentage points - if real, this could be meaningful at scale
The p-value of 0.08 suggests weak evidence against the null hypothesis, not proof the treatment doesn’t work
Consider power analysis - we may have been underpowered to detect this effect size
Practical significance - if the change is low-risk and low-cost, you might still consider implementing
Recommendation: If resources allow, run a larger test to get more conclusive results.”
Question 2: Multiple Testing (Amazon)
Question: You test 20 different variations of a product page. Three show p-values under 0.05. What’s the problem?
Answer: With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different!Expected false positives = 20 × 0.05 = 1Solutions:
Bonferroni correction: Use α = 0.05/20 = 0.0025 as threshold
Benjamini-Hochberg (FDR): Control the expected proportion of false discoveries
Holdout validation: Test the “winners” on fresh data
# Bonferronialpha_corrected = 0.05 / 20 # 0.0025# Only results with p < 0.0025 are significant
Question 3: Power and Sample Size (Facebook/Meta)
Question: An A/B test with 10,000 users per group shows no significant difference. Does this prove the treatment has no effect?
Answer: No! “Fail to reject null” ≠ “null is true”We need to consider statistical power:
What effect size could we detect? With n=10,000, we might only detect large effects
What was our power? If power was 50%, we had a coin flip’s chance of detecting a real effect
What’s the confidence interval? Even with p > 0.05, the CI might not include zero
# For a conversion test at 5% baseline, n=10,000/group# We can reliably detect ~0.5 percentage point differences# Smaller effects would require larger samples
The correct conclusion: “We failed to find evidence of an effect of size > X”
Question 4: Early Stopping (Tech Companies)
Question: You’re running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test?
Answer: No! This is called “p-hacking” or “optional stopping” and inflates false positive rates.The p-value assumes you only look once at the end. If you peek repeatedly:
With 5 peeks, your actual false positive rate is ~19%, not 5%
With 10 peeks, it’s ~25%
Proper approaches:
Sequential testing with adjusted thresholds (O’Brien-Fleming, Pocock)
ML Connection: Every time you compare “Model A accuracy = 0.92 vs Model B accuracy = 0.89”, you’re implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer.
Coming up next: We’ll learn about Correlation and Regression - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning.
Your A/B test has p=0.04. Product wants to ship. Engineering wants more data. Who is right?
Strong Answer:
Neither is automatically right — the answer depends on context that a p-value alone does not provide. A p=0.04 means the result is technically significant at alpha=0.05, but there are several things I would check before making a recommendation.
First, what is the effect size? If the new variant improved conversion by 0.02 percentage points (from 3.00% to 3.02%), the result may be statistically significant with a large enough sample but practically meaningless. Shipping a code change, increasing technical debt, and potentially confusing users for a 0.02pp improvement is not worth it. I would compute the expected annual revenue impact and compare it to the implementation cost.
Second, what is the power of the test? If we planned for 80% power to detect a 10% relative lift but only ran enough traffic for 50% power, then p=0.04 might be an inflated estimate (the “winner’s curse” — significant results from underpowered tests tend to overestimate effect sizes).
Third, did anyone peek at the data before the test concluded? If the team checked results daily, the effective alpha is much higher than 0.05 due to multiple comparisons, and p=0.04 may not actually be significant under the true (inflated) alpha.
My recommendation: if the effect size is meaningful, the test was pre-registered, and nobody peeked, ship it. If any of those conditions are not met, run a confirmation test or extend the current one.
Follow-up: Walk me through the “winner’s curse” in A/B testing. How does it inflate effect size estimates?The winner’s curse occurs because we only ship results that cross the significance threshold. Imagine the true effect is a 2% lift. Due to sampling noise, your measured lift will be randomly higher or lower than 2%. If you only ship when p is less than 0.05, you are selecting for experiments where random noise happened to push the measured lift above 2%. The shipped estimate is biased upward — it is the true effect plus a positive noise component. In underpowered tests, this bias is especially large because you need a bigger “lucky” noise realization to cross the significance threshold. The practical consequence: the business case you built on “5% measured lift” might only deliver 2% lift in production, because the extra 3% was noise that happened to be in your favor. The fix is to either run adequately powered tests (where the noise is small relative to the true effect) or use shrinkage estimators that adjust for the selection bias.
Explain the difference between statistical significance and practical significance with a concrete example.
Strong Answer:
Statistical significance means the observed difference is unlikely to have occurred by chance alone. Practical significance means the difference is large enough to actually matter for the business.
Concrete example: an e-commerce company runs an A/B test on 2 million users and finds the new homepage increases average order value from 47.20to47.35. With that sample size, the p-value is 0.001 — highly statistically significant. But the actual improvement is 0.15perorder,whichtranslatestomaybe150K annually. If the homepage redesign cost $500K in engineering time and introduced new technical debt, the statistically significant result is practically worthless.
Conversely, a startup tests a new pricing page on 500 users and sees a conversion lift from 3% to 5%. The p-value is 0.08 — not statistically significant at alpha=0.05. But the 67% relative lift, if real, would double the company’s revenue. The practical significance is enormous; the test was just underpowered. The right action is to run longer, not to conclude “no effect.”
The way I think about it: p-values tell you whether the signal is distinguishable from noise. Effect size and business context tell you whether the signal matters. You need both.
Follow-up: How would you set the minimum detectable effect for an A/B test before running it?I work backwards from the business case. First, I ask: “What is the smallest improvement that would justify the cost of implementing this change?” If the change requires 2 weeks of engineering time (30Kcost),andweget1millionordersperyearat50 average, then a 0.03improvementperordergenerates30K annually — barely break-even. So the MDE should be at least 0.05−0.10 per order to provide a comfortable return. Then I compute the sample size needed to detect that MDE with 80% power. If the required traffic exceeds what we can collect in a reasonable timeframe (say 4 weeks), I would either accept a larger MDE, reduce alpha from 0.05 to 0.10, or reconsider whether the test is worth running at all. This upfront planning prevents the common trap of running underpowered tests that waste weeks of traffic and produce inconclusive results.
What is p-hacking, and how would you design an experimentation platform to prevent it?
Strong Answer:
P-hacking is the practice of manipulating data analysis to find statistically significant results. Common forms include: checking results daily and stopping when p drops below 0.05, testing multiple metrics and only reporting the one that is significant, segmenting data after the fact to find a subgroup where the effect is significant, or adding and removing covariates until significance appears.
Each of these inflates the false positive rate well beyond the nominal 5%. A team that checks daily for 14 days has effectively run 14 tests, pushing the real false positive rate to roughly 25-30%. A team that tests 20 metrics will find at least one “significant” result by chance alone.
To prevent it at the platform level, I would design the system with these guardrails: (1) Pre-registration: require teams to specify the primary metric, sample size, and analysis plan before the test launches. Lock these parameters. (2) Sequential testing: use methods like always-valid p-values or group sequential designs that allow continuous monitoring without inflating the error rate. (3) Automated correction: when multiple metrics are tracked, automatically apply Benjamini-Hochberg correction and highlight the distinction between primary and exploratory metrics. (4) Mandatory effect size reporting: always show the confidence interval for the effect size alongside the p-value. (5) Cool-off period: require a minimum test duration covering at least one full business cycle before results can be acted upon.
The cultural piece is equally important: incentivize teams for running well-designed experiments regardless of outcome, not just for finding “winners.”
Follow-up: How do sequential testing methods (like always-valid p-values) allow peeking without inflating false positives?Traditional p-values assume you look at the data exactly once. Sequential testing methods use a different mathematical framework — typically based on martingale theory or spending functions — that accounts for continuous monitoring. The always-valid p-value is constructed so that it maintains its coverage guarantee no matter how many times you look at the data. The tradeoff is that at any fixed sample size, the sequential method requires slightly more evidence to declare significance compared to a fixed-sample test. Think of it as paying an “insurance premium” for the right to peek continuously. In practice, the cost is modest (roughly 20-30% more traffic) and the benefit is enormous: teams can monitor experiments in real-time, stop harmful experiments early, and make shipping decisions whenever the evidence is clear, rather than waiting for a pre-specified end date.
You are running 10 A/B tests simultaneously. Three come back significant at p less than 0.05. How many of those are likely real?
Strong Answer:
Under the null hypothesis (no effect for any test), the expected number of false positives from 10 tests at alpha=0.05 is 0.5. So getting 3 “significant” results when you run 10 tests is suspicious — at least some are likely false positives.
However, it is unrealistic to assume all 10 nulls are true. If you are testing reasonable product changes, maybe 3-4 of them actually have real effects. In that case, 3 significant results might include 2-3 real effects and 0-1 false positive.
The standard corrections are Bonferroni (divide alpha by the number of tests, requiring p less than 0.005) and Benjamini-Hochberg (FDR control, which is less conservative). Bonferroni controls the family-wise error rate but is very strict — you might miss real effects. BH controls the false discovery rate, saying “of the results I call significant, at most X% are false.”
The best practice is to flag all 3 as candidates, apply BH correction to see which survive, and then run a focused confirmation test on the 1-2 that survive correction. The confirmation test uses fresh data and a single pre-specified hypothesis, eliminating the multiple testing problem entirely.
Follow-up: What is the False Discovery Rate, and why is it often more useful than the Family-Wise Error Rate in practice?The Family-Wise Error Rate (FWER) is the probability of making at least one false positive among all tests. Bonferroni controls this by making each individual test extremely stringent. The problem is that as the number of tests grows, each test becomes so conservative that you lose power to detect real effects. With 100 tests, each requires p less than 0.0005 — virtually nothing passes. The False Discovery Rate (FDR) controls a different quantity: the expected proportion of false positives among the rejected hypotheses. If you control FDR at 5% and get 20 significant results, you expect about 1 to be a false discovery. This is much more practical for exploratory analysis because you maintain reasonable power while keeping the false discovery proportion bounded. In genomics, where researchers test thousands of genes simultaneously, FDR is the standard approach. In tech, it is increasingly used for feature experimentation platforms that run many simultaneous tests.