Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Hypothesis Testing

Hypothesis Testing: Real Effect or Random Noise?

The A/B Testing Problem

You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week:
VersionVisitorsPurchasesConversion Rate
Blue (Control)10,0003203.20%
Green (New)10,0003553.55%
The green button has a higher conversion rate. But is this a real improvement or just random chance? This is the fundamental question of hypothesis testing. Analogy: Hypothesis testing is like a courtroom trial for data. The null hypothesis is the “defendant” — innocent until proven guilty. The data is the “evidence.” The p-value is how surprising the evidence would be if the defendant were truly innocent. And just like in court, you can make two kinds of mistakes: convicting an innocent person (false positive) or letting a guilty one go free (false negative). The entire framework is about managing those risks.
Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-4 (especially Distributions and Inference)
What You’ll Build: Complete A/B testing framework

The Framework: Innocent Until Proven Guilty

Hypothesis testing borrows from the legal system:
Legal SystemHypothesis Testing
Defendant is innocent until proven guiltyNo effect until proven otherwise
Prosecution must prove guilt beyond reasonable doubtData must prove effect with high confidence
Jury verdict: guilty or not guiltyDecision: reject or fail to reject null
”Not guilty” ≠ “innocent""Fail to reject” ≠ “effect doesn’t exist”
Hypothesis Testing Framework

The Two Hypotheses

Null Hypothesis (H₀): The default assumption. Nothing special is happening.
  • “The new button has the same conversion rate as the old one”
  • “The drug has no effect”
  • “The two groups are the same”
Alternative Hypothesis (H₁ or Hₐ): What we’re trying to prove.
  • “The new button has a different conversion rate”
  • “The drug has an effect”
  • “The groups are different”
# Our A/B test hypotheses:
# H₀: p_green = p_blue (no difference)
# H₁: p_green ≠ p_blue (there is a difference)

The P-Value: Quantifying Surprise

The p-value answers: “If there really were no effect, how likely would we be to see data this extreme?”
P-Value Intuition

Interpreting P-Values

P-ValueInterpretation
p < 0.01Strong evidence against null hypothesis
p < 0.05Moderate evidence against null hypothesis
p < 0.10Weak evidence against null hypothesis
p ≥ 0.10Little to no evidence against null hypothesis
Common threshold (α): 0.05 (5%)
  • If p < 0.05, we reject the null hypothesis
  • If p ≥ 0.05, we fail to reject the null hypothesis
Critical Misconception: The p-value is NOT the probability that the null hypothesis is true.It’s the probability of seeing data this extreme IF the null hypothesis were true.

Testing Our A/B Example

Let’s test whether the green button is actually better:
import numpy as np
from scipy import stats

# Data
blue_visitors = 10000
blue_purchases = 320
blue_rate = blue_purchases / blue_visitors

green_visitors = 10000
green_purchases = 355
green_rate = green_purchases / green_visitors

print(f"Blue conversion rate: {blue_rate:.2%}")
print(f"Green conversion rate: {green_rate:.2%}")
print(f"Observed difference: {green_rate - blue_rate:.2%}")

Method 1: Two-Proportion Z-Test

def two_proportion_z_test(x1, n1, x2, n2):
    """
    Test if two proportions are significantly different.
    
    x1, x2: number of successes
    n1, n2: number of trials
    
    Returns: z-statistic, p-value (two-tailed)
    """
    # Sample proportions
    p1 = x1 / n1
    p2 = x2 / n2
    
    # Pooled proportion (under null hypothesis)
    p_pool = (x1 + x2) / (n1 + n2)
    
    # Standard error under null
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    
    # Z-statistic
    z = (p1 - p2) / se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    return z, p_value

# Run the test
z_stat, p_value = two_proportion_z_test(
    green_purchases, green_visitors,
    blue_purchases, blue_visitors
)

print(f"\nZ-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nResult: Reject null hypothesis")
    print("The difference is statistically significant at α=0.05")
else:
    print("\nResult: Fail to reject null hypothesis")
    print("The difference is NOT statistically significant at α=0.05")
Output:
Blue conversion rate: 3.20%
Green conversion rate: 3.55%
Observed difference: 0.35%

Z-statistic: 1.404
P-value: 0.1603

Result: Fail to reject null hypothesis
The difference is NOT statistically significant at α=0.05
Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance. Step-by-step reasoning for interpreting this result:
  1. What we observed: Green button had 0.35 percentage points higher conversion.
  2. What we asked: If there were truly no difference, how often would we see a gap this large by luck alone?
  3. What we found: About 16% of the time (p = 0.16). That is not particularly rare.
  4. Our decision: Since 16% is above our 5% threshold, we cannot confidently say the green button is better. The observed difference is plausible under random noise.
  5. What this does NOT mean: It does not mean the green button is NOT better. It means we do not have enough evidence to conclude either way. A larger sample might reveal a real difference.
ML Application — Model Comparison: This same logic applies when you compare two ML models. If Model A gets 94.2% accuracy and Model B gets 93.8%, is A really better? Without a statistical test (like a paired t-test over cross-validation folds), you cannot know. Many ML practitioners deploy “better” models that are actually within the noise margin. Always test whether the difference is statistically significant before making deployment decisions.

Method 2: Chi-Square Test

from scipy.stats import chi2_contingency

# Contingency table
#                  Purchased    Not Purchased
# Blue               320          9680
# Green              355          9645

contingency_table = np.array([
    [blue_purchases, blue_visitors - blue_purchases],
    [green_purchases, green_visitors - green_purchases]
])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
Output:
Chi-square statistic: 1.972
P-value: 0.1603
Degrees of freedom: 1
Same p-value, same conclusion.

Types of Errors

We can make two types of mistakes:
H₀ is True (No Effect)H₀ is False (Real Effect)
Reject H₀Type I Error (False Positive)Correct Decision
Fail to Reject H₀Correct DecisionType II Error (False Negative)

Type I Error (α): False Positive

We claim there’s an effect when there isn’t one.
  • Probability = α (typically 0.05)
  • “The boy who cried wolf”
  • Example: Launching a feature that doesn’t actually help

Type II Error (β): False Negative

We miss a real effect.
  • Probability = β (varies, often 0.20)
  • Power = 1 - β (typically 0.80)
  • Example: Abandoning a feature that would have helped
# Visualize the tradeoff
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Type I Error
x = np.linspace(-4, 4, 1000)
null_dist = stats.norm.pdf(x, 0, 1)

axes[0].plot(x, null_dist, 'b-', linewidth=2, label='Null Distribution')
axes[0].fill_between(x[x > 1.96], null_dist[x > 1.96], alpha=0.3, color='red', 
                      label=f'Type I Error Region (α/2)')
axes[0].fill_between(x[x < -1.96], null_dist[x < -1.96], alpha=0.3, color='red')
axes[0].axvline(1.96, color='red', linestyle='--')
axes[0].axvline(-1.96, color='red', linestyle='--')
axes[0].set_title('Type I Error (False Positive)')
axes[0].legend()

# Type II Error (with alternative distribution)
alt_dist = stats.norm.pdf(x, 2, 1)  # Effect exists, shifted right

axes[1].plot(x, null_dist, 'b-', linewidth=2, label='Null (No Effect)')
axes[1].plot(x, alt_dist, 'g-', linewidth=2, label='Alternative (Real Effect)')
axes[1].fill_between(x[(x > -1.96) & (x < 1.96)], alt_dist[(x > -1.96) & (x < 1.96)], 
                      alpha=0.3, color='orange', label='Type II Error Region (β)')
axes[1].axvline(1.96, color='red', linestyle='--')
axes[1].axvline(-1.96, color='red', linestyle='--')
axes[1].set_title('Type II Error (False Negative)')
axes[1].legend()

plt.tight_layout()
plt.show()

Statistical Power: Ability to Detect Real Effects

Power = Probability of detecting an effect when it exists = 1 - β Higher power means:
  • Less likely to miss real effects
  • Requires larger sample sizes
  • More confidence in negative results
def power_proportion_test(p1, p2, n, alpha=0.05):
    """
    Calculate power for a two-proportion test.
    
    p1: control proportion
    p2: treatment proportion
    n: sample size per group
    alpha: significance level
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled standard error under null
    p_pool = (p1 + p2) / 2
    se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
    
    # Standard error under alternative
    se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
    
    # Critical value
    z_crit = stats.norm.ppf(1 - alpha / 2)
    
    # Power
    z_power = (effect - z_crit * se_null) / se_alt
    power = stats.norm.cdf(z_power)
    
    return power

# Our A/B test: 3.20% vs 3.55%, n=10,000 per group
power = power_proportion_test(0.032, 0.0355, 10000)
print(f"Power of our test: {power:.1%}")

# What if we had 50,000 per group?
power_large = power_proportion_test(0.032, 0.0355, 50000)
print(f"Power with n=50,000: {power_large:.1%}")
Output:
Power of our test: 27.3%
Power with n=50,000: 75.8%
With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance. Analogy: Running an underpowered test is like using a metal detector at the beach with the sensitivity turned way down. Even if there is buried treasure, you are unlikely to find it. Power analysis tells you how sensitive your detector needs to be (how large your sample must be) to find treasure of a given size (your minimum detectable effect).
Statistical Mistake in ML — Underpowered Hyperparameter Searches: This same power problem plagues ML practitioners who do hyperparameter tuning with small validation sets. You try 50 configurations, pick the “best” one, but the differences are smaller than the noise. You have essentially picked a random configuration and convinced yourself it is optimal. Use cross-validation with enough folds and enough data per fold to ensure the differences you are selecting on are real.

Sample Size Calculation for Desired Power

def sample_size_proportion_test(p1, p2, power=0.80, alpha=0.05):
    """
    Calculate required sample size per group.
    
    p1: expected control proportion
    p2: expected treatment proportion
    power: desired power (typically 0.80)
    alpha: significance level (typically 0.05)
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    # Variance terms
    var_null = 2 * p_pool * (1 - p_pool)
    var_alt = p1 * (1 - p1) + p2 * (1 - p2)
    
    # Sample size formula
    n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
    
    return int(np.ceil(n))

# How many visitors do we need to detect a 0.35% difference?
n_needed = sample_size_proportion_test(0.032, 0.0355)
print(f"Need {n_needed:,} visitors per group to detect 0.35% difference with 80% power")

# For a larger effect (1% improvement)
n_1pct = sample_size_proportion_test(0.032, 0.042)
print(f"Need {n_1pct:,} visitors per group to detect 1.0% difference with 80% power")
Output:
Need 48,614 visitors per group to detect 0.35% difference with 80% power
Need 6,038 visitors per group to detect 1.0% difference with 80% power

Common Statistical Tests

1. One-Sample t-Test

Is this sample mean different from a known value?
# Are our website load times different from the 3-second industry standard?
load_times = np.array([2.8, 3.2, 2.9, 3.5, 2.7, 3.1, 2.6, 3.0, 2.9, 3.3])

t_stat, p_value = stats.ttest_1samp(load_times, 3.0)
print(f"Sample mean: {np.mean(load_times):.2f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

2. Two-Sample t-Test

Are the means of two groups different?
# Do users spend more time on new homepage design?
old_design_time = np.array([45, 52, 38, 61, 42, 55, 48, 50, 44, 58])
new_design_time = np.array([58, 62, 55, 70, 65, 60, 68, 72, 63, 59])

t_stat, p_value = stats.ttest_ind(old_design_time, new_design_time)
print(f"Old design mean: {np.mean(old_design_time):.1f}s")
print(f"New design mean: {np.mean(new_design_time):.1f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

3. Paired t-Test

Before/after comparisons on the same subjects:
# Does a training program improve test scores?
before = np.array([65, 72, 58, 80, 75, 62, 70, 68, 74, 78])
after = np.array([70, 78, 62, 85, 82, 68, 75, 72, 80, 82])

t_stat, p_value = stats.ttest_rel(before, after)
print(f"Mean improvement: {np.mean(after - before):.1f} points")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

4. ANOVA

Are three or more groups different?
# Do three different ad campaigns have different click rates?
campaign_a = np.array([2.1, 2.3, 2.0, 2.4, 2.2])
campaign_b = np.array([2.8, 3.0, 2.9, 3.1, 2.7])
campaign_c = np.array([2.3, 2.5, 2.4, 2.6, 2.2])

f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")

Complete A/B Testing Framework

import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional

@dataclass
class ABTestResult:
    """Results of an A/B test."""
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    power: float
    
class ABTestAnalyzer:
    """
    Complete A/B testing framework with proper statistical methodology.
    """
    
    def __init__(self, alpha: float = 0.05, power_threshold: float = 0.80):
        self.alpha = alpha
        self.power_threshold = power_threshold
    
    def run_test(
        self, 
        control_successes: int, 
        control_total: int,
        treatment_successes: int, 
        treatment_total: int
    ) -> ABTestResult:
        """Run a two-proportion z-test."""
        
        # Calculate rates
        p_control = control_successes / control_total
        p_treatment = treatment_successes / treatment_total
        
        # Lifts
        absolute_lift = p_treatment - p_control
        relative_lift = (p_treatment - p_control) / p_control if p_control > 0 else 0
        
        # Pooled proportion
        p_pool = (control_successes + treatment_successes) / (control_total + treatment_total)
        
        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/treatment_total))
        
        # Z-statistic
        z = absolute_lift / se if se > 0 else 0
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for the difference
        se_diff = np.sqrt(
            p_control * (1 - p_control) / control_total +
            p_treatment * (1 - p_treatment) / treatment_total
        )
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff)
        
        # Power (approximate)
        power = self._calculate_power(p_control, p_treatment, min(control_total, treatment_total))
        
        return ABTestResult(
            control_rate=p_control,
            treatment_rate=p_treatment,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            is_significant=p_value < self.alpha,
            power=power
        )
    
    def _calculate_power(self, p1: float, p2: float, n: int) -> float:
        """Calculate statistical power."""
        effect = abs(p2 - p1)
        if effect == 0:
            return 0
        
        p_pool = (p1 + p2) / 2
        se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
        se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
        
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        z_power = (effect - z_crit * se_null) / se_alt
        
        return stats.norm.cdf(z_power)
    
    def required_sample_size(
        self, 
        baseline_rate: float, 
        minimum_detectable_effect: float,
        power: float = 0.80
    ) -> int:
        """Calculate required sample size per group."""
        
        p1 = baseline_rate
        p2 = baseline_rate * (1 + minimum_detectable_effect)
        effect = abs(p2 - p1)
        
        p_pool = (p1 + p2) / 2
        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(power)
        
        var_null = 2 * p_pool * (1 - p_pool)
        var_alt = p1 * (1 - p1) + p2 * (1 - p2)
        
        n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
        
        return int(np.ceil(n))
    
    def print_report(self, result: ABTestResult, test_name: str = "A/B Test"):
        """Print a formatted test report."""
        
        print("\n" + "=" * 60)
        print(f"A/B TEST REPORT: {test_name}")
        print("=" * 60)
        
        print(f"\nConversion Rates:")
        print(f"  Control:   {result.control_rate:.2%}")
        print(f"  Treatment: {result.treatment_rate:.2%}")
        
        print(f"\nLift:")
        print(f"  Absolute: {result.absolute_lift:+.2%}")
        print(f"  Relative: {result.relative_lift:+.1%}")
        
        print(f"\nStatistical Analysis:")
        print(f"  Z-statistic: {result.z_statistic:.3f}")
        print(f"  P-value: {result.p_value:.4f}")
        print(f"  95% CI for difference: ({result.confidence_interval[0]:+.2%}, {result.confidence_interval[1]:+.2%})")
        
        print(f"\nTest Quality:")
        print(f"  Power: {result.power:.1%}")
        if result.power < self.power_threshold:
            print(f"  Warning: Low power. Consider larger sample size.")
        
        print(f"\nConclusion (α = {self.alpha}):")
        if result.is_significant:
            if result.absolute_lift > 0:
                print("  SIGNIFICANT: Treatment performs BETTER than control")
            else:
                print("  SIGNIFICANT: Treatment performs WORSE than control")
        else:
            print("  NOT SIGNIFICANT: Cannot conclude a difference exists")
            if result.power < self.power_threshold:
                print("  Note: Low power means we might be missing a real effect")
        
        print("=" * 60)


# Usage example
analyzer = ABTestAnalyzer(alpha=0.05)

# Test 1: Original example (not significant)
result1 = analyzer.run_test(
    control_successes=320, control_total=10000,
    treatment_successes=355, treatment_total=10000
)
analyzer.print_report(result1, "Checkout Button Color")

# Test 2: Larger sample (now significant!)
result2 = analyzer.run_test(
    control_successes=3200, control_total=100000,
    treatment_successes=3550, treatment_total=100000
)
analyzer.print_report(result2, "Checkout Button Color (Large Sample)")

# Calculate required sample size
n_required = analyzer.required_sample_size(
    baseline_rate=0.032,
    minimum_detectable_effect=0.10  # 10% relative improvement
)
print(f"\nTo detect 10% relative improvement with 80% power:")
print(f"Need {n_required:,} visitors per group")

Common Mistakes to Avoid

1. Peeking and Early Stopping

# BAD: Stopping as soon as p < 0.05
# This inflates false positive rate to ~30%!

# GOOD: Pre-specify sample size and run to completion

2. Multiple Testing Without Correction

# Testing 20 variations? Some will be "significant" by chance!

# Bonferroni correction
alpha_corrected = 0.05 / 20  # = 0.0025

# Or use False Discovery Rate (FDR) correction

Interview Questions

Question: Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders?
Answer: “At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations:
  1. The observed difference is 0.3 percentage points - if real, this could be meaningful at scale
  2. The p-value of 0.08 suggests weak evidence against the null hypothesis, not proof the treatment doesn’t work
  3. Consider power analysis - we may have been underpowered to detect this effect size
  4. Practical significance - if the change is low-risk and low-cost, you might still consider implementing
Recommendation: If resources allow, run a larger test to get more conclusive results.”
Question: You test 20 different variations of a product page. Three show p-values under 0.05. What’s the problem?
Answer: With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different!Expected false positives = 20 × 0.05 = 1Solutions:
  1. Bonferroni correction: Use α = 0.05/20 = 0.0025 as threshold
  2. Benjamini-Hochberg (FDR): Control the expected proportion of false discoveries
  3. Holdout validation: Test the “winners” on fresh data
# Bonferroni
alpha_corrected = 0.05 / 20  # 0.0025
# Only results with p < 0.0025 are significant
Question: An A/B test with 10,000 users per group shows no significant difference. Does this prove the treatment has no effect?
Answer: No! “Fail to reject null” ≠ “null is true”We need to consider statistical power:
  1. What effect size could we detect? With n=10,000, we might only detect large effects
  2. What was our power? If power was 50%, we had a coin flip’s chance of detecting a real effect
  3. What’s the confidence interval? Even with p > 0.05, the CI might not include zero
# For a conversion test at 5% baseline, n=10,000/group
# We can reliably detect ~0.5 percentage point differences
# Smaller effects would require larger samples
The correct conclusion: “We failed to find evidence of an effect of size > X”
Question: You’re running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test?
Answer: No! This is called “p-hacking” or “optional stopping” and inflates false positive rates.The p-value assumes you only look once at the end. If you peek repeatedly:
  • With 5 peeks, your actual false positive rate is ~19%, not 5%
  • With 10 peeks, it’s ~25%
Proper approaches:
  1. Sequential testing with adjusted thresholds (O’Brien-Fleming, Pocock)
  2. Bayesian methods that allow continuous monitoring
  3. Pre-commit to analysis plan and stick to it
# O'Brien-Fleming boundaries for 5 interim analyses:
# Look 1: α = 0.00001
# Look 2: α = 0.001
# Look 3: α = 0.01
# Look 4: α = 0.02
# Look 5: α = 0.04

Practice Challenge

Create a production-ready A/B test analysis tool:
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Optional, Tuple

@dataclass
class ABTestResult:
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    power: float
    is_significant: bool
    recommendation: str

class ProductionABTest:
    """
    Production-ready A/B test analyzer with:
    - Power analysis
    - Effect size estimation
    - Confidence intervals
    - Clear recommendations
    """
    
    def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
        self.alpha = alpha
        self.mde = min_detectable_effect
    
    def analyze(
        self,
        control_conversions: int,
        control_visitors: int,
        treatment_conversions: int,
        treatment_visitors: int,
        test_name: str = "A/B Test"
    ) -> ABTestResult:
        """Analyze A/B test results."""
        # Your implementation here
        pass
    
    def recommend_sample_size(
        self,
        baseline_rate: float,
        min_detectable_effect: float,
        power: float = 0.8
    ) -> int:
        """Calculate required sample size per group."""
        # Your implementation here
        pass
    
    def generate_report(self, result: ABTestResult) -> str:
        """Generate human-readable report."""
        # Your implementation here
        pass

# Test your implementation:
test = ProductionABTest()

# Scenario 1: Clear winner
result1 = test.analyze(
    control_conversions=500, control_visitors=10000,
    treatment_conversions=600, treatment_visitors=10000
)

# Scenario 2: Inconclusive
result2 = test.analyze(
    control_conversions=510, control_visitors=10000,
    treatment_conversions=530, treatment_visitors=10000
)

# Scenario 3: Treatment is worse
result3 = test.analyze(
    control_conversions=500, control_visitors=10000,
    treatment_conversions=420, treatment_visitors=10000
)
Full Solution:
class ProductionABTest:
    def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
        self.alpha = alpha
        self.mde = min_detectable_effect
    
    def analyze(
        self,
        control_conversions: int,
        control_visitors: int,
        treatment_conversions: int,
        treatment_visitors: int,
        test_name: str = "A/B Test"
    ) -> ABTestResult:
        # Calculate rates
        p_c = control_conversions / control_visitors
        p_t = treatment_conversions / treatment_visitors
        
        # Effect sizes
        absolute_lift = p_t - p_c
        relative_lift = (p_t - p_c) / p_c if p_c > 0 else 0
        
        # Pooled proportion and standard error
        p_pool = (control_conversions + treatment_conversions) / \
                 (control_visitors + treatment_visitors)
        se = np.sqrt(p_pool * (1 - p_pool) * 
                     (1/control_visitors + 1/treatment_visitors))
        
        # Z-test
        z = absolute_lift / se if se > 0 else 0
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = np.sqrt(
            p_c * (1 - p_c) / control_visitors +
            p_t * (1 - p_t) / treatment_visitors
        )
        z_crit = stats.norm.ppf(1 - self.alpha/2)
        ci = (absolute_lift - z_crit * se_diff,
              absolute_lift + z_crit * se_diff)
        
        # Power calculation
        effect_size = abs(p_t - p_c) / np.sqrt(p_c * (1 - p_c))
        power = self._calculate_power(
            control_visitors, treatment_visitors,
            p_c, p_t
        )
        
        # Significance check
        is_significant = p_value < self.alpha
        
        # Generate recommendation
        recommendation = self._generate_recommendation(
            p_c, p_t, p_value, power, is_significant
        )
        
        return ABTestResult(
            control_rate=p_c,
            treatment_rate=p_t,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            power=power,
            is_significant=is_significant,
            recommendation=recommendation
        )
    
    def _calculate_power(self, n1, n2, p1, p2):
        """Calculate achieved power."""
        effect = abs(p2 - p1)
        pooled_p = (p1 + p2) / 2
        se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2))
        z_crit = stats.norm.ppf(1 - self.alpha/2)
        z_power = (effect / se) - z_crit
        return stats.norm.cdf(z_power)
    
    def _generate_recommendation(self, p_c, p_t, p_value, power, sig):
        if sig and p_t > p_c:
            return "SHIP IT: Treatment significantly outperforms control"
        elif sig and p_t < p_c:
            return "STOP: Treatment significantly underperforms control"
        elif not sig and power < 0.5:
            return "INCONCLUSIVE: Test underpowered, consider running longer"
        elif not sig and power >= 0.8:
            return "NO EFFECT: High-powered test found no significant difference"
        else:
            return "BORDERLINE: Consider practical significance and run longer"
    
    def recommend_sample_size(self, baseline_rate, mde, power=0.8):
        target_rate = baseline_rate * (1 + mde)
        effect = target_rate - baseline_rate
        pooled_p = (baseline_rate + target_rate) / 2
        
        z_alpha = stats.norm.ppf(1 - self.alpha/2)
        z_beta = stats.norm.ppf(power)
        
        n = 2 * pooled_p * (1 - pooled_p) * ((z_alpha + z_beta) / effect) ** 2
        return int(np.ceil(n))

📝 Practice Exercises

Exercise 1

Conduct a one-sample hypothesis test

Exercise 2

Analyze an A/B test for conversion rates

Exercise 3

Calculate statistical power and sample size

Exercise 4

Real-world: Drug trial effectiveness testing

Key Takeaways

The Framework

  • Null hypothesis: no effect (innocent)
  • Alternative: there is an effect
  • P-value: how surprising is the data?
  • Decision threshold: typically α = 0.05

Types of Errors

  • Type I (α): False positive, claiming effect that doesn’t exist
  • Type II (β): False negative, missing real effect
  • Power = 1 - β: Ability to detect real effects

Sample Size Matters

  • Small samples = low power = missed effects
  • Calculate sample size BEFORE running test
  • More precision requires exponentially more data

Test Selection

  • Two proportions: Chi-square or z-test
  • Two means: t-test
  • Multiple groups: ANOVA
  • Non-normal: Mann-Whitney U

Common Pitfalls

A/B Testing Mistakes to Avoid:
  1. Peeking & Early Stopping - Checking daily inflates false positives; use sequential testing methods instead
  2. Underpowered Tests - Running with too few samples misses real effects; calculate sample size first
  3. Multiple Comparisons - Testing 20 variants without correction guarantees false positives
  4. Ignoring Practical Significance - A p < 0.05 with 0.01% improvement isn’t worth shipping
  5. One-Tailed When Uncertain - Only use one-tailed tests when you truly can’t care about opposite effects
  6. P-value Misinterpretation - P-value is NOT the probability the null is true!

Connection to Machine Learning

Hypothesis TestingML Application
A/B testingModel comparison, feature evaluation
Power analysisTraining set size planning
Multiple testing correctionHyperparameter search, feature selection
Type I/II errorsPrecision/Recall tradeoff
Significance testingStatistical validation of model improvements
ML Connection: Every time you compare “Model A accuracy = 0.92 vs Model B accuracy = 0.89”, you’re implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer.
Coming up next: We’ll learn about Correlation and Regression - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning.

Next: Correlation and Regression

Understand relationships and make predictions

Interview Deep-Dive

Strong Answer:
  • Neither is automatically right — the answer depends on context that a p-value alone does not provide. A p=0.04 means the result is technically significant at alpha=0.05, but there are several things I would check before making a recommendation.
  • First, what is the effect size? If the new variant improved conversion by 0.02 percentage points (from 3.00% to 3.02%), the result may be statistically significant with a large enough sample but practically meaningless. Shipping a code change, increasing technical debt, and potentially confusing users for a 0.02pp improvement is not worth it. I would compute the expected annual revenue impact and compare it to the implementation cost.
  • Second, what is the power of the test? If we planned for 80% power to detect a 10% relative lift but only ran enough traffic for 50% power, then p=0.04 might be an inflated estimate (the “winner’s curse” — significant results from underpowered tests tend to overestimate effect sizes).
  • Third, did anyone peek at the data before the test concluded? If the team checked results daily, the effective alpha is much higher than 0.05 due to multiple comparisons, and p=0.04 may not actually be significant under the true (inflated) alpha.
  • My recommendation: if the effect size is meaningful, the test was pre-registered, and nobody peeked, ship it. If any of those conditions are not met, run a confirmation test or extend the current one.
Follow-up: Walk me through the “winner’s curse” in A/B testing. How does it inflate effect size estimates?The winner’s curse occurs because we only ship results that cross the significance threshold. Imagine the true effect is a 2% lift. Due to sampling noise, your measured lift will be randomly higher or lower than 2%. If you only ship when p is less than 0.05, you are selecting for experiments where random noise happened to push the measured lift above 2%. The shipped estimate is biased upward — it is the true effect plus a positive noise component. In underpowered tests, this bias is especially large because you need a bigger “lucky” noise realization to cross the significance threshold. The practical consequence: the business case you built on “5% measured lift” might only deliver 2% lift in production, because the extra 3% was noise that happened to be in your favor. The fix is to either run adequately powered tests (where the noise is small relative to the true effect) or use shrinkage estimators that adjust for the selection bias.
Strong Answer:
  • Statistical significance means the observed difference is unlikely to have occurred by chance alone. Practical significance means the difference is large enough to actually matter for the business.
  • Concrete example: an e-commerce company runs an A/B test on 2 million users and finds the new homepage increases average order value from 47.20to47.20 to 47.35. With that sample size, the p-value is 0.001 — highly statistically significant. But the actual improvement is 0.15perorder,whichtranslatestomaybe0.15 per order, which translates to maybe 150K annually. If the homepage redesign cost $500K in engineering time and introduced new technical debt, the statistically significant result is practically worthless.
  • Conversely, a startup tests a new pricing page on 500 users and sees a conversion lift from 3% to 5%. The p-value is 0.08 — not statistically significant at alpha=0.05. But the 67% relative lift, if real, would double the company’s revenue. The practical significance is enormous; the test was just underpowered. The right action is to run longer, not to conclude “no effect.”
  • The way I think about it: p-values tell you whether the signal is distinguishable from noise. Effect size and business context tell you whether the signal matters. You need both.
Follow-up: How would you set the minimum detectable effect for an A/B test before running it?I work backwards from the business case. First, I ask: “What is the smallest improvement that would justify the cost of implementing this change?” If the change requires 2 weeks of engineering time (30Kcost),andweget1millionordersperyearat30K cost), and we get 1 million orders per year at 50 average, then a 0.03improvementperordergenerates0.03 improvement per order generates 30K annually — barely break-even. So the MDE should be at least 0.050.05-0.10 per order to provide a comfortable return. Then I compute the sample size needed to detect that MDE with 80% power. If the required traffic exceeds what we can collect in a reasonable timeframe (say 4 weeks), I would either accept a larger MDE, reduce alpha from 0.05 to 0.10, or reconsider whether the test is worth running at all. This upfront planning prevents the common trap of running underpowered tests that waste weeks of traffic and produce inconclusive results.
Strong Answer:
  • P-hacking is the practice of manipulating data analysis to find statistically significant results. Common forms include: checking results daily and stopping when p drops below 0.05, testing multiple metrics and only reporting the one that is significant, segmenting data after the fact to find a subgroup where the effect is significant, or adding and removing covariates until significance appears.
  • Each of these inflates the false positive rate well beyond the nominal 5%. A team that checks daily for 14 days has effectively run 14 tests, pushing the real false positive rate to roughly 25-30%. A team that tests 20 metrics will find at least one “significant” result by chance alone.
  • To prevent it at the platform level, I would design the system with these guardrails: (1) Pre-registration: require teams to specify the primary metric, sample size, and analysis plan before the test launches. Lock these parameters. (2) Sequential testing: use methods like always-valid p-values or group sequential designs that allow continuous monitoring without inflating the error rate. (3) Automated correction: when multiple metrics are tracked, automatically apply Benjamini-Hochberg correction and highlight the distinction between primary and exploratory metrics. (4) Mandatory effect size reporting: always show the confidence interval for the effect size alongside the p-value. (5) Cool-off period: require a minimum test duration covering at least one full business cycle before results can be acted upon.
  • The cultural piece is equally important: incentivize teams for running well-designed experiments regardless of outcome, not just for finding “winners.”
Follow-up: How do sequential testing methods (like always-valid p-values) allow peeking without inflating false positives?Traditional p-values assume you look at the data exactly once. Sequential testing methods use a different mathematical framework — typically based on martingale theory or spending functions — that accounts for continuous monitoring. The always-valid p-value is constructed so that it maintains its coverage guarantee no matter how many times you look at the data. The tradeoff is that at any fixed sample size, the sequential method requires slightly more evidence to declare significance compared to a fixed-sample test. Think of it as paying an “insurance premium” for the right to peek continuously. In practice, the cost is modest (roughly 20-30% more traffic) and the benefit is enormous: teams can monitor experiments in real-time, stop harmful experiments early, and make shipping decisions whenever the evidence is clear, rather than waiting for a pre-specified end date.
Strong Answer:
  • Under the null hypothesis (no effect for any test), the expected number of false positives from 10 tests at alpha=0.05 is 0.5. So getting 3 “significant” results when you run 10 tests is suspicious — at least some are likely false positives.
  • However, it is unrealistic to assume all 10 nulls are true. If you are testing reasonable product changes, maybe 3-4 of them actually have real effects. In that case, 3 significant results might include 2-3 real effects and 0-1 false positive.
  • The standard corrections are Bonferroni (divide alpha by the number of tests, requiring p less than 0.005) and Benjamini-Hochberg (FDR control, which is less conservative). Bonferroni controls the family-wise error rate but is very strict — you might miss real effects. BH controls the false discovery rate, saying “of the results I call significant, at most X% are false.”
  • The best practice is to flag all 3 as candidates, apply BH correction to see which survive, and then run a focused confirmation test on the 1-2 that survive correction. The confirmation test uses fresh data and a single pre-specified hypothesis, eliminating the multiple testing problem entirely.
Follow-up: What is the False Discovery Rate, and why is it often more useful than the Family-Wise Error Rate in practice?The Family-Wise Error Rate (FWER) is the probability of making at least one false positive among all tests. Bonferroni controls this by making each individual test extremely stringent. The problem is that as the number of tests grows, each test becomes so conservative that you lose power to detect real effects. With 100 tests, each requires p less than 0.0005 — virtually nothing passes. The False Discovery Rate (FDR) controls a different quantity: the expected proportion of false positives among the rejected hypotheses. If you control FDR at 5% and get 20 significant results, you expect about 1 to be a false discovery. This is much more practical for exploratory analysis because you maintain reasonable power while keeping the false discovery proportion bounded. In genomics, where researchers test thousands of genes simultaneously, FDR is the standard approach. In tech, it is increasingly used for feature experimentation platforms that run many simultaneous tests.