Skip to main content
Hypothesis Testing

Hypothesis Testing: Real Effect or Random Noise?

The A/B Testing Problem

You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week:
VersionVisitorsPurchasesConversion Rate
Blue (Control)10,0003203.20%
Green (New)10,0003553.55%
The green button has a higher conversion rate. But is this a real improvement or just random chance? This is the fundamental question of hypothesis testing.
Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-4 (especially Distributions and Inference)
What You’ll Build: Complete A/B testing framework

The Framework: Innocent Until Proven Guilty

Hypothesis testing borrows from the legal system:
Legal SystemHypothesis Testing
Defendant is innocent until proven guiltyNo effect until proven otherwise
Prosecution must prove guilt beyond reasonable doubtData must prove effect with high confidence
Jury verdict: guilty or not guiltyDecision: reject or fail to reject null
”Not guilty” ≠ “innocent""Fail to reject” ≠ “effect doesn’t exist”
Hypothesis Testing Framework

The Two Hypotheses

Null Hypothesis (H₀): The default assumption. Nothing special is happening.
  • “The new button has the same conversion rate as the old one”
  • “The drug has no effect”
  • “The two groups are the same”
Alternative Hypothesis (H₁ or Hₐ): What we’re trying to prove.
  • “The new button has a different conversion rate”
  • “The drug has an effect”
  • “The groups are different”
# Our A/B test hypotheses:
# H₀: p_green = p_blue (no difference)
# H₁: p_green ≠ p_blue (there is a difference)

The P-Value: Quantifying Surprise

The p-value answers: “If there really were no effect, how likely would we be to see data this extreme?”
P-Value Intuition

Interpreting P-Values

P-ValueInterpretation
p < 0.01Strong evidence against null hypothesis
p < 0.05Moderate evidence against null hypothesis
p < 0.10Weak evidence against null hypothesis
p ≥ 0.10Little to no evidence against null hypothesis
Common threshold (α): 0.05 (5%)
  • If p < 0.05, we reject the null hypothesis
  • If p ≥ 0.05, we fail to reject the null hypothesis
Critical Misconception: The p-value is NOT the probability that the null hypothesis is true.It’s the probability of seeing data this extreme IF the null hypothesis were true.

Testing Our A/B Example

Let’s test whether the green button is actually better:
import numpy as np
from scipy import stats

# Data
blue_visitors = 10000
blue_purchases = 320
blue_rate = blue_purchases / blue_visitors

green_visitors = 10000
green_purchases = 355
green_rate = green_purchases / green_visitors

print(f"Blue conversion rate: {blue_rate:.2%}")
print(f"Green conversion rate: {green_rate:.2%}")
print(f"Observed difference: {green_rate - blue_rate:.2%}")

Method 1: Two-Proportion Z-Test

def two_proportion_z_test(x1, n1, x2, n2):
    """
    Test if two proportions are significantly different.
    
    x1, x2: number of successes
    n1, n2: number of trials
    
    Returns: z-statistic, p-value (two-tailed)
    """
    # Sample proportions
    p1 = x1 / n1
    p2 = x2 / n2
    
    # Pooled proportion (under null hypothesis)
    p_pool = (x1 + x2) / (n1 + n2)
    
    # Standard error under null
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    
    # Z-statistic
    z = (p1 - p2) / se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    return z, p_value

# Run the test
z_stat, p_value = two_proportion_z_test(
    green_purchases, green_visitors,
    blue_purchases, blue_visitors
)

print(f"\nZ-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nResult: Reject null hypothesis")
    print("The difference is statistically significant at α=0.05")
else:
    print("\nResult: Fail to reject null hypothesis")
    print("The difference is NOT statistically significant at α=0.05")
Output:
Blue conversion rate: 3.20%
Green conversion rate: 3.55%
Observed difference: 0.35%

Z-statistic: 1.404
P-value: 0.1603

Result: Fail to reject null hypothesis
The difference is NOT statistically significant at α=0.05
Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance.

Method 2: Chi-Square Test

from scipy.stats import chi2_contingency

# Contingency table
#                  Purchased    Not Purchased
# Blue               320          9680
# Green              355          9645

contingency_table = np.array([
    [blue_purchases, blue_visitors - blue_purchases],
    [green_purchases, green_visitors - green_purchases]
])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
Output:
Chi-square statistic: 1.972
P-value: 0.1603
Degrees of freedom: 1
Same p-value, same conclusion.

Types of Errors

We can make two types of mistakes:
H₀ is True (No Effect)H₀ is False (Real Effect)
Reject H₀Type I Error (False Positive)Correct Decision
Fail to Reject H₀Correct DecisionType II Error (False Negative)

Type I Error (α): False Positive

We claim there’s an effect when there isn’t one.
  • Probability = α (typically 0.05)
  • “The boy who cried wolf”
  • Example: Launching a feature that doesn’t actually help

Type II Error (β): False Negative

We miss a real effect.
  • Probability = β (varies, often 0.20)
  • Power = 1 - β (typically 0.80)
  • Example: Abandoning a feature that would have helped
# Visualize the tradeoff
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Type I Error
x = np.linspace(-4, 4, 1000)
null_dist = stats.norm.pdf(x, 0, 1)

axes[0].plot(x, null_dist, 'b-', linewidth=2, label='Null Distribution')
axes[0].fill_between(x[x > 1.96], null_dist[x > 1.96], alpha=0.3, color='red', 
                      label=f'Type I Error Region (α/2)')
axes[0].fill_between(x[x < -1.96], null_dist[x < -1.96], alpha=0.3, color='red')
axes[0].axvline(1.96, color='red', linestyle='--')
axes[0].axvline(-1.96, color='red', linestyle='--')
axes[0].set_title('Type I Error (False Positive)')
axes[0].legend()

# Type II Error (with alternative distribution)
alt_dist = stats.norm.pdf(x, 2, 1)  # Effect exists, shifted right

axes[1].plot(x, null_dist, 'b-', linewidth=2, label='Null (No Effect)')
axes[1].plot(x, alt_dist, 'g-', linewidth=2, label='Alternative (Real Effect)')
axes[1].fill_between(x[(x > -1.96) & (x < 1.96)], alt_dist[(x > -1.96) & (x < 1.96)], 
                      alpha=0.3, color='orange', label='Type II Error Region (β)')
axes[1].axvline(1.96, color='red', linestyle='--')
axes[1].axvline(-1.96, color='red', linestyle='--')
axes[1].set_title('Type II Error (False Negative)')
axes[1].legend()

plt.tight_layout()
plt.show()

Statistical Power: Ability to Detect Real Effects

Power = Probability of detecting an effect when it exists = 1 - β Higher power means:
  • Less likely to miss real effects
  • Requires larger sample sizes
  • More confidence in negative results
def power_proportion_test(p1, p2, n, alpha=0.05):
    """
    Calculate power for a two-proportion test.
    
    p1: control proportion
    p2: treatment proportion
    n: sample size per group
    alpha: significance level
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled standard error under null
    p_pool = (p1 + p2) / 2
    se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
    
    # Standard error under alternative
    se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
    
    # Critical value
    z_crit = stats.norm.ppf(1 - alpha / 2)
    
    # Power
    z_power = (effect - z_crit * se_null) / se_alt
    power = stats.norm.cdf(z_power)
    
    return power

# Our A/B test: 3.20% vs 3.55%, n=10,000 per group
power = power_proportion_test(0.032, 0.0355, 10000)
print(f"Power of our test: {power:.1%}")

# What if we had 50,000 per group?
power_large = power_proportion_test(0.032, 0.0355, 50000)
print(f"Power with n=50,000: {power_large:.1%}")
Output:
Power of our test: 27.3%
Power with n=50,000: 75.8%
With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance.

Sample Size Calculation for Desired Power

def sample_size_proportion_test(p1, p2, power=0.80, alpha=0.05):
    """
    Calculate required sample size per group.
    
    p1: expected control proportion
    p2: expected treatment proportion
    power: desired power (typically 0.80)
    alpha: significance level (typically 0.05)
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    # Variance terms
    var_null = 2 * p_pool * (1 - p_pool)
    var_alt = p1 * (1 - p1) + p2 * (1 - p2)
    
    # Sample size formula
    n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
    
    return int(np.ceil(n))

# How many visitors do we need to detect a 0.35% difference?
n_needed = sample_size_proportion_test(0.032, 0.0355)
print(f"Need {n_needed:,} visitors per group to detect 0.35% difference with 80% power")

# For a larger effect (1% improvement)
n_1pct = sample_size_proportion_test(0.032, 0.042)
print(f"Need {n_1pct:,} visitors per group to detect 1.0% difference with 80% power")
Output:
Need 48,614 visitors per group to detect 0.35% difference with 80% power
Need 6,038 visitors per group to detect 1.0% difference with 80% power

Common Statistical Tests

1. One-Sample t-Test

Is this sample mean different from a known value?
# Are our website load times different from the 3-second industry standard?
load_times = np.array([2.8, 3.2, 2.9, 3.5, 2.7, 3.1, 2.6, 3.0, 2.9, 3.3])

t_stat, p_value = stats.ttest_1samp(load_times, 3.0)
print(f"Sample mean: {np.mean(load_times):.2f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

2. Two-Sample t-Test

Are the means of two groups different?
# Do users spend more time on new homepage design?
old_design_time = np.array([45, 52, 38, 61, 42, 55, 48, 50, 44, 58])
new_design_time = np.array([58, 62, 55, 70, 65, 60, 68, 72, 63, 59])

t_stat, p_value = stats.ttest_ind(old_design_time, new_design_time)
print(f"Old design mean: {np.mean(old_design_time):.1f}s")
print(f"New design mean: {np.mean(new_design_time):.1f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

3. Paired t-Test

Before/after comparisons on the same subjects:
# Does a training program improve test scores?
before = np.array([65, 72, 58, 80, 75, 62, 70, 68, 74, 78])
after = np.array([70, 78, 62, 85, 82, 68, 75, 72, 80, 82])

t_stat, p_value = stats.ttest_rel(before, after)
print(f"Mean improvement: {np.mean(after - before):.1f} points")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

4. ANOVA

Are three or more groups different?
# Do three different ad campaigns have different click rates?
campaign_a = np.array([2.1, 2.3, 2.0, 2.4, 2.2])
campaign_b = np.array([2.8, 3.0, 2.9, 3.1, 2.7])
campaign_c = np.array([2.3, 2.5, 2.4, 2.6, 2.2])

f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")

Complete A/B Testing Framework

import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional

@dataclass
class ABTestResult:
    """Results of an A/B test."""
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    power: float
    
class ABTestAnalyzer:
    """
    Complete A/B testing framework with proper statistical methodology.
    """
    
    def __init__(self, alpha: float = 0.05, power_threshold: float = 0.80):
        self.alpha = alpha
        self.power_threshold = power_threshold
    
    def run_test(
        self, 
        control_successes: int, 
        control_total: int,
        treatment_successes: int, 
        treatment_total: int
    ) -> ABTestResult:
        """Run a two-proportion z-test."""
        
        # Calculate rates
        p_control = control_successes / control_total
        p_treatment = treatment_successes / treatment_total
        
        # Lifts
        absolute_lift = p_treatment - p_control
        relative_lift = (p_treatment - p_control) / p_control if p_control > 0 else 0
        
        # Pooled proportion
        p_pool = (control_successes + treatment_successes) / (control_total + treatment_total)
        
        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/treatment_total))
        
        # Z-statistic
        z = absolute_lift / se if se > 0 else 0
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for the difference
        se_diff = np.sqrt(
            p_control * (1 - p_control) / control_total +
            p_treatment * (1 - p_treatment) / treatment_total
        )
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff)
        
        # Power (approximate)
        power = self._calculate_power(p_control, p_treatment, min(control_total, treatment_total))
        
        return ABTestResult(
            control_rate=p_control,
            treatment_rate=p_treatment,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            is_significant=p_value < self.alpha,
            power=power
        )
    
    def _calculate_power(self, p1: float, p2: float, n: int) -> float:
        """Calculate statistical power."""
        effect = abs(p2 - p1)
        if effect == 0:
            return 0
        
        p_pool = (p1 + p2) / 2
        se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
        se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
        
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        z_power = (effect - z_crit * se_null) / se_alt
        
        return stats.norm.cdf(z_power)
    
    def required_sample_size(
        self, 
        baseline_rate: float, 
        minimum_detectable_effect: float,
        power: float = 0.80
    ) -> int:
        """Calculate required sample size per group."""
        
        p1 = baseline_rate
        p2 = baseline_rate * (1 + minimum_detectable_effect)
        effect = abs(p2 - p1)
        
        p_pool = (p1 + p2) / 2
        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(power)
        
        var_null = 2 * p_pool * (1 - p_pool)
        var_alt = p1 * (1 - p1) + p2 * (1 - p2)
        
        n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
        
        return int(np.ceil(n))
    
    def print_report(self, result: ABTestResult, test_name: str = "A/B Test"):
        """Print a formatted test report."""
        
        print("\n" + "=" * 60)
        print(f"A/B TEST REPORT: {test_name}")
        print("=" * 60)
        
        print(f"\nConversion Rates:")
        print(f"  Control:   {result.control_rate:.2%}")
        print(f"  Treatment: {result.treatment_rate:.2%}")
        
        print(f"\nLift:")
        print(f"  Absolute: {result.absolute_lift:+.2%}")
        print(f"  Relative: {result.relative_lift:+.1%}")
        
        print(f"\nStatistical Analysis:")
        print(f"  Z-statistic: {result.z_statistic:.3f}")
        print(f"  P-value: {result.p_value:.4f}")
        print(f"  95% CI for difference: ({result.confidence_interval[0]:+.2%}, {result.confidence_interval[1]:+.2%})")
        
        print(f"\nTest Quality:")
        print(f"  Power: {result.power:.1%}")
        if result.power < self.power_threshold:
            print(f"  Warning: Low power. Consider larger sample size.")
        
        print(f"\nConclusion (α = {self.alpha}):")
        if result.is_significant:
            if result.absolute_lift > 0:
                print("  SIGNIFICANT: Treatment performs BETTER than control")
            else:
                print("  SIGNIFICANT: Treatment performs WORSE than control")
        else:
            print("  NOT SIGNIFICANT: Cannot conclude a difference exists")
            if result.power < self.power_threshold:
                print("  Note: Low power means we might be missing a real effect")
        
        print("=" * 60)


# Usage example
analyzer = ABTestAnalyzer(alpha=0.05)

# Test 1: Original example (not significant)
result1 = analyzer.run_test(
    control_successes=320, control_total=10000,
    treatment_successes=355, treatment_total=10000
)
analyzer.print_report(result1, "Checkout Button Color")

# Test 2: Larger sample (now significant!)
result2 = analyzer.run_test(
    control_successes=3200, control_total=100000,
    treatment_successes=3550, treatment_total=100000
)
analyzer.print_report(result2, "Checkout Button Color (Large Sample)")

# Calculate required sample size
n_required = analyzer.required_sample_size(
    baseline_rate=0.032,
    minimum_detectable_effect=0.10  # 10% relative improvement
)
print(f"\nTo detect 10% relative improvement with 80% power:")
print(f"Need {n_required:,} visitors per group")

Common Mistakes to Avoid

1. Peeking and Early Stopping

# BAD: Stopping as soon as p < 0.05
# This inflates false positive rate to ~30%!

# GOOD: Pre-specify sample size and run to completion

2. Multiple Testing Without Correction

# Testing 20 variations? Some will be "significant" by chance!

# Bonferroni correction
alpha_corrected = 0.05 / 20  # = 0.0025

# Or use False Discovery Rate (FDR) correction

Interview Questions

Question: Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders?
Answer: “At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations:
  1. The observed difference is 0.3 percentage points - if real, this could be meaningful at scale
  2. The p-value of 0.08 suggests weak evidence against the null hypothesis, not proof the treatment doesn’t work
  3. Consider power analysis - we may have been underpowered to detect this effect size
  4. Practical significance - if the change is low-risk and low-cost, you might still consider implementing
Recommendation: If resources allow, run a larger test to get more conclusive results.”
Question: You test 20 different variations of a product page. Three show p-values under 0.05. What’s the problem?
Answer: With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different!Expected false positives = 20 × 0.05 = 1Solutions:
  1. Bonferroni correction: Use α = 0.05/20 = 0.0025 as threshold
  2. Benjamini-Hochberg (FDR): Control the expected proportion of false discoveries
  3. Holdout validation: Test the “winners” on fresh data
# Bonferroni
alpha_corrected = 0.05 / 20  # 0.0025
# Only results with p < 0.0025 are significant
Question: An A/B test with 10,000 users per group shows no significant difference. Does this prove the treatment has no effect?
Answer: No! “Fail to reject null” ≠ “null is true”We need to consider statistical power:
  1. What effect size could we detect? With n=10,000, we might only detect large effects
  2. What was our power? If power was 50%, we had a coin flip’s chance of detecting a real effect
  3. What’s the confidence interval? Even with p > 0.05, the CI might not include zero
# For a conversion test at 5% baseline, n=10,000/group
# We can reliably detect ~0.5 percentage point differences
# Smaller effects would require larger samples
The correct conclusion: “We failed to find evidence of an effect of size > X”
Question: You’re running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test?
Answer: No! This is called “p-hacking” or “optional stopping” and inflates false positive rates.The p-value assumes you only look once at the end. If you peek repeatedly:
  • With 5 peeks, your actual false positive rate is ~19%, not 5%
  • With 10 peeks, it’s ~25%
Proper approaches:
  1. Sequential testing with adjusted thresholds (O’Brien-Fleming, Pocock)
  2. Bayesian methods that allow continuous monitoring
  3. Pre-commit to analysis plan and stick to it
# O'Brien-Fleming boundaries for 5 interim analyses:
# Look 1: α = 0.00001
# Look 2: α = 0.001
# Look 3: α = 0.01
# Look 4: α = 0.02
# Look 5: α = 0.04

Practice Challenge

Create a production-ready A/B test analysis tool:
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Optional, Tuple

@dataclass
class ABTestResult:
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    power: float
    is_significant: bool
    recommendation: str

class ProductionABTest:
    """
    Production-ready A/B test analyzer with:
    - Power analysis
    - Effect size estimation
    - Confidence intervals
    - Clear recommendations
    """
    
    def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
        self.alpha = alpha
        self.mde = min_detectable_effect
    
    def analyze(
        self,
        control_conversions: int,
        control_visitors: int,
        treatment_conversions: int,
        treatment_visitors: int,
        test_name: str = "A/B Test"
    ) -> ABTestResult:
        """Analyze A/B test results."""
        # Your implementation here
        pass
    
    def recommend_sample_size(
        self,
        baseline_rate: float,
        min_detectable_effect: float,
        power: float = 0.8
    ) -> int:
        """Calculate required sample size per group."""
        # Your implementation here
        pass
    
    def generate_report(self, result: ABTestResult) -> str:
        """Generate human-readable report."""
        # Your implementation here
        pass

# Test your implementation:
test = ProductionABTest()

# Scenario 1: Clear winner
result1 = test.analyze(
    control_conversions=500, control_visitors=10000,
    treatment_conversions=600, treatment_visitors=10000
)

# Scenario 2: Inconclusive
result2 = test.analyze(
    control_conversions=510, control_visitors=10000,
    treatment_conversions=530, treatment_visitors=10000
)

# Scenario 3: Treatment is worse
result3 = test.analyze(
    control_conversions=500, control_visitors=10000,
    treatment_conversions=420, treatment_visitors=10000
)
Full Solution:
class ProductionABTest:
    def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
        self.alpha = alpha
        self.mde = min_detectable_effect
    
    def analyze(
        self,
        control_conversions: int,
        control_visitors: int,
        treatment_conversions: int,
        treatment_visitors: int,
        test_name: str = "A/B Test"
    ) -> ABTestResult:
        # Calculate rates
        p_c = control_conversions / control_visitors
        p_t = treatment_conversions / treatment_visitors
        
        # Effect sizes
        absolute_lift = p_t - p_c
        relative_lift = (p_t - p_c) / p_c if p_c > 0 else 0
        
        # Pooled proportion and standard error
        p_pool = (control_conversions + treatment_conversions) / \
                 (control_visitors + treatment_visitors)
        se = np.sqrt(p_pool * (1 - p_pool) * 
                     (1/control_visitors + 1/treatment_visitors))
        
        # Z-test
        z = absolute_lift / se if se > 0 else 0
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = np.sqrt(
            p_c * (1 - p_c) / control_visitors +
            p_t * (1 - p_t) / treatment_visitors
        )
        z_crit = stats.norm.ppf(1 - self.alpha/2)
        ci = (absolute_lift - z_crit * se_diff,
              absolute_lift + z_crit * se_diff)
        
        # Power calculation
        effect_size = abs(p_t - p_c) / np.sqrt(p_c * (1 - p_c))
        power = self._calculate_power(
            control_visitors, treatment_visitors,
            p_c, p_t
        )
        
        # Significance check
        is_significant = p_value < self.alpha
        
        # Generate recommendation
        recommendation = self._generate_recommendation(
            p_c, p_t, p_value, power, is_significant
        )
        
        return ABTestResult(
            control_rate=p_c,
            treatment_rate=p_t,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            power=power,
            is_significant=is_significant,
            recommendation=recommendation
        )
    
    def _calculate_power(self, n1, n2, p1, p2):
        """Calculate achieved power."""
        effect = abs(p2 - p1)
        pooled_p = (p1 + p2) / 2
        se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2))
        z_crit = stats.norm.ppf(1 - self.alpha/2)
        z_power = (effect / se) - z_crit
        return stats.norm.cdf(z_power)
    
    def _generate_recommendation(self, p_c, p_t, p_value, power, sig):
        if sig and p_t > p_c:
            return "SHIP IT: Treatment significantly outperforms control"
        elif sig and p_t < p_c:
            return "STOP: Treatment significantly underperforms control"
        elif not sig and power < 0.5:
            return "INCONCLUSIVE: Test underpowered, consider running longer"
        elif not sig and power >= 0.8:
            return "NO EFFECT: High-powered test found no significant difference"
        else:
            return "BORDERLINE: Consider practical significance and run longer"
    
    def recommend_sample_size(self, baseline_rate, mde, power=0.8):
        target_rate = baseline_rate * (1 + mde)
        effect = target_rate - baseline_rate
        pooled_p = (baseline_rate + target_rate) / 2
        
        z_alpha = stats.norm.ppf(1 - self.alpha/2)
        z_beta = stats.norm.ppf(power)
        
        n = 2 * pooled_p * (1 - pooled_p) * ((z_alpha + z_beta) / effect) ** 2
        return int(np.ceil(n))

📝 Practice Exercises

Exercise 1

Conduct a one-sample hypothesis test

Exercise 2

Analyze an A/B test for conversion rates

Exercise 3

Calculate statistical power and sample size

Exercise 4

Real-world: Drug trial effectiveness testing

Key Takeaways

The Framework

  • Null hypothesis: no effect (innocent)
  • Alternative: there is an effect
  • P-value: how surprising is the data?
  • Decision threshold: typically α = 0.05

Types of Errors

  • Type I (α): False positive, claiming effect that doesn’t exist
  • Type II (β): False negative, missing real effect
  • Power = 1 - β: Ability to detect real effects

Sample Size Matters

  • Small samples = low power = missed effects
  • Calculate sample size BEFORE running test
  • More precision requires exponentially more data

Test Selection

  • Two proportions: Chi-square or z-test
  • Two means: t-test
  • Multiple groups: ANOVA
  • Non-normal: Mann-Whitney U

Common Pitfalls

A/B Testing Mistakes to Avoid:
  1. Peeking & Early Stopping - Checking daily inflates false positives; use sequential testing methods instead
  2. Underpowered Tests - Running with too few samples misses real effects; calculate sample size first
  3. Multiple Comparisons - Testing 20 variants without correction guarantees false positives
  4. Ignoring Practical Significance - A p < 0.05 with 0.01% improvement isn’t worth shipping
  5. One-Tailed When Uncertain - Only use one-tailed tests when you truly can’t care about opposite effects
  6. P-value Misinterpretation - P-value is NOT the probability the null is true!

Connection to Machine Learning

Hypothesis TestingML Application
A/B testingModel comparison, feature evaluation
Power analysisTraining set size planning
Multiple testing correctionHyperparameter search, feature selection
Type I/II errorsPrecision/Recall tradeoff
Significance testingStatistical validation of model improvements
ML Connection: Every time you compare “Model A accuracy = 0.92 vs Model B accuracy = 0.89”, you’re implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer.
Coming up next: We’ll learn about Correlation and Regression - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning.

Next: Correlation and Regression

Understand relationships and make predictions