Hypothesis Testing: Real Effect or Random Noise?

The A/B Testing Problem

You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week:

Version	Visitors	Purchases	Conversion Rate
Blue (Control)	10,000	320	3.20%
Green (New)	10,000	355	3.55%

The green button has a higher conversion rate. But is this a real improvement or just random chance? This is the fundamental question of hypothesis testing.

Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Modules 1-4 (especially Distributions and Inference)
What You’ll Build: Complete A/B testing framework

The Framework: Innocent Until Proven Guilty

Hypothesis testing borrows from the legal system:

Legal System	Hypothesis Testing
Defendant is innocent until proven guilty	No effect until proven otherwise
Prosecution must prove guilt beyond reasonable doubt	Data must prove effect with high confidence
Jury verdict: guilty or not guilty	Decision: reject or fail to reject null
”Not guilty” ≠ “innocent"	"Fail to reject” ≠ “effect doesn’t exist”

The Two Hypotheses

Null Hypothesis (H₀): The default assumption. Nothing special is happening.

“The new button has the same conversion rate as the old one”
“The drug has no effect”
“The two groups are the same”

Alternative Hypothesis (H₁ or Hₐ): What we’re trying to prove.

“The new button has a different conversion rate”
“The drug has an effect”
“The groups are different”

# Our A/B test hypotheses:
# H₀: p_green = p_blue (no difference)
# H₁: p_green ≠ p_blue (there is a difference)

The P-Value: Quantifying Surprise

The p-value answers: “If there really were no effect, how likely would we be to see data this extreme?”

Interpreting P-Values

P-Value	Interpretation
p < 0.01	Strong evidence against null hypothesis
p < 0.05	Moderate evidence against null hypothesis
p < 0.10	Weak evidence against null hypothesis
p ≥ 0.10	Little to no evidence against null hypothesis

Common threshold (α): 0.05 (5%)

If p < 0.05, we reject the null hypothesis
If p ≥ 0.05, we fail to reject the null hypothesis

Critical Misconception: The p-value is NOT the probability that the null hypothesis is true.It’s the probability of seeing data this extreme IF the null hypothesis were true.

Testing Our A/B Example

Let’s test whether the green button is actually better:

import numpy as np
from scipy import stats

# Data
blue_visitors = 10000
blue_purchases = 320
blue_rate = blue_purchases / blue_visitors

green_visitors = 10000
green_purchases = 355
green_rate = green_purchases / green_visitors

print(f"Blue conversion rate: {blue_rate:.2%}")
print(f"Green conversion rate: {green_rate:.2%}")
print(f"Observed difference: {green_rate - blue_rate:.2%}")

Method 1: Two-Proportion Z-Test

def two_proportion_z_test(x1, n1, x2, n2):
    """
    Test if two proportions are significantly different.
    
    x1, x2: number of successes
    n1, n2: number of trials
    
    Returns: z-statistic, p-value (two-tailed)
    """
    # Sample proportions
    p1 = x1 / n1
    p2 = x2 / n2
    
    # Pooled proportion (under null hypothesis)
    p_pool = (x1 + x2) / (n1 + n2)
    
    # Standard error under null
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    
    # Z-statistic
    z = (p1 - p2) / se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    return z, p_value

# Run the test
z_stat, p_value = two_proportion_z_test(
    green_purchases, green_visitors,
    blue_purchases, blue_visitors
)

print(f"\nZ-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nResult: Reject null hypothesis")
    print("The difference is statistically significant at α=0.05")
else:
    print("\nResult: Fail to reject null hypothesis")
    print("The difference is NOT statistically significant at α=0.05")

Output:

Blue conversion rate: 3.20%
Green conversion rate: 3.55%
Observed difference: 0.35%

Z-statistic: 1.404
P-value: 0.1603

Result: Fail to reject null hypothesis
The difference is NOT statistically significant at α=0.05

Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance.

Method 2: Chi-Square Test

from scipy.stats import chi2_contingency

# Contingency table
#                  Purchased    Not Purchased
# Blue               320          9680
# Green              355          9645

contingency_table = np.array([
    [blue_purchases, blue_visitors - blue_purchases],
    [green_purchases, green_visitors - green_purchases]
])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

Output:

Chi-square statistic: 1.972
P-value: 0.1603
Degrees of freedom: 1

Same p-value, same conclusion.

Types of Errors

We can make two types of mistakes:

	H₀ is True (No Effect)	H₀ is False (Real Effect)
Reject H₀	Type I Error (False Positive)	Correct Decision
Fail to Reject H₀	Correct Decision	Type II Error (False Negative)

Type I Error (α): False Positive

We claim there’s an effect when there isn’t one.

Probability = α (typically 0.05)
“The boy who cried wolf”
Example: Launching a feature that doesn’t actually help

Type II Error (β): False Negative

We miss a real effect.

Probability = β (varies, often 0.20)
Power = 1 - β (typically 0.80)
Example: Abandoning a feature that would have helped

# Visualize the tradeoff
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Type I Error
x = np.linspace(-4, 4, 1000)
null_dist = stats.norm.pdf(x, 0, 1)

axes[0].plot(x, null_dist, 'b-', linewidth=2, label='Null Distribution')
axes[0].fill_between(x[x > 1.96], null_dist[x > 1.96], alpha=0.3, color='red', 
                      label=f'Type I Error Region (α/2)')
axes[0].fill_between(x[x < -1.96], null_dist[x < -1.96], alpha=0.3, color='red')
axes[0].axvline(1.96, color='red', linestyle='--')
axes[0].axvline(-1.96, color='red', linestyle='--')
axes[0].set_title('Type I Error (False Positive)')
axes[0].legend()

# Type II Error (with alternative distribution)
alt_dist = stats.norm.pdf(x, 2, 1)  # Effect exists, shifted right

axes[1].plot(x, null_dist, 'b-', linewidth=2, label='Null (No Effect)')
axes[1].plot(x, alt_dist, 'g-', linewidth=2, label='Alternative (Real Effect)')
axes[1].fill_between(x[(x > -1.96) & (x < 1.96)], alt_dist[(x > -1.96) & (x < 1.96)], 
                      alpha=0.3, color='orange', label='Type II Error Region (β)')
axes[1].axvline(1.96, color='red', linestyle='--')
axes[1].axvline(-1.96, color='red', linestyle='--')
axes[1].set_title('Type II Error (False Negative)')
axes[1].legend()

plt.tight_layout()
plt.show()

Statistical Power: Ability to Detect Real Effects

Power = Probability of detecting an effect when it exists = 1 - β Higher power means:

Less likely to miss real effects
Requires larger sample sizes
More confidence in negative results

def power_proportion_test(p1, p2, n, alpha=0.05):
    """
    Calculate power for a two-proportion test.
    
    p1: control proportion
    p2: treatment proportion
    n: sample size per group
    alpha: significance level
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled standard error under null
    p_pool = (p1 + p2) / 2
    se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
    
    # Standard error under alternative
    se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
    
    # Critical value
    z_crit = stats.norm.ppf(1 - alpha / 2)
    
    # Power
    z_power = (effect - z_crit * se_null) / se_alt
    power = stats.norm.cdf(z_power)
    
    return power

# Our A/B test: 3.20% vs 3.55%, n=10,000 per group
power = power_proportion_test(0.032, 0.0355, 10000)
print(f"Power of our test: {power:.1%}")

# What if we had 50,000 per group?
power_large = power_proportion_test(0.032, 0.0355, 50000)
print(f"Power with n=50,000: {power_large:.1%}")

Output:

Power of our test: 27.3%
Power with n=50,000: 75.8%

With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance.

Sample Size Calculation for Desired Power

def sample_size_proportion_test(p1, p2, power=0.80, alpha=0.05):
    """
    Calculate required sample size per group.
    
    p1: expected control proportion
    p2: expected treatment proportion
    power: desired power (typically 0.80)
    alpha: significance level (typically 0.05)
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    # Variance terms
    var_null = 2 * p_pool * (1 - p_pool)
    var_alt = p1 * (1 - p1) + p2 * (1 - p2)
    
    # Sample size formula
    n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
    
    return int(np.ceil(n))

# How many visitors do we need to detect a 0.35% difference?
n_needed = sample_size_proportion_test(0.032, 0.0355)
print(f"Need {n_needed:,} visitors per group to detect 0.35% difference with 80% power")

# For a larger effect (1% improvement)
n_1pct = sample_size_proportion_test(0.032, 0.042)
print(f"Need {n_1pct:,} visitors per group to detect 1.0% difference with 80% power")

Output:

Need 48,614 visitors per group to detect 0.35% difference with 80% power
Need 6,038 visitors per group to detect 1.0% difference with 80% power

Common Statistical Tests

1. One-Sample t-Test

Is this sample mean different from a known value?

# Are our website load times different from the 3-second industry standard?
load_times = np.array([2.8, 3.2, 2.9, 3.5, 2.7, 3.1, 2.6, 3.0, 2.9, 3.3])

t_stat, p_value = stats.ttest_1samp(load_times, 3.0)
print(f"Sample mean: {np.mean(load_times):.2f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

2. Two-Sample t-Test

Are the means of two groups different?

# Do users spend more time on new homepage design?
old_design_time = np.array([45, 52, 38, 61, 42, 55, 48, 50, 44, 58])
new_design_time = np.array([58, 62, 55, 70, 65, 60, 68, 72, 63, 59])

t_stat, p_value = stats.ttest_ind(old_design_time, new_design_time)
print(f"Old design mean: {np.mean(old_design_time):.1f}s")
print(f"New design mean: {np.mean(new_design_time):.1f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

3. Paired t-Test

Before/after comparisons on the same subjects:

# Does a training program improve test scores?
before = np.array([65, 72, 58, 80, 75, 62, 70, 68, 74, 78])
after = np.array([70, 78, 62, 85, 82, 68, 75, 72, 80, 82])

t_stat, p_value = stats.ttest_rel(before, after)
print(f"Mean improvement: {np.mean(after - before):.1f} points")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

4. ANOVA

Are three or more groups different?

# Do three different ad campaigns have different click rates?
campaign_a = np.array([2.1, 2.3, 2.0, 2.4, 2.2])
campaign_b = np.array([2.8, 3.0, 2.9, 3.1, 2.7])
campaign_c = np.array([2.3, 2.5, 2.4, 2.6, 2.2])

f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")

Complete A/B Testing Framework

import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional

@dataclass
class ABTestResult:
    """Results of an A/B test."""
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    power: float
    
class ABTestAnalyzer:
    """
    Complete A/B testing framework with proper statistical methodology.
    """
    
    def __init__(self, alpha: float = 0.05, power_threshold: float = 0.80):
        self.alpha = alpha
        self.power_threshold = power_threshold
    
    def run_test(
        self, 
        control_successes: int, 
        control_total: int,
        treatment_successes: int, 
        treatment_total: int
    ) -> ABTestResult:
        """Run a two-proportion z-test."""
        
        # Calculate rates
        p_control = control_successes / control_total
        p_treatment = treatment_successes / treatment_total
        
        # Lifts
        absolute_lift = p_treatment - p_control
        relative_lift = (p_treatment - p_control) / p_control if p_control > 0 else 0
        
        # Pooled proportion
        p_pool = (control_successes + treatment_successes) / (control_total + treatment_total)
        
        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/treatment_total))
        
        # Z-statistic
        z = absolute_lift / se if se > 0 else 0
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for the difference
        se_diff = np.sqrt(
            p_control * (1 - p_control) / control_total +
            p_treatment * (1 - p_treatment) / treatment_total
        )
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff)
        
        # Power (approximate)
        power = self._calculate_power(p_control, p_treatment, min(control_total, treatment_total))
        
        return ABTestResult(
            control_rate=p_control,
            treatment_rate=p_treatment,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            is_significant=p_value < self.alpha,
            power=power
        )
    
    def _calculate_power(self, p1: float, p2: float, n: int) -> float:
        """Calculate statistical power."""
        effect = abs(p2 - p1)
        if effect == 0:
            return 0
        
        p_pool = (p1 + p2) / 2
        se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
        se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
        
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        z_power = (effect - z_crit * se_null) / se_alt
        
        return stats.norm.cdf(z_power)
    
    def required_sample_size(
        self, 
        baseline_rate: float, 
        minimum_detectable_effect: float,
        power: float = 0.80
    ) -> int:
        """Calculate required sample size per group."""
        
        p1 = baseline_rate
        p2 = baseline_rate * (1 + minimum_detectable_effect)
        effect = abs(p2 - p1)
        
        p_pool = (p1 + p2) / 2
        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(power)
        
        var_null = 2 * p_pool * (1 - p_pool)
        var_alt = p1 * (1 - p1) + p2 * (1 - p2)
        
        n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
        
        return int(np.ceil(n))
    
    def print_report(self, result: ABTestResult, test_name: str = "A/B Test"):
        """Print a formatted test report."""
        
        print("\n" + "=" * 60)
        print(f"A/B TEST REPORT: {test_name}")
        print("=" * 60)
        
        print(f"\nConversion Rates:")
        print(f"  Control:   {result.control_rate:.2%}")
        print(f"  Treatment: {result.treatment_rate:.2%}")
        
        print(f"\nLift:")
        print(f"  Absolute: {result.absolute_lift:+.2%}")
        print(f"  Relative: {result.relative_lift:+.1%}")
        
        print(f"\nStatistical Analysis:")
        print(f"  Z-statistic: {result.z_statistic:.3f}")
        print(f"  P-value: {result.p_value:.4f}")
        print(f"  95% CI for difference: ({result.confidence_interval[0]:+.2%}, {result.confidence_interval[1]:+.2%})")
        
        print(f"\nTest Quality:")
        print(f"  Power: {result.power:.1%}")
        if result.power < self.power_threshold:
            print(f"  Warning: Low power. Consider larger sample size.")
        
        print(f"\nConclusion (α = {self.alpha}):")
        if result.is_significant:
            if result.absolute_lift > 0:
                print("  SIGNIFICANT: Treatment performs BETTER than control")
            else:
                print("  SIGNIFICANT: Treatment performs WORSE than control")
        else:
            print("  NOT SIGNIFICANT: Cannot conclude a difference exists")
            if result.power < self.power_threshold:
                print("  Note: Low power means we might be missing a real effect")
        
        print("=" * 60)


# Usage example
analyzer = ABTestAnalyzer(alpha=0.05)

# Test 1: Original example (not significant)
result1 = analyzer.run_test(
    control_successes=320, control_total=10000,
    treatment_successes=355, treatment_total=10000
)
analyzer.print_report(result1, "Checkout Button Color")

# Test 2: Larger sample (now significant!)
result2 = analyzer.run_test(
    control_successes=3200, control_total=100000,
    treatment_successes=3550, treatment_total=100000
)
analyzer.print_report(result2, "Checkout Button Color (Large Sample)")

# Calculate required sample size
n_required = analyzer.required_sample_size(
    baseline_rate=0.032,
    minimum_detectable_effect=0.10  # 10% relative improvement
)
print(f"\nTo detect 10% relative improvement with 80% power:")
print(f"Need {n_required:,} visitors per group")

Common Mistakes to Avoid

1. Peeking and Early Stopping

# BAD: Stopping as soon as p < 0.05
# This inflates false positive rate to ~30%!

# GOOD: Pre-specify sample size and run to completion

2. Multiple Testing Without Correction

# Testing 20 variations? Some will be "significant" by chance!

# Bonferroni correction
alpha_corrected = 0.05 / 20  # = 0.0025

# Or use False Discovery Rate (FDR) correction

Interview Questions

Question 1: A/B Test Interpretation (Google)

Question: Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders?

Answer: “At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations:

The observed difference is 0.3 percentage points - if real, this could be meaningful at scale
The p-value of 0.08 suggests weak evidence against the null hypothesis, not proof the treatment doesn’t work
Consider power analysis - we may have been underpowered to detect this effect size
Practical significance - if the change is low-risk and low-cost, you might still consider implementing

Recommendation: If resources allow, run a larger test to get more conclusive results.”

Question 2: Multiple Testing (Amazon)

Question: You test 20 different variations of a product page. Three show p-values under 0.05. What’s the problem?

Answer: With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different!Expected false positives = 20 × 0.05 = 1Solutions:

Bonferroni correction: Use α = 0.05/20 = 0.0025 as threshold
Benjamini-Hochberg (FDR): Control the expected proportion of false discoveries
Holdout validation: Test the “winners” on fresh data

# Bonferroni
alpha_corrected = 0.05 / 20  # 0.0025
# Only results with p < 0.0025 are significant

Question 3: Power and Sample Size (Facebook/Meta)

Question: An A/B test with 10,000 users per group shows no significant difference. Does this prove the treatment has no effect?

Answer: No! “Fail to reject null” ≠ “null is true”We need to consider statistical power:

What effect size could we detect? With n=10,000, we might only detect large effects
What was our power? If power was 50%, we had a coin flip’s chance of detecting a real effect
What’s the confidence interval? Even with p > 0.05, the CI might not include zero

# For a conversion test at 5% baseline, n=10,000/group
# We can reliably detect ~0.5 percentage point differences
# Smaller effects would require larger samples

The correct conclusion: “We failed to find evidence of an effect of size > X”

Question 4: Early Stopping (Tech Companies)

Question: You’re running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test?

Answer: No! This is called “p-hacking” or “optional stopping” and inflates false positive rates.The p-value assumes you only look once at the end. If you peek repeatedly:

With 5 peeks, your actual false positive rate is ~19%, not 5%
With 10 peeks, it’s ~25%

Proper approaches:

Sequential testing with adjusted thresholds (O’Brien-Fleming, Pocock)
Bayesian methods that allow continuous monitoring
Pre-commit to analysis plan and stick to it

# O'Brien-Fleming boundaries for 5 interim analyses:
# Look 1: α = 0.00001
# Look 2: α = 0.001
# Look 3: α = 0.01
# Look 4: α = 0.02
# Look 5: α = 0.04

Practice Challenge

Challenge: Build a Complete A/B Testing Framework

Create a production-ready A/B test analysis tool:

import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Optional, Tuple

@dataclass
class ABTestResult:
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    power: float
    is_significant: bool
    recommendation: str

class ProductionABTest:
    """
    Production-ready A/B test analyzer with:
    - Power analysis
    - Effect size estimation
    - Confidence intervals
    - Clear recommendations
    """
    
    def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
        self.alpha = alpha
        self.mde = min_detectable_effect
    
    def analyze(
        self,
        control_conversions: int,
        control_visitors: int,
        treatment_conversions: int,
        treatment_visitors: int,
        test_name: str = "A/B Test"
    ) -> ABTestResult:
        """Analyze A/B test results."""
        # Your implementation here
        pass
    
    def recommend_sample_size(
        self,
        baseline_rate: float,
        min_detectable_effect: float,
        power: float = 0.8
    ) -> int:
        """Calculate required sample size per group."""
        # Your implementation here
        pass
    
    def generate_report(self, result: ABTestResult) -> str:
        """Generate human-readable report."""
        # Your implementation here
        pass

# Test your implementation:
test = ProductionABTest()

# Scenario 1: Clear winner
result1 = test.analyze(
    control_conversions=500, control_visitors=10000,
    treatment_conversions=600, treatment_visitors=10000
)

# Scenario 2: Inconclusive
result2 = test.analyze(
    control_conversions=510, control_visitors=10000,
    treatment_conversions=530, treatment_visitors=10000
)

# Scenario 3: Treatment is worse
result3 = test.analyze(
    control_conversions=500, control_visitors=10000,
    treatment_conversions=420, treatment_visitors=10000
)

Full Solution:

class ProductionABTest:
    def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
        self.alpha = alpha
        self.mde = min_detectable_effect
    
    def analyze(
        self,
        control_conversions: int,
        control_visitors: int,
        treatment_conversions: int,
        treatment_visitors: int,
        test_name: str = "A/B Test"
    ) -> ABTestResult:
        # Calculate rates
        p_c = control_conversions / control_visitors
        p_t = treatment_conversions / treatment_visitors
        
        # Effect sizes
        absolute_lift = p_t - p_c
        relative_lift = (p_t - p_c) / p_c if p_c > 0 else 0
        
        # Pooled proportion and standard error
        p_pool = (control_conversions + treatment_conversions) / \
                 (control_visitors + treatment_visitors)
        se = np.sqrt(p_pool * (1 - p_pool) * 
                     (1/control_visitors + 1/treatment_visitors))
        
        # Z-test
        z = absolute_lift / se if se > 0 else 0
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = np.sqrt(
            p_c * (1 - p_c) / control_visitors +
            p_t * (1 - p_t) / treatment_visitors
        )
        z_crit = stats.norm.ppf(1 - self.alpha/2)
        ci = (absolute_lift - z_crit * se_diff,
              absolute_lift + z_crit * se_diff)
        
        # Power calculation
        effect_size = abs(p_t - p_c) / np.sqrt(p_c * (1 - p_c))
        power = self._calculate_power(
            control_visitors, treatment_visitors,
            p_c, p_t
        )
        
        # Significance check
        is_significant = p_value < self.alpha
        
        # Generate recommendation
        recommendation = self._generate_recommendation(
            p_c, p_t, p_value, power, is_significant
        )
        
        return ABTestResult(
            control_rate=p_c,
            treatment_rate=p_t,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            power=power,
            is_significant=is_significant,
            recommendation=recommendation
        )
    
    def _calculate_power(self, n1, n2, p1, p2):
        """Calculate achieved power."""
        effect = abs(p2 - p1)
        pooled_p = (p1 + p2) / 2
        se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2))
        z_crit = stats.norm.ppf(1 - self.alpha/2)
        z_power = (effect / se) - z_crit
        return stats.norm.cdf(z_power)
    
    def _generate_recommendation(self, p_c, p_t, p_value, power, sig):
        if sig and p_t > p_c:
            return "SHIP IT: Treatment significantly outperforms control"
        elif sig and p_t < p_c:
            return "STOP: Treatment significantly underperforms control"
        elif not sig and power < 0.5:
            return "INCONCLUSIVE: Test underpowered, consider running longer"
        elif not sig and power >= 0.8:
            return "NO EFFECT: High-powered test found no significant difference"
        else:
            return "BORDERLINE: Consider practical significance and run longer"
    
    def recommend_sample_size(self, baseline_rate, mde, power=0.8):
        target_rate = baseline_rate * (1 + mde)
        effect = target_rate - baseline_rate
        pooled_p = (baseline_rate + target_rate) / 2
        
        z_alpha = stats.norm.ppf(1 - self.alpha/2)
        z_beta = stats.norm.ppf(power)
        
        n = 2 * pooled_p * (1 - pooled_p) * ((z_alpha + z_beta) / effect) ** 2
        return int(np.ceil(n))

📝 Practice Exercises

Exercise 1

Conduct a one-sample hypothesis test

Exercise 2

Analyze an A/B test for conversion rates

Exercise 3

Calculate statistical power and sample size

Exercise 4

Real-world: Drug trial effectiveness testing

Key Takeaways

The Framework

Null hypothesis: no effect (innocent)
Alternative: there is an effect
P-value: how surprising is the data?
Decision threshold: typically α = 0.05

Types of Errors

Type I (α): False positive, claiming effect that doesn’t exist
Type II (β): False negative, missing real effect
Power = 1 - β: Ability to detect real effects

Sample Size Matters

Small samples = low power = missed effects
Calculate sample size BEFORE running test
More precision requires exponentially more data

Test Selection

Two proportions: Chi-square or z-test
Two means: t-test
Multiple groups: ANOVA
Non-normal: Mann-Whitney U

Common Pitfalls

A/B Testing Mistakes to Avoid:

Peeking & Early Stopping - Checking daily inflates false positives; use sequential testing methods instead
Underpowered Tests - Running with too few samples misses real effects; calculate sample size first
Multiple Comparisons - Testing 20 variants without correction guarantees false positives
Ignoring Practical Significance - A p < 0.05 with 0.01% improvement isn’t worth shipping
One-Tailed When Uncertain - Only use one-tailed tests when you truly can’t care about opposite effects
P-value Misinterpretation - P-value is NOT the probability the null is true!

Connection to Machine Learning

Hypothesis Testing	ML Application
A/B testing	Model comparison, feature evaluation
Power analysis	Training set size planning
Multiple testing correction	Hyperparameter search, feature selection
Type I/II errors	Precision/Recall tradeoff
Significance testing	Statistical validation of model improvements

ML Connection: Every time you compare “Model A accuracy = 0.92 vs Model B accuracy = 0.89”, you’re implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer.

Coming up next: We’ll learn about Correlation and Regression - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning.

Next: Correlation and Regression

Understand relationships and make predictions

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Hypothesis Testing: Real Effect or Random Noise?

​The A/B Testing Problem

​The Framework: Innocent Until Proven Guilty

​The Two Hypotheses

​The P-Value: Quantifying Surprise

​Interpreting P-Values

​Testing Our A/B Example

​Method 1: Two-Proportion Z-Test

​Method 2: Chi-Square Test

​Types of Errors

​Type I Error (α): False Positive

​Type II Error (β): False Negative

​Statistical Power: Ability to Detect Real Effects

​Sample Size Calculation for Desired Power

​Common Statistical Tests

​1. One-Sample t-Test

​2. Two-Sample t-Test

​3. Paired t-Test

​4. ANOVA

​Complete A/B Testing Framework

​Common Mistakes to Avoid

​1. Peeking and Early Stopping

​2. Multiple Testing Without Correction

​Interview Questions

​Practice Challenge

​📝 Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

​Key Takeaways

The Framework

Types of Errors

Sample Size Matters

Test Selection

​Common Pitfalls

​Connection to Machine Learning

Next: Correlation and Regression

Hypothesis Testing: Real Effect or Random Noise?

The A/B Testing Problem

The Framework: Innocent Until Proven Guilty

The Two Hypotheses

The P-Value: Quantifying Surprise

Interpreting P-Values

Testing Our A/B Example

Method 1: Two-Proportion Z-Test

Method 2: Chi-Square Test

Types of Errors

Type I Error (α): False Positive

Type II Error (β): False Negative

Statistical Power: Ability to Detect Real Effects

Sample Size Calculation for Desired Power

Common Statistical Tests

1. One-Sample t-Test

2. Two-Sample t-Test

3. Paired t-Test

4. ANOVA

Complete A/B Testing Framework

Common Mistakes to Avoid

1. Peeking and Early Stopping

2. Multiple Testing Without Correction

Interview Questions

Practice Challenge

📝 Practice Exercises

Key Takeaways

Common Pitfalls

Connection to Machine Learning