Hypothesis Testing: Real Effect or Random Noise?
The A/B Testing Problem
You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week:
Version Visitors Purchases Conversion Rate Blue (Control) 10,000 320 3.20% Green (New) 10,000 355 3.55%
The green button has a higher conversion rate. But is this a real improvement or just random chance ?
This is the fundamental question of hypothesis testing.
Estimated Time : 4-5 hours
Difficulty : Intermediate
Prerequisites : Modules 1-4 (especially Distributions and Inference)
What You’ll Build : Complete A/B testing framework
The Framework: Innocent Until Proven Guilty
Hypothesis testing borrows from the legal system:
Legal System Hypothesis Testing Defendant is innocent until proven guilty No effect until proven otherwise Prosecution must prove guilt beyond reasonable doubt Data must prove effect with high confidence Jury verdict: guilty or not guilty Decision: reject or fail to reject null ”Not guilty” ≠ “innocent" "Fail to reject” ≠ “effect doesn’t exist”
The Two Hypotheses
Null Hypothesis (H₀) : The default assumption. Nothing special is happening.
“The new button has the same conversion rate as the old one”
“The drug has no effect”
“The two groups are the same”
Alternative Hypothesis (H₁ or Hₐ) : What we’re trying to prove.
“The new button has a different conversion rate”
“The drug has an effect”
“The groups are different”
# Our A/B test hypotheses:
# H₀: p_green = p_blue (no difference)
# H₁: p_green ≠ p_blue (there is a difference)
The P-Value: Quantifying Surprise
The p-value answers: “If there really were no effect, how likely would we be to see data this extreme?”
Interpreting P-Values
P-Value Interpretation p < 0.01 Strong evidence against null hypothesis p < 0.05 Moderate evidence against null hypothesis p < 0.10 Weak evidence against null hypothesis p ≥ 0.10 Little to no evidence against null hypothesis
Common threshold (α) : 0.05 (5%)
If p < 0.05, we reject the null hypothesis
If p ≥ 0.05, we fail to reject the null hypothesis
Critical Misconception : The p-value is NOT the probability that the null hypothesis is true.It’s the probability of seeing data this extreme IF the null hypothesis were true.
Testing Our A/B Example
Let’s test whether the green button is actually better:
import numpy as np
from scipy import stats
# Data
blue_visitors = 10000
blue_purchases = 320
blue_rate = blue_purchases / blue_visitors
green_visitors = 10000
green_purchases = 355
green_rate = green_purchases / green_visitors
print ( f "Blue conversion rate: { blue_rate :.2%} " )
print ( f "Green conversion rate: { green_rate :.2%} " )
print ( f "Observed difference: { green_rate - blue_rate :.2%} " )
Method 1: Two-Proportion Z-Test
def two_proportion_z_test ( x1 , n1 , x2 , n2 ):
"""
Test if two proportions are significantly different.
x1, x2: number of successes
n1, n2: number of trials
Returns: z-statistic, p-value (two-tailed)
"""
# Sample proportions
p1 = x1 / n1
p2 = x2 / n2
# Pooled proportion (under null hypothesis)
p_pool = (x1 + x2) / (n1 + n2)
# Standard error under null
se = np.sqrt(p_pool * ( 1 - p_pool) * ( 1 / n1 + 1 / n2))
# Z-statistic
z = (p1 - p2) / se
# Two-tailed p-value
p_value = 2 * ( 1 - stats.norm.cdf( abs (z)))
return z, p_value
# Run the test
z_stat, p_value = two_proportion_z_test(
green_purchases, green_visitors,
blue_purchases, blue_visitors
)
print ( f " \n Z-statistic: { z_stat :.3f} " )
print ( f "P-value: { p_value :.4f} " )
if p_value < 0.05 :
print ( " \n Result: Reject null hypothesis" )
print ( "The difference is statistically significant at α=0.05" )
else :
print ( " \n Result: Fail to reject null hypothesis" )
print ( "The difference is NOT statistically significant at α=0.05" )
Output:
Blue conversion rate: 3.20%
Green conversion rate: 3.55%
Observed difference: 0.35%
Z-statistic: 1.404
P-value: 0.1603
Result: Fail to reject null hypothesis
The difference is NOT statistically significant at α=0.05
Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance.
Method 2: Chi-Square Test
from scipy.stats import chi2_contingency
# Contingency table
# Purchased Not Purchased
# Blue 320 9680
# Green 355 9645
contingency_table = np.array([
[blue_purchases, blue_visitors - blue_purchases],
[green_purchases, green_visitors - green_purchases]
])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print ( f "Chi-square statistic: { chi2 :.3f} " )
print ( f "P-value: { p_value :.4f} " )
print ( f "Degrees of freedom: { dof } " )
Output:
Chi-square statistic: 1.972
P-value: 0.1603
Degrees of freedom: 1
Same p-value, same conclusion.
Types of Errors
We can make two types of mistakes:
H₀ is True (No Effect) H₀ is False (Real Effect) Reject H₀ Type I Error (False Positive) Correct Decision Fail to Reject H₀ Correct Decision Type II Error (False Negative)
Type I Error (α): False Positive
We claim there’s an effect when there isn’t one.
Probability = α (typically 0.05)
“The boy who cried wolf”
Example: Launching a feature that doesn’t actually help
Type II Error (β): False Negative
We miss a real effect.
Probability = β (varies, often 0.20)
Power = 1 - β (typically 0.80)
Example: Abandoning a feature that would have helped
# Visualize the tradeoff
import matplotlib.pyplot as plt
fig, axes = plt.subplots( 1 , 2 , figsize = ( 14 , 5 ))
# Type I Error
x = np.linspace( - 4 , 4 , 1000 )
null_dist = stats.norm.pdf(x, 0 , 1 )
axes[ 0 ].plot(x, null_dist, 'b-' , linewidth = 2 , label = 'Null Distribution' )
axes[ 0 ].fill_between(x[x > 1.96 ], null_dist[x > 1.96 ], alpha = 0.3 , color = 'red' ,
label = f 'Type I Error Region (α/2)' )
axes[ 0 ].fill_between(x[x < - 1.96 ], null_dist[x < - 1.96 ], alpha = 0.3 , color = 'red' )
axes[ 0 ].axvline( 1.96 , color = 'red' , linestyle = '--' )
axes[ 0 ].axvline( - 1.96 , color = 'red' , linestyle = '--' )
axes[ 0 ].set_title( 'Type I Error (False Positive)' )
axes[ 0 ].legend()
# Type II Error (with alternative distribution)
alt_dist = stats.norm.pdf(x, 2 , 1 ) # Effect exists, shifted right
axes[ 1 ].plot(x, null_dist, 'b-' , linewidth = 2 , label = 'Null (No Effect)' )
axes[ 1 ].plot(x, alt_dist, 'g-' , linewidth = 2 , label = 'Alternative (Real Effect)' )
axes[ 1 ].fill_between(x[(x > - 1.96 ) & (x < 1.96 )], alt_dist[(x > - 1.96 ) & (x < 1.96 )],
alpha = 0.3 , color = 'orange' , label = 'Type II Error Region (β)' )
axes[ 1 ].axvline( 1.96 , color = 'red' , linestyle = '--' )
axes[ 1 ].axvline( - 1.96 , color = 'red' , linestyle = '--' )
axes[ 1 ].set_title( 'Type II Error (False Negative)' )
axes[ 1 ].legend()
plt.tight_layout()
plt.show()
Statistical Power: Ability to Detect Real Effects
Power = Probability of detecting an effect when it exists = 1 - β
Higher power means:
Less likely to miss real effects
Requires larger sample sizes
More confidence in negative results
def power_proportion_test ( p1 , p2 , n , alpha = 0.05 ):
"""
Calculate power for a two-proportion test.
p1: control proportion
p2: treatment proportion
n: sample size per group
alpha: significance level
"""
# Effect size
effect = abs (p2 - p1)
# Pooled standard error under null
p_pool = (p1 + p2) / 2
se_null = np.sqrt( 2 * p_pool * ( 1 - p_pool) / n)
# Standard error under alternative
se_alt = np.sqrt((p1 * ( 1 - p1) + p2 * ( 1 - p2)) / n)
# Critical value
z_crit = stats.norm.ppf( 1 - alpha / 2 )
# Power
z_power = (effect - z_crit * se_null) / se_alt
power = stats.norm.cdf(z_power)
return power
# Our A/B test: 3.20% vs 3.55%, n=10,000 per group
power = power_proportion_test( 0.032 , 0.0355 , 10000 )
print ( f "Power of our test: { power :.1%} " )
# What if we had 50,000 per group?
power_large = power_proportion_test( 0.032 , 0.0355 , 50000 )
print ( f "Power with n=50,000: { power_large :.1%} " )
Output:
Power of our test: 27.3%
Power with n=50,000: 75.8%
With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance.
Sample Size Calculation for Desired Power
def sample_size_proportion_test ( p1 , p2 , power = 0.80 , alpha = 0.05 ):
"""
Calculate required sample size per group.
p1: expected control proportion
p2: expected treatment proportion
power: desired power (typically 0.80)
alpha: significance level (typically 0.05)
"""
# Effect size
effect = abs (p2 - p1)
# Pooled proportion
p_pool = (p1 + p2) / 2
# Z-scores
z_alpha = stats.norm.ppf( 1 - alpha / 2 )
z_beta = stats.norm.ppf(power)
# Variance terms
var_null = 2 * p_pool * ( 1 - p_pool)
var_alt = p1 * ( 1 - p1) + p2 * ( 1 - p2)
# Sample size formula
n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
return int (np.ceil(n))
# How many visitors do we need to detect a 0.35% difference?
n_needed = sample_size_proportion_test( 0.032 , 0.0355 )
print ( f "Need { n_needed :,} visitors per group to detect 0.35% difference with 80% power" )
# For a larger effect (1% improvement)
n_1pct = sample_size_proportion_test( 0.032 , 0.042 )
print ( f "Need { n_1pct :,} visitors per group to detect 1.0% difference with 80% power" )
Output:
Need 48,614 visitors per group to detect 0.35% difference with 80% power
Need 6,038 visitors per group to detect 1.0% difference with 80% power
Common Statistical Tests
1. One-Sample t-Test
Is this sample mean different from a known value?
# Are our website load times different from the 3-second industry standard?
load_times = np.array([ 2.8 , 3.2 , 2.9 , 3.5 , 2.7 , 3.1 , 2.6 , 3.0 , 2.9 , 3.3 ])
t_stat, p_value = stats.ttest_1samp(load_times, 3.0 )
print ( f "Sample mean: { np.mean(load_times) :.2f} s" )
print ( f "t-statistic: { t_stat :.3f} " )
print ( f "p-value: { p_value :.4f} " )
2. Two-Sample t-Test
Are the means of two groups different?
# Do users spend more time on new homepage design?
old_design_time = np.array([ 45 , 52 , 38 , 61 , 42 , 55 , 48 , 50 , 44 , 58 ])
new_design_time = np.array([ 58 , 62 , 55 , 70 , 65 , 60 , 68 , 72 , 63 , 59 ])
t_stat, p_value = stats.ttest_ind(old_design_time, new_design_time)
print ( f "Old design mean: { np.mean(old_design_time) :.1f} s" )
print ( f "New design mean: { np.mean(new_design_time) :.1f} s" )
print ( f "t-statistic: { t_stat :.3f} " )
print ( f "p-value: { p_value :.4f} " )
3. Paired t-Test
Before/after comparisons on the same subjects:
# Does a training program improve test scores?
before = np.array([ 65 , 72 , 58 , 80 , 75 , 62 , 70 , 68 , 74 , 78 ])
after = np.array([ 70 , 78 , 62 , 85 , 82 , 68 , 75 , 72 , 80 , 82 ])
t_stat, p_value = stats.ttest_rel(before, after)
print ( f "Mean improvement: { np.mean(after - before) :.1f} points" )
print ( f "t-statistic: { t_stat :.3f} " )
print ( f "p-value: { p_value :.4f} " )
4. ANOVA
Are three or more groups different?
# Do three different ad campaigns have different click rates?
campaign_a = np.array([ 2.1 , 2.3 , 2.0 , 2.4 , 2.2 ])
campaign_b = np.array([ 2.8 , 3.0 , 2.9 , 3.1 , 2.7 ])
campaign_c = np.array([ 2.3 , 2.5 , 2.4 , 2.6 , 2.2 ])
f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c)
print ( f "F-statistic: { f_stat :.3f} " )
print ( f "p-value: { p_value :.4f} " )
Complete A/B Testing Framework
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional
@dataclass
class ABTestResult :
"""Results of an A/B test."""
control_rate: float
treatment_rate: float
relative_lift: float
absolute_lift: float
z_statistic: float
p_value: float
confidence_interval: Tuple[ float , float ]
is_significant: bool
power: float
class ABTestAnalyzer :
"""
Complete A/B testing framework with proper statistical methodology.
"""
def __init__ ( self , alpha : float = 0.05 , power_threshold : float = 0.80 ):
self .alpha = alpha
self .power_threshold = power_threshold
def run_test (
self ,
control_successes : int ,
control_total : int ,
treatment_successes : int ,
treatment_total : int
) -> ABTestResult:
"""Run a two-proportion z-test."""
# Calculate rates
p_control = control_successes / control_total
p_treatment = treatment_successes / treatment_total
# Lifts
absolute_lift = p_treatment - p_control
relative_lift = (p_treatment - p_control) / p_control if p_control > 0 else 0
# Pooled proportion
p_pool = (control_successes + treatment_successes) / (control_total + treatment_total)
# Standard error
se = np.sqrt(p_pool * ( 1 - p_pool) * ( 1 / control_total + 1 / treatment_total))
# Z-statistic
z = absolute_lift / se if se > 0 else 0
# P-value (two-tailed)
p_value = 2 * ( 1 - stats.norm.cdf( abs (z)))
# Confidence interval for the difference
se_diff = np.sqrt(
p_control * ( 1 - p_control) / control_total +
p_treatment * ( 1 - p_treatment) / treatment_total
)
z_crit = stats.norm.ppf( 1 - self .alpha / 2 )
ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff)
# Power (approximate)
power = self ._calculate_power(p_control, p_treatment, min (control_total, treatment_total))
return ABTestResult(
control_rate = p_control,
treatment_rate = p_treatment,
relative_lift = relative_lift,
absolute_lift = absolute_lift,
z_statistic = z,
p_value = p_value,
confidence_interval = ci,
is_significant = p_value < self .alpha,
power = power
)
def _calculate_power ( self , p1 : float , p2 : float , n : int ) -> float :
"""Calculate statistical power."""
effect = abs (p2 - p1)
if effect == 0 :
return 0
p_pool = (p1 + p2) / 2
se_null = np.sqrt( 2 * p_pool * ( 1 - p_pool) / n)
se_alt = np.sqrt((p1 * ( 1 - p1) + p2 * ( 1 - p2)) / n)
z_crit = stats.norm.ppf( 1 - self .alpha / 2 )
z_power = (effect - z_crit * se_null) / se_alt
return stats.norm.cdf(z_power)
def required_sample_size (
self ,
baseline_rate : float ,
minimum_detectable_effect : float ,
power : float = 0.80
) -> int :
"""Calculate required sample size per group."""
p1 = baseline_rate
p2 = baseline_rate * ( 1 + minimum_detectable_effect)
effect = abs (p2 - p1)
p_pool = (p1 + p2) / 2
z_alpha = stats.norm.ppf( 1 - self .alpha / 2 )
z_beta = stats.norm.ppf(power)
var_null = 2 * p_pool * ( 1 - p_pool)
var_alt = p1 * ( 1 - p1) + p2 * ( 1 - p2)
n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
return int (np.ceil(n))
def print_report ( self , result : ABTestResult, test_name : str = "A/B Test" ):
"""Print a formatted test report."""
print ( " \n " + "=" * 60 )
print ( f "A/B TEST REPORT: { test_name } " )
print ( "=" * 60 )
print ( f " \n Conversion Rates:" )
print ( f " Control: { result.control_rate :.2%} " )
print ( f " Treatment: { result.treatment_rate :.2%} " )
print ( f " \n Lift:" )
print ( f " Absolute: { result.absolute_lift :+.2%} " )
print ( f " Relative: { result.relative_lift :+.1%} " )
print ( f " \n Statistical Analysis:" )
print ( f " Z-statistic: { result.z_statistic :.3f} " )
print ( f " P-value: { result.p_value :.4f} " )
print ( f " 95% CI for difference: ( { result.confidence_interval[ 0 ] :+.2%} , { result.confidence_interval[ 1 ] :+.2%} )" )
print ( f " \n Test Quality:" )
print ( f " Power: { result.power :.1%} " )
if result.power < self .power_threshold:
print ( f " Warning: Low power. Consider larger sample size." )
print ( f " \n Conclusion (α = { self .alpha } ):" )
if result.is_significant:
if result.absolute_lift > 0 :
print ( " SIGNIFICANT: Treatment performs BETTER than control" )
else :
print ( " SIGNIFICANT: Treatment performs WORSE than control" )
else :
print ( " NOT SIGNIFICANT: Cannot conclude a difference exists" )
if result.power < self .power_threshold:
print ( " Note: Low power means we might be missing a real effect" )
print ( "=" * 60 )
# Usage example
analyzer = ABTestAnalyzer( alpha = 0.05 )
# Test 1: Original example (not significant)
result1 = analyzer.run_test(
control_successes = 320 , control_total = 10000 ,
treatment_successes = 355 , treatment_total = 10000
)
analyzer.print_report(result1, "Checkout Button Color" )
# Test 2: Larger sample (now significant!)
result2 = analyzer.run_test(
control_successes = 3200 , control_total = 100000 ,
treatment_successes = 3550 , treatment_total = 100000
)
analyzer.print_report(result2, "Checkout Button Color (Large Sample)" )
# Calculate required sample size
n_required = analyzer.required_sample_size(
baseline_rate = 0.032 ,
minimum_detectable_effect = 0.10 # 10% relative improvement
)
print ( f " \n To detect 10% relative improvement with 80% power:" )
print ( f "Need { n_required :,} visitors per group" )
Common Mistakes to Avoid
1. Peeking and Early Stopping
# BAD: Stopping as soon as p < 0.05
# This inflates false positive rate to ~30%!
# GOOD: Pre-specify sample size and run to completion
2. Multiple Testing Without Correction
# Testing 20 variations? Some will be "significant" by chance!
# Bonferroni correction
alpha_corrected = 0.05 / 20 # = 0.0025
# Or use False Discovery Rate (FDR) correction
Interview Questions
Question 1: A/B Test Interpretation (Google)
Question : Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders?Answer :
“At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations:
The observed difference is 0.3 percentage points - if real, this could be meaningful at scale
The p-value of 0.08 suggests weak evidence against the null hypothesis, not proof the treatment doesn’t work
Consider power analysis - we may have been underpowered to detect this effect size
Practical significance - if the change is low-risk and low-cost, you might still consider implementing
Recommendation: If resources allow, run a larger test to get more conclusive results.”
Question 2: Multiple Testing (Amazon)
Question : You test 20 different variations of a product page. Three show p-values under 0.05. What’s the problem?Answer : With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different!Expected false positives = 20 × 0.05 = 1Solutions:
Bonferroni correction : Use α = 0.05/20 = 0.0025 as threshold
Benjamini-Hochberg (FDR) : Control the expected proportion of false discoveries
Holdout validation : Test the “winners” on fresh data
# Bonferroni
alpha_corrected = 0.05 / 20 # 0.0025
# Only results with p < 0.0025 are significant
Question 3: Power and Sample Size (Facebook/Meta)
Question 4: Early Stopping (Tech Companies)
Question : You’re running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test?Answer : No! This is called “p-hacking” or “optional stopping” and inflates false positive rates.The p-value assumes you only look once at the end. If you peek repeatedly:
With 5 peeks, your actual false positive rate is ~19%, not 5%
With 10 peeks, it’s ~25%
Proper approaches:
Sequential testing with adjusted thresholds (O’Brien-Fleming, Pocock)
Bayesian methods that allow continuous monitoring
Pre-commit to analysis plan and stick to it
# O'Brien-Fleming boundaries for 5 interim analyses:
# Look 1: α = 0.00001
# Look 2: α = 0.001
# Look 3: α = 0.01
# Look 4: α = 0.02
# Look 5: α = 0.04
Practice Challenge
Challenge: Build a Complete A/B Testing Framework
Create a production-ready A/B test analysis tool: import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Optional, Tuple
@dataclass
class ABTestResult :
control_rate: float
treatment_rate: float
relative_lift: float
absolute_lift: float
z_statistic: float
p_value: float
confidence_interval: Tuple[ float , float ]
power: float
is_significant: bool
recommendation: str
class ProductionABTest :
"""
Production-ready A/B test analyzer with:
- Power analysis
- Effect size estimation
- Confidence intervals
- Clear recommendations
"""
def __init__ ( self , alpha : float = 0.05 , min_detectable_effect : float = 0.1 ):
self .alpha = alpha
self .mde = min_detectable_effect
def analyze (
self ,
control_conversions : int ,
control_visitors : int ,
treatment_conversions : int ,
treatment_visitors : int ,
test_name : str = "A/B Test"
) -> ABTestResult:
"""Analyze A/B test results."""
# Your implementation here
pass
def recommend_sample_size (
self ,
baseline_rate : float ,
min_detectable_effect : float ,
power : float = 0.8
) -> int :
"""Calculate required sample size per group."""
# Your implementation here
pass
def generate_report ( self , result : ABTestResult) -> str :
"""Generate human-readable report."""
# Your implementation here
pass
# Test your implementation:
test = ProductionABTest()
# Scenario 1: Clear winner
result1 = test.analyze(
control_conversions = 500 , control_visitors = 10000 ,
treatment_conversions = 600 , treatment_visitors = 10000
)
# Scenario 2: Inconclusive
result2 = test.analyze(
control_conversions = 510 , control_visitors = 10000 ,
treatment_conversions = 530 , treatment_visitors = 10000
)
# Scenario 3: Treatment is worse
result3 = test.analyze(
control_conversions = 500 , control_visitors = 10000 ,
treatment_conversions = 420 , treatment_visitors = 10000
)
Full Solution :class ProductionABTest :
def __init__ ( self , alpha : float = 0.05 , min_detectable_effect : float = 0.1 ):
self .alpha = alpha
self .mde = min_detectable_effect
def analyze (
self ,
control_conversions : int ,
control_visitors : int ,
treatment_conversions : int ,
treatment_visitors : int ,
test_name : str = "A/B Test"
) -> ABTestResult:
# Calculate rates
p_c = control_conversions / control_visitors
p_t = treatment_conversions / treatment_visitors
# Effect sizes
absolute_lift = p_t - p_c
relative_lift = (p_t - p_c) / p_c if p_c > 0 else 0
# Pooled proportion and standard error
p_pool = (control_conversions + treatment_conversions) / \
(control_visitors + treatment_visitors)
se = np.sqrt(p_pool * ( 1 - p_pool) *
( 1 / control_visitors + 1 / treatment_visitors))
# Z-test
z = absolute_lift / se if se > 0 else 0
p_value = 2 * ( 1 - stats.norm.cdf( abs (z)))
# Confidence interval for difference
se_diff = np.sqrt(
p_c * ( 1 - p_c) / control_visitors +
p_t * ( 1 - p_t) / treatment_visitors
)
z_crit = stats.norm.ppf( 1 - self .alpha / 2 )
ci = (absolute_lift - z_crit * se_diff,
absolute_lift + z_crit * se_diff)
# Power calculation
effect_size = abs (p_t - p_c) / np.sqrt(p_c * ( 1 - p_c))
power = self ._calculate_power(
control_visitors, treatment_visitors,
p_c, p_t
)
# Significance check
is_significant = p_value < self .alpha
# Generate recommendation
recommendation = self ._generate_recommendation(
p_c, p_t, p_value, power, is_significant
)
return ABTestResult(
control_rate = p_c,
treatment_rate = p_t,
relative_lift = relative_lift,
absolute_lift = absolute_lift,
z_statistic = z,
p_value = p_value,
confidence_interval = ci,
power = power,
is_significant = is_significant,
recommendation = recommendation
)
def _calculate_power ( self , n1 , n2 , p1 , p2 ):
"""Calculate achieved power."""
effect = abs (p2 - p1)
pooled_p = (p1 + p2) / 2
se = np.sqrt(pooled_p * ( 1 - pooled_p) * ( 1 / n1 + 1 / n2))
z_crit = stats.norm.ppf( 1 - self .alpha / 2 )
z_power = (effect / se) - z_crit
return stats.norm.cdf(z_power)
def _generate_recommendation ( self , p_c , p_t , p_value , power , sig ):
if sig and p_t > p_c:
return "SHIP IT: Treatment significantly outperforms control"
elif sig and p_t < p_c:
return "STOP: Treatment significantly underperforms control"
elif not sig and power < 0.5 :
return "INCONCLUSIVE: Test underpowered, consider running longer"
elif not sig and power >= 0.8 :
return "NO EFFECT: High-powered test found no significant difference"
else :
return "BORDERLINE: Consider practical significance and run longer"
def recommend_sample_size ( self , baseline_rate , mde , power = 0.8 ):
target_rate = baseline_rate * ( 1 + mde)
effect = target_rate - baseline_rate
pooled_p = (baseline_rate + target_rate) / 2
z_alpha = stats.norm.ppf( 1 - self .alpha / 2 )
z_beta = stats.norm.ppf(power)
n = 2 * pooled_p * ( 1 - pooled_p) * ((z_alpha + z_beta) / effect) ** 2
return int (np.ceil(n))
📝 Practice Exercises
Exercise 1 Conduct a one-sample hypothesis test
Exercise 2 Analyze an A/B test for conversion rates
Exercise 3 Calculate statistical power and sample size
Exercise 4 Real-world: Drug trial effectiveness testing
Key Takeaways
The Framework
Null hypothesis: no effect (innocent)
Alternative: there is an effect
P-value: how surprising is the data?
Decision threshold: typically α = 0.05
Types of Errors
Type I (α): False positive, claiming effect that doesn’t exist
Type II (β): False negative, missing real effect
Power = 1 - β: Ability to detect real effects
Sample Size Matters
Small samples = low power = missed effects
Calculate sample size BEFORE running test
More precision requires exponentially more data
Test Selection
Two proportions: Chi-square or z-test
Two means: t-test
Multiple groups: ANOVA
Non-normal: Mann-Whitney U
Common Pitfalls
A/B Testing Mistakes to Avoid :
Peeking & Early Stopping - Checking daily inflates false positives; use sequential testing methods instead
Underpowered Tests - Running with too few samples misses real effects; calculate sample size first
Multiple Comparisons - Testing 20 variants without correction guarantees false positives
Ignoring Practical Significance - A p < 0.05 with 0.01% improvement isn’t worth shipping
One-Tailed When Uncertain - Only use one-tailed tests when you truly can’t care about opposite effects
P-value Misinterpretation - P-value is NOT the probability the null is true!
Connection to Machine Learning
Hypothesis Testing ML Application A/B testing Model comparison, feature evaluation Power analysis Training set size planning Multiple testing correction Hyperparameter search, feature selection Type I/II errors Precision/Recall tradeoff Significance testing Statistical validation of model improvements
ML Connection : Every time you compare “Model A accuracy = 0.92 vs Model B accuracy = 0.89”, you’re implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer.
Coming up next : We’ll learn about Correlation and Regression - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning.
Next: Correlation and Regression Understand relationships and make predictions