> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Hypothesis Testing: Real Effect or Random Noise?

> The scientific method for determining if differences are meaningful

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/statistics-for-ml/hypothesis-real-world.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=708f025b2ab6cda22245e9d70dcc0e66" alt="Hypothesis Testing" width="1080" height="1080" data-path="images/courses/statistics-for-ml/hypothesis-real-world.svg" />
</Frame>

# Hypothesis Testing: Real Effect or Random Noise?

## The A/B Testing Problem

You work at an e-commerce company. The design team created a new checkout button - green instead of blue. After running both versions for a week:

| Version        | Visitors | Purchases | Conversion Rate |
| -------------- | -------- | --------- | --------------- |
| Blue (Control) | 10,000   | 320       | 3.20%           |
| Green (New)    | 10,000   | 355       | 3.55%           |

The green button has a higher conversion rate. But is this a **real improvement** or just **random chance**?

This is the fundamental question of hypothesis testing.

**Analogy**: Hypothesis testing is like a courtroom trial for data. The null hypothesis is the "defendant" -- innocent until proven guilty. The data is the "evidence." The p-value is how surprising the evidence would be if the defendant were truly innocent. And just like in court, you can make two kinds of mistakes: convicting an innocent person (false positive) or letting a guilty one go free (false negative). The entire framework is about managing those risks.

<Info>
  **Estimated Time**: 4-5 hours\
  **Difficulty**: Intermediate\
  **Prerequisites**: Modules 1-4 (especially Distributions and Inference)\
  **What You'll Build**: Complete A/B testing framework
</Info>

***

## The Framework: Innocent Until Proven Guilty

Hypothesis testing borrows from the legal system:

| Legal System                                         | Hypothesis Testing                          |
| ---------------------------------------------------- | ------------------------------------------- |
| Defendant is innocent until proven guilty            | No effect until proven otherwise            |
| Prosecution must prove guilt beyond reasonable doubt | Data must prove effect with high confidence |
| Jury verdict: guilty or not guilty                   | Decision: reject or fail to reject null     |
| "Not guilty" ≠ "innocent"                            | "Fail to reject" ≠ "effect doesn't exist"   |

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/statistics-for-ml/hypothesis-testing-math.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=dd3ab6eeff15d06e85334948610f62c4" alt="Hypothesis Testing Framework" width="1080" height="1080" data-path="images/courses/statistics-for-ml/hypothesis-testing-math.svg" />
</Frame>

### The Two Hypotheses

**Null Hypothesis (H₀)**: The default assumption. Nothing special is happening.

* "The new button has the same conversion rate as the old one"
* "The drug has no effect"
* "The two groups are the same"

**Alternative Hypothesis (H₁ or Hₐ)**: What we're trying to prove.

* "The new button has a different conversion rate"
* "The drug has an effect"
* "The groups are different"

```python theme={null}
# Our A/B test hypotheses:
# H₀: p_green = p_blue (no difference)
# H₁: p_green ≠ p_blue (there is a difference)
```

***

## The P-Value: Quantifying Surprise

The **p-value** answers: "If there really were no effect, how likely would we be to see data this extreme?"

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/statistics-for-ml/pvalue-real-world.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=0fcc14980f440d80b62784033368062d" alt="P-Value Intuition" width="1080" height="1080" data-path="images/courses/statistics-for-ml/pvalue-real-world.svg" />
</Frame>

### Interpreting P-Values

| P-Value   | Interpretation                                |
| --------- | --------------------------------------------- |
| p \< 0.01 | Strong evidence against null hypothesis       |
| p \< 0.05 | Moderate evidence against null hypothesis     |
| p \< 0.10 | Weak evidence against null hypothesis         |
| p ≥ 0.10  | Little to no evidence against null hypothesis |

**Common threshold (α)**: 0.05 (5%)

* If p \< 0.05, we reject the null hypothesis
* If p ≥ 0.05, we fail to reject the null hypothesis

<Warning>
  **Critical Misconception**: The p-value is NOT the probability that the null hypothesis is true.

  It's the probability of seeing data this extreme IF the null hypothesis were true.
</Warning>

***

## Testing Our A/B Example

Let's test whether the green button is actually better:

```python theme={null}
import numpy as np
from scipy import stats

# Data
blue_visitors = 10000
blue_purchases = 320
blue_rate = blue_purchases / blue_visitors

green_visitors = 10000
green_purchases = 355
green_rate = green_purchases / green_visitors

print(f"Blue conversion rate: {blue_rate:.2%}")
print(f"Green conversion rate: {green_rate:.2%}")
print(f"Observed difference: {green_rate - blue_rate:.2%}")
```

### Method 1: Two-Proportion Z-Test

```python theme={null}
def two_proportion_z_test(x1, n1, x2, n2):
    """
    Test if two proportions are significantly different.
    
    x1, x2: number of successes
    n1, n2: number of trials
    
    Returns: z-statistic, p-value (two-tailed)
    """
    # Sample proportions
    p1 = x1 / n1
    p2 = x2 / n2
    
    # Pooled proportion (under null hypothesis)
    p_pool = (x1 + x2) / (n1 + n2)
    
    # Standard error under null
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    
    # Z-statistic
    z = (p1 - p2) / se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    return z, p_value

# Run the test
z_stat, p_value = two_proportion_z_test(
    green_purchases, green_visitors,
    blue_purchases, blue_visitors
)

print(f"\nZ-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nResult: Reject null hypothesis")
    print("The difference is statistically significant at α=0.05")
else:
    print("\nResult: Fail to reject null hypothesis")
    print("The difference is NOT statistically significant at α=0.05")
```

**Output:**

```
Blue conversion rate: 3.20%
Green conversion rate: 3.55%
Observed difference: 0.35%

Z-statistic: 1.404
P-value: 0.1603

Result: Fail to reject null hypothesis
The difference is NOT statistically significant at α=0.05
```

Despite the 0.35% improvement, the p-value of 0.16 means we cannot rule out random chance.

**Step-by-step reasoning for interpreting this result**:

1. **What we observed**: Green button had 0.35 percentage points higher conversion.
2. **What we asked**: If there were truly no difference, how often would we see a gap this large by luck alone?
3. **What we found**: About 16% of the time (p = 0.16). That is not particularly rare.
4. **Our decision**: Since 16% is above our 5% threshold, we cannot confidently say the green button is better. The observed difference is plausible under random noise.
5. **What this does NOT mean**: It does not mean the green button is NOT better. It means we do not have enough evidence to conclude either way. A larger sample might reveal a real difference.

<Tip>
  **ML Application -- Model Comparison**: This same logic applies when you compare two ML models. If Model A gets 94.2% accuracy and Model B gets 93.8%, is A really better? Without a statistical test (like a paired t-test over cross-validation folds), you cannot know. Many ML practitioners deploy "better" models that are actually within the noise margin. Always test whether the difference is statistically significant before making deployment decisions.
</Tip>

### Method 2: Chi-Square Test

```python theme={null}
from scipy.stats import chi2_contingency

# Contingency table
#                  Purchased    Not Purchased
# Blue               320          9680
# Green              355          9645

contingency_table = np.array([
    [blue_purchases, blue_visitors - blue_purchases],
    [green_purchases, green_visitors - green_purchases]
])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
```

**Output:**

```
Chi-square statistic: 1.972
P-value: 0.1603
Degrees of freedom: 1
```

Same p-value, same conclusion.

***

## Types of Errors

We can make two types of mistakes:

|                       | H₀ is True (No Effect)        | H₀ is False (Real Effect)      |
| --------------------- | ----------------------------- | ------------------------------ |
| **Reject H₀**         | Type I Error (False Positive) | Correct Decision               |
| **Fail to Reject H₀** | Correct Decision              | Type II Error (False Negative) |

### Type I Error (α): False Positive

We claim there's an effect when there isn't one.

* Probability = α (typically 0.05)
* "The boy who cried wolf"
* Example: Launching a feature that doesn't actually help

### Type II Error (β): False Negative

We miss a real effect.

* Probability = β (varies, often 0.20)
* Power = 1 - β (typically 0.80)
* Example: Abandoning a feature that would have helped

```python theme={null}
# Visualize the tradeoff
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Type I Error
x = np.linspace(-4, 4, 1000)
null_dist = stats.norm.pdf(x, 0, 1)

axes[0].plot(x, null_dist, 'b-', linewidth=2, label='Null Distribution')
axes[0].fill_between(x[x > 1.96], null_dist[x > 1.96], alpha=0.3, color='red', 
                      label=f'Type I Error Region (α/2)')
axes[0].fill_between(x[x < -1.96], null_dist[x < -1.96], alpha=0.3, color='red')
axes[0].axvline(1.96, color='red', linestyle='--')
axes[0].axvline(-1.96, color='red', linestyle='--')
axes[0].set_title('Type I Error (False Positive)')
axes[0].legend()

# Type II Error (with alternative distribution)
alt_dist = stats.norm.pdf(x, 2, 1)  # Effect exists, shifted right

axes[1].plot(x, null_dist, 'b-', linewidth=2, label='Null (No Effect)')
axes[1].plot(x, alt_dist, 'g-', linewidth=2, label='Alternative (Real Effect)')
axes[1].fill_between(x[(x > -1.96) & (x < 1.96)], alt_dist[(x > -1.96) & (x < 1.96)], 
                      alpha=0.3, color='orange', label='Type II Error Region (β)')
axes[1].axvline(1.96, color='red', linestyle='--')
axes[1].axvline(-1.96, color='red', linestyle='--')
axes[1].set_title('Type II Error (False Negative)')
axes[1].legend()

plt.tight_layout()
plt.show()
```

***

## Statistical Power: Ability to Detect Real Effects

**Power** = Probability of detecting an effect when it exists = 1 - β

Higher power means:

* Less likely to miss real effects
* Requires larger sample sizes
* More confidence in negative results

```python theme={null}
def power_proportion_test(p1, p2, n, alpha=0.05):
    """
    Calculate power for a two-proportion test.
    
    p1: control proportion
    p2: treatment proportion
    n: sample size per group
    alpha: significance level
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled standard error under null
    p_pool = (p1 + p2) / 2
    se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
    
    # Standard error under alternative
    se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
    
    # Critical value
    z_crit = stats.norm.ppf(1 - alpha / 2)
    
    # Power
    z_power = (effect - z_crit * se_null) / se_alt
    power = stats.norm.cdf(z_power)
    
    return power

# Our A/B test: 3.20% vs 3.55%, n=10,000 per group
power = power_proportion_test(0.032, 0.0355, 10000)
print(f"Power of our test: {power:.1%}")

# What if we had 50,000 per group?
power_large = power_proportion_test(0.032, 0.0355, 50000)
print(f"Power with n=50,000: {power_large:.1%}")
```

**Output:**

```
Power of our test: 27.3%
Power with n=50,000: 75.8%
```

With only 10,000 per group, we had only a 27% chance of detecting that 0.35% difference. No wonder we failed to find significance.

**Analogy**: Running an underpowered test is like using a metal detector at the beach with the sensitivity turned way down. Even if there is buried treasure, you are unlikely to find it. Power analysis tells you how sensitive your detector needs to be (how large your sample must be) to find treasure of a given size (your minimum detectable effect).

<Warning>
  **Statistical Mistake in ML -- Underpowered Hyperparameter Searches**: This same power problem plagues ML practitioners who do hyperparameter tuning with small validation sets. You try 50 configurations, pick the "best" one, but the differences are smaller than the noise. You have essentially picked a random configuration and convinced yourself it is optimal. Use cross-validation with enough folds and enough data per fold to ensure the differences you are selecting on are real.
</Warning>

***

## Sample Size Calculation for Desired Power

```python theme={null}
def sample_size_proportion_test(p1, p2, power=0.80, alpha=0.05):
    """
    Calculate required sample size per group.
    
    p1: expected control proportion
    p2: expected treatment proportion
    power: desired power (typically 0.80)
    alpha: significance level (typically 0.05)
    """
    # Effect size
    effect = abs(p2 - p1)
    
    # Pooled proportion
    p_pool = (p1 + p2) / 2
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    # Variance terms
    var_null = 2 * p_pool * (1 - p_pool)
    var_alt = p1 * (1 - p1) + p2 * (1 - p2)
    
    # Sample size formula
    n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
    
    return int(np.ceil(n))

# How many visitors do we need to detect a 0.35% difference?
n_needed = sample_size_proportion_test(0.032, 0.0355)
print(f"Need {n_needed:,} visitors per group to detect 0.35% difference with 80% power")

# For a larger effect (1% improvement)
n_1pct = sample_size_proportion_test(0.032, 0.042)
print(f"Need {n_1pct:,} visitors per group to detect 1.0% difference with 80% power")
```

**Output:**

```
Need 48,614 visitors per group to detect 0.35% difference with 80% power
Need 6,038 visitors per group to detect 1.0% difference with 80% power
```

***

## Common Statistical Tests

### 1. One-Sample t-Test

Is this sample mean different from a known value?

```python theme={null}
# Are our website load times different from the 3-second industry standard?
load_times = np.array([2.8, 3.2, 2.9, 3.5, 2.7, 3.1, 2.6, 3.0, 2.9, 3.3])

t_stat, p_value = stats.ttest_1samp(load_times, 3.0)
print(f"Sample mean: {np.mean(load_times):.2f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
```

### 2. Two-Sample t-Test

Are the means of two groups different?

```python theme={null}
# Do users spend more time on new homepage design?
old_design_time = np.array([45, 52, 38, 61, 42, 55, 48, 50, 44, 58])
new_design_time = np.array([58, 62, 55, 70, 65, 60, 68, 72, 63, 59])

t_stat, p_value = stats.ttest_ind(old_design_time, new_design_time)
print(f"Old design mean: {np.mean(old_design_time):.1f}s")
print(f"New design mean: {np.mean(new_design_time):.1f}s")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
```

### 3. Paired t-Test

Before/after comparisons on the same subjects:

```python theme={null}
# Does a training program improve test scores?
before = np.array([65, 72, 58, 80, 75, 62, 70, 68, 74, 78])
after = np.array([70, 78, 62, 85, 82, 68, 75, 72, 80, 82])

t_stat, p_value = stats.ttest_rel(before, after)
print(f"Mean improvement: {np.mean(after - before):.1f} points")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
```

### 4. ANOVA

Are three or more groups different?

```python theme={null}
# Do three different ad campaigns have different click rates?
campaign_a = np.array([2.1, 2.3, 2.0, 2.4, 2.2])
campaign_b = np.array([2.8, 3.0, 2.9, 3.1, 2.7])
campaign_c = np.array([2.3, 2.5, 2.4, 2.6, 2.2])

f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")
```

***

## Complete A/B Testing Framework

```python theme={null}
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional

@dataclass
class ABTestResult:
    """Results of an A/B test."""
    control_rate: float
    treatment_rate: float
    relative_lift: float
    absolute_lift: float
    z_statistic: float
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    power: float
    
class ABTestAnalyzer:
    """
    Complete A/B testing framework with proper statistical methodology.
    """
    
    def __init__(self, alpha: float = 0.05, power_threshold: float = 0.80):
        self.alpha = alpha
        self.power_threshold = power_threshold
    
    def run_test(
        self, 
        control_successes: int, 
        control_total: int,
        treatment_successes: int, 
        treatment_total: int
    ) -> ABTestResult:
        """Run a two-proportion z-test."""
        
        # Calculate rates
        p_control = control_successes / control_total
        p_treatment = treatment_successes / treatment_total
        
        # Lifts
        absolute_lift = p_treatment - p_control
        relative_lift = (p_treatment - p_control) / p_control if p_control > 0 else 0
        
        # Pooled proportion
        p_pool = (control_successes + treatment_successes) / (control_total + treatment_total)
        
        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/treatment_total))
        
        # Z-statistic
        z = absolute_lift / se if se > 0 else 0
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for the difference
        se_diff = np.sqrt(
            p_control * (1 - p_control) / control_total +
            p_treatment * (1 - p_treatment) / treatment_total
        )
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        ci = (absolute_lift - z_crit * se_diff, absolute_lift + z_crit * se_diff)
        
        # Power (approximate)
        power = self._calculate_power(p_control, p_treatment, min(control_total, treatment_total))
        
        return ABTestResult(
            control_rate=p_control,
            treatment_rate=p_treatment,
            relative_lift=relative_lift,
            absolute_lift=absolute_lift,
            z_statistic=z,
            p_value=p_value,
            confidence_interval=ci,
            is_significant=p_value < self.alpha,
            power=power
        )
    
    def _calculate_power(self, p1: float, p2: float, n: int) -> float:
        """Calculate statistical power."""
        effect = abs(p2 - p1)
        if effect == 0:
            return 0
        
        p_pool = (p1 + p2) / 2
        se_null = np.sqrt(2 * p_pool * (1 - p_pool) / n)
        se_alt = np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / n)
        
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        z_power = (effect - z_crit * se_null) / se_alt
        
        return stats.norm.cdf(z_power)
    
    def required_sample_size(
        self, 
        baseline_rate: float, 
        minimum_detectable_effect: float,
        power: float = 0.80
    ) -> int:
        """Calculate required sample size per group."""
        
        p1 = baseline_rate
        p2 = baseline_rate * (1 + minimum_detectable_effect)
        effect = abs(p2 - p1)
        
        p_pool = (p1 + p2) / 2
        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(power)
        
        var_null = 2 * p_pool * (1 - p_pool)
        var_alt = p1 * (1 - p1) + p2 * (1 - p2)
        
        n = ((z_alpha * np.sqrt(var_null) + z_beta * np.sqrt(var_alt)) / effect) ** 2
        
        return int(np.ceil(n))
    
    def print_report(self, result: ABTestResult, test_name: str = "A/B Test"):
        """Print a formatted test report."""
        
        print("\n" + "=" * 60)
        print(f"A/B TEST REPORT: {test_name}")
        print("=" * 60)
        
        print(f"\nConversion Rates:")
        print(f"  Control:   {result.control_rate:.2%}")
        print(f"  Treatment: {result.treatment_rate:.2%}")
        
        print(f"\nLift:")
        print(f"  Absolute: {result.absolute_lift:+.2%}")
        print(f"  Relative: {result.relative_lift:+.1%}")
        
        print(f"\nStatistical Analysis:")
        print(f"  Z-statistic: {result.z_statistic:.3f}")
        print(f"  P-value: {result.p_value:.4f}")
        print(f"  95% CI for difference: ({result.confidence_interval[0]:+.2%}, {result.confidence_interval[1]:+.2%})")
        
        print(f"\nTest Quality:")
        print(f"  Power: {result.power:.1%}")
        if result.power < self.power_threshold:
            print(f"  Warning: Low power. Consider larger sample size.")
        
        print(f"\nConclusion (α = {self.alpha}):")
        if result.is_significant:
            if result.absolute_lift > 0:
                print("  SIGNIFICANT: Treatment performs BETTER than control")
            else:
                print("  SIGNIFICANT: Treatment performs WORSE than control")
        else:
            print("  NOT SIGNIFICANT: Cannot conclude a difference exists")
            if result.power < self.power_threshold:
                print("  Note: Low power means we might be missing a real effect")
        
        print("=" * 60)


# Usage example
analyzer = ABTestAnalyzer(alpha=0.05)

# Test 1: Original example (not significant)
result1 = analyzer.run_test(
    control_successes=320, control_total=10000,
    treatment_successes=355, treatment_total=10000
)
analyzer.print_report(result1, "Checkout Button Color")

# Test 2: Larger sample (now significant!)
result2 = analyzer.run_test(
    control_successes=3200, control_total=100000,
    treatment_successes=3550, treatment_total=100000
)
analyzer.print_report(result2, "Checkout Button Color (Large Sample)")

# Calculate required sample size
n_required = analyzer.required_sample_size(
    baseline_rate=0.032,
    minimum_detectable_effect=0.10  # 10% relative improvement
)
print(f"\nTo detect 10% relative improvement with 80% power:")
print(f"Need {n_required:,} visitors per group")
```

***

## Common Mistakes to Avoid

### 1. Peeking and Early Stopping

```python theme={null}
# BAD: Stopping as soon as p < 0.05
# This inflates false positive rate to ~30%!

# GOOD: Pre-specify sample size and run to completion
```

### 2. Multiple Testing Without Correction

```python theme={null}
# Testing 20 variations? Some will be "significant" by chance!

# Bonferroni correction
alpha_corrected = 0.05 / 20  # = 0.0025

# Or use False Discovery Rate (FDR) correction
```

***

## Interview Questions

<Accordion title="Question 1: A/B Test Interpretation (Google)">
  **Question**: Your A/B test shows the treatment group has 3.5% conversion vs 3.2% control. The p-value is 0.08. What do you tell stakeholders?

  <Tip>
    **Answer**:
    "At our standard α=0.05 threshold, this result is not statistically significant (p=0.08). However, there are a few considerations:

    1. **The observed difference is 0.3 percentage points** - if real, this could be meaningful at scale
    2. **The p-value of 0.08 suggests weak evidence** against the null hypothesis, not proof the treatment doesn't work
    3. **Consider power analysis** - we may have been underpowered to detect this effect size
    4. **Practical significance** - if the change is low-risk and low-cost, you might still consider implementing

    Recommendation: If resources allow, run a larger test to get more conclusive results."
  </Tip>
</Accordion>

<Accordion title="Question 2: Multiple Testing (Amazon)">
  **Question**: You test 20 different variations of a product page. Three show p-values under 0.05. What's the problem?

  <Tip>
    **Answer**: With 20 tests at α=0.05, we expect about 1 false positive even if nothing is different!

    **Expected false positives** = 20 × 0.05 = 1

    Solutions:

    1. **Bonferroni correction**: Use α = 0.05/20 = 0.0025 as threshold
    2. **Benjamini-Hochberg (FDR)**: Control the expected proportion of false discoveries
    3. **Holdout validation**: Test the "winners" on fresh data

    ```python theme={null}
    # Bonferroni
    alpha_corrected = 0.05 / 20  # 0.0025
    # Only results with p < 0.0025 are significant
    ```
  </Tip>
</Accordion>

<Accordion title="Question 3: Power and Sample Size (Facebook/Meta)">
  **Question**: An A/B test with 10,000 users per group shows no significant difference. Does this prove the treatment has no effect?

  <Tip>
    **Answer**: No! "Fail to reject null" ≠ "null is true"

    We need to consider statistical power:

    1. **What effect size could we detect?** With n=10,000, we might only detect large effects
    2. **What was our power?** If power was 50%, we had a coin flip's chance of detecting a real effect
    3. **What's the confidence interval?** Even with p > 0.05, the CI might not include zero

    ```python theme={null}
    # For a conversion test at 5% baseline, n=10,000/group
    # We can reliably detect ~0.5 percentage point differences
    # Smaller effects would require larger samples
    ```

    The correct conclusion: "We failed to find evidence of an effect of size > X"
  </Tip>
</Accordion>

<Accordion title="Question 4: Early Stopping (Tech Companies)">
  **Question**: You're running an A/B test. After 2 days (of planned 7), you peek at results and see p=0.03. Should you stop the test?

  <Tip>
    **Answer**: No! This is called "p-hacking" or "optional stopping" and inflates false positive rates.

    The p-value assumes you only look once at the end. If you peek repeatedly:

    * With 5 peeks, your actual false positive rate is \~19%, not 5%
    * With 10 peeks, it's \~25%

    Proper approaches:

    1. **Sequential testing** with adjusted thresholds (O'Brien-Fleming, Pocock)
    2. **Bayesian methods** that allow continuous monitoring
    3. **Pre-commit** to analysis plan and stick to it

    ```python theme={null}
    # O'Brien-Fleming boundaries for 5 interim analyses:
    # Look 1: α = 0.00001
    # Look 2: α = 0.001
    # Look 3: α = 0.01
    # Look 4: α = 0.02
    # Look 5: α = 0.04
    ```
  </Tip>
</Accordion>

***

## Practice Challenge

<Accordion title="Challenge: Build a Complete A/B Testing Framework">
  Create a production-ready A/B test analysis tool:

  ```python theme={null}
  import numpy as np
  from scipy import stats
  from dataclasses import dataclass
  from typing import Optional, Tuple

  @dataclass
  class ABTestResult:
      control_rate: float
      treatment_rate: float
      relative_lift: float
      absolute_lift: float
      z_statistic: float
      p_value: float
      confidence_interval: Tuple[float, float]
      power: float
      is_significant: bool
      recommendation: str

  class ProductionABTest:
      """
      Production-ready A/B test analyzer with:
      - Power analysis
      - Effect size estimation
      - Confidence intervals
      - Clear recommendations
      """
      
      def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
          self.alpha = alpha
          self.mde = min_detectable_effect
      
      def analyze(
          self,
          control_conversions: int,
          control_visitors: int,
          treatment_conversions: int,
          treatment_visitors: int,
          test_name: str = "A/B Test"
      ) -> ABTestResult:
          """Analyze A/B test results."""
          # Your implementation here
          pass
      
      def recommend_sample_size(
          self,
          baseline_rate: float,
          min_detectable_effect: float,
          power: float = 0.8
      ) -> int:
          """Calculate required sample size per group."""
          # Your implementation here
          pass
      
      def generate_report(self, result: ABTestResult) -> str:
          """Generate human-readable report."""
          # Your implementation here
          pass

  # Test your implementation:
  test = ProductionABTest()

  # Scenario 1: Clear winner
  result1 = test.analyze(
      control_conversions=500, control_visitors=10000,
      treatment_conversions=600, treatment_visitors=10000
  )

  # Scenario 2: Inconclusive
  result2 = test.analyze(
      control_conversions=510, control_visitors=10000,
      treatment_conversions=530, treatment_visitors=10000
  )

  # Scenario 3: Treatment is worse
  result3 = test.analyze(
      control_conversions=500, control_visitors=10000,
      treatment_conversions=420, treatment_visitors=10000
  )
  ```

  **Full Solution**:

  ```python theme={null}
  class ProductionABTest:
      def __init__(self, alpha: float = 0.05, min_detectable_effect: float = 0.1):
          self.alpha = alpha
          self.mde = min_detectable_effect
      
      def analyze(
          self,
          control_conversions: int,
          control_visitors: int,
          treatment_conversions: int,
          treatment_visitors: int,
          test_name: str = "A/B Test"
      ) -> ABTestResult:
          # Calculate rates
          p_c = control_conversions / control_visitors
          p_t = treatment_conversions / treatment_visitors
          
          # Effect sizes
          absolute_lift = p_t - p_c
          relative_lift = (p_t - p_c) / p_c if p_c > 0 else 0
          
          # Pooled proportion and standard error
          p_pool = (control_conversions + treatment_conversions) / \
                   (control_visitors + treatment_visitors)
          se = np.sqrt(p_pool * (1 - p_pool) * 
                       (1/control_visitors + 1/treatment_visitors))
          
          # Z-test
          z = absolute_lift / se if se > 0 else 0
          p_value = 2 * (1 - stats.norm.cdf(abs(z)))
          
          # Confidence interval for difference
          se_diff = np.sqrt(
              p_c * (1 - p_c) / control_visitors +
              p_t * (1 - p_t) / treatment_visitors
          )
          z_crit = stats.norm.ppf(1 - self.alpha/2)
          ci = (absolute_lift - z_crit * se_diff,
                absolute_lift + z_crit * se_diff)
          
          # Power calculation
          effect_size = abs(p_t - p_c) / np.sqrt(p_c * (1 - p_c))
          power = self._calculate_power(
              control_visitors, treatment_visitors,
              p_c, p_t
          )
          
          # Significance check
          is_significant = p_value < self.alpha
          
          # Generate recommendation
          recommendation = self._generate_recommendation(
              p_c, p_t, p_value, power, is_significant
          )
          
          return ABTestResult(
              control_rate=p_c,
              treatment_rate=p_t,
              relative_lift=relative_lift,
              absolute_lift=absolute_lift,
              z_statistic=z,
              p_value=p_value,
              confidence_interval=ci,
              power=power,
              is_significant=is_significant,
              recommendation=recommendation
          )
      
      def _calculate_power(self, n1, n2, p1, p2):
          """Calculate achieved power."""
          effect = abs(p2 - p1)
          pooled_p = (p1 + p2) / 2
          se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2))
          z_crit = stats.norm.ppf(1 - self.alpha/2)
          z_power = (effect / se) - z_crit
          return stats.norm.cdf(z_power)
      
      def _generate_recommendation(self, p_c, p_t, p_value, power, sig):
          if sig and p_t > p_c:
              return "SHIP IT: Treatment significantly outperforms control"
          elif sig and p_t < p_c:
              return "STOP: Treatment significantly underperforms control"
          elif not sig and power < 0.5:
              return "INCONCLUSIVE: Test underpowered, consider running longer"
          elif not sig and power >= 0.8:
              return "NO EFFECT: High-powered test found no significant difference"
          else:
              return "BORDERLINE: Consider practical significance and run longer"
      
      def recommend_sample_size(self, baseline_rate, mde, power=0.8):
          target_rate = baseline_rate * (1 + mde)
          effect = target_rate - baseline_rate
          pooled_p = (baseline_rate + target_rate) / 2
          
          z_alpha = stats.norm.ppf(1 - self.alpha/2)
          z_beta = stats.norm.ppf(power)
          
          n = 2 * pooled_p * (1 - pooled_p) * ((z_alpha + z_beta) / effect) ** 2
          return int(np.ceil(n))
  ```
</Accordion>

***

## 📝 Practice Exercises

<CardGroup cols={2}>
  <Card title="Exercise 1" icon="vial" color="#3B82F6">
    Conduct a one-sample hypothesis test
  </Card>

  <Card title="Exercise 2" icon="code-compare" color="#10B981">
    Analyze an A/B test for conversion rates
  </Card>

  <Card title="Exercise 3" icon="calculator" color="#8B5CF6">
    Calculate statistical power and sample size
  </Card>

  <Card title="Exercise 4" icon="flask" color="#F59E0B">
    Real-world: Drug trial effectiveness testing
  </Card>
</CardGroup>

<details>
  <summary>**Exercise 1: Website Load Time Test** - One-sample t-test</summary>

  **Problem**: Your website claims an average load time of 2.0 seconds. A sample of 30 page loads shows:

  * Sample mean: 2.3 seconds
  * Sample std dev: 0.6 seconds

  1. State the null and alternative hypotheses
  2. Calculate the t-statistic
  3. Find the p-value (two-tailed)
  4. At α = 0.05, do you reject the claim?

  **Solution**:

  ```python theme={null}
  import numpy as np
  from scipy import stats

  # Sample data
  x_bar = 2.3  # sample mean
  s = 0.6      # sample std dev
  n = 30       # sample size
  mu_0 = 2.0   # claimed value

  # 1. Hypotheses
  print("H₀: μ = 2.0 seconds (load time equals claim)")
  print("H₁: μ ≠ 2.0 seconds (load time differs from claim)")

  # 2. Calculate t-statistic
  se = s / np.sqrt(n)  # standard error
  t_stat = (x_bar - mu_0) / se

  print(f"\nStandard Error: {se:.4f}")
  print(f"t-statistic: {t_stat:.4f}")

  # 3. P-value (two-tailed)
  df = n - 1
  p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

  print(f"Degrees of freedom: {df}")
  print(f"P-value (two-tailed): {p_value:.4f}")

  # 4. Decision at α = 0.05
  alpha = 0.05
  critical_t = stats.t.ppf(1 - alpha/2, df)

  print(f"\n--- Decision at α = {alpha} ---")
  print(f"Critical t-value: ±{critical_t:.4f}")

  if p_value < alpha:
      print(f"P-value ({p_value:.4f}) < α ({alpha})")
      print("✗ REJECT H₀: Website is NOT meeting the 2.0s claim")
  else:
      print(f"P-value ({p_value:.4f}) ≥ α ({alpha})")
      print("✓ FAIL TO REJECT H₀: No evidence against the claim")

  # Using scipy directly
  t_result = stats.ttest_1samp([x_bar], mu_0)  # Would need raw data
  print(f"\nConclusion: Load time ({x_bar}s) is significantly higher than claimed ({mu_0}s)")
  ```
</details>

<details>
  <summary>**Exercise 2: A/B Test Analysis** - Two-proportion z-test</summary>

  **Problem**: You're testing a new checkout flow:

  * Control: 5,000 visitors, 200 purchases (4.0%)
  * Treatment: 5,000 visitors, 240 purchases (4.8%)

  1. Is the 0.8% improvement statistically significant at α = 0.05?
  2. Calculate the 95% confidence interval for the difference
  3. What's the relative lift?
  4. Should you ship the new checkout flow?

  **Solution**:

  ```python theme={null}
  import numpy as np
  from scipy import stats

  # Data
  n_c, x_c = 5000, 200  # Control
  n_t, x_t = 5000, 240  # Treatment

  p_c = x_c / n_c  # 4.0%
  p_t = x_t / n_t  # 4.8%
  diff = p_t - p_c  # 0.8%

  print(f"Control rate: {p_c:.2%}")
  print(f"Treatment rate: {p_t:.2%}")
  print(f"Absolute difference: {diff:.2%}")

  # 1. Two-proportion z-test
  # Pooled proportion under null hypothesis
  p_pool = (x_c + x_t) / (n_c + n_t)
  se_pool = np.sqrt(p_pool * (1 - p_pool) * (1/n_c + 1/n_t))

  z_stat = diff / se_pool
  p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

  print(f"\nPooled proportion: {p_pool:.4f}")
  print(f"Z-statistic: {z_stat:.3f}")
  print(f"P-value: {p_value:.4f}")

  alpha = 0.05
  if p_value < alpha:
      print(f"\n✓ Significant at α = {alpha}")
  else:
      print(f"\n✗ NOT significant at α = {alpha}")

  # 2. 95% CI for difference
  # Use unpooled SE for CI
  se_unpooled = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
  z_crit = stats.norm.ppf(0.975)
  moe = z_crit * se_unpooled

  ci_lower = diff - moe
  ci_upper = diff + moe

  print(f"\n95% CI for difference: ({ci_lower:.2%}, {ci_upper:.2%})")

  # 3. Relative lift
  relative_lift = (p_t - p_c) / p_c
  print(f"\nRelative lift: {relative_lift:.1%}")  # 20%

  # 4. Business decision
  print("\n--- Business Decision ---")
  if p_value < alpha and ci_lower > 0:
      print("SHIP IT: Statistically significant improvement!")
      # Calculate expected impact
      monthly_visitors = 100000
      additional_conversions = monthly_visitors * diff
      print(f"Expected additional conversions/month: {additional_conversions:.0f}")
  else:
      print("HOLD: Need more data or re-evaluate the change")
  ```
</details>

<details>
  <summary>**Exercise 3: Power Analysis** - Sample size calculation</summary>

  **Problem**: You're planning an A/B test. Your current conversion rate is 3%, and you want to detect a 10% relative lift (3.0% → 3.3%).

  1. What sample size is needed per group for 80% power at α = 0.05?
  2. What about 90% power?
  3. If you only have 10,000 users, what's the minimum detectable effect?
  4. How long will the test take at 1,000 visitors/day?

  **Solution**:

  ```python theme={null}
  import numpy as np
  from scipy import stats

  def sample_size_ab(p1, p2, alpha=0.05, power=0.80):
      """Calculate sample size per group for A/B test."""
      z_alpha = stats.norm.ppf(1 - alpha/2)
      z_beta = stats.norm.ppf(power)
      
      # Pooled variance estimate
      p_pool = (p1 + p2) / 2
      effect = abs(p2 - p1)
      
      # Sample size per group
      n = 2 * p_pool * (1 - p_pool) * ((z_alpha + z_beta) / effect) ** 2
      return int(np.ceil(n))

  def minimum_detectable_effect(n, p1, alpha=0.05, power=0.80):
      """Calculate MDE given sample size."""
      z_alpha = stats.norm.ppf(1 - alpha/2)
      z_beta = stats.norm.ppf(power)
      
      se = np.sqrt(2 * p1 * (1 - p1) / n)
      mde = (z_alpha + z_beta) * se
      return mde

  # Given parameters
  p_baseline = 0.03   # 3%
  relative_lift = 0.10  # 10%
  p_target = p_baseline * (1 + relative_lift)  # 3.3%

  print(f"Baseline rate: {p_baseline:.1%}")
  print(f"Target rate: {p_target:.1%}")
  print(f"Absolute difference: {p_target - p_baseline:.2%}")

  # 1. Sample size for 80% power
  n_80 = sample_size_ab(p_baseline, p_target, power=0.80)
  print(f"\n1. Sample size for 80% power: {n_80:,} per group")
  print(f"   Total users needed: {2*n_80:,}")

  # 2. Sample size for 90% power
  n_90 = sample_size_ab(p_baseline, p_target, power=0.90)
  print(f"\n2. Sample size for 90% power: {n_90:,} per group")
  print(f"   Increase from 80%: {(n_90-n_80)/n_80:.0%}")

  # 3. MDE with 10,000 users
  n_available = 5000  # per group
  mde = minimum_detectable_effect(n_available, p_baseline)
  relative_mde = mde / p_baseline

  print(f"\n3. With 10,000 total users (5,000 per group):")
  print(f"   Minimum Detectable Effect: {mde:.2%} absolute")
  print(f"   Relative MDE: {relative_mde:.1%}")

  # 4. Test duration
  visitors_per_day = 1000
  days_needed_80 = (2 * n_80) / visitors_per_day
  days_needed_90 = (2 * n_90) / visitors_per_day

  print(f"\n4. Test duration at {visitors_per_day:,} visitors/day:")
  print(f"   For 80% power: {days_needed_80:.0f} days ({days_needed_80/7:.1f} weeks)")
  print(f"   For 90% power: {days_needed_90:.0f} days ({days_needed_90/7:.1f} weeks)")
  ```
</details>

<details>
  <summary>**Exercise 4: Drug Trial Analysis** - Real-world hypothesis testing</summary>

  **Problem**: A clinical trial tests a new blood pressure medication:

  * Control (placebo): n=150, mean BP reduction = 2 mmHg, std = 8 mmHg
  * Treatment (drug): n=150, mean BP reduction = 6 mmHg, std = 10 mmHg

  1. Conduct a two-sample t-test
  2. Calculate Cohen's d (effect size)
  3. Is this clinically significant, not just statistically significant?
  4. What are Type I and Type II error implications in this context?

  **Solution**:

  ```python theme={null}
  import numpy as np
  from scipy import stats

  # Control group
  n_c = 150
  mean_c = 2  # mmHg reduction
  std_c = 8

  # Treatment group
  n_t = 150
  mean_t = 6  # mmHg reduction
  std_t = 10

  # 1. Two-sample t-test
  # Pooled standard error (assuming equal variance)
  sp = np.sqrt(((n_c-1)*std_c**2 + (n_t-1)*std_t**2) / (n_c + n_t - 2))
  se = sp * np.sqrt(1/n_c + 1/n_t)

  t_stat = (mean_t - mean_c) / se
  df = n_c + n_t - 2
  p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

  print("=== Two-Sample T-Test ===")
  print(f"Control: {mean_c} mmHg reduction (std={std_c})")
  print(f"Treatment: {mean_t} mmHg reduction (std={std_t})")
  print(f"\nPooled std: {sp:.2f}")
  print(f"t-statistic: {t_stat:.3f}")
  print(f"P-value: {p_value:.6f}")

  # Decision
  alpha = 0.05
  if p_value < alpha:
      print(f"\n✓ Statistically significant (p < {alpha})")
  else:
      print(f"\n✗ Not statistically significant")

  # 2. Cohen's d (effect size)
  # d = (mean_t - mean_c) / pooled_std
  cohens_d = (mean_t - mean_c) / sp
  print(f"\n=== Effect Size ===")
  print(f"Cohen's d: {cohens_d:.3f}")

  # Interpret effect size
  if abs(cohens_d) < 0.2:
      effect_label = "negligible"
  elif abs(cohens_d) < 0.5:
      effect_label = "small"
  elif abs(cohens_d) < 0.8:
      effect_label = "medium"
  else:
      effect_label = "large"
  print(f"Interpretation: {effect_label} effect")

  # 3. Clinical significance
  print("\n=== Clinical Significance ===")
  mean_diff = mean_t - mean_c
  print(f"Mean difference: {mean_diff} mmHg")

  # Clinically meaningful threshold often 5+ mmHg
  clinical_threshold = 5
  if mean_diff >= clinical_threshold:
      print(f"✓ Clinically significant (≥{clinical_threshold} mmHg)")
  else:
      print(f"✗ May not be clinically meaningful (<{clinical_threshold} mmHg)")

  # 95% CI for difference
  ci_lower = mean_diff - stats.t.ppf(0.975, df) * se
  ci_upper = mean_diff + stats.t.ppf(0.975, df) * se
  print(f"95% CI: ({ci_lower:.1f}, {ci_upper:.1f}) mmHg")

  # 4. Error implications
  print("\n=== Error Implications ===")
  print("Type I Error (False Positive):")
  print("  - Approving an ineffective drug")
  print("  - Patients take unnecessary medication with side effects")
  print("  - Cost burden on healthcare system")
  print("\nType II Error (False Negative):")
  print("  - Rejecting an effective drug")
  print("  - Patients denied beneficial treatment")
  print("  - Continued suffering from high blood pressure")
  print("\nIn drug trials, Type I error is often considered worse")
  print("(hence the stringent α = 0.05 or even 0.01 threshold)")
  ```
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="The Framework" icon="scale-balanced">
    * Null hypothesis: no effect (innocent)
    * Alternative: there is an effect
    * P-value: how surprising is the data?
    * Decision threshold: typically α = 0.05
  </Card>

  <Card title="Types of Errors" icon="triangle-exclamation">
    * Type I (α): False positive, claiming effect that doesn't exist
    * Type II (β): False negative, missing real effect
    * Power = 1 - β: Ability to detect real effects
  </Card>

  <Card title="Sample Size Matters" icon="users">
    * Small samples = low power = missed effects
    * Calculate sample size BEFORE running test
    * More precision requires exponentially more data
  </Card>

  <Card title="Test Selection" icon="list-check">
    * Two proportions: Chi-square or z-test
    * Two means: t-test
    * Multiple groups: ANOVA
    * Non-normal: Mann-Whitney U
  </Card>
</CardGroup>

***

## Common Pitfalls

<Warning>
  **A/B Testing Mistakes to Avoid**:

  1. **Peeking & Early Stopping** - Checking daily inflates false positives; use sequential testing methods instead
  2. **Underpowered Tests** - Running with too few samples misses real effects; calculate sample size first
  3. **Multiple Comparisons** - Testing 20 variants without correction guarantees false positives
  4. **Ignoring Practical Significance** - A p \< 0.05 with 0.01% improvement isn't worth shipping
  5. **One-Tailed When Uncertain** - Only use one-tailed tests when you truly can't care about opposite effects
  6. **P-value Misinterpretation** - P-value is NOT the probability the null is true!
</Warning>

***

## Connection to Machine Learning

| Hypothesis Testing          | ML Application                               |
| --------------------------- | -------------------------------------------- |
| A/B testing                 | Model comparison, feature evaluation         |
| Power analysis              | Training set size planning                   |
| Multiple testing correction | Hyperparameter search, feature selection     |
| Type I/II errors            | Precision/Recall tradeoff                    |
| Significance testing        | Statistical validation of model improvements |

<Tip>
  **ML Connection**: Every time you compare "Model A accuracy = 0.92 vs Model B accuracy = 0.89", you're implicitly doing hypothesis testing. The question is: is this 3% difference real or just noise from your test set? Proper statistical testing (like paired t-tests on cross-validation folds) gives you the answer.
</Tip>

**Coming up next**: We'll learn about **Correlation and Regression** - how to understand relationships between variables and make predictions. This is where statistics directly becomes machine learning.

<Card title="Next: Correlation and Regression" icon="arrow-right" href="/courses/statistics-for-ml/07-regression">
  Understand relationships and make predictions
</Card>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Your A/B test has p=0.04. Product wants to ship. Engineering wants more data. Who is right?">
    **Strong Answer:**

    * Neither is automatically right -- the answer depends on context that a p-value alone does not provide. A p=0.04 means the result is technically significant at alpha=0.05, but there are several things I would check before making a recommendation.
    * First, what is the effect size? If the new variant improved conversion by 0.02 percentage points (from 3.00% to 3.02%), the result may be statistically significant with a large enough sample but practically meaningless. Shipping a code change, increasing technical debt, and potentially confusing users for a 0.02pp improvement is not worth it. I would compute the expected annual revenue impact and compare it to the implementation cost.
    * Second, what is the power of the test? If we planned for 80% power to detect a 10% relative lift but only ran enough traffic for 50% power, then p=0.04 might be an inflated estimate (the "winner's curse" -- significant results from underpowered tests tend to overestimate effect sizes).
    * Third, did anyone peek at the data before the test concluded? If the team checked results daily, the effective alpha is much higher than 0.05 due to multiple comparisons, and p=0.04 may not actually be significant under the true (inflated) alpha.
    * My recommendation: if the effect size is meaningful, the test was pre-registered, and nobody peeked, ship it. If any of those conditions are not met, run a confirmation test or extend the current one.

    **Follow-up: Walk me through the "winner's curse" in A/B testing. How does it inflate effect size estimates?**

    The winner's curse occurs because we only ship results that cross the significance threshold. Imagine the true effect is a 2% lift. Due to sampling noise, your measured lift will be randomly higher or lower than 2%. If you only ship when p is less than 0.05, you are selecting for experiments where random noise happened to push the measured lift above 2%. The shipped estimate is biased upward -- it is the true effect plus a positive noise component. In underpowered tests, this bias is especially large because you need a bigger "lucky" noise realization to cross the significance threshold. The practical consequence: the business case you built on "5% measured lift" might only deliver 2% lift in production, because the extra 3% was noise that happened to be in your favor. The fix is to either run adequately powered tests (where the noise is small relative to the true effect) or use shrinkage estimators that adjust for the selection bias.
  </Accordion>

  <Accordion title="Explain the difference between statistical significance and practical significance with a concrete example.">
    **Strong Answer:**

    * Statistical significance means the observed difference is unlikely to have occurred by chance alone. Practical significance means the difference is large enough to actually matter for the business.
    * Concrete example: an e-commerce company runs an A/B test on 2 million users and finds the new homepage increases average order value from $47.20 to $47.35. With that sample size, the p-value is 0.001 -- highly statistically significant. But the actual improvement is $0.15 per order, which translates to maybe $150K annually. If the homepage redesign cost \$500K in engineering time and introduced new technical debt, the statistically significant result is practically worthless.
    * Conversely, a startup tests a new pricing page on 500 users and sees a conversion lift from 3% to 5%. The p-value is 0.08 -- not statistically significant at alpha=0.05. But the 67% relative lift, if real, would double the company's revenue. The practical significance is enormous; the test was just underpowered. The right action is to run longer, not to conclude "no effect."
    * The way I think about it: p-values tell you whether the signal is distinguishable from noise. Effect size and business context tell you whether the signal matters. You need both.

    **Follow-up: How would you set the minimum detectable effect for an A/B test before running it?**

    I work backwards from the business case. First, I ask: "What is the smallest improvement that would justify the cost of implementing this change?" If the change requires 2 weeks of engineering time ($30K cost), and we get 1 million orders per year at $50 average, then a $0.03 improvement per order generates $30K annually -- barely break-even. So the MDE should be at least $0.05-$0.10 per order to provide a comfortable return. Then I compute the sample size needed to detect that MDE with 80% power. If the required traffic exceeds what we can collect in a reasonable timeframe (say 4 weeks), I would either accept a larger MDE, reduce alpha from 0.05 to 0.10, or reconsider whether the test is worth running at all. This upfront planning prevents the common trap of running underpowered tests that waste weeks of traffic and produce inconclusive results.
  </Accordion>

  <Accordion title="What is p-hacking, and how would you design an experimentation platform to prevent it?">
    **Strong Answer:**

    * P-hacking is the practice of manipulating data analysis to find statistically significant results. Common forms include: checking results daily and stopping when p drops below 0.05, testing multiple metrics and only reporting the one that is significant, segmenting data after the fact to find a subgroup where the effect is significant, or adding and removing covariates until significance appears.
    * Each of these inflates the false positive rate well beyond the nominal 5%. A team that checks daily for 14 days has effectively run 14 tests, pushing the real false positive rate to roughly 25-30%. A team that tests 20 metrics will find at least one "significant" result by chance alone.
    * To prevent it at the platform level, I would design the system with these guardrails: (1) Pre-registration: require teams to specify the primary metric, sample size, and analysis plan before the test launches. Lock these parameters. (2) Sequential testing: use methods like always-valid p-values or group sequential designs that allow continuous monitoring without inflating the error rate. (3) Automated correction: when multiple metrics are tracked, automatically apply Benjamini-Hochberg correction and highlight the distinction between primary and exploratory metrics. (4) Mandatory effect size reporting: always show the confidence interval for the effect size alongside the p-value. (5) Cool-off period: require a minimum test duration covering at least one full business cycle before results can be acted upon.
    * The cultural piece is equally important: incentivize teams for running well-designed experiments regardless of outcome, not just for finding "winners."

    **Follow-up: How do sequential testing methods (like always-valid p-values) allow peeking without inflating false positives?**

    Traditional p-values assume you look at the data exactly once. Sequential testing methods use a different mathematical framework -- typically based on martingale theory or spending functions -- that accounts for continuous monitoring. The always-valid p-value is constructed so that it maintains its coverage guarantee no matter how many times you look at the data. The tradeoff is that at any fixed sample size, the sequential method requires slightly more evidence to declare significance compared to a fixed-sample test. Think of it as paying an "insurance premium" for the right to peek continuously. In practice, the cost is modest (roughly 20-30% more traffic) and the benefit is enormous: teams can monitor experiments in real-time, stop harmful experiments early, and make shipping decisions whenever the evidence is clear, rather than waiting for a pre-specified end date.
  </Accordion>

  <Accordion title="You are running 10 A/B tests simultaneously. Three come back significant at p less than 0.05. How many of those are likely real?">
    **Strong Answer:**

    * Under the null hypothesis (no effect for any test), the expected number of false positives from 10 tests at alpha=0.05 is 0.5. So getting 3 "significant" results when you run 10 tests is suspicious -- at least some are likely false positives.
    * However, it is unrealistic to assume all 10 nulls are true. If you are testing reasonable product changes, maybe 3-4 of them actually have real effects. In that case, 3 significant results might include 2-3 real effects and 0-1 false positive.
    * The standard corrections are Bonferroni (divide alpha by the number of tests, requiring p less than 0.005) and Benjamini-Hochberg (FDR control, which is less conservative). Bonferroni controls the family-wise error rate but is very strict -- you might miss real effects. BH controls the false discovery rate, saying "of the results I call significant, at most X% are false."
    * The best practice is to flag all 3 as candidates, apply BH correction to see which survive, and then run a focused confirmation test on the 1-2 that survive correction. The confirmation test uses fresh data and a single pre-specified hypothesis, eliminating the multiple testing problem entirely.

    **Follow-up: What is the False Discovery Rate, and why is it often more useful than the Family-Wise Error Rate in practice?**

    The Family-Wise Error Rate (FWER) is the probability of making at least one false positive among all tests. Bonferroni controls this by making each individual test extremely stringent. The problem is that as the number of tests grows, each test becomes so conservative that you lose power to detect real effects. With 100 tests, each requires p less than 0.0005 -- virtually nothing passes. The False Discovery Rate (FDR) controls a different quantity: the expected proportion of false positives among the rejected hypotheses. If you control FDR at 5% and get 20 significant results, you expect about 1 to be a false discovery. This is much more practical for exploratory analysis because you maintain reasonable power while keeping the false discovery proportion bounded. In genomics, where researchers test thousands of genes simultaneously, FDR is the standard approach. In tech, it is increasingly used for feature experimentation platforms that run many simultaneous tests.
  </Accordion>
</AccordionGroup>
