Skip to main content
Statistical Inference

Statistical Inference: Conclusions from Samples

The Election Polling Problem

It’s election night. With only 2% of votes counted, news networks are already predicting the winner with 95% confidence. How is that possible? They’ve only seen a tiny fraction of the votes! This is the power of statistical inference - the science of drawing conclusions about a large group (population) by studying a small part of it (sample).
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Modules 1-3 (Describing Data, Probability, Distributions)
What You’ll Build: Election predictor, survey analyzer

Population vs Sample

TermDefinitionExample
PopulationThe entire group you want to studyAll 150 million voters
SampleA subset you actually observe1,500 surveyed voters
ParameterTrue value for the populationActual vote percentage
StatisticCalculated from the sampleSurvey percentage
The fundamental challenge: We want to know the parameter, but we can only calculate the statistic.
import numpy as np

# Imagine this is the TRUE population (we normally don't know this!)
np.random.seed(42)
population = np.random.choice(['A', 'B'], size=10_000_000, p=[0.52, 0.48])
true_proportion = np.mean(population == 'A')
print(f"TRUE population proportion for A: {true_proportion:.4f}")  # ~0.52

# But we can only survey 1000 people
sample = np.random.choice(population, size=1000, replace=False)
sample_proportion = np.mean(sample == 'A')
print(f"Sample proportion for A: {sample_proportion:.4f}")  # Varies!
Population vs Sample Visualization

Sampling Distributions: The Key Insight

Here’s the crucial question: If we took many different samples, how would our estimates vary?
# Take 1000 different samples of 500 people each
sample_proportions = []
for _ in range(1000):
    sample = np.random.choice(population, size=500, replace=False)
    prop = np.mean(sample == 'A')
    sample_proportions.append(prop)

sample_proportions = np.array(sample_proportions)

print(f"Mean of sample proportions: {np.mean(sample_proportions):.4f}")
print(f"Std of sample proportions: {np.std(sample_proportions):.4f}")
print(f"True proportion: {true_proportion:.4f}")
Output:
Mean of sample proportions: 0.5198
Std of sample proportions: 0.0223
True proportion: 0.5200
The samples cluster around the true value, and they form a normal distribution.
Sampling Distribution of Poll Results

Standard Error: Quantifying Uncertainty

The standard error measures how much sample statistics vary from sample to sample. For a proportion: SE=p(1p)nSE = \sqrt{\frac{p(1-p)}{n}} For a mean: SE=σnSE = \frac{\sigma}{\sqrt{n}}
def standard_error_proportion(p, n):
    """Standard error for a sample proportion."""
    return np.sqrt(p * (1 - p) / n)

def standard_error_mean(std_dev, n):
    """Standard error for a sample mean."""
    return std_dev / np.sqrt(n)

# Example: Poll with 52% for candidate A, n=1000
se = standard_error_proportion(0.52, 1000)
print(f"Standard error: {se:.4f}")  # 0.0158 or about 1.58%

# With larger sample
se_large = standard_error_proportion(0.52, 4000)
print(f"SE with n=4000: {se_large:.4f}")  # 0.0079 or about 0.79%
Key Insight: Standard error decreases with square root of sample size. To halve the error, you need 4x the sample size.

Confidence Intervals: Expressing Uncertainty

A confidence interval gives a range of plausible values for the true parameter.

The Formula

For a proportion with 95% confidence: p^±zSE=p^±1.96p^(1p^)n\hat{p} \pm z^* \cdot SE = \hat{p} \pm 1.96 \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
from scipy import stats

def confidence_interval_proportion(p_hat, n, confidence=0.95):
    """Calculate confidence interval for a proportion."""
    # Z-score for desired confidence level
    z = stats.norm.ppf((1 + confidence) / 2)
    
    # Standard error
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    
    # Margin of error
    moe = z * se
    
    return (p_hat - moe, p_hat + moe), moe

# Poll result: 52% with n=1000
p_hat = 0.52
n = 1000

ci, moe = confidence_interval_proportion(p_hat, n)
print(f"Point estimate: {p_hat:.1%}")
print(f"Margin of error: ±{moe:.1%}")
print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})")
Output:
Point estimate: 52.0%
Margin of error: ±3.1%
95% CI: (48.9%, 55.1%)

What Does “95% Confidence” Mean?

It does NOT mean “95% probability the true value is in this interval.” It means: If we repeated this process many times, 95% of the intervals we construct would contain the true value.
# Demonstrate: Create 100 confidence intervals
intervals_containing_truth = 0
true_p = 0.52  # Known true value (in real life, unknown)

for _ in range(100):
    # Take a sample
    sample = np.random.choice(['A', 'B'], size=1000, p=[true_p, 1-true_p])
    p_hat = np.mean(sample == 'A')
    
    # Calculate CI
    ci, _ = confidence_interval_proportion(p_hat, 1000)
    
    # Check if CI contains true value
    if ci[0] <= true_p <= ci[1]:
        intervals_containing_truth += 1

print(f"{intervals_containing_truth}% of intervals contained the true value")
# Should be close to 95!

Confidence Intervals for Means

When estimating an average (like average house price):
def confidence_interval_mean(data, confidence=0.95):
    """Calculate confidence interval for a mean using t-distribution."""
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)  # Standard error of the mean
    
    # Use t-distribution for small samples
    t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1)
    
    moe = t_crit * std_err
    return (mean - moe, mean + moe), moe

# Example: House prices (in thousands)
house_prices = np.array([
    425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
    512, 445, 468, 502, 389, 475, 498, 415, 528, 459,
    442, 495, 478, 410, 525, 465, 488, 435, 505, 472
])

ci, moe = confidence_interval_mean(house_prices)
print(f"Sample mean: ${np.mean(house_prices):.1f}K")
print(f"Margin of error: ±${moe:.1f}K")
print(f"95% CI: (${ci[0]:.1f}K, ${ci[1]:.1f}K)")
Output:
Sample mean: $463.5K
Margin of error: ±$15.1K
95% CI: ($448.4K, $478.6K)

The t-Distribution: For Small Samples

When sample size is small (n < 30), we use the t-distribution instead of the normal distribution. Why? Small samples have more uncertainty about the true standard deviation.
# Compare t and normal distributions
x = np.linspace(-4, 4, 1000)
normal = stats.norm.pdf(x)
t_5 = stats.t.pdf(x, df=5)    # 5 degrees of freedom
t_30 = stats.t.pdf(x, df=30)  # 30 degrees of freedom

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(x, normal, label='Normal', linewidth=2)
plt.plot(x, t_5, label='t (df=5)', linestyle='--', linewidth=2)
plt.plot(x, t_30, label='t (df=30)', linestyle=':', linewidth=2)
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Normal vs t-Distribution')
plt.legend()
plt.show()
The t-distribution has heavier tails, meaning it accounts for more uncertainty. As sample size increases, it approaches the normal distribution.

Sample Size Planning

How large a sample do you need? It depends on:
  1. Desired margin of error
  2. Desired confidence level
  3. Expected variability

For Proportions

n=(zMOE)2p(1p)n = \left(\frac{z^*}{MOE}\right)^2 \cdot p(1-p)
def sample_size_proportion(moe, p=0.5, confidence=0.95):
    """Calculate required sample size for a proportion."""
    z = stats.norm.ppf((1 + confidence) / 2)
    n = (z / moe) ** 2 * p * (1 - p)
    return int(np.ceil(n))

# Want ±3% margin of error at 95% confidence
n_needed = sample_size_proportion(moe=0.03)
print(f"Need {n_needed} respondents for ±3% MOE")  # 1068

# For ±1% (much more precise)
n_needed_1pct = sample_size_proportion(moe=0.01)
print(f"Need {n_needed_1pct} respondents for ±1% MOE")  # 9604

# Notice: 3x more precision requires 9x more sample!

For Means

n=(zσMOE)2n = \left(\frac{z^* \cdot \sigma}{MOE}\right)^2
def sample_size_mean(moe, std_dev, confidence=0.95):
    """Calculate required sample size for a mean."""
    z = stats.norm.ppf((1 + confidence) / 2)
    n = (z * std_dev / moe) ** 2
    return int(np.ceil(n))

# House prices: want ±$10K, estimated std dev is $50K
n_houses = sample_size_mean(moe=10, std_dev=50)
print(f"Need {n_houses} houses for ±$10K MOE")  # 97

Mini-Project: Election Poll Analyzer

Build a complete polling analysis system:
import numpy as np
from scipy import stats

class PollAnalyzer:
    """
    Analyze election poll results with proper uncertainty quantification.
    """
    
    def __init__(self, candidate_a_votes, total_votes, confidence=0.95):
        self.n = total_votes
        self.a_votes = candidate_a_votes
        self.b_votes = total_votes - candidate_a_votes
        self.p_a = candidate_a_votes / total_votes
        self.p_b = 1 - self.p_a
        self.confidence = confidence
        
    def margin_of_error(self):
        """Calculate margin of error for candidate A's proportion."""
        z = stats.norm.ppf((1 + self.confidence) / 2)
        se = np.sqrt(self.p_a * self.p_b / self.n)
        return z * se
    
    def confidence_interval(self):
        """Calculate confidence interval for candidate A."""
        moe = self.margin_of_error()
        return (self.p_a - moe, self.p_a + moe)
    
    def probability_a_wins(self):
        """
        Estimate probability that candidate A is truly ahead.
        This uses the normal approximation.
        """
        # We want P(true_p_a > 0.5)
        # The difference (p_a - 0.5) follows approximately normal
        diff = self.p_a - 0.5
        se = np.sqrt(self.p_a * self.p_b / self.n)
        
        # Z-score for the difference from 0.5
        z = diff / se
        
        # Probability that true proportion > 0.5
        return stats.norm.cdf(z)
    
    def required_sample_for_call(self, min_confidence=0.95):
        """
        Calculate sample size needed to call the race at given confidence.
        Returns None if current lead is too small.
        """
        # We need the CI to not cross 50%
        # This happens when |p_a - 0.5| > MOE
        
        lead = abs(self.p_a - 0.5)
        z = stats.norm.ppf((1 + min_confidence) / 2)
        
        # Solve: z * sqrt(p*q/n) = lead
        # n = (z^2 * p * q) / lead^2
        
        if lead == 0:
            return float('inf')
        
        n = (z ** 2 * self.p_a * self.p_b) / (lead ** 2)
        return int(np.ceil(n))
    
    def report(self):
        """Generate comprehensive poll report."""
        ci = self.confidence_interval()
        
        print("\n" + "=" * 60)
        print("ELECTION POLL ANALYSIS")
        print("=" * 60)
        print(f"Sample Size: {self.n:,} voters")
        print(f"Confidence Level: {self.confidence:.0%}")
        print("-" * 60)
        print(f"Candidate A: {self.p_a:.1%} ({self.a_votes:,} votes)")
        print(f"Candidate B: {self.p_b:.1%} ({self.b_votes:,} votes)")
        print("-" * 60)
        print(f"Margin of Error: ±{self.margin_of_error():.1%}")
        print(f"95% CI for A: ({ci[0]:.1%}, {ci[1]:.1%})")
        print("-" * 60)
        
        p_wins = self.probability_a_wins()
        if p_wins > 0.99:
            call = "PROJECTED WINNER: Candidate A"
        elif p_wins < 0.01:
            call = "PROJECTED WINNER: Candidate B"
        elif p_wins > 0.95:
            call = "LIKELY WINNER: Candidate A"
        elif p_wins < 0.05:
            call = "LIKELY WINNER: Candidate B"
        else:
            call = "TOO CLOSE TO CALL"
        
        print(f"P(A is truly ahead): {p_wins:.1%}")
        print(f"Status: {call}")
        
        if 0.05 < p_wins < 0.95:
            n_needed = self.required_sample_for_call()
            if n_needed < float('inf'):
                print(f"Need ~{n_needed:,} votes to call at 95%")
        
        print("=" * 60)


# Example 1: Early results (close race)
print("\n--- EARLY RESULTS ---")
poll1 = PollAnalyzer(candidate_a_votes=520, total_votes=1000)
poll1.report()

# Example 2: More data (still close)
print("\n--- UPDATED RESULTS ---")
poll2 = PollAnalyzer(candidate_a_votes=5200, total_votes=10000)
poll2.report()

# Example 3: Clear lead
print("\n--- CLEAR LEAD ---")
poll3 = PollAnalyzer(candidate_a_votes=5500, total_votes=10000)
poll3.report()
Output:
--- EARLY RESULTS ---

============================================================
ELECTION POLL ANALYSIS
============================================================
Sample Size: 1,000 voters
Confidence Level: 95%
------------------------------------------------------------
Candidate A: 52.0% (520 votes)
Candidate B: 48.0% (480 votes)
------------------------------------------------------------
Margin of Error: ±3.1%
95% CI for A: (48.9%, 55.1%)
------------------------------------------------------------
P(A is truly ahead): 89.7%
Status: TOO CLOSE TO CALL
Need ~2,397 votes to call at 95%
============================================================

--- UPDATED RESULTS ---

============================================================
ELECTION POLL ANALYSIS
============================================================
Sample Size: 10,000 voters
Confidence Level: 95%
------------------------------------------------------------
Candidate A: 52.0% (5,200 votes)
Candidate B: 48.0% (4,800 votes)
------------------------------------------------------------
Margin of Error: ±1.0%
95% CI for A: (51.0%, 53.0%)
------------------------------------------------------------
P(A is truly ahead): 100.0%
Status: PROJECTED WINNER: Candidate A
============================================================

Common Mistakes in Inference

Mistake 1: Ignoring Sample Bias

# BAD: Survey only people who answer phones during business hours
# This systematically excludes working people!

# GOOD: Random sampling from entire population

Common Mistakes to Avoid

Mistake 1: Misleading Margin of ErrorA headline saying “Poll shows 52% support (±3%)” means the 95% CI is 49-55%. But if the race is 52% vs 48%, the intervals overlap and the race is actually too close to call!
Mistake 2: Small Sample, Big Claims
# Survey of 30 people shows 60% prefer product A
ci, moe = confidence_interval_proportion(0.60, 30)
print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})")
# CI: (42.4%, 77.6%) - way too wide to claim victory!
Mistake 3: Confusing Confidence Level with Probability
# WRONG: "There's a 95% chance the true value is between 48.9% and 55.1%"
# RIGHT: "We're 95% confident our method produces intervals containing the true value"

Interview Questions

Question: Your A/B test shows the new feature increased click-through rate from 2.0% to 2.3%, with a 95% CI of [2.1%, 2.5%] for the new version. What can you conclude?
Answer:
  • We’re 95% confident the true CTR for the new version is between 2.1% and 2.5%
  • Since the entire CI is above 2.0% (the control), we have evidence of a real improvement
  • The minimum expected improvement is 0.1 percentage points (2.1% - 2.0%)
  • For business decisions, consider if a 0.1-0.5 pp improvement justifies the change
Note: If the CI included 2.0%, we couldn’t conclude there’s a real difference.
Question: You want to estimate the proportion of customers who will buy a new product. You need ±5% precision with 95% confidence. How many customers should you survey?
Answer:
# Use p = 0.5 for maximum sample size (conservative)
# Formula: n = (z²×p×(1-p)) / E²
z = 1.96
p = 0.5  # Conservative assumption
E = 0.05  # Desired margin of error

n = (z**2 * p * (1-p)) / (E**2)
print(f"Sample size needed: {int(np.ceil(n))}")  # 385
Key insight: Using p = 0.5 gives the largest sample size because that’s where variance is maximized. If you know approximately what p will be, you can use that value for a smaller required sample.
Question: Daily active users (DAU) over the past 100 days had mean 50M with standard deviation 5M. What’s the 95% CI for the true mean DAU?
Answer:
mean_dau = 50  # millions
std_dau = 5  # millions
n = 100

# Standard error
se = std_dau / np.sqrt(n)  # 0.5 million

# 95% CI
z = 1.96
ci = (mean_dau - z*se, mean_dau + z*se)
print(f"95% CI: ({ci[0]:.1f}M, {ci[1]:.1f}M)")
# (49.02M, 50.98M)
The true average DAU is likely between 49M and 51M with 95% confidence.
Question: You survey users who contacted customer support about a new feature. 80% say they dislike it. Is this valid for all users?
Answer: No! This is selection bias.Users who contact support are more likely to have problems. The sample is not representative of all users. You’re measuring “satisfaction among users with issues” not “overall satisfaction.”To get a valid estimate, you need:
  • Random sampling from all users
  • Or stratified sampling to ensure representation
  • Consider that satisfied users rarely reach out
This is called “sampling bias” or “survivorship bias” and is a major concern in ML training data as well.

Practice Challenge

You’re planning an A/B test. The current conversion rate is 5%. You want to detect a 20% relative improvement (5% → 6%). How many users do you need per group?
# This is called a power analysis
# We need to balance:
# - Significance level (α): probability of false positive
# - Power (1-β): probability of detecting a real effect
# - Effect size: the difference we want to detect
# - Sample size: what we're solving for

from scipy import stats
import numpy as np

def required_sample_size_ab(p1, p2, alpha=0.05, power=0.8):
    """
    Calculate required sample size per group for A/B test.
    
    p1: baseline conversion rate
    p2: expected conversion rate after improvement
    alpha: significance level (typically 0.05)
    power: probability of detecting effect if real (typically 0.8)
    """
    # Effect size (Cohen's h)
    h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha/2)  # Two-tailed
    z_beta = stats.norm.ppf(power)
    
    # Sample size per group
    n = 2 * ((z_alpha + z_beta) / h) ** 2
    
    return int(np.ceil(n))

# Your task: Calculate and interpret
p_baseline = 0.05
p_improved = 0.06

n_per_group = required_sample_size_ab(p_baseline, p_improved)
print(f"Need {n_per_group} users per group")
print(f"Total users: {2 * n_per_group}")

# Also calculate for different scenarios:
# 1. What if you want to detect a 10% improvement (5% → 5.5%)?
# 2. What if power needs to be 90% instead of 80%?
Solution:
# Base case: 5% → 6% (20% relative improvement)
n_base = required_sample_size_ab(0.05, 0.06)
print(f"Base case: {n_base:,} per group")  # ~4,794

# Scenario 1: 5% → 5.5% (10% relative improvement)
n_smaller = required_sample_size_ab(0.05, 0.055)
print(f"Smaller effect: {n_smaller:,} per group")  # ~19,177
# 4x more users for half the effect size!

# Scenario 2: 90% power
n_high_power = required_sample_size_ab(0.05, 0.06, power=0.9)
print(f"Higher power: {n_high_power:,} per group")  # ~6,420
# ~34% more users for 10% more power

# Key insight: Sample size grows QUADRATICALLY with smaller effect sizes
# This is why A/B testing small improvements is expensive!

📝 Practice Exercises

Exercise 1

Calculate confidence intervals for proportions

Exercise 2

Determine required sample sizes for precision

Exercise 3

Construct confidence intervals for means

Exercise 4

Real-world: Election polling analysis

Key Takeaways

Population vs Sample

  • Population: entire group of interest
  • Sample: subset we actually observe
  • Statistics estimate parameters

Standard Error

  • Measures variability of sample statistics
  • Decreases with sqrt(n)
  • Foundation of confidence intervals

Confidence Intervals

  • Range of plausible values for parameter
  • Width = 2 x margin of error
  • Higher confidence = wider interval

Sample Size

  • More precision requires more data
  • Quadratic relationship (2x precision = 4x data)
  • Plan before collecting data

Common Pitfalls

Inference Mistakes to Avoid:
  1. Misinterpreting Confidence Intervals - “95% confident the true value is in this range” NOT “95% chance the parameter is here”
  2. Ignoring Sample Bias - Non-random samples lead to biased estimates regardless of sample size
  3. Confusing Confidence Level with Precision - 99% confidence is wider, not more precise
  4. Forgetting Standard Error Shrinks with √n - To halve SE, you need 4x the data, not 2x
  5. Using z when t is appropriate - For small samples (n < 30), t-distribution accounts for extra uncertainty

Connection to Machine Learning

Inference ConceptML Application
Confidence intervalsUncertainty in predictions
Standard errorError bars, prediction intervals
Sample sizeTraining set size planning
t-distributionSmall data regimes, regularization
Bias in samplingTraining/test split, data collection
ML Connection: When you report model accuracy as “92% ± 2%”, you’re using confidence intervals! Cross-validation provides multiple samples, and the standard error tells you how much your accuracy estimate might vary.
Coming up next: We’ll learn about Hypothesis Testing - how to determine if a difference is real or just random noise. This is the foundation of A/B testing and scientific validation of ML models.

Next: Hypothesis Testing

Learn to distinguish real effects from random noise