Statistical Inference: Conclusions from Samples

The Election Polling Problem

It’s election night. With only 2% of votes counted, news networks are already predicting the winner with 95% confidence. How is that possible? They’ve only seen a tiny fraction of the votes! This is the power of statistical inference - the science of drawing conclusions about a large group (population) by studying a small part of it (sample).

Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Modules 1-3 (Describing Data, Probability, Distributions)
What You’ll Build: Election predictor, survey analyzer

Population vs Sample

Term	Definition	Example
Population	The entire group you want to study	All 150 million voters
Sample	A subset you actually observe	1,500 surveyed voters
Parameter	True value for the population	Actual vote percentage
Statistic	Calculated from the sample	Survey percentage

The fundamental challenge: We want to know the parameter, but we can only calculate the statistic.

import numpy as np

# Imagine this is the TRUE population (we normally don't know this!)
np.random.seed(42)
population = np.random.choice(['A', 'B'], size=10_000_000, p=[0.52, 0.48])
true_proportion = np.mean(population == 'A')
print(f"TRUE population proportion for A: {true_proportion:.4f}")  # ~0.52

# But we can only survey 1000 people
sample = np.random.choice(population, size=1000, replace=False)
sample_proportion = np.mean(sample == 'A')
print(f"Sample proportion for A: {sample_proportion:.4f}")  # Varies!

Sampling Distributions: The Key Insight

Here’s the crucial question: If we took many different samples, how would our estimates vary?

# Take 1000 different samples of 500 people each
sample_proportions = []
for _ in range(1000):
    sample = np.random.choice(population, size=500, replace=False)
    prop = np.mean(sample == 'A')
    sample_proportions.append(prop)

sample_proportions = np.array(sample_proportions)

print(f"Mean of sample proportions: {np.mean(sample_proportions):.4f}")
print(f"Std of sample proportions: {np.std(sample_proportions):.4f}")
print(f"True proportion: {true_proportion:.4f}")

Output:

Mean of sample proportions: 0.5198
Std of sample proportions: 0.0223
True proportion: 0.5200

The samples cluster around the true value, and they form a normal distribution.

Standard Error: Quantifying Uncertainty

The standard error measures how much sample statistics vary from sample to sample. For a proportion:

SE = \sqrt{\frac{p(1-p)}{n}}

For a mean:

SE = \frac{\sigma}{\sqrt{n}}

def standard_error_proportion(p, n):
    """Standard error for a sample proportion."""
    return np.sqrt(p * (1 - p) / n)

def standard_error_mean(std_dev, n):
    """Standard error for a sample mean."""
    return std_dev / np.sqrt(n)

# Example: Poll with 52% for candidate A, n=1000
se = standard_error_proportion(0.52, 1000)
print(f"Standard error: {se:.4f}")  # 0.0158 or about 1.58%

# With larger sample
se_large = standard_error_proportion(0.52, 4000)
print(f"SE with n=4000: {se_large:.4f}")  # 0.0079 or about 0.79%

Key Insight: Standard error decreases with square root of sample size. To halve the error, you need 4x the sample size.

Confidence Intervals: Expressing Uncertainty

A confidence interval gives a range of plausible values for the true parameter.

The Formula

For a proportion with 95% confidence:

\hat{p} \pm z^* \cdot SE = \hat{p} \pm 1.96 \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

from scipy import stats

def confidence_interval_proportion(p_hat, n, confidence=0.95):
    """Calculate confidence interval for a proportion."""
    # Z-score for desired confidence level
    z = stats.norm.ppf((1 + confidence) / 2)
    
    # Standard error
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    
    # Margin of error
    moe = z * se
    
    return (p_hat - moe, p_hat + moe), moe

# Poll result: 52% with n=1000
p_hat = 0.52
n = 1000

ci, moe = confidence_interval_proportion(p_hat, n)
print(f"Point estimate: {p_hat:.1%}")
print(f"Margin of error: ±{moe:.1%}")
print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})")

Output:

Point estimate: 52.0%
Margin of error: ±3.1%
95% CI: (48.9%, 55.1%)

What Does “95% Confidence” Mean?

It does NOT mean “95% probability the true value is in this interval.” It means: If we repeated this process many times, 95% of the intervals we construct would contain the true value.

# Demonstrate: Create 100 confidence intervals
intervals_containing_truth = 0
true_p = 0.52  # Known true value (in real life, unknown)

for _ in range(100):
    # Take a sample
    sample = np.random.choice(['A', 'B'], size=1000, p=[true_p, 1-true_p])
    p_hat = np.mean(sample == 'A')
    
    # Calculate CI
    ci, _ = confidence_interval_proportion(p_hat, 1000)
    
    # Check if CI contains true value
    if ci[0] <= true_p <= ci[1]:
        intervals_containing_truth += 1

print(f"{intervals_containing_truth}% of intervals contained the true value")
# Should be close to 95!

Confidence Intervals for Means

When estimating an average (like average house price):

def confidence_interval_mean(data, confidence=0.95):
    """Calculate confidence interval for a mean using t-distribution."""
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)  # Standard error of the mean
    
    # Use t-distribution for small samples
    t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1)
    
    moe = t_crit * std_err
    return (mean - moe, mean + moe), moe

# Example: House prices (in thousands)
house_prices = np.array([
    425, 389, 445, 520, 478, 395, 510, 462, 398, 485,
    512, 445, 468, 502, 389, 475, 498, 415, 528, 459,
    442, 495, 478, 410, 525, 465, 488, 435, 505, 472
])

ci, moe = confidence_interval_mean(house_prices)
print(f"Sample mean: ${np.mean(house_prices):.1f}K")
print(f"Margin of error: ±${moe:.1f}K")
print(f"95% CI: (${ci[0]:.1f}K, ${ci[1]:.1f}K)")

Output:

Sample mean: $463.5K
Margin of error: ±$15.1K
95% CI: ($448.4K, $478.6K)

The t-Distribution: For Small Samples

When sample size is small (n < 30), we use the t-distribution instead of the normal distribution. Why? Small samples have more uncertainty about the true standard deviation.

# Compare t and normal distributions
x = np.linspace(-4, 4, 1000)
normal = stats.norm.pdf(x)
t_5 = stats.t.pdf(x, df=5)    # 5 degrees of freedom
t_30 = stats.t.pdf(x, df=30)  # 30 degrees of freedom

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(x, normal, label='Normal', linewidth=2)
plt.plot(x, t_5, label='t (df=5)', linestyle='--', linewidth=2)
plt.plot(x, t_30, label='t (df=30)', linestyle=':', linewidth=2)
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Normal vs t-Distribution')
plt.legend()
plt.show()

The t-distribution has heavier tails, meaning it accounts for more uncertainty. As sample size increases, it approaches the normal distribution.

Sample Size Planning

How large a sample do you need? It depends on:

Desired margin of error
Desired confidence level
Expected variability

For Proportions

n = \left(\frac{z^*}{MOE}\right)^2 \cdot p(1-p)

def sample_size_proportion(moe, p=0.5, confidence=0.95):
    """Calculate required sample size for a proportion."""
    z = stats.norm.ppf((1 + confidence) / 2)
    n = (z / moe) ** 2 * p * (1 - p)
    return int(np.ceil(n))

# Want ±3% margin of error at 95% confidence
n_needed = sample_size_proportion(moe=0.03)
print(f"Need {n_needed} respondents for ±3% MOE")  # 1068

# For ±1% (much more precise)
n_needed_1pct = sample_size_proportion(moe=0.01)
print(f"Need {n_needed_1pct} respondents for ±1% MOE")  # 9604

# Notice: 3x more precision requires 9x more sample!

For Means

n = \left(\frac{z^* \cdot \sigma}{MOE}\right)^2

def sample_size_mean(moe, std_dev, confidence=0.95):
    """Calculate required sample size for a mean."""
    z = stats.norm.ppf((1 + confidence) / 2)
    n = (z * std_dev / moe) ** 2
    return int(np.ceil(n))

# House prices: want ±$10K, estimated std dev is $50K
n_houses = sample_size_mean(moe=10, std_dev=50)
print(f"Need {n_houses} houses for ±$10K MOE")  # 97

Mini-Project: Election Poll Analyzer

Build a complete polling analysis system:

import numpy as np
from scipy import stats

class PollAnalyzer:
    """
    Analyze election poll results with proper uncertainty quantification.
    """
    
    def __init__(self, candidate_a_votes, total_votes, confidence=0.95):
        self.n = total_votes
        self.a_votes = candidate_a_votes
        self.b_votes = total_votes - candidate_a_votes
        self.p_a = candidate_a_votes / total_votes
        self.p_b = 1 - self.p_a
        self.confidence = confidence
        
    def margin_of_error(self):
        """Calculate margin of error for candidate A's proportion."""
        z = stats.norm.ppf((1 + self.confidence) / 2)
        se = np.sqrt(self.p_a * self.p_b / self.n)
        return z * se
    
    def confidence_interval(self):
        """Calculate confidence interval for candidate A."""
        moe = self.margin_of_error()
        return (self.p_a - moe, self.p_a + moe)
    
    def probability_a_wins(self):
        """
        Estimate probability that candidate A is truly ahead.
        This uses the normal approximation.
        """
        # We want P(true_p_a > 0.5)
        # The difference (p_a - 0.5) follows approximately normal
        diff = self.p_a - 0.5
        se = np.sqrt(self.p_a * self.p_b / self.n)
        
        # Z-score for the difference from 0.5
        z = diff / se
        
        # Probability that true proportion > 0.5
        return stats.norm.cdf(z)
    
    def required_sample_for_call(self, min_confidence=0.95):
        """
        Calculate sample size needed to call the race at given confidence.
        Returns None if current lead is too small.
        """
        # We need the CI to not cross 50%
        # This happens when |p_a - 0.5| > MOE
        
        lead = abs(self.p_a - 0.5)
        z = stats.norm.ppf((1 + min_confidence) / 2)
        
        # Solve: z * sqrt(p*q/n) = lead
        # n = (z^2 * p * q) / lead^2
        
        if lead == 0:
            return float('inf')
        
        n = (z ** 2 * self.p_a * self.p_b) / (lead ** 2)
        return int(np.ceil(n))
    
    def report(self):
        """Generate comprehensive poll report."""
        ci = self.confidence_interval()
        
        print("\n" + "=" * 60)
        print("ELECTION POLL ANALYSIS")
        print("=" * 60)
        print(f"Sample Size: {self.n:,} voters")
        print(f"Confidence Level: {self.confidence:.0%}")
        print("-" * 60)
        print(f"Candidate A: {self.p_a:.1%} ({self.a_votes:,} votes)")
        print(f"Candidate B: {self.p_b:.1%} ({self.b_votes:,} votes)")
        print("-" * 60)
        print(f"Margin of Error: ±{self.margin_of_error():.1%}")
        print(f"95% CI for A: ({ci[0]:.1%}, {ci[1]:.1%})")
        print("-" * 60)
        
        p_wins = self.probability_a_wins()
        if p_wins > 0.99:
            call = "PROJECTED WINNER: Candidate A"
        elif p_wins < 0.01:
            call = "PROJECTED WINNER: Candidate B"
        elif p_wins > 0.95:
            call = "LIKELY WINNER: Candidate A"
        elif p_wins < 0.05:
            call = "LIKELY WINNER: Candidate B"
        else:
            call = "TOO CLOSE TO CALL"
        
        print(f"P(A is truly ahead): {p_wins:.1%}")
        print(f"Status: {call}")
        
        if 0.05 < p_wins < 0.95:
            n_needed = self.required_sample_for_call()
            if n_needed < float('inf'):
                print(f"Need ~{n_needed:,} votes to call at 95%")
        
        print("=" * 60)


# Example 1: Early results (close race)
print("\n--- EARLY RESULTS ---")
poll1 = PollAnalyzer(candidate_a_votes=520, total_votes=1000)
poll1.report()

# Example 2: More data (still close)
print("\n--- UPDATED RESULTS ---")
poll2 = PollAnalyzer(candidate_a_votes=5200, total_votes=10000)
poll2.report()

# Example 3: Clear lead
print("\n--- CLEAR LEAD ---")
poll3 = PollAnalyzer(candidate_a_votes=5500, total_votes=10000)
poll3.report()

Output:

--- EARLY RESULTS ---

============================================================
ELECTION POLL ANALYSIS
============================================================
Sample Size: 1,000 voters
Confidence Level: 95%
------------------------------------------------------------
Candidate A: 52.0% (520 votes)
Candidate B: 48.0% (480 votes)
------------------------------------------------------------
Margin of Error: ±3.1%
95% CI for A: (48.9%, 55.1%)
------------------------------------------------------------
P(A is truly ahead): 89.7%
Status: TOO CLOSE TO CALL
Need ~2,397 votes to call at 95%
============================================================

--- UPDATED RESULTS ---

============================================================
ELECTION POLL ANALYSIS
============================================================
Sample Size: 10,000 voters
Confidence Level: 95%
------------------------------------------------------------
Candidate A: 52.0% (5,200 votes)
Candidate B: 48.0% (4,800 votes)
------------------------------------------------------------
Margin of Error: ±1.0%
95% CI for A: (51.0%, 53.0%)
------------------------------------------------------------
P(A is truly ahead): 100.0%
Status: PROJECTED WINNER: Candidate A
============================================================

Common Mistakes in Inference

Mistake 1: Ignoring Sample Bias

# BAD: Survey only people who answer phones during business hours
# This systematically excludes working people!

# GOOD: Random sampling from entire population

Common Mistakes to Avoid

Mistake 1: Misleading Margin of ErrorA headline saying “Poll shows 52% support (±3%)” means the 95% CI is 49-55%. But if the race is 52% vs 48%, the intervals overlap and the race is actually too close to call!

Mistake 2: Small Sample, Big Claims

# Survey of 30 people shows 60% prefer product A
ci, moe = confidence_interval_proportion(0.60, 30)
print(f"95% CI: ({ci[0]:.1%}, {ci[1]:.1%})")
# CI: (42.4%, 77.6%) - way too wide to claim victory!

Mistake 3: Confusing Confidence Level with Probability

# WRONG: "There's a 95% chance the true value is between 48.9% and 55.1%"
# RIGHT: "We're 95% confident our method produces intervals containing the true value"

Interview Questions

Question 1: Confidence Interval Interpretation (Google)

Question: Your A/B test shows the new feature increased click-through rate from 2.0% to 2.3%, with a 95% CI of [2.1%, 2.5%] for the new version. What can you conclude?

Answer:

We’re 95% confident the true CTR for the new version is between 2.1% and 2.5%
Since the entire CI is above 2.0% (the control), we have evidence of a real improvement
The minimum expected improvement is 0.1 percentage points (2.1% - 2.0%)
For business decisions, consider if a 0.1-0.5 pp improvement justifies the change

Note: If the CI included 2.0%, we couldn’t conclude there’s a real difference.

Question 2: Sample Size Planning (Amazon)

Question: You want to estimate the proportion of customers who will buy a new product. You need ±5% precision with 95% confidence. How many customers should you survey?

Answer:

# Use p = 0.5 for maximum sample size (conservative)
# Formula: n = (z²×p×(1-p)) / E²
z = 1.96
p = 0.5  # Conservative assumption
E = 0.05  # Desired margin of error

n = (z**2 * p * (1-p)) / (E**2)
print(f"Sample size needed: {int(np.ceil(n))}")  # 385

Key insight: Using p = 0.5 gives the largest sample size because that’s where variance is maximized. If you know approximately what p will be, you can use that value for a smaller required sample.

Question 3: Standard Error Application (Facebook/Meta)

Question: Daily active users (DAU) over the past 100 days had mean 50M with standard deviation 5M. What’s the 95% CI for the true mean DAU?

Answer:

mean_dau = 50  # millions
std_dau = 5  # millions
n = 100

# Standard error
se = std_dau / np.sqrt(n)  # 0.5 million

# 95% CI
z = 1.96
ci = (mean_dau - z*se, mean_dau + z*se)
print(f"95% CI: ({ci[0]:.1f}M, {ci[1]:.1f}M)")
# (49.02M, 50.98M)

The true average DAU is likely between 49M and 51M with 95% confidence.

Question 4: Bias in Sampling (Tech Companies)

Question: You survey users who contacted customer support about a new feature. 80% say they dislike it. Is this valid for all users?

Answer: No! This is selection bias.Users who contact support are more likely to have problems. The sample is not representative of all users. You’re measuring “satisfaction among users with issues” not “overall satisfaction.”To get a valid estimate, you need:

Random sampling from all users
Or stratified sampling to ensure representation
Consider that satisfied users rarely reach out

This is called “sampling bias” or “survivorship bias” and is a major concern in ML training data as well.

Practice Challenge

Challenge: Power Analysis for A/B Test

You’re planning an A/B test. The current conversion rate is 5%. You want to detect a 20% relative improvement (5% → 6%). How many users do you need per group?

# This is called a power analysis
# We need to balance:
# - Significance level (α): probability of false positive
# - Power (1-β): probability of detecting a real effect
# - Effect size: the difference we want to detect
# - Sample size: what we're solving for

from scipy import stats
import numpy as np

def required_sample_size_ab(p1, p2, alpha=0.05, power=0.8):
    """
    Calculate required sample size per group for A/B test.
    
    p1: baseline conversion rate
    p2: expected conversion rate after improvement
    alpha: significance level (typically 0.05)
    power: probability of detecting effect if real (typically 0.8)
    """
    # Effect size (Cohen's h)
    h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha/2)  # Two-tailed
    z_beta = stats.norm.ppf(power)
    
    # Sample size per group
    n = 2 * ((z_alpha + z_beta) / h) ** 2
    
    return int(np.ceil(n))

# Your task: Calculate and interpret
p_baseline = 0.05
p_improved = 0.06

n_per_group = required_sample_size_ab(p_baseline, p_improved)
print(f"Need {n_per_group} users per group")
print(f"Total users: {2 * n_per_group}")

# Also calculate for different scenarios:
# 1. What if you want to detect a 10% improvement (5% → 5.5%)?
# 2. What if power needs to be 90% instead of 80%?

Solution:

# Base case: 5% → 6% (20% relative improvement)
n_base = required_sample_size_ab(0.05, 0.06)
print(f"Base case: {n_base:,} per group")  # ~4,794

# Scenario 1: 5% → 5.5% (10% relative improvement)
n_smaller = required_sample_size_ab(0.05, 0.055)
print(f"Smaller effect: {n_smaller:,} per group")  # ~19,177
# 4x more users for half the effect size!

# Scenario 2: 90% power
n_high_power = required_sample_size_ab(0.05, 0.06, power=0.9)
print(f"Higher power: {n_high_power:,} per group")  # ~6,420
# ~34% more users for 10% more power

# Key insight: Sample size grows QUADRATICALLY with smaller effect sizes
# This is why A/B testing small improvements is expensive!

📝 Practice Exercises

Exercise 1

Calculate confidence intervals for proportions

Exercise 2

Determine required sample sizes for precision

Exercise 3

Construct confidence intervals for means

Exercise 4

Real-world: Election polling analysis

Key Takeaways

Population vs Sample

Population: entire group of interest
Sample: subset we actually observe
Statistics estimate parameters

Standard Error

Measures variability of sample statistics
Decreases with sqrt(n)
Foundation of confidence intervals

Confidence Intervals

Range of plausible values for parameter
Width = 2 x margin of error
Higher confidence = wider interval

Sample Size

More precision requires more data
Quadratic relationship (2x precision = 4x data)
Plan before collecting data

Common Pitfalls

Inference Mistakes to Avoid:

Misinterpreting Confidence Intervals - “95% confident the true value is in this range” NOT “95% chance the parameter is here”
Ignoring Sample Bias - Non-random samples lead to biased estimates regardless of sample size
Confusing Confidence Level with Precision - 99% confidence is wider, not more precise
Forgetting Standard Error Shrinks with √n - To halve SE, you need 4x the data, not 2x
Using z when t is appropriate - For small samples (n < 30), t-distribution accounts for extra uncertainty

Connection to Machine Learning

Inference Concept	ML Application
Confidence intervals	Uncertainty in predictions
Standard error	Error bars, prediction intervals
Sample size	Training set size planning
t-distribution	Small data regimes, regularization
Bias in sampling	Training/test split, data collection

ML Connection: When you report model accuracy as “92% ± 2%”, you’re using confidence intervals! Cross-validation provides multiple samples, and the standard error tells you how much your accuracy estimate might vary.

Coming up next: We’ll learn about Hypothesis Testing - how to determine if a difference is real or just random noise. This is the foundation of A/B testing and scientific validation of ML models.

Next: Hypothesis Testing

Learn to distinguish real effects from random noise

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Statistical Inference: Conclusions from Samples

​The Election Polling Problem

​Population vs Sample

​Sampling Distributions: The Key Insight

​Standard Error: Quantifying Uncertainty

​Confidence Intervals: Expressing Uncertainty

​The Formula

​What Does “95% Confidence” Mean?

​Confidence Intervals for Means

​The t-Distribution: For Small Samples

​Sample Size Planning

​For Proportions

​For Means

​Mini-Project: Election Poll Analyzer

​Common Mistakes in Inference

​Mistake 1: Ignoring Sample Bias

​Common Mistakes to Avoid

​Interview Questions

​Practice Challenge

​📝 Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

​Key Takeaways

Population vs Sample

Standard Error

Confidence Intervals

Sample Size

​Common Pitfalls

​Connection to Machine Learning

Next: Hypothesis Testing

Statistical Inference: Conclusions from Samples

The Election Polling Problem

Population vs Sample

Sampling Distributions: The Key Insight

Standard Error: Quantifying Uncertainty

Confidence Intervals: Expressing Uncertainty

The Formula

What Does “95% Confidence” Mean?

Confidence Intervals for Means

The t-Distribution: For Small Samples

Sample Size Planning

For Proportions

For Means

Mini-Project: Election Poll Analyzer

Common Mistakes in Inference

Mistake 1: Ignoring Sample Bias

Common Mistakes to Avoid

Interview Questions

Practice Challenge

📝 Practice Exercises

Key Takeaways

Common Pitfalls

Connection to Machine Learning