Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Probability Distributions

Probability Distributions: Patterns in Randomness

The Factory Quality Problem

You run a factory that produces ball bearings. Each bearing should be exactly 10mm in diameter. But manufacturing isn’t perfect - there’s always some variation. You measure 1000 bearings and get:
import numpy as np
import matplotlib.pyplot as plt

# Simulated bearing diameters (mm)
np.random.seed(42)
bearings = np.random.normal(loc=10.0, scale=0.05, size=1000)

print(f"Mean diameter: {np.mean(bearings):.4f} mm")
print(f"Std deviation: {np.std(bearings):.4f} mm")
print(f"Min: {np.min(bearings):.4f} mm")
print(f"Max: {np.max(bearings):.4f} mm")
Output:
Mean diameter: 10.0024 mm
Std deviation: 0.0498 mm
Min: 9.8521 mm
Max: 10.1534 mm
If you plot these measurements, something magical appears:
plt.figure(figsize=(10, 5))
plt.hist(bearings, bins=50, density=True, alpha=0.7, edgecolor='black')
plt.xlabel('Diameter (mm)')
plt.ylabel('Frequency')
plt.title('Distribution of Ball Bearing Diameters')
plt.axvline(10.0, color='red', linestyle='--', label='Target: 10mm')
plt.legend()
plt.show()
A bell curve emerges. This isn’t coincidence - it’s one of the most profound patterns in nature.
Key Probability Distributions for ML
Estimated Time: 3-4 hours
Difficulty: Beginner
Prerequisites: Modules 1-2 (Describing Data, Probability)
What You’ll Build: Quality control system, prediction intervals

What Is a Probability Distribution?

A probability distribution describes all possible values a random variable can take and how likely each value is. Think of it as a complete map of possibilities. Analogy: A probability distribution is like a city’s terrain map. The peaks show where values cluster (common outcomes), and the valleys show where values are rare. Just as different cities have different landscapes — some flat (uniform), some with a single mountain (normal), some with a long tail running to the east (exponential) — different types of data have different distributional shapes. Learning to “read the terrain” of your data is one of the most valuable skills in ML.

Discrete vs Continuous

TypeDescriptionExamples
DiscreteCountable outcomesCoin flips, dice rolls, number of customers
ContinuousInfinite possible valuesHeight, weight, temperature, time
# Discrete: Number of heads in 10 coin flips
# Can only be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10

# Continuous: Height of a person
# Can be 170.0 cm, 170.1 cm, 170.01 cm, 170.001 cm...
Discrete vs Continuous Distributions

The Uniform Distribution: Equal Chances

The simplest distribution - every outcome is equally likely.

Discrete Uniform: The Fair Die

import numpy as np
from collections import Counter

# Roll a fair die 10000 times
rolls = np.random.randint(1, 7, size=10000)

counts = Counter(rolls)
for face in sorted(counts.keys()):
    pct = counts[face] / 10000 * 100
    print(f"Face {face}: {counts[face]:4d} ({pct:.1f}%)")
Output:
Face 1: 1652 (16.5%)
Face 2: 1689 (16.9%)
Face 3: 1634 (16.3%)
Face 4: 1701 (17.0%)
Face 5: 1658 (16.6%)
Face 6: 1666 (16.7%)
Each face appears roughly 16.67% (1/6) of the time.
Uniform Distribution - Dice and Lottery

Continuous Uniform: Random Numbers

# Random time a customer arrives between 9:00 and 10:00 AM
arrival_minutes = np.random.uniform(0, 60, size=1000)

print(f"Mean arrival: {np.mean(arrival_minutes):.1f} minutes after 9:00")
print(f"P(arrive in first 15 min): {np.mean(arrival_minutes < 15):.1%}")
ML Applications:
  • Random weight initialization
  • Data augmentation (random crops, rotations)
  • Monte Carlo simulations

The Binomial Distribution: Success/Failure Experiments

When you repeat an experiment with two outcomes (success/failure) multiple times. Parameters:
  • n = number of trials
  • p = probability of success on each trial
Question: If you flip a coin 10 times, what’s the probability of getting exactly 7 heads?

Mathematical Formula

P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1-p)^{n-k} Where (nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!} is “n choose k”
from scipy import stats
import math

def binomial_probability(n, k, p):
    """Calculate P(X = k) for binomial distribution."""
    # n choose k
    combinations = math.factorial(n) / (math.factorial(k) * math.factorial(n - k))
    # Probability
    return combinations * (p ** k) * ((1 - p) ** (n - k))

# P(exactly 7 heads in 10 flips)
p_7_heads = binomial_probability(n=10, k=7, p=0.5)
print(f"P(7 heads in 10 flips): {p_7_heads:.4f}")  # 0.1172

# Using scipy
p_7_scipy = stats.binom.pmf(k=7, n=10, p=0.5)
print(f"P(7 heads) via scipy: {p_7_scipy:.4f}")  # 0.1172

Visualizing the Binomial Distribution

n = 10
p = 0.5

k_values = range(0, n + 1)
probabilities = [stats.binom.pmf(k, n, p) for k in k_values]

plt.figure(figsize=(10, 5))
plt.bar(k_values, probabilities, edgecolor='black', alpha=0.7)
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.xticks(k_values)
plt.show()

Real-World Example: Website Conversion

# Your website has a 3% conversion rate
# 100 people visit today
# What's the probability of 5 or more conversions?

n = 100
p = 0.03

# P(X >= 5) = 1 - P(X <= 4)
p_at_least_5 = 1 - stats.binom.cdf(4, n, p)
print(f"P(5+ conversions): {p_at_least_5:.1%}")  # 18.2%

# Expected conversions
expected = n * p
print(f"Expected conversions: {expected}")  # 3.0
A factory produces items with a 2% defect rate. In a batch of 50 items:
  1. What’s the probability of exactly 0 defects?
  2. What’s the probability of 3 or more defects?
  3. How many defects do you expect?
n, p = 50, 0.02

# 1. P(X = 0)
p_zero = stats.binom.pmf(0, n, p)
print(f"P(0 defects): {p_zero:.1%}")  # 36.4%

# 2. P(X >= 3)
p_three_plus = 1 - stats.binom.cdf(2, n, p)
print(f"P(3+ defects): {p_three_plus:.1%}")  # 7.8%

# 3. Expected defects
expected = n * p
print(f"Expected defects: {expected}")  # 1.0

The Normal Distribution: The Bell Curve

This is the most important distribution in statistics. It appears everywhere:
  • Human heights and weights
  • Test scores
  • Measurement errors
  • Stock price changes
  • IQ scores
Parameters:
  • μ\mu (mu) = mean (center of the bell)
  • σ\sigma (sigma) = standard deviation (width of the bell)

Mathematical Formula

f(x)=1σ2πe12(xμσ)2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}
Normal Distribution Formula and Shape
# Generate normally distributed data
mu = 100  # mean
sigma = 15  # standard deviation

# IQ scores follow this distribution
iq_scores = np.random.normal(mu, sigma, 10000)

plt.figure(figsize=(10, 5))
plt.hist(iq_scores, bins=50, density=True, alpha=0.7, edgecolor='black')

# Overlay theoretical curve
x = np.linspace(50, 150, 1000)
y = stats.norm.pdf(x, mu, sigma)
plt.plot(x, y, 'r-', linewidth=2, label='Theoretical')

plt.xlabel('IQ Score')
plt.ylabel('Probability Density')
plt.title('Normal Distribution of IQ Scores (μ=100, σ=15)')
plt.legend()
plt.show()

The 68-95-99.7 Rule (Empirical Rule)

One of the most useful facts in statistics:
RangePercentage of Data
μ ± 1σ68%
μ ± 2σ95%
μ ± 3σ99.7%
# Verify with IQ scores
within_1_std = np.mean(np.abs(iq_scores - mu) <= sigma)
within_2_std = np.mean(np.abs(iq_scores - mu) <= 2 * sigma)
within_3_std = np.mean(np.abs(iq_scores - mu) <= 3 * sigma)

print(f"Within 1 std (85-115): {within_1_std:.1%}")   # ~68%
print(f"Within 2 std (70-130): {within_2_std:.1%}")   # ~95%
print(f"Within 3 std (55-145): {within_3_std:.1%}")   # ~99.7%
68-95-99.7 Rule Applied to Heights

Z-Scores: Standardizing Any Normal Distribution

A z-score tells you how many standard deviations a value is from the mean. z=xμσz = \frac{x - \mu}{\sigma}
def z_score(x, mu, sigma):
    """Convert value to z-score."""
    return (x - mu) / sigma

# How exceptional is an IQ of 130?
iq = 130
z = z_score(iq, mu=100, sigma=15)
print(f"IQ of 130 has z-score: {z:.2f}")  # 2.0

# This means 130 is 2 standard deviations above average
# Only about 2.3% of people score higher
percentile = stats.norm.cdf(z) * 100
print(f"Percentile: {percentile:.1f}%")  # 97.7%

Calculating Probabilities

# Normal distribution with μ=100, σ=15

# P(IQ > 130)
p_above_130 = 1 - stats.norm.cdf(130, loc=100, scale=15)
print(f"P(IQ > 130): {p_above_130:.2%}")  # 2.28%

# P(IQ between 85 and 115)
p_middle = stats.norm.cdf(115, 100, 15) - stats.norm.cdf(85, 100, 15)
print(f"P(85 < IQ < 115): {p_middle:.2%}")  # 68.27%

# What IQ score is at the 99th percentile?
iq_99 = stats.norm.ppf(0.99, loc=100, scale=15)
print(f"99th percentile IQ: {iq_99:.1f}")  # 134.9

Why Is the Normal Distribution Everywhere?

The Central Limit Theorem (CLT) explains this magic:
Central Limit Theorem: When you add up many independent random variables, their sum tends toward a normal distribution - regardless of the original distributions.

Demonstration

# Roll a single die - definitely NOT normal
single_die = np.random.randint(1, 7, 10000)

# Sum of 2 dice - starting to look different
sum_2_dice = np.array([np.random.randint(1, 7, 2).sum() for _ in range(10000)])

# Sum of 10 dice - getting bell-shaped
sum_10_dice = np.array([np.random.randint(1, 7, 10).sum() for _ in range(10000)])

# Sum of 30 dice - nearly perfect normal!
sum_30_dice = np.array([np.random.randint(1, 7, 30).sum() for _ in range(10000)])

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].hist(single_die, bins=6, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Single Die (Uniform)')

axes[0, 1].hist(sum_2_dice, bins=11, edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Sum of 2 Dice')

axes[1, 0].hist(sum_10_dice, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Sum of 10 Dice')

axes[1, 1].hist(sum_30_dice, bins=40, edgecolor='black', alpha=0.7)
axes[1, 1].set_title('Sum of 30 Dice (Nearly Normal!)')

plt.tight_layout()
plt.show()
This is why heights are normally distributed: Height is determined by thousands of genes, each adding a small random effect. Sum of many small random things = normal distribution.
ML Application — Why Batch Normalization Works: The Central Limit Theorem is the hidden reason batch normalization is so effective in deep learning. Each layer in a neural network sums many weighted inputs — and by CLT, those sums tend toward normality. Batch normalization exploits this by re-centering and re-scaling activations to a standard normal at each layer, stabilizing training and allowing higher learning rates. When someone asks “why does batch norm help?”, the CLT is the statistical foundation of the answer.
Statistical Mistake in ML — Assuming Normality of Features: Many ML practitioners apply z-score standardization and assume their features are normally distributed. But real-world features like income, click counts, and session durations are often heavily skewed. Before standardizing, plot your distributions. For right-skewed data, a log transform before standardization often dramatically improves model performance — especially for linear models and neural networks that implicitly assume symmetric inputs.

Other Important Distributions

Poisson Distribution: Rare Events Over Time

How many customers arrive per hour? How many defects per batch? How many emails per day? Parameter: λ (lambda) = average rate of events P(X=k)=λkeλk!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}
# Average 5 customers per hour
lambda_rate = 5

# Probability of exactly 3 customers in an hour
p_3 = stats.poisson.pmf(3, lambda_rate)
print(f"P(3 customers): {p_3:.2%}")  # 14.04%

# Probability of 10 or more
p_10_plus = 1 - stats.poisson.cdf(9, lambda_rate)
print(f"P(10+ customers): {p_10_plus:.2%}")  # 3.18%

# Visualize
k_values = range(0, 15)
probs = [stats.poisson.pmf(k, lambda_rate) for k in k_values]

plt.figure(figsize=(10, 5))
plt.bar(k_values, probs, edgecolor='black', alpha=0.7)
plt.xlabel('Number of Customers')
plt.ylabel('Probability')
plt.title(f'Poisson Distribution (λ={lambda_rate})')
plt.show()

Exponential Distribution: Time Between Events

If events occur at rate λ, how long until the next one?
# Average 5 customers per hour = 1 customer per 12 minutes average
lambda_rate = 5  # per hour
avg_wait = 60 / lambda_rate  # 12 minutes

# Probability of waiting more than 20 minutes
p_wait_20 = 1 - stats.expon.cdf(20, scale=avg_wait)
print(f"P(wait > 20 min): {p_wait_20:.2%}")  # 18.9%

# Time by which 90% of customers will have arrived
time_90 = stats.expon.ppf(0.90, scale=avg_wait)
print(f"90% arrive within: {time_90:.1f} minutes")  # 27.6 min

Mini-Project: Quality Control System

Build a complete quality control system for the ball bearing factory.
import numpy as np
from scipy import stats

class QualityControlSystem:
    """
    Quality control system using normal distribution.
    """
    
    def __init__(self, target, std_dev, tolerance):
        """
        Initialize QC system.
        
        target: desired measurement (e.g., 10mm)
        std_dev: expected standard deviation in production
        tolerance: acceptable deviation from target (e.g., ±0.1mm)
        """
        self.target = target
        self.std_dev = std_dev
        self.tolerance = tolerance
        self.lower_limit = target - tolerance
        self.upper_limit = target + tolerance
        
    def expected_defect_rate(self):
        """Calculate expected percentage of out-of-spec products."""
        # P(X < lower) + P(X > upper)
        p_below = stats.norm.cdf(self.lower_limit, self.target, self.std_dev)
        p_above = 1 - stats.norm.cdf(self.upper_limit, self.target, self.std_dev)
        return p_below + p_above
    
    def analyze_batch(self, measurements):
        """Analyze a batch of measurements."""
        n = len(measurements)
        mean = np.mean(measurements)
        std = np.std(measurements)
        
        # Count out-of-spec
        defects = np.sum((measurements < self.lower_limit) | 
                         (measurements > self.upper_limit))
        defect_rate = defects / n
        
        # Check if process is in control
        # Mean should be within 2 standard errors of target
        std_error = std / np.sqrt(n)
        z_score = (mean - self.target) / std_error
        
        results = {
            'batch_size': n,
            'mean': mean,
            'std_dev': std,
            'defects': defects,
            'defect_rate': defect_rate,
            'z_score': z_score,
            'process_in_control': abs(z_score) < 2
        }
        
        return results
    
    def print_report(self, results):
        """Print a formatted QC report."""
        print("\n" + "=" * 50)
        print("QUALITY CONTROL REPORT")
        print("=" * 50)
        print(f"Batch Size: {results['batch_size']}")
        print(f"Target: {self.target:.4f} ± {self.tolerance:.4f}")
        print(f"Specification Limits: [{self.lower_limit:.4f}, {self.upper_limit:.4f}]")
        print("-" * 50)
        print(f"Batch Mean: {results['mean']:.4f}")
        print(f"Batch Std Dev: {results['std_dev']:.4f}")
        print(f"Defects: {results['defects']} ({results['defect_rate']:.2%})")
        print(f"Expected Defect Rate: {self.expected_defect_rate():.2%}")
        print("-" * 50)
        print(f"Z-Score: {results['z_score']:.2f}")
        status = "IN CONTROL" if results['process_in_control'] else "OUT OF CONTROL"
        print(f"Process Status: {status}")
        print("=" * 50)


# Create QC system
qc = QualityControlSystem(
    target=10.0,      # 10mm target diameter
    std_dev=0.05,     # 0.05mm expected variation
    tolerance=0.1     # ±0.1mm acceptable
)

# Expected defect rate
print(f"Expected defect rate: {qc.expected_defect_rate():.2%}")

# Simulate a good batch
np.random.seed(42)
good_batch = np.random.normal(10.0, 0.05, 100)
results = qc.analyze_batch(good_batch)
qc.print_report(results)

# Simulate a problematic batch (shifted mean)
bad_batch = np.random.normal(10.08, 0.05, 100)  # Mean shifted by 0.08mm
results_bad = qc.analyze_batch(bad_batch)
qc.print_report(results_bad)
Output:
Expected defect rate: 4.55%

==================================================
QUALITY CONTROL REPORT
==================================================
Batch Size: 100
Target: 10.0000 ± 0.1000
Specification Limits: [9.9000, 10.1000]
--------------------------------------------------
Batch Mean: 10.0024
Batch Std Dev: 0.0496
Defects: 4 (4.00%)
Expected Defect Rate: 4.55%
--------------------------------------------------
Z-Score: 0.49
Process Status: IN CONTROL
==================================================

==================================================
QUALITY CONTROL REPORT
==================================================
Batch Size: 100
Target: 10.0000 ± 0.1000
Specification Limits: [9.9000, 10.1000]
--------------------------------------------------
Batch Mean: 10.0822
Batch Std Dev: 0.0518
Defects: 33 (33.00%)
Expected Defect Rate: 4.55%
--------------------------------------------------
Z-Score: 15.88
Process Status: OUT OF CONTROL
==================================================

Practice Exercises

Exercise 1: Height Analysis

# Adult male heights in the US follow N(69.1, 2.9) inches
# (mean 69.1 inches, std dev 2.9 inches)

# Calculate:
# 1. What percentage of men are over 6 feet (72 inches)?
# 2. What percentage are between 5'6" (66 in) and 6'0" (72 in)?
# 3. How tall do you need to be to be in the top 5%?
# 4. What is the z-score for someone 6'4" (76 inches)?
mu = 69.1
sigma = 2.9

# 1. P(height > 72)
p_over_6ft = 1 - stats.norm.cdf(72, mu, sigma)
print(f"Over 6 feet: {p_over_6ft:.1%}")  # 15.9%

# 2. P(66 < height < 72)
p_between = stats.norm.cdf(72, mu, sigma) - stats.norm.cdf(66, mu, sigma)
print(f"Between 5'6\" and 6'0\": {p_between:.1%}")  # 71.0%

# 3. Top 5% height
top_5_height = stats.norm.ppf(0.95, mu, sigma)
print(f"Top 5% starts at: {top_5_height:.1f} inches")  # 73.9 inches (6'2")

# 4. Z-score for 6'4"
z_76 = (76 - mu) / sigma
print(f"Z-score for 6'4\": {z_76:.2f}")  # 2.38
print(f"Percentile: {stats.norm.cdf(z_76) * 100:.1f}%")  # 99.1%

Exercise 2: Server Requests

# A web server receives an average of 100 requests per minute.
# Requests follow a Poisson distribution.

# Calculate:
# 1. P(exactly 100 requests in a minute)
# 2. P(more than 120 requests in a minute)
# 3. For capacity planning, what number of requests per minute
#    will only be exceeded 1% of the time?
lambda_rate = 100

# 1. P(X = 100)
p_exactly_100 = stats.poisson.pmf(100, lambda_rate)
print(f"P(exactly 100): {p_exactly_100:.2%}")  # 3.99%

# 2. P(X > 120)
p_over_120 = 1 - stats.poisson.cdf(120, lambda_rate)
print(f"P(over 120): {p_over_120:.2%}")  # 1.79%

# 3. 99th percentile
capacity_99 = stats.poisson.ppf(0.99, lambda_rate)
print(f"99% of minutes have fewer than {capacity_99:.0f} requests")  # 124

Common Mistakes to Avoid

Mistake 1: Assuming Everything is NormalNot all data follows a normal distribution. Income data is heavily right-skewed. Time-to-event data often follows exponential distributions. Always visualize your data before assuming normality.
Mistake 2: Misusing the 68-95-99.7 RuleThis rule ONLY applies to normal distributions. Applying it to skewed data will give wrong answers. For non-normal data, use Chebyshev’s inequality: at least 75% of data is within 2 std devs, regardless of distribution shape.
Mistake 3: Confusing PDF and CDFThe PDF gives the relative likelihood at a point (technically, density). The CDF gives the probability of being less than or equal to a value. P(X = exact value) is always 0 for continuous distributions.

Interview Questions

Question: Website response times follow a normal distribution with mean 200ms and std dev 50ms. What percentage of requests take more than 300ms?
Answer: About 2.3%
from scipy import stats
p_slow = 1 - stats.norm.cdf(300, loc=200, scale=50)
# Or using z-score: z = (300-200)/50 = 2
# P(Z > 2) ≈ 0.0228
print(f"{p_slow:.2%}")  # 2.28%
The 68-95-99.7 rule gives us a quick check: 300ms is 2 standard deviations above mean, so roughly 2.5% should be above that.
Question: You’re modeling these scenarios. Which distribution would you use for each?
  1. Number of customers arriving per hour
  2. Whether a user clicks an ad (yes/no)
  3. Time until a server fails
  4. Heights of basketball players
Answer:
  1. Poisson - Counts of events in fixed intervals
  2. Bernoulli (single trial) or Binomial (many users) - Binary outcomes
  3. Exponential - Time until an event (memoryless process)
  4. Normal - Continuous measurements of natural phenomena
For height, you might also consider that basketball players are selected to be tall, so it could be a truncated normal!
Question: User session times are heavily right-skewed (not normal). You calculate the average session time each day for 30 days. What distribution does the sample mean follow?
Answer: Approximately normal!Thanks to the Central Limit Theorem, the sampling distribution of the mean will be approximately normal regardless of the underlying distribution shape, as long as:
  • Sample size is sufficiently large (n ≥ 30 is a common rule of thumb)
  • The original distribution has finite variance
This is why we can use confidence intervals and hypothesis tests based on the normal distribution even when the underlying data isn’t normal.
Question: Video start times follow a log-normal distribution (right-skewed). The P50 is 1.2 seconds and P95 is 4.8 seconds. What does this tell you about user experience?
Answer:
  • Half of users experience start times of 1.2s or less (good!)
  • 5% of users wait more than 4.8 seconds (potentially frustrating)
  • The ratio P95/P50 = 4 indicates significant variability
For right-skewed metrics like latency, the P95 or P99 is often more important than the mean because it captures the experience of the “unlucky” users. A 4x difference between median and P95 suggests there are edge cases worth investigating (slow CDNs, distant users, etc.).

Practice Challenge

You have real website session data. Determine which distribution best fits it:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulated session durations (in seconds)
np.random.seed(42)
sessions = np.random.exponential(scale=120, size=1000)  # Unknown to you!

# Your task:
# 1. Visualize the data with a histogram
# 2. Calculate summary statistics
# 3. Fit different distributions and compare
# 4. Determine which distribution fits best

# Hint: Try normal, exponential, and log-normal

# Starter code:
plt.figure(figsize=(12, 4))

# Histogram
plt.subplot(1, 3, 1)
plt.hist(sessions, bins=50, density=True, alpha=0.7)
plt.title('Data Distribution')
plt.xlabel('Session Duration (s)')

# Q-Q plot for normal
plt.subplot(1, 3, 2)
stats.probplot(sessions, dist="norm", plot=plt)
plt.title('Normal Q-Q Plot')

# Q-Q plot for exponential
plt.subplot(1, 3, 3)
stats.probplot(sessions, dist="expon", plot=plt)
plt.title('Exponential Q-Q Plot')

plt.tight_layout()
plt.show()

# Fit distributions and compare
Solution:
# 1. Visual inspection shows right-skewed data
# 2. Summary stats
print(f"Mean: {np.mean(sessions):.1f}s")
print(f"Median: {np.median(sessions):.1f}s")
print(f"Std: {np.std(sessions):.1f}s")
print(f"Skewness: {stats.skew(sessions):.2f}")  # Positive = right-skewed

# 3. Fit distributions
# Normal
norm_params = stats.norm.fit(sessions)
# Exponential
exp_params = stats.expon.fit(sessions)
# Log-normal
lognorm_params = stats.lognorm.fit(sessions)

# 4. Compare using Kolmogorov-Smirnov test
# Null hypothesis: data follows the distribution
# Lower p-value = worse fit

ks_norm = stats.kstest(sessions, 'norm', args=norm_params)
ks_exp = stats.kstest(sessions, 'expon', args=exp_params)
ks_lognorm = stats.kstest(sessions, 'lognorm', args=lognorm_params)

print(f"\nKS Test p-values:")
print(f"Normal: {ks_norm.pvalue:.4f}")      # Low - bad fit
print(f"Exponential: {ks_exp.pvalue:.4f}")  # High - good fit!
print(f"Log-normal: {ks_lognorm.pvalue:.4f}")

# Exponential wins because mean ≈ std dev (property of exponential)

📝 Practice Exercises

Exercise 1

Work with normal distribution and z-scores

Exercise 2

Apply binomial distribution to A/B testing

Exercise 3

Model customer arrivals with Poisson distribution

Exercise 4

Real-world: Quality control with distributions

Key Takeaways

Distribution Types

  • Discrete: Countable outcomes (die rolls, counts)
  • Continuous: Any value in a range (measurements)
  • Each distribution has parameters that define its shape

The Normal Distribution

  • Defined by mean (μ) and standard deviation (σ)
  • 68-95-99.7 rule for quick calculations
  • Appears everywhere due to Central Limit Theorem

Key Distributions

  • Uniform: Equal probability (dice, random selection)
  • Binomial: Success/failure experiments (conversions, defects)
  • Normal: Continuous measurements (heights, errors)
  • Poisson: Count of rare events (arrivals, defects)

Z-Scores

  • Standardize any normal distribution
  • z = (x - μ) / σ
  • Allows comparison across different scales
  • Standard normal has μ=0, σ=1

Interview Prep: Common Questions

Q: When would you use Poisson vs Binomial distribution?
Poisson: Counting events in continuous time/space where events are rare (website visits, defects). Binomial: Fixed number of trials with binary outcomes (10 coin flips, 100 users converting).
Q: How do you check if data is normally distributed?
Visual: histogram, Q-Q plot. Statistical: Shapiro-Wilk test, Anderson-Darling test. Rule of thumb: Check skewness (< 2) and kurtosis (< 7).
Q: What is the Central Limit Theorem and why does it matter?
CLT states that sample means approach a normal distribution regardless of population distribution, given large enough samples (n ≥ 30). It’s why we can use normal-based methods even when data isn’t normally distributed.
Q: A process has 2% defect rate. What distribution models the number of defects in a batch of 50?
Binomial with n=50, p=0.02. Expected defects = np = 1. Could approximate with Poisson(λ=1) since n is large and p is small.

Common Pitfalls

Distribution Mistakes to Avoid:
  1. Assuming Normality - Always check; many real-world distributions are skewed or heavy-tailed
  2. Confusing Parameters - Variance (σ²) vs Standard Deviation (σ); Population vs Sample
  3. Ignoring Distribution Shape - Mean/std alone don’t fully describe a distribution; visualize first
  4. Wrong Distribution Choice - Using normal for bounded data, using binomial for continuous outcomes
  5. CLT Misapplication - CLT applies to sample means, not individual observations

Connection to Machine Learning

Distribution ConceptML Application
Normal distributionGaussian noise, regularization, Gaussian Naive Bayes
Central Limit TheoremWhy batch statistics work, confidence in predictions
Z-scoresFeature standardization, batch normalization
BinomialClassification evaluation, confidence intervals
PoissonCount prediction, event modeling
ML Connection: When you see “Gaussian” in ML papers, it means “normal distribution.” Gaussian processes, Gaussian mixture models, and Gaussian noise all rely on properties of the normal distribution you just learned!
Coming up next: We’ll learn about Statistical Inference - how to draw conclusions about entire populations from just samples. This is how polls predict elections and A/B tests drive decisions.

Next: Statistical Inference

Learn to draw conclusions from limited data

Interview Deep-Dive

Strong Answer:
  • The choice depends on the nature of the data-generating process, not on what the histogram looks like. Poisson is the right choice when you are counting events in a continuous interval (tickets per hour) where events are independent and occur at a roughly constant rate. It has one parameter (lambda, the average rate) and its variance equals its mean.
  • Binomial is correct when you have a fixed number of discrete trials each with a binary outcome — for example, “out of 500 customers who contacted us, how many submitted a ticket?” It requires knowing the number of trials and the success probability.
  • Normal might be appropriate if you are looking at the average number of tickets per day over many days. By the Central Limit Theorem, the daily averages will be approximately normal even if individual arrivals follow a Poisson process. But you would not use normal for the raw counts because counts cannot be negative, and the normal distribution extends to negative infinity.
  • In practice, I would start by checking whether the mean and variance of the ticket counts are roughly equal. If they are, Poisson is a good fit. If the variance is much larger than the mean (overdispersion), I would consider a Negative Binomial distribution instead, which adds a dispersion parameter. Overdispersion is extremely common in real ticket data because arrival rates are not actually constant — they vary by time of day, day of week, and whether there was a product incident.
Follow-up: Your ticket data shows variance that is 4x the mean. What does this tell you and how do you handle it?Variance much larger than the mean is overdispersion, and it means the Poisson assumption is violated. This typically happens because the arrival rate itself is not constant — it varies over time or across customer segments. Using Poisson in this situation would underestimate the probability of extreme counts (many tickets or zero tickets) and give overly narrow prediction intervals. The fix is to use a Negative Binomial distribution, which explicitly models this extra variation. Alternatively, you can build a hierarchical model: the arrival rate lambda follows a Gamma distribution across time periods, and conditional on lambda, counts follow a Poisson. This is actually mathematically equivalent to the Negative Binomial and gives you a richer understanding of what is driving the overdispersion.
Strong Answer:
  • For the product manager: “Imagine you survey 100 random customers and compute the average satisfaction score. If you repeated that survey many times, each time with a different random 100 customers, those averages would form a bell curve — even if individual satisfaction scores are not bell-shaped at all. The Central Limit Theorem says that averages of random samples become predictable and bell-shaped as long as your sample is large enough. That is why we can compute a margin of error on any survey or test result.”
  • For the technical layer: the CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size increases, regardless of the population distribution, provided the population has finite variance. The rate of convergence depends on how “non-normal” the underlying distribution is — highly skewed distributions need larger n.
  • For A/B testing specifically, the CLT is the entire foundation. When you compare conversion rates between two groups, each conversion rate is a sample mean (of a Bernoulli variable). The CLT guarantees that the difference between these means is approximately normally distributed, which is why you can use a z-test to compute a p-value. Without the CLT, you would need to know the exact distribution of your metric to do any hypothesis testing.
  • The practical caveat: the CLT needs “large enough” samples, and “large enough” depends on the distribution. For proportions near 0.5, n=30 is usually fine. For proportions near 0.01 (like conversion rates), you might need n=500 or more before the normal approximation is accurate. This is why very low conversion rate tests need more traffic.
Follow-up: When does the CLT fail or give misleading results, even with a large sample?The CLT fails when the underlying distribution does not have a finite variance. The canonical example is a Cauchy distribution (heavy-tailed), where the sample mean does not converge to anything normal no matter how many samples you take. In practice, this matters for financial data — stock returns have heavier tails than normal, and models that assume normal distributions (like VaR) systematically underestimate tail risk. The 2008 financial crisis was partly caused by this exact mistake. Another practical failure mode is when your data has structural dependencies that violate the “independent and identically distributed” assumption — like time-series data with autocorrelation or clustered data where observations within a cluster are correlated. In those cases, the effective sample size is much smaller than the nominal sample size, and the CLT-based confidence intervals are too narrow.
Strong Answer:
  • Before stopping the line (which is expensive), I need to determine if 18 defects in 500 is statistically inconsistent with the expected 2% rate. Under the null hypothesis of 2%, the expected number of defects is 10, and the standard deviation is sqrt(500 x 0.02 x 0.98) = approximately 3.13.
  • The z-score for 18 defects is (18 - 10) / 3.13 = 2.56, giving a one-tailed p-value of about 0.005. This is well below the typical 0.05 threshold. So statistically, yes, 18 defects is very unlikely if the true rate is still 2%.
  • However, the statistical answer is only half the decision. I would also consider: Is this a sudden spike or a gradual trend? (Check a control chart for the last several batches.) What is the cost of stopping the line versus the cost of shipping defective products? Is there a known assignable cause (like a new material batch or a maintenance event)?
  • In a Six Sigma framework, this would trigger an investigation but not necessarily an immediate line stop. I would pull the last 5 batches of data and look at a Shewhart control chart. If the process mean has shifted (as opposed to one unlucky batch), that warrants corrective action. If this is a single batch anomaly, the response might be different — inspect remaining inventory from this batch rather than shutting everything down.
Follow-up: What is the difference between using a binomial exact test versus a normal approximation here, and when does it matter?For n=500 and p=0.02, the normal approximation is adequate because np=10 and n(1-p)=490 are both greater than 5. The binomial exact test would give P(X >= 18 given n=500, p=0.02) directly without the normal approximation. The two answers will be close — typically within 0.1% of each other at this sample size. The exact test matters when either np or n(1-p) is small, which happens with very rare events (like a 0.01% defect rate tested on 100 items). In those cases, the normal approximation can be meaningfully wrong, and you should use the exact binomial or a Poisson approximation instead. In modern practice, there is little reason not to use the exact test since computational cost is negligible, but understanding when the approximation breaks helps you catch errors in older tools that default to normal.