You run a factory that produces ball bearings. Each bearing should be exactly 10mm in diameter. But manufacturing isn’t perfect - there’s always some variation.You measure 1000 bearings and get:
A probability distribution describes all possible values a random variable can take and how likely each value is.Think of it as a complete map of possibilities.Analogy: A probability distribution is like a city’s terrain map. The peaks show where values cluster (common outcomes), and the valleys show where values are rare. Just as different cities have different landscapes — some flat (uniform), some with a single mountain (normal), some with a long tail running to the east (exponential) — different types of data have different distributional shapes. Learning to “read the terrain” of your data is one of the most valuable skills in ML.
# Discrete: Number of heads in 10 coin flips# Can only be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10# Continuous: Height of a person# Can be 170.0 cm, 170.1 cm, 170.01 cm, 170.001 cm...
# Random time a customer arrives between 9:00 and 10:00 AMarrival_minutes = np.random.uniform(0, 60, size=1000)print(f"Mean arrival: {np.mean(arrival_minutes):.1f} minutes after 9:00")print(f"P(arrive in first 15 min): {np.mean(arrival_minutes < 15):.1%}")
n = 10p = 0.5k_values = range(0, n + 1)probabilities = [stats.binom.pmf(k, n, p) for k in k_values]plt.figure(figsize=(10, 5))plt.bar(k_values, probabilities, edgecolor='black', alpha=0.7)plt.xlabel('Number of Heads')plt.ylabel('Probability')plt.title(f'Binomial Distribution (n={n}, p={p})')plt.xticks(k_values)plt.show()
# Your website has a 3% conversion rate# 100 people visit today# What's the probability of 5 or more conversions?n = 100p = 0.03# P(X >= 5) = 1 - P(X <= 4)p_at_least_5 = 1 - stats.binom.cdf(4, n, p)print(f"P(5+ conversions): {p_at_least_5:.1%}") # 18.2%# Expected conversionsexpected = n * pprint(f"Expected conversions: {expected}") # 3.0
Practice: Quality Control
A factory produces items with a 2% defect rate. In a batch of 50 items:
What’s the probability of exactly 0 defects?
What’s the probability of 3 or more defects?
How many defects do you expect?
n, p = 50, 0.02# 1. P(X = 0)p_zero = stats.binom.pmf(0, n, p)print(f"P(0 defects): {p_zero:.1%}") # 36.4%# 2. P(X >= 3)p_three_plus = 1 - stats.binom.cdf(2, n, p)print(f"P(3+ defects): {p_three_plus:.1%}") # 7.8%# 3. Expected defectsexpected = n * pprint(f"Expected defects: {expected}") # 1.0
A z-score tells you how many standard deviations a value is from the mean.z=σx−μ
def z_score(x, mu, sigma): """Convert value to z-score.""" return (x - mu) / sigma# How exceptional is an IQ of 130?iq = 130z = z_score(iq, mu=100, sigma=15)print(f"IQ of 130 has z-score: {z:.2f}") # 2.0# This means 130 is 2 standard deviations above average# Only about 2.3% of people score higherpercentile = stats.norm.cdf(z) * 100print(f"Percentile: {percentile:.1f}%") # 97.7%
The Central Limit Theorem (CLT) explains this magic:
Central Limit Theorem: When you add up many independent random variables, their sum tends toward a normal distribution - regardless of the original distributions.
# Roll a single die - definitely NOT normalsingle_die = np.random.randint(1, 7, 10000)# Sum of 2 dice - starting to look differentsum_2_dice = np.array([np.random.randint(1, 7, 2).sum() for _ in range(10000)])# Sum of 10 dice - getting bell-shapedsum_10_dice = np.array([np.random.randint(1, 7, 10).sum() for _ in range(10000)])# Sum of 30 dice - nearly perfect normal!sum_30_dice = np.array([np.random.randint(1, 7, 30).sum() for _ in range(10000)])fig, axes = plt.subplots(2, 2, figsize=(12, 8))axes[0, 0].hist(single_die, bins=6, edgecolor='black', alpha=0.7)axes[0, 0].set_title('Single Die (Uniform)')axes[0, 1].hist(sum_2_dice, bins=11, edgecolor='black', alpha=0.7)axes[0, 1].set_title('Sum of 2 Dice')axes[1, 0].hist(sum_10_dice, bins=30, edgecolor='black', alpha=0.7)axes[1, 0].set_title('Sum of 10 Dice')axes[1, 1].hist(sum_30_dice, bins=40, edgecolor='black', alpha=0.7)axes[1, 1].set_title('Sum of 30 Dice (Nearly Normal!)')plt.tight_layout()plt.show()
This is why heights are normally distributed: Height is determined by thousands of genes, each adding a small random effect. Sum of many small random things = normal distribution.
ML Application — Why Batch Normalization Works: The Central Limit Theorem is the hidden reason batch normalization is so effective in deep learning. Each layer in a neural network sums many weighted inputs — and by CLT, those sums tend toward normality. Batch normalization exploits this by re-centering and re-scaling activations to a standard normal at each layer, stabilizing training and allowing higher learning rates. When someone asks “why does batch norm help?”, the CLT is the statistical foundation of the answer.
Statistical Mistake in ML — Assuming Normality of Features: Many ML practitioners apply z-score standardization and assume their features are normally distributed. But real-world features like income, click counts, and session durations are often heavily skewed. Before standardizing, plot your distributions. For right-skewed data, a log transform before standardization often dramatically improves model performance — especially for linear models and neural networks that implicitly assume symmetric inputs.
How many customers arrive per hour? How many defects per batch? How many emails per day?Parameter: λ (lambda) = average rate of eventsP(X=k)=k!λke−λ
# Average 5 customers per hourlambda_rate = 5# Probability of exactly 3 customers in an hourp_3 = stats.poisson.pmf(3, lambda_rate)print(f"P(3 customers): {p_3:.2%}") # 14.04%# Probability of 10 or morep_10_plus = 1 - stats.poisson.cdf(9, lambda_rate)print(f"P(10+ customers): {p_10_plus:.2%}") # 3.18%# Visualizek_values = range(0, 15)probs = [stats.poisson.pmf(k, lambda_rate) for k in k_values]plt.figure(figsize=(10, 5))plt.bar(k_values, probs, edgecolor='black', alpha=0.7)plt.xlabel('Number of Customers')plt.ylabel('Probability')plt.title(f'Poisson Distribution (λ={lambda_rate})')plt.show()
If events occur at rate λ, how long until the next one?
# Average 5 customers per hour = 1 customer per 12 minutes averagelambda_rate = 5 # per houravg_wait = 60 / lambda_rate # 12 minutes# Probability of waiting more than 20 minutesp_wait_20 = 1 - stats.expon.cdf(20, scale=avg_wait)print(f"P(wait > 20 min): {p_wait_20:.2%}") # 18.9%# Time by which 90% of customers will have arrivedtime_90 = stats.expon.ppf(0.90, scale=avg_wait)print(f"90% arrive within: {time_90:.1f} minutes") # 27.6 min
# Adult male heights in the US follow N(69.1, 2.9) inches# (mean 69.1 inches, std dev 2.9 inches)# Calculate:# 1. What percentage of men are over 6 feet (72 inches)?# 2. What percentage are between 5'6" (66 in) and 6'0" (72 in)?# 3. How tall do you need to be to be in the top 5%?# 4. What is the z-score for someone 6'4" (76 inches)?
# A web server receives an average of 100 requests per minute.# Requests follow a Poisson distribution.# Calculate:# 1. P(exactly 100 requests in a minute)# 2. P(more than 120 requests in a minute)# 3. For capacity planning, what number of requests per minute# will only be exceeded 1% of the time?
Mistake 1: Assuming Everything is NormalNot all data follows a normal distribution. Income data is heavily right-skewed. Time-to-event data often follows exponential distributions. Always visualize your data before assuming normality.
Mistake 2: Misusing the 68-95-99.7 RuleThis rule ONLY applies to normal distributions. Applying it to skewed data will give wrong answers. For non-normal data, use Chebyshev’s inequality: at least 75% of data is within 2 std devs, regardless of distribution shape.
Mistake 3: Confusing PDF and CDFThe PDF gives the relative likelihood at a point (technically, density). The CDF gives the probability of being less than or equal to a value. P(X = exact value) is always 0 for continuous distributions.
Question 1: Normal Distribution Application (Google)
Question: Website response times follow a normal distribution with mean 200ms and std dev 50ms. What percentage of requests take more than 300ms?
Answer: About 2.3%
from scipy import statsp_slow = 1 - stats.norm.cdf(300, loc=200, scale=50)# Or using z-score: z = (300-200)/50 = 2# P(Z > 2) ≈ 0.0228print(f"{p_slow:.2%}") # 2.28%
The 68-95-99.7 rule gives us a quick check: 300ms is 2 standard deviations above mean, so roughly 2.5% should be above that.
Question 2: Choosing the Right Distribution (Amazon)
Question: You’re modeling these scenarios. Which distribution would you use for each?
Number of customers arriving per hour
Whether a user clicks an ad (yes/no)
Time until a server fails
Heights of basketball players
Answer:
Poisson - Counts of events in fixed intervals
Bernoulli (single trial) or Binomial (many users) - Binary outcomes
Exponential - Time until an event (memoryless process)
Normal - Continuous measurements of natural phenomena
For height, you might also consider that basketball players are selected to be tall, so it could be a truncated normal!
Question 3: Central Limit Theorem (Facebook/Meta)
Question: User session times are heavily right-skewed (not normal). You calculate the average session time each day for 30 days. What distribution does the sample mean follow?
Answer: Approximately normal!Thanks to the Central Limit Theorem, the sampling distribution of the mean will be approximately normal regardless of the underlying distribution shape, as long as:
Sample size is sufficiently large (n ≥ 30 is a common rule of thumb)
The original distribution has finite variance
This is why we can use confidence intervals and hypothesis tests based on the normal distribution even when the underlying data isn’t normal.
Question 4: Percentiles in Practice (Netflix)
Question: Video start times follow a log-normal distribution (right-skewed). The P50 is 1.2 seconds and P95 is 4.8 seconds. What does this tell you about user experience?
Answer:
Half of users experience start times of 1.2s or less (good!)
5% of users wait more than 4.8 seconds (potentially frustrating)
The ratio P95/P50 = 4 indicates significant variability
For right-skewed metrics like latency, the P95 or P99 is often more important than the mean because it captures the experience of the “unlucky” users. A 4x difference between median and P95 suggests there are edge cases worth investigating (slow CDNs, distant users, etc.).
You have real website session data. Determine which distribution best fits it:
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt# Simulated session durations (in seconds)np.random.seed(42)sessions = np.random.exponential(scale=120, size=1000) # Unknown to you!# Your task:# 1. Visualize the data with a histogram# 2. Calculate summary statistics# 3. Fit different distributions and compare# 4. Determine which distribution fits best# Hint: Try normal, exponential, and log-normal# Starter code:plt.figure(figsize=(12, 4))# Histogramplt.subplot(1, 3, 1)plt.hist(sessions, bins=50, density=True, alpha=0.7)plt.title('Data Distribution')plt.xlabel('Session Duration (s)')# Q-Q plot for normalplt.subplot(1, 3, 2)stats.probplot(sessions, dist="norm", plot=plt)plt.title('Normal Q-Q Plot')# Q-Q plot for exponentialplt.subplot(1, 3, 3)stats.probplot(sessions, dist="expon", plot=plt)plt.title('Exponential Q-Q Plot')plt.tight_layout()plt.show()# Fit distributions and compare
Solution:
# 1. Visual inspection shows right-skewed data# 2. Summary statsprint(f"Mean: {np.mean(sessions):.1f}s")print(f"Median: {np.median(sessions):.1f}s")print(f"Std: {np.std(sessions):.1f}s")print(f"Skewness: {stats.skew(sessions):.2f}") # Positive = right-skewed# 3. Fit distributions# Normalnorm_params = stats.norm.fit(sessions)# Exponentialexp_params = stats.expon.fit(sessions)# Log-normallognorm_params = stats.lognorm.fit(sessions)# 4. Compare using Kolmogorov-Smirnov test# Null hypothesis: data follows the distribution# Lower p-value = worse fitks_norm = stats.kstest(sessions, 'norm', args=norm_params)ks_exp = stats.kstest(sessions, 'expon', args=exp_params)ks_lognorm = stats.kstest(sessions, 'lognorm', args=lognorm_params)print(f"\nKS Test p-values:")print(f"Normal: {ks_norm.pvalue:.4f}") # Low - bad fitprint(f"Exponential: {ks_exp.pvalue:.4f}") # High - good fit!print(f"Log-normal: {ks_lognorm.pvalue:.4f}")# Exponential wins because mean ≈ std dev (property of exponential)
Q: When would you use Poisson vs Binomial distribution?
Poisson: Counting events in continuous time/space where events are rare (website visits, defects). Binomial: Fixed number of trials with binary outcomes (10 coin flips, 100 users converting).
Q: How do you check if data is normally distributed?
Visual: histogram, Q-Q plot. Statistical: Shapiro-Wilk test, Anderson-Darling test. Rule of thumb: Check skewness (< 2) and kurtosis (< 7).
Q: What is the Central Limit Theorem and why does it matter?
CLT states that sample means approach a normal distribution regardless of population distribution, given large enough samples (n ≥ 30). It’s why we can use normal-based methods even when data isn’t normally distributed.
Q: A process has 2% defect rate. What distribution models the number of defects in a batch of 50?
Binomial with n=50, p=0.02. Expected defects = np = 1. Could approximate with Poisson(λ=1) since n is large and p is small.
Why batch statistics work, confidence in predictions
Z-scores
Feature standardization, batch normalization
Binomial
Classification evaluation, confidence intervals
Poisson
Count prediction, event modeling
ML Connection: When you see “Gaussian” in ML papers, it means “normal distribution.” Gaussian processes, Gaussian mixture models, and Gaussian noise all rely on properties of the normal distribution you just learned!
Coming up next: We’ll learn about Statistical Inference - how to draw conclusions about entire populations from just samples. This is how polls predict elections and A/B tests drive decisions.
You are modeling customer support ticket arrivals. How do you decide between Poisson, Binomial, and Normal distributions?
Strong Answer:
The choice depends on the nature of the data-generating process, not on what the histogram looks like. Poisson is the right choice when you are counting events in a continuous interval (tickets per hour) where events are independent and occur at a roughly constant rate. It has one parameter (lambda, the average rate) and its variance equals its mean.
Binomial is correct when you have a fixed number of discrete trials each with a binary outcome — for example, “out of 500 customers who contacted us, how many submitted a ticket?” It requires knowing the number of trials and the success probability.
Normal might be appropriate if you are looking at the average number of tickets per day over many days. By the Central Limit Theorem, the daily averages will be approximately normal even if individual arrivals follow a Poisson process. But you would not use normal for the raw counts because counts cannot be negative, and the normal distribution extends to negative infinity.
In practice, I would start by checking whether the mean and variance of the ticket counts are roughly equal. If they are, Poisson is a good fit. If the variance is much larger than the mean (overdispersion), I would consider a Negative Binomial distribution instead, which adds a dispersion parameter. Overdispersion is extremely common in real ticket data because arrival rates are not actually constant — they vary by time of day, day of week, and whether there was a product incident.
Follow-up: Your ticket data shows variance that is 4x the mean. What does this tell you and how do you handle it?Variance much larger than the mean is overdispersion, and it means the Poisson assumption is violated. This typically happens because the arrival rate itself is not constant — it varies over time or across customer segments. Using Poisson in this situation would underestimate the probability of extreme counts (many tickets or zero tickets) and give overly narrow prediction intervals. The fix is to use a Negative Binomial distribution, which explicitly models this extra variation. Alternatively, you can build a hierarchical model: the arrival rate lambda follows a Gamma distribution across time periods, and conditional on lambda, counts follow a Poisson. This is actually mathematically equivalent to the Negative Binomial and gives you a richer understanding of what is driving the overdispersion.
Explain the Central Limit Theorem to a non-technical product manager, and then explain why it matters for A/B testing.
Strong Answer:
For the product manager: “Imagine you survey 100 random customers and compute the average satisfaction score. If you repeated that survey many times, each time with a different random 100 customers, those averages would form a bell curve — even if individual satisfaction scores are not bell-shaped at all. The Central Limit Theorem says that averages of random samples become predictable and bell-shaped as long as your sample is large enough. That is why we can compute a margin of error on any survey or test result.”
For the technical layer: the CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size increases, regardless of the population distribution, provided the population has finite variance. The rate of convergence depends on how “non-normal” the underlying distribution is — highly skewed distributions need larger n.
For A/B testing specifically, the CLT is the entire foundation. When you compare conversion rates between two groups, each conversion rate is a sample mean (of a Bernoulli variable). The CLT guarantees that the difference between these means is approximately normally distributed, which is why you can use a z-test to compute a p-value. Without the CLT, you would need to know the exact distribution of your metric to do any hypothesis testing.
The practical caveat: the CLT needs “large enough” samples, and “large enough” depends on the distribution. For proportions near 0.5, n=30 is usually fine. For proportions near 0.01 (like conversion rates), you might need n=500 or more before the normal approximation is accurate. This is why very low conversion rate tests need more traffic.
Follow-up: When does the CLT fail or give misleading results, even with a large sample?The CLT fails when the underlying distribution does not have a finite variance. The canonical example is a Cauchy distribution (heavy-tailed), where the sample mean does not converge to anything normal no matter how many samples you take. In practice, this matters for financial data — stock returns have heavier tails than normal, and models that assume normal distributions (like VaR) systematically underestimate tail risk. The 2008 financial crisis was partly caused by this exact mistake. Another practical failure mode is when your data has structural dependencies that violate the “independent and identically distributed” assumption — like time-series data with autocorrelation or clustered data where observations within a cluster are correlated. In those cases, the effective sample size is much smaller than the nominal sample size, and the CLT-based confidence intervals are too narrow.
A manufacturing line has a 2% defect rate. You test 500 items and find 18 defects (3.6%). Should you stop the line?
Strong Answer:
Before stopping the line (which is expensive), I need to determine if 18 defects in 500 is statistically inconsistent with the expected 2% rate. Under the null hypothesis of 2%, the expected number of defects is 10, and the standard deviation is sqrt(500 x 0.02 x 0.98) = approximately 3.13.
The z-score for 18 defects is (18 - 10) / 3.13 = 2.56, giving a one-tailed p-value of about 0.005. This is well below the typical 0.05 threshold. So statistically, yes, 18 defects is very unlikely if the true rate is still 2%.
However, the statistical answer is only half the decision. I would also consider: Is this a sudden spike or a gradual trend? (Check a control chart for the last several batches.) What is the cost of stopping the line versus the cost of shipping defective products? Is there a known assignable cause (like a new material batch or a maintenance event)?
In a Six Sigma framework, this would trigger an investigation but not necessarily an immediate line stop. I would pull the last 5 batches of data and look at a Shewhart control chart. If the process mean has shifted (as opposed to one unlucky batch), that warrants corrective action. If this is a single batch anomaly, the response might be different — inspect remaining inventory from this batch rather than shutting everything down.
Follow-up: What is the difference between using a binomial exact test versus a normal approximation here, and when does it matter?For n=500 and p=0.02, the normal approximation is adequate because np=10 and n(1-p)=490 are both greater than 5. The binomial exact test would give P(X >= 18 given n=500, p=0.02) directly without the normal approximation. The two answers will be close — typically within 0.1% of each other at this sample size. The exact test matters when either np or n(1-p) is small, which happens with very rare events (like a 0.01% defect rate tested on 100 items). In those cases, the normal approximation can be meaningfully wrong, and you should use the exact binomial or a Poisson approximation instead. In modern practice, there is little reason not to use the exact test since computational cost is negligible, but understanding when the approximation breaks helps you catch errors in older tools that default to normal.