You run a factory that produces ball bearings. Each bearing should be exactly 10mm in diameter. But manufacturing isn’t perfect - there’s always some variation.You measure 1000 bearings and get:
A probability distribution describes all possible values a random variable can take and how likely each value is.Think of it as a complete map of possibilities.
# Discrete: Number of heads in 10 coin flips# Can only be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10# Continuous: Height of a person# Can be 170.0 cm, 170.1 cm, 170.01 cm, 170.001 cm...
# Random time a customer arrives between 9:00 and 10:00 AMarrival_minutes = np.random.uniform(0, 60, size=1000)print(f"Mean arrival: {np.mean(arrival_minutes):.1f} minutes after 9:00")print(f"P(arrive in first 15 min): {np.mean(arrival_minutes < 15):.1%}")
n = 10p = 0.5k_values = range(0, n + 1)probabilities = [stats.binom.pmf(k, n, p) for k in k_values]plt.figure(figsize=(10, 5))plt.bar(k_values, probabilities, edgecolor='black', alpha=0.7)plt.xlabel('Number of Heads')plt.ylabel('Probability')plt.title(f'Binomial Distribution (n={n}, p={p})')plt.xticks(k_values)plt.show()
# Your website has a 3% conversion rate# 100 people visit today# What's the probability of 5 or more conversions?n = 100p = 0.03# P(X >= 5) = 1 - P(X <= 4)p_at_least_5 = 1 - stats.binom.cdf(4, n, p)print(f"P(5+ conversions): {p_at_least_5:.1%}") # 18.2%# Expected conversionsexpected = n * pprint(f"Expected conversions: {expected}") # 3.0
Practice: Quality Control
A factory produces items with a 2% defect rate. In a batch of 50 items:
What’s the probability of exactly 0 defects?
What’s the probability of 3 or more defects?
How many defects do you expect?
Copy
n, p = 50, 0.02# 1. P(X = 0)p_zero = stats.binom.pmf(0, n, p)print(f"P(0 defects): {p_zero:.1%}") # 36.4%# 2. P(X >= 3)p_three_plus = 1 - stats.binom.cdf(2, n, p)print(f"P(3+ defects): {p_three_plus:.1%}") # 7.8%# 3. Expected defectsexpected = n * pprint(f"Expected defects: {expected}") # 1.0
A z-score tells you how many standard deviations a value is from the mean.z=σx−μ
Copy
def z_score(x, mu, sigma): """Convert value to z-score.""" return (x - mu) / sigma# How exceptional is an IQ of 130?iq = 130z = z_score(iq, mu=100, sigma=15)print(f"IQ of 130 has z-score: {z:.2f}") # 2.0# This means 130 is 2 standard deviations above average# Only about 2.3% of people score higherpercentile = stats.norm.cdf(z) * 100print(f"Percentile: {percentile:.1f}%") # 97.7%
The Central Limit Theorem (CLT) explains this magic:
Central Limit Theorem: When you add up many independent random variables, their sum tends toward a normal distribution - regardless of the original distributions.
# Roll a single die - definitely NOT normalsingle_die = np.random.randint(1, 7, 10000)# Sum of 2 dice - starting to look differentsum_2_dice = np.array([np.random.randint(1, 7, 2).sum() for _ in range(10000)])# Sum of 10 dice - getting bell-shapedsum_10_dice = np.array([np.random.randint(1, 7, 10).sum() for _ in range(10000)])# Sum of 30 dice - nearly perfect normal!sum_30_dice = np.array([np.random.randint(1, 7, 30).sum() for _ in range(10000)])fig, axes = plt.subplots(2, 2, figsize=(12, 8))axes[0, 0].hist(single_die, bins=6, edgecolor='black', alpha=0.7)axes[0, 0].set_title('Single Die (Uniform)')axes[0, 1].hist(sum_2_dice, bins=11, edgecolor='black', alpha=0.7)axes[0, 1].set_title('Sum of 2 Dice')axes[1, 0].hist(sum_10_dice, bins=30, edgecolor='black', alpha=0.7)axes[1, 0].set_title('Sum of 10 Dice')axes[1, 1].hist(sum_30_dice, bins=40, edgecolor='black', alpha=0.7)axes[1, 1].set_title('Sum of 30 Dice (Nearly Normal!)')plt.tight_layout()plt.show()
This is why heights are normally distributed: Height is determined by thousands of genes, each adding a small random effect. Sum of many small random things = normal distribution.
How many customers arrive per hour? How many defects per batch? How many emails per day?Parameter: λ (lambda) = average rate of eventsP(X=k)=k!λke−λ
Copy
# Average 5 customers per hourlambda_rate = 5# Probability of exactly 3 customers in an hourp_3 = stats.poisson.pmf(3, lambda_rate)print(f"P(3 customers): {p_3:.2%}") # 14.04%# Probability of 10 or morep_10_plus = 1 - stats.poisson.cdf(9, lambda_rate)print(f"P(10+ customers): {p_10_plus:.2%}") # 3.18%# Visualizek_values = range(0, 15)probs = [stats.poisson.pmf(k, lambda_rate) for k in k_values]plt.figure(figsize=(10, 5))plt.bar(k_values, probs, edgecolor='black', alpha=0.7)plt.xlabel('Number of Customers')plt.ylabel('Probability')plt.title(f'Poisson Distribution (λ={lambda_rate})')plt.show()
If events occur at rate λ, how long until the next one?
Copy
# Average 5 customers per hour = 1 customer per 12 minutes averagelambda_rate = 5 # per houravg_wait = 60 / lambda_rate # 12 minutes# Probability of waiting more than 20 minutesp_wait_20 = 1 - stats.expon.cdf(20, scale=avg_wait)print(f"P(wait > 20 min): {p_wait_20:.2%}") # 18.9%# Time by which 90% of customers will have arrivedtime_90 = stats.expon.ppf(0.90, scale=avg_wait)print(f"90% arrive within: {time_90:.1f} minutes") # 27.6 min
# Adult male heights in the US follow N(69.1, 2.9) inches# (mean 69.1 inches, std dev 2.9 inches)# Calculate:# 1. What percentage of men are over 6 feet (72 inches)?# 2. What percentage are between 5'6" (66 in) and 6'0" (72 in)?# 3. How tall do you need to be to be in the top 5%?# 4. What is the z-score for someone 6'4" (76 inches)?
# A web server receives an average of 100 requests per minute.# Requests follow a Poisson distribution.# Calculate:# 1. P(exactly 100 requests in a minute)# 2. P(more than 120 requests in a minute)# 3. For capacity planning, what number of requests per minute# will only be exceeded 1% of the time?
Mistake 1: Assuming Everything is NormalNot all data follows a normal distribution. Income data is heavily right-skewed. Time-to-event data often follows exponential distributions. Always visualize your data before assuming normality.
Mistake 2: Misusing the 68-95-99.7 RuleThis rule ONLY applies to normal distributions. Applying it to skewed data will give wrong answers. For non-normal data, use Chebyshev’s inequality: at least 75% of data is within 2 std devs, regardless of distribution shape.
Mistake 3: Confusing PDF and CDFThe PDF gives the relative likelihood at a point (technically, density). The CDF gives the probability of being less than or equal to a value. P(X = exact value) is always 0 for continuous distributions.
Question 1: Normal Distribution Application (Google)
Question: Website response times follow a normal distribution with mean 200ms and std dev 50ms. What percentage of requests take more than 300ms?
Answer: About 2.3%
Copy
from scipy import statsp_slow = 1 - stats.norm.cdf(300, loc=200, scale=50)# Or using z-score: z = (300-200)/50 = 2# P(Z > 2) ≈ 0.0228print(f"{p_slow:.2%}") # 2.28%
The 68-95-99.7 rule gives us a quick check: 300ms is 2 standard deviations above mean, so roughly 2.5% should be above that.
Question 2: Choosing the Right Distribution (Amazon)
Question: You’re modeling these scenarios. Which distribution would you use for each?
Number of customers arriving per hour
Whether a user clicks an ad (yes/no)
Time until a server fails
Heights of basketball players
Answer:
Poisson - Counts of events in fixed intervals
Bernoulli (single trial) or Binomial (many users) - Binary outcomes
Exponential - Time until an event (memoryless process)
Normal - Continuous measurements of natural phenomena
For height, you might also consider that basketball players are selected to be tall, so it could be a truncated normal!
Question 3: Central Limit Theorem (Facebook/Meta)
Question: User session times are heavily right-skewed (not normal). You calculate the average session time each day for 30 days. What distribution does the sample mean follow?
Answer: Approximately normal!Thanks to the Central Limit Theorem, the sampling distribution of the mean will be approximately normal regardless of the underlying distribution shape, as long as:
Sample size is sufficiently large (n ≥ 30 is a common rule of thumb)
The original distribution has finite variance
This is why we can use confidence intervals and hypothesis tests based on the normal distribution even when the underlying data isn’t normal.
Question 4: Percentiles in Practice (Netflix)
Question: Video start times follow a log-normal distribution (right-skewed). The P50 is 1.2 seconds and P95 is 4.8 seconds. What does this tell you about user experience?
Answer:
Half of users experience start times of 1.2s or less (good!)
5% of users wait more than 4.8 seconds (potentially frustrating)
The ratio P95/P50 = 4 indicates significant variability
For right-skewed metrics like latency, the P95 or P99 is often more important than the mean because it captures the experience of the “unlucky” users. A 4x difference between median and P95 suggests there are edge cases worth investigating (slow CDNs, distant users, etc.).
You have real website session data. Determine which distribution best fits it:
Copy
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt# Simulated session durations (in seconds)np.random.seed(42)sessions = np.random.exponential(scale=120, size=1000) # Unknown to you!# Your task:# 1. Visualize the data with a histogram# 2. Calculate summary statistics# 3. Fit different distributions and compare# 4. Determine which distribution fits best# Hint: Try normal, exponential, and log-normal# Starter code:plt.figure(figsize=(12, 4))# Histogramplt.subplot(1, 3, 1)plt.hist(sessions, bins=50, density=True, alpha=0.7)plt.title('Data Distribution')plt.xlabel('Session Duration (s)')# Q-Q plot for normalplt.subplot(1, 3, 2)stats.probplot(sessions, dist="norm", plot=plt)plt.title('Normal Q-Q Plot')# Q-Q plot for exponentialplt.subplot(1, 3, 3)stats.probplot(sessions, dist="expon", plot=plt)plt.title('Exponential Q-Q Plot')plt.tight_layout()plt.show()# Fit distributions and compare
Solution:
Copy
# 1. Visual inspection shows right-skewed data# 2. Summary statsprint(f"Mean: {np.mean(sessions):.1f}s")print(f"Median: {np.median(sessions):.1f}s")print(f"Std: {np.std(sessions):.1f}s")print(f"Skewness: {stats.skew(sessions):.2f}") # Positive = right-skewed# 3. Fit distributions# Normalnorm_params = stats.norm.fit(sessions)# Exponentialexp_params = stats.expon.fit(sessions)# Log-normallognorm_params = stats.lognorm.fit(sessions)# 4. Compare using Kolmogorov-Smirnov test# Null hypothesis: data follows the distribution# Lower p-value = worse fitks_norm = stats.kstest(sessions, 'norm', args=norm_params)ks_exp = stats.kstest(sessions, 'expon', args=exp_params)ks_lognorm = stats.kstest(sessions, 'lognorm', args=lognorm_params)print(f"\nKS Test p-values:")print(f"Normal: {ks_norm.pvalue:.4f}") # Low - bad fitprint(f"Exponential: {ks_exp.pvalue:.4f}") # High - good fit!print(f"Log-normal: {ks_lognorm.pvalue:.4f}")# Exponential wins because mean ≈ std dev (property of exponential)
Q: When would you use Poisson vs Binomial distribution?
Poisson: Counting events in continuous time/space where events are rare (website visits, defects). Binomial: Fixed number of trials with binary outcomes (10 coin flips, 100 users converting).
Q: How do you check if data is normally distributed?
Visual: histogram, Q-Q plot. Statistical: Shapiro-Wilk test, Anderson-Darling test. Rule of thumb: Check skewness (< 2) and kurtosis (< 7).
Q: What is the Central Limit Theorem and why does it matter?
CLT states that sample means approach a normal distribution regardless of population distribution, given large enough samples (n ≥ 30). It’s why we can use normal-based methods even when data isn’t normally distributed.
Q: A process has 2% defect rate. What distribution models the number of defects in a batch of 50?
Binomial with n=50, p=0.02. Expected defects = np = 1. Could approximate with Poisson(λ=1) since n is large and p is small.
Why batch statistics work, confidence in predictions
Z-scores
Feature standardization, batch normalization
Binomial
Classification evaluation, confidence intervals
Poisson
Count prediction, event modeling
ML Connection: When you see “Gaussian” in ML papers, it means “normal distribution.” Gaussian processes, Gaussian mixture models, and Gaussian noise all rely on properties of the normal distribution you just learned!
Coming up next: We’ll learn about Statistical Inference - how to draw conclusions about entire populations from just samples. This is how polls predict elections and A/B tests drive decisions.