> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Probability Distributions: Patterns in Randomness > Discover the bell curve and other patterns that govern everything from test scores to stock prices

# Probability Distributions: Patterns in Randomness ## The Factory Quality Problem You run a factory that produces ball bearings. Each bearing should be exactly 10mm in diameter. But manufacturing isn't perfect - there's always some variation. You measure 1000 bearings and get: ```python theme={null} import numpy as np import matplotlib.pyplot as plt # Simulated bearing diameters (mm) np.random.seed(42) bearings = np.random.normal(loc=10.0, scale=0.05, size=1000) print(f"Mean diameter: {np.mean(bearings):.4f} mm") print(f"Std deviation: {np.std(bearings):.4f} mm") print(f"Min: {np.min(bearings):.4f} mm") print(f"Max: {np.max(bearings):.4f} mm") ``` **Output:** ``` Mean diameter: 10.0024 mm Std deviation: 0.0498 mm Min: 9.8521 mm Max: 10.1534 mm ``` If you plot these measurements, something magical appears: ```python theme={null} plt.figure(figsize=(10, 5)) plt.hist(bearings, bins=50, density=True, alpha=0.7, edgecolor='black') plt.xlabel('Diameter (mm)') plt.ylabel('Frequency') plt.title('Distribution of Ball Bearing Diameters') plt.axvline(10.0, color='red', linestyle='--', label='Target: 10mm') plt.legend() plt.show() ``` A **bell curve** emerges. This isn't coincidence - it's one of the most profound patterns in nature. Key Probability Distributions for ML

**Estimated Time**: 3-4 hours\ **Difficulty**: Beginner\ **Prerequisites**: Modules 1-2 (Describing Data, Probability)\ **What You'll Build**: Quality control system, prediction intervals *** ## What Is a Probability Distribution? A **probability distribution** describes all possible values a random variable can take and how likely each value is. Think of it as a complete map of possibilities. **Analogy**: A probability distribution is like a city's terrain map. The peaks show where values cluster (common outcomes), and the valleys show where values are rare. Just as different cities have different landscapes -- some flat (uniform), some with a single mountain (normal), some with a long tail running to the east (exponential) -- different types of data have different distributional shapes. Learning to "read the terrain" of your data is one of the most valuable skills in ML. ### Discrete vs Continuous | Type | Description | Examples | | -------------- | ------------------------ | ------------------------------------------- | | **Discrete** | Countable outcomes | Coin flips, dice rolls, number of customers | | **Continuous** | Infinite possible values | Height, weight, temperature, time | ```python theme={null} # Discrete: Number of heads in 10 coin flips # Can only be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 # Continuous: Height of a person # Can be 170.0 cm, 170.1 cm, 170.01 cm, 170.001 cm... ``` $Discrete vs Continuous Distributions$ *** ## The Uniform Distribution: Equal Chances The simplest distribution - every outcome is equally likely. ### Discrete Uniform: The Fair Die ```python theme={null} import numpy as np from collections import Counter # Roll a fair die 10000 times rolls = np.random.randint(1, 7, size=10000) counts = Counter(rolls) for face in sorted(counts.keys()): pct = counts[face] / 10000 * 100 print(f"Face {face}: {counts[face]:4d} ({pct:.1f}%)") ``` **Output:** ``` Face 1: 1652 (16.5%) Face 2: 1689 (16.9%) Face 3: 1634 (16.3%) Face 4: 1701 (17.0%) Face 5: 1658 (16.6%) Face 6: 1666 (16.7%) ``` Each face appears roughly 16.67% (1/6) of the time. Uniform Distribution - Dice and Lottery

### Continuous Uniform: Random Numbers ```python theme={null} # Random time a customer arrives between 9:00 and 10:00 AM arrival_minutes = np.random.uniform(0, 60, size=1000) print(f"Mean arrival: {np.mean(arrival_minutes):.1f} minutes after 9:00") print(f"P(arrive in first 15 min): {np.mean(arrival_minutes < 15):.1%}") ``` **ML Applications:** * Random weight initialization * Data augmentation (random crops, rotations) * Monte Carlo simulations *** ## The Binomial Distribution: Success/Failure Experiments When you repeat an experiment with two outcomes (success/failure) multiple times. **Parameters:** * n = number of trials * p = probability of success on each trial **Question:** If you flip a coin 10 times, what's the probability of getting exactly 7 heads? ### Mathematical Formula $$ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} $$ Where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is "n choose k" ```python theme={null} from scipy import stats import math def binomial_probability(n, k, p): """Calculate P(X = k) for binomial distribution.""" # n choose k combinations = math.factorial(n) / (math.factorial(k) * math.factorial(n - k)) # Probability return combinations * (p ** k) * ((1 - p) ** (n - k)) # P(exactly 7 heads in 10 flips) p_7_heads = binomial_probability(n=10, k=7, p=0.5) print(f"P(7 heads in 10 flips): {p_7_heads:.4f}") # 0.1172 # Using scipy p_7_scipy = stats.binom.pmf(k=7, n=10, p=0.5) print(f"P(7 heads) via scipy: {p_7_scipy:.4f}") # 0.1172 ``` ### Visualizing the Binomial Distribution ```python theme={null} n = 10 p = 0.5 k_values = range(0, n + 1) probabilities = [stats.binom.pmf(k, n, p) for k in k_values] plt.figure(figsize=(10, 5)) plt.bar(k_values, probabilities, edgecolor='black', alpha=0.7) plt.xlabel('Number of Heads') plt.ylabel('Probability') plt.title(f'Binomial Distribution (n={n}, p={p})') plt.xticks(k_values) plt.show() ``` ### Real-World Example: Website Conversion ```python theme={null} # Your website has a 3% conversion rate # 100 people visit today # What's the probability of 5 or more conversions? n = 100 p = 0.03 # P(X >= 5) = 1 - P(X <= 4) p_at_least_5 = 1 - stats.binom.cdf(4, n, p) print(f"P(5+ conversions): {p_at_least_5:.1%}") # 18.2% # Expected conversions expected = n * p print(f"Expected conversions: {expected}") # 3.0 ``` A factory produces items with a 2% defect rate. In a batch of 50 items: 1. What's the probability of exactly 0 defects? 2. What's the probability of 3 or more defects? 3. How many defects do you expect? ```python theme={null} n, p = 50, 0.02 # 1. P(X = 0) p_zero = stats.binom.pmf(0, n, p) print(f"P(0 defects): {p_zero:.1%}") # 36.4% # 2. P(X >= 3) p_three_plus = 1 - stats.binom.cdf(2, n, p) print(f"P(3+ defects): {p_three_plus:.1%}") # 7.8% # 3. Expected defects expected = n * p print(f"Expected defects: {expected}") # 1.0 ``` *** ## The Normal Distribution: The Bell Curve This is the most important distribution in statistics. It appears everywhere: * Human heights and weights * Test scores * Measurement errors * Stock price changes * IQ scores **Parameters:** * $\mu$ (mu) = mean (center of the bell) * $\sigma$ (sigma) = standard deviation (width of the bell) ### Mathematical Formula $$ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} $$ $Normal Distribution Formula and Shape$ ```python theme={null} # Generate normally distributed data mu = 100 # mean sigma = 15 # standard deviation # IQ scores follow this distribution iq_scores = np.random.normal(mu, sigma, 10000) plt.figure(figsize=(10, 5)) plt.hist(iq_scores, bins=50, density=True, alpha=0.7, edgecolor='black') # Overlay theoretical curve x = np.linspace(50, 150, 1000) y = stats.norm.pdf(x, mu, sigma) plt.plot(x, y, 'r-', linewidth=2, label='Theoretical') plt.xlabel('IQ Score') plt.ylabel('Probability Density') plt.title('Normal Distribution of IQ Scores (μ=100, σ=15)') plt.legend() plt.show() ``` ### The 68-95-99.7 Rule (Empirical Rule) One of the most useful facts in statistics: | Range | Percentage of Data | | ------ | ------------------ | | μ ± 1σ | 68% | | μ ± 2σ | 95% | | μ ± 3σ | 99.7% | ```python theme={null} # Verify with IQ scores within_1_std = np.mean(np.abs(iq_scores - mu) <= sigma) within_2_std = np.mean(np.abs(iq_scores - mu) <= 2 * sigma) within_3_std = np.mean(np.abs(iq_scores - mu) <= 3 * sigma) print(f"Within 1 std (85-115): {within_1_std:.1%}") # ~68% print(f"Within 2 std (70-130): {within_2_std:.1%}") # ~95% print(f"Within 3 std (55-145): {within_3_std:.1%}") # ~99.7% ``` 68-95-99.7 Rule Applied to Heights

### Z-Scores: Standardizing Any Normal Distribution A **z-score** tells you how many standard deviations a value is from the mean. $$ z = \frac{x - \mu}{\sigma} $$ ```python theme={null} def z_score(x, mu, sigma): """Convert value to z-score.""" return (x - mu) / sigma # How exceptional is an IQ of 130? iq = 130 z = z_score(iq, mu=100, sigma=15) print(f"IQ of 130 has z-score: {z:.2f}") # 2.0 # This means 130 is 2 standard deviations above average # Only about 2.3% of people score higher percentile = stats.norm.cdf(z) * 100 print(f"Percentile: {percentile:.1f}%") # 97.7% ``` ### Calculating Probabilities ```python theme={null} # Normal distribution with μ=100, σ=15 # P(IQ > 130) p_above_130 = 1 - stats.norm.cdf(130, loc=100, scale=15) print(f"P(IQ > 130): {p_above_130:.2%}") # 2.28% # P(IQ between 85 and 115) p_middle = stats.norm.cdf(115, 100, 15) - stats.norm.cdf(85, 100, 15) print(f"P(85 < IQ < 115): {p_middle:.2%}") # 68.27% # What IQ score is at the 99th percentile? iq_99 = stats.norm.ppf(0.99, loc=100, scale=15) print(f"99th percentile IQ: {iq_99:.1f}") # 134.9 ``` *** ## Why Is the Normal Distribution Everywhere? The **Central Limit Theorem** (CLT) explains this magic: **Central Limit Theorem**: When you add up many independent random variables, their sum tends toward a normal distribution - regardless of the original distributions. ### Demonstration ```python theme={null} # Roll a single die - definitely NOT normal single_die = np.random.randint(1, 7, 10000) # Sum of 2 dice - starting to look different sum_2_dice = np.array([np.random.randint(1, 7, 2).sum() for _ in range(10000)]) # Sum of 10 dice - getting bell-shaped sum_10_dice = np.array([np.random.randint(1, 7, 10).sum() for _ in range(10000)]) # Sum of 30 dice - nearly perfect normal! sum_30_dice = np.array([np.random.randint(1, 7, 30).sum() for _ in range(10000)]) fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes[0, 0].hist(single_die, bins=6, edgecolor='black', alpha=0.7) axes[0, 0].set_title('Single Die (Uniform)') axes[0, 1].hist(sum_2_dice, bins=11, edgecolor='black', alpha=0.7) axes[0, 1].set_title('Sum of 2 Dice') axes[1, 0].hist(sum_10_dice, bins=30, edgecolor='black', alpha=0.7) axes[1, 0].set_title('Sum of 10 Dice') axes[1, 1].hist(sum_30_dice, bins=40, edgecolor='black', alpha=0.7) axes[1, 1].set_title('Sum of 30 Dice (Nearly Normal!)') plt.tight_layout() plt.show() ``` **This is why heights are normally distributed**: Height is determined by thousands of genes, each adding a small random effect. Sum of many small random things = normal distribution. **ML Application -- Why Batch Normalization Works**: The Central Limit Theorem is the hidden reason batch normalization is so effective in deep learning. Each layer in a neural network sums many weighted inputs -- and by CLT, those sums tend toward normality. Batch normalization exploits this by re-centering and re-scaling activations to a standard normal at each layer, stabilizing training and allowing higher learning rates. When someone asks "why does batch norm help?", the CLT is the statistical foundation of the answer. **Statistical Mistake in ML -- Assuming Normality of Features**: Many ML practitioners apply z-score standardization and assume their features are normally distributed. But real-world features like income, click counts, and session durations are often heavily skewed. Before standardizing, plot your distributions. For right-skewed data, a log transform before standardization often dramatically improves model performance -- especially for linear models and neural networks that implicitly assume symmetric inputs. *** ## Other Important Distributions ### Poisson Distribution: Rare Events Over Time How many customers arrive per hour? How many defects per batch? How many emails per day? **Parameter:** λ (lambda) = average rate of events $$ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} $$ ```python theme={null} # Average 5 customers per hour lambda_rate = 5 # Probability of exactly 3 customers in an hour p_3 = stats.poisson.pmf(3, lambda_rate) print(f"P(3 customers): {p_3:.2%}") # 14.04% # Probability of 10 or more p_10_plus = 1 - stats.poisson.cdf(9, lambda_rate) print(f"P(10+ customers): {p_10_plus:.2%}") # 3.18% # Visualize k_values = range(0, 15) probs = [stats.poisson.pmf(k, lambda_rate) for k in k_values] plt.figure(figsize=(10, 5)) plt.bar(k_values, probs, edgecolor='black', alpha=0.7) plt.xlabel('Number of Customers') plt.ylabel('Probability') plt.title(f'Poisson Distribution (λ={lambda_rate})') plt.show() ``` ### Exponential Distribution: Time Between Events If events occur at rate λ, how long until the next one? ```python theme={null} # Average 5 customers per hour = 1 customer per 12 minutes average lambda_rate = 5 # per hour avg_wait = 60 / lambda_rate # 12 minutes # Probability of waiting more than 20 minutes p_wait_20 = 1 - stats.expon.cdf(20, scale=avg_wait) print(f"P(wait > 20 min): {p_wait_20:.2%}") # 18.9% # Time by which 90% of customers will have arrived time_90 = stats.expon.ppf(0.90, scale=avg_wait) print(f"90% arrive within: {time_90:.1f} minutes") # 27.6 min ``` *** ## Mini-Project: Quality Control System Build a complete quality control system for the ball bearing factory. ```python theme={null} import numpy as np from scipy import stats class QualityControlSystem: """ Quality control system using normal distribution. """ def __init__(self, target, std_dev, tolerance): """ Initialize QC system. target: desired measurement (e.g., 10mm) std_dev: expected standard deviation in production tolerance: acceptable deviation from target (e.g., ±0.1mm) """ self.target = target self.std_dev = std_dev self.tolerance = tolerance self.lower_limit = target - tolerance self.upper_limit = target + tolerance def expected_defect_rate(self): """Calculate expected percentage of out-of-spec products.""" # P(X < lower) + P(X > upper) p_below = stats.norm.cdf(self.lower_limit, self.target, self.std_dev) p_above = 1 - stats.norm.cdf(self.upper_limit, self.target, self.std_dev) return p_below + p_above def analyze_batch(self, measurements): """Analyze a batch of measurements.""" n = len(measurements) mean = np.mean(measurements) std = np.std(measurements) # Count out-of-spec defects = np.sum((measurements < self.lower_limit) | (measurements > self.upper_limit)) defect_rate = defects / n # Check if process is in control # Mean should be within 2 standard errors of target std_error = std / np.sqrt(n) z_score = (mean - self.target) / std_error results = { 'batch_size': n, 'mean': mean, 'std_dev': std, 'defects': defects, 'defect_rate': defect_rate, 'z_score': z_score, 'process_in_control': abs(z_score) < 2 } return results def print_report(self, results): """Print a formatted QC report.""" print("\n" + "=" * 50) print("QUALITY CONTROL REPORT") print("=" * 50) print(f"Batch Size: {results['batch_size']}") print(f"Target: {self.target:.4f} ± {self.tolerance:.4f}") print(f"Specification Limits: [{self.lower_limit:.4f}, {self.upper_limit:.4f}]") print("-" * 50) print(f"Batch Mean: {results['mean']:.4f}") print(f"Batch Std Dev: {results['std_dev']:.4f}") print(f"Defects: {results['defects']} ({results['defect_rate']:.2%})") print(f"Expected Defect Rate: {self.expected_defect_rate():.2%}") print("-" * 50) print(f"Z-Score: {results['z_score']:.2f}") status = "IN CONTROL" if results['process_in_control'] else "OUT OF CONTROL" print(f"Process Status: {status}") print("=" * 50) # Create QC system qc = QualityControlSystem( target=10.0, # 10mm target diameter std_dev=0.05, # 0.05mm expected variation tolerance=0.1 # ±0.1mm acceptable ) # Expected defect rate print(f"Expected defect rate: {qc.expected_defect_rate():.2%}") # Simulate a good batch np.random.seed(42) good_batch = np.random.normal(10.0, 0.05, 100) results = qc.analyze_batch(good_batch) qc.print_report(results) # Simulate a problematic batch (shifted mean) bad_batch = np.random.normal(10.08, 0.05, 100) # Mean shifted by 0.08mm results_bad = qc.analyze_batch(bad_batch) qc.print_report(results_bad) ``` **Output:** ``` Expected defect rate: 4.55% ================================================== QUALITY CONTROL REPORT ================================================== Batch Size: 100 Target: 10.0000 ± 0.1000 Specification Limits: [9.9000, 10.1000] -------------------------------------------------- Batch Mean: 10.0024 Batch Std Dev: 0.0496 Defects: 4 (4.00%) Expected Defect Rate: 4.55% -------------------------------------------------- Z-Score: 0.49 Process Status: IN CONTROL ================================================== ================================================== QUALITY CONTROL REPORT ================================================== Batch Size: 100 Target: 10.0000 ± 0.1000 Specification Limits: [9.9000, 10.1000] -------------------------------------------------- Batch Mean: 10.0822 Batch Std Dev: 0.0518 Defects: 33 (33.00%) Expected Defect Rate: 4.55% -------------------------------------------------- Z-Score: 15.88 Process Status: OUT OF CONTROL ================================================== ``` *** ## Practice Exercises ### Exercise 1: Height Analysis ```python theme={null} # Adult male heights in the US follow N(69.1, 2.9) inches # (mean 69.1 inches, std dev 2.9 inches) # Calculate: # 1. What percentage of men are over 6 feet (72 inches)? # 2. What percentage are between 5'6" (66 in) and 6'0" (72 in)? # 3. How tall do you need to be to be in the top 5%? # 4. What is the z-score for someone 6'4" (76 inches)? ``` ```python theme={null} mu = 69.1 sigma = 2.9 # 1. P(height > 72) p_over_6ft = 1 - stats.norm.cdf(72, mu, sigma) print(f"Over 6 feet: {p_over_6ft:.1%}") # 15.9% # 2. P(66 < height < 72) p_between = stats.norm.cdf(72, mu, sigma) - stats.norm.cdf(66, mu, sigma) print(f"Between 5'6\" and 6'0\": {p_between:.1%}") # 71.0% # 3. Top 5% height top_5_height = stats.norm.ppf(0.95, mu, sigma) print(f"Top 5% starts at: {top_5_height:.1f} inches") # 73.9 inches (6'2") # 4. Z-score for 6'4" z_76 = (76 - mu) / sigma print(f"Z-score for 6'4\": {z_76:.2f}") # 2.38 print(f"Percentile: {stats.norm.cdf(z_76) * 100:.1f}%") # 99.1% ``` ### Exercise 2: Server Requests ```python theme={null} # A web server receives an average of 100 requests per minute. # Requests follow a Poisson distribution. # Calculate: # 1. P(exactly 100 requests in a minute) # 2. P(more than 120 requests in a minute) # 3. For capacity planning, what number of requests per minute # will only be exceeded 1% of the time? ``` ```python theme={null} lambda_rate = 100 # 1. P(X = 100) p_exactly_100 = stats.poisson.pmf(100, lambda_rate) print(f"P(exactly 100): {p_exactly_100:.2%}") # 3.99% # 2. P(X > 120) p_over_120 = 1 - stats.poisson.cdf(120, lambda_rate) print(f"P(over 120): {p_over_120:.2%}") # 1.79% # 3. 99th percentile capacity_99 = stats.poisson.ppf(0.99, lambda_rate) print(f"99% of minutes have fewer than {capacity_99:.0f} requests") # 124 ``` *** ## Common Mistakes to Avoid **Mistake 1: Assuming Everything is Normal** Not all data follows a normal distribution. Income data is heavily right-skewed. Time-to-event data often follows exponential distributions. Always visualize your data before assuming normality. **Mistake 2: Misusing the 68-95-99.7 Rule** This rule ONLY applies to normal distributions. Applying it to skewed data will give wrong answers. For non-normal data, use Chebyshev's inequality: at least 75% of data is within 2 std devs, regardless of distribution shape. **Mistake 3: Confusing PDF and CDF** The PDF gives the relative likelihood at a point (technically, density). The CDF gives the probability of being less than or equal to a value. P(X = exact value) is always 0 for continuous distributions. *** ## Interview Questions **Question**: Website response times follow a normal distribution with mean 200ms and std dev 50ms. What percentage of requests take more than 300ms? **Answer**: About 2.3% ```python theme={null} from scipy import stats p_slow = 1 - stats.norm.cdf(300, loc=200, scale=50) # Or using z-score: z = (300-200)/50 = 2 # P(Z > 2) ≈ 0.0228 print(f"{p_slow:.2%}") # 2.28% ``` The 68-95-99.7 rule gives us a quick check: 300ms is 2 standard deviations above mean, so roughly 2.5% should be above that. **Question**: You're modeling these scenarios. Which distribution would you use for each? 1. Number of customers arriving per hour 2. Whether a user clicks an ad (yes/no) 3. Time until a server fails 4. Heights of basketball players **Answer**: 1. **Poisson** - Counts of events in fixed intervals 2. **Bernoulli** (single trial) or **Binomial** (many users) - Binary outcomes 3. **Exponential** - Time until an event (memoryless process) 4. **Normal** - Continuous measurements of natural phenomena For height, you might also consider that basketball players are selected to be tall, so it could be a truncated normal! **Question**: User session times are heavily right-skewed (not normal). You calculate the average session time each day for 30 days. What distribution does the sample mean follow? **Answer**: Approximately normal! Thanks to the Central Limit Theorem, the sampling distribution of the mean will be approximately normal regardless of the underlying distribution shape, as long as: * Sample size is sufficiently large (n ≥ 30 is a common rule of thumb) * The original distribution has finite variance This is why we can use confidence intervals and hypothesis tests based on the normal distribution even when the underlying data isn't normal. **Question**: Video start times follow a log-normal distribution (right-skewed). The P50 is 1.2 seconds and P95 is 4.8 seconds. What does this tell you about user experience? **Answer**: * Half of users experience start times of 1.2s or less (good!) * 5% of users wait more than 4.8 seconds (potentially frustrating) * The ratio P95/P50 = 4 indicates significant variability For right-skewed metrics like latency, the P95 or P99 is often more important than the mean because it captures the experience of the "unlucky" users. A 4x difference between median and P95 suggests there are edge cases worth investigating (slow CDNs, distant users, etc.). *** ## Practice Challenge You have real website session data. Determine which distribution best fits it: ```python theme={null} import numpy as np from scipy import stats import matplotlib.pyplot as plt # Simulated session durations (in seconds) np.random.seed(42) sessions = np.random.exponential(scale=120, size=1000) # Unknown to you! # Your task: # 1. Visualize the data with a histogram # 2. Calculate summary statistics # 3. Fit different distributions and compare # 4. Determine which distribution fits best # Hint: Try normal, exponential, and log-normal # Starter code: plt.figure(figsize=(12, 4)) # Histogram plt.subplot(1, 3, 1) plt.hist(sessions, bins=50, density=True, alpha=0.7) plt.title('Data Distribution') plt.xlabel('Session Duration (s)') # Q-Q plot for normal plt.subplot(1, 3, 2) stats.probplot(sessions, dist="norm", plot=plt) plt.title('Normal Q-Q Plot') # Q-Q plot for exponential plt.subplot(1, 3, 3) stats.probplot(sessions, dist="expon", plot=plt) plt.title('Exponential Q-Q Plot') plt.tight_layout() plt.show() # Fit distributions and compare ``` **Solution**: ```python theme={null} # 1. Visual inspection shows right-skewed data # 2. Summary stats print(f"Mean: {np.mean(sessions):.1f}s") print(f"Median: {np.median(sessions):.1f}s") print(f"Std: {np.std(sessions):.1f}s") print(f"Skewness: {stats.skew(sessions):.2f}") # Positive = right-skewed # 3. Fit distributions # Normal norm_params = stats.norm.fit(sessions) # Exponential exp_params = stats.expon.fit(sessions) # Log-normal lognorm_params = stats.lognorm.fit(sessions) # 4. Compare using Kolmogorov-Smirnov test # Null hypothesis: data follows the distribution # Lower p-value = worse fit ks_norm = stats.kstest(sessions, 'norm', args=norm_params) ks_exp = stats.kstest(sessions, 'expon', args=exp_params) ks_lognorm = stats.kstest(sessions, 'lognorm', args=lognorm_params) print(f"\nKS Test p-values:") print(f"Normal: {ks_norm.pvalue:.4f}") # Low - bad fit print(f"Exponential: {ks_exp.pvalue:.4f}") # High - good fit! print(f"Log-normal: {ks_lognorm.pvalue:.4f}") # Exponential wins because mean ≈ std dev (property of exponential) ``` *** ## 📝 Practice Exercises Work with normal distribution and z-scores Apply binomial distribution to A/B testing Model customer arrivals with Poisson distribution Real-world: Quality control with distributions

**Exercise 1: SAT Scores Analysis** - Normal distribution and z-scores

**Problem**: SAT scores are normally distributed with mean μ = 1050 and standard deviation σ = 200. 1. What percentage of students score above 1250? 2. What score puts a student in the top 10%? 3. Between what scores do the middle 68% of students fall? 4. If a student scores 1400, how many standard deviations above average are they? **Solution**: ```python theme={null} from scipy import stats import numpy as np mu = 1050 # mean sigma = 200 # standard deviation # 1. P(X > 1250) z = (1250 - mu) / sigma p_above_1250 = 1 - stats.norm.cdf(z) print(f"Z-score for 1250: {z}") # 1.0 print(f"P(Score > 1250): {p_above_1250:.2%}") # 15.87% # Using scipy directly p_above_1250_direct = 1 - stats.norm.cdf(1250, loc=mu, scale=sigma) print(f"Direct calculation: {p_above_1250_direct:.2%}") # 2. Score for top 10% (90th percentile) top_10_score = stats.norm.ppf(0.90, loc=mu, scale=sigma) print(f"\nTop 10% threshold: {top_10_score:.0f}") # 1306 # 3. Middle 68% (empirical rule: mean ± 1 std dev) lower_68 = mu - sigma # 850 upper_68 = mu + sigma # 1250 print(f"\nMiddle 68%: {lower_68} to {upper_68}") # Verify with scipy lower_16 = stats.norm.ppf(0.16, loc=mu, scale=sigma) upper_84 = stats.norm.ppf(0.84, loc=mu, scale=sigma) print(f"Verified: {lower_16:.0f} to {upper_84:.0f}") # 4. Z-score for 1400 z_1400 = (1400 - mu) / sigma print(f"\nZ-score for 1400: {z_1400:.2f}") # 1.75 print(f"This is {z_1400:.2f} standard deviations above the mean") print(f"Percentile: {stats.norm.cdf(z_1400)*100:.1f}%") # 96th percentile ```

**Exercise 2: A/B Test Analysis** - Binomial distribution

**Problem**: You're running an A/B test for a website button color: * Control (Blue): 500 visitors, 45 conversions * Treatment (Green): 500 visitors, 58 conversions 1. What's the conversion rate for each group? 2. Using binomial distribution, what's P(≥58 conversions) if true rate is 9%? 3. Is the difference likely due to chance? **Solution**: ```python theme={null} from scipy import stats import numpy as np # Data control_visitors, control_conversions = 500, 45 treatment_visitors, treatment_conversions = 500, 58 # 1. Conversion rates control_rate = control_conversions / control_visitors treatment_rate = treatment_conversions / treatment_visitors print(f"Control rate: {control_rate:.1%}") # 9.0% print(f"Treatment rate: {treatment_rate:.1%}") # 11.6% print(f"Lift: {(treatment_rate/control_rate - 1)*100:.1f}%") # 28.9% lift # 2. P(X >= 58) if true rate is 9% (null hypothesis) p_null = 0.09 # Assume true rate is same as control n = 500 # P(X >= 58) = 1 - P(X <= 57) p_at_least_58 = 1 - stats.binom.cdf(57, n, p_null) print(f"\nP(≥58 conversions | rate=9%): {p_at_least_58:.4f}") # ~4.6% # 3. Statistical significance (simplified) # If p < 0.05, the result is statistically significant if p_at_least_58 < 0.05: print("\nResult: Statistically significant! Green button likely better.") else: print("\nResult: Not significant. Could be chance variation.") # More rigorous: Two-proportion z-test # Pooled proportion pooled_p = (control_conversions + treatment_conversions) / (control_visitors + treatment_visitors) se = np.sqrt(pooled_p * (1-pooled_p) * (1/control_visitors + 1/treatment_visitors)) z_stat = (treatment_rate - control_rate) / se p_value = 2 * (1 - stats.norm.cdf(abs(z_stat))) print(f"\nZ-statistic: {z_stat:.3f}") print(f"P-value (two-tailed): {p_value:.4f}") ```

**Exercise 3: Customer Arrivals** - Poisson distribution

**Problem**: A coffee shop sees an average of 4 customers per minute during rush hour. 1. What's P(exactly 6 customers) in a given minute? 2. What's P(0 or 1 customers) in a given minute? 3. What's P(more than 10 customers) in a 2-minute window? 4. How many baristas needed if each can serve 5 customers/minute? **Solution**: ```python theme={null} from scipy import stats import numpy as np lambda_per_min = 4 # Average customers per minute # 1. P(X = 6) p_exactly_6 = stats.poisson.pmf(6, lambda_per_min) print(f"P(exactly 6 customers): {p_exactly_6:.4f}") # ~10.4% # 2. P(X = 0 or X = 1) = P(X = 0) + P(X = 1) p_0_or_1 = stats.poisson.cdf(1, lambda_per_min) # CDF gives P(X <= 1) print(f"P(0 or 1 customers): {p_0_or_1:.4f}") # ~9.2% # 3. In 2-minute window, lambda = 4 * 2 = 8 lambda_2min = 8 p_more_than_10 = 1 - stats.poisson.cdf(10, lambda_2min) print(f"P(more than 10 in 2 min): {p_more_than_10:.4f}") # ~18.4% # 4. Staffing analysis # Each barista serves 5 customers/minute # If we want to handle 95th percentile of demand... customers_95th = stats.poisson.ppf(0.95, lambda_per_min) print(f"\n95th percentile demand: {customers_95th:.0f} customers/min") baristas_needed = np.ceil(customers_95th / 5) print(f"Baristas needed (95% coverage): {baristas_needed:.0f}") # Probability of being overwhelmed with 1 vs 2 baristas p_overwhelm_1 = 1 - stats.poisson.cdf(5, lambda_per_min) # >5 customers p_overwhelm_2 = 1 - stats.poisson.cdf(10, lambda_per_min) # >10 customers print(f"\nP(overwhelmed with 1 barista): {p_overwhelm_1:.1%}") print(f"P(overwhelmed with 2 baristas): {p_overwhelm_2:.1%}") ```

**Exercise 4: Manufacturing Quality Control** - Real-world application

**Problem**: A factory produces bolts with target diameter 10mm and acceptable tolerance ±0.2mm. The machine produces bolts with μ = 10.02mm and σ = 0.08mm. 1. What percentage of bolts are within specification (9.8mm to 10.2mm)? 2. If you produce 10,000 bolts, how many are rejected? 3. To reduce rejects to under 1%, what standard deviation is needed? 4. Should you adjust the mean or reduce variability? **Solution**: ```python theme={null} from scipy import stats import numpy as np # Current machine parameters mu = 10.02 # Slightly off-center sigma = 0.08 # Specifications lower_spec = 9.8 upper_spec = 10.2 # 1. Percentage within spec p_below_upper = stats.norm.cdf(upper_spec, loc=mu, scale=sigma) p_below_lower = stats.norm.cdf(lower_spec, loc=mu, scale=sigma) p_in_spec = p_below_upper - p_below_lower print(f"Current performance:") print(f" Mean: {mu}mm (target: 10mm, off by {mu-10:.2f}mm)") print(f" Std Dev: {sigma}mm") print(f" % in spec: {p_in_spec:.2%}") # ~97.4% # 2. Rejects from 10,000 bolts n_bolts = 10000 n_rejects = n_bolts * (1 - p_in_spec) print(f"\nFrom {n_bolts:,} bolts: {n_rejects:.0f} rejected") # 3. What sigma needed for <1% rejects? target_reject_rate = 0.01 target_in_spec = 1 - target_reject_rate # For centered distribution, need P(|X - μ| < 0.2) > 0.99 # This means 0.5% in each tail # Z for 99.5th percentile ≈ 2.576 z_required = stats.norm.ppf(1 - target_reject_rate/2) sigma_required = 0.2 / z_required # If centered print(f"\nTo achieve <1% rejects (if centered):") print(f" Z required: {z_required:.3f}") print(f" Sigma required: {sigma_required:.4f}mm") # 4. Compare: Fix mean vs reduce variability # Option A: Center the mean (mu = 10.00) mu_centered = 10.0 p_in_spec_centered = (stats.norm.cdf(upper_spec, loc=mu_centered, scale=sigma) - stats.norm.cdf(lower_spec, loc=mu_centered, scale=sigma)) # Option B: Keep mean, reduce sigma to 0.06 sigma_reduced = 0.06 p_in_spec_reduced_sigma = (stats.norm.cdf(upper_spec, loc=mu, scale=sigma_reduced) - stats.norm.cdf(lower_spec, loc=mu, scale=sigma_reduced)) print(f"\nImprovement options:") print(f" Current: {p_in_spec:.2%} in spec") print(f" Center mean (μ=10.00): {p_in_spec_centered:.2%} in spec") print(f" Reduce σ to 0.06: {p_in_spec_reduced_sigma:.2%} in spec") print(f"\nRecommendation: Centering the mean is cheaper and very effective!") ```

*** ## Key Takeaways * **Discrete**: Countable outcomes (die rolls, counts) * **Continuous**: Any value in a range (measurements) * Each distribution has parameters that define its shape * Defined by mean (μ) and standard deviation (σ) * 68-95-99.7 rule for quick calculations * Appears everywhere due to Central Limit Theorem * **Uniform**: Equal probability (dice, random selection) * **Binomial**: Success/failure experiments (conversions, defects) * **Normal**: Continuous measurements (heights, errors) * **Poisson**: Count of rare events (arrivals, defects) * Standardize any normal distribution * z = (x - μ) / σ * Allows comparison across different scales * Standard normal has μ=0, σ=1 *** ## Interview Prep: Common Questions **Q: When would you use Poisson vs Binomial distribution?** > Poisson: Counting events in continuous time/space where events are rare (website visits, defects). Binomial: Fixed number of trials with binary outcomes (10 coin flips, 100 users converting). **Q: How do you check if data is normally distributed?** > Visual: histogram, Q-Q plot. Statistical: Shapiro-Wilk test, Anderson-Darling test. Rule of thumb: Check skewness (\< 2) and kurtosis (\< 7). **Q: What is the Central Limit Theorem and why does it matter?** > CLT states that sample means approach a normal distribution regardless of population distribution, given large enough samples (n ≥ 30). It's why we can use normal-based methods even when data isn't normally distributed. **Q: A process has 2% defect rate. What distribution models the number of defects in a batch of 50?** > Binomial with n=50, p=0.02. Expected defects = np = 1. Could approximate with Poisson(λ=1) since n is large and p is small. *** ## Common Pitfalls **Distribution Mistakes to Avoid**: 1. **Assuming Normality** - Always check; many real-world distributions are skewed or heavy-tailed 2. **Confusing Parameters** - Variance (σ²) vs Standard Deviation (σ); Population vs Sample 3. **Ignoring Distribution Shape** - Mean/std alone don't fully describe a distribution; visualize first 4. **Wrong Distribution Choice** - Using normal for bounded data, using binomial for continuous outcomes 5. **CLT Misapplication** - CLT applies to sample means, not individual observations *** ## Connection to Machine Learning | Distribution Concept | ML Application | | --------------------- | ---------------------------------------------------- | | Normal distribution | Gaussian noise, regularization, Gaussian Naive Bayes | | Central Limit Theorem | Why batch statistics work, confidence in predictions | | Z-scores | Feature standardization, batch normalization | | Binomial | Classification evaluation, confidence intervals | | Poisson | Count prediction, event modeling | **ML Connection**: When you see "Gaussian" in ML papers, it means "normal distribution." Gaussian processes, Gaussian mixture models, and Gaussian noise all rely on properties of the normal distribution you just learned! **Coming up next**: We'll learn about **Statistical Inference** - how to draw conclusions about entire populations from just samples. This is how polls predict elections and A/B tests drive decisions. Learn to draw conclusions from limited data *** ## Interview Deep-Dive **Strong Answer:** * The choice depends on the nature of the data-generating process, not on what the histogram looks like. Poisson is the right choice when you are counting events in a continuous interval (tickets per hour) where events are independent and occur at a roughly constant rate. It has one parameter (lambda, the average rate) and its variance equals its mean. * Binomial is correct when you have a fixed number of discrete trials each with a binary outcome -- for example, "out of 500 customers who contacted us, how many submitted a ticket?" It requires knowing the number of trials and the success probability. * Normal might be appropriate if you are looking at the average number of tickets per day over many days. By the Central Limit Theorem, the daily averages will be approximately normal even if individual arrivals follow a Poisson process. But you would not use normal for the raw counts because counts cannot be negative, and the normal distribution extends to negative infinity. * In practice, I would start by checking whether the mean and variance of the ticket counts are roughly equal. If they are, Poisson is a good fit. If the variance is much larger than the mean (overdispersion), I would consider a Negative Binomial distribution instead, which adds a dispersion parameter. Overdispersion is extremely common in real ticket data because arrival rates are not actually constant -- they vary by time of day, day of week, and whether there was a product incident. **Follow-up: Your ticket data shows variance that is 4x the mean. What does this tell you and how do you handle it?** Variance much larger than the mean is overdispersion, and it means the Poisson assumption is violated. This typically happens because the arrival rate itself is not constant -- it varies over time or across customer segments. Using Poisson in this situation would underestimate the probability of extreme counts (many tickets or zero tickets) and give overly narrow prediction intervals. The fix is to use a Negative Binomial distribution, which explicitly models this extra variation. Alternatively, you can build a hierarchical model: the arrival rate lambda follows a Gamma distribution across time periods, and conditional on lambda, counts follow a Poisson. This is actually mathematically equivalent to the Negative Binomial and gives you a richer understanding of what is driving the overdispersion. **Strong Answer:** * For the product manager: "Imagine you survey 100 random customers and compute the average satisfaction score. If you repeated that survey many times, each time with a different random 100 customers, those averages would form a bell curve -- even if individual satisfaction scores are not bell-shaped at all. The Central Limit Theorem says that averages of random samples become predictable and bell-shaped as long as your sample is large enough. That is why we can compute a margin of error on any survey or test result." * For the technical layer: the CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size increases, regardless of the population distribution, provided the population has finite variance. The rate of convergence depends on how "non-normal" the underlying distribution is -- highly skewed distributions need larger n. * For A/B testing specifically, the CLT is the entire foundation. When you compare conversion rates between two groups, each conversion rate is a sample mean (of a Bernoulli variable). The CLT guarantees that the difference between these means is approximately normally distributed, which is why you can use a z-test to compute a p-value. Without the CLT, you would need to know the exact distribution of your metric to do any hypothesis testing. * The practical caveat: the CLT needs "large enough" samples, and "large enough" depends on the distribution. For proportions near 0.5, n=30 is usually fine. For proportions near 0.01 (like conversion rates), you might need n=500 or more before the normal approximation is accurate. This is why very low conversion rate tests need more traffic. **Follow-up: When does the CLT fail or give misleading results, even with a large sample?** The CLT fails when the underlying distribution does not have a finite variance. The canonical example is a Cauchy distribution (heavy-tailed), where the sample mean does not converge to anything normal no matter how many samples you take. In practice, this matters for financial data -- stock returns have heavier tails than normal, and models that assume normal distributions (like VaR) systematically underestimate tail risk. The 2008 financial crisis was partly caused by this exact mistake. Another practical failure mode is when your data has structural dependencies that violate the "independent and identically distributed" assumption -- like time-series data with autocorrelation or clustered data where observations within a cluster are correlated. In those cases, the effective sample size is much smaller than the nominal sample size, and the CLT-based confidence intervals are too narrow. **Strong Answer:** * Before stopping the line (which is expensive), I need to determine if 18 defects in 500 is statistically inconsistent with the expected 2% rate. Under the null hypothesis of 2%, the expected number of defects is 10, and the standard deviation is sqrt(500 x 0.02 x 0.98) = approximately 3.13. * The z-score for 18 defects is (18 - 10) / 3.13 = 2.56, giving a one-tailed p-value of about 0.005. This is well below the typical 0.05 threshold. So statistically, yes, 18 defects is very unlikely if the true rate is still 2%. * However, the statistical answer is only half the decision. I would also consider: Is this a sudden spike or a gradual trend? (Check a control chart for the last several batches.) What is the cost of stopping the line versus the cost of shipping defective products? Is there a known assignable cause (like a new material batch or a maintenance event)? * In a Six Sigma framework, this would trigger an investigation but not necessarily an immediate line stop. I would pull the last 5 batches of data and look at a Shewhart control chart. If the process mean has shifted (as opposed to one unlucky batch), that warrants corrective action. If this is a single batch anomaly, the response might be different -- inspect remaining inventory from this batch rather than shutting everything down. **Follow-up: What is the difference between using a binomial exact test versus a normal approximation here, and when does it matter?** For n=500 and p=0.02, the normal approximation is adequate because np=10 and n(1-p)=490 are both greater than 5. The binomial exact test would give P(X >= 18 given n=500, p=0.02) directly without the normal approximation. The two answers will be close -- typically within 0.1% of each other at this sample size. The exact test matters when either np or n(1-p) is small, which happens with very rare events (like a 0.01% defect rate tested on 100 items). In those cases, the normal approximation can be meaningfully wrong, and you should use the exact binomial or a Poisson approximation instead. In modern practice, there is little reason not to use the exact test since computational cost is negligible, but understanding when the approximation breaks helps you catch errors in older tools that default to normal.