Bayesian Statistics

The Doctor’s Dilemma (Revisited)

Remember from our probability module: A test for a rare disease (1 in 1000 people) comes back positive. The test is 99% accurate. What’s the probability you actually have the disease? Most people say “99%.” The real answer: About 9%. This counterintuitive result is the essence of Bayesian thinking. Let’s understand it deeply.

Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Probability and Distributions modules
What You’ll Build: Spam filter, A/B test analyzer, and diagnostic tool

Bayes’ Theorem: The Core Formula

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

In plain English:

P(A|B): Probability of A given we observed B (posterior)
P(B|A): Probability of observing B if A is true (likelihood)
P(A): Prior probability of A before seeing evidence
P(B): Total probability of observing B

The Medical Test Example

Let’s define:

A: You have the disease
B: Test is positive

Given:

P(disease) = 1/1000 = 0.001 (prior)
P(positive|disease) = 0.99 (sensitivity)
P(positive|no disease) = 0.01 (false positive rate)

import numpy as np

# Prior probability of disease
p_disease = 0.001
p_no_disease = 1 - p_disease

# Likelihood of positive test
p_positive_given_disease = 0.99
p_positive_given_no_disease = 0.01

# Total probability of positive test
p_positive = (p_positive_given_disease * p_disease + 
              p_positive_given_no_disease * p_no_disease)

# Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"P(disease | positive test) = {p_disease_given_positive:.3f}")
print(f"That's only {p_disease_given_positive * 100:.1f}%!")

Output:

P(disease | positive test) = 0.090
That's only 9.0%!

Why Is It So Low?

# Let's think about 100,000 people
n_people = 100_000

# How many have the disease?
n_sick = int(n_people * p_disease)  # 100
n_healthy = n_people - n_sick        # 99,900

# How many test positive?
true_positives = int(n_sick * p_positive_given_disease)  # 99
false_positives = int(n_healthy * p_positive_given_no_disease)  # 999

total_positives = true_positives + false_positives

print("Out of 100,000 people:")
print(f"  Actually sick: {n_sick}")
print(f"  Test positive AND sick: {true_positives}")
print(f"  Test positive BUT healthy: {false_positives}")
print(f"  Total positive tests: {total_positives}")
print(f"\nChance you're sick if positive: {true_positives}/{total_positives} = {true_positives/total_positives:.1%}")

Key Insight: When the disease is rare, even a small false positive rate produces many false alarms that swamp the true positives!

Frequentist vs Bayesian: Two Philosophies

The Coin Flip Example

You flip a coin 10 times and get 7 heads. What’s the probability of heads? Frequentist Answer:

The “true” probability is a fixed (unknown) number
Our best estimate is 7/10 = 0.70
Confidence interval: roughly 0.35 to 0.93 (wide because small sample)

Bayesian Answer:

We start with a prior belief (maybe 50-50)
We update based on evidence
We get a posterior distribution over possible values

import matplotlib.pyplot as plt
from scipy import stats

# Data: 7 heads in 10 flips
n_heads = 7
n_trials = 10

# Define a range of possible probabilities
p_range = np.linspace(0, 1, 1000)

# Bayesian approach with different priors
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Prior 1: Uniform (no prior knowledge)
prior_uniform = stats.beta(1, 1)  # Beta(1,1) = Uniform
posterior_uniform = stats.beta(1 + n_heads, 1 + n_trials - n_heads)

axes[0, 0].plot(p_range, prior_uniform.pdf(p_range), 'b--', label='Prior: Uniform')
axes[0, 0].plot(p_range, posterior_uniform.pdf(p_range), 'r-', linewidth=2, label='Posterior')
axes[0, 0].axvline(x=0.7, color='green', linestyle=':', label='MLE = 0.70')
axes[0, 0].set_xlabel('Probability of Heads')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Uniform Prior: "I have no idea"')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Prior 2: Strong belief in fair coin
prior_fair = stats.beta(50, 50)  # Peaked at 0.5
posterior_fair = stats.beta(50 + n_heads, 50 + n_trials - n_heads)

axes[0, 1].plot(p_range, prior_fair.pdf(p_range), 'b--', label='Prior: Fair coin')
axes[0, 1].plot(p_range, posterior_fair.pdf(p_range), 'r-', linewidth=2, label='Posterior')
axes[0, 1].axvline(x=0.7, color='green', linestyle=':', label='MLE = 0.70')
axes[0, 1].set_xlabel('Probability of Heads')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('Strong Prior: "Coins are usually fair"')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Prior 3: Weak belief in fair coin
prior_weak = stats.beta(5, 5)  # Weakly peaked at 0.5
posterior_weak = stats.beta(5 + n_heads, 5 + n_trials - n_heads)

axes[1, 0].plot(p_range, prior_weak.pdf(p_range), 'b--', label='Prior: Slightly fair')
axes[1, 0].plot(p_range, posterior_weak.pdf(p_range), 'r-', linewidth=2, label='Posterior')
axes[1, 0].axvline(x=0.7, color='green', linestyle=':', label='MLE = 0.70')
axes[1, 0].set_xlabel('Probability of Heads')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title('Weak Prior: "Coins are probably fair"')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# What happens with more data?
axes[1, 1].set_title('Posterior with Different Sample Sizes')
for n in [10, 50, 100, 500]:
    n_h = int(0.6 * n)  # True rate is 60%
    posterior = stats.beta(1 + n_h, 1 + n - n_h)
    axes[1, 1].plot(p_range, posterior.pdf(p_range), label=f'n={n}')
axes[1, 1].axvline(x=0.6, color='black', linestyle='--', label='True rate')
axes[1, 1].set_xlabel('Probability of Heads')
axes[1, 1].set_ylabel('Density')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print credible intervals
print("95% Credible Intervals:")
print(f"  Uniform prior: {posterior_uniform.interval(0.95)}")
print(f"  Strong prior:  {posterior_fair.interval(0.95)}")
print(f"  Weak prior:    {posterior_weak.interval(0.95)}")

Key Observations:

Strong priors resist change (you need lots of data to override them)
Weak priors let the data speak
With enough data, all priors converge to the same answer

Building a Bayesian Spam Filter

This is exactly how the original spam filters worked!

import re
from collections import defaultdict

class NaiveBayesSpamFilter:
    """
    A simple spam filter using Bayes' theorem.
    P(spam | words) ∝ P(words | spam) * P(spam)
    """
    
    def __init__(self, spam_prior=0.3):
        self.spam_prior = spam_prior
        self.word_spam_prob = defaultdict(lambda: 0.5)  # P(word | spam)
        self.word_ham_prob = defaultdict(lambda: 0.5)   # P(word | not spam)
        self.spam_words = defaultdict(int)
        self.ham_words = defaultdict(int)
        self.total_spam_words = 0
        self.total_ham_words = 0
    
    def tokenize(self, text):
        """Convert text to lowercase words."""
        return re.findall(r'\b[a-z]+\b', text.lower())
    
    def train(self, messages, labels):
        """Train on labeled messages."""
        # Count words in spam vs ham
        for msg, is_spam in zip(messages, labels):
            words = self.tokenize(msg)
            for word in words:
                if is_spam:
                    self.spam_words[word] += 1
                    self.total_spam_words += 1
                else:
                    self.ham_words[word] += 1
                    self.total_ham_words += 1
        
        # Calculate probabilities with Laplace smoothing
        vocab = set(self.spam_words.keys()) | set(self.ham_words.keys())
        vocab_size = len(vocab)
        
        for word in vocab:
            self.word_spam_prob[word] = (self.spam_words[word] + 1) / (self.total_spam_words + vocab_size)
            self.word_ham_prob[word] = (self.ham_words[word] + 1) / (self.total_ham_words + vocab_size)
    
    def predict_proba(self, message):
        """Calculate P(spam | message) using Bayes."""
        words = self.tokenize(message)
        
        # Log probabilities to avoid underflow
        log_p_spam = np.log(self.spam_prior)
        log_p_ham = np.log(1 - self.spam_prior)
        
        for word in words:
            log_p_spam += np.log(self.word_spam_prob[word])
            log_p_ham += np.log(self.word_ham_prob[word])
        
        # Convert back from log space
        # P(spam | words) = P(words | spam) * P(spam) / P(words)
        # Use log-sum-exp trick for numerical stability
        max_log = max(log_p_spam, log_p_ham)
        p_spam = np.exp(log_p_spam - max_log)
        p_ham = np.exp(log_p_ham - max_log)
        
        return p_spam / (p_spam + p_ham)
    
    def predict(self, message, threshold=0.5):
        """Classify as spam or not."""
        return self.predict_proba(message) > threshold

# Training data
spam_messages = [
    "FREE MONEY! Click here to claim your prize!",
    "You've WON $1,000,000! Act now!",
    "Congratulations! You're our lucky winner!",
    "Buy cheap pills online! No prescription needed!",
    "Make money fast! Work from home!",
    "FREE iPhone! Limited time offer!",
    "Your account has been compromised! Click to verify!",
    "Hot singles in your area want to meet!",
]

ham_messages = [
    "Hi, can we schedule a meeting for tomorrow?",
    "Thanks for sending the report. Looks good!",
    "Don't forget about dinner with mom on Sunday.",
    "The project deadline has been moved to Friday.",
    "Great job on the presentation!",
    "Please review the attached document.",
    "Looking forward to our call next week.",
    "Can you help me with this Python code?",
]

# Train the filter
filter = NaiveBayesSpamFilter(spam_prior=0.3)
messages = spam_messages + ham_messages
labels = [True] * len(spam_messages) + [False] * len(ham_messages)
filter.train(messages, labels)

# Test on new messages
test_messages = [
    "You've won a FREE vacation!",
    "Meeting rescheduled to 3pm",
    "Claim your prize money now!",
    "The quarterly report is ready for review",
    "Buy now! Limited offer!",
]

print("Spam Filter Results:")
print("-" * 50)
for msg in test_messages:
    prob = filter.predict_proba(msg)
    label = "SPAM" if filter.predict(msg) else "HAM"
    print(f"{prob:.1%} spam | {label:4} | {msg[:40]}...")

# Show most "spammy" and "hammy" words
print("\n" + "=" * 50)
print("Most indicative words:")
word_ratios = {}
for word in filter.spam_words:
    if filter.ham_words[word] > 0:
        ratio = filter.word_spam_prob[word] / filter.word_ham_prob[word]
        word_ratios[word] = ratio

sorted_words = sorted(word_ratios.items(), key=lambda x: x[1], reverse=True)

print("\nMost 'spammy' words:")
for word, ratio in sorted_words[:5]:
    print(f"  {word}: {ratio:.2f}x more likely in spam")

print("\nMost 'legitimate' words:")
for word, ratio in sorted_words[-5:]:
    print(f"  {word}: {1/ratio:.2f}x more likely in ham")

Bayesian A/B Testing

The Problem with Frequentist A/B Tests

Traditional A/B tests give you a yes/no answer: “statistically significant” or not. Bayesian A/B testing gives you richer information:

Probability that B is better than A
Expected improvement
Distribution of possible outcomes

from scipy import stats

def bayesian_ab_test(visitors_a, conversions_a, visitors_b, conversions_b, 
                     prior_alpha=1, prior_beta=1, n_samples=100000):
    """
    Run a Bayesian A/B test.
    Returns probability that B is better than A, and expected lift.
    """
    # Posterior distributions (Beta-Binomial conjugate)
    posterior_a = stats.beta(prior_alpha + conversions_a, 
                            prior_beta + visitors_a - conversions_a)
    posterior_b = stats.beta(prior_alpha + conversions_b, 
                            prior_beta + visitors_b - conversions_b)
    
    # Sample from posteriors
    samples_a = posterior_a.rvs(n_samples)
    samples_b = posterior_b.rvs(n_samples)
    
    # Calculate metrics
    prob_b_better = np.mean(samples_b > samples_a)
    expected_lift = np.mean((samples_b - samples_a) / samples_a)
    
    return {
        'prob_b_better': prob_b_better,
        'expected_lift': expected_lift,
        'posterior_a': posterior_a,
        'posterior_b': posterior_b,
        'samples_a': samples_a,
        'samples_b': samples_b,
    }

# Example: Testing a new checkout button
results_a = {'visitors': 5000, 'conversions': 150}  # Control: 3.0%
results_b = {'visitors': 5000, 'conversions': 175}  # Variant: 3.5%

result = bayesian_ab_test(
    results_a['visitors'], results_a['conversions'],
    results_b['visitors'], results_b['conversions']
)

print("Bayesian A/B Test Results")
print("=" * 50)
print(f"Control (A): {results_a['conversions']}/{results_a['visitors']} = {results_a['conversions']/results_a['visitors']:.2%}")
print(f"Variant (B): {results_b['conversions']}/{results_b['visitors']} = {results_b['conversions']/results_b['visitors']:.2%}")
print(f"\nProbability B is better: {result['prob_b_better']:.1%}")
print(f"Expected lift if B wins: {result['expected_lift']:.1%}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Posterior distributions
x = np.linspace(0.02, 0.05, 500)
axes[0].plot(x, result['posterior_a'].pdf(x), 'b-', linewidth=2, label='A (Control)')
axes[0].plot(x, result['posterior_b'].pdf(x), 'r-', linewidth=2, label='B (Variant)')
axes[0].fill_between(x, result['posterior_a'].pdf(x), alpha=0.3, color='blue')
axes[0].fill_between(x, result['posterior_b'].pdf(x), alpha=0.3, color='red')
axes[0].set_xlabel('Conversion Rate')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Posterior Distributions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Difference distribution
diff = result['samples_b'] - result['samples_a']
axes[1].hist(diff * 100, bins=100, density=True, alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='No difference')
axes[1].axvline(x=np.mean(diff) * 100, color='green', linestyle='-', linewidth=2, 
                label=f'Mean diff: {np.mean(diff)*100:.2f}%')
axes[1].set_xlabel('Difference in Conversion Rate (%)')
axes[1].set_ylabel('Probability Density')
axes[1].set_title('Distribution of B - A')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Credible intervals
print(f"\n95% Credible Interval for A: {result['posterior_a'].interval(0.95)}")
print(f"95% Credible Interval for B: {result['posterior_b'].interval(0.95)}")

Bayesian Linear Regression

Unlike regular linear regression (which gives point estimates), Bayesian regression gives distributions over parameters.

# Simple Bayesian linear regression from scratch
np.random.seed(42)

# Generate data
n = 50
x = np.random.uniform(0, 10, n)
y_true = 2 * x + 1
y = y_true + np.random.randn(n) * 2

# Prior beliefs
# We believe slope is around 0 (uncertain), intercept around 0 (uncertain)
prior_mean = np.array([0, 0])  # [intercept, slope]
prior_cov = np.array([[10, 0], [0, 10]])  # Wide priors

# Likelihood
# y = X @ w + noise, noise ~ N(0, sigma^2)
sigma = 2  # Known noise level

X = np.column_stack([np.ones(n), x])

# Bayesian update (closed form for Gaussian)
# Posterior = N(posterior_mean, posterior_cov)
posterior_cov_inv = np.linalg.inv(prior_cov) + (X.T @ X) / sigma**2
posterior_cov = np.linalg.inv(posterior_cov_inv)
posterior_mean = posterior_cov @ (np.linalg.inv(prior_cov) @ prior_mean + X.T @ y / sigma**2)

print("Bayesian Linear Regression Results")
print("=" * 50)
print(f"True parameters: intercept = 1.0, slope = 2.0")
print(f"Posterior mean:  intercept = {posterior_mean[0]:.3f}, slope = {posterior_mean[1]:.3f}")
print(f"Posterior std:   intercept = {np.sqrt(posterior_cov[0,0]):.3f}, slope = {np.sqrt(posterior_cov[1,1]):.3f}")

# Visualize with uncertainty
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot data and predictions with uncertainty
x_plot = np.linspace(0, 10, 100)
X_plot = np.column_stack([np.ones(100), x_plot])

# Sample from posterior to get prediction uncertainty
n_samples = 100
samples = np.random.multivariate_normal(posterior_mean, posterior_cov, n_samples)

for i in range(n_samples):
    y_sample = X_plot @ samples[i]
    axes[0].plot(x_plot, y_sample, 'r-', alpha=0.1)

axes[0].scatter(x, y, alpha=0.6, label='Data')
axes[0].plot(x_plot, X_plot @ posterior_mean, 'b-', linewidth=2, label='Mean prediction')
axes[0].plot(x_plot, 2 * x_plot + 1, 'g--', linewidth=2, label='True line')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Bayesian Regression with Uncertainty')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Posterior distribution over parameters
from matplotlib.patches import Ellipse

axes[1].scatter(samples[:, 1], samples[:, 0], alpha=0.3, s=10, label='Posterior samples')
axes[1].scatter([2], [1], color='green', s=100, marker='*', label='True values')
axes[1].scatter([posterior_mean[1]], [posterior_mean[0]], color='red', s=100, marker='x', label='Posterior mean')
axes[1].set_xlabel('Slope')
axes[1].set_ylabel('Intercept')
axes[1].set_title('Posterior Distribution over Parameters')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

When to Use Bayesian vs Frequentist

Situation	Recommended Approach
Large dataset, no prior info	Frequentist (simpler)
Small dataset, strong prior knowledge	Bayesian
Need probability statements about parameters	Bayesian
Need to update beliefs as data arrives	Bayesian
Regulatory/scientific publishing	Often Frequentist (tradition)
Business decisions with uncertainty	Bayesian

Practice Exercises

Exercise 1: Updating Beliefs

Problem: You think a coin is fair (prior: Beta(10, 10)). After 30 flips, you see 22 heads. What’s your posterior belief about P(heads)?Calculate the posterior distribution and 95% credible interval.

Exercise 2: Multi-Armed Bandit

Problem: You’re testing 3 different ad creatives. After day 1:

Ad A: 50 clicks, 5 conversions
Ad B: 50 clicks, 8 conversions
Ad C: 50 clicks, 3 conversions

Use Bayesian inference to decide which ad to show more tomorrow (Thompson Sampling).

Exercise 3: Hierarchical Model

Problem: You have conversion rates from 5 different stores. Build a hierarchical Bayesian model that shares information across stores to get better estimates for stores with little data.

Summary

Concept	Frequentist	Bayesian
Parameters	Fixed but unknown	Random variables
Probability	Long-run frequency	Degree of belief
Prior knowledge	Not used	Explicitly encoded
Result	Point estimate + CI	Full distribution
Interpretation	”True value is in CI with 95% confidence"	"95% probability parameter is in interval”

Key Takeaway: Bayesian statistics lets you formally combine prior knowledge with data to get calibrated uncertainty estimates. It’s especially valuable when data is limited or when you need to make decisions under uncertainty.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Bayesian Statistics

​The Doctor’s Dilemma (Revisited)

​Bayes’ Theorem: The Core Formula

​The Medical Test Example

​Why Is It So Low?

​Frequentist vs Bayesian: Two Philosophies

​The Coin Flip Example

​Building a Bayesian Spam Filter

​Bayesian A/B Testing

​The Problem with Frequentist A/B Tests

​Bayesian Linear Regression

​When to Use Bayesian vs Frequentist

​Practice Exercises

​Summary

Bayesian Statistics

The Doctor’s Dilemma (Revisited)

Bayes’ Theorem: The Core Formula

The Medical Test Example

Why Is It So Low?

Frequentist vs Bayesian: Two Philosophies

The Coin Flip Example

Building a Bayesian Spam Filter

Bayesian A/B Testing

The Problem with Frequentist A/B Tests

Bayesian Linear Regression

When to Use Bayesian vs Frequentist

Practice Exercises

Summary