Skip to main content
Bayesian Statistics

Bayesian Statistics

The Doctor’s Dilemma (Revisited)

Remember from our probability module: A test for a rare disease (1 in 1000 people) comes back positive. The test is 99% accurate. What’s the probability you actually have the disease? Most people say “99%.” The real answer: About 9%. This counterintuitive result is the essence of Bayesian thinking. Let’s understand it deeply.
Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: Probability and Distributions modules
What You’ll Build: Spam filter, A/B test analyzer, and diagnostic tool

Bayes’ Theorem: The Core Formula

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} In plain English:
  • P(A|B): Probability of A given we observed B (posterior)
  • P(B|A): Probability of observing B if A is true (likelihood)
  • P(A): Prior probability of A before seeing evidence
  • P(B): Total probability of observing B

The Medical Test Example

Let’s define:
  • A: You have the disease
  • B: Test is positive
Given:
  • P(disease) = 1/1000 = 0.001 (prior)
  • P(positive|disease) = 0.99 (sensitivity)
  • P(positive|no disease) = 0.01 (false positive rate)
import numpy as np

# Prior probability of disease
p_disease = 0.001
p_no_disease = 1 - p_disease

# Likelihood of positive test
p_positive_given_disease = 0.99
p_positive_given_no_disease = 0.01

# Total probability of positive test
p_positive = (p_positive_given_disease * p_disease + 
              p_positive_given_no_disease * p_no_disease)

# Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"P(disease | positive test) = {p_disease_given_positive:.3f}")
print(f"That's only {p_disease_given_positive * 100:.1f}%!")
Output:
P(disease | positive test) = 0.090
That's only 9.0%!

Why Is It So Low?

# Let's think about 100,000 people
n_people = 100_000

# How many have the disease?
n_sick = int(n_people * p_disease)  # 100
n_healthy = n_people - n_sick        # 99,900

# How many test positive?
true_positives = int(n_sick * p_positive_given_disease)  # 99
false_positives = int(n_healthy * p_positive_given_no_disease)  # 999

total_positives = true_positives + false_positives

print("Out of 100,000 people:")
print(f"  Actually sick: {n_sick}")
print(f"  Test positive AND sick: {true_positives}")
print(f"  Test positive BUT healthy: {false_positives}")
print(f"  Total positive tests: {total_positives}")
print(f"\nChance you're sick if positive: {true_positives}/{total_positives} = {true_positives/total_positives:.1%}")
Key Insight: When the disease is rare, even a small false positive rate produces many false alarms that swamp the true positives!

Frequentist vs Bayesian: Two Philosophies

The Coin Flip Example

You flip a coin 10 times and get 7 heads. What’s the probability of heads? Frequentist Answer:
  • The “true” probability is a fixed (unknown) number
  • Our best estimate is 7/10 = 0.70
  • Confidence interval: roughly 0.35 to 0.93 (wide because small sample)
Bayesian Answer:
  • We start with a prior belief (maybe 50-50)
  • We update based on evidence
  • We get a posterior distribution over possible values
import matplotlib.pyplot as plt
from scipy import stats

# Data: 7 heads in 10 flips
n_heads = 7
n_trials = 10

# Define a range of possible probabilities
p_range = np.linspace(0, 1, 1000)

# Bayesian approach with different priors
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Prior 1: Uniform (no prior knowledge)
prior_uniform = stats.beta(1, 1)  # Beta(1,1) = Uniform
posterior_uniform = stats.beta(1 + n_heads, 1 + n_trials - n_heads)

axes[0, 0].plot(p_range, prior_uniform.pdf(p_range), 'b--', label='Prior: Uniform')
axes[0, 0].plot(p_range, posterior_uniform.pdf(p_range), 'r-', linewidth=2, label='Posterior')
axes[0, 0].axvline(x=0.7, color='green', linestyle=':', label='MLE = 0.70')
axes[0, 0].set_xlabel('Probability of Heads')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Uniform Prior: "I have no idea"')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Prior 2: Strong belief in fair coin
prior_fair = stats.beta(50, 50)  # Peaked at 0.5
posterior_fair = stats.beta(50 + n_heads, 50 + n_trials - n_heads)

axes[0, 1].plot(p_range, prior_fair.pdf(p_range), 'b--', label='Prior: Fair coin')
axes[0, 1].plot(p_range, posterior_fair.pdf(p_range), 'r-', linewidth=2, label='Posterior')
axes[0, 1].axvline(x=0.7, color='green', linestyle=':', label='MLE = 0.70')
axes[0, 1].set_xlabel('Probability of Heads')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('Strong Prior: "Coins are usually fair"')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Prior 3: Weak belief in fair coin
prior_weak = stats.beta(5, 5)  # Weakly peaked at 0.5
posterior_weak = stats.beta(5 + n_heads, 5 + n_trials - n_heads)

axes[1, 0].plot(p_range, prior_weak.pdf(p_range), 'b--', label='Prior: Slightly fair')
axes[1, 0].plot(p_range, posterior_weak.pdf(p_range), 'r-', linewidth=2, label='Posterior')
axes[1, 0].axvline(x=0.7, color='green', linestyle=':', label='MLE = 0.70')
axes[1, 0].set_xlabel('Probability of Heads')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title('Weak Prior: "Coins are probably fair"')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# What happens with more data?
axes[1, 1].set_title('Posterior with Different Sample Sizes')
for n in [10, 50, 100, 500]:
    n_h = int(0.6 * n)  # True rate is 60%
    posterior = stats.beta(1 + n_h, 1 + n - n_h)
    axes[1, 1].plot(p_range, posterior.pdf(p_range), label=f'n={n}')
axes[1, 1].axvline(x=0.6, color='black', linestyle='--', label='True rate')
axes[1, 1].set_xlabel('Probability of Heads')
axes[1, 1].set_ylabel('Density')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print credible intervals
print("95% Credible Intervals:")
print(f"  Uniform prior: {posterior_uniform.interval(0.95)}")
print(f"  Strong prior:  {posterior_fair.interval(0.95)}")
print(f"  Weak prior:    {posterior_weak.interval(0.95)}")
Key Observations:
  1. Strong priors resist change (you need lots of data to override them)
  2. Weak priors let the data speak
  3. With enough data, all priors converge to the same answer

Building a Bayesian Spam Filter

This is exactly how the original spam filters worked!
import re
from collections import defaultdict

class NaiveBayesSpamFilter:
    """
    A simple spam filter using Bayes' theorem.
    P(spam | words) ∝ P(words | spam) * P(spam)
    """
    
    def __init__(self, spam_prior=0.3):
        self.spam_prior = spam_prior
        self.word_spam_prob = defaultdict(lambda: 0.5)  # P(word | spam)
        self.word_ham_prob = defaultdict(lambda: 0.5)   # P(word | not spam)
        self.spam_words = defaultdict(int)
        self.ham_words = defaultdict(int)
        self.total_spam_words = 0
        self.total_ham_words = 0
    
    def tokenize(self, text):
        """Convert text to lowercase words."""
        return re.findall(r'\b[a-z]+\b', text.lower())
    
    def train(self, messages, labels):
        """Train on labeled messages."""
        # Count words in spam vs ham
        for msg, is_spam in zip(messages, labels):
            words = self.tokenize(msg)
            for word in words:
                if is_spam:
                    self.spam_words[word] += 1
                    self.total_spam_words += 1
                else:
                    self.ham_words[word] += 1
                    self.total_ham_words += 1
        
        # Calculate probabilities with Laplace smoothing
        vocab = set(self.spam_words.keys()) | set(self.ham_words.keys())
        vocab_size = len(vocab)
        
        for word in vocab:
            self.word_spam_prob[word] = (self.spam_words[word] + 1) / (self.total_spam_words + vocab_size)
            self.word_ham_prob[word] = (self.ham_words[word] + 1) / (self.total_ham_words + vocab_size)
    
    def predict_proba(self, message):
        """Calculate P(spam | message) using Bayes."""
        words = self.tokenize(message)
        
        # Log probabilities to avoid underflow
        log_p_spam = np.log(self.spam_prior)
        log_p_ham = np.log(1 - self.spam_prior)
        
        for word in words:
            log_p_spam += np.log(self.word_spam_prob[word])
            log_p_ham += np.log(self.word_ham_prob[word])
        
        # Convert back from log space
        # P(spam | words) = P(words | spam) * P(spam) / P(words)
        # Use log-sum-exp trick for numerical stability
        max_log = max(log_p_spam, log_p_ham)
        p_spam = np.exp(log_p_spam - max_log)
        p_ham = np.exp(log_p_ham - max_log)
        
        return p_spam / (p_spam + p_ham)
    
    def predict(self, message, threshold=0.5):
        """Classify as spam or not."""
        return self.predict_proba(message) > threshold

# Training data
spam_messages = [
    "FREE MONEY! Click here to claim your prize!",
    "You've WON $1,000,000! Act now!",
    "Congratulations! You're our lucky winner!",
    "Buy cheap pills online! No prescription needed!",
    "Make money fast! Work from home!",
    "FREE iPhone! Limited time offer!",
    "Your account has been compromised! Click to verify!",
    "Hot singles in your area want to meet!",
]

ham_messages = [
    "Hi, can we schedule a meeting for tomorrow?",
    "Thanks for sending the report. Looks good!",
    "Don't forget about dinner with mom on Sunday.",
    "The project deadline has been moved to Friday.",
    "Great job on the presentation!",
    "Please review the attached document.",
    "Looking forward to our call next week.",
    "Can you help me with this Python code?",
]

# Train the filter
filter = NaiveBayesSpamFilter(spam_prior=0.3)
messages = spam_messages + ham_messages
labels = [True] * len(spam_messages) + [False] * len(ham_messages)
filter.train(messages, labels)

# Test on new messages
test_messages = [
    "You've won a FREE vacation!",
    "Meeting rescheduled to 3pm",
    "Claim your prize money now!",
    "The quarterly report is ready for review",
    "Buy now! Limited offer!",
]

print("Spam Filter Results:")
print("-" * 50)
for msg in test_messages:
    prob = filter.predict_proba(msg)
    label = "SPAM" if filter.predict(msg) else "HAM"
    print(f"{prob:.1%} spam | {label:4} | {msg[:40]}...")

# Show most "spammy" and "hammy" words
print("\n" + "=" * 50)
print("Most indicative words:")
word_ratios = {}
for word in filter.spam_words:
    if filter.ham_words[word] > 0:
        ratio = filter.word_spam_prob[word] / filter.word_ham_prob[word]
        word_ratios[word] = ratio

sorted_words = sorted(word_ratios.items(), key=lambda x: x[1], reverse=True)

print("\nMost 'spammy' words:")
for word, ratio in sorted_words[:5]:
    print(f"  {word}: {ratio:.2f}x more likely in spam")

print("\nMost 'legitimate' words:")
for word, ratio in sorted_words[-5:]:
    print(f"  {word}: {1/ratio:.2f}x more likely in ham")

Bayesian A/B Testing

The Problem with Frequentist A/B Tests

Traditional A/B tests give you a yes/no answer: “statistically significant” or not. Bayesian A/B testing gives you richer information:
  • Probability that B is better than A
  • Expected improvement
  • Distribution of possible outcomes
from scipy import stats

def bayesian_ab_test(visitors_a, conversions_a, visitors_b, conversions_b, 
                     prior_alpha=1, prior_beta=1, n_samples=100000):
    """
    Run a Bayesian A/B test.
    Returns probability that B is better than A, and expected lift.
    """
    # Posterior distributions (Beta-Binomial conjugate)
    posterior_a = stats.beta(prior_alpha + conversions_a, 
                            prior_beta + visitors_a - conversions_a)
    posterior_b = stats.beta(prior_alpha + conversions_b, 
                            prior_beta + visitors_b - conversions_b)
    
    # Sample from posteriors
    samples_a = posterior_a.rvs(n_samples)
    samples_b = posterior_b.rvs(n_samples)
    
    # Calculate metrics
    prob_b_better = np.mean(samples_b > samples_a)
    expected_lift = np.mean((samples_b - samples_a) / samples_a)
    
    return {
        'prob_b_better': prob_b_better,
        'expected_lift': expected_lift,
        'posterior_a': posterior_a,
        'posterior_b': posterior_b,
        'samples_a': samples_a,
        'samples_b': samples_b,
    }

# Example: Testing a new checkout button
results_a = {'visitors': 5000, 'conversions': 150}  # Control: 3.0%
results_b = {'visitors': 5000, 'conversions': 175}  # Variant: 3.5%

result = bayesian_ab_test(
    results_a['visitors'], results_a['conversions'],
    results_b['visitors'], results_b['conversions']
)

print("Bayesian A/B Test Results")
print("=" * 50)
print(f"Control (A): {results_a['conversions']}/{results_a['visitors']} = {results_a['conversions']/results_a['visitors']:.2%}")
print(f"Variant (B): {results_b['conversions']}/{results_b['visitors']} = {results_b['conversions']/results_b['visitors']:.2%}")
print(f"\nProbability B is better: {result['prob_b_better']:.1%}")
print(f"Expected lift if B wins: {result['expected_lift']:.1%}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Posterior distributions
x = np.linspace(0.02, 0.05, 500)
axes[0].plot(x, result['posterior_a'].pdf(x), 'b-', linewidth=2, label='A (Control)')
axes[0].plot(x, result['posterior_b'].pdf(x), 'r-', linewidth=2, label='B (Variant)')
axes[0].fill_between(x, result['posterior_a'].pdf(x), alpha=0.3, color='blue')
axes[0].fill_between(x, result['posterior_b'].pdf(x), alpha=0.3, color='red')
axes[0].set_xlabel('Conversion Rate')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Posterior Distributions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Difference distribution
diff = result['samples_b'] - result['samples_a']
axes[1].hist(diff * 100, bins=100, density=True, alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='No difference')
axes[1].axvline(x=np.mean(diff) * 100, color='green', linestyle='-', linewidth=2, 
                label=f'Mean diff: {np.mean(diff)*100:.2f}%')
axes[1].set_xlabel('Difference in Conversion Rate (%)')
axes[1].set_ylabel('Probability Density')
axes[1].set_title('Distribution of B - A')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Credible intervals
print(f"\n95% Credible Interval for A: {result['posterior_a'].interval(0.95)}")
print(f"95% Credible Interval for B: {result['posterior_b'].interval(0.95)}")

Bayesian Linear Regression

Unlike regular linear regression (which gives point estimates), Bayesian regression gives distributions over parameters.
# Simple Bayesian linear regression from scratch
np.random.seed(42)

# Generate data
n = 50
x = np.random.uniform(0, 10, n)
y_true = 2 * x + 1
y = y_true + np.random.randn(n) * 2

# Prior beliefs
# We believe slope is around 0 (uncertain), intercept around 0 (uncertain)
prior_mean = np.array([0, 0])  # [intercept, slope]
prior_cov = np.array([[10, 0], [0, 10]])  # Wide priors

# Likelihood
# y = X @ w + noise, noise ~ N(0, sigma^2)
sigma = 2  # Known noise level

X = np.column_stack([np.ones(n), x])

# Bayesian update (closed form for Gaussian)
# Posterior = N(posterior_mean, posterior_cov)
posterior_cov_inv = np.linalg.inv(prior_cov) + (X.T @ X) / sigma**2
posterior_cov = np.linalg.inv(posterior_cov_inv)
posterior_mean = posterior_cov @ (np.linalg.inv(prior_cov) @ prior_mean + X.T @ y / sigma**2)

print("Bayesian Linear Regression Results")
print("=" * 50)
print(f"True parameters: intercept = 1.0, slope = 2.0")
print(f"Posterior mean:  intercept = {posterior_mean[0]:.3f}, slope = {posterior_mean[1]:.3f}")
print(f"Posterior std:   intercept = {np.sqrt(posterior_cov[0,0]):.3f}, slope = {np.sqrt(posterior_cov[1,1]):.3f}")

# Visualize with uncertainty
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot data and predictions with uncertainty
x_plot = np.linspace(0, 10, 100)
X_plot = np.column_stack([np.ones(100), x_plot])

# Sample from posterior to get prediction uncertainty
n_samples = 100
samples = np.random.multivariate_normal(posterior_mean, posterior_cov, n_samples)

for i in range(n_samples):
    y_sample = X_plot @ samples[i]
    axes[0].plot(x_plot, y_sample, 'r-', alpha=0.1)

axes[0].scatter(x, y, alpha=0.6, label='Data')
axes[0].plot(x_plot, X_plot @ posterior_mean, 'b-', linewidth=2, label='Mean prediction')
axes[0].plot(x_plot, 2 * x_plot + 1, 'g--', linewidth=2, label='True line')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Bayesian Regression with Uncertainty')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Posterior distribution over parameters
from matplotlib.patches import Ellipse

axes[1].scatter(samples[:, 1], samples[:, 0], alpha=0.3, s=10, label='Posterior samples')
axes[1].scatter([2], [1], color='green', s=100, marker='*', label='True values')
axes[1].scatter([posterior_mean[1]], [posterior_mean[0]], color='red', s=100, marker='x', label='Posterior mean')
axes[1].set_xlabel('Slope')
axes[1].set_ylabel('Intercept')
axes[1].set_title('Posterior Distribution over Parameters')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

When to Use Bayesian vs Frequentist

SituationRecommended Approach
Large dataset, no prior infoFrequentist (simpler)
Small dataset, strong prior knowledgeBayesian
Need probability statements about parametersBayesian
Need to update beliefs as data arrivesBayesian
Regulatory/scientific publishingOften Frequentist (tradition)
Business decisions with uncertaintyBayesian

Practice Exercises

Problem: You think a coin is fair (prior: Beta(10, 10)). After 30 flips, you see 22 heads. What’s your posterior belief about P(heads)?Calculate the posterior distribution and 95% credible interval.
Problem: You’re testing 3 different ad creatives. After day 1:
  • Ad A: 50 clicks, 5 conversions
  • Ad B: 50 clicks, 8 conversions
  • Ad C: 50 clicks, 3 conversions
Use Bayesian inference to decide which ad to show more tomorrow (Thompson Sampling).
Problem: You have conversion rates from 5 different stores. Build a hierarchical Bayesian model that shares information across stores to get better estimates for stores with little data.

Summary

ConceptFrequentistBayesian
ParametersFixed but unknownRandom variables
ProbabilityLong-run frequencyDegree of belief
Prior knowledgeNot usedExplicitly encoded
ResultPoint estimate + CIFull distribution
Interpretation”True value is in CI with 95% confidence""95% probability parameter is in interval”
Key Takeaway: Bayesian statistics lets you formally combine prior knowledge with data to get calibrated uncertainty estimates. It’s especially valuable when data is limited or when you need to make decisions under uncertainty.