Dr. Sarah runs a routine test on a patient. The test comes back positive for a rare disease that affects 1 in 1000 people.The test is 99% accurate:
If you HAVE the disease, it correctly says “positive” 99% of the time
If you DON’T have it, it correctly says “negative” 99% of the time
Question: What’s the probability this patient actually has the disease?Most people (and many doctors!) say “99%”.The real answer? About 9%.Surprised? This is why understanding probability properly can literally save lives - and it’s the foundation of all machine learning.
Real Talk: Probability is one of the most misunderstood topics. Our intuition is terrible at it. By the end of this module, you’ll understand why our intuition fails and how to calculate correctly.
Estimated Time: 3-4 hours Difficulty: Beginner Prerequisites: Basic Python, Module 1 (Describing Data) What You’ll Build: A spam email detector using Bayes’ theorem
Here’s where it gets interesting. Conditional probability answers:“What’s the probability of A, GIVEN that B already happened?”Notation: P(A∣B) reads as “probability of A given B”
Now we can solve that medical test problem from the beginning.Bayes’ Theorem lets you flip conditional probabilities:P(A∣B)=P(B)P(B∣A)⋅P(A)In words: The probability of A given B equals the probability of B given A, times the probability of A, divided by the probability of B.
population = 10000# How many actually have disease?actually_sick = int(population * 0.001) # 10 peopleactually_healthy = population - actually_sick # 9990 people# Of the sick, how many test positive?sick_positive = int(actually_sick * 0.99) # ~10 true positives# Of the healthy, how many test positive (false positives)?healthy_positive = int(actually_healthy * 0.01) # ~100 false positives!# Total positivestotal_positive = sick_positive + healthy_positiveprint(f"Actually sick: {actually_sick}")print(f"True positives: {sick_positive}")print(f"False positives: {healthy_positive}")print(f"Total positives: {total_positive}")print(f"P(sick | positive): {sick_positive/total_positive:.1%}")
# Standard 52-card deck# Calculate:# 1. P(drawing a heart)# 2. P(drawing a face card) - J, Q, K# 3. P(drawing a red face card)# 4. P(drawing a heart OR a face card)
Solution
Copy
# 1. P(heart)p_heart = 13/52 # 25%# 2. P(face card) - 12 face cards (3 per suit × 4 suits)p_face = 12/52 # 23.08%# 3. P(red face card) - 6 cards (3 per red suit × 2 red suits)p_red_face = 6/52 # 11.54%# 4. P(heart OR face card)# Hearts = 13, Face cards = 12, but 3 are both (J, Q, K of hearts)p_heart_or_face = (13 + 12 - 3) / 52 # 22/52 = 42.31%print(f"P(heart): {p_heart:.2%}")print(f"P(face): {p_face:.2%}")print(f"P(red face): {p_red_face:.2%}")print(f"P(heart OR face): {p_heart_or_face:.2%}")
# Historical data:# - 30% of days are rainy# - On rainy days, 80% are cloudy in the morning# - On non-rainy days, 40% are cloudy in the morning# This morning is cloudy. What's the probability of rain?
Solution
Copy
# Use Bayes' theoremp_rain = 0.30p_no_rain = 0.70p_cloudy_given_rain = 0.80p_cloudy_given_no_rain = 0.40# P(Cloudy)p_cloudy = p_cloudy_given_rain * p_rain + p_cloudy_given_no_rain * p_no_rainprint(f"P(Cloudy): {p_cloudy:.0%}") # 52%# P(Rain | Cloudy)p_rain_given_cloudy = (p_cloudy_given_rain * p_rain) / p_cloudyprint(f"P(Rain | Cloudy): {p_rain_given_cloudy:.1%}") # 46.2%# Even though it's cloudy, there's still less than 50% chance of rain# because the base rate of rain (30%) is low
# A disease affects 2% of the population# Test A: 95% sensitivity, 90% specificity# Test B: 90% sensitivity, 95% specificity# A patient tests positive on BOTH tests.# What's the probability they have the disease?# Hint: Apply Bayes' theorem twice (sequentially)
Solution
Copy
# Initial statep_disease = 0.02# Test Asensitivity_a = 0.95 # P(+|disease)specificity_a = 0.90 # P(-|no disease)false_pos_a = 1 - specificity_a # 0.10# After Test A positivep_pos_a = sensitivity_a * p_disease + false_pos_a * (1 - p_disease)p_disease_after_a = (sensitivity_a * p_disease) / p_pos_aprint(f"After Test A positive: P(disease) = {p_disease_after_a:.1%}") # 16.2%# Test B (now using updated probability as prior)sensitivity_b = 0.90specificity_b = 0.95false_pos_b = 0.05# After Test B positive (using p_disease_after_a as new prior)p_pos_b_given_disease = sensitivity_bp_pos_b_given_no_disease = false_pos_bp_pos_b = (p_pos_b_given_disease * p_disease_after_a + p_pos_b_given_no_disease * (1 - p_disease_after_a))p_disease_after_both = (p_pos_b_given_disease * p_disease_after_a) / p_pos_bprint(f"After both tests positive: P(disease) = {p_disease_after_both:.1%}") # 77.6%# Two positive tests dramatically increases confidence!
Mistake 1: Ignoring Base Rates (Base Rate Fallacy)A 99% accurate test for a rare disease (1 in 10,000) will mostly produce false positives. Most people who test positive won’t have the disease. Always consider how common the thing you’re testing for actually is.
Mistake 2: Confusing P(A|B) with P(B|A)P(sick|positive test) is NOT the same as P(positive test|sick). This confusion has led to wrongful convictions and medical misdiagnoses. Bayes’ theorem is the bridge between them.
Mistake 3: Treating Dependent Events as IndependentDrawing cards without replacement means probabilities change. P(2nd card is ace | 1st was ace) = 3/51, not 4/52.
Question: A family has two children. Given that at least one is a boy, what’s the probability that both are boys?
Answer: 1/3, not 1/2!Possible outcomes for two children: BB, BG, GB, GG
Given “at least one boy”: BB, BG, GB (3 options)
Both boys: BB (1 option)P(both boys | at least one boy) = 1/3This is counterintuitive because we’re not told which child is the boy, so both orders (BG, GB) are possible.
Question 2: Birthday Problem (Amazon)
Question: In a room of 23 people, what’s the probability that at least two share a birthday?
Answer: About 50%!It’s easier to calculate the complement:
P(no shared birthdays) = (365/365) × (364/365) × (363/365) × … × (343/365)
Copy
import numpy as npp_no_match = np.prod([(365-i)/365 for i in range(23)])p_at_least_one_match = 1 - p_no_matchprint(f"P(shared birthday): {p_at_least_one_match:.1%}") # 50.7%
This is famously counterintuitive because we’re comparing ALL pairs, not just pairs involving you.
Question 3: Monty Hall Problem (Classic)
Question: You’re on a game show with 3 doors. Behind one is a car, behind the others are goats. You pick door 1. The host (who knows what’s behind each door) opens door 3, revealing a goat. Should you switch to door 2?
Answer: Yes! Switching gives you 2/3 chance of winning.Initial pick: 1/3 chance of being right
Switching: 2/3 chance of winningThe key insight: The host’s action gives you information. He always reveals a goat, so when you switch, you’re essentially betting that your initial choice was wrong (which it probably was, 2/3 of the time).
Question 4: Spam Classifier (Tech Companies)
Question: Your spam filter has 98% sensitivity and 95% specificity. If 5% of emails are spam, what fraction of emails flagged as spam are actually spam?
Build a simple sentiment classifier using Bayes’ theorem:
Copy
import numpy as npfrom collections import defaultdict# Training datareviews = [ ("great product love it", "positive"), ("terrible waste of money", "negative"), ("amazing quality highly recommend", "positive"), ("awful experience never again", "negative"), ("fantastic works perfectly", "positive"), ("horrible customer service", "negative"), ("excellent value great buy", "positive"), ("disappointing poor quality", "negative"),]# Your task: Implement a Naive Bayes classifier# 1. Count word frequencies in positive vs negative reviews# 2. Calculate P(word|positive) and P(word|negative) for each word# 3. Use Bayes to classify new review: "great quality but poor service"class NaiveBayesClassifier: def __init__(self): self.word_counts = defaultdict(lambda: defaultdict(int)) self.class_counts = defaultdict(int) self.vocab = set() def train(self, reviews): # Your implementation here pass def predict(self, text): # Your implementation here pass# Test your classifierclassifier = NaiveBayesClassifier()classifier.train(reviews)print(classifier.predict("great quality but poor service"))
Solution:
Copy
class NaiveBayesClassifier: def __init__(self): self.word_counts = defaultdict(lambda: defaultdict(int)) self.class_counts = defaultdict(int) self.vocab = set() def train(self, reviews): for text, label in reviews: self.class_counts[label] += 1 for word in text.lower().split(): self.word_counts[label][word] += 1 self.vocab.add(word) def predict(self, text): words = text.lower().split() total_reviews = sum(self.class_counts.values()) scores = {} for label in self.class_counts: # Start with prior P(class) log_prob = np.log(self.class_counts[label] / total_reviews) # Add log P(word|class) for each word total_words = sum(self.word_counts[label].values()) for word in words: # Laplace smoothing count = self.word_counts[label].get(word, 0) + 1 prob = count / (total_words + len(self.vocab)) log_prob += np.log(prob) scores[label] = log_prob return max(scores, key=scores.get)# Testclassifier = NaiveBayesClassifier()classifier.train(reviews)result = classifier.predict("great quality but poor service")print(f"Prediction: {result}") # Could go either way!
Q: What’s the difference between independent and mutually exclusive events?
Independent: Occurrence of one doesn’t affect the other (e.g., two coin flips). Mutually exclusive: Both cannot occur simultaneously (e.g., getting heads AND tails on one flip). Note: Mutually exclusive events are NOT independent!
Q: You flip a fair coin 10 times and get 10 heads. What’s P(heads on flip 11)?
Still 50%! This is the gambler’s fallacy. Each flip is independent; past results don’t affect future outcomes.
✅ Basic Probability - P(A) = favorable outcomes / total outcomes
✅ Addition Rule - P(A or B) = P(A) + P(B) - P(A and B)
✅ Multiplication Rule - P(A and B) = P(A) × P(B|A)
✅ Conditional Probability - P(A|B) = P(A and B) / P(B)
✅ Bayes’ Theorem - Update beliefs with new evidence; essential for ML
✅ Independence - Events where one doesn’t affect the other
Bayes’ Theorem Intuition: Think of it as “updating your beliefs.” You start with a prior (what you knew before), see evidence, and get a posterior (what you know after). This is exactly how many ML algorithms learn!
Coming up next: We’ll learn about probability distributions - the patterns that randomness follows. This is where we discover the famous “bell curve” and understand why it shows up everywhere!