Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Naive Bayes

Naive Bayes Probability Flow

The Probability Perspective

Most algorithms we’ve seen ask: “Which side of the boundary is this point on?” Naive Bayes asks: “Given the evidence, what’s the probability of each class?”

The Doctor’s Diagnosis Problem

A patient walks in with symptoms:
  • Fever: Yes
  • Cough: Yes
  • Fatigue: Yes
The doctor thinks: “Based on these symptoms, how likely is it they have the flu vs a cold?” This is Bayesian reasoning - updating beliefs based on evidence.
Spam Detection with Naive Bayes

Bayes’ Theorem

P(DiseaseSymptoms)=P(SymptomsDisease)×P(Disease)P(Symptoms)P(Disease|Symptoms) = \frac{P(Symptoms|Disease) \times P(Disease)}{P(Symptoms)} In English:
  • P(Disease|Symptoms): Probability of disease given symptoms (what we want)
  • P(Symptoms|Disease): How likely these symptoms are if you have the disease
  • P(Disease): How common the disease is (prior probability)
  • P(Symptoms): How common these symptoms are overall
Math Connection: This is Bayes’ Theorem from probability theory. See Probability for the full derivation.

Why “Naive”?

The “naive” assumption: All features are independent given the class. For our flu example:
  • P(Fever AND Cough AND Fatigue | Flu)
  • Approximately equals P(Fever|Flu) x P(Cough|Flu) x P(Fatigue|Flu)
Is this realistic? No! Symptoms often correlate — fever and fatigue almost always appear together. If we were computing the true joint probability, we’d need to account for all these correlations. Does it work anyway? Surprisingly well, yes! Here’s why: Naive Bayes only needs to get the ranking of class probabilities right, not the exact values. Even if the absolute probabilities are wildly off (e.g., 99.9% instead of 70%), as long as the winning class is correct, the classification is correct. The independence assumption distorts magnitudes but often preserves the ordering. It’s like a biased thermometer that always reads 10 degrees too high — useless for absolute temperature, but perfectly fine for telling you which room is hottest.

Building Naive Bayes From Scratch

import numpy as np
from collections import defaultdict

class SimpleNaiveBayes:
    """Naive Bayes classifier from scratch."""
    
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_classes = len(self.classes)
        self.n_features = X.shape[1]
        
        # Calculate prior probabilities P(class).
        # If 60% of emails are spam, P(spam) = 0.6.
        # This is our "starting belief" before looking at any features.
        self.priors = {}
        for c in self.classes:
            self.priors[c] = np.mean(y == c)
        
        # Calculate likelihoods P(feature|class) for each feature.
        # We assume each feature follows a Gaussian (bell curve) distribution
        # within each class. So we just need the mean and std per class.
        # This is why it's called "Gaussian" Naive Bayes.
        self.means = {}
        self.stds = {}
        
        for c in self.classes:
            X_c = X[y == c]  # All samples belonging to class c
            self.means[c] = X_c.mean(axis=0)
            self.stds[c] = X_c.std(axis=0) + 1e-10  # Small epsilon prevents division by zero
    
    def _gaussian_probability(self, x, mean, std):
        """Calculate probability using Gaussian distribution."""
        exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2))
        return (1 / (np.sqrt(2 * np.pi) * std)) * exponent
    
    def predict_proba(self, X):
        """Predict probability for each class."""
        probabilities = []
        
        for x in X:
            class_probs = {}
            
            for c in self.classes:
                # Start with prior
                prob = np.log(self.priors[c])
                
                # Multiply by likelihood of each feature
                for i in range(self.n_features):
                    likelihood = self._gaussian_probability(x[i], self.means[c][i], self.stds[c][i])
                    prob += np.log(likelihood + 1e-10)  # Use log to avoid underflow
                
                class_probs[c] = prob
            
            probabilities.append(class_probs)
        
        return probabilities
    
    def predict(self, X):
        """Predict class with highest probability."""
        probabilities = self.predict_proba(X)
        predictions = []
        
        for prob in probabilities:
            predicted_class = max(prob, key=prob.get)
            predictions.append(predicted_class)
        
        return np.array(predictions)

# Test on iris data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

nb = SimpleNaiveBayes()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)

accuracy = np.mean(predictions == y_test)
print(f"Our Naive Bayes accuracy: {accuracy:.2%}")

Types of Naive Bayes

1. Gaussian Naive Bayes

For continuous features (assumes normal distribution):
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print(f"Gaussian NB Accuracy: {gnb.score(X_test, y_test):.2%}")
print(f"Class priors: {gnb.class_prior_}")

2. Multinomial Naive Bayes

For count data (word frequencies, document classification):
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample text classification
documents = [
    "I love this movie, it's amazing!",
    "Great film, highly recommend",
    "Terrible movie, waste of time",
    "Awful, boring, don't watch",
    "Best movie I've ever seen",
    "Horrible acting, bad plot"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

# Convert text to word counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Train Multinomial NB
mnb = MultinomialNB()
mnb.fit(X, labels)

# Predict on new text
new_reviews = ["This movie is great!", "I hated this film"]
X_new = vectorizer.transform(new_reviews)
predictions = mnb.predict(X_new)
probabilities = mnb.predict_proba(X_new)

for review, pred, prob in zip(new_reviews, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{review}' -> {sentiment} (confidence: {max(prob):.2%})")

3. Bernoulli Naive Bayes

For binary features (word presence/absence):
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# Use binary features (word present or not)
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(documents)

bnb = BernoulliNB()
bnb.fit(X, labels)

print(f"Bernoulli NB Accuracy: {bnb.score(X, labels):.2%}")

Real Example: Spam Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Simulated email dataset
emails = [
    "Free money, click here now!",
    "Meeting tomorrow at 3pm",
    "Congratulations! You won a prize",
    "Can you review this document?",
    "Limited offer, buy now",
    "Project deadline is Friday",
    "Earn $1000 per day from home",
    "Lunch meeting at noon",
    "Act now, special discount",
    "Team standup in 10 minutes",
    "You're a winner! Claim prize",
    "Budget report attached",
    "Free gift card, click link",
    "Quarterly review scheduled",
    "Urgent: Verify your account",
    "Happy birthday from the team"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=ham

# Create TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

# Train
spam_classifier = MultinomialNB()
spam_classifier.fit(X_train, y_train)

# Evaluate
y_pred = spam_classifier.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Show most indicative words
feature_names = vectorizer.get_feature_names_out()
spam_log_prob = spam_classifier.feature_log_prob_[1]
ham_log_prob = spam_classifier.feature_log_prob_[0]

# Words most indicative of spam
spam_indicators = spam_log_prob - ham_log_prob
top_spam_words = [feature_names[i] for i in np.argsort(spam_indicators)[-5:]]
print(f"\nTop spam words: {top_spam_words}")

When Naive Bayes Shines

1. Text Classification

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Load 20 newsgroups dataset (subset)
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.misc']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# Create pipeline
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_features=5000),
    MultinomialNB()
)

# Train and evaluate
model.fit(newsgroups_train.data, newsgroups_train.target)
accuracy = model.score(newsgroups_test.data, newsgroups_test.target)
print(f"20 Newsgroups Accuracy: {accuracy:.2%}")

2. Fast Baseline Model

import time
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Large dataset
X, y = make_classification(n_samples=100000, n_features=50, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

models = {
    'Naive Bayes': GaussianNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    accuracy = model.score(X_test, y_test)
    print(f"{name:22s}: Accuracy={accuracy:.2%}, Time={train_time:.2f}s")

Laplace Smoothing

What if a word never appeared in training for a class?
P("cryptocurrency" | spam) = 0 / 100 = 0
Then the entire product becomes 0, regardless of other evidence! Solution: Add a small count to everything (Laplace/additive smoothing). Think of it as giving every word a “benefit of the doubt” — we pretend we’ve seen each word at least once in each class, even if we haven’t. This prevents any single unseen word from vetoing the entire classification.
# alpha controls smoothing strength
# alpha=1 is Laplace smoothing (pretend each word appeared once extra)
# alpha<1 is Lidstone smoothing (less aggressive)
# alpha=0 means no smoothing (dangerous -- one unseen word kills everything)
mnb = MultinomialNB(alpha=1.0)  # Default -- a safe starting point

# Try different smoothing values
for alpha in [0.01, 0.1, 1.0, 10.0]:
    mnb = MultinomialNB(alpha=alpha)
    mnb.fit(X_train, y_train)
    print(f"alpha={alpha}: Accuracy={mnb.score(X_test, y_test):.2%}")

Naive Bayes vs Other Algorithms

AspectNaive BayesLogistic RegressionRandom Forest
SpeedVery FastFastSlow
Training data neededLittleModerateLots
Handles textExcellentGoodPoor
Feature independenceRequiredNot requiredNot required
InterpretabilityGoodGoodPoor
Probability calibrationOften poorGoodModerate

Probability Calibration

Naive Bayes probabilities are often overconfident. Because the independence assumption is wrong, the model tends to push probabilities toward 0 and 1. It might say “99.8% spam” when the true probability is 75%. This doesn’t hurt classification accuracy (it still picks the right class), but it’s a problem if you need reliable probability estimates — for example, when ranking items by risk or making decisions with different cost thresholds.
from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt

# Uncalibrated
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Calibrated
gnb_calibrated = CalibratedClassifierCV(GaussianNB(), cv=5)
gnb_calibrated.fit(X_train, y_train)

# Compare probabilities
probs_uncal = gnb.predict_proba(X_test)[:, 1]
probs_cal = gnb_calibrated.predict_proba(X_test)[:, 1]

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(probs_uncal, bins=20, edgecolor='black')
plt.title('Uncalibrated Probabilities')
plt.xlabel('Probability')

plt.subplot(1, 2, 2)
plt.hist(probs_cal, bins=20, edgecolor='black')
plt.title('Calibrated Probabilities')
plt.xlabel('Probability')

plt.tight_layout()
plt.show()

Key Takeaways

Probability-Based

Predicts class probabilities using Bayes’ theorem

Independence Assumption

Assumes features are independent (often wrong, still works!)

Fast & Simple

Trains instantly, great for baselines

Text Champion

Excels at document classification and spam filtering

What’s Next?

Now let’s learn about ensemble methods - combining multiple models for better predictions!

Continue to Module 6: Ensemble Methods

The wisdom of crowds - Random Forests and Gradient Boosting