Skip to main content

Naive Bayes

Naive Bayes Probability Flow

The Probability Perspective

Most algorithms we’ve seen ask: “Which side of the boundary is this point on?” Naive Bayes asks: “Given the evidence, what’s the probability of each class?”

The Doctor’s Diagnosis Problem

A patient walks in with symptoms:
  • Fever: Yes
  • Cough: Yes
  • Fatigue: Yes
The doctor thinks: “Based on these symptoms, how likely is it they have the flu vs a cold?” This is Bayesian reasoning - updating beliefs based on evidence.
Spam Detection with Naive Bayes

Bayes’ Theorem

P(DiseaseSymptoms)=P(SymptomsDisease)×P(Disease)P(Symptoms)P(Disease|Symptoms) = \frac{P(Symptoms|Disease) \times P(Disease)}{P(Symptoms)} In English:
  • P(Disease|Symptoms): Probability of disease given symptoms (what we want)
  • P(Symptoms|Disease): How likely these symptoms are if you have the disease
  • P(Disease): How common the disease is (prior probability)
  • P(Symptoms): How common these symptoms are overall
Math Connection: This is Bayes’ Theorem from probability theory. See Probability for the full derivation.

Why “Naive”?

The “naive” assumption: All features are independent given the class. For our flu example:
  • P(Fever AND Cough AND Fatigue | Flu)
  • ≈ P(Fever|Flu) × P(Cough|Flu) × P(Fatigue|Flu)
Is this realistic? No! Symptoms often correlate. Does it work anyway? Surprisingly well, yes!

Building Naive Bayes From Scratch

import numpy as np
from collections import defaultdict

class SimpleNaiveBayes:
    """Naive Bayes classifier from scratch."""
    
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_classes = len(self.classes)
        self.n_features = X.shape[1]
        
        # Calculate prior probabilities P(class)
        self.priors = {}
        for c in self.classes:
            self.priors[c] = np.mean(y == c)
        
        # Calculate likelihoods P(feature|class) for each feature
        # Using Gaussian (normal) distribution
        self.means = {}
        self.stds = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.means[c] = X_c.mean(axis=0)
            self.stds[c] = X_c.std(axis=0) + 1e-10  # Add small value to avoid division by zero
    
    def _gaussian_probability(self, x, mean, std):
        """Calculate probability using Gaussian distribution."""
        exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2))
        return (1 / (np.sqrt(2 * np.pi) * std)) * exponent
    
    def predict_proba(self, X):
        """Predict probability for each class."""
        probabilities = []
        
        for x in X:
            class_probs = {}
            
            for c in self.classes:
                # Start with prior
                prob = np.log(self.priors[c])
                
                # Multiply by likelihood of each feature
                for i in range(self.n_features):
                    likelihood = self._gaussian_probability(x[i], self.means[c][i], self.stds[c][i])
                    prob += np.log(likelihood + 1e-10)  # Use log to avoid underflow
                
                class_probs[c] = prob
            
            probabilities.append(class_probs)
        
        return probabilities
    
    def predict(self, X):
        """Predict class with highest probability."""
        probabilities = self.predict_proba(X)
        predictions = []
        
        for prob in probabilities:
            predicted_class = max(prob, key=prob.get)
            predictions.append(predicted_class)
        
        return np.array(predictions)

# Test on iris data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

nb = SimpleNaiveBayes()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)

accuracy = np.mean(predictions == y_test)
print(f"Our Naive Bayes accuracy: {accuracy:.2%}")

Types of Naive Bayes

1. Gaussian Naive Bayes

For continuous features (assumes normal distribution):
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print(f"Gaussian NB Accuracy: {gnb.score(X_test, y_test):.2%}")
print(f"Class priors: {gnb.class_prior_}")

2. Multinomial Naive Bayes

For count data (word frequencies, document classification):
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample text classification
documents = [
    "I love this movie, it's amazing!",
    "Great film, highly recommend",
    "Terrible movie, waste of time",
    "Awful, boring, don't watch",
    "Best movie I've ever seen",
    "Horrible acting, bad plot"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

# Convert text to word counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Train Multinomial NB
mnb = MultinomialNB()
mnb.fit(X, labels)

# Predict on new text
new_reviews = ["This movie is great!", "I hated this film"]
X_new = vectorizer.transform(new_reviews)
predictions = mnb.predict(X_new)
probabilities = mnb.predict_proba(X_new)

for review, pred, prob in zip(new_reviews, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{review}' -> {sentiment} (confidence: {max(prob):.2%})")

3. Bernoulli Naive Bayes

For binary features (word presence/absence):
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# Use binary features (word present or not)
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(documents)

bnb = BernoulliNB()
bnb.fit(X, labels)

print(f"Bernoulli NB Accuracy: {bnb.score(X, labels):.2%}")

Real Example: Spam Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Simulated email dataset
emails = [
    "Free money, click here now!",
    "Meeting tomorrow at 3pm",
    "Congratulations! You won a prize",
    "Can you review this document?",
    "Limited offer, buy now",
    "Project deadline is Friday",
    "Earn $1000 per day from home",
    "Lunch meeting at noon",
    "Act now, special discount",
    "Team standup in 10 minutes",
    "You're a winner! Claim prize",
    "Budget report attached",
    "Free gift card, click link",
    "Quarterly review scheduled",
    "Urgent: Verify your account",
    "Happy birthday from the team"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=ham

# Create TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

# Train
spam_classifier = MultinomialNB()
spam_classifier.fit(X_train, y_train)

# Evaluate
y_pred = spam_classifier.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Show most indicative words
feature_names = vectorizer.get_feature_names_out()
spam_log_prob = spam_classifier.feature_log_prob_[1]
ham_log_prob = spam_classifier.feature_log_prob_[0]

# Words most indicative of spam
spam_indicators = spam_log_prob - ham_log_prob
top_spam_words = [feature_names[i] for i in np.argsort(spam_indicators)[-5:]]
print(f"\nTop spam words: {top_spam_words}")

When Naive Bayes Shines

1. Text Classification

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Load 20 newsgroups dataset (subset)
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.misc']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# Create pipeline
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_features=5000),
    MultinomialNB()
)

# Train and evaluate
model.fit(newsgroups_train.data, newsgroups_train.target)
accuracy = model.score(newsgroups_test.data, newsgroups_test.target)
print(f"20 Newsgroups Accuracy: {accuracy:.2%}")

2. Fast Baseline Model

import time
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Large dataset
X, y = make_classification(n_samples=100000, n_features=50, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

models = {
    'Naive Bayes': GaussianNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    accuracy = model.score(X_test, y_test)
    print(f"{name:22s}: Accuracy={accuracy:.2%}, Time={train_time:.2f}s")

Laplace Smoothing

What if a word never appeared in training for a class?
P("cryptocurrency" | spam) = 0 / 100 = 0
Then the entire product becomes 0, regardless of other evidence! Solution: Add a small count to everything (Laplace/additive smoothing):
# alpha controls smoothing
# alpha=1 is Laplace smoothing
# alpha<1 is Lidstone smoothing
mnb = MultinomialNB(alpha=1.0)  # Default

# Try different smoothing values
for alpha in [0.01, 0.1, 1.0, 10.0]:
    mnb = MultinomialNB(alpha=alpha)
    mnb.fit(X_train, y_train)
    print(f"alpha={alpha}: Accuracy={mnb.score(X_test, y_test):.2%}")

Naive Bayes vs Other Algorithms

AspectNaive BayesLogistic RegressionRandom Forest
SpeedVery FastFastSlow
Training data neededLittleModerateLots
Handles textExcellentGoodPoor
Feature independenceRequiredNot requiredNot required
InterpretabilityGoodGoodPoor
Probability calibrationOften poorGoodModerate

Probability Calibration

Naive Bayes probabilities are often overconfident:
from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt

# Uncalibrated
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Calibrated
gnb_calibrated = CalibratedClassifierCV(GaussianNB(), cv=5)
gnb_calibrated.fit(X_train, y_train)

# Compare probabilities
probs_uncal = gnb.predict_proba(X_test)[:, 1]
probs_cal = gnb_calibrated.predict_proba(X_test)[:, 1]

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(probs_uncal, bins=20, edgecolor='black')
plt.title('Uncalibrated Probabilities')
plt.xlabel('Probability')

plt.subplot(1, 2, 2)
plt.hist(probs_cal, bins=20, edgecolor='black')
plt.title('Calibrated Probabilities')
plt.xlabel('Probability')

plt.tight_layout()
plt.show()

Key Takeaways

Probability-Based

Predicts class probabilities using Bayes’ theorem

Independence Assumption

Assumes features are independent (often wrong, still works!)

Fast & Simple

Trains instantly, great for baselines

Text Champion

Excels at document classification and spam filtering

What’s Next?

Now let’s learn about ensemble methods - combining multiple models for better predictions!

Continue to Module 6: Ensemble Methods

The wisdom of crowds - Random Forests and Gradient Boosting