Naive Bayes
The Probability Perspective
The Doctor’s Diagnosis Problem
Bayes’ Theorem
Why “Naive”?
Building Naive Bayes From Scratch
Types of Naive Bayes
1. Gaussian Naive Bayes
2. Multinomial Naive Bayes
3. Bernoulli Naive Bayes
Real Example: Spam Classification
When Naive Bayes Shines
1. Text Classification
2. Fast Baseline Model
Laplace Smoothing
Naive Bayes vs Other Algorithms
Probability Calibration
Key Takeaways
What’s Next?

Naive Bayes

The Probability Perspective

Most algorithms we’ve seen ask: “Which side of the boundary is this point on?” Naive Bayes asks: “Given the evidence, what’s the probability of each class?”

The Doctor’s Diagnosis Problem

A patient walks in with symptoms:

Fever: Yes
Cough: Yes
Fatigue: Yes

The doctor thinks: “Based on these symptoms, how likely is it they have the flu vs a cold?” This is Bayesian reasoning - updating beliefs based on evidence.

Bayes’ Theorem

P(Disease|Symptoms) = \frac{P(Symptoms|Disease) \times P(Disease)}{P(Symptoms)}

In English:

P(Disease|Symptoms): Probability of disease given symptoms (what we want)
P(Symptoms|Disease): How likely these symptoms are if you have the disease
P(Disease): How common the disease is (prior probability)
P(Symptoms): How common these symptoms are overall

Math Connection: This is Bayes’ Theorem from probability theory. See Probability for the full derivation.

Why “Naive”?

The “naive” assumption: All features are independent given the class. For our flu example:

P(Fever AND Cough AND Fatigue | Flu)
≈ P(Fever|Flu) × P(Cough|Flu) × P(Fatigue|Flu)

Is this realistic? No! Symptoms often correlate. Does it work anyway? Surprisingly well, yes!

Building Naive Bayes From Scratch

import numpy as np
from collections import defaultdict

class SimpleNaiveBayes:
    """Naive Bayes classifier from scratch."""
    
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_classes = len(self.classes)
        self.n_features = X.shape[1]
        
        # Calculate prior probabilities P(class)
        self.priors = {}
        for c in self.classes:
            self.priors[c] = np.mean(y == c)
        
        # Calculate likelihoods P(feature|class) for each feature
        # Using Gaussian (normal) distribution
        self.means = {}
        self.stds = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.means[c] = X_c.mean(axis=0)
            self.stds[c] = X_c.std(axis=0) + 1e-10  # Add small value to avoid division by zero
    
    def _gaussian_probability(self, x, mean, std):
        """Calculate probability using Gaussian distribution."""
        exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2))
        return (1 / (np.sqrt(2 * np.pi) * std)) * exponent
    
    def predict_proba(self, X):
        """Predict probability for each class."""
        probabilities = []
        
        for x in X:
            class_probs = {}
            
            for c in self.classes:
                # Start with prior
                prob = np.log(self.priors[c])
                
                # Multiply by likelihood of each feature
                for i in range(self.n_features):
                    likelihood = self._gaussian_probability(x[i], self.means[c][i], self.stds[c][i])
                    prob += np.log(likelihood + 1e-10)  # Use log to avoid underflow
                
                class_probs[c] = prob
            
            probabilities.append(class_probs)
        
        return probabilities
    
    def predict(self, X):
        """Predict class with highest probability."""
        probabilities = self.predict_proba(X)
        predictions = []
        
        for prob in probabilities:
            predicted_class = max(prob, key=prob.get)
            predictions.append(predicted_class)
        
        return np.array(predictions)

# Test on iris data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

nb = SimpleNaiveBayes()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)

accuracy = np.mean(predictions == y_test)
print(f"Our Naive Bayes accuracy: {accuracy:.2%}")

Types of Naive Bayes

1. Gaussian Naive Bayes

For continuous features (assumes normal distribution):

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print(f"Gaussian NB Accuracy: {gnb.score(X_test, y_test):.2%}")
print(f"Class priors: {gnb.class_prior_}")

2. Multinomial Naive Bayes

For count data (word frequencies, document classification):

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample text classification
documents = [
    "I love this movie, it's amazing!",
    "Great film, highly recommend",
    "Terrible movie, waste of time",
    "Awful, boring, don't watch",
    "Best movie I've ever seen",
    "Horrible acting, bad plot"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

# Convert text to word counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Train Multinomial NB
mnb = MultinomialNB()
mnb.fit(X, labels)

# Predict on new text
new_reviews = ["This movie is great!", "I hated this film"]
X_new = vectorizer.transform(new_reviews)
predictions = mnb.predict(X_new)
probabilities = mnb.predict_proba(X_new)

for review, pred, prob in zip(new_reviews, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{review}' -> {sentiment} (confidence: {max(prob):.2%})")

3. Bernoulli Naive Bayes

For binary features (word presence/absence):

from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# Use binary features (word present or not)
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(documents)

bnb = BernoulliNB()
bnb.fit(X, labels)

print(f"Bernoulli NB Accuracy: {bnb.score(X, labels):.2%}")

Real Example: Spam Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Simulated email dataset
emails = [
    "Free money, click here now!",
    "Meeting tomorrow at 3pm",
    "Congratulations! You won a prize",
    "Can you review this document?",
    "Limited offer, buy now",
    "Project deadline is Friday",
    "Earn $1000 per day from home",
    "Lunch meeting at noon",
    "Act now, special discount",
    "Team standup in 10 minutes",
    "You're a winner! Claim prize",
    "Budget report attached",
    "Free gift card, click link",
    "Quarterly review scheduled",
    "Urgent: Verify your account",
    "Happy birthday from the team"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=ham

# Create TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

# Train
spam_classifier = MultinomialNB()
spam_classifier.fit(X_train, y_train)

# Evaluate
y_pred = spam_classifier.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Show most indicative words
feature_names = vectorizer.get_feature_names_out()
spam_log_prob = spam_classifier.feature_log_prob_[1]
ham_log_prob = spam_classifier.feature_log_prob_[0]

# Words most indicative of spam
spam_indicators = spam_log_prob - ham_log_prob
top_spam_words = [feature_names[i] for i in np.argsort(spam_indicators)[-5:]]
print(f"\nTop spam words: {top_spam_words}")

When Naive Bayes Shines

1. Text Classification

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Load 20 newsgroups dataset (subset)
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.misc']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# Create pipeline
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_features=5000),
    MultinomialNB()
)

# Train and evaluate
model.fit(newsgroups_train.data, newsgroups_train.target)
accuracy = model.score(newsgroups_test.data, newsgroups_test.target)
print(f"20 Newsgroups Accuracy: {accuracy:.2%}")

2. Fast Baseline Model

import time
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Large dataset
X, y = make_classification(n_samples=100000, n_features=50, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

models = {
    'Naive Bayes': GaussianNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    accuracy = model.score(X_test, y_test)
    print(f"{name:22s}: Accuracy={accuracy:.2%}, Time={train_time:.2f}s")

Laplace Smoothing

What if a word never appeared in training for a class?

P("cryptocurrency" | spam) = 0 / 100 = 0

Then the entire product becomes 0, regardless of other evidence! Solution: Add a small count to everything (Laplace/additive smoothing):

# alpha controls smoothing
# alpha=1 is Laplace smoothing
# alpha<1 is Lidstone smoothing
mnb = MultinomialNB(alpha=1.0)  # Default

# Try different smoothing values
for alpha in [0.01, 0.1, 1.0, 10.0]:
    mnb = MultinomialNB(alpha=alpha)
    mnb.fit(X_train, y_train)
    print(f"alpha={alpha}: Accuracy={mnb.score(X_test, y_test):.2%}")

Naive Bayes vs Other Algorithms

Aspect	Naive Bayes	Logistic Regression	Random Forest
Speed	Very Fast	Fast	Slow
Training data needed	Little	Moderate	Lots
Handles text	Excellent	Good	Poor
Feature independence	Required	Not required	Not required
Interpretability	Good	Good	Poor
Probability calibration	Often poor	Good	Moderate

Probability Calibration

Naive Bayes probabilities are often overconfident:

from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt

# Uncalibrated
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Calibrated
gnb_calibrated = CalibratedClassifierCV(GaussianNB(), cv=5)
gnb_calibrated.fit(X_train, y_train)

# Compare probabilities
probs_uncal = gnb.predict_proba(X_test)[:, 1]
probs_cal = gnb_calibrated.predict_proba(X_test)[:, 1]

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(probs_uncal, bins=20, edgecolor='black')
plt.title('Uncalibrated Probabilities')
plt.xlabel('Probability')

plt.subplot(1, 2, 2)
plt.hist(probs_cal, bins=20, edgecolor='black')
plt.title('Calibrated Probabilities')
plt.xlabel('Probability')

plt.tight_layout()
plt.show()

Key Takeaways

Probability-Based

Predicts class probabilities using Bayes’ theorem

Independence Assumption

Assumes features are independent (often wrong, still works!)

Fast & Simple

Trains instantly, great for baselines

Text Champion

Excels at document classification and spam filtering

What’s Next?

Now let’s learn about ensemble methods - combining multiple models for better predictions!

Continue to Module 6: Ensemble Methods

The wisdom of crowds - Random Forests and Gradient Boosting

SVM Ensemble Methods

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Naive Bayes

​The Probability Perspective

​The Doctor’s Diagnosis Problem

​Bayes’ Theorem

​Why “Naive”?

​Building Naive Bayes From Scratch

​Types of Naive Bayes

​1. Gaussian Naive Bayes

​2. Multinomial Naive Bayes

​3. Bernoulli Naive Bayes

​Real Example: Spam Classification

​When Naive Bayes Shines

​1. Text Classification

​2. Fast Baseline Model

​Laplace Smoothing

​Naive Bayes vs Other Algorithms

​Probability Calibration

​Key Takeaways

Probability-Based

Independence Assumption

Fast & Simple

Text Champion

​What’s Next?

Continue to Module 6: Ensemble Methods

Naive Bayes

The Probability Perspective

The Doctor’s Diagnosis Problem

Bayes’ Theorem

Why “Naive”?

Building Naive Bayes From Scratch

Types of Naive Bayes

1. Gaussian Naive Bayes

2. Multinomial Naive Bayes

3. Bernoulli Naive Bayes

Real Example: Spam Classification

When Naive Bayes Shines

1. Text Classification

2. Fast Baseline Model

Laplace Smoothing

Naive Bayes vs Other Algorithms

Probability Calibration

Key Takeaways

What’s Next?