> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Naive Bayes

> Simple probabilistic classification that's surprisingly powerful

# Naive Bayes

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/naive-bayes-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=fb8f27a8fff0c062fc53caa1121e4abb" alt="Naive Bayes Probability Flow" width="1080" height="1080" data-path="images/courses/ml-mastery/naive-bayes-concept.svg" />
</Frame>

## The Probability Perspective

Most algorithms we've seen ask: *"Which side of the boundary is this point on?"*

Naive Bayes asks: *"Given the evidence, what's the probability of each class?"*

***

## The Doctor's Diagnosis Problem

A patient walks in with symptoms:

* Fever: Yes
* Cough: Yes
* Fatigue: Yes

The doctor thinks: *"Based on these symptoms, how likely is it they have the flu vs a cold?"*

This is **Bayesian reasoning** - updating beliefs based on evidence.

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/naive-bayes-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=dba0c969f3d77bb14cd9ca81326af8e7" alt="Spam Detection with Naive Bayes" width="1080" height="1080" data-path="images/courses/ml-mastery/naive-bayes-real-world.svg" />
</Frame>

***

## Bayes' Theorem

$$
P(Disease|Symptoms) = \frac{P(Symptoms|Disease) \times P(Disease)}{P(Symptoms)}
$$

In English:

* **P(Disease|Symptoms)**: Probability of disease given symptoms (what we want)
* **P(Symptoms|Disease)**: How likely these symptoms are if you have the disease
* **P(Disease)**: How common the disease is (prior probability)
* **P(Symptoms)**: How common these symptoms are overall

<Note>
  **Math Connection**: This is Bayes' Theorem from probability theory. See [Probability](/courses/statistics-for-ml/03-probability) for the full derivation.
</Note>

***

## Why "Naive"?

The "naive" assumption: **All features are independent given the class.**

For our flu example:

* P(Fever AND Cough AND Fatigue | Flu)
* Approximately equals P(Fever|Flu) x P(Cough|Flu) x P(Fatigue|Flu)

**Is this realistic?** No! Symptoms often correlate -- fever and fatigue almost always appear together. If we were computing the true joint probability, we'd need to account for all these correlations.

**Does it work anyway?** Surprisingly well, yes! Here's why: Naive Bayes only needs to get the *ranking* of class probabilities right, not the exact values. Even if the absolute probabilities are wildly off (e.g., 99.9% instead of 70%), as long as the winning class is correct, the classification is correct. The independence assumption distorts magnitudes but often preserves the ordering. It's like a biased thermometer that always reads 10 degrees too high -- useless for absolute temperature, but perfectly fine for telling you which room is hottest.

***

## Building Naive Bayes From Scratch

```python theme={null}
import numpy as np
from collections import defaultdict

class SimpleNaiveBayes:
    """Naive Bayes classifier from scratch."""
    
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_classes = len(self.classes)
        self.n_features = X.shape[1]
        
        # Calculate prior probabilities P(class).
        # If 60% of emails are spam, P(spam) = 0.6.
        # This is our "starting belief" before looking at any features.
        self.priors = {}
        for c in self.classes:
            self.priors[c] = np.mean(y == c)
        
        # Calculate likelihoods P(feature|class) for each feature.
        # We assume each feature follows a Gaussian (bell curve) distribution
        # within each class. So we just need the mean and std per class.
        # This is why it's called "Gaussian" Naive Bayes.
        self.means = {}
        self.stds = {}
        
        for c in self.classes:
            X_c = X[y == c]  # All samples belonging to class c
            self.means[c] = X_c.mean(axis=0)
            self.stds[c] = X_c.std(axis=0) + 1e-10  # Small epsilon prevents division by zero
    
    def _gaussian_probability(self, x, mean, std):
        """Calculate probability using Gaussian distribution."""
        exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2))
        return (1 / (np.sqrt(2 * np.pi) * std)) * exponent
    
    def predict_proba(self, X):
        """Predict probability for each class."""
        probabilities = []
        
        for x in X:
            class_probs = {}
            
            for c in self.classes:
                # Start with prior
                prob = np.log(self.priors[c])
                
                # Multiply by likelihood of each feature
                for i in range(self.n_features):
                    likelihood = self._gaussian_probability(x[i], self.means[c][i], self.stds[c][i])
                    prob += np.log(likelihood + 1e-10)  # Use log to avoid underflow
                
                class_probs[c] = prob
            
            probabilities.append(class_probs)
        
        return probabilities
    
    def predict(self, X):
        """Predict class with highest probability."""
        probabilities = self.predict_proba(X)
        predictions = []
        
        for prob in probabilities:
            predicted_class = max(prob, key=prob.get)
            predictions.append(predicted_class)
        
        return np.array(predictions)

# Test on iris data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

nb = SimpleNaiveBayes()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)

accuracy = np.mean(predictions == y_test)
print(f"Our Naive Bayes accuracy: {accuracy:.2%}")
```

***

## Types of Naive Bayes

### 1. Gaussian Naive Bayes

For **continuous features** (assumes normal distribution):

```python theme={null}
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print(f"Gaussian NB Accuracy: {gnb.score(X_test, y_test):.2%}")
print(f"Class priors: {gnb.class_prior_}")
```

### 2. Multinomial Naive Bayes

For **count data** (word frequencies, document classification):

```python theme={null}
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample text classification
documents = [
    "I love this movie, it's amazing!",
    "Great film, highly recommend",
    "Terrible movie, waste of time",
    "Awful, boring, don't watch",
    "Best movie I've ever seen",
    "Horrible acting, bad plot"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

# Convert text to word counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Train Multinomial NB
mnb = MultinomialNB()
mnb.fit(X, labels)

# Predict on new text
new_reviews = ["This movie is great!", "I hated this film"]
X_new = vectorizer.transform(new_reviews)
predictions = mnb.predict(X_new)
probabilities = mnb.predict_proba(X_new)

for review, pred, prob in zip(new_reviews, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{review}' -> {sentiment} (confidence: {max(prob):.2%})")
```

### 3. Bernoulli Naive Bayes

For **binary features** (word presence/absence):

```python theme={null}
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# Use binary features (word present or not)
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(documents)

bnb = BernoulliNB()
bnb.fit(X, labels)

print(f"Bernoulli NB Accuracy: {bnb.score(X, labels):.2%}")
```

***

## Real Example: Spam Classification

```python theme={null}
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Simulated email dataset
emails = [
    "Free money, click here now!",
    "Meeting tomorrow at 3pm",
    "Congratulations! You won a prize",
    "Can you review this document?",
    "Limited offer, buy now",
    "Project deadline is Friday",
    "Earn $1000 per day from home",
    "Lunch meeting at noon",
    "Act now, special discount",
    "Team standup in 10 minutes",
    "You're a winner! Claim prize",
    "Budget report attached",
    "Free gift card, click link",
    "Quarterly review scheduled",
    "Urgent: Verify your account",
    "Happy birthday from the team"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=ham

# Create TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

# Train
spam_classifier = MultinomialNB()
spam_classifier.fit(X_train, y_train)

# Evaluate
y_pred = spam_classifier.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Show most indicative words
feature_names = vectorizer.get_feature_names_out()
spam_log_prob = spam_classifier.feature_log_prob_[1]
ham_log_prob = spam_classifier.feature_log_prob_[0]

# Words most indicative of spam
spam_indicators = spam_log_prob - ham_log_prob
top_spam_words = [feature_names[i] for i in np.argsort(spam_indicators)[-5:]]
print(f"\nTop spam words: {top_spam_words}")
```

***

## When Naive Bayes Shines

### 1. Text Classification

```python theme={null}
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Load 20 newsgroups dataset (subset)
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.misc']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# Create pipeline
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_features=5000),
    MultinomialNB()
)

# Train and evaluate
model.fit(newsgroups_train.data, newsgroups_train.target)
accuracy = model.score(newsgroups_test.data, newsgroups_test.target)
print(f"20 Newsgroups Accuracy: {accuracy:.2%}")
```

### 2. Fast Baseline Model

```python theme={null}
import time
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Large dataset
X, y = make_classification(n_samples=100000, n_features=50, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

models = {
    'Naive Bayes': GaussianNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    accuracy = model.score(X_test, y_test)
    print(f"{name:22s}: Accuracy={accuracy:.2%}, Time={train_time:.2f}s")
```

***

## Laplace Smoothing

What if a word never appeared in training for a class?

```
P("cryptocurrency" | spam) = 0 / 100 = 0
```

Then the entire product becomes 0, regardless of other evidence!

**Solution**: Add a small count to everything (Laplace/additive smoothing). Think of it as giving every word a "benefit of the doubt" -- we pretend we've seen each word at least once in each class, even if we haven't. This prevents any single unseen word from vetoing the entire classification.

```python theme={null}
# alpha controls smoothing strength
# alpha=1 is Laplace smoothing (pretend each word appeared once extra)
# alpha<1 is Lidstone smoothing (less aggressive)
# alpha=0 means no smoothing (dangerous -- one unseen word kills everything)
mnb = MultinomialNB(alpha=1.0)  # Default -- a safe starting point

# Try different smoothing values
for alpha in [0.01, 0.1, 1.0, 10.0]:
    mnb = MultinomialNB(alpha=alpha)
    mnb.fit(X_train, y_train)
    print(f"alpha={alpha}: Accuracy={mnb.score(X_test, y_test):.2%}")
```

***

## Naive Bayes vs Other Algorithms

| Aspect                      | Naive Bayes | Logistic Regression | Random Forest |
| --------------------------- | ----------- | ------------------- | ------------- |
| **Speed**                   | Very Fast   | Fast                | Slow          |
| **Training data needed**    | Little      | Moderate            | Lots          |
| **Handles text**            | Excellent   | Good                | Poor          |
| **Feature independence**    | Required    | Not required        | Not required  |
| **Interpretability**        | Good        | Good                | Poor          |
| **Probability calibration** | Often poor  | Good                | Moderate      |

***

## Probability Calibration

Naive Bayes probabilities are often overconfident. Because the independence assumption is wrong, the model tends to push probabilities toward 0 and 1. It might say "99.8% spam" when the true probability is 75%. This doesn't hurt classification accuracy (it still picks the right class), but it's a problem if you need reliable probability estimates -- for example, when ranking items by risk or making decisions with different cost thresholds.

```python theme={null}
from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt

# Uncalibrated
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Calibrated
gnb_calibrated = CalibratedClassifierCV(GaussianNB(), cv=5)
gnb_calibrated.fit(X_train, y_train)

# Compare probabilities
probs_uncal = gnb.predict_proba(X_test)[:, 1]
probs_cal = gnb_calibrated.predict_proba(X_test)[:, 1]

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(probs_uncal, bins=20, edgecolor='black')
plt.title('Uncalibrated Probabilities')
plt.xlabel('Probability')

plt.subplot(1, 2, 2)
plt.hist(probs_cal, bins=20, edgecolor='black')
plt.title('Calibrated Probabilities')
plt.xlabel('Probability')

plt.tight_layout()
plt.show()
```

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Probability-Based" icon="percent">
    Predicts class probabilities using Bayes' theorem
  </Card>

  <Card title="Independence Assumption" icon="link-slash">
    Assumes features are independent (often wrong, still works!)
  </Card>

  <Card title="Fast & Simple" icon="bolt">
    Trains instantly, great for baselines
  </Card>

  <Card title="Text Champion" icon="file-lines">
    Excels at document classification and spam filtering
  </Card>
</CardGroup>

***

## What's Next?

Now let's learn about ensemble methods - combining multiple models for better predictions!

<Card title="Continue to Module 6: Ensemble Methods" icon="arrow-right" href="/courses/ml-mastery/06-ensemble-methods">
  The wisdom of crowds - Random Forests and Gradient Boosting
</Card>
