> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Naive Bayes > Simple probabilistic classification that's surprisingly powerful # Naive Bayes Naive Bayes Probability Flow

## The Probability Perspective Most algorithms we've seen ask: *"Which side of the boundary is this point on?"* Naive Bayes asks: *"Given the evidence, what's the probability of each class?"* *** ## The Doctor's Diagnosis Problem A patient walks in with symptoms: * Fever: Yes * Cough: Yes * Fatigue: Yes The doctor thinks: *"Based on these symptoms, how likely is it they have the flu vs a cold?"* This is **Bayesian reasoning** - updating beliefs based on evidence. Spam Detection with Naive Bayes

*** ## Bayes' Theorem $$ P(Disease|Symptoms) = \frac{P(Symptoms|Disease) \times P(Disease)}{P(Symptoms)} $$ In English: * **P(Disease|Symptoms)**: Probability of disease given symptoms (what we want) * **P(Symptoms|Disease)**: How likely these symptoms are if you have the disease * **P(Disease)**: How common the disease is (prior probability) * **P(Symptoms)**: How common these symptoms are overall **Math Connection**: This is Bayes' Theorem from probability theory. See [Probability](/courses/statistics-for-ml/03-probability) for the full derivation. *** ## Why "Naive"? The "naive" assumption: **All features are independent given the class.** For our flu example: * P(Fever AND Cough AND Fatigue | Flu) * Approximately equals P(Fever|Flu) x P(Cough|Flu) x P(Fatigue|Flu) **Is this realistic?** No! Symptoms often correlate -- fever and fatigue almost always appear together. If we were computing the true joint probability, we'd need to account for all these correlations. **Does it work anyway?** Surprisingly well, yes! Here's why: Naive Bayes only needs to get the *ranking* of class probabilities right, not the exact values. Even if the absolute probabilities are wildly off (e.g., 99.9% instead of 70%), as long as the winning class is correct, the classification is correct. The independence assumption distorts magnitudes but often preserves the ordering. It's like a biased thermometer that always reads 10 degrees too high -- useless for absolute temperature, but perfectly fine for telling you which room is hottest. *** ## Building Naive Bayes From Scratch ```python theme={null} import numpy as np from collections import defaultdict class SimpleNaiveBayes: """Naive Bayes classifier from scratch.""" def fit(self, X, y): self.classes = np.unique(y) self.n_classes = len(self.classes) self.n_features = X.shape[1] # Calculate prior probabilities P(class). # If 60% of emails are spam, P(spam) = 0.6. # This is our "starting belief" before looking at any features. self.priors = {} for c in self.classes: self.priors[c] = np.mean(y == c) # Calculate likelihoods P(feature|class) for each feature. # We assume each feature follows a Gaussian (bell curve) distribution # within each class. So we just need the mean and std per class. # This is why it's called "Gaussian" Naive Bayes. self.means = {} self.stds = {} for c in self.classes: X_c = X[y == c] # All samples belonging to class c self.means[c] = X_c.mean(axis=0) self.stds[c] = X_c.std(axis=0) + 1e-10 # Small epsilon prevents division by zero def _gaussian_probability(self, x, mean, std): """Calculate probability using Gaussian distribution.""" exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2)) return (1 / (np.sqrt(2 * np.pi) * std)) * exponent def predict_proba(self, X): """Predict probability for each class.""" probabilities = [] for x in X: class_probs = {} for c in self.classes: # Start with prior prob = np.log(self.priors[c]) # Multiply by likelihood of each feature for i in range(self.n_features): likelihood = self._gaussian_probability(x[i], self.means[c][i], self.stds[c][i]) prob += np.log(likelihood + 1e-10) # Use log to avoid underflow class_probs[c] = prob probabilities.append(class_probs) return probabilities def predict(self, X): """Predict class with highest probability.""" probabilities = self.predict_proba(X) predictions = [] for prob in probabilities: predicted_class = max(prob, key=prob.get) predictions.append(predicted_class) return np.array(predictions) # Test on iris data from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42 ) nb = SimpleNaiveBayes() nb.fit(X_train, y_train) predictions = nb.predict(X_test) accuracy = np.mean(predictions == y_test) print(f"Our Naive Bayes accuracy: {accuracy:.2%}") ``` *** ## Types of Naive Bayes ### 1. Gaussian Naive Bayes For **continuous features** (assumes normal distribution): ```python theme={null} from sklearn.naive_bayes import GaussianNB from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42 ) gnb = GaussianNB() gnb.fit(X_train, y_train) print(f"Gaussian NB Accuracy: {gnb.score(X_test, y_test):.2%}") print(f"Class priors: {gnb.class_prior_}") ``` ### 2. Multinomial Naive Bayes For **count data** (word frequencies, document classification): ```python theme={null} from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer # Sample text classification documents = [ "I love this movie, it's amazing!", "Great film, highly recommend", "Terrible movie, waste of time", "Awful, boring, don't watch", "Best movie I've ever seen", "Horrible acting, bad plot" ] labels = [1, 1, 0, 0, 1, 0] # 1 = positive, 0 = negative # Convert text to word counts vectorizer = CountVectorizer() X = vectorizer.fit_transform(documents) # Train Multinomial NB mnb = MultinomialNB() mnb.fit(X, labels) # Predict on new text new_reviews = ["This movie is great!", "I hated this film"] X_new = vectorizer.transform(new_reviews) predictions = mnb.predict(X_new) probabilities = mnb.predict_proba(X_new) for review, pred, prob in zip(new_reviews, predictions, probabilities): sentiment = "Positive" if pred == 1 else "Negative" print(f"'{review}' -> {sentiment} (confidence: {max(prob):.2%})") ``` ### 3. Bernoulli Naive Bayes For **binary features** (word presence/absence): ```python theme={null} from sklearn.naive_bayes import BernoulliNB from sklearn.feature_extraction.text import CountVectorizer # Use binary features (word present or not) vectorizer = CountVectorizer(binary=True) X = vectorizer.fit_transform(documents) bnb = BernoulliNB() bnb.fit(X, labels) print(f"Bernoulli NB Accuracy: {bnb.score(X, labels):.2%}") ``` *** ## Real Example: Spam Classification ```python theme={null} from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix import numpy as np # Simulated email dataset emails = [ "Free money, click here now!", "Meeting tomorrow at 3pm", "Congratulations! You won a prize", "Can you review this document?", "Limited offer, buy now", "Project deadline is Friday", "Earn $1000 per day from home", "Lunch meeting at noon", "Act now, special discount", "Team standup in 10 minutes", "You're a winner! Claim prize", "Budget report attached", "Free gift card, click link", "Quarterly review scheduled", "Urgent: Verify your account", "Happy birthday from the team" ] labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1=spam, 0=ham # Create TF-IDF features vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(emails) # Split X_train, X_test, y_train, y_test = train_test_split( X, labels, test_size=0.25, random_state=42 ) # Train spam_classifier = MultinomialNB() spam_classifier.fit(X_train, y_train) # Evaluate y_pred = spam_classifier.predict(X_test) print("Confusion Matrix:") print(confusion_matrix(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam'])) # Show most indicative words feature_names = vectorizer.get_feature_names_out() spam_log_prob = spam_classifier.feature_log_prob_[1] ham_log_prob = spam_classifier.feature_log_prob_[0] # Words most indicative of spam spam_indicators = spam_log_prob - ham_log_prob top_spam_words = [feature_names[i] for i in np.argsort(spam_indicators)[-5:]] print(f"\nTop spam words: {top_spam_words}") ``` *** ## When Naive Bayes Shines ### 1. Text Classification ```python theme={null} from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline # Load 20 newsgroups dataset (subset) categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.misc'] newsgroups_train = fetch_20newsgroups(subset='train', categories=categories) newsgroups_test = fetch_20newsgroups(subset='test', categories=categories) # Create pipeline model = make_pipeline( TfidfVectorizer(stop_words='english', max_features=5000), MultinomialNB() ) # Train and evaluate model.fit(newsgroups_train.data, newsgroups_train.target) accuracy = model.score(newsgroups_test.data, newsgroups_test.target) print(f"20 Newsgroups Accuracy: {accuracy:.2%}") ``` ### 2. Fast Baseline Model ```python theme={null} import time from sklearn.datasets import make_classification from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier # Large dataset X, y = make_classification(n_samples=100000, n_features=50, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) models = { 'Naive Bayes': GaussianNB(), 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100) } for name, model in models.items(): start = time.time() model.fit(X_train, y_train) train_time = time.time() - start accuracy = model.score(X_test, y_test) print(f"{name:22s}: Accuracy={accuracy:.2%}, Time={train_time:.2f}s") ``` *** ## Laplace Smoothing What if a word never appeared in training for a class? ``` P("cryptocurrency" | spam) = 0 / 100 = 0 ``` Then the entire product becomes 0, regardless of other evidence! **Solution**: Add a small count to everything (Laplace/additive smoothing). Think of it as giving every word a "benefit of the doubt" -- we pretend we've seen each word at least once in each class, even if we haven't. This prevents any single unseen word from vetoing the entire classification. ```python theme={null} # alpha controls smoothing strength # alpha=1 is Laplace smoothing (pretend each word appeared once extra) # alpha<1 is Lidstone smoothing (less aggressive) # alpha=0 means no smoothing (dangerous -- one unseen word kills everything) mnb = MultinomialNB(alpha=1.0) # Default -- a safe starting point # Try different smoothing values for alpha in [0.01, 0.1, 1.0, 10.0]: mnb = MultinomialNB(alpha=alpha) mnb.fit(X_train, y_train) print(f"alpha={alpha}: Accuracy={mnb.score(X_test, y_test):.2%}") ``` *** ## Naive Bayes vs Other Algorithms | Aspect | Naive Bayes | Logistic Regression | Random Forest | | --------------------------- | ----------- | ------------------- | ------------- | | **Speed** | Very Fast | Fast | Slow | | **Training data needed** | Little | Moderate | Lots | | **Handles text** | Excellent | Good | Poor | | **Feature independence** | Required | Not required | Not required | | **Interpretability** | Good | Good | Poor | | **Probability calibration** | Often poor | Good | Moderate | *** ## Probability Calibration Naive Bayes probabilities are often overconfident. Because the independence assumption is wrong, the model tends to push probabilities toward 0 and 1. It might say "99.8% spam" when the true probability is 75%. This doesn't hurt classification accuracy (it still picks the right class), but it's a problem if you need reliable probability estimates -- for example, when ranking items by risk or making decisions with different cost thresholds. ```python theme={null} from sklearn.calibration import CalibratedClassifierCV import matplotlib.pyplot as plt # Uncalibrated gnb = GaussianNB() gnb.fit(X_train, y_train) # Calibrated gnb_calibrated = CalibratedClassifierCV(GaussianNB(), cv=5) gnb_calibrated.fit(X_train, y_train) # Compare probabilities probs_uncal = gnb.predict_proba(X_test)[:, 1] probs_cal = gnb_calibrated.predict_proba(X_test)[:, 1] plt.figure(figsize=(10, 5)) plt.subplot(1, 2, 1) plt.hist(probs_uncal, bins=20, edgecolor='black') plt.title('Uncalibrated Probabilities') plt.xlabel('Probability') plt.subplot(1, 2, 2) plt.hist(probs_cal, bins=20, edgecolor='black') plt.title('Calibrated Probabilities') plt.xlabel('Probability') plt.tight_layout() plt.show() ``` *** ## Key Takeaways Predicts class probabilities using Bayes' theorem Assumes features are independent (often wrong, still works!) Trains instantly, great for baselines Excels at document classification and spam filtering *** ## What's Next? Now let's learn about ensemble methods - combining multiple models for better predictions! The wisdom of crowds - Random Forests and Gradient Boosting