Most algorithms we’ve seen ask: “Which side of the boundary is this point on?”Naive Bayes asks: “Given the evidence, what’s the probability of each class?”
The doctor thinks: “Based on these symptoms, how likely is it they have the flu vs a cold?”This is Bayesian reasoning - updating beliefs based on evidence.
The “naive” assumption: All features are independent given the class.For our flu example:
P(Fever AND Cough AND Fatigue | Flu)
Approximately equals P(Fever|Flu) x P(Cough|Flu) x P(Fatigue|Flu)
Is this realistic? No! Symptoms often correlate — fever and fatigue almost always appear together. If we were computing the true joint probability, we’d need to account for all these correlations.Does it work anyway? Surprisingly well, yes! Here’s why: Naive Bayes only needs to get the ranking of class probabilities right, not the exact values. Even if the absolute probabilities are wildly off (e.g., 99.9% instead of 70%), as long as the winning class is correct, the classification is correct. The independence assumption distorts magnitudes but often preserves the ordering. It’s like a biased thermometer that always reads 10 degrees too high — useless for absolute temperature, but perfectly fine for telling you which room is hottest.
import numpy as npfrom collections import defaultdictclass SimpleNaiveBayes: """Naive Bayes classifier from scratch.""" def fit(self, X, y): self.classes = np.unique(y) self.n_classes = len(self.classes) self.n_features = X.shape[1] # Calculate prior probabilities P(class). # If 60% of emails are spam, P(spam) = 0.6. # This is our "starting belief" before looking at any features. self.priors = {} for c in self.classes: self.priors[c] = np.mean(y == c) # Calculate likelihoods P(feature|class) for each feature. # We assume each feature follows a Gaussian (bell curve) distribution # within each class. So we just need the mean and std per class. # This is why it's called "Gaussian" Naive Bayes. self.means = {} self.stds = {} for c in self.classes: X_c = X[y == c] # All samples belonging to class c self.means[c] = X_c.mean(axis=0) self.stds[c] = X_c.std(axis=0) + 1e-10 # Small epsilon prevents division by zero def _gaussian_probability(self, x, mean, std): """Calculate probability using Gaussian distribution.""" exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2)) return (1 / (np.sqrt(2 * np.pi) * std)) * exponent def predict_proba(self, X): """Predict probability for each class.""" probabilities = [] for x in X: class_probs = {} for c in self.classes: # Start with prior prob = np.log(self.priors[c]) # Multiply by likelihood of each feature for i in range(self.n_features): likelihood = self._gaussian_probability(x[i], self.means[c][i], self.stds[c][i]) prob += np.log(likelihood + 1e-10) # Use log to avoid underflow class_probs[c] = prob probabilities.append(class_probs) return probabilities def predict(self, X): """Predict class with highest probability.""" probabilities = self.predict_proba(X) predictions = [] for prob in probabilities: predicted_class = max(prob, key=prob.get) predictions.append(predicted_class) return np.array(predictions)# Test on iris datafrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitiris = load_iris()X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42)nb = SimpleNaiveBayes()nb.fit(X_train, y_train)predictions = nb.predict(X_test)accuracy = np.mean(predictions == y_test)print(f"Our Naive Bayes accuracy: {accuracy:.2%}")
from sklearn.naive_bayes import BernoulliNBfrom sklearn.feature_extraction.text import CountVectorizer# Use binary features (word present or not)vectorizer = CountVectorizer(binary=True)X = vectorizer.fit_transform(documents)bnb = BernoulliNB()bnb.fit(X, labels)print(f"Bernoulli NB Accuracy: {bnb.score(X, labels):.2%}")
What if a word never appeared in training for a class?
P("cryptocurrency" | spam) = 0 / 100 = 0
Then the entire product becomes 0, regardless of other evidence!Solution: Add a small count to everything (Laplace/additive smoothing). Think of it as giving every word a “benefit of the doubt” — we pretend we’ve seen each word at least once in each class, even if we haven’t. This prevents any single unseen word from vetoing the entire classification.
# alpha controls smoothing strength# alpha=1 is Laplace smoothing (pretend each word appeared once extra)# alpha<1 is Lidstone smoothing (less aggressive)# alpha=0 means no smoothing (dangerous -- one unseen word kills everything)mnb = MultinomialNB(alpha=1.0) # Default -- a safe starting point# Try different smoothing valuesfor alpha in [0.01, 0.1, 1.0, 10.0]: mnb = MultinomialNB(alpha=alpha) mnb.fit(X_train, y_train) print(f"alpha={alpha}: Accuracy={mnb.score(X_test, y_test):.2%}")
Naive Bayes probabilities are often overconfident. Because the independence assumption is wrong, the model tends to push probabilities toward 0 and 1. It might say “99.8% spam” when the true probability is 75%. This doesn’t hurt classification accuracy (it still picks the right class), but it’s a problem if you need reliable probability estimates — for example, when ranking items by risk or making decisions with different cost thresholds.