Your model has 99% accuracy. Incredible, right?Wait. The dataset has 99% of one class:
99% emails are not spam
Model predicts “not spam” for everything
99% accuracy… but catches zero spam!
Think of it like a weather forecaster in the Sahara who predicts “no rain” every single day. They’d be right 99% of the time — and completely useless the 1% of the time it actually matters. Accuracy is a vanity metric when your classes are imbalanced, and in the real world, they almost always are.This is why proper evaluation matters.
Rule #1: Never evaluate on training data!Evaluating on training data is like grading a student using the exact questions they practiced on. Of course they’ll ace it — but you have no idea if they actually understand the material. The test set is the “final exam” your model has never seen.
import numpy as npfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifier# Load datacancer = load_breast_cancer()X, y = cancer.data, cancer.target# Split dataX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing -- industry standard for medium datasets random_state=42, # Reproducibility -- same split every run stratify=y # Preserve class ratios -- critical for imbalanced data! # Without stratify, your test set might randomly have 0% of the minority class)print(f"Training samples: {len(X_train)}")print(f"Testing samples: {len(X_test)}")# Trainmodel = RandomForestClassifier(random_state=42)model.fit(X_train, y_train)# Evaluate on UNSEEN datatrain_acc = model.score(X_train, y_train)test_acc = model.score(X_test, y_test)print(f"\nTraining accuracy: {train_acc:.2%}")print(f"Testing accuracy: {test_acc:.2%}")
If training accuracy >> test accuracy: Your model is overfitting!
It memorized the training data instead of learning patterns.Rules of thumb for the gap:
less than 5%: Normal and expected. Ship it.
5-15%: Mild overfitting. Try regularization or simpler model.
greater than 15%: Serious overfitting. Reduce model complexity, get more data, or add dropout/regularization.
Test higher than train: Something is wrong — possible data leakage or a very lucky split. Investigate.
Every sample gets to be in the test set exactly once!The standard deviation of CV scores tells you how stable your model is. A model with 95% +/- 1% is much more trustworthy than one with 95% +/- 8%. High variance across folds often means your dataset is too small or your model is too sensitive to which specific examples it trains on.
from sklearn.model_selection import cross_val_score# 5-fold cross-validation# Why 5 folds? It's a good trade-off between computational cost# and reliable estimation. 10 folds gives slightly better estimates# but takes twice as long. 3 folds is faster but noisier.scores = cross_val_score( RandomForestClassifier(random_state=42), X, y, cv=5, scoring='accuracy')print(f"CV Scores: {scores}")print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")# The "+/-" is the standard deviation across folds.# If it's > 5% of the mean, consider whether your data is too small# or your model is too complex for the available data.
By default, we use 0.5 as the threshold. But you can adjust it:
# Get probabilitiesy_prob = model.predict_proba(X_test)[:, 1]# Different thresholdsfor threshold in [0.3, 0.5, 0.7]: y_pred_thresh = (y_prob >= threshold).astype(int) precision = precision_score(y_test, y_pred_thresh) recall = recall_score(y_test, y_pred_thresh) print(f"Threshold {threshold}: Precision={precision:.3f}, Recall={recall:.3f}")
Trade-off — think of it like adjusting the sensitivity on a metal detector at an airport:
Lower threshold (more sensitive): Catches more threats but also beeps at belt buckles. More positive predictions, higher recall, lower precision.
Higher threshold (less sensitive): Only triggers on real weapons but might miss a hidden knife. Fewer positive predictions, lower recall, higher precision.
The right threshold depends on what’s more expensive: false alarms or missed catches. In cancer screening, you want low threshold (catch everything). In email spam, you want higher threshold (don’t lose real mail).
AUC (Area Under Curve) — the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:
1.0 = Perfect model (always ranks positives above negatives)
0.5 = Random guessing (coin flip)
> 0.9 = Excellent (production-ready for many applications)
> 0.8 = Good (worth deploying with monitoring)
> 0.7 = Fair (better than nothing, but investigate why it’s struggling)
< 0.5 = Your labels might be flipped, or the model is actively anti-predicting
Why AUC over accuracy? AUC doesn’t depend on a specific threshold, so it tells you about the model’s overall discriminative ability. Two models could have the same accuracy at threshold=0.5 but very different AUCs — the one with higher AUC has more “room to maneuver” when you adjust the threshold for business needs.
Think of it like a cooking class where 95 students want to learn Italian but only 5 want to learn Thai. If you just teach to the majority, you’ll ignore Thai completely. Resampling either duplicates the Thai students (upsampling) or randomly removes some Italian students (downsampling) to give both groups fair representation.
from sklearn.utils import resample# Separate classesX_majority = X_train[y_train == 0]X_minority = X_train[y_train == 1]# Upsample minority class -- duplicate minority examples until# both classes have equal representation. The model sees each# minority example multiple times, emphasizing those patterns.X_minority_upsampled = resample( X_minority, replace=True, # Sample WITH replacement (same point can appear twice) n_samples=len(X_majority), # Match majority class size random_state=42)# Combine into balanced datasetX_balanced = np.vstack([X_majority, X_minority_upsampled])y_balanced = np.hstack([np.zeros(len(X_majority)), np.ones(len(X_minority_upsampled))])# Caution: upsampling creates exact duplicates, which can cause overfitting# on those specific examples. Consider SMOTE (Module 20) for synthetic samples.
from sklearn.ensemble import RandomForestClassifier# Automatically balance weights -- this tells the model to treat# each minority sample as if it were worth MORE during training.# With 'balanced', a class with 10x fewer samples gets 10x the weight.# Mathematically: weight_i = n_samples / (n_classes * n_samples_for_class_i)# This is the easiest fix and should be your first attempt.model = RandomForestClassifier(class_weight='balanced', random_state=42)model.fit(X_train, y_train)
Model selection tip for imbalanced data: Start with class_weight='balanced' on Logistic Regression or Random Forest before trying resampling techniques. It’s simpler, doesn’t create synthetic data, and often works just as well. Reserve SMOTE and other resampling for when class weights alone aren’t enough.
import pandas as pdimport numpy as np# Many real datasets have missing values# DON'T fit imputation on test data!from sklearn.impute import SimpleImputerfrom sklearn.pipeline import Pipeline# CORRECT: Imputation is part of the model pipelinepipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', RandomForestClassifier(random_state=42))])# Cross-validation handles imputation correctlycv_scores = cross_val_score(pipeline, X_with_missing, y, cv=5)print(f"CV Score with proper imputation: {cv_scores.mean():.4f}")
from sklearn.model_selection import TimeSeriesSplit# BAD: Random split leaks future info into training# X_train, X_test = train_test_split(X, y) # WRONG for time series!# GOOD: Time-aware splittscv = TimeSeriesSplit(n_splits=5)for train_idx, test_idx in tscv.split(X): # Test is always AFTER train in time X_train, X_test = X[train_idx], X[test_idx] y_train, y_test = y[train_idx], y[test_idx]