Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Model Evaluation

Confusion Matrix Visualization

The Hidden Trap

Your model has 99% accuracy. Incredible, right? Wait. The dataset has 99% of one class:
  • 99% emails are not spam
  • Model predicts “not spam” for everything
  • 99% accuracy… but catches zero spam!
Think of it like a weather forecaster in the Sahara who predicts “no rain” every single day. They’d be right 99% of the time — and completely useless the 1% of the time it actually matters. Accuracy is a vanity metric when your classes are imbalanced, and in the real world, they almost always are. This is why proper evaluation matters.
A/B Testing Model Comparison

The Train-Test Split

Rule #1: Never evaluate on training data! Evaluating on training data is like grading a student using the exact questions they practiced on. Of course they’ll ace it — but you have no idea if they actually understand the material. The test set is the “final exam” your model has never seen.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,     # 20% for testing -- industry standard for medium datasets
    random_state=42,    # Reproducibility -- same split every run
    stratify=y          # Preserve class ratios -- critical for imbalanced data!
    # Without stratify, your test set might randomly have 0% of the minority class
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

# Train
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate on UNSEEN data
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"\nTraining accuracy: {train_acc:.2%}")
print(f"Testing accuracy:  {test_acc:.2%}")
If training accuracy >> test accuracy: Your model is overfitting! It memorized the training data instead of learning patterns.Rules of thumb for the gap:
  • less than 5%: Normal and expected. Ship it.
  • 5-15%: Mild overfitting. Try regularization or simpler model.
  • greater than 15%: Serious overfitting. Reduce model complexity, get more data, or add dropout/regularization.
  • Test higher than train: Something is wrong — possible data leakage or a very lucky split. Investigate.

Cross-Validation: More Reliable Evaluation

What if the test split was “lucky”? Use k-fold cross-validation:
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]
Every sample gets to be in the test set exactly once! The standard deviation of CV scores tells you how stable your model is. A model with 95% +/- 1% is much more trustworthy than one with 95% +/- 8%. High variance across folds often means your dataset is too small or your model is too sensitive to which specific examples it trains on.
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
# Why 5 folds? It's a good trade-off between computational cost
# and reliable estimation. 10 folds gives slightly better estimates
# but takes twice as long. 3 folds is faster but noisier.
scores = cross_val_score(
    RandomForestClassifier(random_state=42),
    X, y,
    cv=5,
    scoring='accuracy'
)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")
# The "+/-" is the standard deviation across folds.
# If it's > 5% of the mean, consider whether your data is too small
# or your model is too complex for the available data.

Classification Metrics

The Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Train and predict
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Visual
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay(cm, display_labels=cancer.target_names).plot(ax=ax)
plt.title("Confusion Matrix")
plt.show()
                    Predicted
                    Neg     Pos
Actual  Negative   [TN      FP]    <- False Positive: "False alarm"
        Positive   [FN      TP]    <- False Negative: "Missed detection"

Precision, Recall, F1

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Individual metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

# Full report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

Precision

Of predicted positives, how many are correct?TPTP+FP\frac{TP}{TP + FP}“Don’t cry wolf”

Recall

Of actual positives, how many did we find?TPTP+FN\frac{TP}{TP + FN}“Find them all”

F1 Score

Harmonic mean of precision and recall2PRP+R\frac{2 \cdot P \cdot R}{P + R}“Balance both”

When to Use What?

ScenarioPriorityWhy
Spam FilterHigh PrecisionDon’t want real emails in spam
Cancer DetectionHigh RecallDon’t want to miss cancer cases
Search EnginePrecision@KTop results must be relevant
Fraud DetectionHigh RecallDon’t miss fraud
RecommendationPrecisionShow only relevant items

Probability Thresholds

By default, we use 0.5 as the threshold. But you can adjust it:
# Get probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Different thresholds
for threshold in [0.3, 0.5, 0.7]:
    y_pred_thresh = (y_prob >= threshold).astype(int)
    precision = precision_score(y_test, y_pred_thresh)
    recall = recall_score(y_test, y_pred_thresh)
    print(f"Threshold {threshold}: Precision={precision:.3f}, Recall={recall:.3f}")
Trade-off — think of it like adjusting the sensitivity on a metal detector at an airport:
  • Lower threshold (more sensitive): Catches more threats but also beeps at belt buckles. More positive predictions, higher recall, lower precision.
  • Higher threshold (less sensitive): Only triggers on real weapons but might miss a hidden knife. Fewer positive predictions, lower recall, higher precision.
The right threshold depends on what’s more expensive: false alarms or missed catches. In cancer screening, you want low threshold (catch everything). In email spam, you want higher threshold (don’t lose real mail).

ROC Curve and AUC

The ROC curve shows performance across all thresholds:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()
AUC (Area Under Curve) — the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:
  • 1.0 = Perfect model (always ranks positives above negatives)
  • 0.5 = Random guessing (coin flip)
  • > 0.9 = Excellent (production-ready for many applications)
  • > 0.8 = Good (worth deploying with monitoring)
  • > 0.7 = Fair (better than nothing, but investigate why it’s struggling)
  • < 0.5 = Your labels might be flipped, or the model is actively anti-predicting
Why AUC over accuracy? AUC doesn’t depend on a specific threshold, so it tells you about the model’s overall discriminative ability. Two models could have the same accuracy at threshold=0.5 but very different AUCs — the one with higher AUC has more “room to maneuver” when you adjust the threshold for business needs.

Regression Metrics

For predicting numbers:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

# Create regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train and predict
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"R2:   {r2:.4f}")

RMSE

Average error in same units as target. More sensitive to large errors.

MAE

Average error in same units as target. More robust to outliers.

R2 Score

% of variance explained (0 to 1). 1 = perfect fit, 0 = baseline.

MAPE

Average % error. Easy to interpret.

Handling Imbalanced Data

When one class dominates (99% vs 1%):

1. Use Appropriate Metrics

from sklearn.metrics import balanced_accuracy_score, f1_score

# Don't use accuracy!
balanced_acc = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

2. Resample the Data

Think of it like a cooking class where 95 students want to learn Italian but only 5 want to learn Thai. If you just teach to the majority, you’ll ignore Thai completely. Resampling either duplicates the Thai students (upsampling) or randomly removes some Italian students (downsampling) to give both groups fair representation.
from sklearn.utils import resample

# Separate classes
X_majority = X_train[y_train == 0]
X_minority = X_train[y_train == 1]

# Upsample minority class -- duplicate minority examples until
# both classes have equal representation. The model sees each
# minority example multiple times, emphasizing those patterns.
X_minority_upsampled = resample(
    X_minority,
    replace=True,       # Sample WITH replacement (same point can appear twice)
    n_samples=len(X_majority),  # Match majority class size
    random_state=42
)

# Combine into balanced dataset
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)), np.ones(len(X_minority_upsampled))])
# Caution: upsampling creates exact duplicates, which can cause overfitting
# on those specific examples. Consider SMOTE (Module 20) for synthetic samples.

3. Use Class Weights

from sklearn.ensemble import RandomForestClassifier

# Automatically balance weights -- this tells the model to treat
# each minority sample as if it were worth MORE during training.
# With 'balanced', a class with 10x fewer samples gets 10x the weight.
# Mathematically: weight_i = n_samples / (n_classes * n_samples_for_class_i)
# This is the easiest fix and should be your first attempt.
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
Model selection tip for imbalanced data: Start with class_weight='balanced' on Logistic Regression or Random Forest before trying resampling techniques. It’s simpler, doesn’t create synthetic data, and often works just as well. Reserve SMOTE and other resampling for when class weights alone aren’t enough.

Learning Curves: Diagnosing Problems

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training score')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# Plot
plot_learning_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    "Learning Curve - Random Forest"
)
Diagnosing from learning curves — this is one of the most valuable debugging tools in ML:
PatternProblemWhat It Looks LikeSolution
High train, low val, big gapOverfittingTraining score stays at ~99%, validation plateaus at ~75%Simplify model (reduce depth/features), get more data, add regularization
Low train, low val, close togetherUnderfittingBoth scores hover around 65%, more data doesn’t helpUse a more complex model, engineer better features, reduce regularization
Both high and closeGood fitBoth scores at ~90% and converging as data increasesYou’re done — ship it
Val score still rising at the right edgeNeed more dataGap is closing but hasn’t converged yetCollect more training data — you’re on the right track

Validation Curve: Tuning Hyperparameters

from sklearn.model_selection import validation_curve

# Vary max_depth
param_range = [1, 2, 3, 4, 5, 7, 10, 15, 20]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve')
plt.legend()
plt.grid(True)
plt.show()

Complete Evaluation Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Multi-metric cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = cross_validate(
    pipeline, X, y,
    cv=5,
    scoring=scoring,
    return_train_score=True
)

# Display results
print("Cross-Validation Results (Mean +/- Std):\n")
for metric in scoring:
    train_key = f'train_{metric}'
    test_key = f'test_{metric}'
    print(f"{metric:12s}: Train={results[train_key].mean():.4f} (+/- {results[train_key].std():.4f}), "
          f"Val={results[test_key].mean():.4f} (+/- {results[test_key].std():.4f})")

🚀 Mini Projects

Project 1: Metric Dashboard Builder

Build a comprehensive model evaluation dashboard

Project 2: Cross-Validation Analyzer

Compare different CV strategies and their stability

Project 3: Threshold Optimization

Find optimal decision thresholds for business needs

Project 4: Model Comparison Report

Create an automated model comparison report

Project 1: Metric Dashboard Builder

Build a comprehensive evaluation dashboard that calculates all metrics and visualizes model performance.

Project 2: Cross-Validation Analyzer

Compare different cross-validation strategies and analyze their stability.

Project 3: Threshold Optimization

Find the optimal classification threshold for different business objectives.

Project 4: Model Comparison Report

Create an automated report comparing multiple models across all metrics.

Key Takeaways

Never Evaluate on Training Data

Always use a held-out test set or cross-validation

Accuracy Is Not Enough

Use precision, recall, F1, AUC depending on the problem

Cross-Validation

More reliable than a single train-test split

Watch for Leakage

Test data must not influence training in any way

🧹 Real-World Complications: Messy Data Evaluation

Real-world data creates evaluation challenges. Here’s how to handle them:

Handling Class Imbalance in Evaluation

from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.datasets import make_classification

# Create imbalanced dataset (5% positive class)
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# BAD: Regular accuracy looks great
print(f"Regular Accuracy: {(y_pred == y_test).mean():.4f}")  # ~95% but misleading!

# BETTER: Balanced accuracy accounts for imbalance
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")

# BEST: Look at per-class metrics
print("\nPer-Class Metrics:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Evaluating with Missing Values

import pandas as pd
import numpy as np

# Many real datasets have missing values
# DON'T fit imputation on test data!

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# CORRECT: Imputation is part of the model pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Cross-validation handles imputation correctly
cv_scores = cross_val_score(pipeline, X_with_missing, y, cv=5)
print(f"CV Score with proper imputation: {cv_scores.mean():.4f}")

Evaluating on Time Series (No Random Split!)

from sklearn.model_selection import TimeSeriesSplit

# BAD: Random split leaks future info into training
# X_train, X_test = train_test_split(X, y)  # WRONG for time series!

# GOOD: Time-aware split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    # Test is always AFTER train in time
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

Detecting Evaluation Errors

SymptomLikely ProblemSolution
Train acc = 100%, Test acc lowOverfittingMore regularization, less complexity
Train and test acc both ~100%Data leakageCheck for target in features
Accuracy high, F1 lowClass imbalanceUse balanced metrics
CV variance very highSmall datasetUse more folds, bootstrap
Test performance varies wildlyData order mattersUse stratified or time-aware splits

What’s Next?

Before training, you need to prepare your data. Feature engineering can make or break your model!

Continue to Module 8: Feature Engineering

Learn how to transform raw data into powerful features