Skip to main content

Model Evaluation

Confusion Matrix Visualization

The Hidden Trap

Your model has 99% accuracy. Incredible, right? Wait. The dataset has 99% of one class:
  • 99% emails are not spam
  • Model predicts “not spam” for everything
  • 99% accuracy… but catches zero spam!
This is why proper evaluation matters.
A/B Testing Model Comparison

The Train-Test Split

Rule #1: Never evaluate on training data!
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,     # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Preserve class ratios
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

# Train
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate on UNSEEN data
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"\nTraining accuracy: {train_acc:.2%}")
print(f"Testing accuracy:  {test_acc:.2%}")
If training accuracy >> test accuracy: Your model is overfitting! It memorized the training data instead of learning patterns.

Cross-Validation: More Reliable Evaluation

What if the test split was “lucky”? Use k-fold cross-validation:
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]
Every sample gets to be in the test set exactly once!
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    RandomForestClassifier(random_state=42),
    X, y,
    cv=5,
    scoring='accuracy'
)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")

Classification Metrics

The Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Train and predict
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Visual
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay(cm, display_labels=cancer.target_names).plot(ax=ax)
plt.title("Confusion Matrix")
plt.show()
                    Predicted
                    Neg     Pos
Actual  Negative   [TN      FP]    <- False Positive: "False alarm"
        Positive   [FN      TP]    <- False Negative: "Missed detection"

Precision, Recall, F1

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Individual metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

# Full report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

Precision

Of predicted positives, how many are correct?TPTP+FP\frac{TP}{TP + FP}“Don’t cry wolf”

Recall

Of actual positives, how many did we find?TPTP+FN\frac{TP}{TP + FN}“Find them all”

F1 Score

Harmonic mean of precision and recall2PRP+R\frac{2 \cdot P \cdot R}{P + R}“Balance both”

When to Use What?

ScenarioPriorityWhy
Spam FilterHigh PrecisionDon’t want real emails in spam
Cancer DetectionHigh RecallDon’t want to miss cancer cases
Search EnginePrecision@KTop results must be relevant
Fraud DetectionHigh RecallDon’t miss fraud
RecommendationPrecisionShow only relevant items

Probability Thresholds

By default, we use 0.5 as the threshold. But you can adjust it:
# Get probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Different thresholds
for threshold in [0.3, 0.5, 0.7]:
    y_pred_thresh = (y_prob >= threshold).astype(int)
    precision = precision_score(y_test, y_pred_thresh)
    recall = recall_score(y_test, y_pred_thresh)
    print(f"Threshold {threshold}: Precision={precision:.3f}, Recall={recall:.3f}")
Trade-off:
  • Lower threshold → More positive predictions → Higher recall, lower precision
  • Higher threshold → Fewer positive predictions → Lower recall, higher precision

ROC Curve and AUC

The ROC curve shows performance across all thresholds:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()
AUC (Area Under Curve):
  • 1.0 = Perfect model
  • 0.5 = Random guessing
  • 0.9 = Excellent
  • 0.8 = Good
  • 0.7 = Fair

Regression Metrics

For predicting numbers:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

# Create regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train and predict
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"R2:   {r2:.4f}")

RMSE

Average error in same units as target. More sensitive to large errors.

MAE

Average error in same units as target. More robust to outliers.

R2 Score

% of variance explained (0 to 1). 1 = perfect fit, 0 = baseline.

MAPE

Average % error. Easy to interpret.

Handling Imbalanced Data

When one class dominates (99% vs 1%):

1. Use Appropriate Metrics

from sklearn.metrics import balanced_accuracy_score, f1_score

# Don't use accuracy!
balanced_acc = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

2. Resample the Data

from sklearn.utils import resample

# Separate classes
X_majority = X_train[y_train == 0]
X_minority = X_train[y_train == 1]

# Upsample minority class
X_minority_upsampled = resample(
    X_minority,
    replace=True,
    n_samples=len(X_majority),
    random_state=42
)

# Combine
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)), np.ones(len(X_minority_upsampled))])

3. Use Class Weights

from sklearn.ensemble import RandomForestClassifier

# Automatically balance weights
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

Learning Curves: Diagnosing Problems

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training score')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# Plot
plot_learning_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    "Learning Curve - Random Forest"
)
Diagnosing from learning curves:
PatternProblemSolution
High train, low val, gapOverfittingSimplify model, more data, regularization
Low train, low val, closeUnderfittingMore complex model, more features
Both high and closeGood fit!You’re done

Validation Curve: Tuning Hyperparameters

from sklearn.model_selection import validation_curve

# Vary max_depth
param_range = [1, 2, 3, 4, 5, 7, 10, 15, 20]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve')
plt.legend()
plt.grid(True)
plt.show()

Complete Evaluation Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Multi-metric cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = cross_validate(
    pipeline, X, y,
    cv=5,
    scoring=scoring,
    return_train_score=True
)

# Display results
print("Cross-Validation Results (Mean +/- Std):\n")
for metric in scoring:
    train_key = f'train_{metric}'
    test_key = f'test_{metric}'
    print(f"{metric:12s}: Train={results[train_key].mean():.4f} (+/- {results[train_key].std():.4f}), "
          f"Val={results[test_key].mean():.4f} (+/- {results[test_key].std():.4f})")

🚀 Mini Projects

Project 1: Metric Dashboard Builder

Build a comprehensive model evaluation dashboard

Project 2: Cross-Validation Analyzer

Compare different CV strategies and their stability

Project 3: Threshold Optimization

Find optimal decision thresholds for business needs

Project 4: Model Comparison Report

Create an automated model comparison report

Project 1: Metric Dashboard Builder

Build a comprehensive evaluation dashboard that calculates all metrics and visualizes model performance.

Project 2: Cross-Validation Analyzer

Compare different cross-validation strategies and analyze their stability.

Project 3: Threshold Optimization

Find the optimal classification threshold for different business objectives.

Project 4: Model Comparison Report

Create an automated report comparing multiple models across all metrics.

Key Takeaways

Never Evaluate on Training Data

Always use a held-out test set or cross-validation

Accuracy Is Not Enough

Use precision, recall, F1, AUC depending on the problem

Cross-Validation

More reliable than a single train-test split

Watch for Leakage

Test data must not influence training in any way

🧹 Real-World Complications: Messy Data Evaluation

Real-world data creates evaluation challenges. Here’s how to handle them:

Handling Class Imbalance in Evaluation

from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.datasets import make_classification

# Create imbalanced dataset (5% positive class)
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# BAD: Regular accuracy looks great
print(f"Regular Accuracy: {(y_pred == y_test).mean():.4f}")  # ~95% but misleading!

# BETTER: Balanced accuracy accounts for imbalance
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")

# BEST: Look at per-class metrics
print("\nPer-Class Metrics:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Evaluating with Missing Values

import pandas as pd
import numpy as np

# Many real datasets have missing values
# DON'T fit imputation on test data!

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# CORRECT: Imputation is part of the model pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Cross-validation handles imputation correctly
cv_scores = cross_val_score(pipeline, X_with_missing, y, cv=5)
print(f"CV Score with proper imputation: {cv_scores.mean():.4f}")

Evaluating on Time Series (No Random Split!)

from sklearn.model_selection import TimeSeriesSplit

# BAD: Random split leaks future info into training
# X_train, X_test = train_test_split(X, y)  # WRONG for time series!

# GOOD: Time-aware split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    # Test is always AFTER train in time
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

Detecting Evaluation Errors

SymptomLikely ProblemSolution
Train acc = 100%, Test acc lowOverfittingMore regularization, less complexity
Train and test acc both ~100%Data leakageCheck for target in features
Accuracy high, F1 lowClass imbalanceUse balanced metrics
CV variance very highSmall datasetUse more folds, bootstrap
Test performance varies wildlyData order mattersUse stratified or time-aware splits

What’s Next?

Before training, you need to prepare your data. Feature engineering can make or break your model!

Continue to Module 8: Feature Engineering

Learn how to transform raw data into powerful features