Model Evaluation
The Hidden Trap
The Train-Test Split
Cross-Validation: More Reliable Evaluation
Classification Metrics
The Confusion Matrix
Precision, Recall, F1
When to Use What?
Probability Thresholds
ROC Curve and AUC
Regression Metrics
Handling Imbalanced Data
1. Use Appropriate Metrics
2. Resample the Data
3. Use Class Weights
Learning Curves: Diagnosing Problems
Validation Curve: Tuning Hyperparameters
Complete Evaluation Pipeline
🚀 Mini Projects
Project 1: Metric Dashboard Builder
Project 2: Cross-Validation Analyzer
Project 3: Threshold Optimization
Project 4: Model Comparison Report
Key Takeaways
🧹 Real-World Complications: Messy Data Evaluation
What’s Next?

Model Evaluation

The Hidden Trap

Your model has 99% accuracy. Incredible, right? Wait. The dataset has 99% of one class:

99% emails are not spam
Model predicts “not spam” for everything
99% accuracy… but catches zero spam!

This is why proper evaluation matters.

The Train-Test Split

Rule #1: Never evaluate on training data!

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,     # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Preserve class ratios
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

# Train
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate on UNSEEN data
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"\nTraining accuracy: {train_acc:.2%}")
print(f"Testing accuracy:  {test_acc:.2%}")

If training accuracy >> test accuracy: Your model is overfitting! It memorized the training data instead of learning patterns.

Cross-Validation: More Reliable Evaluation

What if the test split was “lucky”? Use k-fold cross-validation:

Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]

Every sample gets to be in the test set exactly once!

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    RandomForestClassifier(random_state=42),
    X, y,
    cv=5,
    scoring='accuracy'
)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")

Classification Metrics

The Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Train and predict
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Visual
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay(cm, display_labels=cancer.target_names).plot(ax=ax)
plt.title("Confusion Matrix")
plt.show()

                    Predicted
                    Neg     Pos
Actual  Negative   [TN      FP]    <- False Positive: "False alarm"
        Positive   [FN      TP]    <- False Negative: "Missed detection"

Precision, Recall, F1

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Individual metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

# Full report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

Precision

Of predicted positives, how many are correct?

\frac{TP}{TP + FP}

“Don’t cry wolf”

Recall

Of actual positives, how many did we find?

\frac{TP}{TP + FN}

“Find them all”

F1 Score

Harmonic mean of precision and recall

\frac{2 \cdot P \cdot R}{P + R}

“Balance both”

When to Use What?

Scenario	Priority	Why
Spam Filter	High Precision	Don’t want real emails in spam
Cancer Detection	High Recall	Don’t want to miss cancer cases
Search Engine	Precision@K	Top results must be relevant
Fraud Detection	High Recall	Don’t miss fraud
Recommendation	Precision	Show only relevant items

Probability Thresholds

By default, we use 0.5 as the threshold. But you can adjust it:

# Get probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Different thresholds
for threshold in [0.3, 0.5, 0.7]:
    y_pred_thresh = (y_prob >= threshold).astype(int)
    precision = precision_score(y_test, y_pred_thresh)
    recall = recall_score(y_test, y_pred_thresh)
    print(f"Threshold {threshold}: Precision={precision:.3f}, Recall={recall:.3f}")

Trade-off:

Lower threshold → More positive predictions → Higher recall, lower precision
Higher threshold → Fewer positive predictions → Lower recall, higher precision

ROC Curve and AUC

The ROC curve shows performance across all thresholds:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()

AUC (Area Under Curve):

1.0 = Perfect model
0.5 = Random guessing
0.9 = Excellent
0.8 = Good
0.7 = Fair

Regression Metrics

For predicting numbers:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

# Create regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train and predict
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"R2:   {r2:.4f}")

RMSE

Average error in same units as target. More sensitive to large errors.

MAE

Average error in same units as target. More robust to outliers.

R2 Score

% of variance explained (0 to 1). 1 = perfect fit, 0 = baseline.

MAPE

Average % error. Easy to interpret.

Handling Imbalanced Data

When one class dominates (99% vs 1%):

1. Use Appropriate Metrics

from sklearn.metrics import balanced_accuracy_score, f1_score

# Don't use accuracy!
balanced_acc = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

2. Resample the Data

from sklearn.utils import resample

# Separate classes
X_majority = X_train[y_train == 0]
X_minority = X_train[y_train == 1]

# Upsample minority class
X_minority_upsampled = resample(
    X_minority,
    replace=True,
    n_samples=len(X_majority),
    random_state=42
)

# Combine
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)), np.ones(len(X_minority_upsampled))])

3. Use Class Weights

from sklearn.ensemble import RandomForestClassifier

# Automatically balance weights
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

Learning Curves: Diagnosing Problems

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training score')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# Plot
plot_learning_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    "Learning Curve - Random Forest"
)

Diagnosing from learning curves:

Pattern	Problem	Solution
High train, low val, gap	Overfitting	Simplify model, more data, regularization
Low train, low val, close	Underfitting	More complex model, more features
Both high and close	Good fit!	You’re done

Validation Curve: Tuning Hyperparameters

from sklearn.model_selection import validation_curve

# Vary max_depth
param_range = [1, 2, 3, 4, 5, 7, 10, 15, 20]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve')
plt.legend()
plt.grid(True)
plt.show()

Complete Evaluation Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Multi-metric cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = cross_validate(
    pipeline, X, y,
    cv=5,
    scoring=scoring,
    return_train_score=True
)

# Display results
print("Cross-Validation Results (Mean +/- Std):\n")
for metric in scoring:
    train_key = f'train_{metric}'
    test_key = f'test_{metric}'
    print(f"{metric:12s}: Train={results[train_key].mean():.4f} (+/- {results[train_key].std():.4f}), "
          f"Val={results[test_key].mean():.4f} (+/- {results[test_key].std():.4f})")

🚀 Mini Projects

Project 1: Metric Dashboard Builder

Build a comprehensive model evaluation dashboard

Project 2: Cross-Validation Analyzer

Compare different CV strategies and their stability

Project 3: Threshold Optimization

Find optimal decision thresholds for business needs

Project 4: Model Comparison Report

Create an automated model comparison report

Project 1: Metric Dashboard Builder

Build a comprehensive evaluation dashboard that calculates all metrics and visualizes model performance.

Project 2: Cross-Validation Analyzer

Compare different cross-validation strategies and analyze their stability.

Project 3: Threshold Optimization

Find the optimal classification threshold for different business objectives.

Project 4: Model Comparison Report

Create an automated report comparing multiple models across all metrics.

Key Takeaways

Never Evaluate on Training Data

Always use a held-out test set or cross-validation

Accuracy Is Not Enough

Use precision, recall, F1, AUC depending on the problem

Cross-Validation

More reliable than a single train-test split

Watch for Leakage

Test data must not influence training in any way

🧹 Real-World Complications: Messy Data Evaluation

Evaluating Models on Messy Data

Real-world data creates evaluation challenges. Here’s how to handle them:

Handling Class Imbalance in Evaluation

from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.datasets import make_classification

# Create imbalanced dataset (5% positive class)
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# BAD: Regular accuracy looks great
print(f"Regular Accuracy: {(y_pred == y_test).mean():.4f}")  # ~95% but misleading!

# BETTER: Balanced accuracy accounts for imbalance
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")

# BEST: Look at per-class metrics
print("\nPer-Class Metrics:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Evaluating with Missing Values

import pandas as pd
import numpy as np

# Many real datasets have missing values
# DON'T fit imputation on test data!

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# CORRECT: Imputation is part of the model pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Cross-validation handles imputation correctly
cv_scores = cross_val_score(pipeline, X_with_missing, y, cv=5)
print(f"CV Score with proper imputation: {cv_scores.mean():.4f}")

Evaluating on Time Series (No Random Split!)

from sklearn.model_selection import TimeSeriesSplit

# BAD: Random split leaks future info into training
# X_train, X_test = train_test_split(X, y)  # WRONG for time series!

# GOOD: Time-aware split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    # Test is always AFTER train in time
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

Detecting Evaluation Errors

Symptom	Likely Problem	Solution
Train acc = 100%, Test acc low	Overfitting	More regularization, less complexity
Train and test acc both ~100%	Data leakage	Check for target in features
Accuracy high, F1 low	Class imbalance	Use balanced metrics
CV variance very high	Small dataset	Use more folds, bootstrap
Test performance varies wildly	Data order matters	Use stratified or time-aware splits

What’s Next?

Before training, you need to prepare your data. Feature engineering can make or break your model!

Continue to Module 8: Feature Engineering

Learn how to transform raw data into powerful features

Ensemble Methods Feature Engineering

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Model Evaluation

​The Hidden Trap

​The Train-Test Split

​Cross-Validation: More Reliable Evaluation

​Classification Metrics

​The Confusion Matrix

​Precision, Recall, F1

Precision

Recall

F1 Score

​When to Use What?

​Probability Thresholds

​ROC Curve and AUC

​Regression Metrics

RMSE

MAE

R2 Score

MAPE

​Handling Imbalanced Data

​1. Use Appropriate Metrics

​2. Resample the Data

​3. Use Class Weights

​Learning Curves: Diagnosing Problems

​Validation Curve: Tuning Hyperparameters

​Complete Evaluation Pipeline

​🚀 Mini Projects

Project 1: Metric Dashboard Builder

Project 2: Cross-Validation Analyzer

Project 3: Threshold Optimization

Project 4: Model Comparison Report

​Project 1: Metric Dashboard Builder

​Project 2: Cross-Validation Analyzer

​Project 3: Threshold Optimization

​Project 4: Model Comparison Report

​Key Takeaways

Never Evaluate on Training Data

Accuracy Is Not Enough

Cross-Validation

Model Evaluation

The Hidden Trap

The Train-Test Split

Cross-Validation: More Reliable Evaluation

Classification Metrics

The Confusion Matrix

Precision, Recall, F1

When to Use What?

Probability Thresholds

ROC Curve and AUC

Regression Metrics

Handling Imbalanced Data

1. Use Appropriate Metrics

2. Resample the Data

3. Use Class Weights

Learning Curves: Diagnosing Problems

Validation Curve: Tuning Hyperparameters

Complete Evaluation Pipeline

🚀 Mini Projects

Project 1: Metric Dashboard Builder

Project 2: Cross-Validation Analyzer

Project 3: Threshold Optimization

Project 4: Model Comparison Report

Key Takeaways

🧹 Real-World Complications: Messy Data Evaluation

Handling Class Imbalance in Evaluation

Evaluating with Missing Values

Evaluating on Time Series (No Random Split!)

Detecting Evaluation Errors

What’s Next?