Model Evaluation
The Hidden Trap
Your model has 99% accuracy . Incredible, right?
Wait. The dataset has 99% of one class:
99% emails are not spam
Model predicts “not spam” for everything
99% accuracy… but catches zero spam !
This is why proper evaluation matters.
The Train-Test Split
Rule #1 : Never evaluate on training data!
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.2 , # 20% for testing
random_state = 42 , # Reproducibility
stratify = y # Preserve class ratios
)
print ( f "Training samples: { len (X_train) } " )
print ( f "Testing samples: { len (X_test) } " )
# Train
model = RandomForestClassifier( random_state = 42 )
model.fit(X_train, y_train)
# Evaluate on UNSEEN data
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print ( f " \n Training accuracy: { train_acc :.2%} " )
print ( f "Testing accuracy: { test_acc :.2%} " )
If training accuracy >> test accuracy : Your model is overfitting !
It memorized the training data instead of learning patterns.
Cross-Validation: More Reliable Evaluation
What if the test split was “lucky”? Use k-fold cross-validation :
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]
Every sample gets to be in the test set exactly once!
from sklearn.model_selection import cross_val_score
# 5-fold cross-validation
scores = cross_val_score(
RandomForestClassifier( random_state = 42 ),
X, y,
cv = 5 ,
scoring = 'accuracy'
)
print ( f "CV Scores: { scores } " )
print ( f "Mean: { scores.mean() :.4f} (+/- { scores.std() :.4f} )" )
Classification Metrics
The Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Train and predict
model = RandomForestClassifier( random_state = 42 )
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print ( "Confusion Matrix:" )
print (cm)
# Visual
fig, ax = plt.subplots( figsize = ( 8 , 6 ))
ConfusionMatrixDisplay(cm, display_labels = cancer.target_names).plot( ax = ax)
plt.title( "Confusion Matrix" )
plt.show()
Predicted
Neg Pos
Actual Negative [TN FP] <- False Positive: "False alarm"
Positive [FN TP] <- False Negative: "Missed detection"
Precision, Recall, F1
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
# Individual metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print ( f "Precision: { precision :.4f} " )
print ( f "Recall: { recall :.4f} " )
print ( f "F1 Score: { f1 :.4f} " )
# Full report
print ( " \n Classification Report:" )
print (classification_report(y_test, y_pred, target_names = cancer.target_names))
Precision Of predicted positives, how many are correct? T P T P + F P \frac{TP}{TP + FP} TP + FP TP “Don’t cry wolf”
Recall Of actual positives, how many did we find? T P T P + F N \frac{TP}{TP + FN} TP + FN TP “Find them all”
F1 Score Harmonic mean of precision and recall 2 ⋅ P ⋅ R P + R \frac{2 \cdot P \cdot R}{P + R} P + R 2 ⋅ P ⋅ R “Balance both”
When to Use What?
Scenario Priority Why Spam Filter High Precision Don’t want real emails in spam Cancer Detection High Recall Don’t want to miss cancer cases Search Engine Precision@K Top results must be relevant Fraud Detection High Recall Don’t miss fraud Recommendation Precision Show only relevant items
Probability Thresholds
By default, we use 0.5 as the threshold. But you can adjust it:
# Get probabilities
y_prob = model.predict_proba(X_test)[:, 1 ]
# Different thresholds
for threshold in [ 0.3 , 0.5 , 0.7 ]:
y_pred_thresh = (y_prob >= threshold).astype( int )
precision = precision_score(y_test, y_pred_thresh)
recall = recall_score(y_test, y_pred_thresh)
print ( f "Threshold { threshold } : Precision= { precision :.3f} , Recall= { recall :.3f} " )
Trade-off :
Lower threshold → More positive predictions → Higher recall, lower precision
Higher threshold → Fewer positive predictions → Lower recall, higher precision
ROC Curve and AUC
The ROC curve shows performance across all thresholds:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Calculate ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot
plt.figure( figsize = ( 8 , 6 ))
plt.plot(fpr, tpr, color = 'blue' , linewidth = 2 , label = f 'ROC (AUC = { roc_auc :.3f} )' )
plt.plot([ 0 , 1 ], [ 0 , 1 ], color = 'gray' , linestyle = '--' , label = 'Random (AUC = 0.5)' )
plt.xlabel( 'False Positive Rate' )
plt.ylabel( 'True Positive Rate' )
plt.title( 'ROC Curve' )
plt.legend()
plt.grid( True )
plt.show()
AUC (Area Under Curve) :
1.0 = Perfect model
0.5 = Random guessing
0.9 = Excellent
0.8 = Good
0.7 = Fair
Regression Metrics
For predicting numbers:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
# Create regression data
X, y = make_regression( n_samples = 100 , n_features = 1 , noise = 10 , random_state = 42 )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 )
# Train and predict
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print ( f "MSE: { mse :.2f} " )
print ( f "RMSE: { rmse :.2f} " )
print ( f "MAE: { mae :.2f} " )
print ( f "R2: { r2 :.4f} " )
RMSE Average error in same units as target.
More sensitive to large errors.
MAE Average error in same units as target.
More robust to outliers.
R2 Score % of variance explained (0 to 1).
1 = perfect fit, 0 = baseline.
MAPE Average % error.
Easy to interpret.
Handling Imbalanced Data
When one class dominates (99% vs 1%):
1. Use Appropriate Metrics
from sklearn.metrics import balanced_accuracy_score, f1_score
# Don't use accuracy!
balanced_acc = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
2. Resample the Data
from sklearn.utils import resample
# Separate classes
X_majority = X_train[y_train == 0 ]
X_minority = X_train[y_train == 1 ]
# Upsample minority class
X_minority_upsampled = resample(
X_minority,
replace = True ,
n_samples = len (X_majority),
random_state = 42
)
# Combine
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros( len (X_majority)), np.ones( len (X_minority_upsampled))])
3. Use Class Weights
from sklearn.ensemble import RandomForestClassifier
# Automatically balance weights
model = RandomForestClassifier( class_weight = 'balanced' , random_state = 42 )
model.fit(X_train, y_train)
Learning Curves: Diagnosing Problems
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
def plot_learning_curve ( model , X , y , title ):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes = np.linspace( 0.1 , 1.0 , 10 ),
cv = 5 ,
scoring = 'accuracy' ,
n_jobs =- 1
)
train_mean = train_scores.mean( axis = 1 )
train_std = train_scores.std( axis = 1 )
val_mean = val_scores.mean( axis = 1 )
val_std = val_scores.std( axis = 1 )
plt.figure( figsize = ( 10 , 6 ))
plt.plot(train_sizes, train_mean, 'o-' , label = 'Training score' )
plt.plot(train_sizes, val_mean, 'o-' , label = 'Validation score' )
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha = 0.1 )
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha = 0.1 )
plt.xlabel( 'Training Set Size' )
plt.ylabel( 'Accuracy' )
plt.title(title)
plt.legend()
plt.grid( True )
plt.show()
# Plot
plot_learning_curve(
RandomForestClassifier( n_estimators = 50 , random_state = 42 ),
X, y,
"Learning Curve - Random Forest"
)
Diagnosing from learning curves :
Pattern Problem Solution High train, low val, gap Overfitting Simplify model, more data, regularization Low train, low val, close Underfitting More complex model, more features Both high and close Good fit! You’re done
Validation Curve: Tuning Hyperparameters
from sklearn.model_selection import validation_curve
# Vary max_depth
param_range = [ 1 , 2 , 3 , 4 , 5 , 7 , 10 , 15 , 20 ]
train_scores, val_scores = validation_curve(
RandomForestClassifier( n_estimators = 50 , random_state = 42 ),
X, y,
param_name = 'max_depth' ,
param_range = param_range,
cv = 5 ,
scoring = 'accuracy'
)
# Plot
plt.figure( figsize = ( 10 , 6 ))
plt.plot(param_range, train_scores.mean( axis = 1 ), 'o-' , label = 'Training' )
plt.plot(param_range, val_scores.mean( axis = 1 ), 'o-' , label = 'Validation' )
plt.xlabel( 'max_depth' )
plt.ylabel( 'Accuracy' )
plt.title( 'Validation Curve' )
plt.legend()
plt.grid( True )
plt.show()
Complete Evaluation Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
# Create pipeline
pipeline = Pipeline([
( 'scaler' , StandardScaler()),
( 'classifier' , RandomForestClassifier( n_estimators = 100 , random_state = 42 ))
])
# Multi-metric cross-validation
scoring = [ 'accuracy' , 'precision' , 'recall' , 'f1' , 'roc_auc' ]
results = cross_validate(
pipeline, X, y,
cv = 5 ,
scoring = scoring,
return_train_score = True
)
# Display results
print ( "Cross-Validation Results (Mean +/- Std): \n " )
for metric in scoring:
train_key = f 'train_ { metric } '
test_key = f 'test_ { metric } '
print ( f " { metric :12s} : Train= { results[train_key].mean() :.4f} (+/- { results[train_key].std() :.4f} ), "
f "Val= { results[test_key].mean() :.4f} (+/- { results[test_key].std() :.4f} )" )
🚀 Mini Projects
Project 1: Metric Dashboard Builder Build a comprehensive model evaluation dashboard
Project 2: Cross-Validation Analyzer Compare different CV strategies and their stability
Project 3: Threshold Optimization Find optimal decision thresholds for business needs
Project 4: Model Comparison Report Create an automated model comparison report
Project 1: Metric Dashboard Builder
Build a comprehensive evaluation dashboard that calculates all metrics and visualizes model performance.
Project 2: Cross-Validation Analyzer
Compare different cross-validation strategies and analyze their stability.
Project 3: Threshold Optimization
Find the optimal classification threshold for different business objectives.
Project 4: Model Comparison Report
Create an automated report comparing multiple models across all metrics.
Key Takeaways
Never Evaluate on Training Data Always use a held-out test set or cross-validation
Accuracy Is Not Enough Use precision, recall, F1, AUC depending on the problem
Cross-Validation More reliable than a single train-test split
Watch for Leakage Test data must not influence training in any way
🧹 Real-World Complications: Messy Data Evaluation
Evaluating Models on Messy Data
Real-world data creates evaluation challenges. Here’s how to handle them: Handling Class Imbalance in Evaluation from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.datasets import make_classification
# Create imbalanced dataset (5% positive class)
X, y = make_classification( n_samples = 1000 , weights = [ 0.95 , 0.05 ], random_state = 42 )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 )
# Train model
model = RandomForestClassifier( random_state = 42 )
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# BAD: Regular accuracy looks great
print ( f "Regular Accuracy: { (y_pred == y_test).mean() :.4f} " ) # ~95% but misleading!
# BETTER: Balanced accuracy accounts for imbalance
print ( f "Balanced Accuracy: { balanced_accuracy_score(y_test, y_pred) :.4f} " )
# BEST: Look at per-class metrics
print ( " \n Per-Class Metrics:" )
print (classification_report(y_test, y_pred, target_names = [ 'Negative' , 'Positive' ]))
Evaluating with Missing Values import pandas as pd
import numpy as np
# Many real datasets have missing values
# DON'T fit imputation on test data!
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# CORRECT: Imputation is part of the model pipeline
pipeline = Pipeline([
( 'imputer' , SimpleImputer( strategy = 'median' )),
( 'scaler' , StandardScaler()),
( 'model' , RandomForestClassifier( random_state = 42 ))
])
# Cross-validation handles imputation correctly
cv_scores = cross_val_score(pipeline, X_with_missing, y, cv = 5 )
print ( f "CV Score with proper imputation: { cv_scores.mean() :.4f} " )
Evaluating on Time Series (No Random Split!) from sklearn.model_selection import TimeSeriesSplit
# BAD: Random split leaks future info into training
# X_train, X_test = train_test_split(X, y) # WRONG for time series!
# GOOD: Time-aware split
tscv = TimeSeriesSplit( n_splits = 5 )
for train_idx, test_idx in tscv.split(X):
# Test is always AFTER train in time
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
Detecting Evaluation Errors Symptom Likely Problem Solution Train acc = 100%, Test acc low Overfitting More regularization, less complexity Train and test acc both ~100% Data leakage Check for target in features Accuracy high, F1 low Class imbalance Use balanced metrics CV variance very high Small dataset Use more folds, bootstrap Test performance varies wildly Data order matters Use stratified or time-aware splits
What’s Next?
Before training, you need to prepare your data. Feature engineering can make or break your model!
Continue to Module 8: Feature Engineering Learn how to transform raw data into powerful features