Handling Imbalanced Data
The 1% Problem
Your fraud detection model has 99.5% accuracy. Amazing, right?
Wait. The dataset has:
- 99.5% legitimate transactions
- 0.5% fraudulent transactions
Your model predicts “legitimate” for everything. It catches zero fraud.
This is the imbalanced data problem - and it’s everywhere in real ML.
Real-World Imbalanced Problems
| Problem | Minority Class | Typical Ratio |
|---|
| Fraud Detection | Fraudulent | 1:1000 |
| Disease Diagnosis | Sick patients | 1:100 |
| Churn Prediction | Churners | 1:10 |
| Click Prediction | Clicks | 1:100 |
| Spam Detection | Spam | 1:5 |
| Manufacturing Defects | Defective | 1:1000 |
Why Standard ML Fails
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Create imbalanced dataset (99% vs 1%)
X, y = make_classification(
n_samples=10000,
n_features=20,
n_informative=10,
n_redundant=5,
n_classes=2,
weights=[0.99, 0.01], # 99% class 0, 1% class 1
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set class distribution:")
print(f" Class 0: {(y_train == 0).sum()} ({(y_train == 0).mean():.1%})")
print(f" Class 1: {(y_train == 1).sum()} ({(y_train == 1).mean():.1%})")
# Train a standard model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("\nStandard Model Performance:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
The model learns to predict the majority class almost exclusively!
Evaluation for Imbalanced Data
Don’t Use Accuracy!
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score)
def evaluate_imbalanced(y_true, y_pred, y_prob=None):
"""Proper metrics for imbalanced classification."""
print("=== Imbalanced Metrics ===")
print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f} (misleading!)")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f} (critical!)")
print(f"F1 Score: {f1_score(y_true, y_pred):.4f}")
if y_prob is not None:
print(f"ROC-AUC: {roc_auc_score(y_true, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_true, y_prob):.4f}")
y_prob = model.predict_proba(X_test)[:, 1]
evaluate_imbalanced(y_test, y_pred, y_prob)
The Metrics That Matter
Recall (Sensitivity)
Of all actual positives, how many did we catch?
Critical for fraud, disease detection
Precision
Of predictions, how many are correct?
Important when false positives are costly
F1 Score
Harmonic mean of precision and recall
Good single metric for imbalanced data
PR-AUC
Area under Precision-Recall curve
Better than ROC-AUC for imbalanced data
Solution 1: Class Weights
Tell the model that minority class errors matter more:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Option 1: balanced class weights (automatic)
model_balanced = LogisticRegression(class_weight='balanced', random_state=42)
model_balanced.fit(X_train, y_train)
# Option 2: custom weights
# If class 1 is 100x rarer, give it 100x weight
weights = {0: 1, 1: 100}
model_weighted = LogisticRegression(class_weight=weights, random_state=42)
model_weighted.fit(X_train, y_train)
# Compare
print("Standard Model:")
evaluate_imbalanced(y_test, model.predict(X_test), model.predict_proba(X_test)[:, 1])
print("\nBalanced Weights:")
y_pred_balanced = model_balanced.predict(X_test)
evaluate_imbalanced(y_test, y_pred_balanced, model_balanced.predict_proba(X_test)[:, 1])
Solution 2: Resampling
Oversampling (Add Minority Samples)
# pip install imbalanced-learn
from imblearn.over_sampling import RandomOverSampler, SMOTE
# Random oversampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
print(f"After Random Oversampling: {len(X_ros)} samples")
print(f" Class 0: {(y_ros == 0).sum()}")
print(f" Class 1: {(y_ros == 1).sum()}")
# SMOTE: Synthetic Minority Over-sampling Technique
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"\nAfter SMOTE: {len(X_smote)} samples")
Undersampling (Remove Majority Samples)
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f"After Random Undersampling: {len(X_rus)} samples")
# Tomek Links: Remove borderline majority samples
tomek = TomekLinks()
X_tomek, y_tomek = tomek.fit_resample(X_train, y_train)
print(f"After Tomek Links: {len(X_tomek)} samples")
Combination: SMOTE + Tomek
from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=42)
X_combo, y_combo = smote_tomek.fit_resample(X_train, y_train)
# Train on resampled data
model_resampled = LogisticRegression(random_state=42)
model_resampled.fit(X_combo, y_combo)
y_pred_resampled = model_resampled.predict(X_test)
print("SMOTE + Tomek Model:")
evaluate_imbalanced(y_test, y_pred_resampled, model_resampled.predict_proba(X_test)[:, 1])
Never resample the test set! Only resample training data. Test data should reflect real-world distribution.
Solution 3: Threshold Tuning
Default threshold is 0.5. For imbalanced data, lower it:
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
# Get probabilities
y_prob = model_balanced.predict_proba(X_test)[:, 1]
# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Default threshold (0.5):")
evaluate_imbalanced(y_test, (y_prob >= 0.5).astype(int), y_prob)
print(f"\nOptimal threshold ({optimal_threshold:.3f}):")
evaluate_imbalanced(y_test, (y_prob >= optimal_threshold).astype(int), y_prob)
# Plot Precision-Recall tradeoff
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.plot(thresholds, f1_scores, label='F1 Score', linewidth=2)
plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.3f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True)
plt.show()
Solution 4: Ensemble Methods for Imbalanced Data
Balanced Random Forest
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
y_pred_brf = brf.predict(X_test)
print("Balanced Random Forest:")
evaluate_imbalanced(y_test, y_pred_brf, brf.predict_proba(X_test)[:, 1])
Easy Ensemble (AdaBoost on balanced subsets)
from imblearn.ensemble import EasyEnsembleClassifier
ee = EasyEnsembleClassifier(n_estimators=10, random_state=42)
ee.fit(X_train, y_train)
y_pred_ee = ee.predict(X_test)
print("Easy Ensemble:")
evaluate_imbalanced(y_test, y_pred_ee, ee.predict_proba(X_test)[:, 1])
Comparison: What Works Best?
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedRandomForestClassifier
results = []
# 1. Baseline
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
results.append(('Baseline', f1_score(y_test, model.predict(X_test))))
# 2. Class weights
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
results.append(('Class Weights', f1_score(y_test, model.predict(X_test))))
# 3. SMOTE
X_smote, y_smote = SMOTE(random_state=42).fit_resample(X_train, y_train)
model = LogisticRegression(random_state=42)
model.fit(X_smote, y_smote)
results.append(('SMOTE', f1_score(y_test, model.predict(X_test))))
# 4. Balanced Random Forest
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
results.append(('Balanced RF', f1_score(y_test, model.predict(X_test))))
# Print comparison
print("\n=== Method Comparison (F1 Score) ===")
for name, score in sorted(results, key=lambda x: x[1], reverse=True):
print(f"{name:20s}: {score:.4f}")
Real Example: Credit Card Fraud
# Simulated fraud dataset
np.random.seed(42)
n_samples = 10000
n_fraud = 50 # Only 50 fraud cases!
# Normal transactions
X_normal = np.random.randn(n_samples - n_fraud, 5)
X_normal[:, 0] += 5 # Shift mean
# Fraudulent transactions (different pattern)
X_fraud = np.random.randn(n_fraud, 5)
X_fraud[:, 0] -= 2 # Different mean
X_fraud[:, 1] += 3 # Different pattern
X = np.vstack([X_normal, X_fraud])
y = np.array([0] * (n_samples - n_fraud) + [1] * n_fraud)
# Shuffle
idx = np.random.permutation(len(X))
X, y = X[idx], y[idx]
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Fraud rate in test set: {y_test.mean():.2%}")
# Best approach for fraud: High recall with balanced RF + threshold tuning
from imblearn.ensemble import BalancedRandomForestClassifier
fraud_model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
fraud_model.fit(X_train, y_train)
# Lower threshold to catch more fraud
y_prob = fraud_model.predict_proba(X_test)[:, 1]
y_pred = (y_prob >= 0.3).astype(int) # Lower threshold
print("\nFraud Detection Results:")
print(confusion_matrix(y_test, y_pred))
print(f"Recall (fraud caught): {recall_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
Decision Flowchart
Is data imbalanced?
│
┌─────────┴─────────┐
│ │
No Yes
│ │
Standard ML What's the ratio?
│
┌──────────────┼──────────────┐
│ │ │
< 1:10 1:10-1:100 > 1:100
│ │ │
Class Weights SMOTE + Ensemble +
Weights Threshold
Key Takeaways
Accuracy Lies
Never use accuracy for imbalanced data
Focus on Recall/F1
These metrics reveal true performance
Resample Wisely
SMOTE for moderate, ensembles for severe imbalance
Tune Thresholds
Lower threshold to catch more minority class
What’s Next?
Learn how to make your models’ decisions understandable with explainability techniques!
Continue to Model Explainability
Understand why your model makes its predictions