Skip to main content
Imbalanced Data Concept
Imbalanced Data Real World Example

Handling Imbalanced Data

The 1% Problem

Your fraud detection model has 99.5% accuracy. Amazing, right? Wait. The dataset has:
  • 99.5% legitimate transactions
  • 0.5% fraudulent transactions
Your model predicts “legitimate” for everything. It catches zero fraud. This is the imbalanced data problem - and it’s everywhere in real ML.

Real-World Imbalanced Problems

ProblemMinority ClassTypical Ratio
Fraud DetectionFraudulent1:1000
Disease DiagnosisSick patients1:100
Churn PredictionChurners1:10
Click PredictionClicks1:100
Spam DetectionSpam1:5
Manufacturing DefectsDefective1:1000

Why Standard ML Fails

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Create imbalanced dataset (99% vs 1%)
X, y = make_classification(
    n_samples=10000, 
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.99, 0.01],  # 99% class 0, 1% class 1
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set class distribution:")
print(f"  Class 0: {(y_train == 0).sum()} ({(y_train == 0).mean():.1%})")
print(f"  Class 1: {(y_train == 1).sum()} ({(y_train == 1).mean():.1%})")

# Train a standard model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("\nStandard Model Performance:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
The model learns to predict the majority class almost exclusively!

Evaluation for Imbalanced Data

Don’t Use Accuracy!

from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                              f1_score, roc_auc_score, average_precision_score)

def evaluate_imbalanced(y_true, y_pred, y_prob=None):
    """Proper metrics for imbalanced classification."""
    print("=== Imbalanced Metrics ===")
    print(f"Accuracy:          {accuracy_score(y_true, y_pred):.4f}  (misleading!)")
    print(f"Precision:         {precision_score(y_true, y_pred):.4f}")
    print(f"Recall:            {recall_score(y_true, y_pred):.4f}  (critical!)")
    print(f"F1 Score:          {f1_score(y_true, y_pred):.4f}")
    
    if y_prob is not None:
        print(f"ROC-AUC:           {roc_auc_score(y_true, y_prob):.4f}")
        print(f"Average Precision: {average_precision_score(y_true, y_prob):.4f}")

y_prob = model.predict_proba(X_test)[:, 1]
evaluate_imbalanced(y_test, y_pred, y_prob)

The Metrics That Matter

Recall (Sensitivity)

Of all actual positives, how many did we catch? Critical for fraud, disease detection

Precision

Of predictions, how many are correct? Important when false positives are costly

F1 Score

Harmonic mean of precision and recall Good single metric for imbalanced data

PR-AUC

Area under Precision-Recall curve Better than ROC-AUC for imbalanced data

Solution 1: Class Weights

Tell the model that minority class errors matter more:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Option 1: balanced class weights (automatic)
model_balanced = LogisticRegression(class_weight='balanced', random_state=42)
model_balanced.fit(X_train, y_train)

# Option 2: custom weights
# If class 1 is 100x rarer, give it 100x weight
weights = {0: 1, 1: 100}
model_weighted = LogisticRegression(class_weight=weights, random_state=42)
model_weighted.fit(X_train, y_train)

# Compare
print("Standard Model:")
evaluate_imbalanced(y_test, model.predict(X_test), model.predict_proba(X_test)[:, 1])

print("\nBalanced Weights:")
y_pred_balanced = model_balanced.predict(X_test)
evaluate_imbalanced(y_test, y_pred_balanced, model_balanced.predict_proba(X_test)[:, 1])

Solution 2: Resampling

Oversampling (Add Minority Samples)

# pip install imbalanced-learn
from imblearn.over_sampling import RandomOverSampler, SMOTE

# Random oversampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
print(f"After Random Oversampling: {len(X_ros)} samples")
print(f"  Class 0: {(y_ros == 0).sum()}")
print(f"  Class 1: {(y_ros == 1).sum()}")

# SMOTE: Synthetic Minority Over-sampling Technique
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"\nAfter SMOTE: {len(X_smote)} samples")

Undersampling (Remove Majority Samples)

from imblearn.under_sampling import RandomUnderSampler, TomekLinks

# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f"After Random Undersampling: {len(X_rus)} samples")

# Tomek Links: Remove borderline majority samples
tomek = TomekLinks()
X_tomek, y_tomek = tomek.fit_resample(X_train, y_train)
print(f"After Tomek Links: {len(X_tomek)} samples")

Combination: SMOTE + Tomek

from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=42)
X_combo, y_combo = smote_tomek.fit_resample(X_train, y_train)

# Train on resampled data
model_resampled = LogisticRegression(random_state=42)
model_resampled.fit(X_combo, y_combo)

y_pred_resampled = model_resampled.predict(X_test)
print("SMOTE + Tomek Model:")
evaluate_imbalanced(y_test, y_pred_resampled, model_resampled.predict_proba(X_test)[:, 1])
Never resample the test set! Only resample training data. Test data should reflect real-world distribution.

Solution 3: Threshold Tuning

Default threshold is 0.5. For imbalanced data, lower it:
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

# Get probabilities
y_prob = model_balanced.predict_proba(X_test)[:, 1]

# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Default threshold (0.5):")
evaluate_imbalanced(y_test, (y_prob >= 0.5).astype(int), y_prob)

print(f"\nOptimal threshold ({optimal_threshold:.3f}):")
evaluate_imbalanced(y_test, (y_prob >= optimal_threshold).astype(int), y_prob)

# Plot Precision-Recall tradeoff
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.plot(thresholds, f1_scores, label='F1 Score', linewidth=2)
plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.3f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True)
plt.show()

Solution 4: Ensemble Methods for Imbalanced Data

Balanced Random Forest

from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

y_pred_brf = brf.predict(X_test)
print("Balanced Random Forest:")
evaluate_imbalanced(y_test, y_pred_brf, brf.predict_proba(X_test)[:, 1])

Easy Ensemble (AdaBoost on balanced subsets)

from imblearn.ensemble import EasyEnsembleClassifier

ee = EasyEnsembleClassifier(n_estimators=10, random_state=42)
ee.fit(X_train, y_train)

y_pred_ee = ee.predict(X_test)
print("Easy Ensemble:")
evaluate_imbalanced(y_test, y_pred_ee, ee.predict_proba(X_test)[:, 1])

Comparison: What Works Best?

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedRandomForestClassifier

results = []

# 1. Baseline
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
results.append(('Baseline', f1_score(y_test, model.predict(X_test))))

# 2. Class weights
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
results.append(('Class Weights', f1_score(y_test, model.predict(X_test))))

# 3. SMOTE
X_smote, y_smote = SMOTE(random_state=42).fit_resample(X_train, y_train)
model = LogisticRegression(random_state=42)
model.fit(X_smote, y_smote)
results.append(('SMOTE', f1_score(y_test, model.predict(X_test))))

# 4. Balanced Random Forest
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
results.append(('Balanced RF', f1_score(y_test, model.predict(X_test))))

# Print comparison
print("\n=== Method Comparison (F1 Score) ===")
for name, score in sorted(results, key=lambda x: x[1], reverse=True):
    print(f"{name:20s}: {score:.4f}")

Real Example: Credit Card Fraud

# Simulated fraud dataset
np.random.seed(42)
n_samples = 10000
n_fraud = 50  # Only 50 fraud cases!

# Normal transactions
X_normal = np.random.randn(n_samples - n_fraud, 5)
X_normal[:, 0] += 5  # Shift mean

# Fraudulent transactions (different pattern)
X_fraud = np.random.randn(n_fraud, 5)
X_fraud[:, 0] -= 2  # Different mean
X_fraud[:, 1] += 3  # Different pattern

X = np.vstack([X_normal, X_fraud])
y = np.array([0] * (n_samples - n_fraud) + [1] * n_fraud)

# Shuffle
idx = np.random.permutation(len(X))
X, y = X[idx], y[idx]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Fraud rate in test set: {y_test.mean():.2%}")

# Best approach for fraud: High recall with balanced RF + threshold tuning
from imblearn.ensemble import BalancedRandomForestClassifier

fraud_model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
fraud_model.fit(X_train, y_train)

# Lower threshold to catch more fraud
y_prob = fraud_model.predict_proba(X_test)[:, 1]
y_pred = (y_prob >= 0.3).astype(int)  # Lower threshold

print("\nFraud Detection Results:")
print(confusion_matrix(y_test, y_pred))
print(f"Recall (fraud caught): {recall_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")

Decision Flowchart

                    Is data imbalanced?

                ┌─────────┴─────────┐
                │                   │
               No                  Yes
                │                   │
           Standard ML        What's the ratio?

                    ┌──────────────┼──────────────┐
                    │              │              │
                 < 1:10        1:10-1:100      > 1:100
                    │              │              │
              Class Weights    SMOTE +        Ensemble +
                               Weights        Threshold

Key Takeaways

Accuracy Lies

Never use accuracy for imbalanced data

Focus on Recall/F1

These metrics reveal true performance

Resample Wisely

SMOTE for moderate, ensembles for severe imbalance

Tune Thresholds

Lower threshold to catch more minority class

What’s Next?

Learn how to make your models’ decisions understandable with explainability techniques!

Continue to Model Explainability

Understand why your model makes its predictions