> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Handling Imbalanced Data

> When 99% of your data is one class - techniques that actually work

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/imbalanced-data-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=1e3b31069a6c7975cceacde76e785010" alt="Imbalanced Data Concept" width="1080" height="1080" data-path="images/courses/ml-mastery/imbalanced-data-concept.svg" />
</Frame>

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/imbalanced-data-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=638b033214fd648a94b520c8d163959a" alt="Imbalanced Data Real World Example" width="1080" height="1080" data-path="images/courses/ml-mastery/imbalanced-data-real-world.svg" />
</Frame>

# Handling Imbalanced Data

## The 1% Problem

Your fraud detection model has 99.5% accuracy. Amazing, right?

**Wait.** The dataset has:

* 99.5% legitimate transactions
* 0.5% fraudulent transactions

Your model predicts "legitimate" for everything. It catches **zero fraud**.

This is the **imbalanced data problem** - and it's everywhere in real ML.

***

## Real-World Imbalanced Problems

| Problem               | Minority Class | Typical Ratio |
| --------------------- | -------------- | ------------- |
| Fraud Detection       | Fraudulent     | 1:1000        |
| Disease Diagnosis     | Sick patients  | 1:100         |
| Churn Prediction      | Churners       | 1:10          |
| Click Prediction      | Clicks         | 1:100         |
| Spam Detection        | Spam           | 1:5           |
| Manufacturing Defects | Defective      | 1:1000        |

***

## Why Standard ML Fails

```python theme={null}
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Create imbalanced dataset (99% vs 1%)
X, y = make_classification(
    n_samples=10000, 
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.99, 0.01],  # 99% class 0, 1% class 1
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set class distribution:")
print(f"  Class 0: {(y_train == 0).sum()} ({(y_train == 0).mean():.1%})")
print(f"  Class 1: {(y_train == 1).sum()} ({(y_train == 1).mean():.1%})")

# Train a standard model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("\nStandard Model Performance:")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

**The model learns to predict the majority class almost exclusively!**

***

## Evaluation for Imbalanced Data

### Don't Use Accuracy!

```python theme={null}
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                              f1_score, roc_auc_score, average_precision_score)

def evaluate_imbalanced(y_true, y_pred, y_prob=None):
    """Proper metrics for imbalanced classification."""
    print("=== Imbalanced Metrics ===")
    print(f"Accuracy:          {accuracy_score(y_true, y_pred):.4f}  (misleading!)")
    print(f"Precision:         {precision_score(y_true, y_pred):.4f}")
    print(f"Recall:            {recall_score(y_true, y_pred):.4f}  (critical!)")
    print(f"F1 Score:          {f1_score(y_true, y_pred):.4f}")
    
    if y_prob is not None:
        print(f"ROC-AUC:           {roc_auc_score(y_true, y_prob):.4f}")
        print(f"Average Precision: {average_precision_score(y_true, y_prob):.4f}")

y_prob = model.predict_proba(X_test)[:, 1]
evaluate_imbalanced(y_test, y_pred, y_prob)
```

### The Metrics That Matter

<CardGroup cols={2}>
  <Card title="Recall (Sensitivity)" icon="magnifying-glass">
    Of all actual positives, how many did we catch?
    **Critical for fraud, disease detection**
  </Card>

  <Card title="Precision" icon="bullseye">
    Of predictions, how many are correct?
    **Important when false positives are costly**
  </Card>

  <Card title="F1 Score" icon="scale-balanced">
    Harmonic mean of precision and recall
    **Good single metric for imbalanced data**
  </Card>

  <Card title="PR-AUC" icon="chart-area">
    Area under Precision-Recall curve
    **Better than ROC-AUC for imbalanced data**
  </Card>
</CardGroup>

***

## Solution 1: Class Weights

Tell the model that minority class errors matter more:

```python theme={null}
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Option 1: balanced class weights (automatic)
model_balanced = LogisticRegression(class_weight='balanced', random_state=42)
model_balanced.fit(X_train, y_train)

# Option 2: custom weights
# If class 1 is 100x rarer, give it 100x weight
weights = {0: 1, 1: 100}
model_weighted = LogisticRegression(class_weight=weights, random_state=42)
model_weighted.fit(X_train, y_train)

# Compare
print("Standard Model:")
evaluate_imbalanced(y_test, model.predict(X_test), model.predict_proba(X_test)[:, 1])

print("\nBalanced Weights:")
y_pred_balanced = model_balanced.predict(X_test)
evaluate_imbalanced(y_test, y_pred_balanced, model_balanced.predict_proba(X_test)[:, 1])
```

***

## Solution 2: Resampling

### Oversampling (Add Minority Samples)

```python theme={null}
# pip install imbalanced-learn
from imblearn.over_sampling import RandomOverSampler, SMOTE

# Random oversampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
print(f"After Random Oversampling: {len(X_ros)} samples")
print(f"  Class 0: {(y_ros == 0).sum()}")
print(f"  Class 1: {(y_ros == 1).sum()}")

# SMOTE: Synthetic Minority Over-sampling Technique
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"\nAfter SMOTE: {len(X_smote)} samples")
```

### Undersampling (Remove Majority Samples)

```python theme={null}
from imblearn.under_sampling import RandomUnderSampler, TomekLinks

# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f"After Random Undersampling: {len(X_rus)} samples")

# Tomek Links: Remove borderline majority samples
tomek = TomekLinks()
X_tomek, y_tomek = tomek.fit_resample(X_train, y_train)
print(f"After Tomek Links: {len(X_tomek)} samples")
```

### Combination: SMOTE + Tomek

```python theme={null}
from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=42)
X_combo, y_combo = smote_tomek.fit_resample(X_train, y_train)

# Train on resampled data
model_resampled = LogisticRegression(random_state=42)
model_resampled.fit(X_combo, y_combo)

y_pred_resampled = model_resampled.predict(X_test)
print("SMOTE + Tomek Model:")
evaluate_imbalanced(y_test, y_pred_resampled, model_resampled.predict_proba(X_test)[:, 1])
```

<Warning>
  **Never resample the test set!** Only resample training data. Test data should reflect real-world distribution.
</Warning>

***

## Solution 3: Threshold Tuning

Default threshold is 0.5. For imbalanced data, lower it:

```python theme={null}
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

# Get probabilities
y_prob = model_balanced.predict_proba(X_test)[:, 1]

# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Default threshold (0.5):")
evaluate_imbalanced(y_test, (y_prob >= 0.5).astype(int), y_prob)

print(f"\nOptimal threshold ({optimal_threshold:.3f}):")
evaluate_imbalanced(y_test, (y_prob >= optimal_threshold).astype(int), y_prob)

# Plot Precision-Recall tradeoff
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.plot(thresholds, f1_scores, label='F1 Score', linewidth=2)
plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.3f}')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True)
plt.show()
```

***

## Solution 4: Ensemble Methods for Imbalanced Data

### Balanced Random Forest

```python theme={null}
from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

y_pred_brf = brf.predict(X_test)
print("Balanced Random Forest:")
evaluate_imbalanced(y_test, y_pred_brf, brf.predict_proba(X_test)[:, 1])
```

### Easy Ensemble (AdaBoost on balanced subsets)

```python theme={null}
from imblearn.ensemble import EasyEnsembleClassifier

ee = EasyEnsembleClassifier(n_estimators=10, random_state=42)
ee.fit(X_train, y_train)

y_pred_ee = ee.predict(X_test)
print("Easy Ensemble:")
evaluate_imbalanced(y_test, y_pred_ee, ee.predict_proba(X_test)[:, 1])
```

***

## Comparison: What Works Best?

```python theme={null}
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedRandomForestClassifier

results = []

# 1. Baseline
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
results.append(('Baseline', f1_score(y_test, model.predict(X_test))))

# 2. Class weights
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
results.append(('Class Weights', f1_score(y_test, model.predict(X_test))))

# 3. SMOTE
X_smote, y_smote = SMOTE(random_state=42).fit_resample(X_train, y_train)
model = LogisticRegression(random_state=42)
model.fit(X_smote, y_smote)
results.append(('SMOTE', f1_score(y_test, model.predict(X_test))))

# 4. Balanced Random Forest
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
results.append(('Balanced RF', f1_score(y_test, model.predict(X_test))))

# Print comparison
print("\n=== Method Comparison (F1 Score) ===")
for name, score in sorted(results, key=lambda x: x[1], reverse=True):
    print(f"{name:20s}: {score:.4f}")
```

***

## Real Example: Credit Card Fraud

```python theme={null}
# Simulated fraud dataset
np.random.seed(42)
n_samples = 10000
n_fraud = 50  # Only 50 fraud cases!

# Normal transactions
X_normal = np.random.randn(n_samples - n_fraud, 5)
X_normal[:, 0] += 5  # Shift mean

# Fraudulent transactions (different pattern)
X_fraud = np.random.randn(n_fraud, 5)
X_fraud[:, 0] -= 2  # Different mean
X_fraud[:, 1] += 3  # Different pattern

X = np.vstack([X_normal, X_fraud])
y = np.array([0] * (n_samples - n_fraud) + [1] * n_fraud)

# Shuffle
idx = np.random.permutation(len(X))
X, y = X[idx], y[idx]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Fraud rate in test set: {y_test.mean():.2%}")

# Best approach for fraud: High recall with balanced RF + threshold tuning
from imblearn.ensemble import BalancedRandomForestClassifier

fraud_model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
fraud_model.fit(X_train, y_train)

# Lower threshold to catch more fraud
y_prob = fraud_model.predict_proba(X_test)[:, 1]
y_pred = (y_prob >= 0.3).astype(int)  # Lower threshold

print("\nFraud Detection Results:")
print(confusion_matrix(y_test, y_pred))
print(f"Recall (fraud caught): {recall_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
```

***

## Decision Flowchart

```
                    Is data imbalanced?
                          │
                ┌─────────┴─────────┐
                │                   │
               No                  Yes
                │                   │
           Standard ML        What's the ratio?
                                   │
                    ┌──────────────┼──────────────┐
                    │              │              │
                 < 1:10        1:10-1:100      > 1:100
                    │              │              │
              Class Weights    SMOTE +        Ensemble +
                               Weights        Threshold
```

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Accuracy Lies" icon="mask">
    Never use accuracy for imbalanced data
  </Card>

  <Card title="Focus on Recall/F1" icon="bullseye">
    These metrics reveal true performance
  </Card>

  <Card title="Resample Wisely" icon="recycle">
    SMOTE for moderate, ensembles for severe imbalance
  </Card>

  <Card title="Tune Thresholds" icon="sliders">
    Lower threshold to catch more minority class
  </Card>
</CardGroup>

***

## What's Next?

Learn how to make your models' decisions understandable with explainability techniques!

<Card title="Continue to Model Explainability" icon="arrow-right" href="/courses/ml-mastery/21-explainability">
  Understand why your model makes its predictions
</Card>
