> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Model Evaluation

> Measure what matters - accuracy is not enough

# Model Evaluation

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/model-evaluation-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=e1d3caed9f8df9991f90be6e23376f13" alt="Confusion Matrix Visualization" width="1080" height="1080" data-path="images/courses/ml-mastery/model-evaluation-concept.svg" />
</Frame>

## The Hidden Trap

Your model has **99% accuracy**. Incredible, right?

**Wait.** The dataset has 99% of one class:

* 99% emails are not spam
* Model predicts "not spam" for everything
* 99% accuracy... but catches **zero spam**!

Think of it like a weather forecaster in the Sahara who predicts "no rain" every single day. They'd be right 99% of the time -- and completely useless the 1% of the time it actually matters. Accuracy is a vanity metric when your classes are imbalanced, and in the real world, they almost always are.

This is why proper evaluation matters.

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/model-evaluation-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=e7900dbe7e14016c716c74fd4c2d67d7" alt="A/B Testing Model Comparison" width="1080" height="1080" data-path="images/courses/ml-mastery/model-evaluation-real-world.svg" />
</Frame>

***

## The Train-Test Split

**Rule #1**: Never evaluate on training data!

Evaluating on training data is like grading a student using the exact questions they practiced on. Of course they'll ace it -- but you have no idea if they actually understand the material. The test set is the "final exam" your model has never seen.

```python theme={null}
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,     # 20% for testing -- industry standard for medium datasets
    random_state=42,    # Reproducibility -- same split every run
    stratify=y          # Preserve class ratios -- critical for imbalanced data!
    # Without stratify, your test set might randomly have 0% of the minority class
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

# Train
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate on UNSEEN data
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"\nTraining accuracy: {train_acc:.2%}")
print(f"Testing accuracy:  {test_acc:.2%}")
```

<Warning>
  **If training accuracy >> test accuracy**: Your model is **overfitting**!
  It memorized the training data instead of learning patterns.

  **Rules of thumb for the gap:**

  * **less than 5%**: Normal and expected. Ship it.
  * **5-15%**: Mild overfitting. Try regularization or simpler model.
  * **greater than 15%**: Serious overfitting. Reduce model complexity, get more data, or add dropout/regularization.
  * **Test higher than train**: Something is wrong -- possible data leakage or a very lucky split. Investigate.
</Warning>

***

## Cross-Validation: More Reliable Evaluation

What if the test split was "lucky"? Use **k-fold cross-validation**:

```
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]
```

Every sample gets to be in the test set exactly once!

The standard deviation of CV scores tells you how *stable* your model is. A model with 95% +/- 1% is much more trustworthy than one with 95% +/- 8%. High variance across folds often means your dataset is too small or your model is too sensitive to which specific examples it trains on.

```python theme={null}
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
# Why 5 folds? It's a good trade-off between computational cost
# and reliable estimation. 10 folds gives slightly better estimates
# but takes twice as long. 3 folds is faster but noisier.
scores = cross_val_score(
    RandomForestClassifier(random_state=42),
    X, y,
    cv=5,
    scoring='accuracy'
)

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")
# The "+/-" is the standard deviation across folds.
# If it's > 5% of the mean, consider whether your data is too small
# or your model is too complex for the available data.
```

***

## Classification Metrics

### The Confusion Matrix

```python theme={null}
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Train and predict
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Visual
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay(cm, display_labels=cancer.target_names).plot(ax=ax)
plt.title("Confusion Matrix")
plt.show()
```

```
                    Predicted
                    Neg     Pos
Actual  Negative   [TN      FP]    <- False Positive: "False alarm"
        Positive   [FN      TP]    <- False Negative: "Missed detection"
```

***

### Precision, Recall, F1

```python theme={null}
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Individual metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

# Full report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
```

<CardGroup cols={3}>
  <Card title="Precision" icon="bullseye">
    Of predicted positives, how many are correct?

    $\frac{TP}{TP + FP}$

    **"Don't cry wolf"**
  </Card>

  <Card title="Recall" icon="magnifying-glass">
    Of actual positives, how many did we find?

    $\frac{TP}{TP + FN}$

    **"Find them all"**
  </Card>

  <Card title="F1 Score" icon="scale-balanced">
    Harmonic mean of precision and recall

    $\frac{2 \cdot P \cdot R}{P + R}$

    **"Balance both"**
  </Card>
</CardGroup>

### When to Use What?

| Scenario         | Priority       | Why                             |
| ---------------- | -------------- | ------------------------------- |
| Spam Filter      | High Precision | Don't want real emails in spam  |
| Cancer Detection | High Recall    | Don't want to miss cancer cases |
| Search Engine    | Precision\@K   | Top results must be relevant    |
| Fraud Detection  | High Recall    | Don't miss fraud                |
| Recommendation   | Precision      | Show only relevant items        |

***

## Probability Thresholds

By default, we use 0.5 as the threshold. But you can adjust it:

```python theme={null}
# Get probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Different thresholds
for threshold in [0.3, 0.5, 0.7]:
    y_pred_thresh = (y_prob >= threshold).astype(int)
    precision = precision_score(y_test, y_pred_thresh)
    recall = recall_score(y_test, y_pred_thresh)
    print(f"Threshold {threshold}: Precision={precision:.3f}, Recall={recall:.3f}")
```

**Trade-off** -- think of it like adjusting the sensitivity on a metal detector at an airport:

* **Lower threshold** (more sensitive): Catches more threats but also beeps at belt buckles. More positive predictions, higher recall, lower precision.
* **Higher threshold** (less sensitive): Only triggers on real weapons but might miss a hidden knife. Fewer positive predictions, lower recall, higher precision.

The right threshold depends on what's more expensive: false alarms or missed catches. In cancer screening, you want low threshold (catch everything). In email spam, you want higher threshold (don't lose real mail).

***

## ROC Curve and AUC

The **ROC curve** shows performance across all thresholds:

```python theme={null}
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()
```

**AUC (Area Under Curve)** -- the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:

* 1.0 = Perfect model (always ranks positives above negatives)
* 0.5 = Random guessing (coin flip)
* \> 0.9 = Excellent (production-ready for many applications)
* \> 0.8 = Good (worth deploying with monitoring)
* \> 0.7 = Fair (better than nothing, but investigate why it's struggling)
* \< 0.5 = Your labels might be flipped, or the model is actively anti-predicting

**Why AUC over accuracy?** AUC doesn't depend on a specific threshold, so it tells you about the model's overall discriminative ability. Two models could have the same accuracy at threshold=0.5 but very different AUCs -- the one with higher AUC has more "room to maneuver" when you adjust the threshold for business needs.

***

## Regression Metrics

For predicting numbers:

```python theme={null}
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

# Create regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train and predict
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE:  {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"R2:   {r2:.4f}")
```

<CardGroup cols={2}>
  <Card title="RMSE" icon="ruler-vertical">
    Average error in same units as target.
    More sensitive to large errors.
  </Card>

  <Card title="MAE" icon="ruler">
    Average error in same units as target.
    More robust to outliers.
  </Card>

  <Card title="R2 Score" icon="percent">
    % of variance explained (0 to 1).
    1 = perfect fit, 0 = baseline.
  </Card>

  <Card title="MAPE" icon="divide">
    Average % error.
    Easy to interpret.
  </Card>
</CardGroup>

***

## Handling Imbalanced Data

When one class dominates (99% vs 1%):

### 1. Use Appropriate Metrics

```python theme={null}
from sklearn.metrics import balanced_accuracy_score, f1_score

# Don't use accuracy!
balanced_acc = balanced_accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
```

### 2. Resample the Data

Think of it like a cooking class where 95 students want to learn Italian but only 5 want to learn Thai. If you just teach to the majority, you'll ignore Thai completely. Resampling either duplicates the Thai students (upsampling) or randomly removes some Italian students (downsampling) to give both groups fair representation.

```python theme={null}
from sklearn.utils import resample

# Separate classes
X_majority = X_train[y_train == 0]
X_minority = X_train[y_train == 1]

# Upsample minority class -- duplicate minority examples until
# both classes have equal representation. The model sees each
# minority example multiple times, emphasizing those patterns.
X_minority_upsampled = resample(
    X_minority,
    replace=True,       # Sample WITH replacement (same point can appear twice)
    n_samples=len(X_majority),  # Match majority class size
    random_state=42
)

# Combine into balanced dataset
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([np.zeros(len(X_majority)), np.ones(len(X_minority_upsampled))])
# Caution: upsampling creates exact duplicates, which can cause overfitting
# on those specific examples. Consider SMOTE (Module 20) for synthetic samples.
```

### 3. Use Class Weights

```python theme={null}
from sklearn.ensemble import RandomForestClassifier

# Automatically balance weights -- this tells the model to treat
# each minority sample as if it were worth MORE during training.
# With 'balanced', a class with 10x fewer samples gets 10x the weight.
# Mathematically: weight_i = n_samples / (n_classes * n_samples_for_class_i)
# This is the easiest fix and should be your first attempt.
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
```

<Tip>
  **Model selection tip for imbalanced data**: Start with `class_weight='balanced'` on Logistic Regression or Random Forest before trying resampling techniques. It's simpler, doesn't create synthetic data, and often works just as well. Reserve SMOTE and other resampling for when class weights alone aren't enough.
</Tip>

***

## Learning Curves: Diagnosing Problems

```python theme={null}
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training score')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# Plot
plot_learning_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    "Learning Curve - Random Forest"
)
```

**Diagnosing from learning curves** -- this is one of the most valuable debugging tools in ML:

| Pattern                                  | Problem            | What It Looks Like                                          | Solution                                                                  |
| ---------------------------------------- | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- |
| High train, low val, big gap             | **Overfitting**    | Training score stays at \~99%, validation plateaus at \~75% | Simplify model (reduce depth/features), get more data, add regularization |
| Low train, low val, close together       | **Underfitting**   | Both scores hover around 65%, more data doesn't help        | Use a more complex model, engineer better features, reduce regularization |
| Both high and close                      | **Good fit**       | Both scores at \~90% and converging as data increases       | You're done -- ship it                                                    |
| Val score still rising at the right edge | **Need more data** | Gap is closing but hasn't converged yet                     | Collect more training data -- you're on the right track                   |

***

## Validation Curve: Tuning Hyperparameters

```python theme={null}
from sklearn.model_selection import validation_curve

# Vary max_depth
param_range = [1, 2, 3, 4, 5, 7, 10, 15, 20]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve')
plt.legend()
plt.grid(True)
plt.show()
```

***

## Complete Evaluation Pipeline

```python theme={null}
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Multi-metric cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = cross_validate(
    pipeline, X, y,
    cv=5,
    scoring=scoring,
    return_train_score=True
)

# Display results
print("Cross-Validation Results (Mean +/- Std):\n")
for metric in scoring:
    train_key = f'train_{metric}'
    test_key = f'test_{metric}'
    print(f"{metric:12s}: Train={results[train_key].mean():.4f} (+/- {results[train_key].std():.4f}), "
          f"Val={results[test_key].mean():.4f} (+/- {results[test_key].std():.4f})")
```

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1: Metric Dashboard Builder" icon="gauge">
    Build a comprehensive model evaluation dashboard
  </Card>

  <Card title="Project 2: Cross-Validation Analyzer" icon="chart-line">
    Compare different CV strategies and their stability
  </Card>

  <Card title="Project 3: Threshold Optimization" icon="sliders">
    Find optimal decision thresholds for business needs
  </Card>

  <Card title="Project 4: Model Comparison Report" icon="file-chart-column">
    Create an automated model comparison report
  </Card>
</CardGroup>

### Project 1: Metric Dashboard Builder

Build a comprehensive evaluation dashboard that calculates all metrics and visualizes model performance.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import train_test_split
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                               f1_score, roc_auc_score, confusion_matrix,
                               roc_curve, precision_recall_curve)

  # Step 1: Load and prepare data
  cancer = load_breast_cancer()
  X_train, X_test, y_train, y_test = train_test_split(
      cancer.data, cancer.target, test_size=0.2, random_state=42
  )

  # Step 2: Train model
  model = RandomForestClassifier(n_estimators=100, random_state=42)
  model.fit(X_train, y_train)

  # Step 3: Get predictions
  y_pred = model.predict(X_test)
  y_proba = model.predict_proba(X_test)[:, 1]

  # Step 4: Calculate all metrics
  metrics = {
      'Accuracy': accuracy_score(y_test, y_pred),
      'Precision': precision_score(y_test, y_pred),
      'Recall': recall_score(y_test, y_pred),
      'F1 Score': f1_score(y_test, y_pred),
      'ROC-AUC': roc_auc_score(y_test, y_proba)
  }

  # Step 5: Create dashboard visualization
  fig, axes = plt.subplots(2, 2, figsize=(12, 10))

  # Metrics bar chart
  ax1 = axes[0, 0]
  ax1.barh(list(metrics.keys()), list(metrics.values()), color='steelblue')
  ax1.set_xlim(0, 1)
  ax1.set_title('Model Performance Metrics')
  for i, (metric, value) in enumerate(metrics.items()):
      ax1.text(value + 0.02, i, f'{value:.3f}', va='center')

  # Confusion matrix
  ax2 = axes[0, 1]
  cm = confusion_matrix(y_test, y_pred)
  im = ax2.imshow(cm, cmap='Blues')
  ax2.set_xticks([0, 1])
  ax2.set_yticks([0, 1])
  ax2.set_xticklabels(['Negative', 'Positive'])
  ax2.set_yticklabels(['Negative', 'Positive'])
  ax2.set_xlabel('Predicted')
  ax2.set_ylabel('Actual')
  ax2.set_title('Confusion Matrix')
  for i in range(2):
      for j in range(2):
          ax2.text(j, i, cm[i, j], ha='center', va='center', fontsize=16)

  # ROC curve
  ax3 = axes[1, 0]
  fpr, tpr, _ = roc_curve(y_test, y_proba)
  ax3.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC={metrics["ROC-AUC"]:.3f})')
  ax3.plot([0, 1], [0, 1], 'k--', label='Random')
  ax3.set_xlabel('False Positive Rate')
  ax3.set_ylabel('True Positive Rate')
  ax3.set_title('ROC Curve')
  ax3.legend()

  # Precision-Recall curve
  ax4 = axes[1, 1]
  precision, recall, _ = precision_recall_curve(y_test, y_proba)
  ax4.plot(recall, precision, 'g-', linewidth=2)
  ax4.set_xlabel('Recall')
  ax4.set_ylabel('Precision')
  ax4.set_title('Precision-Recall Curve')
  ax4.fill_between(recall, precision, alpha=0.3)

  plt.tight_layout()
  plt.savefig('model_dashboard.png', dpi=150)
  print("Dashboard saved!")

  # Print metrics summary
  print("\n📊 Model Evaluation Dashboard")
  print("=" * 40)
  for metric, value in metrics.items():
      print(f"{metric:15s}: {value:.4f}")
  ```

  **What you learned:**

  * Calculating multiple evaluation metrics at once
  * Visualizing model performance comprehensively
  * Understanding the relationships between different metrics
</details>

### Project 2: Cross-Validation Analyzer

Compare different cross-validation strategies and analyze their stability.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import (cross_val_score, KFold, StratifiedKFold,
                                       LeaveOneOut, ShuffleSplit, RepeatedKFold)
  from sklearn.ensemble import RandomForestClassifier

  # Load data
  cancer = load_breast_cancer()
  X, y = cancer.data, cancer.target

  # Create model
  model = RandomForestClassifier(n_estimators=50, random_state=42)

  # Step 1: Define different CV strategies
  cv_strategies = {
      '5-Fold': KFold(n_splits=5, shuffle=True, random_state=42),
      '10-Fold': KFold(n_splits=10, shuffle=True, random_state=42),
      'Stratified 5-Fold': StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
      'Stratified 10-Fold': StratifiedKFold(n_splits=10, shuffle=True, random_state=42),
      'Shuffle Split (10)': ShuffleSplit(n_splits=10, test_size=0.2, random_state=42),
      'Repeated 5-Fold (3x)': RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
  }

  # Step 2: Run each CV strategy
  results = {}
  for name, cv in cv_strategies.items():
      scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
      results[name] = {
          'scores': scores,
          'mean': scores.mean(),
          'std': scores.std(),
          'min': scores.min(),
          'max': scores.max()
      }
      print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")

  # Step 3: Visualize comparison
  fig, axes = plt.subplots(1, 2, figsize=(14, 5))

  # Box plot of scores
  ax1 = axes[0]
  score_data = [results[name]['scores'] for name in cv_strategies.keys()]
  ax1.boxplot(score_data, labels=[n.replace(' ', '\n') for n in cv_strategies.keys()])
  ax1.set_ylabel('Accuracy')
  ax1.set_title('CV Strategy Comparison (Score Distribution)')
  ax1.axhline(y=np.mean([r['mean'] for r in results.values()]), 
              color='r', linestyle='--', label='Overall Mean')
  ax1.legend()

  # Stability analysis (std)
  ax2 = axes[1]
  names = list(results.keys())
  means = [results[n]['mean'] for n in names]
  stds = [results[n]['std'] for n in names]
  x_pos = np.arange(len(names))

  bars = ax2.bar(x_pos, means, yerr=stds, capsize=5, color='steelblue', alpha=0.7)
  ax2.set_xticks(x_pos)
  ax2.set_xticklabels([n.replace(' ', '\n') for n in names], fontsize=8)
  ax2.set_ylabel('Accuracy')
  ax2.set_title('Mean Accuracy with Standard Deviation')
  ax2.set_ylim(0.9, 1.0)

  plt.tight_layout()
  plt.savefig('cv_comparison.png', dpi=150)

  # Step 4: Recommendation
  print("\n📋 CV Strategy Analysis")
  print("=" * 50)
  most_stable = min(results.keys(), key=lambda x: results[x]['std'])
  print(f"Most stable strategy: {most_stable}")
  print(f"  - Mean: {results[most_stable]['mean']:.4f}")
  print(f"  - Std:  {results[most_stable]['std']:.4f}")
  print("\n💡 Recommendations:")
  print("  - Use Stratified K-Fold for imbalanced datasets")
  print("  - Use Repeated K-Fold for more reliable estimates")
  print("  - More folds = lower bias but higher variance")
  ```

  **What you learned:**

  * Different cross-validation strategies have different stability
  * Stratified CV is important for imbalanced data
  * Repeated CV gives more reliable estimates but takes longer
</details>

### Project 3: Threshold Optimization

Find the optimal classification threshold for different business objectives.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import train_test_split
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.metrics import (precision_score, recall_score, f1_score,
                               precision_recall_curve, confusion_matrix)

  # Scenario: Medical diagnosis where missing cancer (FN) costs more than false alarm (FP)
  # Cost of FN (missed cancer): $100,000
  # Cost of FP (unnecessary tests): $5,000

  FN_COST = 100000
  FP_COST = 5000

  # Step 1: Prepare data and model
  cancer = load_breast_cancer()
  X_train, X_test, y_train, y_test = train_test_split(
      cancer.data, cancer.target, test_size=0.3, random_state=42
  )

  model = RandomForestClassifier(n_estimators=100, random_state=42)
  model.fit(X_train, y_train)
  y_proba = model.predict_proba(X_test)[:, 1]

  # Step 2: Evaluate different thresholds
  thresholds = np.arange(0.1, 0.95, 0.05)
  results = []

  for thresh in thresholds:
      y_pred = (y_proba >= thresh).astype(int)
      cm = confusion_matrix(y_test, y_pred)
      tn, fp, fn, tp = cm.ravel()
      
      precision = precision_score(y_test, y_pred, zero_division=0)
      recall = recall_score(y_test, y_pred, zero_division=0)
      f1 = f1_score(y_test, y_pred, zero_division=0)
      total_cost = fn * FN_COST + fp * FP_COST
      
      results.append({
          'threshold': thresh,
          'precision': precision,
          'recall': recall,
          'f1': f1,
          'fp': fp,
          'fn': fn,
          'cost': total_cost
      })

  # Step 3: Find optimal thresholds for different objectives
  import pandas as pd
  df = pd.DataFrame(results)

  best_f1_idx = df['f1'].idxmax()
  best_recall_idx = df['recall'].idxmax()
  best_cost_idx = df['cost'].idxmin()

  print("📊 Threshold Optimization Results")
  print("=" * 60)
  print(f"\n🎯 Best F1 Score (balanced):")
  print(f"   Threshold: {df.loc[best_f1_idx, 'threshold']:.2f}")
  print(f"   F1: {df.loc[best_f1_idx, 'f1']:.4f}, Precision: {df.loc[best_f1_idx, 'precision']:.4f}, Recall: {df.loc[best_f1_idx, 'recall']:.4f}")

  print(f"\n🏥 Best Recall (minimize missed cases):")
  print(f"   Threshold: {df.loc[best_recall_idx, 'threshold']:.2f}")
  print(f"   Recall: {df.loc[best_recall_idx, 'recall']:.4f}, Precision: {df.loc[best_recall_idx, 'precision']:.4f}")

  print(f"\n💰 Minimum Cost (business optimal):")
  print(f"   Threshold: {df.loc[best_cost_idx, 'threshold']:.2f}")
  print(f"   Cost: ${df.loc[best_cost_idx, 'cost']:,.0f}")
  print(f"   FN: {df.loc[best_cost_idx, 'fn']}, FP: {df.loc[best_cost_idx, 'fp']}")

  # Step 4: Visualize
  fig, axes = plt.subplots(1, 3, figsize=(15, 4))

  # Precision-Recall tradeoff
  ax1 = axes[0]
  ax1.plot(df['threshold'], df['precision'], 'b-', label='Precision')
  ax1.plot(df['threshold'], df['recall'], 'r-', label='Recall')
  ax1.plot(df['threshold'], df['f1'], 'g-', label='F1')
  ax1.axvline(x=df.loc[best_f1_idx, 'threshold'], color='g', linestyle='--', alpha=0.5)
  ax1.set_xlabel('Threshold')
  ax1.set_ylabel('Score')
  ax1.set_title('Precision-Recall-F1 vs Threshold')
  ax1.legend()

  # Cost analysis
  ax2 = axes[1]
  ax2.plot(df['threshold'], df['cost']/1000, 'purple', linewidth=2)
  ax2.axvline(x=df.loc[best_cost_idx, 'threshold'], color='purple', linestyle='--', alpha=0.5)
  ax2.scatter([df.loc[best_cost_idx, 'threshold']], [df.loc[best_cost_idx, 'cost']/1000], 
             color='purple', s=100, zorder=5)
  ax2.set_xlabel('Threshold')
  ax2.set_ylabel('Total Cost ($K)')
  ax2.set_title('Cost vs Threshold')

  # FP vs FN tradeoff
  ax3 = axes[2]
  ax3.plot(df['threshold'], df['fp'], 'orange', label='False Positives')
  ax3.plot(df['threshold'], df['fn'], 'red', label='False Negatives')
  ax3.set_xlabel('Threshold')
  ax3.set_ylabel('Count')
  ax3.set_title('Error Types vs Threshold')
  ax3.legend()

  plt.tight_layout()
  plt.savefig('threshold_optimization.png', dpi=150)
  print("\n✅ Visualization saved!")
  ```

  **What you learned:**

  * Default 0.5 threshold isn't always optimal
  * Business costs should drive threshold selection
  * Precision and recall trade off with each other
</details>

### Project 4: Model Comparison Report

Create an automated report comparing multiple models across all metrics.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import cross_validate, train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.pipeline import Pipeline
  from sklearn.linear_model import LogisticRegression
  from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
  from sklearn.svm import SVC
  from sklearn.neighbors import KNeighborsClassifier
  from sklearn.tree import DecisionTreeClassifier
  import time

  # Step 1: Prepare data
  cancer = load_breast_cancer()
  X, y = cancer.data, cancer.target
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  # Step 2: Define models to compare
  models = {
      'Logistic Regression': Pipeline([
          ('scaler', StandardScaler()),
          ('clf', LogisticRegression(max_iter=1000, random_state=42))
      ]),
      'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
      'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
      'SVM': Pipeline([
          ('scaler', StandardScaler()),
          ('clf', SVC(probability=True, random_state=42))
      ]),
      'KNN': Pipeline([
          ('scaler', StandardScaler()),
          ('clf', KNeighborsClassifier(n_neighbors=5))
      ]),
      'Decision Tree': DecisionTreeClassifier(random_state=42)
  }

  # Step 3: Evaluate all models
  scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
  results = []

  for name, model in models.items():
      print(f"Evaluating {name}...")
      start = time.time()
      
      cv_results = cross_validate(
          model, X_train, y_train,
          cv=5, scoring=scoring,
          return_train_score=True
      )
      
      training_time = time.time() - start
      
      result = {'Model': name, 'Training Time (s)': training_time}
      for metric in scoring:
          result[f'{metric} (train)'] = cv_results[f'train_{metric}'].mean()
          result[f'{metric} (val)'] = cv_results[f'test_{metric}'].mean()
          result[f'{metric} (std)'] = cv_results[f'test_{metric}'].std()
      
      results.append(result)

  # Step 4: Create comparison DataFrame
  df_results = pd.DataFrame(results)

  # Step 5: Generate report
  print("\n" + "="*80)
  print("📊 MODEL COMPARISON REPORT")
  print("="*80)

  # Accuracy ranking
  print("\n🎯 ACCURACY RANKING:")
  accuracy_rank = df_results[['Model', 'accuracy (val)', 'accuracy (std)']].sort_values(
      'accuracy (val)', ascending=False
  )
  for i, row in accuracy_rank.iterrows():
      print(f"  {row['Model']:25s}: {row['accuracy (val)']:.4f} (+/- {row['accuracy (std)']:.4f})")

  # F1 ranking
  print("\n📈 F1 SCORE RANKING:")
  f1_rank = df_results[['Model', 'f1 (val)', 'f1 (std)']].sort_values('f1 (val)', ascending=False)
  for i, row in f1_rank.iterrows():
      print(f"  {row['Model']:25s}: {row['f1 (val)']:.4f} (+/- {row['f1 (std)']:.4f})")

  # Speed ranking
  print("\n⚡ SPEED RANKING:")
  speed_rank = df_results[['Model', 'Training Time (s)']].sort_values('Training Time (s)')
  for i, row in speed_rank.iterrows():
      print(f"  {row['Model']:25s}: {row['Training Time (s)']:.3f}s")

  # Overfitting check
  print("\n🔍 OVERFITTING ANALYSIS:")
  for _, row in df_results.iterrows():
      train_acc = row['accuracy (train)']
      val_acc = row['accuracy (val)']
      gap = train_acc - val_acc
      status = "⚠️ Overfitting" if gap > 0.05 else "✅ OK"
      print(f"  {row['Model']:25s}: Train={train_acc:.4f}, Val={val_acc:.4f}, Gap={gap:.4f} {status}")

  # Step 6: Visualization
  fig, axes = plt.subplots(2, 2, figsize=(14, 10))

  # Accuracy comparison
  ax1 = axes[0, 0]
  x = np.arange(len(models))
  width = 0.35
  ax1.bar(x - width/2, df_results['accuracy (train)'], width, label='Train', color='lightblue')
  ax1.bar(x + width/2, df_results['accuracy (val)'], width, label='Validation', color='steelblue')
  ax1.set_xticks(x)
  ax1.set_xticklabels(df_results['Model'], rotation=45, ha='right')
  ax1.set_ylabel('Accuracy')
  ax1.set_title('Accuracy: Train vs Validation')
  ax1.legend()
  ax1.set_ylim(0.8, 1.05)

  # Multi-metric comparison
  ax2 = axes[0, 1]
  metrics_to_plot = ['accuracy (val)', 'precision (val)', 'recall (val)', 'f1 (val)']
  x = np.arange(len(models))
  width = 0.2
  for i, metric in enumerate(metrics_to_plot):
      ax2.bar(x + i*width, df_results[metric], width, label=metric.replace(' (val)', ''))
  ax2.set_xticks(x + width * 1.5)
  ax2.set_xticklabels(df_results['Model'], rotation=45, ha='right')
  ax2.set_ylabel('Score')
  ax2.set_title('Multi-Metric Comparison')
  ax2.legend()
  ax2.set_ylim(0.8, 1.05)

  # Training time
  ax3 = axes[1, 0]
  ax3.barh(df_results['Model'], df_results['Training Time (s)'], color='coral')
  ax3.set_xlabel('Time (seconds)')
  ax3.set_title('Training Time (5-Fold CV)')

  # ROC-AUC comparison
  ax4 = axes[1, 1]
  roc_data = df_results[['Model', 'roc_auc (val)', 'roc_auc (std)']].sort_values('roc_auc (val)')
  ax4.barh(roc_data['Model'], roc_data['roc_auc (val)'], 
          xerr=roc_data['roc_auc (std)'], color='green', capsize=3)
  ax4.set_xlabel('ROC-AUC')
  ax4.set_title('ROC-AUC Comparison')
  ax4.set_xlim(0.9, 1.0)

  plt.tight_layout()
  plt.savefig('model_comparison_report.png', dpi=150)

  # Step 7: Recommendation
  print("\n" + "="*80)
  print("💡 RECOMMENDATION")
  print("="*80)
  best_model = df_results.loc[df_results['f1 (val)'].idxmax(), 'Model']
  print(f"\nBest overall model: {best_model}")
  print("Based on: F1 score (balance of precision and recall)")
  ```

  **What you learned:**

  * Systematic model comparison methodology
  * Multiple metrics reveal different model strengths
  * Overfitting detection through train-validation gap
  * Speed vs accuracy trade-offs
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Never Evaluate on Training Data" icon="skull-crossbones">
    Always use a held-out test set or cross-validation
  </Card>

  <Card title="Accuracy Is Not Enough" icon="circle-exclamation">
    Use precision, recall, F1, AUC depending on the problem
  </Card>

  <Card title="Cross-Validation" icon="repeat">
    More reliable than a single train-test split
  </Card>

  <Card title="Watch for Leakage" icon="eye">
    Test data must not influence training in any way
  </Card>
</CardGroup>

***

## 🧹 Real-World Complications: Messy Data Evaluation

<Accordion title="Evaluating Models on Messy Data" icon="broom">
  Real-world data creates evaluation challenges. Here's how to handle them:

  ### Handling Class Imbalance in Evaluation

  ```python theme={null}
  from sklearn.metrics import classification_report, balanced_accuracy_score
  from sklearn.datasets import make_classification

  # Create imbalanced dataset (5% positive class)
  X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  # Train model
  model = RandomForestClassifier(random_state=42)
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  # BAD: Regular accuracy looks great
  print(f"Regular Accuracy: {(y_pred == y_test).mean():.4f}")  # ~95% but misleading!

  # BETTER: Balanced accuracy accounts for imbalance
  print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")

  # BEST: Look at per-class metrics
  print("\nPer-Class Metrics:")
  print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
  ```

  ### Evaluating with Missing Values

  ```python theme={null}
  import pandas as pd
  import numpy as np

  # Many real datasets have missing values
  # DON'T fit imputation on test data!

  from sklearn.impute import SimpleImputer
  from sklearn.pipeline import Pipeline

  # CORRECT: Imputation is part of the model pipeline
  pipeline = Pipeline([
      ('imputer', SimpleImputer(strategy='median')),
      ('scaler', StandardScaler()),
      ('model', RandomForestClassifier(random_state=42))
  ])

  # Cross-validation handles imputation correctly
  cv_scores = cross_val_score(pipeline, X_with_missing, y, cv=5)
  print(f"CV Score with proper imputation: {cv_scores.mean():.4f}")
  ```

  ### Evaluating on Time Series (No Random Split!)

  ```python theme={null}
  from sklearn.model_selection import TimeSeriesSplit

  # BAD: Random split leaks future info into training
  # X_train, X_test = train_test_split(X, y)  # WRONG for time series!

  # GOOD: Time-aware split
  tscv = TimeSeriesSplit(n_splits=5)
  for train_idx, test_idx in tscv.split(X):
      # Test is always AFTER train in time
      X_train, X_test = X[train_idx], X[test_idx]
      y_train, y_test = y[train_idx], y[test_idx]
  ```

  ### Detecting Evaluation Errors

  | Symptom                        | Likely Problem     | Solution                             |
  | ------------------------------ | ------------------ | ------------------------------------ |
  | Train acc = 100%, Test acc low | Overfitting        | More regularization, less complexity |
  | Train and test acc both \~100% | Data leakage       | Check for target in features         |
  | Accuracy high, F1 low          | Class imbalance    | Use balanced metrics                 |
  | CV variance very high          | Small dataset      | Use more folds, bootstrap            |
  | Test performance varies wildly | Data order matters | Use stratified or time-aware splits  |
</Accordion>

***

## What's Next?

Before training, you need to prepare your data. Feature engineering can make or break your model!

<Card title="Continue to Module 8: Feature Engineering" icon="arrow-right" href="/courses/ml-mastery/08-feature-engineering">
  Learn how to transform raw data into powerful features
</Card>