Cross-Validation Strategies
Why One Train-Test Split Isn’t Enough
The Lucky Split Problem
K-Fold Cross-Validation
Stratified K-Fold
Leave-One-Out Cross-Validation (LOOCV)
Time Series Cross-Validation
Group K-Fold
Nested Cross-Validation
Repeated Cross-Validation
Choosing the Right Strategy
Summary

Cross-Validation Strategies

Why One Train-Test Split Isn’t Enough

You built a model. It got 95% accuracy on your test set. Ship it? Not so fast.

That 95% might be luck. Your test set might have been “easy.” A different split might show 75%.

The Lucky Split Problem

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=500, n_features=20, random_state=42)

# Try multiple splits
accuracies = []
for seed in range(50):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    accuracies.append(acc)

print(f"Accuracy range: {min(accuracies):.3f} to {max(accuracies):.3f}")
print(f"Standard deviation: {np.std(accuracies):.3f}")

Output:

Accuracy range: 0.820 to 0.920
Standard deviation: 0.024

That’s a 10 percentage point swing just from changing the random split!

Estimated Time: 2-3 hours
Difficulty: Intermediate
Prerequisites: Model Evaluation chapter
Tools: scikit-learn, numpy

K-Fold Cross-Validation

The gold standard. Split data into K parts, train on K-1, test on 1, rotate.

Fold 1: [TEST] [Train] [Train] [Train] [Train]
Fold 2: [Train] [TEST] [Train] [Train] [Train]
Fold 3: [Train] [Train] [TEST] [Train] [Train]
Fold 4: [Train] [Train] [Train] [TEST] [Train]
Fold 5: [Train] [Train] [Train] [Train] [TEST]

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# K-Fold cross-validation
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Different values of K
k_values = [3, 5, 10, 15, 20]
results = {}

for k in k_values:
    cv = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    results[k] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }
    print(f"K={k}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot of scores
boxes = [results[k]['scores'] for k in k_values]
axes[0].boxplot(boxes, labels=k_values)
axes[0].set_xlabel('Number of Folds (K)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('K-Fold Cross-Validation Scores')
axes[0].axhline(y=np.mean([r['mean'] for r in results.values()]), 
                color='red', linestyle='--', label='Average')
axes[0].legend()

# Mean and std
means = [results[k]['mean'] for k in k_values]
stds = [results[k]['std'] for k in k_values]
axes[1].errorbar(k_values, means, yerr=stds, fmt='o-', capsize=5)
axes[1].set_xlabel('Number of Folds (K)')
axes[1].set_ylabel('Mean Accuracy')
axes[1].set_title('Mean Accuracy with Standard Deviation')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Choosing K:

K=5: Good default, balances bias and variance
K=10: More reliable estimate, more compute
K=n (LOOCV): Lowest bias, highest variance, very expensive

Stratified K-Fold

When classes are imbalanced, regular K-Fold can create folds with unequal class distributions.

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, 
    n_features=20,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    random_state=42
)

print(f"Class distribution: {np.bincount(y)}")

# Regular K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("\nRegular K-Fold class distribution:")
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
    train_dist = np.bincount(y[train_idx])
    test_dist = np.bincount(y[test_idx])
    print(f"Fold {fold+1}: Train={train_dist}, Test={test_dist}")

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("\nStratified K-Fold class distribution:")
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    train_dist = np.bincount(y[train_idx])
    test_dist = np.bincount(y[test_idx])
    print(f"Fold {fold+1}: Train={train_dist}, Test={test_dist}")

Output:

Class distribution: [900 100]

Regular K-Fold class distribution:
Fold 1: Train=[720  80], Test=[180  20]
Fold 2: Train=[718  82], Test=[182  18]
Fold 3: Train=[722  78], Test=[178  22]
Fold 4: Train=[719  81], Test=[181  19]
Fold 5: Train=[721  79], Test=[179  21]

Stratified K-Fold class distribution:
Fold 1: Train=[720  80], Test=[180  20]
Fold 2: Train=[720  80], Test=[180  20]
Fold 3: Train=[720  80], Test=[180  20]
Fold 4: Train=[720  80], Test=[180  20]
Fold 5: Train=[720  80], Test=[180  20]

Stratified preserves class ratios in every fold!

Leave-One-Out Cross-Validation (LOOCV)

Extreme case: K = n (number of samples). Train on all but one, test on one.

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import time

iris = load_iris()
X, y = iris.data, iris.target

model = KNeighborsClassifier(n_neighbors=3)

# Time comparison
start = time.time()
loo_scores = cross_val_score(model, X, y, cv=LeaveOneOut())
loo_time = time.time() - start

start = time.time()
kf_scores = cross_val_score(model, X, y, cv=KFold(n_splits=5))
kf_time = time.time() - start

print(f"LOOCV: {loo_scores.mean():.3f} (+/- {loo_scores.std()*2:.3f})")
print(f"Time: {loo_time:.2f}s, {len(loo_scores)} iterations")

print(f"\n5-Fold: {kf_scores.mean():.3f} (+/- {kf_scores.std()*2:.3f})")
print(f"Time: {kf_time:.2f}s, {len(kf_scores)} iterations")

LOOCV Pitfalls:

Computationally expensive (n train-test cycles)
High variance in estimates
Use only for very small datasets

Time Series Cross-Validation

Time series data is special: you can’t peek at the future!

from sklearn.model_selection import TimeSeriesSplit
import numpy as np
import matplotlib.pyplot as plt

# Generate time series data
np.random.seed(42)
n = 100
time_index = np.arange(n)
X = np.random.randn(n, 5)
y = np.sin(time_index * 0.1) + np.random.randn(n) * 0.1

# Time Series Split
tscv = TimeSeriesSplit(n_splits=5)

fig, ax = plt.subplots(figsize=(12, 6))

for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    train_y = np.ones(len(train_idx)) * fold
    test_y = np.ones(len(test_idx)) * fold
    
    ax.scatter(train_idx, train_y, c='blue', marker='s', s=20, label='Train' if fold == 0 else '')
    ax.scatter(test_idx, test_y, c='red', marker='o', s=20, label='Test' if fold == 0 else '')

ax.set_xlabel('Time Index')
ax.set_ylabel('Fold Number')
ax.set_title('Time Series Cross-Validation')
ax.legend()
ax.set_yticks(range(5))
ax.set_yticklabels([f'Fold {i+1}' for i in range(5)])
plt.show()

# Proper time series CV
from sklearn.linear_model import Ridge

scores = []
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model = Ridge()
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))
    
    print(f"Train: {train_idx[0]}-{train_idx[-1]}, Test: {test_idx[0]}-{test_idx[-1]}, Score: {scores[-1]:.3f}")

print(f"\nMean score: {np.mean(scores):.3f}")

Why it matters:

Approach	Problem
Regular K-Fold on time series	Data leakage! Training on future to predict past
TimeSeriesSplit	Always predicts future from past

Group K-Fold

When data has groups (e.g., multiple samples from same patient), ensure entire groups stay together.

from sklearn.model_selection import GroupKFold
import numpy as np

# Medical data: multiple readings per patient
np.random.seed(42)
n_patients = 20
readings_per_patient = 5

X = np.random.randn(n_patients * readings_per_patient, 10)
y = np.random.randint(0, 2, n_patients * readings_per_patient)
groups = np.repeat(np.arange(n_patients), readings_per_patient)

print(f"Total samples: {len(X)}")
print(f"Unique patients: {len(np.unique(groups))}")

# Group K-Fold
gkf = GroupKFold(n_splits=5)

print("\nGroup K-Fold splits (patients in each fold):")
for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
    train_patients = np.unique(groups[train_idx])
    test_patients = np.unique(groups[test_idx])
    
    # Check for overlap
    overlap = np.intersect1d(train_patients, test_patients)
    
    print(f"Fold {fold+1}: Train patients={len(train_patients)}, Test patients={len(test_patients)}, Overlap={len(overlap)}")

When to use Group K-Fold:

Medical: Multiple readings per patient
E-commerce: Multiple transactions per user
Text: Multiple documents per author
Any scenario where samples aren’t independent

Nested Cross-Validation

For hyperparameter tuning + evaluation. Two loops:

Outer loop: Evaluate model
Inner loop: Tune hyperparameters

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

# WRONG: Use same data for tuning and evaluation
model = SVC()
param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto']}

# This leaks info!
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Optimistic score: {grid_search.best_score_:.3f}")

# RIGHT: Nested cross-validation
from sklearn.model_selection import cross_val_score

# Inner CV for tuning (passed to GridSearchCV)
# Outer CV for evaluation (passed to cross_val_score)
nested_score = cross_val_score(
    GridSearchCV(SVC(), param_grid, cv=5),
    X, y, cv=5, scoring='accuracy'
)

print(f"\nNested CV score: {nested_score.mean():.3f} (+/- {nested_score.std()*2:.3f})")
print("This is the unbiased estimate!")

Best params: {'C': 10, 'gamma': 'scale'}
Optimistic score: 0.977

Nested CV score: 0.964 (+/- 0.024)
This is the unbiased estimate!

The nested CV score is lower but honest!

Repeated Cross-Validation

Run K-Fold multiple times with different shuffles for more stable estimates:

from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

wine = load_wine()
X, y = wine.data, wine.target

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Single 5-fold
single_scores = cross_val_score(model, X, y, cv=5)

# Repeated 5-fold (10 repetitions)
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
repeated_scores = cross_val_score(model, X, y, cv=rkf)

print(f"Single 5-Fold: {single_scores.mean():.3f} (+/- {single_scores.std()*2:.3f})")
print(f"Repeated 5-Fold (10x): {repeated_scores.mean():.3f} (+/- {repeated_scores.std()*2:.3f})")

# Visualize
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].hist(single_scores, bins=5, edgecolor='black', alpha=0.7)
axes[0].axvline(single_scores.mean(), color='red', linestyle='--')
axes[0].set_title(f'Single 5-Fold (n=5)')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Count')

axes[1].hist(repeated_scores, bins=20, edgecolor='black', alpha=0.7)
axes[1].axvline(repeated_scores.mean(), color='red', linestyle='--')
axes[1].set_title(f'Repeated 5-Fold 10x (n=50)')
axes[1].set_xlabel('Accuracy')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

Choosing the Right Strategy

Scenario	Recommended CV
General classification	Stratified K-Fold (K=5 or 10)
Regression	K-Fold (K=5 or 10)
Time series	TimeSeriesSplit
Grouped data	GroupKFold
Hyperparameter tuning	Nested CV
Very small dataset	LOOCV or Repeated K-Fold
Imbalanced classes	Stratified K-Fold
Critical applications	Repeated Stratified K-Fold

Summary

Cross-validation transforms unreliable single-split estimates into robust performance measures:

K-Fold: Standard approach, every sample tested exactly once
Stratified: Maintains class balance
Time Series: Respects temporal order
Group: Keeps related samples together
Nested: Unbiased tuning + evaluation
Repeated: Reduces variance in estimates

Rule of Thumb: When in doubt, use Stratified 5-Fold for classification and 5-Fold for regression. Add repetition for critical applications.

# Your go-to template
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Common Mistakes Anomaly Detection

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Cross-Validation Strategies

​Why One Train-Test Split Isn’t Enough

​The Lucky Split Problem

​K-Fold Cross-Validation

​Stratified K-Fold

​Leave-One-Out Cross-Validation (LOOCV)

​Time Series Cross-Validation

​Group K-Fold

​Nested Cross-Validation

​Repeated Cross-Validation

​Choosing the Right Strategy

​Summary

Cross-Validation Strategies

Why One Train-Test Split Isn’t Enough

The Lucky Split Problem

K-Fold Cross-Validation

Stratified K-Fold

Leave-One-Out Cross-Validation (LOOCV)

Time Series Cross-Validation

Group K-Fold

Nested Cross-Validation

Repeated Cross-Validation

Choosing the Right Strategy

Summary