Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Cross-Validation Strategies

Cross-Validation Strategies

Why One Train-Test Split Is Not Enough

You built a model. It got 95% accuracy on your test set. Ship it? Not so fast. A single train-test split is like evaluating a student based on one question. Maybe they happened to know that question. Maybe they got lucky. You need multiple questions to get a reliable assessment.
That 95% might be luck. Your test set might have been “easy.” A different random split might show 75%. Cross-validation replaces one unreliable measurement with many reliable ones.

The Lucky Split Problem

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=500, n_features=20, random_state=42)

# Try multiple splits
accuracies = []
for seed in range(50):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    accuracies.append(acc)

print(f"Accuracy range: {min(accuracies):.3f} to {max(accuracies):.3f}")
print(f"Standard deviation: {np.std(accuracies):.3f}")
Output:
Accuracy range: 0.820 to 0.920
Standard deviation: 0.024
That’s a 10 percentage point swing just from changing the random split!
Estimated Time: 2-3 hours
Difficulty: Intermediate
Prerequisites: Model Evaluation chapter
Tools: scikit-learn, numpy

K-Fold Cross-Validation

The gold standard for model evaluation. The idea is simple but powerful: split data into K equal parts (folds). Train on K-1 folds, test on the remaining one. Rotate so every fold gets a turn as the test set. Every data point is used for both training and testing, just never at the same time.
Fold 1: [TEST] [Train] [Train] [Train] [Train]
Fold 2: [Train] [TEST] [Train] [Train] [Train]
Fold 3: [Train] [Train] [TEST] [Train] [Train]
Fold 4: [Train] [Train] [Train] [TEST] [Train]
Fold 5: [Train] [Train] [Train] [Train] [TEST]
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# K-Fold cross-validation
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Different values of K
k_values = [3, 5, 10, 15, 20]
results = {}

for k in k_values:
    cv = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    results[k] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }
    print(f"K={k}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot of scores
boxes = [results[k]['scores'] for k in k_values]
axes[0].boxplot(boxes, labels=k_values)
axes[0].set_xlabel('Number of Folds (K)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('K-Fold Cross-Validation Scores')
axes[0].axhline(y=np.mean([r['mean'] for r in results.values()]), 
                color='red', linestyle='--', label='Average')
axes[0].legend()

# Mean and std
means = [results[k]['mean'] for k in k_values]
stds = [results[k]['std'] for k in k_values]
axes[1].errorbar(k_values, means, yerr=stds, fmt='o-', capsize=5)
axes[1].set_xlabel('Number of Folds (K)')
axes[1].set_ylabel('Mean Accuracy')
axes[1].set_title('Mean Accuracy with Standard Deviation')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Choosing K:
  • K=5: Good default, balances bias and variance
  • K=10: More reliable estimate, more compute
  • K=n (LOOCV): Lowest bias, highest variance, very expensive

Stratified K-Fold

When classes are imbalanced (e.g., 90% negative, 10% positive), regular K-Fold can create folds where one fold has 15% positive samples and another has only 5%. This makes fold-to-fold comparison noisy and can give misleading results. Stratified K-Fold solves this by ensuring each fold has approximately the same class ratio as the full dataset.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, 
    n_features=20,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    random_state=42
)

print(f"Class distribution: {np.bincount(y)}")

# Regular K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("\nRegular K-Fold class distribution:")
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
    train_dist = np.bincount(y[train_idx])
    test_dist = np.bincount(y[test_idx])
    print(f"Fold {fold+1}: Train={train_dist}, Test={test_dist}")

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("\nStratified K-Fold class distribution:")
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    train_dist = np.bincount(y[train_idx])
    test_dist = np.bincount(y[test_idx])
    print(f"Fold {fold+1}: Train={train_dist}, Test={test_dist}")
Output:
Class distribution: [900 100]

Regular K-Fold class distribution:
Fold 1: Train=[720  80], Test=[180  20]
Fold 2: Train=[718  82], Test=[182  18]
Fold 3: Train=[722  78], Test=[178  22]
Fold 4: Train=[719  81], Test=[181  19]
Fold 5: Train=[721  79], Test=[179  21]

Stratified K-Fold class distribution:
Fold 1: Train=[720  80], Test=[180  20]
Fold 2: Train=[720  80], Test=[180  20]
Fold 3: Train=[720  80], Test=[180  20]
Fold 4: Train=[720  80], Test=[180  20]
Fold 5: Train=[720  80], Test=[180  20]
Stratified preserves class ratios in every fold!

Leave-One-Out Cross-Validation (LOOCV)

The extreme case: K = n (number of samples). Train on literally all data except one sample, test on that one sample, then rotate through every sample. This gives the lowest bias (training set is nearly the full dataset), but highest variance (each fold differs by only one sample) and is computationally expensive (n full training cycles).
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import time

iris = load_iris()
X, y = iris.data, iris.target

model = KNeighborsClassifier(n_neighbors=3)

# Time comparison
start = time.time()
loo_scores = cross_val_score(model, X, y, cv=LeaveOneOut())
loo_time = time.time() - start

start = time.time()
kf_scores = cross_val_score(model, X, y, cv=KFold(n_splits=5))
kf_time = time.time() - start

print(f"LOOCV: {loo_scores.mean():.3f} (+/- {loo_scores.std()*2:.3f})")
print(f"Time: {loo_time:.2f}s, {len(loo_scores)} iterations")

print(f"\n5-Fold: {kf_scores.mean():.3f} (+/- {kf_scores.std()*2:.3f})")
print(f"Time: {kf_time:.2f}s, {len(kf_scores)} iterations")
LOOCV Pitfalls:
  • Computationally expensive (n train-test cycles)
  • High variance in estimates
  • Use only for very small datasets

Time Series Cross-Validation

Time series data is special: you cannot peek at the future. Standard K-Fold shuffles data randomly, which means you might train on December data to predict January — that is time travel, not machine learning. TimeSeriesSplit always trains on past data and tests on future data, respecting the arrow of time.
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
import matplotlib.pyplot as plt

# Generate time series data
np.random.seed(42)
n = 100
time_index = np.arange(n)
X = np.random.randn(n, 5)
y = np.sin(time_index * 0.1) + np.random.randn(n) * 0.1

# Time Series Split
tscv = TimeSeriesSplit(n_splits=5)

fig, ax = plt.subplots(figsize=(12, 6))

for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    train_y = np.ones(len(train_idx)) * fold
    test_y = np.ones(len(test_idx)) * fold
    
    ax.scatter(train_idx, train_y, c='blue', marker='s', s=20, label='Train' if fold == 0 else '')
    ax.scatter(test_idx, test_y, c='red', marker='o', s=20, label='Test' if fold == 0 else '')

ax.set_xlabel('Time Index')
ax.set_ylabel('Fold Number')
ax.set_title('Time Series Cross-Validation')
ax.legend()
ax.set_yticks(range(5))
ax.set_yticklabels([f'Fold {i+1}' for i in range(5)])
plt.show()

# Proper time series CV
from sklearn.linear_model import Ridge

scores = []
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model = Ridge()
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))
    
    print(f"Train: {train_idx[0]}-{train_idx[-1]}, Test: {test_idx[0]}-{test_idx[-1]}, Score: {scores[-1]:.3f}")

print(f"\nMean score: {np.mean(scores):.3f}")
Why it matters:
ApproachProblem
Regular K-Fold on time seriesData leakage! Training on future to predict past
TimeSeriesSplitAlways predicts future from past

Group K-Fold

When data has natural groups (e.g., multiple readings from the same patient, multiple transactions from the same user), samples within a group are not independent. If patient A’s readings appear in both train and test, the model can “cheat” by recognizing patient-specific patterns rather than learning general medical rules. GroupKFold ensures that all samples from the same group stay together — either entirely in training or entirely in testing.
from sklearn.model_selection import GroupKFold
import numpy as np

# Medical data: multiple readings per patient
np.random.seed(42)
n_patients = 20
readings_per_patient = 5

X = np.random.randn(n_patients * readings_per_patient, 10)
y = np.random.randint(0, 2, n_patients * readings_per_patient)
groups = np.repeat(np.arange(n_patients), readings_per_patient)

print(f"Total samples: {len(X)}")
print(f"Unique patients: {len(np.unique(groups))}")

# Group K-Fold
gkf = GroupKFold(n_splits=5)

print("\nGroup K-Fold splits (patients in each fold):")
for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
    train_patients = np.unique(groups[train_idx])
    test_patients = np.unique(groups[test_idx])
    
    # Check for overlap
    overlap = np.intersect1d(train_patients, test_patients)
    
    print(f"Fold {fold+1}: Train patients={len(train_patients)}, Test patients={len(test_patients)}, Overlap={len(overlap)}")
When to use Group K-Fold:
  • Medical: Multiple readings per patient
  • E-commerce: Multiple transactions per user
  • Text: Multiple documents per author
  • Any scenario where samples aren’t independent

Nested Cross-Validation

Here is a subtle but important problem: if you use cross-validation to tune hyperparameters and then report that same CV score as your model’s performance, you are being too optimistic. The tuning process “peeked” at the validation folds by selecting the best hyperparameters for them. Nested CV fixes this with two loops:
  • Outer loop: Provides an unbiased estimate of model performance
  • Inner loop: Tunes hyperparameters (this is where GridSearchCV lives)
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

# WRONG: Use same data for tuning and evaluation
model = SVC()
param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto']}

# This leaks info!
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Optimistic score: {grid_search.best_score_:.3f}")

# RIGHT: Nested cross-validation
from sklearn.model_selection import cross_val_score

# Inner CV for tuning (passed to GridSearchCV)
# Outer CV for evaluation (passed to cross_val_score)
nested_score = cross_val_score(
    GridSearchCV(SVC(), param_grid, cv=5),
    X, y, cv=5, scoring='accuracy'
)

print(f"\nNested CV score: {nested_score.mean():.3f} (+/- {nested_score.std()*2:.3f})")
print("This is the unbiased estimate!")
Best params: {'C': 10, 'gamma': 'scale'}
Optimistic score: 0.977

Nested CV score: 0.964 (+/- 0.024)
This is the unbiased estimate!
The nested CV score is lower but honest!

Repeated Cross-Validation

A single 5-fold CV gives you 5 scores. That is not much data to estimate a mean and standard deviation. Repeated CV runs the entire K-Fold process multiple times with different random shuffles, giving you (K x repeats) scores. This dramatically reduces the variance of your performance estimate — think of it as the difference between flipping a coin 5 times versus 50 times to estimate whether it is fair.
from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

wine = load_wine()
X, y = wine.data, wine.target

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Single 5-fold
single_scores = cross_val_score(model, X, y, cv=5)

# Repeated 5-fold (10 repetitions)
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
repeated_scores = cross_val_score(model, X, y, cv=rkf)

print(f"Single 5-Fold: {single_scores.mean():.3f} (+/- {single_scores.std()*2:.3f})")
print(f"Repeated 5-Fold (10x): {repeated_scores.mean():.3f} (+/- {repeated_scores.std()*2:.3f})")

# Visualize
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].hist(single_scores, bins=5, edgecolor='black', alpha=0.7)
axes[0].axvline(single_scores.mean(), color='red', linestyle='--')
axes[0].set_title(f'Single 5-Fold (n=5)')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Count')

axes[1].hist(repeated_scores, bins=20, edgecolor='black', alpha=0.7)
axes[1].axvline(repeated_scores.mean(), color='red', linestyle='--')
axes[1].set_title(f'Repeated 5-Fold 10x (n=50)')
axes[1].set_xlabel('Accuracy')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

Choosing the Right Strategy

ScenarioRecommended CV
General classificationStratified K-Fold (K=5 or 10)
RegressionK-Fold (K=5 or 10)
Time seriesTimeSeriesSplit
Grouped dataGroupKFold
Hyperparameter tuningNested CV
Very small datasetLOOCV or Repeated K-Fold
Imbalanced classesStratified K-Fold
Critical applicationsRepeated Stratified K-Fold

Summary

Cross-validation transforms unreliable single-split estimates into robust performance measures:
  • K-Fold: Standard approach, every sample tested exactly once
  • Stratified: Maintains class balance
  • Time Series: Respects temporal order
  • Group: Keeps related samples together
  • Nested: Unbiased tuning + evaluation
  • Repeated: Reduces variance in estimates
Rule of Thumb: When in doubt, use Stratified 5-Fold for classification and 5-Fold for regression. Add repetition for critical applications.
# Your go-to template
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Interview Deep-Dive

No, and the fact that many practitioners would say yes is a common mistake. A 0.4% difference in 5-fold CV accuracy is almost certainly within the noise of the evaluation procedure.
  • Check the standard deviation. If Model A is 91.2% +/- 2.1% and Model B is 90.8% +/- 1.8%, the confidence intervals overlap massively. The models are statistically indistinguishable at this sample size. You need a statistical test (e.g., paired t-test on fold-level scores) to determine if the difference is significant.
  • Use repeated CV for more reliable estimates. Five data points (one per fold) is too few to estimate a mean with precision. Use Repeated Stratified K-Fold with 10 repetitions, giving you 50 data points. This dramatically tightens your confidence intervals. If the difference holds across 50 folds, it is more credible.
  • Consider the practical significance, not just statistical significance. Even if Model A is statistically significantly better by 0.4%, is that difference meaningful for the business? If it translates to catching 2 more fraud cases per year out of 500, but Model A is 10x more complex and expensive to maintain, Model B is the better choice.
  • Watch for information leakage in comparison. If you tried 20 different models and picked the best one based on CV scores, you have a multiple comparison problem. The “best” model might just be the one that happened to score high due to random variation. Use nested CV to get an honest estimate of the chosen model’s performance.
  • McNemar’s test is more appropriate than comparing means. Instead of comparing aggregate accuracy, use McNemar’s test to see if the two models disagree on specific predictions in a statistically significant way. Two models can have identical accuracy but disagree on 30% of predictions — that disagreement pattern tells you more than the aggregate number.
Follow-up: If both models have statistically indistinguishable accuracy, how do you choose between them?When accuracy is a tie, I choose based on operational factors: inference latency (faster model wins for real-time serving), interpretability (simpler model wins for regulated domains), calibration quality (better-calibrated probabilities win for risk scoring), robustness to missing data (which model degrades more gracefully when features are null), and maintenance burden (fewer dependencies, simpler retraining). In my experience, the simpler model wins most tie-breakers, which is why “always start simple” is not just a platitude but a production-tested heuristic.
This is a textbook example of temporal leakage through inappropriate cross-validation, and it is one of the most common mistakes in time series ML.
  • Root cause: standard K-Fold shuffles time. Regular 5-fold CV randomly assigns rows to folds. For time series data, this means the training fold might contain December 2024 data while the validation fold contains November 2024 data. The model literally trains on the future to predict the past. Any temporal pattern — trends, seasonality, autocorrelation — gets leaked from validation into training.
  • Why the gap is so large (17%). The 95% accuracy is inflated because the model sees future context for every prediction. The 78% on the temporal holdout reflects true forward-looking performance where no future data is available. The 17% gap is the magnitude of the temporal leakage.
  • The fix: TimeSeriesSplit. Replace KFold with TimeSeriesSplit, which always trains on earlier data and validates on later data. The training window expands over time while the validation window slides forward. This mimics how the model would actually be used in production.
  • Additional consideration: add a gap. Even TimeSeriesSplit can overestimate performance if there is strong short-term autocorrelation. Add a gap between training and validation periods (e.g., skip 7 days) to simulate the real-world delay between model training and deployment.
  • Re-evaluate your features for temporal leakage. Some features might use future data in their computation (e.g., centered rolling windows instead of trailing, or aggregate statistics computed on the full dataset). Even with TimeSeriesSplit, if the features themselves leak, the model will overperform in CV and underperform in production.
Follow-up: After switching to TimeSeriesSplit, your CV accuracy varies wildly between folds — Fold 1 is 72%, Fold 5 is 91%. How do you interpret this?High variance across temporal folds means your model’s performance is regime-dependent. The early folds may cover a period of market turbulence, and the later folds cover a stable growth period. Reporting the average of these folds is misleading because it hides the regime sensitivity. I would report the min, max, mean, and standard deviation across folds, and I would investigate which time periods correspond to low performance. If the model fails during regime changes (market shifts, new product launches), I might need to include regime-detection features or use separate models for different regimes. The worst-fold performance is arguably more important than the average, because it tells you how badly the model can fail.
Nested CV addresses a specific and subtle problem: the optimistic bias in performance estimates when the same data is used for both hyperparameter tuning and performance evaluation.
  • The problem with regular CV for model selection. Say you run GridSearchCV with 5-fold CV to tune hyperparameters, and it reports the best configuration has 92% accuracy. That 92% is an optimistic estimate because the tuning process “searched” for the configuration that maximizes performance on those specific validation folds. It is analogous to running 100 statistical tests and reporting only the most significant result — you have inflated your estimate through multiple comparisons.
  • How nested CV fixes this. The outer loop (e.g., 5-fold) splits data into train and test. The inner loop (e.g., 3-fold within each outer training set) runs the hyperparameter search. The outer test fold is never seen by the tuning process. The outer loop scores are the unbiased performance estimate.
  • When it is necessary. Use nested CV when you are reporting the expected performance of your model selection process — for example, in a paper or when comparing two different modeling approaches (e.g., “is XGBoost with tuning better than Random Forest with tuning on this dataset?”). The nested CV answers: “if I ran this entire process on new data, what accuracy would I expect?”
  • When regular CV is sufficient. If you have already fixed your hyperparameters (e.g., using domain knowledge or previous experiments) and just want to evaluate a specific model configuration, regular CV is fine — there is no search to create optimistic bias. Also, if you have a large enough dataset to afford a proper train/validation/test split, you do not need nested CV.
  • The computational cost. Nested CV with 5 outer and 5 inner folds means 25 model trainings per hyperparameter combination. If your grid has 100 combinations, that is 2,500 training runs. For expensive models, this can be prohibitive. In practice, I use nested CV for the final honest evaluation and regular CV for the development iteration.
Follow-up: How much of an optimistic bias does regular CV typically introduce compared to nested CV?In my experience, the gap is typically 1-3% for well-regularized models on reasonably-sized datasets. For small datasets (under 500 samples) or large hyperparameter search spaces, the gap can be 5-10%. The bias is larger when the search space is bigger (more comparisons) and the dataset is smaller (more variance per fold). If you are seeing a gap larger than 5%, it is a sign that your outer dataset is too small for reliable evaluation, or your search space is too large relative to the data. A practical heuristic: if your dataset has more than 5,000 samples and your search space has fewer than 50 configurations, the optimistic bias from regular CV is usually negligible.