Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Cross-Validation Strategies
Why One Train-Test Split Is Not Enough
You built a model. It got 95% accuracy on your test set. Ship it? Not so fast. A single train-test split is like evaluating a student based on one question. Maybe they happened to know that question. Maybe they got lucky. You need multiple questions to get a reliable assessment.The Lucky Split Problem
Estimated Time: 2-3 hours
Difficulty: Intermediate
Prerequisites: Model Evaluation chapter
Tools: scikit-learn, numpy
Difficulty: Intermediate
Prerequisites: Model Evaluation chapter
Tools: scikit-learn, numpy
K-Fold Cross-Validation
The gold standard for model evaluation. The idea is simple but powerful: split data into K equal parts (folds). Train on K-1 folds, test on the remaining one. Rotate so every fold gets a turn as the test set. Every data point is used for both training and testing, just never at the same time.Stratified K-Fold
When classes are imbalanced (e.g., 90% negative, 10% positive), regular K-Fold can create folds where one fold has 15% positive samples and another has only 5%. This makes fold-to-fold comparison noisy and can give misleading results. Stratified K-Fold solves this by ensuring each fold has approximately the same class ratio as the full dataset.Leave-One-Out Cross-Validation (LOOCV)
The extreme case: K = n (number of samples). Train on literally all data except one sample, test on that one sample, then rotate through every sample. This gives the lowest bias (training set is nearly the full dataset), but highest variance (each fold differs by only one sample) and is computationally expensive (n full training cycles).Time Series Cross-Validation
Time series data is special: you cannot peek at the future. Standard K-Fold shuffles data randomly, which means you might train on December data to predict January — that is time travel, not machine learning. TimeSeriesSplit always trains on past data and tests on future data, respecting the arrow of time.| Approach | Problem |
|---|---|
| Regular K-Fold on time series | Data leakage! Training on future to predict past |
| TimeSeriesSplit | Always predicts future from past |
Group K-Fold
When data has natural groups (e.g., multiple readings from the same patient, multiple transactions from the same user), samples within a group are not independent. If patient A’s readings appear in both train and test, the model can “cheat” by recognizing patient-specific patterns rather than learning general medical rules. GroupKFold ensures that all samples from the same group stay together — either entirely in training or entirely in testing.Nested Cross-Validation
Here is a subtle but important problem: if you use cross-validation to tune hyperparameters and then report that same CV score as your model’s performance, you are being too optimistic. The tuning process “peeked” at the validation folds by selecting the best hyperparameters for them. Nested CV fixes this with two loops:- Outer loop: Provides an unbiased estimate of model performance
- Inner loop: Tunes hyperparameters (this is where GridSearchCV lives)
Repeated Cross-Validation
A single 5-fold CV gives you 5 scores. That is not much data to estimate a mean and standard deviation. Repeated CV runs the entire K-Fold process multiple times with different random shuffles, giving you (K x repeats) scores. This dramatically reduces the variance of your performance estimate — think of it as the difference between flipping a coin 5 times versus 50 times to estimate whether it is fair.Choosing the Right Strategy
| Scenario | Recommended CV |
|---|---|
| General classification | Stratified K-Fold (K=5 or 10) |
| Regression | K-Fold (K=5 or 10) |
| Time series | TimeSeriesSplit |
| Grouped data | GroupKFold |
| Hyperparameter tuning | Nested CV |
| Very small dataset | LOOCV or Repeated K-Fold |
| Imbalanced classes | Stratified K-Fold |
| Critical applications | Repeated Stratified K-Fold |
Summary
Cross-validation transforms unreliable single-split estimates into robust performance measures:- K-Fold: Standard approach, every sample tested exactly once
- Stratified: Maintains class balance
- Time Series: Respects temporal order
- Group: Keeps related samples together
- Nested: Unbiased tuning + evaluation
- Repeated: Reduces variance in estimates
Rule of Thumb: When in doubt, use Stratified 5-Fold for classification and 5-Fold for regression. Add repetition for critical applications.
Interview Deep-Dive
You are comparing two models: Model A has 5-fold CV accuracy of 91.2% and Model B has 90.8%. Can you confidently say Model A is better?
You are comparing two models: Model A has 5-fold CV accuracy of 91.2% and Model B has 90.8%. Can you confidently say Model A is better?
No, and the fact that many practitioners would say yes is a common mistake. A 0.4% difference in 5-fold CV accuracy is almost certainly within the noise of the evaluation procedure.
- Check the standard deviation. If Model A is 91.2% +/- 2.1% and Model B is 90.8% +/- 1.8%, the confidence intervals overlap massively. The models are statistically indistinguishable at this sample size. You need a statistical test (e.g., paired t-test on fold-level scores) to determine if the difference is significant.
- Use repeated CV for more reliable estimates. Five data points (one per fold) is too few to estimate a mean with precision. Use Repeated Stratified K-Fold with 10 repetitions, giving you 50 data points. This dramatically tightens your confidence intervals. If the difference holds across 50 folds, it is more credible.
- Consider the practical significance, not just statistical significance. Even if Model A is statistically significantly better by 0.4%, is that difference meaningful for the business? If it translates to catching 2 more fraud cases per year out of 500, but Model A is 10x more complex and expensive to maintain, Model B is the better choice.
- Watch for information leakage in comparison. If you tried 20 different models and picked the best one based on CV scores, you have a multiple comparison problem. The “best” model might just be the one that happened to score high due to random variation. Use nested CV to get an honest estimate of the chosen model’s performance.
- McNemar’s test is more appropriate than comparing means. Instead of comparing aggregate accuracy, use McNemar’s test to see if the two models disagree on specific predictions in a statistically significant way. Two models can have identical accuracy but disagree on 30% of predictions — that disagreement pattern tells you more than the aggregate number.
Your time series model shows 95% accuracy with 5-fold CV but only 78% on a true temporal holdout. What went wrong?
Your time series model shows 95% accuracy with 5-fold CV but only 78% on a true temporal holdout. What went wrong?
This is a textbook example of temporal leakage through inappropriate cross-validation, and it is one of the most common mistakes in time series ML.
- Root cause: standard K-Fold shuffles time. Regular 5-fold CV randomly assigns rows to folds. For time series data, this means the training fold might contain December 2024 data while the validation fold contains November 2024 data. The model literally trains on the future to predict the past. Any temporal pattern — trends, seasonality, autocorrelation — gets leaked from validation into training.
- Why the gap is so large (17%). The 95% accuracy is inflated because the model sees future context for every prediction. The 78% on the temporal holdout reflects true forward-looking performance where no future data is available. The 17% gap is the magnitude of the temporal leakage.
- The fix: TimeSeriesSplit. Replace KFold with TimeSeriesSplit, which always trains on earlier data and validates on later data. The training window expands over time while the validation window slides forward. This mimics how the model would actually be used in production.
- Additional consideration: add a gap. Even TimeSeriesSplit can overestimate performance if there is strong short-term autocorrelation. Add a gap between training and validation periods (e.g., skip 7 days) to simulate the real-world delay between model training and deployment.
- Re-evaluate your features for temporal leakage. Some features might use future data in their computation (e.g., centered rolling windows instead of trailing, or aggregate statistics computed on the full dataset). Even with TimeSeriesSplit, if the features themselves leak, the model will overperform in CV and underperform in production.
Explain nested cross-validation. When is it necessary, and when is regular CV sufficient?
Explain nested cross-validation. When is it necessary, and when is regular CV sufficient?
Nested CV addresses a specific and subtle problem: the optimistic bias in performance estimates when the same data is used for both hyperparameter tuning and performance evaluation.
- The problem with regular CV for model selection. Say you run GridSearchCV with 5-fold CV to tune hyperparameters, and it reports the best configuration has 92% accuracy. That 92% is an optimistic estimate because the tuning process “searched” for the configuration that maximizes performance on those specific validation folds. It is analogous to running 100 statistical tests and reporting only the most significant result — you have inflated your estimate through multiple comparisons.
- How nested CV fixes this. The outer loop (e.g., 5-fold) splits data into train and test. The inner loop (e.g., 3-fold within each outer training set) runs the hyperparameter search. The outer test fold is never seen by the tuning process. The outer loop scores are the unbiased performance estimate.
- When it is necessary. Use nested CV when you are reporting the expected performance of your model selection process — for example, in a paper or when comparing two different modeling approaches (e.g., “is XGBoost with tuning better than Random Forest with tuning on this dataset?”). The nested CV answers: “if I ran this entire process on new data, what accuracy would I expect?”
- When regular CV is sufficient. If you have already fixed your hyperparameters (e.g., using domain knowledge or previous experiments) and just want to evaluate a specific model configuration, regular CV is fine — there is no search to create optimistic bias. Also, if you have a large enough dataset to afford a proper train/validation/test split, you do not need nested CV.
- The computational cost. Nested CV with 5 outer and 5 inner folds means 25 model trainings per hyperparameter combination. If your grid has 100 combinations, that is 2,500 training runs. For expensive models, this can be prohibitive. In practice, I use nested CV for the final honest evaluation and regular CV for the development iteration.