Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Data Leakage
The Hidden Danger
Your model achieves 99% accuracy in development. You celebrate. You deploy it. It performs no better than random guessing. What went wrong? Data leakage — the most common reason ML models that look brilliant in the notebook fail catastrophically in production. It is also one of the hardest bugs to find, because everything looks correct until you deploy.What is Data Leakage?
Data leakage occurs when information from outside the training data sneaks into your model during training. It is the ML equivalent of accidentally seeing the answer key before an exam — your score looks great but you have learned nothing. There are three main types, each with a different mechanism:- Target leakage: Features contain information derived from the target. The model takes a shortcut through the answer instead of learning the real pattern.
- Train-test contamination: Test data statistics influence the training process. Your evaluation is no longer measuring generalization.
- Temporal leakage: Future information is used to predict the past. Your model is a time traveler — impressive in the notebook, useless in production where the future is unknown.
Real Example: The 99% Accuracy Trap
A hospital builds a model to predict pneumonia from chest X-rays:Type 1: Target Leakage
Information derived from the target leaks into features:Example: Credit Card Fraud
fraud_investigation_date and chargeback_amount only exist for fraudulent transactions!
How to Fix
Type 2: Train-Test Contamination
Test data influences the training process:Example: Data Preprocessing
Example: Feature Selection
Type 3: Temporal Leakage
Using future information to predict the past:Example: Stock Prediction
Time Series Cross-Validation
The Pipeline Solution
Use scikit-learn Pipelines to prevent contamination:Common Leakage Sources Checklist
Data Collection Issues
- Timestamps mixed between train/test
- Data from same entity in both train and test
- Features computed from all data (global statistics)
Feature Engineering Issues
- Features derived from target
- Future information in features
- Rolling windows including future data
Preprocessing Issues
- Scaling fit on all data
- Imputation fit on all data
- Feature selection using all data
- PCA/dimensionality reduction on all data
Validation Issues
- Random split on time series
- Same group/patient in train and test
- Test set seen during hyperparameter tuning
How to Detect Leakage
1. Suspiciously Good Results
The “too good to be true” heuristic is surprisingly reliable. In most real-world ML problems, a model that dramatically outperforms published benchmarks or domain-expert baselines probably has a leak, not a breakthrough.2. Feature Importance Analysis
If a feature that should not logically be predictive shows up as the most important, that is a strong signal of leakage. For example, if “customer_id” is the top feature in a churn model, something is wrong — IDs should carry no predictive information.3. Validation Gap
A large gap between cross-validation scores and holdout scores is a red flag. If your CV says 95% but your truly held-out test set says 80%, either your CV has leakage, or there is a distribution shift between training and test periods.Real-World Prevention Strategy
Key Takeaways
Split First
Always split data before any preprocessing or analysis
Use Pipelines
Scikit-learn Pipelines prevent contamination
Question Features
Ask: “Would I have this at prediction time?”
Validate Results
If it’s too good to be true, it probably is
What’s Next?
Now let’s learn about dimensionality reduction - handling high-dimensional data effectively!Continue to Dimensionality Reduction
PCA, t-SNE, and handling the curse of dimensionality
Interview Deep-Dive
You inherited a model with 99.2% accuracy on a fraud detection task. The previous team says it is production-ready. What questions would you ask before deploying it?
You inherited a model with 99.2% accuracy on a fraud detection task. The previous team says it is production-ready. What questions would you ask before deploying it?
A 99.2% accuracy on fraud detection is a giant red flag, not a reason to celebrate. Fraud detection is inherently imbalanced (typically 0.1-1% fraud rate), so predicting “not fraud” for everything already gives you 99%+ accuracy. Here is my interrogation checklist:
- What is the base rate? If 99% of transactions are legitimate, a model that always predicts “legitimate” gets 99% accuracy. I need to see precision, recall, and PR-AUC on the minority class specifically. If recall on fraud is below 50%, the model is useless regardless of overall accuracy.
- How was the train-test split done? Was it random or temporal? For fraud, new fraud patterns emerge over time. A random split puts future fraud patterns in the training set, inflating test performance. I need to see a temporal split where the model is trained on month 1-6 and tested on month 7-8.
- What features are in the model? I would audit every feature for target leakage. Common leaky features in fraud models: chargeback_amount (only exists because fraud was detected), investigation_flag (created after the fraud label), account_frozen_date (consequence of fraud, not a predictor). The litmus test: “Would this feature exist at the exact moment we need to make a prediction?”
- Was the preprocessing done before or after the split? If StandardScaler or target encoding was fit on the entire dataset, test statistics leaked into training. This is subtle but common.
- What happens when you remove the top feature? If removing one feature drops accuracy from 99.2% to 65%, that feature is almost certainly leaking. A legitimate model should degrade gracefully, not collapse.
Walk me through a real scenario where data leakage is extremely subtle and hard to detect.
Walk me through a real scenario where data leakage is extremely subtle and hard to detect.
The subtlest leakage I have encountered involves group-level information leaking across train-test boundaries.
- Patient-level leakage in medical ML. Imagine predicting disease from lab results. A patient has 10 visits over 2 years. If you randomly split rows into train and test, the same patient’s visits appear in both sets. The model learns patient-specific patterns (their baseline lab values, their doctor’s ordering habits) rather than general medical signals. On paper, accuracy is high. On a new patient the model has never seen, it fails.
- The fix: GroupKFold or GroupShuffleSplit. Ensure all rows from the same patient are in either train or test, never both. This is not optional — it is methodologically required whenever your data has grouped structure.
- Another subtle one: global aggregation features. Say you create a feature “average transaction amount for this merchant.” If you compute this average across the entire dataset (including test rows), you have leaked test-set transaction amounts into the training features. The correct approach: compute the average only on the training set, then map it to both train and test.
- Time-based feature leakage via rolling windows. You compute “7-day rolling average revenue” for a forecasting model. But your rolling window implementation uses a centered window (3 days before, current day, 3 days after) instead of a trailing window (7 days before). The “3 days after” is future information. This bug produces no error, gives slightly better backtest results, and silently fails in production where future data does not exist.
- Data collection leakage. In a hospital ICU mortality prediction model, the number of lab tests ordered is a strong predictor — but only because sicker patients get more tests. The model learns “more tests = higher risk” which is a proxy for physician judgment, not a medical signal. In a new hospital with different testing protocols, this feature becomes noise.
How does data leakage interact with cross-validation? Can proper CV prevent all forms of leakage?
How does data leakage interact with cross-validation? Can proper CV prevent all forms of leakage?
Cross-validation prevents some forms of leakage but is completely blind to others. Understanding which types it catches and which it misses is critical.
- CV prevents train-test contamination in preprocessing — but only if you use pipelines. If your scaler, imputer, or feature selector is inside a Pipeline and that Pipeline is passed to cross_val_score, each fold correctly fits the preprocessor only on training data. If you preprocess before calling CV, the leakage happens before CV sees the data, and CV cannot detect or prevent it.
- CV does NOT prevent target leakage. If your feature “chargeback_amount” is derived from the target, it is leaky in every fold. CV will happily report 99% accuracy in every fold, giving you false confidence. Target leakage is a feature engineering problem, not a validation problem.
- Standard CV does NOT prevent temporal leakage. KFold shuffles data randomly, so future data can appear in the training fold. You must use TimeSeriesSplit for temporal data. This is the single most common CV mistake in time series ML.
- Standard CV does NOT prevent group leakage. If patient A’s rows are in both train and test folds, KFold will not flag this. Use GroupKFold when your data has natural groupings.
- Nested CV prevents one additional form: validation set overfitting. If you tune hyperparameters on CV folds and report the same CV score as your performance estimate, that score is optimistically biased. Nested CV uses an outer loop for performance estimation and an inner loop for tuning, keeping them honest.