Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Data Leakage Concept
Data Leakage Real World Example

Data Leakage

The Hidden Danger

Your model achieves 99% accuracy in development. You celebrate. You deploy it. It performs no better than random guessing. What went wrong? Data leakage — the most common reason ML models that look brilliant in the notebook fail catastrophically in production. It is also one of the hardest bugs to find, because everything looks correct until you deploy.

What is Data Leakage?

Data leakage occurs when information from outside the training data sneaks into your model during training. It is the ML equivalent of accidentally seeing the answer key before an exam — your score looks great but you have learned nothing. There are three main types, each with a different mechanism:
  1. Target leakage: Features contain information derived from the target. The model takes a shortcut through the answer instead of learning the real pattern.
  2. Train-test contamination: Test data statistics influence the training process. Your evaluation is no longer measuring generalization.
  3. Temporal leakage: Future information is used to predict the past. Your model is a time traveler — impressive in the notebook, useless in production where the future is unknown.
Data leakage is extremely common and often subtle. It’s responsible for many “too good to be true” results.

Real Example: The 99% Accuracy Trap

A hospital builds a model to predict pneumonia from chest X-rays:
# Simplified example of what went wrong
import pandas as pd
import numpy as np

# Hospital data
data = pd.DataFrame({
    'patient_id': range(1000),
    'has_portable_xray': np.random.choice([0, 1], 1000, p=[0.7, 0.3]),
    'pneumonia': np.zeros(1000)
})

# The leak: sick patients often get portable X-rays
# (they're too sick to go to the X-ray room)
sick_mask = data['has_portable_xray'] == 1
data.loc[sick_mask, 'pneumonia'] = np.random.choice([0, 1], sick_mask.sum(), p=[0.2, 0.8])

# Model learns: portable X-ray → pneumonia
# But this is correlation, not causation!
correlation = data['has_portable_xray'].corr(data['pneumonia'])
print(f"Correlation: {correlation:.3f}")  # Very high!
The model learned that portable X-ray equipment in the image predicts pneumonia — not the actual lung patterns! This is a real scenario based on research at the University of Washington. The model achieved near-perfect accuracy in the hospital where it was trained, but failed at other hospitals where portable X-ray usage patterns differed. The “signal” it learned was an artifact of hospital workflow, not medicine.
The litmus test for target leakage: Ask yourself, “Would I have this feature available at the exact moment I need to make a prediction in production?” If the answer is no — or if the feature only exists because of the outcome you are trying to predict — it is a leak.

Type 1: Target Leakage

Information derived from the target leaks into features:

Example: Credit Card Fraud

# LEAKY DATASET
fraud_data = pd.DataFrame({
    'transaction_amount': [100, 500, 50, 10000],
    'merchant_category': ['grocery', 'electronics', 'gas', 'jewelry'],
    'is_fraud': [0, 0, 0, 1],
    'fraud_investigation_date': [None, None, None, '2024-01-15'],  # LEAK!
    'chargeback_amount': [0, 0, 0, 10000]  # LEAK!
})

# These features only exist BECAUSE of fraud
# They leak the target!
Problem: fraud_investigation_date and chargeback_amount only exist for fraudulent transactions!

How to Fix

# CLEAN DATASET - only use features available at prediction time
clean_fraud_data = pd.DataFrame({
    'transaction_amount': [100, 500, 50, 10000],
    'merchant_category': ['grocery', 'electronics', 'gas', 'jewelry'],
    'time_since_last_transaction': [3600, 86400, 1800, 120],  # seconds
    'distance_from_home': [2, 50, 5, 500],  # miles
    'is_fraud': [0, 0, 0, 1]
})

# Ask: "Would I have this feature at prediction time?"

Type 2: Train-Test Contamination

Test data influences the training process:

Example: Data Preprocessing

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np

np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# ❌ WRONG: Fit scaler on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Test data statistics leak in!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

model = LogisticRegression()
model.fit(X_train, y_train)
leaky_accuracy = model.score(X_test, y_test)

# ✅ CORRECT: Fit scaler only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit only on train
X_test_scaled = scaler.transform(X_test)  # Transform (don't fit) test

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
clean_accuracy = model.score(X_test_scaled, y_test)

print(f"Leaky accuracy: {leaky_accuracy:.4f}")
print(f"Clean accuracy: {clean_accuracy:.4f}")

Example: Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif

# ❌ WRONG: Select features using all data
selector = SelectKBest(f_classif, k=5)
X_selected = selector.fit_transform(X, y)  # Uses test data!
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2)

# ✅ CORRECT: Select features only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

selector = SelectKBest(f_classif, k=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

Type 3: Temporal Leakage

Using future information to predict the past:

Example: Stock Prediction

import pandas as pd
import numpy as np

# Stock data
dates = pd.date_range('2023-01-01', periods=100, freq='D')
prices = 100 + np.cumsum(np.random.randn(100))

stock_data = pd.DataFrame({
    'date': dates,
    'price': prices,
    'next_day_price': np.roll(prices, -1)  # LEAK: future price!
})

# ❌ WRONG: Random split
# This might put Feb 15 in training and Feb 14 in test!
# Model could learn from future to predict past

# ✅ CORRECT: Temporal split
train_cutoff = dates[int(len(dates) * 0.8)]
train_data = stock_data[stock_data['date'] < train_cutoff]
test_data = stock_data[stock_data['date'] >= train_cutoff]

print(f"Training period: {train_data['date'].min()} to {train_data['date'].max()}")
print(f"Testing period: {test_data['date'].min()} to {test_data['date'].max()}")

Time Series Cross-Validation

from sklearn.model_selection import TimeSeriesSplit
import matplotlib.pyplot as plt

# Correct way to cross-validate time series
tscv = TimeSeriesSplit(n_splits=5)

fig, ax = plt.subplots(figsize=(12, 4))
for i, (train_idx, test_idx) in enumerate(tscv.split(stock_data)):
    ax.plot(train_idx, [i] * len(train_idx), 'b-', linewidth=3)
    ax.plot(test_idx, [i] * len(test_idx), 'r-', linewidth=3)

ax.set_xlabel('Time')
ax.set_ylabel('CV Fold')
ax.set_title('Time Series Cross-Validation (No Future Leakage)')
plt.show()

The Pipeline Solution

Use scikit-learn Pipelines to prevent contamination:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Create a pipeline - preprocessing happens inside CV
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(k=5)),
    ('classifier', LogisticRegression())
])

# Cross-validation now correctly separates train/test
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")
Always use Pipelines for preprocessing + modeling. They ensure that transformations are properly fit only on training folds.

Common Leakage Sources Checklist

Data Collection Issues

  • Timestamps mixed between train/test
  • Data from same entity in both train and test
  • Features computed from all data (global statistics)

Feature Engineering Issues

  • Features derived from target
  • Future information in features
  • Rolling windows including future data

Preprocessing Issues

  • Scaling fit on all data
  • Imputation fit on all data
  • Feature selection using all data
  • PCA/dimensionality reduction on all data

Validation Issues

  • Random split on time series
  • Same group/patient in train and test
  • Test set seen during hyperparameter tuning

How to Detect Leakage

1. Suspiciously Good Results

The “too good to be true” heuristic is surprisingly reliable. In most real-world ML problems, a model that dramatically outperforms published benchmarks or domain-expert baselines probably has a leak, not a breakthrough.
# If your model seems too good to be true...
if accuracy > 0.98:
    print("WARNING: Accuracy suspiciously high!")
    print("Check for data leakage before celebrating!")
    # Rule of thumb: if a simple model (logistic regression) gets > 95%
    # on a problem that domain experts find hard, investigate immediately

2. Feature Importance Analysis

If a feature that should not logically be predictive shows up as the most important, that is a strong signal of leakage. For example, if “customer_id” is the top feature in a churn model, something is wrong — IDs should carry no predictive information.
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# If irrelevant features are most important, you have leakage
model = RandomForestClassifier()
model.fit(X_train, y_train)

importances = pd.DataFrame({
    'feature': [f'feature_{i}' for i in range(X_train.shape[1])],
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top features (check if these make sense!):")
print(importances.head(10))

3. Validation Gap

A large gap between cross-validation scores and holdout scores is a red flag. If your CV says 95% but your truly held-out test set says 80%, either your CV has leakage, or there is a distribution shift between training and test periods.
# Large gap between CV and holdout = potential leakage
cv_score = cross_val_score(model, X_train, y_train, cv=5).mean()
holdout_score = model.fit(X_train, y_train).score(X_test, y_test)

gap = abs(cv_score - holdout_score)
if gap > 0.1:
    print(f"WARNING: Large gap between CV ({cv_score:.3f}) and holdout ({holdout_score:.3f})")
    print("Possible data leakage or distribution shift!")
    # Common culprits:
    # - Preprocessing done before CV split (scaler fit on all data)
    # - Feature selection done before CV split
    # - Same entity (user, patient) appearing in both train and test

Real-World Prevention Strategy

def safe_ml_pipeline(X, y, time_column=None):
    """
    A leakage-safe ML pipeline template.
    """
    # 1. Split FIRST, before any processing
    if time_column:
        # Temporal split
        X_sorted = X.sort_values(time_column)
        split_idx = int(len(X_sorted) * 0.8)
        X_train = X_sorted.iloc[:split_idx]
        X_test = X_sorted.iloc[split_idx:]
        y_train = y.iloc[X_train.index]
        y_test = y.iloc[X_test.index]
    else:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
    
    print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
    
    # 2. Create pipeline (all preprocessing inside)
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', RandomForestClassifier())
    ])
    
    # 3. Fit only on training data
    pipeline.fit(X_train, y_train)
    
    # 4. Evaluate on untouched test set
    train_score = pipeline.score(X_train, y_train)
    test_score = pipeline.score(X_test, y_test)
    
    print(f"Train score: {train_score:.3f}")
    print(f"Test score: {test_score:.3f}")
    print(f"Gap: {train_score - test_score:.3f}")
    
    if train_score - test_score > 0.2:
        print("⚠️  Large train-test gap - check for overfitting or leakage")
    
    return pipeline

Key Takeaways

Split First

Always split data before any preprocessing or analysis

Use Pipelines

Scikit-learn Pipelines prevent contamination

Question Features

Ask: “Would I have this at prediction time?”

Validate Results

If it’s too good to be true, it probably is

What’s Next?

Now let’s learn about dimensionality reduction - handling high-dimensional data effectively!

Continue to Dimensionality Reduction

PCA, t-SNE, and handling the curse of dimensionality

Interview Deep-Dive

A 99.2% accuracy on fraud detection is a giant red flag, not a reason to celebrate. Fraud detection is inherently imbalanced (typically 0.1-1% fraud rate), so predicting “not fraud” for everything already gives you 99%+ accuracy. Here is my interrogation checklist:
  • What is the base rate? If 99% of transactions are legitimate, a model that always predicts “legitimate” gets 99% accuracy. I need to see precision, recall, and PR-AUC on the minority class specifically. If recall on fraud is below 50%, the model is useless regardless of overall accuracy.
  • How was the train-test split done? Was it random or temporal? For fraud, new fraud patterns emerge over time. A random split puts future fraud patterns in the training set, inflating test performance. I need to see a temporal split where the model is trained on month 1-6 and tested on month 7-8.
  • What features are in the model? I would audit every feature for target leakage. Common leaky features in fraud models: chargeback_amount (only exists because fraud was detected), investigation_flag (created after the fraud label), account_frozen_date (consequence of fraud, not a predictor). The litmus test: “Would this feature exist at the exact moment we need to make a prediction?”
  • Was the preprocessing done before or after the split? If StandardScaler or target encoding was fit on the entire dataset, test statistics leaked into training. This is subtle but common.
  • What happens when you remove the top feature? If removing one feature drops accuracy from 99.2% to 65%, that feature is almost certainly leaking. A legitimate model should degrade gracefully, not collapse.
Follow-up: You discovered that one feature, “days_until_dispute,” has the highest feature importance. What does that tell you?That is a textbook example of target leakage. “Days until dispute” is information derived from the outcome — you only know when a dispute was filed after the fraud has been identified. At prediction time (when the transaction occurs), this feature does not exist yet. The model learned: “if days_until_dispute is non-null, it is fraud” — which is trivially true but completely useless for catching fraud before it happens. I would remove this feature, retrain, and expect accuracy to drop substantially. The resulting model will be worse on paper but actually useful in production.
The subtlest leakage I have encountered involves group-level information leaking across train-test boundaries.
  • Patient-level leakage in medical ML. Imagine predicting disease from lab results. A patient has 10 visits over 2 years. If you randomly split rows into train and test, the same patient’s visits appear in both sets. The model learns patient-specific patterns (their baseline lab values, their doctor’s ordering habits) rather than general medical signals. On paper, accuracy is high. On a new patient the model has never seen, it fails.
  • The fix: GroupKFold or GroupShuffleSplit. Ensure all rows from the same patient are in either train or test, never both. This is not optional — it is methodologically required whenever your data has grouped structure.
  • Another subtle one: global aggregation features. Say you create a feature “average transaction amount for this merchant.” If you compute this average across the entire dataset (including test rows), you have leaked test-set transaction amounts into the training features. The correct approach: compute the average only on the training set, then map it to both train and test.
  • Time-based feature leakage via rolling windows. You compute “7-day rolling average revenue” for a forecasting model. But your rolling window implementation uses a centered window (3 days before, current day, 3 days after) instead of a trailing window (7 days before). The “3 days after” is future information. This bug produces no error, gives slightly better backtest results, and silently fails in production where future data does not exist.
  • Data collection leakage. In a hospital ICU mortality prediction model, the number of lab tests ordered is a strong predictor — but only because sicker patients get more tests. The model learns “more tests = higher risk” which is a proxy for physician judgment, not a medical signal. In a new hospital with different testing protocols, this feature becomes noise.
Follow-up: How would you build an automated leakage detection system for a team of 20 data scientists?I would implement three automated checks in the ML pipeline. First, a feature-level temporal audit: for any time-indexed dataset, automatically verify that every feature at time T is computed only from data at time T or earlier. Second, a “suspicion score” that flags any feature where permutation importance is more than 3x higher than the next most important feature — disproportionate importance is the strongest leakage signal. Third, a mandatory train-test gap: enforce a buffer period between training and test data (e.g., 7 days) in the pipeline configuration, making the most common form of temporal leakage structurally impossible. These checks should run in CI and block model deployment if violated.
Cross-validation prevents some forms of leakage but is completely blind to others. Understanding which types it catches and which it misses is critical.
  • CV prevents train-test contamination in preprocessing — but only if you use pipelines. If your scaler, imputer, or feature selector is inside a Pipeline and that Pipeline is passed to cross_val_score, each fold correctly fits the preprocessor only on training data. If you preprocess before calling CV, the leakage happens before CV sees the data, and CV cannot detect or prevent it.
  • CV does NOT prevent target leakage. If your feature “chargeback_amount” is derived from the target, it is leaky in every fold. CV will happily report 99% accuracy in every fold, giving you false confidence. Target leakage is a feature engineering problem, not a validation problem.
  • Standard CV does NOT prevent temporal leakage. KFold shuffles data randomly, so future data can appear in the training fold. You must use TimeSeriesSplit for temporal data. This is the single most common CV mistake in time series ML.
  • Standard CV does NOT prevent group leakage. If patient A’s rows are in both train and test folds, KFold will not flag this. Use GroupKFold when your data has natural groupings.
  • Nested CV prevents one additional form: validation set overfitting. If you tune hyperparameters on CV folds and report the same CV score as your performance estimate, that score is optimistically biased. Nested CV uses an outer loop for performance estimation and an inner loop for tuning, keeping them honest.
The bottom line: CV is a powerful tool but it operates on the data you give it. If the data itself is leaky (target leakage, temporal leakage, group leakage), CV will faithfully evaluate a leaky model and tell you it is great. Domain knowledge and feature auditing are the only defenses against those forms of leakage.Follow-up: How would you validate that a pipeline with proper CV is truly leakage-free?The strongest test is the “production simulation” test. Take your trained model, freeze it, and evaluate it on a completely held-out dataset that was collected after the training data, from different entities, with no overlap. If the production-simulation score is close to the CV score, you are likely leakage-free. If there is a large gap (CV says 95%, production simulation says 78%), you have leakage somewhere — work backwards through each feature and preprocessing step to find it. I also recommend a “baseline sanity check”: train a dummy model (DummyClassifier with strategy=‘stratified’) through the same pipeline. If the dummy model scores significantly above the theoretical baseline, something in the pipeline is leaking information to even a random predictor.