Common ML Mistakes
The ML Hall of Shame
Mistake 1: Training on the Test Set
Mistake 2: Using Accuracy for Imbalanced Data
Mistake 3: Random Split for Time Series
Mistake 4: Ignoring Feature Scaling
Mistake 5: Feature Leakage from Target
Mistake 6: Dropping Missing Values Carelessly
Mistake 7: Overfitting to Validation Set
Mistake 8: Not Checking for Data Drift
Mistake 9: One-Hot Encoding High Cardinality
Mistake 10: Ignoring Class Imbalance in CV
Mistake 11: Not Setting Random Seeds
Mistake 12: Selecting Features After Train-Test Split
Mistake 13: Using Mean for Skewed Data
Mistake 14: Trusting Default Hyperparameters
Mistake 15: Complex Model Without Baseline
Quick Reference Checklist
Before Training
During Training
After Training
In Production
Key Takeaways
Congratulations! 🎉
Continue Your Journey

Common ML Mistakes

The ML Hall of Shame

Every data scientist has made these mistakes. Learn from them so you don’t have to!

Mistake 1: Training on the Test Set

❌ Wrong
✅ Correct

# Fitting ANYTHING on all data before split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses test data statistics!

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Split FIRST, then fit only on training
X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # Only transform!

Why it matters: Test set statistics leak into training, giving overly optimistic results.

Mistake 2: Using Accuracy for Imbalanced Data

❌ Wrong
✅ Correct

# 99% accuracy sounds great!
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
# But: model just predicts majority class for everything

# Use appropriate metrics
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"F1: {f1_score(y_test, y_pred):.4f}")

Rule of thumb: If class ratio > 10:1, don’t use accuracy.

Mistake 3: Random Split for Time Series

❌ Wrong
✅ Correct

# Random shuffle breaks temporal order
X_train, X_test = train_test_split(X, y, shuffle=True)
# Now you're training on Dec 2024 to predict Jan 2024!

# Temporal split - train on past, test on future
split_date = '2024-01-01'
train_mask = df['date'] < split_date

X_train = X[train_mask]
X_test = X[~train_mask]

# Or use TimeSeriesSplit for CV
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Mistake 4: Ignoring Feature Scaling

❌ Wrong
✅ Correct

# SVM, KNN, neural nets need scaled features!
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)  # age: 0-100, income: 0-1,000,000
# Income dominates everything

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

svm = make_pipeline(StandardScaler(), SVC())
svm.fit(X_train, y_train)

Models that need scaling: SVM, KNN, Neural Networks, PCA, Logistic Regression (with regularization) Models that don’t need scaling: Decision Trees, Random Forest, Gradient Boosting

Mistake 5: Feature Leakage from Target

❌ Wrong
✅ Correct

# Features derived from target
df['avg_purchase_by_customer_type'] = df.groupby('customer_type')['purchase'].transform('mean')
# This leaks future purchase information!

# Calculate on training data only
train_means = X_train.groupby('customer_type')['purchase'].mean()
X_train['avg_purchase_type'] = X_train['customer_type'].map(train_means)
X_test['avg_purchase_type'] = X_test['customer_type'].map(train_means)

Mistake 6: Dropping Missing Values Carelessly

❌ Wrong
✅ Correct

# Drop all rows with any missing value
df_clean = df.dropna()
# Lost 50% of your data!

# Strategy 1: Impute
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

# Strategy 2: Create missing indicator
df['feature_missing'] = df['feature'].isna().astype(int)
df['feature'] = df['feature'].fillna(df['feature'].median())

# Strategy 3: Drop only if too many missing
threshold = 0.5
cols_to_drop = df.columns[df.isna().mean() > threshold]

Mistake 7: Overfitting to Validation Set

❌ Wrong
✅ Correct

# Keep tuning until validation score is perfect
for i in range(1000):
    model = train_with_new_hyperparameters()
    if model.score(X_val, y_val) > best_score:
        best_model = model
# You've now overfit to validation set!

# Use nested cross-validation for honest estimate
from sklearn.model_selection import cross_val_score, GridSearchCV

# Inner loop: hyperparameter tuning
# Outer loop: performance estimation
outer_scores = cross_val_score(
    GridSearchCV(model, param_grid, cv=3),
    X, y, cv=5
)
print(f"Honest estimate: {outer_scores.mean():.4f}")

Mistake 8: Not Checking for Data Drift

❌ Wrong
✅ Correct

# Train once, deploy forever
model = train(historical_data)
deploy(model)
# 6 months later: "Why is accuracy dropping?"

# Monitor distribution shifts
def check_drift(reference_data, new_data, threshold=0.1):
    for col in reference_data.columns:
        ref_mean = reference_data[col].mean()
        new_mean = new_data[col].mean()
        shift = abs(ref_mean - new_mean) / (ref_mean + 1e-10)
        
        if shift > threshold:
            print(f"⚠️ Drift detected in {col}: {shift:.1%}")

# Monitor predictions
def monitor_predictions(model, X_new):
    probs = model.predict_proba(X_new)[:, 1]
    if probs.mean() > historical_mean + 0.1:
        alert("Prediction distribution has shifted!")

Mistake 9: One-Hot Encoding High Cardinality

❌ Wrong
✅ Correct

# City has 10,000 unique values
df = pd.get_dummies(df, columns=['city'])
# Now you have 10,000 sparse columns!

# Strategy 1: Target encoding
city_means = df.groupby('city')['target'].mean()
df['city_encoded'] = df['city'].map(city_means)

# Strategy 2: Frequency encoding
city_freq = df['city'].value_counts(normalize=True)
df['city_freq'] = df['city'].map(city_freq)

# Strategy 3: Group rare categories
top_cities = df['city'].value_counts().head(50).index
df['city_grouped'] = df['city'].where(df['city'].isin(top_cities), 'Other')

Mistake 10: Ignoring Class Imbalance in CV

❌ Wrong
✅ Correct

# Regular cross-validation with imbalanced data
scores = cross_val_score(model, X, y, cv=5)
# Some folds might have very few minority samples

from sklearn.model_selection import StratifiedKFold

# Stratified CV preserves class ratios in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)

Mistake 11: Not Setting Random Seeds

❌ Wrong
✅ Correct

# Results change every run
model = RandomForestClassifier()
model.fit(X_train, y_train)
# "I swear it worked yesterday!"

import numpy as np

# Set seeds everywhere
RANDOM_STATE = 42

np.random.seed(RANDOM_STATE)

model = RandomForestClassifier(random_state=RANDOM_STATE)
X_train, X_test = train_test_split(X, y, random_state=RANDOM_STATE)

Mistake 12: Selecting Features After Train-Test Split

❌ Wrong
✅ Correct

# Feature selection on all data
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=10)
X_selected = selector.fit_transform(X, y)  # Uses test info!

X_train, X_test = train_test_split(X_selected, y)

# Feature selection inside cross-validation or on train only
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('selector', SelectKBest(k=10)),
    ('classifier', RandomForestClassifier())
])

# Selector fit only on training folds
scores = cross_val_score(pipeline, X, y, cv=5)

Mistake 13: Using Mean for Skewed Data

❌ Wrong
✅ Correct

# Income is highly skewed
df['income'].fillna(df['income'].mean())
# Mean = $85k but median = $50k
# Filling with mean inflates values

# Use median for skewed distributions
df['income'].fillna(df['income'].median())

# Or use log transform first
import numpy as np
df['log_income'] = np.log1p(df['income'])
df['log_income'].fillna(df['log_income'].median())

Mistake 14: Trusting Default Hyperparameters

❌ Wrong
✅ Correct

# Just use defaults
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
# "Good enough"

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.1, 0.2]
}

grid = GridSearchCV(
    GradientBoostingClassifier(),
    param_grid,
    cv=5,
    scoring='roc_auc'
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")

Mistake 15: Complex Model Without Baseline

❌ Wrong
✅ Correct

# Jump straight to deep learning
model = SuperComplexNeuralNetwork(layers=50)
model.fit(X, y)
# "My model has 89% accuracy!"

# Always compare to baselines
from sklearn.dummy import DummyClassifier

# Baseline 1: Random guessing
dummy = DummyClassifier(strategy='stratified')
print(f"Random baseline: {cross_val_score(dummy, X, y).mean():.4f}")

# Baseline 2: Simple model
lr = LogisticRegression()
print(f"Logistic Regression: {cross_val_score(lr, X, y).mean():.4f}")

# Now try complex model
complex_model = GradientBoostingClassifier(n_estimators=200)
print(f"Complex model: {cross_val_score(complex_model, X, y).mean():.4f}")

# Is the improvement worth the complexity?

Quick Reference Checklist

Before Training

Split data before any preprocessing
Set random seeds for reproducibility
Check class balance
Handle missing values appropriately
Scale features if needed by algorithm

During Training

Use pipelines to prevent leakage
Use stratified CV for imbalanced data
Use temporal splits for time series
Compare to baseline models
Tune hyperparameters systematically

After Training

Evaluate on held-out test set
Use appropriate metrics (not just accuracy)
Check for overfitting (train vs test gap)
Validate feature importance makes sense
Document everything

In Production

Monitor for data drift
Track prediction distributions
Set up alerts for performance degradation
Plan for model retraining

Key Takeaways

Split First

Always separate test data before any processing

Use Pipelines

Prevent leakage with sklearn pipelines

Right Metrics

Match metrics to your problem

Start Simple

Baseline first, complexity later

Congratulations! 🎉

You’ve completed the ML Mastery course! You now have comprehensive knowledge of:

ML fundamentals and algorithms
Feature engineering and data preprocessing
Model evaluation and selection
Advanced topics (time series, deep learning, deployment)
Professional practices (pipelines, explainability, common mistakes)

Continue Your Journey

AI Engineering

Build LLM-powered applications and agents

Math Foundations

Deepen your mathematical understanding

System Design

Design ML systems at scale

Kaggle Competitions

Apply your skills in real competitions

ML Pipelines Cross-Validation

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Common ML Mistakes

​The ML Hall of Shame

​Mistake 1: Training on the Test Set

​Mistake 2: Using Accuracy for Imbalanced Data

​Mistake 3: Random Split for Time Series

​Mistake 4: Ignoring Feature Scaling

​Mistake 5: Feature Leakage from Target

​Mistake 6: Dropping Missing Values Carelessly

​Mistake 7: Overfitting to Validation Set

​Mistake 8: Not Checking for Data Drift

​Mistake 9: One-Hot Encoding High Cardinality

​Mistake 10: Ignoring Class Imbalance in CV

​Mistake 11: Not Setting Random Seeds

​Mistake 12: Selecting Features After Train-Test Split

​Mistake 13: Using Mean for Skewed Data

​Mistake 14: Trusting Default Hyperparameters

​Mistake 15: Complex Model Without Baseline

​Quick Reference Checklist

​Before Training

​During Training

​After Training

​In Production

​Key Takeaways

Split First

Use Pipelines

Right Metrics

Start Simple

​Congratulations! 🎉

​Continue Your Journey

AI Engineering

Math Foundations

System Design

Kaggle Competitions

Common ML Mistakes

The ML Hall of Shame

Mistake 1: Training on the Test Set

Mistake 2: Using Accuracy for Imbalanced Data

Mistake 3: Random Split for Time Series

Mistake 4: Ignoring Feature Scaling

Mistake 5: Feature Leakage from Target

Mistake 6: Dropping Missing Values Carelessly

Mistake 7: Overfitting to Validation Set

Mistake 8: Not Checking for Data Drift

Mistake 9: One-Hot Encoding High Cardinality

Mistake 10: Ignoring Class Imbalance in CV

Mistake 11: Not Setting Random Seeds

Mistake 12: Selecting Features After Train-Test Split

Mistake 13: Using Mean for Skewed Data

Mistake 14: Trusting Default Hyperparameters

Mistake 15: Complex Model Without Baseline

Quick Reference Checklist

Before Training

During Training

After Training

In Production

Key Takeaways

Congratulations! 🎉

Continue Your Journey