Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Common ML Mistakes Concept
Common ML Mistakes Real World Example

Common ML Mistakes

The ML Hall of Shame

Every data scientist has made these mistakes — usually more than once. The tricky part is that most of these produce no error messages. Your code runs fine, your metrics look great, and you only discover the problem months later when the model fails in production. This chapter is your checklist for avoiding the silent killers.

Mistake 1: Training on the Test Set

# Fitting ANYTHING on all data before split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses test data statistics!

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
Why it matters: When you fit the scaler on all data, the test set’s mean and standard deviation influence the training data’s scaling. This is subtle — the accuracy inflation might be only 1-2%, but that 1-2% is the difference between “model is ready” and “model needs more work.” In production, you will not have access to future data, so the scaler will produce different values, and your model will underperform.

Mistake 2: Using Accuracy for Imbalanced Data

# 99% accuracy sounds great!
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
# But: model just predicts majority class for everything
Rule of thumb: If class ratio > 10:1, don’t use accuracy.

Mistake 3: Random Split for Time Series

# Random shuffle breaks temporal order
X_train, X_test = train_test_split(X, y, shuffle=True)
# Now you're training on Dec 2024 to predict Jan 2024!

Mistake 4: Ignoring Feature Scaling

# SVM, KNN, neural nets need scaled features!
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)  # age: 0-100, income: 0-1,000,000
# Income dominates everything
Models that need scaling: SVM, KNN, Neural Networks, PCA, Logistic Regression (with regularization). These models either use distance calculations (SVM, KNN) or gradient-based optimization (neural nets) where features on different scales cause one feature to dominate. Models that do NOT need scaling: Decision Trees, Random Forest, Gradient Boosting. These split on individual features independently, so the absolute scale does not matter — a split at income > $50,000 works the same whether income is in dollars or millions.
When in doubt, scale anyway. Scaling never hurts tree-based models (they ignore it), but forgetting to scale distance-based models always hurts. Making scaling the default is a safe habit.

Mistake 5: Feature Leakage from Target

# Features derived from target
df['avg_purchase_by_customer_type'] = df.groupby('customer_type')['purchase'].transform('mean')
# This leaks future purchase information!

Mistake 6: Dropping Missing Values Carelessly

# Drop all rows with any missing value
df_clean = df.dropna()
# Lost 50% of your data!

Mistake 7: Overfitting to Validation Set

# Keep tuning until validation score is perfect
for i in range(1000):
    model = train_with_new_hyperparameters()
    if model.score(X_val, y_val) > best_score:
        best_model = model
# You've now overfit to validation set!

Mistake 8: Not Checking for Data Drift

Models are trained on a snapshot of the world. The world changes. Customer behavior shifts, new products launch, economic conditions evolve. A model trained on pre-pandemic e-commerce data would fail spectacularly in 2020. This is data drift, and it is inevitable — the question is not “if” but “when.”
# Train once, deploy forever
model = train(historical_data)
deploy(model)
# 6 months later: "Why is accuracy dropping?"

Mistake 9: One-Hot Encoding High Cardinality

# City has 10,000 unique values
df = pd.get_dummies(df, columns=['city'])
# Now you have 10,000 sparse columns!

Mistake 10: Ignoring Class Imbalance in CV

# Regular cross-validation with imbalanced data
scores = cross_val_score(model, X, y, cv=5)
# Some folds might have very few minority samples

Mistake 11: Not Setting Random Seeds

# Results change every run
model = RandomForestClassifier()
model.fit(X_train, y_train)
# "I swear it worked yesterday!"

Mistake 12: Selecting Features After Train-Test Split

# Feature selection on all data
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=10)
X_selected = selector.fit_transform(X, y)  # Uses test info!

X_train, X_test = train_test_split(X_selected, y)

Mistake 13: Using Mean for Skewed Data

# Income is highly skewed
df['income'].fillna(df['income'].mean())
# Mean = $85k but median = $50k
# Filling with mean inflates values

Mistake 14: Trusting Default Hyperparameters

# Just use defaults
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
# "Good enough"

Mistake 15: Complex Model Without Baseline

This is the “Kaggle grandmaster” trap. You spend two weeks tuning XGBoost with 47 hyperparameters and get 89% accuracy. Is that good? You have no idea unless you know what a trivial model achieves. If logistic regression gets 87%, your two weeks bought you 2% — probably not worth the complexity, maintenance burden, and inference latency in production.
# Jump straight to deep learning
model = SuperComplexNeuralNetwork(layers=50)
model.fit(X, y)
# "My model has 89% accuracy!" -- But is that actually good?

Quick Reference Checklist

Before Training

  • Split data before any preprocessing
  • Set random seeds for reproducibility
  • Check class balance
  • Handle missing values appropriately
  • Scale features if needed by algorithm

During Training

  • Use pipelines to prevent leakage
  • Use stratified CV for imbalanced data
  • Use temporal splits for time series
  • Compare to baseline models
  • Tune hyperparameters systematically

After Training

  • Evaluate on held-out test set
  • Use appropriate metrics (not just accuracy)
  • Check for overfitting (train vs test gap)
  • Validate feature importance makes sense
  • Document everything

In Production

  • Monitor for data drift
  • Track prediction distributions
  • Set up alerts for performance degradation
  • Plan for model retraining

Key Takeaways

Split First

Always separate test data before any processing

Use Pipelines

Prevent leakage with sklearn pipelines

Right Metrics

Match metrics to your problem

Start Simple

Baseline first, complexity later

Congratulations! 🎉

You’ve completed the ML Mastery course! You now have comprehensive knowledge of:
  • ML fundamentals and algorithms
  • Feature engineering and data preprocessing
  • Model evaluation and selection
  • Advanced topics (time series, deep learning, deployment)
  • Professional practices (pipelines, explainability, common mistakes)

Continue Your Journey

AI Engineering

Build LLM-powered applications and agents

Math Foundations

Deepen your mathematical understanding

System Design

Design ML systems at scale

Kaggle Competitions

Apply your skills in real competitions

Interview Deep-Dive

This is a behavioral-style question disguised as a technical one. The interviewer wants to see your debugging methodology and your ability to handle production incidents. Here is how a strong answer would be structured:
  • Detection pattern: the “too good to be true” signal. The most common way leakage is discovered is during a routine model retrain. The original model had 97% AUC, but a fresh retrain on recent data gives 83% AUC. The gap is the clue — either the original model was leaky (inflated the original score) or the data has drifted dramatically. Check the original training code first.
  • Common root cause: preprocessing before split. In many production incidents, the leakage is in a StandardScaler or target encoding that was fit on the full dataset before train-test split. The model in production uses a scaler fit on historical data, but the offline evaluation was inflated by the leak. The model was always worse than the metrics suggested.
  • Impact assessment. The real-world impact depends on the domain. In a recommendation system, leaky evaluation means you deployed a model that was less personalized than you thought, leading to lower click-through rates. In fraud detection, it means you missed more fraud than expected. Calculate the gap between reported and actual performance, then estimate the business cost of that gap over the deployment period.
  • Remediation. Fix the leakage, retrain, and honestly report the corrected metrics. Then implement pipeline-level guardrails (automated leakage tests, the random-label permutation test) to prevent recurrence. The hardest part is communicating to stakeholders that the model was never as good as reported — but honesty here builds long-term trust.
  • Prevention going forward. Mandate that all preprocessing lives inside sklearn Pipelines. Add a CI check that runs the model with shuffled labels and verifies that accuracy is near the base rate. Review feature engineering code for any feature derived from the target. These three checks catch 90% of leakage.
Follow-up: If the leaky model was performing “well enough” in production according to business metrics, would you still fix it?Yes, absolutely. A leaky model that appears to work is a ticking time bomb. It works by coincidence — the leaky feature happens to correlate with the true signal in the current data. When the data distribution shifts (and it always does), the leaky feature and the true signal will diverge, and the model will fail suddenly rather than gracefully. Fixing it now, while things are stable, is far easier than diagnosing a sudden production failure at 2 AM. I would fix the leakage, retrain, and accept the honest lower metrics. If the honest model does not meet business requirements, that is valuable information — it means you need a better model, not a leaky one.
This is a triage question. You cannot fix everything at once, so prioritization reveals your engineering judgment.
  • Priority 1: Add monitoring (Day 1-3). Before anything else, you need to know if the model is currently working. Add prediction distribution monitoring (track mean, variance, and percentile distribution of model outputs daily), input feature null-rate monitoring, and basic health checks (is the model responding, what is the latency). If the model is silently failing right now, you need to know immediately. This takes a few days and requires no changes to the model itself.
  • Priority 2: Reproduce the model (Week 1-2). Can you retrain the model and get the same performance? If not, you do not truly control the model. Find the training data, the feature engineering code, and the hyperparameters. Pin library versions. Run the training pipeline end-to-end and compare metrics to whatever historical records exist. If you cannot reproduce, you need to understand why before making any changes.
  • Priority 3: Add a holdout evaluation (Week 2-3). Set aside the most recent data as a holdout and evaluate the current model honestly. This tells you how the model is actually performing right now, not how it performed when it was first deployed months ago.
  • Priority 4: Write tests (Week 3-4). Add unit tests for the feature engineering pipeline (given these inputs, do I get these outputs?), integration tests for the end-to-end prediction path, and regression tests that verify model performance stays within expected bounds after retraining.
  • Priority 5: Add documentation (Ongoing). Document the model’s purpose, the feature definitions, the training procedure, the deployment architecture, and the monitoring setup. Do this as you learn the system — document what you discover.
The anti-pattern is trying to refactor the code first. Resist that urge. A poorly written model that is monitored and reproducible is infinitely better than a beautifully refactored model that you accidentally broke during the refactor because you had no tests to catch the regression.Follow-up: You discover during reproduction that the model’s current production performance is 15% worse than the original reported metrics. What do you do?First, verify the finding by checking against multiple evaluation windows and metrics. If confirmed, investigate whether this is data drift (the world changed), code drift (a dependency was updated), or the original metrics were leaky. Then communicate the finding to stakeholders with a clear recommendation: either retrain on recent data (if drift), roll back a code change (if code drift), or fix the leakage and accept the honest lower performance. The most important thing is to act on the information. Many teams discover degradation and do nothing because the model is “still good enough” — until it is not.
The single most impactful mistake is deploying a model without a baseline comparison. Teams spend weeks building complex models, achieve some accuracy number, and deploy without ever asking “is this better than a simple rule or a logistic regression?”
  • Why it matters. A gradient boosting model with 89% accuracy sounds impressive until you learn that a logistic regression achieves 87% in 5 minutes of work, or that a hand-written business rule achieves 85% with zero ML infrastructure. The marginal 2-4% improvement from the complex model comes with significant costs: compute, maintenance, debugging complexity, and explainability challenges.
  • The cost of complexity without commensurate benefit. A complex model requires monitoring, retraining pipelines, feature stores, and on-call rotation. A logistic regression or business rule requires almost none of that. If the complex model is only marginally better, the total cost of ownership makes it a net negative.
  • How I prevent it. In every ML project, I mandate three baselines before any complex modeling begins. First, a trivial baseline (always predict the majority class, or random). This establishes the floor. Second, a simple ML baseline (logistic regression or a single decision tree). This establishes what you can achieve with minimal effort. Third, the existing solution (if any) — the current business rule, heuristic, or manual process. Only if the complex model significantly outperforms all three baselines on the relevant business metric do you proceed to deployment. “Significantly” means the improvement justifies the operational cost.
  • Cultural enforcement. Make the baseline comparison a required section in every model review document. No model can proceed to deployment without demonstrating improvement over the documented baseline. This single process change prevents the most common waste of ML engineering resources: over-engineering problems that do not need complex models.
Follow-up: The baseline logistic regression actually outperforms the team’s complex model. How do you communicate this diplomatically?Frame it as a positive discovery, not a failure. “We found that a simple model achieves 88% accuracy on this problem, which means we can deploy faster, with lower operational cost, and with full explainability. The team’s work was not wasted — the feature engineering they did is what makes the simple model work so well. Let us deploy the logistic regression now, and if the business requirements change or accuracy needs to improve, we have the more complex model as a follow-up.” The key message is: simplicity is a feature, not a failure. The best engineers choose the simplest solution that meets requirements.