Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Common ML Mistakes
The ML Hall of Shame
Every data scientist has made these mistakes — usually more than once. The tricky part is that most of these produce no error messages. Your code runs fine, your metrics look great, and you only discover the problem months later when the model fails in production. This chapter is your checklist for avoiding the silent killers.Mistake 1: Training on the Test Set
- ❌ Wrong
- ✅ Correct
Mistake 2: Using Accuracy for Imbalanced Data
- ❌ Wrong
- ✅ Correct
Mistake 3: Random Split for Time Series
- ❌ Wrong
- ✅ Correct
Mistake 4: Ignoring Feature Scaling
- ❌ Wrong
- ✅ Correct
Mistake 5: Feature Leakage from Target
- ❌ Wrong
- ✅ Correct
Mistake 6: Dropping Missing Values Carelessly
- ❌ Wrong
- ✅ Correct
Mistake 7: Overfitting to Validation Set
- ❌ Wrong
- ✅ Correct
Mistake 8: Not Checking for Data Drift
Models are trained on a snapshot of the world. The world changes. Customer behavior shifts, new products launch, economic conditions evolve. A model trained on pre-pandemic e-commerce data would fail spectacularly in 2020. This is data drift, and it is inevitable — the question is not “if” but “when.”- Wrong
- ✅ Correct
Mistake 9: One-Hot Encoding High Cardinality
- ❌ Wrong
- ✅ Correct
Mistake 10: Ignoring Class Imbalance in CV
- ❌ Wrong
- ✅ Correct
Mistake 11: Not Setting Random Seeds
- ❌ Wrong
- ✅ Correct
Mistake 12: Selecting Features After Train-Test Split
- ❌ Wrong
- ✅ Correct
Mistake 13: Using Mean for Skewed Data
- ❌ Wrong
- ✅ Correct
Mistake 14: Trusting Default Hyperparameters
- ❌ Wrong
- ✅ Correct
Mistake 15: Complex Model Without Baseline
This is the “Kaggle grandmaster” trap. You spend two weeks tuning XGBoost with 47 hyperparameters and get 89% accuracy. Is that good? You have no idea unless you know what a trivial model achieves. If logistic regression gets 87%, your two weeks bought you 2% — probably not worth the complexity, maintenance burden, and inference latency in production.- Wrong
- Correct
Quick Reference Checklist
Before Training
- Split data before any preprocessing
- Set random seeds for reproducibility
- Check class balance
- Handle missing values appropriately
- Scale features if needed by algorithm
During Training
- Use pipelines to prevent leakage
- Use stratified CV for imbalanced data
- Use temporal splits for time series
- Compare to baseline models
- Tune hyperparameters systematically
After Training
- Evaluate on held-out test set
- Use appropriate metrics (not just accuracy)
- Check for overfitting (train vs test gap)
- Validate feature importance makes sense
- Document everything
In Production
- Monitor for data drift
- Track prediction distributions
- Set up alerts for performance degradation
- Plan for model retraining
Key Takeaways
Split First
Always separate test data before any processing
Use Pipelines
Prevent leakage with sklearn pipelines
Right Metrics
Match metrics to your problem
Start Simple
Baseline first, complexity later
Congratulations! 🎉
You’ve completed the ML Mastery course! You now have comprehensive knowledge of:- ML fundamentals and algorithms
- Feature engineering and data preprocessing
- Model evaluation and selection
- Advanced topics (time series, deep learning, deployment)
- Professional practices (pipelines, explainability, common mistakes)
Continue Your Journey
AI Engineering
Build LLM-powered applications and agents
Math Foundations
Deepen your mathematical understanding
System Design
Design ML systems at scale
Kaggle Competitions
Apply your skills in real competitions
Interview Deep-Dive
Tell me about a time you discovered data leakage in a model that was already in production. How did you find it and what was the impact?
Tell me about a time you discovered data leakage in a model that was already in production. How did you find it and what was the impact?
This is a behavioral-style question disguised as a technical one. The interviewer wants to see your debugging methodology and your ability to handle production incidents. Here is how a strong answer would be structured:
- Detection pattern: the “too good to be true” signal. The most common way leakage is discovered is during a routine model retrain. The original model had 97% AUC, but a fresh retrain on recent data gives 83% AUC. The gap is the clue — either the original model was leaky (inflated the original score) or the data has drifted dramatically. Check the original training code first.
- Common root cause: preprocessing before split. In many production incidents, the leakage is in a StandardScaler or target encoding that was fit on the full dataset before train-test split. The model in production uses a scaler fit on historical data, but the offline evaluation was inflated by the leak. The model was always worse than the metrics suggested.
- Impact assessment. The real-world impact depends on the domain. In a recommendation system, leaky evaluation means you deployed a model that was less personalized than you thought, leading to lower click-through rates. In fraud detection, it means you missed more fraud than expected. Calculate the gap between reported and actual performance, then estimate the business cost of that gap over the deployment period.
- Remediation. Fix the leakage, retrain, and honestly report the corrected metrics. Then implement pipeline-level guardrails (automated leakage tests, the random-label permutation test) to prevent recurrence. The hardest part is communicating to stakeholders that the model was never as good as reported — but honesty here builds long-term trust.
- Prevention going forward. Mandate that all preprocessing lives inside sklearn Pipelines. Add a CI check that runs the model with shuffled labels and verifies that accuracy is near the base rate. Review feature engineering code for any feature derived from the target. These three checks catch 90% of leakage.
You join a new team and inherit a model with no documentation, no tests, and no monitoring. What is your priority order for adding production hygiene?
You join a new team and inherit a model with no documentation, no tests, and no monitoring. What is your priority order for adding production hygiene?
This is a triage question. You cannot fix everything at once, so prioritization reveals your engineering judgment.
- Priority 1: Add monitoring (Day 1-3). Before anything else, you need to know if the model is currently working. Add prediction distribution monitoring (track mean, variance, and percentile distribution of model outputs daily), input feature null-rate monitoring, and basic health checks (is the model responding, what is the latency). If the model is silently failing right now, you need to know immediately. This takes a few days and requires no changes to the model itself.
- Priority 2: Reproduce the model (Week 1-2). Can you retrain the model and get the same performance? If not, you do not truly control the model. Find the training data, the feature engineering code, and the hyperparameters. Pin library versions. Run the training pipeline end-to-end and compare metrics to whatever historical records exist. If you cannot reproduce, you need to understand why before making any changes.
- Priority 3: Add a holdout evaluation (Week 2-3). Set aside the most recent data as a holdout and evaluate the current model honestly. This tells you how the model is actually performing right now, not how it performed when it was first deployed months ago.
- Priority 4: Write tests (Week 3-4). Add unit tests for the feature engineering pipeline (given these inputs, do I get these outputs?), integration tests for the end-to-end prediction path, and regression tests that verify model performance stays within expected bounds after retraining.
- Priority 5: Add documentation (Ongoing). Document the model’s purpose, the feature definitions, the training procedure, the deployment architecture, and the monitoring setup. Do this as you learn the system — document what you discover.
What is the single most impactful ML mistake you have seen teams make repeatedly, and how do you prevent it?
What is the single most impactful ML mistake you have seen teams make repeatedly, and how do you prevent it?
The single most impactful mistake is deploying a model without a baseline comparison. Teams spend weeks building complex models, achieve some accuracy number, and deploy without ever asking “is this better than a simple rule or a logistic regression?”
- Why it matters. A gradient boosting model with 89% accuracy sounds impressive until you learn that a logistic regression achieves 87% in 5 minutes of work, or that a hand-written business rule achieves 85% with zero ML infrastructure. The marginal 2-4% improvement from the complex model comes with significant costs: compute, maintenance, debugging complexity, and explainability challenges.
- The cost of complexity without commensurate benefit. A complex model requires monitoring, retraining pipelines, feature stores, and on-call rotation. A logistic regression or business rule requires almost none of that. If the complex model is only marginally better, the total cost of ownership makes it a net negative.
- How I prevent it. In every ML project, I mandate three baselines before any complex modeling begins. First, a trivial baseline (always predict the majority class, or random). This establishes the floor. Second, a simple ML baseline (logistic regression or a single decision tree). This establishes what you can achieve with minimal effort. Third, the existing solution (if any) — the current business rule, heuristic, or manual process. Only if the complex model significantly outperforms all three baselines on the relevant business metric do you proceed to deployment. “Significantly” means the improvement justifies the operational cost.
- Cultural enforcement. Make the baseline comparison a required section in every model review document. No model can proceed to deployment without demonstrating improvement over the documented baseline. This single process change prevents the most common waste of ML engineering resources: over-engineering problems that do not need complex models.