> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Common ML Mistakes > Avoid the pitfalls that trip up even experienced practitioners Common ML Mistakes Concept

# Common ML Mistakes ## The ML Hall of Shame Every data scientist has made these mistakes -- usually more than once. The tricky part is that most of these produce no error messages. Your code runs fine, your metrics look great, and you only discover the problem months later when the model fails in production. This chapter is your checklist for avoiding the silent killers. *** ## Mistake 1: Training on the Test Set ```python theme={null} # Fitting ANYTHING on all data before split scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Uses test data statistics! X_train, X_test, y_train, y_test = train_test_split(X_scaled, y) ``` ```python theme={null} # Split FIRST, then fit only on training X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Only transform! ``` **Why it matters**: When you fit the scaler on all data, the test set's mean and standard deviation influence the training data's scaling. This is subtle -- the accuracy inflation might be only 1-2%, but that 1-2% is the difference between "model is ready" and "model needs more work." In production, you will not have access to future data, so the scaler will produce different values, and your model will underperform. *** ## Mistake 2: Using Accuracy for Imbalanced Data ```python theme={null} # 99% accuracy sounds great! model.fit(X_train, y_train) print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}") # But: model just predicts majority class for everything ``` ```python theme={null} # Use appropriate metrics from sklearn.metrics import classification_report, roc_auc_score print(classification_report(y_test, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}") print(f"F1: {f1_score(y_test, y_pred):.4f}") ``` **Rule of thumb**: If class ratio > 10:1, don't use accuracy. *** ## Mistake 3: Random Split for Time Series ```python theme={null} # Random shuffle breaks temporal order X_train, X_test = train_test_split(X, y, shuffle=True) # Now you're training on Dec 2024 to predict Jan 2024! ``` ```python theme={null} # Temporal split - train on past, test on future split_date = '2024-01-01' train_mask = df['date'] < split_date X_train = X[train_mask] X_test = X[~train_mask] # Or use TimeSeriesSplit for CV from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5) ``` *** ## Mistake 4: Ignoring Feature Scaling ```python theme={null} # SVM, KNN, neural nets need scaled features! from sklearn.svm import SVC svm = SVC() svm.fit(X_train, y_train) # age: 0-100, income: 0-1,000,000 # Income dominates everything ``` ```python theme={null} from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline svm = make_pipeline(StandardScaler(), SVC()) svm.fit(X_train, y_train) ``` **Models that need scaling**: SVM, KNN, Neural Networks, PCA, Logistic Regression (with regularization). These models either use distance calculations (SVM, KNN) or gradient-based optimization (neural nets) where features on different scales cause one feature to dominate. **Models that do NOT need scaling**: Decision Trees, Random Forest, Gradient Boosting. These split on individual features independently, so the absolute scale does not matter -- a split at income > \$50,000 works the same whether income is in dollars or millions. **When in doubt, scale anyway.** Scaling never hurts tree-based models (they ignore it), but forgetting to scale distance-based models always hurts. Making scaling the default is a safe habit. *** ## Mistake 5: Feature Leakage from Target ```python theme={null} # Features derived from target df['avg_purchase_by_customer_type'] = df.groupby('customer_type')['purchase'].transform('mean') # This leaks future purchase information! ``` ```python theme={null} # Calculate on training data only train_means = X_train.groupby('customer_type')['purchase'].mean() X_train['avg_purchase_type'] = X_train['customer_type'].map(train_means) X_test['avg_purchase_type'] = X_test['customer_type'].map(train_means) ``` *** ## Mistake 6: Dropping Missing Values Carelessly ```python theme={null} # Drop all rows with any missing value df_clean = df.dropna() # Lost 50% of your data! ``` ```python theme={null} # Strategy 1: Impute from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') # Strategy 2: Create missing indicator df['feature_missing'] = df['feature'].isna().astype(int) df['feature'] = df['feature'].fillna(df['feature'].median()) # Strategy 3: Drop only if too many missing threshold = 0.5 cols_to_drop = df.columns[df.isna().mean() > threshold] ``` *** ## Mistake 7: Overfitting to Validation Set ```python theme={null} # Keep tuning until validation score is perfect for i in range(1000): model = train_with_new_hyperparameters() if model.score(X_val, y_val) > best_score: best_model = model # You've now overfit to validation set! ``` ```python theme={null} # Use nested cross-validation for honest estimate from sklearn.model_selection import cross_val_score, GridSearchCV # Inner loop: hyperparameter tuning # Outer loop: performance estimation outer_scores = cross_val_score( GridSearchCV(model, param_grid, cv=3), X, y, cv=5 ) print(f"Honest estimate: {outer_scores.mean():.4f}") ``` *** ## Mistake 8: Not Checking for Data Drift Models are trained on a snapshot of the world. The world changes. Customer behavior shifts, new products launch, economic conditions evolve. A model trained on pre-pandemic e-commerce data would fail spectacularly in 2020. This is data drift, and it is inevitable -- the question is not "if" but "when." ```python theme={null} # Train once, deploy forever model = train(historical_data) deploy(model) # 6 months later: "Why is accuracy dropping?" ``` ```python theme={null} # Monitor distribution shifts def check_drift(reference_data, new_data, threshold=0.1): for col in reference_data.columns: ref_mean = reference_data[col].mean() new_mean = new_data[col].mean() shift = abs(ref_mean - new_mean) / (ref_mean + 1e-10) if shift > threshold: print(f"⚠️ Drift detected in {col}: {shift:.1%}") # Monitor predictions def monitor_predictions(model, X_new): probs = model.predict_proba(X_new)[:, 1] if probs.mean() > historical_mean + 0.1: alert("Prediction distribution has shifted!") ``` *** ## Mistake 9: One-Hot Encoding High Cardinality ```python theme={null} # City has 10,000 unique values df = pd.get_dummies(df, columns=['city']) # Now you have 10,000 sparse columns! ``` ```python theme={null} # Strategy 1: Target encoding city_means = df.groupby('city')['target'].mean() df['city_encoded'] = df['city'].map(city_means) # Strategy 2: Frequency encoding city_freq = df['city'].value_counts(normalize=True) df['city_freq'] = df['city'].map(city_freq) # Strategy 3: Group rare categories top_cities = df['city'].value_counts().head(50).index df['city_grouped'] = df['city'].where(df['city'].isin(top_cities), 'Other') ``` *** ## Mistake 10: Ignoring Class Imbalance in CV ```python theme={null} # Regular cross-validation with imbalanced data scores = cross_val_score(model, X, y, cv=5) # Some folds might have very few minority samples ``` ```python theme={null} from sklearn.model_selection import StratifiedKFold # Stratified CV preserves class ratios in each fold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv) ``` *** ## Mistake 11: Not Setting Random Seeds ```python theme={null} # Results change every run model = RandomForestClassifier() model.fit(X_train, y_train) # "I swear it worked yesterday!" ``` ```python theme={null} import numpy as np # Set seeds everywhere RANDOM_STATE = 42 np.random.seed(RANDOM_STATE) model = RandomForestClassifier(random_state=RANDOM_STATE) X_train, X_test = train_test_split(X, y, random_state=RANDOM_STATE) ``` *** ## Mistake 12: Selecting Features After Train-Test Split ```python theme={null} # Feature selection on all data from sklearn.feature_selection import SelectKBest selector = SelectKBest(k=10) X_selected = selector.fit_transform(X, y) # Uses test info! X_train, X_test = train_test_split(X_selected, y) ``` ```python theme={null} # Feature selection inside cross-validation or on train only from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('selector', SelectKBest(k=10)), ('classifier', RandomForestClassifier()) ]) # Selector fit only on training folds scores = cross_val_score(pipeline, X, y, cv=5) ``` *** ## Mistake 13: Using Mean for Skewed Data ```python theme={null} # Income is highly skewed df['income'].fillna(df['income'].mean()) # Mean = $85k but median = $50k # Filling with mean inflates values ``` ```python theme={null} # Use median for skewed distributions df['income'].fillna(df['income'].median()) # Or use log transform first import numpy as np df['log_income'] = np.log1p(df['income']) df['log_income'].fillna(df['log_income'].median()) ``` *** ## Mistake 14: Trusting Default Hyperparameters ```python theme={null} # Just use defaults model = GradientBoostingClassifier() model.fit(X_train, y_train) # "Good enough" ``` ```python theme={null} from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7, 10], 'learning_rate': [0.01, 0.1, 0.2] } grid = GridSearchCV( GradientBoostingClassifier(), param_grid, cv=5, scoring='roc_auc' ) grid.fit(X_train, y_train) print(f"Best params: {grid.best_params_}") ``` *** ## Mistake 15: Complex Model Without Baseline This is the "Kaggle grandmaster" trap. You spend two weeks tuning XGBoost with 47 hyperparameters and get 89% accuracy. Is that good? You have no idea unless you know what a trivial model achieves. If logistic regression gets 87%, your two weeks bought you 2% -- probably not worth the complexity, maintenance burden, and inference latency in production. ```python theme={null} # Jump straight to deep learning model = SuperComplexNeuralNetwork(layers=50) model.fit(X, y) # "My model has 89% accuracy!" -- But is that actually good? ``` ```python theme={null} # Always compare to baselines -- this takes 5 minutes and saves weeks from sklearn.dummy import DummyClassifier # Baseline 1: Random guessing (the absolute floor) dummy = DummyClassifier(strategy='stratified') print(f"Random baseline: {cross_val_score(dummy, X, y).mean():.4f}") # Baseline 2: Simple model (the "is ML even needed?" check) lr = LogisticRegression() print(f"Logistic Regression: {cross_val_score(lr, X, y).mean():.4f}") # Now try complex model complex_model = GradientBoostingClassifier(n_estimators=200) print(f"Complex model: {cross_val_score(complex_model, X, y).mean():.4f}") # The question is not "is the complex model good?" # It's "is the improvement over the simple model worth the extra complexity?" ``` *** ## Quick Reference Checklist ### Before Training * [ ] Split data before any preprocessing * [ ] Set random seeds for reproducibility * [ ] Check class balance * [ ] Handle missing values appropriately * [ ] Scale features if needed by algorithm ### During Training * [ ] Use pipelines to prevent leakage * [ ] Use stratified CV for imbalanced data * [ ] Use temporal splits for time series * [ ] Compare to baseline models * [ ] Tune hyperparameters systematically ### After Training * [ ] Evaluate on held-out test set * [ ] Use appropriate metrics (not just accuracy) * [ ] Check for overfitting (train vs test gap) * [ ] Validate feature importance makes sense * [ ] Document everything ### In Production * [ ] Monitor for data drift * [ ] Track prediction distributions * [ ] Set up alerts for performance degradation * [ ] Plan for model retraining *** ## Key Takeaways Always separate test data before any processing Prevent leakage with sklearn pipelines Match metrics to your problem Baseline first, complexity later *** ## Congratulations! 🎉 You've completed the **ML Mastery** course! You now have comprehensive knowledge of: * ML fundamentals and algorithms * Feature engineering and data preprocessing * Model evaluation and selection * Advanced topics (time series, deep learning, deployment) * Professional practices (pipelines, explainability, common mistakes) ### Continue Your Journey Build LLM-powered applications and agents Deepen your mathematical understanding Design ML systems at scale Apply your skills in real competitions *** ## Interview Deep-Dive This is a behavioral-style question disguised as a technical one. The interviewer wants to see your debugging methodology and your ability to handle production incidents. Here is how a strong answer would be structured: * **Detection pattern: the "too good to be true" signal.** The most common way leakage is discovered is during a routine model retrain. The original model had 97% AUC, but a fresh retrain on recent data gives 83% AUC. The gap is the clue -- either the original model was leaky (inflated the original score) or the data has drifted dramatically. Check the original training code first. * **Common root cause: preprocessing before split.** In many production incidents, the leakage is in a StandardScaler or target encoding that was fit on the full dataset before train-test split. The model in production uses a scaler fit on historical data, but the offline evaluation was inflated by the leak. The model was always worse than the metrics suggested. * **Impact assessment.** The real-world impact depends on the domain. In a recommendation system, leaky evaluation means you deployed a model that was less personalized than you thought, leading to lower click-through rates. In fraud detection, it means you missed more fraud than expected. Calculate the gap between reported and actual performance, then estimate the business cost of that gap over the deployment period. * **Remediation.** Fix the leakage, retrain, and honestly report the corrected metrics. Then implement pipeline-level guardrails (automated leakage tests, the random-label permutation test) to prevent recurrence. The hardest part is communicating to stakeholders that the model was never as good as reported -- but honesty here builds long-term trust. * **Prevention going forward.** Mandate that all preprocessing lives inside sklearn Pipelines. Add a CI check that runs the model with shuffled labels and verifies that accuracy is near the base rate. Review feature engineering code for any feature derived from the target. These three checks catch 90% of leakage. **Follow-up: If the leaky model was performing "well enough" in production according to business metrics, would you still fix it?** Yes, absolutely. A leaky model that appears to work is a ticking time bomb. It works by coincidence -- the leaky feature happens to correlate with the true signal in the current data. When the data distribution shifts (and it always does), the leaky feature and the true signal will diverge, and the model will fail suddenly rather than gracefully. Fixing it now, while things are stable, is far easier than diagnosing a sudden production failure at 2 AM. I would fix the leakage, retrain, and accept the honest lower metrics. If the honest model does not meet business requirements, that is valuable information -- it means you need a better model, not a leaky one. This is a triage question. You cannot fix everything at once, so prioritization reveals your engineering judgment. * **Priority 1: Add monitoring (Day 1-3).** Before anything else, you need to know if the model is currently working. Add prediction distribution monitoring (track mean, variance, and percentile distribution of model outputs daily), input feature null-rate monitoring, and basic health checks (is the model responding, what is the latency). If the model is silently failing right now, you need to know immediately. This takes a few days and requires no changes to the model itself. * **Priority 2: Reproduce the model (Week 1-2).** Can you retrain the model and get the same performance? If not, you do not truly control the model. Find the training data, the feature engineering code, and the hyperparameters. Pin library versions. Run the training pipeline end-to-end and compare metrics to whatever historical records exist. If you cannot reproduce, you need to understand why before making any changes. * **Priority 3: Add a holdout evaluation (Week 2-3).** Set aside the most recent data as a holdout and evaluate the current model honestly. This tells you how the model is actually performing right now, not how it performed when it was first deployed months ago. * **Priority 4: Write tests (Week 3-4).** Add unit tests for the feature engineering pipeline (given these inputs, do I get these outputs?), integration tests for the end-to-end prediction path, and regression tests that verify model performance stays within expected bounds after retraining. * **Priority 5: Add documentation (Ongoing).** Document the model's purpose, the feature definitions, the training procedure, the deployment architecture, and the monitoring setup. Do this as you learn the system -- document what you discover. The anti-pattern is trying to refactor the code first. Resist that urge. A poorly written model that is monitored and reproducible is infinitely better than a beautifully refactored model that you accidentally broke during the refactor because you had no tests to catch the regression. **Follow-up: You discover during reproduction that the model's current production performance is 15% worse than the original reported metrics. What do you do?** First, verify the finding by checking against multiple evaluation windows and metrics. If confirmed, investigate whether this is data drift (the world changed), code drift (a dependency was updated), or the original metrics were leaky. Then communicate the finding to stakeholders with a clear recommendation: either retrain on recent data (if drift), roll back a code change (if code drift), or fix the leakage and accept the honest lower performance. The most important thing is to act on the information. Many teams discover degradation and do nothing because the model is "still good enough" -- until it is not. The single most impactful mistake is deploying a model without a baseline comparison. Teams spend weeks building complex models, achieve some accuracy number, and deploy without ever asking "is this better than a simple rule or a logistic regression?" * **Why it matters.** A gradient boosting model with 89% accuracy sounds impressive until you learn that a logistic regression achieves 87% in 5 minutes of work, or that a hand-written business rule achieves 85% with zero ML infrastructure. The marginal 2-4% improvement from the complex model comes with significant costs: compute, maintenance, debugging complexity, and explainability challenges. * **The cost of complexity without commensurate benefit.** A complex model requires monitoring, retraining pipelines, feature stores, and on-call rotation. A logistic regression or business rule requires almost none of that. If the complex model is only marginally better, the total cost of ownership makes it a net negative. * **How I prevent it.** In every ML project, I mandate three baselines before any complex modeling begins. First, a trivial baseline (always predict the majority class, or random). This establishes the floor. Second, a simple ML baseline (logistic regression or a single decision tree). This establishes what you can achieve with minimal effort. Third, the existing solution (if any) -- the current business rule, heuristic, or manual process. Only if the complex model significantly outperforms all three baselines on the relevant business metric do you proceed to deployment. "Significantly" means the improvement justifies the operational cost. * **Cultural enforcement.** Make the baseline comparison a required section in every model review document. No model can proceed to deployment without demonstrating improvement over the documented baseline. This single process change prevents the most common waste of ML engineering resources: over-engineering problems that do not need complex models. **Follow-up: The baseline logistic regression actually outperforms the team's complex model. How do you communicate this diplomatically?** Frame it as a positive discovery, not a failure. "We found that a simple model achieves 88% accuracy on this problem, which means we can deploy faster, with lower operational cost, and with full explainability. The team's work was not wasted -- the feature engineering they did is what makes the simple model work so well. Let us deploy the logistic regression now, and if the business requirements change or accuracy needs to improve, we have the more complex model as a follow-up." The key message is: simplicity is a feature, not a failure. The best engineers choose the simplest solution that meets requirements.