> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Common ML Mistakes

> Avoid the pitfalls that trip up even experienced practitioners

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/common-mistakes-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=cd5c82cb367c8e7cb644ec188220f124" alt="Common ML Mistakes Concept" width="1080" height="1080" data-path="images/courses/ml-mastery/common-mistakes-concept.svg" />
</Frame>

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/common-mistakes-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=f4a048ad6bf94605c94c47630497347f" alt="Common ML Mistakes Real World Example" width="1080" height="1080" data-path="images/courses/ml-mastery/common-mistakes-real-world.svg" />
</Frame>

# Common ML Mistakes

## The ML Hall of Shame

Every data scientist has made these mistakes -- usually more than once. The tricky part is that most of these produce no error messages. Your code runs fine, your metrics look great, and you only discover the problem months later when the model fails in production. This chapter is your checklist for avoiding the silent killers.

***

## Mistake 1: Training on the Test Set

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Fitting ANYTHING on all data before split
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)  # Uses test data statistics!

    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Split FIRST, then fit only on training
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)  # Only transform!
    ```
  </Tab>
</Tabs>

**Why it matters**: When you fit the scaler on all data, the test set's mean and standard deviation influence the training data's scaling. This is subtle -- the accuracy inflation might be only 1-2%, but that 1-2% is the difference between "model is ready" and "model needs more work." In production, you will not have access to future data, so the scaler will produce different values, and your model will underperform.

***

## Mistake 2: Using Accuracy for Imbalanced Data

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # 99% accuracy sounds great!
    model.fit(X_train, y_train)
    print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
    # But: model just predicts majority class for everything
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Use appropriate metrics
    from sklearn.metrics import classification_report, roc_auc_score

    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
    print(f"F1: {f1_score(y_test, y_pred):.4f}")
    ```
  </Tab>
</Tabs>

**Rule of thumb**: If class ratio > 10:1, don't use accuracy.

***

## Mistake 3: Random Split for Time Series

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Random shuffle breaks temporal order
    X_train, X_test = train_test_split(X, y, shuffle=True)
    # Now you're training on Dec 2024 to predict Jan 2024!
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Temporal split - train on past, test on future
    split_date = '2024-01-01'
    train_mask = df['date'] < split_date

    X_train = X[train_mask]
    X_test = X[~train_mask]

    # Or use TimeSeriesSplit for CV
    from sklearn.model_selection import TimeSeriesSplit
    tscv = TimeSeriesSplit(n_splits=5)
    ```
  </Tab>
</Tabs>

***

## Mistake 4: Ignoring Feature Scaling

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # SVM, KNN, neural nets need scaled features!
    from sklearn.svm import SVC

    svm = SVC()
    svm.fit(X_train, y_train)  # age: 0-100, income: 0-1,000,000
    # Income dominates everything
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import make_pipeline

    svm = make_pipeline(StandardScaler(), SVC())
    svm.fit(X_train, y_train)
    ```
  </Tab>
</Tabs>

**Models that need scaling**: SVM, KNN, Neural Networks, PCA, Logistic Regression (with regularization). These models either use distance calculations (SVM, KNN) or gradient-based optimization (neural nets) where features on different scales cause one feature to dominate.

**Models that do NOT need scaling**: Decision Trees, Random Forest, Gradient Boosting. These split on individual features independently, so the absolute scale does not matter -- a split at income > \$50,000 works the same whether income is in dollars or millions.

<Tip>
  **When in doubt, scale anyway.** Scaling never hurts tree-based models (they ignore it), but forgetting to scale distance-based models always hurts. Making scaling the default is a safe habit.
</Tip>

***

## Mistake 5: Feature Leakage from Target

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Features derived from target
    df['avg_purchase_by_customer_type'] = df.groupby('customer_type')['purchase'].transform('mean')
    # This leaks future purchase information!
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Calculate on training data only
    train_means = X_train.groupby('customer_type')['purchase'].mean()
    X_train['avg_purchase_type'] = X_train['customer_type'].map(train_means)
    X_test['avg_purchase_type'] = X_test['customer_type'].map(train_means)
    ```
  </Tab>
</Tabs>

***

## Mistake 6: Dropping Missing Values Carelessly

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Drop all rows with any missing value
    df_clean = df.dropna()
    # Lost 50% of your data!
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Strategy 1: Impute
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='median')

    # Strategy 2: Create missing indicator
    df['feature_missing'] = df['feature'].isna().astype(int)
    df['feature'] = df['feature'].fillna(df['feature'].median())

    # Strategy 3: Drop only if too many missing
    threshold = 0.5
    cols_to_drop = df.columns[df.isna().mean() > threshold]
    ```
  </Tab>
</Tabs>

***

## Mistake 7: Overfitting to Validation Set

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Keep tuning until validation score is perfect
    for i in range(1000):
        model = train_with_new_hyperparameters()
        if model.score(X_val, y_val) > best_score:
            best_model = model
    # You've now overfit to validation set!
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Use nested cross-validation for honest estimate
    from sklearn.model_selection import cross_val_score, GridSearchCV

    # Inner loop: hyperparameter tuning
    # Outer loop: performance estimation
    outer_scores = cross_val_score(
        GridSearchCV(model, param_grid, cv=3),
        X, y, cv=5
    )
    print(f"Honest estimate: {outer_scores.mean():.4f}")
    ```
  </Tab>
</Tabs>

***

## Mistake 8: Not Checking for Data Drift

Models are trained on a snapshot of the world. The world changes. Customer behavior shifts, new products launch, economic conditions evolve. A model trained on pre-pandemic e-commerce data would fail spectacularly in 2020. This is data drift, and it is inevitable -- the question is not "if" but "when."

<Tabs>
  <Tab title="Wrong">
    ```python theme={null}
    # Train once, deploy forever
    model = train(historical_data)
    deploy(model)
    # 6 months later: "Why is accuracy dropping?"
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Monitor distribution shifts
    def check_drift(reference_data, new_data, threshold=0.1):
        for col in reference_data.columns:
            ref_mean = reference_data[col].mean()
            new_mean = new_data[col].mean()
            shift = abs(ref_mean - new_mean) / (ref_mean + 1e-10)
            
            if shift > threshold:
                print(f"⚠️ Drift detected in {col}: {shift:.1%}")

    # Monitor predictions
    def monitor_predictions(model, X_new):
        probs = model.predict_proba(X_new)[:, 1]
        if probs.mean() > historical_mean + 0.1:
            alert("Prediction distribution has shifted!")
    ```
  </Tab>
</Tabs>

***

## Mistake 9: One-Hot Encoding High Cardinality

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # City has 10,000 unique values
    df = pd.get_dummies(df, columns=['city'])
    # Now you have 10,000 sparse columns!
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Strategy 1: Target encoding
    city_means = df.groupby('city')['target'].mean()
    df['city_encoded'] = df['city'].map(city_means)

    # Strategy 2: Frequency encoding
    city_freq = df['city'].value_counts(normalize=True)
    df['city_freq'] = df['city'].map(city_freq)

    # Strategy 3: Group rare categories
    top_cities = df['city'].value_counts().head(50).index
    df['city_grouped'] = df['city'].where(df['city'].isin(top_cities), 'Other')
    ```
  </Tab>
</Tabs>

***

## Mistake 10: Ignoring Class Imbalance in CV

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Regular cross-validation with imbalanced data
    scores = cross_val_score(model, X, y, cv=5)
    # Some folds might have very few minority samples
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    from sklearn.model_selection import StratifiedKFold

    # Stratified CV preserves class ratios in each fold
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv)
    ```
  </Tab>
</Tabs>

***

## Mistake 11: Not Setting Random Seeds

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Results change every run
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    # "I swear it worked yesterday!"
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    import numpy as np

    # Set seeds everywhere
    RANDOM_STATE = 42

    np.random.seed(RANDOM_STATE)

    model = RandomForestClassifier(random_state=RANDOM_STATE)
    X_train, X_test = train_test_split(X, y, random_state=RANDOM_STATE)
    ```
  </Tab>
</Tabs>

***

## Mistake 12: Selecting Features After Train-Test Split

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Feature selection on all data
    from sklearn.feature_selection import SelectKBest
    selector = SelectKBest(k=10)
    X_selected = selector.fit_transform(X, y)  # Uses test info!

    X_train, X_test = train_test_split(X_selected, y)
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Feature selection inside cross-validation or on train only
    from sklearn.pipeline import Pipeline

    pipeline = Pipeline([
        ('selector', SelectKBest(k=10)),
        ('classifier', RandomForestClassifier())
    ])

    # Selector fit only on training folds
    scores = cross_val_score(pipeline, X, y, cv=5)
    ```
  </Tab>
</Tabs>

***

## Mistake 13: Using Mean for Skewed Data

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Income is highly skewed
    df['income'].fillna(df['income'].mean())
    # Mean = $85k but median = $50k
    # Filling with mean inflates values
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    # Use median for skewed distributions
    df['income'].fillna(df['income'].median())

    # Or use log transform first
    import numpy as np
    df['log_income'] = np.log1p(df['income'])
    df['log_income'].fillna(df['log_income'].median())
    ```
  </Tab>
</Tabs>

***

## Mistake 14: Trusting Default Hyperparameters

<Tabs>
  <Tab title="❌ Wrong">
    ```python theme={null}
    # Just use defaults
    model = GradientBoostingClassifier()
    model.fit(X_train, y_train)
    # "Good enough"
    ```
  </Tab>

  <Tab title="✅ Correct">
    ```python theme={null}
    from sklearn.model_selection import GridSearchCV

    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7, 10],
        'learning_rate': [0.01, 0.1, 0.2]
    }

    grid = GridSearchCV(
        GradientBoostingClassifier(),
        param_grid,
        cv=5,
        scoring='roc_auc'
    )
    grid.fit(X_train, y_train)
    print(f"Best params: {grid.best_params_}")
    ```
  </Tab>
</Tabs>

***

## Mistake 15: Complex Model Without Baseline

This is the "Kaggle grandmaster" trap. You spend two weeks tuning XGBoost with 47 hyperparameters and get 89% accuracy. Is that good? You have no idea unless you know what a trivial model achieves. If logistic regression gets 87%, your two weeks bought you 2% -- probably not worth the complexity, maintenance burden, and inference latency in production.

<Tabs>
  <Tab title="Wrong">
    ```python theme={null}
    # Jump straight to deep learning
    model = SuperComplexNeuralNetwork(layers=50)
    model.fit(X, y)
    # "My model has 89% accuracy!" -- But is that actually good?
    ```
  </Tab>

  <Tab title="Correct">
    ```python theme={null}
    # Always compare to baselines -- this takes 5 minutes and saves weeks
    from sklearn.dummy import DummyClassifier

    # Baseline 1: Random guessing (the absolute floor)
    dummy = DummyClassifier(strategy='stratified')
    print(f"Random baseline: {cross_val_score(dummy, X, y).mean():.4f}")

    # Baseline 2: Simple model (the "is ML even needed?" check)
    lr = LogisticRegression()
    print(f"Logistic Regression: {cross_val_score(lr, X, y).mean():.4f}")

    # Now try complex model
    complex_model = GradientBoostingClassifier(n_estimators=200)
    print(f"Complex model: {cross_val_score(complex_model, X, y).mean():.4f}")

    # The question is not "is the complex model good?"
    # It's "is the improvement over the simple model worth the extra complexity?"
    ```
  </Tab>
</Tabs>

***

## Quick Reference Checklist

### Before Training

* [ ] Split data before any preprocessing
* [ ] Set random seeds for reproducibility
* [ ] Check class balance
* [ ] Handle missing values appropriately
* [ ] Scale features if needed by algorithm

### During Training

* [ ] Use pipelines to prevent leakage
* [ ] Use stratified CV for imbalanced data
* [ ] Use temporal splits for time series
* [ ] Compare to baseline models
* [ ] Tune hyperparameters systematically

### After Training

* [ ] Evaluate on held-out test set
* [ ] Use appropriate metrics (not just accuracy)
* [ ] Check for overfitting (train vs test gap)
* [ ] Validate feature importance makes sense
* [ ] Document everything

### In Production

* [ ] Monitor for data drift
* [ ] Track prediction distributions
* [ ] Set up alerts for performance degradation
* [ ] Plan for model retraining

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Split First" icon="scissors">
    Always separate test data before any processing
  </Card>

  <Card title="Use Pipelines" icon="diagram-project">
    Prevent leakage with sklearn pipelines
  </Card>

  <Card title="Right Metrics" icon="gauge">
    Match metrics to your problem
  </Card>

  <Card title="Start Simple" icon="seedling">
    Baseline first, complexity later
  </Card>
</CardGroup>

***

## Congratulations! 🎉

You've completed the **ML Mastery** course!

You now have comprehensive knowledge of:

* ML fundamentals and algorithms
* Feature engineering and data preprocessing
* Model evaluation and selection
* Advanced topics (time series, deep learning, deployment)
* Professional practices (pipelines, explainability, common mistakes)

### Continue Your Journey

<CardGroup cols={2}>
  <Card title="AI Engineering" icon="robot" href="/ai-engineering/overview">
    Build LLM-powered applications and agents
  </Card>

  <Card title="Math Foundations" icon="calculator" href="/courses/math-for-ml-linear-algebra/01-introduction">
    Deepen your mathematical understanding
  </Card>

  <Card title="System Design" icon="diagram-project" href="/system-design/overview">
    Design ML systems at scale
  </Card>

  <Card title="Kaggle Competitions" icon="trophy" href="https://www.kaggle.com/competitions">
    Apply your skills in real competitions
  </Card>
</CardGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Tell me about a time you discovered data leakage in a model that was already in production. How did you find it and what was the impact?">
    This is a behavioral-style question disguised as a technical one. The interviewer wants to see your debugging methodology and your ability to handle production incidents. Here is how a strong answer would be structured:

    * **Detection pattern: the "too good to be true" signal.** The most common way leakage is discovered is during a routine model retrain. The original model had 97% AUC, but a fresh retrain on recent data gives 83% AUC. The gap is the clue -- either the original model was leaky (inflated the original score) or the data has drifted dramatically. Check the original training code first.
    * **Common root cause: preprocessing before split.** In many production incidents, the leakage is in a StandardScaler or target encoding that was fit on the full dataset before train-test split. The model in production uses a scaler fit on historical data, but the offline evaluation was inflated by the leak. The model was always worse than the metrics suggested.
    * **Impact assessment.** The real-world impact depends on the domain. In a recommendation system, leaky evaluation means you deployed a model that was less personalized than you thought, leading to lower click-through rates. In fraud detection, it means you missed more fraud than expected. Calculate the gap between reported and actual performance, then estimate the business cost of that gap over the deployment period.
    * **Remediation.** Fix the leakage, retrain, and honestly report the corrected metrics. Then implement pipeline-level guardrails (automated leakage tests, the random-label permutation test) to prevent recurrence. The hardest part is communicating to stakeholders that the model was never as good as reported -- but honesty here builds long-term trust.
    * **Prevention going forward.** Mandate that all preprocessing lives inside sklearn Pipelines. Add a CI check that runs the model with shuffled labels and verifies that accuracy is near the base rate. Review feature engineering code for any feature derived from the target. These three checks catch 90% of leakage.

    **Follow-up: If the leaky model was performing "well enough" in production according to business metrics, would you still fix it?**

    Yes, absolutely. A leaky model that appears to work is a ticking time bomb. It works by coincidence -- the leaky feature happens to correlate with the true signal in the current data. When the data distribution shifts (and it always does), the leaky feature and the true signal will diverge, and the model will fail suddenly rather than gracefully. Fixing it now, while things are stable, is far easier than diagnosing a sudden production failure at 2 AM. I would fix the leakage, retrain, and accept the honest lower metrics. If the honest model does not meet business requirements, that is valuable information -- it means you need a better model, not a leaky one.
  </Accordion>

  <Accordion title="You join a new team and inherit a model with no documentation, no tests, and no monitoring. What is your priority order for adding production hygiene?">
    This is a triage question. You cannot fix everything at once, so prioritization reveals your engineering judgment.

    * **Priority 1: Add monitoring (Day 1-3).** Before anything else, you need to know if the model is currently working. Add prediction distribution monitoring (track mean, variance, and percentile distribution of model outputs daily), input feature null-rate monitoring, and basic health checks (is the model responding, what is the latency). If the model is silently failing right now, you need to know immediately. This takes a few days and requires no changes to the model itself.
    * **Priority 2: Reproduce the model (Week 1-2).** Can you retrain the model and get the same performance? If not, you do not truly control the model. Find the training data, the feature engineering code, and the hyperparameters. Pin library versions. Run the training pipeline end-to-end and compare metrics to whatever historical records exist. If you cannot reproduce, you need to understand why before making any changes.
    * **Priority 3: Add a holdout evaluation (Week 2-3).** Set aside the most recent data as a holdout and evaluate the current model honestly. This tells you how the model is actually performing right now, not how it performed when it was first deployed months ago.
    * **Priority 4: Write tests (Week 3-4).** Add unit tests for the feature engineering pipeline (given these inputs, do I get these outputs?), integration tests for the end-to-end prediction path, and regression tests that verify model performance stays within expected bounds after retraining.
    * **Priority 5: Add documentation (Ongoing).** Document the model's purpose, the feature definitions, the training procedure, the deployment architecture, and the monitoring setup. Do this as you learn the system -- document what you discover.

    The anti-pattern is trying to refactor the code first. Resist that urge. A poorly written model that is monitored and reproducible is infinitely better than a beautifully refactored model that you accidentally broke during the refactor because you had no tests to catch the regression.

    **Follow-up: You discover during reproduction that the model's current production performance is 15% worse than the original reported metrics. What do you do?**

    First, verify the finding by checking against multiple evaluation windows and metrics. If confirmed, investigate whether this is data drift (the world changed), code drift (a dependency was updated), or the original metrics were leaky. Then communicate the finding to stakeholders with a clear recommendation: either retrain on recent data (if drift), roll back a code change (if code drift), or fix the leakage and accept the honest lower performance. The most important thing is to act on the information. Many teams discover degradation and do nothing because the model is "still good enough" -- until it is not.
  </Accordion>

  <Accordion title="What is the single most impactful ML mistake you have seen teams make repeatedly, and how do you prevent it?">
    The single most impactful mistake is deploying a model without a baseline comparison. Teams spend weeks building complex models, achieve some accuracy number, and deploy without ever asking "is this better than a simple rule or a logistic regression?"

    * **Why it matters.** A gradient boosting model with 89% accuracy sounds impressive until you learn that a logistic regression achieves 87% in 5 minutes of work, or that a hand-written business rule achieves 85% with zero ML infrastructure. The marginal 2-4% improvement from the complex model comes with significant costs: compute, maintenance, debugging complexity, and explainability challenges.
    * **The cost of complexity without commensurate benefit.** A complex model requires monitoring, retraining pipelines, feature stores, and on-call rotation. A logistic regression or business rule requires almost none of that. If the complex model is only marginally better, the total cost of ownership makes it a net negative.
    * **How I prevent it.** In every ML project, I mandate three baselines before any complex modeling begins. First, a trivial baseline (always predict the majority class, or random). This establishes the floor. Second, a simple ML baseline (logistic regression or a single decision tree). This establishes what you can achieve with minimal effort. Third, the existing solution (if any) -- the current business rule, heuristic, or manual process. Only if the complex model significantly outperforms all three baselines on the relevant business metric do you proceed to deployment. "Significantly" means the improvement justifies the operational cost.
    * **Cultural enforcement.** Make the baseline comparison a required section in every model review document. No model can proceed to deployment without demonstrating improvement over the documented baseline. This single process change prevents the most common waste of ML engineering resources: over-engineering problems that do not need complex models.

    **Follow-up: The baseline logistic regression actually outperforms the team's complex model. How do you communicate this diplomatically?**

    Frame it as a positive discovery, not a failure. "We found that a simple model achieves 88% accuracy on this problem, which means we can deploy faster, with lower operational cost, and with full explainability. The team's work was not wasted -- the feature engineering they did is what makes the simple model work so well. Let us deploy the logistic regression now, and if the business requirements change or accuracy needs to improve, we have the more complex model as a follow-up." The key message is: simplicity is a feature, not a failure. The best engineers choose the simplest solution that meets requirements.
  </Accordion>
</AccordionGroup>
