Skip to main content
Common ML Mistakes Concept
Common ML Mistakes Real World Example

Common ML Mistakes

The ML Hall of Shame

Every data scientist has made these mistakes. Learn from them so you don’t have to!

Mistake 1: Training on the Test Set

# Fitting ANYTHING on all data before split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses test data statistics!

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
Why it matters: Test set statistics leak into training, giving overly optimistic results.

Mistake 2: Using Accuracy for Imbalanced Data

# 99% accuracy sounds great!
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
# But: model just predicts majority class for everything
Rule of thumb: If class ratio > 10:1, don’t use accuracy.

Mistake 3: Random Split for Time Series

# Random shuffle breaks temporal order
X_train, X_test = train_test_split(X, y, shuffle=True)
# Now you're training on Dec 2024 to predict Jan 2024!

Mistake 4: Ignoring Feature Scaling

# SVM, KNN, neural nets need scaled features!
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)  # age: 0-100, income: 0-1,000,000
# Income dominates everything
Models that need scaling: SVM, KNN, Neural Networks, PCA, Logistic Regression (with regularization) Models that don’t need scaling: Decision Trees, Random Forest, Gradient Boosting

Mistake 5: Feature Leakage from Target

# Features derived from target
df['avg_purchase_by_customer_type'] = df.groupby('customer_type')['purchase'].transform('mean')
# This leaks future purchase information!

Mistake 6: Dropping Missing Values Carelessly

# Drop all rows with any missing value
df_clean = df.dropna()
# Lost 50% of your data!

Mistake 7: Overfitting to Validation Set

# Keep tuning until validation score is perfect
for i in range(1000):
    model = train_with_new_hyperparameters()
    if model.score(X_val, y_val) > best_score:
        best_model = model
# You've now overfit to validation set!

Mistake 8: Not Checking for Data Drift

# Train once, deploy forever
model = train(historical_data)
deploy(model)
# 6 months later: "Why is accuracy dropping?"

Mistake 9: One-Hot Encoding High Cardinality

# City has 10,000 unique values
df = pd.get_dummies(df, columns=['city'])
# Now you have 10,000 sparse columns!

Mistake 10: Ignoring Class Imbalance in CV

# Regular cross-validation with imbalanced data
scores = cross_val_score(model, X, y, cv=5)
# Some folds might have very few minority samples

Mistake 11: Not Setting Random Seeds

# Results change every run
model = RandomForestClassifier()
model.fit(X_train, y_train)
# "I swear it worked yesterday!"

Mistake 12: Selecting Features After Train-Test Split

# Feature selection on all data
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=10)
X_selected = selector.fit_transform(X, y)  # Uses test info!

X_train, X_test = train_test_split(X_selected, y)

Mistake 13: Using Mean for Skewed Data

# Income is highly skewed
df['income'].fillna(df['income'].mean())
# Mean = $85k but median = $50k
# Filling with mean inflates values

Mistake 14: Trusting Default Hyperparameters

# Just use defaults
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
# "Good enough"

Mistake 15: Complex Model Without Baseline

# Jump straight to deep learning
model = SuperComplexNeuralNetwork(layers=50)
model.fit(X, y)
# "My model has 89% accuracy!"

Quick Reference Checklist

Before Training

  • Split data before any preprocessing
  • Set random seeds for reproducibility
  • Check class balance
  • Handle missing values appropriately
  • Scale features if needed by algorithm

During Training

  • Use pipelines to prevent leakage
  • Use stratified CV for imbalanced data
  • Use temporal splits for time series
  • Compare to baseline models
  • Tune hyperparameters systematically

After Training

  • Evaluate on held-out test set
  • Use appropriate metrics (not just accuracy)
  • Check for overfitting (train vs test gap)
  • Validate feature importance makes sense
  • Document everything

In Production

  • Monitor for data drift
  • Track prediction distributions
  • Set up alerts for performance degradation
  • Plan for model retraining

Key Takeaways

Split First

Always separate test data before any processing

Use Pipelines

Prevent leakage with sklearn pipelines

Right Metrics

Match metrics to your problem

Start Simple

Baseline first, complexity later

Congratulations! 🎉

You’ve completed the ML Mastery course! You now have comprehensive knowledge of:
  • ML fundamentals and algorithms
  • Feature engineering and data preprocessing
  • Model evaluation and selection
  • Advanced topics (time series, deep learning, deployment)
  • Professional practices (pipelines, explainability, common mistakes)

Continue Your Journey