Bias-Variance Tradeoff
The Core Dilemma
Every ML model faces the same fundamental challenge:- Too simple → Misses patterns (underfitting)
- Too complex → Memorizes noise (overfitting)
The Dartboard Analogy
Imagine throwing darts at a target:- High Bias, Low Variance
- Low Bias, High Variance
- Low Bias, Low Variance
The Math Behind It
Total Error = Bias² + Variance + Irreducible Noise Where:- Bias: Error from wrong assumptions (model too simple)
- Variance: Error from sensitivity to training data (model too complex)
- Irreducible Noise (σ²): Random error that can’t be reduced
Visualizing the Tradeoff
Estimating Bias and Variance
Signs of High Bias vs High Variance
High Bias (Underfitting)
| Symptom | Example |
|---|---|
| High training error | Training accuracy = 65% |
| High test error | Test accuracy = 63% |
| Both errors similar | Gap is small |
High Variance (Overfitting)
| Symptom | Example |
|---|---|
| Low training error | Training accuracy = 99% |
| High test error | Test accuracy = 70% |
| Large gap between them | 29% difference! |
Learning Curves: Diagnostic Tool
Reading Learning Curves
- High Bias
- High Variance
- Good Fit
Both curves plateau high and close together.
More data won’t help! Need more complex model.
Model Complexity Spectrum
Practical Strategies
Fighting High Bias
Fighting High Variance
Real-World Example: Housing Prices
The Bias-Variance for Different Algorithms
| Algorithm | Default Bias | Default Variance | Tuning Focus |
|---|---|---|---|
| Linear Regression | High | Low | Add features |
| KNN (small k) | Low | High | Increase k |
| KNN (large k) | High | Low | Decrease k |
| Decision Tree (deep) | Low | High | Limit depth |
| Random Forest | Low | Lower than trees | n_estimators |
| Gradient Boosting | Starts high | Increases with iterations | Early stopping |
| Neural Networks | Low | High | Regularization, dropout |
Key Takeaways
Bias = Underfitting
Model too simple, misses patterns consistently
Variance = Overfitting
Model too complex, fits noise in training data
Use Learning Curves
Diagnose whether you need more data or a different model
Balance is Key
Find the sweet spot through cross-validation
What’s Next?
Understanding how to avoid one of the most dangerous mistakes in ML - data leakage!Continue to Data Leakage
The silent killer of ML models in production