Skip to main content

Bias-Variance Tradeoff

Underfitting vs Overfitting

The Core Dilemma

Every ML model faces the same fundamental challenge:
  • Too simple → Misses patterns (underfitting)
  • Too complex → Memorizes noise (overfitting)
This is the Bias-Variance Tradeoff - and understanding it will make you a better ML practitioner.
Stock Market Overfitting Example

The Dartboard Analogy

Imagine throwing darts at a target:
        🎯
   •  •  •
   •  •  •
   •  •  •
Darts cluster together but miss the center. Consistently wrong.

The Math Behind It

Total Error = Bias² + Variance + Irreducible Noise E[(yf^(x))2]=Bias2[f^(x)]+Var[f^(x)]+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2 Where:
  • Bias: Error from wrong assumptions (model too simple)
  • Variance: Error from sensitivity to training data (model too complex)
  • Irreducible Noise (σ²): Random error that can’t be reduced

Visualizing the Tradeoff

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Generate true function with noise
np.random.seed(42)
n_samples = 50
X = np.sort(np.random.uniform(0, 1, n_samples))
y_true = np.sin(4 * X)  # True function
y = y_true + np.random.normal(0, 0.3, n_samples)  # Noisy observations

# Fit models of different complexity
degrees = [1, 3, 15]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

X_test = np.linspace(0, 1, 100)

for ax, degree in zip(axes, degrees):
    # Fit polynomial
    model = make_pipeline(
        PolynomialFeatures(degree),
        LinearRegression()
    )
    model.fit(X.reshape(-1, 1), y)
    y_pred = model.predict(X_test.reshape(-1, 1))
    
    # Plot
    ax.scatter(X, y, alpha=0.6, label='Data')
    ax.plot(X_test, np.sin(4 * X_test), 'g--', label='True function', linewidth=2)
    ax.plot(X_test, y_pred, 'r-', label=f'Degree {degree}', linewidth=2)
    ax.set_title(f'Polynomial Degree {degree}')
    ax.legend()
    ax.set_ylim(-2, 2)
    
    # Label bias/variance
    if degree == 1:
        ax.text(0.5, -1.5, 'High Bias\nLow Variance', ha='center', fontsize=10)
    elif degree == 3:
        ax.text(0.5, -1.5, 'Balanced', ha='center', fontsize=10)
    else:
        ax.text(0.5, -1.5, 'Low Bias\nHigh Variance', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

Estimating Bias and Variance

def estimate_bias_variance(X, y, model_class, n_bootstraps=100):
    """
    Estimate bias and variance using bootstrap sampling.
    """
    X_test = np.linspace(0, 1, 50).reshape(-1, 1)
    y_true = np.sin(4 * X_test.ravel())
    
    # Collect predictions from multiple bootstrap samples
    predictions = np.zeros((n_bootstraps, len(X_test)))
    
    for i in range(n_bootstraps):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot = X[indices].reshape(-1, 1)
        y_boot = y[indices]
        
        # Fit and predict
        model = model_class()
        model.fit(X_boot, y_boot)
        predictions[i] = model.predict(X_test)
    
    # Calculate bias and variance
    mean_prediction = predictions.mean(axis=0)
    
    bias_squared = (mean_prediction - y_true) ** 2
    variance = predictions.var(axis=0)
    
    return {
        'bias_squared': bias_squared.mean(),
        'variance': variance.mean(),
        'total_error': bias_squared.mean() + variance.mean()
    }

# Compare different polynomial degrees
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

results = []
degrees = range(1, 16)

for degree in degrees:
    def model_class():
        return make_pipeline(
            PolynomialFeatures(degree),
            LinearRegression()
        )
    
    metrics = estimate_bias_variance(X, y, model_class)
    metrics['degree'] = degree
    results.append(metrics)

# Plot the tradeoff
degrees = [r['degree'] for r in results]
bias = [r['bias_squared'] for r in results]
variance = [r['variance'] for r in results]
total = [r['total_error'] for r in results]

plt.figure(figsize=(10, 6))
plt.plot(degrees, bias, 'b-o', label='Bias²')
plt.plot(degrees, variance, 'r-o', label='Variance')
plt.plot(degrees, total, 'g-o', label='Total Error', linewidth=2)
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.axvline(degrees[np.argmin(total)], color='gray', linestyle='--', 
            label=f'Optimal: degree={degrees[np.argmin(total)]}')
plt.show()

Signs of High Bias vs High Variance

High Bias (Underfitting)

SymptomExample
High training errorTraining accuracy = 65%
High test errorTest accuracy = 63%
Both errors similarGap is small
Solution: More complex model, more features

High Variance (Overfitting)

SymptomExample
Low training errorTraining accuracy = 99%
High test errorTest accuracy = 70%
Large gap between them29% difference!
Solution: More data, regularization, simpler model

Learning Curves: Diagnostic Tool

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, X, y, title="Learning Curve"):
    """Plot learning curve to diagnose bias/variance."""
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X.reshape(-1, 1), y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='neg_mean_squared_error'
    )
    
    train_mean = -train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    test_mean = -test_scores.mean(axis=1)
    test_std = test_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='orange')
    
    plt.plot(train_sizes, train_mean, 'b-o', label='Training Error')
    plt.plot(train_sizes, test_mean, 'r-o', label='Validation Error')
    
    plt.xlabel('Training Set Size')
    plt.ylabel('Mean Squared Error')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# High bias model
high_bias = make_pipeline(PolynomialFeatures(1), LinearRegression())
plot_learning_curve(high_bias, X, y, "High Bias Model (Degree 1)")

# High variance model  
high_variance = make_pipeline(PolynomialFeatures(15), LinearRegression())
plot_learning_curve(high_variance, X, y, "High Variance Model (Degree 15)")

# Balanced model
balanced = make_pipeline(PolynomialFeatures(4), LinearRegression())
plot_learning_curve(balanced, X, y, "Balanced Model (Degree 4)")

Reading Learning Curves

Both curves plateau high and close together. More data won’t help! Need more complex model.

Model Complexity Spectrum

Simple ←――――――――――――――――――――――――――→ Complex

Linear Regression                    Neural Networks
Logistic Regression                  Deep Learning
Naive Bayes                         Ensemble Methods
KNN (large k)        Decision Trees  KNN (k=1)
                     SVM + RBF kernel
                     Random Forest

HIGH BIAS ←――――――――――――――――――――――→ HIGH VARIANCE

Practical Strategies

Fighting High Bias

from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor

# 1. Add polynomial features
poly_features = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly_features.fit_transform(X.reshape(-1, 1))

# 2. Use more powerful model
rf = RandomForestRegressor(n_estimators=100, max_depth=None)

# 3. Add more features
# X_new = add_feature_interactions(X)

# 4. Reduce regularization
from sklearn.linear_model import Ridge
weak_reg = Ridge(alpha=0.001)  # Less regularization

Fighting High Variance

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score

# 1. Add regularization
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)

# 2. Use simpler model
from sklearn.linear_model import LinearRegression
simple_model = LinearRegression()

# 3. Get more data
# X_augmented, y_augmented = get_more_data()

# 4. Feature selection
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=10)

# 5. Early stopping (for iterative algorithms)
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(n_estimators=100, validation_fraction=0.2,
                                n_iter_no_change=10, random_state=42)

# 6. Ensemble methods (average reduces variance)
from sklearn.ensemble import BaggingRegressor
bagging = BaggingRegressor(n_estimators=50)

Real-World Example: Housing Prices

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare models
models = {
    'Linear (High Bias)': Ridge(alpha=100),
    'Ridge (Balanced)': Ridge(alpha=1),
    'Random Forest (Low Bias)': RandomForestRegressor(n_estimators=100, max_depth=10),
    'RF Deep (High Variance)': RandomForestRegressor(n_estimators=100, max_depth=None)
}

print("Model Comparison:")
print("-" * 50)

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    
    train_pred = model.predict(X_train_scaled)
    test_pred = model.predict(X_test_scaled)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
    gap = test_rmse - train_rmse
    
    print(f"{name:25s}: Train RMSE={train_rmse:.3f}, Test RMSE={test_rmse:.3f}, Gap={gap:.3f}")

The Bias-Variance for Different Algorithms

AlgorithmDefault BiasDefault VarianceTuning Focus
Linear RegressionHighLowAdd features
KNN (small k)LowHighIncrease k
KNN (large k)HighLowDecrease k
Decision Tree (deep)LowHighLimit depth
Random ForestLowLower than treesn_estimators
Gradient BoostingStarts highIncreases with iterationsEarly stopping
Neural NetworksLowHighRegularization, dropout

Key Takeaways

Bias = Underfitting

Model too simple, misses patterns consistently

Variance = Overfitting

Model too complex, fits noise in training data

Use Learning Curves

Diagnose whether you need more data or a different model

Balance is Key

Find the sweet spot through cross-validation

What’s Next?

Understanding how to avoid one of the most dangerous mistakes in ML - data leakage!

Continue to Data Leakage

The silent killer of ML models in production