Skip to main content

Regularization: The Art of Keeping Models Simple

L1 L2 and Dropout Regularization

The Overfitting Problem

Remember: a model that memorizes training data is useless on new data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Simple data with noise
np.random.seed(42)
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.randn(15) * 0.3

# Fit polynomials of different degrees
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, degree in zip(axes, [1, 5, 14]):
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    
    model = LinearRegression()
    model.fit(X_poly, y)
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    X_test_poly = poly.transform(X_test)
    y_pred = model.predict(X_test_poly)
    
    ax.scatter(X, y, color='blue', label='Training data')
    ax.plot(X_test, y_pred, color='red', label=f'Degree {degree}')
    ax.set_title(f'Degree {degree} Polynomial')
    ax.legend()
    ax.set_ylim(-1, 5)

plt.tight_layout()
plt.show()
  • Degree 1: Too simple (underfitting)
  • Degree 5: Just right
  • Degree 14: Wiggles through every point (overfitting)
GPT Training Regularization at OpenAI

What Is Regularization?

Core idea: Penalize complexity! Instead of just minimizing prediction error, we minimize: Loss=Prediction Error+λ×Model ComplexityLoss = \text{Prediction Error} + \lambda \times \text{Model Complexity} Where λ\lambda is the regularization strength. Trade-off:
  • λ = 0: No regularization, risk overfitting
  • λ = ∞: Maximum regularization, model predicts the mean
  • λ = just right: Balance fit and complexity

L2 Regularization (Ridge)

Add the sum of squared weights to the loss: LossRidge=MSE+λj=1pwj2Loss_{Ridge} = MSE + \lambda \sum_{j=1}^{p} w_j^2 Effect: Pushes weights toward zero, but never exactly zero. Creates “small” weights.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# High-degree polynomial with Ridge regularization
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.randn(15) * 0.3

# Try different regularization strengths
alphas = [0, 0.1, 1, 10]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, alpha in zip(axes, alphas):
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=10)),
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=alpha))
    ])
    
    pipeline.fit(X, y)
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_pred = pipeline.predict(X_test)
    
    ax.scatter(X, y, color='blue')
    ax.plot(X_test, y_pred, color='red')
    ax.set_title(f'Ridge (α = {alpha})')
    ax.set_ylim(-1, 5)

plt.tight_layout()
plt.show()

L1 Regularization (Lasso)

Add the sum of absolute weights to the loss: LossLasso=MSE+λj=1pwjLoss_{Lasso} = MSE + \lambda \sum_{j=1}^{p} |w_j| Effect: Pushes weights toward zero, and some become exactly zero. Creates sparse models!
from sklearn.linear_model import Lasso

# Compare Ridge vs Lasso on feature selection
from sklearn.datasets import make_regression

# Create data with many irrelevant features
X, y = make_regression(n_samples=100, n_features=20, n_informative=5, noise=10, random_state=42)

# Fit both
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)

ridge.fit(X, y)
lasso.fit(X, y)

# Compare coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(range(20), ridge.coef_)
axes[0].set_title('Ridge Coefficients (all non-zero)')
axes[0].set_xlabel('Feature')
axes[0].axhline(y=0, color='r', linestyle='--')

axes[1].bar(range(20), lasso.coef_)
axes[1].set_title('Lasso Coefficients (many are exactly 0!)')
axes[1].set_xlabel('Feature')
axes[1].axhline(y=0, color='r', linestyle='--')

plt.tight_layout()
plt.show()

print(f"Ridge: {np.sum(ridge.coef_ != 0)} non-zero coefficients")
print(f"Lasso: {np.sum(lasso.coef_ != 0)} non-zero coefficients")

Elastic Net: Best of Both Worlds

Combine L1 and L2: LossElasticNet=MSE+λ1wj+λ2wj2Loss_{ElasticNet} = MSE + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)  # 50% L1, 50% L2
elastic.fit(X, y)

print(f"Elastic Net: {np.sum(elastic.coef_ != 0)} non-zero coefficients")

When to Use Which?

MethodUse Case
Ridge (L2)Many small features all contribute
Lasso (L1)Feature selection, want sparse model
Elastic NetMany correlated features
Math Connection: L2 regularization is related to the Euclidean norm of the weight vector. L1 uses the Manhattan norm.

Regularization in Classification

For logistic regression:
from sklearn.linear_model import LogisticRegression

# C is the inverse of regularization strength (smaller C = more regularization)
models = {
    'No regularization': LogisticRegression(penalty=None, max_iter=1000),
    'L2 (Ridge)': LogisticRegression(penalty='l2', C=1.0, max_iter=1000),
    'L1 (Lasso)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000),
    'Elastic Net': LogisticRegression(penalty='elasticnet', C=1.0, l1_ratio=0.5, solver='saga', max_iter=1000)
}

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Regularization in Tree-Based Models

Trees don’t use L1/L2, but they have their own regularization:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Tree regularization parameters
tree = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth
    min_samples_split=10,  # Min samples to split
    min_samples_leaf=5,    # Min samples in leaf
    max_features='sqrt'    # Random subset of features
)

# Random Forest adds more regularization through bagging
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=2,
    max_features='sqrt'
)

# Gradient Boosting has learning rate as regularization
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,    # Smaller = more regularization
    max_depth=3,
    subsample=0.8         # Random sample of data
)

Dropout: Regularization for Neural Networks

Randomly “turn off” neurons during training:
import torch.nn as nn

class RegularizedNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(100, 256),
            nn.ReLU(),
            nn.Dropout(0.3),      # 30% dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10)
        )
    
    def forward(self, x):
        return self.layers(x)
Why it works: Forces the network to not rely on any single neuron. Creates redundancy.

Early Stopping

Stop training when validation performance stops improving:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=1000
)

# In PyTorch
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(1000):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        # Save best model
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}")
            break

Data Augmentation

Create more training examples by transforming existing ones:
# For images
from torchvision import transforms

augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomCrop(224, padding=4)
])

# For tabular data: add noise
def augment_tabular(X, noise_level=0.01):
    noise = np.random.randn(*X.shape) * noise_level
    return X + noise

Cross-Validation for Choosing λ

from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically finds best alpha
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5)
ridge_cv.fit(X, y)
print(f"Best alpha: {ridge_cv.alpha_}")

# LassoCV does the same for Lasso
lasso_cv = LassoCV(alphas=[0.01, 0.1, 1, 10], cv=5)
lasso_cv.fit(X, y)
print(f"Best alpha: {lasso_cv.alpha_}")

Regularization Summary

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
import pandas as pd

# Compare all regularization methods
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=0.1)': Ridge(alpha=0.1),
    'Ridge (α=1.0)': Ridge(alpha=1.0),
    'Ridge (α=10)': Ridge(alpha=10),
    'Lasso (α=0.1)': Lasso(alpha=0.1),
    'Lasso (α=1.0)': Lasso(alpha=1.0),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5)
}

results = []
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    results.append({
        'Model': name,
        'MSE': -scores.mean(),
        'Std': scores.std()
    })

df = pd.DataFrame(results)
print(df.to_string(index=False))

The Bias-Variance Tradeoff

Regularization is really about balancing:
Low RegularizationHigh Regularization
BiasLowHigh
VarianceHighLow
Training ErrorLowHigh
Test ErrorHigh (overfit)High (underfit)
Sweet SpotJust right!
The goal: Find the regularization strength that minimizes test error, not training error. Use cross-validation!

🚀 Mini Projects

Project 1: Regularization Comparison

Compare Ridge, Lasso, and ElasticNet

Project 2: Feature Selection with Lasso

Use L1 regularization to select features

Project 3: Overfitting Simulator

Visualize how regularization prevents overfitting

Project 4: Optimal Lambda Finder

Find the perfect regularization strength

Project 1: Regularization Comparison

Compare different regularization techniques on the same dataset.

Project 2: Feature Selection with Lasso

Use Lasso regularization to automatically select the most important features.

Project 3: Overfitting Simulator

Visualize how regularization prevents overfitting.

Project 4: Optimal Lambda Finder

Systematically find the best regularization strength using cross-validation.

Key Takeaways

Penalize Complexity

Add weight penalty to the loss function

L2 = Small Weights

Ridge shrinks all weights, none become zero

L1 = Zero Weights

Lasso creates sparse models, selects features

Cross-Validate λ

Always use CV to find the right regularization strength

What’s Next?

You now have a complete ML toolkit! Let’s see how to save, load, and deploy your models.

Continue to Module 14: Model Deployment

Learn how to save models and deploy them for real-world use