Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Regularization: The Art of Keeping Models Simple

L1 L2 and Dropout Regularization

The Overfitting Problem

Remember: a model that memorizes training data is useless on new data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Simple data with noise
np.random.seed(42)
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.randn(15) * 0.3

# Fit polynomials of different degrees
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, degree in zip(axes, [1, 5, 14]):
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    
    model = LinearRegression()
    model.fit(X_poly, y)
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    X_test_poly = poly.transform(X_test)
    y_pred = model.predict(X_test_poly)
    
    ax.scatter(X, y, color='blue', label='Training data')
    ax.plot(X_test, y_pred, color='red', label=f'Degree {degree}')
    ax.set_title(f'Degree {degree} Polynomial')
    ax.legend()
    ax.set_ylim(-1, 5)

plt.tight_layout()
plt.show()
  • Degree 1: Too simple (underfitting)
  • Degree 5: Just right
  • Degree 14: Wiggles through every point (overfitting)
GPT Training Regularization at OpenAI

What Is Regularization?

Core idea: Penalize complexity! Instead of just minimizing prediction error, we minimize: Loss=Prediction Error+λ×Model ComplexityLoss = \text{Prediction Error} + \lambda \times \text{Model Complexity} Where λ\lambda is the regularization strength. Trade-off:
  • λ = 0: No regularization, risk overfitting
  • λ = ∞: Maximum regularization, model predicts the mean
  • λ = just right: Balance fit and complexity

L2 Regularization (Ridge)

Add the sum of squared weights to the loss: LossRidge=MSE+λj=1pwj2Loss_{Ridge} = MSE + \lambda \sum_{j=1}^{p} w_j^2 Effect: Pushes weights toward zero, but never exactly zero. Creates “small” weights.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# High-degree polynomial with Ridge regularization
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.randn(15) * 0.3

# Try different regularization strengths
alphas = [0, 0.1, 1, 10]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, alpha in zip(axes, alphas):
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=10)),
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=alpha))
    ])
    
    pipeline.fit(X, y)
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_pred = pipeline.predict(X_test)
    
    ax.scatter(X, y, color='blue')
    ax.plot(X_test, y_pred, color='red')
    ax.set_title(f'Ridge (α = {alpha})')
    ax.set_ylim(-1, 5)

plt.tight_layout()
plt.show()

L1 Regularization (Lasso)

Add the sum of absolute weights to the loss: LossLasso=MSE+λj=1pwjLoss_{Lasso} = MSE + \lambda \sum_{j=1}^{p} |w_j| Effect: Pushes weights toward zero, and some become exactly zero. Creates sparse models!
from sklearn.linear_model import Lasso

# Compare Ridge vs Lasso on feature selection
from sklearn.datasets import make_regression

# Create data with many irrelevant features
X, y = make_regression(n_samples=100, n_features=20, n_informative=5, noise=10, random_state=42)

# Fit both
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)

ridge.fit(X, y)
lasso.fit(X, y)

# Compare coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(range(20), ridge.coef_)
axes[0].set_title('Ridge Coefficients (all non-zero)')
axes[0].set_xlabel('Feature')
axes[0].axhline(y=0, color='r', linestyle='--')

axes[1].bar(range(20), lasso.coef_)
axes[1].set_title('Lasso Coefficients (many are exactly 0!)')
axes[1].set_xlabel('Feature')
axes[1].axhline(y=0, color='r', linestyle='--')

plt.tight_layout()
plt.show()

print(f"Ridge: {np.sum(ridge.coef_ != 0)} non-zero coefficients")
print(f"Lasso: {np.sum(lasso.coef_ != 0)} non-zero coefficients")

Elastic Net: Best of Both Worlds

Combine L1 and L2: LossElasticNet=MSE+λ1wj+λ2wj2Loss_{ElasticNet} = MSE + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)  # 50% L1, 50% L2
elastic.fit(X, y)

print(f"Elastic Net: {np.sum(elastic.coef_ != 0)} non-zero coefficients")

When to Use Which?

MethodUse Case
Ridge (L2)Many small features all contribute
Lasso (L1)Feature selection, want sparse model
Elastic NetMany correlated features
Math Connection: L2 regularization is related to the Euclidean norm of the weight vector. L1 uses the Manhattan norm.

Regularization in Classification

For logistic regression:
from sklearn.linear_model import LogisticRegression

# C is the inverse of regularization strength (smaller C = more regularization)
models = {
    'No regularization': LogisticRegression(penalty=None, max_iter=1000),
    'L2 (Ridge)': LogisticRegression(penalty='l2', C=1.0, max_iter=1000),
    'L1 (Lasso)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000),
    'Elastic Net': LogisticRegression(penalty='elasticnet', C=1.0, l1_ratio=0.5, solver='saga', max_iter=1000)
}

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Regularization in Tree-Based Models

Trees don’t use L1/L2, but they have their own regularization:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Tree regularization parameters
tree = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth
    min_samples_split=10,  # Min samples to split
    min_samples_leaf=5,    # Min samples in leaf
    max_features='sqrt'    # Random subset of features
)

# Random Forest adds more regularization through bagging
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=2,
    max_features='sqrt'
)

# Gradient Boosting has learning rate as regularization
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,    # Smaller = more regularization
    max_depth=3,
    subsample=0.8         # Random sample of data
)

Dropout: Regularization for Neural Networks

Randomly “turn off” neurons during training:
import torch.nn as nn

class RegularizedNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(100, 256),
            nn.ReLU(),
            nn.Dropout(0.3),      # 30% dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10)
        )
    
    def forward(self, x):
        return self.layers(x)
Why it works: Forces the network to not rely on any single neuron. Creates redundancy.

Early Stopping

Stop training when validation performance stops improving:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=1000
)

# In PyTorch
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(1000):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        # Save best model
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}")
            break

Data Augmentation

Create more training examples by transforming existing ones:
# For images
from torchvision import transforms

augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomCrop(224, padding=4)
])

# For tabular data: add noise
def augment_tabular(X, noise_level=0.01):
    noise = np.random.randn(*X.shape) * noise_level
    return X + noise

Cross-Validation for Choosing λ

from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically finds best alpha
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5)
ridge_cv.fit(X, y)
print(f"Best alpha: {ridge_cv.alpha_}")

# LassoCV does the same for Lasso
lasso_cv = LassoCV(alphas=[0.01, 0.1, 1, 10], cv=5)
lasso_cv.fit(X, y)
print(f"Best alpha: {lasso_cv.alpha_}")

Regularization Summary

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
import pandas as pd

# Compare all regularization methods
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=0.1)': Ridge(alpha=0.1),
    'Ridge (α=1.0)': Ridge(alpha=1.0),
    'Ridge (α=10)': Ridge(alpha=10),
    'Lasso (α=0.1)': Lasso(alpha=0.1),
    'Lasso (α=1.0)': Lasso(alpha=1.0),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5)
}

results = []
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    results.append({
        'Model': name,
        'MSE': -scores.mean(),
        'Std': scores.std()
    })

df = pd.DataFrame(results)
print(df.to_string(index=False))

The Bias-Variance Tradeoff

Regularization is really about balancing:
Low RegularizationHigh Regularization
BiasLowHigh
VarianceHighLow
Training ErrorLowHigh
Test ErrorHigh (overfit)High (underfit)
Sweet SpotJust right!
The goal: Find the regularization strength that minimizes test error, not training error. Use cross-validation!

🚀 Mini Projects

Project 1: Regularization Comparison

Compare Ridge, Lasso, and ElasticNet

Project 2: Feature Selection with Lasso

Use L1 regularization to select features

Project 3: Overfitting Simulator

Visualize how regularization prevents overfitting

Project 4: Optimal Lambda Finder

Find the perfect regularization strength

Project 1: Regularization Comparison

Compare different regularization techniques on the same dataset.

Project 2: Feature Selection with Lasso

Use Lasso regularization to automatically select the most important features.

Project 3: Overfitting Simulator

Visualize how regularization prevents overfitting.

Project 4: Optimal Lambda Finder

Systematically find the best regularization strength using cross-validation.

Key Takeaways

Penalize Complexity

Add weight penalty to the loss function

L2 = Small Weights

Ridge shrinks all weights, none become zero

L1 = Zero Weights

Lasso creates sparse models, selects features

Cross-Validate λ

Always use CV to find the right regularization strength

What’s Next?

You now have a complete ML toolkit! Let’s see how to save, load, and deploy your models.

Continue to Module 14: Model Deployment

Learn how to save models and deploy them for real-world use

Interview Deep-Dive

Similar accuracy is not enough information to make this decision. I would ask several follow-up questions, but here is how I think about it:
  • Interpretability requirements. If stakeholders need to understand which features drive predictions — common in healthcare, finance, and compliance settings — the L1 model wins. Lasso produces sparse coefficients, so you can say “these 7 features matter, the rest do not.” L2 keeps all features active, making the explanation messier.
  • Feature stability over time. L1 models are sensitive to correlated features — if two features are highly correlated, Lasso will arbitrarily pick one and zero out the other. If feature availability or correlation structure changes in production, the L1 model may behave unpredictably. L2 is more stable because it distributes weight across correlated features.
  • Inference cost. If the L1 model zeroed out 80% of features, you only need to compute and transmit 20% of the features at inference time. At scale, this reduces latency and infrastructure cost. For a model serving millions of requests per day, fewer features means real savings.
  • Monitoring burden. Fewer active features (L1) means fewer things to monitor for drift. But it also means a single drifting feature has a bigger impact on predictions.
My default in production: if accuracy is truly similar and feature stability is not a concern, I lean toward L1 for the operational simplicity. But if the feature space has known multicollinearity, I would use ElasticNet or stick with L2.Follow-up: How does the choice change if you are working in a pipeline where feature computation is expensive?This strongly favors L1. If computing a feature requires a database query, an API call, or a complex aggregation, every feature you can eliminate from the model is infrastructure you do not need to maintain. I have seen production systems where the feature store was the bottleneck, not the model inference. In that scenario, a Lasso model that uses 10 features versus a Ridge model that uses 100 features can mean the difference between 20ms and 200ms prediction latency — and that is a business-critical difference.
I would use a hiring analogy. Imagine you are building a team to solve a problem:
  • L2 (Ridge) is like keeping everyone on the team but limiting how much each person works. Nobody gets fired, but everyone is told to contribute a little less. The result: a balanced team where everyone does a small part. The upside is stability — if one person calls in sick, others can compensate. The downside is that you are paying salary for people who contribute almost nothing.
  • L1 (Lasso) is like running a layoff based on performance. People who are not contributing get removed entirely. The team gets smaller and more focused. The upside is efficiency and clarity — you know exactly who matters. The downside is that if you fired the wrong person, there is nobody to cover for them.
Then I would connect it to the business context: “For our fraud detection model, I recommend L1 because it will tell us the 5 key signals that predict fraud, which the investigations team can act on. If I used L2, I would hand them a list of 50 factors, each with a tiny contribution, and that is not actionable.”Follow-up: How do you handle a product manager who says “just use both”?That is actually a real technique called ElasticNet. It combines both L1 and L2 penalties, and you tune a ratio parameter that controls how much of each you use. I would frame it as: “We can start with a blended approach, and the cross-validation process will automatically find the right balance between keeping all features and selecting the most important ones. The data will tell us the right answer.” This turns a binary decision into a continuous optimization, which is usually the right engineering instinct.
Time series adds several wrinkles that change how I think about regularization:
  • Temporal autocorrelation in features. Many engineered features in time series are lagged versions of each other (lag_1, lag_2, lag_7, etc.). These are highly correlated by construction. Pure L1 regularization will arbitrarily pick one lag and zero out others, which can make the model fragile if the most predictive lag shifts. I would default to ElasticNet here, or use L2 with aggressive feature selection as a separate preprocessing step.
  • Feature importance changes over time. The features that mattered last quarter may not matter this quarter. I would use a sliding-window retraining approach with regularization, and I would monitor whether the set of non-zero features (in L1) or the coefficient magnitudes (in L2) are stable across retraining windows. Large shifts signal regime change.
  • Multicollinearity from rolling statistics. If you engineer rolling_mean_7, rolling_mean_14, and rolling_mean_30, these are naturally correlated. Ridge handles this gracefully by sharing weight. Lasso will unpredictably drop some, which may break the model when the short-term vs long-term pattern changes.
  • The regularization strength should be tuned with TimeSeriesSplit, never random CV. This is critical. If you use random cross-validation to select lambda, you are letting future information influence the regularization choice, which inflates the perceived model quality.
In practice, for 200 time series features, I would first use Lasso to identify the top 30-50 features, then retrain with Ridge or ElasticNet on that reduced set. This two-stage approach gives you the sparsity benefit without the instability risk.Follow-up: How would you detect if your regularization is masking a data leakage problem in the time series features?The telltale sign is when a regularized model performs almost as well as an unregularized one on the test set. In a legitimate scenario with 200 features and noise, unregularized models should overfit badly. If they do not, it likely means some feature is leaking future information so strongly that even a simple model can exploit it. I would check feature importance and look for any feature with disproportionately high weight — especially any feature derived from rolling windows or aggregations that might accidentally include future timestamps.