Regularization: The Art of Keeping Models Simple
The Overfitting Problem
Remember: a model that memorizes training data is useless on new data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Simple data with noise
np.random.seed( 42 )
X = np.linspace( 0 , 1 , 15 ).reshape( - 1 , 1 )
y = 2 * X.ravel() + 1 + np.random.randn( 15 ) * 0.3
# Fit polynomials of different degrees
fig, axes = plt.subplots( 1 , 3 , figsize = ( 15 , 4 ))
for ax, degree in zip (axes, [ 1 , 5 , 14 ]):
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
X_test = np.linspace( 0 , 1 , 100 ).reshape( - 1 , 1 )
X_test_poly = poly.transform(X_test)
y_pred = model.predict(X_test_poly)
ax.scatter(X, y, color = 'blue' , label = 'Training data' )
ax.plot(X_test, y_pred, color = 'red' , label = f 'Degree { degree } ' )
ax.set_title( f 'Degree { degree } Polynomial' )
ax.legend()
ax.set_ylim( - 1 , 5 )
plt.tight_layout()
plt.show()
Degree 1 : Too simple (underfitting)
Degree 5 : Just right
Degree 14 : Wiggles through every point (overfitting)
What Is Regularization?
Core idea : Penalize complexity!
Instead of just minimizing prediction error, we minimize:
L o s s = Prediction Error + λ × Model Complexity Loss = \text{Prediction Error} + \lambda \times \text{Model Complexity} L oss = Prediction Error + λ × Model Complexity
Where λ \lambda λ is the regularization strength.
Trade-off:
λ = 0: No regularization, risk overfitting
λ = ∞: Maximum regularization, model predicts the mean
λ = just right: Balance fit and complexity
L2 Regularization (Ridge)
Add the sum of squared weights to the loss:
L o s s R i d g e = M S E + λ ∑ j = 1 p w j 2 Loss_{Ridge} = MSE + \lambda \sum_{j=1}^{p} w_j^2 L os s R i d g e = MSE + λ j = 1 ∑ p w j 2
Effect : Pushes weights toward zero, but never exactly zero. Creates “small” weights.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
# High-degree polynomial with Ridge regularization
X = np.linspace( 0 , 1 , 15 ).reshape( - 1 , 1 )
y = 2 * X.ravel() + 1 + np.random.randn( 15 ) * 0.3
# Try different regularization strengths
alphas = [ 0 , 0.1 , 1 , 10 ]
fig, axes = plt.subplots( 1 , 4 , figsize = ( 16 , 4 ))
for ax, alpha in zip (axes, alphas):
pipeline = Pipeline([
( 'poly' , PolynomialFeatures( degree = 10 )),
( 'scaler' , StandardScaler()),
( 'ridge' , Ridge( alpha = alpha))
])
pipeline.fit(X, y)
X_test = np.linspace( 0 , 1 , 100 ).reshape( - 1 , 1 )
y_pred = pipeline.predict(X_test)
ax.scatter(X, y, color = 'blue' )
ax.plot(X_test, y_pred, color = 'red' )
ax.set_title( f 'Ridge (α = { alpha } )' )
ax.set_ylim( - 1 , 5 )
plt.tight_layout()
plt.show()
L1 Regularization (Lasso)
Add the sum of absolute weights to the loss:
L o s s L a s s o = M S E + λ ∑ j = 1 p ∣ w j ∣ Loss_{Lasso} = MSE + \lambda \sum_{j=1}^{p} |w_j| L os s L a sso = MSE + λ j = 1 ∑ p ∣ w j ∣
Effect : Pushes weights toward zero, and some become exactly zero . Creates sparse models!
from sklearn.linear_model import Lasso
# Compare Ridge vs Lasso on feature selection
from sklearn.datasets import make_regression
# Create data with many irrelevant features
X, y = make_regression( n_samples = 100 , n_features = 20 , n_informative = 5 , noise = 10 , random_state = 42 )
# Fit both
ridge = Ridge( alpha = 1.0 )
lasso = Lasso( alpha = 1.0 )
ridge.fit(X, y)
lasso.fit(X, y)
# Compare coefficients
fig, axes = plt.subplots( 1 , 2 , figsize = ( 14 , 5 ))
axes[ 0 ].bar( range ( 20 ), ridge.coef_)
axes[ 0 ].set_title( 'Ridge Coefficients (all non-zero)' )
axes[ 0 ].set_xlabel( 'Feature' )
axes[ 0 ].axhline( y = 0 , color = 'r' , linestyle = '--' )
axes[ 1 ].bar( range ( 20 ), lasso.coef_)
axes[ 1 ].set_title( 'Lasso Coefficients (many are exactly 0!)' )
axes[ 1 ].set_xlabel( 'Feature' )
axes[ 1 ].axhline( y = 0 , color = 'r' , linestyle = '--' )
plt.tight_layout()
plt.show()
print ( f "Ridge: { np.sum(ridge.coef_ != 0 ) } non-zero coefficients" )
print ( f "Lasso: { np.sum(lasso.coef_ != 0 ) } non-zero coefficients" )
Elastic Net: Best of Both Worlds
Combine L1 and L2:
L o s s E l a s t i c N e t = M S E + λ 1 ∑ ∣ w j ∣ + λ 2 ∑ w j 2 Loss_{ElasticNet} = MSE + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2 L os s El a s t i c N e t = MSE + λ 1 ∑ ∣ w j ∣ + λ 2 ∑ w j 2
from sklearn.linear_model import ElasticNet
elastic = ElasticNet( alpha = 1.0 , l1_ratio = 0.5 ) # 50% L1, 50% L2
elastic.fit(X, y)
print ( f "Elastic Net: { np.sum(elastic.coef_ != 0 ) } non-zero coefficients" )
When to Use Which?
Method Use Case Ridge (L2) Many small features all contribute Lasso (L1) Feature selection, want sparse model Elastic Net Many correlated features
Math Connection : L2 regularization is related to the Euclidean norm of the weight vector. L1 uses the Manhattan norm.
Regularization in Classification
For logistic regression:
from sklearn.linear_model import LogisticRegression
# C is the inverse of regularization strength (smaller C = more regularization)
models = {
'No regularization' : LogisticRegression( penalty = None , max_iter = 1000 ),
'L2 (Ridge)' : LogisticRegression( penalty = 'l2' , C = 1.0 , max_iter = 1000 ),
'L1 (Lasso)' : LogisticRegression( penalty = 'l1' , C = 1.0 , solver = 'saga' , max_iter = 1000 ),
'Elastic Net' : LogisticRegression( penalty = 'elasticnet' , C = 1.0 , l1_ratio = 0.5 , solver = 'saga' , max_iter = 1000 )
}
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
for name, model in models.items():
scores = cross_val_score(model, X, y, cv = 5 )
print ( f " { name :20s} : { scores.mean() :.4f} (+/- { scores.std() :.4f} )" )
Regularization in Tree-Based Models
Trees don’t use L1/L2, but they have their own regularization:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Tree regularization parameters
tree = DecisionTreeClassifier(
max_depth = 5 , # Limit tree depth
min_samples_split = 10 , # Min samples to split
min_samples_leaf = 5 , # Min samples in leaf
max_features = 'sqrt' # Random subset of features
)
# Random Forest adds more regularization through bagging
rf = RandomForestClassifier(
n_estimators = 100 ,
max_depth = 10 ,
min_samples_leaf = 2 ,
max_features = 'sqrt'
)
# Gradient Boosting has learning rate as regularization
gb = GradientBoostingClassifier(
n_estimators = 100 ,
learning_rate = 0.1 , # Smaller = more regularization
max_depth = 3 ,
subsample = 0.8 # Random sample of data
)
Dropout: Regularization for Neural Networks
Randomly “turn off” neurons during training:
import torch.nn as nn
class RegularizedNetwork ( nn . Module ):
def __init__ ( self ):
super (). __init__ ()
self .layers = nn.Sequential(
nn.Linear( 100 , 256 ),
nn.ReLU(),
nn.Dropout( 0.3 ), # 30% dropout
nn.Linear( 256 , 128 ),
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear( 128 , 10 )
)
def forward ( self , x ):
return self .layers(x)
Why it works : Forces the network to not rely on any single neuron. Creates redundancy.
Early Stopping
Stop training when validation performance stops improving:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(
hidden_layer_sizes = ( 100 , 50 ),
early_stopping = True ,
validation_fraction = 0.1 ,
n_iter_no_change = 10 ,
max_iter = 1000
)
# In PyTorch
best_val_loss = float ( 'inf' )
patience = 10
patience_counter = 0
for epoch in range ( 1000 ):
train_loss = train_one_epoch(model, train_loader)
val_loss = evaluate(model, val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
# Save best model
torch.save(model.state_dict(), 'best_model.pt' )
else :
patience_counter += 1
if patience_counter >= patience:
print ( f "Early stopping at epoch { epoch } " )
break
Data Augmentation
Create more training examples by transforming existing ones:
# For images
from torchvision import transforms
augmentation = transforms.Compose([
transforms.RandomHorizontalFlip( p = 0.5 ),
transforms.RandomRotation( degrees = 10 ),
transforms.ColorJitter( brightness = 0.2 , contrast = 0.2 ),
transforms.RandomCrop( 224 , padding = 4 )
])
# For tabular data: add noise
def augment_tabular ( X , noise_level = 0.01 ):
noise = np.random.randn( * X.shape) * noise_level
return X + noise
Cross-Validation for Choosing λ
from sklearn.linear_model import RidgeCV, LassoCV
# RidgeCV automatically finds best alpha
ridge_cv = RidgeCV( alphas = [ 0.01 , 0.1 , 1 , 10 , 100 ], cv = 5 )
ridge_cv.fit(X, y)
print ( f "Best alpha: { ridge_cv.alpha_ } " )
# LassoCV does the same for Lasso
lasso_cv = LassoCV( alphas = [ 0.01 , 0.1 , 1 , 10 ], cv = 5 )
lasso_cv.fit(X, y)
print ( f "Best alpha: { lasso_cv.alpha_ } " )
Regularization Summary
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
import pandas as pd
# Compare all regularization methods
models = {
'Linear Regression' : LinearRegression(),
'Ridge (α=0.1)' : Ridge( alpha = 0.1 ),
'Ridge (α=1.0)' : Ridge( alpha = 1.0 ),
'Ridge (α=10)' : Ridge( alpha = 10 ),
'Lasso (α=0.1)' : Lasso( alpha = 0.1 ),
'Lasso (α=1.0)' : Lasso( alpha = 1.0 ),
'Elastic Net' : ElasticNet( alpha = 1.0 , l1_ratio = 0.5 )
}
results = []
for name, model in models.items():
scores = cross_val_score(model, X, y, cv = 5 , scoring = 'neg_mean_squared_error' )
results.append({
'Model' : name,
'MSE' : - scores.mean(),
'Std' : scores.std()
})
df = pd.DataFrame(results)
print (df.to_string( index = False ))
The Bias-Variance Tradeoff
Regularization is really about balancing:
Low Regularization High Regularization Bias Low High Variance High Low Training Error Low High Test Error High (overfit) High (underfit) Sweet Spot Just right!
The goal : Find the regularization strength that minimizes test error , not training error. Use cross-validation!
🚀 Mini Projects
Project 1: Regularization Comparison Compare Ridge, Lasso, and ElasticNet
Project 2: Feature Selection with Lasso Use L1 regularization to select features
Project 3: Overfitting Simulator Visualize how regularization prevents overfitting
Project 4: Optimal Lambda Finder Find the perfect regularization strength
Project 1: Regularization Comparison
Compare different regularization techniques on the same dataset.
Project 2: Feature Selection with Lasso
Use Lasso regularization to automatically select the most important features.
Project 3: Overfitting Simulator
Visualize how regularization prevents overfitting.
Project 4: Optimal Lambda Finder
Systematically find the best regularization strength using cross-validation.
Key Takeaways
Penalize Complexity Add weight penalty to the loss function
L2 = Small Weights Ridge shrinks all weights, none become zero
L1 = Zero Weights Lasso creates sparse models, selects features
Cross-Validate λ Always use CV to find the right regularization strength
What’s Next?
You now have a complete ML toolkit! Let’s see how to save, load, and deploy your models.
Continue to Module 14: Model Deployment Learn how to save models and deploy them for real-world use