> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Regularization

> Fight overfitting - keep your model honest

# Regularization: The Art of Keeping Models Simple

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/regularization-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=a72f03159ee6547e4b82c10bbd62e2ea" alt="L1 L2 and Dropout Regularization" width="1080" height="1080" data-path="images/courses/ml-mastery/regularization-concept.svg" />
</Frame>

## The Overfitting Problem

Remember: a model that memorizes training data is useless on new data.

```python theme={null}
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Simple data with noise
np.random.seed(42)
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.randn(15) * 0.3

# Fit polynomials of different degrees
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, degree in zip(axes, [1, 5, 14]):
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)
    
    model = LinearRegression()
    model.fit(X_poly, y)
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    X_test_poly = poly.transform(X_test)
    y_pred = model.predict(X_test_poly)
    
    ax.scatter(X, y, color='blue', label='Training data')
    ax.plot(X_test, y_pred, color='red', label=f'Degree {degree}')
    ax.set_title(f'Degree {degree} Polynomial')
    ax.legend()
    ax.set_ylim(-1, 5)

plt.tight_layout()
plt.show()
```

* **Degree 1**: Too simple (underfitting)
* **Degree 5**: Just right
* **Degree 14**: Wiggles through every point (overfitting)

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/regularization-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=ea807a450433eb39ee09b5df0015ea12" alt="GPT Training Regularization at OpenAI" width="1080" height="1080" data-path="images/courses/ml-mastery/regularization-real-world.svg" />
</Frame>

***

## What Is Regularization?

**Core idea**: Penalize complexity!

Instead of just minimizing prediction error, we minimize:

$$
Loss = \text{Prediction Error} + \lambda \times \text{Model Complexity}
$$

Where $\lambda$ is the regularization strength.

**Trade-off:**

* λ = 0: No regularization, risk overfitting
* λ = ∞: Maximum regularization, model predicts the mean
* λ = just right: Balance fit and complexity

***

## L2 Regularization (Ridge)

Add the **sum of squared weights** to the loss:

$$
Loss_{Ridge} = MSE + \lambda \sum_{j=1}^{p} w_j^2
$$

**Effect**: Pushes weights toward zero, but never exactly zero. Creates "small" weights.

```python theme={null}
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# High-degree polynomial with Ridge regularization
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.randn(15) * 0.3

# Try different regularization strengths
alphas = [0, 0.1, 1, 10]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, alpha in zip(axes, alphas):
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=10)),
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=alpha))
    ])
    
    pipeline.fit(X, y)
    
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_pred = pipeline.predict(X_test)
    
    ax.scatter(X, y, color='blue')
    ax.plot(X_test, y_pred, color='red')
    ax.set_title(f'Ridge (α = {alpha})')
    ax.set_ylim(-1, 5)

plt.tight_layout()
plt.show()
```

***

## L1 Regularization (Lasso)

Add the **sum of absolute weights** to the loss:

$$
Loss_{Lasso} = MSE + \lambda \sum_{j=1}^{p} |w_j|
$$

**Effect**: Pushes weights toward zero, and **some become exactly zero**. Creates **sparse** models!

```python theme={null}
from sklearn.linear_model import Lasso

# Compare Ridge vs Lasso on feature selection
from sklearn.datasets import make_regression

# Create data with many irrelevant features
X, y = make_regression(n_samples=100, n_features=20, n_informative=5, noise=10, random_state=42)

# Fit both
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)

ridge.fit(X, y)
lasso.fit(X, y)

# Compare coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(range(20), ridge.coef_)
axes[0].set_title('Ridge Coefficients (all non-zero)')
axes[0].set_xlabel('Feature')
axes[0].axhline(y=0, color='r', linestyle='--')

axes[1].bar(range(20), lasso.coef_)
axes[1].set_title('Lasso Coefficients (many are exactly 0!)')
axes[1].set_xlabel('Feature')
axes[1].axhline(y=0, color='r', linestyle='--')

plt.tight_layout()
plt.show()

print(f"Ridge: {np.sum(ridge.coef_ != 0)} non-zero coefficients")
print(f"Lasso: {np.sum(lasso.coef_ != 0)} non-zero coefficients")
```

***

## Elastic Net: Best of Both Worlds

Combine L1 and L2:

$$
Loss_{ElasticNet} = MSE + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2
$$

```python theme={null}
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)  # 50% L1, 50% L2
elastic.fit(X, y)

print(f"Elastic Net: {np.sum(elastic.coef_ != 0)} non-zero coefficients")
```

***

## When to Use Which?

| Method          | Use Case                             |
| --------------- | ------------------------------------ |
| **Ridge (L2)**  | Many small features all contribute   |
| **Lasso (L1)**  | Feature selection, want sparse model |
| **Elastic Net** | Many correlated features             |

<Note>
  **Math Connection**: L2 regularization is related to the [Euclidean norm](/courses/math-for-ml-linear-algebra/02-vectors) of the weight vector. L1 uses the Manhattan norm.
</Note>

***

## Regularization in Classification

For logistic regression:

```python theme={null}
from sklearn.linear_model import LogisticRegression

# C is the inverse of regularization strength (smaller C = more regularization)
models = {
    'No regularization': LogisticRegression(penalty=None, max_iter=1000),
    'L2 (Ridge)': LogisticRegression(penalty='l2', C=1.0, max_iter=1000),
    'L1 (Lasso)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000),
    'Elastic Net': LogisticRegression(penalty='elasticnet', C=1.0, l1_ratio=0.5, solver='saga', max_iter=1000)
}

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")
```

***

## Regularization in Tree-Based Models

Trees don't use L1/L2, but they have their own regularization:

```python theme={null}
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Tree regularization parameters
tree = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth
    min_samples_split=10,  # Min samples to split
    min_samples_leaf=5,    # Min samples in leaf
    max_features='sqrt'    # Random subset of features
)

# Random Forest adds more regularization through bagging
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=2,
    max_features='sqrt'
)

# Gradient Boosting has learning rate as regularization
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,    # Smaller = more regularization
    max_depth=3,
    subsample=0.8         # Random sample of data
)
```

***

## Dropout: Regularization for Neural Networks

Randomly "turn off" neurons during training:

```python theme={null}
import torch.nn as nn

class RegularizedNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(100, 256),
            nn.ReLU(),
            nn.Dropout(0.3),      # 30% dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10)
        )
    
    def forward(self, x):
        return self.layers(x)
```

**Why it works**: Forces the network to not rely on any single neuron. Creates redundancy.

***

## Early Stopping

Stop training when validation performance stops improving:

```python theme={null}
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=1000
)

# In PyTorch
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(1000):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        # Save best model
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}")
            break
```

***

## Data Augmentation

Create more training examples by transforming existing ones:

```python theme={null}
# For images
from torchvision import transforms

augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomCrop(224, padding=4)
])

# For tabular data: add noise
def augment_tabular(X, noise_level=0.01):
    noise = np.random.randn(*X.shape) * noise_level
    return X + noise
```

***

## Cross-Validation for Choosing λ

```python theme={null}
from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically finds best alpha
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5)
ridge_cv.fit(X, y)
print(f"Best alpha: {ridge_cv.alpha_}")

# LassoCV does the same for Lasso
lasso_cv = LassoCV(alphas=[0.01, 0.1, 1, 10], cv=5)
lasso_cv.fit(X, y)
print(f"Best alpha: {lasso_cv.alpha_}")
```

***

## Regularization Summary

```python theme={null}
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
import pandas as pd

# Compare all regularization methods
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=0.1)': Ridge(alpha=0.1),
    'Ridge (α=1.0)': Ridge(alpha=1.0),
    'Ridge (α=10)': Ridge(alpha=10),
    'Lasso (α=0.1)': Lasso(alpha=0.1),
    'Lasso (α=1.0)': Lasso(alpha=1.0),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5)
}

results = []
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    results.append({
        'Model': name,
        'MSE': -scores.mean(),
        'Std': scores.std()
    })

df = pd.DataFrame(results)
print(df.to_string(index=False))
```

***

## The Bias-Variance Tradeoff

Regularization is really about balancing:

|                    | Low Regularization | High Regularization |
| ------------------ | ------------------ | ------------------- |
| **Bias**           | Low                | High                |
| **Variance**       | High               | Low                 |
| **Training Error** | Low                | High                |
| **Test Error**     | High (overfit)     | High (underfit)     |
| **Sweet Spot**     |                    | Just right!         |

<Note>
  **The goal**: Find the regularization strength that minimizes **test error**, not training error. Use cross-validation!
</Note>

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1: Regularization Comparison" icon="scale-balanced">
    Compare Ridge, Lasso, and ElasticNet
  </Card>

  <Card title="Project 2: Feature Selection with Lasso" icon="filter">
    Use L1 regularization to select features
  </Card>

  <Card title="Project 3: Overfitting Simulator" icon="chart-mixed">
    Visualize how regularization prevents overfitting
  </Card>

  <Card title="Project 4: Optimal Lambda Finder" icon="sliders">
    Find the perfect regularization strength
  </Card>
</CardGroup>

### Project 1: Regularization Comparison

Compare different regularization techniques on the same dataset.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import make_regression
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
  from sklearn.preprocessing import StandardScaler
  from sklearn.metrics import mean_squared_error, r2_score

  # Step 1: Generate data with many features (some irrelevant)
  np.random.seed(42)
  n_samples = 200
  n_features = 50
  n_informative = 10

  X, y = make_regression(
      n_samples=n_samples,
      n_features=n_features,
      n_informative=n_informative,
      noise=10,
      random_state=42
  )

  print("="*60)
  print("📊 REGULARIZATION COMPARISON")
  print("="*60)
  print(f"Samples: {n_samples}")
  print(f"Features: {n_features} (only {n_informative} informative)")

  # Scale features
  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  # Split data
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  # Step 2: Train different models
  models = {
      'Linear Regression': LinearRegression(),
      'Ridge (L2)': Ridge(alpha=1.0),
      'Lasso (L1)': Lasso(alpha=1.0),
      'ElasticNet (L1+L2)': ElasticNet(alpha=1.0, l1_ratio=0.5)
  }

  results = []
  for name, model in models.items():
      model.fit(X_train, y_train)
      
      train_pred = model.predict(X_train)
      test_pred = model.predict(X_test)
      
      cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
      
      n_nonzero = np.sum(np.abs(model.coef_) > 1e-5) if hasattr(model, 'coef_') else n_features
      
      result = {
          'name': name,
          'train_mse': mean_squared_error(y_train, train_pred),
          'test_mse': mean_squared_error(y_test, test_pred),
          'cv_mse': -cv_scores.mean(),
          'r2': r2_score(y_test, test_pred),
          'n_nonzero': n_nonzero,
          'coef': model.coef_ if hasattr(model, 'coef_') else None
      }
      results.append(result)
      
      print(f"\n{name}:")
      print(f"  Train MSE: {result['train_mse']:.2f}")
      print(f"  Test MSE:  {result['test_mse']:.2f}")
      print(f"  CV MSE:    {result['cv_mse']:.2f}")
      print(f"  R² Score:  {result['r2']:.4f}")
      print(f"  Non-zero coefficients: {n_nonzero}/{n_features}")

  # Step 3: Visualize results
  fig, axes = plt.subplots(2, 2, figsize=(14, 10))

  # MSE comparison
  ax1 = axes[0, 0]
  names = [r['name'] for r in results]
  train_mses = [r['train_mse'] for r in results]
  test_mses = [r['test_mse'] for r in results]
  x_pos = np.arange(len(names))
  width = 0.35
  ax1.bar(x_pos - width/2, train_mses, width, label='Train', color='steelblue')
  ax1.bar(x_pos + width/2, test_mses, width, label='Test', color='coral')
  ax1.set_xticks(x_pos)
  ax1.set_xticklabels([n.split('(')[0] for n in names], rotation=15)
  ax1.set_ylabel('MSE')
  ax1.set_title('Train vs Test MSE')
  ax1.legend()

  # Coefficient magnitudes
  ax2 = axes[0, 1]
  for i, result in enumerate(results):
      if result['coef'] is not None:
          sorted_coef = np.sort(np.abs(result['coef']))[::-1]
          ax2.plot(sorted_coef, label=result['name'].split('(')[0])
  ax2.set_xlabel('Coefficient Index (sorted)')
  ax2.set_ylabel('|Coefficient|')
  ax2.set_title('Coefficient Magnitudes')
  ax2.legend()

  # Non-zero features
  ax3 = axes[1, 0]
  n_nonzeros = [r['n_nonzero'] for r in results]
  colors = ['red' if n == n_features else 'green' for n in n_nonzeros]
  ax3.bar(names, n_nonzeros, color=colors)
  ax3.axhline(y=n_informative, color='black', linestyle='--', label=f'True informative ({n_informative})')
  ax3.set_ylabel('Non-zero Coefficients')
  ax3.set_title('Feature Selection')
  ax3.legend()

  # Coefficient heatmap
  ax4 = axes[1, 1]
  coef_matrix = np.array([r['coef'][:20] for r in results if r['coef'] is not None])
  im = ax4.imshow(coef_matrix, aspect='auto', cmap='RdBu', vmin=-np.max(np.abs(coef_matrix)), vmax=np.max(np.abs(coef_matrix)))
  ax4.set_yticks(range(len(coef_matrix)))
  ax4.set_yticklabels([r['name'].split('(')[0] for r in results if r['coef'] is not None])
  ax4.set_xlabel('Feature Index (first 20)')
  ax4.set_title('Coefficient Values Heatmap')
  plt.colorbar(im, ax=ax4)

  plt.tight_layout()
  plt.savefig('regularization_comparison.png', dpi=150)

  print("\n💡 Key Insights:")
  print("  - Linear Regression may overfit with many features")
  print("  - Lasso (L1) sets many coefficients to exactly zero")
  print("  - Ridge (L2) shrinks all coefficients but keeps them non-zero")
  print("  - ElasticNet combines both approaches")
  ```

  **What you learned:**

  * L1 (Lasso) creates sparse models by setting coefficients to zero
  * L2 (Ridge) shrinks all coefficients but keeps them non-zero
  * ElasticNet provides the benefits of both
</details>

### Project 2: Feature Selection with Lasso

Use Lasso regularization to automatically select the most important features.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.linear_model import Lasso, LassoCV
  from sklearn.preprocessing import StandardScaler
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.linear_model import LogisticRegression

  # Step 1: Load data
  cancer = load_breast_cancer()
  X = cancer.data
  y = cancer.target
  feature_names = cancer.feature_names

  print("="*60)
  print("🔍 FEATURE SELECTION WITH LASSO")
  print("="*60)
  print(f"Original features: {X.shape[1]}")
  print(f"Samples: {X.shape[0]}")

  # Scale features
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  X_train, X_test, y_train, y_test = train_test_split(
      X_scaled, y, test_size=0.2, random_state=42
  )

  # Step 2: Use LassoCV to find optimal alpha
  print("\n1️⃣ FINDING OPTIMAL REGULARIZATION")
  print("-"*40)

  lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
  lasso_cv.fit(X_train, y_train)

  print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
  print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")

  # Step 3: Analyze selected features
  print("\n2️⃣ SELECTED FEATURES")
  print("-"*40)

  # Get feature importance from Lasso coefficients
  feature_importance = pd.DataFrame({
      'feature': feature_names,
      'coefficient': lasso_cv.coef_,
      'abs_coefficient': np.abs(lasso_cv.coef_)
  }).sort_values('abs_coefficient', ascending=False)

  selected_features = feature_importance[feature_importance['coefficient'] != 0]
  print(f"\nSelected {len(selected_features)} features out of {len(feature_names)}:")
  for _, row in selected_features.iterrows():
      direction = "+" if row['coefficient'] > 0 else "-"
      print(f"  {direction} {row['feature']}: {row['coefficient']:.4f}")

  # Step 4: Compare model performance with different feature sets
  print("\n3️⃣ PERFORMANCE COMPARISON")
  print("-"*40)

  selected_indices = [i for i, f in enumerate(feature_names) if f in selected_features['feature'].values]
  X_train_selected = X_train[:, selected_indices]
  X_test_selected = X_test[:, selected_indices]

  results = {}

  # All features
  lr_all = LogisticRegression(max_iter=1000, random_state=42)
  cv_all = cross_val_score(lr_all, X_train, y_train, cv=5)
  lr_all.fit(X_train, y_train)
  results['All Features (30)'] = {
      'cv_mean': cv_all.mean(),
      'cv_std': cv_all.std(),
      'test': lr_all.score(X_test, y_test)
  }

  # Lasso-selected features
  lr_selected = LogisticRegression(max_iter=1000, random_state=42)
  cv_selected = cross_val_score(lr_selected, X_train_selected, y_train, cv=5)
  lr_selected.fit(X_train_selected, y_train)
  results[f'Lasso Selected ({len(selected_indices)})'] = {
      'cv_mean': cv_selected.mean(),
      'cv_std': cv_selected.std(),
      'test': lr_selected.score(X_test_selected, y_test)
  }

  # Top 5 by RandomForest
  rf = RandomForestClassifier(n_estimators=100, random_state=42)
  rf.fit(X_train, y_train)
  rf_top5 = np.argsort(rf.feature_importances_)[-5:]
  X_train_top5 = X_train[:, rf_top5]
  X_test_top5 = X_test[:, rf_top5]

  lr_top5 = LogisticRegression(max_iter=1000, random_state=42)
  cv_top5 = cross_val_score(lr_top5, X_train_top5, y_train, cv=5)
  lr_top5.fit(X_train_top5, y_train)
  results['RF Top 5'] = {
      'cv_mean': cv_top5.mean(),
      'cv_std': cv_top5.std(),
      'test': lr_top5.score(X_test_top5, y_test)
  }

  for name, res in results.items():
      print(f"{name}:")
      print(f"  CV Score: {res['cv_mean']:.4f} (+/- {res['cv_std']:.4f})")
      print(f"  Test Score: {res['test']:.4f}")

  # Step 5: Regularization path
  print("\n4️⃣ REGULARIZATION PATH")
  print("-"*40)

  alphas = np.logspace(-4, 1, 50)
  coef_paths = []

  for alpha in alphas:
      lasso = Lasso(alpha=alpha, max_iter=10000)
      lasso.fit(X_train, y_train)
      coef_paths.append(lasso.coef_)

  coef_paths = np.array(coef_paths)

  # Visualize
  fig, axes = plt.subplots(2, 2, figsize=(14, 10))

  # Regularization path
  ax1 = axes[0, 0]
  for i in range(coef_paths.shape[1]):
      ax1.plot(alphas, coef_paths[:, i], alpha=0.7)
  ax1.axvline(x=lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={lasso_cv.alpha_:.4f}')
  ax1.set_xscale('log')
  ax1.set_xlabel('Alpha (regularization strength)')
  ax1.set_ylabel('Coefficient Value')
  ax1.set_title('Lasso Regularization Path')
  ax1.legend()

  # Number of non-zero features vs alpha
  ax2 = axes[0, 1]
  n_nonzero = (np.abs(coef_paths) > 1e-5).sum(axis=1)
  ax2.plot(alphas, n_nonzero, 'b-', linewidth=2)
  ax2.axvline(x=lasso_cv.alpha_, color='r', linestyle='--')
  ax2.set_xscale('log')
  ax2.set_xlabel('Alpha')
  ax2.set_ylabel('Number of Non-zero Coefficients')
  ax2.set_title('Feature Selection vs Regularization')

  # Selected feature coefficients
  ax3 = axes[1, 0]
  selected_coefs = selected_features.sort_values('coefficient')
  colors = ['green' if c > 0 else 'red' for c in selected_coefs['coefficient']]
  ax3.barh(range(len(selected_coefs)), selected_coefs['coefficient'], color=colors)
  ax3.set_yticks(range(len(selected_coefs)))
  ax3.set_yticklabels(selected_coefs['feature'], fontsize=8)
  ax3.set_xlabel('Coefficient')
  ax3.set_title('Selected Feature Coefficients')
  ax3.axvline(x=0, color='black', linewidth=0.5)

  # Comparison bar chart
  ax4 = axes[1, 1]
  names = list(results.keys())
  cv_scores = [r['cv_mean'] for r in results.values()]
  test_scores = [r['test'] for r in results.values()]
  x_pos = np.arange(len(names))
  width = 0.35
  ax4.bar(x_pos - width/2, cv_scores, width, label='CV', color='steelblue')
  ax4.bar(x_pos + width/2, test_scores, width, label='Test', color='coral')
  ax4.set_xticks(x_pos)
  ax4.set_xticklabels(names, rotation=15)
  ax4.set_ylabel('Accuracy')
  ax4.set_title('Model Performance Comparison')
  ax4.legend()
  ax4.set_ylim(0.9, 1.0)

  plt.tight_layout()
  plt.savefig('lasso_feature_selection.png', dpi=150)

  print("\n✅ Feature selection complete!")
  ```

  **What you learned:**

  * Lasso automatically selects features by setting coefficients to zero
  * The regularization path shows how features get eliminated
  * Fewer features can give similar or better performance
</details>

### Project 3: Overfitting Simulator

Visualize how regularization prevents overfitting.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.preprocessing import PolynomialFeatures
  from sklearn.linear_model import Ridge
  from sklearn.model_selection import train_test_split
  from sklearn.metrics import mean_squared_error

  # Step 1: Generate simple data
  np.random.seed(42)
  n_samples = 30

  # True function: simple quadratic
  X = np.linspace(0, 1, n_samples).reshape(-1, 1)
  y_true = 2 * X.ravel()**2 + 0.5 * X.ravel() + 1
  y = y_true + np.random.randn(n_samples) * 0.3

  print("="*60)
  print("📈 OVERFITTING SIMULATOR")
  print("="*60)

  # Split data
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  y_train_true = 2 * X_train.ravel()**2 + 0.5 * X_train.ravel() + 1
  y_test_true = 2 * X_test.ravel()**2 + 0.5 * X_test.ravel() + 1

  print(f"Training samples: {len(X_train)}")
  print(f"Test samples: {len(X_test)}")

  # Step 2: Fit high-degree polynomials with different regularization
  degree = 15
  poly = PolynomialFeatures(degree)
  X_train_poly = poly.fit_transform(X_train)
  X_test_poly = poly.transform(X_test)

  print(f"\nPolynomial degree: {degree} ({X_train_poly.shape[1]} features)")

  alphas = [0, 0.001, 0.01, 0.1, 1, 10, 100]
  results = []

  for alpha in alphas:
      if alpha == 0:
          from sklearn.linear_model import LinearRegression
          model = LinearRegression()
      else:
          model = Ridge(alpha=alpha)
      
      model.fit(X_train_poly, y_train)
      
      train_pred = model.predict(X_train_poly)
      test_pred = model.predict(X_test_poly)
      
      result = {
          'alpha': alpha,
          'train_mse': mean_squared_error(y_train, train_pred),
          'test_mse': mean_squared_error(y_test, test_pred),
          'model': model
      }
      results.append(result)
      
      print(f"α={alpha:6.3f}: Train MSE={result['train_mse']:.4f}, Test MSE={result['test_mse']:.4f}")

  # Step 3: Find optimal regularization
  best_result = min(results, key=lambda x: x['test_mse'])
  print(f"\nBest regularization: α={best_result['alpha']} (Test MSE={best_result['test_mse']:.4f})")

  # Step 4: Visualize
  fig, axes = plt.subplots(2, 3, figsize=(15, 10))

  X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
  X_plot_poly = poly.transform(X_plot)
  y_plot_true = 2 * X_plot.ravel()**2 + 0.5 * X_plot.ravel() + 1

  # Plot for different regularization values
  plot_alphas = [0, 0.001, 0.1, 1, 10, 100]
  for i, alpha in enumerate(plot_alphas):
      ax = axes[i // 3, i % 3]
      
      result = next(r for r in results if r['alpha'] == alpha)
      y_plot = result['model'].predict(X_plot_poly)
      
      ax.scatter(X_train, y_train, c='blue', label='Train', alpha=0.7)
      ax.scatter(X_test, y_test, c='green', label='Test', alpha=0.7)
      ax.plot(X_plot, y_plot_true, 'k--', label='True function', linewidth=2)
      ax.plot(X_plot, y_plot, 'r-', label=f'Prediction', linewidth=2)
      ax.set_title(f'α = {alpha}\nTrain MSE={result["train_mse"]:.3f}, Test MSE={result["test_mse"]:.3f}')
      ax.set_ylim(-0.5, 4)
      ax.legend(fontsize=8)

  plt.tight_layout()
  plt.savefig('overfitting_simulator.png', dpi=150)

  # Step 5: Bias-variance tradeoff visualization
  fig, axes = plt.subplots(1, 2, figsize=(14, 5))

  # MSE vs alpha
  ax1 = axes[0]
  alphas_plot = [r['alpha'] if r['alpha'] > 0 else 1e-4 for r in results]
  train_mses = [r['train_mse'] for r in results]
  test_mses = [r['test_mse'] for r in results]

  ax1.plot(alphas_plot, train_mses, 'b-o', label='Train MSE')
  ax1.plot(alphas_plot, test_mses, 'r-o', label='Test MSE')
  ax1.axvline(x=best_result['alpha'] if best_result['alpha'] > 0 else 1e-4, 
              color='green', linestyle='--', label=f'Optimal α={best_result["alpha"]}')
  ax1.set_xscale('log')
  ax1.set_xlabel('Regularization (α)')
  ax1.set_ylabel('MSE')
  ax1.set_title('Bias-Variance Tradeoff')
  ax1.legend()

  # Coefficient magnitudes
  ax2 = axes[1]
  for result in results[1:]:  # Skip alpha=0
      coefs = np.abs(result['model'].coef_[1:11])  # First 10 non-intercept
      ax2.plot(coefs, 'o-', label=f'α={result["alpha"]}', alpha=0.7)
  ax2.set_xlabel('Coefficient Index')
  ax2.set_ylabel('|Coefficient|')
  ax2.set_title('Coefficient Magnitude vs Regularization')
  ax2.legend()

  plt.tight_layout()
  plt.savefig('bias_variance.png', dpi=150)

  print("\n💡 Key Observations:")
  print("  - α=0: No regularization → Overfitting (wiggly curve)")
  print("  - α too high: Underfitting (too smooth)")
  print(f"  - α={best_result['alpha']}: Best generalization")
  ```

  **What you learned:**

  * No regularization leads to overfitting (low train error, high test error)
  * Too much regularization leads to underfitting (high train and test error)
  * The sweet spot minimizes test error
</details>

### Project 4: Optimal Lambda Finder

Systematically find the best regularization strength using cross-validation.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.datasets import load_diabetes
  from sklearn.model_selection import cross_val_score, learning_curve
  from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, ElasticNetCV
  from sklearn.preprocessing import StandardScaler

  # Step 1: Load data
  diabetes = load_diabetes()
  X = diabetes.data
  y = diabetes.target

  print("="*60)
  print("🎯 OPTIMAL LAMBDA FINDER")
  print("="*60)
  print(f"Features: {X.shape[1]}")
  print(f"Samples: {X.shape[0]}")

  # Scale features
  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  # Step 2: Find optimal lambda for Ridge
  print("\n1️⃣ RIDGE REGRESSION")
  print("-"*40)

  alphas_ridge = np.logspace(-3, 3, 100)
  cv_scores_ridge = []

  for alpha in alphas_ridge:
      ridge = Ridge(alpha=alpha)
      scores = cross_val_score(ridge, X, y, cv=5, scoring='neg_mean_squared_error')
      cv_scores_ridge.append(-scores.mean())

  ridge_cv = RidgeCV(alphas=alphas_ridge, cv=5)
  ridge_cv.fit(X, y)
  print(f"Optimal Ridge alpha: {ridge_cv.alpha_:.4f}")
  print(f"CV MSE at optimal: {cv_scores_ridge[np.argmin(cv_scores_ridge)]:.2f}")

  # Step 3: Find optimal lambda for Lasso
  print("\n2️⃣ LASSO REGRESSION")
  print("-"*40)

  lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
  lasso_cv.fit(X, y)
  print(f"Optimal Lasso alpha: {lasso_cv.alpha_:.4f}")
  print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}/{X.shape[1]}")

  # Step 4: Find optimal ElasticNet parameters
  print("\n3️⃣ ELASTICNET")
  print("-"*40)

  l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9]
  enet_cv = ElasticNetCV(l1_ratio=l1_ratios, cv=5, random_state=42, max_iter=10000)
  enet_cv.fit(X, y)
  print(f"Optimal ElasticNet alpha: {enet_cv.alpha_:.4f}")
  print(f"Optimal l1_ratio: {enet_cv.l1_ratio_:.2f}")

  # Step 5: Compare all methods
  print("\n4️⃣ COMPARISON")
  print("-"*40)

  models = {
      'Ridge': ridge_cv,
      'Lasso': lasso_cv,
      'ElasticNet': enet_cv
  }

  for name, model in models.items():
      scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
      print(f"{name}: CV MSE = {-scores.mean():.2f} (+/- {scores.std():.2f})")

  # Step 6: Visualize
  fig, axes = plt.subplots(2, 2, figsize=(14, 10))

  # Ridge path
  ax1 = axes[0, 0]
  ax1.plot(alphas_ridge, cv_scores_ridge, 'b-', linewidth=2)
  ax1.axvline(x=ridge_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={ridge_cv.alpha_:.4f}')
  ax1.scatter([ridge_cv.alpha_], [cv_scores_ridge[np.argmin(cv_scores_ridge)]], color='r', s=100, zorder=5)
  ax1.set_xscale('log')
  ax1.set_xlabel('Alpha (λ)')
  ax1.set_ylabel('CV MSE')
  ax1.set_title('Ridge Regularization Path')
  ax1.legend()

  # Lasso path
  ax2 = axes[0, 1]
  ax2.semilogx(lasso_cv.alphas_, lasso_cv.mse_path_.mean(axis=1), 'b-', linewidth=2)
  ax2.fill_between(lasso_cv.alphas_, 
                   lasso_cv.mse_path_.mean(axis=1) - lasso_cv.mse_path_.std(axis=1),
                   lasso_cv.mse_path_.mean(axis=1) + lasso_cv.mse_path_.std(axis=1),
                   alpha=0.2)
  ax2.axvline(x=lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={lasso_cv.alpha_:.4f}')
  ax2.set_xlabel('Alpha (λ)')
  ax2.set_ylabel('CV MSE')
  ax2.set_title('Lasso Regularization Path')
  ax2.legend()

  # Coefficient comparison
  ax3 = axes[1, 0]
  width = 0.25
  x_pos = np.arange(X.shape[1])
  ax3.bar(x_pos - width, ridge_cv.coef_, width, label='Ridge', alpha=0.7)
  ax3.bar(x_pos, lasso_cv.coef_, width, label='Lasso', alpha=0.7)
  ax3.bar(x_pos + width, enet_cv.coef_, width, label='ElasticNet', alpha=0.7)
  ax3.set_xlabel('Feature Index')
  ax3.set_ylabel('Coefficient')
  ax3.set_title('Coefficient Comparison')
  ax3.legend()

  # Learning curves at optimal lambda
  ax4 = axes[1, 1]
  train_sizes, train_scores, val_scores = learning_curve(
      ridge_cv, X, y, cv=5, 
      train_sizes=np.linspace(0.1, 1.0, 10),
      scoring='neg_mean_squared_error'
  )
  ax4.plot(train_sizes, -train_scores.mean(axis=1), 'b-', label='Train')
  ax4.fill_between(train_sizes, 
                   -train_scores.mean(axis=1) - train_scores.std(axis=1),
                   -train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
  ax4.plot(train_sizes, -val_scores.mean(axis=1), 'r-', label='Validation')
  ax4.fill_between(train_sizes, 
                   -val_scores.mean(axis=1) - val_scores.std(axis=1),
                   -val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
  ax4.set_xlabel('Training Size')
  ax4.set_ylabel('MSE')
  ax4.set_title('Learning Curve (Ridge with Optimal λ)')
  ax4.legend()

  plt.tight_layout()
  plt.savefig('optimal_lambda.png', dpi=150)

  # Step 7: Summary recommendations
  print("\n📋 RECOMMENDATIONS")
  print("-"*40)
  print(f"For this dataset:")
  print(f"  1. Ridge (α={ridge_cv.alpha_:.4f}): Good when all features are useful")
  print(f"  2. Lasso (α={lasso_cv.alpha_:.4f}): When feature selection is desired")
  print(f"  3. ElasticNet: When there are correlated features")
  print(f"\n  Selected features by Lasso: {np.sum(lasso_cv.coef_ != 0)}/{X.shape[1]}")

  print("\n✅ Lambda optimization complete!")
  ```

  **What you learned:**

  * Cross-validation is essential for finding optimal regularization
  * Different regularization methods suit different problems
  * RidgeCV, LassoCV, ElasticNetCV automate the search process
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Penalize Complexity" icon="gavel">
    Add weight penalty to the loss function
  </Card>

  <Card title="L2 = Small Weights" icon="minimize">
    Ridge shrinks all weights, none become zero
  </Card>

  <Card title="L1 = Zero Weights" icon="broom">
    Lasso creates sparse models, selects features
  </Card>

  <Card title="Cross-Validate λ" icon="repeat">
    Always use CV to find the right regularization strength
  </Card>
</CardGroup>

***

## What's Next?

You now have a complete ML toolkit! Let's see how to save, load, and deploy your models.

<Card title="Continue to Module 14: Model Deployment" icon="arrow-right" href="/courses/ml-mastery/14-model-deployment">
  Learn how to save models and deploy them for real-world use
</Card>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You deployed a Ridge regression model and performance degrades after a few months. How would you diagnose whether the issue is regularization-related or something else entirely?">
    The first thing I would do is separate the problem into two categories: is the model itself stale, or has the data distribution shifted underneath it?

    * **Check for data drift first.** Compare the feature distributions in recent production data against the training data. If the input distribution has changed significantly -- say a new customer segment emerged or seasonality shifted -- the model may be fine but the world moved. Use a KS test or population stability index on each feature.
    * **Examine the coefficient magnitudes.** If you are using Ridge, none of the coefficients go to zero, so all original features are still in play. Pull the learned coefficients and compare them against feature importance from a freshly-trained model on recent data. If the ranking has changed dramatically, the regularization strength you chose at training time may no longer be appropriate for the new feature relationships.
    * **Re-run cross-validation on recent data with the same lambda.** If the CV score on recent data is much worse than the original CV score, the data has drifted. If the CV score is still good but production metrics are bad, the issue is likely in the serving pipeline -- preprocessing skew, missing features, or an encoding mismatch.
    * **Try retraining with RidgeCV on fresh data.** If the optimal alpha changes substantially (say from 1.0 to 100.0), that tells you the signal-to-noise ratio in the data has changed, and your original regularization strength is no longer calibrated.

    **Follow-up: When would you switch from Ridge to Lasso after seeing this kind of degradation?**

    If the retrained model on fresh data shows that many features now have near-zero coefficients -- meaning the signal has concentrated into fewer predictors -- switching to Lasso makes sense. Lasso enforces sparsity, which is effectively automatic feature selection. In practice, I would try ElasticNet first since it gives you the sparsity of L1 with the stability of L2, and use cross-validation to find the optimal l1\_ratio. The key signal is: if your Ridge model is spreading weight across 50 features but only 8 of them actually matter for the current data regime, Lasso or ElasticNet will give you a simpler, more robust model that is easier to monitor and explain.
  </Accordion>

  <Accordion title="An interviewer shows you two models: one with L1 regularization and one with L2, both achieving similar test accuracy. Which do you deploy and why?">
    Similar accuracy is not enough information to make this decision. I would ask several follow-up questions, but here is how I think about it:

    * **Interpretability requirements.** If stakeholders need to understand which features drive predictions -- common in healthcare, finance, and compliance settings -- the L1 model wins. Lasso produces sparse coefficients, so you can say "these 7 features matter, the rest do not." L2 keeps all features active, making the explanation messier.
    * **Feature stability over time.** L1 models are sensitive to correlated features -- if two features are highly correlated, Lasso will arbitrarily pick one and zero out the other. If feature availability or correlation structure changes in production, the L1 model may behave unpredictably. L2 is more stable because it distributes weight across correlated features.
    * **Inference cost.** If the L1 model zeroed out 80% of features, you only need to compute and transmit 20% of the features at inference time. At scale, this reduces latency and infrastructure cost. For a model serving millions of requests per day, fewer features means real savings.
    * **Monitoring burden.** Fewer active features (L1) means fewer things to monitor for drift. But it also means a single drifting feature has a bigger impact on predictions.

    My default in production: if accuracy is truly similar and feature stability is not a concern, I lean toward L1 for the operational simplicity. But if the feature space has known multicollinearity, I would use ElasticNet or stick with L2.

    **Follow-up: How does the choice change if you are working in a pipeline where feature computation is expensive?**

    This strongly favors L1. If computing a feature requires a database query, an API call, or a complex aggregation, every feature you can eliminate from the model is infrastructure you do not need to maintain. I have seen production systems where the feature store was the bottleneck, not the model inference. In that scenario, a Lasso model that uses 10 features versus a Ridge model that uses 100 features can mean the difference between 20ms and 200ms prediction latency -- and that is a business-critical difference.
  </Accordion>

  <Accordion title="How would you explain the difference between L1 and L2 regularization to a non-technical product manager who needs to approve your model choice?">
    I would use a hiring analogy. Imagine you are building a team to solve a problem:

    * **L2 (Ridge) is like keeping everyone on the team but limiting how much each person works.** Nobody gets fired, but everyone is told to contribute a little less. The result: a balanced team where everyone does a small part. The upside is stability -- if one person calls in sick, others can compensate. The downside is that you are paying salary for people who contribute almost nothing.
    * **L1 (Lasso) is like running a layoff based on performance.** People who are not contributing get removed entirely. The team gets smaller and more focused. The upside is efficiency and clarity -- you know exactly who matters. The downside is that if you fired the wrong person, there is nobody to cover for them.

    Then I would connect it to the business context: "For our fraud detection model, I recommend L1 because it will tell us the 5 key signals that predict fraud, which the investigations team can act on. If I used L2, I would hand them a list of 50 factors, each with a tiny contribution, and that is not actionable."

    **Follow-up: How do you handle a product manager who says "just use both"?**

    That is actually a real technique called ElasticNet. It combines both L1 and L2 penalties, and you tune a ratio parameter that controls how much of each you use. I would frame it as: "We can start with a blended approach, and the cross-validation process will automatically find the right balance between keeping all features and selecting the most important ones. The data will tell us the right answer." This turns a binary decision into a continuous optimization, which is usually the right engineering instinct.
  </Accordion>

  <Accordion title="You are building a time series forecasting model with 200 engineered features. How would regularization strategy differ from a standard classification problem?">
    Time series adds several wrinkles that change how I think about regularization:

    * **Temporal autocorrelation in features.** Many engineered features in time series are lagged versions of each other (lag\_1, lag\_2, lag\_7, etc.). These are highly correlated by construction. Pure L1 regularization will arbitrarily pick one lag and zero out others, which can make the model fragile if the most predictive lag shifts. I would default to ElasticNet here, or use L2 with aggressive feature selection as a separate preprocessing step.
    * **Feature importance changes over time.** The features that mattered last quarter may not matter this quarter. I would use a sliding-window retraining approach with regularization, and I would monitor whether the set of non-zero features (in L1) or the coefficient magnitudes (in L2) are stable across retraining windows. Large shifts signal regime change.
    * **Multicollinearity from rolling statistics.** If you engineer rolling\_mean\_7, rolling\_mean\_14, and rolling\_mean\_30, these are naturally correlated. Ridge handles this gracefully by sharing weight. Lasso will unpredictably drop some, which may break the model when the short-term vs long-term pattern changes.
    * **The regularization strength should be tuned with TimeSeriesSplit, never random CV.** This is critical. If you use random cross-validation to select lambda, you are letting future information influence the regularization choice, which inflates the perceived model quality.

    In practice, for 200 time series features, I would first use Lasso to identify the top 30-50 features, then retrain with Ridge or ElasticNet on that reduced set. This two-stage approach gives you the sparsity benefit without the instability risk.

    **Follow-up: How would you detect if your regularization is masking a data leakage problem in the time series features?**

    The telltale sign is when a regularized model performs almost as well as an unregularized one on the test set. In a legitimate scenario with 200 features and noise, unregularized models should overfit badly. If they do not, it likely means some feature is leaking future information so strongly that even a simple model can exploit it. I would check feature importance and look for any feature with disproportionately high weight -- especially any feature derived from rolling windows or aggregations that might accidentally include future timestamps.
  </Accordion>
</AccordionGroup>
