Core idea: Penalize complexity!Instead of just minimizing prediction error, we minimize:Loss=Prediction Error+λ×Model ComplexityWhere λ is the regularization strength.Trade-off:
λ = 0: No regularization, risk overfitting
λ = ∞: Maximum regularization, model predicts the mean
Add the sum of squared weights to the loss:LossRidge=MSE+λj=1∑pwj2Effect: Pushes weights toward zero, but never exactly zero. Creates “small” weights.
Add the sum of absolute weights to the loss:LossLasso=MSE+λj=1∑p∣wj∣Effect: Pushes weights toward zero, and some become exactly zero. Creates sparse models!
from sklearn.linear_model import Lasso# Compare Ridge vs Lasso on feature selectionfrom sklearn.datasets import make_regression# Create data with many irrelevant featuresX, y = make_regression(n_samples=100, n_features=20, n_informative=5, noise=10, random_state=42)# Fit bothridge = Ridge(alpha=1.0)lasso = Lasso(alpha=1.0)ridge.fit(X, y)lasso.fit(X, y)# Compare coefficientsfig, axes = plt.subplots(1, 2, figsize=(14, 5))axes[0].bar(range(20), ridge.coef_)axes[0].set_title('Ridge Coefficients (all non-zero)')axes[0].set_xlabel('Feature')axes[0].axhline(y=0, color='r', linestyle='--')axes[1].bar(range(20), lasso.coef_)axes[1].set_title('Lasso Coefficients (many are exactly 0!)')axes[1].set_xlabel('Feature')axes[1].axhline(y=0, color='r', linestyle='--')plt.tight_layout()plt.show()print(f"Ridge: {np.sum(ridge.coef_ != 0)} non-zero coefficients")print(f"Lasso: {np.sum(lasso.coef_ != 0)} non-zero coefficients")
Trees don’t use L1/L2, but they have their own regularization:
from sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier# Tree regularization parameterstree = DecisionTreeClassifier( max_depth=5, # Limit tree depth min_samples_split=10, # Min samples to split min_samples_leaf=5, # Min samples in leaf max_features='sqrt' # Random subset of features)# Random Forest adds more regularization through baggingrf = RandomForestClassifier( n_estimators=100, max_depth=10, min_samples_leaf=2, max_features='sqrt')# Gradient Boosting has learning rate as regularizationgb = GradientBoostingClassifier( n_estimators=100, learning_rate=0.1, # Smaller = more regularization max_depth=3, subsample=0.8 # Random sample of data)
You deployed a Ridge regression model and performance degrades after a few months. How would you diagnose whether the issue is regularization-related or something else entirely?
The first thing I would do is separate the problem into two categories: is the model itself stale, or has the data distribution shifted underneath it?
Check for data drift first. Compare the feature distributions in recent production data against the training data. If the input distribution has changed significantly — say a new customer segment emerged or seasonality shifted — the model may be fine but the world moved. Use a KS test or population stability index on each feature.
Examine the coefficient magnitudes. If you are using Ridge, none of the coefficients go to zero, so all original features are still in play. Pull the learned coefficients and compare them against feature importance from a freshly-trained model on recent data. If the ranking has changed dramatically, the regularization strength you chose at training time may no longer be appropriate for the new feature relationships.
Re-run cross-validation on recent data with the same lambda. If the CV score on recent data is much worse than the original CV score, the data has drifted. If the CV score is still good but production metrics are bad, the issue is likely in the serving pipeline — preprocessing skew, missing features, or an encoding mismatch.
Try retraining with RidgeCV on fresh data. If the optimal alpha changes substantially (say from 1.0 to 100.0), that tells you the signal-to-noise ratio in the data has changed, and your original regularization strength is no longer calibrated.
Follow-up: When would you switch from Ridge to Lasso after seeing this kind of degradation?If the retrained model on fresh data shows that many features now have near-zero coefficients — meaning the signal has concentrated into fewer predictors — switching to Lasso makes sense. Lasso enforces sparsity, which is effectively automatic feature selection. In practice, I would try ElasticNet first since it gives you the sparsity of L1 with the stability of L2, and use cross-validation to find the optimal l1_ratio. The key signal is: if your Ridge model is spreading weight across 50 features but only 8 of them actually matter for the current data regime, Lasso or ElasticNet will give you a simpler, more robust model that is easier to monitor and explain.
An interviewer shows you two models: one with L1 regularization and one with L2, both achieving similar test accuracy. Which do you deploy and why?
Similar accuracy is not enough information to make this decision. I would ask several follow-up questions, but here is how I think about it:
Interpretability requirements. If stakeholders need to understand which features drive predictions — common in healthcare, finance, and compliance settings — the L1 model wins. Lasso produces sparse coefficients, so you can say “these 7 features matter, the rest do not.” L2 keeps all features active, making the explanation messier.
Feature stability over time. L1 models are sensitive to correlated features — if two features are highly correlated, Lasso will arbitrarily pick one and zero out the other. If feature availability or correlation structure changes in production, the L1 model may behave unpredictably. L2 is more stable because it distributes weight across correlated features.
Inference cost. If the L1 model zeroed out 80% of features, you only need to compute and transmit 20% of the features at inference time. At scale, this reduces latency and infrastructure cost. For a model serving millions of requests per day, fewer features means real savings.
Monitoring burden. Fewer active features (L1) means fewer things to monitor for drift. But it also means a single drifting feature has a bigger impact on predictions.
My default in production: if accuracy is truly similar and feature stability is not a concern, I lean toward L1 for the operational simplicity. But if the feature space has known multicollinearity, I would use ElasticNet or stick with L2.Follow-up: How does the choice change if you are working in a pipeline where feature computation is expensive?This strongly favors L1. If computing a feature requires a database query, an API call, or a complex aggregation, every feature you can eliminate from the model is infrastructure you do not need to maintain. I have seen production systems where the feature store was the bottleneck, not the model inference. In that scenario, a Lasso model that uses 10 features versus a Ridge model that uses 100 features can mean the difference between 20ms and 200ms prediction latency — and that is a business-critical difference.
How would you explain the difference between L1 and L2 regularization to a non-technical product manager who needs to approve your model choice?
I would use a hiring analogy. Imagine you are building a team to solve a problem:
L2 (Ridge) is like keeping everyone on the team but limiting how much each person works. Nobody gets fired, but everyone is told to contribute a little less. The result: a balanced team where everyone does a small part. The upside is stability — if one person calls in sick, others can compensate. The downside is that you are paying salary for people who contribute almost nothing.
L1 (Lasso) is like running a layoff based on performance. People who are not contributing get removed entirely. The team gets smaller and more focused. The upside is efficiency and clarity — you know exactly who matters. The downside is that if you fired the wrong person, there is nobody to cover for them.
Then I would connect it to the business context: “For our fraud detection model, I recommend L1 because it will tell us the 5 key signals that predict fraud, which the investigations team can act on. If I used L2, I would hand them a list of 50 factors, each with a tiny contribution, and that is not actionable.”Follow-up: How do you handle a product manager who says “just use both”?That is actually a real technique called ElasticNet. It combines both L1 and L2 penalties, and you tune a ratio parameter that controls how much of each you use. I would frame it as: “We can start with a blended approach, and the cross-validation process will automatically find the right balance between keeping all features and selecting the most important ones. The data will tell us the right answer.” This turns a binary decision into a continuous optimization, which is usually the right engineering instinct.
You are building a time series forecasting model with 200 engineered features. How would regularization strategy differ from a standard classification problem?
Time series adds several wrinkles that change how I think about regularization:
Temporal autocorrelation in features. Many engineered features in time series are lagged versions of each other (lag_1, lag_2, lag_7, etc.). These are highly correlated by construction. Pure L1 regularization will arbitrarily pick one lag and zero out others, which can make the model fragile if the most predictive lag shifts. I would default to ElasticNet here, or use L2 with aggressive feature selection as a separate preprocessing step.
Feature importance changes over time. The features that mattered last quarter may not matter this quarter. I would use a sliding-window retraining approach with regularization, and I would monitor whether the set of non-zero features (in L1) or the coefficient magnitudes (in L2) are stable across retraining windows. Large shifts signal regime change.
Multicollinearity from rolling statistics. If you engineer rolling_mean_7, rolling_mean_14, and rolling_mean_30, these are naturally correlated. Ridge handles this gracefully by sharing weight. Lasso will unpredictably drop some, which may break the model when the short-term vs long-term pattern changes.
The regularization strength should be tuned with TimeSeriesSplit, never random CV. This is critical. If you use random cross-validation to select lambda, you are letting future information influence the regularization choice, which inflates the perceived model quality.
In practice, for 200 time series features, I would first use Lasso to identify the top 30-50 features, then retrain with Ridge or ElasticNet on that reduced set. This two-stage approach gives you the sparsity benefit without the instability risk.Follow-up: How would you detect if your regularization is masking a data leakage problem in the time series features?The telltale sign is when a regularized model performs almost as well as an unregularized one on the test set. In a legitimate scenario with 200 features and noise, unregularized models should overfit badly. If they do not, it likely means some feature is leaking future information so strongly that even a simple model can exploit it. I would check feature importance and look for any feature with disproportionately high weight — especially any feature derived from rolling windows or aggregations that might accidentally include future timestamps.