Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Convex Optimization
Why Some Problems Are “Easy”
In gradient descent, we worry about getting stuck in local minima. But for some problems, every local minimum is the global minimum. These are convex optimization problems, and they’re the foundation of many ML algorithms. Here is a good analogy: imagine you are dropped somewhere on a landscape and told to find the lowest point. If the landscape is a bowl (convex), any downhill path leads to the same bottom. You cannot get tricked by false valleys. But if the landscape is mountainous (non-convex, like a neural network loss surface), you might walk into a small valley and think you have found the bottom, when the real bottom is on the other side of a ridge you cannot see. This distinction has enormous practical consequences. For convex problems (linear regression, logistic regression, SVMs), you can prove that your solution is globally optimal. For non-convex problems (deep neural networks), you can only hope you found a “good enough” local minimum — and much of modern deep learning research is about ensuring that “good enough” is actually very good.Difficulty: Intermediate
Prerequisites: Gradient Descent, Optimization modules
What You’ll Learn: When optimization is easy, and how to leverage it
What Is Convexity?
Convex Sets
A set is convex if the line segment between any two points in lies entirely in .Convex Functions
A function is convex if the line segment between any two points on the graph lies above or on the graph. Mathematically: For all and :Testing for Convexity
Second Derivative Test
For a single-variable function, convexity testing is straightforward: compute the second derivative and check its sign. If the second derivative is non-negative everywhere, the function is convex. This is the mathematical version of checking whether a surface always curves “upward” like a bowl. For twice-differentiable functions:| Function Type | Condition | Visual Intuition |
|---|---|---|
| Convex | everywhere | Always curves upward (bowl) |
| Strictly Convex | everywhere | Curves upward with no flat regions |
| Concave | everywhere | Always curves downward (hill) |
Hessian Test (Multivariable)
For functions of multiple variables, check if the Hessian (matrix of second derivatives) is positive semi-definite.Why Convexity Matters
Theorem: Every Local Minimum Is Global
For convex functions, gradient descent is guaranteed to find the global minimum.Common Convex Functions in ML
Loss Functions
| Loss | Formula | Convexity |
|---|---|---|
| MSE | Convex ✓ | |
| MAE | Convex ✓ | |
| Cross-Entropy | Convex ✓ | |
| Hinge | Convex ✓ | |
| Huber | Combination | Convex ✓ |
Regularization
| Regularizer | Formula | Convexity |
|---|---|---|
| L2 (Ridge) | Strictly Convex ✓ | |
| L1 (Lasso) | Convex ✓ | |
| Elastic Net | Convex ✓ |
Linear Regression: A Convex Problem
Linear regression with MSE loss is convex: This is a quadratic function in , which is always convex!Strong Convexity: Even Better Guarantees
A function is strongly convex with parameter if: Why it matters: Strong convexity guarantees faster convergence!| Property | Convex | Strongly Convex |
|---|---|---|
| Unique minimum? | Maybe | Yes! |
| Convergence rate | (exponential!) | |
| Effect of L2 regularization | Makes strongly convex | Increases |
Practice Exercises
Exercise 1: Prove Convexity
Exercise 1: Prove Convexity
Exercise 2: L1 vs L2
Exercise 2: L1 vs L2
Exercise 3: Making Problems Convex
Exercise 3: Making Problems Convex
Summary
| Concept | What It Means | Practical Impact |
|---|---|---|
| Convex Function | Bowl-shaped, no local minima | Gradient descent finds global optimum |
| Convex Set | ”Solid” without holes | Feasible region for optimization |
| Strongly Convex | ”Very bowl-shaped” | Faster convergence |
| L2 Regularization | Makes function more convex | Improves optimization |
| Non-Convex | Multiple minima | No global optimum guarantee |
Interview Deep-Dive
Neural networks are non-convex optimization problems. If convexity guarantees finding the global minimum, why do neural networks work at all?
Neural networks are non-convex optimization problems. If convexity guarantees finding the global minimum, why do neural networks work at all?
- This is one of the most important open questions in deep learning theory, and the honest answer is that we do not have a complete theoretical understanding. But we have strong empirical and theoretical insights.
- First, the loss landscape of overparameterized neural networks (more parameters than training examples) is surprisingly benign. Choromanska et al. (2015) showed that for large networks, the loss values at different local minima are clustered close together — there are many local minima, but they are nearly as good as the global minimum. Getting stuck in a “bad” local minimum becomes increasingly rare as the network gets larger.
- Second, saddle points are more common than local minima in high dimensions, and modern optimizers (SGD with noise, momentum, Adam) handle saddle points well. The gradient noise from mini-batch training naturally perturbs the optimizer away from saddle points.
- Third, overparameterization creates a phenomenon where the loss surface has large flat regions at near-zero loss. With enough parameters, there are many weight configurations that fit the training data perfectly, and gradient descent can find one of them relatively easily. The challenge shifts from “finding a minimum” to “finding a minimum that generalizes” — which is a regularization question, not a convexity question.
- Fourth, the architecture matters. Residual connections, normalization layers, and careful initialization all smooth the loss landscape. Li et al. (2018) visualized the loss surfaces of ResNets versus plain networks and showed that skip connections dramatically smoothen the landscape.
- The practical takeaway: we trade convexity guarantees for representational power. A convex model (like logistic regression) is easy to optimize but limited in what it can learn. A neural network can learn arbitrarily complex functions but has no global optimality guarantee. In practice, the optimization works well enough, and the expressiveness gain far outweighs the theoretical risk.
Explain the difference between L1 and L2 regularization from a convex optimization perspective. Why does L1 produce sparse solutions while L2 does not?
Explain the difference between L1 and L2 regularization from a convex optimization perspective. Why does L1 produce sparse solutions while L2 does not?
- Both L1 and L2 regularization add a convex penalty to the loss function, which helps with both optimization (makes the problem more convex) and generalization (reduces overfitting). But their geometric properties are fundamentally different.
- L2 regularization adds lambda * ||w||^2. This is a smooth, strictly convex function. Its gradient is 2lambdaw, which pushes weights toward zero proportionally to their current value. Large weights get large pushes, small weights get small pushes. Weights shrink but never exactly reach zero.
- L1 regularization adds lambda * ||w||_1 = lambda * sum(|w_i|). This is convex but NOT smooth — it has a kink at w_i = 0. The subgradient is lambda * sign(w_i), which pushes ALL weights toward zero with the same force, regardless of magnitude. Small weights are actively driven to exactly zero, producing sparsity.
- Geometrically, L2’s constraint is a sphere, and loss contours typically touch it where no coordinates are exactly zero. L1’s constraint is a diamond with corners on the axes. The loss contours are much more likely to touch the diamond at a corner, which corresponds to one or more coordinates being exactly zero. In higher dimensions, the L1 diamond has even more corners relative to its surface area, making sparsity increasingly likely.
- Practical implications: L1 is used when you want automatic feature selection (LASSO identifies important features by zeroing out irrelevant ones). L2 is used when you want to keep all features but prevent any single feature from dominating. Elastic Net combines both: alpha * L1 + (1-alpha) * L2, getting sparsity plus the grouping effect for correlated features.
How does adding L2 regularization change the optimization landscape? Specifically, how does it relate to strong convexity and convergence speed?
How does adding L2 regularization change the optimization landscape? Specifically, how does it relate to strong convexity and convergence speed?
- L2 regularization adds lambda * ||w||^2 to the loss. The Hessian of this term is 2lambdaI (a scaled identity matrix). So the Hessian of the regularized loss is H_original + 2lambdaI. This shifts ALL eigenvalues upward by 2*lambda.
- If the original loss has minimum Hessian eigenvalue mu_min (which could be zero or negative), the regularized loss has minimum eigenvalue mu_min + 2*lambda. If lambda is large enough, all eigenvalues become positive, making the loss strongly convex.
- Strong convexity with parameter mu = 2lambda gives exponential convergence: O(exp(-mut)) instead of O(1/t) for merely convex functions. The condition number (ratio of largest to smallest eigenvalue) also improves because the denominator increases by 2*lambda.
- For non-convex neural networks, L2 regularization does not make the landscape convex (the non-convexity from activations dominates). But it flattens the landscape by penalizing large weights, eliminating sharp minima and favoring flatter solutions. It also prevents weight explosion for numerical stability.
- An important subtlety: in Adam, L2 regularization and weight decay are NOT equivalent. The L2 gradient (2lambdaw) gets divided by sqrt(v) in Adam, so parameters with large historical gradients receive less effective regularization. AdamW applies weight decay directly to the weights, preserving the intended regularization strength. This distinction matters significantly for transformer training.
Your manager asks you to choose between logistic regression and a neural network for a classification task with 50 features and 10,000 samples. Frame your answer in terms of convexity and optimization guarantees.
Your manager asks you to choose between logistic regression and a neural network for a classification task with 50 features and 10,000 samples. Frame your answer in terms of convexity and optimization guarantees.
- Logistic regression with cross-entropy loss is a convex optimization problem. The loss function is strictly convex (with L2 regularization), meaning gradient descent is guaranteed to find the unique global minimum. No local minima, no saddle points, no sensitivity to initialization. You can solve it with off-the-shelf solvers (L-BFGS, coordinate descent) in seconds. The solution is deterministic and reproducible.
- A neural network’s loss is non-convex. Different random initializations lead to different solutions. Training involves hyperparameter tuning (learning rate, architecture, batch size), takes longer, and requires more compute. There is no guarantee of finding the globally optimal solution.
- For 50 features and 10,000 samples, I would start with logistic regression. At this scale, the convex optimizer will find the provably optimal linear decision boundary in under a second. If the accuracy is sufficient for the business requirement, stop here. You get interpretability (feature importances from coefficients), reproducibility, fast inference, and easy deployment.
- I would switch to a neural network only if: (1) the logistic regression accuracy is insufficient, suggesting the decision boundary is nonlinear; (2) the features have complex interactions that a linear model cannot capture; or (3) domain knowledge suggests the data has hierarchical structure that benefits from learned representations.
- The key insight for the interview: choosing the right model is not about picking the most powerful one — it is about matching the model’s complexity to the problem’s complexity. A convex model that solves the problem is strictly better than a non-convex one, because you get the same accuracy with stronger guarantees, faster training, easier debugging, and simpler deployment.