$Convex Optimization$

Convex Optimization

Why Some Problems Are “Easy”

In gradient descent, we worry about getting stuck in local minima. But for some problems, every local minimum is the global minimum. These are convex optimization problems, and they’re the foundation of many ML algorithms.

The Catch: Neural networks are NOT convex. But understanding convexity helps you:

Know when you’re guaranteed to find the global optimum
Design better loss functions
Understand why some algorithms converge faster

Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Gradient Descent, Optimization modules
What You’ll Learn: When optimization is easy, and how to leverage it

What Is Convexity?

Convex Sets

A set

C

is convex if the line segment between any two points in

C

lies entirely in

C

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon, Circle

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Convex set: Circle
theta = np.linspace(0, 2*np.pi, 100)
circle = plt.Circle((0, 0), 1, fill=True, alpha=0.3, color='blue')
axes[0].add_patch(circle)
axes[0].plot([-.7, .5], [-.5, .7], 'r-', linewidth=2, label='Line segment inside')
axes[0].scatter([-.7, .5], [-.5, .7], c='red', s=100, zorder=5)
axes[0].set_xlim(-1.5, 1.5)
axes[0].set_ylim(-1.5, 1.5)
axes[0].set_title('Convex: Circle (Yes)')
axes[0].set_aspect('equal')
axes[0].legend()

# Convex set: Triangle
triangle = Polygon([[-1, -0.5], [1, -0.5], [0, 1]], alpha=0.3, color='blue')
axes[1].add_patch(triangle)
axes[1].plot([-.5, .3], [-.2, .5], 'r-', linewidth=2, label='Line segment inside')
axes[1].scatter([-.5, .3], [-.2, .5], c='red', s=100, zorder=5)
axes[1].set_xlim(-1.5, 1.5)
axes[1].set_ylim(-1, 1.5)
axes[1].set_title('Convex: Triangle (Yes)')
axes[1].set_aspect('equal')
axes[1].legend()

# Non-convex set: Crescent
t = np.linspace(0, 2*np.pi, 100)
x1 = np.cos(t)
y1 = np.sin(t)
x2 = 0.5 + 0.7*np.cos(t)
y2 = 0.7*np.sin(t)
axes[2].fill(x1, y1, alpha=0.3, color='blue')
axes[2].fill(x2, y2, color='white')
axes[2].plot([-.8, .2], [0, 0], 'r-', linewidth=2, label='Line segment OUTSIDE!')
axes[2].scatter([-.8, .2], [0, 0], c='red', s=100, zorder=5)
axes[2].set_xlim(-1.5, 1.5)
axes[2].set_ylim(-1.5, 1.5)
axes[2].set_title('Non-Convex: Crescent (No)')
axes[2].set_aspect('equal')
axes[2].legend()

plt.tight_layout()
plt.show()

Convex Functions

A function

f

is convex if the line segment between any two points on the graph lies above or on the graph. Mathematically: For all

x, y

and

\lambda \in [0, 1]

f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y)

# Visualize convex vs non-convex functions
x = np.linspace(-3, 3, 100)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Convex: Quadratic
y_quad = x**2
axes[0].plot(x, y_quad, 'b-', linewidth=2)
# Show the "line above" property
x1, x2 = -2, 1.5
y1, y2 = x1**2, x2**2
axes[0].plot([x1, x2], [y1, y2], 'r--', linewidth=2, label='Line segment')
axes[0].scatter([x1, x2], [y1, y2], c='red', s=100, zorder=5)
# Show a point in between
x_mid = 0.5 * x1 + 0.5 * x2
axes[0].scatter([x_mid], [x_mid**2], c='green', s=100, zorder=5, label='Function value')
axes[0].scatter([x_mid], [0.5*y1 + 0.5*y2], c='orange', s=100, zorder=5, label='Line value')
axes[0].set_title('Convex: f(x) = x^2 (Yes)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Convex: Exponential
y_exp = np.exp(x)
axes[1].plot(x, y_exp, 'b-', linewidth=2)
x1, x2 = -2, 1
y1, y2 = np.exp(x1), np.exp(x2)
axes[1].plot([x1, x2], [y1, y2], 'r--', linewidth=2)
axes[1].scatter([x1, x2], [y1, y2], c='red', s=100, zorder=5)
axes[1].set_title('Convex: f(x) = exp(x) (Yes)')
axes[1].grid(True, alpha=0.3)

# Non-convex: Sine
y_sin = np.sin(x)
axes[2].plot(x, y_sin, 'b-', linewidth=2)
x1, x2 = -2, 2
y1, y2 = np.sin(x1), np.sin(x2)
axes[2].plot([x1, x2], [y1, y2], 'r--', linewidth=2)
axes[2].scatter([x1, x2], [y1, y2], c='red', s=100, zorder=5)
axes[2].set_title('Non-Convex: f(x) = sin(x) (No)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Testing for Convexity

Second Derivative Test

For twice-differentiable functions:

Function Type	Condition
Convex	$f''(x) \geq 0$ everywhere
Strictly Convex	$f''(x) > 0$ everywhere
Concave	$f''(x) \leq 0$ everywhere

def is_convex(f, f_double_prime, x_range):
    """Check if f is convex over x_range."""
    x = np.linspace(x_range[0], x_range[1], 1000)
    second_derivs = f_double_prime(x)
    
    return np.all(second_derivs >= -1e-10)  # Allow small numerical errors

# Test some functions
functions = [
    ("x²", lambda x: x**2, lambda x: 2*np.ones_like(x)),
    ("eˣ", lambda x: np.exp(x), lambda x: np.exp(x)),
    ("x³", lambda x: x**3, lambda x: 6*x),
    ("|x|", lambda x: np.abs(x), lambda x: np.zeros_like(x)),  # f''=0 except at 0
    ("sin(x)", lambda x: np.sin(x), lambda x: -np.sin(x)),
]

print("Convexity Check (on [-5, 5]):")
print("=" * 40)
for name, f, f_pp in functions:
    result = is_convex(f, f_pp, [-5, 5])
    print(f"{name:10} : {'Convex ✓' if result else 'Non-convex ✗'}")

Hessian Test (Multivariable)

For functions of multiple variables, check if the Hessian (matrix of second derivatives) is positive semi-definite.

from scipy.optimize import minimize

def check_hessian_pd(H):
    """Check if matrix H is positive semi-definite."""
    eigenvalues = np.linalg.eigvals(H)
    return np.all(eigenvalues >= -1e-10)

# Example: f(x,y) = x² + y² + xy
# Hessian = [[2, 1], [1, 2]]
H = np.array([[2, 1], [1, 2]])

eigenvalues = np.linalg.eigvals(H)
print(f"Hessian eigenvalues: {eigenvalues}")
print(f"Is convex: {check_hessian_pd(H)}")

# Non-convex example: saddle point
# f(x,y) = x² - y²
# Hessian = [[2, 0], [0, -2]]
H_saddle = np.array([[2, 0], [0, -2]])
eigenvalues_saddle = np.linalg.eigvals(H_saddle)
print(f"\nSaddle Hessian eigenvalues: {eigenvalues_saddle}")
print(f"Is convex: {check_hessian_pd(H_saddle)}")

Why Convexity Matters

Theorem: Every Local Minimum Is Global

For convex functions, gradient descent is guaranteed to find the global minimum.

# Demonstrate convergence to global minimum for convex function
def convex_function(x):
    return x[0]**2 + x[1]**2 + x[0]*x[1] + x[0] + 2*x[1]

def convex_gradient(x):
    return np.array([2*x[0] + x[1] + 1, 2*x[1] + x[0] + 2])

# Multiple random starting points
np.random.seed(42)
n_starts = 10
starting_points = np.random.randn(n_starts, 2) * 5

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot contours
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2 + X*Y + X + 2*Y

axes[0].contour(X, Y, Z, levels=30)
axes[0].set_title('Convex Function: All Paths Lead to Same Minimum')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

# Gradient descent from each starting point
all_endpoints = []
for i, start in enumerate(starting_points):
    path = [start.copy()]
    x = start.copy()
    lr = 0.1
    
    for _ in range(50):
        x = x - lr * convex_gradient(x)
        path.append(x.copy())
    
    path = np.array(path)
    axes[0].plot(path[:, 0], path[:, 1], 'o-', markersize=2, alpha=0.7)
    all_endpoints.append(path[-1])

all_endpoints = np.array(all_endpoints)
axes[0].scatter(all_endpoints[:, 0], all_endpoints[:, 1], c='red', s=100, zorder=10, label='Final points')
axes[0].legend()

# Show that all endpoints are the same
print("All endpoints converge to the same point:")
print(f"Mean: {all_endpoints.mean(axis=0)}")
print(f"Std:  {all_endpoints.std(axis=0)}")

# Non-convex comparison
def nonconvex_function(x):
    return np.sin(x[0]) + np.sin(x[1]) + 0.1*(x[0]**2 + x[1]**2)

def nonconvex_gradient(x):
    return np.array([np.cos(x[0]) + 0.2*x[0], np.cos(x[1]) + 0.2*x[1]])

# Plot non-convex
Z_nc = np.sin(X) + np.sin(Y) + 0.1*(X**2 + Y**2)
axes[1].contour(X, Y, Z_nc, levels=30)
axes[1].set_title('Non-Convex Function: Different Minima!')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

all_endpoints_nc = []
for start in starting_points:
    path = [start.copy()]
    x = start.copy()
    lr = 0.1
    
    for _ in range(100):
        x = x - lr * nonconvex_gradient(x)
        path.append(x.copy())
    
    path = np.array(path)
    axes[1].plot(path[:, 0], path[:, 1], 'o-', markersize=2, alpha=0.7)
    all_endpoints_nc.append(path[-1])

all_endpoints_nc = np.array(all_endpoints_nc)
axes[1].scatter(all_endpoints_nc[:, 0], all_endpoints_nc[:, 1], c='red', s=100, zorder=10)

plt.tight_layout()
plt.show()

print("\nNon-convex endpoints (different!):")
print(all_endpoints_nc.round(3))

Common Convex Functions in ML

Loss Functions

Loss	Formula	Convexity
MSE	$(y - \hat{y})^2$	Convex ✓
MAE	$\\|y - \hat{y}\\|$	Convex ✓
Cross-Entropy	$-y\log(\hat{y})$	Convex ✓
Hinge	$\max(0, 1-y\hat{y})$	Convex ✓
Huber	Combination	Convex ✓

Regularization

Regularizer	Formula	Convexity
L2 (Ridge)	$\sum w_i^2$	Strictly Convex ✓
L1 (Lasso)	$\sum \\|w_i\\|$	Convex ✓
Elastic Net	$\alpha \sum \\|w_i\\| + (1-\alpha) \sum w_i^2$	Convex ✓

# Visualize common loss functions
x = np.linspace(-3, 3, 500)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# MSE
axes[0, 0].plot(x, x**2, 'b-', linewidth=2)
axes[0, 0].set_title('MSE: (y - ŷ)²')
axes[0, 0].grid(True, alpha=0.3)

# MAE
axes[0, 1].plot(x, np.abs(x), 'b-', linewidth=2)
axes[0, 1].set_title('MAE: |y - ŷ|')
axes[0, 1].grid(True, alpha=0.3)

# Hinge loss
axes[0, 2].plot(x, np.maximum(0, 1-x), 'b-', linewidth=2)
axes[0, 2].set_title('Hinge: max(0, 1 - yŷ)')
axes[0, 2].grid(True, alpha=0.3)

# Cross entropy (for p ∈ (0,1))
p = np.linspace(0.01, 0.99, 500)
axes[1, 0].plot(p, -np.log(p), 'b-', linewidth=2, label='-log(p)')
axes[1, 0].plot(p, -np.log(1-p), 'r--', linewidth=2, label='-log(1-p)')
axes[1, 0].set_title('Cross-Entropy Components')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Huber loss
delta = 1.0
huber = np.where(np.abs(x) <= delta, 0.5*x**2, delta*(np.abs(x) - 0.5*delta))
axes[1, 1].plot(x, huber, 'b-', linewidth=2)
axes[1, 1].plot(x, x**2, 'r--', alpha=0.5, label='MSE')
axes[1, 1].plot(x, np.abs(x), 'g--', alpha=0.5, label='MAE')
axes[1, 1].set_title('Huber Loss (δ=1)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# L1 vs L2 regularization
w = np.linspace(-3, 3, 500)
axes[1, 2].plot(w, np.abs(w), 'b-', linewidth=2, label='L1 (Lasso)')
axes[1, 2].plot(w, w**2, 'r-', linewidth=2, label='L2 (Ridge)')
axes[1, 2].set_title('Regularization Penalties')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Linear Regression: A Convex Problem

Linear regression with MSE loss is convex:

L(w) = \sum_{i=1}^{n} (y_i - w^T x_i)^2

This is a quadratic function in

w

, which is always convex!

from mpl_toolkits.mplot3d import Axes3D

# 2D linear regression to visualize the loss surface
np.random.seed(42)
n = 50
X = np.random.randn(n, 2)
true_w = np.array([2.0, -1.0])
y = X @ true_w + np.random.randn(n) * 0.5

def mse_loss(w):
    return np.mean((y - X @ w)**2)

# Plot the loss surface
w0_range = np.linspace(-1, 5, 100)
w1_range = np.linspace(-4, 2, 100)
W0, W1 = np.meshgrid(w0_range, w1_range)

Z = np.zeros_like(W0)
for i in range(len(w0_range)):
    for j in range(len(w1_range)):
        Z[j, i] = mse_loss(np.array([w0_range[i], w1_range[j]]))

fig = plt.figure(figsize=(14, 6))

# 3D surface
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot_surface(W0, W1, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('w₀')
ax1.set_ylabel('w₁')
ax1.set_zlabel('MSE Loss')
ax1.set_title('MSE Loss Surface (Convex Bowl)')
ax1.scatter([true_w[0]], [true_w[1]], [mse_loss(true_w)], c='red', s=200, label='Optimum')

# Contour plot with gradient descent paths
ax2 = fig.add_subplot(1, 2, 2)
contour = ax2.contour(W0, W1, Z, levels=30, cmap='viridis')
ax2.scatter([true_w[0]], [true_w[1]], c='red', s=200, zorder=10, label='Optimum')

# Gradient descent
def mse_gradient(w):
    return -2/n * X.T @ (y - X @ w)

for _ in range(5):
    start = np.random.randn(2) * 2
    path = [start]
    w = start.copy()
    
    for _ in range(50):
        w = w - 0.1 * mse_gradient(w)
        path.append(w.copy())
    
    path = np.array(path)
    ax2.plot(path[:, 0], path[:, 1], 'o-', markersize=3, alpha=0.7)

ax2.set_xlabel('w₀')
ax2.set_ylabel('w₁')
ax2.set_title('All Paths Converge to Global Minimum')
ax2.legend()

plt.tight_layout()
plt.show()

print(f"True weights: {true_w}")
print(f"Found weights: {np.linalg.lstsq(X, y, rcond=None)[0].round(3)}")

Strong Convexity: Even Better Guarantees

A function is strongly convex with parameter

\mu

if:

f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2

Why it matters: Strong convexity guarantees faster convergence!

Property	Convex	Strongly Convex
Unique minimum?	Maybe	Yes!
Convergence rate	$O(1/t)$	$O(e^{-\mu t})$ (exponential!)
Effect of L2 regularization	Makes strongly convex	Increases $\mu$

# Demonstrate faster convergence with strong convexity
def f_convex(x):
    return np.abs(x)  # Convex but not strongly convex

def f_strongly_convex(x, mu=1):
    return np.abs(x) + 0.5 * mu * x**2  # Strongly convex

# Subgradient method for non-smooth functions
def subgradient_descent(f, grad, x0, lr, n_iters):
    x = x0
    history = [x]
    for t in range(n_iters):
        step = lr / np.sqrt(t + 1)  # Decreasing step size
        x = x - step * grad(x)
        history.append(x)
    return np.array(history)

# Gradient/subgradient functions
def grad_convex(x):
    return np.sign(x)

def grad_strongly(x, mu=1):
    return np.sign(x) + mu * x

# Run optimization
n_iters = 100
x0 = 5.0

history_convex = subgradient_descent(f_convex, grad_convex, x0, 1.0, n_iters)
history_strong = subgradient_descent(f_strongly_convex, lambda x: grad_strongly(x, 0.5), x0, 1.0, n_iters)

# Compare convergence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(np.abs(history_convex), label='Convex (MAE)')
axes[0].plot(np.abs(history_strong), label='Strongly Convex (MAE + L2)')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('|x|')
axes[0].set_title('Convergence Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Function values
axes[1].plot([f_convex(x) for x in history_convex], label='Convex')
axes[1].plot([f_strongly_convex(x) for x in history_strong], label='Strongly Convex')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x)')
axes[1].set_title('Function Value Over Iterations')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

Practice Exercises

Exercise 1: Prove Convexity

Problem: Prove that log-loss (binary cross-entropy) is convex.

L(p) = -[y \log(p) + (1-y)\log(1-p)]

Hint: Compute the second derivative and show it’s non-negative.

Exercise 2: L1 vs L2

Problem: Why does L1 regularization produce sparse solutions (many zeros) while L2 doesn’t?Hint: Think about the geometry of the constraint sets and where they intersect the loss contours.

Exercise 3: Making Problems Convex

Problem: The function

f(x) = x^4 - 3x^2

is non-convex. Add L2 regularization to make it convex. What’s the minimum regularization strength needed?Hint: The regularized function is

f(x) + \lambda x^2

. Find

\lambda

such that

f''(x) + 2\lambda \geq 0

for all

x

Summary

Concept	What It Means	Practical Impact
Convex Function	Bowl-shaped, no local minima	Gradient descent finds global optimum
Convex Set	”Solid” without holes	Feasible region for optimization
Strongly Convex	”Very bowl-shaped”	Faster convergence
L2 Regularization	Makes function more convex	Improves optimization
Non-Convex	Multiple minima	No global optimum guarantee

Key Takeaway: Convexity is your friend. When designing ML systems, prefer convex loss functions and regularization. When you must use non-convex models (neural networks), understand that you’re trading optimization guarantees for model flexibility. Good initialization, learning rate schedules, and regularization help navigate non-convex landscapes.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Convex Optimization

​Why Some Problems Are “Easy”

​What Is Convexity?

​Convex Sets

​Convex Functions

​Testing for Convexity

​Second Derivative Test

​Hessian Test (Multivariable)

​Why Convexity Matters

​Theorem: Every Local Minimum Is Global

​Common Convex Functions in ML

​Loss Functions

​Regularization

​Linear Regression: A Convex Problem

​Strong Convexity: Even Better Guarantees

​Practice Exercises

​Summary

Convex Optimization

Why Some Problems Are “Easy”

What Is Convexity?

Convex Sets

Convex Functions

Testing for Convexity

Second Derivative Test

Hessian Test (Multivariable)

Why Convexity Matters

Theorem: Every Local Minimum Is Global

Common Convex Functions in ML

Loss Functions

Regularization

Linear Regression: A Convex Problem

Strong Convexity: Even Better Guarantees

Practice Exercises

Summary