Skip to main content
Convex Optimization

Convex Optimization

Why Some Problems Are “Easy”

In gradient descent, we worry about getting stuck in local minima. But for some problems, every local minimum is the global minimum. These are convex optimization problems, and they’re the foundation of many ML algorithms.
The Catch: Neural networks are NOT convex. But understanding convexity helps you:
  1. Know when you’re guaranteed to find the global optimum
  2. Design better loss functions
  3. Understand why some algorithms converge faster
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Gradient Descent, Optimization modules
What You’ll Learn: When optimization is easy, and how to leverage it

What Is Convexity?

Convex Sets

A set CC is convex if the line segment between any two points in CC lies entirely in CC.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon, Circle

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Convex set: Circle
theta = np.linspace(0, 2*np.pi, 100)
circle = plt.Circle((0, 0), 1, fill=True, alpha=0.3, color='blue')
axes[0].add_patch(circle)
axes[0].plot([-.7, .5], [-.5, .7], 'r-', linewidth=2, label='Line segment inside')
axes[0].scatter([-.7, .5], [-.5, .7], c='red', s=100, zorder=5)
axes[0].set_xlim(-1.5, 1.5)
axes[0].set_ylim(-1.5, 1.5)
axes[0].set_title('Convex: Circle (Yes)')
axes[0].set_aspect('equal')
axes[0].legend()

# Convex set: Triangle
triangle = Polygon([[-1, -0.5], [1, -0.5], [0, 1]], alpha=0.3, color='blue')
axes[1].add_patch(triangle)
axes[1].plot([-.5, .3], [-.2, .5], 'r-', linewidth=2, label='Line segment inside')
axes[1].scatter([-.5, .3], [-.2, .5], c='red', s=100, zorder=5)
axes[1].set_xlim(-1.5, 1.5)
axes[1].set_ylim(-1, 1.5)
axes[1].set_title('Convex: Triangle (Yes)')
axes[1].set_aspect('equal')
axes[1].legend()

# Non-convex set: Crescent
t = np.linspace(0, 2*np.pi, 100)
x1 = np.cos(t)
y1 = np.sin(t)
x2 = 0.5 + 0.7*np.cos(t)
y2 = 0.7*np.sin(t)
axes[2].fill(x1, y1, alpha=0.3, color='blue')
axes[2].fill(x2, y2, color='white')
axes[2].plot([-.8, .2], [0, 0], 'r-', linewidth=2, label='Line segment OUTSIDE!')
axes[2].scatter([-.8, .2], [0, 0], c='red', s=100, zorder=5)
axes[2].set_xlim(-1.5, 1.5)
axes[2].set_ylim(-1.5, 1.5)
axes[2].set_title('Non-Convex: Crescent (No)')
axes[2].set_aspect('equal')
axes[2].legend()

plt.tight_layout()
plt.show()

Convex Functions

A function ff is convex if the line segment between any two points on the graph lies above or on the graph. Mathematically: For all x,yx, y and λ[0,1]\lambda \in [0, 1]: f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y)
# Visualize convex vs non-convex functions
x = np.linspace(-3, 3, 100)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Convex: Quadratic
y_quad = x**2
axes[0].plot(x, y_quad, 'b-', linewidth=2)
# Show the "line above" property
x1, x2 = -2, 1.5
y1, y2 = x1**2, x2**2
axes[0].plot([x1, x2], [y1, y2], 'r--', linewidth=2, label='Line segment')
axes[0].scatter([x1, x2], [y1, y2], c='red', s=100, zorder=5)
# Show a point in between
x_mid = 0.5 * x1 + 0.5 * x2
axes[0].scatter([x_mid], [x_mid**2], c='green', s=100, zorder=5, label='Function value')
axes[0].scatter([x_mid], [0.5*y1 + 0.5*y2], c='orange', s=100, zorder=5, label='Line value')
axes[0].set_title('Convex: f(x) = x^2 (Yes)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Convex: Exponential
y_exp = np.exp(x)
axes[1].plot(x, y_exp, 'b-', linewidth=2)
x1, x2 = -2, 1
y1, y2 = np.exp(x1), np.exp(x2)
axes[1].plot([x1, x2], [y1, y2], 'r--', linewidth=2)
axes[1].scatter([x1, x2], [y1, y2], c='red', s=100, zorder=5)
axes[1].set_title('Convex: f(x) = exp(x) (Yes)')
axes[1].grid(True, alpha=0.3)

# Non-convex: Sine
y_sin = np.sin(x)
axes[2].plot(x, y_sin, 'b-', linewidth=2)
x1, x2 = -2, 2
y1, y2 = np.sin(x1), np.sin(x2)
axes[2].plot([x1, x2], [y1, y2], 'r--', linewidth=2)
axes[2].scatter([x1, x2], [y1, y2], c='red', s=100, zorder=5)
axes[2].set_title('Non-Convex: f(x) = sin(x) (No)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Testing for Convexity

Second Derivative Test

For twice-differentiable functions:
Function TypeCondition
Convexf(x)0f''(x) \geq 0 everywhere
Strictly Convexf(x)>0f''(x) > 0 everywhere
Concavef(x)0f''(x) \leq 0 everywhere
def is_convex(f, f_double_prime, x_range):
    """Check if f is convex over x_range."""
    x = np.linspace(x_range[0], x_range[1], 1000)
    second_derivs = f_double_prime(x)
    
    return np.all(second_derivs >= -1e-10)  # Allow small numerical errors

# Test some functions
functions = [
    ("x²", lambda x: x**2, lambda x: 2*np.ones_like(x)),
    ("eˣ", lambda x: np.exp(x), lambda x: np.exp(x)),
    ("x³", lambda x: x**3, lambda x: 6*x),
    ("|x|", lambda x: np.abs(x), lambda x: np.zeros_like(x)),  # f''=0 except at 0
    ("sin(x)", lambda x: np.sin(x), lambda x: -np.sin(x)),
]

print("Convexity Check (on [-5, 5]):")
print("=" * 40)
for name, f, f_pp in functions:
    result = is_convex(f, f_pp, [-5, 5])
    print(f"{name:10} : {'Convex ✓' if result else 'Non-convex ✗'}")

Hessian Test (Multivariable)

For functions of multiple variables, check if the Hessian (matrix of second derivatives) is positive semi-definite.
from scipy.optimize import minimize

def check_hessian_pd(H):
    """Check if matrix H is positive semi-definite."""
    eigenvalues = np.linalg.eigvals(H)
    return np.all(eigenvalues >= -1e-10)

# Example: f(x,y) = x² + y² + xy
# Hessian = [[2, 1], [1, 2]]
H = np.array([[2, 1], [1, 2]])

eigenvalues = np.linalg.eigvals(H)
print(f"Hessian eigenvalues: {eigenvalues}")
print(f"Is convex: {check_hessian_pd(H)}")

# Non-convex example: saddle point
# f(x,y) = x² - y²
# Hessian = [[2, 0], [0, -2]]
H_saddle = np.array([[2, 0], [0, -2]])
eigenvalues_saddle = np.linalg.eigvals(H_saddle)
print(f"\nSaddle Hessian eigenvalues: {eigenvalues_saddle}")
print(f"Is convex: {check_hessian_pd(H_saddle)}")

Why Convexity Matters

Theorem: Every Local Minimum Is Global

For convex functions, gradient descent is guaranteed to find the global minimum.
# Demonstrate convergence to global minimum for convex function
def convex_function(x):
    return x[0]**2 + x[1]**2 + x[0]*x[1] + x[0] + 2*x[1]

def convex_gradient(x):
    return np.array([2*x[0] + x[1] + 1, 2*x[1] + x[0] + 2])

# Multiple random starting points
np.random.seed(42)
n_starts = 10
starting_points = np.random.randn(n_starts, 2) * 5

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot contours
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2 + X*Y + X + 2*Y

axes[0].contour(X, Y, Z, levels=30)
axes[0].set_title('Convex Function: All Paths Lead to Same Minimum')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

# Gradient descent from each starting point
all_endpoints = []
for i, start in enumerate(starting_points):
    path = [start.copy()]
    x = start.copy()
    lr = 0.1
    
    for _ in range(50):
        x = x - lr * convex_gradient(x)
        path.append(x.copy())
    
    path = np.array(path)
    axes[0].plot(path[:, 0], path[:, 1], 'o-', markersize=2, alpha=0.7)
    all_endpoints.append(path[-1])

all_endpoints = np.array(all_endpoints)
axes[0].scatter(all_endpoints[:, 0], all_endpoints[:, 1], c='red', s=100, zorder=10, label='Final points')
axes[0].legend()

# Show that all endpoints are the same
print("All endpoints converge to the same point:")
print(f"Mean: {all_endpoints.mean(axis=0)}")
print(f"Std:  {all_endpoints.std(axis=0)}")

# Non-convex comparison
def nonconvex_function(x):
    return np.sin(x[0]) + np.sin(x[1]) + 0.1*(x[0]**2 + x[1]**2)

def nonconvex_gradient(x):
    return np.array([np.cos(x[0]) + 0.2*x[0], np.cos(x[1]) + 0.2*x[1]])

# Plot non-convex
Z_nc = np.sin(X) + np.sin(Y) + 0.1*(X**2 + Y**2)
axes[1].contour(X, Y, Z_nc, levels=30)
axes[1].set_title('Non-Convex Function: Different Minima!')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

all_endpoints_nc = []
for start in starting_points:
    path = [start.copy()]
    x = start.copy()
    lr = 0.1
    
    for _ in range(100):
        x = x - lr * nonconvex_gradient(x)
        path.append(x.copy())
    
    path = np.array(path)
    axes[1].plot(path[:, 0], path[:, 1], 'o-', markersize=2, alpha=0.7)
    all_endpoints_nc.append(path[-1])

all_endpoints_nc = np.array(all_endpoints_nc)
axes[1].scatter(all_endpoints_nc[:, 0], all_endpoints_nc[:, 1], c='red', s=100, zorder=10)

plt.tight_layout()
plt.show()

print("\nNon-convex endpoints (different!):")
print(all_endpoints_nc.round(3))

Common Convex Functions in ML

Loss Functions

LossFormulaConvexity
MSE(yy^)2(y - \hat{y})^2Convex ✓
MAEyy^\|y - \hat{y}\|Convex ✓
Cross-Entropyylog(y^)-y\log(\hat{y})Convex ✓
Hingemax(0,1yy^)\max(0, 1-y\hat{y})Convex ✓
HuberCombinationConvex ✓

Regularization

RegularizerFormulaConvexity
L2 (Ridge)wi2\sum w_i^2Strictly Convex ✓
L1 (Lasso)wi\sum \|w_i\|Convex ✓
Elastic Netαwi+(1α)wi2\alpha \sum \|w_i\| + (1-\alpha) \sum w_i^2Convex ✓
# Visualize common loss functions
x = np.linspace(-3, 3, 500)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# MSE
axes[0, 0].plot(x, x**2, 'b-', linewidth=2)
axes[0, 0].set_title('MSE: (y - ŷ)²')
axes[0, 0].grid(True, alpha=0.3)

# MAE
axes[0, 1].plot(x, np.abs(x), 'b-', linewidth=2)
axes[0, 1].set_title('MAE: |y - ŷ|')
axes[0, 1].grid(True, alpha=0.3)

# Hinge loss
axes[0, 2].plot(x, np.maximum(0, 1-x), 'b-', linewidth=2)
axes[0, 2].set_title('Hinge: max(0, 1 - yŷ)')
axes[0, 2].grid(True, alpha=0.3)

# Cross entropy (for p ∈ (0,1))
p = np.linspace(0.01, 0.99, 500)
axes[1, 0].plot(p, -np.log(p), 'b-', linewidth=2, label='-log(p)')
axes[1, 0].plot(p, -np.log(1-p), 'r--', linewidth=2, label='-log(1-p)')
axes[1, 0].set_title('Cross-Entropy Components')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Huber loss
delta = 1.0
huber = np.where(np.abs(x) <= delta, 0.5*x**2, delta*(np.abs(x) - 0.5*delta))
axes[1, 1].plot(x, huber, 'b-', linewidth=2)
axes[1, 1].plot(x, x**2, 'r--', alpha=0.5, label='MSE')
axes[1, 1].plot(x, np.abs(x), 'g--', alpha=0.5, label='MAE')
axes[1, 1].set_title('Huber Loss (δ=1)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# L1 vs L2 regularization
w = np.linspace(-3, 3, 500)
axes[1, 2].plot(w, np.abs(w), 'b-', linewidth=2, label='L1 (Lasso)')
axes[1, 2].plot(w, w**2, 'r-', linewidth=2, label='L2 (Ridge)')
axes[1, 2].set_title('Regularization Penalties')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Linear Regression: A Convex Problem

Linear regression with MSE loss is convex: L(w)=i=1n(yiwTxi)2L(w) = \sum_{i=1}^{n} (y_i - w^T x_i)^2 This is a quadratic function in ww, which is always convex!
from mpl_toolkits.mplot3d import Axes3D

# 2D linear regression to visualize the loss surface
np.random.seed(42)
n = 50
X = np.random.randn(n, 2)
true_w = np.array([2.0, -1.0])
y = X @ true_w + np.random.randn(n) * 0.5

def mse_loss(w):
    return np.mean((y - X @ w)**2)

# Plot the loss surface
w0_range = np.linspace(-1, 5, 100)
w1_range = np.linspace(-4, 2, 100)
W0, W1 = np.meshgrid(w0_range, w1_range)

Z = np.zeros_like(W0)
for i in range(len(w0_range)):
    for j in range(len(w1_range)):
        Z[j, i] = mse_loss(np.array([w0_range[i], w1_range[j]]))

fig = plt.figure(figsize=(14, 6))

# 3D surface
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot_surface(W0, W1, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('w₀')
ax1.set_ylabel('w₁')
ax1.set_zlabel('MSE Loss')
ax1.set_title('MSE Loss Surface (Convex Bowl)')
ax1.scatter([true_w[0]], [true_w[1]], [mse_loss(true_w)], c='red', s=200, label='Optimum')

# Contour plot with gradient descent paths
ax2 = fig.add_subplot(1, 2, 2)
contour = ax2.contour(W0, W1, Z, levels=30, cmap='viridis')
ax2.scatter([true_w[0]], [true_w[1]], c='red', s=200, zorder=10, label='Optimum')

# Gradient descent
def mse_gradient(w):
    return -2/n * X.T @ (y - X @ w)

for _ in range(5):
    start = np.random.randn(2) * 2
    path = [start]
    w = start.copy()
    
    for _ in range(50):
        w = w - 0.1 * mse_gradient(w)
        path.append(w.copy())
    
    path = np.array(path)
    ax2.plot(path[:, 0], path[:, 1], 'o-', markersize=3, alpha=0.7)

ax2.set_xlabel('w₀')
ax2.set_ylabel('w₁')
ax2.set_title('All Paths Converge to Global Minimum')
ax2.legend()

plt.tight_layout()
plt.show()

print(f"True weights: {true_w}")
print(f"Found weights: {np.linalg.lstsq(X, y, rcond=None)[0].round(3)}")

Strong Convexity: Even Better Guarantees

A function is strongly convex with parameter μ\mu if: f(y)f(x)+f(x)T(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2 Why it matters: Strong convexity guarantees faster convergence!
PropertyConvexStrongly Convex
Unique minimum?MaybeYes!
Convergence rateO(1/t)O(1/t)O(eμt)O(e^{-\mu t}) (exponential!)
Effect of L2 regularizationMakes strongly convexIncreases μ\mu
# Demonstrate faster convergence with strong convexity
def f_convex(x):
    return np.abs(x)  # Convex but not strongly convex

def f_strongly_convex(x, mu=1):
    return np.abs(x) + 0.5 * mu * x**2  # Strongly convex

# Subgradient method for non-smooth functions
def subgradient_descent(f, grad, x0, lr, n_iters):
    x = x0
    history = [x]
    for t in range(n_iters):
        step = lr / np.sqrt(t + 1)  # Decreasing step size
        x = x - step * grad(x)
        history.append(x)
    return np.array(history)

# Gradient/subgradient functions
def grad_convex(x):
    return np.sign(x)

def grad_strongly(x, mu=1):
    return np.sign(x) + mu * x

# Run optimization
n_iters = 100
x0 = 5.0

history_convex = subgradient_descent(f_convex, grad_convex, x0, 1.0, n_iters)
history_strong = subgradient_descent(f_strongly_convex, lambda x: grad_strongly(x, 0.5), x0, 1.0, n_iters)

# Compare convergence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(np.abs(history_convex), label='Convex (MAE)')
axes[0].plot(np.abs(history_strong), label='Strongly Convex (MAE + L2)')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('|x|')
axes[0].set_title('Convergence Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Function values
axes[1].plot([f_convex(x) for x in history_convex], label='Convex')
axes[1].plot([f_strongly_convex(x) for x in history_strong], label='Strongly Convex')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x)')
axes[1].set_title('Function Value Over Iterations')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

Practice Exercises

Problem: Prove that log-loss (binary cross-entropy) is convex.L(p)=[ylog(p)+(1y)log(1p)]L(p) = -[y \log(p) + (1-y)\log(1-p)]Hint: Compute the second derivative and show it’s non-negative.
Problem: Why does L1 regularization produce sparse solutions (many zeros) while L2 doesn’t?Hint: Think about the geometry of the constraint sets and where they intersect the loss contours.
Problem: The function f(x)=x43x2f(x) = x^4 - 3x^2 is non-convex. Add L2 regularization to make it convex. What’s the minimum regularization strength needed?Hint: The regularized function is f(x)+λx2f(x) + \lambda x^2. Find λ\lambda such that f(x)+2λ0f''(x) + 2\lambda \geq 0 for all xx.

Summary

ConceptWhat It MeansPractical Impact
Convex FunctionBowl-shaped, no local minimaGradient descent finds global optimum
Convex Set”Solid” without holesFeasible region for optimization
Strongly Convex”Very bowl-shaped”Faster convergence
L2 RegularizationMakes function more convexImproves optimization
Non-ConvexMultiple minimaNo global optimum guarantee
Key Takeaway: Convexity is your friend. When designing ML systems, prefer convex loss functions and regularization. When you must use non-convex models (neural networks), understand that you’re trading optimization guarantees for model flexibility. Good initialization, learning rate schedules, and regularization help navigate non-convex landscapes.