Two vectors are orthogonal (perpendicular) if they point in completely independent directions.Here is why this matters so much. Think about describing a location: you say “3 blocks east and 4 blocks north.” East and north are orthogonal directions — knowing how far east you went tells you absolutely nothing about how far north you went. They carry completely independent information. If instead you used “3 blocks east and 4 blocks northeast,” those directions overlap — some of the “northeast” information is redundant with the “east” information.Orthogonal bases give you zero redundancy. Every component carries unique information. This is why PCA looks for orthogonal directions, why Fourier transforms use orthogonal frequencies, and why QR decomposition creates orthogonal columns. Orthogonality is nature’s way of saying “these things are truly independent.”
With an orthonormal basis, decomposing a vector is trivially easy — just take dot products. No matrix inversion, no system of equations, no numerical instability. This is why algorithms go out of their way to build orthonormal bases.
# Standard basis in 3D is orthonormale1 = np.array([1, 0, 0])e2 = np.array([0, 1, 0])e3 = np.array([0, 0, 1])# Any vector can be easily decomposedv = np.array([3, -2, 5])# Finding components is trivial - just dot products!# With a non-orthogonal basis, you'd need to solve a system of equations.# With an orthonormal basis, you just project -- it's that simple.c1 = np.dot(v, e1) # 3c2 = np.dot(v, e2) # -2c3 = np.dot(v, e3) # 5print(f"v = {c1}·e1 + {c2}·e2 + {c3}·e3")print(f"v = {c1*e1 + c2*e2 + c3*e3}") # Reconstructed
ML connection: This is exactly what happens in PCA. The principal components form an orthonormal basis, so projecting your data onto them is just a matrix multiply — no expensive inversions needed.
Given any set of linearly independent vectors, we can create an orthonormal basis. The algorithm works by repeatedly projecting and subtracting: take each vector, remove its components along all previously computed orthonormal vectors (making it orthogonal to them), then normalize it. It is like straightening a set of leaning poles one at a time — each new pole gets adjusted to stand perfectly perpendicular to all the others.
def gram_schmidt(vectors): """ Convert a set of linearly independent vectors to an orthonormal basis. This is the foundation of QR decomposition! """ n = len(vectors) orthonormal = [] for i, v in enumerate(vectors): # Start with the original vector u = v.copy().astype(float) # Subtract projections onto all previous orthonormal vectors for q in orthonormal: u = u - np.dot(u, q) * q # Normalize norm = np.linalg.norm(u) if norm < 1e-10: raise ValueError("Vectors are linearly dependent!") orthonormal.append(u / norm) return np.array(orthonormal)# Example: Convert arbitrary vectors to orthonormalv1 = np.array([1, 1, 0])v2 = np.array([1, 0, 1])v3 = np.array([0, 1, 1])Q = gram_schmidt([v1, v2, v3])print("Original vectors:")print(f"v1 = {v1}")print(f"v2 = {v2}")print(f"v3 = {v3}")print("\nOrthonormal basis Q:")for i, q in enumerate(Q): print(f"q{i+1} = {q.round(4)}")# Verify orthonormalityprint("\nVerification (Q @ Q.T should be identity):")print((Q @ Q.T).round(4))
The projection of vector b onto vector a gives the component of b in the direction of a.Think of it like a shadow. If you shine a light straight down onto a surface, the shadow of a stick is its projection onto that surface. The projection of b onto a is the “shadow” of b when you shine light perpendicular to a. The residual (b minus its projection) is the part of b that is orthogonal to a — the part that “sticks up” out of the shadow.projab=a⋅aa⋅baThis formula has two parts: the scalar a⋅aa⋅b tells you how far along a to go, and then you multiply by a to get the actual vector in that direction.
def project_onto_vector(b, a): """Project vector b onto vector a.""" return (np.dot(a, b) / np.dot(a, a)) * a# Examplea = np.array([2, 1])b = np.array([1, 3])proj = project_onto_vector(b, a)error = b - proj # The residual (perpendicular to a)# Visualizeplt.figure(figsize=(10, 6))plt.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, color='blue', label=f'a = {a}', width=0.015)plt.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', label=f'b = {b}', width=0.015)plt.quiver(0, 0, proj[0], proj[1], angles='xy', scale_units='xy', scale=1, color='green', label=f'proj = {proj.round(2)}', width=0.015)plt.quiver(proj[0], proj[1], error[0], error[1], angles='xy', scale_units='xy', scale=1, color='purple', label='error (⊥ to a)', width=0.015)# Show right angleplt.plot([proj[0], proj[0] + 0.2*error[0]/np.linalg.norm(error)], [proj[1], proj[1] + 0.2*error[1]/np.linalg.norm(error)], 'k-')plt.xlim(-0.5, 3)plt.ylim(-0.5, 3.5)plt.grid(True, alpha=0.3)plt.legend()plt.title('Projection of b onto a')plt.gca().set_aspect('equal')plt.show()# Verify error is orthogonal to aprint(f"Error · a = {np.dot(error, a):.6f} (should be ≈ 0)")
When we have a matrix A with columns spanning a subspace, the projection of b onto this subspace is:x^=(ATA)−1ATbThis is exactly the normal equations for least squares!
# Linear regression as projectionnp.random.seed(42)# Data: y ≈ 2x + 1 + noisen = 50x = np.linspace(0, 10, n)y = 2 * x + 1 + np.random.randn(n) * 2# Feature matrix (with bias column)A = np.column_stack([np.ones(n), x])# The least squares solution is the projection!# We're projecting y onto the column space of A# Method 1: Normal equationsx_hat = np.linalg.inv(A.T @ A) @ A.T @ y# Method 2: QR decomposition (more numerically stable)Q, R = np.linalg.qr(A)x_hat_qr = np.linalg.solve(R, Q.T @ y)print(f"Normal equations: intercept = {x_hat[0]:.3f}, slope = {x_hat[1]:.3f}")print(f"QR decomposition: intercept = {x_hat_qr[0]:.3f}, slope = {x_hat_qr[1]:.3f}")print(f"True values: intercept = 1.000, slope = 2.000")# Visualizey_pred = A @ x_hatresiduals = y - y_predfig, axes = plt.subplots(1, 2, figsize=(14, 5))# Fitted lineaxes[0].scatter(x, y, alpha=0.6, label='Data')axes[0].plot(x, y_pred, 'r-', linewidth=2, label=f'Fit: y = {x_hat[1]:.2f}x + {x_hat[0]:.2f}')axes[0].set_xlabel('x')axes[0].set_ylabel('y')axes[0].set_title('Least Squares as Projection')axes[0].legend()axes[0].grid(True, alpha=0.3)# Residuals (should be orthogonal to columns of A)axes[1].scatter(x, residuals, alpha=0.6)axes[1].axhline(y=0, color='red', linestyle='--')axes[1].set_xlabel('x')axes[1].set_ylabel('Residual')axes[1].set_title('Residuals (⊥ to feature space)')axes[1].grid(True, alpha=0.3)plt.tight_layout()plt.show()# Verify residuals are orthogonal to column spaceprint(f"\nResiduals · ones = {np.dot(residuals, np.ones(n)):.6f} (should be ≈ 0)")print(f"Residuals · x = {np.dot(residuals, x):.6f} (should be ≈ 0)")
The normal equations approach computes (ATA)−1ATb, which requires forming the matrix ATA. This squaring operation is a disaster for numerical stability: it squares the condition number and can turn a mildly ill-conditioned problem into an unsolvable one.QR decomposition sidesteps this entirely. Since Q is orthogonal, QTQ=I, so applying QT is perfectly conditioned — it never amplifies errors. The remaining solve with the triangular matrix R is also well-conditioned. The result: QR-based least squares is numerically stable even when the normal equations produce garbage.
# The normal equations (A^T A)^(-1) can be numerically unstable# QR decomposition avoids computing A^T A entirelydef least_squares_qr(A, b): """ Solve least squares using QR decomposition. More stable than normal equations for ill-conditioned problems. """ Q, R = np.linalg.qr(A) return np.linalg.solve(R, Q.T @ b)# Compare on an ill-conditioned problemn = 100x = np.linspace(0, 1, n)# Create increasingly ill-conditioned feature matrix (polynomial features)degrees = [1, 5, 10, 15]for deg in degrees: # Vandermonde matrix (polynomials) A = np.vander(x, deg + 1, increasing=True) y = np.sin(3 * x) + np.random.randn(n) * 0.1 # Condition number cond = np.linalg.cond(A) # Try both methods try: x_normal = np.linalg.inv(A.T @ A) @ A.T @ y y_pred_normal = A @ x_normal rmse_normal = np.sqrt(np.mean((y - y_pred_normal)**2)) except: rmse_normal = float('inf') x_qr = least_squares_qr(A, y) y_pred_qr = A @ x_qr rmse_qr = np.sqrt(np.mean((y - y_pred_qr)**2)) print(f"Degree {deg:2d}: Condition = {cond:.2e}, Normal RMSE = {rmse_normal:.4f}, QR RMSE = {rmse_qr:.4f}")
Problem: Given vectors u1=[1,1,0] and u2=[1,−1,0] (which are orthogonal), find the projection of b=[3,4,2] onto the plane spanned by u1 and u2.Hint: For orthogonal bases, projections can be computed independently and summed.
Exercise 2: Gram-Schmidt by Hand
Problem: Apply Gram-Schmidt to vectors v1=[1,1] and v2=[1,2].Steps:
q1=v1/∥v1∥
u2=v2−(v2⋅q1)q1
q2=u2/∥u2∥
Exercise 3: Orthogonal Regression
Problem: In standard linear regression, we minimize vertical errors. What if we minimize perpendicular (orthogonal) distances to the line? This is called orthogonal regression or total least squares.Implement orthogonal regression using PCA: the first principal component gives the best-fit line that minimizes orthogonal distances.
This table summarizes the three main approaches to solving the least squares problem min∥Ax−b∥2, and when each approach shines or fails.
Method
How It Works
Condition Number Sensitivity
Speed
When to Use
Normal Equations(ATA)−1ATb
Forms ATA, then solves
κ(ATA)=κ(A)2 — squares the problem
Fastest for small, well-conditioned
Quick prototyping, κ(A)<103
QR DecompositionR−1QTb
Factors A=QR, avoids forming ATA
κ(R)=κ(A) — preserves original conditioning
Moderate
Default for most problems
SVDVΣ−1UTb
Full decomposition A=UΣVT
Handles rank deficiency; truncates small singular values
Slowest
Ill-conditioned or rank-deficient problems
Condition number and accuracy loss (float64, ~16 digits precision):kappa(A) Normal Equations QR SVD10^2 ~12 digits OK ~14 digits ~14 digits10^6 ~4 digits OK ~10 digits ~10 digits10^8 ~0 digits (garbage) ~8 digits ~8 digits10^12 Fails completely ~4 digits ~4 digits (with truncation)
The takeaway: QR is the right default for production code. Use the normal equations only when you have verified that the problem is well-conditioned. Use SVD when you suspect rank deficiency or when you need the pseudoinverse.
Practical Advice: In Python, np.linalg.lstsq() uses SVD internally. This is the safest single-call solution. If you need speed for a hot loop and have verified good conditioning, switch to QR via np.linalg.qr() followed by back-substitution. Only use the normal equations formula (ATA)−1ATb if you have a specific performance reason and have checked κ(A)<106.
Key Takeaway: Orthogonality simplifies everything. Decomposing into orthogonal components makes calculations independent and numerically stable. This is why PCA (finding orthogonal directions of variance) is so powerful for dimensionality reduction. It is also why Fourier analysis works (orthogonal frequencies), why QR decomposition is numerically superior to the normal equations, and why orthogonal weight initialization helps neural networks train faster. When you see orthogonality in a method, it is almost always there to prevent information from “leaking” between components and to keep computations clean.
Gram-Schmidt Numerical Pitfall: The classical Gram-Schmidt algorithm described above can lose orthogonality for nearly-dependent vectors due to floating-point errors. In practice, use the Modified Gram-Schmidt variant (which re-orthogonalizes against updated vectors rather than original ones) or simply call np.linalg.qr(), which uses Householder reflections — a numerically superior approach that avoids the problem entirely.