Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Orthogonality & Projections

Orthogonality & Projections

Why Orthogonality Matters

In machine learning and data science, orthogonality is everywhere:
ApplicationHow Orthogonality Is Used
PCAFind orthogonal directions of maximum variance
Least SquaresProject data onto the column space of features
Signal ProcessingDecompose signals into orthogonal frequency components
QR DecompositionNumerically stable way to solve linear systems
Gram-SchmidtCreate orthonormal bases for any subspace
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Vectors and Matrices modules
What You’ll Build: Image compression, signal denoising, and robust regression

The Intuition: Perpendicular = Independent

A Simple Example

Two vectors are orthogonal (perpendicular) if they point in completely independent directions. Here is why this matters so much. Think about describing a location: you say “3 blocks east and 4 blocks north.” East and north are orthogonal directions — knowing how far east you went tells you absolutely nothing about how far north you went. They carry completely independent information. If instead you used “3 blocks east and 4 blocks northeast,” those directions overlap — some of the “northeast” information is redundant with the “east” information. Orthogonal bases give you zero redundancy. Every component carries unique information. This is why PCA looks for orthogonal directions, why Fourier transforms use orthogonal frequencies, and why QR decomposition creates orthogonal columns. Orthogonality is nature’s way of saying “these things are truly independent.”
import numpy as np
import matplotlib.pyplot as plt

# Two orthogonal vectors
v1 = np.array([3, 0])
v2 = np.array([0, 2])

# Visualize
plt.figure(figsize=(8, 8))
plt.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', label='v1 = [3, 0]')
plt.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='red', label='v2 = [0, 2]')

# Show the right angle
theta = np.linspace(0, np.pi/2, 30)
r = 0.5
plt.plot(r * np.cos(theta), r * np.sin(theta), 'g-', linewidth=2)
plt.text(0.6, 0.3, '90°', fontsize=14, color='green')

plt.xlim(-1, 4)
plt.ylim(-1, 3)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.legend()
plt.title('Orthogonal Vectors: Perpendicular, Independent')
plt.gca().set_aspect('equal')
plt.show()

# Mathematical test: dot product = 0
print(f"v1 · v2 = {np.dot(v1, v2)}")
print("Orthogonal!" if np.dot(v1, v2) == 0 else "Not orthogonal")

The Mathematical Test

Two vectors u\mathbf{u} and v\mathbf{v} are orthogonal if and only if: uv=i=1nuivi=0\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = 0
Geometric Interpretation: When the dot product is zero, the vectors form a 90° angle. No component of one vector points in the direction of the other.

Orthonormal Bases: The Gold Standard

An orthonormal basis is a set of vectors that are:
  1. Orthogonal: Every pair has dot product zero
  2. Normalized: Each vector has length 1

Why Orthonormal Is Amazing

With an orthonormal basis, decomposing a vector is trivially easy — just take dot products. No matrix inversion, no system of equations, no numerical instability. This is why algorithms go out of their way to build orthonormal bases.
# Standard basis in 3D is orthonormal
e1 = np.array([1, 0, 0])
e2 = np.array([0, 1, 0])
e3 = np.array([0, 0, 1])

# Any vector can be easily decomposed
v = np.array([3, -2, 5])

# Finding components is trivial - just dot products!
# With a non-orthogonal basis, you'd need to solve a system of equations.
# With an orthonormal basis, you just project -- it's that simple.
c1 = np.dot(v, e1)  # 3
c2 = np.dot(v, e2)  # -2
c3 = np.dot(v, e3)  # 5

print(f"v = {c1}·e1 + {c2}·e2 + {c3}·e3")
print(f"v = {c1*e1 + c2*e2 + c3*e3}")  # Reconstructed
ML connection: This is exactly what happens in PCA. The principal components form an orthonormal basis, so projecting your data onto them is just a matrix multiply — no expensive inversions needed.

Gram-Schmidt: Making Any Basis Orthonormal

Given any set of linearly independent vectors, we can create an orthonormal basis. The algorithm works by repeatedly projecting and subtracting: take each vector, remove its components along all previously computed orthonormal vectors (making it orthogonal to them), then normalize it. It is like straightening a set of leaning poles one at a time — each new pole gets adjusted to stand perfectly perpendicular to all the others.
def gram_schmidt(vectors):
    """
    Convert a set of linearly independent vectors to an orthonormal basis.
    
    This is the foundation of QR decomposition!
    """
    n = len(vectors)
    orthonormal = []
    
    for i, v in enumerate(vectors):
        # Start with the original vector
        u = v.copy().astype(float)
        
        # Subtract projections onto all previous orthonormal vectors
        for q in orthonormal:
            u = u - np.dot(u, q) * q
        
        # Normalize
        norm = np.linalg.norm(u)
        if norm < 1e-10:
            raise ValueError("Vectors are linearly dependent!")
        
        orthonormal.append(u / norm)
    
    return np.array(orthonormal)

# Example: Convert arbitrary vectors to orthonormal
v1 = np.array([1, 1, 0])
v2 = np.array([1, 0, 1])
v3 = np.array([0, 1, 1])

Q = gram_schmidt([v1, v2, v3])

print("Original vectors:")
print(f"v1 = {v1}")
print(f"v2 = {v2}")
print(f"v3 = {v3}")

print("\nOrthonormal basis Q:")
for i, q in enumerate(Q):
    print(f"q{i+1} = {q.round(4)}")

# Verify orthonormality
print("\nVerification (Q @ Q.T should be identity):")
print((Q @ Q.T).round(4))

Projections: The Heart of Least Squares

Projecting onto a Line

The projection of vector b\mathbf{b} onto vector a\mathbf{a} gives the component of b\mathbf{b} in the direction of a\mathbf{a}. Think of it like a shadow. If you shine a light straight down onto a surface, the shadow of a stick is its projection onto that surface. The projection of b\mathbf{b} onto a\mathbf{a} is the “shadow” of b\mathbf{b} when you shine light perpendicular to a\mathbf{a}. The residual (b\mathbf{b} minus its projection) is the part of b\mathbf{b} that is orthogonal to a\mathbf{a} — the part that “sticks up” out of the shadow. projab=abaaa\text{proj}_{\mathbf{a}} \mathbf{b} = \frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}} \mathbf{a} This formula has two parts: the scalar abaa\frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}} tells you how far along a\mathbf{a} to go, and then you multiply by a\mathbf{a} to get the actual vector in that direction.
def project_onto_vector(b, a):
    """Project vector b onto vector a."""
    return (np.dot(a, b) / np.dot(a, a)) * a

# Example
a = np.array([2, 1])
b = np.array([1, 3])

proj = project_onto_vector(b, a)
error = b - proj  # The residual (perpendicular to a)

# Visualize
plt.figure(figsize=(10, 6))
plt.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, color='blue', label=f'a = {a}', width=0.015)
plt.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', label=f'b = {b}', width=0.015)
plt.quiver(0, 0, proj[0], proj[1], angles='xy', scale_units='xy', scale=1, color='green', label=f'proj = {proj.round(2)}', width=0.015)
plt.quiver(proj[0], proj[1], error[0], error[1], angles='xy', scale_units='xy', scale=1, color='purple', label='error (⊥ to a)', width=0.015)

# Show right angle
plt.plot([proj[0], proj[0] + 0.2*error[0]/np.linalg.norm(error)], 
         [proj[1], proj[1] + 0.2*error[1]/np.linalg.norm(error)], 'k-')

plt.xlim(-0.5, 3)
plt.ylim(-0.5, 3.5)
plt.grid(True, alpha=0.3)
plt.legend()
plt.title('Projection of b onto a')
plt.gca().set_aspect('equal')
plt.show()

# Verify error is orthogonal to a
print(f"Error · a = {np.dot(error, a):.6f} (should be ≈ 0)")

Projecting onto a Subspace (Least Squares!)

When we have a matrix AA with columns spanning a subspace, the projection of b\mathbf{b} onto this subspace is: x^=(ATA)1ATb\hat{\mathbf{x}} = (A^T A)^{-1} A^T \mathbf{b} This is exactly the normal equations for least squares!
# Linear regression as projection
np.random.seed(42)

# Data: y ≈ 2x + 1 + noise
n = 50
x = np.linspace(0, 10, n)
y = 2 * x + 1 + np.random.randn(n) * 2

# Feature matrix (with bias column)
A = np.column_stack([np.ones(n), x])

# The least squares solution is the projection!
# We're projecting y onto the column space of A

# Method 1: Normal equations
x_hat = np.linalg.inv(A.T @ A) @ A.T @ y

# Method 2: QR decomposition (more numerically stable)
Q, R = np.linalg.qr(A)
x_hat_qr = np.linalg.solve(R, Q.T @ y)

print(f"Normal equations: intercept = {x_hat[0]:.3f}, slope = {x_hat[1]:.3f}")
print(f"QR decomposition: intercept = {x_hat_qr[0]:.3f}, slope = {x_hat_qr[1]:.3f}")
print(f"True values: intercept = 1.000, slope = 2.000")

# Visualize
y_pred = A @ x_hat
residuals = y - y_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Fitted line
axes[0].scatter(x, y, alpha=0.6, label='Data')
axes[0].plot(x, y_pred, 'r-', linewidth=2, label=f'Fit: y = {x_hat[1]:.2f}x + {x_hat[0]:.2f}')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Least Squares as Projection')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals (should be orthogonal to columns of A)
axes[1].scatter(x, residuals, alpha=0.6)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residuals (⊥ to feature space)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Verify residuals are orthogonal to column space
print(f"\nResiduals · ones = {np.dot(residuals, np.ones(n)):.6f} (should be ≈ 0)")
print(f"Residuals · x = {np.dot(residuals, x):.6f} (should be ≈ 0)")

QR Decomposition: The Practical Tool

QR decomposition factors a matrix as: A=QRA = QR Where:
  • QQ is orthogonal (columns are orthonormal)
  • RR is upper triangular

Why QR Is Better Than Normal Equations

The normal equations approach computes (ATA)1ATb(A^T A)^{-1} A^T b, which requires forming the matrix ATAA^T A. This squaring operation is a disaster for numerical stability: it squares the condition number and can turn a mildly ill-conditioned problem into an unsolvable one. QR decomposition sidesteps this entirely. Since QQ is orthogonal, QTQ=IQ^T Q = I, so applying QTQ^T is perfectly conditioned — it never amplifies errors. The remaining solve with the triangular matrix RR is also well-conditioned. The result: QR-based least squares is numerically stable even when the normal equations produce garbage.
# The normal equations (A^T A)^(-1) can be numerically unstable
# QR decomposition avoids computing A^T A entirely

def least_squares_qr(A, b):
    """
    Solve least squares using QR decomposition.
    More stable than normal equations for ill-conditioned problems.
    """
    Q, R = np.linalg.qr(A)
    return np.linalg.solve(R, Q.T @ b)

# Compare on an ill-conditioned problem
n = 100
x = np.linspace(0, 1, n)

# Create increasingly ill-conditioned feature matrix (polynomial features)
degrees = [1, 5, 10, 15]

for deg in degrees:
    # Vandermonde matrix (polynomials)
    A = np.vander(x, deg + 1, increasing=True)
    y = np.sin(3 * x) + np.random.randn(n) * 0.1
    
    # Condition number
    cond = np.linalg.cond(A)
    
    # Try both methods
    try:
        x_normal = np.linalg.inv(A.T @ A) @ A.T @ y
        y_pred_normal = A @ x_normal
        rmse_normal = np.sqrt(np.mean((y - y_pred_normal)**2))
    except:
        rmse_normal = float('inf')
    
    x_qr = least_squares_qr(A, y)
    y_pred_qr = A @ x_qr
    rmse_qr = np.sqrt(np.mean((y - y_pred_qr)**2))
    
    print(f"Degree {deg:2d}: Condition = {cond:.2e}, Normal RMSE = {rmse_normal:.4f}, QR RMSE = {rmse_qr:.4f}")

Application: Signal Denoising with Orthogonal Bases

The Discrete Cosine Transform (DCT)

Signals can be decomposed into orthogonal frequency components.
from scipy.fft import dct, idct

# Create a noisy signal
t = np.linspace(0, 1, 256)
clean_signal = np.sin(2 * np.pi * 3 * t) + 0.5 * np.sin(2 * np.pi * 7 * t)
noise = np.random.randn(len(t)) * 0.5
noisy_signal = clean_signal + noise

# DCT decomposes into orthogonal frequency components
coeffs = dct(noisy_signal, norm='ortho')

# Keep only the largest coefficients (denoising)
n_keep = 20
denoised_coeffs = coeffs.copy()
threshold_idx = np.argsort(np.abs(coeffs))[-n_keep:]
mask = np.zeros_like(coeffs, dtype=bool)
mask[threshold_idx] = True
denoised_coeffs[~mask] = 0

# Reconstruct
denoised_signal = idct(denoised_coeffs, norm='ortho')

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].plot(t, clean_signal, 'g-', linewidth=2, label='Clean')
axes[0, 0].plot(t, noisy_signal, 'b-', alpha=0.5, label='Noisy')
axes[0, 0].set_title('Original Signals')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].stem(range(50), np.abs(coeffs[:50]), markerfmt='b.')
axes[0, 1].set_title('DCT Coefficients (first 50)')
axes[0, 1].set_xlabel('Frequency Index')
axes[0, 1].set_ylabel('|Coefficient|')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].plot(t, noisy_signal, 'b-', alpha=0.5, label='Noisy')
axes[1, 0].plot(t, denoised_signal, 'r-', linewidth=2, label='Denoised')
axes[1, 0].set_title(f'Denoising (keeping {n_keep} coefficients)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(t, clean_signal, 'g-', linewidth=2, label='Clean')
axes[1, 1].plot(t, denoised_signal, 'r--', linewidth=2, label='Denoised')
axes[1, 1].set_title('Comparison: Clean vs Denoised')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Quality metrics
mse_noisy = np.mean((noisy_signal - clean_signal)**2)
mse_denoised = np.mean((denoised_signal - clean_signal)**2)

print(f"MSE (noisy): {mse_noisy:.4f}")
print(f"MSE (denoised): {mse_denoised:.4f}")
print(f"Improvement: {(mse_noisy - mse_denoised) / mse_noisy * 100:.1f}%")

Application: Image Compression

Images can be compressed by keeping only the most important orthogonal components.
from scipy.fft import dctn, idctn

# Create a simple test image
n = 64
x, y = np.meshgrid(np.linspace(-1, 1, n), np.linspace(-1, 1, n))
image = np.exp(-(x**2 + y**2) / 0.3) * np.cos(4 * np.pi * x) * np.cos(4 * np.pi * y)
image = (image - image.min()) / (image.max() - image.min())

# 2D DCT
coeffs_2d = dctn(image, norm='ortho')

# Compress by keeping only top k% of coefficients
compression_ratios = [100, 50, 20, 10, 5]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

axes[0].imshow(image, cmap='viridis')
axes[0].set_title(f'Original (100% coefficients)')
axes[0].axis('off')

for i, ratio in enumerate(compression_ratios[1:], 1):
    n_keep = int(n * n * ratio / 100)
    
    # Keep largest coefficients
    flat_coeffs = coeffs_2d.flatten()
    threshold = np.sort(np.abs(flat_coeffs))[-n_keep]
    
    compressed_coeffs = np.where(np.abs(coeffs_2d) >= threshold, coeffs_2d, 0)
    reconstructed = idctn(compressed_coeffs, norm='ortho')
    
    mse = np.mean((image - reconstructed)**2)
    
    axes[i].imshow(reconstructed, cmap='viridis')
    axes[i].set_title(f'{ratio}% coeffs, MSE={mse:.4f}')
    axes[i].axis('off')

# Show coefficient distribution
flat_coeffs = np.abs(coeffs_2d.flatten())
axes[5].hist(flat_coeffs, bins=50, edgecolor='black', alpha=0.7)
axes[5].set_xlabel('|Coefficient|')
axes[5].set_ylabel('Count')
axes[5].set_title('DCT Coefficient Distribution')
axes[5].set_yscale('log')

plt.tight_layout()
plt.show()

Practice Exercises

Problem: Given vectors u1=[1,1,0]\mathbf{u}_1 = [1, 1, 0] and u2=[1,1,0]\mathbf{u}_2 = [1, -1, 0] (which are orthogonal), find the projection of b=[3,4,2]\mathbf{b} = [3, 4, 2] onto the plane spanned by u1\mathbf{u}_1 and u2\mathbf{u}_2.Hint: For orthogonal bases, projections can be computed independently and summed.
Problem: Apply Gram-Schmidt to vectors v1=[1,1]\mathbf{v}_1 = [1, 1] and v2=[1,2]\mathbf{v}_2 = [1, 2].Steps:
  1. q1=v1/v1\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|
  2. u2=v2(v2q1)q1\mathbf{u}_2 = \mathbf{v}_2 - (\mathbf{v}_2 \cdot \mathbf{q}_1)\mathbf{q}_1
  3. q2=u2/u2\mathbf{q}_2 = \mathbf{u}_2 / \|\mathbf{u}_2\|
Problem: In standard linear regression, we minimize vertical errors. What if we minimize perpendicular (orthogonal) distances to the line? This is called orthogonal regression or total least squares.Implement orthogonal regression using PCA: the first principal component gives the best-fit line that minimizes orthogonal distances.

Least Squares Solvers: A Stability Comparison

This table summarizes the three main approaches to solving the least squares problem minAxb2\min \|Ax - b\|^2, and when each approach shines or fails.
MethodHow It WorksCondition Number SensitivitySpeedWhen to Use
Normal Equations (ATA)1ATb(A^TA)^{-1}A^TbForms ATAA^TA, then solvesκ(ATA)=κ(A)2\kappa(A^TA) = \kappa(A)^2 — squares the problemFastest for small, well-conditionedQuick prototyping, κ(A)<103\kappa(A) < 10^3
QR Decomposition R1QTbR^{-1}Q^TbFactors A=QRA = QR, avoids forming ATAA^TAκ(R)=κ(A)\kappa(R) = \kappa(A) — preserves original conditioningModerateDefault for most problems
SVD VΣ1UTbV \Sigma^{-1} U^T bFull decomposition A=UΣVTA = U\Sigma V^THandles rank deficiency; truncates small singular valuesSlowestIll-conditioned or rank-deficient problems
Condition number and accuracy loss (float64, ~16 digits precision):

kappa(A)     Normal Equations    QR             SVD
10^2         ~12 digits OK       ~14 digits     ~14 digits
10^6         ~4 digits OK        ~10 digits     ~10 digits
10^8         ~0 digits (garbage) ~8 digits      ~8 digits
10^12        Fails completely    ~4 digits      ~4 digits (with truncation)
The takeaway: QR is the right default for production code. Use the normal equations only when you have verified that the problem is well-conditioned. Use SVD when you suspect rank deficiency or when you need the pseudoinverse.
Practical Advice: In Python, np.linalg.lstsq() uses SVD internally. This is the safest single-call solution. If you need speed for a hot loop and have verified good conditioning, switch to QR via np.linalg.qr() followed by back-substitution. Only use the normal equations formula (ATA)1ATb(A^TA)^{-1}A^Tb if you have a specific performance reason and have checked κ(A)<106\kappa(A) < 10^6.

Summary

ConceptDefinitionUse Case
OrthogonalDot product = 0Independent directions
OrthonormalOrthogonal + unit lengthConvenient bases
Gram-SchmidtAlgorithm to orthonormalizeQR decomposition
ProjectionComponent in a directionLeast squares, dimensionality reduction
QR DecompositionA = QRNumerically stable solving
Key Takeaway: Orthogonality simplifies everything. Decomposing into orthogonal components makes calculations independent and numerically stable. This is why PCA (finding orthogonal directions of variance) is so powerful for dimensionality reduction. It is also why Fourier analysis works (orthogonal frequencies), why QR decomposition is numerically superior to the normal equations, and why orthogonal weight initialization helps neural networks train faster. When you see orthogonality in a method, it is almost always there to prevent information from “leaking” between components and to keep computations clean.
Gram-Schmidt Numerical Pitfall: The classical Gram-Schmidt algorithm described above can lose orthogonality for nearly-dependent vectors due to floating-point errors. In practice, use the Modified Gram-Schmidt variant (which re-orthogonalizes against updated vectors rather than original ones) or simply call np.linalg.qr(), which uses Householder reflections — a numerically superior approach that avoids the problem entirely.