Skip to main content
Orthogonality & Projections

Orthogonality & Projections

Why Orthogonality Matters

In machine learning and data science, orthogonality is everywhere:
ApplicationHow Orthogonality Is Used
PCAFind orthogonal directions of maximum variance
Least SquaresProject data onto the column space of features
Signal ProcessingDecompose signals into orthogonal frequency components
QR DecompositionNumerically stable way to solve linear systems
Gram-SchmidtCreate orthonormal bases for any subspace
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Vectors and Matrices modules
What You’ll Build: Image compression, signal denoising, and robust regression

The Intuition: Perpendicular = Independent

A Simple Example

Two vectors are orthogonal (perpendicular) if they point in completely independent directions.
import numpy as np
import matplotlib.pyplot as plt

# Two orthogonal vectors
v1 = np.array([3, 0])
v2 = np.array([0, 2])

# Visualize
plt.figure(figsize=(8, 8))
plt.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', label='v1 = [3, 0]')
plt.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='red', label='v2 = [0, 2]')

# Show the right angle
theta = np.linspace(0, np.pi/2, 30)
r = 0.5
plt.plot(r * np.cos(theta), r * np.sin(theta), 'g-', linewidth=2)
plt.text(0.6, 0.3, '90°', fontsize=14, color='green')

plt.xlim(-1, 4)
plt.ylim(-1, 3)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.legend()
plt.title('Orthogonal Vectors: Perpendicular, Independent')
plt.gca().set_aspect('equal')
plt.show()

# Mathematical test: dot product = 0
print(f"v1 · v2 = {np.dot(v1, v2)}")
print("Orthogonal!" if np.dot(v1, v2) == 0 else "Not orthogonal")

The Mathematical Test

Two vectors u\mathbf{u} and v\mathbf{v} are orthogonal if and only if: uv=i=1nuivi=0\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = 0
Geometric Interpretation: When the dot product is zero, the vectors form a 90° angle. No component of one vector points in the direction of the other.

Orthonormal Bases: The Gold Standard

An orthonormal basis is a set of vectors that are:
  1. Orthogonal: Every pair has dot product zero
  2. Normalized: Each vector has length 1

Why Orthonormal Is Amazing

# Standard basis in 3D is orthonormal
e1 = np.array([1, 0, 0])
e2 = np.array([0, 1, 0])
e3 = np.array([0, 0, 1])

# Any vector can be easily decomposed
v = np.array([3, -2, 5])

# Finding components is trivial - just dot products!
c1 = np.dot(v, e1)  # 3
c2 = np.dot(v, e2)  # -2
c3 = np.dot(v, e3)  # 5

print(f"v = {c1}·e1 + {c2}·e2 + {c3}·e3")
print(f"v = {c1*e1 + c2*e2 + c3*e3}")  # Reconstructed

Gram-Schmidt: Making Any Basis Orthonormal

Given any set of linearly independent vectors, we can create an orthonormal basis.
def gram_schmidt(vectors):
    """
    Convert a set of linearly independent vectors to an orthonormal basis.
    
    This is the foundation of QR decomposition!
    """
    n = len(vectors)
    orthonormal = []
    
    for i, v in enumerate(vectors):
        # Start with the original vector
        u = v.copy().astype(float)
        
        # Subtract projections onto all previous orthonormal vectors
        for q in orthonormal:
            u = u - np.dot(u, q) * q
        
        # Normalize
        norm = np.linalg.norm(u)
        if norm < 1e-10:
            raise ValueError("Vectors are linearly dependent!")
        
        orthonormal.append(u / norm)
    
    return np.array(orthonormal)

# Example: Convert arbitrary vectors to orthonormal
v1 = np.array([1, 1, 0])
v2 = np.array([1, 0, 1])
v3 = np.array([0, 1, 1])

Q = gram_schmidt([v1, v2, v3])

print("Original vectors:")
print(f"v1 = {v1}")
print(f"v2 = {v2}")
print(f"v3 = {v3}")

print("\nOrthonormal basis Q:")
for i, q in enumerate(Q):
    print(f"q{i+1} = {q.round(4)}")

# Verify orthonormality
print("\nVerification (Q @ Q.T should be identity):")
print((Q @ Q.T).round(4))

Projections: The Heart of Least Squares

Projecting onto a Line

The projection of vector b\mathbf{b} onto vector a\mathbf{a} gives the component of b\mathbf{b} in the direction of a\mathbf{a}. projab=abaaa\text{proj}_{\mathbf{a}} \mathbf{b} = \frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}} \mathbf{a}
def project_onto_vector(b, a):
    """Project vector b onto vector a."""
    return (np.dot(a, b) / np.dot(a, a)) * a

# Example
a = np.array([2, 1])
b = np.array([1, 3])

proj = project_onto_vector(b, a)
error = b - proj  # The residual (perpendicular to a)

# Visualize
plt.figure(figsize=(10, 6))
plt.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, color='blue', label=f'a = {a}', width=0.015)
plt.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', label=f'b = {b}', width=0.015)
plt.quiver(0, 0, proj[0], proj[1], angles='xy', scale_units='xy', scale=1, color='green', label=f'proj = {proj.round(2)}', width=0.015)
plt.quiver(proj[0], proj[1], error[0], error[1], angles='xy', scale_units='xy', scale=1, color='purple', label='error (⊥ to a)', width=0.015)

# Show right angle
plt.plot([proj[0], proj[0] + 0.2*error[0]/np.linalg.norm(error)], 
         [proj[1], proj[1] + 0.2*error[1]/np.linalg.norm(error)], 'k-')

plt.xlim(-0.5, 3)
plt.ylim(-0.5, 3.5)
plt.grid(True, alpha=0.3)
plt.legend()
plt.title('Projection of b onto a')
plt.gca().set_aspect('equal')
plt.show()

# Verify error is orthogonal to a
print(f"Error · a = {np.dot(error, a):.6f} (should be ≈ 0)")

Projecting onto a Subspace (Least Squares!)

When we have a matrix AA with columns spanning a subspace, the projection of b\mathbf{b} onto this subspace is: x^=(ATA)1ATb\hat{\mathbf{x}} = (A^T A)^{-1} A^T \mathbf{b} This is exactly the normal equations for least squares!
# Linear regression as projection
np.random.seed(42)

# Data: y ≈ 2x + 1 + noise
n = 50
x = np.linspace(0, 10, n)
y = 2 * x + 1 + np.random.randn(n) * 2

# Feature matrix (with bias column)
A = np.column_stack([np.ones(n), x])

# The least squares solution is the projection!
# We're projecting y onto the column space of A

# Method 1: Normal equations
x_hat = np.linalg.inv(A.T @ A) @ A.T @ y

# Method 2: QR decomposition (more numerically stable)
Q, R = np.linalg.qr(A)
x_hat_qr = np.linalg.solve(R, Q.T @ y)

print(f"Normal equations: intercept = {x_hat[0]:.3f}, slope = {x_hat[1]:.3f}")
print(f"QR decomposition: intercept = {x_hat_qr[0]:.3f}, slope = {x_hat_qr[1]:.3f}")
print(f"True values: intercept = 1.000, slope = 2.000")

# Visualize
y_pred = A @ x_hat
residuals = y - y_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Fitted line
axes[0].scatter(x, y, alpha=0.6, label='Data')
axes[0].plot(x, y_pred, 'r-', linewidth=2, label=f'Fit: y = {x_hat[1]:.2f}x + {x_hat[0]:.2f}')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Least Squares as Projection')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals (should be orthogonal to columns of A)
axes[1].scatter(x, residuals, alpha=0.6)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residuals (⊥ to feature space)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Verify residuals are orthogonal to column space
print(f"\nResiduals · ones = {np.dot(residuals, np.ones(n)):.6f} (should be ≈ 0)")
print(f"Residuals · x = {np.dot(residuals, x):.6f} (should be ≈ 0)")

QR Decomposition: The Practical Tool

QR decomposition factors a matrix as: A=QRA = QR Where:
  • QQ is orthogonal (columns are orthonormal)
  • RR is upper triangular

Why QR Is Better Than Normal Equations

# The normal equations (A^T A)^(-1) can be numerically unstable
# QR decomposition avoids computing A^T A

def least_squares_qr(A, b):
    """
    Solve least squares using QR decomposition.
    More stable than normal equations for ill-conditioned problems.
    """
    Q, R = np.linalg.qr(A)
    return np.linalg.solve(R, Q.T @ b)

# Compare on an ill-conditioned problem
n = 100
x = np.linspace(0, 1, n)

# Create increasingly ill-conditioned feature matrix (polynomial features)
degrees = [1, 5, 10, 15]

for deg in degrees:
    # Vandermonde matrix (polynomials)
    A = np.vander(x, deg + 1, increasing=True)
    y = np.sin(3 * x) + np.random.randn(n) * 0.1
    
    # Condition number
    cond = np.linalg.cond(A)
    
    # Try both methods
    try:
        x_normal = np.linalg.inv(A.T @ A) @ A.T @ y
        y_pred_normal = A @ x_normal
        rmse_normal = np.sqrt(np.mean((y - y_pred_normal)**2))
    except:
        rmse_normal = float('inf')
    
    x_qr = least_squares_qr(A, y)
    y_pred_qr = A @ x_qr
    rmse_qr = np.sqrt(np.mean((y - y_pred_qr)**2))
    
    print(f"Degree {deg:2d}: Condition = {cond:.2e}, Normal RMSE = {rmse_normal:.4f}, QR RMSE = {rmse_qr:.4f}")

Application: Signal Denoising with Orthogonal Bases

The Discrete Cosine Transform (DCT)

Signals can be decomposed into orthogonal frequency components.
from scipy.fft import dct, idct

# Create a noisy signal
t = np.linspace(0, 1, 256)
clean_signal = np.sin(2 * np.pi * 3 * t) + 0.5 * np.sin(2 * np.pi * 7 * t)
noise = np.random.randn(len(t)) * 0.5
noisy_signal = clean_signal + noise

# DCT decomposes into orthogonal frequency components
coeffs = dct(noisy_signal, norm='ortho')

# Keep only the largest coefficients (denoising)
n_keep = 20
denoised_coeffs = coeffs.copy()
threshold_idx = np.argsort(np.abs(coeffs))[-n_keep:]
mask = np.zeros_like(coeffs, dtype=bool)
mask[threshold_idx] = True
denoised_coeffs[~mask] = 0

# Reconstruct
denoised_signal = idct(denoised_coeffs, norm='ortho')

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].plot(t, clean_signal, 'g-', linewidth=2, label='Clean')
axes[0, 0].plot(t, noisy_signal, 'b-', alpha=0.5, label='Noisy')
axes[0, 0].set_title('Original Signals')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].stem(range(50), np.abs(coeffs[:50]), markerfmt='b.')
axes[0, 1].set_title('DCT Coefficients (first 50)')
axes[0, 1].set_xlabel('Frequency Index')
axes[0, 1].set_ylabel('|Coefficient|')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].plot(t, noisy_signal, 'b-', alpha=0.5, label='Noisy')
axes[1, 0].plot(t, denoised_signal, 'r-', linewidth=2, label='Denoised')
axes[1, 0].set_title(f'Denoising (keeping {n_keep} coefficients)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(t, clean_signal, 'g-', linewidth=2, label='Clean')
axes[1, 1].plot(t, denoised_signal, 'r--', linewidth=2, label='Denoised')
axes[1, 1].set_title('Comparison: Clean vs Denoised')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Quality metrics
mse_noisy = np.mean((noisy_signal - clean_signal)**2)
mse_denoised = np.mean((denoised_signal - clean_signal)**2)

print(f"MSE (noisy): {mse_noisy:.4f}")
print(f"MSE (denoised): {mse_denoised:.4f}")
print(f"Improvement: {(mse_noisy - mse_denoised) / mse_noisy * 100:.1f}%")

Application: Image Compression

Images can be compressed by keeping only the most important orthogonal components.
from scipy.fft import dctn, idctn

# Create a simple test image
n = 64
x, y = np.meshgrid(np.linspace(-1, 1, n), np.linspace(-1, 1, n))
image = np.exp(-(x**2 + y**2) / 0.3) * np.cos(4 * np.pi * x) * np.cos(4 * np.pi * y)
image = (image - image.min()) / (image.max() - image.min())

# 2D DCT
coeffs_2d = dctn(image, norm='ortho')

# Compress by keeping only top k% of coefficients
compression_ratios = [100, 50, 20, 10, 5]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

axes[0].imshow(image, cmap='viridis')
axes[0].set_title(f'Original (100% coefficients)')
axes[0].axis('off')

for i, ratio in enumerate(compression_ratios[1:], 1):
    n_keep = int(n * n * ratio / 100)
    
    # Keep largest coefficients
    flat_coeffs = coeffs_2d.flatten()
    threshold = np.sort(np.abs(flat_coeffs))[-n_keep]
    
    compressed_coeffs = np.where(np.abs(coeffs_2d) >= threshold, coeffs_2d, 0)
    reconstructed = idctn(compressed_coeffs, norm='ortho')
    
    mse = np.mean((image - reconstructed)**2)
    
    axes[i].imshow(reconstructed, cmap='viridis')
    axes[i].set_title(f'{ratio}% coeffs, MSE={mse:.4f}')
    axes[i].axis('off')

# Show coefficient distribution
flat_coeffs = np.abs(coeffs_2d.flatten())
axes[5].hist(flat_coeffs, bins=50, edgecolor='black', alpha=0.7)
axes[5].set_xlabel('|Coefficient|')
axes[5].set_ylabel('Count')
axes[5].set_title('DCT Coefficient Distribution')
axes[5].set_yscale('log')

plt.tight_layout()
plt.show()

Practice Exercises

Problem: Given vectors u1=[1,1,0]\mathbf{u}_1 = [1, 1, 0] and u2=[1,1,0]\mathbf{u}_2 = [1, -1, 0] (which are orthogonal), find the projection of b=[3,4,2]\mathbf{b} = [3, 4, 2] onto the plane spanned by u1\mathbf{u}_1 and u2\mathbf{u}_2.Hint: For orthogonal bases, projections can be computed independently and summed.
Problem: Apply Gram-Schmidt to vectors v1=[1,1]\mathbf{v}_1 = [1, 1] and v2=[1,2]\mathbf{v}_2 = [1, 2].Steps:
  1. q1=v1/v1\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|
  2. u2=v2(v2q1)q1\mathbf{u}_2 = \mathbf{v}_2 - (\mathbf{v}_2 \cdot \mathbf{q}_1)\mathbf{q}_1
  3. q2=u2/u2\mathbf{q}_2 = \mathbf{u}_2 / \|\mathbf{u}_2\|
Problem: In standard linear regression, we minimize vertical errors. What if we minimize perpendicular (orthogonal) distances to the line? This is called orthogonal regression or total least squares.Implement orthogonal regression using PCA: the first principal component gives the best-fit line that minimizes orthogonal distances.

Summary

ConceptDefinitionUse Case
OrthogonalDot product = 0Independent directions
OrthonormalOrthogonal + unit lengthConvenient bases
Gram-SchmidtAlgorithm to orthonormalizeQR decomposition
ProjectionComponent in a directionLeast squares, dimensionality reduction
QR DecompositionA = QRNumerically stable solving
Key Takeaway: Orthogonality simplifies everything. Decomposing into orthogonal components makes calculations independent and numerically stable. This is why PCA (finding orthogonal directions of variance) is so powerful for dimensionality reduction.