> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Orthogonality & Projections

> Understanding perpendicularity in high dimensions - the foundation of least squares and signal processing

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/math-for-ml-linear-algebra/orthogonality-concept.svg" alt="Orthogonality & Projections" />
</Frame>

# Orthogonality & Projections

## Why Orthogonality Matters

In machine learning and data science, orthogonality is everywhere:

| Application           | How Orthogonality Is Used                              |
| --------------------- | ------------------------------------------------------ |
| **PCA**               | Find orthogonal directions of maximum variance         |
| **Least Squares**     | Project data onto the column space of features         |
| **Signal Processing** | Decompose signals into orthogonal frequency components |
| **QR Decomposition**  | Numerically stable way to solve linear systems         |
| **Gram-Schmidt**      | Create orthonormal bases for any subspace              |

<Info>
  **Estimated Time**: 3-4 hours\
  **Difficulty**: Intermediate\
  **Prerequisites**: Vectors and Matrices modules\
  **What You'll Build**: Image compression, signal denoising, and robust regression
</Info>

***

## The Intuition: Perpendicular = Independent

### A Simple Example

Two vectors are **orthogonal** (perpendicular) if they point in completely independent directions.

Here is why this matters so much. Think about describing a location: you say "3 blocks east and 4 blocks north." East and north are orthogonal directions -- knowing how far east you went tells you *absolutely nothing* about how far north you went. They carry completely independent information. If instead you used "3 blocks east and 4 blocks northeast," those directions overlap -- some of the "northeast" information is redundant with the "east" information.

Orthogonal bases give you **zero redundancy**. Every component carries unique information. This is why PCA looks for orthogonal directions, why Fourier transforms use orthogonal frequencies, and why QR decomposition creates orthogonal columns. Orthogonality is nature's way of saying "these things are truly independent."

```python theme={null}
import numpy as np
import matplotlib.pyplot as plt

# Two orthogonal vectors
v1 = np.array([3, 0])
v2 = np.array([0, 2])

# Visualize
plt.figure(figsize=(8, 8))
plt.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', label='v1 = [3, 0]')
plt.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='red', label='v2 = [0, 2]')

# Show the right angle
theta = np.linspace(0, np.pi/2, 30)
r = 0.5
plt.plot(r * np.cos(theta), r * np.sin(theta), 'g-', linewidth=2)
plt.text(0.6, 0.3, '90°', fontsize=14, color='green')

plt.xlim(-1, 4)
plt.ylim(-1, 3)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.legend()
plt.title('Orthogonal Vectors: Perpendicular, Independent')
plt.gca().set_aspect('equal')
plt.show()

# Mathematical test: dot product = 0
print(f"v1 · v2 = {np.dot(v1, v2)}")
print("Orthogonal!" if np.dot(v1, v2) == 0 else "Not orthogonal")
```

### The Mathematical Test

Two vectors $\mathbf{u}$ and $\mathbf{v}$ are orthogonal if and only if:

$$
\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = 0
$$

<Note>
  **Geometric Interpretation**: When the dot product is zero, the vectors form a 90° angle. No component of one vector points in the direction of the other.
</Note>

***

## Orthonormal Bases: The Gold Standard

An **orthonormal basis** is a set of vectors that are:

1. **Orthogonal**: Every pair has dot product zero
2. **Normalized**: Each vector has length 1

### Why Orthonormal Is Amazing

With an orthonormal basis, decomposing a vector is trivially easy -- just take dot products. No matrix inversion, no system of equations, no numerical instability. This is why algorithms go out of their way to build orthonormal bases.

```python theme={null}
# Standard basis in 3D is orthonormal
e1 = np.array([1, 0, 0])
e2 = np.array([0, 1, 0])
e3 = np.array([0, 0, 1])

# Any vector can be easily decomposed
v = np.array([3, -2, 5])

# Finding components is trivial - just dot products!
# With a non-orthogonal basis, you'd need to solve a system of equations.
# With an orthonormal basis, you just project -- it's that simple.
c1 = np.dot(v, e1)  # 3
c2 = np.dot(v, e2)  # -2
c3 = np.dot(v, e3)  # 5

print(f"v = {c1}·e1 + {c2}·e2 + {c3}·e3")
print(f"v = {c1*e1 + c2*e2 + c3*e3}")  # Reconstructed
```

**ML connection**: This is exactly what happens in PCA. The principal components form an orthonormal basis, so projecting your data onto them is just a matrix multiply -- no expensive inversions needed.

### Gram-Schmidt: Making Any Basis Orthonormal

Given any set of linearly independent vectors, we can create an orthonormal basis. The algorithm works by repeatedly projecting and subtracting: take each vector, remove its components along all previously computed orthonormal vectors (making it orthogonal to them), then normalize it. It is like straightening a set of leaning poles one at a time -- each new pole gets adjusted to stand perfectly perpendicular to all the others.

```python theme={null}
def gram_schmidt(vectors):
    """
    Convert a set of linearly independent vectors to an orthonormal basis.
    
    This is the foundation of QR decomposition!
    """
    n = len(vectors)
    orthonormal = []
    
    for i, v in enumerate(vectors):
        # Start with the original vector
        u = v.copy().astype(float)
        
        # Subtract projections onto all previous orthonormal vectors
        for q in orthonormal:
            u = u - np.dot(u, q) * q
        
        # Normalize
        norm = np.linalg.norm(u)
        if norm < 1e-10:
            raise ValueError("Vectors are linearly dependent!")
        
        orthonormal.append(u / norm)
    
    return np.array(orthonormal)

# Example: Convert arbitrary vectors to orthonormal
v1 = np.array([1, 1, 0])
v2 = np.array([1, 0, 1])
v3 = np.array([0, 1, 1])

Q = gram_schmidt([v1, v2, v3])

print("Original vectors:")
print(f"v1 = {v1}")
print(f"v2 = {v2}")
print(f"v3 = {v3}")

print("\nOrthonormal basis Q:")
for i, q in enumerate(Q):
    print(f"q{i+1} = {q.round(4)}")

# Verify orthonormality
print("\nVerification (Q @ Q.T should be identity):")
print((Q @ Q.T).round(4))
```

***

## Projections: The Heart of Least Squares

### Projecting onto a Line

The **projection** of vector $\mathbf{b}$ onto vector $\mathbf{a}$ gives the component of $\mathbf{b}$ in the direction of $\mathbf{a}$.

Think of it like a shadow. If you shine a light straight down onto a surface, the shadow of a stick is its projection onto that surface. The projection of $\mathbf{b}$ onto $\mathbf{a}$ is the "shadow" of $\mathbf{b}$ when you shine light perpendicular to $\mathbf{a}$. The residual ($\mathbf{b}$ minus its projection) is the part of $\mathbf{b}$ that is orthogonal to $\mathbf{a}$ -- the part that "sticks up" out of the shadow.

$$
\text{proj}_{\mathbf{a}} \mathbf{b} = \frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}} \mathbf{a}
$$

This formula has two parts: the scalar $\frac{\mathbf{a} \cdot \mathbf{b}}{\mathbf{a} \cdot \mathbf{a}}$ tells you *how far* along $\mathbf{a}$ to go, and then you multiply by $\mathbf{a}$ to get the actual vector in that direction.

```python theme={null}
def project_onto_vector(b, a):
    """Project vector b onto vector a."""
    return (np.dot(a, b) / np.dot(a, a)) * a

# Example
a = np.array([2, 1])
b = np.array([1, 3])

proj = project_onto_vector(b, a)
error = b - proj  # The residual (perpendicular to a)

# Visualize
plt.figure(figsize=(10, 6))
plt.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, color='blue', label=f'a = {a}', width=0.015)
plt.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', label=f'b = {b}', width=0.015)
plt.quiver(0, 0, proj[0], proj[1], angles='xy', scale_units='xy', scale=1, color='green', label=f'proj = {proj.round(2)}', width=0.015)
plt.quiver(proj[0], proj[1], error[0], error[1], angles='xy', scale_units='xy', scale=1, color='purple', label='error (⊥ to a)', width=0.015)

# Show right angle
plt.plot([proj[0], proj[0] + 0.2*error[0]/np.linalg.norm(error)], 
         [proj[1], proj[1] + 0.2*error[1]/np.linalg.norm(error)], 'k-')

plt.xlim(-0.5, 3)
plt.ylim(-0.5, 3.5)
plt.grid(True, alpha=0.3)
plt.legend()
plt.title('Projection of b onto a')
plt.gca().set_aspect('equal')
plt.show()

# Verify error is orthogonal to a
print(f"Error · a = {np.dot(error, a):.6f} (should be ≈ 0)")
```

### Projecting onto a Subspace (Least Squares!)

When we have a matrix $A$ with columns spanning a subspace, the projection of $\mathbf{b}$ onto this subspace is:

$$
\hat{\mathbf{x}} = (A^T A)^{-1} A^T \mathbf{b}
$$

This is exactly the **normal equations** for least squares!

```python theme={null}
# Linear regression as projection
np.random.seed(42)

# Data: y ≈ 2x + 1 + noise
n = 50
x = np.linspace(0, 10, n)
y = 2 * x + 1 + np.random.randn(n) * 2

# Feature matrix (with bias column)
A = np.column_stack([np.ones(n), x])

# The least squares solution is the projection!
# We're projecting y onto the column space of A

# Method 1: Normal equations
x_hat = np.linalg.inv(A.T @ A) @ A.T @ y

# Method 2: QR decomposition (more numerically stable)
Q, R = np.linalg.qr(A)
x_hat_qr = np.linalg.solve(R, Q.T @ y)

print(f"Normal equations: intercept = {x_hat[0]:.3f}, slope = {x_hat[1]:.3f}")
print(f"QR decomposition: intercept = {x_hat_qr[0]:.3f}, slope = {x_hat_qr[1]:.3f}")
print(f"True values: intercept = 1.000, slope = 2.000")

# Visualize
y_pred = A @ x_hat
residuals = y - y_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Fitted line
axes[0].scatter(x, y, alpha=0.6, label='Data')
axes[0].plot(x, y_pred, 'r-', linewidth=2, label=f'Fit: y = {x_hat[1]:.2f}x + {x_hat[0]:.2f}')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Least Squares as Projection')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals (should be orthogonal to columns of A)
axes[1].scatter(x, residuals, alpha=0.6)
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residuals (⊥ to feature space)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Verify residuals are orthogonal to column space
print(f"\nResiduals · ones = {np.dot(residuals, np.ones(n)):.6f} (should be ≈ 0)")
print(f"Residuals · x = {np.dot(residuals, x):.6f} (should be ≈ 0)")
```

***

## QR Decomposition: The Practical Tool

QR decomposition factors a matrix as:

$$
A = QR
$$

Where:

* $Q$ is orthogonal (columns are orthonormal)
* $R$ is upper triangular

### Why QR Is Better Than Normal Equations

The normal equations approach computes $(A^T A)^{-1} A^T b$, which requires forming the matrix $A^T A$. This squaring operation is a disaster for numerical stability: it squares the condition number and can turn a mildly ill-conditioned problem into an unsolvable one.

QR decomposition sidesteps this entirely. Since $Q$ is orthogonal, $Q^T Q = I$, so applying $Q^T$ is perfectly conditioned -- it never amplifies errors. The remaining solve with the triangular matrix $R$ is also well-conditioned. The result: QR-based least squares is numerically stable even when the normal equations produce garbage.

```python theme={null}
# The normal equations (A^T A)^(-1) can be numerically unstable
# QR decomposition avoids computing A^T A entirely

def least_squares_qr(A, b):
    """
    Solve least squares using QR decomposition.
    More stable than normal equations for ill-conditioned problems.
    """
    Q, R = np.linalg.qr(A)
    return np.linalg.solve(R, Q.T @ b)

# Compare on an ill-conditioned problem
n = 100
x = np.linspace(0, 1, n)

# Create increasingly ill-conditioned feature matrix (polynomial features)
degrees = [1, 5, 10, 15]

for deg in degrees:
    # Vandermonde matrix (polynomials)
    A = np.vander(x, deg + 1, increasing=True)
    y = np.sin(3 * x) + np.random.randn(n) * 0.1
    
    # Condition number
    cond = np.linalg.cond(A)
    
    # Try both methods
    try:
        x_normal = np.linalg.inv(A.T @ A) @ A.T @ y
        y_pred_normal = A @ x_normal
        rmse_normal = np.sqrt(np.mean((y - y_pred_normal)**2))
    except:
        rmse_normal = float('inf')
    
    x_qr = least_squares_qr(A, y)
    y_pred_qr = A @ x_qr
    rmse_qr = np.sqrt(np.mean((y - y_pred_qr)**2))
    
    print(f"Degree {deg:2d}: Condition = {cond:.2e}, Normal RMSE = {rmse_normal:.4f}, QR RMSE = {rmse_qr:.4f}")
```

***

## Application: Signal Denoising with Orthogonal Bases

### The Discrete Cosine Transform (DCT)

Signals can be decomposed into orthogonal frequency components.

```python theme={null}
from scipy.fft import dct, idct

# Create a noisy signal
t = np.linspace(0, 1, 256)
clean_signal = np.sin(2 * np.pi * 3 * t) + 0.5 * np.sin(2 * np.pi * 7 * t)
noise = np.random.randn(len(t)) * 0.5
noisy_signal = clean_signal + noise

# DCT decomposes into orthogonal frequency components
coeffs = dct(noisy_signal, norm='ortho')

# Keep only the largest coefficients (denoising)
n_keep = 20
denoised_coeffs = coeffs.copy()
threshold_idx = np.argsort(np.abs(coeffs))[-n_keep:]
mask = np.zeros_like(coeffs, dtype=bool)
mask[threshold_idx] = True
denoised_coeffs[~mask] = 0

# Reconstruct
denoised_signal = idct(denoised_coeffs, norm='ortho')

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].plot(t, clean_signal, 'g-', linewidth=2, label='Clean')
axes[0, 0].plot(t, noisy_signal, 'b-', alpha=0.5, label='Noisy')
axes[0, 0].set_title('Original Signals')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].stem(range(50), np.abs(coeffs[:50]), markerfmt='b.')
axes[0, 1].set_title('DCT Coefficients (first 50)')
axes[0, 1].set_xlabel('Frequency Index')
axes[0, 1].set_ylabel('|Coefficient|')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].plot(t, noisy_signal, 'b-', alpha=0.5, label='Noisy')
axes[1, 0].plot(t, denoised_signal, 'r-', linewidth=2, label='Denoised')
axes[1, 0].set_title(f'Denoising (keeping {n_keep} coefficients)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(t, clean_signal, 'g-', linewidth=2, label='Clean')
axes[1, 1].plot(t, denoised_signal, 'r--', linewidth=2, label='Denoised')
axes[1, 1].set_title('Comparison: Clean vs Denoised')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Quality metrics
mse_noisy = np.mean((noisy_signal - clean_signal)**2)
mse_denoised = np.mean((denoised_signal - clean_signal)**2)

print(f"MSE (noisy): {mse_noisy:.4f}")
print(f"MSE (denoised): {mse_denoised:.4f}")
print(f"Improvement: {(mse_noisy - mse_denoised) / mse_noisy * 100:.1f}%")
```

***

## Application: Image Compression

Images can be compressed by keeping only the most important orthogonal components.

```python theme={null}
from scipy.fft import dctn, idctn

# Create a simple test image
n = 64
x, y = np.meshgrid(np.linspace(-1, 1, n), np.linspace(-1, 1, n))
image = np.exp(-(x**2 + y**2) / 0.3) * np.cos(4 * np.pi * x) * np.cos(4 * np.pi * y)
image = (image - image.min()) / (image.max() - image.min())

# 2D DCT
coeffs_2d = dctn(image, norm='ortho')

# Compress by keeping only top k% of coefficients
compression_ratios = [100, 50, 20, 10, 5]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

axes[0].imshow(image, cmap='viridis')
axes[0].set_title(f'Original (100% coefficients)')
axes[0].axis('off')

for i, ratio in enumerate(compression_ratios[1:], 1):
    n_keep = int(n * n * ratio / 100)
    
    # Keep largest coefficients
    flat_coeffs = coeffs_2d.flatten()
    threshold = np.sort(np.abs(flat_coeffs))[-n_keep]
    
    compressed_coeffs = np.where(np.abs(coeffs_2d) >= threshold, coeffs_2d, 0)
    reconstructed = idctn(compressed_coeffs, norm='ortho')
    
    mse = np.mean((image - reconstructed)**2)
    
    axes[i].imshow(reconstructed, cmap='viridis')
    axes[i].set_title(f'{ratio}% coeffs, MSE={mse:.4f}')
    axes[i].axis('off')

# Show coefficient distribution
flat_coeffs = np.abs(coeffs_2d.flatten())
axes[5].hist(flat_coeffs, bins=50, edgecolor='black', alpha=0.7)
axes[5].set_xlabel('|Coefficient|')
axes[5].set_ylabel('Count')
axes[5].set_title('DCT Coefficient Distribution')
axes[5].set_yscale('log')

plt.tight_layout()
plt.show()
```

***

## Practice Exercises

<Accordion title="Exercise 1: Orthogonal Projection">
  **Problem**: Given vectors $\mathbf{u}_1 = [1, 1, 0]$ and $\mathbf{u}_2 = [1, -1, 0]$ (which are orthogonal), find the projection of $\mathbf{b} = [3, 4, 2]$ onto the plane spanned by $\mathbf{u}_1$ and $\mathbf{u}_2$.

  **Hint**: For orthogonal bases, projections can be computed independently and summed.
</Accordion>

<Accordion title="Exercise 2: Gram-Schmidt by Hand">
  **Problem**: Apply Gram-Schmidt to vectors $\mathbf{v}_1 = [1, 1]$ and $\mathbf{v}_2 = [1, 2]$.

  **Steps**:

  1. $\mathbf{q}_1 = \mathbf{v}_1 / \|\mathbf{v}_1\|$
  2. $\mathbf{u}_2 = \mathbf{v}_2 - (\mathbf{v}_2 \cdot \mathbf{q}_1)\mathbf{q}_1$
  3. $\mathbf{q}_2 = \mathbf{u}_2 / \|\mathbf{u}_2\|$
</Accordion>

<Accordion title="Exercise 3: Orthogonal Regression">
  **Problem**: In standard linear regression, we minimize vertical errors. What if we minimize perpendicular (orthogonal) distances to the line? This is called **orthogonal regression** or **total least squares**.

  Implement orthogonal regression using PCA: the first principal component gives the best-fit line that minimizes orthogonal distances.
</Accordion>

***

## Least Squares Solvers: A Stability Comparison

This table summarizes the three main approaches to solving the least squares problem $\min \|Ax - b\|^2$, and when each approach shines or fails.

| Method                                 | How It Works                            | Condition Number Sensitivity                               | Speed                               | When to Use                                |
| -------------------------------------- | --------------------------------------- | ---------------------------------------------------------- | ----------------------------------- | ------------------------------------------ |
| **Normal Equations** $(A^TA)^{-1}A^Tb$ | Forms $A^TA$, then solves               | $\kappa(A^TA) = \kappa(A)^2$ -- squares the problem        | Fastest for small, well-conditioned | Quick prototyping, $\kappa(A) < 10^3$      |
| **QR Decomposition** $R^{-1}Q^Tb$      | Factors $A = QR$, avoids forming $A^TA$ | $\kappa(R) = \kappa(A)$ -- preserves original conditioning | Moderate                            | Default for most problems                  |
| **SVD** $V \Sigma^{-1} U^T b$          | Full decomposition $A = U\Sigma V^T$    | Handles rank deficiency; truncates small singular values   | Slowest                             | Ill-conditioned or rank-deficient problems |

```
Condition number and accuracy loss (float64, ~16 digits precision):

kappa(A)     Normal Equations    QR             SVD
10^2         ~12 digits OK       ~14 digits     ~14 digits
10^6         ~4 digits OK        ~10 digits     ~10 digits
10^8         ~0 digits (garbage) ~8 digits      ~8 digits
10^12        Fails completely    ~4 digits      ~4 digits (with truncation)
```

The takeaway: **QR is the right default for production code**. Use the normal equations only when you have verified that the problem is well-conditioned. Use SVD when you suspect rank deficiency or when you need the pseudoinverse.

<Tip>
  **Practical Advice**: In Python, `np.linalg.lstsq()` uses SVD internally. This is the safest single-call solution. If you need speed for a hot loop and have verified good conditioning, switch to QR via `np.linalg.qr()` followed by back-substitution. Only use the normal equations formula $(A^TA)^{-1}A^Tb$ if you have a specific performance reason and have checked $\kappa(A) < 10^6$.
</Tip>

***

## Summary

| Concept              | Definition                  | Use Case                                |
| -------------------- | --------------------------- | --------------------------------------- |
| **Orthogonal**       | Dot product = 0             | Independent directions                  |
| **Orthonormal**      | Orthogonal + unit length    | Convenient bases                        |
| **Gram-Schmidt**     | Algorithm to orthonormalize | QR decomposition                        |
| **Projection**       | Component in a direction    | Least squares, dimensionality reduction |
| **QR Decomposition** | A = QR                      | Numerically stable solving              |

<Note>
  **Key Takeaway**: Orthogonality simplifies everything. Decomposing into orthogonal components makes calculations independent and numerically stable. This is why PCA (finding orthogonal directions of variance) is so powerful for dimensionality reduction. It is also why Fourier analysis works (orthogonal frequencies), why QR decomposition is numerically superior to the normal equations, and why orthogonal weight initialization helps neural networks train faster. When you see orthogonality in a method, it is almost always there to prevent information from "leaking" between components and to keep computations clean.
</Note>

<Warning>
  **Gram-Schmidt Numerical Pitfall**: The classical Gram-Schmidt algorithm described above can lose orthogonality for nearly-dependent vectors due to floating-point errors. In practice, use the **Modified Gram-Schmidt** variant (which re-orthogonalizes against updated vectors rather than original ones) or simply call `np.linalg.qr()`, which uses Householder reflections -- a numerically superior approach that avoids the problem entirely.
</Warning>
