Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Singular Value Decomposition

Singular Value Decomposition (SVD)

The Magic Behind “People Like You Also Bought…”

An Everyday Mystery

You buy running shoes on Amazon. Suddenly Amazon knows you might want:
  • Protein powder
  • A fitness tracker
  • Compression socks
  • A foam roller
How? You never searched for those things. You never told Amazon you’re a runner. Amazon discovered a hidden pattern: “Runner type” people buy these things together. Similarly:
  • Spotify knows your music taste after 10 songs
  • Netflix predicts you’ll rate a movie 4.2 stars
  • YouTube knows which videos you’ll watch next
All of these use SVD — a technique that finds hidden patterns in incomplete data. If eigendecomposition is like finding the natural vibration modes of a single drum, SVD is like finding the natural vibration modes of any shape — it works on rectangular matrices (where rows and columns represent different things, like users and movies), not just square ones. It is the Swiss Army knife of matrix decompositions.
The Real Power: SVD can predict things you haven’t done yet based on patterns from millions of other people.You’ve rated 50 movies? SVD predicts your ratings for the other 9,950 movies!

Before We Dive In: Why SVD Matters

Your DataHidden Patterns FoundBusiness Value
Purchase history”Customer types”Personalized recommendations
Song listening”Music taste dimensions""Discover Weekly” playlist
Movie ratings”Genre preferences""Because you watched…”
Job applications”Candidate profiles”Better matching
Photos”Visual features”Face recognition, search
SVD is behind billions of dollars of revenue at Amazon, Netflix, Spotify, and Google.
Estimated Time: 4-5 hours
Difficulty: Intermediate to Advanced
Prerequisites: Eigenvalues and PCA modules
Key Insight: Any table of data can be decomposed into hidden factors

A Non-Math Example: Restaurant Preferences

The Problem

Five friends rate 6 restaurants. Can we predict what Alice thinks of restaurants she hasn’t tried?
# Friends' restaurant ratings (1-5 stars)
# 0 = haven't tried
ratings = {
    #              Pizza  Sushi  Steak  Salad  Taco  Ramen
    "Alice":      [5,     4,     0,     2,     0,    0],
    "Bob":        [5,     0,     5,     1,     4,    0],
    "Carol":      [1,     5,     1,     5,     2,    5],
    "Dave":       [0,     4,     0,     5,     0,    5],
    "Eve":        [4,     0,     5,     0,     5,    0],
}

The Human Insight

Looking at this, you notice:
  • Alice, Bob, Eve like hearty food (pizza, steak, tacos)
  • Carol, Dave like lighter options (sushi, salad, ramen)
There seem to be 2 “types” of eaters:
  1. “Hearty eater” factor
  2. “Light eater” factor

SVD Discovers This Automatically!

import numpy as np

# Convert to matrix
ratings_matrix = np.array([
    [5, 4, 0, 2, 0, 0],
    [5, 0, 5, 1, 4, 0],
    [1, 5, 1, 5, 2, 5],
    [0, 4, 0, 5, 0, 5],
    [4, 0, 5, 0, 5, 0],
])

# Apply SVD
U, S, VT = np.linalg.svd(ratings_matrix, full_matrices=False)

print("Hidden factors found:", len(S))
print("Importance of each:", S.round(1))
# [12.2, 9.8, 2.1, 1.2, 0.3]
# First 2 factors dominate! (Hearty vs Light)

Predict Missing Ratings

# Keep only top 2 factors
k = 2
predicted = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

print("Original (0 = unknown):")
print(ratings_matrix)

print("\nPredicted:")
print(predicted.round(1))
Result: SVD predicts Alice would rate Steak 4.2 and Taco 3.8! (She’s a hearty eater like Bob and Eve.) This is EXACTLY how Netflix recommendations work! SVD Netflix Recommendations

The Math: Breaking Any Matrix Into Patterns

The Core Formula

Any matrix can be broken into 3 simpler matrices: A=UΣVTA = U \Sigma V^T Where:
  • UU = left singular vectors (patterns in rows — users/people). Each column of UU is a “user archetype.”
  • Σ\Sigma = singular values (how important each pattern is). These are always non-negative and sorted from largest to smallest. Think of them as volume knobs for each pattern.
  • VTV^T = right singular vectors (patterns in columns — items/restaurants). Each row of VTV^T is an “item archetype.”
The beauty of SVD is that it decomposes any messy data table into a sum of simple rank-1 “layers.” The first layer (σ1u1v1T\sigma_1 \mathbf{u}_1 \mathbf{v}_1^T) captures the single most important pattern. The second layer adds the next most important pattern. And so on. You can stop at any point and have the best possible approximation using that many layers — this is guaranteed by the Eckart-Young theorem.
import numpy as np

# Any matrix
A = np.array([
    [4, 0, 2],
    [0, 3, 0],
    [2, 0, 1]
])

# SVD decomposition
U, S, VT = np.linalg.svd(A)

print(f"U (row patterns): {U.shape}")    # (3, 3)
print(f"S (importance): {S.shape}")       # (3,)
print(f"VT (column patterns): {VT.shape}")  # (3, 3)

# Reconstruct perfectly
A_reconstructed = U @ np.diag(S) @ VT
print(np.allclose(A, A_reconstructed))  # True!
SVD Math Concept

SVD vs Eigendecomposition

This is one of the most common points of confusion, so let us be precise about when to use which.
AspectEigendecompositionSVD
Works onSquare matrices onlyAny matrix (m x n)
FormulaA=VΛV1A = V \Lambda V^{-1}A=UΣVTA = U \Sigma V^T
ValuesEigenvalues (can be negative or complex)Singular values (always real, non-negative)
RequirementMatrix must be diagonalizableAlways works for any matrix
VectorsLeft and right eigenvectors (may not be orthogonal)UU and VV are always orthogonal
Best forSquare systems, stability analysis, PageRankRectangular data, recommendations, compression
The practical rule: If your matrix is square and symmetric (like a covariance matrix), eigendecomposition and SVD give essentially the same information (singular values = absolute eigenvalues). For everything else — especially user-item matrices, document-term matrices, or any non-square data — use SVD.

The Key Relationship

For a matrix AA:
  • Singular values of AA = square roots of eigenvalues of ATAA^T A
  • σi=λi(ATA)\sigma_i = \sqrt{\lambda_i(A^T A)}
A = np.array([[4, 0], [3, -5]])

# SVD
U, S, VT = np.linalg.svd(A)
print("Singular values:", S)  # [6.32, 3.16]

# Eigenvalues of A^T A
eigenvalues = np.linalg.eigvals(A.T @ A)
print("√eigenvalues:", np.sqrt(sorted(eigenvalues, reverse=True)))  # [6.32, 3.16]

Low-Rank Approximation

The magic of SVD: Keep only the top-k singular values for a best approximation! Ak=UkΣkVkT=i=1kσiuiviTA_k = U_k \Sigma_k V_k^T = \sum_{i=1}^{k} \sigma_i u_i v_i^T This is optimal in the Frobenius norm (and the spectral norm): AkA_k is the best rank-k approximation to AA. No other rank-k matrix is closer. This is a remarkable guarantee — it means SVD truncation is not just “pretty good,” it is mathematically provably the best you can do with k components. In practice, this means: if you want to compress a 1000x1000 matrix down to something manageable, SVD tells you exactly how much information you lose at each level of compression, and no other method can do better.
# Original 5×6 matrix with rank 5
# Keep only top 2 singular values → rank 2 approximation

U, S, VT = np.linalg.svd(A, full_matrices=False)

# Keep top k=2 components
k = 2
A_approx = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

# How much did we lose?
error = np.linalg.norm(A - A_approx) / np.linalg.norm(A)
print(f"Reconstruction error: {error:.2%}")

Example 1: Movie Recommendation System (Netflix-Style)

The Setup

# User-Movie rating matrix (5 users × 6 movies)
# Rows = users, Columns = movies
# 0 = not rated (what we want to predict!)

import numpy as np

ratings = np.array([
    [5, 3, 0, 1, 0, 0],  # User 0: seems to like action (Movies 0,1)
    [4, 0, 0, 1, 0, 0],  # User 1: similar to User 0
    [1, 1, 0, 5, 4, 5],  # User 2: seems to like romance (Movies 3,4,5)
    [0, 0, 0, 4, 5, 4],  # User 3: similar to User 2
    [0, 1, 5, 4, 0, 0],  # User 4: mixed taste
])

print(f"Matrix shape: {ratings.shape}")  # (5, 6)
print(f"Unknown ratings: {(ratings == 0).sum()} out of {ratings.size}")
# 15 out of 30 ratings are unknown (50% missing!)
Challenge: Predict the 15 missing ratings!

Apply SVD

# Decompose the ratings matrix
U, S, VT = np.linalg.svd(ratings, full_matrices=False)

print("Singular values (pattern importance):")
print(S.round(2))
# [12.48, 9.51, 1.35, 0.89, 0.12]
# First 2 values are MUCH bigger — 2 main patterns!

Discover Hidden Factors

# What are the 2 main patterns?
print("\nMovie patterns (which movies go together):")
print("Pattern 1:", VT[0].round(2))  # [0.67, 0.38, 0.09, -0.23, -0.32, -0.26]
print("Pattern 2:", VT[1].round(2))  # [-0.08, -0.21, 0.15, 0.58, 0.52, 0.57]

# Pattern 1: High for Movies 0,1 (Action movies!)
# Pattern 2: High for Movies 3,4,5 (Romance movies!)
SVD automatically discovered: Movies 0-1 are “action” and Movies 3-5 are “romance” — without us labeling them!

→ Romance movies (Movies 3, 4, 5)


**Interpretation**:
- **Factor 1**: "Action preference"
- **Factor 2**: "Romance preference"

Users are combinations of these factors!

```python
print("\\nUser factors (U):")
print(U_k)

# User 0: [high action, low romance]
# User 2: [low action, high romance]
# User 4: [medium action, medium romance] - mixed taste!
Real Application: This is exactly how Netflix works! They use ~50 hidden factors instead of 2.

Example 2: House Price Patterns

The Problem

# House-Feature matrix (100 houses × 10 features)
# Discover hidden patterns in house data

np.random.seed(42)
n_houses = 100
n_features = 10

# Simulate house data with hidden patterns
houses = np.random.randn(n_houses, n_features)

# Add pattern 1: "Luxury" (beds, baths, sqft high together)
luxury_factor = np.random.randn(n_houses)
houses[:, 0] += luxury_factor * 0.8  # beds
houses[:, 1] += luxury_factor * 0.7  # baths
houses[:, 2] += luxury_factor * 0.9  # sqft

# Add pattern 2: "Location" (school, walk, crime related)
location_factor = np.random.randn(n_houses)
houses[:, 5] += location_factor * 0.6  # school
houses[:, 7] += location_factor * 0.5  # walkability
houses[:, 6] -= location_factor * 0.4  # crime (inverse)

Apply SVD

# SVD
U, S, VT = np.linalg.svd(houses, full_matrices=False)

print(f"Singular values: {S[:5]}")
# [12.3, 10.8, 3.2, 2.9, 2.1]
# Top 2 dominate!

# Keep top 2 patterns
k = 2
houses_compressed = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

# Compression ratio
original_size = houses.size
compressed_size = U[:, :k].size + k + VT[:k, :].size
print(f"Compression: {compressed_size / original_size:.1%}")  # 24% of original!

Interpret Patterns

# What do the patterns represent?
features = ['beds', 'baths', 'sqft', 'age', 'dist', 'school', 'crime', 'walk', 'park', 'yard']

print("\\nPattern 1 (Luxury):")
for i, feature in enumerate(features):
    print(f"  {feature}: {VT[0, i]:.3f}")

# High: beds, baths, sqft
# → "Luxury/Size" pattern

print("\\nPattern 2 (Location):")
for i, feature in enumerate(features):
    print(f"  {feature}: {VT[1, i]:.3f}")

# High: school, walkability
# Low: crime
# → "Location Quality" pattern
Real Application: Zillow uses SVD to discover hidden house “types” (luxury urban, suburban family, etc.)

Example 3: Student Performance Prediction

The Problem

# Student-Subject matrix (200 students × 8 subjects)
# Predict grades in subjects students haven't taken yet

students = np.random.randn(200, 8)

# Add pattern 1: "STEM aptitude"
stem_factor = np.random.randn(200)
students[:, 0] += stem_factor * 0.9  # Math
students[:, 2] += stem_factor * 0.8  # Physics
students[:, 4] += stem_factor * 0.7  # CS

# Add pattern 2: "Humanities aptitude"
humanities_factor = np.random.randn(200)
students[:, 1] += humanities_factor * 0.8  # English
students[:, 3] += humanities_factor * 0.7  # History
students[:, 5] += humanities_factor * 0.6  # Art

Apply SVD

U, S, VT = np.linalg.svd(students, full_matrices=False)

print(f"Singular values: {S[:5]}")
# [18.5, 15.2, 4.3, 3.8, 2.9]

# Keep top 2 aptitudes
k = 2
students_approx = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

Predict Missing Grades

# Simulate missing grades (set some to 0)
students_incomplete = students.copy()
mask = np.random.rand(*students.shape) < 0.3  # 30% missing
students_incomplete[mask] = 0

# Predict using SVD
U, S, VT = np.linalg.svd(students_incomplete, full_matrices=False)
students_predicted = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

# Compare predictions
print("\\nActual grade:", students[0, 3])
print("Predicted grade:", students_predicted[0, 3])
# Close match!
Real Application: Coursera uses SVD to recommend courses based on your performance in similar courses!

Matrix Decomposition Methods: The Complete Comparison

This is the single most important reference table in the course. Every ML practitioner should know when to reach for which decomposition.
DecompositionFormulaInput RequirementsValuesUniquenessComputational CostPrimary ML Use
EigendecompositionA=VΛV1A = V \Lambda V^{-1}Square, diagonalizableCan be negative or complexEigenvectors unique up to scalingO(n3)O(n^3)Covariance analysis, PageRank, stability
SVDA=UΣVTA = U \Sigma V^TAny matrix (m x n)Always real, non-negativeUnique (for distinct singular values)O(mnmin(m,n))O(mn \cdot \min(m,n))Recommendations, compression, pseudoinverse
PCAProject onto top eigenvectors of XTXX^TXCentered data matrixEigenvalues of covarianceUnique up to sign flipsO(nd2)O(nd^2) or via SVDDimensionality reduction, visualization
LUA=LUA = LUSquare, non-singularN/AUnique with pivotingO(n3)O(n^3) once, O(n2)O(n^2) per solveSolving linear systems efficiently
QRA=QRA = QRAny matrix (m x n)N/AUnique (for full rank)O(mn2)O(mn^2)Numerically stable least squares
CholeskyA=LLTA = LL^TSymmetric positive definiteN/AUniqueO(n3/3)O(n^3/3) — fastestGaussian processes, sampling, fast solves
Common Confusion — “Which decomposition should I use?”
  • Square symmetric matrix (covariance, kernel, Laplacian): Eigendecomposition or Cholesky. These exploit symmetry for speed and guaranteed real values.
  • Rectangular data matrix (users x items, documents x words): SVD. It is the only decomposition that works on non-square matrices and provides an optimal low-rank approximation.
  • Solving Ax = b once: LU decomposition (or just np.linalg.solve).
  • Solving Ax = b many times with different b: LU decomposition, factor once, solve cheaply.
  • Least squares (overdetermined system): QR decomposition. More stable than normal equations.
  • Sampling from a multivariate Gaussian: Cholesky. It is twice as fast as eigendecomposition for this purpose.
  • Not sure: SVD is the safest default. It always works and never fails.

Relationship Between Decompositions

The decompositions are not independent — they are deeply connected:
PCA of data matrix X:
  1. Center X (subtract mean)
  2. Compute covariance C = X^T X / (n-1)
  3. Eigendecompose C = V Lambda V^T
  4. Project: X_reduced = X @ V_k

Equivalently via SVD:
  1. Center X
  2. SVD: X = U Sigma V^T
  3. Principal components = columns of V
  4. Singular values = sqrt((n-1) * eigenvalues of covariance)
  5. Projected data = U_k Sigma_k (same as X @ V_k)
In practice, computing PCA via SVD of the data matrix directly (rather than forming XTXX^T X and eigendecomposing) is more numerically stable, especially for wide matrices where the number of features exceeds the number of samples.

Comparison

MethodInputOutputUse Case
EigenvaluesSquare matrixEigenvalues + eigenvectorsFeature importance
PCAData matrixPrincipal componentsDimensionality reduction
SVDAny matrix3 matrices (U, Sigma, V)Recommendation, compression

When to Use Each

Use Eigenvalues when:
  • You have a square matrix (covariance, adjacency)
  • You want to find important directions
  • Example: PageRank, stability analysis
Use PCA when:
  • You want to reduce features
  • You want to visualize high-D data
  • Example: Compress 10 features → 3
Use SVD when:
  • You have a rectangular matrix (users x items, documents x words)
  • You want to fill missing values (recommendation systems)
  • You want to discover hidden patterns (latent factor models)
  • You want the most numerically stable decomposition (SVD never fails)
  • You need a guaranteed optimal low-rank approximation
  • Example: Recommendations, collaborative filtering, latent semantic analysis, image compression
The mental model: Eigendecomposition asks “what are the natural axes of this transformation?” SVD asks “what are the most important patterns in this data table?” Both decompose complexity into simple, ranked components — but SVD is more general and more robust.

SVD Applications

1. Image Compression

from PIL import Image

# Load image
img = Image.open('photo.jpg').convert('L')  # Grayscale
img_array = np.array(img)

# SVD
U, S, VT = np.linalg.svd(img_array, full_matrices=False)

# Compress: keep top k singular values
k = 50  # Instead of 512
img_compressed = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

# Compression ratio
original_size = img_array.size
compressed_size = U[:, :k].size + k + VT[:k, :].size
print(f"Compression: {compressed_size / original_size:.1%}")  # 20%!

# Save compressed image
Image.fromarray(img_compressed.astype(np.uint8)).save('compressed.jpg')

2. Noise Reduction

The key insight: real signal lives in the top singular values (large, structured patterns), while noise is spread across all singular values (small, random contributions). By keeping only the top-k singular values, you keep the signal and discard the noise. It is like listening to a conversation in a noisy room — your brain naturally focuses on the loud, structured voice (the signal) and ignores the quiet, random background chatter (the noise).
# Noisy data
data_noisy = data_clean + np.random.randn(*data_clean.shape) * 0.5

# SVD
U, S, VT = np.linalg.svd(data_noisy, full_matrices=False)

# Keep only large singular values (signal)
# Drop small ones (noise)
# The challenge: choosing k. Too small = lose signal. Too large = keep noise.
# A common heuristic: look for a "gap" in the singular value spectrum.
k = 10
data_denoised = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

3. Latent Semantic Analysis (LSA)

# Document-Term matrix
# Rows = documents, Columns = words
# Discover hidden topics!

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "machine learning is great",
    "deep learning neural networks",
    "python programming language",
    # ... more documents
]

vectorizer = TfidfVectorizer()
doc_term_matrix = vectorizer.fit_transform(documents).toarray()

# SVD
U, S, VT = np.linalg.svd(doc_term_matrix, full_matrices=False)

# Top 3 topics
k = 3
topics = VT[:k, :]

# Interpret topics
terms = vectorizer.get_feature_names_out()
for i in range(k):
    top_terms_idx = np.argsort(topics[i])[-5:][::-1]
    print(f"Topic {i}: {[terms[idx] for idx in top_terms_idx]}")

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises demonstrate SVD applications powering billion-dollar companies.

Exercise 1: Build a Movie Recommendation Engine 🎬

Create a Netflix-style recommendation system:
import numpy as np

# User-Movie ratings matrix (0 = not rated)
# Users: Alice, Bob, Carol, Dave, Eve
# Movies: Action1, Action2, Romance1, Romance2, SciFi1, Comedy1
ratings = np.array([
    [5, 4, 1, 0, 5, 3],   # Alice: likes action, scifi
    [4, 5, 0, 1, 4, 2],   # Bob: similar to Alice
    [1, 0, 5, 5, 2, 4],   # Carol: likes romance, comedy
    [0, 1, 4, 5, 1, 5],   # Dave: similar to Carol
    [5, 5, 1, 1, 0, 0],   # Eve: action fan, hasn't seen scifi/comedy
])

# TODO:
# 1. Apply SVD with rank-2 approximation
# 2. Predict Eve's rating for SciFi1 and Comedy1
# 3. Which movie should we recommend to Eve?
# 4. Find users most similar to Alice
import numpy as np

ratings = np.array([
    [5, 4, 1, 0, 5, 3],
    [4, 5, 0, 1, 4, 2],
    [1, 0, 5, 5, 2, 4],
    [0, 1, 4, 5, 1, 5],
    [5, 5, 1, 1, 0, 0],
])

users = ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
movies = ['Action1', 'Action2', 'Romance1', 'Romance2', 'SciFi1', 'Comedy1']

print("🎬 Movie Recommendation Engine with SVD")
print("=" * 55)

# 1. Apply SVD
U, S, VT = np.linalg.svd(ratings, full_matrices=False)

# Rank-2 approximation (captures main "taste" dimensions)
k = 2
U_k = U[:, :k]
S_k = np.diag(S[:k])
VT_k = VT[:k, :]

# Predicted ratings
predicted = U_k @ S_k @ VT_k

print("\n📊 Original Ratings:")
print("       ", "  ".join([m[:6].ljust(6) for m in movies]))
for i, user in enumerate(users):
    row = " ".join([f"{r:5.0f}" if r > 0 else "    -" for r in ratings[i]])
    print(f"  {user:6s} {row}")

print("\n🔮 Predicted Ratings (SVD Rank-2):")
print("       ", "  ".join([m[:6].ljust(6) for m in movies]))
for i, user in enumerate(users):
    row = " ".join([f"{r:5.1f}" for r in predicted[i]])
    print(f"  {user:6s} {row}")

# 2. Eve's predictions for unrated movies
print("\n🎯 Predictions for Eve:")
eve_idx = 4
for j in [4, 5]:  # SciFi1 and Comedy1
    print(f"   {movies[j]}: {predicted[eve_idx, j]:.2f}")

# 3. Recommend to Eve
unrated_by_eve = [j for j in range(6) if ratings[eve_idx, j] == 0]
best_movie_idx = max(unrated_by_eve, key=lambda j: predicted[eve_idx, j])
print(f"\n🌟 Recommendation for Eve: {movies[best_movie_idx]}")
print(f"   (Predicted rating: {predicted[eve_idx, best_movie_idx]:.2f})")

# 4. Find users similar to Alice (using user vectors in latent space)
alice_vec = U_k[0]
print("\n👥 Users Most Similar to Alice:")
for i in range(1, len(users)):
    similarity = np.dot(alice_vec, U_k[i]) / (np.linalg.norm(alice_vec) * np.linalg.norm(U_k[i]))
    print(f"   {users[i]}: {similarity:.3f}")
Real-World Insight: Netflix’s recommendation system uses a more advanced version of this with 100+ latent factors. Their recommendation engine is worth ~$1B/year in reduced churn!

Exercise 2: Image Compression with SVD 📷

Compress an image using low-rank approximation:
import numpy as np

# Create a simple 16x16 "image" with structure
np.random.seed(42)
image = np.zeros((16, 16))

# Add some patterns
image[2:6, 2:6] = 0.8     # Square
image[2:6, 10:14] = 0.9   # Another square
image[10:14, 5:11] = 0.7  # Rectangle at bottom
image += np.random.randn(16, 16) * 0.1  # Add noise

# TODO:
# 1. Apply SVD
# 2. Reconstruct with different ranks (1, 3, 5, 10)
# 3. Calculate compression ratio and error for each
# 4. Find minimum rank for < 5% error
import numpy as np

np.random.seed(42)
image = np.zeros((16, 16))
image[2:6, 2:6] = 0.8
image[2:6, 10:14] = 0.9
image[10:14, 5:11] = 0.7
image += np.random.randn(16, 16) * 0.1

print("📷 Image Compression with SVD")
print("=" * 55)
print(f"Original image: {image.shape}")

# Apply SVD
U, S, VT = np.linalg.svd(image, full_matrices=False)

print("\n📊 Singular Values:")
print(f"   {S[:5].round(2)} ...")
print(f"   Top 3 capture: {(S[:3]**2).sum() / (S**2).sum() * 100:.1f}% of variance")

# Compression analysis
print("\n📈 Compression Analysis:")
print("-" * 55)
print(f"{'Rank':<6} {'Storage':<12} {'Compression':<14} {'MSE':<10} {'Quality'}")
print("-" * 55)

original_size = 16 * 16  # 256 values

for k in [1, 3, 5, 10, 16]:
    # Reconstruct with rank k
    reconstructed = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]
    
    # Error
    mse = np.mean((image - reconstructed) ** 2)
    psnr = 10 * np.log10(1.0 / mse) if mse > 0 else 100
    
    # Storage: k values in U columns + k singular values + k rows of VT
    storage = k * 16 + k + k * 16  # 33k values
    compression = original_size / storage
    
    quality = "★" * min(5, int(psnr / 10))
    print(f"k={k:<4} {storage:<12} {compression:<14.2f}x {mse:<10.4f} {quality}")

# Find minimum rank for <5% relative error
target_error = np.mean(image ** 2) * 0.05
for k in range(1, 17):
    recon = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]
    mse = np.mean((image - recon) ** 2)
    if mse < target_error:
        print(f"\n🎯 Minimum rank for <5% error: k={k}")
        print(f"   Compression: {original_size / (33*k):.2f}x")
        break

# Visual comparison (ASCII art)
print("\n🖼️ Visual Comparison (scaled 0-9):")

def to_ascii(img):
    scaled = np.clip(img, 0, 1) * 9
    return "\n".join(["".join([str(int(v)) for v in row]) for row in scaled])

print("\nOriginal:")
print(to_ascii(image[::4, ::4]))  # Sample 4x4

print("\nRank-3 Approximation:")
recon_3 = U[:, :3] @ np.diag(S[:3]) @ VT[:3, :]
print(to_ascii(recon_3[::4, ::4]))
Real-World Insight: SVD-based compression is used in JPEG 2000 and video codecs. A 1080p video frame can be compressed 50x with SVD!

Exercise 3: Latent Semantic Analysis (LSA) for Search 🔍

Build a semantic search engine that understands meaning:
import numpy as np

# Document-term matrix (rows=docs, cols=words)
# Words: car, vehicle, auto, road, engine, cat, dog, pet, animal, kitten
doc_term = np.array([
    [3, 2, 1, 1, 2, 0, 0, 0, 0, 0],  # Doc about cars
    [2, 3, 2, 0, 1, 0, 0, 0, 0, 0],  # Doc about vehicles
    [0, 0, 0, 0, 0, 3, 2, 2, 1, 0],  # Doc about dogs
    [0, 0, 0, 0, 0, 2, 0, 1, 2, 3],  # Doc about cats
    [1, 1, 0, 2, 0, 0, 0, 0, 0, 0],  # Doc about roads
])

# Query: "automobile" (similar to car, vehicle, auto)
query = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])

# TODO:
# 1. Apply SVD with k=2 latent topics
# 2. Project query into latent space
# 3. Find most relevant document
# 4. Identify the "topics" discovered
import numpy as np

doc_term = np.array([
    [3, 2, 1, 1, 2, 0, 0, 0, 0, 0],
    [2, 3, 2, 0, 1, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 3, 2, 2, 1, 0],
    [0, 0, 0, 0, 0, 2, 0, 1, 2, 3],
    [1, 1, 0, 2, 0, 0, 0, 0, 0, 0],
])

words = ['car', 'vehicle', 'auto', 'road', 'engine', 'cat', 'dog', 'pet', 'animal', 'kitten']
docs = ['Cars Article', 'Vehicles Guide', 'Dogs 101', 'Cat Lovers', 'Road Trip']

query = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])  # "automobile"

print("🔍 Semantic Search with LSA (SVD)")
print("=" * 55)

# 1. Apply SVD
U, S, VT = np.linalg.svd(doc_term, full_matrices=False)

# Keep k=2 latent topics
k = 2
U_k = U[:, :k]       # Documents in topic space
S_k = np.diag(S[:k])
VT_k = VT[:k, :]     # Words in topic space

print(f"Reduced to {k} latent topics")
print(f"Singular values: {S[:k].round(2)}")

# 4. Interpret topics (words)
print("\n📚 Discovered Topics:")
for i in range(k):
    topic_words = VT_k[i]
    top_word_idx = np.argsort(np.abs(topic_words))[-4:][::-1]
    top_words = [f"{words[j]}({topic_words[j]:.2f})" for j in top_word_idx]
    print(f"   Topic {i+1}: {', '.join(top_words)}")

# 2. Project query into latent space
query_latent = query @ VT_k.T @ np.linalg.inv(S_k)
print(f"\nQuery 'automobile' in latent space: {query_latent.round(3)}")

# 3. Find most relevant document (cosine similarity)
print("\n📄 Search Results:")
print("-" * 40)

similarities = []
for i in range(len(docs)):
    doc_latent = U_k[i]
    sim = np.dot(query_latent, doc_latent) / (np.linalg.norm(query_latent) * np.linalg.norm(doc_latent) + 1e-10)
    similarities.append((docs[i], sim))

similarities.sort(key=lambda x: x[1], reverse=True)
for rank, (doc, sim) in enumerate(similarities, 1):
    bar = "█" * int(max(0, sim) * 20)
    print(f"   {rank}. {doc:<15} {sim:.3f} {bar}")

# Demonstrate semantic matching
print("\n💡 Key Insight:")
print("   'automobile' matches 'car' & 'vehicle' docs")
print("   even though 'automobile' wasn't in documents!")
print("   SVD discovered the semantic relationship!")
Real-World Insight: This is exactly how early Google worked! Modern search engines (Google, Bing) use neural network embeddings (like BERT), but LSA was the breakthrough that made semantic search possible.

Exercise 4: Noise Reduction in Signals 📡

Use SVD to clean noisy sensor data:
import numpy as np

# Clean signal: 3 underlying components
np.random.seed(42)
t = np.linspace(0, 4*np.pi, 100)

# True signal components
signal1 = np.sin(t)
signal2 = 0.5 * np.cos(2*t)  
signal3 = 0.3 * np.sin(3*t)

# Observed at 5 sensors with different mixings
mixing = np.array([
    [1.0, 0.5, 0.3],
    [0.8, 0.7, 0.2],
    [0.6, 0.4, 0.5],
    [0.9, 0.3, 0.4],
    [0.7, 0.6, 0.3],
])

true_signals = np.vstack([signal1, signal2, signal3])
clean_data = mixing @ true_signals

# Add noise
noise = np.random.randn(5, 100) * 0.3
noisy_data = clean_data + noise

# TODO:
# 1. Apply SVD to noisy data
# 2. Keep only top 3 components (the true rank)
# 3. Reconstruct denoised signal
# 4. Calculate noise reduction (SNR improvement)
import numpy as np

np.random.seed(42)
t = np.linspace(0, 4*np.pi, 100)

signal1 = np.sin(t)
signal2 = 0.5 * np.cos(2*t)
signal3 = 0.3 * np.sin(3*t)

mixing = np.array([
    [1.0, 0.5, 0.3],
    [0.8, 0.7, 0.2],
    [0.6, 0.4, 0.5],
    [0.9, 0.3, 0.4],
    [0.7, 0.6, 0.3],
])

true_signals = np.vstack([signal1, signal2, signal3])
clean_data = mixing @ true_signals
noise = np.random.randn(5, 100) * 0.3
noisy_data = clean_data + noise

print("📡 Signal Denoising with SVD")
print("=" * 55)

# 1. Apply SVD
U, S, VT = np.linalg.svd(noisy_data, full_matrices=False)

print("Singular Values:")
for i, s in enumerate(S):
    bar = "█" * int(s * 2)
    print(f"   σ{i+1} = {s:6.2f} {bar}")

# 2 & 3. Reconstruct with top 3 components
k = 3
denoised = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]

# 4. Calculate SNR improvement
def snr(signal, noise):
    signal_power = np.mean(signal ** 2)
    noise_power = np.mean(noise ** 2)
    return 10 * np.log10(signal_power / noise_power)

noise_before = noisy_data - clean_data
noise_after = denoised - clean_data

snr_before = snr(clean_data, noise_before)
snr_after = snr(clean_data, noise_after)

print(f"\n📊 Noise Reduction Results:")
print("-" * 40)
print(f"   Original SNR:  {snr_before:.2f} dB")
print(f"   Denoised SNR:  {snr_after:.2f} dB")
print(f"   Improvement:   {snr_after - snr_before:.2f} dB")
print(f"   Noise reduced: {(1 - 10**((snr_before-snr_after)/10)) * 100:.1f}%")

# Error analysis
mse_noisy = np.mean((noisy_data - clean_data) ** 2)
mse_denoised = np.mean((denoised - clean_data) ** 2)

print(f"\n📉 Error Analysis:")
print(f"   MSE (noisy):    {mse_noisy:.4f}")
print(f"   MSE (denoised): {mse_denoised:.4f}")
print(f"   Error reduced:  {(1 - mse_denoised/mse_noisy) * 100:.1f}%")

# Visual comparison (one sensor)
print("\n📈 Sample Signal (Sensor 1, first 20 points):")
print("   Clean:    " + " ".join([f"{v:5.2f}" for v in clean_data[0, :10]]))
print("   Noisy:    " + " ".join([f"{v:5.2f}" for v in noisy_data[0, :10]]))
print("   Denoised: " + " ".join([f"{v:5.2f}" for v in denoised[0, :10]]))
Real-World Insight: SVD denoising is used in MRI imaging (saving lives by improving image clarity), audio processing (Dolby noise reduction), and financial data analysis (separating signal from market noise).

Key Takeaways

SVD Core Concepts:
  • Universal Decomposition - Any matrix A = UΣVᵀ (works for non-square!)
  • Low-Rank Approximation - Keep top k singular values for compression
  • Collaborative Filtering - Predict missing ratings via latent factors
  • Latent Factors - Discover hidden patterns (movie genres, user preferences)
  • Noise Reduction - Large singular values = signal, small = noise

Interview Prep: SVD Questions

Q: What’s the difference between PCA and SVD?
PCA uses eigendecomposition of the covariance matrix (square, symmetric). SVD works directly on any matrix (even non-square). For centered data, SVD of X gives the same principal components as PCA. SVD is more numerically stable.
Q: How does Netflix use SVD for recommendations?
User-movie rating matrix is decomposed into user factors and movie factors. Each user/movie is represented by a latent vector capturing hidden preferences (action lover, comedy hater). Predictions = dot product of user and movie vectors.
Q: How do you choose the rank k for truncated SVD?
Methods: (1) Energy/variance retained (e.g., 95%), (2) Cross-validation for prediction tasks, (3) Visualization of singular value decay, (4) Domain knowledge about expected rank.
Q: What are singular values vs eigenvalues?
Singular values are always non-negative real numbers and exist for any matrix. Eigenvalues can be complex and only exist for square matrices. For symmetric A: singular values = |eigenvalues|.

SVD in Modern Deep Learning

SVD is not just a classical technique — it appears throughout modern deep learning in ways that are often invisible but critical.
ApplicationHow SVD Is UsedWhy It Matters
LoRA (Low-Rank Adaptation)Approximate weight updates as low-rank matrices ΔW=BA\Delta W = BAFine-tune billion-parameter LLMs with 1000x fewer trainable parameters
Model CompressionReplace large weight matrices with truncated SVD approximationsDeploy models on mobile/edge devices with limited memory
Attention Head AnalysisSVD of attention matrices reveals what heads have learnedInterpret and debug Transformer models
Weight InitializationSVD-based initialization ensures balanced singular valuesPrevent exploding/vanishing gradients at training start
Matrix CompletionNuclear norm minimization (sum of singular values) as a convex relaxationRecommendation systems, image inpainting
Embedding CompressionSVD on embedding matrices reduces vocabulary storageShrink NLP model sizes by 50-80% with minimal quality loss
LoRA and the SVD Connection: LoRA (Low-Rank Adaptation), one of the most important techniques for fine-tuning large language models, is a direct application of the low-rank approximation idea from SVD. Instead of updating all parameters of a weight matrix WW, LoRA freezes WW and learns a low-rank update ΔW=BA\Delta W = BA where BB is (d x r) and AA is (r x d) with rdr \ll d. This works because the useful updates to a pretrained model tend to live in a low-dimensional subspace — exactly the insight that SVD formalizes. A model with 7 billion parameters can be fine-tuned by training only a few million LoRA parameters.

Common Pitfalls

SVD Mistakes to Avoid:
  1. Confusing U and V - Rows of VTV^T are right singular vectors (features); columns of U are left (samples). Swapping them transposes the entire interpretation.
  2. Wrong Truncation - Keeping too few components loses signal; too many keeps noise. Use cross-validation or the singular value gap heuristic to choose k.
  3. Forgetting Scale - SVD is sensitive to feature scales; consider normalizing. A feature in millions will dominate all singular vectors regardless of its actual importance.
  4. Cold Start Problem - SVD cannot recommend for new users/items with no data. New rows or columns have no representation in the latent space.
  5. Memory Issues - Full SVD on large matrices is expensive (O(mnmin(m,n))O(mn \cdot \min(m,n))). For matrices larger than ~10,000 x 10,000, use randomized SVD (sklearn.utils.extmath.randomized_svd) or sparse SVD (scipy.sparse.linalg.svds).
  6. Treating zeros as ratings - In recommendation matrices, a zero typically means “unobserved,” not “rated zero.” Treating all zeros as real observations biases the decomposition toward predicting low ratings for unseen items. Use masked matrix factorization or fill with user means instead.

Course Complete!

🎉 Congratulations! You’ve mastered Linear Algebra for Machine Learning!

Course Complete!

You’ve completed Linear Algebra for Machine Learning!You now have the mathematical foundation to understand neural networks, dimensionality reduction, recommendation systems, and much more. These concepts appear everywhere in ML—from word embeddings to attention mechanisms to image processing.
Your Linear Algebra Toolkit:
  • Vectors - Data representation, similarity, embeddings
  • Matrices - Transformations, neural network layers, data batches
  • Eigenvalues - Feature importance, stability analysis
  • PCA - Dimensionality reduction, visualization, noise filtering
  • SVD - Recommendations, compression, matrix approximation

Practice More

Apply your skills on real datasets with Kaggle competitions

Continue Learning

Continue to our AI Engineering course for production ML systems

Interview Deep-Dive

Strong Answer:
  • PCA eigendecomposes the covariance matrix C=1n1XTXC = \frac{1}{n-1}X^TX to get principal components. SVD decomposes the data matrix XX directly as X=UΣVTX = U\Sigma V^T. The right singular vectors VV are identical to the eigenvectors of XTXX^TX (the principal components), and singular values relate to eigenvalues by λi=σi2/(n1)\lambda_i = \sigma_i^2 / (n-1).
  • You should always prefer SVD over explicit covariance eigendecomposition for numerical reasons. Computing XTXX^TX squares the condition number. If XX has condition number κ\kappa, then XTXX^TX has κ2\kappa^2. For κ=106\kappa = 10^6, the covariance approach has effective condition number 101210^{12} — you lose 12 digits of precision with float64. SVD works directly on XX with κ=106\kappa = 10^6, preserving 10 reliable digits.
  • SVD is strictly more general: eigendecomposition requires square matrices, but SVD works on any m×nm \times n matrix. For recommendation systems (1M users x 100K items), there is no square covariance matrix to eigendecompose at reasonable cost. SVD of the ratings matrix is the direct path.
  • sklearn.decomposition.PCA uses SVD internally for exactly these reasons. “PCA” is the statistical concept; SVD is the implementation.
  • For the top-kk components, randomized SVD (Halko et al., 2011) runs in O(mnk)O(mnk) versus O(mnmin(m,n))O(mn \cdot \min(m,n)) for full SVD — dramatically faster for low-rank approximations.
Follow-up: What is randomized SVD and why is it so much faster?Randomized SVD exploits the fact that you need only top-kk singular vectors. The algorithm: (1) multiply XX by a random Gaussian matrix Ω\Omega of shape n×(k+p)n \times (k + p) (oversampling parameter p510p \approx 5-10), producing Y=XΩY = X\Omega of shape m×(k+p)m \times (k+p). This “sketches” the column space. (2) QR factorize Y=QRY = QR for an orthonormal basis. (3) Project: B=QTXB = Q^TX, a small matrix. (4) Exact SVD of BB. For a 1M x 100K matrix with k=50k = 50, this is 2000x faster than full SVD. Accuracy loss is negligible when singular values decay rapidly — the error bound is proportional to σk+1\sigma_{k+1}.
Strong Answer:
  • The theorem guarantees: among all rank-kk matrices, the truncated SVD Ak=UkΣkVkTA_k = U_k\Sigma_k V_k^T minimizes AAkF\|A - A_k\|_F. The error is exactly σk+12+σk+22+\sqrt{\sigma_{k+1}^2 + \sigma_{k+2}^2 + \ldots}.
  • This is foundational because it guarantees optimality across many ML applications. Image compression via SVD is the best linear compression. Low-rank matrix factorization for recommendations is the best rank-kk reconstruction. PCA’s top-kk eigenvectors are the projection preserving the most variance. All are the same theorem in different domains.
  • The singular value spectrum tells you compressibility. If values decay rapidly (σ1=100,σ2=50,σ3=10,σ4=1\sigma_1 = 100, \sigma_2 = 50, \sigma_3 = 10, \sigma_4 = 1), rank-3 captures 99.98% of energy. If values decay slowly (all roughly equal), no low-rank approximation is good. Always plot the spectrum first.
  • The theorem holds for Frobenius norm (maps to MSE), but real objectives often care about different metrics — ranking quality (NDCG), click-through rate, or perceptual image quality. Truncated SVD minimizes MSE but not necessarily the metric you care about. Production recommendation systems use SGD-based matrix factorization with arbitrary loss functions rather than pure SVD.
  • This also explains why LoRA works for LLM fine-tuning. The weight update during fine-tuning is approximately low-rank, so ΔW=BA\Delta W = BA (with rdr \ll d) captures most of the meaningful update with far fewer parameters.
Follow-up: Give a concrete example of when the SVD solution would be suboptimal despite being Eckart-Young optimal.Movie recommendation where the business metric is “percentage of recommendations users actually watch” (precision@10). SVD minimizes average squared error across all predictions, treating a 0.1-star error on a 2-star movie the same as on a 5-star movie. For a top-10 list, only the ordering of top items matters. Bayesian Personalized Ranking (BPR) directly optimizes ranking order rather than pointwise MSE. In practice, Netflix combined SVD features with gradient-boosted ranking models optimizing NDCG. The SVD was a good starting point but not the final answer.
Strong Answer:
  • Eigendecomposition (A=QΛQ1A = Q\Lambda Q^{-1}): requires square matrices. Right tool when the matrix represents a transformation or graph: PCA (covariance matrix), PageRank (transition matrix), spectral clustering (Laplacian), stability analysis (Jacobian). Fails for non-square matrices; can produce complex results for non-symmetric matrices.
  • SVD (A=UΣVTA = U\Sigma V^T): works for any matrix. Most general and numerically stable. Use for: dimensionality reduction, image compression, least-squares problems, pseudo-inverses, recommendations. Singular values are always real and non-negative. The downside: no sparsity or non-negativity constraints, hard to interpret.
  • NMF (AWHA \approx WH, W0W \geq 0, H0H \geq 0): produces parts-based, additive decompositions. Faces decompose into eyes, nose, mouth. Documents decompose into topics. Use when data is non-negative and interpretability matters (topic modeling, source separation, image analysis). NMF is NP-hard to solve optimally; algorithms find local minima and are sensitive to initialization.
  • Key trade-off: SVD gives the optimal low-rank approximation (Eckart-Young) but may produce negative values uninterpretable for non-negative data. NMF is interpretable but not globally optimal. Eigendecomposition is natural for square symmetric matrices but cannot handle rectangular matrices.
  • In production: Spotify uses NMF for interpretable playlist “themes,” SVD for cold-start recommendations, and neural collaborative filtering for final ranking.
Follow-up: NMF is NP-hard to solve optimally. How do practical implementations handle this?Alternating optimization: fix WW, optimize HH (non-negative least squares); fix HH, optimize WW. Each sub-problem is convex, converging to a local minimum. Standard practice: run NMF with 10-50 random initializations and keep the lowest reconstruction error. Use NNDSVD initialization (non-negative double SVD) for a deterministic good start. Evaluate on downstream quality (topic coherence, recommendation accuracy), not just reconstruction error. If multiple random starts produce similar solutions, you are likely near the global optimum.
Strong Answer:
  • A grayscale image is an m×nm \times n matrix. Full SVD: A=UΣVTA = U\Sigma V^T. Full storage: mnmn values.
  • Truncated at rank kk: Ak=UkΣkVkTA_k = U_k\Sigma_k V_k^T. Storage: k(m+n+1)k(m + n + 1). For a 1000x1000 image at rank 50: 100,050\approx 100{,}050 values versus 1,000,000 — a 10x compression. At rank 10: 50x compression.
  • Natural images are compressible because they contain large regions of similar color, smooth gradients, and repeated patterns. Top singular values capture large-scale structure (brightness, major shapes); middle values capture textures; the tail captures noise. Dropping the tail barely changes human perception.
  • The scree plot of singular values is the diagnostic: a sharp “elbow” means high compressibility; slow decay means not. Photographs compress well; random noise does not (all singular values roughly equal).
  • This principle underlies JPEG (DCT on 8x8 blocks, related to SVD on fixed basis matrices), though JPEG outperforms direct SVD by exploiting perceptual models and block-local processing.
Follow-up: Why is SVD-based compression not used in practice for images?Three reasons: (1) SVD is global — a single noisy pixel affects all singular vectors. Block-based methods (JPEG’s 8x8 DCT blocks) localize errors. (2) SVD is O(mnmin(m,n))O(mn \cdot \min(m,n)); DCT on 8x8 blocks is O(mn)O(mn) with tiny constants. (3) SVD does not exploit perceptual properties — JPEG uses quantization tables calibrated to human vision, aggressively discarding imperceptible high-frequency components. SVD treats all singular values as equally important. However, SVD is used for matrix completion (Netflix), model compression (LoRA), and latent semantic analysis (text).