Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Principal Component Analysis

Principal Component Analysis (PCA)

The Big Question: Can we predict house prices with fewer features? You have 10 house features, but your model is slow and overfitting. Can you reduce to 3 features while keeping 95% of the information? PCA (Principal Component Analysis) is the answer — it is dimensionality reduction using eigenvectors! Here is the intuition. Imagine shining a flashlight at a 3D object and looking at its shadow on a wall. The shadow is a 2D representation of a 3D thing — some information is lost, but if you choose the right angle for the light, the shadow captures the object’s most recognizable shape. PCA automatically finds the “best angle” to project your high-dimensional data onto fewer dimensions so that the shadow preserves as much of the original structure as possible.
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Eigenvalues module
Main Example: House feature reduction
Supporting Examples: Student profile compression, Movie recommendation optimization

The Core Idea

From 10 Features to 3

Problem: Too many features cause:
  • Slow training
  • Overfitting
  • Hard to visualize
  • Expensive to collect
Solution: Find the 3 most important directions (principal components) and project data onto them!
# Before PCA
X = np.array([[3, 2000, 15, 5, 8, 7, 2, 95, 3, 1500]])  # 10 features

# After PCA
X_reduced = pca.transform(X)  # 3 features
# Still captures 95% of variance!

The Mathematics of PCA

Step-by-Step Algorithm

  1. Center the data: Subtract the mean from each feature
  2. Compute covariance matrix: C=1n1XTXC = \frac{1}{n-1}X^TX
  3. Find eigenvalues & eigenvectors: Solve Cv=λvCv = \lambda v
  4. Sort by eigenvalue: Largest eigenvalues = most important directions
  5. Project data: Xnew=XVkX_{new} = X \cdot V_k where VkV_k contains top-k eigenvectors

Mathematical Formulation

Given data matrix XRn×dX \in \mathbb{R}^{n \times d} (n samples, d features): Covariance Matrix: C=1n1(XXˉ)T(XXˉ)C = \frac{1}{n-1}(X - \bar{X})^T(X - \bar{X}) Eigendecomposition: C=VΛVTC = V \Lambda V^T Where:
  • VV = matrix of eigenvectors (columns)
  • Λ\Lambda = diagonal matrix of eigenvalues
Projection: Xreduced=(XXˉ)VkX_{reduced} = (X - \bar{X}) V_k Where VkV_k contains the top-k eigenvectors.

Variance Explained

Each eigenvalue λi\lambda_i represents the variance captured by that component. Think of the total variance as a pie: each eigenvalue is a slice, and the variance ratio tells you what fraction of the pie each principal component gets. Variance ratioi=λijλj\text{Variance ratio}_i = \frac{\lambda_i}{\sum_j \lambda_j} When people say “PC1 explains 55% of the variance,” they mean that 55% of all the spread and variation in your data occurs along the direction of the first principal component. The remaining 45% is spread across all other directions. Example: If eigenvalues are [5.0, 2.0, 1.5, 0.5], total = 9.0
ComponentEigenvalueVariance Explained
PC15.055.6%
PC22.022.2%
PC31.516.7%
PC40.55.5%
Keeping PC1 + PC2 = 77.8% of the variance with only 2 features!

Example 1: House Price Prediction with Fewer Features

The Dataset

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 1000 houses × 10 features
# [beds, baths, sqft, age, distance, school, crime, walk, parking, yard]
np.random.seed(42)
n_houses = 1000

houses = np.random.randn(n_houses, 10)
# Add correlations (realistic data)
houses[:, 2] = houses[:, 0] * 500 + 1500  # sqft correlates with beds
houses[:, 1] = houses[:, 0] * 0.5 + 1.5   # baths correlate with beds

print(f"Original shape: {houses.shape}")  # (1000, 10)

Apply PCA

# Step 1: Standardize (important!)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
houses_scaled = scaler.fit_transform(houses)

# Step 2: Apply PCA
pca = PCA(n_components=3)  # Keep top 3 components
houses_pca = pca.fit_transform(houses_scaled)

print(f"Reduced shape: {houses_pca.shape}")  # (1000, 3)
print(f"Variance explained: {pca.explained_variance_ratio_}")
# [0.45, 0.28, 0.15] = 88% total!

What Do the Components Mean?

# Component loadings (how original features contribute)
components = pca.components_

print("First component (PC1):")
features = ['beds', 'baths', 'sqft', 'age', 'dist', 'school', 'crime', 'walk', 'park', 'yard']
for i, feature in enumerate(features):
    print(f"  {feature}: {components[0, i]:.3f}")

# Output:
# beds: 0.450
# baths: 0.420
# sqft: 0.480  ← Dominates!
# age: -0.120
# ...
Interpretation:
  • PC1 (45% variance): “House size” (beds + baths + sqft)
  • PC2 (28% variance): “Location quality” (school + walk - crime)
  • PC3 (15% variance): “Age vs. modernization”

Visualize in 2D

# Plot first 2 principal components
plt.figure(figsize=(10, 6))
plt.scatter(houses_pca[:, 0], houses_pca[:, 1], alpha=0.5)
plt.xlabel('PC1: House Size (45% variance)')
plt.ylabel('PC2: Location Quality (28% variance)')
plt.title('Houses in PCA Space')
plt.grid(True)
plt.show()

Predict Prices with Reduced Features

# Train model on original 10 features
from sklearn.linear_model import LinearRegression

prices = np.random.randn(n_houses) * 50000 + 300000  # Simulated prices

model_full = LinearRegression()
model_full.fit(houses_scaled, prices)
score_full = model_full.score(houses_scaled, prices)

# Train model on 3 PCA features
model_pca = LinearRegression()
model_pca.fit(houses_pca, prices)
score_pca = model_pca.score(houses_pca, prices)

print(f"R² with 10 features: {score_full:.3f}")
print(f"R² with 3 features: {score_pca:.3f}")
# Almost the same performance with 70% fewer features!
Key Insight: PCA reduced features from 10 → 3 while maintaining 88% of information and similar prediction accuracy! Real Application: Zillow uses PCA to compress house features for faster recommendations. PCA House Features

Example 2: Student Performance - Simplifying Profiles

The Dataset

# 500 students × 8 features
# [study_hrs, prev_gpa, attendance, sleep, exercise, social, stress, motivation]
students = np.random.randn(500, 8)

# Standardize
students_scaled = scaler.fit_transform(students)

# Apply PCA
pca = PCA(n_components=3)
students_pca = pca.fit_transform(students_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_}")
# [0.38, 0.25, 0.18] = 81% total

Interpret Components

features = ['study', 'gpa', 'attend', 'sleep', 'exercise', 'social', 'stress', 'motiv']
components = pca.components_

print("\\nPrincipal Components:")
for i in range(3):
    print(f"\\nPC{i+1} ({pca.explained_variance_ratio_[i]:.1%} variance):")
    # Find top 3 contributing features
    top_idx = np.argsort(np.abs(components[i]))[-3:][::-1]
    for idx in top_idx:
        print(f"  {features[idx]}: {components[i, idx]:.3f}")
Interpretation:
  • PC1 (38%): “Academic engagement” (study + GPA + attendance)
  • PC2 (25%): “Well-being” (sleep + exercise - stress)
  • PC3 (18%): “Social balance” (social + motivation)

Identify Student Clusters

from sklearn.cluster import KMeans

# Cluster students in PCA space
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(students_pca)

# Visualize
plt.scatter(students_pca[:, 0], students_pca[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.xlabel('PC1: Academic Engagement')
plt.ylabel('PC2: Well-being')
plt.title('Student Clusters')
plt.colorbar(label='Cluster')
plt.show()
Clusters Found:
  • Cluster 0: High academic, low well-being (burnout risk!)
  • Cluster 1: Balanced students
  • Cluster 2: Low academic, high social (need intervention)
Real Application: Khan Academy uses PCA to group students and personalize learning paths!

Example 3: Movie Recommendations - Faster Matching

The Dataset

# 1000 movies × 20 features
# [action, romance, comedy, horror, sci-fi, drama, thriller, mystery, 
#  animation, documentary, rating, runtime, year, budget, revenue, ...]
movies = np.random.randn(1000, 20)

# Standardize
movies_scaled = scaler.fit_transform(movies)

# Apply PCA
pca = PCA(n_components=5)  # 20 → 5 features
movies_pca = pca.fit_transform(movies_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
# 85% with just 5 components!

Interpret Movie “Factors”

# PC1: "Blockbuster factor"
# High: action, sci-fi, budget, revenue
# Low: documentary, indie

# PC2: "Emotional factor"  
# High: romance, drama
# Low: action, horror

# PC3: "Comedy factor"
# High: comedy, animation
# Low: thriller, horror
# Find similar movies in PCA space (much faster!)
from scipy.spatial.distance import cdist

# Query movie (in PCA space)
query_movie = movies_pca[0]

# Compute distances (5D instead of 20D!)
distances = cdist([query_movie], movies_pca, metric='cosine')[0]

# Top 5 similar movies
similar_idx = np.argsort(distances)[1:6]  # Exclude self
print(f"Similar movies: {similar_idx}")
Speed Comparison:
import time

# Original 20D space
start = time.time()
for _ in range(1000):
    distances_full = cdist([movies_scaled[0]], movies_scaled, metric='cosine')
time_full = time.time() - start

# PCA 5D space
start = time.time()
for _ in range(1000):
    distances_pca = cdist([movies_pca[0]], movies_pca, metric='cosine')
time_pca = time.time() - start

print(f"Full: {time_full:.3f}s")
print(f"PCA: {time_pca:.3f}s")
print(f"Speedup: {time_full/time_pca:.1f}x")
# 4x faster with PCA!
Real Application: Netflix uses PCA to compress movie features for real-time recommendations to millions of users!

How PCA Works: Step-by-Step

PCA Math Concept

Step 1: Standardize Data

This is not optional. If you skip standardization, PCA will just find the feature with the biggest numbers and call it the “most important direction.” That tells you nothing useful.
# Why? Features have different scales
# beds: 1-5, sqft: 500-5000, age: 0-100
# Without standardization, sqft dominates everything just because
# it uses bigger numbers -- not because it carries more information.

X = np.array([[3, 2000, 15], [4, 2200, 8]])

# Standardize: mean=0, std=1
# After this, every feature has equal "voice" in the analysis
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_scaled = (X - X_mean) / X_std
The One Exception: If your features are already in the same units and you want PCA to respect absolute variance (e.g., all features are pixel intensities 0-255), you can skip standardization and just center the data (subtract the mean). In image processing, this is common. But for mixed-unit data like house features, always standardize.

Step 2: Compute Covariance Matrix

# Covariance shows how features vary together.
# If beds and sqft tend to rise together, their covariance is positive.
# If age and price move in opposite directions, their covariance is negative.
# The covariance matrix is symmetric -- cov(A,B) = cov(B,A).
cov_matrix = np.cov(X_scaled.T)
print(cov_matrix.shape)  # (3, 3)
Think of the covariance matrix as a summary of all pairwise relationships in your data. Its eigenvectors point in the directions of maximum spread, and its eigenvalues tell you how much spread exists in each direction. PCA simply reads off these directions and sorts them by importance.

Step 3: Find Eigenvectors

# Eigenvectors = principal directions
# Each eigenvector is a direction in feature-space.
# The first eigenvector points in the direction of maximum variance.
# The second is perpendicular to the first, capturing the next-most variance.
# And so on -- they form an orthogonal coordinate system.
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort by eigenvalue (largest first)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
Numerical Stability Tip: In practice, use np.linalg.eigh() instead of np.linalg.eig() when your matrix is symmetric (covariance matrices always are). The eigh function exploits symmetry for faster computation and guarantees real-valued eigenvalues, avoiding tiny imaginary artifacts from floating-point errors.

Step 4: Project Data

# Keep top k eigenvectors
k = 2
principal_components = eigenvectors[:, :k]

# Project data onto principal components
X_pca = X_scaled @ principal_components
print(X_pca.shape)  # (2, 2) - reduced from 3 to 2!

Choosing Number of Components

Method 1: Variance Threshold

# Keep components that explain 95% of variance
pca = PCA(n_components=0.95)  # Automatic!
X_pca = pca.fit_transform(X_scaled)
print(f"Kept {pca.n_components_} components")

Method 2: Scree Plot

The scree plot is named after the “scree” at the bottom of a cliff — the pile of small rocks. In the plot, the tall bars on the left are the important components (the cliff), and the short bars on the right are the noise (the scree). The “elbow” where the cliff meets the scree tells you where to cut.
# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)

# Plot explained variance
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1),
         pca_full.explained_variance_ratio_, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.grid(True)
plt.show()

# Look for "elbow" - where curve flattens
# Components after the elbow are mostly capturing noise

Method 3: Cumulative Variance

cumsum = np.cumsum(pca_full.explained_variance_ratio_)
n_components = np.argmax(cumsum >= 0.95) + 1
print(f"Need {n_components} components for 95% variance")

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises demonstrate how PCA is used in production systems at major tech companies.

Exercise 1: Customer 360° View Compression 🎯

A marketing team has 50 metrics per customer, but their ML models are slow. Use PCA to compress:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Simulate customer data: 10000 customers, 50 features
np.random.seed(42)

# Create correlated features (realistic)
base_features = np.random.randn(10000, 10)
# Expand to 50 features with correlations
customer_data = np.hstack([
    base_features,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.5,
    base_features[:, :5] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.3,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.4,
    base_features[:, 5:] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.6,
])

# TODO:
# 1. Standardize the data
# 2. Find how many components capture 95% variance
# 3. Calculate compression ratio
# 4. Measure information loss
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Setup (same as above)
np.random.seed(42)
base_features = np.random.randn(10000, 10)
customer_data = np.hstack([
    base_features,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.5,
    base_features[:, :5] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.3,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.4,
    base_features[:, 5:] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.6,
])

print("📊 Customer Data Compression with PCA")
print("=" * 50)
print(f"Original shape: {customer_data.shape}")

# 1. Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customer_data)

# 2. Find optimal components
pca_full = PCA()
pca_full.fit(X_scaled)

cumsum = np.cumsum(pca_full.explained_variance_ratio_)
n_95 = np.argmax(cumsum >= 0.95) + 1
n_99 = np.argmax(cumsum >= 0.99) + 1

print(f"\n🎯 Components for 95% variance: {n_95} (was 50)")
print(f"   Components for 99% variance: {n_99}")

# 3. Compression ratio
compression_95 = 50 / n_95
compression_99 = 50 / n_99
print(f"\n📦 Compression Ratios:")
print(f"   95% variance: {compression_95:.1f}x compression")
print(f"   99% variance: {compression_99:.1f}x compression")

# 4. Apply final PCA
pca = PCA(n_components=n_95)
X_compressed = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_compressed)

reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"\n📉 Information Loss:")
print(f"   Mean squared reconstruction error: {reconstruction_error:.4f}")

# Memory savings
original_size = customer_data.nbytes / 1024**2
compressed_size = X_compressed.nbytes / 1024**2
print(f"\n💾 Memory Savings:")
print(f"   Original: {original_size:.2f} MB")
print(f"   Compressed: {compressed_size:.2f} MB")
print(f"   Saved: {(1 - compressed_size/original_size)*100:.1f}%")
Real-World Insight: Spotify compresses user taste profiles from 1000+ features to ~40 components, enabling real-time playlist generation for 500M+ users!

Exercise 2: Image Compression 📸

Use PCA to compress grayscale images:
import numpy as np

# Simulate a 64x64 grayscale image (like a face)
np.random.seed(42)
image = np.random.rand(64, 64)

# Add some structure (face-like features)
image[20:25, 20:25] = 0.9  # Left eye
image[20:25, 40:45] = 0.9  # Right eye
image[35:40, 25:40] = 0.9  # Mouth
image[15:50, 28:37] = 0.7  # Nose

# TODO:
# 1. Reshape image for PCA (each row is a feature)
# 2. Compress to k components
# 3. Reconstruct and measure error
# 4. Find minimum k for "acceptable" quality (MSE < 0.01)
import numpy as np

np.random.seed(42)
image = np.random.rand(64, 64)
image[20:25, 20:25] = 0.9
image[20:25, 40:45] = 0.9
image[35:40, 25:40] = 0.9
image[15:50, 28:37] = 0.7

print("🖼️ Image Compression with PCA")
print("=" * 50)
print(f"Original image: {image.shape} = {64*64} pixels")

# PCA on rows (each row is a sample, each column is a feature)
from sklearn.decomposition import PCA

def compress_image(img, n_components):
    pca = PCA(n_components=n_components)
    compressed = pca.fit_transform(img)
    reconstructed = pca.inverse_transform(compressed)
    mse = np.mean((img - reconstructed) ** 2)
    return reconstructed, mse, pca.explained_variance_ratio_.sum()

print("\n📊 Compression Results:")
print("-" * 45)
print(f"{'Components':<12} {'MSE':<12} {'Variance %':<12} {'Compression':<12}")
print("-" * 45)

for k in [5, 10, 20, 30, 40, 50]:
    recon, mse, var = compress_image(image, k)
    compression = 64*64 / (k*64 + k)  # Approximate storage
    quality = "✓" if mse < 0.01 else ""
    print(f"{k:<12} {mse:<12.4f} {var*100:<12.1f} {compression:<12.1f}x {quality}")

# Find minimum acceptable k
print("\n🔍 Finding minimum components for MSE < 0.01:")
for k in range(1, 65):
    _, mse, var = compress_image(image, k)
    if mse < 0.01:
        print(f"   Minimum k = {k} (MSE: {mse:.4f}, Variance: {var*100:.1f}%)")
        break

# Visual representation of compression
print("\n📈 Compression Quality by Components:")
for k in [5, 15, 30, 50]:
    _, mse, var = compress_image(image, k)
    bar_len = int((1 - mse) * 30)
    bar = "█" * bar_len + "░" * (30 - bar_len)
    print(f"  k={k:2d}: {bar} {(1-mse)*100:.1f}% quality")
Real-World Insight: JPEG compression uses a related technique (DCT). Netflix uses PCA-like methods to compress video frames, saving billions in bandwidth costs!

Exercise 3: Anomaly Detection in Server Logs 🔍

Use PCA to detect unusual server behavior:
import numpy as np

# Normal server metrics: [CPU, Memory, Disk IO, Network, Requests/s]
np.random.seed(42)
normal_data = np.random.randn(1000, 5) * np.array([10, 15, 20, 25, 100]) + \
              np.array([50, 60, 40, 30, 500])

# Add some anomalies
anomalies = np.array([
    [95, 95, 90, 80, 50],    # High CPU/Memory, low requests (crypto mining?)
    [20, 90, 95, 10, 100],   # Memory leak, disk thrashing
    [50, 60, 40, 95, 5000],  # DDoS attack?
])

all_data = np.vstack([normal_data, anomalies])

# TODO:
# 1. Fit PCA on normal data only
# 2. Project all data onto PCA space
# 3. Calculate reconstruction error for each point
# 4. Flag anomalies (high reconstruction error)
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

np.random.seed(42)
normal_data = np.random.randn(1000, 5) * np.array([10, 15, 20, 25, 100]) + \
              np.array([50, 60, 40, 30, 500])

anomalies = np.array([
    [95, 95, 90, 80, 50],
    [20, 90, 95, 10, 100],
    [50, 60, 40, 95, 5000],
])

all_data = np.vstack([normal_data, anomalies])

print("🔍 Server Anomaly Detection with PCA")
print("=" * 50)

# 1. Standardize using normal data statistics
scaler = StandardScaler()
scaler.fit(normal_data)
normal_scaled = scaler.transform(normal_data)
all_scaled = scaler.transform(all_data)

# 2. Fit PCA on normal data only (keep 2 components for simplicity)
pca = PCA(n_components=2)
pca.fit(normal_scaled)

print(f"PCA trained on {len(normal_data)} normal samples")
print(f"Components capture {pca.explained_variance_ratio_.sum()*100:.1f}% variance")

# 3. Calculate reconstruction error for all points
def reconstruction_error(pca, data):
    transformed = pca.transform(data)
    reconstructed = pca.inverse_transform(transformed)
    return np.mean((data - reconstructed) ** 2, axis=1)

errors = reconstruction_error(pca, all_scaled)

# 4. Set threshold based on normal data
normal_errors = errors[:len(normal_data)]
threshold = np.percentile(normal_errors, 99)  # 99th percentile

print(f"\n📊 Reconstruction Error Statistics:")
print(f"   Normal mean: {normal_errors.mean():.4f}")
print(f"   Normal std:  {normal_errors.std():.4f}")
print(f"   Threshold (99th %ile): {threshold:.4f}")

# Flag anomalies
print("\n🚨 Anomaly Detection Results:")
print("-" * 50)

anomaly_errors = errors[-len(anomalies):]
for i, (err, anomaly) in enumerate(zip(anomaly_errors, anomalies)):
    status = "🔴 ANOMALY" if err > threshold else "🟢 Normal"
    print(f"  Server {i+1}: Error = {err:.4f} {status}")
    if err > threshold:
        print(f"           Metrics: CPU={anomaly[0]:.0f}%, Mem={anomaly[1]:.0f}%, "
              f"Disk={anomaly[2]:.0f}%, Net={anomaly[3]:.0f}%, Req={anomaly[4]:.0f}/s")

# False positive rate on normal data
false_positives = np.sum(normal_errors > threshold)
print(f"\n📈 Performance:")
print(f"   False positive rate: {false_positives/len(normal_data)*100:.2f}%")
print(f"   True positives: {np.sum(anomaly_errors > threshold)}/{len(anomalies)}")
Real-World Insight: AWS CloudWatch and Datadog use similar PCA-based anomaly detection to automatically flag unusual server behavior across millions of instances!

Exercise 4: Gene Expression Analysis 🧬

Reduce high-dimensional genetic data for cancer subtype discovery:
import numpy as np

# Simulated gene expression: 200 patients, 1000 genes
np.random.seed(42)

# 3 cancer subtypes with different gene signatures
subtype1 = np.random.randn(70, 1000) + np.hstack([np.ones(300), np.zeros(700)]) * 2
subtype2 = np.random.randn(70, 1000) + np.hstack([np.zeros(300), np.ones(400), np.zeros(300)]) * 2
subtype3 = np.random.randn(60, 1000) + np.hstack([np.zeros(700), np.ones(300)]) * 2

gene_data = np.vstack([subtype1, subtype2, subtype3])
true_labels = [0]*70 + [1]*70 + [2]*60

# TODO:
# 1. Apply PCA to reduce to 2 components
# 2. Visualize the subtypes (would they cluster?)
# 3. Which genes contribute most to PC1?
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

np.random.seed(42)

subtype1 = np.random.randn(70, 1000) + np.hstack([np.ones(300), np.zeros(700)]) * 2
subtype2 = np.random.randn(70, 1000) + np.hstack([np.zeros(300), np.ones(400), np.zeros(300)]) * 2
subtype3 = np.random.randn(60, 1000) + np.hstack([np.zeros(700), np.ones(300)]) * 2

gene_data = np.vstack([subtype1, subtype2, subtype3])
true_labels = np.array([0]*70 + [1]*70 + [2]*60)

print("🧬 Gene Expression Analysis with PCA")
print("=" * 50)
print(f"Original data: {gene_data.shape[0]} patients × {gene_data.shape[1]} genes")

# 1. Standardize and apply PCA
scaler = StandardScaler()
gene_scaled = scaler.fit_transform(gene_data)

pca = PCA(n_components=2)
gene_2d = pca.fit_transform(gene_scaled)

print(f"\nReduced to: {gene_2d.shape[1]} dimensions")
print(f"Variance captured: {pca.explained_variance_ratio_.sum()*100:.1f}%")

# 2. Analyze cluster separation
print("\n📊 Subtype Centroids in PCA Space:")
for subtype in [0, 1, 2]:
    mask = true_labels == subtype
    centroid = gene_2d[mask].mean(axis=0)
    print(f"   Subtype {subtype+1}: PC1={centroid[0]:6.2f}, PC2={centroid[1]:6.2f}")

# Calculate between-class separation
centroids = np.array([gene_2d[true_labels == i].mean(axis=0) for i in range(3)])
separation = np.mean([np.linalg.norm(centroids[i] - centroids[j]) 
                      for i in range(3) for j in range(i+1, 3)])
print(f"\n   Average centroid distance: {separation:.2f}")

# 3. Find most important genes for PC1
loadings = pca.components_[0]
top_genes_pc1 = np.argsort(np.abs(loadings))[-10:][::-1]

print("\n🔬 Top 10 Genes Contributing to PC1:")
for i, gene_idx in enumerate(top_genes_pc1):
    direction = "↑" if loadings[gene_idx] > 0 else "↓"
    print(f"   {i+1}. Gene_{gene_idx:04d}: {loadings[gene_idx]:+.4f} {direction}")

# Visualize cluster quality (text-based)
print("\n📈 PCA Visualization (ASCII):")
print("   " + "-" * 42)
for subtype, marker in [(0, 'A'), (1, 'B'), (2, 'C')]:
    mask = true_labels == subtype
    mean_pc1 = gene_2d[mask, 0].mean()
    mean_pc2 = gene_2d[mask, 1].mean()
    x_pos = int((mean_pc1 + 3) / 6 * 40)
    x_pos = max(0, min(39, x_pos))
    print(f"   |{' ' * x_pos}{marker}{' ' * (39 - x_pos)}| Subtype {subtype+1}")
print("   " + "-" * 42)
print("     PC1 →")
Real-World Insight: This is exactly how cancer researchers discovered breast cancer subtypes (Luminal A, Luminal B, HER2+, Triple-negative). PCA on gene expression data has saved countless lives through personalized treatment!

Key Takeaways

Core PCA Concepts:
  • Principal Components - Orthogonal directions of maximum variance
  • Standardization Required - Features must be centered and scaled
  • Variance Explained - Choose k components to retain 95%+ variance
  • Use Cases - Dimensionality reduction, visualization, noise reduction, feature extraction

Dimensionality Reduction Methods: When to Use What

PCA is the most common technique, but it is not always the right choice. Here is how it compares to alternatives.
MethodTypePreservesHandles Nonlinearity?Interpretable?SpeedBest For
PCALinearGlobal varianceNoModerate (loadings)Fast (O(nd2)O(nd^2))General-purpose reduction, preprocessing
t-SNENonlinearLocal neighborhoodsYesNo (components are meaningless)Slow (O(n2)O(n^2))2D/3D visualization of clusters
UMAPNonlinearLocal + some global structureYesNoModerateVisualization, clustering preprocessing
AutoencodersNonlinear (learned)Depends on architectureYesNo (latent space is opaque)Slow (training required)Complex nonlinear relationships, images
Factor AnalysisLinear (probabilistic)Latent factor structureNoYes (factor loadings)ModerateWhen you believe data has latent causes
Sparse PCALinear (constrained)Variance with sparse loadingsNoHigh (few features per component)ModerateWhen you need interpretable components
Kernel PCANonlinear (kernel trick)Nonlinear varianceYesNoSlow (O(n3)O(n^3))Nonlinear structure with moderate data size
Decision Flowchart:
  • Need to reduce features for a downstream model? Start with PCA. It is fast, well-understood, and works as a preprocessing step before any algorithm.
  • Need to visualize clusters in 2D? Use t-SNE or UMAP. PCA often fails to reveal cluster structure because it optimizes for variance, not local neighborhood preservation.
  • Have nonlinear relationships (e.g., a spiral in 2D)? PCA will fail. Use Kernel PCA, autoencoders, or UMAP.
  • Need interpretable components for stakeholders? Use Sparse PCA or plain feature selection. “PC1 = 0.3age + 0.2income + …” is hard to explain in a board meeting.
  • Have very high dimensionality (10,000+ features, like genomics)? PCA first to reduce to ~100 dimensions, then t-SNE/UMAP for visualization. This two-step approach is standard practice.

PCA vs t-SNE vs UMAP: A Quick Visual Intuition

Original data (3 clusters in 10D):

PCA (2D):                t-SNE (2D):             UMAP (2D):
  A  A                    AAA    BBB               AAA   BBB
 A  A B B                 AAA    BBB              AAA     BBB
   A  B B                                              
 C C   B                  CCC                       CCC
C C C                     CCC                       CCC

Variance-preserving       Neighborhood-preserving   Balanced local/global
Clusters may overlap      Clusters clearly separated Clusters separated
Distances are meaningful  Distances are NOT meaningful Distances somewhat meaningful
Fast (milliseconds)       Slow (minutes)            Moderate (seconds)
PCA projects onto the directions of maximum variance, which sometimes separates clusters and sometimes does not. t-SNE and UMAP explicitly optimize for keeping nearby points nearby, which is why they produce cleaner cluster visualizations — but at the cost of distorting global distances.

Interview Prep: PCA Questions

Q: Why do we standardize data before PCA?
Without standardization, features with larger scales dominate the principal components. A feature measured in millions would overshadow one measured in decimals, regardless of importance. PCA maximizes variance, so if one feature has variance 1,000,000 and another has variance 1, the first principal component will align almost entirely with the high-variance feature. Standardization (mean=0, std=1) puts all features on equal footing so PCA finds directions of correlated variation, not just directions of big numbers.
Q: How do you choose the number of components?
Common methods: (1) Keep components explaining 95% of variance, (2) Elbow method on scree plot, (3) Kaiser criterion (eigenvalue > 1), (4) Cross-validation if using for prediction.
Q: What are the limitations of PCA?
PCA finds linear relationships only; it struggles with non-linear patterns (e.g., a spiral in 2D cannot be untangled by PCA). It assumes variance = importance, which is not always true — in anomaly detection, the low-variance directions may be exactly where interesting anomalies hide. All components are combinations of all original features, making interpretation difficult — PC1 is not “just temperature” but rather “0.4 * temperature + 0.3 * humidity + 0.2 * pressure + …” which is hard to explain to stakeholders. For interpretable dimensionality reduction, consider feature selection or sparse PCA instead.
Q: When should you NOT use PCA?
When interpretability is crucial (PCA features are hard to explain), when relationships are non-linear (use t-SNE or UMAP instead), or when you have categorical features (PCA is for continuous data).

Common Pitfalls

PCA Mistakes to Avoid:
  1. Forgetting to Standardize - Always scale before PCA; unscaled data gives meaningless results. This is by far the most common mistake. If one feature is in millions and another is 0-1, PCA will obsess over the big-number feature regardless of its actual importance.
  2. Too Few Components - Keeping too few loses important information; check variance explained. A good default threshold is 95%, but for noisy data you might accept 90%.
  3. Using on Categorical Data - PCA assumes continuous, roughly normally distributed features. One-hot encoded categoricals violate this assumption badly. Use MCA (Multiple Correspondence Analysis) or entity embeddings for categoricals instead.
  4. Over-Interpreting Components - PCs are linear combos of all features; labeling “PC1 = size” is often oversimplified. Look at the actual loadings before naming components.
  5. Ignoring Outliers - PCA maximizes variance, so a single extreme outlier can dominate an entire principal component. Consider Robust PCA or remove outliers before fitting.
  6. Applying PCA After Train-Test Split Incorrectly - Fit PCA only on training data, then use the same transformation to project test data. Fitting PCA on the entire dataset before splitting leaks information and inflates your metrics.
  7. Expecting PCA to Help with Classification: PCA maximizes variance, not class separability. The direction of maximum variance might be orthogonal to the direction that separates your classes. If your goal is classification, consider Linear Discriminant Analysis (LDA) instead — it explicitly finds directions that maximize between-class separation relative to within-class scatter.

PCA Reconstruction: What Information Is Lost?

When you reduce from d dimensions to k, the lost information is the projection onto the discarded (d - k) eigenvectors. You can quantify this exactly:
# Reconstruction error = sum of discarded eigenvalues / total
pca_full = PCA()
pca_full.fit(X_scaled)

# If you keep k=3 components out of 10:
kept_variance = sum(pca_full.explained_variance_ratio_[:3])
lost_variance = 1 - kept_variance

print(f"Kept: {kept_variance:.1%}")
print(f"Lost: {lost_variance:.1%}")

# You can also reconstruct and see the difference
pca_3 = PCA(n_components=3)
X_reduced = pca_3.fit_transform(X_scaled)
X_reconstructed = pca_3.inverse_transform(X_reduced)

# The difference shows exactly what PCA threw away
X_lost = X_scaled - X_reconstructed
print(f"Reconstruction MSE: {np.mean(X_lost**2):.4f}")
ML Application — Anomaly Detection via Reconstruction Error: Points that are poorly reconstructed by PCA (high reconstruction error) are anomalies. They have unusual patterns in the dimensions that PCA discarded as “unimportant.” This is the basis of PCA-based anomaly detection, used in fraud detection, manufacturing quality control, and network intrusion detection. The intuition: normal data lives near the principal subspace; anomalies are far from it.

What’s Next?

PCA is great for reducing features, but what if you want to find hidden patterns in your data? Like discovering that users who like action movies also like sci-fi? That’s Singular Value Decomposition (SVD) - the most powerful matrix decomposition!

Next: Singular Value Decomposition (SVD)

Build a movie recommendation system using matrix factorization

Interview Deep-Dive

Strong Answer:
  • This is a red flag, not a victory. One component explaining 95% of variance strongly suggests a data quality issue rather than a genuine low-dimensional structure. The most common causes:
  • (1) Feature scaling problem: one feature has variance orders of magnitude larger than the others (e.g., “annual_revenue” in dollars alongside “employee_count” in single digits). PCA maximizes variance, so PC1 will align almost entirely with the high-variance feature. Check: did you standardize (mean=0, std=1) before PCA? If not, standardize and rerun. This is by far the most common explanation.
  • (2) Redundant features: many features are near-perfect linear combinations of one another. For example, “total_cost” = “unit_cost” * “quantity”, or Fahrenheit and Celsius temperature readings. PC1 captures this shared dimension. Check: compute the correlation matrix and look for clusters of features with r>0.95|r| > 0.95. Remove redundancies.
  • (3) Genuine low intrinsic dimensionality: the data really does vary primarily along one axis. This happens in some physical systems (temperature explains most variation in a weather dataset) or in narrow domains. In this case, the 95% is informative — you might only need 1-2 features for your model, saving significant computation.
  • The investigation: examine the loadings of PC1 (the eigenvector coefficients). If one feature has loading 0.99 and all others are near 0, it is a scaling problem. If several features have similar loadings, it captures a real shared pattern. Also plot PC1 vs PC2 to see if the remaining 5% carries important cluster structure — sometimes the interesting separation is in the low-variance directions (anomaly detection relies on this).
Follow-up: In anomaly detection, why might the low-variance principal components be more useful than the high-variance ones?Normal data points cluster tightly along the high-variance directions (the dominant PCs capture the “normal” pattern of variation). Anomalies, by definition, deviate from normal patterns. If an anomaly deviates along a direction where normal data has near-zero variance, its projection onto that low-variance PC will be an extreme outlier — the signal-to-noise ratio for anomaly detection is actually highest in the low-variance directions. This is the principle behind PCA-based anomaly detection: project data onto the minor principal components (those with small eigenvalues) and flag points with large reconstruction error. A $1,000 transaction that is unusual in the “timing + merchant category” dimension (a low-variance direction) is more suspicious than one that is unusual in the “transaction amount” dimension (a high-variance direction, where large variation is normal).
Strong Answer:
  • PCA fails when its core assumptions are violated. There are four major failure modes, each with a specific alternative:
  • (1) Non-linear relationships: PCA finds linear directions of maximum variance. If your data lies on a curved manifold (a Swiss roll, interleaving spirals, a donut), PCA projects it onto a flat plane, destroying the manifold structure. A 2D Swiss roll projected by PCA onto its top 2 PCs looks like a rectangle — the spiral structure is gone. Alternatives: kernel PCA (maps to higher-dimensional feature space where non-linear structure becomes linear), t-SNE (preserves local neighborhoods), UMAP (preserves both local and some global structure), or autoencoders (learn non-linear projections).
  • (2) Non-Gaussian data: PCA implicitly assumes the “important” directions are those with maximum variance, which is optimal when data is Gaussian. For non-Gaussian distributions (e.g., heavy-tailed, multimodal), variance is a poor proxy for information content. Alternative: Independent Component Analysis (ICA), which finds statistically independent (not just uncorrelated) components. ICA is used in signal processing (separating mixed audio signals — the “cocktail party problem”) where PCA fails because the mixed signals have similar variance but very different statistical structure.
  • (3) Categorical or mixed-type features: PCA requires continuous, numeric features. Applying PCA to one-hot encoded categoricals produces nonsensical components (what does “0.7 * male + 0.3 * female” mean?). Alternatives: Multiple Correspondence Analysis (MCA) for categoricals, FAMD (Factor Analysis of Mixed Data) for mixed types, or learn entity embeddings via a neural network first and apply PCA to the embeddings.
  • (4) When interpretability is required: PCA components are linear combinations of all features, making them hard to explain to stakeholders (“PC1 is 0.34 * age + 0.28 * income - 0.22 * credit_score + …”). Alternatives: Sparse PCA (constrains most loadings to zero, so each PC is a combination of only a few features), feature selection (LASSO, mutual information), or Non-negative Matrix Factorization (NMF), which produces additive, parts-based representations that are naturally interpretable.
Follow-up: You mentioned kernel PCA for non-linear data. How does it work mathematically, and what is the “kernel trick” doing under the hood?Kernel PCA maps data to a higher-dimensional feature space ϕ(x)\phi(x) where non-linear structure becomes linear, then applies standard PCA in that space. The kernel trick avoids explicitly computing ϕ(x)\phi(x) (which may be infinite-dimensional) by noting that PCA only needs dot products between data points, not the actual coordinates. The kernel function k(xi,xj)=ϕ(xi)ϕ(xj)k(x_i, x_j) = \phi(x_i) \cdot \phi(x_j) computes these dot products implicitly. For a Gaussian (RBF) kernel, k(xi,xj)=exp(xixj2/2σ2)k(x_i, x_j) = \exp(-\|x_i - x_j\|^2 / 2\sigma^2), the implicit feature space is infinite-dimensional, yet the kernel matrix is just n×nn \times n. You eigendecompose this n×nn \times n kernel matrix rather than the d×dd \times d covariance matrix. The trade-off: standard PCA is O(d2n)O(d^2 n), kernel PCA is O(n2d+n3)O(n^2 d + n^3) — it scales with the number of samples, not features. For large datasets (n>10,000n > 10{,}000), kernel PCA becomes expensive and Nystrom approximation or random Fourier features are used to approximate the kernel matrix.
Strong Answer:
  • This is the pnp \gg n regime (10,000 features, 200 samples), and it fundamentally changes how PCA behaves. The covariance matrix XTXX^TX is 10,000 x 10,000 but has rank at most n1=199n - 1 = 199. This means at most 199 eigenvalues are non-zero — the remaining 9,801 principal components capture zero variance and are meaningless. You cannot possibly find more than 199 informative directions from 200 data points.
  • Computational trick: instead of eigendecomposing the 10,000 x 10,000 matrix XTXX^TX, compute the 200 x 200 matrix XXTXX^T (the dual formulation). The non-zero eigenvalues are identical, and the eigenvectors of XTXX^TX can be recovered from those of XXTXX^T via vi=XTui/σiv_i = X^T u_i / \sigma_i. This reduces computation from O(p3)O(p^3) to O(n2p+n3)O(n^2 p + n^3) — a 50x speedup in this case. Scikit-learn does this automatically when p>np > n.
  • Statistical concern: with so many features relative to samples, the sample covariance matrix is a poor estimate of the true covariance. Many estimated eigenvalues will be artificially inflated or deflated due to sampling noise (the Marcenko-Pastur distribution describes this). Standard PCA will find “directions of maximum noise” rather than “directions of maximum signal.” Regularized covariance estimation (Ledoit-Wolf shrinkage, graphical lasso) or sparse PCA (penalizing the number of non-zero loadings) are essential in this regime.
  • In genomics specifically, the top few PCs often capture population structure (ethnicity, geographic origin) rather than the biological signal of interest. This is both a feature and a bug: researchers use PCA to visualize and correct for population stratification, but if you are looking for disease-associated patterns, you need to remove these confounding PCs first.
  • Alternative approaches for pnp \gg n: LASSO (feature selection, chooses a sparse subset), random forests (handles high dimensions natively), or transfer learning from a pretrained model on a larger dataset.
Follow-up: What is the Marcenko-Pastur distribution and how does it help you decide which principal components are signal versus noise?The Marcenko-Pastur distribution describes the eigenvalue distribution of large random covariance matrices. If your data were pure noise (no signal), the eigenvalues of XTX/nX^TX / n would follow a specific distribution bounded between λ=(1p/n)2\lambda_{-} = (1 - \sqrt{p/n})^2 and λ+=(1+p/n)2\lambda_{+} = (1 + \sqrt{p/n})^2. Eigenvalues above λ+\lambda_{+} are unlikely to be noise — they represent genuine signal. For p=10,000p = 10{,}000 and n=200n = 200, λ+(1+50)264\lambda_{+} \approx (1 + \sqrt{50})^2 \approx 64, meaning any eigenvalue above 64 times the noise level is likely signal. This gives a principled cutoff for choosing the number of components, replacing the ad-hoc “keep 95% of variance” rule. Tracy-Widom statistics provide an exact hypothesis test for each eigenvalue. This approach is standard in genomics and finance, where the pnp \gg n setting makes naive variance-explained thresholds unreliable.
Strong Answer:
  • PCA finds the optimal linear projection that preserves maximum variance. A linear autoencoder (single hidden layer, no activation function, MSE loss) converges to exactly the same solution as PCA — its encoder weights span the same subspace as the top-k principal components. This is a proven mathematical equivalence, not just an empirical observation.
  • The key difference is that non-linear autoencoders (deep networks with ReLU/tanh activations) can capture non-linear structure that PCA misses entirely. A Swiss roll dataset in 3D can be “unrolled” by a non-linear autoencoder into a meaningful 2D representation, while PCA would just squash it flat.
  • When to choose PCA: (1) interpretability matters (PCA loadings are directly interpretable; autoencoder weights are not), (2) dataset is small (autoencoders need thousands of samples to train; PCA works with any size), (3) the data is approximately linear (most tabular data), (4) computational simplicity (PCA is a single eigendecomposition; autoencoders require GPU training, hyperparameter tuning, and regularization).
  • When to choose autoencoders: (1) the data is high-dimensional and non-linear (images, text, audio), (2) you have abundant data (100K+ samples), (3) you want to learn a generative model (variational autoencoders generate new samples), (4) you need to incorporate domain-specific structure (convolutional autoencoders for images, recurrent for sequences).
  • The practical trade-off in production: PCA adds zero latency (precomputed projection matrix, applied as a single matrix multiply). Autoencoders add inference latency (forward pass through the encoder). For a feature preprocessing step in a real-time serving pipeline, PCA is almost always preferred unless the non-linear capacity is demonstrably necessary.
Follow-up: You mentioned that a linear autoencoder converges to the PCA solution. Does it converge to the exact same principal components, or just the same subspace?The same subspace, not necessarily the same components. PCA returns an orthogonal basis ordered by eigenvalue (PC1, PC2, …), where each component is a specific, unique direction. A linear autoencoder’s encoder weights span the same kk-dimensional subspace but may produce an arbitrary rotation of the principal components within that subspace. The reconstruction loss is identical (both find the rank-kk approximation that minimizes MSE), but the intermediate representation may differ by an orthogonal transformation. If you need the specific ordered principal components (e.g., for interpretability or to decide how many to keep), use PCA. If you only need the subspace (e.g., as a preprocessing step before another model), either works.