Skip to main content
Principal Component Analysis

Principal Component Analysis (PCA)

The Big Question: Can we predict house prices with fewer features? You have 10 house features, but your model is slow and overfitting. Can you reduce to 3 features while keeping 95% of the information? PCA (Principal Component Analysis) is the answer - it’s dimensionality reduction using eigenvectors!
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Eigenvalues module
Main Example: House feature reduction
Supporting Examples: Student profile compression, Movie recommendation optimization

The Core Idea

From 10 Features to 3

Problem: Too many features cause:
  • Slow training
  • Overfitting
  • Hard to visualize
  • Expensive to collect
Solution: Find the 3 most important directions (principal components) and project data onto them!
# Before PCA
X = np.array([[3, 2000, 15, 5, 8, 7, 2, 95, 3, 1500]])  # 10 features

# After PCA
X_reduced = pca.transform(X)  # 3 features
# Still captures 95% of variance!

The Mathematics of PCA

Step-by-Step Algorithm

  1. Center the data: Subtract the mean from each feature
  2. Compute covariance matrix: C=1n1XTXC = \frac{1}{n-1}X^TX
  3. Find eigenvalues & eigenvectors: Solve Cv=λvCv = \lambda v
  4. Sort by eigenvalue: Largest eigenvalues = most important directions
  5. Project data: Xnew=XVkX_{new} = X \cdot V_k where VkV_k contains top-k eigenvectors

Mathematical Formulation

Given data matrix XRn×dX \in \mathbb{R}^{n \times d} (n samples, d features): Covariance Matrix: C=1n1(XXˉ)T(XXˉ)C = \frac{1}{n-1}(X - \bar{X})^T(X - \bar{X}) Eigendecomposition: C=VΛVTC = V \Lambda V^T Where:
  • VV = matrix of eigenvectors (columns)
  • Λ\Lambda = diagonal matrix of eigenvalues
Projection: Xreduced=(XXˉ)VkX_{reduced} = (X - \bar{X}) V_k Where VkV_k contains the top-k eigenvectors.

Variance Explained

Each eigenvalue λi\lambda_i represents the variance captured by that component: Variance ratioi=λijλj\text{Variance ratio}_i = \frac{\lambda_i}{\sum_j \lambda_j} Example: If eigenvalues are [5.0, 2.0, 1.5, 0.5], total = 9.0
ComponentEigenvalueVariance Explained
PC15.055.6%
PC22.022.2%
PC31.516.7%
PC40.55.5%
Keeping PC1 + PC2 = 77.8% of the variance with only 2 features!

Example 1: House Price Prediction with Fewer Features

The Dataset

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 1000 houses × 10 features
# [beds, baths, sqft, age, distance, school, crime, walk, parking, yard]
np.random.seed(42)
n_houses = 1000

houses = np.random.randn(n_houses, 10)
# Add correlations (realistic data)
houses[:, 2] = houses[:, 0] * 500 + 1500  # sqft correlates with beds
houses[:, 1] = houses[:, 0] * 0.5 + 1.5   # baths correlate with beds

print(f"Original shape: {houses.shape}")  # (1000, 10)

Apply PCA

# Step 1: Standardize (important!)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
houses_scaled = scaler.fit_transform(houses)

# Step 2: Apply PCA
pca = PCA(n_components=3)  # Keep top 3 components
houses_pca = pca.fit_transform(houses_scaled)

print(f"Reduced shape: {houses_pca.shape}")  # (1000, 3)
print(f"Variance explained: {pca.explained_variance_ratio_}")
# [0.45, 0.28, 0.15] = 88% total!

What Do the Components Mean?

# Component loadings (how original features contribute)
components = pca.components_

print("First component (PC1):")
features = ['beds', 'baths', 'sqft', 'age', 'dist', 'school', 'crime', 'walk', 'park', 'yard']
for i, feature in enumerate(features):
    print(f"  {feature}: {components[0, i]:.3f}")

# Output:
# beds: 0.450
# baths: 0.420
# sqft: 0.480  ← Dominates!
# age: -0.120
# ...
Interpretation:
  • PC1 (45% variance): “House size” (beds + baths + sqft)
  • PC2 (28% variance): “Location quality” (school + walk - crime)
  • PC3 (15% variance): “Age vs. modernization”

Visualize in 2D

# Plot first 2 principal components
plt.figure(figsize=(10, 6))
plt.scatter(houses_pca[:, 0], houses_pca[:, 1], alpha=0.5)
plt.xlabel('PC1: House Size (45% variance)')
plt.ylabel('PC2: Location Quality (28% variance)')
plt.title('Houses in PCA Space')
plt.grid(True)
plt.show()

Predict Prices with Reduced Features

# Train model on original 10 features
from sklearn.linear_model import LinearRegression

prices = np.random.randn(n_houses) * 50000 + 300000  # Simulated prices

model_full = LinearRegression()
model_full.fit(houses_scaled, prices)
score_full = model_full.score(houses_scaled, prices)

# Train model on 3 PCA features
model_pca = LinearRegression()
model_pca.fit(houses_pca, prices)
score_pca = model_pca.score(houses_pca, prices)

print(f"R² with 10 features: {score_full:.3f}")
print(f"R² with 3 features: {score_pca:.3f}")
# Almost the same performance with 70% fewer features!
Key Insight: PCA reduced features from 10 → 3 while maintaining 88% of information and similar prediction accuracy! Real Application: Zillow uses PCA to compress house features for faster recommendations. PCA House Features

Example 2: Student Performance - Simplifying Profiles

The Dataset

# 500 students × 8 features
# [study_hrs, prev_gpa, attendance, sleep, exercise, social, stress, motivation]
students = np.random.randn(500, 8)

# Standardize
students_scaled = scaler.fit_transform(students)

# Apply PCA
pca = PCA(n_components=3)
students_pca = pca.fit_transform(students_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_}")
# [0.38, 0.25, 0.18] = 81% total

Interpret Components

features = ['study', 'gpa', 'attend', 'sleep', 'exercise', 'social', 'stress', 'motiv']
components = pca.components_

print("\\nPrincipal Components:")
for i in range(3):
    print(f"\\nPC{i+1} ({pca.explained_variance_ratio_[i]:.1%} variance):")
    # Find top 3 contributing features
    top_idx = np.argsort(np.abs(components[i]))[-3:][::-1]
    for idx in top_idx:
        print(f"  {features[idx]}: {components[i, idx]:.3f}")
Interpretation:
  • PC1 (38%): “Academic engagement” (study + GPA + attendance)
  • PC2 (25%): “Well-being” (sleep + exercise - stress)
  • PC3 (18%): “Social balance” (social + motivation)

Identify Student Clusters

from sklearn.cluster import KMeans

# Cluster students in PCA space
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(students_pca)

# Visualize
plt.scatter(students_pca[:, 0], students_pca[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.xlabel('PC1: Academic Engagement')
plt.ylabel('PC2: Well-being')
plt.title('Student Clusters')
plt.colorbar(label='Cluster')
plt.show()
Clusters Found:
  • Cluster 0: High academic, low well-being (burnout risk!)
  • Cluster 1: Balanced students
  • Cluster 2: Low academic, high social (need intervention)
Real Application: Khan Academy uses PCA to group students and personalize learning paths!

Example 3: Movie Recommendations - Faster Matching

The Dataset

# 1000 movies × 20 features
# [action, romance, comedy, horror, sci-fi, drama, thriller, mystery, 
#  animation, documentary, rating, runtime, year, budget, revenue, ...]
movies = np.random.randn(1000, 20)

# Standardize
movies_scaled = scaler.fit_transform(movies)

# Apply PCA
pca = PCA(n_components=5)  # 20 → 5 features
movies_pca = pca.fit_transform(movies_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
# 85% with just 5 components!

Interpret Movie “Factors”

# PC1: "Blockbuster factor"
# High: action, sci-fi, budget, revenue
# Low: documentary, indie

# PC2: "Emotional factor"  
# High: romance, drama
# Low: action, horror

# PC3: "Comedy factor"
# High: comedy, animation
# Low: thriller, horror
# Find similar movies in PCA space (much faster!)
from scipy.spatial.distance import cdist

# Query movie (in PCA space)
query_movie = movies_pca[0]

# Compute distances (5D instead of 20D!)
distances = cdist([query_movie], movies_pca, metric='cosine')[0]

# Top 5 similar movies
similar_idx = np.argsort(distances)[1:6]  # Exclude self
print(f"Similar movies: {similar_idx}")
Speed Comparison:
import time

# Original 20D space
start = time.time()
for _ in range(1000):
    distances_full = cdist([movies_scaled[0]], movies_scaled, metric='cosine')
time_full = time.time() - start

# PCA 5D space
start = time.time()
for _ in range(1000):
    distances_pca = cdist([movies_pca[0]], movies_pca, metric='cosine')
time_pca = time.time() - start

print(f"Full: {time_full:.3f}s")
print(f"PCA: {time_pca:.3f}s")
print(f"Speedup: {time_full/time_pca:.1f}x")
# 4x faster with PCA!
Real Application: Netflix uses PCA to compress movie features for real-time recommendations to millions of users!

How PCA Works: Step-by-Step

PCA Math Concept

Step 1: Standardize Data

# Why? Features have different scales
# beds: 1-5, sqft: 500-5000, age: 0-100

X = np.array([[3, 2000, 15], [4, 2200, 8]])

# Standardize: mean=0, std=1
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_scaled = (X - X_mean) / X_std

Step 2: Compute Covariance Matrix

# Covariance shows how features vary together
cov_matrix = np.cov(X_scaled.T)
print(cov_matrix.shape)  # (3, 3)

Step 3: Find Eigenvectors

# Eigenvectors = principal directions
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort by eigenvalue (largest first)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

Step 4: Project Data

# Keep top k eigenvectors
k = 2
principal_components = eigenvectors[:, :k]

# Project data onto principal components
X_pca = X_scaled @ principal_components
print(X_pca.shape)  # (2, 2) - reduced from 3 to 2!

Choosing Number of Components

Method 1: Variance Threshold

# Keep components that explain 95% of variance
pca = PCA(n_components=0.95)  # Automatic!
X_pca = pca.fit_transform(X_scaled)
print(f"Kept {pca.n_components_} components")

Method 2: Scree Plot

# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)

# Plot explained variance
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1),
         pca_full.explained_variance_ratio_, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.grid(True)
plt.show()

# Look for "elbow" - where curve flattens

Method 3: Cumulative Variance

cumsum = np.cumsum(pca_full.explained_variance_ratio_)
n_components = np.argmax(cumsum >= 0.95) + 1
print(f"Need {n_components} components for 95% variance")

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises demonstrate how PCA is used in production systems at major tech companies.

Exercise 1: Customer 360° View Compression 🎯

A marketing team has 50 metrics per customer, but their ML models are slow. Use PCA to compress:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Simulate customer data: 10000 customers, 50 features
np.random.seed(42)

# Create correlated features (realistic)
base_features = np.random.randn(10000, 10)
# Expand to 50 features with correlations
customer_data = np.hstack([
    base_features,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.5,
    base_features[:, :5] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.3,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.4,
    base_features[:, 5:] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.6,
])

# TODO:
# 1. Standardize the data
# 2. Find how many components capture 95% variance
# 3. Calculate compression ratio
# 4. Measure information loss
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Setup (same as above)
np.random.seed(42)
base_features = np.random.randn(10000, 10)
customer_data = np.hstack([
    base_features,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.5,
    base_features[:, :5] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.3,
    base_features @ np.random.randn(10, 10) + np.random.randn(10000, 10) * 0.4,
    base_features[:, 5:] @ np.random.randn(5, 10) + np.random.randn(10000, 10) * 0.6,
])

print("📊 Customer Data Compression with PCA")
print("=" * 50)
print(f"Original shape: {customer_data.shape}")

# 1. Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customer_data)

# 2. Find optimal components
pca_full = PCA()
pca_full.fit(X_scaled)

cumsum = np.cumsum(pca_full.explained_variance_ratio_)
n_95 = np.argmax(cumsum >= 0.95) + 1
n_99 = np.argmax(cumsum >= 0.99) + 1

print(f"\n🎯 Components for 95% variance: {n_95} (was 50)")
print(f"   Components for 99% variance: {n_99}")

# 3. Compression ratio
compression_95 = 50 / n_95
compression_99 = 50 / n_99
print(f"\n📦 Compression Ratios:")
print(f"   95% variance: {compression_95:.1f}x compression")
print(f"   99% variance: {compression_99:.1f}x compression")

# 4. Apply final PCA
pca = PCA(n_components=n_95)
X_compressed = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_compressed)

reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"\n📉 Information Loss:")
print(f"   Mean squared reconstruction error: {reconstruction_error:.4f}")

# Memory savings
original_size = customer_data.nbytes / 1024**2
compressed_size = X_compressed.nbytes / 1024**2
print(f"\n💾 Memory Savings:")
print(f"   Original: {original_size:.2f} MB")
print(f"   Compressed: {compressed_size:.2f} MB")
print(f"   Saved: {(1 - compressed_size/original_size)*100:.1f}%")
Real-World Insight: Spotify compresses user taste profiles from 1000+ features to ~40 components, enabling real-time playlist generation for 500M+ users!

Exercise 2: Image Compression 📸

Use PCA to compress grayscale images:
import numpy as np

# Simulate a 64x64 grayscale image (like a face)
np.random.seed(42)
image = np.random.rand(64, 64)

# Add some structure (face-like features)
image[20:25, 20:25] = 0.9  # Left eye
image[20:25, 40:45] = 0.9  # Right eye
image[35:40, 25:40] = 0.9  # Mouth
image[15:50, 28:37] = 0.7  # Nose

# TODO:
# 1. Reshape image for PCA (each row is a feature)
# 2. Compress to k components
# 3. Reconstruct and measure error
# 4. Find minimum k for "acceptable" quality (MSE < 0.01)
import numpy as np

np.random.seed(42)
image = np.random.rand(64, 64)
image[20:25, 20:25] = 0.9
image[20:25, 40:45] = 0.9
image[35:40, 25:40] = 0.9
image[15:50, 28:37] = 0.7

print("🖼️ Image Compression with PCA")
print("=" * 50)
print(f"Original image: {image.shape} = {64*64} pixels")

# PCA on rows (each row is a sample, each column is a feature)
from sklearn.decomposition import PCA

def compress_image(img, n_components):
    pca = PCA(n_components=n_components)
    compressed = pca.fit_transform(img)
    reconstructed = pca.inverse_transform(compressed)
    mse = np.mean((img - reconstructed) ** 2)
    return reconstructed, mse, pca.explained_variance_ratio_.sum()

print("\n📊 Compression Results:")
print("-" * 45)
print(f"{'Components':<12} {'MSE':<12} {'Variance %':<12} {'Compression':<12}")
print("-" * 45)

for k in [5, 10, 20, 30, 40, 50]:
    recon, mse, var = compress_image(image, k)
    compression = 64*64 / (k*64 + k)  # Approximate storage
    quality = "✓" if mse < 0.01 else ""
    print(f"{k:<12} {mse:<12.4f} {var*100:<12.1f} {compression:<12.1f}x {quality}")

# Find minimum acceptable k
print("\n🔍 Finding minimum components for MSE < 0.01:")
for k in range(1, 65):
    _, mse, var = compress_image(image, k)
    if mse < 0.01:
        print(f"   Minimum k = {k} (MSE: {mse:.4f}, Variance: {var*100:.1f}%)")
        break

# Visual representation of compression
print("\n📈 Compression Quality by Components:")
for k in [5, 15, 30, 50]:
    _, mse, var = compress_image(image, k)
    bar_len = int((1 - mse) * 30)
    bar = "█" * bar_len + "░" * (30 - bar_len)
    print(f"  k={k:2d}: {bar} {(1-mse)*100:.1f}% quality")
Real-World Insight: JPEG compression uses a related technique (DCT). Netflix uses PCA-like methods to compress video frames, saving billions in bandwidth costs!

Exercise 3: Anomaly Detection in Server Logs 🔍

Use PCA to detect unusual server behavior:
import numpy as np

# Normal server metrics: [CPU, Memory, Disk IO, Network, Requests/s]
np.random.seed(42)
normal_data = np.random.randn(1000, 5) * np.array([10, 15, 20, 25, 100]) + \
              np.array([50, 60, 40, 30, 500])

# Add some anomalies
anomalies = np.array([
    [95, 95, 90, 80, 50],    # High CPU/Memory, low requests (crypto mining?)
    [20, 90, 95, 10, 100],   # Memory leak, disk thrashing
    [50, 60, 40, 95, 5000],  # DDoS attack?
])

all_data = np.vstack([normal_data, anomalies])

# TODO:
# 1. Fit PCA on normal data only
# 2. Project all data onto PCA space
# 3. Calculate reconstruction error for each point
# 4. Flag anomalies (high reconstruction error)
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

np.random.seed(42)
normal_data = np.random.randn(1000, 5) * np.array([10, 15, 20, 25, 100]) + \
              np.array([50, 60, 40, 30, 500])

anomalies = np.array([
    [95, 95, 90, 80, 50],
    [20, 90, 95, 10, 100],
    [50, 60, 40, 95, 5000],
])

all_data = np.vstack([normal_data, anomalies])

print("🔍 Server Anomaly Detection with PCA")
print("=" * 50)

# 1. Standardize using normal data statistics
scaler = StandardScaler()
scaler.fit(normal_data)
normal_scaled = scaler.transform(normal_data)
all_scaled = scaler.transform(all_data)

# 2. Fit PCA on normal data only (keep 2 components for simplicity)
pca = PCA(n_components=2)
pca.fit(normal_scaled)

print(f"PCA trained on {len(normal_data)} normal samples")
print(f"Components capture {pca.explained_variance_ratio_.sum()*100:.1f}% variance")

# 3. Calculate reconstruction error for all points
def reconstruction_error(pca, data):
    transformed = pca.transform(data)
    reconstructed = pca.inverse_transform(transformed)
    return np.mean((data - reconstructed) ** 2, axis=1)

errors = reconstruction_error(pca, all_scaled)

# 4. Set threshold based on normal data
normal_errors = errors[:len(normal_data)]
threshold = np.percentile(normal_errors, 99)  # 99th percentile

print(f"\n📊 Reconstruction Error Statistics:")
print(f"   Normal mean: {normal_errors.mean():.4f}")
print(f"   Normal std:  {normal_errors.std():.4f}")
print(f"   Threshold (99th %ile): {threshold:.4f}")

# Flag anomalies
print("\n🚨 Anomaly Detection Results:")
print("-" * 50)

anomaly_errors = errors[-len(anomalies):]
for i, (err, anomaly) in enumerate(zip(anomaly_errors, anomalies)):
    status = "🔴 ANOMALY" if err > threshold else "🟢 Normal"
    print(f"  Server {i+1}: Error = {err:.4f} {status}")
    if err > threshold:
        print(f"           Metrics: CPU={anomaly[0]:.0f}%, Mem={anomaly[1]:.0f}%, "
              f"Disk={anomaly[2]:.0f}%, Net={anomaly[3]:.0f}%, Req={anomaly[4]:.0f}/s")

# False positive rate on normal data
false_positives = np.sum(normal_errors > threshold)
print(f"\n📈 Performance:")
print(f"   False positive rate: {false_positives/len(normal_data)*100:.2f}%")
print(f"   True positives: {np.sum(anomaly_errors > threshold)}/{len(anomalies)}")
Real-World Insight: AWS CloudWatch and Datadog use similar PCA-based anomaly detection to automatically flag unusual server behavior across millions of instances!

Exercise 4: Gene Expression Analysis 🧬

Reduce high-dimensional genetic data for cancer subtype discovery:
import numpy as np

# Simulated gene expression: 200 patients, 1000 genes
np.random.seed(42)

# 3 cancer subtypes with different gene signatures
subtype1 = np.random.randn(70, 1000) + np.hstack([np.ones(300), np.zeros(700)]) * 2
subtype2 = np.random.randn(70, 1000) + np.hstack([np.zeros(300), np.ones(400), np.zeros(300)]) * 2
subtype3 = np.random.randn(60, 1000) + np.hstack([np.zeros(700), np.ones(300)]) * 2

gene_data = np.vstack([subtype1, subtype2, subtype3])
true_labels = [0]*70 + [1]*70 + [2]*60

# TODO:
# 1. Apply PCA to reduce to 2 components
# 2. Visualize the subtypes (would they cluster?)
# 3. Which genes contribute most to PC1?
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

np.random.seed(42)

subtype1 = np.random.randn(70, 1000) + np.hstack([np.ones(300), np.zeros(700)]) * 2
subtype2 = np.random.randn(70, 1000) + np.hstack([np.zeros(300), np.ones(400), np.zeros(300)]) * 2
subtype3 = np.random.randn(60, 1000) + np.hstack([np.zeros(700), np.ones(300)]) * 2

gene_data = np.vstack([subtype1, subtype2, subtype3])
true_labels = np.array([0]*70 + [1]*70 + [2]*60)

print("🧬 Gene Expression Analysis with PCA")
print("=" * 50)
print(f"Original data: {gene_data.shape[0]} patients × {gene_data.shape[1]} genes")

# 1. Standardize and apply PCA
scaler = StandardScaler()
gene_scaled = scaler.fit_transform(gene_data)

pca = PCA(n_components=2)
gene_2d = pca.fit_transform(gene_scaled)

print(f"\nReduced to: {gene_2d.shape[1]} dimensions")
print(f"Variance captured: {pca.explained_variance_ratio_.sum()*100:.1f}%")

# 2. Analyze cluster separation
print("\n📊 Subtype Centroids in PCA Space:")
for subtype in [0, 1, 2]:
    mask = true_labels == subtype
    centroid = gene_2d[mask].mean(axis=0)
    print(f"   Subtype {subtype+1}: PC1={centroid[0]:6.2f}, PC2={centroid[1]:6.2f}")

# Calculate between-class separation
centroids = np.array([gene_2d[true_labels == i].mean(axis=0) for i in range(3)])
separation = np.mean([np.linalg.norm(centroids[i] - centroids[j]) 
                      for i in range(3) for j in range(i+1, 3)])
print(f"\n   Average centroid distance: {separation:.2f}")

# 3. Find most important genes for PC1
loadings = pca.components_[0]
top_genes_pc1 = np.argsort(np.abs(loadings))[-10:][::-1]

print("\n🔬 Top 10 Genes Contributing to PC1:")
for i, gene_idx in enumerate(top_genes_pc1):
    direction = "↑" if loadings[gene_idx] > 0 else "↓"
    print(f"   {i+1}. Gene_{gene_idx:04d}: {loadings[gene_idx]:+.4f} {direction}")

# Visualize cluster quality (text-based)
print("\n📈 PCA Visualization (ASCII):")
print("   " + "-" * 42)
for subtype, marker in [(0, 'A'), (1, 'B'), (2, 'C')]:
    mask = true_labels == subtype
    mean_pc1 = gene_2d[mask, 0].mean()
    mean_pc2 = gene_2d[mask, 1].mean()
    x_pos = int((mean_pc1 + 3) / 6 * 40)
    x_pos = max(0, min(39, x_pos))
    print(f"   |{' ' * x_pos}{marker}{' ' * (39 - x_pos)}| Subtype {subtype+1}")
print("   " + "-" * 42)
print("     PC1 →")
Real-World Insight: This is exactly how cancer researchers discovered breast cancer subtypes (Luminal A, Luminal B, HER2+, Triple-negative). PCA on gene expression data has saved countless lives through personalized treatment!

Key Takeaways

Core PCA Concepts:
  • Principal Components - Orthogonal directions of maximum variance
  • Standardization Required - Features must be centered and scaled
  • Variance Explained - Choose k components to retain 95%+ variance
  • Use Cases - Dimensionality reduction, visualization, noise reduction, feature extraction

Interview Prep: PCA Questions

Q: Why do we standardize data before PCA?
Without standardization, features with larger scales dominate the principal components. A feature measured in millions would overshadow one measured in decimals, regardless of importance.
Q: How do you choose the number of components?
Common methods: (1) Keep components explaining 95% of variance, (2) Elbow method on scree plot, (3) Kaiser criterion (eigenvalue > 1), (4) Cross-validation if using for prediction.
Q: What are the limitations of PCA?
PCA finds linear relationships only; it struggles with non-linear patterns. It also assumes variance = importance, which isn’t always true. All components are combinations of all features, making interpretation difficult.
Q: When should you NOT use PCA?
When interpretability is crucial (PCA features are hard to explain), when relationships are non-linear (use t-SNE or UMAP instead), or when you have categorical features (PCA is for continuous data).

Common Pitfalls

PCA Mistakes to Avoid:
  1. Forgetting to Standardize - Always scale before PCA; unscaled data gives meaningless results
  2. Too Few Components - Keeping too few loses important information; check variance explained
  3. Using on Categorical Data - PCA is for continuous features; encode categoricals separately
  4. Over-Interpreting Components - PCs are linear combos; labeling “PC1 = size” is often oversimplified
  5. Ignoring Outliers - PCA is sensitive to outliers; consider robust alternatives

What’s Next?

PCA is great for reducing features, but what if you want to find hidden patterns in your data? Like discovering that users who like action movies also like sci-fi? That’s Singular Value Decomposition (SVD) - the most powerful matrix decomposition!

Next: Singular Value Decomposition (SVD)

Build a movie recommendation system using matrix factorization