The Big Question: Can we predict house prices with fewer features?You have 10 house features, but your model is slow and overfitting. Can you reduce to 3 features while keeping 95% of the information?PCA (Principal Component Analysis) is the answer - it’s dimensionality reduction using eigenvectors!
Estimated Time: 3-4 hours Difficulty: Intermediate Prerequisites: Eigenvalues module Main Example: House feature reduction Supporting Examples: Student profile compression, Movie recommendation optimization
Each eigenvalue λi represents the variance captured by that component:Variance ratioi=∑jλjλiExample: If eigenvalues are [5.0, 2.0, 1.5, 0.5], total = 9.0
Component
Eigenvalue
Variance Explained
PC1
5.0
55.6%
PC2
2.0
22.2%
PC3
1.5
16.7%
PC4
0.5
5.5%
Keeping PC1 + PC2 = 77.8% of the variance with only 2 features!
# Train model on original 10 featuresfrom sklearn.linear_model import LinearRegressionprices = np.random.randn(n_houses) * 50000 + 300000 # Simulated pricesmodel_full = LinearRegression()model_full.fit(houses_scaled, prices)score_full = model_full.score(houses_scaled, prices)# Train model on 3 PCA featuresmodel_pca = LinearRegression()model_pca.fit(houses_pca, prices)score_pca = model_pca.score(houses_pca, prices)print(f"R² with 10 features: {score_full:.3f}")print(f"R² with 3 features: {score_pca:.3f}")# Almost the same performance with 70% fewer features!
Key Insight: PCA reduced features from 10 → 3 while maintaining 88% of information and similar prediction accuracy!Real Application: Zillow uses PCA to compress house features for faster recommendations.
# Find similar movies in PCA space (much faster!)from scipy.spatial.distance import cdist# Query movie (in PCA space)query_movie = movies_pca[0]# Compute distances (5D instead of 20D!)distances = cdist([query_movie], movies_pca, metric='cosine')[0]# Top 5 similar moviessimilar_idx = np.argsort(distances)[1:6] # Exclude selfprint(f"Similar movies: {similar_idx}")
# Keep top k eigenvectorsk = 2principal_components = eigenvectors[:, :k]# Project data onto principal componentsX_pca = X_scaled @ principal_componentsprint(X_pca.shape) # (2, 2) - reduced from 3 to 2!
Real-World Insight: Spotify compresses user taste profiles from 1000+ features to ~40 components, enabling real-time playlist generation for 500M+ users!
import numpy as np# Simulate a 64x64 grayscale image (like a face)np.random.seed(42)image = np.random.rand(64, 64)# Add some structure (face-like features)image[20:25, 20:25] = 0.9 # Left eyeimage[20:25, 40:45] = 0.9 # Right eyeimage[35:40, 25:40] = 0.9 # Mouthimage[15:50, 28:37] = 0.7 # Nose# TODO:# 1. Reshape image for PCA (each row is a feature)# 2. Compress to k components# 3. Reconstruct and measure error# 4. Find minimum k for "acceptable" quality (MSE < 0.01)
💡 Solution
Copy
import numpy as npnp.random.seed(42)image = np.random.rand(64, 64)image[20:25, 20:25] = 0.9image[20:25, 40:45] = 0.9image[35:40, 25:40] = 0.9image[15:50, 28:37] = 0.7print("🖼️ Image Compression with PCA")print("=" * 50)print(f"Original image: {image.shape} = {64*64} pixels")# PCA on rows (each row is a sample, each column is a feature)from sklearn.decomposition import PCAdef compress_image(img, n_components): pca = PCA(n_components=n_components) compressed = pca.fit_transform(img) reconstructed = pca.inverse_transform(compressed) mse = np.mean((img - reconstructed) ** 2) return reconstructed, mse, pca.explained_variance_ratio_.sum()print("\n📊 Compression Results:")print("-" * 45)print(f"{'Components':<12} {'MSE':<12} {'Variance %':<12} {'Compression':<12}")print("-" * 45)for k in [5, 10, 20, 30, 40, 50]: recon, mse, var = compress_image(image, k) compression = 64*64 / (k*64 + k) # Approximate storage quality = "✓" if mse < 0.01 else "" print(f"{k:<12} {mse:<12.4f} {var*100:<12.1f} {compression:<12.1f}x {quality}")# Find minimum acceptable kprint("\n🔍 Finding minimum components for MSE < 0.01:")for k in range(1, 65): _, mse, var = compress_image(image, k) if mse < 0.01: print(f" Minimum k = {k} (MSE: {mse:.4f}, Variance: {var*100:.1f}%)") break# Visual representation of compressionprint("\n📈 Compression Quality by Components:")for k in [5, 15, 30, 50]: _, mse, var = compress_image(image, k) bar_len = int((1 - mse) * 30) bar = "█" * bar_len + "░" * (30 - bar_len) print(f" k={k:2d}: {bar} {(1-mse)*100:.1f}% quality")
Real-World Insight: JPEG compression uses a related technique (DCT). Netflix uses PCA-like methods to compress video frames, saving billions in bandwidth costs!
import numpy as np# Normal server metrics: [CPU, Memory, Disk IO, Network, Requests/s]np.random.seed(42)normal_data = np.random.randn(1000, 5) * np.array([10, 15, 20, 25, 100]) + \ np.array([50, 60, 40, 30, 500])# Add some anomaliesanomalies = np.array([ [95, 95, 90, 80, 50], # High CPU/Memory, low requests (crypto mining?) [20, 90, 95, 10, 100], # Memory leak, disk thrashing [50, 60, 40, 95, 5000], # DDoS attack?])all_data = np.vstack([normal_data, anomalies])# TODO:# 1. Fit PCA on normal data only# 2. Project all data onto PCA space# 3. Calculate reconstruction error for each point# 4. Flag anomalies (high reconstruction error)
💡 Solution
Copy
import numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAnp.random.seed(42)normal_data = np.random.randn(1000, 5) * np.array([10, 15, 20, 25, 100]) + \ np.array([50, 60, 40, 30, 500])anomalies = np.array([ [95, 95, 90, 80, 50], [20, 90, 95, 10, 100], [50, 60, 40, 95, 5000],])all_data = np.vstack([normal_data, anomalies])print("🔍 Server Anomaly Detection with PCA")print("=" * 50)# 1. Standardize using normal data statisticsscaler = StandardScaler()scaler.fit(normal_data)normal_scaled = scaler.transform(normal_data)all_scaled = scaler.transform(all_data)# 2. Fit PCA on normal data only (keep 2 components for simplicity)pca = PCA(n_components=2)pca.fit(normal_scaled)print(f"PCA trained on {len(normal_data)} normal samples")print(f"Components capture {pca.explained_variance_ratio_.sum()*100:.1f}% variance")# 3. Calculate reconstruction error for all pointsdef reconstruction_error(pca, data): transformed = pca.transform(data) reconstructed = pca.inverse_transform(transformed) return np.mean((data - reconstructed) ** 2, axis=1)errors = reconstruction_error(pca, all_scaled)# 4. Set threshold based on normal datanormal_errors = errors[:len(normal_data)]threshold = np.percentile(normal_errors, 99) # 99th percentileprint(f"\n📊 Reconstruction Error Statistics:")print(f" Normal mean: {normal_errors.mean():.4f}")print(f" Normal std: {normal_errors.std():.4f}")print(f" Threshold (99th %ile): {threshold:.4f}")# Flag anomaliesprint("\n🚨 Anomaly Detection Results:")print("-" * 50)anomaly_errors = errors[-len(anomalies):]for i, (err, anomaly) in enumerate(zip(anomaly_errors, anomalies)): status = "🔴 ANOMALY" if err > threshold else "🟢 Normal" print(f" Server {i+1}: Error = {err:.4f} {status}") if err > threshold: print(f" Metrics: CPU={anomaly[0]:.0f}%, Mem={anomaly[1]:.0f}%, " f"Disk={anomaly[2]:.0f}%, Net={anomaly[3]:.0f}%, Req={anomaly[4]:.0f}/s")# False positive rate on normal datafalse_positives = np.sum(normal_errors > threshold)print(f"\n📈 Performance:")print(f" False positive rate: {false_positives/len(normal_data)*100:.2f}%")print(f" True positives: {np.sum(anomaly_errors > threshold)}/{len(anomalies)}")
Real-World Insight: AWS CloudWatch and Datadog use similar PCA-based anomaly detection to automatically flag unusual server behavior across millions of instances!
Reduce high-dimensional genetic data for cancer subtype discovery:
Copy
import numpy as np# Simulated gene expression: 200 patients, 1000 genesnp.random.seed(42)# 3 cancer subtypes with different gene signaturessubtype1 = np.random.randn(70, 1000) + np.hstack([np.ones(300), np.zeros(700)]) * 2subtype2 = np.random.randn(70, 1000) + np.hstack([np.zeros(300), np.ones(400), np.zeros(300)]) * 2subtype3 = np.random.randn(60, 1000) + np.hstack([np.zeros(700), np.ones(300)]) * 2gene_data = np.vstack([subtype1, subtype2, subtype3])true_labels = [0]*70 + [1]*70 + [2]*60# TODO:# 1. Apply PCA to reduce to 2 components# 2. Visualize the subtypes (would they cluster?)# 3. Which genes contribute most to PC1?
💡 Solution
Copy
import numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAnp.random.seed(42)subtype1 = np.random.randn(70, 1000) + np.hstack([np.ones(300), np.zeros(700)]) * 2subtype2 = np.random.randn(70, 1000) + np.hstack([np.zeros(300), np.ones(400), np.zeros(300)]) * 2subtype3 = np.random.randn(60, 1000) + np.hstack([np.zeros(700), np.ones(300)]) * 2gene_data = np.vstack([subtype1, subtype2, subtype3])true_labels = np.array([0]*70 + [1]*70 + [2]*60)print("🧬 Gene Expression Analysis with PCA")print("=" * 50)print(f"Original data: {gene_data.shape[0]} patients × {gene_data.shape[1]} genes")# 1. Standardize and apply PCAscaler = StandardScaler()gene_scaled = scaler.fit_transform(gene_data)pca = PCA(n_components=2)gene_2d = pca.fit_transform(gene_scaled)print(f"\nReduced to: {gene_2d.shape[1]} dimensions")print(f"Variance captured: {pca.explained_variance_ratio_.sum()*100:.1f}%")# 2. Analyze cluster separationprint("\n📊 Subtype Centroids in PCA Space:")for subtype in [0, 1, 2]: mask = true_labels == subtype centroid = gene_2d[mask].mean(axis=0) print(f" Subtype {subtype+1}: PC1={centroid[0]:6.2f}, PC2={centroid[1]:6.2f}")# Calculate between-class separationcentroids = np.array([gene_2d[true_labels == i].mean(axis=0) for i in range(3)])separation = np.mean([np.linalg.norm(centroids[i] - centroids[j]) for i in range(3) for j in range(i+1, 3)])print(f"\n Average centroid distance: {separation:.2f}")# 3. Find most important genes for PC1loadings = pca.components_[0]top_genes_pc1 = np.argsort(np.abs(loadings))[-10:][::-1]print("\n🔬 Top 10 Genes Contributing to PC1:")for i, gene_idx in enumerate(top_genes_pc1): direction = "↑" if loadings[gene_idx] > 0 else "↓" print(f" {i+1}. Gene_{gene_idx:04d}: {loadings[gene_idx]:+.4f} {direction}")# Visualize cluster quality (text-based)print("\n📈 PCA Visualization (ASCII):")print(" " + "-" * 42)for subtype, marker in [(0, 'A'), (1, 'B'), (2, 'C')]: mask = true_labels == subtype mean_pc1 = gene_2d[mask, 0].mean() mean_pc2 = gene_2d[mask, 1].mean() x_pos = int((mean_pc1 + 3) / 6 * 40) x_pos = max(0, min(39, x_pos)) print(f" |{' ' * x_pos}{marker}{' ' * (39 - x_pos)}| Subtype {subtype+1}")print(" " + "-" * 42)print(" PC1 →")
Real-World Insight: This is exactly how cancer researchers discovered breast cancer subtypes (Luminal A, Luminal B, HER2+, Triple-negative). PCA on gene expression data has saved countless lives through personalized treatment!
Without standardization, features with larger scales dominate the principal components. A feature measured in millions would overshadow one measured in decimals, regardless of importance.
Q: How do you choose the number of components?
Common methods: (1) Keep components explaining 95% of variance, (2) Elbow method on scree plot, (3) Kaiser criterion (eigenvalue > 1), (4) Cross-validation if using for prediction.
Q: What are the limitations of PCA?
PCA finds linear relationships only; it struggles with non-linear patterns. It also assumes variance = importance, which isn’t always true. All components are combinations of all features, making interpretation difficult.
Q: When should you NOT use PCA?
When interpretability is crucial (PCA features are hard to explain), when relationships are non-linear (use t-SNE or UMAP instead), or when you have categorical features (PCA is for continuous data).
PCA is great for reducing features, but what if you want to find hidden patterns in your data? Like discovering that users who like action movies also like sci-fi?That’s Singular Value Decomposition (SVD) - the most powerful matrix decomposition!