Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Dimensionality Reduction Concept
Dimensionality Reduction Real World Example

Dimensionality Reduction

The Curse of Dimensionality

Imagine searching for your friend in a 1D hallway — easy. Now imagine a 2D football field — harder. Now a 3D building — much harder. Now imagine a 100-dimensional space. The volume of that space grows so fast that any reasonable dataset becomes absurdly sparse. As features increase:
  • Data becomes sparse — Points are so far apart that “nearby” neighbors are barely closer than distant ones
  • Models need exponentially more data — To have the same density of examples in 100D as in 2D, you would need roughly 10^50 times more data
  • Distance metrics become meaningless — In high dimensions, the farthest point and the nearest point are almost the same distance away (this is mathematically provable, and it breaks KNN, clustering, and any distance-based method)
  • Training time explodes — More features means more parameters to learn and more computation per step
This is the “curse of dimensionality,” and it is why blindly throwing 500 features at a model often performs worse than carefully picking 20.

Why Reduce Dimensions?

Visualization

Plot 100D data in 2D

Speed

Faster training and inference

Noise Reduction

Remove noisy features

Better Models

Reduce overfitting

PCA: Principal Component Analysis

PCA finds new axes (directions) that capture the most variance in your data. It answers the question: “If I could only look at this data from a few angles, which angles would show me the most information?”

The Intuition

Imagine photographing a cigar from different angles. From the side, you see its full length — lots of information about shape. From the tip, you just see a circle — almost no useful information. PCA automatically finds the “best side view” (most variance) first, then the next most informative angle perpendicular to that, and so on.
  • First principal component: Along the cigar (direction of most variance — the most informative angle)
  • Second principal component: Across the cigar (direction of second most variance, perpendicular to the first)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Create cigar-shaped data
np.random.seed(42)
n_samples = 200

# Original data with correlation
x = np.random.randn(n_samples)
y = 2 * x + np.random.randn(n_samples) * 0.5
X = np.column_stack([x, y])

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original data
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], alpha=0.5)
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('Original Data')
axes[0].set_aspect('equal')

# Draw principal components
mean = X_scaled.mean(axis=0)
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
    axes[0].arrow(mean[0], mean[1], 
                  comp[0] * np.sqrt(var) * 2, 
                  comp[1] * np.sqrt(var) * 2,
                  head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
                  label=f'PC{i+1}: {pca.explained_variance_ratio_[i]:.1%}')
axes[0].legend()

# Transformed data
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].set_title('PCA Transformed Data')
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance captured: {pca.explained_variance_ratio_.sum():.1%}")
Math Connection: PCA uses eigendecomposition of the covariance matrix. See Linear Algebra Course for the full theory.

Choosing the Number of Components

Method 1: Explained Variance

from sklearn.datasets import load_digits

# Load digit images (64 features)
digits = load_digits()
X = digits.data
y = digits.target

print(f"Original shape: {X.shape}")  # (1797, 64)

# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X)

# Plot cumulative explained variance
cumsum = np.cumsum(pca_full.explained_variance_ratio_)

plt.figure(figsize=(10, 5))
plt.plot(range(1, len(cumsum) + 1), cumsum, 'b-o')
plt.axhline(0.95, color='r', linestyle='--', label='95% threshold')
plt.axhline(0.90, color='orange', linestyle='--', label='90% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Explained Variance')
plt.legend()
plt.grid(True)
plt.show()

# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components for 95% variance: {n_components_95}")

Method 2: Preserve Target Variance

# Use cross-validation to find optimal components
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

n_components_range = [5, 10, 15, 20, 30, 40, 50, 64]
scores = []

for n_comp in n_components_range:
    pipeline = Pipeline([
        ('pca', PCA(n_components=n_comp)),
        ('classifier', LogisticRegression(max_iter=5000))
    ])
    
    score = cross_val_score(pipeline, X, y, cv=5).mean()
    scores.append(score)
    print(f"n_components={n_comp:2d}: accuracy={score:.3f}")

plt.figure(figsize=(10, 5))
plt.plot(n_components_range, scores, 'b-o')
plt.xlabel('Number of Components')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Model Performance vs PCA Components')
plt.grid(True)
plt.show()

Visualizing High-Dimensional Data

Digits in 2D

# Reduce 64D to 2D for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title(f'Digits Dataset in 2D (PCA)\nExplained Variance: {pca_2d.explained_variance_ratio_.sum():.1%}')
plt.show()

t-SNE: For Visualization

While PCA preserves global structure (overall spread and distances), t-SNE focuses on preserving local structure — it ensures that points that are close in high-dimensional space remain close in the 2D projection. This makes it excellent for revealing clusters that PCA might smear together.
from sklearn.manifold import TSNE

# t-SNE is slow on large data, sample if needed
sample_size = min(1000, len(X))
indices = np.random.choice(len(X), sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_sample)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('Digits Dataset - t-SNE Visualization')
plt.show()

PCA vs t-SNE

AspectPCAt-SNE
SpeedFast (linear algebra)Slow (iterative optimization)
PurposeFeature reduction, preprocessing, denoisingVisualization only
PreservesGlobal structure (distances, spread)Local structure (neighborhoods)
New dataCan transform new points directlyMust refit on entire dataset
InterpretableYes (loadings tell you which original features matter)No (axes have no meaning)
DeterministicYes (always same result)No (depends on random initialization)
t-SNE is for visualization only! Do not use it as a preprocessing step for models. The distances between clusters in a t-SNE plot are meaningless — two clusters that appear far apart may actually be close in the original space. t-SNE can also create apparent clusters in random data, so always verify patterns with other methods.

UMAP: Best of Both Worlds

UMAP has largely replaced t-SNE as the go-to visualization tool in production settings. It is significantly faster (especially on large datasets), preserves more global structure (cluster distances are more meaningful), and can even transform new unseen data — something t-SNE cannot do.
# pip install umap-learn
from umap import UMAP

# Apply UMAP
umap_model = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = umap_model.fit_transform(X)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('Digits Dataset - UMAP Visualization')
plt.show()

PCA for Preprocessing

Speed Up Training

from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import time

# Load MNIST (larger dataset)
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X_mnist, y_mnist = mnist.data, mnist.target

# Sample for demo
X_sample, _, y_sample, _ = train_test_split(X_mnist, y_mnist, train_size=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.2)

print(f"Original features: {X_train.shape[1]}")

# Without PCA
start = time.time()
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
time_no_pca = time.time() - start
acc_no_pca = rf.score(X_test, y_test)

# With PCA (95% variance)
pca = PCA(n_components=0.95)  # Keep 95% variance
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"PCA features: {X_train_pca.shape[1]}")

start = time.time()
rf_pca = RandomForestClassifier(n_estimators=100, random_state=42)
rf_pca.fit(X_train_pca, y_train)
time_pca = time.time() - start
acc_pca = rf_pca.score(X_test_pca, y_test)

print(f"\nWithout PCA: {acc_no_pca:.3f} accuracy, {time_no_pca:.1f}s")
print(f"With PCA:    {acc_pca:.3f} accuracy, {time_pca:.1f}s")
print(f"Speedup: {time_no_pca/time_pca:.1f}x")

Noise Reduction

# Add noise to images
np.random.seed(42)
noise = np.random.randn(*X_sample.shape) * 50
X_noisy = X_sample + noise
X_noisy = np.clip(X_noisy, 0, 255)  # Keep valid pixel range

# Denoise using PCA
pca_denoise = PCA(n_components=100)
X_denoised = pca_denoise.inverse_transform(pca_denoise.fit_transform(X_noisy))

# Visualize
fig, axes = plt.subplots(3, 5, figsize=(12, 8))

for i in range(5):
    axes[0, i].imshow(X_sample[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_title('Original')
    
    axes[1, i].imshow(X_noisy[i].reshape(28, 28), cmap='gray')
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_title('Noisy')
    
    axes[2, i].imshow(X_denoised[i].reshape(28, 28), cmap='gray')
    axes[2, i].axis('off')
    if i == 0:
        axes[2, i].set_title('Denoised (PCA)')

plt.tight_layout()
plt.show()

Feature Selection vs Feature Extraction

These are two fundamentally different strategies for reducing dimensions, and knowing when to use which is a practical skill that matters in production.
AspectFeature SelectionFeature Extraction (PCA)
MethodPick the best original features, discard the restCreate new features as combinations of originals
InterpretabilityHigh — you keep original feature names and meaningsLower — “PC1” is a weighted mix of all original features
InformationMay lose some if you drop correlated but useful featuresPreserves overall variance efficiently
When to preferWhen stakeholders need to understand which inputs matter (healthcare, finance, compliance)When you want maximum compression and interpretability is secondary
ExamplesSelectKBest, Recursive Feature Elimination (RFE), Lasso (L1)PCA, LDA, Autoencoders
from sklearn.feature_selection import SelectKBest, f_classif

# Feature selection: pick top k features
selector = SelectKBest(f_classif, k=50)
X_selected = selector.fit_transform(X_train, y_train)

# Feature extraction: create new features
pca = PCA(n_components=50)
X_extracted = pca.fit_transform(X_train)

print(f"Selection: {X_selected.shape}")
print(f"Extraction: {X_extracted.shape}")

LDA: Supervised Dimensionality Reduction

While PCA asks “which directions have the most variance?”, LDA asks “which directions best separate my classes?” This makes LDA ideal when your goal is classification rather than general-purpose compression. The tradeoff: LDA can only produce at most (n_classes - 1) components, so for binary classification you get exactly one dimension.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# LDA uses class labels
lda = LinearDiscriminantAnalysis(n_components=9)  # Max = n_classes - 1
X_lda = lda.fit_transform(X, y)

# Compare PCA vs LDA
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.5)
axes[0].set_title('PCA (Unsupervised)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
axes[1].scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', alpha=0.5)
axes[1].set_title('LDA (Supervised)')
axes[1].set_xlabel('LD1')
axes[1].set_ylabel('LD2')

plt.tight_layout()
plt.show()

When to Use What

                    Visualization?

           ┌─────────────┴─────────────┐
           │                           │
          Yes                         No
           │                           │
   ┌───────┴───────┐         ┌────────┴────────┐
   │               │         │                 │
Small data    Large data   Supervised?    Unsupervised?
   │               │         │                 │
 t-SNE          UMAP       LDA              PCA

Practical Example: Image Compression

from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# Original images
digits = load_digits()
X = digits.data

# Compress with different components
components = [5, 10, 20, 40, 64]

fig, axes = plt.subplots(len(components), 5, figsize=(10, 12))

for row, n_comp in enumerate(components):
    if n_comp < 64:
        pca = PCA(n_components=n_comp)
        X_compressed = pca.fit_transform(X)
        X_reconstructed = pca.inverse_transform(X_compressed)
        compression_ratio = 64 / n_comp
    else:
        X_reconstructed = X
        compression_ratio = 1
    
    for col in range(5):
        axes[row, col].imshow(X_reconstructed[col].reshape(8, 8), cmap='gray')
        axes[row, col].axis('off')
        if col == 0:
            axes[row, col].set_ylabel(f'{n_comp} comp\n({compression_ratio:.1f}x)')

plt.suptitle('Image Compression with PCA')
plt.tight_layout()
plt.show()

Key Takeaways

PCA for Preprocessing

Reduce dimensions while keeping variance

t-SNE/UMAP for Visualization

See high-dimensional data in 2D/3D

LDA for Classification

Maximize class separation

Choose Components Wisely

Use explained variance or CV performance

What’s Next?

Congratulations! You’ve completed the advanced topics. Now let’s bring everything together in a capstone project!

Continue to Capstone Project

Build a complete ML system from scratch

Interview Deep-Dive

PCA can absolutely hurt, and understanding when requires thinking about what PCA optimizes for versus what your model needs.
  • PCA maximizes variance, not predictive power. The directions of maximum variance are not necessarily the directions that separate your classes. Imagine a medical dataset where 90% of the variance comes from patient height and weight (wide spread), but the actual disease signal is in a subtle protein biomarker with low variance. PCA would keep the height/weight components and throw away the protein signal. LDA would be the right choice here because it maximizes class separation, not variance.
  • Tree-based models rarely benefit from PCA. Random forests and gradient boosting are invariant to feature scale and can handle high-dimensional, correlated features natively. PCA actually removes the interpretability of individual features (each PC is a weighted mix of all originals) and can hurt trees by obscuring the axis-aligned splits they rely on.
  • PCA destroys sparsity. If your original features are sparse (many zeros, like text bag-of-words), PCA produces dense components. A sparse 10,000-feature matrix might be efficient in memory and computation, but after PCA it becomes a dense 100-feature matrix that loses the computational advantages of sparsity.
  • PCA assumes linear relationships. If the meaningful structure in your data is nonlinear (e.g., a spiral in 2D), PCA will project it into components that do not capture that structure. Kernel PCA or autoencoders handle nonlinear manifolds better.
  • Small datasets with many features can benefit. This is where PCA shines — it acts as regularization by reducing the effective dimensionality, which reduces overfitting. The classic case: 50 samples with 500 features. Without PCA (or some other reduction), most models will overfit badly.
Follow-up: How would you decide the number of PCA components to keep in a production pipeline?I would never use the “keep 95% variance” rule blindly. Instead, I would put PCA inside a pipeline and use cross-validation with different n_components values, scoring on the downstream task metric (accuracy, AUC, whatever matters). The optimal number of components is the one that maximizes task performance, not explained variance. In my experience, the “95% variance” heuristic often keeps too many components. The task-optimal number is frequently 60-70% of explained variance because the remaining variance is mostly noise that hurts generalization.
These three tools solve fundamentally different problems, and confusing them is a common mistake in interviews.
  • PCA: preprocessing and compression. PCA is the only one of the three that should be used as a preprocessing step before modeling. It is deterministic, fast, invertible (you can reconstruct the original features), and can transform new unseen data. In production, I use PCA to reduce feature dimensionality before feeding data into a model, to speed up training on high-dimensional datasets, and for denoising (reconstruct from top-K components).
  • t-SNE: visualization only. t-SNE produces beautiful 2D scatter plots that reveal cluster structure, but the output is not stable (different random seeds give different plots), the axes are meaningless, and you cannot transform new data points without rerunning the entire algorithm. Never use t-SNE output as features for a model. The distances between clusters in a t-SNE plot are meaningless — two clusters that appear far apart may be close in the original space, and vice versa.
  • UMAP: visualization with some production uses. UMAP is faster than t-SNE, preserves more global structure (cluster distances are more meaningful), and crucially can transform new data points. This makes UMAP usable in light production scenarios — for example, projecting new data into an existing 2D space for anomaly visualization dashboards. However, I still would not use UMAP embeddings as features for a downstream model in most cases because the embedding is sensitive to hyperparameters (n_neighbors, min_dist) and small changes can produce very different geometries.
  • A common production pattern: Use PCA for the actual model pipeline (reduce 500 features to 50), then use UMAP or t-SNE to create monitoring visualizations that show how new data clusters compared to training data. This gives you the best of both worlds: a robust, reproducible model with informative visual monitoring.
Follow-up: A data scientist on your team used t-SNE embeddings as features for a classifier and got great accuracy. What would you say?I would say the accuracy is likely misleading and the approach will fail in production. t-SNE is fit on the entire dataset (train + test) simultaneously, which means test data influences the embedding of training data and vice versa. This is a form of data leakage. Additionally, t-SNE embeddings are not stable — rerunning with a different seed gives different features, making the model non-reproducible. Even if you fix the seed, you cannot embed new production data without rerunning t-SNE on the entire dataset including the new point, which is computationally prohibitive at scale. Replace t-SNE with PCA in the pipeline, or if you need nonlinear reduction, use a trained autoencoder that can transform new data independently.
This is a great production-reality question because it tests whether you understand the operational challenges of maintaining ML systems over time.
  • PCA trained on old features cannot handle new features. PCA learns a projection matrix based on a specific set of input features. If the data engineering team adds 5 new features next quarter, you cannot simply append them — the PCA model expects the original feature set. You have three options: retrain PCA including the new features (requires retraining the downstream model too), ignore the new features in the existing model (lose potential signal), or maintain a separate model for new features and ensemble.
  • Use a feature store with versioning. Each model version should be pinned to a specific feature set version. When new features are added, they go into a new feature set version. The existing model continues to use the old feature set until you explicitly retrain and validate a new model version with the expanded features.
  • Design for feature evolution from the start. Instead of hard-coding PCA with n_components=50, use a pipeline that dynamically selects the top K features by importance (e.g., using SelectFromModel with a tree-based estimator) and then applies PCA on the selected features. When new features arrive, the selection step can automatically include them if they are informative.
  • Monitor feature importance after adding new features. After retraining with new features, check if the new features rank highly in importance. If a newly added feature dominates, investigate whether it is genuinely informative or leaking information. New features that immediately become the most important are suspicious.
  • Automate the retrain-evaluate-deploy cycle. If features change quarterly, you need a pipeline that can retrain the model (including PCA), evaluate against a holdout set, compare performance to the current production model, and deploy only if the new model is better. This should be triggered automatically when the feature schema changes.
Follow-up: How do you handle the case where a feature that was previously important gets deprecated by the data engineering team?This is a production emergency if not handled carefully. If the model was trained with that feature and it suddenly becomes null or is removed, predictions will either fail (if the model expects it) or degrade silently (if it defaults to zero). I would implement a feature availability monitoring check that alerts immediately if any expected feature is missing or has an unusual null rate. The short-term fix is to impute the missing feature based on training-time statistics (mean/median). The long-term fix is to retrain the model without that feature. The key lesson is that feature contracts between teams need to be explicit and versioned, just like API contracts.