Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Dimensionality Reduction
The Curse of Dimensionality
Imagine searching for your friend in a 1D hallway — easy. Now imagine a 2D football field — harder. Now a 3D building — much harder. Now imagine a 100-dimensional space. The volume of that space grows so fast that any reasonable dataset becomes absurdly sparse. As features increase:- Data becomes sparse — Points are so far apart that “nearby” neighbors are barely closer than distant ones
- Models need exponentially more data — To have the same density of examples in 100D as in 2D, you would need roughly 10^50 times more data
- Distance metrics become meaningless — In high dimensions, the farthest point and the nearest point are almost the same distance away (this is mathematically provable, and it breaks KNN, clustering, and any distance-based method)
- Training time explodes — More features means more parameters to learn and more computation per step
Why Reduce Dimensions?
Visualization
Plot 100D data in 2D
Speed
Faster training and inference
Noise Reduction
Remove noisy features
Better Models
Reduce overfitting
PCA: Principal Component Analysis
PCA finds new axes (directions) that capture the most variance in your data. It answers the question: “If I could only look at this data from a few angles, which angles would show me the most information?”The Intuition
Imagine photographing a cigar from different angles. From the side, you see its full length — lots of information about shape. From the tip, you just see a circle — almost no useful information. PCA automatically finds the “best side view” (most variance) first, then the next most informative angle perpendicular to that, and so on.- First principal component: Along the cigar (direction of most variance — the most informative angle)
- Second principal component: Across the cigar (direction of second most variance, perpendicular to the first)
Math Connection: PCA uses eigendecomposition of the covariance matrix. See Linear Algebra Course for the full theory.
Choosing the Number of Components
Method 1: Explained Variance
Method 2: Preserve Target Variance
Visualizing High-Dimensional Data
Digits in 2D
t-SNE: For Visualization
While PCA preserves global structure (overall spread and distances), t-SNE focuses on preserving local structure — it ensures that points that are close in high-dimensional space remain close in the 2D projection. This makes it excellent for revealing clusters that PCA might smear together.PCA vs t-SNE
| Aspect | PCA | t-SNE |
|---|---|---|
| Speed | Fast (linear algebra) | Slow (iterative optimization) |
| Purpose | Feature reduction, preprocessing, denoising | Visualization only |
| Preserves | Global structure (distances, spread) | Local structure (neighborhoods) |
| New data | Can transform new points directly | Must refit on entire dataset |
| Interpretable | Yes (loadings tell you which original features matter) | No (axes have no meaning) |
| Deterministic | Yes (always same result) | No (depends on random initialization) |
UMAP: Best of Both Worlds
UMAP has largely replaced t-SNE as the go-to visualization tool in production settings. It is significantly faster (especially on large datasets), preserves more global structure (cluster distances are more meaningful), and can even transform new unseen data — something t-SNE cannot do.PCA for Preprocessing
Speed Up Training
Noise Reduction
Feature Selection vs Feature Extraction
These are two fundamentally different strategies for reducing dimensions, and knowing when to use which is a practical skill that matters in production.| Aspect | Feature Selection | Feature Extraction (PCA) |
|---|---|---|
| Method | Pick the best original features, discard the rest | Create new features as combinations of originals |
| Interpretability | High — you keep original feature names and meanings | Lower — “PC1” is a weighted mix of all original features |
| Information | May lose some if you drop correlated but useful features | Preserves overall variance efficiently |
| When to prefer | When stakeholders need to understand which inputs matter (healthcare, finance, compliance) | When you want maximum compression and interpretability is secondary |
| Examples | SelectKBest, Recursive Feature Elimination (RFE), Lasso (L1) | PCA, LDA, Autoencoders |
LDA: Supervised Dimensionality Reduction
While PCA asks “which directions have the most variance?”, LDA asks “which directions best separate my classes?” This makes LDA ideal when your goal is classification rather than general-purpose compression. The tradeoff: LDA can only produce at most (n_classes - 1) components, so for binary classification you get exactly one dimension.When to Use What
Practical Example: Image Compression
Key Takeaways
PCA for Preprocessing
Reduce dimensions while keeping variance
t-SNE/UMAP for Visualization
See high-dimensional data in 2D/3D
LDA for Classification
Maximize class separation
Choose Components Wisely
Use explained variance or CV performance
What’s Next?
Congratulations! You’ve completed the advanced topics. Now let’s bring everything together in a capstone project!Continue to Capstone Project
Build a complete ML system from scratch
Interview Deep-Dive
You applied PCA before training a model and accuracy improved. A teammate says PCA always helps. When does PCA actually hurt model performance?
You applied PCA before training a model and accuracy improved. A teammate says PCA always helps. When does PCA actually hurt model performance?
PCA can absolutely hurt, and understanding when requires thinking about what PCA optimizes for versus what your model needs.
- PCA maximizes variance, not predictive power. The directions of maximum variance are not necessarily the directions that separate your classes. Imagine a medical dataset where 90% of the variance comes from patient height and weight (wide spread), but the actual disease signal is in a subtle protein biomarker with low variance. PCA would keep the height/weight components and throw away the protein signal. LDA would be the right choice here because it maximizes class separation, not variance.
- Tree-based models rarely benefit from PCA. Random forests and gradient boosting are invariant to feature scale and can handle high-dimensional, correlated features natively. PCA actually removes the interpretability of individual features (each PC is a weighted mix of all originals) and can hurt trees by obscuring the axis-aligned splits they rely on.
- PCA destroys sparsity. If your original features are sparse (many zeros, like text bag-of-words), PCA produces dense components. A sparse 10,000-feature matrix might be efficient in memory and computation, but after PCA it becomes a dense 100-feature matrix that loses the computational advantages of sparsity.
- PCA assumes linear relationships. If the meaningful structure in your data is nonlinear (e.g., a spiral in 2D), PCA will project it into components that do not capture that structure. Kernel PCA or autoencoders handle nonlinear manifolds better.
- Small datasets with many features can benefit. This is where PCA shines — it acts as regularization by reducing the effective dimensionality, which reduces overfitting. The classic case: 50 samples with 500 features. Without PCA (or some other reduction), most models will overfit badly.
Explain the practical differences between PCA, t-SNE, and UMAP. When would you choose each in a production setting?
Explain the practical differences between PCA, t-SNE, and UMAP. When would you choose each in a production setting?
These three tools solve fundamentally different problems, and confusing them is a common mistake in interviews.
- PCA: preprocessing and compression. PCA is the only one of the three that should be used as a preprocessing step before modeling. It is deterministic, fast, invertible (you can reconstruct the original features), and can transform new unseen data. In production, I use PCA to reduce feature dimensionality before feeding data into a model, to speed up training on high-dimensional datasets, and for denoising (reconstruct from top-K components).
- t-SNE: visualization only. t-SNE produces beautiful 2D scatter plots that reveal cluster structure, but the output is not stable (different random seeds give different plots), the axes are meaningless, and you cannot transform new data points without rerunning the entire algorithm. Never use t-SNE output as features for a model. The distances between clusters in a t-SNE plot are meaningless — two clusters that appear far apart may be close in the original space, and vice versa.
- UMAP: visualization with some production uses. UMAP is faster than t-SNE, preserves more global structure (cluster distances are more meaningful), and crucially can transform new data points. This makes UMAP usable in light production scenarios — for example, projecting new data into an existing 2D space for anomaly visualization dashboards. However, I still would not use UMAP embeddings as features for a downstream model in most cases because the embedding is sensitive to hyperparameters (n_neighbors, min_dist) and small changes can produce very different geometries.
- A common production pattern: Use PCA for the actual model pipeline (reduce 500 features to 50), then use UMAP or t-SNE to create monitoring visualizations that show how new data clusters compared to training data. This gives you the best of both worlds: a robust, reproducible model with informative visual monitoring.
How would you handle dimensionality reduction in a production ML pipeline where new features are added quarterly by the data engineering team?
How would you handle dimensionality reduction in a production ML pipeline where new features are added quarterly by the data engineering team?
This is a great production-reality question because it tests whether you understand the operational challenges of maintaining ML systems over time.
- PCA trained on old features cannot handle new features. PCA learns a projection matrix based on a specific set of input features. If the data engineering team adds 5 new features next quarter, you cannot simply append them — the PCA model expects the original feature set. You have three options: retrain PCA including the new features (requires retraining the downstream model too), ignore the new features in the existing model (lose potential signal), or maintain a separate model for new features and ensemble.
- Use a feature store with versioning. Each model version should be pinned to a specific feature set version. When new features are added, they go into a new feature set version. The existing model continues to use the old feature set until you explicitly retrain and validate a new model version with the expanded features.
- Design for feature evolution from the start. Instead of hard-coding PCA with n_components=50, use a pipeline that dynamically selects the top K features by importance (e.g., using SelectFromModel with a tree-based estimator) and then applies PCA on the selected features. When new features arrive, the selection step can automatically include them if they are informative.
- Monitor feature importance after adding new features. After retraining with new features, check if the new features rank highly in importance. If a newly added feature dominates, investigate whether it is genuinely informative or leaking information. New features that immediately become the most important are suspicious.
- Automate the retrain-evaluate-deploy cycle. If features change quarterly, you need a pipeline that can retrain the model (including PCA), evaluate against a holdout set, compare performance to the current production model, and deploy only if the new model is better. This should be triggered automatically when the feature schema changes.