Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The Deep Learning Landscape
The Timeline That Changed Everything
Let’s start with some perspective. Here’s what happened:| Year | Breakthrough | Impact |
|---|---|---|
| 1958 | Perceptron | First learning machine (couldn’t solve XOR) |
| 1986 | Backpropagation | Training multi-layer networks became possible |
| 2006 | Deep Belief Networks | Showed deep networks could be trained |
| 2012 | AlexNet | Won ImageNet by huge margin, started the revolution |
| 2014 | GANs | Generating realistic images |
| 2015 | ResNet | 152-layer networks that actually train |
| 2017 | Transformer | Attention is all you need |
| 2018 | BERT | Language understanding breakthrough |
| 2020 | GPT-3 | Few-shot learning at scale |
| 2022 | ChatGPT | AI goes mainstream |
| 2023 | GPT-4 | Multimodal reasoning |
| 2024 | Sora | Video generation from text |
Deep Learning vs. Machine Learning
Let’s be precise about what we mean:Traditional Machine Learning
Think of traditional ML like hiring an expert art appraiser. The appraiser (you, the engineer) decides what features matter — brush stroke width, color palette, canvas texture — and manually measures each one. The ML model then learns patterns from those measurements. If you missed a critical feature, tough luck.- Feature engineering is time-consuming
- Requires domain expertise
- Features may not capture what matters
- Doesn’t scale to complex patterns
Deep Learning
Deep learning is like hiring an apprentice who figures out what matters on their own. You show them thousands of paintings labeled “Monet” or “not Monet,” and they discover — without any instruction — that brush stroke patterns, color palettes, and light diffusion are the distinguishing features. No domain expert required.- Learns features automatically
- Scales to complex patterns
- Transfers across tasks
- State-of-the-art performance
When to Use What
| Scenario | Best Choice | Why |
|---|---|---|
| Small dataset (<1000 samples) | Traditional ML | Deep learning overfits |
| Tabular data | Traditional ML (XGBoost) | Often beats deep learning |
| Images, audio, text | Deep Learning | Hierarchical patterns |
| Limited compute | Traditional ML | Deep learning is expensive |
| Need interpretability | Traditional ML | Deep learning is a “black box” |
| Massive data available | Deep Learning | Benefits from scale |
The Deep Learning Ecosystem
Major Application Domains
Computer Vision
- Image classification
- Object detection (YOLO, Faster R-CNN)
- Segmentation
- Face recognition
- Medical imaging
- Autonomous vehicles
Natural Language Processing
- Text classification
- Machine translation
- Question answering
- Summarization
- Chatbots (ChatGPT)
- Code generation (Copilot)
Speech & Audio
- Speech recognition (Whisper)
- Text-to-speech
- Music generation
- Audio classification
- Voice cloning
Generative AI
- Image generation (DALL-E, Stable Diffusion)
- Video generation (Sora)
- 3D model generation
- Code generation
- Drug discovery
The Architecture Zoo
| Architecture | Domain | Key Idea |
|---|---|---|
| CNN (1998) | Vision | Local patterns with convolutions |
| RNN/LSTM (1997) | Sequences | Memory for temporal dependencies |
| Transformer (2017) | Everything | Attention over all positions |
| GAN (2014) | Generation | Adversarial training |
| VAE (2013) | Generation | Probabilistic latent space |
| Diffusion (2020) | Generation | Iterative denoising |
| Graph NN (2017) | Graphs | Message passing on structure |
Key Concepts Overview
Before we dive into details, here’s a map of what you’ll learn:The Learning Process
The training loop is the heartbeat of deep learning. It follows a simple cycle that repeats millions of times: guess, check, adjust.What Makes Deep Networks Work
| Component | What It Does | Analogy |
|---|---|---|
| Layers | Transform data step by step | Assembly line workers |
| Weights | Learnable parameters | Worker’s skill levels |
| Activations | Non-linear functions | Decision gates |
| Loss | Measures error | Quality inspector |
| Optimizer | Updates weights | Manager adjusting workers |
| Backprop | Computes gradients | Feedback mechanism |
Your First Neural Network
Let’s build a simple network to classify handwritten digits (MNIST). This is the “Hello World” of deep learning — simple enough to understand completely, but real enough to teach you the full training pipeline.Understanding What Happened
Let’s break down what the network learned:Visualizing Learned Features
- Edges at different orientations
- Curve detectors
- Stroke patterns
What Each Layer Does
| Layer | Input Shape | Output Shape | What It Learns |
|---|---|---|---|
fc1 | 784 (28×28) | 512 | Low-level patterns (edges, strokes) |
fc2 | 512 | 256 | Mid-level combinations (curves, corners) |
fc3 | 256 | 10 | Digit-specific patterns |
The Deep Learning Mindset
It’s All About Representations
The key insight: Deep learning is about learning good representations of your data.The Three Pillars
| Pillar | What It Means | How to Get It |
|---|---|---|
| Data | More data = better models | Web scraping, data augmentation, synthetic data |
| Compute | More GPUs = larger models | Cloud computing, efficient architectures |
| Algorithms | Better architectures | Research, this course! |
Empirical Science
Deep learning is highly empirical. Unlike traditional algorithms where you can prove properties mathematically, deep learning requires:- Experimentation: Try different architectures
- Ablation studies: Remove components to see what matters
- Hyperparameter tuning: Search for the best settings
- Visualization: Look at what your model learned
Common Mistakes for Beginners
| Mistake | Why It’s Wrong | Better Approach |
|---|---|---|
| Jumping to deep learning | May not need it | Start with a baseline (logistic regression, random forest) |
| Not normalizing inputs | Unstable training | Normalize to mean=0, std=1 |
| Wrong loss function | Model won’t learn properly | Classification → Cross-entropy, Regression → MSE |
| Learning rate too high | Training diverges | Start with 0.001, reduce if unstable |
| Not enough data | Model overfits | Data augmentation, transfer learning |
| Training too long | Overfitting | Use early stopping based on validation loss |
What’s Next
Now that you understand the landscape, we’ll dive into the fundamentals:Module 2: Perceptrons & Multi-Layer Networks
Exercises
Exercise 1: Explore the Network
Exercise 1: Explore the Network
- What happens if you remove the hidden layers (just fc1 → fc3)?
- What if you make it deeper (add fc4)?
- What if you change the hidden layer sizes?
Exercise 2: Visualize Confusion
Exercise 2: Visualize Confusion
Exercise 3: Compare to Traditional ML
Exercise 3: Compare to Traditional ML
Interview Deep-Dive
Explain the difference between feature engineering in traditional ML and representation learning in deep learning. Why is this distinction important?
Explain the difference between feature engineering in traditional ML and representation learning in deep learning. Why is this distinction important?
- In traditional ML, a human expert designs features: computing edge histograms for images, TF-IDF vectors for text, or hand-crafted statistical summaries for time series. The model then learns a mapping from these fixed features to outputs. The quality of the model is fundamentally bottlenecked by the quality of the features — if you miss a critical feature, no amount of training will recover it.
- In deep learning, the network learns its own features through hierarchical representation learning. Early layers discover low-level patterns (edges, character n-grams), middle layers compose these into higher-level features (shapes, phrases), and late layers form task-specific representations (object categories, sentiment). The features themselves are optimized end-to-end for the task.
- This distinction matters because representation learning scales to modalities where human feature engineering is impractical. No human can design the right features for recognizing 10,000 object categories or understanding arbitrary natural language. The network discovers features that humans would never think to engineer — and often outperform hand-crafted alternatives by large margins.
- The trade-off: deep learning’s learned representations require substantially more data and compute. With 500 labeled examples, a carefully engineered feature set plus a linear model will usually beat a neural network that must learn everything from scratch.
You are building a model to predict customer churn from a tabular dataset with 50 features and 10,000 rows. Your manager insists on using a deep neural network. How do you push back?
You are building a model to predict customer churn from a tabular dataset with 50 features and 10,000 rows. Your manager insists on using a deep neural network. How do you push back?
- I would present evidence, not opinions. The empirical reality is that gradient-boosted trees (XGBoost, LightGBM, CatBoost) consistently match or outperform deep learning on tabular data, as demonstrated across hundreds of Kaggle competitions and recent benchmark papers (e.g., Grinsztajn et al. 2022, “Why do tree-based models still outperform deep learning on tabular data?”).
- The reasons are structural. Tabular data typically has heterogeneous features (mix of categorical and continuous), irregular feature interactions, and no spatial or temporal structure. Deep learning’s strengths — hierarchical feature learning, translation invariance, weight sharing — do not apply. Trees naturally handle feature heterogeneity and learn sharp decision boundaries that neural networks approximate poorly.
- With 10,000 rows and 50 features, a neural network is likely to overfit without aggressive regularization. XGBoost will train in seconds, is trivially interpretable via SHAP values for stakeholder communication, and requires far less hyperparameter tuning.
- My recommendation: start with XGBoost as a strong baseline, measure its performance carefully, and only explore neural approaches if the baseline is insufficient and there is a clear hypothesis for why depth would help (e.g., complex feature interactions that trees miss).
Why do we normalize input data before training a neural network? What goes wrong if we skip this step?
Why do we normalize input data before training a neural network? What goes wrong if we skip this step?
- Normalization (scaling inputs to mean 0, standard deviation 1) ensures that all features contribute roughly equally to the gradient updates. Without normalization, features with large magnitudes dominate the loss landscape, creating elongated elliptical contours that cause gradient descent to oscillate and converge slowly.
- Geometrically, unnormalized data creates an ill-conditioned optimization problem. If feature A ranges from 0-1000 and feature B ranges from 0-1, the loss landscape is stretched along the A-axis. The optimal learning rate for A is far too small for B and vice versa. Normalization makes the landscape more spherical, allowing a single learning rate to work well for all parameters.
- Without normalization, activations in early layers can saturate (for sigmoid/tanh) or become very large (for ReLU), which causes vanishing gradients or numerical instability. The MNIST example normalizes with mean 0.1307 and std 0.3081 (precomputed dataset statistics) specifically to center the pixel distributions.
- In practice, normalization also makes the model less sensitive to the choice of learning rate and initialization, which speeds up the hyperparameter search process.
You train a neural network and get 99% training accuracy but only 72% test accuracy. Diagnose the problem and propose a systematic fix.
You train a neural network and get 99% training accuracy but only 72% test accuracy. Diagnose the problem and propose a systematic fix.
- This is textbook overfitting: the model has memorized the training data rather than learning generalizable patterns. The 27-point gap between train and test accuracy is the key diagnostic signal.
- Systematic approach, in order of impact and ease of implementation:
- Data augmentation (highest impact, no model changes): for images, add random crops, flips, color jitter, CutMix/MixUp. This effectively multiplies the dataset size and forces the model to learn invariant features rather than memorize specific examples.
- Regularization: add dropout (0.3-0.5 for dense layers), increase weight decay (try 0.01-0.1 with AdamW), and consider label smoothing (epsilon=0.1).
- Reduce model capacity: the model may be too large for the dataset. Try fewer layers, fewer neurons per layer, or a simpler architecture. A model that barely fits the training data will generalize better than one that memorizes it effortlessly.
- Early stopping: monitor validation loss and stop training when it starts increasing. This is cheap to implement and consistently helps.
- Get more data: if feasible, this is the most reliable long-term solution. More diverse training examples directly address the generalization gap.
- I would NOT start by changing the optimizer or learning rate — those affect convergence, not generalization. The diagnosis points specifically to a capacity/data mismatch.