Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Deep Learning Landscape

The Deep Learning Landscape

The Timeline That Changed Everything

Let’s start with some perspective. Here’s what happened:
YearBreakthroughImpact
1958PerceptronFirst learning machine (couldn’t solve XOR)
1986BackpropagationTraining multi-layer networks became possible
2006Deep Belief NetworksShowed deep networks could be trained
2012AlexNetWon ImageNet by huge margin, started the revolution
2014GANsGenerating realistic images
2015ResNet152-layer networks that actually train
2017TransformerAttention is all you need
2018BERTLanguage understanding breakthrough
2020GPT-3Few-shot learning at scale
2022ChatGPTAI goes mainstream
2023GPT-4Multimodal reasoning
2024SoraVideo generation from text
The common thread: Every breakthrough came from making networks deeper, feeding them more data, and training with more compute.
🔗 Connection: The methods you’ll learn in this course — backpropagation, attention, normalization — are the exact techniques powering these breakthroughs. We’re not teaching theory for theory’s sake; we’re teaching the building blocks of modern AI.

Deep Learning vs. Machine Learning

Let’s be precise about what we mean:
ML vs Deep Learning

Traditional Machine Learning

Think of traditional ML like hiring an expert art appraiser. The appraiser (you, the engineer) decides what features matter — brush stroke width, color palette, canvas texture — and manually measures each one. The ML model then learns patterns from those measurements. If you missed a critical feature, tough luck.
# Traditional ML: YOU design the features
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Feature engineering (manual) -- this is where the real work lives
def extract_features(image):
    features = []
    features.append(np.mean(image))  # brightness -- simple but lossy
    features.append(np.std(image))   # contrast -- captures variation
    features.append(count_edges(image))  # edges -- requires domain knowledge
    features.append(color_histogram(image))  # colors -- you hope this matters
    # ... 100 more hand-crafted features -- weeks of domain expert time
    return np.array(features)

# Train on hand-crafted features
X = np.array([extract_features(img) for img in images])
model = RandomForestClassifier()
model.fit(X, labels)
Problems:
  • Feature engineering is time-consuming
  • Requires domain expertise
  • Features may not capture what matters
  • Doesn’t scale to complex patterns

Deep Learning

Deep learning is like hiring an apprentice who figures out what matters on their own. You show them thousands of paintings labeled “Monet” or “not Monet,” and they discover — without any instruction — that brush stroke patterns, color palettes, and light diffusion are the distinguishing features. No domain expert required.
# Deep Learning: The network LEARNS the features
import torch
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        # No hand-crafted features -- the network discovers what matters
        self.conv1 = nn.Conv2d(3, 32, 3)   # Layer 1: will learn edge detectors
        self.conv2 = nn.Conv2d(32, 64, 3)  # Layer 2: will learn shape detectors
        self.conv3 = nn.Conv2d(64, 128, 3) # Layer 3: will learn object-part detectors
        self.fc = nn.Linear(128 * 4 * 4, 10)  # Final: maps features to 10 classes
    
    def forward(self, x):
        x = F.relu(self.conv1(x))  # learns edges automatically from data
        x = F.relu(self.conv2(x))  # combines edges into shapes
        x = F.relu(self.conv3(x))  # combines shapes into object parts
        return self.fc(x.flatten(1))  # classifies based on learned features

# Train end-to-end -- just give it raw pixels and labels
model = CNN()
# The network figures out which features matter. That is the revolution.
Benefits:
  • Learns features automatically
  • Scales to complex patterns
  • Transfers across tasks
  • State-of-the-art performance

When to Use What

ScenarioBest ChoiceWhy
Small dataset (<1000 samples)Traditional MLDeep learning overfits
Tabular dataTraditional ML (XGBoost)Often beats deep learning
Images, audio, textDeep LearningHierarchical patterns
Limited computeTraditional MLDeep learning is expensive
Need interpretabilityTraditional MLDeep learning is a “black box”
Massive data availableDeep LearningBenefits from scale
Don’t be a “deep learning hammer”: Deep learning isn’t always the answer. Gradient boosting (XGBoost, LightGBM) still often wins on tabular data. Understand your problem before reaching for neural networks.

The Deep Learning Ecosystem

Major Application Domains

Computer Vision

  • Image classification
  • Object detection (YOLO, Faster R-CNN)
  • Segmentation
  • Face recognition
  • Medical imaging
  • Autonomous vehicles

Natural Language Processing

  • Text classification
  • Machine translation
  • Question answering
  • Summarization
  • Chatbots (ChatGPT)
  • Code generation (Copilot)

Speech & Audio

  • Speech recognition (Whisper)
  • Text-to-speech
  • Music generation
  • Audio classification
  • Voice cloning

Generative AI

  • Image generation (DALL-E, Stable Diffusion)
  • Video generation (Sora)
  • 3D model generation
  • Code generation
  • Drug discovery

The Architecture Zoo

ArchitectureDomainKey Idea
CNN (1998)VisionLocal patterns with convolutions
RNN/LSTM (1997)SequencesMemory for temporal dependencies
Transformer (2017)EverythingAttention over all positions
GAN (2014)GenerationAdversarial training
VAE (2013)GenerationProbabilistic latent space
Diffusion (2020)GenerationIterative denoising
Graph NN (2017)GraphsMessage passing on structure
The Transformer Takeover: Transformers have largely replaced RNNs for sequences and are increasingly competing with CNNs for vision (Vision Transformer, ViT). By the end of this course, you’ll understand why.

Key Concepts Overview

Before we dive into details, here’s a map of what you’ll learn:

The Learning Process

The training loop is the heartbeat of deep learning. It follows a simple cycle that repeats millions of times: guess, check, adjust.
1. FORWARD PASS (make a prediction)
   Input → [Layer 1] → [Layer 2] → ... → [Layer N] → Prediction
   
2. LOSS COMPUTATION (measure how wrong we are)
   Compare Prediction vs. Ground Truth → Loss Value (a single number)
   
3. BACKWARD PASS (figure out who is responsible for the error)
   Compute gradients of loss w.r.t. each parameter using backpropagation
   
4. PARAMETER UPDATE (adjust to do better next time)
   parameters = parameters - learning_rate × gradients
   
5. REPEAT for all data, many epochs (one epoch = one pass through all data)
Think of it like tuning a guitar by ear. You pluck a string (forward pass), hear how off it sounds (loss), figure out which direction to turn the peg (backward pass), and make a small adjustment (parameter update). You repeat until it sounds right.

What Makes Deep Networks Work

ComponentWhat It DoesAnalogy
LayersTransform data step by stepAssembly line workers
WeightsLearnable parametersWorker’s skill levels
ActivationsNon-linear functionsDecision gates
LossMeasures errorQuality inspector
OptimizerUpdates weightsManager adjusting workers
BackpropComputes gradientsFeedback mechanism

Your First Neural Network

Let’s build a simple network to classify handwritten digits (MNIST). This is the “Hello World” of deep learning — simple enough to understand completely, but real enough to teach you the full training pipeline.
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 1. LOAD DATA
# transforms.Compose chains preprocessing steps -- order matters!
transform = transforms.Compose([
    transforms.ToTensor(),              # Convert PIL image to tensor (also scales 0-255 to 0-1)
    transforms.Normalize((0.1307,), (0.3081,))  # Normalize with MNIST mean and std
    # Why normalize? It centers data around 0, which helps gradients flow evenly
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)

# batch_size=64: process 64 images at once (a balance between speed and memory)
# shuffle=True for training: prevents the model from learning the order of examples
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)  # Larger batches for eval (no gradients stored)

# 2. DEFINE NETWORK
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()          # 28x28 image -> 784-dim vector
        self.fc1 = nn.Linear(28 * 28, 512)   # First hidden layer: compress to 512 features
        self.fc2 = nn.Linear(512, 256)        # Second hidden layer: compress further to 256
        self.fc3 = nn.Linear(256, 10)         # Output layer: 10 classes (digits 0-9)
        self.relu = nn.ReLU()                 # Non-linearity -- without this, the network is just linear regression
        self.dropout = nn.Dropout(0.2)        # Randomly zero 20% of neurons during training to prevent overfitting
    
    def forward(self, x):
        x = self.flatten(x)                          # Flatten image to vector
        x = self.dropout(self.relu(self.fc1(x)))     # Layer 1: linear -> ReLU -> dropout
        x = self.dropout(self.relu(self.fc2(x)))     # Layer 2: same pattern
        return self.fc3(x)                           # Output: raw logits (no activation -- CrossEntropyLoss handles softmax)

model = SimpleNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# ~535K parameters -- tiny by modern standards, but enough for MNIST

# 3. SETUP TRAINING
criterion = nn.CrossEntropyLoss()  # Combines LogSoftmax + NLLLoss -- the standard for classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam adapts learning rate per-parameter

# 4. TRAINING LOOP -- this is the core rhythm of all deep learning
def train_epoch(model, loader, criterion, optimizer):
    model.train()  # Enable dropout and batch norm training behavior
    total_loss = 0
    correct = 0
    
    for batch_idx, (data, target) in enumerate(loader):
        optimizer.zero_grad()             # CRITICAL: clear old gradients (PyTorch accumulates by default)
        output = model(data)              # Forward pass: input -> prediction
        loss = criterion(output, target)  # How wrong are we? (single number)
        loss.backward()                   # Backward pass: compute gradient of loss w.r.t. every parameter
        optimizer.step()                  # Update each parameter: param -= lr * gradient
        
        total_loss += loss.item()
        pred = output.argmax(dim=1)       # Pick the class with highest score
        correct += pred.eq(target).sum().item()
    
    return total_loss / len(loader), 100. * correct / len(loader.dataset)

# 5. EVALUATION -- always evaluate on data the model has never seen
def evaluate(model, loader, criterion):
    model.eval()  # Disable dropout -- use all neurons for prediction
    total_loss = 0
    correct = 0
    
    with torch.no_grad():  # No gradient computation needed -- saves memory and speeds up inference
        for data, target in loader:
            output = model(data)
            total_loss += criterion(output, target).item()
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
    
    return total_loss / len(loader), 100. * correct / len(loader.dataset)

# 6. TRAIN!
for epoch in range(1, 11):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
    test_loss, test_acc = evaluate(model, test_loader, criterion)
    print(f"Epoch {epoch}: Train Acc: {train_acc:.2f}%, Test Acc: {test_acc:.2f}%")
Expected Output:
Parameters: 535,818
Epoch 1: Train Acc: 93.82%, Test Acc: 96.51%
Epoch 2: Train Acc: 97.42%, Test Acc: 97.33%
...
Epoch 10: Train Acc: 99.12%, Test Acc: 98.15%
Congratulations! You just trained a neural network that’s 98% accurate at recognizing handwritten digits.

Understanding What Happened

Let’s break down what the network learned:

Visualizing Learned Features

import matplotlib.pyplot as plt

# Get first layer weights
weights = model.fc1.weight.data.cpu().numpy()

# Visualize some learned features
fig, axes = plt.subplots(4, 8, figsize=(16, 8))
for i, ax in enumerate(axes.flat):
    # Reshape weight to 28x28 image
    feature = weights[i].reshape(28, 28)
    ax.imshow(feature, cmap='RdBu', vmin=-0.3, vmax=0.3)
    ax.axis('off')
plt.suptitle("First Layer Learned Features")
plt.show()
You’ll see that the first layer learns patterns like:
  • Edges at different orientations
  • Curve detectors
  • Stroke patterns
This is the network discovering, on its own, that these patterns are useful for digit recognition!

What Each Layer Does

LayerInput ShapeOutput ShapeWhat It Learns
fc1784 (28×28)512Low-level patterns (edges, strokes)
fc2512256Mid-level combinations (curves, corners)
fc325610Digit-specific patterns

The Deep Learning Mindset

It’s All About Representations

The key insight: Deep learning is about learning good representations of your data.
Raw Pixels → [Layer 1: Edges] → [Layer 2: Shapes] → [Layer 3: Parts] → [Layer 4: Digits]
Each layer transforms the representation into something more useful for the final task.

The Three Pillars

PillarWhat It MeansHow to Get It
DataMore data = better modelsWeb scraping, data augmentation, synthetic data
ComputeMore GPUs = larger modelsCloud computing, efficient architectures
AlgorithmsBetter architecturesResearch, this course!

Empirical Science

Deep learning is highly empirical. Unlike traditional algorithms where you can prove properties mathematically, deep learning requires:
  1. Experimentation: Try different architectures
  2. Ablation studies: Remove components to see what matters
  3. Hyperparameter tuning: Search for the best settings
  4. Visualization: Look at what your model learned
This is closer to chemistry than mathematics. You have theories about why things work, but at the end of the day, you run the experiment and see. This is not a weakness — it is the nature of learning systems that are too complex to analyze analytically.
Expect to iterate: Your first model will rarely be your best. Budget time for experimentation. A good rule of thumb: spend 20% of your time on the first working model and 80% on improving it. The first model tells you what is possible; iteration tells you what is achievable.

Common Mistakes for Beginners

MistakeWhy It’s WrongBetter Approach
Jumping to deep learningMay not need itStart with a baseline (logistic regression, random forest)
Not normalizing inputsUnstable trainingNormalize to mean=0, std=1
Wrong loss functionModel won’t learn properlyClassification → Cross-entropy, Regression → MSE
Learning rate too highTraining divergesStart with 0.001, reduce if unstable
Not enough dataModel overfitsData augmentation, transfer learning
Training too longOverfittingUse early stopping based on validation loss

What’s Next

Now that you understand the landscape, we’ll dive into the fundamentals:

Module 2: Perceptrons & Multi-Layer Networks

Build neural networks from scratch. Understand exactly how neurons compute and connect.

Exercises

Modify the MNIST network above:
  1. What happens if you remove the hidden layers (just fc1 → fc3)?
  2. What if you make it deeper (add fc4)?
  3. What if you change the hidden layer sizes?
Track how accuracy changes with each modification.
Create a confusion matrix showing which digits the model confuses:
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Collect all predictions
all_preds = []
all_targets = []
model.eval()
with torch.no_grad():
    for data, target in test_loader:
        pred = model(data).argmax(dim=1)
        all_preds.extend(pred.numpy())
        all_targets.extend(target.numpy())

cm = confusion_matrix(all_targets, all_preds)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Which pairs of digits are most commonly confused? Why might that be?
Train a Random Forest on the same MNIST data and compare:
from sklearn.ensemble import RandomForestClassifier

# Flatten images for sklearn
X_train = train_data.data.numpy().reshape(-1, 784)
y_train = train_data.targets.numpy()
X_test = test_data.data.numpy().reshape(-1, 784)
y_test = test_data.targets.numpy()

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print(f"Random Forest Accuracy: {rf.score(X_test, y_test):.4f}")
How does it compare to the neural network? When might you prefer Random Forest?

Interview Deep-Dive

Strong Answer:
  • In traditional ML, a human expert designs features: computing edge histograms for images, TF-IDF vectors for text, or hand-crafted statistical summaries for time series. The model then learns a mapping from these fixed features to outputs. The quality of the model is fundamentally bottlenecked by the quality of the features — if you miss a critical feature, no amount of training will recover it.
  • In deep learning, the network learns its own features through hierarchical representation learning. Early layers discover low-level patterns (edges, character n-grams), middle layers compose these into higher-level features (shapes, phrases), and late layers form task-specific representations (object categories, sentiment). The features themselves are optimized end-to-end for the task.
  • This distinction matters because representation learning scales to modalities where human feature engineering is impractical. No human can design the right features for recognizing 10,000 object categories or understanding arbitrary natural language. The network discovers features that humans would never think to engineer — and often outperform hand-crafted alternatives by large margins.
  • The trade-off: deep learning’s learned representations require substantially more data and compute. With 500 labeled examples, a carefully engineered feature set plus a linear model will usually beat a neural network that must learn everything from scratch.
Follow-up: Can you think of a scenario where hand-crafted features combined with deep learning outperforms either alone?This is common in practice. In medical imaging, radiologists’ domain knowledge (e.g., tissue density features, geometric ratios) can be concatenated with CNN-learned features before the classification head. The domain features provide a strong inductive bias that helps with small datasets, while the learned features capture patterns the expert missed. Similarly, in NLP, linguistic features (POS tags, dependency parses) combined with transformer embeddings can improve performance on tasks like relation extraction where structural information matters.
Strong Answer:
  • I would present evidence, not opinions. The empirical reality is that gradient-boosted trees (XGBoost, LightGBM, CatBoost) consistently match or outperform deep learning on tabular data, as demonstrated across hundreds of Kaggle competitions and recent benchmark papers (e.g., Grinsztajn et al. 2022, “Why do tree-based models still outperform deep learning on tabular data?”).
  • The reasons are structural. Tabular data typically has heterogeneous features (mix of categorical and continuous), irregular feature interactions, and no spatial or temporal structure. Deep learning’s strengths — hierarchical feature learning, translation invariance, weight sharing — do not apply. Trees naturally handle feature heterogeneity and learn sharp decision boundaries that neural networks approximate poorly.
  • With 10,000 rows and 50 features, a neural network is likely to overfit without aggressive regularization. XGBoost will train in seconds, is trivially interpretable via SHAP values for stakeholder communication, and requires far less hyperparameter tuning.
  • My recommendation: start with XGBoost as a strong baseline, measure its performance carefully, and only explore neural approaches if the baseline is insufficient and there is a clear hypothesis for why depth would help (e.g., complex feature interactions that trees miss).
Follow-up: Are there recent architectures that challenge the “trees beat NNs on tabular” narrative?Yes — TabNet, FT-Transformer, and TabPFN have shown competitive or superior results on certain tabular benchmarks. FT-Transformer applies self-attention over individual features, treating each feature as a token, which captures complex feature interactions. However, the gains are often marginal (1-2% accuracy) while training cost and complexity increase significantly. For a 10,000-row churn problem, the engineering overhead of these approaches is rarely justified. The practical answer remains: start with trees, explore neural approaches only if the problem demands it.
Strong Answer:
  • Normalization (scaling inputs to mean 0, standard deviation 1) ensures that all features contribute roughly equally to the gradient updates. Without normalization, features with large magnitudes dominate the loss landscape, creating elongated elliptical contours that cause gradient descent to oscillate and converge slowly.
  • Geometrically, unnormalized data creates an ill-conditioned optimization problem. If feature A ranges from 0-1000 and feature B ranges from 0-1, the loss landscape is stretched along the A-axis. The optimal learning rate for A is far too small for B and vice versa. Normalization makes the landscape more spherical, allowing a single learning rate to work well for all parameters.
  • Without normalization, activations in early layers can saturate (for sigmoid/tanh) or become very large (for ReLU), which causes vanishing gradients or numerical instability. The MNIST example normalizes with mean 0.1307 and std 0.3081 (precomputed dataset statistics) specifically to center the pixel distributions.
  • In practice, normalization also makes the model less sensitive to the choice of learning rate and initialization, which speeds up the hyperparameter search process.
Follow-up: Batch normalization normalizes intermediate activations, not just inputs. Why is this helpful, and what problems does it introduce?Batch normalization addresses internal covariate shift — the phenomenon where the distribution of each layer’s inputs changes during training as weights in preceding layers update. By normalizing activations within each mini-batch, it stabilizes training and allows higher learning rates. However, it introduces batch-size dependency: with small batches (under 8-16), batch statistics become noisy and unstable, degrading performance. This is why Layer Normalization (batch-independent) replaced BatchNorm in transformers. BatchNorm also behaves differently at train vs. eval time (using running statistics at eval), which is a common source of bugs when deploying models.
Strong Answer:
  • This is textbook overfitting: the model has memorized the training data rather than learning generalizable patterns. The 27-point gap between train and test accuracy is the key diagnostic signal.
  • Systematic approach, in order of impact and ease of implementation:
    • Data augmentation (highest impact, no model changes): for images, add random crops, flips, color jitter, CutMix/MixUp. This effectively multiplies the dataset size and forces the model to learn invariant features rather than memorize specific examples.
    • Regularization: add dropout (0.3-0.5 for dense layers), increase weight decay (try 0.01-0.1 with AdamW), and consider label smoothing (epsilon=0.1).
    • Reduce model capacity: the model may be too large for the dataset. Try fewer layers, fewer neurons per layer, or a simpler architecture. A model that barely fits the training data will generalize better than one that memorizes it effortlessly.
    • Early stopping: monitor validation loss and stop training when it starts increasing. This is cheap to implement and consistently helps.
    • Get more data: if feasible, this is the most reliable long-term solution. More diverse training examples directly address the generalization gap.
  • I would NOT start by changing the optimizer or learning rate — those affect convergence, not generalization. The diagnosis points specifically to a capacity/data mismatch.
Follow-up: How do you distinguish overfitting from a domain shift between your train and test sets?Key diagnostic: if both train AND test accuracy are low, or if the model fails on specific categories rather than uniformly, suspect domain shift rather than overfitting. Check whether train and test data come from the same distribution — plot feature histograms, compare class distributions, and visually inspect misclassified examples. Overfitting produces random-looking errors; domain shift produces systematic errors (e.g., all nighttime images misclassified if training only had daytime images). The fix for domain shift is not more regularization — it is fixing the data pipeline or applying domain adaptation techniques.