Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Activation Functions
Why We Need Non-Linearity
Here’s a fundamental question: Why can’t we just stack linear transformations?- Curves in decision boundaries
- XOR logic
- Image features
The Classic Activations
Sigmoid
| Property | Value | ||
|---|---|---|---|
| Range | (0, 1) | ||
| Max gradient | 0.25 (at z=0) | ||
| Problem | Vanishing gradients for large | z | |
| Use case | Output layer for binary classification |
Tanh
| Property | Value | ||
|---|---|---|---|
| Range | (-1, 1) | ||
| Max gradient | 1 (at z=0) | ||
| Centered | Yes (zero-centered output) | ||
| Problem | Still vanishes for large | z | |
| Use case | Hidden layers in RNNs, older networks |
ReLU: The Workhorse of Modern Deep Learning
| Advantage | Explanation |
|---|---|
| No vanishing gradient | Gradient is 1 for positive inputs |
| Sparse activation | Many neurons output 0 → efficient |
| Computationally simple | Just a max operation |
| Faster convergence | 6x faster than sigmoid (AlexNet paper) |
- If inputs are always negative, gradient is 0
- Neuron “dies” and never activates — it becomes a permanent zero, wasting capacity
- This typically happens when the learning rate is too high, causing weights to overshoot into a region where all inputs produce negative pre-activations
- Solution: Leaky ReLU, PReLU, or ELU (all allow small gradients for negative inputs)
ReLU Variants
Leaky ReLU
where is typically 0.01.Parametric ReLU (PReLU)
Same as Leaky ReLU, but is learned during training.Exponential Linear Unit (ELU)
- Smooth everywhere (differentiable at z=0)
- Pushes mean activation closer to 0
- More robust than ReLU
Comparison Plot
Modern Activations
GELU (Gaussian Error Linear Unit)
Approximation:- Used in BERT, GPT-2, GPT-3, and most transformers
- Smooth approximation of ReLU with a probabilistic interpretation: it multiplies the input by the probability that the input is greater than other inputs from a standard normal distribution
- The key difference from ReLU: GELU is smooth everywhere (differentiable at zero) and has a small negative region, which acts as a soft form of dropout — small negative inputs are slightly suppressed rather than hard-zeroed
- Empirically outperforms ReLU in NLP tasks, likely because the smooth gating better suits the continuous, high-dimensional representations in language models
Swish / SiLU
- Non-monotonic (dips slightly below 0)
- Self-gated (multiplication by sigmoid)
- Discovered by neural architecture search (Google Brain)
- Used in EfficientNet, MobileNet
Mish
Activation Functions for Output Layers
The output activation depends on your task:| Task | Output Activation | Loss Function |
|---|---|---|
| Binary Classification | Sigmoid | Binary Cross-Entropy |
| Multi-Class Classification | Softmax | Categorical Cross-Entropy |
| Regression | None (Linear) | MSE |
| Multi-Label Classification | Sigmoid | Binary CE per label |
| Bounded Regression | Sigmoid/Tanh | MSE |
Softmax
- Outputs sum to 1 (probability distribution) — this is what makes it valid as a set of class probabilities
- Larger inputs get exponentially more probability — softmax amplifies differences between logits
- Temperature scaling: for controlling sharpness. makes it approach argmax (hard selection), makes it uniform (complete uncertainty). This is why language model “temperature” controls creativity: lower temperature makes the model more deterministic, higher temperature makes it more exploratory
- Numerical stability: Always subtract
max(z)before computingexp(z). Without this,exp(1000)overflows to infinity. The math is identical — for any constant — but the numerics are night and day
Choosing the Right Activation
Decision Flowchart
Rules of Thumb
| Situation | Recommendation |
|---|---|
| Starting a new project | ReLU everywhere |
| RNNs/LSTMs | Tanh (traditional) |
| Transformers/BERT/GPT | GELU |
| EfficientNet/MobileNet | Swish |
| Dying neurons observed | Leaky ReLU or ELU |
| Very deep networks | ELU or SELU |
Implementation in PyTorch
Experiments: Which Activation Works Best?
Exercises
Exercise 1: Implement All Activations
Exercise 1: Implement All Activations
Implement these activation functions and their derivatives from scratch:
- SELU (Scaled ELU)
- Softplus
- Hardswish
Exercise 2: Gradient Flow Analysis
Exercise 2: Gradient Flow Analysis
For a 10-layer network, compute and plot the gradient magnitude at each layer for:
- Sigmoid activation
- ReLU activation
- GELU activation
Exercise 3: Activation Visualization
Exercise 3: Activation Visualization
Create an interactive visualization showing how different activations transform the output space of a 2D network. Use contour plots to show the decision boundary.
Exercise 4: Custom Activation
Exercise 4: Custom Activation
Design your own activation function that:
- Is non-linear
- Is differentiable everywhere
- Doesn’t have vanishing gradients
- Is bounded below (like ReLU)
Key Takeaways
| Activation | Best For | Avoid When |
|---|---|---|
| ReLU | Default choice, hidden layers | Dying neuron problem |
| Leaky ReLU | When neurons die | (Generally safe) |
| GELU | Transformers, NLP | Simple networks |
| Swish/SiLU | Efficient architectures | (Generally safe) |
| Sigmoid | Binary output | Hidden layers |
| Softmax | Multi-class output | Hidden layers |
| Tanh | RNN gates | Deep networks |
What’s Next
Module 5: Loss Functions & Objectives
Define what “learning” means mathematically — MSE, cross-entropy, contrastive loss, and more.
Interview Deep-Dive
Why did GELU replace ReLU as the default activation in transformers? What is the mathematical intuition behind it?
Why did GELU replace ReLU as the default activation in transformers? What is the mathematical intuition behind it?
Strong Answer:
- GELU (Gaussian Error Linear Unit) is defined as , where is the standard normal CDF. Unlike ReLU, which makes a hard binary decision (pass or block), GELU makes a soft probabilistic decision: it multiplies the input by the probability that the input exceeds other inputs drawn from a standard normal distribution.
- The mathematical intuition: GELU smoothly interpolates between identity (for large positive inputs) and zero (for large negative inputs), with a smooth transition region near zero. Small negative inputs are slightly suppressed rather than hard-zeroed. This smooth gating acts as a form of stochastic regularization — it is effectively a deterministic approximation of randomly zeroing activations weighted by their magnitude.
- Why it works better in transformers: transformers process high-dimensional continuous representations where the hard discontinuity of ReLU at zero can create problems. The smooth gradient of GELU means that small perturbations to inputs near zero produce small perturbations to outputs, which improves optimization stability. In the attention mechanism, where values flow through many sequential operations, this smoothness compounds.
- Empirically, GELU outperforms ReLU on NLP benchmarks by 0.5-1%, which is significant at the scale of BERT and GPT. The approximation is used in practice for computational efficiency.
The 'dying ReLU' problem: what is it, how do you detect it in practice, and what are the best fixes?
The 'dying ReLU' problem: what is it, how do you detect it in practice, and what are the best fixes?
Strong Answer:
- What it is: A ReLU neuron “dies” when its input is permanently negative for all training examples. Since ReLU’s gradient is zero for negative inputs, the neuron receives zero gradient and can never update its weights to recover. It becomes a constant-zero output, permanently wasting model capacity.
- How it happens: typically caused by a learning rate that is too high early in training. A large gradient update pushes a neuron’s weights into a region where the pre-activation is negative for all inputs in the training set. Once in this state, zero gradient means zero updates, creating an irreversible death.
- How to detect it: Monitor the fraction of neurons with zero activations across a batch:
(activations > 0).float().mean(). Healthy layers should have 40-60% active neurons. If a layer drops below 20%, you have significant dying ReLU. You can also check for parameters with zero gradient norm. - Best fixes:
- Leaky ReLU (): guarantees a small non-zero gradient for negative inputs, allowing dead neurons to gradually recover. Minimal computational overhead.
- He initialization: sets weight variance to , specifically calibrated for ReLU to prevent activations from collapsing to zero from the start.
- Lower learning rate or warmup: prevents the large early updates that push neurons into the dead zone.
- Batch normalization before ReLU: keeps pre-activations centered near zero, ensuring roughly half are positive.
Why do we use different activation functions for hidden layers versus output layers? Walk through the design reasoning.
Why do we use different activation functions for hidden layers versus output layers? Walk through the design reasoning.