Activation Functions
Why We Need Non-Linearity
Here’s a fundamental question: Why can’t we just stack linear transformations?- Curves in decision boundaries
- XOR logic
- Image features
The Classic Activations
Sigmoid
| Property | Value | ||
|---|---|---|---|
| Range | (0, 1) | ||
| Max gradient | 0.25 (at z=0) | ||
| Problem | Vanishing gradients for large | z | |
| Use case | Output layer for binary classification |
Tanh
| Property | Value | ||
|---|---|---|---|
| Range | (-1, 1) | ||
| Max gradient | 1 (at z=0) | ||
| Centered | Yes (zero-centered output) | ||
| Problem | Still vanishes for large | z | |
| Use case | Hidden layers in RNNs, older networks |
ReLU: The Workhorse of Modern Deep Learning
| Advantage | Explanation |
|---|---|
| No vanishing gradient | Gradient is 1 for positive inputs |
| Sparse activation | Many neurons output 0 → efficient |
| Computationally simple | Just a max operation |
| Faster convergence | 6x faster than sigmoid (AlexNet paper) |
- If inputs are always negative, gradient is 0
- Neuron “dies” and never activates
- Solution: Leaky ReLU, PReLU, ELU
ReLU Variants
Leaky ReLU
where is typically 0.01.Parametric ReLU (PReLU)
Same as Leaky ReLU, but is learned during training.Exponential Linear Unit (ELU)
- Smooth everywhere (differentiable at z=0)
- Pushes mean activation closer to 0
- More robust than ReLU
Comparison Plot
Modern Activations
GELU (Gaussian Error Linear Unit)
Approximation:- Used in BERT, GPT-2, GPT-3, and most transformers
- Smooth approximation of ReLU with probabilistic interpretation
- Better for NLP tasks
Swish / SiLU
- Non-monotonic (dips slightly below 0)
- Self-gated (multiplication by sigmoid)
- Discovered by neural architecture search (Google Brain)
- Used in EfficientNet, MobileNet
Mish
Activation Functions for Output Layers
The output activation depends on your task:| Task | Output Activation | Loss Function |
|---|---|---|
| Binary Classification | Sigmoid | Binary Cross-Entropy |
| Multi-Class Classification | Softmax | Categorical Cross-Entropy |
| Regression | None (Linear) | MSE |
| Multi-Label Classification | Sigmoid | Binary CE per label |
| Bounded Regression | Sigmoid/Tanh | MSE |
Softmax
- Outputs sum to 1 (probability distribution)
- Larger inputs get exponentially more probability
- Temperature scaling: for controlling sharpness
Choosing the Right Activation
Decision Flowchart
Rules of Thumb
| Situation | Recommendation |
|---|---|
| Starting a new project | ReLU everywhere |
| RNNs/LSTMs | Tanh (traditional) |
| Transformers/BERT/GPT | GELU |
| EfficientNet/MobileNet | Swish |
| Dying neurons observed | Leaky ReLU or ELU |
| Very deep networks | ELU or SELU |
Implementation in PyTorch
Experiments: Which Activation Works Best?
Exercises
Exercise 1: Implement All Activations
Exercise 1: Implement All Activations
Implement these activation functions and their derivatives from scratch:
- SELU (Scaled ELU)
- Softplus
- Hardswish
Exercise 2: Gradient Flow Analysis
Exercise 2: Gradient Flow Analysis
For a 10-layer network, compute and plot the gradient magnitude at each layer for:
- Sigmoid activation
- ReLU activation
- GELU activation
Exercise 3: Activation Visualization
Exercise 3: Activation Visualization
Create an interactive visualization showing how different activations transform the output space of a 2D network. Use contour plots to show the decision boundary.
Exercise 4: Custom Activation
Exercise 4: Custom Activation
Design your own activation function that:
- Is non-linear
- Is differentiable everywhere
- Doesn’t have vanishing gradients
- Is bounded below (like ReLU)
Key Takeaways
| Activation | Best For | Avoid When |
|---|---|---|
| ReLU | Default choice, hidden layers | Dying neuron problem |
| Leaky ReLU | When neurons die | (Generally safe) |
| GELU | Transformers, NLP | Simple networks |
| Swish/SiLU | Efficient architectures | (Generally safe) |
| Sigmoid | Binary output | Hidden layers |
| Softmax | Multi-class output | Hidden layers |
| Tanh | RNN gates | Deep networks |