Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Activation Functions

Activation Functions

Why We Need Non-Linearity

Here’s a fundamental question: Why can’t we just stack linear transformations?
# Two linear layers
def layer1(x):
    return W1 @ x + b1

def layer2(x):
    return W2 @ x + b2

# Composition of two linear functions...
def network(x):
    return layer2(layer1(x))
    # = W2 @ (W1 @ x + b1) + b2
    # = (W2 @ W1) @ x + (W2 @ b1 + b2)
    # = W_combined @ x + b_combined
The composition of linear functions is still linear! No matter how many linear layers you stack, you get a single linear transformation. You can’t learn complex patterns like:
  • Curves in decision boundaries
  • XOR logic
  • Image features
Activation functions add non-linearity, enabling networks to learn arbitrarily complex functions. Here is the analogy: a linear network is like a chef who can only blend ingredients (linear combinations). No matter how many times you blend, you still have a smoothie. Activation functions are the cooking techniques — roasting, searing, fermenting — that transform ingredients into something qualitatively different. Without them, a 100-layer network is no more expressive than a single layer. With them, each layer creates new “flavors” that the next layer can build on.
Linear vs Non-Linear Decision Boundaries

The Classic Activations

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

z = np.linspace(-6, 6, 100)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, sigmoid(z), 'b-', linewidth=2)
axes[0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('σ(z)')
axes[0].grid(True)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)

axes[1].plot(z, sigmoid_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Sigmoid Derivative: σ\'(z) = σ(z)(1-σ(z))')
axes[1].set_xlabel('z')
axes[1].set_ylabel(\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()
Properties:
PropertyValue
Range(0, 1)
Max gradient0.25 (at z=0)
ProblemVanishing gradients for largez
Use caseOutput layer for binary classification

Tanh

tanh(z)=ezezez+ez=2σ(2z)1\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1
def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, tanh(z), 'b-', linewidth=2)
axes[0].set_title('Tanh: tanh(z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('tanh(z)')
axes[0].grid(True)

axes[1].plot(z, tanh_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Tanh Derivative: 1 - tanh²(z)')
axes[1].set_xlabel('z')
axes[1].set_ylabel('tanh\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()
Properties:
PropertyValue
Range(-1, 1)
Max gradient1 (at z=0)
CenteredYes (zero-centered output)
ProblemStill vanishes for largez
Use caseHidden layers in RNNs, older networks
Why tanh over sigmoid? Zero-centered outputs make optimization easier. When all outputs are positive (as with sigmoid), weight updates are either all positive or all negative — the gradients can only move in diagonal directions through parameter space, creating a zigzag path toward the optimum. Zero-centered outputs (tanh) allow gradients to point in any direction, giving the optimizer a more direct route. In practice, this translates to faster convergence.

ReLU: The Workhorse of Modern Deep Learning

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, relu(z), 'b-', linewidth=2)
axes[0].set_title('ReLU: max(0, z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('ReLU(z)')
axes[0].grid(True)

axes[1].plot(z, relu_derivative(z), 'r-', linewidth=2)
axes[1].set_title('ReLU Derivative')
axes[1].set_xlabel('z')
axes[1].set_ylabel('ReLU\'(z)')
axes[1].set_ylim(-0.1, 1.5)
axes[1].grid(True)

plt.tight_layout()
plt.show()
Why ReLU Changed Everything:
AdvantageExplanation
No vanishing gradientGradient is 1 for positive inputs
Sparse activationMany neurons output 0 → efficient
Computationally simpleJust a max operation
Faster convergence6x faster than sigmoid (AlexNet paper)
The “Dying ReLU” Problem:
  • If inputs are always negative, gradient is 0
  • Neuron “dies” and never activates — it becomes a permanent zero, wasting capacity
  • This typically happens when the learning rate is too high, causing weights to overshoot into a region where all inputs produce negative pre-activations
  • Solution: Leaky ReLU, PReLU, or ELU (all allow small gradients for negative inputs)
Training pitfall: Dying ReLU is insidious because your loss might still decrease — the surviving neurons compensate. You only notice the problem when you realize your 512-neuron layer is effectively a 200-neuron layer. Monitor activation statistics during training: (activations > 0).float().mean() should stay above 0.5 for healthy layers.
# Demonstration of dying ReLU
dead_neurons = 0
total_neurons = 1000

for _ in range(total_neurons):
    # Random weights
    w = np.random.randn(10) * 0.5
    # If weights are mostly negative, neuron might die
    x = np.random.randn(10)
    if relu(np.dot(w, x)) == 0:
        dead_neurons += 1

print(f"Dead neurons: {dead_neurons}/{total_neurons} ({100*dead_neurons/total_neurons:.1f}%)")

ReLU Variants

Leaky ReLU

LeakyReLU(z)={zif z>0αzif z0\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases} where α\alpha is typically 0.01.
def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

Parametric ReLU (PReLU)

Same as Leaky ReLU, but α\alpha is learned during training.
import torch.nn as nn

# In PyTorch
prelu = nn.PReLU()  # α is learnable

Exponential Linear Unit (ELU)

ELU(z)={zif z>0α(ez1)if z0\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}
def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))
Advantages of ELU:
  • Smooth everywhere (differentiable at z=0)
  • Pushes mean activation closer to 0
  • More robust than ReLU

Comparison Plot

fig, ax = plt.subplots(figsize=(10, 6))

z = np.linspace(-3, 3, 100)

ax.plot(z, relu(z), 'b-', linewidth=2, label='ReLU')
ax.plot(z, leaky_relu(z, 0.1), 'g-', linewidth=2, label='Leaky ReLU (α=0.1)')
ax.plot(z, elu(z), 'r-', linewidth=2, label='ELU')
ax.plot(z, np.tanh(z), 'm--', linewidth=2, label='Tanh')

ax.set_xlabel('z')
ax.set_ylabel('Activation')
ax.set_title('Comparison of Activation Functions')
ax.legend()
ax.grid(True)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

plt.show()
Activation Function Comparison

Modern Activations

GELU (Gaussian Error Linear Unit)

GELU(z)=zΦ(z)=z12[1+erf(z2)]\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right] Approximation: GELU(z)0.5z(1+tanh[2/π(z+0.044715z3)])\text{GELU}(z) \approx 0.5z(1 + \tanh[\sqrt{2/\pi}(z + 0.044715z^3)])
def gelu(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Or using scipy
from scipy.special import erf
def gelu_exact(z):
    return z * 0.5 * (1 + erf(z / np.sqrt(2)))
Why GELU?
  • Used in BERT, GPT-2, GPT-3, and most transformers
  • Smooth approximation of ReLU with a probabilistic interpretation: it multiplies the input by the probability that the input is greater than other inputs from a standard normal distribution
  • The key difference from ReLU: GELU is smooth everywhere (differentiable at zero) and has a small negative region, which acts as a soft form of dropout — small negative inputs are slightly suppressed rather than hard-zeroed
  • Empirically outperforms ReLU in NLP tasks, likely because the smooth gating better suits the continuous, high-dimensional representations in language models

Swish / SiLU

Swish(z)=zσ(z)=z1+ez\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}
def swish(z):
    return z * sigmoid(z)
Properties:
  • Non-monotonic (dips slightly below 0)
  • Self-gated (multiplication by sigmoid)
  • Discovered by neural architecture search (Google Brain)
  • Used in EfficientNet, MobileNet

Mish

Mish(z)=ztanh(softplus(z))=ztanh(ln(1+ez))\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z)) = z \cdot \tanh(\ln(1 + e^z))
def mish(z):
    return z * np.tanh(np.log(1 + np.exp(z)))

Activation Functions for Output Layers

The output activation depends on your task:
TaskOutput ActivationLoss Function
Binary ClassificationSigmoidBinary Cross-Entropy
Multi-Class ClassificationSoftmaxCategorical Cross-Entropy
RegressionNone (Linear)MSE
Multi-Label ClassificationSigmoidBinary CE per label
Bounded RegressionSigmoid/TanhMSE

Softmax

Softmax(zi)=ezij=1Kezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
def softmax(z):
    """Numerically stable softmax."""
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}")  # Should be 1.0
Properties:
  • Outputs sum to 1 (probability distribution) — this is what makes it valid as a set of class probabilities
  • Larger inputs get exponentially more probability — softmax amplifies differences between logits
  • Temperature scaling: Softmax(z/T)\text{Softmax}(z/T) for controlling sharpness. T0T \to 0 makes it approach argmax (hard selection), TT \to \infty makes it uniform (complete uncertainty). This is why language model “temperature” controls creativity: lower temperature makes the model more deterministic, higher temperature makes it more exploratory
  • Numerical stability: Always subtract max(z) before computing exp(z). Without this, exp(1000) overflows to infinity. The math is identical — softmax(z)=softmax(zc)\text{softmax}(z) = \text{softmax}(z - c) for any constant cc — but the numerics are night and day

Choosing the Right Activation

Decision Flowchart

Is it the OUTPUT layer?
├── Yes
│   ├── Binary classification → Sigmoid
│   ├── Multi-class classification → Softmax
│   ├── Regression → Linear (None)
│   └── Bounded regression → Sigmoid/Tanh

└── No (HIDDEN layers)
    ├── Default choice → ReLU
    ├── Dying ReLU problem → Leaky ReLU / ELU
    ├── Transformer architecture → GELU
    ├── Mobile/efficient networks → Swish
    └── Experimental → Mish

Rules of Thumb

SituationRecommendation
Starting a new projectReLU everywhere
RNNs/LSTMsTanh (traditional)
Transformers/BERT/GPTGELU
EfficientNet/MobileNetSwish
Dying neurons observedLeaky ReLU or ELU
Very deep networksELU or SELU

Implementation in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

# All available as modules
class ActivationDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)
        
        # As modules
        self.relu = nn.ReLU()
        self.leaky_relu = nn.LeakyReLU(0.01)
        self.elu = nn.ELU(alpha=1.0)
        self.gelu = nn.GELU()
        self.silu = nn.SiLU()  # Same as Swish
        self.mish = nn.Mish()
        self.prelu = nn.PReLU()  # Learnable
    
    def forward(self, x):
        # Can also use functional versions
        x = F.relu(self.fc(x))
        x = F.gelu(self.fc(x))
        x = F.silu(self.fc(x))
        return x


# Print available activations
print("PyTorch activation modules:")
for name in dir(nn):
    obj = getattr(nn, name)
    if isinstance(obj, type) and issubclass(obj, nn.Module):
        try:
            if 'activation' in obj.__module__ or name in ['ReLU', 'GELU', 'SiLU', 'Sigmoid', 'Tanh']:
                print(f"  nn.{name}")
        except:
            pass

Experiments: Which Activation Works Best?

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Generate synthetic data
torch.manual_seed(42)
X = torch.randn(1000, 20)
y = (X[:, :5].sum(dim=1) > 0).float().unsqueeze(1)

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

def create_network(activation_class):
    """Create a network with specified activation."""
    return nn.Sequential(
        nn.Linear(20, 64),
        activation_class(),
        nn.Linear(64, 32),
        activation_class(),
        nn.Linear(32, 16),
        activation_class(),
        nn.Linear(16, 1),
        nn.Sigmoid()
    )

def train_and_evaluate(activation_name, activation_class, epochs=50):
    """Train a network and return loss history."""
    model = create_network(activation_class)
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.BCELoss()
    
    losses = []
    for epoch in range(epochs):
        epoch_loss = 0
        for batch_x, batch_y in loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        losses.append(epoch_loss / len(loader))
    
    return losses

# Compare activations
activations = {
    'ReLU': nn.ReLU,
    'LeakyReLU': nn.LeakyReLU,
    'ELU': nn.ELU,
    'GELU': nn.GELU,
    'SiLU': nn.SiLU,
    'Tanh': nn.Tanh,
}

plt.figure(figsize=(10, 6))
for name, act_class in activations.items():
    losses = train_and_evaluate(name, act_class)
    plt.plot(losses, label=name, linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Convergence by Activation Function')
plt.legend()
plt.grid(True)
plt.show()

Exercises

Implement these activation functions and their derivatives from scratch:
  1. SELU (Scaled ELU)
  2. Softplus
  3. Hardswish
Verify your implementations against PyTorch.
For a 10-layer network, compute and plot the gradient magnitude at each layer for:
  1. Sigmoid activation
  2. ReLU activation
  3. GELU activation
Explain the differences you observe.
Create an interactive visualization showing how different activations transform the output space of a 2D network. Use contour plots to show the decision boundary.
Design your own activation function that:
  1. Is non-linear
  2. Is differentiable everywhere
  3. Doesn’t have vanishing gradients
  4. Is bounded below (like ReLU)
Train a network with it and compare to ReLU.

Key Takeaways

ActivationBest ForAvoid When
ReLUDefault choice, hidden layersDying neuron problem
Leaky ReLUWhen neurons die(Generally safe)
GELUTransformers, NLPSimple networks
Swish/SiLUEfficient architectures(Generally safe)
SigmoidBinary outputHidden layers
SoftmaxMulti-class outputHidden layers
TanhRNN gatesDeep networks

What’s Next

Module 5: Loss Functions & Objectives

Define what “learning” means mathematically — MSE, cross-entropy, contrastive loss, and more.

Interview Deep-Dive

Strong Answer:
  • GELU (Gaussian Error Linear Unit) is defined as GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x), where Φ(x)\Phi(x) is the standard normal CDF. Unlike ReLU, which makes a hard binary decision (pass or block), GELU makes a soft probabilistic decision: it multiplies the input by the probability that the input exceeds other inputs drawn from a standard normal distribution.
  • The mathematical intuition: GELU smoothly interpolates between identity (for large positive inputs) and zero (for large negative inputs), with a smooth transition region near zero. Small negative inputs are slightly suppressed rather than hard-zeroed. This smooth gating acts as a form of stochastic regularization — it is effectively a deterministic approximation of randomly zeroing activations weighted by their magnitude.
  • Why it works better in transformers: transformers process high-dimensional continuous representations where the hard discontinuity of ReLU at zero can create problems. The smooth gradient of GELU means that small perturbations to inputs near zero produce small perturbations to outputs, which improves optimization stability. In the attention mechanism, where values flow through many sequential operations, this smoothness compounds.
  • Empirically, GELU outperforms ReLU on NLP benchmarks by 0.5-1%, which is significant at the scale of BERT and GPT. The approximation GELU(x)0.5x(1+tanh[2/π(x+0.044715x3)])\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)]) is used in practice for computational efficiency.
Follow-up: Swish (xσ(x)x \cdot \sigma(x)) looks similar to GELU. When would you prefer one over the other?Swish and GELU are nearly identical in shape and empirical performance. The key difference is provenance and convention: GELU was motivated by a probabilistic argument and became the standard in NLP transformers (BERT, GPT-2+), while Swish was discovered through automated architecture search and became the standard in efficient vision networks (EfficientNet, MobileNetV3). In practice, I would use GELU for transformer-based architectures and Swish/SiLU for vision models, following the conventions of the respective communities. The performance difference between them is typically within noise.
Strong Answer:
  • What it is: A ReLU neuron “dies” when its input is permanently negative for all training examples. Since ReLU’s gradient is zero for negative inputs, the neuron receives zero gradient and can never update its weights to recover. It becomes a constant-zero output, permanently wasting model capacity.
  • How it happens: typically caused by a learning rate that is too high early in training. A large gradient update pushes a neuron’s weights into a region where the pre-activation Wx+bWx + b is negative for all inputs in the training set. Once in this state, zero gradient means zero updates, creating an irreversible death.
  • How to detect it: Monitor the fraction of neurons with zero activations across a batch: (activations > 0).float().mean(). Healthy layers should have 40-60% active neurons. If a layer drops below 20%, you have significant dying ReLU. You can also check for parameters with zero gradient norm.
  • Best fixes:
    • Leaky ReLU (max(0.01x,x)\max(0.01x, x)): guarantees a small non-zero gradient for negative inputs, allowing dead neurons to gradually recover. Minimal computational overhead.
    • He initialization: sets weight variance to 2/nin2/n_{in}, specifically calibrated for ReLU to prevent activations from collapsing to zero from the start.
    • Lower learning rate or warmup: prevents the large early updates that push neurons into the dead zone.
    • Batch normalization before ReLU: keeps pre-activations centered near zero, ensuring roughly half are positive.
Follow-up: If dying ReLU is such a problem, why is vanilla ReLU still the default recommendation for new projects?Because the dying ReLU problem is easily preventable with proper initialization and learning rate scheduling, and vanilla ReLU has the lowest computational cost (a single comparison operation). In most practical scenarios with He initialization and a reasonable learning rate, fewer than 10% of neurons die, which has negligible impact on performance. Leaky ReLU and ELU add complexity (an extra multiply) and hyperparameters (the leak coefficient) for marginal benefit. The engineering principle is: start with the simplest thing that works, and only add complexity when you have evidence of a problem.
Strong Answer:
  • Hidden layer activations serve a different purpose than output layer activations, so they have different design requirements.
  • Hidden layers need: (1) non-linearity to enable complex function approximation, (2) well-behaved gradients for backpropagation through many layers, and (3) computational efficiency since they are applied billions of times. ReLU and its variants satisfy all three: they are non-linear, have gradients of 0 or 1 (no vanishing), and are trivially cheap to compute.
  • Output layers need to match the probability structure of the task:
    • Binary classification: sigmoid squashes output to (0, 1), interpretable as P(y=1x)P(y=1|x). Combined with binary cross-entropy, this is the maximum likelihood estimator for Bernoulli outcomes.
    • Multi-class classification: softmax produces a valid probability distribution over KK classes (non-negative, sums to 1). Combined with cross-entropy, this is MLE for categorical outcomes.
    • Regression: no activation (linear output) because we want unbounded real-valued predictions. MSE loss assumes Gaussian noise.
    • Bounded regression (e.g., predicting a percentage): sigmoid or tanh to constrain the output range.
  • Using the wrong combination causes subtle failures. For example, using sigmoid in hidden layers causes vanishing gradients in deep networks. Using ReLU as the output activation for regression clips all negative predictions to zero. Using softmax in hidden layers wastes capacity by imposing a competition between neurons.
Follow-up: Why does PyTorch’s CrossEntropyLoss expect raw logits instead of softmax probabilities?Numerical stability. Computing softmax first and then log separately can cause catastrophic precision loss: if softmax produces a very small probability (say 104510^{-45}), taking log(1045)=103.5\log(10^{-45}) = -103.5 requires representing extremely small intermediate values. PyTorch’s CrossEntropyLoss internally uses the LogSumExp trick, which computes log(softmax(x))\log(\text{softmax}(x)) in a numerically stable way by factoring out the maximum logit before exponentiation. This avoids both overflow (from e1000e^{1000}) and underflow (from log(1045)\log(10^{-45})). The rule: always feed raw logits to CrossEntropyLoss and BCEWithLogitsLoss. Never apply softmax or sigmoid manually before these loss functions.