Skip to main content
Activation Functions

Activation Functions

Why We Need Non-Linearity

Here’s a fundamental question: Why can’t we just stack linear transformations?
# Two linear layers
def layer1(x):
    return W1 @ x + b1

def layer2(x):
    return W2 @ x + b2

# Composition of two linear functions...
def network(x):
    return layer2(layer1(x))
    # = W2 @ (W1 @ x + b1) + b2
    # = (W2 @ W1) @ x + (W2 @ b1 + b2)
    # = W_combined @ x + b_combined
The composition of linear functions is still linear! No matter how many linear layers you stack, you get a single linear transformation. You can’t learn complex patterns like:
  • Curves in decision boundaries
  • XOR logic
  • Image features
Activation functions add non-linearity, enabling networks to learn arbitrarily complex functions.
Linear vs Non-Linear Decision Boundaries

The Classic Activations

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

z = np.linspace(-6, 6, 100)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, sigmoid(z), 'b-', linewidth=2)
axes[0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('σ(z)')
axes[0].grid(True)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)

axes[1].plot(z, sigmoid_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Sigmoid Derivative: σ\'(z) = σ(z)(1-σ(z))')
axes[1].set_xlabel('z')
axes[1].set_ylabel(\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()
Properties:
PropertyValue
Range(0, 1)
Max gradient0.25 (at z=0)
ProblemVanishing gradients for largez
Use caseOutput layer for binary classification

Tanh

tanh(z)=ezezez+ez=2σ(2z)1\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1
def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, tanh(z), 'b-', linewidth=2)
axes[0].set_title('Tanh: tanh(z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('tanh(z)')
axes[0].grid(True)

axes[1].plot(z, tanh_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Tanh Derivative: 1 - tanh²(z)')
axes[1].set_xlabel('z')
axes[1].set_ylabel('tanh\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()
Properties:
PropertyValue
Range(-1, 1)
Max gradient1 (at z=0)
CenteredYes (zero-centered output)
ProblemStill vanishes for largez
Use caseHidden layers in RNNs, older networks
Why tanh over sigmoid? Zero-centered outputs make optimization easier.

ReLU: The Workhorse of Modern Deep Learning

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, relu(z), 'b-', linewidth=2)
axes[0].set_title('ReLU: max(0, z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('ReLU(z)')
axes[0].grid(True)

axes[1].plot(z, relu_derivative(z), 'r-', linewidth=2)
axes[1].set_title('ReLU Derivative')
axes[1].set_xlabel('z')
axes[1].set_ylabel('ReLU\'(z)')
axes[1].set_ylim(-0.1, 1.5)
axes[1].grid(True)

plt.tight_layout()
plt.show()
Why ReLU Changed Everything:
AdvantageExplanation
No vanishing gradientGradient is 1 for positive inputs
Sparse activationMany neurons output 0 → efficient
Computationally simpleJust a max operation
Faster convergence6x faster than sigmoid (AlexNet paper)
The “Dying ReLU” Problem:
  • If inputs are always negative, gradient is 0
  • Neuron “dies” and never activates
  • Solution: Leaky ReLU, PReLU, ELU
# Demonstration of dying ReLU
dead_neurons = 0
total_neurons = 1000

for _ in range(total_neurons):
    # Random weights
    w = np.random.randn(10) * 0.5
    # If weights are mostly negative, neuron might die
    x = np.random.randn(10)
    if relu(np.dot(w, x)) == 0:
        dead_neurons += 1

print(f"Dead neurons: {dead_neurons}/{total_neurons} ({100*dead_neurons/total_neurons:.1f}%)")

ReLU Variants

Leaky ReLU

LeakyReLU(z)={zif z>0αzif z0\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases} where α\alpha is typically 0.01.
def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

Parametric ReLU (PReLU)

Same as Leaky ReLU, but α\alpha is learned during training.
import torch.nn as nn

# In PyTorch
prelu = nn.PReLU()  # α is learnable

Exponential Linear Unit (ELU)

ELU(z)={zif z>0α(ez1)if z0\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}
def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))
Advantages of ELU:
  • Smooth everywhere (differentiable at z=0)
  • Pushes mean activation closer to 0
  • More robust than ReLU

Comparison Plot

fig, ax = plt.subplots(figsize=(10, 6))

z = np.linspace(-3, 3, 100)

ax.plot(z, relu(z), 'b-', linewidth=2, label='ReLU')
ax.plot(z, leaky_relu(z, 0.1), 'g-', linewidth=2, label='Leaky ReLU (α=0.1)')
ax.plot(z, elu(z), 'r-', linewidth=2, label='ELU')
ax.plot(z, np.tanh(z), 'm--', linewidth=2, label='Tanh')

ax.set_xlabel('z')
ax.set_ylabel('Activation')
ax.set_title('Comparison of Activation Functions')
ax.legend()
ax.grid(True)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

plt.show()
Activation Function Comparison

Modern Activations

GELU (Gaussian Error Linear Unit)

GELU(z)=zΦ(z)=z12[1+erf(z2)]\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right] Approximation: GELU(z)0.5z(1+tanh[2/π(z+0.044715z3)])\text{GELU}(z) \approx 0.5z(1 + \tanh[\sqrt{2/\pi}(z + 0.044715z^3)])
def gelu(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Or using scipy
from scipy.special import erf
def gelu_exact(z):
    return z * 0.5 * (1 + erf(z / np.sqrt(2)))
Why GELU?
  • Used in BERT, GPT-2, GPT-3, and most transformers
  • Smooth approximation of ReLU with probabilistic interpretation
  • Better for NLP tasks

Swish / SiLU

Swish(z)=zσ(z)=z1+ez\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}
def swish(z):
    return z * sigmoid(z)
Properties:
  • Non-monotonic (dips slightly below 0)
  • Self-gated (multiplication by sigmoid)
  • Discovered by neural architecture search (Google Brain)
  • Used in EfficientNet, MobileNet

Mish

Mish(z)=ztanh(softplus(z))=ztanh(ln(1+ez))\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z)) = z \cdot \tanh(\ln(1 + e^z))
def mish(z):
    return z * np.tanh(np.log(1 + np.exp(z)))

Activation Functions for Output Layers

The output activation depends on your task:
TaskOutput ActivationLoss Function
Binary ClassificationSigmoidBinary Cross-Entropy
Multi-Class ClassificationSoftmaxCategorical Cross-Entropy
RegressionNone (Linear)MSE
Multi-Label ClassificationSigmoidBinary CE per label
Bounded RegressionSigmoid/TanhMSE

Softmax

Softmax(zi)=ezij=1Kezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
def softmax(z):
    """Numerically stable softmax."""
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}")  # Should be 1.0
Properties:
  • Outputs sum to 1 (probability distribution)
  • Larger inputs get exponentially more probability
  • Temperature scaling: Softmax(z/T)\text{Softmax}(z/T) for controlling sharpness

Choosing the Right Activation

Decision Flowchart

Is it the OUTPUT layer?
├── Yes
│   ├── Binary classification → Sigmoid
│   ├── Multi-class classification → Softmax
│   ├── Regression → Linear (None)
│   └── Bounded regression → Sigmoid/Tanh

└── No (HIDDEN layers)
    ├── Default choice → ReLU
    ├── Dying ReLU problem → Leaky ReLU / ELU
    ├── Transformer architecture → GELU
    ├── Mobile/efficient networks → Swish
    └── Experimental → Mish

Rules of Thumb

SituationRecommendation
Starting a new projectReLU everywhere
RNNs/LSTMsTanh (traditional)
Transformers/BERT/GPTGELU
EfficientNet/MobileNetSwish
Dying neurons observedLeaky ReLU or ELU
Very deep networksELU or SELU

Implementation in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

# All available as modules
class ActivationDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)
        
        # As modules
        self.relu = nn.ReLU()
        self.leaky_relu = nn.LeakyReLU(0.01)
        self.elu = nn.ELU(alpha=1.0)
        self.gelu = nn.GELU()
        self.silu = nn.SiLU()  # Same as Swish
        self.mish = nn.Mish()
        self.prelu = nn.PReLU()  # Learnable
    
    def forward(self, x):
        # Can also use functional versions
        x = F.relu(self.fc(x))
        x = F.gelu(self.fc(x))
        x = F.silu(self.fc(x))
        return x


# Print available activations
print("PyTorch activation modules:")
for name in dir(nn):
    obj = getattr(nn, name)
    if isinstance(obj, type) and issubclass(obj, nn.Module):
        try:
            if 'activation' in obj.__module__ or name in ['ReLU', 'GELU', 'SiLU', 'Sigmoid', 'Tanh']:
                print(f"  nn.{name}")
        except:
            pass

Experiments: Which Activation Works Best?

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Generate synthetic data
torch.manual_seed(42)
X = torch.randn(1000, 20)
y = (X[:, :5].sum(dim=1) > 0).float().unsqueeze(1)

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

def create_network(activation_class):
    """Create a network with specified activation."""
    return nn.Sequential(
        nn.Linear(20, 64),
        activation_class(),
        nn.Linear(64, 32),
        activation_class(),
        nn.Linear(32, 16),
        activation_class(),
        nn.Linear(16, 1),
        nn.Sigmoid()
    )

def train_and_evaluate(activation_name, activation_class, epochs=50):
    """Train a network and return loss history."""
    model = create_network(activation_class)
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.BCELoss()
    
    losses = []
    for epoch in range(epochs):
        epoch_loss = 0
        for batch_x, batch_y in loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        losses.append(epoch_loss / len(loader))
    
    return losses

# Compare activations
activations = {
    'ReLU': nn.ReLU,
    'LeakyReLU': nn.LeakyReLU,
    'ELU': nn.ELU,
    'GELU': nn.GELU,
    'SiLU': nn.SiLU,
    'Tanh': nn.Tanh,
}

plt.figure(figsize=(10, 6))
for name, act_class in activations.items():
    losses = train_and_evaluate(name, act_class)
    plt.plot(losses, label=name, linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Convergence by Activation Function')
plt.legend()
plt.grid(True)
plt.show()

Exercises

Implement these activation functions and their derivatives from scratch:
  1. SELU (Scaled ELU)
  2. Softplus
  3. Hardswish
Verify your implementations against PyTorch.
For a 10-layer network, compute and plot the gradient magnitude at each layer for:
  1. Sigmoid activation
  2. ReLU activation
  3. GELU activation
Explain the differences you observe.
Create an interactive visualization showing how different activations transform the output space of a 2D network. Use contour plots to show the decision boundary.
Design your own activation function that:
  1. Is non-linear
  2. Is differentiable everywhere
  3. Doesn’t have vanishing gradients
  4. Is bounded below (like ReLU)
Train a network with it and compare to ReLU.

Key Takeaways

ActivationBest ForAvoid When
ReLUDefault choice, hidden layersDying neuron problem
Leaky ReLUWhen neurons die(Generally safe)
GELUTransformers, NLPSimple networks
Swish/SiLUEfficient architectures(Generally safe)
SigmoidBinary outputHidden layers
SoftmaxMulti-class outputHidden layers
TanhRNN gatesDeep networks

What’s Next