Activation Functions
Why We Need Non-Linearity
The Classic Activations
Sigmoid
Tanh
ReLU: The Workhorse of Modern Deep Learning
ReLU Variants
Leaky ReLU
Parametric ReLU (PReLU)
Exponential Linear Unit (ELU)
Comparison Plot
Modern Activations
GELU (Gaussian Error Linear Unit)
Swish / SiLU
Mish
Activation Functions for Output Layers
Softmax
Choosing the Right Activation
Decision Flowchart
Rules of Thumb
Implementation in PyTorch
Experiments: Which Activation Works Best?
Exercises
Key Takeaways
What’s Next

Activation Functions

Why We Need Non-Linearity

Here’s a fundamental question: Why can’t we just stack linear transformations?

# Two linear layers
def layer1(x):
    return W1 @ x + b1

def layer2(x):
    return W2 @ x + b2

# Composition of two linear functions...
def network(x):
    return layer2(layer1(x))
    # = W2 @ (W1 @ x + b1) + b2
    # = (W2 @ W1) @ x + (W2 @ b1 + b2)
    # = W_combined @ x + b_combined

The composition of linear functions is still linear! No matter how many linear layers you stack, you get a single linear transformation. You can’t learn complex patterns like:

Curves in decision boundaries
XOR logic
Image features

Activation functions add non-linearity, enabling networks to learn arbitrarily complex functions.

Linear vs Non-Linear Decision Boundaries

The Classic Activations

Sigmoid

\sigma(z) = \frac{1}{1 + e^{-z}}

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

z = np.linspace(-6, 6, 100)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, sigmoid(z), 'b-', linewidth=2)
axes[0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('σ(z)')
axes[0].grid(True)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)

axes[1].plot(z, sigmoid_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Sigmoid Derivative: σ\'(z) = σ(z)(1-σ(z))')
axes[1].set_xlabel('z')
axes[1].set_ylabel('σ\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()

Properties:

Property	Value
Range	(0, 1)
Max gradient	0.25 (at z=0)
Problem	Vanishing gradients for large	z
Use case	Output layer for binary classification

Tanh

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1

def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, tanh(z), 'b-', linewidth=2)
axes[0].set_title('Tanh: tanh(z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('tanh(z)')
axes[0].grid(True)

axes[1].plot(z, tanh_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Tanh Derivative: 1 - tanh²(z)')
axes[1].set_xlabel('z')
axes[1].set_ylabel('tanh\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()

Properties:

Property	Value
Range	(-1, 1)
Max gradient	1 (at z=0)
Centered	Yes (zero-centered output)
Problem	Still vanishes for large	z
Use case	Hidden layers in RNNs, older networks

Why tanh over sigmoid? Zero-centered outputs make optimization easier.

ReLU: The Workhorse of Modern Deep Learning

\text{ReLU}(z) = \max(0, z)

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, relu(z), 'b-', linewidth=2)
axes[0].set_title('ReLU: max(0, z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('ReLU(z)')
axes[0].grid(True)

axes[1].plot(z, relu_derivative(z), 'r-', linewidth=2)
axes[1].set_title('ReLU Derivative')
axes[1].set_xlabel('z')
axes[1].set_ylabel('ReLU\'(z)')
axes[1].set_ylim(-0.1, 1.5)
axes[1].grid(True)

plt.tight_layout()
plt.show()

Why ReLU Changed Everything:

Advantage	Explanation
No vanishing gradient	Gradient is 1 for positive inputs
Sparse activation	Many neurons output 0 → efficient
Computationally simple	Just a max operation
Faster convergence	6x faster than sigmoid (AlexNet paper)

The “Dying ReLU” Problem:

If inputs are always negative, gradient is 0
Neuron “dies” and never activates
Solution: Leaky ReLU, PReLU, ELU

# Demonstration of dying ReLU
dead_neurons = 0
total_neurons = 1000

for _ in range(total_neurons):
    # Random weights
    w = np.random.randn(10) * 0.5
    # If weights are mostly negative, neuron might die
    x = np.random.randn(10)
    if relu(np.dot(w, x)) == 0:
        dead_neurons += 1

print(f"Dead neurons: {dead_neurons}/{total_neurons} ({100*dead_neurons/total_neurons:.1f}%)")

ReLU Variants

Leaky ReLU

\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

where

\alpha

is typically 0.01.

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

Parametric ReLU (PReLU)

Same as Leaky ReLU, but

\alpha

is learned during training.

import torch.nn as nn

# In PyTorch
prelu = nn.PReLU()  # α is learnable

Exponential Linear Unit (ELU)

\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}

def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

Advantages of ELU:

Smooth everywhere (differentiable at z=0)
Pushes mean activation closer to 0
More robust than ReLU

Comparison Plot

fig, ax = plt.subplots(figsize=(10, 6))

z = np.linspace(-3, 3, 100)

ax.plot(z, relu(z), 'b-', linewidth=2, label='ReLU')
ax.plot(z, leaky_relu(z, 0.1), 'g-', linewidth=2, label='Leaky ReLU (α=0.1)')
ax.plot(z, elu(z), 'r-', linewidth=2, label='ELU')
ax.plot(z, np.tanh(z), 'm--', linewidth=2, label='Tanh')

ax.set_xlabel('z')
ax.set_ylabel('Activation')
ax.set_title('Comparison of Activation Functions')
ax.legend()
ax.grid(True)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

plt.show()

Modern Activations

GELU (Gaussian Error Linear Unit)

\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]

Approximation:

\text{GELU}(z) \approx 0.5z(1 + \tanh[\sqrt{2/\pi}(z + 0.044715z^3)])

def gelu(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Or using scipy
from scipy.special import erf
def gelu_exact(z):
    return z * 0.5 * (1 + erf(z / np.sqrt(2)))

Why GELU?

Used in BERT, GPT-2, GPT-3, and most transformers
Smooth approximation of ReLU with probabilistic interpretation
Better for NLP tasks

Swish / SiLU

\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}

def swish(z):
    return z * sigmoid(z)

Properties:

Non-monotonic (dips slightly below 0)
Self-gated (multiplication by sigmoid)
Discovered by neural architecture search (Google Brain)
Used in EfficientNet, MobileNet

Mish

\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z)) = z \cdot \tanh(\ln(1 + e^z))

def mish(z):
    return z * np.tanh(np.log(1 + np.exp(z)))