> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Activation Functions

> ReLU, sigmoid, tanh, GELU, Swish - when to use which and why they matter

<Frame>
  <img src="https://mintcdn.com/devweeekends/0kwJwOL2KCwg2YYu/images/courses/deep-learning-mastery/activation-functions-concept.svg?fit=max&auto=format&n=0kwJwOL2KCwg2YYu&q=85&s=decdeb8753f4120c140050d8dc60cf3c" alt="Activation Functions" width="1080" height="1080" data-path="images/courses/deep-learning-mastery/activation-functions-concept.svg" />
</Frame>

# Activation Functions

## Why We Need Non-Linearity

Here's a fundamental question: Why can't we just stack linear transformations?

```python theme={null}
# Two linear layers
def layer1(x):
    return W1 @ x + b1

def layer2(x):
    return W2 @ x + b2

# Composition of two linear functions...
def network(x):
    return layer2(layer1(x))
    # = W2 @ (W1 @ x + b1) + b2
    # = (W2 @ W1) @ x + (W2 @ b1 + b2)
    # = W_combined @ x + b_combined
```

**The composition of linear functions is still linear!**

No matter how many linear layers you stack, you get a single linear transformation. You can't learn complex patterns like:

* Curves in decision boundaries
* XOR logic
* Image features

**Activation functions add non-linearity**, enabling networks to learn arbitrarily complex functions.

Here is the analogy: a linear network is like a chef who can only blend ingredients (linear combinations). No matter how many times you blend, you still have a smoothie. Activation functions are the cooking techniques -- roasting, searing, fermenting -- that transform ingredients into something qualitatively different. Without them, a 100-layer network is no more expressive than a single layer. With them, each layer creates new "flavors" that the next layer can build on.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/deep-learning-mastery/linear-vs-nonlinear.svg" alt="Linear vs Non-Linear Decision Boundaries" />
</Frame>

***

## The Classic Activations

### Sigmoid

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

```python theme={null}
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

z = np.linspace(-6, 6, 100)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, sigmoid(z), 'b-', linewidth=2)
axes[0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('σ(z)')
axes[0].grid(True)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)

axes[1].plot(z, sigmoid_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Sigmoid Derivative: σ\'(z) = σ(z)(1-σ(z))')
axes[1].set_xlabel('z')
axes[1].set_ylabel('σ\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()
```

**Properties**:

| Property     | Value                                  |   |   |
| ------------ | -------------------------------------- | - | - |
| Range        | (0, 1)                                 |   |   |
| Max gradient | 0.25 (at z=0)                          |   |   |
| Problem      | Vanishing gradients for large          | z |   |
| Use case     | Output layer for binary classification |   |   |

### Tanh

$$
\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1
$$

```python theme={null}
def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, tanh(z), 'b-', linewidth=2)
axes[0].set_title('Tanh: tanh(z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('tanh(z)')
axes[0].grid(True)

axes[1].plot(z, tanh_derivative(z), 'r-', linewidth=2)
axes[1].set_title('Tanh Derivative: 1 - tanh²(z)')
axes[1].set_xlabel('z')
axes[1].set_ylabel('tanh\'(z)')
axes[1].grid(True)

plt.tight_layout()
plt.show()
```

**Properties**:

| Property     | Value                                 |   |   |
| ------------ | ------------------------------------- | - | - |
| Range        | (-1, 1)                               |   |   |
| Max gradient | 1 (at z=0)                            |   |   |
| Centered     | Yes (zero-centered output)            |   |   |
| Problem      | Still vanishes for large              | z |   |
| Use case     | Hidden layers in RNNs, older networks |   |   |

**Why tanh over sigmoid?** Zero-centered outputs make optimization easier. When all outputs are positive (as with sigmoid), weight updates are either all positive or all negative -- the gradients can only move in diagonal directions through parameter space, creating a zigzag path toward the optimum. Zero-centered outputs (tanh) allow gradients to point in any direction, giving the optimizer a more direct route. In practice, this translates to faster convergence.

***

## ReLU: The Workhorse of Modern Deep Learning

$$
\text{ReLU}(z) = \max(0, z)
$$

```python theme={null}
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(z, relu(z), 'b-', linewidth=2)
axes[0].set_title('ReLU: max(0, z)')
axes[0].set_xlabel('z')
axes[0].set_ylabel('ReLU(z)')
axes[0].grid(True)

axes[1].plot(z, relu_derivative(z), 'r-', linewidth=2)
axes[1].set_title('ReLU Derivative')
axes[1].set_xlabel('z')
axes[1].set_ylabel('ReLU\'(z)')
axes[1].set_ylim(-0.1, 1.5)
axes[1].grid(True)

plt.tight_layout()
plt.show()
```

**Why ReLU Changed Everything**:

| Advantage                  | Explanation                            |
| -------------------------- | -------------------------------------- |
| **No vanishing gradient**  | Gradient is 1 for positive inputs      |
| **Sparse activation**      | Many neurons output 0 → efficient      |
| **Computationally simple** | Just a max operation                   |
| **Faster convergence**     | 6x faster than sigmoid (AlexNet paper) |

**The "Dying ReLU" Problem**:

* If inputs are always negative, gradient is 0
* Neuron "dies" and never activates -- it becomes a permanent zero, wasting capacity
* This typically happens when the learning rate is too high, causing weights to overshoot into a region where all inputs produce negative pre-activations
* Solution: Leaky ReLU, PReLU, or ELU (all allow small gradients for negative inputs)

<Tip>
  **Training pitfall**: Dying ReLU is insidious because your loss might still decrease -- the surviving neurons compensate. You only notice the problem when you realize your 512-neuron layer is effectively a 200-neuron layer. Monitor activation statistics during training: `(activations > 0).float().mean()` should stay above 0.5 for healthy layers.
</Tip>

```python theme={null}
# Demonstration of dying ReLU
dead_neurons = 0
total_neurons = 1000

for _ in range(total_neurons):
    # Random weights
    w = np.random.randn(10) * 0.5
    # If weights are mostly negative, neuron might die
    x = np.random.randn(10)
    if relu(np.dot(w, x)) == 0:
        dead_neurons += 1

print(f"Dead neurons: {dead_neurons}/{total_neurons} ({100*dead_neurons/total_neurons:.1f}%)")
```

***

## ReLU Variants

### Leaky ReLU

$$
\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}
$$

where $\alpha$ is typically 0.01.

```python theme={null}
def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)
```

### Parametric ReLU (PReLU)

Same as Leaky ReLU, but $\alpha$ is **learned** during training.

```python theme={null}
import torch.nn as nn

# In PyTorch
prelu = nn.PReLU()  # α is learnable
```

### Exponential Linear Unit (ELU)

$$
\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}
$$

```python theme={null}
def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))
```

**Advantages of ELU**:

* Smooth everywhere (differentiable at z=0)
* Pushes mean activation closer to 0
* More robust than ReLU

### Comparison Plot

```python theme={null}
fig, ax = plt.subplots(figsize=(10, 6))

z = np.linspace(-3, 3, 100)

ax.plot(z, relu(z), 'b-', linewidth=2, label='ReLU')
ax.plot(z, leaky_relu(z, 0.1), 'g-', linewidth=2, label='Leaky ReLU (α=0.1)')
ax.plot(z, elu(z), 'r-', linewidth=2, label='ELU')
ax.plot(z, np.tanh(z), 'm--', linewidth=2, label='Tanh')

ax.set_xlabel('z')
ax.set_ylabel('Activation')
ax.set_title('Comparison of Activation Functions')
ax.legend()
ax.grid(True)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

plt.show()
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/deep-learning-mastery/activation-comparison.svg" alt="Activation Function Comparison" />
</Frame>

***

## Modern Activations

### GELU (Gaussian Error Linear Unit)

$$
\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]
$$

Approximation: $\text{GELU}(z) \approx 0.5z(1 + \tanh[\sqrt{2/\pi}(z + 0.044715z^3)])$

```python theme={null}
def gelu(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Or using scipy
from scipy.special import erf
def gelu_exact(z):
    return z * 0.5 * (1 + erf(z / np.sqrt(2)))
```

**Why GELU?**

* Used in **BERT, GPT-2, GPT-3, and most transformers**
* Smooth approximation of ReLU with a probabilistic interpretation: it multiplies the input by the probability that the input is greater than other inputs from a standard normal distribution
* The key difference from ReLU: GELU is smooth everywhere (differentiable at zero) and has a small negative region, which acts as a soft form of dropout -- small negative inputs are slightly suppressed rather than hard-zeroed
* Empirically outperforms ReLU in NLP tasks, likely because the smooth gating better suits the continuous, high-dimensional representations in language models

### Swish / SiLU

$$
\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}
$$

```python theme={null}
def swish(z):
    return z * sigmoid(z)
```

**Properties**:

* Non-monotonic (dips slightly below 0)
* Self-gated (multiplication by sigmoid)
* Discovered by neural architecture search (Google Brain)
* Used in **EfficientNet**, **MobileNet**

### Mish

$$
\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z)) = z \cdot \tanh(\ln(1 + e^z))
$$

```python theme={null}
def mish(z):
    return z * np.tanh(np.log(1 + np.exp(z)))
```

***

## Activation Functions for Output Layers

The output activation depends on your task:

| Task                       | Output Activation | Loss Function             |
| -------------------------- | ----------------- | ------------------------- |
| Binary Classification      | Sigmoid           | Binary Cross-Entropy      |
| Multi-Class Classification | Softmax           | Categorical Cross-Entropy |
| Regression                 | None (Linear)     | MSE                       |
| Multi-Label Classification | Sigmoid           | Binary CE per label       |
| Bounded Regression         | Sigmoid/Tanh      | MSE                       |

### Softmax

$$
\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

```python theme={null}
def softmax(z):
    """Numerically stable softmax."""
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}")  # Should be 1.0
```

**Properties**:

* Outputs sum to 1 (probability distribution) -- this is what makes it valid as a set of class probabilities
* Larger inputs get exponentially more probability -- softmax amplifies differences between logits
* Temperature scaling: $\text{Softmax}(z/T)$ for controlling sharpness. $T \to 0$ makes it approach argmax (hard selection), $T \to \infty$ makes it uniform (complete uncertainty). This is why language model "temperature" controls creativity: lower temperature makes the model more deterministic, higher temperature makes it more exploratory
* **Numerical stability**: Always subtract `max(z)` before computing `exp(z)`. Without this, `exp(1000)` overflows to infinity. The math is identical -- $\text{softmax}(z) = \text{softmax}(z - c)$ for any constant $c$ -- but the numerics are night and day

***

## Choosing the Right Activation

### Decision Flowchart

```
Is it the OUTPUT layer?
├── Yes
│   ├── Binary classification → Sigmoid
│   ├── Multi-class classification → Softmax
│   ├── Regression → Linear (None)
│   └── Bounded regression → Sigmoid/Tanh
│
└── No (HIDDEN layers)
    ├── Default choice → ReLU
    ├── Dying ReLU problem → Leaky ReLU / ELU
    ├── Transformer architecture → GELU
    ├── Mobile/efficient networks → Swish
    └── Experimental → Mish
```

### Rules of Thumb

| Situation                  | Recommendation     |
| -------------------------- | ------------------ |
| **Starting a new project** | ReLU everywhere    |
| **RNNs/LSTMs**             | Tanh (traditional) |
| **Transformers/BERT/GPT**  | GELU               |
| **EfficientNet/MobileNet** | Swish              |
| **Dying neurons observed** | Leaky ReLU or ELU  |
| **Very deep networks**     | ELU or SELU        |

***

## Implementation in PyTorch

```python theme={null}
import torch
import torch.nn as nn
import torch.nn.functional as F

# All available as modules
class ActivationDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)
        
        # As modules
        self.relu = nn.ReLU()
        self.leaky_relu = nn.LeakyReLU(0.01)
        self.elu = nn.ELU(alpha=1.0)
        self.gelu = nn.GELU()
        self.silu = nn.SiLU()  # Same as Swish
        self.mish = nn.Mish()
        self.prelu = nn.PReLU()  # Learnable
    
    def forward(self, x):
        # Can also use functional versions
        x = F.relu(self.fc(x))
        x = F.gelu(self.fc(x))
        x = F.silu(self.fc(x))
        return x


# Print available activations
print("PyTorch activation modules:")
for name in dir(nn):
    obj = getattr(nn, name)
    if isinstance(obj, type) and issubclass(obj, nn.Module):
        try:
            if 'activation' in obj.__module__ or name in ['ReLU', 'GELU', 'SiLU', 'Sigmoid', 'Tanh']:
                print(f"  nn.{name}")
        except:
            pass
```

***

## Experiments: Which Activation Works Best?

```python theme={null}
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Generate synthetic data
torch.manual_seed(42)
X = torch.randn(1000, 20)
y = (X[:, :5].sum(dim=1) > 0).float().unsqueeze(1)

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

def create_network(activation_class):
    """Create a network with specified activation."""
    return nn.Sequential(
        nn.Linear(20, 64),
        activation_class(),
        nn.Linear(64, 32),
        activation_class(),
        nn.Linear(32, 16),
        activation_class(),
        nn.Linear(16, 1),
        nn.Sigmoid()
    )

def train_and_evaluate(activation_name, activation_class, epochs=50):
    """Train a network and return loss history."""
    model = create_network(activation_class)
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.BCELoss()
    
    losses = []
    for epoch in range(epochs):
        epoch_loss = 0
        for batch_x, batch_y in loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        losses.append(epoch_loss / len(loader))
    
    return losses

# Compare activations
activations = {
    'ReLU': nn.ReLU,
    'LeakyReLU': nn.LeakyReLU,
    'ELU': nn.ELU,
    'GELU': nn.GELU,
    'SiLU': nn.SiLU,
    'Tanh': nn.Tanh,
}

plt.figure(figsize=(10, 6))
for name, act_class in activations.items():
    losses = train_and_evaluate(name, act_class)
    plt.plot(losses, label=name, linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Convergence by Activation Function')
plt.legend()
plt.grid(True)
plt.show()
```

***

## Exercises

<AccordionGroup>
  <Accordion title="Exercise 1: Implement All Activations">
    Implement these activation functions and their derivatives from scratch:

    1. SELU (Scaled ELU)
    2. Softplus
    3. Hardswish

    Verify your implementations against PyTorch.
  </Accordion>

  <Accordion title="Exercise 2: Gradient Flow Analysis">
    For a 10-layer network, compute and plot the gradient magnitude at each layer for:

    1. Sigmoid activation
    2. ReLU activation
    3. GELU activation

    Explain the differences you observe.
  </Accordion>

  <Accordion title="Exercise 3: Activation Visualization">
    Create an interactive visualization showing how different activations transform the output space of a 2D network. Use contour plots to show the decision boundary.
  </Accordion>

  <Accordion title="Exercise 4: Custom Activation">
    Design your own activation function that:

    1. Is non-linear
    2. Is differentiable everywhere
    3. Doesn't have vanishing gradients
    4. Is bounded below (like ReLU)

    Train a network with it and compare to ReLU.
  </Accordion>
</AccordionGroup>

***

## Key Takeaways

| Activation     | Best For                      | Avoid When           |
| -------------- | ----------------------------- | -------------------- |
| **ReLU**       | Default choice, hidden layers | Dying neuron problem |
| **Leaky ReLU** | When neurons die              | (Generally safe)     |
| **GELU**       | Transformers, NLP             | Simple networks      |
| **Swish/SiLU** | Efficient architectures       | (Generally safe)     |
| **Sigmoid**    | Binary output                 | Hidden layers        |
| **Softmax**    | Multi-class output            | Hidden layers        |
| **Tanh**       | RNN gates                     | Deep networks        |

***

## What's Next

<CardGroup cols={1}>
  <Card title="Module 5: Loss Functions & Objectives" icon="crosshairs" href="/courses/deep-learning-mastery/05-loss-functions">
    Define what "learning" means mathematically — MSE, cross-entropy, contrastive loss, and more.
  </Card>
</CardGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Why did GELU replace ReLU as the default activation in transformers? What is the mathematical intuition behind it?">
    **Strong Answer:**

    * GELU (Gaussian Error Linear Unit) is defined as $\text{GELU}(x) = x \cdot \Phi(x)$, where $\Phi(x)$ is the standard normal CDF. Unlike ReLU, which makes a hard binary decision (pass or block), GELU makes a soft probabilistic decision: it multiplies the input by the probability that the input exceeds other inputs drawn from a standard normal distribution.
    * The mathematical intuition: GELU smoothly interpolates between identity (for large positive inputs) and zero (for large negative inputs), with a smooth transition region near zero. Small negative inputs are slightly suppressed rather than hard-zeroed. This smooth gating acts as a form of stochastic regularization -- it is effectively a deterministic approximation of randomly zeroing activations weighted by their magnitude.
    * Why it works better in transformers: transformers process high-dimensional continuous representations where the hard discontinuity of ReLU at zero can create problems. The smooth gradient of GELU means that small perturbations to inputs near zero produce small perturbations to outputs, which improves optimization stability. In the attention mechanism, where values flow through many sequential operations, this smoothness compounds.
    * Empirically, GELU outperforms ReLU on NLP benchmarks by 0.5-1%, which is significant at the scale of BERT and GPT. The approximation $\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])$ is used in practice for computational efficiency.

    **Follow-up: Swish ($x \cdot \sigma(x)$) looks similar to GELU. When would you prefer one over the other?**

    Swish and GELU are nearly identical in shape and empirical performance. The key difference is provenance and convention: GELU was motivated by a probabilistic argument and became the standard in NLP transformers (BERT, GPT-2+), while Swish was discovered through automated architecture search and became the standard in efficient vision networks (EfficientNet, MobileNetV3). In practice, I would use GELU for transformer-based architectures and Swish/SiLU for vision models, following the conventions of the respective communities. The performance difference between them is typically within noise.
  </Accordion>

  <Accordion title="The 'dying ReLU' problem: what is it, how do you detect it in practice, and what are the best fixes?">
    **Strong Answer:**

    * **What it is**: A ReLU neuron "dies" when its input is permanently negative for all training examples. Since ReLU's gradient is zero for negative inputs, the neuron receives zero gradient and can never update its weights to recover. It becomes a constant-zero output, permanently wasting model capacity.
    * **How it happens**: typically caused by a learning rate that is too high early in training. A large gradient update pushes a neuron's weights into a region where the pre-activation $Wx + b$ is negative for all inputs in the training set. Once in this state, zero gradient means zero updates, creating an irreversible death.
    * **How to detect it**: Monitor the fraction of neurons with zero activations across a batch: `(activations > 0).float().mean()`. Healthy layers should have 40-60% active neurons. If a layer drops below 20%, you have significant dying ReLU. You can also check for parameters with zero gradient norm.
    * **Best fixes**:
      * **Leaky ReLU** ($\max(0.01x, x)$): guarantees a small non-zero gradient for negative inputs, allowing dead neurons to gradually recover. Minimal computational overhead.
      * **He initialization**: sets weight variance to $2/n_{in}$, specifically calibrated for ReLU to prevent activations from collapsing to zero from the start.
      * **Lower learning rate or warmup**: prevents the large early updates that push neurons into the dead zone.
      * **Batch normalization before ReLU**: keeps pre-activations centered near zero, ensuring roughly half are positive.

    **Follow-up: If dying ReLU is such a problem, why is vanilla ReLU still the default recommendation for new projects?**

    Because the dying ReLU problem is easily preventable with proper initialization and learning rate scheduling, and vanilla ReLU has the lowest computational cost (a single comparison operation). In most practical scenarios with He initialization and a reasonable learning rate, fewer than 10% of neurons die, which has negligible impact on performance. Leaky ReLU and ELU add complexity (an extra multiply) and hyperparameters (the leak coefficient) for marginal benefit. The engineering principle is: start with the simplest thing that works, and only add complexity when you have evidence of a problem.
  </Accordion>

  <Accordion title="Why do we use different activation functions for hidden layers versus output layers? Walk through the design reasoning.">
    **Strong Answer:**

    * Hidden layer activations serve a different purpose than output layer activations, so they have different design requirements.
    * **Hidden layers** need: (1) non-linearity to enable complex function approximation, (2) well-behaved gradients for backpropagation through many layers, and (3) computational efficiency since they are applied billions of times. ReLU and its variants satisfy all three: they are non-linear, have gradients of 0 or 1 (no vanishing), and are trivially cheap to compute.
    * **Output layers** need to match the probability structure of the task:
      * **Binary classification**: sigmoid squashes output to (0, 1), interpretable as $P(y=1|x)$. Combined with binary cross-entropy, this is the maximum likelihood estimator for Bernoulli outcomes.
      * **Multi-class classification**: softmax produces a valid probability distribution over $K$ classes (non-negative, sums to 1). Combined with cross-entropy, this is MLE for categorical outcomes.
      * **Regression**: no activation (linear output) because we want unbounded real-valued predictions. MSE loss assumes Gaussian noise.
      * **Bounded regression** (e.g., predicting a percentage): sigmoid or tanh to constrain the output range.
    * Using the wrong combination causes subtle failures. For example, using sigmoid in hidden layers causes vanishing gradients in deep networks. Using ReLU as the output activation for regression clips all negative predictions to zero. Using softmax in hidden layers wastes capacity by imposing a competition between neurons.

    **Follow-up: Why does PyTorch's CrossEntropyLoss expect raw logits instead of softmax probabilities?**

    Numerical stability. Computing softmax first and then log separately can cause catastrophic precision loss: if softmax produces a very small probability (say $10^{-45}$), taking $\log(10^{-45}) = -103.5$ requires representing extremely small intermediate values. PyTorch's CrossEntropyLoss internally uses the LogSumExp trick, which computes $\log(\text{softmax}(x))$ in a numerically stable way by factoring out the maximum logit before exponentiation. This avoids both overflow (from $e^{1000}$) and underflow (from $\log(10^{-45})$). The rule: always feed raw logits to CrossEntropyLoss and BCEWithLogitsLoss. Never apply softmax or sigmoid manually before these loss functions.
  </Accordion>
</AccordionGroup>
