> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Interrupts & Exception Handling

> Master interrupt handling, IRQ architecture, softirqs, tasklets, and workqueues in the Linux kernel

# Interrupts & Exception Handling

Interrupts are fundamental to how Linux handles hardware events, system calls, and exceptional conditions. Understanding the interrupt subsystem is crucial for debugging performance issues and writing high-performance systems code.

**The analogy**: Imagine you're a chef cooking an elaborate meal (the running process). A kitchen timer goes off (hardware interrupt) -- you must stop what you're doing, turn off the oven, and then decide: do you handle the hot dish right now (hardirq), or set it aside on the counter to plate later when you have a moment (softirq/workqueue)? The key insight is the same as in the kernel: **acknowledge the interrupt immediately, but defer the heavy work**. If you spend too long handling the timer, all your other dishes burn.

This "top-half / bottom-half" split is the single most important concept in interrupt handling. Get it wrong and you get latency spikes, dropped packets, and unresponsive systems.

<Info>
  **Interview Frequency**: High (especially for performance-critical roles)\
  **Key Topics**: IRQ handling, softirqs, tasklets, workqueues, interrupt coalescing\
  **Time to Master**: 12-14 hours
</Info>

***

## Why Interrupts Matter

Every time a network packet arrives, a disk I/O completes, or a timer fires, an interrupt is involved. Understanding interrupts explains:

* **Why context switches happen**: Interrupts can preempt any code
* **Network performance**: Interrupt coalescing and NAPI
* **CPU affinity effects**: IRQ pinning and load balancing
* **Latency sources**: Interrupt storms and processing time

**A real-world example**: At 10Gbps with 64-byte packets, a NIC can generate over 14 million interrupts per second. If each interrupt takes 5 microseconds of CPU time, that's 70 seconds of CPU time per second -- more than one full core just handling interrupts. This is why NAPI (polling mode) was invented, and why interrupt affinity tuning is a daily task for infrastructure engineers at companies like Cloudflare and Datadog.

***

## Interrupt Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LINUX INTERRUPT ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE LAYER                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  ┌─────┐   ┌─────┐   ┌─────┐   ┌──────┐   ┌─────┐                      ││
│  │  │ NIC │   │ Disk│   │Timer│   │ PCIe │   │ USB │                      ││
│  │  └──┬──┘   └──┬──┘   └──┬──┘   └───┬──┘   └──┬──┘                      ││
│  │     │         │         │          │         │                          ││
│  │     └─────────┴────┬────┴──────────┴─────────┘                          ││
│  │                    ▼                                                     ││
│  │          ┌─────────────────────┐                                        ││
│  │          │   Interrupt         │    Modern: MSI/MSI-X                   ││
│  │          │   Controller        │    (Message Signaled Interrupts)       ││
│  │          │   (APIC/GIC)        │                                        ││
│  │          └──────────┬──────────┘                                        ││
│  └─────────────────────│───────────────────────────────────────────────────┘│
│                        │                                                     │
│                        ▼                                                     │
│  CPU                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  1. CPU receives interrupt signal                                       ││
│  │  2. Saves current context (registers, flags)                           ││
│  │  3. Looks up handler in IDT (Interrupt Descriptor Table)               ││
│  │  4. Switches to kernel stack                                            ││
│  │  5. Jumps to interrupt handler                                          ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                        │                                                     │
│                        ▼                                                     │
│  KERNEL INTERRUPT HANDLING                                                   │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │           HARDIRQ CONTEXT (interrupts disabled)               │      ││
│  │  │                                                               │      ││
│  │  │  • Acknowledge interrupt to hardware                          │      ││
│  │  │  • Do MINIMAL work (read status, copy data to buffer)        │      ││
│  │  │  • Schedule deferred work (softirq, tasklet, workqueue)      │      ││
│  │  │  • Return ASAP (microseconds, not milliseconds)              │      ││
│  │  │                                                               │      ││
│  │  └────────────────────────────┬──────────────────────────────────┘      ││
│  │                               │                                          ││
│  │                               ▼                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │          SOFTIRQ CONTEXT (interrupts enabled)                  │      ││
│  │  │                                                               │      ││
│  │  │  • Process accumulated data (network packets, block I/O)     │      ││
│  │  │  • Can be preempted by hardirqs                              │      ││
│  │  │  • Cannot sleep or block                                     │      ││
│  │  │                                                               │      ││
│  │  └────────────────────────────┬──────────────────────────────────┘      ││
│  │                               │                                          ││
│  │                               ▼                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │           PROCESS CONTEXT (workqueues)                         │      ││
│  │  │                                                               │      ││
│  │  │  • Can sleep and block                                       │      ││
│  │  │  • Can allocate memory with GFP_KERNEL                       │      ││
│  │  │  • Full kernel API available                                 │      ││
│  │  │                                                               │      ││
│  │  └───────────────────────────────────────────────────────────────┘      ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Interrupt Types

### Exceptions vs Interrupts

| Type                   | Cause                | Synchronous? | Examples                            |
| ---------------------- | -------------------- | ------------ | ----------------------------------- |
| **Exception**          | CPU execution        | Yes          | Page fault, divide by zero, syscall |
| **Hardware IRQ**       | External device      | No           | NIC, disk, timer, keyboard          |
| **Software Interrupt** | Explicit instruction | Yes          | `int 0x80`, `syscall`               |

### Exception Categories

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         EXCEPTION TYPES (x86-64)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  FAULTS                                                                      │
│  ────────                                                                   │
│  • Recoverable: execution can continue after handling                       │
│  • Return address = instruction that caused fault                           │
│  • Examples: page fault (#PF), segment not present (#NP)                   │
│                                                                              │
│  TRAPS                                                                       │
│  ──────                                                                     │
│  • Intentional: used for debugging and system calls                        │
│  • Return address = next instruction                                        │
│  • Examples: breakpoint (#BP), overflow (#OF), syscall                     │
│                                                                              │
│  ABORTS                                                                      │
│  ────────                                                                   │
│  • Severe: cannot continue execution                                        │
│  • Examples: machine check (#MC), double fault (#DF)                       │
│                                                                              │
│  COMMON EXCEPTION VECTORS                                                    │
│  ─────────────────────────                                                  │
│  0:  #DE - Divide Error                                                     │
│  3:  #BP - Breakpoint                                                       │
│  6:  #UD - Invalid Opcode                                                   │
│  8:  #DF - Double Fault                                                     │
│  13: #GP - General Protection                                               │
│  14: #PF - Page Fault                                                       │
│  18: #MC - Machine Check                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Interrupt Descriptor Table (IDT)

The IDT maps interrupt vectors to handlers:

```c theme={null}
// arch/x86/kernel/idt.c (simplified)
struct idt_data {
    unsigned int    vector;      // Interrupt number (0-255)
    unsigned int    segment;     // Code segment selector
    struct idt_bits bits;        // Type, DPL, present
    const void      *addr;       // Handler address
};

// IDT entries
static const __initconst struct idt_data early_idts[] = {
    INTG(X86_TRAP_DE,     asm_exc_divide_error),       // #DE
    INTG(X86_TRAP_NMI,    asm_exc_nmi),                // NMI
    INTG(X86_TRAP_BP,     asm_exc_int3),               // #BP
    INTG(X86_TRAP_OF,     asm_exc_overflow),           // #OF
    INTG(X86_TRAP_UD,     asm_exc_invalid_op),         // #UD
    INTG(X86_TRAP_DF,     asm_exc_double_fault),       // #DF
    INTG(X86_TRAP_GP,     asm_exc_general_protection), // #GP
    INTG(X86_TRAP_PF,     asm_exc_page_fault),         // #PF
    // ... more entries
};

// Hardware interrupts (IRQs) start at vector 32
#define FIRST_EXTERNAL_VECTOR 0x20
```

### Viewing IDT Information

```bash theme={null}
# View interrupt statistics
cat /proc/interrupts

# Example output:
#            CPU0       CPU1       CPU2       CPU3
#   0:         25          0          0          0  IR-IO-APIC   2-edge      timer
#   8:          0          0          0          1  IR-IO-APIC   8-edge      rtc0
#  16:          0          0          0          0  IR-IO-APIC  16-fasteoi   ehci_hcd
# 120:          0          0          0          0  DMAR-MSI    0-edge       dmar0
# 121:     234567      12345       9876       8765  IR-PCI-MSI 512000-edge  nvme0q0
# 122:          0     987654          0          0  IR-PCI-MSI 512001-edge  nvme0q1
# NMI:          0          0          0          0   Non-maskable interrupts
# LOC:    1234567    1234567    1234567    1234567   Local timer interrupts

# View affinity for a specific IRQ
cat /proc/irq/121/smp_affinity      # Hex mask of allowed CPUs
cat /proc/irq/121/smp_affinity_list # Human-readable CPU list
```

***

## Hardware IRQ Handling

### IRQ Handler Registration

```c theme={null}
// include/linux/interrupt.h
int request_irq(unsigned int irq,
                irq_handler_t handler,
                unsigned long flags,
                const char *name,
                void *dev_id);

// flags:
#define IRQF_SHARED         0x00000080  // Multiple devices share IRQ
#define IRQF_TRIGGER_HIGH   0x00000004  // Level-triggered, active high
#define IRQF_TRIGGER_RISING 0x00000001  // Edge-triggered, rising
#define IRQF_ONESHOT        0x00002000  // IRQ disabled until handler completes

// Handler return values
typedef irqreturn_t (*irq_handler_t)(int irq, void *dev_id);

#define IRQ_NONE      (0)  // Interrupt wasn't from this device
#define IRQ_HANDLED   (1)  // Interrupt was handled
#define IRQ_WAKE_THREAD (2)  // Wake threaded handler
```

### Example: Network Driver IRQ Handler

```c theme={null}
// Simplified network driver interrupt handler
static irqreturn_t my_net_irq_handler(int irq, void *dev_id)
{
    struct my_net_device *dev = dev_id;
    u32 status;
    
    // Read interrupt status register
    status = my_read_reg(dev, INTR_STATUS);
    
    if (!(status & MY_INTR_MASK))
        return IRQ_NONE;  // Not our interrupt (shared IRQ line)
    
    // Acknowledge interrupt to hardware
    my_write_reg(dev, INTR_ACK, status);
    
    if (status & INTR_RX_DONE) {
        // Disable RX interrupts, schedule NAPI poll
        my_write_reg(dev, INTR_DISABLE, INTR_RX_DONE);
        napi_schedule(&dev->napi);  // Schedule softirq
    }
    
    if (status & INTR_TX_DONE) {
        // TX completion - can do minimal work here
        tasklet_schedule(&dev->tx_tasklet);
    }
    
    return IRQ_HANDLED;
}
```

***

## Softirqs: High-Priority Deferred Work

**The analogy**: If hardirqs are the smoke alarm (drop everything, acknowledge immediately), softirqs are the **cleanup crew** that arrives after the alarm stops. They run with interrupts re-enabled, so new alarms can still ring, but they still cannot take a nap (no sleeping) because they need to be fast. Softirqs are how the kernel processes network packets in bulk, completes block I/O, and runs timer callbacks -- all the heavy lifting deferred from the hardirq handler.

### Softirq Types

```c theme={null}
// include/linux/interrupt.h
enum {
    HI_SOFTIRQ = 0,        // High-priority tasklets
    TIMER_SOFTIRQ,         // Timer processing
    NET_TX_SOFTIRQ,        // Network transmit
    NET_RX_SOFTIRQ,        // Network receive
    BLOCK_SOFTIRQ,         // Block device completion
    IRQ_POLL_SOFTIRQ,      // IRQ polling
    TASKLET_SOFTIRQ,       // Regular tasklets
    SCHED_SOFTIRQ,         // Scheduler load balancing
    HRTIMER_SOFTIRQ,       // High-resolution timers
    RCU_SOFTIRQ,           // RCU callbacks
    NR_SOFTIRQS
};
```

### Softirq Properties

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         SOFTIRQ CHARACTERISTICS                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Properties:                                                                 │
│  • Fixed number (10 currently, compile-time)                                │
│  • Run with interrupts enabled                                              │
│  • Cannot sleep or block                                                    │
│  • Same softirq can run simultaneously on different CPUs                    │
│  • Must use appropriate locking                                             │
│                                                                              │
│  Execution Points:                                                           │
│  • After hardirq handler returns                                            │
│  • When explicitly enabled (local_bh_enable())                              │
│  • By ksoftirqd kernel threads (when too many pending)                      │
│                                                                              │
│  Priority Order:                                                             │
│  HI_SOFTIRQ > TIMER > NET_TX > NET_RX > BLOCK > ... > RCU                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Viewing Softirq Activity

```bash theme={null}
# View softirq statistics per CPU
cat /proc/softirqs
#                    CPU0       CPU1       CPU2       CPU3
#          HI:          0          0          0          0
#       TIMER:   12345678   12345678   12345678   12345678
#      NET_TX:      12345      12345      12345      12345
#      NET_RX:   98765432    1234567     123456      12345
#       BLOCK:     123456     234567     345678     456789
#    IRQ_POLL:          0          0          0          0
#     TASKLET:       1234       2345       3456       4567
#       SCHED:    1234567    1234567    1234567    1234567
#     HRTIMER:     123456     123456     123456     123456
#         RCU:    1234567    1234567    1234567    1234567

# Watch ksoftirqd CPU usage
top -p $(pgrep -d',' ksoftirqd)
```

***

## Tasklets: Dynamic Deferred Work

Tasklets are built on top of softirqs but more flexible:

```c theme={null}
// Tasklet declaration
struct tasklet_struct {
    struct tasklet_struct *next;
    unsigned long state;      // TASKLET_STATE_SCHED, TASKLET_STATE_RUN
    atomic_t count;           // Disable count
    void (*func)(unsigned long);
    unsigned long data;
};

// Static initialization
DECLARE_TASKLET(my_tasklet, my_tasklet_handler, data);
DECLARE_TASKLET_DISABLED(my_tasklet, my_tasklet_handler, data);

// Dynamic initialization
tasklet_init(&my_tasklet, my_tasklet_handler, data);

// Schedule for execution
tasklet_schedule(&my_tasklet);     // Normal priority (TASKLET_SOFTIRQ)
tasklet_hi_schedule(&my_tasklet);  // High priority (HI_SOFTIRQ)

// Control
tasklet_disable(&my_tasklet);  // Prevent execution
tasklet_enable(&my_tasklet);   // Re-enable
tasklet_kill(&my_tasklet);     // Remove (waits if running)
```

### Tasklet vs Softirq

| Feature         | Softirq                               | Tasklet                                     |
| --------------- | ------------------------------------- | ------------------------------------------- |
| **Concurrency** | Same softirq can run on multiple CPUs | Same tasklet runs on only one CPU at a time |
| **Definition**  | Static (compile-time)                 | Dynamic (runtime)                           |
| **Locking**     | Must handle SMP yourself              | Serialized per-tasklet                      |
| **Use case**    | High-frequency, performance-critical  | General deferred work                       |

***

## Workqueues: Process Context Deferred Work

When you need to sleep or allocate memory:

```c theme={null}
// Create work item
struct work_struct my_work;
INIT_WORK(&my_work, my_work_handler);

// Work handler
static void my_work_handler(struct work_struct *work)
{
    // Can sleep, allocate memory, etc.
    struct my_device *dev = container_of(work, struct my_device, work);
    
    // Do heavy processing
    process_data(dev);
    
    // Allocate memory (can sleep)
    void *buf = kmalloc(4096, GFP_KERNEL);
    
    // Call functions that might sleep
    mutex_lock(&dev->lock);
    // ...
    mutex_unlock(&dev->lock);
}

// Schedule work
schedule_work(&my_work);                    // Global workqueue
queue_work(my_workqueue, &my_work);         // Custom workqueue
schedule_delayed_work(&my_dwork, HZ * 5);   // Delayed by 5 seconds
```

### Workqueue Types

```c theme={null}
// Create workqueues
// Bound workqueue (work runs on submitting CPU)
struct workqueue_struct *wq = alloc_workqueue("my_wq",
    WQ_UNBOUND | WQ_MEM_RECLAIM, max_active);

// Flags:
#define WQ_UNBOUND       (1 << 1)   // Work can run on any CPU
#define WQ_FREEZABLE     (1 << 2)   // Participate in system suspend
#define WQ_MEM_RECLAIM   (1 << 3)   // Can be used for memory reclaim
#define WQ_HIGHPRI       (1 << 4)   // High priority workers
#define WQ_CPU_INTENSIVE (1 << 5)   // CPU-intensive work

// System workqueues
system_wq              // Default, bound
system_highpri_wq      // High priority
system_long_wq         // For long-running work
system_unbound_wq      // For CPU-intensive work
system_freezable_wq    // Freezable for suspend
```

### Concurrency Managed Workqueue (cmwq)

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONCURRENCY MANAGED WORKQUEUE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Work Items                                                                  │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                                   │
│  │ W1  │ │ W2  │ │ W3  │ │ W4  │ │ W5  │                                   │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘                                   │
│     │       │       │       │       │                                        │
│     └───────┴───────┼───────┴───────┘                                        │
│                     │                                                         │
│                     ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                    Per-CPU Worker Pools                                  ││
│  │                                                                          ││
│  │  CPU 0 Pool              CPU 1 Pool              CPU 2 Pool             ││
│  │  ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐    ││
│  │  │ Worker threads: │     │ Worker threads: │     │ Worker threads: │    ││
│  │  │ [kworker/0:0]  │     │ [kworker/1:0]  │     │ [kworker/2:0]  │    ││
│  │  │ [kworker/0:1]  │     │ [kworker/1:1]  │     │ [kworker/2:1]  │    ││
│  │  │ (dynamic)      │     │ (dynamic)      │     │ (dynamic)      │    ││
│  │  └─────────────────┘     └─────────────────┘     └─────────────────┘    ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  Key Properties:                                                             │
│  • Workers created/destroyed dynamically based on load                       │
│  • Work items queued to per-CPU pools for locality                          │
│  • Automatic concurrency management (avoids thundering herd)                │
│  • WQ_UNBOUND work uses separate unbound pools                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Choosing the Right Deferred Work

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFERRED WORK DECISION TREE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Need to defer work from interrupt context?                                  │
│  │                                                                           │
│  └─► Does the work need to sleep?                                           │
│      │                                                                       │
│      ├─ YES ──► Use WORKQUEUE                                               │
│      │          • Can sleep (mutex_lock, kmalloc with GFP_KERNEL)           │
│      │          • Full kernel API available                                 │
│      │          • Higher latency (process context switch)                   │
│      │                                                                       │
│      └─ NO ───► Is it performance-critical networking/block I/O?            │
│                 │                                                            │
│                 ├─ YES ──► Use SOFTIRQ (if modifying kernel)               │
│                 │          or NAPI (for network drivers)                    │
│                 │          • Lowest latency                                 │
│                 │          • Runs at highest priority                       │
│                 │          • Need to handle SMP locking                     │
│                 │                                                            │
│                 └─ NO ───► Use TASKLET                                      │
│                            • Simpler than softirq                           │
│                            • Serialized execution                           │
│                            • Good for most driver deferred work             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Interrupt Affinity and Performance

### Setting IRQ Affinity

```bash theme={null}
# View current affinity (hex bitmask)
cat /proc/irq/121/smp_affinity
# 0000000f means CPUs 0-3

# Set affinity to CPUs 0 and 1
echo 3 > /proc/irq/121/smp_affinity

# Using affinity list (human-readable)
echo 0-3 > /proc/irq/121/smp_affinity_list
echo 0,2,4,6 > /proc/irq/121/smp_affinity_list

# Check if irqbalance is running (it may override your settings)
systemctl status irqbalance

# Disable irqbalance for manual control
systemctl stop irqbalance
```

### Performance Tuning Patterns

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    IRQ AFFINITY STRATEGIES                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. NETWORK-INTENSIVE WORKLOADS                                              │
│     ────────────────────────────                                            │
│     • Pin NIC IRQs to dedicated CPUs                                        │
│     • Use RPS/RFS for receive steering                                      │
│     • Enable XPS for transmit steering                                      │
│     • Consider isolcpus for application CPUs                                │
│                                                                              │
│     Example (4-queue NIC, 8 CPUs):                                          │
│     IRQ 121 (rx-0) → CPU 0                                                  │
│     IRQ 122 (rx-1) → CPU 1                                                  │
│     IRQ 123 (rx-2) → CPU 2                                                  │
│     IRQ 124 (rx-3) → CPU 3                                                  │
│     CPUs 4-7 → Application threads                                          │
│                                                                              │
│  2. LATENCY-SENSITIVE WORKLOADS                                              │
│     ──────────────────────────                                              │
│     • Isolate CPUs for application (isolcpus=)                              │
│     • Pin all IRQs to housekeeping CPUs                                     │
│     • Use nohz_full for tickless operation                                  │
│     • Consider disabling irqbalance                                         │
│                                                                              │
│  3. NUMA-AWARE AFFINITY                                                      │
│     ───────────────────                                                     │
│     • Pin IRQs to same NUMA node as device                                  │
│     • Avoid cross-NUMA memory access                                        │
│     • Check: cat /sys/class/net/eth0/device/numa_node                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Measuring Interrupt Latency

```bash theme={null}
# Using cyclictest for latency measurement
sudo cyclictest -p 80 -t1 -n -i 1000 -l 10000
# -p 80: priority
# -t1: 1 thread
# -n: use nanosleep
# -i 1000: 1ms interval
# -l 10000: 10000 loops

# Using perf for interrupt latency
sudo perf sched latency

# Using bpftrace
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry {
    @start[args->irq] = nsecs;
}
tracepoint:irq:irq_handler_exit /@start[args->irq]/ {
    @latency_ns = hist(nsecs - @start[args->irq]);
    delete(@start[args->irq]);
}'
```

***

## NAPI: Network Interrupt Coalescing

NAPI (New API) reduces interrupt overhead for high-throughput networking:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         NAPI MECHANISM                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Traditional (interrupt per packet):                                         │
│  ──────────────────────────────────                                         │
│  Packet 1 → IRQ → Handle → Return                                           │
│  Packet 2 → IRQ → Handle → Return                                           │
│  Packet 3 → IRQ → Handle → Return                                           │
│  ... (high CPU overhead)                                                    │
│                                                                              │
│  NAPI (polling mode):                                                        │
│  ─────────────────────                                                      │
│  Packet 1 → IRQ → Disable IRQ → Schedule NAPI                              │
│                                  │                                           │
│                                  ▼                                           │
│                    ┌─────────────────────────────────────┐                  │
│                    │         NAPI Poll Loop              │                  │
│                    │                                     │                  │
│                    │  while (budget > 0) {               │                  │
│                    │      packet = poll_device();        │                  │
│                    │      if (!packet) break;            │                  │
│                    │      process(packet);               │                  │
│                    │      budget--;                      │                  │
│                    │  }                                  │                  │
│                    │                                     │                  │
│                    │  if (budget > 0)                    │                  │
│                    │      napi_complete();  // Re-enable │                  │
│                    │      enable_irq();     // IRQ      │                  │
│                    │  else                               │                  │
│                    │      reschedule();     // More work │                  │
│                    │                                     │                  │
│                    └─────────────────────────────────────┘                  │
│                                                                              │
│  Benefits:                                                                   │
│  • Reduced interrupt overhead                                               │
│  • Better CPU cache utilization                                             │
│  • Back-pressure mechanism (budget)                                         │
│  • Automatic adaptation to load                                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### NAPI API

```c theme={null}
// Initialize NAPI
netif_napi_add(netdev, &dev->napi, my_poll, NAPI_POLL_WEIGHT);
napi_enable(&dev->napi);

// In IRQ handler
static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    struct my_device *dev = dev_id;
    
    // Disable interrupts and schedule NAPI
    if (napi_schedule_prep(&dev->napi)) {
        disable_irq(dev);
        __napi_schedule(&dev->napi);
    }
    return IRQ_HANDLED;
}

// Poll function
static int my_poll(struct napi_struct *napi, int budget)
{
    struct my_device *dev = container_of(napi, struct my_device, napi);
    int work_done = 0;
    
    while (work_done < budget) {
        struct sk_buff *skb = get_next_packet(dev);
        if (!skb)
            break;
        
        // Process packet
        napi_gro_receive(napi, skb);
        work_done++;
    }
    
    if (work_done < budget) {
        napi_complete_done(napi, work_done);
        enable_irq(dev);  // Re-enable interrupts
    }
    
    return work_done;
}
```

***

## Threaded IRQs

For handlers that need more flexibility:

```c theme={null}
// Request threaded IRQ
int request_threaded_irq(unsigned int irq,
                         irq_handler_t handler,        // Hardirq handler
                         irq_handler_t thread_fn,      // Threaded handler
                         unsigned long flags,
                         const char *name,
                         void *dev_id);

// Example
static irqreturn_t my_hardirq_handler(int irq, void *dev_id)
{
    // Quick check: is this our interrupt?
    if (!check_interrupt_pending(dev_id))
        return IRQ_NONE;
    
    // Acknowledge hardware
    ack_interrupt(dev_id);
    
    // Wake threaded handler
    return IRQ_WAKE_THREAD;
}

static irqreturn_t my_threaded_handler(int irq, void *dev_id)
{
    // Can sleep here!
    mutex_lock(&dev_lock);
    process_data(dev_id);
    mutex_unlock(&dev_lock);
    
    return IRQ_HANDLED;
}

// Registration
request_threaded_irq(irq, my_hardirq_handler, my_threaded_handler,
                     IRQF_ONESHOT, "my_device", dev);
```

***

## Interview Questions

<AccordionGroup>
  <Accordion title="Q: Explain the difference between hardirq and softirq context" icon="question-circle">
    **Answer**:

    **Hardirq context** (top half):

    * Runs with interrupts disabled on local CPU
    * Must be extremely fast (microseconds)
    * Cannot sleep or allocate memory with GFP\_KERNEL
    * Preempts everything including kernel code

    **Softirq context** (bottom half):

    * Runs with interrupts enabled
    * Can be preempted by hardirqs
    * Still cannot sleep (atomic context)
    * Used for deferred processing (networking, block I/O)

    **Key insight**: Split processing into minimal hardirq work (acknowledge, disable, schedule) and heavier softirq work (process data).
  </Accordion>

  <Accordion title="Q: When would you use a workqueue vs a tasklet?" icon="question-circle">
    **Answer**:

    **Use workqueue when**:

    * You need to sleep (mutex, blocking I/O)
    * You need to allocate memory with GFP\_KERNEL
    * The work is not time-critical
    * You need to call functions that might block

    **Use tasklet when**:

    * Work must be done quickly after interrupt
    * You don't need to sleep
    * Work is small and fast
    * You want serialization (same tasklet won't run concurrently)

    **Example**: Network driver TX completion → tasklet (fast, no sleeping). Firmware loading → workqueue (needs file I/O, can sleep).
  </Accordion>

  <Accordion title="Q: How does NAPI improve network performance?" icon="question-circle">
    **Answer**:

    **Problem**: At high packet rates, per-packet interrupts cause:

    * High CPU overhead (context switch per packet)
    * Cache thrashing
    * Interrupt storms (CPU spends all time in IRQ handlers)

    **NAPI solution**:

    1. First packet triggers interrupt
    2. Disable further interrupts for that queue
    3. Switch to polling mode (softirq)
    4. Process packets in batches (budget-based)
    5. Re-enable interrupts when queue is empty

    **Benefits**:

    * Amortized interrupt cost across many packets
    * Better cache locality (process batch together)
    * Natural back-pressure (stop polling when overwhelmed)
    * Scales to millions of packets/second
  </Accordion>

  <Accordion title="Q: How would you debug an interrupt storm?" icon="question-circle">
    **Answer**:

    **Symptoms**:

    * High CPU usage in `si` (softirq) or `hi` (hardirq)
    * System unresponsive
    * `/proc/interrupts` shows rapidly increasing counts

    **Debugging steps**:

    ```bash theme={null}
    # 1. Identify which IRQ
    watch -n 1 cat /proc/interrupts

    # 2. Check which device
    cat /proc/irq/<irq_num>/smp_affinity_list
    ls -la /proc/irq/<irq_num>/

    # 3. Monitor interrupt rate
    perf stat -e irq:irq_handler_entry -a sleep 1

    # 4. Trace specific IRQ
    sudo bpftrace -e '
    tracepoint:irq:irq_handler_entry /args->irq == 121/ {
        @count = count();
    }'
    ```

    **Common causes**:

    * Faulty hardware generating spurious interrupts
    * Driver bug not acknowledging interrupt properly
    * Shared IRQ with misbehaving device
    * Misconfigured interrupt coalescing
  </Accordion>
</AccordionGroup>

***

## Practice Exercises

<Steps>
  <Step title="Interrupt Statistics">
    Write a script that monitors `/proc/interrupts` and alerts when any IRQ rate exceeds a threshold
  </Step>

  <Step title="Affinity Configuration">
    Set up IRQ affinity for a multi-queue NIC to optimize for either throughput or latency
  </Step>

  <Step title="eBPF Tracing">
    Write a bpftrace script to trace interrupt handler latency and identify slow handlers
  </Step>

  <Step title="Workqueue Analysis">
    Use `workqueue:*` tracepoints to analyze work item execution patterns
  </Step>
</Steps>

***

## Summary

| Mechanism    | Context   | Can Sleep? | Use Case                               |
| ------------ | --------- | ---------- | -------------------------------------- |
| Hardirq      | Interrupt | No         | Acknowledge HW, schedule deferred work |
| Softirq      | Atomic    | No         | High-frequency networking, block I/O   |
| Tasklet      | Atomic    | No         | Driver deferred work, serialized       |
| Workqueue    | Process   | Yes        | General deferred work, can block       |
| Threaded IRQ | Process   | Yes        | Device handling that needs sleeping    |

***

## Debugging Tips

```bash theme={null}
# Detect interrupt storms: watch for rapidly increasing counts
watch -d -n 1 cat /proc/interrupts
# If a single IRQ line is incrementing by thousands per second,
# you likely have a misbehaving device or driver bug

# Check if ksoftirqd is consuming excessive CPU
# This indicates softirq backlog -- the kernel cannot keep up
top -p $(pgrep -d',' ksoftirqd)
# If ksoftirqd is using >50% of a core, investigate which softirq type

# Identify which softirq type is dominating
cat /proc/softirqs
# NET_RX growing much faster than others → network receive backlog
# BLOCK growing fast → storage completion backlog
# TIMER growing fast → timer callback overload

# Trace hardirq handler duration (find slow handlers)
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry { @start[args->irq] = nsecs; }
tracepoint:irq:irq_handler_exit /@start[args->irq]/ {
    $dur = (nsecs - @start[args->irq]) / 1000;
    if ($dur > 100) {  // handlers > 100us are suspicious
        printf("SLOW IRQ %d: %d us\n", args->irq, $dur);
    }
    delete(@start[args->irq]);
}'

# Check for IRQ affinity imbalance
# One CPU handling all interrupts while others are idle
cat /proc/interrupts | awk 'NR>1 {for(i=2;i<=NF-3;i++) sum[i]+=$i} END {for(i in sum) print "CPU"i-2": "sum[i]}'
```

<Warning>
  **Common Misconception**: "Setting IRQ affinity is enough to control interrupt placement." In practice, `irqbalance` daemon may override your manual settings. Always check `systemctl status irqbalance` and stop it if you need manual control. Also, some NIC drivers pin MSI-X interrupts internally, overriding smp\_affinity writes.
</Warning>

***

## Common Misconceptions

| Misconception                                              | Reality                                                                                                                                                                                                                          |
| ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "Softirqs run on the CPU that raised the hardirq"          | Usually true, but when softirq backlog is high, they're offloaded to `ksoftirqd` kernel threads which can be migrated                                                                                                            |
| "Workqueues are slow"                                      | For deferred work that needs sleeping, workqueues add only a few microseconds of overhead. The alternative (busy-waiting in atomic context) is far worse                                                                         |
| "NAPI polling wastes CPU"                                  | NAPI only polls when there are packets. When the queue empties, it automatically re-enables interrupts and stops polling                                                                                                         |
| "Tasklets are the recommended bottom-half mechanism"       | Tasklets are actually semi-deprecated in modern kernel development. New code should prefer threaded IRQs or workqueues. Tasklets have problematic semantics (latency-unpredictable, blocks softirq processing for other devices) |
| "Disabling interrupts with `spin_lock_irq` is always safe" | Only safe if you *know* interrupts were enabled before the lock. Use `spin_lock_irqsave`/`spin_unlock_irqrestore` in code paths that might be called with interrupts already disabled                                            |

***

## Key Takeaways

1. **Split interrupt handling**: Minimal work in hardirq, bulk processing in softirq/workqueue
2. **Choose the right mechanism**: Workqueue if you need to sleep, tasklet/softirq otherwise
3. **IRQ affinity matters**: Align with NUMA topology and application threads
4. **NAPI is essential**: For any high-performance networking
5. **Monitor interrupt rates**: High rates indicate potential issues

***

## Next Steps

* [Kernel Synchronization →](/courses/linux-internals/synchronization) - Locking mechanisms for concurrent access
* [Networking Stack →](/courses/linux-internals/networking-stack) - See NAPI in action
* [eBPF →](/courses/linux-internals/ebpf) - Trace interrupt handlers in production
