Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Interrupts & Exception Handling

Interrupts are fundamental to how Linux handles hardware events, system calls, and exceptional conditions. Understanding the interrupt subsystem is crucial for debugging performance issues and writing high-performance systems code. The analogy: Imagine you’re a chef cooking an elaborate meal (the running process). A kitchen timer goes off (hardware interrupt) — you must stop what you’re doing, turn off the oven, and then decide: do you handle the hot dish right now (hardirq), or set it aside on the counter to plate later when you have a moment (softirq/workqueue)? The key insight is the same as in the kernel: acknowledge the interrupt immediately, but defer the heavy work. If you spend too long handling the timer, all your other dishes burn. This “top-half / bottom-half” split is the single most important concept in interrupt handling. Get it wrong and you get latency spikes, dropped packets, and unresponsive systems.
Interview Frequency: High (especially for performance-critical roles)
Key Topics: IRQ handling, softirqs, tasklets, workqueues, interrupt coalescing
Time to Master: 12-14 hours

Why Interrupts Matter

Every time a network packet arrives, a disk I/O completes, or a timer fires, an interrupt is involved. Understanding interrupts explains:
  • Why context switches happen: Interrupts can preempt any code
  • Network performance: Interrupt coalescing and NAPI
  • CPU affinity effects: IRQ pinning and load balancing
  • Latency sources: Interrupt storms and processing time
A real-world example: At 10Gbps with 64-byte packets, a NIC can generate over 14 million interrupts per second. If each interrupt takes 5 microseconds of CPU time, that’s 70 seconds of CPU time per second — more than one full core just handling interrupts. This is why NAPI (polling mode) was invented, and why interrupt affinity tuning is a daily task for infrastructure engineers at companies like Cloudflare and Datadog.

Interrupt Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LINUX INTERRUPT ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE LAYER                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  ┌─────┐   ┌─────┐   ┌─────┐   ┌──────┐   ┌─────┐                      ││
│  │  │ NIC │   │ Disk│   │Timer│   │ PCIe │   │ USB │                      ││
│  │  └──┬──┘   └──┬──┘   └──┬──┘   └───┬──┘   └──┬──┘                      ││
│  │     │         │         │          │         │                          ││
│  │     └─────────┴────┬────┴──────────┴─────────┘                          ││
│  │                    ▼                                                     ││
│  │          ┌─────────────────────┐                                        ││
│  │          │   Interrupt         │    Modern: MSI/MSI-X                   ││
│  │          │   Controller        │    (Message Signaled Interrupts)       ││
│  │          │   (APIC/GIC)        │                                        ││
│  │          └──────────┬──────────┘                                        ││
│  └─────────────────────│───────────────────────────────────────────────────┘│
│                        │                                                     │
│                        ▼                                                     │
│  CPU                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  1. CPU receives interrupt signal                                       ││
│  │  2. Saves current context (registers, flags)                           ││
│  │  3. Looks up handler in IDT (Interrupt Descriptor Table)               ││
│  │  4. Switches to kernel stack                                            ││
│  │  5. Jumps to interrupt handler                                          ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                        │                                                     │
│                        ▼                                                     │
│  KERNEL INTERRUPT HANDLING                                                   │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │           HARDIRQ CONTEXT (interrupts disabled)               │      ││
│  │  │                                                               │      ││
│  │  │  • Acknowledge interrupt to hardware                          │      ││
│  │  │  • Do MINIMAL work (read status, copy data to buffer)        │      ││
│  │  │  • Schedule deferred work (softirq, tasklet, workqueue)      │      ││
│  │  │  • Return ASAP (microseconds, not milliseconds)              │      ││
│  │  │                                                               │      ││
│  │  └────────────────────────────┬──────────────────────────────────┘      ││
│  │                               │                                          ││
│  │                               ▼                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │          SOFTIRQ CONTEXT (interrupts enabled)                  │      ││
│  │  │                                                               │      ││
│  │  │  • Process accumulated data (network packets, block I/O)     │      ││
│  │  │  • Can be preempted by hardirqs                              │      ││
│  │  │  • Cannot sleep or block                                     │      ││
│  │  │                                                               │      ││
│  │  └────────────────────────────┬──────────────────────────────────┘      ││
│  │                               │                                          ││
│  │                               ▼                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │           PROCESS CONTEXT (workqueues)                         │      ││
│  │  │                                                               │      ││
│  │  │  • Can sleep and block                                       │      ││
│  │  │  • Can allocate memory with GFP_KERNEL                       │      ││
│  │  │  • Full kernel API available                                 │      ││
│  │  │                                                               │      ││
│  │  └───────────────────────────────────────────────────────────────┘      ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interrupt Types

Exceptions vs Interrupts

TypeCauseSynchronous?Examples
ExceptionCPU executionYesPage fault, divide by zero, syscall
Hardware IRQExternal deviceNoNIC, disk, timer, keyboard
Software InterruptExplicit instructionYesint 0x80, syscall

Exception Categories

┌─────────────────────────────────────────────────────────────────────────────┐
│                         EXCEPTION TYPES (x86-64)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  FAULTS                                                                      │
│  ────────                                                                   │
│  • Recoverable: execution can continue after handling                       │
│  • Return address = instruction that caused fault                           │
│  • Examples: page fault (#PF), segment not present (#NP)                   │
│                                                                              │
│  TRAPS                                                                       │
│  ──────                                                                     │
│  • Intentional: used for debugging and system calls                        │
│  • Return address = next instruction                                        │
│  • Examples: breakpoint (#BP), overflow (#OF), syscall                     │
│                                                                              │
│  ABORTS                                                                      │
│  ────────                                                                   │
│  • Severe: cannot continue execution                                        │
│  • Examples: machine check (#MC), double fault (#DF)                       │
│                                                                              │
│  COMMON EXCEPTION VECTORS                                                    │
│  ─────────────────────────                                                  │
│  0:  #DE - Divide Error                                                     │
│  3:  #BP - Breakpoint                                                       │
│  6:  #UD - Invalid Opcode                                                   │
│  8:  #DF - Double Fault                                                     │
│  13: #GP - General Protection                                               │
│  14: #PF - Page Fault                                                       │
│  18: #MC - Machine Check                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interrupt Descriptor Table (IDT)

The IDT maps interrupt vectors to handlers:
// arch/x86/kernel/idt.c (simplified)
struct idt_data {
    unsigned int    vector;      // Interrupt number (0-255)
    unsigned int    segment;     // Code segment selector
    struct idt_bits bits;        // Type, DPL, present
    const void      *addr;       // Handler address
};

// IDT entries
static const __initconst struct idt_data early_idts[] = {
    INTG(X86_TRAP_DE,     asm_exc_divide_error),       // #DE
    INTG(X86_TRAP_NMI,    asm_exc_nmi),                // NMI
    INTG(X86_TRAP_BP,     asm_exc_int3),               // #BP
    INTG(X86_TRAP_OF,     asm_exc_overflow),           // #OF
    INTG(X86_TRAP_UD,     asm_exc_invalid_op),         // #UD
    INTG(X86_TRAP_DF,     asm_exc_double_fault),       // #DF
    INTG(X86_TRAP_GP,     asm_exc_general_protection), // #GP
    INTG(X86_TRAP_PF,     asm_exc_page_fault),         // #PF
    // ... more entries
};

// Hardware interrupts (IRQs) start at vector 32
#define FIRST_EXTERNAL_VECTOR 0x20

Viewing IDT Information

# View interrupt statistics
cat /proc/interrupts

# Example output:
#            CPU0       CPU1       CPU2       CPU3
#   0:         25          0          0          0  IR-IO-APIC   2-edge      timer
#   8:          0          0          0          1  IR-IO-APIC   8-edge      rtc0
#  16:          0          0          0          0  IR-IO-APIC  16-fasteoi   ehci_hcd
# 120:          0          0          0          0  DMAR-MSI    0-edge       dmar0
# 121:     234567      12345       9876       8765  IR-PCI-MSI 512000-edge  nvme0q0
# 122:          0     987654          0          0  IR-PCI-MSI 512001-edge  nvme0q1
# NMI:          0          0          0          0   Non-maskable interrupts
# LOC:    1234567    1234567    1234567    1234567   Local timer interrupts

# View affinity for a specific IRQ
cat /proc/irq/121/smp_affinity      # Hex mask of allowed CPUs
cat /proc/irq/121/smp_affinity_list # Human-readable CPU list

Hardware IRQ Handling

IRQ Handler Registration

// include/linux/interrupt.h
int request_irq(unsigned int irq,
                irq_handler_t handler,
                unsigned long flags,
                const char *name,
                void *dev_id);

// flags:
#define IRQF_SHARED         0x00000080  // Multiple devices share IRQ
#define IRQF_TRIGGER_HIGH   0x00000004  // Level-triggered, active high
#define IRQF_TRIGGER_RISING 0x00000001  // Edge-triggered, rising
#define IRQF_ONESHOT        0x00002000  // IRQ disabled until handler completes

// Handler return values
typedef irqreturn_t (*irq_handler_t)(int irq, void *dev_id);

#define IRQ_NONE      (0)  // Interrupt wasn't from this device
#define IRQ_HANDLED   (1)  // Interrupt was handled
#define IRQ_WAKE_THREAD (2)  // Wake threaded handler

Example: Network Driver IRQ Handler

// Simplified network driver interrupt handler
static irqreturn_t my_net_irq_handler(int irq, void *dev_id)
{
    struct my_net_device *dev = dev_id;
    u32 status;
    
    // Read interrupt status register
    status = my_read_reg(dev, INTR_STATUS);
    
    if (!(status & MY_INTR_MASK))
        return IRQ_NONE;  // Not our interrupt (shared IRQ line)
    
    // Acknowledge interrupt to hardware
    my_write_reg(dev, INTR_ACK, status);
    
    if (status & INTR_RX_DONE) {
        // Disable RX interrupts, schedule NAPI poll
        my_write_reg(dev, INTR_DISABLE, INTR_RX_DONE);
        napi_schedule(&dev->napi);  // Schedule softirq
    }
    
    if (status & INTR_TX_DONE) {
        // TX completion - can do minimal work here
        tasklet_schedule(&dev->tx_tasklet);
    }
    
    return IRQ_HANDLED;
}

Softirqs: High-Priority Deferred Work

The analogy: If hardirqs are the smoke alarm (drop everything, acknowledge immediately), softirqs are the cleanup crew that arrives after the alarm stops. They run with interrupts re-enabled, so new alarms can still ring, but they still cannot take a nap (no sleeping) because they need to be fast. Softirqs are how the kernel processes network packets in bulk, completes block I/O, and runs timer callbacks — all the heavy lifting deferred from the hardirq handler.

Softirq Types

// include/linux/interrupt.h
enum {
    HI_SOFTIRQ = 0,        // High-priority tasklets
    TIMER_SOFTIRQ,         // Timer processing
    NET_TX_SOFTIRQ,        // Network transmit
    NET_RX_SOFTIRQ,        // Network receive
    BLOCK_SOFTIRQ,         // Block device completion
    IRQ_POLL_SOFTIRQ,      // IRQ polling
    TASKLET_SOFTIRQ,       // Regular tasklets
    SCHED_SOFTIRQ,         // Scheduler load balancing
    HRTIMER_SOFTIRQ,       // High-resolution timers
    RCU_SOFTIRQ,           // RCU callbacks
    NR_SOFTIRQS
};

Softirq Properties

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SOFTIRQ CHARACTERISTICS                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Properties:                                                                 │
│  • Fixed number (10 currently, compile-time)                                │
│  • Run with interrupts enabled                                              │
│  • Cannot sleep or block                                                    │
│  • Same softirq can run simultaneously on different CPUs                    │
│  • Must use appropriate locking                                             │
│                                                                              │
│  Execution Points:                                                           │
│  • After hardirq handler returns                                            │
│  • When explicitly enabled (local_bh_enable())                              │
│  • By ksoftirqd kernel threads (when too many pending)                      │
│                                                                              │
│  Priority Order:                                                             │
│  HI_SOFTIRQ > TIMER > NET_TX > NET_RX > BLOCK > ... > RCU                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Viewing Softirq Activity

# View softirq statistics per CPU
cat /proc/softirqs
#                    CPU0       CPU1       CPU2       CPU3
#          HI:          0          0          0          0
#       TIMER:   12345678   12345678   12345678   12345678
#      NET_TX:      12345      12345      12345      12345
#      NET_RX:   98765432    1234567     123456      12345
#       BLOCK:     123456     234567     345678     456789
#    IRQ_POLL:          0          0          0          0
#     TASKLET:       1234       2345       3456       4567
#       SCHED:    1234567    1234567    1234567    1234567
#     HRTIMER:     123456     123456     123456     123456
#         RCU:    1234567    1234567    1234567    1234567

# Watch ksoftirqd CPU usage
top -p $(pgrep -d',' ksoftirqd)

Tasklets: Dynamic Deferred Work

Tasklets are built on top of softirqs but more flexible:
// Tasklet declaration
struct tasklet_struct {
    struct tasklet_struct *next;
    unsigned long state;      // TASKLET_STATE_SCHED, TASKLET_STATE_RUN
    atomic_t count;           // Disable count
    void (*func)(unsigned long);
    unsigned long data;
};

// Static initialization
DECLARE_TASKLET(my_tasklet, my_tasklet_handler, data);
DECLARE_TASKLET_DISABLED(my_tasklet, my_tasklet_handler, data);

// Dynamic initialization
tasklet_init(&my_tasklet, my_tasklet_handler, data);

// Schedule for execution
tasklet_schedule(&my_tasklet);     // Normal priority (TASKLET_SOFTIRQ)
tasklet_hi_schedule(&my_tasklet);  // High priority (HI_SOFTIRQ)

// Control
tasklet_disable(&my_tasklet);  // Prevent execution
tasklet_enable(&my_tasklet);   // Re-enable
tasklet_kill(&my_tasklet);     // Remove (waits if running)

Tasklet vs Softirq

FeatureSoftirqTasklet
ConcurrencySame softirq can run on multiple CPUsSame tasklet runs on only one CPU at a time
DefinitionStatic (compile-time)Dynamic (runtime)
LockingMust handle SMP yourselfSerialized per-tasklet
Use caseHigh-frequency, performance-criticalGeneral deferred work

Workqueues: Process Context Deferred Work

When you need to sleep or allocate memory:
// Create work item
struct work_struct my_work;
INIT_WORK(&my_work, my_work_handler);

// Work handler
static void my_work_handler(struct work_struct *work)
{
    // Can sleep, allocate memory, etc.
    struct my_device *dev = container_of(work, struct my_device, work);
    
    // Do heavy processing
    process_data(dev);
    
    // Allocate memory (can sleep)
    void *buf = kmalloc(4096, GFP_KERNEL);
    
    // Call functions that might sleep
    mutex_lock(&dev->lock);
    // ...
    mutex_unlock(&dev->lock);
}

// Schedule work
schedule_work(&my_work);                    // Global workqueue
queue_work(my_workqueue, &my_work);         // Custom workqueue
schedule_delayed_work(&my_dwork, HZ * 5);   // Delayed by 5 seconds

Workqueue Types

// Create workqueues
// Bound workqueue (work runs on submitting CPU)
struct workqueue_struct *wq = alloc_workqueue("my_wq",
    WQ_UNBOUND | WQ_MEM_RECLAIM, max_active);

// Flags:
#define WQ_UNBOUND       (1 << 1)   // Work can run on any CPU
#define WQ_FREEZABLE     (1 << 2)   // Participate in system suspend
#define WQ_MEM_RECLAIM   (1 << 3)   // Can be used for memory reclaim
#define WQ_HIGHPRI       (1 << 4)   // High priority workers
#define WQ_CPU_INTENSIVE (1 << 5)   // CPU-intensive work

// System workqueues
system_wq              // Default, bound
system_highpri_wq      // High priority
system_long_wq         // For long-running work
system_unbound_wq      // For CPU-intensive work
system_freezable_wq    // Freezable for suspend

Concurrency Managed Workqueue (cmwq)

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONCURRENCY MANAGED WORKQUEUE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Work Items                                                                  │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                                   │
│  │ W1  │ │ W2  │ │ W3  │ │ W4  │ │ W5  │                                   │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘                                   │
│     │       │       │       │       │                                        │
│     └───────┴───────┼───────┴───────┘                                        │
│                     │                                                         │
│                     ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                    Per-CPU Worker Pools                                  ││
│  │                                                                          ││
│  │  CPU 0 Pool              CPU 1 Pool              CPU 2 Pool             ││
│  │  ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐    ││
│  │  │ Worker threads: │     │ Worker threads: │     │ Worker threads: │    ││
│  │  │ [kworker/0:0]  │     │ [kworker/1:0]  │     │ [kworker/2:0]  │    ││
│  │  │ [kworker/0:1]  │     │ [kworker/1:1]  │     │ [kworker/2:1]  │    ││
│  │  │ (dynamic)      │     │ (dynamic)      │     │ (dynamic)      │    ││
│  │  └─────────────────┘     └─────────────────┘     └─────────────────┘    ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  Key Properties:                                                             │
│  • Workers created/destroyed dynamically based on load                       │
│  • Work items queued to per-CPU pools for locality                          │
│  • Automatic concurrency management (avoids thundering herd)                │
│  • WQ_UNBOUND work uses separate unbound pools                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Choosing the Right Deferred Work

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFERRED WORK DECISION TREE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Need to defer work from interrupt context?                                  │
│  │                                                                           │
│  └─► Does the work need to sleep?                                           │
│      │                                                                       │
│      ├─ YES ──► Use WORKQUEUE                                               │
│      │          • Can sleep (mutex_lock, kmalloc with GFP_KERNEL)           │
│      │          • Full kernel API available                                 │
│      │          • Higher latency (process context switch)                   │
│      │                                                                       │
│      └─ NO ───► Is it performance-critical networking/block I/O?            │
│                 │                                                            │
│                 ├─ YES ──► Use SOFTIRQ (if modifying kernel)               │
│                 │          or NAPI (for network drivers)                    │
│                 │          • Lowest latency                                 │
│                 │          • Runs at highest priority                       │
│                 │          • Need to handle SMP locking                     │
│                 │                                                            │
│                 └─ NO ───► Use TASKLET                                      │
│                            • Simpler than softirq                           │
│                            • Serialized execution                           │
│                            • Good for most driver deferred work             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interrupt Affinity and Performance

Setting IRQ Affinity

# View current affinity (hex bitmask)
cat /proc/irq/121/smp_affinity
# 0000000f means CPUs 0-3

# Set affinity to CPUs 0 and 1
echo 3 > /proc/irq/121/smp_affinity

# Using affinity list (human-readable)
echo 0-3 > /proc/irq/121/smp_affinity_list
echo 0,2,4,6 > /proc/irq/121/smp_affinity_list

# Check if irqbalance is running (it may override your settings)
systemctl status irqbalance

# Disable irqbalance for manual control
systemctl stop irqbalance

Performance Tuning Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│                    IRQ AFFINITY STRATEGIES                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. NETWORK-INTENSIVE WORKLOADS                                              │
│     ────────────────────────────                                            │
│     • Pin NIC IRQs to dedicated CPUs                                        │
│     • Use RPS/RFS for receive steering                                      │
│     • Enable XPS for transmit steering                                      │
│     • Consider isolcpus for application CPUs                                │
│                                                                              │
│     Example (4-queue NIC, 8 CPUs):                                          │
│     IRQ 121 (rx-0) → CPU 0                                                  │
│     IRQ 122 (rx-1) → CPU 1                                                  │
│     IRQ 123 (rx-2) → CPU 2                                                  │
│     IRQ 124 (rx-3) → CPU 3                                                  │
│     CPUs 4-7 → Application threads                                          │
│                                                                              │
│  2. LATENCY-SENSITIVE WORKLOADS                                              │
│     ──────────────────────────                                              │
│     • Isolate CPUs for application (isolcpus=)                              │
│     • Pin all IRQs to housekeeping CPUs                                     │
│     • Use nohz_full for tickless operation                                  │
│     • Consider disabling irqbalance                                         │
│                                                                              │
│  3. NUMA-AWARE AFFINITY                                                      │
│     ───────────────────                                                     │
│     • Pin IRQs to same NUMA node as device                                  │
│     • Avoid cross-NUMA memory access                                        │
│     • Check: cat /sys/class/net/eth0/device/numa_node                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Measuring Interrupt Latency

# Using cyclictest for latency measurement
sudo cyclictest -p 80 -t1 -n -i 1000 -l 10000
# -p 80: priority
# -t1: 1 thread
# -n: use nanosleep
# -i 1000: 1ms interval
# -l 10000: 10000 loops

# Using perf for interrupt latency
sudo perf sched latency

# Using bpftrace
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry {
    @start[args->irq] = nsecs;
}
tracepoint:irq:irq_handler_exit /@start[args->irq]/ {
    @latency_ns = hist(nsecs - @start[args->irq]);
    delete(@start[args->irq]);
}'

NAPI: Network Interrupt Coalescing

NAPI (New API) reduces interrupt overhead for high-throughput networking:
┌─────────────────────────────────────────────────────────────────────────────┐
│                         NAPI MECHANISM                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Traditional (interrupt per packet):                                         │
│  ──────────────────────────────────                                         │
│  Packet 1 → IRQ → Handle → Return                                           │
│  Packet 2 → IRQ → Handle → Return                                           │
│  Packet 3 → IRQ → Handle → Return                                           │
│  ... (high CPU overhead)                                                    │
│                                                                              │
│  NAPI (polling mode):                                                        │
│  ─────────────────────                                                      │
│  Packet 1 → IRQ → Disable IRQ → Schedule NAPI                              │
│                                  │                                           │
│                                  ▼                                           │
│                    ┌─────────────────────────────────────┐                  │
│                    │         NAPI Poll Loop              │                  │
│                    │                                     │                  │
│                    │  while (budget > 0) {               │                  │
│                    │      packet = poll_device();        │                  │
│                    │      if (!packet) break;            │                  │
│                    │      process(packet);               │                  │
│                    │      budget--;                      │                  │
│                    │  }                                  │                  │
│                    │                                     │                  │
│                    │  if (budget > 0)                    │                  │
│                    │      napi_complete();  // Re-enable │                  │
│                    │      enable_irq();     // IRQ      │                  │
│                    │  else                               │                  │
│                    │      reschedule();     // More work │                  │
│                    │                                     │                  │
│                    └─────────────────────────────────────┘                  │
│                                                                              │
│  Benefits:                                                                   │
│  • Reduced interrupt overhead                                               │
│  • Better CPU cache utilization                                             │
│  • Back-pressure mechanism (budget)                                         │
│  • Automatic adaptation to load                                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

NAPI API

// Initialize NAPI
netif_napi_add(netdev, &dev->napi, my_poll, NAPI_POLL_WEIGHT);
napi_enable(&dev->napi);

// In IRQ handler
static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    struct my_device *dev = dev_id;
    
    // Disable interrupts and schedule NAPI
    if (napi_schedule_prep(&dev->napi)) {
        disable_irq(dev);
        __napi_schedule(&dev->napi);
    }
    return IRQ_HANDLED;
}

// Poll function
static int my_poll(struct napi_struct *napi, int budget)
{
    struct my_device *dev = container_of(napi, struct my_device, napi);
    int work_done = 0;
    
    while (work_done < budget) {
        struct sk_buff *skb = get_next_packet(dev);
        if (!skb)
            break;
        
        // Process packet
        napi_gro_receive(napi, skb);
        work_done++;
    }
    
    if (work_done < budget) {
        napi_complete_done(napi, work_done);
        enable_irq(dev);  // Re-enable interrupts
    }
    
    return work_done;
}

Threaded IRQs

For handlers that need more flexibility:
// Request threaded IRQ
int request_threaded_irq(unsigned int irq,
                         irq_handler_t handler,        // Hardirq handler
                         irq_handler_t thread_fn,      // Threaded handler
                         unsigned long flags,
                         const char *name,
                         void *dev_id);

// Example
static irqreturn_t my_hardirq_handler(int irq, void *dev_id)
{
    // Quick check: is this our interrupt?
    if (!check_interrupt_pending(dev_id))
        return IRQ_NONE;
    
    // Acknowledge hardware
    ack_interrupt(dev_id);
    
    // Wake threaded handler
    return IRQ_WAKE_THREAD;
}

static irqreturn_t my_threaded_handler(int irq, void *dev_id)
{
    // Can sleep here!
    mutex_lock(&dev_lock);
    process_data(dev_id);
    mutex_unlock(&dev_lock);
    
    return IRQ_HANDLED;
}

// Registration
request_threaded_irq(irq, my_hardirq_handler, my_threaded_handler,
                     IRQF_ONESHOT, "my_device", dev);

Interview Questions

Answer:Hardirq context (top half):
  • Runs with interrupts disabled on local CPU
  • Must be extremely fast (microseconds)
  • Cannot sleep or allocate memory with GFP_KERNEL
  • Preempts everything including kernel code
Softirq context (bottom half):
  • Runs with interrupts enabled
  • Can be preempted by hardirqs
  • Still cannot sleep (atomic context)
  • Used for deferred processing (networking, block I/O)
Key insight: Split processing into minimal hardirq work (acknowledge, disable, schedule) and heavier softirq work (process data).
Answer:Use workqueue when:
  • You need to sleep (mutex, blocking I/O)
  • You need to allocate memory with GFP_KERNEL
  • The work is not time-critical
  • You need to call functions that might block
Use tasklet when:
  • Work must be done quickly after interrupt
  • You don’t need to sleep
  • Work is small and fast
  • You want serialization (same tasklet won’t run concurrently)
Example: Network driver TX completion → tasklet (fast, no sleeping). Firmware loading → workqueue (needs file I/O, can sleep).
Answer:Problem: At high packet rates, per-packet interrupts cause:
  • High CPU overhead (context switch per packet)
  • Cache thrashing
  • Interrupt storms (CPU spends all time in IRQ handlers)
NAPI solution:
  1. First packet triggers interrupt
  2. Disable further interrupts for that queue
  3. Switch to polling mode (softirq)
  4. Process packets in batches (budget-based)
  5. Re-enable interrupts when queue is empty
Benefits:
  • Amortized interrupt cost across many packets
  • Better cache locality (process batch together)
  • Natural back-pressure (stop polling when overwhelmed)
  • Scales to millions of packets/second
Answer:Symptoms:
  • High CPU usage in si (softirq) or hi (hardirq)
  • System unresponsive
  • /proc/interrupts shows rapidly increasing counts
Debugging steps:
# 1. Identify which IRQ
watch -n 1 cat /proc/interrupts

# 2. Check which device
cat /proc/irq/<irq_num>/smp_affinity_list
ls -la /proc/irq/<irq_num>/

# 3. Monitor interrupt rate
perf stat -e irq:irq_handler_entry -a sleep 1

# 4. Trace specific IRQ
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry /args->irq == 121/ {
    @count = count();
}'
Common causes:
  • Faulty hardware generating spurious interrupts
  • Driver bug not acknowledging interrupt properly
  • Shared IRQ with misbehaving device
  • Misconfigured interrupt coalescing

Practice Exercises

1

Interrupt Statistics

Write a script that monitors /proc/interrupts and alerts when any IRQ rate exceeds a threshold
2

Affinity Configuration

Set up IRQ affinity for a multi-queue NIC to optimize for either throughput or latency
3

eBPF Tracing

Write a bpftrace script to trace interrupt handler latency and identify slow handlers
4

Workqueue Analysis

Use workqueue:* tracepoints to analyze work item execution patterns

Summary

MechanismContextCan Sleep?Use Case
HardirqInterruptNoAcknowledge HW, schedule deferred work
SoftirqAtomicNoHigh-frequency networking, block I/O
TaskletAtomicNoDriver deferred work, serialized
WorkqueueProcessYesGeneral deferred work, can block
Threaded IRQProcessYesDevice handling that needs sleeping

Debugging Tips

# Detect interrupt storms: watch for rapidly increasing counts
watch -d -n 1 cat /proc/interrupts
# If a single IRQ line is incrementing by thousands per second,
# you likely have a misbehaving device or driver bug

# Check if ksoftirqd is consuming excessive CPU
# This indicates softirq backlog -- the kernel cannot keep up
top -p $(pgrep -d',' ksoftirqd)
# If ksoftirqd is using >50% of a core, investigate which softirq type

# Identify which softirq type is dominating
cat /proc/softirqs
# NET_RX growing much faster than others → network receive backlog
# BLOCK growing fast → storage completion backlog
# TIMER growing fast → timer callback overload

# Trace hardirq handler duration (find slow handlers)
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry { @start[args->irq] = nsecs; }
tracepoint:irq:irq_handler_exit /@start[args->irq]/ {
    $dur = (nsecs - @start[args->irq]) / 1000;
    if ($dur > 100) {  // handlers > 100us are suspicious
        printf("SLOW IRQ %d: %d us\n", args->irq, $dur);
    }
    delete(@start[args->irq]);
}'

# Check for IRQ affinity imbalance
# One CPU handling all interrupts while others are idle
cat /proc/interrupts | awk 'NR>1 {for(i=2;i<=NF-3;i++) sum[i]+=$i} END {for(i in sum) print "CPU"i-2": "sum[i]}'
Common Misconception: “Setting IRQ affinity is enough to control interrupt placement.” In practice, irqbalance daemon may override your manual settings. Always check systemctl status irqbalance and stop it if you need manual control. Also, some NIC drivers pin MSI-X interrupts internally, overriding smp_affinity writes.

Common Misconceptions

MisconceptionReality
”Softirqs run on the CPU that raised the hardirq”Usually true, but when softirq backlog is high, they’re offloaded to ksoftirqd kernel threads which can be migrated
”Workqueues are slow”For deferred work that needs sleeping, workqueues add only a few microseconds of overhead. The alternative (busy-waiting in atomic context) is far worse
”NAPI polling wastes CPU”NAPI only polls when there are packets. When the queue empties, it automatically re-enables interrupts and stops polling
”Tasklets are the recommended bottom-half mechanism”Tasklets are actually semi-deprecated in modern kernel development. New code should prefer threaded IRQs or workqueues. Tasklets have problematic semantics (latency-unpredictable, blocks softirq processing for other devices)
“Disabling interrupts with spin_lock_irq is always safe”Only safe if you know interrupts were enabled before the lock. Use spin_lock_irqsave/spin_unlock_irqrestore in code paths that might be called with interrupts already disabled

Key Takeaways

  1. Split interrupt handling: Minimal work in hardirq, bulk processing in softirq/workqueue
  2. Choose the right mechanism: Workqueue if you need to sleep, tasklet/softirq otherwise
  3. IRQ affinity matters: Align with NUMA topology and application threads
  4. NAPI is essential: For any high-performance networking
  5. Monitor interrupt rates: High rates indicate potential issues

Next Steps