Skip to main content

Interrupts & Exception Handling

Interrupts are fundamental to how Linux handles hardware events, system calls, and exceptional conditions. Understanding the interrupt subsystem is crucial for debugging performance issues and writing high-performance systems code.
Interview Frequency: High (especially for performance-critical roles)
Key Topics: IRQ handling, softirqs, tasklets, workqueues, interrupt coalescing
Time to Master: 12-14 hours

Why Interrupts Matter

Every time a network packet arrives, a disk I/O completes, or a timer fires, an interrupt is involved. Understanding interrupts explains:
  • Why context switches happen: Interrupts can preempt any code
  • Network performance: Interrupt coalescing and NAPI
  • CPU affinity effects: IRQ pinning and load balancing
  • Latency sources: Interrupt storms and processing time

Interrupt Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LINUX INTERRUPT ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE LAYER                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  ┌─────┐   ┌─────┐   ┌─────┐   ┌──────┐   ┌─────┐                      ││
│  │  │ NIC │   │ Disk│   │Timer│   │ PCIe │   │ USB │                      ││
│  │  └──┬──┘   └──┬──┘   └──┬──┘   └───┬──┘   └──┬──┘                      ││
│  │     │         │         │          │         │                          ││
│  │     └─────────┴────┬────┴──────────┴─────────┘                          ││
│  │                    ▼                                                     ││
│  │          ┌─────────────────────┐                                        ││
│  │          │   Interrupt         │    Modern: MSI/MSI-X                   ││
│  │          │   Controller        │    (Message Signaled Interrupts)       ││
│  │          │   (APIC/GIC)        │                                        ││
│  │          └──────────┬──────────┘                                        ││
│  └─────────────────────│───────────────────────────────────────────────────┘│
│                        │                                                     │
│                        ▼                                                     │
│  CPU                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  1. CPU receives interrupt signal                                       ││
│  │  2. Saves current context (registers, flags)                           ││
│  │  3. Looks up handler in IDT (Interrupt Descriptor Table)               ││
│  │  4. Switches to kernel stack                                            ││
│  │  5. Jumps to interrupt handler                                          ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                        │                                                     │
│                        ▼                                                     │
│  KERNEL INTERRUPT HANDLING                                                   │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │           HARDIRQ CONTEXT (interrupts disabled)               │      ││
│  │  │                                                               │      ││
│  │  │  • Acknowledge interrupt to hardware                          │      ││
│  │  │  • Do MINIMAL work (read status, copy data to buffer)        │      ││
│  │  │  • Schedule deferred work (softirq, tasklet, workqueue)      │      ││
│  │  │  • Return ASAP (microseconds, not milliseconds)              │      ││
│  │  │                                                               │      ││
│  │  └────────────────────────────┬──────────────────────────────────┘      ││
│  │                               │                                          ││
│  │                               ▼                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │          SOFTIRQ CONTEXT (interrupts enabled)                  │      ││
│  │  │                                                               │      ││
│  │  │  • Process accumulated data (network packets, block I/O)     │      ││
│  │  │  • Can be preempted by hardirqs                              │      ││
│  │  │  • Cannot sleep or block                                     │      ││
│  │  │                                                               │      ││
│  │  └────────────────────────────┬──────────────────────────────────┘      ││
│  │                               │                                          ││
│  │                               ▼                                          ││
│  │  ┌───────────────────────────────────────────────────────────────┐      ││
│  │  │           PROCESS CONTEXT (workqueues)                         │      ││
│  │  │                                                               │      ││
│  │  │  • Can sleep and block                                       │      ││
│  │  │  • Can allocate memory with GFP_KERNEL                       │      ││
│  │  │  • Full kernel API available                                 │      ││
│  │  │                                                               │      ││
│  │  └───────────────────────────────────────────────────────────────┘      ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interrupt Types

Exceptions vs Interrupts

TypeCauseSynchronous?Examples
ExceptionCPU executionYesPage fault, divide by zero, syscall
Hardware IRQExternal deviceNoNIC, disk, timer, keyboard
Software InterruptExplicit instructionYesint 0x80, syscall

Exception Categories

┌─────────────────────────────────────────────────────────────────────────────┐
│                         EXCEPTION TYPES (x86-64)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  FAULTS                                                                      │
│  ────────                                                                   │
│  • Recoverable: execution can continue after handling                       │
│  • Return address = instruction that caused fault                           │
│  • Examples: page fault (#PF), segment not present (#NP)                   │
│                                                                              │
│  TRAPS                                                                       │
│  ──────                                                                     │
│  • Intentional: used for debugging and system calls                        │
│  • Return address = next instruction                                        │
│  • Examples: breakpoint (#BP), overflow (#OF), syscall                     │
│                                                                              │
│  ABORTS                                                                      │
│  ────────                                                                   │
│  • Severe: cannot continue execution                                        │
│  • Examples: machine check (#MC), double fault (#DF)                       │
│                                                                              │
│  COMMON EXCEPTION VECTORS                                                    │
│  ─────────────────────────                                                  │
│  0:  #DE - Divide Error                                                     │
│  3:  #BP - Breakpoint                                                       │
│  6:  #UD - Invalid Opcode                                                   │
│  8:  #DF - Double Fault                                                     │
│  13: #GP - General Protection                                               │
│  14: #PF - Page Fault                                                       │
│  18: #MC - Machine Check                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interrupt Descriptor Table (IDT)

The IDT maps interrupt vectors to handlers:
// arch/x86/kernel/idt.c (simplified)
struct idt_data {
    unsigned int    vector;      // Interrupt number (0-255)
    unsigned int    segment;     // Code segment selector
    struct idt_bits bits;        // Type, DPL, present
    const void      *addr;       // Handler address
};

// IDT entries
static const __initconst struct idt_data early_idts[] = {
    INTG(X86_TRAP_DE,     asm_exc_divide_error),       // #DE
    INTG(X86_TRAP_NMI,    asm_exc_nmi),                // NMI
    INTG(X86_TRAP_BP,     asm_exc_int3),               // #BP
    INTG(X86_TRAP_OF,     asm_exc_overflow),           // #OF
    INTG(X86_TRAP_UD,     asm_exc_invalid_op),         // #UD
    INTG(X86_TRAP_DF,     asm_exc_double_fault),       // #DF
    INTG(X86_TRAP_GP,     asm_exc_general_protection), // #GP
    INTG(X86_TRAP_PF,     asm_exc_page_fault),         // #PF
    // ... more entries
};

// Hardware interrupts (IRQs) start at vector 32
#define FIRST_EXTERNAL_VECTOR 0x20

Viewing IDT Information

# View interrupt statistics
cat /proc/interrupts

# Example output:
#            CPU0       CPU1       CPU2       CPU3
#   0:         25          0          0          0  IR-IO-APIC   2-edge      timer
#   8:          0          0          0          1  IR-IO-APIC   8-edge      rtc0
#  16:          0          0          0          0  IR-IO-APIC  16-fasteoi   ehci_hcd
# 120:          0          0          0          0  DMAR-MSI    0-edge       dmar0
# 121:     234567      12345       9876       8765  IR-PCI-MSI 512000-edge  nvme0q0
# 122:          0     987654          0          0  IR-PCI-MSI 512001-edge  nvme0q1
# NMI:          0          0          0          0   Non-maskable interrupts
# LOC:    1234567    1234567    1234567    1234567   Local timer interrupts

# View affinity for a specific IRQ
cat /proc/irq/121/smp_affinity      # Hex mask of allowed CPUs
cat /proc/irq/121/smp_affinity_list # Human-readable CPU list

Hardware IRQ Handling

IRQ Handler Registration

// include/linux/interrupt.h
int request_irq(unsigned int irq,
                irq_handler_t handler,
                unsigned long flags,
                const char *name,
                void *dev_id);

// flags:
#define IRQF_SHARED         0x00000080  // Multiple devices share IRQ
#define IRQF_TRIGGER_HIGH   0x00000004  // Level-triggered, active high
#define IRQF_TRIGGER_RISING 0x00000001  // Edge-triggered, rising
#define IRQF_ONESHOT        0x00002000  // IRQ disabled until handler completes

// Handler return values
typedef irqreturn_t (*irq_handler_t)(int irq, void *dev_id);

#define IRQ_NONE      (0)  // Interrupt wasn't from this device
#define IRQ_HANDLED   (1)  // Interrupt was handled
#define IRQ_WAKE_THREAD (2)  // Wake threaded handler

Example: Network Driver IRQ Handler

// Simplified network driver interrupt handler
static irqreturn_t my_net_irq_handler(int irq, void *dev_id)
{
    struct my_net_device *dev = dev_id;
    u32 status;
    
    // Read interrupt status register
    status = my_read_reg(dev, INTR_STATUS);
    
    if (!(status & MY_INTR_MASK))
        return IRQ_NONE;  // Not our interrupt (shared IRQ line)
    
    // Acknowledge interrupt to hardware
    my_write_reg(dev, INTR_ACK, status);
    
    if (status & INTR_RX_DONE) {
        // Disable RX interrupts, schedule NAPI poll
        my_write_reg(dev, INTR_DISABLE, INTR_RX_DONE);
        napi_schedule(&dev->napi);  // Schedule softirq
    }
    
    if (status & INTR_TX_DONE) {
        // TX completion - can do minimal work here
        tasklet_schedule(&dev->tx_tasklet);
    }
    
    return IRQ_HANDLED;
}

Softirqs: High-Priority Deferred Work

Softirq Types

// include/linux/interrupt.h
enum {
    HI_SOFTIRQ = 0,        // High-priority tasklets
    TIMER_SOFTIRQ,         // Timer processing
    NET_TX_SOFTIRQ,        // Network transmit
    NET_RX_SOFTIRQ,        // Network receive
    BLOCK_SOFTIRQ,         // Block device completion
    IRQ_POLL_SOFTIRQ,      // IRQ polling
    TASKLET_SOFTIRQ,       // Regular tasklets
    SCHED_SOFTIRQ,         // Scheduler load balancing
    HRTIMER_SOFTIRQ,       // High-resolution timers
    RCU_SOFTIRQ,           // RCU callbacks
    NR_SOFTIRQS
};

Softirq Properties

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SOFTIRQ CHARACTERISTICS                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Properties:                                                                 │
│  • Fixed number (10 currently, compile-time)                                │
│  • Run with interrupts enabled                                              │
│  • Cannot sleep or block                                                    │
│  • Same softirq can run simultaneously on different CPUs                    │
│  • Must use appropriate locking                                             │
│                                                                              │
│  Execution Points:                                                           │
│  • After hardirq handler returns                                            │
│  • When explicitly enabled (local_bh_enable())                              │
│  • By ksoftirqd kernel threads (when too many pending)                      │
│                                                                              │
│  Priority Order:                                                             │
│  HI_SOFTIRQ > TIMER > NET_TX > NET_RX > BLOCK > ... > RCU                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Viewing Softirq Activity

# View softirq statistics per CPU
cat /proc/softirqs
#                    CPU0       CPU1       CPU2       CPU3
#          HI:          0          0          0          0
#       TIMER:   12345678   12345678   12345678   12345678
#      NET_TX:      12345      12345      12345      12345
#      NET_RX:   98765432    1234567     123456      12345
#       BLOCK:     123456     234567     345678     456789
#    IRQ_POLL:          0          0          0          0
#     TASKLET:       1234       2345       3456       4567
#       SCHED:    1234567    1234567    1234567    1234567
#     HRTIMER:     123456     123456     123456     123456
#         RCU:    1234567    1234567    1234567    1234567

# Watch ksoftirqd CPU usage
top -p $(pgrep -d',' ksoftirqd)

Tasklets: Dynamic Deferred Work

Tasklets are built on top of softirqs but more flexible:
// Tasklet declaration
struct tasklet_struct {
    struct tasklet_struct *next;
    unsigned long state;      // TASKLET_STATE_SCHED, TASKLET_STATE_RUN
    atomic_t count;           // Disable count
    void (*func)(unsigned long);
    unsigned long data;
};

// Static initialization
DECLARE_TASKLET(my_tasklet, my_tasklet_handler, data);
DECLARE_TASKLET_DISABLED(my_tasklet, my_tasklet_handler, data);

// Dynamic initialization
tasklet_init(&my_tasklet, my_tasklet_handler, data);

// Schedule for execution
tasklet_schedule(&my_tasklet);     // Normal priority (TASKLET_SOFTIRQ)
tasklet_hi_schedule(&my_tasklet);  // High priority (HI_SOFTIRQ)

// Control
tasklet_disable(&my_tasklet);  // Prevent execution
tasklet_enable(&my_tasklet);   // Re-enable
tasklet_kill(&my_tasklet);     // Remove (waits if running)

Tasklet vs Softirq

FeatureSoftirqTasklet
ConcurrencySame softirq can run on multiple CPUsSame tasklet runs on only one CPU at a time
DefinitionStatic (compile-time)Dynamic (runtime)
LockingMust handle SMP yourselfSerialized per-tasklet
Use caseHigh-frequency, performance-criticalGeneral deferred work

Workqueues: Process Context Deferred Work

When you need to sleep or allocate memory:
// Create work item
struct work_struct my_work;
INIT_WORK(&my_work, my_work_handler);

// Work handler
static void my_work_handler(struct work_struct *work)
{
    // Can sleep, allocate memory, etc.
    struct my_device *dev = container_of(work, struct my_device, work);
    
    // Do heavy processing
    process_data(dev);
    
    // Allocate memory (can sleep)
    void *buf = kmalloc(4096, GFP_KERNEL);
    
    // Call functions that might sleep
    mutex_lock(&dev->lock);
    // ...
    mutex_unlock(&dev->lock);
}

// Schedule work
schedule_work(&my_work);                    // Global workqueue
queue_work(my_workqueue, &my_work);         // Custom workqueue
schedule_delayed_work(&my_dwork, HZ * 5);   // Delayed by 5 seconds

Workqueue Types

// Create workqueues
// Bound workqueue (work runs on submitting CPU)
struct workqueue_struct *wq = alloc_workqueue("my_wq",
    WQ_UNBOUND | WQ_MEM_RECLAIM, max_active);

// Flags:
#define WQ_UNBOUND       (1 << 1)   // Work can run on any CPU
#define WQ_FREEZABLE     (1 << 2)   // Participate in system suspend
#define WQ_MEM_RECLAIM   (1 << 3)   // Can be used for memory reclaim
#define WQ_HIGHPRI       (1 << 4)   // High priority workers
#define WQ_CPU_INTENSIVE (1 << 5)   // CPU-intensive work

// System workqueues
system_wq              // Default, bound
system_highpri_wq      // High priority
system_long_wq         // For long-running work
system_unbound_wq      // For CPU-intensive work
system_freezable_wq    // Freezable for suspend

Concurrency Managed Workqueue (cmwq)

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONCURRENCY MANAGED WORKQUEUE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Work Items                                                                  │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                                   │
│  │ W1  │ │ W2  │ │ W3  │ │ W4  │ │ W5  │                                   │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘                                   │
│     │       │       │       │       │                                        │
│     └───────┴───────┼───────┴───────┘                                        │
│                     │                                                         │
│                     ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                    Per-CPU Worker Pools                                  ││
│  │                                                                          ││
│  │  CPU 0 Pool              CPU 1 Pool              CPU 2 Pool             ││
│  │  ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐    ││
│  │  │ Worker threads: │     │ Worker threads: │     │ Worker threads: │    ││
│  │  │ [kworker/0:0]  │     │ [kworker/1:0]  │     │ [kworker/2:0]  │    ││
│  │  │ [kworker/0:1]  │     │ [kworker/1:1]  │     │ [kworker/2:1]  │    ││
│  │  │ (dynamic)      │     │ (dynamic)      │     │ (dynamic)      │    ││
│  │  └─────────────────┘     └─────────────────┘     └─────────────────┘    ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  Key Properties:                                                             │
│  • Workers created/destroyed dynamically based on load                       │
│  • Work items queued to per-CPU pools for locality                          │
│  • Automatic concurrency management (avoids thundering herd)                │
│  • WQ_UNBOUND work uses separate unbound pools                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Choosing the Right Deferred Work

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFERRED WORK DECISION TREE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Need to defer work from interrupt context?                                  │
│  │                                                                           │
│  └─► Does the work need to sleep?                                           │
│      │                                                                       │
│      ├─ YES ──► Use WORKQUEUE                                               │
│      │          • Can sleep (mutex_lock, kmalloc with GFP_KERNEL)           │
│      │          • Full kernel API available                                 │
│      │          • Higher latency (process context switch)                   │
│      │                                                                       │
│      └─ NO ───► Is it performance-critical networking/block I/O?            │
│                 │                                                            │
│                 ├─ YES ──► Use SOFTIRQ (if modifying kernel)               │
│                 │          or NAPI (for network drivers)                    │
│                 │          • Lowest latency                                 │
│                 │          • Runs at highest priority                       │
│                 │          • Need to handle SMP locking                     │
│                 │                                                            │
│                 └─ NO ───► Use TASKLET                                      │
│                            • Simpler than softirq                           │
│                            • Serialized execution                           │
│                            • Good for most driver deferred work             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interrupt Affinity and Performance

Setting IRQ Affinity

# View current affinity (hex bitmask)
cat /proc/irq/121/smp_affinity
# 0000000f means CPUs 0-3

# Set affinity to CPUs 0 and 1
echo 3 > /proc/irq/121/smp_affinity

# Using affinity list (human-readable)
echo 0-3 > /proc/irq/121/smp_affinity_list
echo 0,2,4,6 > /proc/irq/121/smp_affinity_list

# Check if irqbalance is running (it may override your settings)
systemctl status irqbalance

# Disable irqbalance for manual control
systemctl stop irqbalance

Performance Tuning Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│                    IRQ AFFINITY STRATEGIES                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. NETWORK-INTENSIVE WORKLOADS                                              │
│     ────────────────────────────                                            │
│     • Pin NIC IRQs to dedicated CPUs                                        │
│     • Use RPS/RFS for receive steering                                      │
│     • Enable XPS for transmit steering                                      │
│     • Consider isolcpus for application CPUs                                │
│                                                                              │
│     Example (4-queue NIC, 8 CPUs):                                          │
│     IRQ 121 (rx-0) → CPU 0                                                  │
│     IRQ 122 (rx-1) → CPU 1                                                  │
│     IRQ 123 (rx-2) → CPU 2                                                  │
│     IRQ 124 (rx-3) → CPU 3                                                  │
│     CPUs 4-7 → Application threads                                          │
│                                                                              │
│  2. LATENCY-SENSITIVE WORKLOADS                                              │
│     ──────────────────────────                                              │
│     • Isolate CPUs for application (isolcpus=)                              │
│     • Pin all IRQs to housekeeping CPUs                                     │
│     • Use nohz_full for tickless operation                                  │
│     • Consider disabling irqbalance                                         │
│                                                                              │
│  3. NUMA-AWARE AFFINITY                                                      │
│     ───────────────────                                                     │
│     • Pin IRQs to same NUMA node as device                                  │
│     • Avoid cross-NUMA memory access                                        │
│     • Check: cat /sys/class/net/eth0/device/numa_node                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Measuring Interrupt Latency

# Using cyclictest for latency measurement
sudo cyclictest -p 80 -t1 -n -i 1000 -l 10000
# -p 80: priority
# -t1: 1 thread
# -n: use nanosleep
# -i 1000: 1ms interval
# -l 10000: 10000 loops

# Using perf for interrupt latency
sudo perf sched latency

# Using bpftrace
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry {
    @start[args->irq] = nsecs;
}
tracepoint:irq:irq_handler_exit /@start[args->irq]/ {
    @latency_ns = hist(nsecs - @start[args->irq]);
    delete(@start[args->irq]);
}'

NAPI: Network Interrupt Coalescing

NAPI (New API) reduces interrupt overhead for high-throughput networking:
┌─────────────────────────────────────────────────────────────────────────────┐
│                         NAPI MECHANISM                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Traditional (interrupt per packet):                                         │
│  ──────────────────────────────────                                         │
│  Packet 1 → IRQ → Handle → Return                                           │
│  Packet 2 → IRQ → Handle → Return                                           │
│  Packet 3 → IRQ → Handle → Return                                           │
│  ... (high CPU overhead)                                                    │
│                                                                              │
│  NAPI (polling mode):                                                        │
│  ─────────────────────                                                      │
│  Packet 1 → IRQ → Disable IRQ → Schedule NAPI                              │
│                                  │                                           │
│                                  ▼                                           │
│                    ┌─────────────────────────────────────┐                  │
│                    │         NAPI Poll Loop              │                  │
│                    │                                     │                  │
│                    │  while (budget > 0) {               │                  │
│                    │      packet = poll_device();        │                  │
│                    │      if (!packet) break;            │                  │
│                    │      process(packet);               │                  │
│                    │      budget--;                      │                  │
│                    │  }                                  │                  │
│                    │                                     │                  │
│                    │  if (budget > 0)                    │                  │
│                    │      napi_complete();  // Re-enable │                  │
│                    │      enable_irq();     // IRQ      │                  │
│                    │  else                               │                  │
│                    │      reschedule();     // More work │                  │
│                    │                                     │                  │
│                    └─────────────────────────────────────┘                  │
│                                                                              │
│  Benefits:                                                                   │
│  • Reduced interrupt overhead                                               │
│  • Better CPU cache utilization                                             │
│  • Back-pressure mechanism (budget)                                         │
│  • Automatic adaptation to load                                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

NAPI API

// Initialize NAPI
netif_napi_add(netdev, &dev->napi, my_poll, NAPI_POLL_WEIGHT);
napi_enable(&dev->napi);

// In IRQ handler
static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    struct my_device *dev = dev_id;
    
    // Disable interrupts and schedule NAPI
    if (napi_schedule_prep(&dev->napi)) {
        disable_irq(dev);
        __napi_schedule(&dev->napi);
    }
    return IRQ_HANDLED;
}

// Poll function
static int my_poll(struct napi_struct *napi, int budget)
{
    struct my_device *dev = container_of(napi, struct my_device, napi);
    int work_done = 0;
    
    while (work_done < budget) {
        struct sk_buff *skb = get_next_packet(dev);
        if (!skb)
            break;
        
        // Process packet
        napi_gro_receive(napi, skb);
        work_done++;
    }
    
    if (work_done < budget) {
        napi_complete_done(napi, work_done);
        enable_irq(dev);  // Re-enable interrupts
    }
    
    return work_done;
}

Threaded IRQs

For handlers that need more flexibility:
// Request threaded IRQ
int request_threaded_irq(unsigned int irq,
                         irq_handler_t handler,        // Hardirq handler
                         irq_handler_t thread_fn,      // Threaded handler
                         unsigned long flags,
                         const char *name,
                         void *dev_id);

// Example
static irqreturn_t my_hardirq_handler(int irq, void *dev_id)
{
    // Quick check: is this our interrupt?
    if (!check_interrupt_pending(dev_id))
        return IRQ_NONE;
    
    // Acknowledge hardware
    ack_interrupt(dev_id);
    
    // Wake threaded handler
    return IRQ_WAKE_THREAD;
}

static irqreturn_t my_threaded_handler(int irq, void *dev_id)
{
    // Can sleep here!
    mutex_lock(&dev_lock);
    process_data(dev_id);
    mutex_unlock(&dev_lock);
    
    return IRQ_HANDLED;
}

// Registration
request_threaded_irq(irq, my_hardirq_handler, my_threaded_handler,
                     IRQF_ONESHOT, "my_device", dev);

Interview Questions

Answer:Hardirq context (top half):
  • Runs with interrupts disabled on local CPU
  • Must be extremely fast (microseconds)
  • Cannot sleep or allocate memory with GFP_KERNEL
  • Preempts everything including kernel code
Softirq context (bottom half):
  • Runs with interrupts enabled
  • Can be preempted by hardirqs
  • Still cannot sleep (atomic context)
  • Used for deferred processing (networking, block I/O)
Key insight: Split processing into minimal hardirq work (acknowledge, disable, schedule) and heavier softirq work (process data).
Answer:Use workqueue when:
  • You need to sleep (mutex, blocking I/O)
  • You need to allocate memory with GFP_KERNEL
  • The work is not time-critical
  • You need to call functions that might block
Use tasklet when:
  • Work must be done quickly after interrupt
  • You don’t need to sleep
  • Work is small and fast
  • You want serialization (same tasklet won’t run concurrently)
Example: Network driver TX completion → tasklet (fast, no sleeping). Firmware loading → workqueue (needs file I/O, can sleep).
Answer:Problem: At high packet rates, per-packet interrupts cause:
  • High CPU overhead (context switch per packet)
  • Cache thrashing
  • Interrupt storms (CPU spends all time in IRQ handlers)
NAPI solution:
  1. First packet triggers interrupt
  2. Disable further interrupts for that queue
  3. Switch to polling mode (softirq)
  4. Process packets in batches (budget-based)
  5. Re-enable interrupts when queue is empty
Benefits:
  • Amortized interrupt cost across many packets
  • Better cache locality (process batch together)
  • Natural back-pressure (stop polling when overwhelmed)
  • Scales to millions of packets/second
Answer:Symptoms:
  • High CPU usage in si (softirq) or hi (hardirq)
  • System unresponsive
  • /proc/interrupts shows rapidly increasing counts
Debugging steps:
# 1. Identify which IRQ
watch -n 1 cat /proc/interrupts

# 2. Check which device
cat /proc/irq/<irq_num>/smp_affinity_list
ls -la /proc/irq/<irq_num>/

# 3. Monitor interrupt rate
perf stat -e irq:irq_handler_entry -a sleep 1

# 4. Trace specific IRQ
sudo bpftrace -e '
tracepoint:irq:irq_handler_entry /args->irq == 121/ {
    @count = count();
}'
Common causes:
  • Faulty hardware generating spurious interrupts
  • Driver bug not acknowledging interrupt properly
  • Shared IRQ with misbehaving device
  • Misconfigured interrupt coalescing

Practice Exercises

1

Interrupt Statistics

Write a script that monitors /proc/interrupts and alerts when any IRQ rate exceeds a threshold
2

Affinity Configuration

Set up IRQ affinity for a multi-queue NIC to optimize for either throughput or latency
3

eBPF Tracing

Write a bpftrace script to trace interrupt handler latency and identify slow handlers
4

Workqueue Analysis

Use workqueue:* tracepoints to analyze work item execution patterns

Summary

MechanismContextCan Sleep?Use Case
HardirqInterruptNoAcknowledge HW, schedule deferred work
SoftirqAtomicNoHigh-frequency networking, block I/O
TaskletAtomicNoDriver deferred work, serialized
WorkqueueProcessYesGeneral deferred work, can block
Threaded IRQProcessYesDevice handling that needs sleeping

Key Takeaways

  1. Split interrupt handling: Minimal work in hardirq, bulk processing in softirq/workqueue
  2. Choose the right mechanism: Workqueue if you need to sleep, tasklet/softirq otherwise
  3. IRQ affinity matters: Align with NUMA topology and application threads
  4. NAPI is essential: For any high-performance networking
  5. Monitor interrupt rates: High rates indicate potential issues

Next Steps