> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Interrupts & Exception Handling > Master interrupt handling, IRQ architecture, softirqs, tasklets, and workqueues in the Linux kernel # Interrupts & Exception Handling Interrupts are fundamental to how Linux handles hardware events, system calls, and exceptional conditions. Understanding the interrupt subsystem is crucial for debugging performance issues and writing high-performance systems code. **The analogy**: Imagine you're a chef cooking an elaborate meal (the running process). A kitchen timer goes off (hardware interrupt) -- you must stop what you're doing, turn off the oven, and then decide: do you handle the hot dish right now (hardirq), or set it aside on the counter to plate later when you have a moment (softirq/workqueue)? The key insight is the same as in the kernel: **acknowledge the interrupt immediately, but defer the heavy work**. If you spend too long handling the timer, all your other dishes burn. This "top-half / bottom-half" split is the single most important concept in interrupt handling. Get it wrong and you get latency spikes, dropped packets, and unresponsive systems. **Interview Frequency**: High (especially for performance-critical roles)\ **Key Topics**: IRQ handling, softirqs, tasklets, workqueues, interrupt coalescing\ **Time to Master**: 12-14 hours *** ## Why Interrupts Matter Every time a network packet arrives, a disk I/O completes, or a timer fires, an interrupt is involved. Understanding interrupts explains: * **Why context switches happen**: Interrupts can preempt any code * **Network performance**: Interrupt coalescing and NAPI * **CPU affinity effects**: IRQ pinning and load balancing * **Latency sources**: Interrupt storms and processing time **A real-world example**: At 10Gbps with 64-byte packets, a NIC can generate over 14 million interrupts per second. If each interrupt takes 5 microseconds of CPU time, that's 70 seconds of CPU time per second -- more than one full core just handling interrupts. This is why NAPI (polling mode) was invented, and why interrupt affinity tuning is a daily task for infrastructure engineers at companies like Cloudflare and Datadog. *** ## Interrupt Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ LINUX INTERRUPT ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ HARDWARE LAYER │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ ││ │ │ │ NIC │ │ Disk│ │Timer│ │ PCIe │ │ USB │ ││ │ │ └──┬──┘ └──┬──┘ └──┬──┘ └───┬──┘ └──┬──┘ ││ │ │ │ │ │ │ │ ││ │ │ └─────────┴────┬────┴──────────┴─────────┘ ││ │ │ ▼ ││ │ │ ┌─────────────────────┐ ││ │ │ │ Interrupt │ Modern: MSI/MSI-X ││ │ │ │ Controller │ (Message Signaled Interrupts) ││ │ │ │ (APIC/GIC) │ ││ │ │ └──────────┬──────────┘ ││ │ └─────────────────────│───────────────────────────────────────────────────┘│ │ │ │ │ ▼ │ │ CPU │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ ││ │ │ 1. CPU receives interrupt signal ││ │ │ 2. Saves current context (registers, flags) ││ │ │ 3. Looks up handler in IDT (Interrupt Descriptor Table) ││ │ │ 4. Switches to kernel stack ││ │ │ 5. Jumps to interrupt handler ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────────────┘│ │ │ │ │ ▼ │ │ KERNEL INTERRUPT HANDLING │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ ││ │ │ ┌───────────────────────────────────────────────────────────────┐ ││ │ │ │ HARDIRQ CONTEXT (interrupts disabled) │ ││ │ │ │ │ ││ │ │ │ • Acknowledge interrupt to hardware │ ││ │ │ │ • Do MINIMAL work (read status, copy data to buffer) │ ││ │ │ │ • Schedule deferred work (softirq, tasklet, workqueue) │ ││ │ │ │ • Return ASAP (microseconds, not milliseconds) │ ││ │ │ │ │ ││ │ │ └────────────────────────────┬──────────────────────────────────┘ ││ │ │ │ ││ │ │ ▼ ││ │ │ ┌───────────────────────────────────────────────────────────────┐ ││ │ │ │ SOFTIRQ CONTEXT (interrupts enabled) │ ││ │ │ │ │ ││ │ │ │ • Process accumulated data (network packets, block I/O) │ ││ │ │ │ • Can be preempted by hardirqs │ ││ │ │ │ • Cannot sleep or block │ ││ │ │ │ │ ││ │ │ └────────────────────────────┬──────────────────────────────────┘ ││ │ │ │ ││ │ │ ▼ ││ │ │ ┌───────────────────────────────────────────────────────────────┐ ││ │ │ │ PROCESS CONTEXT (workqueues) │ ││ │ │ │ │ ││ │ │ │ • Can sleep and block │ ││ │ │ │ • Can allocate memory with GFP_KERNEL │ ││ │ │ │ • Full kernel API available │ ││ │ │ │ │ ││ │ │ └───────────────────────────────────────────────────────────────┘ ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────────────┘│ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` *** ## Interrupt Types ### Exceptions vs Interrupts | Type | Cause | Synchronous? | Examples | | ---------------------- | -------------------- | ------------ | ----------------------------------- | | **Exception** | CPU execution | Yes | Page fault, divide by zero, syscall | | **Hardware IRQ** | External device | No | NIC, disk, timer, keyboard | | **Software Interrupt** | Explicit instruction | Yes | `int 0x80`, `syscall` | ### Exception Categories ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ EXCEPTION TYPES (x86-64) │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ FAULTS │ │ ──────── │ │ • Recoverable: execution can continue after handling │ │ • Return address = instruction that caused fault │ │ • Examples: page fault (#PF), segment not present (#NP) │ │ │ │ TRAPS │ │ ────── │ │ • Intentional: used for debugging and system calls │ │ • Return address = next instruction │ │ • Examples: breakpoint (#BP), overflow (#OF), syscall │ │ │ │ ABORTS │ │ ──────── │ │ • Severe: cannot continue execution │ │ • Examples: machine check (#MC), double fault (#DF) │ │ │ │ COMMON EXCEPTION VECTORS │ │ ───────────────────────── │ │ 0: #DE - Divide Error │ │ 3: #BP - Breakpoint │ │ 6: #UD - Invalid Opcode │ │ 8: #DF - Double Fault │ │ 13: #GP - General Protection │ │ 14: #PF - Page Fault │ │ 18: #MC - Machine Check │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` *** ## Interrupt Descriptor Table (IDT) The IDT maps interrupt vectors to handlers: ```c theme={null} // arch/x86/kernel/idt.c (simplified) struct idt_data { unsigned int vector; // Interrupt number (0-255) unsigned int segment; // Code segment selector struct idt_bits bits; // Type, DPL, present const void *addr; // Handler address }; // IDT entries static const __initconst struct idt_data early_idts[] = { INTG(X86_TRAP_DE, asm_exc_divide_error), // #DE INTG(X86_TRAP_NMI, asm_exc_nmi), // NMI INTG(X86_TRAP_BP, asm_exc_int3), // #BP INTG(X86_TRAP_OF, asm_exc_overflow), // #OF INTG(X86_TRAP_UD, asm_exc_invalid_op), // #UD INTG(X86_TRAP_DF, asm_exc_double_fault), // #DF INTG(X86_TRAP_GP, asm_exc_general_protection), // #GP INTG(X86_TRAP_PF, asm_exc_page_fault), // #PF // ... more entries }; // Hardware interrupts (IRQs) start at vector 32 #define FIRST_EXTERNAL_VECTOR 0x20 ``` ### Viewing IDT Information ```bash theme={null} # View interrupt statistics cat /proc/interrupts # Example output: # CPU0 CPU1 CPU2 CPU3 # 0: 25 0 0 0 IR-IO-APIC 2-edge timer # 8: 0 0 0 1 IR-IO-APIC 8-edge rtc0 # 16: 0 0 0 0 IR-IO-APIC 16-fasteoi ehci_hcd # 120: 0 0 0 0 DMAR-MSI 0-edge dmar0 # 121: 234567 12345 9876 8765 IR-PCI-MSI 512000-edge nvme0q0 # 122: 0 987654 0 0 IR-PCI-MSI 512001-edge nvme0q1 # NMI: 0 0 0 0 Non-maskable interrupts # LOC: 1234567 1234567 1234567 1234567 Local timer interrupts # View affinity for a specific IRQ cat /proc/irq/121/smp_affinity # Hex mask of allowed CPUs cat /proc/irq/121/smp_affinity_list # Human-readable CPU list ``` *** ## Hardware IRQ Handling ### IRQ Handler Registration ```c theme={null} // include/linux/interrupt.h int request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char *name, void *dev_id); // flags: #define IRQF_SHARED 0x00000080 // Multiple devices share IRQ #define IRQF_TRIGGER_HIGH 0x00000004 // Level-triggered, active high #define IRQF_TRIGGER_RISING 0x00000001 // Edge-triggered, rising #define IRQF_ONESHOT 0x00002000 // IRQ disabled until handler completes // Handler return values typedef irqreturn_t (*irq_handler_t)(int irq, void *dev_id); #define IRQ_NONE (0) // Interrupt wasn't from this device #define IRQ_HANDLED (1) // Interrupt was handled #define IRQ_WAKE_THREAD (2) // Wake threaded handler ``` ### Example: Network Driver IRQ Handler ```c theme={null} // Simplified network driver interrupt handler static irqreturn_t my_net_irq_handler(int irq, void *dev_id) { struct my_net_device *dev = dev_id; u32 status; // Read interrupt status register status = my_read_reg(dev, INTR_STATUS); if (!(status & MY_INTR_MASK)) return IRQ_NONE; // Not our interrupt (shared IRQ line) // Acknowledge interrupt to hardware my_write_reg(dev, INTR_ACK, status); if (status & INTR_RX_DONE) { // Disable RX interrupts, schedule NAPI poll my_write_reg(dev, INTR_DISABLE, INTR_RX_DONE); napi_schedule(&dev->napi); // Schedule softirq } if (status & INTR_TX_DONE) { // TX completion - can do minimal work here tasklet_schedule(&dev->tx_tasklet); } return IRQ_HANDLED; } ``` *** ## Softirqs: High-Priority Deferred Work **The analogy**: If hardirqs are the smoke alarm (drop everything, acknowledge immediately), softirqs are the **cleanup crew** that arrives after the alarm stops. They run with interrupts re-enabled, so new alarms can still ring, but they still cannot take a nap (no sleeping) because they need to be fast. Softirqs are how the kernel processes network packets in bulk, completes block I/O, and runs timer callbacks -- all the heavy lifting deferred from the hardirq handler. ### Softirq Types ```c theme={null} // include/linux/interrupt.h enum { HI_SOFTIRQ = 0, // High-priority tasklets TIMER_SOFTIRQ, // Timer processing NET_TX_SOFTIRQ, // Network transmit NET_RX_SOFTIRQ, // Network receive BLOCK_SOFTIRQ, // Block device completion IRQ_POLL_SOFTIRQ, // IRQ polling TASKLET_SOFTIRQ, // Regular tasklets SCHED_SOFTIRQ, // Scheduler load balancing HRTIMER_SOFTIRQ, // High-resolution timers RCU_SOFTIRQ, // RCU callbacks NR_SOFTIRQS }; ``` ### Softirq Properties ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ SOFTIRQ CHARACTERISTICS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Properties: │ │ • Fixed number (10 currently, compile-time) │ │ • Run with interrupts enabled │ │ • Cannot sleep or block │ │ • Same softirq can run simultaneously on different CPUs │ │ • Must use appropriate locking │ │ │ │ Execution Points: │ │ • After hardirq handler returns │ │ • When explicitly enabled (local_bh_enable()) │ │ • By ksoftirqd kernel threads (when too many pending) │ │ │ │ Priority Order: │ │ HI_SOFTIRQ > TIMER > NET_TX > NET_RX > BLOCK > ... > RCU │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Viewing Softirq Activity ```bash theme={null} # View softirq statistics per CPU cat /proc/softirqs # CPU0 CPU1 CPU2 CPU3 # HI: 0 0 0 0 # TIMER: 12345678 12345678 12345678 12345678 # NET_TX: 12345 12345 12345 12345 # NET_RX: 98765432 1234567 123456 12345 # BLOCK: 123456 234567 345678 456789 # IRQ_POLL: 0 0 0 0 # TASKLET: 1234 2345 3456 4567 # SCHED: 1234567 1234567 1234567 1234567 # HRTIMER: 123456 123456 123456 123456 # RCU: 1234567 1234567 1234567 1234567 # Watch ksoftirqd CPU usage top -p $(pgrep -d',' ksoftirqd) ``` *** ## Tasklets: Dynamic Deferred Work Tasklets are built on top of softirqs but more flexible: ```c theme={null} // Tasklet declaration struct tasklet_struct { struct tasklet_struct *next; unsigned long state; // TASKLET_STATE_SCHED, TASKLET_STATE_RUN atomic_t count; // Disable count void (*func)(unsigned long); unsigned long data; }; // Static initialization DECLARE_TASKLET(my_tasklet, my_tasklet_handler, data); DECLARE_TASKLET_DISABLED(my_tasklet, my_tasklet_handler, data); // Dynamic initialization tasklet_init(&my_tasklet, my_tasklet_handler, data); // Schedule for execution tasklet_schedule(&my_tasklet); // Normal priority (TASKLET_SOFTIRQ) tasklet_hi_schedule(&my_tasklet); // High priority (HI_SOFTIRQ) // Control tasklet_disable(&my_tasklet); // Prevent execution tasklet_enable(&my_tasklet); // Re-enable tasklet_kill(&my_tasklet); // Remove (waits if running) ``` ### Tasklet vs Softirq | Feature | Softirq | Tasklet | | --------------- | ------------------------------------- | ------------------------------------------- | | **Concurrency** | Same softirq can run on multiple CPUs | Same tasklet runs on only one CPU at a time | | **Definition** | Static (compile-time) | Dynamic (runtime) | | **Locking** | Must handle SMP yourself | Serialized per-tasklet | | **Use case** | High-frequency, performance-critical | General deferred work | *** ## Workqueues: Process Context Deferred Work When you need to sleep or allocate memory: ```c theme={null} // Create work item struct work_struct my_work; INIT_WORK(&my_work, my_work_handler); // Work handler static void my_work_handler(struct work_struct *work) { // Can sleep, allocate memory, etc. struct my_device *dev = container_of(work, struct my_device, work); // Do heavy processing process_data(dev); // Allocate memory (can sleep) void *buf = kmalloc(4096, GFP_KERNEL); // Call functions that might sleep mutex_lock(&dev->lock); // ... mutex_unlock(&dev->lock); } // Schedule work schedule_work(&my_work); // Global workqueue queue_work(my_workqueue, &my_work); // Custom workqueue schedule_delayed_work(&my_dwork, HZ * 5); // Delayed by 5 seconds ``` ### Workqueue Types ```c theme={null} // Create workqueues // Bound workqueue (work runs on submitting CPU) struct workqueue_struct *wq = alloc_workqueue("my_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, max_active); // Flags: #define WQ_UNBOUND (1 << 1) // Work can run on any CPU #define WQ_FREEZABLE (1 << 2) // Participate in system suspend #define WQ_MEM_RECLAIM (1 << 3) // Can be used for memory reclaim #define WQ_HIGHPRI (1 << 4) // High priority workers #define WQ_CPU_INTENSIVE (1 << 5) // CPU-intensive work // System workqueues system_wq // Default, bound system_highpri_wq // High priority system_long_wq // For long-running work system_unbound_wq // For CPU-intensive work system_freezable_wq // Freezable for suspend ``` ### Concurrency Managed Workqueue (cmwq) ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ CONCURRENCY MANAGED WORKQUEUE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Work Items │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ W1 │ │ W2 │ │ W3 │ │ W4 │ │ W5 │ │ │ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ │ │ │ │ │ │ └───────┴───────┼───────┴───────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐│ │ │ Per-CPU Worker Pools ││ │ │ ││ │ │ CPU 0 Pool CPU 1 Pool CPU 2 Pool ││ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ││ │ │ │ Worker threads: │ │ Worker threads: │ │ Worker threads: │ ││ │ │ │ [kworker/0:0] │ │ [kworker/1:0] │ │ [kworker/2:0] │ ││ │ │ │ [kworker/0:1] │ │ [kworker/1:1] │ │ [kworker/2:1] │ ││ │ │ │ (dynamic) │ │ (dynamic) │ │ (dynamic) │ ││ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────────────┘│ │ │ │ Key Properties: │ │ • Workers created/destroyed dynamically based on load │ │ • Work items queued to per-CPU pools for locality │ │ • Automatic concurrency management (avoids thundering herd) │ │ • WQ_UNBOUND work uses separate unbound pools │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` *** ## Choosing the Right Deferred Work ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ DEFERRED WORK DECISION TREE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Need to defer work from interrupt context? │ │ │ │ │ └─► Does the work need to sleep? │ │ │ │ │ ├─ YES ──► Use WORKQUEUE │ │ │ • Can sleep (mutex_lock, kmalloc with GFP_KERNEL) │ │ │ • Full kernel API available │ │ │ • Higher latency (process context switch) │ │ │ │ │ └─ NO ───► Is it performance-critical networking/block I/O? │ │ │ │ │ ├─ YES ──► Use SOFTIRQ (if modifying kernel) │ │ │ or NAPI (for network drivers) │ │ │ • Lowest latency │ │ │ • Runs at highest priority │ │ │ • Need to handle SMP locking │ │ │ │ │ └─ NO ───► Use TASKLET │ │ • Simpler than softirq │ │ • Serialized execution │ │ • Good for most driver deferred work │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` *** ## Interrupt Affinity and Performance ### Setting IRQ Affinity ```bash theme={null} # View current affinity (hex bitmask) cat /proc/irq/121/smp_affinity # 0000000f means CPUs 0-3 # Set affinity to CPUs 0 and 1 echo 3 > /proc/irq/121/smp_affinity # Using affinity list (human-readable) echo 0-3 > /proc/irq/121/smp_affinity_list echo 0,2,4,6 > /proc/irq/121/smp_affinity_list # Check if irqbalance is running (it may override your settings) systemctl status irqbalance # Disable irqbalance for manual control systemctl stop irqbalance ``` ### Performance Tuning Patterns ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ IRQ AFFINITY STRATEGIES │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ 1. NETWORK-INTENSIVE WORKLOADS │ │ ──────────────────────────── │ │ • Pin NIC IRQs to dedicated CPUs │ │ • Use RPS/RFS for receive steering │ │ • Enable XPS for transmit steering │ │ • Consider isolcpus for application CPUs │ │ │ │ Example (4-queue NIC, 8 CPUs): │ │ IRQ 121 (rx-0) → CPU 0 │ │ IRQ 122 (rx-1) → CPU 1 │ │ IRQ 123 (rx-2) → CPU 2 │ │ IRQ 124 (rx-3) → CPU 3 │ │ CPUs 4-7 → Application threads │ │ │ │ 2. LATENCY-SENSITIVE WORKLOADS │ │ ────────────────────────── │ │ • Isolate CPUs for application (isolcpus=) │ │ • Pin all IRQs to housekeeping CPUs │ │ • Use nohz_full for tickless operation │ │ • Consider disabling irqbalance │ │ │ │ 3. NUMA-AWARE AFFINITY │ │ ─────────────────── │ │ • Pin IRQs to same NUMA node as device │ │ • Avoid cross-NUMA memory access │ │ • Check: cat /sys/class/net/eth0/device/numa_node │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Measuring Interrupt Latency ```bash theme={null} # Using cyclictest for latency measurement sudo cyclictest -p 80 -t1 -n -i 1000 -l 10000 # -p 80: priority # -t1: 1 thread # -n: use nanosleep # -i 1000: 1ms interval # -l 10000: 10000 loops # Using perf for interrupt latency sudo perf sched latency # Using bpftrace sudo bpftrace -e ' tracepoint:irq:irq_handler_entry { @start[args->irq] = nsecs; } tracepoint:irq:irq_handler_exit /@start[args->irq]/ { @latency_ns = hist(nsecs - @start[args->irq]); delete(@start[args->irq]); }' ``` *** ## NAPI: Network Interrupt Coalescing NAPI (New API) reduces interrupt overhead for high-throughput networking: ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ NAPI MECHANISM │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Traditional (interrupt per packet): │ │ ────────────────────────────────── │ │ Packet 1 → IRQ → Handle → Return │ │ Packet 2 → IRQ → Handle → Return │ │ Packet 3 → IRQ → Handle → Return │ │ ... (high CPU overhead) │ │ │ │ NAPI (polling mode): │ │ ───────────────────── │ │ Packet 1 → IRQ → Disable IRQ → Schedule NAPI │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ NAPI Poll Loop │ │ │ │ │ │ │ │ while (budget > 0) { │ │ │ │ packet = poll_device(); │ │ │ │ if (!packet) break; │ │ │ │ process(packet); │ │ │ │ budget--; │ │ │ │ } │ │ │ │ │ │ │ │ if (budget > 0) │ │ │ │ napi_complete(); // Re-enable │ │ │ │ enable_irq(); // IRQ │ │ │ │ else │ │ │ │ reschedule(); // More work │ │ │ │ │ │ │ └─────────────────────────────────────┘ │ │ │ │ Benefits: │ │ • Reduced interrupt overhead │ │ • Better CPU cache utilization │ │ • Back-pressure mechanism (budget) │ │ • Automatic adaptation to load │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### NAPI API ```c theme={null} // Initialize NAPI netif_napi_add(netdev, &dev->napi, my_poll, NAPI_POLL_WEIGHT); napi_enable(&dev->napi); // In IRQ handler static irqreturn_t my_irq_handler(int irq, void *dev_id) { struct my_device *dev = dev_id; // Disable interrupts and schedule NAPI if (napi_schedule_prep(&dev->napi)) { disable_irq(dev); __napi_schedule(&dev->napi); } return IRQ_HANDLED; } // Poll function static int my_poll(struct napi_struct *napi, int budget) { struct my_device *dev = container_of(napi, struct my_device, napi); int work_done = 0; while (work_done < budget) { struct sk_buff *skb = get_next_packet(dev); if (!skb) break; // Process packet napi_gro_receive(napi, skb); work_done++; } if (work_done < budget) { napi_complete_done(napi, work_done); enable_irq(dev); // Re-enable interrupts } return work_done; } ``` *** ## Threaded IRQs For handlers that need more flexibility: ```c theme={null} // Request threaded IRQ int request_threaded_irq(unsigned int irq, irq_handler_t handler, // Hardirq handler irq_handler_t thread_fn, // Threaded handler unsigned long flags, const char *name, void *dev_id); // Example static irqreturn_t my_hardirq_handler(int irq, void *dev_id) { // Quick check: is this our interrupt? if (!check_interrupt_pending(dev_id)) return IRQ_NONE; // Acknowledge hardware ack_interrupt(dev_id); // Wake threaded handler return IRQ_WAKE_THREAD; } static irqreturn_t my_threaded_handler(int irq, void *dev_id) { // Can sleep here! mutex_lock(&dev_lock); process_data(dev_id); mutex_unlock(&dev_lock); return IRQ_HANDLED; } // Registration request_threaded_irq(irq, my_hardirq_handler, my_threaded_handler, IRQF_ONESHOT, "my_device", dev); ``` *** ## Interview Questions **Answer**: **Hardirq context** (top half): * Runs with interrupts disabled on local CPU * Must be extremely fast (microseconds) * Cannot sleep or allocate memory with GFP\_KERNEL * Preempts everything including kernel code **Softirq context** (bottom half): * Runs with interrupts enabled * Can be preempted by hardirqs * Still cannot sleep (atomic context) * Used for deferred processing (networking, block I/O) **Key insight**: Split processing into minimal hardirq work (acknowledge, disable, schedule) and heavier softirq work (process data). **Answer**: **Use workqueue when**: * You need to sleep (mutex, blocking I/O) * You need to allocate memory with GFP\_KERNEL * The work is not time-critical * You need to call functions that might block **Use tasklet when**: * Work must be done quickly after interrupt * You don't need to sleep * Work is small and fast * You want serialization (same tasklet won't run concurrently) **Example**: Network driver TX completion → tasklet (fast, no sleeping). Firmware loading → workqueue (needs file I/O, can sleep). **Answer**: **Problem**: At high packet rates, per-packet interrupts cause: * High CPU overhead (context switch per packet) * Cache thrashing * Interrupt storms (CPU spends all time in IRQ handlers) **NAPI solution**: 1. First packet triggers interrupt 2. Disable further interrupts for that queue 3. Switch to polling mode (softirq) 4. Process packets in batches (budget-based) 5. Re-enable interrupts when queue is empty **Benefits**: * Amortized interrupt cost across many packets * Better cache locality (process batch together) * Natural back-pressure (stop polling when overwhelmed) * Scales to millions of packets/second **Answer**: **Symptoms**: * High CPU usage in `si` (softirq) or `hi` (hardirq) * System unresponsive * `/proc/interrupts` shows rapidly increasing counts **Debugging steps**: ```bash theme={null} # 1. Identify which IRQ watch -n 1 cat /proc/interrupts # 2. Check which device cat /proc/irq//smp_affinity_list ls -la /proc/irq// # 3. Monitor interrupt rate perf stat -e irq:irq_handler_entry -a sleep 1 # 4. Trace specific IRQ sudo bpftrace -e ' tracepoint:irq:irq_handler_entry /args->irq == 121/ { @count = count(); }' ``` **Common causes**: * Faulty hardware generating spurious interrupts * Driver bug not acknowledging interrupt properly * Shared IRQ with misbehaving device * Misconfigured interrupt coalescing *** ## Practice Exercises Write a script that monitors `/proc/interrupts` and alerts when any IRQ rate exceeds a threshold Set up IRQ affinity for a multi-queue NIC to optimize for either throughput or latency Write a bpftrace script to trace interrupt handler latency and identify slow handlers Use `workqueue:*` tracepoints to analyze work item execution patterns *** ## Summary | Mechanism | Context | Can Sleep? | Use Case | | ------------ | --------- | ---------- | -------------------------------------- | | Hardirq | Interrupt | No | Acknowledge HW, schedule deferred work | | Softirq | Atomic | No | High-frequency networking, block I/O | | Tasklet | Atomic | No | Driver deferred work, serialized | | Workqueue | Process | Yes | General deferred work, can block | | Threaded IRQ | Process | Yes | Device handling that needs sleeping | *** ## Debugging Tips ```bash theme={null} # Detect interrupt storms: watch for rapidly increasing counts watch -d -n 1 cat /proc/interrupts # If a single IRQ line is incrementing by thousands per second, # you likely have a misbehaving device or driver bug # Check if ksoftirqd is consuming excessive CPU # This indicates softirq backlog -- the kernel cannot keep up top -p $(pgrep -d',' ksoftirqd) # If ksoftirqd is using >50% of a core, investigate which softirq type # Identify which softirq type is dominating cat /proc/softirqs # NET_RX growing much faster than others → network receive backlog # BLOCK growing fast → storage completion backlog # TIMER growing fast → timer callback overload # Trace hardirq handler duration (find slow handlers) sudo bpftrace -e ' tracepoint:irq:irq_handler_entry { @start[args->irq] = nsecs; } tracepoint:irq:irq_handler_exit /@start[args->irq]/ { $dur = (nsecs - @start[args->irq]) / 1000; if ($dur > 100) { // handlers > 100us are suspicious printf("SLOW IRQ %d: %d us\n", args->irq, $dur); } delete(@start[args->irq]); }' # Check for IRQ affinity imbalance # One CPU handling all interrupts while others are idle cat /proc/interrupts | awk 'NR>1 {for(i=2;i<=NF-3;i++) sum[i]+=$i} END {for(i in sum) print "CPU"i-2": "sum[i]}' ``` **Common Misconception**: "Setting IRQ affinity is enough to control interrupt placement." In practice, `irqbalance` daemon may override your manual settings. Always check `systemctl status irqbalance` and stop it if you need manual control. Also, some NIC drivers pin MSI-X interrupts internally, overriding smp\_affinity writes. *** ## Common Misconceptions | Misconception | Reality | | ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | "Softirqs run on the CPU that raised the hardirq" | Usually true, but when softirq backlog is high, they're offloaded to `ksoftirqd` kernel threads which can be migrated | | "Workqueues are slow" | For deferred work that needs sleeping, workqueues add only a few microseconds of overhead. The alternative (busy-waiting in atomic context) is far worse | | "NAPI polling wastes CPU" | NAPI only polls when there are packets. When the queue empties, it automatically re-enables interrupts and stops polling | | "Tasklets are the recommended bottom-half mechanism" | Tasklets are actually semi-deprecated in modern kernel development. New code should prefer threaded IRQs or workqueues. Tasklets have problematic semantics (latency-unpredictable, blocks softirq processing for other devices) | | "Disabling interrupts with `spin_lock_irq` is always safe" | Only safe if you *know* interrupts were enabled before the lock. Use `spin_lock_irqsave`/`spin_unlock_irqrestore` in code paths that might be called with interrupts already disabled | *** ## Key Takeaways 1. **Split interrupt handling**: Minimal work in hardirq, bulk processing in softirq/workqueue 2. **Choose the right mechanism**: Workqueue if you need to sleep, tasklet/softirq otherwise 3. **IRQ affinity matters**: Align with NUMA topology and application threads 4. **NAPI is essential**: For any high-performance networking 5. **Monitor interrupt rates**: High rates indicate potential issues *** ## Next Steps * [Kernel Synchronization →](/courses/linux-internals/synchronization) - Locking mechanisms for concurrent access * [Networking Stack →](/courses/linux-internals/networking-stack) - See NAPI in action * [eBPF →](/courses/linux-internals/ebpf) - Trace interrupt handlers in production