Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Kernel Data Structures
The Linux kernel uses specialized data structures optimized for performance, concurrency, and cache efficiency. Understanding these is essential for reading kernel source code and for systems programming interviews. Think of kernel data structures as the plumbing of the operating system. User-space programs get fancy high-level containers (hash maps, vectors, trees) from their standard libraries. The kernel has to build its own — and every decision matters because these structures are hit millions of times per second across every CPU. A poorly chosen data structure in the scheduler means your 64-core server acts like a single-core machine. A cache-unfriendly list in the networking stack means dropped packets at 10Gbps. This is why kernel data structures look different from textbook implementations: they are battle-hardened for real hardware constraints.Key Topics: Linked lists, RB-trees, RCU, slab allocator
Time to Master: 8-10 hours
Kernel Linked Lists
The kernel’s linked list implementation is one of the most elegant pieces of C code you’ll encounter. The analogy: Imagine a chain of hotel rooms. The traditional approach puts a list of room numbers on each door (the data structure contains pointers to the next element). The kernel approach is different: instead of a list that holds data, you embed a chain link inside each room. Any room can be on any chain, and you can have multiple chains running through the same rooms. This is thelist_head pattern — the link lives inside the data, not the other way around.
Traditional vs Kernel-Style Linked Lists
Why Kernel-Style is Better
- Type-agnostic: Same
list_headworks for any struct - No memory allocation: Node is part of the data
- Multiple lists: One struct can be on multiple lists
- Cache friendly: Data and links together
The Magic: container_of Macro
container_of is the “find my house given my mailbox” macro. If you know where the mailbox (list_head) is, and you know how far the mailbox is from the front door (the offset), you can find the house (the containing struct). This single macro is what makes the entire kernel-style linked list pattern work, and it appears thousands of times throughout the kernel source.
List Operations
Red-Black Trees
Used throughout the kernel for fast ordered access: process scheduling, memory management, etc. The analogy: Think of an RB-tree as a self-balancing bookshelf. You could throw books onto a pile (unsorted list, O(n) lookup) or carefully arrange them by title (sorted array, O(n) insert). An RB-tree gives you O(log n) for both — like a bookshelf that automatically reorganizes itself every time you add or remove a book, guaranteeing that no one section gets too tall compared to others. Why not a plain binary search tree? Without balancing, a BST can degenerate into a linked list (O(n) operations). The kernel cannot tolerate that — imagine the CFS scheduler taking O(n) to pick the next task on a server with 10,000 runnable processes. RB-trees guarantee O(log n) worst-case, which is why they show up in almost every performance-critical kernel subsystem.RB-Tree Properties
- Every node is red or black
- Root is black
- All leaves (NULL) are black
- Red node’s children are black
- All paths to leaves have same black count
Kernel RB-Tree API
RB-Tree Use Cases in Kernel
| Subsystem | Use |
|---|---|
| CFS Scheduler | Task ordering by vruntime |
| Memory Management | VMA ordering by address |
| epoll | File descriptor management |
| ext4 | Extent tree |
Radix Trees and XArray
For sparse arrays with integer keys (like page cache). The analogy: Imagine a phone book where you can look up entries by page number. A regular array would require memory for every possible page number (billions of entries for a large file). A radix tree is like a hierarchical index — you only allocate entries for page numbers that actually exist. For a 1GB file that only has 100 pages cached, you store 100 entries instead of 262,144. The XArray replaced the older radix tree API starting in kernel 4.20 with a cleaner, more consistent interface. If you’re reading older kernel code, you’ll seeradix_tree_lookup() calls; modern code uses xa_load().
XArray (Modern Replacement for Radix Tree)
Use Cases
- Page cache: Map file offset to page
- Process ID allocation: Map PID to task_struct
- Inode cache: Map inode number to inode
Hash Tables
For O(1) lookups when ordering isn’t needed. When to use hash tables vs RB-trees: Hash tables win when you only need “find this exact key” lookups (like finding a socket by port number). RB-trees win when you need “find the nearest key” or “iterate in order” (like finding the VMA that contains a given address). The kernel uses both extensively, and knowing which one a subsystem chose — and why — is the kind of insight that impresses in interviews.Kernel Hash Table API
Per-CPU Variables
Avoid cache bouncing by giving each CPU its own copy. The analogy: Imagine a shared whiteboard in an office where every employee writes tally marks. With 100 employees, they form a line, each waiting their turn. Per-CPU variables are like giving each employee their own whiteboard. Everyone writes simultaneously with zero waiting. When the manager needs the total, they walk around and add up all the whiteboards. This is exactly how per-CPU counters work in the kernel — each CPU has its own copy, and you only aggregate when you need the global value. Why this matters in practice: On a 64-core server, a shared atomic counter can become a severe bottleneck. Eachatomic_inc() invalidates the cache line on all 63 other CPUs (a phenomenon called “cache-line bouncing” or “false sharing”). Per-CPU variables eliminate this entirely. The kernel uses them for statistics counters, memory allocator free lists, and many other high-contention paths.
Why Per-CPU Matters
RCU (Read-Copy-Update)
The most important synchronization mechanism for read-heavy kernel data structures. The analogy: Imagine you’re editing a Wikipedia article. The traditional approach (locking) would be: lock the page, make edits, unlock. While locked, nobody can read it. RCU works differently: you make a copy of the article, edit the copy, then atomically swap the link so new readers see the new version. Old readers who started before the swap still see the old version — and that’s fine, because they’ll finish reading soon. Once all old readers are done, you delete the old copy. Readers never wait, ever. Why RCU matters: The routing table in a busy Linux server might be read millions of times per second (every packet triggers a lookup) but updated only a few times per day. Using traditional locks would serialize all those reads. RCU makes reads essentially free — just a preemption disable/enable pair, no atomic operations, no memory barriers on the read path. This is why the networking stack, filesystem layer, and module system all depend on RCU.RCU Concept
RCU API
Why RCU is Important
- Zero-overhead reads: No atomics, no memory barriers on read path
- Scalability: Readers never block each other
- Used everywhere: Routing tables, file descriptors, module list
Memory Allocators
kmalloc - Small Object Allocation
GFP Flags
| Flag | Meaning | When to Use |
|---|---|---|
GFP_KERNEL | May sleep, normal allocation | Most kernel code |
GFP_ATOMIC | Cannot sleep, uses reserves | Interrupt context |
GFP_NOWAIT | Cannot sleep, fails quickly | When you can retry |
GFP_USER | For user-space requests | mmap, etc. |
GFP_DMA | DMA-capable memory | Device drivers |
vmalloc - Large Allocations
kmalloc vs vmalloc
| Aspect | kmalloc | vmalloc |
|---|---|---|
| Physical memory | Contiguous | May be non-contiguous |
| Size limit | ~4MB | Limited by virtual address space |
| Performance | Faster | Slower (TLB overhead) |
| Use case | Small objects | Large buffers, module loading |
Slab Allocator
For frequently allocated objects of the same size:Slab Allocator Internals
Lab Exercises
Lab 1: Implement Kernel-Style Linked List
Lab 1: Implement Kernel-Style Linked List
Lab 2: Explore Slab Caches
Lab 2: Explore Slab Caches
Lab 3: RCU Simulation
Lab 3: RCU Simulation
Interview Questions
Q1: Explain how container_of works
Q1: Explain how container_of works
container_of(ptr, type, member) returns a pointer to the containing structure given a pointer to a member.How it works:offsetof(type, member)- bytes from struct start to member(char *)ptr- treat ptr as byte pointer- Subtract offset to get struct start
- Cast to struct pointer
Q2: When would you use RCU vs a spinlock?
Q2: When would you use RCU vs a spinlock?
- Reads vastly outnumber writes
- Read-side performance is critical
- Can tolerate slightly stale data on reads
- Updates can use copy-and-replace semantics
- Read/write ratio is balanced
- Need to modify data in place
- Can’t tolerate stale reads
- Critical section is very short
| Operation | RCU | Spinlock |
|---|---|---|
| Read (uncontended) | ~1 cycle | ~10 cycles |
| Read (contended) | ~1 cycle | ~100+ cycles |
| Write | ~100+ cycles | ~10 cycles |
- Routing table: RCU (millions of lookups, rare updates)
- Memory allocator: Spinlock (frequent alloc/free)
Q3: Why does the kernel use slab allocation?
Q3: Why does the kernel use slab allocation?
- Internal fragmentation (small objects waste pages)
- Slow initialization (must init objects each time)
- Cache unfriendly (random placement)
- Reduced fragmentation: Groups same-size objects
- Cache coloring: Staggers objects across cache lines
- Object caching: Keeps freed objects initialized
- Per-CPU caches: Reduces lock contention
- Debugging support: Red zones, poisoning
- Creating process:
task_structallocated instantly from slab - Without slab: Would need page allocation + initialization
- Measured speedup: 10-100x for hot objects
Q4: Explain per-CPU variables and their benefits
Q4: Explain per-CPU variables and their benefits
- Each CPU gets its own copy of the variable
- No synchronization needed for CPU-local access
- No cache bouncing: Variable always in local cache
- No locking needed: Each CPU has exclusive access
- Better scalability: Performance scales with CPU count
- Counters (stats, metrics)
- Caches (per-CPU object pools)
- State (current process, interrupt flags)
Debugging Tips
Common Misconceptions
| Misconception | Reality |
|---|---|
”kernel linked lists are slower because of container_of overhead” | container_of compiles to a single pointer subtraction — zero runtime cost. The real advantage is avoiding a separate allocation for the node. |
| ”RCU is only for networking” | RCU is used in 20+ subsystems: VFS (dcache, inode), scheduler, module list, PID lookup, audit, security, and more. |
| ”Per-CPU variables waste memory” | On a 64-core machine, a per-CPU int uses 256 bytes. An atomic counter with cache-line bouncing across 64 CPUs wastes far more in stalled cycles. |
| ”Slab allocator is just a memory pool” | SLUB also provides: per-CPU caching for lock-free fast paths, SLAB_HWCACHE_ALIGN for cache coloring, red-zone poisoning for debugging, and merging of similar-sized caches. |
| ”Hash tables are always better than trees” | Hash tables cannot do range queries or ordered iteration. The kernel chooses RB-trees for VMAs (need to find “which VMA contains address X?”) and hash tables for sockets (need exact port lookup). |
Key Takeaways
Kernel Lists
RCU
Slab Allocator
Per-CPU Variables
Interview Deep-Dive
The CFS scheduler uses a red-black tree to organize tasks by vruntime. Why a red-black tree and not a min-heap, sorted array, or skip list? Discuss the algorithmic trade-offs in the context of scheduler requirements.
The CFS scheduler uses a red-black tree to organize tasks by vruntime. Why a red-black tree and not a min-heap, sorted array, or skip list? Discuss the algorithmic trade-offs in the context of scheduler requirements.
- The scheduler needs three operations to be efficient: find the minimum vruntime task (the next task to run), insert a task when it becomes runnable, and remove a task when it is selected or blocks. A red-black tree provides O(log n) for all three operations, and the leftmost node (minimum) can be cached for O(1) access.
- A min-heap gives O(1) find-min and O(log n) insert, but removal of an arbitrary element (not the minimum) is O(n) because you need to find it first. The scheduler frequently removes tasks that block (sleep on I/O, mutex, etc.), not just the minimum, so this O(n) removal is unacceptable.
- A sorted array gives O(1) find-min and O(log n) search, but insertion requires shifting elements, which is O(n). With hundreds or thousands of runnable tasks, this becomes expensive.
- A skip list would work (O(log n) for all operations on average), but red-black trees have lower constant factors due to better cache locality (tree nodes are compact structs, skip list nodes have variable-sized forward pointer arrays) and deterministic worst-case guarantees (skip lists are probabilistic).
- The kernel’s rb-tree implementation is also extremely well-optimized: it uses augmented rb-trees where the leftmost node pointer is cached in
rb_root_cached, makingpick_next_task_fair()effectively O(1). The rebalancing after insertion or removal touches at most 3 rotations, keeping the tree balanced without expensive restructuring.
container_of macro make the rb-tree implementation type-agnostic, and why is this important?Follow-up Answer:- The kernel’s rb-tree does not store user data in tree nodes. Instead, the
rb_nodestructure is embedded inside the user’s data structure (for example,sched_entitycontains anrb_node run_node). When traversing the tree, you get pointers torb_nodestructures.container_of(ptr, struct sched_entity, run_node)uses pointer arithmetic (subtracting the field offset from the pointer) to recover the containingsched_entitypointer. This means the same rb-tree code works for any data type without templates, generics, or void pointers with casts. It also means zero memory allocation overhead for the tree structure itself — the link nodes are part of the data, so inserting a task requires no additional memory allocation.
Explain RCU in detail. A new kernel developer says 'RCU is just a read-write lock with deferred freeing.' Correct their understanding and explain what makes RCU fundamentally different.
Explain RCU in detail. A new kernel developer says 'RCU is just a read-write lock with deferred freeing.' Correct their understanding and explain what makes RCU fundamentally different.
- The fundamental difference is that RCU readers do not acquire any lock at all.
rcu_read_lock()merely disables preemption (a single per-CPU counter increment). There is no atomic operation, no memory barrier, no cache line bouncing between CPUs. This means RCU read-side critical sections scale perfectly: 1000 CPUs reading simultaneously have zero contention. A read-write lock, even an optimized one likerwlock_t, requires atomic operations on a shared cache line, creating contention that increases with CPU count. - RCU achieves this by shifting all synchronization cost to the write side. A writer does not modify data in place. Instead, it allocates a new copy, modifies the copy, then atomically updates the pointer to the new version using
rcu_assign_pointer()(which includes a write memory barrier). Old readers continue reading the old version — they are safe because the old data is not freed until all pre-existing readers finish. The writer callssynchronize_rcu()orcall_rcu()to wait for the grace period. - A grace period is the time during which all CPUs have gone through at least one context switch (or voluntary quiescent state). Since
rcu_read_lock()disables preemption, a context switch on a CPU guarantees no reader on that CPU is in an RCU read-side critical section. Once all CPUs have passed through a quiescent state, the old data can safely be freed. - The trade-off: writers pay a significant cost (memory allocation for copies, waiting for grace periods), and readers see stale data during the grace period. This makes RCU ideal for read-mostly data (routing tables, configuration, module lists) and terrible for write-heavy data.
- Regular RCU readers cannot sleep because
rcu_read_lock()disables preemption, and sleeping while preemption is disabled would cause scheduling problems. SRCU (Sleepable RCU) allows readers to sleep by using a per-CPU counter pair instead of preemption disable.srcu_read_lock()increments one counter,srcu_read_unlock()increments the other. The writer’ssynchronize_srcu()waits until both counters match, indicating no readers are active. This is more expensive on the read side (two atomic increments instead of one preemption counter) but necessary when the read-side critical section must perform blocking operations like disk I/O or sleeping locks. Use cases include file system operations where you need RCU-like read scalability but the read path might block.
A kernel module you are writing needs a data structure to map 10 million IP addresses to connection metadata with fast lookup and frequent updates. Compare the options available in the kernel and justify your choice.
A kernel module you are writing needs a data structure to map 10 million IP addresses to connection metadata with fast lookup and frequent updates. Compare the options available in the kernel and justify your choice.
- The options are: kernel hash table (
hashtable.h), rhashtable (resizable hash table), rb-tree, and XArray. - For 10 million entries with fast lookup and frequent updates, I would use rhashtable (resizable hash table with RCU-protected reads). The standard kernel
hashtable.hhas a fixed number of buckets set at compile time. With 10M entries and 1024 buckets, each bucket chain averages 10K entries, making lookup O(10K) — unacceptable. Rhashtable dynamically resizes its bucket count to maintain short chains (target 0-1 entries per bucket), providing amortized O(1) lookup. - Rhashtable also supports RCU-protected reads: lookups do not take any locks, only
rcu_read_lock(). Insertions and deletions take a per-bucket spinlock, providing fine-grained write concurrency. The resizing operation is done incrementally in the background using RCU, so readers are never blocked during a resize. - An rb-tree would give O(log n) lookup (about 23 comparisons for 10M entries), which is slower than a well-sized hash table’s O(1). XArray is designed for sparse integer-indexed data (like page cache indices), not arbitrary key types like IP addresses.
- Implementation: I would define the rhashtable parameters with
struct rhashtable_paramsspecifying the key offset, key length (4 bytes for IPv4), hash function (jhash), and a reasonable initial size. For IPv6 support, the key length increases to 16 bytes but the approach is identical.
- Each rhashtable entry requires the user struct plus an
rhash_head(a single pointer, 8 bytes on 64-bit). The bucket array itself is an array of pointers, which for 10M entries with a load factor of 0.75 would be approximately 13.3M buckets, each 8 bytes = ~106MB just for buckets. During resize, both old and new bucket arrays coexist temporarily, doubling the bucket memory. The user data (connection metadata) dominates memory usage at this scale. Under memory pressure, rhashtable allocation can fail during resize. Therhashtable_paramsstructure allows specifyingGFP_NOWAITfor non-blocking allocation and settingmax_sizeto cap bucket growth. If resize allocation fails, the table continues operating with longer chains (degraded but functional).
Next: Process Subsystem Deep Dive →