Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Kernel Data Structures

The Linux kernel uses specialized data structures optimized for performance, concurrency, and cache efficiency. Understanding these is essential for reading kernel source code and for systems programming interviews. Think of kernel data structures as the plumbing of the operating system. User-space programs get fancy high-level containers (hash maps, vectors, trees) from their standard libraries. The kernel has to build its own — and every decision matters because these structures are hit millions of times per second across every CPU. A poorly chosen data structure in the scheduler means your 64-core server acts like a single-core machine. A cache-unfriendly list in the networking stack means dropped packets at 10Gbps. This is why kernel data structures look different from textbook implementations: they are battle-hardened for real hardware constraints.
Interview Frequency: High (especially for infrastructure roles)
Key Topics: Linked lists, RB-trees, RCU, slab allocator
Time to Master: 8-10 hours

Kernel Linked Lists

The kernel’s linked list implementation is one of the most elegant pieces of C code you’ll encounter. The analogy: Imagine a chain of hotel rooms. The traditional approach puts a list of room numbers on each door (the data structure contains pointers to the next element). The kernel approach is different: instead of a list that holds data, you embed a chain link inside each room. Any room can be on any chain, and you can have multiple chains running through the same rooms. This is the list_head pattern — the link lives inside the data, not the other way around.

Traditional vs Kernel-Style Linked Lists

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TRADITIONAL VS KERNEL LINKED LIST                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   TRADITIONAL (data embedded in node):                                       │
│                                                                              │
│   struct node {                                                              │
│       int data;                                                              │
│       struct node *next;                                                     │
│       struct node *prev;                                                     │
│   };                                                                         │
│                                                                              │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                              │
│   │ data: 1  │───→│ data: 2  │───→│ data: 3  │                              │
│   │ next ────│    │ next ────│    │ next:NULL│                              │
│   │ prev:NULL│←───│ prev ────│←───│ prev ────│                              │
│   └──────────┘    └──────────┘    └──────────┘                              │
│                                                                              │
│   KERNEL-STYLE (node embedded in data):                                      │
│                                                                              │
│   struct list_head {                                                         │
│       struct list_head *next;                                                │
│       struct list_head *prev;                                                │
│   };                                                                         │
│                                                                              │
│   struct my_data {                                                           │
│       int value;                                                             │
│       char name[32];                                                         │
│       struct list_head list;  // embed the link                             │
│   };                                                                         │
│                                                                              │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│   │ value: 1        │    │ value: 2        │    │ value: 3        │         │
│   │ name: "first"   │    │ name: "second"  │    │ name: "third"   │         │
│   │ list.next ──────│───→│ list.next ──────│───→│ list.next ──────│─→(head)│
│   │ list.prev ←─────│←───│ list.prev ←─────│←───│ list.prev ←─────│         │
│   └─────────────────┘    └─────────────────┘    └─────────────────┘         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Kernel-Style is Better

  1. Type-agnostic: Same list_head works for any struct
  2. No memory allocation: Node is part of the data
  3. Multiple lists: One struct can be on multiple lists
  4. Cache friendly: Data and links together

The Magic: container_of Macro

container_of is the “find my house given my mailbox” macro. If you know where the mailbox (list_head) is, and you know how far the mailbox is from the front door (the offset), you can find the house (the containing struct). This single macro is what makes the entire kernel-style linked list pattern work, and it appears thousands of times throughout the kernel source.
// include/linux/kernel.h
// Given a pointer to a member inside a struct, compute the pointer
// to the containing struct by subtracting the member's offset.
#define container_of(ptr, type, member) ({                \
    const typeof(((type *)0)->member) *__mptr = (ptr);    \
    (type *)((char *)__mptr - offsetof(type, member)); })

// Example usage:
struct my_data {
    int value;
    struct list_head list;
};

// Given a list_head pointer, get the containing struct
struct list_head *pos;
struct my_data *data = container_of(pos, struct my_data, list);

List Operations

#include <linux/list.h>

// Initialize list head
LIST_HEAD(my_list);  // Static initialization
// or
struct list_head my_list;
INIT_LIST_HEAD(&my_list);

// Add to list
struct my_data *new_item = kmalloc(sizeof(*new_item), GFP_KERNEL);
new_item->value = 42;
list_add(&new_item->list, &my_list);       // Add to front
list_add_tail(&new_item->list, &my_list);  // Add to back

// Delete from list
list_del(&item->list);

// Iterate over list
struct my_data *entry;
list_for_each_entry(entry, &my_list, list) {
    printk("Value: %d\n", entry->value);
}

// Safe iteration (allows deletion during iteration)
struct my_data *tmp;
list_for_each_entry_safe(entry, tmp, &my_list, list) {
    if (entry->value < 0) {
        list_del(&entry->list);
        kfree(entry);
    }
}

// Check if list is empty
if (list_empty(&my_list)) { ... }

// Move entire list
LIST_HEAD(new_list);
list_splice(&my_list, &new_list);

Red-Black Trees

Used throughout the kernel for fast ordered access: process scheduling, memory management, etc. The analogy: Think of an RB-tree as a self-balancing bookshelf. You could throw books onto a pile (unsorted list, O(n) lookup) or carefully arrange them by title (sorted array, O(n) insert). An RB-tree gives you O(log n) for both — like a bookshelf that automatically reorganizes itself every time you add or remove a book, guaranteeing that no one section gets too tall compared to others. Why not a plain binary search tree? Without balancing, a BST can degenerate into a linked list (O(n) operations). The kernel cannot tolerate that — imagine the CFS scheduler taking O(n) to pick the next task on a server with 10,000 runnable processes. RB-trees guarantee O(log n) worst-case, which is why they show up in almost every performance-critical kernel subsystem.
Common Misconception: Many engineers assume the kernel uses hash tables everywhere for O(1) lookups. In reality, the kernel uses RB-trees when it needs ordered access (e.g., VMAs sorted by address range, CFS tasks sorted by vruntime). Hash tables give O(1) lookup but O(n) ordered traversal, so the choice depends on the access pattern.

RB-Tree Properties

  1. Every node is red or black
  2. Root is black
  3. All leaves (NULL) are black
  4. Red node’s children are black
  5. All paths to leaves have same black count

Kernel RB-Tree API

#include <linux/rbtree.h>

struct my_node {
    struct rb_node node;  // Must be first or use rb_entry
    int key;
    int value;
};

// Tree root
struct rb_root my_tree = RB_ROOT;

// Insert (must implement search yourself)
void my_insert(struct rb_root *root, struct my_node *new)
{
    struct rb_node **link = &root->rb_node;
    struct rb_node *parent = NULL;
    struct my_node *entry;

    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct my_node, node);

        if (new->key < entry->key)
            link = &parent->rb_left;
        else
            link = &parent->rb_right;
    }

    rb_link_node(&new->node, parent, link);
    rb_insert_color(&new->node, root);
}

// Search
struct my_node *my_search(struct rb_root *root, int key)
{
    struct rb_node *node = root->rb_node;

    while (node) {
        struct my_node *entry = rb_entry(node, struct my_node, node);

        if (key < entry->key)
            node = node->rb_left;
        else if (key > entry->key)
            node = node->rb_right;
        else
            return entry;
    }
    return NULL;
}

// Iterate in order
struct rb_node *node;
for (node = rb_first(&my_tree); node; node = rb_next(node)) {
    struct my_node *entry = rb_entry(node, struct my_node, node);
    printk("Key: %d, Value: %d\n", entry->key, entry->value);
}

RB-Tree Use Cases in Kernel

SubsystemUse
CFS SchedulerTask ordering by vruntime
Memory ManagementVMA ordering by address
epollFile descriptor management
ext4Extent tree

Radix Trees and XArray

For sparse arrays with integer keys (like page cache). The analogy: Imagine a phone book where you can look up entries by page number. A regular array would require memory for every possible page number (billions of entries for a large file). A radix tree is like a hierarchical index — you only allocate entries for page numbers that actually exist. For a 1GB file that only has 100 pages cached, you store 100 entries instead of 262,144. The XArray replaced the older radix tree API starting in kernel 4.20 with a cleaner, more consistent interface. If you’re reading older kernel code, you’ll see radix_tree_lookup() calls; modern code uses xa_load().

XArray (Modern Replacement for Radix Tree)

#include <linux/xarray.h>

DEFINE_XARRAY(my_xa);

// Store value at index
void *old = xa_store(&my_xa, index, pointer, GFP_KERNEL);

// Load value at index
void *value = xa_load(&my_xa, index);

// Erase entry
xa_erase(&my_xa, index);

// Iterate
unsigned long index;
void *entry;
xa_for_each(&my_xa, index, entry) {
    printk("Index: %lu, Entry: %px\n", index, entry);
}

Use Cases

  • Page cache: Map file offset to page
  • Process ID allocation: Map PID to task_struct
  • Inode cache: Map inode number to inode

Hash Tables

For O(1) lookups when ordering isn’t needed. When to use hash tables vs RB-trees: Hash tables win when you only need “find this exact key” lookups (like finding a socket by port number). RB-trees win when you need “find the nearest key” or “iterate in order” (like finding the VMA that contains a given address). The kernel uses both extensively, and knowing which one a subsystem chose — and why — is the kind of insight that impresses in interviews.

Kernel Hash Table API

#include <linux/hashtable.h>

// Define hash table with 2^10 = 1024 buckets
DEFINE_HASHTABLE(my_hash, 10);

struct my_entry {
    int key;
    int value;
    struct hlist_node node;
};

// Add entry
struct my_entry *entry = kmalloc(sizeof(*entry), GFP_KERNEL);
entry->key = 42;
entry->value = 100;
hash_add(my_hash, &entry->node, entry->key);

// Lookup
struct my_entry *found;
hash_for_each_possible(my_hash, found, node, search_key) {
    if (found->key == search_key)
        return found;
}

// Delete
hash_del(&entry->node);

// Iterate all entries
int bkt;
struct my_entry *cur;
hash_for_each(my_hash, bkt, cur, node) {
    printk("Key: %d, Value: %d\n", cur->key, cur->value);
}

Per-CPU Variables

Avoid cache bouncing by giving each CPU its own copy. The analogy: Imagine a shared whiteboard in an office where every employee writes tally marks. With 100 employees, they form a line, each waiting their turn. Per-CPU variables are like giving each employee their own whiteboard. Everyone writes simultaneously with zero waiting. When the manager needs the total, they walk around and add up all the whiteboards. This is exactly how per-CPU counters work in the kernel — each CPU has its own copy, and you only aggregate when you need the global value. Why this matters in practice: On a 64-core server, a shared atomic counter can become a severe bottleneck. Each atomic_inc() invalidates the cache line on all 63 other CPUs (a phenomenon called “cache-line bouncing” or “false sharing”). Per-CPU variables eliminate this entirely. The kernel uses them for statistics counters, memory allocator free lists, and many other high-contention paths.
#include <linux/percpu.h>

// Static per-CPU variable
DEFINE_PER_CPU(int, my_counter);

// Access
int val = get_cpu_var(my_counter);  // Disables preemption
val++;
put_cpu_var(my_counter);            // Re-enables preemption

// Or using this_cpu operations (preferred, handles preemption)
this_cpu_inc(my_counter);
this_cpu_add(my_counter, 5);
int sum = this_cpu_read(my_counter);

// Dynamic per-CPU allocation
int __percpu *ptr = alloc_percpu(int);
*per_cpu_ptr(ptr, cpu) = value;
free_percpu(ptr);

// Sum across all CPUs
int total = 0;
int cpu;
for_each_possible_cpu(cpu)
    total += *per_cpu_ptr(ptr, cpu);

Why Per-CPU Matters

┌─────────────────────────────────────────────────────────────────────────────┐
│                      SHARED VS PER-CPU VARIABLE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SHARED VARIABLE (cache bouncing problem):                                  │
│                                                                              │
│   CPU 0 writes → invalidates cache line on CPU 1, 2, 3                      │
│   CPU 1 writes → invalidates cache line on CPU 0, 2, 3                      │
│   CPU 2 writes → ...                                                         │
│                                                                              │
│   Cache line bounces between CPUs = ~100+ cycles per access                 │
│                                                                              │
│   ────────────────────────────────────────────────────────────────          │
│                                                                              │
│   PER-CPU VARIABLE (no bouncing):                                            │
│                                                                              │
│   ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐                   │
│   │ CPU 0  │     │ CPU 1  │     │ CPU 2  │     │ CPU 3  │                   │
│   │counter │     │counter │     │counter │     │counter │                   │
│   │   5    │     │   3    │     │   7    │     │   2    │                   │
│   └────────┘     └────────┘     └────────┘     └────────┘                   │
│                                                                              │
│   Each CPU has its own copy = always in local cache                         │
│   To get total: sum all per-CPU values                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

RCU (Read-Copy-Update)

The most important synchronization mechanism for read-heavy kernel data structures. The analogy: Imagine you’re editing a Wikipedia article. The traditional approach (locking) would be: lock the page, make edits, unlock. While locked, nobody can read it. RCU works differently: you make a copy of the article, edit the copy, then atomically swap the link so new readers see the new version. Old readers who started before the swap still see the old version — and that’s fine, because they’ll finish reading soon. Once all old readers are done, you delete the old copy. Readers never wait, ever. Why RCU matters: The routing table in a busy Linux server might be read millions of times per second (every packet triggers a lookup) but updated only a few times per day. Using traditional locks would serialize all those reads. RCU makes reads essentially free — just a preemption disable/enable pair, no atomic operations, no memory barriers on the read path. This is why the networking stack, filesystem layer, and module system all depend on RCU.
Common Misconception: “RCU is just a fancy lock.” It is not a lock at all. Readers never block, and there is no lock acquisition on the read path. RCU is a publication mechanism combined with a deferred reclamation scheme. The grace period concept (waiting for all pre-existing readers to finish) is fundamentally different from mutual exclusion.

RCU Concept

┌─────────────────────────────────────────────────────────────────────────────┐
│                          RCU UPDATE PROCESS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Initial state:                                                             │
│   ┌─────────┐     ┌─────────────┐                                           │
│   │  Head   │────→│  Old Data   │                                           │
│   └─────────┘     └─────────────┘                                           │
│        ↑                ↑                                                    │
│     Writers          Readers (may be reading)                               │
│                                                                              │
│   Step 1: Allocate new, copy data, modify copy                              │
│   ┌─────────┐     ┌─────────────┐                                           │
│   │  Head   │────→│  Old Data   │  (readers still using)                    │
│   └─────────┘     └─────────────┘                                           │
│                   ┌─────────────┐                                           │
│                   │  New Data   │  (prepared, not linked)                   │
│                   └─────────────┘                                           │
│                                                                              │
│   Step 2: Update pointer atomically                                         │
│   ┌─────────┐     ┌─────────────┐                                           │
│   │  Head   │──┐  │  Old Data   │  (existing readers still valid!)          │
│   └─────────┘  │  └─────────────┘                                           │
│                │  ┌─────────────┐                                           │
│                └─→│  New Data   │  (new readers see this)                   │
│                   └─────────────┘                                           │
│                                                                              │
│   Step 3: Wait for grace period (all readers finish)                        │
│   synchronize_rcu() blocks until no reader can have old pointer             │
│                                                                              │
│   Step 4: Free old data safely                                              │
│                   ┌─────────────┐                                           │
│   ┌─────────┐────→│  New Data   │                                           │
│   └─────────┘     └─────────────┘                                           │
│                   Old data freed (no readers could be using it)             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

RCU API

#include <linux/rcupdate.h>

struct my_data {
    int value;
    struct rcu_head rcu;  // For deferred freeing
};

struct my_data __rcu *global_ptr;

// Reader side (no blocking allowed!)
rcu_read_lock();
struct my_data *ptr = rcu_dereference(global_ptr);
if (ptr)
    printk("Value: %d\n", ptr->value);
rcu_read_unlock();

// Writer side
struct my_data *new = kmalloc(sizeof(*new), GFP_KERNEL);
new->value = 42;

struct my_data *old = rcu_dereference_protected(global_ptr, 
                            lockdep_is_held(&my_lock));
rcu_assign_pointer(global_ptr, new);

// Wait for all readers, then free
synchronize_rcu();
kfree(old);

// Or use call_rcu for async freeing
void my_free(struct rcu_head *head) {
    struct my_data *ptr = container_of(head, struct my_data, rcu);
    kfree(ptr);
}
call_rcu(&old->rcu, my_free);

Why RCU is Important

  • Zero-overhead reads: No atomics, no memory barriers on read path
  • Scalability: Readers never block each other
  • Used everywhere: Routing tables, file descriptors, module list

Memory Allocators

kmalloc - Small Object Allocation

#include <linux/slab.h>

// Basic allocation
void *ptr = kmalloc(size, GFP_KERNEL);
kfree(ptr);

// Zero-initialized
void *ptr = kzalloc(size, GFP_KERNEL);

// Array allocation
void *arr = kmalloc_array(n, sizeof(type), GFP_KERNEL);
void *arr = kcalloc(n, sizeof(type), GFP_KERNEL);  // zeroed

GFP Flags

FlagMeaningWhen to Use
GFP_KERNELMay sleep, normal allocationMost kernel code
GFP_ATOMICCannot sleep, uses reservesInterrupt context
GFP_NOWAITCannot sleep, fails quicklyWhen you can retry
GFP_USERFor user-space requestsmmap, etc.
GFP_DMADMA-capable memoryDevice drivers

vmalloc - Large Allocations

#include <linux/vmalloc.h>

// For large allocations (pages may not be contiguous)
void *ptr = vmalloc(large_size);
vfree(ptr);

// With zeroing
void *ptr = vzalloc(large_size);

kmalloc vs vmalloc

Aspectkmallocvmalloc
Physical memoryContiguousMay be non-contiguous
Size limit~4MBLimited by virtual address space
PerformanceFasterSlower (TLB overhead)
Use caseSmall objectsLarge buffers, module loading

Slab Allocator

For frequently allocated objects of the same size:
// Create cache
struct kmem_cache *my_cache = kmem_cache_create(
    "my_objects",           // Name (for debugging)
    sizeof(struct my_obj),  // Object size
    0,                      // Alignment (0 = default)
    SLAB_HWCACHE_ALIGN,     // Flags
    NULL                    // Constructor
);

// Allocate from cache
struct my_obj *obj = kmem_cache_alloc(my_cache, GFP_KERNEL);

// Free to cache
kmem_cache_free(my_cache, obj);

// Destroy cache
kmem_cache_destroy(my_cache);

Slab Allocator Internals

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SLAB ALLOCATOR STRUCTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   kmem_cache ("task_struct" cache)                                          │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │  name: "task_struct"                                                │    │
│   │  object_size: 2688 bytes                                            │    │
│   │  objects_per_slab: 12                                               │    │
│   │                                                                     │    │
│   │  Per-CPU cache (fast path):                                        │    │
│   │  ┌─────────────────────────────────────────────────────────────┐   │    │
│   │  │ CPU 0: [obj][obj][obj][free][free]                          │   │    │
│   │  │ CPU 1: [obj][obj][free][free][free]                         │   │    │
│   │  │ CPU 2: [free][free][free][free][free]                       │   │    │
│   │  └─────────────────────────────────────────────────────────────┘   │    │
│   │                                                                     │    │
│   │  Partial slabs (shared):                                           │    │
│   │  ┌──────────────────────┐  ┌──────────────────────┐               │    │
│   │  │ Slab 1: 8/12 used    │  │ Slab 2: 3/12 used    │               │    │
│   │  └──────────────────────┘  └──────────────────────┘               │    │
│   │                                                                     │    │
│   │  Full slabs:                                                       │    │
│   │  ┌──────────────────────┐                                          │    │
│   │  │ Slab 3: 12/12 used   │                                          │    │
│   │  └──────────────────────┘                                          │    │
│   └────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Lab Exercises

Objective: Use kernel-style linked lists in user space
// list.c - User-space implementation
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>

// Minimal kernel-style list implementation
struct list_head {
    struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name)

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

#define list_entry(ptr, type, member) container_of(ptr, type, member)

#define list_for_each_entry(pos, head, member) \
    for (pos = list_entry((head)->next, typeof(*pos), member); \
         &pos->member != (head); \
         pos = list_entry(pos->member.next, typeof(*pos), member))

static inline void INIT_LIST_HEAD(struct list_head *list) {
    list->next = list->prev = list;
}

static inline void __list_add(struct list_head *new,
                              struct list_head *prev,
                              struct list_head *next) {
    next->prev = new;
    new->next = next;
    new->prev = prev;
    prev->next = new;
}

static inline void list_add_tail(struct list_head *new, 
                                 struct list_head *head) {
    __list_add(new, head->prev, head);
}

// Example struct with embedded list
struct process {
    int pid;
    char name[32];
    struct list_head list;
};

LIST_HEAD(process_list);

int main() {
    // Create processes
    for (int i = 0; i < 5; i++) {
        struct process *p = malloc(sizeof(*p));
        p->pid = 100 + i;
        snprintf(p->name, sizeof(p->name), "process_%d", i);
        list_add_tail(&p->list, &process_list);
    }
    
    // Iterate
    struct process *p;
    list_for_each_entry(p, &process_list, list) {
        printf("PID: %d, Name: %s\n", p->pid, p->name);
    }
    
    return 0;
}
Objective: Analyze kernel memory allocation
# View all slab caches
cat /proc/slabinfo | head -20

# Detailed slab stats
sudo slabtop

# Memory usage by cache
cat /proc/slabinfo | awk 'NR>2 {print $1, $2*$4}' | sort -nk2 | tail -20

# Find task_struct cache
cat /proc/slabinfo | grep task_struct

# View with slabinfo tool
sudo cat /sys/kernel/slab/*/alloc_calls | head -50
Objective: Understand RCU behavior
// rcu_sim.c - Simple RCU-like simulation
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <unistd.h>

struct data {
    int value;
};

atomic_int grace_period;
struct data *_Atomic global_ptr;

void *reader(void *arg) {
    int tid = *(int *)arg;
    for (int i = 0; i < 1000000; i++) {
        // Simulate rcu_read_lock (just read grace period)
        int gp = atomic_load(&grace_period);
        
        // Read data
        struct data *ptr = atomic_load(&global_ptr);
        if (ptr) {
            // Use the data
            volatile int v = ptr->value;
            (void)v;
        }
        
        // Simulate rcu_read_unlock
    }
    printf("Reader %d done\n", tid);
    return NULL;
}

void *writer(void *arg) {
    for (int i = 0; i < 100; i++) {
        // Allocate new
        struct data *new = malloc(sizeof(*new));
        new->value = i;
        
        // Swap pointer
        struct data *old = atomic_exchange(&global_ptr, new);
        
        // Wait for grace period (simplified)
        atomic_fetch_add(&grace_period, 1);
        usleep(1000);  // Real RCU uses smarter waiting
        
        // Free old
        free(old);
    }
    printf("Writer done\n");
    return NULL;
}

int main() {
    global_ptr = malloc(sizeof(struct data));
    ((struct data *)global_ptr)->value = 0;
    
    pthread_t readers[4], writer_t;
    int tids[4] = {0, 1, 2, 3};
    
    for (int i = 0; i < 4; i++)
        pthread_create(&readers[i], NULL, reader, &tids[i]);
    pthread_create(&writer_t, NULL, writer, NULL);
    
    for (int i = 0; i < 4; i++)
        pthread_join(readers[i], NULL);
    pthread_join(writer_t, NULL);
    
    return 0;
}

Interview Questions

Answer:container_of(ptr, type, member) returns a pointer to the containing structure given a pointer to a member.How it works:
#define container_of(ptr, type, member) \
    (type *)((char *)ptr - offsetof(type, member))
  1. offsetof(type, member) - bytes from struct start to member
  2. (char *)ptr - treat ptr as byte pointer
  3. Subtract offset to get struct start
  4. Cast to struct pointer
Example:
struct person {
    char name[32];     // offset 0
    int age;           // offset 32
    struct list_head list;  // offset 36
};

// Given &person.list (address 0x1024)
// container_of returns 0x1024 - 36 = 0x1000 (struct start)
Why important: Enables type-agnostic data structures in C.
Answer:Use RCU when:
  • Reads vastly outnumber writes
  • Read-side performance is critical
  • Can tolerate slightly stale data on reads
  • Updates can use copy-and-replace semantics
Use spinlock when:
  • Read/write ratio is balanced
  • Need to modify data in place
  • Can’t tolerate stale reads
  • Critical section is very short
Performance comparison:
OperationRCUSpinlock
Read (uncontended)~1 cycle~10 cycles
Read (contended)~1 cycle~100+ cycles
Write~100+ cycles~10 cycles
Real examples:
  • Routing table: RCU (millions of lookups, rare updates)
  • Memory allocator: Spinlock (frequent alloc/free)
Answer:Problems with raw page allocation:
  • Internal fragmentation (small objects waste pages)
  • Slow initialization (must init objects each time)
  • Cache unfriendly (random placement)
Slab benefits:
  1. Reduced fragmentation: Groups same-size objects
  2. Cache coloring: Staggers objects across cache lines
  3. Object caching: Keeps freed objects initialized
  4. Per-CPU caches: Reduces lock contention
  5. Debugging support: Red zones, poisoning
Example impact:
  • Creating process: task_struct allocated instantly from slab
  • Without slab: Would need page allocation + initialization
  • Measured speedup: 10-100x for hot objects
Answer:What are per-CPU variables:
  • Each CPU gets its own copy of the variable
  • No synchronization needed for CPU-local access
Benefits:
  1. No cache bouncing: Variable always in local cache
  2. No locking needed: Each CPU has exclusive access
  3. Better scalability: Performance scales with CPU count
Implementation:
// Data layout in memory:
CPU 0: [counter_0][padding...][other_vars_0]
CPU 1: [counter_1][padding...][other_vars_1]
CPU 2: [counter_2][padding...][other_vars_2]
Use cases:
  • Counters (stats, metrics)
  • Caches (per-CPU object pools)
  • State (current process, interrupt flags)
Trade-off: Uses more memory (N copies) for better performance.

Debugging Tips

Practical Debugging Guidance for Kernel Data StructuresThese tips will save you hours when you’re tracing bugs in kernel code or kernel modules.
# Inspect slab cache health -- look for high allocation failures
sudo slabtop -o -s c | head -20
# Columns: OBJS, ACTIVE, USE%, OBJ SIZE, SLABS -- watch for USE% near 100%

# Detect memory leaks in slab caches (growing over time)
watch -n 5 'cat /proc/slabinfo | grep -E "task_struct|dentry|inode_cache" | awk "{print \$1, \$2, \$3}"'
# $2 = num_objs, $3 = active_objs. If active_objs keeps growing, you have a leak.

# Trace RCU grace periods (slow grace periods cause memory pressure)
cat /sys/kernel/debug/rcu/rcu_preempt/rcudata
# Look for "c=" (completed) and "g=" (grace period number)
# If g >> c, RCU callbacks are piling up

# Check for list corruption (common kernel bug symptom)
# Symptom: "list_del corruption" or "list_add corruption" in dmesg
# Enable CONFIG_DEBUG_LIST=y to get early detection
dmesg | grep -i "list.*corruption"

# Trace kmalloc/kfree to find memory leaks in modules
sudo bpftrace -e '
kprobe:kmalloc { @allocs[comm] = count(); }
kprobe:kfree   { @frees[comm] = count(); }
interval:s:5   { print(@allocs); print(@frees); clear(@allocs); clear(@frees); }
'
Debugging RCU Stalls: If you see “rcu_sched self-detected stall on CPU” in dmesg, it means a CPU has been in an RCU read-side critical section for too long (or RCU callbacks cannot run). Common causes: a tight loop with preemption disabled, a softirq handler that never yields, or a spinlock held too long. This is a serious bug that can make the system unresponsive.

Common Misconceptions

MisconceptionReality
”kernel linked lists are slower because of container_of overhead”container_of compiles to a single pointer subtraction — zero runtime cost. The real advantage is avoiding a separate allocation for the node.
”RCU is only for networking”RCU is used in 20+ subsystems: VFS (dcache, inode), scheduler, module list, PID lookup, audit, security, and more.
”Per-CPU variables waste memory”On a 64-core machine, a per-CPU int uses 256 bytes. An atomic counter with cache-line bouncing across 64 CPUs wastes far more in stalled cycles.
”Slab allocator is just a memory pool”SLUB also provides: per-CPU caching for lock-free fast paths, SLAB_HWCACHE_ALIGN for cache coloring, red-zone poisoning for debugging, and merging of similar-sized caches.
”Hash tables are always better than trees”Hash tables cannot do range queries or ordered iteration. The kernel chooses RB-trees for VMAs (need to find “which VMA contains address X?”) and hash tables for sockets (need exact port lookup).

Key Takeaways

Kernel Lists

Embedded list_head pattern enables type-agnostic, efficient linked lists

RCU

Read-Copy-Update provides scalable read-heavy synchronization

Slab Allocator

Object caching and per-CPU pools optimize frequent allocations

Per-CPU Variables

Eliminate cache bouncing for frequently accessed data

Interview Deep-Dive

Strong Answer:
  • The scheduler needs three operations to be efficient: find the minimum vruntime task (the next task to run), insert a task when it becomes runnable, and remove a task when it is selected or blocks. A red-black tree provides O(log n) for all three operations, and the leftmost node (minimum) can be cached for O(1) access.
  • A min-heap gives O(1) find-min and O(log n) insert, but removal of an arbitrary element (not the minimum) is O(n) because you need to find it first. The scheduler frequently removes tasks that block (sleep on I/O, mutex, etc.), not just the minimum, so this O(n) removal is unacceptable.
  • A sorted array gives O(1) find-min and O(log n) search, but insertion requires shifting elements, which is O(n). With hundreds or thousands of runnable tasks, this becomes expensive.
  • A skip list would work (O(log n) for all operations on average), but red-black trees have lower constant factors due to better cache locality (tree nodes are compact structs, skip list nodes have variable-sized forward pointer arrays) and deterministic worst-case guarantees (skip lists are probabilistic).
  • The kernel’s rb-tree implementation is also extremely well-optimized: it uses augmented rb-trees where the leftmost node pointer is cached in rb_root_cached, making pick_next_task_fair() effectively O(1). The rebalancing after insertion or removal touches at most 3 rotations, keeping the tree balanced without expensive restructuring.
Follow-up: How does the kernel’s container_of macro make the rb-tree implementation type-agnostic, and why is this important?Follow-up Answer:
  • The kernel’s rb-tree does not store user data in tree nodes. Instead, the rb_node structure is embedded inside the user’s data structure (for example, sched_entity contains an rb_node run_node). When traversing the tree, you get pointers to rb_node structures. container_of(ptr, struct sched_entity, run_node) uses pointer arithmetic (subtracting the field offset from the pointer) to recover the containing sched_entity pointer. This means the same rb-tree code works for any data type without templates, generics, or void pointers with casts. It also means zero memory allocation overhead for the tree structure itself — the link nodes are part of the data, so inserting a task requires no additional memory allocation.
Strong Answer:
  • The fundamental difference is that RCU readers do not acquire any lock at all. rcu_read_lock() merely disables preemption (a single per-CPU counter increment). There is no atomic operation, no memory barrier, no cache line bouncing between CPUs. This means RCU read-side critical sections scale perfectly: 1000 CPUs reading simultaneously have zero contention. A read-write lock, even an optimized one like rwlock_t, requires atomic operations on a shared cache line, creating contention that increases with CPU count.
  • RCU achieves this by shifting all synchronization cost to the write side. A writer does not modify data in place. Instead, it allocates a new copy, modifies the copy, then atomically updates the pointer to the new version using rcu_assign_pointer() (which includes a write memory barrier). Old readers continue reading the old version — they are safe because the old data is not freed until all pre-existing readers finish. The writer calls synchronize_rcu() or call_rcu() to wait for the grace period.
  • A grace period is the time during which all CPUs have gone through at least one context switch (or voluntary quiescent state). Since rcu_read_lock() disables preemption, a context switch on a CPU guarantees no reader on that CPU is in an RCU read-side critical section. Once all CPUs have passed through a quiescent state, the old data can safely be freed.
  • The trade-off: writers pay a significant cost (memory allocation for copies, waiting for grace periods), and readers see stale data during the grace period. This makes RCU ideal for read-mostly data (routing tables, configuration, module lists) and terrible for write-heavy data.
Follow-up: What is SRCU (Sleepable RCU) and when would you use it instead of regular RCU?Follow-up Answer:
  • Regular RCU readers cannot sleep because rcu_read_lock() disables preemption, and sleeping while preemption is disabled would cause scheduling problems. SRCU (Sleepable RCU) allows readers to sleep by using a per-CPU counter pair instead of preemption disable. srcu_read_lock() increments one counter, srcu_read_unlock() increments the other. The writer’s synchronize_srcu() waits until both counters match, indicating no readers are active. This is more expensive on the read side (two atomic increments instead of one preemption counter) but necessary when the read-side critical section must perform blocking operations like disk I/O or sleeping locks. Use cases include file system operations where you need RCU-like read scalability but the read path might block.
Strong Answer:
  • The options are: kernel hash table (hashtable.h), rhashtable (resizable hash table), rb-tree, and XArray.
  • For 10 million entries with fast lookup and frequent updates, I would use rhashtable (resizable hash table with RCU-protected reads). The standard kernel hashtable.h has a fixed number of buckets set at compile time. With 10M entries and 1024 buckets, each bucket chain averages 10K entries, making lookup O(10K) — unacceptable. Rhashtable dynamically resizes its bucket count to maintain short chains (target 0-1 entries per bucket), providing amortized O(1) lookup.
  • Rhashtable also supports RCU-protected reads: lookups do not take any locks, only rcu_read_lock(). Insertions and deletions take a per-bucket spinlock, providing fine-grained write concurrency. The resizing operation is done incrementally in the background using RCU, so readers are never blocked during a resize.
  • An rb-tree would give O(log n) lookup (about 23 comparisons for 10M entries), which is slower than a well-sized hash table’s O(1). XArray is designed for sparse integer-indexed data (like page cache indices), not arbitrary key types like IP addresses.
  • Implementation: I would define the rhashtable parameters with struct rhashtable_params specifying the key offset, key length (4 bytes for IPv4), hash function (jhash), and a reasonable initial size. For IPv6 support, the key length increases to 16 bytes but the approach is identical.
Follow-up: What are the memory overhead considerations for 10 million entries, and how does rhashtable handle memory pressure?Follow-up Answer:
  • Each rhashtable entry requires the user struct plus an rhash_head (a single pointer, 8 bytes on 64-bit). The bucket array itself is an array of pointers, which for 10M entries with a load factor of 0.75 would be approximately 13.3M buckets, each 8 bytes = ~106MB just for buckets. During resize, both old and new bucket arrays coexist temporarily, doubling the bucket memory. The user data (connection metadata) dominates memory usage at this scale. Under memory pressure, rhashtable allocation can fail during resize. The rhashtable_params structure allows specifying GFP_NOWAIT for non-blocking allocation and setting max_size to cap bucket growth. If resize allocation fails, the table continues operating with longer chains (degraded but functional).

Next: Process Subsystem Deep Dive →