> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Kernel Data Structures

> Essential data structures used throughout the Linux kernel: linked lists, trees, RCU, and memory allocators

# Kernel Data Structures

The Linux kernel uses specialized data structures optimized for performance, concurrency, and cache efficiency. Understanding these is essential for reading kernel source code and for systems programming interviews.

Think of kernel data structures as the **plumbing of the operating system**. User-space programs get fancy high-level containers (hash maps, vectors, trees) from their standard libraries. The kernel has to build its own -- and every decision matters because these structures are hit millions of times per second across every CPU. A poorly chosen data structure in the scheduler means your 64-core server acts like a single-core machine. A cache-unfriendly list in the networking stack means dropped packets at 10Gbps. This is why kernel data structures look different from textbook implementations: they are battle-hardened for real hardware constraints.

<Info>
  **Interview Frequency**: High (especially for infrastructure roles)\
  **Key Topics**: Linked lists, RB-trees, RCU, slab allocator\
  **Time to Master**: 8-10 hours
</Info>

***

## Kernel Linked Lists

The kernel's linked list implementation is one of the most elegant pieces of C code you'll encounter.

**The analogy**: Imagine a chain of hotel rooms. The traditional approach puts a list of room numbers on each door (the data structure contains pointers to the next element). The kernel approach is different: instead of a list that holds data, you **embed a chain link inside each room**. Any room can be on any chain, and you can have multiple chains running through the same rooms. This is the `list_head` pattern -- the link lives *inside* the data, not the other way around.

### Traditional vs Kernel-Style Linked Lists

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    TRADITIONAL VS KERNEL LINKED LIST                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   TRADITIONAL (data embedded in node):                                       │
│                                                                              │
│   struct node {                                                              │
│       int data;                                                              │
│       struct node *next;                                                     │
│       struct node *prev;                                                     │
│   };                                                                         │
│                                                                              │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                              │
│   │ data: 1  │───→│ data: 2  │───→│ data: 3  │                              │
│   │ next ────│    │ next ────│    │ next:NULL│                              │
│   │ prev:NULL│←───│ prev ────│←───│ prev ────│                              │
│   └──────────┘    └──────────┘    └──────────┘                              │
│                                                                              │
│   KERNEL-STYLE (node embedded in data):                                      │
│                                                                              │
│   struct list_head {                                                         │
│       struct list_head *next;                                                │
│       struct list_head *prev;                                                │
│   };                                                                         │
│                                                                              │
│   struct my_data {                                                           │
│       int value;                                                             │
│       char name[32];                                                         │
│       struct list_head list;  // embed the link                             │
│   };                                                                         │
│                                                                              │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│   │ value: 1        │    │ value: 2        │    │ value: 3        │         │
│   │ name: "first"   │    │ name: "second"  │    │ name: "third"   │         │
│   │ list.next ──────│───→│ list.next ──────│───→│ list.next ──────│─→(head)│
│   │ list.prev ←─────│←───│ list.prev ←─────│←───│ list.prev ←─────│         │
│   └─────────────────┘    └─────────────────┘    └─────────────────┘         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Why Kernel-Style is Better

1. **Type-agnostic**: Same `list_head` works for any struct
2. **No memory allocation**: Node is part of the data
3. **Multiple lists**: One struct can be on multiple lists
4. **Cache friendly**: Data and links together

### The Magic: container\_of Macro

`container_of` is the "find my house given my mailbox" macro. If you know where the mailbox (list\_head) is, and you know how far the mailbox is from the front door (the offset), you can find the house (the containing struct). This single macro is what makes the entire kernel-style linked list pattern work, and it appears thousands of times throughout the kernel source.

```c theme={null}
// include/linux/kernel.h
// Given a pointer to a member inside a struct, compute the pointer
// to the containing struct by subtracting the member's offset.
#define container_of(ptr, type, member) ({                \
    const typeof(((type *)0)->member) *__mptr = (ptr);    \
    (type *)((char *)__mptr - offsetof(type, member)); })

// Example usage:
struct my_data {
    int value;
    struct list_head list;
};

// Given a list_head pointer, get the containing struct
struct list_head *pos;
struct my_data *data = container_of(pos, struct my_data, list);
```

### List Operations

```c theme={null}
#include <linux/list.h>

// Initialize list head
LIST_HEAD(my_list);  // Static initialization
// or
struct list_head my_list;
INIT_LIST_HEAD(&my_list);

// Add to list
struct my_data *new_item = kmalloc(sizeof(*new_item), GFP_KERNEL);
new_item->value = 42;
list_add(&new_item->list, &my_list);       // Add to front
list_add_tail(&new_item->list, &my_list);  // Add to back

// Delete from list
list_del(&item->list);

// Iterate over list
struct my_data *entry;
list_for_each_entry(entry, &my_list, list) {
    printk("Value: %d\n", entry->value);
}

// Safe iteration (allows deletion during iteration)
struct my_data *tmp;
list_for_each_entry_safe(entry, tmp, &my_list, list) {
    if (entry->value < 0) {
        list_del(&entry->list);
        kfree(entry);
    }
}

// Check if list is empty
if (list_empty(&my_list)) { ... }

// Move entire list
LIST_HEAD(new_list);
list_splice(&my_list, &new_list);
```

***

## Red-Black Trees

Used throughout the kernel for fast ordered access: process scheduling, memory management, etc.

**The analogy**: Think of an RB-tree as a **self-balancing bookshelf**. You could throw books onto a pile (unsorted list, O(n) lookup) or carefully arrange them by title (sorted array, O(n) insert). An RB-tree gives you O(log n) for both -- like a bookshelf that automatically reorganizes itself every time you add or remove a book, guaranteeing that no one section gets too tall compared to others.

**Why not a plain binary search tree?** Without balancing, a BST can degenerate into a linked list (O(n) operations). The kernel cannot tolerate that -- imagine the CFS scheduler taking O(n) to pick the next task on a server with 10,000 runnable processes. RB-trees guarantee O(log n) worst-case, which is why they show up in almost every performance-critical kernel subsystem.

<Warning>
  **Common Misconception**: Many engineers assume the kernel uses hash tables everywhere for O(1) lookups. In reality, the kernel uses RB-trees when it needs *ordered* access (e.g., VMAs sorted by address range, CFS tasks sorted by vruntime). Hash tables give O(1) lookup but O(n) ordered traversal, so the choice depends on the access pattern.
</Warning>

### RB-Tree Properties

1. Every node is red or black
2. Root is black
3. All leaves (NULL) are black
4. Red node's children are black
5. All paths to leaves have same black count

### Kernel RB-Tree API

```c theme={null}
#include <linux/rbtree.h>

struct my_node {
    struct rb_node node;  // Must be first or use rb_entry
    int key;
    int value;
};

// Tree root
struct rb_root my_tree = RB_ROOT;

// Insert (must implement search yourself)
void my_insert(struct rb_root *root, struct my_node *new)
{
    struct rb_node **link = &root->rb_node;
    struct rb_node *parent = NULL;
    struct my_node *entry;

    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct my_node, node);

        if (new->key < entry->key)
            link = &parent->rb_left;
        else
            link = &parent->rb_right;
    }

    rb_link_node(&new->node, parent, link);
    rb_insert_color(&new->node, root);
}

// Search
struct my_node *my_search(struct rb_root *root, int key)
{
    struct rb_node *node = root->rb_node;

    while (node) {
        struct my_node *entry = rb_entry(node, struct my_node, node);

        if (key < entry->key)
            node = node->rb_left;
        else if (key > entry->key)
            node = node->rb_right;
        else
            return entry;
    }
    return NULL;
}

// Iterate in order
struct rb_node *node;
for (node = rb_first(&my_tree); node; node = rb_next(node)) {
    struct my_node *entry = rb_entry(node, struct my_node, node);
    printk("Key: %d, Value: %d\n", entry->key, entry->value);
}
```

### RB-Tree Use Cases in Kernel

| Subsystem         | Use                        |
| ----------------- | -------------------------- |
| CFS Scheduler     | Task ordering by vruntime  |
| Memory Management | VMA ordering by address    |
| epoll             | File descriptor management |
| ext4              | Extent tree                |

***

## Radix Trees and XArray

For sparse arrays with integer keys (like page cache).

**The analogy**: Imagine a phone book where you can look up entries by page number. A regular array would require memory for every possible page number (billions of entries for a large file). A radix tree is like a **hierarchical index** -- you only allocate entries for page numbers that actually exist. For a 1GB file that only has 100 pages cached, you store 100 entries instead of 262,144.

The XArray replaced the older radix tree API starting in kernel 4.20 with a cleaner, more consistent interface. If you're reading older kernel code, you'll see `radix_tree_lookup()` calls; modern code uses `xa_load()`.

### XArray (Modern Replacement for Radix Tree)

```c theme={null}
#include <linux/xarray.h>

DEFINE_XARRAY(my_xa);

// Store value at index
void *old = xa_store(&my_xa, index, pointer, GFP_KERNEL);

// Load value at index
void *value = xa_load(&my_xa, index);

// Erase entry
xa_erase(&my_xa, index);

// Iterate
unsigned long index;
void *entry;
xa_for_each(&my_xa, index, entry) {
    printk("Index: %lu, Entry: %px\n", index, entry);
}
```

### Use Cases

* **Page cache**: Map file offset to page
* **Process ID allocation**: Map PID to task\_struct
* **Inode cache**: Map inode number to inode

***

## Hash Tables

For O(1) lookups when ordering isn't needed.

**When to use hash tables vs RB-trees**: Hash tables win when you only need "find this exact key" lookups (like finding a socket by port number). RB-trees win when you need "find the nearest key" or "iterate in order" (like finding the VMA that contains a given address). The kernel uses both extensively, and knowing which one a subsystem chose -- and why -- is the kind of insight that impresses in interviews.

### Kernel Hash Table API

```c theme={null}
#include <linux/hashtable.h>

// Define hash table with 2^10 = 1024 buckets
DEFINE_HASHTABLE(my_hash, 10);

struct my_entry {
    int key;
    int value;
    struct hlist_node node;
};

// Add entry
struct my_entry *entry = kmalloc(sizeof(*entry), GFP_KERNEL);
entry->key = 42;
entry->value = 100;
hash_add(my_hash, &entry->node, entry->key);

// Lookup
struct my_entry *found;
hash_for_each_possible(my_hash, found, node, search_key) {
    if (found->key == search_key)
        return found;
}

// Delete
hash_del(&entry->node);

// Iterate all entries
int bkt;
struct my_entry *cur;
hash_for_each(my_hash, bkt, cur, node) {
    printk("Key: %d, Value: %d\n", cur->key, cur->value);
}
```

***

## Per-CPU Variables

Avoid cache bouncing by giving each CPU its own copy.

**The analogy**: Imagine a shared whiteboard in an office where every employee writes tally marks. With 100 employees, they form a line, each waiting their turn. Per-CPU variables are like giving **each employee their own whiteboard**. Everyone writes simultaneously with zero waiting. When the manager needs the total, they walk around and add up all the whiteboards. This is exactly how per-CPU counters work in the kernel -- each CPU has its own copy, and you only aggregate when you need the global value.

**Why this matters in practice**: On a 64-core server, a shared atomic counter can become a severe bottleneck. Each `atomic_inc()` invalidates the cache line on all 63 other CPUs (a phenomenon called "cache-line bouncing" or "false sharing"). Per-CPU variables eliminate this entirely. The kernel uses them for statistics counters, memory allocator free lists, and many other high-contention paths.

```c theme={null}
#include <linux/percpu.h>

// Static per-CPU variable
DEFINE_PER_CPU(int, my_counter);

// Access
int val = get_cpu_var(my_counter);  // Disables preemption
val++;
put_cpu_var(my_counter);            // Re-enables preemption

// Or using this_cpu operations (preferred, handles preemption)
this_cpu_inc(my_counter);
this_cpu_add(my_counter, 5);
int sum = this_cpu_read(my_counter);

// Dynamic per-CPU allocation
int __percpu *ptr = alloc_percpu(int);
*per_cpu_ptr(ptr, cpu) = value;
free_percpu(ptr);

// Sum across all CPUs
int total = 0;
int cpu;
for_each_possible_cpu(cpu)
    total += *per_cpu_ptr(ptr, cpu);
```

### Why Per-CPU Matters

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      SHARED VS PER-CPU VARIABLE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SHARED VARIABLE (cache bouncing problem):                                  │
│                                                                              │
│   CPU 0 writes → invalidates cache line on CPU 1, 2, 3                      │
│   CPU 1 writes → invalidates cache line on CPU 0, 2, 3                      │
│   CPU 2 writes → ...                                                         │
│                                                                              │
│   Cache line bounces between CPUs = ~100+ cycles per access                 │
│                                                                              │
│   ────────────────────────────────────────────────────────────────          │
│                                                                              │
│   PER-CPU VARIABLE (no bouncing):                                            │
│                                                                              │
│   ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐                   │
│   │ CPU 0  │     │ CPU 1  │     │ CPU 2  │     │ CPU 3  │                   │
│   │counter │     │counter │     │counter │     │counter │                   │
│   │   5    │     │   3    │     │   7    │     │   2    │                   │
│   └────────┘     └────────┘     └────────┘     └────────┘                   │
│                                                                              │
│   Each CPU has its own copy = always in local cache                         │
│   To get total: sum all per-CPU values                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## RCU (Read-Copy-Update)

The most important synchronization mechanism for read-heavy kernel data structures.

**The analogy**: Imagine you're editing a Wikipedia article. The traditional approach (locking) would be: lock the page, make edits, unlock. While locked, nobody can read it. RCU works differently: you **make a copy** of the article, edit the copy, then **atomically swap** the link so new readers see the new version. Old readers who started before the swap still see the old version -- and that's fine, because they'll finish reading soon. Once all old readers are done, you delete the old copy. Readers never wait, ever.

**Why RCU matters**: The routing table in a busy Linux server might be read millions of times per second (every packet triggers a lookup) but updated only a few times per day. Using traditional locks would serialize all those reads. RCU makes reads essentially free -- just a preemption disable/enable pair, no atomic operations, no memory barriers on the read path. This is why the networking stack, filesystem layer, and module system all depend on RCU.

<Warning>
  **Common Misconception**: "RCU is just a fancy lock." It is not a lock at all. Readers never block, and there is no lock acquisition on the read path. RCU is a *publication mechanism* combined with a *deferred reclamation scheme*. The grace period concept (waiting for all pre-existing readers to finish) is fundamentally different from mutual exclusion.
</Warning>

### RCU Concept

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RCU UPDATE PROCESS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Initial state:                                                             │
│   ┌─────────┐     ┌─────────────┐                                           │
│   │  Head   │────→│  Old Data   │                                           │
│   └─────────┘     └─────────────┘                                           │
│        ↑                ↑                                                    │
│     Writers          Readers (may be reading)                               │
│                                                                              │
│   Step 1: Allocate new, copy data, modify copy                              │
│   ┌─────────┐     ┌─────────────┐                                           │
│   │  Head   │────→│  Old Data   │  (readers still using)                    │
│   └─────────┘     └─────────────┘                                           │
│                   ┌─────────────┐                                           │
│                   │  New Data   │  (prepared, not linked)                   │
│                   └─────────────┘                                           │
│                                                                              │
│   Step 2: Update pointer atomically                                         │
│   ┌─────────┐     ┌─────────────┐                                           │
│   │  Head   │──┐  │  Old Data   │  (existing readers still valid!)          │
│   └─────────┘  │  └─────────────┘                                           │
│                │  ┌─────────────┐                                           │
│                └─→│  New Data   │  (new readers see this)                   │
│                   └─────────────┘                                           │
│                                                                              │
│   Step 3: Wait for grace period (all readers finish)                        │
│   synchronize_rcu() blocks until no reader can have old pointer             │
│                                                                              │
│   Step 4: Free old data safely                                              │
│                   ┌─────────────┐                                           │
│   ┌─────────┐────→│  New Data   │                                           │
│   └─────────┘     └─────────────┘                                           │
│                   Old data freed (no readers could be using it)             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### RCU API

```c theme={null}
#include <linux/rcupdate.h>

struct my_data {
    int value;
    struct rcu_head rcu;  // For deferred freeing
};

struct my_data __rcu *global_ptr;

// Reader side (no blocking allowed!)
rcu_read_lock();
struct my_data *ptr = rcu_dereference(global_ptr);
if (ptr)
    printk("Value: %d\n", ptr->value);
rcu_read_unlock();

// Writer side
struct my_data *new = kmalloc(sizeof(*new), GFP_KERNEL);
new->value = 42;

struct my_data *old = rcu_dereference_protected(global_ptr, 
                            lockdep_is_held(&my_lock));
rcu_assign_pointer(global_ptr, new);

// Wait for all readers, then free
synchronize_rcu();
kfree(old);

// Or use call_rcu for async freeing
void my_free(struct rcu_head *head) {
    struct my_data *ptr = container_of(head, struct my_data, rcu);
    kfree(ptr);
}
call_rcu(&old->rcu, my_free);
```

### Why RCU is Important

* **Zero-overhead reads**: No atomics, no memory barriers on read path
* **Scalability**: Readers never block each other
* **Used everywhere**: Routing tables, file descriptors, module list

***

## Memory Allocators

### kmalloc - Small Object Allocation

```c theme={null}
#include <linux/slab.h>

// Basic allocation
void *ptr = kmalloc(size, GFP_KERNEL);
kfree(ptr);

// Zero-initialized
void *ptr = kzalloc(size, GFP_KERNEL);

// Array allocation
void *arr = kmalloc_array(n, sizeof(type), GFP_KERNEL);
void *arr = kcalloc(n, sizeof(type), GFP_KERNEL);  // zeroed
```

### GFP Flags

| Flag         | Meaning                      | When to Use        |
| ------------ | ---------------------------- | ------------------ |
| `GFP_KERNEL` | May sleep, normal allocation | Most kernel code   |
| `GFP_ATOMIC` | Cannot sleep, uses reserves  | Interrupt context  |
| `GFP_NOWAIT` | Cannot sleep, fails quickly  | When you can retry |
| `GFP_USER`   | For user-space requests      | mmap, etc.         |
| `GFP_DMA`    | DMA-capable memory           | Device drivers     |

### vmalloc - Large Allocations

```c theme={null}
#include <linux/vmalloc.h>

// For large allocations (pages may not be contiguous)
void *ptr = vmalloc(large_size);
vfree(ptr);

// With zeroing
void *ptr = vzalloc(large_size);
```

### kmalloc vs vmalloc

| Aspect          | kmalloc       | vmalloc                          |
| --------------- | ------------- | -------------------------------- |
| Physical memory | Contiguous    | May be non-contiguous            |
| Size limit      | \~4MB         | Limited by virtual address space |
| Performance     | Faster        | Slower (TLB overhead)            |
| Use case        | Small objects | Large buffers, module loading    |

### Slab Allocator

For frequently allocated objects of the same size:

```c theme={null}
// Create cache
struct kmem_cache *my_cache = kmem_cache_create(
    "my_objects",           // Name (for debugging)
    sizeof(struct my_obj),  // Object size
    0,                      // Alignment (0 = default)
    SLAB_HWCACHE_ALIGN,     // Flags
    NULL                    // Constructor
);

// Allocate from cache
struct my_obj *obj = kmem_cache_alloc(my_cache, GFP_KERNEL);

// Free to cache
kmem_cache_free(my_cache, obj);

// Destroy cache
kmem_cache_destroy(my_cache);
```

### Slab Allocator Internals

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        SLAB ALLOCATOR STRUCTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   kmem_cache ("task_struct" cache)                                          │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │  name: "task_struct"                                                │    │
│   │  object_size: 2688 bytes                                            │    │
│   │  objects_per_slab: 12                                               │    │
│   │                                                                     │    │
│   │  Per-CPU cache (fast path):                                        │    │
│   │  ┌─────────────────────────────────────────────────────────────┐   │    │
│   │  │ CPU 0: [obj][obj][obj][free][free]                          │   │    │
│   │  │ CPU 1: [obj][obj][free][free][free]                         │   │    │
│   │  │ CPU 2: [free][free][free][free][free]                       │   │    │
│   │  └─────────────────────────────────────────────────────────────┘   │    │
│   │                                                                     │    │
│   │  Partial slabs (shared):                                           │    │
│   │  ┌──────────────────────┐  ┌──────────────────────┐               │    │
│   │  │ Slab 1: 8/12 used    │  │ Slab 2: 3/12 used    │               │    │
│   │  └──────────────────────┘  └──────────────────────┘               │    │
│   │                                                                     │    │
│   │  Full slabs:                                                       │    │
│   │  ┌──────────────────────┐                                          │    │
│   │  │ Slab 3: 12/12 used   │                                          │    │
│   │  └──────────────────────┘                                          │    │
│   └────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Lab Exercises

<AccordionGroup>
  <Accordion title="Lab 1: Implement Kernel-Style Linked List" icon="code">
    **Objective**: Use kernel-style linked lists in user space

    ```c theme={null}
    // list.c - User-space implementation
    #include <stdio.h>
    #include <stdlib.h>
    #include <stddef.h>

    // Minimal kernel-style list implementation
    struct list_head {
        struct list_head *next, *prev;
    };

    #define LIST_HEAD_INIT(name) { &(name), &(name) }
    #define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name)

    #define container_of(ptr, type, member) \
        ((type *)((char *)(ptr) - offsetof(type, member)))

    #define list_entry(ptr, type, member) container_of(ptr, type, member)

    #define list_for_each_entry(pos, head, member) \
        for (pos = list_entry((head)->next, typeof(*pos), member); \
             &pos->member != (head); \
             pos = list_entry(pos->member.next, typeof(*pos), member))

    static inline void INIT_LIST_HEAD(struct list_head *list) {
        list->next = list->prev = list;
    }

    static inline void __list_add(struct list_head *new,
                                  struct list_head *prev,
                                  struct list_head *next) {
        next->prev = new;
        new->next = next;
        new->prev = prev;
        prev->next = new;
    }

    static inline void list_add_tail(struct list_head *new, 
                                     struct list_head *head) {
        __list_add(new, head->prev, head);
    }

    // Example struct with embedded list
    struct process {
        int pid;
        char name[32];
        struct list_head list;
    };

    LIST_HEAD(process_list);

    int main() {
        // Create processes
        for (int i = 0; i < 5; i++) {
            struct process *p = malloc(sizeof(*p));
            p->pid = 100 + i;
            snprintf(p->name, sizeof(p->name), "process_%d", i);
            list_add_tail(&p->list, &process_list);
        }
        
        // Iterate
        struct process *p;
        list_for_each_entry(p, &process_list, list) {
            printf("PID: %d, Name: %s\n", p->pid, p->name);
        }
        
        return 0;
    }
    ```
  </Accordion>

  <Accordion title="Lab 2: Explore Slab Caches" icon="magnifying-glass">
    **Objective**: Analyze kernel memory allocation

    ```bash theme={null}
    # View all slab caches
    cat /proc/slabinfo | head -20

    # Detailed slab stats
    sudo slabtop

    # Memory usage by cache
    cat /proc/slabinfo | awk 'NR>2 {print $1, $2*$4}' | sort -nk2 | tail -20

    # Find task_struct cache
    cat /proc/slabinfo | grep task_struct

    # View with slabinfo tool
    sudo cat /sys/kernel/slab/*/alloc_calls | head -50
    ```
  </Accordion>

  <Accordion title="Lab 3: RCU Simulation" icon="sync">
    **Objective**: Understand RCU behavior

    ```c theme={null}
    // rcu_sim.c - Simple RCU-like simulation
    #include <stdio.h>
    #include <stdlib.h>
    #include <pthread.h>
    #include <stdatomic.h>
    #include <unistd.h>

    struct data {
        int value;
    };

    atomic_int grace_period;
    struct data *_Atomic global_ptr;

    void *reader(void *arg) {
        int tid = *(int *)arg;
        for (int i = 0; i < 1000000; i++) {
            // Simulate rcu_read_lock (just read grace period)
            int gp = atomic_load(&grace_period);
            
            // Read data
            struct data *ptr = atomic_load(&global_ptr);
            if (ptr) {
                // Use the data
                volatile int v = ptr->value;
                (void)v;
            }
            
            // Simulate rcu_read_unlock
        }
        printf("Reader %d done\n", tid);
        return NULL;
    }

    void *writer(void *arg) {
        for (int i = 0; i < 100; i++) {
            // Allocate new
            struct data *new = malloc(sizeof(*new));
            new->value = i;
            
            // Swap pointer
            struct data *old = atomic_exchange(&global_ptr, new);
            
            // Wait for grace period (simplified)
            atomic_fetch_add(&grace_period, 1);
            usleep(1000);  // Real RCU uses smarter waiting
            
            // Free old
            free(old);
        }
        printf("Writer done\n");
        return NULL;
    }

    int main() {
        global_ptr = malloc(sizeof(struct data));
        ((struct data *)global_ptr)->value = 0;
        
        pthread_t readers[4], writer_t;
        int tids[4] = {0, 1, 2, 3};
        
        for (int i = 0; i < 4; i++)
            pthread_create(&readers[i], NULL, reader, &tids[i]);
        pthread_create(&writer_t, NULL, writer, NULL);
        
        for (int i = 0; i < 4; i++)
            pthread_join(readers[i], NULL);
        pthread_join(writer_t, NULL);
        
        return 0;
    }
    ```
  </Accordion>
</AccordionGroup>

***

## Interview Questions

<AccordionGroup>
  <Accordion title="Q1: Explain how container_of works" icon="question">
    **Answer**:

    `container_of(ptr, type, member)` returns a pointer to the containing structure given a pointer to a member.

    **How it works**:

    ```c theme={null}
    #define container_of(ptr, type, member) \
        (type *)((char *)ptr - offsetof(type, member))
    ```

    1. `offsetof(type, member)` - bytes from struct start to member
    2. `(char *)ptr` - treat ptr as byte pointer
    3. Subtract offset to get struct start
    4. Cast to struct pointer

    **Example**:

    ```c theme={null}
    struct person {
        char name[32];     // offset 0
        int age;           // offset 32
        struct list_head list;  // offset 36
    };

    // Given &person.list (address 0x1024)
    // container_of returns 0x1024 - 36 = 0x1000 (struct start)
    ```

    **Why important**: Enables type-agnostic data structures in C.
  </Accordion>

  <Accordion title="Q2: When would you use RCU vs a spinlock?" icon="question">
    **Answer**:

    **Use RCU when**:

    * Reads vastly outnumber writes
    * Read-side performance is critical
    * Can tolerate slightly stale data on reads
    * Updates can use copy-and-replace semantics

    **Use spinlock when**:

    * Read/write ratio is balanced
    * Need to modify data in place
    * Can't tolerate stale reads
    * Critical section is very short

    **Performance comparison**:

    | Operation          | RCU           | Spinlock      |
    | ------------------ | ------------- | ------------- |
    | Read (uncontended) | \~1 cycle     | \~10 cycles   |
    | Read (contended)   | \~1 cycle     | \~100+ cycles |
    | Write              | \~100+ cycles | \~10 cycles   |

    **Real examples**:

    * Routing table: RCU (millions of lookups, rare updates)
    * Memory allocator: Spinlock (frequent alloc/free)
  </Accordion>

  <Accordion title="Q3: Why does the kernel use slab allocation?" icon="question">
    **Answer**:

    **Problems with raw page allocation**:

    * Internal fragmentation (small objects waste pages)
    * Slow initialization (must init objects each time)
    * Cache unfriendly (random placement)

    **Slab benefits**:

    1. **Reduced fragmentation**: Groups same-size objects
    2. **Cache coloring**: Staggers objects across cache lines
    3. **Object caching**: Keeps freed objects initialized
    4. **Per-CPU caches**: Reduces lock contention
    5. **Debugging support**: Red zones, poisoning

    **Example impact**:

    * Creating process: `task_struct` allocated instantly from slab
    * Without slab: Would need page allocation + initialization
    * Measured speedup: 10-100x for hot objects
  </Accordion>

  <Accordion title="Q4: Explain per-CPU variables and their benefits" icon="question">
    **Answer**:

    **What are per-CPU variables**:

    * Each CPU gets its own copy of the variable
    * No synchronization needed for CPU-local access

    **Benefits**:

    1. **No cache bouncing**: Variable always in local cache
    2. **No locking needed**: Each CPU has exclusive access
    3. **Better scalability**: Performance scales with CPU count

    **Implementation**:

    ```c theme={null}
    // Data layout in memory:
    CPU 0: [counter_0][padding...][other_vars_0]
    CPU 1: [counter_1][padding...][other_vars_1]
    CPU 2: [counter_2][padding...][other_vars_2]
    ```

    **Use cases**:

    * Counters (stats, metrics)
    * Caches (per-CPU object pools)
    * State (current process, interrupt flags)

    **Trade-off**: Uses more memory (N copies) for better performance.
  </Accordion>
</AccordionGroup>

***

## Debugging Tips

<Note>
  **Practical Debugging Guidance for Kernel Data Structures**

  These tips will save you hours when you're tracing bugs in kernel code or kernel modules.
</Note>

```bash theme={null}
# Inspect slab cache health -- look for high allocation failures
sudo slabtop -o -s c | head -20
# Columns: OBJS, ACTIVE, USE%, OBJ SIZE, SLABS -- watch for USE% near 100%

# Detect memory leaks in slab caches (growing over time)
watch -n 5 'cat /proc/slabinfo | grep -E "task_struct|dentry|inode_cache" | awk "{print \$1, \$2, \$3}"'
# $2 = num_objs, $3 = active_objs. If active_objs keeps growing, you have a leak.

# Trace RCU grace periods (slow grace periods cause memory pressure)
cat /sys/kernel/debug/rcu/rcu_preempt/rcudata
# Look for "c=" (completed) and "g=" (grace period number)
# If g >> c, RCU callbacks are piling up

# Check for list corruption (common kernel bug symptom)
# Symptom: "list_del corruption" or "list_add corruption" in dmesg
# Enable CONFIG_DEBUG_LIST=y to get early detection
dmesg | grep -i "list.*corruption"

# Trace kmalloc/kfree to find memory leaks in modules
sudo bpftrace -e '
kprobe:kmalloc { @allocs[comm] = count(); }
kprobe:kfree   { @frees[comm] = count(); }
interval:s:5   { print(@allocs); print(@frees); clear(@allocs); clear(@frees); }
'
```

<Warning>
  **Debugging RCU Stalls**: If you see "rcu\_sched self-detected stall on CPU" in dmesg, it means a CPU has been in an RCU read-side critical section for too long (or RCU callbacks cannot run). Common causes: a tight loop with preemption disabled, a softirq handler that never yields, or a spinlock held too long. This is a serious bug that can make the system unresponsive.
</Warning>

***

## Common Misconceptions

| Misconception                                                       | Reality                                                                                                                                                                                             |
| ------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "kernel linked lists are slower because of `container_of` overhead" | `container_of` compiles to a single pointer subtraction -- zero runtime cost. The real advantage is avoiding a separate allocation for the node.                                                    |
| "RCU is only for networking"                                        | RCU is used in 20+ subsystems: VFS (dcache, inode), scheduler, module list, PID lookup, audit, security, and more.                                                                                  |
| "Per-CPU variables waste memory"                                    | On a 64-core machine, a per-CPU `int` uses 256 bytes. An atomic counter with cache-line bouncing across 64 CPUs wastes far more in stalled cycles.                                                  |
| "Slab allocator is just a memory pool"                              | SLUB also provides: per-CPU caching for lock-free fast paths, SLAB\_HWCACHE\_ALIGN for cache coloring, red-zone poisoning for debugging, and merging of similar-sized caches.                       |
| "Hash tables are always better than trees"                          | Hash tables cannot do range queries or ordered iteration. The kernel chooses RB-trees for VMAs (need to find "which VMA contains address X?") and hash tables for sockets (need exact port lookup). |

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Kernel Lists" icon="link">
    Embedded list\_head pattern enables type-agnostic, efficient linked lists
  </Card>

  <Card title="RCU" icon="sync">
    Read-Copy-Update provides scalable read-heavy synchronization
  </Card>

  <Card title="Slab Allocator" icon="boxes-stacked">
    Object caching and per-CPU pools optimize frequent allocations
  </Card>

  <Card title="Per-CPU Variables" icon="microchip">
    Eliminate cache bouncing for frequently accessed data
  </Card>
</CardGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="The CFS scheduler uses a red-black tree to organize tasks by vruntime. Why a red-black tree and not a min-heap, sorted array, or skip list? Discuss the algorithmic trade-offs in the context of scheduler requirements." icon="message">
    **Strong Answer:**

    * The scheduler needs three operations to be efficient: find the minimum vruntime task (the next task to run), insert a task when it becomes runnable, and remove a task when it is selected or blocks. A red-black tree provides O(log n) for all three operations, and the leftmost node (minimum) can be cached for O(1) access.
    * A min-heap gives O(1) find-min and O(log n) insert, but removal of an arbitrary element (not the minimum) is O(n) because you need to find it first. The scheduler frequently removes tasks that block (sleep on I/O, mutex, etc.), not just the minimum, so this O(n) removal is unacceptable.
    * A sorted array gives O(1) find-min and O(log n) search, but insertion requires shifting elements, which is O(n). With hundreds or thousands of runnable tasks, this becomes expensive.
    * A skip list would work (O(log n) for all operations on average), but red-black trees have lower constant factors due to better cache locality (tree nodes are compact structs, skip list nodes have variable-sized forward pointer arrays) and deterministic worst-case guarantees (skip lists are probabilistic).
    * The kernel's rb-tree implementation is also extremely well-optimized: it uses augmented rb-trees where the leftmost node pointer is cached in `rb_root_cached`, making `pick_next_task_fair()` effectively O(1). The rebalancing after insertion or removal touches at most 3 rotations, keeping the tree balanced without expensive restructuring.

    **Follow-up:** How does the kernel's `container_of` macro make the rb-tree implementation type-agnostic, and why is this important?

    **Follow-up Answer:**

    * The kernel's rb-tree does not store user data in tree nodes. Instead, the `rb_node` structure is embedded inside the user's data structure (for example, `sched_entity` contains an `rb_node run_node`). When traversing the tree, you get pointers to `rb_node` structures. `container_of(ptr, struct sched_entity, run_node)` uses pointer arithmetic (subtracting the field offset from the pointer) to recover the containing `sched_entity` pointer. This means the same rb-tree code works for any data type without templates, generics, or void pointers with casts. It also means zero memory allocation overhead for the tree structure itself -- the link nodes are part of the data, so inserting a task requires no additional memory allocation.
  </Accordion>

  <Accordion title="Explain RCU in detail. A new kernel developer says 'RCU is just a read-write lock with deferred freeing.' Correct their understanding and explain what makes RCU fundamentally different." icon="message">
    **Strong Answer:**

    * The fundamental difference is that RCU readers do not acquire any lock at all. `rcu_read_lock()` merely disables preemption (a single per-CPU counter increment). There is no atomic operation, no memory barrier, no cache line bouncing between CPUs. This means RCU read-side critical sections scale perfectly: 1000 CPUs reading simultaneously have zero contention. A read-write lock, even an optimized one like `rwlock_t`, requires atomic operations on a shared cache line, creating contention that increases with CPU count.
    * RCU achieves this by shifting all synchronization cost to the write side. A writer does not modify data in place. Instead, it allocates a new copy, modifies the copy, then atomically updates the pointer to the new version using `rcu_assign_pointer()` (which includes a write memory barrier). Old readers continue reading the old version -- they are safe because the old data is not freed until all pre-existing readers finish. The writer calls `synchronize_rcu()` or `call_rcu()` to wait for the grace period.
    * A grace period is the time during which all CPUs have gone through at least one context switch (or voluntary quiescent state). Since `rcu_read_lock()` disables preemption, a context switch on a CPU guarantees no reader on that CPU is in an RCU read-side critical section. Once all CPUs have passed through a quiescent state, the old data can safely be freed.
    * The trade-off: writers pay a significant cost (memory allocation for copies, waiting for grace periods), and readers see stale data during the grace period. This makes RCU ideal for read-mostly data (routing tables, configuration, module lists) and terrible for write-heavy data.

    **Follow-up:** What is SRCU (Sleepable RCU) and when would you use it instead of regular RCU?

    **Follow-up Answer:**

    * Regular RCU readers cannot sleep because `rcu_read_lock()` disables preemption, and sleeping while preemption is disabled would cause scheduling problems. SRCU (Sleepable RCU) allows readers to sleep by using a per-CPU counter pair instead of preemption disable. `srcu_read_lock()` increments one counter, `srcu_read_unlock()` increments the other. The writer's `synchronize_srcu()` waits until both counters match, indicating no readers are active. This is more expensive on the read side (two atomic increments instead of one preemption counter) but necessary when the read-side critical section must perform blocking operations like disk I/O or sleeping locks. Use cases include file system operations where you need RCU-like read scalability but the read path might block.
  </Accordion>

  <Accordion title="A kernel module you are writing needs a data structure to map 10 million IP addresses to connection metadata with fast lookup and frequent updates. Compare the options available in the kernel and justify your choice." icon="message">
    **Strong Answer:**

    * The options are: kernel hash table (`hashtable.h`), rhashtable (resizable hash table), rb-tree, and XArray.
    * For 10 million entries with fast lookup and frequent updates, I would use **rhashtable** (resizable hash table with RCU-protected reads). The standard kernel `hashtable.h` has a fixed number of buckets set at compile time. With 10M entries and 1024 buckets, each bucket chain averages 10K entries, making lookup O(10K) -- unacceptable. Rhashtable dynamically resizes its bucket count to maintain short chains (target 0-1 entries per bucket), providing amortized O(1) lookup.
    * Rhashtable also supports RCU-protected reads: lookups do not take any locks, only `rcu_read_lock()`. Insertions and deletions take a per-bucket spinlock, providing fine-grained write concurrency. The resizing operation is done incrementally in the background using RCU, so readers are never blocked during a resize.
    * An rb-tree would give O(log n) lookup (about 23 comparisons for 10M entries), which is slower than a well-sized hash table's O(1). XArray is designed for sparse integer-indexed data (like page cache indices), not arbitrary key types like IP addresses.
    * Implementation: I would define the rhashtable parameters with `struct rhashtable_params` specifying the key offset, key length (4 bytes for IPv4), hash function (`jhash`), and a reasonable initial size. For IPv6 support, the key length increases to 16 bytes but the approach is identical.

    **Follow-up:** What are the memory overhead considerations for 10 million entries, and how does rhashtable handle memory pressure?

    **Follow-up Answer:**

    * Each rhashtable entry requires the user struct plus an `rhash_head` (a single pointer, 8 bytes on 64-bit). The bucket array itself is an array of pointers, which for 10M entries with a load factor of 0.75 would be approximately 13.3M buckets, each 8 bytes = \~106MB just for buckets. During resize, both old and new bucket arrays coexist temporarily, doubling the bucket memory. The user data (connection metadata) dominates memory usage at this scale. Under memory pressure, rhashtable allocation can fail during resize. The `rhashtable_params` structure allows specifying `GFP_NOWAIT` for non-blocking allocation and setting `max_size` to cap bucket growth. If resize allocation fails, the table continues operating with longer chains (degraded but functional).
  </Accordion>
</AccordionGroup>

***

Next: [Process Subsystem Deep Dive →](/courses/linux-internals/process-subsystem)
