Skip to main content

Threads & Concurrency

In modern operating systems, a thread is the fundamental unit of CPU utilization. While a process provides the environment (address space, resources), the thread provides the execution. For senior engineers, understanding threads isn’t just about calling pthread_create; it’s about understanding how the kernel manages execution contexts, how hardware registers are swapped, and how different threading models impact system throughput.
Mastery Level: Senior Systems Engineer
Key Internals: TCB structures, FS/GS segment registers, clone() flags, Context switch assembly
Prerequisites: Process Internals, Virtual Memory

1. The Anatomy of a Thread

A thread is often called a “Lightweight Process” (LWP), but this is slightly misleading. In Linux, threads and processes are both represented by the task_struct, but they differ in what they share.

The Thread Control Block (TCB)

The TCB is the kernel data structure that stores everything unique to a thread. While the process has a Process Control Block (PCB), the TCB contains:
ComponentDescription
Thread ID (TID)Unique identifier within the system (distinct from PID).
Register SetSaved values of RAX, RBX, RCX, etc., when the thread is not running.
Program Counter (PC)The address of the next instruction to execute.
Stack Pointer (SP)Points to the top of the thread’s private stack.
Thread StateRunning, Ready, Blocked, etc.
Signal MaskWhich signals are currently blocked for this specific thread.
Scheduling PriorityThread-specific priority for the OS scheduler.
TLS PointerPointer to the Thread-Local Storage area (often stored in the FS/GS registers).

Shared vs. Private State

In a multi-threaded process, the following partitioning applies: Shared (Process-wide):
  • Address Space: Code (Text), Data (Globals), and Heap.
  • File Descriptors: If one thread opens a file, all can read/write it.
  • Signal Handlers: sigaction applies to the whole process.
  • User/Group IDs: Security context is process-wide.
Private (Per-thread):
  • Registers: Each thread has its own execution flow.
  • Stack: Critical for keeping function call history separate.
  • Thread Local Storage (TLS): Variables marked with __thread.
  • Pending Signals: Signals can be directed at specific threads (e.g., pthread_kill).

2. Threading Models: Mapping User to Kernel

How do user-space threads (created by your code) map to kernel-space threads (managed by the OS)? This mapping defines the performance characteristics of your application.

The 1:1 Model (Kernel-Level Threads)

This is the standard model in modern Linux (NPTL) and Windows. Every user thread is exactly one kernel thread.
  • Implementation: In Linux, pthread_create calls the clone() system call with flags like CLONE_VM, CLONE_FS, CLONE_FILES, and CLONE_SIGHAND.
  • Pros:
    • True parallelism (threads can run on different cores).
    • Blocking syscalls (like read()) only block the calling thread.
  • Cons:
    • Thread creation and context switching require kernel traps (expensive).
    • High memory overhead (kernel must maintain a TCB and a kernel stack for every thread).

The N:1 Model (User-Level Threads / Green Threads)

All user threads map to a single kernel thread.
  • Implementation: A user-space library handles switching between threads by manually swapping registers and stacks.
  • Pros:
    • Extremely fast switching (no syscalls).
    • Minimal memory overhead.
  • Cons:
    • No true parallelism (limited to one core).
    • If one thread calls a blocking syscall, the entire process blocks.

The M:N Model (Hybrid / Scheduler Activations)

M user threads are multiplexed onto N kernel threads.
  • Implementation: Used by Go (Goroutines) and Erlang. The runtime manages a pool of kernel threads and schedules user tasks onto them.
  • Pros:
    • Efficient switching + True parallelism.
    • Low memory (Goroutine stacks start at ~2KB).
  • Cons:
    • Scheduler Complexity: The runtime must detect when a kernel thread is about to block and “hand off” the remaining user threads to a new kernel thread. This is known as Scheduler Activations.

3. Context Switching: The Assembly Reality

What happens when the CPU stops executing Thread A and starts Thread B? This is the most performance-critical part of an OS.

The Hardware Context

On x86-64, a context switch involves saving the state of the current thread to its TCB and loading the state of the next thread.
; Simplified Context Switch Logic (Pseudo-Assembly)
; switch_to(prev_tcb, next_tcb)

save_context:
    push rbp        ; Save base pointer
    push rbx        ; Save callee-saved registers
    push r12
    push r13
    push r14
    push r15
    mov [rdi + STACK_PTR_OFFSET], rsp ; Save current stack pointer to prev TCB

load_context:
    mov rsp, [rsi + STACK_PTR_OFFSET] ; Load new stack pointer from next TCB
    pop r15         ; Restore registers for the new thread
    pop r14
    pop r13
    pop r12
    pop rbx
    pop rbp
    ret             ; Return to the PC stored on the new thread's stack!

The Return Address Trick

Notice the ret at the end. When a thread is first created, the OS pushes the address of the thread’s entry function onto its new stack. When the first context switch happens to that thread, the ret instruction pops that address and “returns” into the thread’s starting code.

4. Thread Local Storage (TLS) Internals

TLS allows each thread to have its own copy of a global variable. This is vital for thread-safe libraries (e.g., errno in C).

How it works (x86-64)

The CPU uses segment registers (FS or GS) to point to a thread-specific memory block.
  1. The linker aggregates all __thread variables into a special ELF section (.tdata and .tbss).
  2. When a thread is created, the OS allocates a block of memory for these variables.
  3. The FS register is set to point to the base of this block.
  4. To access a TLS variable, the compiler generates code like mov eax, fs:[0x10].

The TCB Pointer

In Linux, the TCB itself is usually stored at the end of the TLS block. This allows the thread to find its own metadata (like its TID) by just reading from the FS register.

5. Linux Threads: NPTL and the clone() System Call

Linux doesn’t distinguish between processes and threads internally; it sees “tasks.”

The clone() Flags

When you create a thread via pthread_create, the following clone() flags are typically used:
FlagMeaning
CLONE_VMShare the same memory address space.
CLONE_FSShare filesystem information (root, cwd, umask).
CLONE_FILESShare the file descriptor table.
CLONE_SIGHANDShare signal handlers.
CLONE_THREADPut the new task in the same thread group (same TGID).
CLONE_SETTLSSet the TLS descriptor for the new thread.

PID vs. TID vs. TGID

  • PID: In the kernel, every task has a unique PID.
  • TGID (Thread Group ID): For threads, this is the PID of the first thread (the “main” thread).
  • User-space view: When you call getpid() in C, it actually returns the TGID. To get the actual kernel-level unique ID, you must call gettid().

6. Pthread Implementation Details

The POSIX Threads (pthreads) library is the standard interface, but its implementation hides significant complexity.

Stack Management

  • Allocation: pthread_create allocates a stack for the new thread using mmap().
  • Guard Pages: To detect stack overflow, pthreads places a “Guard Page” (a page with no read/write permissions) at the end of the stack. If the thread grows its stack into this page, the CPU triggers a Segmentation Fault immediately.
  • Default Size: Usually 2MB to 8MB. For thousands of threads, this can exhaust address space on 32-bit systems, requiring pthread_attr_setstacksize.

Thread Joining (Futexes)

When you call pthread_join(), the calling thread doesn’t busy-wait. It uses a Futex (Fast User-space Mutex).
  1. The joining thread tells the kernel: “Put me to sleep until Thread X exits.”
  2. When Thread X exits, its final kernel cleanup routine performs a futex_wake on the joining thread.

7. High-Performance Concurrency: Fibers & Coroutines

Modern high-scale systems (like Nginx or Node.js) often avoid OS threads for high-concurrency tasks, opting for Fibers or Coroutines.

Fibers (Cooperative Threads)

Fibers are user-space execution contexts that use cooperative multitasking.
  • Switching: A fiber must explicitly “yield” control back to the scheduler.
  • Benefit: No race conditions within a single kernel thread (since only one fiber runs at a time).
  • Drawback: A single long-running fiber can “starve” all others.

Coroutines (Stackless vs. Stackful)

  • Stackful (Fibers): Have their own stack. Can yield from deep within nested function calls (e.g., Go).
  • Stackless: Transformed by the compiler into a State Machine. They “remember” where they were but don’t have a real stack (e.g., Python async/await, C++20 Coroutines).

8. Interview Deep Dive: Senior Level

Switching between threads of the same process is significantly faster because:
  1. Address Space remains the same: The CR3 register (Page Table Base) doesn’t need to be changed.
  2. TLB (Translation Lookaside Buffer) remains valid: No need to flush the cache that maps virtual addresses to physical ones.
  3. Cache Warmth: Shared data (Heap/Code) remains in L1/L2 caches.
In contrast, a process switch requires a CR3 update, which invalidates the TLB (unless PCIDs are used) and results in massive cache misses as the new process accesses entirely different memory regions.
The kernel follows these rules:
  1. Synchronous Signals (e.g., SIGSEGV, SIGFPE) are delivered to the specific thread that caused the error.
  2. Asynchronous Signals (e.g., SIGINT, SIGTERM) are delivered to the process as a whole. The kernel picks any thread that does not currently have the signal blocked to handle it.
  3. Thread-Directed Signals (pthread_kill) are delivered to the specific target thread.
Senior Tip: In multi-threaded programs, it is common practice to block all signals in all worker threads and have one dedicated thread calling sigwait() to handle them.
Just like a zombie process, a thread that has exited but hasn’t been “joined” by pthread_join() remains in the system. Its TCB and stack are not fully freed because the OS must preserve the thread’s return value for the joiner.If you don’t plan to join a thread, you must create it in a Detached state (pthread_detach) to ensure resources are reclaimed immediately upon exit.

9. Advanced Practice: The Thread Challenge

  1. The Minimal Switch: Write a C program that uses setjmp and longjmp to implement a basic user-space cooperative thread switch.
  2. The Stack Investigator: Write a program that prints the address of a local variable in two different threads. Calculate the distance between their stacks.
  3. The TLS Explorer: Use the __thread keyword and use gdb to inspect the value of the FS segment register (info registers fs_base).

Key Takeaways for Senior Engineers

  • Threads = Shared Address Space: The primary benefit is zero-copy communication.
  • Stacks are the bottleneck: Large default stack sizes limit thread count; small stacks risk overflow.
  • Context Switches aren’t free: Even 1:1 threads incur kernel overhead. For millions of concurrent tasks, use M:N models or stackless coroutines.
  • The Kernel is agnostic: Internally, Linux schedules task_struct objects. It doesn’t care if they are “threads” or “processes” until it checks the CLONE_VM flag.

OS Threads vs. Runtime Concurrency

Understanding how different languages and runtimes map their concurrency primitives to the kernel is critical for performance tuning.

Comparison Matrix

RuntimeUser-Level AbstractionKernel MappingStack SizeBlocking Behavior
pthreads (C)pthread_t1:1 (one kernel thread)2–8 MBBlocks kernel thread
Java ThreadsThread1:1~1 MBBlocks kernel thread
Go goroutinesgo func()M:N (many goroutines : few OS threads)~2 KB (growable)Runtime parks goroutine, not OS thread
Python threadingThread1:1Interpreter-managedGIL limits parallelism
Rust async/Tokioasync fnM:N (tasks : worker threads)Stackless (state machine)Runtime schedules tasks
Node.jsCallbacks / Promises1 main thread + worker poolN/A (event loop)I/O offloaded to libuv

Why M:N and Async Matter

  • 1:1 model: Simple but each thread costs kernel resources and full stack space. Creating 100K threads is impractical.
  • M:N model: User-level scheduler multiplexes cheap “tasks” onto a small pool of kernel threads. Enables millions of concurrent goroutines or async tasks.
  • Stackless coroutines: No dedicated stack; the compiler transforms async functions into state machines. Extremely memory-efficient.

Implications for System Design

  • If your workload is I/O-bound (network servers, databases): Prefer async runtimes or M:N models.
  • If your workload is CPU-bound (number crunching): Use one kernel thread per core; avoid excessive context switches.
  • If you need true parallelism: Only kernel threads (or goroutines backed by kernel threads) can run on multiple cores simultaneously.

Common Threading Pitfalls

Bugs in multi-threaded code are subtle and hard to reproduce. Here are the classic mistakes.

Pitfall 1: Data Races

Two threads access the same memory location without synchronization, and at least one is a write.
// BUGGY: data race on `counter`
int counter = 0;

void *worker(void *arg) {
    for (int i = 0; i < 1000000; i++) counter++;
    return NULL;
}
Symptoms: Non-deterministic results, values that don’t add up. Fix: Use a mutex, atomic operations, or thread-local storage.

Pitfall 2: Deadlocks

Two threads each hold a lock and wait for the other’s lock.
// Thread A: lock(m1); lock(m2);
// Thread B: lock(m2); lock(m1);  // Deadlock!
Symptoms: Program hangs; ps shows threads in D or S state forever. Fix: Always acquire locks in a consistent global order. See Deadlocks chapter.

Pitfall 3: False Sharing

Different threads write to different variables that happen to share the same cache line, causing cache-line ping-pong.
// BUGGY: arr[0] and arr[1] likely share a cache line
int arr[2];
// Thread 0 writes arr[0]; Thread 1 writes arr[1]; massive slowdown
Symptoms: Parallel code runs slower than expected despite no logical contention. Fix: Pad data structures to cache-line boundaries (64 bytes).

Pitfall 4: Forgetting to Join or Detach

If you don’t pthread_join or pthread_detach, terminated threads become “zombie threads,” leaking resources.
pthread_create(&t, NULL, worker, NULL);
// ... never join or detach ...
Fix: Always join threads you care about, or create them detached.

Pitfall 5: Signal Handling in Multi-threaded Programs

Signals are delivered to “any” thread that hasn’t blocked them, causing unpredictable behavior. Fix: Block all signals in worker threads; have one dedicated thread call sigwait().
Next: IPC & Signals