Threads & Concurrency
1. The Anatomy of a Thread
The Thread Control Block (TCB)
Shared vs. Private State
2. Threading Models: Mapping User to Kernel
The 1:1 Model (Kernel-Level Threads)
The N:1 Model (User-Level Threads / Green Threads)
The M:N Model (Hybrid / Scheduler Activations)
3. Context Switching: The Assembly Reality
The Hardware Context
The Return Address Trick
4. Thread Local Storage (TLS) Internals
How it works (x86-64)
The TCB Pointer
5. Linux Threads: NPTL and the clone() System Call
The clone() Flags
PID vs. TID vs. TGID
6. Pthread Implementation Details
Stack Management
Thread Joining (Futexes)
7. High-Performance Concurrency: Fibers & Coroutines
Fibers (Cooperative Threads)
Coroutines (Stackless vs. Stackful)
8. Interview Deep Dive: Senior Level
9. Advanced Practice: The Thread Challenge
Key Takeaways for Senior Engineers
OS Threads vs. Runtime Concurrency
Comparison Matrix
Why M:N and Async Matter
Implications for System Design
Common Threading Pitfalls
Pitfall 1: Data Races
Pitfall 2: Deadlocks
Pitfall 3: False Sharing
Pitfall 4: Forgetting to Join or Detach
Pitfall 5: Signal Handling in Multi-threaded Programs

Threads & Concurrency

In modern operating systems, a thread is the fundamental unit of CPU utilization. While a process provides the environment (address space, resources), the thread provides the execution. For senior engineers, understanding threads isn’t just about calling pthread_create; it’s about understanding how the kernel manages execution contexts, how hardware registers are swapped, and how different threading models impact system throughput.

Mastery Level: Senior Systems Engineer
Key Internals: TCB structures, FS/GS segment registers, clone() flags, Context switch assembly
Prerequisites: Process Internals, Virtual Memory

1. The Anatomy of a Thread

A thread is often called a “Lightweight Process” (LWP), but this is slightly misleading. In Linux, threads and processes are both represented by the task_struct, but they differ in what they share.

The Thread Control Block (TCB)

The TCB is the kernel data structure that stores everything unique to a thread. While the process has a Process Control Block (PCB), the TCB contains:

Component	Description
Thread ID (TID)	Unique identifier within the system (distinct from PID).
Register Set	Saved values of RAX, RBX, RCX, etc., when the thread is not running.
Program Counter (PC)	The address of the next instruction to execute.
Stack Pointer (SP)	Points to the top of the thread’s private stack.
Thread State	Running, Ready, Blocked, etc.
Signal Mask	Which signals are currently blocked for this specific thread.
Scheduling Priority	Thread-specific priority for the OS scheduler.
TLS Pointer	Pointer to the Thread-Local Storage area (often stored in the FS/GS registers).

Shared vs. Private State

In a multi-threaded process, the following partitioning applies: Shared (Process-wide):

Address Space: Code (Text), Data (Globals), and Heap.
File Descriptors: If one thread opens a file, all can read/write it.
Signal Handlers: sigaction applies to the whole process.
User/Group IDs: Security context is process-wide.

Private (Per-thread):

Registers: Each thread has its own execution flow.
Stack: Critical for keeping function call history separate.
Thread Local Storage (TLS): Variables marked with __thread.
Pending Signals: Signals can be directed at specific threads (e.g., pthread_kill).

2. Threading Models: Mapping User to Kernel

How do user-space threads (created by your code) map to kernel-space threads (managed by the OS)? This mapping defines the performance characteristics of your application.

The 1:1 Model (Kernel-Level Threads)

This is the standard model in modern Linux (NPTL) and Windows. Every user thread is exactly one kernel thread.

Implementation: In Linux, pthread_create calls the clone() system call with flags like CLONE_VM, CLONE_FS, CLONE_FILES, and CLONE_SIGHAND.
Pros:
- True parallelism (threads can run on different cores).
- Blocking syscalls (like read()) only block the calling thread.
Cons:
- Thread creation and context switching require kernel traps (expensive).
- High memory overhead (kernel must maintain a TCB and a kernel stack for every thread).

The N:1 Model (User-Level Threads / Green Threads)

All user threads map to a single kernel thread.

Implementation: A user-space library handles switching between threads by manually swapping registers and stacks.
Pros:
- Extremely fast switching (no syscalls).
- Minimal memory overhead.
Cons:
- No true parallelism (limited to one core).
- If one thread calls a blocking syscall, the entire process blocks.

The M:N Model (Hybrid / Scheduler Activations)

M user threads are multiplexed onto N kernel threads.

Implementation: Used by Go (Goroutines) and Erlang. The runtime manages a pool of kernel threads and schedules user tasks onto them.
Pros:
- Efficient switching + True parallelism.
- Low memory (Goroutine stacks start at ~2KB).
Cons:
- Scheduler Complexity: The runtime must detect when a kernel thread is about to block and “hand off” the remaining user threads to a new kernel thread. This is known as Scheduler Activations.

3. Context Switching: The Assembly Reality

What happens when the CPU stops executing Thread A and starts Thread B? This is the most performance-critical part of an OS.

The Hardware Context

On x86-64, a context switch involves saving the state of the current thread to its TCB and loading the state of the next thread.

; Simplified Context Switch Logic (Pseudo-Assembly)
; switch_to(prev_tcb, next_tcb)

save_context:
    push rbp        ; Save base pointer
    push rbx        ; Save callee-saved registers
    push r12
    push r13
    push r14
    push r15
    mov [rdi + STACK_PTR_OFFSET], rsp ; Save current stack pointer to prev TCB

load_context:
    mov rsp, [rsi + STACK_PTR_OFFSET] ; Load new stack pointer from next TCB
    pop r15         ; Restore registers for the new thread
    pop r14
    pop r13
    pop r12
    pop rbx
    pop rbp
    ret             ; Return to the PC stored on the new thread's stack!

The Return Address Trick

Notice the ret at the end. When a thread is first created, the OS pushes the address of the thread’s entry function onto its new stack. When the first context switch happens to that thread, the ret instruction pops that address and “returns” into the thread’s starting code.

4. Thread Local Storage (TLS) Internals

TLS allows each thread to have its own copy of a global variable. This is vital for thread-safe libraries (e.g., errno in C).

How it works (x86-64)

The CPU uses segment registers (FS or GS) to point to a thread-specific memory block.

The linker aggregates all __thread variables into a special ELF section (.tdata and .tbss).
When a thread is created, the OS allocates a block of memory for these variables.
The FS register is set to point to the base of this block.
To access a TLS variable, the compiler generates code like mov eax, fs:[0x10].

The TCB Pointer

In Linux, the TCB itself is usually stored at the end of the TLS block. This allows the thread to find its own metadata (like its TID) by just reading from the FS register.

5. Linux Threads: NPTL and the `clone()` System Call

Linux doesn’t distinguish between processes and threads internally; it sees “tasks.”

The `clone()` Flags

When you create a thread via pthread_create, the following clone() flags are typically used:

Flag	Meaning
`CLONE_VM`	Share the same memory address space.
`CLONE_FS`	Share filesystem information (root, cwd, umask).
`CLONE_FILES`	Share the file descriptor table.
`CLONE_SIGHAND`	Share signal handlers.
`CLONE_THREAD`	Put the new task in the same thread group (same TGID).
`CLONE_SETTLS`	Set the TLS descriptor for the new thread.

PID vs. TID vs. TGID

PID: In the kernel, every task has a unique PID.
TGID (Thread Group ID): For threads, this is the PID of the first thread (the “main” thread).
User-space view: When you call getpid() in C, it actually returns the TGID. To get the actual kernel-level unique ID, you must call gettid().

6. Pthread Implementation Details

The POSIX Threads (pthreads) library is the standard interface, but its implementation hides significant complexity.

Stack Management

Allocation: pthread_create allocates a stack for the new thread using mmap().
Guard Pages: To detect stack overflow, pthreads places a “Guard Page” (a page with no read/write permissions) at the end of the stack. If the thread grows its stack into this page, the CPU triggers a Segmentation Fault immediately.
Default Size: Usually 2MB to 8MB. For thousands of threads, this can exhaust address space on 32-bit systems, requiring pthread_attr_setstacksize.

Thread Joining (Futexes)

When you call pthread_join(), the calling thread doesn’t busy-wait. It uses a Futex (Fast User-space Mutex).

The joining thread tells the kernel: “Put me to sleep until Thread X exits.”
When Thread X exits, its final kernel cleanup routine performs a futex_wake on the joining thread.

7. High-Performance Concurrency: Fibers & Coroutines

Modern high-scale systems (like Nginx or Node.js) often avoid OS threads for high-concurrency tasks, opting for Fibers or Coroutines.

Fibers (Cooperative Threads)

Fibers are user-space execution contexts that use cooperative multitasking.

Switching: A fiber must explicitly “yield” control back to the scheduler.
Benefit: No race conditions within a single kernel thread (since only one fiber runs at a time).
Drawback: A single long-running fiber can “starve” all others.

Coroutines (Stackless vs. Stackful)

Stackful (Fibers): Have their own stack. Can yield from deep within nested function calls (e.g., Go).
Stackless: Transformed by the compiler into a State Machine. They “remember” where they were but don’t have a real stack (e.g., Python async/await, C++20 Coroutines).

8. Interview Deep Dive: Senior Level

Explain the performance impact of a context switch between threads of the same process vs. different processes.

Switching between threads of the same process is significantly faster because:

Address Space remains the same: The CR3 register (Page Table Base) doesn’t need to be changed.
TLB (Translation Lookaside Buffer) remains valid: No need to flush the cache that maps virtual addresses to physical ones.
Cache Warmth: Shared data (Heap/Code) remains in L1/L2 caches.

In contrast, a process switch requires a CR3 update, which invalidates the TLB (unless PCIDs are used) and results in massive cache misses as the new process accesses entirely different memory regions.

How does the kernel handle signals in a multi-threaded process?

The kernel follows these rules:

Synchronous Signals (e.g., SIGSEGV, SIGFPE) are delivered to the specific thread that caused the error.
Asynchronous Signals (e.g., SIGINT, SIGTERM) are delivered to the process as a whole. The kernel picks any thread that does not currently have the signal blocked to handle it.
Thread-Directed Signals (pthread_kill) are delivered to the specific target thread.

Senior Tip: In multi-threaded programs, it is common practice to block all signals in all worker threads and have one dedicated thread calling sigwait() to handle them.

What is a 'Zombie Thread'?

Just like a zombie process, a thread that has exited but hasn’t been “joined” by pthread_join() remains in the system. Its TCB and stack are not fully freed because the OS must preserve the thread’s return value for the joiner.If you don’t plan to join a thread, you must create it in a Detached state (pthread_detach) to ensure resources are reclaimed immediately upon exit.

9. Advanced Practice: The Thread Challenge

The Minimal Switch: Write a C program that uses setjmp and longjmp to implement a basic user-space cooperative thread switch.
The Stack Investigator: Write a program that prints the address of a local variable in two different threads. Calculate the distance between their stacks.
The TLS Explorer: Use the __thread keyword and use gdb to inspect the value of the FS segment register (info registers fs_base).

Key Takeaways for Senior Engineers

Threads = Shared Address Space: The primary benefit is zero-copy communication.
Stacks are the bottleneck: Large default stack sizes limit thread count; small stacks risk overflow.
Context Switches aren’t free: Even 1:1 threads incur kernel overhead. For millions of concurrent tasks, use M:N models or stackless coroutines.
The Kernel is agnostic: Internally, Linux schedules task_struct objects. It doesn’t care if they are “threads” or “processes” until it checks the CLONE_VM flag.

OS Threads vs. Runtime Concurrency

Understanding how different languages and runtimes map their concurrency primitives to the kernel is critical for performance tuning.

Comparison Matrix

Runtime	User-Level Abstraction	Kernel Mapping	Stack Size	Blocking Behavior
pthreads (C)	`pthread_t`	1:1 (one kernel thread)	2–8 MB	Blocks kernel thread
Java Threads	`Thread`	1:1	~1 MB	Blocks kernel thread
Go goroutines	`go func()`	M:N (many goroutines : few OS threads)	~2 KB (growable)	Runtime parks goroutine, not OS thread
Python threading	`Thread`	1:1	Interpreter-managed	GIL limits parallelism
Rust async/Tokio	`async fn`	M:N (tasks : worker threads)	Stackless (state machine)	Runtime schedules tasks
Node.js	Callbacks / Promises	1 main thread + worker pool	N/A (event loop)	I/O offloaded to libuv

Why M:N and Async Matter

1:1 model: Simple but each thread costs kernel resources and full stack space. Creating 100K threads is impractical.
M:N model: User-level scheduler multiplexes cheap “tasks” onto a small pool of kernel threads. Enables millions of concurrent goroutines or async tasks.
Stackless coroutines: No dedicated stack; the compiler transforms async functions into state machines. Extremely memory-efficient.

Implications for System Design

If your workload is I/O-bound (network servers, databases): Prefer async runtimes or M:N models.
If your workload is CPU-bound (number crunching): Use one kernel thread per core; avoid excessive context switches.
If you need true parallelism: Only kernel threads (or goroutines backed by kernel threads) can run on multiple cores simultaneously.

Common Threading Pitfalls

Bugs in multi-threaded code are subtle and hard to reproduce. Here are the classic mistakes.

Pitfall 1: Data Races

Two threads access the same memory location without synchronization, and at least one is a write.

// BUGGY: data race on `counter`
int counter = 0;

void *worker(void *arg) {
    for (int i = 0; i < 1000000; i++) counter++;
    return NULL;
}

Symptoms: Non-deterministic results, values that don’t add up. Fix: Use a mutex, atomic operations, or thread-local storage.

Pitfall 2: Deadlocks

Two threads each hold a lock and wait for the other’s lock.

// Thread A: lock(m1); lock(m2);
// Thread B: lock(m2); lock(m1);  // Deadlock!

Symptoms: Program hangs; ps shows threads in D or S state forever. Fix: Always acquire locks in a consistent global order. See Deadlocks chapter. Different threads write to different variables that happen to share the same cache line, causing cache-line ping-pong.

// BUGGY: arr[0] and arr[1] likely share a cache line
int arr[2];
// Thread 0 writes arr[0]; Thread 1 writes arr[1]; massive slowdown

Symptoms: Parallel code runs slower than expected despite no logical contention. Fix: Pad data structures to cache-line boundaries (64 bytes).

Pitfall 4: Forgetting to Join or Detach

If you don’t pthread_join or pthread_detach, terminated threads become “zombie threads,” leaking resources.

pthread_create(&t, NULL, worker, NULL);
// ... never join or detach ...

Fix: Always join threads you care about, or create them detached.

Pitfall 5: Signal Handling in Multi-threaded Programs

Signals are delivered to “any” thread that hasn’t blocked them, causing unpredictable behavior. Fix: Block all signals in worker threads; have one dedicated thread call sigwait().

Next: IPC & Signals →

Processes Scheduling

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Threads & Concurrency

​1. The Anatomy of a Thread

​The Thread Control Block (TCB)

​Shared vs. Private State

​2. Threading Models: Mapping User to Kernel

​The 1:1 Model (Kernel-Level Threads)

​The N:1 Model (User-Level Threads / Green Threads)

​The M:N Model (Hybrid / Scheduler Activations)

​3. Context Switching: The Assembly Reality

​The Hardware Context

​The Return Address Trick

​4. Thread Local Storage (TLS) Internals

​How it works (x86-64)

​The TCB Pointer

​5. Linux Threads: NPTL and the clone() System Call

​The clone() Flags

​PID vs. TID vs. TGID

​6. Pthread Implementation Details

​Stack Management

​Thread Joining (Futexes)

​7. High-Performance Concurrency: Fibers & Coroutines

​Fibers (Cooperative Threads)

​Coroutines (Stackless vs. Stackful)

​8. Interview Deep Dive: Senior Level

​9. Advanced Practice: The Thread Challenge

​Key Takeaways for Senior Engineers

​OS Threads vs. Runtime Concurrency

​Comparison Matrix

​Why M:N and Async Matter

​Implications for System Design

​Common Threading Pitfalls

​Pitfall 1: Data Races

​Pitfall 2: Deadlocks

​Pitfall 3: False Sharing

​Pitfall 4: Forgetting to Join or Detach

​Pitfall 5: Signal Handling in Multi-threaded Programs

Threads & Concurrency

1. The Anatomy of a Thread

The Thread Control Block (TCB)

Shared vs. Private State

2. Threading Models: Mapping User to Kernel

The 1:1 Model (Kernel-Level Threads)

The N:1 Model (User-Level Threads / Green Threads)

The M:N Model (Hybrid / Scheduler Activations)

3. Context Switching: The Assembly Reality

The Hardware Context

The Return Address Trick

4. Thread Local Storage (TLS) Internals

How it works (x86-64)

The TCB Pointer

5. Linux Threads: NPTL and the `clone()` System Call

The `clone()` Flags

PID vs. TID vs. TGID

6. Pthread Implementation Details

Stack Management

Thread Joining (Futexes)

7. High-Performance Concurrency: Fibers & Coroutines

Fibers (Cooperative Threads)

Coroutines (Stackless vs. Stackful)

8. Interview Deep Dive: Senior Level

9. Advanced Practice: The Thread Challenge

Key Takeaways for Senior Engineers

OS Threads vs. Runtime Concurrency

Comparison Matrix

Why M:N and Async Matter

Implications for System Design

Common Threading Pitfalls

Pitfall 1: Data Races

Pitfall 2: Deadlocks

Pitfall 3: False Sharing

Pitfall 4: Forgetting to Join or Detach

Pitfall 5: Signal Handling in Multi-threaded Programs