Skip to main content

Threads & Concurrency

A thread is the smallest unit of CPU execution — a lightweight process that shares memory with other threads in the same process. Understanding threading is critical for senior engineers building high-performance systems.
Interview Frequency: Very High
Key Topics: Threading models, pthreads, thread safety, thread pools
Time to Master: 10-12 hours

Thread vs Process

Process

  • Heavy — has own address space
  • Expensive to create (fork + COW)
  • Isolated — protected from other processes
  • Crash is contained
  • Communication needs IPC

Thread

  • Light — shares address space
  • Cheap to create
  • Shared memory — easy communication
  • One thread crash can kill all
  • Direct memory sharing (with care)

What Threads Share (and Don’t)

Thread vs Process Memory

Thread Models

User-Level Threads (N:1)

Threads managed entirely in user space by a library: User Threads Model
ProsCons
Very fast thread switch (no kernel)One block blocks all threads
No kernel modifications neededCan’t use multiple CPUs
Portable across OSesNot true parallelism
Examples: Early Java green threads, GNU Pth

Kernel-Level Threads (1:1)

Each user thread maps to one kernel thread: Kernel Threads Model
ProsCons
True parallelism on multiple CPUsSlower thread creation
One block doesn’t affect othersKernel resources per thread
Full kernel scheduling featuresMore context switch overhead
Examples: Linux (NPTL), Windows threads, Modern pthreads

Hybrid (M:N)

M user threads mapped to N kernel threads: Hybrid Threads Model
ProsCons
Best of both worldsComplex implementation
Efficient thread switchingScheduling challenges
True parallelismPriority inversion possible
Examples: Go goroutines, Erlang processes, Solaris (historical)

POSIX Threads (pthreads)

The standard threading API on Unix systems:

Creating and Joining Threads

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

void *worker(void *arg) {
    int id = *(int *)arg;
    printf("Thread %d: Starting work\n", id);
    
    // Simulate work
    for (int i = 0; i < 1000000; i++);
    
    printf("Thread %d: Finished\n", id);
    
    // Return a value (optional)
    int *result = malloc(sizeof(int));
    *result = id * 100;
    return result;
}

int main() {
    pthread_t threads[4];
    int ids[4];
    
    // Create threads
    for (int i = 0; i < 4; i++) {
        ids[i] = i;
        int err = pthread_create(&threads[i], NULL, worker, &ids[i]);
        if (err) {
            fprintf(stderr, "Error creating thread %d\n", i);
            exit(1);
        }
    }
    
    // Wait for all threads to complete
    for (int i = 0; i < 4; i++) {
        void *retval;
        pthread_join(threads[i], &retval);
        printf("Thread %d returned %d\n", i, *(int *)retval);
        free(retval);
    }
    
    return 0;
}
Compile with: gcc -pthread program.c -o program

Thread Attributes

pthread_attr_t attr;
pthread_attr_init(&attr);

// Set stack size (default is typically 2MB-8MB)
pthread_attr_setstacksize(&attr, 1024 * 1024);  // 1MB

// Set detach state
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

// Create thread with attributes
pthread_create(&thread, &attr, worker, arg);

pthread_attr_destroy(&attr);

Detached Threads

// Option 1: Create as detached
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

// Option 2: Detach after creation
pthread_t thread;
pthread_create(&thread, NULL, worker, NULL);
pthread_detach(thread);  // Can't join after this
Detached vs Joinable:
  • Joinable (default): Resources held until pthread_join() called
  • Detached: Resources freed immediately on thread exit, can’t retrieve return value

Thread-Local Storage (TLS)

Data that’s private to each thread:

Method 1: pthread_key (Traditional)

pthread_key_t key;

void destructor(void *value) {
    free(value);  // Cleanup when thread exits
}

void init() {
    pthread_key_create(&key, destructor);
}

void *worker(void *arg) {
    // Set thread-local value
    int *my_data = malloc(sizeof(int));
    *my_data = pthread_self();  // Each thread has unique value
    pthread_setspecific(key, my_data);
    
    // Later, retrieve it
    int *data = pthread_getspecific(key);
    printf("Thread-local value: %d\n", *data);
    
    return NULL;
}

Method 2: __thread keyword (GCC/Modern)

// Each thread gets its own copy
__thread int thread_id = 0;
__thread int call_count = 0;

void *worker(void *arg) {
    thread_id = *(int *)arg;
    
    for (int i = 0; i < 100; i++) {
        call_count++;
    }
    
    // call_count is 100 for each thread (not shared)
    printf("Thread %d: count = %d\n", thread_id, call_count);
    return NULL;
}

Thread Cancellation

void *worker(void *arg) {
    // Set cancellation type
    pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);  // At cancellation points
    // pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);  // Immediate
    
    while (1) {
        // pthread_testcancel();  // Explicit cancellation point
        
        // Most blocking calls are implicit cancellation points:
        // read(), write(), sleep(), pthread_cond_wait(), etc.
        
        do_work();
    }
    return NULL;
}

int main() {
    pthread_t thread;
    pthread_create(&thread, NULL, worker, NULL);
    
    sleep(5);
    pthread_cancel(thread);  // Request cancellation
    
    void *result;
    pthread_join(thread, &result);
    
    if (result == PTHREAD_CANCELED) {
        printf("Thread was canceled\n");
    }
    
    return 0;
}
Thread cancellation is dangerous! The thread may be holding locks, have allocated memory, or be in the middle of a transaction. Use with extreme caution.

Thread Pool Pattern

Creating threads is expensive. Thread pools reuse threads:
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>

#define POOL_SIZE 4
#define QUEUE_SIZE 100

typedef struct {
    void (*function)(void *);
    void *arg;
} Task;

typedef struct {
    pthread_t threads[POOL_SIZE];
    Task queue[QUEUE_SIZE];
    int queue_front, queue_rear, queue_count;
    pthread_mutex_t mutex;
    pthread_cond_t not_empty;
    pthread_cond_t not_full;
    bool shutdown;
} ThreadPool;

void *worker_thread(void *arg) {
    ThreadPool *pool = (ThreadPool *)arg;
    
    while (true) {
        pthread_mutex_lock(&pool->mutex);
        
        // Wait for task or shutdown
        while (pool->queue_count == 0 && !pool->shutdown) {
            pthread_cond_wait(&pool->not_empty, &pool->mutex);
        }
        
        if (pool->shutdown && pool->queue_count == 0) {
            pthread_mutex_unlock(&pool->mutex);
            break;
        }
        
        // Dequeue task
        Task task = pool->queue[pool->queue_front];
        pool->queue_front = (pool->queue_front + 1) % QUEUE_SIZE;
        pool->queue_count--;
        
        pthread_cond_signal(&pool->not_full);
        pthread_mutex_unlock(&pool->mutex);
        
        // Execute task
        task.function(task.arg);
    }
    
    return NULL;
}

ThreadPool *pool_create() {
    ThreadPool *pool = malloc(sizeof(ThreadPool));
    pool->queue_front = pool->queue_rear = pool->queue_count = 0;
    pool->shutdown = false;
    
    pthread_mutex_init(&pool->mutex, NULL);
    pthread_cond_init(&pool->not_empty, NULL);
    pthread_cond_init(&pool->not_full, NULL);
    
    for (int i = 0; i < POOL_SIZE; i++) {
        pthread_create(&pool->threads[i], NULL, worker_thread, pool);
    }
    
    return pool;
}

void pool_submit(ThreadPool *pool, void (*fn)(void *), void *arg) {
    pthread_mutex_lock(&pool->mutex);
    
    while (pool->queue_count == QUEUE_SIZE) {
        pthread_cond_wait(&pool->not_full, &pool->mutex);
    }
    
    pool->queue[pool->queue_rear].function = fn;
    pool->queue[pool->queue_rear].arg = arg;
    pool->queue_rear = (pool->queue_rear + 1) % QUEUE_SIZE;
    pool->queue_count++;
    
    pthread_cond_signal(&pool->not_empty);
    pthread_mutex_unlock(&pool->mutex);
}

void pool_shutdown(ThreadPool *pool) {
    pthread_mutex_lock(&pool->mutex);
    pool->shutdown = true;
    pthread_cond_broadcast(&pool->not_empty);
    pthread_mutex_unlock(&pool->mutex);
    
    for (int i = 0; i < POOL_SIZE; i++) {
        pthread_join(pool->threads[i], NULL);
    }
    
    pthread_mutex_destroy(&pool->mutex);
    pthread_cond_destroy(&pool->not_empty);
    pthread_cond_destroy(&pool->not_full);
    free(pool);
}

Thread Pool Sizing

Optimal Thread Pool Size:
  • CPU-bound tasks: N_threads = N_CPUs (or N_CPUs + 1)
  • I/O-bound tasks: N_threads = N_CPUs * (1 + W/S) where W = wait time, S = service time
  • Mixed workloads: Separate pools for CPU and I/O tasks

Modern Concurrency: Coroutines & Green Threads

Go Goroutines (M:N)

package main

import (
    "fmt"
    "runtime"
    "time"
)

func worker(id int, ch chan int) {
    fmt.Printf("Worker %d starting\n", id)
    time.Sleep(time.Second)
    ch <- id * 100
}

func main() {
    // Use all available CPUs
    runtime.GOMAXPROCS(runtime.NumCPU())
    
    ch := make(chan int, 10)
    
    // Create 1000 goroutines (very lightweight)
    for i := 0; i < 1000; i++ {
        go worker(i, ch)  // Only ~2KB stack initially
    }
    
    // Collect results
    for i := 0; i < 1000; i++ {
        fmt.Println(<-ch)
    }
}
Go’s Runtime:
  • Goroutines are multiplexed onto OS threads (M:N model)
  • Initial stack is ~2KB (vs 2MB for pthreads)
  • Stack grows dynamically as needed
  • Runtime handles scheduling, not kernel

Python asyncio (Cooperative)

import asyncio

async def fetch_data(url):
    print(f"Fetching {url}")
    await asyncio.sleep(1)  # Simulate I/O
    return f"Data from {url}"

async def main():
    # Concurrent execution
    tasks = [
        fetch_data("url1"),
        fetch_data("url2"),
        fetch_data("url3"),
    ]
    results = await asyncio.gather(*tasks)
    print(results)

asyncio.run(main())
Python’s GIL: The Global Interpreter Lock means only one thread executes Python bytecode at a time. Use multiprocessing for CPU parallelism, asyncio for I/O concurrency.

Thread Safety

Identifying Thread-Unsafe Code

// UNSAFE: Shared static variable
int count = 0;  // Global

void increment() {
    count++;  // Race condition!
}

// UNSAFE: Static local variable
char *dangerous_function() {
    static char buffer[100];  // Shared between calls
    // ...
    return buffer;  // Caller gets pointer to shared memory
}

// Thread-safe alternative
int safe_increment(int *count, pthread_mutex_t *lock) {
    pthread_mutex_lock(lock);
    int result = ++(*count);
    pthread_mutex_unlock(lock);
    return result;
}

Reentrancy vs Thread Safety

PropertyReentrantThread-Safe
Multiple threadsSafeSafe
Interrupted and re-enteredSafeNot necessarily
Uses global/static stateNoMaybe (with locks)
Calls non-reentrant functionsNoMaybe
Uses locksUsually noOften yes
Reentrant functions: strtok_r, localtime_r, rand_r
Non-reentrant: strtok, localtime, rand

Interview Deep Dive Questions

Answer:
  1. Memory: Threads share address space (code, data, heap), only need private stack and registers
  2. Creation: No page table copy, no memory mapping setup
  3. Context switch: No TLB flush (same address space), cache stays warm for shared data
  4. Communication: Direct memory access vs IPC mechanisms
Quantification:
  • Process creation: ~10,000 CPU cycles
  • Thread creation: ~1,000 CPU cycles
  • Thread stack: 2-8 MB default
  • Process overhead: PCB + page tables + mappings
Answer:Go uses GMP model:
  • G (Goroutine): Lightweight user-space thread
  • M (Machine): OS thread
  • P (Processor): Logical processor context (GOMAXPROCS)
Scheduling:
  1. P holds a run queue of Gs
  2. M takes P to execute Gs
  3. When G blocks on syscall, M releases P for another M
  4. Work stealing: idle P steals Gs from busy P’s queue
Benefits:
  • Thousands of goroutines on few OS threads
  • Low creation cost (~2KB initial stack)
  • Preemptive (since Go 1.14) at function calls
  • Efficient I/O multiplexing with netpoller
Answer:Analysis:
  1. Profile typical request: CPU time (S) and I/O wait time (W)
  2. Determine request characteristics
Formula: N = N_CPU * (1 + W/S)Example:
  • 8 CPUs
  • Average request: 10ms CPU, 90ms I/O (database)
  • N = 8 * (1 + 90/10) = 8 * 10 = 80 threads
Practical considerations:
  • Monitor queue length and response times
  • Use separate pools for CPU-bound vs I/O-bound
  • Consider connection limits (database connections)
  • Add bounded queue to handle bursts
Production pattern: Start with 2-4x CPU count, tune based on metrics
Answer:Problems:
  1. Resource leaks: Thread holding malloc’d memory
  2. Lock abandonment: Thread holding mutex when canceled
  3. Inconsistent state: Canceled mid-transaction
  4. Deadlock: Other threads waiting on canceled thread’s lock
Solutions:
  1. Cleanup handlers:
pthread_cleanup_push(cleanup_fn, resource);
// Critical section
pthread_cleanup_pop(1);  // Execute cleanup
  1. Cancellation points: Define where cancellation can happen
  2. Defer cancellation:
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
// Critical section
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
  1. Better design: Use cooperative shutdown with flag
volatile bool shutdown = false;
while (!shutdown) { do_work(); }
Answer:Detection tools:
  1. ThreadSanitizer (TSan): gcc -fsanitize=thread
  2. Helgrind: Part of Valgrind suite
  3. Intel Inspector: Commercial tool
Manual debugging:
  1. Stress testing: Run with many threads, iterations
  2. Add sleeps: Increase timing windows
  3. Logging: Track thread IDs and operations
  4. Core dumps: Analyze thread states
Prevention:
  1. Minimize shared state
  2. Use immutable data
  3. Lock ordering: Prevent deadlocks
  4. Code review: Focus on shared access
Example TSan output:
WARNING: ThreadSanitizer: data race
  Write of size 4 at 0x7ffc...
    #0 increment() in /code/race.c:10
  Previous read of size 4 at 0x7ffc...
    #0 increment() in /code/race.c:10

Practice Exercises

1

Producer-Consumer

Implement a thread-safe bounded buffer with multiple producers and consumers using pthreads.
2

Thread Pool

Build a thread pool with dynamic sizing based on queue length.
3

Parallel Matrix Multiply

Parallelize matrix multiplication with proper thread count selection.
4

Read-Write Lock

Implement a readers-writer lock from mutexes and condition variables.

Key Takeaways

1:1 is Standard

Modern Linux/Windows use 1:1 model. Goroutines/Erlang use M:N for scale.

Thread Pool Always

Create pools, not individual threads. Size based on workload profile.

TLS for Thread State

Use thread-local storage for per-thread data, not globals.

Cancellation is Dangerous

Prefer cooperative shutdown with flags over pthread_cancel.

Next: CPU Scheduling