Operating Systems Case Studies

Learn from real-world examples of OS concepts applied in production systems. These case studies demonstrate how theory meets practice.

Purpose: Connect theory to real systems
Target: Senior engineers preparing for system design
Approach: Analysis of actual production incidents and design decisions

Case Study 1: Chrome’s Multi-Process Architecture

Background

Chrome runs each tab in a separate process. Why?

Problem

Before (single-process browsers):

One tab crash = entire browser crash
Malicious site can access other tabs’ data
Memory leaks accumulate
No parallelism across cores

Solution

┌─────────────────────────────────────────────────────────────────┐
│                    CHROME ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │              Browser Process (privileged)               │   │
│   │  • UI, network, storage, disk access                    │   │
│   │  • Manages all other processes                          │   │
│   │  • Single instance                                       │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │ IPC (Mojo)                       │
│           ┌───────────────────┼───────────────────┐              │
│           │                   │                   │              │
│           ▼                   ▼                   ▼              │
│   ┌───────────────┐   ┌───────────────┐   ┌───────────────┐     │
│   │   Renderer    │   │   Renderer    │   │   Renderer    │     │
│   │   (Tab 1)     │   │   (Tab 2)     │   │   (Tab 3)     │     │
│   │               │   │               │   │               │     │
│   │ • Sandboxed   │   │ • Sandboxed   │   │ • Sandboxed   │     │
│   │ • No disk     │   │ • No disk     │   │ • No disk     │     │
│   │ • No network  │   │ • No network  │   │ • No network  │     │
│   │   (directly)  │   │   (directly)  │   │   (directly)  │     │
│   └───────────────┘   └───────────────┘   └───────────────┘     │
│                                                                  │
│   Additional Processes:                                          │
│   • GPU Process: Hardware acceleration                          │
│   • Plugin Processes: Flash, etc. (sandboxed)                   │
│   • Utility Processes: Audio, network service                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

OS Concepts Applied

Process Isolation: Each renderer is a separate process
- Own address space
- Crash doesn’t affect others
- Memory limits per tab
Sandboxing: Renderers have minimal privileges
- Seccomp filters: ~70 allowed syscalls (out of 300+)
- No file system access
- No network access (must ask browser process)
- Namespaces for isolation
IPC: Mojo framework
- Message passing between processes
- Shared memory for large data (bitmaps)
- File descriptor passing

Tradeoffs

Aspect	Multi-Process	Single-Process
Memory	Higher (duplicate libraries)	Lower
CPU	Context switch overhead	None
Security	Excellent	Poor
Stability	Tab crash isolated	Browser crash
Complexity	High	Low

Lesson

Security and stability often outweigh memory/CPU costs for user-facing applications.

Case Study 2: Mars Pathfinder Priority Inversion

Background

July 1997: Mars Pathfinder landed on Mars. Days later, it started randomly resetting.

Problem

┌─────────────────────────────────────────────────────────────────┐
│                    PATHFINDER TASKS                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   High Priority: bc_dist                                        │
│   - Bus distribution task                                        │
│   - Must run frequently                                          │
│   - Uses shared bus via mutex                                   │
│                                                                  │
│   Medium Priority: Various tasks                                │
│   - Image processing                                             │
│   - Data logging                                                 │
│                                                                  │
│   Low Priority: Meteorological data collection                  │
│   - Takes bus mutex for long time                               │
│   - Reads sensors                                                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Timeline of bug:
┌────────────────────────────────────────────────────────────────┐
│                                                                 │
│  Time   Action                                                  │
│  ─────  ───────────────────────────────────────────────────    │
│  T+0    Low priority (L) acquires bus mutex                    │
│  T+1    High priority (H) wakes up, needs mutex, BLOCKS        │
│  T+2    Medium priority (M) wakes up, preempts L               │
│  T+3    M runs... and runs... and runs...                      │
│  T+4    H is still waiting (for L, which can't run)            │
│  T+5    Watchdog timer fires → SYSTEM RESET!                   │
│                                                                 │
│  Problem: H is waiting for L, but M (lower than H) runs        │
│  This is PRIORITY INVERSION                                     │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Solution

Priority Inheritance Protocol:

When H blocks on mutex held by L
L temporarily inherits H’s priority
L runs (not preempted by M)
L releases mutex
H runs
L returns to original priority

Implementation

// VxWorks RTOS (used on Pathfinder)
// The fix was a configuration flag that was OFF by default!

// Enable priority inheritance on mutex
semMCreate(SEM_Q_PRIORITY | SEM_INVERSION_SAFE);
//                          ^^^^^^^^^^^^^^^^
//                          This was missing!

Remote Debug

The amazing part: NASA debugged this from 119 million miles away:

Analyzed telemetry showing reset patterns
Reproduced on ground hardware
Identified priority inversion via traces
Uploaded patch to enable priority inheritance
Problem solved!

Lesson

Test real-time constraints under load
Enable safety features even if they have overhead
Instrument everything for post-mortem analysis
Design for remote debugging

Case Study 3: Cloudflare Outage (Regex Backtracking)

Background

July 2, 2019: Cloudflare experienced a 27-minute global outage.

Problem

A regex in their Web Application Firewall (WAF) caused catastrophic backtracking:

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

When this regex encountered certain input:

CPU usage spiked to 100%
Worker processes became unresponsive
Edge servers stopped responding
Global outage

Why It Happened

┌─────────────────────────────────────────────────────────────────┐
│                    REGEX BACKTRACKING                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Regex: .*.*=.*                                                │
│   Input: "xxxxxxxxxxxxxxxxxxxxxxxxxx"                           │
│                                                                  │
│   First .* matches all of "xxx..."                              │
│   Second .* can't match anything, backtrack                     │
│   First .* matches one less, try second .* again                │
│   Keep backtracking... exponential combinations!                │
│                                                                  │
│   Complexity: O(2^n) for n characters                           │
│                                                                  │
│   n=10:  1,024 operations                                       │
│   n=20:  1,048,576 operations                                   │
│   n=30:  1,073,741,824 operations → CPU locked                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

OS/Systems Lessons

No timeout on regex execution
- Process ran indefinitely
- Should have CPU time limits
Insufficient isolation
- Bad regex affected all traffic
- Should have per-request resource limits
Cascading failures
- Retry storms made it worse
- Should have circuit breakers

Fixed By

Immediate: Reverted the WAF rule
Short-term: Added regex timeout (Lua)
Long-term:
- Moved to re2 (guaranteed linear time)
- Added automated regex complexity analysis
- Staged rollouts with monitoring

Implementation

-- Before: No protection
local match = ngx.re.match(input, pattern)

-- After: With timeout using pcre_extra limits
local match = ngx.re.match(input, pattern, "jo", nil, 1000)
--                                                    ^^^^
--                                          match_limit: max backtracking

Lesson

Always bound CPU time for untrusted input processing. Use:

cgroups for CPU limits
Timeouts on operations
Algorithms with guaranteed complexity

Case Study 4: Linux Kernel OOM Killer

Background

When Linux runs out of memory, the OOM (Out of Memory) Killer terminates processes to free memory.

Problem Scenario

┌─────────────────────────────────────────────────────────────────┐
│                    OOM SITUATION                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Memory Usage:                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │████████████████████████████████████████████████████████│   │
│   │            Used: 15.8 GB / 16 GB                       │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Swap:                                                          │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │████████████████████████████████████████████████████████│   │
│   │            Used: 4 GB / 4 GB (FULL!)                   │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   New allocation request comes in...                            │
│   No memory available!                                           │
│                                                                  │
│   Options:                                                       │
│   1. Fail the allocation → Process crashes anyway               │
│   2. Kill a process to free memory → OOM Killer!               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

OOM Killer Algorithm

// Simplified scoring algorithm
// Higher score = more likely to be killed

oom_score = memory_usage / total_memory * 1000;

// Adjustments:
// - Root processes: slightly lower score
// - Long-running processes: slightly higher score
// - User adjustment: oom_score_adj (-1000 to +1000)

// Score of 0 or oom_score_adj of -1000 = immune

Real Incident

Scenario: Production database server running out of memory.

# dmesg output:
[10854.231] Out of memory: Killed process 8234 (postgres) 
            total-vm:7234512kB, anon-rss:6891234kB, file-rss:1234kB

# What happened:
# 1. A runaway query consumed excessive memory
# 2. System couldn't allocate for other processes
# 3. OOM killer chose postgres (highest memory user)
# 4. Database terminated, service outage

Prevention Strategies

# 1. Make critical processes immune
echo -1000 > /proc/$(pgrep postgres)/oom_score_adj

# 2. Limit memory at cgroup level
echo 8G > /sys/fs/cgroup/memory/postgres/memory.max

# 3. Disable overcommit (strict mode)
echo 2 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio

# 4. Add swap (buys time)
fallocate -l 4G /swapfile
mkswap /swapfile
swapon /swapfile

# 5. Monitor and alert before OOM
# Set up alerts at 80% memory usage

Better Approach

┌─────────────────────────────────────────────────────────────────┐
│                    PROPER MEMORY MANAGEMENT                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. Application-level limits                                    │
│      • PostgreSQL: shared_buffers, work_mem limits              │
│      • JVM: -Xmx heap limit                                     │
│      • Go: GOMEMLIMIT                                           │
│                                                                  │
│   2. Container/cgroup limits                                     │
│      • Kubernetes: resources.limits.memory                      │
│      • Docker: --memory flag                                    │
│                                                                  │
│   3. Systemd service limits                                      │
│      • MemoryMax=8G in unit file                               │
│                                                                  │
│   4. Graceful degradation                                       │
│      • Reject new connections at 80%                            │
│      • Drop caches at 90%                                       │
│      • Circuit breaker at 95%                                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

### Lesson

**Don't rely on the OOM Killer** — it's a last resort. Instead:
- Set appropriate memory limits
- Monitor and alert
- Design for graceful degradation

---

## Case Study 5: End-to-End Web Request (Linux Server)

### Background

A typical web request on a Linux server exercises **almost every OS subsystem** covered in this course. Understanding the end-to-end path helps cement the relationships between chapters.

### Request Lifecycle

1. **Packet arrival**:
   - NIC receives an Ethernet frame with an IP/TCP packet.
   - DMA transfers the frame into RAM; NIC triggers an interrupt.
2. **Interrupt handling and networking stack**:
   - Interrupt handler schedules NAPI; packets are pulled from the device ring.
   - The kernel’s network stack parses Ethernet/IP/TCP headers, validates checksums.
   - Payload is placed into the appropriate socket’s receive queue.
3. **Scheduler and application wake-up**:
   - A worker thread blocked in `epoll_wait` / `io_uring_enter` / `read` is woken.
   - The **scheduler** chooses a CPU, considering runnable threads and affinities.
4. **System call and process context**:
   - The thread issues `read()` or similar; control transitions to the kernel.
   - Data is copied from kernel buffers into user-space memory.
5. **Application processing**:
   - User-space parses HTTP, runs business logic, maybe hits a database.
   - This triggers further syscalls: `connect`, `send`, `read`, file I/O, etc.
6. **Response send**:
   - Application writes the HTTP response (possibly via `sendfile` or `writev`).
   - Kernel queues data into the socket’s send buffer; TCP handles retransmissions and congestion control.
7. **Scheduling, I/O, and completion**:
   - The scheduler multiplexes the CPU among many connections.
   - The storage stack and file systems serve static assets from disk or page cache.

### OS Concepts Applied

- **Networking**: NIC, DMA rings, NAPI, TCP/IP stack, socket buffers.
- **Scheduling**: CFS/EEVDF deciding which request handler runs.
- **Virtual Memory**: Page cache for static assets; working set of application code and data.
- **File Systems & I/O**: Serving static content via page cache, `sendfile`, `io_uring`.
- **Synchronization**: Worker pools, connection queues, logging, shared caches.
- **Security**: Process isolation, capabilities, seccomp profiles for the web server.

### Lesson

Every “simple” web request is a tour through **CPU, memory, scheduler, I/O, networking, and security**. When debugging latency or throughput, trace the request along this path and map symptoms to the relevant OS chapter.

---

---

## Case Study 5: Docker Fork Bomb Prevention

### Problem

A container runs a fork bomb, potentially taking down the host:

```bash
# Classic fork bomb
:(){ :|:& };:

# This creates exponential processes
# 2^n processes very quickly
# Can exhaust PIDs, file descriptors, memory

Without Protection

┌─────────────────────────────────────────────────────────────────┐
│                    FORK BOMB IMPACT                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Second 0:   1 process                                         │
│   Second 1:   2 processes                                       │
│   Second 2:   4 processes                                       │
│   Second 3:   8 processes                                       │
│   Second 4:   16 processes                                      │
│   Second 5:   32 processes                                      │
│   ...                                                           │
│   Second 15:  32,768 processes → PID limit hit!                │
│                                                                  │
│   Effects:                                                       │
│   • Can't create new processes (even ssh!)                      │
│   • System becomes unresponsive                                 │
│   • Other containers affected                                    │
│   • May require hard reboot                                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Solution: PID Cgroups

# Limit PIDs per container
docker run --pids-limit 100 myimage

# Manually via cgroups:
echo 100 > /sys/fs/cgroup/pids/docker/<container_id>/pids.max

Complete Container Hardening

# docker-compose.yml
version: "3.9"
services:
  myapp:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4G
          pids: 100
    security_opt:
      - no-new-privileges:true
      - seccomp:custom-profile.json
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    read_only: true
    tmpfs:
      - /tmp:size=100M

Kubernetes Pod Security

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  containers:
  - name: app
    image: myapp
    resources:
      limits:
        memory: "4Gi"
        cpu: "2"
        # PID limits via LimitRange
    securityContext:
      runAsNonRoot: true
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL

Lesson

Defense in depth for containers:

PID limits (fork bombs)
Memory limits (memory bombs)
CPU limits (CPU bombs)
Seccomp (syscall filtering)
Capability dropping
Read-only filesystem
Non-root user

Summary: Key Lessons

Isolation is Worth It

Chrome proves process isolation’s value despite memory overhead.

Priority Inversion is Real

Mars Pathfinder shows subtle bugs can have catastrophic effects.

Bound All Operations

Cloudflare regex outage: always limit CPU time for untrusted input.

Don't Trust OOM Killer

Set proper limits; OOM Killer is a last resort, not a strategy.

Practice Exercise

Design a container runtime that:

Isolates processes (namespaces)
Limits resources (cgroups)
Filters syscalls (seccomp)
Survives fork bombs
Handles OOM gracefully

Consider:

What limits would you set by default?
How would you detect resource abuse?
How would you alert operators?
How would you handle cleanup?

← Back to Overview

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Operating Systems Case Studies

​Case Study 1: Chrome’s Multi-Process Architecture

​Background

​Problem

​Solution

​OS Concepts Applied

​Tradeoffs

​Lesson

​Case Study 2: Mars Pathfinder Priority Inversion

​Background

​Problem

​Solution

​Implementation

​Remote Debug

​Lesson

​Case Study 3: Cloudflare Outage (Regex Backtracking)

​Background

​Problem

​Why It Happened

​OS/Systems Lessons

​Fixed By

​Implementation

​Lesson

​Case Study 4: Linux Kernel OOM Killer

​Background

​Problem Scenario

​OOM Killer Algorithm

​Real Incident

​Prevention Strategies

​Better Approach

​Without Protection

​Solution: PID Cgroups

​Complete Container Hardening

​Kubernetes Pod Security

​Lesson

​Summary: Key Lessons

Isolation is Worth It

Priority Inversion is Real

Operating Systems Case Studies

Case Study 1: Chrome’s Multi-Process Architecture

Background

Problem

Solution

OS Concepts Applied

Tradeoffs

Lesson

Case Study 2: Mars Pathfinder Priority Inversion

Background

Problem

Solution

Implementation

Remote Debug

Lesson

Case Study 3: Cloudflare Outage (Regex Backtracking)

Background

Problem

Why It Happened

OS/Systems Lessons

Fixed By

Implementation

Lesson

Case Study 4: Linux Kernel OOM Killer

Background

Problem Scenario

OOM Killer Algorithm

Real Incident

Prevention Strategies

Better Approach

Without Protection

Solution: PID Cgroups

Complete Container Hardening

Kubernetes Pod Security

Lesson

Summary: Key Lessons

Practice Exercise