Skip to main content

CPU Scheduling

The CPU scheduler decides which process/thread runs when, for how long, and on which CPU. Understanding scheduling is essential for performance optimization and is a frequent interview topic for systems roles.
Interview Frequency: High
Key Topics: Scheduling algorithms, CFS, real-time, multi-core
Time to Master: 8-10 hours

Scheduling Goals

Different workloads have different priorities:
MetricDefinitionImportant For
CPU Utilization% time CPU is busyMaximizing throughput
ThroughputJobs completed per timeBatch systems
Turnaround TimeSubmit → CompleteBatch jobs
Response TimeSubmit → First responseInteractive systems
Waiting TimeTime spent in ready queueFairness
FairnessEqual CPU distributionMulti-tenant
Trade-offs are inevitable! Optimizing for throughput hurts response time. Optimizing for fairness may hurt throughput.

When Does Scheduling Happen?

Scheduling decisions occur at these points: Scheduling Triggers

Non-Preemptive vs Preemptive

TypeWhen Context Switch OccursExamples
Non-preemptiveOnly on block/terminateOld batch systems, Windows 3.1
PreemptiveAlso on timer, higher priority arrivalAll modern OS

Classic Scheduling Algorithms

First-Come, First-Served (FCFS)

The simplest algorithm: run processes in arrival order. FCFS Timeline
Convoy Effect: Short processes stuck behind long ones. If P2, P3 ran first: average wait = (0 + 3 + 6) / 3 = 3

Shortest Job First (SJF)

Run the process with shortest burst time: SJF Timeline Problem: How do you know burst time in advance? Solution: Predict based on history (exponential averaging)
τₙ₊₁ = α × tₙ + (1 - α) × τₙ

where:
  tₙ = actual time of last burst
  τₙ = predicted time of last burst
  α  = weight (typically 0.5)

Priority Scheduling

Each process has a priority; highest priority runs first:
// Lower number = higher priority (typical convention)
struct process {
    int pid;
    int priority;
    int burst_time;
};

// Priority assignment factors:
// - Static: user-defined, process type
// - Dynamic: aging, I/O behavior, CPU usage
Starvation Problem: Low-priority processes may never run. Solution - Aging: Increase priority of waiting processes over time.
void age_priorities(Process *ready_queue, int count) {
    for (int i = 0; i < count; i++) {
        ready_queue[i].priority += ready_queue[i].wait_time / AGING_FACTOR;
    }
}

Round Robin (RR)

Each process gets a fixed time quantum, then goes to back of queue: Round Robin Timeline Time Quantum Selection:
Quantum SizeEffect
Too smallToo much context switch overhead
Too largeApproaches FCFS behavior
Rule of thumb80% of bursts should complete in one quantum

Multi-Level Feedback Queue (MLFQ)

The algorithm that combines multiple approaches: Multi-Level Feedback Queue MLFQ Advantages:
  • Interactive processes stay responsive
  • CPU-bound processes get bulk CPU time
  • Adapts to process behavior automatically
  • No need to know burst time in advance

Linux Completely Fair Scheduler (CFS)

The default Linux scheduler since 2.6.23:

Core Concept: Virtual Runtime

// Each task has a virtual runtime (vruntime)
// CFS always runs the task with lowest vruntime

struct sched_entity {
    struct rb_node run_node;  // Position in red-black tree
    u64 vruntime;             // Virtual runtime
    u64 sum_exec_runtime;     // Actual runtime
    // ...
};

// vruntime increases as task runs:
// vruntime += delta_exec * (NICE_0_WEIGHT / task_weight)

// Higher priority (lower nice) → lower weight → vruntime increases slower
// → task gets more CPU time

Red-Black Tree for Task Selection

CFS Red-Black Tree

CFS Parameters

ParameterDefaultEffect
sched_latency_ns6msTarget scheduling period
sched_min_granularity_ns0.75msMinimum time slice
sched_wakeup_granularity_ns1msMinimum vruntime advantage to preempt

Nice Values and Weights

Nice value:  -20 ─────────── 0 ─────────── +19
Priority:    HIGH            NORMAL        LOW
Weight:      88761          1024           15

// A nice -20 task gets ~88761/1024 ≈ 87x more CPU than nice 0
// A nice +19 task gets ~15/1024 ≈ 1/68x as much CPU as nice 0

Real-Time Scheduling

For tasks with hard deadlines:

SCHED_FIFO and SCHED_RR

#include <sched.h>

struct sched_param param;
param.sched_priority = 99;  // 1-99, higher = more urgent

// FIFO: runs until it voluntarily yields or blocks
sched_setscheduler(pid, SCHED_FIFO, &param);

// Round-robin: like FIFO but with time slicing among same priority
sched_setscheduler(pid, SCHED_RR, &param);

Real-Time Scheduling Algorithms

Static priority based on period: shorter period = higher priority
Task | Period | Execution Time | Priority
-----|--------|----------------|----------
T1   |  20ms  |      5ms       |    3 (highest)
T2   |  50ms  |     15ms       |    2
T3   | 100ms  |     25ms       |    1 (lowest)
Schedulability test: U=i=1nCiTin(21/n1)U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq n(2^{1/n} - 1)For 3 tasks: U ≤ 0.78 (more tasks → approaches ln(2) ≈ 0.69)

Multi-Core Scheduling

Load Balancing

Multi-Core Scheduling

CPU Affinity

#define _GNU_SOURCE
#include <sched.h>

// Set CPU affinity
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);  // Allow only CPU 0
CPU_SET(2, &cpuset);  // Also allow CPU 2

sched_setaffinity(pid, sizeof(cpuset), &cpuset);

// Get current affinity
sched_getaffinity(pid, sizeof(cpuset), &cpuset);
for (int i = 0; i < CPU_SETSIZE; i++) {
    if (CPU_ISSET(i, &cpuset)) {
        printf("Process can run on CPU %d\n", i);
    }
}

NUMA Considerations

NUMA Architecture
#include <numaif.h>

// Allocate memory on specific node
void *ptr = numa_alloc_onnode(size, node);

// Set memory policy
unsigned long nodemask = 1 << 0;  // Node 0
set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask) * 8);

// Move pages to specific node
move_pages(pid, count, pages, nodes, status, MPOL_MF_MOVE);

Scheduling Classes in Linux

Linux Scheduling Classes

Interview Deep Dive Questions

Answer:Problems with static priority:
  1. Starvation: Low-priority processes may never run
  2. Priority Inversion: Low-priority process holds lock needed by high-priority
  3. No adaptation: Doesn’t respond to changing process behavior
CFS solves these:
  1. Virtual runtime ensures all processes eventually run
  2. Proportional sharing based on nice values, not absolute priority
  3. Self-adjusting: Interactive processes naturally maintain low vruntime
Still uses priorities for:
  • Real-time scheduling (SCHED_FIFO, SCHED_RR)
  • Nice values influence CFS weight, not absolute priority
Answer:
Time:    0────────5────────10────────15────────20

         Submit                 First          Complete
           ↓                    Output           ↓
           │←── Response Time ──→│               │
           │                                     │
           │←────────── Turnaround Time ─────────│
  • Response Time: Time from submission to first output/response
  • Turnaround Time: Time from submission to completion
Optimization conflicts:
  • Interactive: Minimize response time (run briefly, often)
  • Batch: Minimize turnaround (run to completion, minimize overhead)
Example:
Job: 100ms total work
Round-robin (10ms quantum):
  - Response time: 0-10ms (first quantum starts immediately if no queue)
  - Turnaround: varies based on competition

FCFS:
  - Response time = waiting time in queue
  - Turnaround = waiting + 100ms
Answer:Scenario: Interactive process (e.g., text editor) sleeps waiting for inputCFS behavior:
  1. While sleeping, vruntime doesn’t increase
  2. Other processes’ vruntimes increase
  3. When process wakes, it has relatively low vruntime
  4. Low vruntime means it’s near left of RB-tree
  5. Gets scheduled quickly → good response time
Sleeper fairness adjustment:
// On wakeup, cap vruntime to not be too far behind
vruntime = max(vruntime, min_vruntime - sched_latency)
This prevents:
  • Process sleeping for 1 hour, waking with vruntime = 0
  • Monopolizing CPU to “catch up”
Result: Interactive processes are naturally favored without explicit priority boost
Answer:Requirements analysis:
  • Mix of short queries and long batch jobs
  • Predictable latency for user-facing queries
  • Efficient use of CPU for analytics
Design:
  1. Multi-queue with priorities:
    • Queue 1: Interactive queries (< 10ms expected)
    • Queue 2: Short transactions (< 100ms)
    • Queue 3: Long-running analytics
  2. Scheduling policy:
    • Preemptive within queues
    • Higher queues preempt lower
    • Time-based demotion (like MLFQ)
  3. Adaptive classification:
    if query.estimated_cost < threshold_1:
        assign_to_queue_1()
    elif query.actual_runtime > threshold_2:
        demote_to_queue_3()
    
  4. Resource limits:
    • Queue 3 gets max 40% CPU (prevents starvation)
    • Aging promotes long-waiting queries
Real examples: PostgreSQL uses similar concepts with parallel query limits and statement timeouts
Answer:Scenario:
Low priority (L): Holds lock X
Medium priority (M): CPU-bound, no locks
High priority (H): Waiting for lock X

Timeline:
1. L runs, acquires lock X
2. H arrives, needs lock X, blocks
3. M arrives, preempts L (M > L priority)
4. M runs indefinitely
5. H starves — blocked by M indirectly!
Solutions:
  1. Priority Inheritance:
    • L inherits H’s priority while holding X
    • L can preempt M, finish quickly, release X
    • H then runs
  2. Priority Ceiling:
    • Each lock has a ceiling (highest priority that may acquire it)
    • Process acquiring lock gets ceiling priority
    • Prevents blocking entirely
  3. Random boosting (Windows):
    • Randomly boost low-priority threads
    • Probabilistically solves inversion
Famous case: Mars Pathfinder (1997) — priority inversion caused system resets until patched with priority inheritance

Practice Exercises

1

Simulate MLFQ

Implement a multi-level feedback queue scheduler simulation. Track queue demotions and response times.
2

Measure Scheduling Latency

Write a program that measures scheduling latency by yielding and timing wake-up.
3

CPU Pinning Experiment

Compare performance of cache-sensitive workload with and without CPU affinity.
4

Priority Inversion Demo

Create a program that demonstrates priority inversion with pthreads and measure the effect.

Key Takeaways

CFS is Default

Linux uses virtual runtime for fair CPU sharing. No explicit priority queue.

MLFQ for Balance

Combines priority, round-robin, and adaptive demotion. Used by many OS.

Real-Time is Different

FIFO/RR with fixed priorities for hard deadlines. Rate Monotonic or EDF.

Multi-Core Adds Complexity

Load balancing, cache affinity, NUMA awareness all matter for performance.

Next: Virtual Memory