Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
CPU Scheduling: The Mathematics of Execution
The CPU Scheduler is the heart of the operating system’s process management. It is the arbiter of time, deciding which process/thread gains control of the CPU, for how long, and on which core. In modern high-concurrency environments, a poorly designed scheduler can lead to system-wide jitter, priority inversion, and throughput collapse. This module provides an exhaustive, senior-level deep dive into scheduling theory, the evolution from MLFQ to CFS, and the emerging EEVDF paradigm.1. The Core Problem: Resource Contention
Scheduling is fundamentally an optimization problem under constraints. We have processes and CPUs (where ). The goal is to maximize “useful” work while satisfying various quality-of-service (QoS) requirements.1.1. The Scheduling Framework
Every scheduler operates within a specific lifecycle:- Admission: Which processes are allowed into the “Ready” state?
- Dispatch: Which process from the “Ready” queue gets the CPU?
- Preemption: When do we forcibly take the CPU away from a running process?
- Termination/Yield: How do we clean up or move a process back to waiting?
1.2. Little’s Law and Queuing Theory
To understand scheduling performance, we must first look at the mathematical foundation: Little’s Law. Where:- : Average number of processes in the system (load).
- : Average arrival rate of processes.
- : Average time a process spends in the system (response time).
2. Scheduling Metrics and Trade-offs
There is no “perfect” scheduler. Every algorithm optimizes for one metric at the expense of others.| Metric | Focus | Mathematical Target | User Experience |
|---|---|---|---|
| Throughput | Productivity | Maximize | System finishes work faster. |
| Response Time | Latency | Minimize | UI feels “snappy” and fluid. |
| Turnaround Time | Batch Speed | Minimize | Large tasks finish sooner. |
| Wait Time | Fairness | Minimize | No process feels “ignored.” |
| CPU Utilization | Efficiency | Maximize | Getting your money’s worth of hardware. |
The Paradox of Choice
- Interactive Systems: Optimize for Response Time.
- Batch Systems (HPC): Optimize for Throughput.
- Real-Time Systems: Optimize for Predictability (Jitter).
3. Classic Scheduling Algorithms: From Simple to Sophisticated
3.1. First-Come, First-Served (FCFS)
The simplest non-preemptive algorithm. Uses a FIFO queue. The Convoy Effect: If a CPU-bound process () arrives first, followed by ten I/O-bound processes (), the I/O-bound processes will wait for the entire duration of ‘s burst. This leads to low device utilization because the I/O devices sit idle while processes wait for the CPU.3.2. Shortest Job First (SJF) & SRTF
SJF (Non-preemptive) and Shortest Remaining Time First (SRTF - Preemptive) are provably optimal for minimizing average waiting time. The Prediction Problem: How do we know the length of the next CPU burst? We can’t see the future, so we use Exponential Averaging:- : Actual length of the burst.
- : Predicted value for the burst.
- : Weight (usually ).
3.3. Round Robin (RR)
The foundation of time-sharing. Each process gets a Time Quantum ().- Small : Great response time, but high Context Switch Overhead.
- Large : Low overhead, but approaches FCFS behavior (poor response time).
- Rule of Thumb: should be large relative to context switch time (e.g., vs ), and of CPU bursts should be shorter than .
4. Multi-Level Feedback Queue (MLFQ): The Intelligent Arbiter
MLFQ is the “Gold Standard” for general-purpose OS (macOS, Windows, older Linux). It aims to minimize response time for interactive jobs while also being fair to long-running batch jobs, without knowing burst times in advance.4.1. The Five Rules of MLFQ
- Rule 1: If Priority() > Priority(), runs ( doesn’t).
- Rule 2: If Priority() = Priority(), & run in RR.
- Rule 3: When a job enters the system, it is placed at the highest priority.
- Rule 4a: If a job uses up an entire time slice while running, its priority is reduced (it moves down one queue).
- Rule 4b: If a job gives up the CPU before the time slice is up, it stays at the same priority level.
4.2. Solving the “Gaming” and “Starvation” Problems
The Gaming Problem: A process could run for of a time slice, then issue an I/O to stay at high priority, effectively monopolizing the CPU. Solution (Rule 4 Revision - Accounting): Once a job uses up its time allotment at a given level (regardless of how many times it gave up the CPU), its priority is reduced. The Starvation Problem: Too many interactive jobs could prevent long-running jobs from ever getting CPU time. Solution (Priority Boost): After some period , move all jobs in the system to the topmost queue.5. Proportional Share Scheduling: Lottery and Stride
Instead of optimizing for response time or turnaround, these schedulers ensure each job gets a certain percentage of the CPU.5.1. Lottery Scheduling
Each process holds a number of tickets. The scheduler “picks a random ticket,” and the winner runs.- Probabilistic Fairness: Over time, if A has 75 tickets and B has 25, A will get of the CPU.
- Advantages: Extremely simple to implement; handles new processes easily.
- Disadvantages: Inaccurate over short time scales due to randomness.
5.2. Stride Scheduling
The deterministic version of Lottery.- Each process has a stride ().
- Each process has a pass value (starts at 0).
- Scheduler runs the process with the lowest pass.
- After running, .
6. Linux Internals: The Completely Fair Scheduler (CFS)
Since Linux 2.6.23, CFS has been the default. It abandons traditional priority queues for a Red-Black Tree.6.1. Virtual Runtime (vruntime)
The heart of CFS. Every process tracks itsvruntime, which is the amount of time it has spent on the CPU, normalized by its priority.
- High Priority (Nice -20):
vruntimeincreases very slowly. - Low Priority (Nice +19):
vruntimeincreases very quickly. - Target: CFS always picks the process with the minimum
vruntimeto run next.
6.2. The Red-Black Tree Structure
CFS stores all runnable processes in a Red-Black Tree, keyed byvruntime.
- Selection: (the leftmost node is cached as
min_vruntime). - Insertion: .
- Self-Balancing: Ensures the tree doesn’t become a linked list.
6.3. Latency Targeting
Instead of a fixed quantum, CFS uses Target Latency.- If target latency is and there are 4 equal-weight processes, each gets .
- If there are 100 processes, , which might be below the context switch cost.
- Solution:
min_granularity(e.g., ). If a process is scheduled, it is guaranteed at least this much time.
7. Multiprocessor Scheduling: Scaling to Cores
Moving from one CPU to many introduces two massive headaches: Cache Affinity and Synchronization.7.1. Single Queue Multiprocessor Scheduling (SQMS)
One global queue. All CPUs pull from it.- Pros: Perfectly balanced load.
- Cons: High lock contention on the global queue; Zero cache affinity (Process A might run on CPU 0, then CPU 5, losing all L1/L2 cache data).
7.2. Multi-Queue Multiprocessor Scheduling (MQMS)
Each CPU has its own runqueue.- Pros: No lock contention between cores; Excellent cache affinity.
- Cons: Load Imbalance. CPU 0 might have 100 tasks while CPU 1 is idle.
7.3. Load Balancing: Push vs. Pull
Modern kernels (Linux) use MQMS with balancing:- Push Migration: A kernel task periodically checks loads and “pushes” tasks from busy to idle cores.
- Pull Migration: An idle core looks at other cores’ queues and “pulls” (steals) a task.
8. Real-Time Scheduling: When “Late” equals “Failure”
Real-time OS (RTOS) must guarantee that tasks meet their deadlines.8.1. Rate Monotonic (RM)
Static Priority: Shorter period = Higher priority.- Schedulability Bound: . For large , this is . If CPU utilization is below this, all deadlines are guaranteed.
8.2. Earliest Deadline First (EDF)
Dynamic Priority: The task with the closest deadline runs first.- Schedulability: . Optimal for uniprocessors.
8. Real-World Workload Stories
Theory is useful, but scheduling only really “sticks” when you can picture concrete systems.8.1 Latency-Sensitive Web Service
- Scenario: A multi-threaded HTTP API serving p95 under 100 ms.
- Workload: Short CPU bursts, heavy I/O (database, caches), thousands of concurrent connections.
- Scheduling Implications:
- Favor low latency over maximum throughput.
- Too many worker threads lead to context switch storms and cache thrash.
- Use CFS with reasonable
ulimit -u/ thread counts and tunenet.core.somaxconnand backlog handling.
- Design pattern:
- Use a fixed-size worker pool (bounded threads) + non-blocking I/O.
- Keep CPU utilization in the 60–80% range under peak to absorb bursts.
8.2 Batch Analytics Cluster
- Scenario: Spark/Hadoop jobs running for minutes to hours.
- Workload: CPU-bound phases, long-running tasks, predictable throughput goals.
- Scheduling Implications:
- Optimize for throughput and fairness across tenants, not per-request latency.
- Use cgroups and nice values to prevent one user from monopolizing CPUs.
- Longer time slices (higher
sched_latency_ns) can reduce context switch overhead.
8.3 Low-Latency Trading / Audio Processing
- Scenario: Missed deadlines translate directly to financial loss or audible glitches.
- Workload: Small, periodic tasks with very tight deadlines.
- Scheduling Implications:
- Use real-time scheduling classes (
SCHED_FIFO,SCHED_RR) for critical threads. - Pin threads to specific cores and isolate those cores from general-purpose workloads (
isolcpus,nohz_full). - Minimize kernel preemption and disable frequency scaling jitter where necessary.
- Use real-time scheduling classes (
9. The Future: EEVDF (Earliest Eligible Virtual Deadline First)
As of Linux 6.6, CFS is being replaced by EEVDF.- The Flaw in CFS: CFS is “fair” over long periods but can have high “lag” (short-term unfairness).
- EEVDF Logic: Processes have a “virtual deadline.” The scheduler picks the one with the earliest deadline that is also “eligible” (hasn’t overstayed its welcome). This provides much tighter latency bounds for audio and high-frequency trading applications.
10. Advanced Interview Questions
Q1: Why does CFS use a Red-Black Tree instead of a Heap?
Q1: Why does CFS use a Red-Black Tree instead of a Heap?
find-min, but for delete-min and for insert. However, a Heap is not efficient for updating a value or deleting a specific node (which happens when a process blocks).A Red-Black Tree allows us to find the min_vruntime in (by caching), but also provides for searching/deleting arbitrary nodes (e.g., when a process is killed or changes its nice value).Q2: Explain Priority Inversion and how a scheduler helps.
Q2: Explain Priority Inversion and how a scheduler helps.
Q3: What is the 'Convoy Effect' in FCFS?
Q3: What is the 'Convoy Effect' in FCFS?
Practice Lab: Building a Mini-Scheduler
Objective: Implement a Stride Scheduler in C.Interview Deep-Dive
You are designing a scheduler for a system that must handle both interactive web requests (p99 latency under 50ms) and batch ML training jobs (maximize GPU/CPU throughput). How would you approach this, and what existing Linux mechanisms would you use?
You are designing a scheduler for a system that must handle both interactive web requests (p99 latency under 50ms) and batch ML training jobs (maximize GPU/CPU throughput). How would you approach this, and what existing Linux mechanisms would you use?
- CPU isolation: Use
isolcpusor cpusets to dedicate specific cores to each workload class. Give the web service 8 cores and the ML training 24 cores on a 32-core machine. The web service cores run with theschedutilgovernor for responsive frequency scaling; the ML cores run with theperformancegovernor for maximum throughput. - Cgroups v2 CPU controller: Even within the shared cores, set CPU bandwidth limits. The web service cgroup gets
cpu.weight = 500(high priority) and the ML cgroup getscpu.weight = 100. Under contention, the web service gets 5x the CPU share. Additionally, setcpu.maxon the ML cgroup to ensure it cannot consume more than its allocated cores’ worth of CPU time. - Scheduling class separation: For the most latency-sensitive web request handler threads, consider using
SCHED_FIFOwith a moderate priority (e.g., 50). This gives them preemptive priority over allSCHED_NORMALtasks, ensuring they run immediately when work arrives. The ML training runs under defaultSCHED_NORMAL(CFS/EEVDF). The risk with SCHED_FIFO is that a bug in the web service could starve everything else, so always setrlimit_rttimeas a safety valve. - NUMA awareness: If the server is multi-socket, bind the web service to one NUMA node and the ML training to the other. This prevents cross-node memory access latency from affecting web request processing.
Walk through how CFS determines the time slice for a process. What happens when you have 1000 runnable processes versus 4?
Walk through how CFS determines the time slice for a process. What happens when you have 1000 runnable processes versus 4?
sched_latency_ns (default 6ms for up to 8 tasks) and sched_min_granularity_ns (default 0.75ms).The algorithm:- Few tasks (N is at most sched_nr_latency, default 8): Each task gets
sched_latency_ns / Nweighted by its nice value. With 4 equal-weight tasks: each gets 6ms / 4 = 1.5ms per scheduling round. This means every task runs at least once every 6ms, giving good interactive responsiveness. - Many tasks (N > 8): The time slice would become
6ms / N. For 1000 tasks, that is 6 microseconds — less than the context switch cost. So CFS clamps atsched_min_granularity_ns(0.75ms). With 1000 tasks, the effective scheduling period becomes 1000 * 0.75ms = 750ms. This means a task might wait 750ms between runs. Interactive responsiveness degrades significantly.
- CFS always picks the task with the lowest
vruntimefrom the red-black tree. When a task runs fordeltaactual time, its vruntime increases bydelta * (NICE_0_LOAD / weight). A high-priority task (nice -20, weight ~88761) accumulates vruntime slowly, so it stays at the front of the tree longer. A low-priority task (nice +19, weight ~15) accumulates vruntime quickly and gets displaced.
min_vruntime (leftmost node) is cached, so selection is O(1). The scheduler is still efficient — the problem is not algorithmic overhead but the sheer number of tasks competing for finite CPU time.The practical lesson: if you have 1000 runnable processes and care about latency, you have an architecture problem, not a scheduling problem. The solution is to reduce the number of runnable tasks (use async I/O instead of thread-per-connection, use connection pooling) or increase the number of cores.Follow-up: How does the nice value actually translate to CPU share in CFS?Each nice level maps to a weight via a lookup table. Nice 0 has weight 1024 (the baseline). Each nice level changes the weight by approximately 1.25x. Nice -1 has weight 1277, nice -20 has weight 88761, nice +19 has weight 15. The CPU share for a task is task_weight / sum_of_all_weights. So two tasks at nice 0 each get 50%. One task at nice 0 and one at nice 5 (weight ~335): the nice 0 task gets 1024/(1024+335) = 75%, the nice 5 task gets 25%. The 1.25x factor per level was chosen so that each nice level corresponds to roughly 10% less CPU time, which gives administrators a predictable tuning knob.Explain the Convoy Effect in FCFS scheduling and how it manifests in real production systems. Have you seen it outside of CPU scheduling?
Explain the Convoy Effect in FCFS scheduling and how it manifests in real production systems. Have you seen it outside of CPU scheduling?
- Database lock queues: A single transaction holding a row lock for a long time (full table scan with row-level locking) blocks dozens of short transactions behind it. The short transactions could complete in microseconds but wait milliseconds for the lock. This is why databases use MVCC (Multi-Version Concurrency Control) — readers do not block writers and vice versa, breaking the convoy.
- Network I/O scheduling: On a shared network link, a single large TCP flow (a backup job sending 10GB) can fill the switch buffer, causing small latency-sensitive packets (database queries, health checks) to queue behind it. This is head-of-line blocking. The solution is QoS (Quality of Service) — separate queues for different traffic classes with priority scheduling.
- Garbage collection: In a GC-managed runtime (Java, Go), a stop-the-world GC pause is the ultimate convoy effect. Every application thread stops while the GC runs. Concurrent GC algorithms (ZGC, Shenandoah, Go’s tri-color GC) break this convoy by doing most GC work concurrently with application threads.
- Kubernetes pod scheduling: If a large pod (requesting 64 CPUs) is ahead in the scheduling queue, smaller pods (1 CPU each) cannot be scheduled until the scheduler finds a node for the large pod — even if there are plenty of nodes suitable for the small pods. This is why Kubernetes has priority classes and preemption.
Production Caveats & Common Pitfalls
The scheduler is a place where intuition fails badly. Here are four traps that have caused real outages.Senior-Level Interview Questions
Explain CFS vruntime and how it converges to fairness. What happens when a process sleeps for 10 seconds and wakes up?
Explain CFS vruntime and how it converges to fairness. What happens when a process sleeps for 10 seconds and wakes up?
- Define vruntime: Each task in CFS has a
vruntime(virtual runtime) — the amount of CPU time the task has consumed, scaled by the inverse of its weight (nice value). A nice 0 task’s vruntime grows at 1.0x wall-clock seconds per CPU-second. A nice -20 task (weight ~88761) grows at ~0.012x. A nice +19 task (weight 15) grows at ~68x. CFS always picks the runnable task with the smallest vruntime. - Explain convergence: When the lowest-vruntime task runs, its vruntime increases until another task has lower vruntime; then the scheduler switches. Over time, tasks of equal weight accumulate vruntime at equal rates, getting equal CPU. Tasks with higher weight (lower nice) accumulate more slowly, so they get more CPU time per scheduling round.
- Address the wakeup case: A task that sleeps for 10 seconds would have very low vruntime if we did nothing — it would dominate the CPU upon waking, starving other tasks. CFS prevents this with
min_vruntime: a per-runqueue counter that tracks the smallest vruntime of currently runnable tasks. When a task wakes up, its vruntime is set tomax(saved_vruntime, min_vruntime - sched_latency_ns/2). This gives the waking task a small “bonus” (it goes a bit before others) but caps how far behind it can be — fairness is preserved. - Highlight the latency target: CFS does not use a fixed quantum. Instead, it uses
sched_latency_ns(default 6ms with up to 8 tasks) — the period in which every task should get a turn. With 4 tasks, each gets 1.5ms before the next switch. Combined withsched_min_granularity_ns(default 0.75ms), this prevents excessive switching when there are many tasks.
- “CFS uses time slices like Round Robin.” Wrong: CFS uses dynamic preemption based on vruntime, not fixed slices.
- “Higher nice = more CPU.” Wrong: higher nice = lower priority. Nice -20 is the highest priority for CFS.
- “Sleeping tasks accumulate vruntime credit.” Wrong: vruntime only grows when running; the wakeup adjustment is what prevents starvation of long-sleepers.
Documentation/scheduler/sched-design-CFS.rstin the Linux kernel.- “Inside the Linux 2.6 Completely Fair Scheduler” by M. Tim Jones (IBM developerWorks, 2009).
- “An EEVDF CPU scheduler for Linux” — LWN article on the EEVDF transition: lwn.net/Articles/925371/.
When does CPU pinning a hot path help, and when does it hurt?
When does CPU pinning a hot path help, and when does it hurt?
- Define CPU pinning: Setting CPU affinity (via
taskset,pthread_setaffinity_np, or cgroup cpuset) restricts which cores a thread can run on. Pinning to one core: the thread always runs there, the cache stays warm, no migration cost. - When pinning helps: (a) Latency-sensitive workloads where TLB/cache warm-up cost dominates — HFT, audio processing, packet processing. (b) NUMA-aware code that accesses node-local memory — pin to a core on the same NUMA node as the data. (c) Workloads with large working sets that fit in L1/L2 — avoiding cache invalidation matters more than load balancing. (d) Real-time systems where any migration is unacceptable jitter.
- When pinning hurts: (a) Variable-load workloads — if your pinned core is busy with another task, you queue behind it while other cores are idle. (b) Bursty workloads where the kernel scheduler’s migration logic would naturally rebalance. (c) Multi-threaded workloads where threads sometimes need to run in parallel and sometimes serialize — the kernel’s load balancer is smarter than static pinning. (d) Hyper-threading: pinning to two SMT siblings on the same physical core means they share execution units; you don’t actually get parallelism.
- The senior insight: Pinning is a sledgehammer. The kernel’s CFS scheduler is already cache-aware (
SD_BALANCE_EXEC,SD_SHARE_CPUCAPACITY). It only migrates when there is a strong reason. Don’t pin unless you have measured that the kernel’s default behavior is hurting you. When you do pin, also pin memory (NUMA), isolate cores from kernel work (isolcpus,nohz_full), and disable IRQ steering on those cores.
isolcpus and why combine it with pinning? isolcpus=2,3 (boot parameter) tells the kernel “these CPUs are not part of the default scheduler load balancing.” Tasks won’t be scheduled on them unless explicitly pinned. This eliminates background kernel work (kworkers, watchdogs) from interfering with your latency-critical thread. Combine with nohz_full=2,3 to also disable the timer tick on those cores — now there are no periodic interrupts at all when a single task runs there.top shows it occasionally running on CPU 6. Why? Several possibilities: (a) Someone called taskset -p to change the affinity. (b) The thread inherits affinity from its parent; if the parent’s affinity changed after the child was created, behavior diverges. (c) Cgroup cpuset may override per-thread affinity. (d) The kernel’s sched_setaffinity was called with a wider mask than expected. Always check /proc/PID/status’s Cpus_allowed_list for the actual mask.cpu-manager-policy=static, pods with integer CPU requests get exclusive CPU pinning. The kubelet manages the cpuset cgroup. This is essential for latency-sensitive workloads on shared nodes. The catch: it only works for “Guaranteed” QoS pods (limits == requests, integer CPUs). Burstable pods still float across cores.- “Always pin for performance.” Wrong: pinning often hurts because it disables load balancing.
- “Pinning is just hardware affinity.” Misleading: it also affects how the kernel schedules and balances.
- “Pin to two SMT siblings for max throughput.” Wrong: SMT siblings share execution resources, so you don’t get true parallelism.
- “Performance Analysis Methodology” by Brendan Gregg — includes the USE method for CPU analysis.
- DPDK Programmer’s Guide — doc.dpdk.org/guides/prog_guide/.
- “Real-time Linux on ARM” — article on isolcpus and nohz_full: lwn.net/Articles/549580/.
A process is stuck in D-state for 10 minutes. Walk me through diagnosis and recovery.
A process is stuck in D-state for 10 minutes. Walk me through diagnosis and recovery.
- Define D-state:
TASK_UNINTERRUPTIBLE— the kernel has put the task to sleep waiting for something (typically I/O completion) and made it impossible to wake via signals. This is intentional: some kernel paths cannot tolerate being interrupted halfway (e.g., a journal commit). The downside is that even SIGKILL is ineffective — the process literally cannot be killed. - Diagnose what it’s waiting on:
cat /proc/PID/stackshows the kernel call stack of the sleeping task. Read top-down to see what subsystem is waiting. Common patterns:nfs_*(NFS server unresponsive),xfs_*orext4_*(local filesystem — usually disk I/O hang),__schedule -> io_schedule -> submit_bio(block device hang),mutex_lock_slowpath(kernel mutex contention). - Confirm the hardware/network path: If NFS, ping the server, check
mountfor the offending mount, look at server logs. If local disk, checkdmesgfor I/O errors,smartctlfor drive health,iostat -x 1for queue depths and service times. If block device, look fornvmeorataerrors in dmesg. - Recover: For NFS hangs, you usually have to fix the network or remount with
umount -f -l(force-lazy unmount) — this releases waiters with EIO. For disk hangs, sometimes echoing to/proc/PID/wchandoesn’t help; you may need to fail over the disk or reboot. For kernel mutex hangs (rare), inspect/proc/PID/wchanto identify which mutex, then look at who holds it via lockdep traces if available. Last resort:echo b > /proc/sysrq-triggerfor an immediate reboot if the system is unrecoverable.
submit_bio — the block layer had submitted I/O that the controller never completed. ps showed them as D, kill -9 did nothing, even kill -9 1 (init) couldn’t help. They had to evacuate the host and force-reboot. The post-mortem (githubeng.com) recommended adding /proc/PID/stack collection to all alerts about long-running D-state processes.TASK_KILLABLE and why was it added? Linux 2.6.25 (2008) added TASK_KILLABLE — like TASK_UNINTERRUPTIBLE but allows wakeup specifically by SIGKILL. NFS code was the main motivator: pre-2.6.25, an NFS server crash would create permanent D-state processes; post, you could SIGKILL them. Most uninterruptible sleeps in modern kernels use TASK_KILLABLE. If a process is stuck in TASK_KILLABLE, kill -9 works.read() and it has not returned for 10 minutes. The kernel side is in read_iter or submit_bio waiting.- “kill -9 to fix D-state.” Wrong: D-state ignores all signals (unless TASK_KILLABLE).
- “Restart the process.” Often impossible — the process won’t terminate.
- “Increase CPU.” Doesn’t help an I/O-bound D-state.
- “What Is Disk I/O Wait?” — Brendan Gregg blog.
Documentation/process-management/runaway-processes.rstin the kernel tree.- “Sysrq triggers” — kernel.org/doc/Documentation/admin-guide/sysrq.rst — emergency reboot mechanisms.
Design a scheduling policy for a system that runs both batch ML training and real-time video transcoding on the same nodes. What knobs do you turn?
Design a scheduling policy for a system that runs both batch ML training and real-time video transcoding on the same nodes. What knobs do you turn?
- Identify the SLA: Real-time transcoding has a deadline (frame must be encoded in 33ms for 30fps). Missing the deadline is a quality regression (dropped frame). Batch ML training has a throughput goal (utilize 100% of GPU/CPU when possible) but no per-task deadline.
- Use scheduling class separation: Run transcoding under
SCHED_FIFOpriority 50 — it preempts everything else. Run ML training under defaultSCHED_NORMAL. Setkernel.sched_rt_runtime_us=950000so RT cannot consume more than 95% per second; the remaining 5% lets ML make progress and lets you SSH in. - Use cgroups for resource isolation: Put transcoding in cgroup A with
cpu.weight=10000(max). Put ML in cgroup B withcpu.weight=100. Setcpu.maxon ML to limit its bursts. Usememory.highandmemory.maxto prevent ML from swapping out transcoding pages. - NUMA placement and isolation: On a 2-socket box, dedicate node 0 to transcoding (8 cores) and node 1 to ML (24 cores). Pin transcoding threads with
tasksetand usenumactl --membind=0for memory. ML uses node 1 exclusively. This eliminates QPI/UPI cross-socket traffic. Useisolcpus=0-7on the transcoding cores so no kernel work runs there. - Watchdog and observability: Monitor transcoding p99 latency continuously. If it exceeds 30ms, page the on-call. Alert on RT throttling (
/proc/sched_debugshows RT runtime exceeded). Track NUMA misses withnumastat. Set upbcc-tools/runqlatto track scheduling latency per cgroup.
cpu.weight with high asymmetry (10000 vs 100) and pin transcoding to dedicated cores via cpuset. With dedicated cores plus the weight, latency is usually within 5-10% of the SCHED_FIFO solution. The remaining gap comes from kernel preemption granularity.cpu.max set generously (e.g., 200% of typical use). Under contention with ML, weights ensure transcoding wins. When ML is idle, transcoding can use spare capacity. The catch: transcoding’s working set must fit in its dedicated cores’ cache for predictability; bursting to other cores means cold cache.devices.allow) or Kubernetes device plugins. Don’t share a GPU between latency-sensitive and throughput-sensitive workloads — there is no preemption guarantee.- “Use nice -20 for transcoding.” Wrong: nice doesn’t preempt batch jobs strongly enough.
- “Just give transcoding more cores.” Doesn’t help if the cores are shared with ML.
- “Run them as separate Kubernetes priority classes and trust the scheduler.” Pod-level scheduling doesn’t address per-thread CPU contention within a node.
- “Cgroup v2 documentation” — kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst.
- “Predictable Latency in Multi-tenant Systems” — ACM Queue article on isolation techniques.
- Netflix Tech Blog: “Application data caching using SSDs” — netflixtechblog.com — discusses workload isolation patterns.
Next: Memory Management →