Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Power & Thermal Management

Understanding power management is essential for infrastructure engineers. Whether optimizing cloud costs, managing thermal throttling, or debugging performance issues, these concepts matter at scale. The kernel’s power management subsystem is the negotiation layer between hardware capabilities (what the CPU can do) and software policy (what the OS wants the CPU to do), and getting this negotiation wrong leads to either wasted money or degraded performance.
Interview Frequency: Medium (important for infrastructure/performance roles)
Key Topics: cpufreq, cpuidle, thermal throttling, power governors
Time to Master: 6-8 hours

Why Power Management Matters

In the cloud era, power management directly impacts:
  • Cost: Power = money. A single cloud server running at max frequency 24/7 versus using schedutil can mean 15-30% higher energy draw. Across 10,000 servers, that is millions of dollars per year in electricity and cooling costs.
  • Performance: Thermal throttling degrades performance silently. Your application benchmarks beautifully on a cool machine, then loses 20% throughput at 3 AM when the data center ambient temperature rises.
  • Reliability: Heat shortens hardware lifespan. For every 10 degrees C above 25 degrees C, component failure rates roughly double (Arrhenius equation applied to electronics).
  • Capacity planning: Power and cooling constraints limit rack density. A 10kW rack power budget means you cannot simply add more servers — you must optimize what you have.
A senior engineer would say: “We do not tune power management because we care about electricity bills in isolation. We tune it because power, thermal headroom, and performance are three sides of the same triangle — you cannot change one without affecting the other two.”

Power Management Architecture

Think of the kernel’s power management stack like a building’s climate control system. The hardware (CPU, sensors, fans) is the physical plant. The kernel frameworks (cpufreq, cpuidle, thermal) are the control logic. User-space tools (turbostat, tuned) are the building manager’s dashboard. Each layer has its own responsibility, and they coordinate through well-defined interfaces.
+-----------------------------------------------------------------------------+
|                    LINUX POWER MANAGEMENT STACK                              |
+-----------------------------------------------------------------------------+
|                                                                              |
|  User Space                                                                  |
|  +-------------------------------------------------------------------------+|
|  |  Tools: turbostat, cpupower, thermald, tuned                            ||
|  |  Interfaces: /sys/devices/system/cpu/, /sys/class/thermal/              ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
|  ========================================================================   |
|                                                                              |
|  Kernel Frameworks                                                           |
|  +-------------------------------------------------------------------------+|
|  |                                                                          ||
|  |  +-------------+  +-------------+  +-------------+  +-------------+    ||
|  |  |  cpufreq    |  |  cpuidle    |  |  thermal    |  |  PM QoS     |    ||
|  |  | (frequency) |  |  (C-states) |  |  (cooling)  |  |  (latency)  |    ||
|  |  +------+------+  +------+------+  +------+------+  +------+------+    ||
|  |         |                |                |                |            ||
|  |         +----------------+----------------+----------------+            ||
|  |                          |                |                              ||
|  |                          v                v                              ||
|  |         +--------------------------------------------+                  ||
|  |         |           Power Management Core             |                  ||
|  |         |  - ACPI interface (tables + AML bytecode)   |                  ||
|  |         |  - Intel/AMD specific drivers               |                  ||
|  |         |  - Device PM (runtime suspend/resume)       |                  ||
|  |         +--------------------------------------------+                  ||
|  |                                                                          ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
|  ========================================================================   |
|                                                                              |
|  Hardware                                                                    |
|  +-------------------------------------------------------------------------+|
|  |  CPU: P-states (frequency), C-states (idle), Turbo Boost               ||
|  |  Sensors: Temperature, power consumption, voltage                       ||
|  |  Cooling: Fans, passive cooling                                         ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
+-----------------------------------------------------------------------------+
The kernel’s power management core reads hardware capabilities through ACPI tables at boot time. ACPI (Advanced Configuration and Power Interface) is essentially a contract between the BIOS/firmware and the OS: the firmware advertises what power states the hardware supports, and the OS decides which states to use and when. The intel_pstate and amd_pstate drivers bypass ACPI for more direct hardware control on modern processors, using hardware-specific MSR (Model Specific Register) interfaces.

CPU Frequency Scaling (cpufreq)

The cpufreq subsystem controls how fast the CPU runs when it is active. This is the kernel’s answer to “how much performance do I need right now?” The key insight: CPU power consumption scales roughly with the cube of voltage, and voltage scales roughly linearly with frequency. Halving the frequency can reduce power by roughly 8x (2 cubed) in the dynamic power component.

P-States: Performance States

P-states control the frequency/voltage operating point of an active CPU. Think of them like gears in a car — higher gears (lower P-numbers) mean more speed but more fuel consumption.
+-----------------------------------------------------------------------------+
|                    CPU P-STATES (Performance States)                         |
+-----------------------------------------------------------------------------+
|                                                                              |
|  P-State    Frequency      Voltage       Power          Use Case            |
|  -------------------------------------------------------------------------- |
|  P0         3.6 GHz       1.2 V         ~100W          Max performance      |
|  P1         3.2 GHz       1.1 V         ~75W           High performance     |
|  P2         2.8 GHz       1.0 V         ~50W           Normal use           |
|  P3         2.4 GHz       0.9 V         ~35W           Power saving         |
|  P4         2.0 GHz       0.85V         ~25W           Low power            |
|  ...        ...           ...           ...            ...                   |
|  Pn         800 MHz       0.7 V         ~5W            Minimum              |
|                                                                              |
|  Turbo Boost (above base frequency):                                         |
|  - Single-core turbo: Up to 4.8 GHz                                         |
|  - All-core turbo: Up to 4.2 GHz                                            |
|  - Depends on: Temperature, power budget, active cores                      |
|                                                                              |
|  The power budget is shared. If one core turbos to 4.8 GHz,                 |
|  other cores may be forced to lower P-states to stay within                  |
|  the package TDP (Thermal Design Power) envelope.                            |
|                                                                              |
+-----------------------------------------------------------------------------+
Production gotcha: Turbo Boost is opportunistic. A core can only reach turbo frequencies when the overall package has thermal and power headroom. On a busy server where all cores are active, per-core turbo is often unachievable. Benchmarking on a cold, idle machine and expecting those frequencies in production is a classic mistake.

Viewing Frequency Information

# View available frequencies for CPU 0
# These are the discrete P-state steps the hardware supports
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
# 800000 1000000 1200000 1400000 1600000 1800000 2000000 2200000 2400000
# Values are in kHz -- divide by 1000 for GHz

# Current frequency (what the CPU is running at RIGHT NOW)
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# 2400000 (2.4 GHz)

# Hardware limits -- the absolute floor and ceiling
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq  # 800000
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq  # 3600000

# Current governor -- the policy algorithm making scaling decisions
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# schedutil

# Using cpupower tool for a friendlier summary
cpupower frequency-info
cpupower frequency-set -g performance  # Set governor

# Real-time monitoring with turbostat -- the gold standard tool
# Shows actual frequency, C-state residency, temperature, and power
sudo turbostat --interval 1
# Tip: Avg_MHz is actual work done, Bzy_MHz is frequency when busy.
# If Avg_MHz << Bzy_MHz, the CPU is spending a lot of time idle.

Frequency Governors

Governors are the policy algorithms that decide what frequency the CPU should run at. They are the “brain” of the cpufreq subsystem.
GovernorDescriptionUse Case
performanceAlways max frequencyLatency-sensitive workloads, benchmarks. Wastes power if load is variable.
powersaveAlways min frequencyBattery saving, idle servers. Terrible for anything needing actual CPU time.
ondemandScale with CPU load (reactive)General purpose (legacy). Samples load on a timer, so it reacts with a delay.
conservativeScale gradually (ramp up/down)Smooth frequency transitions. Less aggressive than ondemand.
schedutilScheduler-integrated (proactive)Modern default, best for most workloads. Uses scheduler utilization signals directly, so it knows about load changes before they show up in sampling.
userspaceManual control from user spaceCustom tuning, testing. You set the frequency explicitly.
Why schedutil wins: Older governors like ondemand sample CPU utilization on a timer (every 10-50ms). They are reactive — they see high load after it has already happened. schedutil is integrated with CFS (the Completely Fair Scheduler) and receives utilization updates at every scheduler tick. It can raise frequency before the next scheduling decision, making it both faster and more accurate. On kernels 4.7+, prefer schedutil unless you have a specific reason not to.

Setting Frequency

# Set governor for all CPUs using a shell loop
# This writes the policy to each CPU's sysfs entry
for cpu in /sys/devices/system/cpu/cpu[0-9]*; do
    echo performance > $cpu/cpufreq/scaling_governor
done

# Using cpupower (preferred -- handles edge cases)
cpupower frequency-set -g performance

# Set specific frequency (only works with userspace governor)
cpupower frequency-set -g userspace
cpupower frequency-set -f 2.4GHz

# Set min/max frequency bounds (works with any governor)
# Useful for clamping frequency to a range without using 'performance'
echo 2400000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo 3600000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
# If you set min = max, you effectively pin the frequency regardless of governor
Debugging tip: If writing to scaling_governor returns “Operation not permitted” or “Invalid argument” even as root, check if the intel_pstate driver is active. When intel_pstate is in HWP (Hardware P-state) mode, only performance and powersave governors are available. The hardware itself handles fine-grained scaling. To get the full governor list back, boot with intel_pstate=passive or intel_pstate=disable.

Intel P-State Driver

Modern Intel CPUs use the intel_pstate driver instead of the generic ACPI cpufreq driver. This driver communicates directly with the CPU via MSRs rather than going through ACPI, giving faster and more precise control.
# Check if intel_pstate is active
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
# intel_pstate  -- means intel_pstate is managing frequency
# acpi-cpufreq  -- means the generic ACPI driver is in use

# Intel pstate specific controls
ls /sys/devices/system/cpu/intel_pstate/
# max_perf_pct  min_perf_pct  no_turbo  status  turbo_pct

# Disable turbo boost -- useful for CONSISTENT performance
# Turbo introduces frequency variance (jitter), which hurts
# latency-sensitive workloads more than the extra MHz helps
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# Set performance limits (percentage of max frequency)
# 100% means "allow up to max turbo frequency"
echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
echo 50 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
# min_perf_pct=50 on a 4GHz max CPU means never go below ~2GHz

# Check what percentage of frequency range is turbo
cat /sys/devices/system/cpu/intel_pstate/turbo_pct
# e.g., 33 means turbo adds 33% above base frequency
HWP (Hardware P-State) mode: On Skylake and newer Intel CPUs, the kernel can delegate frequency selection entirely to the hardware using HWP. The CPU has its own internal governor that reacts in microseconds (vs. the kernel’s millisecond timescale). When HWP is active, the kernel sets performance bounds (EPP — Energy Performance Preference) and the hardware makes per-clock-cycle frequency decisions. Check with: grep -c "hwp" /proc/cpuinfo.

CPU Idle States (cpuidle)

While cpufreq controls what happens when the CPU is working, cpuidle controls what happens when the CPU has nothing to do. On a typical server, CPUs are idle 60-90% of the time even under moderate load, so idle state management has an enormous impact on power consumption.

C-States: Idle States

Think of C-states as sleep depths. C0 is wide awake (executing). C1 is dozing (instant wakeup). C6 is deep sleep (takes time to wake). The deeper the sleep, the more power you save, but the longer it takes to get back to work.
+-----------------------------------------------------------------------------+
|                    CPU C-STATES (Idle States)                                |
+-----------------------------------------------------------------------------+
|                                                                              |
|  C-State    Name           Power        Exit Latency    Description          |
|  -------------------------------------------------------------------------- |
|  C0         Active         High         0 us            CPU executing        |
|  C1         Halt           Low          1-2 us          Clock gated          |
|  C1E        Enhanced Halt  Lower        2-5 us          + voltage reduction  |
|  C3         Sleep          Very Low     50-100 us       L1/L2 flushed        |
|  C6         Deep Sleep     Minimal      100-200 us      Core powered off     |
|  C7/C8/C9   Deeper         Ultra Low    200-500 us      Package states       |
|                                                                              |
|  The critical trade-off:                                                     |
|  - Deeper states = more power savings                                        |
|  - Deeper states = higher wake-up latency                                    |
|  - C3+ flushes L1/L2 cache, so first instructions after wake run COLD       |
|  - Must balance power vs. latency requirements                               |
|                                                                              |
|  For latency-sensitive apps (HFT, real-time): Limit to C1/C1E               |
|  For power efficiency (batch, background): Allow all C-states                |
|                                                                              |
+-----------------------------------------------------------------------------+
The hidden cost of deep C-states: When a CPU enters C3 or deeper, the L1 and L2 caches are flushed. When it wakes up, the first few hundred microseconds of execution suffer cache misses. This “cache cold” penalty is not captured in the exit latency number — the 100us exit latency for C6 only measures the time to become electrically active, not the time to warm up the caches. For workloads with frequent short idle periods (like event-driven servers), the effective latency penalty of deep C-states can be 3-5x the advertised exit latency.

Viewing C-State Information

# Available idle states for CPU 0
# Each state directory contains metadata about one C-state
ls /sys/devices/system/cpu/cpu0/cpuidle/
# state0/ state1/ state2/ state3/

# State details -- inspect each field to understand behavior
cat /sys/devices/system/cpu/cpu0/cpuidle/state2/name    # C6
cat /sys/devices/system/cpu/cpu0/cpuidle/state2/latency # 200 (microseconds to exit)
cat /sys/devices/system/cpu/cpu0/cpuidle/state2/usage   # 1234567 (times entered)
cat /sys/devices/system/cpu/cpu0/cpuidle/state2/time    # 98765432 (total us spent)
# Tip: time/usage = average residency. If avg residency << target residency,
# the cpuidle governor is choosing a state that is too deep for the workload.

# Current cpuidle governor (selects which C-state to enter)
cat /sys/devices/system/cpu/cpuidle/current_governor
# menu   -- the default, uses expected idle duration heuristics
# teo    -- Timer Events Oriented, better for workloads with predictable timers
# haltpoll -- spins in a poll loop before entering C-states (good for VMs)

# Using turbostat to see C-state residency across all CPUs
# This is the single most useful command for power analysis
sudo turbostat --show Core,CPU,Busy%,Avg_MHz,C1%,C3%,C6%,C7% --interval 1
# C6% = 85 means the CPU spent 85% of that second in C6 deep sleep

Limiting C-States for Low Latency

# Disable deeper C-states (disable state 2 and beyond on all CPUs)
# state0=C0 (active, cannot be disabled), state1=C1, state2=C3, state3=C6
for cpu in /sys/devices/system/cpu/cpu*/cpuidle/state[2-9]; do
    echo 1 > $cpu/disable
done
# This takes effect immediately -- no reboot needed

# Kernel parameter to limit C-states at boot (persistent across reboots)
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub:
intel_idle.max_cstate=1  # Limits the intel_idle driver
processor.max_cstate=1   # Limits the ACPI processor driver
# Then run: update-grub && reboot

# Using PM QoS to set latency constraint (application-level control)
# This is the BEST approach for applications because it is declarative:
# "I need at most 10us wake-up latency" and the kernel figures out which
# C-states to avoid to meet that constraint
echo 10 > /dev/cpu_dma_latency  # Keeps file open -- constraint active while fd open

# Programmatic PM QoS in C (what tuned and DPDK do internally):
# int fd = open("/dev/cpu_dma_latency", O_RDWR);
# int latency = 10;  // microseconds -- kernel disables C-states > 10us exit
# write(fd, &latency, sizeof(latency));
# // IMPORTANT: Keep fd open as long as low latency is needed.
# // Closing fd removes the constraint. This is intentional --
# // it prevents stale constraints from permanently wasting power.
PM QoS is the production-grade approach. Setting max_cstate=1 in GRUB is a blunt instrument that affects the entire machine permanently. PM QoS lets individual applications request low latency only when they need it, and the constraint automatically cleans up when the application exits (fd closes). This is why DPDK, SPDK, and real-time audio applications use PM QoS rather than boot parameters.

Thermal Management

The thermal subsystem is the kernel’s safety system. Its job is to prevent hardware damage from overheating, ideally without the user noticing. It monitors temperature sensors, evaluates trip points, and activates cooling devices (fans, frequency throttling) as needed.

Thermal Zones

# List thermal zones -- each zone represents a temperature sensor
ls /sys/class/thermal/thermal_zone*/
# thermal_zone0/ thermal_zone1/ ...

# View temperature (millidegrees Celsius)
cat /sys/class/thermal/thermal_zone0/temp
# 45000 (= 45.000 degrees C)
# Tip: Some VMs report 0 or -1 here because the hypervisor does not
# expose physical sensors. This is normal in cloud environments.

# View zone type -- tells you WHAT is being measured
cat /sys/class/thermal/thermal_zone0/type
# x86_pkg_temp   -- overall CPU package temperature
# coretemp        -- per-core temperature
# acpitz         -- ACPI thermal zone (motherboard sensor)

# View trip points -- the temperature thresholds that trigger actions
cat /sys/class/thermal/thermal_zone0/trip_point_0_temp  # 85000 (85C)
cat /sys/class/thermal/thermal_zone0/trip_point_0_type  # passive
cat /sys/class/thermal/thermal_zone0/trip_point_1_temp  # 95000 (95C)
cat /sys/class/thermal/thermal_zone0/trip_point_1_type  # critical

Thermal Trip Points

+-----------------------------------------------------------------------------+
|                    THERMAL TRIP POINTS                                       |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Temperature                                                                 |
|       |                                                                      |
|  95C  +----------------------------------------- CRITICAL                   |
|       |                                          - Emergency shutdown        |
|       |                                          - Prevent physical damage   |
|       |                                          - Cannot be overridden      |
|       |                                                                      |
|  85C  +----------------------------------------- PASSIVE                    |
|       |                                          - CPU throttling begins     |
|       |                                          - Reduce frequency/voltage  |
|       |                                          - Visible as perf drop      |
|       |                                                                      |
|  75C  +----------------------------------------- ACTIVE                     |
|       |                                          - Increase fan speed        |
|       |                                          - No performance impact     |
|       |                                                                      |
|       |                                                                      |
|  45C  +----------------------------------------- NORMAL                     |
|       |                                          - Normal operation          |
|       |                                          - All features available    |
|       |                                                                      |
|       +------------------------------------------------------------->       |
|                                                                              |
+-----------------------------------------------------------------------------+
The kernel’s thermal framework uses a control loop: when temperature rises above a trip point, it activates a cooling device. The cooling device reduces heat generation (by throttling CPU frequency) or increases heat removal (by speeding up fans). When temperature drops, the cooling device is deactivated. This loop runs every polling_interval milliseconds (typically 1000-5000ms).

Cooling Devices

# List cooling devices -- these are the actuators the thermal framework controls
ls /sys/class/thermal/cooling_device*/

# View cooling device type -- tells you WHAT kind of cooling it provides
cat /sys/class/thermal/cooling_device0/type
# intel_powerclamp  -- injects idle time to reduce power (and thus heat)
# Processor          -- reduces CPU frequency (passive cooling)
# Fan               -- controls fan speed (active cooling)

# Current and max cooling state
# state 0 = no cooling, max_state = maximum cooling effort
cat /sys/class/thermal/cooling_device0/cur_state    # 0
cat /sys/class/thermal/cooling_device0/max_state    # 10
# For a Processor cooling device, higher state = lower frequency
# For a Fan, higher state = faster fan speed

# Which zone is this device bound to?
cat /sys/class/thermal/thermal_zone0/cdev0/type

Monitoring Thermal Throttling

Thermal throttling is the number one cause of mysterious intermittent performance degradation in production. It is silent, it is transient, and it rarely shows up in application-level metrics.
# Using turbostat -- best single tool for detecting throttling
# Watch PkgTmp (package temp) and compare Avg_MHz to expected frequency
sudo turbostat --show Core,Busy%,Avg_MHz,PkgTmp,PkgWatt --interval 1
# If PkgTmp > 85 AND Avg_MHz is dropping, you are being throttled

# Check for throttling via MSR (Model Specific Register)
# MSR_THERM_STATUS (0x19C) has hardware-level throttling flags
sudo rdmsr 0x19C
# Bit 0: Thermal status (1 = currently throttling RIGHT NOW)
# Bit 1: Thermal status log (1 = has throttled since last cleared)
# Bit 1 is sticky -- it stays set until you clear it, so it catches
# transient throttling that happened between your checks

# Power measurement with perf
# This shows total energy consumed over a time window
perf stat -e power/energy-pkg/,power/energy-cores/ sleep 10

# Using sensors (lm-sensors package) -- human-friendly temperature display
sensors
# coretemp-isa-0000
# Core 0:       +45.0C  (high = +85.0C, crit = +95.0C)
# The (high) and (crit) values correspond to passive and critical trip points
Cloud gotcha: In most cloud VMs, you cannot read thermal sensors or MSRs directly because the hypervisor virtualizes or hides them. Thermal throttling still happens at the physical host level, but it manifests to you as “CPU steal time” or unexplained performance drops. Monitor steal% in top or mpstat — if it spikes, the host may be thermally throttling. This is especially common on burstable instance types (AWS t3, GCP e2) that share physical cores.

Power Monitoring

RAPL (Running Average Power Limit)

RAPL is Intel’s interface for measuring and limiting power consumption. It provides fine-grained power data at the package, core, DRAM, and GPU levels. Think of it as a power meter built into the CPU.
# View power zones -- RAPL exposes a hierarchy of power domains
ls /sys/class/powercap/intel-rapl/

# Package power (entire CPU socket)
cat /sys/class/powercap/intel-rapl/intel-rapl:0/name           # package-0
cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj      # Cumulative energy
# energy_uj is in microjoules and wraps around (32-bit counter)
# To get power in watts: sample energy_uj twice, divide delta by time
# Example: delta_uj=15000000 over 1 second = 15 watts

# DRAM power (memory subsystem)
cat /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/name  # dram

# Power constraints (limits) -- what the hardware enforces
cat /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw
# This is PL1 (Power Limit 1) -- the sustained power limit in microwatts
cat /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_time_window_us
# Time window over which PL1 is averaged

# Using perf for power events -- convenient for per-application measurement
perf stat -e power/energy-pkg/,power/energy-cores/,power/energy-ram/ ./myapp
# Shows total energy consumed by your application's execution

# Using turbostat for continuous monitoring
sudo turbostat --show PkgWatt,CorWatt,RAMWatt --interval 1
# PkgWatt = total package, CorWatt = CPU cores only, RAMWatt = DRAM

Power Capping

RAPL can also enforce power limits, which the CPU hardware implements by throttling frequency when the power budget is exceeded.
# Set power limit (microwatts) -- this is PL1 (sustained limit)
echo 65000000 > /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw
# 65000000 uw = 65 watts. The CPU will throttle to stay at or below 65W
# averaged over the time window below.

# Set time window (microseconds) -- averaging period for the limit
echo 1000000 > /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_time_window_us
# 1000000 us = 1 second. Power is averaged over 1-second windows.
# Short windows = more aggressive throttling, less sustained burst capability
# Long windows = allow brief bursts above limit if average stays below
Power capping in multi-tenant environments: Cloud providers use RAPL power capping to implement “noisy neighbor” protection. If one VM on a host is consuming disproportionate power, the host can cap its package to prevent it from affecting thermal conditions for other tenants. This is why you sometimes see “throttled” behavior on cloud instances that are not at 100% CPU — the physical host’s power budget is the constraint, not your VM’s CPU utilization.

Performance vs Power Trade-offs

Latency-Sensitive Workloads

For workloads where every microsecond matters (high-frequency trading, real-time audio/video, in-memory databases), you want maximum, predictable performance at the cost of higher power draw.
# Maximum performance configuration
# This is what a latency-performance tuned profile does internally

# 1. Set performance governor -- always run at max frequency
cpupower frequency-set -g performance

# 2. Disable turbo (for CONSISTENT performance)
# Turbo varies based on thermal headroom, creating frequency jitter
# A steady 3.2 GHz is better than 3.6 GHz that occasionally drops to 3.0 GHz
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# 3. Disable deep C-states -- avoid wake-up latency and cache cold effects
for state in /sys/devices/system/cpu/cpu*/cpuidle/state[2-9]; do
    echo 1 > $state/disable
done

# 4. Pin frequency to max -- eliminate P-state transition latency
for cpu in /sys/devices/system/cpu/cpu*/cpufreq; do
    cat $cpu/cpuinfo_max_freq > $cpu/scaling_min_freq
done

# 5. Isolate CPUs for application -- prevent kernel scheduler interference
# Add to kernel boot parameters:
# isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7
# isolcpus: removes CPUs from general scheduling
# nohz_full: disables timer ticks on those CPUs
# rcu_nocbs: offloads RCU callbacks to other CPUs

Power-Efficient Workloads

For batch processing, CI/CD, background jobs, and anything where latency does not matter but cost does.
# Power saving configuration
# 1. Set schedutil governor (balances power/performance dynamically)
cpupower frequency-set -g schedutil

# 2. Enable all C-states (allow deepest sleep when idle)
for state in /sys/devices/system/cpu/cpu*/cpuidle/state*; do
    echo 0 > $state/disable 2>/dev/null
done

# 3. Enable turbo (uses power when needed, saves when not)
# With schedutil, turbo is only activated when load demands it
echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

# 4. Set power cap if needed (hard ceiling on power draw)
echo 35000000 > /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw
# 35W cap -- suitable for a low-priority background worker

Using tuned Profiles

tuned is a daemon that applies and manages system-wide tuning profiles. It is the recommended way to manage power/performance settings in production because it handles all the individual sysfs knobs in a coordinated, tested configuration.
# Install tuned
yum install tuned  # RHEL/CentOS
apt install tuned  # Debian/Ubuntu

# List available profiles
tuned-adm list

# Common profiles and what they actually do:
# balanced              - Default. schedutil governor, all C-states enabled
# throughput-performance - performance governor, C-states limited, NUMA-aware
# latency-performance   - performance governor, C1 only, no turbo jitter
# powersave             - powersave governor, all C-states, turbo off
# virtual-guest         - Optimized for VMs (shorter C-state residency targets)
# virtual-host          - Optimized for hypervisors (memory huge pages, etc.)

# Set profile -- takes effect immediately
tuned-adm profile latency-performance

# Check current profile
tuned-adm active

# Verify what tuned actually changed
tuned-adm verify
# This compares current system state against the profile's expected values
# Useful for debugging when someone manually overrode a setting

Cloud and Container Considerations

Container Power Impact

Containers do not have direct access to power management hardware — they share the host kernel’s power configuration. But container scheduling decisions have a significant indirect impact on power consumption.
# Containers do not directly manage power, but:
# 1. CPU limits affect power indirectly (less CPU time = less active power)
# 2. Throttled containers waste cycles spinning on cgroup CPU bandwidth limits
# 3. Poorly-packed containers leave CPUs idle, increasing C-state transitions

# Check if a container is being CPU-throttled
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.stat
# nr_throttled 12345      -- number of times the container was throttled
# throttled_time 67890123456  -- total nanoseconds spent throttled
# If throttled_time is high, the container is hitting its CPU limit
# and the work it is trying to do takes longer, wasting power on overhead
# Kubernetes pod with power-aware settings
# Setting requests=limits prevents throttling and gives predictable power draw
apiVersion: v1
kind: Pod
metadata:
  name: power-aware
spec:
  containers:
  - name: app
    resources:
      requests:
        cpu: "2"       # Request 2 full cores (guaranteed)
      limits:
        cpu: "2"       # Limit to 2 cores (no burst = no throttle)
    # Pin to a specific NUMA node for memory locality and power efficiency
    # Accessing remote NUMA memory costs ~2x latency and extra power
  nodeSelector:
    kubernetes.io/arch: amd64
  # topology.kubernetes.io/zone can be used for power-zone-aware placement
Cost optimization insight: In Kubernetes, setting CPU requests equal to limits (Guaranteed QoS) is counter-intuitive for power efficiency. You would think “give the container only what it needs” saves power. But Burstable QoS containers experience frequent throttle/unthrottle cycles, which cause P-state and C-state transitions that waste power. Guaranteed QoS with right-sized limits often uses less total energy for the same work because the CPU can operate steadily at an efficient P-state.

Cloud Instance Types

+-----------------------------------------------------------------------------+
|                    CLOUD POWER CONSIDERATIONS                                |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Instance Type        Power Profile          Notes                          |
|  -------------------------------------------------------------------------- |
|  General Purpose      Balanced               C-states enabled, schedutil    |
|  (m5/m6i, n2)        ~$0.10-0.19/hr         Good default for most work     |
|                                                                              |
|  Compute Optimized    Max Performance        Often performance governor     |
|  (c5/c6i, c2)        Higher power draw       Turbo boost active             |
|                       ~$0.08-0.17/hr                                        |
|                                                                              |
|  Memory Optimized     Balanced               Large memory, moderate CPU     |
|  (r5/r6i, m2)        ~$0.13-0.25/hr         CPU often idle (memory-bound)  |
|                                                                              |
|  Burstable            Power efficient        Credits for bursts             |
|  (t3/t4g, e2-micro)  Deep C-states           Cheapest, but steal time risk  |
|                       ~$0.01-0.04/hr                                        |
|                                                                              |
|  High Frequency       Max single-thread      Turbo boost pinned, high TDP  |
|  (z1d, c6i.metal)    ~$0.19-0.37/hr         For workloads that need GHz    |
|                                                                              |
|  Spot/Preemptible     Variable               Reclaimed for capacity         |
|                       60-90% discount         Power profile of base type     |
|                                                                              |
|  Key: Match instance type to workload power profile for cost efficiency     |
|                                                                              |
+-----------------------------------------------------------------------------+

Debugging Power Issues

High CPU Temperature

# 1. Check current temperature across all zones
cat /sys/class/thermal/thermal_zone*/temp
# Compare against trip points. If close to passive trip, throttling is imminent.

# 2. Check for active throttling with turbostat
sudo turbostat --show Core,Busy%,Avg_MHz,PkgTmp,PkgWatt --interval 1
# Key signal: if PkgTmp > 80 and Avg_MHz is dropping below base frequency,
# thermal throttling is active

# 3. Identify the process causing the heat
top -o %CPU
# Look for processes consuming > 100% CPU (multi-threaded)
# Also check: is it legitimate work or a runaway process?

# 4. Check cooling device states
for cooling in /sys/class/thermal/cooling_device*; do
    echo "$cooling: $(cat $cooling/type) state: $(cat $cooling/cur_state)/$(cat $cooling/max_state)"
done
# If Processor cooling is at max_state, the kernel has already throttled
# as much as it can -- the problem is insufficient cooling or excessive load

# 5. Check for thermal trips in kernel logs
dmesg | grep -i thermal
# Look for "above threshold" or "critical temperature reached" messages
# These indicate the thermal framework has been actively intervening

Unexpected Frequency Drops

When the CPU runs slower than expected, the cause is one of four things: thermal throttling, power limit throttling, governor policy, or turbo being disabled. Here is how to isolate which:
# 1. Is it thermal? Check throttle counts
cat /sys/devices/system/cpu/cpu0/thermal_throttle/core_throttle_count
# If this number is increasing, thermal throttling is the cause

# 2. Is it power limit? Check RAPL constraints
cat /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw
# If the power limit is set below the CPU's TDP, power limit throttling
# will cap frequency even when the CPU is thermally fine

# 3. Is it the governor? Check policy
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# If 'powersave', frequency will stay at minimum regardless of load

# 4. Is turbo disabled?
cat /sys/devices/system/cpu/intel_pstate/no_turbo
# 1 = turbo disabled. On a CPU with 2.4 GHz base and 3.6 GHz turbo,
# disabling turbo loses 50% of available frequency range

# 5. Comprehensive real-time monitoring
sudo turbostat --show Core,CPU,Avg_MHz,Busy%,Bzy_MHz,TSC_MHz,PkgTmp --interval 1
# Bzy_MHz = frequency when the CPU is actually busy
# If Bzy_MHz < expected max, something is limiting frequency
# TSC_MHz = constant reference frequency (useful as a baseline)

High Power Consumption

# 1. Monitor power draw -- establish a baseline first
sudo turbostat --show PkgWatt,CorWatt,RAMWatt --interval 1
# Idle server: expect 10-30W package. Fully loaded: near TDP (65-280W).
# If idle power is high, something is preventing deep C-states.

# 2. Check which processes are consuming CPU
perf top
# Sort by overhead% to find the hot functions

# 3. Check C-state residency -- are deep C-states being used?
cat /sys/devices/system/cpu/cpu*/cpuidle/state*/usage
# If state3 (C6) usage is 0 but state0 (C1) usage is high,
# deep C-states are disabled or the workload is too busy to idle

# 4. Check for unnecessary background activity (timer storms, polling)
sudo perf record -g -a sleep 10
sudo perf report
# Look for periodic wakeups -- these are functions that pull CPUs out of
# deep C-states. Common culprits: monitoring agents polling every 1s,
# applications with aggressive keep-alive timers, or a busy kernel timer
Debugging checklist for “why is my server drawing more power than expected?”:
  1. Check C-state residency (are deep states being reached?)
  2. Check for CPU-polling processes (100% CPU in top even on “idle” servers)
  3. Check if monitoring agents are causing frequent wakeups (powertop -t 10)
  4. Check if a tuned profile is overriding C-state settings
  5. Check if PM QoS constraints are preventing deep idle (cat /dev/cpu_dma_latency from another fd)

Interview Questions

Answer:For latency-sensitive workloads, minimize wake-up latency:
  1. Disable deep C-states: Prevent 100+ us wake latencies
intel_idle.max_cstate=1
# or disable via /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
  1. Lock CPU frequency: Prevent P-state transitions
cpupower frequency-set -g performance
# Set min freq = max freq
  1. Consider disabling turbo: Turbo varies, causing jitter
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
  1. Use PM QoS: Programmatically request low latency
int fd = open("/dev/cpu_dma_latency", O_RDWR);
int latency = 0;  // Request zero latency
write(fd, &latency, sizeof(latency));
  1. Isolate CPUs: Prevent kernel interference
isolcpus=2-7 nohz_full=2-7
Trade-off: Higher power consumption, but consistent low latency.
Answer:Causes:
  • High CPU utilization exceeding TDP
  • Inadequate cooling (failed fan, blocked vents)
  • High ambient temperature
  • Too many active cores at high frequency
Detection:
# Check temperature
cat /sys/class/thermal/thermal_zone0/temp  # millidegrees

# Check throttle count
cat /sys/devices/system/cpu/cpu0/thermal_throttle/package_throttle_count

# Real-time monitoring
sudo turbostat --show Core,CPU,Avg_MHz,PkgTmp --interval 1
# If Avg_MHz drops when temp is high -> throttling

# Check dmesg for thermal warnings
dmesg | grep -i "thermal\|throttl"
Solutions:
  • Improve cooling (check fans, thermal paste)
  • Reduce power limit via RAPL
  • Spread work across more cores (lower per-core heat)
  • Check for runaway processes
Answer:P-states (Performance States):
  • Control frequency and voltage when CPU is active
  • Higher P-number = lower frequency = lower power
  • Transition time: ~10-100 us
  • Managed by: cpufreq governors
C-states (Idle States):
  • Control power when CPU is idle
  • Higher C-number = deeper sleep = lower power
  • Deeper states take longer to wake from
  • Managed by: cpuidle governors
When each matters:
  • High load: P-states matter (CPU always active)
  • Low/variable load: C-states matter (CPU frequently idle)
  • Latency-sensitive: Limit both (avoid transitions)
Example: Web server handling requests
  • Between requests: CPU enters C-states (saves power)
  • During request: CPU at appropriate P-state
  • With C6 disabled: Faster wake-up, higher power draw

Interview Deep-Dive

Strong Answer:
  • I would start by establishing whether the spikes are caused by thermal throttling, power limit throttling, or C-state wake-up latency. These are the three power-related mechanisms that cause latency spikes, and each has a distinct signature.
  • First, I would run turbostat --show Core,Avg_MHz,Bzy_MHz,PkgTmp,PkgWatt,C1%,C6% --interval 1 on the database nodes during a spike window. If PkgTmp exceeds 80-85C and Bzy_MHz drops below the expected base frequency, thermal throttling is the culprit. If PkgWatt is at or near the RAPL power limit and Bzy_MHz drops, it is power limit throttling. If C6% is high and spikes correlate with transitions from idle to active, the issue is C-state wake-up latency.
  • If it is thermal: the overnight batch jobs are increasing ambient rack temperature or sharing physical cooling infrastructure. I would check sensors output over time (logged to a monitoring system), correlate with the batch job schedule, and work with the data center team to verify rack inlet temperatures. Solutions include spreading batch work across more racks, adjusting RAPL power limits on the batch servers to reduce their heat output, or physically relocating the database servers.
  • If it is power limit throttling: the host may be enforcing a shared power budget across VMs. I would check cat /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw and compare it to the CPU TDP. In cloud environments, I would check steal time via mpstat — rising steal time during batch hours means the hypervisor is reclaiming CPU cycles.
  • If it is C-state wake-up: the database workload has idle periods long enough for the CPU to enter C6 (100-200us exit latency), and the spikes are the cache-cold penalty after waking. I would set PM QoS to 10us via /dev/cpu_dma_latency on the database servers, which constrains CPUs to C1 maximum. This trades approximately 10-20W more power per server for elimination of the deep-sleep wake-up penalty.
  • The key insight an interviewer wants to hear: 50ms spikes are too long to be a single C-state transition (those are microseconds). So it is likely either sustained thermal throttling during a period, or multiple C-state transitions compounding with cache cold effects across several cores simultaneously. I would use perf sched latency to see if scheduler delays correlate with the spikes, confirming whether the CPU was actually unavailable or just slow.
Follow-up: How would you approach this if the servers are in a cloud environment where you cannot access thermal sensors or RAPL?Follow-up Answer:
  • In a cloud VM, I lose direct visibility into thermals and power, but I can measure their effects indirectly. I would monitor CPU steal time (mpstat -P ALL 1), which reflects the hypervisor reclaiming cycles — often caused by host-level power or thermal management. I would also track actual achieved CPU frequency using turbostat (which works in some cloud environments) or by timing a calibrated compute loop to detect frequency changes. If steal time correlates with the batch window, I would request host migration to a different physical server, switch to a dedicated/metal instance type, or schedule my latency-sensitive work to avoid the batch window. In AWS, I would also check if I am on a burstable instance type (t3) that has exhausted its CPU credits, which presents identically to power throttling.
Strong Answer:
  • The fundamental tension is that latency-sensitive workloads want maximum, consistent performance (performance governor, C1 only, isolated CPUs), while batch and background workloads benefit from power efficiency (schedutil, deep C-states, turbo bursts). Running both on the same physical hosts means one workload’s power configuration hurts the other.
  • My strategy uses node pools with distinct power profiles. I would create three node pools: (1) “latency” nodes with the latency-performance tuned profile — performance governor, max_cstate=1, turbo disabled, isolcpus for the API server cores. (2) “compute” nodes with throughput-performance — performance governor but all C-states enabled, turbo enabled, because ML training is throughput-sensitive but not latency-sensitive. (3) “efficiency” nodes with balanced — schedutil governor, all C-states, RAPL power cap at 70% TDP, because log processing is neither latency nor throughput critical.
  • For Kubernetes integration, I would use node labels (power-profile=latency, power-profile=compute, power-profile=efficiency) and node affinity rules in pod specs. The API server deployments have requiredDuringSchedulingIgnoredDuringExecution affinity to latency nodes. ML training jobs prefer compute nodes. Log processors prefer efficiency nodes but can spill to compute nodes if needed.
  • For cost optimization, latency nodes should be on-demand (we need guaranteed performance), ML training jobs on spot/preemptible instances (checkpointed, can tolerate interruption), and log processors on spot instances in the efficiency pool.
  • The tuned profiles are applied via a DaemonSet that runs tuned-adm profile on node startup, keyed to the node’s power-profile label. This ensures that node replacements (spot interruptions, autoscaler scale-up) get the correct profile automatically.
  • An important subtlety: on the latency nodes, I would also configure Kubernetes CPU Manager with the static policy and set API server pods to Guaranteed QoS (requests == limits). This ensures the kubelet assigns exclusive CPUs to the pods, which combined with nohz_full on those CPUs, gives near-bare-metal latency characteristics.
Follow-up: What happens when the latency node pool is full but you need to schedule more API server pods?Follow-up Answer:
  • This is where pod priority and preemption classes become essential. I would assign the API server pods a high PriorityClass. If latency nodes are full, the cluster autoscaler should add a new latency-profile node (with a startup tuned profile DaemonSet). However, cloud provider node provisioning takes 60-120 seconds, so I would also maintain a small buffer of warm standby latency nodes (cluster autoscaler min set 1-2 above current demand). If a new node truly cannot be provisioned in time, I would configure a preferredDuringSchedulingIgnoredDuringExecution affinity to compute nodes as a fallback — they at least have performance governor even if their C-state configuration is not ideal. The API server pods would emit a metric indicating they are running on a non-latency node, triggering an alert for the platform team to investigate capacity.
Strong Answer:
  • P-states control the operating frequency and voltage when the CPU is active. C-states control power consumption when the CPU is idle. Turbo boost is a mechanism where the CPU temporarily exceeds its base frequency by using the thermal and power headroom freed up by other cores being idle or in deep C-states. These three mechanisms are not independent — they share a common thermal and power budget (the TDP envelope).
  • The reason the performance governor can make an application slower is turbo boost budget exhaustion. Here is the mechanism: with schedutil or ondemand, not all cores are at max frequency simultaneously. Some cores are at lower P-states or in deep C-states. This frees up thermal and power budget, allowing the active cores to turbo higher — potentially 4.5+ GHz on a single core. When the developer switches to performance governor, ALL cores are pinned to max base frequency (say 3.2 GHz) simultaneously. The total package power draw is now at or near TDP. There is no headroom left for turbo boost. The application’s critical threads, which previously turboed to 4.5 GHz, are now stuck at 3.2 GHz. Net result: 30% lower single-thread performance.
  • This is especially pronounced for applications with a single hot thread and many idle or lightly-loaded threads. With schedutil, the idle threads let their cores sleep in C6, freeing ~10W per core of power budget. The hot thread’s core uses that budget to turbo. With performance governor, every core draws ~5W even with nothing to do (C1 only, full frequency), consuming the turbo budget.
  • To verify this diagnosis, I would have the developer run turbostat --interval 1 under both configurations. Under schedutil, they should see the busy core at high Bzy_MHz (turbo range) with other cores showing high C6%. Under performance, they should see all cores at base Bzy_MHz with 0% C6. PkgWatt under performance will be higher, confirming the power budget is saturated.
  • The correct fix depends on the workload: if they need one thread to go as fast as possible, use schedutil and let the kernel manage the turbo budget. If they need consistent, predictable frequency across all cores, use performance but accept lower peak frequency. If they need both, isolate the hot thread’s core with isolcpus and set performance governor only on that core, while other cores use schedutil.
Follow-up: How does Intel’s Hardware P-state (HWP) change this dynamic?Follow-up Answer:
  • HWP shifts frequency decision-making from the kernel to the CPU hardware itself. The CPU has an internal control loop that adjusts frequency on a per-clock-cycle basis (nanosecond granularity), compared to the kernel’s millisecond-granularity governor decisions. With HWP active, the kernel sets a desired performance range via the EPP (Energy Performance Preference) register — a value from 0 (max performance) to 255 (max efficiency). The hardware then makes real-time frequency decisions within that range based on actual instruction throughput, thermal conditions, and power budget. HWP generally handles the turbo budget problem better than software governors because it can react to micro-idle periods (even within a busy workload) and redistribute power budget between cores in microseconds rather than milliseconds. On modern Intel CPUs (Skylake+), the best approach is usually to leave HWP enabled, set the EPP to performance for latency-sensitive workloads and balance_performance for general workloads, and let the hardware optimize. You can check EPP via cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference.
Strong Answer:
  • I would collect four categories of power metrics per server, all available via sysfs and exported by a lightweight agent (node_exporter with custom collectors or a dedicated power agent): (1) RAPL energy counters (energy_uj from /sys/class/powercap/) sampled every 5 seconds, converted to watts. This gives package, core, and DRAM power. (2) CPU frequency and C-state residency from turbostat or /sys/devices/system/cpu/ — actual operating frequency, time spent in each C-state. (3) Thermal zone temperatures from /sys/class/thermal/ and throttle event counts from thermal_throttle/. (4) Cooling device states (fan speeds, throttling levels).
  • For anomaly detection, I would establish per-server-model baselines (idle power, loaded power, thermal profile) and alert on deviations. Key alerts: (a) Idle power above baseline + 20% — indicates a process keeping CPUs out of deep C-states (monitoring agent bug, crypto miner, stuck process). (b) Temperature above passive trip point for more than 5 minutes — sustained throttling indicates a cooling failure. (c) Throttle event count increasing — even if temperature is below critical, increasing throttle events indicate degrading thermal conditions. (d) Power draw suddenly dropping to near-zero on a server that should be loaded — indicates a hardware fault or unexpected shutdown. (e) Frequency consistently below base frequency under load — indicates either RAPL power capping or thermal throttling.
  • For automated remediation: (a) When idle power anomaly is detected, automatically query the process list for CPU-intensive processes and cross-reference with expected workloads. If a rogue process is found, alert the owning team. (b) When thermal throttling is sustained, first verify via IPMI/BMC that fans are running. If fans are at max and temperature is still rising, initiate workload drain (cordon the node in Kubernetes, migrate VMs) and file a hardware ticket. (c) When power draw exceeds server model TDP by more than 10%, apply RAPL power cap as an emergency measure to prevent hardware damage, then investigate.
  • At fleet scale (5000 servers), the aggregated data also enables capacity planning: total fleet power draw versus data center capacity, per-rack power distribution for hotspot detection, and power efficiency metrics (useful work per watt) that inform hardware refresh decisions.
Follow-up: How do you handle the RAPL energy counter overflow problem at scale?Follow-up Answer:
  • RAPL energy_uj is a 32-bit counter that wraps around. At 150W, the counter overflows approximately every 28 seconds (2^32 microjoules / 150,000,000 microwatts). At 300W (high-end servers), it wraps every ~14 seconds. If the collection agent samples every 5 seconds, it will usually catch the counter before overflow, but during high load, sampling jitter could cause a missed wrap. My agent handles this by: (1) sampling at 2-second intervals for the raw counter, even if we only export 5-second aggregates, (2) detecting wraps by checking if the new value is less than the previous value — if so, adding 2^32 to the delta, (3) using the max_energy_range_uj file to determine the actual overflow point (which may be less than 2^32 on some platforms). In practice, I would use the perf_event interface with PERF_TYPE_POWER events rather than raw sysfs reads, because the kernel handles overflow tracking internally and exposes a monotonically increasing counter.

Summary

MechanismPurposeKey Files
cpufreqCPU frequency scaling/sys/devices/system/cpu/cpu*/cpufreq/
cpuidleIdle state management/sys/devices/system/cpu/cpu*/cpuidle/
thermalTemperature monitoring/sys/class/thermal/
RAPLPower limiting/monitoring/sys/class/powercap/intel-rapl/
PM QoSLatency constraints/dev/cpu_dma_latency
tunedProfile-based optimization/etc/tuned/profiles/

Key Tools

# Monitor all power/thermal/frequency at once -- single most useful command
sudo turbostat --interval 1

# Set power/performance profile (production-grade approach)
tuned-adm profile <profile-name>

# CPU frequency control
cpupower frequency-set -g performance

# View temperatures (requires lm-sensors)
sensors

# Power statistics for a specific application
perf stat -e power/energy-pkg/ ./myapp

# Find what is preventing deep C-states (wakeup analysis)
sudo powertop

Next Steps