Power Management: Implementation and Runtime Systems¶
This episode explores the technical implementation of power management on modern CPUs, including the software interfaces, hardware mechanisms, and runtime systems that enable dynamic power optimization in HPC environments.
Scaling Drivers and Governors¶
Intel P-state Driver¶
The intel_pstate driver provides direct hardware-level frequency control on modern Intel processors (Haswell and newer).
Key characteristics:
Controls P-states directly via MSR (Model-Specific Register)
Firmware-independent implementation
Per-core frequency capability on newer CPUs
More responsive to workload changes
Sysfs interface (/sys/devices/system/cpu/intel_pstate/):
max_perf_pct - Maximum P-state allowed (% of max supported)
min_perf_pct - Minimum P-state allowed (% of max supported)
turbo_pct - Ratio of turbo range to total range
no_turbo - Disable all turbo frequencies (0=enabled, 1=disabled)
hwp_dynamic_boost - Enable iowait-triggered boosting (HWP mode)
num_pstates - Number of supported P-states
status - Driver operation mode: "active", "passive", or "off"
Example - Disable turbo boost:
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
Example - Limit maximum frequency to 80%:
echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
ACPI CPUFreq Driver¶
The acpi-cpufreq driver implements ACPI-based frequency scaling, widely used across different CPU vendors.
Key characteristics:
Works with ACPI firmware tables
Supports multiple CPU types (Intel, AMD, others)
Requires firmware to provide P-state information
More portable across different systems
Sysfs interface (/sys/devices/system/cpu/cpu*/cpufreq/):
scaling_driver - Current driver (acpi-cpufreq, intel_pstate, etc.)
scaling_governor - Current governor (performance, powersave, etc.)
scaling_cur_freq - Current frequency in kHz
cpuinfo_min_freq - Minimum CPU frequency in kHz
cpuinfo_max_freq - Maximum CPU frequency in kHz
cpuinfo_base_freq - Nominal/base frequency in kHz
scaling_min_freq - Minimum frequency driver is allowed to set
scaling_max_freq - Maximum frequency driver is allowed to set
scaling_setspeed - Set specific frequency (userspace governor only)
scaling_available_governors - List of available governors
energy_performance_preference - Hardware P-State energy/performance trade-off
base_frequency - Nominal frequency without turbo
Example - View all available governors:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
Example - Change to powersave governor:
echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Scaling Governors: Policies for Frequency Selection¶
A scaling governor implements the policy that decides which P-state (frequency) to use based on current conditions.
Performance Governor¶
Policy: Always run at maximum frequency.
Advantages:
Highest computational performance
Simplest policy (no decision logic)
Predictable behavior
Disadvantages:
Maximum power consumption
Wastes power during I/O waits
Contributes to thermal issues
Use case: Performance-critical applications where energy is not a constraint.
Powersave Governor¶
Policy: Always run at minimum frequency.
Advantages:
Minimum power consumption
Best for energy-constrained systems
Reduces thermal load
Disadvantages:
Worst computational performance
Only suitable for loosely timed workloads
Can cause severe performance degradation
Use case: Energy-critical systems or background tasks.
Ondemand Governor¶
Policy: Scale frequency based on CPU utilization.
Advantages:
Dynamic response to workload changes
Better energy efficiency than performance
Reasonable performance for most workloads
Disadvantages:
Frequency scaling latency can cause performance dips
Threshold tuning is system-specific
Not optimal for irregular workloads
Tuning parameters:
up_threshold- Utilization threshold to scale up (default 80%)down_threshold- Utilization threshold to scale downsampling_rate- How frequently to re-evaluate frequency
Use case: General-purpose systems with variable workloads.
Conservative Governor¶
Policy: Similar to ondemand but with more gradual frequency changes.
Advantages:
More stable than ondemand
Avoids rapid frequency oscillation
Balanced energy/performance trade-off
Disadvantages:
Slower response to load increases
May miss performance opportunities
Not suitable for bursty workloads
Use case: Systems requiring stability with moderate power savings.
Userspace Governor¶
Policy: Allow user applications or system administrators to directly set frequency.
Advantages:
Full control for specialized applications
Can implement custom power management strategies
Enables research into novel power policies
Disadvantages:
Requires user/application to manage frequency
Incorrect policies can waste power
Not suitable for general use
Use case: Research, application-specific optimization, or HPC runtime systems.
Hardware Frequency Control: MSR Registers¶
The actual frequency selection on Intel CPUs is controlled by writing to a Model-Specific Register (MSR):
IA32_PERF_CTL (0x199)¶
This register specifies the target P-state for the CPU:
MSR 0x199 (IA32_PERF_CTL)
Bits [15:8] - Target P-State
Bits [31:16] - Reserved
Example: Writing 0x1C00 sets the CPU to run at P-state 28 (out of 0-39).
CPU core frequency scaling via DVFS operates by specifying a target P-State, which may differ from the current P-State. The hardware then transitions to the target frequency.
When controlled via scaling driver: The driver automatically writes to MSR 0x199 based on the selected governor policy and workload conditions.
When controlled from userspace:
Set
userspacescaling governor to enable direct frequency controlDisable Hardware P-State (HWP) if available to allow manual control
Write target frequency to
scaling_setspeedsysfs interface
Important: Frequency changes are not instantaneous. There is a transition latency (typically 10-50 microseconds on modern CPUs) where the CPU is temporarily unavailable during frequency switching.
Intel Turbo Boost Technology¶
Turbo Boost is a hardware feature that opportunistically increases frequency beyond the nominal specification when thermal and power headroom exists.
Boost Frequency Levels¶
Different instruction sets have different maximum boost frequencies:
SSE Instructions ──────────┐
│ Highest boost frequency
│
AVX/AVX2 Instructions ─────────┐ │
│ │ Medium boost frequency
│ │
AVX-512 Instructions ──────┐ │ │
│ │ │ Lowest boost frequency
│ │ │
Nominal Frequency ────────┴───┴─┴── Base frequency
Why Different Frequencies for Different Instructions?¶
Power and thermal constraints:
AVX-512 instructions perform more computation per cycle → more current/heat
To stay within power budget, frequency must be reduced
Total work per unit time may still increase despite lower frequency
Example (Hypothetical):
SSE turbo: 3.8 GHz × 1 compute unit per cycle = 3.8 units/s
AVX-512 turbo: 3.0 GHz × 8 compute units per cycle = 24 units/s
Despite lower frequency, AVX-512 actually delivers more throughput within power budget.
Disabling Turbo Boost¶
System administrators can disable turbo boost to:
Reduce power consumption
Improve frequency stability for benchmarking
Ensure consistent thermal behavior
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
Frequency Transition Latency¶
Frequency changes are not instantaneous. Understanding transition latency is important for applications that require tight timing.
What Happens During Frequency Change¶
OS writes target P-state to MSR
CPU begins voltage/frequency ramping
Transition period: CPU unavailable for instruction execution (~10-50 μs)
CPU reaches target frequency and resumes execution
Measuring Frequency Transition Latency¶
Linux provides tools to measure transition latency:
# Measure latency (requires kernel support)
$ grep transition_latency /sys/devices/system/cpu/cpu0/cpufreq/*/
# Typical values:
# Modern Intel: 5-20 μs
# Older CPUs: 100-500 μs
Impact on Applications¶
For most HPC applications: Frequency transition latency (tens of microseconds) is negligible compared to computation time (milliseconds to seconds).
Exception: Real-time applications or GPU-CPU synchronization may be affected by frequency jitter.
GPU Frequency Management¶
GPUs also support frequency scaling, though the mechanisms differ from CPUs.
NVIDIA GPU Frequency Scaling¶
nvidia-smi allows querying and setting GPU frequency:
# View current and supported clock frequencies
nvidia-smi --query-gpu=clocks_current_sm,clocks_max_memory
# Set GPU clock (requires root or persistence mode)
nvidia-smi -pm 1
nvidia-smi -lgc <frequency>
Characteristics:
Coarser granularity than CPU (typically 25-50 MHz steps)
Separate memory and core clocks
Can be locked to specific frequencies or set to dynamic
AMD GPU Frequency Scaling¶
rocm-smi provides AMD GPU frequency control:
# View supported frequencies
rocm-smi --showid
# Set frequency
rocm-smi --setsclk <level> # Compute clock
rocm-smi --setmclk <level> # Memory clock
GPU Frequency Switching Latency¶
Hardware P-State (HWP) and SpeedShift¶
Starting with Skylake architecture, Intel introduced Hardware P-State (HWP), also known as SpeedShift Technology.
How HWP Works¶
Instead of the OS setting a specific frequency via MSR writes, HWP allows hardware to autonomously select P-states within a range specified by the OS.
Traditional OS-controlled approach:
OS: "Set frequency to 2.4 GHz"
CPU: Adjusts to 2.4 GHz (10-50 μs latency)
HWP approach:
OS: "Select a P-state between 5 and 30 (min/max)"
CPU: Autonomously selects optimal P-state (0.5-2 μs latency)
Advantages of HWP¶
Reduced latency - Hardware responds faster to workload changes (10-100× faster)
Smarter decisions - Hardware can respond to iowait, memory latency, other signals
Better multi-core management - Different cores can independently optimize
AVX-512 awareness - Hardware automatically reduces frequency for power-intensive instructions
Enabling HWP¶
HWP is enabled via MSR 0x770 (IA32_PM_ENABLE):
# Check if HWP is supported
cat /proc/cpuinfo | grep hwp
# Enable HWP (requires boot-time kernel parameter or firmware change)
# Once enabled, further writes to certain MSR registers are ignored
Important: HWP significantly improves responsiveness to workload changes and is enabled by default on modern systems.
Dynamic Duty Clock Modulation¶
An alternative power management technique that statistically skips a user-defined number of clock cycles, reducing power without changing frequency.
Energy-Performance Preference (EPP)¶
Modern Intel CPUs support an energy-performance preference via MSR that allows fine-grained control over the energy-performance trade-off.
MSR IA32_ENERGY_PERF_BIAS (0x1B0)¶
Specifies hardware preference:
Values range from 0 to 15
0 - Preference to highest performance (maximize frequency)
7 - Balanced hint (balance performance and energy)
15 - Preference to maximize energy saving (minimize frequency)
Sysfs interface: /sys/devices/system/cpu/cpu*/power/energy_perf_bias
Enhanced EPP: MSR 0x774 (IA32_HWP_REQUEST)¶
Modern Intel (Skylake+):
Bits [31:24] - Energy-Performance Preference (0=performance, 128=balanced, 255=energy)
Provides finer granularity and better hardware optimization
This allows the hardware to make local optimization decisions while respecting overall policy directives.
CPU Uncore Frequency¶
Intel processors contain “uncore” subsystems shared by multiple cores:
Last-Level Cache (LLC)
On-chip interconnect
Integrated memory controller
These components consume ~30% of chip area and can be frequency-scaled independently of core frequency.
Intel Uncore Control¶
MSR MSR_UNCORE_RATIO_LIMIT (0x620) controls uncore frequency limits for:
Frequency of subsystems shared by multiple processor cores
Last level cache, on-chip ring interconnect, integrated memory controllers
Specification of maximum and minimum limits
Monitoring Uncore Performance¶
MSR U_MSR_PMON_FIXED_CTR (since Haswell 0x704) - Uncore performance counter
MSR 0x703 - Uncore performance counter enable
AMD Uncore¶
Data Fabric (Infinity Fabric interconnect)
I/O subsystems
Can be controlled separately with P-states
Workload Characterization¶
Understanding workload characteristics enables targeted power optimization:
Memory-bound workloads - Limited by memory bandwidth, not CPU cycles
Benefit from frequency reduction (saves power without hurting performance)
High latency tolerance for power management decisions
Compute-bound workloads - Limited by available CPU cycles
Require high frequency to maintain performance
Sensitive to frequency reduction
Communication-bound workloads - Limited by network bandwidth
Opportunity for frequency reduction during communication waits
Can benefit from dynamic scaling
I/O-bound workloads - Frequent stalls on disk/network access
Aggressive power reduction candidates
Minimal performance impact from lower frequency
Arithmetic Intensity¶
The ratio of compute operations to memory accesses helps predict workload boundness.
Power Capping: Intel RAPL¶
Intel Running Average Power Limit (RAPL) provides hardware-based power capping and monitoring.
RAPL Architecture¶
Power Domains¶
Sysfs: /sys/devices/virtual/powercap/intel-rapl/intel-rapl:X/intel-rapl:0:Y
Package domain:
Limits power consumption for entire CPU package (cores + uncore)
Short window: ~1.2× TDP (milliseconds)
Long window: ~TDP (seconds)
DRAM domain:
Used for memory power capping and monitoring
Enables P-State scaling for memory subsystem
Server architectures only (not client)
Single time window
Disabled by default
PP0 (Core) domain:
Restricts power limit to CPU cores only
Single time window
Not available on latest server CPUs
PP1 (Graphics) domain:
Power limits only integrated GPU
Not on server systems
Single time window
PSys (Platform) domain:
Controls entire System on Chip
Short and long windows
Available from Skylake architecture onwards
Requires vendor support
RAPL MSR Registers¶
MSR MSR_PKG_POWER_LIMIT (0x610):
MSR MSR_RAPL_POWER_UNIT (0x606):
Power units (Watts per unit)
Energy status units (Joules per unit)
Time units (seconds per unit)
Energy Consumption Measurement¶
MSR Energy Status Registers:
MSR MSR_PKG_ENERGY_STATUS (0x611) - Package energy
MSR MSR_DRAM_ENERGY_STATUS (0x619) - DRAM energy
MSR MSR_PP0_ENERGY_STATUS (0x639) - Core energy
MSR MSR_PP1_ENERGY_STATUS (0x641) - Graphics energy
MSR MSR_PLATFORM_ENERGY_COUNTER (0x64D) - Platform total
Power Capping Behavior¶
Intel RAPL algorithm:
Power capping system downscales CPU core and uncore frequencies to keep power consumption at the limit
Note: Intel RAPL does not reflect the arithmetic intensity of the workload
Case Study: Cascade Lake Power Management¶
Frequency Scaling with Arithmetic Intensity¶
Example: AVX-512 workload with arithmetic intensity 8
RAPL behavior:
Keeps core frequency at maximum
Downscales uncore frequency to stay within power limit
Results¶
AVX-512 with arithmetic intensity 8, core 1.9 GHz, uncore 2.2 GHz:
18% CPU energy savings
3.5% runtime improvement
15% node energy savings
Advanced Platforms: Grace Hopper Power Management¶
Variety of power domains (CPU cores, GPU SMs, module-level) but limited set of knobs (frequencies):
Challenge: Power shifting between CPU and GPU requires frequency coordination to stay within node power limit.
Disabling Units for Power Savings¶
Beyond frequency scaling, some systems support unit disabling:
Multi-threading (on/off) - Disable simultaneous multithreading
Disabling cores - Complex, affects memory bandwidth
AMD xGMI lanes - External Global Memory Interconnect per link
Fujitsu A64FX FPU pipelines - FLA (floating-point) and EXA (integer) elimination
P-cores and E-cores - Heterogeneous cores (not yet in HPC)
Power Management Knobs Overview¶
Across different HPC-relevant platforms:
Intel:
CPU - core frequency, uncore frequency, power capping
ACC (PVC) - GPU frequency, memory frequency, power capping
ACC (KNL) - core frequency, power capping
AMD:
CPU - core frequency, power capping, Data Fabric frequency
ACC - power capping, frequency (system, Data Fabric, display, SOC, memory, PCIe)
NVIDIA:
GPU - SM frequency, memory frequency, power capping
CPU+GPU - Grace Hopper: CPU core & GPU SM frequencies, multi-level power capping
IBM:
CPU - core frequency, power capping
CPU+GPU - core frequency, CPU/GPU/node power capping
ARM:
CPU - Fujitsu A64FX: core frequency, FPU pipeline elimination, memory frequency
CPU - NVIDIA Grace: core frequency, power capping
Runtime Systems for Automatic Power Management¶
Runtime systems automatically adjust power management parameters based on application characteristics and system constraints.
What Runtime Systems Do¶
Profile workload - Identify CPU/memory/I/O patterns
Monitor performance - Track actual vs expected performance
Select frequencies - Choose P-states to meet constraints (power budget, deadline, etc.)
Adapt dynamically - Respond to runtime conditions
Example: Energy-Aware Scheduling¶
A runtime system might:
Reduce frequency for loosely-coupled parallel tasks (communication-bound)
Increase frequency for CPU-bound tasks requiring maximum throughput
Collectively respect power budget across all running jobs
Adapt based on observed thermal conditions
Example: Power-Capping Runtime¶
Ensures total node power doesn’t exceed a limit while maximizing performance:
Algorithm:
FOR each core:
power_limit_per_core = total_power_limit / num_cores
frequency = find_max_frequency(power_limit_per_core)
set_frequency(frequency)
Monitor actual power
Adjust frequencies if overage
Practical HPC Power Management Strategies¶
Strategy 1: Fixed Frequency¶
Set a conservative fixed frequency across the cluster:
# All nodes run at 80% frequency
echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
Pros:
Simple to deploy and manage
Predictable power consumption
Measurable energy savings (typically 10-20%)
Cons:
No performance adaptation
May waste power during I/O waits
Not optimal for heterogeneous workloads
Best for: Homogeneous workloads with known characteristics, power-constrained systems
Strategy 2: Per-Application Tuning¶
Different application classes get different frequencies based on profiling:
Batch jobs (deadline-loose): 80% frequency (20% power saving)
Interactive jobs (latency-critical): 100% frequency
GPU-accelerated jobs: CPU 60%, GPU 100%
I/O-bound services: 70% frequency
Pros:
Better performance for latency-critical work
Energy savings for batch jobs
Balanced approach for mixed workloads
Cons:
Requires profiling and characterization
Need to identify workload class at runtime
Moderate complexity in implementation
Best for: Mixed HPC centers with diverse job types
Strategy 3: Dynamic Runtime Control¶
Runtime system adjusts frequency based on:
CPU utilization and workload signals
Memory bandwidth usage
Power budget remaining
Deadline/QoS requirements
Pros:
Highest potential for energy savings (20-40%)
Adapts automatically to workload changes
Respects power budgets dynamically
Cons:
Most complex to implement
Requires careful tuning of heuristics
May have overhead from monitoring
Best for: Advanced HPC centers with sophisticated workload management
Summary: Power Management Parameters¶
Parameter |
Conservative |
Aggressive |
Notes |
|---|---|---|---|
Governor |
performance |
ondemand |
Depends on workload characteristics |
Max Frequency |
100% |
70-80% |
Significant power savings at cost |
Turbo Boost |
Enabled |
Disabled |
Disabling reduces power ~5-10% |
C-states |
Enabled |
Enabled |
Should always enable (minimal performance cost) |
HWP |
Enabled |
Enabled |
Always use if supported (improves responsiveness) |
Uncore Frequency |
100% |
80-90% |
Less impact than core frequency |
The optimal settings depend on your specific workload, power budget, and performance requirements. Profiling and validation are essential for any production deployment.
Case Study: Energy-Aware HPC - RIKEN Fugaku¶
System Overview¶
Ranking: #1 in Top500 since June 2020
Processor: Fujitsu A64FX
48 compute cores + 4 assistant cores (OS daemon and MPI offload)
No TDP, no nominal frequency → no traditional turbo concept
Available frequencies: 1.6, 1.8, 2.0, or 2.2 GHz
User-Controlled Power Options¶
Power mode (scheduler option):
Normal - 2.0 GHz frequency (baseline)
Boost - 2.2 GHz frequency (performance)
ECO - 2.0 GHz + use one of two FPU units only + reduces standby power
Boost ECO - 2.2 GHz + FPU unit elimination
Core retention (ON/OFF):
When enabled: Eliminates standby power for idle CPU cores
Significant power savings for workloads that don’t utilize all cores
Reference: https://sites.google.com/view/rikenfugakushowcase/home
Summary: Episode 1 Learning Outcomes¶
After completing this episode, you should be able to:
Implement power management - Use sysfs and MSR interfaces to control CPU frequencies
Select appropriate governors - Choose scaling policies based on workload characteristics
Understand hardware mechanisms - Explain HWP, RAPL, turbo boost, and other hardware features
Design runtime systems - Create algorithms for automatic power optimization
Evaluate trade-offs - Balance performance, power, and reliability constraints
Apply best practices - Implement fixed, per-application, or dynamic power strategies
Analyze real systems - Understand power management in production HPC systems
Assessment: You should now be able to profile a workload, design a power management strategy, and implement it on an HPC system while respecting power budgets and performance requirements.