Power Management: Implementation and Runtime Systems

This episode explores the technical implementation of power management on modern CPUs, including the software interfaces, hardware mechanisms, and runtime systems that enable dynamic power optimization in HPC environments.

Scaling Drivers and Governors

Intel P-state Driver

The intel_pstate driver provides direct hardware-level frequency control on modern Intel processors (Haswell and newer).

Key characteristics:

  • Controls P-states directly via MSR (Model-Specific Register)

  • Firmware-independent implementation

  • Per-core frequency capability on newer CPUs

  • More responsive to workload changes

Sysfs interface (/sys/devices/system/cpu/intel_pstate/):

max_perf_pct          - Maximum P-state allowed (% of max supported)
min_perf_pct          - Minimum P-state allowed (% of max supported)
turbo_pct             - Ratio of turbo range to total range
no_turbo              - Disable all turbo frequencies (0=enabled, 1=disabled)
hwp_dynamic_boost     - Enable iowait-triggered boosting (HWP mode)
num_pstates           - Number of supported P-states
status                - Driver operation mode: "active", "passive", or "off"

Example - Disable turbo boost:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Example - Limit maximum frequency to 80%:

echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

ACPI CPUFreq Driver

The acpi-cpufreq driver implements ACPI-based frequency scaling, widely used across different CPU vendors.

Key characteristics:

  • Works with ACPI firmware tables

  • Supports multiple CPU types (Intel, AMD, others)

  • Requires firmware to provide P-state information

  • More portable across different systems

Sysfs interface (/sys/devices/system/cpu/cpu*/cpufreq/):

scaling_driver              - Current driver (acpi-cpufreq, intel_pstate, etc.)
scaling_governor            - Current governor (performance, powersave, etc.)
scaling_cur_freq            - Current frequency in kHz
cpuinfo_min_freq            - Minimum CPU frequency in kHz
cpuinfo_max_freq            - Maximum CPU frequency in kHz
cpuinfo_base_freq           - Nominal/base frequency in kHz
scaling_min_freq            - Minimum frequency driver is allowed to set
scaling_max_freq            - Maximum frequency driver is allowed to set
scaling_setspeed            - Set specific frequency (userspace governor only)
scaling_available_governors - List of available governors
energy_performance_preference - Hardware P-State energy/performance trade-off
base_frequency              - Nominal frequency without turbo

Example - View all available governors:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

Example - Change to powersave governor:

echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Scaling Governors: Policies for Frequency Selection

A scaling governor implements the policy that decides which P-state (frequency) to use based on current conditions.

Performance Governor

Policy: Always run at maximum frequency.

Advantages:

  • Highest computational performance

  • Simplest policy (no decision logic)

  • Predictable behavior

Disadvantages:

  • Maximum power consumption

  • Wastes power during I/O waits

  • Contributes to thermal issues

Use case: Performance-critical applications where energy is not a constraint.

Powersave Governor

Policy: Always run at minimum frequency.

Advantages:

  • Minimum power consumption

  • Best for energy-constrained systems

  • Reduces thermal load

Disadvantages:

  • Worst computational performance

  • Only suitable for loosely timed workloads

  • Can cause severe performance degradation

Use case: Energy-critical systems or background tasks.

Ondemand Governor

Policy: Scale frequency based on CPU utilization.

Advantages:

  • Dynamic response to workload changes

  • Better energy efficiency than performance

  • Reasonable performance for most workloads

Disadvantages:

  • Frequency scaling latency can cause performance dips

  • Threshold tuning is system-specific

  • Not optimal for irregular workloads

Tuning parameters:

  • up_threshold - Utilization threshold to scale up (default 80%)

  • down_threshold - Utilization threshold to scale down

  • sampling_rate - How frequently to re-evaluate frequency

Use case: General-purpose systems with variable workloads.

Conservative Governor

Policy: Similar to ondemand but with more gradual frequency changes.

Advantages:

  • More stable than ondemand

  • Avoids rapid frequency oscillation

  • Balanced energy/performance trade-off

Disadvantages:

  • Slower response to load increases

  • May miss performance opportunities

  • Not suitable for bursty workloads

Use case: Systems requiring stability with moderate power savings.

Userspace Governor

Policy: Allow user applications or system administrators to directly set frequency.

Advantages:

  • Full control for specialized applications

  • Can implement custom power management strategies

  • Enables research into novel power policies

Disadvantages:

  • Requires user/application to manage frequency

  • Incorrect policies can waste power

  • Not suitable for general use

Use case: Research, application-specific optimization, or HPC runtime systems.

Hardware Frequency Control: MSR Registers

The actual frequency selection on Intel CPUs is controlled by writing to a Model-Specific Register (MSR):

IA32_PERF_CTL (0x199)

This register specifies the target P-state for the CPU:

MSR 0x199 (IA32_PERF_CTL)
Bits [15:8] - Target P-State
Bits [31:16] - Reserved

Example: Writing 0x1C00 sets the CPU to run at P-state 28 (out of 0-39).

CPU core frequency scaling via DVFS operates by specifying a target P-State, which may differ from the current P-State. The hardware then transitions to the target frequency.

When controlled via scaling driver: The driver automatically writes to MSR 0x199 based on the selected governor policy and workload conditions.

When controlled from userspace:

  • Set userspace scaling governor to enable direct frequency control

  • Disable Hardware P-State (HWP) if available to allow manual control

  • Write target frequency to scaling_setspeed sysfs interface

Important: Frequency changes are not instantaneous. There is a transition latency (typically 10-50 microseconds on modern CPUs) where the CPU is temporarily unavailable during frequency switching.

../../_images/3.png

Intel Turbo Boost Technology

Turbo Boost is a hardware feature that opportunistically increases frequency beyond the nominal specification when thermal and power headroom exists.

Boost Frequency Levels

Different instruction sets have different maximum boost frequencies:

SSE Instructions       ──────────┐
                                  │ Highest boost frequency
                                  │
AVX/AVX2 Instructions ─────────┐ │
                                │ │ Medium boost frequency
                                │ │
AVX-512 Instructions  ──────┐   │ │
                            │   │ │ Lowest boost frequency
                            │   │ │
  Nominal Frequency ────────┴───┴─┴── Base frequency

Why Different Frequencies for Different Instructions?

Power and thermal constraints:

  • AVX-512 instructions perform more computation per cycle → more current/heat

  • To stay within power budget, frequency must be reduced

  • Total work per unit time may still increase despite lower frequency

Example (Hypothetical):

  • SSE turbo: 3.8 GHz × 1 compute unit per cycle = 3.8 units/s

  • AVX-512 turbo: 3.0 GHz × 8 compute units per cycle = 24 units/s

Despite lower frequency, AVX-512 actually delivers more throughput within power budget.

Disabling Turbo Boost

System administrators can disable turbo boost to:

  • Reduce power consumption

  • Improve frequency stability for benchmarking

  • Ensure consistent thermal behavior

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

../../_images/4-1.png
../../_images/4-2.jpg

Frequency Transition Latency

Frequency changes are not instantaneous. Understanding transition latency is important for applications that require tight timing.

What Happens During Frequency Change

  1. OS writes target P-state to MSR

  2. CPU begins voltage/frequency ramping

  3. Transition period: CPU unavailable for instruction execution (~10-50 μs)

  4. CPU reaches target frequency and resumes execution

Measuring Frequency Transition Latency

Linux provides tools to measure transition latency:

# Measure latency (requires kernel support)
$ grep transition_latency /sys/devices/system/cpu/cpu0/cpufreq/*/

# Typical values:
# Modern Intel: 5-20 μs
# Older CPUs: 100-500 μs

Impact on Applications

For most HPC applications: Frequency transition latency (tens of microseconds) is negligible compared to computation time (milliseconds to seconds).

Exception: Real-time applications or GPU-CPU synchronization may be affected by frequency jitter.

../../_images/5-1.png
../../_images/5-2.png

GPU Frequency Management

GPUs also support frequency scaling, though the mechanisms differ from CPUs.

NVIDIA GPU Frequency Scaling

nvidia-smi allows querying and setting GPU frequency:

# View current and supported clock frequencies
nvidia-smi --query-gpu=clocks_current_sm,clocks_max_memory

# Set GPU clock (requires root or persistence mode)
nvidia-smi -pm 1
nvidia-smi -lgc <frequency>

Characteristics:

  • Coarser granularity than CPU (typically 25-50 MHz steps)

  • Separate memory and core clocks

  • Can be locked to specific frequencies or set to dynamic

AMD GPU Frequency Scaling

rocm-smi provides AMD GPU frequency control:

# View supported frequencies
rocm-smi --showid

# Set frequency
rocm-smi --setsclk <level>  # Compute clock
rocm-smi --setmclk <level>  # Memory clock

../../_images/6-1.jpg
../../_images/6-2.jpg

GPU Frequency Switching Latency

../../_images/7.png

Hardware P-State (HWP) and SpeedShift

Starting with Skylake architecture, Intel introduced Hardware P-State (HWP), also known as SpeedShift Technology.

How HWP Works

Instead of the OS setting a specific frequency via MSR writes, HWP allows hardware to autonomously select P-states within a range specified by the OS.

Traditional OS-controlled approach:

OS: "Set frequency to 2.4 GHz"
CPU: Adjusts to 2.4 GHz (10-50 μs latency)

HWP approach:

OS: "Select a P-state between 5 and 30 (min/max)"
CPU: Autonomously selects optimal P-state (0.5-2 μs latency)

Advantages of HWP

  1. Reduced latency - Hardware responds faster to workload changes (10-100× faster)

  2. Smarter decisions - Hardware can respond to iowait, memory latency, other signals

  3. Better multi-core management - Different cores can independently optimize

  4. AVX-512 awareness - Hardware automatically reduces frequency for power-intensive instructions

Enabling HWP

HWP is enabled via MSR 0x770 (IA32_PM_ENABLE):

# Check if HWP is supported
cat /proc/cpuinfo | grep hwp

# Enable HWP (requires boot-time kernel parameter or firmware change)
# Once enabled, further writes to certain MSR registers are ignored

Important: HWP significantly improves responsiveness to workload changes and is enabled by default on modern systems.

../../_images/8.png

Dynamic Duty Clock Modulation

An alternative power management technique that statistically skips a user-defined number of clock cycles, reducing power without changing frequency.

../../_images/9.png

Energy-Performance Preference (EPP)

Modern Intel CPUs support an energy-performance preference via MSR that allows fine-grained control over the energy-performance trade-off.

MSR IA32_ENERGY_PERF_BIAS (0x1B0)

Specifies hardware preference:

  • Values range from 0 to 15

  • 0 - Preference to highest performance (maximize frequency)

  • 7 - Balanced hint (balance performance and energy)

  • 15 - Preference to maximize energy saving (minimize frequency)

Sysfs interface: /sys/devices/system/cpu/cpu*/power/energy_perf_bias

Enhanced EPP: MSR 0x774 (IA32_HWP_REQUEST)

Modern Intel (Skylake+):

  • Bits [31:24] - Energy-Performance Preference (0=performance, 128=balanced, 255=energy)

  • Provides finer granularity and better hardware optimization

This allows the hardware to make local optimization decisions while respecting overall policy directives.

../../_images/10.png

CPU Uncore Frequency

Intel processors contain “uncore” subsystems shared by multiple cores:

  • Last-Level Cache (LLC)

  • On-chip interconnect

  • Integrated memory controller

These components consume ~30% of chip area and can be frequency-scaled independently of core frequency.

Intel Uncore Control

MSR MSR_UNCORE_RATIO_LIMIT (0x620) controls uncore frequency limits for:

  • Frequency of subsystems shared by multiple processor cores

  • Last level cache, on-chip ring interconnect, integrated memory controllers

  • Specification of maximum and minimum limits

Monitoring Uncore Performance

  • MSR U_MSR_PMON_FIXED_CTR (since Haswell 0x704) - Uncore performance counter

  • MSR 0x703 - Uncore performance counter enable

AMD Uncore

  • Data Fabric (Infinity Fabric interconnect)

  • I/O subsystems

  • Can be controlled separately with P-states

../../_images/11-1.png

../../_images/12-1.png
../../_images/12-2.png

Workload Characterization

Understanding workload characteristics enables targeted power optimization:

  • Memory-bound workloads - Limited by memory bandwidth, not CPU cycles

    • Benefit from frequency reduction (saves power without hurting performance)

    • High latency tolerance for power management decisions

  • Compute-bound workloads - Limited by available CPU cycles

    • Require high frequency to maintain performance

    • Sensitive to frequency reduction

  • Communication-bound workloads - Limited by network bandwidth

    • Opportunity for frequency reduction during communication waits

    • Can benefit from dynamic scaling

  • I/O-bound workloads - Frequent stalls on disk/network access

    • Aggressive power reduction candidates

    • Minimal performance impact from lower frequency

Arithmetic Intensity

The ratio of compute operations to memory accesses helps predict workload boundness.

../../_images/13.png

../../_images/14.png

../../_images/15.png

../../_images/16.png

Power Capping: Intel RAPL

Intel Running Average Power Limit (RAPL) provides hardware-based power capping and monitoring.

RAPL Architecture

images/17.png

Power Domains

Sysfs: /sys/devices/virtual/powercap/intel-rapl/intel-rapl:X/intel-rapl:0:Y

Package domain:

  • Limits power consumption for entire CPU package (cores + uncore)

  • Short window: ~1.2× TDP (milliseconds)

  • Long window: ~TDP (seconds)

DRAM domain:

  • Used for memory power capping and monitoring

  • Enables P-State scaling for memory subsystem

  • Server architectures only (not client)

  • Single time window

  • Disabled by default

PP0 (Core) domain:

  • Restricts power limit to CPU cores only

  • Single time window

  • Not available on latest server CPUs

PP1 (Graphics) domain:

  • Power limits only integrated GPU

  • Not on server systems

  • Single time window

PSys (Platform) domain:

  • Controls entire System on Chip

  • Short and long windows

  • Available from Skylake architecture onwards

  • Requires vendor support

../../_images/18-1.png
../../_images/18-2.png

RAPL MSR Registers

MSR MSR_PKG_POWER_LIMIT (0x610):

../../_images/19-1.png

MSR MSR_RAPL_POWER_UNIT (0x606):

  • Power units (Watts per unit)

  • Energy status units (Joules per unit)

  • Time units (seconds per unit)

../../_images/19-2.jpg

Energy Consumption Measurement

MSR Energy Status Registers:

  • MSR MSR_PKG_ENERGY_STATUS (0x611) - Package energy

  • MSR MSR_DRAM_ENERGY_STATUS (0x619) - DRAM energy

  • MSR MSR_PP0_ENERGY_STATUS (0x639) - Core energy

  • MSR MSR_PP1_ENERGY_STATUS (0x641) - Graphics energy

  • MSR MSR_PLATFORM_ENERGY_COUNTER (0x64D) - Platform total

../../_images/19-3.png

../../_images/20-1.png

images/20-2.jpg

Power Capping Behavior

Intel RAPL algorithm:

  • Power capping system downscales CPU core and uncore frequencies to keep power consumption at the limit

  • Note: Intel RAPL does not reflect the arithmetic intensity of the workload

../../_images/21.png

Case Study: Cascade Lake Power Management

Frequency Scaling with Arithmetic Intensity

Example: AVX-512 workload with arithmetic intensity 8

RAPL behavior:

  • Keeps core frequency at maximum

  • Downscales uncore frequency to stay within power limit

../../_images/22.png

../../_images/23.png

Results

AVX-512 with arithmetic intensity 8, core 1.9 GHz, uncore 2.2 GHz:

  • 18% CPU energy savings

  • 3.5% runtime improvement

  • 15% node energy savings

../../_images/24.png

Advanced Platforms: Grace Hopper Power Management

Variety of power domains (CPU cores, GPU SMs, module-level) but limited set of knobs (frequencies):

../../_images/25.png

Challenge: Power shifting between CPU and GPU requires frequency coordination to stay within node power limit.

Disabling Units for Power Savings

Beyond frequency scaling, some systems support unit disabling:

  • Multi-threading (on/off) - Disable simultaneous multithreading

  • Disabling cores - Complex, affects memory bandwidth

  • AMD xGMI lanes - External Global Memory Interconnect per link

  • Fujitsu A64FX FPU pipelines - FLA (floating-point) and EXA (integer) elimination

  • P-cores and E-cores - Heterogeneous cores (not yet in HPC)

../../_images/26.png

Power Management Knobs Overview

Across different HPC-relevant platforms:

Intel:

  • CPU - core frequency, uncore frequency, power capping

  • ACC (PVC) - GPU frequency, memory frequency, power capping

  • ACC (KNL) - core frequency, power capping

AMD:

  • CPU - core frequency, power capping, Data Fabric frequency

  • ACC - power capping, frequency (system, Data Fabric, display, SOC, memory, PCIe)

NVIDIA:

  • GPU - SM frequency, memory frequency, power capping

  • CPU+GPU - Grace Hopper: CPU core & GPU SM frequencies, multi-level power capping

IBM:

  • CPU - core frequency, power capping

  • CPU+GPU - core frequency, CPU/GPU/node power capping

ARM:

  • CPU - Fujitsu A64FX: core frequency, FPU pipeline elimination, memory frequency

  • CPU - NVIDIA Grace: core frequency, power capping

Runtime Systems for Automatic Power Management

Runtime systems automatically adjust power management parameters based on application characteristics and system constraints.

What Runtime Systems Do

  1. Profile workload - Identify CPU/memory/I/O patterns

  2. Monitor performance - Track actual vs expected performance

  3. Select frequencies - Choose P-states to meet constraints (power budget, deadline, etc.)

  4. Adapt dynamically - Respond to runtime conditions

Example: Energy-Aware Scheduling

A runtime system might:

  • Reduce frequency for loosely-coupled parallel tasks (communication-bound)

  • Increase frequency for CPU-bound tasks requiring maximum throughput

  • Collectively respect power budget across all running jobs

  • Adapt based on observed thermal conditions

Example: Power-Capping Runtime

Ensures total node power doesn’t exceed a limit while maximizing performance:

Algorithm:
  FOR each core:
    power_limit_per_core = total_power_limit / num_cores
    frequency = find_max_frequency(power_limit_per_core)
    set_frequency(frequency)
  Monitor actual power
  Adjust frequencies if overage

Practical HPC Power Management Strategies

Strategy 1: Fixed Frequency

Set a conservative fixed frequency across the cluster:

# All nodes run at 80% frequency
echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

Pros:

  • Simple to deploy and manage

  • Predictable power consumption

  • Measurable energy savings (typically 10-20%)

Cons:

  • No performance adaptation

  • May waste power during I/O waits

  • Not optimal for heterogeneous workloads

Best for: Homogeneous workloads with known characteristics, power-constrained systems

Strategy 2: Per-Application Tuning

Different application classes get different frequencies based on profiling:

Batch jobs (deadline-loose):     80% frequency (20% power saving)
Interactive jobs (latency-critical): 100% frequency
GPU-accelerated jobs:             CPU 60%, GPU 100%
I/O-bound services:              70% frequency

Pros:

  • Better performance for latency-critical work

  • Energy savings for batch jobs

  • Balanced approach for mixed workloads

Cons:

  • Requires profiling and characterization

  • Need to identify workload class at runtime

  • Moderate complexity in implementation

Best for: Mixed HPC centers with diverse job types

Strategy 3: Dynamic Runtime Control

Runtime system adjusts frequency based on:

  • CPU utilization and workload signals

  • Memory bandwidth usage

  • Power budget remaining

  • Deadline/QoS requirements

Pros:

  • Highest potential for energy savings (20-40%)

  • Adapts automatically to workload changes

  • Respects power budgets dynamically

Cons:

  • Most complex to implement

  • Requires careful tuning of heuristics

  • May have overhead from monitoring

Best for: Advanced HPC centers with sophisticated workload management

Summary: Power Management Parameters

Parameter

Conservative

Aggressive

Notes

Governor

performance

ondemand

Depends on workload characteristics

Max Frequency

100%

70-80%

Significant power savings at cost

Turbo Boost

Enabled

Disabled

Disabling reduces power ~5-10%

C-states

Enabled

Enabled

Should always enable (minimal performance cost)

HWP

Enabled

Enabled

Always use if supported (improves responsiveness)

Uncore Frequency

100%

80-90%

Less impact than core frequency

The optimal settings depend on your specific workload, power budget, and performance requirements. Profiling and validation are essential for any production deployment.

Case Study: Energy-Aware HPC - RIKEN Fugaku

System Overview

Ranking: #1 in Top500 since June 2020

Processor: Fujitsu A64FX

  • 48 compute cores + 4 assistant cores (OS daemon and MPI offload)

  • No TDP, no nominal frequency → no traditional turbo concept

  • Available frequencies: 1.6, 1.8, 2.0, or 2.2 GHz

User-Controlled Power Options

Power mode (scheduler option):

  • Normal - 2.0 GHz frequency (baseline)

  • Boost - 2.2 GHz frequency (performance)

  • ECO - 2.0 GHz + use one of two FPU units only + reduces standby power

  • Boost ECO - 2.2 GHz + FPU unit elimination

Core retention (ON/OFF):

  • When enabled: Eliminates standby power for idle CPU cores

  • Significant power savings for workloads that don’t utilize all cores

Reference: https://sites.google.com/view/rikenfugakushowcase/home

../../_images/27.png

../../_images/28.png

Summary: Episode 1 Learning Outcomes

After completing this episode, you should be able to:

  1. Implement power management - Use sysfs and MSR interfaces to control CPU frequencies

  2. Select appropriate governors - Choose scaling policies based on workload characteristics

  3. Understand hardware mechanisms - Explain HWP, RAPL, turbo boost, and other hardware features

  4. Design runtime systems - Create algorithms for automatic power optimization

  5. Evaluate trade-offs - Balance performance, power, and reliability constraints

  6. Apply best practices - Implement fixed, per-application, or dynamic power strategies

  7. Analyze real systems - Understand power management in production HPC systems


Assessment: You should now be able to profile a workload, design a power management strategy, and implement it on an HPC system while respecting power budgets and performance requirements.