Power Management: Implementation and Runtime Systems¶

This episode explores the technical implementation of power management on modern CPUs, including the software interfaces, hardware mechanisms, and runtime systems that enable dynamic power optimization in HPC environments.

Scaling Drivers and Governors¶

Intel P-state Driver¶

The intel_pstate driver provides direct hardware-level frequency control on modern Intel processors (Haswell and newer).

Key characteristics:

Controls P-states directly via MSR (Model-Specific Register)
Firmware-independent implementation
Per-core frequency capability on newer CPUs
More responsive to workload changes

Sysfs interface (/sys/devices/system/cpu/intel_pstate/):

max_perf_pct          - Maximum P-state allowed (% of max supported)
min_perf_pct          - Minimum P-state allowed (% of max supported)
turbo_pct             - Ratio of turbo range to total range
no_turbo              - Disable all turbo frequencies (0=enabled, 1=disabled)
hwp_dynamic_boost     - Enable iowait-triggered boosting (HWP mode)
num_pstates           - Number of supported P-states
status                - Driver operation mode: "active", "passive", or "off"

Example - Disable turbo boost:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Example - Limit maximum frequency to 80%:

echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

ACPI CPUFreq Driver¶

The acpi-cpufreq driver implements ACPI-based frequency scaling, widely used across different CPU vendors.

Key characteristics:

Works with ACPI firmware tables
Supports multiple CPU types (Intel, AMD, others)
Requires firmware to provide P-state information
More portable across different systems

Sysfs interface (/sys/devices/system/cpu/cpu*/cpufreq/):

scaling_driver              - Current driver (acpi-cpufreq, intel_pstate, etc.)
scaling_governor            - Current governor (performance, powersave, etc.)
scaling_cur_freq            - Current frequency in kHz
cpuinfo_min_freq            - Minimum CPU frequency in kHz
cpuinfo_max_freq            - Maximum CPU frequency in kHz
cpuinfo_base_freq           - Nominal/base frequency in kHz
scaling_min_freq            - Minimum frequency driver is allowed to set
scaling_max_freq            - Maximum frequency driver is allowed to set
scaling_setspeed            - Set specific frequency (userspace governor only)
scaling_available_governors - List of available governors
energy_performance_preference - Hardware P-State energy/performance trade-off
base_frequency              - Nominal frequency without turbo

Example - View all available governors:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

Example - Change to powersave governor:

echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Scaling Governors: Policies for Frequency Selection¶

A scaling governor implements the policy that decides which P-state (frequency) to use based on current conditions.

Performance Governor¶

Policy: Always run at maximum frequency.

Advantages:

Highest computational performance
Simplest policy (no decision logic)
Predictable behavior

Disadvantages:

Maximum power consumption
Wastes power during I/O waits
Contributes to thermal issues

Use case: Performance-critical applications where energy is not a constraint.

Powersave Governor¶

Policy: Always run at minimum frequency.

Advantages:

Minimum power consumption
Best for energy-constrained systems
Reduces thermal load

Disadvantages:

Worst computational performance
Only suitable for loosely timed workloads
Can cause severe performance degradation

Use case: Energy-critical systems or background tasks.

Ondemand Governor¶

Policy: Scale frequency based on CPU utilization.

Advantages:

Dynamic response to workload changes
Better energy efficiency than performance
Reasonable performance for most workloads

Disadvantages:

Frequency scaling latency can cause performance dips
Threshold tuning is system-specific
Not optimal for irregular workloads

Tuning parameters:

up_threshold - Utilization threshold to scale up (default 80%)
down_threshold - Utilization threshold to scale down
sampling_rate - How frequently to re-evaluate frequency

Use case: General-purpose systems with variable workloads.

Conservative Governor¶

Policy: Similar to ondemand but with more gradual frequency changes.

Advantages:

More stable than ondemand
Avoids rapid frequency oscillation
Balanced energy/performance trade-off

Disadvantages:

Slower response to load increases
May miss performance opportunities
Not suitable for bursty workloads

Use case: Systems requiring stability with moderate power savings.

Userspace Governor¶

Policy: Allow user applications or system administrators to directly set frequency.

Advantages:

Full control for specialized applications
Can implement custom power management strategies
Enables research into novel power policies

Disadvantages:

Requires user/application to manage frequency
Incorrect policies can waste power
Not suitable for general use

Use case: Research, application-specific optimization, or HPC runtime systems.

Hardware Frequency Control: MSR Registers¶

The actual frequency selection on Intel CPUs is controlled by writing to a Model-Specific Register (MSR):

IA32_PERF_CTL (0x199)¶

This register specifies the target P-state for the CPU:

MSR 0x199 (IA32_PERF_CTL)
Bits [15:8] - Target P-State
Bits [31:16] - Reserved

Example: Writing 0x1C00 sets the CPU to run at P-state 28 (out of 0-39).

CPU core frequency scaling via DVFS operates by specifying a target P-State, which may differ from the current P-State. The hardware then transitions to the target frequency.

When controlled via scaling driver: The driver automatically writes to MSR 0x199 based on the selected governor policy and workload conditions.

When controlled from userspace:

Set userspace scaling governor to enable direct frequency control
Disable Hardware P-State (HWP) if available to allow manual control
Write target frequency to scaling_setspeed sysfs interface

Important: Frequency changes are not instantaneous. There is a transition latency (typically 10-50 microseconds on modern CPUs) where the CPU is temporarily unavailable during frequency switching.

Intel Turbo Boost Technology¶

Turbo Boost is a hardware feature that opportunistically increases frequency beyond the nominal specification when thermal and power headroom exists.

Boost Frequency Levels¶

Different instruction sets have different maximum boost frequencies:

SSE Instructions       ──────────┐
                                  │ Highest boost frequency
                                  │
AVX/AVX2 Instructions ─────────┐ │
                                │ │ Medium boost frequency
                                │ │
AVX-512 Instructions  ──────┐   │ │
                            │   │ │ Lowest boost frequency
                            │   │ │
  Nominal Frequency ────────┴───┴─┴── Base frequency

Why Different Frequencies for Different Instructions?¶

Power and thermal constraints:

AVX-512 instructions perform more computation per cycle → more current/heat
To stay within power budget, frequency must be reduced
Total work per unit time may still increase despite lower frequency

Example (Hypothetical):

SSE turbo: 3.8 GHz × 1 compute unit per cycle = 3.8 units/s
AVX-512 turbo: 3.0 GHz × 8 compute units per cycle = 24 units/s

Despite lower frequency, AVX-512 actually delivers more throughput within power budget.

Disabling Turbo Boost¶

System administrators can disable turbo boost to:

Reduce power consumption
Improve frequency stability for benchmarking
Ensure consistent thermal behavior

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Frequency Transition Latency¶

Frequency changes are not instantaneous. Understanding transition latency is important for applications that require tight timing.

What Happens During Frequency Change¶

OS writes target P-state to MSR
CPU begins voltage/frequency ramping
Transition period: CPU unavailable for instruction execution (~10-50 μs)
CPU reaches target frequency and resumes execution

Measuring Frequency Transition Latency¶

Linux provides tools to measure transition latency:

# Measure latency (requires kernel support)
$ grep transition_latency /sys/devices/system/cpu/cpu0/cpufreq/*/

# Typical values:
# Modern Intel: 5-20 μs
# Older CPUs: 100-500 μs

Impact on Applications¶

For most HPC applications: Frequency transition latency (tens of microseconds) is negligible compared to computation time (milliseconds to seconds).

Exception: Real-time applications or GPU-CPU synchronization may be affected by frequency jitter.

GPU Frequency Management¶

GPUs also support frequency scaling, though the mechanisms differ from CPUs.

NVIDIA GPU Frequency Scaling¶

nvidia-smi allows querying and setting GPU frequency:

# View current and supported clock frequencies
nvidia-smi --query-gpu=clocks_current_sm,clocks_max_memory

# Set GPU clock (requires root or persistence mode)
nvidia-smi -pm 1
nvidia-smi -lgc <frequency>

Characteristics:

Coarser granularity than CPU (typically 25-50 MHz steps)
Separate memory and core clocks
Can be locked to specific frequencies or set to dynamic

AMD GPU Frequency Scaling¶

rocm-smi provides AMD GPU frequency control:

# View supported frequencies
rocm-smi --showid

# Set frequency
rocm-smi --setsclk <level>  # Compute clock
rocm-smi --setmclk <level>  # Memory clock

GPU Frequency Switching Latency¶

Hardware P-State (HWP) and SpeedShift¶

Starting with Skylake architecture, Intel introduced Hardware P-State (HWP), also known as SpeedShift Technology.

How HWP Works¶

Instead of the OS setting a specific frequency via MSR writes, HWP allows hardware to autonomously select P-states within a range specified by the OS.

Traditional OS-controlled approach:

OS: "Set frequency to 2.4 GHz"
CPU: Adjusts to 2.4 GHz (10-50 μs latency)

HWP approach:

OS: "Select a P-state between 5 and 30 (min/max)"
CPU: Autonomously selects optimal P-state (0.5-2 μs latency)

Advantages of HWP¶

Reduced latency - Hardware responds faster to workload changes (10-100× faster)
Smarter decisions - Hardware can respond to iowait, memory latency, other signals
Better multi-core management - Different cores can independently optimize
AVX-512 awareness - Hardware automatically reduces frequency for power-intensive instructions

Enabling HWP¶

HWP is enabled via MSR 0x770 (IA32_PM_ENABLE):

# Check if HWP is supported
cat /proc/cpuinfo | grep hwp

# Enable HWP (requires boot-time kernel parameter or firmware change)
# Once enabled, further writes to certain MSR registers are ignored

Important: HWP significantly improves responsiveness to workload changes and is enabled by default on modern systems.

Dynamic Duty Clock Modulation¶

An alternative power management technique that statistically skips a user-defined number of clock cycles, reducing power without changing frequency.

Energy-Performance Preference (EPP)¶

Modern Intel CPUs support an energy-performance preference via MSR that allows fine-grained control over the energy-performance trade-off.

MSR IA32_ENERGY_PERF_BIAS (0x1B0)¶

Specifies hardware preference:

Values range from 0 to 15
0 - Preference to highest performance (maximize frequency)
7 - Balanced hint (balance performance and energy)
15 - Preference to maximize energy saving (minimize frequency)

Sysfs interface: /sys/devices/system/cpu/cpu*/power/energy_perf_bias

Enhanced EPP: MSR 0x774 (IA32_HWP_REQUEST)¶

Modern Intel (Skylake+):

Bits [31:24] - Energy-Performance Preference (0=performance, 128=balanced, 255=energy)
Provides finer granularity and better hardware optimization

This allows the hardware to make local optimization decisions while respecting overall policy directives.

CPU Uncore Frequency¶

Intel processors contain “uncore” subsystems shared by multiple cores:

Last-Level Cache (LLC)
On-chip interconnect
Integrated memory controller

These components consume ~30% of chip area and can be frequency-scaled independently of core frequency.

Intel Uncore Control¶

MSR MSR_UNCORE_RATIO_LIMIT (0x620) controls uncore frequency limits for:

Frequency of subsystems shared by multiple processor cores
Last level cache, on-chip ring interconnect, integrated memory controllers
Specification of maximum and minimum limits

Monitoring Uncore Performance¶

MSR U_MSR_PMON_FIXED_CTR (since Haswell 0x704) - Uncore performance counter
MSR 0x703 - Uncore performance counter enable

AMD Uncore¶

Data Fabric (Infinity Fabric interconnect)
I/O subsystems
Can be controlled separately with P-states

Workload Characterization¶

Understanding workload characteristics enables targeted power optimization:

Memory-bound workloads - Limited by memory bandwidth, not CPU cycles
- Benefit from frequency reduction (saves power without hurting performance)
- High latency tolerance for power management decisions
Compute-bound workloads - Limited by available CPU cycles
- Require high frequency to maintain performance
- Sensitive to frequency reduction
Communication-bound workloads - Limited by network bandwidth
- Opportunity for frequency reduction during communication waits
- Can benefit from dynamic scaling
I/O-bound workloads - Frequent stalls on disk/network access
- Aggressive power reduction candidates
- Minimal performance impact from lower frequency

Arithmetic Intensity¶

The ratio of compute operations to memory accesses helps predict workload boundness.

Power Capping: Intel RAPL¶

Intel Running Average Power Limit (RAPL) provides hardware-based power capping and monitoring.

RAPL Architecture¶

Power Domains¶

Sysfs: /sys/devices/virtual/powercap/intel-rapl/intel-rapl:X/intel-rapl:0:Y

Package domain:

Limits power consumption for entire CPU package (cores + uncore)
Short window: ~1.2× TDP (milliseconds)
Long window: ~TDP (seconds)

DRAM domain:

Used for memory power capping and monitoring
Enables P-State scaling for memory subsystem
Server architectures only (not client)
Single time window
Disabled by default

PP0 (Core) domain:

Restricts power limit to CPU cores only
Single time window
Not available on latest server CPUs

PP1 (Graphics) domain:

Power limits only integrated GPU
Not on server systems
Single time window

PSys (Platform) domain:

Controls entire System on Chip
Short and long windows
Available from Skylake architecture onwards
Requires vendor support

RAPL MSR Registers¶

MSR MSR_PKG_POWER_LIMIT (0x610):

MSR MSR_RAPL_POWER_UNIT (0x606):

Power units (Watts per unit)
Energy status units (Joules per unit)
Time units (seconds per unit)

Energy Consumption Measurement¶

MSR Energy Status Registers:

MSR MSR_PKG_ENERGY_STATUS (0x611) - Package energy
MSR MSR_DRAM_ENERGY_STATUS (0x619) - DRAM energy
MSR MSR_PP0_ENERGY_STATUS (0x639) - Core energy
MSR MSR_PP1_ENERGY_STATUS (0x641) - Graphics energy
MSR MSR_PLATFORM_ENERGY_COUNTER (0x64D) - Platform total

Power Capping Behavior¶

Intel RAPL algorithm:

Power capping system downscales CPU core and uncore frequencies to keep power consumption at the limit
Note: Intel RAPL does not reflect the arithmetic intensity of the workload

Case Study: Cascade Lake Power Management¶

Frequency Scaling with Arithmetic Intensity¶

Example: AVX-512 workload with arithmetic intensity 8

RAPL behavior:

Keeps core frequency at maximum
Downscales uncore frequency to stay within power limit

Results¶

AVX-512 with arithmetic intensity 8, core 1.9 GHz, uncore 2.2 GHz:

18% CPU energy savings
3.5% runtime improvement
15% node energy savings

Advanced Platforms: Grace Hopper Power Management¶

Variety of power domains (CPU cores, GPU SMs, module-level) but limited set of knobs (frequencies):

Challenge: Power shifting between CPU and GPU requires frequency coordination to stay within node power limit.

Disabling Units for Power Savings¶

Beyond frequency scaling, some systems support unit disabling:

Multi-threading (on/off) - Disable simultaneous multithreading
Disabling cores - Complex, affects memory bandwidth
AMD xGMI lanes - External Global Memory Interconnect per link
Fujitsu A64FX FPU pipelines - FLA (floating-point) and EXA (integer) elimination
P-cores and E-cores - Heterogeneous cores (not yet in HPC)

Power Management Knobs Overview¶

Across different HPC-relevant platforms:

Intel:

CPU - core frequency, uncore frequency, power capping
ACC (PVC) - GPU frequency, memory frequency, power capping
ACC (KNL) - core frequency, power capping

AMD:

CPU - core frequency, power capping, Data Fabric frequency
ACC - power capping, frequency (system, Data Fabric, display, SOC, memory, PCIe)

NVIDIA:

GPU - SM frequency, memory frequency, power capping
CPU+GPU - Grace Hopper: CPU core & GPU SM frequencies, multi-level power capping

IBM:

CPU - core frequency, power capping
CPU+GPU - core frequency, CPU/GPU/node power capping

ARM:

CPU - Fujitsu A64FX: core frequency, FPU pipeline elimination, memory frequency
CPU - NVIDIA Grace: core frequency, power capping

Runtime Systems for Automatic Power Management¶

Runtime systems automatically adjust power management parameters based on application characteristics and system constraints.

What Runtime Systems Do¶

Profile workload - Identify CPU/memory/I/O patterns
Monitor performance - Track actual vs expected performance
Select frequencies - Choose P-states to meet constraints (power budget, deadline, etc.)
Adapt dynamically - Respond to runtime conditions

Example: Energy-Aware Scheduling¶

A runtime system might:

Reduce frequency for loosely-coupled parallel tasks (communication-bound)
Increase frequency for CPU-bound tasks requiring maximum throughput
Collectively respect power budget across all running jobs
Adapt based on observed thermal conditions

Example: Power-Capping Runtime¶

Ensures total node power doesn’t exceed a limit while maximizing performance:

Algorithm:
  FOR each core:
    power_limit_per_core = total_power_limit / num_cores
    frequency = find_max_frequency(power_limit_per_core)
    set_frequency(frequency)
  Monitor actual power
  Adjust frequencies if overage

Practical HPC Power Management Strategies¶

Strategy 1: Fixed Frequency¶

Set a conservative fixed frequency across the cluster:

# All nodes run at 80% frequency
echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

Pros:

Simple to deploy and manage
Predictable power consumption
Measurable energy savings (typically 10-20%)

Cons:

No performance adaptation
May waste power during I/O waits
Not optimal for heterogeneous workloads

Best for: Homogeneous workloads with known characteristics, power-constrained systems

Strategy 2: Per-Application Tuning¶

Different application classes get different frequencies based on profiling:

Batch jobs (deadline-loose):     80% frequency (20% power saving)
Interactive jobs (latency-critical): 100% frequency
GPU-accelerated jobs:             CPU 60%, GPU 100%
I/O-bound services:              70% frequency

Pros:

Better performance for latency-critical work
Energy savings for batch jobs
Balanced approach for mixed workloads

Cons:

Requires profiling and characterization
Need to identify workload class at runtime
Moderate complexity in implementation

Best for: Mixed HPC centers with diverse job types

Strategy 3: Dynamic Runtime Control¶

Runtime system adjusts frequency based on:

CPU utilization and workload signals
Memory bandwidth usage
Power budget remaining
Deadline/QoS requirements

Pros:

Highest potential for energy savings (20-40%)
Adapts automatically to workload changes
Respects power budgets dynamically

Cons:

Most complex to implement
Requires careful tuning of heuristics
May have overhead from monitoring

Best for: Advanced HPC centers with sophisticated workload management

Summary: Power Management Parameters¶

Parameter	Conservative	Aggressive	Notes
Governor	performance	ondemand	Depends on workload characteristics
Max Frequency	100%	70-80%	Significant power savings at cost
Turbo Boost	Enabled	Disabled	Disabling reduces power ~5-10%
C-states	Enabled	Enabled	Should always enable (minimal performance cost)
HWP	Enabled	Enabled	Always use if supported (improves responsiveness)
Uncore Frequency	100%	80-90%	Less impact than core frequency

The optimal settings depend on your specific workload, power budget, and performance requirements. Profiling and validation are essential for any production deployment.

Case Study: Energy-Aware HPC - RIKEN Fugaku¶

System Overview¶

Ranking: #1 in Top500 since June 2020

Processor: Fujitsu A64FX

48 compute cores + 4 assistant cores (OS daemon and MPI offload)
No TDP, no nominal frequency → no traditional turbo concept
Available frequencies: 1.6, 1.8, 2.0, or 2.2 GHz

User-Controlled Power Options¶

Power mode (scheduler option):

Normal - 2.0 GHz frequency (baseline)
Boost - 2.2 GHz frequency (performance)
ECO - 2.0 GHz + use one of two FPU units only + reduces standby power
Boost ECO - 2.2 GHz + FPU unit elimination

Core retention (ON/OFF):

When enabled: Eliminates standby power for idle CPU cores
Significant power savings for workloads that don’t utilize all cores

Reference: https://sites.google.com/view/rikenfugakushowcase/home

Summary: Episode 1 Learning Outcomes¶

After completing this episode, you should be able to:

Implement power management - Use sysfs and MSR interfaces to control CPU frequencies
Select appropriate governors - Choose scaling policies based on workload characteristics
Understand hardware mechanisms - Explain HWP, RAPL, turbo boost, and other hardware features
Design runtime systems - Create algorithms for automatic power optimization
Evaluate trade-offs - Balance performance, power, and reliability constraints
Apply best practices - Implement fixed, per-application, or dynamic power strategies
Analyze real systems - Understand power management in production HPC systems

Assessment: You should now be able to profile a workload, design a power management strategy, and implement it on an HPC system while respecting power budgets and performance requirements.