Quiz Solutions: HPC Power Management (Episodes 0 & 1)¶

Episode 0: Power Management Hardware Knobs - Fundamentals and Concepts¶

Multiple choice, single answer

Power Physics Fundamentals:

What is the primary component of CPU power consumption that scales with frequency?

B) Dynamic power (CV²f)
Explanation: The CV²f term dominates at high frequencies. Leakage power is relatively constant.

According to the power equation P = CV²f + I_leak·V, which change provides the greatest power savings?

B) Reducing voltage by 20%
Explanation: Power reduction ∝ V² for dynamic power. A 20% voltage reduction (0.8V factor) saves ~36% dynamic power. Frequency reduction only scales linearly.

Why is dynamic voltage and frequency scaling (DVFS) more effective at lower frequencies?

B) Power savings are quadratic with voltage reduction at lower frequencies
Explanation: At lower frequencies, voltage can be reduced more significantly (within safe operating margins), yielding quadratic power savings.

CPU Frequency Scaling (P-states):

What is the primary purpose of P-states in CPU power management?

B) Select the frequency and voltage for CPU cores

Intel processors typically support how many P-states?

C) 20-40
Explanation: Modern Intel processors support 20-40 distinct frequency/voltage pairs (P-states).

What does P0 represent in the P-state hierarchy?

B) Turbo boost frequency
Explanation: P0 is the highest frequency (turbo); lower numbers = lower frequencies.

Idle Power Management (C-states):

Which C-state represents the CPU actively executing instructions?

A) C0

Approximately what percentage of power can be saved by transitioning from C0 to C3?

C) 50%+
Explanation: Deep C-states like C3 disable most core logic, saving 50-80% core power.

What is the trade-off when using deeper C-states (C3+)?

B) Increased latency to wake up and resume execution
Explanation: Deeper C-states require more time to power back up.

Thermal and System Power Management:

What are T-states used for in CPU power management?

B) Reducing frequency during thermal stress (thermal throttling)

What does S5 represent in the system power state (S-state) hierarchy?

C) System fully powered off

ACPI and Scaling Drivers:

Which of the following is NOT a scaling driver mentioned in Episode 0?

D) gpu-pstate
Explanation: GPU frequency uses different mechanisms (nvidia-smi, rocm-smi), not traditional scaling drivers.

What is the primary role of a scaling governor?

B) Decide which P-state (frequency) to use based on workload conditions

Why Power Management Matters in HPC:

What is the typical power consumption percentage for a data center (facility-level)?

C) 50-70% of operational costs
Explanation: Power and cooling typically account for 50-70% of HPC operational expenses.

Approximately what fraction of power in a large HPC system goes to cooling?

B) 25%
Explanation: Cooling typically accounts for 25-40% of facility power consumption (PUE ~1.25-1.40).

Which of the following is NOT a benefit of power management in HPC?

C) Guaranteed faster execution time
Explanation: Power management may reduce frequency, which can slow execution (though smart strategies minimize this).

Conceptual questions

Power Equation Analysis:

Initial: P₁ = CV²f = C × (0.8)² × 2.0 = 0.64 × 2.0 × C = 1.28C

New: P_new = C × (0.7)² × 1.6 = 0.49 × 1.6 × C = 0.784C

Ratio: P_new / P₁ = 0.784C / 1.28C = 0.61 (or 39% reduction)

Key insight: Even though frequency reduced by 20%, voltage reduction of 12.5% (0.8V → 0.7V) provides quadratic power savings in the voltage term, more than offsetting the frequency reduction. Voltage scaling is crucial because power depends on V².

Workload and Power Interaction:

Workload 1 - All-reduce (Memory-bound):

Frequency reduction will NOT hurt performance significantly because the bottleneck is memory bandwidth, not CPU cycles
Safe to reduce frequency: the CPU is already waiting for memory data
Monitoring: Watch for memory-wait cycles (performance counters), thermal effects

Workload 2 - Dense matrix multiplication (Compute-bound):

Frequency reduction WILL hurt performance because throughput is limited by CPU cycles available
Unsafe to reduce frequency: every cycle matters, and reduction directly impacts FLOP/s delivered
Monitoring: Track actual FLOP/s achieved; frequency reduction should be minimal or none

Power Management Strategy Design:

A comprehensive strategy might include:

Batch jobs (30%, deadline-loose):

Use ondemand or conservative governor
Set max_freq to 70-80% (10-20% power savings)
Prioritize energy efficiency

Interactive jobs (50%, latency-critical):

Use performance governor or HWP with high EPP
Run at 95-100% frequency
Prioritize low latency

GPU-accelerated jobs (20%, compute-intensive):

CPU: 60-70% frequency (since CPU often waits for GPU)
GPU: 100% frequency (compute bottleneck)
Use power capping to respect node power budget
Monitor CPU/GPU power balance

Trade-offs:

Energy savings vs latency: Accept slower batch jobs to save power
Complexity: Need job classification system
Flexibility: Must allow overrides for deadline-critical work

Episode 1: Power Management Implementation and Runtime Systems¶

Multiple choice, single answer

Scaling Drivers and Interfaces:

Which driver provides more responsive CPU frequency control on modern Intel processors?

B) intel_pstate
Explanation: intel_pstate directly controls MSRs and is more responsive than ACPI firmware-based acpi-cpufreq.

What does the max_perf_pct parameter in intel_pstate sysfs control?

B) Maximum allowed P-state as a percentage of maximum frequency

To disable turbo boost on an intel_pstate system, which sysfs file should be written to?

C) no_turbo
Explanation: echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo disables turbo.

Scaling Governors (Policies):

Which scaling governor always runs at maximum frequency?

B) performance

What is the main advantage of the ondemand governor compared to conservative?

B) Faster response to load increases
Explanation: Ondemand adjusts frequency more aggressively; conservative makes gradual changes.

Which governor allows direct user/application control of CPU frequency?

C) userspace

In the ondemand governor, what does up_threshold control?

A) CPU utilization threshold for scaling frequency up

Hardware Frequency Control (MSR):

What is the MSR (Model-Specific Register) address for IA32_PERF_CTL?

A) 0x199

Which bit field in MSR 0x199 specifies the target P-state?

B) Bits [15:8]

Intel Turbo Boost:

What is the key difference between SSE and AVX-512 boost frequencies?

B) SSE has the highest boost frequency due to lower power density
Explanation: SSE instructions use less power per operation, allowing higher frequency. AVX-512 is power-intensive, so turbo frequency is lower.

Why does Intel implement instruction-set-specific frequency levels?

B) To allow higher compute throughput within power and thermal budgets
Explanation: Despite lower frequency, AVX-512 can deliver more compute operations per second within power/thermal constraints.

Frequency Transition Latency:

What is a typical frequency transition latency on modern Intel processors?

B) 5-20 microseconds

For which type of application is frequency transition latency most critical?

B) Real-time embedded systems
Explanation: HPC jobs (hours-long) are insensitive to microsecond latencies.

GPU Frequency Management:

What command-line tool is used to control NVIDIA GPU frequency?

B) nvidia-smi

Which AMD tool is used for GPU frequency management on AMD GPUs?

B) rocm-smi

What is the typical frequency granularity for NVIDIA GPUs?

C) 25-50 MHz

Hardware P-State (HWP) and SpeedShift:

How does HWP differ from OS-controlled frequency scaling?

B) Hardware autonomously selects P-states within OS-specified range

What is the approximate latency improvement of HWP over OS control?

C) 10-100× faster
Explanation: HWP reduces frequency change latency from 10-50 μs to 0.5-2 μs.

Energy-Performance Preference (EPP):

What does MSR IA32_ENERGY_PERF_BIAS (0x1B0) allow?

B) Specifying energy-performance trade-off preference (0-15 scale)

On modern Intel (Skylake+), which MSR provides finer-grained EPP control with 0-255 scale?

C) 0x774

CPU Uncore Frequency:

Approximately what percentage of CPU chip area does the uncore subsystem consume?

C) 30%

What is MSR MSR_UNCORE_RATIO_LIMIT used for?

B) Setting limits on uncore (shared subsystem) frequency

Workload Characterization:

Which workload type would most benefit from frequency reduction without performance loss?

B) Sparse linear solver (memory-latency-bound)
Explanation: Memory-bound workloads tolerate frequency reduction because they’re waiting for memory, not CPU cycles.

What metric helps predict whether a workload is compute-bound or memory-bound?

C) Arithmetic intensity (operations per memory access)

Intel RAPL Power Capping:

How many power domains does Intel RAPL typically support?

C) 3-5
Explanation: Typically Package, DRAM, PP0 (Core), PP1 (Graphics), and PSys (Platform on Skylake+).

Which RAPL domain is specific to server architectures?

C) DRAM

What are the two time windows in Intel RAPL Package domain?

B) Short (~1.2× TDP, ms) and Long (~TDP, seconds)

What does MSR MSR_PKG_POWER_LIMIT (0x610) control?

C) Power capping limits and time windows

Case Studies and Advanced Platforms:

In the Cascade Lake case study, what percentage CPU energy savings was achieved with frequency scaling?

B) 18%

Why is power management challenging on Grace Hopper?

B) Multiple power domains (CPU, GPU, interconnects) require coordination
Explanation: Separate frequency controls for CPU and GPU mean power must be balanced to respect node limit.

On RIKEN Fugaku’s A64FX, what does FPU elimination in ECO mode do?

C) Uses one of two FPU pipelines only, reducing power

Runtime Systems and Strategies:

What does a power-capping runtime system do?

B) Ensures total node power doesn’t exceed a limit while maximizing performance

Which power management strategy is most suitable for a tightly power-constrained HPC facility?

C) Dynamic runtime control with power budgeting
Explanation: Fixed frequency is inflexible; dynamic control adapts to actual workload needs.

What is the typical energy savings range for dynamic runtime power management?

C) 20-40%

Coding and analysis questions

MSR-Based Frequency Control:

a) To set P-state 24 with bits [15:8] = 24 (0x18):

MSR 0x199 = 0x00001800  (or 0x0000_1800_0000_0000 in full 64-bit form)

b) Frequency calculation from P-state:

If P-state range is 0-39 with min=0.8GHz, max=3.8GHz:
frequency = min_freq + (P_state / 39) × (max_freq - min_freq)
frequency = 0.8 + (P / 39) × 3.0 GHz

c) Pseudocode for frequency ramp:

current_pstate = read_pstate()
target_pstate = compute_target_pstate(1.8)  # Convert GHz to P-state
step = (target_pstate - current_pstate) / 10

for i in range(1, 11):
    new_pstate = int(current_pstate + i * step)
    write_msr(0x199, new_pstate << 8)
    time.sleep(0.1)  # 100ms between steps

Scaling Governor Selection:

Scientific simulation (CPU-intensive, loose deadline)
- Governor: ondemand or userspace
- Justification: Can reduce frequency for power savings; loose deadline allows higher latency
- Parameters: up_threshold=90, sampling_rate=50000 (conservative scaling)
Data processing pipeline (30% comm, 70% compute)
- Governor: ondemand with custom tuning
- Justification: Scale down during communication waits, scale up for compute bursts
- Parameters: up_threshold=75, down_threshold=25, sampling_rate=10000
Interactive visualization (< 100ms latency)
- Governor: performance or HWP with high EPP
- Justification: Latency-critical, need maximum responsiveness
- Parameters: Run at maximum frequency; consider HWP for better responsiveness

Power Equation Application:

Initial: P = CV²f + P_leak = 56W + 4W = 60W

New state: V = 0.8V, f = 2.0 GHz

Dynamic: P_dyn = C × (0.8)² × (2.0/2.5) × 56 = 0.64 × 0.8 × 56 ≈ 28.7W
Leakage: P_leak ≈ 4W (slightly less at lower voltage, ~3.6W)
Total ≈ 32.3W (46% reduction from 60W)

a) Energy savings per job:

Original: 60W × 1h = 60 Wh
New: 32.3W × 1h = 32.3 Wh
Savings per job: 27.7 Wh

b) Annual cost savings:

Savings per job: 27.7 Wh = 0.0277 kWh
Jobs/year: 500 jobs/day × 365 days = 182,500 jobs
Total energy saved: 182,500 × 0.0277 = 5,055 kWh
Cost: 5,055 × $0.10 = $505.50/year

Intel RAPL Analysis:

a) Decode MSR_RAPL_POWER_UNIT (0xA1003):

Bit field: energy_unit bits typically 0x0A = 10
2^(-10) = 1/1024 ≈ 0.977 mJ per unit (or ~1 mJ)

b) Energy consumed:

ΔE = 0x3F7F0E00 - 0x2A5B0E00 = 0x15240000
Energy units: 0x15240000 × 0.977 mJ ≈ 352 kJ (or 0.098 kWh)

c) Average power:

P_avg = 352 kJ / 60s ≈ 5.87 kW

d) Power cap check:

Average 5.87 kW = 5870W >> 200W limit
Yes, limit was exceeded (likely sampling shows instantaneous peaks)

HWP vs OS Control Comparison:

Scenario A: 50 μs × 4 changes = 200 μs overhead = 0.0002s = 0.002% of 10s
Scenario B: 2 μs × 4 changes = 8 μs overhead = 0.000008s = 0.00008% of 10s

a) Total latency overhead:

OS control: 200 μs
HWP: 8 μs

b) Performance impact:

OS: 200 μs / 10s = 0.002% (negligible)
HWP: 8 μs / 10s = 0.0008% (negligible)

c) Why HWP is preferable:

Although both are negligible for 10-second latencies, HWP provides:
Better responsiveness to iowait and memory latency events
Per-core independent optimization
AVX-512 frequency awareness
Reduced monitoring overhead

Case Study Analysis - Cascade Lake:

a) New CPU power with 18% savings:

Original CPU: 100W
Savings: 18%
New: 100 × (1 - 0.18) = 82W

b) New total node power with 15% node savings:

Original: 350W
Savings: 15%
New: 350 × (1 - 0.15) = 297.5W

c) Annual energy cost savings for 1000 nodes:

Energy reduction per node: (350 - 297.5)W = 52.5W
Per node per year: 52.5W × 24h × 365d = 460 kWh
For 1000 nodes: 460,000 kWh
Cost: 460,000 × $0.12 = $55,200/year

d) Payback period:

Software development cost: $500,000
Annual savings: $55,200
Payback: $500,000 / $55,200 ≈ 9 years
(Alternative strategies like hardware monitoring tools may have shorter payback)

Runtime System Design - Power Budget Allocation:

a) Frequency reduction tolerance by workload:

Core 0 (compute-bound, 8 ops/byte): Tolerates ~5-10% reduction (very sensitive)
Core 1 (memory-bound, 0.5 ops/byte): Tolerates ~30-40% reduction (insensitive)
Core 2 (I/O-bound, 0.1 ops/byte): Tolerates ~50%+ reduction (very insensitive)
Core 3 (balanced, 2 ops/byte): Tolerates ~20-25% reduction (moderately sensitive)

b) Algorithm for power allocation (maximize throughput):

// Greedy allocation: allocate to most sensitive workloads first
Sort cores by compute intensity (high to low)
Allocate remaining power to each core in order
Each core gets: P_core = P_baseline + 0.8 × (available_power / num_cores)
Compute expected throughput reduction per core
Iteratively reallocate to maximize total throughput

Allocation:

Core 0 (compute): 60W (no reduction)
Core 3 (balanced): 50W (no reduction)
Core 1 (memory): 40W + 50W = 90W available
Core 2 (I/O): 30W + 50W = 80W available
Total: 60 + 50 + 90 + 80 = 280W (under 500W limit)

c) Dynamic reallocation pseudocode:

def reallocate_power():
    idle_core = detect_idle_core()
    if idle_core:
        idle_power = allocated_power[idle_core]
        active_cores = [c for c in cores if not idle[c]]
        power_per_core = idle_power / len(active_cores)
        
        for core in active_cores:
            if can_increase_frequency(core):
                increase_frequency(core, power_per_core)

d) Thermal constraint handling:

Monitor core temperature continuously
If core > 85°C: reduce frequency by 5%
If core > 90°C: reduce frequency by 10% (approaching limit)
Use thermal headroom to allow temporary frequency boost

Workload Characterization and Power Optimization:

a) Arithmetic intensity:

Memory accesses per 1000 cycles: 450
FLOP per 1000 cycles: 1800
Bytes per access (typical L3 cache line): 64 bytes
Total bytes: 450 × 64 = 28,800 bytes
AI = 1800 FLOPs / 28,800 bytes = 0.0625 FLOP/byte (or 16 bytes/FLOP)

b) Boundedness analysis:

This is memory-bound
Justification: L3 hit rate is only 85%, meaning 15% miss rate with 200-cycle latency
With 0.0625 FLOP/byte, memory throughput is the bottleneck
Latency-bound: 200 cycles × 450 misses/1000 cycles = 90 cycles of stall per 1000 cycles (9%)

c) Performance impact of 20% frequency reduction:

Baseline: 2.5 GHz × 1800 FLOP/1000 cycles = 4.5 GFLOP/s
With reduction: 2.0 GHz × 1800 FLOP/1000 cycles = 3.6 GFLOP/s
Impact: 20% reduction in FLOP/s
However, memory latency increases by ~20% wall-clock time
Overall runtime impact: ~10-15% (memory accesses still dominate, not CPU cycles)

d) Power management strategy:

Reduce frequency to 70-75% (save 25-30% power)
Runtime will increase ~12-15% (memory-latency-bound, not CPU-sensitive)
Overall energy reduction: 20-30% (frequency reduction outweighs longer runtime)
Use ondemand governor with high up_threshold (85-90%)
Monitor memory bandwidth utilization to validate memory-boundedness