Quiz Solutions: HPC Power Management (Episodes 0 & 1)

Episode 0: Power Management Hardware Knobs - Fundamentals and Concepts

Multiple choice, single answer

Power Physics Fundamentals:

What is the primary component of CPU power consumption that scales with frequency?

  • B) Dynamic power (CV²f)

  • Explanation: The CV²f term dominates at high frequencies. Leakage power is relatively constant.

According to the power equation P = CV²f + I_leak·V, which change provides the greatest power savings?

  • B) Reducing voltage by 20%

  • Explanation: Power reduction ∝ V² for dynamic power. A 20% voltage reduction (0.8V factor) saves ~36% dynamic power. Frequency reduction only scales linearly.

Why is dynamic voltage and frequency scaling (DVFS) more effective at lower frequencies?

  • B) Power savings are quadratic with voltage reduction at lower frequencies

  • Explanation: At lower frequencies, voltage can be reduced more significantly (within safe operating margins), yielding quadratic power savings.

CPU Frequency Scaling (P-states):

What is the primary purpose of P-states in CPU power management?

  • B) Select the frequency and voltage for CPU cores

Intel processors typically support how many P-states?

  • C) 20-40

  • Explanation: Modern Intel processors support 20-40 distinct frequency/voltage pairs (P-states).

What does P0 represent in the P-state hierarchy?

  • B) Turbo boost frequency

  • Explanation: P0 is the highest frequency (turbo); lower numbers = lower frequencies.

Idle Power Management (C-states):

Which C-state represents the CPU actively executing instructions?

  • A) C0

Approximately what percentage of power can be saved by transitioning from C0 to C3?

  • C) 50%+

  • Explanation: Deep C-states like C3 disable most core logic, saving 50-80% core power.

What is the trade-off when using deeper C-states (C3+)?

  • B) Increased latency to wake up and resume execution

  • Explanation: Deeper C-states require more time to power back up.

Thermal and System Power Management:

What are T-states used for in CPU power management?

  • B) Reducing frequency during thermal stress (thermal throttling)

What does S5 represent in the system power state (S-state) hierarchy?

  • C) System fully powered off

ACPI and Scaling Drivers:

Which of the following is NOT a scaling driver mentioned in Episode 0?

  • D) gpu-pstate

  • Explanation: GPU frequency uses different mechanisms (nvidia-smi, rocm-smi), not traditional scaling drivers.

What is the primary role of a scaling governor?

  • B) Decide which P-state (frequency) to use based on workload conditions

Why Power Management Matters in HPC:

What is the typical power consumption percentage for a data center (facility-level)?

  • C) 50-70% of operational costs

  • Explanation: Power and cooling typically account for 50-70% of HPC operational expenses.

Approximately what fraction of power in a large HPC system goes to cooling?

  • B) 25%

  • Explanation: Cooling typically accounts for 25-40% of facility power consumption (PUE ~1.25-1.40).

Which of the following is NOT a benefit of power management in HPC?

  • C) Guaranteed faster execution time

  • Explanation: Power management may reduce frequency, which can slow execution (though smart strategies minimize this).

Conceptual questions

Power Equation Analysis:

Initial: P₁ = CV²f = C × (0.8)² × 2.0 = 0.64 × 2.0 × C = 1.28C

New: P_new = C × (0.7)² × 1.6 = 0.49 × 1.6 × C = 0.784C

Ratio: P_new / P₁ = 0.784C / 1.28C = 0.61 (or 39% reduction)

Key insight: Even though frequency reduced by 20%, voltage reduction of 12.5% (0.8V → 0.7V) provides quadratic power savings in the voltage term, more than offsetting the frequency reduction. Voltage scaling is crucial because power depends on V².

Workload and Power Interaction:

Workload 1 - All-reduce (Memory-bound):

  • Frequency reduction will NOT hurt performance significantly because the bottleneck is memory bandwidth, not CPU cycles

  • Safe to reduce frequency: the CPU is already waiting for memory data

  • Monitoring: Watch for memory-wait cycles (performance counters), thermal effects

Workload 2 - Dense matrix multiplication (Compute-bound):

  • Frequency reduction WILL hurt performance because throughput is limited by CPU cycles available

  • Unsafe to reduce frequency: every cycle matters, and reduction directly impacts FLOP/s delivered

  • Monitoring: Track actual FLOP/s achieved; frequency reduction should be minimal or none

Power Management Strategy Design:

A comprehensive strategy might include:

Batch jobs (30%, deadline-loose):

  • Use ondemand or conservative governor

  • Set max_freq to 70-80% (10-20% power savings)

  • Prioritize energy efficiency

Interactive jobs (50%, latency-critical):

  • Use performance governor or HWP with high EPP

  • Run at 95-100% frequency

  • Prioritize low latency

GPU-accelerated jobs (20%, compute-intensive):

  • CPU: 60-70% frequency (since CPU often waits for GPU)

  • GPU: 100% frequency (compute bottleneck)

  • Use power capping to respect node power budget

  • Monitor CPU/GPU power balance

Trade-offs:

  • Energy savings vs latency: Accept slower batch jobs to save power

  • Complexity: Need job classification system

  • Flexibility: Must allow overrides for deadline-critical work

Episode 1: Power Management Implementation and Runtime Systems

Multiple choice, single answer

Scaling Drivers and Interfaces:

Which driver provides more responsive CPU frequency control on modern Intel processors?

  • B) intel_pstate

  • Explanation: intel_pstate directly controls MSRs and is more responsive than ACPI firmware-based acpi-cpufreq.

What does the max_perf_pct parameter in intel_pstate sysfs control?

  • B) Maximum allowed P-state as a percentage of maximum frequency

To disable turbo boost on an intel_pstate system, which sysfs file should be written to?

  • C) no_turbo

  • Explanation: echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo disables turbo.

Scaling Governors (Policies):

Which scaling governor always runs at maximum frequency?

  • B) performance

What is the main advantage of the ondemand governor compared to conservative?

  • B) Faster response to load increases

  • Explanation: Ondemand adjusts frequency more aggressively; conservative makes gradual changes.

Which governor allows direct user/application control of CPU frequency?

  • C) userspace

In the ondemand governor, what does up_threshold control?

  • A) CPU utilization threshold for scaling frequency up

Hardware Frequency Control (MSR):

What is the MSR (Model-Specific Register) address for IA32_PERF_CTL?

  • A) 0x199

Which bit field in MSR 0x199 specifies the target P-state?

  • B) Bits [15:8]

Intel Turbo Boost:

What is the key difference between SSE and AVX-512 boost frequencies?

  • B) SSE has the highest boost frequency due to lower power density

  • Explanation: SSE instructions use less power per operation, allowing higher frequency. AVX-512 is power-intensive, so turbo frequency is lower.

Why does Intel implement instruction-set-specific frequency levels?

  • B) To allow higher compute throughput within power and thermal budgets

  • Explanation: Despite lower frequency, AVX-512 can deliver more compute operations per second within power/thermal constraints.

Frequency Transition Latency:

What is a typical frequency transition latency on modern Intel processors?

  • B) 5-20 microseconds

For which type of application is frequency transition latency most critical?

  • B) Real-time embedded systems

  • Explanation: HPC jobs (hours-long) are insensitive to microsecond latencies.

GPU Frequency Management:

What command-line tool is used to control NVIDIA GPU frequency?

  • B) nvidia-smi

Which AMD tool is used for GPU frequency management on AMD GPUs?

  • B) rocm-smi

What is the typical frequency granularity for NVIDIA GPUs?

  • C) 25-50 MHz

Hardware P-State (HWP) and SpeedShift:

How does HWP differ from OS-controlled frequency scaling?

  • B) Hardware autonomously selects P-states within OS-specified range

What is the approximate latency improvement of HWP over OS control?

  • C) 10-100× faster

  • Explanation: HWP reduces frequency change latency from 10-50 μs to 0.5-2 μs.

Energy-Performance Preference (EPP):

What does MSR IA32_ENERGY_PERF_BIAS (0x1B0) allow?

  • B) Specifying energy-performance trade-off preference (0-15 scale)

On modern Intel (Skylake+), which MSR provides finer-grained EPP control with 0-255 scale?

  • C) 0x774

CPU Uncore Frequency:

Approximately what percentage of CPU chip area does the uncore subsystem consume?

  • C) 30%

What is MSR MSR_UNCORE_RATIO_LIMIT used for?

  • B) Setting limits on uncore (shared subsystem) frequency

Workload Characterization:

Which workload type would most benefit from frequency reduction without performance loss?

  • B) Sparse linear solver (memory-latency-bound)

  • Explanation: Memory-bound workloads tolerate frequency reduction because they’re waiting for memory, not CPU cycles.

What metric helps predict whether a workload is compute-bound or memory-bound?

  • C) Arithmetic intensity (operations per memory access)

Intel RAPL Power Capping:

How many power domains does Intel RAPL typically support?

  • C) 3-5

  • Explanation: Typically Package, DRAM, PP0 (Core), PP1 (Graphics), and PSys (Platform on Skylake+).

Which RAPL domain is specific to server architectures?

  • C) DRAM

What are the two time windows in Intel RAPL Package domain?

  • B) Short (~1.2× TDP, ms) and Long (~TDP, seconds)

What does MSR MSR_PKG_POWER_LIMIT (0x610) control?

  • C) Power capping limits and time windows

Case Studies and Advanced Platforms:

In the Cascade Lake case study, what percentage CPU energy savings was achieved with frequency scaling?

  • B) 18%

Why is power management challenging on Grace Hopper?

  • B) Multiple power domains (CPU, GPU, interconnects) require coordination

  • Explanation: Separate frequency controls for CPU and GPU mean power must be balanced to respect node limit.

On RIKEN Fugaku’s A64FX, what does FPU elimination in ECO mode do?

  • C) Uses one of two FPU pipelines only, reducing power

Runtime Systems and Strategies:

What does a power-capping runtime system do?

  • B) Ensures total node power doesn’t exceed a limit while maximizing performance

Which power management strategy is most suitable for a tightly power-constrained HPC facility?

  • C) Dynamic runtime control with power budgeting

  • Explanation: Fixed frequency is inflexible; dynamic control adapts to actual workload needs.

What is the typical energy savings range for dynamic runtime power management?

  • C) 20-40%

Coding and analysis questions

MSR-Based Frequency Control:

a) To set P-state 24 with bits [15:8] = 24 (0x18):

MSR 0x199 = 0x00001800  (or 0x0000_1800_0000_0000 in full 64-bit form)

b) Frequency calculation from P-state:

If P-state range is 0-39 with min=0.8GHz, max=3.8GHz:
frequency = min_freq + (P_state / 39) × (max_freq - min_freq)
frequency = 0.8 + (P / 39) × 3.0 GHz

c) Pseudocode for frequency ramp:

current_pstate = read_pstate()
target_pstate = compute_target_pstate(1.8)  # Convert GHz to P-state
step = (target_pstate - current_pstate) / 10

for i in range(1, 11):
    new_pstate = int(current_pstate + i * step)
    write_msr(0x199, new_pstate << 8)
    time.sleep(0.1)  # 100ms between steps

Scaling Governor Selection:

  1. Scientific simulation (CPU-intensive, loose deadline)

    • Governor: ondemand or userspace

    • Justification: Can reduce frequency for power savings; loose deadline allows higher latency

    • Parameters: up_threshold=90, sampling_rate=50000 (conservative scaling)

  2. Data processing pipeline (30% comm, 70% compute)

    • Governor: ondemand with custom tuning

    • Justification: Scale down during communication waits, scale up for compute bursts

    • Parameters: up_threshold=75, down_threshold=25, sampling_rate=10000

  3. Interactive visualization (< 100ms latency)

    • Governor: performance or HWP with high EPP

    • Justification: Latency-critical, need maximum responsiveness

    • Parameters: Run at maximum frequency; consider HWP for better responsiveness

Power Equation Application:

Initial: P = CV²f + P_leak = 56W + 4W = 60W

New state: V = 0.8V, f = 2.0 GHz

  • Dynamic: P_dyn = C × (0.8)² × (2.0/2.5) × 56 = 0.64 × 0.8 × 56 ≈ 28.7W

  • Leakage: P_leak ≈ 4W (slightly less at lower voltage, ~3.6W)

  • Total ≈ 32.3W (46% reduction from 60W)

a) Energy savings per job:

  • Original: 60W × 1h = 60 Wh

  • New: 32.3W × 1h = 32.3 Wh

  • Savings per job: 27.7 Wh

b) Annual cost savings:

  • Savings per job: 27.7 Wh = 0.0277 kWh

  • Jobs/year: 500 jobs/day × 365 days = 182,500 jobs

  • Total energy saved: 182,500 × 0.0277 = 5,055 kWh

  • Cost: 5,055 × $0.10 = $505.50/year

Intel RAPL Analysis:

a) Decode MSR_RAPL_POWER_UNIT (0xA1003):

  • Bit field: energy_unit bits typically 0x0A = 10

  • 2^(-10) = 1/1024 ≈ 0.977 mJ per unit (or ~1 mJ)

b) Energy consumed:

  • ΔE = 0x3F7F0E00 - 0x2A5B0E00 = 0x15240000

  • Energy units: 0x15240000 × 0.977 mJ ≈ 352 kJ (or 0.098 kWh)

c) Average power:

  • P_avg = 352 kJ / 60s ≈ 5.87 kW

d) Power cap check:

  • Average 5.87 kW = 5870W >> 200W limit

  • Yes, limit was exceeded (likely sampling shows instantaneous peaks)

HWP vs OS Control Comparison:

  • Scenario A: 50 μs × 4 changes = 200 μs overhead = 0.0002s = 0.002% of 10s

  • Scenario B: 2 μs × 4 changes = 8 μs overhead = 0.000008s = 0.00008% of 10s

a) Total latency overhead:

  • OS control: 200 μs

  • HWP: 8 μs

b) Performance impact:

  • OS: 200 μs / 10s = 0.002% (negligible)

  • HWP: 8 μs / 10s = 0.0008% (negligible)

c) Why HWP is preferable:

  • Although both are negligible for 10-second latencies, HWP provides:

  • Better responsiveness to iowait and memory latency events

  • Per-core independent optimization

  • AVX-512 frequency awareness

  • Reduced monitoring overhead

Case Study Analysis - Cascade Lake:

a) New CPU power with 18% savings:

  • Original CPU: 100W

  • Savings: 18%

  • New: 100 × (1 - 0.18) = 82W

b) New total node power with 15% node savings:

  • Original: 350W

  • Savings: 15%

  • New: 350 × (1 - 0.15) = 297.5W

c) Annual energy cost savings for 1000 nodes:

  • Energy reduction per node: (350 - 297.5)W = 52.5W

  • Per node per year: 52.5W × 24h × 365d = 460 kWh

  • For 1000 nodes: 460,000 kWh

  • Cost: 460,000 × $0.12 = $55,200/year

d) Payback period:

  • Software development cost: $500,000

  • Annual savings: $55,200

  • Payback: $500,000 / $55,200 ≈ 9 years

  • (Alternative strategies like hardware monitoring tools may have shorter payback)

Runtime System Design - Power Budget Allocation:

a) Frequency reduction tolerance by workload:

  • Core 0 (compute-bound, 8 ops/byte): Tolerates ~5-10% reduction (very sensitive)

  • Core 1 (memory-bound, 0.5 ops/byte): Tolerates ~30-40% reduction (insensitive)

  • Core 2 (I/O-bound, 0.1 ops/byte): Tolerates ~50%+ reduction (very insensitive)

  • Core 3 (balanced, 2 ops/byte): Tolerates ~20-25% reduction (moderately sensitive)

b) Algorithm for power allocation (maximize throughput):

// Greedy allocation: allocate to most sensitive workloads first
1. Sort cores by compute intensity (high to low)
2. Allocate remaining power to each core in order
3. Each core gets: P_core = P_baseline + 0.8 × (available_power / num_cores)
4. Compute expected throughput reduction per core
5. Iteratively reallocate to maximize total throughput

Allocation:

  • Core 0 (compute): 60W (no reduction)

  • Core 3 (balanced): 50W (no reduction)

  • Core 1 (memory): 40W + 50W = 90W available

  • Core 2 (I/O): 30W + 50W = 80W available

  • Total: 60 + 50 + 90 + 80 = 280W (under 500W limit)

c) Dynamic reallocation pseudocode:

def reallocate_power():
    idle_core = detect_idle_core()
    if idle_core:
        idle_power = allocated_power[idle_core]
        active_cores = [c for c in cores if not idle[c]]
        power_per_core = idle_power / len(active_cores)
        
        for core in active_cores:
            if can_increase_frequency(core):
                increase_frequency(core, power_per_core)

d) Thermal constraint handling:

  • Monitor core temperature continuously

  • If core > 85°C: reduce frequency by 5%

  • If core > 90°C: reduce frequency by 10% (approaching limit)

  • Use thermal headroom to allow temporary frequency boost

Workload Characterization and Power Optimization:

a) Arithmetic intensity:

  • Memory accesses per 1000 cycles: 450

  • FLOP per 1000 cycles: 1800

  • Bytes per access (typical L3 cache line): 64 bytes

  • Total bytes: 450 × 64 = 28,800 bytes

  • AI = 1800 FLOPs / 28,800 bytes = 0.0625 FLOP/byte (or 16 bytes/FLOP)

b) Boundedness analysis:

  • This is memory-bound

  • Justification: L3 hit rate is only 85%, meaning 15% miss rate with 200-cycle latency

  • With 0.0625 FLOP/byte, memory throughput is the bottleneck

  • Latency-bound: 200 cycles × 450 misses/1000 cycles = 90 cycles of stall per 1000 cycles (9%)

c) Performance impact of 20% frequency reduction:

  • Baseline: 2.5 GHz × 1800 FLOP/1000 cycles = 4.5 GFLOP/s

  • With reduction: 2.0 GHz × 1800 FLOP/1000 cycles = 3.6 GFLOP/s

  • Impact: 20% reduction in FLOP/s

  • However, memory latency increases by ~20% wall-clock time

  • Overall runtime impact: ~10-15% (memory accesses still dominate, not CPU cycles)

d) Power management strategy:

  • Reduce frequency to 70-75% (save 25-30% power)

  • Runtime will increase ~12-15% (memory-latency-bound, not CPU-sensitive)

  • Overall energy reduction: 20-30% (frequency reduction outweighs longer runtime)

  • Use ondemand governor with high up_threshold (85-90%)

  • Monitor memory bandwidth utilization to validate memory-boundedness