Recently, I initiated experiments with a four-channel power meter in the lab to monitor the power consumption of our testbed, specifically the Distributed Unit (DU) and the Centralized Unit (CU). However, the initial findings were puzzling: the DU and CU displayed markedly different power consumption profiles (see plot below). The experimental procedure was straightforward: assessing the power consumption of the nodes (CU and DU) when the Radio Access Network (RAN) software was inactive to establish a baseline power consumption. Given that both nodes operated under identical loads and utilized the same hardware components (RAM, processor, GPU, etc.), I naturally anticipated similar power consumption profiles.
This anomaly sparked my curiosity and, to my surprise, guided me towards a highly enlightening troubleshooting session. That’s precisely why I’m eager to share with you today what I learned from this experience and guide you towards a methodology that you can employ should you encounter a similar challenge in the future. In this post we’ll cover:
If any of this sounds interesting to you, then please read on!
As I delved deeper into the troubleshooting process, it became evident that isolating power consumption at a per-component level - specifically, examining the processor, RAM, and other key components, the processor being my primary suspect - was crucial for understanding the significant disparity in power consumption between the two computers. However, I soon realized that monitoring energy consumption at such a granular level posed significant challenges, requiring specialized hardware and a meticulous approach. Thankfully, Intel CPUs have incorporated a feature known as Running Average Power Limit (RAPL) since the Sandy Bridge architecture. This feature allows us to monitor power consumption across various domains of the CPU chip, including attached DRAM and on-chip GPU
For the purpose of this troubleshooting process, there was no need to increase the granularity of power measurement, so I decided to stick with the package power consumption. Using the user-space tool turbostat
sudo turbostat --out [FILE] --cpu package --quiet --show PkgWatt --debug -i 1 --num_iterations 300
In the previous command, notice the -i option, which is set to 1. This option sets the sampling period and thus should not be ignored. According to the Linux man pages: extremely short measurement intervals (much less than 1 second), or system activity that prevents turbostat from being able to run on all CPUS to quickly collect data, will result in inconsistent results
Upon parsing the turbostat output, it became evident that the processors were contributing to the disparate power consumption profiles of the CU and the DU. While I had managed to isolate the issue to a certain extent (it’s worth noting that this finding didn’t rule out the possibility of other components exhibiting differences in power consumption), the reason behind the varying power consumption levels of the processors remained unclear. The next step in this process involved examining the frequency at which each processor was operating. Once again, I parsed the turbostat output to retrieve this information. I then plotted the ‘Bzy_MHz’ metric given in the turbostat output, which represents the average clock rate while the CPU was actively executing instructions
The new results revealed that the two processors were operating at notably different frequencies, even when running under comparable workloads. At this point, it became clear that the two processors couldn’t be using the same frequency scaling policies, and thus a closer inspection of the implemented frequency scaling configurations was needed.
This section introduces concepts that relevant to understand how frequency scaling works. Note that some of these concepts are specific to Intel CPUs.
One of the first things you’ll ever want to do when trying to configure frequency scaling in your CPU is to refer to specifications of your CPU model. You can do this in various ways. For example, in Linux you can run lscpu
in the terminal to obtain this information. Alternatively, you may refer to your CPU manufacturer’s webpage. In my lab, the CU and DU computers run on 13th Gen Intel(R) Core(TM) i9-13900K CPUs. By looking into the specifications of my CPU
- CPU Model: 13th Gen Intel(R) Core(TM) i9-13900K
- Max. Freq: 5.8 GHz
- 1 socket
- 24 cores per socket
- 8 performance cores
- Each can execute 2 threads using multithreading technology
- Performance-core Max Turbo Frequency: 5.40 GHz
- Performance-core Base Frequency: 3.00 GHz
- 16 efficient cores
- Efficient-core Max Turbo Frequency: 4.30 GHz
- Efficient-core Base Frequency: 2.20 GHz
- 32 total threads (16 threads max. for performance cores and other 16 threads for efficient cores)
- Processor base power: 125 W
- Maximum turbo power: 253 W
As you can see, the CPU has twenty-four cores in a single socket, but they are not all the same. Eight of those cores are performance cores (or P-cores) and the 16 of them are efficient cores (or E-cores). What’s the difference between them? The main difference is that P-cores are capable of hyper-threading (i.e. running two software threads at once), while E-cores can only run a single software thread. There are considerable differences also in terms of the frequencies these cores support. P-cores are tuned for high turbo frequencies and high IPC (instructions per cycle)
We can also see in the CPU’s specifications that both P-cores and E-cores support two different frequencies: Base Frequency and Turbo Frequency. The CPU frequency defines the number of cycles (also known as clock ticks) a processor/core can execute per second. With higher frequency, higher performance can be achieved, but higher power is consumed. Processor core clocks can effectively operate at different frequencies:
Processor core clocks can transition between these different frequencies due to:
Intel CPUs can dynamically adjust the frequency of the cores according to the current load, which can save energy or improve performance depending on the use-case. For this reason, Intel CPUs implement so-called C-states and P-states, which are correlated with different idle levels and thus different levels of power consumption
Mode | Definition |
---|---|
C0 | Operational state. CPU fully turned on. |
C1 | First idle state. Stops CPU main internal clocks via software. Bus interface unit and APIC are kept running at full speed. |
C2 | Stops CPU main internal clocks via hardware. State in which the processor maintains all software-visible states, but may take longer to wake up through interrupts. |
C3 | Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts. |
C0 corresponds to an operational state, i.e. the core is executing instructions. The higher the C number, the deeper the sleep state, and thus the higher the energy savings. The downside is that the deeper the sleep state, the higher the latency introduced to put the core back to C0. It is possible to configure the processor to use up to a certain C-state if saving energy is a priority.
When a core is in C0 state, it can be in one of several performance states (P-states). Thus, unlike C-states, P-states are exclusively operational states that correspond to specific frequency and voltage values
Some processors support raising their frequency above the normal maximum for a short burst of time, under appropriate thermal conditions
For Intel® Turbo Boost Technology 2.0 and Intel® Turbo Boost Max Technology 3.0
As discussed previously, Turbo Boost can dynamically increase the frequency of some cores above the normal maximum, which is why it is sometimes called algorithmic overclocking. This differs from manual CPU overclocking, where the user can fine-tune the overclocking settings to meet some performance requirement (e.g. for gaming purposes). If you are interested in overclocking your CPU, there’s good news: various methods are available CPU Core Ratio
option
The Linux kernel supports CPU performance/frequency scaling, by means of the CPUFreq (CPU Frequency scaling) subsystem which consists of three different levels of abstraction: the core, scaling governors and scaling drivers
User-space tools such as cpupower
allow us to select the scaling driver and governor
In modern Intel processors, there are at least two drivers available: intel_pstate and acpi_cpufreq. In this post, we focus on the intel_pstate driver since this is the one we have been using. It implements a scaling driver with an internal governor for Intel Core (Sandy Bridge and newer) processors
The “governors” used in active mode are not generic scaling governors, but their names are the same as the names of some generic governors. You should pay special attention to this because they generally do not work in the same way as the generic governors they share the names with
The powersave P-state selection algorithm provided by intel_pstate is not a counterpart of the generic powersave governor (roughly, it corresponds to the schedutil and ondemand governors)
There are two P-state selection algorithms provided by intel_pstate in the active mode: powersave and performance. The way they both operate depends on whether or not the hardware-managed P-states (HWP) feature has been enabled in the processor and possibly on the processor model
ArchLinux’s Wiki provides a comprehensive overview of the generic CPU governors, which I have referenced below for easier access.
Governor | Description |
---|---|
performance | Run the CPU at the maximum frequency, obtained from /sys/devices/system/cpu/cpuX/cpufreq/scaling_max_freq |
powersave | Run the CPU at the minimum frequency, obtained from /sys/devices/system/cpu/cpuX/cpufreq/scaling_min_freq |
userspace | Run the CPU at user specified frequencies, configurable via /sys/devices/system/cpu/cpuX/cpufreq/scaling_setspeed |
ondemand | Scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases |
conservative | Scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand |
schedutil | Scheduler-driven CPU frequency selection |
GRUB_CMDLINE_LINUX = 'intel_pstate=enable'
. Note that if you set this option to disable then acpi_cpufreq is used instead.sudo update-grub
on the terminal.sudo reboot
.GRUB_CMDLINE_LINUX = 'intel_pstate=X'
, where X can be either active
or passive
.sudo update-grub
on the terminal.sudo reboot
.cat /sys/devices/system/cpu/intel_pstate/status
. Alternatively, you can execute sudo cpupower frequency-info
: if the driver was set to passive mode, then you should see a line “driver: intel_cpufreq”, otherwise you should see “driver: intel_pstate”.cpupower -c all frequency-info
sudo systemctl restart cpufrequtils
.cpupower -c all frequency-info
to check which performance governor is being used.Execute cpupower frequency-info
and look for “available cpufreq governors”.
Edit the file ‘/sys/devices/system/cpu/intel_pstate/no_turbo’ or ‘/sys/devices/system/cpu/cpufreq/intel_pstate/no_turbo’, depending on which one is available. There should be a single number written to that file: if 1 (see command below), then the driver is not allowed to set any turbo P-states; if equal to 0 (default), then turbo P-states can be set by the driver
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
Execute watch cat /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_cur_freq
As it turned out, the high differences in power consumption between the CU and DU computers could be explained by differences in the BIOS configuration of CPU Core Ratio
. In particular, the DU was overclocked, while the CU was not. Once we re-configured the two computers to use the same settings (overclocking, state of the intel_pstate driver, and governor), we repeated our experiments and were pleased to find that the two computers were finally exhibiting comparable power consumption profiles.