Understanding Frequency Scaling and Power Saving in Intel Processors

Primer

Recently, I initiated experiments with a four-channel power meter in the lab to monitor the power consumption of our testbed, specifically the Distributed Unit (DU) and the Centralized Unit (CU). However, the initial findings were puzzling: the DU and CU displayed markedly different power consumption profiles (see plot below). The experimental procedure was straightforward: assessing the power consumption of the nodes (CU and DU) when the Radio Access Network (RAN) software was inactive to establish a baseline power consumption. Given that both nodes operated under identical loads and utilized the same hardware components (RAM, processor, GPU, etc.), I naturally anticipated similar power consumption profiles.

Power consumption of the CU and DU computers, measured over a 300 second interval using a power-meter.

This anomaly sparked my curiosity and, to my surprise, guided me towards a highly enlightening troubleshooting session. That’s precisely why I’m eager to share with you today what I learned from this experience and guide you towards a methodology that you can employ should you encounter a similar challenge in the future. In this post we’ll cover:

How to monitor the power consumption of Intel CPUs
Performance/Frequency scaling concepts
How to configure frequency scaling policies in Intel CPUs

If any of this sounds interesting to you, then please read on!

How to monitor the power consumption of Intel CPUs

As I delved deeper into the troubleshooting process, it became evident that isolating power consumption at a per-component level - specifically, examining the processor, RAM, and other key components, the processor being my primary suspect - was crucial for understanding the significant disparity in power consumption between the two computers. However, I soon realized that monitoring energy consumption at such a granular level posed significant challenges, requiring specialized hardware and a meticulous approach. Thankfully, Intel CPUs have incorporated a feature known as Running Average Power Limit (RAPL) since the Sandy Bridge architecture. This feature allows us to monitor power consumption across various domains of the CPU chip, including attached DRAM and on-chip GPU . It’s important to note that the specific domains available may vary depending on the processor :

Package: measures the energy consumption of the entire socket. It includes the consumption of all the cores, integrated graphics and also the uncore components (last level caches, memory controller).
Power Plane 0 (PP0) : measures the energy consumption of all processor cores on the socket.
Power Plane 1 (PP1) : measures the energy consumption of processor graphics (GPU) on the socket (desktop models only).
DRAM: measures the energy consumption of random access memory (RAM) attached to the integrated memory controller.

For the purpose of this troubleshooting process, there was no need to increase the granularity of power measurement, so I decided to stick with the package power consumption. Using the user-space tool turbostat , I went ahead and collected the power samples when no RAN process was running to establish a power consumption baseline at the processor level:

sudo turbostat --out [FILE] --cpu package --quiet --show PkgWatt --debug -i 1 --num_iterations 300

In the previous command, notice the -i option, which is set to 1. This option sets the sampling period and thus should not be ignored. According to the Linux man pages: extremely short measurement intervals (much less than 1 second), or system activity that prevents turbostat from being able to run on all CPUS to quickly collect data, will result in inconsistent results . Hence, I maintained a 1-second period consistently across all my experiments.

Upon parsing the turbostat output, it became evident that the processors were contributing to the disparate power consumption profiles of the CU and the DU. While I had managed to isolate the issue to a certain extent (it’s worth noting that this finding didn’t rule out the possibility of other components exhibiting differences in power consumption), the reason behind the varying power consumption levels of the processors remained unclear. The next step in this process involved examining the frequency at which each processor was operating. Once again, I parsed the turbostat output to retrieve this information. I then plotted the ‘Bzy_MHz’ metric given in the turbostat output, which represents the average clock rate while the CPU was actively executing instructions .

The new results revealed that the two processors were operating at notably different frequencies, even when running under comparable workloads. At this point, it became clear that the two processors couldn’t be using the same frequency scaling policies, and thus a closer inspection of the implemented frequency scaling configurations was needed.

CPU Frequency Scaling Basics

This section introduces concepts that relevant to understand how frequency scaling works. Note that some of these concepts are specific to Intel CPUs.

Performance Cores and Efficient Cores

One of the first things you’ll ever want to do when trying to configure frequency scaling in your CPU is to refer to specifications of your CPU model. You can do this in various ways. For example, in Linux you can run lscpuin the terminal to obtain this information. Alternatively, you may refer to your CPU manufacturer’s webpage. In my lab, the CU and DU computers run on 13th Gen Intel(R) Core(TM) i9-13900K CPUs. By looking into the specifications of my CPU https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html I was able to retrieve the following information:

CPU Model: 13th Gen Intel(R) Core(TM) i9-13900K

Max. Freq: 5.8 GHz

1 socket

24 cores per socket

8 performance cores

Each can execute 2 threads using multithreading technology

Performance-core Max Turbo Frequency: 5.40 GHz

Performance-core Base Frequency: 3.00 GHz

16 efficient cores

Efficient-core Max Turbo Frequency: 4.30 GHz

Efficient-core Base Frequency: 2.20 GHz

32 total threads (16 threads max. for performance cores and other 16 threads for efficient cores)

Processor base power: 125 W

Maximum turbo power: 253 W

As you can see, the CPU has twenty-four cores in a single socket, but they are not all the same. Eight of those cores are performance cores (or P-cores) and the 16 of them are efficient cores (or E-cores). What’s the difference between them? The main difference is that P-cores are capable of hyper-threading (i.e. running two software threads at once), while E-cores can only run a single software thread. There are considerable differences also in terms of the frequencies these cores support. P-cores are tuned for high turbo frequencies and high IPC (instructions per cycle) , while E-cores are designed to maximize CPU efficiency, measured as performance-per-watt. This leads to lower base and turbo frequencies for E-Cores, and thus lower power consumption.

We can also see in the CPU’s specifications that both P-cores and E-cores support two different frequencies: Base Frequency and Turbo Frequency. The CPU frequency defines the number of cycles (also known as clock ticks) a processor/core can execute per second. With higher frequency, higher performance can be achieved, but higher power is consumed. Processor core clocks can effectively operate at different frequencies:

Base Clock/Frequency: The frequency of this clock is the frequency of the processor when the processor is not in turbo mode, and not being throttled via Intel SpeedStep.
Maximum Clock/Frequency: This is the maximum frequency of the processor when turbo mode is at the highest point.

Processor core clocks can transition between these different frequencies due to:

Thermal Monitor 2 (TM2) TransitionsIntel's Software Developer manual draws attention to the following point: for Intel processors that support TM2, the processor core clocks may operate at a frequency that differs from the Processor Base frequency. See Section 20.7.2 of the manual for more details.: TM2 controls the core temperature of the processor by reducing the operating frequency and voltage of the processor. TM1 is also available, however, TM2 offers a higher performance level for a given level of power reduction than TM1.
Enhanced Intel SpeedStep Technology transitions (P-state transitions): the CPU can change the state of each core to achieve a good tradeoff between power consumption and performance. More details to follow.

C-States and P-States

Intel CPUs can dynamically adjust the frequency of the cores according to the current load, which can save energy or improve performance depending on the use-case. For this reason, Intel CPUs implement so-called C-states and P-states, which are correlated with different idle levels and thus different levels of power consumption The number of C-states varies between different CPU models, but generally one can expect at least states 0 through 3.</d-footnote>. By tuning the state of the cores, the processor can effectively turn off unused components to save energy.

Mode	Definition
C0	Operational state. CPU fully turned on.
C1	First idle state. Stops CPU main internal clocks via software. Bus interface unit and APIC are kept running at full speed.
C2	Stops CPU main internal clocks via hardware. State in which the processor maintains all software-visible states, but may take longer to wake up through interrupts.
C3	Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts.

Available C-states. From .

C0 corresponds to an operational state, i.e. the core is executing instructions. The higher the C number, the deeper the sleep state, and thus the higher the energy savings. The downside is that the deeper the sleep state, the higher the latency introduced to put the core back to C0. It is possible to configure the processor to use up to a certain C-state if saving energy is a priority.

When a core is in C0 state, it can be in one of several performance states (P-states). Thus, unlike C-states, P-states are exclusively operational states that correspond to specific frequency and voltage valuesThe number of P-states varies across processor models.. As a rule, the higher the clock frequency and the higher the voltage, the more instructions can be retired by the CPU over a unit of time, but also the more energy is consumed over a unit of time (or the more power is drawn) by the CPU in the given P-state. What is important to retain is that the higher the P-state, the lower the frequency and voltage at which the core is running. Thus, P0 is always the highest-performance state (if we exclude ‘Turbo Boost) .

Turbo Boost

Some processors support raising their frequency above the normal maximum for a short burst of time, under appropriate thermal conditions . On Intel processors, this is called Turbo Boost. Turbo boost applies to both performance and efficient cores and allows to dynamically overclock active CPU cores (i.e. increase the clock speed up to the Max Turbo Frequency ) while other cores are in deep sleep states, depending on the workload. For this reason, it is sometimes called algorithmic overclocking .

For Intel® Turbo Boost Technology 2.0 and Intel® Turbo Boost Max Technology 3.0If you are interested in understanding the difference between versions 2.0 and 3.0 you can check article , there is no need to install any drivers or software for both technologies . Besides, turbo mode is enabled by default . You should pay special attention to the BIOS settings to check if Turbo Boost is enabled .

Manual CPU overclocking

As discussed previously, Turbo Boost can dynamically increase the frequency of some cores above the normal maximum, which is why it is sometimes called algorithmic overclocking. This differs from manual CPU overclocking, where the user can fine-tune the overclocking settings to meet some performance requirement (e.g. for gaming purposes). If you are interested in overclocking your CPU, there’s good news: various methods are available , one of them being overclocking from the BIOS. For instance, your BIOS may display a CPU Core Ratio option The actual name of the setting may vary since it is defined by the motherboard manufacturer., also known as a multiplier, which determines the speed of the CPU. The overall speed of your processor is calculated by multiplying the base clock speed (BCLK) by this ratio. For example, a BCLK of 100MHz multiplied by a CPU core ratio of 45 would result in a CPU speed of 4,500MHz, or 4.5GHz. This setting can usually be changed per core or across all cores .

Linux support for Performance/Frequency Scaling

The Linux kernel supports CPU performance/frequency scaling, by means of the CPUFreq (CPU Frequency scaling) subsystem which consists of three different levels of abstraction: the core, scaling governors and scaling drivers :

CPUFreq core: provides the common code infrastructure and user space interfaces for all platforms that support CPU performance scaling. It defines the basic framework in which the other components operate .
Scaling governors: implement algorithms to estimate the required CPU capacity. As a rule, each governor implements one, possibly parametrized, scaling algorithm .
Scaling drivers: they interact with the CPU directly, providing scaling governors with information on the available P-states (or P-state ranges in some cases) and access platform-specific hardware interfaces to change CPU P-states as requested by scaling governors .

User-space tools such as cpupower allow us to select the scaling driver and governor . In principle, all available scaling governors can be used with every scaling driver . However, it is important to note that CPUFreq allows scaling drivers to bypass the governor layer and implement their own performance scaling algorithms, which can be done by the intel_pstate scaling driver . More information to follow.

Scaling Drivers

In modern Intel processors, there are at least two drivers available: intel_pstate and acpi_cpufreq. In this post, we focus on the intel_pstate driver since this is the one we have been using. It implements a scaling driver with an internal governor for Intel Core (Sandy Bridge and newer) processors . intel_pstate can operate in one of two modes:

Active mode: it uses its own internal performance scaling governor algorithm or allows the hardware to do performance scaling by itself, bypassing the scaling governor layer of CPUFreq.
Passive mode: it responds to requests made by a generic CPUFreq governor implementing a certain performance scaling algorithm. Which of them will be in effect depends on what kernel command line options are used and on the capabilities of the processor. To enable/disable intel_pstate, we have to add ‘intel_pstate=enable/disable’ to the Kernel’s cmd options.

The “governors” used in active mode are not generic scaling governors, but their names are the same as the names of some generic governors. You should pay special attention to this because they generally do not work in the same way as the generic governors they share the names with . For example:

The powersave P-state selection algorithm provided by intel_pstate is not a counterpart of the generic powersave governor (roughly, it corresponds to the schedutil and ondemand governors)

There are two P-state selection algorithms provided by intel_pstate in the active mode: powersave and performance. The way they both operate depends on whether or not the hardware-managed P-states (HWP) feature has been enabled in the processor and possibly on the processor model .

Generic Scaling Governors

ArchLinux’s Wiki provides a comprehensive overview of the generic CPU governors, which I have referenced below for easier access.

Governor	Description
performance	Run the CPU at the maximum frequency, obtained from /sys/devices/system/cpu/cpuX/cpufreq/scaling_max_freq
powersave	Run the CPU at the minimum frequency, obtained from /sys/devices/system/cpu/cpuX/cpufreq/scaling_min_freq
userspace	Run the CPU at user specified frequencies, configurable via /sys/devices/system/cpu/cpuX/cpufreq/scaling_setspeed
ondemand	Scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases
conservative	Scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand
schedutil	Scheduler-driven CPU frequency selection

Available C-states. From .

CPU Frequency Scaling Cheat Sheet

Default to the intel_pstate driver

Head over to /etc/default/grub and set GRUB_CMDLINE_LINUX = 'intel_pstate=enable'. Note that if you set this option to disable then acpi_cpufreq is used instead.
Execute sudo update-grub on the terminal.
Reboot the computer with sudo reboot.

Set intel_pstate driver to active/passive mode

Head over to /etc/default/grub and set GRUB_CMDLINE_LINUX = 'intel_pstate=X', where X can be either active or passive.
Execute sudo update-grub on the terminal.
Reboot the computer with sudo reboot.
Check the status of the driver by executing cat /sys/devices/system/cpu/intel_pstate/status. Alternatively, you can execute sudo cpupower frequency-info: if the driver was set to passive mode, then you should see a line “driver: intel_cpufreq”, otherwise you should see “driver: intel_pstate”.

Check which governor is being used/Check CPU frequency scaling policies at the core level

cpupower -c all frequency-info

Set a generic CPU governor

Head over to /etc/default/cpufrequtils and add the line GOVERNOR=”X”, where X can be one of the generic governors listed above (e.g. ondemand, performance, etc).
Execute sudo systemctl restart cpufrequtils.
Execute cpupower -c all frequency-info to check which performance governor is being used.

Check which governors (generic or not) are available

Execute cpupower frequency-info and look for “available cpufreq governors”.

Enable/disable turbo mode

Edit the file ‘/sys/devices/system/cpu/intel_pstate/no_turbo’ or ‘/sys/devices/system/cpu/cpufreq/intel_pstate/no_turbo’, depending on which one is available. There should be a single number written to that file: if 1 (see command below), then the driver is not allowed to set any turbo P-states; if equal to 0 (default), then turbo P-states can be set by the driver .

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Check frequency of each core in real-time

Execute watch cat /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_cur_freq .

Closing the troubleshooting process

As it turned out, the high differences in power consumption between the CU and DU computers could be explained by differences in the BIOS configuration of CPU Core Ratio. In particular, the DU was overclocked, while the CU was not. Once we re-configured the two computers to use the same settings (overclocking, state of the intel_pstate driver, and governor), we repeated our experiments and were pleased to find that the two computers were finally exhibiting comparable power consumption profiles.