156 lines
6.3 KiB
ReStructuredText
156 lines
6.3 KiB
ReStructuredText
|
================
|
||
|
RAPL MSR support
|
||
|
================
|
||
|
|
||
|
The RAPL interface (Running Average Power Limit) is advertising the accumulated
|
||
|
energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
|
||
|
|
||
|
The consumption is reported via MSRs (model specific registers) like
|
||
|
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
|
||
|
registers that represent the accumulated energy consumption in micro Joules.
|
||
|
|
||
|
Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
|
||
|
of them can now be handled by the userspace (QEMU). It uses a mechanism called
|
||
|
"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
|
||
|
that a callback is put in place. The design of this patch uses only this
|
||
|
mechanism for handling the MSRs between guest/host.
|
||
|
|
||
|
At the moment the following MSRs are involved:
|
||
|
|
||
|
.. code:: C
|
||
|
|
||
|
#define MSR_RAPL_POWER_UNIT 0x00000606
|
||
|
#define MSR_PKG_POWER_LIMIT 0x00000610
|
||
|
#define MSR_PKG_ENERGY_STATUS 0x00000611
|
||
|
#define MSR_PKG_POWER_INFO 0x00000614
|
||
|
|
||
|
The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
|
||
|
spec and specify the power limit of the package, provide range of parameter(min
|
||
|
power, max power,..) and also the information of the multiplier for the energy
|
||
|
counter to calculate the power. Those MSRs are populated once at the beginning
|
||
|
by reading the host CPU MSRs and are given back to the guest 1:1 when
|
||
|
requested.
|
||
|
|
||
|
The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
|
||
|
energy consumed since the last time the register was cleared. If you multiply
|
||
|
it with the UNIT provided above you'll get the power in micro-joules. This
|
||
|
counter is always increasing and it increases more or less faster depending on
|
||
|
the consumption of the package. This counter is supposed to overflow at some
|
||
|
point.
|
||
|
|
||
|
Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
|
||
|
"rdmsr 0x611") will retrieve the same value. The value represents the energy
|
||
|
for the whole package. Whatever Core reading it will get the same value and a
|
||
|
core that belongs to PKG-0 will not be able to get the value of PKG-1 and
|
||
|
vice-versa.
|
||
|
|
||
|
High level implementation
|
||
|
-------------------------
|
||
|
|
||
|
In order to update the value of the virtual MSR, a QEMU thread is created.
|
||
|
The thread is basically just an infinity loop that does:
|
||
|
|
||
|
1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
|
||
|
Userspace and System)
|
||
|
|
||
|
2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
|
||
|
the QEMU threads are running on.
|
||
|
|
||
|
3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
|
||
|
will do what they have to do and so the energy counter will increase.
|
||
|
|
||
|
4. Repeat 2. and 3. and calculate the delta of every metrics representing the
|
||
|
time spent scheduled for each QEMU thread *and* the energy spent by the
|
||
|
packages during the pause.
|
||
|
|
||
|
5. Filter the vcpu threads and the non-vcpu threads.
|
||
|
|
||
|
6. Retrieve the topology of the Virtual Machine. This helps identify which
|
||
|
vCPU is running on which virtual package.
|
||
|
|
||
|
7. The total energy spent by the non-vcpu threads is divided by the number
|
||
|
of vcpu threads so that each vcpu thread will get an equal part of the
|
||
|
energy spent by the QEMU workers.
|
||
|
|
||
|
8. Calculate the ratio of energy spent per vcpu threads.
|
||
|
|
||
|
9. Calculate the energy for each virtual package.
|
||
|
|
||
|
10. The virtual MSRs are updated for each virtual package. Each vCPU that
|
||
|
belongs to the same package will return the same value when accessing the
|
||
|
the MSR.
|
||
|
|
||
|
11. Loop back to 1.
|
||
|
|
||
|
Ratio calculation
|
||
|
-----------------
|
||
|
|
||
|
In Linux, a process has an execution time associated with it. The scheduler is
|
||
|
dividing the time in clock ticks. The number of clock ticks per second can be
|
||
|
found by the sysconf system call. A typical value of clock ticks per second is
|
||
|
100. So a core can run a process at the maximum of 100 ticks per second. If a
|
||
|
package has 4 cores, 400 ticks maximum can be scheduled on all the cores
|
||
|
of the package for a period of 1 second.
|
||
|
|
||
|
The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
|
||
|
process with the [pid] as the process ID. It gives the amount of ticks the
|
||
|
process has been scheduled in userspace (utime) and kernel space (stime).
|
||
|
|
||
|
By reading those metrics for a thread, one can calculate the ratio of time the
|
||
|
package has spent executing the thread.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
|
||
|
per second per core. If a thread was scheduled for 100 ticks between a second
|
||
|
on this package, that means my thread has been scheduled for 1/4 of the whole
|
||
|
package. With that, the calculation of the energy spent by the thread on this
|
||
|
package during this whole second is 1/4 of the total energy spent by the
|
||
|
package.
|
||
|
|
||
|
Usage
|
||
|
-----
|
||
|
|
||
|
Currently this feature is only working on an Intel CPU that has the RAPL driver
|
||
|
mounted and available in the sysfs. if not, QEMU fails at start-up.
|
||
|
|
||
|
This feature is activated with -accel
|
||
|
kvm,rapl=true,rapl-helper-socket=/path/sock.sock
|
||
|
|
||
|
It is important that the socket path is the same as the one
|
||
|
:program:`qemu-vmsr-helper` is listening to.
|
||
|
|
||
|
qemu-vmsr-helper
|
||
|
----------------
|
||
|
|
||
|
The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
|
||
|
making persistent reservation, qemu-vmsr-helper is here to overcome the
|
||
|
CVE-2020-8694 which remove user access to the rapl msr attributes.
|
||
|
|
||
|
A socket communication is established between QEMU processes that has the RAPL
|
||
|
MSR support activated and the qemu-vmsr-helper. A systemd service and socket
|
||
|
activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
|
||
|
|
||
|
The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
|
||
|
socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
|
||
|
changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
|
||
|
also start a separate helper if needed. All in all, the policy is left to the
|
||
|
user.
|
||
|
|
||
|
See the qemu-pr-helper documentation or manpage for further details.
|
||
|
|
||
|
Current Limitations
|
||
|
-------------------
|
||
|
|
||
|
- Works only on Intel host CPUs because AMD CPUs are using different MSR
|
||
|
addresses.
|
||
|
|
||
|
- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
|
||
|
moment.
|
||
|
|
||
|
References
|
||
|
----------
|
||
|
|
||
|
.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
|
||
|
.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
|