qemu/hw
David Hildenbrand 177f9b1ee4 virtio-mem: Expose device memory dynamically via multiple memslots if enabled
Having large virtio-mem devices that only expose little memory to a VM
is currently a problem: we map the whole sparse memory region into the
guest using a single memslot, resulting in one gigantic memslot in KVM.
KVM allocates metadata for the whole memslot, which can result in quite
some memory waste.

Assuming we have a 1 TiB virtio-mem device and only expose little (e.g.,
1 GiB) memory, we would create a single 1 TiB memslot and KVM has to
allocate metadata for that 1 TiB memslot: on x86, this implies allocating
a significant amount of memory for metadata:

(1) RMAP: 8 bytes per 4 KiB, 8 bytes per 2 MiB, 8 bytes per 1 GiB
    -> For 1 TiB: 2147483648 + 4194304 + 8192 = ~ 2 GiB (0.2 %)

    With the TDP MMU (cat /sys/module/kvm/parameters/tdp_mmu) this gets
    allocated lazily when required for nested VMs
(2) gfn_track: 2 bytes per 4 KiB
    -> For 1 TiB: 536870912 = ~512 MiB (0.05 %)
(3) lpage_info: 4 bytes per 2 MiB, 4 bytes per 1 GiB
    -> For 1 TiB: 2097152 + 4096 = ~2 MiB (0.0002 %)
(4) 2x dirty bitmaps for tracking: 2x 1 bit per 4 KiB page
    -> For 1 TiB: 536870912 = 64 MiB (0.006 %)

So we primarily care about (1) and (2). The bad thing is, that the
memory consumption *doubles* once SMM is enabled, because we create the
memslot once for !SMM and once for SMM.

Having a 1 TiB memslot without the TDP MMU consumes around:
* With SMM: 5 GiB
* Without SMM: 2.5 GiB
Having a 1 TiB memslot with the TDP MMU consumes around:
* With SMM: 1 GiB
* Without SMM: 512 MiB

... and that's really something we want to optimize, to be able to just
start a VM with small boot memory (e.g., 4 GiB) and a virtio-mem device
that can grow very large (e.g., 1 TiB).

Consequently, using multiple memslots and only mapping the memslots we
really need can significantly reduce memory waste and speed up
memslot-related operations. Let's expose the sparse RAM memory region using
multiple memslots, mapping only the memslots we currently need into our
device memory region container.

The feature can be enabled using "dynamic-memslots=on" and requires
"unplugged-inaccessible=on", which is nowadays the default.

Once enabled, we'll auto-detect the number of memslots to use based on the
memslot limit provided by the core. We'll use at most 1 memslot per
gigabyte. Note that our global limit of memslots accross all memory devices
is currently set to 256: even with multiple large virtio-mem devices,
we'd still have a sane limit on the number of memslots used.

The default is to not dynamically map memslot for now
("dynamic-memslots=off"). The optimization must be enabled manually,
because some vhost setups (e.g., hotplug of vhost-user devices) might be
problematic until we support more memslots especially in vhost-user backends.

Note that "dynamic-memslots=on" is just a hint that multiple memslots
*may* be used for internal optimizations, not that multiple memslots
*must* be used. The actual number of memslots that are used is an
internal detail: for example, once memslot metadata is no longer an
issue, we could simply stop optimizing for that. Migration source and
destination can differ on the setting of "dynamic-memslots".

Message-ID: <20230926185738.277351-17-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
..
9pfs fsdev: Use ThrottleDirection instread of bool is_write 2023-08-29 10:49:24 +02:00
acpi virtio,pci: features, cleanups 2023-10-05 09:01:01 -04:00
adc meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
alpha hw/alpha: Use MachineClass->default_nic in the alpha machine 2023-05-26 09:10:49 +02:00
arm * fix from optionrom build 2023-10-03 07:43:44 -04:00
audio hw/audio/es1370: trace lost interrupts 2023-10-10 12:31:05 +00:00
avr
block swim: update IWM/ISM register block decoding 2023-10-06 10:33:43 +02:00
char hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
core cpu: Correct invalid mentions of 'softmmu' by 'system-mode' 2023-10-07 19:02:33 +02:00
cpu hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
cris
cxl hw/cxl: Fix local variable shadowing of cap_hdrs 2023-10-06 10:56:54 +02:00
display virtio,pci: features, cleanups 2023-10-05 09:01:01 -04:00
dma hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
gpio hw/gpio/nrf51: implement DETECT signal 2023-08-22 17:30:59 +01:00
hppa target/hppa: Report and clear BTLBs via fw_cfg at startup 2023-09-15 17:34:38 +02:00
hyperv win32: replace closesocket() with close() wrapper 2023-03-13 15:39:31 +04:00
i2c aspeed/i2c: Clean up local variable shadowing 2023-09-29 10:07:19 +02:00
i386 hw/i386: changes towards enabling -Wshadow=local for x86 machines 2023-10-06 10:56:54 +02:00
ide hw/ide/ahci: Clean up local variable shadowing 2023-10-06 13:16:57 +02:00
input audio: propagate Error * out of audio_init 2023-10-03 10:29:40 +02:00
intc accel/tcg: Replace CPUState.env_ptr with cpu_env() 2023-10-04 11:03:54 -07:00
ipack meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
ipmi hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
isa hw/acpi/acpi_dev_interface: Remove now unused madt_cpu virtual method 2023-10-04 18:15:05 -04:00
loongarch target/loongarch: Clean up local variable shadowing 2023-10-06 10:56:54 +02:00
m68k q800: add alias for MacOS toolbox ROM at 0x40000000 2023-10-06 10:33:43 +02:00
mem memory-device,vhost: Support automatic decision on the number of memslots 2023-10-12 14:15:22 +02:00
microblaze hw/microblaze: Clean up local variable shadowing 2023-09-29 10:07:16 +02:00
mips vt82c686 machines: Support machine-default audiodev with fallback 2023-10-03 10:29:40 +02:00
misc mac_via: extend timer calibration hack to work with A/UX 2023-10-06 10:33:43 +02:00
net hw/net/vhost_net: Silence compiler warning when compiling with -Wshadow 2023-10-06 10:56:54 +02:00
nios2 hw/nios2: Clean up local variable shadowing 2023-09-29 10:07:16 +02:00
nubus trace-events: Fix the name of the tracing.rst file 2023-09-08 13:08:51 +03:00
nvme hw/nvme: Clean up local variable shadowing in nvme_ns_init() 2023-09-29 10:07:20 +02:00
nvram crypto: only include tls-cipher-suites in emulators 2023-10-03 10:29:39 +02:00
openrisc *: Add missing includes of qemu/error-report.h 2023-03-22 15:06:57 +00:00
pci pcie_sriov: unregister_vfs(): fix error path 2023-10-04 18:15:06 -04:00
pci-bridge hw/pci-bridge/cxl-upstream: Add serial number extended capability support 2023-10-04 18:15:06 -04:00
pci-host pc: remove short_root_bus property 2023-09-29 09:33:10 +02:00
pcmcia meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
ppc accel/tcg: Replace CPUState.env_ptr with cpu_env() 2023-10-04 11:03:54 -07:00
rdma meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
remote exec/memory: Add symbol for memory listener priority for device backend 2023-06-28 14:27:59 +02:00
riscv hw/riscv: opentitan: Fixup local variables shadowing 2023-09-29 10:07:20 +02:00
rtc hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
rx hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
s390x s390x: do a subsystem reset before the unprotect on reboot 2023-09-12 11:13:33 +02:00
scsi virtio,pci: features, cleanups 2023-10-05 09:01:01 -04:00
sd aspeed queue: 2023-09-06 11:14:55 -04:00
sensor hw/i2c: spelling fixes 2023-08-31 19:47:43 +02:00
sh4 hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
smbios hw/acpi: changes towards enabling -Wshadow=local 2023-09-29 10:07:18 +02:00
sparc other architectures: spelling fixes 2023-07-25 17:14:07 +03:00
sparc64 hw/pci/pci: Remove multifunction parameter from pci_new_multifunction() 2023-07-10 18:59:32 -04:00
ssi hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
timer aspeed/timer: Clean up local variable shadowing 2023-09-29 10:07:19 +02:00
tpm hw/tpm: spelling fixes 2023-09-20 07:54:34 +03:00
tricore hw/tricore: Log failing test in testdevice 2023-09-29 08:28:02 +02:00
ufs hw/ufs: Support for UFS logical unit 2023-09-07 14:01:29 -04:00
usb hw/usb: Silence compiler warnings in USB code when compiling with -Wshadow 2023-10-06 13:27:48 +02:00
vfio vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation 2023-10-05 22:04:51 +02:00
virtio virtio-mem: Expose device memory dynamically via multiple memslots if enabled 2023-10-12 14:15:22 +02:00
watchdog meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
xen xen: spelling fix 2023-09-08 13:08:52 +03:00
xenpv hw/xenpv: Initialize Xen backend operations 2023-03-24 14:52:14 +00:00
xtensa trivial: Simplify the spots that use TARGET_BIG_ENDIAN as a numeric value 2023-09-08 13:08:52 +03:00
Kconfig hw/ufs: Initial commit for emulated Universal-Flash-Storage 2023-09-07 14:01:29 -04:00
meson.build hw/ufs: Initial commit for emulated Universal-Flash-Storage 2023-09-07 14:01:29 -04:00