Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base

This driver is like virtio-balloon on steroids: it allows both changing the
guest memory allocation via ballooning and (in the next patch) inserting
pieces of extra RAM into it on demand from a provided memory backend.

The actual resizing is done via ballooning interface (for example, via
the "balloon" HMP command).
This includes resizing the guest past its boot size - that is, hot-adding
additional memory in granularity limited only by the guest alignment
requirements, as provided by the next patch.

In contrast with ACPI DIMM hotplug where one can only request to unplug a
whole DIMM stick this driver allows removing memory from guest in single
page (4k) units via ballooning.

After a VM reboot the guest is back to its original (boot) size.

In the future, the guest boot memory size might be changed on reboot
instead, taking into account the effective size that VM had before that
reboot (much like Hyper-V does).

For performance reasons, the guest-released memory is tracked in a few
range trees, as a series of (start, count) ranges.
Each time a new page range is inserted into such tree its neighbors are
checked as candidates for possible merging with it.

Besides performance reasons, the Dynamic Memory protocol itself uses page
ranges as the data structure in its messages, so relevant pages need to be
merged into such ranges anyway.

One has to be careful when tracking the guest-released pages, since the
guest can maliciously report returning pages outside its current address
space, which later clash with the address range of newly added memory.
Similarly, the guest can report freeing the same page twice.

The above design results in much better ballooning performance than when
using virtio-balloon with the same guest: 230 GB / minute with this driver
versus 70 GB / minute with virtio-balloon.

During a ballooning operation most of time is spent waiting for the guest
to come up with newly freed page ranges, processing the received ranges on
the host side (in QEMU and KVM) is nearly instantaneous.

The unballoon operation is also pretty much instantaneous:
thanks to the merging of the ballooned out page ranges 200 GB of memory can
be returned to the guest in about 1 second.
With virtio-balloon this operation takes about 2.5 minutes.

These tests were done against a Windows Server 2019 guest running on a
Xeon E5-2699, after dirtying the whole memory inside guest before each
balloon operation.

Using a range tree instead of a bitmap to track the removed memory also
means that the solution scales well with the guest size: even a 1 TB range
takes just a few bytes of such metadata.

Since the required GTree operations aren't present in every Glib version
a check for them was added to the meson build script, together with new
"--enable-hv-balloon" and "--disable-hv-balloon" configure arguments.
If these GTree operations are missing in the system's Glib version this
driver will be skipped during QEMU build.

An optional "status-report=on" device parameter requests memory status
events from the guest (typically sent every second), which allow the host
to learn both the guest memory available and the guest memory in use
counts.

Following commits will add support for their external emission as
"HV_BALLOON_STATUS_REPORT" QMP events.

The driver is named hv-balloon since the Linux kernel client driver for
the Dynamic Memory Protocol is named as such and to follow the naming
pattern established by the virtio-balloon driver.
The whole protocol runs over Hyper-V VMBus.

The driver was tested against Windows Server 2012 R2, Windows Server 2016
and Windows Server 2019 guests and obeys the guest alignment requirements
reported to the host via DM_CAPABILITIES_REPORT message.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
This commit is contained in:
Maciej S. Szmigiero 2023-06-12 16:00:54 +02:00
parent 4f80cd2f03
commit 0d9e8c0b67
12 changed files with 1616 additions and 1 deletions

View File

@ -46,3 +46,6 @@ config FUZZ
config VFIO_USER_SERVER_ALLOWED
bool
imply VFIO_USER_SERVER
config HV_BALLOON_POSSIBLE
bool

View File

@ -16,3 +16,13 @@ config SYNDBG
bool
default y
depends on VMBUS
config HV_BALLOON_SUPPORTED
bool
config HV_BALLOON
bool
default y
depends on VMBUS
depends on HV_BALLOON_POSSIBLE
depends on HV_BALLOON_SUPPORTED

View File

@ -0,0 +1,33 @@
/*
* QEMU Hyper-V Dynamic Memory Protocol driver
*
* Copyright (C) 2020-2023 Oracle and/or its affiliates.
*
* This work is licensed under the terms of the GNU GPL, version 2 or later.
* See the COPYING file in the top-level directory.
*/
#ifndef HW_HYPERV_HV_BALLOON_INTERNAL_H
#define HW_HYPERV_HV_BALLOON_INTERNAL_H
#include "qemu/osdep.h"
#define HV_BALLOON_PFN_SHIFT 12
#define HV_BALLOON_PAGE_SIZE (1 << HV_BALLOON_PFN_SHIFT)
#define SUM_OVERFLOW_U64(in1, in2) ((in1) > UINT64_MAX - (in2))
#define SUM_SATURATE_U64(in1, in2) \
({ \
uint64_t _in1 = (in1), _in2 = (in2); \
uint64_t _result; \
\
if (!SUM_OVERFLOW_U64(_in1, _in2)) { \
_result = _in1 + _in2; \
} else { \
_result = UINT64_MAX; \
} \
\
_result; \
})
#endif

View File

@ -0,0 +1,228 @@
/*
* QEMU Hyper-V Dynamic Memory Protocol driver
*
* Copyright (C) 2020-2023 Oracle and/or its affiliates.
*
* This work is licensed under the terms of the GNU GPL, version 2 or later.
* See the COPYING file in the top-level directory.
*/
#include "hv-balloon-internal.h"
#include "hv-balloon-page_range_tree.h"
/*
* temporarily avoid warnings about enhanced GTree API usage requiring a
* too recent Glib version until GLIB_VERSION_MAX_ALLOWED finally reaches
* the Glib version with this API
*/
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
/* PageRangeTree */
static gint page_range_tree_key_compare(gconstpointer leftp,
gconstpointer rightp,
gpointer user_data)
{
const uint64_t *left = leftp, *right = rightp;
if (*left < *right) {
return -1;
} else if (*left > *right) {
return 1;
} else { /* *left == *right */
return 0;
}
}
static GTreeNode *page_range_tree_insert_new(PageRangeTree tree,
uint64_t start, uint64_t count)
{
uint64_t *key = g_malloc(sizeof(*key));
PageRange *range = g_malloc(sizeof(*range));
assert(count > 0);
*key = range->start = start;
range->count = count;
return g_tree_insert_node(tree.t, key, range);
}
void hvb_page_range_tree_insert(PageRangeTree tree,
uint64_t start, uint64_t count,
uint64_t *dupcount)
{
GTreeNode *node;
bool joinable;
uint64_t intersection;
PageRange *range;
assert(!SUM_OVERFLOW_U64(start, count));
if (count == 0) {
return;
}
node = g_tree_upper_bound(tree.t, &start);
if (node) {
node = g_tree_node_previous(node);
} else {
node = g_tree_node_last(tree.t);
}
if (node) {
range = g_tree_node_value(node);
assert(range);
intersection = page_range_intersection_size(range, start, count);
joinable = page_range_joinable_right(range, start, count);
}
if (!node ||
(!intersection && !joinable)) {
/*
* !node case: the tree is empty or the very first node in the tree
* already has a higher key (the start of its range).
* the other case: there is a gap in the tree between the new range
* and the previous one.
* anyway, let's just insert the new range into the tree.
*/
node = page_range_tree_insert_new(tree, start, count);
assert(node);
range = g_tree_node_value(node);
assert(range);
} else {
/*
* the previous range in the tree either partially covers the new
* range or ends just at its beginning - extend it
*/
if (dupcount) {
*dupcount += intersection;
}
count += start - range->start;
range->count = MAX(range->count, count);
}
/* check next nodes for possible merging */
for (node = g_tree_node_next(node); node; ) {
PageRange *rangecur;
rangecur = g_tree_node_value(node);
assert(rangecur);
intersection = page_range_intersection_size(rangecur,
range->start, range->count);
joinable = page_range_joinable_left(rangecur,
range->start, range->count);
if (!intersection && !joinable) {
/* the current node is disjoint */
break;
}
if (dupcount) {
*dupcount += intersection;
}
count = rangecur->count + (rangecur->start - range->start);
range->count = MAX(range->count, count);
/* the current node was merged in, remove it */
start = rangecur->start;
node = g_tree_node_next(node);
/* no hinted removal in GTree... */
g_tree_remove(tree.t, &start);
}
}
bool hvb_page_range_tree_pop(PageRangeTree tree, PageRange *out,
uint64_t maxcount)
{
GTreeNode *node;
PageRange *range;
node = g_tree_node_last(tree.t);
if (!node) {
return false;
}
range = g_tree_node_value(node);
assert(range);
out->start = range->start;
/* can't modify range->start as it is the node key */
if (range->count > maxcount) {
out->start += range->count - maxcount;
out->count = maxcount;
range->count -= maxcount;
} else {
out->count = range->count;
/* no hinted removal in GTree... */
g_tree_remove(tree.t, &out->start);
}
return true;
}
bool hvb_page_range_tree_intree_any(PageRangeTree tree,
uint64_t start, uint64_t count)
{
GTreeNode *node;
if (count == 0) {
return false;
}
/* find the first node that can possibly intersect our range */
node = g_tree_upper_bound(tree.t, &start);
if (node) {
/*
* a NULL node below means that the very first node in the tree
* already has a higher key (the start of its range).
*/
node = g_tree_node_previous(node);
} else {
/* a NULL node below means that the tree is empty */
node = g_tree_node_last(tree.t);
}
/* node range start <= range start */
if (!node) {
/* node range start > range start */
node = g_tree_node_first(tree.t);
}
for ( ; node; node = g_tree_node_next(node)) {
PageRange *range = g_tree_node_value(node);
assert(range);
/*
* if this node starts beyond or at the end of our range so does
* every next one
*/
if (range->start >= start + count) {
break;
}
if (page_range_intersection_size(range, start, count) > 0) {
return true;
}
}
return false;
}
void hvb_page_range_tree_init(PageRangeTree *tree)
{
tree->t = g_tree_new_full(page_range_tree_key_compare, NULL,
g_free, g_free);
}
void hvb_page_range_tree_destroy(PageRangeTree *tree)
{
/* g_tree_destroy() is not NULL-safe */
if (!tree->t) {
return;
}
g_tree_destroy(tree->t);
tree->t = NULL;
}

View File

@ -0,0 +1,118 @@
/*
* QEMU Hyper-V Dynamic Memory Protocol driver
*
* Copyright (C) 2020-2023 Oracle and/or its affiliates.
*
* This work is licensed under the terms of the GNU GPL, version 2 or later.
* See the COPYING file in the top-level directory.
*/
#ifndef HW_HYPERV_HV_BALLOON_PAGE_RANGE_TREE_H
#define HW_HYPERV_HV_BALLOON_PAGE_RANGE_TREE_H
#include "qemu/osdep.h"
/* PageRange */
typedef struct PageRange {
uint64_t start;
uint64_t count;
} PageRange;
/* return just the part of range before (start) */
static inline void page_range_part_before(const PageRange *range,
uint64_t start, PageRange *out)
{
uint64_t endr = range->start + range->count;
uint64_t end = MIN(endr, start);
out->start = range->start;
if (end > out->start) {
out->count = end - out->start;
} else {
out->count = 0;
}
}
/* return just the part of range after (start, count) */
static inline void page_range_part_after(const PageRange *range,
uint64_t start, uint64_t count,
PageRange *out)
{
uint64_t end = range->start + range->count;
uint64_t ends = start + count;
out->start = MAX(range->start, ends);
if (end > out->start) {
out->count = end - out->start;
} else {
out->count = 0;
}
}
static inline void page_range_intersect(const PageRange *range,
uint64_t start, uint64_t count,
PageRange *out)
{
uint64_t end1 = range->start + range->count;
uint64_t end2 = start + count;
uint64_t end = MIN(end1, end2);
out->start = MAX(range->start, start);
out->count = out->start < end ? end - out->start : 0;
}
static inline uint64_t page_range_intersection_size(const PageRange *range,
uint64_t start, uint64_t count)
{
PageRange trange;
page_range_intersect(range, start, count, &trange);
return trange.count;
}
static inline bool page_range_joinable_left(const PageRange *range,
uint64_t start, uint64_t count)
{
return start + count == range->start;
}
static inline bool page_range_joinable_right(const PageRange *range,
uint64_t start, uint64_t count)
{
return range->start + range->count == start;
}
static inline bool page_range_joinable(const PageRange *range,
uint64_t start, uint64_t count)
{
return page_range_joinable_left(range, start, count) ||
page_range_joinable_right(range, start, count);
}
/* PageRangeTree */
/* type safety */
typedef struct PageRangeTree {
GTree *t;
} PageRangeTree;
static inline bool page_range_tree_is_empty(PageRangeTree tree)
{
guint nnodes = g_tree_nnodes(tree.t);
return nnodes == 0;
}
void hvb_page_range_tree_init(PageRangeTree *tree);
void hvb_page_range_tree_destroy(PageRangeTree *tree);
bool hvb_page_range_tree_intree_any(PageRangeTree tree,
uint64_t start, uint64_t count);
bool hvb_page_range_tree_pop(PageRangeTree tree, PageRange *out,
uint64_t maxcount);
void hvb_page_range_tree_insert(PageRangeTree tree,
uint64_t start, uint64_t count,
uint64_t *dupcount);
#endif

1160
hw/hyperv/hv-balloon.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -2,3 +2,4 @@ specific_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'))
specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: files('hyperv_testdev.c'))
specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c'))
specific_ss.add(when: 'CONFIG_SYNDBG', if_true: files('syndbg.c'))
specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c', 'hv-balloon-page_range_tree.c'))

View File

@ -16,3 +16,16 @@ vmbus_gpadl_torndown(uint32_t gpadl_id) "gpadl #%d"
vmbus_open_channel(uint32_t chan_id, uint32_t gpadl_id, uint32_t target_vp) "channel #%d gpadl #%d target vp %d"
vmbus_channel_open(uint32_t chan_id, uint32_t status) "channel #%d status %d"
vmbus_close_channel(uint32_t chan_id) "channel #%d"
# hv-balloon
hv_balloon_state_change(const char *tostr) "-> %s"
hv_balloon_incoming_version(uint16_t major, uint16_t minor) "incoming proto version %u.%u"
hv_balloon_incoming_caps(uint32_t caps) "incoming caps 0x%x"
hv_balloon_outgoing_unballoon(uint32_t trans_id, uint64_t count, uint64_t start, uint64_t rempages) "posting unballoon %"PRIu32" for %"PRIu64" @ 0x%"PRIx64", remaining %"PRIu64
hv_balloon_incoming_unballoon(uint32_t trans_id) "incoming unballoon response %"PRIu32
hv_balloon_outgoing_balloon(uint32_t trans_id, uint64_t count, uint64_t rempages) "posting balloon %"PRIu32" for %"PRIu64", remaining %"PRIu64
hv_balloon_incoming_balloon(uint32_t trans_id, uint32_t range_count, uint32_t more_pages) "incoming balloon response %"PRIu32", ranges %"PRIu32", more %"PRIu32
hv_balloon_remove_response(uint64_t count, uint64_t start, unsigned int both) "processing remove response range %"PRIu64" @ 0x%"PRIx64", both %u"
hv_balloon_remove_response_hole(uint64_t counthole, uint64_t starthole, uint64_t countrange, uint64_t startrange, uint64_t starthpr, unsigned int both) "response range hole %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64", before our start 0x%"PRIx64", both %u"
hv_balloon_remove_response_common(uint64_t countcommon, uint64_t startcommon, uint64_t countrange, uint64_t startrange, uint64_t counthpr, uint64_t starthpr, uint64_t removed, unsigned int both) "response common range %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64" with our %"PRIu64" @ 0x%"PRIx64", removed %"PRIu64", both %u"
hv_balloon_remove_response_remainder(uint64_t count, uint64_t start, unsigned int both) "remove response remaining range %"PRIu64" @ 0x%"PRIx64", both %u"

View File

@ -0,0 +1,18 @@
/*
* QEMU Hyper-V Dynamic Memory Protocol driver
*
* Copyright (C) 2020-2023 Oracle and/or its affiliates.
*
* This work is licensed under the terms of the GNU GPL, version 2 or later.
* See the COPYING file in the top-level directory.
*/
#ifndef HW_HV_BALLOON_H
#define HW_HV_BALLOON_H
#include "qom/object.h"
#define TYPE_HV_BALLOON "hv-balloon"
OBJECT_DECLARE_SIMPLE_TYPE(HvBalloon, HV_BALLOON)
#endif

View File

@ -1323,6 +1323,30 @@ if not get_option('glusterfs').auto() or have_block
endif
endif
hv_balloon = false
if get_option('hv_balloon').allowed() and have_system
if cc.links('''
#include <string.h>
#include <gmodule.h>
int main(void) {
GTree *tree;
tree = g_tree_new((GCompareFunc)strcmp);
(void)g_tree_node_first(tree);
g_tree_destroy(tree);
return 0;
}
''', dependencies: glib)
hv_balloon = true
else
if get_option('hv_balloon').enabled()
error('could not enable hv-balloon, update your glib')
else
warning('could not find glib support for hv-balloon, disabling')
endif
endif
endif
libssh = not_found
if not get_option('libssh').auto() or have_block
libssh = dependency('libssh', version: '>=0.8.7',
@ -2855,7 +2879,8 @@ host_kconfig = \
(targetos == 'linux' ? ['CONFIG_LINUX=y'] : []) + \
(have_pvrdma ? ['CONFIG_PVRDMA=y'] : []) + \
(multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
(vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : [])
(vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : []) + \
(hv_balloon ? ['CONFIG_HV_BALLOON_POSSIBLE=y'] : [])
ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
@ -4321,6 +4346,7 @@ if targetos == 'windows'
endif
summary_info += {'seccomp support': seccomp}
summary_info += {'GlusterFS support': glusterfs}
summary_info += {'hv-balloon support': hv_balloon}
summary_info += {'TPM support': have_tpm}
summary_info += {'libssh support': libssh}
summary_info += {'lzo support': lzo}

View File

@ -150,6 +150,8 @@ option('gio', type : 'feature', value : 'auto',
description: 'use libgio for D-Bus support')
option('glusterfs', type : 'feature', value : 'auto',
description: 'Glusterfs block device driver')
option('hv_balloon', type : 'feature', value : 'auto',
description: 'hv-balloon driver (requires Glib 2.68+ GTree API)')
option('libdw', type : 'feature', value : 'auto',
description: 'debuginfo support')
option('libiscsi', type : 'feature', value : 'auto',

View File

@ -123,6 +123,7 @@ meson_options_help() {
printf "%s\n" ' gtk-clipboard clipboard support for the gtk UI (EXPERIMENTAL, MAY HANG)'
printf "%s\n" ' guest-agent Build QEMU Guest Agent'
printf "%s\n" ' guest-agent-msi Build MSI package for the QEMU Guest Agent'
printf "%s\n" ' hv-balloon hv-balloon driver (requires Glib 2.68+ GTree API)'
printf "%s\n" ' hvf HVF acceleration support'
printf "%s\n" ' iconv Font glyph conversion support'
printf "%s\n" ' jack JACK sound support'
@ -333,6 +334,8 @@ _meson_option_parse() {
--disable-guest-agent-msi) printf "%s" -Dguest_agent_msi=disabled ;;
--enable-hexagon-idef-parser) printf "%s" -Dhexagon_idef_parser=true ;;
--disable-hexagon-idef-parser) printf "%s" -Dhexagon_idef_parser=false ;;
--enable-hv-balloon) printf "%s" -Dhv_balloon=enabled ;;
--disable-hv-balloon) printf "%s" -Dhv_balloon=disabled ;;
--enable-hvf) printf "%s" -Dhvf=enabled ;;
--disable-hvf) printf "%s" -Dhvf=disabled ;;
--iasl=*) quote_sh "-Diasl=$2" ;;