We already use MADV_NORESERVE to deal with sparse memory regions. Let's
also set madvise(MADV_DONTDUMP), otherwise a crash of the process can
result in us allocating all memory in the mmap'ed region for dumping
purposes.
This change implies that the mmap'ed rings won't be included in a
coredump. If ever required for debugging purposes, we could mark only
the mapped rings MADV_DODUMP.
Ignore errors during madvise() for now.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-15-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently, we try to remap all rings whenever we add a single new memory
region. That doesn't quite make sense, because we already map rings when
setting the ring address, and panic if that goes wrong. Likely, that
handling was simply copied from set_mem_table code, where we actually
have to remap all rings.
Remapping all rings might require us to walk quite a lot of memory
regions to perform the address translations. Ideally, we'd simply remove
that remapping.
However, let's be a bit careful. There might be some weird corner cases
where we might temporarily remove a single memory region (e.g., resize
it), that would have worked for now. Further, a ring might be located on
hotplugged memory, and as the VM reboots, we might unplug that memory, to
hotplug memory before resetting the ring addresses.
So let's unmap affected rings as we remove a memory region, and try
dynamically mapping the ring again when required.
Acked-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-14-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's factor it out to prepare for further changes.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-13-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In the past, QEMU would create memory regions that could partially cover
hugetlb pages, making mmap() fail if we would use the mmap_offset as an
fd_offset. For that reason, we never used the mmap_offset as an offset into
the fd and instead always mapped the fd from the very start.
However, that can easily result in us mmap'ing a lot of unnecessary
parts of an fd, possibly repeatedly.
QEMU nowadays does not create memory regions that partially cover huge
pages -- it never really worked with postcopy. QEMU handles merging of
regions that partially cover huge pages (due to holes in boot memory) since
2018 in c1ece84e7c ("vhost: Huge page align and merge").
Let's be a bit careful and not unconditionally convert the
mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb
size and pass as much as we can as fd_offset, making sure that we call
mmap() with a properly aligned offset.
With QEMU and a virtio-mem device that is fully plugged (50GiB using 50
memslots) the qemu-storage daemon process consumes in the VA space
1281GiB before this change and 58GiB after this change.
================ Vhost user message ================
Request: VHOST_USER_ADD_MEM_REG (37)
Flags: 0x9
Size: 40
Fds: 59
Adding region 4
guest_phys_addr: 0x0000000200000000
memory_size: 0x0000000040000000
userspace_addr: 0x00007fb73bffe000
old mmap_offset: 0x0000000080000000
fd_offset: 0x0000000080000000
new mmap_offset: 0x0000000000000000
mmap_addr: 0x00007f02f1bdc000
Successfully added new region
================ Vhost user message ================
Request: VHOST_USER_ADD_MEM_REG (37)
Flags: 0x9
Size: 40
Fds: 59
Adding region 5
guest_phys_addr: 0x0000000240000000
memory_size: 0x0000000040000000
userspace_addr: 0x00007fb77bffe000
old mmap_offset: 0x00000000c0000000
fd_offset: 0x00000000c0000000
new mmap_offset: 0x0000000000000000
mmap_addr: 0x00007f0284000000
Successfully added new region
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-12-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's speed up GPA to memory region / virtual address lookup. Store the
memory regions ordered by guest physical addresses, and use binary
search for address translation, as well as when adding/removing memory
regions.
Most importantly, this will speed up GPA->VA address translation when we
have many memslots.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-11-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Memory regions cannot overlap, and if we ever hit that case something
would be really flawed.
For example, when vhost code in QEMU decides to increase the size of memory
regions to cover full huge pages, it makes sure to never create overlaps,
and if there would be overlaps, it would bail out.
QEMU commits 48d7c97577 ("vhost: Merge sections added to temporary
list"), c1ece84e7c ("vhost: Huge page align and merge") and
e7b94a84b6 ("vhost: Allow adjoining regions") added and clarified that
handling and how overlaps are impossible.
Consequently, each GPA can belong to at most one memory region, and
everything else doesn't make sense. Let's factor out our search to prepare
for further changes.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-10-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We cannot have duplicate memory regions, something would be deeply
flawed elsewhere. Let's just stop the search once we found an entry.
We'll add more sanity checks when adding memory regions later.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-9-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
dev->nregions always covers only valid entries. Stop zeroing out other
array elements that are unused.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-8-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We never add a memory region if mmap() failed. Therefore, no need to check
for NULL.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-7-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's factor it out, reducing quite some code duplication and perparing
for further changes.
If we fail to mmap a region and panic, we now simply don't add that
(broken) region.
Note that we now increment dev->nregions as we are successfully
adding memory regions, and don't increment dev->nregions if anything went
wrong.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-6-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's reduce some code duplication and prepare for further changes.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-5-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's factor it out. Note that the check for MAP_FAILED was wrong as
we never set mmap_addr if mmap() failed. We'll remove the NULL check
separately.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-4-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically
allocating dev->regions. We don't have any ABI guarantees (not
dynamically linked), so we can simply change the layout of VuDev.
Let's zero out the memory, just as we used to do.
Reviewed-by: Raphael Norwitz <raphael@enfabrica.net>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240214151701.29906-2-david@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
There is no need to use the Linux-internal __u64 type, 1ULL is
guaranteed to be wide enough.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20240117160313.175609-1-pbonzini@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Rename shadowing variables to make this code compilable
with -Wshadow=local.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20231006121129.487251-1-thuth@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
In the libvhost-user library we need to
handle VHOST_USER_GET_SHARED_OBJECT requests,
and add helper functions to allow sending messages
to interact with the virtio shared objects
hash table.
Signed-off-by: Albert Esteve <aesteve@redhat.com>
Message-Id: <20231002065706.94707-5-aesteve@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Explain Coverity that we are not going to overflow vmsg->fds.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20230925194040.68592-5-vsementsov@yandex-team.ru>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Slave/master nomenclature was replaced with backend/frontend in commit
1fc19b6527 ("vhost-user: Adopt new backend naming")
This patch replaces all remaining uses of master and slave in the
codebase.
Signed-off-by: Emmanouil Pitsidianakis <manos.pitsidianakis@linaro.org>
Message-Id: <20230613080849.2115347-1-manos.pitsidianakis@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
The source file uses VIRTIO_F_VERSION_1 which is
not defined by <linux/virtio_config.h> on Debian 10.
The system-provided <linux/virtio_config.h> which
does not include the macro definition is included
through <linux/vhost.h>, so fix the issue by including
the standard-headers version before that.
Signed-off-by: David 'Digit' Turner <digit@google.com>
Message-Id: <20230405125920.2951721-2-digit@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently block_resize qmp command is simply ignored by vhost-user-blk
export. So, the block-node is successfully resized, but virtio config
is unchanged and guest doesn't see that disk is resized.
Let's handle the resize by modifying the config and notifying the guest
appropriately.
After this comment, lsblk in linux guest with attached
vhost-user-blk-pci device shows new size immediately after block_resize
QMP command on vhost-user exported block node.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20230321201323.3695923-1-vsementsov@yandex-team.ru>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The Vhost-user specification changed feature and request
naming from _SLAVE_ to _BACKEND_.
This patch adopts the new naming convention.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-Id: <20230208203259.381326-3-maxime.coquelin@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Since it was proposed to change the code in libvduse.c to use memcpy
instead of an assignment, the code in libvhost-user.c should also be
changed to use memcpy.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <502b22723264db064e4b05008233a9c1f2f8aaaa.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The assignment of dev->postcopy_ufd can be moved into an else clause and
then the code becomes C90 compliant.
CC libvhost-user.o
libvhost-user.c: In function ‘vu_set_postcopy_advise’:
libvhost-user.c:1625:5: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
1625 | struct uffdio_api api_struct;
| ^~~~~~
Understandable, it might be desired to avoid else clauses, but in this
case it seems clear enough and frankly the dev->postcopy_ufd is only
assigned once.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <74db52afb1203c4580ffc7fa462b4b2ba260a353.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using libvhost-user source in an external project that wants to
comply with the C90 standard, it is best to declare variables before
code.
CC libvhost-user.o
libvhost-user.c: In function ‘generate_faults’:
libvhost-user.c:683:9: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
683 | struct uffdio_register reg_struct;
| ^~~~~~
In this case, it is also simple enough and doesn't cause any extra
ifdef additions.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <556c2d00c01fa134d13c0371d4014c90694c2943.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The sign-compare warning also hits some of the for-loops, but it easy
fixed by just making the iterator variable unsigned int.
CC libvhost-user.o
libvhost-user.c: In function ‘vu_gpa_to_va’:
libvhost-user.c:223:19: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
223 | for (i = 0; i < dev->nregions; i++) {
| ^
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Message-Id: <decb925e1a6fb9538738d2570bda2804f888fa15.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The assert from recvmsg() return value against an uint32_t size field
from a protocol struct throws a compiler warning.
CC libvhost-user.o
In file included from libvhost-user.c:27:
libvhost-user.c: In function ‘vu_message_read_default’:
libvhost-user.c:363:19: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
363 | assert(rc == vmsg->size);
| ^~
This is not critical, but annoying when the libvhost-user source are
used in an external project that has this compiler warning switched on.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Message-Id: <7a791e27b7bd3e0a8b8cc8fbb15090a870d226d5.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Strictly speaking only -std=gnu99 support the usage of typeof and for
easier inclusion in external projects, it is better to use __typeof__.
CC libvhost-user.o
libvhost-user.c: In function ‘vu_log_queue_fill’:
libvhost-user.c:86:13: error: implicit declaration of function ‘typeof’ [-Werror=implicit-function-declaration]
86 | typeof(x) _min1 = (x); \
| ^~~~~~
Changing these two users of typeof makes the compiler happy and no extra
flags or pragmas need to be provided.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <981aa822bcaaa2b8d74f245339a99a85c25b346f.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Then the libvhost-user sources are used by another project, it can not
be guaranteed that _GNU_SOURCE is set by the build system. If it is for
example not set, errors like this show up.
CC libvhost-user.o
libvhost-user.c: In function ‘vu_panic’:
libvhost-user.c:195:9: error: implicit declaration of function ‘vasprintf’; did you mean ‘vsprintf’? [-Werror=implicit-function-declaration]
195 | if (vasprintf(&buf, msg, ap) < 0) {
| ^~~~~~~~~
| vsprintf
The simplest way to allow external complication of libvhost-user.[ch] is
by setting _GNU_SOURCE if it is not already set by the build system.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Message-Id: <be27dcc747a6b5cc6f8ae3f79e0b79171382bcef.1671741278.git.marcel@holtmann.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This fix is required for 32 bit hosts. The bug was detected by CI
for arm-linux, but is also relevant for i386-linux.
Reported-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20221126152507.283271-4-sw@weilnetz.de>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Message-Id: <20220422070144.1043697-2-sw@weilnetz.de>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20221126152507.283271-2-sw@weilnetz.de>
With REPLY_NEEDED, libvhost-user sends both the acutal result and an
additional ACK reply for VHOST_USER_ADD_MEM_REG. This is incorrect, the
spec mandates that it behave the same with and without REPLY_NEEDED
because it always sends a reply.
Fixes: ec94c8e621
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220627134500.94842-4-kwolf@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
With REPLY_NEEDED, libvhost-user sends both the acutal result and an
additional ACK reply for VHOST_USER_GET_MAX_MEM_SLOTS. This is
incorrect, the spec mandates that it behave the same with and without
REPLY_NEEDED because it always sends a reply.
Fixes: 6fb2e173d2
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220627134500.94842-3-kwolf@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This is useful for more human readable debug messages in vhost-user
programs.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20220321153037.3622127-9-alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The spec clarifies now that QEMU should not send a file descriptor in a
request to remove a memory region. Change it accordingly.
For libvhost-user, this is a bug fix that makes it compatible with
rust-vmm's implementation that doesn't send a file descriptor. Keep
accepting, but ignoring a file descriptor for compatibility with older
QEMU versions.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220407133657.155281-4-kwolf@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Outside of postcopy mode, neither VHOST_USER_ADD_MEM_REG nor
VHOST_USER_REM_MEM_REG are supposed to send a reply unless explicitly
requested with the need_reply flag. Their current implementation always
sends a reply, even if it isn't requested. This confuses the master
because it will interpret the reply as a reply for the next message for
which it actually expects a reply.
need_reply is already handled correctly by vu_dispatch(), so just don't
send a reply in the non-postcopy part of the message handler for these
two commands.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220407133657.155281-3-kwolf@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For fd-based shared memory, MAP_NORESERVE is only effective for hugetlb,
otherwise it's ignored. Older Linux versions that didn't support
reservation of huge pages ignored MAP_NORESERVE completely.
The first client to mmap a hugetlb fd without MAP_NORESERVE will
trigger reservation of huge pages for the whole mmapped range. There are
two cases to consider:
1) QEMU mapped RAM without MAP_NORESERVE
We're not dealing with a sparse mapping, huge pages for the whole range
have already been reserved by QEMU. An additional mmap() without
MAP_NORESERVE won't have any effect on the reservation.
2) QEMU mapped RAM with MAP_NORESERVE
We're delaing with a sparse mapping, no huge pages should be reserved.
Further mappings without MAP_NORESERVE should be avoided.
For 1), it doesn't matter if we set MAP_NORESERVE or not, so we can
simply set it. For 2), we'd be overriding QEMUs decision and trigger
reservation of huge pages, which might just fail if there are not
sufficient huge pages around. We must map with MAP_NORESERVE.
This change is required to support virtio-mem with hugetlb: a
virtio-mem device mapped into the guest physical memory corresponds to
a sparse memory mapping and QEMU maps this memory with MAP_NORESERVE.
Whenever memory in that sparse region will be accessed by the VM, QEMU
populates huge pages for the affected range by preallocating memory
and handling any preallocation errors gracefully.
So let's map shared RAM with MAP_NORESERVE. As libvhost-user only
supports Linux, there shouldn't be anything to take care of in regard of
other OS support.
Without this change, libvhost-user will fail mapping the region if there
are currently not enough huge pages to perform the reservation:
fv_panic: libvhost-user: region mmap error: Cannot allocate memory
Cc: "Marc-André Lureau" <marcandre.lureau@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Raphael Norwitz <raphael.norwitz@nutanix.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20220111123939.132659-1-david@redhat.com>
Acked-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Today if QEMU (or any other VMM) has sent multiple copies of the same
region to a libvhost-user based backend and then attempts to remove the
region, only one instance of the region will be removed, leaving stale
copies of the region in dev->regions[].
This change resolves this by having vu_rem_mem_reg() iterate through all
regions in dev->regions[] and delete all matching regions.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20220117041050.19718-7-raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
When VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS support was added to
libvhost-user, no guardrails were added to protect against QEMU
attempting to hot-add too many RAM slots to a VM with a libvhost-user
based backed attached.
This change adds the missing error handling by introducing a check on
the number of RAM slots the device has available before proceeding to
process the VHOST_USER_ADD_MEM_REG message.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20220117041050.19718-6-raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
We end up not closing the file descriptor, resulting in leaking one
file descriptor for each VHOST_USER_REM_MEM_REG message.
Fixes: 875b9fd97b ("Support individual region unmap in libvhost-user")
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Raphael Norwitz <raphael.norwitz@nutanix.com>
Cc: "Marc-André Lureau" <marcandre.lureau@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Coiby Xu <coiby.xu@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20220117041050.19718-5-raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's avoid having to manually copy all elements. Copy only the ones
necessary to close the hole and perform the operation in-place without
a second array.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20220117041050.19718-4-raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Today if multiple FDs are sent from the VMM to the backend in a
VHOST_USER_ADD_MEM_REG message, one FD will be mapped and the remaining
FDs will be leaked. Therefore if multiple FDs are sent we report an
error and fail the operation, closing all FDs in the message.
Likewise in case the VMM sends a message with a size less than that
of a memory region descriptor, we add a check to gracefully report an
error and fail the operation rather than crashing.
Signed-off-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20220117041050.19718-3-raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Today if multiple FDs are sent from the VMM to the backend in a
VHOST_USER_REM_MEM_REG message, one FD will be unmapped and the remaining
FDs will be leaked. Therefore if multiple FDs are sent we report an
error and fail the operation, closing all FDs in the message.
Likewise in case the VMM sends a message with a size less than that of a
memory region descriptor, we add a check to gracefully report an error
and fail the operation rather than crashing.
Signed-off-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20220117041050.19718-2-raphael.norwitz@nutanix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
We end up not copying the mmap_addr of all existing regions, resulting
in a SEGFAULT once we actually try to map/access anything within our
memory regions.
Fixes: 875b9fd97b ("Support individual region unmap in libvhost-user")
Cc: qemu-stable@nongnu.org
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Raphael Norwitz <raphael.norwitz@nutanix.com>
Cc: "Marc-André Lureau" <marcandre.lureau@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Coiby Xu <coiby.xu@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20211011201047.62587-1-david@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
../subprojects/libvhost-user/libvhost-user.c:1070:12: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘__u64’ {aka ‘long long unsigned int’} [-Werror=format=]
1070 | DPRINT(" desc_user_addr: 0x%016" PRIx64 "\n", vra->desc_user_addr);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
| |
| __u64 {aka long long unsigned int}
Rather than using %llx, which may fail if __u64 is declared differently
elsewhere, let's just cast the values. Feel free to propose a better solution!
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20210505151313.203258-2-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Meson now checks that subprojects do not access files from parent
project. While we all agree this is best practice, libvhost-user also
want to share a few headers with QEMU, and libvhost-user isn't really a
standalone project at this point (although this is making the dependency
a bit more explicit).
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20210505151313.203258-1-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>