This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-5-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While at it just call rbd_get_size and avoid rbd_image_info_t.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-4-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-3-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Ceph Luminous (version 12.2.z) is almost 4 years old at this point.
Bump the requirement to get rid of the ifdef'ry in the code.
Qemu 6.1 dropped the support for RHEL-7 which was the last supported
OS that required an older librbd.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-2-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Starting from ceph Pacific, RBD has built-in support for image-level encryption.
Currently supported formats are LUKS version 1 and 2.
There are 2 new relevant librbd APIs for controlling encryption, both expect an
open image context:
rbd_encryption_format: formats an image (i.e. writes the LUKS header)
rbd_encryption_load: loads encryptor/decryptor to the image IO stack
This commit extends the qemu rbd driver API to support the above.
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Message-Id: <20210627114635.39326-1-oro@il.ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Jason has moved on from working on RBD and Ceph. I'm taking over
his role upstream.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210519112513.19694-1-idryomov@gmail.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If KVM_CAP_RPT_INVALIDATE KVM capability is enabled, then
- indicate the availability of H_RPT_INVALIDATE hcall to the guest via
ibm,hypertas-functions property.
- Enable the hcall
Both the above are done only if the new sPAPR machine capability
cap-rpt-invalidate is set.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Message-Id: <20210706112440.1449562-3-bharata@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Update to mainline commit: 79160a603bdb ("Merge tag 'usb-5.14-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb"
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Message-Id: <20210706112440.1449562-2-bharata@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This addresses the comments from v22.
The functional changes are (the VOF ones need retesting with Pegasos2):
(VOF) setprop will start failing if the machine class callback
did not handle it;
(VOF) unit addresses are lowered in path_offset();
(SPAPR) /chosen/bootargs is initialized from kernel_cmdline if
the client did not change it.
Fixes: 5c991e5d4378 ("spapr: Implement Open Firmware client interface")
Cc: BALATON Zoltan <balaton@eik.bme.hu>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20210708065625.548396-1-aik@ozlabs.ru>
Tested-by: BALATON Zoltan <balaton@eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The function ppc_tlb_invalid_all is not compiled anymore in a TCG-less
environment, and the call to that function has been disabled in this
situation
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Message-Id: <20210708164957.28096-2-lucas.araujo@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Linux uses RTAS functions to access PCI devices so we need to provide
these with VOF. Implement some of the most important functions to
allow booting Linux with VOF. With this the board is now usable
without a binary ROM image and we can enable it by default as other
boards.
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
Message-Id: <20210708215113.B3F747456E3@zero.eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The pegasos2 board comes with an Open Firmware compliant ROM based on
SmartFirmware but it has some changes that are not open source
therefore the ROM binary cannot be included in QEMU. Guests running on
the board however depend on services provided by the firmware. The
Virtual Open Firmware recently added to QEMU implements a minimal set
of these services to allow some guests to boot without the original
firmware. This patch adds VOF as the default firmware for pegasos2
which allows booting Linux and MorphOS via -kernel option while a ROM
image can still be used with -bios for guests that don't run with VOF.
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
Message-Id: <1d6ed6f290c5c1f0b5a1e1c51cf1151452d70d9a.1624811233.git.balaton@eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
There are several new L1D cache flush bits added to the hcall which reflect
hardware security features for speculative cache access issues.
These behaviours are now being specified as negative in order to simplify
patched kernel compatibility with older firmware (a new problem found in
existing systems would automatically be vulnerable).
[dwg: Technically this changes behaviour for existing machine types.
After discussion with Nick, we've determined this is safe, because
the worst that will happen if a guest gets the wrong information due
to a migration is that it will perform some unnecessary workarounds,
but will remain correct and secure (well, as secure as it was going
to be anyway). In addition the change only affects cap-cfpc=safe
which is not enabled by default, and in fact is not possible to set
on any current hardware (though it's expected it will be possible on
POWER10)]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20210615044107.1481608-1-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Change the assert in ppc_store_sdr1() to allow vhyp to be set on CPUs
without HV bit. This allows using the vhyp interface for firmware
emulation on pegasos2.
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
Message-Id: <21c7745aabbb68fcc50bb2ffaf16b939ba21261c.1624811233.git.balaton@eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Add own machine state structure which will be used to store state
needed for firmware emulation.
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <7f6d5fbf4f70c64dba001483174a2921dd616ecd.1624811233.git.balaton@eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
MSR is a 32-bit register in BookE and there is no mtmsrd instruction.
Cc: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20210706051321.609046-1-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The PAPR platform describes an OS environment that's presented by
a combination of a hypervisor and firmware. The features it specifies
require collaboration between the firmware and the hypervisor.
Since the beginning, the runtime component of the firmware (RTAS) has
been implemented as a 20 byte shim which simply forwards it to
a hypercall implemented in qemu. The boot time firmware component is
SLOF - but a build that's specific to qemu, and has always needed to be
updated in sync with it. Even though we've managed to limit the amount
of runtime communication we need between qemu and SLOF, there's some,
and it has become increasingly awkward to handle as we've implemented
new features.
This implements a boot time OF client interface (CI) which is
enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
which implements Open Firmware Client Interface (OF CI). This allows
using a smaller stateless firmware which does not have to manage
the device tree.
The new "vof.bin" firmware image is included with source code under
pc-bios/. It also includes RTAS blob.
This implements a handful of CI methods just to get -kernel/-initrd
working. In particular, this implements the device tree fetching and
simple memory allocator - "claim" (an OF CI memory allocator) and updates
"/memory@0/available" to report the client about available memory.
This implements changing some device tree properties which we know how
to deal with, the rest is ignored. To allow changes, this skips
fdt_pack() when x-vof=on as not packing the blob leaves some room for
appending.
In absence of SLOF, this assigns phandles to device tree nodes to make
device tree traversing work.
When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
This adds basic instances support which are managed by a hash map
ihandle -> [phandle].
Before the guest started, the used memory is:
0..e60 - the initial firmware
8000..10000 - stack
400000.. - kernel
3ea0000.. - initramdisk
This OF CI does not implement "interpret".
Unlike SLOF, this does not format uninitialized nvram. Instead, this
includes a disk image with pre-formatted nvram.
With this basic support, this can only boot into kernel directly.
However this is just enough for the petitboot kernel and initradmdisk to
boot from any possible source. Note this requires reasonably recent guest
kernel with:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
The immediate benefit is much faster booting time which especially
crucial with fully emulated early CPU bring up environments. Also this
may come handy when/if GRUB-in-the-userspace sees light of the day.
This separates VOF and sPAPR in a hope that VOF bits may be reused by
other POWERPC boards which do not support pSeries.
This assumes potential support for booting from QEMU backends
such as blockdev or netdev without devices/drivers used.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20210625055155.2252896-1-aik@ozlabs.ru>
Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu>
[dwg: Adjusted some includes which broke compile in some more obscure
compilation setups]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This adds eTSEC support to the PowerPC `ppce500` machine documentation.
Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Changed hash32 address translation to use the supplied mmu_idx, instead
of using what was stored in the msr, for parity purposes (radix64
already uses that) and for conceptual correctness, all the relevant
functions should always use the supplied mmu_idx, as there are no
guarantees that the mmu_idx stored in the CPU variable will not desync.
Signed-off-by: Bruno Larsen (billionai) <bruno.larsen@eldorado.org.br>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20210706150316.21005-3-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Intrudoce a header common to all BookS MMUs, that can hold code that is
common to hash32 and book3s-v3 MMUs.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Bruno Larsen (billionai) <bruno.larsen@eldorado.org.br>
Message-Id: <20210706150316.21005-2-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Changed hash64 address translation to use the supplied mmu_idx instead
of using the one stored in the msr, for parity purposes (other book3s
MMUs already use it).
Signed-off-by: Bruno Larsen (billionai) <bruno.larsen@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210628133610.1143-4-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This commit attempts to fix a technical hiccup first mentioned by Richard
Henderson in
https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06247.html
To sumarize the hiccup here, when radix-style mmus are translating an
address, they might need to call a second level of translation, with
hypervisor privileges. However, the way it was being done up until
this point meant that the second level translation had the same
privileges as the first level. It could lead to a bug in address
translation when running KVM inside a TCG guest, but this bug was never
experienced by users, so this isn't as much a bug fix as it is a
correctness cleanup.
This patch attempts that cleanup by making radix64_*_xlate functions
receive the mmu_idx, and passing one with the correct permission for the
second level translation.
The mmuidx macros added by this patch are only correct for non-bookE
mmus, because BookE style set the IS and DS bits inverted and there
might be other subtle differences. However, there doesn't seem to be
BookE cpus that have radix-style mmus, so we left a comment there to
document the issue, in case a machine does have that and was missed.
As part of this cleanup, we now need to send the correct mmmu_idx
when calling get_phys_page_debug, otherwise we might not be able to see the
memory that the CPU could
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Bruno Larsen (billionai) <bruno.larsen@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20210628133610.1143-2-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This function is used by TCGCPUOps, and is thus TCG specific.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-10-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Create one common dispatch for all of the ppc_*_xlate functions.
Use ppc64_v3_radix to directly dispatch between ppc_radix64_xlate
and ppc_hash64_xlate.
Remove the separate *_handle_mmu_fault and *_get_phys_page_debug
functions, using common code for ppc_cpu_tlb_fill and
ppc_cpu_get_phys_page_debug.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-9-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Mirror the interface of ppc_radix64_xlate (mostly), putting all
of the logic for older mmu translation into a single entry point.
For booke, we need to add mmu_idx to the xlate-style interface.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-8-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Mirror the interface of ppc_radix64_xlate, putting all of
the logic for hash32 translation into a single entry point.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-7-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Mirror the interface of ppc_radix64_xlate, putting all of
the logic for hash64 translation into a single function.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-6-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Instead of returning non-zero for failure, return true for success.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-5-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This removes some incomplete duplication between
ppc_radix64_handle_mmu_fault and ppc_radix64_get_phys_page_debug.
The former was correct wrt SPR_HRMOR and the latter was not.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-4-bruno.larsen@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
These changes were waiting until we didn't need to match
the function type of PowerPCCPUClass.handle_mmu_fault.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-3-bruno.larsen@eldorado.org.br>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Instead, use a switch on env->mmu_model. This avoids some
replicated information in cpu setup.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20210621125115.67717-2-bruno.larsen@eldorado.org.br>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
QEMU reserves space for RTAS via /rtas/rtas-size which tells the client
how much space the RTAS requires to work which includes the RTAS binary
blob implementing RTAS runtime. Because pseries supports FWNMI which
requires plenty of space, QEMU reserves more than 2KB which is
enough for the RTAS blob as it is just 20 bytes (under QEMU).
Since FWNMI reset delivery was added, RTAS_SIZE macro is not used anymore.
This replaces RTAS_SIZE with RTAS_MIN_SIZE and uses it in
the /rtas/rtas-size calculation to account for the RTAS blob.
Fixes: 0e236d3477 ("ppc/spapr: Implement FWNMI System Reset delivery")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20210622070336.1463250-1-aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This isn't used anymore.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210622140926.677618-3-groug@kaod.org>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
PowerPC CPUs use big endian by default but starting with POWER7,
server grade CPUs use the ILE bit of the LPCR special purpose
register to decide on the endianness to use when handling
interrupts. This gives a clue to QEMU on the endianness the
guest kernel is running, which is needed when generating an
ELF dump of the guest or when delivering an FWNMI machine
check interrupt.
Commit 382d2db62b ("target-ppc: Introduce callback for interrupt
endianness") added a class method to PowerPCCPUClass to modelize
this : default implementation returns a fixed "big endian" value,
while POWER7 and newer do the LPCR_ILE check. This is suboptimal
as it forces to implement the method for every new CPU family, and
it is very unlikely that this will ever be different than what we
have today.
We basically only have three cases to consider:
a) CPU doesn't have an LPCR => big endian
b) CPU has an LPCR but doesn't support the ILE bit => big endian
c) CPU has an LPCR and supports the ILE bit => little or big endian
Instead of class methods, introduce an inline helper that checks the
ILE bit in the LPCR_MASK to decide on the outcome. The new helper
words little endian instead of big endian. This allows to drop a !
operator in ppc_cpu_do_fwnmi_machine_check().
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210622140926.677618-2-groug@kaod.org>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We support coordinated discarding of RAM using the RamDiscardManager for
the VFIO_TYPE1 iommus. Let's unlock support for coordinated discards,
keeping uncoordinated discards (e.g., via virtio-balloon) disabled if
possible.
This unlocks virtio-mem + vfio on x86-64. Note that vfio used via "nvme://"
by the block layer has to be implemented/unlocked separately. For now,
virtio-mem only supports x86-64; we don't restrict RamDiscardManager to
x86-64, though: arm64 and s390x are supposed to work as well, and we'll
test once unlocking virtio-mem support. The spapr IOMMUs will need special
care, to be tackled later, e.g.., once supporting virtio-mem.
Note: The block size of a virtio-mem device has to be set to sane sizes,
depending on the maximum hotplug size - to not run out of vfio mappings.
The default virtio-mem block size is usually in the range of a couple of
MBs. The maximum number of mapping is 64k, shared with other users.
Assume you want to hotplug 256GB using virtio-mem - the block size would
have to be set to at least 8 MiB (resulting in 32768 separate mappings).
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-14-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
We implement the RamDiscardManager interface and only require coordinated
discarding of RAM to work.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-13-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
We want to separate the two cases whereby we discard ram
- uncoordinated: e.g., virito-balloon
- coordinated: e.g., virtio-mem coordinated via the RamDiscardManager
Reviewed-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-12-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
We have users in migration context that don't hold the BQL (when
finishing migration). To prepare for further changes, use a dedicated mutex
instead of atomic operations. Keep using qatomic_read ("READ_ONCE") for the
functions that only extract the current state (e.g., used by
virtio-balloon), locking isn't necessary.
While at it, split up the counter into two variables to make it easier
to understand.
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-11-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
vIOMMU support works already with RamDiscardManager as long as guests only
map populated memory. Both, populated and discarded memory is mapped
into &address_space_memory, where vfio_get_xlat_addr() will find that
memory, to create the vfio mapping.
Sane guests will never map discarded memory (e.g., unplugged memory
blocks in virtio-mem) into an IOMMU - or keep it mapped into an IOMMU while
memory is getting discarded. However, there are two cases where a malicious
guests could trigger pinning of more memory than intended.
One case is easy to handle: the guest trying to map discarded memory
into an IOMMU.
The other case is harder to handle: the guest keeping memory mapped in
the IOMMU while it is getting discarded. We would have to walk over all
mappings when discarding memory and identify if any mapping would be a
violation. Let's keep it simple for now and print a warning, indicating
that setting RLIMIT_MEMLOCK can mitigate such attacks.
We have to take care of incoming migration: at the point the
IOMMUs get restored and start creating mappings in vfio, RamDiscardManager
implementations might not be back up and running yet: let's add runstate
priorities to enforce the order when restoring.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-10-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Although RamDiscardManager can handle running into the maximum number of
DMA mappings by propagating errors when creating a DMA mapping, we want
to sanity check and warn the user early that there is a theoretical setup
issue and that virtio-mem might not be able to provide as much memory
towards a VM as desired.
As suggested by Alex, let's use the number of KVM memory slots to guess
how many other mappings we might see over time.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-9-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Let's query the maximum number of possible DMA mappings by querying the
available mappings when creating the container (before any mappings are
created). We'll use this informaton soon to perform some sanity checks
and warn the user.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-8-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Implement support for RamDiscardManager, to prepare for virtio-mem
support. Instead of mapping the whole memory section, we only map
"populated" parts and update the mapping when notified about
discarding/population of memory via the RamDiscardListener. Similarly, when
syncing the dirty bitmaps, sync only the actually mapped (populated) parts
by replaying via the notifier.
Using virtio-mem with vfio is still blocked via
ram_block_discard_disable()/ram_block_discard_require() after this patch.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-7-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Let's properly notify when (un)plugging blocks, after discarding memory
and before allowing the guest to consume memory. Handle errors from
notifiers gracefully (e.g., no remaining VFIO mappings) when plugging,
rolling back the change and telling the guest that the VM is busy.
One special case to take care of is replaying all notifications after
restoring the vmstate. The device starts out with all memory discarded,
so after loading the vmstate, we have to notify about all plugged
blocks.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-6-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>