When vfio_realize fails, the mmap_timer used for INTx optimization
isn't freed. As this timer isn't activated yet, the potential impact
is just a piece of leaked memory.
Fixes: ea486926b0 ("vfio-pci: Update slow path INTx algorithm timer related")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The kvm irqchip notifier is only registered if the device supports
INTx, however it's unconditionally removed in vfio realize error
path. If the assigned device does not support INTx, this will cause
QEMU to crash when vfio realize fails. Change it to conditionally
remove the notifier only if the notify hook is setup.
Before fix:
(qemu) device_add vfio-pci,host=81:11.1,id=vfio1,bus=root1,xres=1
Connection closed by foreign host.
After fix:
(qemu) device_add vfio-pci,host=81:11.1,id=vfio1,bus=root1,xres=1
Error: vfio 0000:81:11.1: xres and yres properties require display=on
(qemu)
Fixes: c5478fea27 ("vfio/pci: Respond to KVM irqchip change notifier")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The major parts of VFIO migration are supported today in QEMU. This
includes basic VFIO migration, device dirty page tracking and precopy
support.
Thus, at this point in time, it seems appropriate to make VFIO migration
non-experimental: remove the x prefix from enable_migration property,
change it to ON_OFF_AUTO and let the default value be AUTO.
In addition, make the following adjustments:
1. When enable_migration is ON and migration is not supported, fail VFIO
device realization.
2. When enable_migration is AUTO (i.e., not explicitly enabled), require
device dirty tracking support. This is because device dirty tracking
is currently the only method to do dirty page tracking, which is
essential for migrating in a reasonable downtime. Setting
enable_migration to ON will not require device dirty tracking.
3. Make migration error and blocker messages more elaborate.
4. Remove error prints in vfio_migration_query_flags().
5. Rename trace_vfio_migration_probe() to
trace_vfio_migration_realize().
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Currently, VFIO bytes_transferred is not reset properly:
1. bytes_transferred is not reset after a VM snapshot (so a migration
following a snapshot will report incorrect value).
2. bytes_transferred is a single counter for all VFIO devices, however
upon migration failure it is reset multiple times, by each VFIO
device.
Fix it by introducing a new function vfio_reset_bytes_transferred() and
calling it during migration and snapshot start.
Remove existing bytes_transferred reset in VFIO migration state
notifier, which is not needed anymore.
Fixes: 3710586caa ("qapi: Add VFIO devices migration stats in Migration stats")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
When vfio_enable_vectors() returns with less than requested nr_vectors
we retry with what kernel reported back. But the retry path doesn't
call vfio_prepare_kvm_msi_virq_batch() and this results in,
qemu-system-aarch64: vfio: Error: Failed to enable 4 MSI vectors, retry with 1
qemu-system-aarch64: ../hw/vfio/pci.c:602: vfio_commit_kvm_msi_virq_batch: Assertion `vdev->defer_kvm_irq_routing' failed
Fixes: dc580d51f7 ("vfio: defer to commit kvm irq routing when enable msi/msix")
Reviewed-by: Longpeng <longpeng2@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
NVIDIA Turing and newer GPUs implement the MSI-X capability at the offset
previously reserved for use by hypervisors to implement the GPUDirect
Cliques capability. A revised specification provides an alternate
location. Add a config space walk to the quirk to check for conflicts,
allowing us to fall back to the new location or generate an error at the
quirk setup rather than when the real conflicting capability is added
should there be no available location.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
A common helper implementing the realloc algorithm for handling
capabilities.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Robin Voetter <robin@streamhpc.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Loading of a VFIO device's data can take a substantial amount of time as
the device may need to allocate resources, prepare internal data
structures, etc. This can increase migration downtime, especially for
VFIO devices with a lot of resources.
To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy data stream. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.
Use migration switchover ack capability to make sure a VFIO device's
initial bytes are sent and loaded in the destination before the source
stops the VM and attempts to complete the migration.
This can significantly reduce migration downtime for some devices.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Tested-by: YangHang Liu <yanghliu@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.
Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found in the following
Linux commit: 4db52602a607 ("vfio: Extend the device migration protocol
with PRE_COPY").
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Tested-by: YangHang Liu <yanghliu@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
VFIO migration flags are queried once in vfio_migration_init(). Store
them in VFIOMigration so they can be used later to check the device's
migration capabilities without re-querying them.
This will be used in the next patch to check if the device supports
precopy migration.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Tested-by: YangHang Liu <yanghliu@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Refactor vfio_save_block() to return the size of saved data on success
and -errno on error.
This will be used in next patch to implement VFIO migration pre-copy
support.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Tested-by: YangHang Liu <yanghliu@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Include the number of dirty pages on the vfio_get_dirty_bitmap tracepoint.
These are fetched from the newly added return value in
cpu_physical_memory_set_dirty_lebitmap().
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20230530180556.24441-3-joao.m.martins@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
vbasedev->name is freed wrongly which leads to garbage VFIO trace log.
Fix it by allocating a dup of vbasedev->name and then free the dup.
Fixes: 2dca1b37a7 ("vfio/pci: add support for VF token")
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The PCI Resizable BAR (ReBAR) capability is currently hidden from the
VM because the protocol for interacting with the capability does not
support a mechanism for the device to reject an advertised supported
BAR size. However, when assigned to a VM, the act of resizing the
BAR requires adjustment of host resources for the device, which
absolutely can fail. Linux does not currently allow us to reserve
resources for the device independent of the current usage.
The only writable field within the ReBAR capability is the BAR Size
register. The PCIe spec indicates that when written, the device
should immediately begin to operate with the provided BAR size. The
spec however also notes that software must only write values
corresponding to supported sizes as indicated in the capability and
control registers. Writing unsupported sizes produces undefined
results. Therefore, if the hypervisor were to virtualize the
capability and control registers such that the current size is the
only indicated available size, then a write of anything other than
the current size falls into the category of undefined behavior,
where we can essentially expose the modified ReBAR capability as
read-only.
This may seem pointless, but users have reported that virtualizing
the capability in this way not only allows guest software to expose
related features as available (even if only cosmetic), but in some
scenarios can resolve guest driver issues. Additionally, no
regressions in behavior have been reported for this change.
A caveat here is that the PCIe spec requires for compatibility that
devices report support for a size in the range of 1MB to 512GB,
therefore if the current BAR size falls outside that range we revert
to hiding the capability.
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230505232308.2869912-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently, VFIO log_sync can be issued while migration is in SETUP
state. However, doing this log_sync is at best redundant and at worst
can fail.
Redundant -- all RAM is marked dirty in migration SETUP state and is
transferred only after migration is set to ACTIVE state, so doing
log_sync during migration SETUP is pointless.
Can fail -- there is a time window, between setting migration state to
SETUP and starting dirty tracking by RAM save_live_setup handler, during
which dirty tracking is still not started. Any VFIO log_sync call that
is issued during this time window will fail. For example, this error can
be triggered by migrating a VM when a GUI is active, which constantly
calls log_sync.
Fix it by skipping VFIO log_sync while migration is in SETUP state.
Fixes: 758b96b61d ("vfio/migrate: Move switch of dirty tracking into vfio_memory_listener")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Link: https://lore.kernel.org/r/20230403130000.6422-1-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This had been pulled in via qemu/plugin.h from hw/core/cpu.h,
but that will be removed.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20230310195252.210956-5-richard.henderson@linaro.org>
[AJB: add various additional cases shown by CI]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20230315174331.2959-15-alex.bennee@linaro.org>
Reviewed-by: Emilio Cota <cota@braap.org>
Pick names that align with the section drivers should use them from,
avoiding the confusion of calling a _finalize() function from _exit()
and generalizing the actual _finalize() to handle removing the viommu
blocker.
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/167820912978.606734.12740287349119694623.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that everything has been set up for device dirty page tracking,
query the device for device dirty page tracking support.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-15-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Migrating with vIOMMU will require either tracking maximum
IOMMU supported address space (e.g. 39/48 address width on Intel)
or range-track current mappings and dirty track the new ones
post starting dirty tracking. This will be done as a separate
series, so add a live migration blocker until that is fixed.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-14-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add device dirty page bitmap sync functionality. This uses the device
DMA logging uAPI to sync dirty page bitmap from the device.
Device dirty page bitmap sync is used only if all devices within a
container support device dirty page tracking.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-13-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Extract the VFIO_IOMMU_DIRTY_PAGES ioctl code in vfio_get_dirty_bitmap()
to its own function.
This will help the code to be more readable after next patch will add
device dirty page bitmap sync functionality.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-12-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add device dirty page tracking start/stop functionality. This uses the
device DMA logging uAPI to start and stop dirty page tracking by device.
Device dirty page tracking is used only if all devices within a
container support device dirty page tracking.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20230307125450.62409-11-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
According to the device DMA logging uAPI, IOVA ranges to be logged by
the device must be provided all at once upon DMA logging start.
As preparation for the following patches which will add device dirty
page tracking, keep a record of all DMA mapped IOVA ranges so later they
can be used for DMA logging start.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-10-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In preparation to be used in device dirty tracking, move the code that
calculate a iova/end range from the container/section. This avoids
duplication on the common checks across listener callbacks.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-9-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The checks are replicated against region_add and region_del
and will be soon added in another memory listener dedicated
for dirty tracking.
Move these into a new helper for avoid duplication.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Link: https://lore.kernel.org/r/20230307125450.62409-8-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In preparation to turn more of the memory listener checks into
common functions, one of the affected places is how we trace when
sections are skipped. Right now there is one for each. Change it
into one single tracepoint `vfio_listener_region_skip` which receives
a name which refers to the callback i.e. region_add and region_del.
Suggested-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-7-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Move the code that finds the container host DMA window against a iova
range. This avoids duplication on the common checks across listener
callbacks.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Link: https://lore.kernel.org/r/20230307125450.62409-6-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There are already two places where dirty page bitmap allocation and
calculations are done in open code.
To avoid code duplication, introduce VFIOBitmap struct and corresponding
alloc function and use them where applicable.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-5-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If VFIO dirty pages log start/stop/sync fails during migration,
migration should be aborted as pages dirtied by VFIO devices might not
be reported properly.
This is not the case today, where in such scenario only an error is
printed.
Fix it by aborting migration in the above scenario.
Fixes: 758b96b61d ("vfio/migrate: Move switch of dirty tracking into vfio_memory_listener")
Fixes: b6dd6504e3 ("vfio: Add vfio_listener_log_sync to mark dirty pages")
Fixes: 9e7b0442f2 ("vfio: Add ioctl to get dirty pages bitmap during dma unmap")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-4-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There are several places where the %m conversion is used if one of
vfio_dma_map(), vfio_dma_unmap() or vfio_get_dirty_bitmap() fail.
The %m usage in these places is wrong since %m relies on errno value while
the above functions don't report errors via errno.
Fix it by using strerror() with the returned value instead.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20230307125450.62409-3-joao.m.martins@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use the VFIO_CCW() QOM type-checking macro to avoid DO_UPCAST().
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20230213170145.45666-7-philmd@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
QOM parenthood relationship is:
VFIOCCWDevice -> S390CCWDevice -> CcwDevice -> DeviceState
No need to double-cast, call CCW_DEVICE() on VFIOCCWDevice.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20230213170145.45666-6-philmd@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Use the S390_CCW_DEVICE() QOM type-checking macro to avoid DO_UPCAST().
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230213170145.45666-5-philmd@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
'cdev' is VFIOCCWDevice's private parent object.
Access it using the S390_CCW_DEVICE() QOM macro.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20230213170145.45666-4-philmd@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
QOM parenthood relationship is:
VFIOCCWDevice -> S390CCWDevice -> CcwDevice -> DeviceState
We can directly use the QOM DEVICE() macro to get the parent object.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20230213170145.45666-3-philmd@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Sort the migration section of VFIO trace events file alphabetically
and move two misplaced traces to common.c section.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-11-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that v2 protocol implementation has been added, remove the
deprecated v1 implementation.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-10-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Implement the basic mandatory part of VFIO migration protocol v2.
This includes all functionality that is necessary to support
VFIO_MIGRATION_STOP_COPY part of the v2 protocol.
The two protocols, v1 and v2, will co-exist and in the following patches
v1 protocol code will be removed.
There are several main differences between v1 and v2 protocols:
- VFIO device state is now represented as a finite state machine instead
of a bitmap.
- Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
ioctl and normal read() and write() instead of the migration region.
- Pre-copy is made optional in v2 protocol. Support for pre-copy will be
added later on.
Detailed information about VFIO migration protocol v2 and its difference
compared to v1 protocol can be found here [1].
[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>.
Link: https://lore.kernel.org/r/20230216143630.25610-9-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To avoid name collisions, rename functions and structs related to VFIO
migration protocol v1. This will allow the two protocols to co-exist
when v2 protocol is added, until v1 is removed. No functional changes
intended.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-8-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
vfio_migration_init(). This logic is specific to v1 protocol and moving
it will make it easier to add the v2 protocol implementation later.
No functional changes intended.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-7-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently VFIO migration doesn't implement some kind of intermediate
quiescent state in which P2P DMAs are quiesced before stopping or
running the device. This can cause problems in multi-device migration
where the devices are doing P2P DMAs, since the devices are not stopped
together at the same time.
Until such support is added, block migration of multiple devices.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-6-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_devices_all_running_and_saving() is used to check if migration is
in pre-copy phase. This is done by checking if migration is in setup or
active states and if all VFIO devices are in pre-copy state, i.e.
_SAVING | _RUNNING.
In VFIO migration protocol v2 pre-copy support is made optional. Hence,
a matching v2 protocol pre-copy state can't be used here.
As preparation for adding v2 protocol, change
vfio_devices_all_running_and_saving() logic such that it doesn't use the
VFIO pre-copy state.
The new equivalent logic checks if migration is in active state and if
all VFIO devices are in running state [1]. No functional changes
intended.
[1] Note that checking if migration is in setup or active states and if
all VFIO devices are in running state doesn't guarantee that we are in
pre-copy phase, thus we check if migration is only in active state.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-5-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked. This is because a DMA-able VFIO device
can dirty RAM pages without updating QEMU about it, thus breaking the
migration.
However, this doesn't mean that migration can't be done at all.
In such case, allow migration and let QEMU VFIO code mark all pages
dirty.
This guarantees that all pages that might have gotten dirty are reported
back, and thus guarantees a valid migration even without VFIO IOMMU
dirty tracking support.
The motivation for this patch is the introduction of iommufd [1].
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into its internal ops, allowing the usage of these IOCTLs
over iommufd. However, VFIO IOMMU dirty tracking is not supported by
this VFIO compatibility API.
This patch will allow migration by hosts that use the VFIO compatibility
API and prevent migration regressions caused by the lack of VFIO IOMMU
dirty tracking support.
[1]
https://lore.kernel.org/kvm/0-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com/
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Link: https://lore.kernel.org/r/20230216143630.25610-4-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As part of its error flow, vfio_vmstate_change() accesses
MigrationState->to_dst_file without any checks. This can cause a NULL
pointer dereference if the error flow is taken and
MigrationState->to_dst_file is not set.
For example, this can happen if VM is started or stopped not during
migration and vfio_vmstate_change() error flow is taken, as
MigrationState->to_dst_file is not set at that time.
Fix it by checking that MigrationState->to_dst_file is set before using
it.
Fixes: 02a7e71b1e ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Link: https://lore.kernel.org/r/20230216143630.25610-3-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Once that res_compatible is removed, they don't make sense anymore.
We remove the _only preffix. And to make things clearer we rename
them to must_precopy and can_postcopy.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Nothing assigns to it after previous commit.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Until previous commit, save_live_pending() was used for ram. Now with
the split into state_pending_estimate() and state_pending_exact() it
is not needed anymore, so remove them.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
We split the function into to:
- state_pending_estimate: We estimate the remaining state size without
stopping the machine.
- state pending_exact: We calculate the exact amount of remaining
state.
The only "device" that implements different functions for _estimate()
and _exact() is ram.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
PCIDeviceClass and PCIDevice are defined in pci.h. Many users of the
header don't actually need them. Similar structs live in their own
headers: PCIBusClass and PCIBus in pci_bus.h, PCIBridge in
pci_bridge.h, PCIHostBridgeClass and PCIHostState in pci_host.h,
PCIExpressHost in pcie_host.h, and PCIERootPortClass, PCIEPort, and
PCIESlot in pcie_port.h.
Move PCIDeviceClass and PCIDeviceClass to new pci_device.h, along with
the code that needs them. Adjust include directives.
This also enables the next commit.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221222100330.380143-6-armbru@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
lots of acpi rework
first version of biosbits infrastructure
ASID support in vhost-vdpa
core_count2 support in smbios
PCIe DOE emulation
virtio vq reset
HMAT support
part of infrastructure for viommu support in vhost-vdpa
VTD PASID support
fixes, tests all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmNpXDkPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpD0AH/2G8ZPrgrxJC9y3uD5/5J6QRzO+TsDYbg5ut
uBf4rKSHHzcu6zdyAfsrhbAKKzyD4HrEGNXZrBjnKM1xCiB/SGBcDIWntwrca2+s
5Dpbi4xvd4tg6tVD4b47XNDCcn2uUbeI0e2M5QIbtCmzdi/xKbFAfl5G8DQp431X
Kmz79G4CdKWyjVlM0HoYmdCw/4FxkdjD02tE/Uc5YMrePNaEg5Bw4hjCHbx1b6ur
6gjeXAtncm9s4sO0l+sIdyiqlxiTry9FSr35WaQ0qPU+Og5zaf1EiWfdl8TRo4qU
EAATw5A4hyw11GfOGp7oOVkTGvcNB/H7aIxD7emdWZV8+BMRPKo=
=zTCn
-----END PGP SIGNATURE-----
Merge tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu into staging
pci,pc,virtio: features, tests, fixes, cleanups
lots of acpi rework
first version of biosbits infrastructure
ASID support in vhost-vdpa
core_count2 support in smbios
PCIe DOE emulation
virtio vq reset
HMAT support
part of infrastructure for viommu support in vhost-vdpa
VTD PASID support
fixes, tests all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# -----BEGIN PGP SIGNATURE-----
#
# iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmNpXDkPHG1zdEByZWRo
# YXQuY29tAAoJECgfDbjSjVRpD0AH/2G8ZPrgrxJC9y3uD5/5J6QRzO+TsDYbg5ut
# uBf4rKSHHzcu6zdyAfsrhbAKKzyD4HrEGNXZrBjnKM1xCiB/SGBcDIWntwrca2+s
# 5Dpbi4xvd4tg6tVD4b47XNDCcn2uUbeI0e2M5QIbtCmzdi/xKbFAfl5G8DQp431X
# Kmz79G4CdKWyjVlM0HoYmdCw/4FxkdjD02tE/Uc5YMrePNaEg5Bw4hjCHbx1b6ur
# 6gjeXAtncm9s4sO0l+sIdyiqlxiTry9FSr35WaQ0qPU+Og5zaf1EiWfdl8TRo4qU
# EAATw5A4hyw11GfOGp7oOVkTGvcNB/H7aIxD7emdWZV8+BMRPKo=
# =zTCn
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 07 Nov 2022 14:27:53 EST
# gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg: issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu: (83 commits)
checkpatch: better pattern for inline comments
hw/virtio: introduce virtio_device_should_start
tests/acpi: update tables for new core count test
bios-tables-test: add test for number of cores > 255
tests/acpi: allow changes for core_count2 test
bios-tables-test: teach test to use smbios 3.0 tables
hw/smbios: add core_count2 to smbios table type 4
vhost-user: Support vhost_dev_start
vhost: Change the sequence of device start
intel-iommu: PASID support
intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function
intel-iommu: drop VTDBus
intel-iommu: don't warn guest errors when getting rid2pasid entry
vfio: move implement of vfio_get_xlat_addr() to memory.c
tests: virt: Update expected *.acpihmatvirt tables
tests: acpi: aarch64/virt: add a test for hmat nodes with no initiators
hw/arm/virt: Enable HMAT on arm virt machine
tests: Add HMAT AArch64/virt empty table files
tests: acpi: q35: update expected blobs *.hmat-noinitiators expected HMAT:
tests: acpi: q35: add test for hmat nodes without initiators
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
- Move the implement vfio_get_xlat_addr to softmmu/memory.c, and
change the name to memory_get_xlat_addr(). So we can use this
function on other devices, such as vDPA device.
- Add a new function vfio_get_xlat_addr in vfio/common.c, and it will check
whether the memory is backed by a discard manager. then device can
have its own warning.
Signed-off-by: Cindy Lu <lulu@redhat.com>
Message-Id: <20221031031020.1405111-2-lulu@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vfio_migration_init() initializes VFIOMigration->device_state using enum
of VFIO migration protocol v2. Current implemented protocol is v1 so v1
enum should be used. Fix it.
Fixes: 429c728006 ("vfio/migration: Fix incorrect initialization value for parameters in VFIOMigration")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Link: https://lore.kernel.org/r/20221016085752.32740-1-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
On error, vfio_get_iommu_info() frees and clears *info, but
vfio_connect_container() continues to use the pointer regardless
of the return value. Restructure the code such that a failure
of this function triggers an error and clean up the remainder of
the function, including updating an outdated comment that had
drifted from its relevant line of code and using host page size
for a default for better compatibility on non-4KB systems.
Reported-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/all/20220910004245.2878-1-nicolinc@nvidia.com/
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/166326219630.3388898.12882473157184946072.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The structure VFIOMigration of a VFIODevice is allocated and initialized
in vfio_migration_init(). "device_state" and "vm_running" are initialized
to 0, indicating that VFIO device is_STOP and VM is not-running. The
initialization value is incorrect. According to the agreement, default
state of VFIO device is _RUNNING. And if a VFIO device is hot-plugged
while the VM is running, "vm_running" should be 1. This patch fixes it.
Fixes: 02a7e71b1e ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Link: https://lore.kernel.org/r/20220711014651.1327-1-jiangkunkun@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The returned value is not used and misleading.
Signed-off-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Message-Id: <20220226115516.59830-2-akihiko.odaki@gmail.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
851d6d1a0f ("vfio/common: remove spurious tpm-crb-cmd misalignment
warning") removed the warning on vfio_listener_region_add() path.
However the same warning also hits on region_del path. Let's remove
it and reword the dynamic trace as this can be called on both
map and unmap path.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20220524091405.416256-1-eric.auger@redhat.com
Fixes: 851d6d1a0f ("vfio/common: remove spurious tpm-crb-cmd misalignment warning")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The QOM API already provides getters for uint64 and uint32 values, so reuse
them.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220301225220.239065-2-shentey@gmail.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Update to c5eb0a61238d ("Linux 5.18-rc6"). Mechanical search and
replace of vfio defines with white space massaging.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer
it is an IOMMU memory region.
no functional change intended
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20220502094223.36384-4-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using a VFIODevice handle local variable to improve the code readability.
no functional change intended
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20220502094223.36384-3-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
"%m" format specifier is not interpreted by the trace infrastructure
and thus "%m" is output instead of the actual errno string. Fix it by
outputting strerror(errno).
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20220502094223.36384-2-yi.l.liu@intel.com
[aw: replace commit log as provided by Eric]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The CRB command buffer currently is a RAM MemoryRegion and given
its base address alignment, it causes an error report on
vfio_listener_region_add(). This region could have been a RAM device
region, easing the detection of such safe situation but this option
was not well received. So let's add a helper function that uses the
memory region owner type to detect the situation is safe wrt
the assignment. Other device types can be checked here if such kind
of problem occurs again.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Acked-by: Stefan Berger <stefanb@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20220506132510.1847942-3-eric.auger@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It uses [offset, offset + size - 1] to indicate that the length of range is
size in most places in vfio trace code (such as
trace_vfio_region_region_mmap()) execpt trace_vfio_region_sparse_mmap_entry().
So change it for trace_vfio_region_sparse_mmap_entry(), but if size is zero,
the trace will be weird with an underflow, so move the trace and trace it
only if size is not zero.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Link: https://lore.kernel.org/r/1650100104-130737-1-git-send-email-chenxiang66@hisilicon.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In migration resume phase, all unmasked msix vectors need to be
setup when loading the VF state. However, the setup operation would
take longer if the VM has more VFs and each VF has more unmasked
vectors.
The hot spot is kvm_irqchip_commit_routes, it'll scan and update
all irqfds that are already assigned each invocation, so more
vectors means need more time to process them.
vfio_pci_load_config
vfio_msix_enable
msix_set_vector_notifiers
for (vector = 0; vector < dev->msix_entries_nr; vector++) {
vfio_msix_vector_do_use
vfio_add_kvm_msi_virq
kvm_irqchip_commit_routes <-- expensive
}
We can reduce the cost by only committing once outside the loop.
The routes are cached in kvm_state, we commit them first and then
bind irqfd for each vector.
The test VM has 128 vcpus and 8 VF (each one has 65 vectors),
we measure the cost of the vfio_msix_enable for each VF, and
we can see 90+% costs can be reduce.
VF Count of irqfds[*] Original With this patch
1st 65 8 2
2nd 130 15 2
3rd 195 22 2
4th 260 24 3
5th 325 36 2
6th 390 44 3
7th 455 51 3
8th 520 58 4
Total 258ms 21ms
[*] Count of irqfds
How many irqfds that already assigned and need to process in this
round.
The optimization can be applied to msi type too.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Link: https://lore.kernel.org/r/20220326060226.1892-6-longpeng2@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit ecebe53fe9 ("vfio: Avoid disabling and enabling vectors
repeatedly in VFIO migration") avoids inefficiently disabling and
enabling vectors repeatedly and lets the unmasked vectors be enabled
one by one.
But we want to batch multiple routes and defer the commit, and only
commit once outside the loop of setting vector notifiers, so we
cannot enable the vectors one by one in the loop now.
Revert that commit and we will take another way in the next patch,
it can not only avoid disabling/enabling vectors repeatedly, but
also satisfy our requirement of defer to commit.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Link: https://lore.kernel.org/r/20220326060226.1892-5-longpeng2@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It's unnecessary to test against the specific return value of
VFIO_DEVICE_SET_IRQS, since any positive return is an error
indicating the number of vectors we should retry with.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Link: https://lore.kernel.org/r/20220326060226.1892-2-longpeng2@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Replace the global variables with inlined helper functions. getpagesize() is very
likely annotated with a "const" function attribute (at least with glibc), and thus
optimization should apply even better.
This avoids the need for a constructor initialization too.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220323155743.1585078-12-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
This commit only touches allocations with size arguments of the form
sizeof(T).
Patch created mechanically with:
$ spatch --in-place --sp-file scripts/coccinelle/use-g_new-etc.cocci \
--macro-file scripts/cocci-macro-file.h FILES...
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20220315144156.1595462-4-armbru@redhat.com>
Reviewed-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
We invoke the kvm_irqchip_commit_routes() for each addition to MSI route
table, which is not efficient if we are adding lots of routes in some cases.
This patch lets callers invoke the kvm_irqchip_commit_routes(), so the
callers can decide how to optimize.
[1] https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg00967.html
Signed-off-by: Longpeng <longpeng2@huawei.com>
Message-Id: <20220222141116.2091-3-longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
More than 1k of TypeInfo instances are already marked as const. Mark the
remaining ones, too.
This commit was created with:
git grep -z -l 'static TypeInfo' -- '*.c' | \
xargs -0 sed -i 's/static TypeInfo/static const TypeInfo/'
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Message-id: 20220117145805.173070-2-shentey@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
hostwin is allocated and added to hostwin_list in vfio_host_win_add, but
it is only deleted from hostwin_list in vfio_host_win_del, which causes
a memory leak. Also, freeing all elements in hostwin_list is missing in
vfio_disconnect_container.
Fix: 2e4109de8e ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)")
CC: qemu-stable@nongnu.org
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
Link: https://lore.kernel.org/r/20211117014739.1839263-1-liangpeng10@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The MSI-X structures of some devices and other non-MSI-X structures
may be in the same BAR. They may share one host page, especially in
the case of large page granularity, such as 64K.
For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. vfio_listener_region_add() will be called
to map the remaining range (0x30-0xffff). If host page size is 64KB,
it will return early at 'int128_ge((int128_make64(iova), llend))'
without any message. Let's add a trace point to inform users like commit
5c08600547 ("vfio: Use a trace point when a RAM section cannot be DMA mapped")
did.
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Link: https://lore.kernel.org/r/20211027090406.761-3-jiangkunkun@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We can expand MemoryRegions of sub-page MMIO BARs in
vfio_pci_write_config() to improve IO performance for some
devices. However, the MemoryRegions of destination VM are
not expanded any more after live migration. Because their
addresses have been updated in vmstate_load_state()
(vfio_pci_load_config) and vfio_sub_page_bar_update_mapping()
will not be called.
This may result in poor performance after live migration.
So iterate BARs in vfio_pci_load_config() and try to update
sub-page BARs.
Reported-by: Nianyao Tang <tangnianyao@huawei.com>
Reported-by: Qixin Gan <ganqixin@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Link: https://lore.kernel.org/r/20211027090406.761-2-jiangkunkun@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
QDicts are both what QMP natively uses and what the keyval parser
produces. Going through QemuOpts isn't useful for either one, so switch
the main device creation function to QDicts. By sharing more code with
the -object/object-add code path, we can even reduce the code size a
bit.
This commit doesn't remove the detour through QemuOpts from any code
path yet, but it allows the following commits to do so.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211008133442.141332-15-kwolf@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Provide a name field for all the memory listeners. It can be used to identify
which memory listener is which.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210817013553.30584-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new RAMBlock flag to denote "protected" memory, i.e. memory that
looks and acts like RAM but is inaccessible via normal mechanisms,
including DMA. Use the flag to skip protected memory regions when
mapping RAM for DMA in VFIO.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
hsch and csch basically have two parts: execute the command,
and perform the halt/clear function. For fully emulated
subchannels, it is pretty clear how it will work: check the
subchannel state, and actually 'perform the halt/clear function'
and set cc 0 if everything looks good.
For passthrough subchannels, some of the checking is done
within QEMU, but some has to be done within the kernel. QEMU's
subchannel state may be such that we can perform the async
function, but the kernel may still get a cc != 0 when it is
actually executing the instruction. In that case, we need to
set the condition actually encountered by the kernel; if we
set cc 0 on error, we would actually need to inject an interrupt
as well.
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Jared Rossi <jrossi@linux.ibm.com>
Message-Id: <20210705163952.736020-2-cohuck@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
When migrate_add_blocker(blocker, &err) is followed by
error_propagate(errp, err), we can often just as well do
migrate_add_blocker(..., errp). This is the case in
vfio_migration_probe().
Prior art: commit 386f6c07d2 "error: Avoid error_propagate() after
migrate_add_blocker()".
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20210720125408.387910-8-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Fix pba_offset initialization value for BAIDU KUNLUN Virtual
Function device. The KUNLUN hardware returns an incorrect
value for the VF PBA offset, and add a quirk to instead
return a hardcoded value of 0xb400.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Link: https://lore.kernel.org/r/20210713093743.942-1-caihuoqing@baidu.com
[aw: comment & whitespace tuning]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
CID 1458134: Integer handling issues (BAD_SHIFT)
In expression "1 << ctz64(container->pgsizes)", left shifting by more
than 31 bits has undefined behavior. The shift amount,
"ctz64(container->pgsizes)", is 64.
Commit 5e3b981c33 ("vfio: Support for RamDiscardManager in the !vIOMMU
case") added an assertion that our granularity is at least as big as the
page size.
Although unlikely, we could have a page size that does not fit into
32 bit. In that case, we'd try shifting by more than 31 bit.
Let's use 1ULL instead and make sure we're not shifting by more than 63
bit by asserting that any bit in container->pgsizes is set.
Fixes: CID 1458134
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Link: https://lore.kernel.org/r/20210712083135.15755-1-david@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We support coordinated discarding of RAM using the RamDiscardManager for
the VFIO_TYPE1 iommus. Let's unlock support for coordinated discards,
keeping uncoordinated discards (e.g., via virtio-balloon) disabled if
possible.
This unlocks virtio-mem + vfio on x86-64. Note that vfio used via "nvme://"
by the block layer has to be implemented/unlocked separately. For now,
virtio-mem only supports x86-64; we don't restrict RamDiscardManager to
x86-64, though: arm64 and s390x are supposed to work as well, and we'll
test once unlocking virtio-mem support. The spapr IOMMUs will need special
care, to be tackled later, e.g.., once supporting virtio-mem.
Note: The block size of a virtio-mem device has to be set to sane sizes,
depending on the maximum hotplug size - to not run out of vfio mappings.
The default virtio-mem block size is usually in the range of a couple of
MBs. The maximum number of mapping is 64k, shared with other users.
Assume you want to hotplug 256GB using virtio-mem - the block size would
have to be set to at least 8 MiB (resulting in 32768 separate mappings).
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-14-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
vIOMMU support works already with RamDiscardManager as long as guests only
map populated memory. Both, populated and discarded memory is mapped
into &address_space_memory, where vfio_get_xlat_addr() will find that
memory, to create the vfio mapping.
Sane guests will never map discarded memory (e.g., unplugged memory
blocks in virtio-mem) into an IOMMU - or keep it mapped into an IOMMU while
memory is getting discarded. However, there are two cases where a malicious
guests could trigger pinning of more memory than intended.
One case is easy to handle: the guest trying to map discarded memory
into an IOMMU.
The other case is harder to handle: the guest keeping memory mapped in
the IOMMU while it is getting discarded. We would have to walk over all
mappings when discarding memory and identify if any mapping would be a
violation. Let's keep it simple for now and print a warning, indicating
that setting RLIMIT_MEMLOCK can mitigate such attacks.
We have to take care of incoming migration: at the point the
IOMMUs get restored and start creating mappings in vfio, RamDiscardManager
implementations might not be back up and running yet: let's add runstate
priorities to enforce the order when restoring.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-10-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Although RamDiscardManager can handle running into the maximum number of
DMA mappings by propagating errors when creating a DMA mapping, we want
to sanity check and warn the user early that there is a theoretical setup
issue and that virtio-mem might not be able to provide as much memory
towards a VM as desired.
As suggested by Alex, let's use the number of KVM memory slots to guess
how many other mappings we might see over time.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-9-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Let's query the maximum number of possible DMA mappings by querying the
available mappings when creating the container (before any mappings are
created). We'll use this informaton soon to perform some sanity checks
and warn the user.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-8-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Implement support for RamDiscardManager, to prepare for virtio-mem
support. Instead of mapping the whole memory section, we only map
"populated" parts and update the mapping when notified about
discarding/population of memory via the RamDiscardListener. Similarly, when
syncing the dirty bitmaps, sync only the actually mapped (populated) parts
by replaying via the notifier.
Using virtio-mem with vfio is still blocked via
ram_block_discard_disable()/ram_block_discard_require() after this patch.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210413095531.25603-7-david@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
move kvm files into kvm/
After the reshuffling, update MAINTAINERS accordingly.
Make use of the new directory:
target/s390x/kvm/
Signed-off-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Cho, Yu-Chen <acho@suse.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210707105324.23400-14-acho@suse.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Wire in the subchannel callback for building the IRB
ESW and ECW space for passthrough devices, and copy
the hardware's ESW into the IRB we are building.
If the hardware presented concurrent sense, then copy
that sense data into the IRB's ECW space.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20210617232537.1337506-5-farman@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Set _SAVING flag for device state from vmstate change handler when it
gets called from savevm.
Currently State transition savevm/suspend is seen as:
_RUNNING -> _STOP -> Stop-and-copy -> _STOP
State transition savevm/suspend should be:
_RUNNING -> Stop-and-copy -> _STOP
State transition from _RUNNING to _STOP occurs from
vfio_vmstate_change() where when vmstate changes from running to
!running, _RUNNING flag is reset but at the same time when
vfio_vmstate_change() is called for RUN_STATE_SAVE_VM, _SAVING bit
should be set.
Reported by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Message-Id: <1623177441-27496-1-git-send-email-kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In the vfio_migration_init(), the SaveVMHandler is registered for
VFIO device. But it lacks the operation of 'unregister'. It will
lead to 'Segmentation fault (core dumped)' in
qemu_savevm_state_setup(), if performing live migration after a
VFIO device is hot deleted.
Fixes: 7c2f5f75f9 (vfio: Register SaveVMHandlers for VFIO device)
Reported-by: Qixin Gan <ganqixin@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Message-Id: <20210527123101.289-1-jiangkunkun@huawei.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit e50caf4a5c ("tracing: convert documentation to rST")
converted docs/devel/tracing.txt to docs/devel/tracing.rst.
We still have several references to the old file, so let's fix them
with the following command:
sed -i s/tracing.txt/tracing.rst/ $(git grep -l docs/devel/tracing.txt)
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210517151702.109066-2-sgarzare@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
The vfio_ccw_unrealize() routine makes an unconditional attempt to
unregister every IRQ notifier, though they may not have been registered
in the first place (when running on an older kernel, for example).
Let's mirror this behavior in the error cleanups in vfio_ccw_realize()
so that if/when new IRQs are added, it is less confusing to recognize
the necessary procedures. The worst case scenario would be some extra
messages about an undefined IRQ, but since this is an error exit that
won't be the only thing to worry about.
And regarding those messages, let's change it to a warning instead of
an error, to better reflect their severity. The existing code in both
paths handles everything anyway.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Acked-by: Matthew Rosato <mjrosato@linux.ibm.com>
Message-Id: <20210428143652.1571487-1-farman@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>