The code used to assign an interrupt index/subindex to an
eventfd is duplicated many times. Let's introduce an helper that
allows to set/unset the signaling for an ACTION_TRIGGER,
ACTION_MASK or ACTION_UNMASK action.
In the error message, we now use errno in case of any
VFIO_DEVICE_SET_IRQS ioctl failure.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Li Qiang <liq3ea@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The MSI-X relocation code can sometimes be used to work around bogus
MSI-X capabilities, but this test for whether the PBA is outside of
the specified BAR causes the device to error before we can apply a
relocation. Let it proceed if we intend to relocate MSI-X anyway.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The resizable BAR capability is currently exposed read-only from the
kernel and we don't yet implement a protocol for virtualizing it to
the VM. Exposing it to the guest read-only introduces poor behavior
as the guest has no reason to test that a control register write is
accepted by the hardware. This can lead to cases where the guest OS
assumes the BAR has been resized, but it hasn't. This has been
observed when assigning AMD Vega GPUs.
Note, this does not preclude future enablement of resizable BARs, but
it's currently incorrect to expose this capability as read-only, so
better to not expose it at all.
Reported-by: James Courtier-Dutton <james.dutton@gmail.com>
Tested-by: James Courtier-Dutton <james.dutton@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Rather than looking inside the definition of a DeviceState with
"s->qdev", use the QOM prefered style: "DEVICE(s)".
This patch was generated using the following Coccinelle script:
// Use DEVICE() macros to access DeviceState.qdev
@use_device_macro_to_access_qdev@
expression obj;
identifier dev;
@@
-&obj->dev.qdev
+DEVICE(obj)
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190528164020.32250-10-philmd@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
'MSIX_CAP_LENGTH' is defined in two .c file. Move it
to hw/pci/msix.h file to reduce duplicated code.
CC: qemu-trivial@nongnu.org
Signed-off-by: Li Qiang <liq3ea@163.com>
Message-Id: <20190521151543.92274-5-liq3ea@163.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The QOMConventions recommends we should use TYPE_FOO
for a TypeInfo's name. Though "vfio-pci-nohotplug" is not
used in other parts, for consistency we should make this change.
CC: qemu-trivial@nongnu.org
Signed-off-by: Li Qiang <liq3ea@163.com>
Message-Id: <20190521151543.92274-2-liq3ea@163.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.
This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.
This adds additional steps to sPAPR PHB setup:
1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;
2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;
3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.
This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.
This puts new memory nodes in a separate NUMA node to as the GPU RAM
needs to be configured equally distant from any other node in the system.
Unlike the host setup which assigns numa ids from 255 downwards, this
adds new NUMA nodes after the user configures nodes or from 1 if none
were configured.
This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for vfio portions]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190417190641.26814-8-armbru@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
This allows configure the display resolution which the vgpu should use.
The information will be passed to the guest using EDID, so the mdev
driver must support the vfio edid region for this to work.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The '%m' format instructs glibc's printf()/syslog() implementation to
insert the contents of strerror(errno). Since this is a glibc extension
it should generally be avoided in QEMU due to need for portability to a
variety of platforms.
Even though vfio is Linux-only code that could otherwise use "%m", it
must still be avoided in trace-events files because several of the
backends do not use the format string and so this error information is
invisible to them.
The errno string value should be given as an explicit trace argument
instead, making it accessible to all backends. This also allows it to
work correctly with future patches that use the format string with
systemtap's simple printf code.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-id: 20190123120016.4538-4-berrange@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that the downstream port will virtually negotiate itself to the
link status of the downstream device, we can remove this emulation.
It's not clear that it was every terribly useful anyway.
Tested-by: Geoffrey McRae <geoff@hostfission.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In preparation for reporting higher virtual link speeds and widths,
create enums and macros to help us manage them.
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Tested-by: Geoffrey McRae <geoff@hostfission.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The previous commit changed vfio's warning messages from
vfio warning: DEV-NAME: Could not frobnicate
to
warning: vfio DEV-NAME: Could not frobnicate
To match this change, change error messages from
vfio error: DEV-NAME: On fire
to
vfio DEV-NAME: On fire
Note the loss of "error". If we think marking error messages that way
is a good idea, we should mark *all* error messages, i.e. make
error_report() print it.
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20181017082702.5581-7-armbru@redhat.com>
The vfio code reports warnings like
error_report(WARN_PREFIX "Could not frobnicate", DEV-NAME);
where WARN_PREFIX is defined so the message comes out as
vfio warning: DEV-NAME: Could not frobnicate
This usage predates the introduction of warn_report() & friends in
commit 97f40301f1. It's time to convert to that interface. Since
these functions already prefix the message with "warning: ", replace
WARN_PREFIX by VFIO_MSG_PREFIX, so the messages come out like
warning: vfio DEV-NAME: Could not frobnicate
The next commit will replace ERR_PREFIX.
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-6-armbru@redhat.com>
From include/qapi/error.h:
* Pass an existing error to the caller with the message modified:
* error_propagate(errp, err);
* error_prepend(errp, "Could not frobnicate '%s': ", name);
Fei Li pointed out that doing error_propagate() first doesn't work
well when @errp is &error_fatal or &error_abort: the error_prepend()
is never reached.
Since I doubt fixing the documentation will stop people from getting
it wrong, introduce error_propagate_prepend(), in the hope that it
lures people away from using its constituents in the wrong order.
Update the instructions in error.h accordingly.
Convert existing error_prepend() next to error_propagate to
error_propagate_prepend(). If any of these get reached with
&error_fatal or &error_abort, the error messages improve. I didn't
check whether that's the case anywhere.
Cc: Fei Li <fli@suse.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-2-armbru@redhat.com>
Define a TYPE_VFIO_PCI and drop DO_UPCAST.
Signed-off-by: Li Qiang <liq3ea@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
So we have a boot display when using a vgpu as primary display.
ramfb depends on a fw_cfg file. fw_cfg files can not be added and
removed at runtime, therefore a ramfb-enabled vfio device can't be
hotplugged.
Add a nohotplug variant of the vfio-pci device (as child class). Add
the ramfb property to the nohotplug variant only. So to enable the vgpu
display with boot support use this:
-device vfio-pci-nohotplug,display=on,ramfb=on,sysfsdev=...
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fix error reported by Coverity where realpath can return NULL,
resulting in a segfault in strcmp(). This should never happen given
that we're working through regularly structured sysfs paths, but
trivial enough to easily avoid.
Fixes: 238e917285 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU. In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device. Under these conditions, ballooning
should be safe.
vfio-ccw devices are always mediated devices and always operate under
the constraints above. Therefore we can consider all vfio-ccw devices
as balloon compatible.
The situation is far from straightforward with vfio-pci. These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver. The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
PCI devices needing a ROM allocate an optional MemoryRegion with
pci_add_option_rom(). pci_del_option_rom() does the cleanup when the
device is destroyed. The only action taken by this routine is to call
vmstate_unregister_ram() which clears the id string of the optional
ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This
was recently added by commit b895de5027 ("migration: discard
non-migratable RAMBlocks"), .
VFIO devices do their own loading of the PCI option ROM in
vfio_pci_size_rom(). The memory region is switched to an I/O region
and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM
region is not allocated. When the associated PCI device is deleted,
pci_del_option_rom() calls vmstate_unregister_ram() which tries to
flag a NULL RAMBlock, leading to a SEGV.
It seems that 'has_rom' was set to have memory_region_destroy()
called, but since commit 469b046ead ("memory: remove
memory_region_destroy") this is not necessary anymore as the
MemoryRegion is freed automagically.
Remove the PCIDevice 'has_rom' attribute setting in vfio.
Fixes: b895de5027 ("migration: discard non-migratable RAMBlocks")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It eases code review, unit is explicit.
Patch generated using:
$ git grep -E '(1024|2048|4096|8192|(<<|>>).?(10|20|30))' hw/ include/hw/
and modified manually.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180625124238.25339-38-f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit a9994687cb ("vfio/display: core & wireup") added display
support to vfio-pci with the default being "auto", which breaks
existing VMs when the vGPU requires GL support but had no previous
requirement for a GL compatible configuration. "Off" is the safer
default as we impose no new requirements to VM configurations.
Fixes: a9994687cb ("vfio/display: core & wireup")
Cc: qemu-stable@nongnu.org
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With vfio ioeventfd support, we can program vfio-pci to perform a
specified BAR write when an eventfd is triggered. This allows the
KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
userspace handling for these events. On the same micro-benchmark
where the ioeventfd got us to almost 90% of performance versus
disabling the GeForce quirks, this gets us to within 95%.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Quirks can be self modifying, provide a hook to allow them to cleanup
on device reset if desired.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
During guest OS reboot, guest framebuffer is invalid. It will cause
bugs, if the invalid guest framebuffer is still used by host.
This patch is to introduce vfio_display_reset which is invoked
during vfio display reset. This vfio_display_reset function is used
to release the invalid display resource, disable scanout mode and
replace the invalid surface with QemuConsole's DisplaySurafce.
This patch can fix the GPU hang issue caused by gd_egl_draw during
guest OS reboot.
Changes v3->v4:
- Move dma-buf based display check into the vfio_display_reset().
(Gerd)
Changes v2->v3:
- Limit vfio_display_reset to dma-buf based vfio display. (Gerd)
Changes v1->v2:
- Use dpy_gfx_update_full() update screen after reset. (Gerd)
- Remove dpy_gfx_switch_surface(). (Gerd)
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
This adds a possibility for the platform to tell VFIO not to emulate MSIX
so MMIO memory regions do not get split into chunks in flatview and
the entire page can be registered as a KVM memory slot and make direct
MMIO access possible for the guest.
This enables the entire MSIX BAR mapping to the guest for the pseries
platform in order to achieve the maximum MMIO preformance for certain
devices.
Tested on:
LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment we unconditionally avoid mapping MSIX data of a BAR and
emulate MSIX table in QEMU. However it is 1) not always necessary as
a platform may provide a paravirt interface for MSIX configuration;
2) can affect the speed of MMIO access by emulating them in QEMU when
frequently accessed registers share same system page with MSIX data,
this is particularly a problem for systems with the page size bigger
than 4KB.
A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added
to the kernel [1] which tells the userspace that mapping of the MSIX data
is possible now. This makes use of it so from now on QEMU tries mapping
the entire BAR as a whole and emulate MSIX on top of that.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Infrastructure for display support. Must be enabled
using 'display' property.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
basename(3) and dirname(3) modify their argument and may return
pointers to statically allocated memory which may be overwritten by
subsequent calls.
g_path_get_basename and g_path_get_dirname have no such issues, and
therefore more preferable.
Signed-off-by: Julia Suvorova <jusual@mail.ru>
Message-Id: <1519888086-4207-1-git-send-email-jusual@mail.ru>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- new stats in virtio balloon
- virtio eventfd rework for boot speedup
- vhost memory rework for boot speedup
- fixes and cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJagxKDAAoJECgfDbjSjVRp5qAH/3gmgBaIzL3KRHd5i0RZifJv
PvyAVYgZd7h0+/1r9GM7guHKyEPZ08JtbHSm/HuDV4BD/Vf3/8joy8roExIfde2A
6k8fd6ANVQmE3t5zUxNXi9qiG4pO4xDIu4cMAbixzgN9x5ttlcfTw7fTT0e0VJxJ
8SN02/uCPPR/DY4/cpjah+slSyv6rBKT1v1ONy7djyRTYHi6h3Meoh05YfEALkwA
goxTKBZHi0L1IZ3HP/ZpXJDohQ5n2P09DX0fQgb8PgmW6WIWB/Qpi5pD53LZpMCV
n9waTF0U0ahneFd2FHo22QMMrwWvQyrjv+w5uXVr+qmHb/OyH2tUt7PgGF9+QKA=
=78s5
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
virtio,vhost,pci,pc: features, fixes and cleanups
- new stats in virtio balloon
- virtio eventfd rework for boot speedup
- vhost memory rework for boot speedup
- fixes and cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Tue 13 Feb 2018 16:29:55 GMT
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (22 commits)
virtio-balloon: include statistics of disk/file caches
acpi-test: update FADT
lpc: drop pcie host dependency
tests: acpi: fix FADT not being compared to reference table
hw/pci-bridge: fix pcie root port's IO hints capability
libvhost-user: Support across-memory-boundary access
libvhost-user: Fix resource leak
virtio-balloon: unref the memory region before continuing
pci: removed the is_express field since a uniform interface was inserted
virtio-blk: enable multiple vectors when using multiple I/O queues
pci/bus: let it has higher migration priority
pci-bridge/i82801b11: clear bridge registers on platform reset
vhost: Move log_dirty check
vhost: Merge and delete unused callbacks
vhost: Clean out old vhost_set_memory and friends
vhost: Regenerate region list from changed sections list
vhost: Merge sections added to temporary list
vhost: Simplify ring verification checks
vhost: Build temporary section list and deref after commit
virtio: improve virtio devices initialization time
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
qemu-common.h includes qemu/option.h, but most places that include the
former don't actually need the latter. Drop the include, and add it
to the places that actually need it.
While there, drop superfluous includes of both headers, and
separate #include from file comment with a blank line.
This cleanup makes the number of objects depending on qemu/option.h
drop from 4545 (out of 4743) to 284 in my "build everything" tree.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-20-armbru@redhat.com>
[Semantic conflict with commit bdd6a90a9e in block/nvme.c resolved]
according to Eduardo Habkost's commit fd3b02c889 all PCIEs now implement
INTERFACE_PCIE_DEVICE so we don't need is_express field anymore.
Devices that implements only INTERFACE_PCIE_DEVICE (is_express == 1)
or
devices that implements only INTERFACE_CONVENTIONAL_PCI_DEVICE (is_express == 0)
where not affected by the change.
The only devices that were affected are those that are hybrid and also
had (is_express == 1) - therefor only:
- hw/vfio/pci.c
- hw/usb/hcd-xhci.c
- hw/xen/xen_pt.c
For those 3 I made sure that QEMU_PCI_CAP_EXPRESS is on in instance_init()
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Yoni Bettan <ybettan@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla
assignment. Leaving them enabled is fully functional and provides the
most compatibility, but due to the unique NVIDIA MSI ACK behavior[1],
it also introduces latency in re-triggering the MSI interrupt. This
overhead is typically negligible, but has been shown to adversely
affect some (very) high interrupt rate applications. This adds the
vfio-pci device option "x-no-geforce-quirks=" which can be set to
"on" to disable this additional overhead.
A follow-on optimization for GeForce might be to make use of an
ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci
driver, avoiding the bounce through userspace to handle this device
write.
[1] Background: the NVIDIA driver has been observed to issue a write
to the MMIO mirror of PCI config space in BAR0 in order to allow the
MSI interrupt for the device to retrigger. Older reports indicated a
write of 0xff to the (read-only) MSI capability ID register, while
more recently a write of 0x0 is observed at config space offset 0x704,
non-architected, extended config space of the device (BAR0 offset
0x88704). Virtualization of this range is only required for GeForce.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table. This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64). In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device. However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X. Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location. There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.
This new x-msix-relocation option accepts the following choices:
off: Disable MSI-X relocation, use native device config (default)
auto: Use a known good combination for the platform/device (none yet)
bar0..bar5: Specify the target BAR for MSI-X data structures
If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.
The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area. Take for
example:
# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
...
Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
...
Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance. The data sheet specifically refers
to this as an MSI-X BAR. This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.
However, here's another example:
# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
...
Region 0: I/O ports at c000 [size=256]
Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
...
Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f000
Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR. If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform. At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.
Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures. A few key rules to keep in mind for this selection
include:
* There are only 6 BAR slots, bar0..bar5
* 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
* PCI BARs are always a power of 2 in size, extending == doubling
* The maximum size of a 32-bit BAR is 2GB
* MSI-X data structures must reside in an MMIO BAR
Using these rules, we can evaluate each BAR of the second example
device above as follows:
bar0: I/O port BAR, incompatible with MSI-X tables
bar1: BAR could be extended, incurring another 64KB of MMIO
bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
bar3: BAR could be extended, incurring another 256KB of MMIO
bar4: Unavailable, bar3 is 64bit, this register is used by bar3
bar5: Available, empty BAR, minimum additional MMIO
A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3. The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users. This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The kernel provides similar emulation of PCI BAR register access to
QEMU, so up until now we've used that for things like BAR sizing and
storing the BAR address. However, if we intend to resize BARs or add
BARs that don't exist on the physical device, we need to switch to the
pure QEMU emulation of the BAR.
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add one more layer to our stack of MemoryRegions, this base region
allows us to register BARs independently of the vfio region or to
extend the size of BARs which do map to a region. This will be
useful when we want hypervisor defined BARs or sections of BARs,
for purposes such as relocating MSI-X emulation. We therefore call
msix_init() based on this new base MemoryRegion, while the quirks,
which only modify regions still operate on those sub-MemoryRegions.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The bus pointer in PCIDevice is basically redundant with QOM information.
It's always initialized to the qdev_get_parent_bus(), the only difference
is the type.
Therefore this patch eliminates the field, instead creating a pci_get_bus()
helper to do the type mangling to derive it conveniently from the QOM
Device object underneath.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
The following devices support both PCI Express and Conventional
PCI, by including special code to handle the QEMU_PCI_CAP_EXPRESS
flag and/or conditional pcie_endpoint_cap_init() calls:
* vfio-pci (is_express=1, but legacy PCI handled by
vfio_populate_device())
* vmxnet3 (is_express=0, but PCIe handled by vmxnet3_realize())
* pvscsi (is_express=0, but PCIe handled by pvscsi_realize())
* virtio-pci (is_express=0, but PCIe handled by
virtio_pci_dc_realize(), and additional legacy PCI code at
virtio_pci_realize())
* base-xhci (is_express=1, but pcie_endpoint_cap_init() call
is conditional on pci_bus_is_express(dev->bus)
* Note that xhci does not clear QEMU_PCI_CAP_EXPRESS like the
other hybrid devices
Cc: Dmitry Fleytman <dmitry@daynix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
NVIDIA has defined a specification for creating GPUDirect "cliques",
where devices with the same clique ID support direct peer-to-peer DMA.
When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest
(part of cuda-samples) determine which GPUs can support peer-to-peer
based on chipset and topology. When running in a VM, these tools have
no visibility to the physical hardware support or topology. This
option allows the user to specify hints via a vendor defined
capability. For instance:
<qemu:commandline>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/>
</qemu:commandline>
This enables two cliques. The first is a singleton clique with ID 0,
for the first hostdev defined in the XML (note that since cliques
define peer-to-peer sets, singleton clique offer no benefit). The
subsequent two hostdevs are both added to clique ID 1, indicating
peer-to-peer is possible between these devices.
QEMU only provides validation that the clique ID is valid and applied
to an NVIDIA graphics device, any validation that the resulting
cliques are functional and valid is the user's responsibility. The
NVIDIA specification allows a 4-bit clique ID, thus valid values are
0-15.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the hypervisor needs to add purely virtual capabilties, give us a
hook through quirks to do that. Note that we determine the maximum
size for a capability based on the physical device, if we insert a
virtual capability, that can change. Therefore if maximum size is
smaller after added virt capabilities, use that.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If vfio_add_std_cap() errors then going to out prepends irrelevant
errors for capabilities we haven't attempted to add as we unwind our
recursive stack. Just return error.
Fixes: 7ef165b9a8 ("vfio/pci: Pass an error object to vfio_add_capabilities")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
hw/vfio/pci.c:308:29: warning: Use of memory after it is freed
qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
^~~~
Reported-by: Clang Static Analyzer
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Intel 82599 VFs report a PCIe capability version of 0, which is
invalid. The earliest version of the PCIe spec used version 1. This
causes Windows to fail startup on the device and it will be disabled
with error code 10. Our choices are either to drop the PCIe cap on
such devices, which has the side effect of likely preventing the guest
from discovering any extended capabilities, or performing a fixup to
update the capability to the earliest valid version. This implements
the latter.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIOGroup.device_list is effectively our reference tracking mechanism
such that we can teardown a group when all of the device references
are removed. However, we also use this list from our machine reset
handler for processing resets that affect multiple devices. Generally
device removals are fully processed (exitfn + finalize) when this
reset handler is invoked, however if the removal is triggered via
another reset handler (piix4_reset->acpi_pcihp_reset) then the device
exitfn may run, but not finalize. In this case we hit asserts when
we start trying to access PCI helpers since much of the PCI state of
the device is released. To resolve this, add a pointer to the Object
DeviceState in our common base-device and skip non-realized devices
as we iterate.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
After the patch 'Make errp the last parameter of pci_add_capability()',
pci_add_capability() and pci_add_capability2() now do exactly the same.
So drop the wrapper pci_add_capability() of pci_add_capability2(), then
replace the pci_add_capability2() with pci_add_capability() everywhere.
Cc: pbonzini@redhat.com
Cc: rth@twiddle.net
Cc: ehabkost@redhat.com
Cc: mst@redhat.com
Cc: dmitry@daynix.com
Cc: jasowang@redhat.com
Cc: marcel@redhat.com
Cc: alex.williamson@redhat.com
Cc: armbru@redhat.com
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When the "No host device provided" error occurs, the hint message
that starts with "Use -vfio-pci," makes no sense, since "-vfio-pci"
is not a valid command line parameter.
Correct this by replacing "-vfio-pci" with "-device vfio-pci".
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>