The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Quirks can be self modifying, provide a hook to allow them to cleanup
on device reset if desired.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This will later be used to include list initialization.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla
assignment. Leaving them enabled is fully functional and provides the
most compatibility, but due to the unique NVIDIA MSI ACK behavior[1],
it also introduces latency in re-triggering the MSI interrupt. This
overhead is typically negligible, but has been shown to adversely
affect some (very) high interrupt rate applications. This adds the
vfio-pci device option "x-no-geforce-quirks=" which can be set to
"on" to disable this additional overhead.
A follow-on optimization for GeForce might be to make use of an
ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci
driver, avoiding the bounce through userspace to handle this device
write.
[1] Background: the NVIDIA driver has been observed to issue a write
to the MMIO mirror of PCI config space in BAR0 in order to allow the
MSI interrupt for the device to retrigger. Older reports indicated a
write of 0xff to the (read-only) MSI capability ID register, while
more recently a write of 0x0 is observed at config space offset 0x704,
non-architected, extended config space of the device (BAR0 offset
0x88704). Virtualization of this range is only required for GeForce.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add INTERFACE_CONVENTIONAL_PCI_DEVICE to all direct subtypes of
TYPE_PCI_DEVICE, except:
1) The ones that already have INTERFACE_PCIE_DEVICE set:
* base-xhci
* e1000e
* nvme
* pvscsi
* vfio-pci
* virtio-pci
* vmxnet3
2) base-pci-bridge
Not all PCI bridges are Conventional PCI devices, so
INTERFACE_CONVENTIONAL_PCI_DEVICE is added only to the subtypes
that are actually Conventional PCI:
* dec-21154-p2p-bridge
* i82801b11-bridge
* pbm-bridge
* pci-bridge
The direct subtypes of base-pci-bridge not touched by this patch
are:
* xilinx-pcie-root: Already marked as PCIe-only.
* pcie-pci-bridge: Already marked as PCIe-only.
* pcie-port: all non-abstract subtypes of pcie-port are already
marked as PCIe-only devices.
3) megasas-base
Not all megasas devices are Conventional PCI devices, so the
interface names are added to the subclasses registered by
megasas_register_types(), according to information in the
megasas_devices[] array.
"megasas-gen2" already implements INTERFACE_PCIE_DEVICE, so add
INTERFACE_CONVENTIONAL_PCI_DEVICE only to "megasas".
Acked-by: Alberto Garcia <berto@igalia.com>
Acked-by: John Snow <jsnow@redhat.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
NVIDIA has defined a specification for creating GPUDirect "cliques",
where devices with the same clique ID support direct peer-to-peer DMA.
When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest
(part of cuda-samples) determine which GPUs can support peer-to-peer
based on chipset and topology. When running in a VM, these tools have
no visibility to the physical hardware support or topology. This
option allows the user to specify hints via a vendor defined
capability. For instance:
<qemu:commandline>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/>
</qemu:commandline>
This enables two cliques. The first is a singleton clique with ID 0,
for the first hostdev defined in the XML (note that since cliques
define peer-to-peer sets, singleton clique offer no benefit). The
subsequent two hostdevs are both added to clique ID 1, indicating
peer-to-peer is possible between these devices.
QEMU only provides validation that the clique ID is valid and applied
to an NVIDIA graphics device, any validation that the resulting
cliques are functional and valid is the user's responsibility. The
NVIDIA specification allows a 4-bit clique ID, thus valid values are
0-15.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the hypervisor needs to add purely virtual capabilties, give us a
hook through quirks to do that. Note that we determine the maximum
size for a capability based on the physical device, if we insert a
virtual capability, that can change. Therefore if maximum size is
smaller after added virt capabilities, use that.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The NVIDIA BAR5 quirk is targeting an ioport BAR. Some older devices
have a BAR5 which is not ioport and can induce a segfault here. Test
the BAR type to skip these devices.
Link: https://bugs.launchpad.net/qemu/+bug/1678466
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This reverts commit c2b2e158cc.
The original patch intend to prevent linux i915 driver from using
stolen meory. But this patch breaks windows IGD driver loading on
Gen9+, as IGD HW will use stolen memory on Gen9+, once windows IGD
driver see zero size stolen memory, it will unload.
Meanwhile stolen memory will be disabled in 915 when i915 run as
a guest.
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
[aw: Gen9+ is SkyLake and newer]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Regardless of running in UPT or legacy mode, the guest igd
drivers may attempt to use stolen memory, however only legacy
mode has BIOS support for reserving stolen memmory in the
guest VM. We zero out the stolen memory size in all cases,
then guest igd driver won't use stolen memory.
In legacy mode, user could use x-igd-gms option to specify the
amount of stolen memory which will be pre-allocated and reserved
by bios for igd use.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99028https://bugs.freedesktop.org/show_bug.cgi?id=99025
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The device has "bridge" in its name, so it should obviously be in
the category DEVICE_CATEGORY_BRIDGE.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Previous IGD, up through Broadwell, only seem to write GTT values into
the first 1MB of space allocated for the BDSM, but clearly the GTT
can be multiple MB in size. Our test in vfio_igd_quirk_data_write()
correctly filters out indexes beyond 1MB, but given the 1MB mask we're
using, we re-apply writes only to the first 1MB of the guest allocated
BDSM.
We can't assume either the host or guest BDSM is naturally aligned, so
we can't simply apply a different mask. Instead, save the host BDSM
and do the arithmetic to subtract the host value to get the BDSM
offset and add it to the guest allocated BDSM.
Reported-by: Alexander Indenbaum <alexander.indenbaum@gmail.com>
Tested-by: Alexander Indenbaum <alexander.indenbaum@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Introductory comment for rtl8168 VFIO MSI-X quirk states:
At BAR2 offset 0x70 there is a dword data register,
offset 0x74 is a dword address register.
vfio: vfio_bar_read(0000:05:00.0:BAR2+0x70, 4) = 0xfee00398 // read data
Thus, correct offset for data read is 0x70,
but function vfio_rtl8168_quirk_data_read() wrongfully uses offset 0x74.
Signed-off-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pass an error object to prepare for migration to VFIO-PCI realize.
In vfio_probe_igd_bar4_quirk, simply report the error.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pass an error object to prepare for the same operation in
vfio_populate_device. Eventually this contributes to the migration
to VFIO-PCI realize.
We now report an error on vfio_get_region_info failure.
vfio_probe_igd_bar4_quirk is not involved in the migration to realize
and simply calls error_reportf_err.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit 2d82f8a3cd ("vfio/pci: Convert all MemoryRegion to dynamic
alloc and consistent functions") converted VFIOPCIDevice.vga to be
dynamically allocted, negating the need for VFIOPCIDevice.has_vga.
Unfortunately not all of the has_vga users were converted, nor was
the field removed from the structure. Correct these oversights.
Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de>
Tested-by: Peter Maloney <peter.maloney@brockmann-consult.de>
Fixes: 2d82f8a3cd ("vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions")
Fixes: https://bugs.launchpad.net/qemu/+bug/1591628
Cc: qemu-stable@nongnu.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The IGD OpRegion is enabled automatically when running in legacy mode,
but it can sometimes be useful in universal passthrough mode as well.
Without an OpRegion, output spigots don't work, and even though Intel
doesn't officially support physical outputs in UPT mode, it's a
useful feature. Note that if an OpRegion is enabled but a monitor is
not connected, some graphics features will be disabled in the guest
versus a headless system without an OpRegion, where they would work.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Tested-by: Gerd Hoffmann <kraxel@redhat.com>
Enable quirks to support SandyBridge and newer IGD devices as primary
VM graphics. This requires new vfio-pci device specific regions added
in kernel v4.6 to expose the IGD OpRegion, the shadow ROM, and config
space access to the PCI host bridge and LPC/ISA bridge. VM firmware
support, SeaBIOS only so far, is also required for reserving memory
regions for IGD specific use. In order to enable this mode, IGD must
be assigned to the VM at PCI bus address 00:02.0, it must have a ROM,
it must be able to enable VGA, it must have or be able to create on
its own an LPC/ISA bridge of the proper type at PCI bus address
00:1f.0 (sorry, not compatible with Q35 yet), and it must have the
above noted vfio-pci kernel features and BIOS. The intention is that
to enable this mode, a user simply needs to assign 00:02.0 from the
host to 00:02.0 in the VM:
-device vfio-pci,host=0000:00:02.0,bus=pci.0,addr=02.0
and everything either happens automatically or it doesn't. In the
case that it doesn't, we leave error reports, but assume the device
will operate in universal passthrough mode (UPT), which doesn't
require any of this, but has a much more narrow window of supported
devices, supported use cases, and supported guest drivers.
When using IGD in this mode, the VM firmware is required to reserve
some VM RAM for the OpRegion (on the order or several 4k pages) and
stolen memory for the GTT (up to 8MB for the latest GPUs). An
additional option, x-igd-gms allows the user to specify some amount
of additional memory (value is number of 32MB chunks up to 512MB) that
is pre-allocated for graphics use. TBH, I don't know of anything that
requires this or makes use of this memory, which is why we don't
allocate any by default, but the specification suggests this is not
actually a valid combination, so the option exists as a workaround.
Please report if it's actually necessary in some environment.
See code comments for further discussion about the actual operation
of the quirks necessary to assign these devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Tested-by: Gerd Hoffmann <kraxel@redhat.com>
Match common vfio code with setup, exit, and finalize functions for
BAR, quirk, and VGA management. VGA is also changed to dynamic
allocation to match the other MemoryRegions.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Both platform and PCI vfio drivers create a "slow", I/O memory region
with one or more mmap memory regions overlayed when supported by the
device. Generalize this to a set of common helpers in the core that
pulls the region info from vfio, fills the region data, configures
slow mapping, and adds helpers for comleting the mmap, enable/disable,
and teardown. This can be immediately used by the PCI MSI-X code,
which needs to mmap around the MSI-X vector table.
This also changes VFIORegion.mem to be dynamically allocated because
otherwise we don't know how the caller has allocated VFIORegion and
therefore don't know whether to unreference it to destroy the
MemoryRegion or not.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1453832250-766-22-git-send-email-peter.maydell@linaro.org
For quirks that support the full PCIe extended config space, limit the
quirk to only the size of config space available through vfio. This
allows host systems with broken MMCONFIG regions to still make use of
these quirks without generating bad address faults trying to access
beyond the end of config space exposed through vfio. This may expose
direct access to the mirror of extended config space, only trapping
the sub-range of standard config space, but allowing this makes the
quirk, and thus the device, functional. We expect that only device
specific accesses make use of the mirror, not general extended PCI
capability accesses, so any virtualization in this space is likely
unnecessary anyway, and the device is still IOMMU isolated, so it
should only be able to hurt itself through any bogus configurations
enabled by this space.
Link: https://www.redhat.com/archives/vfio-users/2015-November/msg00192.html
Reported-by: Ronnie Swanink <ronnie@ronnieswanink.nl>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
This commit only touches allocations with size arguments of the form
sizeof(T). Same Coccinelle semantic patch as in commit b45c03f.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Specifying an emulated PCI vendor/device ID can be useful for testing
various quirk paths, even though the behavior and functionality of
the device with bogus IDs is fully unsupportable. We need to use a
uint32_t for the vendor/device IDs, even though the registers
themselves are only 16-bit in order to be able to determine whether
the value is valid and user set.
The same support is added for subsystem vendor/device ID, though these
have the possibility of being useful and supported for more than a
testing tool. An emulated platform might want to impose their own
subsystem IDs or at least hide the physical subsystem ID. Windows
guests will often reinstall drivers due to a change in subsystem IDs,
something that VM users may want to avoid. Of course careful
attention would be required to ensure that guest drivers do not rely
on the subsystem ID as a basis for device driver quirks.
All of these options are added using the standard experimental option
prefix and should not be considered stable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Simplify access to commonly referenced PCI vendor and device ID by
caching it on the VFIOPCIDevice struct.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is just another quirk, for reset rather than affecting memory
regions. Move it to our new quirks file.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Config windows make use of an address register and a data register.
In VGA cards, these are often used to provide real mode code in the
BIOS an easy way to access MMIO registers since the window often
resides in an I/O port register. When the MMIO register has a mirror
of PCI config space, we need to trap those accesses and redirect them
to emulated config space.
The previous version of this functionality made use of a single
MemoryRegion and single match address. This version uses separate
MemoryRegions for each of the address and data registers and allows
for multiple match addresses. This is useful for Nvidia cards which
have two ranges which index into PCI config space.
The previous implementation is left for the follow-on patch for a more
reviewable diff.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Another rework of this quirk, this time to update to the new quirk
structure. We can handle the address and data registers with
separate MemoryRegions and a quirk specific data structure, making the
code much more understandable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The Nvidia 0x3d0 quirk makes use of a two separate registers and gives
us our first chance to make use of separate memory regions for each to
simplify the code a bit.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is an easy quirk that really doesn't need a data structure if
its own. We can pass vdev as the opaque data and access to the
MemoryRegion isn't required.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIOQuirk hosts a single memory region and a fixed set of data fields
that try to handle all the quirk cases, but end up making those that
don't exactly match really confusing. This patch introduces a struct
intended to provide more flexibility and simpler code. VFIOQuirk is
stripped to its basics, an opaque data pointer for quirk specific
data and a pointer to an array of MemoryRegions with a counter. This
still allows us to have common teardown routines, but adds much
greater flexibility to support multiple memory regions and quirk
specific data structures that are easier to maintain. The existing
VFIOQuirk is transformed into VFIOLegacyQuirk, which further patches
will eliminate entirely.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Create a vendor:device ID helper that we'll also use as we rework the
rest of the quirks. Re-reading the config entries, even if we get
more blacklist entries, is trivial overhead and only incurred during
device setup. There's no need to typedef the blacklist structure,
it's a static private data type used once. The elements get bumped
up to uint32_t to avoid future maintenance issues if PCI_ANY_ID gets
used for a blacklist entry (avoiding an actual hardware match). Our
test loop is also crying out to be simplified as a for loop.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>