mirrors/qemu - qemu - SynapseOS git

Author	SHA1	Message	Date
Alex Williamson	db32d0f438	vfio/pci: Add option to disable GeForce quirks These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla assignment. Leaving them enabled is fully functional and provides the most compatibility, but due to the unique NVIDIA MSI ACK behavior[1], it also introduces latency in re-triggering the MSI interrupt. This overhead is typically negligible, but has been shown to adversely affect some (very) high interrupt rate applications. This adds the vfio-pci device option "x-no-geforce-quirks=" which can be set to "on" to disable this additional overhead. A follow-on optimization for GeForce might be to make use of an ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci driver, avoiding the bounce through userspace to handle this device write. [1] Background: the NVIDIA driver has been observed to issue a write to the MMIO mirror of PCI config space in BAR0 in order to allow the MSI interrupt for the device to retrigger. Older reports indicated a write of 0xff to the (read-only) MSI capability ID register, while more recently a write of 0x0 is observed at config space offset 0x704, non-architected, extended config space of the device (BAR0 offset 0x88704). Virtualization of this range is only required for GeForce. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:27 -07:00
Alex Williamson	89d5202edc	vfio/pci: Allow relocating MSI-X MMIO Recently proposed vfio-pci kernel changes (v4.16) remove the restriction preventing userspace from mmap'ing PCI BARs in areas overlapping the MSI-X vector table. This change is primarily intended to benefit host platforms which make use of system page sizes larger than the PCI spec recommendation for alignment of MSI-X data structures (ie. not x86_64). In the case of POWER systems, the SPAPR spec requires the VM to program MSI-X using hypercalls, rendering the MSI-X vector table unused in the VM view of the device. However, ARM64 platforms also support 64KB pages and rely on QEMU emulation of MSI-X. Regardless of the kernel driver allowing mmaps overlapping the MSI-X vector table, emulation of the MSI-X vector table also prevents direct mapping of device MMIO spaces overlapping this page. Thanks to the fact that PCI devices have a standard self discovery mechanism, we can try to resolve this by relocating the MSI-X data structures, either by creating a new PCI BAR or extending an existing BAR and updating the MSI-X capability for the new location. There's even a very slim chance that this could benefit devices which do not adhere to the PCI spec alignment guidelines on x86_64 systems. This new x-msix-relocation option accepts the following choices: off: Disable MSI-X relocation, use native device config (default) auto: Use a known good combination for the platform/device (none yet) bar0..bar5: Specify the target BAR for MSI-X data structures If compatible, the target BAR will either be created or extended and the new portion will be used for MSI-X emulation. The first obvious user question with this option is how to determine whether a given platform and device might benefit from this option. In most cases, the answer is that it won't, especially on x86_64. Devices often dedicate an entire BAR to MSI-X and therefore no performance sensitive registers overlap the MSI-X area. Take for example: # lspci -vvvs 0a:00.0 0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection ... Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K] Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K] ... Capabilities: [70] MSI-X: Enable+ Count=10 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 This device uses the 16K bar3 for MSI-X with the vector table at offset zero and the pending bits arrary at offset 8K, fully honoring the PCI spec alignment guidance. The data sheet specifically refers to this as an MSI-X BAR. This device would not see a benefit from MSI-X relocation regardless of the platform, regardless of the page size. However, here's another example: # lspci -vvvs 02:00.0 02:00.0 Serial Attached SCSI controller: xxxxxxxx ... Region 0: I/O ports at c000 [size=256] Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K] ... Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f000 Here the MSI-X data structures are placed on separate 4K pages at the end of a 64KB BAR. If our host page size is 4K, we're likely fine, but at 64KB page size, MSI-X emulation at that location prevents the entire BAR from being directly mapped into the VM address space. Overlapping performance sensitive registers then starts to be a very likely scenario on such a platform. At this point, the user could enable tracing on vfio_region_read and vfio_region_write to determine more conclusively if device accesses are being trapped through QEMU. Upon finding a device and platform in need of MSI-X relocation, the next problem is how to choose target PCI BAR to host the MSI-X data structures. A few key rules to keep in mind for this selection include: * There are only 6 BAR slots, bar0..bar5 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot * PCI BARs are always a power of 2 in size, extending == doubling * The maximum size of a 32-bit BAR is 2GB * MSI-X data structures must reside in an MMIO BAR Using these rules, we can evaluate each BAR of the second example device above as follows: bar0: I/O port BAR, incompatible with MSI-X tables bar1: BAR could be extended, incurring another 64KB of MMIO bar2: Unavailable, bar1 is 64-bit, this register is used by bar1 bar3: BAR could be extended, incurring another 256KB of MMIO bar4: Unavailable, bar3 is 64bit, this register is used by bar3 bar5: Available, empty BAR, minimum additional MMIO A secondary optimization we might wish to make in relocating MSI-X is to minimize the additional MMIO required for the device, therefore we might test the available choices in order of preference as bar5, bar1, and finally bar3. The original proposal for this feature included an 'auto' option which would choose bar5 in this case, but various drivers have been found that make assumptions about the properties of the "first" BAR or the size of BARs such that there appears to be no foolproof automatic selection available, requiring known good combinations to be sourced from users. This patch is pre-enabled for an 'auto' selection making use of a validated lookup table, but no entries are yet identified. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:26 -07:00
Alex Williamson	3a286732d1	vfio/pci: Add base BAR MemoryRegion Add one more layer to our stack of MemoryRegions, this base region allows us to register BARs independently of the vfio region or to extend the size of BARs which do map to a region. This will be useful when we want hypervisor defined BARs or sections of BARs, for purposes such as relocating MSI-X emulation. We therefore call msix_init() based on this new base MemoryRegion, while the quirks, which only modify regions still operate on those sub-MemoryRegions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:25 -07:00
Alex Williamson	edd0927893	vfio/pci: Fixup VFIOMSIXInfo comment The fields were removed in the referenced commit, but the comment still mentions them. Fixes: `2fb9636ebf` ("vfio-pci: Remove unused fields from VFIOMSIXInfo") Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:25 -07:00
Alexey Kardashevskiy	2fb9636ebf	vfio-pci: Remove unused fields from VFIOMSIXInfo When support for multiple mappings per a region were added, this was left behind, let's finish and remove unused bits. Fixes: `db0da029a1` ("vfio: Generalize region support") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-12-13 10:19:34 -07:00
Alex Williamson	dfbee78db8	vfio/pci: Add NVIDIA GPUDirect Cliques support NVIDIA has defined a specification for creating GPUDirect "cliques", where devices with the same clique ID support direct peer-to-peer DMA. When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest (part of cuda-samples) determine which GPUs can support peer-to-peer based on chipset and topology. When running in a VM, these tools have no visibility to the physical hardware support or topology. This option allows the user to specify hints via a vendor defined capability. For instance: <qemu:commandline> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/> </qemu:commandline> This enables two cliques. The first is a singleton clique with ID 0, for the first hostdev defined in the XML (note that since cliques define peer-to-peer sets, singleton clique offer no benefit). The subsequent two hostdevs are both added to clique ID 1, indicating peer-to-peer is possible between these devices. QEMU only provides validation that the clique ID is valid and applied to an NVIDIA graphics device, any validation that the resulting cliques are functional and valid is the user's responsibility. The NVIDIA specification allows a 4-bit clique ID, thus valid values are 0-15. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-10-03 12:57:36 -06:00
Alex Williamson	e3f79f3bd4	vfio/pci: Add virtual capabilities quirk infrastructure If the hypervisor needs to add purely virtual capabilties, give us a hook through quirks to do that. Note that we determine the maximum size for a capability based on the physical device, if we insert a virtual capability, that can change. Therefore if maximum size is smaller after added virt capabilities, use that. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-10-03 12:57:36 -06:00
Eric Auger	7237011d05	vfio/pci: Pass an error object to vfio_pci_igd_opregion_init Pass an error object to prepare for migration to VFIO-PCI realize. In vfio_probe_igd_bar4_quirk, simply report the error. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-10-17 10:57:59 -06:00
Eric Auger	cde4279baa	vfio/pci: Pass an error object to vfio_populate_vga Pass an error object to prepare for the same operation in vfio_populate_device. Eventually this contributes to the migration to VFIO-PCI realize. We now report an error on vfio_get_region_info failure. vfio_probe_igd_bar4_quirk is not involved in the migration to realize and simply calls error_reportf_err. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-10-17 10:57:57 -06:00
Alex Williamson	4d3fc4fdc6	vfio/pci: Fix VGA quirks Commit `2d82f8a3cd` ("vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions") converted VFIOPCIDevice.vga to be dynamically allocted, negating the need for VFIOPCIDevice.has_vga. Unfortunately not all of the has_vga users were converted, nor was the field removed from the structure. Correct these oversights. Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de> Tested-by: Peter Maloney <peter.maloney@brockmann-consult.de> Fixes: `2d82f8a3cd` ("vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions") Fixes: https://bugs.launchpad.net/qemu/+bug/1591628 Cc: qemu-stable@nongnu.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-06-30 13:00:22 -06:00
Alex Williamson	6ced0bba70	vfio/pci: Add a separate option for IGD OpRegion support The IGD OpRegion is enabled automatically when running in legacy mode, but it can sometimes be useful in universal passthrough mode as well. Without an OpRegion, output spigots don't work, and even though Intel doesn't officially support physical outputs in UPT mode, it's a useful feature. Note that if an OpRegion is enabled but a monitor is not connected, some graphics features will be disabled in the guest versus a headless system without an OpRegion, where they would work. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> Tested-by: Gerd Hoffmann <kraxel@redhat.com>	2016-05-26 11:12:03 -06:00
Alex Williamson	c4c45e943e	vfio/pci: Intel graphics legacy mode assignment Enable quirks to support SandyBridge and newer IGD devices as primary VM graphics. This requires new vfio-pci device specific regions added in kernel v4.6 to expose the IGD OpRegion, the shadow ROM, and config space access to the PCI host bridge and LPC/ISA bridge. VM firmware support, SeaBIOS only so far, is also required for reserving memory regions for IGD specific use. In order to enable this mode, IGD must be assigned to the VM at PCI bus address 00:02.0, it must have a ROM, it must be able to enable VGA, it must have or be able to create on its own an LPC/ISA bridge of the proper type at PCI bus address 00:1f.0 (sorry, not compatible with Q35 yet), and it must have the above noted vfio-pci kernel features and BIOS. The intention is that to enable this mode, a user simply needs to assign 00:02.0 from the host to 00:02.0 in the VM: -device vfio-pci,host=0000:00:02.0,bus=pci.0,addr=02.0 and everything either happens automatically or it doesn't. In the case that it doesn't, we leave error reports, but assume the device will operate in universal passthrough mode (UPT), which doesn't require any of this, but has a much more narrow window of supported devices, supported use cases, and supported guest drivers. When using IGD in this mode, the VM firmware is required to reserve some VM RAM for the OpRegion (on the order or several 4k pages) and stolen memory for the GTT (up to 8MB for the latest GPUs). An additional option, x-igd-gms allows the user to specify some amount of additional memory (value is number of 32MB chunks up to 512MB) that is pre-allocated for graphics use. TBH, I don't know of anything that requires this or makes use of this memory, which is why we don't allocate any by default, but the specification suggests this is not actually a valid combination, so the option exists as a workaround. Please report if it's actually necessary in some environment. See code comments for further discussion about the actual operation of the quirks necessary to assign these devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> Tested-by: Gerd Hoffmann <kraxel@redhat.com>	2016-05-26 11:12:01 -06:00
Alex Williamson	e593c0211b	vfio/pci: Split out VGA setup This could be setup later by device specific code, such as IGD initialization. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-03-10 20:50:41 -07:00
Alex Williamson	2d82f8a3cd	vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions Match common vfio code with setup, exit, and finalize functions for BAR, quirk, and VGA management. VGA is also changed to dynamic allocation to match the other MemoryRegions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-03-10 20:50:38 -07:00
Alex Williamson	95239e1625	vfio/pci: Lazy PBA emulation The PCI spec recommends devices use additional alignment for MSI-X data structures to allow software to map them to separate processor pages. One advantage of doing this is that we can emulate those data structures without a significant performance impact to the operation of the device. Some devices fail to implement that suggestion and assigned device performance suffers. One such case of this is a Mellanox MT27500 series, ConnectX-3 VF, where the MSI-X vector table and PBA are aligned on separate 4K pages. If PBA emulation is enabled, performance suffers. It's not clear how much value we get from PBA emulation, but the solution here is to only lazily enable the emulated PBA when a masked MSI-X vector fires. We then attempt to more aggresively disable the PBA memory region any time a vector is unmasked. The expectation is then that a typical VM will run entirely with PBA emulation disabled, and only when used is that emulation re-enabled. Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com> Tested-by: Shyam Kaushik <shyam.kaushik@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-01-19 11:33:42 -07:00
Alex Williamson	89dcccc593	vfio/pci: Add emulated PCI IDs Specifying an emulated PCI vendor/device ID can be useful for testing various quirk paths, even though the behavior and functionality of the device with bogus IDs is fully unsupportable. We need to use a uint32_t for the vendor/device IDs, even though the registers themselves are only 16-bit in order to be able to determine whether the value is valid and user set. The same support is added for subsystem vendor/device ID, though these have the possibility of being useful and supported for more than a testing tool. An emulated platform might want to impose their own subsystem IDs or at least hide the physical subsystem ID. Windows guests will often reinstall drivers due to a change in subsystem IDs, something that VM users may want to avoid. Of course careful attention would be required to ensure that guest drivers do not rely on the subsystem ID as a basis for device driver quirks. All of these options are added using the standard experimental option prefix and should not be considered stable. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:49 -06:00
Alex Williamson	ff635e3775	vfio/pci: Cache vendor and device ID Simplify access to commonly referenced PCI vendor and device ID by caching it on the VFIOPCIDevice struct. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:49 -06:00
Alex Williamson	c9c5000991	vfio/pci: Move AMD device specific reset to quirks This is just another quirk, for reset rather than affecting memory regions. Move it to our new quirks file. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:49 -06:00
Alex Williamson	958d553405	vfio/pci: Remove old config window and mirror quirks These are now unused. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:48 -06:00
Alex Williamson	8c4f234853	vfio/pci: Foundation for new quirk structure VFIOQuirk hosts a single memory region and a fixed set of data fields that try to handle all the quirk cases, but end up making those that don't exactly match really confusing. This patch introduces a struct intended to provide more flexibility and simpler code. VFIOQuirk is stripped to its basics, an opaque data pointer for quirk specific data and a pointer to an array of MemoryRegions with a counter. This still allows us to have common teardown routines, but adds much greater flexibility to support multiple memory regions and quirk specific data structures that are easier to maintain. The existing VFIOQuirk is transformed into VFIOLegacyQuirk, which further patches will eliminate entirely. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:46 -06:00
Alex Williamson	056dfcb695	vfio/pci: Cleanup ROM blacklist quirk Create a vendor:device ID helper that we'll also use as we rework the rest of the quirks. Re-reading the config entries, even if we get more blacklist entries, is trivial overhead and only incurred during device setup. There's no need to typedef the blacklist structure, it's a static private data type used once. The elements get bumped up to uint32_t to avoid future maintenance issues if PCI_ANY_ID gets used for a blacklist entry (avoiding an actual hardware match). Our test loop is also crying out to be simplified as a for loop. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:45 -06:00
Alex Williamson	c00d61d8fa	vfio/pci: Split quirks to a separate file Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:45 -06:00
Alex Williamson	78f33d2bfd	vfio/pci: Extract PCI structures to a separate header Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2015-09-23 13:04:44 -06:00

23 Commits