qemu/hw
Alex Williamson 89d5202edc vfio/pci: Allow relocating MSI-X MMIO
Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table.  This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device.  However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location.  There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.

This new x-msix-relocation option accepts the following choices:

  off: Disable MSI-X relocation, use native device config (default)
  auto: Use a known good combination for the platform/device (none yet)
  bar0..bar5: Specify the target BAR for MSI-X data structures

If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.

The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area.  Take for
example:

# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
	...
	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
	...
	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000

This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance.  The data sheet specifically refers
to this as an MSI-X BAR.  This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.

However, here's another example:

# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
	...
	Region 0: I/O ports at c000 [size=256]
	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
	...
	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=1 offset=0000e000
		PBA: BAR=1 offset=0000f000

Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform.  At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.

Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures.  A few key rules to keep in mind for this selection
include:

 * There are only 6 BAR slots, bar0..bar5
 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
 * PCI BARs are always a power of 2 in size, extending == doubling
 * The maximum size of a 32-bit BAR is 2GB
 * MSI-X data structures must reside in an MMIO BAR

Using these rules, we can evaluate each BAR of the second example
device above as follows:

 bar0: I/O port BAR, incompatible with MSI-X tables
 bar1: BAR could be extended, incurring another 64KB of MMIO
 bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
 bar3: BAR could be extended, incurring another 256KB of MMIO
 bar4: Unavailable, bar3 is 64bit, this register is used by bar3
 bar5: Available, empty BAR, minimum additional MMIO

A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3.  The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users.  This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.

Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06 11:08:26 -07:00
..
9pfs tests: virtio-9p: add FLUSH operation test 2018-02-02 11:11:55 +01:00
acpi nvdimm: add 'unarmed' option 2018-01-19 11:18:51 -02:00
adc maint: Fix macros with broken 'do/while(0); ' usage 2018-01-16 14:54:52 +01:00
alpha Merge remote-tracking branch 'origin/master' into HEAD 2018-01-11 22:03:50 +02:00
arm hw/audio/wm8750: move WM8750 declarations from i2c/i2c.h to audio/wm8750.h 2018-02-02 08:19:25 +01:00
audio hw/audio/sb16.c: change dolog() to qemu_log_mask() 2018-02-02 08:19:47 +01:00
block Block layer patches 2018-01-24 22:55:57 +00:00
bt hw/bt: Replace fprintf(stderr, "*\n" with error_report() 2018-01-22 09:51:00 +01:00
char hw: convert the escc device to keycodemapdb 2018-01-29 09:30:25 +01:00
core qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR 2018-02-06 11:08:26 -07:00
cpu hw: use "qemu/osdep.h" as first #include in source files 2017-12-18 17:07:02 +03:00
cris cris: use generic cpu_model parsing 2017-10-27 16:03:54 +02:00
display virtio-gpu: disallow vIOMMU 2018-02-02 08:53:22 +01:00
dma aarch64-softmmu.mak: Use an ARM specific config 2018-01-26 11:09:09 +01:00
gpio Replace all occurances of __FUNCTION__ with __func__ 2018-01-22 09:46:18 +01:00
hppa hw/hppa: Use qemu_log_mask instead of fprintf to stderr 2018-02-04 14:11:03 -08:00
i2c Replace all occurances of __FUNCTION__ with __func__ 2018-01-22 09:46:18 +01:00
i386 tpm: add CRB device 2018-01-29 14:22:50 -05:00
ide Pull request for various patches that have been reviewed and 2018-01-23 10:15:09 +00:00
input input: switch devices to keycodemapdb, bugfixes. 2018-01-29 15:52:27 +00:00
intc xlnx-zynqmp-ipi: Initial version of the Xilinx IPI device 2018-01-26 11:09:09 +01:00
ipack pci: Add INTERFACE_CONVENTIONAL_PCI_DEVICE to Conventional PCI devices 2017-10-15 05:54:43 +03:00
ipmi ipmi: Allow BMC device properties to be set 2018-01-30 15:52:53 -06:00
isa hw/isa: Replace fprintf(stderr, "*\n" with error_report() 2018-01-22 09:51:00 +01:00
lm32 lm32: lm32_boards: use generic cpu_model parsing 2017-10-27 16:03:54 +02:00
m68k m68k: mcf5208: use generic cpu_model parsing 2017-10-27 16:03:54 +02:00
mem nvdimm: add 'unarmed' option 2018-01-19 11:18:51 -02:00
microblaze xlnx-zynqmp-pmu: Connect the IPI device to the PMU 2018-01-26 11:09:09 +01:00
mips Replace all occurances of __FUNCTION__ with __func__ 2018-01-22 09:46:18 +01:00
misc Replace all occurances of __FUNCTION__ with __func__ 2018-01-22 09:46:18 +01:00
moxie hw/moxie/moxiesim: Add support for loading a BIOS on moxiesim 2017-12-21 09:30:31 +01:00
net i.MX: Fix FEC/ENET receive funtions 2018-01-25 11:45:28 +00:00
nios2 nios2: remove duplicated includes (in code commented out) 2017-12-18 17:07:02 +03:00
nvram fw_cfg: fix memory corruption when all fw_cfg slots are used 2018-01-19 11:18:51 -02:00
openrisc openrisc: use generic cpu_model parsing 2017-10-27 16:03:54 +02:00
pci pci/shpc: Move function to generic header file 2018-01-18 21:52:38 +02:00
pci-bridge simba: rename PBMPCIBridge and QOM types to reflect simba naming 2018-01-24 19:19:50 +00:00
pci-host uninorth: convert to trace-events 2018-01-27 17:26:46 +11:00
pcmcia hw: Clean up includes 2016-01-29 15:07:25 +00:00
ppc spapr/iommu: Enable in-kernel TCE acceleration via VFIO KVM device 2018-02-06 11:08:24 -07:00
s390x s390x: fix storage attributes migration for non-small guests 2018-01-22 11:04:52 +01:00
scsi usb-storage: Fix share-rw option parsing 2018-01-26 07:58:34 +01:00
sd sdhci: fix a NULL pointer dereference due to uninitialized AddresSpace object 2018-01-25 11:45:30 +00:00
sh4 pci: Rename root bus initialization functions for clarity 2017-12-05 19:13:45 +02:00
smbios Merge remote-tracking branch 'origin/master' into HEAD 2018-01-11 22:03:50 +02:00
sparc sun4m: remove include/hw/sparc/sun4m.h and all references to it 2018-01-09 21:48:20 +00:00
sparc64 sun4u: implement power device 2018-01-25 13:39:39 +00:00
ssi xilinx_spips: Correct usage of an uninitialized local variable 2018-01-25 11:45:30 +00:00
timer Replace all occurances of __FUNCTION__ with __func__ 2018-01-22 09:46:18 +01:00
tpm tpm: tis: move one-line function into caller 2018-02-03 09:01:56 -05:00
tricore tricore: use generic cpu_model parsing 2017-10-27 16:04:27 +02:00
unicore32 hw/unicore32: restrict hw addr defines to source file 2017-12-18 17:07:02 +03:00
usb usb-ccid: convert CCIDCardClass::exitfn() -> unrealize() 2018-01-26 07:59:33 +01:00
vfio vfio/pci: Allow relocating MSI-X MMIO 2018-02-06 11:08:26 -07:00
virtio Revert "virtio: postpone the execution of event_notifier_cleanup function" 2018-01-24 19:20:19 +02:00
watchdog misc: drop old i386 dependency 2017-12-18 17:07:03 +03:00
xen xen: Add only xen-sysdev to dynamic sysbus device list 2018-01-19 11:18:51 -02:00
xenpv Replace all occurances of __FUNCTION__ with __func__ 2018-01-22 09:46:18 +01:00
xtensa target/xtensa: allow different default CPU for MMU/noMMU 2018-01-22 11:54:23 -08:00
Makefile.objs 9pfs: fix dependencies 2017-08-30 18:23:25 +02:00