qemu/hw
Alex Williamson 77ef8f8db2 pci: Use PCI aliases when determining device IOMMU address space
PCIe requester IDs are used by modern IOMMUs to differentiate devices
in order to provide a unique IOVA address space per device.  These
requester IDs are composed of the bus/device/function (BDF) of the
requesting device.  Conventional PCI pre-dates this concept and is
simply a shared parallel bus where transactions are claimed by
decoding target ranges rather than the packetized, point-to-point
mechanisms of PCI-express.  In order to interface conventional PCI
to PCIe, the PCIe-to-PCI bridge creates and accepts packetized
transactions on behalf of all downstream devices, using one of two
potential forms of a requester ID relating to the bridge itself or its
subordinate bus.  All downstream devices are therefore aliased by the
bridge's requester ID and it's not possible for the IOMMU to create
unique IOVA spaces for devices downstream of such buses.

At least that's how it works on bare metal.  Until now point we've
ignored this nuance of vIOMMU support in QEMU, creating a unique
AddressSpace per device regardless of the virtual bus topology.

Aside from simply being true to bare metal behavior, there are aspects
of a shared address space that we can use to our advantage when
designing a VM.  For instance, a PCI device assignment scenario where
we have the following IOMMU group on the host system:

  $ ls  /sys/kernel/iommu_groups/1/devices/
  0000:00:01.0  0000:01:00.0  0000:01:00.1

An IOMMU group is considered the smallest set of devices which are
fully DMA isolated from other devices by the IOMMU.  In this case the
root port at 00:01.0 does not guarantee that it prevents peer to peer
traffic between the endpoints on bus 01: and the devices are therefore
grouped together.  VFIO considers an IOMMU group to be the smallest
unit of device ownership and allows only a single shared IOVA space
per group due to the limitations of the isolation.

Therefore, if we attempt to create the following VM, we get an error:

qemu-system-x86_64 -machine q35... \
  -device intel-iommu,intremap=on \
  -device pcie-root-port,addr=1e.0,id=pcie.1 \
  -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
  -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1

qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
0000:01:00.1: group 1 used in multiple address spaces

VFIO only allows a single IOVA space (AddressSpace) for both devices,
but we've placed them into a topology where the vIOMMU expects a
separate AddressSpace for each device.  On bare metal we know that
a conventional PCI bus would provide the sort of aliasing we need
here, forcing the IOMMU to consider these devices to be part of a
single shared IOVA space.  The support provided here does the same
for QEMU, such that we can create a conventional PCI topology to
expose equivalent AddressSpace sharing requirements to the VM:

qemu-system-x86_64 -machine q35... \
  -device intel-iommu,intremap=on \
  -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
  -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
  -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

There are pros and cons to this configuration; it's not necessarily
recommended, it's simply a tool we can use to create configurations
which may provide additional functionality in spite of host hardware
limitations or as a benefit to the guest configuration or resource
usage.  An incomplete list of pros and cons:

Cons:
 a) Extended PCI configuration space is unavailable to devices
    downstream of a conventional PCI bus.  The degree to which this
    is a drawback depends on the device and guest drivers.
 b) Applying this topology to devices which are already isolated by
    the host IOMMU (singleton IOMMU groups) will result in devices
    which appear to be non-isolated to the VM (non-singleton groups).
    This can limit configurations within the guest, such as userspace
    drivers or nested device assignment.

Pros:
 a) QEMU better emulates bare metal.
 b) Configurations as above are now possible.
 c) Host IOMMU resources and VM locked memory requirements are reduced
    in vIOMMU configurations due to shared IOMMU domains on the host
    and avoidance of duplicate locked memory accounting.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <157187083548.5439.14747141504058604843.stgit@gimli.home>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-11-05 04:04:21 -05:00
..
9pfs 9p: Use variable length suffixes for inode remapping 2019-10-10 11:36:23 +02:00
acpi hw/i386: split PCMachineState deriving X86MachineState from it 2019-10-22 09:39:50 +02:00
adc Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
alpha hw: Move MC146818 device from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:13:10 +02:00
arm hw/arm/boot: Rebuild hflags when modifying CPUState at boot 2019-11-01 20:41:00 +00:00
audio audio: remove audio_MIN, audio_MAX 2019-08-21 09:13:37 +02:00
block bootdevice: Gather LCHS from all relevant devices 2019-10-31 11:47:29 -04:00
bt Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
char virtio: basic packed virtqueue support 2019-10-25 07:46:22 -04:00
core TCG Plugins initial implementation 2019-10-30 14:10:32 +00:00
cpu hw/core: Move cpu.c, cpu.h from qom/ to hw/core/ 2019-08-21 13:24:01 +02:00
cris Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
display hw/m68k: add Nubus macfb video card 2019-10-28 19:06:49 +01:00
dma hw/dma/xilinx_axidma.c: Switch to transaction-based ptimer API 2019-10-24 17:16:29 +01:00
gpio hw/gpio: Fix property accessors of the AST2600 GPIO 1.8V model 2019-10-24 17:16:27 +01:00
hppa hw: Move MC146818 device from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:13:10 +02:00
hyperv Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
i2c aspeed/i2c: Add AST2600 support 2019-10-15 18:09:04 +01:00
i386 i386: implement IGNNE 2019-10-26 15:38:07 +02:00
ide bootdevice: Gather LCHS from all relevant devices 2019-10-31 11:47:29 -04:00
input hw/input/lm832x: Convert reset handler to DeviceReset 2019-10-15 18:18:08 -03:00
intc core: replace getpagesize() with qemu_real_host_page_size 2019-10-26 15:38:06 +02:00
ipack Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
ipmi ipmi: Add an SMBus IPMI interface 2019-09-20 14:08:10 -05:00
isa hw/isa/vt82c686: Convert reset handler to DeviceReset 2019-10-15 18:18:08 -03:00
lm32 Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
m68k hw/m68k: define Macintosh Quadra 800 2019-10-28 19:06:53 +01:00
mem memory-device: simplify Makefile.objs conditions 2019-10-22 09:38:42 +02:00
microblaze microblaze: fix leak of fdevice tree blob 2019-10-04 18:49:16 +02:00
mips hw: Move MC146818 device from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:13:10 +02:00
misc Add Macintosh Quadra 800 machine in hw/m68k 2019-10-29 16:27:48 +00:00
moxie Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
net virtio_net: use RCU_READ_LOCK_GUARD 2019-10-29 18:56:45 -04:00
nios2 Clean up inclusion of sysemu/sysemu.h 2019-08-16 13:31:53 +02:00
nubus hw/m68k: add Nubus support 2019-10-28 19:06:47 +01:00
nvram bootdevice: FW_CFG interface for LCHS values 2019-10-31 11:47:38 -04:00
openrisc Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
pci pci: Use PCI aliases when determining device IOMMU address space 2019-11-05 04:04:21 -05:00
pci-bridge numa: move numa global variable nb_numa_nodes into MachineState 2019-09-03 11:26:55 -03:00
pci-host hw/core: Add a config switch for the "or-irq" device 2019-08-20 09:11:17 +02:00
pcmcia Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
ppc core: replace getpagesize() with qemu_real_host_page_size 2019-10-26 15:38:06 +02:00
rdma core: replace getpagesize() with qemu_real_host_page_size 2019-10-26 15:38:06 +02:00
riscv riscv/boot: Fix possible memory leak 2019-10-28 08:46:06 -07:00
rtc Merge commit 'df84f17' into HEAD 2019-10-26 15:38:02 +02:00
s390x target/s390x: Remove ilen parameter from s390_program_interrupt 2019-10-09 12:49:01 +02:00
scsi bootdevice: Gather LCHS from all relevant devices 2019-10-31 11:47:29 -04:00
sd hw/sd/sdhci: Add dummy Samsung SDHCI controller 2019-10-22 17:44:00 +01:00
semihosting Clean up inclusion of sysemu/sysemu.h 2019-08-16 13:31:53 +02:00
sh4 sysemu: Split sysemu/runstate.h off sysemu/sysemu.h 2019-08-16 13:37:36 +02:00
smbios smbios:ipmi: Ignore IPMI devices with no fwinfo function 2019-09-20 14:08:10 -05:00
sparc hw: Move M48T59 device from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:20:45 +02:00
sparc64 hw: Move sun4v hypervisor RTC from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:23:15 +02:00
ssi aspeed/smc: Add AST2600 support 2019-10-15 18:09:04 +01:00
timer Fix typos and docs, trivial changes and RTC devices split 2019-10-25 14:17:08 +01:00
tpm Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
tricore Clean up inclusion of sysemu/sysemu.h 2019-08-16 13:31:53 +02:00
unicore32 Include hw/irq.h a lot less 2019-08-16 13:31:52 +02:00
usb usbaudio: change playback counters to 64 bit 2019-10-18 08:14:05 +02:00
vfio vfio: unplug failover primary device before migration 2019-10-29 18:55:26 -04:00
virtio virtio: Use auto rcu_read macros 2019-10-29 18:56:45 -04:00
watchdog hw: wdt_aspeed: Add AST2600 support 2019-10-15 18:09:04 +01:00
xen xen-bus: only set the xen device frontend state if it is missing 2019-09-24 12:21:29 +01:00
xenpv Include sysemu/sysemu.h a lot less 2019-08-16 13:31:53 +02:00
xtensa hw/xtensa: add virt machine 2019-10-18 20:38:10 -07:00
Kconfig Add Macintosh Quadra 800 machine in hw/m68k 2019-10-29 16:27:48 +00:00
Makefile.objs Add Macintosh Quadra 800 machine in hw/m68k 2019-10-29 16:27:48 +00:00