vfio_container_ioctl() was a bad interface that bypassed abstraction
boundaries, had semantics that sat uneasily with its name, and was unsafe
in many realistic circumstances. Now that spapr-pci-vfio-host-bridge has
been folded into spapr-pci-host-bridge, there are no more users, so remove
it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
At present the code handling IBM's Enhanced Error Handling (EEH) interface
on VFIO devices operates by bypassing the usual VFIO logic with
vfio_container_ioctl(). That's a poorly designed interface with unclear
semantics about exactly what can be operated on.
In particular it operates on a single vfio container internally (hence the
name), but takes an address space and group id, from which it deduces the
container in a rather roundabout way. groupids are something that code
outside vfio shouldn't even be aware of.
This patch creates new interfaces for EEH operations. Internally we
have vfio_eeh_container_op() which takes a VFIOContainer object
directly. For external use we have vfio_eeh_as_ok() which determines
if an AddressSpace is usable for EEH (at present this means it has a
single container with exactly one group attached), and vfio_eeh_as_op()
which will perform an operation on an AddressSpace in the unambiguous case,
and otherwise returns an error.
This interface still isn't great, but it's enough of an improvement to
allow a number of cleanups in other places.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Devices like Intel graphics are known to not only have bad checksums,
but also the wrong device ID. This is not so surprising given that
the video BIOS is typically part of the system firmware image rather
that embedded into the device and needs to support any IGD device
installed into the system.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Match common vfio code with setup, exit, and finalize functions for
BAR, quirk, and VGA management. VGA is also changed to dynamic
allocation to match the other MemoryRegions.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Both platform and PCI vfio drivers create a "slow", I/O memory region
with one or more mmap memory regions overlayed when supported by the
device. Generalize this to a set of common helpers in the core that
pulls the region info from vfio, fills the region data, configures
slow mapping, and adds helpers for comleting the mmap, enable/disable,
and teardown. This can be immediately used by the PCI MSI-X code,
which needs to mmap around the MSI-X vector table.
This also changes VFIORegion.mem to be dynamically allocated because
otherwise we don't know how the caller has allocated VFIORegion and
therefore don't know whether to unreference it to destroy the
MemoryRegion or not.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In preparation for supporting capability chains on regions, wrap
ioctl(VFIO_DEVICE_GET_REGION_INFO) so we don't duplicate the code for
each caller.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio-pci currently requires a host= parameter, which comes in the
form of a PCI address in [domain:]<bus:slot.function> notation. We
expect to find a matching entry in sysfs for that under
/sys/bus/pci/devices/. vfio-platform takes a similar approach, but
defines the host= parameter to be a string, which can be matched
directly under /sys/bus/platform/devices/. On the PCI side, we have
some interest in using vfio to expose vGPU devices. These are not
actual discrete PCI devices, so they don't have a compatible host PCI
bus address or a device link where QEMU wants to look for it. There's
also really no requirement that vfio can only be used to expose
physical devices, a new vfio bus and iommu driver could expose a
completely emulated device. To fit within the vfio framework, it
would need a kernel struct device and associated IOMMU group, but
those are easy constraints to manage.
To support such devices, which would include vGPUs, that honor the
VFIO PCI programming API, but are not necessarily backed by a unique
PCI address, add support for specifying any device in sysfs. The
vfio API already has support for probing the device type to ensure
compatibility with either vfio-pci or vfio-platform.
With this, a vfio-pci device could either be specified as:
-device vfio-pci,host=02:00.0
or
-device vfio-pci,sysfsdev=/sys/devices/pci0000:00/0000:00:1c.0/0000:02:00.0
or even
-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:02:00.0
When vGPU support comes along, this might look something more like:
-device vfio-pci,sysfsdev=/sys/devices/virtual/intel-vgpu/vgpu0@0000:00:02.0
NB - This is only a made up example path
The same change is made for vfio-platform, specifying sysfsdev has
precedence over the old host option.
Tested-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
This just catches a couple of stragglers since I posted
the last clean-includes patchset last week.
Even PCI_CAP_FLAGS has the same value as PCI_MSIX_FLAGS, the later one is
the more proper on retrieving MSIX entries.
This patch uses PCI_MSIX_FLAGS to retrieve the MSIX entries.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces the amd-xgbe VFIO platform device. It
allows the guest to do passthrough on a device exposing an
"amd,xgbe-seattle-v1a" compat string.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use the macro PCI_CAP_LIST_NEXT instead of 1, so that the code would be
more self-explain.
This patch makes this change and also fixs one typo in comment.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
this function search the capability from the end, the last
size should 0x100 - pos, not 0xff - pos.
Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1453832250-766-22-git-send-email-peter.maydell@linaro.org
The PCI spec recommends devices use additional alignment for MSI-X
data structures to allow software to map them to separate processor
pages. One advantage of doing this is that we can emulate those data
structures without a significant performance impact to the operation
of the device. Some devices fail to implement that suggestion and
assigned device performance suffers.
One such case of this is a Mellanox MT27500 series, ConnectX-3 VF,
where the MSI-X vector table and PBA are aligned on separate 4K
pages. If PBA emulation is enabled, performance suffers. It's not
clear how much value we get from PBA emulation, but the solution here
is to only lazily enable the emulated PBA when a masked MSI-X vector
fires. We then attempt to more aggresively disable the PBA memory
region any time a vector is unmasked. The expectation is then that
a typical VM will run entirely with PBA emulation disabled, and only
when used is that emulation re-enabled.
Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com>
Tested-by: Shyam Kaushik <shyam.kaushik@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
For quirks that support the full PCIe extended config space, limit the
quirk to only the size of config space available through vfio. This
allows host systems with broken MMCONFIG regions to still make use of
these quirks without generating bad address faults trying to access
beyond the end of config space exposed through vfio. This may expose
direct access to the mirror of extended config space, only trapping
the sub-range of standard config space, but allowing this makes the
quirk, and thus the device, functional. We expect that only device
specific accesses make use of the mirror, not general extended PCI
capability accesses, so any virtualization in this space is likely
unnecessary anyway, and the device is still IOMMU isolated, so it
should only be able to hurt itself through any bogus configurations
enabled by this space.
Link: https://www.redhat.com/archives/vfio-users/2015-November/msg00192.html
Reported-by: Ronnie Swanink <ronnie@ronnieswanink.nl>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
This commit only touches allocations with size arguments of the form
sizeof(T). Same Coccinelle semantic patch as in commit b45c03f.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When we have a PCIe VM, such as Q35, guests start to care more about
valid configurations of devices relative to the VM view of the PCI
topology. Windows will error with a Code 10 for an assigned device if
a PCIe capability is found for a device on a conventional bus. We
also have the possibility of IOMMUs, like VT-d, where the where the
guest may be acutely aware of valid express capabilities on physical
hardware.
Some devices, like tg3 are adversely affected by this due to driver
dependencies on the PCIe capability. The only solution for such
devices is to attach them to an express capable bus in the VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In-kernel ITS emulation on ARM64 will require to supply requester IDs.
These IDs can now be retrieved from the device pointer using new
pci_requester_id() function.
This patch adds pci_dev pointer to KVM GSI routing functions and makes
callers passing it.
x86 architecture does not use requester IDs, but hw/i386/kvm/pci-assign.c
also made passing PCI device pointer instead of NULL for consistency with
the rest of the code.
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Message-Id: <ce081423ba2394a4efc30f30708fca07656bc500.1444916432.git.p.fedin@samsung.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At present the memory listener used by vfio to keep host IOMMU mappings
in sync with the guest memory image assumes that if a guest IOMMU
appears, then it has no existing mappings.
This may not be true if a VFIO device is hotplugged onto a guest bus
which didn't previously include a VFIO device, and which has existing
guest IOMMU mappings.
Therefore, use the memory_region_register_iommu_notifier_replay()
function in order to fix this case, replaying existing guest IOMMU
mappings, bringing the host IOMMU into sync with the guest IOMMU.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Depending on the host IOMMU type we determine and record the available page
sizes for IOMMU translation. We'll need this for other validation in
future patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The current vfio core code assumes that the host IOMMU is capable of
mapping any IOVA the guest wants to use to where we need. However, real
IOMMUs generally only support translating a certain range of IOVAs (the
"DMA window") not a full 64-bit address space.
The common x86 IOMMUs support a wide enough range that guests are very
unlikely to go beyond it in practice, however the IOMMU used on IBM Power
machines - in the default configuration - supports only a much more limited
IOVA range, usually 0..2GiB.
If the guest attempts to set up an IOVA range that the host IOMMU can't
map, qemu won't report an error until it actually attempts to map a bad
IOVA. If guest RAM is being mapped directly into the IOMMU (i.e. no guest
visible IOMMU) then this will show up very quickly. If there is a guest
visible IOMMU, however, the problem might not show up until much later when
the guest actually attempt to DMA with an IOVA the host can't handle.
This patch adds a test so that we will detect earlier if the guest is
attempting to use IOVA ranges that the host IOMMU won't be able to deal
with.
For now, we assume that "Type1" (x86) IOMMUs can support any IOVA, this is
incorrect, but no worse than what we have already. We can't do better for
now because the Type1 kernel interface doesn't tell us what IOVA range the
IOMMU actually supports.
For the Power "sPAPR TCE" IOMMU, however, we can retrieve the supported
IOVA range and validate guest IOVA ranges against it, and this patch does
so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a DMA mapping operation fails in vfio_listener_region_add() it
checks to see if we've already completed initial setup of the
container. If so it reports an error so the setup code can fail
gracefully, otherwise throws a hw_error().
There are other potential failure cases in vfio_listener_region_add()
which could benefit from the same logic, so move it to its own
fail: block. Later patches can use this to extend other failure cases
to fail as gracefully as possible under the circumstances.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently the VFIOContainer iommu_data field contains a union with
different information for different host iommu types. However:
* It only actually contains information for the x86-like "Type1" iommu
* Because we have a common listener the Type1 fields are actually used
on all IOMMU types, including the SPAPR TCE type as well
In fact we now have a general structure for the listener which is unlikely
to ever need per-iommu-type information, so this patch removes the union.
In a similar way we can unify the setup of the vfio memory listener in
vfio_connect_container() that is currently split across a switch on iommu
type, but is effectively the same in both cases.
The iommu_data.release pointer was only needed as a cleanup function
which would handle potentially different data in the union. With the
union gone, it too can be removed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In irqfd mode, current code attempts to set a resamplefd whatever
the type of the IRQ. For an edge-sensitive IRQ this attempt fails
and as a consequence, the whole irqfd setup fails and we fall back
to the slow mode. This patch bypasses the resamplefd setting for
non level-sentive IRQs.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
unmask EventNotifier might not be initialized in case of edge
sensitive irq. Using EventNotifier pointers make life simpler to
handle the edge-sensitive irqfd setup.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With current implementation, eventfd VFIO signaling is first set up and
then irqfd is setup, if supported and allowed.
This start sequence causes several issues with IRQ forwarding setup
which, if supported, is transparently attempted on irqfd setup:
IRQ forwarding setup is likely to fail if the IRQ is detected as under
injection into the guest (active at irqchip level or VFIO masked).
This currently always happens because the current sequence explicitly
VFIO-masks the IRQ before setting irqfd.
Even if that masking were removed, we couldn't prevent the case where
the IRQ is under injection into the guest.
So the simpler solution is to remove this 2-step startup and directly
attempt irqfd setup. This is what this patch does.
Also in case the eventfd setup fails, there is no reason to go farther:
let's abort.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Specifying an emulated PCI vendor/device ID can be useful for testing
various quirk paths, even though the behavior and functionality of
the device with bogus IDs is fully unsupportable. We need to use a
uint32_t for the vendor/device IDs, even though the registers
themselves are only 16-bit in order to be able to determine whether
the value is valid and user set.
The same support is added for subsystem vendor/device ID, though these
have the possibility of being useful and supported for more than a
testing tool. An emulated platform might want to impose their own
subsystem IDs or at least hide the physical subsystem ID. Windows
guests will often reinstall drivers due to a change in subsystem IDs,
something that VM users may want to avoid. Of course careful
attention would be required to ensure that guest drivers do not rely
on the subsystem ID as a basis for device driver quirks.
All of these options are added using the standard experimental option
prefix and should not be considered stable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Simplify access to commonly referenced PCI vendor and device ID by
caching it on the VFIOPCIDevice struct.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is just another quirk, for reset rather than affecting memory
regions. Move it to our new quirks file.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Config windows make use of an address register and a data register.
In VGA cards, these are often used to provide real mode code in the
BIOS an easy way to access MMIO registers since the window often
resides in an I/O port register. When the MMIO register has a mirror
of PCI config space, we need to trap those accesses and redirect them
to emulated config space.
The previous version of this functionality made use of a single
MemoryRegion and single match address. This version uses separate
MemoryRegions for each of the address and data registers and allows
for multiple match addresses. This is useful for Nvidia cards which
have two ranges which index into PCI config space.
The previous implementation is left for the follow-on patch for a more
reviewable diff.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Another rework of this quirk, this time to update to the new quirk
structure. We can handle the address and data registers with
separate MemoryRegions and a quirk specific data structure, making the
code much more understandable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The Nvidia 0x3d0 quirk makes use of a two separate registers and gives
us our first chance to make use of separate memory regions for each to
simplify the code a bit.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is an easy quirk that really doesn't need a data structure if
its own. We can pass vdev as the opaque data and access to the
MemoryRegion isn't required.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIOQuirk hosts a single memory region and a fixed set of data fields
that try to handle all the quirk cases, but end up making those that
don't exactly match really confusing. This patch introduces a struct
intended to provide more flexibility and simpler code. VFIOQuirk is
stripped to its basics, an opaque data pointer for quirk specific
data and a pointer to an array of MemoryRegions with a counter. This
still allows us to have common teardown routines, but adds much
greater flexibility to support multiple memory regions and quirk
specific data structures that are easier to maintain. The existing
VFIOQuirk is transformed into VFIOLegacyQuirk, which further patches
will eliminate entirely.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Create a vendor:device ID helper that we'll also use as we rework the
rest of the quirks. Re-reading the config entries, even if we get
more blacklist entries, is trivial overhead and only incurred during
device setup. There's no need to typedef the blacklist structure,
it's a static private data type used once. The elements get bumped
up to uint32_t to avoid future maintenance issues if PCI_ANY_ID gets
used for a blacklist entry (avoiding an actual hardware match). Our
test loop is also crying out to be simplified as a for loop.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The default should be to allow mmap and new drivers shouldn't need to
expose an option or set it to other than the allocation default in
their initfn. Take advantage of the experimental flag to change this
option to the correct polarity.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tracing is more effective when we can completely disable all KVM
bypass paths. Make these runtime rather than build-time configurable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This allows vfio_msi* tracing. The MSI/X interrupt tracing is also
pulled out of #ifdef DEBUG_VFIO to avoid a recompile for tracing this
path. A few cycles to read the message is hardly anything if we're
already in QEMU.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Rename functions and tracing callbacks so that we can trace vfio_intx*
to see all the INTx related activities.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With the addition of the Chelsio quirk we have an error path out of
vfio_early_setup_msix() that doesn't free the allocated VFIOMSIXInfo
struct. This doesn't introduce a leak as it still gets freed in the
vfio_put_device() path, but it's complicated and sloppy to rely on
that. Restructure to free the allocated data on error and only link
it into the vdev on success.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
There's quite a bit of cleanup that can be done to the RTL8168 quirk,
as well as the tracing to prevent a spew of uninteresting accesses
for anything else the driver might choose to use the window registers
for besides the MSI-X table. There should be no functional change,
but it's now possible to get compact and useful traces by enabling
vfio_rtl8168_quirk*, ex:
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f000
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f000
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0xfee0100c
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f004
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f004
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0x0
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f008
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f008
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0x49b1
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f00c
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f00c
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0x0
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Minor cleanup.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>