That's never a good place to stop QEMU process... Since now we have
both the machine done sanity check and also the hotplug handler, we
can safely remove this to avoid that.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190916080718.3299-5-peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This check was previously only happened when the IOMMU is enabled in
the guest. It was always too late because the enabling of IOMMU
normally only happens during the boot of guest OS. It means that we
can bail out and exit directly during the guest OS boots if the
configuration of devices are not supported. Or, if the guest didn't
enable vIOMMU at all, then the user can use the guest normally but as
long as it reconfigure the guest OS to enable the vIOMMU then reboot,
the user will see the panic right after the reset when the next boot
starts.
Let's make this failure even earlier so that we force the user to use
caching-mode for vfio-pci devices when with the vIOMMU. So the user
won't get surprise at least during execution of the guest, which seems
a bit nicer.
This will affect some user who didn't enable vIOMMU in the guest OS
but was using vfio-pci and the vtd device in the past. However I hope
it's not a majority because not enabling vIOMMU with the device
attached is actually meaningless.
We still keep the old assertion for safety so far because the hotplug
path could still reach it, so far.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190916080718.3299-2-peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In my "build everything" tree, changing hw/qdev-properties.h triggers
a recompile of some 2700 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).
Many places including hw/qdev-properties.h (directly or via hw/qdev.h)
actually need only hw/qdev-core.h. Include hw/qdev-core.h there
instead.
hw/qdev.h is actually pointless: all it does is include hw/qdev-core.h
and hw/qdev-properties.h, which in turn includes hw/qdev-core.h.
Replace the remaining uses of hw/qdev.h by hw/qdev-properties.h.
While there, delete a few superfluous inclusions of hw/qdev-core.h.
Touching hw/qdev-properties.h now recompiles some 1200 objects.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20190812052359.30071-22-armbru@redhat.com>
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h). It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.
Include qemu/main-loop.h only where it's needed. Touching it now
recompiles only some 1700 objects. For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800. For the
others, they shrink only slightly.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
In my "build everything" tree, changing migration/vmstate.h triggers a
recompile of some 2700 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).
hw/hw.h supposedly includes it for convenience. Several other headers
include it just to get VMStateDescription. The previous commit made
that unnecessary.
Include migration/vmstate.h only where it's still needed. Touching it
now recompiles only some 1600 objects.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-Id: <20190812052359.30071-16-armbru@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
This is an replacement work of Yan Zhao's patch:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg625340.html
vtd_address_space_unmap() will do proper page mask alignment to make
sure each IOTLB message will have correct masks for notification
messages (2^N-1), but sometimes it can be expanded to even supercede
the registered range. That could lead to unexpected UNMAP of already
mapped regions in some other notifiers.
Instead of doing mindless expension of the start address and address
mask, we split the range into smaller ones and guarantee that each
small range will have correct masks (2^N-1) and at the same time we
should also try our best to generate as less IOTLB messages as
possible.
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Message-Id: <20190624091811.30412-3-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
IOMMUNotifier is with inclusive ranges, so we should check
against (VTD_ADDRESS_SIZE(s->aw_bits) - 1).
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[peterx: split from another bigger patch]
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190624091811.30412-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VTD_RTADDR_RTT is dropped even by the VT-d spec, so QEMU should
probably do the same thing (after all we never really implemented it).
Since we've had a field for that in the migration stream, to keep
compatibility we need to fill the hole up.
Please refer to VT-d spec 10.4.6.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190329061422.7926-3-peterx@redhat.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When introducing the initial support for scalable mode we added a
new field into vmstate however we blindly migrate that field without
notice. That'll break migration no matter forward or backward.
The normal way should be that we use something like
VMSTATE_UINT32_TEST() or subsections for the new vmstate field however
for this case of vt-d we can even make it simpler because we've
already migrated all the registers and it'll be fairly simple that we
re-generate root_scalable field from the register values during post
load of the device.
Fixes: fb43cf739e ("intel_iommu: scalable mode emulation")
Reviewed-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190329061422.7926-2-peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If we try to use the intel-iommu device with vfio-pci devices without
caching mode enabled, we're told:
qemu-system-x86_64: We need to set caching-mode=1 for intel-iommu to enable
device assignment with IOMMU protection.
But to enable caching mode, the option is actually "caching-mode=on".
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <155364147432.16467.15898335025013220939.stgit@gimli.home>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Williamson <<a href="mailto:alex.williamson@redhat.com" target="_blank" rel="noreferrer">alex.williamson@redhat.com</a>><br>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Previously we have per-device system memory aliases when DMAR is
disabled by the system. It will slow the system down if there are
lots of devices especially when DMAR is disabled, because each of the
aliased system address space will contain O(N) slots, and rendering
such N address spaces will be O(N^2) complexity.
This patch introduces a shared nodmar memory region and for each
device we only create an alias to the shared memory region. With the
aliasing, QEMU memory core API will be able to detect when devices are
sharing the same address space (which is the nodmar address space)
when rendering the FlatViews and the total number of FlatViews can be
dramatically reduced when there are a lot of devices.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190313094323.18263-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds an option to provide flexibility for user to expose
Scalable Mode to guest. User could expose Scalable Mode to guest by
the config as below:
"-device intel-iommu,caching-mode=on,scalable-mode=on"
The Linux iommu driver has supported scalable mode. Please refer below
patch set:
https://www.spinics.net/lists/kernel/msg2985279.html
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Message-Id: <1551753295-30167-4-git-send-email-yi.y.sun@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
Mode. This patch adds emulation of 256bits qi_desc.
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
[Yi Sun is co-developer to rebase and refine the patch.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <1551753295-30167-3-git-send-email-yi.y.sun@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
replace extended context mode. This patch extends current emulator to
support Scalable Mode which includes root table, context table and new
pasid table format change. Now intel_iommu emulates both legacy mode
and scalable mode (with legacy-equivalent capability set).
The key points are below:
1. Extend root table operations to support both legacy mode and scalable
mode.
2. Extend context table operations to support both legacy mode and
scalable mode.
3. Add pasid tabled operations to support scalable mode.
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
[Yi Sun is co-developer to contribute much to refine the whole commit.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Message-Id: <1551753295-30167-2-git-send-email-yi.y.sun@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
This is found when I was debugging another problem. Until now no bug
is reported with this but we'd better reset the IR status correctly
after a system reset.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When calculating use_iommu, we wanted to first detect whether DMAR is
enabled, then check whether PT is enabled if DMAR is enabled. However
in the current code we used "&" rather than "&&" so the ordering
requirement is lost (instead it'll be an "AND" operation). This could
introduce errors dumped in QEMU console when rebooting a guest with
both assigned device and vIOMMU, like:
qemu-system-x86_64: vtd_dev_to_context_entry: invalid root entry:
rsvd=0xf000ff53f000e2c3, val=0xf000ff53f000ff53 (reserved nonzero)
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Switch the intr_supported variable from a boolean to OnOffAuto type so
that we can know whether the user specified it or not. With that
we'll have a chance to help the user to choose more wisely where
possible. Introduce x86_iommu_ir_supported() to mask these changes.
No functional change at all.
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We're going to have 57bits aw-bits support sooner. It's possibly time
to remove the "x-" prefix.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Support DMA read/write draining should be easy for existing VT-d
emulation since the emulation itself does not have any request queue
there so we don't need to do anything to flush the un-commited queue.
What we need to do is to declare the support.
These capabilities are required to pass Windows SVVP test program. It
is verified that when with parameters "x-aw-bits=48,caching-mode=off"
we can pass the Windows SVVP test with this patch applied. Otherwise
we'll fail with:
IOMMU[0] - DWD (DMA write draining) not supported
IOMMU[0] - DWD (DMA read draining) not supported
Segment 0 has no DMA remapping capable IOMMU units
However since these bits are not declared support for QEMU<=3.1, we'll
need a compatibility bit for it and we turn this on by default only
for QEMU>=4.0.
Please refer to VT-d spec 6.5.4 for more information.
CC: Yu Wang <wyu@redhat.com>
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1654550
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Report more *_invalid() tracepoints to error_report_once() so that we
can detect issues even without tracing enabled. Drop those tracepoints.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The iotlb.iova can be zero if failure really happened. Dump the addr
instead.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The vtd_generate_msi_message() in intel-iommu is used to construct a MSI
Message from IRQ. A similar function will be needed when we add interrupt
remapping support in amd-iommu. Moving the function in common file to
avoid the code duplication. Rename it to x86_iommu_irq_to_msi_message().
There is no logic changes in the code flow.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Tom Lendacky <Thomas.Lendacky@amd.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Interrupt remapping needs kernel-irqchip={off|split} on both Intel and AMD
platforms. Move the check in common place.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Tom Lendacky <Thomas.Lendacky@amd.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We should handle VTD_FR_CONTEXT_ENTRY_P properly when synchronizing
shadow page tables. Having invalid context entry there is perfectly
valid when we move a device out of an existing domain. When that
happens, instead of posting an error we invalidate the whole region.
Without this patch, QEMU will crash if we do these steps:
(1) start QEMU with VT-d IOMMU and two 10G NICs (ixgbe)
(2) bind the NICs with vfio-pci in the guest
(3) start testpmd with the NICs applied
(4) stop testpmd
(5) rebind the NIC back to ixgbe kernel driver
The patch should fix it.
Reported-by: Pei Zhang <pezhang@redhat.com>
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1627272
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
There are two callers for vtd_sync_shadow_page_table_range(): one
provided a valid context entry and one not. Move that fetching
operation into the caller vtd_sync_shadow_page_table() where we need to
fetch the context entry.
Meanwhile, remove the error_report_once() directly since we're already
tracing all the error cases in the previous call. Instead, return error
number back to caller. This will not change anything functional since
callers are dropping it after all.
We do this move majorly because we want to do something more later in
vtd_sync_shadow_page_table().
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
QEMU is not handling the global DMAR switch well, especially when from
"on" to "off".
Let's first take the example of system reset.
Assuming that a guest has IOMMU enabled. When it reboots, we will drop
all the existing DMAR mappings to handle the system reset, however we'll
still keep the existing memory layouts which has the IOMMU memory region
enabled. So after the reboot and before the kernel reloads again, there
will be no mapping at all for the host device. That's problematic since
any software (for example, SeaBIOS) that runs earlier than the kernel
after the reboot will assume the IOMMU is disabled, so any DMA from the
software will fail.
For example, a guest that boots on an assigned NVMe device might fail to
find the boot device after a system reboot/reset and we'll be able to
observe SeaBIOS errors if we capture the debugging log:
WARNING - Timeout at nvme_wait:144!
Meanwhile, we should see DMAR errors on the host of that NVMe device.
It's the DMA fault that caused a NVMe driver timeout.
The correct fix should be that we do proper switching of device DMA
address spaces when system resets, which will setup correct memory
regions and notify the backend of the devices. This might not affect
much on non-assigned devices since QEMU VT-d emulation will assume a
default passthrough mapping if DMAR is not enabled in the GCMD
register (please refer to vtd_iommu_translate). However that's required
for an assigned devices, since that'll rebuild the correct GPA to HPA
mapping that is needed for any DMA operation during guest bootstrap.
Besides the system reset, we have some other places that might change
the global DMAR status and we'd better do the same thing there. For
example, when we change the state of GCMD register, or the DMAR root
pointer. Do the same refresh for all these places. For these two
places we'll also need to explicitly invalidate the context entry cache
and iotlb cache.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1625173
CC: QEMU Stable <qemu-stable@nongnu.org>
Reported-by: Cong Li <coli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
--
v2:
- do the same for GCMD write, or root pointer update [Alex]
- test is carried out by me this time, by observing the
vtd_switch_address_space tracepoint after system reboot
v3:
- rewrite commit message as suggested by Alex
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Provide the function and use it in vtd_init(). Used to reset both
context entry cache and iotlb cache for the whole IOMMU unit.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Replace all the trace_vtd_err_*() hooks with the new error_report_once()
since they are similar to trace_vtd_err() - dumping the first error
would be mostly enough, then we have them on by default too.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20180815095328.32414-4-peterx@redhat.com>
[Use "%x" instead of "%" PRIx16 to print uint16_t, whitespace tidied up]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Replace existing trace_vtd_err() with error_report_once() then stderr
will capture something if any of the error happens, meanwhile we don't
suffer from any DDOS. Then remove the trace point. Since at it,
provide more information where proper (now we can pass parameters into
the report function).
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20180815095328.32414-3-peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
[Two format strings fixed, whitespace tidied up]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Add an IOMMU index argument to the translate method of
IOMMUs. Since all of our current IOMMU implementations
support only a single IOMMU index, this has no effect
on the behaviour.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20180604152941.20374-4-peter.maydell@linaro.org
Add support for multiple IOMMU indexes to the IOMMU notifier APIs.
When initializing a notifier with iommu_notifier_init(), the caller
must pass the IOMMU index that it is interested in. When a change
happens, the IOMMU implementation must pass
memory_region_notify_iommu() the IOMMU index that has changed and
that notifiers must be called for.
IOMMUs which support only a single index don't need to change.
Callers which only really support working with IOMMUs with a single
index can use the result of passing MEMTXATTRS_UNSPECIFIED to
memory_region_iommu_attrs_to_index().
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20180604152941.20374-3-peter.maydell@linaro.org
This patch fixes a potential small window that the DMA page table might
be incomplete or invalid when the guest sends domain/context
invalidations to a device. This can cause random DMA errors for
assigned devices.
This is a major change to the VT-d shadow page walking logic. It
includes but is not limited to:
- For each VTDAddressSpace, now we maintain what IOVA ranges we have
mapped and what we have not. With that information, now we only send
MAP or UNMAP when necessary. Say, we don't send MAP notifies if we
know we have already mapped the range, meanwhile we don't send UNMAP
notifies if we know we never mapped the range at all.
- Introduce vtd_sync_shadow_page_table[_range] APIs so that we can call
in any places to resync the shadow page table for a device.
- When we receive domain/context invalidation, we should not really run
the replay logic, instead we use the new sync shadow page table API to
resync the whole shadow page table without unmapping the whole
region. After this change, we'll only do the page walk once for each
domain invalidations (before this, it can be multiple, depending on
number of notifiers per address space).
While at it, the page walking logic is also refactored to be simpler.
CC: QEMU Stable <qemu-stable@nongnu.org>
Reported-by: Jintack Lim <jintack@cs.columbia.edu>
Tested-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch only modifies the trace points.
Previously we were tracing page walk levels. They are redundant since
we have page mask (size) already. Now we trace something much more
useful which is the domain ID of the page walking. That can be very
useful when we trace more than one devices on the same system, so that
we can know which map is for which domain.
CC: QEMU Stable <qemu-stable@nongnu.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We pass in the VTDAddressSpace too. It'll be used in the follow up
patches.
CC: QEMU Stable <qemu-stable@nongnu.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
During the recursive page walking of IOVA page tables, some stack
variables are constant variables and never changed during the whole page
walking procedure. Isolate them into a struct so that we don't need to
pass those contants down the stack every time and multiple times.
CC: QEMU Stable <qemu-stable@nongnu.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
For UNMAP-only IOMMU notifiers, we don't need to walk the page tables.
Fasten that procedure by skipping the page table walk. That should
boost performance for UNMAP-only notifiers like vhost.
CC: QEMU Stable <qemu-stable@nongnu.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
SECURITY IMPLICATION: this patch fixes a potential race when multiple
threads access the IOMMU IOTLB cache.
Add a per-iommu big lock to protect IOMMU status. Currently the only
thing to be protected is the IOTLB/context cache, since that can be
accessed even without BQL, e.g., in IO dataplane.
Note that we don't need to protect device page tables since that's fully
controlled by the guest kernel. However there is still possibility that
malicious drivers will program the device to not obey the rule. In that
case QEMU can't really do anything useful, instead the guest itself will
be responsible for all uncertainties.
CC: QEMU Stable <qemu-stable@nongnu.org>
Reported-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
That is not really necessary. Removing that node struct and put the
list entry directly into VTDAddressSpace. It simplfies the code a lot.
Since at it, rename the old notifiers_list into vtd_as_with_notifiers.
CC: QEMU Stable <qemu-stable@nongnu.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
SECURITY IMPLICATION: without this patch, any guest with both assigned
device and a vIOMMU might encounter stale IO page mappings even if guest
has already unmapped the page, which may lead to guest memory
corruption. The stale mappings will only be limited to the guest's own
memory range, so it should not affect the host memory or other guests on
the host.
During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more. In the past, we skip that entry and we don't
notify the IOMMU notifiers. This is not correct. We should send UNMAP
notification to registered UNMAP notifiers in this case.
For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid. For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.
This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).
Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:
qemu-system-x86_64: VFIO_MAP_DMA: -17
qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
0x7f89a920d000) = -17 (File exists)
CC: QEMU Stable <qemu-stable@nongnu.org>
Acked-by: Jason Wang <jasowang@redhat.com>
[peterx: rewrite the commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Instead of having the same error checks in vtd_realize()
and amdvi_realize(), move that over to the generic
x86_iommu_realize().
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
The current implementation of Intel IOMMU code only supports 39 bits
iova address width. This patch provides a new parameter (x-aw-bits)
for intel-iommu to extend its address width to 48 bits but keeping the
default the same (39 bits). The reason for not changing the default
is to avoid potential compatibility problems with live migration of
intel-iommu enabled QEMU guest. The only valid values for 'x-aw-bits'
parameter are 39 and 48.
After enabling larger address width (48), we should be able to map
larger iova addresses in the guest. For example, a QEMU guest that
is configured with large memory ( >=1TB ). To check whether 48 bits
aw is enabled, we can grep in the guest dmesg output with line:
"DMAR: Host address width 48".
Signed-off-by: Prasad Singamsetty <prasad.singamsety@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The current implementation of Intel IOMMU code only supports 39 bits
host/iova address width so number of macros use hard coded values based
on that. This patch is to redefine them so they can be used with
variable address widths. This patch doesn't add any new functionality
but enables adding support for 48 bit address width.
Signed-off-by: Prasad Singamsetty <prasad.singamsety@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
It should be caching-mode. It may confuse people when it pops up.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We have PCI_DEVFN_MAX now.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In vtd_switch_address_space() we did the memory region switch, however
it's possible that the caller of it has not taken the BQL at all. Make
sure we have it.
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
CC: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
It was cached by read/write separately. Let's merge them.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
IOMMUTLBEntry.iova is returned incorrectly on one PT path (though mostly
we cannot really trigger this path, even if we do, we are mostly
disgarding this value, so it didn't break anything). Fix it by
converting the VTD_PAGE_MASK into the correct definition
VTD_PAGE_MASK_4K, then remove VTD_PAGE_MASK.
Fixes: b93130 ("intel_iommu: cleanup vtd_{do_}iommu_translate()")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This finishes QOM'fication of IOMMUMemoryRegion by introducing
a IOMMUMemoryRegionClass. This also provides a fastpath analog for
IOMMU_MEMORY_REGION_GET_CLASS().
This makes IOMMUMemoryRegion an abstract class.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20170711035620.4232-3-aik@ozlabs.ru>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>