* qemu-kvm/uq/master:
virtio/vhost: Add support for KVM in-kernel MSI injection
msix: Add msix_nr_vectors_allocated
kvm: Enable use of kvm_irqchip_in_kernel in hwlib code
kvm: Introduce kvm_irqchip_add/remove_irqfd
kvm: Make kvm_irqchip_commit_routes an internal service
kvm: Publicize kvm_irqchip_release_virq
kvm: Introduce kvm_irqchip_add_msi_route
kvm: Rename kvm_irqchip_add_route to kvm_irqchip_add_irq_route
msix: Introduce vector notifiers
msix: Invoke msix_handle_mask_update on msix_mask_all
msix: Factor out msix_get_message
kvm: update vmxcap for EPT A/D, INVPCID, RDRAND, VMFUNC
kvm: Enable in-kernel irqchip support by default
kvm: Add support for direct MSI injections
kvm: Update kernel headers
kvm: x86: Wire up MSI support for in-kernel irqchip
pc: Enable MSI support at APIC level
kvm: Introduce basic MSI support for in-kernel irqchips
Introduce MSIMessage structure
kvm: Refactor KVMState::max_gsi to gsi_count
VIRTIO_BLK_F_SCSI is supposed to mean whether the host can *parse*
SCSI requests, not *execute* them. You could run QEMU with scsi=on
and a file-backed disk, and QEMU would fail all SCSI requests even
though it advertises VIRTIO_BLK_F_SCSI.
Because we need to do this to fix a migration compatibility problem
related to how QEMU is invoked by management, we must do this
unconditionally even on older machine types. This more or less assumes
that no one ever invoked QEMU with scsi=off.
Here is how testing goes:
- old QEMU, scsi=on -> new QEMU, scsi=on
- new QEMU, scsi=on -> old QEMU, scsi=on
- old QEMU, scsi=off -> new QEMU, scsi=on
- new QEMU, scsi=off -> old QEMU, scsi=on
ok (new QEMU has VIRTIO_BLK_F_SCSI, adding host features is fine)
- old QEMU, scsi=off -> new QEMU, scsi=off
ok (new QEMU has VIRTIO_BLK_F_SCSI, adding host features is fine)
- old QEMU, scsi=on -> new QEMU, scsi=off
ok, bug fixed
- new QEMU, scsi=on -> old QEMU, scsi=off
doesn't work (same as: old QEMU, scsi=on -> old QEMU, scsi=off)
- new QEMU, scsi=off -> old QEMU, scsi=off
broken by the patch
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We will have to add another field to the virtio-blk configuration in
the next patch. Avoid a proliferation of arguments to virtio_blk_init.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Move it from virtio_blk_exit_pci to virtio_blk_exit.
This is included here because the next patch removes proxy->block.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Make use of the new vector notifier to track changes of the MSI-X
configuration of virtio PCI devices. On enabling events, we establish
the required virtual IRQ to MSI-X message route and link the signaling
eventfd file descriptor to this vIRQ line. That way, vhost-generated
interrupts can be directly delivered to an in-kernel MSI-X consumer like
the x86 APIC.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
There are no outside references to virtio_portio.
Add missing 'static' specifier.
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Currently the virtio balloon device, when using the virtio-pci interface
advertises itself with PCI class code MEMORY_RAM. This is wrong; the
balloon is vaguely related to memory, but is nothing like a PCI memory
device in the meaning of the class code, and this code is not required
or suggested by the virtio PCI specification.
Worse, this patch causes problems on the pseries machine, because the
firmware, seeing this class code, advertises the device as memory in the
device tree, and then a guest kernel bug causes it to see this "memory"
before the real system memory, leading to a crash in early boot.
This patch fixes the problem by removing the bogus PCI class code on the
balloon device. The backwards compatibility PC machines get new compat
properties so that they don't change.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Replace device_init() with generalized type_init().
While at it, unify naming convention: type_init([$prefix_]register_types)
Also, type_init() is a function, so add preceding blank line where
necessary and don't put a semicolon after the closing brace.
Signed-off-by: Andreas Färber <afaerber@suse.de>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: malc <av1474@comtv.ru>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Limit them to the device_add functionality. Device aliases were a hack based
on the fact that virtio was modeled the wrong way. The mechanism for aliasing
is very limited in that only one alias can exist for any device.
We have to support it for the purposes of compatibility but we only need to
support it in device_add so restrict it to that piece of code.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
v1 -> v2
- Use a table for aliases (Paolo)
This was done in a mostly automated fashion. I did it in three steps and then
rebased it into a single step which avoids repeatedly touching every file in
the tree.
The first step was a sed-based addition of the parent type to the subclass
registration functions.
The second step was another sed-based removal of subclass registration functions
while also adding virtual functions from the base class into a class_init
function as appropriate.
Finally, a python script was used to convert the DeviceInfo structures and
qdev_register_subclass functions to TypeInfo structures, class_init functions,
and type_register_static calls.
We are almost fully converted to QOM after this commit.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
These are various small stylistic changes which help make things more
consistent such that the automated conversion script can be simpler.
It's not necessary to agree or disagree with these style changes because all
of this code is going to be rewritten by the patch monkey script anyway.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The virtio config area in PIO space is a bit special. The initial
header is little endian but the rest (device specific) is guest
native endian.
The PIO accessors for PCI on machines that don't have native IO ports
assume that all PIO is little endian, which works fine for everything
except the above.
A complicated way to fix it would be to split the BAR into two memory
regions with different endianess settings, but this isn't practical
to do, besides, the PIO code doesn't honor region endianness anyway
(I have a patch for that too but it isn't necessary at this stage).
So I decided to go for the quick fix instead which consists of
reverting the swap in virtio-pci in selected places, hoping that when
we eventually do a "v2" of the virtio protocols, we sort that out once
and for all using a fixed endian setting for everything.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
[agraf: keep virtio in libhw and determine endianness through a
helper function in exec.c]
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
All files under GPLv2 will get GPLv2+ changes starting tomorrow.
event_notifier.c and exec-obsolete.h were only ever touched by Red Hat
employees and can be relicensed now.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
vdev->guest_features is not masking features that are not supported by
the guest. Fix this by introducing a common wrapper to be used by all
virtio bus implementations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently, virtio devices are usually presented to the guest as an
emulated PCI device, virtio_pci. Although the actual IO operations
are done through system memory, the configuration of the virtio device
is done through the one PCI IO space BAR that virtio_pci presents.
But PCI IO space (aka PIO) is deprecated for modern PCI devices, and
on some systems with many PCI domains accessing PIO space can be
problematic. For example on the existing PowerVM implementation of
the PAPR spec, PCI PIO access is not supported at all. We're hoping
that our KVM implementation will support PCI PIO (once we support PCI
at all), but it will probably have some irritating limitations.
This patch, therefore, extends the virtio_pci device to have a PCI
memory space (MMIO) BAR as well as the IO BAR. The MMIO BAR contains
exactly the same registers, in exactly the same layout as the existing
PIO BAR.
Because the PIO BAR is still present, existing guest drivers should
still work fine. With this change in place, future guest drivers can
check for an MMIO BAR and use that if present (falling back to PIO
when possible to support older qemu versions).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The msix table is defined as a subregion, to allow for a BAR that
mixes device specific regions with the msix table.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Add an exit handler that will free up RAM after a virtio-balloon device
is unplugged.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Multiple balloon registrations are not allowed; check if the
registration with the qemu balloon api succeeded. If not, fail the
device init.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
In practice, guests don't generate config requests
that cross a word boundary, so the logic to
detect command word access is correct because
PCI_COMMAND is 0x4. But depending on this is
tricky, further, it will break with guests
that do try to generate a misaligned access
as we pass it to devices without splitting.
Better to use the generic range_covers_byte for this.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
It needs to be a qdev property, because it belongs to the drive's
guest part. Precedence: commit a0fef654 and 6ced55a5.
Bonus: info qtree now shows the serial number.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The virtio_queue_notify() function checks that the virtqueue number is
less than the maximum number of virtqueues. A signed comparison is used
but the virtqueue number could be negative if a buggy or malicious guest
is run. This results in memory accesses outside of the virtqueue array.
It is risky doing input validation in common code instead of at the
guest<->host boundary. Note that virtio_queue_set_addr(),
virtio_queue_get_addr(), virtio_queue_get_num(), and many other virtio
functions do *not* validate the virtqueue number argument.
Instead of fixing the comparison in virtio_queue_notify(), move the
comparison to the virtio bindings (just like VIRTIO_PCI_QUEUE_SEL) where
we have a uint32_t value and can avoid ever calling into common virtio
code if the virtqueue number is invalid.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch move the 9p device registration into its own file
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
We have two different virtio buses: pci and s390. The abstraction path
taken in qemu is to have generic aliases for each device type in the
architecture specific qdev devices.
So let's make use of these aliases whenever we can and define them
whenever we can.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Commit c81131db15
detects old guests by comparing virtio and
PCI status. It attempts to do this on load,
as well, but load_config callback in a binding
is invoked too early and so the virtio status
isn't set yet.
We could add yet another callback to the
binding, to invoke after load, but it
seems easier to reuse the existing vmstate
callback.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Alexander Graf <agraf@suse.de>
Enable ioeventfd for virtio-serial devices by default. Commit
25db9ebe15 lists the benefits of using
ioeventfd.
Copying a file from guest to host over a virtio-serial channel didn't
show much difference in time or io_exit rate.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Instead of using a single variable to pass to the virtio_serial_init
function, use a struct so that expanding the number of variables to be
passed on later is easier.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.
We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.
Added a vhostforce flag to force vhost-net back on.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
- Don't return status from start/stop functions where it's ignored
- report errors to make debugging easier
- assert on unexpected failures
- don't disable notifiers on error so that we'll
retry when guest driver restarts
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Virtqueue notify is currently handled synchronously in userspace virtio. This
prevents the vcpu from executing guest code while hardware emulation code
handles the notify.
On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to the
iothread and allowing the VM to continue execution. This model is similar to
how vhost receives virtqueue notifies.
The result of this change is improved performance for userspace virtio devices.
Virtio-blk throughput increases especially for multithreaded scenarios and
virtio-net transmit throughput increases substantially.
Some virtio devices are known to have guest drivers which expect a notify to be
processed synchronously and spin waiting for completion.
For virtio-net, this also seems to interact with the guest stack in strange
ways so that TCP throughput for small message sizes (~200bytes)
is harmed. Only enable ioeventfd for virtio-blk for now.
Care must be taken not to interfere with vhost-net, which uses host
notifiers. If the set_host_notifier() API is used by a device
virtio-pci will disable virtio-ioeventfd and let the device deal with
host notifiers as it wishes.
Finally, there used to be a limit of 6 KVM io bus devices inside the
kernel. On such a kernel, don't use ioeventfd for virtqueue host
notification since the limit is reached too easily. This ensures that
existing vhost-net setups (which always use ioeventfd) have ioeventfds
available so they can continue to work.
After migration and on VM change state (running/paused) virtio-ioeventfd
will enable/disable itself.
* VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
* !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
* virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
* vm_change_state(running=0) -> disable virtio-ioeventfd
* vm_change_state(running=1) -> enable virtio-ioeventfd
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The VirtIOPCIProxy bugs field is currently used to enable workarounds
for older guests. Rename it to flags so that other per-device behavior
can be tracked.
A later patch uses the flags field to remember whether ioeventfd should
be used for virtqueue host notification.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add "fw_name" to DeviceInfo to use in device path building. In
contrast to "name" "fw_name" should refer to functionality device
provides instead of particular device model like "name" does.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This patch enables MSI-X for virtfs-9p-pci. It also adds a
compat property to pc-0.13 which turns it of there to stay
compatible to 0.13-stable.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When using irqfd with vhost-net to inject interrupts,
a single evenfd might inject multiple interrupts.
Implementing this is much easier with a single
per-device callback to set guest notifiers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Based on a patch from Mark McLoughlin, this patch introduces a new
bottom half packet transmitter that avoids the latency imposed by
the tx_timer approach. Rather than scheduling a timer when a TX
packet comes in, schedule a bottom half to be run from the iothread.
The bottom half handler first attempts to flush the queue with
notification disabled (this is where we could race with a guest
without txburst). If we flush a full burst, reschedule immediately.
If we send short of a full burst, try to re-enable notification.
To avoid a race with TXs that may have occurred, we must then
flush again. If we find some packets to send, the guest it probably
active, so we can reschedule again.
tx_timer and tx_bh are mutually exclusive, so we can re-use the
tx_waiting flag to indicate one or the other needs to be setup.
This allows us to seamlessly migrate between timer and bh TX
handling.
The bottom half handler becomes the new default and we add a new
tx= option to virtio-net-pci. Usage:
-device virtio-net-pci,tx=timer # select timer mitigation vs "bh"
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If virtio_net_flush_tx() is called with notification disabled, we can
race with the guest, processing packets at the same rate as they
get produced. The trouble is that this means we have no guaranteed
exit condition from the function and can spend minutes in there.
Currently flush_tx is only called with notification on, which seems
to limit us to one pass through the queue per call. An upcoming
patch changes this.
Also add an option to set this value on the command line as different
workloads may wish to use different values. We can't necessarily
support any random value, so this is a developer option: x-txburst=
Usage:
-device virtio-net-pci,x-txburst=64 # 64 packets per tx flush
One pass through the queue (256) seems to be a good default value
for this, balancing latency with throughput. We use a signed int
for x-txburst because 2^31 packets in a burst would take many, many
minutes to process and it allows us to easily return a negative
value value from virtio_net_flush_tx() to indicate a back-off
or error condition.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add an option to make the TX mitigation timer adjustable as a device
option. The 150us hard coded default used currently is reasonable,
but may not be suitable for all workloads, this gives us a way to
adjust it using a single binary. We can't support any random option
though, so use the "x-" prefix to indicate this is a developer
option. Usage:
-device virtio-net-pci,x-txtimer=500000,... # .5ms timeout
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Changing block.h or blockdev.h resulted in recompiling most objects.
Move DriveInfo typedef and BlockInterfaceType enum definitions
to qemu-common.h and rearrange blockdev.h use to decrease churn.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>