kvm_allows_irq0_override() is a totally x86 specific concept:
move it to the target-specific source file where it belongs.
This means we need a new header file for the prototype:
kvm_i386.h, in line with the existing kvm_ppc.h.
While we are moving it, fix the return type to be 'bool' rather
than 'int'.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Rename the function kvm_irqchip_set_irq() to kvm_set_irq(),
since it can be used for sending (asynchronous) interrupts whether
there is a full irqchip model in the kernel or not. (We don't
include 'async' in the function name since asynchronous is the
normal case.)
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
On x86 userspace delivers interrupts to the kernel asynchronously
(and therefore VCPU idle management is done in the kernel) if and
only if there is an in-kernel irqchip. On other architectures this
isn't necessarily true (they may always send interrupts
asynchronously), so define a new kvm_async_interrupts_enabled()
function instead of misusing kvm_irqchip_in_kernel().
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add a helper function for fetching max cpus supported by kvm.
Make QEMU exit with an error message if smp_cpus exceeds limit
of VCPU count retrieved by invoking this helper function.
Signed-off-by: Dunrong Huang <riegamaths@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* qemu-kvm/uq/master:
virtio: move common irqfd handling out of virtio-pci
virtio: move common ioeventfd handling out of virtio-pci
event_notifier: add event_notifier_set_handler
memory: pass EventNotifier, not eventfd
ivshmem: wrap ivshmem_del_eventfd loops with transaction
ivshmem: use EventNotifier and memory API
event_notifier: add event_notifier_init_fd
event_notifier: remove event_notifier_test
event_notifier: add event_notifier_set
apic: Defer interrupt updates to VCPU thread
apic: Reevaluate pending interrupts on LVT_LINT0 changes
apic: Resolve potential endless loop around apic_update_irq
kvm: expose tsc deadline timer feature to guest
kvm_pv_eoi: add flag support
kvm: Don't abort on kvm_irqchip_add_msi_route()
All transports can use the same event handler for the irqfd, though the
exact mechanics of the assignment will be specific. Note that there
are three states: handled by the kernel, handled in userspace, disabled.
This also lets virtio use event_notifier_set_handler.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Under Win32, EventNotifiers will not have event_notifier_get_fd, so we
cannot call it in common code such as hw/virtio-pci.c. Pass a pointer to
the notifier, and only retrieve the file descriptor in kvm-specific code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
By default qemu will use MAP_PRIVATE for guest pages. This will write
protect pages and thus break on s390 systems that dont support this feature.
Therefore qemu has a hack to always use MAP_SHARED for s390. But MAP_SHARED
has other problems (no dirty pages tracking, a lot more swap overhead etc.)
Newer systems allow the distinction via KVM_CAP_S390_COW. With this feature
qemu can use the standard qemu alloc if available, otherwise it will use
the old s390 hack.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Anyone using these functions has to be prepared that irqchip
support may not be present. It shouldn't be up to the core
code to determine whether this is a fatal error. Currently
code written as:
virq = kvm_irqchip_add_msi_route(...)
if (virq < 0) {
<slow path>
} else {
<fast path>
}
works on x86 with and without kvm irqchip enabled, works
without kvm support compiled in, but aborts() on !x86 with
kvm support.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
These are included via monitor.h right now, add them explicitly.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
A type definition and a KVMState field initialization escaped the
required wrapping with KVM_CAP_IRQ_ROUTING. Also, we need to provide a
dummy kvm_irqchip_release_virq as virtio-pci references (but does not
use) it.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Ben Collins <bcollins@ubuntu.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add services to associate an eventfd file descriptor as input with an
IRQ line as output. Such a line can be an input pin of an in-kernel
irqchip or a virtual line returned by kvm_irqchip_add_route.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Automatically commit route changes after kvm_add_routing_entry and
kvm_irqchip_release_virq. There is no performance relevant use case for
which collecting multiple route changes is beneficial. This makes
kvm_irqchip_commit_routes an internal service which assert()s that the
corresponding IOCTL will always succeed.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This allows to drop routes created by kvm_irqchip_add_irq/msi_route
again.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add a service that establishes a static route from a virtual IRQ line to
an MSI message. Will be used for IRQFD and device assignment. As we will
use this service outside of CONFIG_KVM protected code, stub it properly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We will add kvm_irqchip_add_msi_route, so let's make the difference
clearer.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
As MSI is now fully supported by KVM (/wrt available features in
upstream), we can finally enable the in-kernel irqchip by default.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the kernel supports KVM_SIGNAL_MSI, we can avoid the route-based
MSI injection mechanism.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch basically adds kvm_irqchip_send_msi, a service for sending
arbitrary MSI messages to KVM's in-kernel irqchip models.
As the original KVM API requires us to establish a static route from a
pseudo GSI to the target MSI message and inject the MSI via toggling
that virtual IRQ, we need to play some tricks to make this interface
transparent. We create those routes on demand and keep them in a hash
table. Succeeding messages can then search for an existing route in the
table first and reuse it whenever possible. If we should run out of
limited GSIs, we simply flush the table and rebuild it as messages are
sent.
This approach is rather simple and could be optimized further. However,
latest kernels contains a more efficient MSI injection interface that
will obsolete the GSI-based dynamic injection.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead of the bitmap size, store the maximum of GSIs the kernel
support. Move the GSI limit assertion to the API function
kvm_irqchip_add_route and make it stricter.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the kernel page size is larger than TARGET_PAGE_SIZE, which
happens for example on ppc64 with kernels compiled for 64K pages,
the dirty tracking doesn't work.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
The current kvm_init_irq_routing() doesn't set up the used_gsi_bitmap
correctly, and as a consequence pins max_gsi to 32 when it really
should be 1024. I ran into this limitation while testing pci
passthrough, where I consistently got an -ENOSPC return from
kvm_get_irq_route_gsi() called from assigned_dev_update_msix_mmio().
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We use a 2 byte ioeventfd for virtio memory,
add support for this.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Amos Kong <akong@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In kvm-all.c we store an ioctl cmd number in the irqchip_inject_ioctl field
of KVMState, which has type 'int'. This seems to make sense since the
ioctl() man page says that the cmd parameter has type int.
However, the kernel treats ioctl numbers as unsigned - sys_ioctl() takes an
unsigned int, and the macros which generate ioctl numbers expand to
unsigned expressions. Furthermore, some ioctls (IOC_READ ioctls on x86
and IOC_WRITE ioctls on powerpc) have bit 31 set, and so would be negative
if interpreted as an int. This has the surprising and compile-breaking
consequence that in kvm_irqchip_set_irq() where we do:
return (s->irqchip_inject_ioctl == KVM_IRQ_LINE) ? 1 : event.status;
We will get a "comparison is always false due to limited range of data
type" warning from gcc if KVM_IRQ_LINE is one of the bit-31-set ioctls,
which it is on powerpc.
So, despite the fact that the man page and posix say ioctl numbers are
signed, they're actually unsigned. The kernel uses unsigned, the glibc
header uses unsigned long, and FreeBSD, NetBSD and OSX also use unsigned
long ioctl numbers in the code.
Therefore, this patch changes the variable to be unsigned, fixing the
compile.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
Scripted conversion:
for file in *.[hc] hw/*.[hc] hw/kvm/*.[hc] linux-user/*.[hc] linux-user/m68k/*.[hc] bsd-user/*.[hc] darwin-user/*.[hc] tcg/*/*.[hc] target-*/cpu.h; do
sed -i "s/CPUState/CPUArchState/g" $file
done
All occurrences of CPUArchState are expected to be replaced by QOM CPUState,
once all targets are QOM'ified and common fields have been extracted.
Signed-off-by: Andreas Färber <afaerber@suse.de>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
* stefanha/trivial-patches:
configure: Quote the configure args printed in config.log
osdep: Remove local definition of macro offsetof
libcacard: Spelling and grammar fixes in documentation
Spelling fixes in comments (it's -> its)
vnc: Add break statement
libcacard: Use format specifier %u instead of %d for unsigned values
Fix sign of sscanf format specifiers
block/vmdk: Fix warning from splint (comparision of unsigned value)
qmp: Fix spelling fourty -> forty
qom: Fix spelling in documentation
sh7750: Remove redundant 'struct' from MemoryRegionOps
* it's -> its (fixed for all files)
* dont -> don't (only fixed in a line which was touched by the previous fix)
* distrub -> disturb (fixed in the same line)
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
valgrind warns about padding fields which are passed
to vcpu ioctls uninitialized.
This is not an error in practice because kvm ignored padding.
Since the ioctls in question are off data path and
the cost is zero anyway, initialize padding to 0
to suppress these errors.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* qemu-kvm/memory/core: (30 commits)
memory: allow phys_map tree paths to terminate early
memory: unify PhysPageEntry::node and ::leaf
memory: change phys_page_set() to set multiple pages
memory: switch phys_page_set() to a recursive implementation
memory: replace phys_page_find_alloc() with phys_page_set()
memory: simplify multipage/subpage registration
memory: give phys_page_find() its own tree search loop
memory: make phys_page_find() return a MemoryRegionSection
memory: move tlb flush to MemoryListener commit callback
memory: unify the two branches of cpu_register_physical_memory_log()
memory: fix RAM subpages in newly initialized pages
memory: compress phys_map node pointers to 16 bits
memory: store MemoryRegionSection pointers in phys_map
memory: unify phys_map last level with intermediate levels
memory: remove first level of l1_phys_map
memory: change memory registration to rebuild the memory map on each change
memory: support stateless memory listeners
memory: split memory listener for the two address spaces
xen: ignore I/O memory regions
memory: allow MemoryListeners to observe a specific address space
...
kvm_set_phys_mem() may be passed sections that are not aligned to a page
boundary. The current code simply brute-forces the alignment which leads
to an inconsistency and an abort().
Fix by aligning the start and the end of the section correctly, discarding
and unaligned head or tail.
This was triggered by a guest sizing a 64-bit BAR that is smaller than a page
with PCI_COMMAND_MEMORY enabled and the upper dword clear.
Signed-off-by: Avi Kivity <avi@redhat.com>
Current memory listeners are incremental; that is, they are expected to
maintain their own state, and receive callbacks for changes to that state.
This patch adds support for stateless listeners; these work by receiving
a ->begin() callback (which tells them that new state is coming), a
sequence of ->region_add() and ->region_nop() callbacks, and then a
->commit() callback which signifies the end of the new state. They should
ignore ->region_del() callbacks.
Signed-off-by: Avi Kivity <avi@redhat.com>
This allows reverse iteration, which in turns allows consistent ordering
among multiple listeners:
l1->add
l2->add
l2->del
l1->del
Signed-off-by: Avi Kivity <avi@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
As we have thread-local cpu_single_env now and KVM uses exactly one
thread per VCPU, we can drop the cpu_single_env updates from the loop
and initialize this variable only once during setup.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
To both avoid that kvm_irqchip_in_kernel always has to be paired with
kvm_enabled and that the former ends up in a function call, implement it
like the latter. This means keeping the state in a global variable and
defining kvm_irqchip_in_kernel as a preprocessor macro.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Commit 84b058d broke compilation for KVM on non-x86 targets, which
don't have KVM_CAP_IRQ_ROUTING defined.
Fix by not using the unavailable constant when it's not around.
Signed-off-by: Alexander Graf <agraf@suse.de>
Instead of each target knowing or guessing the guest page size,
just pass the desired size of dirtied memory area.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
* qemu-kvm/uq/master:
kvm: Activate in-kernel irqchip support
kvm: x86: Add user space part for in-kernel IOAPIC
kvm: x86: Add user space part for in-kernel i8259
kvm: x86: Add user space part for in-kernel APIC
kvm: x86: Establish IRQ0 override control
kvm: Introduce core services for in-kernel irqchip support
memory: Introduce memory_region_init_reservation
ioapic: Factor out base class for KVM reuse
ioapic: Drop post-load irr initialization
i8259: Factor out base class for KVM reuse
i8259: Completely privatize PicState
apic: Open-code timer save/restore
apic: Factor out base class for KVM reuse
apic: Introduce apic_report_irq_delivered
apic: Inject external NMI events via LINT1
apic: Stop timer on reset
kvm: Move kvmclock into hw/kvm folder
msi: Generalize msix_supported to msi_supported
hyper-v: initialize Hyper-V CPUID leaves.
hyper-v: introduce Hyper-V support infrastructure.
Conflicts:
Makefile.target
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The kvm_get_dirty_pages_log_range() function uses two address
variables to step through the monitored memory region to update the
dirty log. However, these variables have type unsigned long, which
can overflow if running a 64-bit guest with a 32-bit qemu binary.
This patch changes these to target_phys_addr_t which will have the
correct size.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
KVM is forced to disable the IRQ0 override when we run with in-kernel
irqchip but without IRQ routing support of the kernel. Set the fwcfg
value correspondingly. This aligns us with qemu-kvm.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Add the basic infrastructure to active in-kernel irqchip support, inject
interrupts into these models, and maintain IRQ routes.
Routing is optional and depends on the host arch supporting
KVM_CAP_IRQ_ROUTING. When it's not available on x86, we looe the HPET as
we can't route GSI0 to IOAPIC pin 2.
In-kernel irqchip support will once be controlled by the machine
property 'kernel_irqchip', but this is not yet wired up.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Otherwise, the dirty log information is lost in the kernel forever.
Fixes opensuse-12.1 boot screen, which changes the vga windows rapidly.
Signed-off-by: Avi Kivity <avi@redhat.com>
Drop the use of cpu_register_phys_memory_client() in favour of the new
MemoryListener API. The new API simplifies the caller, since there is no
need to deal with splitting and merging slots; however this is not exploited
in this patch.
Signed-off-by: Avi Kivity <avi@redhat.com>
It's a little unfriendly to call abort() without printing any sort of
error message. So turn the DPRINTK() into an fprintf(stderr, ...).
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
And kvm_ioctl(s, KVM_CREATE_VM, 0)'s return value can be < -1,
so change the check of vmfd at label 'err'.
Signed-off-by: Xu He Jie <xuhj@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
mmio callbacks invoked by kvm_flush_coalesced_mmio_buffer() may
themselves indirectly call kvm_flush_coalesced_mmio_buffer().
Prevent reentering the function by checking a flag that indicates
we're processing coalesced mmio requests.
Signed-off-by: Avi Kivity <avi@redhat.com>
Next commit will convert the query-status command to use the
RunState type as generated by the QAPI.
In order to "transparently" replace the current enum by the QAPI
one, we have to make some changes to some enum values.
As the changes are simple renames, I'll do them in one shot. The
changes are:
- Rename the prefix from RSTATE_ to RUN_STATE_
- RUN_STATE_SAVEVM to RUN_STATE_SAVE_VM
- RUN_STATE_IN_MIGRATE to RUN_STATE_INMIGRATE
- RUN_STATE_PANICKED to RUN_STATE_INTERNAL_ERROR
- RUN_STATE_POST_MIGRATE to RUN_STATE_POSTMIGRATE
- RUN_STATE_PRE_LAUNCH to RUN_STATE_PRELAUNCH
- RUN_STATE_PRE_MIGRATE to RUN_STATE_PREMIGRATE
- RUN_STATE_RESTORE to RUN_STATE_RESTORE_VM
- RUN_STATE_PRE_MIGRATE to RUN_STATE_FINISH_MIGRATE
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Today, when notifying a VM state change with vm_state_notify(),
we pass a VMSTOP macro as the 'reason' argument. This is not ideal
because the VMSTOP macros tell why qemu stopped and not exactly
what the current VM state is.
One example to demonstrate this problem is that vm_start() calls
vm_state_notify() with reason=0, which turns out to be VMSTOP_USER.
This commit fixes that by replacing the VMSTOP macros with a proper
state type called RunState.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Enabling the I/O thread by default seems like an important part of declaring
1.0. Besides allowing true SMP support with KVM, the I/O thread means that the
TCG VCPU doesn't have to multiplex itself with the I/O dispatch routines which
currently requires a (racey) signal based alarm system.
I know there have been concerns about performance. I think so far the ones that
have come up (virtio-net) are most likely due to secondary reasons like
decreased batching.
I think we ought to force enabling I/O thread early in 1.0 development and
commit to resolving any lingering issues.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
No longer needed with accompanied kernel headers. We are only left with
build dependencies that are controlled by kvm arch headers.
CC: Alexander Graf <agraf@suse.de>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
On PPC, the default PAGE_SIZE is 64kb. Unfortunately, the hardware
alignments don't match here: There are RAM and MMIO regions within
a single page when it's 64kb in size.
So the only way out for now is to tell the user that he should use 4k
PAGE_SIZE.
This patch gives the user a hint on that, telling him that failing to
register a prefix slot is most likely to be caused by mismatching PAGE_SIZE.
This way it's also more future-proof, as bigger PAGE_SIZE can easily be
supported by other machines then, as long as they stick to 64kb granularities.
Signed-off-by: Alexander Graf <agraf@suse.de>
This change fixes a long-standing immediate crash (memory corruption
and abort in glibc malloc code) in migration on 32bits.
The bug is present since this commit:
commit 692d9aca97b865b0f7903565274a52606910f129
Author: Bruce Rogers <brogers@novell.com>
Date: Wed Sep 23 16:13:18 2009 -0600
qemu-kvm: allocate correct size for dirty bitmap
The dirty bitmap copied out to userspace is stored in a long array,
and gets copied out to userspace accordingly. This patch accounts
for that correctly. Currently I'm seeing kvm crashing due to writing
beyond the end of the alloc'd dirty bitmap memory, because the buffer
has the wrong size.
Signed-off-by: Bruce Rogers
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ int kvm_get_dirty_pages_range(kvm_context_t kvm, unsigned long phys_addr,
- buf = qemu_malloc((slots[i].len / 4096 + 7) / 8 + 2);
+ buf = qemu_malloc(BITMAP_SIZE(slots[i].len));
r = kvm_get_map(kvm, KVM_GET_DIRTY_LOG, i, buf);
BITMAP_SIZE is now open-coded in that function, like this:
size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS), HOST_LONG_BITS) / 8;
The problem is that HOST_LONG_BITS in 32bit userspace is 32
but it's 64 in 64bit kernel. So userspace aligns this to
32, and kernel to 64, but since no length is passed from
userspace to kernel on ioctl, kernel uses its size calculation
and copies 4 extra bytes to userspace, corrupting memory.
Here's how it looks like during migrate execution:
our=20, kern=24
our=4, kern=8
...
our=4, kern=8
our=4064, kern=4064
our=512, kern=512
our=4, kern=8
our=20, kern=24
our=4, kern=8
...
our=4, kern=8
our=4064, kern=4064
*** glibc detected *** ./x86_64-softmmu/qemu-system-x86_64: realloc(): invalid next size: 0x08f20528 ***
(our is userspace size above, kern is the size as calculated
by the kernel).
Fix this by always aligning to 64 in a hope that no platform will
have sizeof(long)>8 any time soon, and add a comment describing it
all. It's a small price to pay for bad kernel design.
Alternatively it's possible to fix that in the kernel by using
different size calculation depending on the current process.
But this becomes quite ugly.
Special thanks goes to Stefan Hajnoczi for spotting the fundamental
cause of the issue, and to Alexander Graf for his support in #qemu.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
CC: Bruce Rogers <brogers@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM only requires to set the raised IRQ in CPUState and to kick the
receiving vcpu if it is remote. Installing a specialized handler allows
potential future changes to the TCG code path without risking KVM side
effects.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
use the new api to reduce the number of these (expensive)
system calls.
Note: using this API, we should be able to
get rid of vga_dirty_log_xxx APIs. Using them doesn't
affect the performance though because we detects
the log_dirty flag set and ignores the call.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
There are no generic bits remaining in the handling of KVM_EXIT_DEBUG.
So push its logic completely into arch hands, i.e. only x86 so far.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Make the return code of kvm_arch_handle_exit directly usable for
kvm_cpu_exec. This is straightforward for x86 and ppc, just s390
would require more work. Avoid this for now by pushing the return code
translation logic into s390's kvm_arch_handle_exit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
CC: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Let kvm_cpu_exec return EXCP_* values consistently and generate those
codes already inside its inner loop. This means we will now re-enter the
kernel while ret == 0.
Update kvm_handle_internal_error accordingly, but keep
kvm_arch_handle_exit untouched, it will be converted in a separate step.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Test for general errors first as this is the slower path.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avoid using 'ret' both for the return value of KVM_RUN as well as the
code kvm_cpu_exec is supposed to return. Both have no direct relation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Without KVM_CAP_SET_GUEST_DEBUG, we neither motivate the kernel to
report KVM_EXIT_DEBUG nor do we expect such exits. So fall through to
the arch code which will simply report an unknown exit reason.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This avoids that early cpu_synchronize_state calls try to retrieve an
uninitialized state from the kernel. That even causes a deadlock if
io-thread is enabled.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We will broaden the scope of this function on x86 beyond irqchip events.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Original fix by David Gibson.
CC: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM-assisted devices need access to it but we have no clean channel to
distribute a reference. As a workaround until there is a better
solution, export kvm_state for global use, though use should remain
restricted to the mentioned scenario.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In order to use log_start/log_stop with Xen as well in the vga code,
this two operations have been put in CPUPhysMemoryClient.
The two new functions cpu_physical_log_start,cpu_physical_log_stop are
used in hw/vga.c and replace the kvm_log_start/stop. With this, vga does
no longer depends on kvm header.
[ Jan: rebasing and style fixlets ]
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The number of slots and the location of private ones changed several
times in KVM's early days. However, it's stable since 2.6.29 (our
required baseline), and slots 8..11 are no longer reserved since then.
So remove this unneeded restriction.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
CC: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mixing up TCG bits with KVM already led to problems around eflags
emulation on x86. Moreover, quite some code that TCG requires on cpu
enty/exit is useless for KVM. So dispatch between tcg_cpu_exec and
kvm_cpu_exec as early as possible.
The core logic of cpu_halted from cpu_exec is added to
kvm_arch_process_irqchip_events. Moving away from cpu_exec makes
exception_index meaningless for KVM, we can simply pass the exit reason
directly (only "EXCP_DEBUG vs. rest" is relevant).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Define and use dedicated constants for vm_stop reasons, they actually
have nothing to do with the EXCP_* defines used so far. At this chance,
specify more detailed reasons so that VM state change handlers can
evaluate them.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The reset we issue on KVM_EXIT_SHUTDOWN implies that we should also
leave the VCPU loop. As we now check for exit_request which is set by
qemu_system_reset_request, this bug is no longer critical. Still it's an
unneeded extra turn.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Improve the readability of the exit dispatcher by moving the static
return value of kvm_handle_io to its caller.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM requires to reenter the kernel after IO exits in order to complete
instruction emulation. Failing to do so will leave the kernel state
inconsistently behind. To ensure that we will get back ASAP, we issue a
self-signal that will cause KVM_RUN to return once the pending
operations are completed.
We can move kvm_arch_process_irqchip_events out of the inner VCPU loop.
The only state that mattered at its old place was a pending INIT
request. Catch it in kvm_arch_pre_run and also trigger a self-signal to
process the request on next kvm_cpu_exec.
This patch also fixes the missing exit_request check in kvm_cpu_exec in
the CONFIG_IOTHREAD case.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
CC: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Provide arch-independent kvm_on_sigbus* stubs to remove the #ifdef'ery
from cpus.c. This patch also fixes --disable-kvm build by providing the
missing kvm_on_sigbus_vcpu kvm-stub.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It is not possible to use virtio-ioeventfd when building without an I/O
thread. We rely on a signal to kick us out of vcpu execution. Timers
and AIO use SIGALRM and SIGUSR2 respectively. Unfortunately eventfd
does not support O_ASYNC (SIGIO) so eventfd cannot be used in a signal
driven manner.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We must flush pending mmio writes if we leave kvm_cpu_exec for an IO
window. Otherwise we risk to loose those requests when migrating to a
different host during that window.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead of splattering the code with #ifdefs and runtime checks for
capabilities we cannot work without anyway, provide central test
infrastructure for verifying their availability both at build and
runtime.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Introduce the cpu_dump_state flag CPU_DUMP_CODE and implement it for
x86. This writes out the code bytes around the current instruction
pointer. Make use of this feature in KVM to help debugging fatal vm
exits.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Report KVM_EXIT_UNKNOWN, KVM_EXIT_FAIL_ENTRY, and KVM_EXIT_EXCEPTION
with more details to stderr. The latter two are so far x86-only, so move
them into the arch-specific handler. Integrate the Intel real mode
warning on KVM_EXIT_FAIL_ENTRY that qemu-kvm carries, but actually
restrict it to Intel CPUs. Moreover, always dump the CPU state in case
we fail.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Ensure that we stop the guest whenever we face a fatal or unknown exit
reason. If we stop, we also have to enforce a cpu loop exit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
simple cleanup and use existing helper: kvm_check_extension().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There used to be a limit of 6 KVM io bus devices in the kernel.
On such a kernel, we can't use many ioeventfds for host notification
since the limit is reached too easily.
Add an API to test for this condition.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This makes ram block ordering under migration stable, ordered by offset.
This is especially useful for migration to exec, for debugging.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Jason Wang <jasowang@redhat.com>
In QEMU-KVM, physical address != RAM address. While MCE simulation
needs physical address instead of RAM address. So
kvm_physical_memory_addr_from_ram() is implemented to do the
conversion, and it is invoked before being filled in the IA32_MCi_ADDR
MSR.
Reported-by: Dean Nelson <dnelson@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
vl.c has a Sun-specific hack to supply a prototype for madvise(),
but the call site has apparently moved to arch_init.c.
Haiku doesn't implement madvise() in favor of posix_madvise().
OpenBSD and Solaris 10 don't implement posix_madvise() but madvise().
MinGW implements neither.
Check for madvise() and posix_madvise() in configure and supply qemu_madvise()
as wrapper. Prefer madvise() over posix_madvise() due to flag availability.
Convert all callers to use qemu_madvise() and QEMU_MADV_*.
Note that on Solaris the warning is fixed by moving the madvise() prototype,
not by qemu_madvise() itself. It helps with porting though, and it simplifies
most call sites.
v7 -> v8:
* Some versions of MinGW have no sys/mman.h header. Reported by Blue Swirl.
v6 -> v7:
* Adopt madvise() rather than posix_madvise() semantics for returning errors.
* Use EINVAL in place of ENOTSUP.
v5 -> v6:
* Replace two leftover instances of POSIX_MADV_NORMAL with QEMU_MADV_INVALID.
Spotted by Blue Swirl.
v4 -> v5:
* Introduce QEMU_MADV_INVALID, suggested by Alexander Graf.
Note that this relies on -1 not being a valid advice value.
v3 -> v4:
* Eliminate #ifdefs at qemu_advise() call sites. Requested by Blue Swirl.
This will currently break the check in kvm-all.c by calling madvise() with
a supported flag, which will not fail. Ideas/patches welcome.
v2 -> v3:
* Reuse the *_MADV_* defines for QEMU_MADV_*. Suggested by Alexander Graf.
* Add configure check for madvise(), too.
Add defines to Makefile, not QEMU_CFLAGS.
Convert all callers, untested. Suggested by Blue Swirl.
* Keep Solaris' madvise() prototype around. Pointed out by Alexander Graf.
* Display configure check results.
v1 -> v2:
* Don't rely on posix_madvise() availability, add qemu_madvise().
Suggested by Blue Swirl.
Signed-off-by: Andreas Färber <afaerber@opensolaris.org>
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>