This change fixes a long-standing immediate crash (memory corruption
and abort in glibc malloc code) in migration on 32bits.
The bug is present since this commit:
commit 692d9aca97b865b0f7903565274a52606910f129
Author: Bruce Rogers <brogers@novell.com>
Date: Wed Sep 23 16:13:18 2009 -0600
qemu-kvm: allocate correct size for dirty bitmap
The dirty bitmap copied out to userspace is stored in a long array,
and gets copied out to userspace accordingly. This patch accounts
for that correctly. Currently I'm seeing kvm crashing due to writing
beyond the end of the alloc'd dirty bitmap memory, because the buffer
has the wrong size.
Signed-off-by: Bruce Rogers
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ int kvm_get_dirty_pages_range(kvm_context_t kvm, unsigned long phys_addr,
- buf = qemu_malloc((slots[i].len / 4096 + 7) / 8 + 2);
+ buf = qemu_malloc(BITMAP_SIZE(slots[i].len));
r = kvm_get_map(kvm, KVM_GET_DIRTY_LOG, i, buf);
BITMAP_SIZE is now open-coded in that function, like this:
size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS), HOST_LONG_BITS) / 8;
The problem is that HOST_LONG_BITS in 32bit userspace is 32
but it's 64 in 64bit kernel. So userspace aligns this to
32, and kernel to 64, but since no length is passed from
userspace to kernel on ioctl, kernel uses its size calculation
and copies 4 extra bytes to userspace, corrupting memory.
Here's how it looks like during migrate execution:
our=20, kern=24
our=4, kern=8
...
our=4, kern=8
our=4064, kern=4064
our=512, kern=512
our=4, kern=8
our=20, kern=24
our=4, kern=8
...
our=4, kern=8
our=4064, kern=4064
*** glibc detected *** ./x86_64-softmmu/qemu-system-x86_64: realloc(): invalid next size: 0x08f20528 ***
(our is userspace size above, kern is the size as calculated
by the kernel).
Fix this by always aligning to 64 in a hope that no platform will
have sizeof(long)>8 any time soon, and add a comment describing it
all. It's a small price to pay for bad kernel design.
Alternatively it's possible to fix that in the kernel by using
different size calculation depending on the current process.
But this becomes quite ugly.
Special thanks goes to Stefan Hajnoczi for spotting the fundamental
cause of the issue, and to Alexander Graf for his support in #qemu.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
CC: Bruce Rogers <brogers@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM only requires to set the raised IRQ in CPUState and to kick the
receiving vcpu if it is remote. Installing a specialized handler allows
potential future changes to the TCG code path without risking KVM side
effects.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
use the new api to reduce the number of these (expensive)
system calls.
Note: using this API, we should be able to
get rid of vga_dirty_log_xxx APIs. Using them doesn't
affect the performance though because we detects
the log_dirty flag set and ignores the call.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
There are no generic bits remaining in the handling of KVM_EXIT_DEBUG.
So push its logic completely into arch hands, i.e. only x86 so far.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Make the return code of kvm_arch_handle_exit directly usable for
kvm_cpu_exec. This is straightforward for x86 and ppc, just s390
would require more work. Avoid this for now by pushing the return code
translation logic into s390's kvm_arch_handle_exit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
CC: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Let kvm_cpu_exec return EXCP_* values consistently and generate those
codes already inside its inner loop. This means we will now re-enter the
kernel while ret == 0.
Update kvm_handle_internal_error accordingly, but keep
kvm_arch_handle_exit untouched, it will be converted in a separate step.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Test for general errors first as this is the slower path.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avoid using 'ret' both for the return value of KVM_RUN as well as the
code kvm_cpu_exec is supposed to return. Both have no direct relation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Without KVM_CAP_SET_GUEST_DEBUG, we neither motivate the kernel to
report KVM_EXIT_DEBUG nor do we expect such exits. So fall through to
the arch code which will simply report an unknown exit reason.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This avoids that early cpu_synchronize_state calls try to retrieve an
uninitialized state from the kernel. That even causes a deadlock if
io-thread is enabled.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We will broaden the scope of this function on x86 beyond irqchip events.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Original fix by David Gibson.
CC: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM-assisted devices need access to it but we have no clean channel to
distribute a reference. As a workaround until there is a better
solution, export kvm_state for global use, though use should remain
restricted to the mentioned scenario.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In order to use log_start/log_stop with Xen as well in the vga code,
this two operations have been put in CPUPhysMemoryClient.
The two new functions cpu_physical_log_start,cpu_physical_log_stop are
used in hw/vga.c and replace the kvm_log_start/stop. With this, vga does
no longer depends on kvm header.
[ Jan: rebasing and style fixlets ]
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The number of slots and the location of private ones changed several
times in KVM's early days. However, it's stable since 2.6.29 (our
required baseline), and slots 8..11 are no longer reserved since then.
So remove this unneeded restriction.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
CC: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mixing up TCG bits with KVM already led to problems around eflags
emulation on x86. Moreover, quite some code that TCG requires on cpu
enty/exit is useless for KVM. So dispatch between tcg_cpu_exec and
kvm_cpu_exec as early as possible.
The core logic of cpu_halted from cpu_exec is added to
kvm_arch_process_irqchip_events. Moving away from cpu_exec makes
exception_index meaningless for KVM, we can simply pass the exit reason
directly (only "EXCP_DEBUG vs. rest" is relevant).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Define and use dedicated constants for vm_stop reasons, they actually
have nothing to do with the EXCP_* defines used so far. At this chance,
specify more detailed reasons so that VM state change handlers can
evaluate them.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The reset we issue on KVM_EXIT_SHUTDOWN implies that we should also
leave the VCPU loop. As we now check for exit_request which is set by
qemu_system_reset_request, this bug is no longer critical. Still it's an
unneeded extra turn.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Improve the readability of the exit dispatcher by moving the static
return value of kvm_handle_io to its caller.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM requires to reenter the kernel after IO exits in order to complete
instruction emulation. Failing to do so will leave the kernel state
inconsistently behind. To ensure that we will get back ASAP, we issue a
self-signal that will cause KVM_RUN to return once the pending
operations are completed.
We can move kvm_arch_process_irqchip_events out of the inner VCPU loop.
The only state that mattered at its old place was a pending INIT
request. Catch it in kvm_arch_pre_run and also trigger a self-signal to
process the request on next kvm_cpu_exec.
This patch also fixes the missing exit_request check in kvm_cpu_exec in
the CONFIG_IOTHREAD case.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
CC: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Provide arch-independent kvm_on_sigbus* stubs to remove the #ifdef'ery
from cpus.c. This patch also fixes --disable-kvm build by providing the
missing kvm_on_sigbus_vcpu kvm-stub.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It is not possible to use virtio-ioeventfd when building without an I/O
thread. We rely on a signal to kick us out of vcpu execution. Timers
and AIO use SIGALRM and SIGUSR2 respectively. Unfortunately eventfd
does not support O_ASYNC (SIGIO) so eventfd cannot be used in a signal
driven manner.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We must flush pending mmio writes if we leave kvm_cpu_exec for an IO
window. Otherwise we risk to loose those requests when migrating to a
different host during that window.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead of splattering the code with #ifdefs and runtime checks for
capabilities we cannot work without anyway, provide central test
infrastructure for verifying their availability both at build and
runtime.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Introduce the cpu_dump_state flag CPU_DUMP_CODE and implement it for
x86. This writes out the code bytes around the current instruction
pointer. Make use of this feature in KVM to help debugging fatal vm
exits.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Report KVM_EXIT_UNKNOWN, KVM_EXIT_FAIL_ENTRY, and KVM_EXIT_EXCEPTION
with more details to stderr. The latter two are so far x86-only, so move
them into the arch-specific handler. Integrate the Intel real mode
warning on KVM_EXIT_FAIL_ENTRY that qemu-kvm carries, but actually
restrict it to Intel CPUs. Moreover, always dump the CPU state in case
we fail.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Ensure that we stop the guest whenever we face a fatal or unknown exit
reason. If we stop, we also have to enforce a cpu loop exit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
simple cleanup and use existing helper: kvm_check_extension().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There used to be a limit of 6 KVM io bus devices in the kernel.
On such a kernel, we can't use many ioeventfds for host notification
since the limit is reached too easily.
Add an API to test for this condition.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This makes ram block ordering under migration stable, ordered by offset.
This is especially useful for migration to exec, for debugging.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Jason Wang <jasowang@redhat.com>
In QEMU-KVM, physical address != RAM address. While MCE simulation
needs physical address instead of RAM address. So
kvm_physical_memory_addr_from_ram() is implemented to do the
conversion, and it is invoked before being filled in the IA32_MCi_ADDR
MSR.
Reported-by: Dean Nelson <dnelson@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
vl.c has a Sun-specific hack to supply a prototype for madvise(),
but the call site has apparently moved to arch_init.c.
Haiku doesn't implement madvise() in favor of posix_madvise().
OpenBSD and Solaris 10 don't implement posix_madvise() but madvise().
MinGW implements neither.
Check for madvise() and posix_madvise() in configure and supply qemu_madvise()
as wrapper. Prefer madvise() over posix_madvise() due to flag availability.
Convert all callers to use qemu_madvise() and QEMU_MADV_*.
Note that on Solaris the warning is fixed by moving the madvise() prototype,
not by qemu_madvise() itself. It helps with porting though, and it simplifies
most call sites.
v7 -> v8:
* Some versions of MinGW have no sys/mman.h header. Reported by Blue Swirl.
v6 -> v7:
* Adopt madvise() rather than posix_madvise() semantics for returning errors.
* Use EINVAL in place of ENOTSUP.
v5 -> v6:
* Replace two leftover instances of POSIX_MADV_NORMAL with QEMU_MADV_INVALID.
Spotted by Blue Swirl.
v4 -> v5:
* Introduce QEMU_MADV_INVALID, suggested by Alexander Graf.
Note that this relies on -1 not being a valid advice value.
v3 -> v4:
* Eliminate #ifdefs at qemu_advise() call sites. Requested by Blue Swirl.
This will currently break the check in kvm-all.c by calling madvise() with
a supported flag, which will not fail. Ideas/patches welcome.
v2 -> v3:
* Reuse the *_MADV_* defines for QEMU_MADV_*. Suggested by Alexander Graf.
* Add configure check for madvise(), too.
Add defines to Makefile, not QEMU_CFLAGS.
Convert all callers, untested. Suggested by Blue Swirl.
* Keep Solaris' madvise() prototype around. Pointed out by Alexander Graf.
* Display configure check results.
v1 -> v2:
* Don't rely on posix_madvise() availability, add qemu_madvise().
Suggested by Blue Swirl.
Signed-off-by: Andreas Färber <afaerber@opensolaris.org>
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This abort() condition is easily triggerable by a guest if it configures
pci bar with unaligned address that overlaps main memory.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If we've unregistered a memory area, we should avoid calling
qemu_get_ram_ptr() on the left over phys_offset cruft in the
slot array. Now that we support removing ramblocks, the
phys_offset ram_addr_t can go away and cause a lookup fault
and abort.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Guest debugging is currently broken under CONFIG_IOTHREAD. The reason is
inconsistent or even lacking signaling the debug events from the source
VCPU to the main loop and the gdbstub.
This patch addresses the issue by pushing this signaling into a
CPUDebugExcpHandler: cpu_debug_handler is registered as first handler,
thus will be executed last after potential breakpoint emulation
handlers. It sets informs the gdbstub about the debug event source,
requests a debug exit of the main loop and stops the current VCPU. This
mechanism works both for TCG and KVM, with and without IO-thread.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Guest debugging under KVM is currently broken once io-threads are
enabled. Easily fixable by switching the fake on_vcpu to the real
run_on_cpu implementation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Continue vcpu execution in case emulation failure happened while vcpu
was in userspace. In this case #UD will be injected into the guest
allowing guest OS to kill offending process and continue.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It is not safe to retrieve the KVM internal state of a given cpu
while its potentially modifying it.
Queue the request to run on cpu context, similarly to qemu-kvm.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Zero cpu_single_env before leaving global lock protection, and
restore on return.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Port qemu-kvm's KVM_EXIT_INTERNAL_ERROR handling to upstream.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Make use of the new KVM_GET/SET_DEBUGREGS to save/restore the x86 debug
registers.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This allows limited use of kvm functions (which will return ENOSYS)
even in once-compiled modules. The patch also improves a bit the error
messages for KVM initialization.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[blauwirbel@gmail.com: fixed Win32 build]
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Fixes clang errors:
CC i386-softmmu/kvm.o
/src/qemu/target-i386/kvm.c:40:9: error: 'dprintf' macro redefined
In file included from /src/qemu/target-i386/kvm.c:21:
In file included from /src/qemu/qemu-common.h:27:
In file included from /usr/include/stdio.h:910:
/usr/include/bits/stdio2.h:189:12: note: previous definition is here
CC i386-softmmu/kvm-all.o
/src/qemu/kvm-all.c:39:9: error: 'dprintf' macro redefined
In file included from /src/qemu/kvm-all.c:23:
In file included from /src/qemu/qemu-common.h:27:
In file included from /usr/include/stdio.h:910:
/usr/include/bits/stdio2.h:189:12: note: previous definition is here
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
The KVM kernel module on S390 refuses to create a VM when the switch_amode
kernel parameter is not used.
Since that is not exactly obvious, let's give the user a nice warning.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Comment on kvm usage: rather than require users to do if (kvm_enabled())
and/or ifdefs, this patch adds an API that, internally, is defined to
stub function on non-kvm build, and checks kvm_enabled for non-kvm
run.
While rest of qemu code still uses if (kvm_enabled()), I think this
approach is cleaner, and we should convert rest of code to it
long term.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This grand cleanup drops all reset and vmsave/load related
synchronization points in favor of four(!) generic hooks:
- cpu_synchronize_all_states in qemu_savevm_state_complete
(initial sync from kernel before vmsave)
- cpu_synchronize_all_post_init in qemu_loadvm_state
(writeback after vmload)
- cpu_synchronize_all_post_init in main after machine init
- cpu_synchronize_all_post_reset in qemu_system_reset
(writeback after system reset)
These writeback points + the existing one of VCPU exec after
cpu_synchronize_state map on three levels of writeback:
- KVM_PUT_RUNTIME_STATE (during runtime, other VCPUs continue to run)
- KVM_PUT_RESET_STATE (on synchronous system reset, all VCPUs stopped)
- KVM_PUT_FULL_STATE (on init or vmload, all VCPUs stopped as well)
This level is passed to the arch-specific VCPU state writing function
that will decide which concrete substates need to be written. That way,
no writer of load, save or reset functions that interact with in-kernel
KVM states will ever have to worry about synchronization again. That
also means that a lot of reasons for races, segfaults and deadlocks are
eliminated.
cpu_synchronize_state remains untouched, just as Anthony suggested. We
continue to need it before reading or writing of VCPU states that are
also tracked by in-kernel KVM subsystems.
Consequently, this patch removes many cpu_synchronize_state calls that
are now redundant, just like remaining explicit register syncs.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
So far we synchronized any dirty VCPU state back into the kernel before
updating the guest debug state. This was a tribute to a deficite in x86
kernels before 2.6.33. But as this is an arch-dependent issue, it is
better handle in the x86 part of KVM and remove the writeback point for
generic code. This also avoids overwriting the flushed state later on if
user space decides to change some more registers before resuming the
guest.
We furthermore need to reinject guest exceptions via the appropriate
mechanism. That is KVM_SET_GUEST_DEBUG for older kernels and
KVM_SET_VCPU_EVENTS for recent ones. Using both mechanisms at the same
time will cause state corruptions.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
With SIG_IPI blocked vcpu loop exit notification happens via -EAGAIN
from KVM_RUN.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Change the way the internal qemu signal, used for communication between
iothread and vcpus, is handled.
Block and consume it with sigtimedwait on the outer vcpu loop, which
allows more precise timing control.
Change from standard signal (SIGUSR1) to real-time one, so multiple
signals are not collapsed.
Set the signal number on KVM's in-kernel allowed sigmask.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We have some duplicated code in the CONFIG_IOTHREAD #ifdef and #else
cases. Fix that.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
remove direct kvm calls from exec.c, make
kvm use memory notifiers framework instead.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
move kvm_set_phys_mem so that it will
be later available earlier in the file.
needed for next patch using memory notifiers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Touching the user space representation of KVM's VCPU state is -
naturally - a per-VCPU thing. So move the dirty flag into KVM_CPU_COMMON
and rename it at this chance to reflect its true meaning.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
The default action of coalesced MMIO is, cache the writing in buffer, until:
1. The buffer is full.
2. Or the exit to QEmu due to other reasons.
But this would result in a very late writing in some condition.
1. The each time write to MMIO content is small.
2. The writing interval is big.
3. No need for input or accessing other devices frequently.
This issue was observed in a experimental embbed system. The test image
simply print "test" every 1 seconds. The output in QEmu meets expectation,
but the output in KVM is delayed for seconds.
Per Avi's suggestion, I hooked flushing coalesced MMIO buffer in VGA update
handler. By this way, We don't need vcpu explicit exit to QEmu to
handle this issue.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch extends the qemu-kvm state sync logic with support for
KVM_GET/SET_VCPU_EVENTS, giving access to yet missing exception,
interrupt and NMI states.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We're leaking file descriptors to child processes. Set FD_CLOEXEC on file
descriptors that don't need to be passed to children to stop this misbehaviour.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Without this, kvm will hold the mutex while it issues its run ioctl,
and never be able to step out of it, causing a deadlock.
Patchworks-ID: 35359
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Recent changes made on_vcpu hit the abort() path, even with the IO thread
disabled. This is because cpu_single_env is no longer set when we call this
function. Although the correct fix is a little bit more complicated that that,
the recent thread in which I proposed qemu_queue_work (which fixes that, btw),
is likely to go on a quite different direction.
So for the benefit of those using guest debugging, I'm proposing this simple
fix in the interim.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Hopefully the last regression of 4c0960c0: KVM_SET_GUEST_DEBUG requires
properly synchronized guest registers (on x86: eflags) on entry.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
In the very least, a change like this requires discussion on the list.
The naming convention is goofy and it causes a massive merge problem. Something
like this _must_ be presented on the list first so people can provide input
and cope with it.
This reverts commit 99a0949b72.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The CPU state parameter is not used, remove it and adjust callers. Now we
can compile ioport.c once for all targets.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Problem: Our file sys-queue.h is a copy of the BSD file, but there are
some additions and it's not entirely compatible. Because of that, there have
been conflicts with system headers on BSD systems. Some hacks have been
introduced in the commits 15cc923584,
f40d753718,
96555a96d7 and
3990d09adf but the fixes were fragile.
Solution: Avoid the conflict entirely by renaming the functions and the
file. Revert the previous hacks.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
cpu_synchronize_state() is a little unreadable since the 'modified'
argument isn't self-explanatory. Simplify it by making it always
synchronize the kernel state into qemu, and automatically flush the
registers back to the kernel if they've been synchronized on this
exit.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This reverts commit bd83677612.
PPC should just implement dirty logging so we can avoid all the fall-out from
this changeset.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The only caller of on_vcpu() is protected by ifdef
KVM_CAP_SET_GUEST_DEBUG, so protect on_vcpu() too otherwise QEMU
may not to build.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We currently use host endian long types to store information
in the dirty bitmap.
This works reasonably well on Little Endian targets, because the
u32 after the first contains the next 32 bits. On Big Endian this
breaks completely though, forcing us to be inventive here.
So Ben suggested to always use Little Endian, which looks reasonable.
We only have dirty bitmap implemented in Little Endian targets so far
and since PowerPC would be the first Big Endian platform, we can just
as well switch to Little Endian always with little effort without
breaking existing targets.
This is the userspace part of the patch. It shouldn't change anything
for existing targets, but help PowerPC.
It replaces my older patch called "Use 64bit pointer for dirty log".
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Dirty logs currently get written with native "long" size. On little endian
it doesn't matter if we use uint64_t instead though, because we'd still end
up using the right bytes.
On big endian, this does become a bigger problem, so we need to ensure that
kernel and userspace talk the same language, which means getting rid of "long"
and using a defined size instead.
So I decided to use 64 bit types at all times. This doesn't break existing
targets but will in conjunction with a patch I'll send to the KVM ML make
dirty logs work with 32 bit userspace on 64 kernel with big endian.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
KVM can have an in-kernel pit or irqchip. While we don't implement it
yet, having a way for test for it (that always returns zero) will allow us
to reuse code in qemu-kvm that tests for it.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
on_vcpu is a qemu-kvm function that will make sure that a specific
piece of code will run on a requested cpu. We don't need that because
we're restricted to -smp 1 right now, but those days are likely to end soon.
So for the benefit of having qemu-kvm share more code with us, I'm
introducing our own version of on_vcpu(). Right now, we either run
a function on the current cpu, or abort the execution, because it would
mean something is seriously wrong.
As an example code, I "ported" kvm_update_guest_debug to use it,
with some slight differences from qemu-kvm.
This is probably 0.12 material
Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Some KVM platforms don't support dirty logging yet, like IA64 and PPC,
so in order to still have screen updates on those, we need to fake it.
This patch just tells the getter function for dirty bitmaps, that all
pages within a slot are dirty when the slot has dirty logging enabled.
That way we can implement dirty logging on those platforms sometime when
it drags down performance, but share the rest of the code with dirty
logging capable platforms.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This fixes a warning I stumbled across while compiling qemu on PPC64.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This reverts commit 8217606e6e (and
updates later added users of qemu_register_reset), we solved the
problem it originally addressed less invasively.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
During startup and after reset we have to synchronize user space to the
in-kernel KVM state. Namely, we need to transfer the VCPU registers when
they change due to VCPU as well as APIC reset.
This patch refactors the required hooks so that kvm_init_vcpu registers
its own per-VCPU reset handler and adds a cpu_synchronize_state to the
APIC reset. That way we no longer depend on the new reset order (and can
drop this disliked interface again) and we can even drop a KVM hook in
main().
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
User space may only inject interrupts during kvm_arch_pre_run if
ready_for_interrupt_injection is set in kvm_run. But that field is
updated on exit from KVM_RUN, so we must ensure that we enter the
kernel after potentially queuing an interrupt, otherwise we risk to
loose one - like it happens with the current code against latest
kernel modules (since kvm-86) that started to queue only a single
interrupt.
Fix the problem by reordering kvm_cpu_exec.
Credits go to Gleb Natapov for analyzing the issue in details.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Users complained that it is not obvious what to do when kvm refuses to
build or run due to an unsupported host kernel, so let's improve the
hints.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Save and restore all so far neglected KVM-specific CPU states. Handling
the TSC stabilizes migration in KVM mode. The interrupt_bitmap and
mp_state are currently unused, but will become relevant for in-kernel
irqchip support. By including proper saving/restoring already, we avoid
having to increment CPU_SAVE_VERSION later on once again.
v2:
- initialize mp_state runnable (for the boot CPU)
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Use standard callback with highest order to synchronize VCPU on reset
after all device callbacks were execute. This allows to remove the
special kvm hook in qemu_system_reset.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Extend kvm_physical_sync_dirty_bitmap() so that is can sync across
multiple slots. Useful for updating the whole dirty log during
migration. Moreover, properly pass down errors the whole call chain.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The buffer passed to KVM_GET_DIRTY_LOG requires one bit per page. Fix
the size calculation in kvm_physical_sync_dirty_bitmap accordingly,
avoiding allocation of extremly oversized buffers.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Introduce a global dirty logging flag that enforces logging for all
slots. This can be used by the live migration code to enable/disable
global logging withouth destroying the per-slot setting.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Only apply the workaround for broken slot joining in KVM when the
capability was not found that signals the corresponding fix existence.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Avi Kivity wrote:
> Suggest wrapping in a function and hiding it deep inside kvm-all.c.
>
Done in v2:
---------->
If the KVM MMU is asynchronous (kernel does not support MMU_NOTIFIER),
we have to avoid COW for the guest memory. Otherwise we risk serious
breakage when guest pages change there physical locations due to COW
after fork. Seen when forking smbd during runtime via -smb.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There is no need to reject an unaligned memory region registration if
the region will be I/O memory and it will not split an existing KVM
slot. This fixes KVM support on PPC.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This reworks the slot management to handle more patterns of
cpu_register_physical_memory*, finally allowing to reset KVM guests (so
far address remapping on reset broke the slot management).
We could actually handle all possible ones without failing, but a KVM
kernel bug in older versions would force us to track all previous
fragmentations and maintain them (as that bug prevents registering
larger slots that overlap also deleted ones). To remain backward
compatible but avoid overly complicated workarounds, we apply a simpler
workaround that covers all currently used patterns.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@7139 c046a42c-6fe2-441c-8c8c-71466251a162
Fail loudly if we run out of memory slot.
Make sure that dirty log start/stop works with consistent memory regions
by reporting invalid parameters. This reveals several inconsistencies in
the vga code, patch to fix them follows later in this series.
And, for simplicity reasons, also catch and report unaligned memory
regions passed to kvm_set_phys_mem (KVM works on page basis).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@7138 c046a42c-6fe2-441c-8c8c-71466251a162
Testing for TLB_MMIO on unmap makes no sense as A) that flag belongs to
CPUTLBEntry and not to io_memory slots or physical addresses and B) we
already use a different condition before mapping. So make this test
consistent.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@7137 c046a42c-6fe2-441c-8c8c-71466251a162
This is a backport of the guest debugging support for the KVM
accelerator that is now part of the KVM tree. It implements the reworked
KVM kernel API for guest debugging (KVM_CAP_SET_GUEST_DEBUG) which is
not yet part of any mainline kernel but will probably be 2.6.30 stuff.
So far supported is x86, but PPC is expected to catch up soon.
Core features are:
- unlimited soft-breakpoints via code patching
- hardware-assisted x86 breakpoints and watchpoints
Changes in this version:
- use generic hook cpu_synchronize_state to transfer registers between
user space and kvm
- push kvm_sw_breakpoints into KVMState
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@6825 c046a42c-6fe2-441c-8c8c-71466251a162
env->interrupt_request is accessed as the bit level from both main code
and signal handler, making a race condition possible even on CISC CPU.
This causes freeze of QEMU under high load when running the dyntick
clock.
The patch below move the bit corresponding to CPU_INTERRUPT_EXIT in a
separate variable, declared as volatile sig_atomic_t, so it should be
work even on RISC CPU.
We may want to move the cpu_interrupt(env, CPU_INTERRUPT_EXIT) case in
its own function and get rid of CPU_INTERRUPT_EXIT. That can be done
later, I wanted to keep the patch short for easier review.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@6728 c046a42c-6fe2-441c-8c8c-71466251a162
Move s under #ifdef to avoid compiler warning.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@6086 c046a42c-6fe2-441c-8c8c-71466251a162
Currently on x86, qemu initializes CPUState but KVM ignores it and does its
own vcpu initialization. However, PowerPC KVM needs to be able to set the
initial register state to support the -kernel and -append options.
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@6060 c046a42c-6fe2-441c-8c8c-71466251a162
MMIO exits are more expensive in KVM or Xen than in QEMU because they
involve, at least, privilege transitions. However, MMIO write
operations can be effectively batched if those writes do not have side
effects.
Good examples of this include VGA pixel operations when in a planar
mode. As it turns out, we can get a nice boost in other areas too.
Laurent mentioned a 9.7% performance boost in iperf with the coalesced
MMIO changes for the e1000 when he originally posted this work for KVM.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5961 c046a42c-6fe2-441c-8c8c-71466251a162
Prior to kvm-80, memory slot deletion was broken in the KVM kernel
modules. In kvm-81, a new capability is introduced to signify that this
problem has been fixed.
Since we rely on being able to delete memory slots, refuse to work with
any kernel module that does not have this capability present.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5960 c046a42c-6fe2-441c-8c8c-71466251a162
This adds a VirtIO based balloon driver. It uses madvise() to actually balloon
the memory when possible.
Until 2.6.27, KVM forced memory pinning so we must disable ballooning unless the
kernel actually supports it when using KVM. It's always safe when using TCG.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5874 c046a42c-6fe2-441c-8c8c-71466251a162
Introduce functions to control logging of memory regions.
We select regions based on its start address, a
guest_physical_addr (target_phys_addr_t, in qemu nomenclature).
The main user of this interface right now is VGA optimization
(a way of reducing the number of mmio exits).
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5792 c046a42c-6fe2-441c-8c8c-71466251a162
struct kvm_userspace_memory_region does not use QEMU friendly types to
define memory slots. This results in lots of ugly casting with warnings
on 32-bit platforms.
This patch introduces a proper KVMSlot structure that uses QEMU types to
describe memory slots. This eliminates many of the casts and isolates
the type conversions to one spot.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5755 c046a42c-6fe2-441c-8c8c-71466251a162
Besides unassigned memory, we also don't care about MMIO.
So if we're giving an MMIO area that is already registered,
wipe it out.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5753 c046a42c-6fe2-441c-8c8c-71466251a162
KVM keeps track of physical memory based on slots in the kernel. The current
code that translates QEMU memory mappings to slots work but is not robust
in the fact of reregistering partial regions of memory.
This patch does the right thing for reregistering partial regions of memory. It
also prevents QEMU from using KVM private slots.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5734 c046a42c-6fe2-441c-8c8c-71466251a162
The third argument to ioctl is a ... which allows any value to be passed. In
practice, glibc always treats the argument as a void *.
Do the same thing for the kvm ioctls to keep things consistent with a
traditional ioctl.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5715 c046a42c-6fe2-441c-8c8c-71466251a162
We don't need to use cpu_loop_exit() because we never use the
condition codes so everything can be folded into a single case.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5669 c046a42c-6fe2-441c-8c8c-71466251a162