Commit Graph

142 Commits

Author SHA1 Message Date
Eduardo Habkost
30c60f77a8 x86-iommu: Rename QOM type macros
Some QOM macros were using a X86_IOMMU_DEVICE prefix, and others
were using a X86_IOMMU prefix.  Rename all of them to use the
same X86_IOMMU_DEVICE prefix.

This will make future conversion to OBJECT_DECLARE* easier.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20200825192110.3528606-47-ehabkost@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-02 07:29:25 -04:00
Vitaly Kuznetsov
0baa4b445e KVM: fix CPU reset wrt HF2_GIF_MASK
HF2_GIF_MASK is set in env->hflags2 unconditionally on CPU reset
(see x86_cpu_reset()) but when calling KVM_SET_NESTED_STATE,
KVM_STATE_NESTED_GIF_SET is only valid for nSVM as e.g. nVMX code
looks like

if (kvm_state->hdr.vmx.vmxon_pa == -1ull) {
    if (kvm_state->flags & ~KVM_STATE_NESTED_EVMCS)
        return -EINVAL;
}

Also, when adjusting the environment after KVM_GET_NESTED_STATE we
need not reset HF2_GIF_MASK on VMX as e.g. x86_cpu_pending_interrupt()
expects it to be set.

Alternatively, we could've made env->hflags2 SVM-only.

Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Fixes: b16c0e20c7 ("KVM: add support for AMD nested live migration")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200723142701.2521161-1-vkuznets@redhat.com>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-07-23 15:03:54 -04:00
Paolo Bonzini
e1e43813e7 KVM: x86: believe what KVM says about WAITPKG
Currently, QEMU is overriding KVM_GET_SUPPORTED_CPUID's answer for
the WAITPKG bit depending on the "-overcommit cpu-pm" setting.  This is a
bad idea because it does not even check if the host supports it, but it
can be done in x86_cpu_realizefn just like we do for the MONITOR bit.

This patch moves it there, while making it conditional on host
support for the related UMWAIT MSR.

Cc: qemu-stable@nongnu.org
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10 18:02:22 -04:00
Paolo Bonzini
b16c0e20c7 KVM: add support for AMD nested live migration
Support for nested guest live migration is part of Linux 5.8, add the
corresponding code to QEMU.  The migration format consists of a few
flags, is an opaque 4k blob.

The blob is in VMCB format (the control area represents the L1 VMCB
control fields, the save area represents the pre-vmentry state; KVM does
not use the host save area since the AMD manual allows that) but QEMU
does not really care about that.  However, the flags need to be
copied to hflags/hflags2 and back.

In addition, support for retrieving and setting the AMD nested virtualization
states allows the L1 guest to be reset while running a nested guest, but
a small bug in CPU reset needs to be fixed for that to work.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10 18:02:17 -04:00
Marcelo Tosatti
74aaddc628 kvm: i386: allow TSC to differ by NTP correction bounds without TSC scaling
The Linux TSC calibration procedure is subject to small variations
(its common to see +-1 kHz difference between reboots on a given CPU, for example).

So migrating a guest between two hosts with identical processor can fail, in case
of a small variation in calibrated TSC between them.

Allow a conservative 250ppm error between host TSC and VM TSC frequencies,
rather than requiring an exact match. NTP daemon in the guest can
correct this difference.

Also change migration to accept this bound.

KVM_SET_TSC_KHZ depends on a kernel interface change. Without this change,
the behaviour remains the same: in case of a different frequency
between host and VM, KVM_SET_TSC_KHZ will fail and QEMU will exit.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Message-Id: <20200616165805.GA324612@fuller.cnet>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-26 09:39:40 -04:00
Like Xu
ea39f9b643 target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES
The Perfmon and Debug Capability MSR named IA32_PERF_CAPABILITIES is
a feature-enumerating MSR, which only enumerates the feature full-width
write (via bit 13) by now which indicates the processor supports IA32_A_PMCx
interface for updating bits 32 and above of IA32_PMCx.

The existence of MSR IA32_PERF_CAPABILITIES is enumerated by CPUID.1:ECX[15].

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: qemu-devel@nongnu.org
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20200529074347.124619-5-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-10 12:10:47 -04:00
Pan Nengyuan
2a69314258 i386/kvm: fix a use-after-free when vcpu plug/unplug
When we hotplug vcpus, cpu_update_state is added to vm_change_state_head
in kvm_arch_init_vcpu(). But it forgot to delete in kvm_arch_destroy_vcpu() after
unplug. Then it will cause a use-after-free access. This patch delete it in
kvm_arch_destroy_vcpu() to fix that.

Reproducer:
    virsh setvcpus vm1 4 --live
    virsh setvcpus vm1 2 --live
    virsh suspend vm1
    virsh resume vm1

The UAF stack:
==qemu-system-x86_64==28233==ERROR: AddressSanitizer: heap-use-after-free on address 0x62e00002e798 at pc 0x5573c6917d9e bp 0x7fff07139e50 sp 0x7fff07139e40
WRITE of size 1 at 0x62e00002e798 thread T0
    #0 0x5573c6917d9d in cpu_update_state /mnt/sdb/qemu/target/i386/kvm.c:742
    #1 0x5573c699121a in vm_state_notify /mnt/sdb/qemu/vl.c:1290
    #2 0x5573c636287e in vm_prepare_start /mnt/sdb/qemu/cpus.c:2144
    #3 0x5573c6362927 in vm_start /mnt/sdb/qemu/cpus.c:2150
    #4 0x5573c71e8304 in qmp_cont /mnt/sdb/qemu/monitor/qmp-cmds.c:173
    #5 0x5573c727cb1e in qmp_marshal_cont qapi/qapi-commands-misc.c:835
    #6 0x5573c7694c7a in do_qmp_dispatch /mnt/sdb/qemu/qapi/qmp-dispatch.c:132
    #7 0x5573c7694c7a in qmp_dispatch /mnt/sdb/qemu/qapi/qmp-dispatch.c:175
    #8 0x5573c71d9110 in monitor_qmp_dispatch /mnt/sdb/qemu/monitor/qmp.c:145
    #9 0x5573c71dad4f in monitor_qmp_bh_dispatcher /mnt/sdb/qemu/monitor/qmp.c:234

Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Pan Nengyuan <pannengyuan@huawei.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200513132630.13412-1-pannengyuan@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-10 12:10:01 -04:00
Liran Alon
73b994f6d7 i386/cpu: Store LAPIC bus frequency in CPU structure
No functional change.
This information will be used by following patches.

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Message-Id: <20200312165431.82118-15-liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-10 12:09:52 -04:00
Dongjiu Geng
6b552b9bc8 KVM: Move hwpoison page related functions into kvm-all.c
kvm_hwpoison_page_add() and kvm_unpoison_all() will both
be used by X86 and ARM platforms, so moving them into
"accel/kvm/kvm-all.c" to avoid duplicate code.

For architectures that don't use the poison-list functionality
the reset handler will harmlessly do nothing, so let's register
the kvm_unpoison_all() function in the generic kvm_init() function.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
Acked-by: Xiang Zheng <zhengxiang9@huawei.com>
Message-id: 20200512030609.19593-8-gengdongjiu@huawei.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-05-14 15:03:09 +01:00
Vitaly Kuznetsov
4a910e1f6a target/i386: do not set unsupported VMX secondary execution controls
Commit 048c95163b ("target/i386: work around KVM_GET_MSRS bug for
secondary execution controls") added a workaround for KVM pre-dating
commit 6defc591846d ("KVM: nVMX: include conditional controls in /dev/kvm
KVM_GET_MSRS") which wasn't setting certain available controls. The
workaround uses generic CPUID feature bits to set missing VMX controls.

It was found that in some cases it is possible to observe hosts which
have certain CPUID features but lack the corresponding VMX control.

In particular, it was reported that Azure VMs have RDSEED but lack
VMX_SECONDARY_EXEC_RDSEED_EXITING; attempts to enable this feature
bit result in QEMU abort.

Resolve the issue but not applying the workaround when we don't have
to. As there is no good way to find out if KVM has the fix itself, use
95c5c7c77c ("KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST") instead
as these [are supposed to] come together.

Fixes: 048c95163b ("target/i386: work around KVM_GET_MSRS bug for secondary execution controls")
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200331162752.1209928-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-02 14:55:45 -04:00
Paolo Bonzini
6702514814 target/i386: check for availability of MSR_IA32_UCODE_REV as an emulated MSR
Even though MSR_IA32_UCODE_REV has been available long before Linux 5.6,
which added it to the emulated MSR list, a bug caused the microcode
version to revert to 0x100000000 on INIT.  As a result, processors other
than the bootstrap processor would not see the host microcode revision;
some Windows version complain loudly about this and crash with a
fairly explicit MICROCODE REVISION MISMATCH error.

[If running 5.6 prereleases, the kernel fix "KVM: x86: do not reset
 microcode version on INIT or RESET" should also be applied.]

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Message-id: <20200211175516.10716-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 16:29:40 +01:00
Philippe Mathieu-Daudé
4f7f589381 accel: Replace current_machine->accelerator by current_accel() wrapper
We actually want to access the accelerator, not the machine, so
use the current_accel() wrapper instead.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200121110349.25842-10-philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-24 20:59:11 +01:00
Paolo Bonzini
32c87d70ff target/i386: kvm: initialize microcode revision from KVM
KVM can return the host microcode revision as a feature MSR.
Use it as the default value for -cpu host.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1579544504-3616-4-git-send-email-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-24 20:59:10 +01:00
Paolo Bonzini
420ae1fc51 target/i386: kvm: initialize feature MSRs very early
Some read-only MSRs affect the behavior of ioctls such as
KVM_SET_NESTED_STATE.  We can initialize them once and for all
right after the CPU is realized, since they will never be modified
by the guest.

Reported-by: Qingua Cheng <qcheng@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1579544504-3616-2-git-send-email-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-24 20:59:09 +01:00
Michal Privoznik
8f54bbd0b4 x86: Check for machine state object class before typecasting it
In ed9e923c3c ("x86: move SMM property to X86MachineState", 2019-12-17)
In v4.2.0-246-ged9e923c3c the SMM property was moved from PC
machine class to x86 machine class. Makes sense, but the change
was too aggressive: in target/i386/kvm.c:kvm_arch_init() it
altered check which sets SMRAM if given machine has SMM enabled.
The line that detects whether given machine object is class of
PC_MACHINE was removed from the check. This makes qemu try to
enable SMRAM for all machine types, which is not what we want.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Fixes: ed9e923c3c ("x86: move SMM property to X86MachineState", 2019-12-17)
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <7cc91bab3191bfd7e071bdd3fdf7fe2a2991deb0.1577692822.git.mprivozn@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-07 12:08:38 +01:00
Eiichi Tsukata
7529a79607 target/i386: remove unused pci-assign codes
Legacy PCI device assignment has been already removed in commit ab37bfc7d6
("pci-assign: Remove"), but some codes remain unused.

CC: qemu-trivial@nongnu.org
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Message-Id: <20191209072932.313056-1-devel@etsukata.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-18 02:34:11 +01:00
Paolo Bonzini
89a289c7e9 x86: move more x86-generic functions out of PC files
These are needed by microvm too, so move them outside of PC-specific files.
With this patch, microvm.c need not include pc.h anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-17 19:33:50 +01:00
Paolo Bonzini
ed9e923c3c x86: move SMM property to X86MachineState
Add it to microvm as well, it is a generic property of the x86
architecture.

Suggested-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-17 19:33:50 +01:00
Paolo Bonzini
4376c40ded kvm: introduce kvm_kernel_irqchip_* functions
The KVMState struct is opaque, so provide accessors for the fields
that will be moved from current_machine to the accelerator.  For now
they just forward to the machine object, but this will change.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-17 19:32:45 +01:00
Paolo Bonzini
23b0898e44 kvm: convert "-machine kvm_shadow_mem" to an accelerator property
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-17 19:32:27 +01:00
Yang Zhong
2605188240 target/i386: disable VMX features if nested=0
If kvm does not support VMX feature by nested=0, the kvm_vmx_basic
can't get the right value from MSR_IA32_VMX_BASIC register, which
make qemu coredump when qemu do KVM_SET_MSRS.

The coredump info:
error: failed to set MSR 0x480 to 0x0
kvm_put_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20191206071111.12128-1-yang.zhong@intel.com>
Reported-by: Catherine Ho <catherine.hecx@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-06 12:35:40 +01:00
Paolo Bonzini
2a9758c51e target/i386: add support for MSR_IA32_TSX_CTRL
The MSR_IA32_TSX_CTRL MSR can be used to hide TSX (also known as the
Trusty Side-channel Extension).  By virtualizing the MSR, KVM guests
can disable TSX and avoid paying the price of mitigating TSX-based
attacks on microarchitectural side channels.

Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-21 16:35:05 +01:00
Tao Xu
6508799707 target/i386: Add support for save/load IA32_UMWAIT_CONTROL MSR
UMWAIT and TPAUSE instructions use 32bits IA32_UMWAIT_CONTROL at MSR
index E1H to determines the maximum time in TSC-quanta that the processor
can reside in either C0.1 or C0.2.

This patch is to Add support for save/load IA32_UMWAIT_CONTROL MSR in
guest.

Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Message-Id: <20191011074103.30393-3-tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-23 17:50:27 +02:00
Tao Xu
67192a298f x86/cpu: Add support for UMONITOR/UMWAIT/TPAUSE
UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
This patch adds support for user wait instructions in KVM. Availability
of the user wait instructions is indicated by the presence of the CPUID
feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
be executed at any privilege level, and use IA32_UMWAIT_CONTROL MSR to
set the maximum time.

The patch enable the umonitor, umwait and tpause features in KVM.
Because umwait and tpause can put a (psysical) CPU into a power saving
state, by default we dont't expose it to kvm and enable it only when
guest CPUID has it. And use QEMU command-line "-overcommit cpu-pm=on"
(enable_cpu_pm is enabled), a VM can use UMONITOR, UMWAIT and TPAUSE
instructions. If the instruction causes a delay, the amount of time
delayed is called here the physical delay. The physical delay is first
computed by determining the virtual delay (the time to delay relative to
the VM’s timestamp counter). Otherwise, UMONITOR, UMWAIT and TPAUSE cause
an invalid-opcode exception(#UD).

The release document ref below link:
https://software.intel.com/sites/default/files/\
managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Message-Id: <20191011074103.30393-2-tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-23 17:50:27 +02:00
Vitaly Kuznetsov
30d6ff662d i386/kvm: add NoNonArchitecturalCoreSharing Hyper-V enlightenment
Hyper-V TLFS specifies this enlightenment as:
"NoNonArchitecturalCoreSharing - Indicates that a virtual processor will never
share a physical core with another virtual processor, except for virtual
processors that are reported as sibling SMT threads. This can be used as an
optimization to avoid the performance overhead of STIBP".

However, STIBP is not the only implication. It was found that Hyper-V on
KVM doesn't pass MD_CLEAR bit to its guests if it doesn't see
NoNonArchitecturalCoreSharing bit.

KVM reports NoNonArchitecturalCoreSharing in KVM_GET_SUPPORTED_HV_CPUID to
indicate that SMT on the host is impossible (not supported of forcefully
disabled).

Implement NoNonArchitecturalCoreSharing support in QEMU as tristate:
'off' - the feature is disabled (default)
'on' - the feature is enabled. This is only safe if vCPUS are properly
 pinned and correct topology is exposed. As CPU pinning is done outside
 of QEMU the enablement decision will be made on a higher level.
'auto' - copy KVM setting. As during live migration SMT settings on the
source and destination host may differ this requires us to add a migration
blocker.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20191018163908.10246-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22 09:38:42 +02:00
Mario Smarduch
73284563dc target/i386: log MCE guest and host addresses
Patch logs MCE AO, AR messages injected to guest or taken by QEMU itself.
We print the QEMU address for guest MCEs, helps on hypervisors that have
another source of MCE logging like mce log, and when they go missing.

For example we found these QEMU logs:

September 26th 2019, 17:36:02.309        Droplet-153258224: Guest MCE Memory Error at qemu addr 0x7f8ce14f5000 and guest 3d6f5000 addr of type BUS_MCEERR_AR injected   qemu-system-x86_64      amsN    ams3nodeNNNN

September 27th 2019, 06:25:03.234        Droplet-153258224: Guest MCE Memory Error at qemu addr 0x7f8ce14f5000 and guest 3d6f5000 addr of type BUS_MCEERR_AR injected   qemu-system-x86_64      amsN    ams3nodeNNNN

The first log had a corresponding mce log entry, the second didnt (logging
thresholds) we can infer from second entry same PA and mce type.

Signed-off-by: Mario Smarduch <msmarduch@digitalocean.com>
Message-Id: <20191009164459.8209-3-msmarduch@digitalocean.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22 09:38:38 +02:00
Eduardo Habkost
af95cafb87 i386: Omit all-zeroes entries from KVM CPUID table
KVM has a 80-entry limit at KVM_SET_CPUID2.  With the
introduction of CPUID[0x1F], it is now possible to hit this limit
with unusual CPU configurations, e.g.:

  $ ./x86_64-softmmu/qemu-system-x86_64 \
    -smp 1,dies=2,maxcpus=2 \
    -cpu EPYC,check=off,enforce=off \
    -machine accel=kvm
  qemu-system-x86_64: kvm_init_vcpu failed: Argument list too long

This happens because QEMU adds a lot of all-zeroes CPUID entries
for unused CPUID leaves.  In the example above, we end up
creating 48 all-zeroes CPUID entries.

KVM already returns all-zeroes when emulating the CPUID
instruction if an entry is missing, so the all-zeroes entries are
redundant.  Skip those entries.  This reduces the CPUID table
size by half while keeping CPUID output unchanged.

Reported-by: Yumei Huang <yuhuang@redhat.com>
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1741508
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20190822225210.32541-1-ehabkost@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2019-10-15 18:34:44 -03:00
Thomas Huth
a1834d975f target/i386/kvm: Silence warning from Valgrind about uninitialized bytes
When I run QEMU with KVM under Valgrind, I currently get this warning:

 Syscall param ioctl(generic) points to uninitialised byte(s)
    at 0x95BA45B: ioctl (in /usr/lib64/libc-2.28.so)
    by 0x429DC3: kvm_ioctl (kvm-all.c:2365)
    by 0x51B249: kvm_arch_get_supported_msr_feature (kvm.c:469)
    by 0x4C2A49: x86_cpu_get_supported_feature_word (cpu.c:3765)
    by 0x4C4116: x86_cpu_expand_features (cpu.c:5065)
    by 0x4C7F8D: x86_cpu_realizefn (cpu.c:5242)
    by 0x5961F3: device_set_realized (qdev.c:835)
    by 0x7038F6: property_set_bool (object.c:2080)
    by 0x707EFE: object_property_set_qobject (qom-qobject.c:26)
    by 0x705814: object_property_set_bool (object.c:1338)
    by 0x498435: pc_new_cpu (pc.c:1549)
    by 0x49C67D: pc_cpus_init (pc.c:1681)
  Address 0x1ffeffee74 is on thread 1's stack
  in frame #2, created by kvm_arch_get_supported_msr_feature (kvm.c:445)

It's harmless, but a little bit annoying, so silence it by properly
initializing the whole structure with zeroes.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-04 18:49:20 +02:00
Paolo Bonzini
048c95163b target/i386: work around KVM_GET_MSRS bug for secondary execution controls
Some secondary controls are automatically enabled/disabled based on the CPUID
values that are set for the guest.  However, they are still available at a
global level and therefore should be present when KVM_GET_MSRS is sent to
/dev/kvm.

Unfortunately KVM forgot to include those, so fix that.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-04 18:49:20 +02:00
Paolo Bonzini
20a78b02d3 target/i386: add VMX features
Add code to convert the VMX feature words back into MSR values,
allowing the user to enable/disable VMX features as they wish.  The same
infrastructure enables support for limiting VMX features in named
CPU models.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-04 18:49:20 +02:00
Paolo Bonzini
ede146c2e7 target/i386: expand feature words to 64 bits
VMX requires 64-bit feature words for the IA32_VMX_EPT_VPID_CAP
and IA32_VMX_BASIC MSRs.  (The VMX control MSRs are 64-bit wide but
actually have only 32 bits of information).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-04 18:49:19 +02:00
Philippe Mathieu-Daudé
d6d059ca07 hw/i386/pc: Extract e820 memory layout code
Suggested-by: Samuel Ortiz <sameo@linux.intel.com>
Reviewed-by: Li Qiang <liq3ea@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20190818225414.22590-3-philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-16 17:13:07 +02:00
Wanpeng Li
d38d201f0e i386/kvm: support guest access CORE cstate
Allow guest reads CORE cstate when exposing host CPU power management capabilities
to the guest. PKG cstate is restricted to avoid a guest to get the whole package
information in multi-tenant scenario.

Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1563154124-18579-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-16 12:32:20 +02:00
Jing Liu
80db491da4 x86: Intel AVX512_BF16 feature enabling
Intel CooperLake cpu adds AVX512_BF16 instruction, defining as
CPUID.(EAX=7,ECX=1):EAX[bit 05].

The patch adds a property for setting the subleaf of CPUID leaf 7 in
case that people would like to specify it.

The release spec link as follows,
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf

Signed-off-by: Jing Liu <jing2.liu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-20 20:00:52 +02:00
Andrey Shinkevich
1f670a95b3 i386/kvm: initialize struct at full before ioctl call
Not the whole structure is initialized before passing it to the KVM.
Reduce the number of Valgrind reports.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Message-Id: <1564502498-805893-4-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-20 17:26:19 +02:00
Li Qiang
de428cead6 target-i386: kvm: 'kvm_get_supported_msrs' cleanup
Function 'kvm_get_supported_msrs' is only called once
now, get rid of the static variable 'kvm_supported_msrs'.

Signed-off-by: Li Qiang <liq3ea@163.com>
Message-Id: <20190725151639.21693-1-liq3ea@163.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-20 17:26:19 +02:00
Marcelo Tosatti
d645e13287 kvm: i386: halt poll control MSR support
Add support for halt poll control MSR: save/restore, migration
and new feature name.

The purpose of this MSR is to allow the guest to disable
host halt poll.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20190603230408.GA7938@amt.cnet>
[Do not enable by default, as pointed out by Mark Kanda. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-20 17:26:17 +02:00
Markus Armbruster
54d31236b9 sysemu: Split sysemu/runstate.h off sysemu/sysemu.h
sysemu/sysemu.h is a rather unfocused dumping ground for stuff related
to the system-emulator.  Evidence:

* It's included widely: in my "build everything" tree, changing
  sysemu/sysemu.h still triggers a recompile of some 1100 out of 6600
  objects (not counting tests and objects that don't depend on
  qemu/osdep.h, down from 5400 due to the previous two commits).

* It pulls in more than a dozen additional headers.

Split stuff related to run state management into its own header
sysemu/runstate.h.

Touching sysemu/sysemu.h now recompiles some 850 objects.  qemu/uuid.h
also drops from 1100 to 850, and qapi/qapi-types-run-state.h from 4400
to 4200.  Touching new sysemu/runstate.h recompiles some 500 objects.

Since I'm touching MAINTAINERS to add sysemu/runstate.h anyway, also
add qemu/main-loop.h.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-30-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
[Unbreak OS-X build]
2019-08-16 13:37:36 +02:00
Markus Armbruster
db72581598 Include qemu/main-loop.h less
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).  It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.

Include qemu/main-loop.h only where it's needed.  Touching it now
recompiles only some 1700 objects.  For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800.  For the
others, they shrink only slightly.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2019-08-16 13:31:52 +02:00
Markus Armbruster
71e8a91585 Include sysemu/reset.h a lot less
In my "build everything" tree, changing sysemu/reset.h triggers a
recompile of some 2600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).

The main culprit is hw/hw.h, which supposedly includes it for
convenience.

Include sysemu/reset.h only where it's needed.  Touching it now
recompiles less than 200 objects.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20190812052359.30071-9-armbru@redhat.com>
2019-08-16 13:31:52 +02:00
Jan Kiszka
bec7156a45 i386/kvm: Do not sync nested state during runtime
Writing the nested state e.g. after a vmport access can invalidate
important parts of the kernel-internal state, and it is not needed as
well. So leave this out from KVM_PUT_RUNTIME_STATE.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Message-Id: <bdd53f40-4e60-f3ae-7ec6-162198214953@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-24 11:21:59 +02:00
Paolo Bonzini
1e44f3ab71 target/i386: skip KVM_GET/SET_NESTED_STATE if VMX disabled, or for SVM
Do not allocate env->nested_state unless we later need to migrate the
nested virtualization state.

With this change, nested_state_needed() will return false if the
VMX flag is not included in the virtual machine.  KVM_GET/SET_NESTED_STATE
is also disabled for SVM which is safer (we know that at least the NPT
root and paging mode have to be saved/loaded), and thus the corresponding
subsection can go away as well.

Inspired by a patch from Liran Alon.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-19 18:02:22 +02:00
Liran Alon
79a197ab18 target/i386: kvm: Demand nested migration kernel capabilities only when vCPU may have enabled VMX
Previous to this change, a vCPU exposed with VMX running on a kernel
without KVM_CAP_NESTED_STATE or KVM_CAP_EXCEPTION_PAYLOAD resulted in
adding a migration blocker. This was because when the code was written
it was thought there is no way to reliably know if a vCPU is utilising
VMX or not at runtime. However, it turns out that this can be known to
some extent:

In order for a vCPU to enter VMX operation it must have CR4.VMXE set.
Since it was set, CR4.VMXE must remain set as long as the vCPU is in
VMX operation. This is because CR4.VMXE is one of the bits set
in MSR_IA32_VMX_CR4_FIXED1.
There is one exception to the above statement when vCPU enters SMM mode.
When a vCPU enters SMM mode, it temporarily exits VMX operation and
may also reset CR4.VMXE during execution in SMM mode.
When the vCPU exits SMM mode, vCPU state is restored to be in VMX operation
and CR4.VMXE is restored to its original state of being set.
Therefore, when the vCPU is not in SMM mode, we can infer whether
VMX is being used by examining CR4.VMXE. Otherwise, we cannot
know for certain but assume the worse that vCPU may utilise VMX.

Summaring all the above, a vCPU may have enabled VMX in case
CR4.VMXE is set or vCPU is in SMM mode.

Therefore, remove migration blocker and check before migration
(cpu_pre_save()) if the vCPU may have enabled VMX. If true, only then
require relevant kernel capabilities.

While at it, demand KVM_CAP_EXCEPTION_PAYLOAD only when the vCPU is in
guest-mode and there is a pending/injected exception. Otherwise, this
kernel capability is not required for proper migration.

Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Tested-by: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-19 18:01:47 +02:00
Peter Maydell
c4107e8208 Bugfixes.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdH7FgAAoJEL/70l94x66DdUUH+gLr/ZdjLIdfYy9cjcnevf4E
 cJlxdaW9KvUsK2uVgqQ/3b1yF+GCGk10n6n8ZTIbClhs+6NpqEMz5O3FA/Na6FGA
 48M2DwJaJ2H9AG/lQBlSBNUZfLsEJ9rWy7DHvNut5XMJFuWGwdtF/jRhUm3KqaRq
 vAaOgcQHbzHU9W8r1NJ7l6pnPebeO7S0JQV+82T/ITTz2gEBDUkJ36boO6fedkVQ
 jLb9nZyG3CJXHm2WlxGO4hkqbLFzURnCi6imOh2rMdD8BCu1eIVl59tD1lC/A0xv
 Pp3xXnv9SgJXsV4/I/N3/nU85ZhGVMPQZXkxaajHPtJJ0rQq7FAG8PJMEj9yPe8=
 =poke
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging

Bugfixes.

# gpg: Signature made Fri 05 Jul 2019 21:21:52 BST
# gpg:                using RSA key BFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini/tags/for-upstream:
  ioapic: use irq number instead of vector in ioapic_eoi_broadcast
  hw/i386: Fix linker error when ISAPC is disabled
  Makefile: generate header file with the list of devices enabled
  target/i386: kvm: Fix when nested state is needed for migration
  minikconf: do not include variables from MINIKCONF_ARGS in config-all-devices.mak
  target/i386: fix feature check in hyperv-stub.c
  ioapic: clear irq_eoi when updating the ioapic redirect table entry
  intel_iommu: Fix unexpected unmaps during global unmap
  intel_iommu: Fix incorrect "end" for vtd_address_space_unmap
  i386/kvm: Fix build with -m32
  checkpatch: do not warn for multiline parenthesized returned value
  pc: fix possible NULL pointer dereference in pc_machine_get_device_memory_region_size()

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-07-08 10:26:18 +01:00
Max Reitz
9dc83cd9c3 i386/kvm: Fix build with -m32
find_next_bit() takes a pointer of type "const unsigned long *", but the
first argument passed here is a "uint64_t *".  These types are
incompatible when compiling qemu with -m32.

Just use ctz64() instead.

Fixes: c686193072
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20190624193913.28343-1-mreitz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05 22:16:46 +02:00
Like Xu
a94e142899 target/i386: Add CPUID.1F generation support for multi-dies PCMachine
The CPUID.1F as Intel V2 Extended Topology Enumeration Leaf would be
exposed if guests want to emulate multiple software-visible die within
each package. Per Intel's SDM, the 0x1f is a superset of 0xb, thus they
can be generated by almost same code as 0xb except die_offset setting.

If the number of dies per package is greater than 1, the cpuid_min_level
would be adjusted to 0x1f regardless of whether the host supports CPUID.1F.
Likewise, the CPUID.1F wouldn't be exposed if env->nr_dies < 2.

Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20190620054525.37188-2-like.xu@linux.intel.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2019-07-05 17:08:04 -03:00
Liran Alon
12604092e2 target/i386: kvm: Add nested migration blocker only when kernel lacks required capabilities
Previous commits have added support for migration of nested virtualization
workloads. This was done by utilising two new KVM capabilities:
KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD. Both which are
required in order to correctly migrate such workloads.

Therefore, change code to add a migration blocker for vCPUs exposed with
Intel VMX or AMD SVM in case one of these kernel capabilities is
missing.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Message-Id: <20190619162140.133674-11-liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-21 13:25:28 +02:00
Liran Alon
fd13f23b8c target/i386: kvm: Add support for KVM_CAP_EXCEPTION_PAYLOAD
Kernel commit c4f55198c7c2 ("kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD")
introduced a new KVM capability which allows userspace to correctly
distinguish between pending and injected exceptions.

This distinguish is important in case of nested virtualization scenarios
because a L2 pending exception can still be intercepted by the L1 hypervisor
while a L2 injected exception cannot.

Furthermore, when an exception is attempted to be injected by QEMU,
QEMU should specify the exception payload (CR2 in case of #PF or
DR6 in case of #DB) instead of having the payload already delivered in
the respective vCPU register. Because in case exception is injected to
L2 guest and is intercepted by L1 hypervisor, then payload needs to be
reported to L1 intercept (VMExit handler) while still preserving
respective vCPU register unchanged.

This commit adds support for QEMU to properly utilise this new KVM
capability (KVM_CAP_EXCEPTION_PAYLOAD).

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Message-Id: <20190619162140.133674-10-liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-21 13:25:27 +02:00
Liran Alon
ebbfef2f34 target/i386: kvm: Add support for save and restore nested state
Kernel commit 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
introduced new IOCTLs to extract and restore vCPU state related to
Intel VMX & AMD SVM.

Utilize these IOCTLs to add support for migration of VMs which are
running nested hypervisors.

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Tested-by: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Message-Id: <20190619162140.133674-9-liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-21 13:23:47 +02:00
Liran Alon
18ab37ba1c target/i386: kvm: Block migration for vCPUs exposed with nested virtualization
Commit d98f26073b ("target/i386: kvm: add VMX migration blocker")
added a migration blocker for vCPU exposed with Intel VMX.
However, migration should also be blocked for vCPU exposed with
AMD SVM.

Both cases should be blocked because QEMU should extract additional
vCPU state from KVM that should be migrated as part of vCPU VMState.
E.g. Whether vCPU is running in guest-mode or host-mode.

Fixes: d98f26073b ("target/i386: kvm: add VMX migration blocker")
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Message-Id: <20190619162140.133674-6-liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-21 13:23:44 +02:00