Commit Graph

146 Commits

Author SHA1 Message Date
Anthony Harivel
0418f90809 Add support for RAPL MSRs in KVM/Qemu
Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
Link: https://lore.kernel.org/r/20240522153453.1230389-4-aharivel@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-22 19:19:37 +02:00
Richard Henderson
5915139aba * meson: Pass objects and dependencies to declare_dependency(), not static_library()
* meson: Drop the .fa library suffix
 * target/i386: drop AMD machine check bits from Intel CPUID
 * target/i386: add avx-vnni-int16 feature
 * target/i386: SEV bugfixes
 * target/i386: SEV-SNP -cpu host support
 * char: fix exit issues
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaGceoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcpgf/XziKojGOTvYsE7xMijOUswYjCG5m
 ZVLqxTug8Q0zO/9mGvluKBTWmh8KhRWOovX5iZL8+F0gPoYPG4ONpNhh3wpA9+S7
 H7ph4V6sDJBX4l3OrOK6htD8dO5D9kns1iKGnE0lY60PkcHl+pU8BNWfK1zYp5US
 geiyzuRFRRtDmoNx5+o+w+D+W5msPZsnlj5BnPWM+O/ykeFfSrk2ztfdwHKXUhCB
 5FJcu2sWVx+wsdVzdjgT8USi5+VTK4vabq3SfccmNRxBRnJOCU5MrR63stMDceo4
 TswSB88I0WRV1848AudcGZRkjvKaXLyHJ+QTjg2dp7itEARJ3MGsvOpS5A==
 =3kv7
 -----END PGP SIGNATURE-----

Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging

* meson: Pass objects and dependencies to declare_dependency(), not static_library()
* meson: Drop the .fa library suffix
* target/i386: drop AMD machine check bits from Intel CPUID
* target/i386: add avx-vnni-int16 feature
* target/i386: SEV bugfixes
* target/i386: SEV-SNP -cpu host support
* char: fix exit issues

# -----BEGIN PGP SIGNATURE-----
#
# iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaGceoUHHBib256aW5p
# QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcpgf/XziKojGOTvYsE7xMijOUswYjCG5m
# ZVLqxTug8Q0zO/9mGvluKBTWmh8KhRWOovX5iZL8+F0gPoYPG4ONpNhh3wpA9+S7
# H7ph4V6sDJBX4l3OrOK6htD8dO5D9kns1iKGnE0lY60PkcHl+pU8BNWfK1zYp5US
# geiyzuRFRRtDmoNx5+o+w+D+W5msPZsnlj5BnPWM+O/ykeFfSrk2ztfdwHKXUhCB
# 5FJcu2sWVx+wsdVzdjgT8USi5+VTK4vabq3SfccmNRxBRnJOCU5MrR63stMDceo4
# TswSB88I0WRV1848AudcGZRkjvKaXLyHJ+QTjg2dp7itEARJ3MGsvOpS5A==
# =3kv7
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 04 Jul 2024 02:56:58 AM PDT
# gpg:                using RSA key F13338574B662389866C7682BFFBD25F78C7AE83
# gpg:                issuer "pbonzini@redhat.com"
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]

* tag 'for-upstream' of https://gitlab.com/bonzini/qemu:
  target/i386/SEV: implement mask_cpuid_features
  target/i386: add support for masking CPUID features in confidential guests
  char-stdio: Restore blocking mode of stdout on exit
  target/i386: add avx-vnni-int16 feature
  i386/sev: Fallback to the default SEV device if none provided in sev_get_capabilities()
  i386/sev: Fix error message in sev_get_capabilities()
  target/i386: do not include undefined bits in the AMD topoext leaf
  target/i386: SEV: fix formatting of CPUID mismatch message
  target/i386: drop AMD machine check bits from Intel CPUID
  target/i386: pass X86CPU to x86_cpu_get_supported_feature_word
  meson: Drop the .fa library suffix
  Revert "meson: Propagate gnutls dependency"
  meson: Pass objects and dependencies to declare_dependency()
  meson: merge plugin_ldflags into emulator_link_args
  meson: move block.syms dependency out of libblock
  meson: move shared_module() calls where modules are already walked

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-07-04 09:16:07 -07:00
Paolo Bonzini
c28d8b097f target/i386: add support for masking CPUID features in confidential guests
Some CPUID features may be provided by KVM for some guests, independent of
processor support, for example TSC deadline or TSC adjust.  If these are
not supported by the confidential computing firmware, however, the guest
will fail to start.  Add support for removing unsupported features from
"-cpu host".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-04 07:47:11 +02:00
David Woodhouse
93c76555d8 hw/i386/fw_cfg: Add etc/e820 to fw_cfg late
In e820_add_entry() the e820_table is reallocated with g_renew() to make
space for a new entry. However, fw_cfg_arch_create() just uses the
existing e820_table pointer. This leads to a use-after-free if anything
adds a new entry after fw_cfg is set up.

Shift the addition of the etc/e820 file to the machine done notifier, via
a new fw_cfg_add_e820() function.

Also make e820_table private and use an e820_get_table() accessor function
for it, which sets a flag that will trigger an assert() for any *later*
attempts to add to the table.

Make e820_add_entry() return void, as most callers don't check for error
anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <a2708734f004b224f33d3b4824e9a5a262431568.camel@infradead.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-03 18:14:06 -04:00
Alex Bennée
5b7d54d4ed gdbstub: move enums into separate header
This is an experiment to further reduce the amount we throw into the
exec headers. It might not be as useful as I initially thought because
just under half of the users also need gdbserver_start().

Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240620152220.2192768-3-alex.bennee@linaro.org>
2024-06-24 10:14:17 +01:00
Philippe Mathieu-Daudé
8291239113 target/i386: Remove X86CPU::kvm_no_smi_migration field
X86CPU::kvm_no_smi_migration was only used by the
pc-i440fx-2.3 machine, which got removed. Remove it
and simplify kvm_put_vcpu_events().

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20240617071118.60464-23-philmd@linaro.org>
2024-06-19 12:40:49 +02:00
John Allen
1ea1432199 i386: Add support for overflow recovery
Add cpuid bit definition for overflow recovery. This is needed in the case
where a deferred error has been sent to the guest, a guest process accesses the
poisoned memory, but the machine_check_poll function has not yet handled the
original deferred error. If overflow recovery is not set in this case, when we
handle the uncorrected error from the poisoned memory access, the overflow bit
will be set and will result in the guest being shut down.

By the time the MCE reaches the guest, the overflow has been handled
by the host and has not caused a shutdown, so include the bit unconditionally.

Signed-off-by: John Allen <john.allen@amd.com>
Message-ID: <20240603193622.47156-4-john.allen@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-08 10:33:39 +02:00
John Allen
2ba8b7ee63 i386: Add support for SUCCOR feature
Add cpuid bit definition for the SUCCOR feature. This cpuid bit is required to
be exposed to guests to allow them to handle machine check exceptions on AMD
hosts.

----
v2:
  - Add "succor" feature word.
  - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature.

Reported-by: William Roche <william.roche@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: John Allen <john.allen@amd.com>
Message-ID: <20240603193622.47156-3-john.allen@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-08 10:33:39 +02:00
John Allen
4b77512b27 i386: Fix MCE support for AMD hosts
For the most part, AMD hosts can use the same MCE injection code as Intel, but
there are instances where the qemu implementation is Intel specific. First, MCE
delivery works differently on AMD and does not support broadcast. Second,
kvm_mce_inject generates MCEs that include a number of Intel specific status
bits. Modify kvm_mce_inject to properly generate MCEs on AMD platforms.

Reported-by: William Roche <william.roche@oracle.com>
Signed-off-by: John Allen <john.allen@amd.com>
Message-ID: <20240603193622.47156-2-john.allen@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-08 10:33:38 +02:00
Xin Li
4ebd98eb3a target/i386: Add get/set/migrate support for FRED MSRs
FRED CPU states are managed in 9 new FRED MSRs, in addtion to a few
existing CPU registers and MSRs, e.g., CR4.FRED and MSR_IA32_PL0_SSP.

Save/restore/migrate FRED MSRs if FRED is exposed to the guest.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Message-ID: <20231109072012.8078-7-xin3.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-08 10:33:38 +02:00
Richard Henderson
f1572ab947 * virtio-blk: remove SCSI passthrough functionality
* require x86-64-v2 baseline ISA
 * SEV-SNP host support
 * fix xsave.flat with TCG
 * fixes for CPUID checks done by TCG
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZgKVYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPKYgf/QkWrNXdjjD3yAsv5LbJFVTVyCYW3
 b4Iax29kEDy8k9wbzfLxOfIk9jXIjmbOMO5ZN9LFiHK6VJxbXslsMh6hm50M3xKe
 49X1Rvf9YuVA7KZX+dWkEuqLYI6Tlgj3HaCilYWfXrjyo6hY3CxzkPV/ChmaeYlV
 Ad4Y8biifoUuuEK8OTeTlcDWLhOHlFXylG3AXqULsUsXp0XhWJ9juXQ60eATv/W4
 eCEH7CSmRhYFu2/rV+IrWFYMnskLRTk1OC1/m6yXGPKOzgnOcthuvQfiUgPkbR/d
 llY6Ni5Aaf7+XX3S7Avcyvoq8jXzaaMzOrzL98rxYGDR1sYBYO+4h4ZToA==
 =qQeP
 -----END PGP SIGNATURE-----

Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging

* virtio-blk: remove SCSI passthrough functionality
* require x86-64-v2 baseline ISA
* SEV-SNP host support
* fix xsave.flat with TCG
* fixes for CPUID checks done by TCG

# -----BEGIN PGP SIGNATURE-----
#
# iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZgKVYUHHBib256aW5p
# QHJlZGhhdC5jb20ACgkQv/vSX3jHroPKYgf/QkWrNXdjjD3yAsv5LbJFVTVyCYW3
# b4Iax29kEDy8k9wbzfLxOfIk9jXIjmbOMO5ZN9LFiHK6VJxbXslsMh6hm50M3xKe
# 49X1Rvf9YuVA7KZX+dWkEuqLYI6Tlgj3HaCilYWfXrjyo6hY3CxzkPV/ChmaeYlV
# Ad4Y8biifoUuuEK8OTeTlcDWLhOHlFXylG3AXqULsUsXp0XhWJ9juXQ60eATv/W4
# eCEH7CSmRhYFu2/rV+IrWFYMnskLRTk1OC1/m6yXGPKOzgnOcthuvQfiUgPkbR/d
# llY6Ni5Aaf7+XX3S7Avcyvoq8jXzaaMzOrzL98rxYGDR1sYBYO+4h4ZToA==
# =qQeP
# -----END PGP SIGNATURE-----
# gpg: Signature made Wed 05 Jun 2024 02:01:10 AM PDT
# gpg:                using RSA key F13338574B662389866C7682BFFBD25F78C7AE83
# gpg:                issuer "pbonzini@redhat.com"
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]

* tag 'for-upstream' of https://gitlab.com/bonzini/qemu: (46 commits)
  hw/i386: Add support for loading BIOS using guest_memfd
  hw/i386/sev: Use guest_memfd for legacy ROMs
  memory: Introduce memory_region_init_ram_guest_memfd()
  i386/sev: Allow measured direct kernel boot on SNP
  i386/sev: Reorder struct declarations
  i386/sev: Extract build_kernel_loader_hashes
  i386/sev: Enable KVM_HC_MAP_GPA_RANGE hcall for SNP guests
  i386/kvm: Add KVM_EXIT_HYPERCALL handling for KVM_HC_MAP_GPA_RANGE
  i386/sev: Invoke launch_updata_data() for SNP class
  i386/sev: Invoke launch_updata_data() for SEV class
  hw/i386/sev: Add support to encrypt BIOS when SEV-SNP is enabled
  i386/sev: Add support for SNP CPUID validation
  i386/sev: Add support for populating OVMF metadata pages
  hw/i386/sev: Add function to get SEV metadata from OVMF header
  i386/sev: Set CPU state to protected once SNP guest payload is finalized
  i386/sev: Add handling to encrypt/finalize guest launch data
  i386/sev: Add the SNP launch start context
  i386/sev: Update query-sev QAPI format to handle SEV-SNP
  i386/sev: Add a class method to determine KVM VM type for SNP guests
  i386/sev: Don't return launch measurements for SEV-SNP guests
  ...

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-06-05 07:45:23 -07:00
Michael Roth
47e76d03b1 i386/kvm: Add KVM_EXIT_HYPERCALL handling for KVM_HC_MAP_GPA_RANGE
KVM_HC_MAP_GPA_RANGE will be used to send requests to userspace for
private/shared memory attribute updates requested by the guest.
Implement handling for that use-case along with some basic
infrastructure for enabling specific hypercall events.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Pankaj Gupta <pankaj.gupta@amd.com>
Message-ID: <20240530111643.1091816-31-pankaj.gupta@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05 11:01:06 +02:00
Paolo Bonzini
a808132f6d i386/sev: Add a class method to determine KVM VM type for SNP guests
SEV guests can use either KVM_X86_DEFAULT_VM, KVM_X86_SEV_VM,
or KVM_X86_SEV_ES_VM depending on the configuration and what
the host kernel supports. SNP guests on the other hand can only
ever use KVM_X86_SNP_VM, so split determination of VM type out
into a separate class method that can be set accordingly for
sev-guest vs. sev-snp-guest objects and add handling for SNP.

Signed-off-by: Pankaj Gupta <pankaj.gupta@amd.com>
Message-ID: <20240530111643.1091816-14-pankaj.gupta@amd.com>
[Remove unnecessary function pointer declaration. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05 11:01:06 +02:00
Richard Henderson
a93b4061b0 target/i386/kvm: Improve KVM_EXIT_NOTIFY warnings
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240412073346.458116-28-richard.henderson@linaro.org>
[PMD: Fixed typo reported by Peter Maydell]
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-06-04 10:02:39 +02:00
Zhao Liu
6ddeb0ec8c i386/cpu: Introduce bitmap to cache available CPU topology levels
Currently, QEMU checks the specify number of topology domains to detect
if there's extended topology levels (e.g., checking nr_dies).

With this bitmap, the extended CPU topology (the levels other than SMT,
core and package) could be easier to detect without touching the
topology details.

This is also in preparation for the follow-up to decouple CPUID[0x1F]
subleaf with specific topology level.

Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240424154929.1487382-10-zhao1.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-22 19:43:29 +02:00
Paolo Bonzini
663e2f443e target/i386: SEV: use KVM_SEV_INIT2 if possible
Implement support for the KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM virtual
machine types, and the KVM_SEV_INIT2 function of KVM_MEMORY_ENCRYPT_OP.

These replace the KVM_SEV_INIT and KVM_SEV_ES_INIT functions, and have
several advantages:

- sharing the initialization sequence with SEV-SNP and TDX

- allowing arguments including the set of desired VMSA features

- protection against invalid use of KVM_GET/SET_* ioctls for guests
  with encrypted state

If the KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types are not supported,
fall back to KVM_SEV_INIT and KVM_SEV_ES_INIT (which use the
default x86 VM type).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-23 17:35:25 +02:00
Paolo Bonzini
ee88612df1 target/i386: Implement mc->kvm_type() to get VM type
KVM is introducing a new API to create confidential guests, which
will be used by TDX and SEV-SNP but is also available for SEV and
SEV-ES.  The API uses the VM type argument to KVM_CREATE_VM to
identify which confidential computing technology to use.

Since there are no other expected uses of VM types, delegate
mc->kvm_type() for x86 boards to the confidential-guest-support
object pointed to by ms->cgs.

For example, if a sev-guest object is specified to confidential-guest-support,
like,

  qemu -machine ...,confidential-guest-support=sev0 \
       -object sev-guest,id=sev0,...

it will check if a VM type KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM
is supported, and if so use them together with the KVM_SEV_INIT2
function of the KVM_MEMORY_ENCRYPT_OP ioctl. If not, it will fall back to
KVM_SEV_INIT and KVM_SEV_ES_INIT.

This is a preparatory work towards TDX and SEV-SNP support, but it
will also enable support for VMSA features such as DebugSwap, which
are only available via KVM_SEV_INIT2.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-23 17:35:25 +02:00
Paolo Bonzini
a99c0c66eb KVM: remove kvm_arch_cpu_check_are_resettable
Board reset requires writing a fresh CPU state.  As far as KVM is
concerned, the only thing that blocks reset is that CPU state is
encrypted; therefore, kvm_cpus_are_resettable() can simply check
if that is the case.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-23 17:35:25 +02:00
Xiaoyao Li
637c95b37b i386/sev: Switch to use confidential_guest_kvm_init()
Use confidential_guest_kvm_init() instead of calling SEV
specific sev_kvm_init(). This allows the introduction of multiple
confidential-guest-support subclasses for different x86 vendors.

As a bonus, stubs are not needed anymore since there is no
direct call from target/i386/kvm/kvm.c to SEV code.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20240229060038.606591-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-23 17:35:25 +02:00
Sean Christopherson
a5acf4f26c i386/kvm: Move architectural CPUID leaf generation to separate helper
Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

For now this is just a cleanup, so keep the function static.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240229063726.610065-23-xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-23 17:35:13 +02:00
Maciej S. Szmigiero
6093637b4d vmbus: Print a warning when enabled without the recommended set of features
Some Windows versions crash at boot or fail to enable the VMBus device if
they don't see the expected set of Hyper-V features (enlightenments).

Since this provides poor user experience let's warn user if the VMBus
device is enabled without the recommended set of Hyper-V features.

The recommended set is the minimum set of Hyper-V features required to make
the VMBus device work properly in Windows Server versions 2016, 2019 and
2022.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
2024-03-08 14:18:56 +01:00
Xiaoyao Li
0729857c70 i386/cpuid: Move leaf 7 to correct group
CPUID leaf 7 was grouped together with SGX leaf 0x12 by commit
b9edbadefb ("i386: Propagate SGX CPUID sub-leafs to KVM") by mistake.

SGX leaf 0x12 has its specific logic to check if subleaf (starting from 2)
is valid or not by checking the bit 0:3 of corresponding EAX is 1 or
not.

Leaf 7 follows the logic that EAX of subleaf 0 enumerates the maximum
valid subleaf.

Fixes: b9edbadefb ("i386: Propagate SGX CPUID sub-leafs to KVM")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240125024016.2521244-4-xiaoyao.li@intel.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-16 13:56:08 +01:00
Xiaoyao Li
a3b5376521 i386/cpuid: Remove subleaf constraint on CPUID leaf 1F
No such constraint that subleaf index needs to be less than 64.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by:Yang Weijiang <weijiang.yang@intel.com>
Message-ID: <20240125024016.2521244-3-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-16 13:56:08 +01:00
Xiaoyao Li
10f92799af i386/cpuid: Decrease cpuid_i when skipping CPUID leaf 1F
Existing code misses a decrement of cpuid_i when skip leaf 0x1F.
There's a blank CPUID entry(with leaf, subleaf as 0, and all fields
stuffed 0s) left in the CPUID array.

It conflicts with correct CPUID leaf 0.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by:Yang Weijiang <weijiang.yang@intel.com>
Message-ID: <20240125024016.2521244-2-xiaoyao.li@intel.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-16 13:56:08 +01:00
Stefan Hajnoczi
195801d700 system/cpus: rename qemu_mutex_lock_iothread() to bql_lock()
The Big QEMU Lock (BQL) has many names and they are confusing. The
actual QemuMutex variable is called qemu_global_mutex but it's commonly
referred to as the BQL in discussions and some code comments. The
locking APIs, however, are called qemu_mutex_lock_iothread() and
qemu_mutex_unlock_iothread().

The "iothread" name is historic and comes from when the main thread was
split into into KVM vcpu threads and the "iothread" (now called the main
loop thread). I have contributed to the confusion myself by introducing
a separate --object iothread, a separate concept unrelated to the BQL.

The "iothread" name is no longer appropriate for the BQL. Rename the
locking APIs to:
- void bql_lock(void)
- void bql_unlock(void)
- bool bql_locked(void)

There are more APIs with "iothread" in their names. Subsequent patches
will rename them. There are also comments and documentation that will be
updated in later patches.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Fabiano Rosas <farosas@suse.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Acked-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-id: 20240102153529.486531-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-01-08 10:45:43 -05:00
Michael Roth
5746f70d68 i386/sev: Avoid SEV-ES crash due to missing MSR_EFER_LMA bit
Commit 7191f24c7f ("accel/kvm/kvm-all: Handle register access errors")
added error checking for KVM_SET_SREGS/KVM_SET_SREGS2. In doing so, it
exposed a long-running bug in current KVM support for SEV-ES where the
kernel assumes that MSR_EFER_LMA will be set explicitly by the guest
kernel, in which case EFER write traps would result in KVM eventually
seeing MSR_EFER_LMA get set and recording it in such a way that it would
be subsequently visible when accessing it via KVM_GET_SREGS/etc.

However, guest kernels currently rely on MSR_EFER_LMA getting set
automatically when MSR_EFER_LME is set and paging is enabled via
CR0_PG_MASK. As a result, the EFER write traps don't actually expose the
MSR_EFER_LMA bit, even though it is set internally, and when QEMU
subsequently tries to pass this EFER value back to KVM via
KVM_SET_SREGS* it will fail various sanity checks and return -EINVAL,
which is now considered fatal due to the aforementioned QEMU commit.

This can be addressed by inferring the MSR_EFER_LMA bit being set when
paging is enabled and MSR_EFER_LME is set, and synthesizing it to ensure
the expected bits are all present in subsequent handling on the host
side.

Ultimately, this handling will be implemented in the host kernel, but to
avoid breaking QEMU's SEV-ES support when using older host kernels, the
same handling can be done in QEMU just after fetching the register
values via KVM_GET_SREGS*. Implement that here.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Lara Lazier <laramglazier@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc:  <kvm@vger.kernel.org>
Fixes: 7191f24c7f ("accel/kvm/kvm-all: Handle register access errors")
Signed-off-by: Michael Roth <michael.roth@amd.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20231206155821.1194551-1-michael.roth@amd.com>
2023-12-06 14:34:11 -05:00
David Woodhouse
8473607bcf i386/xen: advertise XEN_HVM_CPUID_UPCALL_VECTOR in CPUID
This will allow Linux guests (since v6.0) to use the per-vCPU upcall
vector delivered as MSI through the local APIC.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
2023-11-07 08:54:20 +00:00
Paolo Bonzini
39dd3e1f55 kvm: i8254: require KVM_CAP_PIT2 and KVM_CAP_PIT_STATE2
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 19:53:38 +02:00
Paolo Bonzini
52b04ea49d kvm: i386: require KVM_CAP_SET_IDENTITY_MAP_ADDR
This was introduced in KVM in Linux 2.6.32, we can require it unconditionally.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 19:53:38 +02:00
Paolo Bonzini
700766ba60 kvm: i386: require KVM_CAP_ADJUST_CLOCK
This was introduced in KVM in Linux 2.6.33, we can require it
unconditionally.  KVM_CLOCK_TSC_STABLE was only added in Linux 4.9,
for now do not require it (though it would allow the removal of some
pretty yucky code).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 19:53:38 +02:00
Paolo Bonzini
86f2438fc2 kvm: i386: require KVM_CAP_MCE
This was introduced in KVM in Linux 2.6.34, we can require it unconditionally.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 19:53:38 +02:00
Paolo Bonzini
1a44a79ddf kvm: i386: require KVM_CAP_SET_VCPU_EVENTS and KVM_CAP_X86_ROBUST_SINGLESTEP
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 19:53:30 +02:00
Paolo Bonzini
8bba0a3b76 kvm: i386: require KVM_CAP_XSAVE
This was introduced in KVM in Linux 2.6.36, and could already be used at
the time to save/restore FPU data even on older processor.  We can require
it unconditionally and stop using KVM_GET/SET_FPU.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 17:35:15 +02:00
Paolo Bonzini
f57a4dd311 kvm: i386: require KVM_CAP_DEBUGREGS
This was introduced in KVM in Linux 2.6.35, we can require it unconditionally.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 17:35:15 +02:00
Paolo Bonzini
4b2991666c kvm: i386: move KVM_CAP_IRQ_ROUTING detection to kvm_arch_required_capabilities
Simple code cleanup.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 17:35:15 +02:00
Paolo Bonzini
cc5e719e2c kvm: require KVM_CAP_SIGNAL_MSI
This was introduced in KVM in Linux 3.5, we can require it unconditionally
in kvm_irqchip_send_msi().  However, not all architectures have to implement
it so check it only in x86, the only architecture that ever had MSI injection
but not KVM_CAP_SIGNAL_MSI.

ARM uses it to detect the presence of the ITS emulation in the kernel,
introduced in Linux 4.8.  Assume that it's there and possibly fail when
realizing the arm-its-kvm device.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 17:35:14 +02:00
Steve Sistare
c8a7fc5179 migration: simplify blockers
Modify migrate_add_blocker and migrate_del_blocker to take an Error **
reason.  This allows migration to own the Error object, so that if
an error occurs in migrate_add_blocker, migration code can free the Error
and clear the client handle, simplifying client code.  It also simplifies
the migrate_del_blocker call site.

In addition, this is a pre-requisite for a proposed future patch that would
add a mode argument to migration requests to support live update, and
maintain a list of blockers for each mode.  A blocker may apply to a single
mode or to multiple modes, and passing Error** will allow one Error object
to be registered for multiple modes.

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Tested-by: Michael Galaxy <mgalaxy@akamai.com>
Reviewed-by: Michael Galaxy <mgalaxy@akamai.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <1697634216-84215-1-git-send-email-steven.sistare@oracle.com>
2023-10-20 08:51:41 +02:00
Xiaoyao Li
83629b1461 target/i386/cpu: Fix CPUID_HT exposure
When explicitly booting a multiple vcpus vm with "-cpu +ht", it gets
warning of

  warning: host doesn't support requested feature: CPUID.01H:EDX.ht [bit 28]

Make CPUID_HT as supported unconditionally can resolve the warning.
However it introduces another issue that it also expose CPUID_HT to
guest when "-cpu host/max" with only 1 vcpu. To fix this, need mark
CPUID_HT as the no_autoenable_flags.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20231010060539.210258-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-17 15:20:46 +02:00
Paolo Bonzini
e0c3ef715b target/i386/kvm: eliminate shadowed local variables
These are harmless are they die immediately after their use.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-26 16:41:49 +02:00
Michael Tokarev
bad5cfcd60 i386: spelling fixes
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
2023-09-20 07:54:34 +03:00
Philippe Mathieu-Daudé
fc30abf846 sysemu/kvm: Restrict kvm_has_pit_state2() to x86 targets
kvm_has_pit_state2() is only defined for x86 targets (in
target/i386/kvm/kvm.c). Its declaration is pointless on
all other targets. Have it return a boolean.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-13-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-07 13:32:37 +02:00
Anton Johansson
b8a6eb1862 sysemu/kvm: Use vaddr for kvm_arch_[insert|remove]_hw_breakpoint
Changes the signature of the target-defined functions for
inserting/removing kvm hw breakpoints. The address and length arguments
are now of vaddr type, which both matches the type used internally in
accel/kvm/kvm-all.c and makes the api target-agnostic.

Signed-off-by: Anton Johansson <anjo@rev.ng>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230807155706.9580-4-anjo@rev.ng>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2023-08-24 11:21:35 -07:00
Akihiko Odaki
5e0d65909c kvm: Introduce kvm_arch_get_default_type hook
kvm_arch_get_default_type() returns the default KVM type. This hook is
particularly useful to derive a KVM type that is valid for "none"
machine model, which is used by libvirt to probe the availability of
KVM.

For MIPS, the existing mips_kvm_type() is reused. This function ensures
the availability of VZ which is mandatory to use KVM on the current
QEMU.

Cc: qemu-stable@nongnu.org
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-id: 20230727073134.134102-2-akihiko.odaki@daynix.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
[PMM: added doc comment for new function]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2023-08-22 17:31:02 +01:00
Tao Su
ba3709feaa target/i386: Allow MCDT_NO if host supports
MCDT_NO bit indicates HW contains the security fix and doesn't need to
be mitigated to avoid data-dependent behaviour for certain instructions.
It needs no hypervisor support. Treat it as supported regardless of what
KVM reports.

Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20230706054949.66556-4-tao1.su@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-07 12:52:27 +02:00
Miroslav Rezanina
ddc7cb30f8 Fix build without CONFIG_XEN_EMU
Upstream commit ddf0fd9ae1 "hw/xen: Support HVM_PARAM_CALLBACK_TYPE_GSI callback"
added kvm_xen_maybe_deassert_callback usage to target/i386/kvm/kvm.c file without
conditional preprocessing check. This breaks any build not using CONFIG_XEN_EMU.

Protect call by conditional preprocessing to allow build without CONFIG_XEN_EMU.

Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20230308130557.2420-1-mrezanin@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-15 11:52:24 +01:00
David Woodhouse
e16aff4cc2 kvm/i386: Add xen-evtchn-max-pirq property
The default number of PIRQs is set to 256 to avoid issues with 32-bit MSI
devices. Allow it to be increased if the user desires.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
2023-03-01 09:09:22 +00:00
David Woodhouse
6096cf7877 hw/xen: Support MSI mapping to PIRQ
The way that Xen handles MSI PIRQs is kind of awful.

There is a special MSI message which targets a PIRQ. The vector in the
low bits of data must be zero. The low 8 bits of the PIRQ# are in the
destination ID field, the extended destination ID field is unused, and
instead the high bits of the PIRQ# are in the high 32 bits of the address.

Using the high bits of the address means that we can't intercept and
translate these messages in kvm_send_msi(), because they won't be caught
by the APIC — addresses like 0x1000fee46000 aren't in the APIC's range.

So we catch them in pci_msi_trigger() instead, and deliver the event
channel directly.

That isn't even the worst part. The worst part is that Xen snoops on
writes to devices' MSI vectors while they are *masked*. When a MSI
message is written which looks like it targets a PIRQ, it remembers
the device and vector for later.

When the guest makes a hypercall to bind that PIRQ# (snooped from a
marked MSI vector) to an event channel port, Xen *unmasks* that MSI
vector on the device. Xen guests using PIRQ delivery of MSI don't
ever actually unmask the MSI for themselves.

Now that this is working we can finally enable XENFEAT_hvm_pirqs and
let the guest use it all.

Tested with passthrough igb and emulated e1000e + AHCI.

           CPU0       CPU1
  0:         65          0   IO-APIC   2-edge      timer
  1:          0         14  xen-pirq   1-ioapic-edge  i8042
  4:          0        846  xen-pirq   4-ioapic-edge  ttyS0
  8:          1          0  xen-pirq   8-ioapic-edge  rtc0
  9:          0          0  xen-pirq   9-ioapic-level  acpi
 12:        257          0  xen-pirq  12-ioapic-edge  i8042
 24:       9600          0  xen-percpu    -virq      timer0
 25:       2758          0  xen-percpu    -ipi       resched0
 26:          0          0  xen-percpu    -ipi       callfunc0
 27:          0          0  xen-percpu    -virq      debug0
 28:       1526          0  xen-percpu    -ipi       callfuncsingle0
 29:          0          0  xen-percpu    -ipi       spinlock0
 30:          0       8608  xen-percpu    -virq      timer1
 31:          0        874  xen-percpu    -ipi       resched1
 32:          0          0  xen-percpu    -ipi       callfunc1
 33:          0          0  xen-percpu    -virq      debug1
 34:          0       1617  xen-percpu    -ipi       callfuncsingle1
 35:          0          0  xen-percpu    -ipi       spinlock1
 36:          8          0   xen-dyn    -event     xenbus
 37:          0       6046  xen-pirq    -msi       ahci[0000:00:03.0]
 38:          1          0  xen-pirq    -msi-x     ens4
 39:          0         73  xen-pirq    -msi-x     ens4-rx-0
 40:         14          0  xen-pirq    -msi-x     ens4-rx-1
 41:          0         32  xen-pirq    -msi-x     ens4-tx-0
 42:         47          0  xen-pirq    -msi-x     ens4-tx-1

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
2023-03-01 09:09:22 +00:00
David Woodhouse
6f43f2ee49 kvm/i386: Add xen-gnttab-max-frames property
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
2023-03-01 09:07:52 +00:00
David Woodhouse
ddf0fd9ae1 hw/xen: Support HVM_PARAM_CALLBACK_TYPE_GSI callback
The GSI callback (and later PCI_INTX) is a level triggered interrupt. It
is asserted when an event channel is delivered to vCPU0, and is supposed
to be cleared when the vcpu_info->evtchn_upcall_pending field for vCPU0
is cleared again.

Thankfully, Xen does *not* assert the GSI if the guest sets its own
evtchn_upcall_pending field; we only need to assert the GSI when we
have delivered an event for ourselves. So that's the easy part, kind of.

There's a slight complexity in that we need to hold the BQL before we
can call qemu_set_irq(), and we definitely can't do that while holding
our own port_lock (because we'll need to take that from the qemu-side
functions that the PV backend drivers will call). So if we end up
wanting to set the IRQ in a context where we *don't* already hold the
BQL, defer to a BH.

However, we *do* need to poll for the evtchn_upcall_pending flag being
cleared. In an ideal world we would poll that when the EOI happens on
the PIC/IOAPIC. That's how it works in the kernel with the VFIO eventfd
pairs — one is used to trigger the interrupt, and the other works in the
other direction to 'resample' on EOI, and trigger the first eventfd
again if the line is still active.

However, QEMU doesn't seem to do that. Even VFIO level interrupts seem
to be supported by temporarily unmapping the device's BARs from the
guest when an interrupt happens, then trapping *all* MMIO to the device
and sending the 'resample' event on *every* MMIO access until the IRQ
is cleared! Maybe in future we'll plumb the 'resample' concept through
QEMU's irq framework but for now we'll do what Xen itself does: just
check the flag on every vmexit if the upcall GSI is known to be
asserted.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
2023-03-01 09:06:44 +00:00
Joao Martins
c345104cd1 i386/xen: handle VCPUOP_register_vcpu_info
Handle the hypercall to set a per vcpu info, and also wire up the default
vcpu_info in the shared_info page for the first 32 vCPUs.

To avoid deadlock within KVM a vCPU thread must set its *own* vcpu_info
rather than it being set from the context in which the hypercall is
invoked.

Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for migration,
and restore it in kvm_arch_put_registers() appropriately.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
2023-03-01 08:22:49 +00:00