Refactor migrate_get_socket_address to internally utilize 'socket-address'
parameter, reducing redundancy in the function definition.
migrate_get_socket_address implicitly converts SocketAddress into str.
Move migrate_get_socket_address inside migrate_get_connect_uri which
should return the uri string instead.
Signed-off-by: Het Gala <het.gala@nutanix.com>
Suggested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240312202634.63349-4-het.gala@nutanix.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Move the calls to migrate_get_socket_address() into migrate_qmp().
Get rid of connect_uri and replace it with args->connect_uri only
because 'to' object will help to generate connect_uri with the
correct port number.
Signed-off-by: Het Gala <het.gala@nutanix.com>
Suggested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240312202634.63349-3-het.gala@nutanix.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Add the 'to' object into migrate_qmp(), so we can use
migrate_get_socket_address() inside migrate_qmp() to get
the port value. This is not applied to other migrate_qmp*
because they don't need the port.
Signed-off-by: Het Gala <het.gala@nutanix.com>
Suggested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240312202634.63349-2-het.gala@nutanix.com
Signed-off-by: Peter Xu <peterx@redhat.com>
The various Intel CPU manuals claim that SGDT and SIDT can write either 24-bits
or 32-bits depending upon the operand size, but this is incorrect. Not only do
the Intel CPU manuals give contradictory information between processor
revisions, but this information doesn't even match real-life behaviour.
In fact, tests on real hardware show that the CPU always writes 32-bits for SGDT
and SIDT, and this behaviour is required for at least OS/2 Warp and WFW 3.11 with
Win32s to function correctly. Remove the masking applied due to the operand size
for SGDT and SIDT so that the TCG behaviour matches the behaviour on real
hardware.
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2198
--
MCA: Whilst I don't have a copy of OS/2 Warp handy, I've confirmed that this
patch fixes the issue in WFW 3.11 with Win32s. For more technical information I
highly recommend the excellent write-up at
https://www.os2museum.com/wp/sgdtsidt-fiction-and-reality/.
Message-ID: <20240419195147.434894-1-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
docs/requirements.txt is expected by readthedocs and should be in sync
with pythondeps.toml. Add a comment to both.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use warn_report_once() to get rid of the static local variable "notified".
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240418100716.1085491-1-zhao1.liu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the difference between warn_report_once() and
error_report_once() is the former has the "warning:" prefix, while the
latter does not have a similar level prefix.
At the meantime, considering that there is no error handling logic here,
and the purpose of error_report_once() is only to prompt the user with
an abnormal message, there is no need to use an error-level message here,
and instead we can just use a warning.
Therefore, downgrade the message in error_report_once() to warning, and
merge it into the previous warn_report_once().
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240327103951.3853425-4-zhao1.liu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The difference between error_printf() and error_report() is the latter
may contain more information, such as the name of the program
("qemu-system-x86_64").
Thus its variant error_report_once() and warn_report()'s variant
warn_report_once() can be used here to print the information only once
without a static local variable "ht_warned".
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240327103951.3853425-3-zhao1.liu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use warn_report_once() to get rid of the static local variable "warned".
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240327103951.3853425-2-zhao1.liu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TDX requires vMMIO region to be shared. For KVM, MMIO region is the region
which kvm memslot isn't assigned to (except in-kernel emulation).
qemu has the memory region for vMMIO at each device level.
While OVMF issues MapGPA(to-shared) conservatively on 32bit PCI MMIO
region, qemu doesn't find corresponding vMMIO region because it's before
PCI device allocation and memory_region_find() finds the device region, not
PCI bus region. It's safe to ignore MapGPA(to-shared) because when guest
accesses those region they use GPA with shared bit set for vMMIO. Ignore
memory conversion request of non-assigned region to shared and return
success. Otherwise OVMF is confused and panics there.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240229063726.610065-35-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because vMMIO region needs to be shared region, guest TD may explicitly
convert such region from private to shared. Don't complain such
conversion.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240229063726.610065-34-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Upon an KVM_EXIT_MEMORY_FAULT exit, userspace needs to do the memory
conversion on the RAMBlock to turn the memory into desired attribute,
switching between private and shared.
Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.
Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.
Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.
When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240320083945.991426-13-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.
Based on a patch by Isaku Yamahata <isaku.yamahata@intel.com>.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240320083945.991426-12-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some subsystems like VFIO might disable ram block discard, but guest_memfd
uses discard operations to implement conversions between private and
shared memory. Because of this, sequences like the following can result
in stale IOMMU mappings:
1. allocate shared page
2. convert page shared->private
3. discard shared page
4. convert page private->shared
5. allocate shared page
6. issue DMA operations against that shared page
This is not a use-after-free, because after step 3 VFIO is still pinning
the page. However, DMA operations in step 6 will hit the old mapping
that was allocated in step 1.
Address this by taking ram_block_discard_is_enabled() into account when
deciding whether or not to discard pages.
Since kvm_convert_memory()/guest_memfd doesn't implement a
RamDiscardManager handler to convey and replay discard operations,
this is a case of uncoordinated discard, which is blocked/released
by ram_block_discard_require(). Interestingly, this function had
no use so far.
Alternative approaches would be to block discard of shared pages, but
this would cause guests to consume twice the memory if they use VFIO;
or to implement a RamDiscardManager and only block uncoordinated
discard, i.e. use ram_block_coordinated_discard_require().
[Commit message mostly by Michael Roth <michael.roth@amd.com>]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new member "guest_memfd" to memory backends. When it's set
to true, it enables RAM_GUEST_MEMFD in ram_flags, thus private kvm
guest_memfd will be allocated during RAMBlock allocation.
Memory backend's @guest_memfd is wired with @require_guest_memfd
field of MachineState. It avoid looking up the machine in phymem.c.
MachineState::require_guest_memfd is supposed to be set by any VMs
that requires KVM guest memfd as private memory, e.g., TDX VM.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20240320083945.991426-8-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM side leaves the memory to shared by default, which may incur the
overhead of paging conversion on the first visit of each page. Because
the expectation is that page is likely to private for the VMs that
require private memory (has guest memfd).
Explicitly set the memory to private when memory region has valid
guest memfd backend.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240320083945.991426-16-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Switch to KVM_SET_USER_MEMORY_REGION2 when supported by KVM.
With KVM_SET_USER_MEMORY_REGION2, QEMU can set up memory region that
backend'ed both by hva-based shared memory and guest memfd based private
memory.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240320083945.991426-10-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add KVM guest_memfd support to RAMBlock so both normal hva based memory
and kvm guest memfd based private memory can be associated in one RAMBlock.
Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
create private guest_memfd during RAMBlock setup.
Allocating a new RAM_GUEST_MEMFD flag to instruct the setup of guest memfd
is more flexible and extensible than simply relying on the VM type because
in the future we may have the case that not all the memory of a VM need
guest memfd. As a benefit, it also avoid getting MachineState in memory
subsystem.
Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
confidential guests, such as TDX VM. How and when to set it for memory
backends will be implemented in the following patches.
Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
KVM guest_memfd allocated.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20240320083945.991426-7-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce the helper functions to set the attributes of a range of
memory to private or shared.
This is necessary to notify KVM the private/shared attribute of each gpa
range. KVM needs the information to decide the GPA needs to be mapped at
hva-based shared memory or guest_memfd based private memory.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240320083945.991426-11-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The upper 16 bits of kvm_userspace_memory_region::slot are
address space id. Parse it separately in trace_kvm_set_user_memory().
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240229063726.610065-5-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Newer 9.1 machine types will default to using the KVM_SEV_INIT2 API for
creating SEV/SEV-ES going forward. However, this API results in guest
measurement changes which are generally not expected for users of these
older guest types and can cause disruption if they switch to a newer
QEMU/kernel version. Avoid this by continuing to use the older
KVM_SEV_INIT/KVM_SEV_ES_INIT APIs for older machine types.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240409230743.962513-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
QEMU will currently automatically make use of the KVM_SEV_INIT2 API for
initializing SEV and SEV-ES guests verses the older
KVM_SEV_INIT/KVM_SEV_ES_INIT interfaces.
However, the older interfaces will silently avoid sync'ing FPU/XSAVE
state to the VMSA prior to encryption, thus relying on behavior and
measurements that assume the related fields to be allow zero.
With KVM_SEV_INIT2, this state is now synced into the VMSA, resulting in
measurements changes and, theoretically, behaviorial changes, though the
latter are unlikely to be seen in practice.
To allow a smooth transition to the newer interface, while still
providing a mechanism to maintain backward compatibility with VMs
created using the older interfaces, provide a new command-line
parameter:
-object sev-guest,legacy-vm-type=true,...
and have it default to false.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240409230743.962513-2-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement support for the KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM virtual
machine types, and the KVM_SEV_INIT2 function of KVM_MEMORY_ENCRYPT_OP.
These replace the KVM_SEV_INIT and KVM_SEV_ES_INIT functions, and have
several advantages:
- sharing the initialization sequence with SEV-SNP and TDX
- allowing arguments including the set of desired VMSA features
- protection against invalid use of KVM_GET/SET_* ioctls for guests
with encrypted state
If the KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types are not supported,
fall back to KVM_SEV_INIT and KVM_SEV_ES_INIT (which use the
default x86 VM type).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM is introducing a new API to create confidential guests, which
will be used by TDX and SEV-SNP but is also available for SEV and
SEV-ES. The API uses the VM type argument to KVM_CREATE_VM to
identify which confidential computing technology to use.
Since there are no other expected uses of VM types, delegate
mc->kvm_type() for x86 boards to the confidential-guest-support
object pointed to by ms->cgs.
For example, if a sev-guest object is specified to confidential-guest-support,
like,
qemu -machine ...,confidential-guest-support=sev0 \
-object sev-guest,id=sev0,...
it will check if a VM type KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM
is supported, and if so use them together with the KVM_SEV_INIT2
function of the KVM_MEMORY_ENCRYPT_OP ioctl. If not, it will fall back to
KVM_SEV_INIT and KVM_SEV_ES_INIT.
This is a preparatory work towards TDX and SEV-SNP support, but it
will also enable support for VMSA features such as DebugSwap, which
are only available via KVM_SEV_INIT2.
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a common superclass for x86 confidential guest implementations.
It will extend ConfidentialGuestSupportClass with a method that provides
the VM type to be passed to KVM_CREATE_VM.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Board reset requires writing a fresh CPU state. As far as KVM is
concerned, the only thing that blocks reset is that CPU state is
encrypted; therefore, kvm_cpus_are_resettable() can simply check
if that is the case.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the
guest state is encrypted, in which case they do nothing. For the new
API using VM types, instead, the ioctls will fail which is a safer and
more robust approach.
The new API will be the only one available for SEV-SNP and TDX, but it
is also usable for SEV and SEV-ES. In preparation for that, require
architecture-specific KVM code to communicate the point at which guest
state is protected (which must be after kvm_cpu_synchronize_post_init(),
though that might change in the future in order to suppor migration).
From that point, skip reading registers so that cpu->vcpu_dirty is
never true: if it ever becomes true, kvm_arch_put_registers() will
fail miserably.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now, the system reset is concluded by a call to
cpu_synchronize_all_post_reset() in order to sync any changes
that the machine reset callback applied to the CPU state.
However, for VMs with encrypted state such as SEV-ES guests (currently
the only case of guests with non-resettable CPUs) this cannot be done,
because guest state has already been finalized by machine-init-done notifiers.
cpu_synchronize_all_post_reset() does nothing on these guests, and actually
we would like to make it fail if called once guest has been encrypted.
So, assume that boards that support non-resettable CPUs do not touch
CPU state and that all such setup is done before, at the time of
cpu_synchronize_all_post_init().
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Data structures like struct setup_data have been moved to a separate
setup_data.h header which bootparam.h relies on. Add setup_data.h to
the cp_portable() list and sync it along with the other header files.
Note that currently struct setup_data is stripped away as part of
generating bootparam.h, but that handling is no currently needed for
setup_data.h since it doesn't pull in many external
headers/dependencies. However, QEMU currently redefines struct
setup_data in hw/i386/x86.c, so that will need to be removed as part of
any header update that pulls in the new setup_data.h to avoid build
bisect breakage.
Because <asm/setup_data.h> is the first architecture specific #include
in include/standard-headers/, add a new sed substitution to rewrite
asm/ include to the standard-headers/asm-* subdirectory for the current
architecture.
And while at it, remove asm-generic/kvm_para.h from the list of
allowed includes: it does not have a matching substitution, and therefore
it would not be possible to use it on non-Linux systems where there is
no /usr/include/asm-generic/ directory.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use unified confidential_guest_kvm_init() for consistency with
other architectures.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20240229060038.606591-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the unified interface to call confidential guest related kvm_init()
and kvm_reset(), to avoid exposing pef specific functions.
As a bonus, pef.h goes away since there is no direct call from sPAPR
board code to PEF code anymore.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use confidential_guest_kvm_init() instead of calling SEV
specific sev_kvm_init(). This allows the introduction of multiple
confidential-guest-support subclasses for different x86 vendors.
As a bonus, stubs are not needed anymore since there is no
direct call from target/i386/kvm/kvm.c to SEV code.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20240229060038.606591-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Different confidential VMs in different architectures all have the same
needs to do their specific initialization (and maybe resetting) stuffs
with KVM. Currently each of them exposes individual *_kvm_init()
functions and let machine code or kvm code to call it.
To facilitate the introduction of confidential guest technology from
different x86 vendors, add two virtual functions, kvm_init() and kvm_reset()
in ConfidentialGuestSupportClass, and expose two helpers functions for
invodking them.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20240229060038.606591-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC. When PIC
is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit needs to
be cleared. The PIC probe should then print:
[ 0.155970] Using NULL legacy PIC
However, no such log printed in guest kernel unless PCAT_COMPAT is
cleared.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240403145953.3082491-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a q35 property to check whether or not SMM ranges, e.g. SMRAM, TSEG,
etc... exist for the target platform. TDX doesn't support SMM and doesn't
play nice with QEMU modifying related guest memory ranges.
Signed-off-by: Isaku Yamahata <isaku.yamahata@linux.intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240320083945.991426-19-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In mch_realize(), process PAM initialization before SMRAM initialization so
that later patch can skill all the SMRAM related with a single check.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240320083945.991426-18-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Register File Data Sampling (RFDS) is a CPU side-channel vulnerability
that may expose stale register value. CPUs that set RFDS_NO bit in MSR
IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS.
Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has
the microcode to help mitigate RFDS.
Make RFDS_CLEAR and RFDS_NO bits available to guests.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <9a38877857392b5c2deae7e7db1b170d15510314.1710341348.git.pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to table 1-2 in Intel Architecture Instruction Set Extensions and
Future Features (rev 051) [1], SierraForest has the following new features
which have already been virtualized:
- CMPCCXADD CPUID.(EAX=7,ECX=1):EAX[bit 7]
- AVX-IFMA CPUID.(EAX=7,ECX=1):EAX[bit 23]
- AVX-VNNI-INT8 CPUID.(EAX=7,ECX=1):EDX[bit 4]
- AVX-NE-CONVERT CPUID.(EAX=7,ECX=1):EDX[bit 5]
Add above features to new CPU model SierraForest. Comparing with GraniteRapids
CPU model, SierraForest bare-metal removes the following features:
- HLE CPUID.(EAX=7,ECX=0):EBX[bit 4]
- RTM CPUID.(EAX=7,ECX=0):EBX[bit 11]
- AVX512F CPUID.(EAX=7,ECX=0):EBX[bit 16]
- AVX512DQ CPUID.(EAX=7,ECX=0):EBX[bit 17]
- AVX512_IFMA CPUID.(EAX=7,ECX=0):EBX[bit 21]
- AVX512CD CPUID.(EAX=7,ECX=0):EBX[bit 28]
- AVX512BW CPUID.(EAX=7,ECX=0):EBX[bit 30]
- AVX512VL CPUID.(EAX=7,ECX=0):EBX[bit 31]
- AVX512_VBMI CPUID.(EAX=7,ECX=0):ECX[bit 1]
- AVX512_VBMI2 CPUID.(EAX=7,ECX=0):ECX[bit 6]
- AVX512_VNNI CPUID.(EAX=7,ECX=0):ECX[bit 11]
- AVX512_BITALG CPUID.(EAX=7,ECX=0):ECX[bit 12]
- AVX512_VPOPCNTDQ CPUID.(EAX=7,ECX=0):ECX[bit 14]
- LA57 CPUID.(EAX=7,ECX=0):ECX[bit 16]
- TSXLDTRK CPUID.(EAX=7,ECX=0):EDX[bit 16]
- AMX-BF16 CPUID.(EAX=7,ECX=0):EDX[bit 22]
- AVX512_FP16 CPUID.(EAX=7,ECX=0):EDX[bit 23]
- AMX-TILE CPUID.(EAX=7,ECX=0):EDX[bit 24]
- AMX-INT8 CPUID.(EAX=7,ECX=0):EDX[bit 25]
- AVX512_BF16 CPUID.(EAX=7,ECX=1):EAX[bit 5]
- fast zero-length MOVSB CPUID.(EAX=7,ECX=1):EAX[bit 10]
- fast short CMPSB, SCASB CPUID.(EAX=7,ECX=1):EAX[bit 12]
- AMX-FP16 CPUID.(EAX=7,ECX=1):EAX[bit 21]
- PREFETCHI CPUID.(EAX=7,ECX=1):EDX[bit 14]
- XFD CPUID.(EAX=0xD,ECX=1):EAX[bit 4]
- EPT_PAGE_WALK_LENGTH_5 VMX_EPT_VPID_CAP(0x48c)[bit 7]
Add all features of GraniteRapids CPU model except above features to
SierraForest CPU model.
SierraForest doesn’t support TSX and RTM but supports TAA_NO. When RTM is
not enabled in host, KVM will not report TAA_NO. So, just don't include
TAA_NO in SierraForest CPU model.
[1] https://cdrdv2.intel.com/v1/dl/getContent/671368
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Message-ID: <20240320021044.508263-1-tao1.su@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When start L2 guest with both L1/L2 using Icelake-Server-v3 or above,
QEMU reports below warning:
"warning: host doesn't support requested feature: MSR(10AH).taa-no [bit 8]"
Reason is QEMU Icelake-Server-v3 has TSX feature disabled but enables taa-no
bit. It's meaningless that TSX isn't supported but still claim TSX is secure.
So L1 KVM doesn't expose taa-no to L2 if TSX is unsupported, then starting L2
triggers the warning.
Fix it by introducing a new version Icelake-Server-v7 which has both TSX
and taa-no features. Then guest can use TSX securely when it see taa-no.
This matches the production Icelake which supports TSX and isn't susceptible
to TSX Async Abort (TAA) vulnerabilities, a.k.a, taa-no.
Ideally, TSX should have being enabled together with taa-no since v3, but for
compatibility, we'd better to add v7 to enable it.
Fixes: d965dc3559 ("target/i386: Add ARCH_CAPABILITIES related bits into Icelake-Server CPU model")
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-ID: <20240320093138.80267-2-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.
For now this is just a cleanup, so keep the function static.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240229063726.610065-23-xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Query kvm for supported guest physical address bits, in cpuid
function 80000008, eax[23:16]. Usually this is identical to host
physical address bits. With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.
When set pass this to the guest, using cpuid too. Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240318155336.156197-2-kraxel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allows to set guest-phys-bits (cpuid leaf 80000008, eax[23:16])
via -cpu $model,guest-phys-bits=$nr.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Message-ID: <20240318155336.156197-3-kraxel@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If an architecture adds support for KVM_CAP_SET_GUEST_DEBUG but QEMU does not
have the necessary code, QEMU will fail to build after updating kernel headers.
Avoid this by using a #define in config-target.h instead of KVM_CAP_SET_GUEST_DEBUG.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Take into account split screen mode close to wrap around, which is the
other special case for dirty memory region computation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The depth == 0 and depth == 15 have to be special cased because
width * depth / 8 does not provide the correct scanline length.
However, thanks to the recent reorganization of vga_draw_graphic()
the correct value of VRAM bits per pixel is available in "bits".
Use it (via the same "bwidth" computation that is used later in
the function), thus restricting the slow path to the wraparound case.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even though monitor_get_fd() has to remain separate because it is mocked by
tests/unit/test-util-sockets, monitor_fdsets_cleanup() is logically part
of the stubs for monitor/fds.c, so move it there.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20240408155330.522792-19-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>