The KVM FWNMI capability should be enabled with the "ibm,nmi-register"
rtas call. Although MCEs from KVM will be delivered as architected
interrupts to the guest before "ibm,nmi-register" is called, KVM has
different behaviour depending on whether the guest has enabled FWNMI
(it attempts to do more recovery on behalf of a non-FWNMI guest).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200325142906.221248-2-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Coverity detected an issue (CID 1421903) with potential call of clz64(0)
which returns 64 which make it do "<<" with a negative number.
This checks the mask and avoids undefined behaviour.
In practice pgsizes and memory_region_iommu_get_min_page_size() always
have some common page sizes and even if they did not, the resulting page
size would be 0x8000.0000.0000.0000 (gcc 9.2) and
ioctl(VFIO_IOMMU_SPAPR_TCE_CREATE) would fail anyway.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200324063912.25063-1-aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If qemu_find_file() doesn't find the BIOS it returns NULL; we were
passing that unchecked through to load_elf(), which assumes a non-NULL
pointer and may misbehave. In practice it fails with a weird message:
$ qemu-system-ppc -M ppce500 -display none -kernel nonesuch
Bad address
qemu-system-ppc: could not load firmware '(null)'
Handle the failure case better:
$ qemu-system-ppc -M ppce500 -display none -kernel nonesuch
qemu-system-ppc: could not find firmware/kernel file 'nonesuch'
Spotted by Coverity (CID 1238954).
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20200324121216.23899-1-peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reorganize the descriptor handling so that CUR_DSCR always
points to the next descriptor to be processed.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com>
Message-id: 20200402134721.27863-6-edgar.iglesias@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Advance the descriptor address when stopping the channel.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com>
Acked-by: Alistair Francis <alistair.francis@wdc.com>
Message-id: 20200402134721.27863-5-edgar.iglesias@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Clear DMA_DONE when halting the DMA channel.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com>
Acked-by: Alistair Francis <alistair.francis@wdc.com>
Message-id: 20200402134721.27863-4-edgar.iglesias@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Populate DBG0.CMN_BUF_FREE so that SW can see some free space.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com>
Message-id: 20200402134721.27863-3-edgar.iglesias@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Remove comment.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com>
Message-id: 20200402134721.27863-2-edgar.iglesias@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
local_err is used several times in guest_suspend(). Setting non-NULL
local_err will crash, so let's zero it after freeing. Also fix possible
leak of local_err in final if().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200324153630.11882-7-vsementsov@virtuozzo.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
It's possible that we'll try to set err twice (or more). It's bad, it
will crash.
Instead, use warn_report().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200324153630.11882-4-vsementsov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Add script to find and fix trivial use-after-free of Error objects.
How to use:
spatch --sp-file scripts/coccinelle/error-use-after-free.cocci \
--macro-file scripts/cocci-macro-file.h --in-place \
--no-show-diff ( FILES... | --use-gitgrep . )
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200324153630.11882-2-vsementsov@virtuozzo.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Pastos in commit message and comment fixed, globbing in MAINTAINERS
expanded]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
In write_elf_section() we set the 'shdr' pointer to point to local
structures shdr32 or shdr64, which we fill in to be written out to
the ELF dump. Unfortunately the address we pass to fd_write_vmcore()
has a spurious '&' operator, so instead of writing out the section
header we write out the literal pointer value followed by whatever is
on the stack after the 'shdr' local variable.
Pass the correct address into fd_write_vmcore().
Spotted by Coverity: CID 1421970.
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200324173630.12221-1-peter.maydell@linaro.org
Remove a direct include of assert.h -- this is already
provided by qemu/osdep.h, and it breaks our rule that the
first include must always be osdep.h.
In particular we must get the assert() macro via osdep.h
to avoid compile failures on mingw (see the comment in
osdep.h where we redefine assert() for that platform).
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-id: 20200403124712.24826-1-peter.maydell@linaro.org
An old comment in get_phys_addr_lpae() claims that the code does not
support the different format TCR for VTCR_EL2. This used to be true
but it is not true now (in particular the aa64_va_parameters() and
aa32_va_parameters() functions correctly handle the different
register format by checking whether the mmu_idx is Stage2).
Remove the out of date parts of the comment.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20200331143407.3186-1-peter.maydell@linaro.org
Our implementation of the PSTATE.PAN bit incorrectly cleared all
access permission bits for privileged access to memory which is
user-accessible. It should only affect the privileged read and write
permissions; execute permission is dealt with via XN/PXN instead.
Fixes: 81636b70c2
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20200330170651.20901-1-peter.maydell@linaro.org
Coverity complains that the collie_init() function leaks the memory
allocated in sa1110_init(). This is true but not significant since
the function is called only once on machine init and the memory must
remain in existence until QEMU exits anyway.
Still, we can avoid the technical memory leak by keeping the pointer
to the StrongARMState inside the machine state struct. Switch from
the simple DEFINE_MACHINE() style to defining a subclass of
TYPE_MACHINE which extends the MachineState struct, and keep the
pointer there.
Fixes: CID 1421921
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200326204919.22006-1-peter.maydell@linaro.org
While support for parsing ieee_half in the XML description was added
to gdb in 2019 (a6d0f249) there is no easy way for the gdbstub to know
if the gdb end will understand it. Disable it for now and allow older
gdbs to successfully connect to the default -cpu max SVE enabled
QEMUs.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20200402143913.24005-1-alex.bennee@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When a file descriptor becomes ready we must re-arm POLL_ADD. This is
done by adding an sqe to the io_uring sq ring. The ->need_wait()
function wasn't taking pending sqes into account and therefore
io_uring_submit_and_wait() was not being called. Polling for cqes
failed to detect fd readiness since we hadn't submitted the sqe to
io_uring.
This patch fixes the following tests/test-aio -p /aio/event/wait
failure:
ok 11 /aio/event/wait
**
ERROR:tests/test-aio.c:374:test_flush_event_notifier: assertion failed: (aio_poll(ctx, false))
Reported-by: Cole Robinson <crobinso@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200402145434.99349-1-stefanha@redhat.com
Fixes: 73fd282e7b
("aio-posix: add io_uring fd monitoring implementation")
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Since bd457782b3 ("x86/pc: use memdev for RAM") Xen
machine fails to start with:
qemu-system-i386: xen: failed to populate ram at 0
The reason is that xen_ram_alloc() which is called by
memory_region_init_ram(), compares memory region with
statically allocated 'global' ram_memory memory region
that it uses for RAM, and does nothing in case it matches.
While it's possible feed machine->ram to xen_ram_alloc()
in the same manner to keep that hack working, I'd prefer
not to keep that circular dependency and try to untangle that.
However it doesn't look trivial to fix, so as temporary
fixup opt out Xen machine from memdev based RAM allocation,
and let xen_ram_alloc() do its trick for now.
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20200402145418.5139-1-imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No need to return an empty value from object-add (it would also leak
if the command failed). While at it, remove the "if" around object_unref
since object_unref handles NULL arguments just fine.
Reported-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20200325184723.2029630-4-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Direct leak of 4120 byte(s) in 1 object(s) allocated from:
#0 0x7fa114931887 in __interceptor_calloc (/lib64/libasan.so.6+0xb0887)
#1 0x7fa1144ad8f0 in g_malloc0 (/lib64/libglib-2.0.so.0+0x588f0)
#2 0x561e3c9c8897 in qmp_object_add /home/elmarco/src/qemu/qom/qom-qmp-cmds.c:291
#3 0x561e3cf48736 in qmp_dispatch /home/elmarco/src/qemu/qapi/qmp-dispatch.c:155
#4 0x561e3c8efb36 in monitor_qmp_dispatch /home/elmarco/src/qemu/monitor/qmp.c:145
#5 0x561e3c8f09ed in monitor_qmp_bh_dispatcher /home/elmarco/src/qemu/monitor/qmp.c:234
#6 0x561e3d08c993 in aio_bh_call /home/elmarco/src/qemu/util/async.c:136
#7 0x561e3d08d0a5 in aio_bh_poll /home/elmarco/src/qemu/util/async.c:164
#8 0x561e3d0a535a in aio_dispatch /home/elmarco/src/qemu/util/aio-posix.c:380
#9 0x561e3d08e3ca in aio_ctx_dispatch /home/elmarco/src/qemu/util/async.c:298
#10 0x7fa1144a776e in g_main_context_dispatch (/lib64/libglib-2.0.so.0+0x5276e)
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20200325184723.2029630-3-marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 048c95163b ("target/i386: work around KVM_GET_MSRS bug for
secondary execution controls") added a workaround for KVM pre-dating
commit 6defc591846d ("KVM: nVMX: include conditional controls in /dev/kvm
KVM_GET_MSRS") which wasn't setting certain available controls. The
workaround uses generic CPUID feature bits to set missing VMX controls.
It was found that in some cases it is possible to observe hosts which
have certain CPUID features but lack the corresponding VMX control.
In particular, it was reported that Azure VMs have RDSEED but lack
VMX_SECONDARY_EXEC_RDSEED_EXITING; attempts to enable this feature
bit result in QEMU abort.
Resolve the issue but not applying the workaround when we don't have
to. As there is no good way to find out if KVM has the fix itself, use
95c5c7c77c ("KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST") instead
as these [are supposed to] come together.
Fixes: 048c95163b ("target/i386: work around KVM_GET_MSRS bug for secondary execution controls")
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200331162752.1209928-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After c9808d6028 we have both an object representing the serial-isa
device and a separate object representing the underlying common serial
uart. Both of these have vmsd's associated with them and thus the
migration stream ends up with two copies of the migration data - the
serial-isa includes the vmstate of the core serial. Besides
being wrong, it breaks backwards migration compatibility.
Fix this by removing the dc->vmsd from the core device, so it only
gets migrated by any parent devices including it.
Add a vmstate_serial_mm so that any device that uses serial_mm_init
rather than creating a device still gets migrated.
(That doesn't fix backwards migration for serial_mm_init users,
but does seem to work forwards for ppce500).
Fixes: c9808d6028 ('serial: realize the serial device')
Buglink: https://bugs.launchpad.net/qemu/+bug/1869426
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200330164712.198282-1-dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The sequence of instructions exposes an issue:
sti
hlt
Interrupts cannot be delivered to hvf after hlt instruction cpu because
HF_INHIBIT_IRQ_MASK is set just before hlt is handled and never reset
after moving instruction pointer beyond hlt.
So, after hvf_vcpu_exec() returns, CPU thread gets locked up forever in
qemu_wait_io_event() (cpu_thread_is_idle() evaluates inhibition
flag and considers the CPU idle if the flag is set).
Cc: Cameron Esfahani <dirty@apple.com>
Signed-off-by: Roman Bolshakov <r.bolshakov@yadro.com>
Message-Id: <20200328174411.51491-1-r.bolshakov@yadro.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit a1b18df9a4, broke virt_kvm_type() logic, which depends on
maxram_size, ram_size, ram_slots being parsed/set on machine instance
at the time accelerator (KVM) is initialized.
set_memory_options() part was already reverted by commit 2a7b18a320,
so revert remaining initialization of above machine fields to make
virt_kvm_type() work as it used to.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reported-by: Auger Eric <eric.auger@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20200326112829.19989-1-imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Older QEMU versions did fixup the ram size to match what can be reported
via sclp. We need to mimic this behaviour for machine types 4.2 and
older to not fail on inbound migration for memory sizes that do not fit.
Old machines with proper aligned memory sizes are not affected.
Alignment table:
VM size (<=) | Alignment
--------------------------
1020M | 1M
2040M | 2M
4080M | 4M
8160M | 8M
16320M | 16M
32640M | 32M
65280M | 64M
130560M | 128M
261120M | 256M
522240M | 512M
1044480M | 1G
2088960M | 2G
4177920M | 4G
8355840M | 8G
Suggested action is to replace unaligned -m value with a suitable
aligned one or if a change to a newer machine type is possible, use a
machine version >= 5.0.
A future version might remove the compatibility handling.
For machine types >= 5.0 we can simply use an increment size of 1M and
use the full range of increment number which allows for all possible
memory sizes. The old limitation of having a maximum of 1020 increments
was added for standby memory, which we no longer support. With that we
can now support even weird memory sizes like 10001234 MB.
As we no longer fixup maxram_size as well, make other users use ram_size
instead. Keep using maxram_size when setting the maximum ram size in KVM,
as that will come in handy in the future when supporting memory hotplug
(in contrast, storage keys and storage attributes for hotplugged memory
will have to be migrated per RAM block in the future).
Fixes: 3a12fc61af ("390x/s390-virtio-ccw: use memdev for RAM")
Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200401123754.109602-1-borntraeger@de.ibm.com>
[CH: fixed up message on memory size fixup]
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
The cpu number reporting is handled by KVM and QEMU only fills in the
VM name, uuid and other values.
Unfortunately KVM doesn't report reserved cpus and doesn't even know
they exist until the are created via the ioctl.
So let's fix up the cpu values after KVM has written its values to the
3.2.2 sysib. To be consistent, we use the same code to retrieve the cpu
numbers as the STSI TCG code in target/s390x/misc_helper.c:HELPER(stsi).
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200331110123.3774-1-frankja@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
By increasing avx2 length_to_accel to 128, we can simplify its logic and reduce a
branch.
The authorship of this patch actually belongs to Richard Henderson
<richard.henderson@linaro.org>, I just fixed a boundary case on his
original patch.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Message-Id: <1585119021-46593-2-git-send-email-robert.hu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because in unit test, init_accel() will be called several times, each with
different accelerator type.
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Message-Id: <1585119021-46593-1-git-send-email-robert.hu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The virtio-iommu device attaches itself to a PCI bus, so it makes
no sense to include it unless PCI is supported---and in fact
compilation fails without this change.
Reported-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The license is the 'GNU General Public License v2.0 or later',
not 'and':
This program is free software; you can redistribute it and/ori
modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of
the License, or (at your option) any later version.
Fix the license comment.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200312213712.16671-1-philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When running Ubuntu 3.13.0-65-generic guest, QEMU sometimes crashes
during guest ACPI reset. It crashes on assert(s->rings_info_valid)
in pvscsi_process_io().
Analyzing the crash revealed that it happens when userspace issues
a sync during a reboot syscall.
Below are backtraces we gathered from the guests.
Guest backtrace when issuing PVSCSI_CMD_ADAPTER_RESET:
pci_device_shutdown
device_shutdown
init_pid_ns
init_pid_ns
kernel_power_off
SYSC_reboot
Guest backtrace when issuing PVSCSI_REG_OFFSET_KICK_RW_IO:
scsi_done
scsi_dispatch_cmd
blk_add_timer
scsi_request_fn
elv_rb_add
__blk_run_queue
queue_unplugged
blk_flush_plug_list
blk_finish_plug
ext4_writepages
set_next_entity
do_writepages
__filemap_fdatawrite_range
filemap_write_and_wait_range
ext4_sync_file
ext4_sync_file
do_fsync
sys_fsync
Since QEMU pvscsi should imitate VMware pvscsi device emulation,
we decided to imitate VMware's behavior in this case.
To check VMware behavior, we wrote a kernel module that issues
a reset to the pvscsi device and then issues a kick. We ran it on
VMware ESXi 6.5 and it seems that it simply ignores the kick.
Hence, we decided to ignore the kick as well.
Signed-off-by: Elazar Leibovich <elazar.leibovich@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Message-Id: <20200315132634.113632-1-liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current Icelake-Server CPU model lacks all the features enumerated by
MSR_IA32_ARCH_CAPABILITIES.
Add them, so that guest of "Icelake-Server" can see all of them.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200316095605.12318-1-xiaoyao.li@intel.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
The CPUID level need to be set to 0x14 manually on old
machine-type if Intel PT is enabled in guest. E.g. the
CPUID[0].EAX(level)=7 and CPUID[7].EBX[25](intel-pt)=1 when the
Qemu with "-machine pc-i440fx-3.1 -cpu qemu64,+intel-pt" parameter.
Some Intel PT capabilities are exposed by leaf 0x14 and the
missing capabilities will cause some MSRs access failed.
This patch add a warning message to inform the user to extend
the CPUID level.
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Message-Id: <1584031686-16444-1-git-send-email-luwei.kang@intel.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
If the system is numa configured the pkg_offset needs
to be adjusted for EPYC cpu models. Fix it calling the
model specific handler.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <158396725589.58170.16424607815207074485.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
The APIC ID is decoded based on the sequence sockets->dies->cores->threads.
This works fine for most standard AMD and other vendors' configurations,
but this decoding sequence does not follow that of AMD's APIC ID enumeration
strictly. In some cases this can cause CPU topology inconsistency.
When booting a guest VM, the kernel tries to validate the topology, and finds
it inconsistent with the enumeration of EPYC cpu models. The more details are
in the bug https://bugzilla.redhat.com/show_bug.cgi?id=1728166.
To fix the problem we need to build the topology as per the Processor
Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
Processors. The documentation is available from the bugzilla Link below.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
It is also available at
https://www.amd.com/system/files/TechDocs/55570-B1_PUB.zip
Here is the text from the PPR.
Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
number of least significant bits in the Initial APIC ID that indicate core ID
within a processor, in constructing per-core CPUID masks.
Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
(MNC) that the processor could theoretically support, not the actual number of
cores that are actually implemented or enabled on the processor, as indicated
by Core::X86::Cpuid::SizeId[NC].
Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
• ApicId[6] = Socket ID.
• ApicId[5:4] = Node ID.
• ApicId[3] = Logical CCX L3 complex ID
• ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}
The new apic id encoding is enabled for EPYC and EPYC-Rome models.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <158396724913.58170.3539083528095710811.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Apicid calculation depends on knowing the total number of numa nodes
for EPYC cpu models. Right now, we are calculating the arch_id while
parsing the numa(parse_numa). At this time, it is not known how many
total numa nodes are configured in the system.
Move the arch_id calculation inside x86_cpus_init. At this time, smp
parse is already completed and numa node information is available.
Override the handlers if use_epyc_apic_id_encoding is enabled in
cpu model definition.
Also replace the calling convention to use handlers from
X86MachineState.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <158396724217.58170.12256158354204870716.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Add a boolean variable use_epyc_apic_id_encoding in X86CPUDefinition.
This will be set if this cpu model needs to use new EPYC based
apic id encoding.
Override the handlers with EPYC based handlers if use_epyc_apic_id_encoding
is set. This will be done in x86_cpus_init.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <158396723514.58170.14825482171652019765.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Introduce model specific apicid functions inside X86MachineState.
These functions will be loaded from X86CPUDefinition.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <158396722838.58170.5675998866484476427.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Use the new functions from topology.h and delete the unused code. Given the
sockets, nodes, cores and threads, the new functions generate apic id for EPYC
mode. Removes all the hardcoded values.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <158396722151.58170.8031705769621392927.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
These functions add support for building EPYC mode topology given the smp
details like numa nodes, cores, threads and sockets.
The new apic id decoding is mostly similar to current apic id decoding
except that it adds a new field node_id when numa configured. Removes all
the hardcoded values. Subsequent patches will use these functions to build
the topology.
Following functions are added.
apicid_llc_width_epyc
apicid_llc_offset_epyc
apicid_pkg_offset_epyc
apicid_from_topo_ids_epyc
x86_topo_ids_from_idx_epyc
x86_topo_ids_from_apicid_epyc
x86_apicid_from_cpu_idx_epyc
The topology details are available in Processor Programming Reference (PPR)
for AMD Family 17h Model 01h, Revision B1 Processors. The revision guides are
available from the bugzilla Link below.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <158396721426.58170.2930696192478912976.stgit@naples-babu.amd.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>