Linux and Windows need ACPI SRAT table to make memory hotplug work properly,
however currently QEMU doesn't create SRAT table if numa options aren't present
on CLI.
Which breaks both linux and windows guests in certain conditions:
* Windows: won't enable memory hotplug without SRAT table at all
* Linux: if QEMU is started with initial memory all below 4Gb and no SRAT table
present, guest kernel will use nommu DMA ops, which breaks 32bit hw drivers
when memory is hotplugged and guest tries to use it with that drivers.
Fix above issues by automatically creating a numa node when QEMU is started with
memory hotplug enabled but without '-numa' options on CLI.
(PS: auto-create numa node only for new machine types so not to break migration).
Which would provide SRAT table to guests without explicit -numa options on CLI
and would allow:
* Windows: to enable memory hotplug
* Linux: switch to SWIOTLB DMA ops, to bounce DMA transfers to 32bit allocated
buffers that legacy drivers/hw can handle.
[Rewritten by Igor]
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Alistair Francis <alistair23@gmail.com>
Cc: Takao Indoh <indou.takao@jp.fujitsu.com>
Cc: Izumi Taku <izumi.taku@jp.fujitsu.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently there is no MMIO range over 4G
reserved for PCI hotplug. Since the 32bit PCI hole
depends on the number of cold-plugged PCI devices
and other factors, it is very possible is too small
to hotplug PCI devices with large BARs.
Fix it by reserving 2G for I4400FX chipset
in order to comply with older Win32 Guest OSes
and 32G for Q35 chipset.
Even if the new defaults of pci-hole64-size will appear in
"info qtree" also for older machines, the property was
not implemented so no changes will be visible to guests.
Note this is a regression since prev QEMU versions had
some range reserved for 64bit PCI hotplug.
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Legacy PCI device assignment has been removed from Linux in 4.12,
and had been deprecated 2 years ago there. We can remove it from
QEMU as well.
The ROM loading code was shared with Xen PCI passthrough, so move
it to hw/xen.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU
check for Xen here when vcpu number is more than 255.
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Message-Id: <1502842933-8323-1-git-send-email-tianyu.lan@intel.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
xen_modified_memory() sets errno to communicate what went wrong so log
this rather than the return value which is not interesting.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Since 4458fb3a79 (pc: Eliminate pc_default_machine_options()),
hot_add_cpu is set in pc_machine_class_init(), so we don't
need to set it in pc_q35_machine_options(), pc_i440fx_machine_options()
and xenfv_machine_options(), except to clear it in
pc_i440fx_1_4_machine_opt().
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We don't touch isapc when we change guest ABI and add new entries
to PC_COMPAT_* or new PCMachineClass compat flags. This means
isapc never guaranteed guest ABI and cross-QEMU-version live
migration compatibility. There's no point in keeping code for
kvm-pv-eoi and APIC ID compatibility in pc_init_isa().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add INTERFACE_CONVENTIONAL_PCI_DEVICE to all direct subtypes of
TYPE_PCI_DEVICE, except:
1) The ones that already have INTERFACE_PCIE_DEVICE set:
* base-xhci
* e1000e
* nvme
* pvscsi
* vfio-pci
* virtio-pci
* vmxnet3
2) base-pci-bridge
Not all PCI bridges are Conventional PCI devices, so
INTERFACE_CONVENTIONAL_PCI_DEVICE is added only to the subtypes
that are actually Conventional PCI:
* dec-21154-p2p-bridge
* i82801b11-bridge
* pbm-bridge
* pci-bridge
The direct subtypes of base-pci-bridge not touched by this patch
are:
* xilinx-pcie-root: Already marked as PCIe-only.
* pcie-pci-bridge: Already marked as PCIe-only.
* pcie-port: all non-abstract subtypes of pcie-port are already
marked as PCIe-only devices.
3) megasas-base
Not all megasas devices are Conventional PCI devices, so the
interface names are added to the subclasses registered by
megasas_register_types(), according to information in the
megasas_devices[] array.
"megasas-gen2" already implements INTERFACE_PCIE_DEVICE, so add
INTERFACE_CONVENTIONAL_PCI_DEVICE only to "megasas".
Acked-by: Alberto Garcia <berto@igalia.com>
Acked-by: John Snow <jsnow@redhat.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reintroduce the write callback that was removed when write support was
removed in commit 023e314856.
Contrary to the previous callback implementation, the write_cb
callback is called whenever a write happened, so handlers must be
ready to handle partial write as necessary.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
heterogeneous cpus are not supported and hotplugging different
cpu model crashes QEMU:
qemu-system-x86_64 -cpu qemu64 -smp 1,maxcpus=2
(qemu) device_add host-x86_64-cpu,socket-id=1,core-id=0,thread-id=0,id=foo
(qemu) info cpus
error: failed to get MSR 0x38d
qemu-system-x86_64: target/i386/kvm.c:2121: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
Aborted (core dumped)
Gracefully fail hotplug process in case of user mistake.
Reported-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <1507638879-200718-1-git-send-email-imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes e2b6c17 (kvmclock: update system_time_msr address forcibly)
which makes a call to get the latest value of the address
stored in system_timer_msr, but then uses the old address anyway.
Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>
Message-Id: <59b67db0bd15a46ab47c3aa657c81a4c11f168ea.1506702472.git.Jim.Somerville@windriver.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modify the pre_save method on VMStateDescription to return an int
rather than void so that it potentially can fail.
Changed zillions of devices to make them return 0; the only
case I've made it return non-0 is hw/intc/s390_flic_kvm.c that already
had an error_report/return case.
Note: If you add an error exit in your pre_save you must emit
an error_report to say why.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20170925112917.21340-2-dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJZwXs9AAoJECgHk2+YTcWmS18P/1OceEmetatwBQ6YWURA0VfU
CB+nCHxijlWXU8UiwoTVDOXc11P00V5VO0mMFouUbv+o+/80qyAfNl2DhDVq7ZXo
nhIFHKXUR1BX3YbcqfIonH8xCMGtrAexghhhaGPQbx450PqmGyyjBsZsXttNge1S
Yt+jzgD9drsZPYw4ZLJjT/tnWI/+8kDPn37jVujzRzApTD9/fq+77ZZq7q25RzQH
ISa5OXOQk6pq+tPHvaIFXfOqfhILcpM7u/X7MYVwiA1oBqcLXTDCjDX3cKavKade
+i3yrH7ahNgUO1DyT2bMZX6NAFoZDBrwlYgpkw8n+Yf+EUcdPDOHHEcEeaMpdTGx
wgWbQrs+xzIg/ulRb8Qqe9FwdXGQbehfFGofd+gnGQ0XuxekT82in+ucMOivQO3x
W/azGnzoz6D83stJFIZ93S69SRswqBuj2R8mu821yzqx1EUSNXgfKXz9OPwwFed5
El9YN127F/VAyp0av1CsOg1XgqIujMUGRxGf7eQBfkh1R3C/g2XNPTvR3yaY9L5B
zuMJfWLF6r6zL53ymt7/9UVEim295Lia3mNGS5/Min5QGms7edphsDsrubzXsZGq
2owWfAU/KeDH9gNVNNkZdLcEcS5TEz+2oGPR5oeDeB/QlzVdNQ3FeTVlzFavxNQa
8nrzeFcw7VNrIx2gdvsY
=LQ8U
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ehabkost/tags/machine-next-pull-request' into staging
Machine/CPU/NUMA queue, 2017-09-19
# gpg: Signature made Tue 19 Sep 2017 21:17:01 BST
# gpg: using RSA key 0x2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <ehabkost@redhat.com>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/machine-next-pull-request:
MAINTAINERS: Update git URLs for my trees
hw/acpi-build: Fix SRAT memory building in case of node 0 without RAM
NUMA: Replace MAX_NODES with nb_numa_nodes in for loop
numa: cpu: calculate/set default node-ids after all -numa CLI options are parsed
arm: drop intermediate cpu_model -> cpu type parsing and use cpu type directly
pc: use generic cpu_model parsing
vl.c: convert cpu_model to cpu type and set of global properties before machine_init()
cpu: make cpu_generic_init() abort QEMU on error
qom: cpus: split cpu_generic_init() on feature parsing and cpu creation parts
hostmem-file: Add "discard-data" option
osdep: Define QEMU_MADV_REMOVE
vl: Clean up user-creatable objects when exiting
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Currently, Using the fisrt node without memory on the machine makes
QEMU unhappy. With this example command line:
... \
-m 1024M,slots=4,maxmem=32G \
-numa node,nodeid=0 \
-numa node,mem=1024M,nodeid=1 \
-numa node,nodeid=2 \
-numa node,nodeid=3 \
Guest reports "No NUMA configuration found" and the NUMA topology is
wrong.
This is because when QEMU builds ACPI SRAT, it regards node 0 as the
default node to deal with the memory hole(640K-1M). this means the
node0 must have some memory(>1M), but, actually it can have no
memory.
Fix this problem by cut out the 640K hole in the same way the PCI
4G hole does.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Message-Id: <1504231805-30957-2-git-send-email-douly.fnst@cn.fujitsu.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Calculating default node-ids for CPUs in possible_cpu_arch_ids()
is rather fragile since defaults calculation uses nb_numa_nodes but
callback might be potentially called early before all -numa CLI
options are parsed, which would lead to cpus assigned only upto
nb_numa_nodes at the time possible_cpu_arch_ids() is called.
Issue was introduced by
(7c88e65 numa: mirror cpu to node mapping in MachineState::possible_cpus)
and for example CLI:
-smp 4 -numa node,cpus=0 -numa node
would set props.node-id in possible_cpus array for every non
explicitly mapped CPU to the first node.
Issue is not visible to guest nor to mgmt interface due to
1) implictly mapped cpus are forced to the first node in
case of partial mapping
2) in case of default mapping possible_cpu_arch_ids() is
called after all -numa options are parsed (resulting
in correct mapping).
However it's fragile to rely on late execution of
possible_cpu_arch_ids(), therefore add machine specific
callback that returns node-id for CPU and use it to calculate/
set defaults at machine_numa_finish_init() time when all -numa
options are parsed.
Reported-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <1496314408-163972-1-git-send-email-imammedo@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Tidy up some of the warn_report() messages after having converted them
to use warn_report().
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <9cb1d23551898c9c9a5f84da6773e99871285120.1505158760.git.alistair.francis@xilinx.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert all the multi-line uses of fprintf(stderr, "warning:"..."\n"...
to use warn_report() instead. This helps standardise on a single
method of printing warnings to the user.
All of the warnings were changed using these commands:
find ./* -type f -exec sed -i \
'N; {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
find ./* -type f -exec sed -i \
'N;N; {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
find ./* -type f -exec sed -i \
'N;N;N; {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
find ./* -type f -exec sed -i \
'N;N;N;N {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
find ./* -type f -exec sed -i \
'N;N;N;N;N {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
find ./* -type f -exec sed -i \
'N;N;N;N;N;N {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
find ./* -type f -exec sed -i \
'N;N;N;N;N;N;N; {s|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig}' \
{} +
Indentation fixed up manually afterwards.
Some of the lines were manually edited to reduce the line length to below
80 charecters. Some of the lines with newlines in the middle of the
string were also manually edit to avoid checkpatch errrors.
The #include lines were manually updated to allow the code to compile.
Several of the warning messages can be improved after this patch, to
keep this patch mechanical this has been moved into a later patch.
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Anthony Perard <anthony.perard@citrix.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Yongbok Kim <yongbok.kim@imgtec.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <5def63849ca8f551630c6f2b45bcb1c482f765a6.1505158760.git.alistair.francis@xilinx.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert all the single line uses of fprintf(stderr, "warning:"..."\n"...
to use warn_report() instead. This helps standardise on a single
method of printing warnings to the user.
All of the warnings were changed using this command:
find ./* -type f -exec sed -i \
's|fprintf(.*".*warning[,:] \(.*\)\\n"\(.*\));|warn_report("\1"\2);|Ig' \
{} +
Some of the lines were manually edited to reduce the line length to below
80 charecters.
The #include lines were manually updated to allow the code to compile.
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Yongbok Kim <yongbok.kim@imgtec.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: James Hogan <james.hogan@imgtec.com> [mips]
Message-Id: <ae8f8a7f0a88ded61743dff2adade21f8122a9e7.1505158760.git.alistair.francis@xilinx.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While loading kernel via multiboot-v1 image, (flags & 0x00010000)
indicates that multiboot header contains valid addresses to load
the kernel image. These addresses are used to compute kernel
size and kernel text offset in the OS image. Validate these
address values to avoid an OOB access issue.
This is CVE-2017-14167.
Reported-by: Thomas Garnier <thgarnie@google.com>
Signed-off-by: Prasad J Pandit <pjp@fedoraproject.org>
Message-Id: <20170907063256.7418-1-ppandit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
define default CPU type in generic way in pc_machine_class_init()
and let common machine code to handle cpu_model parsing
Patch also introduces TARGET_DEFAULT_CPU_TYPE define for 2 purposes:
* make foo_machine_class_init() look uniform on every target
* use define in [bsd|linux]-user targets to pick default
cpu type
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <1505318697-77161-5-git-send-email-imammedo@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
In vtd_switch_address_space() we did the memory region switch, however
it's possible that the caller of it has not taken the BQL at all. Make
sure we have it.
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
CC: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
HW part of ACPI PCI hotplug in QEMU depends on ACPI_PCIHP_PROP_BSEL
being set on a PCI bus that supports ACPI hotplug. It should work
regardless of the source of ACPI tables (QEMU generator/legacy SeaBIOS/Xen).
So move ACPI_PCIHP_PROP_BSEL initialization into HW ACPI implementation
part from QEMU's ACPI table generator.
To do PCI passthrough with Xen, the property ACPI_PCIHP_PROP_BSEL needs
to be set, but this was done only when ACPI tables are built which is
not needed for a Xen guest. The need for the property starts with commit
"pc: pcihp: avoid adding ACPI_PCIHP_PROP_BSEL twice"
(f0c9d64a68).
Adding find_i440fx into stubs so that mips-softmmu target can be built.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
I found these pattern via grepping the source tree. I don't have a
coccinelle script for it!
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
The 'm->numa_auto_assign_ram = numa_legacy_auto_assign_ram;' line
was supposed to be in pc_i440fx_2_9_machine_options() (see commit
3bfe5716 "numa: equally distribute memory on nodes"), but the
merge commit adb354dd ("Merge remote-tracking branch
'mst/tags/for_upstream' into staging") moved it to the
pc_i440fx_2_10_machine_options().
Move the line back to pc_i440fx_2_9_machine_options().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-id: 20170818190943.23858-1-ehabkost@redhat.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
QEMU currently crashes when trying to use a 'pc-dimm' on the pseries
machine without specifying its 'memdev' property. This happens because
pc_dimm_get_memory_region() does not check whether the 'memdev' property
has properly been set by the user. Looking closer at this function, it's
also obvious that it is using &error_abort to call another function - and
this is bad in a function that is used in the hot-plugging calling chain
since this can also cause QEMU to exit unexpectedly.
So let's fix these issues in a proper way now: Add a "Error **errp"
parameter to pc_dimm_get_memory_region() which we use in case the 'memdev'
property has not been set by the user, and which we can use instead of
the &error_abort, and change the callers of get_memory_region() to make
use of this "errp" parameter for proper error checking.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Unmask previously masked SHPC feature in _OSC method.
Signed-off-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
w2k used to boot on QEMU until revision of FADT has
been bumped to rev3
(commit 77af8a2b hw/i386: Use Rev3 FADT (ACPI 2.0) instead of Rev1 to improve guest OS support.)
Keep PC machine at rev1 to remain compatible and Q35
at rev3 where w2k isn't supported anyway so OSX could
run as well.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Tested-by: John Arbuckle <programmingkidx@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
looking at bios ROM mapping in QEMU it seems that only isapc
(i.e. not PCI enabled machine) requires ROM being mapped as
RW in other cases BIOS is mapped as RO. Do the same for option
ROM 'pc.rom' when machine has PCI enabled.
As useful side-effect pc.rom MemoryRegion stops being
put in vhost memory map (filtered out by vhost_section()),
which reduces number of entries by 1.
Coincidentally it fixes migration failure reported in
"[PATCH V2] vhost: fix a migration failed because of vhost region merge"
where following destination CLI with /sys/module/vhost/parameters/max_mem_regions = 8
export DIMMSCOUNT=6
QEMU -enable-kvm \
-netdev type=tap,id=guest0,vhost=on,script=no,vhostforce \
-device virtio-net-pci,netdev=guest0 \
-m 256,slots=256,maxmem=2G \
`i=0; while [ $i -lt $DIMMSCOUNT ]; do echo \
"-object memory-backend-ram,id=m$i,size=128M \
-device pc-dimm,id=d$i,memdev=m$i"; i=$(($i + 1)); \
done`
will fail to startup with error:
"-device pc-dimm,id=d5,memdev=m5: a used vhost backend has no free memory slots left"
while it's possible to add the 6th DIMM during hotplug
on source.
Issue is caused by the fact that number of entries in vhost map
is bigger on 1 entry, when -device is processed, than
after guest boots up, and that offending entry belongs to
'pc.rom', it's not like vhost intends to do IO in ROM range
so making it RO hides region from vhost and makes number
of entries in vhost memory map at -device/machine_done time
match number of entries after guest boots.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reported-by: Peng Hao <peng.hao2@zte.com.cn>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
It was cached by read/write separately. Let's merge them.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
IOMMUTLBEntry.iova is returned incorrectly on one PT path (though mostly
we cannot really trigger this path, even if we do, we are mostly
disgarding this value, so it didn't break anything). Fix it by
converting the VTD_PAGE_MASK into the correct definition
VTD_PAGE_MASK_4K, then remove VTD_PAGE_MASK.
Fixes: b93130 ("intel_iommu: cleanup vtd_{do_}iommu_translate()")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The only exception are groups of numers separated by symbols
'.', ' ', ':', '/', like 'ab.09.7d'.
This patch is made by the following:
> find . -name trace-events | xargs python script.py
where script.py is the following python script:
=========================
#!/usr/bin/env python
import sys
import re
import fileinput
rhex = '%[-+ *.0-9]*(?:[hljztL]|ll|hh)?(?:x|X|"\s*PRI[xX][^"]*"?)'
rgroup = re.compile('((?:' + rhex + '[.:/ ])+' + rhex + ')')
rbad = re.compile('(?<!0x)' + rhex)
files = sys.argv[1:]
for fname in files:
for line in fileinput.input(fname, inplace=True):
arr = re.split(rgroup, line)
for i in range(0, len(arr), 2):
arr[i] = re.sub(rbad, '0x\g<0>', arr[i])
sys.stdout.write(''.join(arr))
=========================
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Message-id: 20170731160135.12101-5-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In trace format '#' flag of printf is forbidden. Fix it to '0x%'.
This patch is created by the following:
check that we have a problem
> find . -name trace-events | xargs grep '%#' | wc -l
56
check that there are no cases with additional printf flags before '#'
> find . -name trace-events | xargs grep "%[-+ 0'I]+#" | wc -l
0
check that there are no wrong usage of '#' and '0x' together
> find . -name trace-events | xargs grep '0x%#' | wc -l
0
fix the problem
> find . -name trace-events | xargs sed -i 's/%#/0x%/g'
[Eric Blake noted that xargs grep '%[-+ 0'I]+#' should be xargs grep
"%[-+ 0'I]+#" instead so the shell quoting is correct.
--Stefan]
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20170731160135.12101-3-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
With the move of some docs/ to docs/devel/ on ac06724a71,
no references were updated.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Under certain circumstances normal xen-mapcache functioning may be broken
by guest's actions. This may lead to either QEMU performing exit() due to
a caught bad pointer (and with QEMU process gone the guest domain simply
appears hung afterwards) or actual use of the incorrect pointer inside
QEMU address space -- a write to unmapped memory is possible. The bug is
hard to reproduce on a i440 machine as multiple DMA sources are required
(though it's possible in theory, using multiple emulated devices), but can
be reproduced somewhat easily on a Q35 machine using an emulated AHCI
controller -- each NCQ queue command slot may be used as an independent
DMA source ex. using READ FPDMA QUEUED command, so a single storage
device on the AHCI controller port will be enough to produce multiple DMAs
(up to 32). The detailed description of the issue follows.
Xen-mapcache provides an ability to map parts of a guest memory into
QEMU's own address space to work with.
There are two types of cache lookups:
- translating a guest physical address into a pointer in QEMU's address
space, mapping a part of guest domain memory if necessary (while trying
to reduce a number of such (re)mappings to a minimum)
- translating a QEMU's pointer back to its physical address in guest RAM
These lookups are managed via two linked-lists of structures.
MapCacheEntry is used for forward cache lookups, while MapCacheRev -- for
reverse lookups.
Every guest physical address is broken down into 2 parts:
address_index = phys_addr >> MCACHE_BUCKET_SHIFT;
address_offset = phys_addr & (MCACHE_BUCKET_SIZE - 1);
MCACHE_BUCKET_SHIFT depends on a system (32/64) and is equal to 20 for
a 64-bit system (which assumed for the further description). Basically,
this means that we deal with 1 MB chunks and offsets within those 1 MB
chunks. All mappings are created with 1MB-granularity, i.e. 1MB/2MB/3MB
etc. Most DMA transfers typically are less than 1MB, however, if the
transfer crosses any 1MB border(s) - than a nearest larger mapping size
will be used, so ex. a 512-byte DMA transfer with the start address
700FFF80h will actually require a 2MB range.
Current implementation assumes that MapCacheEntries are unique for a given
address_index and size pair and that a single MapCacheEntry may be reused
by multiple requests -- in this case the 'lock' field will be larger than
1. On other hand, each requested guest physical address (with 'lock' flag)
is described by each own MapCacheRev. So there may be multiple MapCacheRev
entries corresponding to a single MapCacheEntry. The xen-mapcache code
uses MapCacheRev entries to retrieve the address_index & size pair which
in turn used to find a related MapCacheEntry. The 'lock' field within
a MapCacheEntry structure is actually a reference counter which shows
a number of corresponding MapCacheRev entries.
The bug lies in ability for the guest to indirectly manipulate with the
xen-mapcache MapCacheEntries list via a special sequence of DMA
operations, typically for storage devices. In order to trigger the bug,
guest needs to issue DMA operations in specific order and timing.
Although xen-mapcache is protected by the mutex lock -- this doesn't help
in this case, as the bug is not due to a race condition.
Suppose we have 3 DMA transfers, namely A, B and C, where
- transfer A crosses 1MB border and thus uses a 2MB mapping
- transfers B and C are normal transfers within 1MB range
- and all 3 transfers belong to the same address_index
In this case, if all these transfers are to be executed one-by-one
(without overlaps), no special treatment necessary -- each transfer's
mapping lock will be set and then cleared on unmap before starting
the next transfer.
The situation changes when DMA transfers overlap in time, ex. like this:
|===== transfer A (2MB) =====|
|===== transfer B (1MB) =====|
|===== transfer C (1MB) =====|
time --->
In this situation the following sequence of actions happens:
1. transfer A creates a mapping to 2MB area (lock=1)
2. transfer B (1MB) tries to find available mapping but cannot find one
because transfer A is still in progress, and it has 2MB size + non-zero
lock. So transfer B creates another mapping -- same address_index,
but 1MB size.
3. transfer A completes, making 1st mapping entry available by setting its
lock to 0
4. transfer C starts and tries to find available mapping entry and sees
that 1st entry has lock=0, so it uses this entry but remaps the mapping
to a 1MB size
5. transfer B completes and by this time
- there are two locked entries in the MapCacheEntry list with the SAME
values for both address_index and size
- the entry for transfer B actually resides farther in list while
transfer C's entry is first
6. xen_ram_addr_from_mapcache() for transfer B gets correct address_index
and size pair from corresponding MapCacheRev entry, but then it starts
looking for MapCacheEntry with these values and finds the first entry
-- which belongs to transfer C.
At this point there may be following possible (bad) consequences:
1. xen_ram_addr_from_mapcache() will use a wrong entry->vaddr_base value
in this statement:
raddr = (reventry->paddr_index << MCACHE_BUCKET_SHIFT) +
((unsigned long) ptr - (unsigned long) entry->vaddr_base);
resulting in an incorrent raddr value returned from the function. The
(ptr - entry->vaddr_base) expression may produce both positive and negative
numbers and its actual value may differ greatly as there are many
map/unmap operations take place. If the value will be beyond guest RAM
limits then a "Bad RAM offset" error will be triggered and logged,
followed by exit() in QEMU.
2. If raddr value won't exceed guest RAM boundaries, the same sequence
of actions will be performed for xen_invalidate_map_cache_entry() on DMA
unmap, resulting in a wrong MapCacheEntry being unmapped while DMA
operation which uses it is still active. The above example must
be extended by one more DMA transfer in order to allow unmapping as the
first mapping in the list is sort of resident.
The patch modifies the behavior in which MapCacheEntry's are added to the
list, avoiding duplicates.
Signed-off-by: Alexey Gerasimenko <x1917x@gmail.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
If we have a system with xenforeignmemory_map2() implemented
we don't need to save/restore physmap on suspend/restore
anymore. In case we resume a VM without physmap - try to
recreate the physmap during memory region restore phase and
remap map cache entries accordingly. The old code is left
for compatibility reasons.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
This new call is trying to update a requested map cache entry
according to the changes in the physmap. The call is searching
for the entry, unmaps it and maps again at the same place using
a new guest address. If the mapping is dummy this call will
make it real.
This function makes use of a new xenforeignmemory_map2() call
with an extended interface that was recently introduced in
libxenforeignmemory [1].
[1] https://www.mail-archive.com/xen-devel@lists.xen.org/msg113007.html
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Dummys are simple anonymous mappings that are placed instead
of regular foreign mappings in certain situations when we need
to postpone the actual mapping but still have to give a
memory region to QEMU to play with.
This is planned to be used for restore on Xen.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Commit 090fa1c8 "add support for unplugging NVMe disks..." extended the
existing disk unplug flag to cover NVMe disks as well as IDE and SCSI.
The recent thread on the xen-devel mailing list [1] has highlighted that
this is not desirable behaviour: PV frontends should be able to distinguish
NVMe disks from other types of disk and should have separate control over
whether they are unplugged.
This patch defines a new bit in the unplug mask for this purpose (see Xen
commit [2]) and also tidies up the definitions of, and improves the
comments regarding, the previously exiting bits in the protocol.
[1] https://lists.xen.org/archives/html/xen-devel/2017-03/msg02924.html
[2] http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=1096aa02
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Instead of reaching into the PCI state, allow the AHCIDevice to
respond with how many ports it has.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 20170623220926.11479-2-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Use the new functions memory_region_init_{ram,rom,rom_device}()
instead of manually calling the _nomigrate() version and then
vmstate_register_ram_global().
Patch automatically created using coccinelle script:
spatch --in-place -sp_file scripts/coccinelle/memory-region-init-ram.cocci -dir hw
(As it turns out, there are no instances of the rom and
rom_device functions that are caught by this script.)
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1499438577-7674-8-git-send-email-peter.maydell@linaro.org
Rename memory_region_init_ram() to memory_region_init_ram_nomigrate().
This leaves the way clear for us to provide a memory_region_init_ram()
which does handle migration.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1499438577-7674-4-git-send-email-peter.maydell@linaro.org
This finishes QOM'fication of IOMMUMemoryRegion by introducing
a IOMMUMemoryRegionClass. This also provides a fastpath analog for
IOMMU_MEMORY_REGION_GET_CLASS().
This makes IOMMUMemoryRegion an abstract class.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20170711035620.4232-3-aik@ozlabs.ru>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>