LSI mapping in spapr currently open-codes standard PCI swizzling. It thus
duplicates the code of pci_swizzle_map_irq_fn().
Expose the swizzling formula so that it can be used with a slot number
when building the device tree. Simply drop pci_spapr_map_irq() and call
pci_swizzle_map_irq_fn() instead.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155448184841.8446.13959787238854054119.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Removing RTAS handlers will become necessary when the new pseries
machine supporting multiple interrupt mode is introduced.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190321144914.19934-9-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.
This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.
This adds additional steps to sPAPR PHB setup:
1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;
2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;
3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.
This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.
This puts new memory nodes in a separate NUMA node to as the GPU RAM
needs to be configured equally distant from any other node in the system.
Unlike the host setup which assigns numa ids from 255 downwards, this
adds new NUMA nodes after the user configures nodes or from 1 if none
were configured.
This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for vfio portions]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
besides the existing 'shared' flags, we are going to add
'is_pmem' to qemu_ram_mmap(), which indicated the memory backend
file is a persist memory.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Message-Id: <786c46862cfeb253ee0ea2f44d62ffe76edb7fa4.1549555521.git.yi.z.zhang@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
The "model[,option...]" string parsed by the function is not just
a CPU model. Rename the function and its argument to indicate it
expects the full "-cpu" option to be provided.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20190417025944.16154-2-ehabkost@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Function find_default_machine() is introduced by commit 2c8cffa599
"vl: make find_default_machine externally visible", and it was used
outside of vl.c until commit a904410af5 "pc_sysfw: remove the rom_only
property".
Commit a904410af5 "pc_sysfw: remove the rom_only property" removed the
only user of find_default_machine() outside vl.c, but neglected to make
it static. Do that now.
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Message-Id: <20190405064121.23662-2-richardw.yang@linux.intel.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJcsHOvAAoJEC7Z13T+cC21IgoP/37yKOLOuagT4an9L573qWPp
xCS48CJ4rNkpWXP3SHuTe+UHSp20sk+5b6rgM/VkLT2d301WS4gxF/VVu85ZFxGX
tkqDwnljb87MqwTbH5Yj/U4mmq8tZkNg4CqmWuvJhWv4aFe6T/YtAzUkl7y7YCcT
QfKqErl363JJKkL7cz+QWopFn5Gl/hy+mvEhbEtexWLIV1UNZ1i2hPWURfkh8FWH
C1047CAfnC/rEJy0GcwkH4iCem8n4LWkMuf3Zehq+Yx+f2e8FfxMkoOtLJVCoKWj
qoMidAplGqUxLoamZsbU1wEzwH6YH28X+uNUULsgDIIBuyW35ymbGsGTmKmNm6ED
zVM1K7badvLeO3PXBxUkviZk7UFjxjXz3xCQMheY47wPoslfX+EN0xUvQ2gW2MvO
Dhb9oPWsr/PFrMMJ35D2OOFH5kJC8Sj30YiP5lsnRoUBi4ecHCIUSlw6esKuYI+H
JfDAYzxe6QqoGg5cxSNUXP+vAgU/FQq+nGNGHzHnsIR4Udt/JsAtNwxQ9DCbYR0C
LA5qxZZTqkDtLPAHynqOzd8m7AoNaBSAgP2qp7Yp8ItXMyemlW2OIYR7yRxuL5bH
zWO/deGYHp3+j9/Z0quzSUL5G85m0o1xgRJcJe9T2fYjWgsy271WFqaZg1JEvpI6
zHcXEw71B7WQuAFaFO1+
=tJvv
-----END PGP SIGNATURE-----
Merge tag 's390-ccw-bios-2019-04-12' into s390-next-staging
Support for booting from a vfio-ccw passthrough dasd device
# gpg: Signature made Fri 12 Apr 2019 01:17:03 PM CEST
# gpg: using RSA key 2ED9D774FE702DB5
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [full]
# gpg: aka "Thomas Huth <thuth@redhat.com>" [undefined]
# gpg: aka "Thomas Huth <huth@tuxfamily.org>" [undefined]
# gpg: aka "Thomas Huth <th.huth@posteo.de>" [unknown]
* tag 's390-ccw-bios-2019-04-12':
pc-bios/s390: Update firmware images
s390-bios: Use control unit type to find bootable devices
s390-bios: Support booting from real dasd device
s390-bios: Add channel command codes/structs needed for dasd-ipl
s390-bios: Use control unit type to determine boot method
s390-bios: Refactor virtio to run channel programs via cio
s390-bios: Factor finding boot device out of virtio code path
s390-bios: Extend find_dev() for non-virtio devices
s390-bios: cio error handling
s390-bios: Support for running format-0/1 channel programs
s390-bios: ptr2u32 and u32toptr
s390-bios: Map low core memory
s390-bios: Decouple channel i/o logic from virtio
s390-bios: Clean up cio.h
s390-bios: decouple common boot logic from virtio
s390-bios: decouple cio setup from virtio
s390 vfio-ccw: Add bootindex property and IPLB data
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Rename qemu_getrampagesize() to qemu_minrampagesize(). While at it,
properly rename find_max_supported_pagesize() to
find_min_backend_pagesize().
s390x is actually interested into the maximum ram pagesize, so
introduce and use qemu_maxrampagesize().
Add a TODO, indicating that looking at any mapped memory backends is not
100% correct in some cases.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20190417113143.5551-3-david@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
In order to handle TB's that translate to too much code, we
need to place the control of the length of the translation
in the hands of the code gen master loop.
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The previous commits have eliminated fprintf_function outside
disassemblers, simplifying code and cleaning up the ugly type-punning
fprintf_function seems to attract. Move fprintf_function to
include/disas/dis-asm.h to reduce the temptation to abuse it.
I considered renaming it to fprintf_ftype (reverting that part of
commit 6e2d864edf, v0.14.0) to get us closer to binutils, but I
figure the fork is too distant to make this worthwhile.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-18-armbru@redhat.com>
Commit dc99065b5f (v0.1.0) added dis-asm.h from binutils.
Commit 43d4145a98 (v0.1.5) inlined bfd.h into dis-asm.h to remove the
dependency on binutils.
Commit 76cad71136 (v1.4.0) moved dis-asm.h to include/disas/bfd.h.
The new name is confusing when you try to match against (pre GPLv3+)
binutils. Rename it back. Keep it in the same directory, of course.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190417191805.28198-17-armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
INIT_DISASSEMBLE_INFO() takes an fprintf()-like callback and a FILE *
to pass to it. monitor_disas() passes monitor_fprintf() and the
current monitor cast to FILE *. monitor_fprintf() casts it right
back, and is otherwise identical to monitor_printf(). The
type-punning is ugly.
Pass qemu_fprintf() and NULL instead.
monitor_fprintf() is now unused; delete it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-16-armbru@redhat.com>
[Commit message typo corrected]
CPUClass method dump_statistics() takes an fprintf()-like callback and
a FILE * to pass to it. Most callers pass fprintf() and stderr.
log_cpu_state() passes fprintf() and qemu_log_file.
hmp_info_registers() passes monitor_fprintf() and the current monitor
cast to FILE *. monitor_fprintf() casts it right back, and is
otherwise identical to monitor_printf().
The callback gets passed around a lot, which is tiresome. The
type-punning around monitor_fprintf() is ugly.
Drop the callback, and call qemu_fprintf() instead. Also gets rid of
the type-punning, since qemu_fprintf() takes NULL instead of the
current monitor cast to FILE *.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-15-armbru@redhat.com>
Code that doesn't want to know about current monitor vs. stdout
vs. stderr takes an fprintf_function callback and a FILE * argument to
pass to it. Actual arguments are either fprintf() and stdout or
stderr, or monitor_fprintf() and the current monitor cast to FILE *.
monitor_fprintf() casts it right back, and is otherwise identical to
monitor_printf(). The type-punning is ugly.
New qemu_fprintf() and qemu_vprintf() address this need without type
punning: they are like fprintf() and vfprintf(), except they print to
the current monitor when passed a null FILE *. The next commits will
put them to use.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-14-armbru@redhat.com>
CPUClass method dump_statistics() takes an fprintf()-like callback and
a FILE * to pass to it.
Its only caller hmp_info_cpustats() (via cpu_dump_statistics()) passes
monitor_fprintf() and the current monitor cast to FILE *.
monitor_fprintf() casts it right back, and is otherwise identical to
monitor_printf(). The type-punning is ugly.
Drop the callback, and call qemu_printf() instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-13-armbru@redhat.com>
The various TARGET_cpu_list() take an fprintf()-like callback and a
FILE * to pass to it. Their callers (vl.c's main() via list_cpus(),
bsd-user/main.c's main(), linux-user/main.c's main()) all pass
fprintf() and stdout. Thus, the flexibility provided by the (rather
tiresome) indirection isn't actually used.
Drop the callback, and call qemu_printf() instead.
Calling printf() would also work, but would make the code unsuitable
for monitor context without making it simpler.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190417191805.28198-10-armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
mtree_info() takes an fprintf()-like callback and a FILE * to pass to
it, and so do its helper functions. Passing around callback and
argument is rather tiresome.
Its only caller hmp_info_mtree() passes monitor_printf() cast to
fprintf_function and the current monitor cast to FILE *.
The type-punning is technically undefined behaviour, but works in
practice. Clean up: drop the callback, and call qemu_printf()
instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-9-armbru@redhat.com>
bdrv_snapshot_dump(), bdrv_image_info_specific_dump(),
bdrv_image_info_dump() and their helpers take an fprintf()-like
callback and a FILE * to pass to it.
hmp.c passes monitor_printf() cast to fprintf_function and the current
monitor cast to FILE *.
qemu-img.c and qemu-io-cmds.c pass fprintf and stdout.
The type-punning is technically undefined behaviour, but works in
practice. Clean up: drop the callback, and call qemu_printf()
instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-8-armbru@redhat.com>
qsp_report() takes an fprintf()-like callback and a FILE * to pass to
it.
Its only caller hmp_sync_profile() passes monitor_fprintf() and the
current monitor cast to FILE *. monitor_fprintf() casts it right
back, and is otherwise identical to monitor_printf(). The
type-punning is ugly.
Drop the callback, and call qemu_printf() instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-7-armbru@redhat.com>
dump_drift_info() takes an fprintf()-like callback and a FILE * to pass
to it.
Its only caller hmp_info_jit() passes monitor_fprintf() and a Monitor
* cast to FILE *. monitor_fprintf() casts it right back, and is
otherwise identical to monitor_printf(). The type-punning is ugly.
Drop the callback, and call qemu_printf() instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-6-armbru@redhat.com>
dump_exec_info() takes an fprintf()-like callback and a FILE * to pass
to it.
Its only caller hmp_info_jit() passes monitor_fprintf() and the
current monitor cast to FILE *. monitor_fprintf() casts it right
back, and is otherwise identical to monitor_printf(). The
type-punning is ugly.
Drop the callback, and call qemu_printf() instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-5-armbru@redhat.com>
dump_opcount_info() takes an fprintf()-like callback and a FILE * to
pass to it.
Its only caller hmp_info_opcount() passes monitor_fprintf() and the
current monitor cast to FILE *. monitor_fprintf() casts it right
back, and is otherwise identical to monitor_printf(). The
type-punning is ugly.
Drop the callback, and call qemu_printf() instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-4-armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-2-armbru@redhat.com>
Commit a95db58f21 added monitor_vfprintf() as an error_printf()
generalized from stderr to arbitrary streams, then used it wrapped in
helper out_printf() to print -device/device_add help to stdout. Use
qemu_printf() instead, and delete monitor_vfprintf() and out_printf().
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417190641.26814-16-armbru@redhat.com>
We commonly want to print to the current monitor if we have one, else
to stdout/stderr. For stderr, have error_printf(). For stdout, all
we have is monitor_vfprintf(), which is rather unwieldy. We often
print to stderr just because error_printf() is easier.
New qemu_printf() and qemu_vprintf() do exactly what's needed. The
next commits will put them to use.
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417190641.26814-12-armbru@redhat.com>
printf() & friends return the number of characters written on success,
negative value on error.
monitor_printf(), monitor_vfprintf(), monitor_vprintf(),
error_printf(), error_printf_unless_qmp(), error_vprintf(), and
error_vprintf_unless_qmp() return void. Some of them carry a TODO
comment asking for int instead.
Improve them to return int like printf() does.
This makes our use of monitor_printf() as fprintf_function slightly
less dirty: the function cast no longer adds a return value that isn't
there. It still changes a parameter's pointer type. That will be
addressed in a future commit.
monitor_vfprintf() always returns zero. Improve it to return the
proper value.
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417190641.26814-11-armbru@redhat.com>
This commit adds a error_init() helper which calls
g_log_set_default_handler() so that glib logs (g_log, g_warning, ...)
are handled similarly to other QEMU logs. This means they will get a
timestamp if timestamps are enabled, and they will go through the
HMP monitor if one is configured.
This commit also adds a call to error_init() to the binaries
installed by QEMU. Since error_init() also calls error_set_progname(),
this means that *-linux-user, *-bsd-user and qemu-pr-helper messages
output with error_report, info_report, ... will slightly change: they
will be prefixed by the binary name.
glib debug messages are enabled through G_MESSAGES_DEBUG similarly to
the glib default log handler.
At the moment, this change will mostly impact SPICE logging if your
spice version is >= 0.14.1. With older spice versions, this is not going
to work as expected, but will not have any ill effect, so this call is
not conditional on the SPICE version.
Signed-off-by: Christophe Fergeau <cfergeau@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20190131164614.19209-3-cfergeau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Add bootindex property and iplb data for vfio-ccw devices. This allows us to
forward boot information into the bios for vfio-ccw devices.
Refactor s390_get_ccw_device() to return device type. This prevents us from
having to use messy casting logic in several places.
Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <1554388475-18329-2-git-send-email-jjherne@linux.ibm.com>
[thuth: fixed "typedef struct VFIOCCWDevice" build failure with clang]
Signed-off-by: Thomas Huth <thuth@redhat.com>
In the accessor functions ld*_he_p() and st*_he_p() we use memcpy()
to perform a load or store to a pointer which might not be aligned
for the size of the type. We rely on the compiler to optimize this
memcpy() into an efficient load or store instruction where possible.
This is required for good performance, but at the moment it is also
required for correct operation, because some users of these functions
require that the access is atomic if the pointer is aligned, which
will only be the case if the compiler has optimized out the memcpy().
(The particular example where we discovered this is the virtio
vring_avail_idx() which calls virtio_lduw_phys_cached() which
eventually ends up calling lduw_he_p().)
Unfortunately some compile environments, such as the fortify-source
setup used in Alpine Linux, define memcpy() to a wrapper function
in a way that inhibits this compiler optimization.
The correct long-term fix here is to add a set of functions for
doing atomic accesses into AddressSpaces (and to other relevant
families of accessor functions like the virtio_*_phys_cached()
ones), and make sure that callsites which want atomic behaviour
use the correct functions.
In the meantime, switch to using __builtin_memcpy() in the
bswap.h accessor functions. This will make us robust against things
like this fortify library in the short term. In the longer term
it will mean that we don't end up with these functions being really
badly-performing even if the semantics of the out-of-line memcpy()
are correct.
Reported-by: Fernando Casas Schössow <casasfernando@outlook.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20190318112938.8298-1-peter.maydell@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some PHB implementations, eg. PAPR used on pseries machine, act like
a regular PCI bus rather than a PCIe bus, but allow access to the
PCIe extended config space anyway.
Introduce a new PCI bus class method to modelize this behaviour and
use it when adjusting the config space size limit during accesses.
No behaviour change for existing PCI bus types.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155414130271.574858.4253514266378127489.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This patch fixes four different things, to maintain bisectability they
have been merged into a single patch. The following fixes are below:
sifive_plic: Fix incorrect irq calculation
The irq is incorrectly calculated to be off by one. It has worked in the
past as the priority_base offset has also been set incorrectly. We are
about to fix the priority_base offset so first first the irq
calculation.
sifive_u: Fix PLIC priority base offset and numbering
According to the FU540 manual the PLIC source priority address starts at
an offset of 0x04 and not 0x00. The same manual also specifies that the
PLIC only has 53 source priorities. Fix these two incorrect header
files.
We also need to over extend the plic_gpios[] array as the PLIC sources
count from 1 and not 0.
riscv: sifive_e: Fix PLIC priority base offset
According to the FE31 manual the PLIC source priority address starts at
an offset of 0x04 and not 0x00.
riscv: virt: Fix PLIC priority base offset
Update the virt offsets based on the newly updated SiFive U and SiFive E
offsets.
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
VTD_RTADDR_RTT is dropped even by the VT-d spec, so QEMU should
probably do the same thing (after all we never really implemented it).
Since we've had a field for that in the migration stream, to keep
compatibility we need to fill the hole up.
Please refer to VT-d spec 10.4.6.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190329061422.7926-3-peterx@redhat.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Watch IDs are allocated from incrementing a int counter against
the QFileMonitor object. In very long life QEMU processes with
a huge amount of USB MTP activity creating & deleting directories
it is just about conceivable that the int counter can wrap
around. This would result in incorrect behaviour of the file
monitor watch APIs due to clashing watch IDs.
Instead of trying to detect this situation, this patch changes
the way watch IDs are allocated. It is turned into an int64_t
variable where the high 32 bits are set from the underlying
inotify "int" ID. This gives an ID that is guaranteed unique
for the directory as a whole, and we can rely on the kernel
to enforce this. QFileMonitor then sets the low 32 bits from
a per-directory counter.
The USB MTP device only sets watches on the directory as a
whole, not files within, so there is no risk of guest
triggered wrap around on the low 32 bits.
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This reverts commit 3df663e575.
This reverts commit b605c47b57.
Command line option --only-migratable is for disallowing any
configuration that can block migration.
Initially, --only-migratable set global variable @only_migratable.
Commit 3df663e575 "migration: move only_migratable to MigrationState"
replaced it by MigrationState member @only_migratable. That was a
mistake.
First, it doesn't make sense on the design level. MigrationState
captures the state of an individual migration, but --only-migratable
isn't a property of an individual migration, it's a restriction on
QEMU configuration. With fault tolerance, we could have several
migrations at once. --only-migratable would certainly protect all of
them. Storing it in MigrationState feels inappropriate.
Second, it contributes to a dependency cycle that manifests itself as
a bug now.
Putting @only_migratable into MigrationState means its available only
after migration_object_init().
We can't set it before migration_object_init(), so we delay setting it
with a global property (this is fixup commit b605c47b57 "migration:
fix handling for --only-migratable").
We can't get it before migration_object_init(), so anything that uses
it can only run afterwards.
Since migrate_add_blocker() needs to obey --only-migratable, any code
adding migration blockers can run only afterwards. This contributes
to the following dependency cycle:
* configure_blockdev() must run before machine_set_property()
so machine properties can refer to block backends
* machine_set_property() before configure_accelerator()
so machine properties like kvm-irqchip get applied
* configure_accelerator() before migration_object_init()
so that Xen's accelerator compat properties get applied.
* migration_object_init() before configure_blockdev()
so configure_blockdev() can add migration blockers
The cycle was closed when recent commit cda4aa9a5a "Create block
backends before setting machine properties" added the first
dependency, and satisfied it by violating the last one. Broke block
backends that add migration blockers.
Moving @only_migratable into MigrationState was a mistake. Revert it.
This doesn't quite break the "migration_object_init() before
configure_blockdev() dependency, since migrate_add_blocker() still has
another dependency on migration_object_init(). To be addressed the
next commit.
Note that the reverted commit made -only-migratable sugar for -global
migration.only-migratable=on below the hood. Documentation has only
ever mentioned -only-migratable. This commit removes the arcane &
undocumented alternative to -only-migratable again. Nobody should be
using it.
Conflicts:
include/migration/misc.h
migration/migration.c
migration/migration.h
vl.c
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190401090827.20793-3-armbru@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
The next patch needs access to a device's minimum permitted
alignment, since NBD wants to advertise this to clients. Add
an accessor function, borrowing from blk_get_max_transfer()
for accessing a backend's block limits.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190329042750.14704-6-eblake@redhat.com>
27461d69a0 "ppc: add host-serial and host-model machine attributes
(CVE-2019-8934)" introduced 'host-serial' and 'host-model' machine
properties for spapr to explicitly control the values advertised to the
guest in device tree properties with the same names.
The previous behaviour on KVM was to unconditionally populate the device
tree with the real host serial number and model, which leaks possibly
sensitive information about the host to the guest.
To maintain compatibility for old machine types, we allowed those props
to be set to "passthrough" to take the value from the host as before. Or
they could be set to "none" to explicitly omit the device tree items.
Special casing specific values on what's otherwise a user supplied string
is very ugly. So, this patch simplifies things by implementing the
backwards compatibility in a different way: we have a machine class flag
set for the older machines, and we only load the host values into the
device tree if A) they're not set by the user and B) we have that flag set.
This does mean that the "passthrough" functionality is no longer available
with the current machine type. That's ok though: if a user or management
layer really wants the information passed through they can read it
themselves (OpenStack Nova already does something similar for x86).
It also means the user can't explicitly ask for the values to be omitted
on the old machine types. I think that's an acceptable trade-off: if you
care enough about not leaking the host information you can either move to
the new machine type, or use a dummy value for the properties.
For the new machine type, this also removes an odd inconsistency
between running on a POWER and non-POWER (or non-Linux) hosts: if the
host information couldn't be read from where we expect (in the host's
device tree as exposed by Linux), we'd fallback to omitting the guest
device tree items.
While we're there, improve some poorly worded comments, and the help text
for the properties.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
We know that the kernel implements a slow fallback code path for
BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
The other operations we call in the context of .bdrv_co_pwrite_zeroes
should usually be quick, so no modification should be needed for them.
If we ever notice that there are additional problematic cases, we can
still make these conditional as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
For qemu-img convert, we want an operation that zeroes out the whole
image if this can be done efficiently, but that returns an error
otherwise so we don't write explicit zeroes and immediately overwrite
them with the real data, potentially doubling the amount of data to be
written.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
We reject undersized backends with a rather enigmatic "failed to read
the initial flash content" error. For instance:
$ qemu-system-ppc64 -S -display none -M sam460ex -drive if=pflash,format=raw,file=eins.img
qemu-system-ppc64: Initialization of device cfi.pflash02 failed: failed to read the initial flash content
We happily accept oversized images, ignoring their tail. Throwing
away parts of firmware that way is pretty much certain to end in an
even more enigmatic failure to boot.
Require the backend's size to match the device's size exactly. Report
mismatch like this:
qemu-system-ppc64: Initialization of device cfi.pflash01 failed: device requires 1048576 bytes, block backend provides 512 bytes
Improve the error for actual read failures to "can't read block
backend".
To avoid duplicating even more code between the two pflash device
models, do all that in new helper blk_check_size_and_read_all().
The error reporting can still be confusing. For instance:
qemu-system-ppc64 -S -display none -M taihu -drive if=pflash,format=raw,file=eins.img -drive if=pflash,unit=1,format=raw,file=zwei.img
qemu-system-ppc64: Initialization of device cfi.pflash02 failed: device requires 2097152 bytes, block backend provides 512 bytes
Leaves the user guessing which of the two -drive is wrong. Mention
the issue in a TODO comment.
Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190319163551.32499-2-armbru@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
TYPE_QAUTHZ is an abstract object of type TYPE_OBJECT. All other
are children of TYPE_QAUTHZ, thus also objects.
Keep INTERFACE_CHECK() for interfaces, and use OBJECT_CHECK() on
objects.
Reported-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Previously we have per-device system memory aliases when DMAR is
disabled by the system. It will slow the system down if there are
lots of devices especially when DMAR is disabled, because each of the
aliased system address space will contain O(N) slots, and rendering
such N address spaces will be O(N^2) complexity.
This patch introduces a shared nodmar memory region and for each
device we only create an alias to the shared memory region. With the
aliasing, QEMU memory core API will be able to detect when devices are
sharing the same address space (which is the nodmar address space)
when rendering the FlatViews and the total number of FlatViews can be
dramatically reduced when there are a lot of devices.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20190313094323.18263-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refer to the RISC-V PSABI specification for details:
- https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md
Cc: Michael Tokarev <mjt@tls.msk.ru>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Alistair Francis <Alistair.Francis@wdc.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Michael Clark <mjc@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
If renderer_blocked is set do not call virtio_gpu_virgl_reset().
Instead set a flag indicating that virglrenderer needs a reset.
When renderer_blocked gets cleared do the actual reset call.
Without this we can trigger an assert in spice due to calling
spice_qxl_gl_scanout() while another operation is still running:
spice_qxl_gl_scanout: condition `qxl_state->gl_draw_cookie == GL_DRAW_COOKIE_INVALID' failed
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Message-id: 20190314115358.26678-2-kraxel@redhat.com
Allow interrogating device internals through HMP interface.
The exposed indicators can be used for troubleshooting by developers or
sysadmin.
There is no need to expose these attributes to a management system (e.x.
libvirt) because (1) most of them are not "device-management' related
info and (2) there is no guarantee the interface is stable.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1552300155-25216-6-git-send-email-yuval.shaia@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
The BCM2836 control logic module includes a simple
"local timer" which is a programmable down-counter that
can generates an interrupt. Implement this functionality.
Signed-off-by: Zoltán Baldaszti <bztemail@gmail.com>
[PMM: wrote commit message; wrapped long line; tweaked
some comments to match the final version of the code]
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
- file-posix: Make auto-read-only dynamic
- Add x-blockdev-reopen QMP command
- Finalize block-latency-histogram QMP command
- gluster: Build fixes for newer lib version
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJciAjXAAoJEH8JsnLIjy/WwMcP/3VawpEBO4I94gp7KNUO1yZa
rW7Rk0VO0gcvjk4fyTpQ1I3U2dfX6NwZLFMrk4oUS382QcKL9ky/TXVeeCaWxYxi
51OH6+wHQKu1MuAjM9acXRD59pfOwmI6wbKAgrungeFzHF3TvCYcLD0rY9Mhz1wp
Q7Oqkk2au6cFrmqZChCF2S5guZc0JOuwzd+LdDshRNDek2Px8a3etVq37VBUuxzK
WDvIws1IZkFI5y2WE3T8kn7YJ8NgMZ1p47tgkymDX7fkn3V766tec8ZYBy1Qz9ab
+I3UlzijuXB8vq+egEtzQfJvvTyoPrb65VFjW94ITu9onuclYo1oV5XVgx2c/NiR
WnUagbu9nft1E4+zmSrVB3Y4I7Pbwi+At/2L2dMQXIrrebK50Cqg8GW2fthhq/KM
5NavsqgdH14gOGS1yUGu06J0HO87XiJUKta4Th9M6iKvcGJqZ+F1WZGSiVEhhk1W
w0FSmWdB/XdUwWoWdQnfx8d43OK34Q3spqsAe59DvKKyORw8Uug1i43yWvSepwSf
SILmRYLOpMIrKUihh9NPIxB6QCzv/Mt7WjSEtaif+EXgQIGZhoQBjWpde0h5Dwh5
DrVz6NqDNnz0VDARPq2hgWOs4RlNpMvx5eZFJcK66RsOfLom3CL7mpnVaa+jd6e6
Nf0HPh/ubFB0IeEYbV0n
=dQt+
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- file-posix: Make auto-read-only dynamic
- Add x-blockdev-reopen QMP command
- Finalize block-latency-histogram QMP command
- gluster: Build fixes for newer lib version
# gpg: Signature made Tue 12 Mar 2019 19:30:31 GMT
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream: (28 commits)
qemu-iotests: Test the x-blockdev-reopen QMP command
block: Add an 'x-blockdev-reopen' QMP command
block: Remove the AioContext parameter from bdrv_reopen_multiple()
block: Add bdrv_reset_options_allowed()
block: Add a 'mutable_opts' field to BlockDriver
block: Allow changing the backing file on reopen
block: Allow omitting the 'backing' option in certain cases
block: Handle child references in bdrv_reopen_queue()
block: Add 'keep_old_opts' parameter to bdrv_reopen_queue()
block: Freeze the backing chain for the duration of the stream job
block: Freeze the backing chain for the duration of the mirror job
block: Freeze the backing chain for the duration of the commit job
block: Allow freezing BdrvChild links
nvme: fix write zeroes offset and count
file-posix: Make auto-read-only dynamic
file-posix: Prepare permission code for fd switching
file-posix: Lock new fd in raw_reopen_prepare()
file-posix: Store BDRVRawState.reopen_state during reopen
file-posix: Factor out raw_reconfigure_getfd()
file-posix: Fix bdrv_open_flags() for snapshot=on
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Currently we do device realization like below:
hotplug_handler_pre_plug()
dc->realize()
hotplug_handler_plug()
Before we do device realization and plug, we should allocate necessary
resources and check if memory-hotplug-support property is enabled.
At the piix4 and ich9, the memory-hotplug-support property is checked at
plug stage. This means that device has been realized and mapped into guest
address space 'pc_dimm_plug()' by the time acpi plug handler is called,
where it might fail and crash QEMU due to reaching g_assert_not_reached()
(piix4) or error_abort (ich9).
Fix it by checking if memory hotplug is enabled at pre_plug stage
where we can gracefully abort hotplug request.
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
CC: Igor Mammedov <imammedo@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Message-Id: <20190301033548.6691-1-richardw.yang@linux.intel.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Claim ACS support in the generic PCIe root port to allow
passthrough of individual functions of a device to different
guests (in a nested virt.setting) with VFIO.
Without this patch, all functions of a device, such as all VFs of
an SR/IOV device, will end up in the same IOMMU group.
A similar situation occurs on Windows with Hyper-V.
In the single function device case, it also has a small cosmetic
benefit in that the root port itself is not grouped with
the device. VFIO handles that situation in that binding rules
only apply to endpoints, so it does not limit passthrough in
those cases.
Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Message-Id: <319460b483f566dd57487eb3dd340ed4c10aa53c.1550768238.git-series.knut.omang@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Implementing an ACS capability on downstream ports and multifunction
endpoints indicates isolation and IOMMU visibility to a finer
granularity. This creates smaller IOMMU groups in the guest and thus
more flexibility in assigning endpoints to guest userspace or an L2
guest.
Signed-off-by: Knut Omang <knut.omang@oracle.com>
Message-Id: <07489975121696f5573b0a92baaf3486ef51e35d.1550768238.git-series.knut.omang@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds support for vhost-user-blk device to get/set
inflight buffer from/to backend.
Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Message-Id: <20190228085355.9614-6-xieyongji@baidu.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
buffer between qemu and backend.
Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
shared buffer from backend. Then qemu should send it back
through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.
This shared buffer is used to track inflight I/O by backend.
Qemu should retrieve a new one when vm reset.
Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Chai Wen <chaiwen@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Message-Id: <20190228085355.9614-2-xieyongji@baidu.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch adds an option to provide flexibility for user to expose
Scalable Mode to guest. User could expose Scalable Mode to guest by
the config as below:
"-device intel-iommu,caching-mode=on,scalable-mode=on"
The Linux iommu driver has supported scalable mode. Please refer below
patch set:
https://www.spinics.net/lists/kernel/msg2985279.html
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Message-Id: <1551753295-30167-4-git-send-email-yi.y.sun@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
Mode. This patch adds emulation of 256bits qi_desc.
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
[Yi Sun is co-developer to rebase and refine the patch.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <1551753295-30167-3-git-send-email-yi.y.sun@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
replace extended context mode. This patch extends current emulator to
support Scalable Mode which includes root table, context table and new
pasid table format change. Now intel_iommu emulates both legacy mode
and scalable mode (with legacy-equivalent capability set).
The key points are below:
1. Extend root table operations to support both legacy mode and scalable
mode.
2. Extend context table operations to support both legacy mode and
scalable mode.
3. Add pasid tabled operations to support scalable mode.
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
[Yi Sun is co-developer to contribute much to refine the whole commit.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Message-Id: <1551753295-30167-2-git-send-email-yi.y.sun@linux.intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Take a VhostUserState* that can be pre-allocated, and initialize it
with the associated chardev.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>
Message-Id: <20190308140454.32437-4-marcandre.lureau@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If we reopen a BlockDriverState and there is an option that is present
in bs->options but missing from the new set of options then we have to
return an error unless the driver is able to reset it to its default
value.
This patch adds a new 'mutable_opts' field to BlockDriver. This is
a list of runtime options that can be modified during reopen. If an
option in this list is unspecified on reopen then it must be reset (or
return an error).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows the user to change the backing file of an image that
is being reopened. Here's what it does:
- In bdrv_reopen_prepare(): check that the value of 'backing' points
to an existing node or is null. If it points to an existing node it
also needs to make sure that replacing the backing file will not
create a cycle in the node graph (i.e. you cannot reach the parent
from the new backing file).
- In bdrv_reopen_commit(): perform the actual node replacement by
calling bdrv_set_backing_hd().
There may be temporary implicit nodes between a BDS and its backing
file (e.g. a commit filter node). In these cases bdrv_reopen_prepare()
looks for the real (non-implicit) backing file and requires that the
'backing' option points to it. Replacing or detaching a backing file
is forbidden if there are implicit nodes in the middle.
Although x-blockdev-reopen is meant to be used like blockdev-add,
there's an important thing that must be taken into account: the only
way to set a new backing file is by using a reference to an existing
node (previously added with e.g. blockdev-add). If 'backing' contains
a dictionary with a new set of options ({"driver": "qcow2", "file": {
... }}) then it is interpreted that the _existing_ backing file must
be reopened with those options.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Children in QMP are specified with BlockdevRef / BlockdevRefOrNull,
which can contain a set of child options, a child reference, or
NULL. In optional attributes like "backing" it can also be missing.
Only the first case (set of child options) is being handled properly
by bdrv_reopen_queue(). This patch deals with all the others.
Here's how these cases should be handled when bdrv_reopen_queue() is
deciding what to do with each child of a BlockDriverState:
1) Set of child options: if the child was implicitly created (i.e
inherits_from points to the parent) then the options are removed
from the parent's options QDict and are passed to the child with
a recursive bdrv_reopen_queue() call. This case was already
working fine.
2) Child reference: there's two possibilites here.
2a) Reference to the current child: if the child was implicitly
created then it is put in the reopen queue, keeping its
current set of options (since this was a child reference
there was no way to specify a different set of options).
If the child is not implicit then it keeps its current set
of options but it is not reopened (and therefore does not
inherit any new option from the parent).
2b) Reference to a different BDS: the current child is not put
in the reopen queue at all. Passing a reference to a
different BDS can be used to replace a child, although at
the moment no driver implements this, so it results in an
error. In any case, the current child is not going to be
reopened (and might in fact disappear if it's replaced)
3) NULL: This is similar to (2b). Although no driver allows this
yet it can be used to detach the current child so it should not
be put in the reopen queue.
4) Missing option: at the moment "backing" is the only case where
this can happen. With "blockdev-add", leaving "backing" out
means that the default backing file is opened. We don't want to
open a new image during reopen, so we require that "backing" is
always present. We'll relax this requirement a bit in the next
patch. If keep_old_opts is true and "backing" is missing then
this behaves like 2a (the current child is reopened).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_reopen_queue() function is used to create a queue with
the BDSs that are going to be reopened and their new options. Once
the queue is ready bdrv_reopen_multiple() is called to perform the
operation.
The original options from each one of the BDSs are kept, with the new
options passed to bdrv_reopen_queue() applied on top of them.
For "x-blockdev-reopen" we want a function that behaves much like
"blockdev-add". We want to ignore the previous set of options so that
only the ones actually specified by the user are applied, with the
rest having their default values.
One of the things that we need is a way to tell bdrv_reopen_queue()
whether we want to keep the old set of options or not, and that's what
this patch does. All current callers are setting this new parameter to
true and x-blockdev-reopen will set it to false.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our permission system is useful to define what operations are allowed
on a certain block node and includes things like BLK_PERM_WRITE or
BLK_PERM_RESIZE among others.
One of the permissions is BLK_PERM_GRAPH_MOD which allows "changing
the node that this BdrvChild points to". The exact meaning of this has
never been very clear, but it can be understood as "change any of the
links connected to the node". This can be used to prevent changing a
backing link, but it's too coarse.
This patch adds a new 'frozen' attribute to BdrvChild, which forbids
detaching the link from the node it points to, and new API to freeze
and unfreeze a backing chain.
After this change a few functions can fail, so they need additional
checks.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit a88b179f introduced the ability to set and query bitmap
persistence, but with an atypical spelling.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 20190308205845.25734-1-eblake@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Instead of checking against busy, inconsistent, or read only directly,
use a check function with permissions bits that let us streamline the
checks without reproducing them in many places.
Included in this patch are permissions changes that simply add the
inconsistent check to existing permissions call spots, without
addressing existing bugs.
In general, this means that busy+readonly checks become BDRV_BITMAP_DEFAULT,
which checks against all three conditions. busy-only checks become
BDRV_BITMAP_ALLOW_RO.
Notably, remove allows inconsistent bitmaps, so it doesn't follow the pattern.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-4-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Add an inconsistent bit to dirty-bitmaps that allows us to report a bitmap as
persistent but potentially inconsistent, i.e. if we find bitmaps on a qcow2
that have been marked as "in use".
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-2-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
These mean the same thing now. Unify them and rename the merged call
bdrv_dirty_bitmap_busy to indicate semantically what we are describing,
as well as help disambiguate from the various _locked and _unlocked
versions of bitmap helpers that refer to mutex locks.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-8-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
"Frozen" was a good description a long time ago, but it isn't adequate now.
Rename the frozen predicate to has_successor to make the semantics of the
predicate more clear to outside callers.
In the process, remove some calls to frozen() that no longer semantically
make sense. For bdrv_enable_dirty_bitmap_locked and
bdrv_disable_dirty_bitmap_locked, it doesn't make sense to prohibit QEMU
internals from performing this action when we only wished to prohibit QMP
users from issuing these commands. All of the QMP API commands for bitmap
manipulation already check against user_locked() to prohibit these actions.
Several other assertions really want to check that the bitmap isn't in-use
by another operation -- use the bitmap_user_locked function for this instead,
which presently also checks for has_successor. This leaves some redundant
checks of has_successor through different helpers that are addressed in
forthcoming patches.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-3-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
This makes vfio_get_region_info_cap() to be used in quirks.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190307050518.64968-3-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The qemu coding standard is to use CamelCase for type and structure names,
and the pseries code follows that... sort of. There are quite a lot of
places where we bend the rules in order to preserve the capitalization of
internal acronyms like "PHB", "TCE", "DIMM" and most commonly "sPAPR".
That was a bad idea - it frequently leads to names ending up with hard to
read clusters of capital letters, and means they don't catch the eye as
type identifiers, which is kind of the point of the CamelCase convention in
the first place.
In short, keeping type identifiers look like CamelCase is more important
than preserving standard capitalization of internal "words". So, this
patch renames a heap of spapr internal type names to a more standard
CamelCase.
In addition to case changes, we also make some other identifier renames:
VIOsPAPR* -> SpaprVio*
The reverse word ordering was only ever used to mitigate the capital
cluster, so revert to the natural ordering.
VIOsPAPRVTYDevice -> SpaprVioVty
VIOsPAPRVLANDevice -> SpaprVioVlan
Brevity, since the "Device" didn't add useful information
sPAPRDRConnector -> SpaprDrc
sPAPRDRConnectorClass -> SpaprDrcClass
Brevity, and makes it clearer this is the same thing as a "DRC"
mentioned in many other places in the code
This is 100% a mechanical search-and-replace patch. It will, however,
conflict with essentially any and all outstanding patches touching the
spapr code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The POWER9 processor does not support per-core frequency control. The
cores are arranged in groups of four, along with their respective L2
and L3 caches, into a structure known as a Quad. The frequency must be
managed at the Quad level.
Provide a basic Quad model to fake the settings done by the firmware
on the Non-Cacheable Unit (NCU). Each core pair (EX) needs a special
BAR setting for the TIMA area of XIVE because it resides on the same
address on all chips.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-12-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Provide a new class attribute to define XSCOM operations per CPU
family and add a couple of XSCOM addresses controlling the power
management states of the core on POWER9.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-11-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The OCC on POWER9 is very similar to the one found on POWER8. Provide
the same routines with P9 values for the registers and IRQ number.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-10-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
To ease the introduction of the OCC model for POWER9, provide a new
class attributes to define XSCOM operations per CPU family and a PSI
IRQ number.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20190307223548.20516-9-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This is just a simple reminder that SerIRQ routing should be
addressed.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-8-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The LPC Controller on POWER9 is very similar to the one found on
POWER8 but accesses are now done via on MMIOs, without the XSCOM and
ECCB logic. The device tree is populated differently so we add a
specific POWER9 routine for the purpose.
SerIRQ routing is yet to be done.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-7-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The ISA bus has a different DT nodename on POWER9. Compute the name
when the PnvChip is realized, that is before it is used by the machine
to populate the device tree with the ISA devices.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-6-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
It will ease the introduction of the LPC Controller model for POWER9.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20190307223548.20516-5-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The PSI bridge on POWER9 is very similar to POWER8. The BAR is still
set through XSCOM but the controls are now entirely done with MMIOs.
More interrupts are defined and the interrupt controller interface has
changed to XIVE. The POWER9 model is a first example of the usage of
the notify() handler of the XiveNotifier interface, linking the PSI
XiveSource to its owning device model.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-3-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
To ease the introduction of the PSI bridge model for POWER9, abstract
the POWER chip differences in a PnvPsi class model and introduce a
specific Pnv8Psi type for POWER8. POWER8 interface to the interrupt
controller is still XICS whereas POWER9 uses the new XIVE model.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190307223548.20516-2-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
On sPAPR vfio_listener_region_add() is called in 2 situations:
1. a new listener is registered from vfio_connect_container();
2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
In both cases vfio_listener_region_add() calls
memory_region_iommu_replay() to notify newly registered IOMMU notifiers
about existing mappings which is totally desirable for case 1.
However for case 2 it is nothing but noop as the window has just been
created and has no valid mappings so replaying those does not do anything.
It is barely noticeable with usual guests but if the window happens to be
really big, such no-op replay might take minutes and trigger RCU stall
warnings in the guest.
For example, a upcoming GPU RAM memory region mapped at 64TiB (right
after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
This mitigates the problem by adding an "skipping_replay" flag to
sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
exactly the same thing as the generic one except it returns early if
@skipping_replay==true.
Another way of fixing this would be delaying replay till the very first
H_PUT_TCE but this does not work if in-kernel H_PUT_TCE handler is
enabled (a likely case).
When "ibm,create-pe-dma-window" is complete, the guest will map only
required regions of the huge DMA window.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20190307050518.64968-2-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The POWER9 and POWER8 processors have different interrupt controllers,
and reporting their state requires calling different helper routines.
However, the interrupt presenters are still handled in the higher
level pic_print_info() routine because they are not related to the
chip.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-9-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The POWER9 and POWER8 processors have a different set of devices and a
different device tree layout.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-8-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This is a simple model of the POWER9 XIVE interrupt controller for the
PowerNV machine which only addresses the needs of the skiboot
firmware. The PowerNV model reuses the common XIVE framework developed
for sPAPR as the fundamentals aspects are quite the same. The
difference are outlined below.
The controller initial BAR configuration is performed using the XSCOM
bus from there, MMIO are used for further configuration.
The MMIO regions exposed are :
- Interrupt controller registers
- ESB pages for IPIs and ENDs
- Presenter MMIO (Not used)
- Thread Interrupt Management Area MMIO, direct and indirect
The virtualization controller MMIO region containing the IPI ESB pages
and END ESB pages is sub-divided into "sets" which map portions of the
VC region to the different ESB pages. These are modeled with custom
address spaces and the XiveSource and XiveENDSource objects are sized
to the maximum allowed by HW. The memory regions are resized at
run-time using the configuration of EDT set translation table provided
by the firmware.
The XIVE virtualization structure tables (EAT, ENDT, NVTT) are now in
the machine RAM and not in the hypervisor anymore. The firmware
(skiboot) configures these tables using Virtual Structure Descriptor
defining the characteristics of each table : SBE, EAS, END and
NVT. These are later used to access the virtual interrupt entries. The
internal cache of these tables in the interrupt controller is updated
and invalidated using a set of registers.
Still to address to complete the model but not fully required is the
support for block grouping. Escalation support will be necessary for
KVM guests.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-7-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The POWER9 PowerNV machine will use a XIVE interrupt presenter type.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-6-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The PowerNV machine with need to encode the block id in the source
interrupt number before forwarding the source event notification to
the Router.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-5-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The PowerNV machine can perform indirect loads and stores on the TIMA
on behalf of another CPU. Give the controller the possibility to call
the TIMA memory accessors with a XiveTCTX of its choice.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-4-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We will use it to get the CPU interrupt presenter in XIVE when the
TIMA is accessed from the indirect page.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20190306085032.15744-3-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
SPAPR_MEMORY_BLOCK_SIZE is logically a difference in memory addresses, and
hence of type hwaddr which is 64-bit. Previously it wasn't marked as such
which means that it could be treated as 32-bit. That will work in some
circumstances but if multiplied by another 32-bit value it could lead to
a 32-bit overflow and an incorrect result.
One specific instance of this in spapr_lmb_dt_populate() was spotted by
Coverity (CID 1399145).
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Introduce a new spapr_cap SPAPR_CAP_CCF_ASSIST to be used to indicate
the requirement for a hw-assisted version of the count cache flush
workaround.
The count cache flush workaround is a software workaround which can be
used to flush the count cache on context switch. Some revisions of
hardware may have a hardware accelerated flush, in which case the
software flush can be shortened. This cap is used to set the
availability of such hardware acceleration for the count cache flush
routine.
The availability of such hardware acceleration is indicated by the
H_CPU_CHAR_BCCTR_FLUSH_ASSIST flag being set in the characteristics
returned from the KVM_PPC_GET_CPU_CHAR ioctl.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Message-Id: <20190301031912.28809-2-sjitindarsingh@gmail.com>
[dwg: Small style fixes]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The spapr_cap SPAPR_CAP_IBS is used to indicate the level of capability
for mitigations for indirect branch speculation. Currently the available
values are broken (default), fixed-ibs (fixed by serialising indirect
branches) and fixed-ccd (fixed by diabling the count cache).
Introduce a new value for this capability denoted workaround, meaning that
software can work around the issue by flushing the count cache on
context switch. This option is available if the hypervisor sets the
H_CPU_BEHAV_FLUSH_COUNT_CACHE flag in the cpu behaviours returned from
the KVM_PPC_GET_CPU_CHAR ioctl.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Message-Id: <20190301031912.28809-1-sjitindarsingh@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>