Commit e2392d4395 ("ppc/pnv: Create BMC devices at machine init")
introduced default BMC devices which can be a problem when the same
devices are defined on the command line with :
-device ipmi-bmc-sim,id=bmc0 -device isa-ipmi-bt,bmc=bmc0,irq=10
QEMU fails with :
qemu-system-ppc64: error creating device tree: node: FDT_ERR_EXISTS
Use defaults_enabled() when creating the default BMC devices to let
the user provide its own BMC devices using '-nodefaults'. If no BMC
device are provided, output a warning but let QEMU run as this is a
supported configuration. However, when multiple BMC devices are
defined, stop QEMU with a clear error as the results are unexpected.
Fixes: e2392d4395 ("ppc/pnv: Create BMC devices at machine init")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200404153655.166834-1-clg@kaod.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
In dcr_write_pcie() we take the iothread lock around a call to
pcie_host_mmcfg_udpate(). This is an incorrect attempt to deal with
the bug fixed in commit 235352ee6e, where we were not taking
the iothread lock before calling device dcr read/write functions.
(It's not sufficient locking, because although the other cases in the
switch statement won't assert, there is no locking which prevents
multiple guest CPUs from trying to access the PPC460EXPCIEState
struct at the same time and corrupting data.)
Unfortunately with commit 235352ee6e we are now trying
to recursively take the iothread lock, which will assert:
$ qemu-system-ppc -M sam460ex --display none
**
ERROR:/home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1830:qemu_mutex_lock_iothread_impl: assertion failed: (!qemu_mutex_iothread_locked())
Aborted (core dumped)
Remove the locking within dcr_write_pcie().
Fixes: 235352ee6e
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20200330125228.24994-1-peter.maydell@linaro.org>
Tested-by: BALATON Zoltan <balaton@eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
For various technical reasons we can't currently allow unplug a PCI to PCI
bridge on the pseries machine. spapr_pci_unplug_request() correctly
generates an error message if that's attempted.
But.. if the given errp is not error_abort or error_fatal, it doesn't
actually stop trying to unplug the bridge anyway.
Fixes: 14e714900f "spapr: Allow hot plug/unplug of PCI bridges and devices under PCI bridges"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Try to be tolerant of FWNMI delivery errors if the machine check had been
recovered by the host.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200325142906.221248-5-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[dwg: Updated comment at Greg's suggestion]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Add some messages which explain problems and guest misbehaviour that
may be difficult to diagnose in rare cases of machine checks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200325142906.221248-4-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Some of the conditions are not as clearly documented as they could be.
Also the non-FWNMI case does not need a large comment.
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200325142906.221248-3-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The KVM FWNMI capability should be enabled with the "ibm,nmi-register"
rtas call. Although MCEs from KVM will be delivered as architected
interrupts to the guest before "ibm,nmi-register" is called, KVM has
different behaviour depending on whether the guest has enabled FWNMI
(it attempts to do more recovery on behalf of a non-FWNMI guest).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200325142906.221248-2-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If qemu_find_file() doesn't find the BIOS it returns NULL; we were
passing that unchecked through to load_elf(), which assumes a non-NULL
pointer and may misbehave. In practice it fails with a weird message:
$ qemu-system-ppc -M ppce500 -display none -kernel nonesuch
Bad address
qemu-system-ppc: could not load firmware '(null)'
Handle the failure case better:
$ qemu-system-ppc -M ppce500 -display none -kernel nonesuch
qemu-system-ppc: could not find firmware/kernel file 'nonesuch'
Spotted by Coverity (CID 1238954).
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20200324121216.23899-1-peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This code is inside the "if (dinfo)" condition, so testing
again here whether it is NULL is unnecessary.
Fixes: dd59bcae7 (Don't size flash memory to match backing image)
Reported-by: Coverity (CID 1421917)
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200320155740.5342-1-philmd@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This is the only error path that needs to free the previously allocated
ov1.
Reported-by: Coverity (CID 1421924)
Fixes: cbd0d7f363 "spapr: Fail CAS if option vector table cannot be parsed"
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158481206205.336182.16106097429336044843.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Per PAPR, it is expected to set effective address provided flag in
sub_err_type member of mc extended error log (i.e
rtas_event_log_v6_mc.sub_err_type). This somehow got missed in original
fwnmi-mce patch series. The current code just updates the effective address
but does not set the flag to indicate that it is available. Hence guest
fails to extract effective address from mce rtas log. This patch fixes
that.
Without this patch guest MCE logs fails print DAR value:
[ 11.933608] Disabling lock debugging due to kernel taint
[ 11.933773] MCE: CPU0: machine check (Severe) Host TLB Multihit [Recovered]
[ 11.933979] MCE: CPU0: NIP: [c000000000090b34] radix__flush_tlb_range_psize+0x194/0xf00
[ 11.934223] MCE: CPU0: Initiator CPU
[ 11.934341] MCE: CPU0: Unknown
After the change:
[ 22.454149] Disabling lock debugging due to kernel taint
[ 22.454316] MCE: CPU0: machine check (Severe) Host TLB Multihit DAR: deadbeefdeadbeef [Recovered]
[ 22.454605] MCE: CPU0: NIP: [c0000000003e5804] kmem_cache_alloc+0x84/0x330
[ 22.454820] MCE: CPU0: Initiator CPU
[ 22.454944] MCE: CPU0: Unknown
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Message-Id: <158451653844.22972.17999316676230071087.stgit@jupiter>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+ber27ys35W+dsvQfe+BBqr8OQ4FAl5xW7kACgkQfe+BBqr8
OQ6x2w/9HAM9tyP65wMebkvvg29v6PeO65g81BOzdfcuyWhkZl0pWg6LjNfaN9a3
xin2MDB9ODOug8kBICeCGEzuJ/qe3wcXEkjnK4uklSk4YZDBIzgfVnC4N+3/pkMr
pvJM2GNHKk8PQI0YoBPZXwfvzN1CB03f0oaWokkpQq4XYLO6rltflPLwI33De5kx
igPA7rfRAz12PxP5xzhvVWfaD54xc9pFoQ8SSxrnUqr+3OWfV6+xovE5F7e1O6vw
x84rRod50tp4c9ABS0mY1kcdnFUKK1YXh+oRvtj9B5QbjYfZY+wvz8Iisgk3cB1s
CtKTvQSvbvBkdghecX5hHmeSerVKxjjMR8tnoS9A0eaTjfOuum2eBqS0Cf51C61O
UuMVHFVRyR8g+t0xcDbciPMGbS08UEVaXlibYU1tA8lr6EB1G4aHW1ZvdAsc/eeY
WrDPb9+QaItT9yL5U43s3/ABFMbHwqyJwdDgNEmet5L89voSGY8VfhDj7wesoQv4
rzCCeDnl1drFiKqiHSc0IrTc7ktpz7vpfh3mydaD52yj5/xmD/3fS5UpUk3kYDJp
JrN9npjnsbuLhdI63TrJPXXzdFqSiRiHaNlmiPtKm8ER/NowwpO5BUPSNLK4HIBX
QcgbcSjbdj1GgmmINPylzShyev9cBfigTks1uF1ln4XuN96S45Q=
=Q+Rb
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/jnsnow/tags/ide-pull-request' into staging
Pull request
# gpg: Signature made Tue 17 Mar 2020 23:22:33 GMT
# gpg: using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E
# gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full]
# Primary key fingerprint: FAEB 9711 A12C F475 812F 18F2 88A9 064D 1835 61EB
# Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76 CBD0 7DEF 8106 AAFC 390E
* remotes/jnsnow/tags/ide-pull-request:
hw/ide: Remove unneeded inclusion of hw/ide.h
hw/ide: Move MAX_IDE_DEVS define to hw/ide/internal.h
hw/ide: Do ide_drive_get() within pci_ide_create_devs()
hw/ide/pci.c: Coding style update to fix checkpatch errors
hw/ide: Remove now unneded #include "hw/pci/pci.h" from hw/ide.h
hw/ide: Get rid of piix4_init function
hw/isa/piix4.c: Introduce variable to store devfn
hw/ide: Get rid of piix3_init functions
hd-geo-test: Clean up use of buf[] in create_qcow2_with_mbr()
via-ide: always use legacy IRQ 14/15 routing
via-ide: allow guests to write to PCI_CLASS_PROG
via-ide: initialise IDE controller in legacy mode
via-ide: ensure that PCI_INTERRUPT_LINE is hard-wired to its default value
pci: Honour wmask when resetting PCI_INTERRUPT_LINE
ide/via: Get rid of via_ide_init()
via-ide: move registration of VMStateDescription to DeviceClass
cmd646: remove unused pci_cmd646_ide_init() function
dp264: use pci_create_simple() to initialise the cmd646 device
cmd646: register vmstate_ide_pci VMStateDescription in DeviceClass
cmd646: register cmd646_reset() function in DeviceClass
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Here's my final pull request for the qemu-5.0 soft freeze. Sorry this
is just under the wire - I hit some last minute problems that took a
while to fix up and retest.
Highlights are:
* Numerous fixes for the FWNMI feature
* A handful of cleanups to the device tree construction code
* Numerous fixes for the spapr-vscsi device
* A number of fixes and cleanups for real mode (MMU off) softmmu
handling
* Fixes for handling of the PAPR RMA
* Better handling of hotplug/unplug events during boot
* Assorted other fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAl5wnnsACgkQbDjKyiDZ
s5JdpQ//eY/AOTs09UhvKxt8DN7lC2WyHGxYSncb2Tj2zaJyPPX9p296IDBMw+KX
Cafr6LzwLjpcpOyf/EWzg7qYGbNYoYgRWoOkHI/9pHsrIH3ZvhmnyTVQI5CffeEb
EDDXJUQo/2sFpAGeODr5zz+zAQUGzt6ZZUxAiQAF9RYc9ohUGD2x5c86Asx6ZTZo
/14bd3qnrcy1x+TxDetb1idFxFr2DsdYqpHAi88zHm+UaWzxYrb7kakd+YbqI24N
tYryf5SdtGrWAAdF/7nq2PQJFzskx+t0QearU+ruovRydxYbUtBpkr5HauoVuQXR
LiV270sDYDS/D1vvQQKzLxkUuvWmbZ0rB+2BAtS1rwq2sOKqYyQEAkTWfGtSXcf8
7fuZm2i1G78MuYGTOLCrF1u0owUB3QYHvt1NUW09GyWS8X3mahtj2fRe1RtPV/5d
NL217bcd32fkMoGCg/lFvK9sCQzR6zJGKkJvOGMVW4ahHCLixpjIWabWtdXjfguT
UahRPvlX7fzeVT+DISfjqyxwL+THnTvB3CTMWG2cktf0K1ke4SXcQ0mPyksN1NuC
QocfPCr1TN2ri8g9dAPwQmOkojnNs9izpIWRYSl3avTJFNseNPxuHQALXj2Y3Y/O
EoYxLN+cqPukQ1O3GxEj5QMKe8V/0986mxWnuS/dMohQOoy+zV4=
=BPnR
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-5.0-20200317' into staging
ppc patch queue 2020-03-17
Here's my final pull request for the qemu-5.0 soft freeze. Sorry this
is just under the wire - I hit some last minute problems that took a
while to fix up and retest.
Highlights are:
* Numerous fixes for the FWNMI feature
* A handful of cleanups to the device tree construction code
* Numerous fixes for the spapr-vscsi device
* A number of fixes and cleanups for real mode (MMU off) softmmu
handling
* Fixes for handling of the PAPR RMA
* Better handling of hotplug/unplug events during boot
* Assorted other fixes
# gpg: Signature made Tue 17 Mar 2020 09:55:07 GMT
# gpg: using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full]
# gpg: aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full]
# gpg: aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" [full]
# gpg: aka "David Gibson (kernel.org) <dwg@kernel.org>" [unknown]
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E 87DC 6C38 CACA 20D9 B392
* remotes/dgibson/tags/ppc-for-5.0-20200317: (45 commits)
pseries: Update SLOF firmware image
ppc/spapr: Ignore common "ibm,nmi-interlock" Linux bug
ppc/spapr: Implement FWNMI System Reset delivery
target/ppc: allow ppc_cpu_do_system_reset to take an alternate vector
ppc/spapr: Allow FWNMI on TCG
ppc/spapr: Fix FWNMI machine check interrupt delivery
ppc/spapr: Add FWNMI System Reset state
ppc/spapr: Change FWNMI names
ppc/spapr: Fix FWNMI machine check failure handling
spapr: Rename DT functions to newer naming convention
spapr: Move creation of ibm,architecture-vec-5 property
spapr: Move creation of ibm,dynamic-reconfiguration-memory dt node
spapr/rtas: Reserve space for RTAS blob and log
pseries: Update SLOF firmware image
ppc/spapr: Move GPRs setup to one place
target/ppc: Fix rlwinm on ppc64
spapr/xive: use SPAPR_IRQ_IPI to define IPI ranges exposed to the guest
hw/scsi/spapr_vscsi: Convert debug fprintf() to trace event
hw/scsi/spapr_vscsi: Prevent buffer overflow
hw/scsi/spapr_vscsi: Do not mix SRP IU size with DMA buffer size
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
After previous clean ups we can drop direct inclusion of hw/ide.h from
several places.
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
Reviewed-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: a3f72b663e537701c63cec5fc9cb8ed4f4249f28.1584457537.git.balaton@eik.bme.hu
Signed-off-by: John Snow <jsnow@redhat.com>
The scripts/coccinelle/memory-region-housekeeping.cocci reported:
* TODO [[view:./hw/ppc/ppc405_boards.c::face=ovl-face1::linb=195::colb=8::cole=30][potential use of memory_region_init_rom*() in ./hw/ppc/ppc405_boards.c::195]]
* TODO [[view:./hw/ppc/ppc405_boards.c::face=ovl-face1::linb=464::colb=8::cole=30][potential use of memory_region_init_rom*() in ./hw/ppc/ppc405_boards.c::464]]
We can indeed replace the memory_region_init_ram() and
memory_region_set_readonly() calls by memory_region_init_rom().
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
This commit was produced with the Coccinelle script
scripts/coccinelle/memory-region-housekeeping.cocci.
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Linux kernels call "ibm,nmi-interlock" in their system reset handlers
contrary to PAPR. Returning an error because the CPU does not hold the
interlock here causes Linux to print warning messages. PowerVM returns
success in this case, so do the same for now.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-9-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
PAPR requires that if "ibm,nmi-register" succeeds, then the hypervisor
delivers all system reset and machine check exceptions to the registered
addresses.
System Resets are delivered with registers set to the architected state,
and with no interlock.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-8-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Provide for an alternate delivery location, -1 defaults to the
architected address.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-7-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
There should no longer be a reason to prevent TCG providing FWNMI.
System Reset interrupts are generated to the guest with nmi monitor
command and H_SIGNAL_SYS_RESET. Machine Checks can not be injected
currently, but this could be implemented with the mce monitor cmd
similarly to i386.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-6-npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
[dwg: Re-enable FWNMI in qtests, since that now works]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
FWNMI machine check delivery misses a few things that will make it fail
with TCG at least (which we would like to allow in future to improve
testing).
It's not nice to scatter interrupt delivery logic around the tree, so
move it to excp_helper.c and share code where possible.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-5-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The FWNMI option must deliver system reset interrupts to their
registered address, and there are a few constraints on the handler
addresses specified in PAPR. Add the system reset address state and
checks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-4-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviwed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The option is called "FWNMI", and it involves more than just machine
checks, also machine checks can be delivered without the FWNMI option,
so re-name various things to reflect that.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-3-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
ppc_cpu_do_system_reset delivers a system rreset interrupt to the guest,
which is certainly not what is intended here. Panic the guest like other
failure cases here do.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-2-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
In the spapr code we've been gradually moving towards a convention that
functions which create pieces of the device tree are called spapr_dt_*().
This patch speeds that along by renaming most of the things that don't yet
match that so that they do.
For now we leave the *_dt_populate() functions which are actual methods
used in the DRCClass::dt_populate method.
While we're there we remove a few comments that don't really say anything
useful.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
This is currently called from spapr_dt_cas_updates() which is a hang
over from when we created this only as a diff to the DT at CAS time.
Now that we fully rebuild the DT at CAS time, just create it along
with the rest of the properties in /chosen.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Currently this node with information about hotpluggable memory is created
from spapr_dt_cas_updates(). But that's just a hangover from when we
created it only as a diff to the device tree at CAS time. Now that we
fully rebuild the DT as CAS time, it makes more sense to create this along
with the rest of the memory information in the device tree.
So, move it to spapr_populate_memory(). The patch is huge, but it's nearly
all just code motion.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
At the moment SLOF reserves space for RTAS and instantiates the RTAS blob
which is 20 bytes binary blob calling an hypercall. The rest of the RTAS
area is a log which SLOF has no idea about but QEMU does.
This moves RTAS sizing to QEMU and this overrides the size from SLOF.
The only remaining problem is that SLOF copies the number of bytes it
reserved (2KB for now) so QEMU needs to reserve at least this much;
SLOF will be fixed separately to check that rtas-size from QEMU is
enough for those 20 bytes for the H_RTAS hcall.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200316011841.99970-1-aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
At the moment "pseries" starts in SLOF which only expects the FDT blob
pointer in r3. As we are going to introduce a OpenFirmware support in
QEMU, we will be booting OF clients directly and these expect a stack
pointer in r1, Linux looks at r3/r4 for the initramdisk location
(although vmlinux can find this from the device tree but zImage from
distro kernels cannot).
This extends spapr_cpu_set_entry_state() to take more registers. This
should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200310050733.29805-2-aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Move the calculation of the Real Mode Area (RMA) size into a helper
function. While we're there clean it up and correct it in a few ways:
* Add comments making it clearer where the various constraints come from
* Remove a pointless check that the RMA fits within Node 0 (we've just
clamped it so that it does)
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
In spapr_machine_init() we clamp the size of the RMA to 16GiB and the
comment saying why doesn't make a whole lot of sense. In fact, this was
done because the real mode handling code elsewhere limited the RMA in TCG
mode to the maximum value configurable in LPCR[RMLS], 16GiB.
But,
* Actually LPCR[RMLS] has been able to encode a 256GiB size for a very
long time, we just didn't implement it properly in the softmmu
* LPCR[RMLS] shouldn't really be relevant anyway, it only was because we
used to abuse the RMOR based translation mode in order to handle the
fact that we're not modelling the hypervisor parts of the cpu
We've now removed those limitations in the modelling so the 16GiB clamp no
longer serves a function. However, we can't just remove the limit
universally: that would break migration to earlier qemu versions, where
the 16GiB RMLS limit still applies, no matter how bad the reasons for it
are.
So, we replace the 16GiB clamp, with a clamp to a limit defined in the
machine type class. We set it to 16 GiB for machine types 4.2 and earlier,
but set it to 0 meaning unlimited for the new 5.0 machine type.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
The Real Mode Area (RMA) is the part of memory which a guest can access
when in real (MMU off) mode. Of course, for a guest under KVM, the MMU
isn't really turned off, it's just in a special translation mode - Virtual
Real Mode Area (VRMA) - which looks like real mode in guest mode.
The mechanics of how this works when using the hash MMU (HPT) put a
constraint on the size of the RMA, which depends on the size of the
HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA
we advertise to the guest based on this VRMA limit.
There are several things wrong with this:
1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
of Node 0 memory size and the VRMA limit. That will *often* work the
same as clamping, but there can be other constraints on RMA size which
supersede Node 0 memory size. We have real bugs caused by this
(currently worked around in the guest kernel)
2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
we're past the point that we can actually advertise an RMA limit to the
guest
3) But most fundamentally, the VRMA limit depends on host configuration
(page size) which shouldn't be visible to the guest, but this partially
exposes it. This can cause problems with migration in certain edge
cases, although we will mostly get away with it.
In practice, this clamping is almost never applied anyway. With 64kiB
pages and the normal rules for sizing of the HPT, the theoretical VRMA
limit will be 4x(guest memory size) and so never hit. It will hit with
4kiB pages, where it will be (guest memory size)/4. However all mainstream
distro kernels for POWER have used a 64kiB page size for at least 10 years.
So, simply replace this logic with a check that the RMA we've calculated
based only on guest visible configuration will fit within the host implied
VRMA limit. This can break if running HPT guests on a host kernel with
4kiB page size. As noted that's very rare. There also exist several
possible workarounds:
* Change the host kernel to use 64kiB pages
* Use radix MMU (RPT) guests instead of HPT
* Use 64kiB hugepages on the host to back guest memory
* Increase the guest memory size so that the RMA hits one of the fixed
limits before the RMA limit. This is relatively easy on POWER8 which
has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
* Use a guest NUMA configuration which artificially constrains the RMA
within the VRMA limit (the RMA must always fit within Node 0).
Previously, on KVM, we also temporarily reduced the rma_size to 256M so
that the we'd load the kernel and initrd safely, regardless of the VRMA
limit. This was a) confusing, b) could significantly limit the size of
images we could load and c) introduced a behavioural difference between
KVM and TCG. So we remove that as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
This function calculates the maximum size of the RMA as implied by the
host's page size of structure of the VRMA (there are a number of other
constraints on the RMA size which will supersede this one in many
circumstances).
The current interface takes the current RMA size estimate, and clamps it
to the VRMA derived size. The only current caller passes in an arguably
wrong value (it will match the current RMA estimate in some but not all
cases).
We want to fix that, but for now just keep concerns separated by having the
KVM helper function just return the VRMA derived limit, and let the caller
combine it with other constraints. We call the new function
kvmppc_vrma_limit() to more clearly indicate its limited responsibility.
The helper should only ever be called in the KVM enabled case, so replace
its !CONFIG_KVM stub with an assert() rather than a dummy value.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cedric Le Goater <clg@fr.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
MIN_RMA_SLOF records the minimum about of RMA that the SLOF firmware
requires. It lets us give a meaningful error if the RMA ends up too small,
rather than just letting SLOF crash.
It's currently stored as a number of megabytes, which is strange for global
constants. Move that megabyte scaling into the definition of the constant
like most other things use.
Change from M to MiB in the associated message while we're at it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
For the "pseries" machine, we use "virtual hypervisor" mode where we
only model the CPU in non-hypervisor privileged mode. This means that
we need guest physical addresses within the modelled cpu to be treated
as absolute physical addresses.
We used to do that by clearing LPCR[VPM0] and setting LPCR[RMLS] to a high
limit so that the old offset based translation for guest mode applied,
which does what we need. However, POWER9 has removed support for that
translation mode, which meant we had some ugly hacks to keep it working.
We now explicitly handle this sort of translation for virtual hypervisor
mode, so the hacks aren't necessary. We don't need to set VPM0 and RMLS
from the machine type code - they're now ignored in vhyp mode. On the cpu
side we don't need to allow LPCR[RMLS] to be set on POWER9 in vhyp mode -
that was only there to allow the hack on the machine side.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200228123303.14540-1-philmd@redhat.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Fixes Coverity issue,
CID 1419883: Error handling issues (CHECKED_RETURN)
Calling "qemu_uuid_parse" without checking return value
nvdimm_set_uuid() already verifies if the user provided uuid is valid or
not. So, need to check for the validity during pre-plug validation again.
As this a false positive in this case, assert if not valid to be safe.
Also, error_abort if QOM accessor encounters error while fetching the uuid
property.
Reported-by: Coverity (CID 1419883)
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Message-Id: <158281096564.89540.4507375445765515529.stgit@lep8c.aus.stglabs.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If a hot plug or unplug request is pending at CAS, we currently trigger
a CAS reboot, which severely increases the guest boot time. This is
because SLOF doesn't handle hot plug events and we had no way to fix
the FDT that gets presented to the guest.
We can do better thanks to recent changes in QEMU and SLOF:
- we now return a full FDT to SLOF during CAS
- SLOF was fixed to correctly detect any device that was either added or
removed since boot time and to update its internal DT accordingly.
The right solution is to process all pending hot plug/unplug requests
during CAS: convert hot plugged devices to cold plugged devices and
remove the hot unplugged ones, which is exactly what spapr_drc_reset()
does. Also clear all hot plug events that are currently queued since
they're no longer relevant.
Note that SLOF cannot currently populate hot plugged PCI bridges or PHBs
at CAS. Until this limitation is lifted, SLOF will reset the machine when
this scenario occurs : this will allow the FDT to be fully processed when
SLOF is started again (ie. the same effect as the CAS reboot that would
occur anyway without this patch).
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158257222352.4102917.8984214333937947307.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Several objects implemented their own uint property getters and setters,
despite them being straightforward (without any checks/validations on
the values themselves) and identical across objects. This makes use of
an enhanced API for object_property_add_uintXX_ptr() which offers
default setters.
Some of these setters used to update the value even if the type visit
failed (eg. because the value being set overflowed over the given type).
The new setter introduces a check for these errors, not updating the
value if an error occurred. The error is propagated.
Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Traditionally, the uint-specific property helpers only offer getters.
When adding object (or class) uint types, one must therefore use the
generic property helper if a setter is needed (and probably duplicate
some code writing their own getters/setters).
This enhances the uint-specific property helper APIs by adding a
bitwise-or'd 'flags' field and modifying all clients of that API to set
this paramater to OBJ_PROP_FLAG_READ. This maintains the current
behaviour whilst allowing others to also set OBJ_PROP_FLAG_WRITE (or use
the more convenient OBJ_PROP_FLAG_READWRITE) in the future (which will
automatically install a setter). Other flags may be added later.
Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's no good reason for it to be type int, change it to bool.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200207161948.15972-3-philmd@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
This series removes ad hoc RAM allocation API (memory_region_allocate_system_memory)
and consolidates it around hostmem backend. It allows to
* resolve conflicts between global -mem-prealloc and hostmem's "policy" option,
fixing premature allocation before binding policy is applied
* simplify complicated memory allocation routines which had to deal with 2 ways
to allocate RAM.
* reuse hostmem backends of a choice for main RAM without adding extra CLI
options to duplicate hostmem features. A recent case was -mem-shared, to
enable vhost-user on targets that don't support hostmem backends [1] (ex: s390)
* move RAM allocation from individual boards into generic machine code and
provide them with prepared MemoryRegion.
* clean up deprecated NUMA features which were tied to the old API (see patches)
- "numa: remove deprecated -mem-path fallback to anonymous RAM"
- (POSTPONED, waiting on libvirt side) "forbid '-numa node,mem' for 5.0 and newer machine types"
- (POSTPONED) "numa: remove deprecated implicit RAM distribution between nodes"
Introduce a new machine.memory-backend property and wrapper code that aliases
global -mem-path and -mem-alloc into automatically created hostmem backend
properties (provided memory-backend was not set explicitly given by user).
A bulk of trivial patches then follow to incrementally convert individual
boards to using machine.memory-backend provided MemoryRegion.
Board conversion typically involves:
* providing MachineClass::default_ram_size and MachineClass::default_ram_id
so generic code could create default backend if user didn't explicitly provide
memory-backend or -m options
* dropping memory_region_allocate_system_memory() call
* using convenience MachineState::ram MemoryRegion, which points to MemoryRegion
allocated by ram-memdev
On top of that for some boards:
* missing ram_size checks are added (typically it were boards with fixed ram size)
* ram_size fixups are replaced by checks and hard errors, forcing user to
provide correct "-m" values instead of ignoring it and continuing running.
After all boards are converted, the old API is removed and memory allocation
routines are cleaned up.
The device tree blob returned by load_device_tree is malloced.
We should free it after cpu_physical_memory_write().
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Chen Qun <kuhn.chenqun@huawei.com>
Message-Id: <20200218091154.21696-3-kuhn.chenqun@huawei.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We already detect if a device is being hot plugged before CAS to trigger
a CAS reboot and during migration to migrate the state of the associated
DRC. But hot unplugging a device is also an asynchronous operation that
requires the guest to take action. This means that if the guest is migrated
after the hot unplug event was sent but before it could release the device
with RTAS, the destination QEMU doesn't know about the pending unplug
operation and doesn't actually remove the device when the guest finally
releases it.
Similarly, if the unplug request is fired before CAS, the guest isn't
notified of the change, just like with hotplug. It ends up booting with
the device still present in the DT and configures it, just like it was
never removed. Even weirder, since the event is still queued, it will
be eventually processed when some other unrelated event is posted to
the guest.
Enhance spapr_drc_transient() to also return true if an unplug request is
pending. This fixes the issue at CAS with a CAS reboot request and
causes the DRC state to be migrated. Some extra care is still needed to
inform the destination that an unplug request is pending : migrate the
unplug_requested field of the DRC in an optional subsection. This might
break backwards migration, but this is still better than ending with
an inconsistent guest.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158169248798.3465937.1108351365840514270.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We currently don't support hotplug of devices between boot and CAS. If
this happens a CAS reboot is triggered. We detect this during CAS using
the spapr_drc_needed() function which is essentially a VMStateDescription
.needed callback. Even if the condition for CAS reboot happens to be the
same as for DRC migration, it looks wrong to piggyback a migration helper
for this.
Introduce a helper with slightly more explicit name and use it in both CAS
and DRC migration code. Since a subsequent patch will enhance this helper
to cover the case of hot unplug, let's go for spapr_drc_transient(). While
here convert spapr_hotplugged_dev_before_cas() to the "transient" wording as
well.
This doesn't change any behaviour.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158169248180.3465937.9531405453362718771.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
'fdt' forgot to clean both e500 and pnv when we call 'system_reset' on ppc,
this patch fix it. The leak stacks are as follow:
Direct leak of 4194304 byte(s) in 4 object(s) allocated from:
#0 0x7fafe37dd970 in __interceptor_calloc (/lib64/libasan.so.5+0xef970)
#1 0x7fafe2e3149d in g_malloc0 (/lib64/libglib-2.0.so.0+0x5249d)
#2 0x561876f7f80d in create_device_tree /mnt/sdb/qemu-new/qemu/device_tree.c:40
#3 0x561876b7ac29 in ppce500_load_device_tree /mnt/sdb/qemu-new/qemu/hw/ppc/e500.c:364
#4 0x561876b7f437 in ppce500_reset_device_tree /mnt/sdb/qemu-new/qemu/hw/ppc/e500.c:617
#5 0x56187718b1ae in qemu_devices_reset /mnt/sdb/qemu-new/qemu/hw/core/reset.c:69
#6 0x561876f6938d in qemu_system_reset /mnt/sdb/qemu-new/qemu/vl.c:1412
#7 0x561876f6a25b in main_loop_should_exit /mnt/sdb/qemu-new/qemu/vl.c:1645
#8 0x561876f6a398 in main_loop /mnt/sdb/qemu-new/qemu/vl.c:1679
#9 0x561876f7da8e in main /mnt/sdb/qemu-new/qemu/vl.c:4438
#10 0x7fafde16b812 in __libc_start_main ../csu/libc-start.c:308
#11 0x5618765c055d in _start (/mnt/sdb/qemu-new/qemu/build/ppc64-softmmu/qemu-system-ppc64+0x2b1555d)
Direct leak of 1048576 byte(s) in 1 object(s) allocated from:
#0 0x7fc0a6f1b970 in __interceptor_calloc (/lib64/libasan.so.5+0xef970)
#1 0x7fc0a656f49d in g_malloc0 (/lib64/libglib-2.0.so.0+0x5249d)
#2 0x55eb05acd2ca in pnv_dt_create /mnt/sdb/qemu-new/qemu/hw/ppc/pnv.c:507
#3 0x55eb05ace5bf in pnv_reset /mnt/sdb/qemu-new/qemu/hw/ppc/pnv.c:578
#4 0x55eb05f2f395 in qemu_system_reset /mnt/sdb/qemu-new/qemu/vl.c:1410
#5 0x55eb05f43850 in main /mnt/sdb/qemu-new/qemu/vl.c:4403
#6 0x7fc0a18a9812 in __libc_start_main ../csu/libc-start.c:308
#7 0x55eb0558655d in _start (/mnt/sdb/qemu-new/qemu/build/ppc64-softmmu/qemu-system-ppc64+0x2b1555d)
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Pan Nengyuan <pannengyuan@huawei.com>
Message-Id: <20200214033206.4395-1-pannengyuan@huawei.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This allows moving the kernel in the guest memory. The option is useful
for step debugging (as Linux is linked at 0x0); it also allows loading
grub which is normally linked to run at 0x20000.
This uses the existing kernel address by default.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200203032943.121178-6-aik@ozlabs.ru>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This patch implements few of the necessary hcalls for the nvdimm support.
PAPR semantics is such that each NVDIMM device is comprising of multiple
SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
bind each of the SCM blocks of the NVDIMM device using hcalls. There can
be SCM block unbind requests in case of driver errors or unplug(not
supported now) use cases. The NVDIMM label read/writes are done through
hcalls.
Since each virtual NVDIMM device is divided into multiple SCM blocks,
the bind, unbind, and queries using hcalls on those blocks can come
independently. This doesn't fit well into the qemu device semantics,
where the map/unmap are done at the (whole)device/object level granularity.
The patch doesnt actually bind/unbind on hcalls but let it happen at the
device_add/del phase itself instead.
The guest kernel makes bind/unbind requests for the virtual NVDIMM device
at the region level granularity. Without interleaving, each virtual NVDIMM
device is presented as a separate guest physical address range. So, there
is no way a partial bind/unbind request can come for the vNVDIMM in a
hcall for a subset of SCM blocks of a virtual NVDIMM. Hence it is safe to
do bind/unbind everything during the device_add/del.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Message-Id: <158131059899.2897.11515211602702956854.stgit@lep8c.aus.stglabs.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Add support for NVDIMM devices for sPAPR. Piggyback on existing nvdimm
device interface in QEMU to support virtual NVDIMM devices for Power.
Create the required DT entries for the device (some entries have
dummy values right now).
The patch creates the required DT node and sends a hotplug
interrupt to the guest. Guest is expected to undertake the normal
DR resource add path in response and start issuing PAPR SCM hcalls.
The device support is verified based on the machine version unlike x86.
This is how it can be used ..
Ex :
For coldplug, the device to be added in qemu command line as shown below
-object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
-device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0
For hotplug, the device to be added from monitor as below
object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
[Early implementation]
Message-Id: <158131058078.2897.12767731856697459923.stgit@lep8c.aus.stglabs.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>