mirrors/qemu - qemu - SynapseOS git

Author	SHA1	Message	Date
David Gibson	3340e5c4f2	spapr: Use unplug_request for PCI hot unplug AIUI, ->unplug_request in the HotplugHandler is used for "soft" unplug, where acknowledgement from the guest is required before completing the unplug, whereas ->unplug is used for "hard" unplug where qemu unilaterally removes the device, and the guest just has to cope with its sudden absence. For spapr we (correctly) use ->unplug_request for CPU and memory hot unplug but we use ->unplug for PCI. While I think it might be possible to support "hard" PCI unplug within the PAPR model, that's not how it actually works now. Although it's called from ->unplug, the PCI unplug path will usually just mark the device for removal, with completion of the unplug delayed until userspace responds to the unplug notification. If the guest doesn't respond as expected, that could delay the unplug completion arbitrarily long. To reflect that, change the PCI unplug path to be called from ->unplug_request. We also rename spapr_phb_hot_plug_child() and spapr_phb_hot_unplug_child() to spapr_pci_plug() and spapr_pci_unplug_request() to more obviously reflect the callbacks they're implementing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org>	2017-07-11 11:04:02 +10:00
David Gibson	5c1da81215	spapr: Remove unnecessary differences between hotplug and coldplug paths spapr_drc_attach() has a 'coldplug' parameter which sets the DRC into configured state initially, instead of the usual ISOLATED/UNUSABLE state. It turns out this is unnecessary: although coldplugged devices do need to be in CONFIGURED state once the guest starts, that will already be accomplished by the reset code which will move DRCs for already plugged devices into a coldplug equivalent state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org>	2017-07-11 11:04:01 +10:00
Laurent Vivier	e806b4db14	spapr: fix migration to pseries machine < 2.8 since commit `5c4537bd` ("spapr: Fix 2.7<->2.8 migration of PCI host bridge"), some migration fields are forged from the new ones in spapr_pci_pre_save(). It works well, except when the number of MSI devices is 0, because in this case the function exits immediately. This fix moves the migration code before the exit code. The problem can be reproduced with these commands: source qemu-2.9: qemu-system-ppc64 -monitor stdio -M pseries-2.6 -nodefaults -S destination qemu-2.6: qemu-system-ppc64 -monitor stdio -M pseries-2.6 -nodefaults \ -incoming tcp:0:4444 on the source: migrate tcp:localhost:4444 Destination fails with the following error: qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr_pci' qemu-system-ppc64: load of migration failed: Invalid argument Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-07-11 11:04:01 +10:00
Halil Pasic	d2164ad35c	vmstate: error hint for failed equal checks In some cases a failing VMSTATE_*_EQUAL does not mean we detected a bug, but it's actually the best we can do. Especially in these cases a verbose error message is required. Let's introduce infrastructure for specifying a error hint to be used if equal check fails. Let's do this by adding a parameter to the _EQUAL macros called _err_hint. Also change all current users to pass NULL as last parameter so nothing changes for them. Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Message-Id: <20170623144823.42936-1-pasic@linux.vnet.ibm.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2017-06-28 11:18:44 +02:00
David Gibson	6304fd27ef	spapr: Fold spapr_phb_{add,remove}_pci_device() into their only callers Both functions are fairly short, and so are their callers. There's no particular logical distinction between them, so fold them together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>	2017-06-08 14:38:27 +10:00
David Gibson	0be4e88621	spapr: Change DRC attach & detach methods to functions DRC objects have attach & detach methods, but there's only one implementation. Although there are some differences in its behaviour for different DRC types, the overall structure is the same, so while we might want different method implementations for some parts, we're unlikely to want them for the top-level functions. So, replace them with direct function calls. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>	2017-06-08 14:38:26 +10:00
David Gibson	f224d35be9	spapr: Clean up DR entity sense handling DRC classes have an entity_sense method to determine (in a specific PAPR sense) the presence or absence of a device plugged into a DRC. However, we only have one implementation of the method, which explicitly tests for different DRC types. This changes it to instead have different method implementations for the two cases: "logical" and "physical" DRCs. While we're at it, the entity sense method always returns RTAS_OUT_SUCCESS, and the interesting value is returned via pass-by-reference. Simplify this to directly return the value we care about Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>	2017-06-08 14:38:26 +10:00
David Gibson	fbf5539718	spapr: Clean up spapr_dr_connector_by_() Change names to something less ludicrously verbose * Now that we have QOM subclasses for the different DRC types, use a QOM typename instead of a PAPR type value parameter The latter allows removal of the get_type_shift() helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>	2017-06-06 09:24:08 +10:00
David Gibson	2d33581899	spapr: Introduce DRC subclasses Currently we only have a single QOM type for all DRCs, but lots of places where we switch behaviour based on the DRC's PAPR defined type. This is a poor use of our existing type system. So, instead create QOM subclasses for each PAPR defined DRC type. We also introduce intermediate subclasses for physical and logical DRCs, a division which will be useful later on. Instead of being stored in the DRC object itself, the PAPR type is now stored in the class structure. There are still many places where we switch directly on the PAPR type value, but this at least provides the basis to start to remove those. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>	2017-06-06 09:23:46 +10:00
David Gibson	0b55aa91c9	spapr: Make DRC get_index and get_type methods into plain functions These two methods only have one implementation, and the spec they're implementing means any other implementation is unlikely, verging on impossible. So replace them with simple functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>	2017-06-06 08:53:24 +10:00
Daniel Henrique Barboza	318347234d	hw/ppc: removing drc->detach_cb and drc->detach_cb_opaque The pointer drc->detach_cb is being used as a way of informing the detach() function inside spapr_drc.c which cb to execute. This information can also be retrieved simply by checking drc->type and choosing the right callback based on it. In this context, detach_cb is redundant information that must be managed. After the previous spapr_lmb_release change, no detach_cb_opaques are being used by any of the three callbacks functions. This is yet another information that is now unused and, on top of that, can't be migrated either. This patch makes the following changes: - removal of detach_cb_opaque. the 'opaque' argument was removed from the callbacks and from the detach() function of sPAPRConnectorClass. The attribute detach_cb_opaque of sPAPRConnector was removed. - removal of detach_cb from the detach() call. The function pointer detach_cb of sPAPRConnector was removed. detach() now uses a switch(drc->type) to execute the apropriate callback. To achieve this, spapr_core_release, spapr_lmb_release and spapr_phb_remove_pci_device_cb callbacks were made public to be visible inside detach(). Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-05-25 11:31:33 +10:00
Eduardo Habkost	e4f4fb1eca	sysbus: Set user_creatable=false by default on TYPE_SYS_BUS_DEVICE commit `33cd52b5d7` unset cannot_instantiate_with_device_add_yet in TYPE_SYSBUS, making all sysbus devices appear on "-device help" and lack the "no-user" flag in "info qdm". To fix this, we can set user_creatable=false by default on TYPE_SYS_BUS_DEVICE, but this requires setting user_creatable=true explicitly on the sysbus devices that actually work with -device. Fortunately today we have just a few has_dynamic_sysbus=1 machines: virt, pc-q35-, ppce500, and spapr. virt, ppce500, and spapr have extra checks to ensure just a few device types can be instantiated: virt supports only TYPE_VFIO_CALXEDA_XGMAC, TYPE_VFIO_AMD_XGBE. * ppce500 supports only TYPE_ETSEC_COMMON. * spapr supports only TYPE_SPAPR_PCI_HOST_BRIDGE. This patch sets user_creatable=true explicitly on those 4 device classes. Now, the more complex cases: pc-q35-: q35 has no sysbus device whitelist yet (which is a separate bug). We are in the process of fixing it and building a sysbus whitelist on q35, but in the meantime we can fix the "-device help" and "info qdm" bugs mentioned above. Also, despite not being strictly necessary for fixing the q35 bug, reducing the list of user_creatable=true devices will help us be more confident when building the q35 whitelist. xen: We also have a hack at xen_set_dynamic_sysbus(), that sets has_dynamic_sysbus=true at runtime when using the Xen accelerator. This hack is only used to allow xen-backend devices to be dynamically plugged/unplugged. This means today we can use -device with the following 22 device types, that are the ones compiled into the qemu-system-x86_64 and qemu-system-i386 binaries: allwinner-ahci * amd-iommu * cfi.pflash01 * esp * fw_cfg_io * fw_cfg_mem * generic-sdhci * hpet * intel-iommu * ioapic * isabus-bridge * kvmclock * kvm-ioapic * kvmvapic * SUNW,fdtwo * sysbus-ahci * sysbus-fdc * sysbus-ohci * unimplemented-device * virtio-mmio * xen-backend * xen-sysdev This patch adds user_creatable=true explicitly to those devices, temporarily, just to keep 100% compatibility with existing behavior of q35. Subsequent patches will remove user_creatable=true from the devices that are really not meant to user-creatable on any machine, and remove the FIXME comment from the ones that are really supposed to be user-creatable. This is being done in separate patches because we still don't have an obvious list of devices that will be whitelisted by q35, and I would like to get each device reviewed individually. Cc: Alexander Graf <agraf@suse.de> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Alistair Francis <alistair.francis@xilinx.com> Cc: Beniamino Galvani <b.galvani@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Edgar E. Iglesias" <edgar.iglesias@gmail.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Frank Blaschka <frank.blaschka@de.ibm.com> Cc: Gabriel L. Somlo <somlo@cmu.edu> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: John Snow <jsnow@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: Laszlo Ersek <lersek@redhat.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Markus Armbruster <armbru@redhat.com> Cc: Max Reitz <mreitz@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Pierre Morel <pmorel@linux.vnet.ibm.com> Cc: Prasad J Pandit <pjp@fedoraproject.org> Cc: qemu-arm@nongnu.org Cc: qemu-block@nongnu.org Cc: qemu-ppc@nongnu.org Cc: Richard Henderson <rth@twiddle.net> Cc: Rob Herring <robh@kernel.org> Cc: Shannon Zhao <zhaoshenglong@huawei.com> Cc: sstabellini@kernel.org Cc: Thomas Huth <thuth@redhat.com> Cc: Yi Min Zhao <zyimin@linux.vnet.ibm.com> Acked-by: John Snow <jsnow@redhat.com> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Marcel Apfelbaum <marcel@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Message-Id: <20170503203604.31462-3-ehabkost@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> [ehabkost: Small changes at sysbus_device_class_init() comments] Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>	2017-05-17 10:37:01 -03:00
Alexey Kardashevskiy	c88fa6dd4a	spapr_pci: Removed unused include Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-04-26 12:00:41 +10:00
Alexey Kardashevskiy	a01f3432dd	spapr_pci: Warn when RAM page size is not enabled in IOMMU page mask If a page size used by QEMU is not enabled in the PHB IOMMU page mask, in-kernel acceleration of TCE handling won't be enabled and performance might be slower than expected. This prints a warning if system page size is not enabled. This should print a warning if huge pages are enabled but sphb.pgsz still uses the default value of 4K\|64K. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-04-26 12:00:41 +10:00
David Gibson	82516263ce	pseries: Don't expose PCIe extended config space on older machine types `bb9986452` "spapr_pci: Advertise access to PCIe extended config space" allowed guests to access the extended config space of PCI Express devices via the PAPR interfaces, even though the paravirtualized bus mostly acts like plain PCI. However, that patch enabled access unconditionally, including for existing machine types, which is an unwise change in behaviour. This patch limits the change to pseries-2.9 (and later) machine types. Suggested-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-03-14 11:54:17 +11:00
David Gibson	bb99864528	spapr_pci: Advertise access to PCIe extended config space The (paravirtual) PCI host bridge on the 'pseries' machine in most regards acts like a regular PCI bus, rather than a PCIe bus. Despite this, though, it does allow access to the PCIe extended config space. We already implemented the RTAS methods to allow this access.. but forgot to put the markers into the device tree so that guest's know it is there. This adds them in. With this, a pseries guest is able to view extended config space on (for example an e1000e device. This should be enough to allow guests to use at least some PCIe devices. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-03-03 11:30:59 +11:00
Cédric Le Goater	f7759e4331	ppc/xics: use the QOM interface to get irqs Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-03-01 11:23:39 +11:00
Cédric Le Goater	681bfaded6	ppc/xics: store the ICS object under the sPAPR machine A list of ICS objects was introduced under the XICS object for the PowerNV machine but, for the sPAPR machine, it brings extra complexity as there is only a single ICS. To simplify the code, let's add the ICS pointer under the sPAPR machine and try to reduce the use of this list where possible. Also, change the xics_spapr_*() routines to use an ICS object instead of an XICSState and change their name to reflect that these are specific to the sPAPR ICS object. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-03-01 11:23:39 +11:00
Greg Kurz	a8eeafda19	spapr/pci: populate PCI DT in reverse order Since commit `1d2d974244` "spapr_pci: enumerate and add PCI device tree", QEMU populates the PCI device tree in the opposite order compared to SLOF. Before `1d2d974244`: Populating /pci@800000020000000 00 0000 (D) : 1af4 1000 virtio [ net ] 00 0800 (D) : 1af4 1001 virtio [ block ] 00 1000 (D) : 1af4 1009 virtio [ network ] Populating /pci@800000020000000/unknown-legacy-device@2 7e5294b8 : /pci@800000020000000 7e52b998 : \|-- ethernet@0 7e52c0c8 : \|-- scsi@1 7e52c7e8 : +-- unknown-legacy-device@2 ok Since `1d2d974244`: Populating /pci@800000020000000 00 1000 (D) : 1af4 1009 virtio [ network ] Populating /pci@800000020000000/unknown-legacy-device@2 00 0800 (D) : 1af4 1001 virtio [ block ] 00 0000 (D) : 1af4 1000 virtio [ net ] 7e5e8118 : /pci@800000020000000 7e5ea6a0 : \|-- unknown-legacy-device@2 7e5eadb8 : \|-- scsi@1 7e5eb4d8 : +-- ethernet@0 ok This behaviour change is not actually a bug since no assumptions should be made on DT ordering. But it has no real justification either, other than being the consequence of the way fdt_add_subnode() inserts new elements to the front of the FDT rather than adding them to the tail. This patch reverts to the historical SLOF ordering by walking PCI devices in reverse order. This reconciles pseries with x86 machine types behavior. It is expected to make things easier when porting existing applications to power. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Tested-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> (slight update to the changelog) Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-03-01 11:23:39 +11:00
Laurent Vivier	2530a1a5cf	spapr: generate DT node names When DT node names for PCI devices are generated by SLOF, they are generated according to the type of the device (for instance, ethernet for virtio-net-pci device). Node name for hotplugged devices is generated by QEMU. This patch adds the mechanic to QEMU to create the node name according to the device type too. The data structure has been roughly copied from OpenBIOS/OpenHackware, node names from SLOF. Example: Hotplugging some PCI cards with QEMU monitor: device_add virtio-tablet-pci device_add virtio-serial-pci device_add virtio-mouse-pci device_add virtio-scsi-pci device_add virtio-gpu-pci device_add ne2k_pci device_add nec-usb-xhci device_add intel-hda What we can see in linux device tree: for dir in /proc/device-tree/pci@800000020000000/@/; do echo $dir cat $dir/name echo done WITHOUT this patch: /proc/device-tree/pci@800000020000000/pci@0/ pci /proc/device-tree/pci@800000020000000/pci@1/ pci /proc/device-tree/pci@800000020000000/pci@2/ pci /proc/device-tree/pci@800000020000000/pci@3/ pci /proc/device-tree/pci@800000020000000/pci@4/ pci /proc/device-tree/pci@800000020000000/pci@5/ pci /proc/device-tree/pci@800000020000000/pci@6/ pci /proc/device-tree/pci@800000020000000/pci@7/ pci WITH this patch: /proc/device-tree/pci@800000020000000/communication-controller@1/ communication-controller /proc/device-tree/pci@800000020000000/display@4/ display /proc/device-tree/pci@800000020000000/ethernet@5/ ethernet /proc/device-tree/pci@800000020000000/input-controller@0/ input-controller /proc/device-tree/pci@800000020000000/mouse@2/ mouse /proc/device-tree/pci@800000020000000/multimedia-device@7/ multimedia-device /proc/device-tree/pci@800000020000000/scsi@3/ scsi /proc/device-tree/pci@800000020000000/usb-xhci@6/ usb-xhci Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2017-03-01 11:23:39 +11:00
David Gibson	5c4537bded	spapr: Fix 2.7<->2.8 migration of PCI host bridge `daa2369` "spapr_pci: Add a 64-bit MMIO window" subtly broke migration from qemu-2.7 to the current version. It split the device's MMIO window into two pieces for 32-bit and 64-bit MMIO. The patch included backwards compatibility code to convert the old property into the new format. However, the property value was also transferred in the migration stream and compared with a (probably unwise) VMSTATE_EQUAL. So, the "raw" value from 2.7 is compared to the new style converted value from (pre-)2.8 giving a mismatch and migration failure. Along with the actual field that caused the breakage, there are several other ill-advised VMSTATE_EQUAL()s. To fix forwards migration, we read the values in the stream into scratch variables and ignore them, instead of comparing for equality. To fix backwards migration, we populate those scratch variables in pre_save() with adjusted values to match the old behaviour. To permit the eventual possibility of removing this cruft from the stream, we only include these compatibility fields if a new 'pre-2.8-migration' property is set. We clear it on the pseries-2.8 machine type, which obviously can't be migrated backwards, but set it on earlier machine type versions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>	2016-11-23 12:00:48 +11:00
David Gibson	5a78b821eb	Revert "spapr: Fix migration of PCI host bridges from qemu-2.7" This reverts commit `9b54ca0ba7`. The commit above corrected a migration breakage between qemu-2.7 and qemu-2.8. However it did so by advancing the migration version for the PCI host bridge, which obviously breaks migration backwards to earlier qemu versions. Although it's not totally essential, we'd like to maintain the possibility for backwards migration, so revert the change in preparation for a better fix. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>	2016-11-23 12:00:48 +11:00
David Gibson	9b54ca0ba7	spapr: Fix migration of PCI host bridges from qemu-2.7 `daa2369` "spapr_pci: Add a 64-bit MMIO window" subtly broke migration from qemu-2.7 to the current version. It split the device's MMIO window into two pieces for 32-bit and 64-bit MMIO. The patch included backwards compatibility code to convert the old property into the new format. However, the property value was also transferred in the migration stream and compared with a (probably unwise) VMSTATE_EQUAL. So, the "raw" value from 2.7 is compared to the new style converted value from (pre-)2.8 giving a mismatch and migration failure. Although it would be technically possible to fix this in a way allowing backwards migration, that would leave an ugly legacy around indefinitely. This patch takes the simpler approach of bumping the migration version, dropping the unwise VMSTATE_EQUAL (and some equally unwise ones around it) and ignoring them on an incoming migration. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>	2016-11-15 10:08:42 +11:00
Michael Roth	4bcfa56ca9	spapr_pci: advertise explicit numa IDs even when there's 1 node With the addition of "numa_node" properties for PHBs we began advertising NUMA affinity in cases where nb_numa_nodes > 1. Since the default on the guest side is to make no assumptions about PHB NUMA affinity (defaulting to -1), there is still a valid use-case for explicitly defining a PHB's NUMA affinity even when there's just one node. In particular, some workloads make faulty assumptions about /sys/bus/pci/<devid>/numa_node being >= 0, warranting the use of this property as a workaround even if there's just 1 PHB or NUMA node. Enable this use-case by always advertising the PHB's NUMA affinity if "numa_node" has been explicitly set. We could achieve this by relaxing the check to simply be nb_numa_nodes > 0, but even safer would be to check numa_info[nodeid].present explicitly, and to fail at start time for cases where it does not exist. This has an additional affect of no longer advertising PHB NUMA affinity unconditionally if nb_numa_nodes > 1 and "numa_node" property is unset/-1, but since the default value on the guest side for each PHB is also -1, the behavior should be the same for that situation. We could still retain the old behavior if desired, but the decision seems arbitrary, so we take the simpler route. Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Shivaprasad G. Bhat <shivapbh@in.ibm.com> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-10-28 09:36:58 +11:00
David Gibson	357d1e3bc7	spapr: Improved placement of PCI host bridges in guest memory map Currently, the MMIO space for accessing PCI on pseries guests begins at 1 TiB in guest address space. Each PCI host bridge (PHB) has a 64 GiB chunk of address space in which it places its outbound PIO and 32-bit and 64-bit MMIO windows. This scheme as several problems: - It limits guest RAM to 1 TiB (though we have a limited fix for this now) - It limits the total MMIO window to 64 GiB. This is not always enough for some of the large nVidia GPGPU cards - Putting all the windows into a single 64 GiB area means that naturally aligning things within there will waste more address space. In addition there was a miscalculation in some of the defaults, which meant that the MMIO windows for each PHB actually slightly overran the 64 GiB region for that PHB. We got away without nasty consequences because the overrun fit within an unused area at the beginning of the next PHB's region, but it's not pretty. This patch implements a new scheme which addresses those problems, and is also closer to what bare metal hardware and pHyp guests generally use. Because some guest versions (including most current distro kernels) can't access PCI MMIO above 64 TiB, we put all the PCI windows between 32 TiB and 64 TiB. This is broken into 1 TiB chunks. The first 1 TiB contains the PIO (64 kiB) and 32-bit MMIO (2 GiB) windows for all of the PHBs. Each subsequent TiB chunk contains a naturally aligned 64-bit MMIO window for one PHB each. This reduces the number of allowed PHBs (without full manual configuration of all the windows) from 256 to 31, but this should still be plenty in practice. We also change some of the default window sizes for manually configured PHBs to saner values. Finally we adjust some tests and libqos so that it correctly uses the new default locations. Ideally it would parse the device tree given to the guest, but that's a more complex problem for another time. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>	2016-10-16 12:04:15 +11:00
David Gibson	daa2369903	spapr_pci: Add a 64-bit MMIO window On real hardware, and under pHyp, the PCI host bridges on Power machines typically advertise two outbound MMIO windows from the guest's physical memory space to PCI memory space: - A 32-bit window which maps onto 2GiB..4GiB in the PCI address space - A 64-bit window which maps onto a large region somewhere high in PCI address space (traditionally this used an identity mapping from guest physical address to PCI address, but that's not always the case) The qemu implementation in spapr-pci-host-bridge, however, only supports a single outbound MMIO window, however. At least some Linux versions expect the two windows however, so we arranged this window to map onto the PCI memory space from 2 GiB..~64 GiB, then advertised it as two contiguous windows, the "32-bit" window from 2G..4G and the "64-bit" window from 4G..~64G. This approach means, however, that the 64G window is not naturally aligned. In turn this limits the size of the largest BAR we can map (which does have to be naturally aligned) to roughly half of the total window. With some large nVidia GPGPU cards which have huge memory BARs, this is starting to be a problem. This patch adds true support for separate 32-bit and 64-bit outbound MMIO windows to the spapr-pci-host-bridge implementation, each of which can be independently configured. The 32-bit window always maps to 2G.. in PCI space, but the PCI address of the 64-bit window can be configured (it defaults to the same as the guest physical address). So as not to break possible existing configurations, as long as a 64-bit window is not specified, a large single window can be specified. This will appear the same way to the guest as the old approach, although it's now implemented by two contiguous memory regions rather than a single one. For now, this only adds the possibility of 64-bit windows. The default configuration still uses the legacy mode. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>	2016-10-16 12:03:09 +11:00
David Gibson	6737d9ad79	spapr_pci: Delegate placement of PCI host bridges to machine type The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB) for a PAPR guest. Unlike on x86, it's routine on Power (both bare metal and PAPR guests) to have numerous independent PHBs, each controlling a separate PCI domain. There are two ways of configuring the spapr-pci-host-bridge device: first it can be done fully manually, specifying the locations and sizes of all the IO windows. This gives the most control, but is very awkward with 6 mandatory parameters. Alternatively just an "index" can be specified which essentially selects from an array of predefined PHB locations. The PHB at index 0 is automatically created as the default PHB. The current set of default locations causes some problems for guests with large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia GPGPU cards via VFIO). Obviously, for migration we can only change the locations on a new machine type, however. This is awkward, because the placement is currently decided within the spapr-pci-host-bridge code, so it breaks abstraction to look inside the machine type version. So, this patch delegates the "default mode" PHB placement from the spapr-pci-host-bridge device back to the machine type via a public method in sPAPRMachineClass. It's still a bit ugly, but it's about the best we can do. For now, this just changes where the calculation is done. It doesn't change the actual location of the host bridges, or any other behaviour. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>	2016-10-16 12:03:09 +11:00
Benjamin Herrenschmidt	cc706a5305	ppc/xics: Make the ICSState a list Instead of an array of fixed sized blocks, use a list, as we will need to have sources with variable number of interrupts. SPAPR only uses a single entry. Native will create more. If performance becomes an issue we can add some hashed lookup but for now this will do fine. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [ move the initialization of list to xics_common_initfn, restore xirr_owner after migration and move restoring to icp_post_load] Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> [ clg: removed the icp_post_load() changes from nikunj patchset v3: http://patchwork.ozlabs.org/patch/646008/ ] Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-10-14 16:31:02 +11:00
Alexey Kardashevskiy	4814401fa0	spapr_pci: Add numa node id This adds a numa id property to a PHB to allow linking passed PCI device to CPU/memory. It is up to the management stack to do CPU/memory pinning to the node with the actual PCI device. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [dwg: Renamed property from "node" to "numa_node" to match the similar one in the pxb device] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-09-23 12:39:07 +10:00
Alexey Kardashevskiy	ae4de14cd3	spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) This adds support for Dynamic DMA Windows (DDW) option defined by the SPAPR specification which allows to have additional DMA window(s) The "ddw" property is enabled by default on a PHB but for compatibility the pseries-2.6 machine and older disable it. This also creates a single DMA window for the older machines to maintain backward migration. This implements DDW for PHB with emulated and VFIO devices. The host kernel support is required. The advertised IOMMU page sizes are 4K and 64K; 16M pages are supported but not advertised by default, in order to enable them, the user has to specify "pgsz" property for PHB and enable huge pages for RAM. The existing linux guests try creating one additional huge DMA window with 64K or 16MB pages and map the entire guest RAM to. If succeeded, the guest switches to dma_direct_ops and never calls TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM and not waste time on map/unmap later. This adds a "dma64_win_addr" property which is a bus address for the 64bit window and by default set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware uses and this allows having emulated and VFIO devices on the same bus. This adds 4 RTAS handlers: * ibm,query-pe-dma-window * ibm,create-pe-dma-window * ibm,remove-pe-dma-window * ibm,reset-pe-dma-window These are registered from type_init() callback. These RTAS handlers are implemented in a separate file to avoid polluting spapr_iommu.c with PCI. This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs and updates all references to dma_liobn. However this does not add 64bit LIOBN to the migration stream as in fact even 32bit LIOBN is rather pointless there (as it is a PHB property and the management software can/should pass LIOBNs via CLI) but we keep it for the backward migration support. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-07-05 14:31:08 +10:00
Alexey Kardashevskiy	606b54986d	spapr_iommu: Realloc guest visible TCE table when starting/stopping listening The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU - a guest view of the table and a hardware TCE table. If there is no VFIO presense in the address space, then just the guest view is used, if this is the case, it is allocated in the KVM. However since there is no support yet for VFIO in KVM TCE hypercalls, when we start using VFIO, we need to move the guest view from KVM to the userspace; and we need to do this for every IOMMU on a bus with VFIO devices. This implements the callbacks for the sPAPR IOMMU - notify_started() reallocated the guest view to the user space, notify_stopped() does the opposite. This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug path as the new callbacks do this better - they notify IOMMU at the exact moment when the configuration is changed, and this also includes the case of PCI hot unplug. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-07-05 10:43:02 +10:00
Benjamin Herrenschmidt	27f2458245	ppc/xics: Replace "icp" with "xics" in most places The "ICP" is a different object than the "XICS". For historical reasons, we have a number of places where we name a variable "icp" while it contains a XICSState pointer. There is an ICPState structure too so this makes the code really confusing. This is a mechanical replacement of all those instances to use the name "xics" instead. There should be no functional change. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [spapr_cpu_init has been moved to spapr_cpu_core.c, change there] Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-07-01 13:41:47 +10:00
Benjamin Herrenschmidt	161deaf225	ppc/xics: Rename existing xics to xics_spapr The common class doesn't change, the KVM one is sPAPR specific. Rename variables and functions to xics_spapr. Retain the type name as "xics" to preserve migration for existing sPAPR guests. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-07-01 13:41:46 +10:00
Markus Armbruster	679dd415bb	spapr_pci: Drop cannot_instantiate_with_device_add_yet=false It's become redundant since it was added in commit `09aa9a5` "spapr-pci: enable adding PHB via -device". Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-06-07 10:17:45 +10:00
Alexey Kardashevskiy	b3162f22cb	spapr_pci: Add and export DMA resetting helper This will be later used by the "ibm,reset-pe-dma-window" RTAS handler which resets the DMA configuration to the defaults. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-06-07 10:17:45 +10:00
Alexey Kardashevskiy	acf1b6dd22	spapr_pci: Reset DMA config on PHB reset LoPAPR dictates that during system reset all DMA windows must be removed and the default DMA32 window must be created so does the patch. At the moment there is just one window supported so no change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-06-07 10:17:45 +10:00
Alexey Kardashevskiy	b4b6eb771a	spapr_iommu: Add root memory region We are going to have multiple DMA windows at different offsets on a PCI bus. For the sake of migration, we will have as many TCE table objects pre-created as many windows supported. So we need a way to map windows dynamically onto a PCI bus when migration of a table is completed but at this stage a TCE table object does not have access to a PHB to ask it to map a DMA window backed by just migrated TCE table. This adds a "root" memory region (UINT64_MAX long) to the TCE object. This new region is mapped on a PCI bus with enabled overlapping as there will be one root MR per TCE table, each of them mapped at 0. The actual IOMMU memory region is a subregion of the root region and a TCE table enables/disables this subregion and maps it at the specific offset inside the root MR which is 1:1 mapping of a PCI address space. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-06-07 10:17:45 +10:00
Alexey Kardashevskiy	df7625d422	spapr_iommu: Introduce "enabled" state for TCE table Currently TCE tables are created once at start and their sizes never change. We are going to change that by introducing a Dynamic DMA windows support where DMA configuration may change during the guest execution. This changes spapr_tce_new_table() to create an empty zero-size IOMMU memory region (IOMMU MR). Only LIOBN is assigned by the time of creation. It still will be called once at the owner object (VIO or PHB) creation. This introduces an "enabled" state for TCE table objects, some helper functions are added: - spapr_tce_table_enable() receives TCE table parameters, stores in sPAPRTCETable and allocates a guest view of the TCE table (in the user space or KVM) and sets the correct size on the IOMMU MR; - spapr_tce_table_disable() disposes the table and resets the IOMMU MR size; it is made public as the following DDW code will be using it. This changes the PHB reset handler to do the default DMA initialization instead of spapr_phb_realize(). This does not make differenct now but later with more than just one DMA window, we will have to remove them all and create the default one on a system reset. No visible change in behaviour is expected except the actual table will be reallocated every reset. We might optimize this later. The other way to implement this would be dynamically create/remove the TCE table QOM objects but this would make migration impossible as the migration code expects all QOM objects to exist at the receiver so we have to have TCE table objects created when migration begins. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-06-07 10:17:45 +10:00
Alexey Kardashevskiy	eded5bac3b	spapr_pci: Use correct DMA LIOBN when composing the device tree The user could have picked LIOBN via the CLI but the device tree rendering code would still use the value derived from the PHB index (which is the default fallback if LIOBN is not set in the CLI). This replaces SPAPR_PCI_LIOBN() with the actual DMA LIOBN value. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-05-27 09:40:23 +10:00
Jianjun Duan	5dd5238c0b	spapr: ensure device trees are always associated with DRC There are possible racing situations involving hotplug events and guest migration. For cases where a hotplug event is migrated, or the guest is in the process of fetching device tree at the time of migration, we need to ensure the device tree is created and associated with the corresponding DRC for devices that were hotplugged on the source, but 'coldplugged' on the target. Signed-off-by: Jianjun Duan <duanj@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-05-27 09:40:23 +10:00
Paolo Bonzini	77ac58ddc6	dma: do not depend on kvm_enabled() Memory barriers are needed also by Xen and, when the ioeventfd bugs are fixed, by TCG as well. sysemu/kvm.h is not anymore needed in sysemu/dma.h, move it to the actual users. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-05-19 16:42:28 +02:00
Thomas Huth	da34fed707	hw/ppc/spapr: Fix crash when specifying bad parameters to spapr-pci-host-bridge QEMU currently crashes when using bad parameters for the spapr-pci-host-bridge device: $ qemu-system-ppc64 -device spapr-pci-host-bridge,buid=0x123,liobn=0x321,mem_win_addr=0x1,io_win_addr=0x10 Segmentation fault The problem is that spapr_tce_find_by_liobn() might return NULL, but the code in spapr_populate_pci_dt() does not check for this condition and then tries to dereference this NULL pointer. Apart from that, the return value of spapr_populate_pci_dt() also has to be checked for all PCI buses, not only for the last one, to make sure we catch all errors. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-04-23 16:52:20 +10:00
Paolo Bonzini	4771d756f4	hw: explicitly include qemu-common.h and cpu.h Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-22 22:20:17 +01:00
Markus Armbruster	da34e65cb4	include/qemu/osdep.h: Don't include qapi/error.h Commit `57cb38b` included qapi/error.h into qemu/osdep.h to get the Error typedef. Since then, we've moved to include qemu/osdep.h everywhere. Its file comment explains: "To avoid getting into possible circular include dependencies, this file should not include any other QEMU headers, with the exceptions of config-host.h, compiler.h, os-posix.h and os-win32.h, all of which are doing a similar job to this file and are under similar constraints." qapi/error.h doesn't do a similar job, and it doesn't adhere to similar constraints: it includes qapi-types.h. That's in excess of 100KiB of crap most .c files don't actually need. Add the typedef to qemu/typedefs.h, and include that instead of qapi/error.h. Include qapi/error.h in .c files that need it and don't get it now. Include qapi-types.h in qom/object.h for uint16List. Update scripts/clean-includes accordingly. Update it further to match reality: replace config.h by config-target.h, add sysemu/os-posix.h, sysemu/os-win32.h. Update the list of includes in the qemu/osdep.h comment quoted above similarly. This reduces the number of objects depending on qapi/error.h from "all of them" to less than a third. Unfortunately, the number depending on qapi-types.h shrinks only a little. More work is needed for that one. Signed-off-by: Markus Armbruster <armbru@redhat.com> [Fix compilation without the spice devel packages. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-22 22:20:15 +01:00
David Gibson	a36304fdca	spapr_pci: Remove finish_realize hook Now that spapr-pci-vfio-host-bridge is reduced to just a stub, there is only one implementation of the finish_realize hook in sPAPRPHBClass. So, we can fold that implementation into its (single) caller, and remove the hook. That's the last thing left in sPAPRPHBClass, so that can go away as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>	2016-03-16 09:55:11 +11:00
David Gibson	c1fa017c7e	spapr_pci: Allow EEH on spapr-pci-host-bridge Now that the EEH code is independent of the special spapr-vfio-pci-host-bridge device, we can allow it on all spapr PCI host bridges instead. We do this by changing spapr_phb_eeh_available() to be based on the vfio_eeh_as_ok() call instead of the host bridge class. Because the value of vfio_eeh_as_ok() can change with devices being hotplugged or unplugged, this can potentially lead to some strange edge cases where the guest starts using EEH, then it starts failing because of a change in status. However, it's not really any worse than the current situation. Cases that would have worked previously will still work (i.e. VFIO devices from at most one VFIO IOMMU group per vPHB), it's just that it's no longer necessary to use spapr-vfio-pci-host-bridge with the groupid pre-specified. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>	2016-03-16 09:55:11 +11:00
David Gibson	fbb4e98341	spapr_pci: Eliminate class callbacks The EEH operations in the spapr-vfio-pci-host-bridge no longer rely on the special groupid field in sPAPRPHBVFIOState. So we can simplify, removing the class specific callbacks with direct calls based on a simple spapr_phb_eeh_enabled() helper. For now we implement that in terms of a boolean in the class, but we'll continue to clean that up later. On its own this is a rather strange way of doing things, but it's a useful intermediate step to further cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>	2016-03-16 09:55:10 +11:00
Michael Roth	788d2599de	spapr_pci: fix multifunction hotplug Since `3f1e147`, QEMU has adopted a convention of supporting function hotplug by deferring hotplug events until func 0 is hotplugged. This is likely how management tools like libvirt would expose such support going forward. Since sPAPR guests rely on per-func events rather than slot-based, our protocol has been to hotplug func 0 first to avoid cases where devices appear within guests without func 0 present to avoid undefined behavior. To remain compatible with new convention, defer hotplug in a similar manner, but then generate events in 0-first order as we did in the past. Once func 0 present, fail any attempts to plug additional functions (as we do with PCIe). For unplug, defer unplug operations in a similar manner, but generate unplug events such that function 0 is removed last in guest. Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-03-16 09:55:05 +11:00
Michael S. Tsirkin	226419d615	msi_supported -> msi_nonbroken Rename controller flag to make it clearer what it means. Add some documentation as well. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2016-03-11 16:45:21 +02:00
Greg Kurz	a005b3ef50	xics: report errors with the QEMU Error API Using the return value to report errors is error prone: - xics_alloc() returns -1 on error but spapr_vio_busdev_realize() errors on 0 - xics_alloc_block() returns the unclear value of ics->offset - 1 on error but both rtas_ibm_change_msi() and spapr_phb_realize() error on 0 This patch adds an errp argument to xics_alloc() and xics_alloc_block() to report errors. The return value of these functions is a valid IRQ number if errp is NULL. It is undefined otherwise. The corresponding error traces get promotted to error messages. Note that the "can't allocate IRQ" error message in spapr_vio_busdev_realize() also moves to xics_alloc(). Similar error message consolidation isn't really applicable to xics_alloc_block() because callers have extra context (device config address, MSI or MSIX). This fixes the issues mentioned above. Based on previous work from Brian W. Hart. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-02-28 16:19:02 +11:00

1 2 3

129 Commits