The resizeable memory region / RAMBlock that is created for the cmd blob
has a maximum size of whole host pages (e.g., 4k), because RAMBlocks
work on full host pages. In addition, in i386 ACPI code:
acpi_align_size(tables->linker->cmd_blob, ACPI_BUILD_ALIGN_SIZE);
makes sure to align to multiples of 4k, padding with 0.
For example, if our cmd_blob is created with a size of 2k, the maximum
size is 4k - we cannot grow beyond that. Growing might be required
due to guest action when rebuilding the tables, but also on incoming
migration.
This automatic generation of the maximum size used to be sufficient,
however, there are cases where we cross host pages now when growing at
runtime: we exceed the maximum size of the RAMBlock and can crash QEMU when
trying to resize the resizeable memory region / RAMBlock:
$ build/qemu-system-x86_64 --enable-kvm \
-machine q35,nvdimm=on \
-smp 1 \
-cpu host \
-m size=2G,slots=8,maxmem=4G \
-object memory-backend-file,id=mem0,mem-path=/tmp/nvdimm,size=256M \
-device nvdimm,label-size=131072,memdev=mem0,id=nvdimm0,slot=1 \
-nodefaults \
-device vmgenid \
-device intel-iommu
Results in:
Unexpected error in qemu_ram_resize() at ../softmmu/physmem.c:1850:
qemu-system-x86_64: Size too large: /rom@etc/table-loader:
0x2000 > 0x1000: Invalid argument
In this configuration, we consume exactly 4k (32 entries, 128 bytes each)
when creating the VM. However, once the guest boots up and maps the MCFG,
we also create the MCFG table and end up consuming 2 additional entries
(pointer + checksum) -- which is where we try resizing the memory region
/ RAMBlock, however, the maximum size does not allow for it.
Currently, we get the following maximum sizes for our different
mutable tables based on behavior of resizeable RAMBlock:
hw table max_size
------- ---------------------------------------------------------
virt "etc/acpi/tables" ACPI_BUILD_TABLE_MAX_SIZE (0x200000)
virt "etc/table-loader" HOST_PAGE_ALIGN(initial_size)
virt "etc/acpi/rsdp" HOST_PAGE_ALIGN(initial_size)
i386 "etc/acpi/tables" ACPI_BUILD_TABLE_MAX_SIZE (0x200000)
i386 "etc/table-loader" HOST_PAGE_ALIGN(initial_size)
i386 "etc/acpi/rsdp" HOST_PAGE_ALIGN(initial_size)
microvm "etc/acpi/tables" ACPI_BUILD_TABLE_MAX_SIZE (0x200000)
microvm "etc/table-loader" HOST_PAGE_ALIGN(initial_size)
microvm "etc/acpi/rsdp" HOST_PAGE_ALIGN(initial_size)
Let's set the maximum table size for "etc/table-loader" to 64k, so we
can properly grow at runtime, which should be good enough for the future.
Migration is not concerned with the maximum size of a RAMBlock, only
with the used size - so existing setups are not affected. Of course, we
cannot migrate a VM that would have crash when started on older QEMU from
new QEMU to older QEMU without failing early on the destination when
synchronizing the RAM state:
qemu-system-x86_64: Size too large: /rom@etc/table-loader: 0x2000 > 0x1000: Invalid argument
qemu-system-x86_64: error while loading state for instance 0x0 of device 'ram'
qemu-system-x86_64: load of migration failed: Invalid argument
We'll refactor the code next, to make sure we get rid of this implicit
behavior for "etc/acpi/rsdp" as well and to make the code easier to
grasp.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Cc: Alistair Francis <alistair.francis@xilinx.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Shannon Zhao <shannon.zhaosl@gmail.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210304105554.121674-2-david@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Implement _DSM according to:
PCI Firmware Specification 3.1
4.6.7. DSM for Naming a PCI or PCI Express Device Under
Operating Systems
and wire it up to cold and hot-plugged PCI devices.
Feature depends on ACPI hotplug being enabled (as that provides
PCI devices descriptions in ACPI and MMIO registers that are
reused to fetch acpi-index).
acpi-index should work for
- cold plugged NICs:
$QEMU -device e1000,acpi-index=100
=> 'eno100'
- hot-plugged
(monitor) device_add e1000,acpi-index=200,id=remove_me
=> 'eno200'
- re-plugged
(monitor) device_del remove_me
(monitor) device_add e1000,acpi-index=1
=> 'eno1'
Windows also sees index under "PCI Label Id" field in properties
dialog but otherwise it doesn't seem to have any effect.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20210315180102.3008391-6-imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
it will be used by follow up patches
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20210315180102.3008391-5-imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
it helps to avoid device naming conflicts when guest OS is
configured to use acpi-index for naming.
Spec ialso says so:
PCI Firmware Specification Revision 3.2
4.6.7. _DSM for Naming a PCI or PCI Express Device Under Operating Systems
"
Instance number must be unique under \_SB scope. This instance number does not have to
be sequential in a given system configuration.
"
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20210315180102.3008391-4-imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In x86/ACPI world, linux distros are using predictable
network interface naming since systemd v197. Which on
QEMU based VMs results into path based naming scheme,
that names network interfaces based on PCI topology.
With itm on has to plug NIC in exactly the same bus/slot,
which was used when disk image was first provisioned/configured
or one risks to loose network configuration due to NIC being
renamed to actually used topology.
That also restricts freedom to reshape PCI configuration of
VM without need to reconfigure used guest image.
systemd also offers "onboard" naming scheme which is
preferred over PCI slot/topology one, provided that
firmware implements:
"
PCI Firmware Specification 3.1
4.6.7. DSM for Naming a PCI or PCI Express Device Under
Operating Systems
"
that allows to assign user defined index to PCI device,
which systemd will use to name NIC. For example, using
-device e1000,acpi-index=100
guest will rename NIC to 'eno100', where 'eno' is default
prefix for "onboard" naming scheme. This doesn't require
any advance configuration on guest side to com in effect
at 'onboard' scheme takes priority over path based naming.
Hope is that 'acpi-index' it will be easier to consume by
management layer, compared to forcing specific PCI topology
and/or having several disk image templates for different
topologies and will help to simplify process of spawning
VM from the same template without need to reconfigure
guest NIC.
This patch adds, 'acpi-index'* property and wires up
a 32bit register on top of pci hotplug register block
to pass index value to AML code at runtime.
Following patch will add corresponding _DSM code and
wire it up to PCI devices described in ACPI.
*) name comes from linux kernel terminology
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20210315180102.3008391-3-imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
ret in virtio_pmem_resp is a uint32_t variable, which should be assigned
using virtio_stl_p.
The kernel side driver does not guarantee virtio_pmem_resp to be initialized
to zero in advance, So sometimes the flush operation will fail.
Signed-off-by: Wang Liang <wangliangzz@inspur.com>
Message-Id: <20210317024145.271212-1-wangliangzz@126.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Now that everything is in place, have the nested event loop to monitor
the slave channel. The source in the main event loop is destroyed and
recreated to ensure any pending even for the slave channel that was
previously detected is purged. This guarantees that the main loop
wont invoke slave_read() based on an event that was already handled
by the nested loop.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210312092212.782255-7-groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
A deadlock condition potentially exists if a vhost-user process needs
to request something to QEMU on the slave channel while processing a
vhost-user message.
This doesn't seem to affect any vhost-user implementation so far, but
this is currently biting the upcoming enablement of DAX with virtio-fs.
The issue is being observed when the guest does an emergency reboot while
a mapping still exits in the DAX window, which is very easy to get with
a busy enough workload (e.g. as simulated by blogbench [1]) :
- QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.
- In order to complete the request, virtiofsd then asks QEMU to remove
the mapping on the slave channel.
All these dialogs are synchronous, hence the deadlock.
As pointed out by Stefan Hajnoczi:
When QEMU's vhost-user master implementation sends a vhost-user protocol
message, vhost_user_read() does a "blocking" read during which slave_fd
is not monitored by QEMU.
The natural solution for this issue is an event loop. The main event
loop cannot be nested though since we have no guarantees that its
fd handlers are prepared for re-entrancy.
Introduce a new event loop that only monitors the chardev I/O for now
in vhost_user_read() and push the actual reading to a one-shot handler.
A subsequent patch will teach the loop to monitor and process messages
from the slave channel as well.
[1] https://github.com/jedisct1/Blogbench
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210312092212.782255-6-groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The slave channel is implemented with socketpair() : QEMU creates
the pair, passes one of the socket to virtiofsd and monitors the
other one with the main event loop using qemu_set_fd_handler().
In order to fix a potential deadlock between QEMU and a vhost-user
external process (e.g. virtiofsd with DAX), we want to be able to
monitor and service the slave channel while handling vhost-user
requests.
Prepare ground for this by converting the slave channel to be a
QIOChannelSocket. This will make monitoring of the slave channel
as simple as calling qio_channel_add_watch_source(). Since the
connection is already established between the two sockets, only
incoming I/O (G_IO_IN) and disconnect (G_IO_HUP) need to be
serviced.
This also allows to get rid of the ancillary data parsing since
QIOChannelSocket can do this for us. Note that the MSG_CTRUNC
check is dropped on the way because QIOChannelSocket ignores this
case. This isn't a problem since slave_read() provisions space for
8 file descriptors, but affected vhost-user slave protocol messages
generally only convey one. If for some reason a buggy implementation
passes more file descriptors, no need to break the connection, just
like we don't break it if some other type of ancillary data is
received : this isn't explicitely violating the protocol per-se so
it seems better to ignore it.
The current code errors out on short reads and writes. Use the
qio_channel_*_all() variants to address this on the way.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210312092212.782255-5-groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210312092212.782255-4-groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Some message types, e.g. VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG,
can convey file descriptors. These must be closed before returning
from slave_read() to avoid being leaked. This can currently be done
in two different places:
[1] just after the request has been processed
[2] on the error path, under the goto label err:
These path are supposed to be mutually exclusive but they are not
actually. If the VHOST_USER_NEED_REPLY_MASK flag was passed and the
sending of the reply fails, both [1] and [2] are performed with the
same descriptor values. This can potentially cause subtle bugs if one
of the descriptor was recycled by some other thread in the meantime.
This code duplication complicates rollback for no real good benefit.
Do the closing in a unique place, under a new fdcleanup: goto label
at the end of the function.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210312092212.782255-3-groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
slave_read() checks EAGAIN when reading or writing to the socket
fails. This gives the impression that the slave channel is in
non-blocking mode, which is certainly not the case with the current
code base. And the rest of the code isn't actually ready to cope
with non-blocking I/O.
Just drop the checks everywhere in this function for the sake of
clarity.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210312092212.782255-2-groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Both functions don't check the personality of the interface (legacy or
modern) before accessing the configuration memory and always use
virtio_config_readX()/virtio_config_writeX().
With this patch, they now check the personality and in legacy mode
call virtio_config_readX()/virtio_config_writeX(), otherwise call
virtio_config_modern_readX()/virtio_config_modern_writeX().
This change has been tested with virtio-mmio guests (virt stretch/armhf and
virt sid/m68k) and virtio-pci guests (pseries RHEL-7.3/ppc64 and /ppc64le).
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20210314200300.3259170-1-laurent@vivier.eu>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Several QOM type names contain ',':
ARM,bitband-memory
etraxfs,pic
etraxfs,serial
etraxfs,timer
fsl,imx25
fsl,imx31
fsl,imx6
fsl,imx6ul
fsl,imx7
grlib,ahbpnp
grlib,apbpnp
grlib,apbuart
grlib,gptimer
grlib,irqmp
qemu,register
SUNW,bpp
SUNW,CS4231
SUNW,DBRI
SUNW,DBRI.prom
SUNW,fdtwo
SUNW,sx
SUNW,tcx
xilinx,zynq_slcr
xlnx,zynqmp
xlnx,zynqmp-pmu-soc
xlnx,zynq-xadc
These are all device types. They can't be plugged with -device /
device_add, except for xlnx,zynqmp-pmu-soc, and I doubt that one
actually works.
They *can* be used with -device / device_add to request help.
Usability is poor, though: you have to double the comma, like this:
$ qemu-system-x86_64 -device SUNW,,fdtwo,help
Trap for the unwary. The fact that this was broken in
device-introspect-test for more than six years until commit e27bd49876
fixed it demonstrates that "the unwary" includes seasoned developers.
One QOM type name contains ' ': "ICH9 SMB". Because having to
remember just one way to quote would be too easy.
Rename the "SUNW,FOO types to "sun-FOO". Summarily replace ',' and '
' by '-' in the other type names.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20210304140229.575481-2-armbru@redhat.com>
Reviewed-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
The previous commit rendered the name fdctrl_connect_drives() somewhat
misleading. Get rid of it by inlining the (now pretty simple)
function into its only caller.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20210309161214.1402527-4-armbru@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Drop the crap deprecated in commit 4a27a638e7 "fdc: Deprecate
configuring floppies with -global isa-fdc" (v5.1.0).
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20210309161214.1402527-3-armbru@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Some compiler versions are smart enough to detect a potentially
uninitialized variable, but are not smart enough to detect that this
cannot happen due to the code flow:
../hw/intc/i8259.c: In function ‘pic_read_irq’:
../hw/intc/i8259.c:203:13: error: ‘irq2’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
203 | irq = irq2 + 8;
| ~~~~^~~~~~~~~~
Restrict irq2 variable use to the inner statement.
Fixes: 78ef2b6989 ("i8259: Reorder intack in pic_read_irq")
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210318163059.3686596-1-philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- QAPIfy object-add and --object
- stream: Fail gracefully if permission is denied
- storage-daemon: Fix crash on quit when job is still running
- curl: Fix use after free
- char: Deprecate backend aliases, fix QMP query-chardev-backends
- Fix image creation option defaults that exist in both the format and
the protocol layer (e.g. 'cluster_size' in qcow2 and rbd; the qcow2
default was incorrectly applied to the rbd layer)
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmBUbF4RHGt3b2xmQHJl
ZGhhdC5jb20ACgkQfwmycsiPL9YX2Q//Ve6++hRulIuJVuh8QDxlmGWERqey/ClX
mUqGDOkSSXfftPTDPCYSUFE7QD6HD25oJmUTix2B2P89AIyDcvqvthMDU/j8clor
X3Kx03ky3NLJilZNdYZ2GOMyljgNP3JSrDHBjc/tZx+1e7C5tPNVxXOUW946wIC9
no6xTAarAANl/GS23ZI+vJ3PBEggzAbu6t/hwT//WAB0WB9wFhkCCzWPXIkdBXwP
QpG8chTwuwFAW1c52F0OeQV5FpM0bcMtxYASuNq0HPL6B8qUdKOusTRgTB/fjLLV
tYMhc6tzPLUlin1mGD4m0P+9tRMBFtF/flZVwbd4S+avcAbV2L5S6Xq0QsiNTbx2
oQUk6N2/IWBOMC6D8aBTBwZ7CCasgEg0imtLUdJ8gKp6T44C7cqg1oZwT2dOyYuI
jS+3T+DcZZn3mHmp61nowL/2/2LDAVaLmOfbsvmvlbuX5j8QHj/Lvt6udRjqpelJ
n0jV9Ay0myu7dSK5ng7WNQUlSrba5I/W3CAjuH0CDp90ADCymWSp2jktvv5rSO/R
bQpz58kRY72y4dEwOy0zkWc/EClqh3p4abq5HDBCIkO+EO8CjJhnEnT+oOrFF5C/
LU93bFPyp6ZJoXzsKnjEjSzMzgDT6XuGTAgrh6upZy52ssjkG8zACbNOmXZTYmAg
hg3OlpdEUvM=
=zZCm
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches and object-add QAPIfication
- QAPIfy object-add and --object
- stream: Fail gracefully if permission is denied
- storage-daemon: Fix crash on quit when job is still running
- curl: Fix use after free
- char: Deprecate backend aliases, fix QMP query-chardev-backends
- Fix image creation option defaults that exist in both the format and
the protocol layer (e.g. 'cluster_size' in qcow2 and rbd; the qcow2
default was incorrectly applied to the rbd layer)
# gpg: Signature made Fri 19 Mar 2021 09:18:22 GMT
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream: (42 commits)
vl: allow passing JSON to -object
qom: move user_creatable_add_opts logic to vl.c and QAPIfy it
tests: convert check-qom-proplist to keyval
qom: Support JSON in HMP object_add and tools --object
char: Simplify chardev_name_foreach()
char: Deprecate backend aliases 'tty' and 'parport'
char: Skip CLI aliases in query-chardev-backends
qom: Add user_creatable_parse_str()
hmp: QAPIfy object_add
qemu-img: Use user_creatable_process_cmdline() for --object
qom: Add user_creatable_add_from_str()
qemu-nbd: Use user_creatable_process_cmdline() for --object
qemu-io: Use user_creatable_process_cmdline() for --object
qom: Factor out user_creatable_process_cmdline()
qom: Remove user_creatable_add_dict()
qemu-storage-daemon: Implement --object with qmp_object_add()
qom: Make "object" QemuOptsList optional
qapi/qom: QAPIfy object-add
qapi/qom: Add ObjectOptions for x-remote-object
qapi/qom: Add ObjectOptions for input-*
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This converts object-add from 'gen': false to the ObjectOptions QAPI
type. As an immediate benefit, clients can now use QAPI schema
introspection for user creatable QOM objects.
It is also the first step towards making the QAPI schema the only
external interface for the creation of user creatable objects. Once all
other places (HMP and command lines of the system emulator and all
tools) go through QAPI, too, some object implementations can be
simplified because some checks (e.g. that mandatory options are set) are
already performed by QAPI, and in another step, QOM boilerplate code
could be generated from the schema.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
* fix for a bug in zone management receive (me)
* metadata and end-to-end data protection support (me & Gollu Appalanaidu)
* verify support (Gollu Appalanaidu)
* multiple lba formats and format nvm support (Minwoo Im)
and a couple of misc refactorings from me.
v2:
- remove an unintended submodule update. Argh.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmBTP0wACgkQTeGvMW1P
DelZvAf+Ijw77YLz2t0P97CiwoOAG4FADwo61WQJo2AHyS3JMYPVdgNUXF7UGt/S
cNhU1JjdEylgZPKi5t9o/qgy7+vtSL0KKqoXIS0z9ZWZkdFsgObNetGULhaqXgaX
4KcAt7PyJS/33uFqYGSVZxJTO4GtCy34hGw6XrVs388tQD1S+eI+oS5EYBN57rl9
MrV5Z72iMCBx8Y074u81SD/u7n34b5WWbUOJADVI1rAKhilDhnkAhBQ/gK/UBodB
trUUQY/KU+eCnXEqfBeMD6fL75UOAFOjnm7Kfk/Q/g3di3H2fu0RGLrJuSS1dG0n
zmwgE7pQ1jT1UcB3QwAJLR3tGtd6sA==
=HwqP
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging
emulated nvme updates and fixes
* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
* fix for a bug in zone management receive (me)
* metadata and end-to-end data protection support (me & Gollu Appalanaidu)
* verify support (Gollu Appalanaidu)
* multiple lba formats and format nvm support (Minwoo Im)
and a couple of misc refactorings from me.
v2:
- remove an unintended submodule update. Argh.
# gpg: Signature made Thu 18 Mar 2021 11:53:48 GMT
# gpg: using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg: aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468 4272 63D5 6FC5 E55D A838
# Subkey fingerprint: 5228 33AA 75E2 DCE6 A247 66C0 4DE1 AF31 6D4F 0DE9
* remotes/nvme/tags/nvme-next-pull-request:
hw/block/nvme: add support for the format nvm command
hw/block/nvme: pull lba format initialization
hw/block/nvme: prefer runtime helpers instead of device parameters
hw/block/nvme: support multiple lba formats
hw/block/nvme: add non-mdts command size limit for verify
hw/block/nvme: add verify command
hw/block/nvme: end-to-end data protection
hw/block/nvme: add metadata support
hw/block/nvme: fix zone management receive reporting too many zones
hw/block/nvme: assert namespaces array indices
hw/block/nvme: fix potential overflow
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Format NVM admin command can make a namespace or namespaces to be
with different LBA size and metadata size with protection information
types.
This patch introduces Format NVM command with LBA format, Metadata, and
Protection Information for the device. The secure erase operation things
and support for formatting zoned namespaces are yet to be added.
The parameter checks inside of this patch has been referred from
Keith's old branch.
Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
[anaidu.gollu: rebased on e2e]
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased for reworked aio tracking]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Pull lba format initialization code into separate function in
preparation for Format NVM support.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
In preparation for Format NVM support, use runtime helpers instead of
the constant device parameters when getting lba size information etc.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
This patch introduces multiple LBA formats supported with the typical
logical block sizes of 512 bytes and 4096 bytes as well as metadata
sizes of 0, 8, 16 and 64 bytes. The format will be chosed based on the
lbads and ms parameters of the nvme-ns device.
Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
[k.jensen: resurrected and rebased]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Verify is not subject to MDTS, so a single Verify command may result in
excessive amounts of allocated memory. Impose a limit on the data size
by adding support for TP 4040 ("Non-MDTS Command Size Limits").
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Add support for namespaces formatted with protection information. The
type of end-to-end data protection (i.e. Type 1, Type 2 or Type 3) is
selected with the `pi` nvme-ns device parameter. If the number of
metadata bytes is larger than 8, the `pil` nvme-ns device parameter may
be used to control the location of the 8-byte DIF tuple. The default
`pil` value of '0', causes the DIF tuple to be transferred as the last
8 bytes of the metadata. Set to 1 to store this in the first eight bytes
instead.
Co-authored-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Add support for metadata in the form of extended logical blocks as well
as a separate buffer of data. The new `ms` nvme-ns device parameter
specifies the size of metadata per logical block in bytes. The `mset`
nvme-ns device parameter controls whether metadata is transfered as part
of an extended lba (set to '1') or in a separate buffer (set to '0',
the default).
Regardsless of the scheme chosen with `mset`, metadata is stored at the
end of the namespace backing block device. This requires the user
provided PRP/SGLs to be walked and "split" into data and metadata
scatter/gather lists if the extended logical block scheme is used, but
has the advantage of not breaking the deallocated blocks support.
Co-authored-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
nvme_zone_mgmt_recv uses nvme_ns_nlbas() to get the number of LBAs in
the namespace and then calculates the number of zones to report by
incrementing slba with ZSZE until exceeding the number of LBAs as
returned by nvme_ns_nlbas().
This is bad because the namespace might be of such as size that some
LBAs are valid, but are not part of any zone, causing zone management
receive to report one additional (but non-existing) zone.
Fix this with a conventional loop on i < ns->num_zones instead.
Fixes: a479335bfa ("hw/block/nvme: Support Zoned Namespace Command Set")
Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Coverity complains about a possible memory corruption in the
nvme_ns_attach and _detach functions. While we should not (famous last
words) be able to reach this function without nsid having previously
been validated, this is still an open door for future misuse.
Make Coverity and maintainers happy by asserting that the index into the
array is valid. Also, while not detected by Coverity (yet), add an
assert in nvme_subsys_ns and nvme_subsys_register_ns as well since a
similar issue is exists there.
Fixes: 037953b5b2 ("hw/block/nvme: support namespace detach")
Fixes: CID 1450757
Fixes: CID 1450758
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
page_size is a uint32_t, and zasl is a uint8_t, so the expression
`page_size << zasl` is done using 32-bit arithmetic and might overflow.
Since we then compare this against a 64 bit data_size value, Coverity
complains that we might overflow unintentionally. An MDTS/ZASL value in
excess of 4GiB is probably impractical, but it is not entirely
unrealistic, so add a cast such that we handle that case properly.
Fixes: 578d914b26 ("hw/block/nvme: align zoned.zasl with mdts")
Fixes: CID 1450756
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Rather than having a device specific debug implementation in
pflash_cfi01.c and pflash_cfi02.c, use the standard tracing facility.
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210216142721.1985543-2-david.edmondson@oracle.com>
[PMD: Rebased, fixed pflash_write_block_erase trace event format]
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
PFlashCFI01.ro is a bool, declare it as such.
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210216142721.1985543-3-david.edmondson@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Use the 'mode_read_array' event when we set the device in such
mode, and use the 'reset' event in DeviceReset handler.
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210310170528.1184868-10-philmd@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Message-Id: <20210310170528.1184868-9-philmd@redhat.com>
There is multiple places resetting the internal state machine.
Factor the code out in a new pflash_reset_state_machine() method.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Message-Id: <20210310170528.1184868-8-philmd@redhat.com>
The same pattern is used when setting the flash in READ_ARRAY mode:
- Set the state machine command to READ_ARRAY
- Reset the write_cycle counter
- Reset the memory region in ROMD
Refactor the current code by extracting this pattern.
It is used three times:
- When the timer expires and not in bypass mode
- On a read access (on invalid command).
- When the device is initialized. Here the ROMD mode is hidden
by the memory_region_init_rom_device() call.
pflash_register_memory(rom_mode=true) already sets the ROM device
in "read array" mode (from I/O device to ROM one). Explicit that
by renaming the function as pflash_mode_read_array(), adding
a trace event and resetting wcycle.
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210310170528.1184868-7-philmd@redhat.com>
There is only one call to pflash_register_memory() with
rom_mode == false. As we want to modify pflash_register_memory()
in the next patch, open-code this trivial function in place for
the 'rom_mode == false' case.
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210310170528.1184868-6-philmd@redhat.com>
There is only one call to pflash_setup_mappings(). Convert 'rom_mode'
to boolean and set it to true directly within pflash_setup_mappings().
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210310170528.1184868-5-philmd@redhat.com>
Fill the CFI table in out of DeviceRealize() in a new function:
pflash_cfi02_fill_cfi_table().
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210310170528.1184868-4-philmd@redhat.com>
Fill the CFI table in out of DeviceRealize() in a new function:
pflash_cfi01_fill_cfi_table().
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210310170528.1184868-3-philmd@redhat.com>
We are going to move this code, fix its style first.
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210310170528.1184868-2-philmd@redhat.com>
The 'scsi-hd' and 'scsi-cd' devices provide suitable alternatives.
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The 'ide-hd' and 'ide-cd' devices provide suitable alternatives.
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>