Commit Graph

1464 Commits

Author SHA1 Message Date
Hanna Reitz
5a1cfd2150 block: Clarify that @bytes is no limit on *pnum
.bdrv_co_block_status() implementations are free to return a *pnum that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for *pnum.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210812084148.14458-4-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2021-09-15 15:54:07 +02:00
Hanna Reitz
0bc329fbb0 block: block-status cache for data regions
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 15:54:06 +02:00
Hanna Reitz
33ff4c9e08 block: Drop BDS comment regarding bdrv_append()
There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210812084148.14458-2-hreitz@redhat.com>
2021-09-15 15:54:06 +02:00
Vladimir Sementsov-Ogievskiy
abde8ac2a5 block/block-copy: block_copy_state_new(): drop extra arguments
The only caller pass copy_range and compress both false. Let's just
drop these arguments.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824083856.17408-35-vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 14:38:08 +02:00
Vladimir Sementsov-Ogievskiy
b518e9e9ef block/backup: move cluster size calculation to block-copy
The main consumer of cluster-size is block-copy. Let's calculate it
here instead of passing through backup-top.

We are going to publish copy-before-write filter soon, so it will be
created through options. But we don't want for now to make explicit
option for cluster-size, let's continue to calculate it automatically.
So, now is the time to get rid of cluster_size argument for
bdrv_cbw_append().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-10-vsementsov@virtuozzo.com>
[hreitz: Add qemu/error-report.h include to block/block-copy.c]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 14:03:11 +02:00
Vladimir Sementsov-Ogievskiy
f8b9504bac block/block-copy: introduce block_copy_set_copy_opts()
We'll need a possibility to set compress and use_copy_range options
after initialization of the state. So make corresponding part of
block_copy_state_new() separate and public.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824083856.17408-8-vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 12:57:31 +02:00
Vladimir Sementsov-Ogievskiy
49577723d4 block-copy: move detecting fleecing scheme to block-copy
We want to simplify initialization interface of copy-before-write
filter as we are going to make it public. So, let's detect fleecing
scheme exactly in block-copy code, to not pass this information through
extra levels.

Why not just set BDRV_REQ_SERIALISING unconditionally: because we are
going to implement new more efficient fleecing scheme which will not
rely on backing feature.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824083856.17408-7-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 12:57:31 +02:00
Vladimir Sementsov-Ogievskiy
bd8f4c42c8 block: introduce bdrv_replace_child_bs()
Add function to transactionally replace bs inside BdrvChild.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-2-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 12:57:31 +02:00
Klaus Jensen
a316aa50e6 hw/nvme: use symbolic names for registers
Add the NvmeBarRegs enum and use these instead of explicit register
offsets.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-07-26 21:09:38 +02:00
Klaus Jensen
5d45edbeac hw/nvme: split pmrmsc register into upper and lower
The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to
make up the 64 bit logical PMRMSC register.

Make it so.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-07-26 21:09:38 +02:00
Stefano Garzarella
1793ad0247 iothread: add aio-max-batch parameter
The `aio-max-batch` parameter will be propagated to AIO engines
and it will be used to control the maximum number of queued requests.

When there are in queue a number of requests equal to `aio-max-batch`,
the engine invokes the system call to forward the requests to the kernel.

This parameter allows us to control the maximum batch size to reduce
the latency that requests might accumulate while queued in the AIO
engine queue.

If `aio-max-batch` is equal to 0 (default value), the AIO engine will
use its default maximum batch size value.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20210721094211.69853-3-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-07-21 13:47:50 +01:00
Peter Maydell
42e1d798a6 Block layer patches
- Make blockdev-reopen stable
 - Remove deprecated qemu-img backing file without format
 - rbd: Convert to coroutines and add write zeroes support
 - rbd: Updated MAINTAINERS
 - export/fuse: Allow other users access to the export
 - vhost-user: Fix backends without multiqueue support
 - Fix drive-backup transaction endless drained section
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmDoRdIRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9bvgQ/+Ogq24n1UOQc8FEKRYfyhajNToQ9ofzWN
 iLiblSGx2QDq+CauD3qdu6z7DLlqEXeoM4NYM462oIPumptQj+9XZt7ftfh6FLWW
 4yJEbjfnVKOba+vFdJ+E0DStwnPaxYdnrPGd53cwHZfbZh4ZmkpTM350mzHHiLTb
 KYKOgWd+UHZbkYeCVNYTGe30SRBiKeAecTpsVZ5HVhe7LstjByuy5stk8dytLpdV
 YqdKOToZfOp77XiHr8YcLLp1HHBGlr5hw73V4SDas0beCp7hqtnAqsTYyXBue4xO
 4zfD4Gujr5JVOCb0crDTyOmOQY5E+y2dqFoOUF00D5AoN2vj4nfQ9ESkbqlE9BVh
 mgJ1izSokYlN2X8rIwGXNR5fbxRmxxfkAA4rScNRytj1KxDHyrDxrp/k8YFemxSQ
 qwgb/FBm0fcr69evPRzovKwZFhcyPremksluHQE4rZZ66qBQ2cGuDJPE7PWVTpPH
 67JCrIVK/O6n5p+4ilFHmQQ3aP3ol0frMFcboYVRchJ2MhIDTsfFL3F/tTK8hy86
 AmrrdQ1BQIAoKNOKnAmOSOUdExM55OcfPmX69+AhEk2GeWP6kgz5Pks4H3qCiKGf
 YoRk8F1V+N4q+C0mFFovB61bNQ6COIlBuzmD9EtmpDD/Ta3Wib+3ZnoGVIdPS+OI
 jyj+qJxd9z4=
 =kH+r
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

- Make blockdev-reopen stable
- Remove deprecated qemu-img backing file without format
- rbd: Convert to coroutines and add write zeroes support
- rbd: Updated MAINTAINERS
- export/fuse: Allow other users access to the export
- vhost-user: Fix backends without multiqueue support
- Fix drive-backup transaction endless drained section

# gpg: Signature made Fri 09 Jul 2021 13:49:22 BST
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (28 commits)
  block: Make blockdev-reopen stable API
  iotests: Test reopening multiple devices at the same time
  block: Support multiple reopening with x-blockdev-reopen
  block: Acquire AioContexts during bdrv_reopen_multiple()
  block: Add bdrv_reopen_queue_free()
  qcow2: Fix dangling pointer after reopen for 'file'
  qemu-img: Improve error for rebase without backing format
  qemu-img: Require -F with -b backing image
  qcow2: Prohibit backing file changes in 'qemu-img amend'
  blockdev: fix drive-backup transaction endless drained section
  vhost-user: Fix backends without multiqueue support
  MAINTAINERS: add block/rbd.c reviewer
  block/rbd: fix type of task->complete
  iotests/fuse-allow-other: Test allow-other
  iotests/308: Test +w on read-only FUSE exports
  export/fuse: Let permissions be adjustable
  export/fuse: Give SET_ATTR_SIZE its own branch
  export/fuse: Add allow-other option
  export/fuse: Pass default_permissions for mount
  util/uri: do not check argument of uri_free()
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-07-10 19:55:21 +01:00
Kevin Wolf
6cf42ca2f9 block: Acquire AioContexts during bdrv_reopen_multiple()
As the BlockReopenQueue can contain nodes in multiple AioContexts, only
one of which may be locked when AIO_WAIT_WHILE() can be called, we can't
let the caller lock the right contexts. Instead, individually lock the
AioContext of a single node when iterating the queue.

Reintroduce bdrv_reopen() as a wrapper for reopening a single node that
drains the node and temporarily drops the AioContext lock for
bdrv_reopen_multiple().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210708114709.206487-4-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-09 13:19:11 +02:00
Alberto Garcia
ab5b522879 block: Add bdrv_reopen_queue_free()
Move the code to free a BlockReopenQueue to a separate function.
It will be used in a subsequent patch.

[ kwolf: Also free explicit_options and options, and explicitly
  qobject_ref() the value when it continues to be used. This makes
  future memory leaks less likely. ]

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210708114709.206487-3-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-09 13:19:11 +02:00
Stefan Hajnoczi
0f08586c71 util/async: add a human-readable name to BHs for debugging
It can be difficult to debug issues with BHs in production environments.
Although BHs can usually be identified by looking up their ->cb()
function pointer, this requires debug information for the program. It is
also not possible to print human-readable diagnostics about BHs because
they have no identifier.

This patch adds a name to each BH. The name is not unique per instance
but differentiates between cb() functions, which is usually enough. It's
done by changing aio_bh_new() and friends to macros that stringify cb.

The next patch will use the name field when reporting leaked BHs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210414200247.917496-2-stefanha@redhat.com>
2021-07-05 11:40:32 +01:00
Peter Maydell
9c2647f750 Block layer patches
- Supporting changing 'file' in x-blockdev-reopen
 - ssh: add support for sha256 host key fingerprints
 - vhost-user-blk: Implement reconnection during realize
 - introduce QEMU_AUTO_VFREE
 - Don't require password of encrypted backing file for image creation
 - Code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmDclTcRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9bzaw/+PYQ9vPG+ZROWl633TUOQu7IYGZynXCET
 ZHlV2JlXnFH8QoO8A53U72cgVg+GlwDpiMCCtjEGMG1yfMBNe+DXR1wFUMDne1Gs
 qIFX4gIVpPGDi3gPeQvefLCwoN8VXwIxvCJj40YR9BY0cqQR6joI81kjVlqKemqB
 cHG4qJHmnphihSLep/dd2BRInkeHWXWU63jK4d7ctdCwHZPNDf0u0FEraxEqS/d0
 5tfdIl3haghhoDRRamyh5bLZBCW1KlpfTR98RdspcwpSlXq/FgV5N9CFa15XSYfQ
 rinSClSIOpLzG90+tBihROVBXvbugO5qZVQTG0Yg1tt4FGG8Cmiqf9MNXC5yctNg
 WnaQQipx/37deafGA4jqorZBJd1R87JLJTBFTpkB47XAFq/ltqsTDhrrfdS+jail
 Fd+qyqWg0Jx3JjdhSUpHvDKBBsErjoxtoyQIGakSreXGmj2UY6BFmGii7lnLZNLo
 +E81C7exnkCIGKkOHy+y9DkpVY/PEJKCG7uwcyy+F2qOqGUOxKLuZomWcLodo6Vf
 /eJ/UsLJt6HhXhXq/1ZZHmaORn8Lft1yr/9azoGXZ7er+jZcbEkhbcZmET+Y6ykq
 Vox/GmLkhyVkM96MA0lMW5hHPWUbF29m9Jmq3nNfvFWBcILEs4uWSlbd0M2oAmWj
 ung9sKIV/8s=
 =aB0a
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

- Supporting changing 'file' in x-blockdev-reopen
- ssh: add support for sha256 host key fingerprints
- vhost-user-blk: Implement reconnection during realize
- introduce QEMU_AUTO_VFREE
- Don't require password of encrypted backing file for image creation
- Code cleanups

# gpg: Signature made Wed 30 Jun 2021 17:00:55 BST
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (24 commits)
  vhost-user-blk: Implement reconnection during realize
  vhost-user-blk: Factor out vhost_user_blk_realize_connect()
  vhost: Distinguish errors in vhost_dev_get_config()
  vhost-user-blk: Add Error parameter to vhost_user_blk_start()
  vhost: Return 0/-errno in vhost_dev_init()
  vhost: Distinguish errors in vhost_backend_init()
  vhost: Add Error parameter to vhost_dev_init()
  block/ssh: add support for sha256 host key fingerprints
  block/commit: use QEMU_AUTO_VFREE
  introduce QEMU_AUTO_VFREE
  iotests: Test replacing files with x-blockdev-reopen
  block: Allow changing bs->file on reopen
  block: BDRVReopenState: drop replace_backing_bs field
  block: move supports_backing check to bdrv_set_file_or_backing_noperm()
  block: bdrv_reopen_parse_backing(): simplify handling implicit filters
  block: bdrv_reopen_parse_backing(): don't check frozen child
  block: bdrv_reopen_parse_backing(): don't check aio context
  block: introduce bdrv_set_file_or_backing_noperm()
  block: introduce bdrv_remove_file_or_backing_child()
  block: comment graph-modifying function not updating permissions
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-07-02 11:46:32 +01:00
Peter Maydell
1ec2cd0ce2 hw/nvme patches
* namespace eui64 support (Heinrich)
 * aiocb refactoring (Klaus)
 * controller parameter for auto zone transitioning (Niklas)
 * misc fixes and additions (Gollu, Klaus, Keith)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmDbap8ACgkQTeGvMW1P
 Denz9wf9GtU4z7G0pQpFlv1smWPvQfv9qlkGiRvGcIQjEgPjcKjEsJC1CqtG6nZ5
 BGYzey706iM7NAs6wPkyUyHgWXl+N/Ku5nNBvNJuzTN5J9aHgnqehe8civvS6ONU
 0WtAQEEZCLu0Qbvw6KawCShXrJ6Jpmu+bYGggYEy5+baumLF0Qj50l2wRiDbImNu
 rE3iZuhEenwnNNSTTkyt2RT8lSzizAtth4bAqzl3Zw1Q3NO+qDCL0wPOWD7YuD8p
 LTZ8/ojB1+YUI3JrMM4MlzbJyGstdk2Y5zq6QpAcf86agaaFEmS0iqjOic52ImpL
 VzmZZZ0W7kmhJVc5H/Q5y/KAFIHnFA==
 =efjI
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

hw/nvme patches

* namespace eui64 support (Heinrich)
* aiocb refactoring (Klaus)
* controller parameter for auto zone transitioning (Niklas)
* misc fixes and additions (Gollu, Klaus, Keith)

# gpg: Signature made Tue 29 Jun 2021 19:46:55 BST
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request: (23 commits)
  hw/nvme: add 'zoned.zasl' to documentation
  hw/nvme: fix pin-based interrupt behavior (again)
  hw/nvme: fix missing check for PMR capability
  hw/nvme: documentation fix
  hw/nvme: fix endianess conversion and add controller list
  Partially revert "hw/block/nvme: drain namespaces on sq deletion"
  hw/nvme: reimplement format nvm to allow cancellation
  hw/nvme: reimplement zone reset to allow cancellation
  hw/nvme: reimplement the copy command to allow aio cancellation
  hw/nvme: add dw0/1 to the req completion trace event
  hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
  hw/nvme: remove assert from nvme_get_zone_by_slba
  hw/nvme: save reftag when generating pi
  hw/nvme: reimplement dsm to allow cancellation
  hw/nvme: add nvme_block_status_all helper
  hw/nvme: reimplement flush to allow cancellation
  hw/nvme: default for namespace EUI-64
  hw/nvme: namespace parameter for EUI-64
  hw/nvme: fix csi field for cns 0x00 and 0x11
  hw/nvme: add param to control auto zone transitioning to zone state closed
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-06-30 21:09:27 +01:00
Alberto Garcia
ecd30d2d97 block: Allow changing bs->file on reopen
When the x-blockdev-reopen was added it allowed reconfiguring the
graph by replacing backing files, but changing the 'file' option was
forbidden. Because of this restriction some operations are not
possible, notably inserting and removing block filters.

This patch adds support for replacing the 'file' option. This is
similar to replacing the backing file and the user is likewise
responsible for the correctness of the resulting graph, otherwise this
can lead to data corruption.

Signed-off-by: Alberto Garcia <berto@igalia.com>
 [vsementsov: bdrv_reopen_parse_file_or_backing() is modified a lot]
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610120537.196183-9-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-29 16:51:00 +02:00
Vladimir Sementsov-Ogievskiy
3d0e8743f0 block: BDRVReopenState: drop replace_backing_bs field
It's used only in bdrv_reopen_commit(). "backing" is covered by the
loop through all children except for case when we removed backing child
during reopen.

Make it more obvious and drop extra boolean field: qdict_del will not
fail if there is no such entry.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610120537.196183-8-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-29 16:51:00 +02:00
Gollu Appalanaidu
5f4eb94dbb hw/nvme: fix endianess conversion and add controller list
Add the controller identifiers list CNS 0x13, available list of ctrls
in NVM Subsystem that may or may not be attached to namespaces.

In Identify Ctrl List of the CNS 0x12 and 0x13 no endian conversion
for the nsid field.

These two CNS values shows affect when there exists a Subsystem.
Added condition if there is no Subsystem return invalid field in
command.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-06-29 07:16:25 +02:00
Klaus Jensen
2a132309e4 hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
The nvme_check_prinfo() and nvme_dif_check() functions operate on the
16 bit "control" member of the NvmeCmd. These functions do not otherwise
operate on an NvmeCmd or an NvmeRequest, so change them to expect the
actual 4 bit PRINFO field and add constants that work on this field as
well.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-06-29 07:16:25 +02:00
Gollu Appalanaidu
18de1526ba hw/nvme: add identify namespace flbas/mc enums
Add enums for the Identify Namespace FLBAS and MC fields.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: squashed separate flbas/mc commits into one]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-06-29 07:16:25 +02:00
Peter Maydell
6512fa497c * Some Meson test conversions
* KVM dirty page ring buffer fix
 * KVM TSC scaling support
 * Fixes for SG_IO with /dev/sdX devices
 * (Non)support for host devices on iOS
 * -smp cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV5TIUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNySgf9HMnAtLWp36p2ie74o4rrW9x3Ojrm
 fuCq2i3q3nBhEKqqiyp+QQJGubE44mXEZQYtX89tOfSFgg7o6SLIoAcQQskr+In6
 f9I1jjpSVTls0AaGUO+iRn9KiTzeMWeo1l6Wht+2mfBL5XpNLaLLu/T49uPhjlvN
 zFi5blgILxIYMqMCD1joDBnIiqqDozr0p7QzRZD8re25sRhg0NHQxyIh3OxBPpJ9
 3Jhy1Us0cDWrwvPbxz6S5N0zesLu1ojtojVPy6iKjyHSv+6eiE6bHyIbS8duG5+H
 zBC1THOsUV3X1UvPAjuSNlgfNeobGAzmxSJ/evLgWWkpkx1mLtsnL5RARQ==
 =YoOL
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into staging

* Some Meson test conversions
* KVM dirty page ring buffer fix
* KVM TSC scaling support
* Fixes for SG_IO with /dev/sdX devices
* (Non)support for host devices on iOS
* -smp cleanups

# gpg: Signature made Fri 25 Jun 2021 15:16:18 BST
# gpg:                using RSA key F13338574B662389866C7682BFFBD25F78C7AE83
# gpg:                issuer "pbonzini@redhat.com"
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini-gitlab/tags/for-upstream: (28 commits)
  machine: reject -smp dies!=1 for non-PC machines
  machine: pass QAPI struct to mc->smp_parse
  machine: add error propagation to mc->smp_parse
  machine: move common smp_parse code to caller
  machine: move dies from X86MachineState to CpuTopology
  file-posix: handle EINTR during ioctl
  block: detect DKIOCGETBLOCKCOUNT/SIZE before use
  block: try BSD disk size ioctls one after another
  block: check for sys/disk.h
  block: feature detection for host block support
  file-posix: try BLKSECTGET on block devices too, do not round to power of 2
  block: add max_hw_transfer to BlockLimits
  block-backend: align max_transfer to request alignment
  osdep: provide ROUND_DOWN macro
  scsi-generic: pass max_segments via max_iov field in BlockLimits
  file-posix: fix max_iov for /dev/sg devices
  KVM: Fix dirty ring mmap incorrect size due to renaming accident
  configure, meson: convert libusbredir detection to meson
  configure, meson: convert libcacard detection to meson
  configure, meson: convert libusb detection to meson
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-06-28 21:04:22 +01:00
Emanuele Giuseppe Esposito
149009bef4 block-copy: atomic .cancelled and .finished fields in BlockCopyCallState
By adding acquire/release pairs, we ensure that .ret and .error_is_read
fields are written by block_copy_dirty_clusters before .finished is true,
and that they are read by API user after .finished is true.

The atomic here are necessary because the fields are concurrently modified
in coroutines, and read outside.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210624072043.180494-6-eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-06-25 14:33:51 +03:00
Paolo Bonzini
24b36e9813 block: add max_hw_transfer to BlockLimits
For block host devices, I/O can happen through either the kernel file
descriptor I/O system calls (preadv/pwritev, io_submit, io_uring)
or the SCSI passthrough ioctl SG_IO.

In the latter case, the size of each transfer can be limited by the
HBA, while for file descriptor I/O the kernel is able to split and
merge I/O in smaller pieces as needed.  Applying the HBA limits to
file descriptor I/O results in more system calls and suboptimal
performance, so this patch splits the max_transfer limit in two:
max_transfer remains valid and is used in general, while max_hw_transfer
is limited to the maximum hardware size.  max_hw_transfer can then be
included by the scsi-generic driver in the block limits page, to ensure
that the stricter hardware limit is used.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25 10:54:13 +02:00
Vladimir Sementsov-Ogievskiy
97cf89259e nbd/client-connection: add option for non-blocking connection attempt
We'll need a possibility of non-blocking nbd_co_establish_connection(),
so that it returns immediately, and it returns success only if a
connections was previously established in background.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-30-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 12:21:22 -05:00
Vladimir Sementsov-Ogievskiy
43cb34dede nbd/client-connection: return only one io channel
block/nbd doesn't need underlying sioc channel anymore. So, we can
update nbd/client-connection interface to return only one top-most io
channel, which is more straight forward.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-27-vsementsov@virtuozzo.com>
[eblake: squash in Vladimir's fixes for uninit usage caught by clang]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 12:20:53 -05:00
Vladimir Sementsov-Ogievskiy
e0e67cbe58 nbd/client-connection: implement connection retry
Add an option for a thread to retry connecting until it succeeds. We'll
use nbd/client-connection both for reconnect and for initial connection
in nbd_open(), so we need a possibility to use same NBDClientConnection
instance to connect once in nbd_open() and then use retry semantics for
reconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-21-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:53 -05:00
Vladimir Sementsov-Ogievskiy
130d49baa5 nbd/client-connection: add possibility of negotiation
Add arguments and logic to support nbd negotiation in the same thread
after successful connection.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-20-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:53 -05:00
Vladimir Sementsov-Ogievskiy
5276c87c12 nbd: move connection code from block/nbd to nbd/client-connection
We now have bs-independent connection API, which consists of four
functions:

  nbd_client_connection_new()
  nbd_client_connection_release()
  nbd_co_establish_connection()
  nbd_co_establish_connection_cancel()

Move them to a separate file together with NBDClientConnection
structure which becomes private to the new API.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-18-vsementsov@virtuozzo.com>
[eblake: comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:53 -05:00
Paolo Bonzini
5f50be9b58 async: the main AioContext is only "current" if under the BQL
If we want to wake up a coroutine from a worker thread, aio_co_wake()
currently does not work.  In that scenario, aio_co_wake() calls
aio_co_enter(), but there is no current AioContext and therefore
qemu_get_current_aio_context() returns the main thread.  aio_co_wake()
then attempts to call aio_context_acquire() instead of going through
aio_co_schedule().

The default case of qemu_get_current_aio_context() was added to cover
synchronous I/O started from the vCPU thread, but the main and vCPU
threads are quite different.  The main thread is an I/O thread itself,
only running a more complicated event loop; the vCPU thread instead
is essentially a worker thread that occasionally calls
qemu_mutex_lock_iothread().  It is only in those critical sections
that it acts as if it were the home thread of the main AioContext.

Therefore, this patch detaches qemu_get_current_aio_context() from
iothreads, which is a useless complication.  The AioContext pointer
is stored directly in the thread-local variable, including for the
main loop.  Worker threads (including vCPU threads) optionally behave
as temporary home threads if they have taken the big QEMU lock,
but if that is not the case they will always schedule coroutines
on remote threads via aio_co_schedule().

With this change, the stub qemu_mutex_iothread_locked() must be changed
from true to false.  The previous value of true was needed because the
main thread did not have an AioContext in the thread-local variable,
but now it does have one.

Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210609122234.544153-1-pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: tweak commit message per Vladimir's review]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:52 -05:00
Paolo Bonzini
c0d4aa82f8 vl: plumb keyval-based options into -readconfig
Let -readconfig support parsing command line options into QDict or
QemuOpts.  This will be used to add back support for objects in
-readconfig.

Cc: Markus Armbruster <armbru@redhat.com>
Cc: qemu-stable@nongnu.org
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210524105752.3318299-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-04 13:50:04 +02:00
Vladimir Sementsov-Ogievskiy
260242a833 block: drop BlockBackendRootState::read_only
Instead of keeping additional boolean field, let's store the
information in BDRV_O_RDWR bit of BlockBackendRootState::open_flags.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210527154056.70294-4-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-02 14:23:20 +02:00
Vladimir Sementsov-Ogievskiy
975da07374 block: drop BlockDriverState::read_only
This variable is just a cache for !(bs->open_flags & BDRV_O_RDWR),
which we have to synchronize everywhere. Let's just drop it and
consistently use bdrv_is_read_only().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210527154056.70294-3-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-02 14:23:20 +02:00
Vladimir Sementsov-Ogievskiy
fb62b58896 block/vvfat: child_vvfat_qcow: add .get_parent_aio_context, fix crash
Commit 3ca1f32257
"block: BdrvChildClass: add .get_parent_aio_context handler" introduced
new handler and commit 228ca37e12
"block: drop ctx argument from bdrv_root_attach_child" made a generic
use of it. But 3ca1f32257 didn't update
child_vvfat_qcow. Fix that.

Before that fix the command

./build/qemu-system-x86_64 -usb -device usb-storage,drive=fat16 \
  -drive file=fat:rw:fat-type=16:"<path of a host folder>",id=fat16,format=raw,if=none

crashes:

1  bdrv_child_get_parent_aio_context (c=0x559d62426d20)
    at ../block.c:1440
2  bdrv_attach_child_common
    (child_bs=0x559d62468190, child_name=0x559d606f9e3d "write-target",
     child_class=0x559d60c58d20 <child_vvfat_qcow>, child_role=3,
     perm=3, shared_perm=4, opaque=0x559d62445690,
     child=0x7ffc74c2acc8, tran=0x559d6246ddd0, errp=0x7ffc74c2ae60)
    at ../block.c:2795
3  bdrv_attach_child_noperm
    (parent_bs=0x559d62445690, child_bs=0x559d62468190,
     child_name=0x559d606f9e3d "write-target",
     child_class=0x559d60c58d20 <child_vvfat_qcow>, child_role=3,
     child=0x7ffc74c2acc8, tran=0x559d6246ddd0, errp=0x7ffc74c2ae60) at
    ../block.c:2855
4  bdrv_attach_child
    (parent_bs=0x559d62445690, child_bs=0x559d62468190,
     child_name=0x559d606f9e3d "write-target",
     child_class=0x559d60c58d20 <child_vvfat_qcow>, child_role=3,
     errp=0x7ffc74c2ae60) at ../block.c:2953
5  bdrv_open_child
    (filename=0x559d62464b80 "/var/tmp/vl.h3TIS4",
     options=0x559d6246ec20, bdref_key=0x559d606f9e3d "write-target",
     parent=0x559d62445690, child_class=0x559d60c58d20
     <child_vvfat_qcow>, child_role=3, allow_none=false,
     errp=0x7ffc74c2ae60) at ../block.c:3351
6  enable_write_target (bs=0x559d62445690, errp=0x7ffc74c2ae60) at
   ../block/vvfat.c:3176
7  vvfat_open (bs=0x559d62445690, options=0x559d6244adb0, flags=155650,
               errp=0x7ffc74c2ae60) at ../block/vvfat.c:1236
8  bdrv_open_driver (bs=0x559d62445690, drv=0x559d60d4f7e0
                     <bdrv_vvfat>, node_name=0x0,
                     options=0x559d6244adb0, open_flags=155650,
                     errp=0x7ffc74c2af70) at ../block.c:1557
9  bdrv_open_common (bs=0x559d62445690, file=0x0,
                     options=0x559d6244adb0, errp=0x7ffc74c2af70) at
...

(gdb) fr 1
 #1  0x0000559d603ea3bf in bdrv_child_get_parent_aio_context
     (c=0x559d62426d20) at ../block.c:1440
1440        return c->klass->get_parent_aio_context(c);
 (gdb) p c->klass
$1 = (const BdrvChildClass *) 0x559d60c58d20 <child_vvfat_qcow>
 (gdb) p c->klass->get_parent_aio_context
$2 = (AioContext *(*)(BdrvChild *)) 0x0

Fixes: 3ca1f32257
Fixes: 228ca37e12
Reported-by: John Arbuckle <programmingkidx@gmail.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210524101257.119377-2-vsementsov@virtuozzo.com>
Tested-by: John Arbuckle <programmingkidx@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-02 14:23:20 +02:00
Paolo Bonzini
b02629550d replication: move include out of root directory
The replication.h file is included from migration/colo.c and tests/unit/test-replication.c,
so it should be in include/.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-26 14:49:46 +02:00
Peter Maydell
15e147b3c7 emulated nvme updates
* various fixes (Gollu Appalanaidu)
 * refactoring (me)
 * move to hw/nvme from hw/block (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmCiNFEACgkQTeGvMW1P
 DeldNQgAgF/FdkGPNNxNlT5Ei7zmycjbKKf7QDSfn4BXgP9a+7Ccil60kGFCX7Ib
 ERiTOFgofJ587GcYVghBSQv4zYKW+3uXIwfFSUxrMi1hG48j8+z7NyxeWUSXSeSi
 AqohaE70iZyoUx7HQWwRhDsoTnlgRfmpK5Ju9jVRE0be28Y5Z9/hSoyti3PeJPNb
 Igg4bYXTGnXIgJcAfXQBe4iHBn9iZrJB+mp59a8Avb62tP2zwgiRED2M0FNHWyhV
 s/Ra1SiUuEzDJROD47mlxCiqfstknZJ31EI34wJwwKJU2rSQUEMcSRZKc+k7dDHG
 qmQAGPwqS7EnW1uJDxSUFZ944nx+dg==
 =gum+
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

emulated nvme updates

* various fixes (Gollu Appalanaidu)
* refactoring (me)
* move to hw/nvme from hw/block (me)

# gpg: Signature made Mon 17 May 2021 10:16:01 BST
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request:
  hw/nvme: move nvme emulation out of hw/block
  hw/block/nvme: move zoned constraints checks
  hw/block/nvme: remove irrelevant zone resource checks
  hw/block/nvme: remove num_namespaces member
  hw/block/nvme: streamline namespace array indexing
  hw/block/nvme: add metadata offset helper
  hw/block/nvme: cache lba and ms sizes
  hw/block/nvme: replace nvme_ns_status
  hw/block/nvme: remove non-shared defines from header file
  hw/block/nvme: cleanup includes
  hw/block/nvme: consolidate header files
  hw/block/nvme: rename __nvme_select_ns_iocs
  hw/block/nvme: rename __nvme_advance_zone_wp
  hw/block/nvme: rename __nvme_zrm_open
  hw/block/nvme: align with existing style
  hw/block/nvme: function formatting fix
  hw/block/nvme: fix io-command set profile feature
  hw/block/nvme: consider metadata read aio return value in compare
  hw/block/nvme: rename reserved fields declarations
  hw/block/nvme: remove redundant invalid_lba_range trace

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-05-18 12:22:16 +01:00
Gollu Appalanaidu
312c3531bb hw/block/nvme: align with existing style
While QEMU coding style prefers lowercase hexadecimals in constants, the
NVMe subsystem uses the format from the NVMe specifications in comments,
i.e. 'h' suffix instead of '0x' prefix.

Fix this up across the code base.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: updated message; added conversion in a couple of missing comments]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-05-17 09:15:13 +02:00
Gollu Appalanaidu
9a31c61583 hw/block/nvme: rename reserved fields declarations
Align the 'rsvd1' reserved field declaration in NvmeBar with existing
style.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: minor commit message fixup]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-05-17 09:15:13 +02:00
Vladimir Sementsov-Ogievskiy
c61ebf362d write-threshold: deal with includes
"qemu/typedefs.h" is enough for include/block/write-threshold.h header
with forward declaration of BlockDriverState. Also drop extra includes
from block/write-threshold.c and tests/unit/test-write-threshold.c

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210506090621.11848-9-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy
2e0e9cbd89 block/write-threshold: drop extra APIs
bdrv_write_threshold_exceeded() is unused.

bdrv_write_threshold_is_set() is used only to double check the value of
bs->write_threshold_offset in tests. No real sense in it (both tests do
check real value with help of bdrv_write_threshold_get())

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted commit message as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy
ad578c56d5 block: drop write notifiers
They are unused now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-3-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy
94783301b8 block/write-threshold: don't use write notifiers
write-notifiers are used only for write-threshold. New code for such
purpose should create filters.

Let's better special-case write-threshold and drop write notifiers at
all. (Actually, write-threshold is special-cased anyway, as the only
user of write-notifiers)

So, create a new direct interface for bdrv_co_write_req_prepare() and
drop all write-notifier related logic from write-threshold.c.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted comment as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy
9c785cd714 mirror: stop cancelling in-flight requests on non-force cancel in READY
If mirror is READY than cancel operation is not discarding the whole
result of the operation, but instead it's a documented way get a
point-in-time snapshot of source disk.

So, we should not cancel any requests if mirror is READ and
force=false. Let's fix that case.

Note, that bug that we have before this commit is not critical, as the
only .bdrv_cancel_in_flight implementation is nbd_cancel_in_flight()
and it cancels only requests waiting for reconnection, so it should be
rare case.

Fixes: 521ff8b779
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210421075858.40197-1-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Kevin Wolf
35b7f4abd5 block: Add BDRV_O_NO_SHARE for blk_new_open()
Normally, blk_new_open() just shares all permissions. This was fine
originally when permissions only protected against uses in the same
process because no other part of the code would actually get to access
the block nodes opened with blk_new_open(). However, since we use it for
file locking now, unsharing permissions becomes desirable.

Add a new BDRV_O_NO_SHARE flag that is used in blk_new_open() to unshare
any permissions that can be unshared.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210422164344.283389-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy
ecb776bd93 block: refactor bdrv_child_set_perm_safe() transaction action
Old interfaces dropped, nobody directly calls
bdrv_child_set_perm_abort() and bdrv_child_set_perm_commit(), so we can
use personal state structure for the action and stop exploiting
BdrvChild structure. Also, drop "_safe" suffix which is redundant now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-35-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy
72373e40fb block: bdrv_reopen_multiple: refresh permissions on updated graph
Move bdrv_reopen_multiple to new paradigm of permission update:
first update graph relations, then do refresh the permissions.

We have to modify reopen process in file-posix driver: with new scheme
we don't have prepared permissions in raw_reopen_prepare(), so we
should reconfigure fd in raw_check_perm(). Still this seems more native
and simple anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-31-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy
1e4c797c75 block: make bdrv_refresh_limits() to be a transaction action
To be used in further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-28-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy
3108a15cf0 block: introduce bdrv_drop_filter()
Using bdrv_replace_node() for removing filter is not good enough: it
keeps child reference of the filter, which may conflict with original
top node during permission update.

Instead let's create new interface, which will do all graph
modifications first and then update permissions.

Let's modify bdrv_replace_node_common(), allowing it additionally drop
backing chain child link pointing to new node. This is quite
appropriate for bdrv_drop_intermediate() and makes possible to add
new bdrv_drop_filter() as a simple wrapper.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-24-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy
53e96d1e9f block: make bdrv_reopen_{prepare,commit,abort} private
These functions are called only from bdrv_reopen_multiple() in block.c.
No reason to publish them.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-8-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:47 +02:00
Vladimir Sementsov-Ogievskiy
228ca37e12 block: drop ctx argument from bdrv_root_attach_child
Passing parent aio context is redundant, as child_class and parent
opaque pointer are enough to retrieve it. Drop the argument and use new
bdrv_child_get_parent_aio_context() interface.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-7-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:47 +02:00
Vladimir Sementsov-Ogievskiy
3ca1f32257 block: BdrvChildClass: add .get_parent_aio_context handler
Add new handler to get aio context and implement it in all child
classes. Add corresponding public interface to be used soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-6-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:47 +02:00
Klaus Jensen
e548935634 hw/block/nvme: fix handling of private namespaces
Prior to this patch, if a private nvme-ns device (that is, a namespace
that is not linked to a subsystem) is wired up to an nvme-subsys linked
nvme controller device, the device fails to verify that the namespace id
is unique within the subsystem. NVM Express v1.4b, Section 6.1.6 ("NSID
and Namespace Usage") states that because the device supports Namespace
Management, "NSIDs *shall* be unique within the NVM subsystem".

Additionally, prior to this patch, private namespaces are not known to
the subsystem and the namespace is considered exclusive to the
controller with which it is initially wired up to. However, this is not
the definition of a private namespace; per Section 1.6.33 ("private
namespace"), a private namespace is just a namespace that does not
support multipath I/O or namespace sharing, which means "that it is only
able to be attached to one controller at a time".

Fix this by always allocating namespaces in the subsystem (if one is
linked to the controller), regardless of the shared/private status of
the namespace. Whether or not the namespace is shareable is controlled
by a new `shared` nvme-ns parameter.

Finally, this fix allows the nvme-ns `subsys` parameter to be removed,
since the `shared` parameter now serves the purpose of attaching the
namespace to all controllers in the subsystem upon device realization.
It is invalid to have an nvme-ns namespace device with a linked
subsystem without the parent nvme controller device also being linked to
one and since the nvme-ns devices will unconditionally be "attached" (in
QEMU terms that is) to an nvme controller device through an NvmeBus, the
nvme-ns namespace device can always get a reference to the subsystem of
the controller it is explicitly (using 'bus=' parameter) or implicitly
attaching to.

Fixes: e570768566 ("hw/block/nvme: support for shared namespace in subsystem")
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
2021-04-07 10:48:31 +02:00
Peter Maydell
8a40754bca emulated nvme updates and fixes
* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
 * fix for a bug in zone management receive (me)
 * metadata and end-to-end data protection support (me & Gollu Appalanaidu)
 * verify support (Gollu Appalanaidu)
 * multiple lba formats and format nvm support (Minwoo Im)
 
 and a couple of misc refactorings from me.
 
 v2:
   - remove an unintended submodule update. Argh.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmBTP0wACgkQTeGvMW1P
 DelZvAf+Ijw77YLz2t0P97CiwoOAG4FADwo61WQJo2AHyS3JMYPVdgNUXF7UGt/S
 cNhU1JjdEylgZPKi5t9o/qgy7+vtSL0KKqoXIS0z9ZWZkdFsgObNetGULhaqXgaX
 4KcAt7PyJS/33uFqYGSVZxJTO4GtCy34hGw6XrVs388tQD1S+eI+oS5EYBN57rl9
 MrV5Z72iMCBx8Y074u81SD/u7n34b5WWbUOJADVI1rAKhilDhnkAhBQ/gK/UBodB
 trUUQY/KU+eCnXEqfBeMD6fL75UOAFOjnm7Kfk/Q/g3di3H2fu0RGLrJuSS1dG0n
 zmwgE7pQ1jT1UcB3QwAJLR3tGtd6sA==
 =HwqP
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

emulated nvme updates and fixes

* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
* fix for a bug in zone management receive (me)
* metadata and end-to-end data protection support (me & Gollu Appalanaidu)
* verify support (Gollu Appalanaidu)
* multiple lba formats and format nvm support (Minwoo Im)

and a couple of misc refactorings from me.

v2:
  - remove an unintended submodule update. Argh.

# gpg: Signature made Thu 18 Mar 2021 11:53:48 GMT
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request:
  hw/block/nvme: add support for the format nvm command
  hw/block/nvme: pull lba format initialization
  hw/block/nvme: prefer runtime helpers instead of device parameters
  hw/block/nvme: support multiple lba formats
  hw/block/nvme: add non-mdts command size limit for verify
  hw/block/nvme: add verify command
  hw/block/nvme: end-to-end data protection
  hw/block/nvme: add metadata support
  hw/block/nvme: fix zone management receive reporting too many zones
  hw/block/nvme: assert namespaces array indices
  hw/block/nvme: fix potential overflow

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-03-18 19:55:37 +00:00
Minwoo Im
dc04d25e2f hw/block/nvme: add support for the format nvm command
Format NVM admin command can make a namespace or namespaces to be
with different LBA size and metadata size with protection information
types.

This patch introduces Format NVM command with LBA format, Metadata, and
Protection Information for the device. The secure erase operation things
and support for formatting zoned namespaces are yet to be added.

The parameter checks inside of this patch has been referred from
Keith's old branch.

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
[anaidu.gollu: rebased on e2e]
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased for reworked aio tracking]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-18 12:41:43 +01:00
Gollu Appalanaidu
3e1da158c4 hw/block/nvme: add verify command
See NVM Express 1.4, section 6.14 ("Verify Command").

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased, refactored for e2e]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-18 12:34:51 +01:00
Klaus Jensen
146f720c55 hw/block/nvme: end-to-end data protection
Add support for namespaces formatted with protection information. The
type of end-to-end data protection (i.e. Type 1, Type 2 or Type 3) is
selected with the `pi` nvme-ns device parameter. If the number of
metadata bytes is larger than 8, the `pil` nvme-ns device parameter may
be used to control the location of the 8-byte DIF tuple. The default
`pil` value of '0', causes the DIF tuple to be transferred as the last
8 bytes of the metadata. Set to 1 to store this in the first eight bytes
instead.

Co-authored-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-18 12:34:51 +01:00
Daniel P. Berrangé
81cbfd5088 block: remove dirty bitmaps 'status' field
The same information is available via the 'recording' and 'busy' fields.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-03-18 09:22:55 +00:00
Peter Maydell
9abda42bf2 nbd patches for 2021-03-09
- Add Vladimir as NBD co-maintainer
 - Fix reporting of holes in NBD_CMD_BLOCK_STATUS
 - Improve command-line parsing accuracy of large numbers (anything going
 through qemu_strtosz), including the deprecation of hex+suffix
 - Improve some error reporting in the block layer
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmBHlmIACgkQp6FrSiUn
 Q2q2cQgAqJWNb4J/ShjvzocDDPzJ0iBitFbg0huFPfbt4DScubEZo5wBJG7vOhOW
 hIHrWCRzGvRgsn0tcSfrgFaegmHKrLgjkibM7ou8ni9NC1kUBd3R/3FBNIMxhYf7
 Q8Kfspl0LRfMJDKF9jdCnQ4Gxcd6h2OIYZqiWVg8V4Tc8WdCpIVOah7e7wjuW8bT
 vgZvfboUWm5AmIF9j/MxuMn+HFZ4ArSuFVL80ZaXlD00vRra7u3HZ8pUfcOlOujg
 7HeouM1E5j3NNE6aZSN++x/EQ3sg0zmirbWUCcgAyRfdRkAmB15uh2PUzPxEIJKH
 UHUIW5LvNtz2+yzOAz2yK29OE523Yg==
 =blE1
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2021-03-09' into staging

nbd patches for 2021-03-09

- Add Vladimir as NBD co-maintainer
- Fix reporting of holes in NBD_CMD_BLOCK_STATUS
- Improve command-line parsing accuracy of large numbers (anything going
through qemu_strtosz), including the deprecation of hex+suffix
- Improve some error reporting in the block layer

# gpg: Signature made Tue 09 Mar 2021 15:38:10 GMT
# gpg:                using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2021-03-09:
  block/qcow2: refactor qcow2_update_options_prepare error paths
  block/qed: bdrv_qed_do_open: deal with errp
  block/qcow2: simplify qcow2_co_invalidate_cache()
  block/qcow2: read_cache_sizes: return status value
  block/qcow2-bitmap: return status from qcow2_store_persistent_dirty_bitmaps
  block/qcow2-bitmap: improve qcow2_load_dirty_bitmaps() interface
  block/qcow2: qcow2_get_specific_info(): drop error propagation
  blockjob: return status from block_job_set_speed()
  block/mirror: drop extra error propagation in commit_active_start()
  block: drop extra error propagation for bdrv_set_backing_hd
  blockdev: fix drive_backup_prepare() missed error
  block: check return value of bdrv_open_child and drop error propagation
  utils: Deprecate hex-with-suffix sizes
  utils: Improve qemu_strtosz() to have 64 bits of precision
  utils: Enhance testsuite for do_strtosz()
  nbd: server: Report holes for raw images
  MAINTAINERS: add Vladimir as co-maintainer of NBD

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-03-11 13:57:08 +00:00
Peter Maydell
d689ecad07 hw/block/nvme updates
* NVMe subsystem support (`-device nvme-subsys`) (Minwoo Im)
 * Namespace (De|At)tachment support (Minwoo Im)
 * Simple Copy command support (Klaus Jensen)
 * Flush broadcast support (Gollu Appalanaidu)
 * QEMUIOVector/QEMUSGList duality refactoring (Klaus Jensen)
 
 plus various fixes from Minwoo, Gollu, Dmitry and me.
 
 v2:
   - add `nqn` nvme-subsys device parameter instead of using `id`.
     (Paolo)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmBHX5EACgkQTeGvMW1P
 Dek7ygf9H1g1wwp0Epo3U9z3wCWGcGz6uhBcqEW/T7MtmFg0G50C9iUGKOSe7xtv
 bWDabGJj0VCCzceeVUtMLfDXtDsekdUEiP/40OWuD3GsMDE7BJj0YIEklPioqa14
 maDuFoXbXvgmbPUGIf4IIvmQDTOg22K2oBkjCHA9nbaZM6qxIPC3wRtM77sxXDcW
 QdKQR4QXOs0lCXUlPdO9PhhnYcxt7k2/WxCKQsdF7ZL7USDkgZ/c7m54MUaw2st4
 gQWUAh6vxa+oEjpAjgivCdnmd6B4srpUsvMK3U4UKSpYFXus7teK2x88e8t/plKR
 9X0sgL40+PQVI9/CeBh0Z8l/vJYCvg==
 =WF8v
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

hw/block/nvme updates

* NVMe subsystem support (`-device nvme-subsys`) (Minwoo Im)
* Namespace (De|At)tachment support (Minwoo Im)
* Simple Copy command support (Klaus Jensen)
* Flush broadcast support (Gollu Appalanaidu)
* QEMUIOVector/QEMUSGList duality refactoring (Klaus Jensen)

plus various fixes from Minwoo, Gollu, Dmitry and me.

v2:
  - add `nqn` nvme-subsys device parameter instead of using `id`.
    (Paolo)

# gpg: Signature made Tue 09 Mar 2021 11:44:17 GMT
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request: (38 commits)
  hw/block/nvme: support Identify NS Attached Controller List
  hw/block/nvme: support changed namespace asynchronous event
  hw/block/nvme: support namespace attachment command
  hw/block/nvme: refactor nvme_select_ns_iocs
  hw/block/nvme: support allocated namespace type
  hw/block/nvme: fix allocated namespace list to 256
  hw/block/nvme: fix namespaces array to 1-based
  hw/block/nvme: support namespace detach
  hw/block/nvme: refactor nvme_dma
  hw/block/nvme: remove the req dependency in map functions
  hw/block/nvme: try to deal with the iov/qsg duality
  hw/block/nvme: fix strerror printing
  hw/block/nvme: remove block accounting for write zeroes
  hw/block/nvme: remove redundant len member in compare context
  hw/block/nvme: report non-mdts command size limit for dsm
  hw/block/nvme: add trace event for zone read check
  hw/block/nvme: fix potential compilation error
  hw/block/nvme: add identify trace event
  hw/block/nvme: remove unnecessary endian conversion
  hw/block/nvme: align zoned.zasl with mdts
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-03-10 20:11:33 +00:00
Minwoo Im
23fb7dfeca hw/block/nvme: support Identify NS Attached Controller List
Support Identify command for Namespace attached controller list.  This
command handler will traverse the controller instances in the given
subsystem to figure out whether the specified nsid is attached to the
controllers or not.

The 4096bytes Identify data will return with the first entry (16bits)
indicating the number of the controller id entries.  So, the data can
hold up to 2047 entries for the controller ids.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: rebased for dma refactor]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:58 +01:00
Minwoo Im
f432fdfa12 hw/block/nvme: support changed namespace asynchronous event
If namespace inventory is changed due to some reasons (e.g., namespace
attachment/detachment), controller can send out event notifier to the
host to manage namespaces.

This patch sends out the AEN to the host after either attach or detach
namespaces from controllers.  To support clear of the event from the
controller, this patch also implemented Get Log Page command for Changed
Namespace List log type.  To return namespace id list through the
command, when namespace inventory is updated, id is added to the
per-controller list (changed_ns_list).

To indicate the support of this async event, this patch set
OAES(Optional Asynchronous Events Supported) in Identify Controller data
structure.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:58 +01:00
Minwoo Im
645ce1a70c hw/block/nvme: support namespace attachment command
This patch supports Namespace Attachment command for the pre-defined
nvme-ns device nodes.  Of course, attach/detach namespace should only be
supported in case 'subsys' is given.  This is because if we detach a
namespace from a controller, somebody needs to manage the detached, but
allocated namespace in the NVMe subsystem.

As command effect for the namespace attachment command is registered,
the host will be notified that namespace inventory is changed so that
host will rescan the namespace inventory after this command.  For
example, kernel driver manages this command effect via passthru IOCTL.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: rebased for dma refactor]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:58 +01:00
Gollu Appalanaidu
67ce28a1fd hw/block/nvme: report non-mdts command size limit for dsm
Dataset Management is not subject to MDTS, but exceeded a certain size
per range causes internal looping. Report this limit (DMRSL) in the NVM
command set specific identify controller data structure.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-09 11:00:57 +01:00
Gollu Appalanaidu
c94973288c hw/block/nvme: add broadcast nsid support flush command
Add support for using the broadcast nsid to issue a flush on all
namespaces through a single command.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:57 +01:00
Klaus Jensen
3862efff31 nvme: updated shared header for copy command
Add new data structures and types for the Simple Copy command.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-09 11:00:57 +01:00
Minwoo Im
adc36b8d21 hw/block/nvme: add NMIC enum value for Identify Namespace
Added Namespace Multi-path I/O and Namespace Sharing Capabilities (NMIC)
field to support shared namespace from controller(s).

This field is in Identify Namespace data structure in [30].

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:57 +01:00
Minwoo Im
66b7e9bed0 hw/block/nvme: add CMIC enum value for Identify Controller
Added Controller Multi-path I/O and Namespace Sharing Capabilities
(CMIC) field to support multi-controller in the following patches.

This field is in Identify Controller data structure in [76].

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:57 +01:00
Vladimir Sementsov-Ogievskiy
775d0c0508 blockjob: return status from block_job_set_speed()
Better to return status together with setting errp. It allows to avoid
error propagation in the caller.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20210202124956.63146-8-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-03-08 15:14:15 -06:00
Vladimir Sementsov-Ogievskiy
35f428ba39 qcow2-bitmap: make bytes_covered_by_bitmap_cluster() public
Rename bytes_covered_by_bitmap_cluster() to
bdrv_dirty_bitmap_serialization_coverage() and make it public.
It is needed as we are going to share it with bitmap loading in
parallels format.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <20210224104707.88430-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-03-08 14:56:54 +01:00
Maxim Levitsky
a890f08e58 block: add bdrv_co_delete_file_noerr
This function wraps bdrv_co_delete_file for the common case of removing a file,
which was just created by format driver, on an error condition.

It hides the -ENOTSUPP error, and reports all other errors otherwise.

Use it in luks driver

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201217170904.946013-3-mlevitsk@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-02-15 15:10:14 +01:00
Peter Maydell
392b9a74b9 bitmaps patches for 2021-02-12
- add 'transform' member to manipulate bitmaps across migration
 - work towards better error handling during bdrv_open
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmAnDQsACgkQp6FrSiUn
 Q2qc5Qf/SKVdpX4j7OnHF6sBuf/8LVWz4KazSqEU0ohazBJmafgJpH2EA5pXMXR4
 frZDWeanGmhj1MjMkta/++uvEBU/TMpW2z98mZvjErteXdnRQAlII/hOCI+QZJvg
 viQ5t1EyrkyXzUePOjs+AwqA5KHWbCKt6QqyItQ78HvI23sw/fuvHj0G67KbVzXZ
 VcSrVr0J7PXnZV/hWfg+C+Nn9Ro9tsVdn79awLYVQ7/SDro3hzylpcHMQaHMK2oe
 mX4D2kNq7s21E27Zb6vlknUhQPkMdETk0gfEbpn7sTVMEc58GRLC7Tqfx7l0JIFK
 5izVyA5vndKVxDGYPkbDK6VL2uDg4A==
 =+Epy
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-bitmaps-2021-02-12' into staging

bitmaps patches for 2021-02-12

- add 'transform' member to manipulate bitmaps across migration
- work towards better error handling during bdrv_open

# gpg: Signature made Fri 12 Feb 2021 23:19:39 GMT
# gpg:                using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-bitmaps-2021-02-12:
  block: use return status of bdrv_append()
  block: return status from bdrv_append and friends
  qemu-iotests: 300: Add test case for modifying persistence of bitmap
  migration: dirty-bitmap: Allow control of bitmap persistence
  migration: dirty-bitmap: Use struct for alias map inner members

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-02-13 21:26:00 +00:00
Vladimir Sementsov-Ogievskiy
a1e708fcda block: return status from bdrv_append and friends
The recommended use of qemu error api assumes returning status together
with setting errp and avoid void functions with errp parameter. Let's
improve bdrv_append and some friends to reduce error-propagation
overhead in further patches.

Choose int return status, because bdrv_replace_node_common() has call
to bdrv_check_update_perm(), which reports int status, which seems
correct to propagate.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210202124956.63146-2-vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-12 15:36:41 -06:00
Vladimir Sementsov-Ogievskiy
bd54669a4a block: add new BlockDriver handler: bdrv_cancel_in_flight
It will be used to stop retrying NBD requests on mirror cancel.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210205163720.887197-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-12 09:45:18 -06:00
Peter Maydell
1214d55d1c Emulated NVMe device updates
* deallocate or unwritten logical block error feature (me)
   * dataset management command (me)
   * compare command (Gollu Appalanaidu)
   * namespace types (Niklas Cassel)
   * zoned namespaces (Dmitry Fomichev)
   * smart critical warning toggle (Zhenwei Pi)
   * allow cmb and pmr to coexist (me)
   * pmr rds/wds support (Naveen Nagar)
   * cmb v1.4 logic (Padmakar Kalghatgi)
 
 And a lot of smaller fixes from Gollu Appalanaidu and Minwoo Im.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmAiON4ACgkQTeGvMW1P
 DenhxQf/WzoiO5bXmvZeRv+IHoIn5GhJ1NxLRYO5MX0diswh0BBwUmIscbEDMe1b
 GsD6rpd9YfEO80L/sEGqgV09HT+6e8YwDsNNZBTAsUNRVx0WxckgcGWcrzNNA9Nc
 WEl0Q8si5USSQ1C7djplLXdR6p4pbA/gIk6AjNIo3q2VK1ZqCBhQGESGEfGgrAXW
 xWo8C1V8dnKdxUYI2blbti44sElHZJ6jcF5N3Xmv0UUa1WL0hh0u6qr7IbCZe1kO
 SUFWMIGLF+1C35MUyWpgCjCn5cUdnTA0s/SLEuWtDlNYRhRRh0D6LZviVTZi38Wx
 6Cxg/bRkSlcKo1/jswwYcAaH7qQ4Eg==
 =NB9D
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

Emulated NVMe device updates

  * deallocate or unwritten logical block error feature (me)
  * dataset management command (me)
  * compare command (Gollu Appalanaidu)
  * namespace types (Niklas Cassel)
  * zoned namespaces (Dmitry Fomichev)
  * smart critical warning toggle (Zhenwei Pi)
  * allow cmb and pmr to coexist (me)
  * pmr rds/wds support (Naveen Nagar)
  * cmb v1.4 logic (Padmakar Kalghatgi)

And a lot of smaller fixes from Gollu Appalanaidu and Minwoo Im.

# gpg: Signature made Tue 09 Feb 2021 07:25:18 GMT
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request: (56 commits)
  hw/block/nvme: refactor the logic for zone write checks
  hw/block/nvme: fix zone boundary check for append
  hw/block/nvme: fix wrong parameter name 'cross_read'
  hw/block/nvme: align with existing style
  hw/block/nvme: fix set feature save field check
  hw/block/nvme: fix set feature for error recovery
  hw/block/nvme: error if drive less than a zone size
  hw/block/nvme: lift cmb restrictions
  hw/block/nvme: bump to v1.4
  hw/block/nvme: move cmb logic to v1.4
  hw/block/nvme: add PMR RDS/WDS support
  hw/block/nvme: disable PMR at boot up
  hw/block/nvme: remove redundant zeroing of PMR registers
  hw/block/nvme: rename PMR/CMB shift/mask fields
  hw/block/nvme: allow cmb and pmr to coexist
  hw/block/nvme: move msix table and pba to BAR 0
  hw/block/nvme: indicate CMB support through controller capabilities register
  hw/block/nvme: fix 64 bit register hi/lo split writes
  hw/block/nvme: add size to mmio read/write trace events
  hw/block/nvme: trigger async event during injecting smart warning
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-02-09 13:24:37 +00:00
Klaus Jensen
c2a3640de8 hw/block/nvme: bump to v1.4
With the new CMB logic in place, bump the implemented specification
version to v1.4 by default.

This requires adding the setting the CNTRLTYPE field and modifying the
VWC field since 0x00 is no longer a valid value for bits 2:1.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:54 +01:00
Padmakar Kalghatgi
f4319477b4 hw/block/nvme: move cmb logic to v1.4
Implement v1.4 logic for configuring the Controller Memory Buffer. By
default, the v1.4 scheme will be used (CMB must be explicitly enabled by
the host), so drivers that only support v1.3 will not be able to use the
CMB anymore.

To retain the v1.3 behavior, set the boolean 'legacy-cmb' nvme device
parameter.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Padmakar Kalghatgi <p.kalghatgi@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:54 +01:00
Klaus Jensen
8e9e8b4821 hw/block/nvme: rename PMR/CMB shift/mask fields
Use the correct field names.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
Andrzej Jakowski
c705063129 hw/block/nvme: indicate CMB support through controller capabilities register
This patch sets CMBS bit in controller capabilities register when user
configures NVMe driver with CMB support, so capabilites are correctly
reported to guest OS.

Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
zhenwei pi
c62720f137 hw/block/nvme: trigger async event during injecting smart warning
During smart critical warning injection by setting property from QMP
command, also try to trigger asynchronous event.

Suggested by Keith, if a event has already been raised, there is no
need to enqueue the duplicate event any more.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: fix typo in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
zhenwei pi
4714791b66 hw/block/nvme: add smart_critical_warning property
There is a very low probability that hitting physical NVMe disk
hardware critical warning case, it's hard to write & test a monitor
agent service.

For debugging purposes, add a new 'smart_critical_warning' property
to emulate this situation.

The orignal version of this change is implemented by adding a fixed
property which could be initialized by QEMU command line. Suggested
by Philippe & Klaus, rework like current version.

Test with this patch:
1, change smart_critical_warning property for a running VM:
 #virsh qemu-monitor-command nvme-upstream '{ "execute": "qom-set",
  "arguments": { "path": "/machine/peripheral-anon/device[0]",
  "property": "smart_critical_warning", "value":16 } }'
2, run smartctl in guest
 #smartctl -H -l error /dev/nvme0n1

  === START OF SMART DATA SECTION ===
  SMART overall-health self-assessment test result: FAILED!
  - volatile memory backup device has failed

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
zhenwei pi
c6d1b5c13b nvme: introduce bit 5 for critical warning
According to NVM Express v1.4, Section 5.14.1.2 ("SMART / Health
Information"), introduce bit 5 for "Persistent Memory Region has become
read-only or unreliable".

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: minor brush ups in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
Klaus Jensen
b05fde2881 hw/block/nvme: enum style fix
Align with existing style and use a typedef for header-file enums.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
2021-02-08 21:15:53 +01:00
Dmitry Fomichev
e9ba46eeaf nvme: Make ZNS-related definitions
Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053).

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:34 +01:00
Niklas Cassel
922e6f4ebd hw/block/nvme: Support allocated CNS command variants
Many CNS commands have "allocated" command variants. These include
a namespace as long as it is allocated, that is a namespace is
included regardless if it is active (attached) or not.

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the
same result as the active CNS command variants.

The reason for not hooking up this command completely is because the
NVMe specification requires the namespace management command to be
supported if the namespace attachment command is supported.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:34 +01:00
Niklas Cassel
141354d55b hw/block/nvme: Add support for Namespace Types
Define the structures and constants required to implement
Namespace Types support.

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:34 +01:00
Dmitry Fomichev
62e8faa468 hw/block/nvme: Add Commands Supported and Effects log
This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

All incoming admin and i/o commands are now only processed if their
corresponding support bits are set in this log. This provides an
easy way to control what commands to support and what not to
depending on set CC.CSS.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:32 +01:00
Klaus Jensen
6fd704a59a nvme: add namespace I/O optimization fields to shared header
This adds the NPWG, NPWA, NPDG, NPDA and NOWS family of fields to the
shared nvme.h header for use by later patches.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
2021-02-08 18:55:48 +01:00
Klaus Jensen
54064e51d1 hw/block/nvme: add dulbe support
Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

     format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
    -----------------------------------------------------
      qcow2    ignore   n          n          y
      qcow2    unmap    n          n          y
      raw      ignore   n          y          y
      raw      unmap    n          y          y

So, this works best with an image in raw format and 4KiB LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64KiB by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-02-08 18:55:48 +01:00
Daniel P. Berrangé
3d3e9b1f66 block: rename and alter bdrv_all_find_snapshot semantics
Currently bdrv_all_find_snapshot() will return 0 if it finds
a snapshot, -1 if an error occurs, or if it fails to find a
snapshot. New callers to be added want to distinguish between
the error scenario and failing to find a snapshot.

Rename it to bdrv_all_has_snapshot and make it return -1 on
error, 0 if no snapshot is found and 1 if snapshot is found.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-7-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Daniel P. Berrangé
c22d644ca7 block: allow specifying name of block device for vmstate storage
Currently the vmstate will be stored in the first block device that
supports snapshots. Historically this would have usually been the
root device, but with UEFI it might be the variable store. There
needs to be a way to override the choice of block device to store
the state in.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-6-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Daniel P. Berrangé
cf3a74c94f block: add ability to specify list of blockdevs during snapshot
When running snapshot operations, there are various rules for which
blockdevs are included/excluded. While this provides reasonable default
behaviour, there are scenarios that are not well handled by the default
logic. Some of the conditions do not have a single correct answer.

Thus there needs to be a way for the mgmt app to provide an explicit
list of blockdevs to perform snapshots across. This can be achieved
by passing a list of node names that should be used.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-5-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Daniel P. Berrangé
e26f98e209 block: push error reporting into bdrv_all_*_snapshot functions
The bdrv_all_*_snapshot functions return a BlockDriverState pointer
for the invalid backend, which the callers then use to report an
error message. In some cases multiple callers are reporting the
same error message, but with slightly different text. In the future
there will be more error scenarios for some of these methods, which
will benefit from fine grained error message reporting. So it is
helpful to push error reporting down a level.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
[PMD: Initialize variables]
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210204124834.774401-2-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Roman Kagan
5082fc82a6 nbd: make nbd_read* return -EIO on error
NBD reconnect logic considers the error code from the functions that
read NBD messages to tell if reconnect should be attempted or not: it is
attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
state (see nbd_channel_error).  This error code is propagated from the
primitives like nbd_read.

The problem, however, is that nbd_read itself turns every error into -1
rather than -EIO.  As a result, if the NBD server happens to die while
sending the message, the client in QEMU receives less data than it
expects, considers it as a fatal error, and wouldn't attempt
reestablishing the connection.

Fix it by turning every negative return from qio_channel_read_all into
-EIO returned from nbd_read.  Apparently that was the original behavior,
but got broken later.  Also adjust nbd_readXX to follow.

Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210129073859.683063-4-rvkagan@yandex-team.ru>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:12 -06:00
Vladimir Sementsov-Ogievskiy
a5215b8fdf block/io: use int64_t bytes in copy_range
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert now copy_range parameters which are already 64bit to signed
type.

It's safe as we don't work with requests overflowing BDRV_MAX_LENGTH
(which is less than INT64_MAX), and do check the requests in
bdrv_co_copy_range_internal() (by bdrv_check_request32(), which calls
bdrv_check_request()).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-17-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:12 -06:00
Vladimir Sementsov-Ogievskiy
e9e52efdc5 block/io: support int64_t bytes in read/write wrappers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

Now, since bdrv_co_preadv_part() and bdrv_co_pwritev_part() have been
updated, update all their wrappers.

For all of them type of 'bytes' is widening, so callers are safe. We
have update request_fn in blkverify.c simultaneously. Still it's just a
pointer to one of bdrv_co_pwritev() or bdrv_co_preadv(), and type is
widening for callers of the request_fn anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-16-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:12 -06:00
Vladimir Sementsov-Ogievskiy
37e9403ea8 block/io: support int64_t bytes in bdrv_co_p{read,write}v_part()
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, prepare bdrv_co_preadv_part() and bdrv_co_pwritev_part() and their
remaining dependencies now.

bdrv_pad_request() is updated simultaneously, as pointer to bytes passed
to it both from bdrv_co_pwritev_part() and bdrv_co_preadv_part().

So, all callers of bdrv_pad_request() are updated to pass 64bit bytes.
bdrv_pad_request() is already good for 64bit requests, add
corresponding assertion.

Look at bdrv_co_preadv_part() and bdrv_co_pwritev_part().
Type is widening, so callers are safe. Let's look inside the functions.

In bdrv_co_preadv_part() and bdrv_aligned_pwritev() we only pass bytes
to other already int64_t interfaces (and some obviously safe
calculations), it's OK.

In bdrv_co_do_zero_pwritev() aligned_bytes may become large now, still
it's passed to bdrv_aligned_pwritev which supports int64_t bytes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-15-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:11 -06:00
Eric Blake
8024726459 block: use int64_t as bytes type in tracked requests
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

All requests in block/io must not overflow BDRV_MAX_LENGTH, all
external users of BdrvTrackedRequest already have corresponding
assertions, so we are safe. Add some assertions still.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-9-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:14:15 -06:00
Vladimir Sementsov-Ogievskiy
801625e69d block/throttle-groups: throttle_group_co_io_limits_intercept(): 64bit bytes
The function is called from 64bit io handlers, and bytes is just passed
to throttle_account() which is 64bit too (unsigned though). So, let's
convert intermediate argument to 64bit too.

This patch is a first in the 64-bit-blocklayer series, so we are
generally moving to int64_t for both offset and bytes parameters on all
io paths. Main motivation is realization of 64-bit write_zeroes
operation for fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

Patch-correctness audit by Eric Blake:

  Caller has 32-bit, this patch now causes widening which is safe:
  block/block-backend.c: blk_do_preadv() passes 'unsigned int'
  block/block-backend.c: blk_do_pwritev_part() passes 'unsigned int'
  block/throttle.c: throttle_co_pwrite_zeroes() passes 'int'
  block/throttle.c: throttle_co_pdiscard() passes 'int'

  Caller has 64-bit, this patch fixes potential bug where pre-patch
  could narrow, except it's easy enough to trace that callers are still
  capped at 2G actions:
  block/throttle.c: throttle_co_preadv() passes 'uint64_t'
  block/throttle.c: throttle_co_pwritev() passes 'uint64_t'

  Implementation in question: block/throttle-groups.c
  throttle_group_co_io_limits_intercept() takes 'unsigned int bytes'
  and uses it: argument to util/throttle.c throttle_account(uint64_t)

  All safe: it patches a latent bug, and does not introduce any 64-bit
  gotchas once throttle_co_p{read,write}v are relaxed, and assuming
  throttle_account() is not buggy.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201211183934.169161-7-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:14:00 -06:00
Vladimir Sementsov-Ogievskiy
69b55e03f7 block: refactor bdrv_check_request: add errp
It's better to pass &error_abort than just assert that result is 0: on
crash, we'll immediately see the reason in the backtrace.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix iotest 206 fallout]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:00:33 -06:00
Vladimir Sementsov-Ogievskiy
143a6384f5 block/block-copy: drop unused argument of block_copy()
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-21-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
5b49c2bdc1 block/block-copy: drop unused block_copy_set_progress_callback()
Drop unused code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-20-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
e0323a045f blockjob: add set_speed to BlockJobDriver
We are going to use async block-copy call in backup, so we'll need to
passthrough setting backup speed to block-copy call.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
a6d23d56df block/block-copy: add block_copy_cancel
Add function to cancel running async block-copy call. It will be used
in backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
7e032df0ea block/block-copy: add ratelimit to block-copy
We are going to directly use one async block-copy operation for backup
job, so we need rate limiter.

We want to maintain current backup behavior: only background copying is
limited and copy-before-write operations only participate in limit
calculation. Therefore we need one rate limiter for block-copy state
and boolean flag for block-copy call state for actual limitation.

Note, that we can't just calculate each chunk in limiter after
successful copying: it will not save us from starting a lot of async
sub-requests which will exceed limit too much. Instead let's use the
following scheme on sub-request creation:
1. If at the moment limit is not exceeded, create the request and
account it immediately.
2. If at the moment limit is already exceeded, drop create sub-request
and handle limit instead (by sleep).
With this approach we'll never exceed the limit more than by one
sub-request (which pretty much matches current backup behavior).

Note also, that if there is in-flight block-copy async call,
block_copy_kick() should be used after set-speed to apply new setup
faster. For that block_copy_kick() published in this patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
26be9d62dd block/block-copy: add max_chunk and max_workers parameters
They will be used for backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-5-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
de4641b46b block/block-copy: implement block_copy_async
We'll need async block-copy invocation to use in backup directly.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-4-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
86c6a3b690 qapi: backup: add perf.use-copy-range parameter
Experiments show, that copy_range is not always making things faster.
So, to make experimentation simpler, let's add a parameter. Some more
perf parameters will be added soon, so here is a new struct.

For now, add new backup qmp parameter with x- prefix for the following
reasons:

 - We are going to add more performance parameters, some will be
   related to the whole block-copy process, some only to background
   copying in backup (ignored for copy-before-write operations).
 - On the other hand, we are going to use block-copy interface in other
   block jobs, which will need performance options as well.. And it
   should be the same structure or at least somehow related.

So, there are too much unclean things about how the interface and now
we need the new options mostly for testing. Let's keep them
experimental for a while.

In do_backup_common() new x-perf parameter handled in a way to
make further options addition simpler.

We add use-copy-range with default=true, and we'll change the default
in further patch, after moving backup to use block-copy.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-2-vsementsov@virtuozzo.com>
[mreitz: s/5\.2/6.0/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy
7f4a396d76 qapi: block-stream: add "bottom" argument
The code already don't freeze base node and we try to make it prepared
for the situation when base node is changed during the operation. In
other words, block-stream doesn't own base node.

Let's introduce a new interface which should replace the current one,
which will in better relations with the code. Specifying bottom node
instead of base, and requiring it to be non-filter gives us the
following benefits:

 - drop difference between above_base and base_overlay, which will be
   renamed to just bottom, when old interface dropped

 - clean way to work with parallel streams/commits on the same backing
   chain, which otherwise become a problem when we introduce a filter
   for stream job

 - cleaner interface. Nobody will surprised the fact that base node may
   disappear during block-stream, when there is no word about "base" in
   the interface.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201216061703.70908-11-vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Andrey Shinkevich
e275458b29 copy-on-read: skip non-guest reads if no copy needed
If the flag BDRV_REQ_PREFETCH was set, skip idling read/write
operations in COR-driver. It can be taken into account for the
COR-algorithms optimization. That check is being made during the
block stream job by the moment.

Add the BDRV_REQ_PREFETCH flag to the supported_read_flags of the
COR-filter.

block: Modify the comment for the flag BDRV_REQ_PREFETCH as we are
going to use it alone and pass it to the COR-filter driver for further
processing.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Andrey Shinkevich
897dd0ec4f block: include supported_read_flags into BDS structure
Add the new member supported_read_flags to the BlockDriverState
structure. It will control the flags set for copy-on-read operations.
Make the block generic layer evaluate supported read flags before they
go to a block driver.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 [vsementsov: use assert instead of abort]
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Andrey Shinkevich
880747a887 qapi: add filter-node-name to block-stream
Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
  [vsementsov: comment indentation, s/Since: 5.2/Since: 6.0/]
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-5-vsementsov@virtuozzo.com>
[mreitz: s/commit/stream/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 11:26:54 +01:00
Andrey Shinkevich
8872ef78ab block: add API function to insert a node
Provide API for insertion a node to backing chain.

Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 11:26:54 +01:00
Vladimir Sementsov-Ogievskiy
d1a764d126 block: introduce BDRV_REQ_NO_WAIT flag
Add flag to make serialising request no wait: if there are conflicting
requests, just return error immediately. It's will be used in upcoming
preallocate filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201021145859.11201-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-12-18 12:35:55 +01:00
Vladimir Sementsov-Ogievskiy
8ac5aab255 block: bdrv_mark_request_serialising: split non-waiting function
We'll need a separate function, which will only "mark" request
serialising with specified align but not wait for conflicting
requests. So, it will be like old bdrv_mark_request_serialising(),
before merging bdrv_wait_serialising_requests_locked() into it.

To reduce the possible mess, let's do the following:

Public function that does both marking and waiting will be called
bdrv_make_request_serialising, and private function which will only
"mark" will be called tracked_request_set_serialising().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201021145859.11201-6-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-12-18 12:35:55 +01:00
Vladimir Sementsov-Ogievskiy
2153994e2e block: simplify comment to BDRV_REQ_SERIALISING
1. BDRV_REQ_NO_SERIALISING doesn't exist already, don't mention it.

2. We are going to add one more user of BDRV_REQ_SERIALISING, so
   comment about backup becomes a bit confusing here. The use case in
   backup is documented in block/backup.c, so let's just drop
   duplication here.

3. The fact that BDRV_REQ_SERIALISING is only for write requests is
   omitted. Add a note.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201021145859.11201-2-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-12-18 12:35:55 +01:00
Vladimir Sementsov-Ogievskiy
8b1170012b block: introduce BDRV_MAX_LENGTH
We are going to modify block layer to work with 64bit requests. And
first step is moving to int64_t type for both offset and bytes
arguments in all block request related functions.

It's mostly safe (when widening signed or unsigned int to int64_t), but
switching from uint64_t is questionable.

So, let's first establish the set of requests we want to work with.
First signed int64_t should be enough, as off_t is signed anyway. Then,
obviously offset + bytes should not overflow.

And most interesting: (offset + bytes) being aligned up should not
overflow as well. Aligned to what alignment? First thing that comes in
mind is bs->bl.request_alignment, as we align up request to this
alignment. But there is another thing: look at
bdrv_mark_request_serialising(). It aligns request up to some given
alignment. And this parameter may be bdrv_get_cluster_size(), which is
often a lot greater than bs->bl.request_alignment.
Note also, that bdrv_mark_request_serialising() uses signed int64_t for
calculations. So, actually, we already depend on some restrictions.

Happily, bdrv_get_cluster_size() returns int and
bs->bl.request_alignment has 32bit unsigned type, but defined to be a
power of 2 less than INT_MAX. So, we may establish, that INT_MAX is
absolute maximum for any kind of alignment that may occur with the
request.

Note, that bdrv_get_cluster_size() is not documented to return power
of 2, still bdrv_mark_request_serialising() behaves like it is.
Also, backup uses bdi.cluster_size and is not prepared to it not being
power of 2.
So, let's establish that Qemu supports only power-of-2 clusters and
alignments.

So, alignment can't be greater than 2^30.

Finally to be safe with calculations, to not calculate different
maximums for different nodes (depending on cluster size and
request_alignment), let's simply set QEMU_ALIGN_DOWN(INT64_MAX, 2^30)
as absolute maximum bytes length for Qemu. Actually, it's not much less
than INT64_MAX.

OK, then, let's apply it to block/io.

Let's consider all block/io entry points of offset/bytes:

4 bytes/offset interface functions: bdrv_co_preadv_part(),
bdrv_co_pwritev_part(), bdrv_co_copy_range_internal() and
bdrv_co_pdiscard() and we check them all with bdrv_check_request().

We also have one entry point with only offset: bdrv_co_truncate().
Check the offset.

And one public structure: BdrvTrackedRequest. Happily, it has only
three external users:

 file-posix.c: adopted by this patch
 write-threshold.c: only read fields
 test-write-threshold.c: sets obviously small constant values

Better is to make the structure private and add corresponding
interfaces.. Still it's not obvious what kind of interface is needed
for file-posix.c. Let's keep it public but add corresponding
assertions.

After this patch we'll convert functions in block/io.c to int64_t bytes
and offset parameters. We can assume that offset/bytes pair always
satisfy new restrictions, and make
corresponding assertions where needed. If we reach some offset/bytes
point in block/io.c missing bdrv_check_request() it is considered a
bug. As well, if block/io.c modifies a offset/bytes request, expanding
it more then aligning up to request_alignment, it's a bug too.

For all io requests except for discard we keep for now old restriction
of 32bit request length.

iotest 206 output error message changed, as now test disk size is
larger than new limit. Add one more test case with new maximum disk
size to cover too-big-L1 case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-11 17:52:40 +01:00
Max Reitz
0c9b70d590 fuse: Allow exporting BDSs via FUSE
block-export-add type=fuse allows mounting block graph nodes via FUSE on
some existing regular file.  That file should then appears like a raw
disk image, and accesses to it result in accesses to the exported BDS.

Right now, we only implement the necessary block export functions to set
it up and shut it down.  We do not implement any access functions, so
accessing the mount point only results in errors.  This will be
addressed by a followup patch.

We keep a hash table of exported mount points, because we want to be
able to detect when users try to use a mount point twice.  This is
because we invoke stat() to check whether the given mount point is a
regular file, but if that file is served by ourselves (because it is
already used as a mount point), then this stat() would have to be served
by ourselves, too, which is impossible to do while we (as the caller)
are waiting for it to settle.  Therefore, keep track of mount point
paths to at least catch the most obvious instances of that problem.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-11 17:52:39 +01:00
Peter Maydell
683685e72d Pull request for 5.2
NVMe fixes to solve IOMMU issues on non-x86 and error message/tracing
 improvements. Elena Afanasova's ioeventfd fixes are also included.
 
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAl+ixjgACgkQnKSrs4Gr
 c8iZYgf+OB2eAGsdZO97fKh6VUUoRKa+BgWKuh37Cfpp3q+dLuIFMSKfU/UgprLc
 aowt6uTFfwudDV9KltUB2EiXIzpuf7JhMNOiDRkyEvYSj4KHRPsQmFCd35Nrjezy
 VvxSGafe2Z60Qnvcx+iGeMATSFX9YTcTZeHttC07v7dWn/yEK3b1hobcmjCcwWeR
 Ud8pjMyh5E2z/NpW8E669/byJf9iahx3LSQxSWt+9PVTPuftAB0Suu+m6svz1wvk
 sjVfIbtVWCp2BdGf5U6a2rEqF3+kIcFkfHp+MwgE0EdMz1wfjudaPl13a0C4DSun
 PSt9E+Ct5BTrDUvqCHvQDOaFiMZTPg==
 =Poyb
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha-gitlab/tags/block-pull-request' into staging

Pull request for 5.2

NVMe fixes to solve IOMMU issues on non-x86 and error message/tracing
improvements. Elena Afanasova's ioeventfd fixes are also included.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

# gpg: Signature made Wed 04 Nov 2020 15:18:16 GMT
# gpg:                using RSA key 8695A8BFD3F97CDAAC35775A9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha-gitlab/tags/block-pull-request: (33 commits)
  util/vfio-helpers: Assert offset is aligned to page size
  util/vfio-helpers: Convert vfio_dump_mapping to trace events
  util/vfio-helpers: Improve DMA trace events
  util/vfio-helpers: Trace where BARs are mapped
  util/vfio-helpers: Trace PCI BAR region info
  util/vfio-helpers: Trace PCI I/O config accesses
  util/vfio-helpers: Improve reporting unsupported IOMMU type
  block/nvme: Fix nvme_submit_command() on big-endian host
  block/nvme: Fix use of write-only doorbells page on Aarch64 arch
  block/nvme: Align iov's va and size on host page size
  block/nvme: Change size and alignment of prp_list_pages
  block/nvme: Change size and alignment of queue
  block/nvme: Change size and alignment of IDENTIFY response buffer
  block/nvme: Correct minimum device page size
  block/nvme: Set request_alignment at initialization
  block/nvme: Simplify nvme_cmd_sync()
  block/nvme: Simplify ADMIN queue access
  block/nvme: Correctly initialize Admin Queue Attributes
  block/nvme: Use definitions instead of magic values in add_io_queue()
  block/nvme: Introduce Completion Queue definitions
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-11-23 13:03:13 +00:00
Greg Kurz
009cde17a5 block: Move bdrv_drain_all_end_quiesce() to block_int.h
This function is really an internal helper for bdrv_close(). Update its
doc comment to make this clear and make the function private.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160387245480.131299.13430357162209598411.stgit@bahia>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-11-09 15:44:21 +01:00
Philippe Mathieu-Daudé
54248d4d73 block/nvme: Introduce Completion Queue definitions
Rename Submission Queue flags with 'Sq' to differentiate
submission queue flags from command queue flags, and introduce
Completion Queue flag definitions.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20201029093306.1063879-13-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03 19:06:21 +00:00
Peter Maydell
8680d6e364 nvme pull 2 Nov 2020
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE28EdLTc7SjdV9QLsYlFWYQpPbMAFAl+gI74ACgkQYlFWYQpP
 bMCBrA/9GXMZZGDfHFenXF+rS6J+ZKxtk29vq9Ly8KZ9YW7CzF9MP8qE/5iyFfmx
 d1BknXGQerW2kAzpkOq2/MKDklOc+0BAhaTdUaFR/ao5ZKuv2LQ8uFnKVoTrhTx9
 +HVkTVUTnez6ReCZVIrtN4+XVdyQTeQotJg6H2m5Q/BxQKcj6OMOlneuSGDn5vFN
 EWgDvEmfFEkzbN8FMXtkT35bg3vA5TGmfQRMk1SMMREOPxF04CaTVTxYscCpS0WC
 Cl+62mx4XLjscK7hwXuTNTrxeOLxZ2xLK5dhDd/qxBveio07mIM5X2psdKR0t5qX
 HLtm437T9CAYmyo8jgvM4KL8f+rbJnLd579qyVwIMsue28Qisj9nuWCTcaEpjfck
 4krhxJwxenRtqQ9wYrnbnQI5yQDIE6iUGf0toXwCNdJIr+FvyIcT7vJtTzZXtRI8
 sxwK5wfJ/WSey9uNLZGFbQuv4vjOMV+Nk3mEi1gUV8ujogo+2U6WUAE3NhqFLKn1
 YT6AJhDZvqL1f8gFrbiqR8xwvPrYmwK/tK38X1exSDOqiB7UNzR/apAb1oniul0e
 rS5xWzIs9APvkdWQssCHvrVDdh6VISXQ5bnT8lkfmvYrCTn2gUGAFXDrxZjXIaL9
 scCr8N9STkHmoYpc2ACRKIpfK3E1sDjGA8mAPemkxsLakNwBS4o=
 =s4KC
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/pull-nvme-20201102' into staging

nvme pull 2 Nov 2020

# gpg: Signature made Mon 02 Nov 2020 15:20:30 GMT
# gpg:                using RSA key DBC11D2D373B4A3755F502EC625156610A4F6CC0
# gpg: Good signature from "Keith Busch <kbusch@kernel.org>" [unknown]
# gpg:                 aka "Keith Busch <keith.busch@gmail.com>" [unknown]
# gpg:                 aka "Keith Busch <keith.busch@intel.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DBC1 1D2D 373B 4A37 55F5  02EC 6251 5661 0A4F 6CC0

* remotes/nvme/tags/pull-nvme-20201102: (30 commits)
  hw/block/nvme: fix queue identifer validation
  hw/block/nvme: fix create IO SQ/CQ status codes
  hw/block/nvme: fix prp mapping status codes
  hw/block/nvme: report actual LBA data shift in LBAF
  hw/block/nvme: add trace event for requests with non-zero status code
  hw/block/nvme: add nsid to get/setfeat trace events
  hw/block/nvme: reject io commands if only admin command set selected
  hw/block/nvme: support for admin-only command set
  hw/block/nvme: validate command set selected
  hw/block/nvme: support per-namespace smart log
  hw/block/nvme: fix log page offset check
  hw/block/nvme: remove pointless rw indirection
  hw/block/nvme: update nsid when registered
  hw/block/nvme: change controller pci id
  pci: allocate pci id for nvme
  hw/block/nvme: support multiple namespaces
  hw/block/nvme: refactor identify active namespace id list
  hw/block/nvme: add support for sgl bit bucket descriptor
  hw/block/nvme: add support for scatter gather lists
  hw/block/nvme: harden cmb access
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-11-02 17:17:29 +00:00
Eric Blake
71719cd57f nbd: Add new qemu:allocation-depth metadata context
'qemu-img map' provides a way to determine which extents of an image
come from the top layer vs. inherited from a backing chain.  This is
useful information worth exposing over NBD.  There is a proposal to
add a QMP command block-dirty-bitmap-populate which can create a dirty
bitmap that reflects allocation information, at which point the
qemu:dirty-bitmap:NAME metadata context can expose that information
via the creation of a temporary bitmap, but we can shorten the effort
by adding a new qemu:allocation-depth metadata context that does the
same thing without an intermediate bitmap (this patch does not
eliminate the need for that proposal, as it will have other uses as
well).

While documenting things, remember that although the NBD protocol has
NBD_OPT_SET_META_CONTEXT, the rest of its documentation refers to
'metadata context', which is a more apt description of what is
actually being used by NBD_CMD_BLOCK_STATUS: the user is requesting
metadata by passing one or more context names.  So I also touched up
some existing wording to prefer the term 'metadata context' where it
makes sense.

Note that this patch does not actually enable any way to request a
server to enable this context; that will come in the next patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20201027050556.269064-10-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2020-10-30 15:22:00 -05:00
Greg Kurz
1a6d3bd229 block: End quiescent sections when a BDS is deleted
If a BDS gets deleted during blk_drain_all(), it might miss a
call to bdrv_do_drained_end(). This means missing a call to
aio_enable_external() and the AIO context remains disabled for
ever. This can cause a device to become irresponsive and to
disrupt the guest execution, ie. hang, loop forever or worse.

This scenario is quite easy to encounter with virtio-scsi
on POWER when punching multiple blockdev-create QMP commands
while the guest is booting and it is still running the SLOF
firmware. This happens because SLOF disables/re-enables PCI
devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND
register after the initial probe/feature negotiation, as it
tends to work with a single device at a time at various stages
like probing and running block/network bootloaders without
doing a full reset in-between. This naturally generates many
dataplane stops and starts, and thus many drain sections that
can race with blockdev_create_run(). In the end, SLOF bails
out.

It is somehow reproducible on x86 but it requires to generate
articial dataplane start/stop activity with stop/cont QMP
commands. In this case, seabios ends up looping for ever,
waiting for the virtio-scsi device to send a response to
a command it never received.

Add a helper that pairs all previously called bdrv_do_drained_begin()
with a bdrv_do_drained_end() and call it from bdrv_close().
While at it, update the "/bdrv-drain/graph-change/drain_all"
test in test-bdrv-drain so that it can catch the issue.

BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1874441
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160346526998.272601.9045392804399803158.stgit@bahia.lan>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-27 15:26:20 +01:00
Alberto Garcia
46cd1e8a47 qcow2: Skip copy-on-write when allocating a zero cluster
Since commit c8bb23cbdb when a write
request results in a new allocation QEMU first tries to see if the
rest of the cluster outside the written area contains only zeroes.

In that case, instead of doing a normal copy-on-write operation and
writing explicit zero buffers to disk, the code zeroes the whole
cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.

This improves performance very significantly but it only happens when
we are writing to an area that was completely unallocated before. Zero
clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
are therefore slower to allocate.

This happens because the code uses bdrv_is_allocated_above() rather
bdrv_block_status_above(). The former is not as accurate for this
purpose but it is faster. However in the case of qcow2 the underlying
call does already report zero clusters just fine so there is no reason
why we cannot use that information.

After testing 4KB writes on an image that only contains zero clusters
this patch results in almost five times more IOPS.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <6d77cab968c501c44d6e1089b9bc91b04170b49e.1603731354.git.berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-27 15:26:20 +01:00
Gollu Appalanaidu
28fee5b5d0 hw/block/nvme: fix prp mapping status codes
Address 0 is not an invalid address. Remove those invalikd checks.

Unaligned PRP2 and PRP list entries should result in Invalid PRP Offset
status code and not Invalid Field. Fix that.

See NVMe Express v1.3d, Section 4.3 ("Physical Region Page Entry and
List").

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2020-10-27 11:29:25 +01:00
Klaus Jensen
1b48e4611a hw/block/nvme: reject io commands if only admin command set selected
If the host sets CC.CSS to 111b, all commands submitted to I/O queues
should be completed with status Invalid Command Opcode.

Note that this is technically a v1.4 feature, but it does not hurt to
implement before we finally bump the reported version implemented.

Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-10-27 11:29:25 +01:00
Keith Busch
8c5cea8593 hw/block/nvme: support for admin-only command set
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2020-10-27 11:29:25 +01:00
Keith Busch
492f9a8d79 hw/block/nvme: validate command set selected
Fail to start the controller if the user requests a command set that the
controller does not support.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2020-10-27 11:29:25 +01:00
Keith Busch
2fbbecc5cd hw/block/nvme: support per-namespace smart log
Let the user specify a specific namespace if they want to get access
stats for a specific namespace.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2020-10-27 11:29:25 +01:00
Klaus Jensen
cba0a8a344 hw/block/nvme: add support for scatter gather lists
For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2020-10-27 07:24:47 +01:00
Kevin Wolf
18c6ac1c6e block: Add bdrv_lock()/unlock()
Inside of coroutine context, we can't directly use aio_context_acquire()
for the AioContext of a block node because we already own the lock of
the current AioContext and we need to avoid double locking to prevent
deadlocks.

This provides helper functions to lock the AioContext of a node only if
it's not the same as the current AioContext.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20201005155855.256490-14-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2020-10-09 07:08:20 +02:00
Kevin Wolf
e336fd4c4b block: Add bdrv_co_enter()/leave()
Add a pair of functions to temporarily move the current coroutine to the
AioContext of a given BlockDriverState.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20201005155855.256490-13-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2020-10-09 07:08:20 +02:00
Kevin Wolf
26b0b698c0 util/async: Add aio_co_reschedule_self()
Add a function that can be used to move the currently running coroutine
to a different AioContext (and therefore potentially a different
thread).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201005155855.256490-12-kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2020-10-09 07:08:20 +02:00
Peter Maydell
f2687fdb75 * Reverse debugging (Pavel)
* CFLAGS cleanup (Paolo)
 * ASLR fix (Mark)
 * cpus.c refactoring (Claudio)
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl98EB0UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOCsQf9G7EUAK1zcEOx20LtDdXFrk4tjsRp
 S83OGdihWe8SM+XiY9BfqsBbXdByqF+SitePOV3feGK0mOP5vtJIL7/2DLrtFTeF
 wOeARRA9ePVb7hcL5oXAQeE3bXrX8wq8Qtw9xAoHdw5JAEVmKIEJS6AL5Eu3M2Fh
 pvdBoV84pOm2/ARS3eRstRyW8gCC8rdLDlNsVDtCbYdNVq+VdkzR0l5Phc8JDx1M
 Qjdl1KpN6ZkuN8M6tnaQNTb9IUVu5c1tu5jdR6JdLUqAWp1wYZJ6r2jSatZWfLR3
 H+gzFsDoLPfCjZ3IhfZyvzF5leSZmdbFfzI0tHS1UJ/ZZYjutDvlPlbyYA==
 =Jys5
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into staging

* Reverse debugging (Pavel)
* CFLAGS cleanup (Paolo)
* ASLR fix (Mark)
* cpus.c refactoring (Claudio)

# gpg: Signature made Tue 06 Oct 2020 07:35:09 BST
# gpg:                using RSA key F13338574B662389866C7682BFFBD25F78C7AE83
# gpg:                issuer "pbonzini@redhat.com"
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini-gitlab/tags/for-upstream: (37 commits)
  tests/acceptance: add reverse debugging test
  replay: create temporary snapshot at debugger connection
  replay: describe reverse debugging in docs/replay.txt
  gdbstub: add reverse continue support in replay mode
  gdbstub: add reverse step support in replay mode
  replay: flush rr queue before loading the vmstate
  replay: implement replay-seek command
  replay: introduce breakpoint at the specified step
  replay: introduce info hmp/qmp command
  qapi: introduce replay.json for record/replay-related stuff
  migration: introduce icount field for snapshots
  qcow2: introduce icount field for snapshots
  replay: provide an accessor for rr filename
  replay: don't record interrupt poll
  configure: don't enable ASLR for --enable-debug Windows builds
  configure: consistently pass CFLAGS/CXXFLAGS/LDFLAGS to meson
  configure: do not clobber environment CFLAGS/CXXFLAGS/LDFLAGS
  dtc: Convert Makefile bits to meson bits
  slirp: Convert Makefile bits to meson bits
  accel/tcg: use current_machine as it is always set for softmmu
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-10-06 15:04:10 +01:00
Pavel Dovgalyuk
b39847a505 migration: introduce icount field for snapshots
Saving icount as a parameters of the snapshot allows navigation between
them in the execution replay scenario.
This information can be used for finding a specific snapshot for proceeding
the recorded execution to the specific moment of the time.
E.g., 'reverse step' action (introduced in one of the following patches)
needs to load the nearest snapshot which is prior to the current moment
of time.
This patch also updates snapshot test which verifies qemu monitor output.

Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>

--

v4 changes:
 - squashed format update with test output update
v7 changes:
 - introduced the spaces between the fields in snapshot info output
 - updated the test to match new field widths
Message-Id: <160174518865.12451.14327573383978752463.stgit@pasha-ThinkPad-X280>

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-06 08:34:49 +02:00
Vladimir Sementsov-Ogievskiy
685257a284 include/block/block.h: drop non-ascii quotation mark
This is the only non-ascii character in the file and it doesn't really
needed here. Let's use normal "'" symbol for consistency with the rest
11 occurrences of "'" in the file.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2020-10-05 10:59:42 +01:00
Vladimir Sementsov-Ogievskiy
b33b354f3a block/io: refactor save/load vmstate
Like for read/write in a previous commit, drop extra indirection layer,
generate directly bdrv_readv_vmstate() and bdrv_writev_vmstate().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-8-vsementsov@virtuozzo.com>
2020-10-05 10:59:42 +01:00
Vladimir Sementsov-Ogievskiy
fae2681add block: drop bdrv_prwv
Now that we are not maintaining boilerplate code for coroutine
wrappers, there is no more sense in keeping the extra indirection layer
of bdrv_prwv().  Let's drop it and instead generate pure bdrv_preadv()
and bdrv_pwritev().

Currently, bdrv_pwritev() and bdrv_preadv() are returning bytes on
success, auto generated functions will instead return zero, as their
_co_ prototype. Still, it's simple to make the conversion safe: the
only external user of bdrv_pwritev() is test-bdrv-drain, and it is
comfortable enough with bdrv_co_pwritev() instead. So prototypes are
moved to local block/coroutines.h. Next, the only internal use is
bdrv_pread() and bdrv_pwrite(), which are modified to return bytes on
success.

Of course, it would be great to convert bdrv_pread() and bdrv_pwrite()
to return 0 on success. But this requires audit (and probably
conversion) of all their users, let's leave it for another day
refactoring.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-7-vsementsov@virtuozzo.com>
2020-10-05 10:59:42 +01:00
Vladimir Sementsov-Ogievskiy
9bb4b066cc block: generate coroutine-wrapper code
Use code generation implemented in previous commit to generated
coroutine wrappers in block.c and block/io.c

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-6-vsementsov@virtuozzo.com>
2020-10-05 10:59:42 +01:00
Vladimir Sementsov-Ogievskiy
aaaa20b69b scripts: add block-coroutine-wrapper.py
We have a very frequent pattern of creating a coroutine from a function
with several arguments:

  - create a structure to pack parameters
  - create _entry function to call original function taking parameters
    from struct
  - do different magic to handle completion: set ret to NOT_DONE or
    EINPROGRESS or use separate bool field
  - fill the struct and create coroutine from _entry function with this
    struct as a parameter
  - do coroutine enter and BDRV_POLL_WHILE loop

Let's reduce code duplication by generating coroutine wrappers.

This patch adds scripts/block-coroutine-wrapper.py together with some
friends, which will generate functions with declared prototypes marked
by the 'generated_co_wrapper' specifier.

The usage of new code generation is as follows:

    1. define the coroutine function somewhere

        int coroutine_fn bdrv_co_NAME(...) {...}

    2. declare in some header file

        int generated_co_wrapper bdrv_NAME(...);

       with same list of parameters (generated_co_wrapper is
       defined in "include/block/block.h").

    3. Make sure the block_gen_c declaration in block/meson.build
       mentions the file with your marker function.

Still, no function is now marked, this work is for the following
commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924185414.28642-5-vsementsov@virtuozzo.com>
[Added encoding='utf-8' to open() calls as requested by Vladimir. Fixed
typo and grammar issues pointed out by Eric Blake. Removed clang-format
dependency that caused build test issues.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2020-10-05 10:59:06 +01:00
Vladimir Sementsov-Ogievskiy
5416645fcf block: return error-code from bdrv_invalidate_cache
This is the only coroutine wrapper from block.c and block/io.c which
doesn't return a value, so let's convert it to the common behavior, to
simplify moving to generated coroutine wrappers in a further commit.

Also, bdrv_invalidate_cache is a void function, returning error only
through **errp parameter, which is considered to be bad practice, as
it forces callers to define and propagate local_err variable, so
conversion is good anyway.

This patch leaves the conversion of .bdrv_co_invalidate_cache() driver
callbacks and bdrv_invalidate_cache_all() for another day.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-2-vsementsov@virtuozzo.com>
2020-10-05 09:35:52 +01:00
Kevin Wolf
5b1cb49704 nbd: Merge nbd_export_new() and nbd_export_create()
There is no real reason any more why nbd_export_new() and
nbd_export_create() should be separate functions. The latter only
performs a few checks before it calls the former.

What makes the current state stand out is that it's the only function in
BlockExportDriver that is not a static function inside nbd/server.c, but
a small wrapper in blockdev-nbd.c that then calls back into nbd/server.c
for the real functionality.

Move all the checks to nbd/server.c and make the resulting function
static to improve readability.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-27-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
331170e073 block/export: Create BlockBackend in blk_exp_add()
Every export type will need a BlockBackend, so creating it centrally in
blk_exp_add() instead of the .create driver callback avoids duplication.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-24-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
37a4f70cea block/export: Move blk to BlockExport
Every block export has a BlockBackend representing the disk that is
exported. It should live in BlockExport therefore.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-23-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
3c3bc462ad block/export: Add block-export-del
Implement a new QMP command block-export-del and make nbd-server-remove
a wrapper around it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-21-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
3859ad36f0 block/export: Move strong user reference to block_exports
The reference owned by the user/monitor that is created when adding the
export and dropped when removing it was tied to the 'exports' list in
nbd/server.c. Every block export will have a user reference, so move it
to the block export level and tie it to the 'block_exports' list in
block/export/export.c instead. This is necessary for introducing a QMP
command for removing exports.

Note that exports are present in block_exports even after the user has
requested shutdown. This is different from NBD's exports where exports
are immediately removed on a shutdown request, even if they are still in
the process of shutting down. In order to avoid that the user still
interacts with an export that is shutting down (and possibly removes it
a second time), we need to remember if the user actually still owns it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-20-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
d53be9ce55 block/export: Add 'id' option to block-export-add
We'll need an id to identify block exports in monitor commands. This
adds one.

Note that this is different from the 'name' option in the NBD server,
which is the externally visible export name. While block export ids need
to be unique in the whole process, export names must be unique only for
the same server. Different export types or (potentially in the future)
multiple NBD servers can have the same export name externally, but still
need different block export ids internally.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-19-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
bc4ee65b8c block/export: Add blk_exp_close_all(_type)
This adds a function to shut down all block exports, and another one to
shut down the block exports of a single type. The latter is used for now
when stopping the NBD server. As soon as we implement support for
multiple NBD servers, we'll need a per-server list of exports and it
will be replaced by a function using that.

As a side effect, the BlockExport layer has a list tracking all existing
exports now. closed_exports loses its only user and can go away.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-18-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
a6ff798966 block/export: Allocate BlockExport in blk_exp_add()
Instead of letting the driver allocate and return the BlockExport
object, allocate it already in blk_exp_add() and pass it. This allows us
to initialise the generic part before calling into the driver so that
the driver can just use these values instead of having to parse the
options a second time.

For symmetry, move freeing the BlockExport to blk_exp_unref().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-17-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
8612c68673 block/export: Move AioContext from NBDExport to BlockExport
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-15-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
c69de1bef5 block/export: Move refcount from NBDExport to BlockExport
Having a refcount makes sense for all types of block exports. It is also
a prerequisite for keeping a list of all exports at the BlockExport
level.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-14-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
00917172a6 qemu-nbd: Use blk_exp_add() to create the export
With this change, NBD exports are now only created through the
BlockExport interface. This allows us finally to move things from the
NBD layer to the BlockExport layer if they make sense for other export
types, too.

blk_exp_add() returns only a weak reference, so the explicit
nbd_export_put() goes away.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-12-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
d794f7f372 nbd: Remove NBDExport.close callback
The export close callback is unused by the built-in NBD server. qemu-nbd
uses it only during shutdown to wait for the unrefed export to actually
go away. It can just use nbd_export_close_all() instead and do without
the callback.

This removes the close callback from nbd_export_new() and makes both
callers of it more similar.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-11-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
1c8222b014 nbd: Add max-connections to nbd-server-start
This is a QMP equivalent of qemu-nbd's --shared option, limiting the
maximum number of clients that can attach at the same time.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-9-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
9b562c646b block/export: Remove magic from block-export-add
nbd-server-add tries to be convenient and adds two questionable
features that we don't want to share in block-export-add, even for NBD
exports:

1. When requesting a writable export of a read-only device, the export
   is silently downgraded to read-only. This should be an error in the
   context of block-export-add.

2. When using a BlockBackend name, unplugging the device from the guest
   will automatically stop the NBD server, too. This may sometimes be
   what you want, but it could also be very surprising. Let's keep
   things explicit with block-export-add. If the user wants to stop the
   export, they should tell us so.

Move these things into the nbd-server-add QMP command handler so that
they apply only there.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-8-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
b57e4de079 qemu-nbd: Use raw block driver for --offset
Instead of implementing qemu-nbd --offset in the NBD code, just put a
raw block node with the requested offset on top of the user image and
rely on that doing the job.

This does not only simplify the nbd_export_new() interface and bring it
closer to the set of options that the nbd-server-add QMP command offers,
but in fact it also eliminates a potential source for bugs in the NBD
code which previously had to add the offset manually in all relevant
places.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-7-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
56ee86261e block/export: Add BlockExport infrastructure and block-export-add
We want to have a common set of commands for all types of block exports.
Currently, this is only NBD, but we're going to add more types.

This patch adds the basic BlockExport and BlockExportDriver structs and
a QMP command block-export-add that creates a new export based on the
given BlockExportOptions.

qmp_nbd_server_add() becomes a wrapper around qmp_block_export_add().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-5-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
5daa6bfd8e qapi: Create block-export module
Move all block export related types and commands from block-core to the
new QAPI module block-export.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-3-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Kevin Wolf
8760366cdb nbd: Remove unused nbd_export_get_blockdev()
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-2-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-02 15:46:40 +02:00
Stefan Hajnoczi
d73415a315 qemu/atomic.h: rename atomic_ to qatomic_
clang's C11 atomic_fetch_*() functions only take a C11 atomic type
pointer argument. QEMU uses direct types (int, etc) and this causes a
compiler error when a QEMU code calls these functions in a source file
that also included <stdatomic.h> via a system header file:

  $ CC=clang CXX=clang++ ./configure ... && make
  ../util/async.c:79:17: error: address argument to atomic operation must be a pointer to _Atomic type ('unsigned int *' invalid)

Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
used by <stdatomic.h>. Prefix QEMU's APIs with 'q' so that atomic.h
and <stdatomic.h> can co-exist. I checked /usr/include on my machine and
searched GitHub for existing "qatomic_" users but there seem to be none.

This patch was generated using:

  $ git grep -h -o '\<atomic\(64\)\?_[a-z0-9_]\+' include/qemu/atomic.h | \
    sort -u >/tmp/changed_identifiers
  $ for identifier in $(</tmp/changed_identifiers); do
        sed -i "s%\<$identifier\>%q$identifier%g" \
            $(git grep -I -l "\<$identifier\>")
    done

I manually fixed line-wrap issues and misaligned rST tables.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200923105646.47864-1-stefanha@redhat.com>
2020-09-23 16:07:44 +01:00
Eduardo Habkost
8063396bf3 Use OBJECT_DECLARE_SIMPLE_TYPE when possible
This converts existing DECLARE_INSTANCE_CHECKER usage to
OBJECT_DECLARE_SIMPLE_TYPE when possible.

$ ./scripts/codeconverter/converter.py -i \
  --pattern=AddObjectDeclareSimpleType $(git grep -l '' -- '*.[ch]')

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Paul Durrant <paul@xen.org>
Message-Id: <20200916182519.415636-6-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-18 14:12:32 -04:00
Peter Maydell
f4ef8c9cc1 QOM boilerplate cleanup
Documentation build fix:
 * memory: Remove kernel-doc comment marker (Eduardo Habkost)
 
 QOM cleanups:
 * Rename QOM macros for consistency between
   TYPE_* and type checking constants (Eduardo Habkost)
 
 QOM new macros:
 * OBJECT_DECLARE_* and OBJECT_DEFINE_* macros (Daniel P. Berrangé)
 * DECLARE_*_CHECKER macros (Eduardo Habkost)
 
 Automated QOM boilerplate changes:
 * Automated changes to use DECLARE_*_CHECKER (Eduardo Habkost
 * Automated changes to use OBJECT_DECLARE* (Eduardo Habkost)
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEWjIv1avE09usz9GqKAeTb5hNxaYFAl9abc0UHGVoYWJrb3N0
 QHJlZGhhdC5jb20ACgkQKAeTb5hNxaYU9Q/8CyK1w2SlItxBhos7zojqnZ9TP1Jt
 b1YCApQJ+bKSPAUDyefajQA0D9HeR9bFlreiOprQnmZWOqeOvnRIxNGvelJRqRRu
 KcIA5DIfVMJRkKJQEXairrGdnPmFLWSLEb7AmwxyAhp5G51PCP/3kbudi3T/vrNr
 OaccUejs5UgImPfO8Fm+0zqZPmblq/xmtU0p77FvDxGNFPPG8ddpu7eKksGD7FYd
 5bTJTtUhONYG9EJMUD2TBxnJoy1pi6AYUu4+2T211RpBcxeiyNSSitI8fZTk6BGl
 33VwQib9SXjGaE8VsSvHDHhLLec7sqqr2JH3rfvyKF6BOptKWzmSzFdbo2mrRkSy
 8jfCImQgTBBMAHBWP+MFTeKuzfhikZx2DbBLzpppHMMvCca6Zc+oYgR2FbVwuPsw
 H2YL+8Wx4Ws6RXe147toNDRbv75vnS7F3fU800Pcur5VHJWTgSpT/tggzmVPWsdU
 GeUgceYlXyVk5/fC89ZhhtD9eurfBSzQR4eN7/nie2wD6PFMpZkOjHwLn40uWsyq
 xRO0F4uYghNU1N8z6NBhEYLTBtEcS1HFEisSLQrnTQH9W0I7mBx3MaZib/uK7NLC
 b2gT0hossTT8Z46Z8ynoZarwO5EquAMWEQtc9hfZGWacrQEpjVm2DMYMfu83krWb
 xhgl+mpKqVasAPk=
 =RjXc
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ehabkost/tags/machine-next-pull-request' into staging

QOM boilerplate cleanup

Documentation build fix:
* memory: Remove kernel-doc comment marker (Eduardo Habkost)

QOM cleanups:
* Rename QOM macros for consistency between
  TYPE_* and type checking constants (Eduardo Habkost)

QOM new macros:
* OBJECT_DECLARE_* and OBJECT_DEFINE_* macros (Daniel P. Berrangé)
* DECLARE_*_CHECKER macros (Eduardo Habkost)

Automated QOM boilerplate changes:
* Automated changes to use DECLARE_*_CHECKER (Eduardo Habkost
* Automated changes to use OBJECT_DECLARE* (Eduardo Habkost)

# gpg: Signature made Thu 10 Sep 2020 19:17:49 BST
# gpg:                using RSA key 5A322FD5ABC4D3DBACCFD1AA2807936F984DC5A6
# gpg:                issuer "ehabkost@redhat.com"
# gpg: Good signature from "Eduardo Habkost <ehabkost@redhat.com>" [full]
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF  D1AA 2807 936F 984D C5A6

* remotes/ehabkost/tags/machine-next-pull-request: (33 commits)
  virtio-vga: Use typedef name for instance_size
  vhost-user-vga: Use typedef name for instance_size
  xilinx_axienet: Use typedef name for instance_size
  lpc_ich9: Use typedef name for instance_size
  omap_intc: Use typedef name for instance_size
  xilinx_axidma: Use typedef name for instance_size
  tusb6010: Rename TUSB to TUSB6010
  pc87312: Rename TYPE_PC87312_SUPERIO to TYPE_PC87312
  vfio: Rename PCI_VFIO to VFIO_PCI
  usb: Rename USB_SERIAL_DEV to USB_SERIAL
  sabre: Rename SABRE_DEVICE to SABRE
  rs6000_mc: Rename RS6000MC_DEVICE to RS6000MC
  filter-rewriter: Rename FILTER_COLO_REWRITER to FILTER_REWRITER
  esp: Rename ESP_STATE to ESP
  ahci: Rename ICH_AHCI to ICH9_AHCI
  vmgenid: Rename VMGENID_DEVICE to TYPE_VMGENID
  vfio: Rename VFIO_AP_DEVICE_TYPE to TYPE_VFIO_AP_DEVICE
  dev-smartcard-reader: Rename CCID_DEV_NAME to TYPE_USB_CCID_DEV
  ap-device: Rename AP_DEVICE_TYPE to TYPE_AP_DEVICE
  gpex: Fix type checking function name
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-09-11 19:26:51 +01:00
Eduardo Habkost
8110fa1d94 Use DECLARE_*CHECKER* macros
Generated using:

 $ ./scripts/codeconverter/converter.py -i \
   --pattern=TypeCheckMacro $(git grep -l '' -- '*.[ch]')

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-12-ehabkost@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-13-ehabkost@redhat.com>
Message-Id: <20200831210740.126168-14-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-09 09:27:09 -04:00
Eduardo Habkost
db1015e92e Move QOM typedefs and add missing includes
Some typedefs and macros are defined after the type check macros.
This makes it difficult to automatically replace their
definitions with OBJECT_DECLARE_TYPE.

Patch generated using:

 $ ./scripts/codeconverter/converter.py -i \
   --pattern=QOMStructTypedefSplit $(git grep -l '' -- '*.[ch]')

which will split "typdef struct { ... } TypedefName"
declarations.

Followed by:

 $ ./scripts/codeconverter/converter.py -i --pattern=MoveSymbols \
    $(git grep -l '' -- '*.[ch]')

which will:
- move the typedefs and #defines above the type check macros
- add missing #include "qom/object.h" lines if necessary

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-9-ehabkost@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-10-ehabkost@redhat.com>
Message-Id: <20200831210740.126168-11-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-09 09:26:43 -04:00
Max Reitz
0b877d09df block: Leave BDS.backing_{file,format} constant
Parts of the block layer treat BDS.backing_file as if it were whatever
the image header says (i.e., if it is a relative path, it is relative to
the overlay), other parts treat it like a cache for
bs->backing->bs->filename (relative paths are relative to the CWD).
Considering bs->backing->bs->filename exists, let us make it mean the
former.

Among other things, this now allows the user to specify a base when
using qemu-img to commit an image file in a directory that is not the
CWD (assuming, everything uses relative filenames).

Before this patch:

$ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
$ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
$ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'

After this patch:

$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
Image committed.
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
Image committed.

With this change, bdrv_find_backing_image() must look at whether the
user has overridden a BDS's backing file.  If so, it can no longer use
bs->backing_file, but must instead compare the given filename against
the backing node's filename directly.

Note that this changes the QAPI output for a node's backing_file.  We
had very inconsistent output there (sometimes what the image header
said, sometimes the actual filename of the backing image).  This
inconsistent output was effectively useless, so we have to decide one
way or the other.  Considering that bs->backing_file usually at runtime
contained the path to the image relative to qemu's CWD (or absolute),
this patch changes QAPI's backing_file to always report the
bs->backing->bs->filename from now on.  If you want to receive the image
header information, you have to refer to full-backing-filename.

This necessitates a change to iotest 228.  The interesting information
it really wanted is the image header, and it can get that now, but it
has to use full-backing-filename instead of backing_file.  Because of
this patch's changes to bs->backing_file's behavior, we also need some
reference output changes.

Along with the changes to bs->backing_file, stop updating
BDS.backing_format in bdrv_backing_attach() as well.  This way,
ImageInfo's backing-filename and backing-filename-format fields will
represent what the image header says and nothing else.

iotest 245 changes in behavior: With the backing node no longer
overriding the parent node's backing_file string, you can now omit the
@backing option when reopening a node with neither a default nor a
current backing file even if it used to have a backing node at some
point.

273 also changes: The base image is opened without a format layer, so
ImageInfo.backing-filename-format used to report "file" for the base
image's overlay after blockdev-snapshot.  However, the image header
never says "file" anywhere, so it now reports $IMGFMT.

Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-09-07 12:31:31 +02:00
Max Reitz
549ec0d978 block: Inline bdrv_co_block_status_from_*()
With bdrv_filter_bs(), we can easily handle this default filter behavior
in bdrv_co_block_status().

blkdebug wants to have an additional assertion, so it keeps its own
implementation, except bdrv_co_block_status_from_file() needs to be
inlined there.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07 12:31:31 +02:00
Max Reitz
f1a7f18f07 block: Drop backing_bs()
We want to make it explicit where bs->backing is used, and we have done
so.  The old role of backing_bs() is now effectively taken by
bdrv_cow_bs().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07 12:31:31 +02:00
Max Reitz
ae23f78646 block: Add bdrv_supports_compressed_writes()
Filters cannot compress data themselves but they have to implement
.bdrv_co_pwritev_compressed() still (or they cannot forward compressed
writes).  Therefore, checking whether
bs->drv->bdrv_co_pwritev_compressed is non-NULL is not sufficient to
know whether the node can actually handle compressed writes.  This
function looks down the filter chain to see whether there is a
non-filter that can actually convert the compressed writes into
compressed data (and thus normal writes).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07 12:31:31 +02:00
Max Reitz
8b8277cdb0 block: Drop bdrv_is_encrypted()
The original purpose of bdrv_is_encrypted() was to inquire whether a BDS
can be used without the user entering a password or not.  It has not
been used for that purpose for quite some time.

Actually, it is not even fit for that purpose, because to answer that
question, it would have recursively query all of the given node's
children.

So now we have to decide in which direction we want to fix
bdrv_is_encrypted(): Recursively query all children, or drop it and just
use bs->encrypted to get the current node's status?

Nowadays, its only purpose is to report through bdrv_query_image_info()
whether the given image is encrypted or not.  For this purpose, it is
probably more interesting to see whether a given node itself is
encrypted or not (otherwise, a management application cannot discern for
certain which nodes are really encrypted and which just have encrypted
children).

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07 12:31:30 +02:00
Max Reitz
d38d7eb8a5 block: Add chain helper functions
Add some helper functions for skipping filters in a chain of block
nodes.

Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-09-07 12:31:30 +02:00
Max Reitz
9a6fc88799 block: Add child access functions
There are BDS children that the general block layer code can access,
namely bs->file and bs->backing.  Since the introduction of filters and
external data files, their meaning is not quite clear.  bs->backing can
be a COW source, or it can be a filtered child; bs->file can be a
filtered child, it can be data and metadata storage, or it can be just
metadata storage.

This overloading really is not helpful.  This patch adds functions that
retrieve the correct child for each exact purpose.  Later patches in
this series will make use of them.  Doing so will allow us to handle
filter nodes in a meaningful way.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2020-09-07 12:31:30 +02:00
Klaus Jensen
69265150aa hw/block/nvme: be consistent about zeros vs zeroes
The NVM Express specification generally uses 'zeroes' and not 'zeros',
so let us align with it.

Cc: Fam Zheng <fam@euphon.net>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
2020-09-02 08:48:50 +02:00
Klaus Jensen
7c46310d29 hw/block/nvme: support the get/set features select and save fields
Since the device does not have any persistent state storage, no
features are "saveable" and setting the Save (SV) field in any Set
Features command will result in a Feature Identifier Not Saveable status
code.

Similarly, if the Select (SEL) field is set to request saved values, the
devices will (as it should) return the default values instead.

Since this also introduces "Supported Capabilities", the nsid field is
now also checked for validity wrt. the feature being get/set'ed.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200706061303.246057-13-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
1302e48e49 hw/block/nvme: add remaining mandatory controller parameters
Add support for any remaining mandatory controller operating parameters
(features).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200706061303.246057-12-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
46ac29c38b hw/block/nvme: move NvmeFeatureVal into hw/block/nvme.h
The NvmeFeatureVal does not belong with the spec-related data structures
in include/block/nvme.h that is shared between the block-level nvme
driver and the emulated nvme device.

Move it into the nvme device specific header file as it is the only
user of the structure. Also, remove the unused members.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200706061303.246057-10-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
5d5a53302b hw/block/nvme: add support for the asynchronous event request command
Add support for the Asynchronous Event Request command. Required for
compliance with NVMe revision 1.3d. See NVM Express 1.3d, Section 5.2
("Asynchronous Event Request command").

Mostly imported from Keith's qemu-nvme tree. Modified with a max number
of queued events (controllable with the aer_max_queued device
parameter). The spec states that the controller *should* retain
events, so we do best effort here.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Message-Id: <20200706061303.246057-9-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
94a7897c41 hw/block/nvme: add support for the get log page command
Add support for the Get Log Page command and basic implementations of
the mandatory Error Information, SMART / Health Information and Firmware
Slot Information log pages.

In violation of the specification, the SMART / Health Information log
page does not persist information over the lifetime of the controller
because the device has no place to store such persistent state.

Note that the LPA field in the Identify Controller data structure
intentionally has bit 0 cleared because there is no namespace specific
information in the SMART / Health information log page.

Required for compliance with NVMe revision 1.3d. See NVM Express 1.3d,
Section 5.14 ("Get Log Page command").

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200706061303.246057-8-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
42a42e4610 hw/block/nvme: mark fw slot 1 as read-only
Mark firmware slot 1 as read-only and only support that slot.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200706061303.246057-7-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
69ff06c49e hw/block/nvme: add temperature threshold feature
It might seem weird to implement this feature for an emulated device,
but it is mandatory to support and the feature is useful for testing
asynchronous event request support, which will be added in a later
patch.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Message-Id: <20200706061303.246057-6-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Klaus Jensen
c26f217370 hw/block/nvme: bump spec data structures to v1.3
Add missing fields in the Identify Controller and Identify Namespace
data structures to bring them in line with NVMe v1.3.

This also adds data structures and defines for SGL support which
requires a couple of trivial changes to the nvme block driver as well.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Fam Zheng <fam@euphon.net>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Message-Id: <20200706061303.246057-2-its@irrelevant.dk>
2020-09-02 08:48:50 +02:00
Philippe Mathieu-Daudé
74e18435c0 hw/block/nvme: Align I/O BAR to 4 KiB
Simplify the NVMe emulated device by aligning the I/O BAR to 4 KiB.

Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200630110429.19972-5-philmd@redhat.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2020-09-02 08:48:50 +02:00
Philippe Mathieu-Daudé
af4a367d15 hw/block/nvme: Fix pmrmsc register size
The Persistent Memory Region Controller Memory Space Control
register is 64-bit wide. See 'Figure 68: Register Definition'
of the 'NVM Express Base Specification Revision 1.4'.

Fixes: 6cf9413229 ("introduce PMR support from NVMe 1.4 spec")
Reported-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200630110429.19972-4-philmd@redhat.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2020-09-02 08:48:50 +02:00
Philippe Mathieu-Daudé
e989738f3a hw/block/nvme: Use QEMU_PACKED on hardware/packet structures
These structures either describe hardware registers, or
commands ('packets') to send to the hardware. To forbid
the compiler to optimize and change fields alignment,
mark the structures as packed.

Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200630110429.19972-3-philmd@redhat.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2020-09-02 08:48:50 +02:00
Eduardo Habkost
7c9dcd6cab throttle-groups: Move ThrottleGroup typedef to header
Move typedef closer to the type check macros, to make it easier
to convert the code to OBJECT_DEFINE_TYPE() in the future.

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Tested-By: Roman Bolshakov <r.bolshakov@yadro.com>
Message-Id: <20200825192110.3528606-17-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-08-27 14:04:54 -04:00
Alberto Garcia
7be2025258 qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit
Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6476caaa73216bd05b7bb2d504a20415e1665176.1594396418.git.berto@igalia.com>
[mreitz: %s/5\.1/5.2/; fixed 302's and 303's reference output]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-08-25 09:19:55 +02:00
Eric Blake
e54ee1b385 block: Add support to warn on backing file change without format
For now, this is a mechanical addition; all callers pass false. But
the next patch will use it to improve 'qemu-img rebase -u' when
selecting a backing file with no format.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Message-Id: <20200706203954.341758-10-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-14 15:18:59 +02:00
Kevin Wolf
ffa244c84a file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).

On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.

This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.

For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.

The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.

Without an extent size hint:

    $ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
    Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
    $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
    Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
    Run completed in 25.848 seconds.
    $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
    Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
    Run completed in 19.616 seconds.
    $ filefrag ~/tmp/test.raw
    /home/kwolf/tmp/test.raw: 2000000 extents found
    $ time ./qemu-img map ~/tmp/test.raw
    Offset          Length          Mapped to       File
    0               0x1e8480000     0               /home/kwolf/tmp/test.raw

    real    0m1,279s
    user    0m0,043s
    sys     0m1,226s

With the new default extent size hint of 1 MB:

    $ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
    Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
    $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
    Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
    Run completed in 11.833 seconds.
    $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
    Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
    Run completed in 10.155 seconds.
    $ filefrag ~/tmp/test.raw
    /home/kwolf/tmp/test.raw: 178 extents found
    $ time ./qemu-img map ~/tmp/test.raw
    Offset          Length          Mapped to       File
    0               0x1e8480000     0               /home/kwolf/tmp/test.raw

    real    0m0,061s
    user    0m0,040s
    sys     0m0,014s

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-14 15:18:59 +02:00
Vladimir Sementsov-Ogievskiy
795d946d07 nbd: Use ERRP_GUARD()
If we want to check error after errp-function call, we need to
introduce local_err and then propagate it to errp. Instead, use
the ERRP_GUARD() macro, benefits are:
1. No need of explicit error_propagate call
2. No need of explicit local_err variable: use errp directly
3. ERRP_GUARD() leaves errp as is if it's not NULL or
   &error_fatal, this means that we don't break error_abort
   (we'll abort on error_set, not on error_propagate)

If we want to add some info to errp (by error_prepend() or
error_append_hint()), we must use the ERRP_GUARD() macro.
Otherwise, this info will not be added when errp == &error_fatal
(the program will exit prior to the error_append_hint() or
error_prepend() call).  Fix several such cases, e.g. in nbd_read().

This commit is generated by command

    sed -n '/^Network Block Device (NBD)$/,/^$/{s/^F: //p}' \
        MAINTAINERS | \
    xargs git ls-files | grep '\.[hc]$' | \
    xargs spatch \
        --sp-file scripts/coccinelle/errp-guard.cocci \
        --macro-file scripts/cocci-macro-file.h \
        --in-place --no-show-diff --max-width 80

Reported-by: Kevin Wolf <kwolf@redhat.com>
Reported-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Commit message tweaked]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200707165037.1026246-8-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[ERRP_AUTO_PROPAGATE() renamed to ERRP_GUARD(), and
auto-propagated-errp.cocci to errp-guard.cocci.  Commit message
tweaked again.]
2020-07-10 15:18:09 +02:00
Vladimir Sementsov-Ogievskiy
a2adbbf603 block: drop unallocated_blocks_are_zero
Currently this field only set by qed and qcow2. But in fact, all
backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share
these semantics: on unallocated blocks, if there is no backing file they
just memset the buffer with zeroes.

So, document this behavior for .supports_backing and drop
.unallocated_blocks_are_zero

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-10-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-07-06 10:34:14 +02:00
Vladimir Sementsov-Ogievskiy
7b1efe996c block: inline bdrv_unallocated_blocks_are_zero()
The function has only one user: bdrv_co_block_status(). Inline it to
simplify reviewing of the following patches, which will finally drop
unallocated_blocks_are_zero field too.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-07-06 08:49:28 +02:00
Maxim Levitsky
ced914d0ab block/core: add generic infrastructure for x-blockdev-amend qmp command
blockdev-amend will be used similiar to blockdev-create
to allow on the fly changes of the structure of the format based block devices.

Current plan is to first support encryption keyslot management for luks
based formats (raw and embedded in qcow2)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20200608094030.670121-12-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-07-06 08:49:28 +02:00
Maxim Levitsky
df373fb0a3 block/amend: separate amend and create options for qemu-img
Some options are only useful for creation
(or hard to be amended, like cluster size for qcow2), while some other
options are only useful for amend, like upcoming keyslot management
options for luks

Since currently only qcow2 supports amend, move all its options
to a common macro and then include it in each action option list.

In future it might be useful to remove some options which are
not supported anyway from amend list, which currently
cause an error message if amended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-5-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-07-06 08:49:28 +02:00
Maxim Levitsky
a3579bfa0a block/amend: add 'force' option
'force' option will be used for some unsafe amend operations.

This includes things like erasing last keyslot in luks based formats
(which destroys the data, unless the master key is backed up
by external means), but that _might_ be desired result.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-4-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-07-06 08:49:28 +02:00
Eric Blake
f9919116b8 osdep: Make MIN/MAX evaluate arguments only once
I'm not aware of any immediate bugs in qemu where a second runtime
evaluation of the arguments to MIN() or MAX() causes a problem, but
proactively preventing such abuse is easier than falling prey to an
unintended case down the road.  At any rate, here's the conversation
that sparked the current patch:
https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05718.html

Update the MIN/MAX macros to only evaluate their argument once at
runtime; this uses typeof(1 ? (a) : (b)) to ensure that we are
promoting the temporaries to the same type as the final comparison (we
have to trigger type promotion, as typeof(bitfield) won't compile; and
we can't use typeof((a) + (b)) or even typeof((a) + 0), as some of our
uses of MAX are on void* pointers where such addition is undefined).

However, we are unable to work around gcc refusing to compile ({}) in
a constant context (such as the array length of a static variable),
even when only used in the dead branch of a __builtin_choose_expr(),
so we have to provide a second macro pair MIN_CONST and MAX_CONST for
use when both arguments are known to be compile-time constants and
where the result must also be usable as a constant; this second form
evaluates arguments multiple times but that doesn't matter for
constants.  By using a void expression as the expansion if a
non-constant is presented to this second form, we can enlist the
compiler to ensure the double evaluation is not attempted on
non-constants.

Alas, as both macros now rely on compiler intrinsics, they are no
longer usable in preprocessor #if conditions; those will just have to
be open-coded or the logic rewritten into #define or runtime 'if'
conditions (but where the compiler dead-code-elimination will probably
still apply).

I tested that both gcc 10.1.1 and clang 10.0.0 produce errors for all
forms of macro mis-use.  As the errors can sometimes be cryptic, I'm
demonstrating the gcc output:

Use of MIN when MIN_CONST is needed:

In file included from /home/eblake/qemu/qemu-img.c:25:
/home/eblake/qemu/include/qemu/osdep.h:249:5: error: braced-group within expression allowed only inside a function
  249 |     ({                                                  \
      |     ^
/home/eblake/qemu/qemu-img.c:92:12: note: in expansion of macro ‘MIN’
   92 | char array[MIN(1, 2)] = "";
      |            ^~~

Use of MIN_CONST when MIN is needed:

/home/eblake/qemu/qemu-img.c: In function ‘is_allocated_sectors’:
/home/eblake/qemu/qemu-img.c:1225:15: error: void value not ignored as it ought to be
 1225 |             i = MIN_CONST(i, n);
      |               ^

Use of MIN in the preprocessor:

In file included from /home/eblake/qemu/accel/tcg/translate-all.c:20:
/home/eblake/qemu/accel/tcg/translate-all.c: In function ‘page_check_range’:
/home/eblake/qemu/include/qemu/osdep.h:249:6: error: token "{" is not valid in preprocessor expressions
  249 |     ({                                                  \
      |      ^

Fix the resulting callsites that used #if or computed a compile-time
constant min or max to use the new macros.  cpu-defs.h is interesting,
as CPU_TLB_DYN_MAX_BITS is sometimes used as a constant and sometimes
dynamic.

It may be worth improving glib's MIN/MAX definitions to be saner, but
that is a task for another day.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200625162602.700741-1-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-26 09:39:39 -04:00
Klaus Jensen
3e829fd438 hw/block/nvme: use constants in identify
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Message-Id: <20200609190333.59390-6-its@irrelevant.dk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-06-17 14:53:40 +02:00
Vladimir Sementsov-Ogievskiy
7ae89a0de9 block/dirty-bitmap: add bdrv_has_named_bitmaps helper
To be used for bitmap migration in further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200521220648.3255-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2020-05-28 13:15:22 -05:00
Eric Blake
c6996cf9a6 blockdev: Promote several bitmap functions to non-static
The next patch will split blockdev.c, which will require accessing
some previously-static functions from more than one .c file.  But part
of promoting a function to public is picking a naming scheme that does
not reek of exposing too many internals (two of the three functions
were named starting with 'do_').  To make future code motion easier,
perform the function rename and non-static promotion into its own
patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513011648.166876-5-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2020-05-19 10:32:14 -05:00
Eric Blake
ef893b5c84 block: Make it easier to learn which BDS support bitmaps
Upcoming patches will enhance bitmap support in qemu-img, but in doing
so, it turns out to be nice to suppress output when persistent bitmaps
make no sense (such as on a qcow2 v2 image).  Add a hook to make this
easier to query.

This patch adds a new callback .bdrv_supports_persistent_dirty_bitmap,
rather than trying to shoehorn the answer in via existing callbacks.
In particular, while it might have been possible to overload
.bdrv_co_can_store_new_dirty_bitmap to special-case a NULL input to
answer whether any persistent bitmaps are supported, that is at odds
with whether a particular bitmap can be stored (for example, even on
an image that supports persistent bitmaps but has currently filled up
the maximum number of bitmaps, attempts to store another one should
fail); and the new functionality doesn't require coroutine safety.
Similarly, we could have added one more piece of information to
.bdrv_get_info, but then again, most callers to that function tend to
already discard extraneous information, and making it a catch-all
rather than a series of dedicated scalar queries hasn't really
simplified life.

In the future, when we improve the ability to look up bitmaps through
a filter, we will probably also want to teach the block layer to
automatically let filters pass this request on through.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513011648.166876-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2020-05-19 10:32:14 -05:00
Peter Maydell
f2465433b4 Pull request
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAl7DkiAACgkQnKSrs4Gr
 c8h3lAf/encwoP2tfsFIMRmVfP2lu7lmbfcagv2CeuFU8DzFqrv1lJEoJTyWPq+X
 3mrlyhxBiUTt8+CiLLeOOiNHzq9P/7ApvREeBYZ+LV2vV12lZwas23pdmXV4YvAk
 OammorLzwoy6xFRoWUkElitMrcCrpwNNMe0DWGwvUPLAx4HUuXZ1Ilo1DKpB8YkR
 pI/x7h3/UG6jgxUIHao4P078jG+wANWxJBt8gPzm8vS9eC+nOLkkzvi6zfN8IqzD
 7RG9ieENJSz9lRQP/GGglW9aB83rmKsENGD0xrPWUNM6t9gvthJevAJUVBaoHUXr
 XKTsEUTYyR8nLasSNeAaT5nzHUa01w==
 =8IIS
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging

Pull request

# gpg: Signature made Tue 19 May 2020 09:00:32 BST
# gpg:                using RSA key 8695A8BFD3F97CDAAC35775A9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/block-pull-request:
  aio-posix: disable fdmon-io_uring when GSource is used
  aio-posix: don't duplicate fd handler deletion in fdmon_io_uring_destroy()
  tests/fuzz: Extract ioport_fuzz_qtest() method
  tests/fuzz: Extract pciconfig_fuzz_qos() method
  tests/fuzz: Remove unuseful/unused typedefs
  tests/fuzz: Add missing space in test description
  Makefile: List fuzz targets in 'make help'
  tests/fuzz/Makefile: Do not link code using unavailable devices

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-05-19 13:42:58 +01:00
Stefan Hajnoczi
ba607ca8bf aio-posix: disable fdmon-io_uring when GSource is used
The glib event loop does not call fdmon_io_uring_wait() so fd handlers
waiting to be submitted build up in the list. There is no benefit is
using io_uring when the glib GSource is being used, so disable it
instead of implementing a more complex fix.

This fixes a memory leak where AioHandlers would build up and increasing
amounts of CPU time were spent iterating them in aio_pending(). The
symptom is that guests become slow when QEMU is built with io_uring
support.

Buglink: https://bugs.launchpad.net/qemu/+bug/1877716
Fixes: 73fd282e7b ("aio-posix: add io_uring fd monitoring implementation")
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Message-id: 20200511183630.279750-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2020-05-18 18:16:00 +01:00
Max Reitz
e5d8a40685 block: Drop @child_class from bdrv_child_perm()
Implementations should decide the necessary permissions based on @role.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-35-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
f6de853fa3 block: Drop child_file
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-33-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
9aab945e9c block: Drop bdrv_format_default_perms()
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-32-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
87278af1d9 block: Make bdrv_filter_default_perms() static
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-31-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
ff3541c4e2 block: Drop child_backing
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-25-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
f34ade1148 block: Drop child_format
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-23-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
2519f54919 block: Add bdrv_default_perms()
This callback can be used by BDSs that use child_of_bds with the
appropriate BdrvChildRole for their children.

Also, make bdrv_format_default_perms() use it for child_of_bds children
(just a temporary solution until we can drop bdrv_format_default_perms()
altogether).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-20-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
4348355032 block: Add child_of_bds
Any current user of child_file, child_format, and child_backing can and
should use this generic BdrvChildClass instead, as it can handle all of
these cases.  However, to be able to do so, the users must pass the
appropriate BdrvChildRole when the child is created/attached.  (The
following commits will take care of that.)

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-15-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
3cdc69d31b block: Pass parent_is_format to .inherit_options()
We plan to unify the generic .inherit_options() functions.  The
resulting common function will need to decide whether to force-enable
format probing, force-disable it, or leave it as-is.  To make this
decision, it will need to know whether the parent node is a format node
or not (because we never want format probing if the parent is a format
node already (except for the backing chain)).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-9-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
272c02eaef block: Pass BdrvChildRole to .inherit_options()
For now, all callers (effectively) pass 0 and no callee evaluates thie
value.  Later patches will change both.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-8-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
bf8e925eb5 block: Pass BdrvChildRole to bdrv_child_perm()
For now, all callers pass 0 and no callee evaluates this value.  Later
patches will change both.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-7-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
258b776515 block: Add BdrvChildRole to BdrvChild
For now, it is always set to 0.  Later patches in this series will
ensure that all callers pass an appropriate combination of flags.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-6-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
3284bcf430 block: Add BdrvChildRole and BdrvChildRoleBits
This mask will supplement BdrvChildClass when it comes to what role (or
combination of roles) a child takes for its parent.  It consists of
BdrvChildRoleBits values (which is an enum).

Because empty enums are not allowed, let us just start with it filled.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-5-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
bd86fb990c block: Rename BdrvChildRole to BdrvChildClass
This structure nearly only contains parent callbacks for child state
changes.  It cannot really reflect a child's role, because different
roles may overlap (as we will see when real roles are introduced), and
because parents can have custom callbacks even when the child fulfills a
standard role.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200513110544.176672-4-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
d67066d8bc block: Add BlockDriver.is_format
We want to unify child_format and child_file at some point.  One of the
important things that set format drivers apart from other drivers is
that they do not expect other format nodes under them (except in the
backing chain), i.e. we must not probe formats inside of formats.  That
means we need something on which to distinguish format drivers from
others, and hence this flag.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200513110544.176672-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Max Reitz
6f7a3b535f block: Add bdrv_make_empty()
Right now, all users of bdrv_make_empty() call the BlockDriver method
directly.  That is not only bad style, it is also wrong, unless the
caller has a BdrvChild with a WRITE or WRITE_UNCHANGED permission.
(WRITE_UNCHANGED suffices, because callers generally use this function
to clear a node with a backing file after a commit operation.)

Introduce bdrv_make_empty() that verifies that it does.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429141126.85159-2-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-18 19:05:25 +02:00
Denis Plotnikov
572ad9783f qcow2: introduce compression type feature
The patch adds some preparation parts for incompatible compression type
feature to qcow2 allowing the use different compression methods for
image clusters (de)compressing.

It is implied that the compression type is set on the image creation and
can be changed only later by image conversion, thus compression type
defines the only compression algorithm used for the image, and thus,
for all image clusters.

The goal of the feature is to add support of other compression methods
to qcow2. For example, ZSTD which is more effective on compression than ZLIB.

The default compression is ZLIB. Images created with ZLIB compression type
are backward compatible with older qemu versions.

Adding of the compression type breaks a number of tests because now the
compression type is reported on image creation and there are some changes
in the qcow2 header in size and offsets.

The tests are fixed in the following ways:
    * filter out compression_type for many tests
    * fix header size, feature table size and backing file offset
      affected tests: 031, 036, 061, 080
      header_size +=8: 1 byte compression type
                       7 bytes padding
      feature_table += 48: incompatible feature compression type
      backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change)
    * add "compression type" for test output matching when it isn't filtered
      affected tests: 049, 060, 061, 065, 082, 085, 144, 182, 185, 198, 206,
                      242, 255, 274, 280

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
QAPI part:
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200507082521.29210-2-dplotnikov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-05-13 14:20:31 +02:00
Eric Blake
47e0b38a13 block: Drop unused .bdrv_has_zero_init_truncate
Now that there are no clients of bdrv_has_zero_init_truncate, none of
the drivers need to worry about providing it.

What's more, this eliminates a source of some confusion: a literal
reading of the documentation as written in ceaca56f and implemented in
commit 1dcaf527 claims that a driver which returns 0 for
bdrv_has_zero_init_truncate() must not return 1 for
bdrv_has_zero_init(); this condition was violated for parallels, qcow,
and sometimes for vdi, although in practice it did not matter since
those drivers also lacked .bdrv_co_truncate.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-10-eblake@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-08 13:26:35 +02:00
Andrzej Jakowski
6cf9413229 nvme: introduce PMR support from NVMe 1.4 spec
This patch introduces support for PMR that has been defined as part of NVMe 1.4
spec. User can now specify a pmrdev option that should point to HostMemoryBackend.
pmrdev memory region will subsequently be exposed as PCI BAR 2 in emulated NVMe
device. Guest OS can perform mmio read and writes to the PMR region that will stay
persistent across system reboot.

Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200330164656.9348-1-andrzej.jakowski@linux.intel.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-04-30 17:51:07 +02:00
Kevin Wolf
7b8e485742 block: Add flags to bdrv(_co)_truncate()
Now that block drivers can support flags for .bdrv_co_truncate, expose
the parameter in the node level interfaces bdrv_co_truncate() and
bdrv_truncate().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424125448.63318-3-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-04-30 17:51:07 +02:00
Kevin Wolf
92b92799dc block: Add flags to BlockDriver.bdrv_co_truncate()
This adds a new BdrvRequestFlags parameter to the .bdrv_co_truncate()
driver callbacks, and a supported_truncate_flags field in
BlockDriverState that allows drivers to advertise support for request
flags in the context of truncate.

For now, we always pass 0 and no drivers declare support for any flag.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424125448.63318-2-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-04-30 17:51:07 +02:00
Paolo Bonzini
3c18a92dc4 aio-wait: delegate polling of main AioContext if BQL not held
Any thread that is not a iothread returns NULL for qemu_get_current_aio_context().
As a result, it would also return true for
in_aio_context_home_thread(qemu_get_aio_context()), causing
AIO_WAIT_WHILE to invoke aio_poll() directly.  This is incorrect
if the BQL is not held, because aio_poll() does not expect to
run concurrently from multiple threads, and it can actually
happen when savevm writes to the vmstate file from the
migration thread.

Therefore, restrict in_aio_context_home_thread to return true
for the main AioContext only if the BQL is held.

The function is moved to aio-wait.h because it is mostly used
there and to avoid a circular reference between main-loop.h
and block/aio.h.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200407140746.8041-5-pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2020-04-09 16:16:28 +01:00
Maxim Levitsky
5a5e7f8cd8 block: trickle down the fallback image creation function use to the block drivers
Instead of checking the .bdrv_co_create_opts to see if we need the
fallback, just implement the .bdrv_co_create_opts in the drivers that
need it.

This way we don't break various places that need to know if the
underlying protocol/format really supports image creation, and this way
we still allow some drivers to not support image creation.

Fixes: fd17146cd9
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1816007

Note that technically this driver reverts the image creation fallback
for the vxhs driver since I don't have a means to test it, and IMHO it
is better to leave it not supported as it was prior to generic image
creation patches.

Also drop iscsi_create_opts which was left accidentally.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200326011218.29230-3-mlevitsk@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
[mreitz: Fixed alignment, and moved bdrv_co_create_opts_simple() and
         bdrv_create_opts_simple from block.h into block_int.h]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-03-26 14:44:33 +01:00
Maxim Levitsky
b92902dfea block: pass BlockDriver reference to the .bdrv_co_create
This will allow the reuse of a single generic .bdrv_co_create
implementation for several drivers.
No functional changes.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200326011218.29230-2-mlevitsk@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-03-26 14:44:33 +01:00
Vladimir Sementsov-Ogievskiy
299ea9ff01 block/dirty-bitmap: improve _next_dirty_area API
Firstly, _next_dirty_area is for scenarios when we may contiguously
search for next dirty area inside some limited region, so it is more
comfortable to specify "end" which should not be recalculated on each
iteration.

Secondly, let's add a possibility to limit resulting area size, not
limiting searching area. This will be used in NBD code in further
commit. (Note that now bdrv_dirty_bitmap_next_dirty_area is unused)

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20200205112041.6003-8-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2020-03-18 14:03:46 -04:00
Vladimir Sementsov-Ogievskiy
9399c54b75 block/dirty-bitmap: add _next_dirty API
We have bdrv_dirty_bitmap_next_zero, let's add corresponding
bdrv_dirty_bitmap_next_dirty, which is more comfortable to use than
bitmap iterators in some cases.

For test modify test_hbitmap_next_zero_check_range to check both
next_zero and next_dirty and add some new checks.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20200205112041.6003-7-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2020-03-18 14:03:46 -04:00
Vladimir Sementsov-Ogievskiy
642700fda0 block/dirty-bitmap: switch _next_dirty_area and _next_zero to int64_t
We are going to introduce bdrv_dirty_bitmap_next_dirty so that same
variable may be used to store its return value and to be its parameter,
so it would int64_t.

Similarly, we are going to refactor hbitmap_next_dirty_area to use
hbitmap_next_dirty together with hbitmap_next_zero, therefore we want
hbitmap_next_zero parameter type to be int64_t too.

So, for convenience update all parameters of *_next_zero and
*_next_dirty_area to be int64_t.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20200205112041.6003-6-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2020-03-18 14:03:46 -04:00
Peter Maydell
49780a582d Block layer patches:
- Relax restrictions for blockdev-snapshot (allows libvirt to do live
   storage migration with blockdev-mirror)
 - luks: Delete created files when block_crypto_co_create_opts_luks fails
 - Fix memleaks in qmp_object_add
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJeaQYTAAoJEH8JsnLIjy/WQwYP/3pzAjqVecL3dGmnPWAkBqCV
 CFpxT2nIMe+xCBvWQBoeekHsFJ7GQf4E1WVNRZgoAh9VQkvkajZsVNn8Auo2Veq2
 c7w/R4xf/Wet2hKGVRS0JXwbg69U5BbpcF7E2DRNfp+CvaDCafHSDNeGTb3hFUjT
 x1jwhK6VqfY7+LHU0B0QmX2KA66nDx1p8l8HJQYd1MlCKAbj8kv/swEbqBJn32hA
 32CIYfC4VCqkW5va1eOjd3Kyi/ugkFCHTI8+mOa45/BFBzIiKfCsDaFHh/DI59QB
 qcDKkUcO3+W788vCKgJQGnG070TwKPx2OnjhxFKiEGaoX3Sz+AY4wUvf3mfFl8GM
 zYqTdOy4Xh0ckvA6JCS0jtAKmANkeEGqnECAgub22Z+kyOzqC05B1FkYwqYDcFXY
 atWKm5Vr47jgD6Oq6O0OpZaZrAUWOfoBmq4ErnrBEHuW5329NEInmjYwxednK+43
 CwU/lSdX7ujRSsjS8Xi1dHS4pxHK/mg51dInL44zGFUayegiLPgA8cuESE0mHOfZ
 67X14rxu6D4Y5r0L+w7rsSGjByR29VynE1McL9fZ1Wp29JHaQ5fjdG6GMXEwYxmV
 R0YNXe85FAlNgqj0Bme+fR2YPxZ48NqHIMOvFFStNHyfD0qQN0TtT1iarfEcBR0u
 jm8MnSoIDLkRXEgBW9bW
 =ociU
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches:

- Relax restrictions for blockdev-snapshot (allows libvirt to do live
  storage migration with blockdev-mirror)
- luks: Delete created files when block_crypto_co_create_opts_luks fails
- Fix memleaks in qmp_object_add

# gpg: Signature made Wed 11 Mar 2020 15:38:59 GMT
# gpg:                using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream:
  qemu-iotests: adding LUKS cleanup for non-UTF8 secret error
  crypto.c: cleanup created file when block_crypto_co_create_opts_luks fails
  block.c: adding bdrv_co_delete_file
  block: introducing 'bdrv_co_delete_file' interface
  tests/qemu-iotests: Fix socket_scm_helper build path
  qapi: Add '@allow-write-only-overlay' feature for 'blockdev-snapshot'
  iotests: Add iothread cases to 155
  block: Fix cross-AioContext blockdev-snapshot
  iotests: Test mirror with temporarily disabled target backing file
  iotests: Fix run_job() with use_log=False
  block: Relax restrictions for blockdev-snapshot
  block: Make bdrv_get_cumulative_perm() public
  qom-qmp-cmds: fix two memleaks in qmp_object_add

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-03-12 16:51:26 +00:00
Peter Maydell
5931ed5641 Block patches for the 5.0 softfreeze:
- qemu-img measure for LUKS
 - Improve block-copy's performance by reducing inter-request
   dependencies
 - Make curl's detection of accept-ranges more robust
 - Memleak fixes
 - iotest fix
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEkb62CjDbPohX0Rgp9AfbAGHVz0AFAl5o5UUSHG1yZWl0ekBy
 ZWRoYXQuY29tAAoJEPQH2wBh1c9AnpMIALI1JGJ+kywe+BNTphOKtFGuIuHku/K4
 2GujC9WjX4to1xyozockXcP/AlFCk1yicbxW8hxPMxgtNmLh6E6Y2vRNT4/Fnpjc
 KeyFw062rtD/O6H8jchfX1wIaMBDBASM2GGUtonbfs9mdYmFnEPg/9jo0eX/b9Qp
 kEDFyVonJsAGFT7gWYB7p2RxToqC/JQBdenacKuYeKkLQOMvUHChERktnRhSfRt6
 bJIBkQ1e8xFuuFfnEUQR2GkvzzVw8k7kkY5gSN5m1kBCi3D1i/Pee5CFkYnQFiAD
 66GGKH5OIFt83t7msqu1zhb7kobZ1OH671nMSTdqTRncRnQfLWloCkk=
 =D7LF
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2020-03-11' into staging

Block patches for the 5.0 softfreeze:
- qemu-img measure for LUKS
- Improve block-copy's performance by reducing inter-request
  dependencies
- Make curl's detection of accept-ranges more robust
- Memleak fixes
- iotest fix

# gpg: Signature made Wed 11 Mar 2020 13:19:01 GMT
# gpg:                using RSA key 91BEB60A30DB3E8857D11829F407DB0061D5CF40
# gpg:                issuer "mreitz@redhat.com"
# gpg: Good signature from "Max Reitz <mreitz@redhat.com>" [full]
# Primary key fingerprint: 91BE B60A 30DB 3E88 57D1  1829 F407 DB00 61D5 CF40

* remotes/maxreitz/tags/pull-block-2020-03-11:
  block/block-copy: hide structure definitions
  block/block-copy: reduce intersecting request lock
  block/block-copy: rename start to offset in interfaces
  block/block-copy: refactor interfaces to use bytes instead of end
  block/block-copy: factor out find_conflicting_inflight_req
  block/block-copy: use block_status
  block/block-copy: specialcase first copy_range request
  block/block-copy: fix progress calculation
  job: refactor progress to separate object
  block/qcow2-threads: fix qcow2_decompress
  qemu-img: free memory before re-assign
  block/qcow2: do free crypto_opts in qcow2_close()
  iotests: Fix nonportable use of od --endian
  block/curl: HTTP header field names are case insensitive
  block/curl: HTTP header fields allow whitespace around values
  iotests: add 288 luks qemu-img measure test
  qemu-img: allow qemu-img measure --object without a filename
  luks: implement .bdrv_measure()
  luks: extract qcrypto_block_calculate_payload_offset()

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-03-11 17:06:40 +00:00
Daniel Henrique Barboza
e1d7f8bb1e block.c: adding bdrv_co_delete_file
Using the new 'bdrv_co_delete_file' interface, a pure co_routine function
'bdrv_co_delete_file' inside block.c can can be used in a way similar of
the existing bdrv_create_file to to clean up a created file.

We're creating a pure co_routine because the only caller of
'bdrv_co_delete_file' will be already in co_routine context, thus there
is no need to add all the machinery to check for qemu_in_coroutine() and
create a separated co_routine to do the job.

Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20200130213907.2830642-3-danielhb413@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-03-11 15:54:38 +01:00
Daniel Henrique Barboza
9bffae14df block: introducing 'bdrv_co_delete_file' interface
Adding to Block Drivers the capability of being able to clean up
its created files can be useful in certain situations. For the
LUKS driver, for instance, a failure in one of its authentication
steps can leave files in the host that weren't there before.

This patch adds the 'bdrv_co_delete_file' interface to block
drivers and add it to the 'file' driver in file-posix.c. The
implementation is given by 'raw_co_delete_file'.

Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20200130213907.2830642-2-danielhb413@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-03-11 15:54:38 +01:00
Kevin Wolf
c7a0f2be8f block: Make bdrv_get_cumulative_perm() public
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200310113831.27293-2-kwolf@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-03-11 15:54:38 +01:00
Peter Maydell
6e8a73e911 Pull request
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAl5o3EQACgkQnKSrs4Gr
 c8jbYwgAupuS62MCITszybRE5Ote5CL80QDbMXNsZnZh6YBIGhPCqW2zdI2+m0zu
 +qoN6x6dxxNUWCvNlhbOcA45fiZVmYzS69TiEo21kCDijoK+h8+W+YbXuxGR2xJi
 cZMm8Q1DiK6Lj3vyfiwkFf4ns3VNz9DhI9hXu6CcpSkNcp79elQu87JJbzEWWWWy
 uEc7uEyBr0uCAKLEJvaLzzzWE2D2i6qKlmj3G17UbDNgCJ/Q/5HX13RUfMrgNgiP
 wmpcJ5MsB3Prz3K4XMMytUKXX/M8zpRLahp3p31t9qHelTWC3Lk1U4xzLPTJZPlm
 if/lrGRCRml+DKb9keBjWeTF4U31vg==
 =LftU
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging

Pull request

# gpg: Signature made Wed 11 Mar 2020 12:40:36 GMT
# gpg:                using RSA key 8695A8BFD3F97CDAAC35775A9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/block-pull-request:
  aio-posix: remove idle poll handlers to improve scalability
  aio-posix: support userspace polling of fd monitoring
  aio-posix: add io_uring fd monitoring implementation
  aio-posix: simplify FDMonOps->update() prototype
  aio-posix: extract ppoll(2) and epoll(7) fd monitoring
  aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
  aio-posix: completely stop polling when disabled
  aio-posix: remove confusing QLIST_SAFE_REMOVE()
  qemu/queue.h: clear linked list pointers on remove

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-03-11 14:41:27 +00:00
Vladimir Sementsov-Ogievskiy
397f4e9d83 block/block-copy: hide structure definitions
Hide structure definitions and add explicit API instead, to keep an
eye on the scope of the shared fields.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200311103004.7649-10-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-03-11 12:42:30 +01:00
Vladimir Sementsov-Ogievskiy
8719091f9d block/block-copy: rename start to offset in interfaces
offset/bytes pair is more usual naming in block layer, let's use it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200311103004.7649-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-03-11 12:42:30 +01:00
Vladimir Sementsov-Ogievskiy
dafaf13593 block/block-copy: refactor interfaces to use bytes instead of end
We have a lot of "chunk_end - start" invocations, let's switch to
bytes/cur_bytes scheme instead.

While being here, improve check on block_copy_do_copy parameters to not
overflow when calculating nbytes and use int64_t for bytes in
block_copy for consistency.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200311103004.7649-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-03-11 12:42:30 +01:00
Vladimir Sementsov-Ogievskiy
d0ebeca14a block/block-copy: fix progress calculation
Assume we have two regions, A and B, and region B is in-flight now,
region A is not yet touched, but it is unallocated and should be
skipped.

Correspondingly, as progress we have

  total = A + B
  current = 0

If we reset unallocated region A and call progress_reset_callback,
it will calculate 0 bytes dirty in the bitmap and call
job_progress_set_remaining, which will set

   total = current + 0 = 0 + 0 = 0

So, B bytes are actually removed from total accounting. When job
finishes we'll have

   total = 0
   current = B

, which doesn't sound good.

This is because we didn't considered in-flight bytes, actually when
calculating remaining, we should have set (in_flight + dirty_bytes)
as remaining, not only dirty_bytes.

To fix it, let's refactor progress calculation, moving it to block-copy
itself instead of fixing callback. And, of course, track in_flight
bytes count.

We still have to keep one callback, to maintain backup job bytes_read
calculation, but it will go on soon, when we turn the whole backup
process into one block_copy call.

Cc: qemu-stable@nongnu.org
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Message-Id: <20200311103004.7649-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-03-11 12:42:30 +01:00
Maxim Levitsky
89802d5ae7 monitor/hmp: Move hmp_drive_add_node to block-hmp-cmds.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-12-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:20:22 +00:00
Maxim Levitsky
2bcad73c4b monitor/hmp: move hmp_info_block* to block-hmp-cmds.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-11-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:20:21 +00:00
Maxim Levitsky
1061f8dd80 monitor/hmp: move remaining hmp_block* functions to block-hmp-cmds.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-10-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:20:13 +00:00
Maxim Levitsky
e263120ecc monitor/hmp: move hmp_nbd_server* to block-hmp-cmds.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-9-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:17:58 +00:00
Maxim Levitsky
fce2b91fdf monitor/hmp: move hmp_snapshot_* to block-hmp-cmds.c
hmp_snapshot_blkdev is from GPLv2 version of the hmp-cmds.c thus
have to change the licence to GPLv2

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-8-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:07:50 +00:00
Maxim Levitsky
6b7fbf61fb monitor/hmp: move hmp_block_job* to block-hmp-cmds.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-7-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:07:48 +00:00
Maxim Levitsky
0932e3f23d monitor/hmp: move hmp_drive_mirror and hmp_drive_backup to block-hmp-cmds.c
Moved code was added after 2012-01-13, thus under GPLv2+

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-6-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  Fixed commit message
2020-03-09 18:07:35 +00:00
Maxim Levitsky
a1edae276a monitor/hmp: move hmp_drive_del and hmp_commit to block-hmp-cmds.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200308092440.23564-5-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:05:33 +00:00
Maxim Levitsky
a2dde2f221 monitor/hmp: rename device-hotplug.c to block/monitor/block-hmp-cmds.c
These days device-hotplug.c only contains the hmp_drive_add
In the next patch, rest of hmp_drive* functions will be moved
there.

Also add block-hmp-cmds.h to contain prototypes of these
functions

License for block-hmp-cmds.h since it contains the code
moved from sysemu.h which lacks license and thus according
to LICENSE is under GPLv2+

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200308092440.23564-4-mlevitsk@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2020-03-09 18:05:31 +00:00
Stefan Hajnoczi
d37d0e365a aio-posix: remove idle poll handlers to improve scalability
When there are many poll handlers it's likely that some of them are idle
most of the time.  Remove handlers that haven't had activity recently so
that the polling loop scales better for guests with a large number of
devices.

This feature only takes effect for the Linux io_uring fd monitoring
implementation because it is capable of combining fd monitoring with
userspace polling.  The other implementations can't do that and risk
starving fds in favor of poll handlers, so don't try this optimization
when they are in use.

IOPS improves from 10k to 105k when the guest has 100
virtio-blk-pci,num-queues=32 devices and 1 virtio-blk-pci,num-queues=1
device for rw=randread,iodepth=1,bs=4k,ioengine=libaio on NVMe.

[Clarified aio_poll_handlers locking discipline explanation in comment
after discussion with Paolo Bonzini <pbonzini@redhat.com>.
--Stefan]

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-8-stefanha@redhat.com
Message-Id: <20200305170806.1313245-8-stefanha@redhat.com>
2020-03-09 16:45:16 +00:00
Stefan Hajnoczi
aa38e19f05 aio-posix: support userspace polling of fd monitoring
Unlike ppoll(2) and epoll(7), Linux io_uring completions can be polled
from userspace.  Previously userspace polling was only allowed when all
AioHandler's had an ->io_poll() callback.  This prevented starvation of
fds by userspace pollable handlers.

Add the FDMonOps->need_wait() callback that enables userspace polling
even when some AioHandlers lack ->io_poll().

For example, it's now possible to do userspace polling when a TCP/IP
socket is monitored thanks to Linux io_uring.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-7-stefanha@redhat.com
Message-Id: <20200305170806.1313245-7-stefanha@redhat.com>
2020-03-09 16:41:31 +00:00
Stefan Hajnoczi
73fd282e7b aio-posix: add io_uring fd monitoring implementation
The recent Linux io_uring API has several advantages over ppoll(2) and
epoll(2).  Details are given in the source code.

Add an io_uring implementation and make it the default on Linux.
Performance is the same as with epoll(7) but later patches add
optimizations that take advantage of io_uring.

It is necessary to change how aio_set_fd_handler() deals with deleting
AioHandlers since removing monitored file descriptors is asynchronous in
io_uring.  fdmon_io_uring_remove() marks the AioHandler deleted and
aio_set_fd_handler() will let it handle deletion in that case.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20200305170806.1313245-6-stefanha@redhat.com
Message-Id: <20200305170806.1313245-6-stefanha@redhat.com>
2020-03-09 16:41:31 +00:00