Commit Graph

5717 Commits

Author SHA1 Message Date
Vladimir Sementsov-Ogievskiy
79ef0cebb5 block/copy-before-write: refactor option parsing
We are going to add one more option of enum type. Let's refactor option
parsing so that we can simply work with BlockdevOptionsCbw object.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2022-06-28 10:20:31 +03:00
Xie Yongji
779d82e1d3 vduse-blk: Add name option
Currently we use 'id' option as the name of VDUSE device.
It's a bit confusing since we use one value for two different
purposes: the ID to identfy the export within QEMU (must be
distinct from any other exports in the same QEMU process, but
can overlap with names used by other processes), and the VDUSE
name to uniquely identify it on the host (must be distinct from
other VDUSE devices on the same host, but can overlap with other
export types like NBD in the same process). To make it clear,
this patch adds a separate 'name' option to specify the VDUSE
name for the vduse-blk export instead.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20220614051532.92-7-xieyongji@bytedance.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
0862a087fd vduse-blk: Add serial option
Add a 'serial' option to allow user to specify this value
explicitly. And the default value is changed to an empty
string as what we did in "hw/block/virtio-blk.c".

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20220614051532.92-6-xieyongji@bytedance.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Eric Blake
2866ddd121 nbd: Drop dead code spotted by Coverity
CID 1488362 points out that the second 'rc >= 0' check is now dead
code.

Reported-by: Peter Maydell <peter.maydell@linaro.org>
Fixes: 172f5f1a40(nbd: remove peppering of nbd_client_connected)
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20220516210519.76135-1-eblake@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Fabian Ebner
9b38fc56c0 block/gluster: correctly set max_pdiscard
On 64-bit platforms, assigning SIZE_MAX to the int64_t max_pdiscard
results in a negative value, and the following assertion would trigger
down the line (it's not the same max_pdiscard, but computed from the
other one):
qemu-system-x86_64: ../block/io.c:3166: bdrv_co_pdiscard: Assertion
`max_pdiscard >= bs->bl.request_alignment' failed.

On 32-bit platforms, it's fine to keep using SIZE_MAX.

The assertion in qemu_gluster_co_pdiscard() is checking that the value
of 'bytes' can safely be passed to glfs_discard_async(), which takes a
size_t for the argument in question, so it is kept as is. And since
max_pdiscard is still <= SIZE_MAX, relying on max_pdiscard is still
fine.

Fixes: 0c8022876f ("block: use int64_t instead of int in driver discard handlers")
Cc: qemu-stable@nongnu.org
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Message-Id: <20220520075922.43972-1-f.ebner@proxmox.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Stefano Garzarella
66dc5f9606 block/rbd: report a better error when namespace does not exist
If the namespace does not exist, rbd_create() fails with -ENOENT and
QEMU reports a generic "error rbd create: No such file or directory":

    $ qemu-img create rbd:rbd/namespace/image 1M
    Formatting 'rbd:rbd/namespace/image', fmt=raw size=1048576
    qemu-img: rbd:rbd/namespace/image: error rbd create: No such file or directory

Unfortunately rados_ioctx_set_namespace() does not fail if the namespace
does not exist, so let's use rbd_namespace_exists() in qemu_rbd_connect()
to check if the namespace exists, reporting a more understandable error:

    $ qemu-img create rbd:rbd/namespace/image 1M
    Formatting 'rbd:rbd/namespace/image', fmt=raw size=1048576
    qemu-img: rbd:rbd/namespace/image: namespace 'namespace' does not exist

Reported-by: Tingting Mao <timao@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20220517071012.6120-1-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
d043e2db87 libvduse: Add support for reconnecting
To support reconnecting after restart or crash, VDUSE backend
might need to resubmit inflight I/Os. This stores the metadata
such as the index of inflight I/O's descriptors to a shm file so
that VDUSE backend can restore them during reconnecting.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20220523084611.91-9-xieyongji@bytedance.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
9e4dea6727 vduse-blk: Add vduse-blk resize support
To support block resize, this uses vduse_dev_update_config()
to update the capacity field in configuration space and inject
config interrupt on the block resize callback.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220523084611.91-8-xieyongji@bytedance.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
2a2359b844 vduse-blk: Implement vduse-blk export
This implements a VDUSE block backends based on
the libvduse library. We can use it to export the BDSs
for both VM and container (host) usage.

The new command-line syntax is:

$ qemu-storage-daemon \
    --blockdev file,node-name=drive0,filename=test.img \
    --export vduse-blk,node-name=drive0,id=vduse-export0,writable=on

After the qemu-storage-daemon started, we need to use
the "vdpa" command to attach the device to vDPA bus:

$ vdpa dev add name vduse-export0 mgmtdev vduse

Also the device must be removed via the "vdpa" command
before we stop the qemu-storage-daemon.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220523084611.91-7-xieyongji@bytedance.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
5c36802970 block/export: Abstract out the logic of virtio-blk I/O process
Abstract the common logic of virtio-blk I/O process to a function
named virtio_blk_process_req(). It's needed for the following commit.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20220523084611.91-4-xieyongji@bytedance.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
8e7fd6f623 block/export: Fix incorrect length passed to vu_queue_push()
Now the req->size is set to the correct value only
when handling VIRTIO_BLK_T_GET_ID request. This patch
fixes it.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20220523084611.91-3-xieyongji@bytedance.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Xie Yongji
ac1fc3a3a9 block: Support passing NULL ops to blk_set_dev_ops()
This supports passing NULL ops to blk_set_dev_ops()
so that we can remove stale ops in some cases.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220523084611.91-2-xieyongji@bytedance.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Vladimir Sementsov-Ogievskiy
618af89e55 block: simplify handling of try to merge different sized bitmaps
We have too much logic to simply check that bitmaps are of the same
size. Let's just define that hbitmap_merge() and
bdrv_dirty_bitmap_merge_internal() require their argument bitmaps be of
same size, this simplifies things.

Let's look through the callers:

For backup_init_bcs_bitmap() we already assert that merge can't fail.

In bdrv_reclaim_dirty_bitmap_locked() we gracefully handle the error
that can't happen: successor always has same size as its parent, drop
this logic.

In bdrv_merge_dirty_bitmap() we already has assertion and separate
check. Make the check explicit and improve error message.

Signed-off-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220517111206.23585-4-v.sementsov-og@mail.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Vladimir Sementsov-Ogievskiy
58cbfbdf73 block: improve block_dirty_bitmap_merge(): don't allocate extra bitmap
We don't need extra bitmap. All we need is to backup the original
bitmap when we do first merge. So, drop extra temporary bitmap and work
directly with target and backup.

Still to keep old semantics, that on failure target is unchanged and
user don't need to restore, we need a local_backup variable and do
restore ourselves on failure path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Message-Id: <20220517111206.23585-3-v.sementsov-og@mail.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Vladimir Sementsov-Ogievskiy
775b30b305 block: block_dirty_bitmap_merge(): fix error path
At the end we ignore failure of bdrv_merge_dirty_bitmap() and report
success. And still set errp. That's wrong.

Signed-off-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220517111206.23585-2-v.sementsov-og@mail.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Stefan Hajnoczi
1ab5096b3a block: get rid of blk->guest_block_size
Commit 1b7fd72955 ("block: rename buffer_alignment to
guest_block_size") noted:

  At this point, the field is set by the device emulation, but completely
  ignored by the block layer.

The last time the value of buffer_alignment/guest_block_size was
actually used was before commit 339064d506 ("block: Don't use guest
sector size for qemu_blockalign()").

This value has not been used since 2013. Get rid of it.

Cc: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220518130945.2657905-1-stefanha@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Faria <afaria@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Stefan Hajnoczi
3399848b7f block: drop unused bdrv_co_drain() API
bdrv_co_drain() has not been used since commit 9a0cec664e ("mirror:
use bdrv_drained_begin/bdrv_drained_end") in 2016. Remove it so there
are fewer drain scenarios to worry about.

Use bdrv_drained_begin()/bdrv_drained_end() instead. They are "mixed"
functions that can be called from coroutine context. Unlike
bdrv_co_drain(), these functions provide control of the length of the
drained section, which is usually the right thing.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220521122714.3837731-1-stefanha@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Alberto Faria <afaria@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Stefan Hajnoczi
99b969fbe1 linux-aio: explain why max batch is checked in laio_io_unplug()
It may not be obvious why laio_io_unplug() checks max batch. I discussed
this with Stefano and have added a comment summarizing the reason.

Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20220609164712.1539045-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-06-15 16:43:42 +01:00
Stefan Hajnoczi
f387cac5af linux-aio: fix unbalanced plugged counter in laio_io_unplug()
Every laio_io_plug() call has a matching laio_io_unplug() call. There is
a plugged counter that tracks the number of levels of plugging and
allows for nesting.

The plugged counter must reflect the balance between laio_io_plug() and
laio_io_unplug() calls accurately. Otherwise I/O stalls occur since
io_submit(2) calls are skipped while plugged.

Reported-by: Nikolay Tenev <nt@storpool.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20220609164712.1539045-2-stefanha@redhat.com
Cc: Stefano Garzarella <sgarzare@redhat.com>
Fixes: 68d7946648 ("linux-aio: add `dev_max_batch` parameter to laio_io_unplug()")
[Stefano Garzarella suggested adding a Fixes tag.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-06-15 16:43:42 +01:00
Sam Li
e2848bc574 Use io_uring_register_ring_fd() to skip fd operations
Linux recently added a new io_uring(7) optimization API that QEMU
doesn't take advantage of yet. The liburing library that QEMU uses
has added a corresponding new API calling io_uring_register_ring_fd().
When this API is called after creating the ring, the io_uring_submit()
library function passes a flag to the io_uring_enter(2) syscall
allowing it to skip the ring file descriptor fdget()/fdput()
operations. This saves some CPU cycles.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-id: 20220531105011.111082-1-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-06-15 14:50:41 +01:00
Paolo Bonzini
f0d43b1ece coroutine-lock: qemu_co_queue_restart_all is a coroutine-only qemu_co_enter_all
qemu_co_queue_restart_all is basically the same as qemu_co_enter_all
but without a QemuLockable argument.  That's perfectly fine, but only as
long as the function is marked coroutine_fn.  If used outside coroutine
context, qemu_co_queue_wait will attempt to take the lock and that
is just broken: if you are calling qemu_co_queue_restart_all outside
coroutine context, the lock is going to be a QemuMutex which cannot be
taken twice by the same thread.

The patch adds the marker to qemu_co_queue_restart_all and to its sole
non-coroutine_fn caller; it then reimplements the function in terms of
qemu_co_enter_all_impl, to remove duplicated code and to clarify that the
latter also works in coroutine context.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20220427130830.150180-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 12:29:44 +02:00
Markus Armbruster
9c0928045c Clean up ill-advised or unusual header guards
Leading underscores are ill-advised because such identifiers are
reserved.  Trailing underscores are merely ugly.  Strip both.

Our header guards commonly end in _H.  Normalize the exceptions.

Macros should be ALL_CAPS.  Normalize the exception.

Done with scripts/clean-header-guards.pl.

include/hw/xen/interface/ and tools/virtiofsd/ left alone, because
these were imported from Xen and libfuse respectively.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20220506134911.2856099-3-armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2022-05-11 16:50:01 +02:00
Markus Armbruster
52581c718c Clean up header guards that don't match their file name
Header guard symbols should match their file name to make guard
collisions less likely.

Cleaned up with scripts/clean-header-guards.pl, followed by some
renaming of new guard symbols picked by the script to better ones.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20220506134911.2856099-2-armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
[Change to generated file ebpf/rss.bpf.skeleton.h backed out]
2022-05-11 16:49:06 +02:00
Hanna Reitz
6d17e28798 block/vmdk: Fix reopening bs->file
VMDK disk data is stored in extents, which may or may not be separate
from bs->file.  VmdkExtent.file points to where they are stored.  Each
that is stored in bs->file will simply reuse the exact pointer value of
bs->file.

(That is why vmdk_free_extents() will unref VmdkExtent.file (e->file)
only if e->file != bs->file.)

Reopen operations can change bs->file (they will replace the whole
BdrvChild object, not just the BDS stored in that BdrvChild), and then
we will need to change all .file pointers of all such VmdkExtents to
point to the new BdrvChild.

In vmdk_reopen_prepare(), we have to check which VmdkExtents are
affected, and in vmdk_reopen_commit(), we can modify them.  We have to
split this because:
- The new BdrvChild is created only after prepare, so we can change
  VmdkExtent.file only in commit
- In commit, there no longer is any (valid) reference to the old
  BdrvChild object, so there would be nothing to compare VmdkExtent.file
  against to see whether it was equal to bs->file before reopening
  (There is BDRVReopenState.old_file_bs, but the old bs->file
  BdrvChild's .bs pointer will be NULL-ed when the new BdrvChild is
  created, and so we cannot compare VmdkExtent.file->bs against
  BDRVReopenState.old_file_bs)

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220314162719.65384-2-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-04 15:55:23 +02:00
Hanna Reitz
06e9cd19a4 qcow2: Do not reopen data_file in invalidate_cache
qcow2_co_invalidate_cache() closes and opens the qcow2 file, by calling
qcow2_close() and qcow2_do_open().  These two functions must thus be
usable from both a global-state and an I/O context.

As they are, they are not safe to call in an I/O context, because they
use bdrv_unref_child() and bdrv_open_child() to close/open the data_file
child, respectively, both of which are global-state functions.  When
used from qcow2_co_invalidate_cache(), we do not need to close/open the
data_file child, though (we do not do this for bs->file or bs->backing
either), and so we should skip it in the qcow2_co_invalidate_cache()
path.

To do so, add a parameter to qcow2_do_open() and qcow2_close() to make
them skip handling s->data_file, and have qcow2_co_invalidate_cache()
exempt it from the memset() on the BDRVQcow2State.

(Note that the QED driver similarly closes/opens the QED image by
invoking bdrv_qed_close()+bdrv_qed_do_open(), but both functions seem
safe to use in an I/O context.)

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/945
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220427114057.36651-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-04 15:55:23 +02:00
Marc-André Lureau
ad24b679d2 block: move fcntl_setfl()
It is only used by block/file-posix.c, move it there.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2022-05-03 15:17:53 +04:00
Paolo Bonzini
620c5cb5da nbd: document what is protected by the CoMutexes
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-10-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:42 -05:00
Paolo Bonzini
a80a9a1c73 nbd: take receive_mutex when reading requests[].receiving
requests[].receiving is set by nbd_receive_replies() under the receive_mutex;
Read it under the same mutex as well.  Waking up receivers on errors happens
after each reply finishes processing, in nbd_co_receive_one_chunk().
If there is no currently-active reply, there are two cases:

* either there is no active request at all, in which case no
element of request[] can have .receiving = true

* or nbd_receive_replies() must be running and owns receive_mutex;
in that case it will get back to nbd_co_receive_one_chunk() because
the socket has been shutdown, and all waiting coroutines will wake up
in turn.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-9-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:41 -05:00
Paolo Bonzini
dba5156c0e nbd: move s->state under requests_lock
Remove the confusing, and most likely wrong, atomics.  The only function
that used to be somewhat in a hot path was nbd_client_connected(),
but it is not anymore after the previous patches.

The same logic is used both to check if a request had to be reissued
and also in nbd_reconnecting_attempt().  The former cases are outside
requests_lock, while nbd_reconnecting_attempt() does have the lock,
therefore the two have been separated in the previous commit.
nbd_client_will_reconnect() can simply take s->requests_lock, while
nbd_reconnecting_attempt() can inline the access now that no
complicated atomics are involved.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-8-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:39 -05:00
Paolo Bonzini
8d45185cb7 nbd: code motion and function renaming
Prepare for the next patch, so that the diff is less confusing.

nbd_client_connecting is moved closer to the definition point.

nbd_client_connecting_wait() is kept only for the reconnection
logic; when it is used to check if a request has to be reissued,
use the renamed function nbd_client_will_reconnect().  In the
next patch, the two cases will have different locking requirements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-7-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:38 -05:00
Paolo Bonzini
ee19d953ec nbd: use a QemuMutex to synchronize yanking, reconnection and coroutines
The condition for waiting on the s->free_sema queue depends on
both s->in_flight and s->state.  The latter is currently using
atomics, but this is quite dubious and probably wrong.

Because s->state is written in the main thread too, for example by
the yank callback, it cannot be protected by a CoMutex.  Introduce a
separate lock that can be used by nbd_co_send_request(); later on this
lock will also be used for s->state.  There will not be any contention
on the lock unless there is a yank or reconnect, so this is not
performance sensitive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-6-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:37 -05:00
Paolo Bonzini
8610b4491f nbd: keep send_mutex/free_sema handling outside nbd_co_do_establish_connection
Elevate s->in_flight early so that other incoming requests will wait
on the CoQueue in nbd_co_send_request; restart them after getting back
from nbd_reconnect_attempt.  This could be after the reconnect timer or
nbd_cancel_in_flight have cancelled the attempt, so there is no
need anymore to cancel the requests there.

nbd_co_send_request now handles both stopping and restarting pending
requests after a successful connection, and there is no need to
hold send_mutex in nbd_co_do_establish_connection.  The current setup
is confusing because nbd_co_do_establish_connection is called both with
send_mutex taken and without it.  Before the patch it uses free_sema which
(at least in theory...) is protected by send_mutex, after the patch it
does not anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-5-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: wrap long line]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:36 -05:00
Paolo Bonzini
172f5f1a40 nbd: remove peppering of nbd_client_connected
It is unnecessary to check nbd_client_connected() because every time
s->state is moved out of NBD_CLIENT_CONNECTED the socket is shut down
and all coroutines are resumed.

The only case where it was actually needed is when the NBD
server disconnects and there is no reconnect-delay.  In that
case, nbd_receive_replies() does not set s->reply.handle and
nbd_co_do_receive_one_chunk() cannot continue.  For that one case,
check the return value of nbd_receive_replies().

As to the others:

* nbd_receive_replies() can put the current coroutine to sleep if another
reply is ongoing; then it will be woken by nbd_channel_error(), called
by the ongoing reply.  Or it can try itself to read a reply header and
fail, thus calling nbd_channel_error() itself.

* nbd_co_send_request() will write the body of the request and fail

* nbd_reply_chunk_iter_receive() will call nbd_co_receive_one_chunk()
and then nbd_co_do_receive_one_chunk(), which will handle the failure as
above; or it will just detect a previous call to nbd_iter_channel_error()
via iter->ret < 0.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-4-pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:36 -05:00
Paolo Bonzini
0c43c6fc89 nbd: mark more coroutine_fns
Several coroutine functions in block/nbd.c are not marked as such.  This
patch adds a few more markers; it is not exhaustive, but it focuses
especially on:

- places that wake other coroutines, because aio_co_wake() has very
different semantics inside a coroutine (queuing after yield vs. entering
immediately);

- functions with _co_ in their names, to avoid confusion

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-3-pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:35 -05:00
Paolo Bonzini
8846b7d1c1 nbd: safeguard against waking up invalid coroutine
The .reply_possible field of s->requests is never set to false.  This is
not a problem as it is only a safeguard to detect protocol errors,
but it's sloppy.  In fact, the field is actually not necessary at all,
because .coroutine is set to NULL in NBD_FOREACH_REPLY_CHUNK after
receiving the last chunk.  Thus, replace .reply_possible with .coroutine
and move the check before deciding the fate of this request.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220414175756.671165-2-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:16:24 -05:00
Vladimir Sementsov-Ogievskiy
1466ef6cbe qapi: rename BlockDirtyBitmapMergeSource to BlockDirtyBitmapOrStr
Rename the type to be reused. Old name is "what is it for". To be
natively reused for other needs, let's name it exactly "what is it".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Message-Id: <20220314213226.362217-2-v.sementsov-og@mail.ru>
[eblake: Adjust S-o-b to Vladimir's new email, with permission]
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: John Snow <jsnow@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:13:50 -05:00
Denis V. Lunev
80dd5aff1b block: add 'force' parameter to 'blockdev-change-medium' command
'blockdev-change-medium' is a convinient wrapper for the following
sequence of commands:
 * blockdev-open-tray
 * blockdev-remove-medium
 * blockdev-insert-medium
 * blockdev-close-tray
and should be used f.e. to change ISO image inside the CD-ROM tray.
Though the guest could lock the tray and some linux guests like
CentOS 8.5 actually does that. In this case the execution if this
command results in the error like the following:
  Device 'scsi0-0-1-0' is locked and force was not specified,
  wait for tray to open and try again.

This situation is could be resolved 'blockdev-open-tray' by passing
flag 'force' inside. Thus is seems reasonable to add the same
capability for 'blockdev-change-medium' too.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Acked-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
Message-Id: <20220412221846.280723-1-den@openvz.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-04-25 12:02:36 +02:00
Hanna Reitz
0423f75351 qcow2: Add errp to rebuild_refcount_structure()
Instead of fprint()-ing error messages in rebuild_refcount_structure()
and its rebuild_refcounts_write_refblocks() helper, pass them through an
Error object to qcow2_check_refcounts() (which will then print it).

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220405134652.19278-4-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2022-04-20 12:09:17 +02:00
Hanna Reitz
a8c07ec287 qcow2: Improve refcount structure rebuilding
When rebuilding the refcount structures (when qemu-img check -r found
errors with refcount = 0, but reference count > 0), the new refcount
table defaults to being put at the image file end[1].  There is no good
reason for that except that it means we will not have to rewrite any
refblocks we already wrote to disk.

Changing the code to rewrite those refblocks is not too difficult,
though, so let us do that.  That is beneficial for images on block
devices, where we cannot really write beyond the end of the image file.

Use this opportunity to add extensive comments to the code, and refactor
it a bit, getting rid of the backwards-jumping goto.

[1] Unless there is something allocated in the area pointed to by the
    last refblock, so we have to write that refblock.  In that case, we
    try to put the reftable in there.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1519071
Closes: https://gitlab.com/qemu-project/qemu/-/issues/941
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220405134652.19278-2-hreitz@redhat.com>
2022-04-20 10:14:28 +02:00
Marc-André Lureau
0f9668e0c1 Remove qemu-common.h include from most units
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220323155743.1585078-33-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-06 14:31:55 +02:00
Marc-André Lureau
89fc45d5c6 include: move qemu_get_vm_name() to sysemu.h
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220323155743.1585078-26-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-06 14:31:43 +02:00
Marc-André Lureau
8e3b0cbb72 Replace qemu_real_host_page variables with inlined functions
Replace the global variables with inlined helper functions. getpagesize() is very
likely annotated with a "const" function attribute (at least with glibc), and thus
optimization should apply even better.

This avoids the need for a constructor initialization too.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220323155743.1585078-12-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-06 10:50:38 +02:00
Hanna Reitz
b1e1af394d block/stream: Drain subtree around graph change
When the stream block job cuts out the nodes between top and base in
stream_prepare(), it does not drain the subtree manually; it fetches the
base node, and tries to insert it as the top node's backing node with
bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
the actual base node might change (because the base node is actually not
part of the stream job) before the old base node passed to
bdrv_set_backing_hd() is installed.

This has two implications:

First, the stream job does not keep a strong reference to the base node.
Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
because some other block job is drained to finish), we will get a
use-after-free.  We should keep a strong reference to that node.

Second, even with such a strong reference, the problem remains that the
base node might change before bdrv_set_backing_hd() actually runs and as
a result the wrong base node is installed.

Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
case, which has five nodes, and simultaneously streams from the middle
node to the top node, and commits the middle node down to the base node.
As it is, this will sometimes crash, namely when we encounter the
above-described use-after-free.

Taking a strong reference to the base node, we no longer get a crash,
but the resuling block graph is less than ideal: The expected result is
obviously that all middle nodes are cut out and the base node is the
immediate backing child of the top node.  However, if stream_prepare()
takes a strong reference to its base node (the middle node), and then
the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
that middle node, the stream job will just reinstall it again.

Therefore, we need to keep the whole subtree drained in
stream_prepare(), so that the graph modification it performs is
effectively atomic, i.e. that the base node it fetches is still the base
node when bdrv_set_backing_hd() sets it as the top node's backing node.

Verify this by asserting in said 030's test case that the base node is
always the top node's immediate backing child when both jobs are done.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220324140907.17192-1-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
2022-03-29 16:30:55 +02:00
Philippe Mathieu-Daudé
3f1db95917 block: Fix misleading hexadecimal format
"0x%u" format is very misleading, replace by "0x%x".

Found running:

  $ git grep -E '0x%[0-9]*([lL]*|" ?PRI)[dDuU]' block/

Inspired-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-id: 20220323114718.58714-2-philippe.mathieu.daude@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-03-24 10:38:42 +00:00
Peter Maydell
04ddcda6a2 Fixes and cleanups for 7.0
Hi,
 
 A collection of fixes & cleanup patches that should be safe for 7.0 inclusion.
 -----BEGIN PGP SIGNATURE-----
 
 iQJQBAABCAA6FiEEh6m9kz+HxgbSdvYt2ujhCXWWnOUFAmI5vPIcHG1hcmNhbmRy
 ZS5sdXJlYXVAcmVkaGF0LmNvbQAKCRDa6OEJdZac5a7ED/9+DCc6b+yAeMsFR7SI
 kqxSvPW9RbgQrJo0LrJxX7H+xYs40JFpkNZFhuAGgWPrk6GlebMzg+aMgSlZi4XN
 B7y5/dAKUUPCC+kNQ7azP4Gp+xb+Pxg2ZZxQ9SnxsGgPWC1prliiB8Zbvs8f5lHl
 ACbh7wvfVOcSJoMaCAf5km4AFzWYQQkwn2w3CRl4CfWnuWUhjnnYL9DfjHrfaYPK
 JCbRCx534dy/amrMPgbAOcDRl0K9/9Tw+xATxOkQPLZ4Za4tclsAGZ9Hb2WoDuWS
 LYQ1ZJVouv37EnaPVMCyPyC2n4oLJ86L2RCSBqKgIgv7rmwTUcqlfYPVg7TZGxuw
 T234lIc8AXcm2UNQ4iTXLH/Od9RGHKseZSF8QYTVGNDtfvp3bDFVT6k5e2X/SpXY
 gVloTdFzmwYWM8dtREPepZlEhXNKz7XdltlrcwyDdKWW0OffLRyKkNIsuUja7EoL
 q4n8l4tq084iLTHpEUSWaFwZvu89b8n81hML0box6XXrOldk1qdf57Ka5gqxNrnk
 pJES7ocRoTANjZgASrJW8vPu3/GkdlmE/Khf5bnOzq/lWMwVxPqYEQY+PRoAU2zR
 MS1UJ9IITe3toJlx7+DqR8Lo6fUyralwKv/MUnBW65S45S7VkbCO4anELNnVvzAE
 CFfsa30VblNDEbppBMXwRFyX0Q==
 =fKgO
 -----END PGP SIGNATURE-----

Merge tag 'fixes-pull-request' of gitlab.com:marcandre.lureau/qemu into staging

Fixes and cleanups for 7.0

Hi,

A collection of fixes & cleanup patches that should be safe for 7.0 inclusion.

# gpg: Signature made Tue 22 Mar 2022 12:11:30 GMT
# gpg:                using RSA key 87A9BD933F87C606D276F62DDAE8E10975969CE5
# gpg:                issuer "marcandre.lureau@redhat.com"
# gpg: Good signature from "Marc-André Lureau <marcandre.lureau@redhat.com>" [full]
# gpg:                 aka "Marc-André Lureau <marcandre.lureau@gmail.com>" [full]
# Primary key fingerprint: 87A9 BD93 3F87 C606 D276  F62D DAE8 E109 7596 9CE5

* tag 'fixes-pull-request' of gitlab.com:marcandre.lureau/qemu: (21 commits)
  qapi: remove needless include
  Remove trailing ; after G_DEFINE_AUTO macro
  tests: remove needless include
  error: use GLib to remember the program name
  qga: remove bswap.h include
  qapi: remove needless include
  meson: fix CONFIG_ATOMIC128 check
  meson: move int128 checks from configure
  qapi: remove needless include
  util: remove the net/net.h dependency
  util: remove needless includes
  scripts/modinfo-collect: remove unused/dead code
  Move HOST_LONG_BITS to compiler.h
  Simplify HOST_LONG_BITS
  compiler.h: replace QEMU_SENTINEL with G_GNUC_NULL_TERMINATED
  compiler.h: replace QEMU_WARN_UNUSED_RESULT with G_GNUC_WARN_UNUSED_RESULT
  Replace GCC_FMT_ATTR with G_GNUC_PRINTF
  Drop qemu_foo() socket API wrapper
  m68k/nios2-semi: fix gettimeofday() result check
  vl: typo fix in a comment
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-03-22 18:43:03 +00:00
Peter Maydell
9d36d5f7e0 Block patches for 7.0-rc1:
- iotest fixes:
   - Fix some iotests for riscv targets
   - Use GNU sed in more places where required
   - Meson-related fixes (i.e. to print errors when they occur)
   - Have qemu-img calls (from Python tests) generally raise nicely
     formattable exceptions on errors
   - Fix iotest 207
 - Allow RBD images to be growable by writing zeroes past the end of
   file, fixing qcow2 on rbd
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEy2LXoO44KeRfAE00ofpA0JgBnN8FAmI5uC4SHGhyZWl0ekBy
 ZWRoYXQuY29tAAoJEKH6QNCYAZzfsPIP/1iRUWmWd9h9m7gX5hSU+TWYXsH+Ua/G
 fKdDHqNlVjQHq6SDN+A3jDCjAV9I91vlHRVIMWXc4QlKbdNdY7yUP1zYKuRTbfig
 UXce6g1NQoZUIlBSHZDqzgEmrvjunwP1U2te2LWliQjmvlTJgdIHYUJn2VgmY3wg
 2vo8exE3YO1FArS9nFsiX/1Ju35Dm4+3w9NkI6KxnKvaFpY++jovVVy2Bj7Jmfsh
 bRnfiQtDVX+0H4FXYS02hDEX4h7PTzsd1DeapVbiiZW9KJrbb0rchTSGc+VKMdkC
 z2XDfU+Hw4jYlNomuWdSZy6qJMNzUaKEqJMji1OiLym4429OAyseL8EO5c6Utjcb
 RRqGKWBOp1ceFZcy8vmQ2Rxc7b3Nc/Jv41Ty7PbyHmrtd2nbgD+AVnlFH9qWDtZo
 clvFSaIHcHaC4k+MppsbGTKvbW7qRYUkdk1B+tFlZytwQpFvM4oK2CF8jNzoLYfY
 qJIvrBqgaBKnYzAGCV4qgH9I0gWY7WHwvwfevHy47rk7XDsErWKMfBFy294TYDyq
 +yU6K1VijWDEn/DdQZMSZQJeE7ByA4cfSSYGRmwtTLOZxmUi4mEBEEo6Lw3x4eI2
 eXj/fhRenbjfE+5p3xmWvEyaVvvyor9oXUPH45HN/LvGgUJmEsjs0XGwnNNwxJBT
 s2lp8U5Lo4Lb
 =0cG5
 -----END PGP SIGNATURE-----

Merge tag 'pull-block-2022-03-22' of https://gitlab.com/hreitz/qemu into staging

Block patches for 7.0-rc1:
- iotest fixes:
  - Fix some iotests for riscv targets
  - Use GNU sed in more places where required
  - Meson-related fixes (i.e. to print errors when they occur)
  - Have qemu-img calls (from Python tests) generally raise nicely
    formattable exceptions on errors
  - Fix iotest 207
- Allow RBD images to be growable by writing zeroes past the end of
  file, fixing qcow2 on rbd

# gpg: Signature made Tue 22 Mar 2022 11:51:10 GMT
# gpg:                using RSA key CB62D7A0EE3829E45F004D34A1FA40D098019CDF
# gpg:                issuer "hreitz@redhat.com"
# gpg: Good signature from "Hanna Reitz <hreitz@redhat.com>" [marginal]
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg:          It is not certain that the signature belongs to the owner.
# Primary key fingerprint: CB62 D7A0 EE38 29E4 5F00  4D34 A1FA 40D0 9801 9CDF

* tag 'pull-block-2022-03-22' of https://gitlab.com/hreitz/qemu: (25 commits)
  iotests/207: Filter host fingerprint
  iotests.py: Filters for VM.run_job()
  iotests: make qemu_img_log and img_info_log raise on error
  iotests: remove qemu_img_pipe_and_status()
  iotests: replace qemu_img_log('create', ...) calls
  iotests: use qemu_img() in has_working_luks()
  iotests: remove remaining calls to qemu_img_pipe()
  iotests/149: Remove qemu_img_pipe() call
  iotests: replace unchecked calls to qemu_img_pipe()
  iotests: change supports_quorum to use qemu_img
  iotests: add qemu_img_map() function
  iotests/remove-bitmap-from-backing: use qemu_img_info()
  iotests: add qemu_img_info()
  iotests: use qemu_img_json() when applicable
  iotests: add qemu_img_json()
  iotests: fortify compare_images() against crashes
  iotests: make qemu_img raise on non-zero rc by default
  iotests: Remove explicit checks for qemu_img() == 0
  python/utils: add VerboseProcessError
  python/utils: add add_visual_margin() text decoration utility
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-03-22 12:44:11 +00:00
Marc-André Lureau
c08401793a compiler.h: replace QEMU_WARN_UNUSED_RESULT with G_GNUC_WARN_UNUSED_RESULT
One less qemu-specific macro. It also helps to make some headers/units
only depend on glib, and thus moved in standalone projects eventually.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
2022-03-22 14:40:51 +04:00
Marc-André Lureau
9edc6313da Replace GCC_FMT_ATTR with G_GNUC_PRINTF
One less qemu-specific macro. It also helps to make some headers/units
only depend on glib, and thus moved in standalone projects eventually.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2022-03-22 14:40:51 +04:00
Stefano Garzarella
cc5387a544 block/rbd: fix write zeroes with growing images
Commit d24f80234b ("block/rbd: increase dynamically the image size")
added a workaround to support growing images (eg. qcow2), resizing
the image before write operations that exceed the current size.

We recently added support for write zeroes and without the
workaround we can have problems with qcow2.

So let's move the resize into qemu_rbd_start_co() and do it when
the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20220317162638.41192-1-sgarzare@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-22 09:40:54 +01:00
Rao Lei
6690302b84 block/nbd.c: Fixed IO request coroutine not being wakeup when kill NBD server
During the IO stress test, the IO request coroutine has a probability that is
can't be awakened when the NBD server is killed.

The GDB stack is as follows:
(gdb) bt
0  0x00007f2ff990cbf6 in __ppoll (fds=0x55575de85000, nfds=1, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
1  0x000055575c302e7c in qemu_poll_ns (fds=0x55575de85000, nfds=1, timeout=599999603140) at ../util/qemu-timer.c:348
2  0x000055575c2d3c34 in fdmon_poll_wait (ctx=0x55575dc480f0, ready_list=0x7ffd9dd1dae0, timeout=599999603140) at ../util/fdmon-poll.c:80
3  0x000055575c2d350d in aio_poll (ctx=0x55575dc480f0, blocking=true) at ../util/aio-posix.c:655
4  0x000055575c16eabd in bdrv_do_drained_begin(bs=0x55575dee7fe0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true)at ../block/io.c:474
5  0x000055575c16eba6 in bdrv_drained_begin (bs=0x55575dee7fe0) at ../block/io.c:480
6  0x000055575c1aff33 in quorum_del_child (bs=0x55575dee7fe0, child=0x55575dcea690, errp=0x7ffd9dd1dd08) at ../block/quorum.c:1130
7  0x000055575c14239b in bdrv_del_child (parent_bs=0x55575dee7fe0, child=0x55575dcea690, errp=0x7ffd9dd1dd08) at ../block.c:7705
8  0x000055575c12da28 in qmp_x_blockdev_change(parent=0x55575df404c0 "colo-disk0", has_child=true, child=0x55575de867f0 "children.1", has_node=false, no   de=0x0, errp=0x7ffd9dd1dd08) at ../blockdev.c:3676
9  0x000055575c258435 in qmp_marshal_x_blockdev_change (args=0x7f2fec008190, ret=0x7f2ff7b0bd98, errp=0x7f2ff7b0bd90) at qapi/qapi-commands-block-core.c   :1675
10 0x000055575c2c6201 in do_qmp_dispatch_bh (opaque=0x7f2ff7b0be30) at ../qapi/qmp-dispatch.c:129
11 0x000055575c2ebb1c in aio_bh_call (bh=0x55575dc429c0) at ../util/async.c:141
12 0x000055575c2ebc2a in aio_bh_poll (ctx=0x55575dc480f0) at ../util/async.c:169
13 0x000055575c2d2d96 in aio_dispatch (ctx=0x55575dc480f0) at ../util/aio-posix.c:415
14 0x000055575c2ec07f in aio_ctx_dispatch (source=0x55575dc480f0, callback=0x0, user_data=0x0) at ../util/async.c:311
15 0x00007f2ff9e7cfbd in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
16 0x000055575c2fd581 in glib_pollfds_poll () at ../util/main-loop.c:232
17 0x000055575c2fd5ff in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:255
18 0x000055575c2fd710 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:531
19 0x000055575bfa7588 in qemu_main_loop () at ../softmmu/runstate.c:726
20 0x000055575bbee57a in main (argc=60, argv=0x7ffd9dd1e0e8, envp=0x7ffd9dd1e2d0) at ../softmmu/main.c:50

(gdb) qemu coroutine 0x55575e16aac0
0  0x000055575c2ee7dc in qemu_coroutine_switch (from_=0x55575e16aac0, to_=0x7f2ff830fba0, action=COROUTINE_YIELD) at ../util/coroutine-ucontext.c:302
1  0x000055575c2fe2a9 in qemu_coroutine_yield () at ../util/qemu-coroutine.c:195
2  0x000055575c2fe93c in qemu_co_queue_wait_impl (queue=0x55575dc46170, lock=0x7f2b32ad9850) at ../util/qemu-coroutine-lock.c:56
3  0x000055575c17ddfb in nbd_co_send_request (bs=0x55575ebfaf20, request=0x7f2b32ad9920, qiov=0x55575dfc15d8) at ../block/nbd.c:478
4  0x000055575c17f931 in nbd_co_request (bs=0x55575ebfaf20, request=0x7f2b32ad9920, write_qiov=0x55575dfc15d8) at ../block/nbd.c:1182
5  0x000055575c17fe14 in nbd_client_co_pwritev (bs=0x55575ebfaf20, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, flags=0) at ../block/nbd.c:1284
6  0x000055575c170d25 in bdrv_driver_pwritev (bs=0x55575ebfaf20, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, qiov_offset=0, flags=0)
    at ../block/io.c:1264
7  0x000055575c1733b4 in bdrv_aligned_pwritev
    (child=0x55575dff6890, req=0x7f2b32ad9ad0, offset=403487858688, bytes=4538368, align=1, qiov=0x55575dfc15d8, qiov_offset=0, flags=0) at ../block/io.c:2126
8  0x000055575c173c67 in bdrv_co_pwritev_part (child=0x55575dff6890, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, qiov_offset=0, flags=0)
    at ../block/io.c:2314
9  0x000055575c17391b in bdrv_co_pwritev (child=0x55575dff6890, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, flags=0) at ../block/io.c:2233
10 0x000055575c1ee506 in replication_co_writev (bs=0x55575e9824f0, sector_num=788062224, remaining_sectors=8864, qiov=0x55575dfc15d8, flags=0)
    at ../block/replication.c:270
11 0x000055575c170eed in bdrv_driver_pwritev (bs=0x55575e9824f0, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, qiov_offset=0, flags=0)
    at ../block/io.c:1297
12 0x000055575c1733b4 in bdrv_aligned_pwritev
    (child=0x55575dcea690, req=0x7f2b32ad9e00, offset=403487858688, bytes=4538368, align=512, qiov=0x55575dfc15d8, qiov_offset=0, flags=0)
    at ../block/io.c:2126
13 0x000055575c173c67 in bdrv_co_pwritev_part (child=0x55575dcea690, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, qiov_offset=0, flags=0)
    at ../block/io.c:2314
14 0x000055575c17391b in bdrv_co_pwritev (child=0x55575dcea690, offset=403487858688, bytes=4538368, qiov=0x55575dfc15d8, flags=0) at ../block/io.c:2233
15 0x000055575c1aeffa in write_quorum_entry (opaque=0x7f2fddaf8c50) at ../block/quorum.c:699
16 0x000055575c2ee4db in coroutine_trampoline (i0=1578543808, i1=21847) at ../util/coroutine-ucontext.c:173
17 0x00007f2ff9855660 in __start_context () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91

When we do failover in COLO mode, QEMU will hang while it is waiting for
the in-flight IO. From the call trace, we can see the IO request coroutine
has yielded in nbd_co_send_request(). When we kill the NBD server, it will never
be wake up. Actually, when we do IO stress test, it will have a lot of
requests in free_sema queue. When the NBD server is killed, current
MAX_NBD_REQUESTS finishes with errors but they wake up at most
MAX_NBD_REQEUSTS from the queue. So, let's move qemu_co_queue_next out
to fix this issue.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Message-Id: <20220309074844.275450-1-lei.rao@intel.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-03-21 15:21:09 -05:00
Philippe Mathieu-Daudé
aa44d3f6b8 block/file-posix: Remove a deprecation warning on macOS 12
When building on macOS 12 we get:

  block/file-posix.c:3335:18: warning: 'IOMasterPort' is deprecated: first deprecated in macOS 12.0 [-Wdeprecated-declarations]
      kernResult = IOMasterPort( MACH_PORT_NULL, &masterPort );
                   ^~~~~~~~~~~~
                   IOMainPort

Replace by IOMainPort, redefining it to IOMasterPort if not available.

Suggested-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed by: Cameron Esfahani <dirty@apple.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Tested-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
2022-03-15 13:36:33 +01:00
Peter Maydell
fdee2c9692 nbd patches for 2022-03-07
- Dan Berrange: Allow qemu-nbd to support TLS over Unix sockets
 - Eric Blake: Minor cleanups related to 64-bit block operations
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmImtE8ACgkQp6FrSiUn
 Q2ovmgf/aksDqf2eNcahs++fez+8Qi9ll5OY/qGyjnzBgsatYKjrK+xF7OnjoJox
 eRX026lh81Q4EQK7oZBUnr2UCY4bncDBTI7MTLh603EV/tId5ZLwx007ERhzvtC1
 mIsQHXNuO9X25LQG2eWnfunY9YztQpiT5r/g3khD2yPBqJWIvBfblzPLx6FkF7px
 /WM8xEKCihmGr1Wr3b+zGYL083YkaBWCvHoR8mJt3tEFUj+Qie8XcdV0OVyI0XUj
 5goIFRcpVwBE8P2nLtfUKNzEXz22cmdonOJUX7E5IvGO21k5F/HrWlHdo8JnuSUZ
 t0w5L9yCxBrRpY1burz30b77J0WMCw==
 =C8Dd
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2022-03-07' into staging

nbd patches for 2022-03-07

- Dan Berrange: Allow qemu-nbd to support TLS over Unix sockets
- Eric Blake: Minor cleanups related to 64-bit block operations

# gpg: Signature made Tue 08 Mar 2022 01:41:35 GMT
# gpg:                using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2022-03-07:
  qemu-io: Allow larger write zeroes under no fallback
  qemu-io: Utilize 64-bit status during map
  nbd/server: Minor cleanups
  tests/qemu-iotests: validate NBD TLS with UNIX sockets and PSK
  tests/qemu-iotests: validate NBD TLS with UNIX sockets
  tests/qemu-iotests: validate NBD TLS with hostname mismatch
  tests/qemu-iotests: convert NBD TLS test to use standard filters
  tests/qemu-iotests: introduce filter for qemu-nbd export list
  tests/qemu-iotests: expand _filter_nbd rules
  tests/qemu-iotests: add QEMU_IOTESTS_REGEN=1 to update reference file
  block/nbd: don't restrict TLS usage to IP sockets
  qemu-nbd: add --tls-hostname option for TLS certificate validation
  block/nbd: support override of hostname for TLS certificate validation
  block: pass desired TLS hostname through from block driver client
  crypto: mandate a hostname when checking x509 creds on a client

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-03-09 11:38:29 +00:00
Peter Maydell
9740b907a5 target-arm queue:
* cleanups of qemu_oom_check() and qemu_memalign()
  * target/arm/translate-neon: UNDEF if VLD1/VST1 stride bits are non-zero
  * target/arm/translate-neon: Simplify align field check for VLD3
  * GICv3 ITS: add more trace events
  * GICv3 ITS: implement 8-byte accesses properly
  * GICv3: fix minor issues with some trace/log messages
  * ui/cocoa: Use the standard about panel
  * target/arm: Provide cpu property for controling FEAT_LPA2
  * hw/arm/virt: Disable LPA2 for -machine virt-6.2
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmImNs4ZHHBldGVyLm1h
 eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3q87D/0cMQeF00uVRNqftrQg2SDI
 txJIG2QYUOPMCDfGWlGTfXv2TUc5y3XwA77C9vTcJcIWJlZ30DUa95DNYqA0BbOH
 TEOzRuZME64wA/JndHadz7oh+xb3HYn+6aSr63LeQCI3/h1eXVHknnEcyF1danOb
 YNB1T308THTEwJHQuKHYksIasgVwcjOf8FvMRYFozVkAKEx1SlabpFXST+aVNyx4
 ASsC2PTiJYAqwnYrTX8lWOYKMiKfkNrQcTd6x7rkoDw1pV7ZDMw2/69tpkhdJ5Fa
 lwxhwZ3+40x49eFGAhfuZWZmGLd4c+76u64pmWW429uk1JhaoXgErJM3xfHbI1er
 d7XSQYkMhDrY5SFuoE5XYwOuxanPtn3f7luM236Uzgf4ZR6qTrf6x+R1xLPZVYa9
 fWbjvR3g5sltTOzyc+9UsBq1OPCbRUbmhJtJDvojj5sWmNvgOwZnSkTu5kMAqvFP
 T2cQIi6phRBo3oMN/fhEZi3g828JjYEA9QlpWZ74JOyiXjYUq9VVNpoe/dtAv4Yy
 wZ+XhVNIK82/4Mxjr9SEeYeNzYrsEEvFAUqe9Bil2CpuIMV5ONEzs+UfQ/gyk4eq
 QnGPiojCrpf6PPAfci0Y6b4RzO+loMFpLjCpurngB4g4cBdmThKip0sVZdTZAI9Y
 lnusB8MR1sESoqYdPZsAfQ==
 =ix0J
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20220307' into staging

target-arm queue:
 * cleanups of qemu_oom_check() and qemu_memalign()
 * target/arm/translate-neon: UNDEF if VLD1/VST1 stride bits are non-zero
 * target/arm/translate-neon: Simplify align field check for VLD3
 * GICv3 ITS: add more trace events
 * GICv3 ITS: implement 8-byte accesses properly
 * GICv3: fix minor issues with some trace/log messages
 * ui/cocoa: Use the standard about panel
 * target/arm: Provide cpu property for controling FEAT_LPA2
 * hw/arm/virt: Disable LPA2 for -machine virt-6.2

# gpg: Signature made Mon 07 Mar 2022 16:46:06 GMT
# gpg:                using RSA key E1A5C593CD419DE28E8315CF3C2525ED14360CDE
# gpg:                issuer "peter.maydell@linaro.org"
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>" [ultimate]
# gpg:                 aka "Peter Maydell <pmaydell@gmail.com>" [ultimate]
# gpg:                 aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>" [ultimate]
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83  15CF 3C25 25ED 1436 0CDE

* remotes/pmaydell/tags/pull-target-arm-20220307:
  hw/arm/virt: Disable LPA2 for -machine virt-6.2
  target/arm: Provide cpu property for controling FEAT_LPA2
  ui/cocoa: Use the standard about panel
  hw/intc/arm_gicv3_cpuif: Fix register names in ICV_HPPIR read trace event
  hw/intc/arm_gicv3: Fix missing spaces in error log messages
  hw/intc/arm_gicv3: Specify valid and impl in MemoryRegionOps
  hw/intc/arm_gicv3_its: Add trace events for table reads and writes
  hw/intc/arm_gicv3_its: Add trace events for commands
  target/arm/translate-neon: Simplify align field check for VLD3
  target/arm/translate-neon: UNDEF if VLD1/VST1 stride bits are non-zero
  osdep: Move memalign-related functions to their own header
  util: Put qemu_vfree() in memalign.c
  util: Use meson checks for valloc() and memalign() presence
  util: Share qemu_try_memalign() implementation between POSIX and Windows
  meson.build: Don't misdetect posix_memalign() on Windows
  util: Return valid allocation for qemu_try_memalign() with zero size
  util: Unify implementations of qemu_memalign()
  util: Make qemu_oom_check() a static function

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-03-08 15:26:10 +00:00
Daniel P. Berrangé
e8ae8b1a75 block/nbd: don't restrict TLS usage to IP sockets
The TLS usage for NBD was restricted to IP sockets because validating
x509 certificates requires knowledge of the hostname that the client
is connecting to.

TLS does not have to use x509 certificates though, as PSK (pre-shared
keys) provide an alternative credential option. These have no
requirement for a hostname and can thus be trivially used for UNIX
sockets.

Furthermore, with the ability to overide the default hostname for
TLS validation in the previous patch, it is now also valid to want
to use x509 certificates with FD passing and UNIX sockets.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220304193610.3293146-6-berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-03-07 15:58:42 -06:00
Daniel P. Berrangé
a0cd6d2972 block/nbd: support override of hostname for TLS certificate validation
When connecting to an NBD server with TLS and x509 credentials,
the client must validate the hostname it uses for the connection,
against that published in the server's certificate. If the client
is tunnelling its connection over some other channel, however, the
hostname it uses may not match the info reported in the server's
certificate. In such a case, the user needs to explicitly set an
override for the hostname to use for certificate validation.

This is achieved by adding a 'tls-hostname' property to the NBD
block driver.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220304193610.3293146-4-berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-03-07 15:58:42 -06:00
Daniel P. Berrangé
046f98d075 block: pass desired TLS hostname through from block driver client
In

  commit a71d597b98
  Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
  Date:   Thu Jun 10 13:08:00 2021 +0300

    block/nbd: reuse nbd_co_do_establish_connection() in nbd_open()

the use of the 'hostname' field from the BDRVNBDState struct was
lost, and 'nbd_connect' just hardcoded it to match the IP socket
address. This was a harmless bug at the time since we block use
with anything other than IP sockets.

Shortly though, we want to allow the caller to override the hostname
used in the TLS certificate checks. This is to allow for TLS
when doing port forwarding or tunneling. Thus we need to reinstate
the passing along of the 'hostname'.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220304193610.3293146-3-berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-03-07 15:58:42 -06:00
Peter Maydell
5df022cf2e osdep: Move memalign-related functions to their own header
Move the various memalign-related functions out of osdep.h and into
their own header, which we include only where they are used.
While we're doing this, add some brief documentation comments.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 20220226180723.1706285-10-peter.maydell@linaro.org
2022-03-07 13:16:49 +00:00
Vladimir Sementsov-Ogievskiy
af5bcd775f block: copy-before-write: realize snapshot-access API
Current scheme of image fleecing looks like this:

[guest]                    [NBD export]
  |                              |
  |root                          | root
  v                              v
[copy-before-write] -----> [temp.qcow2]
  |                 target  |
  |file                     |backing
  v                         |
[active disk] <-------------+

 - On guest writes copy-before-write filter copies old data from active
   disk to temp.qcow2. So fleecing client (NBD export) when reads
   changed regions from temp.qcow2 image and unchanged from active disk
   through backing link.

This patch makes possible new image fleecing scheme:

[guest]                   [NBD export]
   |                            |
   | root                       | root
   v                 file       v
[copy-before-write]<------[snapshot-access]
   |           |
   | file      | target
   v           v
[active-disk] [temp.img]

 - copy-before-write does CBW operations and also provides
   snapshot-access API. The API may be accessed through
   snapshot-access driver.

Benefits of new scheme:

1. Access control: if remote client try to read data that not covered
   by original dirty bitmap used on copy-before-write open, client gets
   -EACCES.

2. Discard support: if remote client do DISCARD, this additionally to
   discarding data in temp.img informs block-copy process to not copy
   these clusters. Next read from discarded area will return -EACCES.
   This is significant thing: when fleecing user reads data that was
   not yet copied to temp.img, we can avoid copying it on further guest
   write.

3. Synchronisation between client reads and block-copy write is more
   efficient. In old scheme we just rely on BDRV_REQ_SERIALISING flag
   used for writes to temp.qcow2. New scheme is less blocking:
     - fleecing reads are never blocked: if data region is untouched or
       in-flight, we just read from active-disk, otherwise we read from
       temp.img
     - writes to temp.img are not blocked by fleecing reads
     - still, guest writes of-course are blocked by in-flight fleecing
       reads, that currently read from active-disk - it's the minimum
       necessary blocking

4. Temporary image may be of any format, as we don't rely on backing
   feature.

5. Permission relation are simplified. With old scheme we have to share
   write permission on target child of copy-before-write, otherwise
   backing link conflicts with copy-before-write file child write
   permissions. With new scheme we don't have backing link, and
   copy-before-write node may have unshared access to temporary node.
   (Not realized in this commit, will be in future).

6. Having control on fleecing reads we'll be able to implement
   alternative behavior on failed copy-before-write operations.
   Currently we just break guest request (that's a historical behavior
   of backup). But in some scenarios it's a bad behavior: better
   is to drop the backup as failed but don't break guest request.
   With new scheme we can simply unset some bits in a bitmap on CBW
   failure and further fleecing reads will -EACCES, or something like
   this. (Not implemented in this commit, will be in future)
   Additional application for this is implementing timeout for CBW
   operations.

Iotest 257 output is updated, as two more bitmaps now live in
copy-before-write filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20220303194349.2304213-13-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:31 +01:00
Vladimir Sementsov-Ogievskiy
1c14eaabdb block: introduce snapshot-access block driver
The new block driver simply utilizes snapshot-access API of underlying
block node.

In further patches we want to use it like this:

[guest]                   [NBD export]
   |                            |
   | root                       | root
   v                 file       v
[copy-before-write]<------[snapshot-access]
   |           |
   | file      | target
   v           v
[active-disk] [temp.img]

This way, NBD client will be able to read snapshotted state of active
disk, when active disk is continued to be written by guest. This is
known as "fleecing", and currently uses another scheme based on qcow2
temporary image which backing file is active-disk. New scheme comes
with benefits - see next commit.

The other possible application is exporting internal snapshots of
qcow2, like this:

[guest]          [NBD export]
   |                  |
   | root             | root
   v       file       v
[qcow2]<---------[snapshot-access]

For this, we'll need to implement snapshot-access API handlers in
qcow2 driver, and improve snapshot-access block driver (and API) to
make it possible to select snapshot by name. Another thing to improve
is size of snapshot. Now for simplicity we just use size of bs->file,
which is OK for backup, but for qcow2 snapshots export we'll need to
imporve snapshot-access API to get size of snapshot.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20220303194349.2304213-12-vsementsov@virtuozzo.com>
[hreitz: Rebased on block GS/IO split]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:31 +01:00
Vladimir Sementsov-Ogievskiy
ce14f3b407 block/io: introduce block driver snapshot-access API
Add new block driver handlers and corresponding generic wrappers.
It will be used to allow copy-before-write filter to provide
reach fleecing interface in further commit.

In future this approach may be used to allow reading qcow2 internal
snapshots, for example to export them through NBD.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-11-vsementsov@virtuozzo.com>
[hreitz: Rebased on block GS/IO split]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:31 +01:00
Vladimir Sementsov-Ogievskiy
3b7ca26bdf block/reqlist: add reqlist_wait_all()
Add function to wait for all intersecting requests.
To be used in the further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-10-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
a6426475a7 block/dirty-bitmap: introduce bdrv_dirty_bitmap_status()
Add a convenient function similar with bdrv_block_status() to get
status of dirty bitmap.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-9-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
84b1e80f67 block/reqlist: reqlist_find_conflict(): use ranges_overlap()
Let's reuse convenient helper.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-8-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
d088e6a48a block: intoduce reqlist
Split intersecting-requests functionality out of block-copy to be
reused in copy-before-write filter.

Note: while being here, fix tiny typo in MAINTAINERS.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-7-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
177541e671 block/block-copy: add block_copy_reset()
Split block_copy_reset() out of block_copy_reset_unallocated() to be
used separately later.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-6-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
5f3a3cd7f0 block/copy-before-write: add bitmap open parameter
This brings "incremental" mode to copy-before-write filter: user can
specify bitmap so that filter will copy only "dirty" areas.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20220303194349.2304213-5-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
1f7252e868 block/block-copy: block_copy_state_new(): add bitmap parameter
This will be used in the following commit to bring "incremental" mode
to copy-before-write filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-4-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
34ffacb7f4 block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return value
That simplifies handling failure in existing code and in further new
usage of bdrv_merge_dirty_bitmap().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
051f66caa2 block/block-copy: move copy_bitmap initialization to block_copy_state_new()
We are going to complicate bitmap initialization in the further
commit. And in future, backup job will be able to work without filter
(when source is immutable), so we'll need same bitmap initialization in
copy-before-write filter and in backup job. So, it's reasonable to do
it in block-copy.

Note that for now cbw_open() is the only caller of
block_copy_state_new().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-2-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy
45e62b464a block: fix preallocate filter: don't do unaligned preallocate requests
There is a bug in handling BDRV_REQ_NO_WAIT flag: we still may wait in
wait_serialising_requests() if request is unaligned. And this is
possible for the only user of this flag (preallocate filter) if
underlying file is unaligned to its request_alignment on start.

So, we have to fix preallocate filter to do only aligned preallocate
requests.

Next, we should fix generic block/io.c somehow. Keeping in mind that
preallocate is the only user of BDRV_REQ_NO_WAIT and that we have to
fix its behavior now, it seems more safe to just assert that we never
use BDRV_REQ_NO_WAIT with unaligned requests and add corresponding
comment. Let's do so.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <20220215121609.38570-1-vsementsov@virtuozzo.com>
[hreitz: Rebased on block GS/IO split]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:19:20 +01:00
Peter Maydell
b0ea6c98fa block/curl.c: Check error return from curl_easy_setopt()
Coverity points out that we aren't checking the return value
from curl_easy_setopt() for any of the calls to it we make
in block/curl.c.

Some of these options are documented as always succeeding (e.g.
CURLOPT_VERBOSE) but others have documented failure cases (e.g.
CURLOPT_URL).  For consistency we check every call, even the ones
that theoretically cannot fail.

Fixes: Coverity CID 1459336, 1459482, 1460331
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220222152341.850419-3-peter.maydell@linaro.org>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:19:20 +01:00
Peter Maydell
2ea7dfcd05 block/curl.c: Set error message string if curl_init_state() fails
In curl_open(), the 'out' label assumes that the state->errmsg string
has been set (either by curl_easy_perform() or by manually copying a
string into it); however if curl_init_state() fails we will jump to
that label without setting the string.  Add the missing error string
setup.

(We can't be specific about the cause of failure: the documentation
of curl_easy_init() just says "If this function returns NULL,
something went wrong".)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220222152341.850419-2-peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:19:20 +01:00
Hanna Reitz
78fa41fc67 block/amend: Keep strong reference to BDS
Otherwise, the BDS might be freed while the job is running, which would
cause a use-after-free.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220304153729.711387-5-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:26 +01:00
Hanna Reitz
b8ba60067b block/amend: Always call .bdrv_amend_clean()
.bdrv_amend_clean() says block drivers can use it to clean up what was
done in .bdrv_amend_pre_run().  Therefore, it should always be called
after .bdrv_amend_pre_run(), which means we need it to call it in the
JobDriver.free() callback, not in JobDriver.clean().

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220304153729.711387-3-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:26 +01:00
Hanna Reitz
4d378bbd83 block: Make bdrv_refresh_limits() non-recursive
bdrv_refresh_limits() recurses down to the node's children.  That does
not seem necessary: When we refresh limits on some node, and then
recurse down and were to change one of its children's BlockLimits, then
that would mean we noticed the changed limits by pure chance.  The fact
that we refresh the parent's limits has nothing to do with it, so the
reason for the change probably happened before this point in time, and
we should have refreshed the limits then.

Consequently, we should actually propagate block limits changes upwards,
not downwards.  That is a separate and pre-existing issue, though, and
so will not be addressed in this patch.

The problem with recursing is that bdrv_refresh_limits() is not atomic.
It begins with zeroing BDS.bl, and only then sets proper, valid limits.
If we do not drain all nodes whose limits are refreshed, then concurrent
I/O requests can encounter invalid request_alignment values and crash
qemu.  Therefore, a recursing bdrv_refresh_limits() requires the whole
subtree to be drained, which is currently not ensured by most callers.

A non-recursive bdrv_refresh_limits() only requires the node in question
to not receive I/O requests, and this is done by most callers in some
way or another:
- bdrv_open_driver() deals with a new node with no parents yet
- bdrv_set_file_or_backing_noperm() acts on a drained node
- bdrv_reopen_commit() acts only on drained nodes
- bdrv_append() should in theory require the node to be drained; in
  practice most callers just lock the AioContext, which should at least
  be enough to prevent concurrent I/O requests from accessing invalid
  limits

So we can resolve the bug by making bdrv_refresh_limits() non-recursive.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1879437
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20220216105355.30729-2-hreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:26 +01:00
Emanuele Giuseppe Esposito
da359909bd block_int-common.h: assertions in the callers of BlockDriver function pointers
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-27-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
1581a70ddd block/coroutines: I/O and "I/O or GS" API
block coroutines functions run in different aiocontext, and are
not protected by the BQL. Therefore are I/O.

On the other side, generated_co_wrapper functions use BDRV_POLL_WHILE,
meaning the caller can either be the main loop or a specific iothread.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-25-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
377cc15bf1 block/copy-before-write.h: global state API + assertions
copy-before-write functions always run under BQL.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-24-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
6b573efec8 include/block/snapshot: global state API + assertions
Snapshots run also under the BQL, so they all are
in the global state API. The aiocontext lock that they hold
is currently an overkill and in future could be removed.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-23-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
c5be7445b7 assertions for blockdev.h global state API
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-22-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
bdb734763b block.c: add assertions to static functions
Following the assertion derived from the API split,
propagate the assertion also in the static functions.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-18-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
967d7905d1 IO_CODE and IO_OR_GS_CODE for block_int I/O API
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-14-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
b4ad82aab1 assertions for block_int global state API
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-13-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
37868b2ac6 IO_CODE and IO_OR_GS_CODE for block-backend I/O API
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-10-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
0439c5a462 block/block-backend.c: assertions for block-backend
All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-9-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
a2c4c3b19b include/sysemu/block-backend: split header into I/O and global state (GS) API
Similarly to the previous patches, split block-backend.h
in block-backend-io.h and block-backend-global-state.h

In addition, remove "block/block.h" include as it seems
it is not necessary anymore, together with "qemu/iov.h"

block-backend-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-8-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
8cc5882c7f block/export/fuse.c: allow writable exports to take RESIZE permission
Allow writable exports to get BLK_PERM_RESIZE permission
from creation, in fuse_export_create().
In this way, there is no need to give the permission in
fuse_do_truncate(), which might be run in an iothread.

Permissions should be set only in the main thread, so
in any case if an iothread tries to set RESIZE, it will
be blocked.

Also assert in fuse_do_truncate that if we give the
RESIZE permission we can then restore the original ones.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303151616.325444-7-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
384a48fb74 IO_CODE and IO_OR_GS_CODE for block I/O API
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-6-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
f791bf7f93 assertions for block global state API
All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-5-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
3b491a9056 include/block/block: split header into I/O and global state API
block.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
Some others can only be called by either the main loop or the
iothread running the AioContext (and not other iothreads),
and using them in another thread would cause deadlocks, and therefore
it is not ideal to define them as I/O.

It is not easy to understand which function is part of which
group (I/O vs GS vs "I/O or GS"), and this patch aims to clarify it.

The "GS" functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.
"I/O or GS" functions run instead in the main loop or in
a single iothread, and use BDRV_POLL_WHILE().

By splitting the header in two files, block-io.h
and block-global-state.h we have a clearer view on what
needs what kind of protection. block-common.h
contains common structures shared by both headers.

block.h is left there for legacy and to avoid changing
all includes in all c files that use the block APIs.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito
3b71719462 block: rename bdrv_invalidate_cache_all, blk_invalidate_cache and test_sync_op_invalidate_cache
Following the bdrv_activate renaming, change also the name
of the respective callers.

bdrv_invalidate_cache_all -> bdrv_activate_all
blk_invalidate_cache -> blk_activate
test_sync_op_invalidate_cache -> test_sync_op_activate

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-5-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:40 +01:00
Emanuele Giuseppe Esposito
a94750d956 block: introduce bdrv_activate
This function is currently just a wrapper for bdrv_invalidate_cache(),
but in future will contain the code of bdrv_co_invalidate_cache() that
has to always be protected by BQL, and leave the rest in the I/O
coroutine.

Replace all bdrv_invalidate_cache() invokations with bdrv_activate().

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:40 +01:00
Emanuele Giuseppe Esposito
dae84929e4 crypto: distinguish between main loop and I/O in block_crypto_amend_options_generic_luks
block_crypto_amend_options_generic_luks uses the block layer
permission API, therefore it should be called with the BQL held.

However, the same function is being called by two BlockDriver
callbacks: bdrv_amend_options (under BQL) and bdrv_co_amend (I/O).

The latter is I/O because it is invoked by block/amend.c's
blockdev_amend_run(), a .run callback of the amend JobDriver.

Therefore we want to change this function to still perform
the permission check, but making sure it is done under BQL regardless
of the caller context.

Remove the permission check in block_crypto_amend_options_generic_luks()
and:
- in block_crypto_amend_options_luks() (BQL case, called by
  .bdrv_amend_options()), reuse helper functions
  block_crypto_amend_{prepare/cleanup} that take care of checking
  permissions.

- for block_crypto_co_amend_luks() (I/O case, called by
  .bdrv_co_amend()), don't check for permissions but delegate
  .bdrv_amend_pre_run() and .bdrv_amend_clean() to do it,
  performing these checks before and after the job runs in its aiocontext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220209105452.1694545-3-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:40 +01:00
Emanuele Giuseppe Esposito
c1019d1687 crypto: perform permission checks under BQL
Move the permission API calls into driver-specific callbacks
that always run under BQL. In this case, bdrv_crypto_luks
needs to perform permission checks before and after
qcrypto_block_amend_options(). The problem is that the caller,
block_crypto_amend_options_generic_luks(), can also run in I/O
from .bdrv_co_amend(). This does not comply with Global State-I/O API split,
as permissions API must always run under BQL.

Firstly, introduce .bdrv_amend_pre_run() and .bdrv_amend_clean()
callbacks. These two callbacks are guaranteed to be invoked under
BQL, respectively before and after .bdrv_co_amend().
They take care of performing the permission checks
in the same way as they are currently done before and after
qcrypto_block_amend_options().
These callbacks are in preparation for next patch, where we
delete the original permission check. Right now they just add redundant
control.

Then, call .bdrv_amend_pre_run() before job_start in
qmp_x_blockdev_amend(), so that it will be run before the job coroutine
is created and stay in the main loop.
As a cleanup, use JobDriver's .clean() callback to call
.bdrv_amend_clean(), and run amend-specific cleanup callbacks under BQL.

After this patch, permission failures occur early in the blockdev-amend
job to update a LUKS volume's keys.  iotest 296 must now expect them in
x-blockdev-amend's QMP reply instead of waiting for the actual job to
fail later.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220209105452.1694545-2-eesposit@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220304153729.711387-6-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:39 +01:00
Peter Maydell
4aa2e497a9 This misc series of changes:
- Improves documentation of SSH fingerprint checking
  - Fixes SHA256 fingerprints with non-blockdev usage
  - Blocks the clone3, setns, unshare & execveat syscalls
    with seccomp
  - Blocks process spawning via clone syscall, but allows
    threads, with seccomp
  - Takes over seccomp maintainer role
  - Expands firmware descriptor spec to allow flash
    without NVRAM
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE2vOm/bJrYpEtDo4/vobrtBUQT98FAmIOOBkACgkQvobrtBUQ
 T9/ruhAAr8jkAH8FN5ftx2/L7q8SHpjPupue1CJ0Nl/ykmYhTGc+SqC3R2nZWOk2
 Ws8hHVcDVT1lhrGxPtU7o+JPC1TebJTsloimJoKQY3qfdvZadJeR/4KsOUzi2ruu
 VZ6HiYvZc1c9T+NPf3QRhBo7yyascKWKWHDseUNIt/2DiefCox4QFUDDMG86HiQF
 KK30xWTvwJdcPxRlbfZbWRoqA0v4OoSDK6Ftp94FQSNBkExO85kstDq3xVaApf8H
 DE1QD7gf+dvz11wVuFhrf4d1EH032nU0p0kMxhABc4/kZXo5iWXohhzML3/MUEVT
 pe5/9pzUdWpfXQd/2r7x2PyPgySAG7lGbkgltowY52qnRPaNw9ukwkFCFAj8wiD8
 FT2ghvkYD3zLfnZ3nuuzJVjf3pXgCc5VcfXaoffT72a7gpI1LTuEqPFwo04imV4l
 21fYFx26mYTGCLH1CwVw8MQ2z/dg6uorT/NHdmRA/KrYJ1Elay2K7DV3Z5jOM5MI
 0Ll5HkfsUut+1rioUjNgmlQ+96k/G0P0hVUoTUIcgl3U/GDx2+ypcrNTfmEcaCLV
 bOhsjtrcg/KAXsCSbvnfDe3bWf0txnscyqoilEzDahLvciWG3d6qlhczLy29LGb4
 /w7iqnUcSygXc+a9/ckVo1h5fo0i9qb3W8Pw9klapvz6SGJ83g4=
 =PeCY
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/berrange-gitlab/tags/misc-next-pull-request' into staging

This misc series of changes:

 - Improves documentation of SSH fingerprint checking
 - Fixes SHA256 fingerprints with non-blockdev usage
 - Blocks the clone3, setns, unshare & execveat syscalls
   with seccomp
 - Blocks process spawning via clone syscall, but allows
   threads, with seccomp
 - Takes over seccomp maintainer role
 - Expands firmware descriptor spec to allow flash
   without NVRAM

# gpg: Signature made Thu 17 Feb 2022 11:57:13 GMT
# gpg:                using RSA key DAF3A6FDB26B62912D0E8E3FBE86EBB415104FDF
# gpg: Good signature from "Daniel P. Berrange <dan@berrange.com>" [full]
# gpg:                 aka "Daniel P. Berrange <berrange@redhat.com>" [full]
# Primary key fingerprint: DAF3 A6FD B26B 6291 2D0E  8E3F BE86 EBB4 1510 4FDF

* remotes/berrange-gitlab/tags/misc-next-pull-request:
  docs: expand firmware descriptor to allow flash without NVRAM
  MAINTAINERS: take over seccomp from Eduardo Otubo
  seccomp: block setns, unshare and execveat syscalls
  seccomp: block use of clone3 syscall
  seccomp: fix blocking of process spawning
  seccomp: add unit test for seccomp filtering
  seccomp: allow action to be customized per syscall
  block: print the server key type and fingerprint on failure
  block: support sha256 fingerprint with pre-blockdev options
  block: better document SSH host key fingerprint checking

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-02-23 09:25:05 +00:00
Paolo Bonzini
406523f6b3 configure, meson: move block layer options to meson_options.txt
Unlike image formats, these also require an entry in config-host.h.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-21 10:35:53 +01:00
Paolo Bonzini
ed793c2c45 configure, meson: move image format options to meson_options.txt
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-21 10:35:53 +01:00
Daniel P. Berrangé
e3296cc796 block: print the server key type and fingerprint on failure
When validating the server key fingerprint fails, it is difficult for
the user to know what they got wrong. The fingerprint accepted by QEMU
is received in a different format than OpenSSH displays. There can also
be keys for multiple different ciphers in known_hosts. It may not be
obvious which cipher QEMU will use and whether it will be the same
as OpenSSH. Address this by printing the server key type and its
corresponding fingerprint in the format QEMU accepts.

Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2022-02-16 14:34:16 +00:00
Daniel P. Berrangé
ea0f60e6f2 block: support sha256 fingerprint with pre-blockdev options
When support for sha256 fingerprint checking was aded in

  commit bf783261f0
  Author: Daniel P. Berrangé <berrange@redhat.com>
  Date:   Tue Jun 22 12:51:56 2021 +0100

    block/ssh: add support for sha256 host key fingerprints

it was only made to work with -blockdev. Getting it working with
-drive requires some extra custom parsing.

Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2022-02-16 14:34:15 +00:00
Peter Maydell
48033ad678 nbd: handle AioContext change correctly
v2: add my s-o-b marks to each commit
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEi5wmzbL9FHyIDoahVh8kwfGfefsFAmIGYU8ACgkQVh8kwfGf
 efslpQ/+OAEgA8z2frIOzoa5PKmyWDaE4xOAsr595EzhSnDeLi1QVIqSSXFap/m6
 1rzqglD+5udM9LT9FZBHeMDbWr1UqvODZTAah6/T8E6vVzXm9Jn+CivEj4kT4Dd2
 2Mrg3ydjkZsVivs1el4eG7QZ88TvSZkoNAIlT9lVYuT/TaRBKohrms1B4h25wqn9
 u78tnbdAdEGtlHUFy9qyOXPt0eGRt6a1CTo+u0wwJLxKjLpq8cht0DPA4/CQhZay
 Jm7cudJAJyTFxT7EUxpfROj4KuSMsUDmF/pNE8qwQuSnZC9G9u67ImhWpMFmwEZl
 yCAJbOTudWYefgZ3+BtKMH5SBVLA6AsQ60luWkXkqK1dUB4cWxt5lG/UAm1OMJsn
 FL6w3WJFkRTIC1OtHqqJnBtDFHWhTxNX9jjcAdb6lpjz4/wKywUVs+l5aSbxswog
 DMrE1XT+JIab1IqAnSGw0QTY0H9COiQyOlsJ8EjtFFsqwLACSX4qBowesGfoA7IF
 N0ZorAs6QuYYzEaSbRG1vB1lAusSCF5s/UnTg1mDLFWSqJOsqZN7jxXyQbbiFhsN
 f+OTlyIwv4t32eGLRrKHF0vaWUoKQ3skB8m9UFMdQbGBO/LakZL9L42ITxarzUMI
 CVeGLAzs3KvbZUlUyJWl97F2KkdrornmM/9CB57sjw13K94oNYo=
 =+lgn
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/vsementsov/tags/pull-nbd-2022-02-09-v2' into staging

nbd: handle AioContext change correctly

v2: add my s-o-b marks to each commit

# gpg: Signature made Fri 11 Feb 2022 13:14:55 GMT
# gpg:                using RSA key 8B9C26CDB2FD147C880E86A1561F24C1F19F79FB
# gpg: Good signature from "Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8B9C 26CD B2FD 147C 880E  86A1 561F 24C1 F19F 79FB

* remotes/vsementsov/tags/pull-nbd-2022-02-09-v2:
  iotests/281: Let NBD connection yield in iothread
  block/nbd: Move s->ioc on AioContext change
  iotests/281: Test lingering timers
  iotests.py: Add QemuStorageDaemon class
  block/nbd: Assert there are no timers when closed
  block/nbd: Delete open timer when done
  block/nbd: Delete reconnect delay timer when done

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-02-12 22:04:07 +00:00
Hanna Reitz
e15f3a66c8 block/nbd: Move s->ioc on AioContext change
s->ioc must always be attached to the NBD node's AioContext.  If that
context changes, s->ioc must be attached to the new context.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2033626
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2022-02-11 14:06:15 +01:00
Hanna Reitz
8a39c381e5 block/nbd: Assert there are no timers when closed
Our two timers must not remain armed beyond nbd_clear_bdrvstate(), or
they will access freed data when they fire.

This patch is separate from the patches that actually fix the issue
(HEAD^^ and HEAD^) so that you can run the associated regression iotest
(281) on a configuration that reproducibly exposes the bug.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2022-02-11 14:06:02 +01:00
Hanna Reitz
717be9644b block/nbd: Delete open timer when done
We start the open timer to cancel the connection attempt after a while.
Once nbd_do_establish_connection() has returned, the attempt is over,
and we no longer need the timer.

Delete it before returning from nbd_open(), so that it does not persist
for longer.  It has no use after nbd_open(), and just like the reconnect
delay timer, it might well be dangerous if it were to fire afterwards.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2022-02-11 14:05:59 +01:00
Hanna Reitz
3ce1fc16ba block/nbd: Delete reconnect delay timer when done
We start the reconnect delay timer to cancel the reconnection attempt
after a while.  Once nbd_co_do_establish_connection() has returned, this
attempt is over, and we no longer need the timer.

Delete it before returning from nbd_reconnect_attempt(), so that it does
not persist beyond the I/O request that was paused for reconnecting; we
do not want it to fire in a drained section, because all sort of things
can happen in such a section (e.g. the AioContext might be changed, and
we do not want the timer to fire in the wrong context; or the BDS might
even be deleted, and so the timer CB would access already-freed data).

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2022-02-11 14:05:36 +01:00
Michael Tokarev
9e8be4c546 drop libxml2 checks since libxml is not actually used (for parallels)
For a long time, we assumed that libxml2 is necessary for parallels
block format support (block/parallels*). However, this format actually
does not use libxml [*]. Since this is the only user of libxml2 in
whole QEMU tree, we can drop all libxml2 checks and dependencies too.

It is even more: --enable-parallels configure option was the only
option which was silently ignored when it's (fake) dependency
(libxml2) isn't installed.

Drop all mentions of libxml2.

[*] Actually the basis for libxml use were introduced in commit
    ed279a06c5 ("configure: add dependency") but the implementation
    was never merged:
    https://lore.kernel.org/qemu-devel/70227bbd-a517-70e9-714f-e6e0ec431be9@openvz.org/

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220119090423.149315-1-mjt@msgid.tls.msk.ru>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
[PMD: Updated description and adapted to use lcitool]
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20220121154134.315047-5-f4bug@amsat.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20220204204335.1689602-9-alex.bennee@linaro.org>
2022-02-09 12:08:42 +00:00
Peter Maydell
47cc1a3655 Block layer patches
- rbd: fix handling of holes in .bdrv_co_block_status
 - Fix potential crash in bdrv_set_backing_hd()
 - vhost-user-blk export: Fix shutdown with requests in flight
 - FUSE export: Fix build failure on FreeBSD
 - Documentation improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmH5TlARHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9biGQ/9GLOXFaFVDdrAOSievKc1xGy3tirX21Wn
 xSQgRUFHcjbMu/r/I5hA5imCNWq8KmT5S+aMUO76RAsRDH94QZdTMlq/1bmPBgkY
 Pu4aKhmP0WzPOmqnjhq19rpk44J75lCtAwc+r+VLzGZUali/wOcIkEQPID3RgSlQ
 628dylVwFF57cQzdvUPph7+iaewJ3OUlk3plYUkyLB/1lRuBTZD6E0bcUeN4eo/K
 YvKMpiRMLyFJwX9d50YRhFw8zwM4cXLUynRzdDSZuUoGeaih59p2GJzkbvrXbBer
 edtEjwvf5PAVLXmHwWI+zz/aC4KYIE+sppB2YCOHhcORcAmKbCpP5Ky7W2jJQ6rJ
 UvbVwjHxVUB3JN59MYsVbhH5l7i/HrT13TZ2VR2HAn4kswk8s3DNGVF0I+DnGD1g
 gHBlxtAeORvM/+7E6hxX4cFY8ZNsji5DGBpbEtfXtGizP0LkF1YJhH7lB2ZSml50
 PJqqxTCTS8MevxWHuSdp+gV7stQoQHIuaNu9jKXrzqQWh+ezuJp1AhcRRWguxoOp
 n+SZpDybQBCXN0EfWlVECmdri8WJsmdBSD/K5qJ0ehN2bF4d6No0c5aCKJAKzfgp
 ygQ+rKPzGplp6cP16Pluu/tCiu1HDar8NajxErX8qqopBVnZmMZNtqi0GjktmzdB
 OhYOyI3m0G0=
 =Eyza
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kwolf-gitlab/tags/for-upstream' into staging

Block layer patches

- rbd: fix handling of holes in .bdrv_co_block_status
- Fix potential crash in bdrv_set_backing_hd()
- vhost-user-blk export: Fix shutdown with requests in flight
- FUSE export: Fix build failure on FreeBSD
- Documentation improvements

# gpg: Signature made Tue 01 Feb 2022 15:14:24 GMT
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kwolf-gitlab/tags/for-upstream:
  block/rbd: workaround for ceph issue #53784
  block/rbd: fix handling of holes in .bdrv_co_block_status
  qemu-img: Unify [-b [-F]] documentation
  qsd: Document fuse's allow-other option
  block.h: remove outdated comment
  block/export/fuse: Fix build failure on FreeBSD
  block/export/fuse: Rearrange if-else-if ladder in fuse_fallocate()
  block/export: Fix vhost-user-blk shutdown with requests in flight
  block: bdrv_set_backing_hd(): use drained section
  qemu-storage-daemon: Fix typo in vhost-user-blk help

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-02-01 19:48:15 +00:00
Peter Lieven
fc176116cd block/rbd: workaround for ceph issue #53784
librbd had a bug until early 2022 that affected all versions of ceph that
supported fast-diff. This bug results in reporting of incorrect offsets
if the offset parameter to rbd_diff_iterate2 is not object aligned.

This patch works around this bug for pre Quincy versions of librbd.

Fixes: 0347a8fd4c
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20220113144426.4036493-3-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-02-01 15:16:32 +01:00
Peter Lieven
9e302f64bb block/rbd: fix handling of holes in .bdrv_co_block_status
the assumption that we can't hit a hole if we do not diff against a snapshot was wrong.

We can see a hole in an image if we diff against base if there exists an older snapshot
of the image and we have discarded blocks in the image where the snapshot has data.

Fix this by simply handling a hole like an unallocated area. There are no callbacks
for unallocated areas so just bail out if we hit a hole.

Fixes: 0347a8fd4c
Suggested-by: Ilya Dryomov <idryomov@gmail.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20220113144426.4036493-2-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-02-01 15:14:12 +01:00
Philippe Mathieu-Daudé
3c9c70347b block/export/fuse: Fix build failure on FreeBSD
When building on FreeBSD we get:

  [816/6851] Compiling C object libblockdev.fa.p/block_export_fuse.c.o
  ../block/export/fuse.c:628:16: error: use of undeclared identifier 'FALLOC_FL_KEEP_SIZE'
      if (mode & FALLOC_FL_KEEP_SIZE) {
                 ^
  ../block/export/fuse.c:651:16: error: use of undeclared identifier 'FALLOC_FL_PUNCH_HOLE'
      if (mode & FALLOC_FL_PUNCH_HOLE) {
                 ^
  ../block/export/fuse.c:652:22: error: use of undeclared identifier 'FALLOC_FL_KEEP_SIZE'
          if (!(mode & FALLOC_FL_KEEP_SIZE)) {
                       ^
  3 errors generated.
  FAILED: libblockdev.fa.p/block_export_fuse.c.o

Meson indeed reported FALLOC_FL_PUNCH_HOLE is not available:

  C compiler for the host machine: cc (clang 10.0.1 "FreeBSD clang version 10.0.1")
  Checking for function "fallocate" : NO
  Checking for function "posix_fallocate" : YES
  Header <linux/falloc.h> has symbol "FALLOC_FL_PUNCH_HOLE" : NO
  Header <linux/falloc.h> has symbol "FALLOC_FL_ZERO_RANGE" : NO
  ...

Similarly to commit 304332039 ("block/export/fuse.c: fix musl build"),
guard the code requiring FALLOC_FL_KEEP_SIZE / FALLOC_FL_PUNCH_HOLE
definitions under CONFIG_FALLOCATE_PUNCH_HOLE #ifdef'ry.

Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220201112655.344373-3-f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-02-01 13:49:15 +01:00
Philippe Mathieu-Daudé
ac50419460 block/export/fuse: Rearrange if-else-if ladder in fuse_fallocate()
In order to safely maintain a mixture of #ifdef'ry with if-else-if
ladder, rearrange the last statement (!mode) first. Since it is
mutually exclusive with the other conditions, checking it first
doesn't make any logical difference, but allows to add #ifdef'ry
around in a more cleanly way.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220201112655.344373-2-f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-02-01 13:49:15 +01:00
Kevin Wolf
520d8b40e8 block/export: Fix vhost-user-blk shutdown with requests in flight
The vhost-user-blk export runs requests asynchronously in their own
coroutine. When the vhost connection goes away and we want to stop the
vhost-user server, we need to wait for these coroutines to stop before
we can unmap the shared memory. Otherwise, they would still access the
unmapped memory and crash.

This introduces a refcount to VuServer which is increased when spawning
a new request coroutine and decreased before the coroutine exits. The
memory is only unmapped when the refcount reaches zero.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220125151435.48792-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-02-01 13:49:15 +01:00
Hanna Reitz
492a119610 block-backend: Retain permissions after migration
After migration, the permissions the guest device wants to impose on its
BlockBackend are stored in blk->perm and blk->shared_perm.  In
blk_root_activate(), we take our permissions, but keep all shared
permissions open by calling `blk_set_perm(blk->perm, BLK_PERM_ALL)`.

Only afterwards (immediately or later, depending on the runstate) do we
restrict the shared permissions by calling
`blk_set_perm(blk->perm, blk->shared_perm)`.  Unfortunately, our first
call with shared_perm=BLK_PERM_ALL has overwritten blk->shared_perm to
be BLK_PERM_ALL, so this is a no-op and the set of shared permissions is
not restricted.

Fix this bug by saving the set of shared permissions before invoking
blk_set_perm() with BLK_PERM_ALL and restoring it afterwards.

Fixes: 5f7772c4d0
       ("block-backend: Defer shared_perm tightening migration
       completion")
Reported-by: Peng Liang <liangpeng10@huawei.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211125135317.186576-2-hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Peng Liang <liangpeng10@huawei.com>
2022-02-01 10:51:39 +01:00
Vladimir Sementsov-Ogievskiy
083c24561a qcow2: simple case support for downgrading of qcow2 images with zstd
If image doesn't have any compressed cluster we can easily switch to
zlib compression, which may allow to downgrade the image.

That's mostly needed to support IMGOPTS='compression_type=zstd' in some
iotests which do qcow2 downgrade.

While being here also fix checkpatch complain against '#' in printf
formatting.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20211223160144.1097696-13-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-02-01 10:51:39 +01:00
Hanna Reitz
113b727ce7 block/io: Update BSC only if want_zero is true
We update the block-status cache whenever we get new information from a
bdrv_co_block_status() call to the block driver.  However, if we have
passed want_zero=false to that call, it may flag areas containing zeroes
as data, and so we would update the block-status cache with wrong
information.

Therefore, we should not update the cache with want_zero=false.

Reported-by: Nir Soffer <nsoffer@redhat.com>
Fixes: 0bc329fbb0 ("block: block-status cache for data regions")
Reviewed-by: Nir Soffer <nsoffer@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220118170000.49423-2-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-01-28 16:52:40 -06:00
Peter Maydell
1cd2ad11d3 Block layer patches
- qemu-storage-daemon: Add vhost-user-blk help
 - block-backend: Fix use-after-free for BDS pointers after aio_poll()
 - qemu-img: Fix sparseness of output image with unaligned ranges
 - vvfat: Fix crashes in read-write mode
 - Fix device deletion events with -device JSON syntax
 - Code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmHhf5gRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9YBMA//ZkaIigVsfjRoeUh2MccgOuvYpXZtq4po
 q7l6AwGLbBpTt5Fy468gYhwmXuwHCapTMRmvWf6mpb86jtJ6vdbE16L0Z4/Z9iiW
 C0w69fsAAP9XyI+f7Q5FNtzz3jWztKowgyhkU33izbwYM7dm5Xw1q5bDkOiIBNoO
 d8cdxLC1oQGEWJmGLgmbaM/ow0iDogFpT8zU5j0VE3uK01si8pblWlXm1SM3nOK9
 b4uROqKYsTzTny/zX7KxD4SX3UGKYK393rQxr5HdmTiW14uGfB+EVfBxJmn07Qch
 lWM/v9tYoP1aVbR6IL5osAQdmbDYX0zsRMq5UA+dQ6OqnE3GpluVrYIFoaUSoShf
 S704hYdWgO0sKfpAYgJgGo6y0mglnp9Z7xO4Ng3XUNj0gvfgnOe3CdCdXIOeTFwC
 eP+KlFvbUT2xpTqI6ttBgKCcwKHA3hgWCnlo39C80bL1ZVKWSqh6zORfwmptouQ3
 BmuhEqZRyoYrknrTELN+lIKK2gP6MLup/ymeXWOOOE58KSpmrdeBAXmgJNXX3ucx
 lAWGsIz0CxdaKQoZpKpikho4rhrGkqZ33B3H7mdcsKS6zYzmsDIqa9FzUjtpvN2V
 K/jXlK7dv58Y+LLzpcuJAf8HNnitA107WD5RA1s5nTw0ahD2GwR4UPzEhnSO9/nT
 yZ3dGUysj7Q=
 =dnBv
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

- qemu-storage-daemon: Add vhost-user-blk help
- block-backend: Fix use-after-free for BDS pointers after aio_poll()
- qemu-img: Fix sparseness of output image with unaligned ranges
- vvfat: Fix crashes in read-write mode
- Fix device deletion events with -device JSON syntax
- Code cleanups

# gpg: Signature made Fri 14 Jan 2022 13:50:16 GMT
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream:
  iotests/testrunner.py: refactor test_field_width
  block: drop BLK_PERM_GRAPH_MOD
  qemu-img: make is_allocated_sectors() more efficient
  iotests: Test qemu-img convert of zeroed data cluster
  vvfat: Fix vvfat_write() for writes before the root directory
  vvfat: Fix size of temporary qcow file
  iotests/308: Fix for CAP_DAC_OVERRIDE
  iotests/stream-error-on-reset: New test
  block-backend: prevent dangling BDS pointers across aio_poll()
  qapi/block: Restrict vhost-user-blk to CONFIG_VHOST_USER_BLK_SERVER
  qemu-storage-daemon: Add vhost-user-blk help
  docs: Correct 'vhost-user-blk' spelling
  softmmu: fix device deletion events with -device JSON syntax
  include/sysemu/blockdev.h: remove drive_get_max_devs
  include/sysemu/blockdev.h: remove drive_mark_claimed_by_board and inline drive_def
  block_int: make bdrv_backing_overridden static

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-01-14 15:56:30 +00:00
Vladimir Sementsov-Ogievskiy
64631f3681 block: drop BLK_PERM_GRAPH_MOD
First, this permission never protected a node from being changed, as
generic child-replacing functions don't check it.

Second, it's a strange thing: it presents a permission of parent node
to change its child. But generally, children are replaced by different
mechanisms, like jobs or qmp commands, not by nodes.

Graph-mod permission is hard to understand. All other permissions
describe operations which done by parent node on its child: read,
write, resize. Graph modification operations are something completely
different.

The only place where BLK_PERM_GRAPH_MOD is used as "perm" (not shared
perm) is mirror_start_job, for s->target. Still modern code should use
bdrv_freeze_backing_chain() to protect from graph modification, if we
don't do it somewhere it may be considered as a bug. So, it's a bit
risky to drop GRAPH_MOD, and analyzing of possible loss of protection
is hard. But one day we should do it, let's do it now.

One more bit of information is that locking the corresponding byte in
file-posix doesn't make sense at all.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902093754.2352-1-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Kevin Wolf
b9b8860d24 vvfat: Fix vvfat_write() for writes before the root directory
The calculation in sector2cluster() is done relative to the offset of
the root directory. Any writes to blocks before the start of the root
directory (in particular, writes to the FAT) result in negative values,
which are not handled correctly in vvfat_write().

This changes sector2cluster() to return a signed value, and makes sure
that vvfat_write() doesn't try to find mappings for negative cluster
number. It clarifies the code in vvfat_write() to make it more obvious
that the cluster numbers can be negative.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211209152231.23756-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Kevin Wolf
2db9b9e96f vvfat: Fix size of temporary qcow file
The size of the qcow size was calculated so that only the FAT partition
would fit on it, but not the whole disk. However, offsets relative to
the whole disk are used to access it, so increase its size to be large
enough for that.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211209151815.23495-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Stefan Hajnoczi
1e3552dbd2 block-backend: prevent dangling BDS pointers across aio_poll()
The BlockBackend root child can change when aio_poll() is invoked. This
happens when a temporary filter node is removed upon blockjob
completion, for example.

Functions in block/block-backend.c must be aware of this when using a
blk_bs() pointer across aio_poll() because the BlockDriverState refcnt
may reach 0, resulting in a stale pointer.

One example is scsi_device_purge_requests(), which calls blk_drain() to
wait for in-flight requests to cancel. If the backup blockjob is active,
then the BlockBackend root child is a temporary filter BDS owned by the
blockjob. The blockjob can complete during bdrv_drained_begin() and the
last reference to the BDS is released when the temporary filter node is
removed. This results in a use-after-free when blk_drain() calls
bdrv_drained_end(bs) on the dangling pointer.

Explicitly hold a reference to bs across block APIs that invoke
aio_poll().

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2021778
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2036178
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220111153613.25453-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Emanuele Giuseppe Esposito
cc67f28ea2 include/sysemu/blockdev.h: remove drive_mark_claimed_by_board and inline drive_def
drive_def is only a particular use case of
qemu_opts_parse_noisily, so it can be inlined.

Also remove drive_mark_claimed_by_board, as it is only defined
but not implemented (nor used) anywhere.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20211215121140.456939-3-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Peter Maydell
1001c9d9c0 Pull request
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAmHfDFIACgkQnKSrs4Gr
 c8hoMwf/QPaU1svRdP9pPiMkJiwmtmgacEKEfrF3I8w8aOOf3dLPyUKafuStJtfZ
 Fhl2631jHL7JKKQKGomJhdzQovHAPsPEC8YFxesB1LvO0LIX4UtYplkxkj27In2D
 9w+cIMVMTkFyIv/5GgTaFBbnmk2at4tqXkcGmcblp0qZCMsElJvGWOkToM+Fjot4
 A4jYUCviqQqdt4j558UjIdecdaWy+5Cnej3NsKwH5V62o2uZY1+7vu0cf0ARcja1
 kptZBbvMIfjyl1TeuJWuEya8aWo0KwIbbs3tVKz16Na7RXlG01mYCwGLAVkBADCD
 mJaM1jZVADtUZyoCkh4M4KBBwFnFCw==
 =ITwP
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha-gitlab/tags/block-pull-request' into staging

Pull request

# gpg: Signature made Wed 12 Jan 2022 17:13:54 GMT
# gpg:                using RSA key 8695A8BFD3F97CDAAC35775A9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha-gitlab/tags/block-pull-request:
  virtio: unify dataplane and non-dataplane ->handle_output()
  virtio: use ->handle_output() instead of ->handle_aio_output()
  virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane
  virtio-blk: drop unused virtio_blk_handle_vq() return value
  virtio: get rid of VirtIOHandleAIOOutput
  aio-posix: split poll check from ready handler

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-01-14 10:43:32 +00:00
Stefan Hajnoczi
826cc32423 aio-posix: split poll check from ready handler
Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0)    = 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8)                         = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8)                         = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512)                        = 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512)                        = 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512)                        = 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0)    = 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0)    = 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8)                         = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8)                         = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50)    = 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20211207132336.36627-2-stefanha@redhat.com

[Fixed up aio_set_event_notifier() calls in
tests/unit/test-fdmon-epoll.c added after this series was queued.
--Stefan]

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-01-12 17:09:39 +00:00
Thomas Huth
a5730b8bd3 block/file-posix: Simplify the XFS_IOC_DIOINFO handling
The handling for the XFS_IOC_DIOINFO ioctl is currently quite excessive:
This is not a "real" feature like the other features that we provide with
the "--enable-xxx" and "--disable-xxx" switches for the configure script,
since this does not influence lots of code (it's only about one call to
xfsctl() in file-posix.c), so people don't gain much with the ability to
disable this with "--disable-xfsctl".
It's also unfortunate that the ioctl will be disabled on Linux in case
the user did not install the right xfsprogs-devel package before running
configure. Thus let's simplify this by providing the ioctl definition
on our own, so we can completely get rid of the header dependency and
thus the related code in the configure script.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20211215125824.250091-1-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-12 14:09:04 +01:00
Vladimir Sementsov-Ogievskiy
985cac8f20 blockjob: drop BlockJob.blk field
It's unused now (except for permission handling)[*]. The only reasonable
user of it was block-stream job, recently updated to use own blk. And
other block jobs prefer to use own source node related objects.

So, the arguments of dropping the field are:

 - block jobs prefer not to use it
 - block jobs usually has more then one node to operate on, and better
   to operate symmetrically (for example has both source and target
   blk's in specific block-job state structure)

*: BlockJob.blk is used to keep some permissions. We simply move
permissions to block-job child created in block_job_create() together
with blk.

In mirror, we just should not care anymore about restoring state of
blk. Most probably this code could be dropped long ago, after dropping
bs->job pointer. Now it finally goes away together with BlockJob.blk
itself.

iotest 141 output is updated, as "bdrv_has_blk(bs)" check in
qmp_blockdev_del() doesn't fail (we don't have blk now). Still, new
error message looks even better.

In iotest 283 we need to add a job id, otherwise "Invalid job ID"
happens now earlier than permission check (as permissions moved from
blk to block-job node).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
2021-12-28 15:18:59 +01:00
Vladimir Sementsov-Ogievskiy
048954e2f6 block/stream: add own blk
block-stream is the only block-job, that reasonably use BlockJob.blk.
We are going to drop BlockJob.blk soon. So, let block-stream have own
blk.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
2021-12-28 15:18:54 +01:00
Vladimir Sementsov-Ogievskiy
be16b8bf9f nbd: allow reconnect on open, with corresponding new options
It is useful when start of vm and start of nbd server are not
simple to sync.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2021-12-21 14:52:06 +01:00
Stefan Hajnoczi
cf4fbc3030 block/nvme: fix infinite loop in nvme_free_req_queue_cb()
When the request free list is exhausted the coroutine waits on
q->free_req_queue for the next free request. Whenever a request is
completed a BH is scheduled to invoke nvme_free_req_queue_cb() and wake
up waiting coroutines.

1. nvme_get_free_req() waits for a free request:

    while (q->free_req_head == -1) {
        ...
            trace_nvme_free_req_queue_wait(q->s, q->index);
            qemu_co_queue_wait(&q->free_req_queue, &q->lock);
        ...
    }

2. nvme_free_req_queue_cb() wakes up the coroutine:

    while (qemu_co_enter_next(&q->free_req_queue, &q->lock)) {
       ^--- infinite loop when free_req_head == -1
    }

nvme_free_req_queue_cb() and the coroutine form an infinite loop when
q->free_req_head == -1. Fix this by checking q->free_req_head in
nvme_free_req_queue_cb(). If the free request list is exhausted, don't
wake waiting coroutines. Eventually an in-flight request will complete
and the BH will be scheduled again, guaranteeing forward progress.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20211208152246.244585-1-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-12-09 09:19:49 +00:00
Daniella Lee
22c36b75c8 block/vvfat.c fix leak when failure occurs
Function vvfat_open called function enable_write_target and init_directories,
and these functions malloc new memory for BDRVVVFATState::qcow_filename,
BDRVVVFATState::used_clusters, and BDRVVVFATState::cluster_buff.

When the specified folder does not exist ,it may contains memory leak.
After init_directories function is executed, the vvfat_open return -EIO,
and bdrv_open_driver goto label open_failed,
the program use g_free(bs->opaque) to release BDRVVVFATState struct
without members mentioned.

command line:
qemu-system-x86_64 -hdb <vdisk qcow file>  -usb -device usb-storage,drive=fat16
-drive file=fat:rw:fat-type=16:"<path of a host folder does not exist>",
id=fat16,format=raw,if=none

enable_write_target called:
(gdb) bt
    at ../block/vvfat.c:3114
    flags=155650, errp=0x7fffffffd780) at ../block/vvfat.c:1236
    node_name=0x0, options=0x555556fa45d0, open_flags=155650,
    errp=0x7fffffffd890) at ../block.c:1558
    errp=0x7fffffffd890) at ../block.c:1852
    reference=0x0, options=0x555556fa45d0, flags=40962, parent=0x555556f98cd0,
    child_class=0x555556b1d6a0 <child_of_bds>, child_role=19,
    errp=0x7fffffffda90) at ../block.c:3779
    options=0x555556f9cfc0, bdref_key=0x555556239bb8 "file",
    parent=0x555556f98cd0, child_class=0x555556b1d6a0 <child_of_bds>,
    child_role=19, allow_none=true, errp=0x7fffffffda90) at ../block.c:3419
    reference=0x0, options=0x555556f9cfc0, flags=8194, parent=0x0,
    child_class=0x0, child_role=0, errp=0x555556c98c40 <error_fatal>)
    at ../block.c:3726
    options=0x555556f757b0, flags=0, errp=0x555556c98c40 <error_fatal>)
    at ../block.c:3872
    options=0x555556f757b0, flags=0, errp=0x555556c98c40 <error_fatal>)
    at ../block/block-backend.c:436
    bs_opts=0x555556f757b0, errp=0x555556c98c40 <error_fatal>)
    at ../blockdev.c:608
    errp=0x555556c98c40 <error_fatal>) at ../blockdev.c:992
......

Signed-off-by: Daniella Lee <daniellalee111@gmail.com>
Message-Id: <20211119112553.352222-1-daniellalee111@gmail.com>
[hreitz: Took commit message from v1]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-11-23 15:39:12 +01:00
Kevin Wolf
5dbd0ce115 file-posix: Fix alignment after reopen changing O_DIRECT
At the end of a reopen, we already call bdrv_refresh_limits(), which
should update bs->request_alignment according to the new file
descriptor. However, raw_probe_alignment() relies on s->needs_alignment
and just uses 1 if it isn't set. We neglected to update this field, so
starting with cache=writeback and then reopening with cache=none means
that we get an incorrect bs->request_alignment == 1 and unaligned
requests fail instead of being automatically aligned.

Fix this by recalculating s->needs_alignment in raw_refresh_limits()
before calling raw_probe_alignment().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211104113109.56336-1-kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211115145409.176785-13-kwolf@redhat.com>
[hreitz: Fix iotest 142 for block sizes greater than 512 by operating on
         a file with a size of 1 MB]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211116101431.105252-1-hreitz@redhat.com>
2021-11-16 11:30:29 +01:00
Hanna Reitz
8d3dd037d9 stream: Traverse graph after modification
bdrv_cor_filter_drop() modifies the block graph.  That means that other
parties can also modify the block graph before it returns.  Therefore,
we cannot assume that the result of a graph traversal we did before
remains valid afterwards.

We should thus fetch `base` and `unfiltered_base` afterwards instead of
before.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211111120829.81329-2-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211115145409.176785-2-kwolf@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-11-16 09:43:23 +01:00
Richard Henderson
741bdeb1d5 Block layer patches
- Fail gracefully when blockdev-snapshot creates loops
 - ide: Fix IDENTIFY DEVICE for disks > 128 GiB
 - file-posix: Fix return value translation for AIO discards
 - file-posix: add 'aio-max-batch' option
 - rbd: implement bdrv_co_block_status
 - Code cleanups and build fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmGBYXIRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9Y3Qw//S9iBIU0cemm5mX/LlDzkrb0VhB5zCQ/C
 8ReZh2R0j3kJl2n+ViRr3qRzvjKyjYuKKJ+bb8FNl/DC88evmiowuXj+csVnC6PM
 I6a6vfKPdioIk36MwACWw75N208eVRNHOPvWH/s3sjar1hMlDftdN1lDZA9sDLLx
 NGH9VRG8gSoKox9+Wg4x584XGI3YU/BqGdcM+W9E0T4OLT9X0RUmXbzVzspYFlPh
 O/r/gTZ0Ip5UXoPiHgyGhCalNep6Uc9A8n1sTaHIjaVcfSE86AuJgc5RcM0HFASn
 1fTNmxbX2taEeynISJAn2x77LASCvLSUErBmVQjHZOAkbNgJD4/R9MUlndxr+MAR
 tXJD+Pkah+M2LwUWdUIfm8UiBBX3LC4Qd/5byc85qr+s24q+EPvOmyqThLT2RNjq
 NLGO9WmRnf7sNvFjvmMXz1JAp1pV4P82j2P/4oy3ZZSW8AVaWr6xxS5hgePpqD4x
 yFP91lnce//UKVfU60VafPh6wUv1Isv0aN1JaW0Vz9wo0LF/i/6zbFO+RVS2PbMR
 VpGmGee7rbgPyWxPrjunyXv+FqciRmbd+IEAg2MlKacHk1F2pQiDLnJUPIGf+9tz
 nv+cq92FNd7hpx4qsuBhe0Ys89QJ2hueTk4RrQF/38tEoRzZ6IwppLlp7kRK39C2
 8TPSkiyqlIQ=
 =Jtih
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kwolf/tags/for-upstream' into staging

Block layer patches

- Fail gracefully when blockdev-snapshot creates loops
- ide: Fix IDENTIFY DEVICE for disks > 128 GiB
- file-posix: Fix return value translation for AIO discards
- file-posix: add 'aio-max-batch' option
- rbd: implement bdrv_co_block_status
- Code cleanups and build fixes

# gpg: Signature made Tue 02 Nov 2021 12:04:02 PM EDT
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]

* remotes/kwolf/tags/for-upstream:
  block/nvme: Extract nvme_free_queue() from nvme_free_queue_pair()
  block/nvme: Display CQ/SQ pointer in nvme_free_queue_pair()
  block/nvme: Automatically free qemu_memalign() with QEMU_AUTO_VFREE
  block-backend: Silence clang -m32 compiler warning
  linux-aio: add `dev_max_batch` parameter to laio_io_unplug()
  linux-aio: add `dev_max_batch` parameter to laio_co_submit()
  file-posix: add `aio-max-batch` option
  block/export/fuse.c: fix musl build
  ide: Cap LBA28 capacity announcement to 2^28-1
  block/rbd: implement bdrv_co_block_status
  block: Fail gracefully when blockdev-snapshot creates loops
  block/file-posix: Fix return value translation for AIO discards

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2021-11-03 00:32:56 -04:00
Philippe Mathieu-Daudé
a895143894 block/nvme: Extract nvme_free_queue() from nvme_free_queue_pair()
Instead of duplicating code, extract the common helper to free
a single queue.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-4-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 15:49:13 +01:00
Philippe Mathieu-Daudé
53cedeaaee block/nvme: Display CQ/SQ pointer in nvme_free_queue_pair()
For debugging purpose it is helpful to know the CQ/SQ pointers.
We already have a trace event in nvme_free_queue_pair(), extend
it to report these pointer addresses.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-3-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 15:49:12 +01:00
Philippe Mathieu-Daudé
4a613bd862 block/nvme: Automatically free qemu_memalign() with QEMU_AUTO_VFREE
Since commit 4d324c0bf6 ("introduce QEMU_AUTO_VFREE") buffers
allocated by qemu_memalign() can automatically freed when using
the QEMU_AUTO_VFREE macro. Use it to simplify a bit.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-2-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 15:48:55 +01:00
Hanna Reitz
73d4a11300 block-backend: Silence clang -m32 compiler warning
Similarly to e7e588d432, there is a
warning in block/block-backend.c that qiov->size <= INT64_MAX is always
true on machines where size_t is narrower than a uint64_t.  In said
commit, we silenced this warning by casting to uint64_t.

The commit introducing this warning here
(a93d81c84a) anticipated it and so tried
to address it the same way.  However, it only did so in one of two
places where this comparison occurs, and so we still need to fix up the
other one.

Fixes: a93d81c84a
       ("block-backend: convert blk_aio_ functions to int64_t bytes
       paramter")
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211026090745.30800-1-hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:22:09 +01:00
Stefano Garzarella
68d7946648 linux-aio: add dev_max_batch parameter to laio_io_unplug()
Between the submission of a request and the unplug, other devices
with larger limits may have been queued new requests without flushing
the batch.

Using the new `dev_max_batch` parameter, laio_io_unplug() can check
if the batch exceeds the device limit to flush the current batch.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-4-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:35 +01:00
Stefano Garzarella
512da21101 linux-aio: add dev_max_batch parameter to laio_co_submit()
This new parameter can be used by block devices to limit the
Linux AIO batch size more than the limit set by the AIO context.

file-posix backend supports this, passing its `aio-max-batch` option
previously added.

Add an helper function to calculate the maximum batch size.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-3-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:35 +01:00
Stefano Garzarella
684960d462 file-posix: add aio-max-batch option
Commit d7ddd0a161 ("linux-aio: limit the batch size using
`aio-max-batch` parameter") added a way to limit the batch size
of Linux AIO backend for the entire AIO context.

The same AIO context can be shared by multiple devices, so
latency-sensitive devices may want to limit the batch size even
more to avoid increasing latency.

For this reason we add the `aio-max-batch` option to the file
backend, which will be used by the next commits to limit the size of
batches including requests generated by this device.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-2-sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:30 +01:00
Fabrice Fontaine
3043320390 block/export/fuse.c: fix musl build
Include linux/falloc.h if CONFIG_FALLOCATE_ZERO_RANGE is defined to fix
50482fda98
and avoid the following build failure on musl:

../block/export/fuse.c: In function 'fuse_fallocate':
../block/export/fuse.c:643:21: error: 'FALLOC_FL_ZERO_RANGE' undeclared (first use in this function)
  643 |     else if (mode & FALLOC_FL_ZERO_RANGE) {
      |                     ^~~~~~~~~~~~~~~~~~~~

Fixes:
 - http://autobuild.buildroot.org/results/be24433a429fda681fb66698160132c1c99bc53b

Fixes: 50482fda98 ("block/export/fuse.c: fix musl build")
Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Message-Id: <20211022095209.1319671-1-fontaine.fabrice@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:02:46 +01:00
Peter Lieven
0347a8fd4c block/rbd: implement bdrv_co_block_status
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20211012152231.24868-1-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:02:46 +01:00
Ari Sundholm
13a028336f block/file-posix: Fix return value translation for AIO discards
AIO discards regressed as a result of the following commit:
	0dfc7af2 block/file-posix: Optimize for macOS

When trying to run blkdiscard within a Linux guest, the request would
fail, with some errors in dmesg:

---- [ snip ] ----
[    4.010070] sd 2:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[    4.011061] sd 2:0:0:0: [sda] tag#0 Sense Key : Aborted Command
[current]
[    4.011061] sd 2:0:0:0: [sda] tag#0 Add. Sense: I/O process
terminated
[    4.011061] sd 2:0:0:0: [sda] tag#0 CDB: Unmap/Read sub-channel 42
00 00 00 00 00 00 00 18 00
[    4.011061] blk_update_request: I/O error, dev sda, sector 0
---- [ snip ] ----

This turns out to be a result of a flaw in changes to the error value
translation logic in handle_aiocb_discard(). The default return value
may be left untranslated in some configurations, and the wrong variable
is used in one translation.

Fix both issues.

Fixes: 0dfc7af2b2 ("block/file-posix: Optimize for macOS")
Cc: qemu-stable@nongnu.org
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Emil Karlson <jkarlson@tuxera.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20211019110954.4170931-1-ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:02:46 +01:00
Thomas Huth
7da9623cc0 block/vpc: Add a sanity check that fixed-size images have the right type
The code in vpc.c uses BDRVVPCState->footer.type in various places
to decide whether the image is a fixed-size (VHD_FIXED) or a dynamic
(VHD_DYNAMIC) image. However, we never check that this field really
contains VHD_FIXED if we detected a fixed size image in vpc_open(),
so a wrong value here could cause quite some trouble during runtime.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20211012082702.792259-1-thuth@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-11-02 12:47:51 +01:00
Thomas Weißschuh
f3d43dfd9a vmdk: allow specification of tools version
VMDK files support an attribute that represents the version of the guest
tools that are installed on the disk.
This attribute is used by vSphere before a machine has been started to
determine if the VM has the guest tools installed.
This is important when configuring "Operating system customizations" in
vSphere, as it checks for the presence of the guest tools before
allowing those customizations.
Thus when the VM has not yet booted normally it would be impossible to
customize it, therefore preventing a customized first-boot.

The attribute should not hurt on disks that do not have the guest tools
installed and indeed the VMware tools also unconditionally add this
attribute.
(Defaulting to the value "2147483647", as is done in this patch)

Signed-off-by: Thomas Weißschuh <thomas.weissschuh.ext@zeiss.com>
Message-Id: <20210913130419.13241-1-thomas.weissschuh.ext@zeiss.com>
[hreitz: Added missing '#' in block-core.json]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-11-02 12:47:51 +01:00
Vladimir Sementsov-Ogievskiy
aa78b82516 block-backend: drop INT_MAX restriction from blk_check_byte_request()
blk_check_bytes_request is called from blk_co_do_preadv,
blk_co_do_pwritev_part, blk_co_do_pdiscard and blk_co_copy_range
before (maybe) calling throttle_group_co_io_limits_intercept() (which
has int64_t argument) and then calling corresponding bdrv_co_ function.
bdrv_co_ functions are OK with int64_t bytes as well.

So dropping the check for INT_MAX we just get same restrictions as in
bdrv_ layer: discard and write-zeroes goes through
bdrv_check_qiov_request() and are allowed to be 64bit. Other requests
go through bdrv_check_request32() and still restricted by INT_MAX
boundary.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-13-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 16:00:07 -05:00
Vladimir Sementsov-Ogievskiy
14149710f9 block-backend: blk_pread, blk_pwrite: rename count parameter to bytes
To be consistent with declarations in include/sysemu/block-backend.h.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-12-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:59:26 -05:00
Vladimir Sementsov-Ogievskiy
a93d81c84a block-backend: convert blk_aio_ functions to int64_t bytes paramter
1. Convert bytes in BlkAioEmAIOCB:
  aio->bytes is only passed to already int64_t interfaces, and set in
  blk_aio_prwv, which is updated here.

2. For all updated functions the parameter type becomes wider so callers
   are safe.

3. In blk_aio_prwv we only store bytes to BlkAioEmAIOCB, which is
   updated here.

4. Other updated functions are wrappers on blk_aio_prwv.

Note that blk_aio_preadv and blk_aio_pwritev become safer: before this
commit, it's theoretically possible to pass qiov with size exceeding
INT_MAX, which than converted to int argument of blk_aio_prwv. Now it's
converted to int64_t which is a lot better. Still add assertions.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: tweak assertion and grammar]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:57:29 -05:00
Vladimir Sementsov-Ogievskiy
e192179bb2 block-backend: convert blk_co_copy_range to int64_t bytes
Function is updated so that parameter type becomes wider, so all
callers should be OK with it.

Look at blk_co_copy_range() itself: bytes is passed only to
blk_check_byte_request() and bdrv_co_copy_range(), which already have
int64_t bytes parameter, so we are OK.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-10-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:55:28 -05:00
Vladimir Sementsov-Ogievskiy
06f0325c5b block-backend: convert blk_foo wrappers to use int64_t bytes parameter
Convert blk_pdiscard, blk_pwrite_compressed, blk_pwrite_zeroes.
These are just wrappers for functions with int64_t argument, so allow
passing int64_t as well. Parameter type becomes wider so all callers
should be OK with it.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Note also that we don't (and are not going to) convert blk_pwrite and
blk_pread: these functions return number of bytes on success, so to
update them, we should change return type to int64_t as well, which
will lead to investigating and updating all callers which is too much.

So, blk_pread and blk_pwrite remain unchanged.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-9-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:53:48 -05:00
Vladimir Sementsov-Ogievskiy
16d36e2996 block-backend: drop blk_prw, use block-coroutine-wrapper
Let's drop hand-made coroutine wrappers and use coroutine wrapper
generation like in block/io.c.

Now, blk_foo() functions are written in same way as blk_co_foo() ones,
but wrap blk_do_foo() instead of blk_co_do_foo().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: spelling fix]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:53:24 -05:00
Vladimir Sementsov-Ogievskiy
7d55a3bbad block-coroutine-wrapper.py: support BlockBackend first argument
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:51:33 -05:00
Vladimir Sementsov-Ogievskiy
70e8775ed9 block-backend: rename _do_ helper functions to _co_do_
This is a preparation to the following commit, to use automatic
coroutine wrapper generation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:50:40 -05:00
Vladimir Sementsov-Ogievskiy
2800637a33 block-backend: convert blk_co_pdiscard to int64_t bytes
We updated blk_do_pdiscard() and its wrapper blk_co_pdiscard(). Both
functions are updated so that the parameter type becomes wider, so all
callers should be OK with it.

Look at blk_do_pdiscard(): bytes is passed only to
blk_check_byte_request() and bdrv_co_pdiscard(), which already have
int64_t bytes parameter, so we are OK.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:48:56 -05:00
Vladimir Sementsov-Ogievskiy
34460feb63 block-backend: convert blk_co_pwritev_part to int64_t bytes
We convert blk_do_pwritev_part() and some wrappers:
blk_co_pwritev_part(), blk_co_pwritev(), blk_co_pwrite_zeroes().

All functions are converted so that the parameter type becomes wider, so
all callers should be OK with it.

Look at blk_do_pwritev_part() body:
bytes is passed to:

 - trace_blk_co_pwritev (we update it here)
 - blk_check_byte_request, throttle_group_co_io_limits_intercept,
   bdrv_co_pwritev_part - all already have int64_t argument.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:47:18 -05:00
Vladimir Sementsov-Ogievskiy
9547907705 block-backend: make blk_co_preadv() 64bit
For both updated functions, the type of bytes becomes wider, so all callers
should be OK with it.

blk_co_preadv() only passes its arguments to blk_do_preadv().

blk_do_preadv() passes bytes to:

 - trace_blk_co_preadv, which is updated too
 - blk_check_byte_request, throttle_group_co_io_limits_intercept,
   bdrv_co_preadv, which are already int64_t.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:46:44 -05:00
Vladimir Sementsov-Ogievskiy
7242db6389 block-backend: blk_check_byte_request(): int64_t bytes
Rename size and make it int64_t to correspond to modern block layer,
which always uses int64_t for offset and bytes (not in blk layer yet,
which is a task for following commits).

All callers pass int or unsigned int.

So, for bytes in [0, INT_MAX] nothing is changed, for negative bytes we
now fail on "bytes < 0" check instead of "bytes > INT_MAX" check.

Note, that blk_check_byte_request() still doesn't allow requests
exceeding INT_MAX.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:39:40 -05:00
Hanna Reitz
e7e588d432 qcow2: Silence clang -m32 compiler warning
With -m32, size_t is generally only a uint32_t.  That makes clang
complain that in the assertion

  assert(qiov->size <= INT64_MAX);

the range of the type of qiov->size (size_t) is too small for any of its
values to ever exceed INT64_MAX.

Cast qiov->size to uint64_t to silence clang.

Fixes: f7ef38dd13
       ("block: use int64_t instead of uint64_t in driver read
       handlers")
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211011155031.149158-1-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:39:38 -05:00
Paolo Bonzini
ff66f3e55b configure, meson: move libaio check to meson.build
Message-Id: <20211007130829.632254-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-14 09:50:57 +02:00
Richard Henderson
14f12119aa mirror: Handle errors after READY cancel
v2: add small fix by Stefano, Hanna's series fixed
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEi5wmzbL9FHyIDoahVh8kwfGfefsFAmFfEVMACgkQVh8kwfGf
 eftAVA//WLtOaiVYPSjEl5EK80kry39VknZkQyeYUzyV7JNr/FRMlbJaIF2sOjH5
 KPRpfwBiuijOc8R0s34HY0BpyweRd1rbypHblZkO7EO4XwHx1FLF5kNHF6yV7wPL
 c9W564sZpc6Z96wSMgC4Is9QHJ6JbO4TJNJsG8v/PHEqGQV/yCYkgBox4loJckww
 uSAZ7l63IWA8uPSq/rOu34bREKN9s0kHkvFq0JNWk2HtOBLDiRDUYmbSfdjfT4jz
 np7ojKiffcAJED9JA28Zo2Y+MSId+FyoO4lbt+deMNzIHboy2oVlHouoHHprr61x
 dIO7Qt1IoMk5IBIfkPRYkReMwxxSVKuIJcWm8Qqtkcg2X0g5ayNUmHwpBMd50h2z
 XPjrr0YdOixhxMHoBnqlkPlWU0Y/B+YJIQ+mjqp+vRNkk94NoXhsXnCod1ajkgWO
 zjc/dztew7HvNStJaMM0rnEjanLhzFZKtlMO4WwZHQp2yZG2AINkPStswo2f3AmL
 FI+2By/UhFKm3BEemf0wYWDPWrPHU+BOiu16KjSKeS0GA9t7GXBUDRxNYPhUheXJ
 eJKIpNsGbseNxKrAbLyRhAB75Fa/ReZqqybmEcLyal/ball3R/cNF3gaMHeX0o1n
 HTGIAF5JOAXNGApS5TilkXPZ7jHFOVPh/Fi6/16/08tcgxjVfro=
 =TVTu
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/vsementsov/tags/pull-jobs-2021-10-07-v2' into staging

mirror: Handle errors after READY cancel
v2: add small fix by Stefano, Hanna's series fixed

# gpg: Signature made Thu 07 Oct 2021 08:25:07 AM PDT
# gpg:                using RSA key 8B9C26CDB2FD147C880E86A1561F24C1F19F79FB
# gpg: Good signature from "Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8B9C 26CD B2FD 147C 880E  86A1 561F 24C1 F19F 79FB

* remotes/vsementsov/tags/pull-jobs-2021-10-07-v2:
  iotests: Add mirror-ready-cancel-error test
  mirror: Do not clear .cancelled
  mirror: Stop active mirroring after force-cancel
  mirror: Check job_is_cancelled() earlier
  mirror: Use job_is_cancelled()
  job: Add job_cancel_requested()
  job: Do not soft-cancel after a job is done
  jobs: Give Job.force_cancel more meaning
  job: @force parameter for job_cancel_sync()
  job: Force-cancel jobs in a failed transaction
  mirror: Drop s->synced
  mirror: Keep s->synced on error
  job: Context changes in job_completed_txn_abort()
  block/aio_task: assert `max_busy_tasks` is greater than 0
  block/backup: avoid integer overflow of `max-workers`

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2021-10-07 10:26:35 -07:00
Hanna Reitz
a640fa0e38 mirror: Do not clear .cancelled
Clearing .cancelled before leaving the main loop when the job has been
soft-cancelled is no longer necessary since job_is_cancelled() only
returns true for jobs that have been force-cancelled.

Therefore, this only makes a differences in places that call
job_cancel_requested().  In block/mirror.c, this is done only before
.cancelled was cleared.

In job.c, there are two callers:
- job_completed_txn_abort() asserts that .cancelled is true, so keeping
  it true will not affect this place.

- job_complete() refuses to let a job complete that has .cancelled set.
  It is correct to refuse to let the user invoke job-complete on mirror
  jobs that have already been soft-cancelled.

With this change, there are no places that reset .cancelled to false and
so we can be sure that .force_cancel can only be true if .cancelled is
true as well.  Assert this in job_is_cancelled().

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-13-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz
9b230ef93e mirror: Stop active mirroring after force-cancel
Once the mirror job is force-cancelled (job_is_cancelled() is true), we
should not generate new I/O requests.  This applies to active mirroring,
too, so stop it once the job is cancelled.

(We must still forward all I/O requests to the source, though, of
course, but those are not really I/O requests generated by the job, so
this is fine.)

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-12-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz
4feeec7e23 mirror: Check job_is_cancelled() earlier
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

Jobs can be cancelled while they yield, and once they are
(force-cancelled), they should not generate new I/O requests.
Therefore, we should put the check after the last yield before
mirror_iteration() is invoked.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-11-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz
20ad4d204a mirror: Use job_is_cancelled()
mirror_drained_poll() returns true whenever the job is cancelled,
because "we [can] be sure that it won't issue more requests".  However,
this is only true for force-cancelled jobs, so use job_is_cancelled().

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-10-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz
08b83bff2a job: Add job_cancel_requested()
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
  - The first invocation is a while loop that should loop until the job
    has been cancelled or scheduled for completion.  What kind of cancel
    does not matter, only the fact that the job is supposed to end.

  - The second invocation wants to know whether the job has been
    soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
    but if the job were force-cancelled, we should leave the main loop
    as soon as possible anyway, so this should not matter here.

  - The last two invocations already check force_cancel, so they should
    continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
  These jobs know only force-cancel, so there is no difference between
  job_is_cancelled() and job_cancel_requested().  We can continue using
  job_is_cancelled().

- job.c:
  - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
    jobs should be prevented from being paused.  Continue using job_is_cancelled().

  - job_update_rc(), job_finalize_single(), job_finish_sync(): These
    functions are all called after the job has left its main loop.  The
    mirror job (the only job that can be soft-cancelled) will clear
    .cancelled before leaving the main loop if it has been
    soft-cancelled.  Therefore, these functions will observe .cancelled
    to be true only if the job has been force-cancelled.  We can
    continue to use job_is_cancelled().
    (Furthermore, conceptually, a soft-cancelled mirror job should not
    report to have been cancelled.  It should report completion (see
    also the block-job-cancel QAPI documentation).  Therefore, it makes
    sense for these functions not to distinguish between a
    soft-cancelled mirror job and a job that has completed as normal.)

  - job_completed_txn_abort(): All jobs other than @job have been
    force-cancelled.  job_is_cancelled() must be true for them.
    Regarding @job itself: job_completed_txn_abort() is mostly called
    when the job's return value is not 0.  A soft-cancelled mirror has a
    return value of 0, and so will not end up here then.
    However, job_cancel() invokes job_completed_txn_abort() if the job
    has been deferred to the main loop, which is mostly the case for
    completed jobs (which skip the assertion), but not for sure.
    To be safe, use job_cancel_requested() in this assertion.

  - job_complete(): This is function eventually invoked by the user
    (through qmp_block_job_complete() or qmp_job_complete(), or
    job_complete_sync(), which comes from qemu-img).  The intention here
    is to prevent a user from invoking job-complete after the job has
    been cancelled.  This should also apply to soft cancelling: After a
    mirror job has been soft-cancelled, the user should not be able to
    decide otherwise and have it complete as normal (i.e. pivoting to
    the target).

  - job_cancel(): Both functions are equivalent (see comment there), but
    we want to use job_is_cancelled(), because this shows that we call
    job_completed_txn_abort() only for force-cancelled jobs.  (As
    explained for job_update_rc(), soft-cancelled jobs should be treated
    as if they have completed as normal.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-9-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:40 +02:00
Hanna Reitz
73895f3838 jobs: Give Job.force_cancel more meaning
We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-7-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:34 +02:00
Hanna Reitz
4cfb3f0562 job: @force parameter for job_cancel_sync()
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception,
specifically the active commit job it runs.

As for job_cancel_sync_all(), all callers want it to force-cancel all
jobs, because that is the point of it: To cancel all remaining jobs as
quickly as possible (generally on process termination).  So make it
invoke job_cancel_sync() with force=true.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-6-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:09 +02:00
Hanna Reitz
4471622428 mirror: Drop s->synced
As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-4-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:40:48 +02:00
Hanna Reitz
a3810da5cf mirror: Keep s->synced on error
An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-3-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:40:48 +02:00
Paolo Bonzini
cc07162953 block: introduce max_hw_iov for use in scsi-generic
Linux limits the size of iovecs to 1024 (UIO_MAXIOV in the kernel
sources, IOV_MAX in POSIX).  Because of this, on some host adapters
requests with many iovecs are rejected with -EINVAL by the
io_submit() or readv()/writev() system calls.

In fact, the same limit applies to SG_IO as well.  To fix both the
EINVAL and the possible performance issues from using fewer iovecs
than allowed by Linux (some HBAs have max_segments as low as 128),
introduce a separate entry in BlockLimits to hold the max_segments
value from sysfs.  This new limit is used only for SG_IO and clamped
to bs->bl.max_iov anyway, just like max_hw_transfer is clamped to
bs->bl.max_transfer.

Reported-by: Halil Pasic <pasic@linux.ibm.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-stable@nongnu.org
Fixes: 18473467d5 ("file-posix: try BLKSECTGET on block devices too, do not round to power of 2", 2021-06-25)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210923130436.1187591-1-pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06 10:25:55 +02:00
Stefano Garzarella
a9515df4d6 block/aio_task: assert max_busy_tasks is greater than 0
All code in block/aio_task.c expects `max_busy_tasks` to always
be greater than 0.

Assert this condition during the AioTaskPool creation where
`max_busy_tasks` is set.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211005161157.282396-3-sgarzare@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-05 18:56:41 +02:00
Stefano Garzarella
8fc898ce0b block/backup: avoid integer overflow of max-workers
QAPI generates `struct BackupPerf` where `max-workers` value is stored
in an `int64_t` variable.
But block_copy_async(), and the underlying code, uses an `int` parameter.

At the end that variable is used to initialize `max_busy_tasks` in
block/aio_task.c causing the following assertion failure if a value
greater than INT_MAX(2147483647) is used:

  ../block/aio_task.c:63: aio_task_pool_wait_one: Assertion `pool->busy_tasks > 0' failed.

Let's check that `max-workers` doesn't exceed INT_MAX and print an
error in that case.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211005161157.282396-2-sgarzare@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-05 18:56:20 +02:00
Vladimir Sementsov-Ogievskiy
1af7737871 block/nbd: check that received handle is valid
If we don't have active request, that waiting for this handle to be
received, we should report an error.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy
4ddb5d2fde block/nbd: drop connection_co
OK, that's a big rewrite of the logic.

Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.

We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.

Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.

The detailed list of changes below (in the sequence of diff hunks).

1. receiving coroutines are woken directly from nbd_channel_error, when
   we change s->state

2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
   and in nbd_teardown_connection() all requests should already be
   finished (and reconnect is done from request). So
   nbd_co_establish_connection_cancel() is called from
   nbd_cancel_in_flight() (to cancel the request that is doing
   nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
   (previously we didn't need it, as reconnect delay only should cancel
   active requests not the reconnection itself). But now reconnection
   itself is done in the separate thread (we now call
   nbd_client_connection_enable_retry() in nbd_open()), and we need to
   cancel the requests that wait in nbd_co_establish_connection()
   now).

2A. We do receive headers in request coroutine. But we also should
   dispatch replies for other pending requests. So,
   nbd_connection_entry() is turned into nbd_receive_replies(), which
   does reply dispatching while it receives other request headers, and
   returns when it receives the requested header.

3. All old staff around drained sections and context switch is dropped.
   In details:
   - we don't need to move connection_co to new aio context, as we
     don't have connection_co anymore
   - we don't have a fake "request" of connection_co (extra increasing
     in_flight), so don't care with it in drain_begin/end
   - we don't stop reconnection during drained section anymore. This
     means that drain_begin may wait for a long time (up to
     reconnect_delay). But that's an improvement and more correct
     behavior see below[*]

4. In nbd_teardown_connection() we don't have to wait for
   connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
   is moved here from removed connection_co.

5. In nbd_co_do_establish_connection() we now should handle
   NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
   NBD_CLIENT_CONNECTING_NOWAIT, it still should call
   nbd_co_establish_connection() (who knows, maybe the connection was
   already established by another thread in the background). But we
   shouldn't wait: if nbd_co_establish_connection() can't return new
   channel immediately the request should fail (we are in
   NBD_CLIENT_CONNECTING_NOWAIT state).

6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
   other requests in the caller, so here we just assert that fact.
   Also delay time is now initialized here: we can easily detect first
   attempt and start a timer.

7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
   retries are fully handle by thread (nbd/client-connection.c), delay
   timer we initialize in nbd_reconnect_attempt(), we don't have to
   bother with s->drained and friends. nbd_reconnect_attempt() now
   called from nbd_co_send_request().

8. nbd_connection_entry is dropped: reconnect is now handled by
   nbd_co_send_request(), receiving reply is now handled by
   nbd_receive_replies(): all handled from request coroutines.

9. So, welcome new nbd_receive_replies() called from request coroutine,
   that receives reply header instead of nbd_connection_entry().
   Like with sending requests, only one coroutine may receive in a
   moment. So we introduce receive_mutex, which is locked around
   nbd_receive_reply(). It also protects some related fields. Still,
   full audit of thread-safety in nbd driver is a separate task.
   New function waits for a reply with specified handle being received
   and works rather simple:

   Under mutex:
     - if current handle is 0, do receive by hand. If another handle
       received - switch to other request coroutine, release mutex and
       yield. Otherwise return success
     - if current handle == requested handle, we are done
     - otherwise, release mutex and yield

10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
    needed. Also waiting in free_sema queue we now wait for one of two
    conditions:
    - connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
    - connectING, in_flight == 0, so we can call
      nbd_reconnect_attempt()
    And this logic is protected by s->send_mutex

    Also, on failure we don't have to care of removed s->connection_co

11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
    s->connection_co we just call new nbd_receive_replies().

12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
    which means that handling of the whole reply is finished. Here we
    need to wake one of coroutines sleeping in nbd_receive_replies().
    If none are sleeping - do nothing. That's another behavior change: we
    don't have endless recv() in the idle time. It may be considered as
    a drawback. If so, it may be fixed later.

13. nbd_reply_chunk_iter_receive(): don't care about removed
    connection_co, just ping in_flight waiters.

14. Don't create connection_co, enable retry in the connection thread
    (we don't have own reconnect loop anymore)

15. We now need to add a nbd_co_establish_connection_cancel() call in
    nbd_cancel_in_flight(), to cancel the request that is doing a
    connection attempt.

[*], ok, now we don't cancel reconnect on drain begin. That's correct:
    reconnect feature leads to possibility of long-running requests (up
    to reconnect delay). Still, drain begin is not a reason to kill
    long requests. We should wait for them.

    This also means, that we can again reproduce a dead-lock, described
    in 8c517de24a.
    Why we are OK with it:
    1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
       after reconnect delay. Actually 8c517de24a fixed a bug in
       NBD logic, that was not described in 8c517de24a and led to
       forever dead-lock. The problem was that nobody woke the free_sema
       queue, but drain_begin can't finish until there is a request in
       free_sema queue. Now we have a reconnect delay timer that works
       well.
    2. It's not a problem of the NBD driver, but of the ide code,
       because it does drain_begin under the global mutex; the problem
       doesn't reproduce when using scsi instead of ide.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar and comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy
04a953b232 block/nbd: refactor nbd_recv_coroutines_wake_all()
Split out nbd_recv_coroutine_wake_one(), as it will be used
separately.
Rename the function and add a possibility to wake only first found
sleeping coroutine.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy
3bc0bd1f42 block/nbd: move nbd_recv_coroutines_wake_all() up
We are going to use it in nbd_channel_error(), so move it up. Note,
that we are going also refactor and rename
nbd_recv_coroutines_wake_all() in future anyway, so keeping it where it
is and making forward declaration doesn't make real sense.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210902103805.25686-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy
cb116da7d7 block/nbd: nbd_channel_error() shutdown channel unconditionally
Don't rely on connection being totally broken in case of -EIO. Safer
and more correct is to just shut down the channel anyway, since we
change the state and plan on reconnecting.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210902103805.25686-2-vsementsov@virtuozzo.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy
6a8f3dbb19 block/io: allow 64bit discard requests
Now that all drivers are updated by the previous commit, we can drop
the last limiter on pdiscard path: INT_MAX in bdrv_co_pdiscard().

Now everything is prepared for implementing incredibly cool and fast
big-discard requests in NBD and qcow2. And any other driver which wants
it of course.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-12-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
0c8022876f block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver discard handlers bytes parameter to int64_t.

The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.

Let's look at all updated functions:

blkdebug: all calculations are still OK, thanks to
  bdrv_check_qiov_request().
  both rule_check and bdrv_co_pdiscard are 64bit

blklogwrites: pass to blk_loc_writes_co_log which is 64bit

blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK

copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
  cbw_do_copy_before_write which is 64bit

file-posix: one handler calls raw_account_discard() is 64bit and both
  handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
  to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
  raw_account_discard())

gluster: somehow, third argument of glfs_discard_async is size_t.
  Let's set max_pdiscard accordingly.

iscsi: iscsi_allocmap_set_invalid is 64bit,
  !is_byte_request_lun_aligned is 64bit.
  list.num is uint32_t. Let's clarify max_pdiscard and
  pdiscard_alignment.

mirror_top: pass to bdrv_mirror_top_do_write() which is
  64bit

nbd: protocol limitation. max_pdiscard is alredy set strict enough,
  keep it as is for now.

nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
  to nvme_refresh_limits().

preallocate: pass to bdrv_co_pdiscard() which is 64bit.

rbd: pass to qemu_rbd_start_co() which is 64bit.

qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
  qcow2_cluster_discard() is 64bit.

raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.

throttle: pass to bdrv_co_pdiscard() which is 64bit and to
  throttle_group_co_io_limits_intercept() which is 64bit as well.

test-block-iothread: bytes argument is unused

Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
39af49c0d7 block: make BlockLimits::max_pdiscard 64bit
We are going to support 64 bit discard requests. Now update the
limit variable. It's absolutely safe. The variable is set in some
drivers, and used in bdrv_co_pdiscard().

Update also max_pdiscard variable in bdrv_co_pdiscard(), so that
bdrv_co_pdiscard() is now prepared for 64bit requests. The remaining
logic including num, offset and bytes variables is already
supporting 64bit requests.

So the only thing that prevents 64 bit requests is limiting
max_pdiscard variable to INT_MAX in bdrv_co_pdiscard().
We'll drop this limitation after updating all block drivers.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-10-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
2aaa3f9b33 block/io: allow 64bit write-zeroes requests
Now that all drivers are updated by previous commit, we can drop two
last limiters on write-zeroes path: INT_MAX in
bdrv_co_do_pwrite_zeroes() and bdrv_check_request32() in
bdrv_co_pwritev_part().

Now everything is prepared for implementing incredibly cool and fast
big-write-zeroes in NBD and qcow2. And any other driver which wants it
of course.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-9-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
f34b2bcf8c block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver write_zeroes handlers bytes parameter to int64_t.

The only caller of all updated function is bdrv_co_do_pwrite_zeroes().

bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.

Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.

Let's go:

blkdebug: calculations can't overflow, thanks to
  bdrv_check_qiov_request() in generic layer. rule_check() and
  bdrv_co_pwrite_zeroes() both have 64bit argument.

blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.

blkreplay, copy-on-read, filter-compress: pass to
  bdrv_co_pwrite_zeroes() which is OK

copy-before-write: Calls cbw_do_copy_before_write() and
  bdrv_co_pwrite_zeroes, both have 64bit argument.

file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
  In raw_do_pwrite_zeroes() calculations are OK due to
  bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
  which is uint64_t.
  Check also where that uint64_t gets handed:
  handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
  ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
  which takes off_t (and we compile to always have 64-bit off_t), as
  does handle_aiocb_write_zeroes_unmap. All look safe.

gluster: bytes go to GlusterAIOCB::size which is int64_t and to
  glfs_zerofill_async works with off_t.

iscsi: Aha, here we deal with iscsi_writesame16_task() that has
  uint32_t num_blocks argument and iscsi_writesame16_task() has
  uint16_t argument. Make comments, add assertions and clarify
  max_pwrite_zeroes calculation.
  iscsi_allocmap_() functions already has int64_t argument
  is_byte_request_lun_aligned is simple to update, do it.

mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
  argument

nbd: Aha, here we have protocol limitation, and NBDRequest::len is
  uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
  OK for now.

nvme: Again, protocol limitation. And no inherent limit for
  write-zeroes at all. But from code that calculates cdw12 it's obvious
  that we do have limit and alignment. Let's clarify it. Also,
  obviously the code is not prepared to handle bytes=0. Let's handle
  this case too.
  trace events already 64bit

preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
  64bit.

rbd: pass to qemu_rbd_start_co() which is 64bit.

qcow2: offset + bytes and alignment still works good (thanks to
  bdrv_check_qiov_request()), so tail calculation is OK
  qcow2_subcluster_zeroize() has 64bit argument, should be OK
  trace events updated

qed: qed_co_request wants int nb_sectors. Also in code we have size_t
  used for request length which may be 32bit. So, let's just keep
  INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
  don't care.

raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
  64bit.

throttle: Both throttle_group_co_io_limits_intercept() and
  bdrv_co_pwrite_zeroes() are 64bit.

vmdk: pass to vmdk_pwritev which is 64bit

quorum: pass to quorum_co_pwritev() which is 64bit

Hooray!

At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
d544f5d3b1 block: make BlockLimits::max_pwrite_zeroes 64bit
We are going to support 64 bit write-zeroes requests. Now update the
limit variable. It's absolutely safe. The variable is set in some
drivers, and used in bdrv_co_do_pwrite_zeroes().

Update also max_write_zeroes variable in bdrv_co_do_pwrite_zeroes(), so
that bdrv_co_do_pwrite_zeroes() is now prepared to 64bit requests. The
remaining logic including num, offset and bytes variables is already
supporting 64bit requests.

So the only thing that prevents 64 bit requests is limiting
max_write_zeroes variable to INT_MAX in bdrv_co_do_pwrite_zeroes().
We'll drop this limitation after updating all block drivers.

Ah, we also have bdrv_check_request32() in bdrv_co_pwritev_part(). It
will be modified to do bdrv_check_request() for write-zeroes path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
485350497b block: use int64_t instead of uint64_t in copy_range driver handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver copy_range handlers parameters which are already
64bit to signed type.

Now let's consider all callers. Simple

  git grep '\->bdrv_co_copy_range'

shows the only caller:

  bdrv_co_copy_range_internal(), which does bdrv_check_request32(),
  so everything is OK.

Still, the functions may be called directly, not only by drv->...
Let's check:

git grep '\.bdrv_co_copy_range_\(from\|to\)\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done

shows no more callers. So, we are done.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-6-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy
e75abedab7 block: use int64_t instead of uint64_t in driver write handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver write handlers parameters which are already 64bit to
signed type.

While being here, convert also flags parameter to be BdrvRequestFlags.

Now let's consider all callers. Simple

  git grep '\->bdrv_\(aio\|co\)_pwritev\(_part\)\?'

shows that's there three callers of driver function:

 bdrv_driver_pwritev() and bdrv_driver_pwritev_compressed() in
 block/io.c, both pass int64_t, checked by bdrv_check_qiov_request() to
 be non-negative.

 qcow2_save_vmstate() does bdrv_check_qiov_request().

Still, the functions may be called directly, not only by drv->...
Let's check:

git grep '\.bdrv_\(aio\|co\)_pwritev\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done

shows several callers:

qcow2:
  qcow2_co_truncate() write at most up to @offset, which is checked in
    generic qcow2_co_truncate() by bdrv_check_request().
  qcow2_co_pwritev_compressed_task() pass the request (or part of the
    request) that already went through normal write path, so it should
    be OK

qcow:
  qcow_co_pwritev_compressed() pass int64_t, it's updated by this patch

quorum:
  quorum_co_pwrite_zeroes() pass int64_t and int - OK

throttle:
  throttle_co_pwritev_compressed() pass int64_t, it's updated by this
  patch

vmdk:
  vmdk_co_pwritev_compressed() pass int64_t, it's updated by this
  patch

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Vladimir Sementsov-Ogievskiy
f7ef38dd13 block: use int64_t instead of uint64_t in driver read handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver read handlers parameters which are already 64bit to
signed type.

While being here, convert also flags parameter to be BdrvRequestFlags.

Now let's consider all callers. Simple

  git grep '\->bdrv_\(aio\|co\)_preadv\(_part\)\?'

shows that's there three callers of driver function:

 bdrv_driver_preadv() in block/io.c, passes int64_t, checked by
   bdrv_check_qiov_request() to be non-negative.

 qcow2_load_vmstate() does bdrv_check_qiov_request().

 do_perform_cow_read() has uint64_t argument. And a lot of things in
 qcow2 driver are uint64_t, so converting it is big job. But we must
 not work with requests that don't satisfy bdrv_check_qiov_request(),
 so let's just assert it here.

Still, the functions may be called directly, not only by drv->...
Let's check:

git grep '\.bdrv_\(aio\|co\)_preadv\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done

The only one such caller:

    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
    ...
    ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);

in tests/unit/test-bdrv-drain.c, and it's OK obviously.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix typos]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Vladimir Sementsov-Ogievskiy
558902cc3d qcow2: check request on vmstate save/load path
We modify the request by adding an offset to vmstate. Let's check the
modified request. It will help us to safely move .bdrv_co_preadv_part
and .bdrv_co_pwritev_part to int64_t type of offset and bytes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Vladimir Sementsov-Ogievskiy
b984b2968b block/io: bring request check to bdrv_co_(read,write)v_vmstate
Only qcow2 driver supports vmstate.
In qcow2 these requests go through .bdrv_co_p{read,write}v_part
handlers.

So, let's do our basic check for the request on vmstate generic
handlers.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Peter Maydell
d1fe59377b Trivial patches pull request 20210916
-----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEzS913cjjpNwuT1Fz8ww4vT8vvjwFAmFDUCMSHGxhdXJlbnRA
 dml2aWVyLmV1AAoJEPMMOL0/L748Oi0P/0Osr0C7A/movW2TCCIXtA/a3vjEtE6s
 FpWlmLtjTT6bODu8GjAq2QszByb0K5aKap55y1FEn3cejueXdbrZLCMBKh7Rgjr8
 iu8tkJMBR0WtC8vIMoSDNbmGmOl8gXAsWwtAB24nRjXODT8SePShqsXbGFK0BsbL
 88duxEqwVMtQnHESa0v3gPt+vYNrZC3p2zBN3pxR7RTGEQKa3kMAHTqu5dm/OtAB
 dhNVT79HMagP1Nm+NqPqMbPW5Jwh9DPNWJ4Pf4cKvHKxyE9fnwvUNiNqiBTah/oK
 B/LPLfru6f5ndKYvpGddex+0ToAqfd/J/W2m6qo5fp8c4XyTn77prNJauM7ij+tu
 72hCATsFVuaSJo1hgs70boo00rLd4JifumQRPahekAQBudHSy28K76oC7Km42cyG
 S01zkJ7cVo9fBj0csj95P5tegGznmXsklk0bQf78lwJAcf4TnAL9bh4cRNi/0Xle
 sRQsBB5DGJbn32t+Oo3g+AHP448uHQPBT/CuwN4S3C+fguAaY3wBAd9GLwJSZeFW
 lg8UejcZQ6b4fY+W3Q7fr4tIRqALbqFhAltQqnOqXZuNGNtup6E/SZEZZCjtIRyz
 +RsNBTrb8MvbaHGwzXYfLWUUPUGrD/Tea+CpKX5IpVXIlAgNzKejY/urmq4kWA72
 NYLVtXDNRw3A
 =z2sk
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/vivier2/tags/trivial-branch-for-6.2-pull-request' into staging

Trivial patches pull request 20210916

# gpg: Signature made Thu 16 Sep 2021 15:09:39 BST
# gpg:                using RSA key CD2F75DDC8E3A4DC2E4F5173F30C38BD3F2FBE3C
# gpg:                issuer "laurent@vivier.eu"
# gpg: Good signature from "Laurent Vivier <lvivier@redhat.com>" [full]
# gpg:                 aka "Laurent Vivier <laurent@vivier.eu>" [full]
# gpg:                 aka "Laurent Vivier (Red Hat) <lvivier@redhat.com>" [full]
# Primary key fingerprint: CD2F 75DD C8E3 A4DC 2E4F  5173 F30C 38BD 3F2F BE3C

* remotes/vivier2/tags/trivial-branch-for-6.2-pull-request:
  target/sparc: Make sparc_cpu_dump_state() static
  target/avr: Fix compiler errors (-Werror=enum-conversion)
  hw/vfio: Fix typo in comments
  intel_iommu: Fix typo in comments
  target/i386: spelling: occured=>occurred, mininum=>minimum
  configure: add missing pc-bios/qemu_vga.ndrv symlink in build tree
  spelling: sytem => system
  qdev: Complete qdev_init_gpio_out() documentation
  hw/i386/acpi-build: Fix a typo
  util: Remove redundant checks in the openpty()

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-09-16 16:02:31 +01:00
Vladimir Sementsov-Ogievskiy
8fba395151 qcow2-refcount: check_refblocks(): add separate message for reserved
Split checking for reserved bits out of aligned offset check.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-11-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
98bc07d6cd qcow2-refcount: check_refcounts_l1(): check reserved bits
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-10-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
cd6efd60e9 qcow2-refcount: improve style of check_refcounts_l1()
- use g_autofree for l1_table
 - better name for size in bytes variable
 - reduce code blocks nesting
 - whitespaces, braces, newlines

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-9-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
289ef5f219 qcow2-refcount: check_refcounts_l2(): check reserved bits
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-8-vsementsov@virtuozzo.com>
[hreitz: Separated `type` declaration from statements]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
9631c7822e qcow2-refcount: check_refcounts_l2(): check l2_bitmap
Check subcluster bitmap of the l2 entry for different types of
clusters:

 - for compressed it must be zero
 - for allocated check consistency of two parts of the bitmap
 - for unallocated all subclusters should be unallocated
   (or zero-plain)

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Message-Id: <20210914122454.141075-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
5c3216c046 qcow2-refcount: fix_l2_entry_by_zero(): also zero L2 entry bitmap
We'll reuse the function to fix wrong L2 entry bitmap. Support it now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-6-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
a2debf6506 qcow2-refcount: introduce fix_l2_entry_by_zero()
Split fix_l2_entry_by_zero() out of check_refcounts_l2() to be
reused in further patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-5-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
a6e098462b qcow2: introduce qcow2_parse_compressed_l2_entry() helper
Add helper to parse compressed l2_entry and use it everywhere instead
of open-coding.

Note, that in most places we move to precise coffset/csize instead of
sector-aligned. Still it should work good enough for updating
refcounts.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-4-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
9a3978a46b qcow2: compressed read: simplify cluster descriptor passing
Let's pass the whole L2 entry and not bother with
L2E_COMPRESSED_OFFSET_SIZE_MASK.

It also helps further refactoring that adds generic
qcow2_parse_compressed_l2_entry() helper.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
786c22d9c2 qcow2-refcount: improve style of check_refcounts_l2()
- don't use same name for size in bytes and in entries
 - use g_autofree for l2_table
 - add whitespace
 - fix block comment style

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-2-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 18:42:38 +02:00
Vladimir Sementsov-Ogievskiy
ff812c5563 qcow2: handle_dependencies(): relax conflict detection
There is no conflict and no dependency if we have parallel writes to
different subclusters of one cluster when the cluster itself is already
allocated. So, relax extra dependency.

Measure performance:
First, prepare build/qemu-img-old and build/qemu-img-new images.

cd scripts/simplebench
./img_bench_templater.py

Paste the following to stdin of running script:

qemu_img=../../build/qemu-img-{old|new}
$qemu_img create -f qcow2 -o extended_l2=on /ssd/x.qcow2 1G
$qemu_img bench -c 100000 -d 8 [-s 2K|-s 2K -o 512|-s $((1024*2+512))] \
        -w -t none -n /ssd/x.qcow2

The result:

All results are in seconds

------------------  ---------  ---------
                    old        new
-s 2K               6.7 ± 15%  6.2 ± 12%
                                 -7%
-s 2K -o 512        13 ± 3%    11 ± 5%
                                 -16%
-s $((1024*2+512))  9.5 ± 4%   8.4
                                 -12%
------------------  ---------  ---------

So small writes are more independent now and that helps to keep deeper
io queue which improves performance.

271 iotest output becomes racy for three allocation in one cluster.
Second and third writes may finish in different order. Second and
third requests don't depend on each other any more. Still they both
depend on first request anyway. Filter out second and third write
offsets to cover both possible outputs.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824101517.59802-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
[hreitz: s/ an / and /]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 15:54:07 +02:00
Vladimir Sementsov-Ogievskiy
6d207d3501 qcow2: refactor handle_dependencies() loop body
No logic change, just prepare for the following commit. While being
here do also small grammar fix in a comment.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824101517.59802-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 15:54:07 +02:00
Stefano Garzarella
66fed30c9c block/mirror: fix NULL pointer dereference in mirror_wait_on_conflicts()
In mirror_iteration() we call mirror_wait_on_conflicts() with
`self` parameter set to NULL.

Starting from commit d44dae1a7c we dereference `self` pointer in
mirror_wait_on_conflicts() without checks if it is not NULL.

Backtrace:
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  mirror_wait_on_conflicts (self=0x0, s=<optimized out>, offset=<optimized out>, bytes=<optimized out>)
      at ../block/mirror.c:172
  172	                self->waiting_for_op = op;
  [Current thread is 1 (Thread 0x7f0908931ec0 (LWP 380249))]
  (gdb) bt
  #0  mirror_wait_on_conflicts (self=0x0, s=<optimized out>, offset=<optimized out>, bytes=<optimized out>)
      at ../block/mirror.c:172
  #1  0x00005610c5d9d631 in mirror_run (job=0x5610c76a2c00, errp=<optimized out>) at ../block/mirror.c:491
  #2  0x00005610c5d58726 in job_co_entry (opaque=0x5610c76a2c00) at ../job.c:917
  #3  0x00005610c5f046c6 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>)
      at ../util/coroutine-ucontext.c:173
  #4  0x00007f0909975820 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
      from /usr/lib64/libc.so.6

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2001404
Fixes: d44dae1a7c ("block/mirror: fix active mirror dead-lock in mirror_wait_on_conflicts")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20210910124533.288318-1-sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 15:54:07 +02:00