By adding acquire/release pairs, we ensure that .ret and .error_is_read
fields are written by block_copy_dirty_clusters before .finished is true,
and that they are read by API user after .finished is true.
The atomic here are necessary because the fields are concurrently modified
in coroutines, and read outside.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210624072043.180494-6-eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Group various structures fields, to better understand what we need to
protect with a lock and what doesn't need it.
Then, add a CoMutex to protect concurrent access of block-copy
data structures. This mutex also protects .copy_bitmap, because its thread-safe
API does not prevent it from assigning two tasks to the same
bitmap region.
Exceptions to the lock:
- .sleep_state is handled in the series "coroutine: new sleep/wake API"
and thus here left as TODO.
- .finished, .cancelled and reads to .ret and .error_is_read will be
protected in the following patch, because are used also outside
coroutines.
- .skip_unallocated is atomic. Including it under the mutex would
increase the critical sections and make them also much more complex.
We can have it as atomic since it is only written from outside and
read by block-copy coroutines.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210624072043.180494-5-eesposit@redhat.com>
[vsementsov: fix typo in comment]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Moving this function in task_end ensures to update the progress
anyways, even if there is an error.
It also helps in next patch, allowing task_end to have only
one critical section.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210624072043.180494-4-eesposit@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Put the logic to determine the copy size in a separate function, so
that there is a simple state machine for the possible methods of
copying data from one BlockDriverState to the other.
Use .method instead of .copy_range as in-out argument, and
include also .zeroes as an additional copy method.
While at it, store the common computation of block_copy_max_transfer
into a new field of BlockCopyState, and make sure that we always
obey max_transfer; that's more efficient even for the
COPY_RANGE_READ_WRITE case.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210624072043.180494-3-eesposit@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Use a local variable instead of referencing BlockCopyState through a
BlockCopyCallState or BlockCopyTask every time.
This is in preparation for next patches.
No functional change intended.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210624072043.180494-2-eesposit@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Progressmeter is protected by the AioContext mutex, which
is taken by the block jobs and their caller (like blockdev).
We would like to remove the dependency of block layer code on the
AioContext mutex, since most drivers and the core I/O code are already
not relying on it.
Create a new C file to implement the ProgressMeter API, but keep the
struct as public, to avoid forcing allocation on the heap.
Also add a mutex to be able to provide an accurate snapshot of the
progress values to the caller.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210614081130.22134-5-eesposit@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210614081130.22134-3-eesposit@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Similar to other handle_aiocb_* functions, handle_aiocb_ioctl needs to cater
for the possibility that ioctl is interrupted by a signal. Otherwise, the
I/O is incorrectly reported as a failure to the guest.
Reported-by: Gordon Watson <gwatson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
iOS hosts do not have these defined so we fallback to the
default behaviour.
Co-authored-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Joelle van Dyne <j@getutm.app>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Try all the possible ioctls for disk size as long as they are
supported, to keep the #if ladder simple.
Extracted and cleaned up from a patch by Joelle van Dyne and
Warner Losh.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On Darwin (iOS), there are no system level APIs for directly accessing
host block devices. We detect this at configure time.
Signed-off-by: Joelle van Dyne <j@getutm.app>
Message-Id: <20210315180341.31638-2-j@getutm.app>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
bs->sg is only true for character devices, but block devices can also
be used with scsi-block and scsi-generic. Unfortunately BLKSECTGET
returns bytes in an int for /dev/sgN devices, and sectors in a short
for block devices, so account for that in the code.
The maximum transfer also need not be a power of 2 (for example I have
seen disks with 1280 KiB maximum transfer) so there's no need to pass
the result through pow2floor.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For block host devices, I/O can happen through either the kernel file
descriptor I/O system calls (preadv/pwritev, io_submit, io_uring)
or the SCSI passthrough ioctl SG_IO.
In the latter case, the size of each transfer can be limited by the
HBA, while for file descriptor I/O the kernel is able to split and
merge I/O in smaller pieces as needed. Applying the HBA limits to
file descriptor I/O results in more system calls and suboptimal
performance, so this patch splits the max_transfer limit in two:
max_transfer remains valid and is used in general, while max_hw_transfer
is limited to the maximum hardware size. max_hw_transfer can then be
included by the scsi-generic driver in the block limits page, to ensure
that the stricter hardware limit is used.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Block device requests must be aligned to bs->bl.request_alignment.
It makes sense for drivers to align bs->bl.max_transfer the same
way; however when there is no specified limit, blk_get_max_transfer
just returns INT_MAX. Since the contract of the function does not
specify that INT_MAX means "no maximum", just align the outcome
of the function (whether INT_MAX or bs->bl.max_transfer) before
returning it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
I/O to a disk via read/write is not limited by the number of segments allowed
by the host adapter; the kernel can split requests if needed, and the limit
imposed by the host adapter can be very low (256k or so) to avoid that SG_IO
returns EINVAL if memory is heavily fragmented.
Since this value is only interesting for SG_IO-based I/O, do not include
it in the max_transfer and only take it into account when patching the
block limits VPD page in the scsi-generic device.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Even though it was only called for devices that have bs->sg set (which
must be character devices), sg_get_max_segments looked at /sys/dev/block
which only works for block devices.
On Linux the sg driver has its own way to provide the maximum number of
iovecs in a scatter/gather list, so add support for it. The block device
path is kept because it will be reinstated in the next patches.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
In the bdrv_snapshot_goto() fallback code, we work with a pointer to
either bs->file or bs->backing. We detach that child, close the node
(with .bdrv_close()), apply the snapshot on the child node, and then
re-open the node (with .bdrv_open()).
In order for .bdrv_open() to attach the same child node that we had
before, we pass "file={child-node}" or "backing={child-node}" to it.
Therefore, when .bdrv_open() has returned success, we can assume that
bs->file or bs->backing (respectively) points to our original child
again. This is verified by an assertion.
All of this is not immediately clear from a quick glance at the code,
so add a comment to the assertion what it is for, and why it is valid.
It certainly confused Coverity.
Reported-by: Coverity (CID 1452774)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210503095418.31521-1-mreitz@redhat.com>
[mreitz: s/close/detach/]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
req->receiving is a flag of request being in one concrete yield point
in nbd_co_do_receive_one_chunk().
Such kind of boolean flag is always better to unset before scheduling
the coroutine, to avoid double scheduling. So, let's be more careful.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-33-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We already have two similar helpers for other state. Let's add another
one for convenience.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-32-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
The only last step we need to reuse the function is coroutine-wrapper.
nbd_open() may be called from non-coroutine context. So, generate the
wrapper and use it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-31-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We'll need a possibility of non-blocking nbd_co_establish_connection(),
so that it returns immediately, and it returns success only if a
connections was previously established in background.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-30-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Split out the part that we want to reuse for nbd_open().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-29-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
block/nbd doesn't need underlying sioc channel anymore. So, we can
update nbd/client-connection interface to return only one top-most io
channel, which is more straight forward.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-27-vsementsov@virtuozzo.com>
[eblake: squash in Vladimir's fixes for uninit usage caught by clang]
Signed-off-by: Eric Blake <eblake@redhat.com>
Currently sioc pointer is used just to pass from socket-connection to
nbd negotiation. Drop the field, and use local variables instead. With
next commit we'll update nbd/client-connection.c to behave
appropriately (return only top-most ioc, not two channels).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-26-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Negotiation during reconnect is now done in a thread, and s->sioc is
not available during negotiation. Negotiation in thread will be
cancelled by nbd_client_connection_release() called from
nbd_clear_bdrvstate(). So, we don't need this code chunk anymore.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-25-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Now that we can opt in to negotiation as part of the client connection
thread, use that to simplify connection_co. This is another step on
the way to moving all reconnect code into NBDClientConnection.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-24-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
To be reused in the following patch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-23-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Add arguments and logic to support nbd negotiation in the same thread
after successful connection.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-20-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We now have bs-independent connection API, which consists of four
functions:
nbd_client_connection_new()
nbd_client_connection_release()
nbd_co_establish_connection()
nbd_co_establish_connection_cancel()
Move them to a separate file together with NBDClientConnection
structure which becomes private to the new API.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-18-vsementsov@virtuozzo.com>
[eblake: comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
This is a last step of creating bs-independent nbd connection
interface. With next commit we can finally move it to separate file.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-17-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
This is a step of creating bs-independent nbd connection interface.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-16-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We are going to move the connection code to its own file, and want
clear names and APIs first.
The structure is shared between user and (possibly) several runs of
connect-thread. So it's wrong to call it "thread". Let's rename to
something more generic.
Appropriately rename connect_thread and thr variables to conn.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-15-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
nbd_co_establish_connection_cancel() actually needs only pointer to
NBDConnectThread. So, make it clean.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-14-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We are going to split connection code to a separate file. Now we are
ready to give nbd_co_establish_connection() clean and bs-independent
interface.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-13-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We don't need all these states. The code refactored to use two boolean
variables looks simpler.
While moving the comment in nbd_co_establish_connection() rework it to
give better information. Also, we are going to move the connection code
to separate file and mentioning drained section would be confusing.
Improve also the comment in NBDConnectThread, while dropping removed
state names from it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-12-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: comment tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
Instead of managing connect_bh, bh_ctx, and wait_connect fields, we
can use a single link to the waiting coroutine with proper mutex
protection.
So new logic is:
nbd_co_establish_connection() sets wait_co under the mutex, releases
the mutex, then yield()s. Note that wait_co may be scheduled by the
thread immediately after unlocking the mutex. Still, the main thread
(or iothread) will not reach the code for entering the coroutine until
the yield(), so we are safe.
connect_thread_func() and nbd_co_establish_connection_cancel() do
the following to handle wait_co:
Under the mutex, if thr->wait_co is not NULL, make it NULL and
schedule it. This way, we avoid scheduling the coroutine twice.
Still scheduling is a bit different:
In connect_thread_func() we can just call aio_co_wake under mutex,
after commit
[async: the main AioContext is only "current" if under the BQL]
we are sure that aio_co_wake() will not try to acquire the aio context
and do qemu_aio_coroutine_enter() but simply schedule the coroutine by
aio_co_schedule().
nbd_co_establish_connection_cancel() will be called from non-coroutine
context in further patch and will be able to go through
qemu_aio_coroutine_enter() path of aio_co_wake(). So keep current
behavior of waking the coroutine after the critical section.
Also, this commit reduces the dependence of
nbd_co_establish_connection() on the internals of bs (we now use a
generic pointer to the coroutine, instead of direct use of
s->connection_co). This is a step towards splitting the connection
API out of nbd.c.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-11-vsementsov@virtuozzo.com>
Reviewied-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
These fields are write-only. Drop them.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-10-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Message-Id: <20210610100802.5888-9-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Simplify lifetime management of BDRVNBDState->connect_thread by
delaying the possible cleanup of it until the BDRVNBDState itself goes
away.
This also reverts
0267101af6 "block/nbd: fix possible use after free of s->connect_thread"
as now s->connect_thread can't be cleared until the very end.
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
[vsementsov: rebase, revert 0267101af6 changes]
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: tweak comment]
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-8-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Detecting monitor by current coroutine works bad when we are not in
coroutine context. And that's exactly so in nbd reconnect code, where
qio_channel_socket_connect_sync() is called from thread.
Monitor is needed only to parse named file descriptor. So, let's just
parse it during nbd_open(), so that all further users of s->saddr don't
need to access monitor.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
nbd_open() does it (through nbd_establish_connection()).
Actually we lost that call on reconnect path in 1dc4718d84
"block/nbd: use non-blocking connect: fix vm hang on connect()"
when we have introduced reconnect thread.
Fixes: 1dc4718d84
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We have two "return error" paths in nbd_open() after
nbd_process_options(). Actually we should call nbd_clear_bdrvstate()
on these paths. Interesting that nbd_process_options() calls
nbd_clear_bdrvstate() by itself.
Let's fix leaks and refactor things to be more obvious:
- intialize yank at top of nbd_open()
- move yank cleanup to nbd_clear_bdrvstate()
- refactor nbd_open() so that all failure paths except for
yank-register goes through nbd_clear_bdrvstate()
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-4-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
nbd_free_connect_thread leaks the channel object if it hasn't been
stolen.
Unref it and fix the leak.
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
The GDateTime APIs provided by GLib avoid portability pitfalls, such
as some platforms where 'struct timeval.tv_sec' field is still 'long'
instead of 'time_t'. When combined with automatic cleanup, GDateTime
often results in simpler code too.
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Connor Kuehl <ckuehl@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
A flush failure is a critical failure scenario for some operations.
For example, it will prevent migration from completing, as it will
make vm_stop() report an error. Thus it is important to have a
trace point present for debugging.
Reviewed-by: Connor Kuehl <ckuehl@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When fdatasync() fails on a file backend we set a flag that
short-circuits any future attempts to call fdatasync(). The
first failure returns the true errno, but the later short-
circuited calls return a generic EIO. The latter is unhelpful
because fdatasync() can return a variety of errnos, including
EACCESS.
Reviewed-by: Connor Kuehl <ckuehl@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Depending on the configuration of QEMU, some binaries might not need libm
at all. In that case libiscsi, which uses exp(), will fail to load.
Link it in the module explicitly.
Reported-by: Yi Sun <yisun@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now all dependencies for block modules are passed to
module_ss.add(when: ...), so they are mandatory. In the next patch we
will need to add a libm dependency to a module, but libm does not exist
on all systems. So, modify the creation of module_ss and modsrc so that
dependencies can also be passed to module_ss.add(if_true: ...).
While touching the array, remove the useless dependency of the curl
module on glib. glib is always linked in QEMU and in fact all other
block modules also need it, but they don't have to specify it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- NBD server: Fix crashes related to switching between AioContexts
- file-posix: Workaround for discard/write_zeroes on buggy filesystems
- Follow-up fixes for the reopen vs. permission changes
- quorum: Fix error handling for flush
- block-copy: Refactor copy_range handling
- docs: Describe how to use 'null-co' block driver
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmC3iy8RHGt3b2xmQHJl
ZGhhdC5jb20ACgkQfwmycsiPL9ZP0hAAuh07CFWLzHCcRC7PBzekSPfzRYYBLDSW
EObJ1Ov4mvz8UZoP6BDJ5QVzLPhel6hXkxTd83B1D7t/Dq+yJYR0z8Kv3USpaVJ4
2U26SsoGQM8BmtVDL1Q8tQ5eDWQ4ykxNx6F2/lKBe1EH1lfaun04Xj1rNh7jpilo
nNmKMDDI1UOkH0lKDR3tqfEV0XQE7o+ZKfPlIbvYMjXk9ZPKUHfjNPGVdCLQVnqH
VJI01hF7eEx1ykSMdlC+TzNoVGG+mCBokGuW0JlUvOpX6FcGnAlxXQx8u53c1I8O
lggZV8b2IbrNBUVwXQHLLrXxjOo+u54Ct9y/gXUTAj8qai+9jVRp60Y1pnCyeIeu
DzFx10xwy04PGleb7AAZ4dT8du2+PTkuyQ9KmlvQ2U4IUcgW124CrDeO7XYr1aif
hCTJPeEDrC9YNU6AQ8rLXrYUtkumSm2zUzU5nZ//i5WH41475/vsmgP5A+Jr457A
Xu0yiI2Gqkr9CNsP9ZzMkNj03oIBhPFuGxiwibLQsy/6UVnaDYS0+rQ8FXYnF5+K
iEpgXe3vKTWxM097kzJMBTDVRMXRa75NtK7KWXMDgVpHTbcv1t1otsn+6dfv+B55
ULJM1ETsyYS0T6BqNglvdytsraSt7JgSF+ZLHbYk3KVDshwnq/0ksgSqHNNA14ca
kYTzHhMgo5w=
=gSq/
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches
- NBD server: Fix crashes related to switching between AioContexts
- file-posix: Workaround for discard/write_zeroes on buggy filesystems
- Follow-up fixes for the reopen vs. permission changes
- quorum: Fix error handling for flush
- block-copy: Refactor copy_range handling
- docs: Describe how to use 'null-co' block driver
# gpg: Signature made Wed 02 Jun 2021 14:44:15 BST
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
docs/secure-coding-practices: Describe how to use 'null-co' block driver
block-copy: refactor copy_range handling
block-copy: fix block_copy_task_entry() progress update
nbd/server: Use drained block ops to quiesce the server
block-backend: add drained_poll
block: improve permission conflict error message
block: simplify bdrv_child_user_desc()
block/vvfat: inherit child_vvfat_qcow from child_of_bds
block: improve bdrv_child_get_parent_desc()
block-backend: improve blk_root_get_parent_desc()
block: document child argument of bdrv_attach_child_common()
block/file-posix: Try other fallbacks after invalid FALLOC_FL_ZERO_RANGE
block/file-posix: Fix problem with fallocate(PUNCH_HOLE) on GPFS
block: drop BlockBackendRootState::read_only
block: drop BlockDriverState::read_only
block: consistently use bdrv_is_read_only()
block/vvfat: fix vvfat_child_perm crash
block/vvfat: child_vvfat_qcow: add .get_parent_aio_context, fix crash
qemu-io-cmds: assert that we don't have .perm requested in no-blk case
block/quorum: Provide .bdrv_co_flush instead of .bdrv_co_flush_to_disk
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Currently we update s->use_copy_range and s->copy_size in
block_copy_do_copy().
It's not very good:
1. block_copy_do_copy() is intended to be a simple function, that wraps
bdrv_co_<io> functions for need of block copy. That's why we don't pass
BlockCopyTask into it. So, block_copy_do_copy() is bad place for
manipulation with generic state of block-copy process
2. We are going to make block-copy thread-safe. So, it's good to move
manipulation with state of block-copy to the places where we'll need
critical sections anyway, to not introduce extra synchronization
primitives in block_copy_do_copy().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210528141628.44287-3-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Don't report successful progress on failure, when call_state->ret is
set.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210528141628.44287-2-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow block backends to poll their devices/users to check if they have
been quiesced when entering a drained section.
This will be used in the next patch to wait for the NBD server to be
completely quiesced.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Sergio Lopez <slp@redhat.com>
Message-Id: <20210602060552.17433-2-slp@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Recently we've fixed a crash by adding .get_parent_aio_context handler
to child_vvfat_qcow. Now we want it to support .get_parent_desc as
well. child_vvfat_qcow wants to implement own .inherit_options, it's
not bad. But omitting all other handlers is a bad idea. Let's inherit
the class from child_of_bds instead, similar to chain_child_class and
detach_by_driver_cb_class in test-bdrv-drain.c.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210601075218.79249-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We have different types of parents: block nodes, block backends and
jobs. So, it makes sense to specify type together with name.
While being here also use g_autofree.
iotest 307 output is updated.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20210601075218.79249-3-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If fallocate(... FALLOC_FL_ZERO_RANGE ...) returns EINVAL, it's likely
an indication that the file system is buggy and does not implement
unaligned accesses right. We still might be lucky with the other
fallback fallocate() calls later in this function, though, so we should
not return immediately and try the others first.
Since FALLOC_FL_ZERO_RANGE could also return EINVAL if the file descriptor
is not a regular file, we ignore this filesystem bug silently, without
printing an error message for the user.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210527172020.847617-3-thuth@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A customer reported that running
qemu-img convert -t none -O qcow2 -f qcow2 input.qcow2 output.qcow2
fails for them with the following error message when the images are
stored on a GPFS file system :
qemu-img: error while writing sector 0: Invalid argument
After analyzing the strace output, it seems like the problem is in
handle_aiocb_write_zeroes(): The call to fallocate(FALLOC_FL_PUNCH_HOLE)
returns EINVAL, which can apparently happen if the file system has
a different idea of the granularity of the operation. It's arguably
a bug in GPFS, since the PUNCH_HOLE mode should not result in EINVAL
according to the man-page of fallocate(), but the file system is out
there in production and so we have to deal with it. In commit 294682cc3a
("block: workaround for unaligned byte range in fallocate()") we also
already applied the a work-around for the same problem to the earlier
fallocate(FALLOC_FL_ZERO_RANGE) call, so do it now similar with the
PUNCH_HOLE call. But instead of silently catching and returning
-ENOTSUP (which causes the caller to fall back to writing zeroes),
let's rather inform the user once about the buggy file system and
try the other fallback instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210527172020.847617-2-thuth@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of keeping additional boolean field, let's store the
information in BDRV_O_RDWR bit of BlockBackendRootState::open_flags.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210527154056.70294-4-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's better to use accessor function instead of bs->read_only directly.
In some places use bdrv_is_writable() instead of
checking both BDRV_O_RDWR set and BDRV_O_INACTIVE not set.
In bdrv_open_common() it's a bit strange to add one more variable, but
we are going to drop bs->read_only in the next patch, so new ro local
variable substitutes it here.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210527154056.70294-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's wrong to rely on s->qcow in vvfat_child_perm, as on permission
update during bdrv_open_child() call this field is not set yet.
Still prior to aa5a04c7db, it didn't
crash, as bdrv_open_child passed NULL as child to bdrv_child_perm(),
and NULL was equal to NULL in assertion (still, it was bad guarantee
for child being s->qcow, not backing :).
Since aa5a04c7db
"add bdrv_attach_child_noperm" bdrv_refresh_perms called on parent node
when attaching child, and new correct child pointer is passed to
.bdrv_child_perm. Still, s->qcow is NULL at the moment. Let's rely only
on role instead.
Without that fix,
./build/qemu-system-x86_64 -usb -device usb-storage,drive=fat16 \
-drive \
file=fat:rw:fat-type=16:"<path of a host folder>",id=fat16,format=raw,if=none
crashes:
(gdb) bt
0 raise () at /lib64/libc.so.6
1 abort () at /lib64/libc.so.6
2 _nl_load_domain.cold () at /lib64/libc.so.6
3 annobin_assert.c_end () at /lib64/libc.so.6
4 vvfat_child_perm (bs=0x559186f3d690, c=0x559186f1ed20, role=3,
reopen_queue=0x0, perm=0, shared=31,
nperm=0x7ffe56f28298, nshared=0x7ffe56f282a0) at
../block/vvfat.c:3214
5 bdrv_child_perm (bs=0x559186f3d690, child_bs=0x559186f60190,
c=0x559186f1ed20, role=3, reopen_queue=0x0,
parent_perm=0, parent_shared=31,
nperm=0x7ffe56f28298, nshared=0x7ffe56f282a0)
at ../block.c:2094
6 bdrv_node_refresh_perm (bs=0x559186f3d690, q=0x0,
tran=0x559186f65850, errp=0x7ffe56f28530) at
../block.c:2336
7 bdrv_list_refresh_perms (list=0x559186db5b90 = {...}, q=0x0,
tran=0x559186f65850, errp=0x7ffe56f28530)
at ../block.c:2358
8 bdrv_refresh_perms (bs=0x559186f3d690, errp=0x7ffe56f28530) at
../block.c:2419
9 bdrv_attach_child
(parent_bs=0x559186f3d690, child_bs=0x559186f60190,
child_name=0x559184d83e3d "write-target",
child_class=0x5591852f3b00 <child_vvfat_qcow>, child_role=3,
errp=0x7ffe56f28530) at ../block.c:2959
10 bdrv_open_child
(filename=0x559186f5cb80 "/var/tmp/vl.7WYmFU",
options=0x559186f66c20, bdref_key=0x559184d83e3d "write-target",
parent=0x559186f3d690, child_class=0x5591852f3b00
<child_vvfat_qcow>, child_role=3, allow_none=false,
errp=0x7ffe56f28530) at ../block.c:3351
11 enable_write_target (bs=0x559186f3d690, errp=0x7ffe56f28530) at
../block/vvfat.c:3177
12 vvfat_open (bs=0x559186f3d690, options=0x559186f42db0, flags=155650,
errp=0x7ffe56f28530) at ../block/vvfat.c:1236
13 bdrv_open_driver (bs=0x559186f3d690, drv=0x5591853d97e0
<bdrv_vvfat>, node_name=0x0,
options=0x559186f42db0, open_flags=155650,
errp=0x7ffe56f28640) at ../block.c:1557
14 bdrv_open_common (bs=0x559186f3d690, file=0x0,
options=0x559186f42db0, errp=0x7ffe56f28640) at
../block.c:1833
...
(gdb) fr 4
#4 vvfat_child_perm (bs=0x559186f3d690, c=0x559186f1ed20, role=3,
reopen_queue=0x0, perm=0, shared=31,
nperm=0x7ffe56f28298, nshared=0x7ffe56f282a0) at
../block/vvfat.c:3214
3214 assert(c == s->qcow || (role & BDRV_CHILD_COW));
(gdb) p role
$1 = 3 # BDRV_CHILD_DATA | BDRV_CHILD_METADATA
(gdb) p *c
$2 = {bs = 0x559186f60190, name = 0x559186f669d0 "write-target", klass
= 0x5591852f3b00 <child_vvfat_qcow>, role = 3, opaque =
0x559186f3d690, perm = 3, shared_perm = 4, frozen = false,
parent_quiesce_counter = 0, next = {le_next = 0x0, le_prev =
0x559186f41818}, next_parent = {le_next = 0x0, le_prev =
0x559186f64320}}
(gdb) p s->qcow
$3 = (BdrvChild *) 0x0
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210524101257.119377-3-vsementsov@virtuozzo.com>
Tested-by: John Arbuckle <programmingkidx@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The quorum block driver uses a custom flush callback to handle the
case when some children return io errors. In that case it still
returns success if enough children are healthy.
However, it provides it as the .bdrv_co_flush_to_disk callback, not
as .bdrv_co_flush. This causes the block layer to do it's own
generic flushing for the children instead, which doesn't handle
errors properly.
Fix this by providing .bdrv_co_flush instead of
.bdrv_co_flush_to_disk so the block layer uses the custom flush
callback.
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Reported-by: Minghao Yuan <meeho@qq.com>
Message-Id: <20210518134214.11ccf05f@gecko.fritz.box>
Tested-by: Zhang Chen <chen.zhang@intel.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It has been over two years since RHEL-8 was released, and thus per the
platform build policy, we no longer need to support RHEL-7 as a build
target. So from the RHEL-7 perspective, we do not have to support
libssh v0.7 anymore now.
Let's look at the versions from other distributions and operating
systems - according to repology.org, current shipping versions are:
RHEL-8: 0.9.4
Debian Buster: 0.8.7
openSUSE Leap 15.2: 0.8.7
Ubuntu LTS 18.04: 0.8.0 *
Ubuntu LTS 20.04: 0.9.3
FreeBSD: 0.9.5
Fedora 33: 0.9.5
Fedora 34: 0.9.5
OpenBSD: 0.9.5
macOS HomeBrew: 0.9.5
HaikuPorts: 0.9.5
* The version of libssh in Ubuntu 18.04 claims to be 0.8.0 from the
name of the package, but in reality it is a 0.7 patched up as a
Frankenstein monster with patches from the 0.8 development branch.
This gave us some headaches in the past already and so it never worked
with QEMU. All attempts to get it supported have failed in the past,
patches for QEMU have never been merged and a request to Ubuntu to
fix it in their 18.04 distro has been ignored:
https://bugs.launchpad.net/ubuntu/+source/libssh/+bug/1847514
Thus we really should ignore the libssh in Ubuntu 18.04 in QEMU, too.
Fix it by bumping the minimum libssh version to something that is
greater than 0.8.0 now. Debian Buster and openSUSE Leap have the
oldest version and so 0.8.7 is the new minimum.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Message-Id: <20210519155859.344569-1-thuth@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Commit e50caf4a5c ("tracing: convert documentation to rST")
converted docs/devel/tracing.txt to docs/devel/tracing.rst.
We still have several references to the old file, so let's fix them
with the following command:
sed -i s/tracing.txt/tracing.rst/ $(git grep -l docs/devel/tracing.txt)
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210517151702.109066-2-sgarzare@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
The replication.h file is included from migration/colo.c and tests/unit/test-replication.c,
so it should be in include/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now, users of qemu_co_sleep_ns_wakeable are simply passing
a pointer to QemuCoSleepState by reference to the function. But
QemuCoSleepState really is just a Coroutine*; making the
content of the struct public is just as efficient and lets us
skip the user_state_pointer indirection.
Since the usage is changed, take the occasion to rename the
struct to QemuCoSleep.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20210517100548.28806-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers of qemu_co_sleep_wake are checking whether they are passing
a NULL argument inside the pointer-to-pointer: do the check in
qemu_co_sleep_wake itself.
As a side effect, qemu_co_sleep_wake can be called more than once and
it will only wake the coroutine once; after the first time, the argument
will be set to NULL via *sleep_state->user_state_pointer. However, this
would not be safe unless co_sleep_cb keeps using the QemuCoSleepState*
directly, so make it go through the pointer-to-pointer instead.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20210517100548.28806-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The checks in vu_blk_sect_range_ok() assume VIRTIO_BLK_SECTOR_SIZE is
equal to BDRV_SECTOR_SIZE. This is true, but let's add a
QEMU_BUILD_BUG_ON() to make it explicit.
We might as well check that the request buffer size is a multiple of
VIRTIO_BLK_SECTOR_SIZE while we're at it.
Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210331142727.391465-1-stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Set bdi->is_dirty, so that qemu-img info could show dirty flag.
After this commit the following check will show '"dirty-flag": true':
./build/qemu-img create -f qcow2 -o lazy_refcounts=on x 1M
./build/qemu-io x
qemu-io> write 0 1M
After "write" command success, kill the qemu-io process:
kill -9 <qemu-io pid>
./build/qemu-img info --output=json x
This will show '"dirty-flag": true' among other things. (before this
commit it shows '"dirty-flag": false')
Note, that qcow2's dirty-bit is not a "dirty bit for the image". It
only protects qcow2 lazy refcounts feature. So, there are a lot of
conditions when qcow2 session may be not closed correctly, but bit is
0. Still, when bit is set, the last session is definitely not finished
correctly and it's better to report it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210504160656.462836-1-vsementsov@virtuozzo.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
"qemu/typedefs.h" is enough for include/block/write-threshold.h header
with forward declaration of BlockDriverState. Also drop extra includes
from block/write-threshold.c and tests/unit/test-write-threshold.c
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210506090621.11848-9-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
bdrv_write_threshold_exceeded() is unused.
bdrv_write_threshold_is_set() is used only to double check the value of
bs->write_threshold_offset in tests. No real sense in it (both tests do
check real value with help of bdrv_write_threshold_get())
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted commit message as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
They are unused now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-3-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
write-notifiers are used only for write-threshold. New code for such
purpose should create filters.
Let's better special-case write-threshold and drop write notifiers at
all. (Actually, write-threshold is special-cased anyway, as the only
user of write-notifiers)
So, create a new direct interface for bdrv_co_write_req_prepare() and
drop all write-notifier related logic from write-threshold.c.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted comment as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Now, after huge update of block graph permission update algorithm, we
don't need this workaround with active state of the filter. Drop it and
use new smart bdrv_drop_filter() function.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210506194143.394141-1-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
If mirror is READY than cancel operation is not discarding the whole
result of the operation, but instead it's a documented way get a
point-in-time snapshot of source disk.
So, we should not cancel any requests if mirror is READ and
force=false. Let's fix that case.
Note, that bug that we have before this commit is not critical, as the
only .bdrv_cancel_in_flight implementation is nbd_cancel_in_flight()
and it cancels only requests waiting for reconnection, so it should be
rare case.
Fixes: 521ff8b779
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210421075858.40197-1-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Max reported the following bug:
$ ./qemu-img create -f raw src.img 1G
$ ./qemu-img create -f raw dst.img 1G
$ (echo '
{"execute":"qmp_capabilities"}
{"execute":"blockdev-mirror",
"arguments":{"job-id":"mirror",
"device":"source",
"target":"target",
"sync":"full",
"filter-node-name":"mirror-top"}}
'; sleep 3; echo '
{"execute":"human-monitor-command",
"arguments":{"command-line":
"qemu-io mirror-top \"write 0 1G\""}}') \
| x86_64-softmmu/qemu-system-x86_64 \
-qmp stdio \
-blockdev file,node-name=source,filename=src.img \
-blockdev file,node-name=target,filename=dst.img \
-object iothread,id=iothr0 \
-device virtio-blk,drive=source,iothread=iothr0
crashes:
0 raise () at /usr/lib/libc.so.6
1 abort () at /usr/lib/libc.so.6
2 error_exit
(err=<optimized out>,
msg=msg@entry=0x55fbb1634790 <__func__.27> "qemu_mutex_unlock_impl")
at ../util/qemu-thread-posix.c:37
3 qemu_mutex_unlock_impl
(mutex=mutex@entry=0x55fbb25ab6e0,
file=file@entry=0x55fbb1636957 "../util/async.c",
line=line@entry=650)
at ../util/qemu-thread-posix.c:109
4 aio_context_release (ctx=ctx@entry=0x55fbb25ab680) at ../util/async.c:650
5 bdrv_do_drained_begin
(bs=bs@entry=0x55fbb3a87000, recursive=recursive@entry=false,
parent=parent@entry=0x0,
ignore_bds_parents=ignore_bds_parents@entry=false,
poll=poll@entry=true) at ../block/io.c:441
6 bdrv_do_drained_begin
(poll=true, ignore_bds_parents=false, parent=0x0, recursive=false,
bs=0x55fbb3a87000) at ../block/io.c:448
7 blk_drain (blk=0x55fbb26c5a00) at ../block/block-backend.c:1718
8 blk_unref (blk=0x55fbb26c5a00) at ../block/block-backend.c:498
9 blk_unref (blk=0x55fbb26c5a00) at ../block/block-backend.c:491
10 hmp_qemu_io (mon=0x7fffaf3fc7d0, qdict=<optimized out>)
at ../block/monitor/block-hmp-cmds.c:628
man pthread_mutex_unlock
...
EPERM The mutex type is PTHREAD_MUTEX_ERRORCHECK or
PTHREAD_MUTEX_RECURSIVE, or the mutex is a robust mutex, and the
current thread does not own the mutex.
So, thread doesn't own the mutex. And we have iothread here.
Next, note that AIO_WAIT_WHILE() documents that ctx must be acquired
exactly once by caller. But where is it acquired in the call stack?
Seems nowhere.
qemuio_command do acquire aio context.. But we need context acquired
around blk_unref() as well and actually around blk_insert_bs() too.
Let's refactor qemuio_command so that it doesn't acquire aio context
but callers do that instead. This way we can cleanly acquire aio
context in hmp_qemu_io() around all three calls.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210423134233.51495-1-vsementsov@virtuozzo.com>
[mreitz: Fixed comment]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Sometimes the parser needs to further split a token it has collected
from the token input stream. Right now, it does a cursory check to see
if the relevant characters appear in the token to determine if it should
break it down further.
However, qemu_rbd_next_tok() will escape characters as it removes tokens
from the token stream and plain strchr() won't. This can make the
initial strchr() check slightly misleading since it implies
qemu_rbd_next_tok() will find the token and split on it, except the
reality is that qemu_rbd_next_tok() will pass over it if it is escaped.
Use a custom strchr to avoid mixing escaped and unescaped string
operations. Furthermore, this code is identical to how
qemu_rbd_next_tok() seeks its next token, so incorporate this custom
strchr into the body of that function to reduce duplication.
Reported-by: Han Han <hhan@redhat.com>
Fixes: https://bugzilla.redhat.com/1873913
Signed-off-by: Connor Kuehl <ckuehl@redhat.com>
Message-Id: <20210421212343.85524-3-ckuehl@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It was deprecated in commit e1c4269763, v5.2.0. See that commit
message for rationale.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20210501075747.3293186-1-armbru@redhat.com>
ACKed-by: Peter Krempa <pkrempa@redhat.com>
Right now, rate limiting is protected by the AioContext mutex, which is
taken for example both by the block jobs and by qmp_block_job_set_speed
(via find_block_job).
We would like to remove the dependency of block layer code on the
AioContext mutex, since most drivers and the core I/O code are already
not relying on it. However, there is no existing lock that can easily
be taken by both ratelimit_set_speed and ratelimit_calculate_delay,
especially because the latter might run in coroutine context (and
therefore under a CoMutex) but the former will not.
Since concurrent calls to ratelimit_calculate_delay are not possible,
one idea could be to use a seqlock to get a snapshot of slice_ns and
slice_quota. But for now keep it simple, and just add a mutex to the
RateLimit struct; block jobs are generally not performance critical to
the point of optimizing the clock cycles spent in synchronization.
This also requires the introduction of init/destroy functions, so
add them to the two users of ratelimit.h.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Stop including sysemu/sysemu.h in files that don't need it.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210416171314.2074665-2-thuth@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Normally, blk_new_open() just shares all permissions. This was fine
originally when permissions only protected against uses in the same
process because no other part of the code would actually get to access
the block nodes opened with blk_new_open(). However, since we use it for
file locking now, unsharing permissions becomes desirable.
Add a new BDRV_O_NO_SHARE flag that is used in blk_new_open() to unshare
any permissions that can be unshared.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210422164344.283389-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move bdrv_reopen_multiple to new paradigm of permission update:
first update graph relations, then do refresh the permissions.
We have to modify reopen process in file-posix driver: with new scheme
we don't have prepared permissions in raw_reopen_prepare(), so we
should reconfigure fd in raw_check_perm(). Still this seems more native
and simple anyway.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-31-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
To be used in further commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-28-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We don't need this workaround anymore: bdrv_append is already smart
enough and we can use new bdrv_drop_filter().
This commit efficiently reverts also recent 705dde27c6, which
checked .active on io path. Still it said that the problem should be
theoretical. And the logic of filter removement is changed anyway.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-25-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Passing parent aio context is redundant, as child_class and parent
opaque pointer are enough to retrieve it. Drop the argument and use new
bdrv_child_get_parent_aio_context() interface.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-7-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add new handler to get aio context and implement it in all child
classes. Add corresponding public interface to be used soon.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-6-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We have too much comments for this feature. It seems better just don't
do it. Most of real users (tests don't count) have to create additional
reference.
Drop also comment in external_snapshot_prepare:
- bdrv_append doesn't "remove" old bs in common sense, it sounds
strange
- the fact that bdrv_append can fail is obvious from the context
- the fact that we must rollback all changes in transaction abort is
known (it's the direct role of abort)
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If on nbd_close() we detach the thread (in
nbd_co_establish_connection_cancel() thr->state becomes
CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use
s->connect_thread (which is set to NULL), as running thread may free it
at any time.
Still nbd_co_establish_connection() does exactly this: it saves
s->connect_thread to local variable (just for better code style) and
use it even after yield point, when thread may be already detached.
Fix that. Also check thr to be non-NULL on
nbd_co_establish_connection() start for safety.
After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes
impossible in the second switch in nbd_co_establish_connection().
Still, don't add extra abort() just before the release. If it somehow
possible to reach this "case:" it won't hurt. Anyway, good refactoring
of all this reconnect mess will come soon.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210406155114.1057355-1-vsementsov@virtuozzo.com>
Reviewed-by: Roman Kagan <rvkagan@yandex-team.ru>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Currently, it is impossible to complete jobs on standby (i.e. paused
ready jobs), but actually the only thing in mirror_complete() that does
not work quite well with a paused job is the job_enter() at the end.
If we make it conditional, this function works just fine even if the
mirror job is paused.
So technically this is a no-op, but obviously the intention is to accept
block-job-complete even for jobs on standby, which we need this patch
for first.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210409120422.144040-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a graph change and therefore should be done in job-finalize
(which is what invokes mirror_exit_common()).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210409120422.144040-2-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When we allocate 'q_namespace', we forgot to set 'has_q_namespace'
to true. This can cause several issues, including a memory leak,
since qapi_free_BlockdevCreateOptions() does not deallocate that
memory, as reported by valgrind:
13 bytes in 1 blocks are definitely lost in loss record 7 of 96
at 0x4839809: malloc (vg_replace_malloc.c:307)
by 0x48CEBB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
by 0x48E3FE3: g_strdup (in /usr/lib64/libglib-2.0.so.0.6600.8)
by 0x180010: qemu_rbd_co_create_opts (rbd.c:446)
by 0x1AE72C: bdrv_create_co_entry (block.c:492)
by 0x241902: coroutine_trampoline (coroutine-ucontext.c:173)
by 0x57530AF: ??? (in /usr/lib64/libc-2.32.so)
by 0x1FFEFFFA6F: ???
Fix setting 'has_q_namespace' to true when we allocate 'q_namespace'.
Fixes: 19ae9ae014 ("block/rbd: Add support for ceph namespaces")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20210329150129.121182-3-sgarzare@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qemu_rbd_connect(), 'mon_host' is allocated by qemu_rbd_mon_host()
using g_strjoinv(), but it's only freed in the error path, leaking
memory in the success path as reported by valgrind:
80 bytes in 4 blocks are definitely lost in loss record 5,028 of 6,516
at 0x4839809: malloc (vg_replace_malloc.c:307)
by 0x5315BB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
by 0x532B6FF: g_strjoinv (in /usr/lib64/libglib-2.0.so.0.6600.8)
by 0x87D07E: qemu_rbd_mon_host (rbd.c:538)
by 0x87D07E: qemu_rbd_connect (rbd.c:562)
by 0x87E1CE: qemu_rbd_open (rbd.c:740)
by 0x840EB1: bdrv_open_driver (block.c:1528)
by 0x8453A9: bdrv_open_common (block.c:1802)
by 0x8453A9: bdrv_open_inherit (block.c:3444)
by 0x8464C2: bdrv_open (block.c:3537)
by 0x8108CD: qmp_blockdev_add (blockdev.c:3569)
by 0x8EA61B: qmp_marshal_blockdev_add (qapi-commands-block-core.c:1086)
by 0x90B528: do_qmp_dispatch_bh (qmp-dispatch.c:131)
by 0x907EA4: aio_bh_poll (async.c:164)
Fix freeing 'mon_host' also when qemu_rbd_connect() ends correctly.
Fixes: 0a55679b4a
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20210329150129.121182-2-sgarzare@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Given that the block size is read from the header of the VDI file, a
wide variety of sizes might be seen. Rather than re-using a block
sized memory region when writing the VDI header, allocate an
appropriately sized buffer.
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Message-id: 20210325112941.365238-3-pbonzini@redhat.com
Message-Id: <20210309144015.557477-3-david.edmondson@oracle.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If a new bitmap entry is allocated, requiring the entire block to be
written, avoiding leaking the buffer allocated for the block should
the write fail.
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Message-id: 20210325112941.365238-2-pbonzini@redhat.com
Message-Id: <20210309144015.557477-2-david.edmondson@oracle.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Setting the qcow2 data-file-raw bit means that you can ignore the
qcow2 metadata when reading from the external data file. It does not
mean that you have to ignore it, though. Therefore, the data read must
be the same regardless of whether you interpret the metadata or whether
you ignore it, and thus the L1/L2 tables must all be present and give a
1:1 mapping.
This patch changes 244's output: First, the qcow2 file is larger right
after creation, because of metadata preallocation. Second, the qemu-img
map output changes: Everything that was not explicitly discarded or
zeroed is now a data area.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210326145509.163455-2-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
mirror_top currently shares all permissions, and takes only the WRITE
permission (if some parent has taken that permission, too).
That is wrong, though; mirror_top is a filter, so it should take
permissions like any other filter does. For example, if the parent
needs CONSISTENT_READ, we need to take that, too, and if it cannot share
the WRITE permission, we cannot share it either.
The exception is when mirror_top is used for active commit, where we
cannot take CONSISTENT_READ (because it is deliberately unshared above
the base node) and where we must share WRITE (so that it is shared for
all images in the backing chain, so the mirror job can take it for the
target BB).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210211172242.146671-2-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Regular virtual timers are used to emulate timings
related to vCPU and peripheral states. QCOW2 uses timers
to clean the cache. These timers should have external
flag. In the opposite case they affect the execution
and it can't be recorded and replayed.
This patch adds external flag to the timer for qcow2
cache clean.
Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <161700516327.1141158.8366564693714562536.stgit@pasha-ThinkPad-X280>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Command block_passwd always fails since
Commit c01c214b69 "block: remove all encryption handling APIs"
(v2.10.0) turned block_passwd into a stub that always fails, and
hardcoded encryption_key_missing to false in query-named-block-nodes
and query-block.
Commit ad1324e044 "block: remove 'encryption_key_missing' flag from
QAPI" just landed. Complete the cleanup job: remove block_passwd.
Cc: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20210323101951.3686029-1-armbru@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
The vhost-user in-flight shmfd feature has not been tested with
qemu-storage-daemon's vhost-user-blk server. Disable this optional
feature for now because it requires MFD_ALLOW_SEALING, which is not
available in some CI environments.
If we need this feature in the future it can be re-enabled after
testing.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210309094106.196911-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>