Since qemu currently doesn't flush persistent bitmaps to disk until
shutdown (which might be MUCH later), it's useful if 'query-block'
at least shows WHICH bitmaps will (eventually) make it to persistent
storage. Update affected iotests.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190204210512.27458-1-eblake@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
In the 'Format specific information' section of the 'qemu-img info'
command output, the supplemental information about existing QCOW2
bitmaps will be shown, such as a bitmap name, flags and granularity:
image: /vz/vmprivate/VM1/harddisk.hdd
file format: qcow2
virtual size: 64G (68719476736 bytes)
disk size: 3.0M
cluster_size: 1048576
Format specific information:
compat: 1.1
lazy refcounts: true
bitmaps:
[0]:
flags:
[0]: in-use
[1]: auto
name: back-up1
granularity: 65536
[1]:
flags:
[0]: in-use
[1]: auto
name: back-up2
granularity: 65536
refcount bits: 16
corrupt: false
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <1549638368-530182-3-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Inform a user in case qcow2_get_specific_info fails to obtain
QCOW2 image specific information. This patch is preliminary to
the one "qcow2: Add list of bitmaps to ImageInfoSpecificQCow2".
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1549638368-530182-2-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
This coroutine will serve nbd reconnects, so, rename it to be something
more generic.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-7-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We have several paranoid checks for ioc != NULL. But ioc may become
NULL only on close, which should not happen during requests handling.
Also, we check ioc only sometimes, not after each yield, which is
inconsistent. Let's drop these checks. However, for safety, let's leave
asserts instead.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-6-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Use exported report, not the variable to be reused (should not really
matter).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-5-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Split connection code to reuse it for reconnect.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-4-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Keep all connection code in one file, to be able to implement reconnect
in further patches.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190201130138.94525-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: format tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
To implement nbd reconnect in further patches, we need to distinguish
error codes, returned by nbd server, from channel errors, to reconnect
only in the latter case.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We generally do very similar things around nbd_read: error_prepend
specifying what we have tried to read, and be_to_cpu conversion of
integers.
So, it seems reasonable to move common things to helper functions,
which:
1. simplify code a bit
2. generalize nbd_read error descriptions, all starting with
"Failed to read"
3. make it more difficult to forget to convert things from BE
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190128165830.165170-1-vsementsov@virtuozzo.com>
[eblake: rename macro to DEF_NBD_READ_N and formatting tweaks;
checkpatch has false positive complaint]
Signed-off-by: Eric Blake <eblake@redhat.com>
We define 54 macros for the powers of two >= 1024. We use six, in six
macro definitions. Four of them could just as well use the common MiB
macro, so do that. The remaining two can't, because they get passed
to stringify. Replace the macro by the literal number there.
Slightly harder to read in one instance (1048576 vs. S_1MiB), so add a
comment there. The other instance is a wash: 65536 vs S_64KiB. 65536
has been good enough for more than seven years there.
This effectively reverts commit 540b849261 and 1240ac558d.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The last user of blk_attach_dev_legacy() was the code in xen_disk which
has recently been reworked. Now there is no user for this legacy function
anymore. Thus we can finally remove all code related to the "legacy_dev"
flag, too, and turn the related "void *" in block-backend.c into proper
"DeviceState *" to fix some of the remaining TODOs there.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently qemu_uuid_bswap() takes a pointer to the QemuUUID to
be byte-swapped. This means it can't be used when the UUID
to be swapped is in a packed member of a struct. It's also
out of line with the general bswap*() functions we provide
in bswap.h, which take the value to be swapped and return it.
Make qemu_uuid_bswap() take a QemuUUID and return the swapped version.
This fixes some clang warnings about taking the address of
a packed struct member in block/vdi.c.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this.
Instead of passing UUID related functions the address of a possibly
unaligned QemuUUID struct, use local variables and then copy to/from
the struct field as appropriate.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by generating the
UUID into a local variable which is definitely safely aligned and
then copying it into place.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Clarify that the number of extents provided in BlockdevCreateOptionsVmdk
must match the number of extents that will actually be used. Providing
more extents will result in an error now.
This requires adapting the test case to provide the right number of
extents.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
This makes VMDK support blockdev-create. The implementation reuses the
image creation code in vmdk_co_create_opts which now acceptes a callback
pointer to "retrieve" BlockBackend pointers from the caller. This way we
separate the logic between file/extent acquisition and initialization.
The QAPI command parameters are mostly the same as the old create_opts
except the dropped legacy @compat6 switch, which is redundant with
@hwversion.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The extracted vmdk_init_extent takes a BlockBackend object and
initializes the format metadata. It is the common part between "qemu-img
create" and "blockdev-create".
Add a "BlockBackend *pbb" parameter to vmdk_create_extent, to return the
opened BB to the caller in the next patch.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the block layer, synchronous APIs are often implemented by creating a
coroutine that calls the asynchronous coroutine-based implementation and
then waiting for completion with BDRV_POLL_WHILE().
For this to work with iothreads (more specifically, when the synchronous
API is called in a thread that is not the home thread of the block
device, so that the coroutine will run in a different thread), we must
make sure to call aio_wait_kick() at the end of the operation. Many
places are missing this, so that BDRV_POLL_WHILE() keeps hanging even if
the condition has long become false.
Note that bdrv_dec_in_flight() involves an aio_wait_kick() call. This
corresponds to the BDRV_POLL_WHILE() in the drain functions, but it is
generally not enough for most other operations because they haven't set
the return value in the coroutine entry stub yet. To avoid race
conditions there, we need to kick after setting the return value.
The race window is small enough that the problem doesn't usually surface
in the common path. However, it does surface and causes easily
reproducible hangs if the operation can return early before even calling
bdrv_inc/dec_in_flight, which many of them do (trivial error or no-op
success paths).
The bug in bdrv_truncate(), bdrv_check() and bdrv_invalidate_cache() is
slightly different: These functions even neglected to schedule the
coroutine in the home thread of the node. This avoids the hang, but is
obviously wrong, too. Fix those to schedule the coroutine in the right
AioContext in addition to adding aio_wait_kick() calls.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Patch created mechanically by rerunning:
$ spatch --sp-file scripts/coccinelle/qobject.cocci \
--macro-file scripts/cocci-macro-file.h \
--dir block --in-place
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcount table entries have a field to store the offset of the
refcount block. The rest of the bits of the entry are currently
reserved.
The offset is always taken from the entry using REFT_OFFSET_MASK to
ensure that we only use the bits that belong to that field.
While that mask is used every time we read from the refcount table, it
is never used when we write to it. Due to the other constraints of the
qcow2 format QEMU can never produce refcount block offsets that don't
fit in that field so any such offset when allocating a refcount block
would indicate a bug in QEMU.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The mirror_start_job() function used for the commit-active job blocks
the source, target and all intermediate nodes for the duration of the
job.
target <- intermediate <- source
Since 4ef85a9c23 this function creates a dummy mirror_top_bs that
goes on top of the source node, and it is this dummy node that gets
blocked instead. The source node is never blocked or added to the
job's list of nodes.
target <- intermediate <- source <- mirror_top
At the moment I don't think it is possible to exploit this problem
because any additional job on 'source' would either be forbidden for
other reasons or it would need to involve an additional node that is
blocked, causing an error.
This can be seen in the error messages, however, because they never
refer to the source node being blocked:
$ qemu-img create -f qcow2 hd0.qcow2 1M
$ qemu-img create -f qcow2 -b hd0.qcow2 hd1.qcow2
$ qemu-io -c 'write 0 1M' hd0.qcow2
$ $QEMU -drive if=none,file=hd1.qcow2,node-name=hd1
{ "execute": "qmp_capabilities" }
{ "execute": "block-commit", "arguments": {"device": "hd1", "speed": 256}}
{ "execute": "block-stream", "arguments": {"device": "hd1"}}
{ "error": {"class": "GenericError",
"desc": "Node 'hd0' is busy: block device is in use by block job: commit"}}
After this patch the error message refers to 'hd1', as it should.
The expected output of iotest 141 also needs to be updated for the
same reason.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
At the moment I don't see how to make this function fail after the
dirty bitmap has been created, but if that was possible then we would
hit the assert(QLIST_EMPTY(&bs->dirty_bitmaps)) in bdrv_close().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-5-lvivier@redhat.com
[mreitz: Fixed sheepdog_snapshot_create_inode's format string to use
PRIx32 for uint32_ts]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-3-lvivier@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-2-lvivier@redhat.com
[mreitz: Fixed type of ssh_{read,write}_return's parameter to be ssize_t
instead of size_t]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add a new command, returning block nodes (and their users) graph.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20181221170909.25584-2-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The following QMP command leads to a crash when iothreads are used:
{ 'execute': 'device_del', 'arguments': {'id': 'data'} }
The backtrace involves the queue restart coroutine where
tgm->throttle_state is a NULL pointer because
throttle_group_unregister_tgm() has already been called:
(gdb) bt full
#0 0x00005585a7a3b378 in qemu_mutex_lock_impl (mutex=0xffffffffffffffd0, file=0x5585a7bb3d54 "block/throttle-groups.c", line=412) at util/qemu-thread-posix.c:64
err = <optimized out>
__PRETTY_FUNCTION__ = "qemu_mutex_lock_impl"
__func__ = "qemu_mutex_lock_impl"
#1 0x00005585a79be074 in throttle_group_restart_queue_entry (opaque=0x5585a9de4eb0) at block/throttle-groups.c:412
_f = <optimized out>
data = 0x5585a9de4eb0
tgm = 0x5585a9079440
ts = 0x0
tg = 0xffffffffffffff98
is_write = false
empty_queue = 255
This coroutine should not execute in the iothread after the throttle
group member has been unregistered!
The root cause is that the device_del code path schedules the restart
coroutine in the iothread while holding the AioContext lock. Therefore
the iothread cannot execute the coroutine until after device_del
releases the lock - by this time it's too late.
This patch adds a reference count to ThrottleGroupMember so we can
synchronously wait for restart coroutines to complete. Once they are
done it is safe to unregister the ThrottleGroupMember.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190114133257.30299-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The header "scsi-lowlevel.h" of libiscsi 1.9.0 contains some bad
"inline" prototype definitions which GCC refuses to compile in its
gnu99 mode:
In file included from block/iscsi.c:52:0:
/usr/include/iscsi/scsi-lowlevel.h:810:13: error: inline function
‘scsi_set_uint16’ declared but never defined [-Werror]
inline void scsi_set_uint16(unsigned char *c, uint16_t val);
^
/usr/include/iscsi/scsi-lowlevel.h:809:13: error: inline function
‘scsi_set_uint32’ declared but never defined [-Werror]
inline void scsi_set_uint32(unsigned char *c, uint32_t val);
^
[...]
This has been fixed by upstream libiscsi in version 1.10.0 (see
https://github.com/sahlberg/libiscsi/commit/7692027d6c11 ), but
since we still want to support 1.9.0 for CentOS 7 / RHEL7, we
have to work-around the issue by redefining the "inline" keyword
to use the old "gnu89" mode behavior via "gnu_inline" instead.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Pass 'info' instead of three separate parameters related to info,
when requesting the server to set the meta context. Update the
NBDExportInfo struct to rename the received id field to match the
fact that we are currently overloading the field to match whatever
context the user supplied through the x-dirty-bitmap hack, as well
as adding a TODO comment to remind future patches about a desire
to request two contexts at once.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-11-eblake@redhat.com>
Refactor the 'name' parameter of nbd_receive_negotiate() from
being a separate parameter into being part of the in-out 'info'.
This also spills over to a simplification of nbd_opt_go().
The main driver for this refactoring is that an upcoming patch
would like to add support to qemu-nbd to list information about
all exports available on a server, where the name(s) will be
provided by the server instead of the client. But another benefit
is that we can now allow the client to explicitly specify the
empty export name "" even when connecting to an oldstyle server
(even if qemu is no longer such a server after commit 7f7dfe2a).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-10-eblake@redhat.com>
tpm physical presence interface
rsc support in virtio net
ivshmem is removed
misc cleanups and fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcQTW/AAoJECgfDbjSjVRpY34H/0wniUqht3CVVihAfC9qnL7N
jgAJ41OGgrrCmXLrwa5P5iWxIKRjp2Odo7DQ2dl+9sqYshe6WoBBXmqEa8bDeldE
69EJ3xSnZjkpScqXKoSaDO3dtibUGpdvlTKpjuh7q2FFBOvNvf3hQubhyt/vINsO
qoDMqycjXJkHSaZLfdNb3P/LAkAuHI4veNk8KEbog3XTijSUbZlb2GYc8vT/agka
0SldiEVi9CTIh7co0L2BkcLVeTMVrF3OZuWDek3tVHXPxPaNbQM3GMvHZaxqZusU
SZinnNhcnI++kgLZVHuurxo7e1S9d8Oy0QuHGX8CZ1sfeqy3vzxer0Hbb7BdLIY=
=GhIM
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pci, pc, virtio: fixes, features
tpm physical presence interface
rsc support in virtio net
ivshmem is removed
misc cleanups and fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Fri 18 Jan 2019 02:11:11 GMT
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (49 commits)
migration: Use strnlen() for fixed-size string
migration: Fix stringop-truncation warning
hw/acpi: Use QEMU_NONSTRING for non NUL-terminated arrays
block/sheepdog: Use QEMU_NONSTRING for non NUL-terminated arrays
qemu/compiler: Define QEMU_NONSTRING
acpi: update expected files
hw: acpi: Fix memory hotplug AML generation error
tpm: clear RAM when "memory overwrite" requested
acpi: add ACPI memory clear interface
acpi: build TPM Physical Presence interface
acpi: expose TPM/PPI configuration parameters to firmware via fw_cfg
tpm: allocate/map buffer for TPM Physical Presence interface
tpm: add a "ppi" boolean property
hw/misc/edu: add msi_uninit() for pci_edu_uninit()
virtio: Make disable-legacy/disable-modern compat properties optional
globals: Allow global properties to be optional
virtio: virtio 9p really requires CONFIG_VIRTFS to work
virtio: split virtio crypto bits from virtio-pci.h
virtio: split virtio gpu bits from virtio-pci.h
virtio: split virtio serial bits from virtio-pci
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
GCC 8 added a -Wstringop-truncation warning:
The -Wstringop-truncation warning added in GCC 8.0 via r254630 for
bug 81117 is specifically intended to highlight likely unintended
uses of the strncpy function that truncate the terminating NUL
character from the source string.
This new warning leads to compilation failures:
CC block/sheepdog.o
qemu/block/sheepdog.c: In function 'find_vdi_name':
qemu/block/sheepdog.c:1239:5: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation]
strncpy(buf + SD_MAX_VDI_LEN, tag, SD_MAX_VDI_TAG_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make: *** [qemu/rules.mak:69: block/sheepdog.o] Error 1
As described previous to the strncpy() calls, the use of strncpy() is
correct here:
/* This pair of strncpy calls ensures that the buffer is zero-filled,
* which is desirable since we'll soon be sending those bytes, and
* don't want the send_req to read uninitialized data.
*/
strncpy(buf, filename, SD_MAX_VDI_LEN);
strncpy(buf + SD_MAX_VDI_LEN, tag, SD_MAX_VDI_TAG_LEN);
Use the QEMU_NONSTRING attribute, since this array is intended to store
character arrays that do not necessarily contain a terminating NUL.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
This reverts commit a33fbb4f8b.
The functionality is unused.
Note: in addition to automatic revert, drop second parameter in
hbitmap_iter_next() call from hbitmap_next_dirty_area() too.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This reverts commit 72d10a9421.
The function is unused now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Use bdrv_dirty_bitmap_next_dirty_area() instead of
bdrv_dirty_iter_next_area(), because of the following problems of
bdrv_dirty_iter_next_area():
1. Using HBitmap iterators we should carefully handle unaligned offset,
as first call to hbitmap_iter_next() may return a value less than
original offset (actually, it will be original offset rounded down to
bitmap granularity). This handling is not done in
do_sync_target_write().
2. bdrv_dirty_iter_next_area() handles unaligned max_offset
incorrectly:
look at the code:
if (max_offset == iter->bitmap->size) {
/* If max_offset points to the image end, round it up by the
* bitmap granularity */
gran_max_offset = ROUND_UP(max_offset, granularity);
} else {
gran_max_offset = max_offset;
}
ret = hbitmap_iter_next(&iter->hbi, false);
if (ret < 0 || ret + granularity > gran_max_offset) {
return false;
}
and assume that max_offset != iter->bitmap->size but still unaligned.
if 0 < ret < max_offset we found dirty area, but the function can
return false in this case (if ret + granularity > max_offset).
3. bdrv_dirty_iter_next_area() uses inefficient loop to find the end of
the dirty area. Let's use more efficient hbitmap_next_zero instead
(bdrv_dirty_bitmap_next_dirty_area() do so)
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
The function alters bdrv_dirty_iter_next_area(), which is wrong and
less efficient (see further commit
"block/mirror: fix and improve do_sync_target_write" for description).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
When making a backup of a dirty bitmap (for transactions), we want to
restore that backup whether or not the bitmap is enabled.
It is perfectly valid to write into bitmaps that are disabled. It is
only illegitimate for the guest to have done so.
Remove this assertion.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20181221093529.23855-3-jsnow@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Most files that have TABs only contain a handful of them. Change
them to spaces so that we don't confuse people.
disas, standard-headers, linux-headers and libdecnumber are imported
from other projects and probably should be exempted from the check.
Outside those, after this patch the following files still contain both
8-space and TAB sequences at the beginning of the line. Many of them
have a majority of TABs, or were initially committed with all tabs.
bsd-user/i386/target_syscall.h
bsd-user/x86_64/target_syscall.h
crypto/aes.c
hw/audio/fmopl.c
hw/audio/fmopl.h
hw/block/tc58128.c
hw/display/cirrus_vga.c
hw/display/xenfb.c
hw/dma/etraxfs_dma.c
hw/intc/sh_intc.c
hw/misc/mst_fpga.c
hw/net/pcnet.c
hw/sh4/sh7750.c
hw/timer/m48t59.c
hw/timer/sh_timer.c
include/crypto/aes.h
include/disas/bfd.h
include/hw/sh4/sh.h
libdecnumber/decNumber.c
linux-headers/asm-generic/unistd.h
linux-headers/linux/kvm.h
linux-user/alpha/target_syscall.h
linux-user/arm/nwfpe/double_cpdo.c
linux-user/arm/nwfpe/fpa11_cpdt.c
linux-user/arm/nwfpe/fpa11_cprt.c
linux-user/arm/nwfpe/fpa11.h
linux-user/flat.h
linux-user/flatload.c
linux-user/i386/target_syscall.h
linux-user/ppc/target_syscall.h
linux-user/sparc/target_syscall.h
linux-user/syscall.c
linux-user/syscall_defs.h
linux-user/x86_64/target_syscall.h
slirp/cksum.c
slirp/if.c
slirp/ip.h
slirp/ip_icmp.c
slirp/ip_icmp.h
slirp/ip_input.c
slirp/ip_output.c
slirp/mbuf.c
slirp/misc.c
slirp/sbuf.c
slirp/socket.c
slirp/socket.h
slirp/tcp_input.c
slirp/tcpip.h
slirp/tcp_output.c
slirp/tcp_subr.c
slirp/tcp_timer.c
slirp/tftp.c
slirp/udp.c
slirp/udp.h
target/cris/cpu.h
target/cris/mmu.c
target/cris/op_helper.c
target/sh4/helper.c
target/sh4/op_helper.c
target/sh4/translate.c
tcg/sparc/tcg-target.inc.c
tests/tcg/cris/check_addo.c
tests/tcg/cris/check_moveq.c
tests/tcg/cris/check_swap.c
tests/tcg/multiarch/test-mmap.c
ui/vnc-enc-hextile-template.h
ui/vnc-enc-zywrle.h
util/envlist.c
util/readline.c
The following have only TABs:
bsd-user/i386/target_signal.h
bsd-user/sparc64/target_signal.h
bsd-user/sparc64/target_syscall.h
bsd-user/sparc/target_signal.h
bsd-user/sparc/target_syscall.h
bsd-user/x86_64/target_signal.h
crypto/desrfb.c
hw/audio/intel-hda-defs.h
hw/core/uboot_image.h
hw/sh4/sh7750_regnames.c
hw/sh4/sh7750_regs.h
include/hw/cris/etraxfs_dma.h
linux-user/alpha/termbits.h
linux-user/arm/nwfpe/fpopcode.h
linux-user/arm/nwfpe/fpsr.h
linux-user/arm/syscall_nr.h
linux-user/arm/target_signal.h
linux-user/cris/target_signal.h
linux-user/i386/target_signal.h
linux-user/linux_loop.h
linux-user/m68k/target_signal.h
linux-user/microblaze/target_signal.h
linux-user/mips64/target_signal.h
linux-user/mips/target_signal.h
linux-user/mips/target_syscall.h
linux-user/mips/termbits.h
linux-user/ppc/target_signal.h
linux-user/sh4/target_signal.h
linux-user/sh4/termbits.h
linux-user/sparc64/target_syscall.h
linux-user/sparc/target_signal.h
linux-user/x86_64/target_signal.h
linux-user/x86_64/termbits.h
pc-bios/optionrom/optionrom.h
slirp/mbuf.h
slirp/misc.h
slirp/sbuf.h
slirp/tcp.h
slirp/tcp_timer.h
slirp/tcp_var.h
target/i386/svm.h
target/sparc/asi.h
target/xtensa/core-dc232b/xtensa-modules.inc.c
target/xtensa/core-dc233c/xtensa-modules.inc.c
target/xtensa/core-de212/core-isa.h
target/xtensa/core-de212/xtensa-modules.inc.c
target/xtensa/core-fsf/xtensa-modules.inc.c
target/xtensa/core-sample_controller/core-isa.h
target/xtensa/core-sample_controller/xtensa-modules.inc.c
target/xtensa/core-test_kc705_be/core-isa.h
target/xtensa/core-test_kc705_be/xtensa-modules.inc.c
tests/tcg/cris/check_abs.c
tests/tcg/cris/check_addc.c
tests/tcg/cris/check_addcm.c
tests/tcg/cris/check_addoq.c
tests/tcg/cris/check_bound.c
tests/tcg/cris/check_ftag.c
tests/tcg/cris/check_int64.c
tests/tcg/cris/check_lz.c
tests/tcg/cris/check_openpf5.c
tests/tcg/cris/check_sigalrm.c
tests/tcg/cris/crisutils.h
tests/tcg/cris/sys.c
tests/tcg/i386/test-i386-ssse3.c
ui/vgafont.h
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20181213223737.11793-3-pbonzini@redhat.com>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Eric Blake <eblake@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefan Markovic <smarkovic@wavecomp.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most list head structs need not be given a name. In most cases the
name is given just in case one is going to use QTAILQ_LAST, QTAILQ_PREV
or reverse iteration, but this does not apply to lists of other kinds,
and even for QTAILQ in practice this is only rarely needed. In addition,
we will soon reimplement those macros completely so that they do not
need a name for the head struct. So clean up everything, not giving a
name except in the rare case where it is necessary.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The libiscsi iscsi_task_mgmt_async() API documentation says:
abort_task will also cancel the scsi task. The callback for the scsi
task will be invoked with SCSI_STATUS_CANCELLED
The libiscsi implementation does not fulfil this promise. The task's
callback is not invoked and its struct iscsi_pdu remains in the internal
list (effectively leaked).
This patch invokes the libiscsi iscsi_scsi_cancel_task() API to force
the task's callback to be invoked with SCSI_STATUS_CANCELLED when the
ABORT TASK TMF completes and the task's callback hasn't been invoked
yet.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20180215111526.2464-1-stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
iscsi_aio_cancel() does not increment the request's reference count,
causing a use-after-free when ABORT TASK finishes after the request has
already completed.
There are some additional issues with iscsi_aio_cancel():
1. Several ABORT TASKs may be sent for the same task if
iscsi_aio_cancel() is invoked multiple times. It's better to avoid
this just in case the command identifier is reused.
2. The iscsilun->mutex protection is missing in iscsi_aio_cancel().
Reported-by: Felipe Franciosi <felipe@nutanix.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20180203061621.7033-4-stefanha@redhat.com>
Reviewed-by: Felipe Franciosi <felipe@nutanix.com>
Tested-by: Sreejith Mohanan <sreejit.mohanan@nutanix.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit d045c466d9 ("iscsi: do not use
aio_context_acquire/release") introduced iscsilun->mutex but appears to
have overlooked iscsi_timed_check_events() when introducing the mutex.
iscsi_service() and iscsi_set_events() must be called with
iscsilun->mutex held.
iscsi_timed_check_events() is invoked from the AioContext and does not
take the mutex.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20180203061621.7033-3-stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The IscsiAIOCB->buf field has not been used since commit
e49ab19fca ("block/iscsi: bump libiscsi
requirement to 1.9.0"). It used to be a linear buffer for old libiscsi
versions that didn't support scatter-gather. The minimum libiscsi
version supports scatter-gather so we don't linearize buffers anymore.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20180203061621.7033-2-stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the IO size is larger than 2 pages, we move the the pointer one by
one in the pagelist, this is inefficient.
This is a simple benchmark result:
Before:
$ qemu-io -c 'write 0 1G' nvme://0000:00:04.0/1
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:02.41 (424.504 MiB/sec and 0.4146 ops/sec)
$ qemu-io -c 'read 0 1G' nvme://0000:00:04.0/1
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:02.03 (503.055 MiB/sec and 0.4913 ops/sec)
After:
$ qemu-io -c 'write 0 1G' nvme://0000:00:04.0/1
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:02.17 (471.517 MiB/sec and 0.4605 ops/sec)
$ qemu-io -c 'read 0 1G' nvme://0000:00:04.0/1
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:01.94 (526.770 MiB/sec and 0.5144 ops/sec)
Signed-off-by: Li Feng <lifeng1519@gmail.com>
Message-Id: <20181101103807.25862-1-lifeng1519@gmail.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Error and trace improvements in NBD code, such as less noise for
common disconnect scenarios.
- Vladimir Sementsov-Ogievskiy: 0/3 nbd-client: drop extra error noise
- Eric Blake: portions of 0/22 nbd: add qemu-nbd --list
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJcMLgeAAoJEKeha0olJ0NqzXgH/RVRrCtmz0dOrrAkoJUfFX3t
AcnOSb8nTWXTLGvGpU6x5l9yGL77VaN83j3z/4tXAraGA3sZsFx4cNNepT7P3zBB
GlyT/b/Aie4EQAJPjuZdIZBe5pR6seSLoMFoGiga2oS2ik/wx4Wxxn9Gf9ZHUwVO
muCtsJQypNrfbv7GY2Qf5NKyB2IQxt4ljSCqOmi0/T2//mQrdmRhcHbgTTDUA5HS
3Rpr9gPV0u6YzA9foCu+UqUpyhF3RBQY9sSMLS+Y9L6W72WgynrG7y0A637o4Q61
W+2UDDjuvyyCDR8DjD+LnWyGn4QuxIglfta2kFRcFGoPr1GqcmP4gDLeqqdJKFs=
=R1PV
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2019-01-05' into staging
nbd patches for 2019-01-05
Error and trace improvements in NBD code, such as less noise for
common disconnect scenarios.
- Vladimir Sementsov-Ogievskiy: 0/3 nbd-client: drop extra error noise
- Eric Blake: portions of 0/22 nbd: add qemu-nbd --list
# gpg: Signature made Sat 05 Jan 2019 13:58:54 GMT
# gpg: using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>"
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>"
# gpg: aka "[jpeg image of size 6874]"
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-nbd-2019-01-05:
nbd/client: Drop pointless buf variable
qemu-nbd: Fail earlier for -c/-d on non-linux
nbd/client: More consistent error messages
nbd: Document timeline of various features
qemu-nbd: Use program name in error messages
block/nbd-client: use traces instead of noisy error_report_err
nbd/client: Trace all server option error messages
nbd: publish _lookup functions
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reduce extra noise of nbd-client, change 083 correspondingly.
In various commits (be41c100 in 2.10, f140e300 in 2.11, 78a33ab
in 2.12), we added spots where qemu as an NBD client would report
problems communicating with the server to stderr, because there
was no where else to send the error to. However, this is racy,
particularly since the most common source of these errors is when
either the client or the server abruptly hangs up, leaving one
coroutine to report the error only if it wins (or loses) the
race in attempting the read from the server before another
thread completes its cleanup of a protocol error that caused the
disconnect in the first place. The race is also apparent in the
fact that differences in the flush behavior of the server can
alter the frequency of encountering the race in the client (see
commit 6d39db96).
Rather than polluting stderr, it's better to just trace these
situations, for use by developers debugging a flaky connection,
particularly since the real error that either triggers the abrupt
disconnection in the first place, or that results from the EIO
when a request can't receive a reply, DOES make it back to the
user in the normal Error propagation channels.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20181102151152.288399-4-vsementsov@virtuozzo.com>
[eblake: drop depedence on error hint, enhance commit message]
Signed-off-by: Eric Blake <eblake@redhat.com>
The dmg file has many tables which describe: "start from sector XXX to
sector XXX, the compression method is XXX and where the compressed data
resides on".
Each sector in the expanded file should be covered by a table. The table
will describe the offset of compressed data (or raw depends on the type)
in the dmg.
For example:
[-----------The expanded file------------]
[---bzip table ---]/* zeros */[---zlib---]
^
| if we want to read this sector.
we will find bzip table which contains this sector, and get the
compressed data offset, read it from dmg, uncompress it, finally write to
expanded file.
If we skip zero chunk (table), some sector cannot find the table which
will cause search_chunk() return s->n_chunks, dmg_read_chunk() return -1
and finally causing dmg_co_preadv() return EIO.
See:
[-----------The expanded file------------]
[---bzip table ---]/* zeros */[---zlib---]
^
| if we want to read this sector.
Oops, we cannot find the table contains it...
In the original implementation, we don't have zero table. When we try to
read sector inside the zero chunk. We will get EIO, and skip reading.
After this patch, we treat zero chunk the same as ignore chunk, it will
directly write zero and avoid some sector may not find the table.
After this patch:
[-----------The expanded file------------]
[---bzip table ---][--zeros--][---zlib---]
Signed-off-by: yuchenlin <npes87184@gmail.com>
Reviewed-by: Julio Faracco <jcfaracco@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190103114700.9686-4-npes87184@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
There is a possible hang in original binary search implementation. That is
if chunk1 = 4, chunk2 = 5, chunk3 = 4, and we go else case.
The chunk1 will be still 4, and so on.
Signed-off-by: yuchenlin <npes87184@gmail.com>
Message-id: 20190103114700.9686-2-npes87184@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is a trivial patch to fix a wrong value for block terminator.
The old value was 0x7fffffff which is wrong. It was not affecting the
code because QEMU dmg block is not handling block terminator right now.
Neverthless, it should be fixed.
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Reviewed-by: yuchenlin <yuchenlin@synology.com>
Message-id: 20181228145055.18039-1-jcfaracco@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Marking a function coroutine_fn currently has no effect on the compiler,
but it documents that this function must be called from coroutine
context and it may yield. This is important information for the
programmer.
Also, if we ever transition to a stackless coroutine implementation,
then it's likely that the annotation will become mandatory so the
compiler can use the correct calling convention for coroutine functions.
Cc: Max Reitz <mreitz@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that all callers are passing all flag changes as QDict options,
the flags parameter is no longer necessary, so we can get rid of it.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function is used to put the hidden and secondary disks in
read-write mode before launching the backup job, and back in read-only
mode afterwards.
This patch does the following changes:
- Use an options QDict with the "read-only" option instead of
passing the changes as flags only.
- Simplify the code (it was unnecessarily complicated and verbose).
- Fix a bug due to which the secondary disk was not being put back
in read-only mode when writable=false (because in this case
orig_secondary_flags always had the BDRV_O_RDWR flag set).
- Stop clearing the BDRV_O_INACTIVE flag.
The flags parameter to bdrv_reopen_queue() becomes redundant and we'll
be able to get rid of it in a subsequent patch.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The 'block-commit' QMP command is implemented internally using two
different drivers. If the source image is the active layer then the
mirror driver is used (commit_active_start()), otherwise the commit
driver is used (commit_start()).
In both cases the destination image must be put temporarily in
read-write mode. This is done correctly in the latter case, but what
commit_active_start() does is copy all flags instead.
This patch replaces the bdrv_reopen() calls in that function with
bdrv_reopen_set_read_only() so that only the read-only status is
changed.
A similar change is made in mirror_exit(), which is also used by the
'drive-mirror' and 'blockdev-mirror' commands.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch replaces the bdrv_reopen() calls that set and remove the
BDRV_O_RDWR flag with the new bdrv_reopen_set_read_only() function.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch replaces the bdrv_reopen() calls that set and remove the
BDRV_O_RDWR flag with the new bdrv_reopen_set_read_only() function.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch replaces the bdrv_reopen() calls that set and remove the
BDRV_O_RDWR flag with the new bdrv_reopen_set_read_only() function.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
This was the last user of aio_worker(), so the function goes away now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
No real reason to keep using the callback based mechanism here when the
rest of the file-posix driver is coroutine based. Changing it brings
ioctls more in line with how other request types work.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() for reads and writes isn't boring enough yet. It still does
some postprocessing for handling short reads and turning the result into
the right return value.
However, there is no reason why handle_aiocb_rw() couldn't do the same,
and even without duplicating code between the read and write path. So
move the code there.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.
As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Getting the thread pool of the AioContext of a block node and scheduling
some work in it is an operation that is already done twice, and we'll
get more instances. Factor it out into a separate function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
RawPosixAIOData contains a lot of fields for several separate operations
that are to be processed in a worker thread and that need different
parameters. The struct is currently rather unorganised, with unions that
cover some, but not all operations, and even one #define for field names
instead of a union.
Clean this up to have some common fields and a single union. As a side
effect, on x86_64 the struct shrinks from 72 to 48 bytes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Do decompression in threads, like it is already done for compression.
This improves asynchronous compressed reads performance.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allocate buffers locally and release qcow2 lock. Than, reads inside
qcow2_co_preadv_compressed may be done in parallel, however all
decompression is still done synchronously. Let's improve it in the
following commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Get rid of it here too.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
- make it look more like a pair of qcow2_compress - rename the function
and its parameters
- drop extra out_len variable, check filling of output buffer by strm
structure itself
- fix code style
- add some documentation
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compression is done in threads in qcow2.c. We want to do decompression
in the same way, so, firstly, move it to the same file.
The only change is braces around if-body in decompress_buffer, to
satisfy checkpatch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Give explicit size both for source and destination buffers, to make it
similar with decompression path and than cleanly reuse parameter
structure for decompression threads.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use appropriate macro, corresponding to deflateInit2 spec.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After commit f8d59dfb40
"block/backup: fix fleecing scheme: use serialized writes" fleecing
(specifically reading from backup target, when backup source is in
backing chain of backup target) is safe, because all backup-job writes
to target are serialized. Therefore we don't need additional
synchronization for these reads.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This change is better to understand what kind of block type is being
handled by the code. Using a syntax similar to the DMG documentation is
easier than tracking all hex values assigned to a block type.
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This commit includes the support to new module dmg-lzfse into dmg block
driver. It includes the support for block type ULFO (0x80000007).
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This commit includes the support to lzfse opensource library. With this
library dmg block driver can decompress images with this type of
compression inside.
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QEMU dmg support includes zlib and bzip2, but it does not contains lzfse
support. This commit adds the source file to extend compression support
for new DMGs.
Signed-off-by: Julio Faracco <jcfaracco@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The two thing that should be handled are cipher and ivgen. For ivgen
the solution is just mutex, as iv calculations should not be long in
comparison with encryption/decryption. And for cipher let's just keep
per-thread ciphers.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Let start from the beginning:
Commit b9e413dd37 (in 2.9)
"block: explicitly acquire aiocontext in aio callbacks that need it"
added pairs of aio_context_acquire/release to mirror_write_complete and
mirror_read_complete, when they were aio callbacks for blk_aio_* calls.
Then, commit 2e1990b26e (in 3.0) "block/mirror: Convert to coroutines"
dropped these blk_aio_* calls, than mirror_write_complete and
mirror_read_complete are not callbacks more, and don't need additional
aiocontext acquiring. Furthermore, mirror_read_complete calls
blk_co_pwritev inside these pair of aio_context_acquire/release, which
leads to the following dead-lock with mirror:
(gdb) info thr
Id Target Id Frame
3 Thread (LWP 145412) "qemu-system-x86" syscall ()
2 Thread (LWP 145416) "qemu-system-x86" __lll_lock_wait ()
* 1 Thread (LWP 145411) "qemu-system-x86" __lll_lock_wait ()
(gdb) bt
#0 __lll_lock_wait ()
#1 _L_lock_812 ()
#2 __GI___pthread_mutex_lock
#3 qemu_mutex_lock_impl (mutex=0x561032dce420 <qemu_global_mutex>,
file=0x5610327d8654 "util/main-loop.c", line=236) at
util/qemu-thread-posix.c:66
#4 qemu_mutex_lock_iothread_impl
#5 os_host_main_loop_wait (timeout=480116000) at util/main-loop.c:236
#6 main_loop_wait (nonblocking=0) at util/main-loop.c:497
#7 main_loop () at vl.c:1892
#8 main
Printing contents of qemu_global_mutex, I see that "__owner = 145416",
so, thr1 is main loop, and now it wants BQL, which is owned by thr2.
(gdb) thr 2
(gdb) bt
#0 __lll_lock_wait ()
#1 _L_lock_870 ()
#2 __GI___pthread_mutex_lock
#3 qemu_mutex_lock_impl (mutex=0x561034d25dc0, ...
#4 aio_context_acquire (ctx=0x561034d25d60)
#5 dma_blk_cb
#6 dma_blk_io
#7 dma_blk_read
#8 ide_dma_cb
#9 bmdma_cmd_writeb
#10 bmdma_write
#11 memory_region_write_accessor
#12 access_with_adjusted_size
#15 flatview_write
#16 address_space_write
#17 address_space_rw
#18 kvm_handle_io
#19 kvm_cpu_exec
#20 qemu_kvm_cpu_thread_fn
#21 qemu_thread_start
#22 start_thread
#23 clone ()
Printing mutex in fr 2, I see "__owner = 145411", so thr2 wants aio
context mutex, which is owned by thr1. Classic dead-lock.
Then, let's check that aio context is hold by mirror coroutine: just
print coroutine stack of first tracked request in mirror job target:
(gdb) [...]
(gdb) qemu coroutine 0x561035dd0860
#0 qemu_coroutine_switch
#1 qemu_coroutine_yield
#2 qemu_co_mutex_lock_slowpath
#3 qemu_co_mutex_lock
#4 qcow2_co_pwritev
#5 bdrv_driver_pwritev
#6 bdrv_aligned_pwritev
#7 bdrv_co_pwritev
#8 blk_co_pwritev
#9 mirror_read_complete () at block/mirror.c:232
#10 mirror_co_read () at block/mirror.c:370
#11 coroutine_trampoline
#12 __start_context
Yes it is mirror_read_complete calling blk_co_pwritev after acquiring
aio context.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If nbd_client_init() fails after we are already connected,
then the server will spam logs with:
Disconnect client, due to: Unexpected end-of-file before all bytes were read
unless we gracefully disconnect before closing the connection.
Ways to trigger this:
$ opts=driver=nbd,export=foo,server.type=inet,server.host=localhost,server.port=10809
$ qemu-img map --output=json --image-opts $opts,read-only=off
$ qemu-img map --output=json --image-opts $opts,x-dirty-bitmap=nosuch:
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181130023232.3079982-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
The implementation of x-dirty-bitmap in qemu 3.0 (commit 216ee365)
silently falls back to treating the server as not supporting
NBD_CMD_BLOCK_STATUS if a requested meta_context name was not
negotiated, which in turn means treating the _entire_ image as
data. Since our hack relied on using 'qemu-img map' to view
which portions of the image were dirty by seeing what the
redirected bdrv_block_status() treats as holes, this means
that our fallback treats the entire image as clean. Better
would have been to treat the entire image as dirty, or to fail
to connect because the user's request for a specific context
could not be honored. This patch goes with the latter.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181130023232.3079982-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
s->locked_shared_perm is the set of bits locked in the file, which is
the inverse of the permissions actually shared. So we need to pass them
as they are to raw_apply_lock_bytes() instead of inverting them again.
Reported-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our code was already checking that we did not attempt to
allocate more clusters than what would fit in an INT64 (the
physical maximimum if we can access a full off_t's worth of
data). But this does not catch smaller limits enforced by
various spots in the qcow2 image description: L1 and normal
clusters of L2 are documented as having bits 63-56 reserved
for other purposes, capping our maximum offset at 64PB (bit
55 is the maximum bit set). And for compressed images with
2M clusters, the cap drops the maximum offset to bit 48, or
a maximum offset of 512TB. If we overflow that offset, we
would write compressed data into one place, but try to
decompress from another, which won't work.
It's actually possible to prove that overflow can cause image
corruption without this patch; I'll add the iotests separately
in the next commit.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Don't leak 'cluster' in the mapping == NULL case. Found by Coverity
(CID 1055918).
Fixes: 8d9401c279
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
The commit for 0e4e4318ea increments QCOW2_OL_MAX_BITNR but does not
add an array entry for QCOW2_OL_BITMAP_DIRECTORY_BITNR to metadata_ol_names[].
As a result, an array dereference of metadata_ol_names[8] in
qcow2_pre_write_overlap_check() could result in a read outside of the array bounds.
Fixes: 0e4e4318ea ('qcow2: add overlap check for bitmap directory')
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Liam Merwick <Liam.Merwick@oracle.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1541453919-25973-6-git-send-email-Liam.Merwick@oracle.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The calls to find_mapping_for_cluster() may return NULL but it
isn't always checked for before dereferencing the value returned.
Additionally, add some asserts to cover cases where NULL can't
be returned but which might not be obvious at first glance.
Signed-off-by: Liam Merwick <Liam.Merwick@oracle.com>
Message-id: 1541453919-25973-5-git-send-email-Liam.Merwick@oracle.com
[mreitz: Dropped superfluous check of "mapping" following an assertion
that it is not NULL, and fixed some indentation]
Signed-off-by: Max Reitz <mreitz@redhat.com>
The dev_id returned by the call to blk_get_attached_dev_id() in
blk_root_get_parent_desc() can be NULL (an internal call to
object_get_canonical_path may have returned NULL).
Instead of just checking this case before before dereferencing,
adjust blk_get_attached_dev_id() to return the empty string if no
object path can be found (similar to the case when blk->dev is NULL
and an empty string is returned).
Signed-off-by: Liam Merwick <Liam.Merwick@oracle.com>
Message-id: 1541453919-25973-3-git-send-email-Liam.Merwick@oracle.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This adds configure options to control the following block drivers:
* Bochs
* Cloop
* Dmg
* Qcow (V1)
* Vdi
* Vvfat
* qed
* parallels
* sheepdog
Each of these defaults to being enabled.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20181107063644.2254-1-armbru@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The lock_fd field is not strictly necessary because transferring locked
bytes from old fd to the new one shouldn't fail anyway. This spares the
user one fd per image.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use error_report for situations that affect user operation (i.e. we're
actually returning error), and warn_report/warn_report_err when some
less critical error happened but the user operation can still carry on.
For raw_normalize_devicepath, add Error parameter to propagate to
its callers.
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If an expression is used to define DEFAULT_CLUSTER_SIZE, when compiled,
it will be embedded as a literal expression in the binary (as the
default value) because it is stringified to mark the size of the default
value. Now this is fixed by using a defined number to define this value.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If read-only=off, but auto-read-only=on is given, open the volume
read-write if we have the permissions, but instead of erroring out for
read-only volumes, just degrade to read-only.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If read-only=off, but auto-read-only=on is given, open the file
read-write if we have the permissions, but instead of erroring out for
read-only files, just degrade to read-only.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
If read-only=off, but auto-read-only=on is given, just degrade to
read-only.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If read-only=off, but auto-read-only=on is given, open the file
read-write if we have the permissions, but instead of erroring out for
read-only files, just degrade to read-only.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If read-only=off, but auto-read-only=on is given, open a read-write NBD
connection if the server provides a read-write export, but instead of
erroring out for read-only exports, just degrade to read-only.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Some block drivers have traditionally changed their node to read-only
mode without asking the user. This behaviour has been marked deprecated
since 2.11, expecting users to provide an explicit read-only=on option.
Now that we have auto-read-only=on, enable these drivers to make use of
the option.
This is the only use of bdrv_set_read_only(), so we can make it a bit
more specific and turn it into a bdrv_apply_auto_read_only() that is
more convenient for drivers to use.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Commit e2b8247a32 introduced an error path in qemu_rbd_open() after
calling rbd_open(), but neglected to close the image again in this error
path. The error path should contain everything that the regular close
function qemu_rbd_close() contains.
This adds the missing rbd_close() call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If a management application builds the block graph node by node, the
protocol layer doesn't inherit its read-only option from the format
layer any more, so it must be set explicitly.
Backing files should work on read-only storage, but at the same time, a
block job like commit should be able to reopen them read-write if they
are on read-write storage. However, without option inheritance, reopen
only changes the read-only option for the root node (typically the
format layer), but not the protocol layer, so reopening fails (the
format layer wants to get write permissions, but the protocol layer is
still read-only).
A simple workaround for the problem in the management tool would be to
open the protocol layer always read-write and to make only the format
layer read-only for backing files. However, sometimes the file is
actually stored on read-only storage and we don't know whether the image
can be opened read-write (for example, for NBD it depends on the server
we're trying to connect to). This adds an option that makes QEMU try to
open the image read-write, but allows it to degrade to a read-only mode
without returning an error.
The documentation for this option is consciously phrased in a way that
allows QEMU to switch to a better model eventually: Instead of trying
when the image is first opened, making the read-only flag dynamic and
changing it automatically whenever the first BLK_PERM_WRITE user is
attached or the last one is detached would be much more useful
behaviour.
Unfortunately, this more useful behaviour is also a lot harder to
implement, and libvirt needs a solution now before it can switch to
-blockdev, so let's start with this easier approach for now.
Instead of adding a new auto-read-only option, turning the existing
read-only into an enum (with a bool alternate for compatibility) was
considered, but it complicated the implementation to the point that it
didn't seem to be worth it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The blkverify mode of Quorum only works when the number of children is
exactly two, so any attempt to add a new one must return an error.
quorum_del_child() on the other hand doesn't need any additional check
because decreasing the number of children would make it go under the
vote threshold.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The blkverify mode of Quorum can only be enabled if the number of
children is exactly two and the value of vote-threshold is also two.
If the user tries to enable it but the other settings are incorrect
then QEMU simply prints an error message to stderr and carries on
disabling the blkverify setting.
This patch makes quorum_open() fail and return an error in this case.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reported-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a static function with only one caller, so there's no need to
keep it. Inlining the code in quorum_compare() makes it much simpler.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reported-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by not using the
"modify in place" byte swapping functions.
There are a few places where the in-place swap function is
used on something other than a packed struct field; we convert
those anyway, for consistency.
Patch produced with scripts/coccinelle/inplace-byteswaps.cocci.
There are other places where we take the address of a packed member
in this file for other purposes than passing it to a byteswap
function (all the calls to qemu_uuid_*()); we leave those for now.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by not using the
"modify in place" byte swapping functions.
There are a few places where the in-place swap function is
used on something other than a packed struct field; we convert
those anyway, for consistency.
Patch produced with scripts/coccinelle/inplace-byteswaps.cocci.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This doesn't have any practical effect at the moment because the
values of BDRV_SECTOR_SIZE, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE and
QCRYPTO_BLOCK_QCOW_SECTOR_SIZE are all the same (512 bytes), but
future encryption methods could have different requirements.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by not using the
"modify in place" byte swapping functions.
There are a few places where the in-place swap function is
used on something other than a packed struct field; we convert
those anyway, for consistency.
This patch was produced with the following spatch script:
@@
expression E;
@@
-be16_to_cpus(&E);
+E = be16_to_cpu(E);
@@
expression E;
@@
-be32_to_cpus(&E);
+E = be32_to_cpu(E);
@@
expression E;
@@
-be64_to_cpus(&E);
+E = be64_to_cpu(E);
@@
expression E;
@@
-cpu_to_be16s(&E);
+E = cpu_to_be16(E);
@@
expression E;
@@
-cpu_to_be32s(&E);
+E = cpu_to_be32(E);
@@
expression E;
@@
-cpu_to_be64s(&E);
+E = cpu_to_be64(E);
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: John Snow <jsnow@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by not using the
"modify in place" byte swapping functions.
There are a few places where the in-place swap function is
used on something other than a packed struct field; we convert
those anyway, for consistency.
This patch was produced with the following spatch script:
@@
expression E;
@@
-be16_to_cpus(&E);
+E = be16_to_cpu(E);
@@
expression E;
@@
-be32_to_cpus(&E);
+E = be32_to_cpu(E);
@@
expression E;
@@
-be64_to_cpus(&E);
+E = be64_to_cpu(E);
@@
expression E;
@@
-cpu_to_be16s(&E);
+E = cpu_to_be16(E);
@@
expression E;
@@
-cpu_to_be32s(&E);
+E = cpu_to_be32(E);
@@
expression E;
@@
-cpu_to_be64s(&E);
+E = cpu_to_be64(E);
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: John Snow <jsnow@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by not using the
"modify in place" byte swapping functions.
There are a few places where the in-place swap function is
used on something other than a packed struct field; we convert
those anyway, for consistency.
This patch was produced with the following spatch script
(and hand-editing to fold a few resulting overlength lines):
@@
expression E;
@@
-be16_to_cpus(&E);
+E = be16_to_cpu(E);
@@
expression E;
@@
-be32_to_cpus(&E);
+E = be32_to_cpu(E);
@@
expression E;
@@
-be64_to_cpus(&E);
+E = be64_to_cpu(E);
@@
expression E;
@@
-cpu_to_be16s(&E);
+E = cpu_to_be16(E);
@@
expression E;
@@
-cpu_to_be32s(&E);
+E = cpu_to_be32(E);
@@
expression E;
@@
-cpu_to_be64s(&E);
+E = cpu_to_be64(E);
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: John Snow <jsnow@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When using the vvfat driver with a directory that contains too many files,
QEMU currently crashes. This can be triggered like this for example:
mkdir /tmp/vvfattest
cd /tmp/vvfattest
for ((x=0;x<=513;x++)); do mkdir $x; done
qemu-system-x86_64 -drive \
file.driver=vvfat,file.dir=.,read-only=on,media=cdrom
Seems like read_directory() is changing the mapping->path variable. Make
sure we use the right pointer instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch aims to bring the following behavior:
1. We don't load bitmaps, when started in inactive mode. It's the case
of incoming migration. In this case we wait for bitmaps migration
through migration channel (if 'dirty-bitmaps' capability is enabled) or
for invalidation (to load bitmaps from the image).
2. We don't remove persistent bitmaps on inactivation. Instead, we only
remove bitmaps after storing. This is the only way to restore bitmaps,
if we decided to resume source after [failed] migration with
'dirty-bitmaps' capability enabled (which means, that bitmaps were not
stored).
3. We load bitmaps on open and any invalidation, it's ok for all cases:
- normal open
- migration target invalidation with dirty-bitmaps capability
(bitmaps are migrating through migration channel, the are not
stored, so they should have IN_USE flag set and will be skipped
when loading. However, it would fail if bitmaps are read-only[1])
- migration target invalidation without dirty-bitmaps capability
(normal load of the bitmaps, if migrated with shared storage)
- source invalidation with dirty-bitmaps capability
(skip because IN_USE)
- source invalidation without dirty-bitmaps capability
(bitmaps were dropped, reload them)
[1]: to accurately handle this, migration of read-only bitmaps is
explicitly forbidden in this patch.
New mechanism for not storing bitmaps when migrate with dirty-bitmaps
capability is introduced: migration filed in BdrvDirtyBitmap.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Similarly to merge, it's OK to allow clear operations on disabled
bitmaps, as this condition only means that they are not recording
new writes. We are free to clear it if the user requests it.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20181002230218.13949-4-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
In prior commits that made merge transactionable, we removed the
assertion that merge cannot operate on disabled bitmaps. In addition,
we want to make sure that we are prohibiting merges to "locked" bitmaps.
Use the new user_locked function to check.
Reported-by: Eric Blake <eblake@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20181002230218.13949-3-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Instead of both frozen and qmp_locked checks, wrap it into one check.
frozen implies the bitmap is split in two (for backup), and shouldn't
be modified. qmp_locked implies it's being used by another operation,
like being exported over NBD. In both cases it means we shouldn't allow
the user to modify it in any meaningful way.
Replace any usages where we check both frozen and qmp_locked with the
new check.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20181002230218.13949-2-jsnow@redhat.com
[w/edits Suggested-By: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>]
Signed-off-by: John Snow <jsnow@redhat.com>
This variable doesn't work as it should, because it is actually cleared
in qcow2_co_invalidate_cache() by memset(). Drop it, as the following
patch will introduce new behavior.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[Maintainer edit -- touched up error message. --js]
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Add backup parameter to bdrv_merge_dirty_bitmap() to be used then with
bdrv_restore_dirty_bitmap() if it needed to restore the bitmap after
merge operation.
This is needed to implement bitmap merge transaction action in further
commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Use more generic names to reuse the function for bitmap merge in the
following commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Move checks from qmp_x_block_dirty_bitmap_merge() to
bdrv_merge_dirty_bitmap(), to share them with dirty bitmap merge
transaction action in future commit.
Note: for now, only qmp_x_block_dirty_bitmap_merge() calls
bdrv_merge_dirty_bitmap().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
vpc_open() merely prints a warning when it finds a bad header
checksum. Turn that into a hard error.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20181017082702.5581-39-armbru@redhat.com>
[Error message capitalized for local consistency]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Calling error_report() in a function that takes an Error ** argument
is suspicious. Convert a few that are actually warnings to
warn_report().
While there, split warnings consisting of multiple sentences to
conform to conventions spelled out in warn_report()'s contract, and
improve a rather useless warning in sheepdog.c.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Lieven <pl@kamp.de>
Cc: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-4-armbru@redhat.com>
Drop changes to "without an explicit read-only=on" warnings, because
there's a series removing them pending. Also drop a cc: to a former
Sheepdog maintainer.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
From include/qapi/error.h:
* Pass an existing error to the caller with the message modified:
* error_propagate(errp, err);
* error_prepend(errp, "Could not frobnicate '%s': ", name);
Fei Li pointed out that doing error_propagate() first doesn't work
well when @errp is &error_fatal or &error_abort: the error_prepend()
is never reached.
Since I doubt fixing the documentation will stop people from getting
it wrong, introduce error_propagate_prepend(), in the hope that it
lures people away from using its constituents in the wrong order.
Update the instructions in error.h accordingly.
Convert existing error_prepend() next to error_propagate to
error_propagate_prepend(). If any of these get reached with
&error_fatal or &error_abort, the error messages improve. I didn't
check whether that's the case anywhere.
Cc: Fei Li <fli@suse.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-2-armbru@redhat.com>
nvme_poll_queues is already protected by q->lock, and
AIO callbacks are invoked outside the AioContext lock.
So remove the acquire/release pair in nvme_handle_event.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20180814062739.19640-1-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Currently, the default values for werror and rerror have to be set
explicitly with blk_set_on_error() by the callers of blk_new(). The only
caller actually doing this is blockdev_init(), which is called for
BlockBackends created using -drive.
In particular, anonymous BlockBackends created with
-device ...,drive=<node-name> didn't get the correct default set and
instead defaulted to the integer value 0 (= BLOCKDEV_ON_ERROR_REPORT).
This is the intended default for rerror anyway, but the default for
werror should be BLOCKDEV_ON_ERROR_ENOSPC.
Set the defaults in blk_new() instead so that they apply no matter what
way the BlockBackend was created.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The default cache-clean-interval is set to 10 minutes, in order to lower
the overhead of the qcow2 caches (before the default was 0, i.e.
disabled).
* For non-Linux platforms the default is kept at 0, because
cache-clean-interval is not supported there yet.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The caches are now recalculated upon image resizing. This is done
because the new default behavior of assigning L2 cache relatively to
the image size, implies that the cache will be adapted accordingly
after an image resize.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The upper limit on the L2 cache size is increased from 1 MB to 32 MB
on Linux platforms, and to 8 MB on other platforms (this difference is
caused by the ability to set intervals for cache cleaning on Linux
platforms only).
This is done in order to allow default full coverage with the L2 cache
for images of up to 256 GB in size (was 8 GB). Note, that only the
needed amount to cover the full image is allocated. The value which is
changed here is just the upper limit on the L2 cache size, beyond which
it will not grow, even if the size of the image will require it to.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sufficient L2 cache can noticeably improve the performance when using
large images with frequent I/O.
Previously, unless 'cache-size' was specified and was large enough, the
L2 cache was set to a certain size without taking the virtual image size
into account.
Now, the L2 cache assignment is aware of the virtual size of the image,
and will cover the entire image, unless the cache size needed for that is
larger than a certain maximum. This maximum is set to 1 MB by default
(enough to cover an 8 GB image with the default cluster size) but can
be increased or decreased using the 'l2-cache-size' option. This option
was previously documented as the *maximum* L2 cache size, and this patch
makes it behave as such, instead of as a constant size. Also, the
existing option 'cache-size' can limit the sum of both L2 and refcount
caches, as previously.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The refcount cache size does not need to be set to its minimum value in
read_cache_sizes(), as it is set to at least its minimum value in
qcow2_update_options_prepare().
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The file-posix code is used for the "file", "host_device" and
"host_cdrom" drivers, and it allows reopening images. However the only
option that is actually processed is "x-check-cache-dropped", and
changes in all other options (e.g. "filename") are silently ignored:
(qemu) qemu-io virtio0 "reopen -o file.filename=no-such-file"
While we could allow changing some of the other options, let's keep
things as they are for now but return an error if the user tries to
change any of them.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The default value of x-check-cache-dropped is false. There's no reason
to use the previous value as a default in raw_reopen_prepare() because
bdrv_reopen_queue_child() already takes care of putting the old
options in the BDRVReopenState.options QDict.
If x-check-cache-dropped was previously set but is now missing from
the reopen QDict then it should be reset to false.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Image locking errors happening at device initialization time doesn't say
which file cannot be locked, for instance,
-device scsi-disk,drive=drive-1: Failed to get shared "write" lock
Is another process using the image?
could refer to either the overlay image or its backing image.
Hoist the error_append_hint to the caller of raw_check_lock_bytes where
file name is known, and include it in the error hint.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is a rare case which the size of last compressed cluster
is larger than the cluster size, which will cause the file is
not aligned at the sector boundary.
There are three reasons to do it. First, if vmdk doesn't align at
the sector boundary, there may be many undefined behaviors,
such as, in vbox it will show VMDK: Compressed image is corrupted
'syno-vm-disk1.vmdk' (VERR_ZIP_CORRUPTED) when we try to import an
ova with unaligned vmdk. Second, all the cluster_sector is aligned
to sector, the last one should be like this, too. Third, it ease
reading with sector based I/Os.
Signed-off-by: yuchenlin <yuchenlin@synology.com>
Message-Id: <20180913082952.3675-1-yuchenlin@synology.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
- Deprecate the "enforce-config-section" machine parameter
- Re-enable the wdt_ib700, endianness and vmxnet3 qtests
- Some trivial fixes and doc update patches that crossed my way
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJbqlsyAAoJEC7Z13T+cC21RbAP/3IvGfBxuRm6rBWoghjQgbl8
KU8nPnlZUtqjxmfUTILO/h+pJ3na5MQ8hh7v8JHi+xlQ2DPkECW21DtnfdxntVjw
+b+N5Ap6J22GHyEq4HJXPWAk2rDInqkU966DvL40RiMvOTfXdg9EO0TDX0VsVgZv
BR1r7/t3T0P7hiQ0XWb9U2JchRIC+Zgk34gXZPSTpoIv89fUhzNoK5LvAA6yV1FQ
TvE8VTKJm4wkqThH1ShtbJCBKjHjW/W8LYZr3YMothcs8vGjEdEcDL4BoJZDn3bF
h4VTkU+k8lp7W9LmlnPnu1WH/5ezhzdwJTeFaPJt4U10WKJptAS4vbK03DXlds9O
9d2BOXKrima2kSr1ejSe1f0kcE8fis1XFmSuhF61Nbw6ngT5+pP2JSc1XwFazd2K
zQwV4GXBLzAGnd4F2Ec+5TKzbGFVfczxeBDiBkkVmG+XdX/UXJpkpPYGAaw7DDiK
JwKVVYIPk1ll6MAbR6qEGsvE/adHNEm8lUdjXqwgbQlIeUZ2H0hCu9lJ0X81mtoQ
WZP+nMa/87COnlPX6VPVgxM2TXQOH/UbGz/WmYzZ6/gPKTX+gfwrHQGdp7Tjl33U
KxFKWioFnoqGuyWasvTtKEK67/IlrY+w1nXuuqKJg8J2/qx1SVtx45FHkRkxkIDx
4boRpx0XUqpDVdf8VhRB
=dXgp
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/huth-gitlab/tags/pull-request-2018-09-25' into staging
- Deprecate the usage of a network backend via "name" instead of "id"
- Deprecate the "enforce-config-section" machine parameter
- Re-enable the wdt_ib700, endianness and vmxnet3 qtests
- Some trivial fixes and doc update patches that crossed my way
# gpg: Signature made Tue 25 Sep 2018 16:58:42 BST
# gpg: using RSA key 2ED9D774FE702DB5
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>"
# gpg: aka "Thomas Huth <thuth@redhat.com>"
# gpg: aka "Thomas Huth <huth@tuxfamily.org>"
# gpg: aka "Thomas Huth <th.huth@posteo.de>"
# Primary key fingerprint: 27B8 8847 EEE0 2501 18F3 EAB9 2ED9 D774 FE70 2DB5
* remotes/huth-gitlab/tags/pull-request-2018-09-25:
Revert "check: Move VMXNET3 test to common"
Revert "check: Move endianess test to common"
Revert "check: Move wdt_ib700 test to common"
tests/migration: Speed up the test on ppc64
hw/qdev-core: Fix description of instance_init
qdev: fix a typo in comment
docs: Fix some typos (most found by codespell)
trivial: Make bios files and source files non-executable
memfd: fix possible usage of the uninitialized file descriptor
hw/core/machine: Officially deprecate the enforce-config-section parameter
net/slirp: Deprecate the [hub_id name] parameter tuple
net: Deprecate the "name" parameter of -net
Makefile: Add missing dependency for qemu-deprecated.texi
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
These files can not be executed on the host, so they should not be
marked as executable.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
When draining a block node, we recurse to its parent and for subtree
drains also to its children. A single AIO_WAIT_WHILE() is then used to
wait for bdrv_drain_poll() to become true, which depends on all of the
nodes we recursed to. However, if the respective child or parent becomes
quiescent and calls bdrv_wakeup(), only the AioWait of the child/parent
is checked, while AIO_WAIT_WHILE() depends on the AioWait of the
original node.
Fix this by using a single AioWait for all callers of AIO_WAIT_WHILE().
This may mean that the draining thread gets a few more unnecessary
wakeups because an unrelated operation got completed, but we already
wake it up when something _could_ have changed rather than only if it
has certainly changed.
Apart from that, drain is a slow path anyway. In theory it would be
possible to use wakeups more selectively and still correctly, but the
gains are likely not worth the additional complexity. In fact, this
patch is a nice simplification for some places in the code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_drain_poll_top_level() was buggy because it didn't release the
AioContext lock of the node to be drained before calling aio_poll().
This way, callbacks called by aio_poll() would possibly take the lock a
second time and run into a deadlock with a nested AIO_WAIT_WHILE() call.
However, it turns out that the aio_poll() call isn't actually needed any
more. It was introduced in commit 91af091f92, which is effectively
reverted by this patch. The cases it was supposed to fix are now covered
by bdrv_drain_poll(), which waits for block jobs to reach a quiescent
state.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Request callbacks can do pretty much anything, including operations that
will yield from the coroutine (such as draining the backend). In that
case, a decreased in_flight would be visible to other code and could
lead to a drain completing while the callback hasn't actually completed
yet.
Note that reordering these operations forbids calling drain directly
inside an AIO callback. As Paolo explains, indirectly calling it is
okay:
- Calling it through a coroutine is okay, because then
bdrv_drained_begin() goes through bdrv_co_yield_to_drain() and you
have in_flight=2 when bdrv_co_yield_to_drain() yields, then soon
in_flight=1 when the aio_co_wake() in the AIO callback completes, then
in_flight=0 after the bottom half starts.
- Calling it through a bottom half would be okay too, as long as the AIO
callback remembers to do inc_in_flight/dec_in_flight just like
bdrv_co_yield_to_drain() and bdrv_co_drain_bh_cb() do
A few more important cases that come to mind:
- A coroutine that yields because of I/O is okay, with a sequence
similar to bdrv_co_yield_to_drain().
- A coroutine that yields with no I/O pending will correctly decrease
in_flight to zero before yielding.
- Calling more AIO from the callback won't overflow the counter just
because of mutual recursion, because AIO functions always yield at
least once before invoking the callback.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
blk_unref() first decreases the refcount of the BlockBackend and calls
blk_delete() if the refcount reaches zero. Requests can still be in
flight at this point, they are only drained during blk_delete():
At this point, arbitrary callbacks can run. If any callback takes a
temporary BlockBackend reference, it will first increase the refcount to
1 and then decrease it to 0 again, triggering another blk_delete(). This
will cause a use-after-free crash in the outer blk_delete().
Fix it by draining the BlockBackend before decreasing to refcount to 0.
Assert in blk_ref() that it never takes the first refcount (which would
mean that the BlockBackend is already being deleted).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
A bdrv_drain operation must ensure that all parents are quiesced, this
includes BlockBackends. Otherwise, callbacks called by requests that are
completed on the BDS layer, but not quite yet on the BlockBackend layer
could still create new requests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_do_drained_begin/end() assume that they are called with the
AioContext lock of bs held. If we call drain functions from a coroutine
with the AioContext lock held, we yield and schedule a BH to move out of
coroutine context. This means that the lock for the home context of the
coroutine is released and must be re-acquired in the bottom half.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
In qemu_laio_process_completions_and_submit, the AioContext is acquired
before the ioq_submit iteration and after qemu_laio_process_completions,
but the latter is not thread safe either.
This change avoids a number of random crashes when the Main Thread and
an IO Thread collide processing completions for the same AioContext.
This is an example of such crash:
- The IO Thread is trying to acquire the AioContext at aio_co_enter,
which evidences that it didn't lock it before:
Thread 3 (Thread 0x7fdfd8bd8700 (LWP 36743)):
#0 0x00007fdfe0dd542d in __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fdfe0dd0de6 in _L_lock_870 () at /lib64/libpthread.so.0
#2 0x00007fdfe0dd0cdf in __GI___pthread_mutex_lock (mutex=mutex@entry=0x5631fde0e6c0)
at ../nptl/pthread_mutex_lock.c:114
#3 0x00005631fc0603a7 in qemu_mutex_lock_impl (mutex=0x5631fde0e6c0, file=0x5631fc23520f "util/async.c", line=511) at util/qemu-thread-posix.c:66
#4 0x00005631fc05b558 in aio_co_enter (ctx=0x5631fde0e660, co=0x7fdfcc0c2b40) at util/async.c:493
#5 0x00005631fc05b5ac in aio_co_wake (co=<optimized out>) at util/async.c:478
#6 0x00005631fbfc51ad in qemu_laio_process_completion (laiocb=<optimized out>) at block/linux-aio.c:104
#7 0x00005631fbfc523c in qemu_laio_process_completions (s=s@entry=0x7fdfc0297670)
at block/linux-aio.c:222
#8 0x00005631fbfc5499 in qemu_laio_process_completions_and_submit (s=0x7fdfc0297670)
at block/linux-aio.c:237
#9 0x00005631fc05d978 in aio_dispatch_handlers (ctx=ctx@entry=0x5631fde0e660) at util/aio-posix.c:406
#10 0x00005631fc05e3ea in aio_poll (ctx=0x5631fde0e660, blocking=blocking@entry=true)
at util/aio-posix.c:693
#11 0x00005631fbd7ad96 in iothread_run (opaque=0x5631fde0e1c0) at iothread.c:64
#12 0x00007fdfe0dcee25 in start_thread (arg=0x7fdfd8bd8700) at pthread_create.c:308
#13 0x00007fdfe0afc34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
- The Main Thread is also processing completions from the same
AioContext, and crashes due to failed assertion at util/iov.c:78:
Thread 1 (Thread 0x7fdfeb5eac80 (LWP 36740)):
#0 0x00007fdfe0a391f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fdfe0a3a8e8 in __GI_abort () at abort.c:90
#2 0x00007fdfe0a32266 in __assert_fail_base (fmt=0x7fdfe0b84e68 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5631fc238ccb "offset == 0", file=file@entry=0x5631fc23698e "util/iov.c", line=line@entry=78, function=function@entry=0x5631fc236adc <__PRETTY_FUNCTION__.15220> "iov_memset")
at assert.c:92
#3 0x00007fdfe0a32312 in __GI___assert_fail (assertion=assertion@entry=0x5631fc238ccb "offset == 0", file=file@entry=0x5631fc23698e "util/iov.c", line=line@entry=78, function=function@entry=0x5631fc236adc <__PRETTY_FUNCTION__.15220> "iov_memset") at assert.c:101
#4 0x00005631fc065287 in iov_memset (iov=<optimized out>, iov_cnt=<optimized out>, offset=<optimized out>, offset@entry=65536, fillc=fillc@entry=0, bytes=15515191315812405248) at util/iov.c:78
#5 0x00005631fc065a63 in qemu_iovec_memset (qiov=<optimized out>, offset=offset@entry=65536, fillc=fillc@entry=0, bytes=<optimized out>) at util/iov.c:410
#6 0x00005631fbfc5178 in qemu_laio_process_completion (laiocb=0x7fdd920df630) at block/linux-aio.c:88
#7 0x00005631fbfc523c in qemu_laio_process_completions (s=s@entry=0x7fdfc0297670)
at block/linux-aio.c:222
#8 0x00005631fbfc5499 in qemu_laio_process_completions_and_submit (s=0x7fdfc0297670)
at block/linux-aio.c:237
#9 0x00005631fbfc54ed in qemu_laio_poll_cb (opaque=<optimized out>) at block/linux-aio.c:272
#10 0x00005631fc05d85e in run_poll_handlers_once (ctx=ctx@entry=0x5631fde0e660) at util/aio-posix.c:497
#11 0x00005631fc05e2ca in aio_poll (blocking=false, ctx=0x5631fde0e660) at util/aio-posix.c:574
#12 0x00005631fc05e2ca in aio_poll (ctx=0x5631fde0e660, blocking=blocking@entry=false)
at util/aio-posix.c:604
#13 0x00005631fbfcb8a3 in bdrv_do_drained_begin (ignore_parent=<optimized out>, recursive=<optimized out>, bs=<optimized out>) at block/io.c:273
#14 0x00005631fbfcb8a3 in bdrv_do_drained_begin (bs=0x5631fe8b6200, recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, poll=<optimized out>) at block/io.c:390
#15 0x00005631fbfbcd2e in blk_drain (blk=0x5631fe83ac80) at block/block-backend.c:1590
#16 0x00005631fbfbe138 in blk_remove_bs (blk=blk@entry=0x5631fe83ac80) at block/block-backend.c:774
#17 0x00005631fbfbe3d6 in blk_unref (blk=0x5631fe83ac80) at block/block-backend.c:401
#18 0x00005631fbfbe3d6 in blk_unref (blk=0x5631fe83ac80) at block/block-backend.c:449
#19 0x00005631fbfc9a69 in commit_complete (job=0x5631fe8b94b0, opaque=0x7fdfcc1bb080)
at block/commit.c:92
#20 0x00005631fbf7d662 in job_defer_to_main_loop_bh (opaque=0x7fdfcc1b4560) at job.c:973
#21 0x00005631fc05ad41 in aio_bh_poll (bh=0x7fdfcc01ad90) at util/async.c:90
#22 0x00005631fc05ad41 in aio_bh_poll (ctx=ctx@entry=0x5631fddffdb0) at util/async.c:118
#23 0x00005631fc05e210 in aio_dispatch (ctx=0x5631fddffdb0) at util/aio-posix.c:436
#24 0x00005631fc05ac1e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#25 0x00007fdfeaae44c9 in g_main_context_dispatch (context=0x5631fde00140) at gmain.c:3201
#26 0x00007fdfeaae44c9 in g_main_context_dispatch (context=context@entry=0x5631fde00140) at gmain.c:3854
#27 0x00005631fc05d503 in main_loop_wait () at util/main-loop.c:215
#28 0x00005631fc05d503 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#29 0x00005631fc05d503 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#30 0x00005631fbd81412 in main_loop () at vl.c:1866
#31 0x00005631fbc18ff3 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at vl.c:4647
- A closer examination shows that s->io_q.in_flight appears to have
gone backwards:
(gdb) frame 7
#7 0x00005631fbfc523c in qemu_laio_process_completions (s=s@entry=0x7fdfc0297670)
at block/linux-aio.c:222
222 qemu_laio_process_completion(laiocb);
(gdb) p s
$2 = (LinuxAioState *) 0x7fdfc0297670
(gdb) p *s
$3 = {aio_context = 0x5631fde0e660, ctx = 0x7fdfeb43b000, e = {rfd = 33, wfd = 33}, io_q = {plugged = 0,
in_queue = 0, in_flight = 4294967280, blocked = false, pending = {sqh_first = 0x0,
sqh_last = 0x7fdfc0297698}}, completion_bh = 0x7fdfc0280ef0, event_idx = 21, event_max = 241}
(gdb) p/x s->io_q.in_flight
$4 = 0xfffffff0
Signed-off-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180906130225.5118-8-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
For purposes of minimum code movement, refactor the mirror_exit
callback to use the post-finalization callbacks in a trivial way.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180906130225.5118-7-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
[mreitz: Added comment for the mirror_exit() function]
Signed-off-by: Max Reitz <mreitz@redhat.com>
In cases where we abort the block/mirror job, there's no point in
installing the new backing chain before we finish aborting.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180906130225.5118-6-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Use the component callbacks; prepare, abort, and clean.
NB: prepare is only called when the job has not yet failed;
and abort can be called after prepare.
complete -> prepare -> abort -> clean
complete -> abort -> clean
During refactor, a potential problem with bdrv_drop_intermediate
was identified, the patched behavior is no worse than the pre-patch
behavior, so leave a FIXME for now to be fixed in a future patch.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180906130225.5118-5-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-4-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-3-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-2-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The sslverify setting is supposed to turn off all TLS certificate
checks in libcurl. However because of the way we use it, it only
turns off peer certificate authenticity checks
(CURLOPT_SSL_VERIFYPEER). This patch makes it also turn off the check
that the server name in the certificate is the same as the server
you're connecting to (CURLOPT_SSL_VERIFYHOST).
We can use Google's server at 8.8.8.8 which happens to have a bad TLS
certificate to demonstrate this:
$ ./qemu-img create -q -f qcow2 -b 'json: { "file.sslverify": "off", "file.driver": "https", "file.url": "https://8.8.8.8/foo" }' /var/tmp/file.qcow2
qemu-img: /var/tmp/file.qcow2: CURL: Error opening file: SSL: no alternative certificate subject name matches target host name '8.8.8.8'
Could not open backing image to determine size.
With this patch applied, qemu-img connects to the server regardless of
the bad certificate:
$ ./qemu-img create -q -f qcow2 -b 'json: { "file.sslverify": "off", "file.driver": "https", "file.url": "https://8.8.8.8/foo" }' /var/tmp/file.qcow2
qemu-img: /var/tmp/file.qcow2: CURL: Error opening file: The requested URL returned error: 404 Not Found
(The 404 error is expected because 8.8.8.8 is not actually serving a
file called "/foo".)
Of course the default (without sslverify=off) remains to always check
the certificate:
$ ./qemu-img create -q -f qcow2 -b 'json: { "file.driver": "https", "file.url": "https://8.8.8.8/foo" }' /var/tmp/file.qcow2
qemu-img: /var/tmp/file.qcow2: CURL: Error opening file: SSL: no alternative certificate subject name matches target host name '8.8.8.8'
Could not open backing image to determine size.
Further information about the two settings is available here:
https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.htmlhttps://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Message-id: 20180914095622.19698-1-rjones@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
When we converted rbd to get rid of the older key/value-centric
encoding format, we broke compatibility with image files with backing
file strings encoded in the old format.
This leaves a bit of an ugly conundrum, and a hacky solution.
If the initial attempt to parse the "proper" options fails, it assumes
that we may have an older key/value encoded filename. Fall back to
attempting to parse the filename, and extract the required options from
it. If that fails, pass along the original error message.
We do not support mixed modern usage alongside legacy keyvalue pair
usage.
A deprecation warning has been added, although care should be taken
when actually deprecating since the impact is not limited to
commandline or qapi usage, but also opening existing images.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 15b332e5432ad069441f7275a46080f465d789a0.1536704901.git.jcody@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Code movement to pull the conversion from Qdict to BlockdevOptionsRbd
into a helper function.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 5b49a980f2cde6610ab1df41bb0277d00b5db893.1536704901.git.jcody@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Rename opaque_job to job to be consistent with other job implementations.
Rename 'job', the BackupBlockJob object, to 's' to also be consistent.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-8-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Utilize the job_exit shim by not calling job_defer_to_main_loop, and
where applicable, converting the deferred callback into the job_exit
callback.
This converts backup, stream, create, and the unit tests all at once.
Most of these jobs do not see any changes to the order in which they
clean up their resources, except the test-blockjob-txn test, which
now puts down its bs before job_completed is called.
This is safe for the same reason the reordering in the mirror job is
safe, because job_completed no longer runs under two locks, making
the unref safe even if it causes a flush.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-7-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Change the manual deferment to mirror_exit into the implicit
callback to job_exit and the mirror_exit callback.
This does change the order of some bdrv_unref calls and job_completed,
but thanks to the new context in which we call .exit, this is safe to
defer the possible flushing of any nodes to the job_finalize_single
cleanup stage.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180830015734.19765-6-jsnow@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Change the manual deferment to commit_complete into the implicit
callback to job_exit, renaming commit_complete to commit_exit.
This conversion does change the timing of when job_completed is
called to after the bdrv_replace_node and bdrv_unref calls, which
could have implications for bjob->blk which will now be put down
after this cleanup.
Kevin highlights that we did not take any permissions for that backend
at job creation time, so it is safe to reorder these operations.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-5-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Jobs presently use both an Error object in the case of the create job,
and char strings in the case of generic errors elsewhere.
Unify the two paths as just j->err, and remove the extra argument from
job_completed. The integer error code for job_completed is kept for now,
to be removed shortly in a separate patch.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180830015734.19765-3-jsnow@redhat.com
[mreitz: Dropped a superfluous g_strdup()]
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Presently we codify the entry point for a job as the "start" callback,
but a more apt name would be "run" to clarify the idea that when this
function returns we consider the job to have "finished," except for
any cleanup which occurs in separate callbacks later.
As part of this clarification, change the signature to include an error
object and a return code. The error ptr is not yet used, and the return
code while captured, will be overwritten by actions in the job_completed
function.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-2-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The generated qapi_event_send_FOO() take an Error ** argument. They
can't actually fail, because all they do with the argument is passing it
to functions that can't fail: the QObject output visitor, and the
@qmp_emit callback, which is either monitor_qapi_event_queue() or
event_test_emit().
Drop the argument, and pass &error_abort to the QObject output visitor
and @qmp_emit instead.
Suggested-by: Eric Blake <eblake@redhat.com>
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20180815133747.25032-4-peterx@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Commit message rewritten, update to qapi-code-gen.txt corrected]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
.bdrv_close handler is optional after previous commit, no needs to keep
empty functions more.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
blockdev-mirror with the same node for source and target segfaults
today: A node is in its own backing chain, so mirror_start_job() decides
that this is an active commit. When adding the intermediate nodes with
block_job_add_bdrv(), it starts the iteration through the subchain with
the backing file of source, though, so it never reaches target and
instead runs into NULL at the base.
While we could fix that by starting with source itself, there is no
point in allowing mirroring a node into itself and I wouldn't be
surprised if this caused more problems later.
So just check for this scenario and error out.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This reinstates commit b008326744,
which was temporarily reverted for the 3.0 release so that libvirt gets
some extra time to update their command lines.
The -drive option serial was deprecated in QEMU 2.10. It's time to
remove it.
Tests need to be updated to set the serial number with -global instead
of using the -drive option.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Format drivers such as qcow2 don't allow sharing the same image between
two QEMU instances in order to prevent image corruptions, because of
metadata cache. LUKS driver don't modify metadata except for when
creating image, so it is safe to relax the permission. This makes
share-rw=on property work on virtual devices.
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 6fccbb475b fixed a bug caused by
QEMU attempting to remove a throttle group member with no pending
requests but an active timer set. This was the result of a previous
bdrv_drained_begin() call processing the throttled requests but
leaving the timer untouched.
Although the commit does solve the problem, the situation shouldn't
happen in the first place. If we try to drain a throttle group member
which has a timer set, we should cancel the timer instead of ignoring
it.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the throttling code after an I/O request has been completed the
next one is selected from a different member using a round-robin
algorithm. This ensures that all members get a chance to finish their
pending I/O requests.
However, if a group member has its I/O limits disabled (because it's
being drained) then we should always give it priority in order to have
all its pending requests finished as soon as possible.
If we don't do this we could have a member in the process of being
drained waiting for the throttled requests of other members, for which
the I/O limits still apply.
This can have additional consequences: if we're running in qtest mode
(with QEMU_CLOCK_VIRTUAL) then timers can only fire if we advance the
clock manually, so attempting to drain a block device can hang QEMU in
the BDRV_POLL_WHILE() loop at the end of bdrv_do_drained_begin().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For BlockBackends that are skipped in query-blockstats, we would leak
info since commit 567dcb31. Allocate info only later to avoid the memory
leak.
Fixes: CID 1394727
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
bdrv_io_plug/bdrv_io_unplug take care of keeping a nesting count,
so change s->plugged to just a bool.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20180813144320.12382-2-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
It is wrong to leave this field as 1, as nvme_close() called in the
error handling code in nvme_file_open() will use it and try to free
s->queues again.
Another problem is the cleaning ups are duplicated between the fail*
labels of nvme_init() and nvme_file_open(), which calls nvme_close().
A third problem is nvme_close() misses g_free() and
event_notifier_cleanup().
Fix all of them.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-Id: <20180712025420.4932-1-famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Consistent with query-block, query-blockstats should not only include
named BlockBackends, but also those that are anonymous, but belong to a
device model.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Like for query-block, the client needs to identify which BlockBackend
the returned data is for. Anonymous BlockBackends are identified by the
device model they are attached to. Add a 'qdev' field that contains the
qdev ID or QOM path of the attached device model.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.
This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since 42a3e1ab36 qemu asserts when using the
vvfat driver:
git clone git://qemu.org/qemu.git
cd qemu
./configure --target-list=ppc-softmmu --enable-debug
make -j8
mkdir foo
touch foo/hello
./ppc-softmmu/qemu-system-ppc -M prep --nographic --monitor null \
-hda fat:rw:./foo
"Ctrl-C"
qemu-system-ppc: block.c:3368: bdrv_close_all: Assertion \
`((&all_bdrv_states)->tqh_first == ((void *)0))' failed.
This is because we reference bs twice in qcow_co_create(..) one time in
bdrv_open_blockdev_ref(..) and in blk_insert_bs(..) but we unref it only once
in blk_unref which leads to the reference leak.
Note that I didn't tested much QCOW after this change as I don't use it much.
Signed-off-by: KONRAD Frederic <frederic.konrad@adacore.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qstring_from_substr() takes the index of the substring's first and
last character. qstring_from_substr(s, 0, SIZE_MAX) denotes an empty
substring. Awkward.
Shift the end index one to the right. This simplifies both
qstring_from_substr() and its callers.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180727062204.10401-3-armbru@redhat.com>
In ed6e2161 ("linux-aio: properly bubble up errors from initialzation"),
I only added a bdrv_attach_aio_context callback for the bdrv_file
driver. There are several other drivers that use the shared
aio_plug callback, though, and they will trip the assertion added to
aio_get_linux_aio because they did not call aio_setup_linux_aio first.
Add the appropriate callback definition to the affected driver
definitions.
Fixes: ed6e2161 ("linux-aio: properly bubble up errors from initialization")
Reported-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180718211256.29774-1-naravamudan@digitalocean.com
Cc: Eric Blake <eblake@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <famz@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It's annoying to see this debug message every time you use vvfat.
Disable it with the DLOG() macro by default, as it is done with the
other debug messages in this file.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Throttle groups consist of members sharing one throttling state
(including bps/iops limits). Round-robin scheduling is used to ensure
fairness. If a group member already has a timer pending then other
groups members do not schedule their own timers. The next group member
will have its turn when the existing timer expires.
A hang may occur when a group member leaves while it had a timer
scheduled. Although the code carefully removes the group member from
the round-robin list, it does not schedule the next member. Therefore
remaining members continue to wait for the removed member's timer to
expire.
This patch schedules the next request if a timer is pending.
Unfortunately the actual bug is a race condition that I've been unable
to capture in a test case.
Sometimes drive2 hangs when drive1 is removed from the throttling group:
$ qemu ... -drive if=none,id=drive1,cache=none,format=qcow2,file=data1.qcow2,iops=100,group=foo \
-device virtio-blk-pci,id=virtio-blk-pci0,drive=drive1 \
-drive if=none,id=drive2,cache=none,format=qcow2,file=data2.qcow2,iops=10,group=foo \
-device virtio-blk-pci,id=virtio-blk-pci1,drive=drive2
(guest-console1)# fio -filename /dev/vda 4k-seq-read.job
(guest-console2)# fio -filename /dev/vdb 4k-seq-read.job
(qmp) {"execute": "block_set_io_throttle", "arguments": {"device": "drive1","bps": 0,"bps_rd": 0,"bps_wr": 0,"iops": 0,"iops_rd": 0,"iops_wr": 0}}
Reported-by: Nini Gu <ngu@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180704145410.794-1-stefanha@redhat.com
RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1535914
Cc: Alberto Garcia <berto@igalia.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Adjust each caller of raw_open_common to specify if they are expecting
host and character devices or not. Tighten expectations of file types upon
open in the common code and refuse types that are not expected.
This has two effects:
(1) Character and block devices are now considered deprecated for the
'file' driver, which expects only S_IFREG, and
(2) no file-posix driver (file, host_cdrom, or host_device) can open
directories now.
I don't think there's a legitimate reason to open directories as if
they were files. This prevents QEMU from opening and attempting to probe
a directory inode, which can break in exciting ways. One of those ways
is lseek on ext4/xfs, which will return 0x7fffffffffffffff as the file
size instead of EISDIR. This can coax QEMU into responding with a
confusing "file too big" instead of "Hey, that's not a file".
See: https://bugs.launchpad.net/qemu/+bug/1739304/
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Truncation is the last to convert from open coded req handling to
reusing helpers. This time the permission check in prepare has to adapt
to the new caller: it checks a different permission bit, and doesn't
trigger the before write notifier.
Also, truncation should always trigger a bs->total_sectors update and in
turn call parent resize_cb. Update the condition in finish helper, too.
It's intended to do a duplicated bs->read_only check before calling
bdrv_co_write_req_prepare() so that we can be more informative with the
error message, as bdrv_co_write_req_prepare() doesn't have Error
parameter.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we are growing the image and potentially using preallocation for the
new area, we need to make sure that no write requests are made to the
"preallocated" area which is [@old_size, @offset), not
[@offset, offset * 2 - @old_size).
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This brings the request handling logic inline with write and discard,
fixing write_gen, resize_cb, dirty bitmaps and image size refreshing.
The last of these issues broke iotest case 222, which is now fixed.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reuse the new bdrv_co_write_req_prepare/finish helpers. The variation
here is that discard requests don't affect bs->wr_highest_offset, and it
cannot extend the image.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Two problems exist when a write request that enlarges the image (i.e.
write beyond EOF) finishes:
1) parent is not notified about size change;
2) dirty bitmap is not resized although we try to set the dirty bits;
Fix them just like how bdrv_co_truncate works.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As a mechanical refactoring patch, this is the first step towards
unified and more correct write code paths. This is helpful because
multiple BlockDriverState fields need to be updated after modifying
image data, and it's hard to maintain in multiple places such as copy
offload, discard and truncate.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This matches the types used for bytes in the rest parts of block layer.
In the case of bdrv_co_truncate, new_bytes can be the image size which
probably doesn't fit in a 32 bit int.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Other I/O functions are already using a BdrvChild pointer in the API, so
make discard do the same. It makes it possible to initiate the same
permission checks before doing I/O, and much easier to share the
helper functions for this, which will be added and used by write,
truncate and copy range paths.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A few trace points that can help reveal what is happening in a copy
offloading I/O path.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With in one module, trace points usually have a common prefix named
after the module name. paio_submit and paio_submit_co are the only two
trace points so far in the two file protocol drivers. As we are adding
more, having a common prefix here is better so that trace points can be
enabled with a glob. Rename them.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This reverts commit b008326744.
Hold off removing this for one more QEMU release (current libvirt
release still uses it.)
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The sector size needs to be large enough to accommodate the data
structures for the log super block and log write entries. This was
previously not properly checked, which made it possible to cause
QEMU to badly misbehave.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fleecing scheme works as follows: we want a kind of temporary snapshot
of active drive A. We create temporary image B, with B->backing = A.
Then we start backup(sync=none) from A to B. From this point, B reads
as point-in-time snapshot of A (A continues to be active drive,
accepting guest IO).
This scheme needs some additional synchronization between reads from B
and backup COW operations, otherwise, the following situation is
theoretically possible:
(assume B is qcow2, client is NBD client, reading from B)
1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
goes up to l2 table loading (assume cache miss)
2) guest write => backup COW => qcow2 write =>
try to take qcow2 mutex => waiting
3. l2 table loaded, we see that cluster is UNALLOCATED, go to
"case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
bdrv_co_preadv(bs->backing, ...)
4) aha, mutex unlocked, backup COW continues, and we finally finish
guest write and change cluster in our active disk A
5. actually, do bdrv_co_preadv(bs->backing, ...) and read
_new updated_ data.
To avoid this, let's make backup writes serializing, to not intersect
with reads from B.
Note: we expand range of handled cases from (sync=none and
B->backing = A) to just (A in backing chain of B), to finally allow
safe reading from B during backup for all cases when A in backing chain
of B, i.e. B formally looks like point-in-time snapshot of A.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Serialized writes should be used in copy-on-write of backup(sync=none)
for image fleecing scheme.
We need to change an assert in bdrv_aligned_pwritev, added in
28de2dcd88. The assert may fail now, because call to
wait_serialising_requests here may become first call to it for this
request with serializing flag set. It occurs if the request is aligned
(otherwise, we should already set serializing flag before calling
bdrv_aligned_pwritev and correspondingly waited for all intersecting
requests). However, for aligned requests, we should not care about
outdating of previously read data, as there no such data. Therefore,
let's just update an assert to not care about aligned requests.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Pass read flags and write flags separately. This is needed to handle
coming BDRV_REQ_NO_SERIALISING clearly in following patches.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Here two things are fixed:
1. Architecture
On each recursion step, we go to the child of src or dst, only for one
of them. So, it's wrong to create tracked requests for both on each
step. It leads to tracked requests duplication.
2. Wait for serializing requests on write path independently of
BDRV_REQ_NO_SERIALISING
Before commit 9ded4a0114 "backup: Use copy offloading",
BDRV_REQ_NO_SERIALISING was used for only one case: read in
copy-on-write operation during backup. Also, the flag was handled only
on read path (in bdrv_co_preadv and bdrv_aligned_preadv).
After 9ded4a0114, flag is used for not waiting serializing operations
on backup target (in same case of copy-on-write operation). This
behavior change is unsubstantiated and potentially dangerous, let's
drop it and add additional asserts and documentation.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the virtual disk size isn't aligned to full clusters,
bdrv_co_do_copy_on_readv() may get pnum == 0 before having the full
cluster completed, which will let it run into an assertion failure:
qemu-io: block/io.c:1203: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.
Check for EOF, assert that we read at least as much as the read request
originally wanted to have (which is true at EOF because otherwise
bdrv_check_byte_request() would already have returned an error) and
return success early even though we couldn't copy the full cluster.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit dcf94a23b1 ('block: Don't poll in parent drain callbacks')
removed polling in bdrv_child_cb_drained_begin() on the grounds that the
original bdrv_drain() already will poll and BdrvChildRole.drained_begin
calls must not cause graph changes (and therefore must not call
aio_poll() or the recursion through the graph will break.
This reasoning is correct for calls through bdrv_do_drained_begin().
However, BdrvChildRole.drained_begin is also called when a node that is
already in a drained section (i.e. bdrv_do_drained_begin() has already
returned and therefore can't poll any more) is attached to a new parent.
In this case, we must explicitly poll to have all requests completed
before the drained new child can be attached to the parent.
In bdrv_replace_child_noperm(), we know that we're not inside the
recursion of bdrv_do_drained_begin() because graph changes are not
allowed there, and bdrv_replace_child_noperm() is a graph change. The
call of BdrvChildRole.drained_begin() must therefore be followed by a
BDRV_POLL_WHILE() that waits for the completion of requests.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VMDK performs a probing check in vmdk_co_create_opts() to prevent the
user from assigning non-VMDK files as a backing file, because it only
supports VMDK backing files. However, with the @backing runtime option,
it is possible to assign arbitrary nodes as backing nodes, regardless of
what the image header says. Therefore, VMDK may not just access backing
nodes assuming they are VMDK nodes -- which it does, because it needs to
compare the backing file's CID with the overlay's parentCID value, and
naturally the backing file only has a CID when it's a VMDK file.
Instead, it should report the CID of non-VMDK backing files not to match
the overlay because clearly a non-present CID does not match.
Without this change, vmdk_read_cid() reads from the backing file's
bs->file, which may be NULL (in which case we get a segfault). Also, it
interprets bs->opaque as a BDRVVmdkState and then reads from the
.desc_offset field, which usually will just return some arbitrary value
which then results in either garbage to be read, or bdrv_pread() to
return an error, both of which result in a non-matching CID to be
reported.
(In a very unlikely case, we could read something that looks like a
VMDK descriptor, and then get a CID which might actually match. But
that is highly unlikely, and the only result would be that VMDK accepts
the backing file which is not too bad (albeit unintentional).)
((And in theory, the seek to .desc_offset might leak data from another
block driver's opaque object. But then again, the user should realize
very quickly that a non-VMDK backing file does not work (because the
read will very likely fail, due to the reasons given above), so this
should not be exploitable.))
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180702210721.4847-2-mreitz@redhat.com
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
One of them is a typo. But update both to be more readable.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-3-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
- qcow2: Use worker threads for compression to improve performance of
'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbPfHhAAoJEH8JsnLIjy/Wxp8P/jeEJ/jJ0N2doix5KJSvtkdP
dYx4cJxHg9tMSnmBfA49qc1ilgbzZqjbF/wH3Ko3zPsmFP0xsxx+9zE4c3iL7HA6
6E4/r3szznO8DIxdgzAIeHjZc7YPpsY3alrthT55eR/FnDyc8zofi/iUMgvTKPYA
8TnWqgi2fdaVwNy+lIRJFdhzzQNtPxOEO+0DFHaOZvQ9vlc5DGPC+Hj3Qc11GK8M
u7orkeYqQPknxy0hJKPxtWHvzCQUJMcSSs6PbuslnzOerYiQLmx+RVIwMhXfVZjV
lXe2SppAszBujtcIENhZlj1cECs7MrWTmFDcWvBA+Mh/JhFEWwykmlQHYrjCqdvw
QyWhVLgP/6jQDaBls6LADSZDxX7i0sV27DzVGN4gYUX/KcyVPEDROm90MyE8nxN0
Y5hhujTkzu94Zvd/OlKZRs4tPxmbRrd2SWy0id8Kj5/gO/+UWECOZCrPrlB5uzec
bsxHoeXLVe2/56JkCIiOw3hLILMY6gPLJgeaQjz6hu1oZJqYuoof8grVFjUCSH0C
BBlz8O4upHdPIKNRl4rmdgwasg5YIiJ7SFl4h7KMCD0elkKmo3SyTKosagvqrpVN
JvW1YuosCedcfYIKNPnZdEGWDKuTaZtcnZnM9IRY6hr8ySMw1LloFsgXsJ3tqTwe
wb1Wjm4Qlz5BaoB7o/Os
=z5AC
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- qcow2: Use worker threads for compression to improve performance of
'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names
# gpg: Signature made Thu 05 Jul 2018 11:24:33 BST
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
file-posix: Unlock FD after creation
file-posix: Fix creation locking
block/blklogwrites: Add an option for the update interval of the log superblock
block/blklogwrites: Add an option for appending to an old log
block/blklogwrites: Change log_sector_size from int64_t to uint64_t
block/crypto: Fix memory leak in create error path
block: Don't silently truncate node names
block: Add blklogwrites
block: Move two block permission constants to the relevant enum
qcow2: add compress threads
qcow2: refactor data compression
qemu-img: allow compressed not-in-order writes
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Closing the FD does not necessarily mean that it is unlocked. Fix this
by relinquishing all permission locks before qemu_close().
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
raw_apply_lock_bytes() takes a bit mask of "permissions that are NOT
shared".
Also, make the "perm" and "shared" variables uint64_t, because I do not
particularly like using ~ on signed integers (and other permission masks
are usually uint64_t, too).
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a way to ensure that the log superblock is periodically
updated. Before, this was only done on flush requests, which may
not be enough if the VM exits abnormally, omitting the final flush.
The default interval is 4096 write requests.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Suggested by Kevin Wolf. May be useful when testing multiple batches
of writes or doing long-term testing involving restarts of the VM.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This was a simple oversight when working on intermediate versions
of the original patch which introduced blklogwrites.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Implements a block device write logging system, similar to Linux kernel
device mapper dm-log-writes. The write operations that are performed
on a block device are logged to a file or another block device. The
write log format is identical to the dm-log-writes format. Currently,
log markers are not supported.
This functionality can be used for crash consistency and fs consistency
testing. By implementing it in qemu, tests utilizing write logs can be
be used to test non-Linux drivers and older kernels.
The driver accepts an optional parameter to set the sector size used
for logging. This makes the driver require all requests to be aligned
to this sector size and also makes offsets and sizes of writes in the
log metadata to be expressed in terms of this value (the log format has
a granularity of one sector for offsets and sizes). This allows
accurate logging of writes to guest block devices that have unusual
sector sizes.
The implementation is based on the blkverify and blkdebug block
drivers.
Signed-off-by: Aapo Vienamo <aapo@tuxera.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Do data compression in separate threads. This significantly improve
performance for qemu-img convert with -W (allow async writes) and -c
(compressed) options.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make a separate function for compression to be parallelized later.
- use .avail_out field instead of .next_out to calculate size of
compressed data. It looks more natural and it allows to keep dest to
be void pointer
- set avail_out to be at least one byte less than input, to be sure
avoid inefficient compression earlier
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Bitmap lock/unlock were added to bdrv_enable_dirty_bitmap in
8b1402ce80, but some places were not updated correspondingly, which
leads to trying to take this lock twice, which is dead-lock. Fix this.
Actually, iotest 199 (about dirty bitmap postcopy migration) is broken
now, and this fixes it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-3-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Add _locked version of bdrv_enable_dirty_bitmap, to fix dirty bitmap
migration in the following patch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Bug fixes and iotest exposure of fleecing via NBD (serving a
read-only point-in-time view via blockdev-backup sync:none,
as well as serving dirty bitmaps over NBD), including a new
x-dirty-bitmap parameter when opening NBD clients as the
counterpart to x-nbd-server-add-bitmap. Also a random fix
for iscsi block_status spotted by Coverity that missed other
miscellaneous trees.
- Eric Blake: nbd/server: Fix dirty bitmap logic regression
- Eric Blake: iscsi: Avoid potential for get_status overflow
- John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing
- Eric Blake: 0/2 test NBD bitmap export
-----BEGIN PGP SIGNATURE-----
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
iQEcBAABCAAGBQJbOtJPAAoJEKeha0olJ0NqEvwH/3FwWnlBdBvdYGgPjzGE1Atm
ofCKcyxE/2VJtxeWlZQHzs1VqSq81s7am5SdzOrIWnQekvHFcLu6/71RABiauzMd
neCvVOrXOVdktj1i1Z2Gg4BgjDmqbTDlo5ssVh/oXP0Zebi6OZFfQrB7y3cGBvui
4XI7lW9qJxt6F1FlKloXnofWRDENyo5vgdz6QjQXfauthw2T5045RIPTfiz03FCp
fbs+6K0+bKxfPdNLrqxxOZo/loYnEXbDYv6VBAIWBqztXVnMHxCqnh0YN05jwsfF
TRW0/YT8lpWarOZ1soIC6a/OGXQZbxgRhZ+Zr+Wa2jw0YNHJanU9isxi37aUqQo=
=Xivx
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2018-07-02' into staging
nbd patches for 2018-07-02
Bug fixes and iotest exposure of fleecing via NBD (serving a
read-only point-in-time view via blockdev-backup sync:none,
as well as serving dirty bitmaps over NBD), including a new
x-dirty-bitmap parameter when opening NBD clients as the
counterpart to x-nbd-server-add-bitmap. Also a random fix
for iscsi block_status spotted by Coverity that missed other
miscellaneous trees.
- Eric Blake: nbd/server: Fix dirty bitmap logic regression
- Eric Blake: iscsi: Avoid potential for get_status overflow
- John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing
- Eric Blake: 0/2 test NBD bitmap export
# gpg: Signature made Tue 03 Jul 2018 02:33:03 BST
# gpg: using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>"
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>"
# gpg: aka "[jpeg image of size 6874]"
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-nbd-2018-07-02:
iotests: New test 223 for exporting dirty bitmap over NBD
nbd/client: Add x-dirty-bitmap to query bitmap from server
iotests: add 222 to test basic fleecing
blockdev: enable non-root nodes for backup source
iscsi: Avoid potential for get_status overflow
nbd/server: Fix dirty bitmap logic regression
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The implementation is similar to the 'qemu-img convert'. In the
beginning of the job, offloaded copy is attempted. If it fails, further
I/O will go through the existing bounce buffer code path.
Then, as Kevin pointed out, both this and qemu-img convert can benefit
from a local check if one request fails because of, for example, the
offset is beyond EOF, but another may well be accepted by the protocol
layer. This will be implemented separately.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-4-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
This semantics is needed by drive-backup so implement it before using
this API there.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-3-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
src may be NULL if BDRV_REQ_ZERO_WRITE flag is set, in this case only
check dst and dst->bs. This bug was introduced when moving in the
request tracking code from bdrv_co_copy_range, in 37aec7d75e.
This especially fixes the possible segfault when initializing src_bs
with a NULL src.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-2-famz@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
In order to test that the NBD server is properly advertising
dirty bitmaps, we need a bare minimum client that can request
and read the context. Since feature freeze for 3.0 is imminent,
this is the smallest workable patch, which replaces the qemu
block status report with the results of the NBD server's dirty
bitmap (making it very easy to use 'qemu-img map --output=json'
to learn where the dirty portions are). Note that the NBD
protocol defines a dirty section with the same bit but opposite
sense that normal "base:allocation" uses to report an allocated
section; so in qemu-img map output, "data":true corresponds to
clean, "data":false corresponds to dirty.
A more complete solution that allows dirty bitmaps to be queried
at the same time as normal block status will be required before
this addition can lose the x- prefix. Until then, the fact that
this replaces normal status with dirty status means actions
like 'qemu-img convert' will likely misbehave due to treating
dirty regions of the file as if they are unallocated.
The next patch adds an iotest to exercise this new code.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180702191458.28741-2-eblake@redhat.com>
Detected by Coverity: Multiplying two 32-bit int and assigning
the result to a 64-bit number is a risk of overflow. Prior to
the conversion to byte-based interfaces, the block layer took
care of ensuring that a status request never exceeded 2G in
the driver; but after that conversion, the block layer expects
drivers to deal with any size request (the driver can always
truncate the request size back down, as long as it makes
progress). So, in the off-chance that someone makes a large
request, we are at the mercy of whether iscsi_get_lba_status_task()
will cap things to at most INT_MAX / iscsilun->block_size when
it populates lbasd->num_blocks; since I could not easily audit
that, it's better to be safe than sorry by just forcing a 64-bit
multiply.
Fixes: 92809c36
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180508212718.1482663-1-eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Message-Id: <20180625124238.25339-3-f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that all callers of vectored I/O have been converted
to use our preferred byte-based bdrv_co_p{read,write}v(), we can
delete the unused bdrv_co_{read,write}v().
Furthermore, this gets rid of the signature difference between the
public bdrv_co_writev() and the callback .bdrv_co_writev (the
latter still exists, because some drivers still need more work
before they are fully byte-based).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based calls
into the block layer from the vhdx driver.
Ideally, the vhdx driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based calls
into the block layer from the replication driver.
Ideally, the replication driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. The qcow driver is now ready to fully utilize the
byte-based callback interface, as long as we override the default
alignment to still be 512 (needed at least for asserts present
because of encryption, but easier to do everywhere than to audit
which sub-sector requests are handled correctly, especially since
we no longer recommend qcow for new disk images).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the internals of the qcow
driver write function, by iterating over offset/bytes instead of
sector_num/nb_sectors, and with a rename of index_in_cluster and
repurposing of n to track bytes instead of sectors.
A later patch will then switch the qcow driver as a whole over
to byte-based operation.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the internals of the qcow
driver read function, by iterating over offset/bytes instead of
sector_num/nb_sectors, and with a rename of index_in_cluster and
repurposing of n to track bytes instead of sectors.
A later patch will then switch the qcow driver as a whole over
to byte-based operation.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the internal helper function
get_cluster_offset(), by changing n_start and n_end to be byte
offsets rather than sector indices within the cluster being
allocated. However, assert that these values are still
sector-aligned (at least qcrypto_block_encrypt() still wants that).
For now we get that alignment for free because we still use
sector-based driver callbacks.
A later patch will then switch the qcow driver as a whole over
to byte-based operation; but will still leave things at sector
alignments as it is not worth auditing the qcow image format
to worry about sub-sector requests.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based calls
into the block layer from the parallels driver.
Ideally, the parallels driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
EINTR should be checked against errno, not ret. While fixing the bug,
collect the branches with a switch block.
Also, change the return value from -ENOSTUP to -ENOSPC when the actual
issue is request range passes EOF, which should be distinguishable from
the case of error == ENOSYS by the caller, so that it could still retry
with other byte ranges, whereas it shouldn't retry anymore upon ENOSYS.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Per SCSI definition the designator_length we receive from INQUIRY is 8,
12 or at most 16, but we should be careful because the remote iscsi
target may misbehave, otherwise we could have a buffer overflow.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Not updating src_offset will result in wrong data being written to dst
image.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This simplifies file-posix by implementing the coroutine variants of
the discard and flush BlockDriver callbacks. These were the last
remaining users of paio_submit(), which can be removed now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If we managed to allocate the clusters, but then failed to write the
data, there's a good chance that we'll still be able to free the
clusters again in order to avoid cluster leaks (the refcounts are
cached, so even if we can't write them out right now, we may be able to
do so when the VM is resumed after a werror=stop/enospc pause).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
block_crypto_open_opts_init() and block_crypto_create_opts_init()
contain a virtual visit of QCryptoBlockOptions and
QCryptoBlockCreateOptions less member "format", respectively.
Change their callers to put member "format" in the QDict, so they can
use the generated visitors for these types instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
in_flight and tracked requests need to be tracked in every layer during
recursion. For now the only user is qemu-img convert where overlapping
requests and IOThreads don't exist, therefore this change doesn't make
much difference form user point of view, but it is incorrect as part of
the API. Fix it.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the beginning of the function, we initialize the local variable to 0,
and in the body of the function, we check the assigned values and exit
the loop immediately. So here it can never be non-zero.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This moves the code to resize an image file to the thread pool to avoid
blocking.
Creating large images with preallocation with blockdev-create is now
actually a background job instead of blocking the monitor (and most
other things) until the preallocation has completed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
When growing an image, block drivers (especially protocol drivers) may
initialise the newly added area. I/O requests to the same area need to
wait for this initialisation to be completed so that data writes don't
get overwritten and reads don't read uninitialised data.
To avoid overhead in the fast I/O path by adding new locking in the
protocol drivers and to restrict the impact to requests that actually
touch the new area, reuse the existing tracked request infrastructure in
block/io.c and mark all discard requests as serialising.
With this change, it is safe for protocol drivers to make
.bdrv_co_truncate actually asynchronous.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves the bdrv_truncate() implementation from block.c to block/io.c
so it can have access to the tracked requests infrastructure.
This involves making refresh_total_sectors() public (in block_int.h).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers are coroutine_fns now, so we can just directly call
preallocate_co().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
If qcow2_alloc_clusters_at() returns an error, we do need to negate it
to get back the positive errno code for error_setg_errno(), but we still
need to return the negative error code.
Fixes: 772d1f973f
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Coverity can't see that qobject_input_visitor_new_flat_confused()
returns non-null when it doesn't set @local_err. Check the return
value instead, like all the other callers do.
Fixes: CID 1393615
Fixes: CID 1393616
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Flat unions may now have uncovered branches, so it is possible to get rid
of empty types defined for that purpose only.
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1529311206-76847-3-git-send-email-anton.nefedov@virtuozzo.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
This patch allows the user to specify whether to use active or only
background mode for mirror block jobs. Currently, this setting will
remain constant for the duration of the entire block job.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-14-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch implements active synchronous mirroring. In active mode, the
passive mechanism will still be in place and is used to copy all
initially dirty clusters off the source disk; but every write request
will write data both to the source and the target disk, so the source
cannot be dirtied faster than data is mirrored to the target. Also,
once the block job has converged (BLOCK_JOB_READY sent), source and
target are guaranteed to stay in sync (unless an error occurs).
Active mode is completely optional and currently disabled at runtime. A
later patch will add a way for users to enable it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-13-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This will allow us to access the block job data when the mirror block
driver becomes more complex.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-11-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This new function allows to look for a consecutively dirty area in a
dirty bitmap.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-10-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This new parameter allows the caller to just query the next dirty
position without moving the iterator.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-8-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
With this, the mirror_top_bs is no longer just a technically required
node in the BDS graph but actually represents the block job operation.
Also, drop MirrorBlockJob.source, as we can reach it through
mirror_top_bs->backing.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-6-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch makes the mirror code differentiate between simply waiting
for any operation to complete (mirror_wait_for_free_in_flight_slot())
and specifically waiting for all operations touching a certain range of
the virtual disk to complete (mirror_wait_on_conflicts()).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Attach a CoQueue to each in-flight operation so if we need to wait for
any we can use it to wait instead of just blindly yielding and hoping
for some operation to wake us.
A later patch will use this infrastructure to allow requests accessing
the same area of the virtual disk to specifically wait for each other.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-4-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
In order to talk to the source BDS (and maybe in the future to the
target BDS as well) directly, we need to convert our existing AIO
requests into coroutine I/O requests.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
When converting mirror's I/O to coroutines, we are going to need a point
where these coroutines are created. mirror_perform() is going to be
that point.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.
An AIO flush can yield at some point:
blk_aio_flush_entry()
blk_co_flush(blk)
bdrv_co_flush(blk->root->bs)
...
qemu_coroutine_yield()
and let the HMP command to run, free blk->root and give control
back to the AIO flush:
hmp_drive_del()
blk_remove_bs()
bdrv_root_unref_child(blk->root)
child_bs = blk->root->bs
bdrv_detach_child(blk->root)
bdrv_replace_child(blk->root, NULL)
blk->root->bs = NULL
g_free(blk->root) <============== blk->root becomes stale
bdrv_unref(child_bs)
bdrv_delete(child_bs)
bdrv_close()
bdrv_drained_begin()
bdrv_do_drained_begin()
bdrv_drain_recurse()
aio_poll()
...
qemu_coroutine_switch()
and the AIO flush completion ends up dereferencing blk->root:
blk_aio_complete()
scsi_aio_complete()
blk_get_aio_context(blk)
bs = blk_bs(blk)
ie, bs = blk->root ? blk->root->bs : NULL
^^^^^
stale
The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and
did a subtree drain for each of them. This works fine as long as the
graph is static, but sadly, reality looks different.
If the graph changes so that root nodes are added or removed, we would
have to compensate for this. bdrv_next() returns each root node only
once even if it's the root node for multiple BlockBackends or for a
monitor-owned block driver tree, which would only complicate things.
The much easier and more obviously correct way is to fundamentally
change the way the functions work: Iterate over all BlockDriverStates,
no matter who owns them, and drain them individually. Compensation is
only necessary when a new BDS is created inside a drain_all section.
Removal of a BDS doesn't require any action because it's gone afterwards
anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the future, bdrv_drained_all_begin/end() will drain all invidiual
nodes separately rather than whole subtrees. This means that we don't
want to propagate the drain to all parents any more: If the parent is a
BDS, it will already be drained separately. Recursing to all parents is
unnecessary work and would make it an O(n²) operation.
Prepare the drain function for the changed drain_all by adding an
ignore_bds_parents parameter to the internal implementation that
prevents the propagation of the drain to BDS parents. We still (have to)
propagate it to non-BDS parents like BlockBackends or Jobs because those
are not drained separately.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Before we can introduce a single polling loop for all nodes in
bdrv_drain_all_begin(), we must make sure to run it outside of coroutine
context like we already do for bdrv_do_drained_begin().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We cannot allow aio_poll() in bdrv_drain_invoke(begin=true) until we're
done with propagating the drain through the graph and are doing the
single final BDRV_POLL_WHILE().
Just schedule the coroutine with the callback and increase bs->in_flight
to make sure that the polling phase will wait for it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_do_drained_begin() is only safe if we have a single
BDRV_POLL_WHILE() after quiescing all affected nodes. We cannot allow
that parent callbacks introduce a nested polling loop that could cause
graph changes while we're traversing the graph.
Split off bdrv_do_drained_begin_quiesce(), which only quiesces a single
node without waiting for its requests to complete. These requests will
be waited for in the BDRV_POLL_WHILE() call down the call chain.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Anything can happen inside BDRV_POLL_WHILE(), including graph
changes that may interfere with its callers (e.g. child list iteration
in recursive callers of bdrv_do_drained_begin).
Switch to a single BDRV_POLL_WHILE() call for the whole subtree at the
end of bdrv_do_drained_begin() to avoid such effects. The recursion
happens now inside the loop condition. As the graph can only change
between bdrv_drain_poll() calls, but not inside of it, doing the
recursion here is safe.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For bdrv_drain(), recursively waiting for child node requests is
pointless because we didn't quiesce their parents, so new requests could
come in anyway. Letting the function work only on a single node makes it
more consistent.
For subtree drains and drain_all, we already have the recursion in
bdrv_do_drained_begin(), so the extra recursion doesn't add anything
either.
Remove the useless code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
We already requested that block jobs be paused in .bdrv_drained_begin,
but no guarantee was made that the job was actually inactive at the
point where bdrv_drained_begin() returned.
This introduces a new callback BdrvChildRole.bdrv_drained_poll() and
uses it to make bdrv_drain_poll() consider block jobs using the node to
be drained.
For the test case to work as expected, we have to switch from
block_job_sleep_ns() to qemu_co_sleep_ns() so that the test job is even
considered active and must be waited for when draining the node.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 91af091f92 added an additional aio_poll() to BDRV_POLL_WHILE()
in order to make sure that all pending BHs are executed on drain. This
was the wrong place to make the fix, as it is useless overhead for all
other users of the macro and unnecessarily complicates the mechanism.
This patch effectively reverts said commit (the context has changed a
bit and the code has moved to AIO_WAIT_WHILE()) and instead polls in the
loop condition for drain.
The effect is probably hard to measure in any real-world use case
because actual I/O will dominate, but if I run only the initialisation
part of 'qemu-img convert' where it calls bdrv_block_status() for the
whole image to find out how much data there is copy, this phase actually
needs only roughly half the time after this patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
All involved nodes are already idle, we called bdrv_do_drain_begin() on
them.
The comment in the code suggested that this was not correct because the
completion of a request on one node could spawn a new request on a
different node (which might have been drained before, so we wouldn't
drain the new request). In reality, new requests to different nodes
aren't spawned out of nothing, but only in the context of a parent
request, and they aren't submitted to random nodes, but only to child
nodes. As long as we still poll for the completion of the parent request
(which we do), draining each root node separately is good enough.
Remove the additional polling code from bdrv_drain_all_begin() and
replace it with an assertion that all nodes are already idle after we
drained them separately.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers pass false for the 'recursive' parameter now. Remove it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_do_drain_begin/end() implement already everything that
bdrv_drain_all_begin/end() need and currently still do manually: Disable
external events, call parent drain callbacks, call block driver
callbacks.
It also does two more things:
The first is incrementing bs->quiesce_counter. bdrv_drain_all() already
stood out in the test case by behaving different from the other drain
variants. Adding this is not only safe, but in fact a bug fix.
The second is calling bdrv_drain_recurse(). We already do that later in
the same function in a loop, so basically doing an early first iteration
doesn't hurt.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
As long as nobody keeps the other I/O thread from working, there is no
reason why bdrv_drain() wouldn't work with cross-AioContext events. The
key is that the root request we're waiting for is in the AioContext
we're polling (which it always is for bdrv_drain()) so that aio_poll()
is woken up in the end.
Add a test case that shows that it works. Remove the comment in
bdrv_drain() that claims otherwise.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The -drive option serial was deprecated in QEMU 2.10. It's time to
remove it.
Tests need to be updated to set the serial number with -global instead
of using the -drive option.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Legacy -drive supports "password-secret" parameter that isn't
available with -blockdev / blockdev-add. That's because we backed out
our first try to provide it there due to interface design doubts, in
commit 577d8c9a81, v2.9.0.
This is the second try. It brings back the parameter, except it's
named "key-secret" now.
Let's review our reasons for backing out the first try, as stated in
the commit message:
* BlockdevOptionsRbd member @password-secret isn't actually a
password, it's a key generated by Ceph.
Addressed by the rename.
* We're not sure where member @password-secret belongs (see the
previous commit).
See previous commit.
* How @password-secret interacts with settings from a configuration
file specified with @conf is undocumented.
Not actually true, the documentation for @conf says "Values in the
configuration file will be overridden by options specified via QAPI",
and we've tested this.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Parameter auth-client-required lets you configure authentication
methods. We tried to provide that in v2.9.0, but backed out due to
interface design doubts (commit 464444fcc1).
This commit is similar to what we backed out, but simpler: we use a
list of enumeration values instead of a list of objects with a member
of enumeration type.
Let's review our reasons for backing out the first try, as stated in
the commit message:
* The implementation uses deprecated rados_conf_set() key
"auth_supported". No biggie.
Fixed: we use "auth-client-required".
* The implementation makes -drive silently ignore invalid parameters
"auth" and "auth-supported.*.X" where X isn't "auth". Fixable (in
fact I'm going to fix similar bugs around parameter server), so
again no biggie.
That fix is commit 2836284db6. This commit doesn't bring the bugs
back.
* BlockdevOptionsRbd member @password-secret applies only to
authentication method cephx. Should it be a variant member of
RbdAuthMethod?
We've had time to ponder, and we decided to stick to the way Ceph
configuration works: the key configured separately, and silently
ignored if the authentication method doesn't use it.
* BlockdevOptionsRbd member @user could apply to both methods cephx
and none, but I'm not sure it's actually used with none. If it
isn't, should it be a variant member of RbdAuthMethod?
Likewise.
* The client offers a *set* of authentication methods, not a list.
Should the methods be optional members of BlockdevOptionsRbd instead
of members of list @auth-supported? The latter begs the question
what multiple entries for the same method mean. Trivial question
now that RbdAuthMethod contains nothing but @type, but less so when
RbdAuthMethod acquires other members, such the ones discussed above.
Again, we decided to stick to the way Ceph configuration works, except
we make auth-client-required a list of enumeration values instead of a
string containing keywords separated by delimiters.
* How BlockdevOptionsRbd member @auth-supported interacts with
settings from a configuration file specified with @conf is
undocumented. I suspect it's untested, too.
Not actually true, the documentation for @conf says "Values in the
configuration file will be overridden by options specified via QAPI",
and we've tested this.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Remaining uses of qobject_input_visitor_new_keyval() in the block
subsystem:
* block_crypto_open_opts_init()
Currently doesn't visit any non-string scalars, thus safe. It's
called from
- block_crypto_open_luks()
Creates the QDict with qemu_opts_to_qdict_filtered(), which
creates only string scalars, but has a TODO asking for other types.
- qcow_open()
- qcow2_open(), qcow2_co_invalidate_cache(), qcow2_reopen_prepare()
* block_crypto_create_opts_init(), called from
- block_crypto_co_create_opts_luks()
Also creates the QDict with qemu_opts_to_qdict_filtered().
* vdi_co_create_opts()
Also creates the QDict with qemu_opts_to_qdict_filtered().
Replace these uses by qobject_input_visitor_new_flat_confused() for
robustness. This adds crumpling. Right now, that's a no-op, but if
we ever extend these things in non-flat ways, crumpling will be
needed.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The following pattern occurs in the .bdrv_co_create_opts() methods of
parallels, qcow, qcow2, qed, vhdx and vpc:
qobj = qdict_crumple_for_keyval_qiv(qdict, errp);
qobject_unref(qdict);
qdict = qobject_to(QDict, qobj);
if (qdict == NULL) {
ret = -EINVAL;
goto done;
}
v = qobject_input_visitor_new_keyval(QOBJECT(qdict));
[...]
ret = 0;
done:
qobject_unref(qdict);
[...]
return ret;
If qobject_to() fails, we return failure without setting errp. That's
wrong. As far as I can tell, it cannot fail here. Clean it up
anyway, by removing the useless conversion.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The previous commit fixed -blockdev breakage due to misuse of the
qobject input visitor's keyval flavor in bdrv_file_open(). The commit
message explain why using the plain flavor would be just as wrong; it
would break -drive. Turns out we break it in three places:
nbd_open(), sd_open() and ssh_file_open(). They are even marked
FIXME. Example breakage:
$ qemu-system-x86 -drive node-name=n1,driver=nbd,server.type=inet,server.host=localhost,server.port=1234,server.numeric=off
qemu-system-x86: -drive node-name=n1,driver=nbd,server.type=inet,server.host=localhost,server.port=1234,server.numeric=off: Invalid parameter type for 'numeric', expected: boolean
Fix it the same way: replace qdict_crumple() by
qdict_crumple_for_keyval_qiv(), and switch from plain to the keyval
flavor.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Configuration flows through the block subsystem in a rather peculiar
way. Configuration made with -drive enters it as QemuOpts.
Configuration made with -blockdev / blockdev-add enters it as QAPI
type BlockdevOptions. The block subsystem uses QDict, QemuOpts and
QAPI types internally. The precise flow is next to impossible to
explain (I tried for this commit message, but gave up after wasting
several hours). What I can explain is a flaw in the BlockDriver
interface that leads to this bug:
$ qemu-system-x86_64 -blockdev node-name=n1,driver=nfs,server.type=inet,server.host=localhost,path=/foo/bar,user=1234
qemu-system-x86_64: -blockdev node-name=n1,driver=nfs,server.type=inet,server.host=localhost,path=/foo/bar,user=1234: Internal error: parameter user invalid
QMP blockdev-add is broken the same way.
Here's what happens. The block layer passes configuration represented
as flat QDict (with dotted keys) to BlockDriver methods
.bdrv_file_open(). The QDict's members are typed according to the
QAPI schema.
nfs_file_open() converts it to QAPI type BlockdevOptionsNfs, with
qdict_crumple() and a qobject input visitor.
This visitor comes in two flavors. The plain flavor requires scalars
to be typed according to the QAPI schema. That's the case here. The
keyval flavor requires string scalars. That's not the case here.
nfs_file_open() uses the latter, and promptly falls apart for members
@user, @group, @tcp-syn-count, @readahead-size, @page-cache-size,
@debug.
Switching to the plain flavor would fix -blockdev, but break -drive,
because there the scalars arrive in nfs_file_open() as strings.
The proper fix would be to replace the QDict by QAPI type
BlockdevOptions in the BlockDriver interface. Sadly, that's beyond my
reach right now.
Next best would be to fix the block layer to always pass correctly
typed QDicts to the BlockDriver methods. Also beyond my reach.
What I can do is throw another hack onto the pile: have
nfs_file_open() convert all members to string, so use of the keyval
flavor actually works, by replacing qdict_crumple() by new function
qdict_crumple_for_keyval_qiv().
The pattern "pass result of qdict_crumple() to
qobject_input_visitor_new_keyval()" occurs several times more:
* qemu_rbd_open()
Same issue as nfs_file_open(), but since BlockdevOptionsRbd has only
string members, its only a latent bug. Fix it anyway.
* parallels_co_create_opts(), qcow_co_create_opts(),
qcow2_co_create_opts(), bdrv_qed_co_create_opts(),
sd_co_create_opts(), vhdx_co_create_opts(), vpc_co_create_opts()
These work, because they create the QDict with
qemu_opts_to_qdict_filtered(), which creates only string scalars.
The function sports a TODO comment asking for better typing; that's
going to be fun. Use qdict_crumple_for_keyval_qiv() to be safe.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are numerous QDict functions that have been introduced for and are
used only by the block layer. Move their declarations into an own
header file to reflect that.
While qdict_extract_subqdict() is in fact used outside of the block
layer (in util/qemu-config.c), it is still a function related very
closely to how the block layer works with nested QDicts, namely by
sometimes flattening them. Therefore, its declaration is put into this
header as well and util/qemu-config.c includes it with a comment stating
exactly which function it needs.
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20180509165530.29561-7-mreitz@redhat.com>
[Copyright note tweaked, superfluous includes dropped]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Parameter "filename" is deprecated since commit 5c3ad1a6a8, v2.10.0.
Time to get rid of it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Parameter "filename" is deprecated since commit 91589d9e5c, v2.10.0.
Time to get rid of it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180606182449.1607-5-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Add locks and remove comments about BQL accordingly to
dirty_bitmap_mutex definition in block_int.h.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180606182449.1607-2-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
QLIST_REMOVE does not require walking the list, and once the "bitmap"
argument is removed from bdrv_do_release_matching_dirty_bitmap_locked
the code simplifies a lot and it is worth inlining everything in the
callers of bdrv_do_release_matching_dirty_bitmap.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180326104037.6894-1-pbonzini@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
All this function is doing will be repeated by
bdrv_do_release_matching_dirty_bitmap_locked, except
resetting bm->persistent. But even that does not matter
because the bitmap will be freed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180323164254.26487-1-pbonzini@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
When signaling a corruption on a read-only image, qcow2 already makes
fatal events non-fatal (i.e., they will not result in the image being
closed, and the image header's corrupt flag will not be set). This is
necessary because we cannot set the corrupt flag on read-only images,
and it is possible because further corruption of read-only images is
impossible.
Inactive images are effectively read-only, too, so we should do the same
for them. bdrv_is_writable() can tell us whether an image can actually
be written to, so use its result instead of !bs->read_only.
(Otherwise, the assert(!(bs->open_flags & BDRV_O_INACTIVE)) in
bdrv_co_pwritev() will fail, crashing qemu.)
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180606193702.7113-3-mreitz@redhat.com
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The throttle block filter can be reopened, and with this it is
possible to change the throttle group that the filter belongs to.
The way the code does that is the following:
- On throttle_reopen_prepare(): create a new ThrottleGroupMember
and attach it to the new throttle group.
- On throttle_reopen_commit(): detach the old ThrottleGroupMember,
delete it and replace it with the new one.
The problem with this is that by replacing the ThrottleGroupMember the
previous value of io_limits_disabled is lost, causing an assertion
failure in throttle_co_drain_end().
This problem can be reproduced by reopening a throttle node:
$QEMU -monitor stdio
-object throttle-group,id=tg0,x-iops-total=1000 \
-blockdev node-name=hd0,driver=qcow2,file.driver=file,file.filename=hd.qcow2 \
-blockdev node-name=root,driver=throttle,throttle-group=tg0,file=hd0,read-only=on
(qemu) block_stream root
block/throttle.c:214: throttle_co_drain_end: Assertion `tgm->io_limits_disabled' failed.
Since we only want to change the throttle group on reopen there's no
need to create a ThrottleGroupMember and discard the old one. It's
easier if we simply detach it from its current group and attach it to
the new one.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180608151536.7378-1-berto@igalia.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This assert may fail, because bitmap_table is not initialized. Just
drop it, as it's obvious, that bitmap_table_load sets bitmap_table
parameter only when returning zero.
Reported-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180608101225.2575-1-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Repairing OFLAG_COPIED is usually safe because it is done after the
refcounts have been repaired. Therefore, it we did not find anyone else
referencing a data or L2 cluster, it makes no sense to not set
OFLAG_COPIED -- and the other direction (clearing OFLAG_COPIED) is
always safe, anyway, it may just induce leaks.
Furthermore, if OFLAG_COPIED is actually consistent with a wrong (leaky)
refcount, we will decrement the refcount with -r leaks, but OFLAG_COPIED
will then be wrong. qemu-img check should not produce images that are
more corrupted afterwards then they were before.
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1527085
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20180509200059.31125-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Looking at the qcow2 code that is riddled with error_report() calls,
this is really how it should have been from the start.
Along the way, turn the target_version/current_version comparisons at
the beginning of qcow2_downgrade() into assertions (the caller has to
make sure these conditions are met), and rephrase the error message on
using compat=1.1 to get refcount widths other than 16 bits.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180509210023.20283-3-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When creating a file, we should take the WRITE and RESIZE permissions.
We do not need either for the creation itself, but we do need them for
clearing and resizing it. So we can take the proper permissions by
replacing O_TRUNC with an explicit truncation to 0, and by taking the
appropriate file locks between those two steps.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180509215336.31304-3-mreitz@redhat.com
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
raw_apply_lock_bytes() and raw_check_lock_bytes() currently take a
BDRVRawState *, but they only use the lock_fd field. During image
creation, we do not have a BDRVRawState, but we do have an FD; so if we
want to reuse the functions there, we should modify them to receive only
the FD.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180509215336.31304-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
block/sheepdog.o has a 4M static variable that is 90% of QEMU's whole .bss
section. Replace it with a heap-allocated block, and make it smaller too
since only the inode header is actually being used.
bss size goes down from 4464280 to 269976.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20180523160721.14018-3-pbonzini@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
The expression "SD_INODE_SIZE - sizeof(inode.data_vdi_id)" already has a macro
defined for the same value (though with a nicer definition using offsetof).
Replace it.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20180523160721.14018-2-pbonzini@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
* Copy offloading for qemu-img convert (iSCSI, raw, and qcow2)
If the underlying storage supports copy offloading, qemu-img convert will
use it instead of performing reads and writes. This avoids data transfers
and thus frees up storage bandwidth for other purposes. SCSI EXTENDED COPY
and Linux copy_file_range(2) are used to implement this optimization.
* Drop spurious "WARNING: I\/O thread spun for 1000 iterations" warning
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbFSBoAAoJEJykq7OBq3PISpEIAIcMao4/rzinAWXzS+ncK9LO
6FtRVpgutpHaWX2ayySaz5n2CdR3cNMrpCI7sjY2Kw0lrdkqxPgl5n0SWD+VCl4W
7+JLz/uF0iUV8X+99e7WGAjZbm9LSlxgn5AQKfrrwyPf0ZfzoYQ5nBMcQ6xjEeQP
48j2WqJqN9/u8RBD07o11yn0+CE5g56/f12xVjR5ASVodzsAmcZ2OQRMQbM01isU
1mBekJQkDxJkt5l13Rql8+t+vWz8/9BEW2c/eIDKvoayMqYJpdfKv4DqLloIuHnc
3RkquA0zUuKtl7xEnEkH/We7fi4QPGW/vyBN7ychS/zKzZFQrXmwqrAuFSw3dKU=
=vZp+
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging
Pull request
* Copy offloading for qemu-img convert (iSCSI, raw, and qcow2)
If the underlying storage supports copy offloading, qemu-img convert will
use it instead of performing reads and writes. This avoids data transfers
and thus frees up storage bandwidth for other purposes. SCSI EXTENDED COPY
and Linux copy_file_range(2) are used to implement this optimization.
* Drop spurious "WARNING: I\/O thread spun for 1000 iterations" warning
# gpg: Signature made Mon 04 Jun 2018 12:20:08 BST
# gpg: using RSA key 9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>"
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35 775A 9CA4 ABB3 81AB 73C8
* remotes/stefanha/tags/block-pull-request:
main-loop: drop spin_counter
qemu-img: Convert with copy offloading
block-backend: Add blk_co_copy_range
iscsi: Implement copy offloading
iscsi: Create and use iscsi_co_wait_for_task
iscsi: Query and save device designator when opening
file-posix: Implement bdrv_co_copy_range
qcow2: Implement copy offloading
raw: Implement copy offloading
raw: Check byte range uniformly
block: Introduce API for copy offloading
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
vDPA support, fix to vhost blk RO bit handling, some include path
cleanups, NFIT ACPI table.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbEXNvAAoJECgfDbjSjVRpc8gH/R8xrcFrV+k9wwbgYcOcGb6Y
LWjseE31pqJcxRV80vLOdzYEuLStZQKQQY7xBDMlA5vdyvZxIA6FLO2IsiJSbFAk
EK8pclwhpwQAahr8BfzenabohBv2UO7zu5+dqSvuJCiMWF3jGtPAIMxInfjXaOZY
odc1zY2D2EgsC7wZZ1hfraRbISBOiRaez9BoGDKPOyBY9G1ASEgxJgleFgoBLfsK
a1XU+fDM6hAVdxftfkTm0nibyf7PWPDyzqghLqjR9WXLvZP3Cqud4p8N29mY51pR
KSTjA4FYk6Z9EVMltyBHfdJs6RQzglKjxcNGdlrvacDfyFi79fGdiosVllrjfJM=
=3+V0
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
acpi, vhost, misc: fixes, features
vDPA support, fix to vhost blk RO bit handling, some include path
cleanups, NFIT ACPI table.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Fri 01 Jun 2018 17:25:19 BST
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (31 commits)
vhost-blk: turn on pre-defined RO feature bit
ACPI testing: test NFIT platform capabilities
nvdimm, acpi: support NFIT platform capabilities
tests/.gitignore: add entry for generated file
arch_init: sort architectures
ui: use local path for local headers
qga: use local path for local headers
colo: use local path for local headers
migration: use local path for local headers
usb: use local path for local headers
sd: fix up include
vhost-scsi: drop an unused include
ppc: use local path for local headers
rocker: drop an unused include
e1000e: use local path for local headers
ioapic: fix up includes
ide: use local path for local headers
display: use local path for local headers
trace: use local path for local headers
migration: drop an unused include
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
It's a BlockBackend wrapper of the BDS interface.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180601092648.24614-10-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Issue EXTENDED COPY (LID1) command to implement the copy_range API.
The parameter data construction code is modified from libiscsi's
iscsi-dd.c.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180601092648.24614-9-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This loop is repeated a growing number times. Make a helper.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20180601092648.24614-8-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The device designator data returned in INQUIRY command will be useful to
fill in source/target fields during copy offloading. Do this when
connecting to the target and save the data for later use.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180601092648.24614-7-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
With copy_file_range(2), we can implement the bdrv_co_copy_range
semantics.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180601092648.24614-6-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The two callbacks are implemented quite similarly to the read/write
functions: bdrv_co_copy_range_from maps for read and calls into bs->file
or bs->backing depending on the allocation status; bdrv_co_copy_range_to
maps for write and calls into bs->file.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180601092648.24614-5-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Just pass down to ->file.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180601092648.24614-4-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We don't verify the request range against s->size in the I/O callbacks
except for raw_co_pwritev. This is inconsistent (especially for
raw_co_pwrite_zeroes and raw_co_pdiscard), so fix them, in the meanwhile
make the helper reusable by the coming new callbacks.
Note that in most cases the block layer already verifies the request
byte range against our reported image length, before invoking the driver
callbacks. The exception is during image creating, after
blk_set_allow_write_beyond_eof(blk, true) is called. But in that case,
the requests are not directly from the user or guest. So there is no
visible behavior change in adding the check code.
The int64_t -> uint64_t inconsistency, as shown by the type casting, is
pre-existing due to the interface.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180601092648.24614-3-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduce the bdrv_co_copy_range() API for copy offloading. Block
drivers implementing this API support efficient copy operations that
avoid reading each block from the source device and writing it to the
destination devices. Examples of copy offload primitives are SCSI
EXTENDED COPY and Linux copy_file_range(2).
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180601092648.24614-2-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When pulling in headers that are in the same directory as the C file (as
opposed to one in include/), we should use its relative path, without a
directory.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
We're ready to declare the blockdev-create job stable. This renames the
corresponding QMP command from x-blockdev-create to blockdev-create.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
This changes the x-blockdev-create QMP command so that it doesn't block
the monitor and the main loop any more, but starts a background job that
performs the image creation.
The basic job as implemented here is all that is necessary to make image
creation asynchronous and to provide a QMP interface that can be marked
stable, but it still lacks a few features that jobs usually provide: The
job will ignore pause commands and it doesn't publish more than very
basic progress yet (total-progress is 1 and current-progress advances
from 0 to 1 when the driver callbacks returns). These features can be
added later without breaking compatibility.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
So far we relied on job->ret and strerror() to produce an error message
for failed jobs. Not surprisingly, this tends to result in completely
useless messages.
This adds a Job.error field that can contain an error string for a
failing job, and a parameter to job_completed() that sets the field. As
a default, if NULL is passed, we continue to use strerror(job->ret).
All existing callers are changed to pass NULL. They can be improved in
separate patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
.bdrv_co_create() is supposed to return 0 on success, but vhdx could
return a positive value instead. Fix this.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
.bdrv_co_create() is supposed to return 0 on success, but vdi could
return a positive value instead. Fix this.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
MIN_REFCOUNT_CACHE_SIZE is 4 and the cluster size is guaranteed to be
at most 2MB, so the minimum refcount cache size (in bytes) is always
going to fit in a 32-bit integer.
Coverity doesn't know that, and since we're storing the result in a
uint64_t (*refcount_cache_size) it thinks that we need the 64 bits and
that we probably want to do a 64-bit multiplication to prevent the
result from being truncated.
This is a false positive in this case, but it's a fair warning.
We could do a 64-bit multiplication to get rid of it, but since we
know that a 32-bit variable is enough to store this value let's simply
reuse min_refcount_cache, make it a normal int and stop doing casts.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
BlockJob has fields .offset and .len, which are actually misnomers today
because they are no longer tied to block device sizes, but just progress
counters. As such they make a lot of sense in generic Jobs.
This patch moves the fields to Job and renames them to .progress_current
and .progress_total to describe their function better.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The transition to the READY state was still performed in the BlockJob
layer, in the same function that sent the BLOCK_JOB_READY QMP event.
This patch brings the state transition to the Job layer and implements
the QMP event using a notifier called from the Job layer, like we
already do for other events related to state transitions.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This doesn't actually move any transaction code to Job yet, but it
renames the type for transactions from BlockJobTxn to JobTxn and makes
them contain Jobs rather than BlockJobs
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
block_job_finish_sync() doesn't contain anything block job specific any
more, so it can be moved to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This moves the .complete callback that tells a READY job to complete
from BlockJobDriver to JobDriver. The wrapper function job_complete()
doesn't require anything block job specific any more and can be moved
to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
block_job_drain() contains a blk_drain() call which cannot be moved to
Job, so add a new JobDriver callback JobDriver.drain which has a common
implementation for all BlockJobs. In addition to this we keep the
existing BlockJobDriver.drain callback that is called by the common
drain implementation for all block jobs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
block_job_cancel_async() did two things that were still block job
specific:
* Setting job->force. This field makes sense on the Job level, so we can
just move it. While at it, rename it to job->force_cancel to make its
purpose more obvious.
* Resetting the I/O status. This can't be moved because generic Jobs
don't have an I/O status. What the function really implements is a
user resume, except without entering the coroutine. Consequently, it
makes sense to call the .user_resume driver callback here which
already resets the I/O status.
The old block_job_cancel_async() has two separate if statements that
check job->iostatus != BLOCK_DEVICE_IO_STATUS_OK and job->user_paused.
However, the former condition always implies the latter (as is
asserted in block_job_iostatus_reset()), so changing the explicit call
of block_job_iostatus_reset() on the former condition with the
.user_resume callback on the latter condition is equivalent and
doesn't need to access any BlockJob specific state.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This moves the finalisation of a single job from BlockJob to Job.
Some part of this code depends on job transactions, and job transactions
call this code, we introduce some temporary calls from Job functions to
BlockJob ones. This will be fixed once transactions move to Job, too.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This renames the BlockJobCreateFlags constants, moves a few JOB_INTERNAL
checks to job_create() and the auto_{finalize,dismiss} fields from
BlockJob to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
While we already moved the state related to job pausing to Job, the
functions to do were still BlockJob only. This commit moves them over to
Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
There is nothing block layer specific about block_job_sleep_ns(), so
move the function to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This commit moves some core functions for dealing with the job coroutine
from BlockJob to Job. This includes primarily entering the coroutine
(both for the first and reentering) and yielding explicitly and at pause
points.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Move the defer_to_main_loop functionality from BlockJob to Job.
The code can be simplified because we can use job->aio_context in
job_defer_to_main_loop_bh() now, instead of having to access the
BlockDriverState.
Probably taking the data->aio_context lock in addition was already
unnecessary in the old code because we didn't actually make use of
anything protected by the old AioContext except getting the new
AioContext, in case it changed between scheduling the BH and running it.
But it's certainly unnecessary now that the BDS isn't accessed at all
any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
We cannot yet move the whole logic around job cancelling to Job because
it depends on quite a few other things that are still only in BlockJob,
but we can move the cancelled field at least.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This moves reference counting from BlockJob to Job.
In order to keep calling the BlockJob cleanup code when the job is
deleted via job_unref(), introduce a new JobDriver.free callback. Every
block job must use block_job_free() for this callback, this is asserted
in block_job_create().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This moves BlockJob.status and the closely related functions
(block_)job_state_transition() and (block_)job_apply_verb to Job. The
two QAPI enums are renamed to JobStatus and JobVerb.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This moves the job_type field from BlockJobDriver to JobDriver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
QAPI types aren't externally visible, so we can rename them without
causing problems. Before we add a job type to Job, rename the enum
so it can be used for more than just block jobs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This is the first step towards creating an infrastructure for generic
background jobs that aren't tied to a block device. For now, Job only
stores its ID and JobDriver, the rest stays in BlockJob.
The following patches will move over more parts of BlockJob to Job if
they are meaningful outside the context of a block job.
BlockJob.driver is now redundant, but this patch leaves it around to
avoid unnecessary churn. The next patches will get rid of almost all of
its uses anyway so that it can be removed later with much less churn.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
In commit 8b9ad56e9c, we removed the code that could result
in our getting to sd_prealloc()'s out_with_err_set label with a
NULL blk pointer. That makes the NULL check in the error-handling
path unnecessary, and Coverity gripes about it (CID 1390636).
Delete the redundant check.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit c22a03454 QAPIfied option parsing in the NFS block driver, but
forgot to remove all the options we processed. Therefore, we get an
error in bdrv_open_inherit(), which thinks the remaining options are
invalid. Trying to open an NFS image will result in an error like this:
Block protocol 'nfs' doesn't support the option 'server.host'
Remove all options from the QDict to make the NFS driver work again.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20180516160816.26259-1-kwolf@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Don't throw away local_err, but propagate it to errp.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20180516161034.27440-1-kwolf@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Update the rest of the filter drivers to support
BDRV_REQ_WRITE_UNCHANGED. They already forward write request flags to
their children, so we just have to announce support for it.
This patch does not cover the replication driver because that currently
does not support flags at all, and because it just grabs the WRITE
permission for its children when it can, so we should be fine just
submitting the incoming WRITE_UNCHANGED requests as normal writes.
It also does not cover format drivers for similar reasons. They all use
bdrv_format_default_perms() as their .bdrv_child_perm() implementation
so they just always grab the WRITE permission for their file children
whenever possible. In addition, it often would be difficult to
ascertain whether incoming unchanging writes end up as unchanging writes
in their files. So we just leave them as normal potentially changing
writes.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180421132929.21610-7-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We just need to forward it to quorum's children (except in case of a
rewrite because of corruption), but for that we first have to support
flags in child requests at all.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180421132929.21610-6-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180421132929.21610-5-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This flag signifies that a write request will not change the visible
disk content. With this flag set, it is sufficient to have the
BLK_PERM_WRITE_UNCHANGED permission instead of BLK_PERM_WRITE.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180421132929.21610-4-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This adds a simple copy-on-read filter driver. It relies on the already
existing COR functionality in the central block layer code, which may be
moved here once we no longer need it there.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180421132929.21610-2-mreitz@redhat.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The L2 and refcount caches have default sizes that can be overridden
using the l2-cache-size and refcount-cache-size (an additional
parameter named cache-size sets the combined size of both caches).
Unless forced by one of the aforementioned parameters, QEMU will set
the unspecified sizes so that the L2 cache is 4 times larger than the
refcount cache.
This is based on the premise that the refcount metadata needs to be
only a fourth of the L2 metadata to cover the same amount of disk
space. This is incorrect for two reasons:
a) The amount of disk covered by an L2 table depends solely on the
cluster size, but in the case of a refcount block it depends on
the cluster size *and* the width of each refcount entry.
The 4/1 ratio is only valid with 16-bit entries (the default).
b) When we talk about disk space and L2 tables we are talking about
guest space (L2 tables map guest clusters to host clusters),
whereas refcount blocks are used for host clusters (including
L1/L2 tables and the refcount blocks themselves). On a fully
populated (and uncompressed) qcow2 file, image size > virtual size
so there are more refcount entries than L2 entries.
Problem (a) could be fixed by adjusting the algorithm to take into
account the refcount entry width. Problem (b) could be fixed by
increasing a bit the refcount cache size to account for the clusters
used for qcow2 metadata.
However this patch takes a completely different approach and instead
of keeping a ratio between both cache sizes it assigns as much as
possible to the L2 cache and the remainder to the refcount cache.
The reason is that L2 tables are used for every single I/O request
from the guest and the effect of increasing the cache is significant
and clearly measurable. Refcount blocks are however only used for
cluster allocation and internal snapshots and in practice are accessed
sequentially in most cases, so the effect of increasing the cache is
negligible (even when doing random writes from the guest).
So, make the refcount cache as small as possible unless the user
explicitly asks for a larger one.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 9695182c2eb11b77cb319689a1ebaa4e7c9d6591.1523968389.git.berto@igalia.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Compressed clusters are not supposed to have the COPIED bit set.
"qemu-img check" detects that and prints an error message reporting
the number of the affected host cluster. This doesn't make much sense
because compressed clusters are not aligned to host clusters, so it
would be better to report the offset instead. Plus, the calculation is
wrong and it uses the raw L2 entry as if it was simply an offset.
This patch fixes the error message and reports the offset of the
compressed cluster.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 0f687957feb72e80c740403191a47e607c2463fe.1523376013.git.berto@igalia.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The backup block job directly accesses the driver field in BlockJob. Add
a wrapper for getting it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This gets us rid of more direct accesses to BlockJob fields from the
job drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
All block job drivers support .set_speed and all of them duplicate the
same code to implement it. Move that code to blockjob.c and remove the
now useless callback.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Every block job has a RateLimit, and they all do the exact same thing
with it, so it should be common infrastructure. Move the struct field
for a start.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Block job drivers are not expected to mess with the internals of the
BlockJob object, so provide wrapper functions for one of the cases where
they still do it: Updating the progress counter.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
We have too many driver callback interfaces; simplify the mess
somewhat by merging the flags parameter of .bdrv_co_writev_flags()
into .bdrv_co_writev(). Note that as long as a driver doesn't set
.supported_write_flags, the flags argument will be 0 and behavior is
identical. Also note that the public function bdrv_co_writev() still
lacks a flags argument; so the driver signature is thus intentionally
slightly different. But that's not the end of the world, nor the first
time that the driver interface differs slightly from the public
interface.
Ideally, we should be rewriting all of these drivers to use modern
byte-based interfaces. But that's a more invasive patch to write
and audit, compared to the simplification done here.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that all drivers with aio callbacks are using the
byte-based interfaces, we can remove the sector-based versions.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based callbacks
in the vxhs driver.
Note that the driver was already using byte-based calls for
performing actual I/O, so this just gets rid of a round trip
of scaling; however, as I don't know if VxHS is tolerant of
non-sector AIO operations, I went with the conservative approach
of adding .bdrv_refresh_limits to override the block layer
defaults back to the pre-patch value of 512.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based callbacks
in the rbd driver.
Note that the driver was already using byte-based calls for
performing actual I/O, so this just gets rid of a round trip
of scaling; however, as I don't know if RBD is tolerant of
non-sector AIO operations, I went with the conservate approach
of adding .bdrv_refresh_limits to override the block layer
defaults back to the pre-patch value of 512.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based callbacks
in the null-co and null-aio drivers.
Note that since the null driver does nothing on writes, it trivially
supports the BDRV_REQ_FUA flag (all writes have already landed to
the same bit-bucket without needing an extra flush call). Also, since
the null driver does just as well with byte-based requests, we can
now avoid cycles wasted on read-modify-write by taking advantage of
the block layer now defaulting the alignment to 1 instead of 512.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based callbacks
in the file-win32 driver.
Note that the driver was already using byte-based calls for
performing actual I/O, so this just gets rid of a round trip
of scaling; however, as I don't know if Windows is tolerant of
non-sector AIO operations, I went with the conservative approach
of modifying .bdrv_refresh_limits to override the block layer
defaults back to the pre-patch value of 512.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Add new sector-based aio callbacks for read and write,
to match the fact that bdrv_aio_pdiscard is already byte-based.
Ideally, drivers should be converted to use coroutine callbacks
rather than aio; but that is not quite as trivial (and if we were
to do that conversion, the null-aio driver would disappear), so for
the short term, converting the signature but keeping things with
aio is easier. However, we CAN declare that a driver that uses
the byte-based aio interfaces now defaults to byte-based
operations, and must explicitly provide a refresh_limits override
to stick with larger alignments (making the alignment issues more
obvious directly in the drivers touched in the next few patches).
Once all drivers are converted, the sector-based aio callbacks will
be removed; in the meantime, a FIXME comment is added due to a
slight inefficiency that will be touched up as part of that later
cleanup.
Simplify some instances of 'bs->drv' into 'drv' while touching this,
since the local variable already exists to reduce typing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
blk_get_aio_context verifies if BlockDriverState bs is not NULL,
return bdrv_get_aio_context(bs) if true or qemu_get_aio_context()
otherwise. However, bdrv_get_aio_context from block.c already does
this verification itself, also returning qemu_get_aio_context()
if bs is NULL:
AioContext *bdrv_get_aio_context(BlockDriverState *bs)
{
return bs ? bs->aio_context : qemu_get_aio_context();
}
This patch simplifies blk_get_aio_context to simply call
bdrv_get_aio_context instead of replicating the same logic.
Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
mincore(2) checks whether pages are resident. Use it to verify that
page cache has been dropped.
You can trigger a verification failure by mmapping the image file from
another process that loads a byte from a page, forcing it to become
resident. bdrv_co_invalidate_cache() will fail while that process is
alive.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180427162312.18583-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*. Use
this to drop page cache on the destination host during shared storage
migration. This way the destination host will read the latest copy of
the data and will not use stale data from the page cache.
The flow is as follows:
1. Source host writes out all dirty pages and inactivates drives.
2. QEMU_VM_EOF is sent on migration stream.
3. Destination host invalidates caches before accessing drives.
This patch enables live migration even with -drive cache.direct=off.
* Terms and conditions may apply, please see patch for details.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180427162312.18583-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Both the option string for the 'redundancy' option and the
SheepdogRedundancy object that is created accordingly could be leaked in
error paths. This fixes the memory leaks.
Reported by Coverity (CID 1390614 and 1390641).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20180503153509.22223-1-kwolf@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Commit b76e4458b1 made the mirror block
job respect block-job-cancel's @force flag: With that flag set, it would
now always really cancel, even post-READY.
Unfortunately, it had a side effect: Without that flag set, it would now
never cancel, not even before READY. Considering that is an
incompatible change and not noted anywhere in the commit or the
description of block-job-cancel's @force parameter, this seems
unintentional and we should revert to the previous behavior, which is to
immediately cancel the job when block-job-cancel is called before source
and target are in sync (i.e. before the READY event).
Cc: qemu-stable@nongnu.org
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1572856
Reported-by: Yanan Fu <yfu@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20180501220509.14152-2-mreitz@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Commit b76e4458b1 ("block/mirror: change
the semantic of 'force' of block-job-cancel") accidentally removed the
ratelimit in the mirror job.
Reintroduce the ratelimit but keep the block-job-cancel force=true
behavior that was added in commit
b76e4458b1.
Note that block_job_sleep_ns() returns immediately when the job is
cancelled. Therefore it's safe to unconditionally call
block_job_sleep_ns() - a cancelled job does not sleep.
This commit fixes the non-deterministic qemu-iotests 185 output. The
test relies on the ratelimit to make the job sleep until the 'quit'
command is processed. Previously the job could complete before the
'quit' command was received since there was no ratelimit.
Cc: Liang Li <liliang.opensource@gmail.com>
Cc: Jeff Cody <jcody@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180424123527.19168-1-stefanha@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
The NBD spec is proposing a relaxation of NBD_CMD_BLOCK_STATUS
where a server may have the final extent per context give a
length beyond the original request, if it can easily prove that
subsequent bytes have the same status, on the grounds that a
client can take advantage of this information for fewer block
status requests. Since qemu 2.12 as a client always sends
NBD_CMD_FLAG_REQ_ONE, and rejects a server that sends extra
length, the upstream NBD spec will probably limit this behavior
to clients that don't request REQ_ONE semantics; but it doesn't
hurt to relax qemu to always be permissive of this server
behavior, even if it continues to use REQ_ONE.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180503222626.1303410-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
For convenience and clarity, make it possible to call qobject_ref() at
the time when the reference is associated with a variable, or
argument, by making qobject_ref() return the same pointer as given.
Use that to simplify the callers.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180419150145.24795-5-marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Useless change to qobject_ref_impl() dropped, commit message improved
slightly]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Now that we can safely call QOBJECT() on QObject * as well as its
subtypes, we can have macros qobject_ref() / qobject_unref() that work
everywhere instead of having to use QINCREF() / QDECREF() for QObject
and qobject_incref() / qobject_decref() for its subtypes.
The replacement is mechanical, except I broke a long line, and added a
cast in monitor_qmp_cleanup_req_queue_locked(). Unlike
qobject_decref(), qobject_unref() doesn't accept void *.
Note that the new macros evaluate their argument exactly once, thus no
need to shout them.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180419150145.24795-4-marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Rebased, semantic conflict resolved, commit message improved]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Checking reopen by existence of some bitmaps is wrong, as it may be
some other bitmaps, or on the other hand, user may remove bitmaps. This
criteria is bad. To simplify things and make behavior more predictable
let's just add a flag to remember, that we've already tried to load
bitmaps on open and do not want do it again.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180411122606.367301-2-vsementsov@virtuozzo.com
[mreitz: Changed comment wording according to Eric Blake's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Streaming and the commit block job only want to apply throttling when
they actually copied data instead of skipping it, so they made the
calculation of delay_ns conditional. However, delay_ns isn't reset when
skipping some sectors, so instead of not waiting, the old delay is
applied again.
Properly reset delay_ns where needed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Commit 4bfb274 added some QAPIfication of option parsing in
qemu_rbd_open(). We need to remove all the options we processed,
otherwise in bdrv_open_inherit() we will think the remaining options are
invalid.
(This needs to go in 2.12 to avoid a regression that prevents rbd
from being opened.)
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Storing the lseek() result in an int results in it overflowing when the
file is at least 2 GB big. Then, we have a 50 % chance of the result
being "negative" and thus thinking an error occurred when actually
everything went just fine.
So we should use the correct type for storing the result: off_t.
Reported-by: Daniel P. Berrange <berrange@redhat.com>
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1549231
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180228131315.30194-2-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The legacy command line interface gets the socket path from an option
called 'socket'. QAPI in contract uses SocketAddress, where the
corresponding option is called 'path'.
Fix the gluster block driver to accept both 'socket' and 'path', with
'path' being the preferred syntax.
https://bugzilla.redhat.com/show_bug.cgi?id=1545155
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20180403110810.25624-1-kwolf@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
In commit 223a23c198, we implemented a
workaround in the gluster driver to handle invalid values returned for
SEEK_DATA or SEEK_HOLE.
In some instances, these same invalid values can be seen in the posix
file handler as well - for example, it has been reported on FUSE gluster
mounts.
Calling assert() for these invalid values is overly harsh; we can safely
return -EIO and allow this case to be treated as a "learned nothing"
case (e.g., D4 / H4, as commented in the code).
This patch does the same thing that 223a23c198 did for gluster.c,
except in file-posix.c
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The legacy command line interface gets the socket path from an option
called 'socket'. QAPI in contract uses SocketAddress, where the
corresponding option is called 'path'.
Fix the gluster block driver to accept both 'socket' and 'path', with
'path' being the preferred syntax.
https://bugzilla.redhat.com/show_bug.cgi?id=1545155
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
iotests 123 and 209 fail on 32-bit platforms. The culprit:
sizeof(extent) is wrong; we want sizeof(*extent). But since
the struct is 8 bytes, it happened to work on 64-bit platforms
where the pointer is also 8 bytes (nasty).
Fixes: 78a33ab58
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180327210517.1804242-1-eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>