Commit f3c507adcd ("NVMe: Initial commit for new storage interface")
introduced the NvmeBar structure. Unfortunately in commit bdd6a90a9e
("block: Add VFIO based NVMe driver") we duplicated it.
Apparently in commit a3d9a352d4 ("block: Move NVMe constants to
a separate header") we tried to unify headers but forgot to remove
the structure declared in the block/nvme.c source file.
Do it now, and remove the structure size check which is redundant
with the header check added in commit 74e18435c0 ("hw/block/nvme:
Align I/O BAR to 4 KiB").
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200904124130.583838-3-philmd@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to use the NvmeBar structure from "block/nvme.h" in the
next commit. As a preliminary step, group all the NVMe controller
registers in the 'ctrl' field, keeping the doorbells registers
out of it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200904124130.583838-2-philmd@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The intended behaviour was that locking=off/auto work and have no
effect (to remain compatible with file-posix), whereas locking=on would
return an error. Unfortunately, the code forgot to remove "locking" from
the options QDict, so any attempt to use the option would fail.
Replace the option parsing code for "locking" with something that is
part of the raw_runtime_opts QemuOptsList (so it is properly removed
from the QDict) and looks more like file-posix.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200907092739.9988-1-kwolf@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some trace points are attributed to the wrong source file. Happens
when we neglect to update trace-events for code motion, or add events
in the wrong place, or misspell the file name.
Clean up with help of scripts/cleanup-trace-events.pl. Funnies
requiring manual post-processing:
* accel/tcg/cputlb.c trace points are in trace-events.
* block.c and blockdev.c trace points are in block/trace-events.
* hw/block/nvme.c uses the preprocessor to hide its trace point use
from cleanup-trace-events.pl.
* hw/tpm/tpm_spapr.c uses pseudo trace point tpm_spapr_show_buffer to
guard debug code.
* include/hw/xen/xen_common.h trace points are in hw/xen/trace-events.
* linux-user/trace-events abbreviates a tedious list of filenames to
*/signal.c.
* net/colo-compare and net/filter-rewriter.c use pseudo trace points
colo_compare_miscompare and colo_filter_rewriter_debug to guard
debug code.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200806141334.3646302-5-armbru@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tracked down with the help of scripts/cleanup-trace-events.pl.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20200806141334.3646302-4-armbru@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Parts of the block layer treat BDS.backing_file as if it were whatever
the image header says (i.e., if it is a relative path, it is relative to
the overlay), other parts treat it like a cache for
bs->backing->bs->filename (relative paths are relative to the CWD).
Considering bs->backing->bs->filename exists, let us make it mean the
former.
Among other things, this now allows the user to specify a base when
using qemu-img to commit an image file in a directory that is not the
CWD (assuming, everything uses relative filenames).
Before this patch:
$ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
$ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
$ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
After this patch:
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
Image committed.
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
Image committed.
With this change, bdrv_find_backing_image() must look at whether the
user has overridden a BDS's backing file. If so, it can no longer use
bs->backing_file, but must instead compare the given filename against
the backing node's filename directly.
Note that this changes the QAPI output for a node's backing_file. We
had very inconsistent output there (sometimes what the image header
said, sometimes the actual filename of the backing image). This
inconsistent output was effectively useless, so we have to decide one
way or the other. Considering that bs->backing_file usually at runtime
contained the path to the image relative to qemu's CWD (or absolute),
this patch changes QAPI's backing_file to always report the
bs->backing->bs->filename from now on. If you want to receive the image
header information, you have to refer to full-backing-filename.
This necessitates a change to iotest 228. The interesting information
it really wanted is the image header, and it can get that now, but it
has to use full-backing-filename instead of backing_file. Because of
this patch's changes to bs->backing_file's behavior, we also need some
reference output changes.
Along with the changes to bs->backing_file, stop updating
BDS.backing_format in bdrv_backing_attach() as well. This way,
ImageInfo's backing-filename and backing-filename-format fields will
represent what the image header says and nothing else.
iotest 245 changes in behavior: With the backing node no longer
overriding the parent node's backing_file string, you can now omit the
@backing option when reopening a node with neither a default nor a
current backing file even if it used to have a backing node at some
point.
273 also changes: The base image is opened without a format layer, so
ImageInfo.backing-filename-format used to report "file" for the base
image's overlay after blockdev-snapshot. However, the image header
never says "file" anywhere, so it now reports $IMGFMT.
Signed-off-by: Max Reitz <mreitz@redhat.com>
With bdrv_filter_bs(), we can easily handle this default filter behavior
in bdrv_co_block_status().
blkdebug wants to have an additional assertion, so it keeps its own
implementation, except bdrv_co_block_status_from_file() needs to be
inlined there.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
This includes some permission limiting (for example, we only need to
take the RESIZE permission if the base is smaller than the top).
Signed-off-by: Max Reitz <mreitz@redhat.com>
This includes some permission limiting (for example, we only need to
take the RESIZE permission for active commits where the base is smaller
than the top).
base_overlay is introduced so we can query bdrv_is_allocated_above() on
it - we cannot do that with base itself, because a filter's block_status
is the same as its child node, so if there are filters on base,
bdrv_is_allocated_above() on base would return information including
base.
Use this opportunity to rename qmp_drive_mirror()'s "source" BDS to
"target_backing_bs", because that is what it really refers to.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
query-block, query-named-block-nodes, and query-blockstats now return
any filtered child under "backing", not just bs->backing or COW
children. This is so that filters do not interrupt the reported backing
chain. This changes the output for iotest 184, as the throttled node
now appears as a backing child.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
It makes no sense to report the block stats of a purely metadata-storing
child in query-blockstats. So if the primary child does not have any
data, try to find a unique data-storing child.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
It is trivial, so we might as well do it.
Remove _filter_actual_image_size from iotest 184, so we get to see the
result in its reference output.
Signed-off-by: Max Reitz <mreitz@redhat.com>
If the top node's driver does not provide snapshot functionality and we
want to fall back to a node down the chain, we need to snapshot all
non-COW children. For simplicity's sake, just do not fall back if there
is more than one such child. Furthermore, we really only can fall back
to bs->file and bs->backing, because bdrv_snapshot_goto() has to modify
the child link (notably, set it to NULL).
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
If a node whose driver does not provide VM state functions has a
metadata child, the VM state should probably go there; if it is a
filter, the VM state should probably go there. It follows that we
should generally go down to the primary child.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Instead of looking at just bs->file and bs->backing, we should look at
all children that could end up receiving forwarded requests.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Before HEAD^, we needed this because bdrv_co_flush() by itself would
only flush bs->file. With HEAD^, bdrv_co_flush() will flush all
children on which a WRITE or WRITE_UNCHANGED permission has been taken.
Thus, vmdk no longer needs to do it itself.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
itself has to flush the children of the given node, it should not flush
just bs->file->bs, but in fact all children that might have been written
to (judging from the permissions taken on them).
This is a bug fix for qcow2 images with an external data file, as they
so far did not flush that data_file node.
In any case, the BLKDBG_EVENT() should be emitted on the primary child,
because that is where a blkdebug node would be if there is any.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
The condition modified here is not about potentially filtered children,
but only about COW sources (i.e. traditional backing files).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Because of the (not so recent anymore) changes that make the stream job
independent of the base node and instead track the node above it, we
have to split that "bottom" node into two cases: The bottom COW node,
and the node directly above the base node (which may be an R/W filter
or the bottom COW node).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Places that use patterns like
if (bs->drv->is_filter && bs->file) {
... something about bs->file->bs ...
}
should be
BlockDriverState *filtered = bdrv_filter_bs(bs);
if (filtered) {
... something about @filtered ...
}
instead.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
The original purpose of bdrv_is_encrypted() was to inquire whether a BDS
can be used without the user entering a password or not. It has not
been used for that purpose for quite some time.
Actually, it is not even fit for that purpose, because to answer that
question, it would have recursively query all of the given node's
children.
So now we have to decide in which direction we want to fix
bdrv_is_encrypted(): Recursively query all children, or drop it and just
use bs->encrypted to get the current node's status?
Nowadays, its only purpose is to report through bdrv_query_image_info()
whether the given image is encrypted or not. For this purpose, it is
probably more interesting to see whether a given node itself is
encrypted or not (otherwise, a management application cannot discern for
certain which nodes are really encrypted and which just have encrypted
children).
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
In preparation of using multiple IRQ (thus multiple eventfds)
make BDRVNVMeState::irq_notifier an array (for now of a single
element, the admin queue notifier).
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-16-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As we want to do per-queue polling, extract the nvme_poll_queue()
method which operates on a single queue.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-15-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nvme_create_queue_pair() doesn't require BlockDriverState anymore.
Replace it by BDRVNVMeState and AioContext to simplify.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-14-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
BDRV_POLL_WHILE() is defined as:
#define BDRV_POLL_WHILE(bs, cond) ({ \
BlockDriverState *bs_ = (bs); \
AIO_WAIT_WHILE(bdrv_get_aio_context(bs_), \
cond); })
As we will remove the BlockDriverState use in the next commit,
start by using the exploded version of BDRV_POLL_WHILE().
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-13-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nvme_init_queue() doesn't require BlockDriverState anymore.
Replace it by BDRVNVMeState to simplify.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-12-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu_try_blockalign() is a generic API that call back to the
block driver to return its page alignment. As we call from
within the very same driver, we already know to page alignment
stored in our state. Remove indirections and use the value from
BDRVNVMeState.
This change is required to later remove the BlockDriverState
argument, to make nvme_init_queue() per hardware, and not per
block driver.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-11-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the next commit we'll get rid of qemu_try_blockalign().
To ease review, first replace qemu_try_blockalign0() by explicit
calls to qemu_try_blockalign() and memset().
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-10-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We allocate an unique chunk of memory then use it for two
different structures. By using an union, we make it clear
the data is overlapping (and we can remove the casts).
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-9-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are going to modify the code in the next commit. Renaming
the 'resp' variable to 'id' first makes the next commit easier
to review. No logical changes.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-8-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Rearrange nvme_add_io_queue() by using a common error path.
This will be proven useful in few commits where we add IRQ
notification to the IO queues.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-7-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Do not use the same error message for different failures.
Display a different error whether it is the CQ or the SQ.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-6-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use definitions instead of '0' or '1' indexes. Also this will
be useful when using multi-queues later.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-5-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As nvme_create_queue_pair() is allowed to fail, replace the
alloc() calls by try_alloc() to avoid aborting QEMU.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-4-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Avoid further processing if TRACE_NVME_SUBMIT_COMMAND_RAW is
not enabled. This is an untested intend of performance optimization.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-3-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use self-explicit SCALE_MS definition instead of magic value.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-2-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes nbd's connection_co yield during reconnects, so that
reconnect doesn't block the main thread. This is very important in
case of an unavailable nbd server host: connect() call may take a long
time, blocking the main thread (and due to reconnect, it will hang
again and again with small gaps of working time during pauses between
connection attempts).
Realization notes:
- We don't want to implement non-blocking connect() over non-blocking
socket, because getaddrinfo() doesn't have portable non-blocking
realization anyway, so let's just use a thread for both getaddrinfo()
and connect().
- We can't use qio_channel_socket_connect_async (which behaves
similarly and starts a thread to execute connect() call), as it's relying
on someone iterating main loop (g_main_loop_run() or something like
this), which is not always the case.
- We can't use thread_pool_submit_co API, as thread pool waits for all
threads to finish (but we don't want to wait for blocking reconnect
attempt on shutdown.
So, we just create the thread by hand. Some additional difficulties
are:
- We want our connect to avoid blocking drained sections and aio context
switches. To achieve this, we make it possible to "cancel" synchronous
wait for the connect (which is a coroutine yield actually), still,
the thread continues in background, and if successful, its result may be
reused on next reconnect attempt.
- We don't want to wait for reconnect on shutdown, so there is
CONNECT_THREAD_RUNNING_DETACHED thread state, which means that the block
layer is no longer interested in a result, and thread should close new
connected socket on finish and free the state.
How to reproduce the bug, fixed with this commit:
1. Create an image on node1:
qemu-img create -f qcow2 xx 100M
2. Start NBD server on node1:
qemu-nbd xx
3. Start vm with second nbd disk on node2, like this:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
file=/work/images/cent7.qcow2 -drive file=nbd+tcp://192.168.100.2 \
-vnc :0 -qmp stdio -m 2G -enable-kvm -vga std
4. Access the vm through vnc (or some other way?), and check that NBD
drive works:
dd if=/dev/sdb of=/dev/null bs=1M count=10
- the command should succeed.
5. Now, let's trigger nbd-reconnect loop in Qemu process. For this:
5.1 Kill NBD server on node1
5.2 run "dd if=/dev/sdb of=/dev/null bs=1M count=10" in the guest
again. The command should fail and a lot of error messages about
failing disk may appear as well.
Now NBD client driver in Qemu tries to reconnect.
Still, VM works well.
6. Make node1 unavailable on NBD port, so connect() from node2 will
last for a long time:
On node1 (Note, that 10809 is just a default NBD port):
sudo iptables -A INPUT -p tcp --dport 10809 -j DROP
After some time the guest hangs, and you may check in gdb that Qemu
hangs in connect() call, issued from the main thread. This is the
BUG.
7. Don't forget to drop iptables rule from your node1:
sudo iptables -D INPUT -p tcp --dport 10809 -j DROP
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200812145237.4396-1-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: minor wording and formatting tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
The NVM Express specification generally uses 'zeroes' and not 'zeros',
so let us align with it.
Cc: Fam Zheng <fam@euphon.net>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Add missing fields in the Identify Controller and Identify Namespace
data structures to bring them in line with NVMe v1.3.
This also adds data structures and defines for SGL support which
requires a couple of trivial changes to the nvme block driver as well.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Fam Zheng <fam@euphon.net>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Message-Id: <20200706061303.246057-2-its@irrelevant.dk>
Remove superfluous breaks, as there is a "return" before them.
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <1594631107-36574-1-git-send-email-wang.yi59@zte.com.cn>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The qcow2 driver needs the zlib dependency. While emulators
provided it through the migration code, this is not true of
the tools. Move the dependency from the qcow1 rule directly
into block_ss so that it is included unconditionally.
Fixes build with --disable-qcow1.
Reported-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Cc: qemu-block@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move typedef closer to the type check macros, to make it easier
to convert the code to OBJECT_DEFINE_TYPE() in the future.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Tested-By: Roman Bolshakov <r.bolshakov@yadro.com>
Message-Id: <20200825192110.3528606-17-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
This function is only used by qcow2_expand_zero_clusters() to
downgrade a qcow2 image to a previous version. This would require
transforming all extended L2 entries into normal L2 entries but this
is not a simple task and there are no plans to implement this at the
moment.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <15e65112b4144381b4d8c0bdf8fb76b0d813e3d1.1594396418.git.berto@igalia.com>
[mreitz: Fixed comment style]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Traditional qcow2 images don't allow preallocation if a backing file
is set. This is because once a cluster is allocated there is no way to
tell that its data should be read from the backing file.
Extended L2 entries have individual allocation bits for each
subcluster, and therefore it is perfectly possible to have an
allocated cluster with all its subclusters unallocated.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6d5b0f38e7dc5f2f31d8cab1cb92044e9909aece.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6476caaa73216bd05b7bb2d504a20415e1665176.1594396418.git.berto@igalia.com>
[mreitz: %s/5\.1/5.2/; fixed 302's and 303's reference output]
Signed-off-by: Max Reitz <mreitz@redhat.com>
This field allows us to indicate that the L2 metadata update does not
come from a write request with actual data but from a preallocation
request.
For traditional images this does not make any difference, but for
images with extended L2 entries this means that the clusters are
allocated normally in the L2 table but individual subclusters are
marked as unallocated.
This will allow preallocating images that have a backing file.
There is one special case: when we resize an existing image we can
also request that the new clusters are preallocated. If the image
already had a backing file then we have to hide any possible stale
data and zero out the new clusters (see commit 955c7d6687 for more
details).
In this case the subclusters cannot be left as unallocated so the L2
bitmap must be updated.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <960d4c444a4f5a870e2b47e5da322a73cd9a2f5a.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Extended L2 entries are bigger than normal L2 entries so this has an
impact on the amount of metadata needed for a qcow2 file.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <7efae2efd5e36b42d2570743a12576d68ce53685.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This works now at the subcluster level and pwrite_zeroes_alignment is
updated accordingly.
qcow2_cluster_zeroize() is turned into qcow2_subcluster_zeroize() with
the following changes:
- The request can now be subcluster-aligned.
- The cluster-aligned body of the request is still zeroized using
zero_in_l2_slice() as before.
- The subcluster-aligned head and tail of the request are zeroized
with the new zero_l2_subclusters() function.
There is just one thing to take into account for a possible future
improvement: compressed clusters cannot be partially zeroized so
zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
This makes the caller repeat the *complete* request and write actual
zeroes to disk. This is sub-optimal because
1) if the head area was compressed we would still be able to use
the fast path for the body and possibly the tail.
2) if the tail area was compressed we are writing zeroes to the
head and the body areas, which are already zeroized.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <17e05e2ee7e12f10dcf012da81e83ebe27eb3bef.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The bdrv_co_pwrite_zeroes() call here fills complete clusters with
zeroes, but it can happen that some subclusters are not part of the
write request or the copy-on-write. This patch makes sure that only
the affected subclusters are overwritten.
A potential improvement would be to also fill with zeroes the other
subclusters if we can guarantee that we are not overwriting existing
data. However this would waste more disk space, so we should first
evaluate if it's really worth doing.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <b3dc97e8e2240ddb5191a4f930e8fc9653f94621.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Compressed clusters always have the bitmap part of the extended L2
entry set to 0.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <04455b3de5dfeb9d1cfe1fc7b02d7060a6e09710.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The L2 bitmap needs to be updated after each write to indicate what
new subclusters are now allocated. This needs to happen even if the
cluster was already allocated and the L2 entry was otherwise valid.
In some cases however a write operation doesn't need change the L2
bitmap (because all affected subclusters were already allocated). This
is detected in calculate_l2_meta(), and qcow2_alloc_cluster_link_l2()
is never called in those cases.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <0875620d49f44320334b6a91c73b3f301f975f38.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The offset field of an uncompressed cluster's L2 entry must be aligned
to the cluster size, otherwise it is invalid. If the cluster has no
data then it means that the offset points to a preallocation, so we
can clear the offset field without affecting the guest-visible data.
This is what 'qemu-img check' does when run in repair mode.
On traditional qcow2 images this can only happen when QCOW_OFLAG_ZERO
is set, and repairing such entries turns the clusters from ZERO_ALLOC
into ZERO_PLAIN.
Extended L2 entries have no ZERO_ALLOC clusters and no QCOW_OFLAG_ZERO
but the idea is the same: if none of the subclusters are allocated
then we can clear the offset field and leave the bitmap untouched.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <9f4ed1d0a34b0a545b032c31ecd8c14734065342.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Two things need to be taken into account here:
1) With full_discard == true the L2 entry must be cleared completely.
This also includes the L2 bitmap if the image has extended L2
entries.
2) With full_discard == false we have to make the discarded cluster
read back as zeroes. With normal L2 entries this is done with the
QCOW_OFLAG_ZERO bit, whereas with extended L2 entries this is done
with the individual 'all zeroes' bits for each subcluster.
Note however that QCOW_OFLAG_ZERO is not supported in v2 qcow2
images so, if there is a backing file, discard cannot guarantee
that the image will read back as zeroes. If this is important for
the caller it should forbid it as qcow2_co_pdiscard() does (see
80f5c01183 for more details).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <5ef8274e628aa3ab559bfac467abf488534f2b76.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The QCOW_OFLAG_ZERO bit that indicates that a cluster reads as
zeroes is only used in standard L2 entries. Extended L2 entries use
individual 'all zeroes' bits for each subcluster.
This must be taken into account when updating the L2 entry and also
when deciding that an existing entry does not need to be updated.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <b61d61606d8c9b367bd641ab37351ddb9172799a.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The logic of this function remains pretty much the same, except that
it uses count_contiguous_subclusters(), which combines the logic of
count_contiguous_clusters() / count_contiguous_clusters_unallocated()
and checks individual subclusters.
qcow2_cluster_to_subcluster_type() is not necessary as a separate
function anymore so it's inlined into its caller.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <d2193fd48653a350d80f0eca1c67b1d9053fb2f3.1594396418.git.berto@igalia.com>
[mreitz: Initialize expected_type to anything]
Signed-off-by: Max Reitz <mreitz@redhat.com>
If an image has subclusters then there are more copy-on-write
scenarios that we need to consider. Let's say we have a write request
from the middle of subcluster #3 until the end of the cluster:
1) If we are writing to a newly allocated cluster then we need
copy-on-write. The previous contents of subclusters #0 to #3 must
be copied to the new cluster. We can optimize this process by
skipping all leading unallocated or zero subclusters (the status of
those skipped subclusters will be reflected in the new L2 bitmap).
2) If we are overwriting an existing cluster:
2.1) If subcluster #3 is unallocated or has the all-zeroes bit set
then we need copy-on-write (on subcluster #3 only).
2.2) If subcluster #3 was already allocated then there is no need
for any copy-on-write. However we still need to update the L2
bitmap to reflect possible changes in the allocation status of
subclusters #4 to #31. Because of this, this function checks
if all the overwritten subclusters are already allocated and
in this case it returns without creating a new QCowL2Meta
structure.
After all these changes l2meta_cow_start() and l2meta_cow_end()
are not necessarily cluster-aligned anymore. We need to update the
calculation of old_start and old_end in handle_dependencies() to
guarantee that no two requests try to write on the same cluster.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <4292dd56e4446d386a2fe307311737a711c00708.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When dealing with subcluster types there is a new value called
QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC that has no equivalent in
QCow2ClusterType.
This patch handles that value in all places where subcluster types
are processed.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <bf09e2e2439a468a901bb96ace411eed9ee50295.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
In order to support extended L2 entries some functions of the qcow2
driver need to start dealing with subclusters instead of clusters.
qcow2_get_host_offset() is modified to return the subcluster type
instead of the cluster type, and all callers are updated to replace
all values of QCow2ClusterType with their QCow2SubclusterType
equivalents.
This patch only changes the data types, there are no semantic changes.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <f6c29737c295f32cbee74c903c30b01820363b34.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This function returns an integer that can be either an error code or a
cluster type (a value from the QCow2ClusterType enum).
We are going to start using subcluster types instead of cluster types
in some functions so it's better to use the exact data types instead
of integers for clarity and in order to detect errors more easily.
This patch makes qcow2_get_host_offset() return 0 on success and
puts the returned cluster type in a separate parameter. There are no
semantic changes.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <396b6eab1859a271551dcd7dcba77f8934aa3c3f.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This helper function tells us if a cluster is allocated (that is,
there is an associated host offset for it).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6d8771c5c79cbdc6c519875a5078e1cc85856d63.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
There are situations in which we want to know how many contiguous
subclusters of the same type there are in a given cluster. This can be
done by simply iterating over the subclusters and repeatedly calling
qcow2_get_subcluster_type() for each one of them.
However once we determined the type of a subcluster we can check the
rest efficiently by counting the number of adjacent ones (or zeroes)
in the bitmap. This is what this function does.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <db917263d568ec6ffb4a41cac3c9100f96bf6c18.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch adds QCow2SubclusterType, which is the subcluster-level
version of QCow2ClusterType. All QCOW2_SUBCLUSTER_* values have the
the same meaning as their QCOW2_CLUSTER_* equivalents (when they
exist). See below for details and caveats.
In images without extended L2 entries clusters are treated as having
exactly one subcluster so it is possible to replace one data type with
the other while keeping the exact same semantics.
With extended L2 entries there are new possible values, and every
subcluster in the same cluster can obviously have a different
QCow2SubclusterType so functions need to be adapted to work on the
subcluster level.
There are several things that have to be taken into account:
a) QCOW2_SUBCLUSTER_COMPRESSED means that the whole cluster is
compressed. We do not support compression at the subcluster
level.
b) There are two different values for unallocated subclusters:
QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN which means that the whole
cluster is unallocated, and QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC
which means that the cluster is allocated but the subcluster is
not. The latter can only happen in images with extended L2
entries.
c) QCOW2_SUBCLUSTER_INVALID is used to detect the cases where an L2
entry has a value that violates the specification. The caller is
responsible for handling these situations.
To prevent compatibility problems with images that have invalid
values but are currently being read by QEMU without causing side
effects, QCOW2_SUBCLUSTER_INVALID is only returned for images
with extended L2 entries.
qcow2_cluster_to_subcluster_type() is added as a separate function
from qcow2_get_subcluster_type(), but this is only temporary and both
will be merged in a subsequent patch.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <26ef38e270f25851c98b51278852b4c4a7f97e69.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Extended L2 entries are 128-bit wide: 64 bits for the entry itself and
64 bits for the subcluster allocation bitmap.
In order to support them correctly get/set_l2_entry() need to be
updated so they take the entry width into account in order to
calculate the correct offset.
This patch also adds the get/set_l2_bitmap() functions that are
used to access the bitmaps. For convenience we allow calling
get_l2_bitmap() on images without subclusters. In this case the
returned value is always 0 and has no meaning.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6ee0f81ae3329c991de125618b3675e1e46acdbb.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
qcow2 images with subclusters have 128-bit L2 entries. The first 64
bits contain the same information as traditional images and the last
64 bits form a bitmap with the status of each individual subcluster.
Because of that we cannot assume that L2 entries are sizeof(uint64_t)
anymore. This function returns the proper value for the image.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <d34d578bd0380e739e2dde3e8dd6187d3d249fa9.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Like offset_into_cluster() and size_to_clusters(), but for
subclusters.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <3cc2390dcdef3d234d47c741b708bd8734490862.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
For a given offset, return the subcluster number within its cluster
(i.e. with 32 subclusters per cluster it returns a number between 0
and 31).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <56e3e4ac0d827c6a2f5f259106c5ddb7c4ca2653.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch adds the following new fields to BDRVQcow2State:
- subclusters_per_cluster: Number of subclusters in a cluster
- subcluster_size: The size of each subcluster, in bytes
- subcluster_bits: No. of bits so 1 << subcluster_bits = subcluster_size
Images without subclusters are treated as if they had exactly one
subcluster per cluster (i.e. subcluster_size = cluster_size).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <55bfeac86b092fa2c9d182a95cbeb479ff7eca4f.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This function will be used by the qcow2 code to check if an image has
subclusters or not.
At the moment this simply returns false. Once all patches needed for
subcluster support are ready then QEMU will be able to create and
read images with subclusters and this function will return the actual
value.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <905526221083581a1b7057bca1585487661c5c13.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The size of an L2 entry is 64 bits, but if we want to have subclusters
we need extended L2 entries. This means that we have to access L2
tables and slices differently depending on whether an image has
extended L2 entries or not.
This patch replaces all l2_slice[] accesses with calls to
get_l2_entry() and set_l2_entry().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <9586363531fec125ba1386e561762d3e4224e9fc.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When writing to a qcow2 file there are two functions that take a
virtual offset and return a host offset, possibly allocating new
clusters if necessary:
- handle_copied() looks for normal data clusters that are already
allocated and have a reference count of 1. In those clusters we
can simply write the data and there is no need to perform any
copy-on-write.
- handle_alloc() looks for clusters that do need copy-on-write,
either because they haven't been allocated yet, because their
reference count is != 1 or because they are ZERO_ALLOC clusters.
The ZERO_ALLOC case is a bit special because those are clusters that
are already allocated and they could perfectly be dealt with in
handle_copied() (as long as copy-on-write is performed when required).
In fact, there is extra code specifically for them in handle_alloc()
that tries to reuse the existing allocation if possible and frees them
otherwise.
This patch changes the handling of ZERO_ALLOC clusters so the
semantics of these two functions are now like this:
- handle_copied() looks for clusters that are already allocated and
which we can overwrite (NORMAL and ZERO_ALLOC clusters with a
reference count of 1).
- handle_alloc() looks for clusters for which we need a new
allocation (all other cases).
One important difference after this change is that clusters found
in handle_copied() may now require copy-on-write, but this will be
necessary anyway once we add support for subclusters.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <eb17fc938f6be7be2e8d8ff42763d2c19241f866.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to need it in other places.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <65e5d9627ca2ebe7e62deaeddf60949c33067d9d.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
handle_alloc() creates a QCowL2Meta structure in order to update the
image metadata and perform the necessary copy-on-write operations.
This patch moves that code to a separate function so it can be used
from other places.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <e5bc4a648dac31972bfa7a0e554be8064be78799.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
qcow2_get_cluster_offset() takes an (unaligned) guest offset and
returns the (aligned) offset of the corresponding cluster in the qcow2
image.
In practice none of the callers need to know where the cluster starts
so this patch makes the function calculate and return the final host
offset directly. The function is also renamed accordingly.
There is a pre-existing exception with compressed clusters: in this
case the function returns the complete cluster descriptor (containing
the offset and size of the compressed data). This does not change with
this patch but it is now documented.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <ffae6cdc5ca8950e8280ac0f696dcc376cb07095.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The file_cluster_offset field of Qcow2AioTask stores a cluster-aligned
host offset. In practice this is not very useful because all users(*)
of this structure need the final host offset into the cluster, which
they calculate using
host_offset = file_cluster_offset + offset_into_cluster(s, offset)
There is no reason why Qcow2AioTask cannot store host_offset directly
and that is what this patch does.
(*) compressed clusters are the exception: in this case what
file_cluster_offset was storing was the full compressed cluster
descriptor (offset + size). This does not change with this patch
but it is documented now.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <07c4b15c644dcf06c9459f98846ac1c4ea96e26f.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Meson doesn't enjoy the same flexibility we have with Make in choosing
the include path. In particular the tracing headers are using
$(build_root)/$(<D).
In order to keep the include directives unchanged,
the simplest solution is to generate headers with patterns like
"trace/trace-audio.h" and place forwarding headers in the source tree
such that for example "audio/trace.h" includes "trace/trace-audio.h".
This patch is too ugly to be applied to the Makefiles now. It's only
a way to separate the changes to the tracing header files from the
Meson rewrite of the tracing logic.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since commit 42ac214406 (block/block-copy: refactor task creation)
block_copy_task_create calculates the area to be copied via
bdrv_dirty_bitmap_next_dirty_area, but that can return an unaligned byte
count if the image's last cluster end is not aligned to the bitmap's
granularity.
Always ALIGN_UP the resulting bytes value to satisfy block_copy_do_copy,
which requires the 'bytes' parameter to be aligned to cluster size.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Message-Id: <20200810095523.15071-1-s.reiter@proxmox.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When calculating the offset, the result of left shift operation will be promoted
to type int64 automatically because the left operand of + operator is uint64_t.
but the result after integer promotion may be produce an error value for us and
trigger the following asserting error.
For example, consider i=0x2000, cluster_bits=18, the result of left shift
operation will be 0x80000000. Cause argument i is of signed integer type,
the result is automatically promoted to 0xffffffff80000000 which is not
we expected
The way to trigger the assertion error:
qemu-img create -f qcow2 -o preallocation=full,cluster_size=256k tmpdisk 10G
This patch fix it by casting @i to uint64_t before doing left shift operation
Signed-off-by: Guoyi Tu <tu.guoyi@h3c.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 81ba90fe0c014f269621c283269b42ad@h3c.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
During migration, we release all bitmaps after storing them on disk, as
long as they are (1) stored on disk, (2) not read-only, and (3)
consistent.
(2) seems arbitrary, though. The reason we do not release them is
because we do not write them, as there is no need to; and then we just
forget about all bitmaps that we have not written to the file. However,
read-only persistent bitmaps are still in the file and in sync with
their in-memory representation, so we may as well release them just like
any R/W bitmap that we have updated.
It leads to actual problems, too: After migration, letting the source
continue may result in an error if there were any bitmaps on read-only
nodes (such as backing images), because those have not been released by
bdrv_inactive_all(), but bdrv_invalidate_cache_all() attempts to reload
them (which fails, because they are still present in memory).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200730120234.49288-2-mreitz@redhat.com>
Tested-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We try to go to wakeable sleep, so that, if drain begins it will break
the sleep. But what if nbd_client_co_drain_begin() already called and
s->drained is already true? We'll go to sleep, and drain will have to
wait for the whole timeout. Let's improve it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200727184751.15704-5-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
On shutdown nbd driver may be in a connecting state. We should shutdown
it as well, otherwise we may hang in
nbd_teardown_connection, waiting for conneciton_co to finish in
BDRV_POLL_WHILE(bs, s->connection_co) loop if remote server is down.
How to reproduce the dead lock:
1. Create nbd-fault-injector.conf with the following contents:
[inject-error "mega1"]
event=data
io=readwrite
when=before
2. In one terminal run nbd-fault-injector in a loop, like this:
n=1; while true; do
echo $n; ((n++));
./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
done
3. In another terminal run qemu-io in a loop, like this:
n=1; while true; do
echo $n; ((n++));
./qemu-io -c 'read 0 512' nbd://127.0.0.1:10000;
done
After some time, qemu-io will hang. Note, that this hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: allow drain during reconnect attempt".
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200727184751.15704-4-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
It should be safe to reenter qio_channel_yield() on io/channel read/write
path, so it's safe to reduce in_flight and allow attaching new aio
context. And no problem to allow drain itself: connection attempt is
not a guest request. Moreover, if remote server is down, we can hang
in negotiation, blocking drain section and provoking a dead lock.
How to reproduce the dead lock:
1. Create nbd-fault-injector.conf with the following contents:
[inject-error "mega1"]
event=data
io=readwrite
when=before
2. In one terminal run nbd-fault-injector in a loop, like this:
n=1; while true; do
echo $n; ((n++));
./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
done
3. In another terminal run qemu-io in a loop, like this:
n=1; while true; do
echo $n; ((n++));
./qemu-io -c 'read 0 512' nbd://127.0.0.1:10000;
done
After some time, qemu-io will hang trying to drain, for example, like
this:
#3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
util/aio-posix.c:600
#4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
#5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
#6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
#7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
#8 bdrv_open_inherit (filename=0x7fffba1563bc
"nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
flags=24578, parent=0x0, child_class=0x0, child_role=0,
errp=0x7fffba154620) at block.c:3491
#9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
block.c:3513
#10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
block/block-backend.c:421
And connection_co stack like this:
#0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
#1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
#2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
io/channel.c:472
#3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
#4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
#5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
"\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
io/channel.c:247
#6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
/work/src/qemu/master/include/block/nbd.h:365
#7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
/work/src/qemu/master/include/block/nbd.h:391
#8 nbd_start_negotiate (aio_context=0x55f006bdd890,
ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
outioc=0x55f006bf19f8, structured_reply=true,
zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
#9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
nbd/client.c:1032
#10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
block/nbd.c:1460
#11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
#12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
#13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
#14 coroutine_trampoline (i0=113190480, i1=22000) at
util/coroutine-ucontext.c:173
Note, that the hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: on shutdown terminate connection attempt".
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200727184751.15704-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We are going to implement non-blocking version of
nbd_establish_connection, which for a while will be used only for
nbd_reconnect_attempt, not for nbd_open, so we need to call it
separately.
Refactor nbd_reconnect_attempt in a way which makes next commit
simpler.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200727184751.15704-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
When converting to qcow2 compressed format, the last step is a special
zero length compressed write, ending in a call to bdrv_co_truncate(). This
call always fails for the nbd driver since it does not implement
bdrv_co_truncate().
For block devices, which have the same limits, the call succeeds since
the file driver implements bdrv_co_truncate(). If the caller asked to
truncate to the same or smaller size with exact=false, the truncate
succeeds. Implement the same logic for nbd.
Example failing without this change:
In one shell start qemu-nbd:
$ truncate -s 1g test.tar
$ qemu-nbd --socket=/tmp/nbd.sock --persistent --format=raw --offset 1536 test.tar
In another shell convert an image to qcow2 compressed via NBD:
$ echo "disk data" > disk.raw
$ truncate -s 1g disk.raw
$ qemu-img convert -f raw -O qcow2 -c disk1.raw nbd+unix:///?socket=/tmp/nbd.sock; echo $?
1
qemu-img failed, but the conversion was successful:
$ qemu-img info nbd+unix:///?socket=/tmp/nbd.sock
image: nbd+unix://?socket=/tmp/nbd.sock
file format: qcow2
virtual size: 1 GiB (1073741824 bytes)
...
$ qemu-img check nbd+unix:///?socket=/tmp/nbd.sock
No errors were found on the image.
1/16384 = 0.01% allocated, 100.00% fragmented, 100.00% compressed clusters
Image end offset: 393216
$ qemu-img compare disk.raw nbd+unix:///?socket=/tmp/nbd.sock
Images are identical.
Fixes: https://bugzilla.redhat.com/1860627
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Message-Id: <20200727215846.395443-2-nsoffer@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: typo fixes]
Signed-off-by: Eric Blake <eblake@redhat.com>
- Improve handling of various post-copy bitmap migration scenarios. A lost
bitmap should merely mean that the next backup must be full rather than
incremental, rather than abruptly breaking the entire guest migration.
- Associated iotest improvements
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAl8fPRkACgkQp6FrSiUn
Q2qanQf/dRTrqZ7/hs8aENySf44o0dBzOLZr+FBcrqEj2sd0c6jPzV2X5CVtnA1v
gBgKJJGLpti3mSeNQDbaXZIQrsesBAuxvJsc6vZ9npDCdMYnK/qPE3Zfw1bx12qR
cb39ba28P4izgs216h92ZACtUewnvjkxyJgN7zfmCJdNcwZINMUItAS183tSbQjn
n39Wb7a+umsRgV9HQv/6cXlQIPqFMyAOl5kkzV3evuw7EBoHFnNq4cjPrUnjkqiD
xf2pcSomaedYd37SpvoH57JxfL3z/90OBcuXhFvbqFk4FgQ63rJ32nRve2ZbIDI0
XPbohnYjYoFv6Xs/jtTzctZCbZ+jTg==
=1dmz
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-bitmaps-2020-07-27' into staging
bitmaps patches for 2020-07-27
- Improve handling of various post-copy bitmap migration scenarios. A lost
bitmap should merely mean that the next backup must be full rather than
incremental, rather than abruptly breaking the entire guest migration.
- Associated iotest improvements
# gpg: Signature made Mon 27 Jul 2020 21:46:17 BST
# gpg: using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg: aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-bitmaps-2020-07-27: (24 commits)
migration: Fix typos in bitmap migration comments
iotests: Adjust which migration tests are quick
qemu-iotests/199: add source-killed case to bitmaps postcopy
qemu-iotests/199: add early shutdown case to bitmaps postcopy
qemu-iotests/199: check persistent bitmaps
qemu-iotests/199: prepare for new test-cases addition
migration/savevm: don't worry if bitmap migration postcopy failed
migration/block-dirty-bitmap: cancel migration on shutdown
migration/block-dirty-bitmap: relax error handling in incoming part
migration/block-dirty-bitmap: keep bitmap state for all bitmaps
migration/block-dirty-bitmap: simplify dirty_bitmap_load_complete
migration/block-dirty-bitmap: rename finish_lock to just lock
migration/block-dirty-bitmap: refactor state global variables
migration/block-dirty-bitmap: move mutex init to dirty_bitmap_mig_init
migration/block-dirty-bitmap: rename dirty_bitmap_mig_cleanup
migration/block-dirty-bitmap: rename state structure types
migration/block-dirty-bitmap: fix dirty_bitmap_mig_before_vm_start
qemu-iotests/199: increase postcopy period
qemu-iotests/199: change discard patterns
qemu-iotests/199: improve performance: set bitmap by discard
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Since these functions take a @qiov_offset, they must always take it into
account when working with @qiov. There are a couple of places where
they do not, but they should.
Fixes: 65cd4424b9
("block/io: bdrv_aligned_preadv: use and support qiov_offset")
Fixes: 28c4da2869
("block/io: bdrv_aligned_pwritev: use and support qiov_offset")
Reported-by: Claudio Fontana <cfontana@suse.de>
Reported-by: Bruce Rogers <brogers@suse.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200728120806.265916-2-mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Claudio Fontana <cfontana@suse.de>
Tested-by: Bruce Rogers <brogers@suse.com>
Make the capitalization of the hexadecimal numbers consistent for the
QCOW2 header extension constants in docs/interop/qcow2.txt.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <1594973699-781898-2-git-send-email-andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We should check whether the user-specified node-name actually refers to
a node. The simplest way to do that is to use bdrv_lookup_bs() instead
of bdrv_find_node() (the former wraps the latter, and produces an error
message if necessary).
Reported-by: Coverity (CID 1430268)
Fixes: ced914d0ab
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200710095037.10885-1-mreitz@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
qcow2 version 2 images don't support the zero flag for clusters, so for
write_zeroes requests, we return -ENOTSUP and get explicit zero buffer
writes. If the image doesn't have a backing file, we can do better: Just
discard the respective clusters.
This is relevant for 'qemu-img convert -O qcow2 -n', where qemu-img has
to assume that the existing target image may contain any data, so it has
to write zeroes. Without this patch, this results in a fully allocated
target image, even if the source image was empty.
Reported-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200721135520.72355-2-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The `detect-zeroes=unmap` option may issue unaligned
`FALLOC_FL_PUNCH_HOLE` requests, raw block devices can (and will) return
`EINVAL`, qemu should then write the zeroes to the blockdev instead of
issuing an `IO_ERROR`.
The problem can be reprodced like this:
$ qemu-io -c 'write -P 0 42 1234' --image-opts driver=host_device,filename=/dev/loop0,detect-zeroes=unmap
write failed: Invalid argument
Signed-off-by: Antoine Damhet <antoine.damhet@blade-group.com>
Message-Id: <20200717135603.51180-1-antoine.damhet@blade-group.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
object_get_canonical_path_component() returns a malloced copy of a
property name on success, null on failure.
19 of its 25 callers immediately free the returned copy.
Change object_get_canonical_path_component() to return the property
name directly. Since modifying the name would be wrong, adjust the
return type to const char *.
Drop the free from the 19 callers become simpler, add the g_strdup()
to the other six.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200714160202.3121879-4-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Li Qiang <liq3ea@gmail.com>
bdrv_aio_cancel() calls aio_poll() on the AioContext for the given I/O
request until it has completed. ENOMEDIUM requests are special because
there is no BlockDriverState when the drive has no medium!
Define a .get_aio_context() function for BlkAioEmAIOCB requests so that
bdrv_aio_cancel() can find the AioContext where the completion BH is
pending. Without this function bdrv_aio_cancel() aborts on ENOMEDIUM
requests!
libFuzzer triggered the following assertion:
cat << EOF | qemu-system-i386 -M pc-q35-5.0 \
-nographic -monitor none -serial none \
-qtest stdio -trace ide\*
outl 0xcf8 0x8000fa24
outl 0xcfc 0xe106c000
outl 0xcf8 0x8000fa04
outw 0xcfc 0x7
outl 0xcf8 0x8000fb20
write 0x0 0x3 0x2780e7
write 0xe106c22c 0xd 0x1130c218021130c218021130c2
write 0xe106c218 0x15 0x110010110010110010110010110010110010110010
EOF
ide_exec_cmd IDE exec cmd: bus 0x56170a77a2b8; state 0x56170a77a340; cmd 0xe7
ide_reset IDEstate 0x56170a77a340
Aborted (core dumped)
(gdb) bt
#1 0x00007ffff4f93895 in abort () at /lib64/libc.so.6
#2 0x0000555555dc6c00 in bdrv_aio_cancel (acb=0x555556765550) at block/io.c:2745
#3 0x0000555555dac202 in blk_aio_cancel (acb=0x555556765550) at block/block-backend.c:1546
#4 0x0000555555b1bd74 in ide_reset (s=0x555557213340) at hw/ide/core.c:1318
#5 0x0000555555b1e3a1 in ide_bus_reset (bus=0x5555572132b8) at hw/ide/core.c:2422
#6 0x0000555555b2aa27 in ahci_reset_port (s=0x55555720eb50, port=2) at hw/ide/ahci.c:650
#7 0x0000555555b29fd7 in ahci_port_write (s=0x55555720eb50, port=2, offset=44, val=16) at hw/ide/ahci.c:360
#8 0x0000555555b2a564 in ahci_mem_write (opaque=0x55555720eb50, addr=556, val=16, size=1) at hw/ide/ahci.c:513
#9 0x000055555598415b in memory_region_write_accessor (mr=0x55555720eb80, addr=556, value=0x7fffffffb838, size=1, shift=0, mask=255, attrs=...) at softmmu/memory.c:483
Looking at bdrv_aio_cancel:
2728 /* async I/Os */
2729
2730 void bdrv_aio_cancel(BlockAIOCB *acb)
2731 {
2732 qemu_aio_ref(acb);
2733 bdrv_aio_cancel_async(acb);
2734 while (acb->refcnt > 1) {
2735 if (acb->aiocb_info->get_aio_context) {
2736 aio_poll(acb->aiocb_info->get_aio_context(acb), true);
2737 } else if (acb->bs) {
2738 /* qemu_aio_ref and qemu_aio_unref are not thread-safe, so
2739 * assert that we're not using an I/O thread. Thread-safe
2740 * code should use bdrv_aio_cancel_async exclusively.
2741 */
2742 assert(bdrv_get_aio_context(acb->bs) == qemu_get_aio_context());
2743 aio_poll(bdrv_get_aio_context(acb->bs), true);
2744 } else {
2745 abort(); <===============
2746 }
2747 }
2748 qemu_aio_unref(acb);
2749 }
Fixes: 02c50efe08 ("block: Add bdrv_aio_cancel_async")
Reported-by: Alexander Bulekov <alxndr@bu.edu>
Buglink: https://bugs.launchpad.net/qemu/+bug/1878255
Originally-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200720100141.129739-1-stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
My commit 'block/crypto: implement the encryption key management'
accidently allowed raw luks images to be shared between different
qemu processes without share-rw=on explicit override.
Fix that.
Fixes: bbfdae91fb ("block/crypto: implement the encryption key management")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1857490
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200719122059.59843-2-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200717105426.51134-4-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For Linux block devices, being able to open the device read-write
doesn't necessarily mean that the device is actually writable (one
example is a read-only LV, as you get with lvchange -pr <device>). We
have check_hdev_writable() to check this condition and fail opening the
image read-write if it's not actually writable.
However, this check doesn't take auto-read-only into account, but
results in a hard failure instead of downgrading to read-only where
possible.
Fix this and do the writable check not based on BDRV_O_RDWR, but only
when this actually results in opening the file read-write. A second
check is inserted in raw_reconfigure_getfd() to have the same check when
dynamic auto-read-only upgrades an image file from read-only to
read-write.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200717105426.51134-3-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We'll need to call it in raw_open_common(), so move the function to
avoid a forward declaration.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200717105426.51134-2-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since commit a6b257a08e ('file-posix: Handle undetectable alignment'),
we assume that if we open a file with O_DIRECT and alignment probing
returns 1, we just couldn't find out the real alignment requirement
because some filesystems make the requirement only for allocated blocks.
In this case, a safe default of 4k is used.
This is too strict for NFS, which does actually allow byte-aligned
requests even with O_DIRECT. Because we can't distinguish both cases
with generic code, let's just look at the file system magic and disable
s->needs_alignment for NFS. This way, O_DIRECT can still be used on NFS
for images that are not aligned to 4k.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200716142601.111237-3-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vxhs code doesn't compile since v2.12.0. There's no point in fixing
and then adding CI for a config that our users have demonstrated that
they do not use; better to just remove it.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200711065926.2204721-1-marcandre.lureau@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
- file-posix: Mitigate file fragmentation with extent size hints
- Tighten qemu-img rules on missing backing format
- qemu-img map: Don't limit block status request size
- Fix crash with virtio-scsi and iothreads
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAl8NsgMRHGt3b2xmQHJl
ZGhhdC5jb20ACgkQfwmycsiPL9Z0tA//eqauxD7cTEpwrtLNrRtpiBtMG64BBpxz
QfkURzB38bMVahHlwq3Gt7Zcov8V4V7vxK66h688Z/fhw3vmqIeVe8+P6+Y5s9FL
jil8lewHuLTa6xELeugoV7SZXH8AAh1W2fQmiR7EPiOmpSE0wf7C5IShVlX8A04E
r0n09+61qGjRIe1hNTwTtldqQEfx6UGnxQWcQb81JUPA1lZhX3cnPg/j94Bofr+m
v/DbVTfsmUtTMjc0PdU7n4DKTWu8OS5B/X0unF21rTtO//cYBrhAeY3ax2jbFBWi
CIZK8HLI5m9/HFyltql1LOsd+B5TtfnXMfSdvDh2jaVUlto7wTeTnWU1fv4wxUB5
hk7XgJo/y203ebFNHpTmW8tvLfGTP8uqCVfOEFxzjy+JHGrarlbWkwL2LMOFFAZ2
s2WcwlfqiYGFTG4+OFdhPf9qPWKSqMr+jTdZJTse64/c6+YXWHk+pP9lfYEUOgSi
OYwdQUY9uiZ1K13q5Tif2TbFvs+c118xdTgVhAV7VtfPnWc3c647dX7iaq8Szknc
IT93670Iqf/PzEj+L7XUbbLIIsAcmxD0sr7QAQEt7bfiYIDRIQLiVPyzXplETFg2
SEkvtqBovm84ct7pWQzqA6lFvr3oIFDNquR40XFGozHNnlBeNi5s7pXQnqUBLElr
wDDuEi+z5QM=
=DB0q
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- file-posix: Mitigate file fragmentation with extent size hints
- Tighten qemu-img rules on missing backing format
- qemu-img map: Don't limit block status request size
- Fix crash with virtio-scsi and iothreads
# gpg: Signature made Tue 14 Jul 2020 14:24:19 BST
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
block: Avoid stale pointer dereference in blk_get_aio_context()
qemu-img: Deprecate use of -b without -F
block: Add support to warn on backing file change without format
iotests: Specify explicit backing format where sensible
qcow2: Deprecate use of qemu-img amend to change backing file
block: Error if backing file fails during creation without -u
qcow: Tolerate backing_fmt=
vmdk: Add trivial backing_fmt support
sheepdog: Add trivial backing_fmt support
block: Finish deprecation of 'qemu-img convert -n -o'
qemu-img: Flush stdout before before potential stderr messages
file-posix: Mitigate file fragmentation with extent size hints
iotests/059: Filter out disk size with more standard filter
qemu-img map: Don't limit block status request size
iotests: Simplify _filter_img_create() a bit
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
It is possible for blk_remove_bs() to race with blk_drain_all(), causing
the latter to dereference a stale blk->root pointer:
blk_remove_bs(blk)
bdrv_root_unref_child(blk->root)
child_bs = blk->root->bs
bdrv_detach_child(blk->root)
...
g_free(blk->root) <============== blk->root becomes stale
bdrv_unref(child_bs) <============ yield at some point
A blk_drain_all() can be triggered by some guest action in the
meantime, eg. on POWER, SLOF might disable bus mastering on
a virtio-scsi-pci device:
virtio_write_config()
virtio_pci_stop_ioeventfd()
virtio_bus_stop_ioeventfd()
virtio_scsi_dataplane_stop()
blk_drain_all()
blk_get_aio_context()
bs = blk->root ? blk->root->bs : NULL
^^^^^^^^^
stale
Then, depending on one's luck, QEMU either crashes with SEGV or
hits the assertion in blk_get_aio_context().
blk->root is set by blk_insert_bs() which calls bdrv_root_attach_child()
first. The blk_remove_bs() function should rollback the changes made
by blk_insert_bs() in the opposite order (or it should be documented
somewhere why this isn't the case). Clear blk->root before calling
bdrv_root_unref_child() in blk_remove_bs().
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <159430264541.389456.11925072456012783045.stgit@bahia.lan>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For now, this is a mechanical addition; all callers pass false. But
the next patch will use it to improve 'qemu-img rebase -u' when
selecting a backing file with no format.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Message-Id: <20200706203954.341758-10-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The use of 'qemu-img amend' to change qcow2 backing files is not
tested very well. In particular, our implementation has a bug where
if a new backing file is provided without a format, then the prior
format is blindly reused, even if this results in data corruption, but
this is not caught by iotests.
There are also situations where amending other options needs access to
the original backing file (for example, on a downgrade to a v2 image,
knowing whether a v3 zero cluster must be allocated or may be left
unallocated depends on knowing whether the backing file already reads
as zero), but the command line does not have a nice way to tell us
both the backing file to use for opening the image as well as the
backing file to install after the operation is complete.
Even if we do allow changing the backing file, it is redundant with
the existing ability to change backing files via 'qemu-img rebase -u'.
It is time to deprecate this support (leaving the existing behavior
intact, even if it is buggy), and at a point in the future, require
the use of only 'qemu-img rebase' for adjusting backing chain
relations, saving 'qemu-img amend' for changes unrelated to the
backing chain.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200706203954.341758-8-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow has no space in the metadata to store a backing format, and there
are existing qcow images backed both by raw or by other formats
(usually qcow) images, reliant on probing to tell the difference. On
the bright side, because we probe every time, raw files are marked as
probed and we thus forbid a commit action into the backing file where
guest-controlled contents could change the result of the probe next
time around (the iotest added here proves that).
Still, allowing the user to specify the backing format during
creation, even if we can't record it, is a good thing. This patch
blindly allows any value that resolves to a known driver, even if the
user's request is a mismatch from what probing finds; then the next
patch will further enhance things to verify that the user's request
matches what we actually probe. With this and the next patch in
place, we will finally be ready to deprecate the creation of images
where a backing format was not explicitly specified by the user.
Note that this is only for QemuOpts usage; there is no change to the
QAPI to allow a format through -blockdev.
Add a new iotest 301 just for qcow, to demonstrate the latest
behavior, and to make it easier to show the improvements made in the
next patch.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200706203954.341758-6-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vmdk already requires that if backing_file is present, that it be
another vmdk image (see vmdk_co_do_create). Meanwhile, we want to
move towards always being explicit about the backing format for other
drivers where it matters. So for convenience, make qemu-img create -F
vmdk work, while rejecting all other explicit formats (note that this
is only for QemuOpts usage; there is no change to the QAPI to allow a
format through -blockdev).
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200706203954.341758-5-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sheepdog already requires that if backing_file is present, that it be
another sheepdog image (see sd_co_create). Meanwhile, we want to move
towards always being explicit about the backing format for other
drivers where it matters. So for convenience, make qemu-img create -F
sheepdog work, while rejecting all other explicit formats (note that
this is only for QemuOpts usage; there is no change to the QAPI to
allow a format through -blockdev).
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200706203954.341758-4-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When snprintf returns the same value as the buffer size, the final
byte was truncated to ensure a NUL terminator. Fortunately, such long
export names are unusual enough, with no real impact other than what
is displayed to the user.
Fixes: 5c86bdf120
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200622210355.414941-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
When an I/O request failed, now we only return correct
value on scsi check condition. We should also have a
default errno such as -EIO in other case.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20200701105444.3226-2-xieyongji@bytedance.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The handling of check condition was incorrect because
we would only do it after retries exceed maximum.
Fixes: 8c460269aa ("iscsi: base all handling of check condition on scsi_sense_to_errno")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Message-Id: <20200701105444.3226-1-xieyongji@bytedance.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we want to check error after errp-function call, we need to
introduce local_err and then propagate it to errp. Instead, use
the ERRP_GUARD() macro, benefits are:
1. No need of explicit error_propagate call
2. No need of explicit local_err variable: use errp directly
3. ERRP_GUARD() leaves errp as is if it's not NULL or
&error_fatal, this means that we don't break error_abort
(we'll abort on error_set, not on error_propagate)
If we want to add some info to errp (by error_prepend() or
error_append_hint()), we must use the ERRP_GUARD() macro.
Otherwise, this info will not be added when errp == &error_fatal
(the program will exit prior to the error_append_hint() or
error_prepend() call). Fix several such cases, e.g. in nbd_read().
This commit is generated by command
sed -n '/^Network Block Device (NBD)$/,/^$/{s/^F: //p}' \
MAINTAINERS | \
xargs git ls-files | grep '\.[hc]$' | \
xargs spatch \
--sp-file scripts/coccinelle/errp-guard.cocci \
--macro-file scripts/cocci-macro-file.h \
--in-place --no-show-diff --max-width 80
Reported-by: Kevin Wolf <kwolf@redhat.com>
Reported-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Commit message tweaked]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200707165037.1026246-8-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[ERRP_AUTO_PROPAGATE() renamed to ERRP_GUARD(), and
auto-propagated-errp.cocci to errp-guard.cocci. Commit message
tweaked again.]
When migrate_add_blocker(blocker, &errp) is followed by
error_propagate(errp, err), we can often just as well do
migrate_add_blocker(..., errp).
Do that with this Coccinelle script:
@@
expression blocker, err, errp;
expression ret;
@@
- ret = migrate_add_blocker(blocker, &err);
- if (err) {
+ ret = migrate_add_blocker(blocker, errp);
+ if (ret < 0) {
... when != err;
- error_propagate(errp, err);
...
}
@@
expression blocker, err, errp;
@@
- migrate_add_blocker(blocker, &err);
- if (err) {
+ if (migrate_add_blocker(blocker, errp) < 0) {
... when != err;
- error_propagate(errp, err);
...
}
Double-check @err is not used afterwards. Dereferencing it would be
use after free, but checking whether it's null would be legitimate.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200707160613.848843-43-armbru@redhat.com>
Convert
visit_type_FOO(v, ..., &ptr, &err);
...
if (err) {
...
}
to
visit_type_FOO(v, ..., &ptr, errp);
...
if (!ptr) {
...
}
for functions that set @ptr to non-null / null on success / error.
Eliminate error_propagate() that are now unnecessary. Delete @err
that are now unused.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-40-armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200707160613.848843-39-armbru@redhat.com>
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away, even when we need to keep error_propagate() for other
error paths.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-38-armbru@redhat.com>
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away. The previous two commits did that for sufficiently simple
cases with Coccinelle. Do it for several more manually.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-37-armbru@redhat.com>
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away. The previous commit did that with a Coccinelle script I
consider fairly trustworthy. This commit uses the same script with
the matching of return taken out, i.e. we convert
if (!foo(..., &err)) {
...
error_propagate(errp, err);
...
}
to
if (!foo(..., errp)) {
...
...
}
This is unsound: @err could still be read between afterwards. I don't
know how to express "no read of @err without an intervening write" in
Coccinelle. Instead, I manually double-checked for uses of @err.
Suboptimal line breaks tweaked manually. qdev_realize() simplified
further to placate scripts/checkpatch.pl.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-36-armbru@redhat.com>
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away. Convert
if (!foo(..., &err)) {
...
error_propagate(errp, err);
...
return ...
}
to
if (!foo(..., errp)) {
...
...
return ...
}
where nothing else needs @err. Coccinelle script:
@rule1 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
binary operator op;
constant c1, c2;
symbol false;
@@
if (
(
- fun(args, &err, args2)
+ fun(args, errp, args2)
|
- !fun(args, &err, args2)
+ !fun(args, errp, args2)
|
- fun(args, &err, args2) op c1
+ fun(args, errp, args2) op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
)
}
@rule2 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
expression var;
binary operator op;
constant c1, c2;
symbol false;
@@
- var = fun(args, &err, args2);
+ var = fun(args, errp, args2);
... when != err
if (
(
var
|
!var
|
var op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
|
return var;
)
}
@depends on rule1 || rule2@
identifier err;
@@
- Error *err = NULL;
... when != err
Not exactly elegant, I'm afraid.
The "when != lbl:" is necessary to avoid transforming
if (fun(args, &err)) {
goto out
}
...
out:
error_propagate(errp, err);
even though other paths to label out still need the error_propagate().
For an actual example, see sclp_realize().
Without the "when strict", Coccinelle transforms vfio_msix_setup(),
incorrectly. I don't know what exactly "when strict" does, only that
it helps here.
The match of return is narrower than what I want, but I can't figure
out how to express "return where the operand doesn't use @err". For
an example where it's too narrow, see vfio_intx_enable().
Silently fails to convert hw/arm/armsse.c, because Coccinelle gets
confused by ARMSSE being used both as typedef and function-like macro
there. Converted manually.
Line breaks tidied up manually. One nested declaration of @local_err
deleted manually. Preexisting unwanted blank line dropped in
hw/riscv/sifive_e.c.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-35-armbru@redhat.com>
Replace
error_setg(&err, ...);
error_propagate(errp, err);
by
error_setg(errp, ...);
Related pattern:
if (...) {
error_setg(&err, ...);
goto out;
}
...
out:
error_propagate(errp, err);
return;
When all paths to label out are that way, replace by
if (...) {
error_setg(errp, ...);
return;
}
and delete the label along with the error_propagate().
When we have at most one other path that actually needs to propagate,
and maybe one at the end that where propagation is unnecessary, e.g.
foo(..., &err);
if (err) {
goto out;
}
...
bar(..., &err);
out:
error_propagate(errp, err);
return;
move the error_propagate() to where it's needed, like
if (...) {
foo(..., &err);
error_propagate(errp, err);
return;
}
...
bar(..., errp);
return;
and transform the error_setg() as above.
In some places, the transformation results in obviously unnecessary
error_propagate(). The next few commits will eliminate them.
Bonus: the elimination of gotos will make later patches in this series
easier to review.
Candidates for conversion tracked down with this Coccinelle script:
@@
identifier err, errp;
expression list args;
@@
- error_setg(&err, args);
+ error_setg(errp, args);
... when != err
error_propagate(errp, err);
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-34-armbru@redhat.com>
The previous commit used Coccinelle to convert from checking the Error
object to checking the return value. Convert a few more manually.
Also tweak control flow in places to conform to the conventional "if
error bail out" pattern.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200707160613.848843-20-armbru@redhat.com>
The previous commit enables conversion of
visit_foo(..., &err);
if (err) {
...
}
to
if (!visit_foo(..., errp)) {
...
}
for visitor functions that now return true / false on success / error.
Coccinelle script:
@@
identifier fun =~ "check_list|input_type_enum|lv_start_struct|lv_type_bool|lv_type_int64|lv_type_str|lv_type_uint64|output_type_enum|parse_type_bool|parse_type_int64|parse_type_null|parse_type_number|parse_type_size|parse_type_str|parse_type_uint64|print_type_bool|print_type_int64|print_type_null|print_type_number|print_type_size|print_type_str|print_type_uint64|qapi_clone_start_alternate|qapi_clone_start_list|qapi_clone_start_struct|qapi_clone_type_bool|qapi_clone_type_int64|qapi_clone_type_null|qapi_clone_type_number|qapi_clone_type_str|qapi_clone_type_uint64|qapi_dealloc_start_list|qapi_dealloc_start_struct|qapi_dealloc_type_anything|qapi_dealloc_type_bool|qapi_dealloc_type_int64|qapi_dealloc_type_null|qapi_dealloc_type_number|qapi_dealloc_type_str|qapi_dealloc_type_uint64|qobject_input_check_list|qobject_input_check_struct|qobject_input_start_alternate|qobject_input_start_list|qobject_input_start_struct|qobject_input_type_any|qobject_input_type_bool|qobject_input_type_bool_keyval|qobject_input_type_int64|qobject_input_type_int64_keyval|qobject_input_type_null|qobject_input_type_number|qobject_input_type_number_keyval|qobject_input_type_size_keyval|qobject_input_type_str|qobject_input_type_str_keyval|qobject_input_type_uint64|qobject_input_type_uint64_keyval|qobject_output_start_list|qobject_output_start_struct|qobject_output_type_any|qobject_output_type_bool|qobject_output_type_int64|qobject_output_type_null|qobject_output_type_number|qobject_output_type_str|qobject_output_type_uint64|start_list|visit_check_list|visit_check_struct|visit_start_alternate|visit_start_list|visit_start_struct|visit_type_.*";
expression list args;
typedef Error;
Error *err;
@@
- fun(args, &err);
- if (err)
+ if (!fun(args, &err))
{
...
}
A few line breaks tidied up manually.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200707160613.848843-19-armbru@redhat.com>
Convert uses like
opts = qemu_opts_create(..., &err);
if (err) {
...
}
to
opts = qemu_opts_create(..., errp);
if (!opts) {
...
}
Eliminate error_propagate() that are now unnecessary. Delete @err
that are now unused.
Note that we can't drop parallels_open()'s error_propagate() here. We
continue to execute it even in the converted case. It's a no-op then:
local_err is null.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <20200707160613.848843-8-armbru@redhat.com>
The other four drivers that support backing files (qcow, qcow2,
parallels, vmdk) all rely on the block layer to populate zeroes when
reading beyond EOF of a short backing file. We can simplify the qed
code by doing likewise.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200528094405.145708-11-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Currently this field only set by qed and qcow2. But in fact, all
backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share
these semantics: on unallocated blocks, if there is no backing file they
just memset the buffer with zeroes.
So, document this behavior for .supports_backing and drop
.unallocated_blocks_are_zero
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-10-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
vhdx doesn't have .bdrv_co_block_status handler, so DATA|ALLOCATED is
always assumed for it in bdrv_co_block_status().
unallocated_blocks_are_zero is useless (it doesn't affect the only user
of the field: bdrv_co_block_status()), drop it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
raw_co_block_status() in block/file-posix.c never returns 0, so
unallocated_blocks_are_zero is useless (it doesn't affect the only user
of the field: bdrv_co_block_status()). Drop it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We set bdi->unallocated_blocks_are_zero = iscsilun->lbprz, but
iscsi_co_block_status doesn't return 0 in case of iscsilun->lbprz, it
returns ZERO when appropriate. So actually unallocated_blocks_are_zero
is useless (it doesn't affect the only user of the field:
bdrv_co_block_status()). Drop it now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It's false by default, no needs to set it. We are going to drop this
variable at all, so drop it now here, it doesn't hurt.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-6-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
In case when get_image_offset() returns -1, we do zero out the
corresponding chunk of qiov. So, this should be reported as ZERO.
Note that this changes visible output of "qemu-img map --output=json"
and "qemu-io -c map" commands. For qemu-img map, the change is obvious:
we just mark as zero what is really zero. For qemu-io it's less
obvious: what was unallocated now is allocated.
There is an inconsistency in understanding of unallocated regions in
Qemu: backing-supporting format-drivers return 0 block-status to report
go-to-backing logic for this area. Some protocol-drivers (iscsi) return
0 to report fs-unallocated-non-zero status (i.e., don't occupy space on
disk, read result is undefined).
BDRV_BLOCK_ALLOCATED is defined as something more close to
go-to-backing logic. Still it is calculated as ZERO | DATA, so 0 from
iscsi is treated as unallocated. It doesn't influence backing-chain
behavior, as iscsi can't have backing file. But it does influence
"qemu-io -c map".
We should solve this inconsistency at some future point. Now, let's
just make backing-not-supporting format drivers (vdi in the previous
patch and vpc now) to behave more like backing-supporting drivers
and not report 0 block-status. More over, returning ZERO status is
absolutely valid thing, and again, corresponds to how the other
format-drivers (backing-supporting) work.
After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense (as the only user of it
is bdrv_co_block_status and it checks unallocated_blocks_are_zero only
for unallocated areas). Drop it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-5-vsementsov@virtuozzo.com>
[mreitz: qemu-io -c map as used by iotest 146 now reports everything as
allocated; in order to make the test do something useful, we
use qemu-img map --output=json now]
Signed-off-by: Max Reitz <mreitz@redhat.com>
In case of !VDI_IS_ALLOCATED[], we do zero out the corresponding chunk
of qiov. So, this should be reported as ZERO.
Note that this changes visible output of "qemu-img map --output=json"
and "qemu-io -c map" commands. For qemu-img map, the change is obvious:
we just mark as zero what is really zero. For qemu-io it's less
obvious: what was unallocated now is allocated.
There is an inconsistency in understanding of unallocated regions in
Qemu: backing-supporting format-drivers return 0 block-status to report
go-to-backing logic for this area. Some protocol-drivers (iscsi) return
0 to report fs-unallocated-non-zero status (i.e., don't occupy space on
disk, read result is undefined).
BDRV_BLOCK_ALLOCATED is defined as something more close to
go-to-backing logic. Still it is calculated as ZERO | DATA, so 0 from
iscsi is treated as unallocated. It doesn't influence backing-chain
behavior, as iscsi can't have backing file. But it does influence
"qemu-io -c map".
We should solve this inconsistency at some future point. Now, let's
just make backing-not-supporting format drivers (vdi at this patch and
vpc with the following) to behave more like backing-supporting drivers
and not report 0 block-status. More over, returning ZERO status is
absolutely valid thing, and again, corresponds to how the other
format-drivers (backing-supporting) work.
After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense (as the only user of it
is bdrv_co_block_status and it checks unallocated_blocks_are_zero only
for unallocated areas). Drop it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-4-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The function has only one user: bdrv_co_block_status(). Inline it to
simplify reviewing of the following patches, which will finally drop
unallocated_blocks_are_zero field too.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200528094405.145708-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Currently the implementation only supports amending the encryption
options, unlike the qemu-img version
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-14-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-13-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
blockdev-amend will be used similiar to blockdev-create
to allow on the fly changes of the structure of the format based block devices.
Current plan is to first support encryption keyslot management for luks
based formats (raw and embedded in qcow2)
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20200608094030.670121-12-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Now that we have all the infrastructure in place,
wire it in the qcow2 driver and expose this to the user.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-9-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This implements the encryption key management using the generic code in
qcrypto layer and exposes it to the user via qemu-img
This code adds another 'write_func' because the initialization
write_func works directly on the underlying file, and amend
works on instance of luks device.
This commit also adds a 'hack/workaround' I and Kevin Wolf (thanks)
made to make the driver both support write sharing (to avoid breaking the users),
and be safe against concurrent metadata update (the keyslots)
Eventually the write sharing for luks driver will be deprecated
and removed together with this hack.
The hack is that we ask (as a format driver) for BLK_PERM_CONSISTENT_READ
and then when we want to update the keys, we unshare that permission.
So if someone else has the image open, even readonly, encryption
key update will fail gracefully.
Also thanks to Daniel Berrange for the idea of
unsharing read, rather that write permission which allows
to avoid cases when the other user had opened the image read-only.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-8-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
rename the write_func to create_write_func, and init_func to create_init_func.
This is preparation for other write_func that will be used to update the encryption keys.
No functional changes
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20200608094030.670121-7-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Some qcow2 create options can't be used for amend.
Remove them from the qcow2 create options and add generic logic to detect
such options in qemu-img
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[mreitz: Dropped some iotests reference output hunks that became
unnecessary thanks to
"iotests: Make _filter_img_create more active"]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200625125548.870061-12-mreitz@redhat.com>
Some options are only useful for creation
(or hard to be amended, like cluster size for qcow2), while some other
options are only useful for amend, like upcoming keyslot management
options for luks
Since currently only qcow2 supports amend, move all its options
to a common macro and then include it in each action option list.
In future it might be useful to remove some options which are
not supported anyway from amend list, which currently
cause an error message if amended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-5-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
'force' option will be used for some unsafe amend operations.
This includes things like erasing last keyslot in luks based formats
(which destroys the data, unless the master key is backed up
by external means), but that _might_ be desired result.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200608094030.670121-4-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This will be used first to implement luks keyslot management.
block_crypto_amend_opts_init will be used to convert
qemu-img cmdline to QCryptoBlockAmendOptions
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20200608094030.670121-2-mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When resizing an image with qcow2_co_truncate() using the falloc or
full preallocation modes the code assumes that both the old and new
sizes are cluster-aligned.
There are two problems with this:
1) The calculation of how many clusters are involved does not always
get the right result.
Example: creating a 60KB image and resizing it (with
preallocation=full) to 80KB won't allocate the second cluster.
2) No copy-on-write is performed, so in the previous example if
there is a backing file then the first 60KB of the first cluster
won't be filled with data from the backing file.
This patch fixes both issues.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200617140036.20311-1-berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
ret may be > 0 on success path at this point. Fix assertion, which may
crash currently.
Fixes: 4ce5dd3e9b
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200526181347.489557-1-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
array_remove_slice() calls array_roll() with array->next - 1 as the
destination index. This is only correct for count == 1, otherwise we're
writing past the end of the array. array->next - count would be correct.
However, this is the only place ever calling array_roll(), so this
rather complicated operation isn't even necessary.
Fix the problem and simplify the code by replacing it with a single
memmove() call. array_roll() can now be removed.
Reported-by: Nathan Huckleberry <nhuck15@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200623175534.38286-3-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
FAT allows only a restricted set of characters in file names, and for
some of the illegal characters, it's actually important that we catch
them: If filenames can contain '/', the guest can construct filenames
containing "../" and escape from the assigned vvfat directory. The same
problem could arise if ".." was ever accepted as a literal filename.
Fix this by adding a check that all filenames are valid in
check_directory_consistency().
Reported-by: Nathan Huckleberry <nhuck15@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200623175534.38286-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QEMU block drivers are supposed to support aio_poll() from I/O
completion callback functions. This means completion processing must be
re-entrant.
The standard approach is to schedule a BH during completion processing
and cancel it at the end of processing. If aio_poll() is invoked by a
callback function then the BH will run. The BH continues the suspended
completion processing.
All of this means that request A's cb() can synchronously wait for
request B to complete. Previously the nvme block driver would hang
because it didn't process completions from nested aio_poll().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Message-id: 20200617132201.1832152-8-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Passing around both BDRVNVMeState and NVMeQueuePair is unwieldy. Reduce
the number of function arguments by keeping the BDRVNVMeState pointer in
NVMeQueuePair. This will come in handly when a BH is introduced in a
later patch and only one argument can be passed to it.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200617132201.1832152-7-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
There are three issues with the current NVMeRequest->busy field:
1. The busy field is accidentally accessed outside q->lock when request
submission fails.
2. Waiters on free_req_queue are not woken when a request is returned
early due to submission failure.
2. Finding a free request involves scanning all requests. This makes
request submission O(n^2).
Switch to an O(1) freelist that is always accessed under the lock.
Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and
NVME_NUM_REQS, the number of usable requests. This makes the code
simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind
that one slot is reserved.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Message-id: 20200617132201.1832152-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Do not access a CQE after incrementing q->cq.head and releasing q->lock.
It is unlikely that this causes problems in practice but it's a latent
bug.
The reason why it should be safe at the moment is that completion
processing is not re-entrant and the CQ doorbell isn't written until the
end of nvme_process_completion().
Make this change now because QEMU expects completion processing to be
re-entrant and later patches will do that.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200617132201.1832152-4-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
A lot of CPU time is spent simply locking/unlocking q->lock during
polling. Check for completion outside the lock to make q->lock disappear
from the profile.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Message-id: 20200617132201.1832152-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
For now, we don't have persistent bitmaps in any other formats, but
that might not be true in the future. Make it obvious that our
incoming parameter is not necessarily a qcow2 image, and therefore is
limited to just the bdrv_dirty_bitmap_* API calls (rather than probing
into qcow2 internals).
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200608190821.3293867-1-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Rather than listing block/monitor from the top-level Makefile.objs, we
should instead list monitor from block/Makefile.objs.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Fixes: bb4e58c613
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200608173339.3244211-1-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 93676c88 relaxed our NBD client code to request export names up
to the NBD protocol maximum of 4096 bytes without NUL terminator, even
though the block layer can't store anything longer than 4096 bytes
including NUL terminator for display to the user. Since this means
there are some export names where we have to truncate things, we can
at least try to make the truncation a bit more obvious for the user.
Note that in spite of the truncated display name, we can still
communicate with an NBD server using such a long export name; this was
deemed nicer than refusing to even connect to such a server (since the
server may not be under our control, and since determining our actual
length limits gets tricky when nbd://host:port/export and
nbd+unix:///export?socket=/path are themselves variable-length
expansions beyond the export name but count towards the block layer
name length).
Reported-by: Xueqiang Wei <xuwei@redhat.com>
Fixes: https://bugzilla.redhat.com/1843684
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200610163741.3745251-3-eblake@redhat.com>
We have a few bdrv_*() functions that can either spawn a new coroutine
and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are
alreeady running in a coroutine. All of them duplicate basically the
same code.
Factor the common code into a new function bdrv_run_co().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20200520144901.16589-1-vsementsov@virtuozzo.com
[Factor out bdrv_run_co_entry too]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In qemu_luring_poll_cb() we are not using the cqe peeked from the
CQ ring. We are using io_uring_peek_cqe() only to see if there
are cqes ready, so we can replace it with io_uring_cq_ready().
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20200519134942.118178-1-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
As recently documented [1], io_uring_enter(2) syscall can return an
error (errno=EINTR) if the operation was interrupted by a delivery
of a signal before it could complete.
This should happen when IORING_ENTER_GETEVENTS flag is used, for
example during io_uring_submit_and_wait() or during io_uring_submit()
when IORING_SETUP_IOPOLL is enabled.
We shouldn't have this problem for now, but it's better to prevent it.
[1] 344355ec66
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20200519133041.112138-1-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It's useful to know how much space can be occupied by qcow2 persistent
bitmaps, even though such metadata is unrelated to the guest-visible
data. Report this value as an additional QMP field, present when
measuring an existing image and output format that both support
bitmaps. Update iotest 178 and 190 to updated output, as well as new
coverage in 190 demonstrating non-zero values made possible with the
recently-added qemu-img bitmap command (see 3b51ab4b).
The new 'bitmaps size:' field is displayed automatically as part of
'qemu-img measure' any time it is present in QMP (that is, any time
both the source image being measured and destination format support
bitmaps, even if the measurement is 0 because there are no bitmaps
present). If the field is absent, it means that no bitmaps can be
copied (source, destination, or both lack bitmaps, including when
measuring based on size rather than on a source image). This behavior
is compatible with an upcoming patch adding 'qemu-img convert
--bitmaps': that command will fail in the same situations where this
patch omits the field.
The addition of a new field demonstrates why we should always
zero-initialize qapi C structs; while the qcow2 driver still fully
populates all fields, the raw and crypto drivers had to be tweaked to
avoid uninitialized data.
Consideration was also given towards having a 'qemu-img measure
--bitmaps' which errors out when bitmaps are not possible, and
otherwise sums the bitmaps into the existing allocation totals rather
than displaying as a separate field, as a potential convenience
factor. But this was ultimately decided to be more complexity than
necessary when the QMP interface was sufficient enough with bitmaps
remaining a separate field.
See also: https://bugzilla.redhat.com/1779904
Reported-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200521192137.1120211-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To be used for bitmap migration in further commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200521220648.3255-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Upcoming patches want to add some basic bitmap manipulation abilities
to qemu-img. But blockdev.o is too heavyweight to link into qemu-img
(among other things, it would drag in block jobs and transaction
support - qemu-img does offline manipulation, where atomicity is less
important because there are no concurrent modifications to compete
with), so it's time to split off the bare bones of what we will need
into a new file block/monitor/bitmap-qmp-cmds.o.
This is sufficient to expose 6 QMP commands for use by qemu-img (add,
remove, clear, enable, disable, merge), as well as move the three
helper functions touched in the previous patch. Regarding
MAINTAINERS, the new file is automatically part of block core, but
also makes sense as related to other dirty bitmap files.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513011648.166876-6-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Upcoming patches will enhance bitmap support in qemu-img, but in doing
so, it turns out to be nice to suppress output when persistent bitmaps
make no sense (such as on a qcow2 v2 image). Add a hook to make this
easier to query.
This patch adds a new callback .bdrv_supports_persistent_dirty_bitmap,
rather than trying to shoehorn the answer in via existing callbacks.
In particular, while it might have been possible to overload
.bdrv_co_can_store_new_dirty_bitmap to special-case a NULL input to
answer whether any persistent bitmaps are supported, that is at odds
with whether a particular bitmap can be stored (for example, even on
an image that supports persistent bitmaps but has currently filled up
the maximum number of bitmaps, attempts to store another one should
fail); and the new functionality doesn't require coroutine safety.
Similarly, we could have added one more piece of information to
.bdrv_get_info, but then again, most callers to that function tend to
already discard extraneous information, and making it a catch-all
rather than a series of dedicated scalar queries hasn't really
simplified life.
In the future, when we improve the ability to look up bitmaps through
a filter, we will probably also want to teach the block layer to
automatically let filters pass this request on through.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513011648.166876-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
block_copy_do_copy() is static, only used in block_copy_task_entry
with the error_is_read argument set. No need to check for it,
simplify.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200507121129.29760-3-philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix when building with -Os:
CC block/block-copy.o
block/block-copy.c: In function ‘block_copy_task_entry’:
block/block-copy.c:428:38: error: ‘error_is_read’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
428 | t->call_state->error_is_read = error_is_read;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200507121129.29760-2-philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Implementations should decide the necessary permissions based on @role.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-35-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These calls have no real use for the child role yet, but it will not
harm to give one.
Notably, the bdrv_root_attach_child() call in blockjob.c is left
unmodified because there is not much the generic BlockJob object wants
from its children.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-34-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_default_perms() can decide which permission profile to use based on
the BdrvChildRole, so block drivers do not need to select it explicitly.
The blkverify driver now no longer shares the WRITE permission for the
image to verify. We thus have to adjust two places in
test-block-iothread not to take it. (Note that in theory, blkverify
should behave like quorum in this regard and share neither WRITE nor
RESIZE for both of its children. In practice, it does not really
matter, because blkverify is used only for debugging, so we might as
well keep its permissions rather liberal.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-30-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace child_file by child_of_bds in all remaining places (excluding
tests).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-28-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Note that some filters have secondary children, namely blkverify (the
image to be verified) and blklogwrites (the log). This patch does not
touch those children.
Note that for blkverify, the filtered child should not be format-probed.
While there is nothing enforcing this here, in practice, it will not be:
blkverify implements .bdrv_file_open. The block layer ensures (and in
fact, asserts) that BDRV_O_PROTOCOL is set for every BDS whose driver
implements .bdrv_file_open. This flag will then be bequeathed to
blkverify's children, and they will thus (by default) not be probed
either.
("By default" refers to the fact that blkverify's other child (the
non-filtered one) will have BDRV_O_PROTOCOL force-unset, because that is
what happens for all non-filtered children of non-format drivers.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-27-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commonly, they need to pass the BDRV_CHILD_IMAGE set as the
BdrvChildRole; but there are exceptions for drivers with external data
files (qcow2 and vmdk).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-26-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make all parents of backing files pass the appropriate BdrvChildRole.
By doing so, we can switch their BdrvChildClass over to the generic
child_of_bds, which will do the right thing when given a correct
BdrvChildRole.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-24-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Both users (quorum and blkverify) use child_format for
not-really-filtered children, so the appropriate BdrvChildRole in both
cases is DATA. (Note that this will cause bdrv_inherited_options() to
force-allow format probing.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-22-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Split raw_read_options() into one function that actually just reads the
options, and another that applies them. This will allow us to detect
whether the user has specified any options before attaching the file
child (so we can decide on its role based on the options).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-21-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We plan to unify the generic .inherit_options() functions. The
resulting common function will need to decide whether to force-enable
format probing, force-disable it, or leave it as-is. To make this
decision, it will need to know whether the parent node is a format node
or not (because we never want format probing if the parent is a format
node already (except for the backing chain)).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-9-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For now, all callers (effectively) pass 0 and no callee evaluates thie
value. Later patches will change both.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-8-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For now, all callers pass 0 and no callee evaluates this value. Later
patches will change both.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-7-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For now, it is always set to 0. Later patches in this series will
ensure that all callers pass an appropriate combination of flags.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513110544.176672-6-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This structure nearly only contains parent callbacks for child state
changes. It cannot really reflect a child's role, because different
roles may overlap (as we will see when real roles are introduced), and
because parents can have custom callbacks even when the child fulfills a
standard role.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200513110544.176672-4-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to unify child_format and child_file at some point. One of the
important things that set format drivers apart from other drivers is
that they do not expect other format nodes under them (except in the
backing chain), i.e. we must not probe formats inside of formats. That
means we need something on which to distinguish format drivers from
others, and hence this flag.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200513110544.176672-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The commit, mirror, and blkreplay block nodes are filters, so they should
be marked as such.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200513110544.176672-2-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200429141126.85159-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is just a bandaid to keep tests/test-replication working after
bdrv_make_empty() starts to assert that we're not trying to call it on a
read-only child.
For the real solution in the future, replication should not steal the
BdrvChild from its backing file (this is never correct to do!), but
instead have its own child node references, with the appropriate
permissions.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_commit() already has a BlockBackend pointing to the BDS that we
want to empty, it just has the wrong permissions.
qemu-img commit has no BlockBackend pointing to the old backing file
yet, but introducing one is simple.
After this commit, bdrv_make_empty() is the only remaining caller of
BlockDriver.bdrv_make_empty().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429141126.85159-5-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[kwolf: Fixed up reference output for 098]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Two callers of BlockDriver.bdrv_make_empty() remain that should not call
this method directly. Both do not have access to a BdrvChild, but they
can use a BlockBackend, so we add this function that lets them use it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429141126.85159-4-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qemu in colo secondary mode is stopped, it crashes because
s->backup_job is canceled twice: First with job_cancel_sync_all()
in qemu_cleanup() and then in replication_stop().
Fix this by assigning NULL to s->backup_job when the job completes
so replication_stop() and replication_do_checkpoint() won't touch
the job.
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Message-Id: <20200511090801.7ed5d8f3@luklap>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the target is shorter than the source, mirror would copy data until
it reaches the end of the target and then fail with an I/O error when
trying to write past the end.
If the target is longer than the source, the mirror job would complete
successfully, but the target wouldn't actually be an accurate copy of
the source image (it would contain some additional garbage at the end).
Fix this by checking that both images have the same size when the job
starts.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200511135825.219437-4-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only way object_property_add() can fail is when a property with
the same name already exists. Since our property names are all
hardcoded, failure is a programming error, and the appropriate way to
handle it is passing &error_abort.
Same for its variants, except for object_property_add_child(), which
additionally fails when the child already has a parent. Parentage is
also under program control, so this is a programming error, too.
We have a bit over 500 callers. Almost half of them pass
&error_abort, slightly fewer ignore errors, one test case handles
errors, and the remaining few callers pass them to their own callers.
The previous few commits demonstrated once again that ignoring
programming errors is a bad idea.
Of the few ones that pass on errors, several violate the Error API.
The Error ** argument must be NULL, &error_abort, &error_fatal, or a
pointer to a variable containing NULL. Passing an argument of the
latter kind twice without clearing it in between is wrong: if the
first call sets an error, it no longer points to NULL for the second
call. ich9_pm_add_properties(), sparc32_ledma_realize(),
sparc32_dma_realize(), xilinx_axidma_realize(), xilinx_enet_realize()
are wrong that way.
When the one appropriate choice of argument is &error_abort, letting
users pick the argument is a bad idea.
Drop parameter @errp and assert the preconditions instead.
There's one exception to "duplicate property name is a programming
error": the way object_property_add() implements the magic (and
undocumented) "automatic arrayification". Don't drop @errp there.
Instead, rename object_property_add() to object_property_try_add(),
and add the obvious wrapper object_property_add().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200505152926.18877-15-armbru@redhat.com>
[Two semantic rebase conflicts resolved]
Obviously, we should g_free the task after trace point and offset
update.
Reported-by: Coverity (CID 1428756)
Fixes: 4ce5dd3e9b
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200507183800.22626-1-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.
The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G
The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.
compress cmd:
time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd]
src.img [zlib|zstd]_compressed.img
decompress cmd
time ./qemu-img convert -O qcow2
[zlib|zstd]_compressed.img uncompressed.img
compression decompression
zlib zstd zlib zstd
------------------------------------------------------------
real 65.5 16.3 (-75 %) 1.9 1.6 (-16 %)
user 65.0 15.8 5.3 2.5
sys 3.3 0.2 2.0 2.0
Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G
Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
QAPI part:
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200507082521.29210-4-dplotnikov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The patch enables processing the image compression type defined
for the image and chooses an appropriate method for image clusters
(de)compression.
Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200507082521.29210-3-dplotnikov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The patch adds some preparation parts for incompatible compression type
feature to qcow2 allowing the use different compression methods for
image clusters (de)compressing.
It is implied that the compression type is set on the image creation and
can be changed only later by image conversion, thus compression type
defines the only compression algorithm used for the image, and thus,
for all image clusters.
The goal of the feature is to add support of other compression methods
to qcow2. For example, ZSTD which is more effective on compression than ZLIB.
The default compression is ZLIB. Images created with ZLIB compression type
are backward compatible with older qemu versions.
Adding of the compression type breaks a number of tests because now the
compression type is reported on image creation and there are some changes
in the qcow2 header in size and offsets.
The tests are fixed in the following ways:
* filter out compression_type for many tests
* fix header size, feature table size and backing file offset
affected tests: 031, 036, 061, 080
header_size +=8: 1 byte compression type
7 bytes padding
feature_table += 48: incompatible feature compression type
backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change)
* add "compression type" for test output matching when it isn't filtered
affected tests: 049, 060, 061, 065, 082, 085, 144, 182, 185, 198, 206,
242, 255, 274, 280
Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
QAPI part:
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200507082521.29210-2-dplotnikov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Now that there are no clients of bdrv_has_zero_init_truncate, none of
the drivers need to worry about providing it.
What's more, this eliminates a source of some confusion: a literal
reading of the documentation as written in ceaca56f and implemented in
commit 1dcaf527 claims that a driver which returns 0 for
bdrv_has_zero_init_truncate() must not return 1 for
bdrv_has_zero_init(); this condition was violated for parallels, qcow,
and sometimes for vdi, although in practice it did not matter since
those drivers also lacked .bdrv_co_truncate.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-10-eblake@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vhdx driver uses truncation for image growth, with a special case
for blocks that already read as zero but which are only being
partially written. But with a bit of rearranging, it's just as easy
to defer the decision on whether truncation resulted in zeroes to the
actual allocation attempt, reducing the number of places that still
use bdrv_has_zero_init_truncate.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-9-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The parallels driver tries to use truncation for image growth, but can
only do so when reads are guaranteed as zero. Now that we have a way
to request zero contents from truncation, we can defer the decision to
actual allocation attempts rather than up front, reducing the number
of places that still use bdrv_has_zero_init_truncate.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-8-eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our .bdrv_has_zero_init_truncate can detect when the remote side
always zero fills; we can reuse that same knowledge to implement
BDRV_REQ_ZERO_WRITE by ignoring it when the server gives it to us for
free.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-7-eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our .bdrv_has_zero_init_truncate always returns 1 because sheepdog
always 0-fills; we can use that same knowledge to implement
BDRV_REQ_ZERO_WRITE by ignoring it.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-6-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our .bdrv_has_zero_init_truncate always returns 1 because rbd always
0-fills; we can use that same knowledge to implement
BDRV_REQ_ZERO_WRITE by ignoring it.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-5-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our .bdrv_has_zero_init_truncate returns 1 if we detect that the OS
always 0-fills; we can use that same knowledge to implement
BDRV_REQ_ZERO_WRITE by ignoring it when the OS gives it to us for
free.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-4-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When using bdrv_file, .bdrv_has_zero_init_truncate always returns 1;
therefore, we can behave just like file-posix, and always implement
BDRV_REQ_ZERO_WRITE by ignoring it since the OS gives it to us for
free (note that file-posix.c had to use an 'if' because it shared code
between regular files and block devices, but in file-win32.c,
bdrv_host_device uses a separate .bdrv_file_open).
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428202905.770727-3-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block.c already defaults to 0 if we don't provide a callback; there's
no need to write a callback that always fails.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200428202905.770727-2-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Calling bdrv_getlength() to get the pre-truncate file size will not
really work on block devices, because they have always the same length,
and trying to write beyond it will fail with a rather cryptic error
message.
Instead, we should use qcow2_get_last_cluster() and bdrv_getlength()
only as a fallback.
Before this patch:
$ truncate -s 1G test.img
$ sudo losetup -f --show test.img
/dev/loop0
$ sudo qemu-img create -f qcow2 -o preallocation=full /dev/loop0 64M
Formatting '/dev/loop0', fmt=qcow2 size=67108864 cluster_size=65536
preallocation=full lazy_refcounts=off refcount_bits=16
qemu-img: /dev/loop0: Could not resize image: Failed to resize refcount
structures: No space left on device
With this patch:
$ sudo qemu-img create -f qcow2 -o preallocation=full /dev/loop0 64M
Formatting '/dev/loop0', fmt=qcow2 size=67108864 cluster_size=65536
preallocation=full lazy_refcounts=off refcount_bits=16
qemu-img: /dev/loop0: Could not resize image: Failed to resize
underlying file: Preallocation mode 'full' unsupported for this
non-regular file
So as you can see, it still fails, but now the problem is missing
support on the block device level, so we at least get a better error
message.
Note that we cannot preallocate block devices on truncate by design,
because we do not know what area to preallocate. Their length is always
the same, the truncate operation does not change it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200505141801.1096763-1-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since the introduction of a backup filter node in commit 00e30f05d, the
backup block job crashes when the target image is smaller than the
source image because it will try to write after the end of the target
node without having BLK_PERM_RESIZE. (Previously, the BlockBackend layer
would have caught this and errored out gracefully.)
We can fix this and even do better than the old behaviour: Check that
source and target have the same image size at the start of the block job
and unshare BLK_PERM_RESIZE. (This permission was already unshared
before the same commit 00e30f05d, but the BlockBackend that was used to
make the restriction was removed without a replacement.) This will
immediately error out when starting the job instead of only when writing
to a block that doesn't exist in the target.
Longer target than source would technically work because we would never
write to blocks that don't exist, but semantically these are invalid,
too, because a backup is supposed to create a copy, not just an image
that starts with a copy.
Fixes: 00e30f05de
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1778593
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200430142755.315494-4-kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_get_device_name() will be an empty string with modern management
tools that don't use -drive. Use bdrv_get_device_or_node_name() instead
so that the node name is used if the BlockBackend is anonymous.
While at it, start with upper case to make the message consistent with
the rest of the function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200430142755.315494-3-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we have a backup L2 table, we currently flush once after writing to
the active L2 table and again after writing to the backup table. A
single flush is enough and makes things a little less slow.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200430133007.170335-6-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If a cluster is already zeroed, we don't have to call vmdk_L2update(),
which is rather slow because it flushes the image file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200430133007.170335-5-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When overwriting a zero cluster, we must not perform copy-on-write from
the backing file, but from a zeroed buffer.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200430133007.170335-4-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
m_data must contain valid data even for zero clusters when no cluster
was allocated in the image file. Without this, zero writes segfault with
images that have zeroed_grain=on.
For zero writes, we don't want to allocate a cluster in the image file
even in compressed files.
Fixes: 524089bce4
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200430133007.170335-3-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
m_data is used for zero clusters even though valid == 0. It really only
means that a new cluster was allocated in the image file. Rename it to
reflect this.
While at it, change it from int to bool, too.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200430133007.170335-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After commit f01643fb8b when an image is
extended and BDRV_REQ_ZERO_WRITE is set then the new clusters are
zeroized.
The code however does not detect correctly situations when the old and
the new end of the image are within the same cluster. The problem can
be reproduced with these steps:
qemu-img create -f qcow2 backing.qcow2 1M
qemu-img create -f qcow2 -F qcow2 -b backing.qcow2 top.qcow2
qemu-img resize --shrink top.qcow2 520k
qemu-img resize top.qcow2 567k
In the last step offset - zero_start causes an integer wraparound.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200504155217.10325-1-berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently if you attampt to create too large file with luks you
get the following error message:
Formatting 'test.luks', fmt=luks size=17592186044416 key-secret=sec0
qemu-img: test.luks: Could not resize file: File too large
While for raw format the error message is
qemu-img: test.img: The image size is too large for file format 'raw'
The reason for this is that qemu-img checks for errono of the failure,
and presents the later error when it is -EFBIG
However crypto generic code 'swallows' the errno and replaces it
with -EIO.
As an attempt to make it better, we can make luks driver,
detect -EFBIG and in this case present a better error message,
which is what this patch does
The new error message is:
qemu-img: error creating test.luks: The requested file size is too large
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534898
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
- reduce client-side fragmentation of NBD trim and status requests
- fix iotest 41 when run in deep tree
- fix socket activation in qemu-nbd
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAl6whTUACgkQp6FrSiUn
Q2r4nAf7BtGSFMkUu6nWYeq+Ggg+Xwmz2FLAzWTK/rccGDC44c9ETzOIbWEddo6X
FHpU07VXdLW1h2M7ox8lQVo0DZEFxTRBYTPtUtjB7izfkAs4CkYeElJsZAPAZKgU
GsKqa3RM6uXubsQaXXXjMFCGlYgqi1dVkmkgtPebt7evSe0ATlTfYfd0y9gb5f9C
cbHD3CVcGKQe4ZtNcSBpTzOvXJSrBZznyCyhBO2qmVXTynt/5Ygog+Ulq3DHZsPX
UkRkTPohKA0BhXuS7wD49danlzCLiTlvswr62fAncM1+AJTbmIa+apy3SwiOkwMh
Aawq5vDtaFV+HEBKbMC0QRhgtoEe1w==
=ExlI
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2020-05-04' into staging
nbd patches for 2020-05-04
- reduce client-side fragmentation of NBD trim and status requests
- fix iotest 41 when run in deep tree
- fix socket activation in qemu-nbd
# gpg: Signature made Mon 04 May 2020 22:12:21 BST
# gpg: using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg: aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-nbd-2020-05-04:
block/nbd-client: drop max_block restriction from discard
block/nbd-client: drop max_block restriction from block_status
iotests/041: Fix NBD socket path
tools: Fix use of fcntl(F_SETFD) during socket activation
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Silent static analyzer warning
Remove dead assignments
Support -chardev serial on macOS
Update MAINTAINERS
Some cosmetic changes
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEzS913cjjpNwuT1Fz8ww4vT8vvjwFAl6wOI4SHGxhdXJlbnRA
dml2aWVyLmV1AAoJEPMMOL0/L748p7UQAIFSNN0FrDV+K7i8qqq0X+JrS+dNOHNm
DSpOf8IaGm/BezzL6XirXBVpFxg9iB5DQVLsjP1kUggO7rbBO0blx5H5eOPhnXZj
xg60kLN16ty7NZ/WPS1G9jF4nDsjz0ZUtCXb0OXsuGJIOrsmN2r/lxdJwcjHZaqJ
RzbcCSFXlvL0g7mOakJinMJH5r/nWCiUoEYsikhP10DcvuSBoCnjr+LYV6Ef02G0
Y5lgKN2G0EAMgWTJaL3gIF27zS8QLDNll+eO+PIU5K4yo75/wRCKr4e3PpErZlf6
B+hCAAPnXCpDKw+8sK2z+9OZXUGe1hQ8LHNgNNM921C66f+vLLXpIDTAECihM4K4
0wThYlFDwT4j+PMHFNlzIobGMtb33ui8m40lepMt/YOVFqY4tr8u3MLhHkVDo2+8
sNuOOWLXAoFOYyRqgTeVJvZvMUFQqtDiftghw1BR55TyIpDWjvLYRqae5CI+MGXs
6YylZVHGzVjMVptxvivvIQ735Nq8LaKq7N8Cb7uvcbRaCki39BsxXVPZx4p6NdwN
dMndUOz/y75dNlRMDjK8l/oRFPJa/p1Yz8mZhl0uVOO6JeJhBwYmk+WkQ7g/GHZb
Rx15HnVWRu6C/Icbw4kqZYyqrgl5lykS8aAWURePdpjzKY77rY1H71FesMhjifRN
ZGgfUdWI88M4
=ibgH
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/vivier2/tags/trivial-branch-for-5.1-pull-request' into staging
trivial patches (20200504)
Silent static analyzer warning
Remove dead assignments
Support -chardev serial on macOS
Update MAINTAINERS
Some cosmetic changes
# gpg: Signature made Mon 04 May 2020 16:45:18 BST
# gpg: using RSA key CD2F75DDC8E3A4DC2E4F5173F30C38BD3F2FBE3C
# gpg: issuer "laurent@vivier.eu"
# gpg: Good signature from "Laurent Vivier <lvivier@redhat.com>" [full]
# gpg: aka "Laurent Vivier <laurent@vivier.eu>" [full]
# gpg: aka "Laurent Vivier (Red Hat) <lvivier@redhat.com>" [full]
# Primary key fingerprint: CD2F 75DD C8E3 A4DC 2E4F 5173 F30C 38BD 3F2F BE3C
* remotes/vivier2/tags/trivial-branch-for-5.1-pull-request:
hw/timer/pxa2xx_timer: Add assertion to silent static analyzer warning
hw/timer/stm32f2xx_timer: Remove dead assignment
hw/gpio/aspeed_gpio: Remove dead assignment
hw/isa/i82378: Remove dead assignment
hw/ide/sii3112: Remove dead assignment
hw/input/adb-kbd: Remove dead assignment
hw/i2c/pm_smbus: Remove dead assignment
blockdev: Remove dead assignment
block: Avoid dead assignment
Compress lines for immediate return
chardev: Add macOS to list of OSes that support -chardev serial
MAINTAINERS: Update Keith Busch's email address
elf_ops: Don't try to g_mapped_file_unref(NULL)
hw/mem/pc-dimm: Fix line over 80 characters warning
hw/mem/pc-dimm: Print slot number on error at pc_dimm_pre_plug()
MAINTAINERS: Mark the LatticeMico32 target as orphan
timer/exynos4210_mct: Remove redundant statement in exynos4210_mct_write()
display/blizzard: use extract16() for fix clang analyzer warning in blizzard_draw_line16_32()
scsi/esp-pci: add g_assert() for fix clang analyzer warning in esp_pci_io_write()
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Run block_copy iterations in parallel in aio tasks.
Changes:
- BlockCopyTask becomes aio task structure. Add zeroes field to pass
it to block_copy_do_copy
- add call state - it's a state of one call of block_copy(), shared
between parallel tasks. For now used only to keep information about
first error: is it read or not.
- convert block_copy_dirty_clusters to aio-task loop.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200429130847.28124-6-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Instead of just relying on the comment "Called only on full-dirty
region" in block_copy_task_create() let's move initial dirty area
search directly to block_copy_task_create(). Let's also use effective
bdrv_dirty_bitmap_next_dirty_area instead of looping through all
non-dirty clusters.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429130847.28124-5-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to use aio-task-pool API, so we'll need state pointer in
BlockCopyTask anyway. Add it now and use where possible.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429130847.28124-4-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to use aio-task-pool API, so tasks will be handled in
parallel. We need therefore separate allocated task on each iteration.
Introduce this logic now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429130847.28124-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to use aio-task-pool API and extend in-flight request
structure to be a successor of AioTask, so rename things appropriately.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200429130847.28124-2-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It's been a while since we got rid of the sector-based bdrv_read and
bdrv_write (commit 2e11d756); let's finish the job on a few remaining
comments.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200428213807.776655-1-eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Our comment did not actually match the code. Rewrite the comment to
be less sensitive to any future changes to qcow2-bitmap.c that might
implement scenarios that we currently reject.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200428192648.749066-4-eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We originally refused to allow resize of images with internal
snapshots because the v2 image format did not require the tracking of
snapshot size, making it impossible to safely revert to a snapshot
with a different size than the current view of the image. But the
snapshot size tracking was rectified in v3, and our recent fixes to
qemu-img amend (see 0a85af35) guarantee that we always have a valid
snapshot size. Thus, we no longer need to artificially limit image
resizes, but it does become one more thing that would prevent a
downgrade back to v2. And now that we support different-sized
snapshots, it's also easy to fix reverting to a snapshot to apply the
new size.
Upgrade iotest 61 to cover this (we previously had NO coverage of
refusal to resize while snapshots exist). Note that the amend process
can fail but still have effects: in particular, since we break things
into upgrade, resize, downgrade, a failure during resize does not roll
back changes made during upgrade, nor does failure in downgrade roll
back a resize. But this situation is pre-existing even without this
patch; and without journaling, the best we could do is minimize the
chance of partial failure by collecting all changes prior to doing any
writes - which adds a lot of complexity but could still fail with EIO.
On the other hand, we are careful that even if we have partial
modification but then fail, the image is left viable (that is, we are
careful to sequence things so that after each successful cluster
write, there may be transient leaked clusters but no corrupt
metadata). And complicating the code to make it more transaction-like
is not worth the effort: a user can always request multiple 'qemu-img
amend' changing one thing each, if they need finer-grained control
over detecting the first failure than what they get by letting qemu
decide how to sequence multiple changes.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200428192648.749066-3-eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
There are several callers that need to create a new block backend from
an existing BDS; make the task slightly easier with a common helper
routine.
Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200424190903.522087-2-eblake@redhat.com>
[mreitz: Set @ret only in error paths, see
https://lists.nongnu.org/archive/html/qemu-block/2020-04/msg01216.html]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200428192648.749066-2-eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The NBD spec was updated (see nbd.git commit 9f30fedb) so that
max_block doesn't relate to NBD_CMD_TRIM. So, drop the restriction.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200401150112.9557-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: tweak commit message to call out NBD commit]
Signed-off-by: Eric Blake <eblake@redhat.com>
The NBD spec was updated (see nbd.git commit 9f30fedb) so that
max_block doesn't relate to NBD_CMD_BLOCK_STATUS. So, drop the
restriction.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200401150112.9557-2-vsementsov@virtuozzo.com>
[eblake: tweak commit message to call out NBD commit]
Signed-off-by: Eric Blake <eblake@redhat.com>
- ran regexp "qemu_mutex_lock\(.*\).*\n.*if" to find targets
- replaced result with QEMU_LOCK_GUARD if all unlocks at function end
- replaced result with WITH_QEMU_LOCK_GUARD if unlock not at end
Signed-off-by: Daniel Brodsky <dnbrdsky@gmail.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-id: 20200404042108.389635-3-dnbrdsky@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Compress two lines into a single line if immediate return statement is found.
It also remove variables progress, val, data, ret and sock
as they are no longer needed.
Remove space between function "mixer_load" and '(' to fix the
checkpatch.pl error:-
ERROR: space prohibited between function name and open parenthesis '('
Done using following coccinelle script:
@@
local idexpression ret;
expression e;
@@
-ret =
+return
e;
-return ret;
Signed-off-by: Simran Singhal <singhalsimran0@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200401165314.GA3213@simran-Inspiron-5558>
[lv: in handle_aiocb_write_zeroes_unmap() move "int ret" inside the #ifdef]
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
image is possibly preallocated and then the zero flag is added to all
clusters. This means that a copy-on-write operation may be needed when
writing to these clusters, despite having used preallocation, negating
one of the major benefits of preallocation.
Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
and if the protocol driver can ensure that the new area reads as zeros,
we can skip setting the zero flag in the qcow2 layer.
Unfortunately, the same approach doesn't work for metadata
preallocation, so we'll still set the zero flag there.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424142701.67053-1-kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When extending the size of an image that has a backing file larger than
its old size, make sure that the backing file data doesn't become
visible in the guest, but the added area is properly zeroed out.
Consider the following scenario where the overlay is shorter than its
backing file:
base.qcow2: AAAAAAAA
overlay.qcow2: BBBB
When resizing (extending) overlay.qcow2, the new blocks should not stay
unallocated and make the additional As from base.qcow2 visible like
before this patch, but zeros should be read.
A similar case happens with the various variants of a commit job when an
intermediate file is short (- for unallocated):
base.qcow2: A-A-AAAA
mid.qcow2: BB-B
top.qcow2: C--C--C-
After commit top.qcow2 to mid.qcow2, the following happens:
mid.qcow2: CB-C00C0 (correct result)
mid.qcow2: CB-C--C- (before this fix)
Without the fix, blocks that previously read as zeros on top.qcow2
suddenly turn into A.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200424125448.63318-8-kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For regular files, we always get BDRV_REQ_ZERO_WRITE behaviour from the
OS, so we can advertise the flag and just ignore it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424125448.63318-7-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The raw format driver can simply forward the flag and let its bs->file
child take care of actually providing the zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200424125448.63318-6-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If BDRV_REQ_ZERO_WRITE is set and we're extending the image, calling
qcow2_cluster_zeroize() with flags=0 does the right thing: It doesn't
undo any previous preallocation, but just adds the zero flag to all
relevant L2 entries. If an external data file is in use, a write_zeroes
request to the data file is made instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200424125448.63318-5-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that node level interface bdrv_truncate() supports passing request
flags to the block driver, expose this on the BlockBackend level, too.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424125448.63318-4-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that block drivers can support flags for .bdrv_co_truncate, expose
the parameter in the node level interfaces bdrv_co_truncate() and
bdrv_truncate().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424125448.63318-3-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds a new BdrvRequestFlags parameter to the .bdrv_co_truncate()
driver callbacks, and a supported_truncate_flags field in
BlockDriverState that allows drivers to advertise support for request
flags in the context of truncate.
For now, we always pass 0 and no drivers declare support for any flag.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200424125448.63318-2-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The previous few commits have made this more obvious, and removed the
one exception. Time to clarify the documentation, and drop dead error
checking.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200424084338.26803-13-armbru@redhat.com>
The Error ** argument must be NULL, &error_abort, &error_fatal, or a
pointer to a variable containing NULL. Passing an argument of the
latter kind twice without clearing it in between is wrong: if the
first call sets an error, it no longer points to NULL for the second
call.
check_cache_dropped() calls error_setg() in a loop. It fails to break
the loop in one instance. If a subsequent iteration error_setg()s
again, it trips error_setv()'s assertion.
Fix it to break the loop.
Fixes: 31be8a2a97
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200422130719.28225-3-armbru@redhat.com>
Fixes the following coccinelle warnings:
$ spatch --sp-file --verbose-parsing ... \
scripts/coccinelle/remove_local_err.cocci
...
SUSPICIOUS: a \ character appears outside of a #define at ./target/ppc/translate_init.inc.c:5213
SUSPICIOUS: a \ character appears outside of a #define at ./target/ppc/translate_init.inc.c:5261
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:166
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:167
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:169
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:170
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:171
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:172
SUSPICIOUS: a \ character appears outside of a #define at ./target/microblaze/cpu.c:173
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5787
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5789
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5800
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5801
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5802
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5804
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5805
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:5806
SUSPICIOUS: a \ character appears outside of a #define at ./target/i386/cpu.c:6329
SUSPICIOUS: a \ character appears outside of a #define at ./hw/sd/sdhci.c:1133
SUSPICIOUS: a \ character appears outside of a #define at ./hw/scsi/scsi-disk.c:3081
SUSPICIOUS: a \ character appears outside of a #define at ./hw/net/virtio-net.c:1529
SUSPICIOUS: a \ character appears outside of a #define at ./hw/riscv/sifive_u.c:468
SUSPICIOUS: a \ character appears outside of a #define at ./dump/dump.c:1895
SUSPICIOUS: a \ character appears outside of a #define at ./block/vhdx.c:2209
SUSPICIOUS: a \ character appears outside of a #define at ./block/vhdx.c:2215
SUSPICIOUS: a \ character appears outside of a #define at ./block/vhdx.c:2221
SUSPICIOUS: a \ character appears outside of a #define at ./block/vhdx.c:2222
SUSPICIOUS: a \ character appears outside of a #define at ./block/replication.c:172
SUSPICIOUS: a \ character appears outside of a #define at ./block/replication.c:173
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20200412223619.11284-2-f4bug@amsat.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
There is an overflow, the source 'datain.data[2]' is 100 bytes,
but the 'ss' is 252 bytes.This may cause a security issue because
we can access a lot of unrelated memory data.
The len for sbp copy data should take the minimum of mx_sb_len and
sb_len_wr, not the maximum.
If we use iscsi device for VM backend storage, ASAN show stack:
READ of size 252 at 0xfffd149dcfc4 thread T0
#0 0xaaad433d0d34 in __asan_memcpy (aarch64-softmmu/qemu-system-aarch64+0x2cb0d34)
#1 0xaaad45f9d6d0 in iscsi_aio_ioctl_cb /qemu/block/iscsi.c:996:9
#2 0xfffd1af0e2dc (/usr/lib64/iscsi/libiscsi.so.8+0xe2dc)
#3 0xfffd1af0d174 (/usr/lib64/iscsi/libiscsi.so.8+0xd174)
#4 0xfffd1af19fac (/usr/lib64/iscsi/libiscsi.so.8+0x19fac)
#5 0xaaad45f9acc8 in iscsi_process_read /qemu/block/iscsi.c:403:5
#6 0xaaad4623733c in aio_dispatch_handler /qemu/util/aio-posix.c:467:9
#7 0xaaad4622f350 in aio_dispatch_handlers /qemu/util/aio-posix.c:510:20
#8 0xaaad4622f350 in aio_dispatch /qemu/util/aio-posix.c:520
#9 0xaaad46215944 in aio_ctx_dispatch /qemu/util/async.c:298:5
#10 0xfffd1bed12f4 in g_main_context_dispatch (/lib64/libglib-2.0.so.0+0x512f4)
#11 0xaaad46227de0 in glib_pollfds_poll /qemu/util/main-loop.c:219:9
#12 0xaaad46227de0 in os_host_main_loop_wait /qemu/util/main-loop.c:242
#13 0xaaad46227de0 in main_loop_wait /qemu/util/main-loop.c:518
#14 0xaaad43d9d60c in qemu_main_loop /qemu/softmmu/vl.c:1662:9
#15 0xaaad4607a5b0 in main /qemu/softmmu/main.c:49:5
#16 0xfffd1a460b9c in __libc_start_main (/lib64/libc.so.6+0x20b9c)
#17 0xaaad43320740 in _start (aarch64-softmmu/qemu-system-aarch64+0x2c00740)
0xfffd149dcfc4 is located 0 bytes to the right of 100-byte region [0xfffd149dcf60,0xfffd149dcfc4)
allocated by thread T0 here:
#0 0xaaad433d1e70 in __interceptor_malloc (aarch64-softmmu/qemu-system-aarch64+0x2cb1e70)
#1 0xfffd1af0e254 (/usr/lib64/iscsi/libiscsi.so.8+0xe254)
#2 0xfffd1af0d174 (/usr/lib64/iscsi/libiscsi.so.8+0xd174)
#3 0xfffd1af19fac (/usr/lib64/iscsi/libiscsi.so.8+0x19fac)
#4 0xaaad45f9acc8 in iscsi_process_read /qemu/block/iscsi.c:403:5
#5 0xaaad4623733c in aio_dispatch_handler /qemu/util/aio-posix.c:467:9
#6 0xaaad4622f350 in aio_dispatch_handlers /qemu/util/aio-posix.c:510:20
#7 0xaaad4622f350 in aio_dispatch /qemu/util/aio-posix.c:520
#8 0xaaad46215944 in aio_ctx_dispatch /qemu/util/async.c:298:5
#9 0xfffd1bed12f4 in g_main_context_dispatch (/lib64/libglib-2.0.so.0+0x512f4)
#10 0xaaad46227de0 in glib_pollfds_poll /qemu/util/main-loop.c:219:9
#11 0xaaad46227de0 in os_host_main_loop_wait /qemu/util/main-loop.c:242
#12 0xaaad46227de0 in main_loop_wait /qemu/util/main-loop.c:518
#13 0xaaad43d9d60c in qemu_main_loop /qemu/softmmu/vl.c:1662:9
#14 0xaaad4607a5b0 in main /qemu/softmmu/main.c:49:5
#15 0xfffd1a460b9c in __libc_start_main (/lib64/libc.so.6+0x20b9c)
#16 0xaaad43320740 in _start (aarch64-softmmu/qemu-system-aarch64+0x2c00740)
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Chen Qun <kuhn.chenqun@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200418062602.10776-1-kuhn.chenqun@huawei.com
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>