This fixes an unbounded allocation for s->unknown_header_fields.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If it returns an error, the migrated VM will not be started, but qemu
exits with an error message.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
qcow2_open() causes writes when repairing an image with the dirty flag
set and when clearing autoclear flags. It shouldn't do this when another
qemu instance is still actively working on this image file.
One effect of the bug is that images may have a cleared dirty flag while
the migration source host still has it in use with lazy refcounts
enabled, so refcounts are not accurate and the dirty flag must remain
set.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of manually building a list of all options from BDRVQcowState
values just reuse the options that were used to open the image.
qcow2_open() won't fully use all of the options in the QDict, but that's
okay.
This fixes all of the driver-specific options in qcow2, except for
lazy-refcounts, which was special cased before.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
After migration has completed, we call bdrv_invalidate_cache() so that
drivers which cache some data drop their stale copy of the data and
reread it from the image file to get a new version of data that the
source modified while the migration was running.
Reloading metadata from the image file is useless, though, if the size
of the image file stays stale (this is a value that is cached for all
image formats in block.c). Reads from (meta)data after the old EOF
return only zeroes, causing image corruption.
We need to update bs->total_sectors in all layers that could potentially
have changed their size (i.e. backing files are not a concern - if they
are changed, we're in bigger trouble)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Returning "Wrong medium type" for an image that does not have a valid
header is a bit weird. Improve the error by mentioning what format
was trying to open it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add the bdrv_open() option BDRV_O_PROTOCOL which results in passing the
call to bdrv_file_open(). Additionally, make bdrv_file_open() static and
therefore bdrv_open() the only way to call it.
Consequently, all existing calls to bdrv_file_open() have to be adjusted
to use bdrv_open() with the BDRV_O_PROTOCOL flag instead.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow bdrv_open() to handle references to existing block devices just as
bdrv_file_open() is already capable of.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make bdrv_open() take a pointer to a BDS pointer, similarly to
bdrv_file_open(). If a pointer to a NULL pointer is given, bdrv_open()
will create a new BDS with an empty name; if the BDS pointer is not
NULL, that existing BDS will be reused (in the same way as bdrv_open()
already did).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
error_is_set(&var) is the same as var != NULL, but it takes
whole-program analysis to figure that out. Unnecessarily hard for
optimizers, static checkers, and human readers. Dumb it down to
obvious.
Gets rid of several dozen Coverity false positives.
Note that the obvious form is already used in many places.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
In the case of a metadata preallocation with a large cluster size,
qcow2_alloc_cluster_offset() can allocate nothing and returns a
NULL l2meta. This patch checks for it and link2 l2 with only valid
l2meta.
Replace 9 and 512 with BDRV_SECTOR_BITS, BDRV_SECTOR_SIZE
respectively while at the function.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
n_start can be actually calculated from offset. The number of
sectors to be allocated(n_end - n_start) can be passed in in
num. By removing n_start and n_end, we can save two parameters.
The side effect is there is a bug in qcow2.c:preallocate() that
passes incorrect n_start to qcow2_alloc_cluster_offset() is
fixed. The bug can be triggerred by a larger cluster size than
the default value(65536), for example:
./qemu-img create -f qcow2 \
-o 'cluster_size=131072,preallocation=metadata' file.img 4G
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QCOW2 .bdrv_make_empty implementation always returns 0 for success,
but does not actually do anything.
The proper way to not support an optional driver function stub is to
just not implement it, so let's remove the stub.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This function separates filling the BlockLimits from bdrv_open(), which
allows it to call it from other operations which may change the limits
(e.g. modifications to the backing file chain or bdrv_reopen)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Allow specifying a reference to an existing block device (by name) for
bdrv_file_open() instead of a filename and/or options.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a boiler-plate _nofail variant of qemu_opts_create. Remove and
use error_abort in call sites.
null/0 arguments needs to be added for the id and fail_if_exists fields
in affected callsites due to argument inconsistency between the normal and
no_fail variants.
Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Strictly speaking, this is only required for has_zero_init() == false,
but it's easy enough to just do a cluster-aligned write that is padded
with zeros after the header.
This fixes that after 'qemu-img create' header extensions are attempted
to be parsed that are really just random leftover data.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will let misaligned but large requests use zero clusters. This
is important because the cluster size is not guest visible.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If you open an image temporarily just because you want to check its size
or get it flushed, there's no real reason to open the whole backing file
chain.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Opening the qcow2 image with BDRV_O_NO_FLUSH prevents any flushes during
the image creation. This means that the image has not yet been flushed
to disk when qemu-img create exits. This flush is delayed until the next
operation on the image involving opening it without BDRV_O_NO_FLUSH and
closing (or directly flushing) it. For large images and/or images with a
small cluster size and preallocated metadata, this flush may take a
significant amount of time and may occur unexpectedly.
Reopening the image without BDRV_O_NO_FLUSH right before the end of
qcow2_create2() results in hoisting the potentially costly flush into
the image creation, which is expected to take some time (whereas
successive image operations may be not).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Saving the VM state is done using bdrv_pwrite. This function may perform
a read-modify-write, which in this case results in data being read from
beyond the end of the virtual disk. Since we are actually trying to
access an area which is not a part of the virtual disk, zero_beyond_eof
has to be set to false before performing the partial write, otherwise
the VM state may become corrupted.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since df2a6f29a5, bdrv_co_do_writev increases the total_sectors value of
a growable block devices on writes after the current end. This leads to
the virtual disk apparently growing in qcow2_save_vmstate, which in turn
affects the disk size captured by the internal snapshot taken directly
afterwards through e.g. the HMP savevm command. Such a "grown" snapshot
cannot be loaded after reopening the qcow2 image, since its disk size
differs from the actual virtual disk size (writing a VM state does not
actually increase the virtual disk size).
Fix this by restoring total_sectors at the end of qcow2_save_vmstate.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Evaluate the runtime overlap check options and set
BDRVQcowState.overlap_check appropriately.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add an array which assigns the option string to its corresponding
overlap check bit.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add runtime options to tune the overlap checks to be performed before
write accesses.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the QCOW2_OL_DEFAULT macro by a variable overlap_check in
BDRVQcowState.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_check_metadata_overlap and qcow2_pre_write_overlap_check,
change the parameter signifying the checks to perform from its current
positive form to a negative one, i.e., it will no longer explicitly
specify every check to perform but rather a mask of checks not to
perform.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_write_compressed, if the compression fails, a normal cluster is
written to disk. This is done through bdrv_write on the qcow2 BDS
itself (using the guest offset), thus it is wrong to do a metadata
overlap check before.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The error message in qcow2_downgrade about an unsupported refcount
order is missing a space. This patch adds it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new ImageInfoSpecificQCow2 type as a subtype of ImageInfoSpecific.
This contains the compatibility level as a string and an optional
lazy_refcounts boolean (optional means mandatory for compat >= 1.1 and
not available for compat == 0.10).
Also, add qcow2_get_specific_info, which returns this information.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCowHeader and QCowExtension are structs that reside in the on-disk
image format, and are read and written directly via bdrv_pread()/write(),
and as such should be packed to avoid any unintentional struct padding.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add an Error ** parameter to bdrv_create and its associated functions to
allow more specific error messages.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add an Error ** parameter to bdrv_open, bdrv_file_open and associated
functions to allow more specific error messages.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add an Error ** parameter to BlockDriver.bdrv_open and
BlockDriver.bdrv_file_open to allow more specific error messages.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Implement bdrv_amend_options for compat, size, backing_file, backing_fmt
and lazy_refcounts.
Downgrading images from compat=1.1 to compat=0.10 is achieved through
handling all incompatible flags accordingly, clearing all compatible and
autoclear flags and expanding all zero clusters.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Save the image refcount order in BDRVQcowState. This will be relevant
for future code supporting different refcount orders than four and also
for code that needs to verify a certain refcount order for an opened
image.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
During savevm, the VM state is written to the active L1 of the image and
then a snapshot is taken. After that, the VM state isn't needed any more
in the active L1 and should be discarded. This is implemented by this
patch.
The impact of not discarding the VM state is that a snapshot can never
become smaller than any previous snapshot (because it would be padded
with old VM state), and more importantly that future savevm operations
cause unnecessary COWs (with associated flushes), which makes subsequent
snapshots much slower.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The function will be used internally instead of only being called for
guest discard requests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
For now, bdrv_get_block_status is just another name for bdrv_is_allocated.
The next patches will add more flags.
This also touches all block drivers with a mostly mechanical rename. The
sole exception is cow; because it calls cow_co_is_allocated from the read
code, we keep that function and make cow_co_get_block_status a wrapper.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Some bdrv_is_allocated callers do not expect errors, but the fallback
in qcow2.c might make other callers trip on assertion failures or
infinite loops.
Fix the callers to always look for errors.
Cc: qemu-stable@nongnu.org
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Manage BlockDriverState lifecycle with refcnt, so bdrv_delete() is no
longer public and should be called by bdrv_unref() if refcnt is
decreased to 0.
This is an identical change because effectively, there's no multiple
reference of BDS now: no caller of bdrv_ref() yet, only bdrv_new() sets
bs->refcnt to 1, so all bdrv_unref() now actually delete the BDS.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If no corruptions remain after an image repair (and no errors have been
encountered), clear the corrupt flag in qcow2_check.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The pre-write overlap check function is now called before most of the
qcow2 writes (aborting it on collision or other error).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds an incompatible bit indicating corruption to qcow2. Any image
with this bit set may not be written to unless for repairing (and
subsequently clearing the bit if the repair has been successful).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
By the time that qemu 1.7 will be released, enough time will have passed
since qemu 1.1, which is the first version to understand version 3
images, that changing the default shouldn't hurt many people any more
and the benefits of using the new format outweigh the pain.
qemu-iotests already runs with compat=1.1 by default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
In 4146b46c42e0989cb5842e04d88ab6ccb1713a48 (block: Produce zeros when
protocols reading beyond end of file), we break qemu-iotests ./check
-qcow2 022. This happens because qcow2 temporarily sets ->growable = 1
for vmstate accesses (which are stored beyond the end of regular image
data).
We introduce the bs->zero_beyond_eof to allow qcow2_load_vmstate() to
disable ->zero_beyond_eof temporarily in addition to enable ->growable.
[Since the broken patch "block: Produce zeros when protocols reading
beyond end of file" has not been merged yet, I have applied this fix
*first* and will then apply the next patch to keep the tree bisectable.
-- Stefan]
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
By the time that qemu 1.7 will be released, enough time will have passed
since qemu 1.1, which is the first version to understand version 3
images, that changing the default shouldn't hurt many people any more
and the benefits of using the new format outweigh the pain.
qemu-iotests already runs with compat=1.1 by default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is what QMP wants to use. The options haven't been enabled in any
release yet, so we're still free to change them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
.has_zero_init defaults to 1 for all formats and protocols.
this is a dangerous default since this means that all
new added drivers need to manually overwrite it to 0 if
they do not ensure that a device is zero initialized
after bdrv_create().
if a driver needs to explicitly set this value to
1 its easier to verify the correctness in the review process.
during review of the existing drivers it turned out
that ssh and gluster had a wrong default of 1.
both protocols support host_devices as backend
which are not by default zero initialized. this
wrong assumption will lead to possible corruption
if qemu-img convert is used to write to such a backend.
vpc and vmdk also defaulted to 1 altough they support
fixed respectively flat extends. this has to be addresses
in separate patches. both formats as well as the mentioned
ssh and gluster are turned to the default of 0 with this
patch for safety.
a similar problem with the wrong default existed for
iscsi most likely because the driver developer did
oversee the default value of 1.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This optimises the discard operation for freed clusters by batching
discard requests (both snapshot deletion and bdrv_discard end up
updating the refcounts cluster by cluster).
Note that we don't discard asynchronously, but keep s->lock held. This
is to avoid that a freed cluster is reallocated and written to while the
discard is still in flight.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Deleted snapshots are discarded in the image file by default, discard
requests take their default from the -drive discard=... option and other
places that free clusters must always be enabled explicitly.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds a refcount update reason to all callers of update_refcounts(),
so that a follow-up patch can use this information to decide whether
clusters that reach a refcount of 0 should be discarded in the image
file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This catches the situation that is described in the bug report at
https://bugs.launchpad.net/qemu/+bug/865518 and goes like this:
$ qemu-img create -f qcow2 huge.qcow2 $((1024*1024))T
Formatting 'huge.qcow2', fmt=qcow2 size=1152921504606846976 encryption=off cluster_size=65536 lazy_refcounts=off
$ qemu-io /tmp/huge.qcow2 -c "write $((1024*1024*1024*1024*1024*1024 - 1024)) 512"
Segmentation fault
With this patch applied the segfault will be avoided, however the case
will still fail, though gracefully:
$ qemu-img create -f qcow2 /tmp/huge.qcow2 $((1024*1024))T
Formatting 'huge.qcow2', fmt=qcow2 size=1152921504606846976 encryption=off cluster_size=65536 lazy_refcounts=off
qemu-img: The image size is too large for file format 'qcow2'
Note that even long before these overflow checks kick in, you get
insanely high memory usage (up to INT_MAX * sizeof(uint64_t) = 16 GB for
the L1 table), so with somewhat smaller image sizes you'll probably see
qemu aborting for a failed g_malloc().
If you need huge image sizes, you should increase the cluster size to
the maximum of 2 MB in order to get higher limits.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Compression in qcow2 requires image length to be a multiple of the
cluster size. Lift this requirement by zero-padding the final cluster
when necessary. The virtual disk size is still not cluster-aligned, so
the guest cannot access the zero sectors.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Directly pass the QEMUIOVector on instead of linearising it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Move aes.h from include/block to include/qemu to show it can be reused
by other subsystems.
Cc: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Instead of expecting a single l2meta, have a list of them. This allows
to still have a single I/O request for the guest data, even though
multiple l2meta may be needed in order to describe both a COW overwrite
and a new cluster allocation (typical sequential write case).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The unlock wakes up the next coroutine, but the currently running
coroutine will lock it again before it yields, so this doesn't make a
lot of sense.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Need to pass an options QDict to qcow2_open() now. This fixes a segfault
on the migration target with qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Otherwise, live migration of the top layer will miss zero clusters and
let the backing file show through. This also matches what is done in qed.
QCOW2_CLUSTER_ZERO clusters are invalid in v2 image files. Check this
directly in qcow2_get_cluster_offset instead of replicating the test
everywhere.
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qcow2 images now accept a boolean lazy_refcounts options. Use it like
this:
-drive file=test.qcow2,lazy_refcounts=on
If the option is specified on the command line, it overrides the default
specified by the qcow2 header flags that were set when creating the
image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It doesn't do anything yet except storing the options QDict in the
BlockDriverState.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This improves error reports for bochs, cow, qcow, qcow2, qed and vmdk
when a file with the wrong format is selected.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
One of the recent refactoring patches (commit f50f88b9) didn't take care
to initialise l2meta properly, so with zero-length writes, which don't
even enter the write loop, qemu just segfaulted.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
There's no reason for run_dependent_requests() to hold s->lock, and a
later patch will require that in fact the lock is not held.
Also, before this patch, run_dependent_requests() not only does what its
name suggests, but also removes the l2meta from the list of in-flight
requests. When changing this, it becomes an one-liner, so just inline it
completely.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is closer to where the dirty flag is really needed, and it avoids
having checks for special cases related to cluster allocation directly
in the writev loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even for writes to already allocated clusters, an l2meta is allocated,
though it stays effectively unused. After this patch, only allocating
requests still have one. Each l2meta now describes an in-flight request
that writes to clusters that are not yet hooked up in the L2 table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There's no real reason to have an l2meta for normal requests that don't
allocate anything. Before we can get rid of it, we must return the host
cluster offset in a different way.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As soon as delayed COW is introduced, the l2meta struct is needed even
after completion of the request, so it can't live on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to use these events to suspend requests for testing concurrent
AIO requests. Suspending requests while they are holding the CoMutex is
rather boring for this purpose.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are the stubs for the file reopen drivers for the qcow2 format.
There is currently nothing that needs to be done by the qcow2 driver
in reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Image formats with a dirty bit, like qed and qcow2, repair dirty image
files upon open with BDRV_O_RDWR. Performing automatic repair when
qemu-img check runs is not ideal because the bdrv_open() call repairs
the image before the actual bdrv_check() call from qemu-img.c.
Fix this "double repair" since it leads to confusing output from
qemu-img check. Tell the block driver that this image is being opened
just for bdrv_check(). This skips automatic repair and qemu-img.c can
invoke it manually with bdrv_check().
Update the golden output for qemu-iotests 039 to reflect the new
qemu-img check output.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The dirty bit is cleared after image repair succeeds in qcow2_open().
Move this into qcow2_check() so that all callers benefit from this
behavior when fix mode is enabled.
This is necessary so qemu-img check can call .bdrv_check() and mark the
image clean.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Lazy refcounts is a performance optimization for qcow2 that postpones
refcount metadata updates and instead marks the image dirty. In the
case of crash or power failure the image will be left in a dirty state
and repaired next time it is opened.
Reducing metadata I/O is important for cache=writethrough and
cache=directsync because these modes guarantee that data is on disk
after each write (hence we cannot take advantage of caching updates in
RAM). Refcount metadata is not needed for guest->file block address
translation and therefore does not need to be on-disk at the time of
write completion - this is the motivation behind the lazy refcount
optimization.
The lazy refcount optimization must be enabled at image creation time:
qemu-img create -f qcow2 -o compat=1.1,lazy_refcounts=on a.qcow2 10G
qemu-system-x86_64 -drive if=virtio,file=a.qcow2,cache=writethrough
Update qemu-iotests 031 and 036 since the extension header size changes
when we add feature bit table entries.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds an incompatible feature bit to mark images that have not
been closed cleanly. When a dirty image file is opened a consistency
check and repair is performed.
Update qemu-iotests 031 and 036 since the extension header size changes
when we add feature bit table entries.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* mjt/mjt-iov2:
rewrite iov_send_recv() and move it to iov.c
cleanup qemu_co_sendv(), qemu_co_recvv() and friends
export iov_send_recv() and use it in iov_send() and iov_recv()
rename qemu_sendv to iov_send, change proto and move declarations to iov.h
change qemu_iovec_to_buf() to match other to,from_buf functions
consolidate qemu_iovec_copy() and qemu_iovec_concat() and make them consistent
allow qemu_iovec_from_buffer() to specify offset from which to start copying
consolidate qemu_iovec_memset{,_skip}() into single function and use existing iov_memset()
rewrite iov_* functions
change iov_* function prototypes to be more appropriate
virtio-serial-bus: use correct lengths in control_out() message
Conflicts:
tests/Makefile
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The DEBUG_ALLOC qcow2.h macro enables additional consistency checks
throughout the code. This makes it easier to spot corruptions that are
introduced during development. Since consistency check is an expensive
operation the DEBUG_ALLOC macro is used to compile checks out in normal
builds and qcow2_check_refcounts() calls missed the addition of a new
function argument.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The autoclear feature bits can be used for qcow2 file format features
that are safe to "drop" by old programs that do not understand the
feature. Upon opening the image file unknown autoclear feature bits are
cleared and the image file header is rewritten, but this was happening
too early in the code when critical header fields were not yet loaded.
Process autoclear feature bits after all necessary header information
has been loaded.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Writethrough does not need special-casing anymore in the qcow2 caches.
The block layer adds flushes after every guest-initiated data write,
and these will also flush the qcow2 caches to the OS.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QED block driver already provides the functionality to not only
detect inconsistencies in images, but also fix them. However, this
functionality cannot be manually invoked with qemu-img, but the
check happens only automatically during bdrv_open().
This adds a -r switch to qemu-img check that allows manual invocation
of an image repair.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It now allows specifying offset within qiov to start from and
amount of bytes to copy. Actual implementation is just a call
to iov_to_buf().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
qemu_iovec_concat() is currently a wrapper for
qemu_iovec_copy(), use the former (with extra
"0" arg) in a few places where it is used.
Change skip argument of qemu_iovec_copy() from
uint64_t to size_t, since size of qiov itself
is size_t, so there's no way to skip larger
sizes. Rename it to soffset, to make it clear
that the offset is applied to src.
Also change the only usage of uint64_t in
hw/9pfs/virtio-9p.c, in v9fs_init_qiov_from_pdu() -
all callers of it actually uses size_t too,
not uint64_t.
One added restriction: as for all other iovec-related
functions, soffset must point inside src.
Order of argumens is already good:
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes)
vs:
qemu_iovec_concat(QEMUIOVector *dst,
QEMUIOVector *src,
size_t soffset, size_t sbytes)
(note soffset is after _src_ not dst, since it applies to src;
for memset it applies to qiov).
Note that in many places where this function is used,
the previous call is qemu_iovec_reset(), which means
many callers actually want copy (replacing dst content),
not concat. So we may want to add a wrapper like
qemu_iovec_copy() with the same arguments but which
calls qemu_iovec_reset() before _concat().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Similar to
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes);
the new prototype is:
qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
const void *buf, size_t bytes);
The processing starts at offset bytes within qiov.
This way, we may copy a bounce buffer directly to
a middle of qiov.
This is exactly the same function as iov_from_buf() from
iov.c, so use the existing implementation and rename it
to qemu_iovec_from_buf() to be shorter and to match the
utility function.
As with utility implementation, we now assert that the
offset is inside actual iovec. Nothing changed for
current callers, because `offset' parameter is new.
While at it, stop using "bounce-qiov" in block/qcow2.c
and copy decrypted data directly from cluster_data
instead of recreating a temp qiov for doing that.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This patch combines two functions into one, and replaces
the implementation with already existing iov_memset() from
iov.c.
The new prototype of qemu_iovec_memset():
size_t qemu_iovec_memset(qiov, size_t offset, int fillc, size_t bytes)
It is different from former qemu_iovec_memset_skip(), and
I want to make other functions to be consistent with it
too: first how much to skip, second what, and 3rd how many
of it. It also returns actual number of bytes filled in,
which may be less than the requested `bytes' if qiov is
smaller than offset+bytes, in the same way iov_memset()
does.
While at it, use utility function iov_memset() from
iov.h in posix-aio-compat.c, where qiov was used.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This check applies to all drivers, but QED lacks it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
preallocate() will be locked. This is required because
qcow2_alloc_cluster_link_l2() assumes that it runs under a lock that it
can drop while COW is being performed.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of printing an ugly bitmask, qemu can now print a more helpful
string even for yet unknown features.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds the basic infrastructure to qcow2 to handle version 3 images.
It includes code to create v3 images, allow header updates for v3 images
and checks feature bits.
It still misses support for zero clusters, so this is not a fully
compliant implementation of v3 yet.
The default for creating new images stays at v2 for now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With this change, reading from a qcow2 image ignores all reserved bits
that are set in an L1 or L2 table entry.
Now get_cluster_offset() assigns *cluster_offset only the offset without
any other flags. The cluster type is not longer encoded in the offset,
but a positive return value in case of success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu-img resize has some limitations with qcow2, but the user is only
told that "this image format does not support resize". Quite confusing,
so add some more detailed error_report() calls and change "this image
format" into "this image".
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Image files that make qemu-img info read several gigabytes into the
unknown header extensions list are bad. Just fail opening the image
if an extension claims to be larger than the header extension area.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The spec says that the length of extensions is padded to 8 bytes, not
the offset. Currently this is the same because the header size is a
multiple of 8, so this is only about compatibility with future changes
to the header size.
While touching it, move the calculation to a common place instead of
duplicating it for each header extension type.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
If we want header extensions to work as compatible extensions, we can't
destroy yet unknown header extensions when rewriting the header (e.g.
for changing the backing file). Save all unknown header extensions in a
list of blobs and include them in a new header.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to switch the backing file, qcow2 issues multiple write
requests that only changed a part of the image header. Any failure after
the first one would leave the header in an corrupted state. With this
patch, the whole header is written at once, so we can't fail in the
middle.
At the same time, this gives us a reusable functions that updates all
fields of the qcow2 header and not only the backing file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's common to wake up all waiting coroutines. Introduce the
qemu_co_queue_restart_all() function to do this instead of looping over
qemu_co_queue_next() in every caller.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2, qcow, and vmdk block drivers are based on coroutines. They have a
coroutine mutex which protects internal state. We can convert the
.bdrv_is_allocated() function to .bdrv_co_is_allocated() by holding the mutex
around the cluster lookup operation.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Looks better when reviewing these source files.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
We don't reopen the actual file, but instead invoke the close and open routines.
We specifically ignore the backing file since it's contents are read-only and
therefore immutable.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2 has a writeback metadata cache, so flushing a qcow2 image actually
consists of writing back that cache to the protocol and only then flushes the
protocol in order to get everything stable on disk.
This introduces a separate bdrv_co_flush_to_os to reflect the split.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are two different types of flush that you can do: Flushing one level up
to the OS (i.e. writing data to the host page cache) or flushing it all the way
down to the disk. The existing functions flush to the disk, reflect this in the
function name.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_cache_flush failed, s->lock will not be unlock.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since coroutine operation is now mandatory, convert both bdrv_discard
implementations to coroutines. For qcow2, this means taking the lock
around the operation. raw-posix remains synchronous.
The bdrv_discard callback is then unused and can be eliminated.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since coroutine operation is now mandatory, convert all bdrv_flush
implementations to coroutines. For qcow2, this means taking the lock.
Other implementations are simpler and just forward bdrv_flush to the
underlying protocol, so they can avoid the lock.
The bdrv_flush callback is then unused and can be eliminated.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If during allocation of compressed clusters the cluster was already allocated
uncompressed, fail and properly release the l2_table (the latter avoids a
failed assertion).
While at it, make it return some real error numbers instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Block drivers now only need to provide either of .bdrv_co_flush,
.bdrv_aio_flush() or for legacy drivers .bdrv_flush(). Remove
the redundant .bdrv_flush() implementations.
[Paolo Bonzini: change raw driver to bdrv_co_flush]
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Requests depending on a failed request would end up waiting forever. This fixes
the error path to continue dependent requests even when the request has failed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
By introducing BlockDriverState compiling qcow2 with DEBUG_ALLOC and DEBUG_EXT
defined got broken.
Define a BdrvCheckResult structure locally which is now needed as the second
argument.
Also fix qcow2_read_extensions() needing BDRVQcowState.
Signed-off-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
error_report() prepends location, and appends a newline. The message
constructed from the arguments should not contain a newline. Fix the
obvious offenders.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
bdrv_aio_* must not call the callback before returning to its caller. In qcow2,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When not specifying a cluster size on the command line, qemu-img printed
a cluster size of 0:
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864
encryption=off cluster_size=0
This patch adds the default cluster size to the QEMUOptionParameter list, so
that it displays the default value that is used.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change BDRV_O_NOCACHE to only imply bypassing the host OS file cache,
but no writeback semantics. All existing callers are changed to also
specify BDRV_O_CACHE_WB to give them writeback semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2 driver is now declared responsible for any QCOW image that has
version 2 or greater (before this, version 3 would be detected as raw).
For everything newer than version 2, an error is reported.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
When reading a compressed cluster failed, qcow2 falsely returned success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Requests could return success even though they failed when bdrv_aio_readv
returned NULL for a backing file read.
Reported-by: Chunqiang Tang <ctang@us.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch fixes the following bug in QCOW2. For a QCOW2 image that is larger
than its base image, when handling a read request straddling over the end of the
base image, the QCOW2 driver attempts to read beyond the end of the base image
and the request would fail.
This bug was found by Fast Virtual Disk (FVD)'s fully automated testing tool.
The following test triggered the bug.
dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1098561536
dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=593099264
./qemu-img create -f qcow2 -ocluster_size=65536,backing_fmt=blksim -b /var/ramdisk/zero-500M.raw /var/ramdisk/test.qcow2 1098561536
./qemu-io --auto --seed=30477694 --truth=/var/ramdisk/truth.raw --format=qcow2 --test=blksim:/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false --compare_after=true --round=100000 --parallel=100 --io_size=10485760 --fail_prob=0 --cancel_prob=0 --instant_qemubh=true
Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For cache=unsafe we also need to set BDRV_O_CACHE_WB, otherwise we have some
strange unsafe writethrough mode.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This adds a bdrv_discard function to qcow2 that frees the discarded clusters.
It does not yet pass the discard on to the underlying file system driver, but
the space can be reused by future writes to the image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
In addition this adds missing braces to the function to be consistent
with the coding style.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It doesn't really make sense for functions in qcow2.c to be named
qcow_ so convert the names to match correctly.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes bdrv_flush to return 0 on success and -errno in case of failure.
It's a requirement for implementing proper error handle in users of bdrv_flush.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
In order to backup snapshots, created from QCOW2 iamge, we want to copy snapshots out of QCOW2 disk to a seperate storage.
The following patch adds a new option in "qemu-img": qemu-img convert -f qcow2 -O qcow2 -s snapshot_name src_img bck_img.
Right now, it only supports to copy the full snapshot, delta snapshot is on the way.
Changes from V1: all the comments from Kevin are addressed:
Add read-only checking
Fix coding style
Change the name from bdrv_snapshot_load to bdrv_snapshot_load_tmp
Signed-off-by: Disheng Su <edison@cloud.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of doing lots of magic for setting up initial refcount blocks and stuff
create a minimal (inconsistent) image, open it and initialize the rest with
regular qcow2 functions.
This is a complete rewrite of the image creation function. The old
implementating is #ifdef'd out and will be removed by the next patch (removing
it here would have made the diff unreadable because diff tries to find
similarities when it's really a rewrite)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The L1 table grow operation includes a size calculation that bumps up
the new L1 table size in order to anticipate the size needs of vmstate
data. This helps reduce the number of times that the L1 table has to be
grown when vmstate data is appended.
This size overhead is not necessary during image creation,
bdrv_truncate(), or snapshot goto operations. In fact, existing
qemu-iotests that exercise table growth are no longer able to trigger it
because image creation preallocates an L1 table that is too large after
changes to qcow_create2().
This patch keeps the size calculation but also adds exact growth for
callers that do not want to inflate the L1 table size unnecessarily.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO write path. Encrypted
images continue to use a bounce buffer, however with constant size.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This distinguishes between harmless leaks and real corruption. Hopefully users
better understand what qemu-img check wants to tell them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
People were wondering why qemu-img check failed after they tried to preallocate
a large qcow2 file and ran out of disk space.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_get_cluster_offset() looks up a given virtual disk offset and returns the
offset of the corresponding cluster in the image file. Errors (e.g. L2 table
can't be read) are currenctly indicated by a return value of 0, which is
unfortuately the same as for any unallocated cluster. So in effect we can't
check for errors.
This makes the old return value a by-reference parameter and returns the usual
0/-errno error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
OpenBSDs gcc is said to generate warnings for this declaration, so don't
reference bdrv_qcow2 directly, but look it up using bdrv_find_format.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 20d97356c9.
The BlockDriver definition should stay at the end of source files.
Conflicts:
block/qcow2.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This patch adds the ability to grow qcow2 images in-place using
bdrv_truncate(). This enables qemu-img resize command support for
qcow2.
Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
The notion of resizing an image with snapshots could lead to confusion:
users may expect snapshots to remain unchanged, but this is not possible
with the current qcow2 on-disk format where the header.size field is
global instead of per-snapshot. Others may expect snapshots to change
size along with the current image data. I think it is safest to not
support snapshots and perhaps add behavior later if there is a
consensus.
Backing images continue to work. If the image is now larger than its
backing image, zeroes are read when accessing beyond the end of the
backing image.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Format drivers shouldn't need to bother with things like file names, but rather
just get an open BlockDriverState for the underlying protocol. This patch
introduces this behaviour for bdrv_open implementation. For protocols which
need to access the filename to open their file/device/connection/... a new
callback bdrv_file_open is introduced which doesn't get an underlying file
opened.
For now, also some of the more obscure formats use bdrv_file_open because they
open() the file themselves instead of using the block.c functions. They need to
be fixed in later patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
GCC 3.3.5 generates warnings for static forward declarations of data, so
rearrange code to use static forward declarations of functions instead.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
What is known today as bdrv_open2 becomes the new bdrv_open. All remaining
callers of the old function are converted to the new one. In some places they
even know the right format, so they should have used bdrv_open2 from the
beginning.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow_create2 assumes that the new image will only need one cluster for its
refcount table initially. Obviously that's not true any more when the image is
big enough (exact value depends on the cluster size).
This patch calculates the refcount table size dynamically.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we complete a request with a failure we need to remove it from the list of
requests that are in flight. If we don't do it, the next time the same AIOCB is
used for a cluster allocation it will create a loop in the list and qemu will
hang in an endless loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Returning -EIO is far from optimal, but at least it's an error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
When checking for errors, commit db89119d compares with the wrong values,
failing image creation even when there was no error. Additionally, if an
error has occured, we can't preallocate the image (it's likely broken).
This unbreaks test 023 of qemu-iotests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The n member is not very descriptive and very hard to grep, rename it to
cur_nr_sectors to better indicate what it is used for. Also rename
nb_sectors to remaining_sectors as that is what it is used for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Found some places that seems needs this explicitly, now that
read-write is not the default.
Signed-off-by: Naphtali Sprei <nsprei@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/qcow2.o
cc1: warnings being treated as errors
block/qcow2.c: In function 'qcow_create2':
block/qcow2.c:829: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:838: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:839: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:841: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:844: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:849: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:852: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:855: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/qcow2.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Returning 0/-errno allows it to distingush different errors classes. The
cluster offset of newly allocated clusters is now returned in the QCowL2Meta
struct.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't assume success but pass the bdrv_pwrite return value on.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that we do not have to flush the backing device anymore implementing
the bdrv_aio_flush method for image formats is trivial.
[hch: forward ported to qemu mainline from a product tree]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Introduce the functions needed to change the backing file of an image. The
function is implemented for qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently qcow2 unnecessarily rounds up the length of the backing format string
to the next multiple of 8. At the same time, the array in BlockDriverState can
only hold 15 characters, so in effect backing formats with 9 characters or more
don't work (e.g. host_device).
Save the real string length and things start to work for all valid image format
names.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Images with disk size 0 may be used for
VM snapshots, but not to save normal block data.
It is possible to create such images using
qemu-img, but opening them later fails.
So even "qemu-img info image.qcow2" is not
possible for an image created with
"qemu-img create -f qcow2 image.qcow2 0".
This is fixed here.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
It was merely a workaround and the real fix is done now.
This reverts commit ef845c3bf4.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the synchronous read and write functions were dropped, they were replaced
by generic emulation functions. Unfortunately, these emulation functions don't
provide the same semantics as the original functions did.
The original bdrv_read would mean that we read some data synchronously and that
we won't be interrupted during this read. The latter assumption is no longer
true with the emulation function which needs to use qemu_aio_poll and therefore
allows the callback of any other concurrent AIO request to be run during the
read. Which in turn means that (meta)data read earlier could have changed and
be invalid now. qcow2 is not prepared to work in this way and it's just scary
how many places there are where other requests could run.
I'm not sure yet where exactly it breaks, but you'll see breakage with virtio
on qcow2 with a backing file. Providing synchronous functions again fixes the
problem for me.
Patchworks-ID: 35437
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Problem: Our file sys-queue.h is a copy of the BSD file, but there are
some additions and it's not entirely compatible. Because of that, there have
been conflicts with system headers on BSD systems. Some hacks have been
introduced in the commits 15cc923584,
f40d753718,
96555a96d7 and
3990d09adf but the fixes were fragile.
Solution: Avoid the conflict entirely by renaming the functions and the
file. Revert the previous hacks.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
When two AIO requests write to the same cluster, and this cluster is
unallocated, currently both requests allocate a new cluster and the second one
merges the first one when it is completed. This means an cluster allocation, a
read and a cluster deallocation which cause some overhead. If we simply let the
second request wait until the first one is done, we improve overall performance
with AIO requests (specifially, qcow2/virtio combinations).
This patch maintains a list of in-flight requests that have allocated new
clusters. A second request touching the same cluster is limited so that it
either doesn't touch the allocation of the first request (so it can have a
non-overlapping allocation) or it waits for the first request to complete.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The wrong version of the preallocation patch has been applied, so this is the
remaining diff.
We can't use truncate to grow the image file to the right size because we don't
know if metadata has been written after the last data cluster. In this case
truncate would shrink the file and destroy its metadata. Write a zero sector at
the end of the virtual disk instead to ensure that the file is big enough.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This introduces a qemu-img create option for qcow2 which allows the metadata to
be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
tables, but no data is written to them. Metadata is quite small, so this
happens in almost no time.
Especially with qcow2 on virtio this helps to gain a bit of performance during
the initial writes. However, as soon as create a snapshot, we're back to the
normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
while we're still having trouble with qcow2 on virtio.
Note that the option is disabled by default and needs to be specified
explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The VM state offset is a concept internal to the image format. Replace
the old bdrv_{get,put}_buffer method that require an index into the
image file that is constructed from the VM state offset and an offset
into the vmstate with the bdrv_{load,save}_vmstate that just take an
offset into the VM state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Contrary to what one could expect, the size of L1 tables is not cluster
aligned. So as we're writing whole sectors now instead of single entries,
we need to ensure that the L1 table in memory is large enough; otherwise
write would access memory after the end of the L1 table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The performance of qcow2 has improved meanwhile, so we don't need to
special-case it any more. Switch the default to write-through caching
like all other block drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The qcow2 source is now split into several more manageable files. During the
conversion quite some functions that were static before needed to be changed to
be global to make the source compile again.
We were lucky enough not to get name conflicts with these additional global
names, but they are not nice. This patch adds a qcow2_ prefix to all of the
global functions in qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-snapshot.c contains the code related to snapshotting.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-cluster.c contains all functions related to the management of guest
clusters, i.e. what the guest sees on its virtual disk. This code is about
mapping these guest clusters to host clusters in the image file using the
two-level lookup tables.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-refcount.c contains all functions which are related to cluster
allocation and management in the image file. A large part of this is the
reference counting of these clusters.
Also a header file qcow2.h is introduced which will contain the interface of
the split qcow2 modules.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Larger cluster sizes mean less metadata. This has been discussion a few times,
let's do it now. This turns 64k clusters on by default for new images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch adds a small help text to each of the options in the block drivers
which can be displayed by using qemu-img create -f fmt -o ?
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that we have a separate aio pool structure we can remove those
aio pool details from BlockDriver.
Every driver supporting AIO now needs to declare a static AIOPool
with the aiocb size and the cancellation method. This cleans up the
current code considerably and will make it cleaner and more obvious
to support two different aio implementations behind a single
BlockDriver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't write each single changed refcount block entry to the disk after it is
written, but update all entries of the block and write all of them at once.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is a preparation patch with no functional changes. It moves the allocation
of new refcounts block to a new function and makes update_cluster_refcount (for
one cluster) call update_refcount (for multiple clusters) instead the other way
round.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There is only one (internal) user left and it can be switched to the normal
emulation provided in block.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Add an option to specify the cluster size of a newly created qcow2 image.
Default is 4k which is the same value that was hard-coded before.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now we can make use of the newly introduced option structures. Instead of
having bdrv_create carry more and more parameters (which are format specific in
most cases), just pass a option structure as defined by the driver itself.
bdrv_create2() contains an emulation of the old interface to simplify the
transition.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>