Sadly, some images may have more clusters than what can be represented
using a plain int. We should be prepared for that case (in
qcow2_check_refcounts() we actually were trying to catch that case, but
since size_to_clusters() truncated the returned value, that check never
did anything useful).
Cc: qemu-stable <qemu-stable@nongnu.org>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
BDRVQcowState is already used by qcow1, and gdb is always confused which
one to use. Rename the qcow2 one so they can be distinguished.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Many source files have doubled words (eg "the the", "to to",
and so on). Most of these can simply be removed, but a couple
were actual mis-spellings (eg "to to" instead of "to do").
There was even one triple word score "to to to" :-)
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Switch the qcow/qcow2 block driver over to use the generic cipher
API, this allows it to use the pluggable AES implementations,
instead of being hardcoded to use QEMU's built-in impl.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-Id: <1435770638-25715-10-git-send-email-berrange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a qcow[2] file is opened, if the header reports an
encryption method, this is used to set the 'crypt_method_header'
field on the BDRVQcow[2]State struct, and the 'encrypted' flag
in the BDRVState struct.
When doing I/O operations, the 'crypt_method' field on the
BDRVQcow[2]State struct is checked to determine if encryption
needs to be applied.
The crypt_method_header value is copied into crypt_method when
the bdrv_set_key() method is called.
The QEMU code which opens a block device is expected to always
do a check
if (bdrv_is_encrypted(bs)) {
bdrv_set_key(bs, ....key...);
}
If code forgets to do this, then 'crypt_method' is never set
and so when I/O is performed, QEMU writes plain text data
into a sector which is expected to contain cipher text, or
when reading, will return cipher text instead of plain
text.
Change the qcow[2] code to consult bs->encrypted when deciding
whether encryption is required, and assert(s->crypt_method)
to protect against cases where the caller forgets to set the
encryption key.
Also put an assert in the set_key methods to protect against
the case where the caller sets an encryption key on a block
device that does not have encryption
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function never receives an invalid table pointer, so we can make
it void and remove all the error checking code.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2 L2/refcount cache contains one separate table for each cache
entry. Doing one allocation per table adds unnecessary overhead and it
also requires us to store the address of each table separately.
Since the size of the cache is constant during its lifetime, it's
better to have an array that contains all the tables using one single
allocation.
In my tests measuring freshly created caches with sizes 128MB (L2) and
32MB (refcount) this uses around 10MB of RAM less.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcounts may have a width of up to 64 bits, so qemu should use the same
width to represent refcount values internally.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
update_refcount() and qcow2_update_cluster_refcount() currently take a
signed addend. At least one caller passes a value directly derived from
an absolute refcount that should be reached ("l2_refcount - 1" in
expand_zero_clusters_in_l1()). Therefore, the addend should be unsigned
as well; this will be especially important for 64 bit refcounts.
Because update_refcount() then no longer knows whether the refcount
should be increased or decreased, it now requires an additional flag
which specified exactly that. The same applies to
qcow2_update_cluster_refcount().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcounts can theoretically be of type uint64_t; in order to be able to
represent the full range, qcow2_get_refcount() cannot use a single
variable to represent both all refcount values and also keep some values
reserved for errors.
One solution would be to add an Error pointer parameter to
qcow2_get_refcount(); however, no caller could (currently) pass that
error message, so it would have to be emitted immediately and be
passed to the next caller by returning -EIO or something similar.
Therefore, an Error parameter does not offer any advantages here.
The solution applied by this patch is simpler to use. Because no caller
would be able to pass the error message, they would have to print it and
free it, whereas with this patch the caller only needs to pass the
returned integer (which is often a no-op from the code perspective,
because that integer will be stored in a variable "ret" which will be
returned by the fail path of many callers).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds checks for unaligned L2 table offsets and unaligned data
cluster offsets (actually the preallocated offsets for zero clusters) to
the zero cluster expansion function.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_alloc_cluster_offset(), *num is limited to
INT_MAX >> BDRV_SECTOR_BITS by all callers. However, since remaining is
of type uint64_t, we might as well cast *num to that type before
performing the shift.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, we have a bitmap for keeping track of which clusters have
been created during the zero cluster expansion process. This was
necessary because we need to properly increase the refcount for shared
L2 tables.
However, now we can simply take the L2 refcount and use it for the
cluster allocated for expansion. This will be the correct refcount and
therefore we don't have to remember that cluster having been allocated
any more.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-7-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The only really time-consuming operation potentially performed by
qcow2_amend_options() is zero cluster expansion when downgrading qcow2
images from compat=1.1 to compat=0.10, so report status of that
operation and that operation only through the status CB.
For this, approximate the progress as the number of L1 entries visited
during the operation.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-5-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Normally, discarded sectors should read back as zero. However, there are
cases in which a sector (or rather cluster) should be discarded as if
they were never written in the first place, that is, reading them should
fall through to the backing file again.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414159063-25977-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
While writing an L1 table sector, qcow2_write_l1_entry() copies the
respective range from s->l1_table to the local "buf" array. The size of
s->l1_table does not have to be a multiple of L1_ENTRIES_PER_SECTOR;
thus, limit the index which is used for copying all entries to the L1
size.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Offsets taken from the L1, L2 and refcount tables are generally assumed
to be correctly aligned. However, this cannot be guaranteed if the image
has been written to by something different than qemu, thus check all
offsets taken from these tables for correct cluster alignment.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1409926039-29044-5-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
Patch created with Coccinelle, with two manual changes on top:
* Add const to bdrv_iterate_format() to keep the types straight
* Convert the allocation in bdrv_drop_intermediate(), which Coccinelle
inexplicably misses
Coccinelle semantic patch:
@@
type T;
@@
-g_malloc(sizeof(T))
+g_new(T, 1)
@@
type T;
@@
-g_try_malloc(sizeof(T))
+g_try_new(T, 1)
@@
type T;
@@
-g_malloc0(sizeof(T))
+g_new0(T, 1)
@@
type T;
@@
-g_try_malloc0(sizeof(T))
+g_try_new0(T, 1)
@@
type T;
expression n;
@@
-g_malloc(sizeof(T) * (n))
+g_new(T, n)
@@
type T;
expression n;
@@
-g_try_malloc(sizeof(T) * (n))
+g_try_new(T, n)
@@
type T;
expression n;
@@
-g_malloc0(sizeof(T) * (n))
+g_new0(T, n)
@@
type T;
expression n;
@@
-g_try_malloc0(sizeof(T) * (n))
+g_try_new0(T, n)
@@
type T;
expression p, n;
@@
-g_realloc(p, sizeof(T) * (n))
+g_renew(T, p, n)
@@
type T;
expression p, n;
@@
-g_try_realloc(p, sizeof(T) * (n))
+g_try_renew(T, p, n)
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_alloc_cluster_offset() uses host_offset == 0 as "no preferred
offset" for the (data) cluster range to be allocated. However, this
offset is actually valid and may be allocated on images with a corrupted
refcount table or first refcount block.
In this case, the corruption prevention should normally catch that
write anyway (because it would overwrite the image header). But since 0
is a special value here, the function assumes that nothing has been
allocated at all which it asserts against.
Because this condition is not qemu's fault but rather that of a broken
image, it shouldn't throw an assertion but rather mark the image corrupt
and show an appropriate message, which this patch does by calling the
corruption check earlier than it would be called normally (before the
assertion).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some code in the block layer makes potentially huge allocations. Failure
is not completely unexpected there, so avoid aborting qemu and handle
out-of-memory situations gracefully.
This patch addresses the allocations in the qcow2 block driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This triggers if bs->drv becomes NULL in a concurrent request. This is
currently only the case when corruption prevention kicks in (i.e. at
most once per image, and after that it produces I/O errors).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
First, new_l1_size is an int64_t, whereas min_size is a uint64_t.
Therefore, during the loop which adjusts new_l1_size until it equals or
exceeds min_size, new_l1_size might overflow and become negative. The
comparison in the loop condition however will take it as an unsigned
value (because min_size is unsigned) and therefore recognize it as
exceeding min_size. Therefore, the loop is left with a negative
new_l1_size, which is not correct. This could be fixed by making
new_l1_size uint64_t.
On the other hand, however, by doing this, the while loop may take
forever. If min_size is e.g. UINT64_MAX, it will take new_l1_size
probably multiple overflows to reach the exact same value (if it reaches
it at all). Then, right after the loop, new_l1_size will be recognized
as being too big anyway.
Both problems require a ridiculously high min_size value, which is very
unlikely to occur; but both problems are also simply avoided by checking
whether min_size is sane before calculating new_l1_size (which should
still be checked separately, though).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
discard_single_l2() should not implement its own version of
qcow2_get_cluster_type(), but rather rely on this already existing
function. By doing so, it will work for compressed clusters as well
(which it did not so far).
Also, rename "old_offset" to "old_l2_entry", as both are quite different
(and the value is indeed of the latter kind).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When qcow2_get_cluster_offset() sees a zero cluster in a version 2
image, it (rightfully) returns an error. But in doing so it shouldn't
leak an L2 table cache reference.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bs->total_sectors is not the highest possible sector number that could
be involved in a copy on write operation: VM state is after the end of
the virtual disk. This resulted in wrong values for the number of
sectors to be copied (n).
The code that checks for the end of the image isn't required any more
because the code hasn't been calling the block layer's bdrv_read() for a
long time; instead, it directly calls qcow2_readv(), which doesn't error
out on VM state sector numbers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The size in bytes is assigned to an int later, so check that instead of
the number of entries.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Before dereferencing bs->drv for a call to its member bdrv_co_readv(),
copy_sectors() should check whether that pointer is indeed valid, since
it may have been set to NULL by e.g. a concurrent write triggering the
corruption prevention mechanism.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of making the backing file contents visible again after a discard
request, set the zero flag if possible (i.e. on version >= 3).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
n_start can be actually calculated from offset. The number of
sectors to be allocated(n_end - n_start) can be passed in in
num. By removing n_start and n_end, we can save two parameters.
The side effect is there is a bug in qcow2.c:preallocate() that
passes incorrect n_start to qcow2_alloc_cluster_offset() is
fixed. The bug can be triggerred by a larger cluster size than
the default value(65536), for example:
./qemu-img create -f qcow2 \
-o 'cluster_size=131072,preallocation=metadata' file.img 4G
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
if multiple sectors spanning multiple clusters are read the
function count_contiguous_clusters should ensure that the
cluster type should not change between the clusters.
Especially the for-loop should break when we have one
or more normal clusters followed by a compressed cluster.
Unfortunately the wrong macro was used in the mask to
compare the flags.
This was discovered while debugging a data corruption
issue when converting a compressed qcow2 image to raw.
qemu-img reads 2MB chunks which span multiple clusters.
CC: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the legacy cpu_to_be64wu() with stq_be_p().
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-id: 1383669517-25598-9-git-send-email-peter.maydell@linaro.org
Signed-off-by: Anthony Liguori <aliguori@amazon.com>
In qcow2_check_metadata_overlap and qcow2_pre_write_overlap_check,
change the parameter signifying the checks to perform from its current
positive form to a negative one, i.e., it will no longer explicitly
specify every check to perform but rather a mask of checks not to
perform.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If an error occurs in l2_allocate, the allocated (but unused) L2 cluster
should be freed.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Switching the L1 table in memory should be an atomic operation, as far
as possible. Calling qcow2_free_clusters on the old L1 table on disk is
not a good idea when the old L1 table is no longer valid and the address
to the new one hasn't yet been written into the corresponding
BDRVQcowState field. To be more specific, this can lead to segfaults due
to qcow2_check_metadata_overlap trying to access the L1 table during the
free operation.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers pass start = 0, and it's doubtful if any other value would
actually do what you expect. Remove the parameter.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Compressed clusters can never be contiguous, therefore the corresponding
flag does not need to be given explicitly to count_contiguous_clusters.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function is not intended to be used on compressed clusters and will
not work correctly, if used anyway, since L2E_OFFSET_MASK is not the
right mask for determining the offset of compressed clusters. Therefore,
assert that the first cluster is not compressed and always include the
compression flag in the mask of significant flags, i.e., stop the search
as soon as a compressed cluster occurs.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In expand_zero_clusters_in_l1, a new cluster is only allocated if it was
not already preallocated. On error, such preallocated clusters should
not be freed, but only the newly allocated ones.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Just returning -errno in some cases prevents
trace_qcow2_l2_allocate_done from being executed (and, in one case, also
the unused allocated L2 table from being freed). Always going down the
error path fixes this.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In l2_allocate, the fail path is executed if qcow2_cache_flush fails.
However, the L2 table has not yet been fetched from the L2 table cache.
The qcow2_cache_put in the fail path therefore basically gives an
undefined argument as the L2 table address (in this case).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since the expanded_clusters bitmap is addressed using host offsets in
the underlying image file, the correct size to use for allocating the
bitmap is not determined by the guest disk image but by the underlying
host image file.
Furthermore, this size may change during the expansion due to cluster
allocations on growable image files. In this case, the bitmap needs to
be resized as well to reflect the growth.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_alloc_cluster_link_l2 is called with a QCowL2Meta describing a
request crossing L2 boundaries, a buffer overflow will occur. This is
impossible right now since such requests are never generated (every
request is shortened to L2 boundaries before) and probably also
completely unintended (considering the name "QCowL2Meta"), however, it
is still worth an assertion.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add functionality for expanding zero clusters. This is necessary for
downgrading the image version to one without zero cluster support.
For non-backed images, this function may also just discard zero clusters
instead of truly expanding them.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function will be used internally instead of only being called for
guest discard requests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Since the OFLAG_COPIED checks are now executed after the refcounts have
been repaired (if repairing), it is safe to assume that they are correct
but the OFLAG_COPIED flag may be not. Therefore, if its value differs
from what it should be (considering the according refcount), that
discrepancy can be repaired by correctly setting (or clearing that flag.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The pre-write overlap check function is now called before most of the
qcow2 writes (aborting it on collision or other error).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This optimises the discard operation for freed clusters by batching
discard requests (both snapshot deletion and bdrv_discard end up
updating the refcounts cluster by cluster).
Note that we don't discard asynchronously, but keep s->lock held. This
is to avoid that a freed cluster is reallocated and written to while the
discard is still in flight.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds a refcount update reason to all callers of update_refcounts(),
so that a follow-up patch can use this information to decide whether
clusters that reach a refcount of 0 should be discarded in the image
file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>