This fixes a regression introduced in commit 250196f1. The bug leads to
data corruption, found during an Autotest run with a Fedora 8 guest.
Consider a write request whose first part is covered by an already
allocated cluster, but additional clusters need to be newly allocated.
When counting the number of clusters to allocate, the qcow2 code would
decide to do COW for all remaining clusters of the write request, even
if some of them are already allocated.
If during this COW operation another write request is issued that touches
the same cluster, it will still refer to the old cluster. When the COW
completes, the first request will update the L2 table and the second
write request will be lost. Note that the requests need not overlap, it's
enough for them to touch the same cluster.
This patch ensures that only clusters that really require COW are
considered for allocation. In this case any other request writing to the
same cluster will be an allocating write and gets serialised.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If cache references are held while the coroutine has yielded, the cache
may get used up and abort() when it can't find a free entry.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
count_cow_clusters() tries to reuse existing functions, and all it
achieves is to make things much more complicated than they really are:
Everything needs COW, unless it's a normal cluster with refcount 1.
This patch implements the obvious way of doing this, and by using
qcow2_get_cluster_type() it gets rid of all flag magic.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the still existing places that assume that the only flags
are QCOW_OFLAG_COPIED and QCOW_OFLAG_COMPRESSED to properly mask out
reserved bits.
It does not convert bdrv_check yet.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_alloc_compressed_cluster_offset() already fails if the copied flag
is set, because qcow2_write_compressed() doesn't perform COW as it would
have to do to allow this.
However, what we really want to check here is whether the cluster is
allocated or not. With internal snapshots the copied flag may not be set
on allocated clusters. Check the cluster offset instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Until now, count_contiguous_clusters() has an argument that allowed to
specify flags that should be ignored in the comparison, i.e. that are
allowed to change between contiguous clusters.
This patch changes the function so that it ignores all flags by default
now and you need to pass the flags on which it should stop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With this change, reading from a qcow2 image ignores all reserved bits
that are set in an L1 or L2 table entry.
Now get_cluster_offset() assigns *cluster_offset only the offset without
any other flags. The cluster type is not longer encoded in the offset,
but a positive return value in case of success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If do_alloc_cluster_offset() fails, the error handling code tried to
remove the request from the in-flight queue, to which it wasn't added
yet, resulting in a NULL pointer dereference.
m->nb_clusters really only becomes != 0 when the request is in the list.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since everything goes through the cache, callers don't use the L2 table
offset any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
If the first part of a write request is allocated, but the second isn't
and it can be allocated so that the resulting area is contiguous, handle
it at once. This is a common case for sequential writes.
After this patch, alloc_cluster_offset() only checks if the clusters are
already allocated or how many new clusters can be allocated contigouosly.
The actual cluster allocation is split off into a new function
do_alloc_cluster_offset().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
A BlockDriverState should not issue requests on itself through the
public block layer interface. Nested, or reentrant, requests are
problematic because they do I/O throttling and request tracking twice.
Features like block layer copy-on-read use request tracking to avoid
race conditions between concurrent requests. The reentrant request will
have to "wait" for its parent request to complete. But the parent is
waiting for the reentrant request to make progress so we have reached
deadlock.
The solution is for block drivers to avoid the public block layer
interfaces for reentrant requests. Instead they should call their own
internal functions if they wish to perform reentrant requests.
This is also a good opportunity to make copy_sectors() a true
coroutine_fn. That means calling bdrv_co_writev() instead of
bdrv_write(). Behavior is unchanged but we're being explicit that this
executes in coroutine context.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unlocking during COW allows for more parallelism. One change it requires is
that buffers are dynamically allocated instead of just using a per-image
buffer.
While touching the code, drop the synchronous qcow2_read() function and replace
it by a bdrv_read() call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If during allocation of compressed clusters the cluster was already allocated
uncompressed, fail and properly release the l2_table (the latter avoids a
failed assertion).
While at it, make it return some real error numbers instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
QCowL2Meta::offset is not cluster aligned but only sector aligned
however nb_clusters count cluster from cluster start.
This fix range check. Note that old code have no corruption issues
related to this check cause it only cause intersection to occur
when shouldn't.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCow2Meta structure was inserted into list before many fields are
initialized. Currently is not a problem cause all occur in a lock
but if qcow2_alloc_clusters would in a future unlock this lock
some issues could arise.
Initializing fields before inserting fix the problem.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Documentation states the num is measured in clusters, but its
actually measured in sectors
Signed-off-by: Devin Nakamura <devin122@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_cache_put returns an error during cluster allocation and the
allocation fails, it must be removed from the list of in-flight allocations.
Otherwise we'd get a loop in the list when the ACB is used for the next
allocation.
Luckily, this qcow2_cache_put shouldn't fail anyway because the L2 table is
only read, so that qcow2_cache_put doesn't even involve I/O.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This fixes memory leaks that may be caused by I/O errors during L1 table growth
(can happen during save_vm) and in qemu-img check.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When copying L2 tables (this happens only with internal snapshots), the order
wasn't completely safe, so that after a crash you could end up with a L2 table
that has too low refcount, possibly leading to corruption in the long run.
This patch puts the operations in the right order: First allocate the new
L2 table and replace the reference, and only then decrease the refcount of the
old table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When reading a compressed cluster failed, qcow2 falsely returned success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
This adds a bdrv_discard function to qcow2 that frees the discarded clusters.
It does not yet pass the discard on to the underlying file system driver, but
the space can be reused by future writes to the image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
qcow2 calls bdrv_flush() after performing COW in order to ensure that the
L2 table change is never written before the copy is safe on disk. Now that the
L2 table is cached, we can wait with flushing until we write out the next L2
table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cpu_to_be64w() is called with an obviously non-aligned pointer. Use
cpu_to_be64wu() instead. It fixes unaligned accesses errors on IA64
hosts.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It doesn't really make sense for functions in qcow2.c to be named
qcow_ so convert the names to match correctly.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cache content may be destroyed after a failed read, better not use it any
more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The L1 table grow operation includes a size calculation that bumps up
the new L1 table size in order to anticipate the size needs of vmstate
data. This helps reduce the number of times that the L1 table has to be
grown when vmstate data is appended.
This size overhead is not necessary during image creation,
bdrv_truncate(), or snapshot goto operations. In fact, existing
qemu-iotests that exercise table growth are no longer able to trigger it
because image creation preallocates an L1 table that is too large after
changes to qcow_create2().
This patch keeps the size calculation but also adds exact growth for
callers that do not want to inflate the L1 table size unnecessarily.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We always have a sync for the refcount update when a new cluster is
allocated. If we move this past the COW, we can save an additional sync.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When a new cluster was allocated, we only need a flush after the write to the
L2 table if it was a COW and we need to decrease the refcounts of the old
clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If writing the L1 table to disk failed, we need to restore its old content in
memory to avoid inconsistencies.
Reported-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_get_cluster_offset() looks up a given virtual disk offset and returns the
offset of the corresponding cluster in the image file. Errors (e.g. L2 table
can't be read) are currenctly indicated by a return value of 0, which is
unfortuately the same as for any unallocated cluster. So in effect we can't
check for errors.
This makes the old return value a by-reference parameter and returns the usual
0/-errno error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
l2_allocate has some intermediate states in which the image is inconsistent.
Change the order to write to the L1 table only after the new L2 table has
successfully been initialized.
Also reset the L2 cache in failure case, it's very likely wrong.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the L2 table was already updated in cache, but writing it to disk has
failed, we must not continue using the changed version in the cache to stay
consistent with what's on the disk.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Format drivers shouldn't need to bother with things like file names, but rather
just get an open BlockDriverState for the underlying protocol. This patch
introduces this behaviour for bdrv_open implementation. For protocols which
need to access the filename to open their file/device/connection/... a new
callback bdrv_file_open is introduced which doesn't get an underlying file
opened.
For now, also some of the more obscure formats use bdrv_file_open because they
open() the file themselves instead of using the block.c functions. They need to
be fixed in later patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Returning NULL on error doesn't allow distinguishing between different errors.
Change the interface to return an integer for -errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>