If a relatively large cluster size is chosen, the default of 1 MB L2
cache is not really appropriate. In this case, unless overridden by the
user, the default cache size should not be determined by its size in
bytes but by the number of L2 tables (clusters) it is supposed to
contain.
Note that without this patch, MIN_L2_CACHE_SIZE will effectively take
over the same role. However, providing space for just two L2 tables is
not enough to be the default.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The L2 cache must cover at least two L2 tables, because during COW two
L2 tables are accessed simultaneously.
Reported-by: Alexander Graf <agraf@suse.de>
Cc: qemu-stable <qemu-stable@nongnu.org>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Tested-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function never receives an invalid table pointer, so we can make
it void and remove all the error checking code.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2 L2/refcount cache contains one separate table for each cache
entry. Doing one allocation per table adds unnecessary overhead and it
also requires us to store the address of each table separately.
Since the size of the cache is constant during its lifetime, it's
better to have an array that contains all the tables using one single
allocation.
In my tests measuring freshly created caches with sizes 128MB (L2) and
32MB (refcount) this uses around 10MB of RAM less.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In recent qemu versions, it is possible to override the backing file
name and format that is stored in the image file with values given at
runtime. In such cases, the temporary override could end up in the
image header if the qcow2 header was updated, while obviously correct
behaviour would be to leave the on-disk backing file path/format
unchanged.
Fix this and add a test case for it.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1428411796-2852-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Since refcounts do not always have to be a uint16_t, all refcount blocks
and arrays in memory should not have a specific type (thus they become
pointers to void) and for accessing them, two helper functions are used
(a getter and a setter). Those functions are called indirectly through
function pointers in the BDRVQcowState so they may later be exchanged
for different refcount orders.
With the check and repair functions using this function, the refcount
array they are creating will be in big endian byte order; additionally,
using realloc_refcount_array() makes the size of this refcount array
always cluster-aligned. Both combined allow rebuild_refcount_structure()
to drop the bounce buffer which was used to convert parts of the
refcount array to big endian byte order and store them on disk. Instead,
those parts can now be written directly.
[ kwolf: Fixed a build failure on 32 bit and another with old glib ]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcounts may have a width of up to 64 bits, so qemu should use the same
width to represent refcount values internally.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
update_refcount() and qcow2_update_cluster_refcount() currently take a
signed addend. At least one caller passes a value directly derived from
an absolute refcount that should be reached ("l2_refcount - 1" in
expand_zero_clusters_in_l1()). Therefore, the addend should be unsigned
as well; this will be especially important for 64 bit refcounts.
Because update_refcount() then no longer knows whether the refcount
should be increased or decreased, it now requires an additional flag
which specified exactly that. The same applies to
qcow2_update_cluster_refcount().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcounts can theoretically be of type uint64_t; in order to be able to
represent the full range, qcow2_get_refcount() cannot use a single
variable to represent both all refcount values and also keep some values
reserved for errors.
One solution would be to add an Error pointer parameter to
qcow2_get_refcount(); however, no caller could (currently) pass that
error message, so it would have to be emitted immediately and be
passed to the next caller by returning -EIO or something similar.
Therefore, an Error parameter does not offer any advantages here.
The solution applied by this patch is simpler to use. Because no caller
would be able to pass the error message, they would have to print it and
free it, whereas with this patch the caller only needs to pass the
returned integer (which is often a no-op from the code perspective,
because that integer will be stored in a variable "ret" which will be
returned by the fail path of many callers).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add two new fields regarding refcount information (the bit width of
every entry and the maximum refcount value) to the BDRVQcowState.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only user went away five years ago with commit a9420734 ('qcow2:
Simplify image creation'). It's about time to remove it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reading the refcount of a cluster is an operation which can be useful in
all of the qcow2 code, so make that function globally available.
While touching this function, amend the comment describing the "addend"
parameter: It is (no longer, if it ever was) necessary to have it set to
-1 or 1; any value is fine.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-6-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The only really time-consuming operation potentially performed by
qcow2_amend_options() is zero cluster expansion when downgrading qcow2
images from compat=1.1 to compat=0.10, so report status of that
operation and that operation only through the status CB.
For this, approximate the progress as the number of L1 entries visited
during the operation.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-5-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Normally, discarded sectors should read back as zero. However, there are
cases in which a sector (or rather cluster) should be discarded as if
they were never written in the first place, that is, reading them should
fall through to the backing file again.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414159063-25977-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
With BDRVQcowState.refcount_block_bits, we don't need REFCOUNT_SHIFT
anymore.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The size of a refblock entry is (in theory) variable; calculate
therefore the number of entries per refblock and the according bit shift
(1 << x == entry count) when opening an image.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Being able to set the overlap-check option to a string and then refine
it via the overlap-check.* options is a nice idea for the command line
but does not work so well for non-flattened dicts. In that case, one can
only specify either but not both, so add a field to overlap-check.*
which does the same as directly specifying overlap-check but can be used
in conjunction with the other fields in non-flattened dicts.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1408557576-14574-4-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add a helper function for easily marking an image corrupt (on fatal
corruptions) while outputting an informative message to stderr and via
QAPI.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Message-id: 1409926039-29044-3-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add options for specifying the size of the metadata caches. This can
either be done directly for each cache (if only one is given, the other
will be derived according to a default ratio) or combined for both.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Specifying the metadata cache sizes in clusters results in less clusters
(and much less bytes) covered for small cluster sizes and vice versa.
Using a constant byte size reduces this difference, and makes it
possible to manually specify the cache size in an easily comprehensible
unit.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even with a limit of 64k snapshots, each snapshot could have a filename
and an ID with up to 64k, which would still lead to pretty large
allocations, which could potentially lead to qemu aborting. Limit the
total size of the snapshot table to an average of 1k per entry when
the limit of 64k snapshots is fully used. This should be plenty for any
reasonable user.
This also fixes potential integer overflows of s->snapshot_size.
Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This avoids an unbounded allocation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In order to avoid integer overflows.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If the size becomes larger than what qcow2_open() would accept, fail the
growing operation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This avoid unbounded memory allocation and fixes a potential buffer
overflow on 32 bit hosts.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
n_start can be actually calculated from offset. The number of
sectors to be allocated(n_end - n_start) can be passed in in
num. By removing n_start and n_end, we can save two parameters.
The side effect is there is a bug in qcow2.c:preallocate() that
passes incorrect n_start to qcow2_alloc_cluster_offset() is
fixed. The bug can be triggerred by a larger cluster size than
the default value(65536), for example:
./qemu-img create -f qcow2 \
-o 'cluster_size=131072,preallocation=metadata' file.img 4G
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Accoring to qcow spec, the offset fields in l1e, l2e and ref table entry
start at bit 9. The offset is cluster offset, and the smallest possible
cluster size is 512 bytes.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since later this function will be used so improve it. The only caller of it
now is qemu-img, and it is not impacted by introduce function
bdrv_snapshot_load_tmp_by_id_or_name() that call bdrv_snapshot_load_tmp()
twice to keep old search logic. bdrv_snapshot_load_tmp_by_id_or_name() return
int to let caller know the errno, and errno will be used later.
Also fix a typo in comments of bdrv_snapshot_delete().
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduces the macros QCOW2_OL_CONSTANT and QCOW2_OL_ALL in addition to
the already existing QCOW2_OL_CACHED, signifying all metadata overlap
checks that can be performed in constant time (regardless of image size
etc.) and truly all available overlap checks, respectively.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add runtime options to tune the overlap checks to be performed before
write accesses.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the QCOW2_OL_DEFAULT macro by a variable overlap_check in
BDRVQcowState.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_check_metadata_overlap and qcow2_pre_write_overlap_check,
change the parameter signifying the checks to perform from its current
positive form to a negative one, i.e., it will no longer explicitly
specify every check to perform but rather a mask of checks not to
perform.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCowHeader and QCowExtension are structs that reside in the on-disk
image format, and are read and written directly via bdrv_pread()/write(),
and as such should be packed to avoid any unintentional struct padding.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Snapshot creation actually already distinguish id and name since it take
a structured parameter *sn, but delete can't. Later an accurate delete
is needed in qmp_transaction abort and blockdev-snapshot-delete-sync,
so change its prototype. Also *errp is added to tip error, but return
value is kepted to let caller check what kind of error happens. Existing
caller for it are savevm, delvm and qemu-img, they are not impacted by
introducing a new function bdrv_snapshot_delete_by_id_or_name(), which
check the return value and do the operation again.
Before this patch:
For qcow2, it search id first then name to find the one to delete.
For rbd, it search name.
For sheepdog, it does nothing.
After this patch:
For qcow2, logic is the same by call it twice in caller.
For rbd, it always fails in delete with id, but still search for name
in second try, no change to user.
Some code for *errp is based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Save the image refcount order in BDRVQcowState. This will be relevant
for future code supporting different refcount orders than four and also
for code that needs to verify a certain refcount order for an opened
image.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add functionality for expanding zero clusters. This is necessary for
downgrading the image version to one without zero cluster support.
For non-backed images, this function may also just discard zero clusters
instead of truly expanding them.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a function for emptying a cache, i.e., flushing it and marking all
elements invalid.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
During savevm, the VM state is written to the active L1 of the image and
then a snapshot is taken. After that, the VM state isn't needed any more
in the active L1 and should be discarded. This is implemented by this
patch.
The impact of not discarding the VM state is that a snapshot can never
become smaller than any previous snapshot (because it would be padded
with old VM state), and more importantly that future savevm operations
cause unnecessary COWs (with associated flushes), which makes subsequent
snapshots much slower.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The function will be used internally instead of only being called for
guest discard requests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Since the OFLAG_COPIED checks are now executed after the refcounts have
been repaired (if repairing), it is safe to assume that they are correct
but the OFLAG_COPIED flag may be not. Therefore, if its value differs
from what it should be (considering the according refcount), that
discrepancy can be repaired by correctly setting (or clearing that flag.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Two new functions are added; the first one checks a given range in the
image file for overlaps with metadata (main header, L1 tables, L2
tables, refcount table and blocks).
The second one should be used immediately before writing to the image
file as it calls the first function and, upon collision, marks the
image as corrupt and makes the BDS unusable, thereby preventing
further access.
Both functions take a bitmask argument specifying the structures which
should be checked for overlaps, making it possible to also check
metadata writes against colliding with other structures.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds an incompatible bit indicating corruption to qcow2. Any image
with this bit set may not be written to unless for repairing (and
subsequently clearing the bit if the repair has been successful).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The expression "1LL << 63" tries to shift the 1 into the sign bit of a
'long long', which provokes a clang sanitizer warning:
runtime error: left shift of 1 by 63 places cannot be represented in type 'long long'
Use "1ULL << 63" as the definition of QCOW_OFLAG_COPIED instead
to avoid this. For consistency, we also update the other QCOW_OFLAG
definitions to use the ULL suffix rather than LL, though only the
shift by 63 is undefined behaviour.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is what QMP wants to use. The options haven't been enabled in any
release yet, so we're still free to change them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This optimises the discard operation for freed clusters by batching
discard requests (both snapshot deletion and bdrv_discard end up
updating the refcounts cluster by cluster).
Note that we don't discard asynchronously, but keep s->lock held. This
is to avoid that a freed cluster is reallocated and written to while the
discard is still in flight.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Deleted snapshots are discarded in the image file by default, discard
requests take their default from the -drive discard=... option and other
places that free clusters must always be enabled explicitly.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds a refcount update reason to all callers of update_refcounts(),
so that a follow-up patch can use this information to decide whether
clusters that reach a refcount of 0 should be discarded in the image
file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This catches the situation that is described in the bug report at
https://bugs.launchpad.net/qemu/+bug/865518 and goes like this:
$ qemu-img create -f qcow2 huge.qcow2 $((1024*1024))T
Formatting 'huge.qcow2', fmt=qcow2 size=1152921504606846976 encryption=off cluster_size=65536 lazy_refcounts=off
$ qemu-io /tmp/huge.qcow2 -c "write $((1024*1024*1024*1024*1024*1024 - 1024)) 512"
Segmentation fault
With this patch applied the segfault will be avoided, however the case
will still fail, though gracefully:
$ qemu-img create -f qcow2 /tmp/huge.qcow2 $((1024*1024))T
Formatting 'huge.qcow2', fmt=qcow2 size=1152921504606846976 encryption=off cluster_size=65536 lazy_refcounts=off
qemu-img: The image size is too large for file format 'qcow2'
Note that even long before these overflow checks kick in, you get
insanely high memory usage (up to INT_MAX * sizeof(uint64_t) = 16 GB for
the L1 table), so with somewhat smaller image sizes you'll probably see
qemu aborting for a failed g_malloc().
If you need huge image sizes, you should increase the cluster size to
the maximum of 2 MB in order to get higher limits.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Move aes.h from include/block to include/qemu to show it can be reused
by other subsystems.
Cc: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Instead of expecting a single l2meta, have a list of them. This allows
to still have a single I/O request for the guest data, even though
multiple l2meta may be needed in order to describe both a COW overwrite
and a new cluster allocation (typical sequential write case).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>