We require a C99 compiler, so let's use 'bool' instead of 'int'
when dealing with boolean values. There are few enough clients
to fix them all in one pass.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJVevYFAAoJEH8JsnLIjy/W3jEP/0hiQ3rCRZ/he8s5maTdT+TR
YSeHkB5rKpz0Uopn1DMn1QrIbUVzX7dyb+uf9zQ0/xRQIzf6k8uxqU/NWrdoF3NK
qx91dGWedwnG+TEBIMbcR7nMrw4dP6kH7uPz/VWMXDHVLz0HIcD95qhKgs0mSY6J
dWqex6ACjXM68zJU5IioagU9evV80WZE1S8z7zfixxtTBx5hCaTVbwalkaCxcrXw
PbZle55rjI8B10+OzgBw0fq10nias+NTndU9CwNBboxmEtAjq8/mQ663vcWlmiFo
9a/hkda27Z5ut/0Tqk1v4uLHauylp++rrAabPBAuCFMKes6cdkddP15Q/r52aJ29
5meodQtbet1rGrM+Aq4vuSuWId71PGypEI/3URDdNfYFNISoeLLsk4lcQUu7VrDD
sRX3Jt8SI3nkIgOnhPyi7NDPmafxFt8yRt5vM8MyR5ynF8NS/2hiAc3wqnbXGjUj
a5GqDCefb1yM0R5HvksuFFt3OnXlKJQ3J+ksXNUJf9DSAZPauqWD696pcTeg8wyy
3PIGkczgUuKTVfFWd3THZxJLAo7ZuqvXBHHV8o1SeMBDxwh4FhTd8Kjvm3rUNFfl
VDox4qwZ1AcxLrxgqazKU7sD9iWBDHURRcpOoUsBys7oxQZnhcmMp1fRlEkTOyrD
HNiSNByqBrtkfeVzHlSe
=QgYk
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer core and image format patches
# gpg: Signature made Fri Jun 12 16:08:53 2015 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
* remotes/kevin/tags/for-upstream: (25 commits)
block: Fix reopen flag inheritance
block: Add BlockDriverState.inherits_from
block: Add list of children to BlockDriverState
queue.h: Add QLIST_FIX_HEAD_PTR()
block: Drain requests before swapping nodes in bdrv_swap()
block: Move flag inheritance to bdrv_open_inherit()
block: Use QemuOpts in bdrv_open_common()
block: Use macro for cache option names
vmdk: Use bdrv_open_image()
quorum: Use bdrv_open_image()
check-qdict: Test cases for new functions
qdict: Add qdict_{set,copy}_default()
qdict: Add qdict_array_entries()
iotests: Add tests for overriding BDRV_O_PROTOCOL
block: driver should override flags in bdrv_open()
block: Change bitmap truncate conditional to assertion
block: record new size in bdrv_dirty_bitmap_truncate
raw-posix: Fix .bdrv_co_get_block_status() for unaligned image size
vmdk: Use vmdk_find_index_in_cluster everywhere
vmdk: Fix index_in_cluster calculation in vmdk_co_get_block_status
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When reopening an image, the block layer already takes care to reopen
bs->file as well with recalculated inherited flags. The same must happen
for any other child (most notably missing before this patch: backing
files).
If bs->file (or any other child) didn't originally inherit from bs, e.g.
because it was created separately and then only referenced, it must not
inherit flags on reopen either, so check the inherited_from field before
propagation the reopen down.
VMDK already reopened its extents manually; this code can now be
dropped.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Instead of letting every caller of bdrv_open() determine the right flags
for its child node manually and pass them to the function, pass the
parent node and the role of the newly opened child (like backing file,
protocol layer, etc.).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Besides standardising on a single interface for opening child nodes,
this patch allows the user to specify options to individual extent
nodes. Overriding file names isn't possible with this yet, so it's of
limited usefulness, but still a step forward.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Besides standardising on a single interface for opening child nodes,
this simplifies the .bdrv_open() implementation of the quorum block
driver by using block layer functionality for handling BlockdevRefs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Image files with an unaligned image size have a final hole that starts
at EOF, i.e. in the middle of a sector. Currently, *pnum == 0 is
returned when checking the status of this sector. In qemu-img, this
triggers an assertion failure.
In order to fix this, one type for the sector that contains EOF must be
found. Treating a hole as data is safe, so this patch rounds the
calculated number of data sectors up, so that a partial sector at EOF is
treated as a full data sector.
This fixes https://bugzilla.redhat.com/show_bug.cgi?id=1229394
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Cole Robinson <crobinso@redhat.com>
It has the similar issue with b1649fae49. Since the calculation
is repeated for a few times already, introduce a function so it can be
reused.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If a relatively large cluster size is chosen, the default of 1 MB L2
cache is not really appropriate. In this case, unless overridden by the
user, the default cache size should not be determined by its size in
bytes but by the number of L2 tables (clusters) it is supposed to
contain.
Note that without this patch, MIN_L2_CACHE_SIZE will effectively take
over the same role. However, providing space for just two L2 tables is
not enough to be the default.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The L2 cache must cover at least two L2 tables, because during COW two
L2 tables are accessed simultaneously.
Reported-by: Alexander Graf <agraf@suse.de>
Cc: qemu-stable <qemu-stable@nongnu.org>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Tested-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_swap() touches the fields of a BlockDriverState that are
protected by the ThrottleGroup lock. Although those fields end up in
their original place, they are temporarily swapped in the process,
so there's a chance that an operation on a member of the same group
happening on a different thread can try to use them.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: d92dc40d7c4f1fc5cda5cbbf4ffb7a4670b79d17.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The throttle group support use a cooperative round robin scheduling
algorithm.
The principles of the algorithm are simple:
- Each BDS of the group is used as a token in a circular way.
- The active BDS computes if a wait must be done and arms the right
timer.
- If a wait must be done the token timer will be armed so the token
will become the next active BDS.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: f0082a86f3ac01c46170f7eafe2101a92e8fde39.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Group throttling will share ThrottleState between multiple bs.
As a consequence the ThrottleState will be accessed by multiple aio
context.
Timers are tied to their aio context so they must go out of the
ThrottleState structure.
This commit paves the way for each bs of a common ThrottleState to
have its own timer.
Signed-off-by: Benoit Canet <benoit.canet@nodalink.com>
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 6cf9ea96d8b32ae2f8769cead38f68a6a0c8c909.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Image files with an unaligned image size have a final hole that starts
at EOF, i.e. in the middle of a sector. Currently, *pnum == 0 is
returned when checking the status of this sector. In qemu-img, this
triggers an assertion failure.
In order to fix this, one type for the sector that contains EOF must be
found. Treating a hole as data is safe, so this patch rounds the
calculated number of data sectors up, so that a partial sector at EOF is
treated as a full data sector.
This fixes https://bugzilla.redhat.com/show_bug.cgi?id=1229394
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1433840108-9996-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Retain the function value for now, to permit selective conversion of
its callers.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
When the argument is non-zero, qemu_opts_foreach() stops on callback
returning non-zero, and returns that value.
When the argument is zero, it doesn't stop, and returns the bit-wise
inclusive or of all the return values. Funky :)
The callers that pass zero could just as well pass one, because their
callbacks can't return anything but zero:
* qemu_add_globals()'s callback qdev_add_one_global()
* qemu_config_write()'s callback config_write_opts()
* main()'s callbacks default_driver_check(), drive_enable_snapshot(),
vnc_init_func()
Drop the parameter, and always stop.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
raw_bsd already has QEMU_BUILD_BUG_ON(BDRV_SECTOR_SIZE != 512), so iscsi
should relax.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
When a qcow[2] file is opened, if the header reports an
encryption method, this is used to set the 'crypt_method_header'
field on the BDRVQcow[2]State struct, and the 'encrypted' flag
in the BDRVState struct.
When doing I/O operations, the 'crypt_method' field on the
BDRVQcow[2]State struct is checked to determine if encryption
needs to be applied.
The crypt_method_header value is copied into crypt_method when
the bdrv_set_key() method is called.
The QEMU code which opens a block device is expected to always
do a check
if (bdrv_is_encrypted(bs)) {
bdrv_set_key(bs, ....key...);
}
If code forgets to do this, then 'crypt_method' is never set
and so when I/O is performed, QEMU writes plain text data
into a sector which is expected to contain cipher text, or
when reading, will return cipher text instead of plain
text.
Change the qcow[2] code to consult bs->encrypted when deciding
whether encryption is required, and assert(s->crypt_method)
to protect against cases where the caller forgets to set the
encryption key.
Also put an assert in the set_key methods to protect against
the case where the caller sets an encryption key on a block
device that does not have encryption
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix pointer declaration to make it consistent with the rest of the
code.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function never receives an invalid table pointer, so we can make
it void and remove all the error checking code.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The current cache algorithm traverses the array starting always from
the beginning, so the average number of comparisons needed to perform
a lookup is proportional to the size of the array.
By using a hash of the offset as the starting point, lookups are
faster and independent from the array size.
The hash is computed using the cluster number of the table, multiplied
by 4 to make it perform better when there are collisions.
In my tests, using a cache with 2048 entries, this reduces the average
number of comparisons per lookup from 430 to 2.5.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A cache miss means that the whole array was traversed and the entry
we were looking for was not found, so there's no need to traverse it
again in order to select an entry to replace.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The current algorithm to evict entries from the cache gives always
preference to those in the lowest positions. As the size of the cache
increases, the chances of the later elements of being removed decrease
exponentially.
In a scenario with random I/O and lots of cache misses, entries in
positions 8 and higher are rarely (if ever) evicted. This can be seen
even with the default cache size, but with larger caches the problem
becomes more obvious.
Using an LRU algorithm makes the chances of being removed from the
cache independent from the position.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since all tables are now stored together, it is possible to obtain
the position of a particular table directly from its address, so the
operation becomes O(1).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2 L2/refcount cache contains one separate table for each cache
entry. Doing one allocation per table adds unnecessary overhead and it
also requires us to store the address of each table separately.
Since the size of the cache is constant during its lifetime, it's
better to have an array that contains all the tables using one single
allocation.
In my tests measuring freshly created caches with sizes 128MB (L2) and
32MB (refcount) this uses around 10MB of RAM less.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Richard Jones caught this bug with afl fuzzer.
In fact, that's the only possible value to overflow (extent->l1_size =
0x20000000) l1_size:
l1_size = extent->l1_size * sizeof(long) => 0x80000000;
g_try_malloc returns NULL because l1_size is interpreted as negative
during type casting from 'int' to 'gsize', which yields a enormous
value. Hence, by coincidence, we get a "not too bad" behavior:
qemu-img: Could not open '/tmp/afl6.img': Could not open
'/tmp/afl6.img': Cannot allocate memory
Values larger than 0x20000000 will be refused by the validation in
vmdk_add_extent.
Values smaller than 0x20000000 will not overflow l1_size.
Cc: qemu-stable@nongnu.org
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes the bug introduced by commit c6ac36e (vmdk: Optimize cluster
allocation).
Sometimes, write_len could be larger than cluster size, because it
contains both data and marker. We must advance next_cluster_sector in
this case, otherwise the image gets corrupted.
Cc: qemu-stable@nongnu.org
Reported-by: Antoni Villalonga <qemu-list@friki.cat>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Before a freed cluster can be reused, pending discards for this cluster
must be processed.
The original assumption was that this was not a problem because discards
are only cached during discard/write zeroes operations, which are
synchronous so that no concurrent write requests can cause cluster
allocations.
However, the discard/write zeroes operation itself can allocate a new L2
table (and it has to in order to put zero flags there), so make sure we
can cope with the situation.
This fixes https://bugs.launchpad.net/bugs/1349972.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
A bit of Boolean algebra (and common sense) tells us that the
second "if" here is looking for blocks that are not allocated.
This is the opposite of the "if" that sets BDRV_BLOCK_ALLOCATED,
and thus it can use an "else".
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1431599702-10431-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
For zero write, callers pass in NULL qiov (qemu-io "write -z" or
scsi-disk "write same").
Commit fc3959e466 fixed bdrv_co_write_zeroes which is the common case
for this bug, but it still exists in bdrv_aio_write_zeroes. A simpler
fix would be in bdrv_co_do_pwritev which is the NULL dereference point
and covers both cases.
So don't access it in bdrv_co_do_pwritev in this case, use three aligned
writes.
[Initialize ret to 0 in bdrv_co_do_zero_pwritev() to avoid uninitialized
variable warning with gcc 4.9.2.
--Stefan]
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1431522721-3266-3-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This reverts commit fc3959e466.
The core write code already handles the case, so remove this
duplication.
Because commit 61007b316 moved the touched code from block.c to
block/io.c, the change is manually reverted.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431522721-3266-2-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The following sequence
int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
for (i = 0; i < 100000; i++)
write(fd, buf, 4096);
performs 5% better if buf is aligned to 4096 bytes.
The difference is quite reliable.
On the other hand we do not want at the moment to enforce bounce
buffering if guest request is aligned to 512 bytes.
The patch changes default bounce buffer optimal alignment to
MAX(page size, 4k). 4k is chosen as maximal known sector size on real
HDD.
The justification of the performance improve is quite interesting.
From the kernel point of view each request to the disk was split
by two. This could be seen by blktrace like this:
9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img]
9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img]
9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img]
9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img]
9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img]
9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img]
9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img]
9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img]
After the patch the pattern becomes normal:
9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img]
9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img]
9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img]
9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img]
and the amount of requests sent to disk (could be calculated counting
number of lines in the output of blktrace) is reduced about 2 times.
Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
does his job well and real requests comes properly aligned (to page).
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431441056-26198-3-git-send-email-den@openvz.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The patch introduces new concept: minimal memory alignment for bounce
buffers. Original so called "optimal" value is actually minimal required
value for aligment. It should be used for validation that the IOVec
is properly aligned and bounce buffer is not required.
Though, from the performance point of view, it would be better if
bounce buffer or IOVec allocated by QEMU will be aligned stricter.
The patch does not change any alignment value yet.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431441056-26198-2-git-send-email-den@openvz.org
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is the behavior in the operating system, for example Linux's
blkdev_write_iter has the following:
if (bdev_read_only(I_BDEV(bd_inode)))
return -EPERM;
This does not apply to opening a device for read/write, when the
device only supports read-only operation. In this case any of
EACCES, EPERM or EROFS is acceptable depending on why writing is
not possible.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1431013548-22492-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Try to perform IO for the biggest continuous block possible.
All blocks abscent in the image are accounted in the same type
and preallocation is made for all of them at once.
The performance for sequential write is increased from 200 Mb/sec to
235 Mb/sec on my SSD HDD.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-28-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Plain image expansion spends a lot of time to update image file size.
This seriously affects the performance. The following simple test
qemu_img create -f parallels -o cluster_size=64k ./1.hds 64G
qemu_io -n -c "write -P 0x11 0 1024M" ./1.hds
could be improved if the format driver will pre-allocate some space
in the image file with a reasonable chunk.
This patch preallocates 128 Mb using bdrv_write_zeroes, which should
normally use fallocate() call inside. Fallback to older truncate()
could be used as a fallback using image open options thanks to the
previous patch.
The benefit is around 15%.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Karan <rkagan@parallels.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-27-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is preparational commit for tweaks in Parallels image expansion.
The idea is that enlarge via truncate by one data block is slow. It
would be much better to use fallocate via bdrv_write_zeroes and
expand by some significant amount at once.
Original idea with sequential file writing to the end of the file without
fallocate/truncate would be slower than this approach if the image is
expanded with several operations:
- each image expanding means file metadata update, i.e. filesystem
journal write. Truncate/write to newly truncated space update file
metadata twice thus truncate removal helps. With fallocate call
inside bdrv_write_zeroes file metadata is updated only once and
this should happen infrequently thus this approach is the best one
for the image expansion
- tail writes are ordered, i.e. the guest IO queue could not be sent
immediately to the host introducing additional IO delays
This patch just adds proper parameters into BDRVParallelsState and
performs options parsing in parallels_open.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-26-git-send-email-den@openvz.org
CC: Roman Kagan <rkagan@parallels.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The idea is that we do not need to immediately sync BAT to the image as
from the guest point of view there is a possibility that IO is lost
even in the physical controller until flush command was finished.
bdrv_co_flush_to_os is exactly the right place for this purpose.
Technically the patch uses loaded BAT data as a cache and performs
actual on-disk metadata updates in parallels_co_flush_to_os callback.
This patch speed ups
qemu-img create -f parallels -o cluster_size=64k ./1.hds 64G
qemu-io -f parallels -c "write -P 0x11 0 1024k" 1.hds
writing from 50-60 Mb/sec to 80-90 Mb/sec on rotational media and
from 160 Mb/sec to 190 Mb/sec on SSD disk.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-25-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
calculate offset of the BAT entry in the image file.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-24-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Try to perform IO for the biggest continuous block possible.
The performance for sequential read is increased from 220 Mb/sec to
360 Mb/sec for continous image on my SSD HDD.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-23-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The software driver must set inuse field in Parallels header to
0x746F6E59 when the image is opened in read-write mode. The presence of
this magic in the header on open forces image consistency check.
There is an unfortunate trick here. We can not check for inuse in
parallels_check as this will happen too late. It is possible to do
that for simple check, but during the fix this would always report
an error as the image was opened in BDRV_O_RDWR mode. Thus we save
the flag in BDRVParallelsState for this.
On the other hand, nothing should be done to clear inuse in
parallels_check. Generic close will do the job right.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-21-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The check is very simple at the moment. It calculates necessary stats
and fix only the following errors:
- space leak at the end of the image. This would happens due to
preallocation
- clusters outside the image are zeroed. Nothing else could be done here
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-20-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will help to avoid forward declarations for upcoming parallels_check
Some very obvious formatting fixes were made to the moved code to make
checkpatch happy.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-19-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This metadata cache would allow to properly batch BAT updates to disk
in next patches. These updates will be properly aligned to avoid
read-modify-write transactions on block level.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-18-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will allow to use this data as buffer to BAT update directly
without any intermediate buffers.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-17-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
deduplicate copy/paste arithmetcs
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-16-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>