Now that all bottom halves and callbacks take care of taking the
AioContext lock, we can migrate some users away from it and to a
specific QemuMutex or CoMutex.
Protect libiscsi calls with a QemuMutex. Callbacks are invoked
using bottom halves, so we don't even have to drop it around
callback invocations.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20170222180725.28611-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that all bottom halves and callbacks take care of taking the
AioContext lock, we can migrate some users away from it and to a
specific QemuMutex or CoMutex.
Protect libnfs calls with a QemuMutex. Callbacks are invoked
using bottom halves, so we don't even have to drop it around
callback invocations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20170222180725.28611-3-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that all bottom halves and callbacks take care of taking the
AioContext lock, we can migrate some users away from it and to a
specific QemuMutex or CoMutex.
Protect BDRVCURLState access with a QemuMutex.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20170222180725.28611-2-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJYsHaaAAoJEH8JsnLIjy/WvC4P/iw/WC6XySNzPMOzICrJr9fc
IHz447+0MqoIBWqGRFLEU8z4t6k8HGt2YnWKFuS1N+2gU5fSgOdp20nAA+pQvxRb
RTyQL7BNJPvFNzrEQPTpdvBt79veYsoNe5dNfSq01z/PdgxMQR4PkS96Jm8w4Jdu
20ZPfULws8+Wq1Snsxl4PdnFpY2n97OOGQRCjl50h5ypol5+fXDDzAvp/GPf6Q8B
09OTWmQ4UIr4OZqT83T1kDdRZRRMIxiRP5qTTKdh0BlJEQdBjrJ9fTClY4bMcIgl
VP/3kzkmdoqSW+4D+0L88q7vyvM3Kwc6n2PFDaRf6Lgy9ueYoPKuW9AF5vzcxhED
AI0SN9boM+4z1P75kalBaHg/BiRL7UDwpHgdanPVcENoP8IywsKzOIvdjOpqINmX
uLU3QfK+qK6x7gflhVXiX2sFzTLzZQlWu5KPjW6QfuMzUuRAeksfMDXh2GXNFVG1
igGnORyqB2nOGNAsudtMrOvjHQRSbwGsnvWcwvaX0swrxOb3oAwc3X+E4nxE/K5/
wJQLrCM9DfD3CKKlKiu+1wQXx6zqaGYIsNxLcrSylJ3TPsGuxfkx3b9GVCCQ0Y+e
YSdwWSAIS9LxR4qISJ5ehlnYDA0sRtdco0xGW6ACNE5IrgREYUcH4EJOXbxZ4NHs
BK/DVdYhFtNZ60zHeEn4
=OYgF
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches
# gpg: Signature made Fri 24 Feb 2017 18:08:26 GMT
# gpg: using RSA key 0x7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
tests: Use opened block node for block job tests
vvfat: Use opened node as backing file
block: Add bdrv_new_open_driver()
block: Factor out bdrv_open_driver()
block: Use BlockBackend for image probing
block: Factor out bdrv_open_child_bs()
block: Attach bs->file only during .bdrv_open()
block: Pass BdrvChild to bdrv_truncate()
mirror: Resize active commit base in mirror_run()
qcow2: Use BB for resizing in qcow2_amend_options()
blockdev: Use BlockBackend to resize in qmp_block_resize()
iotests: Fix another race in 030
qemu-img: Improve documentation for PREALLOC_MODE_FALLOC
qemu-img: Truncate before full preallocation
qemu-img: Add tests for raw image preallocation
qemu-img: Do not truncate before preallocation
qemu-iotests: redirect nbd server stdout to /dev/null
qemu-iotests: add ability to exclude certain protocols from tests
qemu-iotests: Test 137 only supports 'file' protocol
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Rbd can do readv and writev directly, so wo do not need to transform
iov to buf or vice versa any more.
Signed-off-by: tianqing <tianqing@unitedstack.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
if the passed qiov contains exactly one iov we can
pass the buffer directly.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1487349541-10201-3-git-send-email-pl@kamp.de
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1487349541-10201-2-git-send-email-pl@kamp.de
Signed-off-by: Jeff Cody <jcody@redhat.com>
We should not try to assign a not yet opened node as the backing file,
because as soon as the permission system is added it will fail. The
just added bdrv_new_open_driver() function is the right tool to open a
file with an internal driver, use it.
In case anyone wonders whether that magic fake backing file to trigger a
special action on 'commit' actually works today: No, not for me. One
reason is that we've been adding a raw format driver on top for several
years now and raw doesn't support commit. Other reasons include that the
backing file isn't writable and the driver doesn't support reopen, and
it's also size 0 and the driver doesn't support bdrv_truncate. All of
these are easily fixable, but then 'commit' ended up in an infinite loop
deep in the vvfat code for me, so I thought I'd best leave it alone. I'm
not really sure what it was supposed to do anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The way that attaching bs->file worked was a bit unusual in that it was
the only child that would be attached to a node which is not opened yet.
Because of this, the block layer couldn't know yet which permissions the
driver would eventually need.
This patch moves the point where bs->file is attached to the beginning
of the individual .bdrv_open() implementations, so drivers already know
what they are going to do with the child. This is also more consistent
with how driver-specific children work.
For a moment, bdrv_open() gets its own BdrvChild to perform image
probing, but instead of directly assigning this BdrvChild to the BDS, it
becomes a temporary one and the node name is passed as an option to the
drivers, so that they can simply use bdrv_open_child() to create another
reference for their own use.
This duplicated child for (the not opened yet) bs is not the final
state, a follow-up patch will change the image probing code to use a
BlockBackend, which is completely independent of bs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This is more consistent with the commit block job, and it moves the code
to a place where we already have the necessary BlockBackends to resize
the base image when bdrv_truncate() is changed to require a BdrvChild.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
In order to able to convert bdrv_truncate() to take a BdrvChild and
later to correctly check the resize permission here, we need to use a
BlockBackend for resizing the image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Now that we are truncating the file in both PREALLOC_MODE_FULL and
PREALLOC_MODE_OFF, not truncating in PREALLOC_MODE_FALLOC looks odd.
Add a comment explaining why we do not truncate in this case.
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In a previous commit (qemu-img: Do not truncate before preallocation) we
moved truncate to the PREALLOC_MODE_OFF branch to avoid slowdown in
posix_fallocate().
However this change is not optimal when using PREALLOC_MODE_FULL, since
knowing the final size from the beginning could allow the file system
driver to do less allocations and possibly avoid fragmentation of the
file.
Now we truncate also before doing full preallocation.
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When using file system that does not support fallocate() (e.g. NFS <
4.2), truncating the file only when preallocation=OFF speeds up creating
raw file.
Here is example run, tested on Fedora 24 machine, creating raw file on
NFS version 3 server.
$ time ./qemu-img-master create -f raw -o preallocation=falloc mnt/test 1g
Formatting 'mnt/test', fmt=raw size=1073741824 preallocation=falloc
real 0m21.185s
user 0m0.022s
sys 0m0.574s
$ time ./qemu-img-fix create -f raw -o preallocation=falloc mnt/test 1g
Formatting 'mnt/test', fmt=raw size=1073741824 preallocation=falloc
real 0m11.601s
user 0m0.016s
sys 0m0.525s
$ time dd if=/dev/zero of=mnt/test bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 15.6627 s, 68.6 MB/s
real 0m16.104s
user 0m0.009s
sys 0m0.220s
Running with strace we can see that without this change we do one
pread() and one pwrite() for each block. With this change, we do only
one pwrite() per block.
$ strace ./qemu-img-master create -f raw -o preallocation=falloc mnt/test 8192
...
pread64(9, "\0", 1, 4095) = 1
pwrite64(9, "\0", 1, 4095) = 1
pread64(9, "\0", 1, 8191) = 1
pwrite64(9, "\0", 1, 8191) = 1
$ strace ./qemu-img-fix create -f raw -o preallocation=falloc mnt/test 8192
...
pwrite64(9, "\0", 1, 4095) = 1
pwrite64(9, "\0", 1, 8191) = 1
This happens because posix_fallocate is checking if each block is
allocated before writing a byte to the block, and when truncating the
file before preallocation, all blocks are unallocated.
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After a visit of a complex QAPI type FOO
ov = qobject_output_visitor_new(&foo);
visit_type_FOO(ov, NULL, expr, &error_abort);
visit_complete(ov, &foo);
we can safely assume qobject_type(foo) is QTYPE_QDICT. We do in many
places, but occasionally assert qobject_type(obj) == QTYPE_QDICT.
Don't. The appropriate place to check such fundamental properties of
QAPI visitors is the test suite.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1487363905-9480-15-git-send-email-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If explicit zeroing out before mirroring is required for the target image,
it moves the block job offset counter to EOF, then offset and len counters
count the image size twice. There is no harm but stats are confusing,
specifically the progress of the operation is always reported as 99% by
management tools.
The patch skips offset increase for the first "technical" pass over the
image. This should not cause any further harm.
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1486045515-8009-1-git-send-email-den@openvz.org
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This adds blockdev-add support for iscsi devices.
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This was previously only available with -iscsi. Again, after this patch,
the -iscsi option only takes effect if an URL is given. New users are
supposed to use the new driver-specific option.
All -iscsi options have a corresponding driver-specific option for the
iscsi block driver now.
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This was previously only available with -iscsi. Again, after this patch,
the -iscsi option only takes effect if an URL is given. New users are
supposed to use the new driver-specific option.
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This was previously only available with -iscsi. Again, after this patch,
the -iscsi option only takes effect if an URL is given. New users are
supposed to use the new driver-specific option.
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This splits the logic in the old parse_chap() function into a part that
parses the -iscsi options into the new driver-specific options, and
another part that actually applies those options (called apply_chap()
now).
Note that this means that username and password specified with -iscsi
only take effect when a URL is provided. This is intentional, -iscsi is
a legacy interface only supported for compatibility, new users should
use the proper driver-specific options.
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This introduces a .bdrv_parse_filename handler for iscsi which parses an
URL if given and translates it to individual options.
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
All that CoQueue needs in order to become thread-safe is help
from an external mutex. Add this to the API.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213181244.16297-6-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-16-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-15-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This covers both file descriptor callbacks and polling callbacks,
since they execute related code.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-14-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-13-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qed_aio_start_io and qed_aio_next_io will not have to acquire/release
the AioContext, while qed_aio_next_io_cb will. Split the functionality
and gain a little type-safety in the process.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-11-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Keep the coroutine on the same AioContext. Without this change,
there would be a race between yielding the coroutine and reentering it.
While the race cannot happen now, because the code only runs from a single
AioContext, this will change with multiqueue support in the block layer.
While doing the change, replace custom bottom half with aio_co_schedule.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-10-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In the client, read the reply headers from a coroutine, switching the
read side between the "read header" coroutine and the I/O coroutine that
reads the body of the reply.
In the server, if the server can read more requests it will create a new
"read request" coroutine as soon as a request has been read. Otherwise,
the new coroutine is created in nbd_request_put.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-8-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qcow2_create2 calls this. Do not run a nested event loop, as that
breaks when aio_co_wake tries to queue the coroutine on the co_queue_wakeup
list of the currently running one.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-4-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
AioContext is fairly self contained, the only dependency is QEMUTimer but
that in turn doesn't need anything else. So move them out of block-obj-y
to avoid introducing a dependency from io/ to block-obj-y.
main-loop and its dependency iohandler also need to be moved, because
later in this series io/ will call iohandler_get_aio_context.
[Changed copyright "the QEMU team" to "other QEMU contributors" as
suggested by Daniel Berrange and agreed by Paolo.
--Stefan]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20170213135235.12274-2-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The metadata overlap checks introduced in a40f1c2add help detect
corruption in the qcow2 image by verifying that data writes don't
overlap with existing metadata sections.
The 'refcount-block' check in particular iterates over the refcount
table in order to get the addresses of all refcount blocks and check
that none of them overlap with the region where we want to write.
The problem with the refcount table is that since it always occupies
complete clusters its size is usually very big. With the default
values of cluster_size=64KB and refcount_bits=16 this table holds 8192
entries, each one of them enough to map 2GB worth of host clusters.
So unless we're using images with several TB of allocated data this
table is going to be mostly empty, and iterating over it is a waste of
CPU. If the storage backend is fast enough this can have an effect on
I/O performance.
This patch keeps the index of the last used (i.e. non-zero) entry in
the refcount table and updates it every time the table changes. The
refcount-block overlap check then uses that index instead of reading
the whole table.
In my tests with a 4GB qcow2 file stored in RAM this doubles the
amount of write IOPS.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 20170201123828.4815-1-berto@igalia.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
commit 94d6a7a accidentally left the naming of runtime opts and QAPI
scheme inconsistent. As one consequence passing of parameters in the
URI is broken. Sync the naming of the runtime opts to the QAPI
scheme.
Please note that this is technically backwards incompatible with the 2.8
release, but the 2.8 release is the only version that had the wrong naming.
Furthermore release 2.8 suffered from a NULL pointer dereference during
URI parsing.
Fixes: 94d6a7a76e
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-id: 1485942829-10756-3-git-send-email-pl@kamp.de
[mreitz: Fixed commit message]
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
parse_uint_full wants to put the parsed value into the
variable passed via its second argument which is NULL.
Fixes: 94d6a7a76e
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1485942829-10756-2-git-send-email-pl@kamp.de
Signed-off-by: Max Reitz <mreitz@redhat.com>
In order to reduce the execution time, this patch optimize
the qmp_query_blockstats():
Remove the next_query_bds function.
Remove the bdrv_query_stats function.
Remove some judgement sentence.
The original qmp_query_blockstats calls next_query_bds to get
the next objects in each loops. In the next_query_bds, it checks
the query_nodes and blk. It also call bdrv_query_stats to get
the stats, In the bdrv_query_stats, it checks blk and bs each
times. This waste more times, which may stall the main loop a
bit. And if the disk is too many and donot use the dataplane
feature, this may affect the performance in main loop thread.
This patch removes that two functions, and makes the structure
clearly.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Message-id: 1484467275-27919-3-git-send-email-douly.fnst@cn.fujitsu.com
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[mreitz: Removed duplicate info->value assignment]
Signed-off-by: Max Reitz <mreitz@redhat.com>
The bdrv_query_stats and bdrv_query_bds_stats functions need to call
each other, that increases the coupling. it also makes the program
complicated and makes some unnecessary tests.
Remove the call from bdrv_query_bds_stats to bdrv_query_stats, just
take some recursion to make it clearly.
Avoid testing whether the blk is NULL during querying the bds stats.
It is unnecessary.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Message-id: 1484467275-27919-2-git-send-email-douly.fnst@cn.fujitsu.com
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The problem was triggered by qemu-iotests case 055. It failed when it
was comparing the compressed vmdk image with original test.img.
The cause is that buf_len in vmdk_write_extent wasn't converted to
little-endian before it was stored to disk. But later vmdk_read_extent
read it and converted it from little-endian to cpu endian.
If the cpu is big-endian like s390, the problem will happen and
the data length read by vmdk_read_extent will become invalid!
The fix is to add the conversion in vmdk_write_extent, meanwhile,
repair the endianness problem of lba field which shall also be converted
to little-endian before storing to disk.
Cc: qemu-stable@nongnu.org
Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Signed-off-by: Jing Liu <liujbjl@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20161216052040.53067-2-haoqf@linux.vnet.ibm.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
@bs doesn't always have a device name, such as when it comes from
"qemu-img info". Report file name instead.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20170119130759.28319-2-famz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Wrap the code that was copied repeatedly in the two functions,
sd_aio_setup and sd_aio_complete.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161129113245.32724-6-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add to the list in add_aio_request and, indirectly, resend_aioreq. Inline
free_aio_req in the caller, it does not simply undo alloc_aio_req's job.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161129113245.32724-5-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Sheepdog's AIOCB are completely internal entities for a group of
requests and do not need dynamic allocation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161129113245.32724-4-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Delimit co_recv's lifetime clearly in aio_read_response.
Do a simple qemu_coroutine_enter in aio_read_response, letting
sd_co_writev call sd_write_done.
Handle nr_pending in the same way in sd_co_rw_vector,
sd_write_done and sd_co_flush_to_disk.
Remove sd_co_rw_vector's return value; just leave with no
pending requests.
[Jeff: added missing 'return' back, spotted by Paolo after
series was applied.]
Signed-off-by: Jeff Cody <jcody@redhat.com>
SheepdogAIOCB is internal to sheepdog.c, hence it is never canceled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161129113245.32724-2-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
There are a number of unused trace events that
scripts/cleanup-trace-events.pl finds. The "hw/vfio/pci-quirks.c"
filename was typoed and "qapi/qapi-visit-core.c" was missing the qapi/
directory prefix.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20170126171613.1399-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
commit f57b4b5f moved qemu_iscsi_opts into vl.c. This
made them invisible for qemu-img, qemu-nbd etc.
Fixes: f57b4b5fb1
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1485262161-18543-1-git-send-email-pl@kamp.de>
[Drop useless #ifdef. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We can get the maximum number of bytes for a single I/O transfer
from the BLKSECTGET ioctl, but we only perform this for block
devices. scsi-generic devices are represented as character devices,
and so do not issue this today. Update this, so that virtio-scsi
devices using the scsi-generic interface can return the same data.
Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com>
Message-Id: <20170120162527.66075-4-farman@linux.vnet.ibm.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 6f6071745b ("raw-posix: Fetch max sectors for host block device")
introduced a routine to call the kernel BLKSECTGET ioctl, which stores the
result back to user space. However, the size of the data returned depends
on the routine handling the ioctl. The (compat_)blkdev_ioctl returns a
short, while sg_ioctl returns an int. Thus, on big-endian systems, we can
find ourselves accidentally shifting the result to a much larger value.
(On s390x, a short is 16 bits while an int is 32 bits.)
Also, the two ioctl handlers return values in different scales (block
returns sectors, while sg returns bytes), so some tweaking of the outputs
is required such that hdev_get_max_transfer_length returns a value in a
consistent set of units.
Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com>
Message-Id: <20170120162527.66075-3-farman@linux.vnet.ibm.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
nb_cls_shrunk in iscsi_allocmap_update can become -1 if the
request starts and ends within the same cluster. This results
in passing -1 to bitmap_set and bitmap_clear and they don't
handle negative values properly. In the end this leads to data
corruption.
Fixes: e1123a3b40
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1484579832-18589-1-git-send-email-pl@kamp.de>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a migration is already in progress and somebody attempts
to add a migration blocker, this should rightly fail.
Add an errp parameter and a retcode return value to migrate_add_blocker.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Message-Id: <1484566314-3987-5-git-send-email-ashijeetacharya@gmail.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Acked-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Merged with recent 'Allow invtsc migration' change
Remove the "// assert(is_consistent(s))" comment in block/vvfat.c
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Message-Id: <1484566314-3987-2-git-send-email-ashijeetacharya@gmail.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
bdrv_io_plug and bdrv_io_unplug are only called (via their
BlockBackend equivalents) after starting asynchronous I/O.
bdrv_drain is not going to be called while they are running,
because---even if a coroutine runs for some reason---it will
only drain in the next iteration of the event loop through
bdrv_co_yield_to_drain.
So this mechanism is unnecessary, get rid of it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161129113334.605-1-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
These files deal with the file protocol, not the raw format (the
file protocol is often used with other formats, and the raw
format is not forced to use the file protocol). Rename things
to make it a bit easier to follow.
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Given that we have raw-win32.c and raw-posix.c, my initial guess at
raw_bsd.c was that it was for dealing with raw files using code
specific to the BSD operating system (beyond what raw-posix could
do). Not so - this name was chosen back in commit e1c66c6 to
distinguish that it was a BSD licensed file, in contrast to the
then-existing raw.c with an unclear and potentially unusable
license. But since it has been more than three years since the
rewrite, it's time to pick a more useful name for this file to
avoid this type of confusion to future contributors that don't know
the backstory, as none of our other files are named solely by the
license they use.
In reality, this file deals with the raw format, which is useful
with any number of protocols, while raw-{win32,posix} deal with
the file protocol (and in turn, that protocol is not limited to
use with the raw format). So rename raw_bsd to raw-format.c. We
could have also used the shorter name raw.c, except that collides
with the earlier use of that filename for a different license,
and it's better to be safe than risk license pollution.
The next patch will also rename raw-win32.c and raw-posix.c to
further distinguish the difference in roles.
It doesn't hurt that this gets rid of an underscore in the filename,
thereby making tab-completion on 'ra<TAB>' easier (now I don't have
to type the shift key, which slows things down :)
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This enables byte granularity requests for blkverify, and at the same
time gets us rid of another user of the BDS-level AIO emulation.
The reference output of a test case must be changed because the
verification failure message reports byte offsets instead of sectors
now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This enables byte granularity requests for blkdebug, and at the same
time gets us rid of another user of the BDS-level AIO emulation.
Note that unless align=512 is specified, this can behave subtly
different from the old behaviour because bdrv_co_preadv/pwritev don't
have to perform alignment adjustments any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Make sure that all fields of the new QuorumAIOCB are zeroed when the
function returns even without explicitly setting them. This will protect
us when new fields are added, removes some explicit zero assignment and
makes the code a little nicer to read.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Inlining the function removes some boilerplace code and replaces
recursion by a simple loop, so the code becomes somewhat easier to
understand.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This enables byte granularity requests on quorum nodes.
Note that the QMP events emitted by the driver are an external API that
we were careless enough to define as sector based. The offset and length
of requests reported in events are rounded therefore.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Replacing it with bdrv_co_pwritev() prepares us for byte granularity
requests and gets us rid of the last bdrv_aio_*() user in quorum.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is a conversion to a more natural coroutine style and improves the
readability of the driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Instead of calling quorum_aio_finalize() deeply nested in what used
to be an AIO callback, do it in the same functions that allocated the
AIOCB.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This converts the quorum block driver from implementing callback-based
interfaces for read/write to coroutine-based ones. This is the first
step that will allow us further simplification of the code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
There is no point in passing the value of bs->opaque in order to
overwrite it with itself.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
The Linux AIO userspace ABI includes a ring that is shared with the
kernel. This allows userspace programs to process completions without
system calls.
Add an AioContext poll handler to check for completions in the ring.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161201192652.9509-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The new AioPollFn io_poll() argument to aio_set_fd_handler() and
aio_set_event_handler() is used in the next patch.
Keep this code change separate due to the number of files it touches.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161201192652.9509-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qcow2_make_empty() function is reached during 'qemu-img commit',
in order to clear out ALL clusters of an image. However, if the
image cannot use the fast code path (true if the image is format
0.10, or if the image contains a snapshot), the cluster size is
larger than 512, and the image is larger than 2G in size, then our
choice of sector_step causes problems. Since it is not cluster
aligned, but qcow2_discard_clusters() silently ignores an unaligned
head or tail, we are leaving clusters allocated.
Enhance the testsuite to expose the flaw, and patch the problem by
ensuring our step size is aligned.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QMP definition of BlockdevOptionsNfs:
{ 'struct': 'BlockdevOptionsNfs',
'data': { 'server': 'NFSServer',
'path': 'str',
'*user': 'int',
'*group': 'int',
'*tcp-syn-count': 'int',
'*readahead-size': 'int',
'*page-cache-size': 'int',
'*debug-level': 'int' } }
To make this consistent with other block protocols like gluster, lets
change s/debug-level/debug/
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
The QMP definition of BlockdevOptionsGluster:
{ 'struct': 'BlockdevOptionsGluster',
'data': { 'volume': 'str',
'path': 'str',
'server': ['GlusterServer'],
'*debug-level': 'int',
'*logfile': 'str' } }
But instead of 'debug-level we have exported 'debug' as the option for choosing
debug level of gluster protocol driver.
This patch fix QMP definition BlockdevOptionsGluster
s/debug-level/debug/
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJYPZu6AAoJEH8JsnLIjy/W1IAP/AwV0sWafsSMnWiz/4NVqeh3
Yk2cBtxCBmnq1y+PilLoZBdHui/RumwVuZKaShs3JA1n5CB1AjsVtEVl/6rQM7lv
yymLr32pODuf4eaGwGY09FqTiL0Erlm846zbSDjkiKbTYoKpzRv0PT2iiA6yTnjO
Mrs5nG7kEWdXPZ0ZsJyEyU3+vs7rNg+4N/VfTdPmCrV5DVBvAeCawM6JXHQNc7LV
ER6Y8W9PAu5mYqwekjAW07lPCudytAsOTrbTTO9Sv/+kZUdKEmv7ZHJrPdECCb6N
vcPOYOzKsEvvR8E0YZtuJDK9W4RTakxdlTste+TtW3VSt1Cs0zpvCFytaGuC+Kmq
mhlA4lYLDvaiNOMl09SvIjjxGI7+FO+1XsY7e4rI5PJzOKWZMFOIwQMNxE3B2qUI
dxd6izf7fzF4V5uDDwHTJ8TAiJDSAe6Bkz+vzipQtu5NARl/isbQuIPIGXPkxZln
fkCYA8/7EXrLXqd3khiRqEHS60ZtNgfm4ss8euMlWAgJAz0RLC1d/XhOIxaCQOg3
R/F9UdJAon6mfOgamZs5yzJgaPU6M90g/QipMB3Ub00VODacTiA81QUjZdEgELBB
zvhgeja7qdIvOh9r9heCCuUTmkWRRppmkrKdqFLowZ2aWosISy/UjiPTGdjDdq7Z
LsfYiRXsW94FmNXdKCqg
=3WTz
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'kwolf/tags/for-upstream' into staging
Block layer patches for 2.8.0-rc2
# gpg: Signature made Tue 29 Nov 2016 03:16:10 PM GMT
# gpg: using RSA key 0x7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* kwolf/tags/for-upstream:
docs: Specify that cache-clean-interval is only supported in Linux
qcow2: Remove stale comment
qcow2: Allow 'cache-clean-interval' in Linux only
qcow2: Make qcow2_cache_table_release() work only in Linux
Message-id: 1480436227-2211-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The cache-clean-interval option of qcow2 only works on Linux. However
we allow setting it in other systems regardless of whether it works or
not.
In those systems this option is not simply a no-op: it actually
invalidates perfectly valid cache tables for no good reason without
freeing their memory.
This patch forbids using that option in non-Linux systems.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are using QEMU_MADV_DONTNEED to discard the memory of individual L2
cache tables. The problem with this is that those semantics are
specific to the Linux madvise() system call. Other implementations of
madvise() (including the very Linux implementation of posix_madvise())
don't do that, so we cannot use them for the same purpose.
This patch makes the code Linux-specific and uses madvise() directly
since there's no point in going through qemu_madvise() for this.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
but forgot to inform the block layer that FUA unmapping of zeroes is
supported. Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
the block layer will always insist on the NBD layer passing
NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
things even when it was desired to let the server punch holes.
Similarly, failing to set BDRV_REQ_FUA means that the client may
send unnecessary NBD_CMD_FLUSH when it could have instead used the
NBD_CMD_FLAG_FUA bit.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1479413642-22463-2-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Discard is advisory, so rounding the requests to alignment
boundaries is never semantically wrong from the data that
the guest sees. But at least the Dell Equallogic iSCSI SANs
has an interesting property that its advertised discard
alignment is 15M, yet documents that discarding a sequence
of 1M slices will eventually result in the 15M page being
marked as discarded, and it is possible to observe which
pages have been discarded.
Between commits 9f1963b and b8d0a980, we converted the block
layer to a byte-based interface that ultimately ignores any
unaligned head or tail based on the driver's advertised
discard granularity, which means that qemu 2.7 refuses to
pass any discard request smaller than 15M down to the Dell
Equallogic hardware. This is a slight regression in behavior
compared to earlier qemu, where a guest executing discards
in power-of-2 chunks used to be able to get every page
discarded, but is now left with various pages still allocated
because the guest requests did not align with the hardware's
15M pages.
Since the SCSI specification says nothing about a minimum
discard granularity, and only documents the preferred
alignment, it is best if the block layer gives the driver
every bit of information about discard requests, rather than
rounding it to alignment boundaries early.
Rework the block layer discard algorithm to mirror the write
zero algorithm: always peel off any unaligned head or tail
and manage that in isolation, then do the bulk of the request
on an aligned boundary. The fallback when the driver returns
-ENOTSUP for an unaligned request is to silently ignore that
portion of the discard request; but for devices that can pass
the partial request all the way down to hardware, this can
result in the hardware coalescing requests and discarding
aligned pages after all.
Reported by: Peter Lieven <pl@kamp.de>
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Right now, the block layer rounds discard requests, so that
individual drivers are able to assert that discard requests
will never be unaligned. But there are some ISCSI devices
that track and coalesce multiple unaligned requests, turning it
into an actual discard if the requests eventually cover an
entire page, which implies that it is better to always pass
discard requests as low down the stack as possible.
In isolation, this patch has no semantic effect, since the
block layer currently never passes an unaligned request through.
But the block layer already has code that silently ignores
drivers that return -ENOTSUP for a discard request that cannot
be honored (as well as drivers that return 0 even when nothing
was done). But the next patch will update the block layer to
fragment discard requests, so that clients are guaranteed that
they are either dealing with an unaligned head or tail, or an
aligned core, making it similar to the block layer semantics of
write zero fragmentation.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary. But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.
However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k. Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.
Test case:
$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056
In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea. But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.
Reported-by: Ed Swierk <eswierk@skyportsystems.com>
CC: qemu-stable@nongnu.org
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
At the qcow2 layer, discard is only possible on a per-cluster
basis; at the moment, qcow2 silently rounds any unaligned
requests to this granularity. However, an upcoming patch will
fix a regression in the block layer ignoring too much of an
unaligned discard request, by changing the block layer to
break up a discard request at alignment boundaries; for that
to work, the block layer must know about our limits.
However, we can't go one step further by changing
qcow2_discard_clusters() to assert that requests are always
aligned, since that helper function is reached on paths
outside of the block layer.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes a use-after-free bug introduced in commit 6349c154. We need
to use QLIST_FOREACH_SAFE() when freeing elements in the loop. Spotted
by Coverity.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1479378608-11962-1-git-send-email-kwolf@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
This puts a huge strain on the disks when there are many concurrent
migrations. With this patch we only flush twice: just before issuing
the event, and just before pivoting to the destination. If management
will complete the job close to the BLOCK_JOB_READY event, the cost of
the second flush should be small anyway.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161109162008.27287-2-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
libcurl will only give us as much data as there is, not more. The block
layer will deny requests beyond the end of file for us; but since this
block driver is still using a sector-based interface, we can still get
in trouble if the file size is not a multiple of 512.
While we have already made sure not to attempt transfers beyond the end
of the file, we are currently still trying to receive data from there if
the original request exceeds the file size. This patch fixes this issue
and invokes qemu_iovec_memset() on the iovec's tail.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20161025025431.24714-5-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
For some connection types (like FTP, generally), more than one socket
may be used (in FTP's case: control vs. data stream). As of commit
838ef60249 ("curl: Eliminate unnecessary
use of curl_multi_socket_all"), we have to remember all of the sockets
used by libcurl, but in fact we only did that for a single one. Since
one libcurl connection may use multiple sockets, however, we have to
remember them all.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20161025025431.24714-4-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
While commit 38bbc0a580 is correct in that
the callback is supposed to return the number of bytes handled; what it
does not mention is that libcurl will throw an error if the callback did
not "handle" all of the data passed to it.
Therefore, if the callback receives some data that it cannot handle
(either because the receive buffer has not been set up yet or because it
would not fit into the receive buffer) and we have to ignore it, we
still have to report that the data has been handled.
Obviously, this should not happen normally. But it does happen at least
for FTP connections where some data (that we do not expect) may be
generated when the connection is established.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20161025025431.24714-3-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Currently, curl defines its own constant SECTOR_SIZE. There is no
advantage over using the global BDRV_SECTOR_SIZE, so drop it.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20161025025431.24714-2-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Because TFTP does not support byte ranges, it was never usable with our
curl block driver. Since apparently nobody has ever complained loudly
enough for someone to take care of the issue until now, it seems
reasonable to assume that nobody has ever actually used it.
Therefore, it should be safe to just drop it from curl's protocol list.
[Jeff Cody: Below is additional summary pulled, with some rewording,
from followup emails between Max and Markus, to explain what
worked and what didn't]
TFTP would sometimes work, to a limited extent, for images <= the curl
"readahead" size, so long as reads started at offset zero. By default,
that readahead size is 256KB.
Reads starting at a non-zero offset would also have returned data from a
zero offset. It can become more complicated still, with mixed reads at
zero offset and non-zero offsets, due to data buffering.
In short, TFTP could only have worked before in very specific scenarios
with unrealistic expectations and constraints.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20161102175539.4375-4-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Refactor backup_start as backup_job_create, which only creates the job,
but does not automatically start it. The old interface, 'backup_start',
is not kept in favor of limiting the number of nearly-identical interfaces
that would have to be edited to keep up with QAPI changes in the future.
Callers that wish to synchronously start the backup_block_job can
instead just call block_job_start immediately after calling
backup_job_create.
Transactions are updated to use the new interface, calling block_job_start
only during the .commit phase, which helps prevent race conditions where
jobs may finish before we even finish building the transaction. This may
happen, for instance, during empty block backup jobs.
Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1478587839-9834-6-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Instead of automatically starting jobs at creation time via backup_start
et al, we'd like to return a job object pointer that can be started
manually at later point in time.
For now, add the block_job_start mechanism and start the jobs
automatically as we have been doing, with conversions job-by-job coming
in later patches.
Of note: cancellation of unstarted jobs will perform all the normal
cleanup as if the job had started, particularly abort and clean. The
only difference is that we will not emit any events, because the job
never actually started.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1478587839-9834-5-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add an explicit start field to specify the entrypoint. We already have
ownership of the coroutine itself AND managing the lifetime of the
coroutine, let's take control of creation of the coroutine, too.
This will allow us to delay creation of the actual coroutine until we
know we'll actually start a BlockJob in block_job_start. This avoids
the sticky question of how to "un-create" a Coroutine that hasn't been
started yet.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478587839-9834-4-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Cleaning up after we have deferred to the main thread but before the
transaction has converged can be dangerous and result in deadlocks
if the job cleanup invokes any BH polling loops.
A job may attempt to begin cleaning up, but may induce another job to
enter its cleanup routine. The second job, part of our same transaction,
will block waiting for the first job to finish, so neither job may now
make progress.
To rectify this, allow jobs to register a cleanup operation that will
always run regardless of if the job was in a transaction or not, and
if the transaction job group completed successfully or not.
Move sensitive cleanup to this callback instead which is guaranteed to
be run only after the transaction has converged, which removes sensitive
timing constraints from said cleanup.
Furthermore, in future patches these cleanup operations will be performed
regardless of whether or not we actually started the job. Therefore,
cleanup callbacks should essentially confine themselves to undoing create
operations, e.g. setup actions taken in what is now backup_start.
Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478587839-9834-3-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
blk_eject is only used by scsi-disk and atapi, and in both cases we
only attempt to invoke blk_eject if we have a bona-fide change in
tray state.
The "issue" here is that the tray state does not generate a QMP event
unless there is a medium/BDS attached to the device, so if libvirt et al
are waiting for a tray event to occur from an empty-but-closed drive,
software opening that drive will not emit an event and libvirt will
wait forever.
Change this by modifying blk_eject to always emit an event, instead of
conditionally on a "real" backend eject.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1373264
Reported-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478553214-497-2-git-send-email-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
It is too confusing because it sounds like a BDRVRawState variable.
Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1477565117-17230-1-git-send-email-famz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It was from the time when none of the global functions had a qcow2_
prefix.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We make sure that the size is aligned to sector length to prevent any
round ups. Otherwise we could end up reading/writing data outside the
area specified by user. This is only needed when user supplies the size
option to avoid any surprises. It is not necessary when only offset is
set.
More over, the check made it difficult to use the offset option without
size option. The check puts unneeded restriction on the offset which had
to be aligned too. Because bdrv_getlength() returns aligned value having
unaligned offset would make the check fail.
Signed-off-by: Tomáš Golembiovský <tgolembi@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When only offset is specified but no size and the offset is greater than
the real size of the containing device an overflow occurs when parsing
the options. This overflow is harmless because we do check for this
exact situation little bit later, but it leads to an error message with
weird values. It is better to do the check is sooner and prevent the
overflow.
Signed-off-by: Tomáš Golembiovský <tgolembi@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>