update_refcount can return errors that need to be handled by the callers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There's absolutely no problem with updating the refcounts of 0 clusters.
At least snapshot code is doing this and would fail once the result of
update_refcount isn't ignored any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If update_refcount fails, try to undo any changes made so far to avoid
inconsistencies in the image file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Returning 0/-errno allows it to distingush different errors classes. The
cluster offset of newly allocated clusters is now returned in the QCowL2Meta
struct.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Switching to 0/-errno allows it to distinguish different error cases.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't assume success but pass the bdrv_pwrite return value on.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Return the appropriate error value instead of always using EIO. Don't free the
L1 table on errors, we still need it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Instead of using the field 'readonly' of the BlockDriverState struct for passing the request,
pass the request in the flags parameter to the function.
Signed-off-by: Naphtali Sprei <nsprei@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Current legacy floppy detection is hardcoded based on source file
name. Make this smarter on linux by attempting a floppy specific
ioctl.
v2:
Give ioctl check higher priority than filename check
s/IDE/legacy/
v3:
Actually initialize 'prio' variable
Check for ioctl success rather than absence of specific failure
v4:
Explicitly mention that change is linux specific.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Current CDROM detection is hardcoded based on source file name.
Make this smarter on linux by attempting a CDROM specific ioctl.
This makes '-cdrom /dev/sr0' succeed with no media present.
v2:
Give ioctl check higher priority than filename check.
v3:
Actually initialize 'prio' variable.
Check for ioctl success rather than absence of specific failure.
v4:
Explicitly mention that change is linux specific.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that we do not have to flush the backing device anymore implementing
the bdrv_aio_flush method for image formats is trivial.
[hch: forward ported to qemu mainline from a product tree]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Introduce the functions needed to change the backing file of an image. The
function is implemented for qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently the dmg image format driver simply opens the images as raw
if any kind of failure happens. This is contrarty to the behaviour
of all other image formats which just return an error and let the
block core deal with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The disk image I created from my old laptop disk with VBoxManage
internalcommand converthd obviously was not a multiple of 1MB as when
created from scratch. This fixes QEMU refusing it. We still require the
size to be a multiple of sector size though.
It then boots correctly.
Allow opening VDI images with size not multiple of 1MB (as when converted from a raw disk).
Signed-off-by: François Revol <revol@free.fr>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/bochs.o
cc1: warnings being treated as errors
block/bochs.c: In function 'seek_to_sector':
block/bochs.c:202: error: ignoring return value of 'read', declared with attribute warn_unused_result
make: *** [block/bochs.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
We're leaking file descriptors to child processes. Set FD_CLOEXEC on file
descriptors that don't need to be passed to children to stop this misbehaviour.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
I haven't heard yet of anyone using qemu-img to copy an image to a real floppy,
but it's a valid use case.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently qcow2 unnecessarily rounds up the length of the backing format string
to the next multiple of 8. At the same time, the array in BlockDriverState can
only hold 15 characters, so in effect backing formats with 9 characters or more
don't work (e.g. host_device).
Save the real string length and things start to work for all valid image format
names.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Images with disk size 0 may be used for
VM snapshots, but not to save normal block data.
It is possible to create such images using
qemu-img, but opening them later fails.
So even "qemu-img info image.qcow2" is not
possible for an image created with
"qemu-img create -f qcow2 image.qcow2 0".
This is fixed here.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The context parameter in paio_submit isn't used anyway, so there is no reason
why block drivers should need to remember it. This also avoids passing a Linux
AIO context to paio_submit (which doesn't do any harm as long as the parameter
is unused, but it is highly confusing).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
It was merely a workaround and the real fix is done now.
This reverts commit ef845c3bf4.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We'll leave some AIO completions unhandled when we can't call the callback.
qemu_aio_process_queue() is used later to run any callbacks that are left and
can be run then.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When using Linux AIO raw still falls back to POSIX AIO sometimes, so we should
initialize it.
Not initializing it happens to work if POSIX AIO is used by another drive, or
if the format is not specified (probing the format uses POSIX AIO) or by pure
luck (e.g. it doesn't seem to happen any more with qcow2 since we have re-added
synchronous qcow2 functions).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
In case of failure, we haven't increased the refcount for the newly allocated
cluster yet. Therefore we must not free the cluster or its refcount will become
negative (and endless recursion is possible).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the synchronous read and write functions were dropped, they were replaced
by generic emulation functions. Unfortunately, these emulation functions don't
provide the same semantics as the original functions did.
The original bdrv_read would mean that we read some data synchronously and that
we won't be interrupted during this read. The latter assumption is no longer
true with the emulation function which needs to use qemu_aio_poll and therefore
allows the callback of any other concurrent AIO request to be run during the
read. Which in turn means that (meta)data read earlier could have changed and
be invalid now. qcow2 is not prepared to work in this way and it's just scary
how many places there are where other requests could run.
I'm not sure yet where exactly it breaks, but you'll see breakage with virtio
on qcow2 with a backing file. Providing synchronous functions again fixes the
problem for me.
Patchworks-ID: 35437
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Today host_devices have a create function, so they also need a create_options
field to prevent qemu-img from complaining.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch increases the maximum qcow2 cluster size to 2 MB. Starting with 128k
clusters, L2 tables span 2 GB or more of virtual disk space, causing 32 bit
truncation and wraparound of signed integers. Therefore some variables need to
use a larger data type.
While being at reviewing data types, change some integers that are used for
array indices to unsigned. In some places they were checked against some upper
limit but not for negative values. This could avoid potential segfaults with
corrupted qcow2 images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If available, the Universally Unique Identifier library
is used by the vdi block driver.
Other parts of QEMU (vl.c) could also use it.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
In the very least, a change like this requires discussion on the list.
The naming convention is goofy and it causes a massive merge problem. Something
like this _must_ be presented on the list first so people can provide input
and cope with it.
This reverts commit 99a0949b72.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Put space between = and & when taking a pointer,
to avoid confusion with old-style "&=".
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Problem: Our file sys-queue.h is a copy of the BSD file, but there are
some additions and it's not entirely compatible. Because of that, there have
been conflicts with system headers on BSD systems. Some hacks have been
introduced in the commits 15cc923584,
f40d753718,
96555a96d7 and
3990d09adf but the fixes were fragile.
Solution: Avoid the conflict entirely by renaming the functions and the
file. Revert the previous hacks.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Instead stalling the VCPU while serving a cache flush try to do it
asynchronously. Use our good old helper thread pool to issue an
asynchronous fdatasync for raw-posix. Note that while Linux AIO
implements a fdatasync operation it is not useful for us because
it isn't actually implement in asynchronous fashion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If we are flushing the caches for our image files we only care about the
data (including the metadata required for accessing it) but not things
like timestamp updates. So try to use fdatasync instead of fsync to
implement the flush operations.
Unfortunately many operating systems still do not support fdatasync,
so we add a qemu_fdatasync wrapper that uses fdatasync if available
as per the _POSIX_SYNCHRONIZED_IO feature macro or fsync otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When two AIO requests write to the same cluster, and this cluster is
unallocated, currently both requests allocate a new cluster and the second one
merges the first one when it is completed. This means an cluster allocation, a
read and a cluster deallocation which cause some overhead. If we simply let the
second request wait until the first one is done, we improve overall performance
with AIO requests (specifially, qcow2/virtio combinations).
This patch maintains a list of in-flight requests that have allocated new
clusters. A second request touching the same cluster is limited so that it
either doesn't touch the allocation of the first request (so it can have a
non-overlapping allocation) or it waits for the first request to complete.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The wrong version of the preallocation patch has been applied, so this is the
remaining diff.
We can't use truncate to grow the image file to the right size because we don't
know if metadata has been written after the last data cluster. In this case
truncate would shrink the file and destroy its metadata. Write a zero sector at
the end of the virtual disk instead to ensure that the file is big enough.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The company which made Virtual PC was Connectix.
They use the magic string "conectix" in their disk images.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch fixes linker errors when building QEMU without Linux AIO support.
It is based on suggestions from malc and Kevin Wolf.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that do have a nicer interface to work against we can add Linux native
AIO support. It's an extremly thing layer just setting up an iocb for
the io_submit system call in the submission path, and registering an
eventfd with the qemu poll handler to do complete the iocbs directly
from there.
This started out based on Anthony's earlier AIO patch, but after
estimated 42,000 rewrites and just as many build system changes
there's not much left of it.
To enable native kernel aio use the aio=native sub-command on the
drive command line. I have also added an option to qemu-io to
test the aio support without needing a guest.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently the raw-posix.c code contains a lot of knowledge about the
asynchronous I/O scheme that is mostly implemented in posix-aio-compat.c.
All this code does not really belong here and is getting a bit in the
way of implementing native AIO on Linux.
So instead move all the guts of the AIO implementation into
posix-aio-compat.c (which might need a better name, btw).
There's now a very small interface between the AIO providers and raw-posix.c:
- an init routine is called from raw_open_common to return an AIO context
for this drive. An AIO implementation may either re-use one context
for all drives, or use a different one for each as the Linux native
AIO support will do.
- an submit routine is called from the aio_reav/writev methods to submit
an AIO request
There are no indirect calls involved in this interface as we need to
decide which one to call manually. We will only call the Linux AIO native
init function if we were requested to by vl.c, and we will only call
the native submit function if we are asked to and the request is properly
aligned. That's also the reason why the alignment check actually does
the inverse move and now goes into raw-posix.c.
The old posix-aio-compat.h headers is removed now that most of it's
content is private to posix-aio-compat.c, and instead we add a new
block/raw-posix-aio.h headers is created containing only the tiny interface
between raw-posix.c and the AIO implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This introduces a qemu-img create option for qcow2 which allows the metadata to
be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
tables, but no data is written to them. Metadata is quite small, so this
happens in almost no time.
Especially with qcow2 on virtio this helps to gain a bit of performance during
the initial writes. However, as soon as create a snapshot, we're back to the
normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
while we're still having trouble with qcow2 on virtio.
Note that the option is disabled by default and needs to be specified
explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
* The code for option '-static' was wrong, so image creation
always created static images.
* Static images created with qemu-img did not set header entry
blocks_allocated.
* The size of the block map must be rounded to the next multiple
of SECTOR_SIZE, otherwise the block map is only read partially
for block map sizes which are not a multiple of SECTOR_SIZE.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>