qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We always have a sync for the refcount update when a new cluster is
allocated. If we move this past the COW, we can save an additional sync.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Note that the flush is omitted intentionally in qcow2_free_clusters. If
anything, we can leak clusters here if we lose the writes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block/nbd.c: use default port number when none is specified
qemu-nbd.c: use IANA-assigned port number: 10809
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the hardcoded handling of 512 byte alignment with bs->buffer_alignment
to handle larger sector size devices correctly.
Note that we can not rely on it to be initialize in bdrv_open, so deal
with the worst case there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow file used for write support in vvfat is a temporary file,
so we can use cache=unsafe there. Without this, write support is just
too slow to be of any use.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
Allocation and deallocation of bs->opaque is not in the control of a
block driver. Therefore it should not set bs->opaque to a data structure
used by another bs, or closing the image will lead to a double free.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
vvfat tries to set the readonly flag in its open function, but nowadays
this is overwritted with the readonly=... command line option. Check in
bdrv_write if the vvfat was opened read-only and return an error in this
case.
Without this check, vvfat tries to access the qcow bs, which is NULL
without enabled write support.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
The signedness of enum types depend on the compiler implementation.
Therefore the check for negative values may or may not be meaningful.
Fix by explicitly casting to a signed integer.
Since the values are also checked earlier against event_names
table, this is an internal error. Change the 'if' to 'assert'.
This also avoids a warning with GCC flag -Wtype-limits.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 79368c81bf.
Conflicts:
block.c
I haven't been able to come up with a solution yet for the corruption caused by
unaligned requests from the IDE disk so revert until a solution can be written.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When a new cluster was allocated, we only need a flush after the write to the
L2 table if it was a COW and we need to decrease the refcounts of the old
clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow symbolic links which point to /dev/sgX devices.
Signed-off-by: Bernhard Kohl <bernhard.kohl@nsn.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux, we have code to detect CD-ROMs using an ioctl. We shouldn't lose
anything but false positives by removing the check for a /dev/cd* path.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows to connect Qemu using NBD protocol to an nbd-server
using named exports.
For instance, if on the host "isoserver", in /etc/nbd-server/config, you have:
[generic]
[debian-500-ppc-netinst]
exportname = /ISO/debian-500-powerpc-netinst.iso
[Fedora-10-ppc-netinst]
exportname = /ISO/Fedora-10-ppc-netinst.iso
You can connect to it, using:
qemu -cdrom nbd:isoserver:exportname=debian-500-ppc-netinst
qemu -cdrom nbd:isoserver:exportname=Fedora-10-ppc-netinst
NOTE: you need at least nbd-server 2.9.18
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
"qemu_socket.h" includes all necessary files and
including <netinet/tcp.h> without <netinet/in.h>
could cause errors on some systems.
Signed-off-by: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Assuming that any image on a block device is not properly zero-initialized is
actually wrong: Only raw images have this problem. Any other image format
shouldn't care about it, they initialize everything properly themselves.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to have a second set of integral types.
Replace them by the standard types from stdint.h.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
trick the block probing code into accessing arbitrary files in a guest. To
mitigate this, we added an explicit format parameter to -drive which disabling
block probing.
Fast forward to today, and the vast majority of users do not use this parameter.
libvirt does not use this by default nor does virt-manager.
Most users want block probing so we should try to make it safer.
This patch adds some logic to the raw device which attempts to detect a write
operation to the beginning of a raw device. If the first 4 bytes happen to
match an image file that has a backing file that we support, it scrubs the
signature to all zeros. If a user specifies an explicit format parameter, this
behavior is disabled.
I contend that while a legitimate guest could write such a signature to the
header, we would behave incorrectly anyway upon the next invocation of QEMU.
This simply changes the incorrect behavior to not involve a security
vulnerability.
I've tested this pretty extensively both in the positive and negative case. I'm
not 100% confident in the block layer's ability to deal with zero sized writes
particularly with respect to the aio functions so some additional eyes would be
appreciated.
Even in the case of a single sector write, we have to make sure to invoked the
completion from a bottom half so just removing the zero sized write is not an
option.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
WIN32 is not only the system which doesn't have TCP_CORK (e.g. OS X).
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS. This
patch adds a qemu block driver for Sheepdog.
Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing
The more details are available at the project site:
http://www.osrg.net/sheepdog/
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
raw_pread_aligned() retries up to two times if the block device backs
a virtual CD-ROM (a drive with media=cdrom and if=ide, scsi, xen or
none). This makes no sense. Whether retrying reads can correct read
errors can only depend on what we're reading, not on how the result
gets used. We need to check what whether we're reading from a
physical CD-ROM or floppy here.
I doubt retrying is useful even then. Left for another day.
Impact:
* Virtual CD-ROM backed by host_cdrom behaves the same.
* Virtual CD-ROM backed by file or host_device no longer retries.
* A drive backed by host_cdrom now retries even if it's not a virtual
CD-ROM.
* Any drive backed by host_floppy now retries.
While there, clean up gratuitous use of goto.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This distinguishes between harmless leaks and real corruption. Hopefully users
better understand what qemu-img check wants to tell them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
state = 0 in rules means that the rule is valid for any state. Therefore it's
impossible to have a rule that works only in the initial state. This changes
the initial state from 0 to 1 to make this possible.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Forgetting to free them means that the next instance inherits all rules and
gets its own rules only additionally.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The list head was initialized to point to the wrong list, so all actions ended
up being handled as inject-error even if they were set-state in fact.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
People were wondering why qemu-img check failed after they tried to preallocate
a large qcow2 file and ran out of disk space.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Trying to check them leads to a second error message which is more confusing
than helpful:
Can't get refcount for cluster 0: Invalid argument
ERROR cluster 0 refcount=-22 reference=1
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With corrupted images, we can easily get an cluster index that exceeds the
array size of the temporary refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
While at it, correct the wrong usage of errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We don't have an equivalent to mmap in the qemu block API, so read and
write the bitmap directly. At least in the dumb implementation added
in this patch this is a lot less efficient, but it means cow can also
work on windows, and over nbd or curl. And it fixes qemu-iotests testcase
012 which did not work properly due to issues with read-only mmap access.
In addition we can also get rid of the now unused get_mmap_addr function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread/pwrite instead of lseek + read/write in preparation of using the
qemu block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If writing the L1 table to disk failed, we need to restore its old content in
memory to avoid inconsistencies.
Reported-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes load_refcount_block which completely ignored the return value of
write_refcount_block and always returned -EIO for bdrv_pwrite failure.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently it would consider blocks for which get_refcount fails used. However,
it's unlikely that get_refcount would succeed for the next cluster, so it's not
really helpful. Return an error instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
get_refcount might need to load a refcount block from disk, so errors may
happen. Return the error code instead of assuming a refcount of 1 and change
the callers to respect error return values.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the vpc block driver (for VHD) to read/write multiple sectors at
once instead of doing a request for each single sector.
Before this, running qemu-iotests for VPC took ages, now it's actually quite
reasonable to run it always (down from ~1 hour to 40 seconds for me).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Clean up raw-posix.c to be more consistent using BDRV_SECTOR_SIZE
instead of hard coded 512 values.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After it is done with updating refcounts in the cache, update_refcount writes
all changed entries to disk. If a refcount block allocation fails, however,
there was no change yet and therefore first_index = last_index = -1. Don't
treat -1 as a normal sector index (resulting in a 512 byte write!) but return
without updating anything in this case.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refblock allocation code needs to take into consideration that update_refcount
will load a different refcount block into the cache, so it must initialize the
cache for a new refcount block only afterwards. Not doing this means that not
only the refcount in the wrong block is updated, but also that the caller will
work on the wrong block.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
write_refcount_block_entries used to return -EIO for any errors. Change this to
return the real error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_get_cluster_offset() looks up a given virtual disk offset and returns the
offset of the corresponding cluster in the image file. Errors (e.g. L2 table
can't be read) are currenctly indicated by a return value of 0, which is
unfortuately the same as for any unallocated cluster. So in effect we can't
check for errors.
This makes the old return value a by-reference parameter and returns the usual
0/-errno error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
l2_allocate has some intermediate states in which the image is inconsistent.
Change the order to write to the L1 table only after the new L2 table has
successfully been initialized.
Also reset the L2 cache in failure case, it's very likely wrong.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the L2 table was already updated in cache, but writing it to disk has
failed, we must not continue using the changed version in the cache to stay
consistent with what's on the disk.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Casting a pointer to an int doesn't work on 64 bit platforms. Use the %p printf
conversion specifier instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
gcc does not like passing a NULL where an int value is expected:
block/vvfat.c: In function ‘checkpoint’:
block/vvfat.c:2868: error: passing argument 2 of ‘remove_mapping’ makes
integer from pointer without a cast
Signed-off-by: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fix is based on a patch from Kevin Wolf. Here his comment:
"The number of blocks needs to be rounded up to cover all of the virtual hard
disk. Without this fix, we can't even open our own images if their size is not
a multiple of the block size."
While Kevin's patch addressed vdi_create, my modification also fixes
vdi_open which now accepts images with odd disk sizes.
v3:
Don't allow reading of disk images with too large disk sizes.
Neither VBoxManage nor old versions of qemu-img read such images.
This change requires rounding of odd disk sizes before we do the checks.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: François Revol <revol@free.fr>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Dmg actually does an lseek to a negative offset in the open routine,
which we replace with offset arithmetics after doing a bdrv_getlength.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API. Note that dmg actually uses the implicit file offset
a lot in dmg_open, and we had to replace it with an offset variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When dmg_read_chunk encounters an uncompressed chunk it currently
calls read without any previous adjustment of the file postion.
This seems very wrong, and the "reference" implementation in
dmg2img does a search to the same offset as done in the various
compression cases, so do the same here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The VHD algorithm calculates a disk geometry
which is usually smaller than the requested size.
QEMU tried to round up but failed for certain sizes:
qemu-img create -f vpc disk.vpc 9437184
would create an image with 9435136 bytes
(which is too small for qemu-img convert).
Instead of hacking the geometry algorithm, the patch
increases the number of sectors until we get enough
sectors.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even it is not very useful, users may create images of size 0.
Without the special option CONFIG_ZERO_MALLOC, qemu_mallocz
aborts execution when it is told to allocate 0 bytes,
so avoid this kind of call.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
OpenBSDs gcc is said to generate warnings for this declaration, so don't
reference bdrv_qcow2 directly, but look it up using bdrv_find_format.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 20d97356c9.
The BlockDriver definition should stay at the end of source files.
Conflicts:
block/qcow2.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This patch adds the ability to grow qcow2 images in-place using
bdrv_truncate(). This enables qemu-img resize command support for
qcow2.
Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
The notion of resizing an image with snapshots could lead to confusion:
users may expect snapshots to remain unchanged, but this is not possible
with the current qcow2 on-disk format where the header.size field is
global instead of per-snapshot. Others may expect snapshots to change
size along with the current image data. I think it is safest to not
support snapshots and perhaps add behavior later if there is a
consensus.
Backing images continue to work. If the image is now larger than its
backing image, zeroes are read when accessing beyond the end of the
backing image.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While it's true that during regular operation free_clusters failure would be a
bug, an I/O error can always happen. There's no need to kill the VM, the worst
thing that can happen (and it will) is that we leak some clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch combines the lseek+read/write calls to use pread/pwrite
instead. This will result in fewer system calls and is already used by
AIO.
Thanks to Jan Kiszka <jan.kiszka@siemens.com> for identifying excessive
lseek and Christoph Hellwig <hch@lst.de> for confirming that this
approach should work.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The i loop iterator is shadowed by the next free cluster index. Both
using the variable name 'i' makes the code harder to read.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VMDK is doing interesting things when it needs to open a backing file. This
patch changes that part to look more like in other drivers. The nice side
effect is that the file name isn't needed any more in the open function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When trying to do COW, VMDK wrote the data back to the backing file. This
problem was revealed by the patch that made backing files read-only. This patch
does not only fix the problem, but also simplifies the VMDK code a bit.
This fixes the backing file qemu-iotests cases for VMDK.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Format drivers shouldn't need to bother with things like file names, but rather
just get an open BlockDriverState for the underlying protocol. This patch
introduces this behaviour for bdrv_open implementation. For protocols which
need to access the filename to open their file/device/connection/... a new
callback bdrv_file_open is introduced which doesn't get an underlying file
opened.
For now, also some of the more obscure formats use bdrv_file_open because they
open() the file themselves instead of using the block.c functions. They need to
be fixed in later patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We're running into various problems because the "raw" file access, which
is used internally by the various image formats is entangled with the
"raw" image format, which maps the VM view 1:1 to a file system.
This patch renames the raw file backends to the file protocol which
is treated like other protocols (e.g. nbd and http) and adds a new
"raw" image format which is just a wrapper around calls to the underlying
protocol.
The patch is surprisingly simple, besides changing the probing logical
in block.c to only look for image formats when using bdrv_open and
renaming of the old raw protocols to file there's almost nothing in there.
For creating images, a new bdrv_create_file is introduced which guesses the
protocol to use. This allows using qemu-img create -f raw (or just using the
default) for both files and host devices. Converting the other format drivers
to use this function to create their images is left for later patches.
The only issues still open are in the handling of the host devices.
Firstly in current qemu we can specifiy the host* format names
on various command line acceping images, but the new code can't
do that without adding some translation. Second the layering breaks
the no_zero_init flag in the BlockDriver used by qemu-img. I'm not
happy how this is done per-driver instead of per-state so I'll
prepare a separate patch to clean this up.
There's some more cleanup opportunity after this patch, e.g. using
separate lists and registration functions for image formats vs
protocols and maybe even host drivers, but this can be done at a
later stage.
Also there's a check for protocol in bdrv_open for the BDRV_O_SNAPSHOT
case that I don't quite understand, but which I fear won't work as
expected - possibly even before this patch.
Note that this patch requires various recent block patches from Kevin
and me, which should all be in his block queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix clang warnings:
/src/qemu/block/vvfat.c:1102:9: warning: Value stored to 'index3' during its initialization is never read
int index3=index1+1;
/src/qemu/cmd.c:290:15: warning: Value stored to 'p' during its initialization is never read
char *p = result;
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
GCC 3.3.5 generates warnings for static forward declarations of data, so
rearrange code to use static forward declarations of functions instead.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Returning NULL on error doesn't allow distinguishing between different errors.
Change the interface to return an integer for -errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Split up the raw_getlength into separate generic, solaris and BSD
versions to reduce the ifdef maze a bit. The BSD variant still
is a complete maze, but to clean it up properly we'd need some
people using the BSD variants to figure out what code is used
for what variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
What is known today as bdrv_open2 becomes the new bdrv_open. All remaining
callers of the old function are converted to the new one. In some places they
even know the right format, so they should have used bdrv_open2 from the
beginning.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow_create2 assumes that the new image will only need one cluster for its
refcount table initially. Obviously that's not true any more when the image is
big enough (exact value depends on the cluster size).
This patch calculates the refcount table size dynamically.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers can trigger a blkdebug event whenever they reach a place where it
could be useful to inject an error for testing/debugging purposes.
Rules are read from a blkdebug config file and describe which action is taken
when an event is triggered. For now this is only injecting an error (with a few
options) or changing the state (which is an integer). Rules can be declared to
be active only in a specific state; this way later rules can distiguish on
which path we came to trigger their event.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a mechanism to inject errors instead of passing requests on. With no
further patches applied, you can use it by setting inject_errno in gdb.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This isn't doing anything interesting. It creates the blkdebug block driver as
a protocol which just passes everything through to raw.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_open already takes care of this for us.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
If we complete a request with a failure we need to remove it from the list of
requests that are in flight. If we don't do it, the next time the same AIOCB is
used for a cluster allocation it will create a loop in the list and qemu will
hang in an endless loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Returning -EIO is far from optimal, but at least it's an error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Now that we output an error message according to the returned error code in
qemu-img, let's return the real error codes. "Input/output error" for
everything isn't helpful.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
When building with -DNDEBUG, assert(0) will not stop execution
so it must not be used for abnormal termination.
Use cpu_abort() when in CPU context, abort() otherwise.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
cleanup code is identical for error/success cases. Only difference
are goto labels.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
fail_gd error case would also free rgd_buf that was already freed
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When checking for errors, commit db89119d compares with the wrong values,
failing image creation even when there was no error. Additionally, if an
error has occured, we can't preallocate the image (it's likely broken).
This unbreaks test 023 of qemu-iotests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The current implementation of alloc_refcount_block and grow_refcount_table has
fundamental problems regarding error handling. There are some places where an
I/O error means that the image is going to be corrupted. I have found that the
only way to fix this is to completely rewrite the thing.
In detail, the problem is that the refcount blocks itself are allocated using
alloc_refcount_noref (to avoid endless recursion when updating the refcount of
the new refcount block, which migh access just the same refcount block but its
allocation is not yet completed...). Only at the end of the refcount allocation
the refcount of the refcount block is increased. If an error happens in
between, the refcount block is in use, but has a refcount of zero and will
likely be overwritten later.
The new approach is explained in comments in the code. The trick is basically
to let new refcount blocks describe their own refcount, so their refcount will
be automatically changed when they are hooked up in the refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the refcount table grows, it doesn't only grow by one entry but reserves
some space for future refcount blocks. The algorithm to calculate the number of
entries stays the same with the fixes, so factor it out before replacing the
rest.
As Juan suggested take the opportunity to simplify the code a bit.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If a write requests crosses a L2 table boundary and all clusters until the
end of the L2 table are usable for the request, we must not look at the next
L2 entry because we already have arrived at the end of the array.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Most of these are obvious NULL-deref bug fixes, for example,
the ones in these files:
block/curl.c
net.c
slirp/misc.c
and the first one in block/vvfat.c.
The others in block/vvfat.c may not lead to an immediate segfault, but I
traced the two schedule_rename(..., strdup(path)) uses, and a failed
strdup would appear to trigger this assertion in handle_renames_and_mkdirs:
assert(commit->path);
The conversion to use qemu_strdup in envlist_to_environ is not technically
needed, but does avoid a theoretical leak in the caller when strdup fails
for one value, but later succeeds in allocating another buffer(plausible,
if one string length is much larger than the others). The caller does
not know the length of the returned list, and as such can only free
pointers until it hits the first NULL. If there are non-NULL pointers
beyond the first, their buffers would be leaked. This one is admittedly
far-fetched.
The two in linux-user/main.c are worth fixing to ensure that an
OOM error is diagnosed up front, rather than letting it provoke some
harder-to-diagnose secondary error, in case of exec failure, or worse, in
case the exec succeeds but with an invalid list of command line options.
However, considering how unlikely it is to encounter a failed strdup early
in main, this isn't a big deal. Note that adding the required uses of
qemu_strdup here and in envlist.c induce link failures because qemu_strdup
is not currently in any library they're linked with. So for now, I've
omitted those changes, as well as the fixes in target-i386/helper.c
and target-sparc/helper.c.
If you'd like to see the above discussion (or anything else)
in the commit log, just let me know and I'll be happy to adjust.
>From 9af42864fd1ea666bd25e2cecfdfae74c20aa8c7 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering@redhat.com>
Date: Mon, 8 Feb 2010 18:29:29 +0100
Subject: [PATCH] don't dereference NULL after failed strdup
Handle failing strdup by replacing each use with qemu_strdup,
so as not to dereference NULL or trigger a failing assertion.
* block/curl.c (curl_open): s/\bstrdup\b/qemu_strdup/
* block/vvfat.c (init_directories): Likewise.
(get_cluster_count_for_direntry, check_directory_consistency): Likewise.
* net.c (parse_host_src_port): Likewise.
* slirp/misc.c (fork_exec): Likewise.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Checking for return codes < 0 isn't really going to work with unsigned
types. Use signed types instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This shouldn't happen under any normal circumstances. However, it looks like
it's possible to achieve this with corrupted images. Without this patch
raw_pread is hanging in an endless loop in such cases.
The patch is not affecting growable files, for which such reads happen in
normal use cases. raw_pread_aligned already handles these cases and won't
return zero in the first place.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Win32 suffers from a very big memory leak when dealing with SCSI devices.
Each read/write request allocates memory with qemu_memalign (ie
VirtualAlloc) but frees it with qemu_free (ie free).
Pair all qemu_memalign() calls with qemu_vfree() to prevent such leaks.
Signed-off-by: Herve Poussineau <hpoussin@reactos.org>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The n member is not very descriptive and very hard to grep, rename it to
cur_nr_sectors to better indicate what it is used for. Also rename
nb_sectors to remaining_sectors as that is what it is used for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The BDRV_O_CREAT option is unused inside qemu and partially duplicates
the bdrv_create method. Remove it, and the -C option to qemu-io which
isn't used in qemu-iotests anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Found some places that seems needs this explicitly, now that
read-write is not the default.
Signed-off-by: Naphtali Sprei <nsprei@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/qcow2.o
cc1: warnings being treated as errors
block/qcow2.c: In function 'qcow_create2':
block/qcow2.c:829: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:838: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:839: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:841: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:844: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:849: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:852: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:855: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/qcow2.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/vvfat.o
cc1: warnings being treated as errors
block/vvfat.c: In function 'commit_one_file':
block/vvfat.c:2259: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
make: *** [block/vvfat.o] Error 1
CC block/vvfat.o
In file included from /usr/include/stdio.h:912,
from ./qemu-common.h:19,
from block/vvfat.c:27:
In function 'snprintf',
inlined from 'init_directories' at block/vvfat.c:871,
inlined from 'vvfat_open' at block/vvfat.c:1068:
/usr/include/bits/stdio2.h:65: error: call to __builtin___snprintf_chk will always overflow destination buffer
make: *** [block/vvfat.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/vmdk.o
cc1: warnings being treated as errors
block/vmdk.c: In function 'vmdk_snapshot_create':
block/vmdk.c:236: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
block/vmdk.c: In function 'vmdk_create':
block/vmdk.c:775: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:776: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:778: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
block/vmdk.c:784: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:790: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:807: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/vmdk.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/qcow.o
cc1: warnings being treated as errors
block/qcow.c: In function 'qcow_create':
block/qcow.c:804: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow.c:806: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow.c:811: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/qcow.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/cow.o
cc1: warnings being treated as errors
block/cow.c: In function 'cow_create':
block/cow.c:251: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/cow.c:253: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
make: *** [block/cow.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that qcow2_alloc_clusters can return error codes, we must handle them in
the callers of qcow2_alloc_clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
update_refcount can return errors that need to be handled by the callers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There's absolutely no problem with updating the refcounts of 0 clusters.
At least snapshot code is doing this and would fail once the result of
update_refcount isn't ignored any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If update_refcount fails, try to undo any changes made so far to avoid
inconsistencies in the image file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Returning 0/-errno allows it to distingush different errors classes. The
cluster offset of newly allocated clusters is now returned in the QCowL2Meta
struct.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Switching to 0/-errno allows it to distinguish different error cases.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't assume success but pass the bdrv_pwrite return value on.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Return the appropriate error value instead of always using EIO. Don't free the
L1 table on errors, we still need it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Instead of using the field 'readonly' of the BlockDriverState struct for passing the request,
pass the request in the flags parameter to the function.
Signed-off-by: Naphtali Sprei <nsprei@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Current legacy floppy detection is hardcoded based on source file
name. Make this smarter on linux by attempting a floppy specific
ioctl.
v2:
Give ioctl check higher priority than filename check
s/IDE/legacy/
v3:
Actually initialize 'prio' variable
Check for ioctl success rather than absence of specific failure
v4:
Explicitly mention that change is linux specific.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Current CDROM detection is hardcoded based on source file name.
Make this smarter on linux by attempting a CDROM specific ioctl.
This makes '-cdrom /dev/sr0' succeed with no media present.
v2:
Give ioctl check higher priority than filename check.
v3:
Actually initialize 'prio' variable.
Check for ioctl success rather than absence of specific failure.
v4:
Explicitly mention that change is linux specific.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that we do not have to flush the backing device anymore implementing
the bdrv_aio_flush method for image formats is trivial.
[hch: forward ported to qemu mainline from a product tree]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Introduce the functions needed to change the backing file of an image. The
function is implemented for qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently the dmg image format driver simply opens the images as raw
if any kind of failure happens. This is contrarty to the behaviour
of all other image formats which just return an error and let the
block core deal with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The disk image I created from my old laptop disk with VBoxManage
internalcommand converthd obviously was not a multiple of 1MB as when
created from scratch. This fixes QEMU refusing it. We still require the
size to be a multiple of sector size though.
It then boots correctly.
Allow opening VDI images with size not multiple of 1MB (as when converted from a raw disk).
Signed-off-by: François Revol <revol@free.fr>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/bochs.o
cc1: warnings being treated as errors
block/bochs.c: In function 'seek_to_sector':
block/bochs.c:202: error: ignoring return value of 'read', declared with attribute warn_unused_result
make: *** [block/bochs.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
We're leaking file descriptors to child processes. Set FD_CLOEXEC on file
descriptors that don't need to be passed to children to stop this misbehaviour.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
I haven't heard yet of anyone using qemu-img to copy an image to a real floppy,
but it's a valid use case.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently qcow2 unnecessarily rounds up the length of the backing format string
to the next multiple of 8. At the same time, the array in BlockDriverState can
only hold 15 characters, so in effect backing formats with 9 characters or more
don't work (e.g. host_device).
Save the real string length and things start to work for all valid image format
names.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Images with disk size 0 may be used for
VM snapshots, but not to save normal block data.
It is possible to create such images using
qemu-img, but opening them later fails.
So even "qemu-img info image.qcow2" is not
possible for an image created with
"qemu-img create -f qcow2 image.qcow2 0".
This is fixed here.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The context parameter in paio_submit isn't used anyway, so there is no reason
why block drivers should need to remember it. This also avoids passing a Linux
AIO context to paio_submit (which doesn't do any harm as long as the parameter
is unused, but it is highly confusing).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
It was merely a workaround and the real fix is done now.
This reverts commit ef845c3bf4.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We'll leave some AIO completions unhandled when we can't call the callback.
qemu_aio_process_queue() is used later to run any callbacks that are left and
can be run then.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When using Linux AIO raw still falls back to POSIX AIO sometimes, so we should
initialize it.
Not initializing it happens to work if POSIX AIO is used by another drive, or
if the format is not specified (probing the format uses POSIX AIO) or by pure
luck (e.g. it doesn't seem to happen any more with qcow2 since we have re-added
synchronous qcow2 functions).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
In case of failure, we haven't increased the refcount for the newly allocated
cluster yet. Therefore we must not free the cluster or its refcount will become
negative (and endless recursion is possible).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the synchronous read and write functions were dropped, they were replaced
by generic emulation functions. Unfortunately, these emulation functions don't
provide the same semantics as the original functions did.
The original bdrv_read would mean that we read some data synchronously and that
we won't be interrupted during this read. The latter assumption is no longer
true with the emulation function which needs to use qemu_aio_poll and therefore
allows the callback of any other concurrent AIO request to be run during the
read. Which in turn means that (meta)data read earlier could have changed and
be invalid now. qcow2 is not prepared to work in this way and it's just scary
how many places there are where other requests could run.
I'm not sure yet where exactly it breaks, but you'll see breakage with virtio
on qcow2 with a backing file. Providing synchronous functions again fixes the
problem for me.
Patchworks-ID: 35437
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Today host_devices have a create function, so they also need a create_options
field to prevent qemu-img from complaining.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch increases the maximum qcow2 cluster size to 2 MB. Starting with 128k
clusters, L2 tables span 2 GB or more of virtual disk space, causing 32 bit
truncation and wraparound of signed integers. Therefore some variables need to
use a larger data type.
While being at reviewing data types, change some integers that are used for
array indices to unsigned. In some places they were checked against some upper
limit but not for negative values. This could avoid potential segfaults with
corrupted qcow2 images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If available, the Universally Unique Identifier library
is used by the vdi block driver.
Other parts of QEMU (vl.c) could also use it.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
In the very least, a change like this requires discussion on the list.
The naming convention is goofy and it causes a massive merge problem. Something
like this _must_ be presented on the list first so people can provide input
and cope with it.
This reverts commit 99a0949b72.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Put space between = and & when taking a pointer,
to avoid confusion with old-style "&=".
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Problem: Our file sys-queue.h is a copy of the BSD file, but there are
some additions and it's not entirely compatible. Because of that, there have
been conflicts with system headers on BSD systems. Some hacks have been
introduced in the commits 15cc923584,
f40d753718,
96555a96d7 and
3990d09adf but the fixes were fragile.
Solution: Avoid the conflict entirely by renaming the functions and the
file. Revert the previous hacks.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Instead stalling the VCPU while serving a cache flush try to do it
asynchronously. Use our good old helper thread pool to issue an
asynchronous fdatasync for raw-posix. Note that while Linux AIO
implements a fdatasync operation it is not useful for us because
it isn't actually implement in asynchronous fashion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If we are flushing the caches for our image files we only care about the
data (including the metadata required for accessing it) but not things
like timestamp updates. So try to use fdatasync instead of fsync to
implement the flush operations.
Unfortunately many operating systems still do not support fdatasync,
so we add a qemu_fdatasync wrapper that uses fdatasync if available
as per the _POSIX_SYNCHRONIZED_IO feature macro or fsync otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When two AIO requests write to the same cluster, and this cluster is
unallocated, currently both requests allocate a new cluster and the second one
merges the first one when it is completed. This means an cluster allocation, a
read and a cluster deallocation which cause some overhead. If we simply let the
second request wait until the first one is done, we improve overall performance
with AIO requests (specifially, qcow2/virtio combinations).
This patch maintains a list of in-flight requests that have allocated new
clusters. A second request touching the same cluster is limited so that it
either doesn't touch the allocation of the first request (so it can have a
non-overlapping allocation) or it waits for the first request to complete.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The wrong version of the preallocation patch has been applied, so this is the
remaining diff.
We can't use truncate to grow the image file to the right size because we don't
know if metadata has been written after the last data cluster. In this case
truncate would shrink the file and destroy its metadata. Write a zero sector at
the end of the virtual disk instead to ensure that the file is big enough.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The company which made Virtual PC was Connectix.
They use the magic string "conectix" in their disk images.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch fixes linker errors when building QEMU without Linux AIO support.
It is based on suggestions from malc and Kevin Wolf.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that do have a nicer interface to work against we can add Linux native
AIO support. It's an extremly thing layer just setting up an iocb for
the io_submit system call in the submission path, and registering an
eventfd with the qemu poll handler to do complete the iocbs directly
from there.
This started out based on Anthony's earlier AIO patch, but after
estimated 42,000 rewrites and just as many build system changes
there's not much left of it.
To enable native kernel aio use the aio=native sub-command on the
drive command line. I have also added an option to qemu-io to
test the aio support without needing a guest.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently the raw-posix.c code contains a lot of knowledge about the
asynchronous I/O scheme that is mostly implemented in posix-aio-compat.c.
All this code does not really belong here and is getting a bit in the
way of implementing native AIO on Linux.
So instead move all the guts of the AIO implementation into
posix-aio-compat.c (which might need a better name, btw).
There's now a very small interface between the AIO providers and raw-posix.c:
- an init routine is called from raw_open_common to return an AIO context
for this drive. An AIO implementation may either re-use one context
for all drives, or use a different one for each as the Linux native
AIO support will do.
- an submit routine is called from the aio_reav/writev methods to submit
an AIO request
There are no indirect calls involved in this interface as we need to
decide which one to call manually. We will only call the Linux AIO native
init function if we were requested to by vl.c, and we will only call
the native submit function if we are asked to and the request is properly
aligned. That's also the reason why the alignment check actually does
the inverse move and now goes into raw-posix.c.
The old posix-aio-compat.h headers is removed now that most of it's
content is private to posix-aio-compat.c, and instead we add a new
block/raw-posix-aio.h headers is created containing only the tiny interface
between raw-posix.c and the AIO implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This introduces a qemu-img create option for qcow2 which allows the metadata to
be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
tables, but no data is written to them. Metadata is quite small, so this
happens in almost no time.
Especially with qcow2 on virtio this helps to gain a bit of performance during
the initial writes. However, as soon as create a snapshot, we're back to the
normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
while we're still having trouble with qcow2 on virtio.
Note that the option is disabled by default and needs to be specified
explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
* The code for option '-static' was wrong, so image creation
always created static images.
* Static images created with qemu-img did not set header entry
blocks_allocated.
* The size of the block map must be rounded to the next multiple
of SECTOR_SIZE, otherwise the block map is only read partially
for block map sizes which are not a multiple of SECTOR_SIZE.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
These errors come up when compiling with gcc-4.3.3 and some older headers:
/scratch/froydnj/qemu.git/block/vpc.c: In function 'vpc_create':
/scratch/froydnj/qemu.git/block/vpc.c:514: error: value computed is not used
/scratch/froydnj/qemu.git/block/vpc.c:516: error: value computed is not used
/scratch/froydnj/qemu.git/block/vpc.c:517: error: value computed is not used
/scratch/froydnj/qemu.git/block/vpc.c:566: error: value computed is not used
Use memcpy to copy the strings instead of strncpy.
Signed-off-by: Nathan Froyd <froydnj@codesourcery.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
As requested by Anthony make pthreads mandatory. This means we will always
have AIO available on posix hosts, and it will also allow enabling the I/O
thread unconditionally once it's ready.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is a new block driver written from scratch
to support the VDI format in QEMU.
VDI is the native format used by Innotek / SUN VirtualBox.
Latest changes:
* stripped down version
(code for synchronous operations and experimental code removed)
* don't open VDI snapshot images (with uuid_link or uuid_parent)
* modified vdi_aio_cancel
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Message-Id:
Instead of storing the backing file in its own BlockDriverState, VMDK uses the
BlockDriverState of the raw image file it opened. This is wrong and breaks
functions that access the backing file or protocols. This fix replaces all
occurrences of s->hd->backing_* with bs->backing_*.
This fixes qemu-iotests failure in 020 (Commit changes to backing file).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
I used the following command to enable debugging:
perl -p -i -e 's/^\/\/#define DEBUG/#define DEBUG/g' * */* */*/*
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In qemu-iotests, some large images are created using qemu-img.
Without checks for errors, qemu-img will just create an
empty image, and later read / write tests will fail.
With the patch, failures during image creation are detected
and reported.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The VM state offset is a concept internal to the image format. Replace
the old bdrv_{get,put}_buffer method that require an index into the
image file that is constructed from the VM state offset and an offset
into the vmstate with the bdrv_{load,save}_vmstate that just take an
offset into the VM state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Contrary to what one could expect, the size of L1 tables is not cluster
aligned. So as we're writing whole sectors now instead of single entries,
we need to ensure that the L1 table in memory is large enough; otherwise
write would access memory after the end of the L1 table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The performance of qcow2 has improved meanwhile, so we don't need to
special-case it any more. Switch the default to write-through caching
like all other block drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The really time consuming part of snapshotting is to adjust the reference count
of all clusters. Currently after each adjusted cluster the refcount block is
written to disk.
Don't write each single byte immediately to disk but cache all writes to the
refcount block and write them out once we're done with the block.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When using O_DIRECT, qcow2 snapshots didn't work any more for me. In the
process of creating the snapshot, qcow2 tries to pwrite some new information
(e.g. new L1 table) which will often end up being after the old end of the
image file. Now pwrite tries to align things and reads the old contents of the
file, read returns 0 because there is nothing to read after the end of file and
pwrite is stuck in an endless loop.
This patch allows to pread beyond the end of an image file. Whenever the
given offset is after the end of the image file, the read succeeds and fills
the buffer with zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Problem: It is impossible to feed filenames with the character colon because
qemu interprets such names as a protocol. For example filename scsi:0, is
interpreted as a protocol by name "scsi".
This patch allows user to espace colon characters. For example the above
filename can now be expressed either as 'scsi\:0' or as file:scsi:0
anything following the "file:" tag is interpreted verbatin. However if "file:"
tag is omitted then any colon characters in the string must be escaped using
backslash.
Here are couple of examples:
scsi\:0\:abc is a local file scsi:0:abc
http\://myweb is a local file by name http://myweb
file:scsi:0:abc is a local file scsi:0:abc
file:http://myweb is a local file by name http://myweb
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When updating the refcount blocks in update_refcount(), write complete sectors
instead of updating single entries.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When updating the L2 tables in alloc_cluster_link_l2(), write complete
sectors instead of updating single entries.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When modifying the L1 table, l2_allocate() needs to write complete sectors
instead of single entries. The L1 table is already in memory, reading it from
disk in the block layer to align the request is wasted performance.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The qcow2 source is now split into several more manageable files. During the
conversion quite some functions that were static before needed to be changed to
be global to make the source compile again.
We were lucky enough not to get name conflicts with these additional global
names, but they are not nice. This patch adds a qcow2_ prefix to all of the
global functions in qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-snapshot.c contains the code related to snapshotting.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-cluster.c contains all functions related to the management of guest
clusters, i.e. what the guest sees on its virtual disk. This code is about
mapping these guest clusters to host clusters in the image file using the
two-level lookup tables.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-refcount.c contains all functions which are related to cluster
allocation and management in the image file. A large part of this is the
reference counting of these clusters.
Also a header file qcow2.h is introduced which will contain the interface of
the split qcow2 modules.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Larger cluster sizes mean less metadata. This has been discussion a few times,
let's do it now. This turns 64k clusters on by default for new images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When we open a file, we first attempt to open it read-write, then fall back
to read-only. Unfortunately we reuse the flags from the previous attempt,
so both attempts try to open the file with write permissions, and fail.
Fix by clearing the O_RDWR flag from the previous attempt.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The flags argument to raw_common_open() contain bits defined by the BDRV_O_*
namespace, not the posix O_* namespace.
Adjust to use the correct constants.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Rename raw_ioctl and raw_aio_ioctl to hdev_ioctl and hdev_aio_ioctl as they
are only used for the host device. Also only add them to the method table
for the cases where we need them (generic hdev if linux and linux CDROM)
instead of declaring stubs and always add them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a bdrv_probe_device method to all BlockDriver instances implementing
host devices to move matching of host device types into the actual drivers.
For now we keep exacly the old matching behaviour based on the devices names,
although we really should have better detetion methods based on device
information in the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of declaring one BlockDriver for all host devices declared one
for each type: a generic one for normal disk devices, a Linux floppy
driver and a CDROM driver for Linux and FreeBSD. This gets rid of a lot
of messy ifdefs and switching based on the type in the various removal
device methods.
block.c grows a new method to find the correct host device driver based
on OS-sepcific criteria, which will later into the actual drivers in a
later patch in this series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
raw_open and hdev_open contain the same basic logic. Add a new
raw_open_common helper containing the guts of the open routine
and call it from raw_open and hdev_open.
We use the new open_flags field in BDRVRawState to allow passing
additional open flags to raw_open_common from both.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Both the Linux floppy and the FreeBSD CDROM host device need to store
the open flags so that they can re-open the device later. Store the
open flags unconditionally to remove the ifdef mess and simply the
calling conventions for the later patches in the series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch adds a small help text to each of the options in the block drivers
which can be displayed by using qemu-img create -f fmt -o ?
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that we have a separate aio pool structure we can remove those
aio pool details from BlockDriver.
Every driver supporting AIO now needs to declare a static AIOPool
with the aiocb size and the cancellation method. This cleans up the
current code considerably and will make it cleaner and more obvious
to support two different aio implementations behind a single
BlockDriver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
[this one is required for [PATCH] fully split aio_pool from BlockDriver,
sorry for not sending it out earlier]
Add a qcow_aio_setup helper to qcow to shared common code between
the aio_readv and aio_writev methods. Based on the function with
the same name in qcow2.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We do need hdev_create unconditionally on all platforms so that qemu-img
create support for host device works on all platforms.
Also relax the check to allow character devices in addition to block
devices. On many Unix platforms block devices have buffered block
nodes and unbuffered character device nodes, and on FreeBSD the block
nodes don't even exist anymore. Also on Linux we do support the
/dev/sgN scsi passthrough devices through the host device driver,
and probably the old-style /dev/raw/rawN raw devices although I haven't
tested that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
raw_pread_aligned currently returns the raw return value from
lseek/read, which is always -1 in case of an error. But the
callers higher up the stack expect it to return the negated
errno just like raw_pwrite_aligned.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch converts the remaining users of bdrv_create2 to bdrv_create and
removes the now unused function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't write each single changed refcount block entry to the disk after it is
written, but update all entries of the block and write all of them at once.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is a preparation patch with no functional changes. It moves the allocation
of new refcounts block to a new function and makes update_cluster_refcount (for
one cluster) call update_refcount (for multiple clusters) instead the other way
round.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There is only one (internal) user left and it can be switched to the normal
emulation provided in block.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently Qemu can read from posix I/O and NBD. This patch adds a
third protocol to the game: HTTP.
In certain situations it can be useful to access HTTP data directly,
for example if you want to try out an http provided OS image, but
don't know if you want to download it yet.
Using this patch you can now try it on on the fly. Just use it like:
qemu -cdrom http://host/path/my.iso
Signed-off-by: Alexander Graf <agraf@suse.de>
Add an option to specify the cluster size of a newly created qcow2 image.
Default is 4k which is the same value that was hard-coded before.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now we can make use of the newly introduced option structures. Instead of
having bdrv_create carry more and more parameters (which are format specific in
most cases), just pass a option structure as defined by the driver itself.
bdrv_create2() contains an emulation of the old interface to simplify the
transition.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>