This adds some new cache functions to qcow2 which can be used for caching
refcount blocks and L2 tables. When used with cache=writethrough they work
like the old caching code which is spread all over qcow2, so for this case we
have merely a cleanup.
The interesting case is with writeback caching (this includes cache=none) where
data isn't written to disk immediately but only kept in cache initially. This
leads to some form of metadata write batching which avoids the current "write
to refcount block, flush, write to L2 table" pattern for each single request
when a lot of cluster allocations happen. Instead, cache entries are only
written out if its required to maintain the right order. In the pure cluster
allocation case this means that all metadata updates for requests are done in
memory initially and on sync, first the refcount blocks are written to disk,
then fsync, then L2 tables.
This improves performance of scenarios with lots of cluster allocations
noticably (e.g. installation or after taking a snapshot).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cpu_to_be64w() is called with an obviously non-aligned pointer. Use
cpu_to_be64wu() instead. It fixes unaligned accesses errors on IA64
hosts.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix a file descriptor leak, reported by cppcheck:
[/src/qemu/block/vvfat.c:759]: (error) Resource leak: dir
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In addition this adds missing braces to the function to be consistent
with the coding style.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It doesn't really make sense for functions in qcow2.c to be named
qcow_ so convert the names to match correctly.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for the qemu-img check command. It also
introduces a dirty bit in the qed header to mark modified images as
needing a check. This bit is cleared when the image file is closed
cleanly.
If an image file is opened and it has the dirty bit set, a consistency
check will run and try to fix corrupted table offsets. These
corruptions may occur if there is power loss while an allocating write
is performed. Once the image is fixed it opens as normal again.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch implements the read/write state machine. Operations are
fully asynchronous and multiple operations may be active at any time.
Allocating writes lock tables to ensure metadata updates do not
interfere with each other. If two allocating writes need to update the
same L2 table they will run sequentially. If two allocating writes need
to update different L2 tables they will run in parallel.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds code to look up data cluster offsets in the image via
the L1/L2 tables. The L2 tables are writethrough cached in memory for
performance (each read/write requires a lookup so it is essential to
cache the tables).
With cluster lookup code in place it is possible to implement
bdrv_is_allocated() to query the number of contiguous
allocated/unallocated clusters.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch introduces the qed on-disk layout and implements image
creation. Later patches add read/write and other functionality.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support to discard blocks in a raw image residing on an XFS filesystem
by calling the XFS_IOC_UNRESVSP64 ioctl to punch holes. Support for other
hole punching mechanisms can be added when they become available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new bdrv_discard method to free blocks in a mapping image, and a new
drive property to set the granularity for these discard. If no discard
granularity support is set discard support is disabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
RBD is an block driver for the distributed file system Ceph
(http://ceph.newdream.net/). This driver uses librados (which is part
of the Ceph server) for direct access to the Ceph object store and is
running entirely in userspace (Yehuda also wrote a driver for the
linux kernel, that can be used to access rbd volumes as a block
device).
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Christian Brunner <chb@muc.de>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All drivers use bs->file instead of s->hd for quite a while now, so it's time
to remove s->hd.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
When building on a 64 bit host which uses 'long' for int64_t,
GCC emits a warning:
CC block/blkverify.o
/src/qemu/block/blkverify.c: In function `blkverify_verify_readv':
/src/qemu/block/blkverify.c:304: warning: long long int format, long
unsigned int arg (arg 3)
Rework a77cffe7e9 to avoid the warning.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cache content may be destroyed after a failed read, better not use it any
more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This changes bdrv_flush to return 0 on success and -errno in case of failure.
It's a requirement for implementing proper error handle in users of bdrv_flush.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Move timer init functions to a new file, qemu-timer-common.c. Make other
critical timer functions inlined to preserve performance in
qemu-timer.c, also move muldiv64() (used by the inline functions)
to qemu-timer.h.
Adjust block/raw-posix.c and simpletrace.c to use get_clock() directly.
Remove a similar/duplicate definition in qemu-tool.c.
Adjust hw/omap_clk.c to include qemu-timer.h because muldiv64() is used
there.
After this change, tracing can be used also for user code and
simpletrace on Win32.
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Adding the gcc format attribute detects a format bug
which is fixed here.
v2:
Don't use type cast. BDRV_SECTOR_SIZE is unsigned long long,
so %lld should be the correct format specifier.
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to backup snapshots, created from QCOW2 iamge, we want to copy snapshots out of QCOW2 disk to a seperate storage.
The following patch adds a new option in "qemu-img": qemu-img convert -f qcow2 -O qcow2 -s snapshot_name src_img bck_img.
Right now, it only supports to copy the full snapshot, delta snapshot is on the way.
Changes from V1: all the comments from Kevin are addressed:
Add read-only checking
Fix coding style
Change the name from bdrv_snapshot_load to bdrv_snapshot_load_tmp
Signed-off-by: Disheng Su <edison@cloud.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of doing lots of magic for setting up initial refcount blocks and stuff
create a minimal (inconsistent) image, open it and initialize the rest with
regular qcow2 functions.
This is a complete rewrite of the image creation function. The old
implementating is #ifdef'd out and will be removed by the next patch (removing
it here would have made the diff unreadable because diff tries to find
similarities when it's really a rewrite)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The L1 table grow operation includes a size calculation that bumps up
the new L1 table size in order to anticipate the size needs of vmstate
data. This helps reduce the number of times that the L1 table has to be
grown when vmstate data is appended.
This size overhead is not necessary during image creation,
bdrv_truncate(), or snapshot goto operations. In fact, existing
qemu-iotests that exercise table growth are no longer able to trigger it
because image creation preallocates an L1 table that is too large after
changes to qcow_create2().
This patch keeps the size calculation but also adds exact growth for
callers that do not want to inflate the L1 table size unnecessarily.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compiling with GCC 4.6.0 20100925 produced a warning:
/src/qemu/block/qcow2-refcount.c: In function 'update_refcount':
/src/qemu/block/qcow2-refcount.c:552:13: error: variable 'dummy' set but not used [-Werror=unused-but-set-variable]
Fix by adding a dummy cast so that the result is not unused.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Fix this compiler warning:
./block/vvfat.c:2285: error: comparison of unsigned expression >= 0 is always true
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
The blkverify block driver makes investigating image format data
corruption much easier. A raw image initialized with the same contents
as the test image (e.g. qcow2 file) must be provided. The raw image
mirrors read/write operations and is used to verify that data read from
the test image is correct.
See docs/blkverify.txt for more information.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO write path. Encrypted
images continue to use a bounce buffer, however with constant size.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We always have a sync for the refcount update when a new cluster is
allocated. If we move this past the COW, we can save an additional sync.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Note that the flush is omitted intentionally in qcow2_free_clusters. If
anything, we can leak clusters here if we lose the writes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block/nbd.c: use default port number when none is specified
qemu-nbd.c: use IANA-assigned port number: 10809
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the hardcoded handling of 512 byte alignment with bs->buffer_alignment
to handle larger sector size devices correctly.
Note that we can not rely on it to be initialize in bdrv_open, so deal
with the worst case there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow file used for write support in vvfat is a temporary file,
so we can use cache=unsafe there. Without this, write support is just
too slow to be of any use.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
Allocation and deallocation of bs->opaque is not in the control of a
block driver. Therefore it should not set bs->opaque to a data structure
used by another bs, or closing the image will lead to a double free.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
vvfat tries to set the readonly flag in its open function, but nowadays
this is overwritted with the readonly=... command line option. Check in
bdrv_write if the vvfat was opened read-only and return an error in this
case.
Without this check, vvfat tries to access the qcow bs, which is NULL
without enabled write support.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
The signedness of enum types depend on the compiler implementation.
Therefore the check for negative values may or may not be meaningful.
Fix by explicitly casting to a signed integer.
Since the values are also checked earlier against event_names
table, this is an internal error. Change the 'if' to 'assert'.
This also avoids a warning with GCC flag -Wtype-limits.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 79368c81bf.
Conflicts:
block.c
I haven't been able to come up with a solution yet for the corruption caused by
unaligned requests from the IDE disk so revert until a solution can be written.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When a new cluster was allocated, we only need a flush after the write to the
L2 table if it was a COW and we need to decrease the refcounts of the old
clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow symbolic links which point to /dev/sgX devices.
Signed-off-by: Bernhard Kohl <bernhard.kohl@nsn.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux, we have code to detect CD-ROMs using an ioctl. We shouldn't lose
anything but false positives by removing the check for a /dev/cd* path.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows to connect Qemu using NBD protocol to an nbd-server
using named exports.
For instance, if on the host "isoserver", in /etc/nbd-server/config, you have:
[generic]
[debian-500-ppc-netinst]
exportname = /ISO/debian-500-powerpc-netinst.iso
[Fedora-10-ppc-netinst]
exportname = /ISO/Fedora-10-ppc-netinst.iso
You can connect to it, using:
qemu -cdrom nbd:isoserver:exportname=debian-500-ppc-netinst
qemu -cdrom nbd:isoserver:exportname=Fedora-10-ppc-netinst
NOTE: you need at least nbd-server 2.9.18
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
"qemu_socket.h" includes all necessary files and
including <netinet/tcp.h> without <netinet/in.h>
could cause errors on some systems.
Signed-off-by: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Assuming that any image on a block device is not properly zero-initialized is
actually wrong: Only raw images have this problem. Any other image format
shouldn't care about it, they initialize everything properly themselves.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to have a second set of integral types.
Replace them by the standard types from stdint.h.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
trick the block probing code into accessing arbitrary files in a guest. To
mitigate this, we added an explicit format parameter to -drive which disabling
block probing.
Fast forward to today, and the vast majority of users do not use this parameter.
libvirt does not use this by default nor does virt-manager.
Most users want block probing so we should try to make it safer.
This patch adds some logic to the raw device which attempts to detect a write
operation to the beginning of a raw device. If the first 4 bytes happen to
match an image file that has a backing file that we support, it scrubs the
signature to all zeros. If a user specifies an explicit format parameter, this
behavior is disabled.
I contend that while a legitimate guest could write such a signature to the
header, we would behave incorrectly anyway upon the next invocation of QEMU.
This simply changes the incorrect behavior to not involve a security
vulnerability.
I've tested this pretty extensively both in the positive and negative case. I'm
not 100% confident in the block layer's ability to deal with zero sized writes
particularly with respect to the aio functions so some additional eyes would be
appreciated.
Even in the case of a single sector write, we have to make sure to invoked the
completion from a bottom half so just removing the zero sized write is not an
option.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
WIN32 is not only the system which doesn't have TCP_CORK (e.g. OS X).
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS. This
patch adds a qemu block driver for Sheepdog.
Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing
The more details are available at the project site:
http://www.osrg.net/sheepdog/
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
raw_pread_aligned() retries up to two times if the block device backs
a virtual CD-ROM (a drive with media=cdrom and if=ide, scsi, xen or
none). This makes no sense. Whether retrying reads can correct read
errors can only depend on what we're reading, not on how the result
gets used. We need to check what whether we're reading from a
physical CD-ROM or floppy here.
I doubt retrying is useful even then. Left for another day.
Impact:
* Virtual CD-ROM backed by host_cdrom behaves the same.
* Virtual CD-ROM backed by file or host_device no longer retries.
* A drive backed by host_cdrom now retries even if it's not a virtual
CD-ROM.
* Any drive backed by host_floppy now retries.
While there, clean up gratuitous use of goto.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This distinguishes between harmless leaks and real corruption. Hopefully users
better understand what qemu-img check wants to tell them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
state = 0 in rules means that the rule is valid for any state. Therefore it's
impossible to have a rule that works only in the initial state. This changes
the initial state from 0 to 1 to make this possible.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Forgetting to free them means that the next instance inherits all rules and
gets its own rules only additionally.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The list head was initialized to point to the wrong list, so all actions ended
up being handled as inject-error even if they were set-state in fact.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
People were wondering why qemu-img check failed after they tried to preallocate
a large qcow2 file and ran out of disk space.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Trying to check them leads to a second error message which is more confusing
than helpful:
Can't get refcount for cluster 0: Invalid argument
ERROR cluster 0 refcount=-22 reference=1
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With corrupted images, we can easily get an cluster index that exceeds the
array size of the temporary refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
While at it, correct the wrong usage of errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We don't have an equivalent to mmap in the qemu block API, so read and
write the bitmap directly. At least in the dumb implementation added
in this patch this is a lot less efficient, but it means cow can also
work on windows, and over nbd or curl. And it fixes qemu-iotests testcase
012 which did not work properly due to issues with read-only mmap access.
In addition we can also get rid of the now unused get_mmap_addr function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread/pwrite instead of lseek + read/write in preparation of using the
qemu block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If writing the L1 table to disk failed, we need to restore its old content in
memory to avoid inconsistencies.
Reported-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes load_refcount_block which completely ignored the return value of
write_refcount_block and always returned -EIO for bdrv_pwrite failure.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently it would consider blocks for which get_refcount fails used. However,
it's unlikely that get_refcount would succeed for the next cluster, so it's not
really helpful. Return an error instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
get_refcount might need to load a refcount block from disk, so errors may
happen. Return the error code instead of assuming a refcount of 1 and change
the callers to respect error return values.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the vpc block driver (for VHD) to read/write multiple sectors at
once instead of doing a request for each single sector.
Before this, running qemu-iotests for VPC took ages, now it's actually quite
reasonable to run it always (down from ~1 hour to 40 seconds for me).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Clean up raw-posix.c to be more consistent using BDRV_SECTOR_SIZE
instead of hard coded 512 values.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After it is done with updating refcounts in the cache, update_refcount writes
all changed entries to disk. If a refcount block allocation fails, however,
there was no change yet and therefore first_index = last_index = -1. Don't
treat -1 as a normal sector index (resulting in a 512 byte write!) but return
without updating anything in this case.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refblock allocation code needs to take into consideration that update_refcount
will load a different refcount block into the cache, so it must initialize the
cache for a new refcount block only afterwards. Not doing this means that not
only the refcount in the wrong block is updated, but also that the caller will
work on the wrong block.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
write_refcount_block_entries used to return -EIO for any errors. Change this to
return the real error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_get_cluster_offset() looks up a given virtual disk offset and returns the
offset of the corresponding cluster in the image file. Errors (e.g. L2 table
can't be read) are currenctly indicated by a return value of 0, which is
unfortuately the same as for any unallocated cluster. So in effect we can't
check for errors.
This makes the old return value a by-reference parameter and returns the usual
0/-errno error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
l2_allocate has some intermediate states in which the image is inconsistent.
Change the order to write to the L1 table only after the new L2 table has
successfully been initialized.
Also reset the L2 cache in failure case, it's very likely wrong.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the L2 table was already updated in cache, but writing it to disk has
failed, we must not continue using the changed version in the cache to stay
consistent with what's on the disk.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Casting a pointer to an int doesn't work on 64 bit platforms. Use the %p printf
conversion specifier instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
gcc does not like passing a NULL where an int value is expected:
block/vvfat.c: In function ‘checkpoint’:
block/vvfat.c:2868: error: passing argument 2 of ‘remove_mapping’ makes
integer from pointer without a cast
Signed-off-by: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fix is based on a patch from Kevin Wolf. Here his comment:
"The number of blocks needs to be rounded up to cover all of the virtual hard
disk. Without this fix, we can't even open our own images if their size is not
a multiple of the block size."
While Kevin's patch addressed vdi_create, my modification also fixes
vdi_open which now accepts images with odd disk sizes.
v3:
Don't allow reading of disk images with too large disk sizes.
Neither VBoxManage nor old versions of qemu-img read such images.
This change requires rounding of odd disk sizes before we do the checks.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: François Revol <revol@free.fr>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Dmg actually does an lseek to a negative offset in the open routine,
which we replace with offset arithmetics after doing a bdrv_getlength.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API. Note that dmg actually uses the implicit file offset
a lot in dmg_open, and we had to replace it with an offset variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When dmg_read_chunk encounters an uncompressed chunk it currently
calls read without any previous adjustment of the file postion.
This seems very wrong, and the "reference" implementation in
dmg2img does a search to the same offset as done in the various
compression cases, so do the same here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The VHD algorithm calculates a disk geometry
which is usually smaller than the requested size.
QEMU tried to round up but failed for certain sizes:
qemu-img create -f vpc disk.vpc 9437184
would create an image with 9435136 bytes
(which is too small for qemu-img convert).
Instead of hacking the geometry algorithm, the patch
increases the number of sectors until we get enough
sectors.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even it is not very useful, users may create images of size 0.
Without the special option CONFIG_ZERO_MALLOC, qemu_mallocz
aborts execution when it is told to allocate 0 bytes,
so avoid this kind of call.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
OpenBSDs gcc is said to generate warnings for this declaration, so don't
reference bdrv_qcow2 directly, but look it up using bdrv_find_format.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 20d97356c9.
The BlockDriver definition should stay at the end of source files.
Conflicts:
block/qcow2.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This patch adds the ability to grow qcow2 images in-place using
bdrv_truncate(). This enables qemu-img resize command support for
qcow2.
Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
The notion of resizing an image with snapshots could lead to confusion:
users may expect snapshots to remain unchanged, but this is not possible
with the current qcow2 on-disk format where the header.size field is
global instead of per-snapshot. Others may expect snapshots to change
size along with the current image data. I think it is safest to not
support snapshots and perhaps add behavior later if there is a
consensus.
Backing images continue to work. If the image is now larger than its
backing image, zeroes are read when accessing beyond the end of the
backing image.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While it's true that during regular operation free_clusters failure would be a
bug, an I/O error can always happen. There's no need to kill the VM, the worst
thing that can happen (and it will) is that we leak some clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch combines the lseek+read/write calls to use pread/pwrite
instead. This will result in fewer system calls and is already used by
AIO.
Thanks to Jan Kiszka <jan.kiszka@siemens.com> for identifying excessive
lseek and Christoph Hellwig <hch@lst.de> for confirming that this
approach should work.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The i loop iterator is shadowed by the next free cluster index. Both
using the variable name 'i' makes the code harder to read.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VMDK is doing interesting things when it needs to open a backing file. This
patch changes that part to look more like in other drivers. The nice side
effect is that the file name isn't needed any more in the open function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When trying to do COW, VMDK wrote the data back to the backing file. This
problem was revealed by the patch that made backing files read-only. This patch
does not only fix the problem, but also simplifies the VMDK code a bit.
This fixes the backing file qemu-iotests cases for VMDK.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Format drivers shouldn't need to bother with things like file names, but rather
just get an open BlockDriverState for the underlying protocol. This patch
introduces this behaviour for bdrv_open implementation. For protocols which
need to access the filename to open their file/device/connection/... a new
callback bdrv_file_open is introduced which doesn't get an underlying file
opened.
For now, also some of the more obscure formats use bdrv_file_open because they
open() the file themselves instead of using the block.c functions. They need to
be fixed in later patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We're running into various problems because the "raw" file access, which
is used internally by the various image formats is entangled with the
"raw" image format, which maps the VM view 1:1 to a file system.
This patch renames the raw file backends to the file protocol which
is treated like other protocols (e.g. nbd and http) and adds a new
"raw" image format which is just a wrapper around calls to the underlying
protocol.
The patch is surprisingly simple, besides changing the probing logical
in block.c to only look for image formats when using bdrv_open and
renaming of the old raw protocols to file there's almost nothing in there.
For creating images, a new bdrv_create_file is introduced which guesses the
protocol to use. This allows using qemu-img create -f raw (or just using the
default) for both files and host devices. Converting the other format drivers
to use this function to create their images is left for later patches.
The only issues still open are in the handling of the host devices.
Firstly in current qemu we can specifiy the host* format names
on various command line acceping images, but the new code can't
do that without adding some translation. Second the layering breaks
the no_zero_init flag in the BlockDriver used by qemu-img. I'm not
happy how this is done per-driver instead of per-state so I'll
prepare a separate patch to clean this up.
There's some more cleanup opportunity after this patch, e.g. using
separate lists and registration functions for image formats vs
protocols and maybe even host drivers, but this can be done at a
later stage.
Also there's a check for protocol in bdrv_open for the BDRV_O_SNAPSHOT
case that I don't quite understand, but which I fear won't work as
expected - possibly even before this patch.
Note that this patch requires various recent block patches from Kevin
and me, which should all be in his block queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix clang warnings:
/src/qemu/block/vvfat.c:1102:9: warning: Value stored to 'index3' during its initialization is never read
int index3=index1+1;
/src/qemu/cmd.c:290:15: warning: Value stored to 'p' during its initialization is never read
char *p = result;
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
GCC 3.3.5 generates warnings for static forward declarations of data, so
rearrange code to use static forward declarations of functions instead.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Returning NULL on error doesn't allow distinguishing between different errors.
Change the interface to return an integer for -errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Split up the raw_getlength into separate generic, solaris and BSD
versions to reduce the ifdef maze a bit. The BSD variant still
is a complete maze, but to clean it up properly we'd need some
people using the BSD variants to figure out what code is used
for what variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
What is known today as bdrv_open2 becomes the new bdrv_open. All remaining
callers of the old function are converted to the new one. In some places they
even know the right format, so they should have used bdrv_open2 from the
beginning.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow_create2 assumes that the new image will only need one cluster for its
refcount table initially. Obviously that's not true any more when the image is
big enough (exact value depends on the cluster size).
This patch calculates the refcount table size dynamically.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers can trigger a blkdebug event whenever they reach a place where it
could be useful to inject an error for testing/debugging purposes.
Rules are read from a blkdebug config file and describe which action is taken
when an event is triggered. For now this is only injecting an error (with a few
options) or changing the state (which is an integer). Rules can be declared to
be active only in a specific state; this way later rules can distiguish on
which path we came to trigger their event.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a mechanism to inject errors instead of passing requests on. With no
further patches applied, you can use it by setting inject_errno in gdb.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This isn't doing anything interesting. It creates the blkdebug block driver as
a protocol which just passes everything through to raw.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_open already takes care of this for us.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
If we complete a request with a failure we need to remove it from the list of
requests that are in flight. If we don't do it, the next time the same AIOCB is
used for a cluster allocation it will create a loop in the list and qemu will
hang in an endless loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Returning -EIO is far from optimal, but at least it's an error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Now that we output an error message according to the returned error code in
qemu-img, let's return the real error codes. "Input/output error" for
everything isn't helpful.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
When building with -DNDEBUG, assert(0) will not stop execution
so it must not be used for abnormal termination.
Use cpu_abort() when in CPU context, abort() otherwise.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
cleanup code is identical for error/success cases. Only difference
are goto labels.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
fail_gd error case would also free rgd_buf that was already freed
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When checking for errors, commit db89119d compares with the wrong values,
failing image creation even when there was no error. Additionally, if an
error has occured, we can't preallocate the image (it's likely broken).
This unbreaks test 023 of qemu-iotests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The current implementation of alloc_refcount_block and grow_refcount_table has
fundamental problems regarding error handling. There are some places where an
I/O error means that the image is going to be corrupted. I have found that the
only way to fix this is to completely rewrite the thing.
In detail, the problem is that the refcount blocks itself are allocated using
alloc_refcount_noref (to avoid endless recursion when updating the refcount of
the new refcount block, which migh access just the same refcount block but its
allocation is not yet completed...). Only at the end of the refcount allocation
the refcount of the refcount block is increased. If an error happens in
between, the refcount block is in use, but has a refcount of zero and will
likely be overwritten later.
The new approach is explained in comments in the code. The trick is basically
to let new refcount blocks describe their own refcount, so their refcount will
be automatically changed when they are hooked up in the refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the refcount table grows, it doesn't only grow by one entry but reserves
some space for future refcount blocks. The algorithm to calculate the number of
entries stays the same with the fixes, so factor it out before replacing the
rest.
As Juan suggested take the opportunity to simplify the code a bit.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If a write requests crosses a L2 table boundary and all clusters until the
end of the L2 table are usable for the request, we must not look at the next
L2 entry because we already have arrived at the end of the array.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Most of these are obvious NULL-deref bug fixes, for example,
the ones in these files:
block/curl.c
net.c
slirp/misc.c
and the first one in block/vvfat.c.
The others in block/vvfat.c may not lead to an immediate segfault, but I
traced the two schedule_rename(..., strdup(path)) uses, and a failed
strdup would appear to trigger this assertion in handle_renames_and_mkdirs:
assert(commit->path);
The conversion to use qemu_strdup in envlist_to_environ is not technically
needed, but does avoid a theoretical leak in the caller when strdup fails
for one value, but later succeeds in allocating another buffer(plausible,
if one string length is much larger than the others). The caller does
not know the length of the returned list, and as such can only free
pointers until it hits the first NULL. If there are non-NULL pointers
beyond the first, their buffers would be leaked. This one is admittedly
far-fetched.
The two in linux-user/main.c are worth fixing to ensure that an
OOM error is diagnosed up front, rather than letting it provoke some
harder-to-diagnose secondary error, in case of exec failure, or worse, in
case the exec succeeds but with an invalid list of command line options.
However, considering how unlikely it is to encounter a failed strdup early
in main, this isn't a big deal. Note that adding the required uses of
qemu_strdup here and in envlist.c induce link failures because qemu_strdup
is not currently in any library they're linked with. So for now, I've
omitted those changes, as well as the fixes in target-i386/helper.c
and target-sparc/helper.c.
If you'd like to see the above discussion (or anything else)
in the commit log, just let me know and I'll be happy to adjust.
>From 9af42864fd1ea666bd25e2cecfdfae74c20aa8c7 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering@redhat.com>
Date: Mon, 8 Feb 2010 18:29:29 +0100
Subject: [PATCH] don't dereference NULL after failed strdup
Handle failing strdup by replacing each use with qemu_strdup,
so as not to dereference NULL or trigger a failing assertion.
* block/curl.c (curl_open): s/\bstrdup\b/qemu_strdup/
* block/vvfat.c (init_directories): Likewise.
(get_cluster_count_for_direntry, check_directory_consistency): Likewise.
* net.c (parse_host_src_port): Likewise.
* slirp/misc.c (fork_exec): Likewise.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Checking for return codes < 0 isn't really going to work with unsigned
types. Use signed types instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>