vdi caches the block map. For migration to work, it would have to be
invalidated. Block migration for now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
s->lock should be unlocked before leaving add_aio_request.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
zlib.h is not a local include file, therefore it should be included
using <> instead of "".
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now when you try to migrate with qed, you get:
(qemu) migrate tcp:localhost:1025
Block format 'qed' used by device 'ide0-hd0' does not support feature 'live migration'
(qemu)
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We don't reopen the actual file, but instead invoke the close and open routines.
We specifically ignore the backing file since it's contents are read-only and
therefore immutable.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2 has a writeback metadata cache, so flushing a qcow2 image actually
consists of writing back that cache to the protocol and only then flushes the
protocol in order to get everything stable on disk.
This introduces a separate bdrv_co_flush_to_os to reflect the split.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are two different types of flush that you can do: Flushing one level up
to the OS (i.e. writing data to the host page cache) or flushing it all the way
down to the disk. The existing functions flush to the disk, reflect this in the
function name.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The Data Offset field in the Dynamic Disk Header is an 8 byte field.
Although the specification (2006-10-11) gives an example of initializing
only the first 4 bytes, images generated by Microsoft on Windows initialize
all 8 bytes.
Failure to initialize all 8 bytes results in errors from utilities
like Citrix's vhd-util which checks specifically for the proper Data
Offset field initialization.
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vvfat used to directly call into the qcow2 block driver instead of using the
block.c wrappers. With the coroutine conversion, this stopped working.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
First determine FAT12/16/32, then compute geometry from that for both
FDD and HDD. For 1.44MB floppies, and 2.88MB floppies using FAT16,
change to 1 sector/cluster. The default remains 2.88MB with FAT12
and 2 sectors/cluster. Both DOS and mkdosfs by default format a 2.88MB
floppy as FAT12.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The sector count is stored in the partition and hence must not include the
sectors before its start. At the same time, remove the useless special
casing for 1.44 MB floppies. This fixes fsck on VVFAT hard disks,
which otherwise tries to seek past the end of the disk.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is consistent with what "real" floppies have, so file(1)
now actually recognizes the VVFAT image as a 1.44 MB floppy.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the number of "faked sectors" + the number of sectors that are
part of a cluster does not sum up to the total number of sectors,
qemu-img convert fails. Read these spare sectors as all zeros.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When reading the address of the first free entry, you cannot
use array_get without first marking all entries as occupied.
This is visible if you change the sectors per cluster on a
floppy from 2 to 1.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix mismatching allocation and deallocation: g_free should be used to pair with
g_malloc.
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed_by: Ray Wang <raywang@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix coding style in block/cloop.c.
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed_by: Ray Wang <raywang@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_cache_flush failed, s->lock will not be unlock.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Data we read from the disk isn't necessarily null terminated and may not
contain the string we're looking for. The code needs to be a bit more careful
here.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
An entry in the VDI block map will hold an offset to the actual block if
the block is allocated, or one of two specially-interpreted values if
not allocated. Using VirtualBox terminology, value VDI_IMAGE_BLOCK_FREE
(0xffffffff) represents a never-allocated block (semantically arbitrary
content). VDI_IMAGE_BLOCK_ZERO (0xfffffffe) represents a "discarded"
block (semantically zero-filled). block/vdi knows only about
VDI_IMAGE_BLOCK_FREE. Teach it about VDI_IMAGE_BLOCK_ZERO.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This provides built-in support for iSCSI to QEMU.
This has the advantage that the iSCSI devices need not be made visible to the host, which is useful if you have very many virtual machines and very many iscsi devices.
It also has the benefit that non-root users of QEMU can access iSCSI devices across the network without requiring root privilege on the host.
This driver interfaces with the multiplatform posix library for iscsi initiator/client access to iscsi devices hosted at
git://github.com/sahlberg/libiscsi.git
The patch adds the driver to interface with the iscsi library.
It also updated the configure script to
* by default, probe is libiscsi is available and if so, build
qemu against libiscsi.
* --enable-libiscsi
Force a build against libiscsi. If libiscsi is not available
the build will fail.
* --disable-libiscsi
Do not link against libiscsi, even if it is available.
When linked with libiscsi, qemu gains support to access iscsi resources such as disks and cdrom directly, without having to make the devices visible to the host.
You can specify devices using a iscsi url of the form :
iscsi://[<username>[:<password>@]]<host>[:<port]/<target-iqn-name>/<lun>
When using authentication, the password can optionally be set with
LIBISCSI_CHAP_PASSWORD="password" to avoid it showing up in the process list
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
'ret' is unconditionally overwitten by qed_read_l1_table_sync()
Spotted by Clang Analyzer
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Spotted by Clang Analyzer
[Note this memcpy call has always been safe because the length will be 0
when the pointer is NULL]
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Since coroutine operation is now mandatory, convert both bdrv_discard
implementations to coroutines. For qcow2, this means taking the lock
around the operation. raw-posix remains synchronous.
The bdrv_discard callback is then unused and can be eliminated.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since coroutine operation is now mandatory, convert all bdrv_flush
implementations to coroutines. For qcow2, this means taking the lock.
Other implementations are simpler and just forward bdrv_flush to the
underlying protocol, so they can avoid the lock.
The bdrv_flush callback is then unused and can be eliminated.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This does the first part of the conversion to coroutines, by
wrapping bdrv_write implementations to take the mutex.
Drivers that implement bdrv_write rather than bdrv_co_writev can
then benefit from asynchronous operation (at least if the underlying
protocol supports it, which is not the case for raw-win32), even
though they still operate with a bounce buffer.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This does the first part of the conversion to coroutines, by
wrapping bdrv_read implementations to take the mutex.
Drivers that implement bdrv_read rather than bdrv_co_readv can
then benefit from asynchronous operation (at least if the underlying
protocol supports it, which is not the case for raw-win32), even
though they still operate with a bounce buffer.
raw-win32 does not need the lock, because it cannot yield.
nbd also doesn't probably, but better be safe.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The big conversion of bdrv_read/write to coroutines caused the two
homonymous callbacks in BlockDriver to become reentrant. It goes
like this:
1) bdrv_read is now called in a coroutine, and calls bdrv_read or
bdrv_pread.
2) the nested bdrv_read goes through the fast path in bdrv_rw_co_entry;
3) in the common case when the protocol is file, bdrv_co_do_readv calls
bdrv_co_readv_em (and from here goes to bdrv_co_io_em), which yields
until the AIO operation is complete;
4) if bdrv_read had been called from a bottom half, the main loop
is free to iterate again: a device model or another bottom half
can then come and call bdrv_read again.
This applies to all four of read/write/flush/discard. It would also
apply to is_allocated, but it is not used from within coroutines:
besides qemu-img.c and qemu-io.c, which operate synchronously, the
only user is the monitor. Copy-on-read will introduce a use in the
block layer, and will require converting it.
The solution is "simply" to convert all drivers to coroutines! We
just need to add a CoMutex that is taken around affected operations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move vmdk_parent_open to vmdk_open. There's another path how
vmdk_parent_open can be reached:
vmdk_parse_extents() -> vmdk_open_sparse() -> vmdk_open_vmdk4() ->
vmdk_open_desc_file().
If that can happen, however, the code is bogus. vmdk_parent_open
reads from bs->file:
if (bdrv_pread(bs->file, s->desc_offset, desc, DESC_SIZE) != DESC_SIZE) {
but it is always called with s->desc_offset == 0 and with the same
bs->file. So the data that vmdk_parent_open reads comes always from the
same place, and anyway there is only one place where it can write it,
namely bs->backing_file.
So, if it cannot happen, the patched code is okay.
It is also possible that the recursive call can happen, but only once. In
that case there would still be a bug in vmdk_open_desc_file setting
s->desc_offset = 0, but the patched code is okay.
Finally, in the case where multiple recursive calls can happen the code
would need to be rewritten anyway. It is likely that this would anyway
involve adding several parameters to vmdk_parent_open, and calling it from
vmdk_open_vmdk4.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While vmdk_open_desc_file (touched by the patch) correctly changed -1
to -EINVAL, vmdk_open did not. Fix it directly in vmdk_parent_open.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If during allocation of compressed clusters the cluster was already allocated
uncompressed, fail and properly release the l2_table (the latter avoids a
failed assertion).
While at it, make it return some real error numbers instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
This similarly adds support for coroutine and asynchronous discard.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers now only need to provide either of .bdrv_co_flush,
.bdrv_aio_flush() or for legacy drivers .bdrv_flush(). Remove
the redundant .bdrv_flush() implementations.
[Paolo Bonzini: change raw driver to bdrv_co_flush]
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes the following patch easier to review.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The raw format delegates all operations to bs->file (the protocol).
Previously this block driver exposed both sync and aio interfaces.
Since the block layer now works in terms of coroutines, expose the
coroutine interfaces and drop the others. This avoids unnecessary
emulation of sync and aio interfaces.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers only need to provide one of sync, aio, or coroutine
interfaces. Since raw-posix.c provides aio interfaces, simply drop the
synchronous interfaces since they can be emulated using aio and
coroutines.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cppcheck reported this error:
qemu/block/qcow.c:599: error: Mismatching allocation and deallocation: cluster_data
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The unused code was detected using cppcheck.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cppcheck reported memory leaks and mismatched g_malloc() with free()
instead of g_free().
Fix these errors.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow to resize images that reside on host devices up to the available
space. This allows to grow images after resizing the device manually or
vice versa.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED's metadata caching strategy allows two parallel requests to race for
metadata lookup. The first one to complete will populate the metadata
cache and the second one will drop the data it just read in favor of the
cached data.
There is a use-after-free in qed_read_l2_table_cb() and
qed_commit_l2_update() where l2_table->offset was used after the
l2_table may have been freed due to a metadata lookup race. Fix this by
keeping the l2_offset in a local variable and not reaching into the
possibly freed l2_table.
Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The previous behaviour was to finish AIOCBs inside curl_aio_readv()
if the data was cached. This caused the following failed assertion
at hw/ide/pci.c:314: bmdma_cmd_writeb
"Assertion `bm->bus->dma->aiocb == ((void *)0)' failed."
By scheduling a QEMUBH and performing the completion inside the
callback, we avoid this problem.
Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The config string is variously delimited by =, @, and /, depending on the
field. Allow these characters to be escaped by preceeding them with \.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
librbd recently added async writeback and flush support. If the new
rbd_flush() call is available, call it.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Properly document the configuration string syntax and semantics. Remove
(out of date) details about the librbd implementation.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we are reading from the default config location, ignore any failures.
It is perfectly legal for the user to specify exactly the options they need
and to not rely on any config file.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Release extent_file on error in vmdk_parse_extents. Added closing files
in freeing extents.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nbd supports writing flags in bytes 24...27 of the header,
and uses that for the read-only flag. Add support for it
in qemu-nbd.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Those blanks violate the coding conventions, see
scripts/checkpatch.pl.
Blanks missing after colons in the changed lines were added.
This patch does not try to fix tabs, long lines and other
problems in the changed lines, therefore checkpatch.pl reports
many violations.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
QCowL2Meta::offset is not cluster aligned but only sector aligned
however nb_clusters count cluster from cluster start.
This fix range check. Note that old code have no corruption issues
related to this check cause it only cause intersection to occur
when shouldn't.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCow2Meta structure was inserted into list before many fields are
initialized. Currently is not a problem cause all occur in a lock
but if qcow2_alloc_clusters would in a future unlock this lock
some issues could arise.
Initializing fields before inserting fix the problem.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix leak of s->snap in failure path. Simplify error paths for the whole
function.
Reported-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
No assignment in condition. Remove duplicate ret > 0 check.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as. Currently they are stuck with
the default ('admin'). This is necessary for anyone using authentication
in their environment.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vSphere 4 exported image is streamOptimized extent, which is not
quite correctly handled. Ignore rdgOffset when RGD flag bit not set.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Haiku provides a specially formed vmdk image, which let qemu abort. It a
combination of sparse header and flat data (i.e. with not l1/l2 table at
all). The fix is turn to descriptor when sparse header is zero in field
'capacity'.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Creating streamOptimized subformat. Added subformat option
'streamOptimized', to create a image with compression enabled and each
cluster with a GrainMarker.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support for reading/writing compressed extent.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Added flags field for compressed/streamOptimized extents, open and save
image configuration.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Factor out read/write extent code, since there will be more things to
take care of once reading/writing compressed clusters is introduced.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Enable the createType 'twoGbMaxExtentFlat'. The supporting code is
already in.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block driver "raw" forwards most methods to the underlying block
driver. However, it doesn't implement method bdrv_media_changed().
Makes bdrv_media_changed() always return -ENOTSUP.
I believe -fda /dev/fd0 gives you raw over host_floppy, and disk
change detection (fdc register 7 bit 7) is broken. Testing my theory
requires a computer museum, though.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Requests depending on a failed request would end up waiting forever. This fixes
the error path to continue dependent requests even when the request has failed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add some notes about Linux AIO explaining why we don't use AIO in
some situations.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most changes were made using these commands:
git grep -la '__attribute__((packed))'|xargs perl -pi -e 's/__attribute__\(\(packed\)\)/QEMU_PACKED/'
git grep -la '__attribute__ ((packed))'|xargs perl -pi -e 's/__attribute__ \(\(packed\)\)/QEMU_PACKED/'
git grep -la '__attribute__((__packed__))'|xargs perl -pi -e 's/__attribute__\(\(__packed__\)\)/QEMU_PACKED/'
git grep -la '__attribute__ ((__packed__))'|xargs perl -pi -e 's/__attribute__ \(\(__packed__\)\)/QEMU_PACKED/'
git grep -la '__attribute((packed))'|xargs perl -pi -e 's/__attribute\(\(packed\)\)/QEMU_PACKED/'
Whitespace in linux-user/syscall_defs.h was fixed manually
to avoid warnings from scripts/checkpatch.pl.
Manual changes were also applied to hw/pc.c.
I did not fix indentation with tabs in block/vvfat.c.
The patch will show 4 errors with scripts/checkpatch.pl.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This makes the sheepdog block driver support bdrv_co_readv/writev
instead of bdrv_aio_readv/writev.
With this patch, Sheepdog network I/O becomes fully asynchronous. The
block driver yields back when send/recv returns EAGAIN, and is resumed
when the sheepdog network connection is ready for the operation.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Embed qcow_aio_read_cb into qcow_co_readv and qcow_aio_write_cb into qcow_co_writev
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
remove unused field from this structure and put some of them in qcow_aio_read_cb and qcow_aio_write_cb
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Current behaviour if a read fails is for the acb to not get finished.
This causes an infinite loop in bdrv_read_em (block.c). The read failure
never gets reported to the guest and if the error condition clears, the
process never recovers.
With this patch, when curl reports a failure we finish the acb as a
failure. This results in the guest receiving an I/O error (rather than
the read hanging indefinitely) and if the error condition subsequently
clears, retries work as expected.
The simplest test is to put an ISO on a web server you have control over
and open it with qemu-io. Then move the ISO out of the way and attempt
to read some data - you should see behaviour matching the above.
Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
commit 52b8eb6013 added a mutex,
but never initialized it. This caused a segfault.
Reported-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Documentation states the num is measured in clusters, but its
actually measured in sectors
Signed-off-by: Devin Nakamura <devin122@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
By introducing BlockDriverState compiling qcow2 with DEBUG_ALLOC and DEBUG_EXT
defined got broken.
Define a BdrvCheckResult structure locally which is now needed as the second
argument.
Also fix qcow2_read_extensions() needing BDRVQcowState.
Signed-off-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
SetFilePointer returns INVALID_SET_FILE_POINTER when it fails.
In addition, GetLastError must be checked.
The first call of SetFilePointer did not use INVALID_SET_FILE_POINTER,
the second call used wrong error handling.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When loading an internal snapshot whose L1 table is smaller than the current L1
table, the size of the current L1 would be shrunk to the snapshot's L1 size in
memory, but not on disk. This lead to incorrect refcount updates and eventuelly
to image corruption.
Instead of writing the new L1 size to disk, this simply retains the bigger L1
size that is currently in use and makes sure that the unused part is zeroed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Tested-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The purpose of AsyncContexts was to protect qcow and qcow2 against reentrancy
during an emulated bdrv_read/write (which includes a qemu_aio_wait() call and
can run AIO callbacks of different requests if it weren't for AsyncContexts).
Now both qcow and qcow2 are protected by CoMutexes and AsyncContexts can be
removed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The old qcow format is another user of the AsyncContext infrastructure.
Converting it to coroutines (and therefore CoMutexes) allows to remove
AsyncContexts.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VHD files technically can be up to 2Tb, but virtual pc is limited
to 127G. Currently qemu-img refused to create vpc files > 127G,
but it is failing to return error when converting from a non-vpc
VHD file which is >127G. It returns success, but creates a truncated
converted image. Also, qemu-img info claims the vpc file is 127G
(and clean).
This patch detects a too-large vpc file and returns -EFBIG. Without
this patch,
=============================================================
root@ip-10-38-123-242:~/qemu-fixed# qemu-img info /mnt/140g-dynamic.vhd
image: /mnt/140g-dynamic.vhd
file format: vpc
virtual size: 127G (136899993600 bytes)
disk size: 284K
root@ip-10-38-123-242:~/qemu-fixed# qemu-img convert -f vpc -O raw /mnt/140g-dynamic.vhd /mnt/y
root@ip-10-38-123-242:~/qemu-fixed# echo $?
0
root@ip-10-38-123-242:~/qemu-fixed# qemu-img info /mnt/y
image: /mnt/y
file format: raw
virtual size: 127G (136899993600 bytes)
disk size: 0
=============================================================
(The 140G image was truncated with no warning or error.)
With the patch, I get:
=============================================================
root@ip-10-38-123-242:~/qemu-fixed# ./qemu-img info /mnt/140g-dynamic.vhd
qemu-img: Could not open '/mnt/140g-dynamic.vhd': File too large
root@ip-10-38-123-242:~/qemu-fixed# ./qemu-img convert -f vpc -O raw /mnt/140g-dynamic.vhd /mnt/y
qemu-img: Could not open '/mnt/140g-dynamic.vhd': File too large
qemu-img: Could not open '/mnt/140g-dynamic.vhd'
=============================================================
See https://bugs.launchpad.net/qemu/+bug/814222 for details.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Callees always return 0, except for FreeBSD's cdrom_eject(), which
returns -ENOTSUP when the device is in a terminally wedged state.
The only caller is bdrv_eject(), and it maps -ENOTSUP to 0 since
commit 4be9762a.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only caller is bdrv_set_locked(), and it ignores the value.
Callees always return 0, except for FreeBSD's cdrom_set_locked(),
which returns -ENOTSUP when the device is in a terminally wedged
state.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's been disabled since the start (commit 19cb3738, Aug 2006), and
has been untouched except for spelling fixes and such. I don't feel
like dragging it along any further.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Avoid warnings like these by wrapping recv():
CC slirp/ip_icmp.o
/src/qemu/slirp/ip_icmp.c: In function 'icmp_receive':
/src/qemu/slirp/ip_icmp.c:418:5: error: passing argument 2 of 'recv' from incompatible pointer type [-Werror]
/usr/local/lib/gcc/i686-mingw32msvc/4.6.0/../../../../i686-mingw32msvc/include/winsock2.h:547:32: note: expected 'char *' but argument is of type 'struct icmp *'
Remove also casts used to avoid warnings.
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In snapshotting there is no guest involved, so we can safely use a writeback
mode and do the flushes in the right place (i.e. at the very end). This
improves the time that creating/restoring an internal snapshot takes with an
image in writethrough mode.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu-img.c wants to count allocated file size of image. Previously it
counts a single bs->file by 'stat' or Window API. As VMDK introduces
multiple file support, the operation becomes format specific with
platform specific meanwhile.
The functions are moved to block/raw-{posix,win32}.c and qemu-img.c calls
bdrv_get_allocated_file_size to count the bs. And also added VMDK code
to count his own extents.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Conform coding style in vmdk.c to pass scripts/checkpatch.pl checks.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add create option 'format', with enums:
monolithicSparse
monolithicFlat
twoGbMaxExtentSparse
twoGbMaxExtentFlat
Each creates a subformat image file. The default is monolithicSparse.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Parse vmdk decriptor file and open mono flat image.
Read/write the flat extent.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The return type of get_cluster_offset was an offset that use 0 to denote
'not allocated', this will be no longer true for flat extents, as we see
flat extent file as a single huge cluster whose offset is 0 and length
is the whole file length.
So now we use int return value, 0 means success and otherwise offset
invalid.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cid_update is the flag for updating CID on first write after opening the
image. This should be per image open rather than per program life cycle,
so change it from static var of vmdk_write to a field in BDRVVmdkState.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Flush all the file that referenced by the image.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are several occurrence of magic number 0x200 as the descriptor
offset within mono sparse image file. This is not the case for images
with separate descriptor file. So a field is added to BDRVVmdkState to
hold the correct value.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Separate vmdk_open by subformats to:
* vmdk_open_vmdk3
* vmdk_open_vmdk4
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Probe as the same behavior as VMware does.
Recognize image as monolithicFlat descriptor file when the file is text
and the first effective line (not '#' leaded comment or space line) is
either 'version=1' or 'version=2'. No space or upper case charactors
accepted.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In get_whole_cluster, the offset is not aligned to cluster when reading
from backing_hd. When the first write to child is not at the cluster
boundary, wrong address data from parent is copied to child.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This introduces qemu-img create option for sheepdog which allows the
data to be fully preallocated (note that sheepdog always preallocates
metadata).
The option is disabled by default and you need to enable it like the
following:
qemu-img create sheepdog:test -o preallocation=full 1G
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img
ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.
The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files. Work around by calling fstat() the ensure the ioctls are
only used on block devices.
Signed-off-by: Johannes Stezenbach <js@sig21.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
error_report() prepends location, and appends a newline. The message
constructed from the arguments should not contain a newline. Fix the
obvious offenders.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
If qcow2_cache_put returns an error during cluster allocation and the
allocation fails, it must be removed from the list of in-flight allocations.
Otherwise we'd get a loop in the list when the ACB is used for the next
allocation.
Luckily, this qcow2_cache_put shouldn't fail anyway because the L2 table is
only read, so that qcow2_cache_put doesn't even involve I/O.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
bdrv_aio_* must not call the callback before returning to its caller. In vdi,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_aio_* must not call the callback before returning to its caller. In qcow,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_aio_* must not call the callback before returning to its caller. In qcow2,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Variable 'snap' is assigned a value that is never used.
Remove snap and the related code.
Cc: Christian Brunner <chb@muc.de>
Cc: Josh Durgin <josh.durgin@dreamhost.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When not specifying a cluster size on the command line, qemu-img printed
a cluster size of 0:
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864
encryption=off cluster_size=0
This patch adds the default cluster size to the QEMUOptionParameter list, so
that it displays the default value that is used.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes memory leaks that may be caused by I/O errors during L1 table growth
(can happen during save_vm) and in qemu-img check.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If scheduling fails, the number of outstanding I/Os must be correct,
or there will be a hang when waiting for everything to be flushed.
Reviewed-by: Christian Brunner <chb@muc.de>
Reported-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new format is rbd:pool/image[@snapshot][:option1=value1[:option2=value2...]]
Each option is used to configure rados, and may be any Ceph option, or "conf".
The "conf" option specifies a Ceph configuration file to read.
This allows rbd volumes from more than one Ceph cluster to be used by
specifying different monitor addresses, as well as having different
logging levels or locations for different volumes.
Reviewed-by: Christian Brunner <chb@muc.de>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
librbd stacks on top of librados to provide access
to rbd images.
Using librbd simplifies the qemu code, and allows
qemu to use new versions of the rbd format
with few (if any) changes.
Reviewed-by: Christian Brunner <chb@muc.de>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
use the correct way to get the size of a disk device or partition
From: Adam Hamsik <haad@netbsd.org>
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On NetBSD a userland process is better with the character device
interface. In addition, a block device can't be opened twice; if a Xen
backend opens it, qemu can't and vice-versa.
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vmdk code is sloppy when handling the header descriptor during
creation of an image. Fix all header accesses in the create path to
either store native endianness or convert it when appropriate.
Reported-by: Yury Tsarev <ytsarev@novell.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change BDRV_O_NOCACHE to only imply bypassing the host OS file cache,
but no writeback semantics. All existing callers are changed to also
specify BDRV_O_CACHE_WB to give them writeback semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch removes all references to signal.h when qemu-common.h is included
as they become redundant.
Signed-off-by: Alexandre Raymond <cerbere@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The .bdrv_truncate() operation resizes images and growing is easy to
implement in QED. Simply check that the new size is valid and then
update the image_size header field to reflect the new size.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
One strategy to limit the startup delay of consistency check when
opening image files is to ensure that the file is marked dirty for as
little time as possible.
QED currently marks the image dirty when the first allocating write
request is issued and clears the dirty bit again when the image is
cleanly closed. In practice that means the image is marked dirty for
most of a guest's lifetime and prone to being in a dirty state upon
crash or power failure.
It is safe to clear the dirty bit after all allocating write requests
have completed and a flush has been performed. This patch adds a timer
after the last allocating write request completes. When the timer fires
it will flush and then clear the dirty bit. The timer is set to 5
seconds and is cancelled upon arrival of a new allocating write request.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The code changed here is an unused data type name (evt_flush_occurred).
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The qed_bytes_to_clusters() function is normally used with size_t
lengths. Consistency check used it with file size length and therefore
failed on 32-bit hosts when the image file is 4 GB or more.
Make qed_bytes_to_clusters() explicitly 64-bit and update consistency
check to keep 64-bit cluster counts.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use get_option_parameter() to instead of duplicating the loop, and
use BDRV_SECTOR_SIZE to instead of 512
Signed-off-by: Mitnick Lyu <mitnick.lyu@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero clusters are similar to unallocated clusters except instead of reading
their value from a backing file when one is available, the cluster is always
read as zero.
This implements read support only. At this stage, QED will never write a
zero cluster.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We also change the way the file parameter is parsed so IPv6 IP
addresses can be used, e.g.: "drive=nbd:[::1]:5000"
Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu now has generic bitmap functions,
so don't redefine them in sheepdog.c,
use common header instead. A small cleanup.
Here's only one function which is actually
used in sheepdog and gets replaced with
a generic one (simplified):
- static inline int test_bit(int nr, const volatile unsigned long *addr)
+ static inline int test_bit(int nr, const unsigned long *addr)
{
- return ((1UL << (nr % BITS_PER_LONG))
& ((unsigned long*)addr)[nr / BITS_PER_LONG])) != 0;
+ return 1UL & (addr[nr / BITS_PER_LONG] >> (nr & (BITS_PER_LONG-1)));
}
The body is equivalent, but the argument is not: there's
"volatile" in there. Why it is used for - I'm not sure.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
This patch is similar to 171e3d6b99
which fixed qcow2:
Returning -EIO is far from optimal, but at least it's an error code.
In addition to read/write failures, -EIO is also returned when
decompress_cluster failed.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is similar to 171e3d6b99
which fixed qcow2:
Returning -EIO is far from optimal, but at least it's an error code.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When copying L2 tables (this happens only with internal snapshots), the order
wasn't completely safe, so that after a crash you could end up with a L2 table
that has too low refcount, possibly leading to corruption in the long run.
This patch puts the operations in the right order: First allocate the new
L2 table and replace the reference, and only then decrease the refcount of the
old table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of just returning -ENOTSUP, generate a more detailed error.
Unfortunately we don't have a helpful text for features that we don't know yet,
so just print the feature mask. It might be useful at least if someone asks for
help.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The qcow2 driver is now declared responsible for any QCOW image that has
version 2 or greater (before this, version 3 would be detected as raw).
For everything newer than version 2, an error is reported.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
When reading a compressed cluster failed, qcow2 falsely returned success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Requests could return success even though they failed when bdrv_aio_readv
returned NULL for a backing file read.
Reported-by: Chunqiang Tang <ctang@us.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch fixes the following bug in QCOW2. For a QCOW2 image that is larger
than its base image, when handling a read request straddling over the end of the
base image, the QCOW2 driver attempts to read beyond the end of the base image
and the request would fail.
This bug was found by Fast Virtual Disk (FVD)'s fully automated testing tool.
The following test triggered the bug.
dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1098561536
dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=593099264
./qemu-img create -f qcow2 -ocluster_size=65536,backing_fmt=blksim -b /var/ramdisk/zero-500M.raw /var/ramdisk/test.qcow2 1098561536
./qemu-io --auto --seed=30477694 --truth=/var/ramdisk/truth.raw --format=qcow2 --test=blksim:/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false --compare_after=true --round=100000 --parallel=100 --io_size=10485760 --fail_prob=0 --cancel_prob=0 --instant_qemubh=true
Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Error report from cppcheck:
block/vdi.c:122: error: Using sizeof for array given as function argument returns the size of pointer.
block/vdi.c:128: error: Using sizeof for array given as function argument returns the size of pointer.
Fix both by setting the correct size.
The buggy code is only used when QEMU is build without uuid support.
The bug is not critical, so there is no urgent need to apply it to
old versions of QEMU.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For cache=unsafe we also need to set BDRV_O_CACHE_WB, otherwise we have some
strange unsafe writethrough mode.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Variables l2_modified and l2_size are not really used, remove them.
Spotted by GCC 4.6.0:
CC block/qcow2-refcount.o
/src/qemu/block/qcow2-refcount.c: In function 'qcow2_update_snapshot_refcount':
/src/qemu/block/qcow2-refcount.c:708:37: error: variable 'l2_modified' set but not used [-Werror=unused-but-set-variable]
/src/qemu/block/qcow2-refcount.c:708:9: error: variable 'l2_size' set but not used [-Werror=unused-but-set-variable]
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The consistency check on open is necessary in order to fix inconsistent
table offsets left as a result of a crash mid-operation. Images with a
backing file actually flush before updating table offsets and are
therefore guaranteed to be consistent. Do not mark these images dirty.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds a bdrv_discard function to qcow2 that frees the discarded clusters.
It does not yet pass the discard on to the underlying file system driver, but
the space can be reused by future writes to the image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This patch parses the input filename in sd_create(), and enables us
specifying a target server to create sheepdog images.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move size after the two pointers in struct Qcow2Cache to get better
packing of struct elements on 64 bit architectures.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED relies on the underlying filesystem to extend the file and maintain
its size. Check that images are not created on a block device.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 calls bdrv_flush() after performing COW in order to ensure that the
L2 table change is never written before the copy is safe on disk. Now that the
L2 table is cached, we can wait with flushing until we write out the next L2
table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds some new cache functions to qcow2 which can be used for caching
refcount blocks and L2 tables. When used with cache=writethrough they work
like the old caching code which is spread all over qcow2, so for this case we
have merely a cleanup.
The interesting case is with writeback caching (this includes cache=none) where
data isn't written to disk immediately but only kept in cache initially. This
leads to some form of metadata write batching which avoids the current "write
to refcount block, flush, write to L2 table" pattern for each single request
when a lot of cluster allocations happen. Instead, cache entries are only
written out if its required to maintain the right order. In the pure cluster
allocation case this means that all metadata updates for requests are done in
memory initially and on sync, first the refcount blocks are written to disk,
then fsync, then L2 tables.
This improves performance of scenarios with lots of cluster allocations
noticably (e.g. installation or after taking a snapshot).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cpu_to_be64w() is called with an obviously non-aligned pointer. Use
cpu_to_be64wu() instead. It fixes unaligned accesses errors on IA64
hosts.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix a file descriptor leak, reported by cppcheck:
[/src/qemu/block/vvfat.c:759]: (error) Resource leak: dir
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In addition this adds missing braces to the function to be consistent
with the coding style.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It doesn't really make sense for functions in qcow2.c to be named
qcow_ so convert the names to match correctly.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for the qemu-img check command. It also
introduces a dirty bit in the qed header to mark modified images as
needing a check. This bit is cleared when the image file is closed
cleanly.
If an image file is opened and it has the dirty bit set, a consistency
check will run and try to fix corrupted table offsets. These
corruptions may occur if there is power loss while an allocating write
is performed. Once the image is fixed it opens as normal again.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch implements the read/write state machine. Operations are
fully asynchronous and multiple operations may be active at any time.
Allocating writes lock tables to ensure metadata updates do not
interfere with each other. If two allocating writes need to update the
same L2 table they will run sequentially. If two allocating writes need
to update different L2 tables they will run in parallel.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds code to look up data cluster offsets in the image via
the L1/L2 tables. The L2 tables are writethrough cached in memory for
performance (each read/write requires a lookup so it is essential to
cache the tables).
With cluster lookup code in place it is possible to implement
bdrv_is_allocated() to query the number of contiguous
allocated/unallocated clusters.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch introduces the qed on-disk layout and implements image
creation. Later patches add read/write and other functionality.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support to discard blocks in a raw image residing on an XFS filesystem
by calling the XFS_IOC_UNRESVSP64 ioctl to punch holes. Support for other
hole punching mechanisms can be added when they become available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new bdrv_discard method to free blocks in a mapping image, and a new
drive property to set the granularity for these discard. If no discard
granularity support is set discard support is disabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
RBD is an block driver for the distributed file system Ceph
(http://ceph.newdream.net/). This driver uses librados (which is part
of the Ceph server) for direct access to the Ceph object store and is
running entirely in userspace (Yehuda also wrote a driver for the
linux kernel, that can be used to access rbd volumes as a block
device).
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Christian Brunner <chb@muc.de>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All drivers use bs->file instead of s->hd for quite a while now, so it's time
to remove s->hd.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
When building on a 64 bit host which uses 'long' for int64_t,
GCC emits a warning:
CC block/blkverify.o
/src/qemu/block/blkverify.c: In function `blkverify_verify_readv':
/src/qemu/block/blkverify.c:304: warning: long long int format, long
unsigned int arg (arg 3)
Rework a77cffe7e9 to avoid the warning.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cache content may be destroyed after a failed read, better not use it any
more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This changes bdrv_flush to return 0 on success and -errno in case of failure.
It's a requirement for implementing proper error handle in users of bdrv_flush.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Move timer init functions to a new file, qemu-timer-common.c. Make other
critical timer functions inlined to preserve performance in
qemu-timer.c, also move muldiv64() (used by the inline functions)
to qemu-timer.h.
Adjust block/raw-posix.c and simpletrace.c to use get_clock() directly.
Remove a similar/duplicate definition in qemu-tool.c.
Adjust hw/omap_clk.c to include qemu-timer.h because muldiv64() is used
there.
After this change, tracing can be used also for user code and
simpletrace on Win32.
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Adding the gcc format attribute detects a format bug
which is fixed here.
v2:
Don't use type cast. BDRV_SECTOR_SIZE is unsigned long long,
so %lld should be the correct format specifier.
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to backup snapshots, created from QCOW2 iamge, we want to copy snapshots out of QCOW2 disk to a seperate storage.
The following patch adds a new option in "qemu-img": qemu-img convert -f qcow2 -O qcow2 -s snapshot_name src_img bck_img.
Right now, it only supports to copy the full snapshot, delta snapshot is on the way.
Changes from V1: all the comments from Kevin are addressed:
Add read-only checking
Fix coding style
Change the name from bdrv_snapshot_load to bdrv_snapshot_load_tmp
Signed-off-by: Disheng Su <edison@cloud.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of doing lots of magic for setting up initial refcount blocks and stuff
create a minimal (inconsistent) image, open it and initialize the rest with
regular qcow2 functions.
This is a complete rewrite of the image creation function. The old
implementating is #ifdef'd out and will be removed by the next patch (removing
it here would have made the diff unreadable because diff tries to find
similarities when it's really a rewrite)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The L1 table grow operation includes a size calculation that bumps up
the new L1 table size in order to anticipate the size needs of vmstate
data. This helps reduce the number of times that the L1 table has to be
grown when vmstate data is appended.
This size overhead is not necessary during image creation,
bdrv_truncate(), or snapshot goto operations. In fact, existing
qemu-iotests that exercise table growth are no longer able to trigger it
because image creation preallocates an L1 table that is too large after
changes to qcow_create2().
This patch keeps the size calculation but also adds exact growth for
callers that do not want to inflate the L1 table size unnecessarily.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compiling with GCC 4.6.0 20100925 produced a warning:
/src/qemu/block/qcow2-refcount.c: In function 'update_refcount':
/src/qemu/block/qcow2-refcount.c:552:13: error: variable 'dummy' set but not used [-Werror=unused-but-set-variable]
Fix by adding a dummy cast so that the result is not unused.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Fix this compiler warning:
./block/vvfat.c:2285: error: comparison of unsigned expression >= 0 is always true
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
The blkverify block driver makes investigating image format data
corruption much easier. A raw image initialized with the same contents
as the test image (e.g. qcow2 file) must be provided. The raw image
mirrors read/write operations and is used to verify that data read from
the test image is correct.
See docs/blkverify.txt for more information.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO write path. Encrypted
images continue to use a bounce buffer, however with constant size.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We always have a sync for the refcount update when a new cluster is
allocated. If we move this past the COW, we can save an additional sync.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Note that the flush is omitted intentionally in qcow2_free_clusters. If
anything, we can leak clusters here if we lose the writes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block/nbd.c: use default port number when none is specified
qemu-nbd.c: use IANA-assigned port number: 10809
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the hardcoded handling of 512 byte alignment with bs->buffer_alignment
to handle larger sector size devices correctly.
Note that we can not rely on it to be initialize in bdrv_open, so deal
with the worst case there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow file used for write support in vvfat is a temporary file,
so we can use cache=unsafe there. Without this, write support is just
too slow to be of any use.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
Allocation and deallocation of bs->opaque is not in the control of a
block driver. Therefore it should not set bs->opaque to a data structure
used by another bs, or closing the image will lead to a double free.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
vvfat tries to set the readonly flag in its open function, but nowadays
this is overwritted with the readonly=... command line option. Check in
bdrv_write if the vvfat was opened read-only and return an error in this
case.
Without this check, vvfat tries to access the qcow bs, which is NULL
without enabled write support.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
The signedness of enum types depend on the compiler implementation.
Therefore the check for negative values may or may not be meaningful.
Fix by explicitly casting to a signed integer.
Since the values are also checked earlier against event_names
table, this is an internal error. Change the 'if' to 'assert'.
This also avoids a warning with GCC flag -Wtype-limits.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 79368c81bf.
Conflicts:
block.c
I haven't been able to come up with a solution yet for the corruption caused by
unaligned requests from the IDE disk so revert until a solution can be written.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When a new cluster was allocated, we only need a flush after the write to the
L2 table if it was a COW and we need to decrease the refcounts of the old
clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow symbolic links which point to /dev/sgX devices.
Signed-off-by: Bernhard Kohl <bernhard.kohl@nsn.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux, we have code to detect CD-ROMs using an ioctl. We shouldn't lose
anything but false positives by removing the check for a /dev/cd* path.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows to connect Qemu using NBD protocol to an nbd-server
using named exports.
For instance, if on the host "isoserver", in /etc/nbd-server/config, you have:
[generic]
[debian-500-ppc-netinst]
exportname = /ISO/debian-500-powerpc-netinst.iso
[Fedora-10-ppc-netinst]
exportname = /ISO/Fedora-10-ppc-netinst.iso
You can connect to it, using:
qemu -cdrom nbd:isoserver:exportname=debian-500-ppc-netinst
qemu -cdrom nbd:isoserver:exportname=Fedora-10-ppc-netinst
NOTE: you need at least nbd-server 2.9.18
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
"qemu_socket.h" includes all necessary files and
including <netinet/tcp.h> without <netinet/in.h>
could cause errors on some systems.
Signed-off-by: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Assuming that any image on a block device is not properly zero-initialized is
actually wrong: Only raw images have this problem. Any other image format
shouldn't care about it, they initialize everything properly themselves.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to have a second set of integral types.
Replace them by the standard types from stdint.h.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
trick the block probing code into accessing arbitrary files in a guest. To
mitigate this, we added an explicit format parameter to -drive which disabling
block probing.
Fast forward to today, and the vast majority of users do not use this parameter.
libvirt does not use this by default nor does virt-manager.
Most users want block probing so we should try to make it safer.
This patch adds some logic to the raw device which attempts to detect a write
operation to the beginning of a raw device. If the first 4 bytes happen to
match an image file that has a backing file that we support, it scrubs the
signature to all zeros. If a user specifies an explicit format parameter, this
behavior is disabled.
I contend that while a legitimate guest could write such a signature to the
header, we would behave incorrectly anyway upon the next invocation of QEMU.
This simply changes the incorrect behavior to not involve a security
vulnerability.
I've tested this pretty extensively both in the positive and negative case. I'm
not 100% confident in the block layer's ability to deal with zero sized writes
particularly with respect to the aio functions so some additional eyes would be
appreciated.
Even in the case of a single sector write, we have to make sure to invoked the
completion from a bottom half so just removing the zero sized write is not an
option.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
WIN32 is not only the system which doesn't have TCP_CORK (e.g. OS X).
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS. This
patch adds a qemu block driver for Sheepdog.
Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing
The more details are available at the project site:
http://www.osrg.net/sheepdog/
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
raw_pread_aligned() retries up to two times if the block device backs
a virtual CD-ROM (a drive with media=cdrom and if=ide, scsi, xen or
none). This makes no sense. Whether retrying reads can correct read
errors can only depend on what we're reading, not on how the result
gets used. We need to check what whether we're reading from a
physical CD-ROM or floppy here.
I doubt retrying is useful even then. Left for another day.
Impact:
* Virtual CD-ROM backed by host_cdrom behaves the same.
* Virtual CD-ROM backed by file or host_device no longer retries.
* A drive backed by host_cdrom now retries even if it's not a virtual
CD-ROM.
* Any drive backed by host_floppy now retries.
While there, clean up gratuitous use of goto.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This distinguishes between harmless leaks and real corruption. Hopefully users
better understand what qemu-img check wants to tell them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
state = 0 in rules means that the rule is valid for any state. Therefore it's
impossible to have a rule that works only in the initial state. This changes
the initial state from 0 to 1 to make this possible.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Forgetting to free them means that the next instance inherits all rules and
gets its own rules only additionally.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The list head was initialized to point to the wrong list, so all actions ended
up being handled as inject-error even if they were set-state in fact.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
People were wondering why qemu-img check failed after they tried to preallocate
a large qcow2 file and ran out of disk space.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Trying to check them leads to a second error message which is more confusing
than helpful:
Can't get refcount for cluster 0: Invalid argument
ERROR cluster 0 refcount=-22 reference=1
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With corrupted images, we can easily get an cluster index that exceeds the
array size of the temporary refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>