This removes blocking network I/Os in coroutine context.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, no one reenters the yielded coroutine. This fixes it.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes warnings about dprintf format in debug mode.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When qcow2_alloc_clusters() error handling code was introduced in commit
5d757b563d, the value of free_byte_offset
was clobbered in the error case. This patch keeps free_byte_offset at 0
so we will try to allocate clusters again next time this function is
called.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The DEBUG_ALLOC qcow2.h macro enables additional consistency checks
throughout the code. This makes it easier to spot corruptions that are
introduced during development. Since consistency check is an expensive
operation the DEBUG_ALLOC macro is used to compile checks out in normal
builds and qcow2_check_refcounts() calls missed the addition of a new
function argument.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the device we open is a SMC or SSC device, then force the use of sg. We
dont have any medium changer or tape emulation so only passthrough via
real sg or scsi-generic via iscsi would work anyway.
Forcing sg also makes qemu skip trying to read from the device to guess
the image format by reading from the device (find_image_format()).
SMC devices do not implement READ6/10/12/16 so it is not possible to
read from them (SSC have different CDBs).
With this patch I can successfully manage a SMC device wiht iscsi in
passthrough mode.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[Added TYPE_TAPE handling - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Update iscsi to allow passthrough of SG_IO scsi commands when the iscsi
device is forced to be scsi-generic.
Implement both bdrv_ioctl() and bdrv_aio_ioctl() in the iscsi backend,
emulate the SG_IO ioctl and pass the SCSI commands across to the
iscsi target.
This allows end-to-end passthrough of SCSI all the way from the guest,
to qemu, via scsi-generic, then libiscsi all the way to the iscsi target.
To activate this you need to specify that the iscsi lun should be treated
as a scsi-generic device.
Example:
-device lsi -device scsi-generic,drive=MyISCSI \
-drive file=iscsi://10.1.1.125/iqn.ronnie.test/1,if=none,id=MyISCSI
Note, you can currently not boot a qemu guest from a scsi device.
Note,
This only works when the host is linux, since the emulation relies on
definitions of SG_IO from the scsi-generic implementation in the
linux kernel.
It should be fairly easy to re-implement some structures similar enough
for non-linux hosts to do the same style of passthrough via a fake
scsi generic layer and libiscsi if need be.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the declaration of s into the #ifdef sections that actually make
use of it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The autoclear feature bits can be used for qcow2 file format features
that are safe to "drop" by old programs that do not understand the
feature. Upon opening the image file unknown autoclear feature bits are
cleared and the image file header is rewritten, but this was happening
too early in the code when critical header fields were not yet loaded.
Process autoclear feature bits after all necessary header information
has been loaded.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
avail_sectors should really be the number of sectors from the start of
the allocation, not from the start of the write request.
We're lucky enough that this mistake didn't cause any real bug.
avail_sectors is only used in the intialiser of QCowL2Meta:
.nb_available = MIN(requested_sectors, avail_sectors),
m->nb_available in turn is only used for COW at the end of the
allocation. A COW occurs only if the request wasn't cluster aligned,
which in turn would imply that requested_sectors was less than
avail_sectors (both in the original and in the fixed version). In this
case avail_sectors is ignored and therefore the mistake doesn't cause
any misbehaviour.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
copy_sectors() always uses the sum (cluster_offset + n_start) or
(start_sect + n_start), so if some value is added to both cluster_offset
and start_sect, and subtracted from n_start, it's cancelled out anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Writethrough does not need special-casing anymore in the qcow2 caches.
The block layer adds flushes after every guest-initiated data write,
and these will also flush the qcow2 caches to the OS.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Writeback caching was added in Ceph 0.46, and writethrough will be in
0.47. These are controlled by general config options, so there's no
need to check for librbd version.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When any inconsistencies have been fixed, print the statistics and run
another check to make sure everything is correct now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QED block driver already provides the functionality to not only
detect inconsistencies in images, but also fix them. However, this
functionality cannot be manually invoked with qemu-img, but the
check happens only automatically during bdrv_open().
This adds a -r switch to qemu-img check that allows manual invocation
of an image repair.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
is_allocated_base has complex semantics that are not really usable
outside streaming. Split the check in two parts, where the allocated
state for the top bs is moved to the caller. The resulting function
is more generally useful.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Either FIEMAP, or SEEK_DATA+SEEK_HOLE can be used to implement the
is_allocated callback for raw files. On Linux ext4, btrfs and XFS
all support it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 3948d1d4 removed the pointer argument we filled in with l2_offset
but forgot to remove the unnecessary l2_offset assignment.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some gcc versions seem not to be able to figure out that the switch
statement covers all possible values and that c is therefore always
initialised. Add a default branch for them.
Reported-by: malc <av1474@comtv.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: malc <av1474@comtv.ru>
The same as for non-coroutine versions in previous
patches: rename arguments to be more obvious, change
type of arguments from int to size_t where appropriate,
and use common code for send and receive paths (with
one extra argument) since these are exactly the same.
Use common iov_send_recv() directly.
qemu_co_sendv(), qemu_co_recvv(), and qemu_co_recv()
are now trivial #define's merely adding one extra arg.
qemu_co_sendv() and qemu_co_recvv() callers are
converted to different argument order and extra
`iov_cnt' argument.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
It now allows specifying offset within qiov to start from and
amount of bytes to copy. Actual implementation is just a call
to iov_to_buf().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
qemu_iovec_concat() is currently a wrapper for
qemu_iovec_copy(), use the former (with extra
"0" arg) in a few places where it is used.
Change skip argument of qemu_iovec_copy() from
uint64_t to size_t, since size of qiov itself
is size_t, so there's no way to skip larger
sizes. Rename it to soffset, to make it clear
that the offset is applied to src.
Also change the only usage of uint64_t in
hw/9pfs/virtio-9p.c, in v9fs_init_qiov_from_pdu() -
all callers of it actually uses size_t too,
not uint64_t.
One added restriction: as for all other iovec-related
functions, soffset must point inside src.
Order of argumens is already good:
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes)
vs:
qemu_iovec_concat(QEMUIOVector *dst,
QEMUIOVector *src,
size_t soffset, size_t sbytes)
(note soffset is after _src_ not dst, since it applies to src;
for memset it applies to qiov).
Note that in many places where this function is used,
the previous call is qemu_iovec_reset(), which means
many callers actually want copy (replacing dst content),
not concat. So we may want to add a wrapper like
qemu_iovec_copy() with the same arguments but which
calls qemu_iovec_reset() before _concat().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Similar to
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes);
the new prototype is:
qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
const void *buf, size_t bytes);
The processing starts at offset bytes within qiov.
This way, we may copy a bounce buffer directly to
a middle of qiov.
This is exactly the same function as iov_from_buf() from
iov.c, so use the existing implementation and rename it
to qemu_iovec_from_buf() to be shorter and to match the
utility function.
As with utility implementation, we now assert that the
offset is inside actual iovec. Nothing changed for
current callers, because `offset' parameter is new.
While at it, stop using "bounce-qiov" in block/qcow2.c
and copy decrypted data directly from cluster_data
instead of recreating a temp qiov for doing that.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This patch combines two functions into one, and replaces
the implementation with already existing iov_memset() from
iov.c.
The new prototype of qemu_iovec_memset():
size_t qemu_iovec_memset(qiov, size_t offset, int fillc, size_t bytes)
It is different from former qemu_iovec_memset_skip(), and
I want to make other functions to be consistent with it
too: first how much to skip, second what, and 3rd how many
of it. It also returns actual number of bytes filled in,
which may be less than the requested `bytes' if qiov is
smaller than offset+bytes, in the same way iov_memset()
does.
While at it, use utility function iov_memset() from
iov.h in posix-aio-compat.c, where qiov was used.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
In snapshot mode, bdrv_open creates an empty temporary file without
checking for mkstemp or close failure, and ignoring the possibility
of a buffer overrun given a surprisingly long $TMPDIR.
Change the get_tmp_filename function to return int (not void),
so that it can inform its two callers of those failures.
Also avoid the risk of buffer overrun and do not ignore mkstemp
or close failure.
Update both callers (in block.c and vvfat.c) to propagate
temp-file-creation failure to their callers.
get_tmp_filename creates and closes an empty file, while its
callers later open that presumed-existing file with O_CREAT.
The problem was that a malicious user could provoke mkstemp failure
and race to create a symlink with the selected temporary file name,
thus causing the qemu process (usually root owned) to open through
the symlink, overwriting an attacker-chosen file.
This addresses CVE-2012-2652.
http://bugzilla.redhat.com/CVE-2012-2652
Signed-off-by: Jim Meyering <meyering@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_save_vmstate and bdrv_load_vmstate should return the vmstate size
on success, and -errno on error.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
fdc-test: introduced qtest no_media_on_start and cmos qtest for floppy
fdc: fix media detection
fdc: floppy drive should be visible after start without media
qemu-iotests: mark 035 qcow2-only
qcow2: Check qcow2_alloc_clusters_at() return value
sheepdog: use heap instead of stack for BDRVSheepdogState
sheepdog: return -errno on error
sheepdog: mark image as snapshot when tag is specified
qemu-img: Explain how rebase operation can be used to perform a 'diff' operation.
qcow2: don't leak buffer for unexpected qcow_version in header
This allows using LUNs bigger than 2TB. Keep using READ10 for other
device types such as MMC.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
This is needed to avoid READ CAPACITY(16) for MMC devices.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call qemu_notify_event() after updating events. Otherwise, If we add
an event for -is-writeable but the socket is already writeable there
may be a delay before the event callback is actually triggered.
Those delays would in particular hurt performance during BIOS boot and
when the GRUB bootloader reads the kernel and initrd.
But first call out to the socket write functions directly, and only set up
the write event if the socket is full. This will happen very rarely and
this improves performance.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
When using qcow2_alloc_clusters_at(), the cluster allocation code
checked the wrong variable for an error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_create() is called in coroutine context now, so we cannot use
more stack than 1 MB in the function if we use ucontext coroutine.
This patch allocates BDRVSheepdogState, whose size is 4 MB, on the
heap in sd_create().
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On error, BlockDriver APIs should return -errno instead of -1.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When a snapshot tag is specified in the filename, the opened image is
a snapshot.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unallocated sectors should really never be accessed by the guest,
so there's no need to copy them during the streaming process.
If they are read by the guest during streaming, guest-initiated
copy-on-read will copy them (we're in the base == NULL case, which
enables copy on read). If they are read after we disconnect the
image from the base, they will read as zeroes anyway.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes inability to make progress in streaming if the quota is set
to less than the amount of data that an I/O operation has to write.
In this case, limit->dispatched + n will always be above the quota and,
due to the "goto retry" to recheck cancellation and allocation, streaming
will livelock.
This can be reproduced with "block_job_set_speed ide0-hd0 1b". Of course,
with this patch the requested limit will not be obeyed. That could be
done with another patch that caps is_allocated's n argument by the slice
quota.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When an image is modified to point to the new backing file, the backing
file format is set to NULL, which means auto-probe. This is wrong, in
fact it is a small security problem.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The limitation on not having I/O after cancellation cannot really be
kept. Even streaming has a very small race window where you could
cancel a job and have it report completion. If this window is hit,
bdrv_change_backing_file() will yield and possibly cause accesses to
dangling pointers etc.
So, let's just assume that we cannot know exactly what will happen
after the coroutine has set busy to false. We can set a very lax
condition:
- if we cancel the job, the coroutine won't set it to false again
(and hence will not call co_sleep_ns again).
- block_job_cancel_sync will wait for the coroutine to exit, which
pretty much ensures no race.
Instead, we track the coroutine that executes the job and put very
strict conditions on what to do while it is quiescent (busy = false).
First of all, the coroutine must never set busy = false while the job
has been cancelled. Second, the coroutine can be reentered arbitrarily
while it is quiescent, so you cannot really do anything but co_sleep_ns at
that time. This condition is obeyed by the block_job_sleep_ns function.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function abstracts the pretty complex semantics of the "busy"
member of BlockJob.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED's opaque data includes a pointer back to the BlockDriverState.
This breaks when bdrv_append shuffles data between bs_new and bs_top.
To avoid this, add a "rebind" function that tells the driver about
the new relationship between the BlockDriverState and its opaque.
The patch also adds rebind to VVFAT for completeness, even though
it is not used with live snapshots.
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are needed to print "info block" output correctly. QCOW2 does this
because it needs it to write the header, but QED does not, and common code
is the right place to do it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This check applies to all drivers, but QED lacks it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
fdc: simplify media change handling
qcow2: lock on prealloc
block: make bdrv_create adopt coroutine
qcow2: Limit COW to where it's needed
sheepdog: switch to writethrough mode if cluster doesn't support flush
preallocate() will be locked. This is required because
qcow2_alloc_cluster_link_l2() assumes that it runs under a lock that it
can drop while COW is being performed.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes a regression introduced in commit 250196f1. The bug leads to
data corruption, found during an Autotest run with a Fedora 8 guest.
Consider a write request whose first part is covered by an already
allocated cluster, but additional clusters need to be newly allocated.
When counting the number of clusters to allocate, the qcow2 code would
decide to do COW for all remaining clusters of the write request, even
if some of them are already allocated.
If during this COW operation another write request is issued that touches
the same cluster, it will still refer to the old cluster. When the COW
completes, the first request will update the L2 table and the second
write request will be lost. Note that the requests need not overlap, it's
enough for them to touch the same cluster.
This patch ensures that only clusters that really require COW are
considered for allocation. In this case any other request writing to the
same cluster will be an allocating write and gets serialised.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is necessary for qemu to work with the older version of Sheepdog
which doesn't support SD_OP_FLUSH_VDI.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Update the configure test for libiscsi support to detect version 1.3
or later. Version 1.3 of libiscsi provides both READCAPACITY16 as well
as UNMAP commands.
Update the iscsi block layer to use READCAPACITY16 to detect the size of
the LUN instead of READCAPACITY10. This allows support for LUNs larger
than 2TB.
Update to implement bdrv_aio_discard() using the UNMAP command.
This allows us to use thin-provisioned LUNs from TGTD and other iSCSI
targets that support thin-provisioning.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[squashed in subsequent patch from Ronnie to fix off-by-one in LBA count]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change the write flag to an operation type in RBDAIOCB, and make the
buffer optional since discard doesn't use it.
Discard is first included in librbd 0.1.2 (which is in Ceph 0.46).
If librbd is too old, leave out qemu_rbd_aio_discard entirely,
so the old behavior is preserved.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If cache references are held while the coroutine has yielded, the cache
may get used up and abort() when it can't find a free entry.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use __APPLE__ and __MACH__ macros instead of CONFIG_COCOA to detect Mac
OS X host. The patch is based on Ben Leslie's patch:
http://patchwork.ozlabs.org/patch/97859/
Signed-off-by: Ben Leslie <benno@benno.id.au>
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
* qmp/queue/qmp:
qapi: fix qmp_balloon() conversion
qemu-iotests: add block-stream speed value test case
block: add 'speed' optional parameter to block-stream
block: change block-job-set-speed argument from 'value' to 'speed'
block: use Error mechanism instead of -errno for block_job_set_speed()
block: use Error mechanism instead of -errno for block_job_create()
Allow streaming operations to be started with an initial speed limit.
This eliminates the window of time between starting streaming and
issuing block-job-set-speed. Users should use the new optional 'speed'
parameter instead so that speed limits are in effect immediately when
the job starts.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
There are at least two different errors that can occur in
block_job_set_speed(): the job might not support setting speeds or the
value might be invalid.
Use the Error mechanism to report the error where it occurs.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
The block job API uses -errno return values internally and we convert
these to Error in the QMP functions. This is ugly because the Error
should be created at the point where we still have all the relevant
information. More importantly, it is hard to add new error cases to
this case since we quickly run out of -errno values without losing
information.
Go ahead and use Error directly and don't convert later.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
s->sock is assigned only afterwards, so we're really registering an
aio_fd_handler for file descriptor 0 here. Not exactly what we intended.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kwolf/for-anthony: (38 commits)
qemu-iotests: Fix test 031 for qcow2 v3 support
qemu-iotests: Add -o and make v3 the default for qcow2
qcow2: Zero write support
qemu-iotests: Test backing file COW with zero clusters
qemu-iotests: add a simple test for write_zeroes
qcow2: Support for feature table header extension
qcow2: Support reading zero clusters
qcow2: Version 3 images
qcow2: Ignore reserved bits in check_refcounts
qcow2: Ignore reserved bits in refcount table entries
qcow2: Simplify count_cow_clusters
qcow2: Refactor qcow2_free_any_clusters
qcow2: Ignore reserved bits in L1/L2 entries
qcow2: Fail write_compressed when overwriting data
qcow2: Ignore reserved bits in count_contiguous_clusters()
qcow2: Ignore reserved bits in get_cluster_offset
qcow2: Save disk size in snapshot header
Specification for qcow2 version 3
qcow2: Fix refcount block allocation during qcow2_alloc_cluster_at()
iotests: Resolve test failures caused by hostname
...
Instead of printing an ugly bitmask, qemu can now print a more helpful
string even for yet unknown features.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds the basic infrastructure to qcow2 to handle version 3 images.
It includes code to create v3 images, allow header updates for v3 images
and checks feature bits.
It still misses support for zero clusters, so this is not a fully
compliant implementation of v3 yet.
The default for creating new images stays at v2 for now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Also don't infer the cluster type directly from the L2 entries, but use
qcow2_get_cluster_type() to keep everything in a single place.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
count_cow_clusters() tries to reuse existing functions, and all it
achieves is to make things much more complicated than they really are:
Everything needs COW, unless it's a normal cluster with refcount 1.
This patch implements the obvious way of doing this, and by using
qcow2_get_cluster_type() it gets rid of all flag magic.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero clusters will add another cluster type. Refactor the open-coded
cluster type detection into a switch of QCOW2_CLUSTER_* options so that
the detection is in a single place. This makes it easier to add new
cluster types.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the still existing places that assume that the only flags
are QCOW_OFLAG_COPIED and QCOW_OFLAG_COMPRESSED to properly mask out
reserved bits.
It does not convert bdrv_check yet.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_alloc_compressed_cluster_offset() already fails if the copied flag
is set, because qcow2_write_compressed() doesn't perform COW as it would
have to do to allow this.
However, what we really want to check here is whether the cluster is
allocated or not. With internal snapshots the copied flag may not be set
on allocated clusters. Check the cluster offset instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Until now, count_contiguous_clusters() has an argument that allowed to
specify flags that should be ignored in the comparison, i.e. that are
allowed to change between contiguous clusters.
This patch changes the function so that it ignores all flags by default
now and you need to pass the flags on which it should stop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With this change, reading from a qcow2 image ignores all reserved bits
that are set in an L1 or L2 table entry.
Now get_cluster_offset() assigns *cluster_offset only the offset without
any other flags. The cluster type is not longer encoded in the offset,
but a positive return value in case of success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows that different snapshots of an image can have different
sizes, which is a requirement for enabling image resizing even with
images that have internal snapshots.
We don't do the actual support for it now, but make sure that the
additional field is present and not completely ignored in all version 3
images. When trying to load a snapshot of different size, it returns
an error.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcount block allocation and refcount table growth rely on
s->free_cluster_index pointing to somewhere after the current
allocation. Change qcow2_alloc_cluster_at() to fulfill this
assumption.
Without this change it could happen that a newly allocated refcount
block and the allocated data block point to the same area in the image
file, causing data corruption in the long run.
This fixes a bug that became first visible after commit 250196f1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Right now, nbd_wr_sync will hang if no data at all is available on the
socket and the other side is not going to provide any. Relax this by
making it loop only for writes or partial reads. This fixes a race
where one thread is executing qemu_aio_wait() and another is executing
main_loop_wait(). Then, the select() call in main_loop_wait() can return
stale data and call the "readable" callback with no data in the socket.
Reported-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the next patch we need to look at the return code of nbd_wr_sync.
To avoid percolating the socket_error() ugliness all around, let's
handle errors by returning negative errno values.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Someone forgot something in commit 29c1a730... Documenting the right
return value is not enough, you also need to actually return it in the
code.
This bug sometimes causes error return values even when everything has
succeeded: The new offset of the refcount block is truncated to 32 bits
and interpreted as signed. At least with small cluster sizes it's easy
to get a negative return value this way.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
If do_alloc_cluster_offset() fails, the error handling code tried to
remove the request from the in-flight queue, to which it wasn't added
yet, resulting in a NULL pointer dereference.
m->nb_clusters really only becomes != 0 when the request is in the list.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony: (46 commits)
qed: remove incoming live migration blocker
qed: honor BDRV_O_INCOMING for incoming live migration
migration: clear BDRV_O_INCOMING flags on end of incoming live migration
qed: add bdrv_invalidate_cache to be called after incoming live migration
blockdev: open images with BDRV_O_INCOMING on incoming live migration
block: add a function to clear incoming live migration flags
block: Add new BDRV_O_INCOMING flag to notice incoming live migration
block stream: close unused files and update ->backing_hd
qemu-iotests: Fix call syntax for qemu-io
qemu-iotests: Fix call syntax for qemu-img
qemu-iotests: Test unknown qcow2 header extensions
qemu-iotests: qcow2.py
sheepdog: fix send req helpers
sheepdog: implement SD_OP_FLUSH_VDI operation
block: bdrv_append() fixes
qed: track dirty flag status
qemu-img: add dirty flag status
qed: image fragmentation statistics
qemu-img: add image fragmentation statistics
block: document job API
...
From original commit with Patchwork-id: 31108 by
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
"The QED image format includes a file header bit to mark images dirty.
QED normally checks dirty images on open and fixes inconsistent
metadata. This is undesirable during live migration since the dirty bit
may be set if the source host is modifying the image file. The check
should be postponed until migration completes.
Skip operations that modify the image file if the BDRV_O_INCOMING flag
is set."
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QED image is reopened to flush metadata and check consistency.
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Close the now unused images that were part of the previous backing file
chain and adjust ->backing_hd, backing_filename and backing_format
properly.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=801449
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We should return if reading of the header fails.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Flush operation is supposed to flush the write-back cache of
sheepdog cluster.
By issuing flush operation, we can assure the Guest of data
reaching the sheepdog cluster storage.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to do this in every implementation of set_speed
(even though there is only one right now).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Streaming can issue I/O while qcow2_close is running. This causes the
L2 caches to become very confused or, alternatively, could cause a
segfault when the streaming coroutine is reentered after closing its
block device. The fix is to cancel streaming jobs when closing their
underlying device.
The cancellation must be synchronous, on the other hand qemu_aio_wait
will not restart a coroutine that is sleeping in co_sleep. So add
a flag saying whether streaming has in-flight I/O. If the busy flag
is false, the coroutine is quiescent and, when cancelled, will not
issue any new I/O.
This protects streaming against closing, but not against deleting.
We have a reference count protecting us against concurrent deletion,
but I still added an assertion to ensure nothing bad happens.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Finally reindent all code and change goto statements to a loop.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reads and writes to the underlying file can also occur with the simple
non-vectored I/O interfaces.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi.c really works as if it implemented bdrv_read and bdrv_write. However,
because only vector I/O is supported by the asynchronous callbacks, it
went through extra pain to bounce-buffer the I/O. This can be handled
by the block layer now that the format is coroutine-based.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most of the AIOCB really holds local variables that need to persist
across callback invocation. It can go away now.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now inline the former AIO callbacks into vdi_co_readv and vdi_co_writev.
While many cleanups are possible, the code now really looks synchronous.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The next step is to take code that only triggers after the first operation,
and move it at the end of vdi_aio_read_cb and vdi_aio_write_cb.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even a basic conversion changing the bdrv_aio_readv/bdrv_aio_writev calls
to bdrv_co_readv/bdrv_co_writev, and callbacks to goto statements can
eliminate a lot of code. This is because error handling is simplified
and indirections through bottom halves can go away.
After this patch, I/O to the underlying file already happens via
coroutines, but the code still looks a lot like if asynchronous I/O was
being used.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After validation check, the 'checksum' is not written back
to footer, which leave it with zero.
This results in errors while loadding it under Microsoft's
Hyper-V environment, and also errors from utilities like
Citrix's vhd-util.
Signed-off-by: Zhang Shengju <sean_zhang@trendmicro.com.cn>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since everything goes through the cache, callers don't use the L2 table
offset any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The function usleep is not available for all supported platforms:
at least some versions of MinGW don't support it.
usleep was also declared obsolete by POSIX.1-2001.
The function g_usleep is part of glib2.0, so it is available for
all supported platforms.
Using nanosleep would also be possible but needs more code.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
If the first part of a write request is allocated, but the second isn't
and it can be allocated so that the resulting area is contiguous, handle
it at once. This is a common case for sequential writes.
After this patch, alloc_cluster_offset() only checks if the clusters are
already allocated or how many new clusters can be allocated contigouosly.
The actual cluster allocation is split off into a new function
do_alloc_cluster_offset().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This function allows to allocate clusters at a given offset in the image
file. This is useful if you want to allocate the second part of an area
that must be contiguous.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
qemu-img resize has some limitations with qcow2, but the user is only
told that "this image format does not support resize". Quite confusing,
so add some more detailed error_report() calls and change "this image
format" into "this image".
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The L2 table cache reduces QED metadata reads that would be required
when translating LBAs to offsets into the image file. Since requests
execute in parallel it is possible to share an L2 table between multiple
requests.
There is a potential data corruption issue when an in-use L2 table is
evicted from the cache because the following situation occurs:
1. An allocating write performs an update to L2 table "A".
2. Another request needs L2 table "B" and causes table "A" to be
evicted.
3. A new read request needs L2 table "A" but it is not cached.
As a result the L2 update from #1 can overlap with the L2 fetch from #3.
We must avoid doing overlapping I/O requests here since the worst case
outcome is that the L2 fetch completes before the L2 update and yields
stale data. In that case we would effectively discard the L2 update and
lose data clusters!
Thanks to Benoît Canet <benoit.canet@gmail.com> for extensive testing
and debugging which lead to discovery of this bug.
Reported-by: Benoît Canet <benoit.canet@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Tested-by: Benoît Canet <benoit.canet@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Image files that make qemu-img info read several gigabytes into the
unknown header extensions list are bad. Just fail opening the image
if an extension claims to be larger than the header extension area.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The spec says that the length of extensions is padded to 8 bytes, not
the offset. Currently this is the same because the header size is a
multiple of 8, so this is only about compatibility with future changes
to the header size.
While touching it, move the calculation to a common place instead of
duplicating it for each header extension type.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The co_recv coroutine has two things that will try to enter it:
1. The select(2) read callback on the sheepdog socket.
2. The aio_add_request() blocking operations, including a coroutine
mutex.
This patch fixes it by setting NULL to co_recv before sending data.
In future, we should make the sheepdog driver fully coroutine-based
and simplify request handling.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we want header extensions to work as compatible extensions, we can't
destroy yet unknown header extensions when rewriting the header (e.g.
for changing the backing file). Save all unknown header extensions in a
list of blobs and include them in a new header.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to switch the backing file, qcow2 issues multiple write
requests that only changed a part of the image header. Any failure after
the first one would leave the header in an corrupted state. With this
patch, the whole header is written at once, so we can't fail in the
middle.
At the same time, this gives us a reusable functions that updates all
fields of the qcow2 header and not only the backing file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The geometry calculation algorithm from the VHD spec rounds the image
size down if it doesn't exactly match a geometry. During image
conversion, this causes the image to be truncated. For dynamic images,
we already have code in place to round up instead, let's do the same for
fixed images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The Virtual Hard Disk Image Format Specification allows for three
types of hard disk formats, Fixed, Dynamic, and Differencing. Qemu
currently only supports Dynamic disks. This patch adds support for
the Fixed Disk format.
Usage:
Example 1: qemu-img create -f vpc -o type=fixed <filename> [size]
Example 2: qemu-img convert -O vpc -o type=fixed <input filename> <output filename>
While it is also allowed to specify '-o type=dynamic', the default disk type
remains Dynamic and is what is used when the type is left unspecified.
Signed-off-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds configuration variables for iSCSI to set
initiator-name to use when logging in to the target,
which type of header-digest to negotiate with the target
and username and password for CHAP authentication.
This allows specifying a initiator-name either from the command line
-iscsi initiator-name=iqn.2004-01.com.example:test
or from a configuration file included with -readconfig
[iscsi]
initiator-name = iqn.2004-01.com.example:test
header-digest = CRC32C|CRC32C-NONE|NONE-CRC32C|NONE
user = CHAP username
password = CHAP password
If you use several different targets, you can also configure this on a per
target basis by using a group name:
[iscsi "iqn.target.name"]
...
The configuration file can be read using -readconfig.
Example :
qemu-system-i386 -drive file=iscsi://127.0.0.1/iqn.ronnie.test/1
-readconfig iscsi.conf
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero writes are a dedicated interface for writing regions of zeroes into
the image file. If clusters are not yet allocated it is possible to use
an efficient metadata representation which keeps the image file compact
and does not store individual zero bytes.
Implementing this for the QED image format is fairly straightforward.
The only issue is that when a zero write touches an existing cluster we
have to allocate a bounce buffer and perform a regular write.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Per-request attributes like read/write are currently implemented as bool
fields in the QEDAIOCB struct. This becomes unwiedly as the number of
attributes grows. For example, the qed_aio_setup() function would have
to take multiple bool arguments and at call sites it would be hard to
distinguish the meaning of each bool.
Instead use a flags field with bitmask constants. This will be used
when zero write support is added.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since common file operation functions lack of error detection and use
much more I/O syscalls, so change them to bdrv series functions and
reduce I/O request.
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new block was filled with zero when it was allocated by g_malloc0,
but when it was reused later and only partially used, data from the
previously allocated block were still present and written to the new
block.
This caused the problems reported by bug #919242
(https://bugs.launchpad.net/qemu/+bug/919242).
Now the unused parts of the new block which are before and after the data
are always filled with zero, so it is no longer necessary to zero the whole
block with g_malloc0.
I also updated the copyright comment.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support for streaming data from an intermediate section of the
image chain (see patch and documentation for details).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch implements rate-limiting for image streaming. If we've
exceeded the bandwidth quota for a 100 ms time slice we sleep the
coroutine until the next slice begins.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most of the codebase as been converted to use glib memory allocation
functions. There are still a few instances of malloc/calloc in the
block layer and qemu-io. Replace them, especially since they do not
check the strdup/malloc/calloc return value.
Reported-by: Dr David Alan Gilbert <davidagilbert@uk.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All files under GPLv2 will get GPLv2+ changes starting tomorrow.
event_notifier.c and exec-obsolete.h were only ever touched by Red Hat
employees and can be relicensed now.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Allow sending up to 16 requests, and drive the replies to the coroutine
that did the request. The code is written to be exactly the same as
before this patch when MAX_NBD_REQUESTS == 1 (modulo the extra mutex
and state).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
qemu-nbd has a limit of slightly less than 1M per request. Work
around this in the nbd block driver.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Outside coroutines, avoid busy waiting on EAGAIN by temporarily
making the socket blocking.
The API of qemu_recvv/qemu_sendv is slightly different from
do_readv/do_writev because they do not handle coroutines. It
returns the number of bytes written before encountering an
EAGAIN. The specificity of yielding on EAGAIN is entirely in
qemu-coroutine.c.
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The caller expects psn_tab to be NULL when there are no snapshots or
an error occurs. This results in calling g_free on an invalid address.
Reported-by: Oliver Francke <Oliver@filoo.de>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Initially done with the following semantic patch:
@ rule1 @
expression E;
statement S;
@@
E = qemu_aio_get (...);
(
- if (E == NULL) { ... }
|
- if (E)
{ <... S ...> }
)
which however missed occurrences in linux-aio.c and posix-aio-compat.c.
Those were done by hand.
The change in vdi_aio_setup's caller was also done by hand.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Initially done with the following semantic patch:
@ rule1 @
expression E;
statement S;
@@
E =
(
bdrv_aio_readv
| bdrv_aio_writev
| bdrv_aio_flush
| bdrv_aio_discard
| bdrv_aio_ioctl
)
(...);
(
- if (E == NULL) { ... }
|
- if (E)
{ <... S ...> }
)
which however missed the occurrence in block/blkverify.c
(as it should have done), and left behind some unused
variables.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Double semicolons should be single.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Now that bdrv_co_is_allocated() is available we can use it instead of
the synchronous bdrv_is_allocated() interface. This is a follow-up that
Kevin Wolf <kwolf@redhat.com> pointed out after applying the series that
introduces bdrv_co_is_allocated().
It is safe to make cow_read() a coroutine_fn because its only caller is
a coroutine_fn.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's common to wake up all waiting coroutines. Introduce the
qemu_co_queue_restart_all() function to do this instead of looping over
qemu_co_queue_next() in every caller.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cow block driver does not keep internal state for cluster lookups.
This means it is safe to perform cluster lookups in coroutine context
without risk of race conditions that corrupt internal state.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is trivial to switch from the synchronous .bdrv_is_allocated()
interface to .bdrv_co_is_allocated() since vdi_is_allocated() does not
block.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is trivial to switch from the synchronous .bdrv_is_allocated()
interface to .bdrv_co_is_allocated() since vvfat_is_allocated() does not
block.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2, qcow, and vmdk block drivers are based on coroutines. They have a
coroutine mutex which protects internal state. We can convert the
.bdrv_is_allocated() function to .bdrv_co_is_allocated() by holding the mutex
around the cluster lookup operation.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_qed_is_allocated() function is a synchronous wrapper around
qed_find_cluster(), which performs the cluster lookup. In order to
convert the synchronous function to a coroutine function we yield
instead of using qemu_aio_wait(). Note that QED's cache is already safe
for parallel requests so no locking is needed.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the bdrv_read() of the snapshot's L1 table fails, return the right
error code and make sure that the old L1 table is still loaded and we
don't break the BlockDriverState completely.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
First the snapshot must be deleted and only then the refcounts can be
decreased.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The refcount updates must be moved so that in the worst case we can get
cluster leaks, but refcounts may never be too low.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Besides fixing the return code, this adds some comments that make clear
how the code works and that it potentially breaks images if we fail in
the wrong place. Actually fixing this is left for the next patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Increase refcounts only after allocating a new L1 table has succeeded in
order to make leaks less likely. If writing the snapshot table fails,
revert in-memory state to be consistent with that on disk.
While at it, make it return the real error codes instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
sn->id_str could be leaked before this. The rest of this patch changes
comments, fixes coding style or removes checks that are unnecessary with
g_malloc.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Failing in the middle wouldn't help with the integrity of the image, so
doing everything in a single request seems better.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Doesn't immediately fix anything as the callers don't use the return
value, but they will be fixed next.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Looks better when reviewing these source files.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since common file operation functions lack of error detection,
so change them to bdrv series functions.
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is only to refactor some lines of codes to get better and more robust codes.
As you have seen, in qed_read_table_cb() it's nice to
use qiov->size because that function doesn't obviously use a single
struct iovec.
In other two functions, if qiov use more than one struct iovec, the existing way will get wrong nb_sectors.
To make the code more robust, it will be nicer to refactor the existing way as below.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A BlockDriverState should not issue requests on itself through the
public block layer interface. Nested, or reentrant, requests are
problematic because they do I/O throttling and request tracking twice.
Features like block layer copy-on-read use request tracking to avoid
race conditions between concurrent requests. The reentrant request will
have to "wait" for its parent request to complete. But the parent is
waiting for the reentrant request to make progress so we have reached
deadlock.
The solution is for block drivers to avoid the public block layer
interfaces for reentrant requests. Instead they should call their own
internal functions if they wish to perform reentrant requests.
This is also a good opportunity to make copy_sectors() a true
coroutine_fn. That means calling bdrv_co_writev() instead of
bdrv_write(). Behavior is unchanged but we're being explicit that this
executes in coroutine context.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unlocking during COW allows for more parallelism. One change it requires is
that buffers are dynamically allocated instead of just using a per-image
buffer.
While touching the code, drop the synchronous qcow2_read() function and replace
it by a bdrv_read() call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
vvfat caches more or less everything when in writable mode. For migration
to work, it would have to be invalidated. Block migration for now when
in writable mode (default is readonly).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi caches the block map. For migration to work, it would have to be
invalidated. Block migration for now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
s->lock should be unlocked before leaving add_aio_request.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
zlib.h is not a local include file, therefore it should be included
using <> instead of "".
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now when you try to migrate with qed, you get:
(qemu) migrate tcp:localhost:1025
Block format 'qed' used by device 'ide0-hd0' does not support feature 'live migration'
(qemu)
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We don't reopen the actual file, but instead invoke the close and open routines.
We specifically ignore the backing file since it's contents are read-only and
therefore immutable.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2 has a writeback metadata cache, so flushing a qcow2 image actually
consists of writing back that cache to the protocol and only then flushes the
protocol in order to get everything stable on disk.
This introduces a separate bdrv_co_flush_to_os to reflect the split.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are two different types of flush that you can do: Flushing one level up
to the OS (i.e. writing data to the host page cache) or flushing it all the way
down to the disk. The existing functions flush to the disk, reflect this in the
function name.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The Data Offset field in the Dynamic Disk Header is an 8 byte field.
Although the specification (2006-10-11) gives an example of initializing
only the first 4 bytes, images generated by Microsoft on Windows initialize
all 8 bytes.
Failure to initialize all 8 bytes results in errors from utilities
like Citrix's vhd-util which checks specifically for the proper Data
Offset field initialization.
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vvfat used to directly call into the qcow2 block driver instead of using the
block.c wrappers. With the coroutine conversion, this stopped working.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
First determine FAT12/16/32, then compute geometry from that for both
FDD and HDD. For 1.44MB floppies, and 2.88MB floppies using FAT16,
change to 1 sector/cluster. The default remains 2.88MB with FAT12
and 2 sectors/cluster. Both DOS and mkdosfs by default format a 2.88MB
floppy as FAT12.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The sector count is stored in the partition and hence must not include the
sectors before its start. At the same time, remove the useless special
casing for 1.44 MB floppies. This fixes fsck on VVFAT hard disks,
which otherwise tries to seek past the end of the disk.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is consistent with what "real" floppies have, so file(1)
now actually recognizes the VVFAT image as a 1.44 MB floppy.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the number of "faked sectors" + the number of sectors that are
part of a cluster does not sum up to the total number of sectors,
qemu-img convert fails. Read these spare sectors as all zeros.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When reading the address of the first free entry, you cannot
use array_get without first marking all entries as occupied.
This is visible if you change the sectors per cluster on a
floppy from 2 to 1.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix mismatching allocation and deallocation: g_free should be used to pair with
g_malloc.
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed_by: Ray Wang <raywang@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix coding style in block/cloop.c.
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed_by: Ray Wang <raywang@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_cache_flush failed, s->lock will not be unlock.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Data we read from the disk isn't necessarily null terminated and may not
contain the string we're looking for. The code needs to be a bit more careful
here.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
An entry in the VDI block map will hold an offset to the actual block if
the block is allocated, or one of two specially-interpreted values if
not allocated. Using VirtualBox terminology, value VDI_IMAGE_BLOCK_FREE
(0xffffffff) represents a never-allocated block (semantically arbitrary
content). VDI_IMAGE_BLOCK_ZERO (0xfffffffe) represents a "discarded"
block (semantically zero-filled). block/vdi knows only about
VDI_IMAGE_BLOCK_FREE. Teach it about VDI_IMAGE_BLOCK_ZERO.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This provides built-in support for iSCSI to QEMU.
This has the advantage that the iSCSI devices need not be made visible to the host, which is useful if you have very many virtual machines and very many iscsi devices.
It also has the benefit that non-root users of QEMU can access iSCSI devices across the network without requiring root privilege on the host.
This driver interfaces with the multiplatform posix library for iscsi initiator/client access to iscsi devices hosted at
git://github.com/sahlberg/libiscsi.git
The patch adds the driver to interface with the iscsi library.
It also updated the configure script to
* by default, probe is libiscsi is available and if so, build
qemu against libiscsi.
* --enable-libiscsi
Force a build against libiscsi. If libiscsi is not available
the build will fail.
* --disable-libiscsi
Do not link against libiscsi, even if it is available.
When linked with libiscsi, qemu gains support to access iscsi resources such as disks and cdrom directly, without having to make the devices visible to the host.
You can specify devices using a iscsi url of the form :
iscsi://[<username>[:<password>@]]<host>[:<port]/<target-iqn-name>/<lun>
When using authentication, the password can optionally be set with
LIBISCSI_CHAP_PASSWORD="password" to avoid it showing up in the process list
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
'ret' is unconditionally overwitten by qed_read_l1_table_sync()
Spotted by Clang Analyzer
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Spotted by Clang Analyzer
[Note this memcpy call has always been safe because the length will be 0
when the pointer is NULL]
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Since coroutine operation is now mandatory, convert both bdrv_discard
implementations to coroutines. For qcow2, this means taking the lock
around the operation. raw-posix remains synchronous.
The bdrv_discard callback is then unused and can be eliminated.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since coroutine operation is now mandatory, convert all bdrv_flush
implementations to coroutines. For qcow2, this means taking the lock.
Other implementations are simpler and just forward bdrv_flush to the
underlying protocol, so they can avoid the lock.
The bdrv_flush callback is then unused and can be eliminated.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This does the first part of the conversion to coroutines, by
wrapping bdrv_write implementations to take the mutex.
Drivers that implement bdrv_write rather than bdrv_co_writev can
then benefit from asynchronous operation (at least if the underlying
protocol supports it, which is not the case for raw-win32), even
though they still operate with a bounce buffer.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This does the first part of the conversion to coroutines, by
wrapping bdrv_read implementations to take the mutex.
Drivers that implement bdrv_read rather than bdrv_co_readv can
then benefit from asynchronous operation (at least if the underlying
protocol supports it, which is not the case for raw-win32), even
though they still operate with a bounce buffer.
raw-win32 does not need the lock, because it cannot yield.
nbd also doesn't probably, but better be safe.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The big conversion of bdrv_read/write to coroutines caused the two
homonymous callbacks in BlockDriver to become reentrant. It goes
like this:
1) bdrv_read is now called in a coroutine, and calls bdrv_read or
bdrv_pread.
2) the nested bdrv_read goes through the fast path in bdrv_rw_co_entry;
3) in the common case when the protocol is file, bdrv_co_do_readv calls
bdrv_co_readv_em (and from here goes to bdrv_co_io_em), which yields
until the AIO operation is complete;
4) if bdrv_read had been called from a bottom half, the main loop
is free to iterate again: a device model or another bottom half
can then come and call bdrv_read again.
This applies to all four of read/write/flush/discard. It would also
apply to is_allocated, but it is not used from within coroutines:
besides qemu-img.c and qemu-io.c, which operate synchronously, the
only user is the monitor. Copy-on-read will introduce a use in the
block layer, and will require converting it.
The solution is "simply" to convert all drivers to coroutines! We
just need to add a CoMutex that is taken around affected operations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move vmdk_parent_open to vmdk_open. There's another path how
vmdk_parent_open can be reached:
vmdk_parse_extents() -> vmdk_open_sparse() -> vmdk_open_vmdk4() ->
vmdk_open_desc_file().
If that can happen, however, the code is bogus. vmdk_parent_open
reads from bs->file:
if (bdrv_pread(bs->file, s->desc_offset, desc, DESC_SIZE) != DESC_SIZE) {
but it is always called with s->desc_offset == 0 and with the same
bs->file. So the data that vmdk_parent_open reads comes always from the
same place, and anyway there is only one place where it can write it,
namely bs->backing_file.
So, if it cannot happen, the patched code is okay.
It is also possible that the recursive call can happen, but only once. In
that case there would still be a bug in vmdk_open_desc_file setting
s->desc_offset = 0, but the patched code is okay.
Finally, in the case where multiple recursive calls can happen the code
would need to be rewritten anyway. It is likely that this would anyway
involve adding several parameters to vmdk_parent_open, and calling it from
vmdk_open_vmdk4.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While vmdk_open_desc_file (touched by the patch) correctly changed -1
to -EINVAL, vmdk_open did not. Fix it directly in vmdk_parent_open.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If during allocation of compressed clusters the cluster was already allocated
uncompressed, fail and properly release the l2_table (the latter avoids a
failed assertion).
While at it, make it return some real error numbers instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
This similarly adds support for coroutine and asynchronous discard.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers now only need to provide either of .bdrv_co_flush,
.bdrv_aio_flush() or for legacy drivers .bdrv_flush(). Remove
the redundant .bdrv_flush() implementations.
[Paolo Bonzini: change raw driver to bdrv_co_flush]
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes the following patch easier to review.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The raw format delegates all operations to bs->file (the protocol).
Previously this block driver exposed both sync and aio interfaces.
Since the block layer now works in terms of coroutines, expose the
coroutine interfaces and drop the others. This avoids unnecessary
emulation of sync and aio interfaces.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers only need to provide one of sync, aio, or coroutine
interfaces. Since raw-posix.c provides aio interfaces, simply drop the
synchronous interfaces since they can be emulated using aio and
coroutines.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cppcheck reported this error:
qemu/block/qcow.c:599: error: Mismatching allocation and deallocation: cluster_data
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The unused code was detected using cppcheck.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cppcheck reported memory leaks and mismatched g_malloc() with free()
instead of g_free().
Fix these errors.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow to resize images that reside on host devices up to the available
space. This allows to grow images after resizing the device manually or
vice versa.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED's metadata caching strategy allows two parallel requests to race for
metadata lookup. The first one to complete will populate the metadata
cache and the second one will drop the data it just read in favor of the
cached data.
There is a use-after-free in qed_read_l2_table_cb() and
qed_commit_l2_update() where l2_table->offset was used after the
l2_table may have been freed due to a metadata lookup race. Fix this by
keeping the l2_offset in a local variable and not reaching into the
possibly freed l2_table.
Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The previous behaviour was to finish AIOCBs inside curl_aio_readv()
if the data was cached. This caused the following failed assertion
at hw/ide/pci.c:314: bmdma_cmd_writeb
"Assertion `bm->bus->dma->aiocb == ((void *)0)' failed."
By scheduling a QEMUBH and performing the completion inside the
callback, we avoid this problem.
Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The config string is variously delimited by =, @, and /, depending on the
field. Allow these characters to be escaped by preceeding them with \.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
librbd recently added async writeback and flush support. If the new
rbd_flush() call is available, call it.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Properly document the configuration string syntax and semantics. Remove
(out of date) details about the librbd implementation.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we are reading from the default config location, ignore any failures.
It is perfectly legal for the user to specify exactly the options they need
and to not rely on any config file.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Release extent_file on error in vmdk_parse_extents. Added closing files
in freeing extents.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nbd supports writing flags in bytes 24...27 of the header,
and uses that for the read-only flag. Add support for it
in qemu-nbd.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Those blanks violate the coding conventions, see
scripts/checkpatch.pl.
Blanks missing after colons in the changed lines were added.
This patch does not try to fix tabs, long lines and other
problems in the changed lines, therefore checkpatch.pl reports
many violations.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
QCowL2Meta::offset is not cluster aligned but only sector aligned
however nb_clusters count cluster from cluster start.
This fix range check. Note that old code have no corruption issues
related to this check cause it only cause intersection to occur
when shouldn't.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCow2Meta structure was inserted into list before many fields are
initialized. Currently is not a problem cause all occur in a lock
but if qcow2_alloc_clusters would in a future unlock this lock
some issues could arise.
Initializing fields before inserting fix the problem.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix leak of s->snap in failure path. Simplify error paths for the whole
function.
Reported-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
No assignment in condition. Remove duplicate ret > 0 check.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as. Currently they are stuck with
the default ('admin'). This is necessary for anyone using authentication
in their environment.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vSphere 4 exported image is streamOptimized extent, which is not
quite correctly handled. Ignore rdgOffset when RGD flag bit not set.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Haiku provides a specially formed vmdk image, which let qemu abort. It a
combination of sparse header and flat data (i.e. with not l1/l2 table at
all). The fix is turn to descriptor when sparse header is zero in field
'capacity'.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Creating streamOptimized subformat. Added subformat option
'streamOptimized', to create a image with compression enabled and each
cluster with a GrainMarker.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support for reading/writing compressed extent.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Added flags field for compressed/streamOptimized extents, open and save
image configuration.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Factor out read/write extent code, since there will be more things to
take care of once reading/writing compressed clusters is introduced.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Enable the createType 'twoGbMaxExtentFlat'. The supporting code is
already in.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block driver "raw" forwards most methods to the underlying block
driver. However, it doesn't implement method bdrv_media_changed().
Makes bdrv_media_changed() always return -ENOTSUP.
I believe -fda /dev/fd0 gives you raw over host_floppy, and disk
change detection (fdc register 7 bit 7) is broken. Testing my theory
requires a computer museum, though.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Requests depending on a failed request would end up waiting forever. This fixes
the error path to continue dependent requests even when the request has failed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add some notes about Linux AIO explaining why we don't use AIO in
some situations.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most changes were made using these commands:
git grep -la '__attribute__((packed))'|xargs perl -pi -e 's/__attribute__\(\(packed\)\)/QEMU_PACKED/'
git grep -la '__attribute__ ((packed))'|xargs perl -pi -e 's/__attribute__ \(\(packed\)\)/QEMU_PACKED/'
git grep -la '__attribute__((__packed__))'|xargs perl -pi -e 's/__attribute__\(\(__packed__\)\)/QEMU_PACKED/'
git grep -la '__attribute__ ((__packed__))'|xargs perl -pi -e 's/__attribute__ \(\(__packed__\)\)/QEMU_PACKED/'
git grep -la '__attribute((packed))'|xargs perl -pi -e 's/__attribute\(\(packed\)\)/QEMU_PACKED/'
Whitespace in linux-user/syscall_defs.h was fixed manually
to avoid warnings from scripts/checkpatch.pl.
Manual changes were also applied to hw/pc.c.
I did not fix indentation with tabs in block/vvfat.c.
The patch will show 4 errors with scripts/checkpatch.pl.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This makes the sheepdog block driver support bdrv_co_readv/writev
instead of bdrv_aio_readv/writev.
With this patch, Sheepdog network I/O becomes fully asynchronous. The
block driver yields back when send/recv returns EAGAIN, and is resumed
when the sheepdog network connection is ready for the operation.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Embed qcow_aio_read_cb into qcow_co_readv and qcow_aio_write_cb into qcow_co_writev
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
remove unused field from this structure and put some of them in qcow_aio_read_cb and qcow_aio_write_cb
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Current behaviour if a read fails is for the acb to not get finished.
This causes an infinite loop in bdrv_read_em (block.c). The read failure
never gets reported to the guest and if the error condition clears, the
process never recovers.
With this patch, when curl reports a failure we finish the acb as a
failure. This results in the guest receiving an I/O error (rather than
the read hanging indefinitely) and if the error condition subsequently
clears, retries work as expected.
The simplest test is to put an ISO on a web server you have control over
and open it with qemu-io. Then move the ISO out of the way and attempt
to read some data - you should see behaviour matching the above.
Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
commit 52b8eb6013 added a mutex,
but never initialized it. This caused a segfault.
Reported-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Documentation states the num is measured in clusters, but its
actually measured in sectors
Signed-off-by: Devin Nakamura <devin122@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
By introducing BlockDriverState compiling qcow2 with DEBUG_ALLOC and DEBUG_EXT
defined got broken.
Define a BdrvCheckResult structure locally which is now needed as the second
argument.
Also fix qcow2_read_extensions() needing BDRVQcowState.
Signed-off-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
SetFilePointer returns INVALID_SET_FILE_POINTER when it fails.
In addition, GetLastError must be checked.
The first call of SetFilePointer did not use INVALID_SET_FILE_POINTER,
the second call used wrong error handling.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When loading an internal snapshot whose L1 table is smaller than the current L1
table, the size of the current L1 would be shrunk to the snapshot's L1 size in
memory, but not on disk. This lead to incorrect refcount updates and eventuelly
to image corruption.
Instead of writing the new L1 size to disk, this simply retains the bigger L1
size that is currently in use and makes sure that the unused part is zeroed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Tested-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The purpose of AsyncContexts was to protect qcow and qcow2 against reentrancy
during an emulated bdrv_read/write (which includes a qemu_aio_wait() call and
can run AIO callbacks of different requests if it weren't for AsyncContexts).
Now both qcow and qcow2 are protected by CoMutexes and AsyncContexts can be
removed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The old qcow format is another user of the AsyncContext infrastructure.
Converting it to coroutines (and therefore CoMutexes) allows to remove
AsyncContexts.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VHD files technically can be up to 2Tb, but virtual pc is limited
to 127G. Currently qemu-img refused to create vpc files > 127G,
but it is failing to return error when converting from a non-vpc
VHD file which is >127G. It returns success, but creates a truncated
converted image. Also, qemu-img info claims the vpc file is 127G
(and clean).
This patch detects a too-large vpc file and returns -EFBIG. Without
this patch,
=============================================================
root@ip-10-38-123-242:~/qemu-fixed# qemu-img info /mnt/140g-dynamic.vhd
image: /mnt/140g-dynamic.vhd
file format: vpc
virtual size: 127G (136899993600 bytes)
disk size: 284K
root@ip-10-38-123-242:~/qemu-fixed# qemu-img convert -f vpc -O raw /mnt/140g-dynamic.vhd /mnt/y
root@ip-10-38-123-242:~/qemu-fixed# echo $?
0
root@ip-10-38-123-242:~/qemu-fixed# qemu-img info /mnt/y
image: /mnt/y
file format: raw
virtual size: 127G (136899993600 bytes)
disk size: 0
=============================================================
(The 140G image was truncated with no warning or error.)
With the patch, I get:
=============================================================
root@ip-10-38-123-242:~/qemu-fixed# ./qemu-img info /mnt/140g-dynamic.vhd
qemu-img: Could not open '/mnt/140g-dynamic.vhd': File too large
root@ip-10-38-123-242:~/qemu-fixed# ./qemu-img convert -f vpc -O raw /mnt/140g-dynamic.vhd /mnt/y
qemu-img: Could not open '/mnt/140g-dynamic.vhd': File too large
qemu-img: Could not open '/mnt/140g-dynamic.vhd'
=============================================================
See https://bugs.launchpad.net/qemu/+bug/814222 for details.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Callees always return 0, except for FreeBSD's cdrom_eject(), which
returns -ENOTSUP when the device is in a terminally wedged state.
The only caller is bdrv_eject(), and it maps -ENOTSUP to 0 since
commit 4be9762a.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only caller is bdrv_set_locked(), and it ignores the value.
Callees always return 0, except for FreeBSD's cdrom_set_locked(),
which returns -ENOTSUP when the device is in a terminally wedged
state.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's been disabled since the start (commit 19cb3738, Aug 2006), and
has been untouched except for spelling fixes and such. I don't feel
like dragging it along any further.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Avoid warnings like these by wrapping recv():
CC slirp/ip_icmp.o
/src/qemu/slirp/ip_icmp.c: In function 'icmp_receive':
/src/qemu/slirp/ip_icmp.c:418:5: error: passing argument 2 of 'recv' from incompatible pointer type [-Werror]
/usr/local/lib/gcc/i686-mingw32msvc/4.6.0/../../../../i686-mingw32msvc/include/winsock2.h:547:32: note: expected 'char *' but argument is of type 'struct icmp *'
Remove also casts used to avoid warnings.
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In snapshotting there is no guest involved, so we can safely use a writeback
mode and do the flushes in the right place (i.e. at the very end). This
improves the time that creating/restoring an internal snapshot takes with an
image in writethrough mode.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu-img.c wants to count allocated file size of image. Previously it
counts a single bs->file by 'stat' or Window API. As VMDK introduces
multiple file support, the operation becomes format specific with
platform specific meanwhile.
The functions are moved to block/raw-{posix,win32}.c and qemu-img.c calls
bdrv_get_allocated_file_size to count the bs. And also added VMDK code
to count his own extents.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Conform coding style in vmdk.c to pass scripts/checkpatch.pl checks.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add create option 'format', with enums:
monolithicSparse
monolithicFlat
twoGbMaxExtentSparse
twoGbMaxExtentFlat
Each creates a subformat image file. The default is monolithicSparse.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Parse vmdk decriptor file and open mono flat image.
Read/write the flat extent.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The return type of get_cluster_offset was an offset that use 0 to denote
'not allocated', this will be no longer true for flat extents, as we see
flat extent file as a single huge cluster whose offset is 0 and length
is the whole file length.
So now we use int return value, 0 means success and otherwise offset
invalid.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cid_update is the flag for updating CID on first write after opening the
image. This should be per image open rather than per program life cycle,
so change it from static var of vmdk_write to a field in BDRVVmdkState.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Flush all the file that referenced by the image.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are several occurrence of magic number 0x200 as the descriptor
offset within mono sparse image file. This is not the case for images
with separate descriptor file. So a field is added to BDRVVmdkState to
hold the correct value.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Separate vmdk_open by subformats to:
* vmdk_open_vmdk3
* vmdk_open_vmdk4
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Probe as the same behavior as VMware does.
Recognize image as monolithicFlat descriptor file when the file is text
and the first effective line (not '#' leaded comment or space line) is
either 'version=1' or 'version=2'. No space or upper case charactors
accepted.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In get_whole_cluster, the offset is not aligned to cluster when reading
from backing_hd. When the first write to child is not at the cluster
boundary, wrong address data from parent is copied to child.
Signed-off-by: Fam Zheng <famcool@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This introduces qemu-img create option for sheepdog which allows the
data to be fully preallocated (note that sheepdog always preallocates
metadata).
The option is disabled by default and you need to enable it like the
following:
qemu-img create sheepdog:test -o preallocation=full 1G
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img
ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.
The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files. Work around by calling fstat() the ensure the ioctls are
only used on block devices.
Signed-off-by: Johannes Stezenbach <js@sig21.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
error_report() prepends location, and appends a newline. The message
constructed from the arguments should not contain a newline. Fix the
obvious offenders.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
If qcow2_cache_put returns an error during cluster allocation and the
allocation fails, it must be removed from the list of in-flight allocations.
Otherwise we'd get a loop in the list when the ACB is used for the next
allocation.
Luckily, this qcow2_cache_put shouldn't fail anyway because the L2 table is
only read, so that qcow2_cache_put doesn't even involve I/O.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
bdrv_aio_* must not call the callback before returning to its caller. In vdi,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_aio_* must not call the callback before returning to its caller. In qcow,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_aio_* must not call the callback before returning to its caller. In qcow2,
this could happen in some error cases. This starts the real requests processing
in a BH to avoid this situation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Variable 'snap' is assigned a value that is never used.
Remove snap and the related code.
Cc: Christian Brunner <chb@muc.de>
Cc: Josh Durgin <josh.durgin@dreamhost.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When not specifying a cluster size on the command line, qemu-img printed
a cluster size of 0:
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864
encryption=off cluster_size=0
This patch adds the default cluster size to the QEMUOptionParameter list, so
that it displays the default value that is used.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes memory leaks that may be caused by I/O errors during L1 table growth
(can happen during save_vm) and in qemu-img check.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If scheduling fails, the number of outstanding I/Os must be correct,
or there will be a hang when waiting for everything to be flushed.
Reviewed-by: Christian Brunner <chb@muc.de>
Reported-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new format is rbd:pool/image[@snapshot][:option1=value1[:option2=value2...]]
Each option is used to configure rados, and may be any Ceph option, or "conf".
The "conf" option specifies a Ceph configuration file to read.
This allows rbd volumes from more than one Ceph cluster to be used by
specifying different monitor addresses, as well as having different
logging levels or locations for different volumes.
Reviewed-by: Christian Brunner <chb@muc.de>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
librbd stacks on top of librados to provide access
to rbd images.
Using librbd simplifies the qemu code, and allows
qemu to use new versions of the rbd format
with few (if any) changes.
Reviewed-by: Christian Brunner <chb@muc.de>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
use the correct way to get the size of a disk device or partition
From: Adam Hamsik <haad@netbsd.org>
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On NetBSD a userland process is better with the character device
interface. In addition, a block device can't be opened twice; if a Xen
backend opens it, qemu can't and vice-versa.
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vmdk code is sloppy when handling the header descriptor during
creation of an image. Fix all header accesses in the create path to
either store native endianness or convert it when appropriate.
Reported-by: Yury Tsarev <ytsarev@novell.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change BDRV_O_NOCACHE to only imply bypassing the host OS file cache,
but no writeback semantics. All existing callers are changed to also
specify BDRV_O_CACHE_WB to give them writeback semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch removes all references to signal.h when qemu-common.h is included
as they become redundant.
Signed-off-by: Alexandre Raymond <cerbere@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The .bdrv_truncate() operation resizes images and growing is easy to
implement in QED. Simply check that the new size is valid and then
update the image_size header field to reflect the new size.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
One strategy to limit the startup delay of consistency check when
opening image files is to ensure that the file is marked dirty for as
little time as possible.
QED currently marks the image dirty when the first allocating write
request is issued and clears the dirty bit again when the image is
cleanly closed. In practice that means the image is marked dirty for
most of a guest's lifetime and prone to being in a dirty state upon
crash or power failure.
It is safe to clear the dirty bit after all allocating write requests
have completed and a flush has been performed. This patch adds a timer
after the last allocating write request completes. When the timer fires
it will flush and then clear the dirty bit. The timer is set to 5
seconds and is cancelled upon arrival of a new allocating write request.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The code changed here is an unused data type name (evt_flush_occurred).
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The qed_bytes_to_clusters() function is normally used with size_t
lengths. Consistency check used it with file size length and therefore
failed on 32-bit hosts when the image file is 4 GB or more.
Make qed_bytes_to_clusters() explicitly 64-bit and update consistency
check to keep 64-bit cluster counts.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use get_option_parameter() to instead of duplicating the loop, and
use BDRV_SECTOR_SIZE to instead of 512
Signed-off-by: Mitnick Lyu <mitnick.lyu@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero clusters are similar to unallocated clusters except instead of reading
their value from a backing file when one is available, the cluster is always
read as zero.
This implements read support only. At this stage, QED will never write a
zero cluster.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We also change the way the file parameter is parsed so IPv6 IP
addresses can be used, e.g.: "drive=nbd:[::1]:5000"
Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu now has generic bitmap functions,
so don't redefine them in sheepdog.c,
use common header instead. A small cleanup.
Here's only one function which is actually
used in sheepdog and gets replaced with
a generic one (simplified):
- static inline int test_bit(int nr, const volatile unsigned long *addr)
+ static inline int test_bit(int nr, const unsigned long *addr)
{
- return ((1UL << (nr % BITS_PER_LONG))
& ((unsigned long*)addr)[nr / BITS_PER_LONG])) != 0;
+ return 1UL & (addr[nr / BITS_PER_LONG] >> (nr & (BITS_PER_LONG-1)));
}
The body is equivalent, but the argument is not: there's
"volatile" in there. Why it is used for - I'm not sure.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
This patch is similar to 171e3d6b99
which fixed qcow2:
Returning -EIO is far from optimal, but at least it's an error code.
In addition to read/write failures, -EIO is also returned when
decompress_cluster failed.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is similar to 171e3d6b99
which fixed qcow2:
Returning -EIO is far from optimal, but at least it's an error code.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When copying L2 tables (this happens only with internal snapshots), the order
wasn't completely safe, so that after a crash you could end up with a L2 table
that has too low refcount, possibly leading to corruption in the long run.
This patch puts the operations in the right order: First allocate the new
L2 table and replace the reference, and only then decrease the refcount of the
old table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of just returning -ENOTSUP, generate a more detailed error.
Unfortunately we don't have a helpful text for features that we don't know yet,
so just print the feature mask. It might be useful at least if someone asks for
help.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The qcow2 driver is now declared responsible for any QCOW image that has
version 2 or greater (before this, version 3 would be detected as raw).
For everything newer than version 2, an error is reported.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
When reading a compressed cluster failed, qcow2 falsely returned success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Requests could return success even though they failed when bdrv_aio_readv
returned NULL for a backing file read.
Reported-by: Chunqiang Tang <ctang@us.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch fixes the following bug in QCOW2. For a QCOW2 image that is larger
than its base image, when handling a read request straddling over the end of the
base image, the QCOW2 driver attempts to read beyond the end of the base image
and the request would fail.
This bug was found by Fast Virtual Disk (FVD)'s fully automated testing tool.
The following test triggered the bug.
dd if=/dev/zero of=/var/ramdisk/truth.raw count=0 bs=1 seek=1098561536
dd if=/dev/zero of=/var/ramdisk/zero-500M.raw count=0 bs=1 seek=593099264
./qemu-img create -f qcow2 -ocluster_size=65536,backing_fmt=blksim -b /var/ramdisk/zero-500M.raw /var/ramdisk/test.qcow2 1098561536
./qemu-io --auto --seed=30477694 --truth=/var/ramdisk/truth.raw --format=qcow2 --test=blksim:/var/ramdisk/test.qcow2 --verify_write=true --compare_before=false --compare_after=true --round=100000 --parallel=100 --io_size=10485760 --fail_prob=0 --cancel_prob=0 --instant_qemubh=true
Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Error report from cppcheck:
block/vdi.c:122: error: Using sizeof for array given as function argument returns the size of pointer.
block/vdi.c:128: error: Using sizeof for array given as function argument returns the size of pointer.
Fix both by setting the correct size.
The buggy code is only used when QEMU is build without uuid support.
The bug is not critical, so there is no urgent need to apply it to
old versions of QEMU.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For cache=unsafe we also need to set BDRV_O_CACHE_WB, otherwise we have some
strange unsafe writethrough mode.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Variables l2_modified and l2_size are not really used, remove them.
Spotted by GCC 4.6.0:
CC block/qcow2-refcount.o
/src/qemu/block/qcow2-refcount.c: In function 'qcow2_update_snapshot_refcount':
/src/qemu/block/qcow2-refcount.c:708:37: error: variable 'l2_modified' set but not used [-Werror=unused-but-set-variable]
/src/qemu/block/qcow2-refcount.c:708:9: error: variable 'l2_size' set but not used [-Werror=unused-but-set-variable]
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The consistency check on open is necessary in order to fix inconsistent
table offsets left as a result of a crash mid-operation. Images with a
backing file actually flush before updating table offsets and are
therefore guaranteed to be consistent. Do not mark these images dirty.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds a bdrv_discard function to qcow2 that frees the discarded clusters.
It does not yet pass the discard on to the underlying file system driver, but
the space can be reused by future writes to the image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This patch parses the input filename in sd_create(), and enables us
specifying a target server to create sheepdog images.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move size after the two pointers in struct Qcow2Cache to get better
packing of struct elements on 64 bit architectures.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED relies on the underlying filesystem to extend the file and maintain
its size. Check that images are not created on a block device.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 calls bdrv_flush() after performing COW in order to ensure that the
L2 table change is never written before the copy is safe on disk. Now that the
L2 table is cached, we can wait with flushing until we write out the next L2
table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds some new cache functions to qcow2 which can be used for caching
refcount blocks and L2 tables. When used with cache=writethrough they work
like the old caching code which is spread all over qcow2, so for this case we
have merely a cleanup.
The interesting case is with writeback caching (this includes cache=none) where
data isn't written to disk immediately but only kept in cache initially. This
leads to some form of metadata write batching which avoids the current "write
to refcount block, flush, write to L2 table" pattern for each single request
when a lot of cluster allocations happen. Instead, cache entries are only
written out if its required to maintain the right order. In the pure cluster
allocation case this means that all metadata updates for requests are done in
memory initially and on sync, first the refcount blocks are written to disk,
then fsync, then L2 tables.
This improves performance of scenarios with lots of cluster allocations
noticably (e.g. installation or after taking a snapshot).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cpu_to_be64w() is called with an obviously non-aligned pointer. Use
cpu_to_be64wu() instead. It fixes unaligned accesses errors on IA64
hosts.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix a file descriptor leak, reported by cppcheck:
[/src/qemu/block/vvfat.c:759]: (error) Resource leak: dir
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In addition this adds missing braces to the function to be consistent
with the coding style.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It doesn't really make sense for functions in qcow2.c to be named
qcow_ so convert the names to match correctly.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for the qemu-img check command. It also
introduces a dirty bit in the qed header to mark modified images as
needing a check. This bit is cleared when the image file is closed
cleanly.
If an image file is opened and it has the dirty bit set, a consistency
check will run and try to fix corrupted table offsets. These
corruptions may occur if there is power loss while an allocating write
is performed. Once the image is fixed it opens as normal again.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch implements the read/write state machine. Operations are
fully asynchronous and multiple operations may be active at any time.
Allocating writes lock tables to ensure metadata updates do not
interfere with each other. If two allocating writes need to update the
same L2 table they will run sequentially. If two allocating writes need
to update different L2 tables they will run in parallel.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds code to look up data cluster offsets in the image via
the L1/L2 tables. The L2 tables are writethrough cached in memory for
performance (each read/write requires a lookup so it is essential to
cache the tables).
With cluster lookup code in place it is possible to implement
bdrv_is_allocated() to query the number of contiguous
allocated/unallocated clusters.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch introduces the qed on-disk layout and implements image
creation. Later patches add read/write and other functionality.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support to discard blocks in a raw image residing on an XFS filesystem
by calling the XFS_IOC_UNRESVSP64 ioctl to punch holes. Support for other
hole punching mechanisms can be added when they become available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new bdrv_discard method to free blocks in a mapping image, and a new
drive property to set the granularity for these discard. If no discard
granularity support is set discard support is disabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
RBD is an block driver for the distributed file system Ceph
(http://ceph.newdream.net/). This driver uses librados (which is part
of the Ceph server) for direct access to the Ceph object store and is
running entirely in userspace (Yehuda also wrote a driver for the
linux kernel, that can be used to access rbd volumes as a block
device).
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Christian Brunner <chb@muc.de>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All drivers use bs->file instead of s->hd for quite a while now, so it's time
to remove s->hd.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
When building on a 64 bit host which uses 'long' for int64_t,
GCC emits a warning:
CC block/blkverify.o
/src/qemu/block/blkverify.c: In function `blkverify_verify_readv':
/src/qemu/block/blkverify.c:304: warning: long long int format, long
unsigned int arg (arg 3)
Rework a77cffe7e9 to avoid the warning.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cache content may be destroyed after a failed read, better not use it any
more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This changes bdrv_flush to return 0 on success and -errno in case of failure.
It's a requirement for implementing proper error handle in users of bdrv_flush.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Move timer init functions to a new file, qemu-timer-common.c. Make other
critical timer functions inlined to preserve performance in
qemu-timer.c, also move muldiv64() (used by the inline functions)
to qemu-timer.h.
Adjust block/raw-posix.c and simpletrace.c to use get_clock() directly.
Remove a similar/duplicate definition in qemu-tool.c.
Adjust hw/omap_clk.c to include qemu-timer.h because muldiv64() is used
there.
After this change, tracing can be used also for user code and
simpletrace on Win32.
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Adding the gcc format attribute detects a format bug
which is fixed here.
v2:
Don't use type cast. BDRV_SECTOR_SIZE is unsigned long long,
so %lld should be the correct format specifier.
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to backup snapshots, created from QCOW2 iamge, we want to copy snapshots out of QCOW2 disk to a seperate storage.
The following patch adds a new option in "qemu-img": qemu-img convert -f qcow2 -O qcow2 -s snapshot_name src_img bck_img.
Right now, it only supports to copy the full snapshot, delta snapshot is on the way.
Changes from V1: all the comments from Kevin are addressed:
Add read-only checking
Fix coding style
Change the name from bdrv_snapshot_load to bdrv_snapshot_load_tmp
Signed-off-by: Disheng Su <edison@cloud.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of doing lots of magic for setting up initial refcount blocks and stuff
create a minimal (inconsistent) image, open it and initialize the rest with
regular qcow2 functions.
This is a complete rewrite of the image creation function. The old
implementating is #ifdef'd out and will be removed by the next patch (removing
it here would have made the diff unreadable because diff tries to find
similarities when it's really a rewrite)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The L1 table grow operation includes a size calculation that bumps up
the new L1 table size in order to anticipate the size needs of vmstate
data. This helps reduce the number of times that the L1 table has to be
grown when vmstate data is appended.
This size overhead is not necessary during image creation,
bdrv_truncate(), or snapshot goto operations. In fact, existing
qemu-iotests that exercise table growth are no longer able to trigger it
because image creation preallocates an L1 table that is too large after
changes to qcow_create2().
This patch keeps the size calculation but also adds exact growth for
callers that do not want to inflate the L1 table size unnecessarily.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compiling with GCC 4.6.0 20100925 produced a warning:
/src/qemu/block/qcow2-refcount.c: In function 'update_refcount':
/src/qemu/block/qcow2-refcount.c:552:13: error: variable 'dummy' set but not used [-Werror=unused-but-set-variable]
Fix by adding a dummy cast so that the result is not unused.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Fix this compiler warning:
./block/vvfat.c:2285: error: comparison of unsigned expression >= 0 is always true
Cc: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
The blkverify block driver makes investigating image format data
corruption much easier. A raw image initialized with the same contents
as the test image (e.g. qcow2 file) must be provided. The raw image
mirrors read/write operations and is used to verify that data read from
the test image is correct.
See docs/blkverify.txt for more information.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO write path. Encrypted
images continue to use a bounce buffer, however with constant size.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We always have a sync for the refcount update when a new cluster is
allocated. If we move this past the COW, we can save an additional sync.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Note that the flush is omitted intentionally in qcow2_free_clusters. If
anything, we can leak clusters here if we lose the writes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block/nbd.c: use default port number when none is specified
qemu-nbd.c: use IANA-assigned port number: 10809
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the hardcoded handling of 512 byte alignment with bs->buffer_alignment
to handle larger sector size devices correctly.
Note that we can not rely on it to be initialize in bdrv_open, so deal
with the worst case there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow file used for write support in vvfat is a temporary file,
so we can use cache=unsafe there. Without this, write support is just
too slow to be of any use.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
Allocation and deallocation of bs->opaque is not in the control of a
block driver. Therefore it should not set bs->opaque to a data structure
used by another bs, or closing the image will lead to a double free.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
vvfat tries to set the readonly flag in its open function, but nowadays
this is overwritted with the readonly=... command line option. Check in
bdrv_write if the vvfat was opened read-only and return an error in this
case.
Without this check, vvfat tries to access the qcow bs, which is NULL
without enabled write support.
Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
The signedness of enum types depend on the compiler implementation.
Therefore the check for negative values may or may not be meaningful.
Fix by explicitly casting to a signed integer.
Since the values are also checked earlier against event_names
table, this is an internal error. Change the 'if' to 'assert'.
This also avoids a warning with GCC flag -Wtype-limits.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 79368c81bf.
Conflicts:
block.c
I haven't been able to come up with a solution yet for the corruption caused by
unaligned requests from the IDE disk so revert until a solution can be written.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When a new cluster was allocated, we only need a flush after the write to the
L2 table if it was a COW and we need to decrease the refcounts of the old
clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow symbolic links which point to /dev/sgX devices.
Signed-off-by: Bernhard Kohl <bernhard.kohl@nsn.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux, we have code to detect CD-ROMs using an ioctl. We shouldn't lose
anything but false positives by removing the check for a /dev/cd* path.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows to connect Qemu using NBD protocol to an nbd-server
using named exports.
For instance, if on the host "isoserver", in /etc/nbd-server/config, you have:
[generic]
[debian-500-ppc-netinst]
exportname = /ISO/debian-500-powerpc-netinst.iso
[Fedora-10-ppc-netinst]
exportname = /ISO/Fedora-10-ppc-netinst.iso
You can connect to it, using:
qemu -cdrom nbd:isoserver:exportname=debian-500-ppc-netinst
qemu -cdrom nbd:isoserver:exportname=Fedora-10-ppc-netinst
NOTE: you need at least nbd-server 2.9.18
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
"qemu_socket.h" includes all necessary files and
including <netinet/tcp.h> without <netinet/in.h>
could cause errors on some systems.
Signed-off-by: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Assuming that any image on a block device is not properly zero-initialized is
actually wrong: Only raw images have this problem. Any other image format
shouldn't care about it, they initialize everything properly themselves.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to have a second set of integral types.
Replace them by the standard types from stdint.h.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
trick the block probing code into accessing arbitrary files in a guest. To
mitigate this, we added an explicit format parameter to -drive which disabling
block probing.
Fast forward to today, and the vast majority of users do not use this parameter.
libvirt does not use this by default nor does virt-manager.
Most users want block probing so we should try to make it safer.
This patch adds some logic to the raw device which attempts to detect a write
operation to the beginning of a raw device. If the first 4 bytes happen to
match an image file that has a backing file that we support, it scrubs the
signature to all zeros. If a user specifies an explicit format parameter, this
behavior is disabled.
I contend that while a legitimate guest could write such a signature to the
header, we would behave incorrectly anyway upon the next invocation of QEMU.
This simply changes the incorrect behavior to not involve a security
vulnerability.
I've tested this pretty extensively both in the positive and negative case. I'm
not 100% confident in the block layer's ability to deal with zero sized writes
particularly with respect to the aio functions so some additional eyes would be
appreciated.
Even in the case of a single sector write, we have to make sure to invoked the
completion from a bottom half so just removing the zero sized write is not an
option.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
WIN32 is not only the system which doesn't have TCP_CORK (e.g. OS X).
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS. This
patch adds a qemu block driver for Sheepdog.
Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing
The more details are available at the project site:
http://www.osrg.net/sheepdog/
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
raw_pread_aligned() retries up to two times if the block device backs
a virtual CD-ROM (a drive with media=cdrom and if=ide, scsi, xen or
none). This makes no sense. Whether retrying reads can correct read
errors can only depend on what we're reading, not on how the result
gets used. We need to check what whether we're reading from a
physical CD-ROM or floppy here.
I doubt retrying is useful even then. Left for another day.
Impact:
* Virtual CD-ROM backed by host_cdrom behaves the same.
* Virtual CD-ROM backed by file or host_device no longer retries.
* A drive backed by host_cdrom now retries even if it's not a virtual
CD-ROM.
* Any drive backed by host_floppy now retries.
While there, clean up gratuitous use of goto.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This distinguishes between harmless leaks and real corruption. Hopefully users
better understand what qemu-img check wants to tell them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
state = 0 in rules means that the rule is valid for any state. Therefore it's
impossible to have a rule that works only in the initial state. This changes
the initial state from 0 to 1 to make this possible.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Forgetting to free them means that the next instance inherits all rules and
gets its own rules only additionally.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The list head was initialized to point to the wrong list, so all actions ended
up being handled as inject-error even if they were set-state in fact.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
People were wondering why qemu-img check failed after they tried to preallocate
a large qcow2 file and ran out of disk space.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Trying to check them leads to a second error message which is more confusing
than helpful:
Can't get refcount for cluster 0: Invalid argument
ERROR cluster 0 refcount=-22 reference=1
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With corrupted images, we can easily get an cluster index that exceeds the
array size of the temporary refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
While at it, correct the wrong usage of errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We don't have an equivalent to mmap in the qemu block API, so read and
write the bitmap directly. At least in the dumb implementation added
in this patch this is a lot less efficient, but it means cow can also
work on windows, and over nbd or curl. And it fixes qemu-iotests testcase
012 which did not work properly due to issues with read-only mmap access.
In addition we can also get rid of the now unused get_mmap_addr function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread/pwrite instead of lseek + read/write in preparation of using the
qemu block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If writing the L1 table to disk failed, we need to restore its old content in
memory to avoid inconsistencies.
Reported-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes load_refcount_block which completely ignored the return value of
write_refcount_block and always returned -EIO for bdrv_pwrite failure.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently it would consider blocks for which get_refcount fails used. However,
it's unlikely that get_refcount would succeed for the next cluster, so it's not
really helpful. Return an error instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
get_refcount might need to load a refcount block from disk, so errors may
happen. Return the error code instead of assuming a refcount of 1 and change
the callers to respect error return values.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the vpc block driver (for VHD) to read/write multiple sectors at
once instead of doing a request for each single sector.
Before this, running qemu-iotests for VPC took ages, now it's actually quite
reasonable to run it always (down from ~1 hour to 40 seconds for me).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Clean up raw-posix.c to be more consistent using BDRV_SECTOR_SIZE
instead of hard coded 512 values.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After it is done with updating refcounts in the cache, update_refcount writes
all changed entries to disk. If a refcount block allocation fails, however,
there was no change yet and therefore first_index = last_index = -1. Don't
treat -1 as a normal sector index (resulting in a 512 byte write!) but return
without updating anything in this case.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refblock allocation code needs to take into consideration that update_refcount
will load a different refcount block into the cache, so it must initialize the
cache for a new refcount block only afterwards. Not doing this means that not
only the refcount in the wrong block is updated, but also that the caller will
work on the wrong block.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
write_refcount_block_entries used to return -EIO for any errors. Change this to
return the real error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_get_cluster_offset() looks up a given virtual disk offset and returns the
offset of the corresponding cluster in the image file. Errors (e.g. L2 table
can't be read) are currenctly indicated by a return value of 0, which is
unfortuately the same as for any unallocated cluster. So in effect we can't
check for errors.
This makes the old return value a by-reference parameter and returns the usual
0/-errno error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
l2_allocate has some intermediate states in which the image is inconsistent.
Change the order to write to the L1 table only after the new L2 table has
successfully been initialized.
Also reset the L2 cache in failure case, it's very likely wrong.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the L2 table was already updated in cache, but writing it to disk has
failed, we must not continue using the changed version in the cache to stay
consistent with what's on the disk.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Casting a pointer to an int doesn't work on 64 bit platforms. Use the %p printf
conversion specifier instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
gcc does not like passing a NULL where an int value is expected:
block/vvfat.c: In function ‘checkpoint’:
block/vvfat.c:2868: error: passing argument 2 of ‘remove_mapping’ makes
integer from pointer without a cast
Signed-off-by: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fix is based on a patch from Kevin Wolf. Here his comment:
"The number of blocks needs to be rounded up to cover all of the virtual hard
disk. Without this fix, we can't even open our own images if their size is not
a multiple of the block size."
While Kevin's patch addressed vdi_create, my modification also fixes
vdi_open which now accepts images with odd disk sizes.
v3:
Don't allow reading of disk images with too large disk sizes.
Neither VBoxManage nor old versions of qemu-img read such images.
This change requires rounding of odd disk sizes before we do the checks.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: François Revol <revol@free.fr>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Dmg actually does an lseek to a negative offset in the open routine,
which we replace with offset arithmetics after doing a bdrv_getlength.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API. Note that dmg actually uses the implicit file offset
a lot in dmg_open, and we had to replace it with an offset variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When dmg_read_chunk encounters an uncompressed chunk it currently
calls read without any previous adjustment of the file postion.
This seems very wrong, and the "reference" implementation in
dmg2img does a search to the same offset as done in the various
compression cases, so do the same here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The VHD algorithm calculates a disk geometry
which is usually smaller than the requested size.
QEMU tried to round up but failed for certain sizes:
qemu-img create -f vpc disk.vpc 9437184
would create an image with 9435136 bytes
(which is too small for qemu-img convert).
Instead of hacking the geometry algorithm, the patch
increases the number of sectors until we get enough
sectors.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even it is not very useful, users may create images of size 0.
Without the special option CONFIG_ZERO_MALLOC, qemu_mallocz
aborts execution when it is told to allocate 0 bytes,
so avoid this kind of call.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use bdrv_pwrite to access the backing device instead of pread, and
convert the driver to implementing the bdrv_open method which gives
it an already opened BlockDriverState for the underlying device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use pread instead of lseek + read in preparation of using the qemu
block API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
OpenBSDs gcc is said to generate warnings for this declaration, so don't
reference bdrv_qcow2 directly, but look it up using bdrv_find_format.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This reverts commit 20d97356c9.
The BlockDriver definition should stay at the end of source files.
Conflicts:
block/qcow2.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
This patch adds the ability to grow qcow2 images in-place using
bdrv_truncate(). This enables qemu-img resize command support for
qcow2.
Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
The notion of resizing an image with snapshots could lead to confusion:
users may expect snapshots to remain unchanged, but this is not possible
with the current qcow2 on-disk format where the header.size field is
global instead of per-snapshot. Others may expect snapshots to change
size along with the current image data. I think it is safest to not
support snapshots and perhaps add behavior later if there is a
consensus.
Backing images continue to work. If the image is now larger than its
backing image, zeroes are read when accessing beyond the end of the
backing image.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While it's true that during regular operation free_clusters failure would be a
bug, an I/O error can always happen. There's no need to kill the VM, the worst
thing that can happen (and it will) is that we leak some clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch combines the lseek+read/write calls to use pread/pwrite
instead. This will result in fewer system calls and is already used by
AIO.
Thanks to Jan Kiszka <jan.kiszka@siemens.com> for identifying excessive
lseek and Christoph Hellwig <hch@lst.de> for confirming that this
approach should work.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The i loop iterator is shadowed by the next free cluster index. Both
using the variable name 'i' makes the code harder to read.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VMDK is doing interesting things when it needs to open a backing file. This
patch changes that part to look more like in other drivers. The nice side
effect is that the file name isn't needed any more in the open function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When trying to do COW, VMDK wrote the data back to the backing file. This
problem was revealed by the patch that made backing files read-only. This patch
does not only fix the problem, but also simplifies the VMDK code a bit.
This fixes the backing file qemu-iotests cases for VMDK.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Format drivers shouldn't need to bother with things like file names, but rather
just get an open BlockDriverState for the underlying protocol. This patch
introduces this behaviour for bdrv_open implementation. For protocols which
need to access the filename to open their file/device/connection/... a new
callback bdrv_file_open is introduced which doesn't get an underlying file
opened.
For now, also some of the more obscure formats use bdrv_file_open because they
open() the file themselves instead of using the block.c functions. They need to
be fixed in later patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We're running into various problems because the "raw" file access, which
is used internally by the various image formats is entangled with the
"raw" image format, which maps the VM view 1:1 to a file system.
This patch renames the raw file backends to the file protocol which
is treated like other protocols (e.g. nbd and http) and adds a new
"raw" image format which is just a wrapper around calls to the underlying
protocol.
The patch is surprisingly simple, besides changing the probing logical
in block.c to only look for image formats when using bdrv_open and
renaming of the old raw protocols to file there's almost nothing in there.
For creating images, a new bdrv_create_file is introduced which guesses the
protocol to use. This allows using qemu-img create -f raw (or just using the
default) for both files and host devices. Converting the other format drivers
to use this function to create their images is left for later patches.
The only issues still open are in the handling of the host devices.
Firstly in current qemu we can specifiy the host* format names
on various command line acceping images, but the new code can't
do that without adding some translation. Second the layering breaks
the no_zero_init flag in the BlockDriver used by qemu-img. I'm not
happy how this is done per-driver instead of per-state so I'll
prepare a separate patch to clean this up.
There's some more cleanup opportunity after this patch, e.g. using
separate lists and registration functions for image formats vs
protocols and maybe even host drivers, but this can be done at a
later stage.
Also there's a check for protocol in bdrv_open for the BDRV_O_SNAPSHOT
case that I don't quite understand, but which I fear won't work as
expected - possibly even before this patch.
Note that this patch requires various recent block patches from Kevin
and me, which should all be in his block queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix clang warnings:
/src/qemu/block/vvfat.c:1102:9: warning: Value stored to 'index3' during its initialization is never read
int index3=index1+1;
/src/qemu/cmd.c:290:15: warning: Value stored to 'p' during its initialization is never read
char *p = result;
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
GCC 3.3.5 generates warnings for static forward declarations of data, so
rearrange code to use static forward declarations of functions instead.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Returning NULL on error doesn't allow distinguishing between different errors.
Change the interface to return an integer for -errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Split up the raw_getlength into separate generic, solaris and BSD
versions to reduce the ifdef maze a bit. The BSD variant still
is a complete maze, but to clean it up properly we'd need some
people using the BSD variants to figure out what code is used
for what variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
What is known today as bdrv_open2 becomes the new bdrv_open. All remaining
callers of the old function are converted to the new one. In some places they
even know the right format, so they should have used bdrv_open2 from the
beginning.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow_create2 assumes that the new image will only need one cluster for its
refcount table initially. Obviously that's not true any more when the image is
big enough (exact value depends on the cluster size).
This patch calculates the refcount table size dynamically.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers can trigger a blkdebug event whenever they reach a place where it
could be useful to inject an error for testing/debugging purposes.
Rules are read from a blkdebug config file and describe which action is taken
when an event is triggered. For now this is only injecting an error (with a few
options) or changing the state (which is an integer). Rules can be declared to
be active only in a specific state; this way later rules can distiguish on
which path we came to trigger their event.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a mechanism to inject errors instead of passing requests on. With no
further patches applied, you can use it by setting inject_errno in gdb.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This isn't doing anything interesting. It creates the blkdebug block driver as
a protocol which just passes everything through to raw.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_open already takes care of this for us.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
If we complete a request with a failure we need to remove it from the list of
requests that are in flight. If we don't do it, the next time the same AIOCB is
used for a cluster allocation it will create a loop in the list and qemu will
hang in an endless loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Returning -EIO is far from optimal, but at least it's an error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Now that we output an error message according to the returned error code in
qemu-img, let's return the real error codes. "Input/output error" for
everything isn't helpful.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
When building with -DNDEBUG, assert(0) will not stop execution
so it must not be used for abnormal termination.
Use cpu_abort() when in CPU context, abort() otherwise.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
cleanup code is identical for error/success cases. Only difference
are goto labels.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
fail_gd error case would also free rgd_buf that was already freed
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When checking for errors, commit db89119d compares with the wrong values,
failing image creation even when there was no error. Additionally, if an
error has occured, we can't preallocate the image (it's likely broken).
This unbreaks test 023 of qemu-iotests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The current implementation of alloc_refcount_block and grow_refcount_table has
fundamental problems regarding error handling. There are some places where an
I/O error means that the image is going to be corrupted. I have found that the
only way to fix this is to completely rewrite the thing.
In detail, the problem is that the refcount blocks itself are allocated using
alloc_refcount_noref (to avoid endless recursion when updating the refcount of
the new refcount block, which migh access just the same refcount block but its
allocation is not yet completed...). Only at the end of the refcount allocation
the refcount of the refcount block is increased. If an error happens in
between, the refcount block is in use, but has a refcount of zero and will
likely be overwritten later.
The new approach is explained in comments in the code. The trick is basically
to let new refcount blocks describe their own refcount, so their refcount will
be automatically changed when they are hooked up in the refcount table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the refcount table grows, it doesn't only grow by one entry but reserves
some space for future refcount blocks. The algorithm to calculate the number of
entries stays the same with the fixes, so factor it out before replacing the
rest.
As Juan suggested take the opportunity to simplify the code a bit.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If a write requests crosses a L2 table boundary and all clusters until the
end of the L2 table are usable for the request, we must not look at the next
L2 entry because we already have arrived at the end of the array.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Most of these are obvious NULL-deref bug fixes, for example,
the ones in these files:
block/curl.c
net.c
slirp/misc.c
and the first one in block/vvfat.c.
The others in block/vvfat.c may not lead to an immediate segfault, but I
traced the two schedule_rename(..., strdup(path)) uses, and a failed
strdup would appear to trigger this assertion in handle_renames_and_mkdirs:
assert(commit->path);
The conversion to use qemu_strdup in envlist_to_environ is not technically
needed, but does avoid a theoretical leak in the caller when strdup fails
for one value, but later succeeds in allocating another buffer(plausible,
if one string length is much larger than the others). The caller does
not know the length of the returned list, and as such can only free
pointers until it hits the first NULL. If there are non-NULL pointers
beyond the first, their buffers would be leaked. This one is admittedly
far-fetched.
The two in linux-user/main.c are worth fixing to ensure that an
OOM error is diagnosed up front, rather than letting it provoke some
harder-to-diagnose secondary error, in case of exec failure, or worse, in
case the exec succeeds but with an invalid list of command line options.
However, considering how unlikely it is to encounter a failed strdup early
in main, this isn't a big deal. Note that adding the required uses of
qemu_strdup here and in envlist.c induce link failures because qemu_strdup
is not currently in any library they're linked with. So for now, I've
omitted those changes, as well as the fixes in target-i386/helper.c
and target-sparc/helper.c.
If you'd like to see the above discussion (or anything else)
in the commit log, just let me know and I'll be happy to adjust.
>From 9af42864fd1ea666bd25e2cecfdfae74c20aa8c7 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering@redhat.com>
Date: Mon, 8 Feb 2010 18:29:29 +0100
Subject: [PATCH] don't dereference NULL after failed strdup
Handle failing strdup by replacing each use with qemu_strdup,
so as not to dereference NULL or trigger a failing assertion.
* block/curl.c (curl_open): s/\bstrdup\b/qemu_strdup/
* block/vvfat.c (init_directories): Likewise.
(get_cluster_count_for_direntry, check_directory_consistency): Likewise.
* net.c (parse_host_src_port): Likewise.
* slirp/misc.c (fork_exec): Likewise.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Checking for return codes < 0 isn't really going to work with unsigned
types. Use signed types instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This shouldn't happen under any normal circumstances. However, it looks like
it's possible to achieve this with corrupted images. Without this patch
raw_pread is hanging in an endless loop in such cases.
The patch is not affecting growable files, for which such reads happen in
normal use cases. raw_pread_aligned already handles these cases and won't
return zero in the first place.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Win32 suffers from a very big memory leak when dealing with SCSI devices.
Each read/write request allocates memory with qemu_memalign (ie
VirtualAlloc) but frees it with qemu_free (ie free).
Pair all qemu_memalign() calls with qemu_vfree() to prevent such leaks.
Signed-off-by: Herve Poussineau <hpoussin@reactos.org>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The n member is not very descriptive and very hard to grep, rename it to
cur_nr_sectors to better indicate what it is used for. Also rename
nb_sectors to remaining_sectors as that is what it is used for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The BDRV_O_CREAT option is unused inside qemu and partially duplicates
the bdrv_create method. Remove it, and the -C option to qemu-io which
isn't used in qemu-iotests anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Found some places that seems needs this explicitly, now that
read-write is not the default.
Signed-off-by: Naphtali Sprei <nsprei@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/qcow2.o
cc1: warnings being treated as errors
block/qcow2.c: In function 'qcow_create2':
block/qcow2.c:829: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:838: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:839: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:841: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:844: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:849: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:852: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow2.c:855: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/qcow2.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/vvfat.o
cc1: warnings being treated as errors
block/vvfat.c: In function 'commit_one_file':
block/vvfat.c:2259: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
make: *** [block/vvfat.o] Error 1
CC block/vvfat.o
In file included from /usr/include/stdio.h:912,
from ./qemu-common.h:19,
from block/vvfat.c:27:
In function 'snprintf',
inlined from 'init_directories' at block/vvfat.c:871,
inlined from 'vvfat_open' at block/vvfat.c:1068:
/usr/include/bits/stdio2.h:65: error: call to __builtin___snprintf_chk will always overflow destination buffer
make: *** [block/vvfat.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/vmdk.o
cc1: warnings being treated as errors
block/vmdk.c: In function 'vmdk_snapshot_create':
block/vmdk.c:236: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
block/vmdk.c: In function 'vmdk_create':
block/vmdk.c:775: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:776: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:778: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
block/vmdk.c:784: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:790: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/vmdk.c:807: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/vmdk.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/qcow.o
cc1: warnings being treated as errors
block/qcow.c: In function 'qcow_create':
block/qcow.c:804: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow.c:806: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/qcow.c:811: error: ignoring return value of 'write', declared with attribute warn_unused_result
make: *** [block/qcow.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/cow.o
cc1: warnings being treated as errors
block/cow.c: In function 'cow_create':
block/cow.c:251: error: ignoring return value of 'write', declared with attribute warn_unused_result
block/cow.c:253: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result
make: *** [block/cow.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that qcow2_alloc_clusters can return error codes, we must handle them in
the callers of qcow2_alloc_clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
update_refcount can return errors that need to be handled by the callers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There's absolutely no problem with updating the refcounts of 0 clusters.
At least snapshot code is doing this and would fail once the result of
update_refcount isn't ignored any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If update_refcount fails, try to undo any changes made so far to avoid
inconsistencies in the image file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Returning 0/-errno allows it to distingush different errors classes. The
cluster offset of newly allocated clusters is now returned in the QCowL2Meta
struct.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Switching to 0/-errno allows it to distinguish different error cases.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't assume success but pass the bdrv_pwrite return value on.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Return the appropriate error value instead of always using EIO. Don't free the
L1 table on errors, we still need it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Instead of using the field 'readonly' of the BlockDriverState struct for passing the request,
pass the request in the flags parameter to the function.
Signed-off-by: Naphtali Sprei <nsprei@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Current legacy floppy detection is hardcoded based on source file
name. Make this smarter on linux by attempting a floppy specific
ioctl.
v2:
Give ioctl check higher priority than filename check
s/IDE/legacy/
v3:
Actually initialize 'prio' variable
Check for ioctl success rather than absence of specific failure
v4:
Explicitly mention that change is linux specific.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Current CDROM detection is hardcoded based on source file name.
Make this smarter on linux by attempting a CDROM specific ioctl.
This makes '-cdrom /dev/sr0' succeed with no media present.
v2:
Give ioctl check higher priority than filename check.
v3:
Actually initialize 'prio' variable.
Check for ioctl success rather than absence of specific failure.
v4:
Explicitly mention that change is linux specific.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that we do not have to flush the backing device anymore implementing
the bdrv_aio_flush method for image formats is trivial.
[hch: forward ported to qemu mainline from a product tree]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Introduce the functions needed to change the backing file of an image. The
function is implemented for qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently the dmg image format driver simply opens the images as raw
if any kind of failure happens. This is contrarty to the behaviour
of all other image formats which just return an error and let the
block core deal with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The disk image I created from my old laptop disk with VBoxManage
internalcommand converthd obviously was not a multiple of 1MB as when
created from scratch. This fixes QEMU refusing it. We still require the
size to be a multiple of sector size though.
It then boots correctly.
Allow opening VDI images with size not multiple of 1MB (as when converted from a raw disk).
Signed-off-by: François Revol <revol@free.fr>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
CC block/bochs.o
cc1: warnings being treated as errors
block/bochs.c: In function 'seek_to_sector':
block/bochs.c:202: error: ignoring return value of 'read', declared with attribute warn_unused_result
make: *** [block/bochs.o] Error 1
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
We're leaking file descriptors to child processes. Set FD_CLOEXEC on file
descriptors that don't need to be passed to children to stop this misbehaviour.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
I haven't heard yet of anyone using qemu-img to copy an image to a real floppy,
but it's a valid use case.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently qcow2 unnecessarily rounds up the length of the backing format string
to the next multiple of 8. At the same time, the array in BlockDriverState can
only hold 15 characters, so in effect backing formats with 9 characters or more
don't work (e.g. host_device).
Save the real string length and things start to work for all valid image format
names.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Images with disk size 0 may be used for
VM snapshots, but not to save normal block data.
It is possible to create such images using
qemu-img, but opening them later fails.
So even "qemu-img info image.qcow2" is not
possible for an image created with
"qemu-img create -f qcow2 image.qcow2 0".
This is fixed here.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The context parameter in paio_submit isn't used anyway, so there is no reason
why block drivers should need to remember it. This also avoids passing a Linux
AIO context to paio_submit (which doesn't do any harm as long as the parameter
is unused, but it is highly confusing).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
It was merely a workaround and the real fix is done now.
This reverts commit ef845c3bf4.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We'll leave some AIO completions unhandled when we can't call the callback.
qemu_aio_process_queue() is used later to run any callbacks that are left and
can be run then.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When using Linux AIO raw still falls back to POSIX AIO sometimes, so we should
initialize it.
Not initializing it happens to work if POSIX AIO is used by another drive, or
if the format is not specified (probing the format uses POSIX AIO) or by pure
luck (e.g. it doesn't seem to happen any more with qcow2 since we have re-added
synchronous qcow2 functions).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
In case of failure, we haven't increased the refcount for the newly allocated
cluster yet. Therefore we must not free the cluster or its refcount will become
negative (and endless recursion is possible).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the synchronous read and write functions were dropped, they were replaced
by generic emulation functions. Unfortunately, these emulation functions don't
provide the same semantics as the original functions did.
The original bdrv_read would mean that we read some data synchronously and that
we won't be interrupted during this read. The latter assumption is no longer
true with the emulation function which needs to use qemu_aio_poll and therefore
allows the callback of any other concurrent AIO request to be run during the
read. Which in turn means that (meta)data read earlier could have changed and
be invalid now. qcow2 is not prepared to work in this way and it's just scary
how many places there are where other requests could run.
I'm not sure yet where exactly it breaks, but you'll see breakage with virtio
on qcow2 with a backing file. Providing synchronous functions again fixes the
problem for me.
Patchworks-ID: 35437
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Today host_devices have a create function, so they also need a create_options
field to prevent qemu-img from complaining.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch increases the maximum qcow2 cluster size to 2 MB. Starting with 128k
clusters, L2 tables span 2 GB or more of virtual disk space, causing 32 bit
truncation and wraparound of signed integers. Therefore some variables need to
use a larger data type.
While being at reviewing data types, change some integers that are used for
array indices to unsigned. In some places they were checked against some upper
limit but not for negative values. This could avoid potential segfaults with
corrupted qcow2 images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If available, the Universally Unique Identifier library
is used by the vdi block driver.
Other parts of QEMU (vl.c) could also use it.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
In the very least, a change like this requires discussion on the list.
The naming convention is goofy and it causes a massive merge problem. Something
like this _must_ be presented on the list first so people can provide input
and cope with it.
This reverts commit 99a0949b72.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Put space between = and & when taking a pointer,
to avoid confusion with old-style "&=".
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Problem: Our file sys-queue.h is a copy of the BSD file, but there are
some additions and it's not entirely compatible. Because of that, there have
been conflicts with system headers on BSD systems. Some hacks have been
introduced in the commits 15cc923584,
f40d753718,
96555a96d7 and
3990d09adf but the fixes were fragile.
Solution: Avoid the conflict entirely by renaming the functions and the
file. Revert the previous hacks.
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Instead stalling the VCPU while serving a cache flush try to do it
asynchronously. Use our good old helper thread pool to issue an
asynchronous fdatasync for raw-posix. Note that while Linux AIO
implements a fdatasync operation it is not useful for us because
it isn't actually implement in asynchronous fashion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
If we are flushing the caches for our image files we only care about the
data (including the metadata required for accessing it) but not things
like timestamp updates. So try to use fdatasync instead of fsync to
implement the flush operations.
Unfortunately many operating systems still do not support fdatasync,
so we add a qemu_fdatasync wrapper that uses fdatasync if available
as per the _POSIX_SYNCHRONIZED_IO feature macro or fsync otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When two AIO requests write to the same cluster, and this cluster is
unallocated, currently both requests allocate a new cluster and the second one
merges the first one when it is completed. This means an cluster allocation, a
read and a cluster deallocation which cause some overhead. If we simply let the
second request wait until the first one is done, we improve overall performance
with AIO requests (specifially, qcow2/virtio combinations).
This patch maintains a list of in-flight requests that have allocated new
clusters. A second request touching the same cluster is limited so that it
either doesn't touch the allocation of the first request (so it can have a
non-overlapping allocation) or it waits for the first request to complete.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The wrong version of the preallocation patch has been applied, so this is the
remaining diff.
We can't use truncate to grow the image file to the right size because we don't
know if metadata has been written after the last data cluster. In this case
truncate would shrink the file and destroy its metadata. Write a zero sector at
the end of the virtual disk instead to ensure that the file is big enough.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The company which made Virtual PC was Connectix.
They use the magic string "conectix" in their disk images.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch fixes linker errors when building QEMU without Linux AIO support.
It is based on suggestions from malc and Kevin Wolf.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that do have a nicer interface to work against we can add Linux native
AIO support. It's an extremly thing layer just setting up an iocb for
the io_submit system call in the submission path, and registering an
eventfd with the qemu poll handler to do complete the iocbs directly
from there.
This started out based on Anthony's earlier AIO patch, but after
estimated 42,000 rewrites and just as many build system changes
there's not much left of it.
To enable native kernel aio use the aio=native sub-command on the
drive command line. I have also added an option to qemu-io to
test the aio support without needing a guest.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently the raw-posix.c code contains a lot of knowledge about the
asynchronous I/O scheme that is mostly implemented in posix-aio-compat.c.
All this code does not really belong here and is getting a bit in the
way of implementing native AIO on Linux.
So instead move all the guts of the AIO implementation into
posix-aio-compat.c (which might need a better name, btw).
There's now a very small interface between the AIO providers and raw-posix.c:
- an init routine is called from raw_open_common to return an AIO context
for this drive. An AIO implementation may either re-use one context
for all drives, or use a different one for each as the Linux native
AIO support will do.
- an submit routine is called from the aio_reav/writev methods to submit
an AIO request
There are no indirect calls involved in this interface as we need to
decide which one to call manually. We will only call the Linux AIO native
init function if we were requested to by vl.c, and we will only call
the native submit function if we are asked to and the request is properly
aligned. That's also the reason why the alignment check actually does
the inverse move and now goes into raw-posix.c.
The old posix-aio-compat.h headers is removed now that most of it's
content is private to posix-aio-compat.c, and instead we add a new
block/raw-posix-aio.h headers is created containing only the tiny interface
between raw-posix.c and the AIO implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This introduces a qemu-img create option for qcow2 which allows the metadata to
be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
tables, but no data is written to them. Metadata is quite small, so this
happens in almost no time.
Especially with qcow2 on virtio this helps to gain a bit of performance during
the initial writes. However, as soon as create a snapshot, we're back to the
normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
while we're still having trouble with qcow2 on virtio.
Note that the option is disabled by default and needs to be specified
explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
* The code for option '-static' was wrong, so image creation
always created static images.
* Static images created with qemu-img did not set header entry
blocks_allocated.
* The size of the block map must be rounded to the next multiple
of SECTOR_SIZE, otherwise the block map is only read partially
for block map sizes which are not a multiple of SECTOR_SIZE.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
These errors come up when compiling with gcc-4.3.3 and some older headers:
/scratch/froydnj/qemu.git/block/vpc.c: In function 'vpc_create':
/scratch/froydnj/qemu.git/block/vpc.c:514: error: value computed is not used
/scratch/froydnj/qemu.git/block/vpc.c:516: error: value computed is not used
/scratch/froydnj/qemu.git/block/vpc.c:517: error: value computed is not used
/scratch/froydnj/qemu.git/block/vpc.c:566: error: value computed is not used
Use memcpy to copy the strings instead of strncpy.
Signed-off-by: Nathan Froyd <froydnj@codesourcery.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
As requested by Anthony make pthreads mandatory. This means we will always
have AIO available on posix hosts, and it will also allow enabling the I/O
thread unconditionally once it's ready.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is a new block driver written from scratch
to support the VDI format in QEMU.
VDI is the native format used by Innotek / SUN VirtualBox.
Latest changes:
* stripped down version
(code for synchronous operations and experimental code removed)
* don't open VDI snapshot images (with uuid_link or uuid_parent)
* modified vdi_aio_cancel
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Message-Id:
Instead of storing the backing file in its own BlockDriverState, VMDK uses the
BlockDriverState of the raw image file it opened. This is wrong and breaks
functions that access the backing file or protocols. This fix replaces all
occurrences of s->hd->backing_* with bs->backing_*.
This fixes qemu-iotests failure in 020 (Commit changes to backing file).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
I used the following command to enable debugging:
perl -p -i -e 's/^\/\/#define DEBUG/#define DEBUG/g' * */* */*/*
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
In qemu-iotests, some large images are created using qemu-img.
Without checks for errors, qemu-img will just create an
empty image, and later read / write tests will fail.
With the patch, failures during image creation are detected
and reported.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The VM state offset is a concept internal to the image format. Replace
the old bdrv_{get,put}_buffer method that require an index into the
image file that is constructed from the VM state offset and an offset
into the vmstate with the bdrv_{load,save}_vmstate that just take an
offset into the VM state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Contrary to what one could expect, the size of L1 tables is not cluster
aligned. So as we're writing whole sectors now instead of single entries,
we need to ensure that the L1 table in memory is large enough; otherwise
write would access memory after the end of the L1 table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The performance of qcow2 has improved meanwhile, so we don't need to
special-case it any more. Switch the default to write-through caching
like all other block drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The really time consuming part of snapshotting is to adjust the reference count
of all clusters. Currently after each adjusted cluster the refcount block is
written to disk.
Don't write each single byte immediately to disk but cache all writes to the
refcount block and write them out once we're done with the block.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When using O_DIRECT, qcow2 snapshots didn't work any more for me. In the
process of creating the snapshot, qcow2 tries to pwrite some new information
(e.g. new L1 table) which will often end up being after the old end of the
image file. Now pwrite tries to align things and reads the old contents of the
file, read returns 0 because there is nothing to read after the end of file and
pwrite is stuck in an endless loop.
This patch allows to pread beyond the end of an image file. Whenever the
given offset is after the end of the image file, the read succeeds and fills
the buffer with zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Problem: It is impossible to feed filenames with the character colon because
qemu interprets such names as a protocol. For example filename scsi:0, is
interpreted as a protocol by name "scsi".
This patch allows user to espace colon characters. For example the above
filename can now be expressed either as 'scsi\:0' or as file:scsi:0
anything following the "file:" tag is interpreted verbatin. However if "file:"
tag is omitted then any colon characters in the string must be escaped using
backslash.
Here are couple of examples:
scsi\:0\:abc is a local file scsi:0:abc
http\://myweb is a local file by name http://myweb
file:scsi:0:abc is a local file scsi:0:abc
file:http://myweb is a local file by name http://myweb
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When updating the refcount blocks in update_refcount(), write complete sectors
instead of updating single entries.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When updating the L2 tables in alloc_cluster_link_l2(), write complete
sectors instead of updating single entries.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When modifying the L1 table, l2_allocate() needs to write complete sectors
instead of single entries. The L1 table is already in memory, reading it from
disk in the block layer to align the request is wasted performance.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The qcow2 source is now split into several more manageable files. During the
conversion quite some functions that were static before needed to be changed to
be global to make the source compile again.
We were lucky enough not to get name conflicts with these additional global
names, but they are not nice. This patch adds a qcow2_ prefix to all of the
global functions in qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-snapshot.c contains the code related to snapshotting.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-cluster.c contains all functions related to the management of guest
clusters, i.e. what the guest sees on its virtual disk. This code is about
mapping these guest clusters to host clusters in the image file using the
two-level lookup tables.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2-refcount.c contains all functions which are related to cluster
allocation and management in the image file. A large part of this is the
reference counting of these clusters.
Also a header file qcow2.h is introduced which will contain the interface of
the split qcow2 modules.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Larger cluster sizes mean less metadata. This has been discussion a few times,
let's do it now. This turns 64k clusters on by default for new images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When we open a file, we first attempt to open it read-write, then fall back
to read-only. Unfortunately we reuse the flags from the previous attempt,
so both attempts try to open the file with write permissions, and fail.
Fix by clearing the O_RDWR flag from the previous attempt.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The flags argument to raw_common_open() contain bits defined by the BDRV_O_*
namespace, not the posix O_* namespace.
Adjust to use the correct constants.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Rename raw_ioctl and raw_aio_ioctl to hdev_ioctl and hdev_aio_ioctl as they
are only used for the host device. Also only add them to the method table
for the cases where we need them (generic hdev if linux and linux CDROM)
instead of declaring stubs and always add them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a bdrv_probe_device method to all BlockDriver instances implementing
host devices to move matching of host device types into the actual drivers.
For now we keep exacly the old matching behaviour based on the devices names,
although we really should have better detetion methods based on device
information in the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of declaring one BlockDriver for all host devices declared one
for each type: a generic one for normal disk devices, a Linux floppy
driver and a CDROM driver for Linux and FreeBSD. This gets rid of a lot
of messy ifdefs and switching based on the type in the various removal
device methods.
block.c grows a new method to find the correct host device driver based
on OS-sepcific criteria, which will later into the actual drivers in a
later patch in this series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
raw_open and hdev_open contain the same basic logic. Add a new
raw_open_common helper containing the guts of the open routine
and call it from raw_open and hdev_open.
We use the new open_flags field in BDRVRawState to allow passing
additional open flags to raw_open_common from both.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Both the Linux floppy and the FreeBSD CDROM host device need to store
the open flags so that they can re-open the device later. Store the
open flags unconditionally to remove the ifdef mess and simply the
calling conventions for the later patches in the series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch adds a small help text to each of the options in the block drivers
which can be displayed by using qemu-img create -f fmt -o ?
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that we have a separate aio pool structure we can remove those
aio pool details from BlockDriver.
Every driver supporting AIO now needs to declare a static AIOPool
with the aiocb size and the cancellation method. This cleans up the
current code considerably and will make it cleaner and more obvious
to support two different aio implementations behind a single
BlockDriver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
[this one is required for [PATCH] fully split aio_pool from BlockDriver,
sorry for not sending it out earlier]
Add a qcow_aio_setup helper to qcow to shared common code between
the aio_readv and aio_writev methods. Based on the function with
the same name in qcow2.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We do need hdev_create unconditionally on all platforms so that qemu-img
create support for host device works on all platforms.
Also relax the check to allow character devices in addition to block
devices. On many Unix platforms block devices have buffered block
nodes and unbuffered character device nodes, and on FreeBSD the block
nodes don't even exist anymore. Also on Linux we do support the
/dev/sgN scsi passthrough devices through the host device driver,
and probably the old-style /dev/raw/rawN raw devices although I haven't
tested that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
raw_pread_aligned currently returns the raw return value from
lseek/read, which is always -1 in case of an error. But the
callers higher up the stack expect it to return the negated
errno just like raw_pwrite_aligned.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch converts the remaining users of bdrv_create2 to bdrv_create and
removes the now unused function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Don't write each single changed refcount block entry to the disk after it is
written, but update all entries of the block and write all of them at once.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is a preparation patch with no functional changes. It moves the allocation
of new refcounts block to a new function and makes update_cluster_refcount (for
one cluster) call update_refcount (for multiple clusters) instead the other way
round.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There is only one (internal) user left and it can be switched to the normal
emulation provided in block.c
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently Qemu can read from posix I/O and NBD. This patch adds a
third protocol to the game: HTTP.
In certain situations it can be useful to access HTTP data directly,
for example if you want to try out an http provided OS image, but
don't know if you want to download it yet.
Using this patch you can now try it on on the fly. Just use it like:
qemu -cdrom http://host/path/my.iso
Signed-off-by: Alexander Graf <agraf@suse.de>
Add an option to specify the cluster size of a newly created qcow2 image.
Default is 4k which is the same value that was hard-coded before.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now we can make use of the newly introduced option structures. Instead of
having bdrv_create carry more and more parameters (which are format specific in
most cases), just pass a option structure as defined by the driver itself.
bdrv_create2() contains an emulation of the old interface to simplify the
transition.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>