Commit Graph

760 Commits

Author SHA1 Message Date
Corey Bryant
2e1e79dae7 block: Convert close calls to qemu_close
This patch converts all block layer close calls, that correspond
to qemu_open calls, to qemu_close.

Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-15 10:48:57 +02:00
Corey Bryant
6165f4d85d block: Convert open calls to qemu_open
This patch converts all block layer open calls to qemu_open.

Note that this adds the O_CLOEXEC flag to the changed open paths
when the O_CLOEXEC macro is defined.

Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-15 10:48:57 +02:00
Corey Bryant
e174082835 block: Prevent detection of /dev/fdset/ as floppy
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-15 10:48:57 +02:00
Anthony Liguori
53810bab3a Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony:
  qemu-iotests: skip 039 with ./check -nocache
  block: add BLOCK_O_CHECK for qemu-img check
  qcow2: mark image clean after repair succeeds
  qed: mark image clean after repair succeeds
  blockdev: flip default cache mode from writethrough to writeback
  virtio-blk: disable write cache if not negotiated
  virtio-blk: support VIRTIO_BLK_F_CONFIG_WCE
  qemu-iotests: Save some sed processes
  ahci: Fix sglist memleak in ahci_dma_rw_buf()
  ahci: Fix ahci cdrom read corruptions for reads > 128k
  virtio-blk: fix use-after-free while handling scsi commands
2012-08-11 19:48:50 -05:00
Anthony Liguori
312942619a Merge remote-tracking branch 'bonzini/scsi-next' into staging
* bonzini/scsi-next:
  scsi-disk: add support for the UNMAP command
  scsi-disk: improve out-of-range LBA detection for WRITE SAME
  scsi-disk: more assertions and resets for aiocb
  virtio-scsi: do not compare 32-bit QEMU tags against 64-bit virtio-scsi tags
  iscsi: Pick default initiator-name based on the name of the VM
  iscsi: reorganize code for parse_initiator_name
  iscsi: do not leak initiator_name
2012-08-11 17:11:23 -05:00
Stefan Hajnoczi
058f8f16db block: add BLOCK_O_CHECK for qemu-img check
Image formats with a dirty bit, like qed and qcow2, repair dirty image
files upon open with BDRV_O_RDWR.  Performing automatic repair when
qemu-img check runs is not ideal because the bdrv_open() call repairs
the image before the actual bdrv_check() call from qemu-img.c.

Fix this "double repair" since it leads to confusing output from
qemu-img check.  Tell the block driver that this image is being opened
just for bdrv_check().  This skips automatic repair and qemu-img.c can
invoke it manually with bdrv_check().

Update the golden output for qemu-iotests 039 to reflect the new
qemu-img check output.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-10 10:25:12 +02:00
Stefan Hajnoczi
acbe59829e qcow2: mark image clean after repair succeeds
The dirty bit is cleared after image repair succeeds in qcow2_open().
Move this into qcow2_check() so that all callers benefit from this
behavior when fix mode is enabled.

This is necessary so qemu-img check can call .bdrv_check() and mark the
image clean.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-10 10:25:12 +02:00
Stefan Hajnoczi
b10170aca0 qed: mark image clean after repair succeeds
The dirty bit is cleared after image repair succeeds in qed_open().
Move this into qed_check() so that all callers benefit from this
behavior when fix=true.

This is necessary so qemu-img check can call .bdrv_check() and mark the
image clean.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-10 10:25:12 +02:00
Ronnie Sahlberg
31459f463a iscsi: Pick default initiator-name based on the name of the VM
This patch updates the iscsi layer to automatically pick a 'unique'
initiator-name based on the name of the vm in case the user has not set
an explicit iqn-name to use.

Create a new function qemu_get_vm_name() that returns the name of the VM,
if specified.

This way we can thus create default names to use as the initiator name
based on the guest session.

If the VM is not named via the '-name' command line argument, the iscsi
initiator-name used wiull simply be

    iqn.2008-11.org.linux-kvm

If a name for the VM was specified with the '-name' option, iscsi will
use a default initiatorname of

    iqn.2008-11.org.linux-kvm:<name>

These names are just the default iscsi initiator name that qemu will
generate/use only when the user has not set an explicit initiator name
via the commandlines or config files.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-08-09 15:04:09 +02:00
Paolo Bonzini
f2ef4a6dd9 iscsi: reorganize code for parse_initiator_name
Merge the occurrences of the "iqn.2008-11.org.linux-kvm" string
to avoid duplication.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-08-08 14:51:59 +02:00
Paolo Bonzini
b93c94f7ec iscsi: do not leak initiator_name
The argument of iscsi_create_context is never freed by libiscsi,
which in fact calls strdup on it.  Avoid a leak.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-08-08 14:51:59 +02:00
Stefan Hajnoczi
bfe8043e92 qcow2: implement lazy refcounts
Lazy refcounts is a performance optimization for qcow2 that postpones
refcount metadata updates and instead marks the image dirty.  In the
case of crash or power failure the image will be left in a dirty state
and repaired next time it is opened.

Reducing metadata I/O is important for cache=writethrough and
cache=directsync because these modes guarantee that data is on disk
after each write (hence we cannot take advantage of caching updates in
RAM).  Refcount metadata is not needed for guest->file block address
translation and therefore does not need to be on-disk at the time of
write completion - this is the motivation behind the lazy refcount
optimization.

The lazy refcount optimization must be enabled at image creation time:

  qemu-img create -f qcow2 -o compat=1.1,lazy_refcounts=on a.qcow2 10G
  qemu-system-x86_64 -drive if=virtio,file=a.qcow2,cache=writethrough

Update qemu-iotests 031 and 036 since the extension header size changes
when we add feature bit table entries.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-06 22:39:14 +02:00
Stefan Hajnoczi
c61d0004bc qcow2: introduce dirty bit
This patch adds an incompatible feature bit to mark images that have not
been closed cleanly.  When a dirty image file is opened a consistency
check and repair is performed.

Update qemu-iotests 031 and 036 since the extension header size changes
when we add feature bit table entries.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-08-06 22:39:14 +02:00
Markus Armbruster
4480e0f924 vvfat: Do not clobber the user's geometry
vvfat creates a virtual VFAT filesystem with a certain logical
geometry that depends on its options.  It sets the "geometry hint" to
this geometry.  It is the only block driver to do this.

The geometry hint is about about *physical* geometry, and used only by
certain hard disk device models.

vvfat's hint is normally invisible for device models, because
bdrv_open() puts a raw format on top of vvfat's fat protocol.  That
raw format is where drive_init() puts the user's geometry (if any),
and where the device model gets it from.

Nobody complained, because the default physical geometry is the same
as vvfat's logical geometry:

    opts        LCHS        def. PCHS
                1024,16,63  same
    :32:        1024,16,63  same
    :16:        1024,16,63  same
    :12:          64,16,63  same

Except when you specify :floppy:

    opts        LCHS        def. PCHS
       :floppy:   80, 2,36  5,16,63
    :32:floppy:   80, 2,36  5,16,63
    :16:floppy:   80, 2,36  5,16,63
    :12:floppy:   80, 2,18  2,16,63

Silly thing to do for use with a hard disk.

However, the "raw" format can be suppressed by adding an
redundant-looking "format=vvfat" to "file=fat:FOO".  Then, vvfat's
hint clobbers the user's geometry, i.e. -drive options cyls, heads,
secs get silently ignored.  Don't do that.

No change without format=vvfat.  With it, the user's hard disk
geometry (-drive options cyls, heads, secs) is now obeyed, and the
default hard disk geometry with :floppy: now matches the one without
format=vvfat.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-17 16:48:30 +02:00
Markus Armbruster
f91cbefe2d vvfat: Fix partition table
Unless parameter ":floppy:" is given, vvfat creates a virtual image
with DOS MBR defining a single partition which holds the FAT file
system.  The size of the virtual image depends on the width of the
FAT: 32 MiB (CHS 64, 16, 63) for 12 bit FAT, 504 MiB (CHS 1024, 16,
63) for 16 and 32 bit FAT, leaving (64*16-1)*63 = 64449 and
(1024*16-1)*64 = 1032129 sectors for the partition.

However, it screws up the end of the partition in the MBR:

    FAT width param.  start CHS  end CHS     start LBA  size
        :32:          0,1,1      1023,14,63       63    1032065
        :16:          0,1,1      1023,14,55       63    1032057
        :12:          0,1,1        63,14,55       63      64377

The actual FAT file system nevertheless assumes the partition has
1032129 or 64449 sectors.  Oops.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-17 16:48:30 +02:00
Christoph Hellwig
19db9b9042 sheepdog: do not blindly memset all read buffers
Only buffers that map to unallocated blocks need to be zeroed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-17 16:48:29 +02:00
MORITA Kazutaka
cddd4ac7a2 sheepdog: always use coroutine-based network functions
This reduces some code duplication.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-17 16:48:29 +02:00
Anthony Liguori
23797df3d9 Merge remote-tracking branch 'mjt/mjt-iov2' into staging
* mjt/mjt-iov2:
  rewrite iov_send_recv() and move it to iov.c
  cleanup qemu_co_sendv(), qemu_co_recvv() and friends
  export iov_send_recv() and use it in iov_send() and iov_recv()
  rename qemu_sendv to iov_send, change proto and move declarations to iov.h
  change qemu_iovec_to_buf() to match other to,from_buf functions
  consolidate qemu_iovec_copy() and qemu_iovec_concat() and make them consistent
  allow qemu_iovec_from_buffer() to specify offset from which to start copying
  consolidate qemu_iovec_memset{,_skip}() into single function and use existing iov_memset()
  rewrite iov_* functions
  change iov_* function prototypes to be more appropriate
  virtio-serial-bus: use correct lengths in control_out() message

Conflicts:
	tests/Makefile

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2012-07-09 12:35:06 -05:00
Anthony Liguori
715cc00ce1 Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony: (24 commits)
  block: Factor bdrv_read_unthrottled() out of guess_disk_lchs()
  qtest: Tidy up temporary files properly
  fdc: Drop broken code for user-defined floppy geometry
  fdc_test: introduce test_sense_interrupt
  fdc_test: update media_change test
  fdc: fix interrupt handling
  fdc: rewrite seek and DSKCHG bit handling
  block: introduce bdrv_swap, implement bdrv_append on top of it
  block: copy over job and dirty bitmap fields in bdrv_append
  raw: hook into blkdebug
  blkdebug: optionally tie errors to a specific sector
  blkdebug: store list of active rules
  blkdebug: pass getlength to underlying file
  blkdebug: tiny cleanup
  blkdebug: remove sync i/o events
  sheepdog: traverse pending_list from the first for each time
  sheepdog: split outstanding list into inflight and pending
  sheepdog: make sure we don't free aiocb before sending all requests
  sheepdog: use coroutine based socket functions in coroutine context
  sheepdog: restart I/O when socket becomes ready in do_co_req()
  ...
2012-07-09 10:29:40 -05:00
Paolo Bonzini
5c171afa4c raw: hook into blkdebug
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
Paolo Bonzini
e4780db429 blkdebug: optionally tie errors to a specific sector
This makes blkdebug scripts more powerful, and independent of the
exact sequence of operations performed by streaming.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
Paolo Bonzini
571cd43e57 blkdebug: store list of active rules
This prepares for the next patch, where some active rules may actually
not trigger depending on input to readv/writev.  Store the active rules
in a SIMPLEQ (so that it can be emptied easily with QSIMPLEQ_INIT), and
fetch the errno/once/immediately arguments from there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
Paolo Bonzini
e130225587 blkdebug: pass getlength to underlying file
This is required when using blkdebug with raw format.  Unlike qcow2/QED,
raw asks blkdebug for the length of the file, it doesn't get it from
a header.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
Paolo Bonzini
368e8dd10a blkdebug: tiny cleanup
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
Paolo Bonzini
820100fd15 blkdebug: remove sync i/o events
These are unused, except (by mistake more or less) in QED.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
MORITA Kazutaka
7dc1cde05b sheepdog: traverse pending_list from the first for each time
The pending list can be modified in other coroutine context
sd_co_rw_vector, so we need to traverse the list from the first again
after we send the pending request.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
MORITA Kazutaka
c292ee6a67 sheepdog: split outstanding list into inflight and pending
outstanding_list_head is used for both pending and inflight requests.
This patch splits it and improves readability.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:02 +02:00
MORITA Kazutaka
1d732d7d7c sheepdog: make sure we don't free aiocb before sending all requests
This patch increments the pending counter before sending requests, and
make sures that aiocb is not freed while sending them.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:01 +02:00
MORITA Kazutaka
b97564f4c5 sheepdog: use coroutine based socket functions in coroutine context
This removes blocking network I/Os in coroutine context.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:01 +02:00
MORITA Kazutaka
2dfcca3b68 sheepdog: restart I/O when socket becomes ready in do_co_req()
Currently, no one reenters the yielded coroutine.  This fixes it.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:01 +02:00
MORITA Kazutaka
1b6ac9985a sheepdog: fix dprintf format strings
This fixes warnings about dprintf format in debug mode.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:01 +02:00
Stefan Hajnoczi
206e6d8551 qcow2: preserve free_byte_offset when qcow2_alloc_bytes() fails
When qcow2_alloc_clusters() error handling code was introduced in commit
5d757b563d, the value of free_byte_offset
was clobbered in the error case.  This patch keeps free_byte_offset at 0
so we will try to allocate clusters again next time this function is
called.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:01 +02:00
Stefan Hajnoczi
b35278f754 qcow2: fix #ifdef'd qcow2_check_refcounts() callers
The DEBUG_ALLOC qcow2.h macro enables additional consistency checks
throughout the code.  This makes it easier to spot corruptions that are
introduced during development.  Since consistency check is an expensive
operation the DEBUG_ALLOC macro is used to compile checks out in normal
builds and qcow2_check_refcounts() calls missed the addition of a new
function argument.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-07-09 15:53:01 +02:00
Ronnie Sahlberg
622695a458 ISCSI: force use of sg for SMC and SSC devices
If the device we open is a SMC or SSC device, then force the use of sg. We
dont have any medium changer or tape emulation so only passthrough via
real sg or scsi-generic via iscsi would work anyway.

Forcing sg also makes qemu skip trying to read from the device to guess
the image format by reading from the device (find_image_format()).
SMC devices do not implement READ6/10/12/16 so it is not possible to
read from them (SSC have different CDBs).

With this patch I can successfully manage a SMC device wiht iscsi in
passthrough mode.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[Added TYPE_TAPE handling - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-07-02 10:18:41 +02:00
Ronnie Sahlberg
983924532f ISCSI: Add SCSI passthrough via scsi-generic to libiscsi
Update iscsi to allow passthrough of SG_IO scsi commands when the iscsi
device is forced to be scsi-generic.

Implement both bdrv_ioctl() and bdrv_aio_ioctl() in the iscsi backend,
emulate the SG_IO ioctl and pass the SCSI commands across to the
iscsi target.

This allows end-to-end passthrough of SCSI all the way from the guest,
to qemu, via scsi-generic, then libiscsi all the way to the iscsi target.

To activate this you need to specify that the iscsi lun should be treated
as a scsi-generic device.

Example:
    -device lsi -device scsi-generic,drive=MyISCSI \
    -drive file=iscsi://10.1.1.125/iqn.ronnie.test/1,if=none,id=MyISCSI

Note, you can currently not boot a qemu guest from a scsi device.

Note,
This only works when the host is linux, since the emulation relies on
definitions of SG_IO from the scsi-generic implementation in the
linux kernel.
It should be fairly easy to re-implement some structures similar enough
for non-linux hosts to do the same style of passthrough via a fake
scsi generic layer and libiscsi if need be.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-07-02 10:18:41 +02:00
Kevin Wolf
94282e7146 raw-posix: Fix build without is_allocated support
Move the declaration of s into the #ifdef sections that actually make
use of it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-06-24 01:04:45 +02:00
Stefan Hajnoczi
af7b708db2 qcow2: fix autoclear image header update
The autoclear feature bits can be used for qcow2 file format features
that are safe to "drop" by old programs that do not understand the
feature.  Upon opening the image file unknown autoclear feature bits are
cleared and the image file header is rewritten, but this was happening
too early in the code when critical header fields were not yet loaded.

Process autoclear feature bits after all necessary header information
has been loaded.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
Kevin Wolf
b7ab0fea37 qcow2: Fix avail_sectors in cluster allocation code
avail_sectors should really be the number of sectors from the start of
the allocation, not from the start of the write request.

We're lucky enough that this mistake didn't cause any real bug.
avail_sectors is only used in the intialiser of QCowL2Meta:

  .nb_available   = MIN(requested_sectors, avail_sectors),

m->nb_available in turn is only used for COW at the end of the
allocation. A COW occurs only if the request wasn't cluster aligned,
which in turn would imply that requested_sectors was less than
avail_sectors (both in the original and in the fixed version). In this
case avail_sectors is ignored and therefore the mistake doesn't cause
any misbehaviour.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
Kevin Wolf
cdba7fee1d qcow2: Simplify calculation for COW area at the end
copy_sectors() always uses the sum (cluster_offset + n_start) or
(start_sect + n_start), so if some value is added to both cluster_offset
and start_sect, and subtracted from n_start, it's cancelled out anyway.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
Paolo Bonzini
6af4e9ead4 qcow2: always operate caches in writeback mode
Writethrough does not need special-casing anymore in the qcow2 caches.
The block layer adds flushes after every guest-initiated data write,
and these will also flush the qcow2 caches to the OS.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
MORITA Kazutaka
e0d93a89b9 sheepdog: add coroutine_fn markers to coroutine functions
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Josh Durgin
b11f38fcdf rbd: hook up cache options
Writeback caching was added in Ceph 0.46, and writethrough will be in
0.47. These are controlled by general config options, so there's no
need to check for librbd version.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf
166acf546f qcow2: Support for fixing refcount inconsistencies
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf
ccf34716ee qemu-img check: Print fixed clusters and recheck
When any inconsistencies have been fixed, print the statistics and run
another check to make sure everything is correct now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf
4534ff5426 qemu-img check -r for repairing images
The QED block driver already provides the functionality to not only
detect inconsistencies in images, but also fix them. However, this
functionality cannot be manually invoked with qemu-img, but the
check happens only automatically during bdrv_open().

This adds a -r switch to qemu-img check that allows manual invocation
of an image repair.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini
6ef228fc0d stream: move rate limiting to a separate header file
Make the code reusable.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini
188a7bbf94 stream: move is_allocated_above to block.c
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini
f9749f28b7 stream: tweak usage of bdrv_co_is_allocated
is_allocated_base has complex semantics that are not really usable
outside streaming.  Split the check in two parts, where the allocated
state for the top bs is moved to the caller.  The resulting function
is more generally useful.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini
5500316ded block: implement is_allocated for raw
Either FIEMAP, or SEEK_DATA+SEEK_HOLE can be used to implement the
is_allocated callback for raw files.  On Linux ext4, btrfs and XFS
all support it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Zhi Yong Wu
87267753a3 qcow2: fix endianness conversion
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Zhi Yong Wu
833e40858c qcow2: remove a line of unnecessary code
Commit 3948d1d4 removed the pointer argument we filled in with l2_offset
but forgot to remove the unnecessary l2_offset assignment.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf
1417d7e40e qcow2: Silence false warning
Some gcc versions seem not to be able to figure out that the switch
statement covers all possible values and that c is therefore always
initialised. Add a default branch for them.

Reported-by: malc <av1474@comtv.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: malc <av1474@comtv.ru>
2012-06-15 15:52:45 +04:00
Michael Tokarev
2fc8ae1dd7 cleanup qemu_co_sendv(), qemu_co_recvv() and friends
The same as for non-coroutine versions in previous
patches: rename arguments to be more obvious, change
type of arguments from int to size_t where appropriate,
and use common code for send and receive paths (with
one extra argument) since these are exactly the same.
Use common iov_send_recv() directly.

qemu_co_sendv(), qemu_co_recvv(), and qemu_co_recv()
are now trivial #define's merely adding one extra arg.

qemu_co_sendv() and qemu_co_recvv() callers are
converted to different argument order and extra
`iov_cnt' argument.

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2012-06-11 23:12:11 +04:00
Michael Tokarev
d5e6b1619c change qemu_iovec_to_buf() to match other to,from_buf functions
It now allows specifying offset within qiov to start from and
amount of bytes to copy.  Actual implementation is just a call
to iov_to_buf().

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2012-06-11 23:12:11 +04:00
Michael Tokarev
1b093c480a consolidate qemu_iovec_copy() and qemu_iovec_concat() and make them consistent
qemu_iovec_concat() is currently a wrapper for
qemu_iovec_copy(), use the former (with extra
"0" arg) in a few places where it is used.

Change skip argument of qemu_iovec_copy() from
uint64_t to size_t, since size of qiov itself
is size_t, so there's no way to skip larger
sizes.  Rename it to soffset, to make it clear
that the offset is applied to src.

Also change the only usage of uint64_t in
hw/9pfs/virtio-9p.c, in v9fs_init_qiov_from_pdu() -
all callers of it actually uses size_t too,
not uint64_t.

One added restriction: as for all other iovec-related
functions, soffset must point inside src.

Order of argumens is already good:
 qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
                   int c, size_t bytes)
vs:
 qemu_iovec_concat(QEMUIOVector *dst,
                   QEMUIOVector *src,
                   size_t soffset, size_t sbytes)
(note soffset is after _src_ not dst, since it applies to src;
for memset it applies to qiov).

Note that in many places where this function is used,
the previous call is qemu_iovec_reset(), which means
many callers actually want copy (replacing dst content),
not concat.  So we may want to add a wrapper like
qemu_iovec_copy() with the same arguments but which
calls qemu_iovec_reset() before _concat().

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2012-06-11 23:12:11 +04:00
Michael Tokarev
03396148bc allow qemu_iovec_from_buffer() to specify offset from which to start copying
Similar to
 qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
                   int c, size_t bytes);
the new prototype is:
 qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
                     const void *buf, size_t bytes);

The processing starts at offset bytes within qiov.

This way, we may copy a bounce buffer directly to
a middle of qiov.

This is exactly the same function as iov_from_buf() from
iov.c, so use the existing implementation and rename it
to qemu_iovec_from_buf() to be shorter and to match the
utility function.

As with utility implementation, we now assert that the
offset is inside actual iovec.  Nothing changed for
current callers, because `offset' parameter is new.

While at it, stop using "bounce-qiov" in block/qcow2.c
and copy decrypted data directly from cluster_data
instead of recreating a temp qiov for doing that.

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2012-06-11 23:12:11 +04:00
Michael Tokarev
3d9b49254f consolidate qemu_iovec_memset{,_skip}() into single function and use existing iov_memset()
This patch combines two functions into one, and replaces
the implementation with already existing iov_memset() from
iov.c.

The new prototype of qemu_iovec_memset():
  size_t qemu_iovec_memset(qiov, size_t offset, int fillc, size_t bytes)
It is different from former qemu_iovec_memset_skip(), and
I want to make other functions to be consistent with it
too: first how much to skip, second what, and 3rd how many
of it.  It also returns actual number of bytes filled in,
which may be less than the requested `bytes' if qiov is
smaller than offset+bytes, in the same way iov_memset()
does.

While at it, use utility function iov_memset() from
iov.h in posix-aio-compat.c, where qiov was used.

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2012-06-11 23:07:44 +04:00
Paolo Bonzini
7456e4ce8d build: move block/ objects to nested Makefile.objs
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-06-07 09:21:13 +02:00
Jim Meyering
c2d76497b6 block: prevent snapshot mode $TMPDIR symlink attack
In snapshot mode, bdrv_open creates an empty temporary file without
checking for mkstemp or close failure, and ignoring the possibility
of a buffer overrun given a surprisingly long $TMPDIR.
Change the get_tmp_filename function to return int (not void),
so that it can inform its two callers of those failures.
Also avoid the risk of buffer overrun and do not ignore mkstemp
or close failure.
Update both callers (in block.c and vvfat.c) to propagate
temp-file-creation failure to their callers.

get_tmp_filename creates and closes an empty file, while its
callers later open that presumed-existing file with O_CREAT.
The problem was that a malicious user could provoke mkstemp failure
and race to create a symlink with the selected temporary file name,
thus causing the qemu process (usually root owned) to open through
the symlink, overwriting an attacker-chosen file.

This addresses CVE-2012-2652.
http://bugzilla.redhat.com/CVE-2012-2652

Signed-off-by: Jim Meyering <meyering@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-30 10:18:20 +02:00
MORITA Kazutaka
6f3c714eb7 sheepdog: fix return value of do_load_save_vm_state
bdrv_save_vmstate and bdrv_load_vmstate should return the vmstate size
on success, and -errno on error.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-30 09:58:39 +02:00
Anthony Liguori
306761537f Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony:
  fdc-test: introduced qtest no_media_on_start and cmos qtest for floppy
  fdc: fix media detection
  fdc: floppy drive should be visible after start without media
  qemu-iotests: mark 035 qcow2-only
  qcow2: Check qcow2_alloc_clusters_at() return value
  sheepdog: use heap instead of stack for BDRVSheepdogState
  sheepdog: return -errno on error
  sheepdog: mark image as snapshot when tag is specified
  qemu-img: Explain how rebase operation can be used to perform a 'diff' operation.
  qcow2: don't leak buffer for unexpected qcow_version in header
2012-05-29 04:30:49 -05:00
Ronnie Sahlberg
f4dfa67f04 ISCSI: Switch to using READ16/WRITE16 for I/O to the LUN
This allows using LUNs bigger than 2TB.  Keep using READ10 for other
device types such as MMC.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-05-28 14:04:16 +02:00
Ronnie Sahlberg
6bcd1346bb ISCSI: Only call READCAPACITY16 for SBC devices, use READCAPACITY10 for MMC
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-05-28 14:04:15 +02:00
Ronnie Sahlberg
dbfff6d776 ISCSI: get device type at connection time
This is needed to avoid READ CAPACITY(16) for MMC devices.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-05-28 14:04:14 +02:00
Paolo Bonzini
c7b4a95202 ISCSI: change num_blocks to 64-bit
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-05-28 14:04:14 +02:00
Ronnie Sahlberg
c9b9f6824f ISCSI: redo how we set up the events
Call qemu_notify_event() after updating events.  Otherwise, If we add
an event for -is-writeable but the socket is already writeable there
may be a delay before the event callback is actually triggered.

Those delays would in particular hurt performance during BIOS boot and
when the GRUB bootloader reads the kernel and initrd.

But first call out to the socket write functions directly, and only set up
the write event if the socket is full.  This will happen very rarely and
this improves performance.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-05-28 14:04:06 +02:00
Kevin Wolf
df02179189 qcow2: Check qcow2_alloc_clusters_at() return value
When using qcow2_alloc_clusters_at(), the cluster allocation code
checked the wrong variable for an error code.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
MORITA Kazutaka
b6fc8245e9 sheepdog: use heap instead of stack for BDRVSheepdogState
bdrv_create() is called in coroutine context now, so we cannot use
more stack than 1 MB in the function if we use ucontext coroutine.
This patch allocates BDRVSheepdogState, whose size is 4 MB, on the
heap in sd_create().

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
MORITA Kazutaka
cb595887cc sheepdog: return -errno on error
On error, BlockDriver APIs should return -errno instead of -1.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
MORITA Kazutaka
622b6057be sheepdog: mark image as snapshot when tag is specified
When a snapshot tag is specified in the filename, the opened image is
a snapshot.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
Jim Meyering
b6c147622d qcow2: don't leak buffer for unexpected qcow_version in header
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
Kevin Wolf
c44bfe4637 qcow2: Don't ignore failure to clear autoclear flags
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-14 17:02:19 +02:00
Anthony Liguori
04120e3bb0 block: fix warning introduced in efcc7a23
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2012-05-10 09:10:42 -05:00
Paolo Bonzini
efcc7a2324 stream: do not copy unallocated sectors from the base
Unallocated sectors should really never be accessed by the guest,
so there's no need to copy them during the streaming process.
If they are read by the guest during streaming, guest-initiated
copy-on-read will copy them (we're in the base == NULL case, which
enables copy on read).  If they are read after we disconnect the
image from the base, they will read as zeroes anyway.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 11:01:59 +02:00
Paolo Bonzini
b21d677ee9 stream: fix ratelimiting corner case
This fixes inability to make progress in streaming if the quota is set
to less than the amount of data that an I/O operation has to write.

In this case, limit->dispatched + n will always be above the quota and,
due to the "goto retry" to recheck cancellation and allocation, streaming
will livelock.

This can be reproduced with "block_job_set_speed ide0-hd0 1b".  Of course,
with this patch the requested limit will not be obeyed.  That could be
done with another patch that caps is_allocated's n argument by the slice
quota.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 11:01:59 +02:00
Paolo Bonzini
f6133def92 stream: pass new base image format to bdrv_change_backing_file
When an image is modified to point to the new backing file, the backing
file format is set to NULL, which means auto-probe.  This is wrong, in
fact it is a small security problem.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 11:01:59 +02:00
Paolo Bonzini
fa4478d5c8 block: wait for job callback in block_job_cancel_sync
The limitation on not having I/O after cancellation cannot really be
kept.  Even streaming has a very small race window where you could
cancel a job and have it report completion.  If this window is hit,
bdrv_change_backing_file() will yield and possibly cause accesses to
dangling pointers etc.

So, let's just assume that we cannot know exactly what will happen
after the coroutine has set busy to false.  We can set a very lax
condition:

- if we cancel the job, the coroutine won't set it to false again
(and hence will not call co_sleep_ns again).

- block_job_cancel_sync will wait for the coroutine to exit, which
pretty much ensures no race.

Instead, we track the coroutine that executes the job and put very
strict conditions on what to do while it is quiescent (busy = false).
First of all, the coroutine must never set busy = false while the job
has been cancelled.  Second, the coroutine can be reentered arbitrarily
while it is quiescent, so you cannot really do anything but co_sleep_ns at
that time.  This condition is obeyed by the block_job_sleep_ns function.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:12 +02:00
Paolo Bonzini
4513eafe92 block: add block_job_sleep_ns
This function abstracts the pretty complex semantics of the "busy"
member of BlockJob.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:12 +02:00
Paolo Bonzini
e023b2e244 block: fix snapshot on QED
QED's opaque data includes a pointer back to the BlockDriverState.
This breaks when bdrv_append shuffles data between bs_new and bs_top.
To avoid this, add a "rebind" function that tells the driver about
the new relationship between the BlockDriverState and its opaque.

The patch also adds rebind to VVFAT for completeness, even though
it is not used with live snapshots.

Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:12 +02:00
Paolo Bonzini
469ef350e1 block: update in-memory backing file and format
These are needed to print "info block" output correctly.  QCOW2 does this
because it needs it to write the header, but QED does not, and common code
is the right place to do it.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:11 +02:00
Paolo Bonzini
5f3777945d block: push bdrv_change_backing_file error checking up from drivers
This check applies to all drivers, but QED lacks it.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:11 +02:00
Anthony Liguori
7c652c1eaf Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony:
  fdc: simplify media change handling
  qcow2: lock on prealloc
  block: make bdrv_create adopt coroutine
  qcow2: Limit COW to where it's needed
  sheepdog: switch to writethrough mode if cluster doesn't support flush
2012-05-08 09:38:41 -05:00
Zhi Yong Wu
15552c4ad3 qcow2: lock on prealloc
preallocate() will be locked. This is required because
qcow2_alloc_cluster_link_l2() assumes that it runs under a lock that it
can drop while COW is being performed.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-07 19:33:18 +02:00
Kevin Wolf
54e6814360 qcow2: Limit COW to where it's needed
This fixes a regression introduced in commit 250196f1. The bug leads to
data corruption, found during an Autotest run with a Fedora 8 guest.

Consider a write request whose first part is covered by an already
allocated cluster, but additional clusters need to be newly allocated.
When counting the number of clusters to allocate, the qcow2 code would
decide to do COW for all remaining clusters of the write request, even
if some of them are already allocated.

If during this COW operation another write request is issued that touches
the same cluster, it will still refer to the old cluster. When the COW
completes, the first request will update the L2 table and the second
write request will be lost. Note that the requests need not overlap, it's
enough for them to touch the same cluster.

This patch ensures that only clusters that really require COW are
considered for allocation. In this case any other request writing to the
same cluster will be an allocating write and gets serialised.

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-07 19:33:18 +02:00
MORITA Kazutaka
115c2b5a68 sheepdog: switch to writethrough mode if cluster doesn't support flush
This is necessary for qemu to work with the older version of Sheepdog
which doesn't support SD_OP_FLUSH_VDI.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-07 19:33:18 +02:00
Ronnie Sahlberg
fa6acb0c2f ISCSI: Add support for thin-provisioning via discard/UNMAP and bigger LUNs
Update the configure test for libiscsi support to detect version 1.3
or later.  Version 1.3 of libiscsi provides both READCAPACITY16 as well
as UNMAP commands.

Update the iscsi block layer to use READCAPACITY16 to detect the size of
the LUN instead of READCAPACITY10. This allows support for LUNs larger
than 2TB.

Update to implement bdrv_aio_discard() using the UNMAP command.
This allows us to use thin-provisioned LUNs from TGTD and other iSCSI
targets that support thin-provisioning.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[squashed in subsequent patch from Ronnie to fix off-by-one in LBA count]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-05-04 10:39:18 +02:00
Josh Durgin
787f31330e rbd: add discard support
Change the write flag to an operation type in RBDAIOCB, and make the
buffer optional since discard doesn't use it.

Discard is first included in librbd 0.1.2 (which is in Ceph 0.46).
If librbd is too old, leave out qemu_rbd_aio_discard entirely,
so the old behavior is preserved.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:41:42 +02:00
Zhi Yong Wu
647cc47223 qcow2: fix the return value -ENOENT -> -EEXIST
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00
Kevin Wolf
7242411460 qcow2: Don't hold cache references across yield
If cache references are held while the coroutine has yielded, the cache
may get used up and abort() when it can't find a free entry.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00
Kevin Wolf
60651f901a qcow2: Remove unused parameter in do_alloc_cluster_offset
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00
Stefan Weil
b9531b6eed block/qcow2: Add missing GCC_FMT_ATTR to function report_unsupported()
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00
Pavel Borzenkov
83affaa622 raw-posix: Do not use CONFIG_COCOA macro
Use __APPLE__ and __MACH__ macros instead of CONFIG_COCOA to detect Mac
OS X host. The patch is based on Ben Leslie's patch:
http://patchwork.ozlabs.org/patch/97859/

Signed-off-by: Ben Leslie <benno@benno.id.au>
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
2012-05-01 00:16:58 +02:00
Anthony Liguori
a8b69b8e24 Merge remote-tracking branch 'qmp/queue/qmp' into staging
* qmp/queue/qmp:
  qapi: fix qmp_balloon() conversion
  qemu-iotests: add block-stream speed value test case
  block: add 'speed' optional parameter to block-stream
  block: change block-job-set-speed argument from 'value' to 'speed'
  block: use Error mechanism instead of -errno for block_job_set_speed()
  block: use Error mechanism instead of -errno for block_job_create()
2012-04-27 12:00:06 -05:00
Stefan Hajnoczi
c83c66c3b5 block: add 'speed' optional parameter to block-stream
Allow streaming operations to be started with an initial speed limit.
This eliminates the window of time between starting streaming and
issuing block-job-set-speed.  Users should use the new optional 'speed'
parameter instead so that speed limits are in effect immediately when
the job starts.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
2012-04-27 11:44:50 -03:00
Stefan Hajnoczi
882ec7ce53 block: change block-job-set-speed argument from 'value' to 'speed'
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
2012-04-27 11:44:50 -03:00
Stefan Hajnoczi
9e6636c72d block: use Error mechanism instead of -errno for block_job_set_speed()
There are at least two different errors that can occur in
block_job_set_speed(): the job might not support setting speeds or the
value might be invalid.

Use the Error mechanism to report the error where it occurs.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
2012-04-27 11:44:50 -03:00
Stefan Hajnoczi
fd7f8c6537 block: use Error mechanism instead of -errno for block_job_create()
The block job API uses -errno return values internally and we convert
these to Error in the QMP functions.  This is ugly because the Error
should be created at the point where we still have all the relevant
information.  More importantly, it is hard to add new error cases to
this case since we quickly run out of -errno values without losing
information.

Go ahead and use Error directly and don't convert later.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
2012-04-27 11:44:50 -03:00
Kevin Wolf
b3adf53a3a nbd: Fix uninitialised use of s->sock
s->sock is assigned only afterwards, so we're really registering an
aio_fd_handler for file descriptor 0 here. Not exactly what we intended.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-04-26 17:54:22 +02:00
Anthony Liguori
1f8bcac09a Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony: (38 commits)
  qemu-iotests: Fix test 031 for qcow2 v3 support
  qemu-iotests: Add -o and make v3 the default for qcow2
  qcow2: Zero write support
  qemu-iotests: Test backing file COW with zero clusters
  qemu-iotests: add a simple test for write_zeroes
  qcow2: Support for feature table header extension
  qcow2: Support reading zero clusters
  qcow2: Version 3 images
  qcow2: Ignore reserved bits in check_refcounts
  qcow2: Ignore reserved bits in refcount table entries
  qcow2: Simplify count_cow_clusters
  qcow2: Refactor qcow2_free_any_clusters
  qcow2: Ignore reserved bits in L1/L2 entries
  qcow2: Fail write_compressed when overwriting data
  qcow2: Ignore reserved bits in count_contiguous_clusters()
  qcow2: Ignore reserved bits in get_cluster_offset
  qcow2: Save disk size in snapshot header
  Specification for qcow2 version 3
  qcow2: Fix refcount block allocation during qcow2_alloc_cluster_at()
  iotests: Resolve test failures caused by hostname
  ...
2012-04-23 14:27:04 -05:00
Kevin Wolf
621f058940 qcow2: Zero write support
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:30 +02:00
Kevin Wolf
cfcc4c62ff qcow2: Support for feature table header extension
Instead of printing an ugly bitmask, qemu can now print a more helpful
string even for yet unknown features.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:29 +02:00
Kevin Wolf
6377af48b0 qcow2: Support reading zero clusters
This adds support for reading zero clusters in version 3 images.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:29 +02:00
Kevin Wolf
6744cbab8c qcow2: Version 3 images
This adds the basic infrastructure to qcow2 to handle version 3 images.
It includes code to create v3 images, allow header updates for v3 images
and checks feature bits.

It still misses support for zero clusters, so this is not a fully
compliant implementation of v3 yet.

The default for creating new images stays at v2 for now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:29 +02:00
Kevin Wolf
afdf0abe77 qcow2: Ignore reserved bits in check_refcounts
Also don't infer the cluster type directly from the L2 entries, but use
qcow2_get_cluster_type() to keep everything in a single place.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:29 +02:00
Kevin Wolf
76dc9e0c8f qcow2: Ignore reserved bits in refcount table entries
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:29 +02:00
Kevin Wolf
143550a83e qcow2: Simplify count_cow_clusters
count_cow_clusters() tries to reuse existing functions, and all it
achieves is to make things much more complicated than they really are:
Everything needs COW, unless it's a normal cluster with refcount 1.

This patch implements the obvious way of doing this, and by using
qcow2_get_cluster_type() it gets rid of all flag magic.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:28 +02:00
Kevin Wolf
c7a4c37a0f qcow2: Refactor qcow2_free_any_clusters
Zero clusters will add another cluster type. Refactor the open-coded
cluster type detection into a switch of QCOW2_CLUSTER_* options so that
the detection is in a single place. This makes it easier to add new
cluster types.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:28 +02:00
Kevin Wolf
8e37f681d5 qcow2: Ignore reserved bits in L1/L2 entries
This changes the still existing places that assume that the only flags
are QCOW_OFLAG_COPIED and QCOW_OFLAG_COMPRESSED to properly mask out
reserved bits.

It does not convert bdrv_check yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:28 +02:00
Kevin Wolf
b0b6862e5e qcow2: Fail write_compressed when overwriting data
qcow2_alloc_compressed_cluster_offset() already fails if the copied flag
is set, because qcow2_write_compressed() doesn't perform COW as it would
have to do to allow this.

However, what we really want to check here is whether the cluster is
allocated or not. With internal snapshots the copied flag may not be set
on allocated clusters. Check the cluster offset instead.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:27 +02:00
Kevin Wolf
2bfcc4a0a0 qcow2: Ignore reserved bits in count_contiguous_clusters()
Until now, count_contiguous_clusters() has an argument that allowed to
specify flags that should be ignored in the comparison, i.e. that are
allowed to change between contiguous clusters.

This patch changes the function so that it ignores all flags by default
now and you need to pass the flags on which it should stop.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:27 +02:00
Kevin Wolf
68d000a390 qcow2: Ignore reserved bits in get_cluster_offset
With this change, reading from a qcow2 image ignores all reserved bits
that are set in an L1 or L2 table entry.

Now get_cluster_offset() assigns *cluster_offset only the offset without
any other flags. The cluster type is not longer encoded in the offset,
but a positive return value in case of success.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:27 +02:00
Kevin Wolf
90b277593d qcow2: Save disk size in snapshot header
This allows that different snapshots of an image can have different
sizes, which is a requirement for enabling image resizing even with
images that have internal snapshots.

We don't do the actual support for it now, but make sure that the
additional field is present and not completely ignored in all version 3
images. When trying to load a snapshot of different size, it returns
an error.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:57:27 +02:00
Kevin Wolf
f24423bd90 qcow2: Fix refcount block allocation during qcow2_alloc_cluster_at()
Refcount block allocation and refcount table growth rely on
s->free_cluster_index pointing to somewhere after the current
allocation. Change qcow2_alloc_cluster_at() to fulfill this
assumption.

Without this change it could happen that a newly allocated refcount
block and the allocated data block point to the same area in the image
file, causing data corruption in the long run.

This fixes a bug that became first visible after commit 250196f1.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-20 15:56:19 +02:00
Paolo Bonzini
bafbd6a1c6 aio: remove process_queue callback and qemu_aio_process_queue
Both unused after the previous patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-19 16:37:53 +02:00
Paolo Bonzini
7fe7b68b32 nbd: do not block in nbd_wr_sync if no data at all is available
Right now, nbd_wr_sync will hang if no data at all is available on the
socket and the other side is not going to provide any.  Relax this by
making it loop only for writes or partial reads.  This fixes a race
where one thread is executing qemu_aio_wait() and another is executing
main_loop_wait().  Then, the select() call in main_loop_wait() can return
stale data and call the "readable" callback with no data in the socket.

Reported-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-04-19 16:36:43 +02:00
Paolo Bonzini
185b43386a nbd: consistently return negative errno values
In the next patch we need to look at the return code of nbd_wr_sync.
To avoid percolating the socket_error() ugliness all around, let's
handle errors by returning negative errno values.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-04-19 16:36:43 +02:00
Paolo Bonzini
fc19f8a02e nbd: consistently check for <0 or >=0
This prepares for the following patch, which changes -1 return values
to negative errno.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-04-19 16:36:43 +02:00
Paolo Bonzini
dd3e8ac413 nbd: avoid out of bounds access to recv_coroutine array
This can happen with a buggy or malicious server.

Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-04-19 16:36:42 +02:00
Kevin Wolf
2795ecf681 qcow2: Fix return value of alloc_refcount_block
Someone forgot something in commit 29c1a730... Documenting the right
return value is not enough, you also need to actually return it in the
code.

This bug sometimes causes error return values even when everything has
succeeded: The new offset of the refcount block is truncated to 32 bits
and interpreted as signed. At least with small cluster sizes it's easy
to get a negative return value this way.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2012-04-19 16:03:27 +02:00
Kevin Wolf
8dc0a5e7a0 qcow2: Fix error handling in qcow2_alloc_cluster_offset
If do_alloc_cluster_offset() fails, the error handling code tried to
remove the request from the in-flight queue, to which it wasn't added
yet, resulting in a NULL pointer dereference.

m->nb_clusters really only becomes != 0 when the request is in the list.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-19 16:03:27 +02:00
Stefan Weil
4e35b92a51 block: Fix spelling in comment (ineffcient -> inefficient)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-19 15:48:52 +02:00
Anthony Liguori
bb5d8dd757 Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony: (46 commits)
  qed: remove incoming live migration blocker
  qed: honor BDRV_O_INCOMING for incoming live migration
  migration: clear BDRV_O_INCOMING flags on end of incoming live migration
  qed: add bdrv_invalidate_cache to be called after incoming live migration
  blockdev: open images with BDRV_O_INCOMING on incoming live migration
  block: add a function to clear incoming live migration flags
  block: Add new BDRV_O_INCOMING flag to notice incoming live migration
  block stream: close unused files and update ->backing_hd
  qemu-iotests: Fix call syntax for qemu-io
  qemu-iotests: Fix call syntax for qemu-img
  qemu-iotests: Test unknown qcow2 header extensions
  qemu-iotests: qcow2.py
  sheepdog: fix send req helpers
  sheepdog: implement SD_OP_FLUSH_VDI operation
  block: bdrv_append() fixes
  qed: track dirty flag status
  qemu-img: add dirty flag status
  qed: image fragmentation statistics
  qemu-img: add image fragmentation statistics
  block: document job API
  ...
2012-04-10 08:16:12 -05:00
Benoît Canet
50d30c2675 qed: remove incoming live migration blocker
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 16:29:12 +02:00
Benoît Canet
2d1f3c2360 qed: honor BDRV_O_INCOMING for incoming live migration
From original commit with Patchwork-id: 31108 by
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

"The QED image format includes a file header bit to mark images dirty.
QED normally checks dirty images on open and fixes inconsistent
metadata.  This is undesirable during live migration since the dirty bit
may be set if the source host is modifying the image file.  The check
should be postponed until migration completes.

Skip operations that modify the image file if the BDRV_O_INCOMING flag
is set."

Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 16:29:04 +02:00
Benoît Canet
c82954e529 qed: add bdrv_invalidate_cache to be called after incoming live migration
The QED image is reopened to flush metadata and check consistency.

Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 16:28:27 +02:00
Marcelo Tosatti
5a67a1048e block stream: close unused files and update ->backing_hd
Close the now unused images that were part of the previous backing file
chain and adjust ->backing_hd, backing_filename and backing_format
properly.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=801449

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 15:11:37 +02:00
Liu Yuan
eb09218077 sheepdog: fix send req helpers
We should return if reading of the header fails.

Cc: Kevin Wolf <kwolf@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:41 +02:00
Liu Yuan
47622c44d0 sheepdog: implement SD_OP_FLUSH_VDI operation
Flush operation is supposed to flush the write-back cache of
sheepdog cluster.

By issuing flush operation, we can assure the Guest of data
reaching the sheepdog cluster storage.

Cc: Kevin Wolf <kwolf@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:41 +02:00
Dong Xu Wang
d68dbee80e qed: track dirty flag status
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:41 +02:00
Dong Xu Wang
11c9c615c8 qed: image fragmentation statistics
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
9f25eccc1c block: set job->speed in block_set_speed
There is no need to do this in every implementation of set_speed
(even though there is only one right now).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
3e914655f2 block: fix streaming/closing race
Streaming can issue I/O while qcow2_close is running.  This causes the
L2 caches to become very confused or, alternatively, could cause a
segfault when the streaming coroutine is reentered after closing its
block device.  The fix is to cancel streaming jobs when closing their
underlying device.

The cancellation must be synchronous, on the other hand qemu_aio_wait
will not restart a coroutine that is sleeping in co_sleep.  So add
a flag saying whether streaming has in-flight I/O.  If the busy flag
is false, the coroutine is quiescent and, when cancelled, will not
issue any new I/O.

This protects streaming against closing, but not against deleting.
We have a reference count protecting us against concurrent deletion,
but I still added an assertion to ensure nothing bad happens.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
eb9566d13e vdi: change goto to loop
Finally reindent all code and change goto statements to a loop.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
4eea78e634 vdi: do not create useless iovecs
Reads and writes to the underlying file can also occur with the simple
non-vectored I/O interfaces.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
a7a43aa199 vdi: leave bounce buffering to block layer
vdi.c really works as if it implemented bdrv_read and bdrv_write.  However,
because only vector I/O is supported by the asynchronous callbacks, it
went through extra pain to bounce-buffer the I/O.  This can be handled
by the block layer now that the format is coroutine-based.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
bfc45fc183 vdi: move aiocb fields to locals
Most of the AIOCB really holds local variables that need to persist
across callback invocation.  It can go away now.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
4de659e8eb vdi: merge aio_read_cb and aio_write_cb into callers
Now inline the former AIO callbacks into vdi_co_readv and vdi_co_writev.
While many cleanups are possible, the code now really looks synchronous.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
0c7bfc321b vdi: move end-of-I/O handling at the end
The next step is to take code that only triggers after the first operation,
and move it at the end of vdi_aio_read_cb and vdi_aio_write_cb.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
3d46a75aa5 vdi: basic conversion to coroutines
Even a basic conversion changing the bdrv_aio_readv/bdrv_aio_writev calls
to bdrv_co_readv/bdrv_co_writev, and callbacks to goto statements can
eliminate a lot of code.  This is because error handling is simplified
and indirections through bottom halves can go away.

After this patch, I/O to the underlying file already happens via
coroutines, but the code still looks a lot like if asynchronous I/O was
being used.

Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Zhang Shengju
c088b69136 block/vpc: write checksum back to footer after check
After validation check, the 'checksum' is not written back
to footer, which leave it with zero.

This results in errors while loadding it under Microsoft's
Hyper-V environment, and also errors from utilities like
Citrix's vhd-util.

Signed-off-by: Zhang Shengju <sean_zhang@trendmicro.com.cn>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:40 +02:00
Paolo Bonzini
29cdb2513c block: push recursive flushing up from drivers
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-04-05 14:54:39 +02:00
Kevin Wolf
3948d1d487 qcow2: Remove unused parameter in get_cluster_table()
Since everything goes through the cache, callers don't use the L2 table
offset any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-04-05 14:54:39 +02:00
Stefan Weil
fb7c8e8a2d block/curl: Replace usleep by g_usleep
The function usleep is not available for all supported platforms:
at least some versions of MinGW don't support it.

usleep was also declared obsolete by POSIX.1-2001.

The function g_usleep is part of glib2.0, so it is available for
all supported platforms.

Using nanosleep would also be possible but needs more code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-04-03 09:34:34 +01:00
Kevin Wolf
250196f19c qcow2: Reduce number of I/O requests
If the first part of a write request is allocated, but the second isn't
and it can be allocated so that the resulting area is contiguous, handle
it at once. This is a common case for sequential writes.

After this patch, alloc_cluster_offset() only checks if the clusters are
already allocated or how many new clusters can be allocated contigouosly.
The actual cluster allocation is split off into a new function
do_alloc_cluster_offset().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-03-12 15:14:07 +01:00
Kevin Wolf
256900b16b qcow2: Add qcow2_alloc_clusters_at()
This function allows to allocate clusters at a given offset in the image
file. This is useful if you want to allocate the second part of an area
that must be contiguous.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-03-12 15:14:07 +01:00
Kevin Wolf
bf319ece56 qcow2: Factor out count_cow_clusters
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-03-12 15:14:07 +01:00
Kevin Wolf
259b217310 qcow2: Add error messages in qcow2_truncate
qemu-img resize has some limitations with qcow2, but the user is only
told that "this image format does not support resize". Quite confusing,
so add some more detailed error_report() calls and change "this image
format" into "this image".

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-03-12 15:14:06 +01:00
Kevin Wolf
3cce16f44d qcow2: Add some tracing
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-03-12 15:14:06 +01:00
Stefan Hajnoczi
14fe292d86 qed: do not evict in-use L2 table cache entries
The L2 table cache reduces QED metadata reads that would be required
when translating LBAs to offsets into the image file.  Since requests
execute in parallel it is possible to share an L2 table between multiple
requests.

There is a potential data corruption issue when an in-use L2 table is
evicted from the cache because the following situation occurs:

  1. An allocating write performs an update to L2 table "A".

  2. Another request needs L2 table "B" and causes table "A" to be
     evicted.

  3. A new read request needs L2 table "A" but it is not cached.

As a result the L2 update from #1 can overlap with the L2 fetch from #3.
We must avoid doing overlapping I/O requests here since the worst case
outcome is that the L2 fetch completes before the L2 update and yields
stale data.  In that case we would effectively discard the L2 update and
lose data clusters!

Thanks to Benoît Canet <benoit.canet@gmail.com> for extensive testing
and debugging which lead to discovery of this bug.

Reported-by: Benoît Canet <benoit.canet@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Tested-by: Benoît Canet <benoit.canet@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-03-12 15:14:06 +01:00
Stefan Weil
75d1234103 block/vmdk: Fix warning from splint (comparision of unsigned value)
l1_entry_sectors will never be less than 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2012-03-07 13:03:51 +00:00