These files deal with the file protocol, not the raw format (the
file protocol is often used with other formats, and the raw
format is not forced to use the file protocol). Rename things
to make it a bit easier to follow.
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Given that we have raw-win32.c and raw-posix.c, my initial guess at
raw_bsd.c was that it was for dealing with raw files using code
specific to the BSD operating system (beyond what raw-posix could
do). Not so - this name was chosen back in commit e1c66c6 to
distinguish that it was a BSD licensed file, in contrast to the
then-existing raw.c with an unclear and potentially unusable
license. But since it has been more than three years since the
rewrite, it's time to pick a more useful name for this file to
avoid this type of confusion to future contributors that don't know
the backstory, as none of our other files are named solely by the
license they use.
In reality, this file deals with the raw format, which is useful
with any number of protocols, while raw-{win32,posix} deal with
the file protocol (and in turn, that protocol is not limited to
use with the raw format). So rename raw_bsd to raw-format.c. We
could have also used the shorter name raw.c, except that collides
with the earlier use of that filename for a different license,
and it's better to be safe than risk license pollution.
The next patch will also rename raw-win32.c and raw-posix.c to
further distinguish the difference in roles.
It doesn't hurt that this gets rid of an underscore in the filename,
thereby making tab-completion on 'ra<TAB>' easier (now I don't have
to type the shift key, which slows things down :)
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
dmg.o was moved to block-obj-m in 5505e8b76 to become a separate module,
so that its reference to libbz2, since 6b383c08c, doesn't add an extra
library to the main executable.
Until recently, commit 06e60f70a (blockdev: Add dynamic module loading
for block drivers) moved it back to block-obj-y to simplify the design
of dynamic loading of block modules. But we don't want to lose the
feature of less library dependency on the main executable.
The solution here is to move only the bz2 related code to a separate
DSO file, and load it when dmg_open is called.
dmg_probe doesn't depend on bz2 support to work, and is the only code in
this file which can run before dmg_open.
While we are at it, fix the unhelpful cast of last argument passed to
dmg_uncompress_bz2.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1473043845-13197-4-git-send-email-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This removes our dependency to libuuid, so that the driver can always be
built.
Similar to how we handled data plane configure options, --enable-vhdx
and --disable-vhdx are also changed to a nop with a message saying it's
obsolete.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-Id: <1474432046-325-4-git-send-email-famz@redhat.com>
Modularizes the nfs block driver so that it gets dynamically loaded.
Signed-off-by: Colin Lord <clord@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1471008424-16465-5-git-send-email-clord@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Extend the current module interface to allow for block drivers to be
loaded dynamically on request. The only block drivers that can be
converted into modules are the drivers that don't perform any init
operation except for registering themselves.
In addition, only the protocol drivers are being modularized, as they
are the only ones which see significant performance benefits. The format
drivers do not generally link to external libraries, so modularizing
them is of no benefit from a performance perspective.
All the necessary module information is located in a new structure found
in module_block.h
This spoils the purpose of 5505e8b76f (block/dmg: make it modular).
Before this patch, if module build is enabled, block-dmg.so is linked to
libbz2, whereas the main binary is not. In downstream, theoretically, it
means only the qemu-block-extra package depends on libbz2, while the
main QEMU package needn't to. With this patch, we (temporarily) change
the case so that the main QEMU depends on libbz2 again.
Signed-off-by: Marc Marí <markmb@redhat.com>
Signed-off-by: Colin Lord <clord@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1471008424-16465-4-git-send-email-clord@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
[mreitz: Do a signed comparison against the length of
block_driver_modules[], so it will not cause a compile error when
empty]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Some programs that add a dependency on it will use
the block layer directly.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.fnst@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1469602913-20979-5-git-send-email-xiecl.fnst@cn.fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
No code changes, just moved from one file to another.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch introduces block driver that implement recording
and replaying of block devices' operations.
All block completion operations are added to the queue.
Queue is flushed at checkpoints and information about processed requests
is recorded to the log. In replay phase the queue is matched with
events read from the log. Therefore block devices requests are processed
deterministically.
Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
[ kwolf: Rebased onto modified and already applied part of the series ]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a block driver that is capable of supporting any full disk
encryption format. This utilizes the previously added block
encryption code, and at this time supports the LUKS format.
The driver code is capable of supporting any format supported
by the QCryptoBlock module, so it registers one block driver
for each format. This patch only registers the "luks" driver
since the "qcow" driver is there only for back-compatibility
with existing qcow built-in encryption.
New LUKS compatible volumes can be formatted using qemu-img
with defaults for all settings.
$ qemu-img create --object secret,data=123456,id=sec0 \
-f luks -o key-secret=sec0 demo.luks 10G
Alternatively the cryptographic settings can be explicitly
set
$ qemu-img create --object secret,data=123456,id=sec0 \
-f luks -o key-secret=sec0,cipher-alg=aes-256,\
cipher-mode=cbc,ivgen-alg=plain64,hash-alg=sha256 \
demo.luks 10G
And query its size
$ qemu-img info demo.img
image: demo.img
file format: luks
virtual size: 10G (10737418240 bytes)
disk size: 132K
encrypted: yes
Note that it was not necessary to provide the password
when querying info for the volume. The password is only
required when performing I/O on the volume
All volumes created by this new 'luks' driver should be
capable of being opened by the kernel dm-crypt driver.
The only algorithms listed in the LUKS spec that are
not currently supported by this impl are sha512 and
ripemd160 hashes and cast6 cipher.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
[ kwolf - Added #include to resolve conflict with da34e65c ]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only code change is making bdrv_dirty_bitmap_truncate public. It is
used in block.c.
Also two long lines (bdrv_get_dirty) are wrapped.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1457412306-18940-5-git-send-email-famz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Get rid of direct use of gnutls APIs in quorum blockdrv in
favour of using the crypto APIs. This avoids the need to
do conditional compilation of the quorum driver. It can
simply report an error at file open file instead if the
required hash algorithm isn't supported by QEMU.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-Id: <1435770638-25715-8-git-send-email-berrange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The block.c file has grown to over 6000 lines. It is time to split this
file so there are fewer conflicts and the code is easier to maintain.
Extract I/O request processing code:
* Read
* Write
* Zero writes and making the image empty
* Flush
* Discard
* ioctl
* Tracked requests and queuing
* Throttling and copy-on-read
* Block status and allocated functions
* Refreshing block limits
* Reading/writing vmstate
* qemu_blockalign() and friends
The patch simply moves code from block.c into block/io.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
dmg can optionally utilize libbz2, make it modular
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for bzip2-compressed block entries as introduced
with OS X 10.4 (source: https://en.wikipedia.org/wiki/Apple_Disk_Image).
It was tested against a 5.2G "OS X Yosemite" installation image which
stores the BLXX block in the XML property list (instead of resource
forks) and has over 5k chunks.
New configure entries are added (--enable-bzip2 / --disable-bzip2) to
control inclusion of bzip2 functionality (which requires linking against
libbz2). The help message suggests that this option is needed for DMG
files, but the tests are generic enough that other parts of QEMU can use
bzip2 if needed.
The identifiers are based on http://newosxbook.com/DMG.html.
The decompression routines are based on the zlib case, but as there is
no way to reset the decompression state (unlike zlib), memory is
allocated and deallocated for every decompression. This should not be
problematic as the decompression takes most of the time and as blocks
are typically about/over 1 MiB in size, only one allocation is done
every 2000 sectors.
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1420566495-13284-12-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device, and automatically extends the image once the threshold
is reached or exceeded.
In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.
To fix this, this patch adds:
* A new monitor command `block-set-write-threshold', to set a mark for
a given block device.
* A new event `BLOCK_WRITE_THRESHOLD', to report if a block device
usage exceeds the threshold.
* A new `write_threshold' field into the `BlockDeviceInfo' structure,
to report the configured threshold.
This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.
[Updated qemu-iotests 067 output to add the new 'write_threshold'
property. --Stefan]
[Changed g_assert_false() to !g_assert() to fix the build on older glib
versions. --Kevin]
Signed-off-by: Francesco Romani <fromani@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1421068273-692-1-git-send-email-fromani@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu-img should use QMP commands whenever possible in order to ensure
feature completeness of both online and offline image operations. As
qemu-img itself has no access to QMP (since this would basically require
just everything being linked into qemu-img), imitate QMP's
implementation of block-commit by using commit_active_start() and then
waiting for the block job to finish.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1414159063-25977-9-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
A block device consists of a frontend device model and a backend.
A block backend has a tree of block drivers doing the actual work.
The tree is managed by the block layer.
We currently use a single abstraction BlockDriverState both for tree
nodes and the backend as a whole. Drawbacks:
* Its API includes both stuff that makes sense only at the block
backend level (root of the tree) and stuff that's only for use
within the block layer. This makes the API bigger and more complex
than necessary. Moreover, it's not obvious which interfaces are
meant for device models, and which really aren't.
* Since device models keep a reference to their backend, the backend
object can't just be destroyed. But for media change, we need to
replace the tree. Our solution is to make the BlockDriverState
generic, with actual driver state in a separate object, pointed to
by member opaque. That lets us replace the tree by deinitializing
and reinitializing its root. This special need of the root makes
the data structure awkward everywhere in the tree.
The general plan is to separate the APIs into "block backend", for use
by device models, monitor and whatever other code dealing with block
backends, and "block driver", for use by the block layer and whatever
other code (if any) dealing with trees and tree nodes.
Code dealing with block backends, device models in particular, should
become completely oblivious of BlockDriverState. This should let us
clean up both APIs, and the tree data structures.
This commit is a first step. It creates a minimal "block backend"
API: type BlockBackend and functions to create, destroy and find them.
BlockBackend objects are created and destroyed exactly when root
BlockDriverState objects are created and destroyed. "Root" in the
sense of "in bdrv_states". They're not yet used for anything; that'll
come shortly.
A root BlockDriverState is created with bdrv_new_root(), so where to
create a BlockBackend is obvious. Where these roots get destroyed
isn't always as obvious.
It is obvious in qemu-img.c, qemu-io.c and qemu-nbd.c, and in error
paths of blockdev_init(), blk_connect(). That leaves destruction of
objects successfully created by blockdev_init() and blk_connect().
blockdev_init() is used only by drive_new() and qmp_blockdev_add().
Objects created by the latter are currently indestructible (see commit
48f364d "blockdev: Refuse to drive_del something added with
blockdev-add" and commit 2d246f0 "blockdev: Introduce
DriveInfo.enable_auto_del"). Objects created by the former get
destroyed by drive_del().
Objects created by blk_connect() get destroyed by blk_disconnect().
BlockBackend is reference-counted. Its reference count never exceeds
one so far, but that's going to change.
In drive_del(), the BB's reference count is surely one now. The BDS's
reference count is greater than one when something else is holding a
reference, such as a block job. In this case, the BB is destroyed
right away, but the BDS lives on until all extra references get
dropped.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch removes support for the cow file format.
Normally we do not break backwards compatibility but in this case there
is no impact and it is the most logical option. Extraordinary claims
require extraordinary evidence so I will show why removing the cow block
driver is the right thing to do.
The cow file format is the disk image format for Usermode Linux, a way
of running a Linux system in userspace. The performance of UML was
never great and it was hacky, but it enjoyed some popularity before
hardware virtualization support became mainstream.
QEMU's block/cow.c is supposed to read this image file format.
Unfortunately the file format was underspecified:
1. Earlier Linux versions used the MAXPATHLEN constant for the backing
filename field. The value of MAXPATHLEN can change, so Linux
switched to a 4096 literal but QEMU has a 1024 literal.
2. Padding was not used on the header struct (both in the Linux kernel
and in QEMU) so the struct layout varied across architectures. In
particular, i386 and x86_64 were different due to int64_t alignment
differences. Linux now uses __attribute__((packed)), QEMU does not.
Therefore:
1. QEMU cow images do not conform to the Linux cow image file format.
2. cow images cannot be shared between different host architectures.
This means QEMU cow images are useless and QEMU has not had bug reports
from users actually hitting these issues.
Let's get rid of this thing, it serves no purpose and no one will be
affected.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1410877464-20481-1-git-send-email-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is an analogue to Linux null_blk. It can be used for testing or
benchmarking block device emulation and general block layer
functionalities such as coroutines and throttling, where disk IO is not
necessary or wanted.
Use null-aio:// for AIO version, and null-co:// for coroutine version.
[Resolved conflict with Fam's async bdrv_aio_cancel() series:
1. Drop .bdrv_aio_cancel() since it is now done by block.c
2. Rename qemu_aio_release() to qemu_aio_unref()
--Stefan]
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1410415798-20673-2-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The plan is to add new accounting metrics (latency, invalid requests, failed
requests, queue depth) and block.c is overpopulated so it will be better to work
in a separate module.
Moreover the long term plan is to have statistics in each of the BDS of the graph
for metrology purpose; this means that the device model statistics must move from
the topmost BDS to the device model.
So we need to decouple the statistic code from BlockDriverState.
This is another argument for the extraction of the code in a separate module.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Benoit Canet <benoit@irqsave.net>
CC: Fam Zheng <famz@redhat.com>
CC: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Uses the same select/WSAEventSelect scheme as main-loop.c.
WSAEventSelect() is edge-triggered, so it cannot be used
directly, but it is still used as a way to exit from a
blocking g_poll().
Before g_poll() is called, we poll sockets with a non-blocking
select() to achieve the level-triggered semantics we require:
if a socket is ready, the g_poll() is made non-blocking too.
Based on a patch from Or Goshen.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
VM Image on Archipelago volume is specified like this:
file.driver=archipelago,file.volume=<volumename>[,file.mport=<mapperd_port>[,
file.vport=<vlmcd_port>][,file.segment=<segment_name>]]
'archipelago' is the protocol.
'mport' is the port number on which mapperd is listening. This is optional
and if not specified, QEMU will make Archipelago to use the default port.
'vport' is the port number on which vlmcd is listening. This is optional
and if not specified, QEMU will make Archipelago to use the default port.
'segment' is the name of the shared memory segment Archipelago stack is using.
This is optional and if not specified, QEMU will make Archipelago to use the
default value, 'archipelago'.
Examples:
file.driver=archipelago,file.volume=my_vm_volume
file.driver=archipelago,file.volume=my_vm_volume,file.mport=123
file.driver=archipelago,file.volume=my_vm_volume,file.mport=123,
file.vport=1234
file.driver=archipelago,file.volume=my_vm_volume,file.mport=123,
file.vport=1234,file.segment=my_segment
Signed-off-by: Chrysostomos Nanakos <cnanakos@grnet.gr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTB8hAAAoJEH8JsnLIjy/WQdgP/jEu5baA1/qKanDsS9l+81u1
/sIYSWpHDEJ0uavqTMBeyMOwkzel7SZRusIwA/d5pMqxbY6/86YJumTTozFWvtqc
IABqHtRKCxjcLdZRPbkuNAOiw6p76vSZa543o2t8OAhK2DIFy530wWXeoQEYvuJX
4pOh0lTradOrF1z6uW4ozgQ1efPppwh/iqwfWWNJVTgfnWxJk6qQaATEgkuSdsUN
Wp78UzOxLGO6JKJB6kP3LfNL0ANTYHpfH2/wkE6cW6TkSUduOm6hIBY+tb9khqYt
INOKxqFADK6EOgjvJBsZuZUtOnHK5oM921LepN/mOPAs6gKcn2j+FfqJrl3I1/5M
AXM3M0FPuijEKPGWw7pCLt7j84KJkD9a/rsKO37yRzw17fOma2Rpr4TrX43BF+5t
CGqQ7PzDJ6Fng4EXjyNDzviwXIK8xmG1tfn92tq/BUd6OuM9MCyzEGvEiGOMBoXv
w4iOV7UC+1P3TjnTBhMlBVGywSfdOJoHr9k4lXGNp0h8fPhM9rfruI3BFysxaas6
GmKbd7yvKwXOTptd3I9SB8BzVUL3CcD3FK24+cWKAl8GgyiDIWRlvBYyMp3p8Z8f
NDzcxYP6aRGsoddvpIWr3Tz89hw5wTW5u3RmNgxJUguz6HYKFbl30dpGT+96q2BN
YIAANTdPxn7BP6r3glQH
=ZaDG
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block patches
# gpg: Signature made Fri 21 Feb 2014 21:42:24 GMT using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
* remotes/kevin/tags/for-upstream: (54 commits)
iotests: Mixed quorum child device specifications
quorum: Simplify quorum_open()
quorum: Add unit test.
quorum: Add quorum_open() and quorum_close().
quorum: Implement recursive .bdrv_recurse_is_first_non_filter in quorum.
quorum: Add quorum_co_flush().
quorum: Add quorum_invalidate_cache().
quorum: Add quorum_getlength().
quorum: Add quorum mechanism.
quorum: Add quorum_aio_readv.
blkverify: Extract qemu_iovec_clone() and qemu_iovec_compare() from blkverify.
quorum: Add quorum_aio_writev and its dependencies.
quorum: Create BDRVQuorumState and BlkDriver and do init.
quorum: Create quorum.c, add QuorumChildRequest and QuorumAIOCB.
check-qdict: Test termination of qdict_array_split()
check-qdict: Adjust test for qdict_array_split()
qdict: Extract non-QDicts in qdict_array_split()
qemu-config: Sections must consist of keys
qemu-iotests: Check qemu-img command line parsing
qemu-img: Allow -o help with incomplete argument list
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This patchset enables the core of the quorum mechanism.
The num_children reads are compared to get the majority version and if this
version exists more than threshold times the guest won't see the error at all.
If a block is corrupted or if an error occurs during an IO or if the quorum
cannot be established QMP events are used to report to the management.
Use gnutls's SHA-256 to compare versions.
--enable-quorum must be used to enable the feature.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Quorum is a block filter mirroring writes to num_children children.
For reads quorum reads each children and does a vote.
If more than vote_threshold versions are identical the quorum is reached and
this winning version is returned to the guest. So quorum prevents bit corruption.
For high availability purpose minority errors are reported via QMP but the guest
does not see them.
This patch creates the driver C source file and introduces the structures that
will be used in asynchronous reads and writes.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
No longer adds flags and libs for them to global variables, instead
create config-host.mak variables like FOO_CFLAGS and FOO_LIBS, which is
used as per object cflags and libs.
This removes unwanted dependencies from libcacard.
Signed-off-by: Fam Zheng <famz@redhat.com>
[Split from Fam's patch to enable modules. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds native support for accessing images on NFS
shares without the requirement to actually mount the entire
NFS share on the host.
NFS Images can simply be specified by an url of the form:
nfs://<host>/<export>/<filename>[?param=value[¶m2=value2[&...]]]
For example:
qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
You need LibNFS from Ronnie Sahlberg available at:
git://github.com/sahlberg/libnfs.git
for this to work.
During configure it is automatically probed for libnfs and support
is enabled on-the-fly. You can forbid or enforce libnfs support
with --disable-libnfs or --enable-libnfs respectively.
Due to NFS restrictions you might need to execute your binaries
as root, allow them to open priviledged ports (<1024) or specify
insecure option on the NFS server.
For additional information on ROOT vs. non-ROOT operation and URL
format + parameters see:
https://raw.github.com/sahlberg/libnfs/master/README
Supported by qemu are the uid, gid and tcp-syncnt URL parameters.
LibNFS currently support NFS version 3 only.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds support for VHDX v0 logs, as specified in Microsoft's
VHDX Specification Format v1.00:
https://www.microsoft.com/en-us/download/details.aspx?id=34750
The following support is added:
* Log parsing, and validation - validate that an existing log
is correct.
* Log search - search through an existing log, to find any valid
sequence of entries.
* Log replay and flush - replay an existing log, and flush/clear
the log when complete.
The VHDX log is a circular buffer, with elements (sectors) of 4KB.
A log entry is a variably-length number of sectors, that is
comprised of a header and 'descriptors', that describe each sector.
A log may contain multiple entries, know as a log sequence. In a log
sequence, each log entry immediately follows the previous entry, with an
incrementing sequence number. There can only ever be one active and
valid sequence in the log.
Each log entry must match the file log GUID in order to be valid (along
with other criteria). Once we have flushed all valid log entries, we
marked the file log GUID to be zero, which indicates a buffer with no
valid entries.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves the endian translation functions out from the vhdx.c source,
into a separate source file. In addition to the previously defined
endian functions, new endian translation functions for log support are
added as well.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds the ability to update the headers in a VHDX image, including
generating a new MS-compatible GUID.
As VHDX depends on uuid.h, VHDX is now a configurable build option. If
VHDX support is enabled, that will also enable uuid as well. The
default is to have VHDX enabled.
To enable/disable VHDX: --enable-vhdx, --disable-vhdx
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
"Incoming" function prototypes and "outgoing" function calls must match
reality. Implemented using the "struct BlockDriver" definition in
"include/block/block_int.h", and gcc errors & warnings.
v1->v2:
On 08/20/13 09:51, Kevin Wolf wrote:
> Am 18.08.2013 um 16:29 hat Paolo Bonzini geschrieben:
>> Il 16/08/2013 16:15, Laszlo Ersek ha scritto:
>>> +static int raw_reopen_prepare(BDRVReopenState *reopen_state,
>>> + BlockReopenQueue *queue, Error **errp)
>>> {
>>> - return bdrv_reopen_prepare(bs->file);
>>> + BDRVReopenState tmp = *reopen_state;
>>> +
>>> + tmp.bs = tmp.bs->file;
>>> + return bdrv_reopen_prepare(&tmp, queue, errp);
>>> }
>>
>> This should just return zero, my fault.
>
> Which is because bdrv_reopen_queue() already queues bs->file for reopen.
> The simple return 0; implementation is shared by all other format drivers
> that support reopening images.
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
backup_start() creates a block job that copies a point-in-time snapshot
of a block device to a target block device.
We call backup_do_cow() for each write during backup. That function
reads the original data from the block device before it gets
overwritten. The data is then written to the target device.
Currently backup cluster size is hardcoded to 65536 bytes.
[I made a number of changes to Dietmar's original patch and folded them
in to make code review easy. Here is the full list:
* Drop BackupDumpFunc interface in favor of a target block device
* Detect zero clusters with buffer_is_zero() and use bdrv_co_write_zeroes()
* Use 0 delay instead of 1us, like other block jobs
* Unify creation/start functions into backup_start()
* Simplify cleanup, free bitmap in backup_run() instead of cb
* function
* Use HBitmap to avoid duplicating bitmap code
* Use bdrv_getlength() instead of accessing ->total_sectors
* directly
* Delete the backup.h header file, it is no longer necessary
* Move ./backup.c to block/backup.c
* Remove #ifdefed out code
* Coding style and whitespace cleanups
* Use bdrv_add_before_write_notifier() instead of blockjob-specific hooks
* Keep our own in-flight CowRequest list instead of using block.c
tracked requests. This means a little code duplication but is much
simpler than trying to share the tracked requests list and use the
backup block size.
* Add on_source_error and on_target_error error handling.
* Use trace events instead of DPRINTF()
-- stefanha]
Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is a pure code move patch, except following modification:
1 get_human_readable_size() is changed to static function.
2 dump_human_image_info() is renamed to bdrv_image_info_dump().
3 in qmp_query_block() and qmp_query_blockstats, use bdrv_next(bs)
instead of direct traverse of global array 'bdrv_states'.
4 collect_snapshots() and collect_image_info() are renamed, unused parameter
*fmt in collect_image_info() is removed.
5 code style fix.
To avoid conflict and tip better, macro in header file is BLOCK_QAPI_H
instead of QAPI_H. Now block.h and snapshot.h are at the same level in
include path, block_int.h and qapi.h will both include them.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All snapshot related code, except bdrv_snapshot_dump() and
bdrv_is_snapshot(), is moved to block/snapshot.c. bdrv_snapshot_dump()
will be moved to another file later. bdrv_is_snapshot() is not related
with internal snapshot. It also fixes small code style errors reported
by check script.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is the initial block driver framework for VHDX image support
(i.e. Hyper-V image file formats), that supports opening VHDX files, and
parsing the headers.
This commit does not yet enable:
- reading
- writing
- updating the header
- differencing files (images with parents)
- log replay / dirty logs (only clean images)
This is based on Microsoft's VHDX specification:
"VHDX Format Specification v0.95", published 4/12/2012
https://www.microsoft.com/en-us/download/details.aspx?id=29681
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qemu-system-x86_64 -drive file=ssh://hostname/some/image
QEMU will ssh into 'hostname' and open '/some/image' which is made
available as a standard block device.
You can specify a username (ssh://user@host/...) and/or a port number
(ssh://host:port/...). You can also use an alternate syntax using
properties (file.user, file.host, file.port, file.path).
Current limitations:
- Authentication must be done without passwords or passphrases, using
ssh-agent. Other authentication methods are not supported.
- Uses a single connection, instead of concurrent AIO with multiple
SSH connections.
This is implemented using libssh2 on the client side. The server just
requires a regular ssh daemon with sftp-server support. Most ssh
daemons on Unix/Linux systems will work out of the box.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
With the new support for EventNotifiers in the AIO event loop, we
can hook a completion port to every opened file and use asynchronous
I/O on them.
Wine's support is extremely inefficient, also because it really does
the I/O synchronously on regular files. (!) But it works, and it is
good to keep the Win32 and POSIX ports as similar as possible.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Win32 implementation will only accept EventNotifiers, thus a few
drivers are disabled under Windows. EventNotifiers are a good match
for the GSource implementation, too, because the Win32 port of glib
allows to place their HANDLEs in a GPollFD.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the implementation of a new job that mirrors a disk to
a new image while letting the guest continue using the old image.
The target is treated as a "black box" and data is copied from the
source to the target in the background. This can be used for several
purposes, including storage migration, continuous replication, and
observation of the guest I/O in an external program. It is also a
first step in replacing the inefficient block migration code that is
part of QEMU.
The job is possibly never-ending, but it is logically structured into
two phases: 1) copy all data as fast as possible until the target
first gets in sync with the source; 2) keep target in sync and
ensure that reopening to the target gets a correct (full) copy
of the source data.
The second phase is indicated by the progress in "info block-jobs"
reporting the current offset to be equal to the length of the file.
When the job is cancelled in the second phase, QEMU will run the
job until the source is clean and quiescent, then it will report
successful completion of the job.
In other words, the BLOCK_JOB_CANCELLED event means that the target
may _not_ be consistent with a past state of the source; the
BLOCK_JOB_COMPLETED event means that the target is consistent with
a past state of the source. (Note that it could already happen
that management lost the race against QEMU and got a completion
event instead of cancellation).
It is not yet possible to complete the job and switch over to the target
disk. The next patches will fix this and add many refinements to the
basic idea introduced here. These include improved error management,
some tunable knobs and performance optimizations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds the live commit coroutine. This iteration focuses on the
commit only below the active layer, and not the active layer itself.
The behaviour is similar to block streaming; the sectors are walked
through, and anything that exists above 'base' is committed back down
into base. At the end, intermediate images are deleted, and the
chain stitched together. Images are restored to their original open
flags upon completion.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds gluster as the new block backend in QEMU. This gives
QEMU the ability to boot VM images from gluster volumes. Its already
possible to boot from VM images on gluster volumes using FUSE mount, but
this patchset provides the ability to boot VM images from gluster volumes
by by-passing the FUSE layer in gluster. This is made possible by
using libgfapi routines to perform IO on gluster volumes directly.
VM Image on gluster volume is specified like this:
file=gluster[+transport]://[server[:port]]/volname/image[?socket=...]
'gluster' is the protocol.
'transport' specifies the transport type used to connect to gluster
management daemon (glusterd). Valid transport types are
tcp, unix and rdma. If a transport type isn't specified, then tcp
type is assumed.
'server' specifies the server where the volume file specification for
the given volume resides. This can be either hostname, ipv4 address
or ipv6 address. ipv6 address needs to be within square brackets [ ].
If transport type is 'unix', then 'server' field should not be specifed.
The 'socket' field needs to be populated with the path to unix domain
socket.
'port' is the port number on which glusterd is listening. This is optional
and if not specified, QEMU will send 0 which will make gluster to use the
default port. If the transport type is unix, then 'port' should not be
specified.
'volname' is the name of the gluster volume which contains the VM image.
'image' is the path to the actual VM image that resides on gluster volume.
Examples:
file=gluster://1.2.3.4/testvol/a.img
file=gluster+tcp://1.2.3.4/testvol/a.img
file=gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
file=gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img
file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
file=gluster+rdma://1.2.3.4:24007/testvol/a.img
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>