We can get the maximum number of bytes for a single I/O transfer
from the BLKSECTGET ioctl, but we only perform this for block
devices. scsi-generic devices are represented as character devices,
and so do not issue this today. Update this, so that virtio-scsi
devices using the scsi-generic interface can return the same data.
Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com>
Message-Id: <20170120162527.66075-4-farman@linux.vnet.ibm.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 6f6071745b ("raw-posix: Fetch max sectors for host block device")
introduced a routine to call the kernel BLKSECTGET ioctl, which stores the
result back to user space. However, the size of the data returned depends
on the routine handling the ioctl. The (compat_)blkdev_ioctl returns a
short, while sg_ioctl returns an int. Thus, on big-endian systems, we can
find ourselves accidentally shifting the result to a much larger value.
(On s390x, a short is 16 bits while an int is 32 bits.)
Also, the two ioctl handlers return values in different scales (block
returns sectors, while sg returns bytes), so some tweaking of the outputs
is required such that hdev_get_max_transfer_length returns a value in a
consistent set of units.
Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com>
Message-Id: <20170120162527.66075-3-farman@linux.vnet.ibm.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
nb_cls_shrunk in iscsi_allocmap_update can become -1 if the
request starts and ends within the same cluster. This results
in passing -1 to bitmap_set and bitmap_clear and they don't
handle negative values properly. In the end this leads to data
corruption.
Fixes: e1123a3b40
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1484579832-18589-1-git-send-email-pl@kamp.de>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a migration is already in progress and somebody attempts
to add a migration blocker, this should rightly fail.
Add an errp parameter and a retcode return value to migrate_add_blocker.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Message-Id: <1484566314-3987-5-git-send-email-ashijeetacharya@gmail.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Acked-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Merged with recent 'Allow invtsc migration' change
Remove the "// assert(is_consistent(s))" comment in block/vvfat.c
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Message-Id: <1484566314-3987-2-git-send-email-ashijeetacharya@gmail.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
bdrv_io_plug and bdrv_io_unplug are only called (via their
BlockBackend equivalents) after starting asynchronous I/O.
bdrv_drain is not going to be called while they are running,
because---even if a coroutine runs for some reason---it will
only drain in the next iteration of the event loop through
bdrv_co_yield_to_drain.
So this mechanism is unnecessary, get rid of it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161129113334.605-1-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
These files deal with the file protocol, not the raw format (the
file protocol is often used with other formats, and the raw
format is not forced to use the file protocol). Rename things
to make it a bit easier to follow.
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Given that we have raw-win32.c and raw-posix.c, my initial guess at
raw_bsd.c was that it was for dealing with raw files using code
specific to the BSD operating system (beyond what raw-posix could
do). Not so - this name was chosen back in commit e1c66c6 to
distinguish that it was a BSD licensed file, in contrast to the
then-existing raw.c with an unclear and potentially unusable
license. But since it has been more than three years since the
rewrite, it's time to pick a more useful name for this file to
avoid this type of confusion to future contributors that don't know
the backstory, as none of our other files are named solely by the
license they use.
In reality, this file deals with the raw format, which is useful
with any number of protocols, while raw-{win32,posix} deal with
the file protocol (and in turn, that protocol is not limited to
use with the raw format). So rename raw_bsd to raw-format.c. We
could have also used the shorter name raw.c, except that collides
with the earlier use of that filename for a different license,
and it's better to be safe than risk license pollution.
The next patch will also rename raw-win32.c and raw-posix.c to
further distinguish the difference in roles.
It doesn't hurt that this gets rid of an underscore in the filename,
thereby making tab-completion on 'ra<TAB>' easier (now I don't have
to type the shift key, which slows things down :)
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This enables byte granularity requests for blkverify, and at the same
time gets us rid of another user of the BDS-level AIO emulation.
The reference output of a test case must be changed because the
verification failure message reports byte offsets instead of sectors
now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This enables byte granularity requests for blkdebug, and at the same
time gets us rid of another user of the BDS-level AIO emulation.
Note that unless align=512 is specified, this can behave subtly
different from the old behaviour because bdrv_co_preadv/pwritev don't
have to perform alignment adjustments any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Make sure that all fields of the new QuorumAIOCB are zeroed when the
function returns even without explicitly setting them. This will protect
us when new fields are added, removes some explicit zero assignment and
makes the code a little nicer to read.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Inlining the function removes some boilerplace code and replaces
recursion by a simple loop, so the code becomes somewhat easier to
understand.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This enables byte granularity requests on quorum nodes.
Note that the QMP events emitted by the driver are an external API that
we were careless enough to define as sector based. The offset and length
of requests reported in events are rounded therefore.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Replacing it with bdrv_co_pwritev() prepares us for byte granularity
requests and gets us rid of the last bdrv_aio_*() user in quorum.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is a conversion to a more natural coroutine style and improves the
readability of the driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Instead of calling quorum_aio_finalize() deeply nested in what used
to be an AIO callback, do it in the same functions that allocated the
AIOCB.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This converts the quorum block driver from implementing callback-based
interfaces for read/write to coroutine-based ones. This is the first
step that will allow us further simplification of the code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
There is no point in passing the value of bs->opaque in order to
overwrite it with itself.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
The Linux AIO userspace ABI includes a ring that is shared with the
kernel. This allows userspace programs to process completions without
system calls.
Add an AioContext poll handler to check for completions in the ring.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161201192652.9509-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The new AioPollFn io_poll() argument to aio_set_fd_handler() and
aio_set_event_handler() is used in the next patch.
Keep this code change separate due to the number of files it touches.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161201192652.9509-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qcow2_make_empty() function is reached during 'qemu-img commit',
in order to clear out ALL clusters of an image. However, if the
image cannot use the fast code path (true if the image is format
0.10, or if the image contains a snapshot), the cluster size is
larger than 512, and the image is larger than 2G in size, then our
choice of sector_step causes problems. Since it is not cluster
aligned, but qcow2_discard_clusters() silently ignores an unaligned
head or tail, we are leaving clusters allocated.
Enhance the testsuite to expose the flaw, and patch the problem by
ensuring our step size is aligned.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QMP definition of BlockdevOptionsNfs:
{ 'struct': 'BlockdevOptionsNfs',
'data': { 'server': 'NFSServer',
'path': 'str',
'*user': 'int',
'*group': 'int',
'*tcp-syn-count': 'int',
'*readahead-size': 'int',
'*page-cache-size': 'int',
'*debug-level': 'int' } }
To make this consistent with other block protocols like gluster, lets
change s/debug-level/debug/
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
The QMP definition of BlockdevOptionsGluster:
{ 'struct': 'BlockdevOptionsGluster',
'data': { 'volume': 'str',
'path': 'str',
'server': ['GlusterServer'],
'*debug-level': 'int',
'*logfile': 'str' } }
But instead of 'debug-level we have exported 'debug' as the option for choosing
debug level of gluster protocol driver.
This patch fix QMP definition BlockdevOptionsGluster
s/debug-level/debug/
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJYPZu6AAoJEH8JsnLIjy/W1IAP/AwV0sWafsSMnWiz/4NVqeh3
Yk2cBtxCBmnq1y+PilLoZBdHui/RumwVuZKaShs3JA1n5CB1AjsVtEVl/6rQM7lv
yymLr32pODuf4eaGwGY09FqTiL0Erlm846zbSDjkiKbTYoKpzRv0PT2iiA6yTnjO
Mrs5nG7kEWdXPZ0ZsJyEyU3+vs7rNg+4N/VfTdPmCrV5DVBvAeCawM6JXHQNc7LV
ER6Y8W9PAu5mYqwekjAW07lPCudytAsOTrbTTO9Sv/+kZUdKEmv7ZHJrPdECCb6N
vcPOYOzKsEvvR8E0YZtuJDK9W4RTakxdlTste+TtW3VSt1Cs0zpvCFytaGuC+Kmq
mhlA4lYLDvaiNOMl09SvIjjxGI7+FO+1XsY7e4rI5PJzOKWZMFOIwQMNxE3B2qUI
dxd6izf7fzF4V5uDDwHTJ8TAiJDSAe6Bkz+vzipQtu5NARl/isbQuIPIGXPkxZln
fkCYA8/7EXrLXqd3khiRqEHS60ZtNgfm4ss8euMlWAgJAz0RLC1d/XhOIxaCQOg3
R/F9UdJAon6mfOgamZs5yzJgaPU6M90g/QipMB3Ub00VODacTiA81QUjZdEgELBB
zvhgeja7qdIvOh9r9heCCuUTmkWRRppmkrKdqFLowZ2aWosISy/UjiPTGdjDdq7Z
LsfYiRXsW94FmNXdKCqg
=3WTz
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'kwolf/tags/for-upstream' into staging
Block layer patches for 2.8.0-rc2
# gpg: Signature made Tue 29 Nov 2016 03:16:10 PM GMT
# gpg: using RSA key 0x7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* kwolf/tags/for-upstream:
docs: Specify that cache-clean-interval is only supported in Linux
qcow2: Remove stale comment
qcow2: Allow 'cache-clean-interval' in Linux only
qcow2: Make qcow2_cache_table_release() work only in Linux
Message-id: 1480436227-2211-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The cache-clean-interval option of qcow2 only works on Linux. However
we allow setting it in other systems regardless of whether it works or
not.
In those systems this option is not simply a no-op: it actually
invalidates perfectly valid cache tables for no good reason without
freeing their memory.
This patch forbids using that option in non-Linux systems.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are using QEMU_MADV_DONTNEED to discard the memory of individual L2
cache tables. The problem with this is that those semantics are
specific to the Linux madvise() system call. Other implementations of
madvise() (including the very Linux implementation of posix_madvise())
don't do that, so we cannot use them for the same purpose.
This patch makes the code Linux-specific and uses madvise() directly
since there's no point in going through qemu_madvise() for this.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
but forgot to inform the block layer that FUA unmapping of zeroes is
supported. Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
the block layer will always insist on the NBD layer passing
NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
things even when it was desired to let the server punch holes.
Similarly, failing to set BDRV_REQ_FUA means that the client may
send unnecessary NBD_CMD_FLUSH when it could have instead used the
NBD_CMD_FLAG_FUA bit.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1479413642-22463-2-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Discard is advisory, so rounding the requests to alignment
boundaries is never semantically wrong from the data that
the guest sees. But at least the Dell Equallogic iSCSI SANs
has an interesting property that its advertised discard
alignment is 15M, yet documents that discarding a sequence
of 1M slices will eventually result in the 15M page being
marked as discarded, and it is possible to observe which
pages have been discarded.
Between commits 9f1963b and b8d0a980, we converted the block
layer to a byte-based interface that ultimately ignores any
unaligned head or tail based on the driver's advertised
discard granularity, which means that qemu 2.7 refuses to
pass any discard request smaller than 15M down to the Dell
Equallogic hardware. This is a slight regression in behavior
compared to earlier qemu, where a guest executing discards
in power-of-2 chunks used to be able to get every page
discarded, but is now left with various pages still allocated
because the guest requests did not align with the hardware's
15M pages.
Since the SCSI specification says nothing about a minimum
discard granularity, and only documents the preferred
alignment, it is best if the block layer gives the driver
every bit of information about discard requests, rather than
rounding it to alignment boundaries early.
Rework the block layer discard algorithm to mirror the write
zero algorithm: always peel off any unaligned head or tail
and manage that in isolation, then do the bulk of the request
on an aligned boundary. The fallback when the driver returns
-ENOTSUP for an unaligned request is to silently ignore that
portion of the discard request; but for devices that can pass
the partial request all the way down to hardware, this can
result in the hardware coalescing requests and discarding
aligned pages after all.
Reported by: Peter Lieven <pl@kamp.de>
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Right now, the block layer rounds discard requests, so that
individual drivers are able to assert that discard requests
will never be unaligned. But there are some ISCSI devices
that track and coalesce multiple unaligned requests, turning it
into an actual discard if the requests eventually cover an
entire page, which implies that it is better to always pass
discard requests as low down the stack as possible.
In isolation, this patch has no semantic effect, since the
block layer currently never passes an unaligned request through.
But the block layer already has code that silently ignores
drivers that return -ENOTSUP for a discard request that cannot
be honored (as well as drivers that return 0 even when nothing
was done). But the next patch will update the block layer to
fragment discard requests, so that clients are guaranteed that
they are either dealing with an unaligned head or tail, or an
aligned core, making it similar to the block layer semantics of
write zero fragmentation.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary. But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.
However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k. Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.
Test case:
$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056
In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea. But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.
Reported-by: Ed Swierk <eswierk@skyportsystems.com>
CC: qemu-stable@nongnu.org
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
At the qcow2 layer, discard is only possible on a per-cluster
basis; at the moment, qcow2 silently rounds any unaligned
requests to this granularity. However, an upcoming patch will
fix a regression in the block layer ignoring too much of an
unaligned discard request, by changing the block layer to
break up a discard request at alignment boundaries; for that
to work, the block layer must know about our limits.
However, we can't go one step further by changing
qcow2_discard_clusters() to assert that requests are always
aligned, since that helper function is reached on paths
outside of the block layer.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes a use-after-free bug introduced in commit 6349c154. We need
to use QLIST_FOREACH_SAFE() when freeing elements in the loop. Spotted
by Coverity.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1479378608-11962-1-git-send-email-kwolf@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
This puts a huge strain on the disks when there are many concurrent
migrations. With this patch we only flush twice: just before issuing
the event, and just before pivoting to the destination. If management
will complete the job close to the BLOCK_JOB_READY event, the cost of
the second flush should be small anyway.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161109162008.27287-2-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
libcurl will only give us as much data as there is, not more. The block
layer will deny requests beyond the end of file for us; but since this
block driver is still using a sector-based interface, we can still get
in trouble if the file size is not a multiple of 512.
While we have already made sure not to attempt transfers beyond the end
of the file, we are currently still trying to receive data from there if
the original request exceeds the file size. This patch fixes this issue
and invokes qemu_iovec_memset() on the iovec's tail.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20161025025431.24714-5-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
For some connection types (like FTP, generally), more than one socket
may be used (in FTP's case: control vs. data stream). As of commit
838ef60249 ("curl: Eliminate unnecessary
use of curl_multi_socket_all"), we have to remember all of the sockets
used by libcurl, but in fact we only did that for a single one. Since
one libcurl connection may use multiple sockets, however, we have to
remember them all.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20161025025431.24714-4-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
While commit 38bbc0a580 is correct in that
the callback is supposed to return the number of bytes handled; what it
does not mention is that libcurl will throw an error if the callback did
not "handle" all of the data passed to it.
Therefore, if the callback receives some data that it cannot handle
(either because the receive buffer has not been set up yet or because it
would not fit into the receive buffer) and we have to ignore it, we
still have to report that the data has been handled.
Obviously, this should not happen normally. But it does happen at least
for FTP connections where some data (that we do not expect) may be
generated when the connection is established.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20161025025431.24714-3-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Currently, curl defines its own constant SECTOR_SIZE. There is no
advantage over using the global BDRV_SECTOR_SIZE, so drop it.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20161025025431.24714-2-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Because TFTP does not support byte ranges, it was never usable with our
curl block driver. Since apparently nobody has ever complained loudly
enough for someone to take care of the issue until now, it seems
reasonable to assume that nobody has ever actually used it.
Therefore, it should be safe to just drop it from curl's protocol list.
[Jeff Cody: Below is additional summary pulled, with some rewording,
from followup emails between Max and Markus, to explain what
worked and what didn't]
TFTP would sometimes work, to a limited extent, for images <= the curl
"readahead" size, so long as reads started at offset zero. By default,
that readahead size is 256KB.
Reads starting at a non-zero offset would also have returned data from a
zero offset. It can become more complicated still, with mixed reads at
zero offset and non-zero offsets, due to data buffering.
In short, TFTP could only have worked before in very specific scenarios
with unrealistic expectations and constraints.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20161102175539.4375-4-mreitz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Refactor backup_start as backup_job_create, which only creates the job,
but does not automatically start it. The old interface, 'backup_start',
is not kept in favor of limiting the number of nearly-identical interfaces
that would have to be edited to keep up with QAPI changes in the future.
Callers that wish to synchronously start the backup_block_job can
instead just call block_job_start immediately after calling
backup_job_create.
Transactions are updated to use the new interface, calling block_job_start
only during the .commit phase, which helps prevent race conditions where
jobs may finish before we even finish building the transaction. This may
happen, for instance, during empty block backup jobs.
Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1478587839-9834-6-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Instead of automatically starting jobs at creation time via backup_start
et al, we'd like to return a job object pointer that can be started
manually at later point in time.
For now, add the block_job_start mechanism and start the jobs
automatically as we have been doing, with conversions job-by-job coming
in later patches.
Of note: cancellation of unstarted jobs will perform all the normal
cleanup as if the job had started, particularly abort and clean. The
only difference is that we will not emit any events, because the job
never actually started.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1478587839-9834-5-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add an explicit start field to specify the entrypoint. We already have
ownership of the coroutine itself AND managing the lifetime of the
coroutine, let's take control of creation of the coroutine, too.
This will allow us to delay creation of the actual coroutine until we
know we'll actually start a BlockJob in block_job_start. This avoids
the sticky question of how to "un-create" a Coroutine that hasn't been
started yet.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478587839-9834-4-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Cleaning up after we have deferred to the main thread but before the
transaction has converged can be dangerous and result in deadlocks
if the job cleanup invokes any BH polling loops.
A job may attempt to begin cleaning up, but may induce another job to
enter its cleanup routine. The second job, part of our same transaction,
will block waiting for the first job to finish, so neither job may now
make progress.
To rectify this, allow jobs to register a cleanup operation that will
always run regardless of if the job was in a transaction or not, and
if the transaction job group completed successfully or not.
Move sensitive cleanup to this callback instead which is guaranteed to
be run only after the transaction has converged, which removes sensitive
timing constraints from said cleanup.
Furthermore, in future patches these cleanup operations will be performed
regardless of whether or not we actually started the job. Therefore,
cleanup callbacks should essentially confine themselves to undoing create
operations, e.g. setup actions taken in what is now backup_start.
Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478587839-9834-3-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
blk_eject is only used by scsi-disk and atapi, and in both cases we
only attempt to invoke blk_eject if we have a bona-fide change in
tray state.
The "issue" here is that the tray state does not generate a QMP event
unless there is a medium/BDS attached to the device, so if libvirt et al
are waiting for a tray event to occur from an empty-but-closed drive,
software opening that drive will not emit an event and libvirt will
wait forever.
Change this by modifying blk_eject to always emit an event, instead of
conditionally on a "real" backend eject.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1373264
Reported-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478553214-497-2-git-send-email-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
It is too confusing because it sounds like a BDRVRawState variable.
Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1477565117-17230-1-git-send-email-famz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It was from the time when none of the global functions had a qcow2_
prefix.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We make sure that the size is aligned to sector length to prevent any
round ups. Otherwise we could end up reading/writing data outside the
area specified by user. This is only needed when user supplies the size
option to avoid any surprises. It is not necessary when only offset is
set.
More over, the check made it difficult to use the offset option without
size option. The check puts unneeded restriction on the offset which had
to be aligned too. Because bdrv_getlength() returns aligned value having
unaligned offset would make the check fail.
Signed-off-by: Tomáš Golembiovský <tgolembi@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When only offset is specified but no size and the offset is greater than
the real size of the containing device an overflow occurs when parsing
the options. This overflow is harmless because we do check for this
exact situation little bit later, but it leads to an error message with
weird values. It is better to do the check is sooner and prevent the
overflow.
Signed-off-by: Tomáš Golembiovský <tgolembi@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch drops the unused parameter "BDRVSSHState" being passed into
the ssh_config() function and does code cleanup. The unused parameter
was introduced by the commit c322712.
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch frees the leaked visitor in nbd_refresh_filename() and uses
visit_free() to fix it. The leak was introduced by the commit 491d6c7.
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 3ff2f67a changed bdrv_co_flush() so that no flush is issues if
the image hasn't been dirtied since the last flush. This is not quite
correct: The condition should be that the image hasn't been dirtied
since the last _successful_ flush. This patch changes the logic
accordingly.
Without this fix, subsequent bdrv_co_flush() calls would return success
without actually doing anything even though the image is still dirty.
The difference is visible in some blkdebug test cases where error
messages incorrectly disappeared after commit 3ff2f67a.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1478300595-10090-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* NBD write zeroes support (Eric)
* Memory backend fixes (Haozhong)
* Atomics fix (Alex)
* New AVX512 features (Luwei)
* "make check" logging fix (Paolo)
* Chardev refactoring fallout (Paolo)
* Small checkpatch improvements (Paolo, Jeff)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQExBAABCAAbBQJYGaRPFBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
XKgH/RgNtosBTqJsmphkS7wACFAFOf7Uq46ajoKfB66Pt1J/++pFQg4TApPYkb7j
KlKeKmXa7hb6+Jg8325H4zGkGno4kn2dE+OnznaB1xPKwiZVAMQVzQsagsEVqpno
k/5PBVRptIiuHQKyU29Go0CxbWJBTH0O14S7rDK4YDF0YMnuT280HQOI3jdu1igV
G/Q+CMgfk+yXf6GWHE8Z9sNq7n0ha8qgruA/X3NC7+pAvEsUcAP065zwLp9weYuK
W1MU68L7Ub4tRo0SVf1HFkDUNdMv4T4hg+wpGe1GwthJWexHu9x0YAQBy60ykJb6
NtHwjLwCUWtm7AiZD/btsOJPmjk=
=+Dt/
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
* NBD bugfix (Changlong)
* NBD write zeroes support (Eric)
* Memory backend fixes (Haozhong)
* Atomics fix (Alex)
* New AVX512 features (Luwei)
* "make check" logging fix (Paolo)
* Chardev refactoring fallout (Paolo)
* Small checkpatch improvements (Paolo, Jeff)
# gpg: Signature made Wed 02 Nov 2016 08:31:11 AM GMT
# gpg: using RSA key 0xBFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
# gpg: aka "Paolo Bonzini <pbonzini@redhat.com>"
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* remotes/bonzini/tags/for-upstream: (30 commits)
main-loop: Suppress I/O thread warning under qtest
docs/rcu.txt: Fix minor typo
vl: exit qemu on guest panic if -no-shutdown is not set
checkpatch: allow spaces before parenthesis for 'coroutine_fn'
x86: add AVX512_4VNNIW and AVX512_4FMAPS features
slirp: fix CharDriver breakage
qemu-char: do not forward events through the mux until QEMU has started
nbd: Implement NBD_CMD_WRITE_ZEROES on client
nbd: Implement NBD_CMD_WRITE_ZEROES on server
nbd: Improve server handling of shutdown requests
nbd: Refactor conversion to errno to silence checkpatch
nbd: Support shorter handshake
nbd: Less allocation during NBD_OPT_LIST
nbd: Let client skip portions of server reply
nbd: Let server know when client gives up negotiation
nbd: Share common option-sending code in client
nbd: Send message along with server NBD_REP_ERR errors
nbd: Share common reply-sending code in server
nbd: Rename struct nbd_request and nbd_reply
nbd: Rename NbdClientSession to NBDClientSession
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Upstream NBD protocol recently added the ability to efficiently
write zeroes without having to send the zeroes over the wire,
along with a flag to control whether the client wants a hole.
The generic block code takes care of falling back to the obvious
write of lots of zeroes if we return -ENOTSUP because the server
does not have WRITE_ZEROES.
Ideally, since NBD_CMD_WRITE_ZEROES does not involve any data
over the wire, we want to support transactions that are much
larger than the normal 32M limit imposed on NBD_CMD_WRITE. But
the server may still have a limit smaller than UINT_MAX, so
until experimental NBD protocol additions for advertising various
command sizes is finalized (see [1], [2]), for now we just stick to
the same limits as normal writes.
[1] https://github.com/yoe/nbd/blob/extension-info/doc/proto.md
[2] https://sourceforge.net/p/nbd/mailman/message/35081223/
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1476469998-28592-17-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Our coding convention prefers CamelCase names, and we already
have other existing structs with NBDFoo naming. Let's be
consistent, before later patches add even more structs.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1476469998-28592-6-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's better to use consistent capitalization of the namespace
used for NBD functions; we have more instances of NBD* than
Nbd*.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1476469998-28592-5-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current upstream NBD documents that requests have a 16-bit flags,
followed by a 16-bit type integer; although older versions mentioned
only a 32-bit field with masking to find flags. Since the protocol
is in network order (big-endian over the wire), the ABI is unchanged;
but dealing with the flags as a separate field rather than masking
will make it easier to add support for upcoming NBD extensions that
increase the number of both flags and commands.
Improve some comments in nbd.h based on the current upstream
NBD protocol (https://github.com/yoe/nbd/blob/master/doc/proto.md),
and touch some nearby code to keep checkpatch.pl happy.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1476469998-28592-3-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
NBD is using the CoMutex in a way that wasn't anticipated. For example, if there are
N(N=26, MAX_NBD_REQUESTS=16) nbd write requests, so we will invoke nbd_client_co_pwritev
N times.
----------------------------------------------------------------------------------------
time request Actions
1 1 in_flight=1, Coroutine=C1
2 2 in_flight=2, Coroutine=C2
...
15 15 in_flight=15, Coroutine=C15
16 16 in_flight=16, Coroutine=C16, free_sema->holder=C16, mutex->locked=true
17 17 in_flight=16, Coroutine=C17, queue C17 into free_sema->queue
18 18 in_flight=16, Coroutine=C18, queue C18 into free_sema->queue
...
26 N in_flight=16, Coroutine=C26, queue C26 into free_sema->queue
----------------------------------------------------------------------------------------
Once nbd client recieves request No.16' reply, we will re-enter C16. It's ok, because
it's equal to 'free_sema->holder'.
----------------------------------------------------------------------------------------
time request Actions
27 16 in_flight=15, Coroutine=C16, free_sema->holder=C16, mutex->locked=false
----------------------------------------------------------------------------------------
Then nbd_coroutine_end invokes qemu_co_mutex_unlock what will pop coroutines from
free_sema->queue's head and enter C17. More free_sema->holder is C17 now.
----------------------------------------------------------------------------------------
time request Actions
28 17 in_flight=16, Coroutine=C17, free_sema->holder=C17, mutex->locked=true
----------------------------------------------------------------------------------------
In above scenario, we only recieves request No.16' reply. As time goes by, nbd client will
almostly recieves replies from requests 1 to 15 rather than request 17 who owns C17. In this
case, we will encounter assert "mutex->holder == self" failed since Kevin's commit 0e438cdc
"coroutine: Let CoMutex remember who holds it". For example, if nbd client recieves request
No.15' reply, qemu will stop unexpectedly:
----------------------------------------------------------------------------------------
time request Actions
29 15(most case) in_flight=15, Coroutine=C15, free_sema->holder=C17, mutex->locked=false
----------------------------------------------------------------------------------------
Per Paolo's suggestion "The simplest fix is to change it to CoQueue, which is like a condition
variable", this patch replaces CoMutex with CoQueue.
Cc: Wen Congyang <wency@cn.fujitsu.com>
Reported-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Message-Id: <1476267508-19499-1-git-send-email-xiecl.fnst@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To make it a little more obvious which functions are intended to be
public interface and which are intended to be for use only by jobs
themselves, split the interface into "public" and "private" files.
Convert blockjobs (e.g. block/backup) to using the private interface.
Leave blockdev and others on the public interface.
There are remaining uses of private state by qemu-img, and several
cases in blockdev.c and block/io.c where we grab job->blk for the
purposes of acquiring an AIOContext.
These will be corrected in future patches.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-7-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
There's no reason to leave this to blockdev; we can do it in blockjobs
directly and get rid of an extra callback for most users.
All non-internal events, even those created outside of QMP, will
consistently emit events.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-5-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Bubble up the internal interface to commit and backup jobs, then switch
replication tasks over to using this methodology.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-4-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add the ability to create jobs without an ID.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-3-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
After introduction of qapi schema in gluster block driver code, the port
type is now string as per InetSocketAddress
{ 'struct': 'InetSocketAddress',
'data': {
'host': 'str',
'port': 'str',
'*to': 'uint16',
'*ipv4': 'bool',
'*ipv6': 'bool' } }
but the current code still treats it as QEMU_OPT_NUMBER, hence fixing port
to accept QEMU_OPT_STRING.
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
using atoi() for converting string to int may be error prone in case if
string supplied in the argument is not a fold of numerical number,
This is not a bug because in the existing code,
static QemuOptsList runtime_tcp_opts = {
.name = "gluster_tcp",
.head = QTAILQ_HEAD_INITIALIZER(runtime_tcp_opts.head),
.desc = {
...
{
.name = GLUSTER_OPT_PORT,
.type = QEMU_OPT_NUMBER,
.help = "port number ...",
},
...
};
port type is QEMU_OPT_NUMBER, before we actually reaches atoi() port is already
defended by parse_option_number()
However It is a good practice to use function like parse_uint_full()
over atoi() to keep port self defended
Note: As now the port string to int conversion has its defence code set,
and also we understand that port argument is actually a string type,
in the follow up patch let's move port type from QEMU_OPT_NUMBER to
QEMU_OPT_STRING
[Jeff Cody: removed spurious parenthesis]
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
We already specified BDRV_O_UNMAP when opening images in 'qemu-img
commit', but didn't turn on the "unmap" in the active commit job. This
patch fixes that so that zeroed clusters in top image can be discarded
which is desired in the virt-sparsify use case, where a temporary
overlay is created and fstrim'ed before commiting back, to free space in
the original image.
This also enables it for block-commit.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1474974892-5031-1-git-send-email-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Currently, for every drive accessed via gfapi we create a new glfs
instance (call glfs_new() followed by glfs_init()) which could consume
memory in few 100 MB's, from the table below it looks like for each
instance ~300 MB VSZ was consumed
Before:
-------
Disks VSZ RSS
1 1098728 187756
2 1430808 198656
3 1764932 199704
4 2084728 202684
This patch maintains a list of pre-opened glfs objects. On adding
a new drive belonging to the same gluster volume, we just reuse the
existing glfs object by updating its refcount.
With this approch we shrink up the unwanted memory consumption and
glfs_new/glfs_init calls for accessing a disk (file) if belongs to
same volume.
From below table notice that the memory usage after adding a disk
(which will reuse the existing glfs object hence) is in negligible
compared to before.
After:
------
Disks VSZ RSS
1 1101964 185768
2 1109604 194920
3 1114012 196036
4 1114496 199868
Disks: number of -drive
VSZ: virtual memory size of the process in KiB
RSS: resident set size, the non-swapped physical memory (in kiloBytes)
VSZ and RSS are analyzed using 'ps aux' utility.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477581890-4811-1-git-send-email-prasanna.kalever@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add checks to see if the system compiling QEMU has support for
SEEK_HOLE/SEEK_DATA. If the system does not, we will flag that seek
data is unsupported in gluster.
Note: this is not a check on whether the gluster server itself supports
SEEK_DATA (that is already done during runtime), but rather if the
compilation environment supports SEEK_DATA.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
Message-id: 00370bce5c98140d6c56ad5145635ec6551265cc.1475876377.git.jcody@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Make it a bit clearer and more readable.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1476519973-6436-1-git-send-email-lixiubo@cmss.chinamobile.com
CC: John Snow <jsnow@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Make NFS block driver use various fine grained runtime_opts.
Set .bdrv_parse_filename() to nfs_parse_filename() and introduce two
new functions nfs_parse_filename() and nfs_parse_uri() to help parsing
the URI.
Add a new option "server" which then accepts a new struct NFSServer.
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
[ kwolf: Fixed client->path ]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Added two new options 'offset' and 'size'. This makes it possible to use
only part of the file as a device. This can be used e.g. to limit the
access only to single partition in a disk image or use a disk inside a
tar archive (like OVA).
When 'size' is specified we do our best to honour it.
Signed-off-by: Tomáš Golembiovský <tgolembi@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes sure that the image we are streaming into is open in
read-write mode during the operation.
Operation blockers are also set in all intermediate nodes, since they
will be removed from the chain afterwards.
Finally, this also unblocks the stream operation in backing files.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When block-commit is launched without the top parameter, it uses
internally a mirror block job. In that case all intermediate nodes
between the active and base nodes must be blocked as well.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After a successful block-commit operation all nodes between top and
base are removed from the backing chain, and top's overlay needs to
be updated to point to base. Because of that we should prevent other
block jobs from messing with them.
This patch blocks all operations in these nodes in commit_start().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use block_job_add_bdrv() instead of blocking all operations in
backup_start() and unblocking them in backup_run().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use block_job_add_bdrv() instead of blocking all operations in
mirror_start_job() and unblocking them in mirror_exit().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_drain_all() doesn't allow the caller to do anything after all
pending requests have been completed but before block jobs are
resumed.
This patch splits bdrv_drain_all() into _begin() and _end() for that
purpose. It also adds aio_{disable,enable}_external() calls to disable
external clients in the meantime.
An important restriction of this split is that no new block jobs or
BlockDriverStates can be created between the bdrv_drain_all_begin()
and bdrv_drain_all_end() calls. This is not a concern now because
we'll only be using this in bdrv_reopen_multiple(), but it must be
dealt with if we ever have other uses cases in the future.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Drop the use of legacy options in favour of the InetSocketAddress
options.
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add InetSocketAddress compatibility to SSH driver.
Add a new option "server" to the SSH block driver which then accepts
a InetSocketAddress.
"host" and "port" are supported as legacy options and are mapped to
their InetSocketAddress representation.
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We have 5 options plus ("server") option which is added in the next
patch that conflict with specifying a SSH filename. We need to iterate
over all the options to check whether its key has an "server." prefix.
This iteration will help us adding the new option "server" easily.
Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Version: GnuPG v2
iQEcBAABCAAGBQJYE2ULAAoJEMo1YkxqkXHGvvAH/iDPIAiwBXbndL3KhQTneSHn
ctd4I3VK1/VVTIBRJIetqETiWiAm/WoRhI9kBc/NrQxBFx3ko+fpSYFS2t6lJYnV
EX0vjTKjFhr05tOTQDH/SQtHdU5x/x2M8SsxqrCcTyLm5VDfdPeBlMBfSNMj/L2K
bwinANVEwr6LOM0h8weQ0SvOCa5MLII2p5ufGwKQmhUY5tgZvFlyPa+quDVisKoE
7CpLwWHmUQSNxUXSaru90osUJyk90wCcYxPpJN3YO1MHvpH4kG8DpZ8bnFqLAoNw
zkRdqIrlfntD+mKDqRU1y0GXxu9I4VK1UDcQyRFoSdMi2oHR+L018sQEjCYTAXo=
=n+CF
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/famz/tags/for-upstream' into staging
# gpg: Signature made Fri 28 Oct 2016 15:47:39 BST
# gpg: using RSA key 0xCA35624C6A9171C6
# gpg: Good signature from "Fam Zheng <famz@redhat.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 5003 7CB7 9706 0F76 F021 AD56 CA35 624C 6A91 71C6
* remotes/famz/tags/for-upstream:
aio: convert from RFifoLock to QemuRecMutex
qemu-thread: introduce QemuRecMutex
iothread: release AioContext around aio_poll
block: only call aio_poll on the current thread's AioContext
qemu-img: call aio_context_acquire/release around block job
qemu-io: acquire AioContext
block: prepare bdrv_reopen_multiple to release AioContext
replication: pass BlockDriverState to reopen_backing_file
iothread: detach all block devices before stopping them
aio: introduce qemu_get_current_aio_context
sheepdog: use BDRV_POLL_WHILE
nfs: use BDRV_POLL_WHILE
nfs: move nfs_set_events out of the while loops
block: introduce BDRV_POLL_WHILE
qed: Implement .bdrv_drain
block: change drain to look only at one child at a time
block: add BDS field to count in-flight requests
mirror: use bdrv_drained_begin/bdrv_drained_end
blockjob: introduce .drain callback for jobs
replication: interrupt failover if the main device is closed
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJYEfjrAAoJEL6G67QVEE/fdU4P/i7yBJo436OpkdgeWS8AWuFr
ptZ+Fj/weGka5GU9E3KQu36kbSgrtfcgwTHphCMXnZ0YCeKQDuM57f7LNiN6qheB
nqgJvJioLbUvLTQvCHOISM7bWOnYvASBmYtLJFtUcP/jhdOy61KaADnJ+7MbliNv
yJSW2RN+s/y9nUb+dxEpIXXUVMRa6BX+wHW3O44c1oLn6/Pe20aJeHTyDx3qiBhD
8RYXUgRZopH2bouBSzXgMQTbn/QMD/dC81WQlHKlt4swffyei2D/1pciOcuc0SXz
+SZdkTre5JB5Kd6DU8zQ6PrrIt1nPmLSptSyhQvNxm+uWNWHnFcW1s2aYmf/ikjl
4boW37ayJx09mns8yv7TerzEPbL5qJvVX8Dsnb6telkvrS9hy9S1xuIB5xHbt6/h
vwFmCdwaZoGpDDaoXRL+9k9TOI9BbEMKX33nAPDqvEXLMIf+og4fmweTKcY4XTRL
/Fdg1H71v8Ayv+r5TJOKwFg3PNNjnvqkbk1psS+aaW7dup43iaYGIKWy+VFaCufk
hPXLOtR5lUsYC2qm+nkjPIgoP7D8oZx4AGkCHbYsqzi+l1lynZH3rBIs8ggLr72o
FFk4g0sNYe1ccAa89jFEgWIQbS0N6ckUXCv12g3eyF/UIC1F35/mGGugSRnTXuc2
a/WsvgU7pGBrtqXcg7lF
=gsxL
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/berrange/tags/pull-qio-2016-10-27-1' into staging
Merge qio 2016/10/27 v1
# gpg: Signature made Thu 27 Oct 2016 13:54:03 BST
# gpg: using RSA key 0xBE86EBB415104FDF
# gpg: Good signature from "Daniel P. Berrange <dan@berrange.com>"
# gpg: aka "Daniel P. Berrange <berrange@redhat.com>"
# Primary key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF
* remotes/berrange/tags/pull-qio-2016-10-27-1:
main: set names for main loop sources created
vnc: set name for all I/O channels created
migration: set name for all I/O channels created
char: set name for all I/O channels created
nbd: set name for all I/O channels created
io: add ability to set a name for IO channels
io: Add a QIOChannelSocket cleanup test
io: set LISTEN flag explicitly for listen sockets
io: Introduce a qio_channel_set_feature() helper
io: Use qio_channel_has_feature() where applicable
io: Fix double shift usages on QIOChannel features
Conflicts:
qemu-char.c
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
aio_poll is not thread safe; for example bdrv_drain can hang if
the last in-flight I/O operation is completed in the I/O thread after
the main thread has checked bs->in_flight.
The bug remains latent as long as all of it is called within
aio_context_acquire/aio_context_release, but this will change soon.
To fix this, if bdrv_drain is called from outside the I/O thread,
signal the main AioContext through a dummy bottom half. The event
loop then only runs in the I/O thread.
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1477565348-5458-18-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
After the next patch bdrv_drain_all will have to be called without holding any
AioContext. Prepare to do this by adding an AioContext argument to
bdrv_reopen_multiple.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1477565348-5458-15-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
This will be needed in the next patch to retrieve the AioContext.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1477565348-5458-14-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
This is important when the sheepdog driver works on a BlockDriverState
that is attached to an I/O thread other than the main thread.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-11-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
This will make it possible to use nfs_get_allocated_file_size on
a file that is not in the main AioContext.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1477565348-5458-10-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
nfs_set_events only needs to be called once before entering the
while loop; afterwards, nfs_process_read and nfs_process_write
take care of it.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1477565348-5458-9-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
We want the BDS event loop to run exclusively in the iothread that
owns the BDS's AioContext. This macro will provide the synchronization
between the two event loops; for now it just wraps the common idiom
of a while loop around aio_poll.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-8-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
The "need_check_timer" is used to clear the "NEED_CHECK" flag in the
image header after a grace period once metadata update has finished. To
comply with the bdrv_drain semantics, we should make sure it remains
deleted once .bdrv_drain is called.
The change to qed_need_check_timer_cb is needed because bdrv_qed_drain
is called after s->bs has been drained, and should not operate on it;
instead it should operate on the BdrvChild-ren exclusively. Doing so
is easy because QED does not have a bdrv_co_flush_to_os callback, hence
all that is needed to flush it is to ensure writes have reached the disk.
Based on commit df9a681dc9 (which however included some unrelated
hunks, possibly due to a merge failure or an overlooked squash).
The patch was reverted because at the time bdrv_qed_drain could call
qed_plug_allocating_write_reqs while an allocating write was queued.
This however is not possible anymore after the previous patch;
.bdrv_drain is only called after all writes have completed at the
QED level, and its purpose is to trigger metadata writes in bs->file.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1477565348-5458-7-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
bdrv_requests_pending is checking children to also wait until internal
requests (such as metadata writes) have completed. However, checking
children is in general overkill. Children requests can be of two kinds:
- requests caused by an operation on bs, e.g. a bdrv_aio_write to bs
causing a write to bs->file->bs. In this case, the parent's in_flight
count will always be incremented by at least one for every request in
the child.
- asynchronous metadata writes or flushes. Such writes can be started
even if bs's in_flight count is zero, but not after the .bdrv_drain
callback has been invoked.
This patch therefore changes bdrv_drain to finish I/O in the parent
(after which the parent's in_flight will be locked to zero), call
bdrv_drain (after which the parent will not generate I/O on the child
anymore), and then wait for internal I/O in the children to complete.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-6-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Unlike tracked_requests, this field also counts throttled requests,
and remains non-zero if an AIO operation needs a BH to be "really"
completed.
With this change, it is no longer necessary to have a dummy
BdrvTrackedRequest for requests that are never serialising, and
it is no longer necessary to poll the AioContext once after
bdrv_requests_pending(bs) returns false.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-5-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Ensure that there are no changes between the last check to
bdrv_get_dirty_count and the switch to the target.
There is already a bdrv_drained_end call, we only need to ensure
that bdrv_drained_begin is not called twice.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-4-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
This is required to decouple block jobs from running in an
AioContext. With multiqueue block devices, a BlockDriverState
does not really belong to a single AioContext.
The solution is to first wait until all I/O operations are
complete; then loop in the main thread for the block job to
complete entirely.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-3-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Without this change, there is a race condition in tests/test-replication.
Depending on how fast the failover job (active commit) runs, there is a
chance of two bad things happening:
1) replication_done can be called after the secondary has been closed
and hence when the BDRVReplicationState is not valid anymore.
2) two copies of the active disk are present during the
/replication/secondary/stop test (that test runs immediately after
/replication/secondary/start, which tests failover). This causes the
corruption detector to fire.
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-2-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Drop the use of legacy options in favor of the SocketAddress
representation, even for internal use (i.e. for storing the result of
the filename parsing).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new option "server" to the NBD block driver which accepts a
SocketAddress.
"path", "host" and "port" are still supported as legacy options and are
mapped to their corresponding SocketAddress representation.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Right now, we have four possible options that conflict with specifying
an NBD filename, and a future patch will add another one ("address").
This future option is a nested QDict that is flattened at this point,
requiring us to test each option whether its key has an "address."
prefix. Therefore, we will then need to iterate through all options
(including the "export" option which was not covered so far).
Adding this iteration logic now will simplify adding the new option
later. A nice side effect is that the user will not receive a long list
of five options which are not supposed to be specified with a filename,
but we can actually print the problematic option.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>