This patch reduce CPU usage of flush operations a bit. When we have one
flush completed we should kick only next operation. We should not start
all pending operations in the hope that they will go back to wait on
wait_queue.
Also there is a technical possibility that requests will get reordered
with the previous approach. After wakeup all requests are removed from
the wait queue. They become active and they are processed one-by-one
adding to the wait queue in the same order. Though new flush can arrive
while all requests are not put into the queue.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Message-id: 1471457214-3994-3-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The following commit
commit 3ff2f67a7c
Author: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Date: Mon Jul 18 22:39:52 2016 +0300
block: ignore flush requests when storage is clean
has introduced a regression.
There is a problem that it is still possible for 2 requests to execute
in non sequential fashion and sometimes this results in a deadlock
when bdrv_drain_one/all are called for BDS with such stalled requests.
1. Current flushed_gen and flush_started_gen is 1.
2. Request 1 enters bdrv_co_flush to with write_gen 1 (i.e. the same
as flushed_gen). It gets past flushed_gen != flush_started_gen and
sets flush_started_gen to 1 (again, the same it was before).
3. Request 1 yields somewhere before exiting bdrv_co_flush
4. Request 2 enters bdrv_co_flush with write_gen 2. It gets past
flushed_gen != flush_started_gen and sets flush_started_gen to 2.
5. Request 2 runs to completion and sets flushed_gen to 2
6. Request 1 is resumed, runs to completion and sets flushed_gen to 1.
However flush_started_gen is now 2.
From here on out flushed_gen is always != to flush_started_gen and all
further requests will wait on flush_queue. This change replaces
flush_started_gen with an explicitly tracked active flush request.
Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Message-id: 1471457214-3994-2-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently "make docker-test-mingw@fedora" has a warning like:
/tmp/qemu-test/src/block/curl.c: In function 'curl_sock_cb':
/tmp/qemu-test/src/block/curl.c:172:6: warning: format '%d' expects
argument of type 'int', but argument 4 has type 'curl_socket_t {aka long
long unsigned int}'
DPRINTF("CURL (AIO): Sock action %d on fd %d\n", action, fd);
^
cc1: all warnings being treated as errors
Cast to int to suppress it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-Id: <1470027888-24381-1-git-send-email-famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Store the runtime option values in the BDRVNBDState so they can later be
used in nbd_refresh_filename() without having to directly access the
options QDict which may contain values of non-string types.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Store the configuration file's filename so it can later be used in
bdrv_refresh_filename() without having to directly access the options
QDict which may contain a value of a non-string type.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Using QemuOpts will prevent qemu from crashing if the input options have
not been validated (which is the case when they are specified on the
command line or in a json: filename) and some have the wrong type.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Using QemuOpts will prevent qemu from crashing if the input options have
not been validated (which is the case when they are specified on the
command line or in a json: filename) and some have the wrong type.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is generally not expected that io_submit() fails other than with
-EAGAIN, but corner cases like SELinux refusing I/O when permissions are
revoked are still possible. In this case, we shouldn't abort, but just
return an I/O error for the request.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1470741619-23231-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stop to produce new async copy requests from mirror_iteration if
critical error (error action = BLOCK_ERROR_ACTION_REPORT) detected.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Before this patch incorrect image could be created via qemu-img
(Example: qemu-img create -f parallels -o size=4096T hack.img),
incorrect images cannot be used due to overflow in main image structure.
This patch add check of size in image creation.
After reading size it compare it with UINT32_MAX * cluster_size.
Signed-off-by: Klim Kireev <proffk@virtuozzo.mipt.ru>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Message-id: 1469639300-12155-1-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Dell Equallogic iSCSI SANs have a very unusual advertised geometry:
$ iscsi-inq -e 1 -c $((0xb0)) iscsi://XXX/0
wsnz:0
maximum compare and write length:1
optimal transfer length granularity:0
maximum transfer length:0
optimal transfer length:0
maximum prefetch xdread xdwrite transfer length:0
maximum unmap lba count:30720
maximum unmap block descriptor count:2
optimal unmap granularity:30720
ugavalid:1
unmap granularity alignment:0
maximum write same length:30720
which says that both the maximum and the optimal discard size
is 15M. It is not immediately apparent if the device allows
discard requests not aligned to the optimal size, nor if it
allows discards at a finer granularity than the optimal size.
I tried to find details in the SCSI Commands Reference Manual
Rev. A on what valid values of maximum and optimal sizes are
permitted, but while that document mentions a "Block Limits
VPD Page", I couldn't actually find documentation of that page
or what values it would have, or if a SCSI device has an
advertisement of its minimal unmap granularity. So it is not
obvious to me whether the Dell Equallogic device is compliance
with the SCSI specification.
Fortunately, it is easy enough to support non-power-of-2 sizing,
even if it means we are less efficient than truly possible when
targetting that device (for example, it means that we refuse to
unmap anything that is not a multiple of 15M and aligned to a
15M boundary, even if the device truly does support a smaller
granularity where unmapping actually works).
Reported-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1469129688-22848-5-git-send-email-eblake@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rather than asserting that nbdflags is within range, just give
it the correct type to begin with :) nbdflags corresponds to
the per-export portion of NBD Protocol "transmission flags", which
is 16 bits in response to NBD_OPT_EXPORT_NAME and NBD_OPT_GO.
Furthermore, upstream NBD has never passed the global flags to
the kernel via ioctl(NBD_SET_FLAGS) (the ioctl was first
introduced in NBD 2.9.22; then a latent bug in NBD 3.1 actually
tried to OR the global flags with the transmission flags, with
the disaster that the addition of NBD_FLAG_NO_ZEROES in 3.9
caused all earlier NBD 3.x clients to treat every export as
read-only; NBD 3.10 and later intentionally clip things to 16
bits to pass only transmission flags). Qemu should follow suit,
since the current two global flags (NBD_FLAG_FIXED_NEWSTYLE
and NBD_FLAG_NO_ZEROES) have no impact on the kernel's behavior
during transmission.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1469129688-22848-3-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mirror can do up to 16 in-flight requests, but actually on full copy
(the whole source disk is non-zero) in-flight is always 1. This happens
as the request is not limited in size: the data occupies maximum available
capacity of s->buf.
The patch limits the size of the request to some artificial constant
(1 Mb here), which is not that big or small. This effectively enables
back parallelism in mirror code as it was designed.
The result is important: the time to migrate 10 Gb disk is reduced from
~350 sec to 170 sec.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1468516741-82174-1-git-send-email-vsementsov@virtuozzo.com
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
The qemu-img info command has the ability to expose format
specific metadata about volumes. Wire up this facility for
the LUKS driver to report on cipher configuration and key
slot usage.
$ qemu-img info ~/VirtualMachines/demo.luks
image: /home/berrange/VirtualMachines/demo.luks
file format: luks
virtual size: 98M (102760448 bytes)
disk size: 100M
encrypted: yes
Format specific information:
ivgen alg: plain64
hash alg: sha1
cipher alg: aes-128
uuid: 6ddee74b-3a22-408c-8909-6789d4fa2594
cipher mode: xts
slots:
[0]:
active: true
iters: 572706
key offset: 4096
stripes: 4000
[1]:
active: false
key offset: 135168
[2]:
active: false
key offset: 266240
[3]:
active: false
key offset: 397312
[4]:
active: false
key offset: 528384
[5]:
active: false
key offset: 659456
[6]:
active: false
key offset: 790528
[7]:
active: false
key offset: 921600
payload offset: 2097152
master key iters: 142375
One somewhat undesirable artifact is that the data fields are
printed out in (apparently) random order. This will be addressed
later by changing the way the block layer pretty-prints the
image specific data.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 1469192015-16487-3-git-send-email-berrange@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
There are no needs to allocate more than one cluster, as we set
avail_out for deflate to one cluster.
Zlib docs (http://www.zlib.net/manual.html) says:
"deflate compresses as much data as possible, and stops when the input
buffer becomes empty or the output buffer becomes full."
So, deflate will not write more than avail_out to output buffer. If
there is not enough space in output buffer for compressed data (it may
be larger than input data) deflate just returns Z_OK. (if all data is
compressed and written to output buffer deflate returns Z_STREAM_END).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 1468515565-81313-1-git-send-email-vsementsov@virtuozzo.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
- interrupt remapping for intel iommus
- a bunch of virtio cleanups
- fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXkQsqAAoJECgfDbjSjVRpanoIAJ9JVlc1aEjt9sa0cSBcs+NQ
J7JmgU9FqFsj+4FrNTouO3AxTjHurd1UAULP1WMPD+V3JpbnHct8r6SCBLQ5EBMN
VOjYo4DwWs1g+DqnQ9WZmbadu06XvYi/yiAKNUzWfZk0MR11D0D/S5hmarNKw0Kq
tGHeTWjGeY4WqFLV7m+qB4+cqkAByn6um99UtUvgLL05RgIEIP2IEMKYZ+rXvAa9
iGUvzqlO7mbq/+LbL18kaWywa4TCwbbd2eSGWaqhX4CuB62Rl33mWTXFcfaYhkyp
Z3FgwaJ09h0lAjSVEbyAuLFMfO/BnMcsoKqwl4xc4vkn/xBCqFtgH9JcEVm3O8U=
=ge2D
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pc, pci, virtio: new features, cleanups, fixes
- interrupt remapping for intel iommus
- a bunch of virtio cleanups
- fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Thu 21 Jul 2016 18:49:30 BST
# gpg: using RSA key 0x281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (57 commits)
intel_iommu: avoid unnamed fields
virtio: Update migration docs
virtio-gpu: Wrap in vmstate
virtio-gpu: Use migrate_add_blocker for virgl migration blocking
virtio-input: Wrap in vmstate
9pfs: Wrap in vmstate
virtio-serial: Wrap in vmstate
virtio-net: Wrap in vmstate
virtio-balloon: Wrap in vmstate
virtio-rng: Wrap in vmstate
virtio-blk: Wrap in vmstate
virtio-scsi: Wrap in vmstate
virtio: Migration helper function and macro
virtio-serial: Remove old migration version support
virtio-net: Remove old migration version support
virtio-scsi: Replace HandleOutput typedef
Revert "mirror: Workaround for unexpected iohandler events during completion"
virtio-scsi: Call virtio_add_queue_aio
virtio-blk: Call virtio_add_queue_aio
virtio: Introduce virtio_add_queue_aio
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This reverts commit ab27c3b5e7.
The virtio storage device host notifiers now work with
bdrv_drained_begin/end, so we don't need this hack any more.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Since the raw format driver is just passing things through, we can
do byte-based read and write if the underlying protocol does
likewise.
There's one tricky part - if we probed the image format, we document
that we restrict operations on the initial sector. It's easiest to
keep this guarantee by enforcing read-modify-write on sub-sector
operations (yes, this partially reverts commit ad82be2f).
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468624988-423-20-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The NBD protocol doesn't have any notion of sectors, so it is
a fairly easy conversion to use byte-based read and write.
Signed-off-by: Eric Blake <eblake@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468624988-423-19-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that all drivers have a byte-based .bdrv_co_pdiscard(), we
no longer need to worry about the sector-based version. We can
also relax our minimum alignment to 1 for drivers that support it.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-18-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-17-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-16-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-15-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
While at it, call directly into nbd-client.c instead of having
a pointless trivial wrapper in nbd.c.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-14-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
Unlike write_zeroes, where we can be handed unaligned requests
and must fail gracefully with -ENOTSUP for a fallback, we are
guaranteed that discard requests are always aligned because the
block layer already ignored unaligned head/tail.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-13-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-12-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards killing off sector-based block APIs.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-11-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
There's enough drivers with a sector-based callback that it will
be easier to switch one at a time. This patch adds a byte-based
callback, and then after all drivers are swapped, we'll drop the
sector-based callback.
[checkpatch doesn't like the space after coroutine_fn in
block_int.h, but it's consistent with the rest of the file]
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-10-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards byte-based interfaces everywhere. Replace
the sector-based driver callback .bdrv_aio_discard() with a new
byte-based .bdrv_aio_pdiscard(). Only raw-posix and RBD drivers
are affected, so it was not worth splitting into multiple patches.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-9-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The internal function converts to byte-based before calling into
RBD code; hoist the conversion to the callers so that callers
can then be switched to byte-based themselves.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-8-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The only remaining uses of paio_submit() were flush (with no
offset or count) and discard (which we are switching to byte-based);
furthermore, the similarly named paio_submit_co() is already
byte-based.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-7-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Change sector-based blk_discard(), blk_co_discard(), and
blk_aio_discard() to instead be byte-based blk_pdiscard(),
blk_co_pdiscard(), and blk_aio_pdiscard(). NBD gets a lot
simpler now that ignoring the unaligned portion of a
byte-based discard request is handled under the hood by
the block layer.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-6-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards byte-based interfaces everywhere. Replace
the sector-based bdrv_aio_discard() with a new byte-based
bdrv_aio_pdiscard(), which silently ignores any unaligned head
or tail. Driver callbacks will be converted in followup patches.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468624988-423-5-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
BlockRequest is the internal struct used by bdrv_aio_*. At the
moment, all such calls were sector-based, but we will eventually
convert to byte-based; start by changing the internal variables
to be byte-based. No change to behavior, although the read and
write code can now go byte-based through more of the stack.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468624988-423-4-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards byte-based interfaces everywhere. Replace
the sector-based bdrv_discard() with a new byte-based
bdrv_pdiscard(), which silently ignores any unaligned head
or tail.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-3-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Another step towards byte-based interfaces everywhere. Replace
the sector-based bdrv_co_discard() with a new byte-based
bdrv_co_pdiscard(), which silently ignores any unaligned head
or tail. Driver callbacks will be converted in followup patches.
By calculating the alignment outside of the loop, and clamping
the max discard to an aligned value, we can simplify the actions
done within the loop.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-2-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that the block layer honors max_request, we don't need to
bother with an EINVAL on overlarge requests, but can instead
assert that requests are well-behaved.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468607524-19021-7-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that NBD relies on the block layer to fragment things, we no
longer need to track an offset argument for which fragment of
a request we are actually servicing.
While at it, use true and false instead of 0 and 1 for a bool
parameter.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468607524-19021-6-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that the block layer will honor max_transfer, we can simplify
our code to rely on that guarantee.
The readv code can call directly into nbd-client, just as the
writev code has done since commit 52a4650.
Interestingly enough, while qemu-io 'w 0 40m' splits into a 32M
and 8M transaction, 'w -z 0 40m' splits into two 16M and an 8M,
because the block layer caps the bounce buffer for writing zeroes
at 16M. When we later introduce support for NBD_CMD_WRITE_ZEROES,
we can get a full 32M zero write (or larger, if the client and
server negotiate that write zeroes can use a larger size than
ordinary writes).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468607524-19021-5-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Drivers should be able to rely on the block layer honoring the
max transfer length, rather than needing to return -EINVAL
(iscsi) or manually fragment things (nbd). We already fragment
write zeroes at the block layer; this patch adds the fragmentation
for normal writes, after requests have been aligned (fragmenting
before alignment would lead to multiple unaligned requests, rather
than just the head and tail).
When fragmenting a large request where FUA was requested, but
where we know that FUA is implemented by flushing all requests
rather than the given request, then we can still get by with
only one flush. Note, however, that we need a followup patch
to the raw format driver to avoid a regression in the number of
flushes actually issued.
The return value was previously nebulous on success (sometimes
zero, sometimes the length written); since we never have a short
write, and since fragmenting may store yet another positive
value in 'ret', change the function to always return 0 on success,
matching what we do in bdrv_aligned_preadv().
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468607524-19021-4-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The raw format layer supports all flags via passthrough - but
it only makes sense to pass through flags that the lower layer
actually supports.
The next patch gives stronger reasoning for why this is correct.
At the moment, the raw format layer ignores the max_transfer
limit of its protocol layer, and an attempt to do the qemu-io
'w -f 0 40m' to an NBD server that lacks FUA will pass the entire
40m request to the NBD driver, which then fragments the request
itself into a 32m write, 8m write, and flush. But once the block
layer starts honoring limits and fragmenting packets, the raw
driver will hand the NBD driver two separate requests; if both
requests have BDRV_REQ_FUA set, then this would result in a 32m
write, flush, 8m write, and second flush. By having the raw
layer no longer advertise FUA support when the protocol layer
lacks it, we are back to a single flush at the block layer for
the overall 40m request.
Note that 'w -f -z 0 40m' does not currently exhibit the same
problem, because there, the fragmentation does not occur until
at the NBD layer (the raw layer has .bdrv_co_pwrite_zeroes, and
the NBD layer doesn't advertise max_pwrite_zeroes to constrain
things at the raw layer) - but the problem is latent and we
would again have too many flushes without this patch once the
NBD layer implements support for the new NBD_CMD_WRITE_ZEROES
command, if it sets max_pwrite_zeroes to the same 32m limit as
recommended by the NBD protocol.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468607524-19021-3-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Drivers should be able to rely on the block layer honoring the
max transfer length, rather than needing to return -EINVAL
(iscsi) or manually fragment things (nbd). This patch adds
the fragmentation in the block layer, after requests have been
aligned (fragmenting before alignment would lead to multiple
unaligned requests, rather than just the head and tail).
The return value was previously nebulous on success on whether
it was zero or the length read; and fragmenting may introduce
yet other non-zero values if we use the last length read. But
as at least some callers are sloppy and expect only zero on
success, it is easiest to just guarantee 0.
[Fix uninitialized ret local variable in bdrv_aligned_preadv().
--Stefan]
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468607524-19021-2-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds a way to specify multiple volfile servers to the gluster
block backend of QEMU with tcp|rdma transport types and their port numbers.
Problem:
Currently VM Image on gluster volume is specified like this:
file=gluster[+tcp]://host[:port]/testvol/a.img
Say we have three hosts in a trusted pool with replica 3 volume in action.
When the host mentioned in the command above goes down for some reason,
the other two hosts are still available. But there's currently no way
to tell QEMU about them.
Solution:
New way of specifying VM Image on gluster volume with volfile servers:
(We still support old syntax to maintain backward compatibility)
Basic command line syntax looks like:
Pattern I:
-drive driver=gluster,
volume=testvol,path=/path/a.raw,[debug=N,]
server.0.type=tcp,
server.0.host=1.2.3.4,
server.0.port=24007,
server.1.type=unix,
server.1.socket=/path/socketfile
Pattern II:
'json:{"driver":"qcow2","file":{"driver":"gluster",
"volume":"testvol","path":"/path/a.qcow2",["debug":N,]
"server":[{hostinfo_1}, ...{hostinfo_N}]}}'
driver => 'gluster' (protocol name)
volume => name of gluster volume where our VM image resides
path => absolute path of image in gluster volume
[debug] => libgfapi loglevel [(0 - 9) default 4 -> Error]
{hostinfo} => {{type:"tcp",host:"1.2.3.4"[,port=24007]},
{type:"unix",socket:"/path/sockfile"}}
type => transport type used to connect to gluster management daemon,
it can be tcp|unix
host => host address (hostname/ipv4/ipv6 addresses/socket path)
port => port number on which glusterd is listening.
socket => path to socket file
Examples:
1.
-drive driver=qcow2,file.driver=gluster,
file.volume=testvol,file.path=/path/a.qcow2,file.debug=9,
file.server.0.type=tcp,
file.server.0.host=1.2.3.4,
file.server.0.port=24007,
file.server.1.type=unix,
file.server.1.socket=/var/run/glusterd.socket
2.
'json:{"driver":"qcow2","file":{"driver":"gluster","volume":"testvol",
"path":"/path/a.qcow2","debug":9,"server":
[{"type":"tcp","host":"1.2.3.4","port":"24007"},
{"type":"unix","socket":"/var/run/glusterd.socket"}
]}}'
This patch gives a mechanism to provide all the server addresses, which are in
replica set, so in case host1 is down VM can still boot from any of the
active hosts.
This is equivalent to the backup-volfile-servers option supported by
mount.glusterfs (FUSE way of mounting gluster volume)
credits: sincere thanks to all the supporters
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1468947453-5433-6-git-send-email-prasanna.kalever@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
this patch adds 'GlusterServer' related schema in qapi/block-core.json
[Jeff: minor fix-ups of comments and formatting, per patch reviews]
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1468947453-5433-5-git-send-email-prasanna.kalever@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
gluster volfile server fetch happens through unix and/or tcp, it doesn't
support volfile fetch over rdma. The rdma code may actually mislead,
so to make sure things do not break, for now we fallback to tcp when requested
for rdma, with a warning.
If you are wondering how this worked all these days, its the gluster libgfapi
code which handles anything other than unix transport as socket/tcp, sad but
true.
Also gluster doesn't support ipv6 addresses, removing the ipv6 related
comments/docs section
[Jeff: Minor grammatical fixes in comments and commit message, per
review comments]
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1468947453-5433-4-git-send-email-prasanna.kalever@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
unified coding styles of multiline function arguments and other error functions
moved random declarations of structures and other list variables
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1468947453-5433-3-git-send-email-prasanna.kalever@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>