Deciphering the hard-coded list of integer return values from
nbd_start_negotiate() will only get more confusing when adding support
for 64-bit extended headers. Better is to name things in an enum.
Although the function in question is private to client.c, putting the
enum in a public header and including an enum-to-string conversion
will allow its use in more places in upcoming patches.
The enum is intentionally laid out so that operators like <= can be
used to group multiple modes with similar characteristics, and where
the least powerful mode has value 0, even though this patch does not
exploit that. No semantic change intended.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20230608135653.2918540-9-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Our existing use of structured replies either reads into a qiov capped
at 32M (NBD_CMD_READ) or caps allocation to 1000 bytes (see
NBD_MAX_MALLOC_PAYLOAD in block/nbd.c). But the existing length
checks are rather late; if we encounter a buggy (or malicious) server
that sends a super-large payload length, we should drop the connection
right then rather than assuming the layer on top will be careful.
This becomes more important when we permit 64-bit lengths which are
even more likely to have the potential for attempted denial of service
abuse.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20230608135653.2918540-8-eblake@redhat.com>
Externally, libnbd exposed the 64-bit opaque marker for each client
NBD packet as the "cookie", because it was less confusing when
contrasted with 'struct nbd_handle *' holding all libnbd state. It
also avoids confusion between the noun 'handle' as a way to identify a
packet and the verb 'handle' for reacting to things like signals.
Upstream NBD changed their spec to favor the name "cookie" based on
libnbd's recommendations[1], so we can do likewise.
[1] https://github.com/NetworkBlockDevice/nbd/commit/ca4392eb2b
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20230608135653.2918540-6-eblake@redhat.com>
[eblake: typo fix]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Part of NBD's 64-bit headers extension involves passing the client's
requested offset back as part of the reply header (one reason it
stated for this change: converting absolute offsets stored in
NBD_REPLY_TYPE_OFFSET_DATA to relative offsets within the buffer is
easier if the absolute offset of the buffer is also available). This
is a refactoring patch to pass the full request around the reply
stack, rather than just the handle, so that later patches can then
access request->from when extended headers are active. Meanwhile,
this patch enables us to now assert that simple replies are only
attempted when appropriate, and otherwise has no semantic change.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20230608135653.2918540-5-eblake@redhat.com>
Upstream NBD now documents[1] an extension that supports 64-bit effect
lengths in requests. As part of that extension, the size of the reply
headers will change in order to permit a 64-bit length in the reply
for symmetry[2]. Additionally, where the reply header is currently 16
bytes for simple reply, and 20 bytes for structured reply; with the
extension enabled, there will only be one extended reply header, of 32
bytes, with both structured and extended modes sending identical
payloads for chunked replies.
Since we are already wired up to use iovecs, it is easiest to allow
for this change in header size by splitting each structured reply
across multiple iovecs, one for the header (which will become wider in
a future patch according to client negotiation), and the other(s) for
the chunk payload, and removing the header from the payload struct
definitions. Rename the affected functions with s/structured/chunk/
to make it obvious that the code will be reused in extended mode.
Interestingly, the client side code never utilized the packed types,
so only the server code needs to be updated.
[1] https://github.com/NetworkBlockDevice/nbd/blob/extension-ext-header/doc/proto.md
as of NBD commit e6f3b94a934
[2] Note that on the surface, this is because some future server might
permit a 4G+ NBD_CMD_READ and need to reply with that much data in one
transaction. But even though the extended reply length is widened to
64 bits, for now the NBD spec is clear that servers will not reply
with more than a maximum payload bounded by the 32-bit
NBD_INFO_BLOCK_SIZE field; allowing a client and server to mutually
agree to transactions larger than 4G would require yet another
extension.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20230608135653.2918540-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Assigning strlen() to a uint32_t and then asserting that it isn't too
large doesn't catch the case of an input string 4G in length.
Thankfully, the incoming strings can never be that large: if the
export name or query is reflecting a string the client got from the
server, we already guarantee that we dropped the NBD connection if the
server sent more than 32M in a single reply to our NBD_OPT_* request;
if the export name is coming from qemu, nbd_receive_negotiate()
asserted that strlen(info->name) <= NBD_MAX_STRING_SIZE; and
similarly, a query string via x->dirty_bitmap coming from the user was
bounds-checked in either qemu-nbd or by the limitations of QMP.
Still, it doesn't hurt to be more explicit in how we write our
assertions to not have to analyze whether inadvertent wraparound is
possible.
Fixes: 93676c88 ("nbd: Don't send oversize strings", v4.2.0)
Reported-by: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20230608135653.2918540-2-eblake@redhat.com>
Mechanical change running Coccinelle spatch with content
generated from the qom-cast-macro-clean-cocci-gen.py added
in the previous commit.
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20230601093452.38972-3-philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
nbd_drained_poll() generally runs in the main thread, not whatever
iothread the NBD server coroutine is meant to run in, so it can't
directly reenter the coroutines to wake them up.
The code seems to have the right intention, it specifies the correct
AioContext when it calls qemu_aio_coroutine_enter(). However, this
functions doesn't schedule the coroutine to run in that AioContext, but
it assumes it is already called in the home thread of the AioContext.
To fix this, add a new thread-safe qio_channel_wake_read() that can be
called in the main thread to wake up the coroutine in its AioContext,
and use this in nbd_drained_poll().
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230517152834.277483-3-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
exp->common.blk cannot be NULL, nbd_export_delete() is only called (through
a bottom half) from blk_exp_unref() and in turn that can only happen
after blk_exp_add() has asserted exp->blk != NULL.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
qemu-nbd doesn't set TCP_NODELAY on the tcp socket.
Kernel waits for more data and avoids transmission of small packets.
Without TLS this is barely noticeable, but with TLS this really shows.
Booting a VM via qemu-nbd on localhost (with tls) takes more than
2 minutes on my system. tcpdump shows frequent wait periods, where no
packets get sent for a 40ms period.
Add explicit (un)corking when processing (and responding to) requests.
"TCP_CORK, &zero" after earlier "CORK, &one" will flush pending data.
VM Boot time:
main: no tls: 23s, with tls: 2m45s
patched: no tls: 14s, with tls: 15s
VM Boot time, qemu-nbd via network (same lan):
main: no tls: 18s, with tls: 1m50s
patched: no tls: 17s, with tls: 18s
Future optimization: if we could detect if there is another pending
request we could defer the uncork operation because more data would be
appended.
Signed-off-by: Florian Westphal <fw@strlen.de>
Message-Id: <20230324104720.2498-1-fw@strlen.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We have two inclusion loops:
block/block.h
-> block/block-global-state.h
-> block/block-common.h
-> block/blockjob.h
-> block/block.h
block/block.h
-> block/block-io.h
-> block/block-common.h
-> block/blockjob.h
-> block/block.h
I believe these go back to Emanuele's reorganization of the block API,
merged a few months ago in commit d7e2fe4aac.
Fortunately, breaking them is merely a matter of deleting unnecessary
includes from headers, and adding them back in places where they are
now missing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221221133551.3967339-2-armbru@redhat.com>
qemu/coroutine.h and qemu/lockable.h include each other.
They need each other only in macro expansions, so we could simply drop
both inclusions to break the loop, and add suitable includes to files
that expand the macros.
Instead, move a part of qemu/coroutine.h to new qemu/coroutine-core.h
so that qemu/coroutine-core.h doesn't need qemu/lockable.h, and
qemu/lockable.h only needs qemu/coroutine-core.h. Result:
qemu/coroutine.h includes qemu/lockable.h includes
qemu/coroutine-core.h.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221221131435.3851212-5-armbru@redhat.com>
[Semantic rebase conflict with 7c10cb38cc "accel/tcg: Add debuginfo
support" resolved]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20221221131435.3851212-2-armbru@redhat.com>
Avoid mixing bdrv_* functions with blk_*, so create blk_* counterparts
for bdrv_block_status_above and bdrv_is_allocated_above.
Note that since blk_co_block_status_above only calls the g_c_w function
bdrv_common_block_status_above and is marked as coroutine_fn, call
directly bdrv_co_common_block_status_above() to avoid using a g_c_w.
Same applies to blk_co_is_allocated_above.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-5-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These functions end up calling bdrv_*() implemented as generated_co_wrapper
functions.
In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.
This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we can mark such functions coroutine_fn too.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block-export-add argument @name defaults to the value of argument
@node-name.
nbd_export_create() implements this by copying @node_name to @name.
It leaves @has_node_name false, violating the "has_node_name ==
!!node_name" invariant. Unclean. Falls apart when we elide
@has_node_name (next commit): then QAPI frees the same value twice,
once for @node_name and once @name. iotest 307 duly explodes.
Goes back to commit c62d24e906 "blockdev-nbd: Boxed argument type for
nbd-server-add" (v5.0.0). Got moved from qmp_nbd_server_add() to
nbd_export_create() (commit 56ee86261e), then copied back (commit
b6076afcab). Commit 8675cbd68b "nbd: Utilize QAPI_CLONE for type
conversion" (v5.2.0) cleaned up the copy in qmp_nbd_server_add()
noting
Second, our assignment to arg->name is fishy: the generated QAPI code
for qapi_free_NbdServerAddOptions does not visit arg->name if
arg->has_name is false, but if it DID visit it, we would have
introduced a double-free situation when arg is finally freed.
Exactly. However, the copy in nbd_export_create() remained dirty.
Clean it up. Since the value stored in member @name is not actually
used outside this function, use a local variable instead of modifying
the QAPI object.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Cc: qemu-block@nongnu.org
Message-Id: <20221104160712.3005652-10-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Swap 'buf' and 'bytes' around for consistency with
blk_co_{pread,pwrite}(), and in preparation to implement these functions
using generated_co_wrapper.
Callers were updated using this Coccinelle script:
@@ expression blk, offset, buf, bytes, flags; @@
- blk_pread(blk, offset, buf, bytes, flags)
+ blk_pread(blk, offset, bytes, buf, flags)
@@ expression blk, offset, buf, bytes, flags; @@
- blk_pwrite(blk, offset, buf, bytes, flags)
+ blk_pwrite(blk, offset, bytes, buf, flags)
It had no effect on hw/block/nand.c, presumably due to the #if, so that
file was updated manually.
Overly-long lines were then fixed by hand.
Signed-off-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220705161527.1054072-4-afaria@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
For consistency with other I/O functions, and in preparation to
implement it using generated_co_wrapper.
Callers were updated using this Coccinelle script:
@@ expression blk, offset, buf, bytes; @@
- blk_pread(blk, offset, buf, bytes)
+ blk_pread(blk, offset, buf, bytes, 0)
It had no effect on hw/block/nand.c, presumably due to the #if, so that
file was updated manually.
Overly-long lines were then fixed by hand.
Signed-off-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220705161527.1054072-3-afaria@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
At the moment there are 2 sources of lengthy operations if configured:
* open connection, which could retry inside and
* reconnect of already opened connection
These operations could be quite lengthy and cumbersome to catch thus
it would be quite natural to add trace points for them.
This patch is based on the original downstream work made by Vladimir.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Eric Blake <eblake@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
According to the NBD spec, a server that advertises
NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
not see any cache inconsistencies: when properly separated by a single
flush, actions performed by one client will be visible to another
client, regardless of which client did the flush.
We always satisfy these conditions in qemu - even when we support
multiple clients, ALL clients go through a single point of reference
into the block layer, with no local caching. The effect of one client
is instantly visible to the next client. Even if our backend were a
network device, we argue that any multi-path caching effects that
would cause inconsistencies in back-to-back actions not seeing the
effect of previous actions would be a bug in that backend, and not the
fault of caching in qemu. As such, it is safe to unconditionally
advertise CAN_MULTI_CONN for any qemu NBD server situation that
supports parallel clients.
Note, however, that we don't want to advertise CAN_MULTI_CONN when we
know that a second client cannot connect (for historical reasons,
qemu-nbd defaults to a single connection while nbd-server-add and QMP
commands default to unlimited connections; but we already have
existing means to let either style of NBD server creation alter those
defaults). This is visible by no longer advertising MULTI_CONN for
'qemu-nbd -r' without -e, as in the iotest nbd-qemu-allocation.
The harder part of this patch is setting up an iotest to demonstrate
behavior of multiple NBD clients to a single server. It might be
possible with parallel qemu-io processes, but I found it easier to do
in python with the help of libnbd, and help from Nir and Vladimir in
writing the test.
Signed-off-by: Eric Blake <eblake@redhat.com>
Suggested-by: Nir Soffer <nsoffer@redhat.com>
Suggested-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Message-Id: <20220512004924.417153-3-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hi all! Current logic of relying on search through backing chain is not
safe neither convenient.
Sometimes it leads to necessity of extra bitmap copying. Also, we are
going to add "snapshot-access" driver, to access some snapshot state
through NBD. And this driver is not formally a filter, and of course
it's not a COW format driver. So, searching through backing chain will
not work. Instead of widening the workaround of bitmap searching, let's
extend the interface so that user can select bitmap precisely.
Note, that checking for bitmap active status is not copied to the new
API, I don't see a reason for it, user should understand the risks. And
anyway, bitmap from other node is unrelated to this export being
read-only or read-write.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Message-Id: <20220314213226.362217-3-v.sementsov-og@mail.ru>
[eblake: Adjust S-o-b to Vladimir's new email, with permission]
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
One less qemu-specific macro. It also helps to make some headers/units
only depend on glib, and thus moved in standalone projects eventually.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
- Dan Berrange: Allow qemu-nbd to support TLS over Unix sockets
- Eric Blake: Minor cleanups related to 64-bit block operations
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmImtE8ACgkQp6FrSiUn
Q2ovmgf/aksDqf2eNcahs++fez+8Qi9ll5OY/qGyjnzBgsatYKjrK+xF7OnjoJox
eRX026lh81Q4EQK7oZBUnr2UCY4bncDBTI7MTLh603EV/tId5ZLwx007ERhzvtC1
mIsQHXNuO9X25LQG2eWnfunY9YztQpiT5r/g3khD2yPBqJWIvBfblzPLx6FkF7px
/WM8xEKCihmGr1Wr3b+zGYL083YkaBWCvHoR8mJt3tEFUj+Qie8XcdV0OVyI0XUj
5goIFRcpVwBE8P2nLtfUKNzEXz22cmdonOJUX7E5IvGO21k5F/HrWlHdo8JnuSUZ
t0w5L9yCxBrRpY1burz30b77J0WMCw==
=C8Dd
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2022-03-07' into staging
nbd patches for 2022-03-07
- Dan Berrange: Allow qemu-nbd to support TLS over Unix sockets
- Eric Blake: Minor cleanups related to 64-bit block operations
# gpg: Signature made Tue 08 Mar 2022 01:41:35 GMT
# gpg: using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg: aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-nbd-2022-03-07:
qemu-io: Allow larger write zeroes under no fallback
qemu-io: Utilize 64-bit status during map
nbd/server: Minor cleanups
tests/qemu-iotests: validate NBD TLS with UNIX sockets and PSK
tests/qemu-iotests: validate NBD TLS with UNIX sockets
tests/qemu-iotests: validate NBD TLS with hostname mismatch
tests/qemu-iotests: convert NBD TLS test to use standard filters
tests/qemu-iotests: introduce filter for qemu-nbd export list
tests/qemu-iotests: expand _filter_nbd rules
tests/qemu-iotests: add QEMU_IOTESTS_REGEN=1 to update reference file
block/nbd: don't restrict TLS usage to IP sockets
qemu-nbd: add --tls-hostname option for TLS certificate validation
block/nbd: support override of hostname for TLS certificate validation
block: pass desired TLS hostname through from block driver client
crypto: mandate a hostname when checking x509 creds on a client
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Spelling fixes, grammar improvements and consistent spacing, noticed
while preparing other patches in this file.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20211203231539.3900865-2-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
In
commit a71d597b98
Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Date: Thu Jun 10 13:08:00 2021 +0300
block/nbd: reuse nbd_co_do_establish_connection() in nbd_open()
the use of the 'hostname' field from the BDRVNBDState struct was
lost, and 'nbd_connect' just hardcoded it to match the IP socket
address. This was a harmless bug at the time since we block use
with anything other than IP sockets.
Shortly though, we want to allow the caller to override the hostname
used in the TLS certificate checks. This is to allow for TLS
when doing port forwarding or tunneling. Thus we need to reinstate
the passing along of the 'hostname'.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220304193610.3293146-3-berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Move the various memalign-related functions out of osdep.h and into
their own header, which we include only where they are used.
While we're doing this, add some brief documentation comments.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 20220226180723.1706285-10-peter.maydell@linaro.org
NBDRequestData struct has unused QSIMPLEQ_ENTRY field. It seems that
this field exists since the first git commit and was never used.
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Message-Id: <20220111194313.581486-1-nsoffer@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Fixes: d9a73806 ("qemu-nbd: introduce NBDRequest", v1.1)
Signed-off-by: Eric Blake <eblake@redhat.com>
The only caller of nbd_do_establish_connection() that uses errp is
nbd_open(). The only way to cancel this call is through open_timer
timeout. And for this case, user will be more interested in description
of last failed connect rather than in
"Connection attempt cancelled by other operation".
So, let's change behavior on cancel to return previous failure error if
available.
Do the same for non-blocking failure case. In this case we still don't
have a caller that is interested in errp. But let's be consistent.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Now that the block layer supports 64-bit operations (see commit
2800637a and friends, new to v6.2), we no longer have to self-fragment
requests larger than 2G, reverting the workaround added in 890cbccb08
("nbd: Fix large trim/zero requests", v5.1.0).
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20211117170230.1128262-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
When a client disconnects abruptly, but did not have any pending
requests (for example, when using nbdsh without calling h.shutdown),
we used to output the following message:
$ qemu-nbd -f raw file
$ nbdsh -u 'nbd://localhost:10809' -c 'h.trim(1,0)'
qemu-nbd: Disconnect client, due to: Failed to read request: Unexpected end-of-file before all bytes were read
Then in commit f148ae7, we refactored nbd_receive_request() to use
nbd_read_eof(); when this returns 0, we regressed into tracing
uninitialized memory (if tracing is enabled) and reporting a
less-specific:
qemu-nbd: Disconnect client, due to: Request handling failed in intermediate state
Note that with Unix sockets, we have yet another error message,
unchanged by the 6.0 regression:
$ qemu-nbd -k /tmp/sock -f raw file
$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.trim(1,0)'
qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
But in all cases, the error message goes away if the client performs a
soft shutdown by using NBD_CMD_DISC, rather than a hard shutdown by
abrupt disconnect:
$ nbdsh -u 'nbd://localhost:10809' -c 'h.trim(1,0)' -c 'h.shutdown()'
This patch fixes things to avoid uninitialized memory, and in general
avoids warning about a client that does a hard shutdown when not in
the middle of a packet. A client that aborts mid-request, or which
does not read the full server's reply, can still result in warnings,
but those are indeed much more unusual situations.
CC: qemu-stable@nongnu.org
Fixes: f148ae7d36 ("nbd/server: Quiesce coroutines on context switch", v6.0.0)
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: defer unrelated typo fixes to later patch]
Message-Id: <20211117170230.1128262-2-eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
clang's sanitizer is picky: memset(NULL, x, 0) is technically
undefined behavior, even though no sane implementation of memset()
deferences the NULL. Caught by the nbd-qemu-allocation iotest.
The alternative to checking before each memset is to instead force an
allocation of 1 element instead of g_new0(type, 0)'s behavior of
returning NULL for a 0-length array.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Fixes: 3b1f244c59 (nbd: Allow export of multiple bitmaps for one device)
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20211115223943.626416-1-eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
OK, that's a big rewrite of the logic.
Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.
We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.
Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.
The detailed list of changes below (in the sequence of diff hunks).
1. receiving coroutines are woken directly from nbd_channel_error, when
we change s->state
2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
and in nbd_teardown_connection() all requests should already be
finished (and reconnect is done from request). So
nbd_co_establish_connection_cancel() is called from
nbd_cancel_in_flight() (to cancel the request that is doing
nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
(previously we didn't need it, as reconnect delay only should cancel
active requests not the reconnection itself). But now reconnection
itself is done in the separate thread (we now call
nbd_client_connection_enable_retry() in nbd_open()), and we need to
cancel the requests that wait in nbd_co_establish_connection()
now).
2A. We do receive headers in request coroutine. But we also should
dispatch replies for other pending requests. So,
nbd_connection_entry() is turned into nbd_receive_replies(), which
does reply dispatching while it receives other request headers, and
returns when it receives the requested header.
3. All old staff around drained sections and context switch is dropped.
In details:
- we don't need to move connection_co to new aio context, as we
don't have connection_co anymore
- we don't have a fake "request" of connection_co (extra increasing
in_flight), so don't care with it in drain_begin/end
- we don't stop reconnection during drained section anymore. This
means that drain_begin may wait for a long time (up to
reconnect_delay). But that's an improvement and more correct
behavior see below[*]
4. In nbd_teardown_connection() we don't have to wait for
connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
is moved here from removed connection_co.
5. In nbd_co_do_establish_connection() we now should handle
NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
NBD_CLIENT_CONNECTING_NOWAIT, it still should call
nbd_co_establish_connection() (who knows, maybe the connection was
already established by another thread in the background). But we
shouldn't wait: if nbd_co_establish_connection() can't return new
channel immediately the request should fail (we are in
NBD_CLIENT_CONNECTING_NOWAIT state).
6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
other requests in the caller, so here we just assert that fact.
Also delay time is now initialized here: we can easily detect first
attempt and start a timer.
7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
retries are fully handle by thread (nbd/client-connection.c), delay
timer we initialize in nbd_reconnect_attempt(), we don't have to
bother with s->drained and friends. nbd_reconnect_attempt() now
called from nbd_co_send_request().
8. nbd_connection_entry is dropped: reconnect is now handled by
nbd_co_send_request(), receiving reply is now handled by
nbd_receive_replies(): all handled from request coroutines.
9. So, welcome new nbd_receive_replies() called from request coroutine,
that receives reply header instead of nbd_connection_entry().
Like with sending requests, only one coroutine may receive in a
moment. So we introduce receive_mutex, which is locked around
nbd_receive_reply(). It also protects some related fields. Still,
full audit of thread-safety in nbd driver is a separate task.
New function waits for a reply with specified handle being received
and works rather simple:
Under mutex:
- if current handle is 0, do receive by hand. If another handle
received - switch to other request coroutine, release mutex and
yield. Otherwise return success
- if current handle == requested handle, we are done
- otherwise, release mutex and yield
10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
needed. Also waiting in free_sema queue we now wait for one of two
conditions:
- connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
- connectING, in_flight == 0, so we can call
nbd_reconnect_attempt()
And this logic is protected by s->send_mutex
Also, on failure we don't have to care of removed s->connection_co
11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
s->connection_co we just call new nbd_receive_replies().
12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
which means that handling of the whole reply is finished. Here we
need to wake one of coroutines sleeping in nbd_receive_replies().
If none are sleeping - do nothing. That's another behavior change: we
don't have endless recv() in the idle time. It may be considered as
a drawback. If so, it may be fixed later.
13. nbd_reply_chunk_iter_receive(): don't care about removed
connection_co, just ping in_flight waiters.
14. Don't create connection_co, enable retry in the connection thread
(we don't have own reconnect loop anymore)
15. We now need to add a nbd_co_establish_connection_cancel() call in
nbd_cancel_in_flight(), to cancel the request that is doing a
connection attempt.
[*], ok, now we don't cancel reconnect on drain begin. That's correct:
reconnect feature leads to possibility of long-running requests (up
to reconnect delay). Still, drain begin is not a reason to kill
long requests. We should wait for them.
This also means, that we can again reproduce a dead-lock, described
in 8c517de24a.
Why we are OK with it:
1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
after reconnect delay. Actually 8c517de24a fixed a bug in
NBD logic, that was not described in 8c517de24a and led to
forever dead-lock. The problem was that nobody woke the free_sema
queue, but drain_begin can't finish until there is a request in
free_sema queue. Now we have a reconnect delay timer that works
well.
2. It's not a problem of the NBD driver, but of the ide code,
because it does drain_begin under the global mutex; the problem
doesn't reproduce when using scsi instead of ide.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar and comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
When we don't have a connection and blocking is false, we return NULL
but don't set errp. That's wrong.
We have two paths for calling nbd_co_establish_connection():
1. nbd_open() -> nbd_do_establish_connection() -> ...
but that will never set blocking=false
2. nbd_reconnect_attempt() -> nbd_co_do_establish_connection() -> ...
but that uses errp=NULL
So, we are safe with our wrong errp policy in
nbd_co_establish_connection(). Still let's fix it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210906190654.183421-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
The NBD protocol just relaxed the requirements on
NBD_OPT_LIST_META_CONTEXT:
https://github.com/NetworkBlockDevice/nbd/commit/13a4e33a87
Since listing is not stateful (unlike SET_META_CONTEXT), we don't care
if a client asks for meta contexts without first requesting structured
replies. Well-behaved clients will still ask for structured reply
first (if for no other reason than for back-compat to older servers),
but that's no reason to avoid this change.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210907173505.1499709-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
From clang-13:
nbd/server.c:976:22: error: variable 'bitmaps' set but not used \
[-Werror,-Wunused-but-set-variable]
which is incorrect; see //bugs.llvm.org/show_bug.cgi?id=3888.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
We'll need a possibility of non-blocking nbd_co_establish_connection(),
so that it returns immediately, and it returns success only if a
connections was previously established in background.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-30-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
block/nbd doesn't need underlying sioc channel anymore. So, we can
update nbd/client-connection interface to return only one top-most io
channel, which is more straight forward.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-27-vsementsov@virtuozzo.com>
[eblake: squash in Vladimir's fixes for uninit usage caught by clang]
Signed-off-by: Eric Blake <eblake@redhat.com>
Now, when a thread can do negotiation and retry, it may run relatively
long. We need a mechanism to stop it, when the user is not interested
in a result any more. So, on nbd_client_connection_release() let's
shutdown the socket, and do not retry connection if thread is detached.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-22-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Add an option for a thread to retry connecting until it succeeds. We'll
use nbd/client-connection both for reconnect and for initial connection
in nbd_open(), so we need a possibility to use same NBDClientConnection
instance to connect once in nbd_open() and then use retry semantics for
reconnect.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-21-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
Add arguments and logic to support nbd negotiation in the same thread
after successful connection.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-20-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We don't update connect_thread_func() to use QEMU_LOCK_GUARD, as it
will get more complex critical sections logic in further commit, where
QEMU_LOCK_GUARD doesn't help.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-19-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We now have bs-independent connection API, which consists of four
functions:
nbd_client_connection_new()
nbd_client_connection_release()
nbd_co_establish_connection()
nbd_co_establish_connection_cancel()
Move them to a separate file together with NBDClientConnection
structure which becomes private to the new API.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-18-vsementsov@virtuozzo.com>
[eblake: comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
- NBD server: Fix crashes related to switching between AioContexts
- file-posix: Workaround for discard/write_zeroes on buggy filesystems
- Follow-up fixes for the reopen vs. permission changes
- quorum: Fix error handling for flush
- block-copy: Refactor copy_range handling
- docs: Describe how to use 'null-co' block driver
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmC3iy8RHGt3b2xmQHJl
ZGhhdC5jb20ACgkQfwmycsiPL9ZP0hAAuh07CFWLzHCcRC7PBzekSPfzRYYBLDSW
EObJ1Ov4mvz8UZoP6BDJ5QVzLPhel6hXkxTd83B1D7t/Dq+yJYR0z8Kv3USpaVJ4
2U26SsoGQM8BmtVDL1Q8tQ5eDWQ4ykxNx6F2/lKBe1EH1lfaun04Xj1rNh7jpilo
nNmKMDDI1UOkH0lKDR3tqfEV0XQE7o+ZKfPlIbvYMjXk9ZPKUHfjNPGVdCLQVnqH
VJI01hF7eEx1ykSMdlC+TzNoVGG+mCBokGuW0JlUvOpX6FcGnAlxXQx8u53c1I8O
lggZV8b2IbrNBUVwXQHLLrXxjOo+u54Ct9y/gXUTAj8qai+9jVRp60Y1pnCyeIeu
DzFx10xwy04PGleb7AAZ4dT8du2+PTkuyQ9KmlvQ2U4IUcgW124CrDeO7XYr1aif
hCTJPeEDrC9YNU6AQ8rLXrYUtkumSm2zUzU5nZ//i5WH41475/vsmgP5A+Jr457A
Xu0yiI2Gqkr9CNsP9ZzMkNj03oIBhPFuGxiwibLQsy/6UVnaDYS0+rQ8FXYnF5+K
iEpgXe3vKTWxM097kzJMBTDVRMXRa75NtK7KWXMDgVpHTbcv1t1otsn+6dfv+B55
ULJM1ETsyYS0T6BqNglvdytsraSt7JgSF+ZLHbYk3KVDshwnq/0ksgSqHNNA14ca
kYTzHhMgo5w=
=gSq/
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches
- NBD server: Fix crashes related to switching between AioContexts
- file-posix: Workaround for discard/write_zeroes on buggy filesystems
- Follow-up fixes for the reopen vs. permission changes
- quorum: Fix error handling for flush
- block-copy: Refactor copy_range handling
- docs: Describe how to use 'null-co' block driver
# gpg: Signature made Wed 02 Jun 2021 14:44:15 BST
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
docs/secure-coding-practices: Describe how to use 'null-co' block driver
block-copy: refactor copy_range handling
block-copy: fix block_copy_task_entry() progress update
nbd/server: Use drained block ops to quiesce the server
block-backend: add drained_poll
block: improve permission conflict error message
block: simplify bdrv_child_user_desc()
block/vvfat: inherit child_vvfat_qcow from child_of_bds
block: improve bdrv_child_get_parent_desc()
block-backend: improve blk_root_get_parent_desc()
block: document child argument of bdrv_attach_child_common()
block/file-posix: Try other fallbacks after invalid FALLOC_FL_ZERO_RANGE
block/file-posix: Fix problem with fallocate(PUNCH_HOLE) on GPFS
block: drop BlockBackendRootState::read_only
block: drop BlockDriverState::read_only
block: consistently use bdrv_is_read_only()
block/vvfat: fix vvfat_child_perm crash
block/vvfat: child_vvfat_qcow: add .get_parent_aio_context, fix crash
qemu-io-cmds: assert that we don't have .perm requested in no-blk case
block/quorum: Provide .bdrv_co_flush instead of .bdrv_co_flush_to_disk
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Before switching between AioContexts we need to make sure that we're
fully quiesced ("nb_requests == 0" for every client) when entering the
drained section.
To do this, we set "quiescing = true" for every client on
".drained_begin" to prevent new coroutines from being created, and
check if "nb_requests == 0" on ".drained_poll". Finally, once we're
exiting the drained section, on ".drained_end" we set "quiescing =
false" and call "nbd_client_receive_next_request()" to resume the
processing of new requests.
With these changes, "blk_aio_attach()" and "blk_aio_detach()" can be
reverted to be as simple as they were before f148ae7d36.
RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1960137
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Sergio Lopez <slp@redhat.com>
Message-Id: <20210602060552.17433-3-slp@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit e50caf4a5c ("tracing: convert documentation to rST")
converted docs/devel/tracing.txt to docs/devel/tracing.rst.
We still have several references to the old file, so let's fix them
with the following command:
sed -i s/tracing.txt/tracing.rst/ $(git grep -l docs/devel/tracing.txt)
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210517151702.109066-2-sgarzare@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
When querying image extents for raw image, qemu-nbd reports holes as
zero:
$ qemu-nbd -t -r -f raw empty-6g.raw
$ qemu-img map --output json nbd://localhost
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, "offset": 0}]
$ qemu-img map --output json empty-6g.raw
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]
Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but
nbd server reports a hole based on BDRV_BLOCK_ALLOCATED.
The NBD protocol says:
NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and
future writes to that area may cause fragmentation or encounter an
NBD_ENOSPC error); if clear, the block is allocated or the server
could not otherwise determine its status.
qemu-img manual says:
whether the sectors contain actual data or not (boolean field data;
if false, the sectors are either unallocated or stored as
optimized all-zero clusters);
To me, data=false looks compatible with NBD_STATE_HOLE. From user point
of view, getting same results from qemu-nbd and qemu-img is more
important than being more correct about allocation status.
Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd
results compatible with qemu-img map:
$ qemu-img map --output json nbd://localhost
[{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, "offset": 0}]
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Message-Id: <20210219160752.1826830-1-nsoffer@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>