After a couple of follow-up patches, this function will replace
bdrv_recurse_is_first_non_filter() in check_to_replace_node().
bdrv_recurse_is_first_non_filter() is both not sufficiently specific for
check_to_replace_node() (it allows cases that should not be allowed,
like replacing child nodes of quorum with dissenting data that have more
parents than just quorum), and it is too restrictive (it is perfectly
fine to replace filters).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200218103454.296704-7-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is unused now. (And it was ugly because it needed to explore all BDS
chains from the top.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200218103454.296704-4-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Aborts when sqe fails to be set as sqes cannot be returned to the
ring. Adds slow path for short reads for older kernels
Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200120141858.587874-5-stefanha@redhat.com
Message-Id: <20200120141858.587874-5-stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Marking without waiting would not result in actual serialising behavior.
Thus, make a call bdrv_mark_request_serialising sufficient for
serialisation to happen.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1578495356-46219-3-git-send-email-pbonzini@redhat.com
Message-Id: <1578495356-46219-3-git-send-email-pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It is unused since commit 00e30f0 ("block/backup: use backup-top instead
of write notifiers", 2019-10-01), drop it to simplify the code.
While at it, drop redundant assertions on flags.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1578495356-46219-2-git-send-email-pbonzini@redhat.com
Message-Id: <1578495356-46219-2-git-send-email-pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We need some way to correlate QAPI BlockPermission values with
BLK_PERM_* flags. We could:
(1) have the same order in the QAPI definition as the the BLK_PERM_*
flags are in LSb-first order. However, then there is no guarantee
that they actually match (e.g. when someone modifies the QAPI schema
without thinking of the BLK_PERM_* definitions).
We could add static assertions, but these would break what’s good
about this solution, namely its simplicity.
(2) define the BLK_PERM_* flags based on the BlockPermission values.
But this way whenever someone were to modify the QAPI order
(perfectly sensible in theory), the BLK_PERM_* values would change.
Because these values are used for file locking, this might break
file locking between different qemu versions.
Therefore, go the slightly more cumbersome way: Add a function to
translate from the QAPI constants to the BLK_PERM_* flags.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20191108123455.39445-2-mreitz@redhat.com
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@Redhat.com>
Message-Id: <20191205174635.18758-11-vsementsov@virtuozzo.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Qemu as server currently won't accept export names larger than 256
bytes, nor create dirty bitmap names longer than 1023 bytes, so most
uses of qemu as client or server have no reason to get anywhere near
the NBD spec maximum of a 4k limit per string.
However, we weren't actually enforcing things, ignoring when the
remote side violates the protocol on input, and also having several
code paths where we send oversize strings on output (for example,
qemu-nbd --description could easily send more than 4k). Tighten
things up as follows:
client:
- Perform bounds check on export name and dirty bitmap request prior
to handing it to server
- Validate that copied server replies are not too long (ignoring
NBD_INFO_* replies that are not copied is not too bad)
server:
- Perform bounds check on export name and description prior to
advertising it to client
- Reject client name or metadata query that is too long
- Adjust things to allow full 4k name limit rather than previous
256 byte limit
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20191114024635.11363-4-eblake@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
We document that for qcow2 persistent bitmaps, the name cannot exceed
1023 bytes. It is inconsistent if transient bitmaps do not have to
abide by the same limit, and it is unlikely that any existing client
even cares about using bitmap names this long. It's time to codify
that ALL bitmaps managed by qemu (whether persistent in qcow2 or not)
have a documented maximum length.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20191114024635.11363-3-eblake@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
As long as we limit NBD names to 256 bytes (the bare minimum permitted
by the standard), stack-allocation works for parsing a name received
from the client. But as mentioned in a comment, we eventually want to
permit up to the 4k maximum of the NBD standard, which is too large
for stack allocation; so switch everything in the server to use heap
allocation. For now, there is no change in actually supported name
length.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20191114024635.11363-2-eblake@redhat.com>
[eblake: fix uninit variable compile failure]
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Make both bdrv_mark_request_serialising() and
bdrv_wait_serialising_requests() public so they can be used from block
drivers.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20191101152510.11719-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Fix the offset of the NSSRS field the CAP register.
From NVME 1.4, section 3 ("Controller Registers"), subsection 3.1.1
("Offset 0h: CAP – Controller Capabilities") CAP_NSSRS_SHIFT is bit 36,
not 33.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reported-by: Javier Gonzalez <javier.gonz@samsung.com>
Message-id: 20191023073315.446534-1-its@irrelevant.dk
Reviewed-by: John Snow <jsnow@redhat.com>
[mreitz: Added John's note on the location in the specification where
this information can be found]
Signed-off-by: Max Reitz <mreitz@redhat.com>
We have two drivers (iscsi and file-posix) that (in some cases) return
success from their .bdrv_co_truncate() implementation if the block
device is larger than the requested offset, but cannot be shrunk. Some
callers do not want that behavior, so this patch adds a new parameter
that they can use to turn off that behavior.
This patch just adds the parameter and lets the block/io.c and
block/block-backend.c functions pass it around. All other callers
always pass false and none of the implementations evaluate it, so that
this patch does not change existing behavior. Future patches take care
of that.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190918095144.955-5-mreitz@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
No reason to limit buffered copy to one cluster. Let's allow up to 1
MiB.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20191022111805.3432-7-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Currently total allocation for parallel requests to block-copy instance
is unlimited. Let's limit it to 128 MiB.
For now block-copy is used only in backup, so actually we limit total
allocation for backup job.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20191022111805.3432-6-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The only reason I can imagine for this strange code at the very-end of
bdrv_reopen_commit is the fact that bs->read_only updated after
calling drv->bdrv_reopen_commit in bdrv_reopen_commit. And in the same
time, prior to previous commit, qcow2_reopen_bitmaps_rw did a wrong
check for being writable, when actually it only need writable file
child not self.
So, as it's fixed, let's move things to correct place.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190927122355.7344-10-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Firstly, no reason to optimize failure path. Then, function name is
ambiguous: it checks for readonly and similar things, but someone may
think that it will ignore normal bitmaps which was just unchanged, and
this is in bad relation with the fact that we should drop IN_USE flag
for unchanged bitmaps in the image.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190927122355.7344-5-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
We'll need reverse-foreach in the following commit, QTAILQ support it,
so move to QTAILQ.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190927122355.7344-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
bdrv_dirty_bitmap_next is always used in same pattern. So, split it
into _next and _first, instead of combining two functions into one and
add FOR_EACH_DIRTY_BITMAP macro.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190916141911.5255-5-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Add bs field to BdrvDirtyBitmap structure. Drop BlockDriverState
parameter from bitmap APIs where possible.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190916141911.5255-3-vsementsov@virtuozzo.com
[Rebased on top of block-copy. --js]
Signed-off-by: John Snow <jsnow@redhat.com>
Drop meta bitmaps, as they are unused.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190916141911.5255-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
qmp_block_dirty_bitmap_add and do_block_dirty_bitmap_remove do acquire
aio context since 0a6c86d024. But this is not enough: we also must
lock qcow2 mutex when access in-image metadata. Especially it concerns
freeing qcow2 clusters.
To achieve this, move qcow2_can_store_new_dirty_bitmap and
qcow2_remove_persistent_dirty_bitmap to coroutine context.
Since we work in coroutines in correct aio context, we don't need
context acquiring in blockdev.c anymore, drop it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190920082543.23444-4-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
It's more comfortable to not deal with local_err.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190920082543.23444-3-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Drop write notifiers and use filter node instead.
= Changes =
1. Add filter-node-name argument for backup qmp api. We have to do it
in this commit, as 257 needs to be fixed.
2. There are no more write notifiers here, so is_write_notifier
parameter is dropped from block-copy paths.
3. To sync with in-flight requests at job finish we now have drained
removing of the filter, we don't need rw-lock.
4. Block-copy is now using BdrvChildren instead of BlockBackends
5. As backup-top owns these children, we also move block-copy state
into backup-top's ownership.
= Iotest changes =
56: op-blocker doesn't shoot now, as we set it on source, but then
check on filter, when trying to start second backup.
To keep the test we instead can catch another collision: both jobs will
get 'drive0' job-id, as job-id parameter is unspecified. To prevent
interleaving with file-posix locks (as they are dependent on config)
let's use another target for second backup.
Also, it's obvious now that we'd like to drop this op-blocker at all
and add a test-case for two backups from one node (to different
destinations) actually works. But not in these series.
141: Output changed: prepatch, "Node is in use" comes from bdrv_has_blk
check inside qmp_blockdev_del. But we've dropped block-copy blk
objects, so no more blk objects on source bs (job blk is on backup-top
filter bs). New message is from op-blocker, which is the next check in
qmp_blockdev_add.
257: The test wants to emulate guest write during backup. They should
go to filter node, not to original source node, of course. Therefore we
need to specify filter node name and use it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20191001131409.14202-6-vsementsov@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Split block_copy_set_callbacks out of block_copy_state_new. It's needed
for further commit: block-copy will use BdrvChildren of backup-top
filter, so it will be created from backup-top filter creation function.
But callbacks will still belong to backup job and will be set in
separate.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20191001131409.14202-4-vsementsov@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Move synchronization mechanism to block-copy, to be able to use one
block-copy instance from backup job and backup-top filter in parallel.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20191001131409.14202-2-vsementsov@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
A block driver can provide a callback to report driver-specific
statistics.
file-posix driver now reports discard statistics
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190923121737.83281-10-anton.nefedov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Each block_acct_done/failed call is designed to correspond to a
previous block_acct_start call, which initializes the stats cookie.
However sometimes it is not the case, e.g. some error paths might
report the same cookie twice because it is hard to accurately track if
the cookie was reported yet or not.
This patch cleans the cookie after report.
(Note: block_acct_failed/done without a previous block_acct_start at
all should be avoided. Uninitialized cookie might hold a garbage value
and there is still "< BLOCK_MAX_IOTYPE" assertion for that)
It will be particularly useful in ide code where it's hard to
keep track whether the request done its accounting or not: in the
following patch of the series, trim requests will do the accounting
separately.
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190923121737.83281-4-anton.nefedov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20190923121737.83281-3-anton.nefedov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Split block_copy to separate file, to be cleanly shared with backup-top
filter driver in further commits.
It's a clean movement, the only change is drop "static" from interface
functions.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190920142056.12778-8-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Common interface for aio task loops. To be used for improving
performance of synchronous io loops in qcow2, block-stream,
copy-on-read, and may be other places.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190916175324.18478-3-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
When iothreads are in use, the failure to grab the aio context results
in an assertion failure when trying to unlock things during blk_unref,
when trying to unlock a mutex that was not locked. In short, all
calls to nbd_export_put need to done while within the correct aio
context. But since nbd_export_put can recursively reach itself via
nbd_export_close, and recursively grabbing the context would deadlock,
we can't do the context grab directly in those functions, but must do
so in their callers.
Hoist the use of the correct aio_context from nbd_export_new() to its
caller qmp_nbd_server_add(). Then tweak qmp_nbd_server_remove(),
nbd_eject_notifier(), and nbd_esport_close_all() to grab the right
context, so that all callers during qemu now own the context before
nbd_export_put() can call blk_unref().
Remaining uses in qemu-nbd don't matter (since that use case does not
support iothreads).
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190917023917.32226-1-eblake@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Replace confusing usage:
~BDRV_SECTOR_MASK
With more clear:
(BDRV_SECTOR_SIZE - 1)
Remove BDRV_SECTOR_MASK and the unused BDRV_BLOCK_OFFSET_MASK which was
it's last user.
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Message-id: 20190827185913.27427-3-nsoffer@redhat.com
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
In job_finish_sync job_enter should be enough for a job to make some
progress and draining is a wrong tool for it. So use job_enter directly
here and drop job_drain with all related staff not used more.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: John Snow <jsnow@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit fe0480d6 and friends added BDRV_REQ_NO_FALLBACK as a way to
avoid wasting time on a preliminary write-zero request that will later
be rewritten by actual data, if it is known that the write-zero
request will use a slow fallback; but in doing so, could not optimize
for NBD. The NBD specification is now considering an extension that
will allow passing on those semantics; this patch updates the new
protocol bits and 'qemu-nbd --list' output to recognize the bit, as
well as the new errno value possible when using the new flag; while
upcoming patches will improve the client to use the feature when
present, and the server to advertise support for it.
The NBD spec recommends (but not requires) that ENOTSUP be avoided for
all but failures of a fast zero (the only time it is mandatory to
avoid an ENOTSUP failure is when fast zero is supported but not
requested during write zeroes; the questionable use is for ENOTSUP to
other actions like a normal write request). However, clients that get
an unexpected ENOTSUP will either already be treating it the same as
EINVAL, or may appreciate the extra bit of information. We were
equally loose for returning EOVERFLOW in more situations than
recommended by the spec, so if it turns out to be a problem in
practice, a later patch can tighten handling for both error codes.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190823143726.27062-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: tweak commit message, also handle EOPNOTSUPP]
When creating a read-only image, we are still advertising support for
TRIM and WRITE_ZEROES to the client, even though the client should not
be issuing those commands. But seeing this requires looking across
multiple functions:
All callers to nbd_export_new() passed a single flag based solely on
whether the export allows writes. Later, we then pass a constant set
of flags to nbd_negotiate_options() (namely, the set of flags which we
always support, at least for writable images), which is then further
dynamically modified with NBD_FLAG_SEND_DF based on client requests
for structured options. Finally, when processing NBD_OPT_EXPORT_NAME
or NBD_OPT_EXPORT_GO we bitwise-or the original caller's flag with the
runtime set of flags we've built up over several functions.
Let's refactor things to instead compute a baseline of flags as soon
as possible which gets shared between multiple clients, in
nbd_export_new(), and changing the signature for the callers to pass
in a simpler bool rather than having to figure out flags. We can then
get rid of the 'myflags' parameter to various functions, and instead
refer to client for everything we need (we still have to perform a
bitwise-OR for NBD_FLAG_SEND_DF during NBD_OPT_EXPORT_NAME and
NBD_OPT_EXPORT_GO, but it's easier to see what is being computed).
This lets us quit advertising senseless flags for read-only images, as
well as making the next patch for exposing FAST_ZERO support easier to
write.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190823143726.27062-2-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: improve commit message, update iotest 223]
The NBD specification defines NBD_FLAG_CAN_MULTI_CONN, which can be
advertised when the server promises cache consistency between
simultaneous clients (basically, rules that determine what FUA and
flush from one client are able to guarantee for reads from another
client). When we don't permit simultaneous clients (such as qemu-nbd
without -e), the bit makes no sense; and for writable images, we
probably have a lot more work before we can declare that actions from
one client are cache-consistent with actions from another. But for
read-only images, where flush isn't changing any data, we might as
well advertise multi-conn support. What's more, advertisement of the
bit makes it easier for clients to determine if 'qemu-nbd -e' was in
use, where a second connection will succeed rather than hang until the
first client goes away.
This patch affects qemu as server in advertising the bit. We may want
to consider patches to qemu as client to attempt parallel connections
for higher throughput by spreading the load over those connections
when a server advertises multi-conn, but for now sticking to one
connection per nbd:// BDS is okay.
See also: https://bugzilla.redhat.com/1708300
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190815185024.7010-1-eblake@redhat.com>
[eblake: tweak blockdev-nbd.c to not request shared when writable,
fix iotest 233]
Reviewed-by: John Snow <jsnow@redhat.com>
Introduce extended variants of bdrv_co_preadv and bdrv_co_pwritev
with qiov_offset parameter.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190604161514.262241-10-vsementsov@virtuozzo.com
Message-Id: <20190604161514.262241-10-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add handlers supporting qiov_offset parameter:
bdrv_co_preadv_part
bdrv_co_pwritev_part
bdrv_co_pwritev_compressed_part
This is used to reduce need of defining local_qiovs and hd_qiovs in all
corners of block layer code. The following patches will increase usage
of this new API part by part.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190604161514.262241-5-vsementsov@virtuozzo.com
Message-Id: <20190604161514.262241-5-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
No .bdrv_has_zero_init() implementation returns 1 if growing the file
would add non-zero areas (at least with PREALLOC_MODE_OFF), so using it
in lieu of this new function was always safe.
But on the other hand, it is possible that growing an image that is not
zero-initialized would still add a zero-initialized area, like when
using nonpreallocating truncation on a preallocated image. For callers
that care only about truncation, not about creation with potential
preallocation, this new function is useful.
Alternatively, we could have added a PreallocMode parameter to
bdrv_has_zero_init(). But the only user would have been qemu-img
convert, which does not have a plain PreallocMode value right now -- it
would have to parse the creation option to obtain it. Therefore, the
simpler solution is to let bdrv_has_zero_init() inquire the
preallocation status and add the new bdrv_has_zero_init_truncate() that
presupposes PREALLOC_MODE_OFF.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190724171239.8764-4-mreitz@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
bdrv_has_zero_init() only has meaning for newly created images or image
areas. If the mirror job itself did not create the image, it cannot
rely on bdrv_has_zero_init()'s result to carry any meaning.
This is the case for drive-mirror with mode=existing and always for
blockdev-mirror.
Note that we only have to zero-initialize the target with sync=full,
because other modes actually do not promise that the target will contain
the same data as the source after the job -- sync=top only promises to
copy anything allocated in the top layer, and sync=none will only copy
new I/O. (Which is how mirror has always handled it.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190724171239.8764-3-mreitz@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It is used to do transactional movement of the bitmap (which is
possible in conjunction with merge command). Transactional bitmap
movement is needed in scenarios with external snapshot, when we don't
want to leave copy of the bitmap in the base image.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190708220502.12977-3-jsnow@redhat.com
[Edited "since" version to 4.2 --js]
Signed-off-by: John Snow <jsnow@redhat.com>
Add a public interface for get. While we're at it,
rename "bdrv_get_dirty_bitmap_locked" to "bdrv_dirty_bitmap_get_locked".
(There are more functions to rename to the bdrv_dirty_bitmap_VERB form,
but they will wait until the conclusion of this series.)
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190709232550.10724-11-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
I'm surprised it didn't come up sooner, but sometimes we have a +busy
bitmap as a source. This is dangerous from the QMP API, but if we are
the owner that marked the bitmap busy, it's safe to merge it using it as
a read only source.
It is not safe in the general case to allow users to read from in-use
bitmaps, so create an internal variant that foregoes the safety
checking.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190709232550.10724-10-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
We don't need or want a new sync mode for simple differences in
semantics. Create a new mode simply named "BITMAP" that is designed to
make use of the new Bitmap Sync Mode field.
Because the only bitmap sync mode is 'on-success', this adds no new
functionality to the backup job (yet). The old incremental backup mode
is maintained as a syntactic sugar for sync=bitmap, mode=on-success.
Add all of the plumbing necessary to support this new instruction.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190709232550.10724-6-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h). It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.
Include qemu/main-loop.h only where it's needed. Touching it now
recompiles only some 1700 objects. For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800. For the
others, they shrink only slightly.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Some of the generated qapi-types-MODULE.h are included all over the
place. Changing a QAPI type can trigger massive recompiling. Top
scorers recompile more than 1000 out of some 6600 objects (not
counting tests and objects that don't depend on qemu/osdep.h):
6300 qapi/qapi-builtin-types.h
5700 qapi/qapi-types-run-state.h
3900 qapi/qapi-types-common.h
3300 qapi/qapi-types-sockets.h
3000 qapi/qapi-types-misc.h
3000 qapi/qapi-types-crypto.h
3000 qapi/qapi-types-job.h
3000 qapi/qapi-types-block-core.h
2800 qapi/qapi-types-block.h
1300 qapi/qapi-types-net.h
Clean up headers to include generated QAPI headers only where needed.
Impact is negligible except for hw/qdev-properties.h.
This header includes qapi/qapi-types-block.h and
qapi/qapi-types-misc.h. They are used only in expansions of property
definition macros such as DEFINE_PROP_BLOCKDEV_ON_ERROR() and
DEFINE_PROP_OFF_AUTO(). Moving their inclusion from
hw/qdev-properties.h to the users of these macros avoids pointless
recompiles. This is how other property definition macros, such as
DEFINE_PROP_NETDEV(), already work.
Improves things for some of the top scorers:
3600 qapi/qapi-types-common.h
2800 qapi/qapi-types-sockets.h
900 qapi/qapi-types-misc.h
2200 qapi/qapi-types-crypto.h
2100 qapi/qapi-types-job.h
2100 qapi/qapi-types-block-core.h
270 qapi/qapi-types-block.h
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20190812052359.30071-3-armbru@redhat.com>
Back in 2016, we discussed[1] rules for headers, and these were
generally liked:
1. Have a carefully curated header that's included everywhere first. We
got that already thanks to Peter: osdep.h.
2. Headers should normally include everything they need beyond osdep.h.
If exceptions are needed for some reason, they must be documented in
the header. If all that's needed from a header is typedefs, put
those into qemu/typedefs.h instead of including the header.
3. Cyclic inclusion is forbidden.
This patch gets include/ closer to obeying 2.
It's actually extracted from my "[RFC] Baby steps towards saner
headers" series[2], which demonstrates a possible path towards
checking 2 automatically. It passes the RFC test there.
[1] Message-ID: <87h9g8j57d.fsf@blackfin.pond.sub.org>
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg03345.html
[2] Message-Id: <20190711122827.18970-1-armbru@redhat.com>
https://lists.nongnu.org/archive/html/qemu-devel/2019-07/msg02715.html
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-Id: <20190812052359.30071-2-armbru@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
No reason to use blocking channel for negotiation and we'll benefit in
further reconnect feature, as qio_channel reads and writes will do
qemu_coroutine_yield while waiting for io completion.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190618114328.55249-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Do effective copy-on-read request when we don't need data actually. It
will be used for block-stream and NBD_CMD_CACHE.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190725100550.33801-2-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[eblake: comment grammar fix]
Signed-off-by: Eric Blake <eblake@redhat.com>
bdrv_set_aio_context_ignore() can only work in the main loop:
bdrv_drained_begin() only works in the main loop and the node's (old)
AioContext; and bdrv_drained_end() really only works in the main loop
and the node's (new) AioContext (contrary to its current comment, which
is just wrong).
Consequentially, bdrv_set_aio_context_ignore() must be called from the
main loop. Luckily, assuming that we can make block graph changes only
from the main loop as well, all its callers do that already.
Note that changing a node's context in a sense is an operation that
changes the block graph, so it actually makes sense to require this
function to be called from the main loop.
Also, fix bdrv_drained_end()'s description. You can only use it from
the main loop or the node's AioContext, and in the latter case, the
whole subtree must be in the same context.
Fixes: e037c09c78
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190722133054.21781-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
We should never poll anywhere in bdrv_do_drained_end() (including its
recursive callees like bdrv_drain_invoke()), because it does not cope
well with graph changes. In fact, it has been written based on the
postulation that no graph changes will happen in it.
Instead, the callers that want to poll must poll, i.e. all currently
globally available wrappers: bdrv_drained_end(),
bdrv_subtree_drained_end(), bdrv_unapply_subtree_drain(), and
bdrv_drain_all_end(). Graph changes there do not matter.
They can poll simply by passing a pointer to a drained_end_counter and
wait until it reaches 0.
This patch also adds a non-polling global wrapper for
bdrv_do_drained_end() that takes a drained_end_counter pointer. We need
such a variant because now no function called anywhere from
bdrv_do_drained_end() must poll. This includes
BdrvChildRole.drained_end(), which already must not poll according to
its interface documentation, but bdrv_child_cb_drained_end() just
violates that by invoking bdrv_drained_end() (which does poll).
Therefore, BdrvChildRole.drained_end() must take a *drained_end_counter
parameter, which bdrv_child_cb_drained_end() can pass on to the new
bdrv_drained_end_no_poll() function.
Note that we now have a pattern of all drained_end-related functions
either polling or receiving a *drained_end_counter to let the caller
poll based on that.
A problem with a single poll loop is that when the drained section in
bdrv_set_aio_context_ignore() ends, some nodes in the subgraph may be in
the old contexts, while others are in the new context already. To let
the collective poll in bdrv_drained_end() work correctly, we must not
hold a lock to the old context, so that the old context can make
progress in case it is different from the current context.
(In the process, remove the comment saying that the current context is
always the old context, because it is wrong.)
In all other places, all nodes in a subtree must be in the same context,
so we can just poll that. The exception of course is
bdrv_drain_all_end(), but that always runs in the main context, so we
can just poll NULL (like bdrv_drain_all_begin() does).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These functions are not used outside of block/io.c, there is no reason
why they should be globally available.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 5cb2737e92 laid out why
bdrv_do_drained_end() must decrement the quiesce_counter after
bdrv_drain_invoke(). It did not give a very good reason why it has to
happen after bdrv_parent_drained_end(), instead only claiming symmetry
to bdrv_do_drained_begin().
It turns out that delaying it for so long is wrong.
Situation: We have an active commit job (i.e. a mirror job) from top to
base for the following graph:
filter
|
[file]
|
v
top --[backing]--> base
Now the VM is closed, which results in the job being cancelled and a
bdrv_drain_all() happening pretty much simultaneously.
Beginning the drain means the job is paused once whenever one of its
nodes is quiesced. This is reversed when the drain ends.
With how the code currently is, after base's drain ends (which means
that it will have unpaused the job once), its quiesce_counter remains at
1 while it goes to undrain its parents (bdrv_parent_drained_end()). For
some reason or another, undraining filter causes the job to be kicked
and enter mirror_exit_common(), where it proceeds to invoke
block_job_remove_all_bdrv().
Now base will be detached from the job. Because its quiesce_counter is
still 1, it will unpause the job once more. So in total, undraining
base will unpause the job twice. Eventually, this will lead to the
job's pause_count going negative -- well, it would, were there not an
assertion against this, which crashes qemu.
The general problem is that if in bdrv_parent_drained_end() we undrain
parent A, and then undrain parent B, which then leads to A detaching the
child, bdrv_replace_child_noperm() will undrain A as if we had not done
so yet; that is, one time too many.
It follows that we cannot decrement the quiesce_counter after invoking
bdrv_parent_drained_end().
Unfortunately, decrementing it before bdrv_parent_drained_end() would be
wrong, too. Imagine the above situation in reverse: Undraining A leads
to B detaching the child. If we had already decremented the
quiesce_counter by that point, bdrv_replace_child_noperm() would undrain
B one time too little; because it expects bdrv_parent_drained_end() to
issue this undrain. But bdrv_parent_drained_end() won't do that,
because B is no longer a parent.
Therefore, we have to do something else. This patch opts for
introducing a second quiesce_counter that counts how many times a
child's parent has been quiesced (though c->role->drained_*). With
that, bdrv_replace_child_noperm() just has to undrain the parent exactly
that many times when removing a child, and it will always be right.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The commit and the mirror block job must be able to drop their filter
node at any point. However, this will not be possible if any of the
BdrvChild links to them is frozen. Therefore, we need to prevent them
from ever becoming frozen.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190703172813.6868-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch is used in the 'block/stream: introduce a bottom node'
that is following. Instead of the base node, the caller may pass
the node that has the base as its backing image to the function
bdrv_is_allocated_above() with a new parameter include_base = true
and get rid of the dependency on the base that may change during
commit/stream parallel jobs. Now, if the specified base is not
found in the backing image chain, the QEMU will abort.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 1559152576-281803-2-git-send-email-andrey.shinkevich@virtuozzo.com
[mreitz: Squashed in the following as a rebase on conflicting patches:]
Message-id: e3cf99ae-62e9-8b6e-5a06-d3c8b9363b85@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
If a block node uses bdrv_child_try_set_perm() to change the permission
it takes on its child, the result may be very short-lived. If anything
makes the block layer recalculate the permissions internally, it will
invoke the node driver's .bdrv_child_perm() implementation. The
permission/shared permissions masks that returns will then override the
values previously passed to bdrv_child_try_set_perm().
If drivers want a child edge to have specific values for the
permissions/shared permissions mask, it must return them in
.bdrv_child_perm(). Consequentially, there is no need for them to pass
the same values to bdrv_child_try_set_perm() then: It is better to have
a function that invokes .bdrv_child_perm() and calls
bdrv_child_try_set_perm() with the result. This patch adds such a
function under the name of bdrv_child_refresh_perms().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Drop remaining users of bs->job:
1. assertions actually duplicated by assert(!bs->refcnt)
2. trace-point seems not enough reason to change stream_start to return
BlockJob pointer
3. Restricting creation of two jobs based on same bs is bad idea, as
3.1 Some jobs creates filters to be their main node, so, this check
don't actually prevent creating second job on same real node (which
will create another filter node) (but I hope it is restricted by
other mechanisms)
3.2 Even without bs->job we have two systems of permissions:
op-blockers and BLK_PERM
3.3 We may want to run several jobs on one node one day
And finally, drop bs->job pointer itself. Hurrah!
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are going to remove bs->job pointer. Drop it's usage in
blockdev_mark_auto_del: instead of looking at bs->job let's check all
jobs for references to bs.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are going to remove bs->job pointer. Drop it's usage in replication
code. Additionally we have to return job pointer from some mirror APIs.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
No header includes qemu-common.h after this commit, as prescribed by
qemu-common.h's file comment.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-5-armbru@redhat.com>
[Rebased with conflicts resolved automatically, except for
include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c
block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c
target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h
target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h
target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h
target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and
net/tap-bsd.c fixed up]
This fixes at least one overflow in qcow2_process_discards, which
passes 64bit region length to bdrv_pdiscard where bytes (or sectors in
the past) parameter is int since its introduction in 0b919fae.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All callers of bdrv_set_aio_context() are eliminated now, they have
moved to bdrv_try_set_aio_context() and related safe functions. Remove
bdrv_set_aio_context().
With this, we can now know that the .set_aio_ctx callback must be
present in bdrv_set_aio_context_ignore() because
bdrv_can_set_aio_context() would have returned false previously, so
instead of checking the condition, we can assert it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
So far, we only made sure that updating the AioContext of a node
affected the whole subtree. However, if a node is newly attached to a
new parent, we also need to make sure that both the subtree of the node
and the parent are in the same AioContext. This tries to move the new
child node to the parent AioContext and returns an error if this isn't
possible.
BlockBackends now actually apply their AioContext to their root node.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kenneth Heitke <kenneth.heitke@intel.com>
Reviewed-by: Klaus Birkelund Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Callback-based laio_submit() and laio_cancel() were left after
rewriting Linux AIO backend to coroutines in hope that they would be
used in other code that could bypass coroutines. They can be safely
removed because they have not been used since that time.
Signed-off-by: Julia Suvorova <jusual@mail.ru>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
drv_co_block_status digs bs->file for additional, more accurate search
for hole inside region, reported as DATA by bs since 5daa74a6eb.
This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
knows, where are holes and where is data. But every block_status
request calls lseek additionally. Assume a big disk, full of
data, in any iterative copying block job (or img convert) we'll call
lseek(HOLE) on every iteration, and each of these lseeks will have to
iterate through all metadata up to the end of file. It's obviously
ineffective behavior. And for many scenarios we don't need this lseek
at all.
However, lseek is needed when we have metadata-preallocated image.
So, let's detect metadata-preallocation case and don't dig qcow2's
protocol file in other cases.
The idea is to compare allocation size in POV of filesystem with
allocations size in POV of Qcow2 (by refcounts). If allocation in fs is
significantly lower, consider it as metadata-preallocation case.
102 iotest changed, as our detector can't detect shrinked file as
metadata-preallocation, which don't seem to be wrong, as with metadata
preallocation we always have valid file length.
Two other iotests have a slight change in their QMP output sequence:
Active 'block-commit' returns earlier because the job coroutine yields
earlier on a blocking operation. This operation is loading the refcount
blocks in qcow2_detect_metadata_preallocation().
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All block nodes and users in any connected component of the block graph
must be in the same AioContext, so changing the AioContext of one node
must not only change all of its children, but all of its parents (and
in turn their children etc.) as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since commit b97511c7bc, there is no reason for block drivers any more
to call these functions (see the function comment in block_int.h). They
are now just internal helper functions for bdrv_set_aio_context()
and can be made static.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Eventually, we want to make sure that all parents and all children of a
node are in the same AioContext as the node itself. This means that
changing the AioContext may fail because one of the other involved
parties (e.g. a guest device that was configured with an iothread)
cannot allow switching to a different AioContext.
Introduce a set of functions that allow to first check whether all
involved nodes can switch to a new context and only then do the actual
switch. The check recursively covers children and parents.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cleaned up with scripts/clean-header-guards.pl.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190315145123.28030-9-armbru@redhat.com>
No one is using these functions anymore, all callers have switched to
the byte-based bdrv_pread() and bdrv_pwrite()
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The existing code to convert flag bits into strings looks a bit strange
now, and if we ever add more flags, it will look even stranger. Prevent
that from happening by making it look up the flag names in an array.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20190405191635.25740-1-mreitz@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_snapshot_dump(), bdrv_image_info_specific_dump(),
bdrv_image_info_dump() and their helpers take an fprintf()-like
callback and a FILE * to pass to it.
hmp.c passes monitor_printf() cast to fprintf_function and the current
monitor cast to FILE *.
qemu-img.c and qemu-io-cmds.c pass fprintf and stdout.
The type-punning is technically undefined behaviour, but works in
practice. Clean up: drop the callback, and call qemu_printf()
instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-8-armbru@redhat.com>
qsp_report() takes an fprintf()-like callback and a FILE * to pass to
it.
Its only caller hmp_sync_profile() passes monitor_fprintf() and the
current monitor cast to FILE *. monitor_fprintf() casts it right
back, and is otherwise identical to monitor_printf(). The
type-punning is ugly.
Drop the callback, and call qemu_printf() instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-7-armbru@redhat.com>
We know that the kernel implements a slow fallback code path for
BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
The other operations we call in the context of .bdrv_co_pwrite_zeroes
should usually be quick, so no modification should be needed for them.
If we ever notice that there are additional problematic cases, we can
still make these conditional as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
For qemu-img convert, we want an operation that zeroes out the whole
image if this can be done efficiently, but that returns an error
otherwise so we don't write explicit zeroes and immediately overwrite
them with the real data, potentially doubling the amount of data to be
written.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
If we reopen a BlockDriverState and there is an option that is present
in bs->options but missing from the new set of options then we have to
return an error unless the driver is able to reset it to its default
value.
This patch adds a new 'mutable_opts' field to BlockDriver. This is
a list of runtime options that can be modified during reopen. If an
option in this list is unspecified on reopen then it must be reset (or
return an error).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch allows the user to change the backing file of an image that
is being reopened. Here's what it does:
- In bdrv_reopen_prepare(): check that the value of 'backing' points
to an existing node or is null. If it points to an existing node it
also needs to make sure that replacing the backing file will not
create a cycle in the node graph (i.e. you cannot reach the parent
from the new backing file).
- In bdrv_reopen_commit(): perform the actual node replacement by
calling bdrv_set_backing_hd().
There may be temporary implicit nodes between a BDS and its backing
file (e.g. a commit filter node). In these cases bdrv_reopen_prepare()
looks for the real (non-implicit) backing file and requires that the
'backing' option points to it. Replacing or detaching a backing file
is forbidden if there are implicit nodes in the middle.
Although x-blockdev-reopen is meant to be used like blockdev-add,
there's an important thing that must be taken into account: the only
way to set a new backing file is by using a reference to an existing
node (previously added with e.g. blockdev-add). If 'backing' contains
a dictionary with a new set of options ({"driver": "qcow2", "file": {
... }}) then it is interpreted that the _existing_ backing file must
be reopened with those options.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Children in QMP are specified with BlockdevRef / BlockdevRefOrNull,
which can contain a set of child options, a child reference, or
NULL. In optional attributes like "backing" it can also be missing.
Only the first case (set of child options) is being handled properly
by bdrv_reopen_queue(). This patch deals with all the others.
Here's how these cases should be handled when bdrv_reopen_queue() is
deciding what to do with each child of a BlockDriverState:
1) Set of child options: if the child was implicitly created (i.e
inherits_from points to the parent) then the options are removed
from the parent's options QDict and are passed to the child with
a recursive bdrv_reopen_queue() call. This case was already
working fine.
2) Child reference: there's two possibilites here.
2a) Reference to the current child: if the child was implicitly
created then it is put in the reopen queue, keeping its
current set of options (since this was a child reference
there was no way to specify a different set of options).
If the child is not implicit then it keeps its current set
of options but it is not reopened (and therefore does not
inherit any new option from the parent).
2b) Reference to a different BDS: the current child is not put
in the reopen queue at all. Passing a reference to a
different BDS can be used to replace a child, although at
the moment no driver implements this, so it results in an
error. In any case, the current child is not going to be
reopened (and might in fact disappear if it's replaced)
3) NULL: This is similar to (2b). Although no driver allows this
yet it can be used to detach the current child so it should not
be put in the reopen queue.
4) Missing option: at the moment "backing" is the only case where
this can happen. With "blockdev-add", leaving "backing" out
means that the default backing file is opened. We don't want to
open a new image during reopen, so we require that "backing" is
always present. We'll relax this requirement a bit in the next
patch. If keep_old_opts is true and "backing" is missing then
this behaves like 2a (the current child is reopened).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_reopen_queue() function is used to create a queue with
the BDSs that are going to be reopened and their new options. Once
the queue is ready bdrv_reopen_multiple() is called to perform the
operation.
The original options from each one of the BDSs are kept, with the new
options passed to bdrv_reopen_queue() applied on top of them.
For "x-blockdev-reopen" we want a function that behaves much like
"blockdev-add". We want to ignore the previous set of options so that
only the ones actually specified by the user are applied, with the
rest having their default values.
One of the things that we need is a way to tell bdrv_reopen_queue()
whether we want to keep the old set of options or not, and that's what
this patch does. All current callers are setting this new parameter to
true and x-blockdev-reopen will set it to false.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Our permission system is useful to define what operations are allowed
on a certain block node and includes things like BLK_PERM_WRITE or
BLK_PERM_RESIZE among others.
One of the permissions is BLK_PERM_GRAPH_MOD which allows "changing
the node that this BdrvChild points to". The exact meaning of this has
never been very clear, but it can be understood as "change any of the
links connected to the node". This can be used to prevent changing a
backing link, but it's too coarse.
This patch adds a new 'frozen' attribute to BdrvChild, which forbids
detaching the link from the node it points to, and new API to freeze
and unfreeze a backing chain.
After this change a few functions can fail, so they need additional
checks.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit a88b179f introduced the ability to set and query bitmap
persistence, but with an atypical spelling.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 20190308205845.25734-1-eblake@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Instead of checking against busy, inconsistent, or read only directly,
use a check function with permissions bits that let us streamline the
checks without reproducing them in many places.
Included in this patch are permissions changes that simply add the
inconsistent check to existing permissions call spots, without
addressing existing bugs.
In general, this means that busy+readonly checks become BDRV_BITMAP_DEFAULT,
which checks against all three conditions. busy-only checks become
BDRV_BITMAP_ALLOW_RO.
Notably, remove allows inconsistent bitmaps, so it doesn't follow the pattern.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-4-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Add an inconsistent bit to dirty-bitmaps that allows us to report a bitmap as
persistent but potentially inconsistent, i.e. if we find bitmaps on a qcow2
that have been marked as "in use".
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-2-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
These mean the same thing now. Unify them and rename the merged call
bdrv_dirty_bitmap_busy to indicate semantically what we are describing,
as well as help disambiguate from the various _locked and _unlocked
versions of bitmap helpers that refer to mutex locks.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-8-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
"Frozen" was a good description a long time ago, but it isn't adequate now.
Rename the frozen predicate to has_successor to make the semantics of the
predicate more clear to outside callers.
In the process, remove some calls to frozen() that no longer semantically
make sense. For bdrv_enable_dirty_bitmap_locked and
bdrv_disable_dirty_bitmap_locked, it doesn't make sense to prohibit QEMU
internals from performing this action when we only wished to prohibit QMP
users from issuing these commands. All of the QMP API commands for bitmap
manipulation already check against user_locked() to prohibit these actions.
Several other assertions really want to check that the bitmap isn't in-use
by another operation -- use the bitmap_user_locked function for this instead,
which presently also checks for has_successor. This leaves some redundant
checks of has_successor through different helpers that are addressed in
forthcoming patches.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-3-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
Provide an option to force QEMU to always keep the external data file
consistent as a standalone read-only raw image.
At the moment, this means making sure that write_zeroes requests are
forwarded to the data file instead of just updating the metadata, and
checking that no backing file is used.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_iterate_format (which is currently only used for printing out the
formats supported by the block layer) doesn't take format whitelisting
into account.
This creates a problem for tests: they enumerate supported formats to
decide which tests to enable, but then discover that QEMU doesn't let
them actually use some of those formats.
To avoid that, exclude formats that are not whitelisted from
enumeration, if whitelisting is in use. Since we have separate
whitelists for r/w and r/o, take this a parameter to
bdrv_iterate_format, and print two lists of supported formats (r/w and
r/o) in main qemu.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As with the previous patch to qemu-nbd, the nbd-server-start QMP command
also needs to be able to specify authorization when enabling TLS encryption.
First the client must create a QAuthZ object instance using the
'object-add' command:
{
'execute': 'object-add',
'arguments': {
'qom-type': 'authz-list',
'id': 'authz0',
'parameters': {
'policy': 'deny',
'rules': [
{
'match': '*CN=fred',
'policy': 'allow'
}
]
}
}
}
They can then reference this in the new 'tls-authz' parameter when
executing the 'nbd-server-start' command:
{
'execute': 'nbd-server-start',
'arguments': {
'addr': {
'type': 'inet',
'host': '127.0.0.1',
'port': '9000'
},
'tls-creds': 'tls0',
'tls-authz': 'authz0'
}
}
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-Id: <20190227162035.18543-3-berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Currently any client which can complete the TLS handshake is able to use
the NBD server. The server admin can turn on the 'verify-peer' option
for the x509 creds to require the client to provide a x509 certificate.
This means the client will have to acquire a certificate from the CA
before they are permitted to use the NBD server. This is still a fairly
low bar to cross.
This adds a '--tls-authz OBJECT-ID' option to the qemu-nbd command which
takes the ID of a previously added 'QAuthZ' object instance. This will
be used to validate the client's x509 distinguished name. Clients
failing the authorization check will not be permitted to use the NBD
server.
For example to setup authorization that only allows connection from a client
whose x509 certificate distinguished name is
CN=laptop.example.com,O=Example Org,L=London,ST=London,C=GB
escape the commas in the name and use:
qemu-nbd --object tls-creds-x509,id=tls0,dir=/home/berrange/qemutls,\
endpoint=server,verify-peer=yes \
--object 'authz-simple,id=auth0,identity=CN=laptop.example.com,,\
O=Example Org,,L=London,,ST=London,,C=GB' \
--tls-creds tls0 \
--tls-authz authz0 \
....other qemu-nbd args...
NB: a real shell command line would not have leading whitespace after
the line continuation, it is just included here for clarity.
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-Id: <20190227162035.18543-2-berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: split long line in --help text, tweak 233 to show that whitespace
after ,, in identity= portion is actually okay]
Signed-off-by: Eric Blake <eblake@redhat.com>
Currently, BlockDriver.bdrv_refresh_filename() is supposed to both
refresh the filename (BDS.exact_filename) and set BDS.full_open_options.
Now that we have generic code in the central bdrv_refresh_filename() for
creating BDS.full_open_options, we can drop the latter part from all
BlockDriver.bdrv_refresh_filename() implementations.
This also means that we can drop all of the existing default code for
this from the global bdrv_refresh_filename() itself.
Furthermore, we now have to call BlockDriver.bdrv_refresh_filename()
after having set BDS.full_open_options, because the block driver's
implementation should now be allowed to depend on BDS.full_open_options
being set correctly.
Finally, with this patch we can drop the @options parameter from
BlockDriver.bdrv_refresh_filename(); also, add a comment on this
function's purpose in block/block_int.h while touching its interface.
This completely obsoletes blklogwrite's implementation of
.bdrv_refresh_filename().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190201192935.18394-25-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Some follow-up patches will rework the way bs->full_open_options is
refreshed in bdrv_refresh_filename(). The new implementation will remove
the need for the block drivers' bdrv_refresh_filename() implementations
to set bs->full_open_options; instead, it will be generic and use static
information from each block driver.
However, by implementing bdrv_gather_child_options(), block drivers will
still be able to override the way the full_open_options of their
children are incorporated into their own.
We need to implement this function for VMDK because we have to prevent
the generic implementation from gathering the options of all children:
It is not possible to specify options for the extents through the
runtime options.
For quorum, the child names that would be used by the generic
implementation and the ones that we actually (currently) want to use
differ. See quorum_gather_child_options() for more information.
Note that both of these are cases which are not ideal: In case of VMDK
it would probably be nice to be able to specify options for all extents.
In case of quorum, the current runtime option structure is simply broken
and needs to be fixed (but that is left for another patch).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-23-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This new field can be set by block drivers to list the runtime options
they accept that may influence the contents of the respective BDS. As of
a follow-up patch, this list will be used by the common
bdrv_refresh_filename() implementation to decide which options to put
into BDS.full_open_options (and consequently whether a JSON filename has
to be created), thus freeing the drivers of having to implement that
logic themselves.
Additionally, this patch adds the field to all of the block drivers that
need it and sets it accordingly.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-22-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This function may be implemented by block drivers to derive a directory
name from a BDS. Concatenating this g_free()-able string with a relative
filename must result in a valid (not necessarily existing) filename, so
this is a function that should generally be not implemented by format
drivers, because this is protocol-specific.
If a BDS's driver does not implement this function, bdrv_dirname() will
fall through to the BDS's file if it exists. If it does not, the
exact_filename field will be used to generate a directory name.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-15-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Make bdrv_get_full_backing_filename() return an allocated string instead
of placing the result in a caller-provided buffer.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-12-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Make bdrv_get_full_backing_filename_from_filename() return an allocated
string instead of placing the result in a caller-provided buffer.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190201192935.18394-11-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Besides being safe for arbitrary path lengths, after some follow-up
patches all callers will want a freshly allocated buffer anyway.
In the meantime, path_combine_deprecated() is added which has the same
interface as path_combine() had before this patch. All callers to that
function will be converted in follow-up patches.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20190201192935.18394-10-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
If the backing file is overridden, this most probably does change the
guest-visible data of a BDS. Therefore, we will need to consider this
in bdrv_refresh_filename().
To see whether it has been overridden, we might want to compare
bs->backing_file and bs->backing->bs->filename. However,
bs->backing_file is changed by bdrv_set_backing_hd() (which is just used
to change the backing child at runtime, without modifying the image
header), so bs->backing_file most of the time simply contains a copy of
bs->backing->bs->filename anyway, so it is useless for such a
comparison.
This patch adds an auto_backing_file BDS field which contains the
backing file path as indicated by the image header, which is not changed
by bdrv_set_backing_hd().
Because of bdrv_refresh_filename() magic, however, a BDS's filename may
differ from what has been specified during bdrv_open(). Then, the
comparison between bs->auto_backing_file and bs->backing->bs->filename
may fail even though bs->backing was opened from bs->auto_backing_file.
To mitigate this, we can copy the real BDS's filename (after the whole
bdrv_open() and bdrv_refresh_filename() process) into
bs->auto_backing_file, if we know the former has been opened based on
the latter. This is only possible if no options modifying the backing
file's behavior have been specified, though. To simplify things, this
patch only copies the filename from the backing file if no options have
been specified for it at all.
Furthermore, there are cases where an overlay is created by qemu which
already contains a BDS's filename (e.g. in blockdev-snapshot-sync). We
do not need to worry about updating the overlay's bs->auto_backing_file
there, because we actually wrote a post-bdrv_refresh_filename() filename
into the image header.
So all in all, there will be false negatives where (as of a future
patch) bdrv_refresh_filename() will assume that the backing file differs
from what was specified in the image header, even though it really does
not. However, these cases should be limited to where (1) the user
actually did override something in the backing chain (e.g. by specifying
options for the backing file), or (2) the user executed a QMP command to
change some node's backing file (e.g. change-backing-file or
block-commit with @backing-file given) where the given filename does not
happen to coincide with qemu's idea of the backing BDS's filename.
Then again, (1) really is limited to -drive. With -blockdev or
blockdev-add, you have to adhere to the schema, so a user cannot give
partial "unimportant" options (e.g. by just setting backing.node-name
and leaving the rest to the image header). Therefore, trying to fix
this would mean trying to fix something for -drive only.
To improve on (2), we would need a full infrastructure to "canonicalize"
an arbitrary filename (+ options), so it can be compared against
another. That seems a bit over the top, considering that filenames
nowadays are there mostly for the user's entertainment.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Before this patch, bdrv_refresh_filename() is used in a pushing manner:
Whenever the BDS graph is modified, the parents of the modified edges
are supposed to be updated (recursively upwards). However, that is
nonviable, considering that we want child changes not to concern
parents.
Also, in the long run we want a pull model anyway: Here, we would have a
bdrv_filename() function which returns a BDS's filename, freshly
constructed.
This patch is an intermediate step. It adds bdrv_refresh_filename()
calls before every place a BDS.filename value is used. The only
exceptions are protocol drivers that use their own filename, which
clearly would not profit from refreshing that filename before.
Also, bdrv_get_encrypted_filename() is removed along the way (as a user
of BDS.filename), since it is completely unused.
In turn, all of the calls to bdrv_refresh_filename() before this patch
are removed, because we no longer have to call this function on graph
changes.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190201192935.18394-2-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
bdrv_check_perm in it's recursion checks each node in context of new
permissions for one parent, because of nature of DFS. It works well,
while children subgraph of top-most updated node is a tree, i.e. it
doesn't have any kind of loops. But if we have a loop (not oriented,
of course), i.e. we have two different ways from top-node to some
child-node, then bdrv_check_perm will do wrong thing:
top
| \
| |
v v
A B
| |
v v
node
It will once check new permissions of node in context of new A
permissions and old B permissions and once visa-versa. It's a wrong way
and may lead to corruption of permission system. We may start with
no-permissions and all-shared for both A->node and B->node relations
and finish up with non shared write permission for both ways.
The following commit will add a test, which shows this bug.
To fix this situation, let's really set BdrvChild permissions during
bdrv_check_perm procedure. And we are happy here, as check-perm is
already written in transaction manner, so we just need to restore
backed-up permissions in _abort.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of using the convenience wrapper qio_channel_read_all_eof(), use
the lower level QIOChannel API. This means duplicating some code, but
we'll need this because this coroutine yield is special: We want it to
be interruptible so that nbd_client_attach_aio_context() can correctly
reenter the coroutine.
This moves the bdrv_dec/inc_in_flight() pair into nbd_read_eof(), so
that connection_co will always sit in this exact qio_channel_yield()
call when bdrv_drain() returns.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The only caller of nbd_read_eof() is nbd_receive_reply(), so it doesn't
have to live in the header file, but can move next to its caller.
Also add the missing coroutine_fn to the function and its caller.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
After the previous patch, the only instance of this function left
is inside qemu-img.c.
qemu-img is using it inside the 'img_snapshot' function to delete
snapshots in the SNAPSHOT_DELETE case, based on a "snapshot_name"
string that refers to the tag, not ID, of the QEMUSnapshotInfo struct.
This can be verified by checking the SNAPSHOT_CREATE case that
comes shortly before SNAPSHOT_DELETE. In that case, the same
"snapshot_name" variable is being strcpy to the 'name' field
of the QEMUSnapshotInfo struct sn:
pstrcpy(sn.name, sizeof(sn.name), snapshot_name);
Based on that, it is unlikely that "snapshot_name" might contain
an "id" in SNAPSHOT_DELETE.
This patch changes SNAPSHOT_DELETE to use snapshot_find() and
snapshot_delete() instead of bdrv_snapshot_delete_by_id_or_name.
After that, there is no instances left of bdrv_snapshot_delete_by_id_or_name
in the code, so it is safe to remove it entirely.
Suggested-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Inform a user in case qcow2_get_specific_info fails to obtain
QCOW2 image specific information. This patch is preliminary to
the one "qcow2: Add list of bitmaps to ImageInfoSpecificQCow2".
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1549638368-530182-2-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We generally do very similar things around nbd_read: error_prepend
specifying what we have tried to read, and be_to_cpu conversion of
integers.
So, it seems reasonable to move common things to helper functions,
which:
1. simplify code a bit
2. generalize nbd_read error descriptions, all starting with
"Failed to read"
3. make it more difficult to forget to convert things from BE
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190128165830.165170-1-vsementsov@virtuozzo.com>
[eblake: rename macro to DEF_NBD_READ_N and formatting tweaks;
checkpatch has false positive complaint]
Signed-off-by: Eric Blake <eblake@redhat.com>
Add a new command, returning block nodes (and their users) graph.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20181221170909.25584-2-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The following QMP command leads to a crash when iothreads are used:
{ 'execute': 'device_del', 'arguments': {'id': 'data'} }
The backtrace involves the queue restart coroutine where
tgm->throttle_state is a NULL pointer because
throttle_group_unregister_tgm() has already been called:
(gdb) bt full
#0 0x00005585a7a3b378 in qemu_mutex_lock_impl (mutex=0xffffffffffffffd0, file=0x5585a7bb3d54 "block/throttle-groups.c", line=412) at util/qemu-thread-posix.c:64
err = <optimized out>
__PRETTY_FUNCTION__ = "qemu_mutex_lock_impl"
__func__ = "qemu_mutex_lock_impl"
#1 0x00005585a79be074 in throttle_group_restart_queue_entry (opaque=0x5585a9de4eb0) at block/throttle-groups.c:412
_f = <optimized out>
data = 0x5585a9de4eb0
tgm = 0x5585a9079440
ts = 0x0
tg = 0xffffffffffffff98
is_write = false
empty_queue = 255
This coroutine should not execute in the iothread after the throttle
group member has been unregistered!
The root cause is that the device_del code path schedules the restart
coroutine in the iothread while holding the AioContext lock. Therefore
the iothread cannot execute the coroutine until after device_del
releases the lock - by this time it's too late.
This patch adds a reference count to ThrottleGroupMember so we can
synchronously wait for restart coroutines to complete. Once they are
done it is safe to unregister the ThrottleGroupMember.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190114133257.30299-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We want to be able to detect whether a given qemu NBD server is
exposing the right export(s) and dirty bitmaps, at least for
regression testing. We could use 'nbd-client -l' from the upstream
NBD project to list exports, but it's annoying to rely on
out-of-tree binaries; furthermore, nbd-client doesn't necessarily
know about all of the qemu NBD extensions. Thus, we plan on adding
a new mode to qemu-nbd that merely sniffs all possible information
from the server during handshake phase, then disconnects and dumps
the information.
This patch continues the work of the previous patch, by adding the
ability to track the list of available meta contexts into
NBDExportInfo. It benefits from the recent refactoring patches
with a new nbd_list_meta_contexts() that reuses much of the same
framework as setting a meta context.
Note: a malicious server could exhaust memory of a client by feeding
an unending loop of contexts; perhaps we could place a limit on how
many we are willing to receive. But this is no different from our
earlier analysis on a server sending an unending list of exports,
and the death of a client due to memory exhaustion when the client
was going to exit soon anyways is not really a denial of service
attack.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-19-eblake@redhat.com>
We want to be able to detect whether a given qemu NBD server is
exposing the right export(s) and dirty bitmaps, at least for
regression testing. We could use 'nbd-client -l' from the upstream
NBD project to list exports, but it's annoying to rely on
out-of-tree binaries; furthermore, nbd-client doesn't necessarily
know about all of the qemu NBD extensions. Thus, we plan on adding
a new mode to qemu-nbd that merely sniffs all possible information
from the server during handshake phase, then disconnects and dumps
the information.
This patch adds the low-level client code for grabbing the list
of exports. It benefits from the recent refactoring patches, in
order to share as much code as possible when it comes to doing
validation of server replies. The resulting information is stored
in an array of NBDExportInfo which has been expanded to any
description string, along with a convenience function for freeing
the list.
Note: a malicious server could exhaust memory of a client by feeding
an unending loop of exports; perhaps we should place a limit on how
many we are willing to receive. But note that a server could
reasonably be serving an export for every file in a large directory,
where an arbitrary limit in the client means we can't list anything
from such a server; the same happens if we just run until the client
fails to malloc() and thus dies by an abort(), where the limit is
no longer arbitrary but determined by available memory. Since the
client is already planning on being short-lived, it's hard to call
this a denial of service attack that would starve off other uses,
so it does not appear to be a security issue.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Message-Id: <20190117193658.16413-18-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Pass 'info' instead of three separate parameters related to info,
when requesting the server to set the meta context. Update the
NBDExportInfo struct to rename the received id field to match the
fact that we are currently overloading the field to match whatever
context the user supplied through the x-dirty-bitmap hack, as well
as adding a TODO comment to remind future patches about a desire
to request two contexts at once.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-11-eblake@redhat.com>
Refactor the 'name' parameter of nbd_receive_negotiate() from
being a separate parameter into being part of the in-out 'info'.
This also spills over to a simplification of nbd_opt_go().
The main driver for this refactoring is that an upcoming patch
would like to add support to qemu-nbd to list information about
all exports available on a server, where the name(s) will be
provided by the server instead of the client. But another benefit
is that we can now allow the client to explicitly specify the
empty export name "" even when connecting to an oldstyle server
(even if qemu is no longer such a server after commit 7f7dfe2a).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-10-eblake@redhat.com>
Although our compile-time environment is set up so that we always
support long files with 64-bit off_t, we have no guarantee whether
off_t is the same type as int64_t. This requires casts when
printing values, and prevents us from directly using qemu_strtoi64()
(which will be done in the next patch). Let's just flip to uint64_t
where possible, and stick to int64_t for detecting failure of
blk_getlength(); we also keep the assertions added in the previous
patch that the resulting values fit in 63 bits. The overflow check
in nbd_co_receive_request() was already sane (request->from is
validated to fit in 63 bits, and request->len is 32 bits, so the
addition can't overflow 64 bits), but rewrite it in a form easier
to recognize as a typical overflow check.
Rename the variable 'description' to keep line lengths reasonable.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190117193658.16413-7-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
This reverts commit 72d10a9421.
The function is unused now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
The function alters bdrv_dirty_iter_next_area(), which is wrong and
less efficient (see further commit
"block/mirror: fix and improve do_sync_target_write" for description).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
We only have one caller that wants to export a bitmap name,
which it does right after creation of the export. But there is
still a brief window of time where an NBD client could see the
export but not the dirty bitmap, which a robust client would
have to interpret as meaning the entire image should be treated
as dirty. Better is to eliminate the window entirely, by
inlining nbd_export_bitmap() into nbd_export_new(), and refusing
to create the bitmap in the first place if the requested bitmap
can't be located.
We also no longer need logic for setting a different bitmap
name compared to the bitmap being exported.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190111194720.15671-8-eblake@redhat.com>
The existing NBD code had a weird split where nbd_export_new()
created an export but did not add it to the list of exported
names until a later nbd_export_set_name() came along and grabbed
a second reference on the object; later, the first call to
nbd_export_close() drops the second reference while removing
the export from the list. This is in part because the QAPI
NbdServerRemoveNode enum documents the possibility of adding a
mode where we could do a soft disconnect: preventing new clients,
but waiting for existing clients to gracefully quit, based on
the mode used when calling nbd_export_close().
But in spite of all that, note that we never change the name of
an NBD export while it is exposed, which means it is easier to
just inline the process of setting the name as part of creating
the export.
Inline the contents of nbd_export_set_name() and
nbd_export_set_description() into the two points in an export
lifecycle where they matter, then adjust both callers to pass
the name up front. Note that for creation, all callers pass a
non-NULL name, (passing NULL at creation was for old style
servers, but we removed support for that in commit 7f7dfe2a),
so we can add an assert and do things unconditionally; but for
cleanup, because of the dual nature of nbd_export_close(), we
still have to be careful to avoid use-after-free. Along the
way, add a comment reminding ourselves of the potential of
adding a middle mode disconnect.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190111194720.15671-5-eblake@redhat.com>
These functions are used for formatting pretty trace points. We are
going to add some in block/nbd-client, so, let's publish all these
functions at once. Note, that nbd_reply_type_lookup is already
published, and constants, "named" by these functions live in
include/block/nbd.h too.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20181102151152.288399-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Now that all callers are passing all flag changes as QDict options,
the flags parameter is no longer necessary, so we can get rid of it.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
No one is using this function anymore, so we can safely remove it.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most callers of bdrv_reopen() only use it to switch a BlockDriverState
between read-only and read-write, so this patch adds a new function
that does just that.
We also want to get rid of the flags parameter in the bdrv_reopen()
API, so this function sets the "read-only" option and passes the
original flags (which will then be updated in bdrv_reopen_prepare()).
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some block drivers have traditionally changed their node to read-only
mode without asking the user. This behaviour has been marked deprecated
since 2.11, expecting users to provide an explicit read-only=on option.
Now that we have auto-read-only=on, enable these drivers to make use of
the option.
This is the only use of bdrv_set_read_only(), so we can make it a bit
more specific and turn it into a bdrv_apply_auto_read_only() that is
more convenient for drivers to use.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If a management application builds the block graph node by node, the
protocol layer doesn't inherit its read-only option from the format
layer any more, so it must be set explicitly.
Backing files should work on read-only storage, but at the same time, a
block job like commit should be able to reopen them read-write if they
are on read-write storage. However, without option inheritance, reopen
only changes the read-only option for the root node (typically the
format layer), but not the protocol layer, so reopening fails (the
format layer wants to get write permissions, but the protocol layer is
still read-only).
A simple workaround for the problem in the management tool would be to
open the protocol layer always read-write and to make only the format
layer read-only for backing files. However, sometimes the file is
actually stored on read-only storage and we don't know whether the image
can be opened read-write (for example, for NBD it depends on the server
we're trying to connect to). This adds an option that makes QEMU try to
open the image read-write, but allows it to degrade to a read-only mode
without returning an error.
The documentation for this option is consciously phrased in a way that
allows QEMU to switch to a better model eventually: Instead of trying
when the image is first opened, making the read-only flag dynamic and
changing it automatically whenever the first BLK_PERM_WRITE user is
attached or the last one is detached would be much more useful
behaviour.
Unfortunately, this more useful behaviour is also a lot harder to
implement, and libvirt needs a solution now before it can switch to
-blockdev, so let's start with this easier approach for now.
Instead of adding a new auto-read-only option, turning the existing
read-only into an enum (with a bool alternate for compatibility) was
considered, but it complicated the implementation to the point that it
didn't seem to be worth it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This patch aims to bring the following behavior:
1. We don't load bitmaps, when started in inactive mode. It's the case
of incoming migration. In this case we wait for bitmaps migration
through migration channel (if 'dirty-bitmaps' capability is enabled) or
for invalidation (to load bitmaps from the image).
2. We don't remove persistent bitmaps on inactivation. Instead, we only
remove bitmaps after storing. This is the only way to restore bitmaps,
if we decided to resume source after [failed] migration with
'dirty-bitmaps' capability enabled (which means, that bitmaps were not
stored).
3. We load bitmaps on open and any invalidation, it's ok for all cases:
- normal open
- migration target invalidation with dirty-bitmaps capability
(bitmaps are migrating through migration channel, the are not
stored, so they should have IN_USE flag set and will be skipped
when loading. However, it would fail if bitmaps are read-only[1])
- migration target invalidation without dirty-bitmaps capability
(normal load of the bitmaps, if migrated with shared storage)
- source invalidation with dirty-bitmaps capability
(skip because IN_USE)
- source invalidation without dirty-bitmaps capability
(bitmaps were dropped, reload them)
[1]: to accurately handle this, migration of read-only bitmaps is
explicitly forbidden in this patch.
New mechanism for not storing bitmaps when migrate with dirty-bitmaps
capability is introduced: migration filed in BdrvDirtyBitmap.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Instead of both frozen and qmp_locked checks, wrap it into one check.
frozen implies the bitmap is split in two (for backup), and shouldn't
be modified. qmp_locked implies it's being used by another operation,
like being exported over NBD. In both cases it means we shouldn't allow
the user to modify it in any meaningful way.
Replace any usages where we check both frozen and qmp_locked with the
new check.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20181002230218.13949-2-jsnow@redhat.com
[w/edits Suggested-By: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>]
Signed-off-by: John Snow <jsnow@redhat.com>
Add backup parameter to bdrv_merge_dirty_bitmap() to be used then with
bdrv_restore_dirty_bitmap() if it needed to restore the bitmap after
merge operation.
This is needed to implement bitmap merge transaction action in further
commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Use more generic names to reuse the function for bitmap merge in the
following commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Attributes are simple flags, associated with individual timers for their
whole lifetime. They intended to be used to mark individual timers for
special handling when they fire.
New/init functions family in timer interface updated and refactored (new
'attribute' argument added, timer_list replaced with timer_list_group+type
combinations, comments improved to avoid info duplication). Also existing
aio interface extended with attribute-enabled variants of functions,
which create/initialize timers.
Signed-off-by: Artem Pisarenko <artem.k.pisarenko@gmail.com>
Message-Id: <f47b81dbce734e9806f9516eba8ca588e6321c2f.1539764043.git.artem.k.pisarenko@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit bc37b06a5 added NBD_CMD_CACHE support, but used the wrong value
for NBD_FLAG_SEND_CACHE flag for negotiation. That commit picked bit 8,
which had already been assigned by the NBD specification to mean
NBD_FLAG_CAN_MULTI_CONN, and which was already implemented in the
Linux kernel as a part of stable userspace-kernel API since 4.10:
"bit 8, NBD_FLAG_CAN_MULTI_CONN: Indicates that the server operates
entirely without cache, or that the cache it uses is shared among all
connections to the given device. In particular, if this flag is
present, then the effects of NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
MUST be visible across all connections when the server sends its reply
to that command to the client. In the absense of this flag, clients
SHOULD NOT multiplex their commands over more than one connection to
the export.
...
bit 10, NBD_FLAG_SEND_CACHE: documents that the server understands
NBD_CMD_CACHE; however, note that server implementations exist
which support the command without advertising this bit, and
conversely that this bit does not guarantee that the command will
succeed or have an impact."
Consequences:
- a client trying to use NBD_CMD_CACHE per the NBD spec will not
see the feature as available from a qemu 3.0 server (not fatal,
clients already have to be prepared for caching to not exist)
- a client accidentally coded to the qemu 3.0 bit value instead
of following the spec may interpret NBD_CMD_CACHE as being available
when it is not (probably not fatal, the spec says the server should
gracefully fail unknown commands, and that clients of NBD_CMD_CACHE
should be prepared for failure even when the feature is advertised);
such clients are unlikely (perhaps only in unreleased Virtuozzo code),
and will disappear over time
- a client prepared to use multiple connections based on
NBD_FLAG_CAN_MULTI_CONN may cause data corruption when it assumes
that caching is consistent when in reality qemu 3.0 did not have
a consistent cache. Partially mitigated by using read-only
connections (where nothing needs to be flushed, so caching is
indeed consistent) or when using qemu-nbd with the default -e 1
(at most one client at a time); visible only when using -e 2 or
more for a writable export.
Thus the commit fixes negotiation flag in QEMU according to the
specification.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Valery Vdovin <valery.vdovin@acronis.com>
CC: Eric Blake <eblake@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: qemu-stable@nongnu.org
Message-Id: <20181004100313.4253-1-den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: enhance commit message, add defines for unimplemented flags]
Signed-off-by: Eric Blake <eblake@redhat.com>
After the previous commit, nbd_client_new's first parameter is always
NULL. Let's drop it with all corresponding old-style negotiation code
path which is unreachable now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20181003170228.95973-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: re-wrap short line]
Signed-off-by: Eric Blake <eblake@redhat.com>
'detect-zeroes' is one of the basic BlockdevOptions available for all
drivers, but it's not handled by bdrv_reopen_prepare(), so any attempt
to change it results in an error:
(qemu) qemu-io virtio0 "reopen -o detect-zeroes=on"
Cannot change the option 'detect-zeroes'
Since there's no reason why we shouldn't allow changing it and the
implementation is simple let's just do it.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When draining a block node, we recurse to its parent and for subtree
drains also to its children. A single AIO_WAIT_WHILE() is then used to
wait for bdrv_drain_poll() to become true, which depends on all of the
nodes we recursed to. However, if the respective child or parent becomes
quiescent and calls bdrv_wakeup(), only the AioWait of the child/parent
is checked, while AIO_WAIT_WHILE() depends on the AioWait of the
original node.
Fix this by using a single AioWait for all callers of AIO_WAIT_WHILE().
This may mean that the draining thread gets a few more unnecessary
wakeups because an unrelated operation got completed, but we already
wake it up when something _could_ have changed rather than only if it
has certainly changed.
Apart from that, drain is a slow path anyway. In theory it would be
possible to use wakeups more selectively and still correctly, but the
gains are likely not worth the additional complexity. In fact, this
patch is a nice simplification for some places in the code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Even if AIO_WAIT_WHILE() is called in the home context of the
AioContext, we still want to allow the condition to change depending on
other threads as long as they kick the AioWait. Specfically block jobs
can be running in an I/O thread and should then be able to kick a drain
in the main loop context.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
In the context of draining a BDS, the .drained_poll callback of block
jobs is called. If this returns true (i.e. there is still some activity
pending), the drain operation may call aio_poll() with blocking=true to
wait for completion.
As soon as the pending activity is completed and the job finally arrives
in a quiescent state (i.e. its coroutine either yields with busy=false
or terminates), the block job must notify the aio_poll() loop to wake
up, otherwise we get a deadlock if both are running in different
threads.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-4-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-3-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-2-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
BDRV_REQ_MAY_UNMAP in a write_zeroes request does not only allow the
driver to unmap the blocks, but it actively requests that the blocks be
unmapped afterwards if at all possible.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This matches the types used for bytes in the rest parts of block layer.
In the case of bdrv_co_truncate, new_bytes can be the image size which
probably doesn't fit in a 32 bit int.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Other I/O functions are already using a BdrvChild pointer in the API, so
make discard do the same. It makes it possible to initiate the same
permission checks before doing I/O, and much easier to share the
helper functions for this, which will be added and used by write,
truncate and copy range paths.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Serialized writes should be used in copy-on-write of backup(sync=none)
for image fleecing scheme.
We need to change an assert in bdrv_aligned_pwritev, added in
28de2dcd88. The assert may fail now, because call to
wait_serialising_requests here may become first call to it for this
request with serializing flag set. It occurs if the request is aligned
(otherwise, we should already set serializing flag before calling
bdrv_aligned_pwritev and correspondingly waited for all intersecting
requests). However, for aligned requests, we should not care about
outdating of previously read data, as there no such data. Therefore,
let's just update an assert to not care about aligned requests.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Pass read flags and write flags separately. This is needed to handle
coming BDRV_REQ_NO_SERIALISING clearly in following patches.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Here two things are fixed:
1. Architecture
On each recursion step, we go to the child of src or dst, only for one
of them. So, it's wrong to create tracked requests for both on each
step. It leads to tracked requests duplication.
2. Wait for serializing requests on write path independently of
BDRV_REQ_NO_SERIALISING
Before commit 9ded4a0114 "backup: Use copy offloading",
BDRV_REQ_NO_SERIALISING was used for only one case: read in
copy-on-write operation during backup. Also, the flag was handled only
on read path (in bdrv_co_preadv and bdrv_aligned_preadv).
After 9ded4a0114, flag is used for not waiting serializing operations
on backup target (in same case of copy-on-write operation). This
behavior change is unsubstantiated and potentially dangerous, let's
drop it and add additional asserts and documentation.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit dcf94a23b1 ('block: Don't poll in parent drain callbacks')
removed polling in bdrv_child_cb_drained_begin() on the grounds that the
original bdrv_drain() already will poll and BdrvChildRole.drained_begin
calls must not cause graph changes (and therefore must not call
aio_poll() or the recursion through the graph will break.
This reasoning is correct for calls through bdrv_do_drained_begin().
However, BdrvChildRole.drained_begin is also called when a node that is
already in a drained section (i.e. bdrv_do_drained_begin() has already
returned and therefore can't poll any more) is attached to a new parent.
In this case, we must explicitly poll to have all requests completed
before the drained new child can be attached to the parent.
In bdrv_replace_child_noperm(), we know that we're not inside the
recursion of bdrv_do_drained_begin() because graph changes are not
allowed there, and bdrv_replace_child_noperm() is a graph change. The
call of BdrvChildRole.drained_begin() must therefore be followed by a
BDRV_POLL_WHILE() that waits for the completion of requests.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
- qcow2: Use worker threads for compression to improve performance of
'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbPfHhAAoJEH8JsnLIjy/Wxp8P/jeEJ/jJ0N2doix5KJSvtkdP
dYx4cJxHg9tMSnmBfA49qc1ilgbzZqjbF/wH3Ko3zPsmFP0xsxx+9zE4c3iL7HA6
6E4/r3szznO8DIxdgzAIeHjZc7YPpsY3alrthT55eR/FnDyc8zofi/iUMgvTKPYA
8TnWqgi2fdaVwNy+lIRJFdhzzQNtPxOEO+0DFHaOZvQ9vlc5DGPC+Hj3Qc11GK8M
u7orkeYqQPknxy0hJKPxtWHvzCQUJMcSSs6PbuslnzOerYiQLmx+RVIwMhXfVZjV
lXe2SppAszBujtcIENhZlj1cECs7MrWTmFDcWvBA+Mh/JhFEWwykmlQHYrjCqdvw
QyWhVLgP/6jQDaBls6LADSZDxX7i0sV27DzVGN4gYUX/KcyVPEDROm90MyE8nxN0
Y5hhujTkzu94Zvd/OlKZRs4tPxmbRrd2SWy0id8Kj5/gO/+UWECOZCrPrlB5uzec
bsxHoeXLVe2/56JkCIiOw3hLILMY6gPLJgeaQjz6hu1oZJqYuoof8grVFjUCSH0C
BBlz8O4upHdPIKNRl4rmdgwasg5YIiJ7SFl4h7KMCD0elkKmo3SyTKosagvqrpVN
JvW1YuosCedcfYIKNPnZdEGWDKuTaZtcnZnM9IRY6hr8ySMw1LloFsgXsJ3tqTwe
wb1Wjm4Qlz5BaoB7o/Os
=z5AC
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- qcow2: Use worker threads for compression to improve performance of
'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names
# gpg: Signature made Thu 05 Jul 2018 11:24:33 BST
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
file-posix: Unlock FD after creation
file-posix: Fix creation locking
block/blklogwrites: Add an option for the update interval of the log superblock
block/blklogwrites: Add an option for appending to an old log
block/blklogwrites: Change log_sector_size from int64_t to uint64_t
block/crypto: Fix memory leak in create error path
block: Don't silently truncate node names
block: Add blklogwrites
block: Move two block permission constants to the relevant enum
qcow2: add compress threads
qcow2: refactor data compression
qemu-img: allow compressed not-in-order writes
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This allows using the two constants outside of block.c, which will
happen in a subsequent patch.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add _locked version of bdrv_enable_dirty_bitmap, to fix dirty bitmap
migration in the following patch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
This semantics is needed by drive-backup so implement it before using
this API there.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-3-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
In order to test that the NBD server is properly advertising
dirty bitmaps, we need a bare minimum client that can request
and read the context. Since feature freeze for 3.0 is imminent,
this is the smallest workable patch, which replaces the qemu
block status report with the results of the NBD server's dirty
bitmap (making it very easy to use 'qemu-img map --output=json'
to learn where the dirty portions are). Note that the NBD
protocol defines a dirty section with the same bit but opposite
sense that normal "base:allocation" uses to report an allocated
section; so in qemu-img map output, "data":true corresponds to
clean, "data":false corresponds to dirty.
A more complete solution that allows dirty bitmaps to be queried
at the same time as normal block status will be required before
this addition can lose the x- prefix. Until then, the fact that
this replaces normal status with dirty status means actions
like 'qemu-img convert' will likely misbehave due to treating
dirty regions of the file as if they are unallocated.
The next patch adds an iotest to exercise this new code.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180702191458.28741-2-eblake@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that all callers of vectored I/O have been converted
to use our preferred byte-based bdrv_co_p{read,write}v(), we can
delete the unused bdrv_co_{read,write}v().
Furthermore, this gets rid of the signature difference between the
public bdrv_co_writev() and the callback .bdrv_co_writev (the
latter still exists, because some drivers still need more work
before they are fully byte-based).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This moves the code to resize an image file to the thread pool to avoid
blocking.
Creating large images with preallocation with blockdev-create is now
actually a background job instead of blocking the monitor (and most
other things) until the preallocation has completed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
When growing an image, block drivers (especially protocol drivers) may
initialise the newly added area. I/O requests to the same area need to
wait for this initialisation to be completed so that data writes don't
get overwritten and reads don't read uninitialised data.
To avoid overhead in the fast I/O path by adding new locking in the
protocol drivers and to restrict the impact to requests that actually
touch the new area, reuse the existing tracked request infrastructure in
block/io.c and mark all discard requests as serialising.
With this change, it is safe for protocol drivers to make
.bdrv_co_truncate actually asynchronous.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves the bdrv_truncate() implementation from block.c to block/io.c
so it can have access to the tracked requests infrastructure.
This involves making refresh_total_sectors() public (in block_int.h).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Handle nbd CACHE command. Just do read, without sending read data back.
Cache mechanism should be done by exported node driver chain.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20180413143156.11409-1-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix two missing case labels in switch statements]
Signed-off-by: Eric Blake <eblake@redhat.com>
Handle a new NBD meta namespace: "qemu", and corresponding queries:
"qemu:dirty-bitmap:<export bitmap name>".
With the new metadata context negotiated, BLOCK_STATUS query will reply
with dirty-bitmap data, converted to extents. The new public function
nbd_export_bitmap selects which bitmap to export. For now, only one bitmap
may be exported.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20180609151758.17343-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: wording tweaks, minor cleanups, additional tracing]
Signed-off-by: Eric Blake <eblake@redhat.com>
This patch allows the user to specify whether to use active or only
background mode for mirror block jobs. Currently, this setting will
remain constant for the duration of the entire block job.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-14-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This new function allows to look for a consecutively dirty area in a
dirty bitmap.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-10-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and
did a subtree drain for each of them. This works fine as long as the
graph is static, but sadly, reality looks different.
If the graph changes so that root nodes are added or removed, we would
have to compensate for this. bdrv_next() returns each root node only
once even if it's the root node for multiple BlockBackends or for a
monitor-owned block driver tree, which would only complicate things.
The much easier and more obviously correct way is to fundamentally
change the way the functions work: Iterate over all BlockDriverStates,
no matter who owns them, and drain them individually. Compensation is
only necessary when a new BDS is created inside a drain_all section.
Removal of a BDS doesn't require any action because it's gone afterwards
anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the future, bdrv_drained_all_begin/end() will drain all invidiual
nodes separately rather than whole subtrees. This means that we don't
want to propagate the drain to all parents any more: If the parent is a
BDS, it will already be drained separately. Recursing to all parents is
unnecessary work and would make it an O(n²) operation.
Prepare the drain function for the changed drain_all by adding an
ignore_bds_parents parameter to the internal implementation that
prevents the propagation of the drain to BDS parents. We still (have to)
propagate it to non-BDS parents like BlockBackends or Jobs because those
are not drained separately.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_drain_all() wants to have a single polling loop for draining the
in-flight requests of all nodes. This means that the AIO_WAIT_WHILE()
condition relies on activity in multiple AioContexts, which is polled
from the mainloop context. We must therefore call AIO_WAIT_WHILE() from
the mainloop thread and use the AioWait notification mechanism.
Just randomly picking the AioContext of any non-mainloop thread would
work, but instead of bothering to find such a context in the caller, we
can just as well accept NULL for ctx.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_do_drained_begin() is only safe if we have a single
BDRV_POLL_WHILE() after quiescing all affected nodes. We cannot allow
that parent callbacks introduce a nested polling loop that could cause
graph changes while we're traversing the graph.
Split off bdrv_do_drained_begin_quiesce(), which only quiesces a single
node without waiting for its requests to complete. These requests will
be waited for in the BDRV_POLL_WHILE() call down the call chain.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Anything can happen inside BDRV_POLL_WHILE(), including graph
changes that may interfere with its callers (e.g. child list iteration
in recursive callers of bdrv_do_drained_begin).
Switch to a single BDRV_POLL_WHILE() call for the whole subtree at the
end of bdrv_do_drained_begin() to avoid such effects. The recursion
happens now inside the loop condition. As the graph can only change
between bdrv_drain_poll() calls, but not inside of it, doing the
recursion here is safe.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We already requested that block jobs be paused in .bdrv_drained_begin,
but no guarantee was made that the job was actually inactive at the
point where bdrv_drained_begin() returned.
This introduces a new callback BdrvChildRole.bdrv_drained_poll() and
uses it to make bdrv_drain_poll() consider block jobs using the node to
be drained.
For the test case to work as expected, we have to switch from
block_job_sleep_ns() to qemu_co_sleep_ns() so that the test job is even
considered active and must be waited for when draining the node.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 91af091f92 added an additional aio_poll() to BDRV_POLL_WHILE()
in order to make sure that all pending BHs are executed on drain. This
was the wrong place to make the fix, as it is useless overhead for all
other users of the macro and unnecessarily complicates the mechanism.
This patch effectively reverts said commit (the context has changed a
bit and the code has moved to AIO_WAIT_WHILE()) and instead polls in the
loop condition for drain.
The effect is probably hard to measure in any real-world use case
because actual I/O will dominate, but if I run only the initialisation
part of 'qemu-img convert' where it calls bdrv_block_status() for the
whole image to find out how much data there is copy, this phase actually
needs only roughly half the time after this patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Configuration flows through the block subsystem in a rather peculiar
way. Configuration made with -drive enters it as QemuOpts.
Configuration made with -blockdev / blockdev-add enters it as QAPI
type BlockdevOptions. The block subsystem uses QDict, QemuOpts and
QAPI types internally. The precise flow is next to impossible to
explain (I tried for this commit message, but gave up after wasting
several hours). What I can explain is a flaw in the BlockDriver
interface that leads to this bug:
$ qemu-system-x86_64 -blockdev node-name=n1,driver=nfs,server.type=inet,server.host=localhost,path=/foo/bar,user=1234
qemu-system-x86_64: -blockdev node-name=n1,driver=nfs,server.type=inet,server.host=localhost,path=/foo/bar,user=1234: Internal error: parameter user invalid
QMP blockdev-add is broken the same way.
Here's what happens. The block layer passes configuration represented
as flat QDict (with dotted keys) to BlockDriver methods
.bdrv_file_open(). The QDict's members are typed according to the
QAPI schema.
nfs_file_open() converts it to QAPI type BlockdevOptionsNfs, with
qdict_crumple() and a qobject input visitor.
This visitor comes in two flavors. The plain flavor requires scalars
to be typed according to the QAPI schema. That's the case here. The
keyval flavor requires string scalars. That's not the case here.
nfs_file_open() uses the latter, and promptly falls apart for members
@user, @group, @tcp-syn-count, @readahead-size, @page-cache-size,
@debug.
Switching to the plain flavor would fix -blockdev, but break -drive,
because there the scalars arrive in nfs_file_open() as strings.
The proper fix would be to replace the QDict by QAPI type
BlockdevOptions in the BlockDriver interface. Sadly, that's beyond my
reach right now.
Next best would be to fix the block layer to always pass correctly
typed QDicts to the BlockDriver methods. Also beyond my reach.
What I can do is throw another hack onto the pile: have
nfs_file_open() convert all members to string, so use of the keyval
flavor actually works, by replacing qdict_crumple() by new function
qdict_crumple_for_keyval_qiv().
The pattern "pass result of qdict_crumple() to
qobject_input_visitor_new_keyval()" occurs several times more:
* qemu_rbd_open()
Same issue as nfs_file_open(), but since BlockdevOptionsRbd has only
string members, its only a latent bug. Fix it anyway.
* parallels_co_create_opts(), qcow_co_create_opts(),
qcow2_co_create_opts(), bdrv_qed_co_create_opts(),
sd_co_create_opts(), vhdx_co_create_opts(), vpc_co_create_opts()
These work, because they create the QDict with
qemu_opts_to_qdict_filtered(), which creates only string scalars.
The function sports a TODO comment asking for better typing; that's
going to be fun. Use qdict_crumple_for_keyval_qiv() to be safe.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are numerous QDict functions that have been introduced for and are
used only by the block layer. Move their declarations into an own
header file to reflect that.
While qdict_extract_subqdict() is in fact used outside of the block
layer (in util/qemu-config.c), it is still a function related very
closely to how the block layer works with nested QDicts, namely by
sometimes flattening them. Therefore, its declaration is put into this
header as well and util/qemu-config.c includes it with a comment stating
exactly which function it needs.
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20180509165530.29561-7-mreitz@redhat.com>
[Copyright note tweaked, superfluous includes dropped]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180606182449.1607-5-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
All this function is doing will be repeated by
bdrv_do_release_matching_dirty_bitmap_locked, except
resetting bm->persistent. But even that does not matter
because the bitmap will be freed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180323164254.26487-1-pbonzini@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
This is a useful function for the whole block layer, so make it public.
At the same time, users outside of block.c probably do not need to make
use of the reopen functionality, so rename the current function to
bdrv_is_writable_after_reopen() create a new bdrv_is_writable() function
that just passes NULL to it for the reopen queue.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180606193702.7113-2-mreitz@redhat.com
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Looking at the qcow2 code that is riddled with error_report() calls,
this is really how it should have been from the start.
Along the way, turn the target_version/current_version comparisons at
the beginning of qcow2_downgrade() into assertions (the caller has to
make sure these conditions are met), and rephrase the error message on
using compat=1.1 to get refcount widths other than 16 bits.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180509210023.20283-3-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
With copy_file_range(2), we can implement the bdrv_co_copy_range
semantics.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180601092648.24614-6-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduce the bdrv_co_copy_range() API for copy offloading. Block
drivers implementing this API support efficient copy operations that
avoid reading each block from the source device and writing it to the
destination devices. Examples of copy offload primitives are SCSI
EXTENDED COPY and Linux copy_file_range(2).
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180601092648.24614-2-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
BlockJob has fields .offset and .len, which are actually misnomers today
because they are no longer tied to block device sizes, but just progress
counters. As such they make a lot of sense in generic Jobs.
This patch moves the fields to Job and renames them to .progress_current
and .progress_total to describe their function better.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The transition to the READY state was still performed in the BlockJob
layer, in the same function that sent the BLOCK_JOB_READY QMP event.
This patch brings the state transition to the Job layer and implements
the QMP event using a notifier called from the Job layer, like we
already do for other events related to state transitions.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of having a 'bool ready' in BlockJob, add a function that
derives its value from the job status.
At the same time, this fixes the behaviour to match what the QAPI
documentation promises for query-block-job: 'true if the job may be
completed'. When the ready flag was introduced in commit ef6dbf1e46,
the flag never had to be reset to match the description because after
being ready, the jobs would immediately complete and disappear.
Job transactions and manual job finalisation were introduced only later.
With these changes, jobs may stay around even after having completed
(and they are not ready to be completed a second time), however their
patches forgot to reset the ready flag.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This moves the logic that implements job transactions from BlockJob to
Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This doesn't actually move any transaction code to Job yet, but it
renames the type for transactions from BlockJobTxn to JobTxn and makes
them contain Jobs rather than BlockJobs
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This moves the .complete callback that tells a READY job to complete
from BlockJobDriver to JobDriver. The wrapper function job_complete()
doesn't require anything block job specific any more and can be moved
to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
block_job_drain() contains a blk_drain() call which cannot be moved to
Job, so add a new JobDriver callback JobDriver.drain which has a common
implementation for all BlockJobs. In addition to this we keep the
existing BlockJobDriver.drain callback that is called by the common
drain implementation for all block jobs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
block_job_cancel_async() did two things that were still block job
specific:
* Setting job->force. This field makes sense on the Job level, so we can
just move it. While at it, rename it to job->force_cancel to make its
purpose more obvious.
* Resetting the I/O status. This can't be moved because generic Jobs
don't have an I/O status. What the function really implements is a
user resume, except without entering the coroutine. Consequently, it
makes sense to call the .user_resume driver callback here which
already resets the I/O status.
The old block_job_cancel_async() has two separate if statements that
check job->iostatus != BLOCK_DEVICE_IO_STATUS_OK and job->user_paused.
However, the former condition always implies the latter (as is
asserted in block_job_iostatus_reset()), so changing the explicit call
of block_job_iostatus_reset() on the former condition with the
.user_resume callback on the latter condition is equivalent and
doesn't need to access any BlockJob specific state.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This moves the finalisation of a single job from BlockJob to Job.
Some part of this code depends on job transactions, and job transactions
call this code, we introduce some temporary calls from Job functions to
BlockJob ones. This will be fixed once transactions move to Job, too.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Go through the Job layer in order to send QMP events. For the moment,
these functions only call a notifier in the BlockJob layer that sends
the existing commands.
This uses notifiers rather than JobDriver callbacks because internal
users of jobs won't receive QMP events, but might still be interested
in getting notified for the events.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This renames the BlockJobCreateFlags constants, moves a few JOB_INTERNAL
checks to job_create() and the auto_{finalize,dismiss} fields from
BlockJob to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Since we introduced an explicit status to block job, BlockJob.completed
is redundant because it can be derived from the status. Remove the field
from BlockJob and add a function to derive it from the status at the Job
level.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
While we already moved the state related to job pausing to Job, the
functions to do were still BlockJob only. This commit moves them over to
Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
There is nothing block layer specific about block_job_sleep_ns(), so
move the function to Job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This commit moves some core functions for dealing with the job coroutine
from BlockJob to Job. This includes primarily entering the coroutine
(both for the first and reentering) and yielding explicitly and at pause
points.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Move the defer_to_main_loop functionality from BlockJob to Job.
The code can be simplified because we can use job->aio_context in
job_defer_to_main_loop_bh() now, instead of having to access the
BlockDriverState.
Probably taking the data->aio_context lock in addition was already
unnecessary in the old code because we didn't actually make use of
anything protected by the old AioContext except getting the new
AioContext, in case it changed between scheduling the BH and running it.
But it's certainly unnecessary now that the BDS isn't accessed at all
any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
We cannot yet move the whole logic around job cancelling to Job because
it depends on quite a few other things that are still only in BlockJob,
but we can move the cancelled field at least.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This moves reference counting from BlockJob to Job.
In order to keep calling the BlockJob cleanup code when the job is
deleted via job_unref(), introduce a new JobDriver.free callback. Every
block job must use block_job_free() for this callback, this is asserted
in block_job_create().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This moves BlockJob.status and the closely related functions
(block_)job_state_transition() and (block_)job_apply_verb to Job. The
two QAPI enums are renamed to JobStatus and JobVerb.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This moves the job list from BlockJob to Job. Now we can check for
duplicate IDs in job_create().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This moves the job_type field from BlockJobDriver to JobDriver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
QAPI types aren't externally visible, so we can rename them without
causing problems. Before we add a job type to Job, rename the enum
so it can be used for more than just block jobs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This is the first step towards creating an infrastructure for generic
background jobs that aren't tied to a block device. For now, Job only
stores its ID and JobDriver, the rest stays in BlockJob.
The following patches will move over more parts of BlockJob to Job if
they are meaningful outside the context of a block job.
BlockJob.driver is now redundant, but this patch leaves it around to
avoid unnecessary churn. The next patches will get rid of almost all of
its uses anyway so that it can be removed later with much less churn.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
When we call addIOThread, the epollfd created in aio_context_setup,
but not close it in the process of delIOThread, so the epollfd will leak.
Reorder the code in aio_epoll_disable and reuse it.
Signed-off-by: Jie Wang <wangjie88@huawei.com>
Message-Id: <1526517763-11108-1-git-send-email-wangjie88@huawei.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
[Mention change to aio_epoll_disable in commit message. - Fam]
Signed-off-by: Fam Zheng <famz@redhat.com>
Add BDRV_REQ_WRITE_UNCHANGED to the list of flags honored during pwrite
and pwrite_zeroes, and also add a note on when you absolutely need to
support it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180502140359.18222-1-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This flag signifies that a write request will not change the visible
disk content. With this flag set, it is sufficient to have the
BLK_PERM_WRITE_UNCHANGED permission instead of BLK_PERM_WRITE.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180421132929.21610-4-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Currently we never actually check whether the WRITE_UNCHANGED
permission has been taken for unchanging writes. But the one check that
is commented out checks both WRITE and WRITE_UNCHANGED; and considering
that WRITE_UNCHANGED is already documented as being weaker than WRITE,
we should probably explicitly document WRITE to include WRITE_UNCHANGED.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180421132929.21610-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The backup block job directly accesses the driver field in BlockJob. Add
a wrapper for getting it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
This gets us rid of more direct accesses to BlockJob fields from the
job drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
All block job drivers support .set_speed and all of them duplicate the
same code to implement it. Move that code to blockjob.c and remove the
now useless callback.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Every block job has a RateLimit, and they all do the exact same thing
with it, so it should be common infrastructure. Move the struct field
for a start.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Block job drivers are not expected to mess with the internals of the
BlockJob object, so provide wrapper functions for one of the cases where
they still do it: Updating the progress counter.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
We have too many driver callback interfaces; simplify the mess
somewhat by merging the flags parameter of .bdrv_co_writev_flags()
into .bdrv_co_writev(). Note that as long as a driver doesn't set
.supported_write_flags, the flags argument will be 0 and behavior is
identical. Also note that the public function bdrv_co_writev() still
lacks a flags argument; so the driver signature is thus intentionally
slightly different. But that's not the end of the world, nor the first
time that the driver interface differs slightly from the public
interface.
Ideally, we should be rewriting all of these drivers to use modern
byte-based interfaces. But that's a more invasive patch to write
and audit, compared to the simplification done here.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that all drivers with aio callbacks are using the
byte-based interfaces, we can remove the sector-based versions.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based callbacks
in the file-win32 driver.
Note that the driver was already using byte-based calls for
performing actual I/O, so this just gets rid of a round trip
of scaling; however, as I don't know if Windows is tolerant of
non-sector AIO operations, I went with the conservative approach
of modifying .bdrv_refresh_limits to override the block layer
defaults back to the pre-patch value of 512.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Add new sector-based aio callbacks for read and write,
to match the fact that bdrv_aio_pdiscard is already byte-based.
Ideally, drivers should be converted to use coroutine callbacks
rather than aio; but that is not quite as trivial (and if we were
to do that conversion, the null-aio driver would disappear), so for
the short term, converting the signature but keeping things with
aio is easier. However, we CAN declare that a driver that uses
the byte-based aio interfaces now defaults to byte-based
operations, and must explicitly provide a refresh_limits override
to stick with larger alignments (making the alignment issues more
obvious directly in the drivers touched in the next few patches).
Once all drivers are converted, the sector-based aio callbacks will
be removed; in the meantime, a FIXME comment is added due to a
slight inefficiency that will be touched up as part of that later
cleanup.
Simplify some instances of 'bs->drv' into 'drv' while touching this,
since the local variable already exists to reduce typing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 8119334918 ("block: Don't
block_job_pause_all() in bdrv_drain_all()") removed the only callers of
block_job_pause/resume_all().
Pausing and resuming now happens in child_job_drained_begin/end() so
it's no longer necessary to globally pause/resume jobs.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180424085240.5798-1-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Clarify that:
- for protocols the brdv_file_open function is used instead
of bdrv_open;
- when protocol_name is set, a driver should expect
to be given only a filename and no other options.
Signed-off-by: Fabiano Rosas <farosas@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Introduce latency histogram statics for block devices.
For each accounted operation type, the latency region [0, +inf) is
divided into subregions by several points. Then, calculate
hits for each subregion.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20180309165212.97144-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
When doing drive mirror to a low speed shared storage, if there was heavy
BLK IO write workload in VM after the 'ready' event, drive mirror block job
can't be canceled immediately, it would keep running until the heavy BLK IO
workload stopped in the VM.
Libvirt depends on the current block-job-cancel semantics, which is that
when used without a flag after the 'ready' event, the command blocks
until data is in sync. However, these semantics are awkward in other
situations, for example, people may use drive mirror for realtime
backups while still wanting to use block live migration. Libvirt cannot
start a block live migration while another drive mirror is in progress,
but the user would rather abandon the backup attempt as broken and
proceed with the live migration than be stuck waiting for the current
drive mirror backup to finish.
The drive-mirror command already includes a 'force' flag, which libvirt
does not use, although it documented the flag as only being useful to
quit a job which is paused. However, since quitting a paused job has
the same effect as abandoning a backup in a non-paused job (namely, the
destination file is not in sync, and the command completes immediately),
we can just improve the documentation to make the force flag obviously
useful.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jeff Cody <jcody@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Reported-by: Huaitong Han <huanhuaitong@didichuxing.com>
Signed-off-by: Huaitong Han <huanhuaitong@didichuxing.com>
Signed-off-by: Liang Li <liliangleo@didichuxing.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of automatically transitioning from PENDING to CONCLUDED, gate
the .prepare() and .commit() phases behind an explicit acknowledgement
provided by the QMP monitor if auto_finalize = false has been requested.
This allows us to perform graph changes in prepare and/or commit so that
graph changes do not occur autonomously without knowledge of the
controlling management layer.
Transactions that have reached the "PENDING" state together can all be
moved to invoke their finalization methods by issuing block_job_finalize
to any one job in the transaction.
Jobs in a transaction with mixed job->auto_finalize settings will all
remain stuck in the "PENDING" state, as if the entire transaction was
specified with auto_finalize = false. Jobs that specified
auto_finalize = true, however, will still not emit the PENDING event.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For jobs utilizing the new manual workflow, we intend to prohibit
them from modifying the block graph until the management layer provides
an explicit ACK via block-job-finalize to move the process forward.
To distinguish this runstate from "ready" or "waiting," we add a new
"pending" event and status.
For now, the transition from PENDING to CONCLUDED/ABORTING is automatic,
but a future commit will add the explicit block-job-finalize step.
Transitions:
Waiting -> Pending: Normal transition.
Pending -> Concluded: Normal transition.
Pending -> Aborting: Late transactional failures and cancellations.
Removed Transitions:
Waiting -> Concluded: Jobs must go to PENDING first.
Verbs:
Cancel: Can be applied to a pending job.
+---------+
|UNDEFINED|
+--+------+
|
+--v----+
+---------+CREATED+-----------------+
| +--+----+ |
| | |
| +--+----+ +------+ |
+---------+RUNNING<----->PAUSED| |
| +--+-+--+ +------+ |
| | | |
| | +------------------+ |
| | | |
| +--v--+ +-------+ | |
+---------+READY<------->STANDBY| | |
| +--+--+ +-------+ | |
| | | |
| +--v----+ | |
+---------+WAITING<---------------+ |
| +--+----+ |
| | |
| +--v----+ |
+---------+PENDING| |
| +--+----+ |
| | |
+--v-----+ +--v------+ |
|ABORTING+--->CONCLUDED| |
+--------+ +--+------+ |
| |
+--v-+ |
|NULL<--------------------+
+----+
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some jobs upon finalization may need to perform some work that can
still fail. If these jobs are part of a transaction, it's important
that these callbacks fail the entire transaction.
We allow for a new callback in addition to commit/abort/clean that
allows us the opportunity to have fairly late-breaking failures
in the transactional process.
The expected flow is:
- All jobs in a transaction converge to the PENDING state,
added in a forthcoming commit.
- Upon being finalized, either automatically or explicitly
by the user, jobs prepare to complete.
- If any job fails preparation, all jobs call .abort.
- Otherwise, they succeed and call .commit.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For jobs that have reached their CONCLUDED state, prior to having their
last reference put down (meaning jobs that have completed successfully,
unsuccessfully, or have been canceled), allow the user to dismiss the
job's lingering status report via block-job-dismiss.
This gives management APIs the chance to conclusively determine if a job
failed or succeeded, even if the event broadcast was missed.
Note: block_job_do_dismiss and block_job_decommission happen to do
exactly the same thing, but they're called from different semantic
contexts, so both aliases are kept to improve readability.
Note 2: Don't worry about the 0x04 flag definition for AUTO_DISMISS, she
has a friend coming in a future patch to fill the hole where 0x02 is.
Verbs:
Dismiss: operates on CONCLUDED jobs only.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Which commands ("verbs") are appropriate for jobs in which state is
also somewhat burdensome to keep track of.
As of this commit, it looks rather useless, but begins to look more
interesting the more states we add to the STM table.
A recurring theme is that no verb will apply to an 'undefined' job.
Further, it's not presently possible to restrict the "pause" or "resume"
verbs any more than they are in this commit because of the asynchronous
nature of how jobs enter the PAUSED state; justifications for some
seemingly erroneous applications are given below.
=====
Verbs
=====
Cancel: Any state except undefined.
Pause: Any state except undefined;
'created': Requests that the job pauses as it starts.
'running': Normal usage. (PAUSED)
'paused': The job may be paused for internal reasons,
but the user may wish to force an indefinite
user-pause, so this is allowed.
'ready': Normal usage. (STANDBY)
'standby': Same logic as above.
Resume: Any state except undefined;
'created': Will lift a user's pause-on-start request.
'running': Will lift a pause request before it takes effect.
'paused': Normal usage.
'ready': Will lift a pause request before it takes effect.
'standby': Normal usage.
Set-speed: Any state except undefined, though ready may not be meaningful.
Complete: Only a 'ready' job may accept a complete request.
=======
Changes
=======
(1)
To facilitate "nice" error checking, all five major block-job verb
interfaces in blockjob.c now support an errp parameter:
- block_job_user_cancel is added as a new interface.
- block_job_user_pause gains an errp paramter
- block_job_user_resume gains an errp parameter
- block_job_set_speed already had an errp parameter.
- block_job_complete already had an errp parameter.
(2)
block-job-pause and block-job-resume will no longer no-op when trying
to pause an already paused job, or trying to resume a job that isn't
paused. These functions will now report that they did not perform the
action requested because it was not possible.
iotests have been adjusted to address this new behavior.
(3)
block-job-complete doesn't worry about checking !block_job_started,
because the permission table guards against this.
(4)
test-bdrv-drain's job implementation needs to announce that it is
'ready' now, in order to be completed.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We're about to add several new states, and booleans are becoming
unwieldly and difficult to reason about. It would help to have a
more explicit bookkeeping of the state of blockjobs. To this end,
add a new "status" field and add our existing states in a redundant
manner alongside the bools they are replacing:
UNDEFINED: Placeholder, default state. Not currently visible to QMP
unless changes occur in the future to allow creating jobs
without starting them via QMP.
CREATED: replaces !!job->co && paused && !busy
RUNNING: replaces effectively (!paused && busy)
PAUSED: Nearly redundant with info->paused, which shows pause_count.
This reports the actual status of the job, which almost always
matches the paused request status. It differs in that it is
strictly only true when the job has actually gone dormant.
READY: replaces job->ready.
STANDBY: Paused, but job->ready is true.
New state additions in coming commits will not be quite so redundant:
WAITING: Waiting on transaction. This job has finished all the work
it can until the transaction converges, fails, or is canceled.
PENDING: Pending authorization from user. This job has finished all the
work it can until the job or transaction is finalized via
block_job_finalize. This implies the transaction has converged
and left the WAITING phase.
ABORTING: Job has encountered an error condition and is in the process
of aborting.
CONCLUDED: Job has ceased all operations and has a return code available
for query and may be dismissed via block_job_dismiss.
NULL: Job has been dismissed and (should) be destroyed. Should never
be visible to QMP.
Some of these states appear somewhat superfluous, but it helps define the
expected flow of a job; so some of the states wind up being synchronous
empty transitions. Importantly, jobs can be in only one of these states
at any given time, which helps code and external users alike reason about
the current condition of a job unambiguously.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Trivial; Document what the job creation flags do,
and some general tidying.
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
model all independent jobs as single job transactions.
It's one less case we have to worry about when we add more states to the
transition machine. This way, we can just treat all job lifetimes exactly
the same. This helps tighten assertions of the STM graph and removes some
conditionals that would have been needed in the coming commits adding a
more explicit job lifetime management API.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add special state, when qmp operations on the bitmap are disabled.
It is needed during bitmap migration. "Frozen" state is not
appropriate here, because it looks like bitmap is unchanged.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180207155837.92351-5-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180207155837.92351-4-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Minimal realization: only one extent in server answer is supported.
Flag NBD_CMD_FLAG_REQ_ONE is used to force this behavior.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20180312152126.286890-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks, fix min_block check and 32-bit cap, use -1
instead of errno on failure in nbd_negotiate_simple_meta_context,
ensure that block status makes progress on success]
Signed-off-by: Eric Blake <eblake@redhat.com>
Enabling bitmap successor is necessary to enable successors of bitmaps
being migrated before target vm start.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180207155837.92351-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Nested BDRV_POLL_WHILE() calls can occur. Currently
assert(!wait_->wakeup) fails in AIO_WAIT_WHILE() when this happens.
This patch converts the bool wait_->need_kick flag to an unsigned
wait_->num_waiters counter.
Nesting works correctly because outer AIO_WAIT_WHILE() callers evaluate
the condition again after the inner caller completes (invoking the inner
caller counts as aio_poll() progress).
Reported-by: "fuweiwei (C)" <fuweiwei2@huawei.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180307124619.6218-1-stefanha@redhat.com
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaoqOgAAoJEH8JsnLIjy/W2MMP/1gj7CJgtSG9wIzyBHjSWQMy
ofXEgRJO9t/smfUMlH2NdrW8P2LvYcmqOsEBkLJzCtl48fPexwtI/cunzVjutXcf
VlpqKz/8uN4C9D6m8FN/5kKf65l+tnVqnCoJgwafY5uT7jmoC8LF1xO2jo8a+lJd
0Dv6RxJUQq/tDR6OvO6aW4EzbOUcD4wkLvi/uz8+ZjV1BLSLlpdudejr6W9TnJY/
EGFedbxqjPV7fIvMbodbFp0Ie8Aw0WEL8ttERboeR4jbA/o+PZVGpPtHsr/4V6QO
Pgh6vH2rGavxFzwuCWEGhlLKGx66CGqqdTknm6lNJchepCvcfoYxjOPZv9FCaMUs
enC/x43xSkCmkwBwKKxpXqu1vS5nGdMebAwRjstSIplypjv2YOwS1AiU5snaDwuk
t9Gjkw0Wka5nySuYi43H2RPXmlWbh4T8DfQ6pOyJGvXGjm8t+f5BTaMtSWn6Iq2W
F6r1UezQJBDnUbpFgsRg4AP+htPGDHgsOg7KzCCd/lBHwbjX7dkQlAYbBZZ2OBF+
wQN5olDR6jsKIy2IlARNgNweZHW5UQa1cc+7HlVNNE5tqtkjo7aWPk/LhEzBCIHg
sWG3VH2y3lQlaMzYh1v+jnGrFoq1ZJU4sbjaxvQX8czjmaQvPtbzKuZAovQ4pGwa
g0SrWP6p9yLo0LXLuXBP
=WDF4
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches
# gpg: Signature made Fri 09 Mar 2018 15:09:20 GMT
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream: (56 commits)
qemu-iotests: fix 203 migration completion race
iotests: Tweak 030 in order to trigger a race condition with parallel jobs
iotests: Skip test for ENOMEM error
iotests: Mark all tests executable
iotests: Test creating overlay when guest running
qemu-iotests: Test ssh image creation over QMP
qemu-iotests: Test qcow2 over file image creation with QMP
block: Fail bdrv_truncate() with negative size
file-posix: Fix no-op bdrv_truncate() with falloc preallocation
ssh: Support .bdrv_co_create
ssh: Pass BlockdevOptionsSsh to connect_to_ssh()
ssh: QAPIfy host-key-check option
ssh: Use QAPI BlockdevOptionsSsh object
sheepdog: Support .bdrv_co_create
sheepdog: QAPIfy "redundancy" create option
nfs: Support .bdrv_co_create
nfs: Use QAPI options in nfs_client_open()
rbd: Use qemu_rbd_connect() in qemu_rbd_do_create()
rbd: Assign s->snap/image_name in qemu_rbd_open()
rbd: Support .bdrv_co_create
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This adds a synchronous x-blockdev-create QMP command that can create
qcow2 images on a given node name.
We don't want to block while creating an image, so this is not the final
interface in all aspects, but BlockdevCreateOptionsQcow2 and
.bdrv_co_create() are what they actually might look like in the end. In
any case, this should be good enough to test whether we interpret
BlockdevCreateOptions as we should.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>