In the process of converting sector-based interfaces to bytes,
I'm finding it easier to represent a byte count as a 64-bit
integer at the block layer (even if we are internally capped
by SIZE_MAX or even INT_MAX for individual transactions, it's
still nicer to not have to worry about truncation/overflow
issues on as many variables). Update the signature of
bdrv_round_to_clusters() to uniformly use int64_t, matching
the signature already chosen for bdrv_is_allocated and the
fact that off_t is also a signed type, then adjust clients
according to the required fallout (even where the result could
now exceed 32 bits, no client is directly assigning the result
into a 32-bit value without breaking things into a loop first).
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Not all callers care about which BDS owns the mapping for a given
range of the file. This patch merely simplifies the callers by
consolidating the logic in the common call point, while guaranteeing
a non-NULL file to all the driver callbacks, for no semantic change.
The only caller that does not care about pnum is bdrv_is_allocated,
as invoked by vvfat; we can likewise add assertions that the rest
of the stack does not have to worry about a NULL pnum.
Furthermore, this will also set the stage for a future cleanup: when
a caller does not care about which BDS owns an offset, it would be
nice to allow the driver to optimize things to not have to return
BDRV_BLOCK_OFFSET_VALID in the first place. In the case of fragmented
allocation (for example, it's fairly easy to create a qcow2 image
where consecutive guest addresses are not at consecutive host
addresses), the current contract requires bdrv_get_block_status()
to clamp *pnum to the limit where host addresses are no longer
consecutive, but allowing a NULL file means that *pnum could be
set to the full length of known-allocated data.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Backing may be zero after failed bdrv_append in mirror_start_job,
which leads to SIGSEGV.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20170929152255.5431-1-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Backing may be zero after failed bdrv_attach_child in
bdrv_set_backing_hd, which leads to SIGSEGV.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20170928120300.58164-1-vsementsov@virtuozzo.com
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some of the callers were already scaling bytes to sectors; others
can be easily converted to pass byte offsets, all in our shift
towards a consistent byte interface everywhere. Making the change
will also make it easier to write the hold-out callers to use byte
rather than sectors for their iterations; it also makes it easier
for a future dirty-bitmap patch to offload scaling over to the
internal hbitmap. Although all callers happen to pass
sector-aligned values, make the internal scaling robust to any
sub-sector requests.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Half the callers were already scaling bytes to sectors; the other
half can eventually be simplified to use byte iteration. Both
callers were already using the result as a bool, so make that
explicit. Making the change also makes it easier for a future
dirty-bitmap patch to offload scaling over to the internal hbitmap.
Remember, asking whether a byte is dirty is effectively asking
whether the entire granularity containing the byte is dirty, since
we only track dirtiness by granularity.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Thanks to recent cleanups, all callers were scaling a return value
of sectors into bytes; do the scaling internally instead.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Thanks to recent cleanups, most callers were scaling a return value
of sectors into bytes (the exception, in qcow2-bitmap, will be
converted to byte-based iteration later). Update the interface to
do the scaling internally instead.
In qcow2-bitmap, the code was specifically checking for an error
return of -1. To avoid a regression, we either have to make sure
we continue to return -1 (rather than a scaled -512) on error, or
we have to fix the caller to treat all negative values as error
rather than just one magic value. It's easy enough to make both
changes at the same time, even though either one in isolation
would work.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All callers to bdrv_dirty_iter_new() passed 0 for their initial
starting point, drop that parameter.
Most callers to bdrv_set_dirty_iter() were scaling a byte offset to
a sector number; the exception qcow2-bitmap will be converted later
to use byte rather than sector iteration. Move the scaling to occur
internally to dirty bitmap code instead, so that callers now pass
in bytes.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the context of bdrv_reopen(), we'll have to look at the state of the
graph as it will be after the reopen. This interface addition is in
preparation for the change.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
bdrv_co_get_block_status_from_file() and
bdrv_co_get_block_status_from_backing() set *file to bs->file and
bs->backing respectively, so that bdrv_co_get_block_status() can recurse
to them. Future block drivers won't have to duplicate code to implement
this.
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commits 0db832f and 6cdbceb introduced the automatic insertion of filter
nodes above the top layer of mirror and commit block jobs. The
assumption made there was that since libvirt doesn't do node-level
management of the block layer yet, it shouldn't be affected by added
nodes.
This is true as far as commands issued by libvirt are concerned. It only
uses BlockBackend names to address nodes, so any operations it performs
still operate on the root of the tree as intended.
However, the assumption breaks down when you consider query commands,
which return data for the wrong node now. These commands also return
information on some child nodes (bs->file and/or bs->backing), which
libvirt does make use of, and which refer to the wrong nodes, too.
One of the consequences is that oVirt gets wrong information about the
image size and stops the VM in response as long as a mirror or commit
job is running:
https://bugzilla.redhat.com/show_bug.cgi?id=1470634
This patch fixes the problem by hiding the implicit nodes created
automatically by the mirror and commit block jobs in the output of
query-block and BlockBackend-based query-blockstats as long as the user
doesn't indicate that they are aware of those nodes by providing a node
name for them in the QMP command to start the block job.
The node-based commands query-named-block-nodes and query-blockstats
with query-nodes=true still show all nodes, including implicit ones.
This ensures that users that are capable of node-level management can
still access the full information; users that only know BlockBackends
won't use these commands.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
blk_truncate() itself will pass that value to bdrv_truncate(), and all
callers of blk_truncate() just set the parameter to PREALLOC_MODE_OFF
for now.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20170613202107.10125-4-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.
Changing the signature of the function to use int64_t *pnum ensures
that the compiler enforces that all callers are updated. For now,
the io.c layer still assert()s that all callers are sector-aligned,
but that can be relaxed when a later patch implements byte-based
block status. Therefore, for the most part this patch is just the
addition of scaling at the callers followed by inverse scaling at
bdrv_is_allocated(). But some code, particularly stream_run(),
gets a lot simpler because it no longer has to mess with sectors.
Leave comments where we can further simplify by switching to
byte-based iterations, once later patches eliminate the need for
sector-aligned operations.
For ease of review, bdrv_is_allocated() was tackled separately.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based. Change the internal
loop iteration of mirroring to track by bytes instead of sectors
(although we are still guaranteed that we iterate by steps that
are both sector-aligned and multiples of the granularity). Drop
the now-unused mirror_clip_sectors().
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based. Convert another internal
function, preserving all existing semantics, and adding one more
assertion that things are still sector-aligned (so that conversions
to sectors in mirror_read_complete don't need to round).
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based. Convert another internal
function (no semantic change), and add mirror_clip_bytes() as a
counterpart to mirror_clip_sectors(). Some of the conversion is
a bit tricky, requiring temporaries to convert between units; it
will be cleared up in a following patch.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Rather than having a void function that modifies its input
in-place as the output, change the signature to reduce a layer
of indirection and return the result.
Suggested-by: John Snow <jsnow@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based. Convert another internal
function (no semantic change).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based. Continue by converting an
internal structure (no semantic change), and all references to the
buffer size.
Add an assertion that our use of s->granularity >> BDRV_SECTOR_BITS
(necessary for interaction with sector-based dirty bitmaps, until
a later patch converts those to be byte-based) does not suffer from
truncation problems.
[checkpatch has a false positive on use of MIN() in this patch]
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Upcoming patches are going to switch to byte-based interfaces
instead of sector-based. Even worse, trace_backup_do_cow_enter()
had a weird mix of cluster and sector indices.
The trace interface is low enough that there are no stability
guarantees, and therefore nothing wrong with changing our units,
even in cases like trace_backup_do_cow_skip() where we are not
changing the trace output. So make the tracing uniformly use
bytes.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The user interface specifies job rate limits in bytes/second.
It's pointless to have our internal representation track things
in sectors/second, particularly since we want to move away from
sector-based interfaces.
Fix up a doc typo found while verifying that the ratelimit
code handles the scaling difference.
Repetition of expressions like 'n * BDRV_SECTOR_SIZE' will be
cleaned up later when functions are converted to iterate over
images by bytes rather than by sectors.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The lone caller that cares about a return of BDRV_BLOCK_RAW
(namely, io.c:bdrv_co_get_block_status) completely replaces the
return value, so there is no point in passing BDRV_BLOCK_DATA.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change the 'int count' parameter in *pwrite_zeros, *pdiscard related
functions (and some others) to 'int bytes', as they both refer to bytes.
This helps with code legibility.
Signed-off-by: Manos Pitsidianakis <el13635@mail.ntua.gr>
Message-id: 20170609101808.13506-1-el13635@mail.ntua.gr
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It protects only the list of dirty bitmaps; in the next patch we will
also protect their content.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20170605123908.18777-15-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJZLDGTAAoJEH8JsnLIjy/WjY0P/jtXF+SQKs9H695Uy+U+EUr5
9w+lYTfjHXTUx9F5AKT4OAMww/uWSG5ywG8FZD8AJZyt1B8piDrSA7sW5YYgWCV4
XD1HMzN6pasVGrchzgfBSW/K9BgUlvBEjqFrLCNVii0osa68J7vX3uJsLyRA+yPo
ijFseJaB/b0KnAPG9JqHXc4DVQgU2IhZH3ZsgE6Utb+KUrm8/veiORhWQboBbfCW
eyTF+YWee/rI+up4nKn9+Le57EUXWr3Y50BBxFZw1/4wQbv0WEhOIbl/Z4ggxNRR
VoyXGMqOIZFlGQ7bGxDawuEwGKfcOFFEHj3rhPrs7RPnod/6X8Vxm0rAr2GNZoOW
9tLMQKe4HLo3kVwX65AhzWd/WMukAd0MC/0oaULOjiAEdlWjflyEhx9H9uYefl5x
kn77ZqQqIEL4aCYRBLJn9oJV3CIZbSd6cuKC9UPuXAwD8XDWTXUzT/aDMJUnfij7
vhS1qks1Wyk37wIjTuuERwrr0K6y8e3De30NeU6LqQuzXvJ1MYSp49BA1FCgHfmT
x5RWbTSjLdjZiIo7biE2dSwdOtsxsnuxX19utqfGqOmegNU7LykRtu6mFh9kJhNe
5ladt1Mu//puZLQ/q4RH/J0pwkuS/OVrS3LBobcggPGHhUIiHhdNUI73bb7BK0+9
ME5DGLm4/c/REB3Lidr9
=kkU0
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'kwolf/tags/for-upstream' into staging
Block layer patches
# gpg: Signature made Mon 29 May 2017 03:34:59 PM BST
# gpg: using RSA key 0x7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* kwolf/tags/for-upstream:
block/file-*: *_parse_filename() and colons
block: Fix backing paths for filenames with colons
block: Tweak error message related to qemu-img amend
qemu-img: Fix leakage of options on error
qemu-img: copy *key-secret opts when opening newly created files
qemu-img: introduce --target-image-opts for 'convert' command
qemu-img: fix --image-opts usage with dd command
qemu-img: add support for --object with 'dd' command
qemu-img: Fix documentation of convert
qcow2: remove extra local_error variable
mirror: Drop permissions on s->target on completion
nvme: Add support for Controller Memory Buffers
iotests: 147: Don't test inet6 if not available
qemu-iotests: Test streaming with missing job ID
stream: fix crash in stream_start() when block_job_create() fails
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This fixes an assertion failure that was triggered by qemu-iotests 129
on some CI host, while the same test case didn't seem to fail on other
hosts.
Essentially the problem is that the blk_unref(s->target) in
mirror_exit() doesn't necessarily mean that the BlockBackend goes away
immediately. It is possible that the job completion was triggered nested
in mirror_drain(), which looks like this:
BlockBackend *target = s->target;
blk_ref(target);
blk_drain(target);
blk_unref(target);
In this case, the write permissions for s->target are retained until
after blk_drain(), which makes removing mirror_top_bs fail for the
active commit case (can't have a writable backing file in the chain
without the filter driver).
Explicitly dropping the permissions first means that the additional
reference doesn't hurt and the job can complete successfully even if
called from the nested blk_drain().
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Outside blockjob.c, block_job_unref is only used when a block job fails
to start, and block_job_ref is not used at all. The reference counting
thus is pretty well hidden. Introduce a separate function to be used
by block jobs; because block_job_ref and block_job_unref now become
static, move them earlier in blockjob.c.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20170508141310.8674-4-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
For one thing, this allows us to drop the error message generation from
qemu-img.c and blockdev.c and instead have it unified in
bdrv_truncate().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170328205129.15138-3-mreitz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It should be moved to the same context as source, before inserting to the
graph.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If @bs does not have any parents, the only reference to @mirror_top_bs
will be held by the BlockJob object after the bdrv_unref() following
block_job_create(). However, if block_job_create() fails, this reference
will not exist and @mirror_top_bs will have been deleted when we
goto fail.
The issue comes back at all later entries to the fail label: We delete
the BlockJob object before rolling back our changes to the node graph.
This means that we will delete @mirror_top_bs in the process.
All in all, whenever @bs does not have any parents and we go down the
fail path we will dereference @mirror_top_bs after it has been deleted.
Fix this by invoking bdrv_unref() only when block_job_create() was
successful and by bdrv_ref()'ing @mirror_top_bs in the fail path before
deleting the BlockJob object. Finally, bdrv_unref() it at the end of the
fail path after we actually no longer need it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block/trace-events lists the parameters for mirror_yield
consistently with other mirror events (cnt just after s, like in
mirror_before_sleep; in_flight last, like in mirror_yield_in_flight).
But the callers were passing parameters in the wrong order, leading
to poor trace messages, including type truncation when there are
more than 4G dirty sectors involved. Broken since its introduction
in commit bd48bde.
While touching this, ensure that all callers use the same type
(uint64_t) for cnt, as a later patch will enable the compiler to do
stricter type-checking.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_child_set_perm alone is not very usable because the caller must
call bdrv_child_check_perm first. This is already encapsulated
conveniently in bdrv_child_try_set_perm, so remove the other prototypes
from the header and fix the one wrong caller, block/mirror.c.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want query-block to return the right filename, even if a mirror job
put a bdrv_mirror_top on top of the actual image format driver. Let
bdrv_mirror_top.bdrv_refresh_filename get the filename from its backing
file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
When adding an Error parameter, bdrv_replace_in_backing_chain() would
become nothing more than a wrapper around change_parent_backing_link().
So make the latter public, renamed as bdrv_replace_node(), and remove
bdrv_replace_in_backing_chain().
Most of the callers just remove a node from the graph that they just
inserted, so they can use &error_abort, but completion of a mirror job
with 'replaces' set can actually fail.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
mirror_top_bs must be removed from the graph again when creating the
dirty bitmap fails.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
mirror_top_bs takes write permissions on its backing file, which can
make it impossible to attach that backing file node to another parent.
However, this is exactly what needs to be done in order to remove
mirror_top_bs from the backing chain. So give up the write permission
first.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The 'replaces' option of drive-mirror can be used to mirror a Quorum
node to a new image and then let the target image replace one of the
Quorum children. In order for this graph modification to succeed, the
mirror job needs to lift its restrictions on the target node first
before actually replacing the child.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Aborting on error in bdrv_append() isn't correct. This patch fixes it
and lets the callers handle failures.
Test case 085 needs a reference output update. This is caused by the
reversed order of bdrv_set_backing_hd() and change_parent_backing_link()
in bdrv_append(): When the backing file of the new node is set, the
parent nodes are still pointing to the old top, so the backing blocker
is now initialised with the node name rather than the BlockBackend name.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Not all callers of bdrv_set_backing_hd() know for sure that attaching
the backing file will be allowed by the permission system. Return the
error from the function rather than aborting.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Management tools need to be able to know about every node in the graph
and need a way to address them. Changing the graph structure was okay
because libvirt doesn't really manage the node level yet, but future
libvirt versions need to deal with both new and old version of qemu.
This new option to blockdev-commit allows the client to set a node-name
for the automatically inserted filter driver, and at the same time
serves as a witness for a future libvirt that this version of qemu does
automatically insert a filter driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Management tools need to be able to know about every node in the graph
and need a way to address them. Changing the graph structure was okay
because libvirt doesn't really manage the node level yet, but future
libvirt versions need to deal with both new and old version of qemu.
This new option to blockdev-mirror allows the client to set a node-name
for the automatically inserted filter driver, and at the same time
serves as a witness for a future libvirt that this version of qemu does
automatically insert a filter driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
The mirror block job is mainly used for two different scenarios:
Mirroring to an otherwise unused, independent target node, or for active
commit where the target node is part of the backing chain of the source.
Similarly to the commit block job patch, we need to insert a new filter
node to keep the permissions correct during active commit.
Note that one change this implies is that job->blk points to
mirror_top_bs as its root now, and mirror_top_bs (rather than the actual
source node) contains the bs->job pointer. This requires qemu-img commit
to get the job by name now rather than just taking bs->job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Acked-by: Max Reitz <mreitz@redhat.com>
Block jobs don't actually do I/O through the the reference they create
with block_job_add_bdrv(), but they might want to use the permisssion
system to express what the block job does to intermediate nodes. This
adds permissions to block_job_add_bdrv() to provide the means to request
permissions.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
This functions creates a BlockBackend internally, so the block jobs need
to tell it what they want to do with the BB.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Now that blk_insert_bs() requests the BlockBackend permissions for the
node it attaches to, it can fail. Instead of aborting, pass the errors
to the callers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
We want every user to be specific about the permissions it needs, so
we'll pass the initial permissions as parameters to blk_new(). A user
only needs to call blk_set_perm() if it wants to change the permissions
after the fact.
The permissions are stored in the BlockBackend and applied whenever a
BlockDriverState should be attached in blk_insert_bs().
This does not include actually choosing the right set of permissions
everywhere yet. Instead, the usual FIXME comment is added to each place
and will be addressed in individual patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
int64_t is in all likelihood the actual scalar type we want.
Yep, really.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1219541
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
This is more consistent with the commit block job, and it moves the code
to a place where we already have the necessary BlockBackends to resize
the base image when bdrv_truncate() is changed to require a BdrvChild.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
If explicit zeroing out before mirroring is required for the target image,
it moves the block job offset counter to EOF, then offset and len counters
count the image size twice. There is no harm but stats are confusing,
specifically the progress of the operation is always reported as 99% by
management tools.
The patch skips offset increase for the first "technical" pass over the
image. This should not cause any further harm.
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1486045515-8009-1-git-send-email-den@openvz.org
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20170213135235.12274-16-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This puts a huge strain on the disks when there are many concurrent
migrations. With this patch we only flush twice: just before issuing
the event, and just before pivoting to the destination. If management
will complete the job close to the BLOCK_JOB_READY event, the cost of
the second flush should be small anyway.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20161109162008.27287-2-pbonzini@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Instead of automatically starting jobs at creation time via backup_start
et al, we'd like to return a job object pointer that can be started
manually at later point in time.
For now, add the block_job_start mechanism and start the jobs
automatically as we have been doing, with conversions job-by-job coming
in later patches.
Of note: cancellation of unstarted jobs will perform all the normal
cleanup as if the job had started, particularly abort and clean. The
only difference is that we will not emit any events, because the job
never actually started.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1478587839-9834-5-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add an explicit start field to specify the entrypoint. We already have
ownership of the coroutine itself AND managing the lifetime of the
coroutine, let's take control of creation of the coroutine, too.
This will allow us to delay creation of the actual coroutine until we
know we'll actually start a BlockJob in block_job_start. This avoids
the sticky question of how to "un-create" a Coroutine that hasn't been
started yet.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1478587839-9834-4-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
To make it a little more obvious which functions are intended to be
public interface and which are intended to be for use only by jobs
themselves, split the interface into "public" and "private" files.
Convert blockjobs (e.g. block/backup) to using the private interface.
Leave blockdev and others on the public interface.
There are remaining uses of private state by qemu-img, and several
cases in blockdev.c and block/io.c where we grab job->blk for the
purposes of acquiring an AIOContext.
These will be corrected in future patches.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-7-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
There's no reason to leave this to blockdev; we can do it in blockjobs
directly and get rid of an extra callback for most users.
All non-internal events, even those created outside of QMP, will
consistently emit events.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-5-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Bubble up the internal interface to commit and backup jobs, then switch
replication tasks over to using this methodology.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-4-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add the ability to create jobs without an ID.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1477584421-1399-3-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
We already specified BDRV_O_UNMAP when opening images in 'qemu-img
commit', but didn't turn on the "unmap" in the active commit job. This
patch fixes that so that zeroed clusters in top image can be discarded
which is desired in the virt-sparsify use case, where a temporary
overlay is created and fstrim'ed before commiting back, to free space in
the original image.
This also enables it for block-commit.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1474974892-5031-1-git-send-email-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
When block-commit is launched without the top parameter, it uses
internally a mirror block job. In that case all intermediate nodes
between the active and base nodes must be blocked as well.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use block_job_add_bdrv() instead of blocking all operations in
mirror_start_job() and unblocking them in mirror_exit().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Ensure that there are no changes between the last check to
bdrv_get_dirty_count and the switch to the target.
There is already a bdrv_drained_end call, we only need to ensure
that bdrv_drained_begin is not called twice.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-4-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
This is required to decouple block jobs from running in an
AioContext. With multiqueue block devices, a BlockDriverState
does not really belong to a single AioContext.
The solution is to first wait until all I/O operations are
complete; then loop in the main thread for the block job to
complete entirely.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1477565348-5458-3-git-send-email-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
HBitmap is an implementation detail of block dirty bitmap that should be hidden
from users. Introduce a BdrvDirtyBitmapIter to encapsulate the underlying
HBitmapIter.
A small difference in the interface is, before, an HBitmapIter is initialized
in place, now the new BdrvDirtyBitmapIter must be dynamically allocated because
the structure definition is in block/dirty-bitmap.c.
Two current users are converted too.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1476395910-8697-2-git-send-email-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Auto complete mirror job in background to prevent from
blocking synchronously
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.fnst@cn.fujitsu.com>
Message-id: 1469602913-20979-7-git-send-email-xiecl.fnst@cn.fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stop to produce new async copy requests from mirror_iteration if
critical error (error action = BLOCK_ERROR_ACTION_REPORT) detected.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Mirror can do up to 16 in-flight requests, but actually on full copy
(the whole source disk is non-zero) in-flight is always 1. This happens
as the request is not limited in size: the data occupies maximum available
capacity of s->buf.
The patch limits the size of the request to some artificial constant
(1 Mb here), which is not that big or small. This effectively enables
back parallelism in mirror code as it was designed.
The result is important: the time to migrate 10 Gb disk is reduced from
~350 sec to 170 sec.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1468516741-82174-1-git-send-email-vsementsov@virtuozzo.com
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
- interrupt remapping for intel iommus
- a bunch of virtio cleanups
- fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXkQsqAAoJECgfDbjSjVRpanoIAJ9JVlc1aEjt9sa0cSBcs+NQ
J7JmgU9FqFsj+4FrNTouO3AxTjHurd1UAULP1WMPD+V3JpbnHct8r6SCBLQ5EBMN
VOjYo4DwWs1g+DqnQ9WZmbadu06XvYi/yiAKNUzWfZk0MR11D0D/S5hmarNKw0Kq
tGHeTWjGeY4WqFLV7m+qB4+cqkAByn6um99UtUvgLL05RgIEIP2IEMKYZ+rXvAa9
iGUvzqlO7mbq/+LbL18kaWywa4TCwbbd2eSGWaqhX4CuB62Rl33mWTXFcfaYhkyp
Z3FgwaJ09h0lAjSVEbyAuLFMfO/BnMcsoKqwl4xc4vkn/xBCqFtgH9JcEVm3O8U=
=ge2D
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pc, pci, virtio: new features, cleanups, fixes
- interrupt remapping for intel iommus
- a bunch of virtio cleanups
- fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Thu 21 Jul 2016 18:49:30 BST
# gpg: using RSA key 0x281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (57 commits)
intel_iommu: avoid unnamed fields
virtio: Update migration docs
virtio-gpu: Wrap in vmstate
virtio-gpu: Use migrate_add_blocker for virgl migration blocking
virtio-input: Wrap in vmstate
9pfs: Wrap in vmstate
virtio-serial: Wrap in vmstate
virtio-net: Wrap in vmstate
virtio-balloon: Wrap in vmstate
virtio-rng: Wrap in vmstate
virtio-blk: Wrap in vmstate
virtio-scsi: Wrap in vmstate
virtio: Migration helper function and macro
virtio-serial: Remove old migration version support
virtio-net: Remove old migration version support
virtio-scsi: Replace HandleOutput typedef
Revert "mirror: Workaround for unexpected iohandler events during completion"
virtio-scsi: Call virtio_add_queue_aio
virtio-blk: Call virtio_add_queue_aio
virtio: Introduce virtio_add_queue_aio
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This reverts commit ab27c3b5e7.
The virtio storage device host notifiers now work with
bdrv_drained_begin/end, so we don't need this hack any more.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Change sector-based blk_discard(), blk_co_discard(), and
blk_aio_discard() to instead be byte-based blk_pdiscard(),
blk_co_pdiscard(), and blk_aio_pdiscard(). NBD gets a lot
simpler now that ignoring the unaligned portion of a
byte-based discard request is handled under the hood by
the block layer.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-6-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
There are 2 deficiencies here:
- mirror_iteration could start several requests inside. Thus we could
simply have more in_flight requests than MAX_IN_FLIGHT.
- keeping this in mind throttling in mirror_run which is checking
s->in_flight == MAX_IN_FLIGHT is wrong.
The patch adds the check and throttling into mirror_iteration and fixes
the check in mirror_run() to be sure.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1466598927-5990-1-git-send-email-den@openvz.org
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
(cherry picked from commit e648dc95c28fbca12e67be26a1fc4b9a0676c3fe)
We should not take into account zero blocks for delay calculations.
They are not read and thus IO throttling is not required. In the
other case VM migration with 16 Tb QCOW2 disk with 4 Gb of data takes
days.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1468503209-19498-9-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
With a bdrv_co_write_zeroes method on a target BDS and when this method
is working as indicated by the bdrv_can_write_zeroes_with_unmap(), zeroes
will not be placed into the wire. Thus the target could be very efficiently
zeroed out. This should be done with the largest chunk possible.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1468503209-19498-8-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
There is no need to scan allocation tables if we have mark_all_dirty flag
set. Just mark it all dirty.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1468503209-19498-7-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
The code inside the helper will be extended in the next patch. mirror_run
itself is overbloated at the moment.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1468503209-19498-5-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
The patch also places last_pause_ns from stack in mirror_run into
MirrorBlockJob structure. This helper will be useful in next patches.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1468503209-19498-4-git-send-email-den@openvz.org
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Eric Blake <eblake@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
We keep here the sum of int fields. Thus this could easily overflow,
especially when we will start sending big requests in next patches.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1468503209-19498-3-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
ratelimit_calculate_delay() previously reset the accounting every time
slice, no matter how much data had been processed before. This had (at
least) two consequences:
1. The minimum speed is rather large, e.g. 5 MiB/s for commit and stream.
Not sure if there are real-world use cases where this would be a
problem. Mirroring and backup over a slow link (e.g. DSL) would
come to mind, though.
2. Tests for block job operations (e.g. cancel) were rather racy
All block jobs currently use a time slice of 100ms. That's a
reasonable value to get smooth output during regular
operation. However this also meant that the state of block jobs
changed every 100ms, no matter how low the configured limit was. On
busy hosts, qemu often transferred additional chunks until the test
case had a chance to cancel the job.
Fix the block job rate limit code to delay for more than one time
slice to address the above issues. To make it easier to handle
oversized chunks we switch the semantics from returning a delay
_before_ the current request to a delay _after_ the current
request. If necessary, this delay consists of multiple time slice
units.
Since the mirror job sends multiple chunks in one go even if the rate
limit was exceeded in between, we need to keep track of the start of
the current time slice so we can correctly re-compute the delay for
the updated amount of data.
The minimum bandwidth now is 1 data unit per time slice. The block
jobs are currently passing the amount of data transferred in sectors
and using 100ms time slices, so this translates to 5120
bytes/second. With chunk sizes usually being O(512KiB), tests have
plenty of time (O(100s)) to operate on block jobs. The chance of a
race condition now is fairly remote, except possibly on insanely
loaded systems.
Signed-off-by: Sascha Silbe <silbe@linux.vnet.ibm.com>
Message-id: 1467127721-9564-2-git-send-email-silbe@linux.vnet.ibm.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
In practice the entry argument is always known at creation time, and
it is confusing that sometimes qemu_coroutine_enter is used with a
non-NULL argument to re-enter a coroutine (this happens in
block/sheepdog.c and tests/test-coroutine.c). So pass the opaque value
at creation time, for consistency with e.g. aio_bh_new.
Mostly done with the following semantic patch:
@ entry1 @
expression entry, arg, co;
@@
- co = qemu_coroutine_create(entry);
+ co = qemu_coroutine_create(entry, arg);
...
- qemu_coroutine_enter(co, arg);
+ qemu_coroutine_enter(co);
@ entry2 @
expression entry, arg;
identifier co;
@@
- Coroutine *co = qemu_coroutine_create(entry);
+ Coroutine *co = qemu_coroutine_create(entry, arg);
...
- qemu_coroutine_enter(co, arg);
+ qemu_coroutine_enter(co);
@ entry3 @
expression entry, arg;
@@
- qemu_coroutine_enter(qemu_coroutine_create(entry), arg);
+ qemu_coroutine_enter(qemu_coroutine_create(entry, arg));
@ reentry @
expression co;
@@
- qemu_coroutine_enter(co, NULL);
+ qemu_coroutine_enter(co);
except for the aforementioned few places where the semantic patch
stumbled (as expected) and for test_co_queue, which would otherwise
produce an uninitialized variable warning.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds a new optional 'job-id' parameter to 'block-commit',
allowing the user to specify the ID of the block job to be created.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds a new optional 'job-id' parameter to 'blockdev-mirror'
and 'drive-mirror', allowing the user to specify the ID of the block
job to be created.
The HMP 'drive_mirror' command remains unchanged.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When a new job is created, the job ID is taken from the device name of
the BDS. This patch adds a new 'job_id' parameter to let the caller
provide one instead.
This patch also verifies that the ID is always unique and well-formed.
This causes problems in a couple of places where no ID is being set,
because the BDS does not have a device name.
In the case of test_block_job_start() (from test-blockjob-txn.c) we
can simply use this new 'job_id' parameter to set the missing ID.
In the case of img_commit() (from qemu-img.c) we still don't have the
API to make commit_active_start() set the job ID, so we solve it by
setting a default value. We'll get rid of this as soon as we extend
the API.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The 'id' field of the BlockJob structure will be able to hold any ID,
not only a device name. This patch updates the description of that
field and the error messages where it is being used.
Soon we'll add the ability to set an arbitrary ID when creating a
block job.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
s/target bs/to_replace/, also we check to_replace bs is not
blocked in qmp_drive_mirror() not here
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1466672241-22485-3-git-send-email-xiecl.fnst@cn.fujitsu.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
During the refactor of mirror_iteration in e5b43573,
we regressed the fix introduced in cae98cb8.
This patch re-adds IOV_MAX checking to cases where we
aren't checking alignment (and size) already.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466625064-11280-3-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
mirror_do_read intends to return the number of sectors processed after
the starting sector, without regard to how many sectors were processed
before the starting sector due to alignment.
Clean up the comments and code to hopefully illustrate this more clearly.
This also fixes an issue in initialization where if the mirror buffer size
is initialized to smaller than the number of sectors being requested for
transfer, we report back an incorrectly large number to the caller.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466625064-11280-2-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
trace_mirror_yield_in_flight accepts 2nd arguments in sectors while here
we pass chunks instead.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1466518157-27140-1-git-send-email-den@openvz.org
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add block_job_pause_point() calls to mark quiescent points and make sure
to complete in-flight requests when switching AioContexts.
This patch solves undefined behavior in the mirror block job when the
BDS AioContext is changed by dataplane.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466096189-6477-8-git-send-email-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain() which is called from
mirror_exit(). However, mirror_complete() already tries to open the
target's backing chain with a call to bdrv_open_backing_file().
First, we should only set the target's backing BDS once. Second, the
mirroring block job has a better idea of what to set it to than the
generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
conditions on when to move the backing BDS from source to target are not
really correct).
Therefore, remove that code from bdrv_replace_in_backing_chain() and
leave it to mirror_complete().
Depending on what kind of mirroring is performed, we furthermore want to
use different strategies to open the target's backing chain:
- If blockdev-mirror is used, we can assume the user made sure that the
target already has the correct backing chain. In particular, we should
not try to open a backing file if the target does not have any yet.
- If drive-mirror with mode=absolute-paths is used, we can and should
reuse the already existing chain of nodes that the source BDS is in.
In case of sync=full, no backing BDS is required; with sync=top, we
just link the source's backing BDS to the target, and with sync=none,
we use the source BDS as the target's backing BDS.
We should not try to open these backing files anew because this would
lead to two BDSs existing per physical file in the backing chain, and
we would like to avoid such concurrent access.
- If drive-mirror with mode=existing is used, we have to use the
information provided in the physical image file which means opening
the target's backing chain completely anew, just as it has been done
already.
If the target's backing chain shares images with the source, this may
lead to multiple BDSs per physical image file. But since we cannot
reliably ascertain this case, there is nothing we can do about it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
In a first step to convert the common I/O path to work on bytes rather
than sectors, this converts the copy-on-read logic that is used by
bdrv_aligned_preadv().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit e253f4b8 converted mirroring from sector-based bdrv_aio_*
to byte-based blk_aio_*, but failed to account for the subtle
difference in signatures (the former takes a semi-redundant length,
the latter takes a flags parameter). Since all of our flags are
currently smaller in size than BDRV_SECTOR_SIZE, it has no ill
effects until we either perform sub-sector mirroring, or we start
asserting that no unexpected flags are set. I found it while
testing new asserts when qemu-iotests 132 started warning about an
unknown flag 0x200000.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the mirror block job to use the job's BlockBackend for
performing its I/O. job->bs isn't used by the mirroring code any more
afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
We had to forbid mirroring to a target BDS that already had a BB
attached because the node swapping at job completion would add a second
BB and we didn't support multiple BBs on a single BDS at the time. Now
we do, so we can lift the restriction.
As we allow additional BlockBackends for the target, we must expect
other users to be sending requests. There may no requests be in flight
during the graph modification, so we have to drain those users now.
The core part of this patch is a revert of commit 40365552.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This adds a new BlockBackend field to the BlockJob struct, which
coexists with the BlockDriverState while converting the individual jobs.
When creating a block job, a new BlockBackend is created on top of the
given BlockDriverState, and it is destroyed when the BlockJob ends. The
reference to the BDS is now held by the BlockBackend instead of calling
bdrv_ref/unref manually.
We have to be careful when we use bdrv_replace_in_backing_chain() in
block jobs because this changes the BDS that job->blk points to. At the
moment block jobs are too tightly coupled with their BDS, so that moving
a job to another BDS isn't easily possible; therefore, we need to just
manually undo this change afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This patch removes the remaining users of bs->blk, which will allow us
to have multiple BBs on top of a single BDS. In the meantime, all checks
that are currently in place to prevent the user from creating such
setups can be switched to bdrv_has_blk() instead of accessing BDS.blk.
Future patches can allow them and e.g. enable users to mirror to a block
device that already has a BlockBackend on it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Block jobs don't actually make use of the iostatus for their BDSes, but
they manage a separate block job iostatus. Still, they require that it
is enabled for the source BDS and they enable it automatically for the
target and set the error handling mode - which ends up never being used
by the job.
This patch removes all of the BDS iostatus handling from the block job,
which removes another few bs->blk accesses.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
When block job errors were introduced, we assigned the iostatus of the
target BDS "just in case". The field has never been accessible for the
user because the target isn't listed in query-block.
Before we can allow the user to have a second BlockBackend on the
target, we need to clean this up. If anything, we would want to set the
iostatus for the internal BB of the job (which we can always do later),
but certainly not for a separate BB which the job doesn't even use.
As a nice side effect, this gets us rid of another bs->blk use.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Commit 5a7e7a0ba moved mirror_exit to a BH handler but didn't add any
protection against new requests that could sneak in just before the
BH is dispatched. For example (assuming a code base at that commit):
main_loop_wait # 1
os_host_main_loop_wait
g_main_context_dispatch
aio_ctx_dispatch
aio_dispatch
...
mirror_run
bdrv_drain
(a) block_job_defer_to_main_loop
qemu_iohandler_poll
virtio_queue_host_notifier_read
...
virtio_submit_multiwrite
(b) blk_aio_multiwrite
main_loop_wait # 2
<snip>
aio_dispatch
aio_bh_poll
(c) mirror_exit
At (a) we know the BDS has no pending request. However, the same
main_loop_wait call is going to dispatch iohandlers (EventNotifier
events), which may lead to a new I/O from guest. So the invariant is
already broken at (c). Data loss.
Commit f3926945c8 made iohandler to use aio API. The order of
virtio_queue_host_notifier_read and block_job_defer_to_main_loop within
a main_loop_wait becomes unpredictable, and even worse, if the host
notifier event arrives at the next main_loop_wait call, the
unpredictable order between mirror_exit and
virtio_queue_host_notifier_read is also a trouble. As shown below, this
commit made the bug easier to trigger:
- Bug case 1:
main_loop_wait # 1
os_host_main_loop_wait
g_main_context_dispatch
aio_ctx_dispatch (qemu_aio_context)
...
mirror_run
bdrv_drain
(a) block_job_defer_to_main_loop
aio_ctx_dispatch (iohandler_ctx)
virtio_queue_host_notifier_read
...
virtio_submit_multiwrite
(b) blk_aio_multiwrite
main_loop_wait # 2
...
aio_dispatch
aio_bh_poll
(c) mirror_exit
- Bug case 2:
main_loop_wait # 1
os_host_main_loop_wait
g_main_context_dispatch
aio_ctx_dispatch (qemu_aio_context)
...
mirror_run
bdrv_drain
(a) block_job_defer_to_main_loop
main_loop_wait # 2
...
aio_ctx_dispatch (iohandler_ctx)
virtio_queue_host_notifier_read
...
virtio_submit_multiwrite
(b) blk_aio_multiwrite
aio_dispatch
aio_bh_poll
(c) mirror_exit
In both cases, (b) breaks the invariant wanted by (a) and (c).
Until then, the request loss has been silent. Later, 3f09bfbc7b added
asserts at (c) to check the invariant (in
bdrv_replace_in_backing_chain), and Max reported an assertion failure
first visible there, by doing active committing while the guest is
running bonnie++.
2.5 added bdrv_drained_begin at (a) to protect the dataplane case from
similar problems, but we never realize the main loop bug until now.
As a bandage, this patch disables iohandler's external events
temporarily together with bs->ctx.
Launchpad Bug: 1570134
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The last sub-chunk is rounded up to the copy granularity in the target
image, resulting in a larger size than the source.
Add a function to clip the copied sectors to the end.
This undoes the "wrong" changes to tests/qemu-iotests/109.out in
e5b43573e2. The remaining two offset changes are okay.
[ kwolf: Use DIV_ROUND_UP to calculate nb_chunks now ]
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
If the drive's dirty bitmap is dirtied while the mirror operation is
running, the cache of the iterator used by the mirror code may become
stale and not contain all dirty bits.
This only becomes an issue if we are looking for contiguously dirty
chunks on the drive. In that case, we can easily detect the discrepancy
and just refresh the iterator if one occurs.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
mirror_iteration() is supposed to wait if the current chunk is subject
to a still in-flight mirroring operation. However, it mixed checking
this conflict situation with checking the dirty status of a chunk. A
simplification for the latter condition (the first chunk encountered is
always dirty) led to neglecting the former: We just skip the first chunk
and thus never test whether it conflicts with an in-flight operation.
To fix this, pull out the code which waits for in-flight operations on
the first chunk of the range to be mirrored to settle.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only remaining users were block jobs (mirror and backup) which
unconditionally enabled WCE on the BlockBackend of the target image. As
these block jobs don't go through BlockBackend for their I/O requests,
they aren't affected by this setting anyway but always get a writeback
mode, so that call can be removed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Commit 57cb38b included qapi/error.h into qemu/osdep.h to get the
Error typedef. Since then, we've moved to include qemu/osdep.h
everywhere. Its file comment explains: "To avoid getting into
possible circular include dependencies, this file should not include
any other QEMU headers, with the exceptions of config-host.h,
compiler.h, os-posix.h and os-win32.h, all of which are doing a
similar job to this file and are under similar constraints."
qapi/error.h doesn't do a similar job, and it doesn't adhere to
similar constraints: it includes qapi-types.h. That's in excess of
100KiB of crap most .c files don't actually need.
Add the typedef to qemu/typedefs.h, and include that instead of
qapi/error.h. Include qapi/error.h in .c files that need it and don't
get it now. Include qapi-types.h in qom/object.h for uint16List.
Update scripts/clean-includes accordingly. Update it further to match
reality: replace config.h by config-target.h, add sysemu/os-posix.h,
sysemu/os-win32.h. Update the list of includes in the qemu/osdep.h
comment quoted above similarly.
This reduces the number of objects depending on qapi/error.h from "all
of them" to less than a third. Unfortunately, the number depending on
qapi-types.h shrinks only a little. More work is needed for that one.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
[Fix compilation without the spice devel packages. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The three lines are duplicated a number of times now, refactor a
function.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1454637630-10585-3-git-send-email-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
The "pnum < nb_sectors" condition in deciding whether to actually copy
data is unnecessarily strict, and the qiov initialization is
unnecessarily for bdrv_aio_write_zeroes and bdrv_aio_discard.
Rewrite mirror_iteration to fix both flaws.
The output of iotests 109 is updated because we now report the offset
and len slightly differently in mirroring progress.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1454637630-10585-2-git-send-email-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
The added parameter can be used to return the BDS pointer which the
valid offset is referring to. Its value should be ignored unless
BDRV_BLOCK_OFFSET_VALID in ret is set.
Until block drivers fill in the right value, let's clear it explicitly
right before calling .bdrv_get_block_status.
The "bs->file" condition in bdrv_co_get_block_status is kept now to keep iotest
case 102 passing, and will be fixed once all drivers return the right file
pointer.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1453780743-16806-2-git-send-email-famz@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For bs->file, using references to existing BDSes has been possible for a
while already. This patch enables the same for bs->backing.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_replace_in_backing_chain() asserts that not both old and new
BlockDdriverState have a BlockBackend attached to them because both
would have to end up pointing to the new BDS and we don't support more
than one BB per BDS yet.
Before we can safely allow references to existing nodes as backing
files, we need to make sure that even if a backing file has a BB on it,
this doesn't crash qemu.
There are probably also some cases with the 'replaces' option set where
drive-mirror could fail this assertion today. They are fixed with this
error check as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
With dataplane, the ioeventfd events could be dispatched after
mirror_run releases the dirty bitmap, but before mirror_exit actually
does the device switch, because the iothread will still be running, and
it will cause silent data loss.
Fix this by adding a bdrv_drained_begin/end pair around the window, so
that no new external request will be handled.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Add reference count to block job, meanwhile move the ownership of the
reference to job->bs from the caller (which is released in two
completion callbacks) to the block job itself. It is necessary for
block_job_complete_sync to work, because block job shouldn't live longer
than its bs, as asserted in bdrv_delete.
Now block_job_complete_sync can be simplified.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1446765200-3054-6-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There's nothing preventing the target image from being used by other
operations during the 'drive-mirror' job, so we should block them all
until the job is done.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 82b88fd04cde918a08a6f9a4ab85626d7fd7b502.1446475331.git.berto@igalia.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
These options are only relevant for the user of a whole BDS tree (like a
guest device or a block job) and should thus be moved into the
BlockBackend.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This cleans up the mess we left behind in the mirror code after the
previous patch. Instead of using bdrv_swap(), just change pointers.
The interface change of the mirror job that callers must consider is
that after job completion, their local BDS pointers still point to the
same node now. qemu-img must change its code accordingly (which makes it
easier to understand); the other callers stays unchanged because after
completion they don't do anything with the BDS, but just with the job,
and the job is still owned by the source BDS.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Some block jobs change the block device graph on completion. This means
that the device that owns the job and originally was addressed with its
device name may no longer be what the corresponding BlockBackend points
to.
Previously, the effects of bdrv_swap() ensured that the job was (at
least partially) transferred to the target image. Events that contain
the device name could still use bdrv_get_device_name(job->bs) and get
the same result.
After removing bdrv_swap(), this won't work any more. Instead, save the
device name at job creation and use that copy for QMP events and
anything else identifying the job.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This simplifies the code somewhat, especially when dropping whole
backing file subchains.
The exception is the mirroring code that does adventurous things with
bdrv_swap() and in order to keep it working, I had to duplicate most of
bdrv_set_backing_hd() locally. We'll get rid again of this ugliness
shortly.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This is the final step in converting all of the BlockDriverState
pointers that block drivers use to BdrvChild.
After this patch, bs->children contains the full list of child nodes
that are referenced by a given BDS, and these children are only
referenced through BdrvChild, so that updating the pointer in there is
enough for changing edges in the graph.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Simplify memory allocation by sticking with a single API. GSlice
is not that fast anyway (tcmalloc/jemalloc are better).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
During mirror, if the target device does not support zero init, a
mirror may result in a corrupted image for sync="full" mode.
This is due to how the initial dirty bitmap is set up prior to copying
data - we did not mark sectors as dirty that are unallocated. This
means those unallocated sectors are skipped over on the target, and for
a device without zero init, invalid data may reside in those holes.
If both of the following conditions are true, then we will explicitly
mark all sectors as dirty:
1.) sync = "full"
2.) bdrv_has_zero_init(target) == false
If the target does support zero init, but a target image is passed in
with data already present (i.e. an "existing" image), it is assumed the
data present in the existing image is valid data for those sectors.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 91ed4bc5bda7e2b09eb508b07c83f4071fe0b3c9.1443705220.git.jcody@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
We use mirror+replace to fix quorum's broken child. bs/s->common.bs
is quorum, and to_replace is the broken child. The new child is target_bs.
Without this patch, the replace node can be any node, and it can be
top BDS with BB, or another quorum's child. We just check if the broken
child is part of the quorum BDS in this patch.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Message-id: 55A86486.1000404@cn.fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This fixes a regression introduced by commit dcfb3beb ("mirror: Do zero
write on target if sectors not allocated"), which was reported to cause
aborts with the message "Co-routine re-entered recursively".
The cause for this bug is the following code in mirror_iteration_done():
if (s->common.busy) {
qemu_coroutine_enter(s->common.co, NULL);
}
This has always been ugly because - unlike most places that reenter - it
doesn't have a specific yield that it pairs with, but is more
uncontrolled. What we really mean here is "reenter the coroutine if
it's in one of the four explicit yields in mirror.c".
This used to be equivalent with s->common.busy because neither
mirror_run() nor mirror_iteration() call any function that could yield.
However since commit dcfb3beb this doesn't hold true any more:
bdrv_get_block_status_above() can yield.
So what happens is that bdrv_get_block_status_above() wants to take a
lock that is already held, so it adds itself to the queue of waiting
coroutines and yields. Instead of being woken up by the unlock function,
however, it gets woken up by mirror_iteration_done(), which is obviously
wrong.
In most cases the code actually happens to cope fairly well with such
cases, but in this specific case, the unlock must already have scheduled
the coroutine for wakeup when mirror_iteration_done() reentered it. And
then the coroutine happened to process the scheduled restarts and tried
to reenter itself recursively.
This patch fixes the problem by pairing the reenter in
mirror_iteration_done() with specific yields instead of abusing
s->common.busy.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1439455310-11263-1-git-send-email-kwolf@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
If mirror has more free buffers than IOV_MAX, preadv(2)/pwritev(2)
EINVAL failures may be encountered.
It is possible to trigger this by setting granularity to a low value
like 8192.
This patch stops appending chunks once IOV_MAX is reached.
The spurious EINVAL failure can be reproduced with a qcow2 image file
and the following QMP invocation:
qmp.command('drive-mirror', device='virtio0', target='/tmp/r7.s1',
granularity=8192, sync='full', mode='absolute-paths',
format='raw')
While the guest is running dd if=/dev/zero of=/var/tmp/foo oflag=direct
bs=4k.
Cc: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1435761950-26714-1-git-send-email-stefanha@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Limiting to sectors_per_chunk for each bdrv_is_allocated_above is slow,
because the underlying protocol driver would issue much more queries
than necessary. We should coalesce the query.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: <1436413678-7114-4-git-send-email-famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If bus_size is less than 0, the command fails.
If buf_size is 0, use DEFAULT_MIRROR_BUF_SIZE.
If buf_size % granularity is not 0, mirror_free_init() will
do dangerous things.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 5555A588.3080907@cn.fujitsu.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Before, we only yield after initializing dirty bitmap, where the QMP
command would return. That may take very long, and guest IO will be
blocked.
Add sleep points like the later mirror iterations.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1431486673-19280-1-git-send-email-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
There is job resource leak in function mirror_start_job,
although bdrv_create_dirty_bitmap is unlikely failed.
Add block_job_release for each release when needed.
Signed-off-by: Ting Wang <kathy.wangting@huawei.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1435311455-56048-1-git-send-email-kathy.wangting@huawei.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If guest discards a source cluster, mirroring with bdrv_aio_readv is overkill.
Some protocols do zero upon discard, where it's best to use
bdrv_aio_write_zeroes, otherwise, bdrv_aio_discard will be enough.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If specified as "true", it allows discarding on target sectors where source is
not allocated.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If we wish to make differential backups a feature that's easy to access,
it might be pertinent to rename the "dirty-bitmap" mode to "incremental"
to make it clear what /type/ of backup the dirty-bitmap is helping us
perform.
This is an API breaking change, but 2.4 has not yet gone live,
so we have this flexibility.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 1433463642-21840-2-git-send-email-jsnow@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In particular, don't include it into headers.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
These macros expand into error class enumeration constant, comma,
string. Unclean. Has been that way since commit 13f59ae.
The error class is always ERROR_CLASS_GENERIC_ERROR since the previous
commit.
Clean up as follows:
* Prepend every use of a QERR_ macro by ERROR_CLASS_GENERIC_ERROR, and
delete it from the QERR_ macro. No change after preprocessing.
* Rewrite error_set(ERROR_CLASS_GENERIC_ERROR, ...) into
error_setg(...). Again, no change after preprocessing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
The mirror block job is trying to take a clever shortcut if delay_ns is
0 and skips block_job_sleep_ns() in that case. But that function must be
called in every block job iteration, because otherwise it is for example
impossible to pause the job.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We often don't need the BlockDriverState for functions
that operate on bitmaps. Remove it.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-15-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For "dirty-bitmap" sync mode, the block job will iterate through the
given dirty bitmap to decide if a sector needs backup (backup all the
dirty clusters and skip clean ones), just as allocation conditions of
"top" sync mode.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-11-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new command pair is added to manage a user created dirty bitmap. The
dirty bitmap's name is mandatory and must be unique for the same device,
but different devices can have bitmaps with the same names.
The granularity is an optional field. If it is not specified, we will
choose a default granularity based on the cluster size if available,
clamped to between 4K and 64K to mirror how the 'mirror' code was
already choosing granularity. If we do not have cluster size info
available, we choose 64K. This code has been factored out into a helper
shared with block/mirror.
This patch also introduces the 'block_dirty_bitmap_lookup' helper,
which takes a device name and a dirty bitmap name and validates the
lookup, returning NULL and setting errp if there is a problem with
either field. This helper will be re-used in future patches in this
series.
The types added to block-core.json will be re-used in future patches
in this series, see:
'qapi: Add transaction support to block-dirty-bitmap-{add, enable, disable}'
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1429314609-29776-5-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We treat this field with a variety of different types everywhere
in the code. Now it's just uint32_t.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-4-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This field will be set for user created dirty bitmap. Also pass in an
error pointer to bdrv_create_dirty_bitmap, so when a name is already
taken on this BDS, it can report an error message. This is not global
check, two BDSes can have dirty bitmap with a common name.
Implemented bdrv_find_dirty_bitmap to find a dirty bitmap by name, will
be used later when other QMP commands want to reference dirty bitmap by
name.
Add bdrv_dirty_bitmap_make_anon. This unsets the name of dirty bitmap.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-3-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 1428069921-2957-5-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch changes block_job_pause to increase the pause counter and
block_job_resume to decrease it.
The counter will allow calling block_job_pause/block_job_resume
unconditionally on a job when we need to suspend the IO temporarily.
From now on, each block_job_resume must be paired with a block_job_pause
to keep the counter balanced.
The user pause from QMP or HMP will only trigger block_job_pause once
until it's resumed, this is achieved by adding a user_paused flag in
BlockJob.
One occurrence of block_job_resume in mirror_complete is replaced with
block_job_enter which does what is necessary.
In block_job_cancel, the cancel flag is good enough to instruct
coroutines to quit loop, so use block_job_enter to replace the unpaired
block_job_resume.
Upon block job IO error, user is notified about the entering to the
pause state, so this pause belongs to user pause, set the flag
accordingly and expect a matching QMP resume.
[Extended doc comments as suggested by Paolo Bonzini
<pbonzini@redhat.com>.
--Stefan]
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 1428069921-2957-2-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The backing_filename string in mirror_run() is only used to check
for a NULL string, so we don't need to allocate 1024 bytes (or, later,
PATH_MAX bytes), when we only need to copy the first 2 characters.
We technically only need 1 byte, as we are just checking for NULL, but
since backing_filename[] is populated by bdrv_get_backing_filename(), a
string size of 1 will always only return '\0';
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Mirror and migration use dirty bitmaps for their purposes, and since
commit [block: per caller dirty bitmap] they use their own bitmaps, not
the global one. But they use old functions bdrv_set_dirty and
bdrv_reset_dirty, which change all dirty bitmaps.
Named dirty bitmaps series by Fam and Snow are affected: mirroring and
migration will spoil all (not related to this mirroring or migration)
named dirty bitmaps.
This patch fixes this by adding bdrv_set_dirty_bitmap and
bdrv_reset_dirty_bitmap, which change concrete bitmap. Also, to prevent
such mistakes in future, old functions bdrv_(set,reset)_dirty are made
static, for internal block usage.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>
CC: John Snow <jsnow@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417081246-3593-1-git-send-email-vsementsov@parallels.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The mirror block job must run in the BlockDriverState AioContext so that
it works with dataplane.
Acquire the AioContext in blockdev.c so starting the block job is safe.
Note that to_replace is treated separately from other BlockDriverStates
in that it does not need to be in the same AioContext. Explicitly
acquire/release to_replace's AioContext when accessing it.
The completion code in block/mirror.c must perform BDS graph
manipulation and bdrv_reopen() from the main loop. Use
block_job_defer_to_main_loop() to achieve that.
The bdrv_drain_all() call is not allowed outside the main loop since it
could lead to lock ordering problems. Use bdrv_drain(bs) instead
because we have acquired the AioContext so nothing else can sneak in
I/O.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-10-git-send-email-stefanha@redhat.com