mirrors/qemu - qemu - SynapseOS git

Author	SHA1	Message	Date
Eric Blake	3e4d0e72b7	block: Switch passthrough drivers to .bdrv_co_block_status() We are gradually moving away from sector-based interfaces, towards byte-based. Update the generic helpers, and all passthrough clients (blkdebug, commit, mirror, throttle) accordingly. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2018-03-02 18:39:07 +01:00
Eric Blake	86a3d5c688	block: Add .bdrv_co_block_status() callback We are gradually moving away from sector-based interfaces, towards byte-based. Now that the block layer exposes byte-based allocation, it's time to tackle the drivers. Add a new callback that operates on as small as byte boundaries. Subsequent patches will then update individual drivers, then finally remove .bdrv_co_get_block_status(). The new code also passes through the 'want_zero' hint, which will allow subsequent patches to further optimize callers that only care about how much of the image is allocated (want_zero is false), rather than full details about runs of zeroes and which offsets the allocation actually maps to (want_zero is true). As part of this effort, fix another part of the documentation: the claim in commit `4c41cb4` that BDRV_BLOCK_ALLOCATED is short for 'DATA \|\| ZERO' is a lie at the block layer (see commit `e88ae2264`), even though it is how the bit is computed from the driver layer. After all, there are intentionally cases where we return ZERO but not ALLOCATED at the block layer, when we know that a read sees zero because the backing file is too short. Note that the driver interface is thus slightly different than the public interface with regards to which bits will be set, and what guarantees are provided on input. We also add an assertion that any driver using the new callback will make progress (the only time pnum will be 0 is if the block layer already handled an out-of-bounds request, or if there is an error); the old driver interface did not provide this guarantee, which could lead to some inf-loops in drastic corner-case failures. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2018-03-02 18:39:07 +01:00
Fam Zheng	23d0ba9319	block: Introduce buf register API Allow block driver to map and unmap a buffer for later I/O, as a performance hint. Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20180116060901.17413-5-famz@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2018-02-08 09:22:03 +08:00
Kevin Wolf	d736f119da	block: Allow graph changes in subtree drained section We need to remember how many of the drain sections in which a node is were recursive (i.e. subtree drain rather than node drain), so that they can be correctly applied when children are added or removed during the drained section. With this change, it is safe to modify the graph even inside a bdrv_subtree_drained_begin/end() section. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-12-22 15:05:32 +01:00
Kevin Wolf	b016558590	block: Add bdrv_subtree_drained_begin/end() bdrv_drained_begin() waits for the completion of requests in the whole subtree, but it only actually keeps its immediate bs parameter quiesced until bdrv_drained_end(). Add a version that keeps the whole subtree drained. As of this commit, graph changes cannot be allowed during a subtree drained section, but this will be fixed soon. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-12-22 15:05:32 +01:00
Kevin Wolf	0152bf400f	block: Don't notify parents in drain call chain This is in preparation for subtree drains, i.e. drained sections that affect not only a single node, but recursively all child nodes, too. Calling the parent callbacks for drain is pointless when we just came from that parent node recursively and leads to multiple increases of bs->quiesce_counter in a single drain call. Don't do it. In order for this to work correctly, the parent callback must be called for every bdrv_drain_begin/end() call, not only for the outermost one: If we have a node N with two parents A and B, recursive draining of A should cause the quiesce_counter of B to increase because its child N is drained independently of B. If now B is recursively drained, too, A must increase its quiesce_counter because N is drained independently of A only now, even if N is going from quiesce_counter 1 to 2. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-12-22 15:05:32 +01:00
Kevin Wolf	0f11516894	block: Nested drain_end must still call callbacks bdrv_do_drained_begin() restricts the call of parent callbacks and aio_disable_external() to the outermost drain section, but the block driver callbacks are always called. bdrv_do_drained_end() must match this behaviour, otherwise nodes stay drained even if begin/end calls were balanced. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-12-22 15:05:32 +01:00
Kevin Wolf	8119334918	block: Don't block_job_pause_all() in bdrv_drain_all() Block jobs are already paused using the BdrvChildRole drain callbacks, so we don't need an additional block_job_pause_all() call. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-12-22 15:05:32 +01:00
Kevin Wolf	7b6a3d3553	block: Make bdrv_drain() driver callbacks non-recursive bdrv_drained_begin() doesn't increase bs->quiesce_counter recursively and also doesn't notify other parent nodes of children, which both means that the child nodes are not actually drained, and bdrv_drained_begin() is providing useful functionality only on a single node. To keep things consistent, we also shouldn't call the block driver callbacks recursively. A proper recursive drain version that provides an actually working drained section for child nodes will be introduced later. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com>	2017-12-22 15:05:31 +01:00
Kevin Wolf	9a7e86c804	block: Assert drain_all is only called from main AioContext Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com>	2017-12-22 15:05:31 +01:00
Fam Zheng	8e77e0bceb	block: Remove unused bdrv_requests_pending Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-12-22 15:05:31 +01:00
Kevin Wolf	60369b86c4	block: Unify order in drain functions Drain requests are propagated to child nodes, parent nodes and directly to the AioContext. The order in which this happened was different between all combinations of drain/drain_all and begin/end. The correct order is to keep children only drained when their parents are also drained. This means that at the start of a drained section, the AioContext needs to be drained first, the parents second and only then the children. The correct order for the end of a drained section is the opposite. This patch changes the three other functions to follow the example of bdrv_drained_begin(), which is the only one that got it right. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-12-22 15:03:41 +01:00
Kevin Wolf	5280aa32e1	block: Don't wait for requests in bdrv_drain*_end() The device is drained, so there is no point in waiting for requests at the end of the drained section. Remove the bdrv_drain_recurse() calls there. The bdrv_drain_recurse() calls were introduced in commit `481cad48e5` in order to call the .bdrv_co_drain_end() driver callback. This is now done by a separate bdrv_drain_invoke() call. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-12-22 15:03:41 +01:00
Kevin Wolf	99c05de918	block: bdrv_drain_recurse(): Remove unused begin parameter Now that the bdrv_drain_invoke() calls are pulled up to the callers of bdrv_drain_recurse(), the 'begin' parameter isn't needed any more. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-12-22 15:03:41 +01:00
Kevin Wolf	2da9b7d456	block: Call .drain_begin only once in bdrv_drain_all_begin() bdrv_drain_all_begin() used to call the .bdrv_co_drain_begin() driver callback inside its polling loop. This means that how many times it got called for each node depended on long it had to poll the event loop. This is obviously not right and results in nodes that stay drained even after bdrv_drain_all_end(), which calls .bdrv_co_drain_begin() once per node. Fix bdrv_drain_all_begin() to call the callback only once, too. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-12-22 15:03:41 +01:00
Kevin Wolf	db0289b9b2	block: Make bdrv_drain_invoke() recursive This change separates bdrv_drain_invoke(), which calls the BlockDriver drain callbacks, from bdrv_drain_recurse(). Instead, the function performs its own recursion now. One reason for this is that bdrv_drain_recurse() can be called multiple times by bdrv_drain_all_begin(), but the callbacks may only be called once. The separation is necessary to fix this bug. The other reason is that we intend to go to a model where we call all driver callbacks first, and only then start polling. This is not fully achieved yet with this patch, as bdrv_drain_invoke() contains a BDRV_POLL_WHILE() loop for the block driver callbacks, which can still call callbacks for any unrelated event. It's a step in this direction anyway. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-12-22 15:03:41 +01:00
Kevin Wolf	02d213009d	block: Expect graph changes in bdrv_parent_drained_begin/end The .drained_begin/end callbacks can (directly or indirectly via aio_poll()) cause block nodes to be removed or the current BdrvChild to point to a different child node. Use QLIST_FOREACH_SAFE() to make sure we don't access invalid BlockDriverStates or accidentally continue iterating the parents of the new child node instead of the node we actually came from. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Tested-by: Jeff Cody <jcody@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-11-29 14:22:03 +01:00
Max Reitz	d470ad42ac	block: Guard against NULL bs->drv We currently do not guard everywhere against a NULL bs->drv where we should be doing so. Most of the places fixed here just do not care about that case at all. Some care implicitly, e.g. through a prior function call to bdrv_getlength() which would always fail for an ejected BDS. Add an assert there to make it more obvious. Other places seem to care, but do so insufficiently: Freeing clusters in a qcow2 image is an error-free operation, but it may leave the image in an unusable state anyway. Giving qcow2_free_clusters() an error code is not really viable, it is much easier to note that bs->drv may be NULL even after a successful driver call. This concerns bdrv_co_flush(), and the way the check is added to bdrv_co_pdiscard() (in every iteration instead of only once). Finally, some places employ at least an assert(bs->drv); somewhere, that may be reasonable (such as in the reopen code), but in bdrv_has_zero_init(), it is definitely not. Returning 0 there in case of an ejected BDS saves us much headache instead. Reported-by: R. Nageswara Sastry <nasastry@in.ibm.com> Buglink: https://bugs.launchpad.net/qemu/+bug/1728660 Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20171110203111.7666-4-mreitz@redhat.com Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2017-11-17 18:21:31 +01:00
Eric Blake	88e63df214	block: Reduce bdrv_aligned_preadv() rounding Now that bdrv_is_allocated accepts non-aligned inputs, we can remove the TODO added in commit `d6a644bb`. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	efa6e2ed64	block: Align block status requests Any device that has request_alignment greater than 512 should be unable to report status at a finer granularity; it may also be simpler for such devices to be guaranteed that the block layer has rounded things out to the granularity boundary (the way the block layer already rounds all other I/O out). Besides, getting the code correct for super-sector alignment also benefits us for the fact that our public interface now has byte granularity, even though none of our drivers have byte-level callbacks. Add an assertion in blkdebug that proves that the block layer never requests status of unaligned sections, similar to what it does on other requests (while still keeping the generic helper in place for when future patches add a throttle driver). Note that iotest 177 already covers this (it would fail if you use just the blkdebug.c hunk without the io.c changes). Meanwhile, we can drop assertions in callers that no longer have to pass in sector-aligned addresses. There is a mid-function scope added for 'count' and 'longret', for a couple of reasons: first, an upcoming patch will add an 'if' statement that checks whether a driver has an old- or new-style callback, and can conveniently use the same scope for less indentation churn at that time. Second, since we are trying to get rid of sector-based computations, wrapping things in a scope makes it easier to group and see what will be deleted in a final cleanup patch once all drivers have been converted to the new-style callback. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	3182664220	block: Convert bdrv_get_block_status_above() to bytes We are gradually moving away from sector-based interfaces, towards byte-based. In the common case, allocation is unlikely to ever use values that are not naturally sector-aligned, but it is possible that byte-based values will let us be more precise about allocation at the end of an unaligned file that can do byte-based access. Changing the name of the function from bdrv_get_block_status_above() to bdrv_block_status_above() ensures that the compiler enforces that all callers are updated. Likewise, since it a byte interface allows an offset mapping that might not be sector aligned, split the mapping out of the return value and into a pass-by-reference parameter. For now, the io.c layer still assert()s that all uses are sector-aligned, but that can be relaxed when a later patch implements byte-based block status in the drivers. For the most part this patch is just the addition of scaling at the callers followed by inverse scaling at bdrv_block_status(), plus updates for the new split return interface. But some code, particularly bdrv_block_status(), gets a lot simpler because it no longer has to mess with sectors. Likewise, mirror code no longer computes s->granularity >> BDRV_SECTOR_BITS, and can therefore drop an assertion about alignment because the loop no longer depends on alignment (never mind that we don't really have a driver that reports sub-sector alignments, so it's not really possible to test the effect of sub-sector mirroring). Fix a neighboring assertion to use is_power_of_2 while there. For ease of review, bdrv_get_block_status() was tackled separately. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	5b648c67e3	block: Switch bdrv_co_get_block_status_above() to byte-based We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Convert another internal type (no semantic change), and rename it to match the corresponding public function rename. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	7ddb99b9dc	block: Switch bdrv_common_block_status_above() to byte-based We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Convert another internal function (no semantic change). Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	4bcd936e47	block: Switch BdrvCoGetBlockStatusData to byte-based We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Convert another internal type (no semantic change), and rename it to match the corresponding public function rename. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	2e8bc7874b	block: Switch bdrv_co_get_block_status() to byte-based We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Convert another internal function (no semantic change); and as with its public counterpart, rename to bdrv_co_block_status() and split the offset return, to make the compiler enforce that we catch all uses. For now, we assert that callers and the return value still use aligned data, but ultimately, this will be the function where we hand off to a byte-based driver callback, and will eventually need to add logic to ensure we round calls according to the driver's request_alignment then touch up the result handed back to the caller, to start permitting a caller to pass unaligned offsets. Note that we are now prepared to accepts 'bytes' larger than INT_MAX; this is okay as long as we clamp things internally before violating any 32-bit limits, and makes no difference to how a client will use the information (clients looping over the entire file must already be prepared for consecutive calls to return the same status, as drivers are already free to return shorter-than-maximal status due to any other convenient split points, such as when the L2 table crosses cluster boundaries in qcow2). Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	237d78f8fc	block: Convert bdrv_get_block_status() to bytes We are gradually moving away from sector-based interfaces, towards byte-based. In the common case, allocation is unlikely to ever use values that are not naturally sector-aligned, but it is possible that byte-based values will let us be more precise about allocation at the end of an unaligned file that can do byte-based access. Changing the name of the function from bdrv_get_block_status() to bdrv_block_status() ensures that the compiler enforces that all callers are updated. For now, the io.c layer still assert()s that all callers are sector-aligned, but that can be relaxed when a later patch implements byte-based block status in the drivers. There was an inherent limitation in returning the offset via the return value: we only have room for BDRV_BLOCK_OFFSET_MASK bits, which means an offset can only be mapped for sector-aligned queries (or, if we declare that non-aligned input is at the same relative position modulo 512 of the answer), so the new interface also changes things to return the offset via output through a parameter by reference rather than mashed into the return value. We'll have some glue code that munges between the two styles until we finish converting all uses. For the most part this patch is just the addition of scaling at the callers followed by inverse scaling at bdrv_block_status(), coupled with the tweak in calling convention. But some code, particularly bdrv_is_allocated(), gets a lot simpler because it no longer has to mess with sectors. For ease of review, bdrv_get_block_status_above() will be tackled separately. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	7286d6106f	block: Switch bdrv_make_zero() to byte-based We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Change the internal loop iteration of zeroing a device to track by bytes instead of sectors (although we are still guaranteed that we iterate by steps that are sector-aligned). Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	7cfd527525	block: Make bdrv_round_to_clusters() signature more useful In the process of converting sector-based interfaces to bytes, I'm finding it easier to represent a byte count as a 64-bit integer at the block layer (even if we are internally capped by SIZE_MAX or even INT_MAX for individual transactions, it's still nicer to not have to worry about truncation/overflow issues on as many variables). Update the signature of bdrv_round_to_clusters() to uniformly use int64_t, matching the signature already chosen for bdrv_is_allocated and the fact that off_t is also a signed type, then adjust clients according to the required fallout (even where the result could now exceed 32 bits, no client is directly assigning the result into a 32-bit value without breaking things into a loop first). Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	c9ce8c4da6	block: Add flag to avoid wasted work in bdrv_is_allocated() Not all callers care about which BDS owns the mapping for a given range of the file, or where the zeroes lie within that mapping. In particular, bdrv_is_allocated() cares more about finding the largest run of allocated data from the guest perspective, whether or not that data is consecutive from the host perspective, and whether or not the data reads as zero. Therefore, doing subsequent refinements such as checking how much of the format-layer allocation also satisfies BDRV_BLOCK_ZERO at the protocol layer is wasted work - in the best case, it just costs extra CPU cycles during a single bdrv_is_allocated(), but in the worst case, it results in a smaller *pnum, and forces callers to iterate through more status probes when visiting the entire file for even more extra CPU cycles. This patch only optimizes the block layer (no behavior change when want_zero is true, but skip unnecessary effort when it is false). Then when subsequent patches tweak the driver callback to be byte-based, we can also pass this hint through to the driver. Tweak BdrvCoGetBlockStatusData to declare arguments in parameter order, rather than mixing things up (minimizing padding is not necessary here). Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Eric Blake	298a1665a2	block: Allow NULL file for bdrv_get_block_status() Not all callers care about which BDS owns the mapping for a given range of the file. This patch merely simplifies the callers by consolidating the logic in the common call point, while guaranteeing a non-NULL file to all the driver callbacks, for no semantic change. The only caller that does not care about pnum is bdrv_is_allocated, as invoked by vvfat; we can likewise add assertions that the rest of the stack does not have to worry about a NULL pnum. Furthermore, this will also set the stage for a future cleanup: when a caller does not care about which BDS owns an offset, it would be nice to allow the driver to optimize things to not have to return BDRV_BLOCK_OFFSET_VALID in the first place. In the case of fragmented allocation (for example, it's fairly easy to create a qcow2 image where consecutive guest addresses are not at consecutive host addresses), the current contract requires bdrv_get_block_status() to clamp pnum to the limit where host addresses are no longer consecutive, but allowing a NULL file means that pnum could be set to the full length of known-allocated data. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-26 14:45:57 +02:00
Manos Pitsidianakis	f8ea8dacf0	block: rename bdrv_co_drain to bdrv_co_drain_begin Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Manos Pitsidianakis <el13635@mail.ntua.gr> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-10-13 12:38:41 +01:00
Manos Pitsidianakis	481cad48e5	block: add bdrv_co_drain_end callback BlockDriverState has a bdrv_co_drain() callback but no equivalent for the end of the drain. The throttle driver (block/throttle.c) needs a way to mark the end of the drain in order to toggle io_limits_disabled correctly, thus bdrv_co_drain_end is needed. Signed-off-by: Manos Pitsidianakis <el13635@mail.ntua.gr> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-10-13 12:38:41 +01:00
Eric Blake	cb2e28780c	block: Perform copy-on-read in loop Improve our braindead copy-on-read implementation. Pre-patch, we have multiple issues: - we create a bounce buffer and perform a write for the entire request, even if the active image already has 99% of the clusters occupied, and really only needs to copy-on-read the remaining 1% of the clusters - our bounce buffer was as large as the read request, and can needlessly exhaust our memory by using double the memory of the request size (the original request plus our bounce buffer), rather than a capped maximum overhead beyond the original - if a driver has a max_transfer limit, we are bypassing the normal code in bdrv_aligned_preadv() that fragments to that limit, and instead attempt to read the entire buffer from the driver in one go, which some drivers may assert on - a client can request a large request of nearly 2G such that rounding the request out to cluster boundaries results in a byte count larger than 2G. While this cannot exceed 32 bits, it DOES have some follow-on problems: -- the call to bdrv_driver_pread() can assert for exceeding BDRV_REQUEST_MAX_BYTES, if the driver is old and lacks .bdrv_co_preadv -- if the buffer is all zeroes, the subsequent call to bdrv_co_do_pwrite_zeroes is a no-op due to a negative size, which means we did not actually copy on read Fix all of these issues by breaking up the action into a loop, where each iteration is capped to sane limits. Also, querying the allocation status allows us to optimize: when data is already present in the active layer, we don't need to bounce. Note that the code has a telling comment that copy-on-read should probably be a filter driver rather than a bolt-on hack in io.c; but that remains a task for another day. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-06 16:28:58 +02:00
Eric Blake	d855ebcd3c	block: Add blkdebug hook for copy-on-read Make it possible to inject errors on writes performed during a read operation due to copy-on-read semantics. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-06 16:28:58 +02:00
Eric Blake	9cdcfd9f7a	block: Uniform handling of 0-length bdrv_get_block_status() Handle a 0-length block status request up front, with a uniform return value claiming the area is not allocated. Most callers don't pass a length of 0 to bdrv_get_block_status() and friends; but it definitely happens with a 0-length read when copy-on-read is enabled. While we could audit all callers to ensure that they never make a 0-length request, and then assert that fact, it was just as easy to fix things to always report success (as long as the callers are careful to not go into an infinite loop). However, we had inconsistent behavior on whether the status is reported as allocated or defers to the backing layer, depending on what callbacks the driver implements, and possibly wasting quite a few CPU cycles to get to that answer. Consistently reporting unallocated up front doesn't really hurt anything, and makes it easier both for callers (0-length requests now have well-defined behavior) and for drivers (drivers don't have to deal with 0-length requests). Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-06 16:28:58 +02:00
Eric Blake	0fdf1a4f68	dirty-bitmap: Switch bdrv_set_dirty() to bytes Both callers already had bytes available, but were scaling to sectors. Move the scaling to internal code. In the case of bdrv_aligned_pwritev(), we are now passing the exact offset rather than a rounded sector-aligned value, but that's okay as long as dirty bitmap widens start/bytes to granularity boundaries. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-06 16:28:58 +02:00
Eric Blake	765d9df962	block: Typo fix in copy_on_readv() Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-10-06 16:28:58 +02:00
Manos Pitsidianakis	f7cc69b326	block: add default implementations for bdrv_co_get_block_status() bdrv_co_get_block_status_from_file() and bdrv_co_get_block_status_from_backing() set *file to bs->file and bs->backing respectively, so that bdrv_co_get_block_status() can recurse to them. Future block drivers won't have to duplicate code to implement this. Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Manos Pitsidianakis <el13635@mail.ntua.gr> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-09-04 18:31:13 +02:00
Daniel P. Berrange	f42cf447e2	block: move trace probes into bdrv_co_preadv\|pwritev There are trace probes in bdrv_co_readv\|writev, however, the block drivers are being gradually moved over to using the bdrv_co_preadv\|pwritev functions instead. As a result some block drivers miss the current probes. Move the probes into bdrv_co_preadv\|pwritev instead, so that they are triggered by more (all?) I/O code paths. Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170804105036.11879-1-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-08-07 09:39:35 +01:00
Peter Maydell	718d7f4f9c	-----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJZbNpiAAoJEJykq7OBq3PIKcIIAMEJpEiWQonSJZVV4fxqcbOF dXJSVxHqrrVUrcM8NY2zGXIwcGS8RBNZG+Yx/SEZgIljoYH4NFmbvKXWS2zgHSyr LUH0M6gYlQ/1vQTXwQrkJdtmgfc3xNVrQbanbynK3+aB1S5Y6pRGauDo8SqCBWu0 uLWkhcSQbG+OHD8Go5X1kZUSdpP8yOqKrxcNLe980ghi4HPMUydL3lbs4SwNlnRt NJIpMTGzJrL+CqyakIL+/PT9RBGCo4hllPD0CgX6HETEkuojxxXaqJIG+Tzj2FeU fXkoK1YQHEHLdVXnPkpModoPylhRqIQcPXoXt+aMvoLSM+bLooYXMYryMPoPDGI= =VgCF -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging # gpg: Signature made Mon 17 Jul 2017 16:40:18 BST # gpg: using RSA key 0x9CA4ABB381AB73C8 # gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" # gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>" # Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35 775A 9CA4 ABB3 81AB 73C8 * remotes/stefanha/tags/block-pull-request: block: fix shadowed variable in bdrv_co_pdiscard util/aio-win32: Only select on what we are actually waiting for Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2017-07-18 13:09:51 +01:00
Denis V. Lunev	593ed6f0a3	block: fix shadowed variable in bdrv_co_pdiscard We've had a shadowed 'ret' variable, which risks returning the wrong value, introduced in commit `b9c64947`. Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20170710150559.30163-1-den@openvz.org CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-07-17 15:58:37 +01:00
Paolo Bonzini	61124f03ab	block: invoke .bdrv_drain callback in coroutine context and from AioContext This will let the callback take a CoMutex in the next patch. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170629132749.997-8-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-07-17 11:28:15 +08:00
Vladimir Sementsov-Ogievskiy	d6883bc968	block/dirty-bitmap: add readonly field to BdrvDirtyBitmap It will be needed in following commits for persistent bitmaps. If bitmap is loaded from read-only storage (and we can't mark it "in use" in this storage) corresponding BdrvDirtyBitmap should be read-only. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20170628120530.31251-11-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2017-07-11 17:44:58 +02:00
Eric Blake	51b0a48888	block: Make bdrv_is_allocated_above() byte-based We are gradually moving away from sector-based interfaces, towards byte-based. In the common case, allocation is unlikely to ever use values that are not naturally sector-aligned, but it is possible that byte-based values will let us be more precise about allocation at the end of an unaligned file that can do byte-based access. Changing the signature of the function to use int64_t *pnum ensures that the compiler enforces that all callers are updated. For now, the io.c layer still assert()s that all callers are sector-aligned, but that can be relaxed when a later patch implements byte-based block status. Therefore, for the most part this patch is just the addition of scaling at the callers followed by inverse scaling at bdrv_is_allocated(). But some code, particularly stream_run(), gets a lot simpler because it no longer has to mess with sectors. Leave comments where we can further simplify by switching to byte-based iterations, once later patches eliminate the need for sector-aligned operations. For ease of review, bdrv_is_allocated() was tackled separately. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-07-10 13:18:07 +02:00
Eric Blake	c00716beb3	block: Minimize raw use of bds->total_sectors bdrv_is_allocated_above() was relying on intermediate->total_sectors, which is a field that can have stale contents depending on the value of intermediate->has_variable_length. An audit shows that we are safe (we were first calling through bdrv_co_get_block_status() which in turn calls bdrv_nb_sectors() and therefore just refreshed the current length), but it's nicer to favor our accessor functions to avoid having to repeat such an audit, even if it means refresh_total_sectors() is called more frequently. Suggested-by: John Snow <jsnow@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Manos Pitsidianakis <el13635@mail.ntua.gr> Reviewed-by: Jeff Cody <jcody@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-07-10 13:18:07 +02:00
Eric Blake	d6a644bbfe	block: Make bdrv_is_allocated() byte-based We are gradually moving away from sector-based interfaces, towards byte-based. In the common case, allocation is unlikely to ever use values that are not naturally sector-aligned, but it is possible that byte-based values will let us be more precise about allocation at the end of an unaligned file that can do byte-based access. Changing the signature of the function to use int64_t pnum ensures that the compiler enforces that all callers are updated. For now, the io.c layer still assert()s that all callers are sector-aligned on input and that pnum is sector-aligned on return to the caller, but that can be relaxed when a later patch implements byte-based block status. Therefore, this code adds usages like DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned values, where the call might reasonbly give non-aligned results in the future; on the other hand, no rounding is needed for callers that should just continue to work with byte alignment. For the most part this patch is just the addition of scaling at the callers followed by inverse scaling at bdrv_is_allocated(). But some code, particularly bdrv_commit(), gets a lot simpler because it no longer has to mess with sectors; also, it is now possible to pass NULL if the caller does not care how much of the image is allocated beyond the initial offset. Leave comments where we can further simplify once a later patch eliminates the need for sector-aligned requests through bdrv_is_allocated(). For ease of review, bdrv_is_allocated_above() will be tackled separately. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-07-10 13:18:07 +02:00
Eric Blake	e8a81e9cad	block: Drop unused bdrv_round_sectors_to_clusters() Now that the last user [mirror_iteration()] has converted to using bytes, we no longer need a function to round sectors to clusters. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-07-10 13:18:06 +02:00
Eric Blake	81c219ac6c	block: Guarantee that file is set on bdrv_get_block_status() We document that file is valid if the return is not an error and includes BDRV_BLOCK_OFFSET_VALID, but forgot to obey this contract when a driver (such as blkdebug) lacks a callback. Messed up in commit `67a0fd2` (v2.6), when we added the file parameter. Enhance qemu-iotest 177 to cover this, using a sequence that would print garbage or even SEGV, because it was dererefencing through uninitialized memory. [The resulting test output shows that we have less-than-ideal block status from the blkdebug driver, but that's a separate fix coming up soon.] Setting file on all paths that return BDRV_BLOCK_OFFSET_VALID is enough to fix the crash, but we can go one step further: always setting file, even on error, means that a broken caller that blindly dereferences file without checking for error is now more likely to get a reliable SEGV instead of randomly acting on garbage, making it easier to diagnose such buggy callers. Adding an assertion that file is set where expected doesn't hurt either. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-07-10 13:18:05 +02:00
Eric Blake	c61e684e44	block: Exploit BDRV_BLOCK_EOF for larger zero blocks When we have a BDS with unallocated clusters, but asking the status of its underlying bs->file or backing layer encounters an end-of-file condition, we know that the rest of the unallocated area will read as zeroes. However, pre-patch, this required two separate calls to bdrv_get_block_status(), as the first call stops at the point where the underlying file ends. Thanks to BDRV_BLOCK_EOF, we can now widen the results of the primary status if the secondary status already includes BDRV_BLOCK_ZERO. In turn, this fixes a TODO mentioned in iotest 154, where we can now see that all sectors in a partial cluster at the end of a file read as zero when coupling the shorter backing file's status along with our knowledge that the remaining sectors came from an unallocated cluster. Also, note that the loop in bdrv_co_get_block_status_above() had an inefficent exit: in cases where the active layer sets BDRV_BLOCK_ZERO but does NOT set BDRV_BLOCK_ALLOCATED (namely, where we know we read zeroes merely because our unallocated clusters lie beyond the backing file's shorter length), we still ended up probing the backing layer even though we already had a good answer. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20170505021500.19315-3-eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-30 21:48:06 +08:00
Eric Blake	fb0d8654ff	block: Add BDRV_BLOCK_EOF to bdrv_get_block_status() Just as the block layer already sets BDRV_BLOCK_ALLOCATED as a shortcut for subsequent operations, there are also some optimizations that are made easier if we can quickly tell that *pnum will advance us to the end of a file, via a new BDRV_BLOCK_EOF which gets set by the block layer. This just plumbs up the new bit; subsequent patches will make use of it. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20170505021500.19315-2-eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-30 21:48:06 +08:00
Manos Pitsidianakis	f5a5ca7969	block: change variable names in BlockDriverState Change the 'int count' parameter in pwrite_zeros, pdiscard related functions (and some others) to 'int bytes', as they both refer to bytes. This helps with code legibility. Signed-off-by: Manos Pitsidianakis <el13635@mail.ntua.gr> Message-id: 20170609101808.13506-1-el13635@mail.ntua.gr Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2017-06-26 14:54:46 +02:00
Kevin Wolf	c5f1ad429c	block: Remove bdrv_aio_readv/writev/flush() These functions are unused now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-06-26 14:51:15 +02:00
Stefan Hajnoczi	ea17c9d20d	block: use BDRV_POLL_WHILE() in bdrv_rw_vmstate() Calling aio_poll() directly may have been fine previously, but this is the future, man! The difference between an aio_poll() loop and BDRV_POLL_WHILE() is that BDRV_POLL_WHILE() releases the AioContext around aio_poll(). This allows the IOThread to run fd handlers or BHs to complete the request. Failure to release the AioContext causes deadlocks. Using BDRV_POLL_WHILE() partially fixes a 'savevm' hang with -object iothread. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-06-26 14:51:13 +02:00
Stefan Hajnoczi	dc88a467ec	block: count bdrv_co_rw_vmstate() requests Call bdrv_inc/dec_in_flight() for vmstate reads/writes. This seems unnecessary at first glance because vmstate reads/writes are done synchronously while the guest is stopped. But we need the bdrv_wakeup() in bdrv_dec_in_flight() so the main loop sees request completion. Besides, it's cleaner to count vmstate reads/writes like ordinary read/write requests. The bdrv_wakeup() partially fixes a 'savevm' hang with -object iothread. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-06-26 14:51:13 +02:00
Paolo Bonzini	3783fa3dd3	block: protect tracked_requests and flush_queue with reqs_lock Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-14-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	47fec59941	block: access write_gen with atomics Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-13-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	f7946da274	block: use Stat64 for wr_highest_offset Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-12-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	850d54a2a9	block: access io_plugged with atomic ops Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-7-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	e2a6ae7fe5	block: access wakeup with atomic ops Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-6-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	20fc71b25c	block: access serialising_in_flight with atomic ops Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-5-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	414c2ec358	block: access quiesce_counter with atomic ops Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-3-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	d3faa13e5f	block: access copy_on_read with atomic ops Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20170605123908.18777-2-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-06-16 07:55:00 +08:00
Paolo Bonzini	f321dcb57f	blockjob: introduce block_job_pause/resume_all Remove use of block_job_pause/resume from outside blockjob.c, thus making them static. The new functions are used by the block layer, so place them in blockjob_int.h. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Message-id: 20170508141310.8674-5-pbonzini@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>	2017-05-24 16:38:51 -04:00
Eric Blake	ee29d6adef	block: Simplify BDRV_BLOCK_RAW recursion Since we are already in coroutine context during the body of bdrv_co_get_block_status(), we can shave off a few layers of wrappers when recursing to query the protocol when a format driver returned BDRV_BLOCK_RAW. Note that we are already using the correct recursion later on in the same function, when probing whether the protocol layer is sparse in order to find out if we can add BDRV_BLOCK_ZERO to an existing BDRV_BLOCK_DATA\|BDRV_BLOCK_OFFSET_VALID. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170504173745.27414-1-eblake@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-05-12 10:36:46 -04:00
Denis V. Lunev	f13ce1be35	block: fix alignment calculations in bdrv_co_do_zero_pwritev tail_padding_bytes is calculated wrong. F.e. for offset = 0 bytes = 2048 align = 512 we will have tail_padding_bytes = 512 which is definitely wrong. The patch fixes that arithmetics. Fortunately this problem is harmless, we will have 1 extra allocation and free thus there is no need to put this into stable. The problem is here from the very beginning. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-04-27 16:24:01 +02:00
Fam Zheng	e914404efb	block: Remove NULL check in bdrv_co_flush Reported by Coverity. We already use bs in bdrv_inc_in_flight before checking for NULL. It is unnecessary as all callers pass non-NULL bs, so drop it. Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-04-27 15:39:50 +02:00
Max Reitz	362b3786eb	Revert "block/io: Comment out permission assertions" This reverts commit `e3e0003a8f`. This commit was necessary for the 2.9 release because we were unable to fix the underlying issue(s) in time. However, we will be for 2.10. Signed-off-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2017-04-27 15:39:49 +02:00
Fam Zheng	178bd438af	block: Walk bs->children carefully in bdrv_drain_recurse The recursive bdrv_drain_recurse may run a block job completion BH that drops nodes. The coming changes will make that more likely and use-after-free would happen without this patch Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to QLIST_FOREACH_SAFE to prevent such a case from happening. Since bdrv_unref accesses global state that is not protected by the AioContext lock, we cannot use bdrv_ref/bdrv_unref unconditionally. Fortunately the protection is not needed in IOThread because only main loop can modify a graph with the AioContext lock held. Signed-off-by: Fam Zheng <famz@redhat.com> Message-Id: <20170418143044.12187-2-famz@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Tested-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2017-04-18 22:56:28 +08:00
Max Reitz	e3e0003a8f	block/io: Comment out permission assertions In case of block migration, there may be writes to BlockBackends that do not have the write permission taken. Before this issue is fixed (which is not going to happen in 2.9), we therefore cannot assert that this is the case. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Tested-by: Kevin Wolf <kwolf@redhat.com> Message-id: 20170411145050.31290-1-mreitz@redhat.com Tested-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2017-04-11 16:09:31 +01:00
Fam Zheng	49ca625913	block: Fix bdrv_co_flush early return bdrv_inc_in_flight and bdrv_dec_in_flight are mandatory for BDRV_POLL_WHILE to work, even for the shortcut case where flush is unnecessary. Move the if block to below bdrv_dec_in_flight, and BTW fix the variable declaration position. Signed-off-by: Fam Zheng <famz@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>	2017-04-11 20:07:15 +08:00
Fam Zheng	e92f0e1910	block: Use bdrv_coroutine_enter to start I/O coroutines BDRV_POLL_WHILE waits for the started I/O by releasing bs's ctx then polling the main context, which relies on the yielded coroutine continuing on bs->ctx before notifying qemu_aio_context with bdrv_wakeup(). Thus, using qemu_coroutine_enter to start I/O is wrong because if the coroutine is entered from main loop, co->ctx will be qemu_aio_context, as a result of the "release, poll, acquire" loop of BDRV_POLL_WHILE, race conditions happen when both main thread and the iothread access the same BDS: main loop iothread ----------------------------------------------------------------------- blockdev_snapshot aio_context_acquire(bs->ctx) virtio_scsi_data_plane_handle_cmd bdrv_drained_begin(bs->ctx) bdrv_flush(bs) bdrv_co_flush(bs) aio_context_acquire(bs->ctx).enter ... qemu_coroutine_yield(co) BDRV_POLL_WHILE() aio_context_release(bs->ctx) aio_context_acquire(bs->ctx).return ... aio_co_wake(co) aio_poll(qemu_aio_context) ... co_schedule_bh_cb() ... qemu_coroutine_enter(co) ... /* (A) bdrv_co_flush(bs) /* (B) I/O on bs / continues... / aio_context_release(bs->ctx) aio_context_acquire(bs->ctx) Note that in above case, bdrv_drained_begin() doesn't do the "release, poll, acquire" in BDRV_POLL_WHILE, because bs->in_flight == 0. Fix this by using bdrv_coroutine_enter and enter coroutine in the right context. iotests 109 output is updated because the coroutine reenter flow during mirror job complete is different (now through co_queue_wakeup, instead of the unconditional qemu_coroutine_switch before), making the end job len different. Signed-off-by: Fam Zheng <famz@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>	2017-04-11 20:07:15 +08:00
Fam Zheng	14e9559f46	block: Make bdrv_parent_drained_begin/end public Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>	2017-04-11 20:07:15 +08:00
Kevin Wolf	1bf03e66fd	block: Don't check permissions for copy on read The assertion is currently failing. We can't require callers to have write permissions when all they are doing is a read, so comment it out. Add a FIXME comment in the code so that the check is re-enabled when copy on read is refactored into its own filter driver. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2017-04-07 14:44:06 +02:00
Kevin Wolf	b64aa44195	block: Request block status from *file for BDRV_BLOCK_RAW This fixes bdrv_co_get_block_status() for the bdrv_mirror_top block driver, which must fall through to bs->backing instead of bs->file. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2017-03-13 12:49:33 +01:00
Kevin Wolf	c8f6d58edb	block: Assertions for resize permission This adds an assertion that ensures that the necessary resize permission has been granted before bdrv_truncate() is called. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>	2017-02-28 20:47:50 +01:00
Kevin Wolf	afa4b29323	block: Assertions for write permissions This adds assertions that ensure that the necessary write permissions have been granted before someone attempts to write to a node. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>	2017-02-28 20:47:50 +01:00
Kevin Wolf	85c97ca7a1	block: Pass BdrvChild to bdrv_aligned_preadv/pwritev and copy-on-read This is where we want to check the permissions, so we need to have the BdrvChild around where they are stored. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>	2017-02-28 20:47:50 +01:00
Paolo Bonzini	1ace7ceac5	coroutine-lock: add mutex argument to CoQueue APIs All that CoQueue needs in order to become thread-safe is help from an external mutex. Add this to the API. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170213181244.16297-6-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-02-21 11:39:40 +00:00
Paolo Bonzini	b9e413dd37	block: explicitly acquire aiocontext in aio callbacks that need it Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170213135235.12274-16-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-02-21 11:39:39 +00:00
Paolo Bonzini	1919631e6b	block: explicitly acquire aiocontext in bottom halves that need it Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170213135235.12274-15-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-02-21 11:39:39 +00:00
Paolo Bonzini	2f47da5f7f	block: explicitly acquire aiocontext in timers that need it Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170213135235.12274-13-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-02-21 11:14:08 +00:00
Paolo Bonzini	c2b38b277a	block: move AioContext, QEMUTimer, main-loop to libqemuutil AioContext is fairly self contained, the only dependency is QEMUTimer but that in turn doesn't need anything else. So move them out of block-obj-y to avoid introducing a dependency from io/ to block-obj-y. main-loop and its dependency iohandler also need to be moved, because later in this series io/ will call iohandler_get_aio_context. [Changed copyright "the QEMU team" to "other QEMU contributors" as suggested by Daniel Berrange and agreed by Paolo. --Stefan] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170213135235.12274-2-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-02-21 11:14:07 +00:00
Paolo Bonzini	8f90b5e91d	block: get rid of bdrv_io_unplugged_begin/end bdrv_io_plug and bdrv_io_unplug are only called (via their BlockBackend equivalents) after starting asynchronous I/O. bdrv_drain is not going to be called while they are running, because---even if a coroutine runs for some reason---it will only drain in the next iteration of the event loop through bdrv_co_yield_to_drain. So this mechanism is unnecessary, get rid of it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113334.605-1-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-01-16 13:25:17 +00:00
Eric Blake	3482b9bc41	block: Pass unaligned discard requests to drivers Discard is advisory, so rounding the requests to alignment boundaries is never semantically wrong from the data that the guest sees. But at least the Dell Equallogic iSCSI SANs has an interesting property that its advertised discard alignment is 15M, yet documents that discarding a sequence of 1M slices will eventually result in the 15M page being marked as discarded, and it is possible to observe which pages have been discarded. Between commits `9f1963b` and `b8d0a980`, we converted the block layer to a byte-based interface that ultimately ignores any unaligned head or tail based on the driver's advertised discard granularity, which means that qemu 2.7 refuses to pass any discard request smaller than 15M down to the Dell Equallogic hardware. This is a slight regression in behavior compared to earlier qemu, where a guest executing discards in power-of-2 chunks used to be able to get every page discarded, but is now left with various pages still allocated because the guest requests did not align with the hardware's 15M pages. Since the SCSI specification says nothing about a minimum discard granularity, and only documents the preferred alignment, it is best if the block layer gives the driver every bit of information about discard requests, rather than rounding it to alignment boundaries early. Rework the block layer discard algorithm to mirror the write zero algorithm: always peel off any unaligned head or tail and manage that in isolation, then do the bulk of the request on an aligned boundary. The fallback when the driver returns -ENOTSUP for an unaligned request is to silently ignore that portion of the discard request; but for devices that can pass the partial request all the way down to hardware, this can result in the hardware coalescing requests and discarding aligned pages after all. Reported by: Peter Lieven <pl@kamp.de> CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-11-22 15:59:23 +01:00
Eric Blake	b2f95feec5	block: Let write zeroes fallback work even with small max_transfer Commit `443668ca` rewrote the write_zeroes logic to guarantee that an unaligned request never crosses a cluster boundary. But in the rewrite, the new code assumed that at most one iteration would be needed to get to an alignment boundary. However, it is easy to trigger an assertion failure: the Linux kernel limits loopback devices to advertise a max_transfer of only 64k. Any operation that requires falling back to writes rather than more efficient zeroing must obey max_transfer during that fallback, which means an unaligned head may require multiple iterations of the write fallbacks before reaching the aligned boundaries, when layering a format with clusters larger than 64k atop the protocol of file access to a loopback device. Test case: $ qemu-img create -f qcow2 -o cluster_size=1M file 10M $ losetup /dev/loop2 /path/to/file $ qemu-io -f qcow2 /dev/loop2 qemu-io> w 7m 1k qemu-io> w -z 8003584 2093056 In fairness to Denis (as the original listed author of the culprit commit), the faulty logic for at most one iteration is probably all my fault in reworking his idea. But the solution is to restore what was in place prior to that commit: when dealing with an unaligned head or tail, iterate as many times as necessary while fragmenting the operation at max_transfer boundaries. Reported-by: Ed Swierk <eswierk@skyportsystems.com> CC: qemu-stable@nongnu.org CC: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-11-22 15:59:22 +01:00
Kevin Wolf	e6af1e0854	block: Don't mark node clean after failed flush Commit `3ff2f67a` changed bdrv_co_flush() so that no flush is issues if the image hasn't been dirtied since the last flush. This is not quite correct: The condition should be that the image hasn't been dirtied since the last _successful_ flush. This patch changes the logic accordingly. Without this fix, subsequent bdrv_co_flush() calls would return success without actually doing anything even though the image is still dirty. The difference is visible in some blkdebug test cases where error messages incorrectly disappeared after commit `3ff2f67a`. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 1478300595-10090-1-git-send-email-kwolf@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2016-11-08 16:06:35 +00:00
Alberto Garcia	c0778f6693	block: Add bdrv_drain_all_{begin,end}() bdrv_drain_all() doesn't allow the caller to do anything after all pending requests have been completed but before block jobs are resumed. This patch splits bdrv_drain_all() into _begin() and _end() for that purpose. It also adds aio_{disable,enable}_external() calls to disable external clients in the meantime. An important restriction of this split is that no new block jobs or BlockDriverStates can be created between the bdrv_drain_all_begin() and bdrv_drain_all_end() calls. This is not a concern now because we'll only be using this in bdrv_reopen_multiple(), but it must be dealt with if we ever have other uses cases in the future. Signed-off-by: Alberto Garcia <berto@igalia.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-10-31 16:51:14 +01:00
Paolo Bonzini	c9d1a56174	block: only call aio_poll on the current thread's AioContext aio_poll is not thread safe; for example bdrv_drain can hang if the last in-flight I/O operation is completed in the I/O thread after the main thread has checked bs->in_flight. The bug remains latent as long as all of it is called within aio_context_acquire/aio_context_release, but this will change soon. To fix this, if bdrv_drain is called from outside the I/O thread, signal the main AioContext through a dummy bottom half. The event loop then only runs in the I/O thread. Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1477565348-5458-18-git-send-email-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2016-10-28 21:50:18 +08:00
Paolo Bonzini	88b062c203	block: introduce BDRV_POLL_WHILE We want the BDS event loop to run exclusively in the iothread that owns the BDS's AioContext. This macro will provide the synchronization between the two event loops; for now it just wraps the common idiom of a while loop around aio_poll. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-Id: <1477565348-5458-8-git-send-email-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2016-10-28 21:50:18 +08:00
Paolo Bonzini	d42cf28837	block: change drain to look only at one child at a time bdrv_requests_pending is checking children to also wait until internal requests (such as metadata writes) have completed. However, checking children is in general overkill. Children requests can be of two kinds: - requests caused by an operation on bs, e.g. a bdrv_aio_write to bs causing a write to bs->file->bs. In this case, the parent's in_flight count will always be incremented by at least one for every request in the child. - asynchronous metadata writes or flushes. Such writes can be started even if bs's in_flight count is zero, but not after the .bdrv_drain callback has been invoked. This patch therefore changes bdrv_drain to finish I/O in the parent (after which the parent's in_flight will be locked to zero), call bdrv_drain (after which the parent will not generate I/O on the child anymore), and then wait for internal I/O in the children to complete. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-Id: <1477565348-5458-6-git-send-email-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2016-10-28 21:50:18 +08:00
Paolo Bonzini	9972354856	block: add BDS field to count in-flight requests Unlike tracked_requests, this field also counts throttled requests, and remains non-zero if an AIO operation needs a BH to be "really" completed. With this change, it is no longer necessary to have a dummy BdrvTrackedRequest for requests that are never serialising, and it is no longer necessary to poll the AioContext once after bdrv_requests_pending(bs) returns false. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-Id: <1477565348-5458-5-git-send-email-pbonzini@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com>	2016-10-28 21:50:18 +08:00
Kevin Wolf	cbc14ac9c3	block: Remove bdrv_aio_ioctl() It is unused now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2016-10-27 19:05:23 +02:00
Kevin Wolf	16a389dc9e	block: Introduce .bdrv_co_ioctl() driver callback This allows drivers to implement ioctls in a coroutine-based way. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2016-10-27 19:05:23 +02:00
Kevin Wolf	61b2450414	block: Remove bdrv_ioctl() It is unused now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2016-10-27 19:05:23 +02:00
Kevin Wolf	48af776a5b	block: Use blk_co_ioctl() for all BB level ioctls All read/write functions already have a single coroutine-based function on the BlockBackend level through which all requests go (no matter what API style the external caller used) and which passes the requests down to the block node level. This patch exports a bdrv_co_ioctl() function and uses it to extend this mode of operation to ioctls. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2016-10-27 19:05:22 +02:00
Kevin Wolf	7381e95cc2	block: Remove bdrv_aio_pdiscard() It is unused now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2016-10-27 19:05:22 +02:00
Paolo Bonzini	fffb6e1223	block: use aio_bh_schedule_oneshot This simplifies bottom half handlers by removing calls to qemu_bh_delete and thus removing the need to stash the bottom half pointer in the opaque datum. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-10-07 13:34:07 +02:00
John Snow	4085f5c7a2	block: reintroduce bdrv_flush_all Commit `fe1a9cbc` moved the flush_all routine from the bdrv layer to the block-backend layer. In doing so, however, the semantics of the routine changed slightly such that flush_all now used blk_flush instead of bdrv_flush. blk_flush can fail if the attached device model reports that it is not "available," (i.e. the tray is open.) This changed the semantics of flush_all such that it can now fail for e.g. open CDROM drives. Reintroduce bdrv_flush_all to regain the old semantics without having to alter the behavior of blk_flush or blk_flush_all, which are already 'doing the right thing.' Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-09-29 14:13:13 +02:00
Pavel Butsykin	3ea1a09111	block/io: turn on dirty_bitmaps for the compressed writes Previously was added the assert: commit `1755da16e3` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 18 16:49:18 2012 +0200 block: introduce new dirty bitmap functionality Now the compressed write is always in coroutine and setting the bits is done after the write, so that we can return the dirty_bitmaps for the compressed writes. Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Jeff Cody <jcody@redhat.com> CC: Markus Armbruster <armbru@redhat.com> CC: Eric Blake <eblake@redhat.com> CC: John Snow <jsnow@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-09-05 19:06:48 +02:00
Pavel Butsykin	35fadca80e	block: remove BlockDriver.bdrv_write_compressed There are no block drivers left that implement the old .bdrv_write_compressed interface, so it can be removed. Also now we have no need to use the bdrv_pwrite_compressed function and we can remove it entirely. Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Jeff Cody <jcody@redhat.com> CC: Markus Armbruster <armbru@redhat.com> CC: Eric Blake <eblake@redhat.com> CC: John Snow <jsnow@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2016-09-05 19:06:48 +02:00

1 2 3 4 5 ...

291 Commits