2004-08-02 01:59:26 +04:00
|
|
|
/*
|
|
|
|
* QEMU System Emulator block driver
|
2007-09-17 01:08:06 +04:00
|
|
|
*
|
2004-08-02 01:59:26 +04:00
|
|
|
* Copyright (c) 2003 Fabrice Bellard
|
2007-09-17 01:08:06 +04:00
|
|
|
*
|
2004-08-02 01:59:26 +04:00
|
|
|
* Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
|
|
* of this software and associated documentation files (the "Software"), to deal
|
|
|
|
* in the Software without restriction, including without limitation the rights
|
|
|
|
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
|
|
* copies of the Software, and to permit persons to whom the Software is
|
|
|
|
* furnished to do so, subject to the following conditions:
|
|
|
|
*
|
|
|
|
* The above copyright notice and this permission notice shall be included in
|
|
|
|
* all copies or substantial portions of the Software.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
|
|
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
|
|
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
|
|
|
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
|
|
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
|
|
* THE SOFTWARE.
|
|
|
|
*/
|
|
|
|
#ifndef BLOCK_INT_H
|
|
|
|
#define BLOCK_INT_H
|
|
|
|
|
2014-09-05 17:46:16 +04:00
|
|
|
#include "block/accounting.h"
|
2012-12-17 21:19:44 +04:00
|
|
|
#include "block/block.h"
|
2018-02-16 19:50:12 +03:00
|
|
|
#include "block/aio-wait.h"
|
2012-12-17 21:20:00 +04:00
|
|
|
#include "qemu/queue.h"
|
2015-09-01 16:48:02 +03:00
|
|
|
#include "qemu/coroutine.h"
|
2017-06-05 15:39:00 +03:00
|
|
|
#include "qemu/stats64.h"
|
2012-12-17 21:20:00 +04:00
|
|
|
#include "qemu/timer.h"
|
2013-01-21 20:09:41 +04:00
|
|
|
#include "qemu/hbitmap.h"
|
2013-05-25 07:09:44 +04:00
|
|
|
#include "block/snapshot.h"
|
2013-09-02 16:14:39 +04:00
|
|
|
#include "qemu/throttle.h"
|
block: block-status cache for data regions
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.
The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the
implementation is outside of qemu, there is little we can do about its
performance.
We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.
To address this, we want to cache data regions. Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.
(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine. While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)
We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-08-12 11:41:44 +03:00
|
|
|
#include "qemu/rcu.h"
|
2007-11-11 05:51:17 +03:00
|
|
|
|
2012-07-27 12:05:22 +04:00
|
|
|
#define BLOCK_FLAG_LAZY_REFCOUNTS 8
|
2007-09-17 01:59:02 +04:00
|
|
|
|
2012-07-27 12:05:22 +04:00
|
|
|
#define BLOCK_OPT_SIZE "size"
|
|
|
|
#define BLOCK_OPT_ENCRYPT "encryption"
|
2017-06-23 19:24:06 +03:00
|
|
|
#define BLOCK_OPT_ENCRYPT_FORMAT "encrypt.format"
|
2012-07-27 12:05:22 +04:00
|
|
|
#define BLOCK_OPT_COMPAT6 "compat6"
|
2016-05-03 12:43:30 +03:00
|
|
|
#define BLOCK_OPT_HWVERSION "hwversion"
|
2012-07-27 12:05:22 +04:00
|
|
|
#define BLOCK_OPT_BACKING_FILE "backing_file"
|
|
|
|
#define BLOCK_OPT_BACKING_FMT "backing_fmt"
|
|
|
|
#define BLOCK_OPT_CLUSTER_SIZE "cluster_size"
|
|
|
|
#define BLOCK_OPT_TABLE_SIZE "table_size"
|
|
|
|
#define BLOCK_OPT_PREALLOC "preallocation"
|
|
|
|
#define BLOCK_OPT_SUBFMT "subformat"
|
|
|
|
#define BLOCK_OPT_COMPAT_LEVEL "compat"
|
|
|
|
#define BLOCK_OPT_LAZY_REFCOUNTS "lazy_refcounts"
|
2013-01-30 03:26:52 +04:00
|
|
|
#define BLOCK_OPT_ADAPTER_TYPE "adapter_type"
|
2013-11-07 18:56:38 +04:00
|
|
|
#define BLOCK_OPT_REDUNDANCY "redundancy"
|
qemu-img create: add 'nocow' option
Add 'nocow' option so that users could have a chance to set NOCOW flag to
newly created files. It's useful on btrfs file system to enhance performance.
Btrfs has low performance when hosting VM images, even more when the guest
in those VM are also using btrfs as file system. One way to mitigate this bad
performance is to turn off COW attributes on VM files. Generally, there are
two ways to turn off NOCOW on btrfs: a) by mounting fs with nodatacow, then
all newly created files will be NOCOW. b) per file. Add the NOCOW file
attribute. It could only be done to empty or new files.
This patch tries the second way, according to the option, it could add NOCOW
per file.
For most block drivers, since the create file step is in raw-posix.c, so we
can do setting NOCOW flag ioctl in raw-posix.c only.
But there are some exceptions, like block/vpc.c and block/vdi.c, they are
creating file by calling qemu_open directly. For them, do the same setting
NOCOW flag ioctl work in them separately.
[Fixed up 082.out due to the new 'nocow' creation option
--Stefan]
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-06-30 10:29:58 +04:00
|
|
|
#define BLOCK_OPT_NOCOW "nocow"
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
#define BLOCK_OPT_EXTENT_SIZE_HINT "extent_size_hint"
|
2015-02-13 12:20:53 +03:00
|
|
|
#define BLOCK_OPT_OBJECT_SIZE "object_size"
|
2015-02-19 01:40:49 +03:00
|
|
|
#define BLOCK_OPT_REFCOUNT_BITS "refcount_bits"
|
2019-01-14 18:57:27 +03:00
|
|
|
#define BLOCK_OPT_DATA_FILE "data_file"
|
2019-02-22 16:29:38 +03:00
|
|
|
#define BLOCK_OPT_DATA_FILE_RAW "data_file_raw"
|
qcow2: introduce compression type feature
The patch adds some preparation parts for incompatible compression type
feature to qcow2 allowing the use different compression methods for
image clusters (de)compressing.
It is implied that the compression type is set on the image creation and
can be changed only later by image conversion, thus compression type
defines the only compression algorithm used for the image, and thus,
for all image clusters.
The goal of the feature is to add support of other compression methods
to qcow2. For example, ZSTD which is more effective on compression than ZLIB.
The default compression is ZLIB. Images created with ZLIB compression type
are backward compatible with older qemu versions.
Adding of the compression type breaks a number of tests because now the
compression type is reported on image creation and there are some changes
in the qcow2 header in size and offsets.
The tests are fixed in the following ways:
* filter out compression_type for many tests
* fix header size, feature table size and backing file offset
affected tests: 031, 036, 061, 080
header_size +=8: 1 byte compression type
7 bytes padding
feature_table += 48: incompatible feature compression type
backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change)
* add "compression type" for test output matching when it isn't filtered
affected tests: 049, 060, 061, 065, 082, 085, 144, 182, 185, 198, 206,
242, 255, 274, 280
Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
QAPI part:
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20200507082521.29210-2-dplotnikov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-05-07 11:25:18 +03:00
|
|
|
#define BLOCK_OPT_COMPRESSION_TYPE "compression_type"
|
2020-07-10 19:13:13 +03:00
|
|
|
#define BLOCK_OPT_EXTL2 "extended_l2"
|
2009-05-18 18:42:10 +04:00
|
|
|
|
2014-11-20 18:27:11 +03:00
|
|
|
#define BLOCK_PROBE_BUF_SIZE 512
|
|
|
|
|
2015-11-09 13:16:46 +03:00
|
|
|
enum BdrvTrackedRequestType {
|
|
|
|
BDRV_TRACKED_READ,
|
|
|
|
BDRV_TRACKED_WRITE,
|
|
|
|
BDRV_TRACKED_DISCARD,
|
2018-06-26 15:23:23 +03:00
|
|
|
BDRV_TRACKED_TRUNCATE,
|
2015-11-09 13:16:46 +03:00
|
|
|
};
|
|
|
|
|
block: introduce BDRV_MAX_LENGTH
We are going to modify block layer to work with 64bit requests. And
first step is moving to int64_t type for both offset and bytes
arguments in all block request related functions.
It's mostly safe (when widening signed or unsigned int to int64_t), but
switching from uint64_t is questionable.
So, let's first establish the set of requests we want to work with.
First signed int64_t should be enough, as off_t is signed anyway. Then,
obviously offset + bytes should not overflow.
And most interesting: (offset + bytes) being aligned up should not
overflow as well. Aligned to what alignment? First thing that comes in
mind is bs->bl.request_alignment, as we align up request to this
alignment. But there is another thing: look at
bdrv_mark_request_serialising(). It aligns request up to some given
alignment. And this parameter may be bdrv_get_cluster_size(), which is
often a lot greater than bs->bl.request_alignment.
Note also, that bdrv_mark_request_serialising() uses signed int64_t for
calculations. So, actually, we already depend on some restrictions.
Happily, bdrv_get_cluster_size() returns int and
bs->bl.request_alignment has 32bit unsigned type, but defined to be a
power of 2 less than INT_MAX. So, we may establish, that INT_MAX is
absolute maximum for any kind of alignment that may occur with the
request.
Note, that bdrv_get_cluster_size() is not documented to return power
of 2, still bdrv_mark_request_serialising() behaves like it is.
Also, backup uses bdi.cluster_size and is not prepared to it not being
power of 2.
So, let's establish that Qemu supports only power-of-2 clusters and
alignments.
So, alignment can't be greater than 2^30.
Finally to be safe with calculations, to not calculate different
maximums for different nodes (depending on cluster size and
request_alignment), let's simply set QEMU_ALIGN_DOWN(INT64_MAX, 2^30)
as absolute maximum bytes length for Qemu. Actually, it's not much less
than INT64_MAX.
OK, then, let's apply it to block/io.
Let's consider all block/io entry points of offset/bytes:
4 bytes/offset interface functions: bdrv_co_preadv_part(),
bdrv_co_pwritev_part(), bdrv_co_copy_range_internal() and
bdrv_co_pdiscard() and we check them all with bdrv_check_request().
We also have one entry point with only offset: bdrv_co_truncate().
Check the offset.
And one public structure: BdrvTrackedRequest. Happily, it has only
three external users:
file-posix.c: adopted by this patch
write-threshold.c: only read fields
test-write-threshold.c: sets obviously small constant values
Better is to make the structure private and add corresponding
interfaces.. Still it's not obvious what kind of interface is needed
for file-posix.c. Let's keep it public but add corresponding
assertions.
After this patch we'll convert functions in block/io.c to int64_t bytes
and offset parameters. We can assume that offset/bytes pair always
satisfy new restrictions, and make
corresponding assertions where needed. If we reach some offset/bytes
point in block/io.c missing bdrv_check_request() it is considered a
bug. As well, if block/io.c modifies a offset/bytes request, expanding
it more then aligning up to request_alignment, it's a bug too.
For all io requests except for discard we keep for now old restriction
of 32bit request length.
iotest 206 output error message changed, as now test disk size is
larger than new limit. Add one more test case with new maximum disk
size to cover too-big-L1 case.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-04 01:27:13 +03:00
|
|
|
/*
|
|
|
|
* That is not quite good that BdrvTrackedRequest structure is public,
|
|
|
|
* as block/io.c is very careful about incoming offset/bytes being
|
|
|
|
* correct. Be sure to assert bdrv_check_request() succeeded after any
|
|
|
|
* modification of BdrvTrackedRequest object out of block/io.c
|
|
|
|
*/
|
2013-06-24 19:13:10 +04:00
|
|
|
typedef struct BdrvTrackedRequest {
|
|
|
|
BlockDriverState *bs;
|
2013-12-03 18:31:25 +04:00
|
|
|
int64_t offset;
|
2021-02-03 17:14:15 +03:00
|
|
|
int64_t bytes;
|
2015-11-09 13:16:46 +03:00
|
|
|
enum BdrvTrackedRequestType type;
|
2013-12-04 20:08:50 +04:00
|
|
|
|
2013-12-04 19:43:44 +04:00
|
|
|
bool serialising;
|
2013-12-04 20:08:50 +04:00
|
|
|
int64_t overlap_offset;
|
2021-02-03 17:14:15 +03:00
|
|
|
int64_t overlap_bytes;
|
2013-12-04 20:08:50 +04:00
|
|
|
|
2013-06-24 19:13:10 +04:00
|
|
|
QLIST_ENTRY(BdrvTrackedRequest) list;
|
|
|
|
Coroutine *co; /* owner, used for deadlock detection */
|
|
|
|
CoQueue wait_queue; /* coroutines blocked on this request */
|
2013-12-13 16:04:35 +04:00
|
|
|
|
|
|
|
struct BdrvTrackedRequest *waiting_for;
|
2013-06-24 19:13:10 +04:00
|
|
|
} BdrvTrackedRequest;
|
|
|
|
|
2020-12-11 21:39:19 +03:00
|
|
|
int bdrv_check_request(int64_t offset, int64_t bytes, Error **errp);
|
block: introduce BDRV_MAX_LENGTH
We are going to modify block layer to work with 64bit requests. And
first step is moving to int64_t type for both offset and bytes
arguments in all block request related functions.
It's mostly safe (when widening signed or unsigned int to int64_t), but
switching from uint64_t is questionable.
So, let's first establish the set of requests we want to work with.
First signed int64_t should be enough, as off_t is signed anyway. Then,
obviously offset + bytes should not overflow.
And most interesting: (offset + bytes) being aligned up should not
overflow as well. Aligned to what alignment? First thing that comes in
mind is bs->bl.request_alignment, as we align up request to this
alignment. But there is another thing: look at
bdrv_mark_request_serialising(). It aligns request up to some given
alignment. And this parameter may be bdrv_get_cluster_size(), which is
often a lot greater than bs->bl.request_alignment.
Note also, that bdrv_mark_request_serialising() uses signed int64_t for
calculations. So, actually, we already depend on some restrictions.
Happily, bdrv_get_cluster_size() returns int and
bs->bl.request_alignment has 32bit unsigned type, but defined to be a
power of 2 less than INT_MAX. So, we may establish, that INT_MAX is
absolute maximum for any kind of alignment that may occur with the
request.
Note, that bdrv_get_cluster_size() is not documented to return power
of 2, still bdrv_mark_request_serialising() behaves like it is.
Also, backup uses bdi.cluster_size and is not prepared to it not being
power of 2.
So, let's establish that Qemu supports only power-of-2 clusters and
alignments.
So, alignment can't be greater than 2^30.
Finally to be safe with calculations, to not calculate different
maximums for different nodes (depending on cluster size and
request_alignment), let's simply set QEMU_ALIGN_DOWN(INT64_MAX, 2^30)
as absolute maximum bytes length for Qemu. Actually, it's not much less
than INT64_MAX.
OK, then, let's apply it to block/io.
Let's consider all block/io entry points of offset/bytes:
4 bytes/offset interface functions: bdrv_co_preadv_part(),
bdrv_co_pwritev_part(), bdrv_co_copy_range_internal() and
bdrv_co_pdiscard() and we check them all with bdrv_check_request().
We also have one entry point with only offset: bdrv_co_truncate().
Check the offset.
And one public structure: BdrvTrackedRequest. Happily, it has only
three external users:
file-posix.c: adopted by this patch
write-threshold.c: only read fields
test-write-threshold.c: sets obviously small constant values
Better is to make the structure private and add corresponding
interfaces.. Still it's not obvious what kind of interface is needed
for file-posix.c. Let's keep it public but add corresponding
assertions.
After this patch we'll convert functions in block/io.c to int64_t bytes
and offset parameters. We can assume that offset/bytes pair always
satisfy new restrictions, and make
corresponding assertions where needed. If we reach some offset/bytes
point in block/io.c missing bdrv_check_request() it is considered a
bug. As well, if block/io.c modifies a offset/bytes request, expanding
it more then aligning up to request_alignment, it's a bug too.
For all io requests except for discard we keep for now old restriction
of 32bit request length.
iotest 206 output error message changed, as now test disk size is
larger than new limit. Add one more test case with new maximum disk
size to cover too-big-L1 case.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-04 01:27:13 +03:00
|
|
|
|
2004-08-02 01:59:26 +04:00
|
|
|
struct BlockDriver {
|
|
|
|
const char *format_name;
|
|
|
|
int instance_size;
|
2013-10-02 16:33:48 +04:00
|
|
|
|
2017-07-13 18:30:25 +03:00
|
|
|
/* set to true if the BlockDriver is a block filter. Block filters pass
|
2019-05-31 16:23:11 +03:00
|
|
|
* certain callbacks that refer to data (see block.c) to their bs->file
|
|
|
|
* or bs->backing (whichever one exists) if the driver doesn't implement
|
|
|
|
* them. Drivers that do not wish to forward must implement them and return
|
|
|
|
* -ENOTSUP.
|
|
|
|
* Note that filters are not allowed to modify data.
|
|
|
|
*
|
|
|
|
* Filters generally cannot have more than a single filtered child,
|
|
|
|
* because the data they present must at all times be the same as
|
|
|
|
* that on their filtered child. That would be impossible to
|
|
|
|
* achieve for multiple filtered children.
|
|
|
|
* (And this filtered child must then be bs->file or bs->backing.)
|
2017-07-13 18:30:25 +03:00
|
|
|
*/
|
2014-03-03 22:11:34 +04:00
|
|
|
bool is_filter;
|
2020-05-13 14:05:12 +03:00
|
|
|
/*
|
|
|
|
* Set to true if the BlockDriver is a format driver. Format nodes
|
|
|
|
* generally do not expect their children to be other format nodes
|
|
|
|
* (except for backing files), and so format probing is disabled
|
|
|
|
* on those children.
|
|
|
|
*/
|
|
|
|
bool is_format;
|
2020-02-18 13:34:41 +03:00
|
|
|
/*
|
|
|
|
* Return true if @to_replace can be replaced by a BDS with the
|
|
|
|
* same data as @bs without it affecting @bs's behavior (that is,
|
|
|
|
* without it being visible to @bs's parents).
|
|
|
|
*/
|
|
|
|
bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
|
|
|
|
BlockDriverState *to_replace);
|
2013-10-02 16:33:48 +04:00
|
|
|
|
2004-08-02 01:59:26 +04:00
|
|
|
int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
|
2009-06-15 16:04:22 +04:00
|
|
|
int (*bdrv_probe_device)(const char *filename);
|
2013-03-18 19:40:51 +04:00
|
|
|
|
|
|
|
/* Any driver implementing this callback is expected to be able to handle
|
|
|
|
* NULL file names in its .bdrv_open() implementation */
|
2013-03-15 21:47:22 +04:00
|
|
|
void (*bdrv_parse_filename)(const char *filename, QDict *options, Error **errp);
|
2013-09-24 19:07:04 +04:00
|
|
|
/* Drivers not implementing bdrv_parse_filename nor bdrv_open should have
|
|
|
|
* this field set to true, except ones that are defined only by their
|
|
|
|
* child's bs.
|
|
|
|
* An example of the last type will be the quorum block driver.
|
|
|
|
*/
|
|
|
|
bool bdrv_needs_filename;
|
2012-09-20 23:13:19 +04:00
|
|
|
|
2020-05-28 12:44:04 +03:00
|
|
|
/*
|
|
|
|
* Set if a driver can support backing files. This also implies the
|
|
|
|
* following semantics:
|
|
|
|
*
|
|
|
|
* - Return status 0 of .bdrv_co_block_status means that corresponding
|
|
|
|
* blocks are not allocated in this layer of backing-chain
|
|
|
|
* - For such (unallocated) blocks, read will:
|
|
|
|
* - fill buffer with zeros if there is no backing file
|
|
|
|
* - read from the backing file otherwise, where the block layer
|
|
|
|
* takes care of reading zeros beyond EOF if backing file is short
|
|
|
|
*/
|
2014-06-04 17:09:35 +04:00
|
|
|
bool supports_backing;
|
|
|
|
|
2012-09-20 23:13:19 +04:00
|
|
|
/* For handling image reopen for split or non-split files */
|
|
|
|
int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state,
|
|
|
|
BlockReopenQueue *queue, Error **errp);
|
|
|
|
void (*bdrv_reopen_commit)(BDRVReopenState *reopen_state);
|
2020-02-28 15:44:46 +03:00
|
|
|
void (*bdrv_reopen_commit_post)(BDRVReopenState *reopen_state);
|
2012-09-20 23:13:19 +04:00
|
|
|
void (*bdrv_reopen_abort)(BDRVReopenState *reopen_state);
|
2015-11-16 17:34:59 +03:00
|
|
|
void (*bdrv_join_options)(QDict *options, QDict *old_options);
|
2012-09-20 23:13:19 +04:00
|
|
|
|
2013-09-05 16:22:29 +04:00
|
|
|
int (*bdrv_open)(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp);
|
2018-03-13 01:07:53 +03:00
|
|
|
|
|
|
|
/* Protocol drivers should implement this instead of bdrv_open */
|
2013-09-05 16:22:29 +04:00
|
|
|
int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp);
|
2004-09-18 23:32:11 +04:00
|
|
|
void (*bdrv_close)(BlockDriverState *bs);
|
2020-06-25 15:55:45 +03:00
|
|
|
|
|
|
|
|
2018-01-09 18:50:57 +03:00
|
|
|
int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
|
2018-01-18 15:43:45 +03:00
|
|
|
Error **errp);
|
2020-03-26 04:12:17 +03:00
|
|
|
int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
|
|
|
|
const char *filename,
|
2018-01-09 18:50:57 +03:00
|
|
|
QemuOpts *opts,
|
|
|
|
Error **errp);
|
2020-06-25 15:55:45 +03:00
|
|
|
|
|
|
|
int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
|
|
|
|
BlockdevAmendOptions *opts,
|
|
|
|
bool force,
|
|
|
|
Error **errp);
|
|
|
|
|
|
|
|
int (*bdrv_amend_options)(BlockDriverState *bs,
|
|
|
|
QemuOpts *opts,
|
|
|
|
BlockDriverAmendStatusCB *status_cb,
|
|
|
|
void *cb_opaque,
|
|
|
|
bool force,
|
|
|
|
Error **errp);
|
|
|
|
|
2005-12-18 21:28:15 +03:00
|
|
|
int (*bdrv_make_empty)(BlockDriverState *bs);
|
2014-07-18 22:24:56 +04:00
|
|
|
|
2019-02-01 22:29:28 +03:00
|
|
|
/*
|
|
|
|
* Refreshes the bs->exact_filename field. If that is impossible,
|
|
|
|
* bs->exact_filename has to be left empty.
|
|
|
|
*/
|
|
|
|
void (*bdrv_refresh_filename)(BlockDriverState *bs);
|
2014-07-18 22:24:56 +04:00
|
|
|
|
2019-02-01 22:29:26 +03:00
|
|
|
/*
|
|
|
|
* Gathers the open options for all children into @target.
|
|
|
|
* A simple format driver (without backing file support) might
|
|
|
|
* implement this function like this:
|
|
|
|
*
|
|
|
|
* QINCREF(bs->file->bs->full_open_options);
|
|
|
|
* qdict_put(target, "file", bs->file->bs->full_open_options);
|
|
|
|
*
|
|
|
|
* If not specified, the generic implementation will simply put
|
|
|
|
* all children's options under their respective name.
|
|
|
|
*
|
|
|
|
* @backing_overridden is true when bs->backing seems not to be
|
|
|
|
* the child that would result from opening bs->backing_file.
|
|
|
|
* Therefore, if it is true, the backing child's options should be
|
|
|
|
* gathered; otherwise, there is no need since the backing child
|
|
|
|
* is the one implied by the image header.
|
|
|
|
*
|
|
|
|
* Note that ideally this function would not be needed. Every
|
|
|
|
* block driver which implements it is probably doing something
|
|
|
|
* shady regarding its runtime option structure.
|
|
|
|
*/
|
|
|
|
void (*bdrv_gather_child_options)(BlockDriverState *bs, QDict *target,
|
|
|
|
bool backing_overridden);
|
|
|
|
|
2019-02-01 22:29:18 +03:00
|
|
|
/*
|
|
|
|
* Returns an allocated string which is the directory name of this BDS: It
|
|
|
|
* will be used to make relative filenames absolute by prepending this
|
|
|
|
* function's return value to them.
|
|
|
|
*/
|
|
|
|
char *(*bdrv_dirname)(BlockDriverState *bs, Error **errp);
|
|
|
|
|
2006-08-01 20:21:11 +04:00
|
|
|
/* aio */
|
block: Support byte-based aio callbacks
We are gradually moving away from sector-based interfaces, towards
byte-based. Add new sector-based aio callbacks for read and write,
to match the fact that bdrv_aio_pdiscard is already byte-based.
Ideally, drivers should be converted to use coroutine callbacks
rather than aio; but that is not quite as trivial (and if we were
to do that conversion, the null-aio driver would disappear), so for
the short term, converting the signature but keeping things with
aio is easier. However, we CAN declare that a driver that uses
the byte-based aio interfaces now defaults to byte-based
operations, and must explicitly provide a refresh_limits override
to stick with larger alignments (making the alignment issues more
obvious directly in the drivers touched in the next few patches).
Once all drivers are converted, the sector-based aio callbacks will
be removed; in the meantime, a FIXME comment is added due to a
slight inefficiency that will be touched up as part of that later
cleanup.
Simplify some instances of 'bs->drv' into 'drv' while touching this,
since the local variable already exists to reduce typing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-04-24 22:25:01 +03:00
|
|
|
BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags,
|
|
|
|
BlockCompletionFunc *cb, void *opaque);
|
|
|
|
BlockAIOCB *(*bdrv_aio_pwritev)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags,
|
|
|
|
BlockCompletionFunc *cb, void *opaque);
|
2014-10-07 15:59:14 +04:00
|
|
|
BlockAIOCB *(*bdrv_aio_flush)(BlockDriverState *bs,
|
2014-10-07 15:59:15 +04:00
|
|
|
BlockCompletionFunc *cb, void *opaque);
|
2016-07-16 02:22:57 +03:00
|
|
|
BlockAIOCB *(*bdrv_aio_pdiscard)(BlockDriverState *bs,
|
2017-06-09 13:18:08 +03:00
|
|
|
int64_t offset, int bytes,
|
2014-10-07 15:59:15 +04:00
|
|
|
BlockCompletionFunc *cb, void *opaque);
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2011-07-14 19:27:13 +04:00
|
|
|
int coroutine_fn (*bdrv_co_readv)(BlockDriverState *bs,
|
|
|
|
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
|
2017-08-31 13:54:56 +03:00
|
|
|
|
|
|
|
/**
|
|
|
|
* @offset: position in bytes to read at
|
|
|
|
* @bytes: number of bytes to read
|
|
|
|
* @qiov: the buffers to fill with read data
|
|
|
|
* @flags: currently unused, always 0
|
|
|
|
*
|
|
|
|
* @offset and @bytes will be a multiple of 'request_alignment',
|
|
|
|
* but the length of individual @qiov elements does not have to
|
|
|
|
* be a multiple.
|
|
|
|
*
|
|
|
|
* @bytes will always equal the total size of @qiov, and will be
|
|
|
|
* no larger than 'max_transfer'.
|
|
|
|
*
|
|
|
|
* The buffer in @qiov may point directly to guest memory.
|
|
|
|
*/
|
2016-04-25 12:25:18 +03:00
|
|
|
int coroutine_fn (*bdrv_co_preadv)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags);
|
2019-06-04 19:15:06 +03:00
|
|
|
int coroutine_fn (*bdrv_co_preadv_part)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes,
|
|
|
|
QEMUIOVector *qiov, size_t qiov_offset, int flags);
|
2011-07-14 19:27:13 +04:00
|
|
|
int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
|
2016-03-10 15:39:55 +03:00
|
|
|
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov, int flags);
|
2017-08-31 13:54:56 +03:00
|
|
|
/**
|
|
|
|
* @offset: position in bytes to write at
|
|
|
|
* @bytes: number of bytes to write
|
|
|
|
* @qiov: the buffers containing data to write
|
|
|
|
* @flags: zero or more bits allowed by 'supported_write_flags'
|
|
|
|
*
|
|
|
|
* @offset and @bytes will be a multiple of 'request_alignment',
|
|
|
|
* but the length of individual @qiov elements does not have to
|
|
|
|
* be a multiple.
|
|
|
|
*
|
|
|
|
* @bytes will always equal the total size of @qiov, and will be
|
|
|
|
* no larger than 'max_transfer'.
|
|
|
|
*
|
|
|
|
* The buffer in @qiov may point directly to guest memory.
|
|
|
|
*/
|
2016-04-25 12:25:18 +03:00
|
|
|
int coroutine_fn (*bdrv_co_pwritev)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags);
|
2019-06-04 19:15:06 +03:00
|
|
|
int coroutine_fn (*bdrv_co_pwritev_part)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes,
|
|
|
|
QEMUIOVector *qiov, size_t qiov_offset, int flags);
|
2016-03-10 15:39:55 +03:00
|
|
|
|
2012-02-07 17:27:25 +04:00
|
|
|
/*
|
|
|
|
* Efficiently zero a region of the disk image. Typically an image format
|
|
|
|
* would use a compact metadata representation to implement this. This
|
block: Honor BDRV_REQ_FUA during write_zeroes
The block layer has a couple of cases where it can lose
Force Unit Access semantics when writing a large block of
zeroes, such that the request returns before the zeroes
have been guaranteed to land on underlying media.
SCSI does not support FUA during WRITESAME(10/16); FUA is only
supported if it falls back to WRITE(10/16). But where the
underlying device is new enough to not need a fallback, it
means that any upper layer request with FUA semantics was
silently ignoring BDRV_REQ_FUA.
Conversely, NBD has situations where it can support FUA but not
ZERO_WRITE; when that happens, the generic block layer fallback
to bdrv_driver_pwritev() (or the older bdrv_co_writev() in qemu
2.6) was losing the FUA flag.
The problem of losing flags unrelated to ZERO_WRITE has been
latent in bdrv_co_do_write_zeroes() since commit aa7bfbff, but
back then, it did not matter because there was no FUA flag. It
became observable when commit 93f5e6d8 paved the way for flags
that can impact correctness, when we should have been using
bdrv_co_writev_flags() with modified flags. Compare to commit
9eeb6dd, which got flag manipulation right in
bdrv_co_do_zero_pwritev().
Symptoms: I tested with qemu-io with default writethrough cache
(which is supposed to use FUA semantics on every write), and
targetted an NBD client connected to a server that intentionally
did not advertise NBD_FLAG_SEND_FUA. When doing 'write 0 512',
the NBD client sent two operations (NBD_CMD_WRITE then
NBD_CMD_FLUSH) to get the fallback FUA semantics; but when doing
'write -z 0 512', the NBD client sent only NBD_CMD_WRITE.
The fix is do to a cleanup bdrv_co_flush() at the end of the
operation if any step in the middle relied on a BDS that does
not natively support FUA for that step (note that we don't
need to flush after every operation, if the operation is broken
into chunks based on bounce-buffer sizing). Each BDS gains a
new flag .supported_zero_flags, which parallels the use of
.supported_write_flags but only when accessing a zero write
operation (the flags MUST be different, because of SCSI having
different semantics based on WRITE vs. WRITESAME; and also
because BDRV_REQ_MAY_UNMAP only makes sense on zero writes).
Also fix some documentation to describe -ENOTSUP semantics,
particularly since iscsi depends on those semantics.
Down the road, we may want to add a driver where its
.bdrv_co_pwritev() honors all three of BDRV_REQ_FUA,
BDRV_REQ_ZERO_WRITE, and BDRV_REQ_MAY_UNMAP, and advertise
this via bs->supported_write_flags for blocks opened by that
driver; such a driver should NOT supply .bdrv_co_write_zeroes
nor .supported_zero_flags. But none of the drivers touched
in this patch want to do that (the act of writing zeroes is
different enough from normal writes to deserve a second
callback).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-04 01:39:07 +03:00
|
|
|
* function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
|
|
|
|
* will be called instead.
|
2012-02-07 17:27:25 +04:00
|
|
|
*/
|
2016-06-02 00:10:03 +03:00
|
|
|
int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
|
2017-06-09 13:18:08 +03:00
|
|
|
int64_t offset, int bytes, BdrvRequestFlags flags);
|
2016-07-16 02:22:58 +03:00
|
|
|
int coroutine_fn (*bdrv_co_pdiscard)(BlockDriverState *bs,
|
2017-06-09 13:18:08 +03:00
|
|
|
int64_t offset, int bytes);
|
2017-05-07 03:05:43 +03:00
|
|
|
|
2018-06-01 12:26:39 +03:00
|
|
|
/* Map [offset, offset + nbytes) range onto a child of @bs to copy from,
|
|
|
|
* and invoke bdrv_co_copy_range_from(child, ...), or invoke
|
|
|
|
* bdrv_co_copy_range_to() if @bs is the leaf child to copy data from.
|
|
|
|
*
|
|
|
|
* See the comment of bdrv_co_copy_range for the parameter and return value
|
|
|
|
* semantics.
|
|
|
|
*/
|
|
|
|
int coroutine_fn (*bdrv_co_copy_range_from)(BlockDriverState *bs,
|
|
|
|
BdrvChild *src,
|
|
|
|
uint64_t offset,
|
|
|
|
BdrvChild *dst,
|
|
|
|
uint64_t dst_offset,
|
|
|
|
uint64_t bytes,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvRequestFlags read_flags,
|
|
|
|
BdrvRequestFlags write_flags);
|
2018-06-01 12:26:39 +03:00
|
|
|
|
|
|
|
/* Map [offset, offset + nbytes) range onto a child of bs to copy data to,
|
|
|
|
* and invoke bdrv_co_copy_range_to(child, src, ...), or perform the copy
|
|
|
|
* operation if @bs is the leaf and @src has the same BlockDriver. Return
|
|
|
|
* -ENOTSUP if @bs is the leaf but @src has a different BlockDriver.
|
|
|
|
*
|
|
|
|
* See the comment of bdrv_co_copy_range for the parameter and return value
|
|
|
|
* semantics.
|
|
|
|
*/
|
|
|
|
int coroutine_fn (*bdrv_co_copy_range_to)(BlockDriverState *bs,
|
|
|
|
BdrvChild *src,
|
|
|
|
uint64_t src_offset,
|
|
|
|
BdrvChild *dst,
|
|
|
|
uint64_t dst_offset,
|
|
|
|
uint64_t bytes,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvRequestFlags read_flags,
|
|
|
|
BdrvRequestFlags write_flags);
|
2018-06-01 12:26:39 +03:00
|
|
|
|
2017-05-07 03:05:43 +03:00
|
|
|
/*
|
2017-10-12 06:46:57 +03:00
|
|
|
* Building block for bdrv_block_status[_above] and
|
|
|
|
* bdrv_is_allocated[_above]. The driver should answer only
|
block: Add .bdrv_co_block_status() callback
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that the block layer exposes byte-based allocation,
it's time to tackle the drivers. Add a new callback that operates
on as small as byte boundaries. Subsequent patches will then update
individual drivers, then finally remove .bdrv_co_get_block_status().
The new code also passes through the 'want_zero' hint, which will
allow subsequent patches to further optimize callers that only care
about how much of the image is allocated (want_zero is false),
rather than full details about runs of zeroes and which offsets the
allocation actually maps to (want_zero is true). As part of this
effort, fix another part of the documentation: the claim in commit
4c41cb4 that BDRV_BLOCK_ALLOCATED is short for 'DATA || ZERO' is a
lie at the block layer (see commit e88ae2264), even though it is
how the bit is computed from the driver layer. After all, there
are intentionally cases where we return ZERO but not ALLOCATED at
the block layer, when we know that a read sees zero because the
backing file is too short. Note that the driver interface is thus
slightly different than the public interface with regards to which
bits will be set, and what guarantees are provided on input.
We also add an assertion that any driver using the new callback will
make progress (the only time pnum will be 0 is if the block layer
already handled an out-of-bounds request, or if there is an error);
the old driver interface did not provide this guarantee, which
could lead to some inf-loops in drastic corner-case failures.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-02-13 23:26:41 +03:00
|
|
|
* according to the current layer, and should only need to set
|
|
|
|
* BDRV_BLOCK_DATA, BDRV_BLOCK_ZERO, BDRV_BLOCK_OFFSET_VALID,
|
|
|
|
* and/or BDRV_BLOCK_RAW; if the current layer defers to a backing
|
|
|
|
* layer, the result should be 0 (and not BDRV_BLOCK_ZERO). See
|
|
|
|
* block.h for the overall meaning of the bits. As a hint, the
|
|
|
|
* flag want_zero is true if the caller cares more about precise
|
|
|
|
* mappings (favor accurate _OFFSET_VALID/_ZERO) or false for
|
|
|
|
* overall allocation (favor larger *pnum, perhaps by reporting
|
|
|
|
* _DATA instead of _ZERO). The block layer guarantees input
|
|
|
|
* clamped to bdrv_getlength() and aligned to request_alignment,
|
|
|
|
* as well as non-NULL pnum, map, and file; in turn, the driver
|
|
|
|
* must return an error or set pnum to an aligned non-zero value.
|
2021-08-12 11:41:45 +03:00
|
|
|
*
|
|
|
|
* Note that @bytes is just a hint on how big of a region the
|
|
|
|
* caller wants to inspect. It is not a limit on *pnum.
|
|
|
|
* Implementations are free to return larger values of *pnum if
|
|
|
|
* doing so does not incur a performance penalty.
|
|
|
|
*
|
|
|
|
* block/io.c's bdrv_co_block_status() will utilize an unclamped
|
|
|
|
* *pnum value for the block-status cache on protocol nodes, prior
|
|
|
|
* to clamping *pnum for return to its caller.
|
2017-05-07 03:05:43 +03:00
|
|
|
*/
|
block: Add .bdrv_co_block_status() callback
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that the block layer exposes byte-based allocation,
it's time to tackle the drivers. Add a new callback that operates
on as small as byte boundaries. Subsequent patches will then update
individual drivers, then finally remove .bdrv_co_get_block_status().
The new code also passes through the 'want_zero' hint, which will
allow subsequent patches to further optimize callers that only care
about how much of the image is allocated (want_zero is false),
rather than full details about runs of zeroes and which offsets the
allocation actually maps to (want_zero is true). As part of this
effort, fix another part of the documentation: the claim in commit
4c41cb4 that BDRV_BLOCK_ALLOCATED is short for 'DATA || ZERO' is a
lie at the block layer (see commit e88ae2264), even though it is
how the bit is computed from the driver layer. After all, there
are intentionally cases where we return ZERO but not ALLOCATED at
the block layer, when we know that a read sees zero because the
backing file is too short. Note that the driver interface is thus
slightly different than the public interface with regards to which
bits will be set, and what guarantees are provided on input.
We also add an assertion that any driver using the new callback will
make progress (the only time pnum will be 0 is if the block layer
already handled an out-of-bounds request, or if there is an error);
the old driver interface did not provide this guarantee, which
could lead to some inf-loops in drastic corner-case failures.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-02-13 23:26:41 +03:00
|
|
|
int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
|
|
|
|
bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
|
|
|
|
int64_t *map, BlockDriverState **file);
|
2011-07-14 19:27:13 +04:00
|
|
|
|
2021-02-05 19:37:11 +03:00
|
|
|
/*
|
|
|
|
* This informs the driver that we are no longer interested in the result
|
|
|
|
* of in-flight requests, so don't waste the time if possible.
|
|
|
|
*
|
|
|
|
* One example usage is to avoid waiting for an nbd target node reconnect
|
2021-04-21 10:58:58 +03:00
|
|
|
* timeout during job-cancel with force=true.
|
2021-02-05 19:37:11 +03:00
|
|
|
*/
|
|
|
|
void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
|
|
|
|
|
2011-11-15 01:09:45 +04:00
|
|
|
/*
|
|
|
|
* Invalidate any cached meta-data.
|
|
|
|
*/
|
2018-03-01 19:36:18 +03:00
|
|
|
void coroutine_fn (*bdrv_co_invalidate_cache)(BlockDriverState *bs,
|
|
|
|
Error **errp);
|
2015-12-22 16:07:08 +03:00
|
|
|
int (*bdrv_inactivate)(BlockDriverState *bs);
|
2011-11-15 01:09:45 +04:00
|
|
|
|
2016-03-14 10:44:53 +03:00
|
|
|
/*
|
|
|
|
* Flushes all data for all layers by calling bdrv_co_flush for underlying
|
|
|
|
* layers, if needed. This function is needed for deterministic
|
|
|
|
* synchronization of the flush finishing callback.
|
|
|
|
*/
|
|
|
|
int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs);
|
|
|
|
|
2020-01-31 00:39:04 +03:00
|
|
|
/* Delete a created file. */
|
|
|
|
int coroutine_fn (*bdrv_co_delete_file)(BlockDriverState *bs,
|
|
|
|
Error **errp);
|
|
|
|
|
2011-11-10 20:25:44 +04:00
|
|
|
/*
|
|
|
|
* Flushes all data that was already written to the OS all the way down to
|
2016-12-02 22:48:54 +03:00
|
|
|
* the disk (for example file-posix.c calls fsync()).
|
2011-11-10 20:25:44 +04:00
|
|
|
*/
|
|
|
|
int coroutine_fn (*bdrv_co_flush_to_disk)(BlockDriverState *bs);
|
|
|
|
|
2011-11-10 21:10:11 +04:00
|
|
|
/*
|
|
|
|
* Flushes all internal caches to the OS. The data may still sit in a
|
|
|
|
* writeback cache of the host OS, but it will survive a crash of the qemu
|
|
|
|
* process.
|
|
|
|
*/
|
|
|
|
int coroutine_fn (*bdrv_co_flush_to_os)(BlockDriverState *bs);
|
|
|
|
|
2018-03-13 01:07:53 +03:00
|
|
|
/*
|
|
|
|
* Drivers setting this field must be able to work with just a plain
|
|
|
|
* filename with '<protocol_name>:' as a prefix, and no other options.
|
|
|
|
* Options may be extracted from the filename by implementing
|
|
|
|
* bdrv_parse_filename.
|
|
|
|
*/
|
2006-08-01 20:21:11 +04:00
|
|
|
const char *protocol_name;
|
2019-09-18 12:51:40 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Truncate @bs to @offset bytes using the given @prealloc mode
|
|
|
|
* when growing. Modes other than PREALLOC_MODE_OFF should be
|
|
|
|
* rejected when shrinking @bs.
|
|
|
|
*
|
|
|
|
* If @exact is true, @bs must be resized to exactly @offset.
|
|
|
|
* Otherwise, it is sufficient for @bs (if it is a host block
|
|
|
|
* device and thus there is no way to resize it) to be at least
|
|
|
|
* @offset bytes in length.
|
|
|
|
*
|
|
|
|
* If @exact is true and this function fails but would succeed
|
|
|
|
* with @exact = false, it should return -ENOTSUP.
|
|
|
|
*/
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 18:54:35 +03:00
|
|
|
int coroutine_fn (*bdrv_co_truncate)(BlockDriverState *bs, int64_t offset,
|
2019-09-18 12:51:40 +03:00
|
|
|
bool exact, PreallocMode prealloc,
|
2020-04-24 15:54:39 +03:00
|
|
|
BdrvRequestFlags flags, Error **errp);
|
block: Avoid unecessary drv->bdrv_getlength() calls
The block layer generally keeps the size of an image cached in
bs->total_sectors so that it doesn't have to perform expensive
operations to get the size whenever it needs it.
This doesn't work however when using a backend that can change its size
without qemu being aware of it, i.e. passthrough of removable media like
CD-ROMs or floppy disks. For this reason, the caching is disabled when a
removable device is used.
It is obvious that checking whether the _guest_ device has removable
media isn't the right thing to do when we want to know whether the size
of the host backend can change. To make things worse, non-top-level
BlockDriverStates never have any device attached, which makes qemu
assume they are removable, so drv->bdrv_getlength() is always called on
the protocol layer. In the case of raw-posix, this causes unnecessary
lseek() system calls, which turned out to be rather expensive.
This patch completely changes the logic and disables bs->total_sectors
caching only for certain block driver types, for which a size change is
expected: host_cdrom and host_floppy on POSIX, host_device on win32; also
the raw format in case it sits on top of one of these protocols, but in
the common case the nested bdrv_getlength() call on the protocol driver
will use the cache again and avoid an expensive drv->bdrv_getlength()
call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-29 15:18:58 +04:00
|
|
|
|
2006-08-01 20:21:11 +04:00
|
|
|
int64_t (*bdrv_getlength)(BlockDriverState *bs);
|
block: Avoid unecessary drv->bdrv_getlength() calls
The block layer generally keeps the size of an image cached in
bs->total_sectors so that it doesn't have to perform expensive
operations to get the size whenever it needs it.
This doesn't work however when using a backend that can change its size
without qemu being aware of it, i.e. passthrough of removable media like
CD-ROMs or floppy disks. For this reason, the caching is disabled when a
removable device is used.
It is obvious that checking whether the _guest_ device has removable
media isn't the right thing to do when we want to know whether the size
of the host backend can change. To make things worse, non-top-level
BlockDriverStates never have any device attached, which makes qemu
assume they are removable, so drv->bdrv_getlength() is always called on
the protocol layer. In the case of raw-posix, this causes unnecessary
lseek() system calls, which turned out to be rather expensive.
This patch completely changes the logic and disables bs->total_sectors
caching only for certain block driver types, for which a size change is
expected: host_cdrom and host_floppy on POSIX, host_device on win32; also
the raw format in case it sits on top of one of these protocols, but in
the common case the nested bdrv_getlength() call on the protocol driver
will use the cache again and avoid an expensive drv->bdrv_getlength()
call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-29 15:18:58 +04:00
|
|
|
bool has_variable_length;
|
2011-07-12 15:56:39 +04:00
|
|
|
int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
|
2017-07-05 15:57:30 +03:00
|
|
|
BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
|
|
|
|
Error **errp);
|
block: Avoid unecessary drv->bdrv_getlength() calls
The block layer generally keeps the size of an image cached in
bs->total_sectors so that it doesn't have to perform expensive
operations to get the size whenever it needs it.
This doesn't work however when using a backend that can change its size
without qemu being aware of it, i.e. passthrough of removable media like
CD-ROMs or floppy disks. For this reason, the caching is disabled when a
removable device is used.
It is obvious that checking whether the _guest_ device has removable
media isn't the right thing to do when we want to know whether the size
of the host backend can change. To make things worse, non-top-level
BlockDriverStates never have any device attached, which makes qemu
assume they are removable, so drv->bdrv_getlength() is always called on
the protocol layer. In the case of raw-posix, this causes unnecessary
lseek() system calls, which turned out to be rather expensive.
This patch completely changes the logic and disables bs->total_sectors
caching only for certain block driver types, for which a size change is
expected: host_cdrom and host_floppy on POSIX, host_device on win32; also
the raw format in case it sits on top of one of these protocols, but in
the common case the nested bdrv_getlength() call on the protocol driver
will use the cache again and avoid an expensive drv->bdrv_getlength()
call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-29 15:18:58 +04:00
|
|
|
|
2016-07-22 11:17:42 +03:00
|
|
|
int coroutine_fn (*bdrv_co_pwritev_compressed)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov);
|
2019-06-04 19:15:06 +03:00
|
|
|
int coroutine_fn (*bdrv_co_pwritev_compressed_part)(BlockDriverState *bs,
|
|
|
|
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov,
|
|
|
|
size_t qiov_offset);
|
2016-07-22 11:17:42 +03:00
|
|
|
|
2007-09-17 01:08:06 +04:00
|
|
|
int (*bdrv_snapshot_create)(BlockDriverState *bs,
|
2006-08-06 01:31:00 +04:00
|
|
|
QEMUSnapshotInfo *sn_info);
|
2007-09-17 01:08:06 +04:00
|
|
|
int (*bdrv_snapshot_goto)(BlockDriverState *bs,
|
2006-08-06 01:31:00 +04:00
|
|
|
const char *snapshot_id);
|
snapshot: distinguish id and name in snapshot delete
Snapshot creation actually already distinguish id and name since it take
a structured parameter *sn, but delete can't. Later an accurate delete
is needed in qmp_transaction abort and blockdev-snapshot-delete-sync,
so change its prototype. Also *errp is added to tip error, but return
value is kepted to let caller check what kind of error happens. Existing
caller for it are savevm, delvm and qemu-img, they are not impacted by
introducing a new function bdrv_snapshot_delete_by_id_or_name(), which
check the return value and do the operation again.
Before this patch:
For qcow2, it search id first then name to find the one to delete.
For rbd, it search name.
For sheepdog, it does nothing.
After this patch:
For qcow2, logic is the same by call it twice in caller.
For rbd, it always fails in delete with id, but still search for name
in second try, no change to user.
Some code for *errp is based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2013-09-11 10:04:33 +04:00
|
|
|
int (*bdrv_snapshot_delete)(BlockDriverState *bs,
|
|
|
|
const char *snapshot_id,
|
|
|
|
const char *name,
|
|
|
|
Error **errp);
|
2007-09-17 01:08:06 +04:00
|
|
|
int (*bdrv_snapshot_list)(BlockDriverState *bs,
|
2006-08-06 01:31:00 +04:00
|
|
|
QEMUSnapshotInfo **psn_info);
|
2010-09-22 06:58:41 +04:00
|
|
|
int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
|
2013-12-04 13:10:54 +04:00
|
|
|
const char *snapshot_id,
|
|
|
|
const char *name,
|
|
|
|
Error **errp);
|
2006-08-06 01:31:00 +04:00
|
|
|
int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
|
2019-02-08 18:06:06 +03:00
|
|
|
ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
|
|
|
|
Error **errp);
|
2019-09-23 15:17:37 +03:00
|
|
|
BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2016-06-09 17:24:44 +03:00
|
|
|
int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
|
|
|
|
QEMUIOVector *qiov,
|
|
|
|
int64_t pos);
|
|
|
|
int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
|
|
|
|
QEMUIOVector *qiov,
|
|
|
|
int64_t pos);
|
2009-04-05 23:10:55 +04:00
|
|
|
|
2010-01-12 14:55:17 +03:00
|
|
|
int (*bdrv_change_backing_file)(BlockDriverState *bs,
|
|
|
|
const char *backing_file, const char *backing_fmt);
|
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
/* removable device specific */
|
2015-10-19 18:53:11 +03:00
|
|
|
bool (*bdrv_is_inserted)(BlockDriverState *bs);
|
2012-02-03 22:24:53 +04:00
|
|
|
void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
|
2011-09-06 20:58:47 +04:00
|
|
|
void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2007-12-24 19:10:43 +03:00
|
|
|
/* to control generic scsi devices */
|
2014-10-07 15:59:14 +04:00
|
|
|
BlockAIOCB *(*bdrv_aio_ioctl)(BlockDriverState *bs,
|
2009-03-28 20:28:41 +03:00
|
|
|
unsigned long int req, void *buf,
|
2014-10-07 15:59:15 +04:00
|
|
|
BlockCompletionFunc *cb, void *opaque);
|
2016-10-20 16:07:27 +03:00
|
|
|
int coroutine_fn (*bdrv_co_ioctl)(BlockDriverState *bs,
|
|
|
|
unsigned long int req, void *buf);
|
2007-12-24 19:10:43 +03:00
|
|
|
|
2009-05-18 18:42:10 +04:00
|
|
|
/* List of options for creating images, terminated by name == NULL */
|
2014-06-05 13:20:51 +04:00
|
|
|
QemuOptsList *create_opts;
|
2020-06-25 15:55:39 +03:00
|
|
|
|
|
|
|
/* List of options for image amend */
|
|
|
|
QemuOptsList *amend_opts;
|
|
|
|
|
2019-03-12 19:48:48 +03:00
|
|
|
/*
|
|
|
|
* If this driver supports reopening images this contains a
|
|
|
|
* NULL-terminated list of the runtime options that can be
|
|
|
|
* modified. If an option in this list is unspecified during
|
|
|
|
* reopen then it _must_ be reset to its default value or return
|
|
|
|
* an error.
|
|
|
|
*/
|
|
|
|
const char *const *mutable_opts;
|
2009-03-28 20:55:10 +03:00
|
|
|
|
2010-06-29 14:37:54 +04:00
|
|
|
/*
|
|
|
|
* Returns 0 for completed check, -errno for internal errors.
|
|
|
|
* The check results are stored in result.
|
|
|
|
*/
|
2018-03-01 19:36:19 +03:00
|
|
|
int coroutine_fn (*bdrv_co_check)(BlockDriverState *bs,
|
|
|
|
BdrvCheckResult *result,
|
|
|
|
BdrvCheckMode fix);
|
2009-04-22 03:11:50 +04:00
|
|
|
|
2015-11-18 11:52:54 +03:00
|
|
|
void (*bdrv_debug_event)(BlockDriverState *bs, BlkdebugEvent event);
|
2010-03-15 19:27:00 +03:00
|
|
|
|
2012-12-06 17:32:58 +04:00
|
|
|
/* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
|
|
|
|
int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
|
|
|
|
const char *tag);
|
2013-11-20 06:01:54 +04:00
|
|
|
int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
|
|
|
|
const char *tag);
|
2012-12-06 17:32:58 +04:00
|
|
|
int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
|
|
|
|
bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
|
|
|
|
|
2014-07-16 19:48:16 +04:00
|
|
|
void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
|
2013-12-11 22:26:16 +04:00
|
|
|
|
2010-07-28 13:26:29 +04:00
|
|
|
/*
|
|
|
|
* Returns 1 if newly created images are guaranteed to contain only
|
|
|
|
* zeros, 0 otherwise.
|
|
|
|
*/
|
|
|
|
int (*bdrv_has_zero_init)(BlockDriverState *bs);
|
2009-11-30 18:54:15 +03:00
|
|
|
|
2014-05-08 18:34:37 +04:00
|
|
|
/* Remove fd handlers, timers, and other event loop callbacks so the event
|
|
|
|
* loop is no longer in use. Called with no in-flight requests and in
|
|
|
|
* depth-first traversal order with parents before child nodes.
|
|
|
|
*/
|
|
|
|
void (*bdrv_detach_aio_context)(BlockDriverState *bs);
|
|
|
|
|
|
|
|
/* Add fd handlers, timers, and other event loop callbacks so I/O requests
|
|
|
|
* can be processed again. Called with no in-flight requests and in
|
|
|
|
* depth-first traversal order with child nodes before parent nodes.
|
|
|
|
*/
|
|
|
|
void (*bdrv_attach_aio_context)(BlockDriverState *bs,
|
|
|
|
AioContext *new_context);
|
|
|
|
|
2014-07-04 14:04:33 +04:00
|
|
|
/* io queue for linux-aio */
|
|
|
|
void (*bdrv_io_plug)(BlockDriverState *bs);
|
|
|
|
void (*bdrv_io_unplug)(BlockDriverState *bs);
|
|
|
|
|
2015-02-16 14:47:54 +03:00
|
|
|
/**
|
|
|
|
* Try to get @bs's logical and physical block size.
|
|
|
|
* On success, store them in @bsz and return zero.
|
|
|
|
* On failure, return negative errno.
|
|
|
|
*/
|
|
|
|
int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
|
|
|
|
/**
|
|
|
|
* Try to get @bs's geometry (cyls, heads, sectors)
|
|
|
|
* On success, store them in @geo and return 0.
|
|
|
|
* On failure return -errno.
|
|
|
|
* Only drivers that want to override guest geometry implement this
|
|
|
|
* callback; see hd_geometry_guess().
|
|
|
|
*/
|
|
|
|
int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
|
|
|
|
|
2015-11-09 13:16:53 +03:00
|
|
|
/**
|
2017-09-23 14:14:10 +03:00
|
|
|
* bdrv_co_drain_begin is called if implemented in the beginning of a
|
2017-09-23 14:14:09 +03:00
|
|
|
* drain operation to drain and stop any internal sources of requests in
|
|
|
|
* the driver.
|
|
|
|
* bdrv_co_drain_end is called if implemented at the end of the drain.
|
|
|
|
*
|
|
|
|
* They should be used by the driver to e.g. manage scheduled I/O
|
|
|
|
* requests, or toggle an internal state. After the end of the drain new
|
|
|
|
* requests will continue normally.
|
2015-11-09 13:16:53 +03:00
|
|
|
*/
|
2017-09-23 14:14:10 +03:00
|
|
|
void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
|
2017-09-23 14:14:09 +03:00
|
|
|
void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
|
2015-11-09 13:16:53 +03:00
|
|
|
|
2016-05-10 10:36:37 +03:00
|
|
|
void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
|
|
|
|
Error **errp);
|
|
|
|
void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
|
|
|
|
Error **errp);
|
|
|
|
|
2016-12-15 15:04:20 +03:00
|
|
|
/**
|
|
|
|
* Informs the block driver that a permission change is intended. The
|
|
|
|
* driver checks whether the change is permissible and may take other
|
|
|
|
* preparations for the change (e.g. get file system locks). This operation
|
|
|
|
* is always followed either by a call to either .bdrv_set_perm or
|
|
|
|
* .bdrv_abort_perm_update.
|
|
|
|
*
|
|
|
|
* Checks whether the requested set of cumulative permissions in @perm
|
|
|
|
* can be granted for accessing @bs and whether no other users are using
|
|
|
|
* permissions other than those given in @shared (both arguments take
|
|
|
|
* BLK_PERM_* bitmasks).
|
|
|
|
*
|
|
|
|
* If both conditions are met, 0 is returned. Otherwise, -errno is returned
|
|
|
|
* and errp is set to an error describing the conflict.
|
|
|
|
*/
|
|
|
|
int (*bdrv_check_perm)(BlockDriverState *bs, uint64_t perm,
|
|
|
|
uint64_t shared, Error **errp);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Called to inform the driver that the set of cumulative set of used
|
|
|
|
* permissions for @bs has changed to @perm, and the set of sharable
|
|
|
|
* permission to @shared. The driver can use this to propagate changes to
|
|
|
|
* its children (i.e. request permissions only if a parent actually needs
|
|
|
|
* them).
|
|
|
|
*
|
|
|
|
* This function is only invoked after bdrv_check_perm(), so block drivers
|
|
|
|
* may rely on preparations made in their .bdrv_check_perm implementation.
|
|
|
|
*/
|
|
|
|
void (*bdrv_set_perm)(BlockDriverState *bs, uint64_t perm, uint64_t shared);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called to inform the driver that after a previous bdrv_check_perm()
|
|
|
|
* call, the permission update is not performed and any preparations made
|
|
|
|
* for it (e.g. taken file locks) need to be undone.
|
|
|
|
*
|
|
|
|
* This function can be called even for nodes that never saw a
|
|
|
|
* bdrv_check_perm() call. It is a no-op then.
|
|
|
|
*/
|
|
|
|
void (*bdrv_abort_perm_update)(BlockDriverState *bs);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Returns in @nperm and @nshared the permissions that the driver for @bs
|
|
|
|
* needs on its child @c, based on the cumulative permissions requested by
|
|
|
|
* the parents in @parent_perm and @parent_shared.
|
|
|
|
*
|
|
|
|
* If @c is NULL, return the permissions for attaching a new child for the
|
2020-05-13 14:05:16 +03:00
|
|
|
* given @child_class and @role.
|
2017-09-14 13:47:11 +03:00
|
|
|
*
|
|
|
|
* If @reopen_queue is non-NULL, don't return the currently needed
|
|
|
|
* permissions, but those that will be needed after applying the
|
|
|
|
* @reopen_queue.
|
2016-12-15 15:04:20 +03:00
|
|
|
*/
|
|
|
|
void (*bdrv_child_perm)(BlockDriverState *bs, BdrvChild *c,
|
2020-05-13 14:05:16 +03:00
|
|
|
BdrvChildRole role,
|
2017-09-14 13:47:11 +03:00
|
|
|
BlockReopenQueue *reopen_queue,
|
2016-12-15 15:04:20 +03:00
|
|
|
uint64_t parent_perm, uint64_t parent_shared,
|
|
|
|
uint64_t *nperm, uint64_t *nshared);
|
|
|
|
|
block: Make it easier to learn which BDS support bitmaps
Upcoming patches will enhance bitmap support in qemu-img, but in doing
so, it turns out to be nice to suppress output when persistent bitmaps
make no sense (such as on a qcow2 v2 image). Add a hook to make this
easier to query.
This patch adds a new callback .bdrv_supports_persistent_dirty_bitmap,
rather than trying to shoehorn the answer in via existing callbacks.
In particular, while it might have been possible to overload
.bdrv_co_can_store_new_dirty_bitmap to special-case a NULL input to
answer whether any persistent bitmaps are supported, that is at odds
with whether a particular bitmap can be stored (for example, even on
an image that supports persistent bitmaps but has currently filled up
the maximum number of bitmaps, attempts to store another one should
fail); and the new functionality doesn't require coroutine safety.
Similarly, we could have added one more piece of information to
.bdrv_get_info, but then again, most callers to that function tend to
already discard extraneous information, and making it a catch-all
rather than a series of dedicated scalar queries hasn't really
simplified life.
In the future, when we improve the ability to look up bitmaps through
a filter, we will probably also want to teach the block layer to
automatically let filters pass this request on through.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200513011648.166876-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2020-05-13 04:16:42 +03:00
|
|
|
bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
|
2019-09-20 11:25:43 +03:00
|
|
|
bool (*bdrv_co_can_store_new_dirty_bitmap)(BlockDriverState *bs,
|
2019-09-20 11:25:42 +03:00
|
|
|
const char *name,
|
2019-09-20 11:25:43 +03:00
|
|
|
uint32_t granularity,
|
2019-09-20 11:25:42 +03:00
|
|
|
Error **errp);
|
2019-09-20 11:25:43 +03:00
|
|
|
int (*bdrv_co_remove_persistent_dirty_bitmap)(BlockDriverState *bs,
|
|
|
|
const char *name,
|
|
|
|
Error **errp);
|
2017-06-28 15:05:13 +03:00
|
|
|
|
2018-01-16 09:08:56 +03:00
|
|
|
/**
|
|
|
|
* Register/unregister a buffer for I/O. For example, when the driver is
|
|
|
|
* interested to know the memory areas that will later be used in iovs, so
|
|
|
|
* that it can do IOMMU mapping with VFIO etc., in order to get better
|
|
|
|
* performance. In the case of VFIO drivers, this callback is used to do
|
|
|
|
* DMA mapping for hot buffers.
|
|
|
|
*/
|
|
|
|
void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
|
|
|
|
void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
|
2010-04-13 13:29:33 +04:00
|
|
|
QLIST_ENTRY(BlockDriver) list;
|
2019-02-01 22:29:25 +03:00
|
|
|
|
|
|
|
/* Pointer to a NULL-terminated array of names of strong options
|
|
|
|
* that can be specified for bdrv_open(). A strong option is one
|
|
|
|
* that changes the data of a BDS.
|
|
|
|
* If this pointer is NULL, the array is considered empty.
|
|
|
|
* "filename" and "driver" are always considered strong. */
|
|
|
|
const char *const *strong_runtime_opts;
|
2004-08-02 01:59:26 +04:00
|
|
|
};
|
|
|
|
|
2019-06-04 19:15:06 +03:00
|
|
|
static inline bool block_driver_can_compress(BlockDriver *drv)
|
|
|
|
{
|
|
|
|
return drv->bdrv_co_pwritev_compressed ||
|
|
|
|
drv->bdrv_co_pwritev_compressed_part;
|
|
|
|
}
|
|
|
|
|
2013-10-24 14:06:56 +04:00
|
|
|
typedef struct BlockLimits {
|
2016-06-24 01:37:24 +03:00
|
|
|
/* Alignment requirement, in bytes, for offset/length of I/O
|
|
|
|
* requests. Must be a power of 2 less than INT_MAX; defaults to
|
|
|
|
* 1 for drivers with modern byte interfaces, and to 512
|
|
|
|
* otherwise. */
|
|
|
|
uint32_t request_alignment;
|
|
|
|
|
2016-07-21 22:34:48 +03:00
|
|
|
/* Maximum number of bytes that can be discarded at once (since it
|
|
|
|
* is signed, it must be < 2G, if set). Must be multiple of
|
2016-06-24 01:37:21 +03:00
|
|
|
* pdiscard_alignment, but need not be power of 2. May be 0 if no
|
|
|
|
* inherent 32-bit limit */
|
|
|
|
int32_t max_pdiscard;
|
|
|
|
|
2016-07-21 22:34:48 +03:00
|
|
|
/* Optimal alignment for discard requests in bytes. A power of 2
|
|
|
|
* is best but not mandatory. Must be a multiple of
|
|
|
|
* bl.request_alignment, and must be less than max_pdiscard if
|
|
|
|
* that is set. May be 0 if bl.request_alignment is good enough */
|
2016-06-24 01:37:21 +03:00
|
|
|
uint32_t pdiscard_alignment;
|
2013-10-24 14:06:56 +04:00
|
|
|
|
2016-07-21 22:34:48 +03:00
|
|
|
/* Maximum number of bytes that can zeroized at once (since it is
|
|
|
|
* signed, it must be < 2G, if set). Must be multiple of
|
2016-06-24 01:37:20 +03:00
|
|
|
* pwrite_zeroes_alignment. May be 0 if no inherent 32-bit limit */
|
2016-06-02 00:10:02 +03:00
|
|
|
int32_t max_pwrite_zeroes;
|
2013-10-24 14:06:56 +04:00
|
|
|
|
2016-07-21 22:34:48 +03:00
|
|
|
/* Optimal alignment for write zeroes requests in bytes. A power
|
|
|
|
* of 2 is best but not mandatory. Must be a multiple of
|
|
|
|
* bl.request_alignment, and must be less than max_pwrite_zeroes
|
|
|
|
* if that is set. May be 0 if bl.request_alignment is good
|
|
|
|
* enough */
|
2016-06-02 00:10:02 +03:00
|
|
|
uint32_t pwrite_zeroes_alignment;
|
2013-11-27 14:07:04 +04:00
|
|
|
|
2016-07-21 22:34:48 +03:00
|
|
|
/* Optimal transfer length in bytes. A power of 2 is best but not
|
|
|
|
* mandatory. Must be a multiple of bl.request_alignment, or 0 if
|
|
|
|
* no preferred size */
|
2016-06-24 01:37:19 +03:00
|
|
|
uint32_t opt_transfer;
|
|
|
|
|
2016-07-21 22:34:48 +03:00
|
|
|
/* Maximal transfer length in bytes. Need not be power of 2, but
|
|
|
|
* must be multiple of opt_transfer and bl.request_alignment, or 0
|
|
|
|
* for no 32-bit limit. For now, anything larger than INT_MAX is
|
|
|
|
* clamped down. */
|
2016-06-24 01:37:19 +03:00
|
|
|
uint32_t max_transfer;
|
2014-10-27 12:18:44 +03:00
|
|
|
|
2021-06-03 11:34:23 +03:00
|
|
|
/* Maximal hardware transfer length in bytes. Applies whenever
|
|
|
|
* transfers to the device bypass the kernel I/O scheduler, for
|
|
|
|
* example with SG_IO. If larger than max_transfer or if zero,
|
|
|
|
* blk_get_max_hw_transfer will fall back to max_transfer.
|
|
|
|
*/
|
|
|
|
uint64_t max_hw_transfer;
|
|
|
|
|
2016-06-24 01:37:24 +03:00
|
|
|
/* memory alignment, in bytes so that no bounce buffer is needed */
|
2015-05-12 17:30:55 +03:00
|
|
|
size_t min_mem_alignment;
|
|
|
|
|
2016-06-24 01:37:24 +03:00
|
|
|
/* memory alignment, in bytes, for bounce buffer */
|
2013-11-28 13:23:32 +04:00
|
|
|
size_t opt_mem_alignment;
|
2015-07-09 12:56:44 +03:00
|
|
|
|
|
|
|
/* maximum number of iovec elements */
|
|
|
|
int max_iov;
|
2013-10-24 14:06:56 +04:00
|
|
|
} BlockLimits;
|
|
|
|
|
2014-05-23 17:29:42 +04:00
|
|
|
typedef struct BdrvOpBlocker BdrvOpBlocker;
|
|
|
|
|
2014-06-20 23:57:33 +04:00
|
|
|
typedef struct BdrvAioNotifier {
|
|
|
|
void (*attached_aio_context)(AioContext *new_context, void *opaque);
|
|
|
|
void (*detach_aio_context)(void *opaque);
|
|
|
|
|
|
|
|
void *opaque;
|
2016-06-16 19:56:26 +03:00
|
|
|
bool deleted;
|
2014-06-20 23:57:33 +04:00
|
|
|
|
|
|
|
QLIST_ENTRY(BdrvAioNotifier) list;
|
|
|
|
} BdrvAioNotifier;
|
|
|
|
|
2020-05-13 14:05:13 +03:00
|
|
|
struct BdrvChildClass {
|
2017-03-06 18:20:51 +03:00
|
|
|
/* If true, bdrv_replace_node() doesn't change the node this BdrvChild
|
|
|
|
* points to. */
|
2017-01-17 15:39:34 +03:00
|
|
|
bool stay_at_node;
|
|
|
|
|
2018-05-29 18:17:45 +03:00
|
|
|
/* If true, the parent is a BlockDriverState and bdrv_next_all_states()
|
|
|
|
* will return it. This information is used for drain_all, where every node
|
|
|
|
* will be drained separately, so the drain only needs to be propagated to
|
|
|
|
* non-BDS parents. */
|
|
|
|
bool parent_is_bds;
|
|
|
|
|
2020-05-13 14:05:18 +03:00
|
|
|
void (*inherit_options)(BdrvChildRole role, bool parent_is_format,
|
2020-05-13 14:05:17 +03:00
|
|
|
int *child_flags, QDict *child_options,
|
2015-04-29 18:29:39 +03:00
|
|
|
int parent_flags, QDict *parent_options);
|
2016-03-22 14:05:35 +03:00
|
|
|
|
2016-02-24 17:13:35 +03:00
|
|
|
void (*change_media)(BdrvChild *child, bool load);
|
|
|
|
void (*resize)(BdrvChild *child);
|
|
|
|
|
2016-02-26 12:22:16 +03:00
|
|
|
/* Returns a name that is supposedly more useful for human users than the
|
|
|
|
* node name for identifying the node in question (in particular, a BB
|
|
|
|
* name), or NULL if the parent can't provide a better name. */
|
2017-04-05 18:18:24 +03:00
|
|
|
const char *(*get_name)(BdrvChild *child);
|
2016-02-26 12:22:16 +03:00
|
|
|
|
2017-01-17 17:56:16 +03:00
|
|
|
/* Returns a malloced string that describes the parent of the child for a
|
|
|
|
* human reader. This could be a node-name, BlockBackend name, qdev ID or
|
|
|
|
* QOM path of the device owning the BlockBackend, job type and ID etc. The
|
|
|
|
* caller is responsible for freeing the memory. */
|
2017-04-05 18:18:24 +03:00
|
|
|
char *(*get_parent_desc)(BdrvChild *child);
|
2017-01-17 17:56:16 +03:00
|
|
|
|
2016-03-22 14:05:35 +03:00
|
|
|
/*
|
|
|
|
* If this pair of functions is implemented, the parent doesn't issue new
|
|
|
|
* requests after returning from .drained_begin() until .drained_end() is
|
|
|
|
* called.
|
|
|
|
*
|
2018-06-29 19:01:31 +03:00
|
|
|
* These functions must not change the graph (and therefore also must not
|
|
|
|
* call aio_poll(), which could change the graph indirectly).
|
|
|
|
*
|
block: Do not poll in bdrv_do_drained_end()
We should never poll anywhere in bdrv_do_drained_end() (including its
recursive callees like bdrv_drain_invoke()), because it does not cope
well with graph changes. In fact, it has been written based on the
postulation that no graph changes will happen in it.
Instead, the callers that want to poll must poll, i.e. all currently
globally available wrappers: bdrv_drained_end(),
bdrv_subtree_drained_end(), bdrv_unapply_subtree_drain(), and
bdrv_drain_all_end(). Graph changes there do not matter.
They can poll simply by passing a pointer to a drained_end_counter and
wait until it reaches 0.
This patch also adds a non-polling global wrapper for
bdrv_do_drained_end() that takes a drained_end_counter pointer. We need
such a variant because now no function called anywhere from
bdrv_do_drained_end() must poll. This includes
BdrvChildRole.drained_end(), which already must not poll according to
its interface documentation, but bdrv_child_cb_drained_end() just
violates that by invoking bdrv_drained_end() (which does poll).
Therefore, BdrvChildRole.drained_end() must take a *drained_end_counter
parameter, which bdrv_child_cb_drained_end() can pass on to the new
bdrv_drained_end_no_poll() function.
Note that we now have a pattern of all drained_end-related functions
either polling or receiving a *drained_end_counter to let the caller
poll based on that.
A problem with a single poll loop is that when the drained section in
bdrv_set_aio_context_ignore() ends, some nodes in the subgraph may be in
the old contexts, while others are in the new context already. To let
the collective poll in bdrv_drained_end() work correctly, we must not
hold a lock to the old context, so that the old context can make
progress in case it is different from the current context.
(In the process, remove the comment saying that the current context is
always the old context, because it is wrong.)
In all other places, all nodes in a subtree must be in the same context,
so we can just poll that. The exception of course is
bdrv_drain_all_end(), but that always runs in the main context, so we
can just poll NULL (like bdrv_drain_all_begin() does).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 12:26:14 +03:00
|
|
|
* If drained_end() schedules background operations, it must atomically
|
|
|
|
* increment *drained_end_counter for each such operation and atomically
|
|
|
|
* decrement it once the operation has settled.
|
|
|
|
*
|
2016-03-22 14:05:35 +03:00
|
|
|
* Note that this can be nested. If drained_begin() was called twice, new
|
|
|
|
* I/O is allowed only after drained_end() was called twice, too.
|
|
|
|
*/
|
|
|
|
void (*drained_begin)(BdrvChild *child);
|
block: Do not poll in bdrv_do_drained_end()
We should never poll anywhere in bdrv_do_drained_end() (including its
recursive callees like bdrv_drain_invoke()), because it does not cope
well with graph changes. In fact, it has been written based on the
postulation that no graph changes will happen in it.
Instead, the callers that want to poll must poll, i.e. all currently
globally available wrappers: bdrv_drained_end(),
bdrv_subtree_drained_end(), bdrv_unapply_subtree_drain(), and
bdrv_drain_all_end(). Graph changes there do not matter.
They can poll simply by passing a pointer to a drained_end_counter and
wait until it reaches 0.
This patch also adds a non-polling global wrapper for
bdrv_do_drained_end() that takes a drained_end_counter pointer. We need
such a variant because now no function called anywhere from
bdrv_do_drained_end() must poll. This includes
BdrvChildRole.drained_end(), which already must not poll according to
its interface documentation, but bdrv_child_cb_drained_end() just
violates that by invoking bdrv_drained_end() (which does poll).
Therefore, BdrvChildRole.drained_end() must take a *drained_end_counter
parameter, which bdrv_child_cb_drained_end() can pass on to the new
bdrv_drained_end_no_poll() function.
Note that we now have a pattern of all drained_end-related functions
either polling or receiving a *drained_end_counter to let the caller
poll based on that.
A problem with a single poll loop is that when the drained section in
bdrv_set_aio_context_ignore() ends, some nodes in the subgraph may be in
the old contexts, while others are in the new context already. To let
the collective poll in bdrv_drained_end() work correctly, we must not
hold a lock to the old context, so that the old context can make
progress in case it is different from the current context.
(In the process, remove the comment saying that the current context is
always the old context, because it is wrong.)
In all other places, all nodes in a subtree must be in the same context,
so we can just poll that. The exception of course is
bdrv_drain_all_end(), but that always runs in the main context, so we
can just poll NULL (like bdrv_drain_all_begin() does).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 12:26:14 +03:00
|
|
|
void (*drained_end)(BdrvChild *child, int *drained_end_counter);
|
2017-02-08 13:28:52 +03:00
|
|
|
|
2018-03-22 16:11:20 +03:00
|
|
|
/*
|
|
|
|
* Returns whether the parent has pending requests for the child. This
|
|
|
|
* callback is polled after .drained_begin() has been called until all
|
|
|
|
* activity on the child has stopped.
|
|
|
|
*/
|
|
|
|
bool (*drained_poll)(BdrvChild *child);
|
|
|
|
|
2017-05-04 19:52:38 +03:00
|
|
|
/* Notifies the parent that the child has been activated/inactivated (e.g.
|
|
|
|
* when migration is completing) and it can start/stop requesting
|
|
|
|
* permissions and doing I/O on it. */
|
2017-05-04 19:52:37 +03:00
|
|
|
void (*activate)(BdrvChild *child, Error **errp);
|
2017-05-04 19:52:38 +03:00
|
|
|
int (*inactivate)(BdrvChild *child);
|
2017-05-04 19:52:37 +03:00
|
|
|
|
2017-02-08 13:28:52 +03:00
|
|
|
void (*attach)(BdrvChild *child);
|
|
|
|
void (*detach)(BdrvChild *child);
|
2017-06-29 20:32:21 +03:00
|
|
|
|
|
|
|
/* Notifies the parent that the filename of its child has changed (e.g.
|
|
|
|
* because the direct child was removed from the backing chain), so that it
|
|
|
|
* can update its reference. */
|
|
|
|
int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
|
|
|
|
const char *filename, Error **errp);
|
2019-05-06 20:17:56 +03:00
|
|
|
|
|
|
|
bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
|
|
|
|
GSList **ignore, Error **errp);
|
2019-05-06 20:17:59 +03:00
|
|
|
void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
|
2021-04-28 18:17:33 +03:00
|
|
|
|
|
|
|
AioContext *(*get_parent_aio_context)(BdrvChild *child);
|
2015-04-08 14:43:47 +03:00
|
|
|
};
|
|
|
|
|
2020-05-13 14:05:24 +03:00
|
|
|
extern const BdrvChildClass child_of_bds;
|
2015-04-08 14:43:47 +03:00
|
|
|
|
2015-06-15 14:24:19 +03:00
|
|
|
struct BdrvChild {
|
2015-04-08 14:49:41 +03:00
|
|
|
BlockDriverState *bs;
|
2015-04-27 14:46:22 +03:00
|
|
|
char *name;
|
2020-05-13 14:05:13 +03:00
|
|
|
const BdrvChildClass *klass;
|
2020-05-13 14:05:15 +03:00
|
|
|
BdrvChildRole role;
|
2016-02-24 17:13:35 +03:00
|
|
|
void *opaque;
|
2016-12-14 19:24:36 +03:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Granted permissions for operating on this BdrvChild (BLK_PERM_* bitmask)
|
|
|
|
*/
|
|
|
|
uint64_t perm;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Permissions that can still be granted to other users of @bs while this
|
|
|
|
* BdrvChild is still attached to it. (BLK_PERM_* bitmask)
|
|
|
|
*/
|
|
|
|
uint64_t shared_perm;
|
|
|
|
|
2019-03-12 19:48:40 +03:00
|
|
|
/*
|
|
|
|
* This link is frozen: the child can neither be replaced nor
|
|
|
|
* detached from the parent.
|
|
|
|
*/
|
|
|
|
bool frozen;
|
|
|
|
|
block: Introduce BdrvChild.parent_quiesce_counter
Commit 5cb2737e925042e6c7cd3fb0b01313950b03cddf laid out why
bdrv_do_drained_end() must decrement the quiesce_counter after
bdrv_drain_invoke(). It did not give a very good reason why it has to
happen after bdrv_parent_drained_end(), instead only claiming symmetry
to bdrv_do_drained_begin().
It turns out that delaying it for so long is wrong.
Situation: We have an active commit job (i.e. a mirror job) from top to
base for the following graph:
filter
|
[file]
|
v
top --[backing]--> base
Now the VM is closed, which results in the job being cancelled and a
bdrv_drain_all() happening pretty much simultaneously.
Beginning the drain means the job is paused once whenever one of its
nodes is quiesced. This is reversed when the drain ends.
With how the code currently is, after base's drain ends (which means
that it will have unpaused the job once), its quiesce_counter remains at
1 while it goes to undrain its parents (bdrv_parent_drained_end()). For
some reason or another, undraining filter causes the job to be kicked
and enter mirror_exit_common(), where it proceeds to invoke
block_job_remove_all_bdrv().
Now base will be detached from the job. Because its quiesce_counter is
still 1, it will unpause the job once more. So in total, undraining
base will unpause the job twice. Eventually, this will lead to the
job's pause_count going negative -- well, it would, were there not an
assertion against this, which crashes qemu.
The general problem is that if in bdrv_parent_drained_end() we undrain
parent A, and then undrain parent B, which then leads to A detaching the
child, bdrv_replace_child_noperm() will undrain A as if we had not done
so yet; that is, one time too many.
It follows that we cannot decrement the quiesce_counter after invoking
bdrv_parent_drained_end().
Unfortunately, decrementing it before bdrv_parent_drained_end() would be
wrong, too. Imagine the above situation in reverse: Undraining A leads
to B detaching the child. If we had already decremented the
quiesce_counter by that point, bdrv_replace_child_noperm() would undrain
B one time too little; because it expects bdrv_parent_drained_end() to
issue this undrain. But bdrv_parent_drained_end() won't do that,
because B is no longer a parent.
Therefore, we have to do something else. This patch opts for
introducing a second quiesce_counter that counts how many times a
child's parent has been quiesced (though c->role->drained_*). With
that, bdrv_replace_child_noperm() just has to undrain the parent exactly
that many times when removing a child, and it will always be right.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 12:26:09 +03:00
|
|
|
/*
|
|
|
|
* How many times the parent of this child has been drained
|
2020-05-13 14:05:13 +03:00
|
|
|
* (through klass->drained_*).
|
block: Introduce BdrvChild.parent_quiesce_counter
Commit 5cb2737e925042e6c7cd3fb0b01313950b03cddf laid out why
bdrv_do_drained_end() must decrement the quiesce_counter after
bdrv_drain_invoke(). It did not give a very good reason why it has to
happen after bdrv_parent_drained_end(), instead only claiming symmetry
to bdrv_do_drained_begin().
It turns out that delaying it for so long is wrong.
Situation: We have an active commit job (i.e. a mirror job) from top to
base for the following graph:
filter
|
[file]
|
v
top --[backing]--> base
Now the VM is closed, which results in the job being cancelled and a
bdrv_drain_all() happening pretty much simultaneously.
Beginning the drain means the job is paused once whenever one of its
nodes is quiesced. This is reversed when the drain ends.
With how the code currently is, after base's drain ends (which means
that it will have unpaused the job once), its quiesce_counter remains at
1 while it goes to undrain its parents (bdrv_parent_drained_end()). For
some reason or another, undraining filter causes the job to be kicked
and enter mirror_exit_common(), where it proceeds to invoke
block_job_remove_all_bdrv().
Now base will be detached from the job. Because its quiesce_counter is
still 1, it will unpause the job once more. So in total, undraining
base will unpause the job twice. Eventually, this will lead to the
job's pause_count going negative -- well, it would, were there not an
assertion against this, which crashes qemu.
The general problem is that if in bdrv_parent_drained_end() we undrain
parent A, and then undrain parent B, which then leads to A detaching the
child, bdrv_replace_child_noperm() will undrain A as if we had not done
so yet; that is, one time too many.
It follows that we cannot decrement the quiesce_counter after invoking
bdrv_parent_drained_end().
Unfortunately, decrementing it before bdrv_parent_drained_end() would be
wrong, too. Imagine the above situation in reverse: Undraining A leads
to B detaching the child. If we had already decremented the
quiesce_counter by that point, bdrv_replace_child_noperm() would undrain
B one time too little; because it expects bdrv_parent_drained_end() to
issue this undrain. But bdrv_parent_drained_end() won't do that,
because B is no longer a parent.
Therefore, we have to do something else. This patch opts for
introducing a second quiesce_counter that counts how many times a
child's parent has been quiesced (though c->role->drained_*). With
that, bdrv_replace_child_noperm() just has to undrain the parent exactly
that many times when removing a child, and it will always be right.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 12:26:09 +03:00
|
|
|
* Usually, this is equal to bs->quiesce_counter (potentially
|
|
|
|
* reduced by bdrv_drain_all_count). It may differ while the
|
|
|
|
* child is entering or leaving a drained section.
|
|
|
|
*/
|
|
|
|
int parent_quiesce_counter;
|
|
|
|
|
2015-04-08 14:49:41 +03:00
|
|
|
QLIST_ENTRY(BdrvChild) next;
|
2015-09-17 14:18:23 +03:00
|
|
|
QLIST_ENTRY(BdrvChild) next_parent;
|
2015-06-15 14:24:19 +03:00
|
|
|
};
|
2015-04-08 14:49:41 +03:00
|
|
|
|
block: block-status cache for data regions
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.
The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the
implementation is outside of qemu, there is little we can do about its
performance.
We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.
To address this, we want to cache data regions. Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.
(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine. While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)
We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-08-12 11:41:44 +03:00
|
|
|
/*
|
|
|
|
* Allows bdrv_co_block_status() to cache one data region for a
|
|
|
|
* protocol node.
|
|
|
|
*
|
|
|
|
* @valid: Whether the cache is valid (should be accessed with atomic
|
|
|
|
* functions so this can be reset by RCU readers)
|
|
|
|
* @data_start: Offset where we know (or strongly assume) is data
|
|
|
|
* @data_end: Offset where the data region ends (which is not necessarily
|
|
|
|
* the start of a zeroed region)
|
|
|
|
*/
|
|
|
|
typedef struct BdrvBlockStatusCache {
|
|
|
|
struct rcu_head rcu;
|
|
|
|
|
|
|
|
bool valid;
|
|
|
|
int64_t data_start;
|
|
|
|
int64_t data_end;
|
|
|
|
} BdrvBlockStatusCache;
|
|
|
|
|
2004-08-02 01:59:26 +04:00
|
|
|
struct BlockDriverState {
|
2017-02-13 16:52:35 +03:00
|
|
|
/* Protected by big QEMU lock or read-only after opening. No special
|
|
|
|
* locking needed during I/O...
|
|
|
|
*/
|
2010-02-14 14:39:18 +03:00
|
|
|
int open_flags; /* flags used to open the file, re-used for re-open */
|
2016-06-24 01:37:26 +03:00
|
|
|
bool encrypted; /* if true, the media is encrypted */
|
|
|
|
bool sg; /* if true, the device is a /dev/sg* */
|
|
|
|
bool probed; /* if true, format was probed rather than specified */
|
2017-05-02 19:35:37 +03:00
|
|
|
bool force_share; /* if true, always allow all shared permissions */
|
2017-07-18 18:24:05 +03:00
|
|
|
bool implicit; /* if true, this filter node was automatically inserted */
|
2016-06-24 01:37:26 +03:00
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
BlockDriver *drv; /* NULL means no media */
|
2004-08-02 01:59:26 +04:00
|
|
|
void *opaque;
|
|
|
|
|
2014-05-08 18:34:37 +04:00
|
|
|
AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
|
2014-06-20 23:57:33 +04:00
|
|
|
/* long-running tasks intended to always use the same AioContext as this
|
|
|
|
* BDS may register themselves in this list to be notified of changes
|
|
|
|
* regarding this BDS's context */
|
|
|
|
QLIST_HEAD(, BdrvAioNotifier) aio_notifiers;
|
2016-06-16 19:56:26 +03:00
|
|
|
bool walking_aio_notifiers; /* to make removal during iteration safe */
|
2014-05-08 18:34:37 +04:00
|
|
|
|
2015-01-22 16:03:30 +03:00
|
|
|
char filename[PATH_MAX];
|
block: Leave BDS.backing_{file,format} constant
Parts of the block layer treat BDS.backing_file as if it were whatever
the image header says (i.e., if it is a relative path, it is relative to
the overlay), other parts treat it like a cache for
bs->backing->bs->filename (relative paths are relative to the CWD).
Considering bs->backing->bs->filename exists, let us make it mean the
former.
Among other things, this now allows the user to specify a base when
using qemu-img to commit an image file in a directory that is not the
CWD (assuming, everything uses relative filenames).
Before this patch:
$ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
$ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
$ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
After this patch:
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
Image committed.
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
Image committed.
With this change, bdrv_find_backing_image() must look at whether the
user has overridden a BDS's backing file. If so, it can no longer use
bs->backing_file, but must instead compare the given filename against
the backing node's filename directly.
Note that this changes the QAPI output for a node's backing_file. We
had very inconsistent output there (sometimes what the image header
said, sometimes the actual filename of the backing image). This
inconsistent output was effectively useless, so we have to decide one
way or the other. Considering that bs->backing_file usually at runtime
contained the path to the image relative to qemu's CWD (or absolute),
this patch changes QAPI's backing_file to always report the
bs->backing->bs->filename from now on. If you want to receive the image
header information, you have to refer to full-backing-filename.
This necessitates a change to iotest 228. The interesting information
it really wanted is the image header, and it can get that now, but it
has to use full-backing-filename instead of backing_file. Because of
this patch's changes to bs->backing_file's behavior, we also need some
reference output changes.
Along with the changes to bs->backing_file, stop updating
BDS.backing_format in bdrv_backing_attach() as well. This way,
ImageInfo's backing-filename and backing-filename-format fields will
represent what the image header says and nothing else.
iotest 245 changes in behavior: With the backing node no longer
overriding the parent node's backing_file string, you can now omit the
@backing option when reopening a node with neither a default nor a
current backing file even if it used to have a backing node at some
point.
273 also changes: The base image is opened without a format layer, so
ImageInfo.backing-filename-format used to report "file" for the base
image's overlay after blockdev-snapshot. However, the image header
never says "file" anywhere, so it now reports $IMGFMT.
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-08-01 21:34:11 +03:00
|
|
|
/*
|
|
|
|
* If not empty, this image is a diff in relation to backing_file.
|
|
|
|
* Note that this is the name given in the image header and
|
|
|
|
* therefore may or may not be equal to .backing->bs->filename.
|
|
|
|
* If this field contains a relative path, it is to be resolved
|
|
|
|
* relatively to the overlay's location.
|
|
|
|
*/
|
|
|
|
char backing_file[PATH_MAX];
|
|
|
|
/*
|
|
|
|
* The backing filename indicated by the image header. Contrary
|
|
|
|
* to backing_file, if we ever open this file, auto_backing_file
|
|
|
|
* is replaced by the resulting BDS's filename (i.e. after a
|
|
|
|
* bdrv_refresh_filename() run).
|
|
|
|
*/
|
block: Add BDS.auto_backing_file
If the backing file is overridden, this most probably does change the
guest-visible data of a BDS. Therefore, we will need to consider this
in bdrv_refresh_filename().
To see whether it has been overridden, we might want to compare
bs->backing_file and bs->backing->bs->filename. However,
bs->backing_file is changed by bdrv_set_backing_hd() (which is just used
to change the backing child at runtime, without modifying the image
header), so bs->backing_file most of the time simply contains a copy of
bs->backing->bs->filename anyway, so it is useless for such a
comparison.
This patch adds an auto_backing_file BDS field which contains the
backing file path as indicated by the image header, which is not changed
by bdrv_set_backing_hd().
Because of bdrv_refresh_filename() magic, however, a BDS's filename may
differ from what has been specified during bdrv_open(). Then, the
comparison between bs->auto_backing_file and bs->backing->bs->filename
may fail even though bs->backing was opened from bs->auto_backing_file.
To mitigate this, we can copy the real BDS's filename (after the whole
bdrv_open() and bdrv_refresh_filename() process) into
bs->auto_backing_file, if we know the former has been opened based on
the latter. This is only possible if no options modifying the backing
file's behavior have been specified, though. To simplify things, this
patch only copies the filename from the backing file if no options have
been specified for it at all.
Furthermore, there are cases where an overlay is created by qemu which
already contains a BDS's filename (e.g. in blockdev-snapshot-sync). We
do not need to worry about updating the overlay's bs->auto_backing_file
there, because we actually wrote a post-bdrv_refresh_filename() filename
into the image header.
So all in all, there will be false negatives where (as of a future
patch) bdrv_refresh_filename() will assume that the backing file differs
from what was specified in the image header, even though it really does
not. However, these cases should be limited to where (1) the user
actually did override something in the backing chain (e.g. by specifying
options for the backing file), or (2) the user executed a QMP command to
change some node's backing file (e.g. change-backing-file or
block-commit with @backing-file given) where the given filename does not
happen to coincide with qemu's idea of the backing BDS's filename.
Then again, (1) really is limited to -drive. With -blockdev or
blockdev-add, you have to adhere to the schema, so a user cannot give
partial "unimportant" options (e.g. by just setting backing.node-name
and leaving the rest to the image header). Therefore, trying to fix
this would mean trying to fix something for -drive only.
To improve on (2), we would need a full infrastructure to "canonicalize"
an arbitrary filename (+ options), so it can be compared against
another. That seems a bit over the top, considering that filenames
nowadays are there mostly for the user's entertainment.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-01 22:29:08 +03:00
|
|
|
char auto_backing_file[PATH_MAX];
|
2009-03-28 20:55:10 +03:00
|
|
|
char backing_format[16]; /* if non-zero and backing_file exists */
|
2006-08-19 15:45:59 +04:00
|
|
|
|
2014-07-18 22:24:56 +04:00
|
|
|
QDict *full_open_options;
|
2015-01-22 16:03:30 +03:00
|
|
|
char exact_filename[PATH_MAX];
|
2014-07-18 22:24:56 +04:00
|
|
|
|
2015-06-17 15:55:21 +03:00
|
|
|
BdrvChild *backing;
|
2015-06-16 15:19:22 +03:00
|
|
|
BdrvChild *file;
|
2010-04-14 16:17:38 +04:00
|
|
|
|
2013-10-24 14:06:56 +04:00
|
|
|
/* I/O Limits */
|
|
|
|
BlockLimits bl;
|
|
|
|
|
2020-12-16 09:16:57 +03:00
|
|
|
/*
|
|
|
|
* Flags honored during pread
|
|
|
|
*/
|
|
|
|
unsigned int supported_read_flags;
|
2018-05-02 17:03:59 +03:00
|
|
|
/* Flags honored during pwrite (so far: BDRV_REQ_FUA,
|
|
|
|
* BDRV_REQ_WRITE_UNCHANGED).
|
|
|
|
* If a driver does not support BDRV_REQ_WRITE_UNCHANGED, those
|
|
|
|
* writes will be issued as normal writes without the flag set.
|
|
|
|
* This is important to note for drivers that do not explicitly
|
|
|
|
* request a WRITE permission for their children and instead take
|
|
|
|
* the same permissions as their parent did (this is commonly what
|
|
|
|
* block filters do). Such drivers have to be aware that the
|
|
|
|
* parent may have taken a WRITE_UNCHANGED permission only and is
|
|
|
|
* issuing such requests. Drivers either must make sure that
|
|
|
|
* these requests do not result in plain WRITE accesses (usually
|
|
|
|
* by supporting BDRV_REQ_WRITE_UNCHANGED, and then forwarding
|
|
|
|
* every incoming write request as-is, including potentially that
|
|
|
|
* flag), or they have to explicitly take the WRITE permission for
|
|
|
|
* their children. */
|
2016-05-04 01:39:06 +03:00
|
|
|
unsigned int supported_write_flags;
|
2016-06-02 00:10:03 +03:00
|
|
|
/* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
|
2018-05-02 17:03:59 +03:00
|
|
|
* BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
|
block: Honor BDRV_REQ_FUA during write_zeroes
The block layer has a couple of cases where it can lose
Force Unit Access semantics when writing a large block of
zeroes, such that the request returns before the zeroes
have been guaranteed to land on underlying media.
SCSI does not support FUA during WRITESAME(10/16); FUA is only
supported if it falls back to WRITE(10/16). But where the
underlying device is new enough to not need a fallback, it
means that any upper layer request with FUA semantics was
silently ignoring BDRV_REQ_FUA.
Conversely, NBD has situations where it can support FUA but not
ZERO_WRITE; when that happens, the generic block layer fallback
to bdrv_driver_pwritev() (or the older bdrv_co_writev() in qemu
2.6) was losing the FUA flag.
The problem of losing flags unrelated to ZERO_WRITE has been
latent in bdrv_co_do_write_zeroes() since commit aa7bfbff, but
back then, it did not matter because there was no FUA flag. It
became observable when commit 93f5e6d8 paved the way for flags
that can impact correctness, when we should have been using
bdrv_co_writev_flags() with modified flags. Compare to commit
9eeb6dd, which got flag manipulation right in
bdrv_co_do_zero_pwritev().
Symptoms: I tested with qemu-io with default writethrough cache
(which is supposed to use FUA semantics on every write), and
targetted an NBD client connected to a server that intentionally
did not advertise NBD_FLAG_SEND_FUA. When doing 'write 0 512',
the NBD client sent two operations (NBD_CMD_WRITE then
NBD_CMD_FLUSH) to get the fallback FUA semantics; but when doing
'write -z 0 512', the NBD client sent only NBD_CMD_WRITE.
The fix is do to a cleanup bdrv_co_flush() at the end of the
operation if any step in the middle relied on a BDS that does
not natively support FUA for that step (note that we don't
need to flush after every operation, if the operation is broken
into chunks based on bounce-buffer sizing). Each BDS gains a
new flag .supported_zero_flags, which parallels the use of
.supported_write_flags but only when accessing a zero write
operation (the flags MUST be different, because of SCSI having
different semantics based on WRITE vs. WRITESAME; and also
because BDRV_REQ_MAY_UNMAP only makes sense on zero writes).
Also fix some documentation to describe -ENOTSUP semantics,
particularly since iscsi depends on those semantics.
Down the road, we may want to add a driver where its
.bdrv_co_pwritev() honors all three of BDRV_REQ_FUA,
BDRV_REQ_ZERO_WRITE, and BDRV_REQ_MAY_UNMAP, and advertise
this via bs->supported_write_flags for blocks opened by that
driver; such a driver should NOT supply .bdrv_co_write_zeroes
nor .supported_zero_flags. But none of the drivers touched
in this patch want to do that (the act of writing zeroes is
different enough from normal writes to deserve a second
callback).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-04 01:39:07 +03:00
|
|
|
unsigned int supported_zero_flags;
|
2020-04-24 15:54:39 +03:00
|
|
|
/*
|
|
|
|
* Flags honoured during truncate (so far: BDRV_REQ_ZERO_WRITE).
|
|
|
|
*
|
|
|
|
* If BDRV_REQ_ZERO_WRITE is given, the truncate operation must make sure
|
|
|
|
* that any added space reads as all zeros. If this can't be guaranteed,
|
|
|
|
* the operation must fail.
|
|
|
|
*/
|
|
|
|
unsigned int supported_truncate_flags;
|
2011-11-29 15:42:20 +04:00
|
|
|
|
2014-01-24 00:31:32 +04:00
|
|
|
/* the following member gives a name to every node on the bs graph. */
|
|
|
|
char node_name[32];
|
|
|
|
/* element of the list of named nodes building the graph */
|
|
|
|
QTAILQ_ENTRY(BlockDriverState) node_list;
|
2016-01-29 18:36:11 +03:00
|
|
|
/* element of the list of all BlockDriverStates (all_bdrv_states) */
|
|
|
|
QTAILQ_ENTRY(BlockDriverState) bs_list;
|
2016-01-29 18:36:12 +03:00
|
|
|
/* element of the list of monitor-owned BDS */
|
|
|
|
QTAILQ_ENTRY(BlockDriverState) monitor_list;
|
2013-08-23 05:14:46 +04:00
|
|
|
int refcnt;
|
2011-11-17 17:40:27 +04:00
|
|
|
|
2014-05-23 17:29:42 +04:00
|
|
|
/* operation blockers */
|
|
|
|
QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
|
|
|
|
|
2015-04-09 19:47:50 +03:00
|
|
|
/* The node that this node inherited default options from (and a reopen on
|
|
|
|
* which can affect this node by changing these defaults). This is always a
|
|
|
|
* parent node of this node. */
|
|
|
|
BlockDriverState *inherits_from;
|
2015-04-08 14:49:41 +03:00
|
|
|
QLIST_HEAD(, BdrvChild) children;
|
2015-09-17 14:18:23 +03:00
|
|
|
QLIST_HEAD(, BdrvChild) parents;
|
2015-04-08 14:49:41 +03:00
|
|
|
|
2013-03-15 13:35:02 +04:00
|
|
|
QDict *options;
|
2015-05-08 17:15:03 +03:00
|
|
|
QDict *explicit_options;
|
2014-05-18 02:58:19 +04:00
|
|
|
BlockdevDetectZeroesOptions detect_zeroes;
|
2014-05-23 17:29:47 +04:00
|
|
|
|
|
|
|
/* The error object in use for blocking operations on backing_hd */
|
|
|
|
Error *backing_blocker;
|
block: add event when disk usage exceeds threshold
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device, and automatically extends the image once the threshold
is reached or exceeded.
In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.
To fix this, this patch adds:
* A new monitor command `block-set-write-threshold', to set a mark for
a given block device.
* A new event `BLOCK_WRITE_THRESHOLD', to report if a block device
usage exceeds the threshold.
* A new `write_threshold' field into the `BlockDeviceInfo' structure,
to report the configured threshold.
This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.
[Updated qemu-iotests 067 output to add the new 'write_threshold'
property. --Stefan]
[Changed g_assert_false() to !g_assert() to fix the build on older glib
versions. --Kevin]
Signed-off-by: Francesco Romani <fromani@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1421068273-692-1-git-send-email-fromani@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-01-12 16:11:13 +03:00
|
|
|
|
2017-02-13 16:52:35 +03:00
|
|
|
/* Protected by AioContext lock */
|
|
|
|
|
|
|
|
/* If we are reading a disk image, give its size in sectors.
|
2017-04-20 15:25:55 +03:00
|
|
|
* Generally read-only; it is written to by load_snapshot and
|
|
|
|
* save_snaphost, but the block layer is quiescent during those.
|
2017-02-13 16:52:35 +03:00
|
|
|
*/
|
|
|
|
int64_t total_sectors;
|
|
|
|
|
block: add event when disk usage exceeds threshold
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device, and automatically extends the image once the threshold
is reached or exceeded.
In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.
To fix this, this patch adds:
* A new monitor command `block-set-write-threshold', to set a mark for
a given block device.
* A new event `BLOCK_WRITE_THRESHOLD', to report if a block device
usage exceeds the threshold.
* A new `write_threshold' field into the `BlockDeviceInfo' structure,
to report the configured threshold.
This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.
[Updated qemu-iotests 067 output to add the new 'write_threshold'
property. --Stefan]
[Changed g_assert_false() to !g_assert() to fix the build on older glib
versions. --Kevin]
Signed-off-by: Francesco Romani <fromani@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1421068273-692-1-git-send-email-fromani@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-01-12 16:11:13 +03:00
|
|
|
/* threshold limit for writes, in bytes. "High water mark". */
|
|
|
|
uint64_t write_threshold_offset;
|
2015-10-23 06:08:09 +03:00
|
|
|
|
2017-06-05 15:39:03 +03:00
|
|
|
/* Writing to the list requires the BQL _and_ the dirty_bitmap_mutex.
|
|
|
|
* Reading from the list can be done with either the BQL or the
|
2017-06-05 15:39:05 +03:00
|
|
|
* dirty_bitmap_mutex. Modifying a bitmap only requires
|
|
|
|
* dirty_bitmap_mutex. */
|
2017-06-05 15:39:03 +03:00
|
|
|
QemuMutex dirty_bitmap_mutex;
|
2017-02-13 16:52:35 +03:00
|
|
|
QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
|
|
|
|
|
2017-06-05 15:39:00 +03:00
|
|
|
/* Offset after the highest byte written to */
|
|
|
|
Stat64 wr_highest_offset;
|
|
|
|
|
2017-06-05 15:38:50 +03:00
|
|
|
/* If true, copy read backing sectors into image. Can be >1 if more
|
|
|
|
* than one client has requested copy-on-read. Accessed with atomic
|
|
|
|
* ops.
|
|
|
|
*/
|
|
|
|
int copy_on_read;
|
|
|
|
|
2017-06-05 15:38:53 +03:00
|
|
|
/* number of in-flight requests; overall and serialising.
|
|
|
|
* Accessed with atomic ops.
|
|
|
|
*/
|
|
|
|
unsigned int in_flight;
|
|
|
|
unsigned int serialising_in_flight;
|
|
|
|
|
2017-06-05 15:38:55 +03:00
|
|
|
/* counter for nested bdrv_io_plug.
|
|
|
|
* Accessed with atomic ops.
|
|
|
|
*/
|
|
|
|
unsigned io_plugged;
|
|
|
|
|
2017-02-13 16:52:35 +03:00
|
|
|
/* do we need to tell the quest if we have a volatile write cache? */
|
|
|
|
int enable_write_cache;
|
|
|
|
|
2017-06-05 15:38:51 +03:00
|
|
|
/* Accessed with atomic ops. */
|
2015-10-23 06:08:09 +03:00
|
|
|
int quiesce_counter;
|
2017-12-18 18:05:48 +03:00
|
|
|
int recursive_quiesce_counter;
|
|
|
|
|
2017-06-05 15:39:01 +03:00
|
|
|
unsigned int write_gen; /* Current data generation */
|
2017-06-05 15:39:02 +03:00
|
|
|
|
|
|
|
/* Protected by reqs_lock. */
|
|
|
|
CoMutex reqs_lock;
|
|
|
|
QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
|
|
|
|
CoQueue flush_queue; /* Serializing flush queue */
|
|
|
|
bool active_flush_req; /* Flush request in flight? */
|
|
|
|
|
|
|
|
/* Only read/written by whoever has set active_flush_req to true. */
|
|
|
|
unsigned int flushed_gen; /* Flushed write generation */
|
2019-07-03 20:28:02 +03:00
|
|
|
|
|
|
|
/* BdrvChild links to this node may never be frozen */
|
|
|
|
bool never_freeze;
|
block: block-status cache for data regions
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.
The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the
implementation is outside of qemu, there is little we can do about its
performance.
We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.
To address this, we want to cache data regions. Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.
(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine. While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)
We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-08-12 11:41:44 +03:00
|
|
|
|
|
|
|
/* Lock for block-status cache RCU writers */
|
|
|
|
CoMutex bsc_modify_lock;
|
|
|
|
/* Always non-NULL, but must only be dereferenced under an RCU read guard */
|
|
|
|
BdrvBlockStatusCache *block_status_cache;
|
2004-08-02 01:59:26 +04:00
|
|
|
};
|
|
|
|
|
2015-10-19 18:53:24 +03:00
|
|
|
struct BlockBackendRootState {
|
|
|
|
int open_flags;
|
|
|
|
BlockdevDetectZeroesOptions detect_zeroes;
|
|
|
|
};
|
|
|
|
|
block/mirror: Fix target backing BDS
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain() which is called from
mirror_exit(). However, mirror_complete() already tries to open the
target's backing chain with a call to bdrv_open_backing_file().
First, we should only set the target's backing BDS once. Second, the
mirroring block job has a better idea of what to set it to than the
generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
conditions on when to move the backing BDS from source to target are not
really correct).
Therefore, remove that code from bdrv_replace_in_backing_chain() and
leave it to mirror_complete().
Depending on what kind of mirroring is performed, we furthermore want to
use different strategies to open the target's backing chain:
- If blockdev-mirror is used, we can assume the user made sure that the
target already has the correct backing chain. In particular, we should
not try to open a backing file if the target does not have any yet.
- If drive-mirror with mode=absolute-paths is used, we can and should
reuse the already existing chain of nodes that the source BDS is in.
In case of sync=full, no backing BDS is required; with sync=top, we
just link the source's backing BDS to the target, and with sync=none,
we use the source BDS as the target's backing BDS.
We should not try to open these backing files anew because this would
lead to two BDSs existing per physical file in the backing chain, and
we would like to avoid such concurrent access.
- If drive-mirror with mode=existing is used, we have to use the
information provided in the physical image file which means opening
the target's backing chain completely anew, just as it has been done
already.
If the target's backing chain shares images with the source, this may
lead to multiple BDSs per physical image file. But since we cannot
reliably ascertain this case, there is nothing we can do about it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-06-10 21:57:47 +03:00
|
|
|
typedef enum BlockMirrorBackingMode {
|
|
|
|
/* Reuse the existing backing chain from the source for the target.
|
|
|
|
* - sync=full: Set backing BDS to NULL.
|
|
|
|
* - sync=top: Use source's backing BDS.
|
|
|
|
* - sync=none: Use source as the backing BDS. */
|
|
|
|
MIRROR_SOURCE_BACKING_CHAIN,
|
|
|
|
|
|
|
|
/* Open the target's backing chain completely anew */
|
|
|
|
MIRROR_OPEN_BACKING_CHAIN,
|
|
|
|
|
|
|
|
/* Do not change the target's backing BDS after job completion */
|
|
|
|
MIRROR_LEAVE_BACKING_CHAIN,
|
|
|
|
} BlockMirrorBackingMode;
|
|
|
|
|
2014-12-02 20:32:41 +03:00
|
|
|
|
|
|
|
/* Essential block drivers which must always be statically linked into qemu, and
|
|
|
|
* which therefore can be accessed without using bdrv_find_format() */
|
|
|
|
extern BlockDriver bdrv_file;
|
|
|
|
extern BlockDriver bdrv_raw;
|
|
|
|
extern BlockDriver bdrv_qcow2;
|
|
|
|
|
2016-06-20 22:31:46 +03:00
|
|
|
int coroutine_fn bdrv_co_preadv(BdrvChild *child,
|
2020-12-11 21:39:33 +03:00
|
|
|
int64_t offset, int64_t bytes, QEMUIOVector *qiov,
|
2016-03-08 15:47:47 +03:00
|
|
|
BdrvRequestFlags flags);
|
2019-06-04 19:15:11 +03:00
|
|
|
int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
|
block/io: support int64_t bytes in bdrv_co_p{read,write}v_part()
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, prepare bdrv_co_preadv_part() and bdrv_co_pwritev_part() and their
remaining dependencies now.
bdrv_pad_request() is updated simultaneously, as pointer to bytes passed
to it both from bdrv_co_pwritev_part() and bdrv_co_preadv_part().
So, all callers of bdrv_pad_request() are updated to pass 64bit bytes.
bdrv_pad_request() is already good for 64bit requests, add
corresponding assertion.
Look at bdrv_co_preadv_part() and bdrv_co_pwritev_part().
Type is widening, so callers are safe. Let's look inside the functions.
In bdrv_co_preadv_part() and bdrv_aligned_pwritev() we only pass bytes
to other already int64_t interfaces (and some obviously safe
calculations), it's OK.
In bdrv_co_do_zero_pwritev() aligned_bytes may become large now, still
it's passed to bdrv_aligned_pwritev which supports int64_t bytes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-15-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2020-12-11 21:39:32 +03:00
|
|
|
int64_t offset, int64_t bytes,
|
2019-06-04 19:15:11 +03:00
|
|
|
QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
|
2016-06-20 22:31:46 +03:00
|
|
|
int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
|
2020-12-11 21:39:33 +03:00
|
|
|
int64_t offset, int64_t bytes, QEMUIOVector *qiov,
|
2016-03-08 15:47:48 +03:00
|
|
|
BdrvRequestFlags flags);
|
2019-06-04 19:15:11 +03:00
|
|
|
int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
|
block/io: support int64_t bytes in bdrv_co_p{read,write}v_part()
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, prepare bdrv_co_preadv_part() and bdrv_co_pwritev_part() and their
remaining dependencies now.
bdrv_pad_request() is updated simultaneously, as pointer to bytes passed
to it both from bdrv_co_pwritev_part() and bdrv_co_preadv_part().
So, all callers of bdrv_pad_request() are updated to pass 64bit bytes.
bdrv_pad_request() is already good for 64bit requests, add
corresponding assertion.
Look at bdrv_co_preadv_part() and bdrv_co_pwritev_part().
Type is widening, so callers are safe. Let's look inside the functions.
In bdrv_co_preadv_part() and bdrv_aligned_pwritev() we only pass bytes
to other already int64_t interfaces (and some obviously safe
calculations), it's OK.
In bdrv_co_do_zero_pwritev() aligned_bytes may become large now, still
it's passed to bdrv_aligned_pwritev which supports int64_t bytes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-15-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2020-12-11 21:39:32 +03:00
|
|
|
int64_t offset, int64_t bytes,
|
2019-06-04 19:15:11 +03:00
|
|
|
QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
|
2016-03-08 15:47:47 +03:00
|
|
|
|
2019-04-22 17:58:30 +03:00
|
|
|
static inline int coroutine_fn bdrv_co_pread(BdrvChild *child,
|
|
|
|
int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
|
|
|
|
{
|
|
|
|
QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
|
|
|
|
|
|
|
|
return bdrv_co_preadv(child, offset, bytes, &qiov, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int coroutine_fn bdrv_co_pwrite(BdrvChild *child,
|
|
|
|
int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
|
|
|
|
{
|
|
|
|
QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
|
|
|
|
|
|
|
|
return bdrv_co_pwritev(child, offset, bytes, &qiov, flags);
|
|
|
|
}
|
|
|
|
|
2018-03-28 19:29:18 +03:00
|
|
|
extern unsigned int bdrv_drain_all_count;
|
2017-12-18 18:05:48 +03:00
|
|
|
void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
|
|
|
|
void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
|
|
|
|
|
2020-10-21 17:58:43 +03:00
|
|
|
bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
|
|
|
|
uint64_t align);
|
2019-11-01 18:25:09 +03:00
|
|
|
BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs);
|
2019-11-01 18:25:08 +03:00
|
|
|
|
2012-05-28 11:27:54 +04:00
|
|
|
int get_tmp_filename(char *filename, int size);
|
raw: Prohibit dangerous writes for probed images
If the user neglects to specify the image format, QEMU probes the
image to guess it automatically, for convenience.
Relying on format probing is insecure for raw images (CVE-2008-2004).
If the guest writes a suitable header to the device, the next probe
will recognize a format chosen by the guest. A malicious guest can
abuse this to gain access to host files, e.g. by crafting a QCOW2
header with backing file /etc/shadow.
Commit 1e72d3b (April 2008) provided -drive parameter format to let
users disable probing. Commit f965509 (March 2009) extended QCOW2 to
optionally store the backing file format, to let users disable backing
file probing. QED has had a flag to suppress probing since the
beginning (2010), set whenever a raw backing file is assigned.
All of these additions that allow to avoid format probing have to be
specified explicitly. The default still allows the attack.
In order to fix this, commit 79368c8 (July 2010) put probed raw images
in a restricted mode, in which they wouldn't be able to overwrite the
first few bytes of the image so that they would identify as a different
image. If a write to the first sector would write one of the signatures
of another driver, qemu would instead zero out the first four bytes.
This patch was later reverted in commit 8b33d9e (September 2010) because
it didn't get the handling of unaligned qiov members right.
Today's block layer that is based on coroutines and has qiov utility
functions makes it much easier to get this functionality right, so this
patch implements it.
The other differences of this patch to the old one are that it doesn't
silently write something different than the guest requested by zeroing
out some bytes (it fails the request instead) and that it doesn't
maintain a list of signatures in the raw driver (it calls the usual
probe function instead).
Note that this change doesn't introduce new breakage for false positive
cases where the guest legitimately writes data into the first sector
that matches the signatures of an image format (e.g. for nested virt):
These cases were broken before, only the failure mode changes from
corruption after the next restart (when the wrong format is probed) to
failing the problematic write request.
Also note that like in the original patch, the restrictions only apply
if the image format has been guessed by probing. Explicitly specifying a
format allows guests to write anything they like.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1416497234-29880-8-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2014-11-20 18:27:12 +03:00
|
|
|
BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
|
|
|
|
const char *filename);
|
2005-12-18 21:28:15 +03:00
|
|
|
|
2017-05-22 22:52:16 +03:00
|
|
|
void bdrv_parse_filename_strip_prefix(const char *filename, const char *prefix,
|
|
|
|
QDict *options);
|
|
|
|
|
block: Leave BDS.backing_{file,format} constant
Parts of the block layer treat BDS.backing_file as if it were whatever
the image header says (i.e., if it is a relative path, it is relative to
the overlay), other parts treat it like a cache for
bs->backing->bs->filename (relative paths are relative to the CWD).
Considering bs->backing->bs->filename exists, let us make it mean the
former.
Among other things, this now allows the user to specify a base when
using qemu-img to commit an image file in a directory that is not the
CWD (assuming, everything uses relative filenames).
Before this patch:
$ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
$ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
$ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
After this patch:
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
Image committed.
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
Image committed.
With this change, bdrv_find_backing_image() must look at whether the
user has overridden a BDS's backing file. If so, it can no longer use
bs->backing_file, but must instead compare the given filename against
the backing node's filename directly.
Note that this changes the QAPI output for a node's backing_file. We
had very inconsistent output there (sometimes what the image header
said, sometimes the actual filename of the backing image). This
inconsistent output was effectively useless, so we have to decide one
way or the other. Considering that bs->backing_file usually at runtime
contained the path to the image relative to qemu's CWD (or absolute),
this patch changes QAPI's backing_file to always report the
bs->backing->bs->filename from now on. If you want to receive the image
header information, you have to refer to full-backing-filename.
This necessitates a change to iotest 228. The interesting information
it really wanted is the image header, and it can get that now, but it
has to use full-backing-filename instead of backing_file. Because of
this patch's changes to bs->backing_file's behavior, we also need some
reference output changes.
Along with the changes to bs->backing_file, stop updating
BDS.backing_format in bdrv_backing_attach() as well. This way,
ImageInfo's backing-filename and backing-filename-format fields will
represent what the image header says and nothing else.
iotest 245 changes in behavior: With the backing node no longer
overriding the parent node's backing_file string, you can now omit the
@backing option when reopening a node with neither a default nor a
current backing file even if it used to have a backing node at some
point.
273 also changes: The base image is opened without a format layer, so
ImageInfo.backing-filename-format used to report "file" for the base
image's overlay after blockdev-snapshot. However, the image header
never says "file" anywhere, so it now reports $IMGFMT.
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-08-01 21:34:11 +03:00
|
|
|
bool bdrv_backing_overridden(BlockDriverState *bs);
|
|
|
|
|
2011-11-03 12:57:25 +04:00
|
|
|
|
2014-06-20 23:57:33 +04:00
|
|
|
/**
|
|
|
|
* bdrv_add_aio_context_notifier:
|
|
|
|
*
|
|
|
|
* If a long-running job intends to be always run in the same AioContext as a
|
|
|
|
* certain BDS, it may use this function to be notified of changes regarding the
|
|
|
|
* association of the BDS to an AioContext.
|
|
|
|
*
|
|
|
|
* attached_aio_context() is called after the target BDS has been attached to a
|
|
|
|
* new AioContext; detach_aio_context() is called before the target BDS is being
|
|
|
|
* detached from its old AioContext.
|
|
|
|
*/
|
|
|
|
void bdrv_add_aio_context_notifier(BlockDriverState *bs,
|
|
|
|
void (*attached_aio_context)(AioContext *new_context, void *opaque),
|
|
|
|
void (*detach_aio_context)(void *opaque), void *opaque);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* bdrv_remove_aio_context_notifier:
|
|
|
|
*
|
|
|
|
* Unsubscribe of change notifications regarding the BDS's AioContext. The
|
|
|
|
* parameters given here have to be the same as those given to
|
|
|
|
* bdrv_add_aio_context_notifier().
|
|
|
|
*/
|
|
|
|
void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
|
|
|
|
void (*aio_context_attached)(AioContext *,
|
|
|
|
void *),
|
|
|
|
void (*aio_context_detached)(void *),
|
|
|
|
void *opaque);
|
|
|
|
|
2016-10-27 13:49:05 +03:00
|
|
|
/**
|
|
|
|
* bdrv_wakeup:
|
|
|
|
* @bs: The BlockDriverState for which an I/O operation has been completed.
|
|
|
|
*
|
|
|
|
* Wake up the main thread if it is waiting on BDRV_POLL_WHILE. During
|
|
|
|
* synchronous I/O on a BlockDriverState that is attached to another
|
|
|
|
* I/O thread, the main thread lets the I/O thread's event loop run,
|
|
|
|
* waiting for the I/O operation to complete. A bdrv_wakeup will wake
|
|
|
|
* up the main thread if necessary.
|
|
|
|
*
|
|
|
|
* Manual calls to bdrv_wakeup are rarely necessary, because
|
|
|
|
* bdrv_dec_in_flight already calls it.
|
|
|
|
*/
|
|
|
|
void bdrv_wakeup(BlockDriverState *bs);
|
|
|
|
|
2009-06-15 16:04:22 +04:00
|
|
|
#ifdef _WIN32
|
|
|
|
int is_windows_drive(const char *filename);
|
|
|
|
#endif
|
|
|
|
|
2012-03-30 15:17:13 +04:00
|
|
|
/**
|
|
|
|
* stream_start:
|
2016-07-05 17:28:59 +03:00
|
|
|
* @job_id: The id of the newly-created job, or %NULL to use the
|
|
|
|
* device name of @bs.
|
2012-03-30 15:17:13 +04:00
|
|
|
* @bs: Block device to operate on.
|
|
|
|
* @base: Block device that will become the new base, or %NULL to
|
|
|
|
* flatten the whole backing file chain onto @bs.
|
2016-07-05 17:28:52 +03:00
|
|
|
* @backing_file_str: The file name that will be written to @bs as the
|
|
|
|
* the new backing file if the job completes. Ignored if @base is %NULL.
|
2018-09-06 16:02:12 +03:00
|
|
|
* @creation_flags: Flags that control the behavior of the Job lifetime.
|
|
|
|
* See @BlockJobCreateFlags
|
2012-04-25 19:51:03 +04:00
|
|
|
* @speed: The maximum speed, in bytes per second, or 0 for unlimited.
|
2012-09-28 19:22:59 +04:00
|
|
|
* @on_error: The action to take upon error.
|
2020-12-16 09:16:54 +03:00
|
|
|
* @filter_node_name: The node name that should be assigned to the filter
|
|
|
|
* driver that the stream job inserts into the graph above
|
|
|
|
* @bs. NULL means that a node name should be autogenerated.
|
2012-04-25 19:51:00 +04:00
|
|
|
* @errp: Error object.
|
2012-03-30 15:17:13 +04:00
|
|
|
*
|
|
|
|
* Start a streaming operation on @bs. Clusters that are unallocated
|
|
|
|
* in @bs, but allocated in any image between @base and @bs (both
|
|
|
|
* exclusive) will be written to @bs. At the end of a successful
|
|
|
|
* streaming job, the backing file of @bs will be changed to
|
2016-07-05 17:28:52 +03:00
|
|
|
* @backing_file_str in the written image and to @base in the live
|
|
|
|
* BlockDriverState.
|
2012-03-30 15:17:13 +04:00
|
|
|
*/
|
2016-07-05 17:28:59 +03:00
|
|
|
void stream_start(const char *job_id, BlockDriverState *bs,
|
|
|
|
BlockDriverState *base, const char *backing_file_str,
|
2020-12-16 09:17:00 +03:00
|
|
|
BlockDriverState *bottom,
|
2018-09-06 16:02:12 +03:00
|
|
|
int creation_flags, int64_t speed,
|
2020-12-16 09:16:54 +03:00
|
|
|
BlockdevOnError on_error,
|
|
|
|
const char *filter_node_name,
|
|
|
|
Error **errp);
|
2012-01-18 18:40:44 +04:00
|
|
|
|
2012-09-27 21:29:13 +04:00
|
|
|
/**
|
|
|
|
* commit_start:
|
2016-07-05 17:29:00 +03:00
|
|
|
* @job_id: The id of the newly-created job, or %NULL to use the
|
|
|
|
* device name of @bs.
|
2013-12-16 10:45:30 +04:00
|
|
|
* @bs: Active block device.
|
|
|
|
* @top: Top block device to be committed.
|
|
|
|
* @base: Block device that will be written into, and become the new top.
|
2018-09-06 16:02:10 +03:00
|
|
|
* @creation_flags: Flags that control the behavior of the Job lifetime.
|
|
|
|
* See @BlockJobCreateFlags
|
2012-09-27 21:29:13 +04:00
|
|
|
* @speed: The maximum speed, in bytes per second, or 0 for unlimited.
|
|
|
|
* @on_error: The action to take upon error.
|
block: extend block-commit to accept a string for the backing file
On some image chains, QEMU may not always be able to resolve the
filenames properly, when updating the backing file of an image
after a block commit.
For instance, certain relative pathnames may fail, or drives may
have been specified originally by file descriptor (e.g. /dev/fd/???),
or a relative protocol pathname may have been used.
In these instances, QEMU may lack the information to be able to make
the correct choice, but the user or management layer most likely does
have that knowledge.
With this extension to the block-commit api, the user is able to change
the backing file of the overlay image as part of the block-commit
operation.
This allows the change to be 'safe', in the sense that if the attempt
to write the overlay image metadata fails, then the block-commit
operation returns failure, without disrupting the guest.
If the commit top is the active layer, then specifying the backing
file string will be treated as an error (there is no overlay image
to modify in that case).
If a backing file string is not specified in the command, the backing
file string to use is determined in the same manner as it was
previously.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-06-25 23:40:10 +04:00
|
|
|
* @backing_file_str: String to use as the backing file in @top's overlay
|
2017-02-20 20:10:05 +03:00
|
|
|
* @filter_node_name: The node name that should be assigned to the filter
|
|
|
|
* driver that the commit job inserts into the graph above @top. NULL means
|
|
|
|
* that a node name should be autogenerated.
|
2012-09-27 21:29:13 +04:00
|
|
|
* @errp: Error object.
|
|
|
|
*
|
|
|
|
*/
|
2016-07-05 17:29:00 +03:00
|
|
|
void commit_start(const char *job_id, BlockDriverState *bs,
|
2018-09-06 16:02:10 +03:00
|
|
|
BlockDriverState *base, BlockDriverState *top,
|
|
|
|
int creation_flags, int64_t speed,
|
2016-10-27 19:06:58 +03:00
|
|
|
BlockdevOnError on_error, const char *backing_file_str,
|
2017-02-20 20:10:05 +03:00
|
|
|
const char *filter_node_name, Error **errp);
|
2013-12-16 10:45:30 +04:00
|
|
|
/**
|
|
|
|
* commit_active_start:
|
2016-07-05 17:29:00 +03:00
|
|
|
* @job_id: The id of the newly-created job, or %NULL to use the
|
|
|
|
* device name of @bs.
|
2013-12-16 10:45:30 +04:00
|
|
|
* @bs: Active block device to be committed.
|
|
|
|
* @base: Block device that will be written into, and become the new top.
|
2016-10-27 19:06:57 +03:00
|
|
|
* @creation_flags: Flags that control the behavior of the Job lifetime.
|
|
|
|
* See @BlockJobCreateFlags
|
2013-12-16 10:45:30 +04:00
|
|
|
* @speed: The maximum speed, in bytes per second, or 0 for unlimited.
|
|
|
|
* @on_error: The action to take upon error.
|
2017-02-20 20:10:05 +03:00
|
|
|
* @filter_node_name: The node name that should be assigned to the filter
|
|
|
|
* driver that the commit job inserts into the graph above @bs. NULL means that
|
|
|
|
* a node name should be autogenerated.
|
2013-12-16 10:45:30 +04:00
|
|
|
* @cb: Completion function for the job.
|
|
|
|
* @opaque: Opaque pointer value passed to @cb.
|
2016-07-27 10:01:47 +03:00
|
|
|
* @auto_complete: Auto complete the job.
|
2017-04-21 15:27:04 +03:00
|
|
|
* @errp: Error object.
|
2013-12-16 10:45:30 +04:00
|
|
|
*
|
|
|
|
*/
|
2019-06-06 18:41:29 +03:00
|
|
|
BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs,
|
|
|
|
BlockDriverState *base, int creation_flags,
|
|
|
|
int64_t speed, BlockdevOnError on_error,
|
|
|
|
const char *filter_node_name,
|
|
|
|
BlockCompletionFunc *cb, void *opaque,
|
|
|
|
bool auto_complete, Error **errp);
|
2012-10-18 18:49:23 +04:00
|
|
|
/*
|
|
|
|
* mirror_start:
|
2016-07-05 17:28:57 +03:00
|
|
|
* @job_id: The id of the newly-created job, or %NULL to use the
|
|
|
|
* device name of @bs.
|
2012-10-18 18:49:23 +04:00
|
|
|
* @bs: Block device to operate on.
|
|
|
|
* @target: Block device to write to.
|
2014-06-27 20:25:25 +04:00
|
|
|
* @replaces: Block graph node name to replace once the mirror is done. Can
|
|
|
|
* only be used when full mirroring is selected.
|
2018-09-06 16:02:11 +03:00
|
|
|
* @creation_flags: Flags that control the behavior of the Job lifetime.
|
|
|
|
* See @BlockJobCreateFlags
|
2012-10-18 18:49:23 +04:00
|
|
|
* @speed: The maximum speed, in bytes per second, or 0 for unlimited.
|
2013-01-21 20:09:46 +04:00
|
|
|
* @granularity: The chosen granularity for the dirty bitmap.
|
2013-01-22 12:03:13 +04:00
|
|
|
* @buf_size: The amount of data that can be in flight at one time.
|
2012-10-18 18:49:23 +04:00
|
|
|
* @mode: Whether to collapse all images in the chain to the target.
|
block/mirror: Fix target backing BDS
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain() which is called from
mirror_exit(). However, mirror_complete() already tries to open the
target's backing chain with a call to bdrv_open_backing_file().
First, we should only set the target's backing BDS once. Second, the
mirroring block job has a better idea of what to set it to than the
generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
conditions on when to move the backing BDS from source to target are not
really correct).
Therefore, remove that code from bdrv_replace_in_backing_chain() and
leave it to mirror_complete().
Depending on what kind of mirroring is performed, we furthermore want to
use different strategies to open the target's backing chain:
- If blockdev-mirror is used, we can assume the user made sure that the
target already has the correct backing chain. In particular, we should
not try to open a backing file if the target does not have any yet.
- If drive-mirror with mode=absolute-paths is used, we can and should
reuse the already existing chain of nodes that the source BDS is in.
In case of sync=full, no backing BDS is required; with sync=top, we
just link the source's backing BDS to the target, and with sync=none,
we use the source BDS as the target's backing BDS.
We should not try to open these backing files anew because this would
lead to two BDSs existing per physical file in the backing chain, and
we would like to avoid such concurrent access.
- If drive-mirror with mode=existing is used, we have to use the
information provided in the physical image file which means opening
the target's backing chain completely anew, just as it has been done
already.
If the target's backing chain shares images with the source, this may
lead to multiple BDSs per physical image file. But since we cannot
reliably ascertain this case, there is nothing we can do about it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-06-10 21:57:47 +03:00
|
|
|
* @backing_mode: How to establish the target's backing chain after completion.
|
2019-07-24 20:12:30 +03:00
|
|
|
* @zero_target: Whether the target should be explicitly zero-initialized
|
2012-10-18 18:49:28 +04:00
|
|
|
* @on_source_error: The action to take upon error reading from the source.
|
|
|
|
* @on_target_error: The action to take upon error writing to the target.
|
2015-06-08 08:56:08 +03:00
|
|
|
* @unmap: Whether to unmap target where source sectors only contain zeroes.
|
2017-02-20 20:10:05 +03:00
|
|
|
* @filter_node_name: The node name that should be assigned to the filter
|
|
|
|
* driver that the mirror job inserts into the graph above @bs. NULL means that
|
|
|
|
* a node name should be autogenerated.
|
2018-06-13 21:18:22 +03:00
|
|
|
* @copy_mode: When to trigger writes to the target.
|
2012-10-18 18:49:23 +04:00
|
|
|
* @errp: Error object.
|
|
|
|
*
|
|
|
|
* Start a mirroring operation on @bs. Clusters that are allocated
|
2016-09-14 14:03:38 +03:00
|
|
|
* in @bs will be written to @target until the job is cancelled or
|
2012-10-18 18:49:23 +04:00
|
|
|
* manually completed. At the end of a successful mirroring job,
|
|
|
|
* @bs will be switched to read from @target.
|
|
|
|
*/
|
2016-07-05 17:28:57 +03:00
|
|
|
void mirror_start(const char *job_id, BlockDriverState *bs,
|
|
|
|
BlockDriverState *target, const char *replaces,
|
2018-09-06 16:02:11 +03:00
|
|
|
int creation_flags, int64_t speed,
|
|
|
|
uint32_t granularity, int64_t buf_size,
|
block/mirror: Fix target backing BDS
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain() which is called from
mirror_exit(). However, mirror_complete() already tries to open the
target's backing chain with a call to bdrv_open_backing_file().
First, we should only set the target's backing BDS once. Second, the
mirroring block job has a better idea of what to set it to than the
generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
conditions on when to move the backing BDS from source to target are not
really correct).
Therefore, remove that code from bdrv_replace_in_backing_chain() and
leave it to mirror_complete().
Depending on what kind of mirroring is performed, we furthermore want to
use different strategies to open the target's backing chain:
- If blockdev-mirror is used, we can assume the user made sure that the
target already has the correct backing chain. In particular, we should
not try to open a backing file if the target does not have any yet.
- If drive-mirror with mode=absolute-paths is used, we can and should
reuse the already existing chain of nodes that the source BDS is in.
In case of sync=full, no backing BDS is required; with sync=top, we
just link the source's backing BDS to the target, and with sync=none,
we use the source BDS as the target's backing BDS.
We should not try to open these backing files anew because this would
lead to two BDSs existing per physical file in the backing chain, and
we would like to avoid such concurrent access.
- If drive-mirror with mode=existing is used, we have to use the
information provided in the physical image file which means opening
the target's backing chain completely anew, just as it has been done
already.
If the target's backing chain shares images with the source, this may
lead to multiple BDSs per physical image file. But since we cannot
reliably ascertain this case, there is nothing we can do about it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-06-10 21:57:47 +03:00
|
|
|
MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
|
2019-07-24 20:12:30 +03:00
|
|
|
bool zero_target,
|
block/mirror: Fix target backing BDS
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain() which is called from
mirror_exit(). However, mirror_complete() already tries to open the
target's backing chain with a call to bdrv_open_backing_file().
First, we should only set the target's backing BDS once. Second, the
mirroring block job has a better idea of what to set it to than the
generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
conditions on when to move the backing BDS from source to target are not
really correct).
Therefore, remove that code from bdrv_replace_in_backing_chain() and
leave it to mirror_complete().
Depending on what kind of mirroring is performed, we furthermore want to
use different strategies to open the target's backing chain:
- If blockdev-mirror is used, we can assume the user made sure that the
target already has the correct backing chain. In particular, we should
not try to open a backing file if the target does not have any yet.
- If drive-mirror with mode=absolute-paths is used, we can and should
reuse the already existing chain of nodes that the source BDS is in.
In case of sync=full, no backing BDS is required; with sync=top, we
just link the source's backing BDS to the target, and with sync=none,
we use the source BDS as the target's backing BDS.
We should not try to open these backing files anew because this would
lead to two BDSs existing per physical file in the backing chain, and
we would like to avoid such concurrent access.
- If drive-mirror with mode=existing is used, we have to use the
information provided in the physical image file which means opening
the target's backing chain completely anew, just as it has been done
already.
If the target's backing chain shares images with the source, this may
lead to multiple BDSs per physical image file. But since we cannot
reliably ascertain this case, there is nothing we can do about it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-06-10 21:57:47 +03:00
|
|
|
BlockdevOnError on_source_error,
|
2012-10-18 18:49:28 +04:00
|
|
|
BlockdevOnError on_target_error,
|
2018-06-13 21:18:22 +03:00
|
|
|
bool unmap, const char *filter_node_name,
|
|
|
|
MirrorCopyMode copy_mode, Error **errp);
|
2012-10-18 18:49:23 +04:00
|
|
|
|
2013-06-24 19:13:11 +04:00
|
|
|
/*
|
2016-11-08 09:50:38 +03:00
|
|
|
* backup_job_create:
|
2016-07-05 17:28:58 +03:00
|
|
|
* @job_id: The id of the newly-created job, or %NULL to use the
|
|
|
|
* device name of @bs.
|
2013-06-24 19:13:11 +04:00
|
|
|
* @bs: Block device to operate on.
|
|
|
|
* @target: Block device to write to.
|
|
|
|
* @speed: The maximum speed, in bytes per second, or 0 for unlimited.
|
2013-07-26 22:39:04 +04:00
|
|
|
* @sync_mode: What parts of the disk image should be copied to the destination.
|
2019-07-29 23:35:52 +03:00
|
|
|
* @sync_bitmap: The dirty bitmap if sync_mode is 'bitmap' or 'incremental'
|
|
|
|
* @bitmap_mode: The bitmap synchronization policy to use.
|
qapi: backup: add perf.use-copy-range parameter
Experiments show, that copy_range is not always making things faster.
So, to make experimentation simpler, let's add a parameter. Some more
perf parameters will be added soon, so here is a new struct.
For now, add new backup qmp parameter with x- prefix for the following
reasons:
- We are going to add more performance parameters, some will be
related to the whole block-copy process, some only to background
copying in backup (ignored for copy-before-write operations).
- On the other hand, we are going to use block-copy interface in other
block jobs, which will need performance options as well.. And it
should be the same structure or at least somehow related.
So, there are too much unclean things about how the interface and now
we need the new options mostly for testing. Let's keep them
experimental for a while.
In do_backup_common() new x-perf parameter handled in a way to
make further options addition simpler.
We add use-copy-range with default=true, and we'll change the default
in further patch, after moving backup to use block-copy.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-2-vsementsov@virtuozzo.com>
[mreitz: s/5\.2/6.0/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-17 00:46:43 +03:00
|
|
|
* @perf: Performance options. All actual fields assumed to be present,
|
|
|
|
* all ".has_*" fields are ignored.
|
2013-06-24 19:13:11 +04:00
|
|
|
* @on_source_error: The action to take upon error reading from the source.
|
|
|
|
* @on_target_error: The action to take upon error writing to the target.
|
2016-10-27 19:06:57 +03:00
|
|
|
* @creation_flags: Flags that control the behavior of the Job lifetime.
|
|
|
|
* See @BlockJobCreateFlags
|
2013-06-24 19:13:11 +04:00
|
|
|
* @cb: Completion function for the job.
|
|
|
|
* @opaque: Opaque pointer value passed to @cb.
|
2015-11-06 02:13:17 +03:00
|
|
|
* @txn: Transaction that this job is part of (may be NULL).
|
2013-06-24 19:13:11 +04:00
|
|
|
*
|
2016-11-08 09:50:38 +03:00
|
|
|
* Create a backup operation on @bs. Clusters in @bs are written to @target
|
2013-06-24 19:13:11 +04:00
|
|
|
* until the job is cancelled or manually completed.
|
|
|
|
*/
|
2016-11-08 09:50:38 +03:00
|
|
|
BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
|
|
|
|
BlockDriverState *target, int64_t speed,
|
|
|
|
MirrorSyncMode sync_mode,
|
|
|
|
BdrvDirtyBitmap *sync_bitmap,
|
2019-07-29 23:35:52 +03:00
|
|
|
BitmapSyncMode bitmap_mode,
|
2016-11-08 09:50:38 +03:00
|
|
|
bool compress,
|
block/backup: use backup-top instead of write notifiers
Drop write notifiers and use filter node instead.
= Changes =
1. Add filter-node-name argument for backup qmp api. We have to do it
in this commit, as 257 needs to be fixed.
2. There are no more write notifiers here, so is_write_notifier
parameter is dropped from block-copy paths.
3. To sync with in-flight requests at job finish we now have drained
removing of the filter, we don't need rw-lock.
4. Block-copy is now using BdrvChildren instead of BlockBackends
5. As backup-top owns these children, we also move block-copy state
into backup-top's ownership.
= Iotest changes =
56: op-blocker doesn't shoot now, as we set it on source, but then
check on filter, when trying to start second backup.
To keep the test we instead can catch another collision: both jobs will
get 'drive0' job-id, as job-id parameter is unspecified. To prevent
interleaving with file-posix locks (as they are dependent on config)
let's use another target for second backup.
Also, it's obvious now that we'd like to drop this op-blocker at all
and add a test-case for two backups from one node (to different
destinations) actually works. But not in these series.
141: Output changed: prepatch, "Node is in use" comes from bdrv_has_blk
check inside qmp_blockdev_del. But we've dropped block-copy blk
objects, so no more blk objects on source bs (job blk is on backup-top
filter bs). New message is from op-blocker, which is the next check in
qmp_blockdev_add.
257: The test wants to emulate guest write during backup. They should
go to filter node, not to original source node, of course. Therefore we
need to specify filter node name and use it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20191001131409.14202-6-vsementsov@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-01 16:14:09 +03:00
|
|
|
const char *filter_node_name,
|
qapi: backup: add perf.use-copy-range parameter
Experiments show, that copy_range is not always making things faster.
So, to make experimentation simpler, let's add a parameter. Some more
perf parameters will be added soon, so here is a new struct.
For now, add new backup qmp parameter with x- prefix for the following
reasons:
- We are going to add more performance parameters, some will be
related to the whole block-copy process, some only to background
copying in backup (ignored for copy-before-write operations).
- On the other hand, we are going to use block-copy interface in other
block jobs, which will need performance options as well.. And it
should be the same structure or at least somehow related.
So, there are too much unclean things about how the interface and now
we need the new options mostly for testing. Let's keep them
experimental for a while.
In do_backup_common() new x-perf parameter handled in a way to
make further options addition simpler.
We add use-copy-range with default=true, and we'll change the default
in further patch, after moving backup to use block-copy.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-2-vsementsov@virtuozzo.com>
[mreitz: s/5\.2/6.0/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-17 00:46:43 +03:00
|
|
|
BackupPerf *perf,
|
2016-11-08 09:50:38 +03:00
|
|
|
BlockdevOnError on_source_error,
|
|
|
|
BlockdevOnError on_target_error,
|
|
|
|
int creation_flags,
|
|
|
|
BlockCompletionFunc *cb, void *opaque,
|
2018-04-19 17:09:52 +03:00
|
|
|
JobTxn *txn, Error **errp);
|
2013-06-24 19:13:11 +04:00
|
|
|
|
2016-03-08 15:47:46 +03:00
|
|
|
BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
|
|
|
|
const char *child_name,
|
2020-05-13 14:05:13 +03:00
|
|
|
const BdrvChildClass *child_class,
|
2020-05-13 14:05:15 +03:00
|
|
|
BdrvChildRole child_role,
|
2016-12-14 19:24:36 +03:00
|
|
|
uint64_t perm, uint64_t shared_perm,
|
|
|
|
void *opaque, Error **errp);
|
2016-03-08 15:47:46 +03:00
|
|
|
void bdrv_root_unref_child(BdrvChild *child);
|
|
|
|
|
2020-03-10 14:38:25 +03:00
|
|
|
void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
|
|
|
|
uint64_t *shared_perm);
|
|
|
|
|
2019-05-22 20:03:46 +03:00
|
|
|
/**
|
|
|
|
* Sets a BdrvChild's permissions. Avoid if the parent is a BDS; use
|
|
|
|
* bdrv_child_refresh_perms() instead and make the parent's
|
|
|
|
* .bdrv_child_perm() implementation return the correct values.
|
|
|
|
*/
|
2016-12-15 15:04:20 +03:00
|
|
|
int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
|
|
|
|
Error **errp);
|
|
|
|
|
2019-05-22 20:03:46 +03:00
|
|
|
/**
|
|
|
|
* Calls bs->drv->bdrv_child_perm() and updates the child's permission
|
|
|
|
* masks with the result.
|
|
|
|
* Drivers should invoke this function whenever an event occurs that
|
|
|
|
* makes their .bdrv_child_perm() implementation return different
|
|
|
|
* values than before, but which will not result in the block layer
|
|
|
|
* automatically refreshing the permissions.
|
|
|
|
*/
|
|
|
|
int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp);
|
|
|
|
|
2020-02-18 13:34:41 +03:00
|
|
|
bool bdrv_recurse_can_replace(BlockDriverState *bs,
|
|
|
|
BlockDriverState *to_replace);
|
|
|
|
|
2020-05-13 14:05:29 +03:00
|
|
|
/*
|
|
|
|
* Default implementation for BlockDriver.bdrv_child_perm() that can
|
|
|
|
* be used by block filters and image formats, as long as they use the
|
|
|
|
* child_of_bds child class and set an appropriate BdrvChildRole.
|
|
|
|
*/
|
|
|
|
void bdrv_default_perms(BlockDriverState *bs, BdrvChild *c,
|
2020-05-13 14:05:44 +03:00
|
|
|
BdrvChildRole role, BlockReopenQueue *reopen_queue,
|
2020-05-13 14:05:29 +03:00
|
|
|
uint64_t perm, uint64_t shared,
|
|
|
|
uint64_t *nperm, uint64_t *nshared);
|
|
|
|
|
2016-03-22 20:38:44 +03:00
|
|
|
const char *bdrv_get_parent_name(const BlockDriverState *bs);
|
2017-01-24 16:21:41 +03:00
|
|
|
void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp);
|
2014-10-07 15:59:25 +04:00
|
|
|
bool blk_dev_has_removable_media(BlockBackend *blk);
|
2016-01-29 22:49:10 +03:00
|
|
|
bool blk_dev_has_tray(BlockBackend *blk);
|
2014-10-07 15:59:25 +04:00
|
|
|
void blk_dev_eject_request(BlockBackend *blk, bool force);
|
|
|
|
bool blk_dev_is_tray_open(BlockBackend *blk);
|
|
|
|
bool blk_dev_is_medium_locked(BlockBackend *blk);
|
|
|
|
|
2017-09-25 17:55:25 +03:00
|
|
|
void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes);
|
2015-04-28 16:27:50 +03:00
|
|
|
|
2015-11-09 13:16:54 +03:00
|
|
|
void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
|
2018-10-29 23:23:14 +03:00
|
|
|
void bdrv_restore_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *backup);
|
2019-07-29 23:35:53 +03:00
|
|
|
bool bdrv_dirty_bitmap_merge_internal(BdrvDirtyBitmap *dest,
|
|
|
|
const BdrvDirtyBitmap *src,
|
|
|
|
HBitmap **backup, bool lock);
|
2015-11-09 13:16:54 +03:00
|
|
|
|
2016-10-27 13:48:52 +03:00
|
|
|
void bdrv_inc_in_flight(BlockDriverState *bs);
|
|
|
|
void bdrv_dec_in_flight(BlockDriverState *bs);
|
|
|
|
|
2016-01-29 18:36:12 +03:00
|
|
|
void blockdev_close_all_bdrv_states(void);
|
|
|
|
|
2020-12-11 21:39:34 +03:00
|
|
|
int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
|
|
|
|
BdrvChild *dst, int64_t dst_offset,
|
|
|
|
int64_t bytes,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvRequestFlags read_flags,
|
|
|
|
BdrvRequestFlags write_flags);
|
2020-12-11 21:39:34 +03:00
|
|
|
int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
|
|
|
|
BdrvChild *dst, int64_t dst_offset,
|
|
|
|
int64_t bytes,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvRequestFlags read_flags,
|
|
|
|
BdrvRequestFlags write_flags);
|
2018-06-01 12:26:39 +03:00
|
|
|
|
2018-06-26 14:55:20 +03:00
|
|
|
int refresh_total_sectors(BlockDriverState *bs, int64_t hint);
|
|
|
|
|
2020-03-08 12:24:40 +03:00
|
|
|
void bdrv_set_monitor_owned(BlockDriverState *bs);
|
|
|
|
BlockDriverState *bds_tree_init(QDict *bs_opts, Error **errp);
|
|
|
|
|
2020-03-26 04:12:18 +03:00
|
|
|
/**
|
|
|
|
* Simple implementation of bdrv_co_create_opts for protocol drivers
|
|
|
|
* which only support creation via opening a file
|
|
|
|
* (usually existing raw storage device)
|
|
|
|
*/
|
|
|
|
int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv,
|
|
|
|
const char *filename,
|
|
|
|
QemuOpts *opts,
|
|
|
|
Error **errp);
|
|
|
|
extern QemuOptsList bdrv_create_opts_simple;
|
|
|
|
|
2020-05-13 04:16:43 +03:00
|
|
|
BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char *node,
|
|
|
|
const char *name,
|
|
|
|
BlockDriverState **pbs,
|
|
|
|
Error **errp);
|
|
|
|
BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, const char *target,
|
|
|
|
BlockDirtyBitmapMergeSourceList *bms,
|
|
|
|
HBitmap **backup, Error **errp);
|
|
|
|
BdrvDirtyBitmap *block_dirty_bitmap_remove(const char *node, const char *name,
|
|
|
|
bool release,
|
|
|
|
BlockDriverState **bitmap_bs,
|
|
|
|
Error **errp);
|
|
|
|
|
2019-05-31 16:23:11 +03:00
|
|
|
BdrvChild *bdrv_cow_child(BlockDriverState *bs);
|
|
|
|
BdrvChild *bdrv_filter_child(BlockDriverState *bs);
|
|
|
|
BdrvChild *bdrv_filter_or_cow_child(BlockDriverState *bs);
|
|
|
|
BdrvChild *bdrv_primary_child(BlockDriverState *bs);
|
2019-06-12 16:06:37 +03:00
|
|
|
BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs);
|
|
|
|
BlockDriverState *bdrv_skip_filters(BlockDriverState *bs);
|
|
|
|
BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs);
|
2019-05-31 16:23:11 +03:00
|
|
|
|
|
|
|
static inline BlockDriverState *child_bs(BdrvChild *child)
|
|
|
|
{
|
|
|
|
return child ? child->bs : NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline BlockDriverState *bdrv_cow_bs(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
return child_bs(bdrv_cow_child(bs));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline BlockDriverState *bdrv_filter_bs(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
return child_bs(bdrv_filter_child(bs));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline BlockDriverState *bdrv_filter_or_cow_bs(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
return child_bs(bdrv_filter_or_cow_child(bs));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
return child_bs(bdrv_primary_child(bs));
|
|
|
|
}
|
|
|
|
|
2020-10-28 11:07:34 +03:00
|
|
|
/**
|
|
|
|
* End all quiescent sections started by bdrv_drain_all_begin(). This is
|
|
|
|
* needed when deleting a BDS before bdrv_drain_all_end() is called.
|
|
|
|
*
|
|
|
|
* NOTE: this is an internal helper for bdrv_close() *only*. No one else
|
|
|
|
* should call it.
|
|
|
|
*/
|
|
|
|
void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
|
|
|
|
|
block: block-status cache for data regions
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.
The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the
implementation is outside of qemu, there is little we can do about its
performance.
We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.
To address this, we want to cache data regions. Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.
(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine. While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)
We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-08-12 11:41:44 +03:00
|
|
|
/**
|
|
|
|
* Check whether the given offset is in the cached block-status data
|
|
|
|
* region.
|
|
|
|
*
|
|
|
|
* If it is, and @pnum is not NULL, *pnum is set to
|
|
|
|
* `bsc.data_end - offset`, i.e. how many bytes, starting from
|
|
|
|
* @offset, are data (according to the cache).
|
|
|
|
* Otherwise, *pnum is not touched.
|
|
|
|
*/
|
|
|
|
bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* If [offset, offset + bytes) overlaps with the currently cached
|
|
|
|
* block-status region, invalidate the cache.
|
|
|
|
*
|
|
|
|
* (To be used by I/O paths that cause data regions to be zero or
|
|
|
|
* holes.)
|
|
|
|
*/
|
|
|
|
void bdrv_bsc_invalidate_range(BlockDriverState *bs,
|
|
|
|
int64_t offset, int64_t bytes);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Mark the range [offset, offset + bytes) as a data region.
|
|
|
|
*/
|
|
|
|
void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
|
|
|
|
|
2004-08-02 01:59:26 +04:00
|
|
|
#endif /* BLOCK_INT_H */
|