2006-08-01 20:21:11 +04:00
|
|
|
/*
|
2007-12-15 20:28:36 +03:00
|
|
|
* Block driver for RAW files (posix)
|
2007-09-17 01:08:06 +04:00
|
|
|
*
|
2006-08-01 20:21:11 +04:00
|
|
|
* Copyright (c) 2006 Fabrice Bellard
|
2007-09-17 01:08:06 +04:00
|
|
|
*
|
2006-08-01 20:21:11 +04:00
|
|
|
* Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
|
|
* of this software and associated documentation files (the "Software"), to deal
|
|
|
|
* in the Software without restriction, including without limitation the rights
|
|
|
|
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
|
|
* copies of the Software, and to permit persons to whom the Software is
|
|
|
|
* furnished to do so, subject to the following conditions:
|
|
|
|
*
|
|
|
|
* The above copyright notice and this permission notice shall be included in
|
|
|
|
* all copies or substantial portions of the Software.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
|
|
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
|
|
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
|
|
|
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
|
|
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
|
|
* THE SOFTWARE.
|
|
|
|
*/
|
2018-02-01 14:18:46 +03:00
|
|
|
|
2016-01-18 21:01:42 +03:00
|
|
|
#include "qemu/osdep.h"
|
2019-05-23 17:35:08 +03:00
|
|
|
#include "qemu-common.h"
|
include/qemu/osdep.h: Don't include qapi/error.h
Commit 57cb38b included qapi/error.h into qemu/osdep.h to get the
Error typedef. Since then, we've moved to include qemu/osdep.h
everywhere. Its file comment explains: "To avoid getting into
possible circular include dependencies, this file should not include
any other QEMU headers, with the exceptions of config-host.h,
compiler.h, os-posix.h and os-win32.h, all of which are doing a
similar job to this file and are under similar constraints."
qapi/error.h doesn't do a similar job, and it doesn't adhere to
similar constraints: it includes qapi-types.h. That's in excess of
100KiB of crap most .c files don't actually need.
Add the typedef to qemu/typedefs.h, and include that instead of
qapi/error.h. Include qapi/error.h in .c files that need it and don't
get it now. Include qapi-types.h in qom/object.h for uint16List.
Update scripts/clean-includes accordingly. Update it further to match
reality: replace config.h by config-target.h, add sysemu/os-posix.h,
sysemu/os-win32.h. Update the list of includes in the qemu/osdep.h
comment quoted above similarly.
This reduces the number of objects depending on qapi/error.h from "all
of them" to less than a third. Unfortunately, the number depending on
qapi-types.h shrinks only a little. More work is needed for that one.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
[Fix compilation without the spice devel packages. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-14 11:01:28 +03:00
|
|
|
#include "qapi/error.h"
|
2016-03-20 20:16:19 +03:00
|
|
|
#include "qemu/cutils.h"
|
2015-03-17 20:29:20 +03:00
|
|
|
#include "qemu/error-report.h"
|
2012-12-17 21:19:44 +04:00
|
|
|
#include "block/block_int.h"
|
2012-12-17 21:20:00 +04:00
|
|
|
#include "qemu/module.h"
|
2018-02-01 14:18:46 +03:00
|
|
|
#include "qemu/option.h"
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
#include "qemu/units.h"
|
2012-05-25 13:46:27 +04:00
|
|
|
#include "trace.h"
|
2012-12-17 21:19:44 +04:00
|
|
|
#include "block/thread-pool.h"
|
2012-12-17 21:20:00 +04:00
|
|
|
#include "qemu/iov.h"
|
2016-07-04 19:33:20 +03:00
|
|
|
#include "block/raw-aio.h"
|
2018-02-01 14:18:39 +03:00
|
|
|
#include "qapi/qmp/qdict.h"
|
2015-03-17 20:29:20 +03:00
|
|
|
#include "qapi/qmp/qstring.h"
|
2006-08-01 20:21:11 +04:00
|
|
|
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
#include "scsi/pr-manager.h"
|
|
|
|
#include "scsi/constants.h"
|
|
|
|
|
2011-11-10 22:40:06 +04:00
|
|
|
#if defined(__APPLE__) && (__MACH__)
|
2021-03-15 21:03:38 +03:00
|
|
|
#include <sys/ioctl.h>
|
|
|
|
#if defined(HAVE_HOST_BLOCK_DEVICE)
|
2006-08-01 20:21:11 +04:00
|
|
|
#include <paths.h>
|
|
|
|
#include <sys/param.h>
|
2021-07-05 16:04:56 +03:00
|
|
|
#include <sys/mount.h>
|
2006-08-01 20:21:11 +04:00
|
|
|
#include <IOKit/IOKitLib.h>
|
|
|
|
#include <IOKit/IOBSD.h>
|
|
|
|
#include <IOKit/storage/IOMediaBSDClient.h>
|
|
|
|
#include <IOKit/storage/IOMedia.h>
|
|
|
|
#include <IOKit/storage/IOCDMedia.h>
|
|
|
|
//#include <IOKit/storage/IOCDTypes.h>
|
2016-03-21 18:41:28 +03:00
|
|
|
#include <IOKit/storage/IODVDMedia.h>
|
2006-08-01 20:21:11 +04:00
|
|
|
#include <CoreFoundation/CoreFoundation.h>
|
2021-03-15 21:03:38 +03:00
|
|
|
#endif /* defined(HAVE_HOST_BLOCK_DEVICE) */
|
2006-08-01 20:21:11 +04:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef __sun__
|
2007-01-05 20:44:41 +03:00
|
|
|
#define _POSIX_PTHREAD_SEMANTICS 1
|
2006-08-01 20:21:11 +04:00
|
|
|
#include <sys/dkio.h>
|
|
|
|
#endif
|
2006-08-19 15:45:59 +04:00
|
|
|
#ifdef __linux__
|
|
|
|
#include <sys/ioctl.h>
|
2010-09-06 19:06:02 +04:00
|
|
|
#include <sys/param.h>
|
2018-06-01 12:26:43 +03:00
|
|
|
#include <sys/syscall.h>
|
2020-07-16 17:26:01 +03:00
|
|
|
#include <sys/vfs.h>
|
2006-08-19 15:45:59 +04:00
|
|
|
#include <linux/cdrom.h>
|
|
|
|
#include <linux/fd.h>
|
2012-05-09 18:49:58 +04:00
|
|
|
#include <linux/fs.h>
|
2015-02-16 14:47:56 +03:00
|
|
|
#include <linux/hdreg.h>
|
2020-07-16 17:26:01 +03:00
|
|
|
#include <linux/magic.h>
|
2015-06-23 13:45:00 +03:00
|
|
|
#include <scsi/sg.h>
|
2015-02-16 14:47:56 +03:00
|
|
|
#ifdef __s390__
|
|
|
|
#include <asm/dasd.h>
|
|
|
|
#endif
|
qemu-img create: add 'nocow' option
Add 'nocow' option so that users could have a chance to set NOCOW flag to
newly created files. It's useful on btrfs file system to enhance performance.
Btrfs has low performance when hosting VM images, even more when the guest
in those VM are also using btrfs as file system. One way to mitigate this bad
performance is to turn off COW attributes on VM files. Generally, there are
two ways to turn off NOCOW on btrfs: a) by mounting fs with nodatacow, then
all newly created files will be NOCOW. b) per file. Add the NOCOW file
attribute. It could only be done to empty or new files.
This patch tries the second way, according to the option, it could add NOCOW
per file.
For most block drivers, since the create file step is in raw-posix.c, so we
can do setting NOCOW flag ioctl in raw-posix.c only.
But there are some exceptions, like block/vpc.c and block/vdi.c, they are
creating file by calling qemu_open directly. For them, do the same setting
NOCOW flag ioctl work in them separately.
[Fixed up 082.out due to the new 'nocow' creation option
--Stefan]
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-06-30 10:29:58 +04:00
|
|
|
#ifndef FS_NOCOW_FL
|
|
|
|
#define FS_NOCOW_FL 0x00800000 /* Do not cow file */
|
|
|
|
#endif
|
2012-05-09 18:49:58 +04:00
|
|
|
#endif
|
2015-01-30 11:42:14 +03:00
|
|
|
#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
|
2013-01-14 19:26:52 +04:00
|
|
|
#include <linux/falloc.h>
|
|
|
|
#endif
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2006-12-21 22:14:11 +03:00
|
|
|
#include <sys/disk.h>
|
2009-03-28 11:37:13 +03:00
|
|
|
#include <sys/cdio.h>
|
2006-12-21 22:14:11 +03:00
|
|
|
#endif
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2008-08-15 22:33:42 +04:00
|
|
|
#ifdef __OpenBSD__
|
|
|
|
#include <sys/ioctl.h>
|
|
|
|
#include <sys/disklabel.h>
|
|
|
|
#include <sys/dkio.h>
|
|
|
|
#endif
|
|
|
|
|
2011-05-23 16:31:17 +04:00
|
|
|
#ifdef __NetBSD__
|
|
|
|
#include <sys/ioctl.h>
|
|
|
|
#include <sys/disklabel.h>
|
|
|
|
#include <sys/dkio.h>
|
|
|
|
#include <sys/disk.h>
|
|
|
|
#endif
|
|
|
|
|
2009-03-07 23:06:23 +03:00
|
|
|
#ifdef __DragonFly__
|
|
|
|
#include <sys/ioctl.h>
|
|
|
|
#include <sys/diskslice.h>
|
|
|
|
#endif
|
|
|
|
|
2008-10-14 22:00:38 +04:00
|
|
|
/* OS X does not have O_DSYNC */
|
|
|
|
#ifndef O_DSYNC
|
2009-07-01 21:28:32 +04:00
|
|
|
#ifdef O_SYNC
|
2008-10-14 22:14:47 +04:00
|
|
|
#define O_DSYNC O_SYNC
|
2009-07-01 21:28:32 +04:00
|
|
|
#elif defined(O_FSYNC)
|
|
|
|
#define O_DSYNC O_FSYNC
|
|
|
|
#endif
|
2008-10-14 22:00:38 +04:00
|
|
|
#endif
|
|
|
|
|
2008-10-14 18:42:54 +04:00
|
|
|
/* Approximate O_DIRECT with O_DSYNC if O_DIRECT isn't available */
|
|
|
|
#ifndef O_DIRECT
|
|
|
|
#define O_DIRECT O_DSYNC
|
|
|
|
#endif
|
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
#define FTYPE_FILE 0
|
|
|
|
#define FTYPE_CD 1
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2010-09-13 01:43:21 +04:00
|
|
|
#define MAX_BLOCKSIZE 4096
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
/* Posix file locking bytes. Libvirt takes byte 0, we start from higher bytes,
|
|
|
|
* leaving a few more bytes for its future use. */
|
|
|
|
#define RAW_LOCK_PERM_BASE 100
|
|
|
|
#define RAW_LOCK_SHARED_BASE 200
|
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
typedef struct BDRVRawState {
|
|
|
|
int fd;
|
2017-05-02 19:35:56 +03:00
|
|
|
bool use_lock;
|
2006-08-19 15:45:59 +04:00
|
|
|
int type;
|
2009-06-15 15:53:26 +04:00
|
|
|
int open_flags;
|
2011-11-29 15:42:20 +04:00
|
|
|
size_t buf_align;
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
/* The current permissions. */
|
|
|
|
uint64_t perm;
|
|
|
|
uint64_t shared_perm;
|
|
|
|
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
/* The perms bits whose corresponding bytes are already locked in
|
2018-10-11 10:21:34 +03:00
|
|
|
* s->fd. */
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
uint64_t locked_perm;
|
|
|
|
uint64_t locked_shared_perm;
|
|
|
|
|
2021-10-26 19:23:44 +03:00
|
|
|
uint64_t aio_max_batch;
|
|
|
|
|
2019-03-08 17:40:40 +03:00
|
|
|
int perm_change_fd;
|
2019-05-22 20:03:45 +03:00
|
|
|
int perm_change_flags;
|
2019-03-07 21:07:35 +03:00
|
|
|
BDRVReopenState *reopen_state;
|
|
|
|
|
2013-11-22 16:39:55 +04:00
|
|
|
bool has_discard:1;
|
2013-11-22 16:39:57 +04:00
|
|
|
bool has_write_zeroes:1;
|
2013-11-22 16:39:55 +04:00
|
|
|
bool discard_zeroes:1;
|
2016-09-08 16:09:01 +03:00
|
|
|
bool use_linux_aio:1;
|
2020-01-20 17:18:51 +03:00
|
|
|
bool use_linux_io_uring:1;
|
2021-04-15 16:28:16 +03:00
|
|
|
int page_cache_inconsistent; /* errno from fdatasync failure */
|
2015-01-30 11:42:15 +03:00
|
|
|
bool has_fallocate;
|
2014-10-21 18:03:03 +04:00
|
|
|
bool needs_alignment;
|
2021-11-16 13:14:31 +03:00
|
|
|
bool force_alignment;
|
2019-03-07 19:49:41 +03:00
|
|
|
bool drop_cache;
|
2018-04-27 19:23:12 +03:00
|
|
|
bool check_cache_dropped;
|
2019-09-23 15:17:36 +03:00
|
|
|
struct {
|
|
|
|
uint64_t discard_nb_ok;
|
|
|
|
uint64_t discard_nb_failed;
|
|
|
|
uint64_t discard_bytes_ok;
|
|
|
|
} stats;
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
|
|
|
|
PRManager *pr_mgr;
|
2006-08-19 15:45:59 +04:00
|
|
|
} BDRVRawState;
|
|
|
|
|
2012-09-20 23:13:25 +04:00
|
|
|
typedef struct BDRVRawReopenState {
|
|
|
|
int open_flags;
|
2019-03-07 19:49:41 +03:00
|
|
|
bool drop_cache;
|
2018-04-27 19:23:12 +03:00
|
|
|
bool check_cache_dropped;
|
2012-09-20 23:13:25 +04:00
|
|
|
} BDRVRawReopenState;
|
|
|
|
|
2021-03-15 21:03:38 +03:00
|
|
|
static int fd_open(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
/* this is just to ensure s->fd is sane (its called by io ops) */
|
|
|
|
if (s->fd >= 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2009-06-26 21:51:24 +04:00
|
|
|
static int64_t raw_getlength(BlockDriverState *bs);
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2012-05-25 13:46:27 +04:00
|
|
|
typedef struct RawPosixAIOData {
|
|
|
|
BlockDriverState *bs;
|
2018-10-25 18:21:14 +03:00
|
|
|
int aio_type;
|
2012-05-25 13:46:27 +04:00
|
|
|
int aio_fildes;
|
2018-10-25 18:21:14 +03:00
|
|
|
|
2012-05-25 13:46:27 +04:00
|
|
|
off_t aio_offset;
|
2018-10-25 18:21:14 +03:00
|
|
|
uint64_t aio_nbytes;
|
|
|
|
|
2018-06-21 19:23:16 +03:00
|
|
|
union {
|
2018-10-25 18:21:14 +03:00
|
|
|
struct {
|
|
|
|
struct iovec *iov;
|
|
|
|
int niov;
|
|
|
|
} io;
|
|
|
|
struct {
|
|
|
|
uint64_t cmd;
|
|
|
|
void *buf;
|
|
|
|
} ioctl;
|
2018-06-21 19:23:16 +03:00
|
|
|
struct {
|
|
|
|
int aio_fd2;
|
|
|
|
off_t aio_offset2;
|
2018-10-25 18:21:14 +03:00
|
|
|
} copy_range;
|
2018-06-21 19:23:16 +03:00
|
|
|
struct {
|
|
|
|
PreallocMode prealloc;
|
|
|
|
Error **errp;
|
2018-10-25 18:21:14 +03:00
|
|
|
} truncate;
|
2018-06-21 19:23:16 +03:00
|
|
|
};
|
2012-05-25 13:46:27 +04:00
|
|
|
} RawPosixAIOData;
|
|
|
|
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2009-06-15 15:55:19 +04:00
|
|
|
static int cdrom_reopen(BlockDriverState *bs);
|
2009-03-28 11:37:13 +03:00
|
|
|
#endif
|
|
|
|
|
2021-01-13 19:44:47 +03:00
|
|
|
/*
|
|
|
|
* Elide EAGAIN and EACCES details when failing to lock, as this
|
|
|
|
* indicates that the specified file region is already locked by
|
|
|
|
* another process, which is considered a common scenario.
|
|
|
|
*/
|
|
|
|
#define raw_lock_error_setg_errno(errp, err, fmt, ...) \
|
|
|
|
do { \
|
|
|
|
if ((err) == EAGAIN || (err) == EACCES) { \
|
|
|
|
error_setg((errp), (fmt), ## __VA_ARGS__); \
|
|
|
|
} else { \
|
|
|
|
error_setg_errno((errp), (err), (fmt), ## __VA_ARGS__); \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
2011-05-24 13:30:29 +04:00
|
|
|
#if defined(__NetBSD__)
|
2018-11-01 09:29:09 +03:00
|
|
|
static int raw_normalize_devicepath(const char **filename, Error **errp)
|
2011-05-24 13:30:29 +04:00
|
|
|
{
|
|
|
|
static char namebuf[PATH_MAX];
|
|
|
|
const char *dp, *fname;
|
|
|
|
struct stat sb;
|
|
|
|
|
|
|
|
fname = *filename;
|
|
|
|
dp = strrchr(fname, '/');
|
|
|
|
if (lstat(fname, &sb) < 0) {
|
2019-07-25 12:59:20 +03:00
|
|
|
error_setg_file_open(errp, errno, fname);
|
2011-05-24 13:30:29 +04:00
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!S_ISBLK(sb.st_mode)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dp == NULL) {
|
|
|
|
snprintf(namebuf, PATH_MAX, "r%s", fname);
|
|
|
|
} else {
|
|
|
|
snprintf(namebuf, PATH_MAX, "%.*s/r%s",
|
|
|
|
(int)(dp - fname), fname, dp + 1);
|
|
|
|
}
|
|
|
|
*filename = namebuf;
|
2018-11-01 09:29:09 +03:00
|
|
|
warn_report("%s is a block device, using %s", fname, *filename);
|
2011-05-24 13:30:29 +04:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
2018-11-01 09:29:09 +03:00
|
|
|
static int raw_normalize_devicepath(const char **filename, Error **errp)
|
2011-05-24 13:30:29 +04:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-02-16 14:47:55 +03:00
|
|
|
/*
|
|
|
|
* Get logical block size via ioctl. On success store it in @sector_size_p.
|
|
|
|
*/
|
|
|
|
static int probe_logical_blocksize(int fd, unsigned int *sector_size_p)
|
2011-11-29 15:42:20 +04:00
|
|
|
{
|
|
|
|
unsigned int sector_size;
|
2015-02-16 14:47:55 +03:00
|
|
|
bool success = false;
|
2017-03-23 17:36:28 +03:00
|
|
|
int i;
|
2011-11-29 15:42:20 +04:00
|
|
|
|
2015-02-16 14:47:55 +03:00
|
|
|
errno = ENOTSUP;
|
2017-03-23 17:36:28 +03:00
|
|
|
static const unsigned long ioctl_list[] = {
|
2011-11-29 15:42:20 +04:00
|
|
|
#ifdef BLKSSZGET
|
2017-03-23 17:36:28 +03:00
|
|
|
BLKSSZGET,
|
2011-11-29 15:42:20 +04:00
|
|
|
#endif
|
|
|
|
#ifdef DKIOCGETBLOCKSIZE
|
2017-03-23 17:36:28 +03:00
|
|
|
DKIOCGETBLOCKSIZE,
|
2011-11-29 15:42:20 +04:00
|
|
|
#endif
|
|
|
|
#ifdef DIOCGSECTORSIZE
|
2017-03-23 17:36:28 +03:00
|
|
|
DIOCGSECTORSIZE,
|
2011-11-29 15:42:20 +04:00
|
|
|
#endif
|
2017-03-23 17:36:28 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
/* Try a few ioctls to get the right size */
|
|
|
|
for (i = 0; i < (int)ARRAY_SIZE(ioctl_list); i++) {
|
|
|
|
if (ioctl(fd, ioctl_list[i], §or_size) >= 0) {
|
|
|
|
*sector_size_p = sector_size;
|
|
|
|
success = true;
|
|
|
|
}
|
|
|
|
}
|
2015-02-16 14:47:55 +03:00
|
|
|
|
|
|
|
return success ? 0 : -errno;
|
|
|
|
}
|
|
|
|
|
2015-02-16 14:47:56 +03:00
|
|
|
/**
|
|
|
|
* Get physical block size of @fd.
|
|
|
|
* On success, store it in @blk_size and return 0.
|
|
|
|
* On failure, return -errno.
|
|
|
|
*/
|
|
|
|
static int probe_physical_blocksize(int fd, unsigned int *blk_size)
|
|
|
|
{
|
|
|
|
#ifdef BLKPBSZGET
|
|
|
|
if (ioctl(fd, BLKPBSZGET, blk_size) < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
#else
|
|
|
|
return -ENOTSUP;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2020-07-16 17:26:01 +03:00
|
|
|
/*
|
|
|
|
* Returns true if no alignment restrictions are necessary even for files
|
|
|
|
* opened with O_DIRECT.
|
|
|
|
*
|
|
|
|
* raw_probe_alignment() probes the required alignment and assume that 1 means
|
|
|
|
* the probing failed, so it falls back to a safe default of 4k. This can be
|
|
|
|
* avoided if we know that byte alignment is okay for the file.
|
|
|
|
*/
|
|
|
|
static bool dio_byte_aligned(int fd)
|
|
|
|
{
|
|
|
|
#ifdef __linux__
|
|
|
|
struct statfs buf;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = fstatfs(fd, &buf);
|
|
|
|
if (ret == 0 && buf.f_type == NFS_SUPER_MAGIC) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2021-11-16 13:14:31 +03:00
|
|
|
static bool raw_needs_alignment(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
if ((bs->open_flags & BDRV_O_NOCACHE) != 0 && !dio_byte_aligned(s->fd)) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return s->force_alignment;
|
|
|
|
}
|
|
|
|
|
2015-03-06 00:38:17 +03:00
|
|
|
/* Check if read is allowed with given memory buffer and length.
|
|
|
|
*
|
|
|
|
* This function is used to check O_DIRECT memory buffer and request alignment.
|
|
|
|
*/
|
|
|
|
static bool raw_is_io_aligned(int fd, void *buf, size_t len)
|
|
|
|
{
|
|
|
|
ssize_t ret = pread(fd, buf, len, 0);
|
|
|
|
|
|
|
|
if (ret >= 0) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef __linux__
|
|
|
|
/* The Linux kernel returns EINVAL for misaligned O_DIRECT reads. Ignore
|
|
|
|
* other errors (e.g. real I/O error), which could happen on a failed
|
|
|
|
* drive, since we only care about probing alignment.
|
|
|
|
*/
|
|
|
|
if (errno != EINVAL) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-02-16 14:47:55 +03:00
|
|
|
static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
char *buf;
|
2019-10-13 05:11:45 +03:00
|
|
|
size_t max_align = MAX(MAX_BLOCKSIZE, qemu_real_host_page_size);
|
file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:
1. With Gluster, buf_align cannot be detected since the actual I/O is
done on Gluster server, and qemu buffer alignment does not matter.
Since we don't have alignment requirement, buf_align=1 is the best
value.
2. With local XFS filesystem, buf_align cannot be detected if reading
from unallocated area. In this we must align the buffer, but we don't
know what is the correct size. Using the wrong alignment results in
I/O error.
3. With Gluster backed by XFS, request_alignment cannot be detected if
reading from unallocated area. In this case we need to use the
correct alignment, and failing to do so results in I/O errors.
4. With NFS, the server does not use direct I/O, so both buf_align cannot
be detected. In this case we don't need any alignment so we can use
buf_align=1 and request_alignment=1.
These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.
To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.
Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.
Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 512 (1) 4096 (4096) FAIL
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 512 (4096) 4096 (4096) FAIL
----------------------------------------------------------------------
3 512 512 (1) 512 (512) OK
3 4096 512 (1) 512 (4096) FAIL
----------------------------------------------------------------------
4 512 512 (1) 512 (1) OK
4 4096 512 (1) 512 (1) OK
Same cases with this change:
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 4096 (4096) 4096 (4096) OK
----------------------------------------------------------------------
3 512 4096 (1) 4096 (512) OK
3 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
4 512 4096 (1) 4096 (1) OK
4 4096 4096 (1) 4096 (1) OK
I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.
[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-13 21:21:03 +03:00
|
|
|
size_t alignments[] = {1, 512, 1024, 2048, 4096};
|
2015-02-16 14:47:55 +03:00
|
|
|
|
2015-06-23 13:44:56 +03:00
|
|
|
/* For SCSI generic devices the alignment is not really used.
|
2015-02-16 14:47:55 +03:00
|
|
|
With buffered I/O, we don't have any restrictions. */
|
2015-06-23 13:44:56 +03:00
|
|
|
if (bdrv_is_sg(bs) || !s->needs_alignment) {
|
2016-06-24 01:37:24 +03:00
|
|
|
bs->bl.request_alignment = 1;
|
2015-02-16 14:47:55 +03:00
|
|
|
s->buf_align = 1;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-06-24 01:37:24 +03:00
|
|
|
bs->bl.request_alignment = 0;
|
2015-02-16 14:47:55 +03:00
|
|
|
s->buf_align = 0;
|
|
|
|
/* Let's try to use the logical blocksize for the alignment. */
|
2016-06-24 01:37:24 +03:00
|
|
|
if (probe_logical_blocksize(fd, &bs->bl.request_alignment) < 0) {
|
|
|
|
bs->bl.request_alignment = 0;
|
2015-02-16 14:47:55 +03:00
|
|
|
}
|
2021-12-15 15:58:24 +03:00
|
|
|
|
|
|
|
#ifdef __linux__
|
|
|
|
/*
|
|
|
|
* The XFS ioctl definitions are shipped in extra packages that might
|
|
|
|
* not always be available. Since we just need the XFS_IOC_DIOINFO ioctl
|
|
|
|
* here, we simply use our own definition instead:
|
|
|
|
*/
|
|
|
|
struct xfs_dioattr {
|
|
|
|
uint32_t d_mem;
|
|
|
|
uint32_t d_miniosz;
|
|
|
|
uint32_t d_maxiosz;
|
|
|
|
} da;
|
|
|
|
if (ioctl(fd, _IOR('X', 30, struct xfs_dioattr), &da) >= 0) {
|
|
|
|
bs->bl.request_alignment = da.d_miniosz;
|
|
|
|
/* The kernel returns wrong information for d_mem */
|
|
|
|
/* s->buf_align = da.d_mem; */
|
2011-11-29 15:42:20 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:
1. With Gluster, buf_align cannot be detected since the actual I/O is
done on Gluster server, and qemu buffer alignment does not matter.
Since we don't have alignment requirement, buf_align=1 is the best
value.
2. With local XFS filesystem, buf_align cannot be detected if reading
from unallocated area. In this we must align the buffer, but we don't
know what is the correct size. Using the wrong alignment results in
I/O error.
3. With Gluster backed by XFS, request_alignment cannot be detected if
reading from unallocated area. In this case we need to use the
correct alignment, and failing to do so results in I/O errors.
4. With NFS, the server does not use direct I/O, so both buf_align cannot
be detected. In this case we don't need any alignment so we can use
buf_align=1 and request_alignment=1.
These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.
To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.
Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.
Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 512 (1) 4096 (4096) FAIL
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 512 (4096) 4096 (4096) FAIL
----------------------------------------------------------------------
3 512 512 (1) 512 (512) OK
3 4096 512 (1) 512 (4096) FAIL
----------------------------------------------------------------------
4 512 512 (1) 512 (1) OK
4 4096 512 (1) 512 (1) OK
Same cases with this change:
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 4096 (4096) 4096 (4096) OK
----------------------------------------------------------------------
3 512 4096 (1) 4096 (512) OK
3 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
4 512 4096 (1) 4096 (1) OK
4 4096 4096 (1) 4096 (1) OK
I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.
[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-13 21:21:03 +03:00
|
|
|
/*
|
|
|
|
* If we could not get the sizes so far, we can only guess them. First try
|
|
|
|
* to detect request alignment, since it is more likely to succeed. Then
|
|
|
|
* try to detect buf_align, which cannot be detected in some cases (e.g.
|
|
|
|
* Gluster). If buf_align cannot be detected, we fallback to the value of
|
|
|
|
* request_alignment.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (!bs->bl.request_alignment) {
|
|
|
|
int i;
|
2011-11-29 15:42:20 +04:00
|
|
|
size_t align;
|
file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:
1. With Gluster, buf_align cannot be detected since the actual I/O is
done on Gluster server, and qemu buffer alignment does not matter.
Since we don't have alignment requirement, buf_align=1 is the best
value.
2. With local XFS filesystem, buf_align cannot be detected if reading
from unallocated area. In this we must align the buffer, but we don't
know what is the correct size. Using the wrong alignment results in
I/O error.
3. With Gluster backed by XFS, request_alignment cannot be detected if
reading from unallocated area. In this case we need to use the
correct alignment, and failing to do so results in I/O errors.
4. With NFS, the server does not use direct I/O, so both buf_align cannot
be detected. In this case we don't need any alignment so we can use
buf_align=1 and request_alignment=1.
These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.
To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.
Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.
Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 512 (1) 4096 (4096) FAIL
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 512 (4096) 4096 (4096) FAIL
----------------------------------------------------------------------
3 512 512 (1) 512 (512) OK
3 4096 512 (1) 512 (4096) FAIL
----------------------------------------------------------------------
4 512 512 (1) 512 (1) OK
4 4096 512 (1) 512 (1) OK
Same cases with this change:
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 4096 (4096) 4096 (4096) OK
----------------------------------------------------------------------
3 512 4096 (1) 4096 (512) OK
3 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
4 512 4096 (1) 4096 (1) OK
4 4096 4096 (1) 4096 (1) OK
I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.
[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-13 21:21:03 +03:00
|
|
|
buf = qemu_memalign(max_align, max_align);
|
|
|
|
for (i = 0; i < ARRAY_SIZE(alignments); i++) {
|
|
|
|
align = alignments[i];
|
|
|
|
if (raw_is_io_aligned(fd, buf, align)) {
|
|
|
|
/* Fallback to safe value. */
|
|
|
|
bs->bl.request_alignment = (align != 1) ? align : max_align;
|
2011-11-29 15:42:20 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
qemu_vfree(buf);
|
|
|
|
}
|
|
|
|
|
file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:
1. With Gluster, buf_align cannot be detected since the actual I/O is
done on Gluster server, and qemu buffer alignment does not matter.
Since we don't have alignment requirement, buf_align=1 is the best
value.
2. With local XFS filesystem, buf_align cannot be detected if reading
from unallocated area. In this we must align the buffer, but we don't
know what is the correct size. Using the wrong alignment results in
I/O error.
3. With Gluster backed by XFS, request_alignment cannot be detected if
reading from unallocated area. In this case we need to use the
correct alignment, and failing to do so results in I/O errors.
4. With NFS, the server does not use direct I/O, so both buf_align cannot
be detected. In this case we don't need any alignment so we can use
buf_align=1 and request_alignment=1.
These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.
To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.
Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.
Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 512 (1) 4096 (4096) FAIL
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 512 (4096) 4096 (4096) FAIL
----------------------------------------------------------------------
3 512 512 (1) 512 (512) OK
3 4096 512 (1) 512 (4096) FAIL
----------------------------------------------------------------------
4 512 512 (1) 512 (1) OK
4 4096 512 (1) 512 (1) OK
Same cases with this change:
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 4096 (4096) 4096 (4096) OK
----------------------------------------------------------------------
3 512 4096 (1) 4096 (512) OK
3 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
4 512 4096 (1) 4096 (1) OK
4 4096 4096 (1) 4096 (1) OK
I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.
[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-13 21:21:03 +03:00
|
|
|
if (!s->buf_align) {
|
|
|
|
int i;
|
2011-11-29 15:42:20 +04:00
|
|
|
size_t align;
|
file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:
1. With Gluster, buf_align cannot be detected since the actual I/O is
done on Gluster server, and qemu buffer alignment does not matter.
Since we don't have alignment requirement, buf_align=1 is the best
value.
2. With local XFS filesystem, buf_align cannot be detected if reading
from unallocated area. In this we must align the buffer, but we don't
know what is the correct size. Using the wrong alignment results in
I/O error.
3. With Gluster backed by XFS, request_alignment cannot be detected if
reading from unallocated area. In this case we need to use the
correct alignment, and failing to do so results in I/O errors.
4. With NFS, the server does not use direct I/O, so both buf_align cannot
be detected. In this case we don't need any alignment so we can use
buf_align=1 and request_alignment=1.
These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.
To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.
Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.
Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 512 (1) 4096 (4096) FAIL
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 512 (4096) 4096 (4096) FAIL
----------------------------------------------------------------------
3 512 512 (1) 512 (512) OK
3 4096 512 (1) 512 (4096) FAIL
----------------------------------------------------------------------
4 512 512 (1) 512 (1) OK
4 4096 512 (1) 512 (1) OK
Same cases with this change:
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 4096 (4096) 4096 (4096) OK
----------------------------------------------------------------------
3 512 4096 (1) 4096 (512) OK
3 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
4 512 4096 (1) 4096 (1) OK
4 4096 4096 (1) 4096 (1) OK
I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.
[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-13 21:21:03 +03:00
|
|
|
buf = qemu_memalign(max_align, 2 * max_align);
|
|
|
|
for (i = 0; i < ARRAY_SIZE(alignments); i++) {
|
|
|
|
align = alignments[i];
|
|
|
|
if (raw_is_io_aligned(fd, buf + align, max_align)) {
|
2019-08-27 13:13:28 +03:00
|
|
|
/* Fallback to request_alignment. */
|
file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:
1. With Gluster, buf_align cannot be detected since the actual I/O is
done on Gluster server, and qemu buffer alignment does not matter.
Since we don't have alignment requirement, buf_align=1 is the best
value.
2. With local XFS filesystem, buf_align cannot be detected if reading
from unallocated area. In this we must align the buffer, but we don't
know what is the correct size. Using the wrong alignment results in
I/O error.
3. With Gluster backed by XFS, request_alignment cannot be detected if
reading from unallocated area. In this case we need to use the
correct alignment, and failing to do so results in I/O errors.
4. With NFS, the server does not use direct I/O, so both buf_align cannot
be detected. In this case we don't need any alignment so we can use
buf_align=1 and request_alignment=1.
These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.
To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.
Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.
Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 512 (1) 4096 (4096) FAIL
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 512 (4096) 4096 (4096) FAIL
----------------------------------------------------------------------
3 512 512 (1) 512 (512) OK
3 4096 512 (1) 512 (4096) FAIL
----------------------------------------------------------------------
4 512 512 (1) 512 (1) OK
4 4096 512 (1) 512 (1) OK
Same cases with this change:
Case Sector buf_align (opt) request_alignment (opt) result
======================================================================
1 512 512 (1) 512 (512) OK
1 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
2 512 512 (512) 512 (512) OK
2 4096 4096 (4096) 4096 (4096) OK
----------------------------------------------------------------------
3 512 4096 (1) 4096 (512) OK
3 4096 4096 (1) 4096 (4096) OK
----------------------------------------------------------------------
4 512 4096 (1) 4096 (1) OK
4 4096 4096 (1) 4096 (1) OK
I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.
[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-13 21:21:03 +03:00
|
|
|
s->buf_align = (align != 1) ? align : bs->bl.request_alignment;
|
2011-11-29 15:42:20 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
qemu_vfree(buf);
|
|
|
|
}
|
2014-07-16 19:48:17 +04:00
|
|
|
|
2016-06-24 01:37:24 +03:00
|
|
|
if (!s->buf_align || !bs->bl.request_alignment) {
|
2016-06-24 01:37:25 +03:00
|
|
|
error_setg(errp, "Could not find working O_DIRECT alignment");
|
|
|
|
error_append_hint(errp, "Try cache.direct=off\n");
|
2014-07-16 19:48:17 +04:00
|
|
|
}
|
2011-11-29 15:42:20 +04:00
|
|
|
}
|
|
|
|
|
2020-07-17 13:54:25 +03:00
|
|
|
static int check_hdev_writable(int fd)
|
2020-07-17 13:54:24 +03:00
|
|
|
{
|
|
|
|
#if defined(BLKROGET)
|
|
|
|
/* Linux block devices can be configured "read-only" using blockdev(8).
|
|
|
|
* This is independent of device node permissions and therefore open(2)
|
|
|
|
* with O_RDWR succeeds. Actual writes fail with EPERM.
|
|
|
|
*
|
|
|
|
* bdrv_open() is supposed to fail if the disk is read-only. Explicitly
|
|
|
|
* check for read-only block devices so that Linux block devices behave
|
|
|
|
* properly.
|
|
|
|
*/
|
|
|
|
struct stat st;
|
|
|
|
int readonly = 0;
|
|
|
|
|
2020-07-17 13:54:25 +03:00
|
|
|
if (fstat(fd, &st)) {
|
2020-07-17 13:54:24 +03:00
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!S_ISBLK(st.st_mode)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-07-17 13:54:25 +03:00
|
|
|
if (ioctl(fd, BLKROGET, &readonly) < 0) {
|
2020-07-17 13:54:24 +03:00
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (readonly) {
|
|
|
|
return -EACCES;
|
|
|
|
}
|
|
|
|
#endif /* defined(BLKROGET) */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-02 00:15:11 +03:00
|
|
|
static void raw_parse_flags(int bdrv_flags, int *open_flags, bool has_writers)
|
2012-09-20 23:13:21 +04:00
|
|
|
{
|
2019-03-02 00:15:11 +03:00
|
|
|
bool read_write = false;
|
2012-09-20 23:13:21 +04:00
|
|
|
assert(open_flags != NULL);
|
|
|
|
|
|
|
|
*open_flags |= O_BINARY;
|
|
|
|
*open_flags &= ~O_ACCMODE;
|
2019-03-02 00:15:11 +03:00
|
|
|
|
|
|
|
if (bdrv_flags & BDRV_O_AUTO_RDONLY) {
|
|
|
|
read_write = has_writers;
|
|
|
|
} else if (bdrv_flags & BDRV_O_RDWR) {
|
|
|
|
read_write = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (read_write) {
|
2012-09-20 23:13:21 +04:00
|
|
|
*open_flags |= O_RDWR;
|
|
|
|
} else {
|
|
|
|
*open_flags |= O_RDONLY;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Use O_DSYNC for write-through caching, no flags for write-back caching,
|
|
|
|
* and O_DIRECT for no caching. */
|
|
|
|
if ((bdrv_flags & BDRV_O_NOCACHE)) {
|
|
|
|
*open_flags |= O_DIRECT;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-06 01:41:37 +04:00
|
|
|
static void raw_parse_filename(const char *filename, QDict *options,
|
|
|
|
Error **errp)
|
|
|
|
{
|
2017-05-22 22:52:16 +03:00
|
|
|
bdrv_parse_filename_strip_prefix(filename, "file:", options);
|
2014-03-06 01:41:37 +04:00
|
|
|
}
|
|
|
|
|
2013-04-02 12:47:40 +04:00
|
|
|
static QemuOptsList raw_runtime_opts = {
|
|
|
|
.name = "raw",
|
|
|
|
.head = QTAILQ_HEAD_INITIALIZER(raw_runtime_opts.head),
|
|
|
|
.desc = {
|
|
|
|
{
|
|
|
|
.name = "filename",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "File name of the image",
|
|
|
|
},
|
2016-09-08 16:09:01 +03:00
|
|
|
{
|
|
|
|
.name = "aio",
|
|
|
|
.type = QEMU_OPT_STRING,
|
2020-01-20 17:18:51 +03:00
|
|
|
.help = "host AIO implementation (threads, native, io_uring)",
|
2016-09-08 16:09:01 +03:00
|
|
|
},
|
2021-10-26 19:23:44 +03:00
|
|
|
{
|
|
|
|
.name = "aio-max-batch",
|
|
|
|
.type = QEMU_OPT_NUMBER,
|
|
|
|
.help = "AIO max batch size (0 = auto handled by AIO backend, default: 0)",
|
|
|
|
},
|
2017-05-02 19:35:50 +03:00
|
|
|
{
|
|
|
|
.name = "locking",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "file locking mode (on/off/auto, default: auto)",
|
|
|
|
},
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
{
|
|
|
|
.name = "pr-manager",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "id of persistent reservation manager object (default: none)",
|
|
|
|
},
|
2019-03-07 19:49:41 +03:00
|
|
|
#if defined(__linux__)
|
|
|
|
{
|
|
|
|
.name = "drop-cache",
|
|
|
|
.type = QEMU_OPT_BOOL,
|
|
|
|
.help = "invalidate page cache during live migration (default: on)",
|
|
|
|
},
|
|
|
|
#endif
|
2018-04-27 19:23:12 +03:00
|
|
|
{
|
|
|
|
.name = "x-check-cache-dropped",
|
|
|
|
.type = QEMU_OPT_BOOL,
|
|
|
|
.help = "check that page cache was dropped on live migration (default: off)"
|
|
|
|
},
|
2013-04-02 12:47:40 +04:00
|
|
|
{ /* end of list */ }
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
2019-03-12 19:48:48 +03:00
|
|
|
static const char *const mutable_opts[] = { "x-check-cache-dropped", NULL };
|
|
|
|
|
2013-04-02 12:47:40 +04:00
|
|
|
static int raw_open_common(BlockDriverState *bs, QDict *options,
|
2018-07-10 20:00:40 +03:00
|
|
|
int bdrv_flags, int open_flags,
|
|
|
|
bool device, Error **errp)
|
2006-08-01 20:21:11 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2013-04-02 12:47:40 +04:00
|
|
|
QemuOpts *opts;
|
|
|
|
Error *local_err = NULL;
|
2014-04-11 21:16:36 +04:00
|
|
|
const char *filename = NULL;
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
const char *str;
|
2016-09-08 16:09:01 +03:00
|
|
|
BlockdevAioOptions aio, aio_default;
|
2009-06-15 15:53:26 +04:00
|
|
|
int fd, ret;
|
2013-11-22 16:39:55 +04:00
|
|
|
struct stat st;
|
2017-05-02 19:35:56 +03:00
|
|
|
OnOffAuto locking;
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2014-01-02 06:49:17 +04:00
|
|
|
opts = qemu_opts_create(&raw_runtime_opts, NULL, 0, &error_abort);
|
2020-07-07 19:06:03 +03:00
|
|
|
if (!qemu_opts_absorb_qdict(opts, options, errp)) {
|
2013-04-02 12:47:40 +04:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
filename = qemu_opt_get(opts, "filename");
|
|
|
|
|
2018-11-01 09:29:09 +03:00
|
|
|
ret = raw_normalize_devicepath(&filename, errp);
|
2011-05-24 13:30:29 +04:00
|
|
|
if (ret != 0) {
|
2013-04-02 12:47:40 +04:00
|
|
|
goto fail;
|
2011-05-24 13:30:29 +04:00
|
|
|
}
|
|
|
|
|
2020-01-20 17:18:51 +03:00
|
|
|
if (bdrv_flags & BDRV_O_NATIVE_AIO) {
|
|
|
|
aio_default = BLOCKDEV_AIO_OPTIONS_NATIVE;
|
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
} else if (bdrv_flags & BDRV_O_IO_URING) {
|
|
|
|
aio_default = BLOCKDEV_AIO_OPTIONS_IO_URING;
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
aio_default = BLOCKDEV_AIO_OPTIONS_THREADS;
|
|
|
|
}
|
|
|
|
|
2017-08-24 11:46:10 +03:00
|
|
|
aio = qapi_enum_parse(&BlockdevAioOptions_lookup,
|
|
|
|
qemu_opt_get(opts, "aio"),
|
2017-08-24 11:45:57 +03:00
|
|
|
aio_default, &local_err);
|
2016-09-08 16:09:01 +03:00
|
|
|
if (local_err) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
2020-01-20 17:18:51 +03:00
|
|
|
|
2016-09-08 16:09:01 +03:00
|
|
|
s->use_linux_aio = (aio == BLOCKDEV_AIO_OPTIONS_NATIVE);
|
2020-01-20 17:18:51 +03:00
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
s->use_linux_io_uring = (aio == BLOCKDEV_AIO_OPTIONS_IO_URING);
|
|
|
|
#endif
|
2016-09-08 16:09:01 +03:00
|
|
|
|
2021-10-26 19:23:44 +03:00
|
|
|
s->aio_max_batch = qemu_opt_get_number(opts, "aio-max-batch", 0);
|
|
|
|
|
2017-08-24 11:46:10 +03:00
|
|
|
locking = qapi_enum_parse(&OnOffAuto_lookup,
|
|
|
|
qemu_opt_get(opts, "locking"),
|
2017-08-24 11:45:57 +03:00
|
|
|
ON_OFF_AUTO_AUTO, &local_err);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (local_err) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
switch (locking) {
|
|
|
|
case ON_OFF_AUTO_ON:
|
|
|
|
s->use_lock = true;
|
file-posix: Do runtime check for ofd lock API
It is reported that on Windows Subsystem for Linux, ofd operations fail
with -EINVAL. In other words, QEMU binary built with system headers that
exports F_OFD_SETLK doesn't necessarily run in an environment that
actually supports it:
$ qemu-system-aarch64 ... -drive file=test.vhdx,if=none,id=hd0 \
-device virtio-blk-pci,drive=hd0
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to unlock byte 100
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to unlock byte 100
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to lock byte 100
As a matter of fact this is not WSL specific. It can happen when running
a QEMU compiled against a newer glibc on an older kernel, such as in
a containerized environment.
Let's do a runtime check to cope with that.
Reported-by: Andrew Baumann <Andrew.Baumann@microsoft.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-08-11 14:44:47 +03:00
|
|
|
if (!qemu_has_ofd_lock()) {
|
2018-11-01 09:29:09 +03:00
|
|
|
warn_report("File lock requested but OFD locking syscall is "
|
|
|
|
"unavailable, falling back to POSIX file locks");
|
|
|
|
error_printf("Due to the implementation, locks can be lost "
|
|
|
|
"unexpectedly.\n");
|
file-posix: Do runtime check for ofd lock API
It is reported that on Windows Subsystem for Linux, ofd operations fail
with -EINVAL. In other words, QEMU binary built with system headers that
exports F_OFD_SETLK doesn't necessarily run in an environment that
actually supports it:
$ qemu-system-aarch64 ... -drive file=test.vhdx,if=none,id=hd0 \
-device virtio-blk-pci,drive=hd0
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to unlock byte 100
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to unlock byte 100
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to lock byte 100
As a matter of fact this is not WSL specific. It can happen when running
a QEMU compiled against a newer glibc on an older kernel, such as in
a containerized environment.
Let's do a runtime check to cope with that.
Reported-by: Andrew Baumann <Andrew.Baumann@microsoft.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-08-11 14:44:47 +03:00
|
|
|
}
|
2017-05-02 19:35:56 +03:00
|
|
|
break;
|
|
|
|
case ON_OFF_AUTO_OFF:
|
|
|
|
s->use_lock = false;
|
|
|
|
break;
|
|
|
|
case ON_OFF_AUTO_AUTO:
|
file-posix: Do runtime check for ofd lock API
It is reported that on Windows Subsystem for Linux, ofd operations fail
with -EINVAL. In other words, QEMU binary built with system headers that
exports F_OFD_SETLK doesn't necessarily run in an environment that
actually supports it:
$ qemu-system-aarch64 ... -drive file=test.vhdx,if=none,id=hd0 \
-device virtio-blk-pci,drive=hd0
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to unlock byte 100
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to unlock byte 100
qemu-system-aarch64: -drive file=test.vhdx,if=none,id=hd0: Failed to lock byte 100
As a matter of fact this is not WSL specific. It can happen when running
a QEMU compiled against a newer glibc on an older kernel, such as in
a containerized environment.
Let's do a runtime check to cope with that.
Reported-by: Andrew Baumann <Andrew.Baumann@microsoft.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-08-11 14:44:47 +03:00
|
|
|
s->use_lock = qemu_has_ofd_lock();
|
2017-05-02 19:35:56 +03:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
abort();
|
|
|
|
}
|
|
|
|
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
str = qemu_opt_get(opts, "pr-manager");
|
|
|
|
if (str) {
|
|
|
|
s->pr_mgr = pr_manager_lookup(str, &local_err);
|
|
|
|
if (local_err) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-03-07 19:49:41 +03:00
|
|
|
s->drop_cache = qemu_opt_get_bool(opts, "drop-cache", true);
|
2018-04-27 19:23:12 +03:00
|
|
|
s->check_cache_dropped = qemu_opt_get_bool(opts, "x-check-cache-dropped",
|
|
|
|
false);
|
|
|
|
|
2012-09-20 23:13:21 +04:00
|
|
|
s->open_flags = open_flags;
|
2019-03-02 00:15:11 +03:00
|
|
|
raw_parse_flags(bdrv_flags, &s->open_flags, false);
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2009-06-15 15:53:38 +04:00
|
|
|
s->fd = -1;
|
2020-07-01 17:22:43 +03:00
|
|
|
fd = qemu_open(filename, s->open_flags, errp);
|
2018-10-08 18:27:18 +03:00
|
|
|
ret = fd < 0 ? -errno : 0;
|
|
|
|
|
|
|
|
if (ret < 0) {
|
2013-04-02 12:47:40 +04:00
|
|
|
if (ret == -EROFS) {
|
2006-08-19 15:45:59 +04:00
|
|
|
ret = -EACCES;
|
2013-04-02 12:47:40 +04:00
|
|
|
}
|
|
|
|
goto fail;
|
2006-08-19 15:45:59 +04:00
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
s->fd = fd;
|
2009-08-20 18:58:19 +04:00
|
|
|
|
2020-07-17 13:54:25 +03:00
|
|
|
/* Check s->open_flags rather than bdrv_flags due to auto-read-only */
|
|
|
|
if (s->open_flags & O_RDWR) {
|
|
|
|
ret = check_hdev_writable(s->fd);
|
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, -ret, "The device is not writable");
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
s->perm = 0;
|
|
|
|
s->shared_perm = BLK_PERM_ALL;
|
|
|
|
|
2009-08-20 18:58:35 +04:00
|
|
|
#ifdef CONFIG_LINUX_AIO
|
2016-09-08 16:09:01 +03:00
|
|
|
/* Currently Linux does AIO only for files opened with O_DIRECT */
|
linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22 22:37:00 +03:00
|
|
|
if (s->use_linux_aio) {
|
|
|
|
if (!(s->open_flags & O_DIRECT)) {
|
|
|
|
error_setg(errp, "aio=native was specified, but it requires "
|
|
|
|
"cache.direct=on, which was not specified.");
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
if (!aio_setup_linux_aio(bdrv_get_aio_context(bs), errp)) {
|
|
|
|
error_prepend(errp, "Unable to use native AIO: ");
|
|
|
|
goto fail;
|
|
|
|
}
|
2015-03-17 15:45:21 +03:00
|
|
|
}
|
2015-07-23 15:48:34 +03:00
|
|
|
#else
|
2016-09-08 16:09:01 +03:00
|
|
|
if (s->use_linux_aio) {
|
2015-12-15 13:35:36 +03:00
|
|
|
error_setg(errp, "aio=native was specified, but is not supported "
|
|
|
|
"in this build.");
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
2015-07-23 15:48:34 +03:00
|
|
|
}
|
|
|
|
#endif /* !defined(CONFIG_LINUX_AIO) */
|
2009-08-20 18:58:19 +04:00
|
|
|
|
2020-01-20 17:18:51 +03:00
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
if (s->use_linux_io_uring) {
|
|
|
|
if (!aio_setup_linux_io_uring(bdrv_get_aio_context(bs), errp)) {
|
|
|
|
error_prepend(errp, "Unable to use io_uring: ");
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
if (s->use_linux_io_uring) {
|
|
|
|
error_setg(errp, "aio=io_uring was specified, but is not supported "
|
|
|
|
"in this build.");
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
#endif /* !defined(CONFIG_LINUX_IO_URING) */
|
|
|
|
|
2013-11-22 16:39:47 +04:00
|
|
|
s->has_discard = true;
|
2013-11-22 16:39:57 +04:00
|
|
|
s->has_write_zeroes = true;
|
2013-11-22 16:39:55 +04:00
|
|
|
|
|
|
|
if (fstat(s->fd, &st) < 0) {
|
2014-12-02 20:32:53 +03:00
|
|
|
ret = -errno;
|
2013-11-22 16:39:55 +04:00
|
|
|
error_setg_errno(errp, errno, "Could not stat file");
|
|
|
|
goto fail;
|
|
|
|
}
|
2018-07-10 20:00:40 +03:00
|
|
|
|
|
|
|
if (!device) {
|
2021-02-22 14:16:32 +03:00
|
|
|
if (!S_ISREG(st.st_mode)) {
|
|
|
|
error_setg(errp, "'%s' driver requires '%s' to be a regular file",
|
|
|
|
bs->drv->format_name, bs->filename);
|
2018-07-10 20:00:40 +03:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
} else {
|
|
|
|
s->discard_zeroes = true;
|
|
|
|
s->has_fallocate = true;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
|
2021-02-22 14:16:32 +03:00
|
|
|
error_setg(errp, "'%s' driver requires '%s' to be either "
|
|
|
|
"a character or block device",
|
|
|
|
bs->drv->format_name, bs->filename);
|
2018-07-10 20:00:40 +03:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto fail;
|
|
|
|
}
|
2013-11-22 16:39:55 +04:00
|
|
|
}
|
2018-07-10 20:00:40 +03:00
|
|
|
|
2013-11-22 16:39:56 +04:00
|
|
|
if (S_ISBLK(st.st_mode)) {
|
|
|
|
#ifdef BLKDISCARDZEROES
|
|
|
|
unsigned int arg;
|
|
|
|
if (ioctl(s->fd, BLKDISCARDZEROES, &arg) == 0 && arg) {
|
|
|
|
s->discard_zeroes = true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#ifdef __linux__
|
|
|
|
/* On Linux 3.10, BLKDISCARD leaves stale data in the page cache. Do
|
|
|
|
* not rely on the contents of discarded blocks unless using O_DIRECT.
|
2013-11-22 16:39:57 +04:00
|
|
|
* Same for BLKZEROOUT.
|
2013-11-22 16:39:56 +04:00
|
|
|
*/
|
|
|
|
if (!(bs->open_flags & BDRV_O_NOCACHE)) {
|
|
|
|
s->discard_zeroes = false;
|
2013-11-22 16:39:57 +04:00
|
|
|
s->has_write_zeroes = false;
|
2013-11-22 16:39:56 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
2014-10-21 18:03:03 +04:00
|
|
|
#ifdef __FreeBSD__
|
|
|
|
if (S_ISCHR(st.st_mode)) {
|
|
|
|
/*
|
|
|
|
* The file is a char device (disk), which on FreeBSD isn't behind
|
|
|
|
* a pager, so force all requests to be aligned. This is needed
|
|
|
|
* so QEMU makes sure all IO operations on the device are aligned
|
|
|
|
* to sector size, or else FreeBSD will reject them with EINVAL.
|
|
|
|
*/
|
2021-11-16 13:14:31 +03:00
|
|
|
s->force_alignment = true;
|
2014-10-21 18:03:03 +04:00
|
|
|
}
|
|
|
|
#endif
|
2021-11-16 13:14:31 +03:00
|
|
|
s->needs_alignment = raw_needs_alignment(bs);
|
2013-11-22 16:39:55 +04:00
|
|
|
|
2019-03-22 15:45:23 +03:00
|
|
|
bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
|
2020-04-24 15:54:44 +03:00
|
|
|
if (S_ISREG(st.st_mode)) {
|
|
|
|
/* When extending regular files, we get zeros from the OS */
|
|
|
|
bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
|
|
|
|
}
|
2013-04-02 12:47:40 +04:00
|
|
|
ret = 0;
|
|
|
|
fail:
|
2020-07-17 13:54:26 +03:00
|
|
|
if (ret < 0 && s->fd != -1) {
|
|
|
|
qemu_close(s->fd);
|
|
|
|
}
|
2014-04-11 21:16:36 +04:00
|
|
|
if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
|
|
|
|
unlink(filename);
|
|
|
|
}
|
2013-04-02 12:47:40 +04:00
|
|
|
qemu_opts_del(opts);
|
|
|
|
return ret;
|
2006-08-01 20:21:11 +04:00
|
|
|
}
|
|
|
|
|
2013-09-05 16:22:29 +04:00
|
|
|
static int raw_open(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp)
|
2009-06-15 15:53:38 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
s->type = FTYPE_FILE;
|
2018-07-10 20:00:40 +03:00
|
|
|
return raw_open_common(bs, options, flags, 0, false, errp);
|
2009-06-15 15:53:38 +04:00
|
|
|
}
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
typedef enum {
|
|
|
|
RAW_PL_PREPARE,
|
|
|
|
RAW_PL_COMMIT,
|
|
|
|
RAW_PL_ABORT,
|
|
|
|
} RawPermLockOp;
|
|
|
|
|
|
|
|
#define PERM_FOREACH(i) \
|
|
|
|
for ((i) = 0; (1ULL << (i)) <= BLK_PERM_ALL; i++)
|
|
|
|
|
|
|
|
/* Lock bytes indicated by @perm_lock_bits and @shared_perm_lock_bits in the
|
|
|
|
* file; if @unlock == true, also unlock the unneeded bytes.
|
|
|
|
* @shared_perm_lock_bits is the mask of all permissions that are NOT shared.
|
|
|
|
*/
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
static int raw_apply_lock_bytes(BDRVRawState *s, int fd,
|
2017-05-02 19:35:56 +03:00
|
|
|
uint64_t perm_lock_bits,
|
|
|
|
uint64_t shared_perm_lock_bits,
|
|
|
|
bool unlock, Error **errp)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
int i;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
uint64_t locked_perm, locked_shared_perm;
|
|
|
|
|
|
|
|
if (s) {
|
|
|
|
locked_perm = s->locked_perm;
|
|
|
|
locked_shared_perm = s->locked_shared_perm;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We don't have the previous bits, just lock/unlock for each of the
|
|
|
|
* requested bits.
|
|
|
|
*/
|
|
|
|
if (unlock) {
|
|
|
|
locked_perm = BLK_PERM_ALL;
|
|
|
|
locked_shared_perm = BLK_PERM_ALL;
|
|
|
|
} else {
|
|
|
|
locked_perm = 0;
|
|
|
|
locked_shared_perm = 0;
|
|
|
|
}
|
|
|
|
}
|
2017-05-02 19:35:56 +03:00
|
|
|
|
|
|
|
PERM_FOREACH(i) {
|
|
|
|
int off = RAW_LOCK_PERM_BASE + i;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
uint64_t bit = (1ULL << i);
|
|
|
|
if ((perm_lock_bits & bit) && !(locked_perm & bit)) {
|
2018-05-10 00:53:34 +03:00
|
|
|
ret = qemu_lock_fd(fd, off, 1, false);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (ret) {
|
2021-01-13 19:44:47 +03:00
|
|
|
raw_lock_error_setg_errno(errp, -ret, "Failed to lock byte %d",
|
|
|
|
off);
|
2017-05-02 19:35:56 +03:00
|
|
|
return ret;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
} else if (s) {
|
|
|
|
s->locked_perm |= bit;
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
} else if (unlock && (locked_perm & bit) && !(perm_lock_bits & bit)) {
|
2018-05-10 00:53:34 +03:00
|
|
|
ret = qemu_unlock_fd(fd, off, 1);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (ret) {
|
2021-01-13 19:44:47 +03:00
|
|
|
error_setg_errno(errp, -ret, "Failed to unlock byte %d", off);
|
2017-05-02 19:35:56 +03:00
|
|
|
return ret;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
} else if (s) {
|
|
|
|
s->locked_perm &= ~bit;
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
PERM_FOREACH(i) {
|
|
|
|
int off = RAW_LOCK_SHARED_BASE + i;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
uint64_t bit = (1ULL << i);
|
|
|
|
if ((shared_perm_lock_bits & bit) && !(locked_shared_perm & bit)) {
|
2018-05-10 00:53:34 +03:00
|
|
|
ret = qemu_lock_fd(fd, off, 1, false);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (ret) {
|
2021-01-13 19:44:47 +03:00
|
|
|
raw_lock_error_setg_errno(errp, -ret, "Failed to lock byte %d",
|
|
|
|
off);
|
2017-05-02 19:35:56 +03:00
|
|
|
return ret;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
} else if (s) {
|
|
|
|
s->locked_shared_perm |= bit;
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
} else if (unlock && (locked_shared_perm & bit) &&
|
|
|
|
!(shared_perm_lock_bits & bit)) {
|
2018-05-10 00:53:34 +03:00
|
|
|
ret = qemu_unlock_fd(fd, off, 1);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (ret) {
|
2021-01-13 19:44:47 +03:00
|
|
|
error_setg_errno(errp, -ret, "Failed to unlock byte %d", off);
|
2017-05-02 19:35:56 +03:00
|
|
|
return ret;
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
} else if (s) {
|
|
|
|
s->locked_shared_perm &= ~bit;
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check "unshared" bytes implied by @perm and ~@shared_perm in the file. */
|
2018-05-10 00:53:34 +03:00
|
|
|
static int raw_check_lock_bytes(int fd, uint64_t perm, uint64_t shared_perm,
|
2017-05-02 19:35:56 +03:00
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
PERM_FOREACH(i) {
|
|
|
|
int off = RAW_LOCK_SHARED_BASE + i;
|
|
|
|
uint64_t p = 1ULL << i;
|
|
|
|
if (perm & p) {
|
2018-05-10 00:53:34 +03:00
|
|
|
ret = qemu_lock_fd_test(fd, off, 1, true);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (ret) {
|
|
|
|
char *perm_name = bdrv_perm_names(p);
|
2021-01-13 19:44:47 +03:00
|
|
|
|
|
|
|
raw_lock_error_setg_errno(errp, -ret,
|
|
|
|
"Failed to get \"%s\" lock",
|
|
|
|
perm_name);
|
2017-05-02 19:35:56 +03:00
|
|
|
g_free(perm_name);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
PERM_FOREACH(i) {
|
|
|
|
int off = RAW_LOCK_PERM_BASE + i;
|
|
|
|
uint64_t p = 1ULL << i;
|
|
|
|
if (!(shared_perm & p)) {
|
2018-05-10 00:53:34 +03:00
|
|
|
ret = qemu_lock_fd_test(fd, off, 1, true);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (ret) {
|
|
|
|
char *perm_name = bdrv_perm_names(p);
|
2021-01-13 19:44:47 +03:00
|
|
|
|
|
|
|
raw_lock_error_setg_errno(errp, -ret,
|
|
|
|
"Failed to get shared \"%s\" lock",
|
|
|
|
perm_name);
|
2017-05-02 19:35:56 +03:00
|
|
|
g_free(perm_name);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int raw_handle_perm_lock(BlockDriverState *bs,
|
|
|
|
RawPermLockOp op,
|
|
|
|
uint64_t new_perm, uint64_t new_shared,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int ret = 0;
|
|
|
|
Error *local_err = NULL;
|
|
|
|
|
|
|
|
if (!s->use_lock) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bdrv_get_flags(bs) & BDRV_O_INACTIVE) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (op) {
|
|
|
|
case RAW_PL_PREPARE:
|
2019-03-29 14:04:54 +03:00
|
|
|
if ((s->perm | new_perm) == s->perm &&
|
|
|
|
(s->shared_perm & new_shared) == s->shared_perm)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We are going to unlock bytes, it should not fail. If it fail due
|
|
|
|
* to some fs-dependent permission-unrelated reasons (which occurs
|
|
|
|
* sometimes on NFS and leads to abort in bdrv_replace_child) we
|
|
|
|
* can't prevent such errors by any check here. And we ignore them
|
|
|
|
* anyway in ABORT and COMMIT.
|
|
|
|
*/
|
|
|
|
return 0;
|
|
|
|
}
|
2018-10-11 10:21:34 +03:00
|
|
|
ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm,
|
2017-05-02 19:35:56 +03:00
|
|
|
~s->shared_perm | ~new_shared,
|
|
|
|
false, errp);
|
|
|
|
if (!ret) {
|
2018-10-11 10:21:34 +03:00
|
|
|
ret = raw_check_lock_bytes(s->fd, new_perm, new_shared, errp);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (!ret) {
|
|
|
|
return 0;
|
|
|
|
}
|
2018-09-25 08:05:01 +03:00
|
|
|
error_append_hint(errp,
|
|
|
|
"Is another process using the image [%s]?\n",
|
|
|
|
bs->filename);
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
|
|
|
/* fall through to unlock bytes. */
|
|
|
|
case RAW_PL_ABORT:
|
2018-10-11 10:21:34 +03:00
|
|
|
raw_apply_lock_bytes(s, s->fd, s->perm, ~s->shared_perm,
|
2018-05-10 00:53:34 +03:00
|
|
|
true, &local_err);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (local_err) {
|
|
|
|
/* Theoretically the above call only unlocks bytes and it cannot
|
|
|
|
* fail. Something weird happened, report it.
|
|
|
|
*/
|
2018-11-01 09:29:09 +03:00
|
|
|
warn_report_err(local_err);
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
case RAW_PL_COMMIT:
|
2018-10-11 10:21:34 +03:00
|
|
|
raw_apply_lock_bytes(s, s->fd, new_perm, ~new_shared,
|
2018-05-10 00:53:34 +03:00
|
|
|
true, &local_err);
|
2017-05-02 19:35:56 +03:00
|
|
|
if (local_err) {
|
|
|
|
/* Theoretically the above call only unlocks bytes and it cannot
|
|
|
|
* fail. Something weird happened, report it.
|
|
|
|
*/
|
2018-11-01 09:29:09 +03:00
|
|
|
warn_report_err(local_err);
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-03-02 02:26:30 +03:00
|
|
|
static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
|
2019-03-02 00:15:11 +03:00
|
|
|
int *open_flags, uint64_t perm, bool force_dup,
|
2019-03-08 17:40:40 +03:00
|
|
|
Error **errp)
|
2019-03-02 02:26:30 +03:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int fd = -1;
|
|
|
|
int ret;
|
2019-03-02 00:15:11 +03:00
|
|
|
bool has_writers = perm &
|
|
|
|
(BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED | BLK_PERM_RESIZE);
|
2019-03-02 02:26:30 +03:00
|
|
|
int fcntl_flags = O_APPEND | O_NONBLOCK;
|
|
|
|
#ifdef O_NOATIME
|
|
|
|
fcntl_flags |= O_NOATIME;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
*open_flags = 0;
|
|
|
|
if (s->type == FTYPE_CD) {
|
|
|
|
*open_flags |= O_NONBLOCK;
|
|
|
|
}
|
|
|
|
|
2019-03-02 00:15:11 +03:00
|
|
|
raw_parse_flags(flags, open_flags, has_writers);
|
2019-03-02 02:26:30 +03:00
|
|
|
|
|
|
|
#ifdef O_ASYNC
|
|
|
|
/* Not all operating systems have O_ASYNC, and those that don't
|
|
|
|
* will not let us track the state into rs->open_flags (typically
|
|
|
|
* you achieve the same effect with an ioctl, for example I_SETSIG
|
|
|
|
* on Solaris). But we do not use O_ASYNC, so that's fine.
|
|
|
|
*/
|
|
|
|
assert((s->open_flags & O_ASYNC) == 0);
|
|
|
|
#endif
|
|
|
|
|
2019-03-08 17:40:40 +03:00
|
|
|
if (!force_dup && *open_flags == s->open_flags) {
|
|
|
|
/* We're lucky, the existing fd is fine */
|
|
|
|
return s->fd;
|
|
|
|
}
|
|
|
|
|
2019-03-02 02:26:30 +03:00
|
|
|
if ((*open_flags & ~fcntl_flags) == (s->open_flags & ~fcntl_flags)) {
|
|
|
|
/* dup the original fd */
|
|
|
|
fd = qemu_dup(s->fd);
|
|
|
|
if (fd >= 0) {
|
|
|
|
ret = fcntl_setfl(fd, *open_flags);
|
|
|
|
if (ret) {
|
|
|
|
qemu_close(fd);
|
|
|
|
fd = -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-07-01 17:22:43 +03:00
|
|
|
/* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
|
2019-03-02 02:26:30 +03:00
|
|
|
if (fd == -1) {
|
|
|
|
const char *normalized_filename = bs->filename;
|
|
|
|
ret = raw_normalize_devicepath(&normalized_filename, errp);
|
|
|
|
if (ret >= 0) {
|
2020-07-01 17:22:43 +03:00
|
|
|
fd = qemu_open(normalized_filename, *open_flags, errp);
|
2019-03-02 02:26:30 +03:00
|
|
|
if (fd == -1) {
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-07-17 13:54:25 +03:00
|
|
|
if (fd != -1 && (*open_flags & O_RDWR)) {
|
|
|
|
ret = check_hdev_writable(fd);
|
|
|
|
if (ret < 0) {
|
|
|
|
qemu_close(fd);
|
|
|
|
error_setg_errno(errp, -ret, "The device is not writable");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-03-02 02:26:30 +03:00
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2012-09-20 23:13:25 +04:00
|
|
|
static int raw_reopen_prepare(BDRVReopenState *state,
|
|
|
|
BlockReopenQueue *queue, Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRawState *s;
|
2016-10-27 13:45:17 +03:00
|
|
|
BDRVRawReopenState *rs;
|
2018-04-27 19:23:12 +03:00
|
|
|
QemuOpts *opts;
|
2019-03-02 00:48:45 +03:00
|
|
|
int ret;
|
2012-09-20 23:13:25 +04:00
|
|
|
|
|
|
|
assert(state != NULL);
|
|
|
|
assert(state->bs != NULL);
|
|
|
|
|
|
|
|
s = state->bs->opaque;
|
|
|
|
|
block: Use g_new() & friends where that makes obvious sense
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
Patch created with Coccinelle, with two manual changes on top:
* Add const to bdrv_iterate_format() to keep the types straight
* Convert the allocation in bdrv_drop_intermediate(), which Coccinelle
inexplicably misses
Coccinelle semantic patch:
@@
type T;
@@
-g_malloc(sizeof(T))
+g_new(T, 1)
@@
type T;
@@
-g_try_malloc(sizeof(T))
+g_try_new(T, 1)
@@
type T;
@@
-g_malloc0(sizeof(T))
+g_new0(T, 1)
@@
type T;
@@
-g_try_malloc0(sizeof(T))
+g_try_new0(T, 1)
@@
type T;
expression n;
@@
-g_malloc(sizeof(T) * (n))
+g_new(T, n)
@@
type T;
expression n;
@@
-g_try_malloc(sizeof(T) * (n))
+g_try_new(T, n)
@@
type T;
expression n;
@@
-g_malloc0(sizeof(T) * (n))
+g_new0(T, n)
@@
type T;
expression n;
@@
-g_try_malloc0(sizeof(T) * (n))
+g_try_new0(T, n)
@@
type T;
expression p, n;
@@
-g_realloc(p, sizeof(T) * (n))
+g_renew(T, p, n)
@@
type T;
expression p, n;
@@
-g_try_realloc(p, sizeof(T) * (n))
+g_try_renew(T, p, n)
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2014-08-19 12:31:08 +04:00
|
|
|
state->opaque = g_new0(BDRVRawReopenState, 1);
|
2016-10-27 13:45:17 +03:00
|
|
|
rs = state->opaque;
|
2018-04-27 19:23:12 +03:00
|
|
|
|
|
|
|
/* Handle options changes */
|
|
|
|
opts = qemu_opts_create(&raw_runtime_opts, NULL, 0, &error_abort);
|
2020-07-07 19:06:03 +03:00
|
|
|
if (!qemu_opts_absorb_qdict(opts, state->options, errp)) {
|
2018-04-27 19:23:12 +03:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2019-03-07 19:49:41 +03:00
|
|
|
rs->drop_cache = qemu_opt_get_bool_del(opts, "drop-cache", true);
|
2018-09-06 12:37:07 +03:00
|
|
|
rs->check_cache_dropped =
|
|
|
|
qemu_opt_get_bool_del(opts, "x-check-cache-dropped", false);
|
|
|
|
|
|
|
|
/* This driver's reopen function doesn't currently allow changing
|
|
|
|
* other options, so let's put them back in the original QDict and
|
|
|
|
* bdrv_reopen_prepare() will detect changes and complain. */
|
|
|
|
qemu_opts_to_qdict(opts, state->options);
|
2012-09-20 23:13:25 +04:00
|
|
|
|
2021-04-28 18:17:58 +03:00
|
|
|
/*
|
|
|
|
* As part of reopen prepare we also want to create new fd by
|
|
|
|
* raw_reconfigure_getfd(). But it wants updated "perm", when in
|
|
|
|
* bdrv_reopen_multiple() .bdrv_reopen_prepare() callback called prior to
|
|
|
|
* permission update. Happily, permission update is always a part (a seprate
|
|
|
|
* stage) of bdrv_reopen_multiple() so we can rely on this fact and
|
|
|
|
* reconfigure fd in raw_check_perm().
|
|
|
|
*/
|
2014-07-16 19:48:17 +04:00
|
|
|
|
2019-03-07 21:07:35 +03:00
|
|
|
s->reopen_state = state;
|
2019-03-02 00:48:45 +03:00
|
|
|
ret = 0;
|
2021-04-28 18:17:58 +03:00
|
|
|
|
2018-04-27 19:23:12 +03:00
|
|
|
out:
|
|
|
|
qemu_opts_del(opts);
|
2012-09-20 23:13:25 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void raw_reopen_commit(BDRVReopenState *state)
|
|
|
|
{
|
2016-10-27 13:45:17 +03:00
|
|
|
BDRVRawReopenState *rs = state->opaque;
|
2012-09-20 23:13:25 +04:00
|
|
|
BDRVRawState *s = state->bs->opaque;
|
|
|
|
|
2019-03-07 19:49:41 +03:00
|
|
|
s->drop_cache = rs->drop_cache;
|
2018-04-27 19:23:12 +03:00
|
|
|
s->check_cache_dropped = rs->check_cache_dropped;
|
2016-10-27 13:45:17 +03:00
|
|
|
s->open_flags = rs->open_flags;
|
2012-09-20 23:13:25 +04:00
|
|
|
g_free(state->opaque);
|
|
|
|
state->opaque = NULL;
|
2019-03-07 21:07:35 +03:00
|
|
|
|
|
|
|
assert(s->reopen_state == state);
|
|
|
|
s->reopen_state = NULL;
|
2012-09-20 23:13:25 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
static void raw_reopen_abort(BDRVReopenState *state)
|
|
|
|
{
|
2016-10-27 13:45:17 +03:00
|
|
|
BDRVRawReopenState *rs = state->opaque;
|
2019-03-07 21:07:35 +03:00
|
|
|
BDRVRawState *s = state->bs->opaque;
|
2012-09-20 23:13:25 +04:00
|
|
|
|
|
|
|
/* nothing to do if NULL, we didn't get far enough */
|
2016-10-27 13:45:17 +03:00
|
|
|
if (rs == NULL) {
|
2012-09-20 23:13:25 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
g_free(state->opaque);
|
|
|
|
state->opaque = NULL;
|
2019-03-07 21:07:35 +03:00
|
|
|
|
|
|
|
assert(s->reopen_state == state);
|
|
|
|
s->reopen_state = NULL;
|
2012-09-20 23:13:25 +04:00
|
|
|
}
|
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
static int hdev_get_max_hw_transfer(int fd, struct stat *st)
|
2016-06-03 05:07:02 +03:00
|
|
|
{
|
|
|
|
#ifdef BLKSECTGET
|
2021-04-14 20:52:26 +03:00
|
|
|
if (S_ISBLK(st->st_mode)) {
|
|
|
|
unsigned short max_sectors = 0;
|
|
|
|
if (ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
|
|
|
|
return max_sectors * 512;
|
|
|
|
}
|
2016-06-03 05:07:02 +03:00
|
|
|
} else {
|
2021-04-14 20:52:26 +03:00
|
|
|
int max_bytes = 0;
|
|
|
|
if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
|
|
|
|
return max_bytes;
|
|
|
|
}
|
2016-06-03 05:07:02 +03:00
|
|
|
}
|
2021-04-14 20:52:26 +03:00
|
|
|
return -errno;
|
2016-06-03 05:07:02 +03:00
|
|
|
#else
|
|
|
|
return -ENOSYS;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
static int hdev_get_max_segments(int fd, struct stat *st)
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_LINUX
|
|
|
|
char buf[32];
|
|
|
|
const char *end;
|
2019-07-04 15:43:42 +03:00
|
|
|
char *sysfspath = NULL;
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
int ret;
|
2019-07-04 15:43:42 +03:00
|
|
|
int sysfd = -1;
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
long max_segments;
|
2019-07-04 15:43:42 +03:00
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
if (S_ISCHR(st->st_mode)) {
|
2021-04-15 13:41:31 +03:00
|
|
|
if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
if (!S_ISBLK(st->st_mode)) {
|
2021-04-15 13:41:31 +03:00
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
|
2021-04-14 20:52:26 +03:00
|
|
|
major(st->st_rdev), minor(st->st_rdev));
|
2019-07-04 15:43:42 +03:00
|
|
|
sysfd = open(sysfspath, O_RDONLY);
|
|
|
|
if (sysfd == -1) {
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
ret = -errno;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
do {
|
2019-07-04 15:43:42 +03:00
|
|
|
ret = read(sysfd, buf, sizeof(buf) - 1);
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
} while (ret == -1 && errno == EINTR);
|
|
|
|
if (ret < 0) {
|
|
|
|
ret = -errno;
|
|
|
|
goto out;
|
|
|
|
} else if (ret == 0) {
|
|
|
|
ret = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
buf[ret] = 0;
|
|
|
|
/* The file is ended with '\n', pass 'end' to accept that. */
|
|
|
|
ret = qemu_strtol(buf, &end, 10, &max_segments);
|
|
|
|
if (ret == 0 && end && *end == '\n') {
|
|
|
|
ret = max_segments;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2019-07-04 15:43:42 +03:00
|
|
|
if (sysfd != -1) {
|
|
|
|
close(sysfd);
|
2017-03-14 19:12:05 +03:00
|
|
|
}
|
file-posix: Consider max_segments for BlockLimits.max_transfer
BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.
Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:
1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
it's not, because it will be passed through to host device as an
SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
-EINVAL;
This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)
The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).
Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.
Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-08 15:08:14 +03:00
|
|
|
g_free(sysfspath);
|
|
|
|
return ret;
|
|
|
|
#else
|
|
|
|
return -ENOTSUP;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2014-07-16 19:48:16 +04:00
|
|
|
static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
|
2011-11-29 15:42:20 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2021-04-14 20:52:26 +03:00
|
|
|
struct stat st;
|
2016-06-03 05:07:02 +03:00
|
|
|
|
2021-11-16 13:14:31 +03:00
|
|
|
s->needs_alignment = raw_needs_alignment(bs);
|
2021-04-14 20:52:26 +03:00
|
|
|
raw_probe_alignment(bs, s->fd, errp);
|
2021-11-16 13:14:31 +03:00
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
bs->bl.min_mem_alignment = s->buf_align;
|
|
|
|
bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Maximum transfers are best effort, so it is okay to ignore any
|
|
|
|
* errors. That said, based on the man page errors in fstat would be
|
|
|
|
* very much unexpected; the only possible case seems to be ENOMEM.
|
|
|
|
*/
|
|
|
|
if (fstat(s->fd, &st)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-07-05 16:04:56 +03:00
|
|
|
#if defined(__APPLE__) && (__MACH__)
|
|
|
|
struct statfs buf;
|
|
|
|
|
|
|
|
if (!fstatfs(s->fd, &buf)) {
|
|
|
|
bs->bl.opt_transfer = buf.f_iosize;
|
|
|
|
bs->bl.pdiscard_alignment = buf.f_bsize;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
if (bs->sg || S_ISBLK(st.st_mode)) {
|
|
|
|
int ret = hdev_get_max_hw_transfer(s->fd, &st);
|
2019-07-04 15:43:42 +03:00
|
|
|
|
|
|
|
if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
|
2021-04-14 20:52:26 +03:00
|
|
|
bs->bl.max_hw_transfer = ret;
|
2019-07-04 15:43:42 +03:00
|
|
|
}
|
|
|
|
|
2021-04-14 20:52:26 +03:00
|
|
|
ret = hdev_get_max_segments(s->fd, &st);
|
2019-07-04 15:43:42 +03:00
|
|
|
if (ret > 0) {
|
2021-09-23 16:04:36 +03:00
|
|
|
bs->bl.max_hw_iov = ret;
|
2016-06-03 05:07:02 +03:00
|
|
|
}
|
|
|
|
}
|
2011-11-29 15:42:20 +04:00
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2015-02-16 14:47:56 +03:00
|
|
|
static int check_for_dasd(int fd)
|
|
|
|
{
|
|
|
|
#ifdef BIODASDINFO2
|
|
|
|
struct dasd_information2_t info = {0};
|
|
|
|
|
|
|
|
return ioctl(fd, BIODASDINFO2, &info);
|
|
|
|
#else
|
|
|
|
return -1;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Try to get @bs's logical and physical block size.
|
|
|
|
* On success, store them in @bsz and return zero.
|
|
|
|
* On failure, return negative errno.
|
|
|
|
*/
|
|
|
|
static int hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* If DASD, get blocksizes */
|
|
|
|
if (check_for_dasd(s->fd) < 0) {
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
ret = probe_logical_blocksize(s->fd, &bsz->log);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
return probe_physical_blocksize(s->fd, &bsz->phys);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Try to get @bs's geometry: cyls, heads, sectors.
|
|
|
|
* On success, store them in @geo and return 0.
|
|
|
|
* On failure return -errno.
|
|
|
|
* (Allows block driver to assign default geometry values that guest sees)
|
|
|
|
*/
|
|
|
|
#ifdef __linux__
|
|
|
|
static int hdev_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
struct hd_geometry ioctl_geo = {0};
|
|
|
|
|
|
|
|
/* If DASD, get its geometry */
|
|
|
|
if (check_for_dasd(s->fd) < 0) {
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
if (ioctl(s->fd, HDIO_GETGEO, &ioctl_geo) < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
/* HDIO_GETGEO may return success even though geo contains zeros
|
|
|
|
(e.g. certain multipath setups) */
|
|
|
|
if (!ioctl_geo.heads || !ioctl_geo.sectors || !ioctl_geo.cylinders) {
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
/* Do not return a geometry for partition */
|
|
|
|
if (ioctl_geo.start != 0) {
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
geo->heads = ioctl_geo.heads;
|
|
|
|
geo->sectors = ioctl_geo.sectors;
|
|
|
|
geo->cylinders = ioctl_geo.cylinders;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else /* __linux__ */
|
|
|
|
static int hdev_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
|
|
|
|
{
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-10-31 13:30:42 +03:00
|
|
|
#if defined(__linux__)
|
|
|
|
static int handle_aiocb_ioctl(void *opaque)
|
2012-05-25 13:46:27 +04:00
|
|
|
{
|
2018-10-31 13:30:42 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2012-05-25 13:46:27 +04:00
|
|
|
int ret;
|
|
|
|
|
2021-06-15 09:34:52 +03:00
|
|
|
do {
|
|
|
|
ret = ioctl(aiocb->aio_fildes, aiocb->ioctl.cmd, aiocb->ioctl.buf);
|
|
|
|
} while (ret == -1 && errno == EINTR);
|
2012-05-25 13:46:27 +04:00
|
|
|
if (ret == -1) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
|
2013-01-10 18:28:35 +04:00
|
|
|
return 0;
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
2018-10-31 13:30:42 +03:00
|
|
|
#endif /* linux */
|
2012-05-25 13:46:27 +04:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_flush(void *opaque)
|
2012-05-25 13:46:27 +04:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2017-03-23 00:00:05 +03:00
|
|
|
BDRVRawState *s = aiocb->bs->opaque;
|
2012-05-25 13:46:27 +04:00
|
|
|
int ret;
|
|
|
|
|
2017-03-23 00:00:05 +03:00
|
|
|
if (s->page_cache_inconsistent) {
|
2021-04-15 16:28:16 +03:00
|
|
|
return -s->page_cache_inconsistent;
|
2017-03-23 00:00:05 +03:00
|
|
|
}
|
|
|
|
|
2012-05-25 13:46:27 +04:00
|
|
|
ret = qemu_fdatasync(aiocb->aio_fildes);
|
|
|
|
if (ret == -1) {
|
2021-04-15 16:28:16 +03:00
|
|
|
trace_file_flush_fdatasync_failed(errno);
|
|
|
|
|
2017-03-23 00:00:05 +03:00
|
|
|
/* There is no clear definition of the semantics of a failing fsync(),
|
|
|
|
* so we may have to assume the worst. The sad truth is that this
|
|
|
|
* assumption is correct for Linux. Some pages are now probably marked
|
|
|
|
* clean in the page cache even though they are inconsistent with the
|
|
|
|
* on-disk contents. The next fdatasync() call would succeed, but no
|
|
|
|
* further writeback attempt will be made. We can't get back to a state
|
|
|
|
* in which we know what is on disk (we would have to rewrite
|
|
|
|
* everything that was touched since the last fdatasync() at least), so
|
|
|
|
* make bdrv_flush() fail permanently. Given that the behaviour isn't
|
|
|
|
* really defined, I have little hope that other OSes are doing better.
|
|
|
|
*
|
|
|
|
* Obviously, this doesn't affect O_DIRECT, which bypasses the page
|
|
|
|
* cache. */
|
|
|
|
if ((s->open_flags & O_DIRECT) == 0) {
|
2021-04-15 16:28:16 +03:00
|
|
|
s->page_cache_inconsistent = errno;
|
2017-03-23 00:00:05 +03:00
|
|
|
}
|
2012-05-25 13:46:27 +04:00
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_PREADV
|
|
|
|
|
|
|
|
static bool preadv_present = true;
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
|
|
|
|
{
|
|
|
|
return preadv(fd, iov, nr_iov, offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
qemu_pwritev(int fd, const struct iovec *iov, int nr_iov, off_t offset)
|
|
|
|
{
|
|
|
|
return pwritev(fd, iov, nr_iov, offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static bool preadv_present = false;
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
qemu_pwritev(int fd, const struct iovec *iov, int nr_iov, off_t offset)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb)
|
|
|
|
{
|
|
|
|
ssize_t len;
|
|
|
|
|
|
|
|
do {
|
|
|
|
if (aiocb->aio_type & QEMU_AIO_WRITE)
|
|
|
|
len = qemu_pwritev(aiocb->aio_fildes,
|
2018-10-25 18:21:14 +03:00
|
|
|
aiocb->io.iov,
|
|
|
|
aiocb->io.niov,
|
2012-05-25 13:46:27 +04:00
|
|
|
aiocb->aio_offset);
|
|
|
|
else
|
|
|
|
len = qemu_preadv(aiocb->aio_fildes,
|
2018-10-25 18:21:14 +03:00
|
|
|
aiocb->io.iov,
|
|
|
|
aiocb->io.niov,
|
2012-05-25 13:46:27 +04:00
|
|
|
aiocb->aio_offset);
|
|
|
|
} while (len == -1 && errno == EINTR);
|
|
|
|
|
|
|
|
if (len == -1) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read/writes the data to/from a given linear buffer.
|
|
|
|
*
|
|
|
|
* Returns the number of bytes handles or -errno in case of an error. Short
|
|
|
|
* reads are only returned if the end of the file is reached.
|
|
|
|
*/
|
|
|
|
static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
|
|
|
|
{
|
|
|
|
ssize_t offset = 0;
|
|
|
|
ssize_t len;
|
|
|
|
|
|
|
|
while (offset < aiocb->aio_nbytes) {
|
|
|
|
if (aiocb->aio_type & QEMU_AIO_WRITE) {
|
|
|
|
len = pwrite(aiocb->aio_fildes,
|
|
|
|
(const char *)buf + offset,
|
|
|
|
aiocb->aio_nbytes - offset,
|
|
|
|
aiocb->aio_offset + offset);
|
|
|
|
} else {
|
|
|
|
len = pread(aiocb->aio_fildes,
|
|
|
|
buf + offset,
|
|
|
|
aiocb->aio_nbytes - offset,
|
|
|
|
aiocb->aio_offset + offset);
|
|
|
|
}
|
|
|
|
if (len == -1 && errno == EINTR) {
|
|
|
|
continue;
|
2014-08-21 16:44:07 +04:00
|
|
|
} else if (len == -1 && errno == EINVAL &&
|
|
|
|
(aiocb->bs->open_flags & BDRV_O_NOCACHE) &&
|
|
|
|
!(aiocb->aio_type & QEMU_AIO_WRITE) &&
|
|
|
|
offset > 0) {
|
|
|
|
/* O_DIRECT pread() may fail with EINVAL when offset is unaligned
|
|
|
|
* after a short read. Assume that O_DIRECT short reads only occur
|
|
|
|
* at EOF. Therefore this is a short read, not an I/O error.
|
|
|
|
*/
|
|
|
|
break;
|
2012-05-25 13:46:27 +04:00
|
|
|
} else if (len == -1) {
|
|
|
|
offset = -errno;
|
|
|
|
break;
|
|
|
|
} else if (len == 0) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
offset += len;
|
|
|
|
}
|
|
|
|
|
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_rw(void *opaque)
|
2012-05-25 13:46:27 +04:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2012-05-25 13:46:27 +04:00
|
|
|
ssize_t nbytes;
|
|
|
|
char *buf;
|
|
|
|
|
|
|
|
if (!(aiocb->aio_type & QEMU_AIO_MISALIGNED)) {
|
|
|
|
/*
|
|
|
|
* If there is just a single buffer, and it is properly aligned
|
|
|
|
* we can just use plain pread/pwrite without any problems.
|
|
|
|
*/
|
2018-10-25 18:21:14 +03:00
|
|
|
if (aiocb->io.niov == 1) {
|
2018-10-31 11:43:17 +03:00
|
|
|
nbytes = handle_aiocb_rw_linear(aiocb, aiocb->io.iov->iov_base);
|
|
|
|
goto out;
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We have more than one iovec, and all are properly aligned.
|
|
|
|
*
|
|
|
|
* Try preadv/pwritev first and fall back to linearizing the
|
|
|
|
* buffer if it's not supported.
|
|
|
|
*/
|
|
|
|
if (preadv_present) {
|
|
|
|
nbytes = handle_aiocb_rw_vector(aiocb);
|
|
|
|
if (nbytes == aiocb->aio_nbytes ||
|
|
|
|
(nbytes < 0 && nbytes != -ENOSYS)) {
|
2018-10-31 11:43:17 +03:00
|
|
|
goto out;
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
|
|
|
preadv_present = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX(hch): short read/write. no easy way to handle the reminder
|
|
|
|
* using these interfaces. For now retry using plain
|
|
|
|
* pread/pwrite?
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we have to do it the hard way, copy all segments into
|
|
|
|
* a single aligned buffer.
|
|
|
|
*/
|
2014-05-21 20:02:42 +04:00
|
|
|
buf = qemu_try_blockalign(aiocb->bs, aiocb->aio_nbytes);
|
|
|
|
if (buf == NULL) {
|
2018-10-31 11:43:17 +03:00
|
|
|
nbytes = -ENOMEM;
|
|
|
|
goto out;
|
2014-05-21 20:02:42 +04:00
|
|
|
}
|
|
|
|
|
2012-05-25 13:46:27 +04:00
|
|
|
if (aiocb->aio_type & QEMU_AIO_WRITE) {
|
|
|
|
char *p = buf;
|
|
|
|
int i;
|
|
|
|
|
2018-10-25 18:21:14 +03:00
|
|
|
for (i = 0; i < aiocb->io.niov; ++i) {
|
|
|
|
memcpy(p, aiocb->io.iov[i].iov_base, aiocb->io.iov[i].iov_len);
|
|
|
|
p += aiocb->io.iov[i].iov_len;
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
2014-07-01 18:09:54 +04:00
|
|
|
assert(p - buf == aiocb->aio_nbytes);
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
nbytes = handle_aiocb_rw_linear(aiocb, buf);
|
|
|
|
if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
|
|
|
|
char *p = buf;
|
|
|
|
size_t count = aiocb->aio_nbytes, copy;
|
|
|
|
int i;
|
|
|
|
|
2018-10-25 18:21:14 +03:00
|
|
|
for (i = 0; i < aiocb->io.niov && count; ++i) {
|
2012-05-25 13:46:27 +04:00
|
|
|
copy = count;
|
2018-10-25 18:21:14 +03:00
|
|
|
if (copy > aiocb->io.iov[i].iov_len) {
|
|
|
|
copy = aiocb->io.iov[i].iov_len;
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
2018-10-25 18:21:14 +03:00
|
|
|
memcpy(aiocb->io.iov[i].iov_base, p, copy);
|
2014-07-01 18:09:54 +04:00
|
|
|
assert(count >= copy);
|
2012-05-25 13:46:27 +04:00
|
|
|
p += copy;
|
|
|
|
count -= copy;
|
|
|
|
}
|
2014-07-01 18:09:54 +04:00
|
|
|
assert(count == 0);
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
|
|
|
qemu_vfree(buf);
|
|
|
|
|
2018-10-31 11:43:17 +03:00
|
|
|
out:
|
|
|
|
if (nbytes == aiocb->aio_nbytes) {
|
|
|
|
return 0;
|
|
|
|
} else if (nbytes >= 0 && nbytes < aiocb->aio_nbytes) {
|
|
|
|
if (aiocb->aio_type & QEMU_AIO_WRITE) {
|
|
|
|
return -EINVAL;
|
|
|
|
} else {
|
|
|
|
iov_memset(aiocb->io.iov, aiocb->io.niov, nbytes,
|
|
|
|
0, aiocb->aio_nbytes - nbytes);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
assert(nbytes < 0);
|
|
|
|
return nbytes;
|
|
|
|
}
|
2012-05-25 13:46:27 +04:00
|
|
|
}
|
|
|
|
|
2021-07-05 16:04:56 +03:00
|
|
|
#if defined(CONFIG_FALLOCATE) || defined(BLKZEROOUT) || defined(BLKDISCARD)
|
2015-01-30 11:42:11 +03:00
|
|
|
static int translate_err(int err)
|
|
|
|
{
|
|
|
|
if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
|
|
|
|
err == -ENOTTY) {
|
|
|
|
err = -ENOTSUP;
|
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
2021-07-05 16:04:56 +03:00
|
|
|
#endif
|
2015-01-30 11:42:11 +03:00
|
|
|
|
2015-01-30 11:42:15 +03:00
|
|
|
#ifdef CONFIG_FALLOCATE
|
2015-01-30 11:42:12 +03:00
|
|
|
static int do_fallocate(int fd, int mode, off_t offset, off_t len)
|
|
|
|
{
|
|
|
|
do {
|
|
|
|
if (fallocate(fd, mode, offset, len) == 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} while (errno == EINTR);
|
|
|
|
return translate_err(-errno);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-01-30 11:42:13 +03:00
|
|
|
static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
|
2013-11-22 16:39:57 +04:00
|
|
|
{
|
2015-01-30 11:42:13 +03:00
|
|
|
int ret = -ENOTSUP;
|
2013-11-22 16:39:57 +04:00
|
|
|
BDRVRawState *s = aiocb->bs->opaque;
|
|
|
|
|
2015-01-30 11:42:13 +03:00
|
|
|
if (!s->has_write_zeroes) {
|
2013-11-22 16:39:57 +04:00
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef BLKZEROOUT
|
2019-03-22 15:45:23 +03:00
|
|
|
/* The BLKZEROOUT implementation in the kernel doesn't set
|
|
|
|
* BLKDEV_ZERO_NOFALLBACK, so we can't call this if we have to avoid slow
|
|
|
|
* fallbacks. */
|
|
|
|
if (!(aiocb->aio_type & QEMU_AIO_NO_FALLBACK)) {
|
|
|
|
do {
|
|
|
|
uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
|
|
|
|
if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} while (errno == EINTR);
|
2015-01-30 11:42:13 +03:00
|
|
|
|
2019-03-22 15:45:23 +03:00
|
|
|
ret = translate_err(-errno);
|
2019-08-16 12:48:17 +03:00
|
|
|
if (ret == -ENOTSUP) {
|
|
|
|
s->has_write_zeroes = false;
|
|
|
|
}
|
2019-03-22 15:45:23 +03:00
|
|
|
}
|
2013-11-22 16:39:57 +04:00
|
|
|
#endif
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_write_zeroes(void *opaque)
|
2015-01-30 11:42:13 +03:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2017-08-04 18:10:11 +03:00
|
|
|
#ifdef CONFIG_FALLOCATE
|
2019-08-23 16:03:40 +03:00
|
|
|
BDRVRawState *s = aiocb->bs->opaque;
|
2017-08-04 18:10:11 +03:00
|
|
|
int64_t len;
|
|
|
|
#endif
|
2015-01-30 11:42:13 +03:00
|
|
|
|
|
|
|
if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
|
|
|
|
return handle_aiocb_write_zeroes_block(aiocb);
|
|
|
|
}
|
|
|
|
|
2015-01-30 11:42:14 +03:00
|
|
|
#ifdef CONFIG_FALLOCATE_ZERO_RANGE
|
|
|
|
if (s->has_write_zeroes) {
|
|
|
|
int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
|
|
|
|
aiocb->aio_offset, aiocb->aio_nbytes);
|
2021-05-27 20:20:20 +03:00
|
|
|
if (ret == -ENOTSUP) {
|
|
|
|
s->has_write_zeroes = false;
|
|
|
|
} else if (ret == 0 || ret != -EINVAL) {
|
2015-01-30 11:42:14 +03:00
|
|
|
return ret;
|
|
|
|
}
|
2021-05-27 20:20:20 +03:00
|
|
|
/*
|
|
|
|
* Note: Some file systems do not like unaligned byte ranges, and
|
|
|
|
* return EINVAL in such a case, though they should not do it according
|
|
|
|
* to the man-page of fallocate(). Thus we simply ignore this return
|
|
|
|
* value and try the other fallbacks instead.
|
|
|
|
*/
|
2015-01-30 11:42:14 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-01-30 11:42:16 +03:00
|
|
|
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
|
|
|
|
if (s->has_discard && s->has_fallocate) {
|
|
|
|
int ret = do_fallocate(s->fd,
|
|
|
|
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
|
|
|
|
aiocb->aio_offset, aiocb->aio_nbytes);
|
|
|
|
if (ret == 0) {
|
|
|
|
ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
|
|
|
|
if (ret == 0 || ret != -ENOTSUP) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
s->has_fallocate = false;
|
2021-05-27 20:20:19 +03:00
|
|
|
} else if (ret == -EINVAL) {
|
|
|
|
/*
|
|
|
|
* Some file systems like older versions of GPFS do not like un-
|
|
|
|
* aligned byte ranges, and return EINVAL in such a case, though
|
|
|
|
* they should not do it according to the man-page of fallocate().
|
|
|
|
* Warn about the bad filesystem and try the final fallback instead.
|
|
|
|
*/
|
|
|
|
warn_report_once("Your file system is misbehaving: "
|
|
|
|
"fallocate(FALLOC_FL_PUNCH_HOLE) returned EINVAL. "
|
2021-08-18 17:06:53 +03:00
|
|
|
"Please report this bug to your file system "
|
2021-05-27 20:20:19 +03:00
|
|
|
"vendor.");
|
2015-01-30 11:42:16 +03:00
|
|
|
} else if (ret != -ENOTSUP) {
|
|
|
|
return ret;
|
|
|
|
} else {
|
|
|
|
s->has_discard = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-01-30 11:42:15 +03:00
|
|
|
#ifdef CONFIG_FALLOCATE
|
2017-08-04 18:10:11 +03:00
|
|
|
/* Last resort: we are trying to extend the file with zeroed data. This
|
|
|
|
* can be done via fallocate(fd, 0) */
|
|
|
|
len = bdrv_getlength(aiocb->bs);
|
|
|
|
if (s->has_fallocate && len >= 0 && aiocb->aio_offset >= len) {
|
2015-01-30 11:42:15 +03:00
|
|
|
int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
|
|
|
|
if (ret == 0 || ret != -ENOTSUP) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
s->has_fallocate = false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-01-30 11:42:13 +03:00
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_write_zeroes_unmap(void *opaque)
|
2018-07-26 12:28:30 +03:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2018-07-26 12:28:30 +03:00
|
|
|
BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
|
|
|
|
|
|
|
|
/* First try to write zeros and unmap at the same time */
|
|
|
|
|
|
|
|
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
|
2020-04-01 19:53:14 +03:00
|
|
|
int ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
|
|
|
|
aiocb->aio_offset, aiocb->aio_nbytes);
|
2020-07-17 16:56:04 +03:00
|
|
|
switch (ret) {
|
|
|
|
case -ENOTSUP:
|
|
|
|
case -EINVAL:
|
2020-11-11 18:39:12 +03:00
|
|
|
case -EBUSY:
|
2020-07-17 16:56:04 +03:00
|
|
|
break;
|
|
|
|
default:
|
2018-07-26 12:28:30 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* If we couldn't manage to unmap while guaranteed that the area reads as
|
|
|
|
* all-zero afterwards, just write zeroes without unmapping */
|
2020-04-01 19:53:14 +03:00
|
|
|
return handle_aiocb_write_zeroes(aiocb);
|
2018-07-26 12:28:30 +03:00
|
|
|
}
|
|
|
|
|
2018-06-01 12:26:43 +03:00
|
|
|
#ifndef HAVE_COPY_FILE_RANGE
|
|
|
|
static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
|
|
|
|
off_t *out_off, size_t len, unsigned int flags)
|
|
|
|
{
|
|
|
|
#ifdef __NR_copy_file_range
|
|
|
|
return syscall(__NR_copy_file_range, in_fd, in_off, out_fd,
|
|
|
|
out_off, len, flags);
|
|
|
|
#else
|
|
|
|
errno = ENOSYS;
|
|
|
|
return -1;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_copy_range(void *opaque)
|
2018-06-01 12:26:43 +03:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2018-06-01 12:26:43 +03:00
|
|
|
uint64_t bytes = aiocb->aio_nbytes;
|
|
|
|
off_t in_off = aiocb->aio_offset;
|
2018-10-25 18:21:14 +03:00
|
|
|
off_t out_off = aiocb->copy_range.aio_offset2;
|
2018-06-01 12:26:43 +03:00
|
|
|
|
|
|
|
while (bytes) {
|
|
|
|
ssize_t ret = copy_file_range(aiocb->aio_fildes, &in_off,
|
2018-10-25 18:21:14 +03:00
|
|
|
aiocb->copy_range.aio_fd2, &out_off,
|
2018-06-01 12:26:43 +03:00
|
|
|
bytes, 0);
|
2018-07-10 09:31:16 +03:00
|
|
|
trace_file_copy_file_range(aiocb->bs, aiocb->aio_fildes, in_off,
|
2018-10-25 18:21:14 +03:00
|
|
|
aiocb->copy_range.aio_fd2, out_off, bytes,
|
|
|
|
0, ret);
|
2018-06-29 09:03:28 +03:00
|
|
|
if (ret == 0) {
|
|
|
|
/* No progress (e.g. when beyond EOF), let the caller fall back to
|
|
|
|
* buffer I/O. */
|
|
|
|
return -ENOSPC;
|
2018-06-01 12:26:43 +03:00
|
|
|
}
|
|
|
|
if (ret < 0) {
|
2018-06-29 09:03:28 +03:00
|
|
|
switch (errno) {
|
|
|
|
case ENOSYS:
|
2018-06-01 12:26:43 +03:00
|
|
|
return -ENOTSUP;
|
2018-06-29 09:03:28 +03:00
|
|
|
case EINTR:
|
|
|
|
continue;
|
|
|
|
default:
|
2018-06-01 12:26:43 +03:00
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
bytes -= ret;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_discard(void *opaque)
|
2013-01-14 19:26:55 +04:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2021-10-19 14:09:55 +03:00
|
|
|
int ret = -ENOTSUP;
|
2013-01-14 19:26:55 +04:00
|
|
|
BDRVRawState *s = aiocb->bs->opaque;
|
|
|
|
|
2013-11-22 16:39:47 +04:00
|
|
|
if (!s->has_discard) {
|
|
|
|
return -ENOTSUP;
|
2013-01-14 19:26:55 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
|
|
|
|
#ifdef BLKDISCARD
|
|
|
|
do {
|
|
|
|
uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
|
|
|
|
if (ioctl(aiocb->aio_fildes, BLKDISCARD, range) == 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} while (errno == EINTR);
|
|
|
|
|
2021-07-05 16:04:56 +03:00
|
|
|
ret = translate_err(-errno);
|
2013-01-14 19:26:55 +04:00
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
|
2015-01-30 11:42:12 +03:00
|
|
|
ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
|
|
|
|
aiocb->aio_offset, aiocb->aio_nbytes);
|
2021-10-19 14:09:55 +03:00
|
|
|
ret = translate_err(ret);
|
2021-07-05 16:04:56 +03:00
|
|
|
#elif defined(__APPLE__) && (__MACH__)
|
|
|
|
fpunchhole_t fpunchhole;
|
|
|
|
fpunchhole.fp_flags = 0;
|
|
|
|
fpunchhole.reserved = 0;
|
|
|
|
fpunchhole.fp_offset = aiocb->aio_offset;
|
|
|
|
fpunchhole.fp_length = aiocb->aio_nbytes;
|
|
|
|
if (fcntl(s->fd, F_PUNCHHOLE, &fpunchhole) == -1) {
|
|
|
|
ret = errno == ENODEV ? -ENOTSUP : -errno;
|
|
|
|
} else {
|
|
|
|
ret = 0;
|
|
|
|
}
|
2013-01-14 19:26:55 +04:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2015-01-30 11:42:11 +03:00
|
|
|
if (ret == -ENOTSUP) {
|
2013-11-22 16:39:47 +04:00
|
|
|
s->has_discard = false;
|
2013-01-14 19:26:55 +04:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-08-27 04:05:27 +03:00
|
|
|
/*
|
|
|
|
* Help alignment probing by allocating the first block.
|
|
|
|
*
|
|
|
|
* When reading with direct I/O from unallocated area on Gluster backed by XFS,
|
|
|
|
* reading succeeds regardless of request length. In this case we fallback to
|
|
|
|
* safe alignment which is not optimal. Allocating the first block avoids this
|
|
|
|
* fallback.
|
|
|
|
*
|
|
|
|
* fd may be opened with O_DIRECT, but we don't know the buffer alignment or
|
|
|
|
* request alignment, so we use safe values.
|
|
|
|
*
|
|
|
|
* Returns: 0 on success, -errno on failure. Since this is an optimization,
|
|
|
|
* caller may ignore failures.
|
|
|
|
*/
|
|
|
|
static int allocate_first_block(int fd, size_t max_size)
|
|
|
|
{
|
|
|
|
size_t write_size = (max_size < MAX_BLOCKSIZE)
|
|
|
|
? BDRV_SECTOR_SIZE
|
|
|
|
: MAX_BLOCKSIZE;
|
2019-10-13 05:11:45 +03:00
|
|
|
size_t max_align = MAX(MAX_BLOCKSIZE, qemu_real_host_page_size);
|
2019-08-27 04:05:27 +03:00
|
|
|
void *buf;
|
|
|
|
ssize_t n;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
buf = qemu_memalign(max_align, write_size);
|
|
|
|
memset(buf, 0, write_size);
|
|
|
|
|
|
|
|
do {
|
|
|
|
n = pwrite(fd, buf, write_size, 0);
|
|
|
|
} while (n == -1 && errno == EINTR);
|
|
|
|
|
|
|
|
ret = (n == -1) ? -errno : 0;
|
|
|
|
|
|
|
|
qemu_vfree(buf);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int handle_aiocb_truncate(void *opaque)
|
2018-06-21 19:23:16 +03:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData *aiocb = opaque;
|
2018-06-21 19:23:16 +03:00
|
|
|
int result = 0;
|
|
|
|
int64_t current_length = 0;
|
|
|
|
char *buf = NULL;
|
|
|
|
struct stat st;
|
|
|
|
int fd = aiocb->aio_fildes;
|
|
|
|
int64_t offset = aiocb->aio_offset;
|
2018-10-25 18:21:14 +03:00
|
|
|
PreallocMode prealloc = aiocb->truncate.prealloc;
|
|
|
|
Error **errp = aiocb->truncate.errp;
|
2018-06-21 19:23:16 +03:00
|
|
|
|
|
|
|
if (fstat(fd, &st) < 0) {
|
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result, "Could not stat file");
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
current_length = st.st_size;
|
2018-10-25 18:21:14 +03:00
|
|
|
if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
|
2018-06-21 19:23:16 +03:00
|
|
|
error_setg(errp, "Cannot use preallocation for shrinking files");
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
2018-10-25 18:21:14 +03:00
|
|
|
switch (prealloc) {
|
2018-06-21 19:23:16 +03:00
|
|
|
#ifdef CONFIG_POSIX_FALLOCATE
|
|
|
|
case PREALLOC_MODE_FALLOC:
|
|
|
|
/*
|
|
|
|
* Truncating before posix_fallocate() makes it about twice slower on
|
|
|
|
* file systems that do not support fallocate(), trying to check if a
|
|
|
|
* block is allocated before allocating it, so don't do that here.
|
|
|
|
*/
|
|
|
|
if (offset != current_length) {
|
|
|
|
result = -posix_fallocate(fd, current_length,
|
|
|
|
offset - current_length);
|
|
|
|
if (result != 0) {
|
|
|
|
/* posix_fallocate() doesn't set errno. */
|
|
|
|
error_setg_errno(errp, -result,
|
|
|
|
"Could not preallocate new data");
|
2019-08-27 04:05:27 +03:00
|
|
|
} else if (current_length == 0) {
|
|
|
|
/*
|
|
|
|
* posix_fallocate() uses fallocate() if the filesystem
|
|
|
|
* supports it, or fallback to manually writing zeroes. If
|
|
|
|
* fallocate() was used, unaligned reads from the fallocated
|
|
|
|
* area in raw_probe_alignment() will succeed, hence we need to
|
|
|
|
* allocate the first block.
|
|
|
|
*
|
|
|
|
* Optimize future alignment probing; ignore failures.
|
|
|
|
*/
|
|
|
|
allocate_first_block(fd, offset);
|
2018-06-21 19:23:16 +03:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
result = 0;
|
|
|
|
}
|
|
|
|
goto out;
|
|
|
|
#endif
|
|
|
|
case PREALLOC_MODE_FULL:
|
|
|
|
{
|
|
|
|
int64_t num = 0, left = offset - current_length;
|
|
|
|
off_t seek_result;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Knowing the final size from the beginning could allow the file
|
|
|
|
* system driver to do less allocations and possibly avoid
|
|
|
|
* fragmentation of the file.
|
|
|
|
*/
|
|
|
|
if (ftruncate(fd, offset) != 0) {
|
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result, "Could not resize file");
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
buf = g_malloc0(65536);
|
|
|
|
|
|
|
|
seek_result = lseek(fd, current_length, SEEK_SET);
|
|
|
|
if (seek_result < 0) {
|
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result,
|
|
|
|
"Failed to seek to the old end of file");
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (left > 0) {
|
|
|
|
num = MIN(left, 65536);
|
|
|
|
result = write(fd, buf, num);
|
|
|
|
if (result < 0) {
|
2018-07-27 09:53:14 +03:00
|
|
|
if (errno == EINTR) {
|
|
|
|
continue;
|
|
|
|
}
|
2018-06-21 19:23:16 +03:00
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result,
|
|
|
|
"Could not write zeros for preallocation");
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
left -= result;
|
|
|
|
}
|
|
|
|
if (result >= 0) {
|
|
|
|
result = fsync(fd);
|
|
|
|
if (result < 0) {
|
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result,
|
|
|
|
"Could not flush file to disk");
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
case PREALLOC_MODE_OFF:
|
|
|
|
if (ftruncate(fd, offset) != 0) {
|
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result, "Could not resize file");
|
2019-08-27 04:05:27 +03:00
|
|
|
} else if (current_length == 0 && offset > current_length) {
|
|
|
|
/* Optimize future alignment probing; ignore failures. */
|
|
|
|
allocate_first_block(fd, offset);
|
2018-06-21 19:23:16 +03:00
|
|
|
}
|
|
|
|
return result;
|
|
|
|
default:
|
|
|
|
result = -ENOTSUP;
|
|
|
|
error_setg(errp, "Unsupported preallocation mode: %s",
|
2018-10-25 18:21:14 +03:00
|
|
|
PreallocMode_str(prealloc));
|
2018-06-21 19:23:16 +03:00
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (result < 0) {
|
|
|
|
if (ftruncate(fd, current_length) < 0) {
|
|
|
|
error_report("Failed to restore old file length: %s",
|
|
|
|
strerror(errno));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
g_free(buf);
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2018-10-26 18:53:47 +03:00
|
|
|
static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs,
|
|
|
|
ThreadPoolFunc func, void *arg)
|
|
|
|
{
|
|
|
|
/* @bs can be NULL, bdrv_get_aio_context() returns the main context then */
|
|
|
|
ThreadPool *pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
|
|
|
|
return thread_pool_submit_co(pool, func, arg);
|
|
|
|
}
|
|
|
|
|
2016-06-03 18:36:27 +03:00
|
|
|
static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
|
|
|
|
uint64_t bytes, QEMUIOVector *qiov, int type)
|
2006-08-01 20:21:11 +04:00
|
|
|
{
|
2006-08-07 06:38:06 +04:00
|
|
|
BDRVRawState *s = bs->opaque;
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData acb;
|
2006-08-07 06:38:06 +04:00
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
if (fd_open(bs) < 0)
|
2014-08-06 19:18:07 +04:00
|
|
|
return -EIO;
|
2006-08-19 15:45:59 +04:00
|
|
|
|
2009-04-07 22:43:24 +04:00
|
|
|
/*
|
2020-01-20 17:18:51 +03:00
|
|
|
* When using O_DIRECT, the request must be aligned to be able to use
|
|
|
|
* either libaio or io_uring interface. If not fail back to regular thread
|
|
|
|
* pool read/write code which emulates this for us if we
|
|
|
|
* set QEMU_AIO_MISALIGNED.
|
2009-04-07 22:43:24 +04:00
|
|
|
*/
|
2020-01-20 17:18:51 +03:00
|
|
|
if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
|
|
|
|
type |= QEMU_AIO_MISALIGNED;
|
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
} else if (s->use_linux_io_uring) {
|
|
|
|
LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
|
|
|
|
assert(qiov->size == bytes);
|
|
|
|
return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
|
|
|
|
#endif
|
2009-08-28 16:39:31 +04:00
|
|
|
#ifdef CONFIG_LINUX_AIO
|
2020-01-20 17:18:51 +03:00
|
|
|
} else if (s->use_linux_aio) {
|
|
|
|
LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
|
|
|
|
assert(qiov->size == bytes);
|
2021-10-26 19:23:45 +03:00
|
|
|
return laio_co_submit(bs, aio, s->fd, offset, qiov, type,
|
|
|
|
s->aio_max_batch);
|
2009-08-28 16:39:31 +04:00
|
|
|
#endif
|
2009-08-20 18:58:19 +04:00
|
|
|
}
|
2009-04-07 22:43:24 +04:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
acb = (RawPosixAIOData) {
|
|
|
|
.bs = bs,
|
|
|
|
.aio_fildes = s->fd,
|
|
|
|
.aio_type = type,
|
|
|
|
.aio_offset = offset,
|
|
|
|
.aio_nbytes = bytes,
|
|
|
|
.io = {
|
|
|
|
.iov = qiov->iov,
|
|
|
|
.niov = qiov->niov,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
assert(qiov->size == bytes);
|
|
|
|
return raw_thread_pool_submit(bs, handle_aiocb_rw, &acb);
|
2014-08-06 19:18:07 +04:00
|
|
|
}
|
|
|
|
|
block: use int64_t instead of uint64_t in driver read handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver read handlers parameters which are already 64bit to
signed type.
While being here, convert also flags parameter to be BdrvRequestFlags.
Now let's consider all callers. Simple
git grep '\->bdrv_\(aio\|co\)_preadv\(_part\)\?'
shows that's there three callers of driver function:
bdrv_driver_preadv() in block/io.c, passes int64_t, checked by
bdrv_check_qiov_request() to be non-negative.
qcow2_load_vmstate() does bdrv_check_qiov_request().
do_perform_cow_read() has uint64_t argument. And a lot of things in
qcow2 driver are uint64_t, so converting it is big job. But we must
not work with requests that don't satisfy bdrv_check_qiov_request(),
so let's just assert it here.
Still, the functions may be called directly, not only by drv->...
Let's check:
git grep '\.bdrv_\(aio\|co\)_preadv\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done
The only one such caller:
QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
...
ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
in tests/unit/test-bdrv-drain.c, and it's OK obviously.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix typos]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:27:59 +03:00
|
|
|
static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
|
|
|
|
int64_t bytes, QEMUIOVector *qiov,
|
|
|
|
BdrvRequestFlags flags)
|
2014-08-06 19:18:07 +04:00
|
|
|
{
|
2016-06-03 18:36:27 +03:00
|
|
|
return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_READ);
|
2014-08-06 19:18:07 +04:00
|
|
|
}
|
|
|
|
|
block: use int64_t instead of uint64_t in driver write handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write handlers parameters which are already 64bit to
signed type.
While being here, convert also flags parameter to be BdrvRequestFlags.
Now let's consider all callers. Simple
git grep '\->bdrv_\(aio\|co\)_pwritev\(_part\)\?'
shows that's there three callers of driver function:
bdrv_driver_pwritev() and bdrv_driver_pwritev_compressed() in
block/io.c, both pass int64_t, checked by bdrv_check_qiov_request() to
be non-negative.
qcow2_save_vmstate() does bdrv_check_qiov_request().
Still, the functions may be called directly, not only by drv->...
Let's check:
git grep '\.bdrv_\(aio\|co\)_pwritev\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done
shows several callers:
qcow2:
qcow2_co_truncate() write at most up to @offset, which is checked in
generic qcow2_co_truncate() by bdrv_check_request().
qcow2_co_pwritev_compressed_task() pass the request (or part of the
request) that already went through normal write path, so it should
be OK
qcow:
qcow_co_pwritev_compressed() pass int64_t, it's updated by this patch
quorum:
quorum_co_pwrite_zeroes() pass int64_t and int - OK
throttle:
throttle_co_pwritev_compressed() pass int64_t, it's updated by this
patch
vmdk:
vmdk_co_pwritev_compressed() pass int64_t, it's updated by this
patch
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:00 +03:00
|
|
|
static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
|
|
|
|
int64_t bytes, QEMUIOVector *qiov,
|
|
|
|
BdrvRequestFlags flags)
|
2014-08-06 19:18:07 +04:00
|
|
|
{
|
2016-06-03 18:36:27 +03:00
|
|
|
assert(flags == 0);
|
|
|
|
return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
|
2006-08-01 20:21:11 +04:00
|
|
|
}
|
|
|
|
|
2014-07-04 14:04:34 +04:00
|
|
|
static void raw_aio_plug(BlockDriverState *bs)
|
|
|
|
{
|
2020-01-20 17:18:51 +03:00
|
|
|
BDRVRawState __attribute__((unused)) *s = bs->opaque;
|
2014-07-04 14:04:34 +04:00
|
|
|
#ifdef CONFIG_LINUX_AIO
|
2016-09-08 16:09:01 +03:00
|
|
|
if (s->use_linux_aio) {
|
2016-07-04 19:33:20 +03:00
|
|
|
LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
|
|
|
|
laio_io_plug(bs, aio);
|
2014-07-04 14:04:34 +04:00
|
|
|
}
|
|
|
|
#endif
|
2020-01-20 17:18:51 +03:00
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
if (s->use_linux_io_uring) {
|
|
|
|
LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
|
|
|
|
luring_io_plug(bs, aio);
|
|
|
|
}
|
|
|
|
#endif
|
2014-07-04 14:04:34 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void raw_aio_unplug(BlockDriverState *bs)
|
|
|
|
{
|
2020-01-20 17:18:51 +03:00
|
|
|
BDRVRawState __attribute__((unused)) *s = bs->opaque;
|
2014-07-04 14:04:34 +04:00
|
|
|
#ifdef CONFIG_LINUX_AIO
|
2016-09-08 16:09:01 +03:00
|
|
|
if (s->use_linux_aio) {
|
2016-07-04 19:33:20 +03:00
|
|
|
LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
|
2021-10-26 19:23:46 +03:00
|
|
|
laio_io_unplug(bs, aio, s->aio_max_batch);
|
2014-07-04 14:04:34 +04:00
|
|
|
}
|
|
|
|
#endif
|
2020-01-20 17:18:51 +03:00
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
if (s->use_linux_io_uring) {
|
|
|
|
LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
|
|
|
|
luring_io_unplug(bs, aio);
|
|
|
|
}
|
|
|
|
#endif
|
2014-07-04 14:04:34 +04:00
|
|
|
}
|
|
|
|
|
2018-06-21 20:07:32 +03:00
|
|
|
static int raw_co_flush_to_disk(BlockDriverState *bs)
|
2009-09-04 21:01:49 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData acb;
|
2018-06-21 20:07:32 +03:00
|
|
|
int ret;
|
2009-09-04 21:01:49 +04:00
|
|
|
|
2018-06-21 20:07:32 +03:00
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
2009-09-04 21:01:49 +04:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
acb = (RawPosixAIOData) {
|
|
|
|
.bs = bs,
|
|
|
|
.aio_fildes = s->fd,
|
|
|
|
.aio_type = QEMU_AIO_FLUSH,
|
|
|
|
};
|
|
|
|
|
2020-01-20 17:18:51 +03:00
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
if (s->use_linux_io_uring) {
|
|
|
|
LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
|
|
|
|
return luring_co_submit(bs, aio, s->fd, 0, NULL, QEMU_AIO_FLUSH);
|
|
|
|
}
|
|
|
|
#endif
|
2018-10-25 16:18:58 +03:00
|
|
|
return raw_thread_pool_submit(bs, handle_aiocb_flush, &acb);
|
2009-09-04 21:01:49 +04:00
|
|
|
}
|
|
|
|
|
linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22 22:37:00 +03:00
|
|
|
static void raw_aio_attach_aio_context(BlockDriverState *bs,
|
|
|
|
AioContext *new_context)
|
|
|
|
{
|
2020-01-20 17:18:51 +03:00
|
|
|
BDRVRawState __attribute__((unused)) *s = bs->opaque;
|
linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22 22:37:00 +03:00
|
|
|
#ifdef CONFIG_LINUX_AIO
|
|
|
|
if (s->use_linux_aio) {
|
2019-11-30 22:42:22 +03:00
|
|
|
Error *local_err = NULL;
|
linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22 22:37:00 +03:00
|
|
|
if (!aio_setup_linux_aio(new_context, &local_err)) {
|
|
|
|
error_reportf_err(local_err, "Unable to use native AIO, "
|
|
|
|
"falling back to thread pool: ");
|
|
|
|
s->use_linux_aio = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2020-01-20 17:18:51 +03:00
|
|
|
#ifdef CONFIG_LINUX_IO_URING
|
|
|
|
if (s->use_linux_io_uring) {
|
2020-10-23 09:12:18 +03:00
|
|
|
Error *local_err = NULL;
|
2020-01-20 17:18:51 +03:00
|
|
|
if (!aio_setup_linux_io_uring(new_context, &local_err)) {
|
|
|
|
error_reportf_err(local_err, "Unable to use linux io_uring, "
|
|
|
|
"falling back to thread pool: ");
|
|
|
|
s->use_linux_io_uring = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22 22:37:00 +03:00
|
|
|
}
|
|
|
|
|
2006-08-01 20:21:11 +04:00
|
|
|
static void raw_close(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2014-05-08 18:34:47 +04:00
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
if (s->fd >= 0) {
|
2012-08-15 00:43:46 +04:00
|
|
|
qemu_close(s->fd);
|
2006-08-19 15:45:59 +04:00
|
|
|
s->fd = -1;
|
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
}
|
|
|
|
|
2017-06-13 23:20:58 +03:00
|
|
|
/**
|
|
|
|
* Truncates the given regular file @fd to @offset and, when growing, fills the
|
|
|
|
* new space according to @prealloc.
|
|
|
|
*
|
|
|
|
* Returns: 0 on success, -errno on failure.
|
|
|
|
*/
|
2018-06-21 19:23:16 +03:00
|
|
|
static int coroutine_fn
|
|
|
|
raw_regular_truncate(BlockDriverState *bs, int fd, int64_t offset,
|
|
|
|
PreallocMode prealloc, Error **errp)
|
2017-06-13 23:20:57 +03:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData acb;
|
2017-06-13 23:20:58 +03:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
acb = (RawPosixAIOData) {
|
2018-06-21 19:23:16 +03:00
|
|
|
.bs = bs,
|
|
|
|
.aio_fildes = fd,
|
|
|
|
.aio_type = QEMU_AIO_TRUNCATE,
|
|
|
|
.aio_offset = offset,
|
2018-10-25 18:21:14 +03:00
|
|
|
.truncate = {
|
|
|
|
.prealloc = prealloc,
|
|
|
|
.errp = errp,
|
|
|
|
},
|
2018-06-21 19:23:16 +03:00
|
|
|
};
|
2017-06-13 23:20:58 +03:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
return raw_thread_pool_submit(bs, handle_aiocb_truncate, &acb);
|
2017-06-13 23:20:57 +03:00
|
|
|
}
|
|
|
|
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 18:54:35 +03:00
|
|
|
static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
|
2019-09-18 12:51:40 +03:00
|
|
|
bool exact, PreallocMode prealloc,
|
2020-04-24 15:54:39 +03:00
|
|
|
BdrvRequestFlags flags, Error **errp)
|
2006-08-01 20:21:11 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2011-09-21 03:10:37 +04:00
|
|
|
struct stat st;
|
2017-03-28 23:51:29 +03:00
|
|
|
int ret;
|
2011-09-21 03:10:37 +04:00
|
|
|
|
|
|
|
if (fstat(s->fd, &st)) {
|
2017-03-28 23:51:29 +03:00
|
|
|
ret = -errno;
|
|
|
|
error_setg_errno(errp, -ret, "Failed to fstat() the file");
|
|
|
|
return ret;
|
2011-09-21 03:10:37 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (S_ISREG(st.st_mode)) {
|
2019-09-18 12:51:41 +03:00
|
|
|
/* Always resizes to the exact @offset */
|
2018-06-21 19:23:16 +03:00
|
|
|
return raw_regular_truncate(bs, s->fd, offset, prealloc, errp);
|
2017-06-13 23:20:59 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (prealloc != PREALLOC_MODE_OFF) {
|
|
|
|
error_setg(errp, "Preallocation mode '%s' unsupported for this "
|
2017-08-24 11:46:08 +03:00
|
|
|
"non-regular file", PreallocMode_str(prealloc));
|
2017-06-13 23:20:59 +03:00
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode)) {
|
2019-09-18 12:51:41 +03:00
|
|
|
int64_t cur_length = raw_getlength(bs);
|
|
|
|
|
|
|
|
if (offset != cur_length && exact) {
|
|
|
|
error_setg(errp, "Cannot resize device files");
|
|
|
|
return -ENOTSUP;
|
|
|
|
} else if (offset > cur_length) {
|
2017-03-28 23:51:29 +03:00
|
|
|
error_setg(errp, "Cannot grow device files");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2011-09-21 03:10:37 +04:00
|
|
|
} else {
|
2017-03-28 23:51:29 +03:00
|
|
|
error_setg(errp, "Resizing this file is not supported");
|
2011-09-21 03:10:37 +04:00
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
2006-08-01 20:21:11 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-08-15 22:33:42 +04:00
|
|
|
#ifdef __OpenBSD__
|
|
|
|
static int64_t raw_getlength(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int fd = s->fd;
|
|
|
|
struct stat st;
|
|
|
|
|
|
|
|
if (fstat(fd, &st))
|
2014-06-26 15:23:16 +04:00
|
|
|
return -errno;
|
2008-08-15 22:33:42 +04:00
|
|
|
if (S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode)) {
|
|
|
|
struct disklabel dl;
|
|
|
|
|
|
|
|
if (ioctl(fd, DIOCGDINFO, &dl))
|
2014-06-26 15:23:16 +04:00
|
|
|
return -errno;
|
2008-08-15 22:33:42 +04:00
|
|
|
return (uint64_t)dl.d_secsize *
|
|
|
|
dl.d_partitions[DISKPART(st.st_rdev)].p_size;
|
|
|
|
} else
|
|
|
|
return st.st_size;
|
|
|
|
}
|
2011-05-23 16:31:17 +04:00
|
|
|
#elif defined(__NetBSD__)
|
|
|
|
static int64_t raw_getlength(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int fd = s->fd;
|
|
|
|
struct stat st;
|
|
|
|
|
|
|
|
if (fstat(fd, &st))
|
2014-06-26 15:23:16 +04:00
|
|
|
return -errno;
|
2011-05-23 16:31:17 +04:00
|
|
|
if (S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode)) {
|
|
|
|
struct dkwedge_info dkw;
|
|
|
|
|
|
|
|
if (ioctl(fd, DIOCGWEDGEINFO, &dkw) != -1) {
|
|
|
|
return dkw.dkw_size * 512;
|
|
|
|
} else {
|
|
|
|
struct disklabel dl;
|
|
|
|
|
|
|
|
if (ioctl(fd, DIOCGDINFO, &dl))
|
2014-06-26 15:23:16 +04:00
|
|
|
return -errno;
|
2011-05-23 16:31:17 +04:00
|
|
|
return (uint64_t)dl.d_secsize *
|
|
|
|
dl.d_partitions[DISKPART(st.st_rdev)].p_size;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
return st.st_size;
|
|
|
|
}
|
2010-04-06 21:13:44 +04:00
|
|
|
#elif defined(__sun__)
|
|
|
|
static int64_t raw_getlength(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
struct dk_minfo minfo;
|
|
|
|
int ret;
|
2014-06-26 15:23:16 +04:00
|
|
|
int64_t size;
|
2010-04-06 21:13:44 +04:00
|
|
|
|
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use the DKIOCGMEDIAINFO ioctl to read the size.
|
|
|
|
*/
|
|
|
|
ret = ioctl(s->fd, DKIOCGMEDIAINFO, &minfo);
|
|
|
|
if (ret != -1) {
|
|
|
|
return minfo.dki_lbsize * minfo.dki_capacity;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There are reports that lseek on some devices fails, but
|
|
|
|
* irc discussion said that contingency on contingency was overkill.
|
|
|
|
*/
|
2014-06-26 15:23:16 +04:00
|
|
|
size = lseek(s->fd, 0, SEEK_END);
|
|
|
|
if (size < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
return size;
|
2010-04-06 21:13:44 +04:00
|
|
|
}
|
|
|
|
#elif defined(CONFIG_BSD)
|
|
|
|
static int64_t raw_getlength(BlockDriverState *bs)
|
2006-08-01 20:21:11 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int fd = s->fd;
|
|
|
|
int64_t size;
|
|
|
|
struct stat sb;
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2009-03-28 11:37:13 +03:00
|
|
|
int reopened = 0;
|
2006-08-01 20:21:11 +04:00
|
|
|
#endif
|
2006-08-19 15:45:59 +04:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2009-03-28 11:37:13 +03:00
|
|
|
again:
|
|
|
|
#endif
|
2006-08-01 20:21:11 +04:00
|
|
|
if (!fstat(fd, &sb) && (S_IFCHR & sb.st_mode)) {
|
2021-06-16 16:32:04 +03:00
|
|
|
size = 0;
|
2006-08-01 20:21:11 +04:00
|
|
|
#ifdef DIOCGMEDIASIZE
|
2021-06-16 16:32:04 +03:00
|
|
|
if (ioctl(fd, DIOCGMEDIASIZE, (off_t *)&size)) {
|
|
|
|
size = 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#ifdef DIOCGPART
|
|
|
|
if (size == 0) {
|
|
|
|
struct partinfo pi;
|
|
|
|
if (ioctl(fd, DIOCGPART, &pi) == 0) {
|
|
|
|
size = pi.media_size;
|
|
|
|
}
|
2009-03-07 23:06:23 +03:00
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
#endif
|
2021-03-15 21:03:40 +03:00
|
|
|
#if defined(DKIOCGETBLOCKCOUNT) && defined(DKIOCGETBLOCKSIZE)
|
2021-06-16 16:32:04 +03:00
|
|
|
if (size == 0) {
|
2015-01-20 01:12:55 +03:00
|
|
|
uint64_t sectors = 0;
|
|
|
|
uint32_t sector_size = 0;
|
|
|
|
|
|
|
|
if (ioctl(fd, DKIOCGETBLOCKCOUNT, §ors) == 0
|
|
|
|
&& ioctl(fd, DKIOCGETBLOCKSIZE, §or_size) == 0) {
|
|
|
|
size = sectors * sector_size;
|
|
|
|
}
|
|
|
|
}
|
2021-06-16 16:32:04 +03:00
|
|
|
#endif
|
|
|
|
if (size == 0) {
|
|
|
|
size = lseek(fd, 0LL, SEEK_END);
|
|
|
|
}
|
2014-06-26 15:23:16 +04:00
|
|
|
if (size < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2009-03-28 11:37:13 +03:00
|
|
|
switch(s->type) {
|
|
|
|
case FTYPE_CD:
|
|
|
|
/* XXX FreeBSD acd returns UINT_MAX sectors for an empty drive */
|
|
|
|
if (size == 2048LL * (unsigned)-1)
|
|
|
|
size = 0;
|
|
|
|
/* XXX no disc? maybe we need to reopen... */
|
2009-06-15 15:55:19 +04:00
|
|
|
if (size <= 0 && !reopened && cdrom_reopen(bs) >= 0) {
|
2009-03-28 11:37:13 +03:00
|
|
|
reopened = 1;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
#endif
|
2010-04-06 21:13:44 +04:00
|
|
|
} else {
|
2006-08-01 20:21:11 +04:00
|
|
|
size = lseek(fd, 0, SEEK_END);
|
2014-06-26 15:23:16 +04:00
|
|
|
if (size < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
}
|
|
|
|
return size;
|
|
|
|
}
|
2010-04-06 21:13:44 +04:00
|
|
|
#else
|
|
|
|
static int64_t raw_getlength(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int ret;
|
2014-06-26 15:23:16 +04:00
|
|
|
int64_t size;
|
2010-04-06 21:13:44 +04:00
|
|
|
|
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-06-26 15:23:16 +04:00
|
|
|
size = lseek(s->fd, 0, SEEK_END);
|
|
|
|
if (size < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
return size;
|
2010-04-06 21:13:44 +04:00
|
|
|
}
|
2008-08-15 22:33:42 +04:00
|
|
|
#endif
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2011-07-12 15:56:39 +04:00
|
|
|
static int64_t raw_get_allocated_file_size(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
struct stat st;
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
if (fstat(s->fd, &st) < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
return (int64_t)st.st_blocks * 512;
|
|
|
|
}
|
|
|
|
|
2018-06-21 19:23:16 +03:00
|
|
|
static int coroutine_fn
|
|
|
|
raw_co_create(BlockdevCreateOptions *options, Error **errp)
|
2006-08-01 20:21:11 +04:00
|
|
|
{
|
2018-01-16 18:04:21 +03:00
|
|
|
BlockdevCreateOptionsFile *file_opts;
|
2018-07-04 17:47:51 +03:00
|
|
|
Error *local_err = NULL;
|
2006-08-01 20:21:11 +04:00
|
|
|
int fd;
|
2018-07-04 17:47:50 +03:00
|
|
|
uint64_t perm, shared;
|
2009-07-11 18:43:37 +04:00
|
|
|
int result = 0;
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2018-01-16 18:04:21 +03:00
|
|
|
/* Validate options and set default values */
|
|
|
|
assert(options->driver == BLOCKDEV_DRIVER_FILE);
|
|
|
|
file_opts = &options->u.file;
|
2014-03-06 01:41:38 +04:00
|
|
|
|
2018-01-16 18:04:21 +03:00
|
|
|
if (!file_opts->has_nocow) {
|
|
|
|
file_opts->nocow = false;
|
|
|
|
}
|
|
|
|
if (!file_opts->has_preallocation) {
|
|
|
|
file_opts->preallocation = PREALLOC_MODE_OFF;
|
2014-09-10 13:05:48 +04:00
|
|
|
}
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
if (!file_opts->has_extent_size_hint) {
|
|
|
|
file_opts->extent_size_hint = 1 * MiB;
|
|
|
|
}
|
|
|
|
if (file_opts->extent_size_hint > UINT32_MAX) {
|
|
|
|
result = -EINVAL;
|
|
|
|
error_setg(errp, "Extent size hint is too large");
|
|
|
|
goto out;
|
|
|
|
}
|
2006-08-01 20:21:11 +04:00
|
|
|
|
2018-01-16 18:04:21 +03:00
|
|
|
/* Create file */
|
2020-07-01 17:22:43 +03:00
|
|
|
fd = qemu_create(file_opts->filename, O_RDWR | O_BINARY, 0644, errp);
|
2009-07-11 18:43:37 +04:00
|
|
|
if (fd < 0) {
|
|
|
|
result = -errno;
|
2014-09-10 13:05:48 +04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-05-10 00:53:35 +03:00
|
|
|
/* Take permissions: We want to discard everything, so we need
|
|
|
|
* BLK_PERM_WRITE; and truncation to the desired size requires
|
|
|
|
* BLK_PERM_RESIZE.
|
|
|
|
* On the other hand, we cannot share the RESIZE permission
|
|
|
|
* because we promise that after this function, the file has the
|
|
|
|
* size given in the options. If someone else were to resize it
|
|
|
|
* concurrently, we could not guarantee that.
|
|
|
|
* Note that after this function, we can no longer guarantee that
|
|
|
|
* the file is not touched by a third party, so it may be resized
|
|
|
|
* then. */
|
|
|
|
perm = BLK_PERM_WRITE | BLK_PERM_RESIZE;
|
|
|
|
shared = BLK_PERM_ALL & ~BLK_PERM_RESIZE;
|
|
|
|
|
|
|
|
/* Step one: Take locks */
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
result = raw_apply_lock_bytes(NULL, fd, perm, ~shared, false, errp);
|
2018-05-10 00:53:35 +03:00
|
|
|
if (result < 0) {
|
|
|
|
goto out_close;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Step two: Check that nobody else has taken conflicting locks */
|
|
|
|
result = raw_check_lock_bytes(fd, perm, shared, errp);
|
|
|
|
if (result < 0) {
|
2018-09-25 08:05:01 +03:00
|
|
|
error_append_hint(errp,
|
|
|
|
"Is another process using the image [%s]?\n",
|
|
|
|
file_opts->filename);
|
2018-07-04 17:47:51 +03:00
|
|
|
goto out_unlock;
|
2018-05-10 00:53:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Clear the file by truncating it to 0 */
|
2018-06-21 19:23:16 +03:00
|
|
|
result = raw_regular_truncate(NULL, fd, 0, PREALLOC_MODE_OFF, errp);
|
2018-05-10 00:53:35 +03:00
|
|
|
if (result < 0) {
|
2018-07-04 17:47:51 +03:00
|
|
|
goto out_unlock;
|
2018-05-10 00:53:35 +03:00
|
|
|
}
|
|
|
|
|
2018-01-16 18:04:21 +03:00
|
|
|
if (file_opts->nocow) {
|
qemu-img create: add 'nocow' option
Add 'nocow' option so that users could have a chance to set NOCOW flag to
newly created files. It's useful on btrfs file system to enhance performance.
Btrfs has low performance when hosting VM images, even more when the guest
in those VM are also using btrfs as file system. One way to mitigate this bad
performance is to turn off COW attributes on VM files. Generally, there are
two ways to turn off NOCOW on btrfs: a) by mounting fs with nodatacow, then
all newly created files will be NOCOW. b) per file. Add the NOCOW file
attribute. It could only be done to empty or new files.
This patch tries the second way, according to the option, it could add NOCOW
per file.
For most block drivers, since the create file step is in raw-posix.c, so we
can do setting NOCOW flag ioctl in raw-posix.c only.
But there are some exceptions, like block/vpc.c and block/vdi.c, they are
creating file by calling qemu_open directly. For them, do the same setting
NOCOW flag ioctl work in them separately.
[Fixed up 082.out due to the new 'nocow' creation option
--Stefan]
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-06-30 10:29:58 +04:00
|
|
|
#ifdef __linux__
|
2014-09-10 13:05:48 +04:00
|
|
|
/* Set NOCOW flag to solve performance issue on fs like btrfs.
|
|
|
|
* This is an optimisation. The FS_IOC_SETFLAGS ioctl return value
|
|
|
|
* will be ignored since any failure of this operation should not
|
|
|
|
* block the left work.
|
|
|
|
*/
|
|
|
|
int attr;
|
|
|
|
if (ioctl(fd, FS_IOC_GETFLAGS, &attr) == 0) {
|
|
|
|
attr |= FS_NOCOW_FL;
|
|
|
|
ioctl(fd, FS_IOC_SETFLAGS, &attr);
|
qemu-img create: add 'nocow' option
Add 'nocow' option so that users could have a chance to set NOCOW flag to
newly created files. It's useful on btrfs file system to enhance performance.
Btrfs has low performance when hosting VM images, even more when the guest
in those VM are also using btrfs as file system. One way to mitigate this bad
performance is to turn off COW attributes on VM files. Generally, there are
two ways to turn off NOCOW on btrfs: a) by mounting fs with nodatacow, then
all newly created files will be NOCOW. b) per file. Add the NOCOW file
attribute. It could only be done to empty or new files.
This patch tries the second way, according to the option, it could add NOCOW
per file.
For most block drivers, since the create file step is in raw-posix.c, so we
can do setting NOCOW flag ioctl in raw-posix.c only.
But there are some exceptions, like block/vpc.c and block/vdi.c, they are
creating file by calling qemu_open directly. For them, do the same setting
NOCOW flag ioctl work in them separately.
[Fixed up 082.out due to the new 'nocow' creation option
--Stefan]
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-06-30 10:29:58 +04:00
|
|
|
}
|
2014-09-10 13:05:48 +04:00
|
|
|
#endif
|
|
|
|
}
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
#ifdef FS_IOC_FSSETXATTR
|
|
|
|
/*
|
|
|
|
* Try to set the extent size hint. Failure is not fatal, and a warning is
|
|
|
|
* only printed if the option was explicitly specified.
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
struct fsxattr attr;
|
|
|
|
result = ioctl(fd, FS_IOC_FSGETXATTR, &attr);
|
|
|
|
if (result == 0) {
|
|
|
|
attr.fsx_xflags |= FS_XFLAG_EXTSIZE;
|
|
|
|
attr.fsx_extsize = file_opts->extent_size_hint;
|
|
|
|
result = ioctl(fd, FS_IOC_FSSETXATTR, &attr);
|
|
|
|
}
|
|
|
|
if (result < 0 && file_opts->has_extent_size_hint &&
|
|
|
|
file_opts->extent_size_hint)
|
|
|
|
{
|
|
|
|
warn_report("Failed to set extent size hint: %s",
|
|
|
|
strerror(errno));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2014-09-10 13:05:48 +04:00
|
|
|
|
2018-05-10 00:53:35 +03:00
|
|
|
/* Resize and potentially preallocate the file to the desired
|
|
|
|
* final size */
|
2018-06-21 19:23:16 +03:00
|
|
|
result = raw_regular_truncate(NULL, fd, file_opts->size,
|
|
|
|
file_opts->preallocation, errp);
|
2017-06-13 23:20:57 +03:00
|
|
|
if (result < 0) {
|
2018-07-04 17:47:51 +03:00
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
out_unlock:
|
file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.
Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:
$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img
Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:
$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img
Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:
blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
can abort() QEMU, out of environmental changes.
This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-11 10:21:33 +03:00
|
|
|
raw_apply_lock_bytes(NULL, fd, 0, 0, true, &local_err);
|
2018-07-04 17:47:51 +03:00
|
|
|
if (local_err) {
|
|
|
|
/* The above call should not fail, and if it does, that does
|
|
|
|
* not mean the whole creation operation has failed. So
|
|
|
|
* report it the user for their convenience, but do not report
|
|
|
|
* it to the caller. */
|
2018-11-01 09:29:09 +03:00
|
|
|
warn_report_err(local_err);
|
2009-07-11 18:43:37 +04:00
|
|
|
}
|
2014-09-10 13:05:48 +04:00
|
|
|
|
2017-02-17 03:51:26 +03:00
|
|
|
out_close:
|
2014-09-10 13:05:48 +04:00
|
|
|
if (qemu_close(fd) != 0 && result == 0) {
|
|
|
|
result = -errno;
|
|
|
|
error_setg_errno(errp, -result, "Could not close the new file");
|
|
|
|
}
|
|
|
|
out:
|
2009-07-11 18:43:37 +04:00
|
|
|
return result;
|
2006-08-01 20:21:11 +04:00
|
|
|
}
|
|
|
|
|
2020-03-26 04:12:17 +03:00
|
|
|
static int coroutine_fn raw_co_create_opts(BlockDriver *drv,
|
|
|
|
const char *filename,
|
|
|
|
QemuOpts *opts,
|
2018-01-16 18:04:21 +03:00
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
BlockdevCreateOptions options;
|
|
|
|
int64_t total_size = 0;
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
int64_t extent_size_hint = 0;
|
|
|
|
bool has_extent_size_hint = false;
|
2018-01-16 18:04:21 +03:00
|
|
|
bool nocow = false;
|
|
|
|
PreallocMode prealloc;
|
|
|
|
char *buf = NULL;
|
|
|
|
Error *local_err = NULL;
|
|
|
|
|
|
|
|
/* Skip file: protocol prefix */
|
|
|
|
strstart(filename, "file:", &filename);
|
|
|
|
|
|
|
|
/* Read out options */
|
|
|
|
total_size = ROUND_UP(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
|
|
|
|
BDRV_SECTOR_SIZE);
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
if (qemu_opt_get(opts, BLOCK_OPT_EXTENT_SIZE_HINT)) {
|
|
|
|
has_extent_size_hint = true;
|
|
|
|
extent_size_hint =
|
|
|
|
qemu_opt_get_size_del(opts, BLOCK_OPT_EXTENT_SIZE_HINT, -1);
|
|
|
|
}
|
2018-01-16 18:04:21 +03:00
|
|
|
nocow = qemu_opt_get_bool(opts, BLOCK_OPT_NOCOW, false);
|
|
|
|
buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
|
|
|
|
prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
|
|
|
|
PREALLOC_MODE_OFF, &local_err);
|
|
|
|
g_free(buf);
|
|
|
|
if (local_err) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
options = (BlockdevCreateOptions) {
|
|
|
|
.driver = BLOCKDEV_DRIVER_FILE,
|
|
|
|
.u.file = {
|
|
|
|
.filename = (char *) filename,
|
|
|
|
.size = total_size,
|
|
|
|
.has_preallocation = true,
|
|
|
|
.preallocation = prealloc,
|
|
|
|
.has_nocow = true,
|
|
|
|
.nocow = nocow,
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
.has_extent_size_hint = has_extent_size_hint,
|
|
|
|
.extent_size_hint = extent_size_hint,
|
2018-01-16 18:04:21 +03:00
|
|
|
},
|
|
|
|
};
|
|
|
|
return raw_co_create(&options, errp);
|
|
|
|
}
|
|
|
|
|
2020-01-31 00:39:04 +03:00
|
|
|
static int coroutine_fn raw_co_delete_file(BlockDriverState *bs,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
struct stat st;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!(stat(bs->filename, &st) == 0) || !S_ISREG(st.st_mode)) {
|
|
|
|
error_setg_errno(errp, ENOENT, "%s is not a regular file",
|
|
|
|
bs->filename);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = unlink(bs->filename);
|
|
|
|
if (ret < 0) {
|
|
|
|
ret = -errno;
|
|
|
|
error_setg_errno(errp, -ret, "Error when deleting file %s",
|
|
|
|
bs->filename);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
/*
|
|
|
|
* Find allocation range in @bs around offset @start.
|
|
|
|
* May change underlying file descriptor's file offset.
|
|
|
|
* If @start is not in a hole, store @start in @data, and the
|
|
|
|
* beginning of the next hole in @hole, and return 0.
|
|
|
|
* If @start is in a non-trailing hole, store @start in @hole and the
|
|
|
|
* beginning of the next non-hole in @data, and return 0.
|
|
|
|
* If @start is in a trailing hole or beyond EOF, return -ENXIO.
|
|
|
|
* If we can't find out, return a negative errno other than -ENXIO.
|
|
|
|
*/
|
|
|
|
static int find_allocation(BlockDriverState *bs, off_t start,
|
|
|
|
off_t *data, off_t *hole)
|
2014-05-08 22:57:55 +04:00
|
|
|
{
|
|
|
|
#if defined SEEK_HOLE && defined SEEK_DATA
|
2012-06-20 02:02:51 +04:00
|
|
|
BDRVRawState *s = bs->opaque;
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
off_t offs;
|
2012-06-20 02:02:51 +04:00
|
|
|
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
/*
|
|
|
|
* SEEK_DATA cases:
|
|
|
|
* D1. offs == start: start is in data
|
|
|
|
* D2. offs > start: start is in a hole, next data at offs
|
|
|
|
* D3. offs < 0, errno = ENXIO: either start is in a trailing hole
|
|
|
|
* or start is beyond EOF
|
|
|
|
* If the latter happens, the file has been truncated behind
|
|
|
|
* our back since we opened it. All bets are off then.
|
|
|
|
* Treating like a trailing hole is simplest.
|
|
|
|
* D4. offs < 0, errno != ENXIO: we learned nothing
|
|
|
|
*/
|
|
|
|
offs = lseek(s->fd, start, SEEK_DATA);
|
|
|
|
if (offs < 0) {
|
|
|
|
return -errno; /* D3 or D4 */
|
|
|
|
}
|
2018-04-03 07:37:26 +03:00
|
|
|
|
|
|
|
if (offs < start) {
|
|
|
|
/* This is not a valid return by lseek(). We are safe to just return
|
|
|
|
* -EIO in this case, and we'll treat it like D4. */
|
|
|
|
return -EIO;
|
|
|
|
}
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
|
|
|
|
if (offs > start) {
|
|
|
|
/* D2: in hole, next data at offs */
|
|
|
|
*hole = start;
|
|
|
|
*data = offs;
|
|
|
|
return 0;
|
2012-05-09 18:49:58 +04:00
|
|
|
}
|
|
|
|
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
/* D1: in data, end not yet known */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* SEEK_HOLE cases:
|
|
|
|
* H1. offs == start: start is in a hole
|
|
|
|
* If this happens here, a hole has been dug behind our back
|
|
|
|
* since the previous lseek().
|
|
|
|
* H2. offs > start: either start is in data, next hole at offs,
|
|
|
|
* or start is in trailing hole, EOF at offs
|
|
|
|
* Linux treats trailing holes like any other hole: offs ==
|
|
|
|
* start. Solaris seeks to EOF instead: offs > start (blech).
|
|
|
|
* If that happens here, a hole has been dug behind our back
|
|
|
|
* since the previous lseek().
|
|
|
|
* H3. offs < 0, errno = ENXIO: start is beyond EOF
|
|
|
|
* If this happens, the file has been truncated behind our
|
|
|
|
* back since we opened it. Treat it like a trailing hole.
|
|
|
|
* H4. offs < 0, errno != ENXIO: we learned nothing
|
|
|
|
* Pretend we know nothing at all, i.e. "forget" about D1.
|
|
|
|
*/
|
|
|
|
offs = lseek(s->fd, start, SEEK_HOLE);
|
|
|
|
if (offs < 0) {
|
|
|
|
return -errno; /* D1 and (H3 or H4) */
|
|
|
|
}
|
2018-04-03 07:37:26 +03:00
|
|
|
|
|
|
|
if (offs < start) {
|
|
|
|
/* This is not a valid return by lseek(). We are safe to just return
|
|
|
|
* -EIO in this case, and we'll treat it like H4. */
|
|
|
|
return -EIO;
|
|
|
|
}
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
|
|
|
|
if (offs > start) {
|
|
|
|
/*
|
|
|
|
* D1 and H2: either in data, next hole at offs, or it was in
|
|
|
|
* data but is now in a trailing hole. In the latter case,
|
|
|
|
* all bets are off. Treating it as if it there was data all
|
|
|
|
* the way to EOF is safe, so simply do that.
|
|
|
|
*/
|
2014-05-08 22:57:55 +04:00
|
|
|
*data = start;
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
*hole = offs;
|
|
|
|
return 0;
|
2012-05-09 18:49:58 +04:00
|
|
|
}
|
2014-05-08 22:57:55 +04:00
|
|
|
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
/* D1 and H1 */
|
|
|
|
return -EBUSY;
|
2012-05-09 18:49:58 +04:00
|
|
|
#else
|
2014-05-08 22:57:55 +04:00
|
|
|
return -ENOTSUP;
|
2012-05-09 18:49:58 +04:00
|
|
|
#endif
|
2014-05-08 22:57:55 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-02-13 23:26:44 +03:00
|
|
|
* Returns the allocation status of the specified offset.
|
2014-05-08 22:57:55 +04:00
|
|
|
*
|
2018-02-13 23:26:44 +03:00
|
|
|
* The block layer guarantees 'offset' and 'bytes' are within bounds.
|
2014-05-08 22:57:55 +04:00
|
|
|
*
|
2018-02-13 23:26:44 +03:00
|
|
|
* 'pnum' is set to the number of bytes (including and immediately following
|
|
|
|
* the specified offset) that are known to be in the same
|
2014-05-08 22:57:55 +04:00
|
|
|
* allocated/unallocated state.
|
|
|
|
*
|
2021-08-12 11:41:46 +03:00
|
|
|
* 'bytes' is a soft cap for 'pnum'. If the information is free, 'pnum' may
|
|
|
|
* well exceed it.
|
2014-05-08 22:57:55 +04:00
|
|
|
*/
|
2018-02-13 23:26:44 +03:00
|
|
|
static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
|
|
|
|
bool want_zero,
|
|
|
|
int64_t offset,
|
|
|
|
int64_t bytes, int64_t *pnum,
|
|
|
|
int64_t *map,
|
|
|
|
BlockDriverState **file)
|
|
|
|
{
|
|
|
|
off_t data = 0, hole = 0;
|
2014-10-24 14:57:59 +04:00
|
|
|
int ret;
|
2014-05-08 22:57:55 +04:00
|
|
|
|
block/file-posix: Unaligned O_DIRECT block-status
Currently, qemu crashes whenever someone queries the block status of an
unaligned image tail of an O_DIRECT image:
$ echo > foo
$ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on
Offset Length Mapped to File
qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `*pnum &&
QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset'
failed.
This is because bdrv_co_block_status() checks that the result returned
by the driver's implementation is aligned to the request_alignment, but
file-posix can fail to do so, which is actually mentioned in a comment
there: "[...] possibly including a partial sector at EOF".
Fix this by rounding up those partial sectors.
There are two possible alternative fixes:
(1) We could refuse to open unaligned image files with O_DIRECT
altogether. That sounds reasonable until you realize that qcow2
does necessarily not fill up its metadata clusters, and that nobody
runs qemu-img create with O_DIRECT. Therefore, unpreallocated qcow2
files usually have an unaligned image tail.
(2) bdrv_co_block_status() could ignore unaligned tails. It actually
throws away everything past the EOF already, so that sounds
reasonable.
Unfortunately, the block layer knows file lengths only with a
granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually
would have to guess whether its file length information is inexact
or whether the driver is broken.
Fixing what raw_co_block_status() returns is the safest thing to do.
There seems to be no other block driver that sets request_alignment and
does not make sure that it always returns aligned values.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-15 07:15:40 +03:00
|
|
|
assert(QEMU_IS_ALIGNED(offset | bytes, bs->bl.request_alignment));
|
|
|
|
|
2014-05-08 22:57:55 +04:00
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-02-13 23:26:44 +03:00
|
|
|
if (!want_zero) {
|
|
|
|
*pnum = bytes;
|
|
|
|
*map = offset;
|
|
|
|
*file = bs;
|
|
|
|
return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
|
2014-10-24 14:57:58 +04:00
|
|
|
}
|
2014-05-08 22:57:55 +04:00
|
|
|
|
2018-02-13 23:26:44 +03:00
|
|
|
ret = find_allocation(bs, offset, &data, &hole);
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
if (ret == -ENXIO) {
|
|
|
|
/* Trailing hole */
|
2018-02-13 23:26:44 +03:00
|
|
|
*pnum = bytes;
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
ret = BDRV_BLOCK_ZERO;
|
|
|
|
} else if (ret < 0) {
|
|
|
|
/* No info available, so pretend there are no holes */
|
2018-02-13 23:26:44 +03:00
|
|
|
*pnum = bytes;
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
ret = BDRV_BLOCK_DATA;
|
2018-02-13 23:26:44 +03:00
|
|
|
} else if (data == offset) {
|
|
|
|
/* On a data extent, compute bytes to the end of the extent,
|
2015-06-09 11:55:08 +03:00
|
|
|
* possibly including a partial sector at EOF. */
|
2021-08-12 11:41:46 +03:00
|
|
|
*pnum = hole - offset;
|
block/file-posix: Unaligned O_DIRECT block-status
Currently, qemu crashes whenever someone queries the block status of an
unaligned image tail of an O_DIRECT image:
$ echo > foo
$ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on
Offset Length Mapped to File
qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `*pnum &&
QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset'
failed.
This is because bdrv_co_block_status() checks that the result returned
by the driver's implementation is aligned to the request_alignment, but
file-posix can fail to do so, which is actually mentioned in a comment
there: "[...] possibly including a partial sector at EOF".
Fix this by rounding up those partial sectors.
There are two possible alternative fixes:
(1) We could refuse to open unaligned image files with O_DIRECT
altogether. That sounds reasonable until you realize that qcow2
does necessarily not fill up its metadata clusters, and that nobody
runs qemu-img create with O_DIRECT. Therefore, unpreallocated qcow2
files usually have an unaligned image tail.
(2) bdrv_co_block_status() could ignore unaligned tails. It actually
throws away everything past the EOF already, so that sounds
reasonable.
Unfortunately, the block layer knows file lengths only with a
granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually
would have to guess whether its file length information is inexact
or whether the driver is broken.
Fixing what raw_co_block_status() returns is the safest thing to do.
There seems to be no other block driver that sets request_alignment and
does not make sure that it always returns aligned values.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-15 07:15:40 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We are not allowed to return partial sectors, though, so
|
|
|
|
* round up if necessary.
|
|
|
|
*/
|
|
|
|
if (!QEMU_IS_ALIGNED(*pnum, bs->bl.request_alignment)) {
|
|
|
|
int64_t file_length = raw_getlength(bs);
|
|
|
|
if (file_length > 0) {
|
|
|
|
/* Ignore errors, this is just a safeguard */
|
|
|
|
assert(hole == file_length);
|
|
|
|
}
|
|
|
|
*pnum = ROUND_UP(*pnum, bs->bl.request_alignment);
|
|
|
|
}
|
|
|
|
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
ret = BDRV_BLOCK_DATA;
|
2012-05-09 18:49:58 +04:00
|
|
|
} else {
|
2018-02-13 23:26:44 +03:00
|
|
|
/* On a hole, compute bytes to the beginning of the next extent. */
|
|
|
|
assert(hole == offset);
|
2021-08-12 11:41:46 +03:00
|
|
|
*pnum = data - offset;
|
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Rewrite from scratch, with very careful comments.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2014-11-17 13:18:34 +03:00
|
|
|
ret = BDRV_BLOCK_ZERO;
|
2012-05-09 18:49:58 +04:00
|
|
|
}
|
2018-02-13 23:26:44 +03:00
|
|
|
*map = offset;
|
2016-01-26 06:58:51 +03:00
|
|
|
*file = bs;
|
2018-02-13 23:26:44 +03:00
|
|
|
return ret | BDRV_BLOCK_OFFSET_VALID;
|
2012-05-09 18:49:58 +04:00
|
|
|
}
|
|
|
|
|
2018-04-27 19:23:12 +03:00
|
|
|
#if defined(__linux__)
|
|
|
|
/* Verify that the file is not in the page cache */
|
|
|
|
static void check_cache_dropped(BlockDriverState *bs, Error **errp)
|
|
|
|
{
|
|
|
|
const size_t window_size = 128 * 1024 * 1024;
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
void *window = NULL;
|
|
|
|
size_t length = 0;
|
|
|
|
unsigned char *vec;
|
|
|
|
size_t page_size;
|
|
|
|
off_t offset;
|
|
|
|
off_t end;
|
|
|
|
|
|
|
|
/* mincore(2) page status information requires 1 byte per page */
|
|
|
|
page_size = sysconf(_SC_PAGESIZE);
|
|
|
|
vec = g_malloc(DIV_ROUND_UP(window_size, page_size));
|
|
|
|
|
|
|
|
end = raw_getlength(bs);
|
|
|
|
|
|
|
|
for (offset = 0; offset < end; offset += window_size) {
|
|
|
|
void *new_window;
|
|
|
|
size_t new_length;
|
|
|
|
size_t vec_end;
|
|
|
|
size_t i;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* Unmap previous window if size has changed */
|
|
|
|
new_length = MIN(end - offset, window_size);
|
|
|
|
if (new_length != length) {
|
|
|
|
munmap(window, length);
|
|
|
|
window = NULL;
|
|
|
|
length = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
new_window = mmap(window, new_length, PROT_NONE, MAP_PRIVATE,
|
|
|
|
s->fd, offset);
|
|
|
|
if (new_window == MAP_FAILED) {
|
|
|
|
error_setg_errno(errp, errno, "mmap failed");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
window = new_window;
|
|
|
|
length = new_length;
|
|
|
|
|
|
|
|
ret = mincore(window, length, vec);
|
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, errno, "mincore failed");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
vec_end = DIV_ROUND_UP(length, page_size);
|
|
|
|
for (i = 0; i < vec_end; i++) {
|
|
|
|
if (vec[i] & 0x1) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2020-04-22 16:07:07 +03:00
|
|
|
if (i < vec_end) {
|
|
|
|
error_setg(errp, "page cache still in use!");
|
|
|
|
break;
|
|
|
|
}
|
2018-04-27 19:23:12 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (window) {
|
|
|
|
munmap(window, length);
|
|
|
|
}
|
|
|
|
|
|
|
|
g_free(vec);
|
|
|
|
}
|
|
|
|
#endif /* __linux__ */
|
|
|
|
|
2018-04-27 19:23:11 +03:00
|
|
|
static void coroutine_fn raw_co_invalidate_cache(BlockDriverState *bs,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, -ret, "The file descriptor is not open");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2019-03-07 19:49:41 +03:00
|
|
|
if (!s->drop_cache) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2018-04-27 19:23:11 +03:00
|
|
|
if (s->open_flags & O_DIRECT) {
|
|
|
|
return; /* No host kernel page cache */
|
|
|
|
}
|
|
|
|
|
|
|
|
#if defined(__linux__)
|
|
|
|
/* This sets the scene for the next syscall... */
|
|
|
|
ret = bdrv_co_flush(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, -ret, "flush failed");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Linux does not invalidate pages that are dirty, locked, or mmapped by a
|
|
|
|
* process. These limitations are okay because we just fsynced the file,
|
|
|
|
* we don't use mmap, and the file should not be in use by other processes.
|
|
|
|
*/
|
|
|
|
ret = posix_fadvise(s->fd, 0, 0, POSIX_FADV_DONTNEED);
|
|
|
|
if (ret != 0) { /* the return value is a positive errno */
|
|
|
|
error_setg_errno(errp, ret, "fadvise failed");
|
|
|
|
return;
|
|
|
|
}
|
2018-04-27 19:23:12 +03:00
|
|
|
|
|
|
|
if (s->check_cache_dropped) {
|
|
|
|
check_cache_dropped(bs, errp);
|
|
|
|
}
|
2018-04-27 19:23:11 +03:00
|
|
|
#else /* __linux__ */
|
|
|
|
/* Do nothing. Live migration to a remote host with cache.direct=off is
|
|
|
|
* unsupported on other host operating systems. Cache consistency issues
|
|
|
|
* may occur but no error is reported here, partly because that's the
|
|
|
|
* historical behavior and partly because it's hard to differentiate valid
|
|
|
|
* configurations that should not cause errors.
|
|
|
|
*/
|
|
|
|
#endif /* !__linux__ */
|
|
|
|
}
|
|
|
|
|
2019-09-23 15:17:36 +03:00
|
|
|
static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
|
|
|
|
{
|
|
|
|
if (ret) {
|
|
|
|
s->stats.discard_nb_failed++;
|
|
|
|
} else {
|
|
|
|
s->stats.discard_nb_ok++;
|
|
|
|
s->stats.discard_bytes_ok += nbytes;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-06-21 20:07:32 +03:00
|
|
|
static coroutine_fn int
|
block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver discard handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.
Let's look at all updated functions:
blkdebug: all calculations are still OK, thanks to
bdrv_check_qiov_request().
both rule_check and bdrv_co_pdiscard are 64bit
blklogwrites: pass to blk_loc_writes_co_log which is 64bit
blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK
copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
cbw_do_copy_before_write which is 64bit
file-posix: one handler calls raw_account_discard() is 64bit and both
handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
raw_account_discard())
gluster: somehow, third argument of glfs_discard_async is size_t.
Let's set max_pdiscard accordingly.
iscsi: iscsi_allocmap_set_invalid is 64bit,
!is_byte_request_lun_aligned is 64bit.
list.num is uint32_t. Let's clarify max_pdiscard and
pdiscard_alignment.
mirror_top: pass to bdrv_mirror_top_do_write() which is
64bit
nbd: protocol limitation. max_pdiscard is alredy set strict enough,
keep it as is for now.
nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
to nvme_refresh_limits().
preallocate: pass to bdrv_co_pdiscard() which is 64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
qcow2_cluster_discard() is 64bit.
raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.
throttle: pass to bdrv_co_pdiscard() which is 64bit and to
throttle_group_co_io_limits_intercept() which is 64bit as well.
test-block-iothread: bytes argument is unused
Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:06 +03:00
|
|
|
raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
|
|
|
|
bool blkdev)
|
2010-12-17 13:41:15 +03:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData acb;
|
2019-09-23 15:17:36 +03:00
|
|
|
int ret;
|
2018-10-25 16:18:58 +03:00
|
|
|
|
|
|
|
acb = (RawPosixAIOData) {
|
|
|
|
.bs = bs,
|
|
|
|
.aio_fildes = s->fd,
|
|
|
|
.aio_type = QEMU_AIO_DISCARD,
|
|
|
|
.aio_offset = offset,
|
|
|
|
.aio_nbytes = bytes,
|
|
|
|
};
|
2010-12-17 13:41:15 +03:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
if (blkdev) {
|
|
|
|
acb.aio_type |= QEMU_AIO_BLKDEV;
|
|
|
|
}
|
|
|
|
|
2019-09-23 15:17:36 +03:00
|
|
|
ret = raw_thread_pool_submit(bs, handle_aiocb_discard, &acb);
|
|
|
|
raw_account_discard(s, bytes, ret);
|
|
|
|
return ret;
|
2018-10-25 16:18:58 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static coroutine_fn int
|
block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver discard handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.
Let's look at all updated functions:
blkdebug: all calculations are still OK, thanks to
bdrv_check_qiov_request().
both rule_check and bdrv_co_pdiscard are 64bit
blklogwrites: pass to blk_loc_writes_co_log which is 64bit
blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK
copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
cbw_do_copy_before_write which is 64bit
file-posix: one handler calls raw_account_discard() is 64bit and both
handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
raw_account_discard())
gluster: somehow, third argument of glfs_discard_async is size_t.
Let's set max_pdiscard accordingly.
iscsi: iscsi_allocmap_set_invalid is 64bit,
!is_byte_request_lun_aligned is 64bit.
list.num is uint32_t. Let's clarify max_pdiscard and
pdiscard_alignment.
mirror_top: pass to bdrv_mirror_top_do_write() which is
64bit
nbd: protocol limitation. max_pdiscard is alredy set strict enough,
keep it as is for now.
nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
to nvme_refresh_limits().
preallocate: pass to bdrv_co_pdiscard() which is 64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
qcow2_cluster_discard() is 64bit.
raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.
throttle: pass to bdrv_co_pdiscard() which is 64bit and to
throttle_group_co_io_limits_intercept() which is 64bit as well.
test-block-iothread: bytes argument is unused
Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:06 +03:00
|
|
|
raw_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
|
2018-10-25 16:18:58 +03:00
|
|
|
{
|
|
|
|
return raw_do_pdiscard(bs, offset, bytes, false);
|
2010-12-17 13:41:15 +03:00
|
|
|
}
|
2009-05-18 18:42:10 +04:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
static int coroutine_fn
|
block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write_zeroes handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_do_pwrite_zeroes().
bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.
Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.
Let's go:
blkdebug: calculations can't overflow, thanks to
bdrv_check_qiov_request() in generic layer. rule_check() and
bdrv_co_pwrite_zeroes() both have 64bit argument.
blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.
blkreplay, copy-on-read, filter-compress: pass to
bdrv_co_pwrite_zeroes() which is OK
copy-before-write: Calls cbw_do_copy_before_write() and
bdrv_co_pwrite_zeroes, both have 64bit argument.
file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
In raw_do_pwrite_zeroes() calculations are OK due to
bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
which is uint64_t.
Check also where that uint64_t gets handed:
handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
which takes off_t (and we compile to always have 64-bit off_t), as
does handle_aiocb_write_zeroes_unmap. All look safe.
gluster: bytes go to GlusterAIOCB::size which is int64_t and to
glfs_zerofill_async works with off_t.
iscsi: Aha, here we deal with iscsi_writesame16_task() that has
uint32_t num_blocks argument and iscsi_writesame16_task() has
uint16_t argument. Make comments, add assertions and clarify
max_pwrite_zeroes calculation.
iscsi_allocmap_() functions already has int64_t argument
is_byte_request_lun_aligned is simple to update, do it.
mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
argument
nbd: Aha, here we have protocol limitation, and NBDRequest::len is
uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
OK for now.
nvme: Again, protocol limitation. And no inherent limit for
write-zeroes at all. But from code that calculates cdw12 it's obvious
that we do have limit and alignment. Let's clarify it. Also,
obviously the code is not prepared to handle bytes=0. Let's handle
this case too.
trace events already 64bit
preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: offset + bytes and alignment still works good (thanks to
bdrv_check_qiov_request()), so tail calculation is OK
qcow2_subcluster_zeroize() has 64bit argument, should be OK
trace events updated
qed: qed_co_request wants int nb_sectors. Also in code we have size_t
used for request length which may be 32bit. So, let's just keep
INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
don't care.
raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
64bit.
throttle: Both throttle_group_co_io_limits_intercept() and
bdrv_co_pwrite_zeroes() are 64bit.
vmdk: pass to vmdk_pwritev which is 64bit
quorum: pass to quorum_co_pwritev() which is 64bit
Hooray!
At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:03 +03:00
|
|
|
raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
|
2018-10-25 16:18:58 +03:00
|
|
|
BdrvRequestFlags flags, bool blkdev)
|
2013-11-22 16:39:55 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData acb;
|
|
|
|
ThreadPoolFunc *handler;
|
|
|
|
|
2019-11-01 18:25:10 +03:00
|
|
|
#ifdef CONFIG_FALLOCATE
|
|
|
|
if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
|
|
|
|
BdrvTrackedRequest *req;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is a workaround for a bug in the Linux XFS driver,
|
|
|
|
* where writes submitted through the AIO interface will be
|
|
|
|
* discarded if they happen beyond a concurrently running
|
|
|
|
* fallocate() that increases the file length (i.e., both the
|
|
|
|
* write and the fallocate() happen beyond the EOF).
|
|
|
|
*
|
|
|
|
* To work around it, we extend the tracked request for this
|
|
|
|
* zero write until INT64_MAX (effectively infinity), and mark
|
|
|
|
* it as serializing.
|
|
|
|
*
|
|
|
|
* We have to enable this workaround for all filesystems and
|
|
|
|
* AIO modes (not just XFS with aio=native), because for
|
|
|
|
* remote filesystems we do not know the host configuration.
|
|
|
|
*/
|
|
|
|
|
|
|
|
req = bdrv_co_get_self_request(bs);
|
|
|
|
assert(req);
|
|
|
|
assert(req->type == BDRV_TRACKED_WRITE);
|
|
|
|
assert(req->offset <= offset);
|
|
|
|
assert(req->offset + req->bytes >= offset + bytes);
|
|
|
|
|
block: introduce BDRV_MAX_LENGTH
We are going to modify block layer to work with 64bit requests. And
first step is moving to int64_t type for both offset and bytes
arguments in all block request related functions.
It's mostly safe (when widening signed or unsigned int to int64_t), but
switching from uint64_t is questionable.
So, let's first establish the set of requests we want to work with.
First signed int64_t should be enough, as off_t is signed anyway. Then,
obviously offset + bytes should not overflow.
And most interesting: (offset + bytes) being aligned up should not
overflow as well. Aligned to what alignment? First thing that comes in
mind is bs->bl.request_alignment, as we align up request to this
alignment. But there is another thing: look at
bdrv_mark_request_serialising(). It aligns request up to some given
alignment. And this parameter may be bdrv_get_cluster_size(), which is
often a lot greater than bs->bl.request_alignment.
Note also, that bdrv_mark_request_serialising() uses signed int64_t for
calculations. So, actually, we already depend on some restrictions.
Happily, bdrv_get_cluster_size() returns int and
bs->bl.request_alignment has 32bit unsigned type, but defined to be a
power of 2 less than INT_MAX. So, we may establish, that INT_MAX is
absolute maximum for any kind of alignment that may occur with the
request.
Note, that bdrv_get_cluster_size() is not documented to return power
of 2, still bdrv_mark_request_serialising() behaves like it is.
Also, backup uses bdi.cluster_size and is not prepared to it not being
power of 2.
So, let's establish that Qemu supports only power-of-2 clusters and
alignments.
So, alignment can't be greater than 2^30.
Finally to be safe with calculations, to not calculate different
maximums for different nodes (depending on cluster size and
request_alignment), let's simply set QEMU_ALIGN_DOWN(INT64_MAX, 2^30)
as absolute maximum bytes length for Qemu. Actually, it's not much less
than INT64_MAX.
OK, then, let's apply it to block/io.
Let's consider all block/io entry points of offset/bytes:
4 bytes/offset interface functions: bdrv_co_preadv_part(),
bdrv_co_pwritev_part(), bdrv_co_copy_range_internal() and
bdrv_co_pdiscard() and we check them all with bdrv_check_request().
We also have one entry point with only offset: bdrv_co_truncate().
Check the offset.
And one public structure: BdrvTrackedRequest. Happily, it has only
three external users:
file-posix.c: adopted by this patch
write-threshold.c: only read fields
test-write-threshold.c: sets obviously small constant values
Better is to make the structure private and add corresponding
interfaces.. Still it's not obvious what kind of interface is needed
for file-posix.c. Let's keep it public but add corresponding
assertions.
After this patch we'll convert functions in block/io.c to int64_t bytes
and offset parameters. We can assume that offset/bytes pair always
satisfy new restrictions, and make
corresponding assertions where needed. If we reach some offset/bytes
point in block/io.c missing bdrv_check_request() it is considered a
bug. As well, if block/io.c modifies a offset/bytes request, expanding
it more then aligning up to request_alignment, it's a bug too.
For all io requests except for discard we keep for now old restriction
of 32bit request length.
iotest 206 output error message changed, as now test disk size is
larger than new limit. Add one more test case with new maximum disk
size to cover too-big-L1 case.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-04 01:27:13 +03:00
|
|
|
req->bytes = BDRV_MAX_LENGTH - req->offset;
|
|
|
|
|
2020-12-11 21:39:19 +03:00
|
|
|
bdrv_check_request(req->offset, req->bytes, &error_abort);
|
2019-11-01 18:25:10 +03:00
|
|
|
|
2020-10-21 17:58:43 +03:00
|
|
|
bdrv_make_request_serialising(req, bs->bl.request_alignment);
|
2019-11-01 18:25:10 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
acb = (RawPosixAIOData) {
|
|
|
|
.bs = bs,
|
|
|
|
.aio_fildes = s->fd,
|
|
|
|
.aio_type = QEMU_AIO_WRITE_ZEROES,
|
|
|
|
.aio_offset = offset,
|
|
|
|
.aio_nbytes = bytes,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (blkdev) {
|
|
|
|
acb.aio_type |= QEMU_AIO_BLKDEV;
|
|
|
|
}
|
2019-03-22 15:45:23 +03:00
|
|
|
if (flags & BDRV_REQ_NO_FALLBACK) {
|
|
|
|
acb.aio_type |= QEMU_AIO_NO_FALLBACK;
|
|
|
|
}
|
2013-11-22 16:39:55 +04:00
|
|
|
|
2018-07-26 12:28:30 +03:00
|
|
|
if (flags & BDRV_REQ_MAY_UNMAP) {
|
2018-10-25 16:18:58 +03:00
|
|
|
acb.aio_type |= QEMU_AIO_DISCARD;
|
|
|
|
handler = handle_aiocb_write_zeroes_unmap;
|
|
|
|
} else {
|
|
|
|
handler = handle_aiocb_write_zeroes;
|
2013-11-22 16:39:55 +04:00
|
|
|
}
|
2018-07-26 12:28:30 +03:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
return raw_thread_pool_submit(bs, handler, &acb);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int coroutine_fn raw_co_pwrite_zeroes(
|
|
|
|
BlockDriverState *bs, int64_t offset,
|
block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write_zeroes handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_do_pwrite_zeroes().
bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.
Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.
Let's go:
blkdebug: calculations can't overflow, thanks to
bdrv_check_qiov_request() in generic layer. rule_check() and
bdrv_co_pwrite_zeroes() both have 64bit argument.
blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.
blkreplay, copy-on-read, filter-compress: pass to
bdrv_co_pwrite_zeroes() which is OK
copy-before-write: Calls cbw_do_copy_before_write() and
bdrv_co_pwrite_zeroes, both have 64bit argument.
file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
In raw_do_pwrite_zeroes() calculations are OK due to
bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
which is uint64_t.
Check also where that uint64_t gets handed:
handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
which takes off_t (and we compile to always have 64-bit off_t), as
does handle_aiocb_write_zeroes_unmap. All look safe.
gluster: bytes go to GlusterAIOCB::size which is int64_t and to
glfs_zerofill_async works with off_t.
iscsi: Aha, here we deal with iscsi_writesame16_task() that has
uint32_t num_blocks argument and iscsi_writesame16_task() has
uint16_t argument. Make comments, add assertions and clarify
max_pwrite_zeroes calculation.
iscsi_allocmap_() functions already has int64_t argument
is_byte_request_lun_aligned is simple to update, do it.
mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
argument
nbd: Aha, here we have protocol limitation, and NBDRequest::len is
uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
OK for now.
nvme: Again, protocol limitation. And no inherent limit for
write-zeroes at all. But from code that calculates cdw12 it's obvious
that we do have limit and alignment. Let's clarify it. Also,
obviously the code is not prepared to handle bytes=0. Let's handle
this case too.
trace events already 64bit
preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: offset + bytes and alignment still works good (thanks to
bdrv_check_qiov_request()), so tail calculation is OK
qcow2_subcluster_zeroize() has 64bit argument, should be OK
trace events updated
qed: qed_co_request wants int nb_sectors. Also in code we have size_t
used for request length which may be 32bit. So, let's just keep
INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
don't care.
raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
64bit.
throttle: Both throttle_group_co_io_limits_intercept() and
bdrv_co_pwrite_zeroes() are 64bit.
vmdk: pass to vmdk_pwritev which is 64bit
quorum: pass to quorum_co_pwritev() which is 64bit
Hooray!
At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:03 +03:00
|
|
|
int64_t bytes, BdrvRequestFlags flags)
|
2018-10-25 16:18:58 +03:00
|
|
|
{
|
|
|
|
return raw_do_pwrite_zeroes(bs, offset, bytes, flags, false);
|
2013-11-22 16:39:55 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-09-23 15:17:37 +03:00
|
|
|
static BlockStatsSpecificFile get_blockstats_specific_file(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
return (BlockStatsSpecificFile) {
|
|
|
|
.discard_nb_ok = s->stats.discard_nb_ok,
|
|
|
|
.discard_nb_failed = s->stats.discard_nb_failed,
|
|
|
|
.discard_bytes_ok = s->stats.discard_bytes_ok,
|
|
|
|
};
|
|
|
|
}
|
|
|
|
|
|
|
|
static BlockStatsSpecific *raw_get_specific_stats(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BlockStatsSpecific *stats = g_new(BlockStatsSpecific, 1);
|
|
|
|
|
|
|
|
stats->driver = BLOCKDEV_DRIVER_FILE;
|
|
|
|
stats->u.file = get_blockstats_specific_file(bs);
|
|
|
|
|
|
|
|
return stats;
|
|
|
|
}
|
|
|
|
|
2021-03-15 21:03:38 +03:00
|
|
|
#if defined(HAVE_HOST_BLOCK_DEVICE)
|
2019-09-23 15:17:37 +03:00
|
|
|
static BlockStatsSpecific *hdev_get_specific_stats(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BlockStatsSpecific *stats = g_new(BlockStatsSpecific, 1);
|
|
|
|
|
|
|
|
stats->driver = BLOCKDEV_DRIVER_HOST_DEVICE;
|
|
|
|
stats->u.host_device = get_blockstats_specific_file(bs);
|
|
|
|
|
|
|
|
return stats;
|
|
|
|
}
|
2021-03-15 21:03:38 +03:00
|
|
|
#endif /* HAVE_HOST_BLOCK_DEVICE */
|
2019-09-23 15:17:37 +03:00
|
|
|
|
2014-06-05 13:21:01 +04:00
|
|
|
static QemuOptsList raw_create_opts = {
|
|
|
|
.name = "raw-create-opts",
|
|
|
|
.head = QTAILQ_HEAD_INITIALIZER(raw_create_opts.head),
|
|
|
|
.desc = {
|
|
|
|
{
|
|
|
|
.name = BLOCK_OPT_SIZE,
|
|
|
|
.type = QEMU_OPT_SIZE,
|
|
|
|
.help = "Virtual disk size"
|
|
|
|
},
|
qemu-img create: add 'nocow' option
Add 'nocow' option so that users could have a chance to set NOCOW flag to
newly created files. It's useful on btrfs file system to enhance performance.
Btrfs has low performance when hosting VM images, even more when the guest
in those VM are also using btrfs as file system. One way to mitigate this bad
performance is to turn off COW attributes on VM files. Generally, there are
two ways to turn off NOCOW on btrfs: a) by mounting fs with nodatacow, then
all newly created files will be NOCOW. b) per file. Add the NOCOW file
attribute. It could only be done to empty or new files.
This patch tries the second way, according to the option, it could add NOCOW
per file.
For most block drivers, since the create file step is in raw-posix.c, so we
can do setting NOCOW flag ioctl in raw-posix.c only.
But there are some exceptions, like block/vpc.c and block/vdi.c, they are
creating file by calling qemu_open directly. For them, do the same setting
NOCOW flag ioctl work in them separately.
[Fixed up 082.out due to the new 'nocow' creation option
--Stefan]
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-06-30 10:29:58 +04:00
|
|
|
{
|
|
|
|
.name = BLOCK_OPT_NOCOW,
|
|
|
|
.type = QEMU_OPT_BOOL,
|
|
|
|
.help = "Turn off copy-on-write (valid only on btrfs)"
|
|
|
|
},
|
2014-09-10 13:05:48 +04:00
|
|
|
{
|
|
|
|
.name = BLOCK_OPT_PREALLOC,
|
|
|
|
.type = QEMU_OPT_STRING,
|
2019-05-24 10:58:47 +03:00
|
|
|
.help = "Preallocation mode (allowed values: off"
|
|
|
|
#ifdef CONFIG_POSIX_FALLOCATE
|
|
|
|
", falloc"
|
|
|
|
#endif
|
|
|
|
", full)"
|
2014-09-10 13:05:48 +04:00
|
|
|
},
|
file-posix: Mitigate file fragmentation with extent size hints
Especially when O_DIRECT is used with image files so that the page cache
indirection can't cause a merge of allocating requests, the file will
fragment on the file system layer, with a potentially very small
fragment size (this depends on the requests the guest sent).
On Linux, fragmentation can be reduced by setting an extent size hint
when creating the file (at least on XFS, it can't be set any more after
the first extent has been allocated), basically giving raw files a
"cluster size" for allocation.
This adds a create option to set the extent size hint, and changes the
default from not setting a hint to setting it to 1 MB. The main reason
why qcow2 defaults to smaller cluster sizes is that COW becomes more
expensive, which is not an issue with raw files, so we can choose a
larger size. The tradeoff here is only potentially wasted disk space.
For qcow2 (or other image formats) over file-posix, the advantage should
even be greater because they grow sequentially without leaving holes, so
there won't be wasted space. Setting even larger extent size hints for
such images may make sense. This can be done with the new option, but
let's keep the default conservative for now.
The effect is very visible with a test that intentionally creates a
badly fragmented file with qemu-img bench (the time difference while
creating the file is already remarkable) and then looks at the number of
extents and the time a simple "qemu-img map" takes.
Without an extent size hint:
$ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 25.848 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 19.616 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 2000000 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m1,279s
user 0m0,043s
sys 0m1,226s
With the new default extent size hint of 1 MB:
$ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192)
Run completed in 11.833 seconds.
$ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192)
Run completed in 10.155 seconds.
$ filefrag ~/tmp/test.raw
/home/kwolf/tmp/test.raw: 178 extents found
$ time ./qemu-img map ~/tmp/test.raw
Offset Length Mapped to File
0 0x1e8480000 0 /home/kwolf/tmp/test.raw
real 0m0,061s
user 0m0,040s
sys 0m0,014s
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200707142329.48303-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-07-07 17:23:29 +03:00
|
|
|
{
|
|
|
|
.name = BLOCK_OPT_EXTENT_SIZE_HINT,
|
|
|
|
.type = QEMU_OPT_SIZE,
|
|
|
|
.help = "Extent size hint for the image file, 0 to disable"
|
|
|
|
},
|
2014-06-05 13:21:01 +04:00
|
|
|
{ /* end of list */ }
|
|
|
|
}
|
2009-05-18 18:42:10 +04:00
|
|
|
};
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
static int raw_check_perm(BlockDriverState *bs, uint64_t perm, uint64_t shared,
|
|
|
|
Error **errp)
|
|
|
|
{
|
2019-03-08 17:40:40 +03:00
|
|
|
BDRVRawState *s = bs->opaque;
|
2021-04-28 18:17:58 +03:00
|
|
|
int input_flags = s->reopen_state ? s->reopen_state->flags : bs->open_flags;
|
2019-03-08 17:40:40 +03:00
|
|
|
int open_flags;
|
|
|
|
int ret;
|
|
|
|
|
2021-04-28 18:17:58 +03:00
|
|
|
/* We may need a new fd if auto-read-only switches the mode */
|
|
|
|
ret = raw_reconfigure_getfd(bs, input_flags, &open_flags, perm,
|
|
|
|
false, errp);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
} else if (ret != s->fd) {
|
|
|
|
Error *local_err = NULL;
|
|
|
|
|
2019-03-08 17:40:40 +03:00
|
|
|
/*
|
2021-04-28 18:17:58 +03:00
|
|
|
* Fail already check_perm() if we can't get a working O_DIRECT
|
|
|
|
* alignment with the new fd.
|
2019-03-08 17:40:40 +03:00
|
|
|
*/
|
2021-04-28 18:17:58 +03:00
|
|
|
raw_probe_alignment(bs, ret, &local_err);
|
|
|
|
if (local_err) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
return -EINVAL;
|
2019-03-08 17:40:40 +03:00
|
|
|
}
|
2021-04-28 18:17:58 +03:00
|
|
|
|
|
|
|
s->perm_change_fd = ret;
|
|
|
|
s->perm_change_flags = open_flags;
|
2019-03-08 17:40:40 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Prepare permissions on old fd to avoid conflicts between old and new,
|
|
|
|
* but keep everything locked that new will need. */
|
|
|
|
ret = raw_handle_perm_lock(bs, RAW_PL_PREPARE, perm, shared, errp);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy locks to the new fd */
|
2020-12-07 14:44:06 +03:00
|
|
|
if (s->perm_change_fd && s->use_lock) {
|
2019-03-08 17:40:40 +03:00
|
|
|
ret = raw_apply_lock_bytes(NULL, s->perm_change_fd, perm, ~shared,
|
|
|
|
false, errp);
|
|
|
|
if (ret < 0) {
|
|
|
|
raw_handle_perm_lock(bs, RAW_PL_ABORT, 0, 0, NULL);
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
fail:
|
2021-04-28 18:17:58 +03:00
|
|
|
if (s->perm_change_fd) {
|
2019-03-08 17:40:40 +03:00
|
|
|
qemu_close(s->perm_change_fd);
|
|
|
|
}
|
|
|
|
s->perm_change_fd = 0;
|
|
|
|
return ret;
|
2017-05-02 19:35:56 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void raw_set_perm(BlockDriverState *bs, uint64_t perm, uint64_t shared)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2019-03-08 17:40:40 +03:00
|
|
|
|
|
|
|
/* For reopen, we have already switched to the new fd (.bdrv_set_perm is
|
|
|
|
* called after .bdrv_reopen_commit) */
|
|
|
|
if (s->perm_change_fd && s->fd != s->perm_change_fd) {
|
|
|
|
qemu_close(s->fd);
|
|
|
|
s->fd = s->perm_change_fd;
|
2019-05-22 20:03:45 +03:00
|
|
|
s->open_flags = s->perm_change_flags;
|
2019-03-08 17:40:40 +03:00
|
|
|
}
|
|
|
|
s->perm_change_fd = 0;
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
raw_handle_perm_lock(bs, RAW_PL_COMMIT, perm, shared, NULL);
|
|
|
|
s->perm = perm;
|
|
|
|
s->shared_perm = shared;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void raw_abort_perm_update(BlockDriverState *bs)
|
|
|
|
{
|
2019-03-08 17:40:40 +03:00
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
/* For reopen, .bdrv_reopen_abort is called afterwards and will close
|
|
|
|
* the file descriptor. */
|
2021-04-28 18:17:58 +03:00
|
|
|
if (s->perm_change_fd) {
|
2019-03-08 17:40:40 +03:00
|
|
|
qemu_close(s->perm_change_fd);
|
|
|
|
}
|
|
|
|
s->perm_change_fd = 0;
|
|
|
|
|
2017-05-02 19:35:56 +03:00
|
|
|
raw_handle_perm_lock(bs, RAW_PL_ABORT, 0, 0, NULL);
|
|
|
|
}
|
|
|
|
|
2018-07-09 19:37:17 +03:00
|
|
|
static int coroutine_fn raw_co_copy_range_from(
|
2021-09-03 13:28:01 +03:00
|
|
|
BlockDriverState *bs, BdrvChild *src, int64_t src_offset,
|
|
|
|
BdrvChild *dst, int64_t dst_offset, int64_t bytes,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvRequestFlags read_flags, BdrvRequestFlags write_flags)
|
2018-06-01 12:26:43 +03:00
|
|
|
{
|
2018-07-09 19:37:17 +03:00
|
|
|
return bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes,
|
|
|
|
read_flags, write_flags);
|
2018-06-01 12:26:43 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static int coroutine_fn raw_co_copy_range_to(BlockDriverState *bs,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvChild *src,
|
2021-09-03 13:28:01 +03:00
|
|
|
int64_t src_offset,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvChild *dst,
|
2021-09-03 13:28:01 +03:00
|
|
|
int64_t dst_offset,
|
|
|
|
int64_t bytes,
|
2018-07-09 19:37:17 +03:00
|
|
|
BdrvRequestFlags read_flags,
|
|
|
|
BdrvRequestFlags write_flags)
|
2018-06-01 12:26:43 +03:00
|
|
|
{
|
2018-10-25 16:18:58 +03:00
|
|
|
RawPosixAIOData acb;
|
2018-06-01 12:26:43 +03:00
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
BDRVRawState *src_s;
|
|
|
|
|
|
|
|
assert(dst->bs == bs);
|
|
|
|
if (src->bs->drv->bdrv_co_copy_range_to != raw_co_copy_range_to) {
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
|
|
|
src_s = src->bs->opaque;
|
2018-07-02 05:58:34 +03:00
|
|
|
if (fd_open(src->bs) < 0 || fd_open(dst->bs) < 0) {
|
2018-06-01 12:26:43 +03:00
|
|
|
return -EIO;
|
|
|
|
}
|
2018-10-25 16:18:58 +03:00
|
|
|
|
|
|
|
acb = (RawPosixAIOData) {
|
|
|
|
.bs = bs,
|
|
|
|
.aio_type = QEMU_AIO_COPY_RANGE,
|
|
|
|
.aio_fildes = src_s->fd,
|
|
|
|
.aio_offset = src_offset,
|
|
|
|
.aio_nbytes = bytes,
|
|
|
|
.copy_range = {
|
|
|
|
.aio_fd2 = s->fd,
|
|
|
|
.aio_offset2 = dst_offset,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
return raw_thread_pool_submit(bs, handle_aiocb_copy_range, &acb);
|
2018-06-01 12:26:43 +03:00
|
|
|
}
|
|
|
|
|
2014-12-02 20:32:41 +03:00
|
|
|
BlockDriver bdrv_file = {
|
2010-04-08 00:30:24 +04:00
|
|
|
.format_name = "file",
|
|
|
|
.protocol_name = "file",
|
2009-04-07 21:57:09 +04:00
|
|
|
.instance_size = sizeof(BDRVRawState),
|
2013-09-24 19:07:04 +04:00
|
|
|
.bdrv_needs_filename = true,
|
2009-04-07 21:57:09 +04:00
|
|
|
.bdrv_probe = NULL, /* no probe for protocols */
|
2014-03-06 01:41:37 +04:00
|
|
|
.bdrv_parse_filename = raw_parse_filename,
|
2010-04-14 16:17:38 +04:00
|
|
|
.bdrv_file_open = raw_open,
|
2012-09-20 23:13:25 +04:00
|
|
|
.bdrv_reopen_prepare = raw_reopen_prepare,
|
|
|
|
.bdrv_reopen_commit = raw_reopen_commit,
|
|
|
|
.bdrv_reopen_abort = raw_reopen_abort,
|
2009-04-07 21:57:09 +04:00
|
|
|
.bdrv_close = raw_close,
|
2018-01-16 18:04:21 +03:00
|
|
|
.bdrv_co_create = raw_co_create,
|
2018-01-18 15:43:45 +03:00
|
|
|
.bdrv_co_create_opts = raw_co_create_opts,
|
2013-06-28 14:47:42 +04:00
|
|
|
.bdrv_has_zero_init = bdrv_has_zero_init_1,
|
2018-02-13 23:26:44 +03:00
|
|
|
.bdrv_co_block_status = raw_co_block_status,
|
2018-04-27 19:23:11 +03:00
|
|
|
.bdrv_co_invalidate_cache = raw_co_invalidate_cache,
|
2016-06-02 00:10:10 +03:00
|
|
|
.bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes,
|
2020-01-31 00:39:04 +03:00
|
|
|
.bdrv_co_delete_file = raw_co_delete_file,
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2016-06-03 18:36:27 +03:00
|
|
|
.bdrv_co_preadv = raw_co_preadv,
|
|
|
|
.bdrv_co_pwritev = raw_co_pwritev,
|
2018-06-21 20:07:32 +03:00
|
|
|
.bdrv_co_flush_to_disk = raw_co_flush_to_disk,
|
|
|
|
.bdrv_co_pdiscard = raw_co_pdiscard,
|
2018-06-01 12:26:43 +03:00
|
|
|
.bdrv_co_copy_range_from = raw_co_copy_range_from,
|
|
|
|
.bdrv_co_copy_range_to = raw_co_copy_range_to,
|
2011-11-29 15:42:20 +04:00
|
|
|
.bdrv_refresh_limits = raw_refresh_limits,
|
2014-07-04 14:04:34 +04:00
|
|
|
.bdrv_io_plug = raw_aio_plug,
|
|
|
|
.bdrv_io_unplug = raw_aio_unplug,
|
linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22 22:37:00 +03:00
|
|
|
.bdrv_attach_aio_context = raw_aio_attach_aio_context,
|
2008-12-12 19:41:40 +03:00
|
|
|
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 18:54:35 +03:00
|
|
|
.bdrv_co_truncate = raw_co_truncate,
|
2006-08-01 20:21:11 +04:00
|
|
|
.bdrv_getlength = raw_getlength,
|
2013-11-22 16:39:55 +04:00
|
|
|
.bdrv_get_info = raw_get_info,
|
2011-07-12 15:56:39 +04:00
|
|
|
.bdrv_get_allocated_file_size
|
|
|
|
= raw_get_allocated_file_size,
|
2019-09-23 15:17:37 +03:00
|
|
|
.bdrv_get_specific_stats = raw_get_specific_stats,
|
2017-05-02 19:35:56 +03:00
|
|
|
.bdrv_check_perm = raw_check_perm,
|
|
|
|
.bdrv_set_perm = raw_set_perm,
|
|
|
|
.bdrv_abort_perm_update = raw_abort_perm_update,
|
2014-06-05 13:21:01 +04:00
|
|
|
.create_opts = &raw_create_opts,
|
2019-03-12 19:48:48 +03:00
|
|
|
.mutable_opts = mutable_opts,
|
2006-08-01 20:21:11 +04:00
|
|
|
};
|
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
/***********************************************/
|
|
|
|
/* host device */
|
|
|
|
|
2021-03-15 21:03:38 +03:00
|
|
|
#if defined(HAVE_HOST_BLOCK_DEVICE)
|
|
|
|
|
2011-11-10 22:40:06 +04:00
|
|
|
#if defined(__APPLE__) && defined(__MACH__)
|
2015-11-21 03:17:48 +03:00
|
|
|
static kern_return_t GetBSDPath(io_iterator_t mediaIterator, char *bsdPath,
|
|
|
|
CFIndex maxPathSize, int flags);
|
2016-03-21 18:41:28 +03:00
|
|
|
static char *FindEjectableOpticalMedia(io_iterator_t *mediaIterator)
|
2006-08-19 15:45:59 +04:00
|
|
|
{
|
2016-03-21 18:41:28 +03:00
|
|
|
kern_return_t kernResult = KERN_FAILURE;
|
2006-08-19 15:45:59 +04:00
|
|
|
mach_port_t masterPort;
|
|
|
|
CFMutableDictionaryRef classesToMatch;
|
2016-03-21 18:41:28 +03:00
|
|
|
const char *matching_array[] = {kIODVDMediaClass, kIOCDMediaClass};
|
|
|
|
char *mediaType = NULL;
|
2006-08-19 15:45:59 +04:00
|
|
|
|
|
|
|
kernResult = IOMasterPort( MACH_PORT_NULL, &masterPort );
|
|
|
|
if ( KERN_SUCCESS != kernResult ) {
|
|
|
|
printf( "IOMasterPort returned %d\n", kernResult );
|
|
|
|
}
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2016-03-21 18:41:28 +03:00
|
|
|
int index;
|
|
|
|
for (index = 0; index < ARRAY_SIZE(matching_array); index++) {
|
|
|
|
classesToMatch = IOServiceMatching(matching_array[index]);
|
|
|
|
if (classesToMatch == NULL) {
|
|
|
|
error_report("IOServiceMatching returned NULL for %s",
|
|
|
|
matching_array[index]);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
CFDictionarySetValue(classesToMatch, CFSTR(kIOMediaEjectableKey),
|
|
|
|
kCFBooleanTrue);
|
|
|
|
kernResult = IOServiceGetMatchingServices(masterPort, classesToMatch,
|
|
|
|
mediaIterator);
|
|
|
|
if (kernResult != KERN_SUCCESS) {
|
|
|
|
error_report("Note: IOServiceGetMatchingServices returned %d",
|
|
|
|
kernResult);
|
|
|
|
continue;
|
|
|
|
}
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2016-03-21 18:41:28 +03:00
|
|
|
/* If a match was found, leave the loop */
|
|
|
|
if (*mediaIterator != 0) {
|
2018-12-13 19:27:26 +03:00
|
|
|
trace_file_FindEjectableOpticalMedia(matching_array[index]);
|
2016-03-21 18:41:28 +03:00
|
|
|
mediaType = g_strdup(matching_array[index]);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return mediaType;
|
2006-08-19 15:45:59 +04:00
|
|
|
}
|
|
|
|
|
2015-11-21 03:17:48 +03:00
|
|
|
kern_return_t GetBSDPath(io_iterator_t mediaIterator, char *bsdPath,
|
|
|
|
CFIndex maxPathSize, int flags)
|
2006-08-19 15:45:59 +04:00
|
|
|
{
|
|
|
|
io_object_t nextMedia;
|
|
|
|
kern_return_t kernResult = KERN_FAILURE;
|
|
|
|
*bsdPath = '\0';
|
|
|
|
nextMedia = IOIteratorNext( mediaIterator );
|
|
|
|
if ( nextMedia )
|
|
|
|
{
|
|
|
|
CFTypeRef bsdPathAsCFString;
|
|
|
|
bsdPathAsCFString = IORegistryEntryCreateCFProperty( nextMedia, CFSTR( kIOBSDNameKey ), kCFAllocatorDefault, 0 );
|
|
|
|
if ( bsdPathAsCFString ) {
|
|
|
|
size_t devPathLength;
|
|
|
|
strcpy( bsdPath, _PATH_DEV );
|
2015-11-21 03:17:48 +03:00
|
|
|
if (flags & BDRV_O_NOCACHE) {
|
|
|
|
strcat(bsdPath, "r");
|
|
|
|
}
|
2006-08-19 15:45:59 +04:00
|
|
|
devPathLength = strlen( bsdPath );
|
|
|
|
if ( CFStringGetCString( bsdPathAsCFString, bsdPath + devPathLength, maxPathSize - devPathLength, kCFStringEncodingASCII ) ) {
|
|
|
|
kernResult = KERN_SUCCESS;
|
|
|
|
}
|
|
|
|
CFRelease( bsdPathAsCFString );
|
|
|
|
}
|
|
|
|
IOObjectRelease( nextMedia );
|
|
|
|
}
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
return kernResult;
|
|
|
|
}
|
|
|
|
|
2016-03-21 18:41:28 +03:00
|
|
|
/* Sets up a real cdrom for use in QEMU */
|
|
|
|
static bool setup_cdrom(char *bsd_path, Error **errp)
|
|
|
|
{
|
|
|
|
int index, num_of_test_partitions = 2, fd;
|
|
|
|
char test_partition[MAXPATHLEN];
|
|
|
|
bool partition_found = false;
|
|
|
|
|
|
|
|
/* look for a working partition */
|
|
|
|
for (index = 0; index < num_of_test_partitions; index++) {
|
|
|
|
snprintf(test_partition, sizeof(test_partition), "%ss%d", bsd_path,
|
|
|
|
index);
|
2020-07-01 17:22:43 +03:00
|
|
|
fd = qemu_open(test_partition, O_RDONLY | O_BINARY | O_LARGEFILE, NULL);
|
2016-03-21 18:41:28 +03:00
|
|
|
if (fd >= 0) {
|
|
|
|
partition_found = true;
|
|
|
|
qemu_close(fd);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if a working partition on the device was not found */
|
|
|
|
if (partition_found == false) {
|
|
|
|
error_setg(errp, "Failed to find a working partition on disc");
|
|
|
|
} else {
|
2018-12-13 19:27:26 +03:00
|
|
|
trace_file_setup_cdrom(test_partition);
|
2016-03-21 18:41:28 +03:00
|
|
|
pstrcpy(bsd_path, MAXPATHLEN, test_partition);
|
|
|
|
}
|
|
|
|
return partition_found;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Prints directions on mounting and unmounting a device */
|
|
|
|
static void print_unmounting_directions(const char *file_name)
|
|
|
|
{
|
|
|
|
error_report("If device %s is mounted on the desktop, unmount"
|
|
|
|
" it first before using it in QEMU", file_name);
|
|
|
|
error_report("Command to unmount device: diskutil unmountDisk %s",
|
|
|
|
file_name);
|
|
|
|
error_report("Command to mount device: diskutil mountDisk %s", file_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* defined(__APPLE__) && defined(__MACH__) */
|
2006-08-19 15:45:59 +04:00
|
|
|
|
2009-06-15 16:04:22 +04:00
|
|
|
static int hdev_probe_device(const char *filename)
|
|
|
|
{
|
|
|
|
struct stat st;
|
|
|
|
|
|
|
|
/* allow a dedicated CD-ROM driver to match with a higher priority */
|
|
|
|
if (strstart(filename, "/dev/cdrom", NULL))
|
|
|
|
return 50;
|
|
|
|
|
|
|
|
if (stat(filename, &st) >= 0 &&
|
|
|
|
(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
|
|
|
|
return 100;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-03-08 03:39:41 +04:00
|
|
|
static void hdev_parse_filename(const char *filename, QDict *options,
|
|
|
|
Error **errp)
|
|
|
|
{
|
2017-05-22 22:52:16 +03:00
|
|
|
bdrv_parse_filename_strip_prefix(filename, "host_device:", options);
|
2014-03-08 03:39:41 +04:00
|
|
|
}
|
|
|
|
|
2015-06-23 13:45:00 +03:00
|
|
|
static bool hdev_is_sg(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
|
|
|
|
#if defined(__linux__)
|
|
|
|
|
2016-10-20 15:50:12 +03:00
|
|
|
BDRVRawState *s = bs->opaque;
|
2015-06-23 13:45:00 +03:00
|
|
|
struct stat st;
|
|
|
|
struct sg_scsi_id scsiid;
|
|
|
|
int sg_version;
|
2016-10-20 15:50:12 +03:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (stat(bs->filename, &st) < 0 || !S_ISCHR(st.st_mode)) {
|
|
|
|
return false;
|
|
|
|
}
|
2015-06-23 13:45:00 +03:00
|
|
|
|
2016-10-20 15:50:12 +03:00
|
|
|
ret = ioctl(s->fd, SG_GET_VERSION_NUM, &sg_version);
|
|
|
|
if (ret < 0) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = ioctl(s->fd, SG_GET_SCSI_ID, &scsiid);
|
|
|
|
if (ret >= 0) {
|
2018-12-13 19:27:26 +03:00
|
|
|
trace_file_hdev_is_sg(scsiid.scsi_type, sg_version);
|
2015-06-23 13:45:00 +03:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2013-09-05 16:22:29 +04:00
|
|
|
static int hdev_open(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp)
|
2006-08-19 15:45:59 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
2013-02-05 15:28:33 +04:00
|
|
|
int ret;
|
2008-09-22 23:17:18 +04:00
|
|
|
|
2011-11-10 22:40:06 +04:00
|
|
|
#if defined(__APPLE__) && defined(__MACH__)
|
block: Document -drive problematic code and bugs
-blockdev and blockdev_add convert their arguments via QObject to
BlockdevOptions for qmp_blockdev_add(), which converts them back to
QObject, then to a flattened QDict. The QDict's members are typed
according to the QAPI schema.
-drive converts its argument via QemuOpts to a (flat) QDict. This
QDict's members are all QString.
Thus, the QType of a flat QDict member depends on whether it comes
from -drive or -blockdev/blockdev_add, except when the QAPI type maps
to QString, which is the case for 'str' and enumeration types.
The block layer core extracts generic configuration from the flat
QDict, and the block driver extracts driver-specific configuration.
Both commonly do so by converting (parts of) the flat QDict to
QemuOpts, which turns all values into strings. Not exactly elegant,
but correct.
However, A few places access the flat QDict directly:
* Most of them access members that are always QString. Correct.
* bdrv_open_inherit() accesses a boolean, carefully. Correct.
* nfs_config() uses a QObject input visitor. Correct only because the
visited type contains nothing but QStrings.
* nbd_config() and ssh_config() use a QObject input visitor, and the
visited types contain non-QStrings: InetSocketAddress members
@numeric, @to, @ipv4, @ipv6. -drive works as long as you don't try
to use them (they're all optional). @to is ignored anyway.
Reproducer:
-drive driver=ssh,server.host=h,server.port=22,server.ipv4,path=p
-drive driver=nbd,server.type=inet,server.data.host=h,server.data.port=22,server.data.ipv4
both fail with "Invalid parameter type for 'data.ipv4', expected: boolean"
Add suitable comments to all these places. Mark the buggy ones FIXME.
"Fortunately", -drive's driver-specific options are entirely
undocumented.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1490895797-29094-5-git-send-email-armbru@redhat.com
[mreitz: Fixed two typos]
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-03-30 20:43:12 +03:00
|
|
|
/*
|
|
|
|
* Caution: while qdict_get_str() is fine, getting non-string types
|
|
|
|
* would require more care. When @options come from -blockdev or
|
|
|
|
* blockdev_add, its members are typed according to the QAPI
|
|
|
|
* schema, but when they come from -drive, they're all QString.
|
|
|
|
*/
|
2015-06-23 13:45:00 +03:00
|
|
|
const char *filename = qdict_get_str(options, "filename");
|
2016-03-21 18:41:28 +03:00
|
|
|
char bsd_path[MAXPATHLEN] = "";
|
|
|
|
bool error_occurred = false;
|
|
|
|
|
|
|
|
/* If using a real cdrom */
|
|
|
|
if (strcmp(filename, "/dev/cdrom") == 0) {
|
|
|
|
char *mediaType = NULL;
|
|
|
|
kern_return_t ret_val;
|
|
|
|
io_iterator_t mediaIterator = 0;
|
|
|
|
|
|
|
|
mediaType = FindEjectableOpticalMedia(&mediaIterator);
|
|
|
|
if (mediaType == NULL) {
|
|
|
|
error_setg(errp, "Please make sure your CD/DVD is in the optical"
|
|
|
|
" drive");
|
|
|
|
error_occurred = true;
|
|
|
|
goto hdev_open_Mac_error;
|
|
|
|
}
|
2015-06-23 13:45:00 +03:00
|
|
|
|
2016-03-21 18:41:28 +03:00
|
|
|
ret_val = GetBSDPath(mediaIterator, bsd_path, sizeof(bsd_path), flags);
|
|
|
|
if (ret_val != KERN_SUCCESS) {
|
|
|
|
error_setg(errp, "Could not get BSD path for optical drive");
|
|
|
|
error_occurred = true;
|
|
|
|
goto hdev_open_Mac_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If a real optical drive was not found */
|
|
|
|
if (bsd_path[0] == '\0') {
|
|
|
|
error_setg(errp, "Failed to obtain bsd path for optical drive");
|
|
|
|
error_occurred = true;
|
|
|
|
goto hdev_open_Mac_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If using a cdrom disc and finding a partition on the disc failed */
|
|
|
|
if (strncmp(mediaType, kIOCDMediaClass, 9) == 0 &&
|
|
|
|
setup_cdrom(bsd_path, errp) == false) {
|
|
|
|
print_unmounting_directions(bsd_path);
|
|
|
|
error_occurred = true;
|
|
|
|
goto hdev_open_Mac_error;
|
2006-08-19 15:45:59 +04:00
|
|
|
}
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2017-04-28 00:58:17 +03:00
|
|
|
qdict_put_str(options, "filename", bsd_path);
|
2016-03-21 18:41:28 +03:00
|
|
|
|
|
|
|
hdev_open_Mac_error:
|
|
|
|
g_free(mediaType);
|
|
|
|
if (mediaIterator) {
|
|
|
|
IOObjectRelease(mediaIterator);
|
|
|
|
}
|
|
|
|
if (error_occurred) {
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
2006-08-19 15:45:59 +04:00
|
|
|
}
|
2016-03-21 18:41:28 +03:00
|
|
|
#endif /* defined(__APPLE__) && defined(__MACH__) */
|
2006-08-19 15:45:59 +04:00
|
|
|
|
|
|
|
s->type = FTYPE_FILE;
|
2009-06-15 15:53:38 +04:00
|
|
|
|
error: Eliminate error_propagate() with Coccinelle, part 1
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away. Convert
if (!foo(..., &err)) {
...
error_propagate(errp, err);
...
return ...
}
to
if (!foo(..., errp)) {
...
...
return ...
}
where nothing else needs @err. Coccinelle script:
@rule1 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
binary operator op;
constant c1, c2;
symbol false;
@@
if (
(
- fun(args, &err, args2)
+ fun(args, errp, args2)
|
- !fun(args, &err, args2)
+ !fun(args, errp, args2)
|
- fun(args, &err, args2) op c1
+ fun(args, errp, args2) op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
)
}
@rule2 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
expression var;
binary operator op;
constant c1, c2;
symbol false;
@@
- var = fun(args, &err, args2);
+ var = fun(args, errp, args2);
... when != err
if (
(
var
|
!var
|
var op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
|
return var;
)
}
@depends on rule1 || rule2@
identifier err;
@@
- Error *err = NULL;
... when != err
Not exactly elegant, I'm afraid.
The "when != lbl:" is necessary to avoid transforming
if (fun(args, &err)) {
goto out
}
...
out:
error_propagate(errp, err);
even though other paths to label out still need the error_propagate().
For an actual example, see sclp_realize().
Without the "when strict", Coccinelle transforms vfio_msix_setup(),
incorrectly. I don't know what exactly "when strict" does, only that
it helps here.
The match of return is narrower than what I want, but I can't figure
out how to express "return where the operand doesn't use @err". For
an example where it's too narrow, see vfio_intx_enable().
Silently fails to convert hw/arm/armsse.c, because Coccinelle gets
confused by ARMSSE being used both as typedef and function-like macro
there. Converted manually.
Line breaks tidied up manually. One nested declaration of @local_err
deleted manually. Preexisting unwanted blank line dropped in
hw/riscv/sifive_e.c.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-35-armbru@redhat.com>
2020-07-07 19:06:02 +03:00
|
|
|
ret = raw_open_common(bs, options, flags, 0, true, errp);
|
2013-02-05 15:28:33 +04:00
|
|
|
if (ret < 0) {
|
2016-03-21 18:41:28 +03:00
|
|
|
#if defined(__APPLE__) && defined(__MACH__)
|
|
|
|
if (*bsd_path) {
|
|
|
|
filename = bsd_path;
|
|
|
|
}
|
|
|
|
/* if a physical device experienced an error while being opened */
|
|
|
|
if (strncmp(filename, "/dev/", 5) == 0) {
|
|
|
|
print_unmounting_directions(filename);
|
|
|
|
}
|
|
|
|
#endif /* defined(__APPLE__) && defined(__MACH__) */
|
2013-02-05 15:28:33 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-06-23 13:45:00 +03:00
|
|
|
/* Since this does ioctl the device must be already opened */
|
|
|
|
bs->sg = hdev_is_sg(bs);
|
|
|
|
|
2013-02-05 15:28:33 +04:00
|
|
|
return ret;
|
2006-08-19 15:45:59 +04:00
|
|
|
}
|
|
|
|
|
2008-09-15 19:51:35 +04:00
|
|
|
#if defined(__linux__)
|
2018-10-31 13:25:18 +03:00
|
|
|
static int coroutine_fn
|
|
|
|
hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
|
2009-03-28 20:28:41 +03:00
|
|
|
{
|
2009-04-07 22:43:24 +04:00
|
|
|
BDRVRawState *s = bs->opaque;
|
2018-10-31 13:30:42 +03:00
|
|
|
RawPosixAIOData acb;
|
2018-10-31 13:25:18 +03:00
|
|
|
int ret;
|
2009-03-28 20:28:41 +03:00
|
|
|
|
2018-10-31 13:25:18 +03:00
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
}
|
2012-11-02 19:14:20 +04:00
|
|
|
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
if (req == SG_IO && s->pr_mgr) {
|
|
|
|
struct sg_io_hdr *io_hdr = buf;
|
|
|
|
if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
|
|
|
|
io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
|
|
|
|
return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
|
2018-10-31 13:25:18 +03:00
|
|
|
s->fd, io_hdr);
|
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
or:
$ qemu-system-x86_64
-device virtio-scsi \
-object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
-blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
-device scsi-block,drive=hd
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-21 19:58:56 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-10-31 13:30:42 +03:00
|
|
|
acb = (RawPosixAIOData) {
|
|
|
|
.bs = bs,
|
|
|
|
.aio_type = QEMU_AIO_IOCTL,
|
|
|
|
.aio_fildes = s->fd,
|
|
|
|
.aio_offset = 0,
|
|
|
|
.ioctl = {
|
|
|
|
.buf = buf,
|
|
|
|
.cmd = req,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
return raw_thread_pool_submit(bs, handle_aiocb_ioctl, &acb);
|
2009-03-28 20:28:41 +03:00
|
|
|
}
|
2015-10-19 18:53:07 +03:00
|
|
|
#endif /* linux */
|
2009-03-28 20:28:41 +03:00
|
|
|
|
2018-06-21 20:07:32 +03:00
|
|
|
static coroutine_fn int
|
block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver discard handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.
Let's look at all updated functions:
blkdebug: all calculations are still OK, thanks to
bdrv_check_qiov_request().
both rule_check and bdrv_co_pdiscard are 64bit
blklogwrites: pass to blk_loc_writes_co_log which is 64bit
blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK
copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
cbw_do_copy_before_write which is 64bit
file-posix: one handler calls raw_account_discard() is 64bit and both
handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
raw_account_discard())
gluster: somehow, third argument of glfs_discard_async is size_t.
Let's set max_pdiscard accordingly.
iscsi: iscsi_allocmap_set_invalid is 64bit,
!is_byte_request_lun_aligned is 64bit.
list.num is uint32_t. Let's clarify max_pdiscard and
pdiscard_alignment.
mirror_top: pass to bdrv_mirror_top_do_write() which is
64bit
nbd: protocol limitation. max_pdiscard is alredy set strict enough,
keep it as is for now.
nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
to nvme_refresh_limits().
preallocate: pass to bdrv_co_pdiscard() which is 64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
qcow2_cluster_discard() is 64bit.
raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.
throttle: pass to bdrv_co_pdiscard() which is 64bit and to
throttle_group_co_io_limits_intercept() which is 64bit as well.
test-block-iothread: bytes argument is unused
Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:06 +03:00
|
|
|
hdev_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
|
2013-01-18 19:43:35 +04:00
|
|
|
{
|
2019-09-23 15:17:36 +03:00
|
|
|
BDRVRawState *s = bs->opaque;
|
2018-06-21 20:07:32 +03:00
|
|
|
int ret;
|
2013-01-18 19:43:35 +04:00
|
|
|
|
2018-06-21 20:07:32 +03:00
|
|
|
ret = fd_open(bs);
|
|
|
|
if (ret < 0) {
|
2019-09-23 15:17:36 +03:00
|
|
|
raw_account_discard(s, bytes, ret);
|
2018-06-21 20:07:32 +03:00
|
|
|
return ret;
|
2013-01-18 19:43:35 +04:00
|
|
|
}
|
2018-10-25 16:18:58 +03:00
|
|
|
return raw_do_pdiscard(bs, offset, bytes, true);
|
2013-01-18 19:43:35 +04:00
|
|
|
}
|
|
|
|
|
2016-06-02 00:10:10 +03:00
|
|
|
static coroutine_fn int hdev_co_pwrite_zeroes(BlockDriverState *bs,
|
block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write_zeroes handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_do_pwrite_zeroes().
bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.
Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.
Let's go:
blkdebug: calculations can't overflow, thanks to
bdrv_check_qiov_request() in generic layer. rule_check() and
bdrv_co_pwrite_zeroes() both have 64bit argument.
blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.
blkreplay, copy-on-read, filter-compress: pass to
bdrv_co_pwrite_zeroes() which is OK
copy-before-write: Calls cbw_do_copy_before_write() and
bdrv_co_pwrite_zeroes, both have 64bit argument.
file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
In raw_do_pwrite_zeroes() calculations are OK due to
bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
which is uint64_t.
Check also where that uint64_t gets handed:
handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
which takes off_t (and we compile to always have 64-bit off_t), as
does handle_aiocb_write_zeroes_unmap. All look safe.
gluster: bytes go to GlusterAIOCB::size which is int64_t and to
glfs_zerofill_async works with off_t.
iscsi: Aha, here we deal with iscsi_writesame16_task() that has
uint32_t num_blocks argument and iscsi_writesame16_task() has
uint16_t argument. Make comments, add assertions and clarify
max_pwrite_zeroes calculation.
iscsi_allocmap_() functions already has int64_t argument
is_byte_request_lun_aligned is simple to update, do it.
mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
argument
nbd: Aha, here we have protocol limitation, and NBDRequest::len is
uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
OK for now.
nvme: Again, protocol limitation. And no inherent limit for
write-zeroes at all. But from code that calculates cdw12 it's obvious
that we do have limit and alignment. Let's clarify it. Also,
obviously the code is not prepared to handle bytes=0. Let's handle
this case too.
trace events already 64bit
preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: offset + bytes and alignment still works good (thanks to
bdrv_check_qiov_request()), so tail calculation is OK
qcow2_subcluster_zeroize() has 64bit argument, should be OK
trace events updated
qed: qed_co_request wants int nb_sectors. Also in code we have size_t
used for request length which may be 32bit. So, let's just keep
INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
don't care.
raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
64bit.
throttle: Both throttle_group_co_io_limits_intercept() and
bdrv_co_pwrite_zeroes() are 64bit.
vmdk: pass to vmdk_pwritev which is 64bit
quorum: pass to quorum_co_pwritev() which is 64bit
Hooray!
At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 13:28:03 +03:00
|
|
|
int64_t offset, int64_t bytes, BdrvRequestFlags flags)
|
2013-11-22 16:39:56 +04:00
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
rc = fd_open(bs);
|
|
|
|
if (rc < 0) {
|
|
|
|
return rc;
|
|
|
|
}
|
2018-07-26 12:28:30 +03:00
|
|
|
|
2018-10-25 16:18:58 +03:00
|
|
|
return raw_do_pwrite_zeroes(bs, offset, bytes, flags, true);
|
2013-11-22 16:39:56 +04:00
|
|
|
}
|
|
|
|
|
2009-05-10 02:03:42 +04:00
|
|
|
static BlockDriver bdrv_host_device = {
|
2009-10-01 14:35:49 +04:00
|
|
|
.format_name = "host_device",
|
2010-04-08 00:30:24 +04:00
|
|
|
.protocol_name = "host_device",
|
2009-10-01 14:35:49 +04:00
|
|
|
.instance_size = sizeof(BDRVRawState),
|
2013-09-24 19:07:04 +04:00
|
|
|
.bdrv_needs_filename = true,
|
2009-10-01 14:35:49 +04:00
|
|
|
.bdrv_probe_device = hdev_probe_device,
|
2014-03-08 03:39:41 +04:00
|
|
|
.bdrv_parse_filename = hdev_parse_filename,
|
2010-04-14 16:17:38 +04:00
|
|
|
.bdrv_file_open = hdev_open,
|
2009-10-01 14:35:49 +04:00
|
|
|
.bdrv_close = raw_close,
|
2012-11-20 19:21:10 +04:00
|
|
|
.bdrv_reopen_prepare = raw_reopen_prepare,
|
|
|
|
.bdrv_reopen_commit = raw_reopen_commit,
|
|
|
|
.bdrv_reopen_abort = raw_reopen_abort,
|
2020-03-26 04:12:18 +03:00
|
|
|
.bdrv_co_create_opts = bdrv_co_create_opts_simple,
|
|
|
|
.create_opts = &bdrv_create_opts_simple,
|
2019-03-12 19:48:48 +03:00
|
|
|
.mutable_opts = mutable_opts,
|
2018-04-27 19:23:11 +03:00
|
|
|
.bdrv_co_invalidate_cache = raw_co_invalidate_cache,
|
2016-06-02 00:10:10 +03:00
|
|
|
.bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
|
2007-09-17 12:09:54 +04:00
|
|
|
|
2016-06-03 18:36:27 +03:00
|
|
|
.bdrv_co_preadv = raw_co_preadv,
|
|
|
|
.bdrv_co_pwritev = raw_co_pwritev,
|
2018-06-21 20:07:32 +03:00
|
|
|
.bdrv_co_flush_to_disk = raw_co_flush_to_disk,
|
|
|
|
.bdrv_co_pdiscard = hdev_co_pdiscard,
|
2018-06-01 12:26:43 +03:00
|
|
|
.bdrv_co_copy_range_from = raw_co_copy_range_from,
|
|
|
|
.bdrv_co_copy_range_to = raw_co_copy_range_to,
|
2011-11-29 15:42:20 +04:00
|
|
|
.bdrv_refresh_limits = raw_refresh_limits,
|
2014-07-04 14:04:34 +04:00
|
|
|
.bdrv_io_plug = raw_aio_plug,
|
|
|
|
.bdrv_io_unplug = raw_aio_unplug,
|
2018-07-19 00:12:56 +03:00
|
|
|
.bdrv_attach_aio_context = raw_aio_attach_aio_context,
|
2008-12-12 19:41:40 +03:00
|
|
|
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 18:54:35 +03:00
|
|
|
.bdrv_co_truncate = raw_co_truncate,
|
2009-03-08 01:00:29 +03:00
|
|
|
.bdrv_getlength = raw_getlength,
|
2013-11-22 16:39:55 +04:00
|
|
|
.bdrv_get_info = raw_get_info,
|
2011-07-12 15:56:39 +04:00
|
|
|
.bdrv_get_allocated_file_size
|
|
|
|
= raw_get_allocated_file_size,
|
2019-09-23 15:17:37 +03:00
|
|
|
.bdrv_get_specific_stats = hdev_get_specific_stats,
|
2017-05-02 19:35:56 +03:00
|
|
|
.bdrv_check_perm = raw_check_perm,
|
|
|
|
.bdrv_set_perm = raw_set_perm,
|
|
|
|
.bdrv_abort_perm_update = raw_abort_perm_update,
|
2015-02-16 14:47:56 +03:00
|
|
|
.bdrv_probe_blocksizes = hdev_probe_blocksizes,
|
|
|
|
.bdrv_probe_geometry = hdev_probe_geometry,
|
2006-08-19 15:45:59 +04:00
|
|
|
|
2009-06-15 15:55:19 +04:00
|
|
|
/* generic scsi device */
|
2009-06-15 16:04:34 +04:00
|
|
|
#ifdef __linux__
|
2018-10-31 13:25:18 +03:00
|
|
|
.bdrv_co_ioctl = hdev_co_ioctl,
|
2009-06-15 16:04:34 +04:00
|
|
|
#endif
|
2009-06-15 15:55:19 +04:00
|
|
|
};
|
|
|
|
|
2014-03-08 03:39:43 +04:00
|
|
|
#if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
|
|
|
|
static void cdrom_parse_filename(const char *filename, QDict *options,
|
|
|
|
Error **errp)
|
|
|
|
{
|
2017-05-22 22:52:16 +03:00
|
|
|
bdrv_parse_filename_strip_prefix(filename, "host_cdrom:", options);
|
2014-03-08 03:39:43 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef __linux__
|
2013-09-05 16:22:29 +04:00
|
|
|
static int cdrom_open(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
s->type = FTYPE_CD;
|
|
|
|
|
2009-06-17 19:27:44 +04:00
|
|
|
/* open will not fail even if no CD is inserted, so add O_NONBLOCK */
|
2018-07-10 20:00:40 +03:00
|
|
|
return raw_open_common(bs, options, flags, O_NONBLOCK, true, errp);
|
2009-06-15 15:55:19 +04:00
|
|
|
}
|
|
|
|
|
2009-06-15 16:04:22 +04:00
|
|
|
static int cdrom_probe_device(const char *filename)
|
|
|
|
{
|
2010-01-14 19:19:40 +03:00
|
|
|
int fd, ret;
|
|
|
|
int prio = 0;
|
2011-06-29 18:25:17 +04:00
|
|
|
struct stat st;
|
2010-01-14 19:19:40 +03:00
|
|
|
|
2020-07-01 17:22:43 +03:00
|
|
|
fd = qemu_open(filename, O_RDONLY | O_NONBLOCK, NULL);
|
2010-01-14 19:19:40 +03:00
|
|
|
if (fd < 0) {
|
|
|
|
goto out;
|
|
|
|
}
|
2011-06-29 18:25:17 +04:00
|
|
|
ret = fstat(fd, &st);
|
|
|
|
if (ret == -1 || !S_ISBLK(st.st_mode)) {
|
|
|
|
goto outc;
|
|
|
|
}
|
2010-01-14 19:19:40 +03:00
|
|
|
|
|
|
|
/* Attempt to detect via a CDROM specific ioctl */
|
|
|
|
ret = ioctl(fd, CDROM_DRIVE_STATUS, CDSL_CURRENT);
|
|
|
|
if (ret >= 0)
|
|
|
|
prio = 100;
|
|
|
|
|
2011-06-29 18:25:17 +04:00
|
|
|
outc:
|
2012-08-15 00:43:46 +04:00
|
|
|
qemu_close(fd);
|
2010-01-14 19:19:40 +03:00
|
|
|
out:
|
|
|
|
return prio;
|
2009-06-15 16:04:22 +04:00
|
|
|
}
|
|
|
|
|
2015-10-19 18:53:11 +03:00
|
|
|
static bool cdrom_is_inserted(BlockDriverState *bs)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = ioctl(s->fd, CDROM_DRIVE_STATUS, CDSL_CURRENT);
|
2015-10-19 18:53:11 +03:00
|
|
|
return ret == CDS_DISC_OK;
|
2009-06-15 15:55:19 +04:00
|
|
|
}
|
|
|
|
|
2012-02-03 22:24:53 +04:00
|
|
|
static void cdrom_eject(BlockDriverState *bs, bool eject_flag)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
if (eject_flag) {
|
|
|
|
if (ioctl(s->fd, CDROMEJECT, NULL) < 0)
|
|
|
|
perror("CDROMEJECT");
|
|
|
|
} else {
|
|
|
|
if (ioctl(s->fd, CDROMCLOSETRAY, NULL) < 0)
|
|
|
|
perror("CDROMEJECT");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-09-06 20:58:47 +04:00
|
|
|
static void cdrom_lock_medium(BlockDriverState *bs, bool locked)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
if (ioctl(s->fd, CDROM_LOCKDOOR, locked) < 0) {
|
|
|
|
/*
|
|
|
|
* Note: an error can happen if the distribution automatically
|
|
|
|
* mounts the CD-ROM
|
|
|
|
*/
|
|
|
|
/* perror("CDROM_LOCKDOOR"); */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static BlockDriver bdrv_host_cdrom = {
|
|
|
|
.format_name = "host_cdrom",
|
2010-04-08 00:30:24 +04:00
|
|
|
.protocol_name = "host_cdrom",
|
2009-06-15 15:55:19 +04:00
|
|
|
.instance_size = sizeof(BDRVRawState),
|
2013-09-24 19:07:04 +04:00
|
|
|
.bdrv_needs_filename = true,
|
2009-06-15 16:04:22 +04:00
|
|
|
.bdrv_probe_device = cdrom_probe_device,
|
2014-03-08 03:39:43 +04:00
|
|
|
.bdrv_parse_filename = cdrom_parse_filename,
|
2010-04-14 16:17:38 +04:00
|
|
|
.bdrv_file_open = cdrom_open,
|
2009-06-15 15:55:19 +04:00
|
|
|
.bdrv_close = raw_close,
|
2012-11-20 19:21:10 +04:00
|
|
|
.bdrv_reopen_prepare = raw_reopen_prepare,
|
|
|
|
.bdrv_reopen_commit = raw_reopen_commit,
|
|
|
|
.bdrv_reopen_abort = raw_reopen_abort,
|
2020-03-26 04:12:18 +03:00
|
|
|
.bdrv_co_create_opts = bdrv_co_create_opts_simple,
|
|
|
|
.create_opts = &bdrv_create_opts_simple,
|
2019-03-12 19:48:48 +03:00
|
|
|
.mutable_opts = mutable_opts,
|
2018-04-27 19:23:11 +03:00
|
|
|
.bdrv_co_invalidate_cache = raw_co_invalidate_cache,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
2016-06-03 18:36:27 +03:00
|
|
|
.bdrv_co_preadv = raw_co_preadv,
|
|
|
|
.bdrv_co_pwritev = raw_co_pwritev,
|
2018-06-21 20:07:32 +03:00
|
|
|
.bdrv_co_flush_to_disk = raw_co_flush_to_disk,
|
2011-11-29 15:42:20 +04:00
|
|
|
.bdrv_refresh_limits = raw_refresh_limits,
|
2014-07-04 14:04:34 +04:00
|
|
|
.bdrv_io_plug = raw_aio_plug,
|
|
|
|
.bdrv_io_unplug = raw_aio_unplug,
|
2018-07-19 00:12:56 +03:00
|
|
|
.bdrv_attach_aio_context = raw_aio_attach_aio_context,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 18:54:35 +03:00
|
|
|
.bdrv_co_truncate = raw_co_truncate,
|
block: Avoid unecessary drv->bdrv_getlength() calls
The block layer generally keeps the size of an image cached in
bs->total_sectors so that it doesn't have to perform expensive
operations to get the size whenever it needs it.
This doesn't work however when using a backend that can change its size
without qemu being aware of it, i.e. passthrough of removable media like
CD-ROMs or floppy disks. For this reason, the caching is disabled when a
removable device is used.
It is obvious that checking whether the _guest_ device has removable
media isn't the right thing to do when we want to know whether the size
of the host backend can change. To make things worse, non-top-level
BlockDriverStates never have any device attached, which makes qemu
assume they are removable, so drv->bdrv_getlength() is always called on
the protocol layer. In the case of raw-posix, this causes unnecessary
lseek() system calls, which turned out to be rather expensive.
This patch completely changes the logic and disables bs->total_sectors
caching only for certain block driver types, for which a size change is
expected: host_cdrom and host_floppy on POSIX, host_device on win32; also
the raw format in case it sits on top of one of these protocols, but in
the common case the nested bdrv_getlength() call on the protocol driver
will use the cache again and avoid an expensive drv->bdrv_getlength()
call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-29 15:18:58 +04:00
|
|
|
.bdrv_getlength = raw_getlength,
|
|
|
|
.has_variable_length = true,
|
2011-07-12 15:56:39 +04:00
|
|
|
.bdrv_get_allocated_file_size
|
|
|
|
= raw_get_allocated_file_size,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
|
|
|
/* removable device support */
|
|
|
|
.bdrv_is_inserted = cdrom_is_inserted,
|
|
|
|
.bdrv_eject = cdrom_eject,
|
2011-09-06 20:58:47 +04:00
|
|
|
.bdrv_lock_medium = cdrom_lock_medium,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
|
|
|
/* generic scsi device */
|
2018-10-31 13:25:18 +03:00
|
|
|
.bdrv_co_ioctl = hdev_co_ioctl,
|
2009-06-15 15:55:19 +04:00
|
|
|
};
|
|
|
|
#endif /* __linux__ */
|
|
|
|
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2013-11-01 01:41:46 +04:00
|
|
|
static int cdrom_open(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
s->type = FTYPE_CD;
|
|
|
|
|
error: Eliminate error_propagate() with Coccinelle, part 1
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away. Convert
if (!foo(..., &err)) {
...
error_propagate(errp, err);
...
return ...
}
to
if (!foo(..., errp)) {
...
...
return ...
}
where nothing else needs @err. Coccinelle script:
@rule1 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
binary operator op;
constant c1, c2;
symbol false;
@@
if (
(
- fun(args, &err, args2)
+ fun(args, errp, args2)
|
- !fun(args, &err, args2)
+ !fun(args, errp, args2)
|
- fun(args, &err, args2) op c1
+ fun(args, errp, args2) op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
)
}
@rule2 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
expression var;
binary operator op;
constant c1, c2;
symbol false;
@@
- var = fun(args, &err, args2);
+ var = fun(args, errp, args2);
... when != err
if (
(
var
|
!var
|
var op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
|
return var;
)
}
@depends on rule1 || rule2@
identifier err;
@@
- Error *err = NULL;
... when != err
Not exactly elegant, I'm afraid.
The "when != lbl:" is necessary to avoid transforming
if (fun(args, &err)) {
goto out
}
...
out:
error_propagate(errp, err);
even though other paths to label out still need the error_propagate().
For an actual example, see sclp_realize().
Without the "when strict", Coccinelle transforms vfio_msix_setup(),
incorrectly. I don't know what exactly "when strict" does, only that
it helps here.
The match of return is narrower than what I want, but I can't figure
out how to express "return where the operand doesn't use @err". For
an example where it's too narrow, see vfio_intx_enable().
Silently fails to convert hw/arm/armsse.c, because Coccinelle gets
confused by ARMSSE being used both as typedef and function-like macro
there. Converted manually.
Line breaks tidied up manually. One nested declaration of @local_err
deleted manually. Preexisting unwanted blank line dropped in
hw/riscv/sifive_e.c.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-35-armbru@redhat.com>
2020-07-07 19:06:02 +03:00
|
|
|
ret = raw_open_common(bs, options, flags, 0, true, errp);
|
2013-10-11 13:37:01 +04:00
|
|
|
if (ret) {
|
2009-06-15 15:55:19 +04:00
|
|
|
return ret;
|
2013-10-11 13:37:01 +04:00
|
|
|
}
|
2009-06-15 15:55:19 +04:00
|
|
|
|
2011-11-22 14:06:25 +04:00
|
|
|
/* make sure the door isn't locked at this time */
|
2009-06-15 15:55:19 +04:00
|
|
|
ioctl(s->fd, CDIOCALLOW);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-06-15 16:04:22 +04:00
|
|
|
static int cdrom_probe_device(const char *filename)
|
|
|
|
{
|
|
|
|
if (strstart(filename, "/dev/cd", NULL) ||
|
|
|
|
strstart(filename, "/dev/acd", NULL))
|
|
|
|
return 100;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-06-15 15:55:19 +04:00
|
|
|
static int cdrom_reopen(BlockDriverState *bs)
|
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Force reread of possibly changed/newly loaded disc,
|
|
|
|
* FreeBSD seems to not notice sometimes...
|
|
|
|
*/
|
|
|
|
if (s->fd >= 0)
|
2012-08-15 00:43:46 +04:00
|
|
|
qemu_close(s->fd);
|
2020-07-01 17:22:43 +03:00
|
|
|
fd = qemu_open(bs->filename, s->open_flags, NULL);
|
2009-06-15 15:55:19 +04:00
|
|
|
if (fd < 0) {
|
|
|
|
s->fd = -1;
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
s->fd = fd;
|
|
|
|
|
2011-11-22 14:06:25 +04:00
|
|
|
/* make sure the door isn't locked at this time */
|
2009-06-15 15:55:19 +04:00
|
|
|
ioctl(s->fd, CDIOCALLOW);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-10-19 18:53:11 +03:00
|
|
|
static bool cdrom_is_inserted(BlockDriverState *bs)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
return raw_getlength(bs) > 0;
|
|
|
|
}
|
|
|
|
|
2012-02-03 22:24:53 +04:00
|
|
|
static void cdrom_eject(BlockDriverState *bs, bool eject_flag)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
if (s->fd < 0)
|
2011-07-20 20:23:42 +04:00
|
|
|
return;
|
2009-06-15 15:55:19 +04:00
|
|
|
|
|
|
|
(void) ioctl(s->fd, CDIOCALLOW);
|
|
|
|
|
|
|
|
if (eject_flag) {
|
|
|
|
if (ioctl(s->fd, CDIOCEJECT) < 0)
|
|
|
|
perror("CDIOCEJECT");
|
|
|
|
} else {
|
|
|
|
if (ioctl(s->fd, CDIOCCLOSE) < 0)
|
|
|
|
perror("CDIOCCLOSE");
|
|
|
|
}
|
|
|
|
|
2011-07-20 20:23:42 +04:00
|
|
|
cdrom_reopen(bs);
|
2009-06-15 15:55:19 +04:00
|
|
|
}
|
|
|
|
|
2011-09-06 20:58:47 +04:00
|
|
|
static void cdrom_lock_medium(BlockDriverState *bs, bool locked)
|
2009-06-15 15:55:19 +04:00
|
|
|
{
|
|
|
|
BDRVRawState *s = bs->opaque;
|
|
|
|
|
|
|
|
if (s->fd < 0)
|
2011-07-20 20:23:41 +04:00
|
|
|
return;
|
2009-06-15 15:55:19 +04:00
|
|
|
if (ioctl(s->fd, (locked ? CDIOCPREVENT : CDIOCALLOW)) < 0) {
|
|
|
|
/*
|
|
|
|
* Note: an error can happen if the distribution automatically
|
|
|
|
* mounts the CD-ROM
|
|
|
|
*/
|
|
|
|
/* perror("CDROM_LOCKDOOR"); */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static BlockDriver bdrv_host_cdrom = {
|
|
|
|
.format_name = "host_cdrom",
|
2010-04-08 00:30:24 +04:00
|
|
|
.protocol_name = "host_cdrom",
|
2009-06-15 15:55:19 +04:00
|
|
|
.instance_size = sizeof(BDRVRawState),
|
2013-09-24 19:07:04 +04:00
|
|
|
.bdrv_needs_filename = true,
|
2009-06-15 16:04:22 +04:00
|
|
|
.bdrv_probe_device = cdrom_probe_device,
|
2014-03-08 03:39:43 +04:00
|
|
|
.bdrv_parse_filename = cdrom_parse_filename,
|
2010-04-14 16:17:38 +04:00
|
|
|
.bdrv_file_open = cdrom_open,
|
2009-06-15 15:55:19 +04:00
|
|
|
.bdrv_close = raw_close,
|
2012-11-20 19:21:10 +04:00
|
|
|
.bdrv_reopen_prepare = raw_reopen_prepare,
|
|
|
|
.bdrv_reopen_commit = raw_reopen_commit,
|
|
|
|
.bdrv_reopen_abort = raw_reopen_abort,
|
2020-03-26 04:12:18 +03:00
|
|
|
.bdrv_co_create_opts = bdrv_co_create_opts_simple,
|
|
|
|
.create_opts = &bdrv_create_opts_simple,
|
2019-03-12 19:48:48 +03:00
|
|
|
.mutable_opts = mutable_opts,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
2016-06-03 18:36:27 +03:00
|
|
|
.bdrv_co_preadv = raw_co_preadv,
|
|
|
|
.bdrv_co_pwritev = raw_co_pwritev,
|
2018-06-21 20:07:32 +03:00
|
|
|
.bdrv_co_flush_to_disk = raw_co_flush_to_disk,
|
2011-11-29 15:42:20 +04:00
|
|
|
.bdrv_refresh_limits = raw_refresh_limits,
|
2014-07-04 14:04:34 +04:00
|
|
|
.bdrv_io_plug = raw_aio_plug,
|
|
|
|
.bdrv_io_unplug = raw_aio_unplug,
|
2018-07-19 00:12:56 +03:00
|
|
|
.bdrv_attach_aio_context = raw_aio_attach_aio_context,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 18:54:35 +03:00
|
|
|
.bdrv_co_truncate = raw_co_truncate,
|
block: Avoid unecessary drv->bdrv_getlength() calls
The block layer generally keeps the size of an image cached in
bs->total_sectors so that it doesn't have to perform expensive
operations to get the size whenever it needs it.
This doesn't work however when using a backend that can change its size
without qemu being aware of it, i.e. passthrough of removable media like
CD-ROMs or floppy disks. For this reason, the caching is disabled when a
removable device is used.
It is obvious that checking whether the _guest_ device has removable
media isn't the right thing to do when we want to know whether the size
of the host backend can change. To make things worse, non-top-level
BlockDriverStates never have any device attached, which makes qemu
assume they are removable, so drv->bdrv_getlength() is always called on
the protocol layer. In the case of raw-posix, this causes unnecessary
lseek() system calls, which turned out to be rather expensive.
This patch completely changes the logic and disables bs->total_sectors
caching only for certain block driver types, for which a size change is
expected: host_cdrom and host_floppy on POSIX, host_device on win32; also
the raw format in case it sits on top of one of these protocols, but in
the common case the nested bdrv_getlength() call on the protocol driver
will use the cache again and avoid an expensive drv->bdrv_getlength()
call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-29 15:18:58 +04:00
|
|
|
.bdrv_getlength = raw_getlength,
|
|
|
|
.has_variable_length = true,
|
2011-07-12 15:56:39 +04:00
|
|
|
.bdrv_get_allocated_file_size
|
|
|
|
= raw_get_allocated_file_size,
|
2009-06-15 15:55:19 +04:00
|
|
|
|
2006-08-19 15:45:59 +04:00
|
|
|
/* removable device support */
|
2009-06-15 15:55:19 +04:00
|
|
|
.bdrv_is_inserted = cdrom_is_inserted,
|
|
|
|
.bdrv_eject = cdrom_eject,
|
2011-09-06 20:58:47 +04:00
|
|
|
.bdrv_lock_medium = cdrom_lock_medium,
|
2006-08-19 15:45:59 +04:00
|
|
|
};
|
2009-06-15 15:55:19 +04:00
|
|
|
#endif /* __FreeBSD__ */
|
2009-05-10 02:03:42 +04:00
|
|
|
|
2021-03-15 21:03:38 +03:00
|
|
|
#endif /* HAVE_HOST_BLOCK_DEVICE */
|
|
|
|
|
2010-04-08 00:30:24 +04:00
|
|
|
static void bdrv_file_init(void)
|
2009-05-10 02:03:42 +04:00
|
|
|
{
|
2009-06-15 16:04:22 +04:00
|
|
|
/*
|
|
|
|
* Register all the drivers. Note that order is important, the driver
|
|
|
|
* registered last will get probed first.
|
|
|
|
*/
|
2010-04-08 00:30:24 +04:00
|
|
|
bdrv_register(&bdrv_file);
|
2021-03-15 21:03:38 +03:00
|
|
|
#if defined(HAVE_HOST_BLOCK_DEVICE)
|
2009-05-10 02:03:42 +04:00
|
|
|
bdrv_register(&bdrv_host_device);
|
2009-06-15 15:55:19 +04:00
|
|
|
#ifdef __linux__
|
|
|
|
bdrv_register(&bdrv_host_cdrom);
|
|
|
|
#endif
|
2009-11-29 20:00:41 +03:00
|
|
|
#if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
|
2009-06-15 15:55:19 +04:00
|
|
|
bdrv_register(&bdrv_host_cdrom);
|
|
|
|
#endif
|
2021-03-15 21:03:38 +03:00
|
|
|
#endif /* HAVE_HOST_BLOCK_DEVICE */
|
2009-05-10 02:03:42 +04:00
|
|
|
}
|
|
|
|
|
2010-04-08 00:30:24 +04:00
|
|
|
block_init(bdrv_file_init);
|