This simplifies following commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-3-vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We should not set overlap_bytes:
1. Don't worry: it is calculated by bdrv_mark_request_serialising() and
will be equal to or greater than bytes anyway.
2. If the request was already aligned up to some greater alignment,
than we may break things: we reduce overlap_bytes, and further
bdrv_mark_request_serialising() may not help, as it will not restore
old bigger alignment.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The scenario is that when accessing a volume on an NFS filesystem
without supporting the file lock, Qemu will complain "Failed to lock
byte 100", even when setting the file.locking = off.
We should do file lock related operations only when the file.locking is
enabled, otherwise, the syscall of 'fcntl' will return non-zero.
Signed-off-by: Li Feng <fengli@smartx.com>
Message-Id: <1607341446-85506-1-git-send-email-fengli@smartx.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a relatively new feature in libfuse (available since 3.8.0,
which was released in November 2019), so we have to add a dedicated
check whether it is available before making use of it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-7-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows allocating areas after the (old) EOF as part of a growing
resize, writing zeroes, and discarding.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-6-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These will behave more like normal files in that writes beyond the EOF
will automatically grow the export size.
As an optimization, keep the RESIZE permission for growable exports so
we do not have to take it for every post-EOF write. (This permission is
not released when the export is destroyed, because at that point the
BlockBackend is destroyed altogether anyway.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-5-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes the export actually useful instead of only producing errors
whenever it is accessed.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-4-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block-export-add type=fuse allows mounting block graph nodes via FUSE on
some existing regular file. That file should then appears like a raw
disk image, and accesses to it result in accesses to the exported BDS.
Right now, we only implement the necessary block export functions to set
it up and shut it down. We do not implement any access functions, so
accessing the mount point only results in errors. This will be
addressed by a followup patch.
We keep a hash table of exported mount points, because we want to be
able to detect when users try to use a mount point twice. This is
because we invoke stat() to check whether the given mount point is a
regular file, but if that file is served by ourselves (because it is
already used as a mount point), then this stat() would have to be served
by ourselves, too, which is impossible to do while we (as the caller)
are waiting for it to settle. Therefore, keep track of mount point
paths to at least catch the most obvious instances of that problem.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace manual lock()/unlock() calls with lock guard macros
(QEMU_LOCK_GUARD/WITH_QEMU_LOCK_GUARD) in block/iscsi.
Signed-off-by: Gan Qixin <ganqixin@huawei.com>
Message-Id: <20201203075055.127773-5-ganqixin@huawei.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace manual lock()/unlock() calls with lock guard macros
(QEMU_LOCK_GUARD/WITH_QEMU_LOCK_GUARD) in block/throttle-groups.
Signed-off-by: Gan Qixin <ganqixin@huawei.com>
Message-Id: <20201203075055.127773-4-ganqixin@huawei.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace manual lock()/unlock() calls with lock guard macros
(QEMU_LOCK_GUARD/WITH_QEMU_LOCK_GUARD) in block/curl.
Signed-off-by: Gan Qixin <ganqixin@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201203075055.127773-3-ganqixin@huawei.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace manual lock()/unlock() calls with lock guard macros
(QEMU_LOCK_GUARD/WITH_QEMU_LOCK_GUARD) in block/accounting.
Signed-off-by: Gan Qixin <ganqixin@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201203075055.127773-2-ganqixin@huawei.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change to "expects a THING" where that's an obvious improvement
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20201113082626.2725812-11-armbru@redhat.com>
Do not validate input with g_return_val_if(). This API is intended for
checking programming errors and is compiled out with -DG_DISABLE_CHECKS.
Use an explicit if statement for input validation so it cannot
accidentally be compiled out.
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201118091644.199527-5-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
By making libvhost-user a subproject, check it builds
standalone (without the global QEMU cflags etc).
Note that the library still relies on QEMU include/qemu/atomic.h and
linux_headers/.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20201125100640.366523-6-marcandre.lureau@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
introduced a subtle change to code in zero_in_l2_slice:
It swapped the order of
1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
To
1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
It seems harmless, however the call to qcow2_free_any_clusters can
trigger a cache flush which can mark the L2 table as clean, and
assuming that this was the last write to it, a stale version of it
will remain on the disk.
Now we have a valid L2 entry pointing to a freed cluster. Oops.
Fixes: 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[ kwolf: Fixed to restore the correct original order from before
205fa50750; added comments like in discard_in_l2_slice(). ]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20201124092815.39056-1-kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
- quorum: Fix crash with rewrite-corrupted and without read-write user
- io_uring: do not use pointer after free
- file-posix: Use fallback path for -EBUSY from FALLOC_FL_PUNCH_HOLE
- iotests: Fix failure on Python 3.9 due to use of a deprecated function
- char-stdio: Fix QMP default for 'signal'
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAl+zt1URHGt3b2xmQHJl
ZGhhdC5jb20ACgkQfwmycsiPL9ZxyxAAu8GIOeAb7atQvc+KpeBTUG4A+tfAXkC+
iUYdIpFeWWgmGf7myu3nlaAkeTDk6qHalmzkGRHi3yhX4eNIh5Sdff1YwPcZwf+q
GLIqFFTW0z1Bd36N8G7Mkf04nKX4QTHqp6THHtSt9jNs56h5OP3axPXVA/3v9y8B
4ZAkOOvwnwO+U94crhy5y5pX/Vwafv/Dz4DH9hEupE+EI9AuzjZLBrS+sgkxjhmu
gvHpDSqm6NXwWQA5a24J6NzCy3n/Fw/rqmnoOrN8eRz+4DSCMVDnTDDEMFLa/UoK
Ci7AqWfG/MnQ4GrGsOx80KJhAFLTmI60vfnUizKtEjL/HJyK5PDyM+VxHz+P/Tkq
4hqQsHEsll4mAQiKCrrKOOXhn+YC4DhY/5O1EzEfhqfUjI+BFE9iC7LuqQevwKPL
gytup7eoZjIHMtnKwY1B2ApAqHtodswjHkefcjEcvSlhqGi/BvwuWmeYlFXmA3r0
YO8fvbYJrwHwJy7CzMb5Rgs2461QGERmXoCsBxLAiqXU9rhpOZ6gKXIjjlYojZ8M
W0kqbaccTRPuhooFdEQ9RTPSkX7AX2bI0nOoPxfz3YD/siw35YwnUkJqvQbckvJd
vpPkCL5jt3d9sfO0z1xjSH2ey9bevSReYpCsk+kIZl7V2XoDAW0Nbi0Td3pW4j6x
dEkg/+sjF+o=
=0pFF
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Patches for 5.2.0-rc2:
- quorum: Fix crash with rewrite-corrupted and without read-write user
- io_uring: do not use pointer after free
- file-posix: Use fallback path for -EBUSY from FALLOC_FL_PUNCH_HOLE
- iotests: Fix failure on Python 3.9 due to use of a deprecated function
- char-stdio: Fix QMP default for 'signal'
# gpg: Signature made Tue 17 Nov 2020 11:43:17 GMT
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
iotests/081: Test rewrite-corrupted without WRITE
iotests/081: Filter image format after testdir
quorum: Require WRITE perm with rewrite-corrupted
io_uring: do not use pointer after free
file-posix: allow -EBUSY errors during write zeros on raw block devices
iotests: Replace deprecated ConfigParser.readfp()
char-stdio: Fix QMP default for 'signal'
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Using rewrite-corrupted means quorum may issue writes to its children
just from receiving read requests from its parents. Thus, it must take
the WRITE permission when rewrite-corrupted is used.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201113211718.261671-2-mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even though only the pointer value is only printed, it is untidy
and Coverity complains.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201113154102.1460459-1-pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device,
without O_DIRECT can return -EBUSY if it races with another write to the same page.
Since this is rare and discard is not a critical operation, ignore this error
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201111153913.41840-2-mlevitsk@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no "version 2" of the "Lesser" General Public License.
It is either "GPL version 2.0" or "Lesser GPL version 2.1".
This patch replaces all occurrences of "Lesser GPL version 2" with
"Lesser GPL version 2.1" in comment section.
This patch contains all the files, whose maintainer I could not get
from ‘get_maintainer.pl’ script.
Signed-off-by: Chetan Pant <chetan4windows@gmail.com>
Message-Id: <20201023124424.20177-1-chetan4windows@gmail.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
[thuth: Adapted exec.c and qdev-monitor.c to new location]
Signed-off-by: Thomas Huth <thuth@redhat.com>
The --enable/disable-vhost-user-blk-server options were implemented in
./configure. There has been confusion about them and part of the problem
is that the shell syntax used for setting the default value is not easy
to read. Move the option over to meson where the conditions are easier
to understand:
have_vhost_user_blk_server = (targetos == 'linux')
if get_option('vhost_user_blk_server').enabled()
if targetos != 'linux'
error('vhost_user_blk_server requires linux')
endif
elif get_option('vhost_user_blk_server').disabled() or not have_system
have_vhost_user_blk_server = false
endif
This patch does not change behavior.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201110171121.1265142-2-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
There have some code style problems be found when read the block driver code.
So I fixes some problems of this error, ERROR: "foo* bar" should be "foo *bar".
Signed-off-by: Liyang Shi <shiliyang@huawei.com>
Reported-by: Euler Robot <euler.robot@huawei.com>
Message-Id: <3211f389-6d22-46c1-4a16-e6a2ba66f070@huawei.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
These compiling errors are fixed:
../block/nfs.c:27:10: fatal error: poll.h: No such file or directory
27 | #include <poll.h>
| ^~~~~~~~
compilation terminated.
../block/nfs.c:63:5: error: unknown type name 'blkcnt_t'
63 | blkcnt_t st_blocks;
| ^~~~~~~~
../block/nfs.c: In function 'nfs_client_open':
../block/nfs.c:550:27: error: 'struct _stat64' has no member named 'st_blocks'
550 | client->st_blocks = st.st_blocks;
| ^
../block/nfs.c: In function 'nfs_get_allocated_file_size':
../block/nfs.c:751:41: error: 'struct _stat64' has no member named 'st_blocks'
751 | return (task.ret < 0 ? task.ret : st.st_blocks * 512);
| ^
../block/nfs.c: In function 'nfs_reopen_prepare':
../block/nfs.c:805:31: error: 'struct _stat64' has no member named 'st_blocks'
805 | client->st_blocks = st.st_blocks;
| ^
../block/nfs.c: In function 'nfs_get_allocated_file_size':
../block/nfs.c:752:1: error: control reaches end of non-void function [-Werror=return-type]
752 | }
| ^
On msys2/mingw, there is no st_blocks in struct _stat64 yet, we disable the usage of it
on msys2/mingw, and create a typedef long long blkcnt_t; for further implementation
Signed-off-by: Yonggang Luo <luoyonggang@gmail.com>
Message-Id: <20201105123116.674-2-luoyonggang@gmail.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The QCowL2Meta structure is used to store information about a part of
a write request that touches clusters that need changes in their L2
entries. This happens with newly-allocated clusters or subclusters.
This structure has changed a bit since it was first created and its
current documentation is not quite up-to-date.
A write request can span a region consisting of a combination of
clusters of different types, and qcow2_alloc_host_offset() can
repeatedly call handle_copied() and handle_alloc() to add more
clusters to the mix as long as they all are contiguous on the image
file.
Because of this a write request has a list of QCowL2Meta structures,
one for each part of the request that needs changes in the L2
metadata.
Each one of them spans nb_clusters and has two copy-on-write regions
located immediately before and after the middle region touched by that
part of the write request. Even when those regions themselves are
empty their offsets must be correct because they are used to know the
location of the middle region.
This was not always the case but it is not a problem anymore
because the only two places where QCowL2Meta structures are created
(calculate_l2_meta() and qcow2_co_truncate()) ensure that the
copy-on-write regions are correctly defined, and so do assertions like
the ones in perform_cow().
The conditional initialization of the 'written_to' variable is
therefore unnecessary and is removed by this patch.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201007161323.4667-1-berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The "qemu-common.h" include is not used, remove it.
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: AlexChen <alex.chen@huawei.com>
Message-Id: <5F8FFB94.3030209@huawei.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Lots of fixes all over the place.
virtio-mem and virtio-iommu patches are kind of fixes but
it seems better to just make them behave sanely than
try to educate users about the limitations ...
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl+i9YMPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpySQH/Ru/sxB9PncR1HsqSf0HC0tt/EMKgyZTXEwQ
FITcjkCvBDS98a1VUvvZbjzTEDEZNnoUv94MjdLeBoptJ7GtK6nPoI6Ke0p1Zqbe
mlY2BCb0FpN8FE+mthjAI03mhw6o8Qo/OPtyISQzUxCVVqUHL5TRAVAQdeidoK8n
RBQ4WogwM/h7wI0d9GGgSxAON8IRQnBYImtzJieBb6zeScwKVFTWI1tqBdOyFN0/
AhzQiNZuhZ7a1XGJIsxmWB1NK2kcXNJuOF0ANh4coIHR0JzmH3xRy+Jnf5e3dYsw
LI23DUZPSTJJXAwKPucyTG7RTX8F55N9DVHC9KDRD6Ntq1oreJ4=
=pcbN
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pc,pci,vhost,virtio: fixes
Lots of fixes all over the place.
virtio-mem and virtio-iommu patches are kind of fixes but
it seems better to just make them behave sanely than
try to educate users about the limitations ...
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Wed 04 Nov 2020 18:40:03 GMT
# gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg: issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (31 commits)
contrib/vhost-user-blk: fix get_config() information leak
block/export: fix vhost-user-blk get_config() information leak
block/export: make vhost-user-blk config space little-endian
configure: introduce --enable-vhost-user-blk-server
libvhost-user: follow QEMU comment style
vhost-blk: set features before setting inflight feature
Revert "vhost-blk: set features before setting inflight feature"
net: Add vhost-vdpa in show_netdevs()
vhost-vdpa: Add qemu_close in vhost_vdpa_cleanup
vfio: Don't issue full 2^64 unmap
virtio-iommu: Set supported page size mask
vfio: Set IOMMU page size as per host supported page size
memory: Add interface to set iommu page size mask
virtio-iommu: Add notify_flag_changed() memory region callback
virtio-iommu: Add replay() memory region callback
virtio-iommu: Call memory notifiers in attach/detach
virtio-iommu: Add memory notifiers for map/unmap
virtio-iommu: Store memory region in endpoint struct
virtio-iommu: Fix virtio_iommu_mr()
hw/smbios: Fix leaked fd in save_opt_one() error path
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Refuse get_config() requests in excess of sizeof(struct virtio_blk_config).
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201027173528.213464-5-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
VIRTIO 1.0 devices have little-endian configuration space. The
vhost-user-blk-server.c code already uses little-endian for virtqueue
processing but not for the configuration space fields. Fix this so the
vhost-user-blk export works on big-endian hosts.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201027173528.213464-4-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Make it possible to compile out the vhost-user-blk server. It is enabled
by default on Linux.
Note that vhost-user-server.c depends on libvhost-user, which requires
CONFIG_LINUX. The CONFIG_VHOST_USER dependency was erroneous since that
option controls vhost-user frontends (previously known as "master") and
not device backends (previously known as "slave").
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201027173528.213464-3-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The Completion Queue Command Identifier is a 16-bit value,
so nvme_submit_command() is unlikely to work on big-endian
hosts, as the relevant bits are truncated.
Fix by using the correct byte-swap function.
Fixes: bdd6a90a9e ("block: Add VFIO based NVMe driver")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20201029093306.1063879-25-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
qemu_vfio_pci_map_bar() calls mmap(), and mmap(2) states:
'offset' must be a multiple of the page size as returned
by sysconf(_SC_PAGE_SIZE).
In commit f68453237b we started to use an offset of 4K which
broke this contract on Aarch64 arch.
Fix by mapping at offset 0, and and accessing doorbells at offset=4K.
Fixes: f68453237b ("block/nvme: Map doorbells pages write-only")
Reported-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-24-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Make sure iov's va and size are properly aligned on the
host page size.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-23-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
In preparation of 64kB host page support, let's change the size
and alignment of the prp_list_pages so that the VFIO DMA MAP succeeds
with 64kB host page size. We align on the host page size.
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-22-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
In preparation of 64kB host page support, let's change the size
and alignment of the queue so that the VFIO DMA MAP succeeds.
We align on the host page size.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-21-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
In preparation of 64kB host page support, let's change the size
and alignment of the IDENTIFY command response buffer so that
the VFIO DMA MAP succeeds. We align on the host page size.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-20-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
While trying to simplify the code using a macro, we forgot
the 12-bit shift... Correct that.
Fixes: fad1eb6886 ("block/nvme: Use register definitions from 'block/nvme.h'")
Reported-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-19-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Commit bdd6a90a9e ("block: Add VFIO based NVMe driver")
sets the request_alignment in nvme_refresh_limits().
For consistency, also set it during initialization.
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-18-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
As all commands use the ADMIN queue, it is pointless to pass
it as argument each time. Remove the argument, and rename the
function as nvme_admin_cmd_sync() to make this new behavior
clearer.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20201029093306.1063879-17-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
We don't need to dereference from BDRVNVMeState each time.
Use a NVMeQueuePair pointer on the admin queue.
The nvme_init() becomes easier to review, matching the style
of nvme_add_io_queue().
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-16-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
From the specification chapter 3.1.8 "AQA - Admin Queue Attributes"
the Admin Submission Queue Size field is a 0’s based value:
Admin Submission Queue Size (ASQS):
Defines the size of the Admin Submission Queue in entries.
Enabling a controller while this field is cleared to 00h
produces undefined results. The minimum size of the Admin
Submission Queue is two entries. The maximum size of the
Admin Submission Queue is 4096 entries.
This is a 0’s based value.
This bug has never been hit because the device initialization
uses a single command synchronously :)
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-15-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Replace magic values by definitions, and simplifiy since the
number of queues will never reach 64K.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-14-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Just for consistency, following the example documented since
commit e3fe3988d7 ("error: Document Error API usage rules"),
return a boolean value indicating an error is set or not.
Directly pass errp as the local_err is not requested in our
case. This simplifies a bit nvme_create_queue_pair().
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-12-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Just for consistency, following the example documented since
commit e3fe3988d7 ("error: Document Error API usage rules"),
return a boolean value indicating an error is set or not.
Directly pass errp as the local_err is not requested in our
case.
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20201029093306.1063879-11-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
We can not have negative queue count/size/index, use unsigned type.
Rename 'nr_queues' as 'queue_count' to match the spec naming.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-10-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
To be able to use some definitions in structure declarations,
move them earlier. No logical change.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-9-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-8-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
What we want to trace is the block driver state and the queue index.
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-7-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
As we want to enable multiple queues, report the event
in each nvme_poll_queue() call, rather than once in
the callback calling nvme_poll_queues().
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-6-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Controllers have different capabilities and report them in the
CAP register. We are particularly interested by the page size
limits.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-5-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Instead of displaying warning on stderr, use warn_report()
which also displays it on the monitor.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-4-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Use the same format used for the hw/vfio/ trace events.
Suggested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20201029093306.1063879-3-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
We should use printf format specifier "%u" instead of "%d" for
argument of type "unsigned int".
In addition, fix two error format problems found by checkpatch.pl:
ERROR: space required after that ',' (ctx:VxV)
+ fprintf(stderr,"%s attributes=0x%02x begin=%u size=%d\n",
^
ERROR: line over 90 characters
+ fprintf(stderr, "%d, %s (%u, %d)\n", i, commit->path ? commit->path : "(null)", commit->param.rename.cluster, commit->action);
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Message-Id: <5FA12620.6030705@huawei.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow the server to expose an additional metacontext to be requested
by savvy clients. qemu-nbd adds a new option -A to expose the
qemu:allocation-depth metacontext through NBD_CMD_BLOCK_STATUS; this
can also be set via QMP when using block-export-add.
qemu as client is hacked into viewing the key aspects of this new
context by abusing the already-experimental x-dirty-bitmap option to
collapse all depths greater than 2, which results in a tri-state value
visible in the output of 'qemu-img map --output=json' (yes, that means
x-dirty-bitmap is now a bit of a misnomer, but I didn't feel like
renaming it as it would introduce a needless break of back-compat,
even though we make no compat guarantees with x- members):
unallocated (depth 0) => "zero":false, "data":true
local (depth 1) => "zero":false, "data":false
backing (depth 2+) => "zero":true, "data":true
libnbd as client is probably a nicer way to get at the information
without having to decipher such hacks in qemu as client. ;)
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20201027050556.269064-11-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
When checking for allocation across a chain, it's already easy to
count the depth within the chain at which the allocation is found.
Instead of throwing that information away, return it to the caller.
Existing callers only cared about allocated/non-allocated, but having
a depth available will be used by NBD in the next patch.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20201027050556.269064-9-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: rebase to master]
Signed-off-by: Eric Blake <eblake@redhat.com>
If a BDS gets deleted during blk_drain_all(), it might miss a
call to bdrv_do_drained_end(). This means missing a call to
aio_enable_external() and the AIO context remains disabled for
ever. This can cause a device to become irresponsive and to
disrupt the guest execution, ie. hang, loop forever or worse.
This scenario is quite easy to encounter with virtio-scsi
on POWER when punching multiple blockdev-create QMP commands
while the guest is booting and it is still running the SLOF
firmware. This happens because SLOF disables/re-enables PCI
devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND
register after the initial probe/feature negotiation, as it
tends to work with a single device at a time at various stages
like probing and running block/network bootloaders without
doing a full reset in-between. This naturally generates many
dataplane stops and starts, and thus many drain sections that
can race with blockdev_create_run(). In the end, SLOF bails
out.
It is somehow reproducible on x86 but it requires to generate
articial dataplane start/stop activity with stop/cont QMP
commands. In this case, seabios ends up looping for ever,
waiting for the virtio-scsi device to send a response to
a command it never received.
Add a helper that pairs all previously called bdrv_do_drained_begin()
with a bdrv_do_drained_end() and call it from bdrv_close().
While at it, update the "/bdrv-drain/graph-change/drain_all"
test in test-bdrv-drain so that it can catch the issue.
BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1874441
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160346526998.272601.9045392804399803158.stgit@bahia.lan>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since commit c8bb23cbdb when a write
request results in a new allocation QEMU first tries to see if the
rest of the cluster outside the written area contains only zeroes.
In that case, instead of doing a normal copy-on-write operation and
writing explicit zero buffers to disk, the code zeroes the whole
cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
This improves performance very significantly but it only happens when
we are writing to an area that was completely unallocated before. Zero
clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
are therefore slower to allocate.
This happens because the code uses bdrv_is_allocated_above() rather
bdrv_block_status_above(). The former is not as accurate for this
purpose but it is faster. However in the case of qcow2 the underlying
call does already report zero clusters just fine so there is no reason
why we cannot use that information.
After testing 4KB writes on an image that only contains zero clusters
this patch results in almost five times more IOPS.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <6d77cab968c501c44d6e1089b9bc91b04170b49e.1603731354.git.berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If a BlockDriverState supports backing files but has none then any
unallocated area reads back as zeroes.
bdrv_co_block_status() is only reporting this is if want_zero is true,
but this is an inexpensive test and there is no reason not to do it in
all cases.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <66fa0914a0e2b727ab6d1b63ca773d7cd29a9a9e.1603731354.git.berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_is_allocated_above wrongly handles short backing files: it reports
after-EOF space as UNALLOCATED which is wrong, as on read the data is
generated on the level of short backing file (if all overlays have
unallocated areas at that place).
Reusing bdrv_common_block_status_above fixes the issue and unifies code
path.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20200924194003.22080-5-vsementsov@virtuozzo.com
[Fix s/has/have/ as suggested by Eric Blake. Fix s/area/areas/.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We are going to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above. bdrv_is_allocated_above may be called with
include_base == false and still bs == base (for ex. from img_rebase()).
So, support this corner case.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20200924194003.22080-4-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In order to reuse bdrv_common_block_status_above in
bdrv_is_allocated_above, let's support include_base parameter.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20200924194003.22080-3-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_co_block_status_above has several design problems with handling
short backing files:
1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.
2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.
Fix these things, making logic about short backing files clearer.
With fixed bdrv_block_status_above we also have to improve is_zero in
qcow2 code, otherwise iotest 154 will fail, because with this patch we
stop to merge zeros of different types (produced by fully unallocated
in the whole backing chain regions vs produced by short backing files).
Note also, that this patch leaves for another day the general problem
around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
vs go-to-backing.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20200924194003.22080-2-vsementsov@virtuozzo.com
[Fix s/comes/come/ as suggested by Eric Blake
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Allow the number of queues to be configured using --export
vhost-user-blk,num-queues=N. This setting should match the QEMU --device
vhost-user-blk-pci,num-queues=N setting but QEMU vhost-user-blk.c lowers
its own value if the vhost-user-blk backend offers fewer queues than
QEMU.
The vhost-user-blk-server.c code is already capable of multi-queue. All
virtqueue processing runs in the same AioContext. No new locking is
needed.
Add the num-queues=N option and set the VIRTIO_BLK_F_MQ feature bit.
Note that the feature bit only announces the presence of the num_queues
configuration space field. It does not promise that there is more than 1
virtqueue, so we can set it unconditionally.
I tested multi-queue by running a random read fio test with numjobs=4 on
an -smp 4 guest. After the benchmark finished the guest /proc/interrupts
file showed activity on all 4 virtio-blk MSI-X. The /sys/block/vda/mq/
directory shows that Linux blk-mq has 4 queues configured.
An automated test is included in the next commit.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20201001144604.559733-2-stefanha@redhat.com
[Fixed accidental tab characters as suggested by Markus Armbruster
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Make it possible to specify the iothread where the export will run. By
default the block node can be moved to other AioContexts later and the
export will follow. The fixed-iothread option forces strict behavior
that prevents changing AioContext while the export is active. See the
QAPI docs for details.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200929125516.186715-5-stefanha@redhat.com
[Fix stray '#' character in block-export.json and add missing "(since:
5.2)" as suggested by Eric Blake.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Block exports are used by softmmu, qemu-storage-daemon, and qemu-nbd.
They are not used by other programs and are not otherwise needed in
libblock.
Undo the recent move of blockdev-nbd.c from blockdev_ss into block_ss.
Since bdrv_close_all() (libblock) calls blk_exp_close_all()
(libblockdev) a stub function is required..
Make qemu-nbd.c use signal handling utility functions instead of
duplicating the code. This helps because os-posix.c is in libblockdev
and it depends on a qemu_system_killed() symbol that qemu-nbd.c lacks.
Once we use the signal handling utility functions we also end up
providing the necessary symbol.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20200929125516.186715-4-stefanha@redhat.com
[Fixed s/ndb/nbd/ typo in commit description as suggested by Eric Blake
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Don't compile contrib/libvhost-user/libvhost-user.c again. Instead build
the static library once and then reuse it throughout QEMU.
Also switch from CONFIG_LINUX to CONFIG_VHOST_USER, which is what the
vhost-user tools (vhost-user-gpu, etc) do.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-14-stefanha@redhat.com
[Added CONFIG_LINUX again because libvhost-user doesn't build on macOS.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Headers used by other subsystems are located in include/. Also add the
vhost-user-server and vhost-user-blk-server headers to MAINTAINERS.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-13-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Use the new QAPI block exports API instead of defining our own QOM
objects.
This is a large change because the lifecycle of VuBlockDev needs to
follow BlockExportDriver. QOM properties are replaced by QAPI options
objects.
VuBlockDev is renamed VuBlkExport and contains a BlockExport field.
Several fields can be dropped since BlockExport already has equivalents.
The file names and meson build integration will be adjusted in a future
patch. libvhost-user should probably be built as a static library that
is linked into QEMU instead of as a .c file that results in duplicate
compilation.
The new command-line syntax is:
$ qemu-storage-daemon \
--blockdev file,node-name=drive0,filename=test.img \
--export vhost-user-blk,node-name=drive0,id=export0,unix-socket=/tmp/vhost-user-blk.sock
Note that unix-socket is optional because we may wish to accept chardevs
too in the future.
Markus noted that supported address families are not explicit in the
QAPI schema. It is unlikely that support for more address families will
be added since file descriptor passing is required and few address
families support it. If a new address family needs to be added, then the
QAPI 'features' syntax can be used to advertize them.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20200924151549.913737-12-stefanha@redhat.com
[Skip test on big-endian host architectures because this device doesn't
support them yet (as already mentioned in a code comment).
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Propagate the flush return value since errors are possible.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-11-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The vu_client_trip() coroutine is leaked during AioContext switching. It
is also unsafe to destroy the vu_dev in panic_cb() since its callers
still access it in some cases.
Rework the lifecycle to solve these safety issues.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-10-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The device panic notifier callback is not used. Drop it.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-7-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Only one struct is needed per request. Drop req_data and the separate
VuBlockReq instance. Instead let vu_queue_pop() allocate everything at
once.
This fixes the req_data memory leak in vu_block_virtio_process_req().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200924151549.913737-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
By making use of libvhost-user, block device drive can be shared to
the connected vhost-user client. Only one client can connect to the
server one time.
Since vhost-user-server needs a block drive to be created first, delay
the creation of this object.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20200918080912.321299-6-coiby.xu@gmail.com
[Shorten "vhost_user_blk_server" string to "vhost_user_blk" to avoid the
following compiler warning:
../block/export/vhost-user-blk-server.c:178:50: error: ‘%s’ directive output truncated writing 21 bytes into a region of size 20 [-Werror=format-truncation=]
and fix "Invalid size %ld ..." ssize_t format string arguments for
32-bit hosts.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
this fixes non-TCG builds broken recently by replay reverse debugging.
Stub the needed functions in stub/, splitting roughly between functions
needed only by system emulation, by system emulation and tools,
and by everyone. This includes duplicating some code in replay/, and
puts the logic for non-replay related events in the replay/ module (+
the stubs), so this should be revisited in the future.
Surprisingly, only _one_ qtest was affected by this, ide-test.c, which
resulted in a buzz as the bh events were never delivered, and the bh
never executed.
Many other subsystems _should_ have been affected.
This fixes the immediate issue, however a better way to group replay
functionality to TCG-only code could be developed in the long term.
Signed-off-by: Claudio Fontana <cfontana@suse.de>
Message-Id: <20201013192123.22632-4-cfontana@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This thread from a little over a year ago:
http://lists.wpkg.org/pipermail/sheepdog/2019-March/thread.html
states that sheepdog is no longer actively developed. The only mentioned
users are some companies who are said to have it for legacy reasons with
plans to replace it by Ceph. There is talk about cutting out existing
features to turn it into a simple demo of how to write a distributed
block service. There is no evidence of anyone working on that idea:
https://github.com/sheepdog/sheepdog/commits/master
No real commits to git since Jan 2018, and before then just some minor
technical debt cleanup.
There is essentially no activity on the mailing list aside from
patches to QEMU that get CC'd due to our MAINTAINERS entry.
Fedora packages for sheepdog failed to build from upstream source
because of the more strict linker that no longer merges duplicate
global symbols. Fedora patches it to add the missing "extern"
annotations and presumably other distros do to, but upstream source
remains broken.
There is only basic compile testing, no functional testing of the
driver.
Since there are no build pre-requisites the sheepdog driver is currently
enabled unconditionally. This would result in configure issuing a
deprecation warning by default for all users. Thus the configure default
is changed to disable it, requiring users to pass --enable-sheepdog to
build the driver.
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20201002113243.2347710-3-berrange@redhat.com>
Reviewed-by: Neal Gompa <ngompa13@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fedora 32 gcc 10 seems to give false positives:
Compiling C object libblock.fa.p/block_vmdk.c.o
../block/vmdk.c: In function ‘vmdk_parse_extents’:
../block/vmdk.c:587:5: error: ‘extent’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
587 | g_free(extent->l1_table);
| ^~~~~~~~~~~~~~~~~~~~~~~~
../block/vmdk.c:754:17: note: ‘extent’ was declared here
754 | VmdkExtent *extent;
| ^~~~~~
../block/vmdk.c:620:11: error: ‘extent’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
620 | ret = vmdk_init_tables(bs, extent, errp);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../block/vmdk.c:598:17: note: ‘extent’ was declared here
598 | VmdkExtent *extent;
| ^~~~~~
../block/vmdk.c:1178:39: error: ‘extent’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1178 | extent->flat_start_offset = flat_offset << 9;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
../block/vmdk.c: In function ‘vmdk_open_vmdk4’:
../block/vmdk.c:581:22: error: ‘extent’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
581 | extent->l2_cache =
| ~~~~~~~~~~~~~~~~~^
582 | g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../block/vmdk.c:872:17: note: ‘extent’ was declared here
872 | VmdkExtent *extent;
| ^~~~~~
../block/vmdk.c: In function ‘vmdk_open’:
../block/vmdk.c:620:11: error: ‘extent’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
620 | ret = vmdk_init_tables(bs, extent, errp);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../block/vmdk.c:598:17: note: ‘extent’ was declared here
598 | VmdkExtent *extent;
| ^~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile.ninja:884: libblock.fa.p/block_vmdk.c.o] Error 1
fix them by assigning a default value.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Message-Id: <20200930155859.303148-2-borntraeger@de.ibm.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
In a recent commit 12c75e20a2 we've improved
nbd_co_reconnect_loop() to not make drain wait for additional sleep.
Similarly, we shouldn't try to connect, if previous sleep was
interrupted by drain begin, otherwise drain_begin will have to wait for
the whole connection attempt.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200903190301.367620-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
reconnect-delay has a design flaw: we handle it in the same loop where
we do connection attempt. So, reconnect-delay may be exceeded by
unpredictable time of connection attempt.
Let's instead use separate timer.
How to reproduce the bug:
1. Create an image on node1:
qemu-img create -f qcow2 xx 100M
2. Start NBD server on node1:
qemu-nbd xx
3. On node2 start qemu-io:
./build/qemu-io --image-opts \
driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=15
4. Type 'read 0 512' in qemu-io interface to check that connection
works
Be careful: you should make steps 5-7 in a short time, less than 15
seconds.
5. Kill nbd server on node1
6. Run 'read 0 512' in qemu-io interface again, to be sure that nbd
client goes to reconnect loop.
7. On node1 run the following command
sudo iptables -A INPUT -p tcp --dport 10809 -j DROP
This will make the connect() call of qemu-io at node2 take a long time.
And you'll see that read command in qemu-io will hang for a long time,
more than 15 seconds specified by reconnect-delay parameter. It's the
bug.
8. Don't forget to drop iptables rule on node1:
sudo iptables -D INPUT -p tcp --dport 10809 -j DROP
Important note: Step [5] is necessary to reproduce _this_ bug. If we
miss step [5], the read command (step 6) will hang for a long time and
this commit doesn't help, because there will be not long connect() to
unreachable host, but long sendmsg() to unreachable host, which should
be fixed by enabling and adjusting keep-alive on the socket, which is a
thing for further patch set.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200903190301.367620-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Don't use nbd_client_detach_aio_context() driver handler where we want
to finalize the connection. We should directly use
qio_channel_detach_aio_context() in such cases. Driver handler may (and
will) contain another things, unrelated to the qio channel.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200903190301.367620-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We pause reconnect process during drained section. So, if we have some
requests, waiting for reconnect we should cancel them, otherwise they
deadlock the drained section.
How to reproduce:
1. Create an image:
qemu-img create -f qcow2 xx 100M
2. Start NBD server:
qemu-nbd xx
3. Start vm with second nbd disk on node2, like this:
./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
file=/work/images/cent7.qcow2 -drive \
driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60 \
-vnc :0 -m 2G -enable-kvm -vga std
4. Access the vm through vnc (or some other way?), and check that NBD
drive works:
dd if=/dev/sdb of=/dev/null bs=1M count=10
- the command should succeed.
5. Now, kill the nbd server, and run dd in the guest again:
dd if=/dev/sdb of=/dev/null bs=1M count=10
Now Qemu is trying to reconnect, and dd-generated requests are waiting
for the connection (they will wait up to 60 seconds (see
reconnect-delay option above) and than fail). But suddenly, vm may
totally hang in the deadlock. You may need to increase reconnect-delay
period to catch the dead-lock.
VM doesn't respond because drain dead-lock happens in cpu thread with
global mutex taken. That's not good thing by itself and is not fixed
by this commit (true way is using iothreads). Still this commit fixes
drain dead-lock itself.
Note: probably, we can instead continue to reconnect during drained
section. To achieve this, we may move negotiation to the connect thread
to make it independent of bs aio context. But expanding drained section
doesn't seem good anyway. So, let's now fix the bug the simplest way.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200903190301.367620-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Saving icount as a parameters of the snapshot allows navigation between
them in the execution replay scenario.
This information can be used for finding a specific snapshot for proceeding
the recorded execution to the specific moment of the time.
E.g., 'reverse step' action (introduced in one of the following patches)
needs to load the nearest snapshot which is prior to the current moment
of time.
This patch also updates snapshot test which verifies qemu monitor output.
Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
--
v4 changes:
- squashed format update with test output update
v7 changes:
- introduced the spaces between the fields in snapshot info output
- updated the test to match new field widths
Message-Id: <160174518865.12451.14327573383978752463.stgit@pasha-ThinkPad-X280>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch introduces the icount field for saving within the snapshot.
It is required for navigation between the snapshots in record/replay mode.
Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Acked-by: Kevin Wolf <kwolf@redhat.com>
--
v7 changes:
- also fix the test which checks qcow2 snapshot extra data
Message-Id: <160174518284.12451.2301137308458777398.stgit@pasha-ThinkPad-X280>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like for read/write in a previous commit, drop extra indirection layer,
generate directly bdrv_readv_vmstate() and bdrv_writev_vmstate().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-8-vsementsov@virtuozzo.com>
Now that we are not maintaining boilerplate code for coroutine
wrappers, there is no more sense in keeping the extra indirection layer
of bdrv_prwv(). Let's drop it and instead generate pure bdrv_preadv()
and bdrv_pwritev().
Currently, bdrv_pwritev() and bdrv_preadv() are returning bytes on
success, auto generated functions will instead return zero, as their
_co_ prototype. Still, it's simple to make the conversion safe: the
only external user of bdrv_pwritev() is test-bdrv-drain, and it is
comfortable enough with bdrv_co_pwritev() instead. So prototypes are
moved to local block/coroutines.h. Next, the only internal use is
bdrv_pread() and bdrv_pwrite(), which are modified to return bytes on
success.
Of course, it would be great to convert bdrv_pread() and bdrv_pwrite()
to return 0 on success. But this requires audit (and probably
conversion) of all their users, let's leave it for another day
refactoring.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-7-vsementsov@virtuozzo.com>
Use code generation implemented in previous commit to generated
coroutine wrappers in block.c and block/io.c
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-6-vsementsov@virtuozzo.com>
We have a very frequent pattern of creating a coroutine from a function
with several arguments:
- create a structure to pack parameters
- create _entry function to call original function taking parameters
from struct
- do different magic to handle completion: set ret to NOT_DONE or
EINPROGRESS or use separate bool field
- fill the struct and create coroutine from _entry function with this
struct as a parameter
- do coroutine enter and BDRV_POLL_WHILE loop
Let's reduce code duplication by generating coroutine wrappers.
This patch adds scripts/block-coroutine-wrapper.py together with some
friends, which will generate functions with declared prototypes marked
by the 'generated_co_wrapper' specifier.
The usage of new code generation is as follows:
1. define the coroutine function somewhere
int coroutine_fn bdrv_co_NAME(...) {...}
2. declare in some header file
int generated_co_wrapper bdrv_NAME(...);
with same list of parameters (generated_co_wrapper is
defined in "include/block/block.h").
3. Make sure the block_gen_c declaration in block/meson.build
mentions the file with your marker function.
Still, no function is now marked, this work is for the following
commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924185414.28642-5-vsementsov@virtuozzo.com>
[Added encoding='utf-8' to open() calls as requested by Vladimir. Fixed
typo and grammar issues pointed out by Eric Blake. Removed clang-format
dependency that caused build test issues.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We are going to keep coroutine-wrappers code (structure-packing
parameters, BDRV_POLL wrapper functions) in separate auto-generated
files. So, we'll need a header with declaration of original _co_
functions, for those which are static now. As well, we'll need
declarations for wrapper functions. Do these declarations now, as a
preparation step.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-4-vsementsov@virtuozzo.com>
Most of our coroutine wrappers already follow this convention:
We have 'coroutine_fn bdrv_co_<something>(<normal argument list>)' as
the core function, and a wrapper 'bdrv_<something>(<same argument
list>)' which does parameter packing and calls bdrv_run_co().
The only outsiders are the bdrv_prwv_co and
bdrv_common_block_status_above wrappers. Let's refactor them to behave
as the others, it simplifies further conversion of coroutine wrappers.
This patch adds an indirection layer, but it will be compensated by
a further commit, which will drop bdrv_co_prwv together with the
is_write logic, to keep the read and write paths separate.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-3-vsementsov@virtuozzo.com>
Use self-explicit SCALE_MS definition instead of magic value
(missed in similar commit e4f310fe7f).
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-7-philmd@redhat.com>
Use the NVMe register definitions from "block/nvme.h" which
ease a bit reviewing the code while matching the datasheet.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-6-philmd@redhat.com>
NVMeRegs only contains NvmeBar. Simplify the code by using NvmeBar
directly.
This triggers a checkpatch.pl error:
ERROR: Use of volatile is usually wrong, please add a comment
#30: FILE: block/nvme.c:691:
+ volatile NvmeBar *regs;
This is a false positive as in our case we are using I/O registers,
so the 'volatile' use is justified.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-5-philmd@redhat.com>
We only access the I/O register in nvme_init().
Remove the reference in BDRVNVMeState and reduce its scope.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-4-philmd@redhat.com>
Per the datasheet sections 3.1.13/3.1.14:
"The host should not read the doorbell registers."
As we don't need read access, map the doorbells with write-only
permission. We keep a reference to this mapped address in the
BDRVNVMeState structure.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-3-philmd@redhat.com>
Pages are currently mapped READ/WRITE. To be able to use different
protections, add a new argument to qemu_vfio_pci_map_bar().
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200922083821.578519-2-philmd@redhat.com>
We overlooked these in 02b1ecfa10
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200928162333.14998-1-berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The 'writable' option is a basic option that will probably be applicable
to most if not all export types that we will implement. Move it from NBD
to the generic BlockExport layer.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-26-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds a simple QMP command to query the list of block exports.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-25-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Every export type will need a BlockBackend, so creating it centrally in
blk_exp_add() instead of the .create driver callback avoids duplication.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-24-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Every block export has a BlockBackend representing the disk that is
exported. It should live in BlockExport therefore.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-23-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Clients may want to know when an export has finally disappeard
(block-export-del returns earlier than that in the general case), so add
a QAPI event for it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200924152717.287415-22-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Implement a new QMP command block-export-del and make nbd-server-remove
a wrapper around it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-21-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The reference owned by the user/monitor that is created when adding the
export and dropped when removing it was tied to the 'exports' list in
nbd/server.c. Every block export will have a user reference, so move it
to the block export level and tie it to the 'block_exports' list in
block/export/export.c instead. This is necessary for introducing a QMP
command for removing exports.
Note that exports are present in block_exports even after the user has
requested shutdown. This is different from NBD's exports where exports
are immediately removed on a shutdown request, even if they are still in
the process of shutting down. In order to avoid that the user still
interacts with an export that is shutting down (and possibly removes it
a second time), we need to remember if the user actually still owns it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-20-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We'll need an id to identify block exports in monitor commands. This
adds one.
Note that this is different from the 'name' option in the NBD server,
which is the externally visible export name. While block export ids need
to be unique in the whole process, export names must be unique only for
the same server. Different export types or (potentially in the future)
multiple NBD servers can have the same export name externally, but still
need different block export ids internally.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-19-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds a function to shut down all block exports, and another one to
shut down the block exports of a single type. The latter is used for now
when stopping the NBD server. As soon as we implement support for
multiple NBD servers, we'll need a per-server list of exports and it
will be replaced by a function using that.
As a side effect, the BlockExport layer has a list tracking all existing
exports now. closed_exports loses its only user and can go away.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-18-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of letting the driver allocate and return the BlockExport
object, allocate it already in blk_exp_add() and pass it. This allows us
to initialise the generic part before calling into the driver so that
the driver can just use these values instead of having to parse the
options a second time.
For symmetry, move freeing the BlockExport to blk_exp_unref().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-17-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Every block export needs a block node to export, so add a 'node-name'
option to BlockExportOptions and remove the replaced option 'device'
from BlockExportOptionsNbd.
To maintain compatibility in nbd-server-add, BlockExportOptionsNbd needs
to be wrapped by a new type NbdServerAddOptions that adds 'device' back
because nbd-server-add doesn't use the BlockExportOptions base type at
all (so even without changing it to a 'node-name' option in
block-export-add, this compatibility code would be necessary).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-16-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-15-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Having a refcount makes sense for all types of block exports. It is also
a prerequisite for keeping a list of all exports at the BlockExport
level.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-14-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a QMP equivalent of qemu-nbd's --shared option, limiting the
maximum number of clients that can attach at the same time.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-9-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nbd-server-add tries to be convenient and adds two questionable
features that we don't want to share in block-export-add, even for NBD
exports:
1. When requesting a writable export of a read-only device, the export
is silently downgraded to read-only. This should be an error in the
context of block-export-add.
2. When using a BlockBackend name, unplugging the device from the guest
will automatically stop the NBD server, too. This may sometimes be
what you want, but it could also be very surprising. Let's keep
things explicit with block-export-add. If the user wants to stop the
export, they should tell us so.
Move these things into the nbd-server-add QMP command handler so that
they apply only there.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-8-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to have a common set of commands for all types of block exports.
Currently, this is only NBD, but we're going to add more types.
This patch adds the basic BlockExport and BlockExportDriver structs and
a QMP command block-export-add that creates a new export based on the
given BlockExportOptions.
qmp_nbd_server_add() becomes a wrapper around qmp_block_export_add().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200924152717.287415-5-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The name BlockExport will be used for the struct containing the runtime
state of block exports, so change the name of export creation options.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-4-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move all block export related types and commands from block-core to the
new QAPI module block-export.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200924152717.287415-3-kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use self-explicit NANOSECONDS_PER_SECOND definition instead
of magic value.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200921110145.520944-1-philmd@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Only qemu-system-FOO and qemu-storage-daemon provide QMP
monitors, therefore such declarations and definitions are
irrelevant for user-mode emulation.
Restricting the query-uuid command to machine.json pulls less
QAPI-generated code into user-mode.
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200913195348.1064154-6-philmd@redhat.com>
[Commit message tweaked]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
clang's C11 atomic_fetch_*() functions only take a C11 atomic type
pointer argument. QEMU uses direct types (int, etc) and this causes a
compiler error when a QEMU code calls these functions in a source file
that also included <stdatomic.h> via a system header file:
$ CC=clang CXX=clang++ ./configure ... && make
../util/async.c:79:17: error: address argument to atomic operation must be a pointer to _Atomic type ('unsigned int *' invalid)
Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
used by <stdatomic.h>. Prefix QEMU's APIs with 'q' so that atomic.h
and <stdatomic.h> can co-exist. I checked /usr/include on my machine and
searched GitHub for existing "qatomic_" users but there seem to be none.
This patch was generated using:
$ git grep -h -o '\<atomic\(64\)\?_[a-z0-9_]\+' include/qemu/atomic.h | \
sort -u >/tmp/changed_identifiers
$ for identifier in $(</tmp/changed_identifiers); do
sed -i "s%\<$identifier\>%q$identifier%g" \
$(git grep -I -l "\<$identifier\>")
done
I manually fixed line-wrap issues and misaligned rST tables.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200923105646.47864-1-stefanha@redhat.com>
Currently at startup if using cache=none on a filesystem lacking
O_DIRECT such as tmpfs, at startup QEMU prints
qemu-system-x86_64: -drive file=/tmp/foo.img,cache=none: file system may not support O_DIRECT
qemu-system-x86_64: -drive file=/tmp/foo.img,cache=none: Could not open '/tmp/foo.img': Invalid argument
while at QMP level the hint is missing, so QEMU reports just
"error": {
"class": "GenericError",
"desc": "Could not open '/tmp/foo.img': Invalid argument"
}
which is close to useless for the end user trying to figure out what
they did wrong.
With this change at startup QEMU prints
qemu-system-x86_64: -drive file=/tmp/foo.img,cache=none: Unable to open '/tmp/foo.img': filesystem does not support O_DIRECT
while at the QMP level QEMU reports a massively more informative
"error": {
"class": "GenericError",
"desc": "Unable to open '/tmp/foo.img': filesystem does not support O_DIRECT"
}
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
We want to introduce a new version of qemu_open() that uses an Error
object for reporting problems and make this it the preferred interface.
Rename the existing method to release the namespace for the new impl.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Commit 19ae9ae014 ("block/rbd: Add support for ceph namespaces")
introduced namespace support for RBD, but we forgot to add the
new 'namespace' options to qemu_rbd_strong_runtime_opts[].
The 'namespace' is used to identify the image, so it is a strong
option since it can changes the data of a BDS.
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1821528
Fixes: 19ae9ae014 ("block/rbd: Add support for ceph namespaces")
Cc: Florian Florensa <fflorensa@online.net>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20200914190553.74871-1-sgarzare@redhat.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
qcow2_alloc_cluster_offset() takes an (unaligned) guest offset and
returns the (aligned) offset of the corresponding cluster in the qcow2
image.
In practice none of the callers need to know where the cluster starts
so this patch makes the function calculate and return the final host
offset directly. The function is also renamed accordingly.
See 388e581615 for a similar change to qcow2_get_cluster_offset().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <9bfef50ec9200d752413be4fc2aeb22a28378817.1599833007.git.berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This function preallocates metadata structures and then extends the
image to its new size, but that new size calculation is wrong because
it doesn't take into account that the host_offset variable is always
cluster-aligned.
This problem can be reproduced with preallocation=metadata when the
original size is not cluster-aligned but the new size is. In this case
the final image size will be shorter than expected.
qemu-img create -f qcow2 img.qcow2 31k
qemu-img resize --preallocation=metadata img.qcow2 128k
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <adeb8b059917b141d5f5b3bd2a016262d3052c79.1599833007.git.berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
[mreitz: Mark compat=0.10 unsupported for iotest 125]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Introduced by d85f4222b4,
These were seemingly never used at all.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20200806211345.2925343-3-jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This saw its last use in 4bfb274165.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20200806211345.2925343-2-jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This function checks the current status of a (sub)cluster in order to
see if an unaligned 'write zeroes' request can be done efficiently by
simply updating the L2 metadata and without having to write actual
zeroes to disk.
If the situation does not allow using the fast path then the function
returns -ENOTSUP and the caller falls back to writing zeroes.
If can happen however that the aforementioned check returns an actual
error code so in this case we should pass it to the caller.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200909123739.719-1-berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This function takes an L2 entry and a number of clusters to free.
Although in principle it can free any type of cluster (using the L2
entry to determine its type) in practice the API is broken because
compressed clusters have a variable size and there is no way to free
more than one without having the L2 entry of each one of them.
The good news all callers are passing nb_clusters=1 so we can simply
get rid of that parameter.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <77cea0f4616f921d37e971b3c5b18a2faa24b173.1599573989.git.berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
If qcow2_alloc_cluster_offset() or qcow2_alloc_cluster_link_l2() fail
then this function simply returns the error code, potentially leaking
the QCowL2Meta structure and leaving stale items in s->cluster_allocs.
A second problem is that this function calls qcow2_free_any_clusters()
on failure but passing a host cluster offset instead of an L2 entry.
Luckily for normal uncompressed clusters a raw offset also works like
a valid L2 entry so it works just the same, but we should be using
qcow2_free_clusters() instead.
This patch fixes both problems by using qcow2_handle_l2meta().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <cd3a6b9abd43f9c0b60be413d760f0cacc67eb66.1599573989.git.berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
block/vhdx uses qemu block layer where sector size is always 512 bytes.
This may have issues with 4K logical sector sized vhdx image.
For e.g qemu-img convert on such images fails with following assert:
$qemu-img convert -f vhdx -O raw 4KTest1.vhdx test.raw
qemu-img: util/iov.c:388: qiov_slice: Assertion `offset + len <=
qiov->size' failed.
Aborted
This patch adds an check to return ENOTSUP for vhdx images which
have logical sector size other than 512 bytes.
Signed-off-by: Swapnil Ingle <swapnil.ingle@nutanix.com>
Message-Id: <1596794594-44531-1-git-send-email-swapnil.ingle@nutanix.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The current text corresponds to an earlier, simpler version of this
function and it does not explain how it works now.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <bb5bd06f07c5a05b0818611de0d06ec5b66c8df3.1599150873.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
In the past, when a new cluster was allocated the l2meta structure was
a variable in the stack so it was necessary to have a way to tell
whether it had been initialized and contained valid data or not. The
nb_clusters field was used for this purpose. Since commit f50f88b9fe
this is no longer the case, l2meta (nowadays a pointer to a list) is
only allocated when needed and nb_clusters is guaranteed to be > 0 so
this check is unnecessary.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <ab0b67c29c7ba26e598db35f12aa5ab5982539c1.1599150873.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When a write request needs to allocate new clusters (or change the L2
bitmap of existing ones) a QCowL2Meta structure is created so the L2
metadata can be later updated and any copy-on-write can be performed
if necessary.
A write request can span a region consisting of an arbitrary
combination of previously unallocated and allocated clusters, and if
the unallocated ones can be put contiguous to the existing ones then
QEMU will do so in order to minimize the number of write operations.
In practice this means that a write request has not just one but a
number of QCowL2Meta structures. All of them are added to the
cluster_allocs list that is stored in BDRVQcow2State and is used to
detect overlapping requests. After the write request finishes all its
associated QCowL2Meta are removed from that list. calculate_l2_meta()
takes care of creating and putting those structures in the list, and
qcow2_handle_l2meta() takes care of removing them.
The problem is that the error path in handle_alloc() also tries to
remove an item in that list, a remnant from the time when this was
handled there (that code would not even be correct anymore because
it only removes one struct and not all the ones from the same write
request).
This can trigger a double removal of the same item from the list,
causing a crash. This is not easy to reproduce in practice because
it requires that do_alloc_cluster_offset() fails after a successful
previous allocation during the same write request, but it can be
reproduced with the included test case.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <3440a1c4d53c4fe48312b478c96accb338cbef7c.1599150873.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch replaces instances of sizeof(uint64_t) in the qcow2 driver
with macros that indicate what those sizes are actually referring to.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20200828110828.13833-1-berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
If we remove the child with the highest index from the quorum,
decrement s->next_child_index. This way we get stable children
names as long as we only remove the last child.
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Fixes: https://bugs.launchpad.net/bugs/1881231
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <5d5f930424c1c770754041aa8ad6421dc4e2b58e.1596536719.git.lukasstraub2@web.de>
Signed-off-by: Max Reitz <mreitz@redhat.com>
- qemu-img create: Fail gracefully when backing file is an empty string
- Fixes related to filter block nodes ("Deal with filters" series)
- block/nvme: Various cleanups required to use multiple queues
- block/nvme: Use NvmeBar structure from "block/nvme.h"
- file-win32: Fix "locking" option
- iotests: Allow running from different directory
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAl9Z7bcRHGt3b2xmQHJl
ZGhhdC5jb20ACgkQfwmycsiPL9ZS6w//bos+A0RfRRF0YFWkIBLQWxqzKcGvMJ8W
XWv3mFzd47UaDgRYwVnCC3CR6bLYEINISngZ3geA4jI1+w7AtYKDOO0HN32dUg+D
ZrNMn02701CA6qkmpxJ+yjsrl9ltR3jYe0me4Wr39Pvdexa2pl/e+M4Vas6FhkYL
ghAwNThypscGCrFjAlz3ru2Sc/K+sPWrGoqkzr+SWvsm9wy4vb8aLxr8Yy50x/zc
CqALS9SQ/YA93BCVi9CzPkVyV3ioA0kg/y38WvLtAQ9GZ3m/ekMro3WvdYsRsFCN
LGXsuwFig+U7Kd7lJrCS9TLnlTJstNGqPq9jEoV5cThPvGknFfMvVOzRmmP7tzqT
YRcPRy39z44OoLKa3kyg3aF38BTxt+9gPqBnivKMr9j9EecMvPsXXHRvF+lP+LsP
j753Ih561hX6FurcjX8pc9GOM2cQA0GjlyL77UTTAmLZyFXP/8e55oQbBuYTylc/
Xlvmc/T+yEGiEGTnK+FxgDAiUaxbCCM9cDVStJjTvsIq43dwXb48g1onDsGZ5eDf
j9lmAD6TJxHNOB5ErNsDPODf4/D1wJ9t9WVF8UZp9ArfPHRdxMzT7Q4LvetaDmVl
+hQC9cgTq8Qd8LwSqbKEYua4L6iGbmLAT7/N6htq5L1eVLg76/tLg/tKSwh/vKAY
yzPmyHaVK84=
=gaaW
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- qemu-img create: Fail gracefully when backing file is an empty string
- Fixes related to filter block nodes ("Deal with filters" series)
- block/nvme: Various cleanups required to use multiple queues
- block/nvme: Use NvmeBar structure from "block/nvme.h"
- file-win32: Fix "locking" option
- iotests: Allow running from different directory
# gpg: Signature made Thu 10 Sep 2020 10:11:19 BST
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream: (65 commits)
block/qcow2-cluster: Add missing "fallthrough" annotation
block/nvme: Pair doorbell registers
block/nvme: Use generic NvmeBar structure
block/nvme: Group controller registers in NVMeRegs structure
file-win32: Fix "locking" option
iotests: Allow running from different directory
iotests: Test committing to overridden backing
iotests: Add test for commit in sub directory
iotests: Add filter mirror test cases
iotests: Add filter commit test cases
iotests: Let complete_and_wait() work with commit
iotests: Test that qcow2's data-file is flushed
block: Leave BDS.backing_{file,format} constant
block: Inline bdrv_co_block_status_from_*()
blockdev: Fix active commit choice
block: Drop backing_bs()
qemu-img: Use child access functions
nbd: Use CAF when looking for dirty bitmap
commit: Deal with filters
backup: Deal with filters
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When compiling with -Werror=implicit-fallthrough, the compiler currently
complains:
../../devel/qemu/block/qcow2-cluster.c: In function ‘cluster_needs_new_alloc’:
../../devel/qemu/block/qcow2-cluster.c:1320:12: error: this statement may fall
through [-Werror=implicit-fallthrough=]
if (l2_entry & QCOW_OFLAG_COPIED) {
^
../../devel/qemu/block/qcow2-cluster.c:1323:5: note: here
case QCOW2_CLUSTER_UNALLOCATED:
^~~~
It's quite obvious that the fallthrough is intended here, so let's add
a comment to silence the compiler warning.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20200908070028.193298-1-thuth@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For each queue doorbell registers are paired as:
- Submission Queue Tail Doorbell
- Completion Queue Head Doorbell
Reflect that in the NVMeRegs structure, and adapt
nvme_create_queue_pair() accordingly.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200904124130.583838-4-philmd@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit f3c507adcd ("NVMe: Initial commit for new storage interface")
introduced the NvmeBar structure. Unfortunately in commit bdd6a90a9e
("block: Add VFIO based NVMe driver") we duplicated it.
Apparently in commit a3d9a352d4 ("block: Move NVMe constants to
a separate header") we tried to unify headers but forgot to remove
the structure declared in the block/nvme.c source file.
Do it now, and remove the structure size check which is redundant
with the header check added in commit 74e18435c0 ("hw/block/nvme:
Align I/O BAR to 4 KiB").
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200904124130.583838-3-philmd@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to use the NvmeBar structure from "block/nvme.h" in the
next commit. As a preliminary step, group all the NVMe controller
registers in the 'ctrl' field, keeping the doorbells registers
out of it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200904124130.583838-2-philmd@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The intended behaviour was that locking=off/auto work and have no
effect (to remain compatible with file-posix), whereas locking=on would
return an error. Unfortunately, the code forgot to remove "locking" from
the options QDict, so any attempt to use the option would fail.
Replace the option parsing code for "locking" with something that is
part of the raw_runtime_opts QemuOptsList (so it is properly removed
from the QDict) and looks more like file-posix.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20200907092739.9988-1-kwolf@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some trace points are attributed to the wrong source file. Happens
when we neglect to update trace-events for code motion, or add events
in the wrong place, or misspell the file name.
Clean up with help of scripts/cleanup-trace-events.pl. Funnies
requiring manual post-processing:
* accel/tcg/cputlb.c trace points are in trace-events.
* block.c and blockdev.c trace points are in block/trace-events.
* hw/block/nvme.c uses the preprocessor to hide its trace point use
from cleanup-trace-events.pl.
* hw/tpm/tpm_spapr.c uses pseudo trace point tpm_spapr_show_buffer to
guard debug code.
* include/hw/xen/xen_common.h trace points are in hw/xen/trace-events.
* linux-user/trace-events abbreviates a tedious list of filenames to
*/signal.c.
* net/colo-compare and net/filter-rewriter.c use pseudo trace points
colo_compare_miscompare and colo_filter_rewriter_debug to guard
debug code.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20200806141334.3646302-5-armbru@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tracked down with the help of scripts/cleanup-trace-events.pl.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20200806141334.3646302-4-armbru@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Parts of the block layer treat BDS.backing_file as if it were whatever
the image header says (i.e., if it is a relative path, it is relative to
the overlay), other parts treat it like a cache for
bs->backing->bs->filename (relative paths are relative to the CWD).
Considering bs->backing->bs->filename exists, let us make it mean the
former.
Among other things, this now allows the user to specify a base when
using qemu-img to commit an image file in a directory that is not the
CWD (assuming, everything uses relative filenames).
Before this patch:
$ ./qemu-img create -f qcow2 foo/bot.qcow2 1M
$ ./qemu-img create -f qcow2 -b bot.qcow2 foo/mid.qcow2
$ ./qemu-img create -f qcow2 -b mid.qcow2 foo/top.qcow2
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find '[...]/foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
After this patch:
$ ./qemu-img commit -b mid.qcow2 foo/top.qcow2
Image committed.
$ ./qemu-img commit -b foo/mid.qcow2 foo/top.qcow2
qemu-img: Did not find 'foo/mid.qcow2' in the backing chain of 'foo/top.qcow2'
$ ./qemu-img commit -b $PWD/foo/mid.qcow2 foo/top.qcow2
Image committed.
With this change, bdrv_find_backing_image() must look at whether the
user has overridden a BDS's backing file. If so, it can no longer use
bs->backing_file, but must instead compare the given filename against
the backing node's filename directly.
Note that this changes the QAPI output for a node's backing_file. We
had very inconsistent output there (sometimes what the image header
said, sometimes the actual filename of the backing image). This
inconsistent output was effectively useless, so we have to decide one
way or the other. Considering that bs->backing_file usually at runtime
contained the path to the image relative to qemu's CWD (or absolute),
this patch changes QAPI's backing_file to always report the
bs->backing->bs->filename from now on. If you want to receive the image
header information, you have to refer to full-backing-filename.
This necessitates a change to iotest 228. The interesting information
it really wanted is the image header, and it can get that now, but it
has to use full-backing-filename instead of backing_file. Because of
this patch's changes to bs->backing_file's behavior, we also need some
reference output changes.
Along with the changes to bs->backing_file, stop updating
BDS.backing_format in bdrv_backing_attach() as well. This way,
ImageInfo's backing-filename and backing-filename-format fields will
represent what the image header says and nothing else.
iotest 245 changes in behavior: With the backing node no longer
overriding the parent node's backing_file string, you can now omit the
@backing option when reopening a node with neither a default nor a
current backing file even if it used to have a backing node at some
point.
273 also changes: The base image is opened without a format layer, so
ImageInfo.backing-filename-format used to report "file" for the base
image's overlay after blockdev-snapshot. However, the image header
never says "file" anywhere, so it now reports $IMGFMT.
Signed-off-by: Max Reitz <mreitz@redhat.com>
With bdrv_filter_bs(), we can easily handle this default filter behavior
in bdrv_co_block_status().
blkdebug wants to have an additional assertion, so it keeps its own
implementation, except bdrv_co_block_status_from_file() needs to be
inlined there.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
This includes some permission limiting (for example, we only need to
take the RESIZE permission if the base is smaller than the top).
Signed-off-by: Max Reitz <mreitz@redhat.com>
This includes some permission limiting (for example, we only need to
take the RESIZE permission for active commits where the base is smaller
than the top).
base_overlay is introduced so we can query bdrv_is_allocated_above() on
it - we cannot do that with base itself, because a filter's block_status
is the same as its child node, so if there are filters on base,
bdrv_is_allocated_above() on base would return information including
base.
Use this opportunity to rename qmp_drive_mirror()'s "source" BDS to
"target_backing_bs", because that is what it really refers to.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
query-block, query-named-block-nodes, and query-blockstats now return
any filtered child under "backing", not just bs->backing or COW
children. This is so that filters do not interrupt the reported backing
chain. This changes the output for iotest 184, as the throttled node
now appears as a backing child.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
It makes no sense to report the block stats of a purely metadata-storing
child in query-blockstats. So if the primary child does not have any
data, try to find a unique data-storing child.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
It is trivial, so we might as well do it.
Remove _filter_actual_image_size from iotest 184, so we get to see the
result in its reference output.
Signed-off-by: Max Reitz <mreitz@redhat.com>
If the top node's driver does not provide snapshot functionality and we
want to fall back to a node down the chain, we need to snapshot all
non-COW children. For simplicity's sake, just do not fall back if there
is more than one such child. Furthermore, we really only can fall back
to bs->file and bs->backing, because bdrv_snapshot_goto() has to modify
the child link (notably, set it to NULL).
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
If a node whose driver does not provide VM state functions has a
metadata child, the VM state should probably go there; if it is a
filter, the VM state should probably go there. It follows that we
should generally go down to the primary child.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Instead of looking at just bs->file and bs->backing, we should look at
all children that could end up receiving forwarded requests.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Before HEAD^, we needed this because bdrv_co_flush() by itself would
only flush bs->file. With HEAD^, bdrv_co_flush() will flush all
children on which a WRITE or WRITE_UNCHANGED permission has been taken.
Thus, vmdk no longer needs to do it itself.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
If the driver does not support .bdrv_co_flush() so bdrv_co_flush()
itself has to flush the children of the given node, it should not flush
just bs->file->bs, but in fact all children that might have been written
to (judging from the permissions taken on them).
This is a bug fix for qcow2 images with an external data file, as they
so far did not flush that data_file node.
In any case, the BLKDBG_EVENT() should be emitted on the primary child,
because that is where a blkdebug node would be if there is any.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
The condition modified here is not about potentially filtered children,
but only about COW sources (i.e. traditional backing files).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Because of the (not so recent anymore) changes that make the stream job
independent of the base node and instead track the node above it, we
have to split that "bottom" node into two cases: The bottom COW node,
and the node directly above the base node (which may be an R/W filter
or the bottom COW node).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Places that use patterns like
if (bs->drv->is_filter && bs->file) {
... something about bs->file->bs ...
}
should be
BlockDriverState *filtered = bdrv_filter_bs(bs);
if (filtered) {
... something about @filtered ...
}
instead.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
The original purpose of bdrv_is_encrypted() was to inquire whether a BDS
can be used without the user entering a password or not. It has not
been used for that purpose for quite some time.
Actually, it is not even fit for that purpose, because to answer that
question, it would have recursively query all of the given node's
children.
So now we have to decide in which direction we want to fix
bdrv_is_encrypted(): Recursively query all children, or drop it and just
use bs->encrypted to get the current node's status?
Nowadays, its only purpose is to report through bdrv_query_image_info()
whether the given image is encrypted or not. For this purpose, it is
probably more interesting to see whether a given node itself is
encrypted or not (otherwise, a management application cannot discern for
certain which nodes are really encrypted and which just have encrypted
children).
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
In preparation of using multiple IRQ (thus multiple eventfds)
make BDRVNVMeState::irq_notifier an array (for now of a single
element, the admin queue notifier).
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-16-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As we want to do per-queue polling, extract the nvme_poll_queue()
method which operates on a single queue.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-15-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nvme_create_queue_pair() doesn't require BlockDriverState anymore.
Replace it by BDRVNVMeState and AioContext to simplify.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-14-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
BDRV_POLL_WHILE() is defined as:
#define BDRV_POLL_WHILE(bs, cond) ({ \
BlockDriverState *bs_ = (bs); \
AIO_WAIT_WHILE(bdrv_get_aio_context(bs_), \
cond); })
As we will remove the BlockDriverState use in the next commit,
start by using the exploded version of BDRV_POLL_WHILE().
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-13-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
nvme_init_queue() doesn't require BlockDriverState anymore.
Replace it by BDRVNVMeState to simplify.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-12-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu_try_blockalign() is a generic API that call back to the
block driver to return its page alignment. As we call from
within the very same driver, we already know to page alignment
stored in our state. Remove indirections and use the value from
BDRVNVMeState.
This change is required to later remove the BlockDriverState
argument, to make nvme_init_queue() per hardware, and not per
block driver.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-11-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the next commit we'll get rid of qemu_try_blockalign().
To ease review, first replace qemu_try_blockalign0() by explicit
calls to qemu_try_blockalign() and memset().
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-10-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We allocate an unique chunk of memory then use it for two
different structures. By using an union, we make it clear
the data is overlapping (and we can remove the casts).
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-9-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are going to modify the code in the next commit. Renaming
the 'resp' variable to 'id' first makes the next commit easier
to review. No logical changes.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-8-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Rearrange nvme_add_io_queue() by using a common error path.
This will be proven useful in few commits where we add IRQ
notification to the IO queues.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-7-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Do not use the same error message for different failures.
Display a different error whether it is the CQ or the SQ.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-6-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use definitions instead of '0' or '1' indexes. Also this will
be useful when using multi-queues later.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-5-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As nvme_create_queue_pair() is allowed to fail, replace the
alloc() calls by try_alloc() to avoid aborting QEMU.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-4-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Avoid further processing if TRACE_NVME_SUBMIT_COMMAND_RAW is
not enabled. This is an untested intend of performance optimization.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-3-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use self-explicit SCALE_MS definition instead of magic value.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200821195359.1285345-2-philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes nbd's connection_co yield during reconnects, so that
reconnect doesn't block the main thread. This is very important in
case of an unavailable nbd server host: connect() call may take a long
time, blocking the main thread (and due to reconnect, it will hang
again and again with small gaps of working time during pauses between
connection attempts).
Realization notes:
- We don't want to implement non-blocking connect() over non-blocking
socket, because getaddrinfo() doesn't have portable non-blocking
realization anyway, so let's just use a thread for both getaddrinfo()
and connect().
- We can't use qio_channel_socket_connect_async (which behaves
similarly and starts a thread to execute connect() call), as it's relying
on someone iterating main loop (g_main_loop_run() or something like
this), which is not always the case.
- We can't use thread_pool_submit_co API, as thread pool waits for all
threads to finish (but we don't want to wait for blocking reconnect
attempt on shutdown.
So, we just create the thread by hand. Some additional difficulties
are:
- We want our connect to avoid blocking drained sections and aio context
switches. To achieve this, we make it possible to "cancel" synchronous
wait for the connect (which is a coroutine yield actually), still,
the thread continues in background, and if successful, its result may be
reused on next reconnect attempt.
- We don't want to wait for reconnect on shutdown, so there is
CONNECT_THREAD_RUNNING_DETACHED thread state, which means that the block
layer is no longer interested in a result, and thread should close new
connected socket on finish and free the state.
How to reproduce the bug, fixed with this commit:
1. Create an image on node1:
qemu-img create -f qcow2 xx 100M
2. Start NBD server on node1:
qemu-nbd xx
3. Start vm with second nbd disk on node2, like this:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
file=/work/images/cent7.qcow2 -drive file=nbd+tcp://192.168.100.2 \
-vnc :0 -qmp stdio -m 2G -enable-kvm -vga std
4. Access the vm through vnc (or some other way?), and check that NBD
drive works:
dd if=/dev/sdb of=/dev/null bs=1M count=10
- the command should succeed.
5. Now, let's trigger nbd-reconnect loop in Qemu process. For this:
5.1 Kill NBD server on node1
5.2 run "dd if=/dev/sdb of=/dev/null bs=1M count=10" in the guest
again. The command should fail and a lot of error messages about
failing disk may appear as well.
Now NBD client driver in Qemu tries to reconnect.
Still, VM works well.
6. Make node1 unavailable on NBD port, so connect() from node2 will
last for a long time:
On node1 (Note, that 10809 is just a default NBD port):
sudo iptables -A INPUT -p tcp --dport 10809 -j DROP
After some time the guest hangs, and you may check in gdb that Qemu
hangs in connect() call, issued from the main thread. This is the
BUG.
7. Don't forget to drop iptables rule from your node1:
sudo iptables -D INPUT -p tcp --dport 10809 -j DROP
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200812145237.4396-1-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: minor wording and formatting tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
The NVM Express specification generally uses 'zeroes' and not 'zeros',
so let us align with it.
Cc: Fam Zheng <fam@euphon.net>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Add missing fields in the Identify Controller and Identify Namespace
data structures to bring them in line with NVMe v1.3.
This also adds data structures and defines for SGL support which
requires a couple of trivial changes to the nvme block driver as well.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Fam Zheng <fam@euphon.net>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Message-Id: <20200706061303.246057-2-its@irrelevant.dk>
Remove superfluous breaks, as there is a "return" before them.
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <1594631107-36574-1-git-send-email-wang.yi59@zte.com.cn>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The qcow2 driver needs the zlib dependency. While emulators
provided it through the migration code, this is not true of
the tools. Move the dependency from the qcow1 rule directly
into block_ss so that it is included unconditionally.
Fixes build with --disable-qcow1.
Reported-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Cc: qemu-block@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move typedef closer to the type check macros, to make it easier
to convert the code to OBJECT_DEFINE_TYPE() in the future.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Tested-By: Roman Bolshakov <r.bolshakov@yadro.com>
Message-Id: <20200825192110.3528606-17-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
This function is only used by qcow2_expand_zero_clusters() to
downgrade a qcow2 image to a previous version. This would require
transforming all extended L2 entries into normal L2 entries but this
is not a simple task and there are no plans to implement this at the
moment.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <15e65112b4144381b4d8c0bdf8fb76b0d813e3d1.1594396418.git.berto@igalia.com>
[mreitz: Fixed comment style]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Traditional qcow2 images don't allow preallocation if a backing file
is set. This is because once a cluster is allocated there is no way to
tell that its data should be read from the backing file.
Extended L2 entries have individual allocation bits for each
subcluster, and therefore it is perfectly possible to have an
allocated cluster with all its subclusters unallocated.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6d5b0f38e7dc5f2f31d8cab1cb92044e9909aece.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <6476caaa73216bd05b7bb2d504a20415e1665176.1594396418.git.berto@igalia.com>
[mreitz: %s/5\.1/5.2/; fixed 302's and 303's reference output]
Signed-off-by: Max Reitz <mreitz@redhat.com>
This field allows us to indicate that the L2 metadata update does not
come from a write request with actual data but from a preallocation
request.
For traditional images this does not make any difference, but for
images with extended L2 entries this means that the clusters are
allocated normally in the L2 table but individual subclusters are
marked as unallocated.
This will allow preallocating images that have a backing file.
There is one special case: when we resize an existing image we can
also request that the new clusters are preallocated. If the image
already had a backing file then we have to hide any possible stale
data and zero out the new clusters (see commit 955c7d6687 for more
details).
In this case the subclusters cannot be left as unallocated so the L2
bitmap must be updated.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <960d4c444a4f5a870e2b47e5da322a73cd9a2f5a.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Extended L2 entries are bigger than normal L2 entries so this has an
impact on the amount of metadata needed for a qcow2 file.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <7efae2efd5e36b42d2570743a12576d68ce53685.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
This works now at the subcluster level and pwrite_zeroes_alignment is
updated accordingly.
qcow2_cluster_zeroize() is turned into qcow2_subcluster_zeroize() with
the following changes:
- The request can now be subcluster-aligned.
- The cluster-aligned body of the request is still zeroized using
zero_in_l2_slice() as before.
- The subcluster-aligned head and tail of the request are zeroized
with the new zero_l2_subclusters() function.
There is just one thing to take into account for a possible future
improvement: compressed clusters cannot be partially zeroized so
zero_l2_subclusters() on the head or the tail can return -ENOTSUP.
This makes the caller repeat the *complete* request and write actual
zeroes to disk. This is sub-optimal because
1) if the head area was compressed we would still be able to use
the fast path for the body and possibly the tail.
2) if the tail area was compressed we are writing zeroes to the
head and the body areas, which are already zeroized.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <17e05e2ee7e12f10dcf012da81e83ebe27eb3bef.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The bdrv_co_pwrite_zeroes() call here fills complete clusters with
zeroes, but it can happen that some subclusters are not part of the
write request or the copy-on-write. This patch makes sure that only
the affected subclusters are overwritten.
A potential improvement would be to also fill with zeroes the other
subclusters if we can guarantee that we are not overwriting existing
data. However this would waste more disk space, so we should first
evaluate if it's really worth doing.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <b3dc97e8e2240ddb5191a4f930e8fc9653f94621.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Compressed clusters always have the bitmap part of the extended L2
entry set to 0.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <04455b3de5dfeb9d1cfe1fc7b02d7060a6e09710.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The L2 bitmap needs to be updated after each write to indicate what
new subclusters are now allocated. This needs to happen even if the
cluster was already allocated and the L2 entry was otherwise valid.
In some cases however a write operation doesn't need change the L2
bitmap (because all affected subclusters were already allocated). This
is detected in calculate_l2_meta(), and qcow2_alloc_cluster_link_l2()
is never called in those cases.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <0875620d49f44320334b6a91c73b3f301f975f38.1594396418.git.berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>