Commit Graph

80 Commits

Author SHA1 Message Date
Kevin Wolf
06dc9bd571 file-posix: Avoid aio_worker() for QEMU_AIO_FLUSH
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf
46ee0f462b file-posix: Avoid aio_worker() for QEMU_AIO_DISCARD
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf
7154d8ae66 file-posix: Avoid aio_worker() for QEMU_AIO_WRITE_ZEROES
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf
58a209c437 file-posix: Avoid aio_worker() for QEMU_AIO_COPY_RANGE
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf
29cb4c01e7 file-posix: Avoid aio_worker() for QEMU_AIO_TRUNCATE
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf
5d5de25005 file-posix: Factor out raw_thread_pool_submit()
Getting the thread pool of the AioContext of a block node and scheduling
some work in it is an operation that is already done twice, and we'll
get more instances. Factor it out into a separate function.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf
d57c44d00f file-posix: Reorganise RawPosixAIOData
RawPosixAIOData contains a lot of fields for several separate operations
that are to be processed in a worker thread and that need different
parameters. The struct is currently rather unorganised, with unions that
cover some, but not all operations, and even one #define for field names
instead of a union.

Clean this up to have some common fields and a single union. As a side
effect, on x86_64 the struct shrinks from 72 to 48 bytes.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Max Reitz
577a133988 file-posix: Fix shared locks on reopen commit
s->locked_shared_perm is the set of bits locked in the file, which is
the inverse of the permissions actually shared.  So we need to pass them
as they are to raw_apply_lock_bytes() instead of inverting them again.

Reported-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-19 14:32:01 +01:00
Fam Zheng
f2e3af29b7 file-posix: Drop s->lock_fd
The lock_fd field is not strictly necessary because transferring locked
bytes from old fd to the new one shouldn't fail anyway. This spares the
user one fd per image.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-12 17:46:57 +01:00
Fam Zheng
2996ffad3a file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.

Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:

    $ ls -lhZ b.img
    -rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img

Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:

    $ ls -lhZ b.img
    -rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img

Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:

    blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);

can abort() QEMU, out of environmental changes.

This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-12 17:46:57 +01:00
Fam Zheng
db0754df88 file-posix: Use error API properly
Use error_report for situations that affect user operation (i.e.  we're
actually returning error), and warn_report/warn_report_err when some
less critical error happened but the user operation can still carry on.

For raw_normalize_devicepath, add Error parameter to propagate to
its callers.

Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-12 17:46:57 +01:00
Kevin Wolf
64107dc044 file-posix: Support auto-read-only option
If read-only=off, but auto-read-only=on is given, open the file
read-write if we have the permissions, but instead of erroring out for
read-only files, just degrade to read-only.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-11-05 15:09:55 +01:00
Alberto Garcia
8d3245750b file-posix: Forbid trying to change unsupported options during reopen
The file-posix code is used for the "file", "host_device" and
"host_cdrom" drivers, and it allows reopening images. However the only
option that is actually processed is "x-check-cache-dropped", and
changes in all other options (e.g. "filename") are silently ignored:

   (qemu) qemu-io virtio0 "reopen -o file.filename=no-such-file"

While we could allow changing some of the other options, let's keep
things as they are for now but return an error if the user tries to
change any of them.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-01 12:51:12 +02:00
Alberto Garcia
589f20dccd file-posix: x-check-cache-dropped should default to false on reopen
The default value of x-check-cache-dropped is false. There's no reason
to use the previous value as a default in raw_reopen_prepare() because
bdrv_reopen_queue_child() already takes care of putting the old
options in the BDRVReopenState.options QDict.

If x-check-cache-dropped was previously set but is now missing from
the reopen QDict then it should be reset to false.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-01 12:51:11 +02:00
Fam Zheng
b857431d2a file-posix: Include filename in locking error message
Image locking errors happening at device initialization time doesn't say
which file cannot be locked, for instance,

    -device scsi-disk,drive=drive-1: Failed to get shared "write" lock
    Is another process using the image?

could refer to either the overlay image or its backing image.

Hoist the error_append_hint to the caller of raw_check_lock_bytes where
file name is known, and include it in the error hint.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-01 12:51:11 +02:00
Kevin Wolf
34fa110e42 file-posix: Fix write_zeroes with unmap on block devices
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.

This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00
Fam Zheng
a1c81f4f16 file-posix: Handle EINTR in preallocation=full write
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00
Nishanth Aravamudan
042b757cc7 block/file-posix: add bdrv_attach_aio_context callback for host dev and cdrom
In ed6e2161 ("linux-aio: properly bubble up errors from initialzation"),
I only added a bdrv_attach_aio_context callback for the bdrv_file
driver. There are several other drivers that use the shared
aio_plug callback, though, and they will trip the assertion added to
aio_get_linux_aio because they did not call aio_setup_linux_aio first.
Add the appropriate callback definition to the affected driver
definitions.

Fixes: ed6e2161 ("linux-aio: properly bubble up errors from initialization")
Reported-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180718211256.29774-1-naravamudan@digitalocean.com
Cc: Eric Blake <eblake@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <famz@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-07-24 14:27:41 +01:00
John Snow
230ff73904 file-posix: specify expected filetypes
Adjust each caller of raw_open_common to specify if they are expecting
host and character devices or not. Tighten expectations of file types upon
open in the common code and refuse types that are not expected.

This has two effects:

(1) Character and block devices are now considered deprecated for the
    'file' driver, which expects only S_IFREG, and
(2) no file-posix driver (file, host_cdrom, or host_device) can open
    directories now.

I don't think there's a legitimate reason to open directories as if
they were files. This prevents QEMU from opening and attempting to probe
a directory inode, which can break in exciting ways. One of those ways
is lseek on ext4/xfs, which will return 0x7fffffffffffffff as the file
size instead of EISDIR. This can coax QEMU into responding with a
confusing "file too big" instead of "Hey, that's not a file".

See: https://bugs.launchpad.net/qemu/+bug/1739304/
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-12 18:24:08 +02:00
Peter Maydell
7851f1a706 Block layer patches:
- Copy offloading fixes for when the copy increases the image size
 - Temporary revert of the removal of deprecated -drive options
 - Fix request serialisation in the image fleecing scenario
 - Fix copy-on-read crash with unaligned image size
 - Fix another drain crash
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbRNLQAAoJEH8JsnLIjy/WOaQQALlZk01JohETuwGG6HGl0LdI
 jEEm+N0J+BlGOVjoGU67OKGidUCl5WvBsQTlyYkmlaToGuk/njWxCa/GA6+iNRnt
 MDq7Ovr8uZI3D+0Fuc6xg/6NBiLkukgh0Q9gMWkzn3jaNWzO2WcTr8WXwepvP6sj
 YtPhEQOXTT3sXf/MFY8ig7qRrZ6f7LFOoKu7LMnrD+QWDo8TY3QLZaxP9OUFHH7S
 A6J0LIfuRZlq79a7SgrRkCR2ddtgYyBQ+zD7PD5kf1vLW4+dOhDOutQEsZCMCPgR
 ft99kNhrZcJGN6n2r8/oVcvRkw5c4I1JPgakm/GoW/NllfPMebuPospKaS4wiJnB
 zI4YOtmco4Mfxkw/wK+Ep/bPCpxEF43uDcpPiEjsNADrdLq0eKnPn5ctwSyWlGvn
 ayQWxDoKoYckn/ccjtLxJ2xPws8433cTXrBdIKnJadWxi3iRNzlIKHRuEfXf9zQt
 G+Nq7ruysT9TPf9ifuCHcZnTsi3SLYLsjCj7pAgBkazBYE2cCI3eKN8kxsDJi7qv
 cWzFCpwE28pHRJ6FwtdzBVkNcfTlC/XopR1M66OzYZlLqR/4hbNhyHL3hBV+yfrM
 fC7mPi81ttI6e+JAgC6K8t3Ey242MjSzUYa7pJUNws7RpqUhfhr6EXXbBceJKsVW
 F8qKZoiIEK7wDacUiEiE
 =FXOo
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches:

- Copy offloading fixes for when the copy increases the image size
- Temporary revert of the removal of deprecated -drive options
- Fix request serialisation in the image fleecing scenario
- Fix copy-on-read crash with unaligned image size
- Fix another drain crash

# gpg: Signature made Tue 10 Jul 2018 16:37:52 BST
# gpg:                using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (24 commits)
  block: Use common write req handling in truncate
  block: Fix bdrv_co_truncate overlap check
  block: Use common req handling in copy offloading
  block: Use common req handling for discard
  block: Fix handling of image enlarging write
  block: Extract common write req handling
  block: Use uint64_t for BdrvTrackedRequest byte fields
  block: Use BdrvChild to discard
  block: Add copy offloading trace points
  block: Prefix file driver trace points with "file_"
  Revert "block: Remove deprecated -drive geometry options"
  Revert "block: Remove deprecated -drive option addr"
  Revert "block: Remove deprecated -drive option serial"
  Revert "block: Remove dead deprecation warning code"
  block/blklogwrites: Make sure the log sector size is not too small
  qapi/block-core.json: Add missing documentation for blklogwrites log-append option
  block/backup: fix fleecing scheme: use serialized writes
  block: add BDRV_REQ_SERIALISING flag
  block: split flags in copy_range
  block/io: fix copy_range
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-10 17:28:29 +01:00
Fam Zheng
ecc983a507 block: Add copy offloading trace points
A few trace points that can help reveal what is happening in a copy
offloading I/O path.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:52 +02:00
Fam Zheng
f8a30874ca block: Prefix file driver trace points with "file_"
With in one module, trace points usually have a common prefix named
after the module name. paio_submit and paio_submit_co are the only two
trace points so far in the two file protocol drivers. As we are adding
more, having a common prefix here is better so that trace points can be
enabled with a glob. Rename them.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:51 +02:00
Vladimir Sementsov-Ogievskiy
67b51fb998 block: split flags in copy_range
Pass read flags and write flags separately. This is needed to handle
coming BDRV_REQ_NO_SERIALISING clearly in following patches.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 13:04:25 +02:00
Fam Zheng
9f850f67ad file-posix: Fix fd_open check in raw_co_copy_range_to
One of them is a typo. But update both to be more readable.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-3-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Max Reitz
7c20c808a5 file-posix: Unlock FD after creation
Closing the FD does not necessarily mean that it is unlocked.  Fix this
by relinquishing all permission locks before qemu_close().

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 11:07:58 +02:00
Max Reitz
d815efcaf0 file-posix: Fix creation locking
raw_apply_lock_bytes() takes a bit mask of "permissions that are NOT
shared".

Also, make the "perm" and "shared" variables uint64_t, because I do not
particularly like using ~ on signed integers (and other permission masks
are usually uint64_t, too).

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 11:07:58 +02:00
Fam Zheng
c436e3d014 file-posix: Fix EINTR handling
EINTR should be checked against errno, not ret. While fixing the bug,
collect the branches with a switch block.

Also, change the return value from -ENOSTUP to -ENOSPC when the actual
issue is request range passes EOF, which should be distinguishable from
the case of error == ENOSYS by the caller, so that it could still retry
with other byte ranges, whereas it shouldn't retry anymore upon ENOSYS.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf
33d70fb6fa file-posix: Implement co versions of discard/flush
This simplifies file-posix by implementing the coroutine variants of
the discard and flush BlockDriver callbacks. These were the last
remaining users of paio_submit(), which can be removed now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf
93f4e2ff4b file-posix: Make .bdrv_co_truncate asynchronous
This moves the code to resize an image file to the thread pool to avoid
blocking.

Creating large images with preallocation with blockdev-create is now
actually a background job instead of blocking the monitor (and most
other things) until the preallocation has completed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf
061ca8a368 block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.

This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:

* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
  protocol drivers are trivially safe because they don't actually yield
  yet, so there is no change in behaviour.

* copy-on-read, crypto, raw-format: Essentially just filter drivers that
  pass the request to a child node, no problem.

* qcow2: The implementation modifies metadata, so it needs to hold
  s->lock to be safe with concurrent I/O requests. In order to avoid
  double locking, this requires pulling the locking out into
  preallocate_co() and using qcow2_write_caches() instead of
  bdrv_flush().

* qed: Does a single header update, this is fine without locking.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Nishanth Aravamudan
ed6e216171 linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().

To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.

Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.

Add an assert that aio_get_linux_aio() cannot return NULL.

Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-27 13:06:34 +01:00
Max Reitz
b8cf1913a9 block/file-posix: File locking during creation
When creating a file, we should take the WRITE and RESIZE permissions.
We do not need either for the creation itself, but we do need them for
clearing and resizing it.  So we can take the proper permissions by
replacing O_TRUNC with an explicit truncation to 0, and by taking the
appropriate file locks between those two steps.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180509215336.31304-3-mreitz@redhat.com
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-11 16:18:45 +02:00
Max Reitz
d0a96155de block/file-posix: Pass FD to locking helpers
raw_apply_lock_bytes() and raw_check_lock_bytes() currently take a
BDRVRawState *, but they only use the lock_fd field.  During image
creation, we do not have a BDRVRawState, but we do have an FD; so if we
want to reuse the functions there, we should modify them to receive only
the FD.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180509215336.31304-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-11 16:18:45 +02:00
Fam Zheng
1efad060d7 file-posix: Implement bdrv_co_copy_range
With copy_file_range(2), we can implement the bdrv_co_copy_range
semantics.

Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180601092648.24614-6-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-01 14:41:48 +01:00
Stefan Hajnoczi
31be8a2a97 block/file-posix: add x-check-page-cache=on|off option
mincore(2) checks whether pages are resident.  Use it to verify that
page cache has been dropped.

You can trigger a verification failure by mmapping the image file from
another process that loads a byte from a page, forcing it to become
resident.  bdrv_co_invalidate_cache() will fail while that process is
alive.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180427162312.18583-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-05-11 16:43:05 +01:00
Stefan Hajnoczi
dd577a26ff block/file-posix: implement bdrv_co_invalidate_cache() on Linux
On Linux posix_fadvise(POSIX_FADV_DONTNEED) invalidates pages*.  Use
this to drop page cache on the destination host during shared storage
migration.  This way the destination host will read the latest copy of
the data and will not use stale data from the page cache.

The flow is as follows:

1. Source host writes out all dirty pages and inactivates drives.
2. QEMU_VM_EOF is sent on migration stream.
3. Destination host invalidates caches before accessing drives.

This patch enables live migration even with -drive cache.direct=off.

* Terms and conditions may apply, please see patch for details.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180427162312.18583-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-05-11 16:43:05 +01:00
Max Reitz
82b45e0a0b block/file-posix: Fix fully preallocated truncate
Storing the lseek() result in an int results in it overflowing when the
file is at least 2 GB big.  Then, we have a 50 % chance of the result
being "negative" and thus thinking an error occurred when actually
everything went just fine.

So we should use the correct type for storing the result: off_t.

Reported-by: Daniel P. Berrange <berrange@redhat.com>
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1549231
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180228131315.30194-2-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-04-03 17:39:37 +02:00
Jeff Cody
a03083a017 block: handle invalid lseek returns gracefully
In commit 223a23c198, we implemented a
workaround in the gluster driver to handle invalid values returned for
SEEK_DATA or SEEK_HOLE.

In some instances, these same invalid values can be seen in the posix
file handler as well - for example, it has been reported on FUSE gluster
mounts.

Calling assert() for these invalid values is overly harsh; we can safely
return -EIO and allow this case to be treated as a "learned nothing"
case (e.g., D4 / H4, as commented in the code).

This patch does the same thing that 223a23c198 did for gluster.c,
except in file-posix.c

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-04-03 15:25:17 +02:00
Kevin Wolf
89b259eeaa file-posix: Fix no-op bdrv_truncate() with falloc preallocation
If bdrv_truncate() is called, but the requested size is the same as
before, don't call posix_fallocate(), which returns -EINVAL for length
zero and would therefore make bdrv_truncate() fail.

The problem can be triggered by creating a zero-sized raw image with
'falloc' preallocation mode.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-03-09 15:17:48 +01:00
Kevin Wolf
927f11e131 file-posix: Support .bdrv_co_create
This adds the .bdrv_co_create driver callback to file, which enables
image creation over QMP.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-03-09 15:17:47 +01:00
Stefan Hajnoczi
efc75e2a4c block: rename .bdrv_create() to .bdrv_co_create_opts()
BlockDriver->bdrv_create() has been called from coroutine context since
commit 5b7e1542cf ("block: make
bdrv_create adopt coroutine").

Make this explicit by renaming to .bdrv_co_create_opts() and add the
coroutine_fn annotation.  This makes it obvious to block driver authors
that they may yield, use CoMutex, or other coroutine_fn APIs.
bdrv_co_create is reserved for the QAPI-based version that Kevin is
working on.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20170705102231.20711-2-stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-03-02 18:39:07 +01:00
Eric Blake
a290f08590 file-posix: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the file protocol driver accordingly.

In want_zero mode, we continue to report fine-grained hole
information (the caller wants as much mapping detail as possible);
but when not in that mode, the caller prefers larger *pnum and
merely cares about what offsets are allocated at this layer, rather
than where the holes live.  Since holes still read as zeroes at
this layer (rather than deferring to a backing layer), we can take
the shortcut of skipping lseek(), and merely state that all bytes
are allocated.

We can also drop redundant bounds checks that are already
guaranteed by the block layer.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-03-02 18:39:07 +01:00
Eric Blake
e24d813b29 block: Simplify bdrv_can_write_zeroes_with_unmap()
We don't need the can_write_zeroes_with_unmap field in
BlockDriverInfo, because it is redundant information with
supported_zero_flags & BDRV_REQ_MAY_UNMAP.  Note that
BlockDriverInfo and supported_zero_flags are both per-device
settings, rather than global state about the driver as a
whole, which means one or both of these bits of information
can already be conditional.  Let's audit how they were set:

crypto: always setting can_write_ to false is pointless (the
struct starts life zero-initialized), no use of supported_

nbd: just recently fixed to set can_write_ if supported_
includes MAY_UNMAP (thus this commit effectively reverts
bca80059e and solves the problem mentioned there in a more
global way)

file-posix, iscsi, qcow2: can_write_ is conditional, while
supported_ was unconditional; but passing MAY_UNMAP would
fail with ENOTSUP if the condition wasn't met

qed: can_write_ is unconditional, but pwrite_zeroes lacks
support for MAY_UNMAP and supported_ is not set. Perhaps
support can be added later (since it would be similar to
qcow2), but for now claiming false is no real loss

all other drivers: can_write_ is not set, and supported_ is
either unset or a passthrough

Simplify the code by moving the conditional into
supported_zero_flags for all drivers, then dropping the
now-unused BDI field.  For callers that relied on
bdrv_can_write_zeroes_with_unmap(), we return the same
per-device settings for drivers that had conditions (no
observable change in behavior there); and can now return
true (instead of false) for drivers that support passthrough
(for example, the commit driver) which gives those drivers
the same fix as nbd just got in bca80059e.  For callers that
relied on supported_zero_flags, we now have a few more places
that can avoid a wasted call to pwrite_zeroes() that will
just fail with ENOTSUP.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180126193439.20219-1-eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-02-09 12:32:44 -06:00
Markus Armbruster
922a01a013 Move include qemu/option.h from qemu-common.h to actual users
qemu-common.h includes qemu/option.h, but most places that include the
former don't actually need the latter.  Drop the include, and add it
to the places that actually need it.

While there, drop superfluous includes of both headers, and
separate #include from file comment with a blank line.

This cleanup makes the number of objects depending on qemu/option.h
drop from 4545 (out of 4743) to 284 in my "build everything" tree.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-20-armbru@redhat.com>
[Semantic conflict with commit bdd6a90a9e in block/nvme.c resolved]
2018-02-09 13:52:16 +01:00
Markus Armbruster
452fcdbc49 Include qapi/qmp/qdict.h exactly where needed
This cleanup makes the number of objects depending on qapi/qmp/qdict.h
drop from 4550 (out of 4743) to 368 in my "build everything" tree.
For qapi/qmp/qobject.h, the number drops from 4552 to 390.

While there, separate #include from file comment with a blank line.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-13-armbru@redhat.com>
2018-02-09 13:52:15 +01:00
Fam Zheng
97ec9117c3 file-posix: Clear out first sector in hdev_create
People get surprised when, after "qemu-img create -f raw /dev/sdX", they
still see qcow2 with "qemu-img info", if previously the bdev had a qcow2
header. While this is natural because raw doesn't need to write any
magic bytes during creation, hdev_create is free to clear out the first
sector to make sure the stale qcow2 header doesn't cause such confusion.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-09-26 14:46:23 +02:00
Paolo Bonzini
7c9e527659 scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.

As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.

The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands.  For example:

  $ qemu-system-x86_64
      -device virtio-scsi \
      -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
      -drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
      -device scsi-block,drive=hd

or:

  $ qemu-system-x86_64
      -device virtio-scsi \
      -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
      -blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
      -device scsi-block,drive=hd

Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now.  For example, a pr-manager could:

- talk directly to the multipath daemon from a privileged QEMU
  (i.e. QEMU links to libmpathpersist); this makes reservation work
  properly with multipath, but still requires CAP_SYS_RAWIO

- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)

- more interestingly, implement reservations directly in QEMU
  through file system locks or a shared database (e.g. sqlite)

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-22 01:06:51 +02:00
Marc-André Lureau
f7abe0ecd4 qapi: Change data type of the FOO_lookup generated for enum FOO
Currently, a FOO_lookup is an array of strings terminated by a NULL
sentinel.

A future patch will generate enums with "holes".  NULL-termination
will cease to work then.

To prepare for that, store the length in the FOO_lookup by wrapping it
in a struct and adding a member for the length.

The sentinel will be dropped next.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20170822132255.23945-13-marcandre.lureau@redhat.com>
[Basically redone]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1503564371-26090-16-git-send-email-armbru@redhat.com>
[Rebased]
2017-09-04 13:09:13 +02:00
Markus Armbruster
977c736f80 qapi: Mechanically convert FOO_lookup[...] to FOO_str(...)
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1503564371-26090-14-git-send-email-armbru@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
2017-09-04 13:09:13 +02:00
Markus Armbruster
5b5f825d44 qapi: Generate FOO_str() macro for QAPI enum FOO
The next commit will put it to use.  May look pointless now, but we're
going to change the FOO_lookup's type, and then it'll help.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1503564371-26090-13-git-send-email-armbru@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
2017-09-04 13:09:13 +02:00