Commit Graph

4261 Commits

Author SHA1 Message Date
Vladimir Sementsov-Ogievskiy
54277a2aab block/qed: add missed coroutine_fn markers
qed_read_table and qed_write_table use coroutine-only interfaces but
are not marked coroutine_fn. Happily, they are called only from
coroutine context, so we only need to add missed markers.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Alberto Garcia
065abf9f2b commit: Make base read-only if there is an early failure
You can reproduce this by passing an invalid filter-node-name (like
"1234") to block-commit. In this case the base image is put in
read-write mode but is never reset back to read-only.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
f4326aefcf block/stream: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
08b6261f34 block/commit: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
607dbdc4e0 block/backup: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
a4072543cc block/parallels: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
696e8cb292 block/qed: use buffer-based io
Move to _co_ versions of io functions qed_read_table() and
qed_write_table(), as we use qemu_co_mutex_unlock()
anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
4ed3e0c486 block/qcow: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy
b00cb15bda block/qcow2: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Alberto Garcia
e1f4a37a49 qcow2: Fix error handling in the compression code
This patch fixes a few things in the way error codes are handled in
the qcow2 compression code:

a) qcow2_co_pwritev_compressed() expects qcow2_co_compress() to only
   return -1 or -2 on failure, but this is not correct. Since the
   change from qcow2_compress() to qcow2_co_compress() in commit
   ceb029cd6f the new code can also return -EINVAL (although
   there does not seem to exist any code path that would cause that
   error in the current implementation).

b) -1 and -2 are ad-hoc error codes defined in qcow2_compress().
   This patch replaces them with standard constants from errno.h.

c) Both qcow2_compress() and qcow2_co_do_compress() return a negative
   value on failure, but qcow2_co_pwritev_compressed() stores the
   value in an unsigned data type.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf
db04524f82 qcow2: Fix qcow2_make_empty() with external data file
make_completely_empty() is an optimisated path for bdrv_make_empty()
where completely new metadata is created inside the image file instead
of going through all clusters and discarding them. For an external data
file, however, we actually need to do discard operations on the data
file; just overwriting the qcow2 file doesn't get rid of the data.

The necessary slow path with an explicit discard operation already
exists for other cases. Use it for external data files, too.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf
718c0fce2f qcow2: Fix full preallocation with external data file
preallocate_co() already gave the data file the full size without
forwarding the requested preallocation mode to the protocol. When
bdrv_co_truncate() was called later with the preallocation mode, the
file didn't actually grow any more, so the data file stayed unallocated
even if full preallocation was requested.

Pass the right preallocation mode to preallocate_co() and remove the
second bdrv_co_truncate() to fix this. As a side effect, the ugly
one-byte write in preallocate_co() is replaced with a truncate call,
now leaving the last block unallocated on the protocol level as it
should be.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf
360bd07471 qcow2: Add errp to preallocate_co()
We'll add a bdrv_co_truncate() call in the next patch which can return
an Error that we don't want to discard. So add an errp parameter to
preallocate_co().

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf
f29fbf7c6b qcow2: Avoid COW during metadata preallocation
Limiting the allocation to INT_MAX bytes isn't particularly clever
because it means that the final cluster will be a partial cluster which
will be completed through a COW operation. This results in unnecessary
data read and write requests which lead to an unwanted non-sparse
filesystem block for metadata preallocation.

Align the maximum allocation size down to the cluster size to avoid this
situation.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Eric Blake
de38b5005e qemu-img: Saner printing of large file sizes
Disk sizes close to INT64_MAX cause overflow, for some pretty
ridiculous output:

  $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
  image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
  file format: raw
  virtual size: -8388607T (9223372036854775296 bytes)
  disk size: unavailable

But there's no reason to have two separate implementations of integer
to human-readable abbreviation, where one has overflow and stops at
'T', while the other avoids overflow and goes all the way to 'E'. With
this patch, the output now claims 8EiB instead of -8388607T, which
really is the correct rounding of largest file size supported by qemu
(we could go 511 bytes larger if we used byte-accurate sizing instead
of rounding up to the next sector boundary, but that wouldn't change
the human-readable result).

Quite a few iotests need updates to expected output to match.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Tested-by: Max Reitz <mreitz@redhat.com>
2019-04-30 15:29:00 +02:00
Stefano Garzarella
0cb98af218 block/vhdx: Use IEC binary prefixes for size constants
Using IEC binary prefixes in order to make the code more readable,
with the exception of DEFAULT_LOG_SIZE because it's passed to
stringify().

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Stefano Garzarella
e9991e29ea block/vhdx: Remove redundant IEC binary prefixes definition
IEC binary prefixes are already defined in "qemu/units.h",
so we can remove redundant definitions in "block/vhdx.h".

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Sam Eiderman
7502be838e vmdk: Set vmdk parent backing_format to vmdk
Commit b69864e5a ("vmdk: Support version=3 in VMDK descriptor files")
fixed the probe function to correctly guess vmdk descriptors with
version=3.

This solves the issue where vmdk snapshot with parent vmdk descriptor
containing "version=3" would be treated as raw instead vmdk.

In the future case where a new vmdk version is introduced, we will again
experience this issue, even if the user will provide "-f vmdk" it will
only apply to the tip image and not to the underlying "misprobed" parent
image.

The code in vmdk.c already assumes that the backing file of vmdk must be
vmdk (see vmdk_is_cid_valid which returns 0 if backing file is not
vmdk).

So let's make it official by supplying the backing_format as vmdk.

Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-By: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Shmuel Eiderman <shmuel.eiderman@oracle.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Zhengui li
126734c4f7 vpc: unlock Coroutine lock to make IO submit Concurrently
Concurrent IO becomes serial IO because of the qemu Coroutine lock,
which reduce IO performance severely.

So unlock Coroutine lock before bdrv_co_pwritev and
bdrv_co_preadv to fix it.

Signed-off-by: Zhengui li <lizhengui@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Markus Armbruster
e1ce7d747b block/qapi: Clean up how we print to monitor or stdout
bdrv_snapshot_dump(), bdrv_image_info_specific_dump(),
bdrv_image_info_dump() and their helpers take an fprintf()-like
callback and a FILE * to pass to it.

hmp.c passes monitor_printf() cast to fprintf_function and the current
monitor cast to FILE *.

qemu-img.c and qemu-io-cmds.c pass fprintf and stdout.

The type-punning is technically undefined behaviour, but works in
practice.  Clean up: drop the callback, and call qemu_printf()
instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-8-armbru@redhat.com>
2019-04-18 22:18:59 +02:00
Markus Armbruster
6b3048cee0 block/ssh: Do not report read/write/flush errors to the user
Callbacks ssh_co_readv(), ssh_co_writev(), ssh_co_flush() report
errors to the user with error_printf().  They shouldn't, it's their
caller's job.  Replace by a suitable trace point.  While there, drop
the unreachable !s->sftp case.

Perhaps we should convert this part of the block driver interface to
Error, so block drivers can pass more detail to their callers.  Not
today.

Cc: "Richard W.M. Jones" <rjones@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190417190641.26814-3-armbru@redhat.com>
2019-04-17 21:21:49 +02:00
Kevin Wolf
93e32b3e20 qcow2: Fix preallocation bdrv_pwrite to wrong file
With an external data file, preallocate_co() must write the final byte
to the external data file, not to the qcow2 image file.

This is harmless for preallocation of newly created images (only the
qcow2 file size is increased to the virtual disk size while it should be
much smaller), but with preallocated resize, it could in theory cause
visible corruption if the metadata of the image is larger than the data
(e.g. lots of bitmaps).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-16 16:23:24 +02:00
Alberto Garcia
20509c4b8b block: freeze the backing chain earlier in stream_start()
Commit 6585493369 added code to freeze
the backing chain from 'top' to 'base' for the duration of the
block-stream job.

The problem is that the freezing happens too late in stream_start():
during the bdrv_reopen_set_read_only() call earlier in that function
another job can jump in and remove the base image. If that happens we
have an invalid chain and QEMU crashes.

This patch puts the bdrv_freeze_backing_chain() call at the beginning
of the function.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Vladimir Sementsov-Ogievskiy
696aaaed57 block/file-posix: do not fail on unlock bytes
bdrv_replace_child() calls bdrv_check_perm() with error_abort on
loosening permissions. However file-locking operations may fail even
in this case, for example on NFS. And this leads to Qemu crash.

Let's avoid such errors. Note, that we ignore such things anyway on
permission update commit and abort.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Stefano Garzarella
de23e72bb7 block/gluster: limit the transfer size to 512 MiB
Several versions of GlusterFS (3.12? -> 6.0.1) fail when the
transfer size is greater or equal to 1024 MiB, so we are
limiting the transfer size to 512 MiB to avoid this rare issue.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1691320
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Eric Blake
75d34eb98c nbd/client: Trace server noncompliance on structured reads
Just as we recently added a trace for a server sending block status
that doesn't match the server's advertised minimum block alignment,
let's do the same for read chunks.  But since qemu 3.1 is such a
server (because it advertised 512-byte alignment, but when serving a
file that ends in data but is not sector-aligned, NBD_CMD_READ would
detect a mid-sector change between data and hole at EOF and the
resulting read chunks are unaligned), we don't want to change our
behavior of otherwise tolerating unaligned reads.

Note that even though we fixed the server for 4.0 to advertise an
actual block alignment (which gets rid of the unaligned reads at EOF
for posix files), we can still trigger it via other means:

$ qemu-nbd --image-opts driver=blkdebug,align=512,image.driver=file,image.filename=/path/to/non-aligned-file

Arguably, that is a bug in the blkdebug block status function, for
leaking a block status that is not aligned. It may also be possible to
observe issues with a backing layer with smaller alignment than the
active layer, although so far I have been unable to write a reliable
iotest for that scenario.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190330165349.32256-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-04-01 08:58:04 -05:00
Eric Blake
4841211e0d block: Add bdrv_get_request_alignment()
The next patch needs access to a device's minimum permitted
alignment, since NBD wants to advertise this to clients. Add
an accessor function, borrowing from blk_get_max_transfer()
for accessing a backend's block limits.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190329042750.14704-6-eblake@redhat.com>
2019-04-01 08:46:52 -05:00
Eric Blake
9cf638508c nbd/client: Support qemu-img convert from unaligned size
If an NBD server advertises a size that is not a multiple of a sector,
the block layer rounds up that size, even though we set info.size to
the exact byte value sent by the server. The block layer then proceeds
to let us read or query block status on the hole that it added past
EOF, which the NBD server is unlikely to be happy with. Fortunately,
qemu as a server never advertizes an unaligned size, so we generally
don't run into this problem; but the nbdkit server makes it easy to
test:

$ printf %1000d 1 > f1
$ ~/nbdkit/nbdkit -fv file f1 & pid=$!
$ qemu-img convert -f raw nbd://localhost:10809 f2
$ kill $pid
$ qemu-img compare f1 f2

Pre-patch, the server attempts a 1024-byte read, which nbdkit
rightfully rejects as going beyond its advertised 1000 byte size; the
conversion fails and the output files differ (not even the first
sector is copied, because qemu-img does not follow ddrescue's habit of
trying smaller reads to get as much information as possible in spite
of errors). Post-patch, the client's attempts to read (and query block
status, for new enough nbdkit) are properly truncated to the server's
length, with sane handling of the hole the block layer forced on
us. Although f2 ends up as a larger file (1024 bytes instead of 1000),
qemu-img compare shows the two images to have identical contents for
display to the guest.

I didn't add iotests coverage since I didn't want to add a dependency
on nbdkit in iotests. I also did NOT patch write, trim, or write
zeroes - these commands continue to fail (usually with ENOSPC, but
whatever the server chose), because we really can't write to the end
of the file, and because 'qemu-img convert' is the most common case
where we care about being tolerant (which is read-only). Perhaps we
could truncate the request if the client is writing zeros to the tail,
but that seems like more work, especially if the block layer is fixed
in 4.1 to track byte-accurate sizing (in which case this patch would
be reverted as unnecessary).

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190329042750.14704-5-eblake@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
2019-04-01 08:32:44 -05:00
Eric Blake
a62a85ef5c nbd/client: Report offsets in bdrv_block_status
It is desirable for 'qemu-img map' to have the same output for a file
whether it is served over file or nbd protocols. However, ever since
we implemented block status for NBD (2.12), the NBD protocol forgot to
inform the block layer that as the final layer in the chain, the
offset is valid; without an offset, the human-readable form of
qemu-img map gives up with the unhelpful:

$ nbdkit -U - data data="1" size=512 --run 'qemu-img map $nbd'
Offset          Length          Mapped to       File
qemu-img: File contains external, encrypted or compressed clusters.

The --output=json form always works, because it is reporting the
lower-level bdrv_block_status results directly rather than trying to
filter out sparse ranges for human consumption - but now it also
shows the offset member.

With this patch, the human output changes to:

Offset          Length          Mapped to       File
0               0x200           0               nbd+unix://?socket=/tmp/nbdkitOxeoLa/socket

This change is observable to several iotests.

Fixes: 78a33ab5
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190329042750.14704-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 20:52:29 -05:00
Eric Blake
7da537f70d nbd/client: Lower min_block for block-status, unaligned size
We have a latent bug in our NBD client code, tickled by the brand new
nbdkit 1.11.10 block status support:

$ nbdkit --filter=log --filter=truncate -U - \
           data data="1" size=511 truncate=64K logfile=/dev/stdout \
           --run 'qemu-img convert $nbd /var/tmp/out'
...
qemu-img: block/io.c:2122: bdrv_co_block_status: Assertion `*pnum && QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset' failed.

The culprit? Our implementation of .bdrv_co_block_status can return
unaligned block status for any server that operates with a lower
actual alignment than what we tell the block layer in
request_alignment, in violation of the block layer's constraints. To
date, we've been unable to trip the bug, because qemu as NBD server
always advertises block sizing (at which point it is a server bug if
the server sends unaligned status - although qemu 3.1 is such a server
and I've sent separate patches for 4.0 both to get the server to obey
the spec, and to let the client to tolerate server oddities at EOF).

But nbdkit does not (yet) advertise block sizing, and therefore is not
in violation of the spec for returning block status at whatever
boundaries it wants, and those unaligned results can occur anywhere
rather than just at EOF. While we are still wise to avoid sending
sub-sector read/write requests to a server of unknown origin, we MUST
consider that a server telling us block status without an advertised
block size is correct.  So, we either have to munge unaligned answers
from the server into aligned ones that we hand back to the block
layer, or we have to tell the block layer about a smaller alignment.

Similarly, if the server advertises an image size that is not
sector-aligned, we might as well assume that the server intends to let
us access those tail bytes, and therefore supports a minimum block
size of 1, regardless of whether the server supports block status
(although we still need more patches to fix the problem that with an
unaligned image, we can send read or block status requests that exceed
EOF to the server). Again, qemu as server cannot trip this problem
(because it rounds images to sector alignment), but nbdkit advertised
unaligned size even before it gained block status support.

Solve both alignment problems at once by using better heuristics on
what alignment to report to the block layer when the server did not
give us something to work with. Note that very few NBD servers
implement block status (to date, only qemu and nbdkit are known to do
so); and as the NBD spec mentioned block sizing constraints prior to
documenting block status, it can be assumed that any future
implementations of block status are aware that they must advertise
block size if they want a minimum size other than 1.

We've had a long history of struggles with picking the right alignment
to use in the block layer, as evidenced by the commit message of
fd8d372d (v2.12) that introduced the current choice of forced 512-byte
alignment.

There is no iotest coverage for this fix, because qemu can't provoke
it, and I didn't want to make test 241 dependent on nbdkit.

Fixes: fd8d372d
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190329042750.14704-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
2019-03-30 20:52:19 -05:00
Eric Blake
737d3f5244 nbd-client: Work around server BLOCK_STATUS misalignment at EOF
The NBD spec is clear that a server that advertises a minimum block
size should reply to NBD_CMD_BLOCK_STATUS with extents aligned
accordingly. However, we know that the qemu NBD server implementation
has had a corner-case bug where it is not compliant with the spec,
present since the introduction of NBD_CMD_BLOCK_STATUS in qemu 2.12
(and unlikely to be patched in time for 4.0). Namely, when qemu is
serving a file that is not a multiple of 512 bytes, it rounds the size
advertised over NBD up to the next sector boundary (someday, I'd like
to fix that to be byte-accurate, but it's a much bigger audit not
appropriate for this release); yet if the final sector contains data
prior to EOF, lseek(SEEK_HOLE) will point to the implicit hole
mid-sector which qemu then reported over NBD.

We are well within our rights to hang up on a server that can't follow
the spec, but it is more useful to try and keep the connection alive
in spite of the problem. Do so by tracing a message about the problem,
and then either truncating the request back to an aligned boundary (if
it covered more than the final sector) or widening it out to the full
boundary with a forced status of data (since truncating would result
in 0 bytes, but we have to make progress, and valid since data is a
default-safe answer). And in practice, since the problem only happens
on a sector that starts with data and ends with a hole, we are going
to want to read that full sector anyway (where qemu as the server
fills in the tail beyond EOF with appropriate NUL bytes).

Easy reproduction:
$ printf %1000d 1 > file
$ qemu-nbd -f raw -t file & pid=$!
$ qemu-img map --output=json -f raw nbd://localhost:10809
qemu-img: Could not read file metadata: Invalid argument
$ kill $pid

where the patched version instead succeeds with:
[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}]

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190326171317.4036-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Eric Blake
ebd82cd872 nbd: Permit simple error to NBD_CMD_BLOCK_STATUS
The NBD spec is clear that when structured replies are active, a
simple error reply is acceptable to any command except for
NBD_CMD_READ.  However, we were mistakenly requiring structured errors
for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
simple error (since qemu does not behave as such a server, we didn't
notice the problem until now).  Broken since its introduction in
commit 78a33ab5 (v2.12).

Noticed while debugging a separate failure reported by nbdkit while
working out its initial implementation of BLOCK_STATUS, although it
turns out that nbdkit also chose to send structured error replies for
BLOCK_STATUS, so I had to manually provoke the situation by hacking
qemu's server to send a simple error reply:

| diff --git i/nbd/server.c w/nbd/server.c
| index fd013a2817a..833288d7c45 100644
| 00--- i/nbd/server.c
| +++ w/nbd/server.c
| @@ -2269,6 +2269,8 @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
|                                        "discard failed", errp);
|
|      case NBD_CMD_BLOCK_STATUS:
| +        return nbd_co_send_simple_reply(client, request->handle, ENOMEM,
| +                                        NULL, 0, errp);
|          if (!request->len) {
|              return nbd_send_generic_reply(client, request->handle, -EINVAL,
|                                            "need non-zero length", errp);
|

Signed-off-by: Eric Blake <eblake@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Message-Id: <20190325190104.30213-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Eric Blake
b29f3a3d2a nbd: Don't lose server's error to NBD_CMD_BLOCK_STATUS
When the server replies with a (structured [*]) error to
NBD_CMD_BLOCK_STATUS, without any extent information sent first, the
client code was blindly throwing away the server's error code and
instead telling the caller that EIO occurred.  This has been broken
since its introduction in 78a33ab5 (v2.12, where we should have called:
   error_setg(&local_err, "Server did not reply with any status extents");
   nbd_iter_error(&iter, false, -EIO, &local_err);
to declare the situation as a non-fatal error if no earlier error had
already been flagged, rather than just blindly slamming iter.err and
iter.ret), although it is more noticeable since commit 7f86068d, which
actually tries hard to preserve the server's code thanks to a separate
iter.request_ret.

[*] The spec is clear that the server is also permitted to reply with
a simple error, but that's a separate fix.

I was able to provoke this scenario with a hack to the server, then
seeing whether ENOMEM makes it back to the caller:

| diff --git a/nbd/server.c b/nbd/server.c
| index fd013a2817a..29c7995de02 100644
| --- a/nbd/server.c
| +++ b/nbd/server.c
| @@ -2269,6 +2269,8 @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
|                                        "discard failed", errp);
|
|      case NBD_CMD_BLOCK_STATUS:
| +        return nbd_send_generic_reply(client, request->handle, -ENOMEM,
| +                                      "no status for you today", errp);
|          if (!request->len) {
|              return nbd_send_generic_reply(client, request->handle, -EINVAL,
|                                            "need non-zero length", errp);
| --

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190325190104.30213-2-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Eric Blake
a39286dd61 nbd: Tolerate some server non-compliance in NBD_CMD_BLOCK_STATUS
The NBD spec states that NBD_CMD_FLAG_REQ_ONE (which we currently
always use) should not reply with an extent larger than our request,
and that the server's response should be exactly one extent. Right
now, that means that if a server sends more than one extent, we treat
the server as broken, fail the block status request, and disconnect,
which prevents all further use of the block device. But while good
software should be strict in what it sends, it should be tolerant in
what it receives.

While trying to implement NBD_CMD_BLOCK_STATUS in nbdkit, we
temporarily had a non-compliant server sending too many extents in
spite of REQ_ONE. Oddly enough, 'qemu-img convert' with qemu 3.1
failed with a somewhat useful message:
  qemu-img: Protocol error: invalid payload for NBD_REPLY_TYPE_BLOCK_STATUS

which then disappeared with commit d8b4bad8, on the grounds that an
error message flagged only at the time of coroutine teardown is
pointless, and instead we should rely on the actual failed API to
report an error - in other words, the 3.1 behavior was masking the
fact that qemu-img was not reporting an error. That has since been
fixed in the previous patch, where qemu-img convert now fails with:
  qemu-img: error while reading block status of sector 0: Invalid argument

But even that is harsh.  Since we already partially relaxed things in
commit acfd8f7a to tolerate a server that exceeds the cap (although
that change was made prior to the NBD spec actually putting a cap on
the extent length during REQ_ONE - in fact, the NBD spec change was
BECAUSE of the qemu behavior prior to that commit), it's not that much
harder to argue that we should also tolerate a server that sends too
many extents.  But at the same time, it's nice to trace when we are
being tolerant of server non-compliance, in order to help server
writers fix their implementations to be more portable (if they refer
to our traces, rather than just stderr).

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190323212639.579-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Kevin Wolf
738301e117 file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
We know that the kernel implements a slow fallback code path for
BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
The other operations we call in the context of .bdrv_co_pwrite_zeroes
should usually be quick, so no modification should be needed for them.
If we ever notice that there are additional problematic cases, we can
still make these conditional as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Kevin Wolf
80f5c33ff3 block: Advertise BDRV_REQ_NO_FALLBACK in filter drivers
Filter drivers that support .bdrv_co_pwrite_zeroes can safely advertise
BDRV_REQ_NO_FALLBACK because they just forward the request flags to
their child node.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Kevin Wolf
fe0480d629 block: Add BDRV_REQ_NO_FALLBACK
For qemu-img convert, we want an operation that zeroes out the whole
image if this can be done efficiently, but that returns an error
otherwise so we don't write explicit zeroes and immediately overwrite
them with the real data, potentially doubling the amount of data to be
written.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Kevin Wolf
48ce986096 block: Remove error messages in bdrv_make_zero()
There is only a single caller of bdrv_make_zero(), which is qemu-img
convert. If the function fails, we just fall back to a different method
of zeroing out blocks on the target image. There is no good reason to
print error messages on stderr when the higher level operation will
actually succeed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Markus Armbruster
a9779a3ab0 trace-events: Delete unused trace points
Tracked down with cleanup-trace-events.pl.  Funnies requiring manual
post-processing:

* block.c and blockdev.c trace points are in block/trace-events.

* hw/block/nvme.c uses the preprocessor to hide its trace point use
  from cleanup-trace-events.pl.

* include/hw/xen/xen_common.h trace points are in hw/xen/trace-events.

* net/colo-compare and net/filter-rewriter.c use pseudo trace points
  colo_compare_udp_miscompare and colo_filter_rewriter_debug to guard
  debug code.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190314180929.27722-5-armbru@redhat.com
Message-Id: <20190314180929.27722-5-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-22 16:18:07 +00:00
Markus Armbruster
500016e5db trace-events: Shorten file names in comments
We spell out sub/dir/ in sub/dir/trace-events' comments pointing to
source files.  That's because when trace-events got split up, the
comments were moved verbatim.

Delete the sub/dir/ part from these comments.  Gets rid of several
misspellings.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20190314180929.27722-3-armbru@redhat.com
Message-Id: <20190314180929.27722-3-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-22 16:18:07 +00:00
Alberto Garcia
782b9d06bf block: Make bdrv_{copy_on_read,crypto_luks,replication} static
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-19 15:49:29 +01:00
Sam Eiderman
b69864e5a8 vmdk: Support version=3 in VMDK descriptor files
Commit 509d39aa22 added support for read
only VMDKs of version 3.

This commit fixes the probe function to correctly handle descriptors of
version 3.

This commit has two effects:
    1. We no longer need to supply '-f vmdk' when pointing to descriptor
       files of version 3 in qemu/qemu-img command line arguments.
    2. This fixes the scenario where a VMDK points to a parent version 3
       descriptor file which is being probed as "raw" instead of "vmdk".

Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Shmuel Eiderman <shmuel.eiderman@oracle.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-19 15:49:29 +01:00
Kevin Wolf
a0cf83639c qcow2: Fix data file error condition in qcow2_co_create()
We were trying to check whether bdrv_open_blockdev_ref() returned
success, but accidentally checked the wrong variable. Spotted by
Coverity (CID 1399703).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2019-03-19 15:49:29 +01:00
Sergio Lopez
5e771752a1 mirror: Confirm we're quiesced only if the job is paused or cancelled
While child_job_drained_begin() calls to job_pause(), the job doesn't
actually transition between states until it runs again and reaches a
pause point. This means bdrv_drained_begin() may return with some jobs
using the node still having 'busy == true'.

As a consequence, block_job_detach_aio_context() may get into a
deadlock, waiting for the job to be actually paused, while the coroutine
servicing the job is yielding and doesn't get the opportunity to get
scheduled again. This situation can be reproduced by issuing a
'block-commit' immediately followed by a 'device_del'.

To ensure bdrv_drained_begin() only returns when the jobs have been
paused, we change mirror_drained_poll() to only confirm it's quiesced
when job->paused == true and there aren't any in-flight requests, except
if we reached that point by a drained section initiated by the
mirror/commit job itself.

The other block jobs shouldn't need any changes, as the default
drained_poll() behavior is to only confirm it's quiesced if the job is
not busy or completed.

Signed-off-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-19 15:49:29 +01:00
Peter Maydell
dbbc277510 Pull request
* Add 'drop-cache=on|off' option to file-posix.c.  The default is on.
    Disabling the option fixes a QEMU 3.0.0 performance regression when live
    migrating on the same host with cache.direct=off.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJciOSEAAoJEJykq7OBq3PIVSUIAI6r2Mgoi+no4nle8Jf2nZ+W
 EnQXnNEFyJA0lKRtqQ2UILD9udVdKd/L1PZu5k/Il/Ralto9Yf3+62brekI7rsss
 c3Qusu4LUK6jom2RslRjRIaJ9GilQi/jWezKV/O0VlcsMVemgVHX008EIR+ea1U4
 H0/u2kfu04PciKQ5MR2+6aacu9bfmyH1yM2no+aMN5dDu/38PV6JEsf0Zl2agowg
 opGepJ7YiDQsxH9IBXrbfm38mBrrY0K2vFzAb9BzTHfBPotGMNIZNJNM2FChRfoM
 sTjOIpZz3NDwPEUPQPZxp+7YKRFFYfse1oHtpyh4n1rMQksB019SCGlP9TBhrF0=
 =CH5G
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging

Pull request

 * Add 'drop-cache=on|off' option to file-posix.c.  The default is on.
   Disabling the option fixes a QEMU 3.0.0 performance regression when live
   migrating on the same host with cache.direct=off.

# gpg: Signature made Wed 13 Mar 2019 11:07:48 GMT
# gpg:                using RSA key 9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/block-pull-request:
  file-posix: add drop-cache=on|off option

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-03-14 09:34:51 +00:00
Peter Maydell
523a2a42c3 Pull request
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+ber27ys35W+dsvQfe+BBqr8OQ4FAlyIFSwACgkQfe+BBqr8
 OQ7wghAAm16eCEr57oTO7QXR3y8uVFsKqXBn9cNH6nbrFp2PUQSglwMDKBls1Z5m
 olF23X/JaqSlSmkL9BBuzDZ6Up+kkHKuxPq4/5RKXfiDI0pr3R0eqts0COAlaN9q
 Bew3ipj99m8gzMi2093AW4+Ob0N3658fuDTGLe1M1Uoy7CEg1QJ7rVOBBEui7vIl
 RbZ8l/Zmb4ldNpB3lnE4Nh9ue8fy0RAj3Nai161nCnNeXNF/VzD3Ye8bojSBbnux
 PIMX6/RWmykX4feIf9QP8apDpxX4HkyuPq5EdwT9PD8PwdyXPAXZtsYUNCuNtQuk
 n5VKFVgFYgqUclBeVHmrMYPU4K4iCFQp4/Fua7wzPEC0iG05NiiDv91oVkEJCp3L
 ManHeuGfNLCcXaIntKZhuJl1cK8yMM3yDww6/pPTehrPjcyvKa0NOqhQBExektcD
 R6q7maJRzFaxSxdcs+Zzuog9zESvH1mlJxQCKzeYhAP0kkxInyTELE/Vbx37xuqR
 RFfZYyVQ6x87Q/sxHx4EMiV97WUM8elZOQdSEC/okt5WUUNpgIu0WF9nSQ1VKZ8C
 CZmv5xh9ogfwvB/kOm6IVwNkLvVagJQcLwddORI5LLXLbSIUcuwVSuyMp/7iDtQ/
 hnHkGs2mIJ2JUYbSSNsSJNs6oTurn8eTFCeGoYKJgd9l4QxaThU=
 =ekU+
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/jnsnow/tags/bitmaps-pull-request' into staging

Pull request

# gpg: Signature made Tue 12 Mar 2019 20:23:08 GMT
# gpg:                using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E
# gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full]
# Primary key fingerprint: FAEB 9711 A12C F475 812F  18F2 88A9 064D 1835 61EB
#      Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76  CBD0 7DEF 8106 AAFC 390E

* remotes/jnsnow/tags/bitmaps-pull-request: (22 commits)
  tests/qemu-iotests: add bitmap resize test 246
  block/qcow2-bitmap: Allow resizes with persistent bitmaps
  block/qcow2-bitmap: Don't check size for IN_USE bitmap
  docs/interop/qcow2: Improve bitmap flag in_use specification
  bitmaps: Fix typo in function name
  block/dirty-bitmaps: implement inconsistent bit
  block/dirty-bitmaps: disallow busy bitmaps as merge source
  block/dirty-bitmaps: prohibit removing readonly bitmaps
  block/dirty-bitmaps: prohibit readonly bitmaps for backups
  block/dirty-bitmaps: add block_dirty_bitmap_check function
  block/dirty-bitmap: add inconsistent status
  block/dirty-bitmaps: add inconsistent bit
  iotests: add busy/recording bit test to 124
  blockdev: remove unused paio parameter documentation
  block/dirty-bitmaps: move comment block
  block/dirty-bitmaps: unify qmp_locked and user_locked calls
  block/dirty-bitmap: explicitly lock bitmaps with successors
  nbd: change error checking order for bitmaps
  block/dirty-bitmap: change semantics of enabled predicate
  block/dirty-bitmap: remove set/reset assertions against enabled bit
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

# Conflicts:
#	tests/qemu-iotests/group
2019-03-13 17:30:34 +00:00
Stefan Hajnoczi
f357fcd890 file-posix: add drop-cache=on|off option
Commit dd577a26ff ("block/file-posix:
implement bdrv_co_invalidate_cache() on Linux") introduced page cache
invalidation so that cache.direct=off live migration is safe on Linux.

The invalidation takes a significant amount of time when the file is
large and present in the page cache.  Normally this is not the case for
cross-host live migration but it can happen when migrating between QEMU
processes on the same host.

On same-host migration we don't need to invalidate pages for correctness
anyway, so an option to skip page cache invalidation is useful.  I
investigated optimizing invalidation and detecting same-host migration,
but both are hard to achieve so a user-visible option will suffice.

As a bonus this option means that the cache invalidation feature will
now be detectable by libvirt via QMP schema introspection.

Suggested-by: Neil Skrypuch <neil@tembosocial.com>
Tested-by: Neil Skrypuch <neil@tembosocial.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190307164941.3322-1-stefanha@redhat.com
Message-Id: <20190307164941.3322-1-stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-13 10:54:55 +00:00
Alberto Garcia
5019aece2a block: Remove the AioContext parameter from bdrv_reopen_multiple()
This parameter has been unused since 1a63a90750

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia
8a2ce0bc1e block: Add a 'mutable_opts' field to BlockDriver
If we reopen a BlockDriverState and there is an option that is present
in bs->options but missing from the new set of options then we have to
return an error unless the driver is able to reset it to its default
value.

This patch adds a new 'mutable_opts' field to BlockDriver. This is
a list of runtime options that can be modified during reopen. If an
option in this list is unspecified on reopen then it must be reset (or
return an error).

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia
077e8e2018 block: Add 'keep_old_opts' parameter to bdrv_reopen_queue()
The bdrv_reopen_queue() function is used to create a queue with
the BDSs that are going to be reopened and their new options. Once
the queue is ready bdrv_reopen_multiple() is called to perform the
operation.

The original options from each one of the BDSs are kept, with the new
options passed to bdrv_reopen_queue() applied on top of them.

For "x-blockdev-reopen" we want a function that behaves much like
"blockdev-add". We want to ignore the previous set of options so that
only the ones actually specified by the user are applied, with the
rest having their default values.

One of the things that we need is a way to tell bdrv_reopen_queue()
whether we want to keep the old set of options or not, and that's what
this patch does. All current callers are setting this new parameter to
true and x-blockdev-reopen will set it to false.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00