Locking was introduced in QEMU 2.7 to address the deprecation of
readdir_r(3) in glibc 2.24. It turns out that the frontend code is
the worst place to handle a critical section with a pthread mutex:
the code runs in a coroutine on behalf of the QEMU mainloop and then
yields control, waiting for the fsdev backend to process the request
in a worker thread. If the client resends another readdir request for
the same fid before the previous one finally unlocked the mutex, we're
deadlocked.
This never bit us because the linux client serializes readdir requests
for the same fid, but it is quite easy to demonstrate with a custom
client.
A good solution could be to narrow the critical section in the worker
thread code and to return a copy of the dirent to the frontend, but
this causes quite some changes in both 9p.c and codir.c. So, instead
of that, in order for people to easily backport the fix to older QEMU
versions, let's simply use a CoMutex since all the users for this
sit in coroutines.
Fixes: 7cde47d4a8 ("9p: add locking to V9fsDir")
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Message-Id: <158981894794.109297.3530035833368944254.stgit@bahia.lan>
Signed-off-by: Greg Kurz <groug@kaod.org>
linux/limits.h should be included for the XATTR_SIZE_MAX definition used
by v9fs_xattrcreate.
Fixes: 3b79ef2cf4 ("9pfs: limit xattr size in xattrcreate")
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Message-Id: <20200515203015.7090-2-dan@dlrobertson.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Devices may have component devices and buses.
Device realization may fail. Realization is recursive: a device's
realize() method realizes its components, and device_set_realized()
realizes its buses (which should in turn realize the devices on that
bus, except bus_set_realized() doesn't implement that, yet).
When realization of a component or bus fails, we need to roll back:
unrealize everything we realized so far. If any of these unrealizes
failed, the device would be left in an inconsistent state. Must not
happen.
device_set_realized() lets it happen: it ignores errors in the roll
back code starting at label child_realize_fail.
Since realization is recursive, unrealization must be recursive, too.
But how could a partly failed unrealize be rolled back? We'd have to
re-realize, which can fail. This design is fundamentally broken.
device_set_realized() does not roll back at all. Instead, it keeps
unrealizing, ignoring further errors.
It can screw up even for a device with no buses: if the lone
dc->unrealize() fails, it still unregisters vmstate, and calls
listeners' unrealize() callback.
bus_set_realized() does not roll back either. Instead, it stops
unrealizing.
Fortunately, no unrealize method can fail, as we'll see below.
To fix the design error, drop parameter @errp from all the unrealize
methods.
Any unrealize method that uses @errp now needs an update. This leads
us to unrealize() methods that can fail. Merely passing it to another
unrealize method cannot cause failure, though. Here are the ones that
do other things with @errp:
* virtio_serial_device_unrealize()
Fails when qbus_set_hotplug_handler() fails, but still does all the
other work. On failure, the device would stay realized with its
resources completely gone. Oops. Can't happen, because
qbus_set_hotplug_handler() can't actually fail here. Pass
&error_abort to qbus_set_hotplug_handler() instead.
* hw/ppc/spapr_drc.c's unrealize()
Fails when object_property_del() fails, but all the other work is
already done. On failure, the device would stay realized with its
vmstate registration gone. Oops. Can't happen, because
object_property_del() can't actually fail here. Pass &error_abort
to object_property_del() instead.
* spapr_phb_unrealize()
Fails and bails out when remove_drcs() fails, but other work is
already done. On failure, the device would stay realized with some
of its resources gone. Oops. remove_drcs() fails only when
chassis_from_bus()'s object_property_get_uint() fails, and it can't
here. Pass &error_abort to remove_drcs() instead.
Therefore, no unrealize method can fail before this patch.
device_set_realized()'s recursive unrealization via bus uses
object_property_set_bool(). Can't drop @errp there, so pass
&error_abort.
We similarly unrealize with object_property_set_bool() elsewhere,
always ignoring errors. Pass &error_abort instead.
Several unrealize methods no longer handle errors from other unrealize
methods: virtio_9p_device_unrealize(),
virtio_input_device_unrealize(), scsi_qdev_unrealize(), ...
Much of the deleted error handling looks wrong anyway.
One unrealize methods no longer ignore such errors:
usb_ehci_pci_exit().
Several realize methods no longer ignore errors when rolling back:
v9fs_device_realize_common(), pci_qdev_unrealize(),
spapr_phb_realize(), usb_qdev_realize(), vfio_ccw_realize(),
virtio_device_realize().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200505152926.18877-17-armbru@redhat.com>
If delivery of some 9pfs response fails for some reason, log the
error message by mentioning the 9P protocol reply type, not by
client's request type. The latter could be misleading that the
error occurred already when handling the request input.
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Message-Id: <ad0e5a9b6abde52502aa40b30661d29aebe1590a.1589132512.git.qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
QEMU's local 9pfs server passes through O_NOATIME from the client. If
the QEMU process doesn't have permissions to use O_NOATIME (namely, it
does not own the file nor have the CAP_FOWNER capability), the open will
fail. This causes issues when from the client's point of view, it
believes it has permissions to use O_NOATIME (e.g., a process running as
root in the virtual machine). Additionally, overlayfs on Linux opens
files on the lower layer using O_NOATIME, so in this case a 9pfs mount
can't be used as a lower layer for overlayfs (cf.
dabfe19719/vmtest/onoatimehack.c
and https://github.com/NixOS/nixpkgs/issues/54509).
Luckily, O_NOATIME is effectively a hint, and is often ignored by, e.g.,
network filesystems. open(2) notes that O_NOATIME "may not be effective
on all filesystems. One example is NFS, where the server maintains the
access time." This means that we can honor it when possible but fall
back to ignoring it.
Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Message-Id: <e9bee604e8df528584693a4ec474ded6295ce8ad.1587149256.git.osandov@fb.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
The common fsdev options are set by qemu_fsdev_add() before it calls
the backend specific option parsing code. In the case of "proxy" this
means "writeout" or "readonly" were simply ignored. This has been
broken from the beginning.
Reported-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Message-Id: <158349633705.1237488.8895481990204796135.stgit@bahia.lan>
This will provide the following virtual files by the 9pfs
synth driver:
- /ReadDirDir/ReadDirFile99
- /ReadDirDir/ReadDirFile98
...
- /ReadDirDir/ReadDirFile1
- /ReadDirDir/ReadDirFile0
This virtual directory and its virtual 100 files will be
used by the upcoming 9pfs readdir tests.
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <5408c28c8de25dd575b745cef63bf785305ccef2.1579567020.git.qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
A good 9p client sends T_readdir with "count" parameter that's sufficiently
smaller than client's initially negotiated msize (maximum message size).
We perform a check for that though to avoid the server to be interrupted
with a "Failed to encode VirtFS reply type 41" transport error message by
bad clients. This count value constraint uses msize - 11, because 11 is the
header size of R_readdir.
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <3990d3891e8ae2074709b56449e96ab4b4b93b7d.1579567020.git.qemu_oss@crudebyte.com>
[groug: added comment ]
Signed-off-by: Greg Kurz <groug@kaod.org>
A client establishes a session by sending a Tversion request along with a
'msize' parameter which client uses to suggest server a maximum message
size ever to be used for communication (for both requests and replies)
between client and server during that session. If client suggests a 'msize'
smaller than 4096 then deny session by server immediately with an error
response (Rlerror for "9P2000.L" clients or Rerror for "9P2000.u" clients)
instead of replying with Rversion.
So far any msize submitted by client with Tversion was simply accepted by
server without any check. Introduction of some minimum msize makes sense,
because e.g. a msize < 7 would not allow any subsequent 9p operation at
all, because 7 is the size of the header section common by all 9p message
types.
A substantial higher value of 4096 was chosen though to prevent potential
issues with some message types. E.g. Rreadlink may yield up to a size of
PATH_MAX which is usually 4096, and like almost all 9p message types,
Rreadlink is not allowed to be truncated by the 9p protocol. This chosen
size also prevents a similar issue with Rreaddir responses (provided client
always sends adequate 'count' parameter with Treaddir), because even though
directory entries retrieval may be split up over several T/Rreaddir
messages; a Rreaddir response must not truncate individual directory entries
though. So msize should be large enough to return at least one directory
entry with the longest possible file name supported by host. Most file
systems support a max. file name length of 255. Largest known file name
lenght limit would be currently ReiserFS with max. 4032 bytes, which is
also covered by this min. msize value because 4032 + 35 < 4096.
Furthermore 4096 is already the minimum msize of the Linux kernel's 9pfs
client.
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <8ceecb7fb9fdbeabbe55c04339349a36929fb8e3.1579567019.git.qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
* Cleanups (Philippe)
* virtio-scsi fix (Pan Nengyuan)
* Tweak Skylake-v3 model id (Kashyap)
* x86 UCODE_REV support and nested live migration fix (myself)
* Advisory mode for pvpanic (Zhenwei)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJeK1CXAAoJEL/70l94x66DXtkIAI6W5wEY0Yme4M9Q5mGc0RV8
uscPLg0wsg88u6xne8ucCiGymvDREym2ii/aVI0Hi5ish84ZMdCrdck9cd+llpMf
+a3slL26AKlOW8WtYSuyAE1RdLFXngeXdwal5KtWPEExJorkDUPTbwhBzQduQK1a
myoHHcbwdd/96v7FvKnfG8jM6KZtHPQQ0i6+6fX4PN44jaULQNjze8GIrRBEwqw5
uCKJFQPBXiVcxKjH5/kzI1vl2hLJbF2ZGVEzX/U8OPZwyGPHIkWquURo8lvUTPfb
ySlNTUTV2CyrN65TBRXQp/mJi44WvME5Jxlf5rNLBaYXPpL0zhmILKn5X5ya4U0=
=TD0Y
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
* Register qdev properties as class properties (Marc-André)
* Cleanups (Philippe)
* virtio-scsi fix (Pan Nengyuan)
* Tweak Skylake-v3 model id (Kashyap)
* x86 UCODE_REV support and nested live migration fix (myself)
* Advisory mode for pvpanic (Zhenwei)
# gpg: Signature made Fri 24 Jan 2020 20:16:23 GMT
# gpg: using RSA key BFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* remotes/bonzini/tags/for-upstream: (58 commits)
build-sys: clean up flags included in the linker command line
target/i386: Add the 'model-id' for Skylake -v3 CPU models
qdev: use object_property_help()
qapi/qmp: add ObjectPropertyInfo.default-value
qom: introduce object_property_help()
qom: simplify qmp_device_list_properties()
vl: print default value in object help
qdev: register properties as class properties
qdev: move instance properties to class properties
qdev: rename DeviceClass.props
qdev: set properties with device_class_set_props()
object: return self in object_ref()
object: release all props
object: add object_class_property_add_link()
object: express const link with link property
object: add direct link flag
object: rename link "child" to "target"
object: check strong flag with &
object: do not free class properties
object: add object_property_set_default
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Use virtio_delete_queue to make it more clear.
Signed-off-by: Pan Nengyuan <pannengyuan@huawei.com>
Message-Id: <20200117060927.51996-3-pannengyuan@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
v->vq forgot to cleanup in virtio_9p_device_unrealize, the memory leak
stack is as follow:
Direct leak of 14336 byte(s) in 2 object(s) allocated from:
#0 0x7f819ae43970 (/lib64/libasan.so.5+0xef970) ??:?
#1 0x7f819872f49d (/lib64/libglib-2.0.so.0+0x5249d) ??:?
#2 0x55a3a58da624 (./x86_64-softmmu/qemu-system-x86_64+0x2c14624) /mnt/sdb/qemu/hw/virtio/virtio.c:2327
#3 0x55a3a571bac7 (./x86_64-softmmu/qemu-system-x86_64+0x2a55ac7) /mnt/sdb/qemu/hw/9pfs/virtio-9p-device.c:209
#4 0x55a3a58e7bc6 (./x86_64-softmmu/qemu-system-x86_64+0x2c21bc6) /mnt/sdb/qemu/hw/virtio/virtio.c:3504
#5 0x55a3a5ebfb37 (./x86_64-softmmu/qemu-system-x86_64+0x31f9b37) /mnt/sdb/qemu/hw/core/qdev.c:876
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Pan Nengyuan <pannengyuan@huawei.com>
Message-Id: <20200117060927.51996-2-pannengyuan@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Acked-by: Greg Kurz <groug@kaod.org>
'out' label in v9fs_xattr_write() and 'out_nofid' label in
v9fs_complete_rename() can be replaced by appropriate return
calls.
CC: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Acked-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
init_in_iov_from_pdu might not be able to allocate the full buffer size
requested, which comes from the client and could be larger than the
transport has available at the time of the request. Specifically, this
can happen with read operations, with the client requesting a read up to
the max allowed, which might be more than the transport has available at
the time.
Today the implementation of init_in_iov_from_pdu throws an error, both
Xen and Virtio.
Instead, change the V9fsTransport interface so that the size becomes a
pointer and can be limited by the implementation of
init_in_iov_from_pdu.
Change both the Xen and Virtio implementations to set the size to the
size of the buffer they managed to allocate, instead of throwing an
error. However, if the allocated buffer size is less than P9_IOHDRSZ
(the size of the header) still throw an error as the case is unhandable.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
CC: groug@kaod.org
CC: anthony.perard@citrix.com
CC: roman@zededa.com
CC: qemu_oss@crudebyte.com
[groug: fix 32-bit build]
Signed-off-by: Greg Kurz <groug@kaod.org>
local_unlinkat_common() is supposed to always return -1 on error.
This is being done by jumps to the 'err_out' label, which is
a 'return ret' call, and 'ret' is initialized with -1.
Unfortunately there is a condition in which the function will
return 0 on error: in a case where flags == AT_REMOVEDIR, 'ret'
will be 0 when reaching
map_dirfd = openat_dir(...)
And, if map_dirfd == -1 and errno != ENOENT, the existing 'err_out'
jump will execute 'return ret', when ret is still set to zero
at that point.
This patch fixes it by changing all 'err_out' labels by
'return -1' calls, ensuring that the function will always
return -1 on error conditions. 'ret' can be left unintialized
since it's now being used just to store the result of 'unlinkat'
calls.
CC: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
[groug: changed prefix in title to be "9p: local:"]
Signed-off-by: Greg Kurz <groug@kaod.org>
There is a possible memory leak while local_link return -1 without free
odirpath and oname.
Reported-by: Euler Robot <euler.robot@huawei.com>
Signed-off-by: Jaijun Chen <chenjiajun8@huawei.com>
Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Mostly, Error ** is for returning error from the function, so the
callee sets it. However error_append_security_model_hint and
error_append_socket_sockfd_hint get already filled errp
parameter. They don't change the pointer itself, only change the
internal state of referenced Error object. So we can make it Error
*const * errp, to stress the behavior. It will also help coccinelle
script (in future) to distinguish such cases from common errp usage.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Acked-by: Greg Kurz <groug@kaod.org>
Message-Id: <20191205174635.18758-9-vsementsov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Commit message replaced]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Some filesystems may return 0s in statfs (trivially, a FUSE filesystem
can do so). QEMU should handle this gracefully and just behave the
same as if statfs failed.
Signed-off-by: Dan Schatzberg <dschatzberg@fb.com>
Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Use variable length suffixes for inode remapping instead of the fixed
16 bit size prefixes before. With this change the inode numbers on guest
will typically be much smaller (e.g. around >2^1 .. >2^7 instead of >2^48
with the previous fixed size inode remapping.
Additionally this solution is more efficient, since inode numbers in
practice can take almost their entire 64 bit range on guest as well, so
there is less likely a need for generating and tracking additional suffixes,
which might also be beneficial for nested virtualization where each level of
virtualization would shift up the inode bits and increase the chance of
expensive remapping actions.
The "Exponential Golomb" algorithm is used as basis for generating the
variable length suffixes. The algorithm has a parameter k which controls the
distribution of bits on increasing indeces (minimum bits at low index vs.
maximum bits at high index). With k=0 the generated suffixes look like:
Index Dec/Bin -> Generated Suffix Bin
1 [1] -> [1] (1 bits)
2 [10] -> [010] (3 bits)
3 [11] -> [110] (3 bits)
4 [100] -> [00100] (5 bits)
5 [101] -> [10100] (5 bits)
6 [110] -> [01100] (5 bits)
7 [111] -> [11100] (5 bits)
8 [1000] -> [0001000] (7 bits)
9 [1001] -> [1001000] (7 bits)
10 [1010] -> [0101000] (7 bits)
11 [1011] -> [1101000] (7 bits)
12 [1100] -> [0011000] (7 bits)
...
65533 [1111111111111101] -> [1011111111111111000000000000000] (31 bits)
65534 [1111111111111110] -> [0111111111111111000000000000000] (31 bits)
65535 [1111111111111111] -> [1111111111111111000000000000000] (31 bits)
Hence minBits=1 maxBits=31
And with k=5 they would look like:
Index Dec/Bin -> Generated Suffix Bin
1 [1] -> [000001] (6 bits)
2 [10] -> [100001] (6 bits)
3 [11] -> [010001] (6 bits)
4 [100] -> [110001] (6 bits)
5 [101] -> [001001] (6 bits)
6 [110] -> [101001] (6 bits)
7 [111] -> [011001] (6 bits)
8 [1000] -> [111001] (6 bits)
9 [1001] -> [000101] (6 bits)
10 [1010] -> [100101] (6 bits)
11 [1011] -> [010101] (6 bits)
12 [1100] -> [110101] (6 bits)
...
65533 [1111111111111101] -> [0011100000000000100000000000] (28 bits)
65534 [1111111111111110] -> [1011100000000000100000000000] (28 bits)
65535 [1111111111111111] -> [0111100000000000100000000000] (28 bits)
Hence minBits=6 maxBits=28
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
stat_to_qid attempts via qid_path_prefixmap to map unique files (which are
identified by 64 bit inode nr and 32 bit device id) to a 64 QID path value.
However this implementation makes some assumptions about inode number
generation on the host.
If qid_path_prefixmap fails, we still have 48 bits available in the QID
path to fall back to a less memory efficient full mapping.
Signed-off-by: Antonios Motakis <antonios.motakis@huawei.com>
[CS: - Rebased to https://github.com/gkurz/qemu/commits/9p-next
(SHA1 7fc4c49e91).
- Updated hash calls to new xxhash API.
- Removed unnecessary parantheses in qpf_lookup_func().
- Removed unnecessary g_malloc0() result checks.
- Log error message when running out of prefixes in
qid_path_fullmap().
- Log warning message about potential degraded performance in
qid_path_prefixmap().
- Wrapped qpf_table initialization to dedicated qpf_table_init()
function.
- Fixed typo in comment. ]
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
'warn' (default): Only log an error message (once) on host if more than one
device is shared by same export, except of that just ignore this config
error though. This is the default behaviour for not breaking existing
installations implying that they really know what they are doing.
'forbid': Like 'warn', but except of just logging an error this
also denies access of guest to additional devices.
'remap': Allows to share more than one device per export by remapping
inodes from host to guest appropriately. To support multiple devices on the
9p share, and avoid qid path collisions we take the device id as input to
generate a unique QID path. The lowest 48 bits of the path will be set
equal to the file inode, and the top bits will be uniquely assigned based
on the top 16 bits of the inode and the device id.
Signed-off-by: Antonios Motakis <antonios.motakis@huawei.com>
[CS: - Rebased to https://github.com/gkurz/qemu/commits/9p-next
(SHA1 7fc4c49e91).
- Added virtfs option 'multidevs', original patch simply did the inode
remapping without being asked.
- Updated hash calls to new xxhash API.
- Updated docs for new option 'multidevs'.
- Fixed v9fs_do_readdir() not having remapped inodes.
- Log error message when running out of prefixes in
qid_path_prefixmap().
- Fixed definition of QPATH_INO_MASK.
- Wrapped qpp_table initialization to dedicated qpp_table_init()
function.
- Dropped unnecessary parantheses in qpp_lookup_func().
- Dropped unnecessary g_malloc0() result checks. ]
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
[groug: - Moved "multidevs" parsing to the local backend.
- Added hint to invalid multidevs option error.
- Turn "remap" into "x-remap". ]
Signed-off-by: Greg Kurz <groug@kaod.org>
The QID path should uniquely identify a file. However, the
inode of a file is currently used as the QID path, which
on its own only uniquely identifies files within a device.
Here we track the device hosting the 9pfs share, in order
to prevent security issues with QID path collisions from
other devices.
We only print a warning for now but a subsequent patch will
allow users to have finer control over the desired behaviour.
Failing the I/O will be one the proposed behaviour, so we
also change stat_to_qid() to return an error here in order to
keep other patches simpler.
Signed-off-by: Antonios Motakis <antonios.motakis@huawei.com>
[CS: - Assign dev_id to export root's device already in
v9fs_device_realize_common(), not postponed in
stat_to_qid().
- error_report_once() if more than one device was
shared by export.
- Return -ENODEV instead of -ENOSYS in stat_to_qid().
- Fixed typo in log comment. ]
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
[groug, changed to warning, updated message and changelog]
Signed-off-by: Greg Kurz <groug@kaod.org>
It is more convenient to use the return value of the function to notify
errors, rather than to be tied up setting up the &local_err boilerplate.
Signed-off-by: Greg Kurz <groug@kaod.org>
There is no need for signedness on these QID fields for 9p.
Signed-off-by: Antonios Motakis <antonios.motakis@huawei.com>
[CS: - Also make QID type unsigned.
- Adjust donttouch_stat() to new types.
- Adjust trace-events to new types. ]
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Express the complex conditions in Kconfig rather than Makefiles, since Kconfig
is better suited at expressing dependencies and detecting contradictions.
Cc: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In my "build everything" tree, changing hw/qdev-properties.h triggers
a recompile of some 2700 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).
Many places including hw/qdev-properties.h (directly or via hw/qdev.h)
actually need only hw/qdev-core.h. Include hw/qdev-core.h there
instead.
hw/qdev.h is actually pointless: all it does is include hw/qdev-core.h
and hw/qdev-properties.h, which in turn includes hw/qdev-core.h.
Replace the remaining uses of hw/qdev.h by hw/qdev-properties.h.
While there, delete a few superfluous inclusions of hw/qdev-core.h.
Touching hw/qdev-properties.h now recompiles some 1200 objects.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20190812052359.30071-22-armbru@redhat.com>
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h). It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.
Include qemu/main-loop.h only where it's needed. Touching it now
recompiles only some 1700 objects. For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800. For the
others, they shrink only slightly.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
In my "build everything" tree, changing hw/hw.h triggers a recompile
of some 2600 out of 6600 objects (not counting tests and objects that
don't depend on qemu/osdep.h).
The previous commits have left only the declaration of hw_error() in
hw/hw.h. This permits dropping most of its inclusions. Touching it
now recompiles less than 200 objects.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-Id: <20190812052359.30071-19-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
A Xen public header have been imported into QEMU (by
f65eadb639 "xen: import ring.h from xen"), but there are other header
that depends on ring.h which come from the system when building QEMU.
This patch resolves the issue of having headers from the system
importing a different copie of ring.h.
This patch is prompt by the build issue described in the previous
patch: 'Revert xen/io/ring.h of "Clean up a few header guard symbols"'
ring.h and the new imported headers are moved to
"include/hw/xen/interface" as those describe interfaces with a guest.
The imported headers are cleaned up a bit while importing them: some
part of the file that QEMU doesn't use are removed (description
of how to make hypercall in grant_table.h have been removed).
Other cleanup:
- xen-mapcache.c and xen-legacy-backend.c don't need grant_table.h.
- xenfb.c doesn't need event_channel.h.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Message-Id: <20190621105441.3025-3-anthony.perard@citrix.com>
No header includes qemu-common.h after this commit, as prescribed by
qemu-common.h's file comment.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-5-armbru@redhat.com>
[Rebased with conflicts resolved automatically, except for
include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c
block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c
target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h
target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h
target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h
target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and
net/tap-bsd.c fixed up]
Some trace points are attributed to the wrong source file. Happens
when we neglect to update trace-events for code motion, or add events
in the wrong place, or misspell the file name.
Clean up with help of cleanup-trace-events.pl. Same funnies as in the
previous commit, of course. Manually shorten its change to
linux-user/trace-events to */signal.c.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190314180929.27722-6-armbru@redhat.com
Message-Id: <20190314180929.27722-6-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We spell out sub/dir/ in sub/dir/trace-events' comments pointing to
source files. That's because when trace-events got split up, the
comments were moved verbatim.
Delete the sub/dir/ part from these comments. Gets rid of several
misspellings.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20190314180929.27722-3-armbru@redhat.com
Message-Id: <20190314180929.27722-3-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20190123065618.3520-42-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The make_device_config.sh script is replaced by minikconf, which
is modified to support the same command line as its predecessor.
The roots of the parsing are default-configs/*.mak, Kconfig.host and
hw/Kconfig. One difference with make_device_config.sh is that all symbols
have to be defined in a Kconfig file, including those coming from the
configure script. This is the reason for the Kconfig.host file introduced
in the previous patch. Whenever a file in default-configs/*.mak used
$(...) to refer to a config-host.mak symbol, this is replaced by a
Kconfig dependency; this part must be done already in this patch
for bisectability.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Acked-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20190123065618.3520-28-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Kconfig files were generated mostly with this script:
for i in `grep -ho CONFIG_[A-Z0-9_]* default-configs/* | sort -u`; do
set fnord `git grep -lw $i -- 'hw/*/Makefile.objs' `
shift
if test $# = 1; then
cat >> $(dirname $1)/Kconfig << EOF
config ${i#CONFIG_}
bool
EOF
git add $(dirname $1)/Kconfig
else
echo $i $*
fi
done
sed -i '$d' hw/*/Kconfig
for i in hw/*; do
if test -d $i && ! test -f $i/Kconfig; then
touch $i/Kconfig
git add $i/Kconfig
fi
done
Whenever a symbol is referenced from multiple subdirectories, the
script prints the list of directories that reference the symbol.
These symbols have to be added manually to the Kconfig files.
Kconfig.host and hw/Kconfig were created manually.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20190123065618.3520-27-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VIRTIO_9P || VIRTFS && XEN condition can be computed in hw/Makefile.objs,
removing an "if" from hw/9pfs/Makefile.objs.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...and xen_backend.h to xen-legacy-backend.h
Rather than attempting to convert the existing backend infrastructure to
be QOM compliant (which would be hard to do in an incremental fashion),
subsequent patches will introduce a completely new framework for Xen PV
backends. Hence it is necessary to re-name parts of existing code to avoid
name clashes. The re-named 'legacy' infrastructure will be removed once all
backends have been ported to the new framework.
This patch is purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Anthony Perard <anthony.perard@citrix.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
The "handle" fsdev backend was deprecated in QEMU 2.12.0 with:
commit db3b3c7281
Author: Greg Kurz <groug@kaod.org>
Date: Mon Jan 8 11:18:23 2018 +0100
9pfs: deprecate handle backend
This backend raise some concerns:
- doesn't support symlinks
- fails +100 tests in the PJD POSIX file system test suite [1]
- requires the QEMU process to run with the CAP_DAC_READ_SEARCH
capability, which isn't recommended for security reasons
This backend should not be used and wil be removed. The 'local'
backend is the recommended alternative.
[1] https://www.tuxera.com/community/posix-test-suite/
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
It has passed the two release cooling period without any complaint.
Remove it now.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Because it is a recommended coding practice (see HACKING).
Signed-off-by: Greg Kurz <groug@kaod.org>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Because it is a recommended coding practice (see HACKING).
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
When using the 9P2000.u version of the protocol, the following shell
command line in the guest can cause QEMU to crash:
while true; do rm -rf aa; mkdir -p a/b & touch a/b/c & mv a aa; done
With 9P2000.u, file renaming is handled by the WSTAT command. The
v9fs_wstat() function calls v9fs_complete_rename(), which calls
v9fs_fix_path() for every fid whose path is affected by the change.
The involved calls to v9fs_path_copy() may race with any other access
to the fid path performed by some worker thread, causing a crash like
shown below:
Thread 12 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x0000555555a25da2 in local_open_nofollow (fs_ctx=0x555557d958b8, path=0x0,
flags=65536, mode=0) at hw/9pfs/9p-local.c:59
59 while (*path && fd != -1) {
(gdb) bt
#0 0x0000555555a25da2 in local_open_nofollow (fs_ctx=0x555557d958b8,
path=0x0, flags=65536, mode=0) at hw/9pfs/9p-local.c:59
#1 0x0000555555a25e0c in local_opendir_nofollow (fs_ctx=0x555557d958b8,
path=0x0) at hw/9pfs/9p-local.c:92
#2 0x0000555555a261b8 in local_lstat (fs_ctx=0x555557d958b8,
fs_path=0x555556b56858, stbuf=0x7fff84830ef0) at hw/9pfs/9p-local.c:185
#3 0x0000555555a2b367 in v9fs_co_lstat (pdu=0x555557d97498,
path=0x555556b56858, stbuf=0x7fff84830ef0) at hw/9pfs/cofile.c:53
#4 0x0000555555a1e9e2 in v9fs_stat (opaque=0x555557d97498)
at hw/9pfs/9p.c:1083
#5 0x0000555555e060a2 in coroutine_trampoline (i0=-669165424, i1=32767)
at util/coroutine-ucontext.c:116
#6 0x00007fffef4f5600 in __start_context () at /lib64/libc.so.6
#7 0x0000000000000000 in ()
(gdb)
The fix is to take the path write lock when calling v9fs_complete_rename(),
like in v9fs_rename().
Impact: DoS triggered by unprivileged guest users.
Fixes: CVE-2018-19489
Cc: P J P <ppandit@redhat.com>
Reported-by: zhibin hu <noirfate@gmail.com>
Reviewed-by: Prasad J Pandit <pjp@fedoraproject.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
Recent commit 5b76ef50f6 fixed a race where v9fs_co_open2() could
possibly overwrite a fid path with v9fs_path_copy() while it is being
accessed by some other thread, ie, use-after-free that can be detected
by ASAN with a custom 9p client.
It turns out that the same can happen at several locations where
v9fs_path_copy() is used to set the fid path. The fix is again to
take the write lock.
Fixes CVE-2018-19364.
Cc: P J P <ppandit@redhat.com>
Reported-by: zhibin hu <noirfate@gmail.com>
Reviewed-by: Prasad J Pandit <pjp@fedoraproject.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
The assumption that the fid cannot be used by any other operation is
wrong. At least, nothing prevents a misbehaving client to create a
file with a given fid, and to pass this fid to some other operation
at the same time (ie, without waiting for the response to the creation
request). The call to v9fs_path_copy() performed by the worker thread
after the file was created can race with any access to the fid path
performed by some other thread. This causes use-after-free issues that
can be detected by ASAN with a custom 9p client.
Unlike other operations that only read the fid path, v9fs_co_open2()
does modify it. It should hence take the write lock.
Cc: P J P <ppandit@redhat.com>
Reported-by: zhibin hu <noirfate@gmail.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Calling error_report() from within a function that takes an Error **
argument is suspicious. qemu_fsdev_add() does that, and its caller
fsdev_init_func() then fails without setting an error. Its caller
main(), via qemu_opts_foreach(), is fine with it, but clean it up
anyway.
Cc: Greg Kurz <groug@kaod.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Greg Kurz <groug@kaod.org>
Message-Id: <20181017082702.5581-32-armbru@redhat.com>
Calling error_report() in a function that takes an Error ** argument
is suspicious. 9p-handle.c's handle_parse_opts() does that, and then
fails without setting an error. Wrong. Its caller crashes when it
tries to report the error:
$ qemu-system-x86_64 -nodefaults -fsdev id=foo,fsdriver=handle
qemu-system-x86_64: -fsdev id=foo,fsdriver=handle: warning: handle backend is deprecated
qemu-system-x86_64: -fsdev id=foo,fsdriver=handle: fsdev: No path specified
Segmentation fault (core dumped)
Screwed up when commit 91cda4e8f3 (v2.12.0) converted the function to
Error. Fix by calling error_setg() instead of error_report().
Fixes: 91cda4e8f3
Cc: Greg Kurz <groug@kaod.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-9-armbru@redhat.com>
From include/qapi/error.h:
* Pass an existing error to the caller with the message modified:
* error_propagate(errp, err);
* error_prepend(errp, "Could not frobnicate '%s': ", name);
Fei Li pointed out that doing error_propagate() first doesn't work
well when @errp is &error_fatal or &error_abort: the error_prepend()
is never reached.
Since I doubt fixing the documentation will stop people from getting
it wrong, introduce error_propagate_prepend(), in the hope that it
lures people away from using its constituents in the wrong order.
Update the instructions in error.h accordingly.
Convert existing error_prepend() next to error_propagate to
error_propagate_prepend(). If any of these get reached with
&error_fatal or &error_abort, the error messages improve. I didn't
check whether that's the case anywhere.
Cc: Fei Li <fli@suse.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-2-armbru@redhat.com>
Comparisons of mode_t with -1 require an explicit cast, since mode_t
is unsigned on Darwin.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
strchrnul is a GNU extension and thus unavailable on a number of targets.
In the review for a commit removing strchrnul from 9p, I was asked to
create a qemu_strchrnul helper to factor out this functionality.
Do so, and use it in a number of other places in the code base that inlined
the replacement pattern in a place where strchrnul could be used.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
As with unlinkat, these flags come from the client and need to
be translated to their host values. The protocol values happen
to match linux, but that need not be true in general.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
The 9p-local code previously relied on P9_DOTL_AT_REMOVEDIR and AT_REMOVEDIR
having the same numerical value and deferred any errorchecking to the
syscall itself. However, while the former assumption is true on Linux,
it is not true in general. 9p-handle did this properly however. Move
the translation code to the generic 9p server code and add an error
if unrecognized flags are passed.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Both `stbuf` and `local_ioc_getversion` where unused when
FS_IOC_GETVERSION was not defined, causing a compiler warning.
Reorganize the code to avoid this warning.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
If the size returned from llistxattr/lgetxattr is 0, we skipped
the malloc call, leaving xattr.value uninitialized. However, this
value is later passed to `g_free` without any further checks,
causing an error. Fix that by always calling g_malloc unconditionally.
If `size` is 0, it will return NULL, which is safe to pass to g_free.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
These functions will need custom implementations on Darwin. Since the
implementation is very similar among all of them, and 9p-util already
has the _nofollow version of fgetxattrat, let's move them all there.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
In the review of
9p: Avoid warning if FS_IOC_GETVERSION is not defined
Grep Kurz noted this error path was failing to set errp.
Fix that.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
[added local: to commit title, Greg Kurz]
Signed-off-by: Greg Kurz <groug@kaod.org>
The size to pass to the `connect` call is the size of the entire
`struct sockaddr_un`. Passing anything shorter than this causes errors
on darwin.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
This is only half of the work, because the proxy devices (virtio-*-pci,
virtio-*-ccw, etc.) are still included unconditionally. It is still a
move in the right direction.
Based-on: <20180522194943.24871-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that helpers are available in xen_backend, use them throughout all
Xen PV backends.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Anthony Perard <anthony.perard@citrix.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Don't print the tv_nsec part of atime and mtime, to stay below the 10
argument limit of trace events.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Leak found thanks to ASAN:
Direct leak of 8 byte(s) in 1 object(s) allocated from:
#0 0x55995789ac90 in __interceptor_malloc (/home/elmarco/src/qemu/build/x86_64-softmmu/qemu-system-x86_64+0x1510c90)
#1 0x7f0a91190f0c in g_malloc /home/elmarco/src/gnome/glib/builddir/../glib/gmem.c:94
#2 0x5599580a281c in v9fs_path_copy /home/elmarco/src/qemu/hw/9pfs/9p.c:196:17
#3 0x559958f9ec5d in coroutine_trampoline /home/elmarco/src/qemu/util/coroutine-ucontext.c:116:9
#4 0x7f0a8766ebbf (/lib64/libc.so.6+0x50bbf)
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
lhs/rhs doesn't tell much about how argument are handled, dst/src is
and const arguments is clearer in my mind. Use g_memdup() while at it.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
qemu-common.h includes qemu/option.h, but most places that include the
former don't actually need the latter. Drop the include, and add it
to the places that actually need it.
While there, drop superfluous includes of both headers, and
separate #include from file comment with a blank line.
This cleanup makes the number of objects depending on qemu/option.h
drop from 4545 (out of 4743) to 284 in my "build everything" tree.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-20-armbru@redhat.com>
[Semantic conflict with commit bdd6a90a9e in block/nvme.c resolved]
This cleanup makes the number of objects depending on qapi/error.h
drop from 1910 (out of 4743) to 1612 in my "build everything" tree.
While there, separate #include from file comment with a blank line,
and drop a useless comment on why qemu/osdep.h is included first.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-5-armbru@redhat.com>
[Semantic conflict with commit 34e304e975 resolved, OSX breakage fixed]
The idea is to send a victim request that will possibly block in the
server and to send a flush request to cancel the victim request.
This patch adds two test to verifiy that:
- the server does not reply to a victim request that was actually
cancelled
- the server replies to the flush request after replying to the
victim request if it could not cancel it
9p request cancellation reference:
http://man.cat-v.org/plan_9/5/flush
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
(groug, change the test to only write a single byte to avoid
any alignment or endianess consideration)
Trivial test of a successful write.
Signed-off-by: Greg Kurz <groug@kaod.org>
(groug, handle potential overflow when computing request size,
add missing g_free(buf),
backend handles one written byte at a time to validate
the server doesn't do short-reads)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The purpose of virtio-9p-test is to test the virtio-9p device, especially
the 9p server state machine. We don't really care what fsdev backend we're
using. Moreover, if we want to be able to test the flush request or a
device reset with in-flights I/O, it is close to impossible to achieve
with a physical backend because we cannot ask it reliably to put an I/O
on hold at a specific point in time.
Fortunately, we can do that with the synthetic backend, which allows to
register callbacks on read/write accesses to a specific file. This will
be used by a later patch to test the 9P flush request.
The walk request test is converted to using the synth backend.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
# Background
I was investigating spurious non-deterministic EINTR returns from
various 9p file system operations in a Linux guest served from the
qemu 9p server.
## EINTR, ERESTARTSYS and the linux kernel
When a signal arrives that the Linux kernel needs to deliver to user-space
while a given thread is blocked (in the 9p case waiting for a reply to its
request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
driver is currently running to abort its current operation (in the 9p case
causing the submission of a TFLUSH message) and return to user space.
In these situations, the error message reported is generally ERESTARTSYS.
If the userspace processes specified SA_RESTART, this means that the
system call will get restarted upon completion of the signal handler
delivery (assuming the signal handler doesn't modify the process state
in complicated ways not relevant here). If SA_RESTART is not specified,
ERESTARTSYS gets translated to EINTR and user space is expected to handle
the restart itself.
## The 9p TFLUSH command
The 9p TFLUSH commands requests that the server abort an ongoing operation.
The man page [1] specifies:
```
If it recognizes oldtag as the tag of a pending transaction, it should
abort any pending response and discard that tag.
[...]
When the client sends a Tflush, it must wait to receive the corresponding
Rflush before reusing oldtag for subsequent messages. If a response to the
flushed request is received before the Rflush, the client must honor the
response as if it had not been flushed, since the completed request may
signify a state change in the server
```
In particular, this means that the server must not send a reply with the
orignal tag in response to the cancellation request, because the client is
obligated to interpret such a reply as a coincidental reply to the original
request.
# The bug
When qemu receives a TFlush request, it sets the `cancelled` flag on the
relevant pdu. This flag is periodically checked, e.g. in
`v9fs_co_name_to_path`, and if set, the operation is aborted and the error
is set to EINTR. However, the server then violates the spec, by returning
to the client an Rerror response, rather than discarding the message
entirely. As a result, the client is required to assume that said Rerror
response is a result of the original request, not a result of the
cancellation and thus passes the EINTR error back to user space.
This is not the worst thing it could do, however as discussed above, the
correct error code would have been ERESTARTSYS, such that user space
programs with SA_RESTART set get correctly restarted upon completion of
the signal handler.
Instead, such programs get spurious EINTR results that they were not
expecting to handle.
It should be noted that there are plenty of user space programs that do not
set SA_RESTART and do not correctly handle EINTR either. However, that is
then a userspace bug. It should also be noted that this bug has been
mitigated by a recent commit to the Linux kernel [2], which essentially
prevents the kernel from sending Tflush requests unless the process is about
to die (in which case the process likely doesn't care about the response).
Nevertheless, for older kernels and to comply with the spec, I believe this
change is beneficial.
# Implementation
The fix is fairly simple, just skipping notification of a reply if
the pdu was previously cancelled. We do however, also notify the transport
layer that we're doing this, so it can clean up any resources it may be
holding. I also added a new trace event to distinguish
operations that caused an error reply from those that were cancelled.
One complication is that we only omit sending the message on EINTR errors in
order to avoid confusing the rest of the code (which may assume that a
client knows about a fid if it sucessfully passed it off to pud_complete
without checking for cancellation status). This does mean that if the server
acts upon the cancellation flag, it always needs to set err to EINTR. I
believe this is true of the current code.
[1] https://9fans.github.io/plan9port/man/man9/flush.html
[2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc891
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[groug, send a zero-sized reply instead of detaching the buffer]
Signed-off-by: Greg Kurz <groug@kaod.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
No good reasons to do this outside of v9fs_device_realize_common().
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
This backend raise some concerns:
- doesn't support symlinks
- fails +100 tests in the PJD POSIX file system test suite [1]
- requires the QEMU process to run with the CAP_DAC_READ_SEARCH
capability, which isn't recommended for security reasons
This backend should not be used and wil be removed. The 'local'
backend is the recommended alternative.
[1] https://www.tuxera.com/community/posix-test-suite/
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
This patch changes some error messages in the backend init code and
convert backends to propagate QEMU Error objects instead of calling
error_report().
One notable improvement is that the local backend now provides a more
detailed error report when it fails to open the shared directory.
Signed-off-by: Greg Kurz <groug@kaod.org>
This patch changes some error messages in the backend opts parsing
code and convert backends to propagate QEMU Error objects instead
of calling error_report().
Signed-off-by: Greg Kurz <groug@kaod.org>
If we receive an unsupported request id, we first decide to
return -ENOTSUPP to the client, but since the request id
causes is_read_only_op() to return false, we change the
error to be -EROFS if the fsdev is read-only. This doesn't
make sense since we don't know what the client asked for.
This patch ensures that -EROFS can only be returned if the
request id is supported.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
The return value of v9fs_mark_fids_unreclaim() is then propagated to
pdu_complete(). It should be a negative errno, not -1.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
9p back-end first queries the size of an extended attribute,
allocates space for it via g_malloc() and then retrieves its
value into allocated buffer. Race between querying attribute
size and retrieving its could lead to memory bytes disclosure.
Use g_malloc0() to avoid it.
Reported-by: Tuomas Tynkkynen <tuomas.tynkkynen@iki.fi>
Signed-off-by: Prasad J Pandit <pjp@fedoraproject.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
v9fs_do_readdir_with_stat() should check for a maximum buffer size
before an attempt to marshal gathered data. Otherwise, buffers assumed
as misconfigured and the transport would be broken.
The patch brings v9fs_do_readdir_with_stat() in conformity with
v9fs_do_readdir() behavior.
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
[groug, regression caused my commit 8d37de41ca # 2.10]
Signed-off-by: Greg Kurz <groug@kaod.org>
The third parameter of v9fs_co_name_to_path() must not contain `/'
character.
The issue is most likely related to 9p2000.u protocol only.
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
[groug, regression caused by commit f57f587857 # 2.10]
Signed-off-by: Greg Kurz <groug@kaod.org>
If the client is using 9p2000.u, the following occurs:
$ cd ${virtfs_shared_dir}
$ mkdir -p a/b/c
$ ls a/b
ls: cannot access 'a/b/a': No such file or directory
ls: cannot access 'a/b/b': No such file or directory
a b c
instead of the expected:
$ ls a/b
c
This is a regression introduced by commit f57f5878578a;
local_name_to_path() now resolves ".." and "." in paths,
and v9fs_do_readdir_with_stat()->stat_to_v9stat() then
copies the basename of the resulting path to the response.
With the example above, this means that "." and ".." are
turned into "b" and "a" respectively...
stat_to_v9stat() currently assumes it is passed a full
canonicalized path and uses it to do two different things:
1) to pass it to v9fs_co_readlink() in case the file is a symbolic
link
2) to set the name field of the V9fsStat structure to the basename
part of the given path
It only has two users: v9fs_stat() and v9fs_do_readdir_with_stat().
v9fs_stat() really needs 1) and 2) to be performed since it starts
with the full canonicalized path stored in the fid. It is different
for v9fs_do_readdir_with_stat() though because the name we want to
put into the V9fsStat structure is the d_name field of the dirent
actually (ie, we want to keep the "." and ".." special names). So,
we only need 1) in this case.
This patch hence adds a basename argument to stat_to_v9stat(), to
be used to set the name field of the V9fsStat structure, and moves
the basename logic to v9fs_stat().
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
(groug, renamed old name argument to path and updated changelog)
Signed-off-by: Greg Kurz <groug@kaod.org>
Since fchmodat(2) on Linux doesn't support AT_SYMLINK_NOFOLLOW, we have to
implement it using workarounds. There are two different ways, depending on
whether the system supports O_PATH or not.
In the case O_PATH is supported, we rely on the behavhior of openat(2)
when passing O_NOFOLLOW | O_PATH and the file is a symbolic link. Even
if openat_file() already adds O_NOFOLLOW to the flags, this patch makes
it explicit that we need both creation flags to obtain the expected
behavior.
This is only cleanup, no functional change.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
(note this is how other functions also handle the errors).
hw/9pfs/9p.c:948:18: warning: Loss of sign in implicit conversion
offset = err;
^~~
Reported-by: Clang Static Analyzer
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
Nothing in fsdev/ or hw/9pfs/ depends on pci; it should rather depend
on CONFIG_VIRTFS and CONFIG_VIRTIO/CONFIG_XEN only.
Acked-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
This function has to ensure it doesn't follow a symlink that could be used
to escape the virtfs directory. This could be easily achieved if fchmodat()
on linux honored the AT_SYMLINK_NOFOLLOW flag as described in POSIX, but
it doesn't. There was a tentative to implement a new fchmodat2() syscall
with the correct semantics:
https://patchwork.kernel.org/patch/9596301/
but it didn't gain much momentum. Also it was suggested to look at an O_PATH
based solution in the first place.
The current implementation covers most use-cases, but it notably fails if:
- the target path has access rights equal to 0000 (openat() returns EPERM),
=> once you've done chmod(0000) on a file, you can never chmod() again
- the target path is UNIX domain socket (openat() returns ENXIO)
=> bind() of UNIX domain sockets fails if the file is on 9pfs
The solution is to use O_PATH: openat() now succeeds in both cases, and we
can ensure the path isn't a symlink with fstat(). The associated entry in
"/proc/self/fd" can hence be safely passed to the regular chmod() syscall.
The previous behavior is kept for older systems that don't have O_PATH.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Zhi Yong Wu <zhiyong.wu@ucloud.cn>
Acked-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
With the move of some docs/ to docs/devel/ on ac06724a71,
no references were updated.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Convert all uses of error_report("warning:"... to use warn_report()
instead. This helps standardise on a single method of printing warnings
to the user.
All of the warnings were changed using these two commands:
find ./* -type f -exec sed -i \
's|error_report(".*warning[,:] |warn_report("|Ig' {} +
Indentation fixed up manually afterwards.
The test-qdev-global-props test case was manually updated to ensure that
this patch passes make check (as the test cases are case sensitive).
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Suggested-by: Thomas Huth <thuth@redhat.com>
Cc: Jeff Cody <jcody@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Lieven <pl@kamp.de>
Cc: Josh Durgin <jdurgin@redhat.com>
Cc: "Richard W.M. Jones" <rjones@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Greg Kurz <groug@kaod.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Peter Chubb <peter.chubb@nicta.com.au>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alexander Graf <agraf@suse.de>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Greg Kurz <groug@kaod.org>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed by: Peter Chubb <peter.chubb@data61.csiro.au>
Acked-by: Max Reitz <mreitz@redhat.com>
Acked-by: Marcel Apfelbaum <marcel@redhat.com>
Message-Id: <e1cfa2cd47087c248dd24caca9c33d9af0c499b0.1499866456.git.alistair.francis@xilinx.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Contrary to what is written in the comment, a buggy guest can misconfigure
the transport buffers and pdu_marshal() may return an error. If this ever
happens, it is up to the transport layer to handle the situation (9P is
transport agnostic).
This fixes Coverity issue CID1348518.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Implement xen_9pfs_disconnect by unbinding the event channels. On
xen_9pfs_free, call disconnect if any event channels haven't been
disconnected.
If the frontend misconfigured the buffers set the backend to "Closing"
and disconnect it. Misconfigurations include requesting a read of more
bytes than available on the ring buffer, or claiming to be writing more
data than available on the ring buffer.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
The 9P protocol is transport agnostic: if the guest misconfigured the
buffers, the best we can do is to set the broken flag on the device.
Signed-off-by: Greg Kurz <groug@kaod.org>
The 9p spec at http://man.cat-v.org/plan_9/5/intro reads:
"Each 9P message begins with a four-byte size field specify-
ing the length in bytes of the complete message including
the four bytes of the size field itself. The next byte is
the message type, one of the constants in the enumeration in
the include file <fcall.h>. The next two bytes are an iden-
tifying tag, described below."
ie, each message starts with a 7-byte long header.
The core 9P code already assumes this pretty much everywhere. This patch
does the following:
- makes the assumption explicit in the common 9p.h header, since it isn't
related to the transport
- open codes the header size in handle_9p_output() and hardens the sanity
check on the space needed for the reply message
Signed-off-by: Greg Kurz <groug@kaod.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
If the guest sends a malformed request, we end up with a dangling pointer
in V9fsVirtioState. This doesn't seem to cause any bug, but let's remove
this side effect anyway.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
I found these pattern via grepping the source tree. I don't have a
coccinelle script for it!
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
In mapped security modes, files are created with very restrictive
permissions (600 for files and 700 for directories). This makes
file sharing between virtual machines and users on the host rather
complicated. Imagine eg. a group of users that need to access data
produced by processes on a virtual machine. Giving those users access
to the data will be difficult since the group access mode is always 0.
This patch makes the default mode for both files and directories
configurable. Existing setups that don't know about the new parameters
keep using the current secure behavior.
Signed-off-by: Tobias Schramm <tobleminer@gmail.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Commit a0e640a8 introduced a path processing error.
Pass fstatat the dirpath based path component instead
of the entire path.
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
When using the mapped-file security, credentials are stored in a metadata
directory located in the parent directory. This is okay for all paths with
the notable exception of the root path, since we don't want and probably
can't create a metadata directory above the virtfs directory on the host.
This patch introduces a dedicated metadata file, sitting in the virtfs root
for this purpose. It relies on the fact that the "." name necessarily refers
to the virtfs root.
As for the metadata directory, we don't want the client to see this file.
The current code only cares for readdir() but there are many other places
to fix actually. The filtering logic is hence put in a separate function.
Before:
# ls -ld
drwxr-xr-x. 3 greg greg 4096 May 5 12:49 .
# chown root.root .
chown: changing ownership of '.': Is a directory
# ls -ld
drwxr-xr-x. 3 greg greg 4096 May 5 12:49 .
After:
# ls -ld
drwxr-xr-x. 3 greg greg 4096 May 5 12:49 .
# chown root.root .
# ls -ld
drwxr-xr-x. 3 root root 4096 May 5 12:50 .
and from the host:
ls -al .virtfs_metadata_root
-rwx------. 1 greg greg 26 May 5 12:50 .virtfs_metadata_root
$ cat .virtfs_metadata_root
virtfs.uid=0
virtfs.gid=0
Reported-by: Leo Gaspard <leo@gaspard.io>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Leo Gaspard <leo@gaspard.io>
[groug: work around a patchew false positive in
local_set_mapped_file_attrat()]
The logic to open a path currently sits between local_open_nofollow() and
the relative_openat_nofollow() helper, which has no other user.
For the sake of clarity, this patch moves all the code of the helper into
its unique caller. While here we also:
- drop the code to skip leading "/" because the backend isn't supposed to
pass anything but relative paths without consecutive slashes. The assert()
is kept because we really don't want a buggy backend to pass an absolute
path to openat().
- use strchrnul() to get a simpler code. This is ok since virtfs is for
linux+glibc hosts only.
- don't dup() the initial directory and add an assert() to ensure we don't
return the global mountfd to the caller. BTW, this would mean that the
caller passed an empty path, which isn't supposed to happen either.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
[groug: fixed typos in changelog]
When using the mapped-file security mode, the creds of a path /foo/bar
are stored in the /foo/.virtfs_metadata/bar file. This is okay for all
paths unless they end with '.' or '..', because we cannot create the
corresponding file in the metadata directory.
This patch ensures that '.' and '..' are resolved in all paths.
The core code only passes path elements (no '/') to the backend, with
the notable exception of the '/' path, which refers to the virtfs root.
This patch preserves the current behavior of converting it to '.' so
that it can be passed to "*at()" syscalls ('/' would mean the host root).
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
These v9fs_co_name_to_path() call sites have always been around. I guess
no care was taken to check the return value because the name_to_path
operation could never fail at the time. This is no longer true: the
handle and synth backends can already fail this operation, and so will the
local backend soon.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
The utimensat() and futimens() syscalls have been around for ages (ie,
glibc 2.6 and linux 2.6.22), and the decision was already taken to
switch to utimensat() anyway when fixing CVE-2016-9602 in 2.9.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
When trying to remove a file from a directory, both created in non-mapped
mode, the file remains and EBADF is returned to the guest.
This is a regression introduced by commit "df4938a6651b 9pfs: local:
unlinkat: don't follow symlinks" when fixing CVE-2016-9602. It changed the
way we unlink the metadata file from
ret = remove("$dir/.virtfs_metadata/$name");
if (ret < 0 && errno != ENOENT) {
/* Error out */
}
/* Ignore absence of metadata */
to
fd = openat("$dir/.virtfs_metadata")
unlinkat(fd, "$name")
if (ret < 0 && errno != ENOENT) {
/* Error out */
}
/* Ignore absence of metadata */
If $dir was created in non-mapped mode, openat() fails with ENOENT and
we pass -1 to unlinkat(), which fails in turn with EBADF.
We just need to check the return of openat() and ignore ENOENT, in order
to restore the behaviour we had with remove().
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
[groug: rewrote the comments as suggested by Eric]
Only pdu_complete() needs to notify the client that a request has completed.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
These bits aren't related to the transport so let's move them to the core
code.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
This allows us to remove lots of includes of migration/migration.h
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Use the common utility function, which contains checks on return values
and first calls F_GETFD as recommended by POSIX.1-2001, instead of
manually calling fcntl.
CID: 1374831
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
CC: anthony.perard@citrix.com
CC: groug@kaod.org
CC: aneesh.kumar@linux.vnet.ibm.com
CC: Eric Blake <eblake@redhat.com>
When using the mapped-file security mode, we shouldn't let the client mess
with the metadata. The current code already tries to hide the metadata dir
from the client by skipping it in local_readdir(). But the client can still
access or modify it through several other operations. This can be used to
escalate privileges in the guest.
Affected backend operations are:
- local_mknod()
- local_mkdir()
- local_open2()
- local_symlink()
- local_link()
- local_unlinkat()
- local_renameat()
- local_rename()
- local_name_to_path()
Other operations are safe because they are only passed a fid path, which
is computed internally in local_name_to_path().
This patch converts all the functions listed above to fail and return
EINVAL when being passed the name of the metadata dir. This may look
like a poor choice for errno, but there's no such thing as an illegal
path name on Linux and I could not think of anything better.
This fixes CVE-2017-7493.
Reported-by: Leo Gaspard <leo@gaspard.io>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJY/5EdAAoJEIlPj0hw4a6QHj8P/1D3iZq8vKyLvnkioYC00ao8
dhuCFV1Dk0dl/QXvDXxNAHFqmHp2PzHeHF2V19bTbUh9QnY3WH9aSsMQ7YVPDzLb
6ZezbxXd1l8+IOrWh+ch+x6jiUpRAeHGjhSpFxQEmH5THBmPtLHBFhAeeGpVq6CT
MPFlN0mr1snyMMbK2aKCVTHgh5Wip83/t/kijIq4RmPwXdJbwxx55PL8jxC/NZHq
/NbAZzAT149fmItfBNnsRTfNoDs7fSmgM9VelYsnJNHs6qkYYfewxys4vudePG8n
ZcjCXMv9vdCSTfXfYY9nxhfGCgkkRSvBWrzARRIUPxTXGAmEU52LJrccpG9AEFRZ
Kgz1JYr0y+u/g+RsCJvAglvmciawXgZH8GIR4sFl2iv4u6cx8PxNRgjTdt7YX4Je
V7+scmOLB0U5wkI7PUcPV1v2fwKJHFfMdyik257otfMCJiTCnWGJfCZGR4JAdy3C
NHuG4eIj2XLjZ7IQIo3mg+EfdCWdfVMKtXj0MAFsP6Kcr6sY/ef5sx/wB61k3NU1
Y4ctI/LAOONsjlC2yzJ4aZR4LgyFcVcwx8FKOK7niCcCgVLYGWpyiHU3kFKYlmKM
FKIlGPPOK8WuRV9NUGDI9A8XENyFXGyiR2xGCotfgHj+FSSqRzhKH4ecfgNimJVY
wjTzZmBa68pLHdXaJ+SV
=OfeB
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/sstabellini/tags/xen-20170421-v2-tag' into staging
Xen 2017/04/21 + fix
# gpg: Signature made Tue 25 Apr 2017 19:10:37 BST
# gpg: using RSA key 0x894F8F4870E1AE90
# gpg: Good signature from "Stefano Stabellini <stefano.stabellini@eu.citrix.com>"
# gpg: aka "Stefano Stabellini <sstabellini@kernel.org>"
# Primary key fingerprint: D04E 33AB A51F 67BA 07D3 0AEA 894F 8F48 70E1 AE90
* remotes/sstabellini/tags/xen-20170421-v2-tag: (21 commits)
move xen-mapcache.c to hw/i386/xen/
move xen-hvm.c to hw/i386/xen/
move xen-common.c to hw/xen/
add xen-9p-backend to MAINTAINERS under Xen
xen/9pfs: build and register Xen 9pfs backend
xen/9pfs: send responses back to the frontend
xen/9pfs: implement in/out_iov_from_pdu and vmarshal/vunmarshal
xen/9pfs: receive requests from the frontend
xen/9pfs: connect to the frontend
xen/9pfs: introduce Xen 9pfs backend
9p: introduce a type for the 9p header
xen: import ring.h from xen
configure: use pkg-config for obtaining xen version
xen: additionally restrict xenforeignmemory operations
xen: use libxendevice model to restrict operations
xen: use 5 digit xen versions
xen: use libxendevicemodel when available
configure: detect presence of libxendevicemodel
xen: create wrappers for all other uses of xc_hvm_XXX() functions
xen: rename xen_modified_memory() to xen_hvm_modified_memory()
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Once a request is completed, xen_9pfs_push_and_notify gets called. In
xen_9pfs_push_and_notify, update the indexes (data has already been
copied to the sg by the common code) and send a notification to the
frontend.
Schedule the bottom-half to check if we already have any other requests
pending.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
CC: anthony.perard@citrix.com
CC: jgross@suse.com
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Greg Kurz <groug@kaod.org>
Implement xen_9pfs_init_in/out_iov_from_pdu and
xen_9pfs_pdu_vmarshal/vunmarshall by creating new sg pointing to the
data on the ring.
This is safe as we only handle one request per ring at any given time.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
CC: anthony.perard@citrix.com
CC: jgross@suse.com
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Greg Kurz <groug@kaod.org>
Upon receiving an event channel notification from the frontend, schedule
the bottom half. From the bottom half, read one request from the ring,
create a pdu and call pdu_submit to handle it.
For now, only handle one request per ring at a time.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
CC: anthony.perard@citrix.com
CC: jgross@suse.com
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Greg Kurz <groug@kaod.org>
Write the limits of the backend to xenstore. Connect to the frontend.
Upon connection, allocate the rings according to the protocol
specification.
Initialize a QEMUBH to schedule work upon receiving an event channel
notification from the frontend.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
CC: anthony.perard@citrix.com
CC: jgross@suse.com
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Greg Kurz <groug@kaod.org>
Introduce the Xen 9pfs backend: add struct XenDevOps to register as a
Xen backend and add struct V9fsTransport to register as v9fs transport.
All functions are empty stubs for now.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
CC: anthony.perard@citrix.com
CC: jgross@suse.com
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Greg Kurz <groug@kaod.org>
Use the new type in virtio-9p-device.
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
CC: anthony.perard@citrix.com
CC: jgross@suse.com
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Greg Kurz <groug@kaod.org>
The local backend was recently converted to using "at*()" syscalls in order
to ensure all accesses happen below the shared directory. This requires that
we only pass relative paths, otherwise the dirfd argument to the "at*()"
syscalls is ignored and the path is treated as an absolute path in the host.
This is actually the case for paths in all fids, with the notable exception
of the root fid, whose path is "/". This causes the following backend ops to
act on the "/" directory of the host instead of the virtfs shared directory
when the export root is involved:
- lstat
- chmod
- chown
- utimensat
ie, chmod /9p_mount_point in the guest will be converted to chmod / in the
host for example. This could cause security issues with a privileged QEMU.
All "*at()" syscalls are being passed an open file descriptor. In the case
of the export root, this file descriptor points to the path in the host that
was passed to -fsdev.
The fix is thus as simple as changing the path of the export root fid to be
"." instead of "/".
This is CVE-2017-7471.
Cc: qemu-stable@nongnu.org
Reported-by: Léo Gaspard <leo@gaspard.io>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The migration blocker survives a device reset: if the guest mounts a 9p
share and then gets rebooted with system_reset, it will be unmigratable
until it remounts and umounts the 9p share again.
This happens because the migration blocker is supposed to be cleared when
we put the last reference on the root fid, but virtfs_reset() wrongly calls
free_fid() instead of put_fid().
This patch fixes virtfs_reset() so that it honor the way fids are supposed
to be manipulated: first get a reference and later put it back when you're
done.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Li Qiang <liqiang6-s@360.cn>
If a client tries to flush the same outstanding request several times, only
the first flush completes. Subsequent ones keep waiting for the request
completion in v9fs_flush() and, therefore, leak a PDU. This will cause QEMU
to hang when draining active PDUs the next time the device is reset.
Let have each flush request wake up the next one if any. The last waiter
frees the cancelled PDU.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
The v9fs_create() and v9fs_lcreate() functions are used to create a file
on the backend and to associate it to a fid. The fid shouldn't be already
in-use, otherwise both functions may silently leak a file descriptor or
allocated memory. The current code doesn't check that.
This patch ensures that the fid isn't already associated to anything
before using it.
Signed-off-by: Li Qiang <liqiang6-s@360.cn>
(reworded the changelog, Greg Kurz)
Signed-off-by: Greg Kurz <groug@kaod.org>
Replies from the virtfs proxy are made up of a fixed-size header (8 bytes)
and a payload of variable size (maximum 64kb). When receiving a reply,
the proxy backend first reads the whole header and then unmarshals it.
If the header is okay, it then does the same operation with the payload.
Since the proxy backend uses a pre-allocated buffer which has enough room
for a header and the maximum payload size, marshalling should never fail
with fixed size arguments. Any error here is likely to result from a more
serious corruption in QEMU and we'd better dump core right away.
This patch adds error checks where they are missing and converts the
associated error paths into assertions.
This should also address Coverity's complaints CID 1348519 and CID 1348520,
about not always checking the return value of proxy_unmarshal().
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
According to the 9P spec [*], when a client wants to cancel a pending I/O
request identified by a given tag (uint16), it must send a Tflush message
and wait for the server to respond with a Rflush message before reusing this
tag for another I/O. The server may still send a completion message for the
I/O if it wasn't actually cancelled but the Rflush message must arrive after
that.
QEMU hence waits for the flushed PDU to complete before sending the Rflush
message back to the client.
If a client sends 'Tflush tag oldtag' and tag == oldtag, QEMU will then
allocate a PDU identified by tag, find it in the PDU list and wait for
this same PDU to complete... i.e. wait for a completion that will never
happen. This causes a tag and ring slot leak in the guest, and a PDU
leak in QEMU, all of them limited by the maximal number of PDUs (128).
But, worse, this causes QEMU to hang on device reset since v9fs_reset()
wants to drain all pending I/O.
This insane behavior is likely to denote a bug in the client, and it would
deserve an Rerror message to be sent back. Unfortunately, the protocol
allows it and requires all flush requests to suceed (only a Tflush response
is expected).
The only option is to detect when we have to handle a self-referencing
flush request and report success to the client right away.
[*] http://man.cat-v.org/plan_9/5/flush
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Greg Kurz <groug@kaod.org>
We should pass O_NOFOLLOW otherwise openat() will follow symlinks and make
QEMU vulnerable.
While here, we also fix local_unlinkat_common() to use openat_dir() for
the same reasons (it was a leftover in the original patchset actually).
This fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
When O_PATH is used with O_DIRECTORY, it only acts as an optimization: the
openat() syscall simply finds the name in the VFS, and doesn't trigger the
underlying filesystem.
On systems that don't define O_PATH, because they have glibc version 2.13
or older for example, we can safely omit it. We don't want to deactivate
O_PATH globally though, in case it is used without O_DIRECTORY. The is done
with a dedicated macro.
Systems without O_PATH may thus fail to resolve names that involve
unreadable directories, compared to newer systems succeeding, but such
corner case failure is our only option on those older systems to avoid
the security hole of chasing symlinks inappropriately.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
(added last paragraph to changelog as suggested by Eric Blake)
Signed-off-by: Greg Kurz <groug@kaod.org>
The name argument can never be an empty string, and dirfd always point to
the containing directory of the file name. AT_EMPTY_PATH is hence useless
here. Also it breaks build with glibc version 2.13 and older.
It is actually an oversight of a previous tentative patch to implement this
function. We can safely drop it.
Reported-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Signed-off-by: Greg Kurz <groug@kaod.org>
Tested-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Eric Blake <eblake@redhat.com>
If we cannot open the given path, we can return right away instead of
passing -1 to fstatfs() and close(). This will make Coverity happy.
(Coverity issue CID1371729)
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Daniel P. berrange <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Coverity issue CID1371731
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
This was spotted by Coverity as a fd leak. This is certainly true, but also
local_remove() would always return without doing anything, unless the fd is
zero, which is very unlikely.
(Coverity issue CID1371732)
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Now that the all callbacks have been converted to use "at" syscalls, we
can drop this code.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_open2() callback is vulnerable to symlink attacks because it
calls:
(1) open() which follows symbolic links for all path elements but the
rightmost one
(2) local_set_xattr()->setxattr() which follows symbolic links for all
path elements
(3) local_set_mapped_file_attr() which calls in turn local_fopen() and
mkdir(), both functions following symbolic links for all path
elements but the rightmost one
(4) local_post_create_passthrough() which calls in turn lchown() and
chmod(), both functions also following symbolic links
This patch converts local_open2() to rely on opendir_nofollow() and
mkdirat() to fix (1), as well as local_set_xattrat(),
local_set_mapped_file_attrat() and local_set_cred_passthrough() to
fix (2), (3) and (4) respectively. Since local_open2() already opens
a descriptor to the target file, local_set_cred_passthrough() is
modified to reuse it instead of opening a new one.
The mapped and mapped-file security modes are supposed to be identical,
except for the place where credentials and file modes are stored. While
here, we also make that explicit by sharing the call to openat().
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_mkdir() callback is vulnerable to symlink attacks because it
calls:
(1) mkdir() which follows symbolic links for all path elements but the
rightmost one
(2) local_set_xattr()->setxattr() which follows symbolic links for all
path elements
(3) local_set_mapped_file_attr() which calls in turn local_fopen() and
mkdir(), both functions following symbolic links for all path
elements but the rightmost one
(4) local_post_create_passthrough() which calls in turn lchown() and
chmod(), both functions also following symbolic links
This patch converts local_mkdir() to rely on opendir_nofollow() and
mkdirat() to fix (1), as well as local_set_xattrat(),
local_set_mapped_file_attrat() and local_set_cred_passthrough() to
fix (2), (3) and (4) respectively.
The mapped and mapped-file security modes are supposed to be identical,
except for the place where credentials and file modes are stored. While
here, we also make that explicit by sharing the call to mkdirat().
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_mknod() callback is vulnerable to symlink attacks because it
calls:
(1) mknod() which follows symbolic links for all path elements but the
rightmost one
(2) local_set_xattr()->setxattr() which follows symbolic links for all
path elements
(3) local_set_mapped_file_attr() which calls in turn local_fopen() and
mkdir(), both functions following symbolic links for all path
elements but the rightmost one
(4) local_post_create_passthrough() which calls in turn lchown() and
chmod(), both functions also following symbolic links
This patch converts local_mknod() to rely on opendir_nofollow() and
mknodat() to fix (1), as well as local_set_xattrat() and
local_set_mapped_file_attrat() to fix (2) and (3) respectively.
A new local_set_cred_passthrough() helper based on fchownat() and
fchmodat_nofollow() is introduced as a replacement to
local_post_create_passthrough() to fix (4).
The mapped and mapped-file security modes are supposed to be identical,
except for the place where credentials and file modes are stored. While
here, we also make that explicit by sharing the call to mknodat().
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_symlink() callback is vulnerable to symlink attacks because it
calls:
(1) symlink() which follows symbolic links for all path elements but the
rightmost one
(2) open(O_NOFOLLOW) which follows symbolic links for all path elements but
the rightmost one
(3) local_set_xattr()->setxattr() which follows symbolic links for all
path elements
(4) local_set_mapped_file_attr() which calls in turn local_fopen() and
mkdir(), both functions following symbolic links for all path
elements but the rightmost one
This patch converts local_symlink() to rely on opendir_nofollow() and
symlinkat() to fix (1), openat(O_NOFOLLOW) to fix (2), as well as
local_set_xattrat() and local_set_mapped_file_attrat() to fix (3) and
(4) respectively.
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_chown() callback is vulnerable to symlink attacks because it
calls:
(1) lchown() which follows symbolic links for all path elements but the
rightmost one
(2) local_set_xattr()->setxattr() which follows symbolic links for all
path elements
(3) local_set_mapped_file_attr() which calls in turn local_fopen() and
mkdir(), both functions following symbolic links for all path
elements but the rightmost one
This patch converts local_chown() to rely on open_nofollow() and
fchownat() to fix (1), as well as local_set_xattrat() and
local_set_mapped_file_attrat() to fix (2) and (3) respectively.
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_chmod() callback is vulnerable to symlink attacks because it
calls:
(1) chmod() which follows symbolic links for all path elements
(2) local_set_xattr()->setxattr() which follows symbolic links for all
path elements
(3) local_set_mapped_file_attr() which calls in turn local_fopen() and
mkdir(), both functions following symbolic links for all path
elements but the rightmost one
We would need fchmodat() to implement AT_SYMLINK_NOFOLLOW to fix (1). This
isn't the case on linux unfortunately: the kernel doesn't even have a flags
argument to the syscall :-\ It is impossible to fix it in userspace in
a race-free manner. This patch hence converts local_chmod() to rely on
open_nofollow() and fchmod(). This fixes the vulnerability but introduces
a limitation: the target file must readable and/or writable for the call
to openat() to succeed.
It introduces a local_set_xattrat() replacement to local_set_xattr()
based on fsetxattrat() to fix (2), and a local_set_mapped_file_attrat()
replacement to local_set_mapped_file_attr() based on local_fopenat()
and mkdirat() to fix (3). No effort is made to factor out code because
both local_set_xattr() and local_set_mapped_file_attr() will be dropped
when all users have been converted to use the "at" versions.
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_link() callback is vulnerable to symlink attacks because it calls:
(1) link() which follows symbolic links for all path elements but the
rightmost one
(2) local_create_mapped_attr_dir()->mkdir() which follows symbolic links
for all path elements but the rightmost one
This patch converts local_link() to rely on opendir_nofollow() and linkat()
to fix (1), mkdirat() to fix (2).
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
When using the mapped-file security model, we also have to create a link
for the metadata file if it exists. In case of failure, we should rollback.
That's what this patch does.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_rename() callback is vulnerable to symlink attacks because it
uses rename() which follows symbolic links in all path elements but the
rightmost one.
This patch simply transforms local_rename() into a wrapper around
local_renameat() which is symlink-attack safe.
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_renameat() callback is currently a wrapper around local_rename()
which is vulnerable to symlink attacks.
This patch rewrites local_renameat() to have its own implementation, based
on local_opendir_nofollow() and renameat().
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The local_lstat() callback is vulnerable to symlink attacks because it
calls:
(1) lstat() which follows symbolic links in all path elements but the
rightmost one
(2) getxattr() which follows symbolic links in all path elements
(3) local_mapped_file_attr()->local_fopen()->openat(O_NOFOLLOW) which
follows symbolic links in all path elements but the rightmost
one
This patch converts local_lstat() to rely on opendir_nofollow() and
fstatat(AT_SYMLINK_NOFOLLOW) to fix (1), fgetxattrat_nofollow() to
fix (2).
A new local_fopenat() helper is introduced as a replacement to
local_fopen() to fix (3). No effort is made to factor out code
because local_fopen() will be dropped when all users have been
converted to call local_fopenat().
This partly fixes CVE-2016-9602.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>