Currently we created a thread pool (With 64 max threads per pool) for
each virtqueue. We hoped that this will provide us with better scalability
and performance.
But in practice, we are getting better numbers in most of the cases
when we don't create a thread pool at all and a single thread per
virtqueue receives the request and processes it.
Hence, I am proposing that we switch to no thread pool by default
(equivalent of --thread-pool-size=0). This will provide out of
box better performance to most of the users. In fact other users
have confirmed that not using a thread pool gives them better
numbers. So why not use this as default. It can be changed when
somebody can fix the issues with thread pool performance.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20210210182744.27324-2-vgoyal@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This patch adds basic support for FUSE_HANDLE_KILLPRIV_V2. virtiofsd
can enable/disable this by specifying option "-o killpriv_v2/no_killpriv_v2".
By default this is enabled as long as client supports it
Enabling this option helps with performance in write path. Without this
option, currently every write is first preceeded with a getxattr() operation
to find out if security.capability is set. (Write is supposed to clear
security.capability). With this option enabled, server is signing up for
clearing security.capability on every WRITE and also clearing suid/sgid
subject to certain rules. This gets rid of extra getxattr() call for every
WRITE and improves performance. This is true when virtiofsd is run with
option -o xattr.
What does enabling FUSE_HANDLE_KILLPRIV_V2 mean for file server implementation.
It needs to adhere to following rules. Thanks to Miklos for this summary.
- clear "security.capability" on write, truncate and chown unconditionally
- clear suid/sgid in case of following. Note, sgid is cleared only if
group executable bit is set.
o setattr has FATTR_SIZE and FATTR_KILL_SUIDGID set.
o setattr has FATTR_UID or FATTR_GID
o open has O_TRUNC and FUSE_OPEN_KILL_SUIDGID
o create has O_TRUNC and FUSE_OPEN_KILL_SUIDGID flag set.
o write has FUSE_WRITE_KILL_SUIDGID
>From Linux VFS client perspective, here are the requirements.
- caps are always cleared on chown/write/truncate
- suid is always cleared on chown, while for truncate/write it is cleared
only if caller does not have CAP_FSETID.
- sgid is always cleared on chown, while for truncate/write it is cleared
only if caller does not have CAP_FSETID as well as file has group execute
permission.
virtiofsd implementation has not changed much to adhere to above ruls. And
reason being that current assumption is that we are running on Linux
and on top of filesystems like ext4/xfs which already follow above rules.
On write, truncate, chown, seucurity.capability is cleared. And virtiofsd
drops CAP_FSETID if need be and that will lead to clearing of suid/sgid.
But if virtiofsd is running on top a filesystem which breaks above assumptions,
then it will have to take extra actions to emulate above. That's a TODO
for later when need arises.
Note: create normally is supposed to be called only when file does not
exist. So generally there should not be any question of clearing
setuid/setgid. But it is possible that after client checks that
file is not present, some other client creates file on server
and this race can trigger sending FUSE_CREATE. In that case, if
O_TRUNC is set, we should clear suid/sgid if FUSE_OPEN_KILL_SUIDGID
is also set.
v3:
- Resolved conflicts due to lo_inode_open() changes.
- Moved capability code in lo_do_open() so that both lo_open() and
lo_create() can benefit from common code.
- Dropped changes to kernel headers as these are part of qemu already.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20210208224024.43555-3-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Change error code handling slightly in lo_setattr(). Right now we seem
to jump to out_err and assume that "errno" is valid and use that to
send reply.
But if caller has to do some other operations before jumping to out_err,
then it does the dance of first saving errno to saverr and the restore
errno before jumping to out_err. This makes it more confusing.
I am about to make more changes where caller will have to do some
work after error before jumping to out_err. I found it easier to
change the convention a bit. That is caller saves error in "saverr"
before jumping to out_err. And out_err uses "saverr" to send error
back and does not rely on "errno" having actual error.
v3: Resolved conflicts in lo_setattr() due to lo_inode_open() changes.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20210208224024.43555-2-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Follow the inclusive terminology from the "Conscious Language in your
Open Source Projects" guidelines [*] and replace the words "whitelist"
appropriately.
[*] https://github.com/conscious-lang/conscious-lang-docs/blob/main/faq.md
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210205171817.2108907-3-philmd@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
pthread_rwlock_rdlock() and pthread_rwlock_wrlock() can fail if a
deadlock condition is detected or the current thread already owns
the lock. They can also fail, like pthread_rwlock_unlock(), if the
mutex wasn't properly initialized. None of these are ever expected
to happen with fv_VuDev::vu_dispatch_rwlock.
Some users already check the return value and assert, some others
don't. Introduce rdlock/wrlock/unlock wrappers that just do the
former and use them everywhere for improved consistency and
robustness.
This is just cleanup. It doesn't fix any actual issue.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210203182434.93870-1-groug@kaod.org>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This is how linux restarts some system calls after SIGSTOP/SIGCONT.
This is needed to avoid virtiofsd termination when resuming execution
under GDB for example.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210201193305.136390-1-groug@kaod.org>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This is how glibc implements lseek(2) on POWER.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1917692
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210121171540.1449777-1-groug@kaod.org>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
A well-behaved FUSE client does not attempt to open special files with
FUSE_OPEN because they are handled on the client side (e.g. device nodes
are handled by client-side device drivers).
The check to prevent virtiofsd from opening special files is missing in
a few cases, most notably FUSE_OPEN. A malicious client can cause
virtiofsd to open a device node, potentially allowing the guest to
escape. This can be exploited by a modified guest device driver. It is
not exploitable from guest userspace since the guest kernel will handle
special files inside the guest instead of sending FUSE requests.
This patch fixes this issue by introducing the lo_inode_open() function
to check the file type before opening it. This is a short-term solution
because it does not prevent a compromised virtiofsd process from opening
device nodes on the host.
Restructure lo_create() to try O_CREAT | O_EXCL first. Note that O_CREAT
| O_EXCL does not follow symlinks, so O_NOFOLLOW masking is not
necessary here. If the file exists and the user did not specify O_EXCL,
open it via lo_do_open().
Reported-by: Alex Xu <alex@alxu.ca>
Fixes: CVE-2020-35517
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210204150208.367837-4-stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
lo_do_lookup() finds an existing inode or allocates a new one. It
increments nlookup so that the inode stays alive until the client
releases it.
Existing callers don't need the struct lo_inode so the function doesn't
return it. Extend the function to optionally return the inode. The next
commit will need it.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <20210204150208.367837-3-stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Both lo_open() and lo_create() have similar code to open a file. Extract
a common lo_do_open() function from lo_open() that will be used by
lo_create() in a later commit.
Since lo_do_open() does not otherwise need fuse_req_t req, convert
lo_add_fd_mapping() to use struct lo_data *lo instead.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210204150208.367837-2-stefanha@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The 'ch' will be NULL in the following stack:
send_notify_iov()->fuse_send_msg()->virtio_send_msg(), and
this may lead to NULL pointer dereferenced in virtio_send_msg().
But send_notify_iov() was never called, so remove the useless code
about send_notify_iov() to fix this problem.
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Message-Id: <20201214121615.29967-1-alex.chen@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Miklos confirms it's *only* the FUSE_FORGET request that the client can
use for decrementing "lo_inode.nlookup".
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Fixes: 1222f01555
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Message-Id: <20201208073936.8629-1-lersek@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Currently lo_flush() is written in such a way that it expects to receive
a FLUSH requests on a regular file (and not directories). For example,
we call lo_fi_fd() which searches lo->fd_map. If we open directories
using opendir(), we keep don't keep track of these in lo->fd_map instead
we keep them in lo->dir_map. So we expect lo_flush() to be called on
regular files only.
Even linux fuse client calls FLUSH only for regular files and not
directories. So put a check for filetype and return EBADF if
lo_flush() is called on a non-regular file.
Reported-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20201211142544.GB3285@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
If remote posix locks are not enabled (lo->posix_lock == false), then disable
code paths taken to initialize inode->posix_lock hash table and corresponding
destruction and search etc.
lo_getlk() and lo_setlk() have been modified to return ENOSYS if daemon
does not support posix lock but client still sends a lock/unlock request.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20201207183021.22752-3-vgoyal@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
We setup per inode hash table ->posix_lock to support remote posix locks.
But we forgot to initialize this table for root inode.
Laszlo managed to trigger an issue where he sent a FUSE_FLUSH request for
root inode and lo_flush() found inode with inode->posix_lock NULL and
accessing this table crashed virtiofsd.
May be we can get rid of initializing this hash table for directory
objects completely. But that optimization is for another day.
Reported-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20201207195539.GB3107@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The current timestamp format doesn't help me visually notice small jumps
in time ("small" as defined on human scale, such as a few seconds or a few
ten seconds). Replace it with a local time format where such differences
stand out.
Before:
> [13316826770337] [ID: 00000004] unique: 62, opcode: RELEASEDIR (29), nodeid: 1, insize: 64, pid: 1
> [13316826778175] [ID: 00000004] unique: 62, success, outsize: 16
> [13316826781156] [ID: 00000004] virtio_send_msg: elem 0: with 1 in desc of length 16
> [15138279317927] [ID: 00000001] virtio_loop: Got VU event
> [15138279504884] [ID: 00000001] fv_queue_set_started: qidx=1 started=0
> [15138279519034] [ID: 00000003] fv_queue_thread: kill event on queue 1 - quitting
> [15138280876463] [ID: 00000001] fv_remove_watch: TODO! fd=9
> [15138280897381] [ID: 00000001] virtio_loop: Waiting for VU event
> [15138280946834] [ID: 00000001] virtio_loop: Got VU event
> [15138281175421] [ID: 00000001] virtio_loop: Waiting for VU event
> [15138281182387] [ID: 00000001] virtio_loop: Got VU event
> [15138281189474] [ID: 00000001] virtio_loop: Waiting for VU event
> [15138309321936] [ID: 00000001] virtio_loop: Unexpected poll revents 11
> [15138309434150] [ID: 00000001] virtio_loop: Exit
(Notice how you don't (easily) notice the gap in time after
"virtio_send_msg", and especially the amount of time passed is hard to
estimate.)
After:
> [2020-12-08 06:43:22.58+0100] [ID: 00000004] unique: 51, opcode: RELEASEDIR (29), nodeid: 1, insize: 64, pid: 1
> [2020-12-08 06:43:22.58+0100] [ID: 00000004] unique: 51, success, outsize: 16
> [2020-12-08 06:43:22.58+0100] [ID: 00000004] virtio_send_msg: elem 0: with 1 in desc of length 16
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] virtio_loop: Got VU event
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] fv_queue_set_started: qidx=1 started=0
> [2020-12-08 06:43:29.34+0100] [ID: 00000003] fv_queue_thread: kill event on queue 1 - quitting
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] fv_remove_watch: TODO! fd=9
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] virtio_loop: Waiting for VU event
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] virtio_loop: Got VU event
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] virtio_loop: Waiting for VU event
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] virtio_loop: Got VU event
> [2020-12-08 06:43:29.34+0100] [ID: 00000001] virtio_loop: Waiting for VU event
> [2020-12-08 06:43:29.37+0100] [ID: 00000001] virtio_loop: Unexpected poll revents 11
> [2020-12-08 06:43:29.37+0100] [ID: 00000001] virtio_loop: Exit
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Message-Id: <20201208055043.31548-1-lersek@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Right now we create a thread pool and main thread hands over the request
to thread in thread pool to process. Number of threads in thread pool
can be managed by option --thread-pool-size.
In tests we have noted that many of the workloads are getting better
performance if we don't use a thread pool at all and process all
the requests in the context of a thread receiving the request.
Hence give user an option to be able to run virtiofsd without using
a thread pool.
To implement this, I have used existing option --thread-pool-size. This
option defines how many maximum threads can be in the thread pool.
Thread pool size zero freezes thead pool. I can't see why will one
start virtiofsd with a frozen thread pool (hence frozen file system).
So I am redefining --thread-pool-size=0 to mean, don't use a thread pool.
Instead process the request in the context of thread receiving request
from the queue.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20201109143548.GA1479853@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This allows to get rid of a check for older GCC version (which was a bit
bogus too since it was falling back on c++ version..)
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201210134752.780923-7-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes, with the changes
to the following files manually reverted:
contrib/libvhost-user/libvhost-user-glib.h
contrib/libvhost-user/libvhost-user.c
contrib/libvhost-user/libvhost-user.h
contrib/plugins/hotblocks.c
contrib/plugins/hotpages.c
contrib/plugins/howvec.c
contrib/plugins/lockstep.c
linux-user/mips64/cpu_loop.c
linux-user/mips64/signal.c
linux-user/sparc64/cpu_loop.c
linux-user/sparc64/signal.c
linux-user/x86_64/cpu_loop.c
linux-user/x86_64/signal.c
target/s390x/gen-features.c
tests/fp/platform.h
tests/migration/s390x/a-b-bios.c
tests/plugin/bb.c
tests/plugin/empty.c
tests/plugin/insn.c
tests/plugin/mem.c
tests/test-rcu-simpleq.c
tests/test-rcu-slist.c
tests/test-rcu-tailq.c
tests/uefi-test-tools/UefiTestToolsPkg/BiosTablesTest/BiosTablesTest.c
contrib/plugins/, tests/plugin/, and tests/test-rcu-slist.c appear not
to include osdep.h intentionally. The remaining reverts are the same
as in commit bbfff19688.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20201113061216.2483385-1-armbru@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Alexander Bulekov <alxndr@bu.edu>
By making libvhost-user a subproject, check it builds
standalone (without the global QEMU cflags etc).
Note that the library still relies on QEMU include/qemu/atomic.h and
linux_headers/.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20201125100640.366523-6-marcandre.lureau@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In main func, strdup lo.source may fail. So check whether strdup
lo.source return NULL before using it.
Signed-off-by: Haotian Li <lihaotian9@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Message-Id: <f1e48ca8-d6de-d901-63c8-4f4024bda518@huawei.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
In main func, func lo_map_reserve is called without NULL check.
If reallocing new_elems fails in func lo_map_grow, the func
lo_map_reserve may return NULL. We should check whether
lo_map_reserve returns NULL before using it.
Signed-off-by: Haotian Li <lihaotian9@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Message-Id: <48887813-1c95-048c-6d10-48e3dd2bac71@huawei.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
In fuse_bufvec_advance func, calling fuse_bufvec_current func
may return NULL, so we should check whether buf is NULL before
using it.
Signed-off-by: Haotian Li <lihaotian9@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Message-Id: <29fc87c2-b87c-4c34-40d4-75381f228849@huawei.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Contrary to what the check (and warning) in lo_init() claims, we can
announce submounts just fine even without statx() -- the check is based
on comparing both the mount ID and st_dev of parent and child. Without
statx(), we will not have the mount ID; but we always have st_dev.
The only problems we have (without statx() and its mount ID) are:
(1) Mounting the same device twice may lead to both trees being treated
as exactly the same tree by virtiofsd. But that is a problem that
is completely independent of mirroring host submounts in the guest.
Both submount roots will still show the FUSE_SUBMOUNT flag, because
their st_dev still differs from their respective parent.
(2) There is only one exception to (1), and that is if you mount a
device inside a mount of itself: Then, its st_dev will be the same
as that of its parent, and so without a mount ID, virtiofsd will not
be able to recognize the nested mount's root as a submount.
However, thanks to virtiofsd then treating both trees as exactly the
same tree, it will be caught up in a loop when the guest tries to
examine the nested submount, so the guest will always see nothing
but an ELOOP there. Therefore, this case is just fully broken
without statx(), whether we check for submounts (based on st_dev) or
not.
All in all, checking for submounts works well even without comparing the
mount ID (i.e., without statx()). The only concern is an edge case
that, without statx() mount IDs, is utterly broken anyway.
Thus, drop said check in lo_init().
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201103164135.169325-1-mreitz@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The option `libexecdir` is relative to `prefix` (see
https://mesonbuild.com/Builtin-options.html), so we have to be aware
of this when creating 50-qemu-gpu.json and
50-qemu-virtiofsd.json. Otherwise, tools like libvirt will not be able
to find the executable.
Fixes: 16bf7a3326 ("configure: move directory options from config-host.mak to meson")
Signed-off-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Message-Id: <20201103112333.24734-1-mhartmay@linux.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Whenever we encounter a directory with an st_dev or mount ID that
differs from that of its parent, we set the FUSE_ATTR_SUBMOUNT flag so
the guest can create a submount for it.
We only need to do so in lo_do_lookup(). The following functions return
a fuse_attr object:
- lo_create(), though fuse_reply_create(): Calls lo_do_lookup().
- lo_lookup(), though fuse_reply_entry(): Calls lo_do_lookup().
- lo_mknod_symlink(), through fuse_reply_entry(): Calls lo_do_lookup().
- lo_link(), through fuse_reply_entry(): Creating a link cannot create a
submount, so there is no need to check for it.
- lo_getattr(), through fuse_reply_attr(): Announcing submounts when the
node is first detected (at lookup) is sufficient. We do not need to
return the submount attribute later.
- lo_do_readdir(), through fuse_add_direntry_plus(): Calls
lo_do_lookup().
Make announcing submounts optional, so submounts are only announced to
the guest with the announce_submounts option. Some users may prefer the
current behavior, so that the guest learns nothing about the host mount
structure.
(announce_submounts is force-disabled when the guest does not present
the FUSE_SUBMOUNTS capability, or when there is no statx().)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201102161859.156603-6-mreitz@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Using st_dev is not sufficient to uniquely identify a mount: You can
mount the same device twice, but those are still separate trees, and
e.g. by mounting something else inside one of them, they may differ.
Using statx(), we can get a mount ID that uniquely identifies a mount.
If that is available, add it to the lo_inode key.
Most of this patch is taken from Miklos's mail here:
https://marc.info/?l=fuse-devel&m=160062521827983
(virtiofsd-use-mount-id.patch attachment)
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201102161859.156603-5-mreitz@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
fuse_entry_param is converted to fuse_attr on the line (by
fill_entry()), so it should have a member that mirrors fuse_attr.flags.
fill_entry() should then copy this fuse_entry_param.attr_flags to
fuse_attr.flags.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201102161859.156603-3-mreitz@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
FUSE_SUBMOUNTS is a pure indicator by the kernel to signal that it
supports submounts. It does not check its state in the init reply, so
there is nothing for fuse_lowlevel.c to do but to check its existence
and copy it into fuse_conn_info.capable.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201102161859.156603-2-mreitz@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The commit 88fc107956 disabled remote
posix locks by default. But the --help message still says it is enabled
by default. So fix it to output no_posix_lock.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Message-Id: <20201027081558.29904-1-zhangjiachen.jaycee@bytedance.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Since commit 6f5fd83788, vu_init() can fail if malloc() returns NULL.
This fixes the following Coverity warning:
CID 1435958 (#1 of 1): Unchecked return value (CHECKED_RETURN)
Fixes: 6f5fd83788 ("libvhost-user: support many virtqueues")
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20201102092339.2034297-1-philmd@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
On ppc, and some other archs, it looks like syslog ends up using 'send'
rather than 'sendto'.
Reference: https://github.com/kata-containers/kata-containers/issues/1050
Reported-by: amulmek1@in.ibm.com
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201102150750.34565-1-dgilbert@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This reverts the following commits due to their basis on a bogus
linux kernel header update:
c93a656f7b ("tests/acceptance: Add virtiofs_submounts.py")
45ced7ca2f ("tests/acceptance/boot_linux: Accept SSH pubkey")
08dce386e7 ("virtiofsd: Announce sub-mount points")
eba8b096c1 ("virtiofsd: Store every lo_inode's parent_dev")
ede24b6be7 ("virtiofsd: Add fuse_reply_attr_with_flags()")
e2577435d3 ("virtiofsd: Add attr_flags to fuse_entry_param")
2f10415abf ("virtiofsd: Announce FUSE_ATTR_FLAGS")
97d741cc96 ("linux/fuse.h: Pull in from Linux")
Cc: Max Reitz <mreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 160385090886.20017.13382256442750027666.stgit@gimli.home
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Whenever we encounter a directory with an st_dev that differs from that
of its parent, we set the FUSE_ATTR_SUBMOUNT flag so the guest can
create a submount for it.
Make this behavior optional, so submounts are only announced to the
guest with the announce_submounts option. Some users may prefer the
current behavior, so that the guest learns nothing about the host mount
structure.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200909184028.262297-7-mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Manual merge
We want to detect mount points in the shared tree. We report them to
the guest by setting the FUSE_ATTR_SUBMOUNT flag in fuse_attr.flags, but
because the FUSE client will create a submount for every directory that
has this flag set, we must do this only for the actual mount points.
We can detect mount points by comparing a directory's st_dev with its
parent's st_dev. To be able to do so, we need to store the parent's
st_dev in the lo_inode object.
Note that mount points need not necessarily be directories; a single
file can be a mount point as well. However, for the sake of simplicity
let us ignore any non-directory mount points for now.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200909184028.262297-6-mreitz@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The plain fuse_reply_attr() function does not allow setting
fuse_attr.flags, so add this new function that does.
Make fuse_reply_attr() a wrapper around it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200909184028.262297-5-mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
fuse_entry_param is converted to fuse_attr on the line (by
fill_entry()), so it should have a member that mirrors fuse_attr.flags.
fill_entry() should then copy this fuse_entry_param.attr_flags to
fuse_attr.flags.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200909184028.262297-4-mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The fuse_attr.flags field is currently just initialized to 0, which is
valid. Thus, there is no reason not to always announce FUSE_ATTR_FLAGS
(when the kernel supports it).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20200909184028.262297-3-mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The mapping rule system implemented in the last few patches is
extremely flexible, but not easy to use. Add a simple
'map' type as a sprinkling of sugar to make it easy.
e.g.
-o xattrmap=":map::user.virtiofs.:"
would be sufficient to prefix all xattr's
or
-o xattrmap=":map:trusted.:user.virtiofs.:"
would just prefix 'trusted.' xattr's and leave
everything else alone.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201023165812.36028-6-dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Map xattr names coming from the server, i.e. the host filesystem;
currently this is only from listxattr.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201023165812.36028-4-dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Map xattr names originating at the client; from get/set/remove xattr.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201023165812.36028-3-dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Add an option to define mappings of xattr names so that
the client and server filesystems see different views.
This can be used to have different SELinux mappings as
seen by the guest, to run the virtiofsd with less privileges
(e.g. in a case where it can't set trusted/system/security
xattrs but you want the guest to be able to), or to isolate
multiple users of the same name; e.g. trusted attributes
used by stacking overlayfs.
A mapping engine is used with 3 simple rules; the rules can
be combined to allow most useful mapping scenarios.
The ruleset is defined by -o xattrmap='rules...'.
This patch doesn't use the rule maps yet.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201023165812.36028-2-dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to
create namespaces.
Introduce a weaker sandbox mode that is sufficient in container
environments because the container runtime already sets up namespaces.
Use chroot to restrict path traversal to the shared directory.
virtiofsd loses the following:
1. Mount namespace. The process chroots to the shared directory but
leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
syscalls.
2. Pid namespace. This should be fine because virtiofsd is the only
process running in the container.
3. Network namespace. This should be fine because seccomp already
rejects the connect(2) syscall, but an additional layer of security
is lost. Container runtime-specific network security policies can be
used drop network traffic (except for the vhost-user UNIX domain
socket).
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201008085534.16070-1-stefanha@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Just noticed that although help message says default log level is INFO,
it is actually 0 (EMRGE) and no mesage will be shown when error occurs.
It's better to follow help message.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Message-Id: <20201008110148.2757734-1-misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Since installation is not part of Makefiles anymore, Make need not
know the directories anymore. Meson already knows them through
built-in options, do everything using them instead of the config_host
dictionary.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow vu_message_read to be replaced by one which will make use of the
QIOChannel functions. Thus reading vhost-user message won't stall the
guest. For slave channel, we still use the default vu_message_read.
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20200918080912.321299-2-coiby.xu@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In order to prevent /proc/self/fd escapes a temporary directory is
created where /proc/self/fd is bind-mounted. This doesn't work on
read-only file systems.
Avoid the temporary directory by bind-mounting /proc/self/fd over /proc.
This does not affect other processes since we remounted / with MS_REC |
MS_SLAVE. /proc must exist and virtiofsd does not use it so it's safe to
do this.
Path traversal can be tested with the following function:
static void test_proc_fd_escape(struct lo_data *lo)
{
int fd;
int level = 0;
ino_t last_ino = 0;
fd = lo->proc_self_fd;
for (;;) {
struct stat st;
if (fstat(fd, &st) != 0) {
perror("fstat");
return;
}
if (last_ino && st.st_ino == last_ino) {
fprintf(stderr, "inode number unchanged, stopping\n");
return;
}
last_ino = st.st_ino;
fprintf(stderr, "Level %d dev %lu ino %lu\n", level,
(unsigned long)st.st_dev,
(unsigned long)last_ino);
fd = openat(fd, "..", O_PATH | O_DIRECTORY | O_NOFOLLOW);
level++;
}
}
Before and after this patch only Level 0 is displayed. Without
/proc/self/fd bind-mount protection it is possible to traverse parent
directories.
Fixes: 397ae982f4 ("virtiofsd: jail lo->proc_self_fd")
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jens Freimann <jfreimann@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201006095826.59813-1-stefanha@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Tested-by: Jens Freimann <jfreimann@redhat.com>
Reviewed-by: Jens Freimann <jfreimann@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Since fcb4f59c87 qemu_get_local_state_pathname relies on the
init_exec_dir, and virtiofsd asserts because we never set it.
Set it.
Reported-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201002124015.44820-1-dgilbert@redhat.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
If you like running QEMU as a normal user (very common for TCG runs)
but you have to run virtiofsd as a root user you run into connection
problems. Adding support for an optional --socket-group allows the
users to keep using the command line.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200925125147.26943-2-alex.bennee@linaro.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
dgilbert: Split long line