If the time pointer is null, then write permission
on the file is also sufficient.
From FreeBSD.
Should fix PR kern/57246 "NFS group permissions regression"
- If there's a prior concurrent close, it must have interrupted this
open.
- If there's a new concurrent close, it must wait until this open has
released device_lock before it can revoke.
There seems to be a bug here but I'm not sure what it is yet:
https://mail-index.netbsd.org/current-users/2022/08/09/msg042800.htmlhttps://syzkaller.appspot.com/bug?id=47c67ab6d3a87514d0707882a9ad6671beaa8642
The decision to actually invoke d_close is serialized under
device_lock, so it should not be possible for more than one process
to close at the same time, but syzbot and kre found a way for
sd_closing to be false later in spec_close. Let's make sure it's
false when we're making what should be the exclusive decision to
close.
We can't assert !sd_opened before cancel and spec_io_drain, because
those are necessary to interrupt and wait for pending opens that
might later set sd_opened, but we can assert !sd_opened afterward
because once sd_closing is true nothing should set sd_opened.
If specified, when revoking a device node or closing its last open
node, specfs will:
1. Call d_cancel, which should return promptly without blocking.
2. Wait for all concurrent d_read/write/ioctl/&c. to drain.
3. Call d_close.
Otherwise, specfs will:
1. Call d_close.
2. Wait for all concurrent d_read/write/ioctl/&c. to drain.
This fallback is problematic because often parts of d_close rely on
concurrent devsw operations to have completed already, so it is up to
each driver to have its own mechanism for waiting, and the extra step
in (2) is almost redundant. But it is still important to ensure that
devsw operations are not active by the time a module tries to invoke
devsw_detach, because only d_open is protected against that.
The signature of d_cancel matches d_close, mostly so we don't raise
questions about `why is this different?'; the lwp argument is not
useful but we should remove it from open/cancel/close all at the same
time.
The only way d_cancel should fail, if it does at all, is with ENODEV,
meaning the driver doesn't support cancelling outstanding I/O, and
will take responsibility for that in d_close. I would make it return
void and only have bdev_cancel and cdev_cancel possibly return ENODEV
so specfs can detect whether a driver supports it, but this would
break the pattern around devsw operation types.
Drivers are allowed to omit it from struct bdevsw, struct cdevsw --
if so, it is as if they used a function that just returns ENODEV.
XXX kernel ABI change to struct bdevsw/cdevsw requires bump
Previously, it was possible for spec_node_lookup_by_dev to handle a
speconde that a concurrent spec_node_destroy is about to remove from
the hash table and then free, as soon as spec_node_lookup_by_dev
releases device_lock.
Now, the ordering is:
1. Remove specnode from hash table in spec_node_revoke. At this
point, no _new_ vnode references are possible (other than possibly
one acquired by vcache_vget under v_interlock), but there may be
existing ones.
2. Mark vnode reclaimed so vcache_vget will fail.
3. The last vrele (or equivalent logic in vcache_vget) will then free
the specnode in spec_node_destroy.
This way, _if_ a thread in spec_node_lookup_by_dev finds a specnode
in the hash table under device_lock/v_interlock, _then_ it will not
be freed until the thread completes vcache_vget.
This change requires calling spec_node_revoke unconditionally for
device special nodes, not just for active ones. Might introduce
slightly more contention on device_lock but not much because we
already have to take it in this path anyway a little later in
spec_node_destroy.
vdevgone relies on this to ensure that if there is a concurrent
revoke in progress, it will wait for that revoke to finish -- that
way, it can guarantee all I/O operations have completed and the
device is closed.
- Revoke is used to invalidate all prior access control checks when
device permissions are changing, so it must wait for .d_open to exit
so any new access must go through new access control checks.
- Revoke is used by vdevgone in xyz_detach to wait until all use of
the driver's data structures have completed before xyz_detach frees
them.
So we need to make sure spec_close waits for .d_open too.
Otherwise, bdev/cdev_close could have cancelled all _existing_ opens,
and waited for them to complete (and freed resources used by them) --
but a new one could start, and hang (e.g., a tty), at the same time
spec_close tries to drain all pending I/O operations, one of which
(the new open) is now hanging indefinitely.
Preventing the new open from even starting until bdev/cdev_close is
finished and all I/O operations have drained avoids this deadlock.
This is not quite correct. We _should_ require the caller to hold a
vnode lock around spec_node_getmountedfs, and an exclusive vnode lock
around spec_node_setmountedfs, so that it is only necessary to check
whether revoke has already happened, not hold an I/O reference.
Unfortunately, various callers in various file systems don't follow
this sensible rule. So let's at least make sure the vnode can't be
revoked in spec_node_setmountedfs, while we're in bdev_ioctl, and
leave a comment explaining what the sorry state of affairs is and how
to fix it later.
New kind of I/O reference on specdevs, sd_iocnt. This could be done
with psref instead; I chose a reference count instead for now because
we already have to take a per-object lock anyway, v_interlock, for
vdead_check, so another atomic is not likely to hurt much more. We
can always change the mechanism inside spec_io_enter/exit/drain later
on.
Make sure every access to vp->v_rdev or vp->v_specnode and every call
to a devsw operation is protected either:
- by the vnode lock (with vdead_check if we unlocked/relocked),
- by positive sd_opencnt,
- by spec_io_enter/exit, or
- by sd_opencnt management in open/close.
The vnode lock is held, so the vnode cannot be revoked without also
changing v_op so subsequent uses under the vnode lock will go to
deadfs's VOP_FDISCARD instead (which is genfs_eopnotsupp).
Annoying as it is that .d_open and .d_close can run at the same time,
it is also necessary for tty semantics, where open can block
indefinitely, and it is the responsibility of close (called via
revoke) necessary to interrupt it.
The sections are now:
1. Acquire open reference.
1a (intermezzo). Set VV_ISTTY.
2. Drop the vnode lock to call .d_open and autoload modules if
necessary.
3. Handle concurrent revoke if it happenend, or release open reference
if .d_open failed.
No functional change. Sprinkle comments about problems.
There is no need for it to serialize opens, because they are already
serialized by sd_opencnt which for block devices is always either 0
or 1.
There's not obviously any other reason why the vnode lock should be
held across bdev_open, other than that it might be nice to avoid
dropping it if not necessary. For character devices we always have
to drop the vnode lock because open might hang indefinitely, when
opening a tty, which is not allowed while holding the vnode lock.
D_MCLOSE was introduced a few years ago by mistake for audio(4),
which should have used -- and now does use -- fd_clone to create
per-open state. The semantics was originally to call close once
every time the device node is closed, not only for the last close.
Nothing uses it any more, and it complicates reasoning about the
system, so let's simplify it away.