This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.
XXX kernel revbump -- struct fileops API and ABI change
.d_cancel, .d_strategy, .d_read, .d_write, .d_ioctl, and .d_discard
are only ever used between successful .d_open return and entry to
.d_close. .d_open doesn't return until sc is nonnull and sc_state is
RUNNING, and dkwedge_detach waits for the last .d_close before
setting sc_state to DEAD. So there is no possibility for sc to be
null or for sc_state to be anything other than RUNNING or DYING.
There is a small functional change here but only in the event of a
race: in the short window between when dkwedge_detach is entered, and
when .d_close runs, any I/O operations (read, write, ioctl, &c.) may
be issued that would have failed with ENXIO before.
This shouldn't matter for anything: disk I/O operations are supposed
to complete reasonably promptly, and these operations _could_ have
begun milliseconds prior, before dkwedge_detach was entered, so it's
not a significant distinction.
Notes:
- .d_open must still contend with trying to open a nonexistent wedge,
of course.
- .d_close must also contend with closing a nonexistent wedge, in
case there were two calls to open in quick succession and the first
failed while the second hadn't yet determined it would fail.
- .d_size and .d_dump are used from ddb without any open/close.
For non-directory vnodes:
- reading f_offset requires a shared or exclusive vnode lock
- writing f_offset requires an exclusive vnode lock
For directory vnodes, access (read or write) requires either:
- a shared vnode lock AND f_lock, or
- an exclusive vnode lock.
This way, two files for the same underlying directory vnode can still
do VOP_READDIR in parallel, but if two readdir(2) or lseek(2) calls
run in parallel on the same file, the load and store of f_offset is
atomic (otherwise, e.g., on 32-bit systems it might be torn and lead
to corrupt offsets).
There is still a potential problem: the _whole transaction_ of
readdir(2) may not be atomic. For example, if thread A and thread B
read n bytes of directory content, thread A might get bytes [0,n) and
thread B might get bytes [n,2n) but f_offset might end up at n
instead of 2n once both operations complete. (However, f_offset
wouldn't be some corrupt garbled number like n & 0xffffffff00000000.)
Fixing this would require either:
(a) using an exclusive vnode lock in vn_readdir,
(b) introducing a new lock that serializes vn_readdir on the same
file (but ont necessarily the same vnode), or
(c) proving it is safe to hold f_lock across VOP_READDIR, VOP_SEEK,
and VOP_GETATTR.
New test case that reflects the fix in PR kern/57260. The majority of
work for this case itself was by riastradh@, who'd supplied the basis
for it in the ticket, and provided further guidance.
All the members these use are stable after initialization, except for
the wedge size, which dkwedge_size safely reads a snapshot of without
locking in the caller.
This way, opening dkN or rdkN will wait if attach or detach is still
in progress, and vdevgone will wake up such pending opens and make
them fail. So it is no longer possible for a wedge to be detached
after dkopen has already started using it.
For now, we use a custom .d_devtounit function that looks up the
autoconf unit number via the dkwedges array, which conceivably may
use an independent unit numbering system -- nothing guarantees they
match up. (In practice they will mostly match up, but concurrent
wedge creation could lead to different numbering.) Eventually this
should be changed so the two numbering systems match, which would let
us delete the new dkunit function and just use dev_minor_unit like
many other drivers can.
The first step is to decide whether we can detach (if forced, yes; if
not forced, only if not already open), and prevent new opens if so.
There's no need to start closing open instances at this point --
we're just making a decision to detach, and preventing new opens by
transitioning state that dkopen will respect[*].
The second step is to force all open instances to close. This is
done by vdevgone. By the time vdevgone returns, there can be no open
instances, so if there _were_ any, closing them via vdevgone will
have passed through dklastclose.
After that point, there can be no opens and no I/O operations, so
dk_openmask must already be zero and the bufq must be empty.
Thus, there's no need to have an explicit call to dklastclose (via
dkwedge_cleanup_parent) before or after making the decision to
detach.
[*] Currently access to this state is racy: nothing serializes
dkwedge_detach's state transition with dkopen's test. TBD in a
separate commit shortly.
1. Set a flag sc_iostop under the lock sc_iolock so dkwedge_detach
and dkstart don't race over it.
2. Decline to schedule the callout if sc_iostop is set. The callout
is already only ever scheduled while the lock is held.
3. Use callout_halt to wait for any concurrent callout to complete.
At this point, it can't reschedule itself.
Without this change, the callout could be concurrently rescheduling
itself as we issue callout_stop, leading to use-after-free later.
This way, dkclose is guaranteed that dkopen, dkread, dkwrite,
dkioctl, &c., have all returned before it runs. For block opens,
setting d_cancel also guarantees that any buffered writes are flushed
with vinvalbuf before dkclose is called.
This can happen with dk(4), which allows wedges to have their size
increased without destroying and recreating the device instance.
Drivers which allow concurrent disk_set_info and disk_ioctl must
serialize disk_set_info with dk_openlock.
Follows the pattern of most drivers, and will be necessary for
referencing dk_cd in dk_bdevsw and dk_cdevsw soon, to prevent
open/detach races.
No functional change intended.
Rules:
1. Only ever increases, never decreases.
(Decreases require removing and readding the wedge.)
2. Increases are serialized by dk_openlock.
3. Reads can happen unlocked in any context where the softc is valid.
Access is gathered into dkwedge_size* subroutines -- don't touch
sc_size outside these. For now, we use rwlock(9) to keep the
reasoning simple. This should be done with atomics on 64-bit
platforms and a seqlock on 32-bit platforms to avoid contention.
However, we can do that in a later change.
ENXIO is `device not configured', meaning there is no such device.
ENODEV is `operation not supported by device', meaning the device is
there but refuses the operation, like writing to a read-only medium.
Exception: For undefined ioctl commands, it's not ENODEV _or_ ENXIO,
but rather ENOTTY, because why make any of this obvious when you
could make it obscure Unix lore?
This is not great -- we shouldn't be choosing the unit number here
anyway; we should just let autoconf do it for us -- but it's better
than potentially blocking any dk_openlock or dk_rawlock (which are
sometimes held when waiting for dkwedges_lock) for memory allocation.
We only enter dklastclose if the wedge is open (sc->sc_dk.dk_openmask
!= 0), which can happen only if dkfirstopen has succeeded, in which
case we hold a dk_rawopens reference to the parent that prevents
anyone else from closing it. Hence sc->sc_parent->dk_rawopens > 0.
On open, sc->sc_parent->dk_rawvp is set to nonnull, and it is only
reset to null on close. Hence if the parent is still open, as it
must be here, sc->sc_parent->dk_rawvp must be nonnull.
* privsep: keep resources open rather than open/close
* dhcp6: OPTION_NTP_SERVER is now preferred over OPTION_SNTP_SERVER
* Misc bug fixes mainly around privsep for many platforms.
* Fix for reading the some BSD routing table entries.
* Fix reading authtokens from config.
Big new release, mainly around better privsep process management
which allows us to detect when they exit unexpectedly.