These are things of the form #define foofs_op genfs_op, or #define
foofs_op genfs_eopnotsupp, or similar. They serve no purpose besides
obfuscation, and have gotten cutpasted all over everywhere.
Part 3; cvs randomly didn't commit all the files the first time, still
hunting down the files it skipped.
These are things of the form #define foofs_op genfs_op, or #define
foofs_op genfs_eopnotsupp, or similar. They serve no purpose besides
obfuscation, and have gotten cutpasted all over everywhere.
Add GENFS_SPECOP_ENTRIES and GENFS_FIFOOP_ENTRIES macros that contain
the portion of the vnode ops table declaration that is
(conservatively) the same in every fs. Use these in every fs that
supports devices and/or fifos with separate ops tables.
Note that ptyfs works differently (it has one type of vnode with
open-coded dispatch to the specfs code, which I haven't changed in
this commit) and rump/librump/rumpvfs/rumpfs.c has an indirect dynamic
dispatch that already does more or less the same thing, which I also
haven't changed.
Also note that this anticipates a few bits in the next changeset here
and there, and adds missing but unreachable calls in some cases (e.g.
most fses weren't defining whiteout on devices and fifos, but it isn't
reachable there), and it changes parsepath on devices and fifos to
genfs_badop from genfs_parsepath (but it's not reachable there
either).
It appears that devices in kernfs were missing kqfilter, so it's
possible that if you try to use kqueue on /kern/rootdev that it'll
explode.
And finally note that the ops declaration tables aren't
order-dependent. (Other than vop_default_desc has to come first.)
Otherwise this wouldn't work.
Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.
Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.
This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.
Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)
This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)
- Move namei_getcomponent to genfs_vnops.c and call it genfs_parsepath.
- Add a parsepath entry to every vnode ops table.
VOP_PARSEPATH takes a directory vnode to be searched and a complete
following path and chooses how much of that path to consume. To begin
with, all parsepath calls are genfs_parsepath, which locates the first
'/' as always.
Note that the call doesn't take the whole struct componentname, only
the string. The other bits of struct componentname should not be
needed and there's no reason to cause potential complications by
exposing them.
The poorly named uvm.h is generally supposed to be for uvm-internal
users only.
- Narrow it to files that actually need it -- mostly files that need
to query whether curlwp is the pagedaemon, which should maybe be
exposed by an external header.
- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
by it. We should split up uvm_extern.h but this will serve for now
to reduce the uvm.h dependencies.
- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
UVMHIST(ubchist), since ubchist is declared in uvm.h but the
reference evaporates if UVMHIST is not defined, so we reduce header
file dependencies.
- Make uvm_device.h and uvm_swap.h independently includable while
here.
ok chs@
Reproducer:
A: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
rmdir("c/d/e"); rmdir("c/d"); }
B: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
rename("c", "c/d/e"); }
C: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
rename("c/d/e", "c"); }
Deadlock:
- A holds c and wants to lock d; and either
- B holds . and d and wants to lock c, or
- C holds . and d and wants to lock c.
The problem with these is that genfs_rename_enter_separate in B or C
tried lock order .->d->c->e (in A/B, fdvp->tdvp->fvp->tvp; in A/C,
tdvp->fdvp->tvp->fvp) which violates the ancestor->descendant order
.->c->d->e.
The resolution is to change B to do fdvp->fvp->tdvp->tvp and C to do
tdvp->tvp->fdvp->fvp. But there's an edge case: tvp and fvp might be
the same (hard links), and we can't detect that until after we've
looked them both up -- and in some file systems (I'm looking at you,
ufs), there is no mere lookup operation, only lookup-and-lock, so we
can't even hold the lock on one of tvp or fvp when we look up the
other one if there's a chance they might be the same.
Fortunately the cases
(a) tvp = fvp
(b) tvp or fvp is a directory
are mutually exclusive as long as directories cannot be hard-linked.
In case (a) we can just defer locking {tvp, fvp} until the end, because
it can't possibly have {fdvp or fvp, tdvp or tvp} as descendants. In
case (b) we can just lock them in the order fdvp->fvp->tdvp->tvp or
tdvp->tvp->fdvp->fvp if the first one of {fvp, tvp} is a directory,
because it can't possibly coincide with the second one of {fvp, tvp}.
With this change, we can now prove that the locking order is consistent
with the ancestor->descendant partial ordering. Where two nodes are
incommensurate under that partial ordering, they are only ever locked
by rename and there is only ever one rename at a time.
Proof:
- For same-directory renames, genfs_rename_enter_common locks the
directory first and then the children. The order
directory->child[i] is consistent with ancestor->descendant and
child[0]/child[1] are incommensurate.
- For cross-directory renames:
. While a rename is in progress and the fs-wide rename lock is held,
directories can be created or removed but not changed, so the
outcome of gro_genealogy -- which, given fdvp and tdvp, returns
the node N relating fdvp/N/.../tdvp or null if there is none --
can only transition from finding N to not finding N, if one of
the directories is removed while any of the vnodes are unlocked.
Merely creating directories cannot change the ancestry of tdvp,
and concurrent renames are not possible.
Thus, if a gro_genealogy determined the operation to have the
form fdvp/N/.../tdvp, then it might cease to have that form, but
only because tdvp was removed which will harmlessly cause the
rename to fail later on. Similarly, if gro_genealogy determined
the operation _not_ to have the form fdvp/N/.../tdvp then it
can't begin to have that form until after the rename has
completed.
The lock order is,
=> for fdvp/.../tdvp:
1. lock fdvp
2. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
3. lock fvp if a directory (consistent with fdvp->fvp)
4. lock tdvp (consistent with fdvp->tdvp and possibly fvp->tdvp)
5. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
6. lock fvp if a nondirectory (fvp->t* or fvp->fdvp is impossible)
7. lock tvp if not fvp (tvp->f* is impossible unless tvp=fvp)
=> for incommensurate fdvp & tdvp, or for tdvp/.../fdvp:
1. lock tdvp
2. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
3. lock tvp if a directory (consistent with tdvp->tvp)
4. lock fdvp (either incommensurate with tdvp and/or tvp, or
consistent with tdvp(->tvp)->fdvp)
5. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
6. lock tvp if a nondirectory (tvp->f* or tvp->tdvp is impossible)
7. lock fvp if not tvp (fvp->t* is impossible unless fvp=tvp)
Deadlocks found by hannken@; resolution worked out with dholland@.
XXX I think we could improve concurrency somewhat -- with a likely
big win for applications like tar and rsync that create many files
with temporary names and then rename them to the permanent one in the
same directory -- by making vfs_renamelock a reader/writer lock: any
number of same-directory renames, or exactly one cross-directory
rename, at any one time.
- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.
- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.
- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.
parameters can't change part way through a search: move the "uobj" and
"flags" arguments over to uvm_page_array_init() and store those with the
array.
- With that, detect when it's not possible to find any more pages in the
tree with the given search parameters, and avoid repeated tree lookups if
the caller loops over uvm_page_array_fill_and_peek().
- Make PGO_LOCKED getpages imply PGO_NOBUSY and remove the latter. Mark
pages busy only when there's actually I/O to do.
- When doing COW on a uvm_object, don't mess with neighbouring pages. In
all likelyhood they're already entered.
- Don't mess with neighbouring VAs that have existing mappings as replacing
those mappings with same can be quite costly.
- Don't enqueue pages for neighbour faults unless not enqueued already, and
don't activate centre pages unless uvmpdpol says its useful.
Also:
- Make PGO_LOCKED getpages on UAOs work more like vnodes: do gang lookup in
the radix tree, and don't allocate new pages.
- Fix many assertion failures around faults/loans with tmpfs.
which relied on taking extra vnode refs.
Having benchmarked various experimental changes over the past few months it
seems that it's better to avoid vnode refs as much as possible. cwdi_lock
as a RW lock already did that to some extent for getcwd() and will permit
the same for namei() too.
when allocating a PID.
- Per above, proc_free_pid() no longer decrements nprocs. It's now done
in proc_free() right after proc_free_pid().
- Ensure nprocs is accessed using atomics everywhere.
returned by DIOCGPARTINFO if it's bigger than DEV_BSIZE and less
than MAXBSIZE (MAXPHYS)
fixes panic "buf mem pool index 8" in buf_mempoolidx() when the
disklabel contains bsize 128KB and something reads the block device -
buffer cache can't allocate bufs bigger than MAXPHYS
and getcwd():
- push vnode locking back as far as possible.
- do most lookups directly in the namecache, avoiding vnode locks & refs.
- don't block new refs to vnodes across VOP_INACTIVE().
- get shared locks for VOP_LOOKUP() if the file system supports it.
- correct lock types for VOP_ACCESS() / VOP_GETATTR() in a few places.
Possible future enhancements:
- make the lookups lockless.
- support dotdot lookups by being lockless and inferring absence of chroot.
- maybe make it work for layered file systems.
- avoid vnode references at the root & cwd.
parallel, where the relevant pages are already in-core. Proposed on
tech-kern.
Temporarily disabled on MP architectures with __HAVE_UNLOCKED_PMAP until
adjustments are made to their pmaps.
automate installation of sysctl nodes.
Note that there are still a number of device and pseudo-device modules
that create entries tied to individual device units, rather than to the
module itself. These are not changed.