Commit Graph

1366 Commits

Author SHA1 Message Date
dholland cba87bb8e7 Abolish all the silly indirection macros for initializing vnode ops tables.
These are things of the form #define foofs_op genfs_op, or #define
foofs_op genfs_eopnotsupp, or similar. They serve no purpose besides
obfuscation, and have gotten cutpasted all over everywhere.

Part 3; cvs randomly didn't commit all the files the first time, still
hunting down the files it skipped.
2021-07-19 01:33:53 +00:00
dholland 4171507047 Abolish all the silly indirection macros for initializing vnode ops tables.
These are things of the form #define foofs_op genfs_op, or #define
foofs_op genfs_eopnotsupp, or similar. They serve no purpose besides
obfuscation, and have gotten cutpasted all over everywhere.
2021-07-18 23:57:13 +00:00
dholland d819c3614f Use macros for the canned parts of device and fifo vnode op tables.
Add GENFS_SPECOP_ENTRIES and GENFS_FIFOOP_ENTRIES macros that contain
the portion of the vnode ops table declaration that is
(conservatively) the same in every fs. Use these in every fs that
supports devices and/or fifos with separate ops tables.

Note that ptyfs works differently (it has one type of vnode with
open-coded dispatch to the specfs code, which I haven't changed in
this commit) and rump/librump/rumpvfs/rumpfs.c has an indirect dynamic
dispatch that already does more or less the same thing, which I also
haven't changed.

Also note that this anticipates a few bits in the next changeset here
and there, and adds missing but unreachable calls in some cases (e.g.
most fses weren't defining whiteout on devices and fifos, but it isn't
reachable there), and it changes parsepath on devices and fifos to
genfs_badop from genfs_parsepath (but it's not reachable there
either).

It appears that devices in kernfs were missing kqfilter, so it's
possible that if you try to use kqueue on /kern/rootdev that it'll
explode.

And finally note that the ops declaration tables aren't
order-dependent. (Other than vop_default_desc has to come first.)
Otherwise this wouldn't work.
2021-07-18 23:56:12 +00:00
dholland 37ce85c5f7 Fix perms on /kern/{r,}rootdev. 2021-07-06 03:23:03 +00:00
dholland 285b06fd69 Add missing VOP_KQFILTER to kernfs.
Not sure if lack of it can be used for local DoS or not, but best to
fix.
2021-07-06 03:22:44 +00:00
dholland 723d09ce8e Add containment for the cloning devices hack in vn_open.
Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)
2021-06-29 22:40:53 +00:00
dholland c6c16cd073 - Add a new vnode op: VOP_PARSEPATH.
- Move namei_getcomponent to genfs_vnops.c and call it genfs_parsepath.
 - Add a parsepath entry to every vnode ops table.

VOP_PARSEPATH takes a directory vnode to be searched and a complete
following path and chooses how much of that path to consume. To begin
with, all parsepath calls are genfs_parsepath, which locates the first
'/' as always.

Note that the call doesn't take the whole struct componentname, only
the string. The other bits of struct componentname should not be
needed and there's no reason to cause potential complications by
exposing them.
2021-06-29 22:34:05 +00:00
chs e3beb37645 VOP_BMAP() may be called via ioctl(FIOGETBMAP) on any vnode that applications
can open.  change various pseudo-fs *_bmap methods return an error instead of
panic.

Reported-by: syzbot+8289a3eaf2ba60958c87@syzkaller.appspotmail.com
2021-06-28 17:52:12 +00:00
hannken 9decf88a36 Make sure fdesc_lookup() never returns VNON vnodes.
Should fix PR kern/56130 (fdescfs create nodes with wrong major number)
2021-05-01 15:08:14 +00:00
riastradh a1daa84551 Fix procfs environ node. 2020-12-28 22:36:16 +00:00
mlelstv 1bc006b1c2 When reading from a block device, queue parallel block requests to
fill a buffer with breadn.
2020-12-25 09:28:56 +00:00
thorpej 8cda207fa4 Use sel{record,remove}_knote(). 2020-12-19 21:54:42 +00:00
riastradh 9fc453562f Round of uvm.h cleanup.
The poorly named uvm.h is generally supposed to be for uvm-internal
users only.

- Narrow it to files that actually need it -- mostly files that need
  to query whether curlwp is the pagedaemon, which should maybe be
  exposed by an external header.

- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
  by it.  We should split up uvm_extern.h but this will serve for now
  to reduce the uvm.h dependencies.

- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
  UVMHIST(ubchist), since ubchist is declared in uvm.h but the
  reference evaporates if UVMHIST is not defined, so we reduce header
  file dependencies.

- Make uvm_device.h and uvm_swap.h independently includable while
  here.

ok chs@
2020-09-05 16:30:10 +00:00
riastradh 44afc3b3f9 genfs_rename: Fix deadlocks in cross-directory cyclic rename.
Reproducer:

A: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rmdir("c/d/e"); rmdir("c/d"); }
B: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rename("c", "c/d/e"); }
C: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rename("c/d/e", "c"); }

Deadlock:

- A holds c and wants to lock d; and either
- B holds . and d and wants to lock c, or
- C holds . and d and wants to lock c.

The problem with these is that genfs_rename_enter_separate in B or C
tried lock order .->d->c->e (in A/B, fdvp->tdvp->fvp->tvp; in A/C,
tdvp->fdvp->tvp->fvp) which violates the ancestor->descendant order
.->c->d->e.

The resolution is to change B to do fdvp->fvp->tdvp->tvp and C to do
tdvp->tvp->fdvp->fvp.  But there's an edge case: tvp and fvp might be
the same (hard links), and we can't detect that until after we've
looked them both up -- and in some file systems (I'm looking at you,
ufs), there is no mere lookup operation, only lookup-and-lock, so we
can't even hold the lock on one of tvp or fvp when we look up the
other one if there's a chance they might be the same.

Fortunately the cases
(a) tvp = fvp
(b) tvp or fvp is a directory
are mutually exclusive as long as directories cannot be hard-linked.
In case (a) we can just defer locking {tvp, fvp} until the end, because
it can't possibly have {fdvp or fvp, tdvp or tvp} as descendants.  In
case (b) we can just lock them in the order fdvp->fvp->tdvp->tvp or
tdvp->tvp->fdvp->fvp if the first one of {fvp, tvp} is a directory,
because it can't possibly coincide with the second one of {fvp, tvp}.

With this change, we can now prove that the locking order is consistent
with the ancestor->descendant partial ordering.  Where two nodes are
incommensurate under that partial ordering, they are only ever locked
by rename and there is only ever one rename at a time.

Proof:

- For same-directory renames, genfs_rename_enter_common locks the
  directory first and then the children.  The order
  directory->child[i] is consistent with ancestor->descendant and
  child[0]/child[1] are incommensurate.

- For cross-directory renames:

  . While a rename is in progress and the fs-wide rename lock is held,
    directories can be created or removed but not changed, so the
    outcome of gro_genealogy -- which, given fdvp and tdvp, returns
    the node N relating fdvp/N/.../tdvp or null if there is none --
    can only transition from finding N to not finding N, if one of
    the directories is removed while any of the vnodes are unlocked.
    Merely creating directories cannot change the ancestry of tdvp,
    and concurrent renames are not possible.

    Thus, if a gro_genealogy determined the operation to have the
    form fdvp/N/.../tdvp, then it might cease to have that form, but
    only because tdvp was removed which will harmlessly cause the
    rename to fail later on.  Similarly, if gro_genealogy determined
    the operation _not_ to have the form fdvp/N/.../tdvp then it
    can't begin to have that form until after the rename has
    completed.

    The lock order is,

    => for fdvp/.../tdvp:
       1. lock fdvp
       2. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
       3. lock fvp if a directory (consistent with fdvp->fvp)
       4. lock tdvp (consistent with fdvp->tdvp and possibly fvp->tdvp)
       5. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
       6. lock fvp if a nondirectory (fvp->t* or fvp->fdvp is impossible)
       7. lock tvp if not fvp (tvp->f* is impossible unless tvp=fvp)

    => for incommensurate fdvp & tdvp, or for tdvp/.../fdvp:
       1. lock tdvp
       2. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
       3. lock tvp if a directory (consistent with tdvp->tvp)
       4. lock fdvp (either incommensurate with tdvp and/or tvp, or
          consistent with tdvp(->tvp)->fdvp)
       5. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
       6. lock tvp if a nondirectory (tvp->f* or tvp->tdvp is impossible)
       7. lock fvp if not tvp (fvp->t* is impossible unless fvp=tvp)

Deadlocks found by hannken@; resolution worked out with dholland@.

XXX I think we could improve concurrency somewhat -- with a likely
big win for applications like tar and rsync that create many files
with temporary names and then rename them to the permanent one in the
same directory -- by making vfs_renamelock a reader/writer lock: any
number of same-directory renames, or exactly one cross-directory
rename, at any one time.
2020-09-05 02:47:03 +00:00
simonb bf74807839 Remove trailing \n from UVMHIST_LOG() format strings. 2020-08-19 07:29:00 +00:00
chs 19303cecfc centralize calls from UVM to radixtree into a few functions.
in those functions, assert that the object lock is held in
the correct mode.
2020-08-14 09:06:14 +00:00
rin 3123ec52cf Output offsets in hex for UVMHIST. 2020-08-10 11:09:15 +00:00
christos ad1efe3529 accmode should be accmode_t 2020-08-07 18:14:21 +00:00
christos 79e3c74f8e Introduce genfs_pathconf() and use it for the default case in all filesystems. 2020-06-27 17:29:17 +00:00
ad 2806b3da8b genfs_putpages(): when building a cluster make use of pages in the in the
existing uvm_page_array.
2020-06-14 00:25:22 +00:00
ad ba90a6ba38 Counter tweaks:
- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
  each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
  for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one().  It has no users and doesn't save a whole lot.
  For the cheap option, give cpu_count_sync() a boolean parameter indicating
  that a cached value is okay, and rate limit the updates for cached values
  to hz.
2020-06-11 22:21:05 +00:00
ad 4b8a875ae2 uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched.  It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.
2020-06-11 19:20:42 +00:00
rin 46b290a0c3 struct statvfs is too large for stack. Use malloc(9) instead.
XXX
Switch to kmem(9) for entire this file.

Frame size, e.g. for m68k, becomes:
    3292 --> 12
2020-05-31 08:38:54 +00:00
bouyer ecb2afc2be Add need-flags for kernfs.
Compile Xen kernfs support only if kernfs is compiled in the kernel.
Should fix MODULAR build.
2020-05-26 10:37:24 +00:00
ad 4bfe043955 - Alter the convention for uvm_page_array slightly, so the basic search
parameters can't change part way through a search: move the "uobj" and
  "flags" arguments over to uvm_page_array_init() and store those with the
  array.

- With that, detect when it's not possible to find any more pages in the
  tree with the given search parameters, and avoid repeated tree lookups if
  the caller loops over uvm_page_array_fill_and_peek().
2020-05-25 21:15:10 +00:00
ad 0eaaa024ea Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.
2020-05-23 23:42:41 +00:00
christos 44ed7d42f1 Fix EPERM vs EACCES on chtimes (thanks @hannken) 2020-05-20 17:06:15 +00:00
christos 10d4c4e928 remove debugging, it is just clutter. 2020-05-18 19:55:42 +00:00
christos e59d6517f7 Fix EPERM vs EACCES return. 2020-05-18 19:42:16 +00:00
ad ff872804dc Start trying to reduce cache misses on vm_page during fault processing.
- Make PGO_LOCKED getpages imply PGO_NOBUSY and remove the latter.  Mark
  pages busy only when there's actually I/O to do.

- When doing COW on a uvm_object, don't mess with neighbouring pages.  In
  all likelyhood they're already entered.

- Don't mess with neighbouring VAs that have existing mappings as replacing
  those mappings with same can be quite costly.

- Don't enqueue pages for neighbour faults unless not enqueued already, and
  don't activate centre pages unless uvmpdpol says its useful.

Also:

- Make PGO_LOCKED getpages on UAOs work more like vnodes: do gang lookup in
  the radix tree, and don't allocate new pages.

- Fix many assertion failures around faults/loans with tmpfs.
2020-05-17 19:38:16 +00:00
christos 9aa2a9c323 Add ACL support for FFS. From FreeBSD. 2020-05-16 18:31:45 +00:00
riastradh 8f400c1021 Put forward declaration a little further forward to unbreak build. 2020-04-29 07:18:24 +00:00
thorpej 91da5a2e36 If the procfs mount is marked as linux-compat, then allow proc lookup
by any LWP ID in the proc, not just the canonical PID.
2020-04-29 01:56:54 +00:00
christos 2f4aa83fe6 Allow root to access and modify system space extended attributes.
XXX: this routine should not be using the string, but the attribute namespace.
I have fixed this in the ACL code.
2020-04-25 22:28:47 +00:00
ad e88c11f417 Revert the changes made in February to make cwdinfo use mostly lockless,
which relied on taking extra vnode refs.

Having benchmarked various experimental changes over the past few months it
seems that it's better to avoid vnode refs as much as possible.  cwdi_lock
as a RW lock already did that to some extent for getcwd() and will permit
the same for namei() too.
2020-04-21 21:42:47 +00:00
martin 0a1cb03168 Add missing include of <sys/atomic.h> to fix the build 2020-04-20 13:30:34 +00:00
htodd 7bceb35060 Sort include files. 2020-04-20 05:22:28 +00:00
htodd 6b104688ac Add missing include to fix build. 2020-04-20 05:11:00 +00:00
thorpej a29147fa13 - Only increment nprocs when we're creating a new process, not just
when allocating a PID.
- Per above, proc_free_pid() no longer decrements nprocs.  It's now done
  in proc_free() right after proc_free_pid().
- Ensure nprocs is accessed using atomics everywhere.
2020-04-19 20:31:59 +00:00
jdolecek 5dec3f0781 when determining I/O block size for VBLK device, only use pi_bsize
returned by DIOCGPARTINFO if it's bigger than DEV_BSIZE and less
than MAXBSIZE (MAXPHYS)

fixes panic "buf mem pool index 8" in buf_mempoolidx() when the
disklabel contains bsize 128KB and something reads the block device -
buffer cache can't allocate bufs bigger than MAXPHYS
2020-04-13 20:02:27 +00:00
ad 23bf88000c Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed().  Signature matches
FreeBSD.
2020-04-13 19:23:17 +00:00
jdolecek 8d1e4d9c00 switch to kmem_zalloc() instead of malloc() for struct kernfs_mount 2020-04-07 08:35:49 +00:00
jdolecek 8a3ee72648 switch KERNFS_ALLOCENTRY() to use kmem_zalloc() instead of malloc() 2020-04-07 08:14:42 +00:00
ad c90f9c8c81 Merge the remaining changes from the ad-namecache branch, affecting namei()
and getcwd():

- push vnode locking back as far as possible.
- do most lookups directly in the namecache, avoiding vnode locks & refs.
- don't block new refs to vnodes across VOP_INACTIVE().
- get shared locks for VOP_LOOKUP() if the file system supports it.
- correct lock types for VOP_ACCESS() / VOP_GETATTR() in a few places.

Possible future enhancements:

- make the lookups lockless.
- support dotdot lookups by being lockless and inferring absence of chroot.
- maybe make it work for layered file systems.
- avoid vnode references at the root & cwd.
2020-04-04 20:49:30 +00:00
ad 1d7848ad43 Process concurrent page faults on individual uvm_objects / vm_amaps in
parallel, where the relevant pages are already in-core.  Proposed on
tech-kern.

Temporarily disabled on MP architectures with __HAVE_UNLOCKED_PMAP until
adjustments are made to their pmaps.
2020-03-22 18:32:41 +00:00
pgoyette 3a65820412 Finish the transition to SYSCTL_SETUP by removing local sysctllog
in favor of the one provided by the module infrastructure.
2020-03-21 16:30:39 +00:00
ad 1912643ff9 Tweak the March 14th change to make page waits interlocked by pg->interlock.
Remove unneeded changes and only deal with the PQ_WANTED flag, to exclude
possible bugs.
2020-03-17 18:31:38 +00:00
pgoyette 9120d4511b Use the module subsystem's ability to process SYSCTL_SETUP() entries to
automate installation of sysctl nodes.

Note that there are still a number of device and pseudo-device modules
that create entries tied to individual device units, rather than to the
module itself.  These are not changed.
2020-03-16 21:20:09 +00:00
ad 94e054a199 Update a comment. 2020-03-14 21:47:41 +00:00
ad da3ef92bf6 Make uvm_pagemarkdirty() responsible for putting vnodes onto the syncer
work list.  Proposed on tech-kern@.
2020-03-14 20:45:23 +00:00