Commit Graph

1354 Commits

Author SHA1 Message Date
riastradh 9fc453562f Round of uvm.h cleanup.
The poorly named uvm.h is generally supposed to be for uvm-internal
users only.

- Narrow it to files that actually need it -- mostly files that need
  to query whether curlwp is the pagedaemon, which should maybe be
  exposed by an external header.

- Use uvm_extern.h where feasible and uvm_*.h for things not exposed
  by it.  We should split up uvm_extern.h but this will serve for now
  to reduce the uvm.h dependencies.

- Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use
  UVMHIST(ubchist), since ubchist is declared in uvm.h but the
  reference evaporates if UVMHIST is not defined, so we reduce header
  file dependencies.

- Make uvm_device.h and uvm_swap.h independently includable while
  here.

ok chs@
2020-09-05 16:30:10 +00:00
riastradh 44afc3b3f9 genfs_rename: Fix deadlocks in cross-directory cyclic rename.
Reproducer:

A: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rmdir("c/d/e"); rmdir("c/d"); }
B: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rename("c", "c/d/e"); }
C: for (;;) { mkdir("c", 0600); mkdir("c/d", 0600); mkdir("c/d/e", 0600);
    rename("c/d/e", "c"); }

Deadlock:

- A holds c and wants to lock d; and either
- B holds . and d and wants to lock c, or
- C holds . and d and wants to lock c.

The problem with these is that genfs_rename_enter_separate in B or C
tried lock order .->d->c->e (in A/B, fdvp->tdvp->fvp->tvp; in A/C,
tdvp->fdvp->tvp->fvp) which violates the ancestor->descendant order
.->c->d->e.

The resolution is to change B to do fdvp->fvp->tdvp->tvp and C to do
tdvp->tvp->fdvp->fvp.  But there's an edge case: tvp and fvp might be
the same (hard links), and we can't detect that until after we've
looked them both up -- and in some file systems (I'm looking at you,
ufs), there is no mere lookup operation, only lookup-and-lock, so we
can't even hold the lock on one of tvp or fvp when we look up the
other one if there's a chance they might be the same.

Fortunately the cases
(a) tvp = fvp
(b) tvp or fvp is a directory
are mutually exclusive as long as directories cannot be hard-linked.
In case (a) we can just defer locking {tvp, fvp} until the end, because
it can't possibly have {fdvp or fvp, tdvp or tvp} as descendants.  In
case (b) we can just lock them in the order fdvp->fvp->tdvp->tvp or
tdvp->tvp->fdvp->fvp if the first one of {fvp, tvp} is a directory,
because it can't possibly coincide with the second one of {fvp, tvp}.

With this change, we can now prove that the locking order is consistent
with the ancestor->descendant partial ordering.  Where two nodes are
incommensurate under that partial ordering, they are only ever locked
by rename and there is only ever one rename at a time.

Proof:

- For same-directory renames, genfs_rename_enter_common locks the
  directory first and then the children.  The order
  directory->child[i] is consistent with ancestor->descendant and
  child[0]/child[1] are incommensurate.

- For cross-directory renames:

  . While a rename is in progress and the fs-wide rename lock is held,
    directories can be created or removed but not changed, so the
    outcome of gro_genealogy -- which, given fdvp and tdvp, returns
    the node N relating fdvp/N/.../tdvp or null if there is none --
    can only transition from finding N to not finding N, if one of
    the directories is removed while any of the vnodes are unlocked.
    Merely creating directories cannot change the ancestry of tdvp,
    and concurrent renames are not possible.

    Thus, if a gro_genealogy determined the operation to have the
    form fdvp/N/.../tdvp, then it might cease to have that form, but
    only because tdvp was removed which will harmlessly cause the
    rename to fail later on.  Similarly, if gro_genealogy determined
    the operation _not_ to have the form fdvp/N/.../tdvp then it
    can't begin to have that form until after the rename has
    completed.

    The lock order is,

    => for fdvp/.../tdvp:
       1. lock fdvp
       2. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
       3. lock fvp if a directory (consistent with fdvp->fvp)
       4. lock tdvp (consistent with fdvp->tdvp and possibly fvp->tdvp)
       5. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
       6. lock fvp if a nondirectory (fvp->t* or fvp->fdvp is impossible)
       7. lock tvp if not fvp (tvp->f* is impossible unless tvp=fvp)

    => for incommensurate fdvp & tdvp, or for tdvp/.../fdvp:
       1. lock tdvp
       2. lookup(/lock/unlock) tvp (consistent with tdvp->tvp)
       3. lock tvp if a directory (consistent with tdvp->tvp)
       4. lock fdvp (either incommensurate with tdvp and/or tvp, or
          consistent with tdvp(->tvp)->fdvp)
       5. lookup(/lock/unlock) fvp (consistent with fdvp->fvp)
       6. lock tvp if a nondirectory (tvp->f* or tvp->tdvp is impossible)
       7. lock fvp if not tvp (fvp->t* is impossible unless fvp=tvp)

Deadlocks found by hannken@; resolution worked out with dholland@.

XXX I think we could improve concurrency somewhat -- with a likely
big win for applications like tar and rsync that create many files
with temporary names and then rename them to the permanent one in the
same directory -- by making vfs_renamelock a reader/writer lock: any
number of same-directory renames, or exactly one cross-directory
rename, at any one time.
2020-09-05 02:47:03 +00:00
simonb bf74807839 Remove trailing \n from UVMHIST_LOG() format strings. 2020-08-19 07:29:00 +00:00
chs 19303cecfc centralize calls from UVM to radixtree into a few functions.
in those functions, assert that the object lock is held in
the correct mode.
2020-08-14 09:06:14 +00:00
rin 3123ec52cf Output offsets in hex for UVMHIST. 2020-08-10 11:09:15 +00:00
christos ad1efe3529 accmode should be accmode_t 2020-08-07 18:14:21 +00:00
christos 79e3c74f8e Introduce genfs_pathconf() and use it for the default case in all filesystems. 2020-06-27 17:29:17 +00:00
ad 2806b3da8b genfs_putpages(): when building a cluster make use of pages in the in the
existing uvm_page_array.
2020-06-14 00:25:22 +00:00
ad ba90a6ba38 Counter tweaks:
- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
  each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
  for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one().  It has no users and doesn't save a whole lot.
  For the cheap option, give cpu_count_sync() a boolean parameter indicating
  that a cached value is okay, and rate limit the updates for cached values
  to hz.
2020-06-11 22:21:05 +00:00
ad 4b8a875ae2 uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched.  It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.
2020-06-11 19:20:42 +00:00
rin 46b290a0c3 struct statvfs is too large for stack. Use malloc(9) instead.
XXX
Switch to kmem(9) for entire this file.

Frame size, e.g. for m68k, becomes:
    3292 --> 12
2020-05-31 08:38:54 +00:00
bouyer ecb2afc2be Add need-flags for kernfs.
Compile Xen kernfs support only if kernfs is compiled in the kernel.
Should fix MODULAR build.
2020-05-26 10:37:24 +00:00
ad 4bfe043955 - Alter the convention for uvm_page_array slightly, so the basic search
parameters can't change part way through a search: move the "uobj" and
  "flags" arguments over to uvm_page_array_init() and store those with the
  array.

- With that, detect when it's not possible to find any more pages in the
  tree with the given search parameters, and avoid repeated tree lookups if
  the caller loops over uvm_page_array_fill_and_peek().
2020-05-25 21:15:10 +00:00
ad 0eaaa024ea Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.
2020-05-23 23:42:41 +00:00
christos 44ed7d42f1 Fix EPERM vs EACCES on chtimes (thanks @hannken) 2020-05-20 17:06:15 +00:00
christos 10d4c4e928 remove debugging, it is just clutter. 2020-05-18 19:55:42 +00:00
christos e59d6517f7 Fix EPERM vs EACCES return. 2020-05-18 19:42:16 +00:00
ad ff872804dc Start trying to reduce cache misses on vm_page during fault processing.
- Make PGO_LOCKED getpages imply PGO_NOBUSY and remove the latter.  Mark
  pages busy only when there's actually I/O to do.

- When doing COW on a uvm_object, don't mess with neighbouring pages.  In
  all likelyhood they're already entered.

- Don't mess with neighbouring VAs that have existing mappings as replacing
  those mappings with same can be quite costly.

- Don't enqueue pages for neighbour faults unless not enqueued already, and
  don't activate centre pages unless uvmpdpol says its useful.

Also:

- Make PGO_LOCKED getpages on UAOs work more like vnodes: do gang lookup in
  the radix tree, and don't allocate new pages.

- Fix many assertion failures around faults/loans with tmpfs.
2020-05-17 19:38:16 +00:00
christos 9aa2a9c323 Add ACL support for FFS. From FreeBSD. 2020-05-16 18:31:45 +00:00
riastradh 8f400c1021 Put forward declaration a little further forward to unbreak build. 2020-04-29 07:18:24 +00:00
thorpej 91da5a2e36 If the procfs mount is marked as linux-compat, then allow proc lookup
by any LWP ID in the proc, not just the canonical PID.
2020-04-29 01:56:54 +00:00
christos 2f4aa83fe6 Allow root to access and modify system space extended attributes.
XXX: this routine should not be using the string, but the attribute namespace.
I have fixed this in the ACL code.
2020-04-25 22:28:47 +00:00
ad e88c11f417 Revert the changes made in February to make cwdinfo use mostly lockless,
which relied on taking extra vnode refs.

Having benchmarked various experimental changes over the past few months it
seems that it's better to avoid vnode refs as much as possible.  cwdi_lock
as a RW lock already did that to some extent for getcwd() and will permit
the same for namei() too.
2020-04-21 21:42:47 +00:00
martin 0a1cb03168 Add missing include of <sys/atomic.h> to fix the build 2020-04-20 13:30:34 +00:00
htodd 7bceb35060 Sort include files. 2020-04-20 05:22:28 +00:00
htodd 6b104688ac Add missing include to fix build. 2020-04-20 05:11:00 +00:00
thorpej a29147fa13 - Only increment nprocs when we're creating a new process, not just
when allocating a PID.
- Per above, proc_free_pid() no longer decrements nprocs.  It's now done
  in proc_free() right after proc_free_pid().
- Ensure nprocs is accessed using atomics everywhere.
2020-04-19 20:31:59 +00:00
jdolecek 5dec3f0781 when determining I/O block size for VBLK device, only use pi_bsize
returned by DIOCGPARTINFO if it's bigger than DEV_BSIZE and less
than MAXBSIZE (MAXPHYS)

fixes panic "buf mem pool index 8" in buf_mempoolidx() when the
disklabel contains bsize 128KB and something reads the block device -
buffer cache can't allocate bufs bigger than MAXPHYS
2020-04-13 20:02:27 +00:00
ad 23bf88000c Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed().  Signature matches
FreeBSD.
2020-04-13 19:23:17 +00:00
jdolecek 8d1e4d9c00 switch to kmem_zalloc() instead of malloc() for struct kernfs_mount 2020-04-07 08:35:49 +00:00
jdolecek 8a3ee72648 switch KERNFS_ALLOCENTRY() to use kmem_zalloc() instead of malloc() 2020-04-07 08:14:42 +00:00
ad c90f9c8c81 Merge the remaining changes from the ad-namecache branch, affecting namei()
and getcwd():

- push vnode locking back as far as possible.
- do most lookups directly in the namecache, avoiding vnode locks & refs.
- don't block new refs to vnodes across VOP_INACTIVE().
- get shared locks for VOP_LOOKUP() if the file system supports it.
- correct lock types for VOP_ACCESS() / VOP_GETATTR() in a few places.

Possible future enhancements:

- make the lookups lockless.
- support dotdot lookups by being lockless and inferring absence of chroot.
- maybe make it work for layered file systems.
- avoid vnode references at the root & cwd.
2020-04-04 20:49:30 +00:00
ad 1d7848ad43 Process concurrent page faults on individual uvm_objects / vm_amaps in
parallel, where the relevant pages are already in-core.  Proposed on
tech-kern.

Temporarily disabled on MP architectures with __HAVE_UNLOCKED_PMAP until
adjustments are made to their pmaps.
2020-03-22 18:32:41 +00:00
pgoyette 3a65820412 Finish the transition to SYSCTL_SETUP by removing local sysctllog
in favor of the one provided by the module infrastructure.
2020-03-21 16:30:39 +00:00
ad 1912643ff9 Tweak the March 14th change to make page waits interlocked by pg->interlock.
Remove unneeded changes and only deal with the PQ_WANTED flag, to exclude
possible bugs.
2020-03-17 18:31:38 +00:00
pgoyette 9120d4511b Use the module subsystem's ability to process SYSCTL_SETUP() entries to
automate installation of sysctl nodes.

Note that there are still a number of device and pseudo-device modules
that create entries tied to individual device units, rather than to the
module itself.  These are not changed.
2020-03-16 21:20:09 +00:00
ad 94e054a199 Update a comment. 2020-03-14 21:47:41 +00:00
ad da3ef92bf6 Make uvm_pagemarkdirty() responsible for putting vnodes onto the syncer
work list.  Proposed on tech-kern@.
2020-03-14 20:45:23 +00:00
ad 5972ba1600 Make page waits (WANTED vs BUSY) interlocked by pg->interlock. Gets RW
locks out of the equation for sleep/wakeup, and allows observing+waiting
for busy pages when holding only a read lock.  Proposed on tech-kern.
2020-03-14 20:23:51 +00:00
ad 6ba8fa570a Unused variable. 2020-03-14 19:07:22 +00:00
ad 16d4fad635 - Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
  any priority boost gained earlier from blocking.
2020-03-14 18:08:38 +00:00
ad 01f564d8c8 OR into bp->b_cflags; don't overwrite. 2020-03-14 15:31:29 +00:00
ad bf79731039 Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.
2020-02-27 22:12:53 +00:00
ad 1316228274 v_interlock -> vmobjlock 2020-02-24 20:49:51 +00:00
ad bada6a544d v_interlock -> vmobjlock 2020-02-24 20:44:25 +00:00
ad 926b25e154 Merge from ad-namecache:
- Have a stab at clustering the members of vnode_t and vnode_impl_t in a
  more cache-conscious way.  With that done, go back to adjusting v_usecount
  with atomics and keep vi_lock directly in vnode_impl_t (saves KVA).

- Allow VOP_LOCK(LK_NONE) for the benefit of VFS_VGET() and VFS_ROOT().
  Make sure LK_UPGRADE always comes with LK_NOWAIT.

- Make cwdinfo use mostly lockless.
2020-02-23 22:14:03 +00:00
ad d2a0ebb67a UVM locking changes, proposed on tech-kern:
- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart.  v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap.  Others to follow later.
2020-02-23 15:46:38 +00:00
riastradh 16a0e8da32 Use vn_bwrite, not genfs_nullop, for VOP_BWRITE.
VOP_BWRITE is responsible for calling biodone; can't just leave it
hanging.

XXX pullup
2020-02-20 15:48:05 +00:00
chs 5232c510c9 remove the aiodoned thread. I originally added this to provide a thread context
for doing page cache iodone work, but since then biodone() has changed to
hand off all iodone work to a softint thread, so we no longer need the
special-purpose aiodoned thread.
2020-02-18 20:23:17 +00:00
riastradh b26fba762e Use specfs vnops for specnodes in kernfs.
While here, don't filter out rootdev and rrootdev merely because
they're not cached.

Fixes the elusive /kern/rootdev and /kern/rrootdev nodes, which only
appeared sometimes when they felt like it, and fixes operations on
/kern/rootdev and /kern/rrootdev always returning EOPNOTSUPP.

We didn't seem to have a single PR for these issues but the following
PRs are all relevant:

PR bin/13564
PR kern/38265
PR kern/38778
PR kern/45974

XXX pullup-9, pullup-8, pullup-7, pullup-6, pullup-5, pullup-4, pullup-3, pullup-2, pullup-1.4T...
2020-02-04 04:19:24 +00:00