Commit Graph

10725 Commits

Author SHA1 Message Date
ad
0b2f97d6bd For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.
2019-12-08 20:35:23 +00:00
ad
2210aab771 Revert previous. No performance gain worth the potential headaches
with buffers in these contexts.
2019-12-08 19:52:37 +00:00
ad
51eccb6fe4 Adjustment to previous: if we're going to toss the buffer, then wake
everybody.
2019-12-08 19:49:25 +00:00
ad
8dc9961386 - Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.
2019-12-08 19:26:05 +00:00
ad
d9135670c1 Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy) 2019-12-08 19:23:51 +00:00
maxv
cbaecaa959 Use the inlines; it is actually fine, since the compiler drops the inlines
if the caller is kmsan-instrumented, forcing a white-listing of the memory
access.
2019-12-08 11:53:54 +00:00
ad
9b2ceb0ad8 mi_switch: move an over eager KASSERT defeated by kernel preemption.
Discovered during automated test.
2019-12-07 21:14:36 +00:00
kamil
61223bbb2b Revert the in_interrupt() change to use again the x86 specific code
This is prerequisite for kMSan and upcoming kernel changes.

Discussed with <maxv>
2019-12-07 19:50:33 +00:00
ad
9079e8e09e mi_switch: move LOCKDEBUG_BARRIER later to accomodate holding two locks
on entry.
2019-12-07 17:36:33 +00:00
ad
4477d28d73 Make it possible to call mi_switch() and immediately switch to another CPU.
This seems to take about 3us on my Intel system.  Two changes required:

- Have the caller to mi_switch() be responsible for calling spc_lock().
- Avoid using l->l_cpu in mi_switch().

While here:

- Add a couple of calls to membar_enter()
- Have the idle LWP set itself to LSIDL, to match softint_thread().
- Remove unused return value from mi_switch().
2019-12-06 21:36:10 +00:00
ad
4c55fef0a6 sched_tick(): don't try to optimise something that's called 10 times a
second, it's a fine way to introduce bugs (and I did).  Use the MI
interface for rescheduling which always does the correct thing.
2019-12-06 18:33:19 +00:00
ad
a730537ede softint_trigger (slow case): set RESCHED_IDLE too just to be consistent.
No functional change.
2019-12-06 18:15:57 +00:00
kamil
0dd7df5e3c Correct signals in siglist+sigmask passed in kinfo_lwp
Make the union of LWP and PROC pending signals correctly.
2019-12-06 17:41:43 +00:00
maxv
9bef5cce95 cast to proper type 2019-12-06 16:54:47 +00:00
maxv
48d18df02a Fix a bunch of unimportant "Local variable hides global variable" warnings
from the LGTM bot.
2019-12-06 08:35:21 +00:00
maxv
be264b1266 Minor changes, reported by the LGTM bot. 2019-12-06 07:27:06 +00:00
riastradh
6f17d02bf7 Switch psz_ev_excl to static evcnt. 2019-12-05 03:21:29 +00:00
riastradh
d5dccc2571 Restore psz_lock just for the event count.
Cost of mutex_enter/exit is negligible compared to the xcall we just
did, so this is not going to meaningfully affect performance.
2019-12-05 03:21:17 +00:00
riastradh
de3acc9d56 Allow equality in this assertion.
This can happen if we lose the race mentioned in percpu_cpu_swap.
2019-12-05 03:21:08 +00:00
wiz
7b4936e883 Fix typo in comment (typlogy) 2019-12-04 09:34:12 +00:00
riastradh
f9764fd750 Disable rngtest on output of cprng_strong.
We already do a self-test for correctenss of Hash_DRBG output;
applying rngtest to it does nothing but give everyone warning fatigue
about spurious rngtest failures.
2019-12-04 05:36:34 +00:00
ad
dece39714a - Add some more failsafes to the CPU topology stuff, and build a 3rd
circular list of peer CPUs in other packages, so we might scroll through
  them in the scheduler when looking to distribute or steal jobs.

- Fold the run queue data structure into spc_schedstate.  Makes kern_runq.c
  a far more pleasant place to work.

- Remove the code in sched_nextlwp() that tries to steal jobs from other
  CPUs.  It's not needed, because we do the very same thing in the idle LWP
  anyway.  Outside the VM system this was one of the the main causes of L3
  cache misses I saw during builds.  On my machine, this change yields a
  60%-70% drop in time on the "hackbench" benchmark (there's clearly a bit
  more going on here, but basically being less aggressive helps).
2019-12-03 22:28:41 +00:00
riastradh
4ec73a10b2 Use __insn_barrier to enforce ordering in l_ncsw loops.
(Only need ordering observable by interruption, not by other CPUs.)
2019-12-03 15:20:59 +00:00
martin
536d4955dc Stopgap hack to unbreak the build: #ifdef __HAVE_ATOMIC64_LOADSTORE
the event counter update. From rmind@
2019-12-03 13:30:52 +00:00
riastradh
a0c864ecf3 Rip out pserialize(9) logic now that the RCU patent has expired.
pserialize_perform() is now basically just xc_barrier(XC_HIGHPRI).
No more tentacles throughout the scheduler.  Simplify the psz read
count for diagnostic assertions by putting it unconditionally into
cpu_info.

From rmind@, tidied up by me.
2019-12-03 05:07:48 +00:00
ad
5d954ab634 Take the basic CPU topology information we already collect, and use it
to make circular lists of CPU siblings in the same core, and in the
same package.  Nothing fancy, just enough to have a bit of fun in the
scheduler trying out different tactics.
2019-12-02 23:22:43 +00:00
riastradh
53ecfc3aad Restore xcall(9) fast path using atomic_load/store_*.
While here, fix a bug that was formerly in xcall(9): a missing
acquire operation in the xc_wait fast path so that all memory
operations in the xcall on remote CPUs will happen before any memory
operations on the issuing CPU after xc_wait returns.

All stores of xc->xc_donep are done with atomic_store_release so that
we can safely use atomic_load_acquire to read it outside the lock.
However, this fast path only works on platforms with cheap 64-bit
atomic load/store, so conditionalize it on __HAVE_ATOMIC64_LOADSTORE.
(Under the lock, no need for atomic loads since nobody else will be
issuing stores.)

For review, here's the relevant diff from the old version of the fast
path, from before it was removed and some other things changed in the
file:

diff --git a/sys/kern/subr_xcall.c b/sys/kern/subr_xcall.c
index 45a877aa90e0..b6bfb6455291 100644
--- a/sys/kern/subr_xcall.c
+++ b/sys/kern/subr_xcall.c
@@ -84,6 +84,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_xcall.c,v 1.27 2019/10/06 15:11:17 uwe Exp $");
 #include <sys/evcnt.h>
 #include <sys/kthread.h>
 #include <sys/cpu.h>
+#include <sys/atomic.h>

 #ifdef _RUMPKERNEL
 #include "rump_private.h"
@@ -334,10 +353,12 @@ xc_wait(uint64_t where)
 		xc = &xc_low_pri;
 	}

+#ifdef __HAVE_ATOMIC64_LOADSTORE
 	/* Fast path, if already done. */
-	if (xc->xc_donep >= where) {
+	if (atomic_load_acquire(&xc->xc_donep) >= where) {
 		return;
 	}
+#endif

 	/* Slow path: block until awoken. */
 	mutex_enter(&xc->xc_lock);
@@ -422,7 +443,11 @@ xc_thread(void *cookie)
 		(*func)(arg1, arg2);

 		mutex_enter(&xc->xc_lock);
+#ifdef __HAVE_ATOMIC64_LOADSTORE
+		atomic_store_release(&xc->xc_donep, xc->xc_donep + 1);
+#else
 		xc->xc_donep++;
+#endif
 	}
 	/* NOTREACHED */
 }
@@ -462,7 +487,6 @@ xc__highpri_intr(void *dummy)
 	 * Lock-less fetch of function and its arguments.
 	 * Safe since it cannot change at this point.
 	 */
-	KASSERT(xc->xc_donep < xc->xc_headp);
 	func = xc->xc_func;
 	arg1 = xc->xc_arg1;
 	arg2 = xc->xc_arg2;
@@ -475,7 +499,13 @@ xc__highpri_intr(void *dummy)
 	 * cross-call has been processed - notify waiters, if any.
 	 */
 	mutex_enter(&xc->xc_lock);
-	if (++xc->xc_donep == xc->xc_headp) {
+	KASSERT(xc->xc_donep < xc->xc_headp);
+#ifdef __HAVE_ATOMIC64_LOADSTORE
+	atomic_store_release(&xc->xc_donep, xc->xc_donep + 1);
+#else
+	xc->xc_donep++;
+#endif
+	if (xc->xc_donep == xc->xc_headp) {
 		cv_broadcast(&xc->xc_busy);
 	}
 	mutex_exit(&xc->xc_lock);
2019-12-01 20:56:39 +00:00
ad
2b25ff6d06 Back out previous temporarily - seeing unusual lookup failures. Will
come back to it.
2019-12-01 18:31:19 +00:00
kamil
82c05df197 Switch in_interrupt() in KCOV to cpu_intr_p()
This makes KCOV more MI friendly and removes x86-specific in_interrupt()
implementation.
2019-12-01 17:41:11 +00:00
ad
10fb14e25f Init kern_runq and kern_synch before booting secondary CPUs. 2019-12-01 17:08:31 +00:00
ad
ca7481a7dd Back out the fastpath change in xc_wait(). It's going to be done differently. 2019-12-01 17:06:00 +00:00
ad
e398df6f78 Make the fast path in xc_wait() depend on _LP64 for now. Needs 64-bit
load/store.  To be revisited.
2019-12-01 16:32:01 +00:00
ad
57eb66c673 Fix false sharing problems with cpu_info. Identified with tprof(8).
This was a very nice win in my tests on a 48 CPU box.

- Reorganise cpu_data slightly according to usage.
- Put cpu_onproc into struct cpu_info alongside ci_curlwp (now is ci_onproc).
- On x86, put some items in their own cache lines according to usage, like
  the IPI bitmask and ci_want_resched.
2019-12-01 15:34:44 +00:00
ad
c242783135 Fix a longstanding problem with LWP limits. When changing the user's
LWP count, we must use the process credentials because that's what
the accounting entity is tied to.

Reported-by: syzbot+d193266676f635661c62@syzkaller.appspotmail.com
2019-12-01 15:27:58 +00:00
ad
4bc8197e77 If the system is not up and running yet, just run the function locally. 2019-12-01 14:20:00 +00:00
ad
94bb47e411 Regen for VOP_LOCK & LK_UPGRADE/LK_DOWNGRADE. 2019-12-01 13:58:52 +00:00
ad
fb0bbaf12a Minor vnode locking changes:
- Stop using atomics to maniupulate v_usecount.  It was a mistake to begin
  with.  It doesn't work as intended unless the XLOCK bit is incorporated in
  v_usecount and we don't have that any more.  When I introduced this 10+
  years ago it was to reduce pressure on v_interlock but it doesn't do that,
  it just makes stuff disappear from lockstat output and introduces problems
  elsewhere.  We could do atomic usecounts on vnodes but there has to be a
  well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
  when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
  struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
  vnode if it's already on the correct list and was requeued recently (less
  than a second ago).

Kernel build before and after:

119.63s real  1453.16s user  2742.57s system
115.29s real  1401.52s user  2690.94s system
2019-12-01 13:56:29 +00:00
ad
566436656a namecache changes:
- Delete the per-entry lock, and borrow the associated vnode's v_interlock
  instead.  We need to acquire it during lookup anyway.  We can revisit this
  in the future but for now it's a stepping stone, and works within the
  quite limited context of what we have (BSD namecache/lookup design).

- Implement an idea that Mateusz Guzik (mjg@FreeBSD.org) gave me.  In
  cache_reclaim(), we don't need to lock out all of the CPUs to garbage
  collect entries.  All we need to do is observe their locks unheld at least
  once: then we know they are not in the critical section, and no longer
  have visibility of the entries about to be garbage collected.

- The above makes it safe for sysctl to take only namecache_lock to get stats,
  and we can remove all the crap dealing with per-CPU locks.

- For lockstat, make namecache_lock a static now we have __cacheline_aligned.

- Avoid false sharing - don't write back to nc_hittime unless it has changed.
  Put a a comment in place explaining this.  Pretty sure this was there in
  2008/2009 but someone removed it (understandably, the code looks weird).

- Use a mutex to protect the garbage collection queue instead of atomics, and
  adjust the low water mark up so that cache_reclaim() isn't doing so much
  work at once.
2019-12-01 13:39:53 +00:00
ad
036b61e0aa PR port-sparc/54718 (sparc install hangs since recent scheduler changes)
- sched_tick: cpu_need_resched is no longer the correct thing to do here.
  All we need to do is OR the request into the local ci_want_resched.

- sched_resched_cpu: we need to set RESCHED_UPREEMPT even on softint LWPs,
  especially in the !__HAVE_FAST_SOFTINTS case, because the LWP with the
  LP_INTR flag could be running via softint_overlay() - i.e. it has been
  temporarily borrowed from a user process, and it needs to notice the
  resched after it has stopped running softints.
2019-12-01 13:20:42 +00:00
maxv
890f284aec Add KCSAN instrumentation for atomic_{load,store}_*. 2019-12-01 08:15:58 +00:00
ad
b8a643bb0c VOP_UNLOCK + vrele -> vput 2019-11-30 20:45:49 +00:00
ad
2936562466 Back out previous. It works on amd64 under stress test but not
evbarm-aarch64 for some reason.  Will revisit.
2019-11-30 14:21:16 +00:00
ad
c5154900d0 A couple more tweaks to avoid reading the lock word. 2019-11-29 20:50:54 +00:00
riastradh
915952b159 Largely eliminate the MD rwlock.h header file.
This was full of definitions that have been obsolete for over a
decade.  The file still remains for __HAVE_RW_STUBS but that's all.
Used only internally in kern_rwlock.c now, not by <sys/rwlock.h>.
2019-11-29 20:04:52 +00:00
ad
874c8f508b Get rid of MUTEX_RECEIVE/MUTEX_GIVE. 2019-11-29 19:44:59 +00:00
ad
c7d1277ea0 Don't try to kpreempt a CPU hog unless __HAVE_PREEMPTION (oops). 2019-11-29 18:29:45 +00:00
ad
79ec83719f Don't try to IPI other CPUs early on. Fixes a crash on sparc64. Thanks
to martin@ for diagnosing.
2019-11-27 20:31:13 +00:00
ad
781a39cc41 Remove some unneeded memory barriers and reads of the lock word.
Prodded by Mateusz Guzik.
2019-11-25 20:16:22 +00:00
ad
65e19688a4 port-sparc/54718 (sparc install hangs since recent scheduler changes) 2019-11-25 17:24:59 +00:00
riastradh
8cfdd6f56f Use cprng_strong, not cprng_fast, for sysctl kern.arnd. 2019-11-25 15:19:54 +00:00