This seems to take about 3us on my Intel system. Two changes required:
- Have the caller to mi_switch() be responsible for calling spc_lock().
- Avoid using l->l_cpu in mi_switch().
While here:
- Add a couple of calls to membar_enter()
- Have the idle LWP set itself to LSIDL, to match softint_thread().
- Remove unused return value from mi_switch().
We already do a self-test for correctenss of Hash_DRBG output;
applying rngtest to it does nothing but give everyone warning fatigue
about spurious rngtest failures.
circular list of peer CPUs in other packages, so we might scroll through
them in the scheduler when looking to distribute or steal jobs.
- Fold the run queue data structure into spc_schedstate. Makes kern_runq.c
a far more pleasant place to work.
- Remove the code in sched_nextlwp() that tries to steal jobs from other
CPUs. It's not needed, because we do the very same thing in the idle LWP
anyway. Outside the VM system this was one of the the main causes of L3
cache misses I saw during builds. On my machine, this change yields a
60%-70% drop in time on the "hackbench" benchmark (there's clearly a bit
more going on here, but basically being less aggressive helps).
pserialize_perform() is now basically just xc_barrier(XC_HIGHPRI).
No more tentacles throughout the scheduler. Simplify the psz read
count for diagnostic assertions by putting it unconditionally into
cpu_info.
From rmind@, tidied up by me.
to make circular lists of CPU siblings in the same core, and in the
same package. Nothing fancy, just enough to have a bit of fun in the
scheduler trying out different tactics.
While here, fix a bug that was formerly in xcall(9): a missing
acquire operation in the xc_wait fast path so that all memory
operations in the xcall on remote CPUs will happen before any memory
operations on the issuing CPU after xc_wait returns.
All stores of xc->xc_donep are done with atomic_store_release so that
we can safely use atomic_load_acquire to read it outside the lock.
However, this fast path only works on platforms with cheap 64-bit
atomic load/store, so conditionalize it on __HAVE_ATOMIC64_LOADSTORE.
(Under the lock, no need for atomic loads since nobody else will be
issuing stores.)
For review, here's the relevant diff from the old version of the fast
path, from before it was removed and some other things changed in the
file:
diff --git a/sys/kern/subr_xcall.c b/sys/kern/subr_xcall.c
index 45a877aa90e0..b6bfb6455291 100644
--- a/sys/kern/subr_xcall.c
+++ b/sys/kern/subr_xcall.c
@@ -84,6 +84,7 @@ __KERNEL_RCSID(0, "$NetBSD: subr_xcall.c,v 1.27 2019/10/06 15:11:17 uwe Exp $");
#include <sys/evcnt.h>
#include <sys/kthread.h>
#include <sys/cpu.h>
+#include <sys/atomic.h>
#ifdef _RUMPKERNEL
#include "rump_private.h"
@@ -334,10 +353,12 @@ xc_wait(uint64_t where)
xc = &xc_low_pri;
}
+#ifdef __HAVE_ATOMIC64_LOADSTORE
/* Fast path, if already done. */
- if (xc->xc_donep >= where) {
+ if (atomic_load_acquire(&xc->xc_donep) >= where) {
return;
}
+#endif
/* Slow path: block until awoken. */
mutex_enter(&xc->xc_lock);
@@ -422,7 +443,11 @@ xc_thread(void *cookie)
(*func)(arg1, arg2);
mutex_enter(&xc->xc_lock);
+#ifdef __HAVE_ATOMIC64_LOADSTORE
+ atomic_store_release(&xc->xc_donep, xc->xc_donep + 1);
+#else
xc->xc_donep++;
+#endif
}
/* NOTREACHED */
}
@@ -462,7 +487,6 @@ xc__highpri_intr(void *dummy)
* Lock-less fetch of function and its arguments.
* Safe since it cannot change at this point.
*/
- KASSERT(xc->xc_donep < xc->xc_headp);
func = xc->xc_func;
arg1 = xc->xc_arg1;
arg2 = xc->xc_arg2;
@@ -475,7 +499,13 @@ xc__highpri_intr(void *dummy)
* cross-call has been processed - notify waiters, if any.
*/
mutex_enter(&xc->xc_lock);
- if (++xc->xc_donep == xc->xc_headp) {
+ KASSERT(xc->xc_donep < xc->xc_headp);
+#ifdef __HAVE_ATOMIC64_LOADSTORE
+ atomic_store_release(&xc->xc_donep, xc->xc_donep + 1);
+#else
+ xc->xc_donep++;
+#endif
+ if (xc->xc_donep == xc->xc_headp) {
cv_broadcast(&xc->xc_busy);
}
mutex_exit(&xc->xc_lock);
This was a very nice win in my tests on a 48 CPU box.
- Reorganise cpu_data slightly according to usage.
- Put cpu_onproc into struct cpu_info alongside ci_curlwp (now is ci_onproc).
- On x86, put some items in their own cache lines according to usage, like
the IPI bitmask and ci_want_resched.
- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.
- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.
- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.
- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).
Kernel build before and after:
119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system
- Delete the per-entry lock, and borrow the associated vnode's v_interlock
instead. We need to acquire it during lookup anyway. We can revisit this
in the future but for now it's a stepping stone, and works within the
quite limited context of what we have (BSD namecache/lookup design).
- Implement an idea that Mateusz Guzik (mjg@FreeBSD.org) gave me. In
cache_reclaim(), we don't need to lock out all of the CPUs to garbage
collect entries. All we need to do is observe their locks unheld at least
once: then we know they are not in the critical section, and no longer
have visibility of the entries about to be garbage collected.
- The above makes it safe for sysctl to take only namecache_lock to get stats,
and we can remove all the crap dealing with per-CPU locks.
- For lockstat, make namecache_lock a static now we have __cacheline_aligned.
- Avoid false sharing - don't write back to nc_hittime unless it has changed.
Put a a comment in place explaining this. Pretty sure this was there in
2008/2009 but someone removed it (understandably, the code looks weird).
- Use a mutex to protect the garbage collection queue instead of atomics, and
adjust the low water mark up so that cache_reclaim() isn't doing so much
work at once.
- sched_tick: cpu_need_resched is no longer the correct thing to do here.
All we need to do is OR the request into the local ci_want_resched.
- sched_resched_cpu: we need to set RESCHED_UPREEMPT even on softint LWPs,
especially in the !__HAVE_FAST_SOFTINTS case, because the LWP with the
LP_INTR flag could be running via softint_overlay() - i.e. it has been
temporarily borrowed from a user process, and it needs to notice the
resched after it has stopped running softints.
This was full of definitions that have been obsolete for over a
decade. The file still remains for __HAVE_RW_STUBS but that's all.
Used only internally in kern_rwlock.c now, not by <sys/rwlock.h>.