This was a very nice win in my tests on a 48 CPU box.
- Reorganise cpu_data slightly according to usage.
- Put cpu_onproc into struct cpu_info alongside ci_curlwp (now is ci_onproc).
- On x86, put some items in their own cache lines according to usage, like
the IPI bitmask and ci_want_resched.
Changes:
- membar_producer();
*p = v;
=>
atomic_store_release(p, v);
(Effectively like using membar_exit instead of membar_producer,
which is what we should have been doing all along so that stores by
the `reader' can't affect earlier loads by the writer, such as
KASSERT(p->refcnt == 0) in the writer and atomic_inc(&p->refcnt) in
the reader.)
- p = *pp;
if (p != NULL) membar_datadep_consumer();
=>
p = atomic_load_consume(pp);
(Only makes a difference on DEC Alpha. As long as lists generally
have at least one element, this is not likely to make a big
difference, and keeps the code simpler and clearer.)
No other functional change intended. While here, annotate each
synchronizing load and store with its counterpart in a comment.
Don't call kpreempt_disable() / kpreempt_enable() to make sure we're not
preempted while using the value of curcpu(). Instead, observe the value of
l_ncsw before and after the check to see if we have been preempted. If
we have been preempted, then we need to retry the read.
- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.
- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.
- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.
- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).
Kernel build before and after:
119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system
- Delete the per-entry lock, and borrow the associated vnode's v_interlock
instead. We need to acquire it during lookup anyway. We can revisit this
in the future but for now it's a stepping stone, and works within the
quite limited context of what we have (BSD namecache/lookup design).
- Implement an idea that Mateusz Guzik (mjg@FreeBSD.org) gave me. In
cache_reclaim(), we don't need to lock out all of the CPUs to garbage
collect entries. All we need to do is observe their locks unheld at least
once: then we know they are not in the critical section, and no longer
have visibility of the entries about to be garbage collected.
- The above makes it safe for sysctl to take only namecache_lock to get stats,
and we can remove all the crap dealing with per-CPU locks.
- For lockstat, make namecache_lock a static now we have __cacheline_aligned.
- Avoid false sharing - don't write back to nc_hittime unless it has changed.
Put a a comment in place explaining this. Pretty sure this was there in
2008/2009 but someone removed it (understandably, the code looks weird).
- Use a mutex to protect the garbage collection queue instead of atomics, and
adjust the low water mark up so that cache_reclaim() isn't doing so much
work at once.
- sched_tick: cpu_need_resched is no longer the correct thing to do here.
All we need to do is OR the request into the local ci_want_resched.
- sched_resched_cpu: we need to set RESCHED_UPREEMPT even on softint LWPs,
especially in the !__HAVE_FAST_SOFTINTS case, because the LWP with the
LP_INTR flag could be running via softint_overlay() - i.e. it has been
temporarily borrowed from a user process, and it needs to notice the
resched after it has stopped running softints.
- Don't do the calculation if there is a CRC error.
- If we get any kind of error during a refresh, retry up to three times.
- Add event counters to report what's going on.
- Re-do the signalling to be a little more forgiving and efficient.
- If bus reset fails during probe, try a second time.
- Spread out kernel threads for many busses to avoid thundering herd effect.
Since binutils 2.15, nm(1) cannot be used for character devices.
We worked around this by a local patch:
http://cvsweb.netbsd.org/bsdweb.cgi/src/gnu/dist/binutils/binutils/Attic/bucomm.c?r1=1.1.1.2&hideattic=0#rev1.2
With recent update of binutils, 'nm /dev/ksyms' got broken again.
This is due to a consistency check involving file size reported by
stat(2), which is always zero for character devices. So, skip this
check if file size is zero.