better job now at keeping all physical CPUs busy, while using the extra
threads to help out. In particular, during preempt() if we're using SMT,
try to find a better CPU to run on and teleport curlwp there.
- Change the CPU topology stuff so it can work on asymmetric systems. This
mainly entails rearranging one of the CPU lists so it makes sense in all
configurations.
- Add a parameter to cpu_topology_set() to note that a CPU is "slow", for
where there are fast CPUs and slow CPUs, like with the Rockwell RK3399.
Extend the SMT awareness to try and handle that situation too (keep fast
CPUs busy, use slow CPUs as helpers).
where curcpu() is defined as curlwp->l_cpu:
- mi_switch(): undo the ~2007ish optimisation to unlock curlwp before
calling cpu_switchto(). It's not safe to let other actors mess with the
LWP (in particular l->l_cpu) while it's still context switching. This
removes l->l_ctxswtch.
- Move the LP_RUNNING flag into l->l_flag and rename to LW_RUNNING since
it's now covered by the LWP's lock.
- Ditch lwp_exit_switchaway() and just call mi_switch() instead. Everything
is in cache anyway so it wasn't buying much by trying to avoid saving old
state. This means cpu_switchto() will never be called with prevlwp ==
NULL.
- Remove some KERNEL_LOCK handling which hasn't been needed for years.
on a different CPU in the same CPU core as the parent, because both parent
and child share lots of state. (I want to come back later and do
something different for _lwp_create() and maybe execve().)
- Remove the runqueue evcnts, which are racy and impose a penalty for very
little payoff.
- Break out of the loop in sched_takecpu() as soon as we have a CPU that can
run the LWP. There's no need to look at all CPUs.
- SPCF_IDLE in sched_enqueue() so we know the CPU is not idle sooner.
- Add some simple SMT awareness. Try to keep as many different cores loaded
up with jobs as possible before we start to make use of SMT. Have SMT
"secondaries" function more as helpers to their respective primaries.
This isn't enforced, it's an effort at herding/encouraging things to go in
the right direction (for one because we support processor sets and those
can be configured any way that you like). Seen at work with "top -1".
- Don't allow sched_balance() to run any faster than the clock interrupt,
because it causes terrible cache contention. Need to look into this in
more detail because it's still not ideal.
circular list of peer CPUs in other packages, so we might scroll through
them in the scheduler when looking to distribute or steal jobs.
- Fold the run queue data structure into spc_schedstate. Makes kern_runq.c
a far more pleasant place to work.
- Remove the code in sched_nextlwp() that tries to steal jobs from other
CPUs. It's not needed, because we do the very same thing in the idle LWP
anyway. Outside the VM system this was one of the the main causes of L3
cache misses I saw during builds. On my machine, this change yields a
60%-70% drop in time on the "hackbench" benchmark (there's clearly a bit
more going on here, but basically being less aggressive helps).
This was a very nice win in my tests on a 48 CPU box.
- Reorganise cpu_data slightly according to usage.
- Put cpu_onproc into struct cpu_info alongside ci_curlwp (now is ci_onproc).
- On x86, put some items in their own cache lines according to usage, like
the IPI bitmask and ci_want_resched.
- sched_tick: cpu_need_resched is no longer the correct thing to do here.
All we need to do is OR the request into the local ci_want_resched.
- sched_resched_cpu: we need to set RESCHED_UPREEMPT even on softint LWPs,
especially in the !__HAVE_FAST_SOFTINTS case, because the LWP with the
LP_INTR flag could be running via softint_overlay() - i.e. it has been
temporarily borrowed from a user process, and it needs to notice the
resched after it has stopped running softints.
- Adapt to cpu_need_resched() changes. Avoid lost & duplicate IPIs and ASTs.
sched_resched_cpu() and sched_resched_lwp() contain the logic for this.
- Changes for LSIDL to make the locking scheme match the intended design.
- Reduce lock contention and false sharing further.
- Numerous small bugfixes, including some corrections for SCHED_FIFO/RT.
- Use setrunnable() in more places, and merge cut & pasted code.
kmem_alloc() with KM_SLEEP
kmem_zalloc() with KM_SLEEP
percpu_alloc()
pserialize_create()
psref_class_create()
all of these paths include an assertion that the allocation has not failed,
so callers should not assert that again.
for averages. Otherwise the decisions can be heavily biased by rounding
errors.
Add sysctl kern.sched_average_weight to change the weight of
historical data, the default is 50%.
char *cpu_name(struct cpu_info *);
and use it when setting up the runq event counters, avoiding an 8 byte
kmem(4) allocation for each cpu. there are more places the cpuname is
used that can be converted to using this new interface, but that can
and will be done as future work.
as discussed with rmind.
- Addresses the issue described in PR/38828.
- Some simplification in threading and sleepq subsystems.
- Eliminates pmap_collect() and, as a side note, allows pmap optimisations.
- Eliminates XS_CTL_DATA_ONSTACK in scsipi code.
- Avoids few scans on LWP list and thus potentially long holds of proc_lock.
- Cuts ~1.5k lines of code. Reduces amd64 kernel size by ~4k.
- Removes __SWAP_BROKEN cases.
Tested on x86, mips, acorn32 (thanks <mpumford>) and partly tested on
acorn26 (thanks to <bjh21>).
Discussed on <tech-kern>, reviewed by <ad>.
- Change minimal time-quantum to ~20 ms.
- Thus remove unneeded pool in M2, and unused sched_lwp_exit().
- Do not increase l_slptime twice for SCHED_4BSD (regression fix).
mi_switch(), migration for LSONPROC is now performed via idle loop.
Handles/fixes on-CPU case in lwp_migrate(), misc.
Closes PR/38169, idea of migration via idle loop by Andrew Doran.