Since they are practically never used (only when prehistoric code
uses simple_lock()), their efficiency doesn't matter that much and
we can simply adapt the versions from x86 lock.h.
access. The old scheduler had a global freelist which caused a
cache crisis with multiple host threads trying to schedule a virtual
CPU simultaneously.
The rump scheduler is different from a normal thread scheduler, so
it has different requirements. First, we schedule a CPU for a
thread (which we get from the host scheduler) instead of scheduling
a thread onto a CPU. Second, scheduling points are at every
entry/exit to/from the rump kernel, including (but not limited to)
syscall entry points and hypercalls. This means scheduling happens
a lot more frequently than in a normal kernel.
For every lwp, cache the previously used CPU. When scheduling,
attempt to reuse the same CPU. If we get it, we can use it directly
without any memory barriers or expensive locks. If the CPU is
taken, migrate. Use a lock/wait only in the slowpath. Be very
wary of walking the entire CPU array because that does not lead to
a happy cacher.
The migration algorithm could probably benefit from improved
heuristics and tuning. Even as such, with the new scheduler an
application which has two threads making rlimit syscalls in a tight
loop experiences almost 400% speedup. The exact speedup is difficult
to pinpoint, though, since the old scheduler caused very jittery
results due to cache contention. Also, the rump version is now
70% faster than the counterpart which calls the host kernel.
unicpu configurations (i.e. RUMP_NCPU==1), but are massively faster
than the multiprocessor versions since the fast path does not have
to perform any cache coherent operations. _Applications_ with
lock-happy kernel paths, i.e. _not_ lock microbenchmarks, measure
up to tens of percents speedup on my Core2 Duo. Every globally
atomic state required by normal locks/atomic ops implies a hideous
speed penalty even for the fast path.
While this requires a unicpu configuration, it should be noted that
we are talking about a virtual unicpu configuration. The host can
have as many processors as it desires, and the speed benefit of
virtual unicpu is still there. It's pretty obvious that in terms
of scalability simple workload partitioning and replication into
multiple kernels wins hands down over complicated locking or
locklessing algorithms which depend on globally atomic state.
interlock. This is applicable in cases where the actual interlock
is the CPU the currently running thread is scheduled on. Borrowing
the scheduler lock as the mutex mandated by pthread_cond_wait()
does away with need to have an additional mutex. This both optimizes
runtime execution and simplifies code, as the extra lock typically
lead to quite some trickeries to avoid the dungeon collapsing due
to zaps from the wand of deadlock.
time down: 14ms -> 12ms. Further hashing etc. did not seem to have
any noticable effect.
(without /dev node creation bootstrap time is 8ms, so it's still
the bottleneck)
reserves a large amount of memory by default and this is not
desirable in a rump kernel where the typical usage is minimal.
Maybe I should write a few lines to autoscale desiredvnodes up to
a hard limit after the soft limit is reached?