components which are too bloaty to be included in rumpkern (where
bloaty means "can be easily left out without anyone missing"), but
generally do not require the support of the dev/fs/net factions to
function. As the first one, add ksems. librumpcrypto will migrate
here too once I get my timeslice to deal with the setlists, as most
likely will tty support.
really wanted from this commit was the support for proc_specificdata.
TODO: make creating a new process actually use kern_proc and
maybe even add an interface which starts a process with
"any pid you don't like"
was autoload *not* working with an alternate path. This revision
make the code double as good in the sense that it now works also
in case you *do* want it to work.
through an in-rumpkernel hypermemory allocator which knows it should
kick the pagedaemon and block in case ``waitok'' memory allocation
fails.
This allows us to recover from some out-of-memory situations.
Realworld'istically speaking (as opposed to whatever "should be"
theory), these OOM situations will happen extremely rarely if ever
when our hypervisor is a regular process. Speculatively, this
should be useful for other types of hosts.
issues remaining:
* the hypervisor does not know how to reclaim kernel memory (and
for the reason I stated above, I'm not sure if it makes sense
to teach the current implementation about that)
* vfs memory (buffers, vm object pages etc.) is not reclaimed
Since they are practically never used (only when prehistoric code
uses simple_lock()), their efficiency doesn't matter that much and
we can simply adapt the versions from x86 lock.h.
access. The old scheduler had a global freelist which caused a
cache crisis with multiple host threads trying to schedule a virtual
CPU simultaneously.
The rump scheduler is different from a normal thread scheduler, so
it has different requirements. First, we schedule a CPU for a
thread (which we get from the host scheduler) instead of scheduling
a thread onto a CPU. Second, scheduling points are at every
entry/exit to/from the rump kernel, including (but not limited to)
syscall entry points and hypercalls. This means scheduling happens
a lot more frequently than in a normal kernel.
For every lwp, cache the previously used CPU. When scheduling,
attempt to reuse the same CPU. If we get it, we can use it directly
without any memory barriers or expensive locks. If the CPU is
taken, migrate. Use a lock/wait only in the slowpath. Be very
wary of walking the entire CPU array because that does not lead to
a happy cacher.
The migration algorithm could probably benefit from improved
heuristics and tuning. Even as such, with the new scheduler an
application which has two threads making rlimit syscalls in a tight
loop experiences almost 400% speedup. The exact speedup is difficult
to pinpoint, though, since the old scheduler caused very jittery
results due to cache contention. Also, the rump version is now
70% faster than the counterpart which calls the host kernel.
unicpu configurations (i.e. RUMP_NCPU==1), but are massively faster
than the multiprocessor versions since the fast path does not have
to perform any cache coherent operations. _Applications_ with
lock-happy kernel paths, i.e. _not_ lock microbenchmarks, measure
up to tens of percents speedup on my Core2 Duo. Every globally
atomic state required by normal locks/atomic ops implies a hideous
speed penalty even for the fast path.
While this requires a unicpu configuration, it should be noted that
we are talking about a virtual unicpu configuration. The host can
have as many processors as it desires, and the speed benefit of
virtual unicpu is still there. It's pretty obvious that in terms
of scalability simple workload partitioning and replication into
multiple kernels wins hands down over complicated locking or
locklessing algorithms which depend on globally atomic state.
interlock. This is applicable in cases where the actual interlock
is the CPU the currently running thread is scheduled on. Borrowing
the scheduler lock as the mutex mandated by pthread_cond_wait()
does away with need to have an additional mutex. This both optimizes
runtime execution and simplifies code, as the extra lock typically
lead to quite some trickeries to avoid the dungeon collapsing due
to zaps from the wand of deadlock.