Allocate/Prefetch one cache-line ahead of the one we're about to deal with.
This reduces the chances of the cpu stalling while waiting for the cache
to flush a dirty line in order to satisfy the Allocate/Prefetch request.
registers in any trap/interrupt exception frame found.
- Slight tweak to more accurately detect the correct call-site when
looking for a function's prologue.
- use struct vm_page_md for attaching pv entries to struct vm_page
- change pseg_set()'s return value to indicate whether the spare page
was used as an L2 or L3 PTP.
- use a pool for pv entries instead of malloc().
- put PTPs on a list attached to the pmap so we can free them
more efficiently (by just walking the list) in pmap_destroy().
- use the new pmap_remove_all() interface to avoid flushing the cache and TLB
for each pmap_remove() that's done as we are tearing down an address space.
- in pmap_enter(), handle replacing an existing mapping more efficiently
than just calling pmap_remove() on it. also, skip flushing the
TSB and TLB if there was no previous mapping, since there can't be
anything we need to flush. also, preload the TSB if we're pre-setting
the mod/ref bits.
- allocate hardware contexts like the MIPS pmap:
allocate them all sequentially without reuse, then once we run out
just invalidate all user TLB entries and flush the entire L1 dcache.
- fix pmap_extract() for the case where the va is not page-aligned and
nothing is mapped there.
- fix calculation of TSB size. it was comparing physmem (which is
in units of pages) to constants that only make sense if they are
in units of bytes.
- avoid sleeping in pmap_enter(), instead let the caller do it.
- use pmap_kenter_pa() instead of pmap_enter() where appropriate.
- remove code to handle impossible cases in various functions.
- tweak asm code to pipeline a little better.
- remove many unnecessary spls and membars.
- lots of code cleanup.
- no doubt other stuff that I've forgotten.
the result of all this is that a fork+exit microbenchmark is 34% faster
and a fork+exec+exit microbenchmark is 28% faster.
cpu_idle() and new cpu_switch() to replace the old cpu_switch()
which did the lot. Runs leaner without overly blocking interrupts.
Includes cleanup of the RAS code to make use of callee-saved registers.
Benchmarks on DX4 @ 100MHz reveal a slight performance improvement
but probably not statistically signficant. More TBD to verify this.
Changes passed a pounding on Athlon @ 1GHz too.