instead.
With this change, we no longer need to save the current interrupt level
in the switchframe. This is no great loss since both cpu_switch and
cpu_switchto are always called at splsched, so the process' spl is
effectively saved somewhere in the callstack.
This fixes an evbarm problem reported by Allen Briggs:
lwp gets into sa_switch -> mi_switch with newl != NULL
when it's the last element on the runqueue, so it
hits the second bit of:
if (newl == NULL) {
retval = cpu_switch(l, NULL);
} else {
remrunqueue(newl);
cpu_switchto(l, newl);
retval = 0;
}
mi_switch calls remrunqueue() and cpu_switchto()
cpu_switchto unlocks the sched lock
cpu_switchto drops CPU priority
softclock is received
schedcpu is called from softclock
schedcpu hits the first if () {} block here:
if (l->l_priority >= PUSER) {
if (l->l_stat == LSRUN &&
(l->l_flag & L_INMEM) &&
(l->l_priority / PPQ) != (l->l_usrpri / PPQ)) {
remrunqueue(l);
l->l_priority = l->l_usrpri;
setrunqueue(l);
} else
l->l_priority = l->l_usrpri;
}
Since mi_switch has already run remrunqueue, the LWP has been
removed, but it's not been put back on any queue, so the
remrunqueue panics.
- Use the "clz" instruction to pick a run-queue, instead of using the
ffs-by-table-lookup method.
- Use strd instead of stmia where possible.
- Use multiple ldr instructions instead of ldmia where possible.
the test as spl0 is actually a macro for splx(0). The code now calls
splx(0)
(note building with the #ifdef fixed, caused the build to fail on a
GENERIC acorn32 kernel.)
* Define a new "MMU type", ARM_MMU_SA1. While the SA-1's MMU is basically
compatible with the generic, the SA-1 cache does not have a write-through
mode, and it is useful to know have an indication of this.
* Add a new PMAP_NEEDS_PTE_SYNC indicator, and try to evaluate it at
compile time. We evaluate it like so:
- If SA-1-style MMU is the only type configured -> 1
- If SA-1-style MMU is not configured -> 0
- Otherwise, defer to a run-time variable.
If PMAP_NEEDS_PTE_SYNC might evaluate to true (SA-1 only or run-time
check), then we also define PMAP_INCLUDE_PTE_SYNC so that e.g. assembly
code can include the necessary run-time support. PMAP_INCLUDE_PTE_SYNC
largely replaces the ARM32_PMAP_NEEDS_PTE_SYNC manual setting Steve
included with the original new pmap.
* In the new pmap, make pmap_pte_init_generic() check to see if the CPU
has a write-back cache. If so, init the PT cache mode to C=1,B=0 to get
write-through mode. Otherwise, init the PT cache mode to C=1,B=1.
* Add a new pmap_pte_init_arm8(). Old pmap, same as generic. New pmap,
sets page table cacheability to 0 (ARM8 has a write-back cache, but
flushing it is quite expensive).
* In the new pmap, make pmap_pte_init_arm9() reset the PT cache mode to
C=1,B=0, since the write-back check in generic gets it wrong for ARM9,
since we use write-through mode all the time on ARM9 right now. (What
this really tells me is that the test for write-through cache is less
than perfect, but we can fix that later.)
* Add a new pmap_pte_init_sa1(). Old pmap, same as generic. New pmap,
does generic initialization, then resets page table cache mode to
C=1,B=1, since C=1,B=0 does not produce write-through on the SA-1.
Some features of the new pmap are:
- It allows L1 descriptor tables to be shared efficiently between
multiple processes. A typical "maxusers 32" kernel, where NPROC is set
to 532, requires 35 L1s. A "maxusers 2" kernel runs quite happily
with just 4 L1s. This completely solves the problem of running out
of contiguous physical memory for allocating new L1s at runtime on a
busy system.
- Much improved cache/TLB management "smarts". This change ripples
out to encompass the low-level context switch code, which is also
much smarter about when to flush the cache/TLB, and when not to.
- Faster allocation of L2 page tables and associated metadata thanks,
in part, to the pool_cache enhancements recently contributed to
NetBSD by Wasabi Systems.
- Faster VM space teardown due to accurate referenced tracking of L2
page tables.
- Better/faster cache-alias tracking.
The new pmap is enabled by adding options ARM32_PMAP_NEW to the kernel
config file, and making the necessary changes to the port-specific
initarm() function. Several ports have already been converted and will
be committed shortly.
In particular, use r8 to hold the old process, and r7 for medium-term
scratch, saving r0-r3 for things we don't need saved over function
calls. This gets rid of five register-to-register MOVs.
and hence save fewer into the PCB. This should give me enough free
registers in cpu_switch to tidy things up and support MULTIPROCESSOR
properly. While we're here, make the stacked registers into an
APCS stack frame, so that DDB backtraces through cpu_switch() will
work.
This also affects cpu_fork(), which has to fabricate a switchframe and
PCB for the new process.
GCC produces almost exactly the same instructions as the hand-assembled
versions, albeit in a different order. It even found one place where it
could shave one off. Its insistence on creating a stack frame might slow
things down marginally, but not, I think, enough to matter.
add rd, pc, #foo - . - 8 -> adr rd, foo
ldr rd, [pc, #foo - . - 8] -> ldr rd, foo
Also, when saving the return address for a function pointer call, use
"mov lr, pc" just before the call unless the return address is somewhere
other than just after the call site.
Finally, a few obvious little micro-optimisations like using LDR directly
rather than ADR followed by LDR, and loading directly into PC rather than
bouncing via R0.
and boot multi-user on a single-processor machine. Many of these changes
are wildly inappropriate for actual multi-processor operation, and correcting
this will be my next task.
* Save an instruction in the transition from idle to have-process-to-
switch-to, and eliminate two instructions that cause datadep-stalls
on StrongARM And XScale (one in each idle block).
* Rearrange some other instructions to avoid datadep-stalls on StrongARM
and XScale.
* Since cpu_do_powersave == 0 is by far the common case, avoid a
pipeline flush by reordering the two idle blocks.
the CPU's "sleep" function in the idle loop.
* Default all CPUs to not use powersave, except for the PDA processors
(SA11x0 and PXA2x0).
This significantly reduces inteterrupt latency in high-performance
applications (and was good to squeeze another ~10% out of an XScale
IOP on a Gig-E benchmark).
nathanw_sa branch.
* In switch_exit(), set the outgoing-proc register to NULL (rather than
proc0) so that we actually use the "exiting process" optimization in
cpu_switch().
Also add correct locking when freeing pages in pmap_destroy (fix from potr)
This now means that arm32 kernels can be built with LOCKDEBUG enabled. (only tested on cats though)
pass. Rather than providing a whole slew of cache operations that
aren't ever used, distill them down to some useful primitives:
icache_sync_all Synchronize I-cache
icache_sync_range Synchronize I-cache range
dcache_wbinv_all Write-back and Invalidate D-cache
dcache_wbinv_range Write-back and Invalidate D-cache range
dcache_inv_range Invalidate D-cache range
dcache_wb_range Write-back D-cache range
idcache_wbinv_all Write-back and Invalidate D-cache,
Invalidate I-cache
idcache_wbinv_range Write-back and Invalidate D-cache,
Invalidate I-cache range
Note: This does not yet include an overhaul of the actual asm files
that implement the primitives. Instead, we've provided a safe default
for each CPU type, and the individual CPU types can now be optimized
one at a time.
switch_exit only needs to take 1 parameter, it loads the value of proc0 into R1 itself
Fixup some comments to reflect the real state of things.
Tweak a couple of bits of asm to avoid a load delay.
remove excess code for setting curpcb and curproc.