* Save an instruction in the transition from idle to have-process-to-
switch-to, and eliminate two instructions that cause datadep-stalls
on StrongARM And XScale (one in each idle block).
* Rearrange some other instructions to avoid datadep-stalls on StrongARM
and XScale.
* Since cpu_do_powersave == 0 is by far the common case, avoid a
pipeline flush by reordering the two idle blocks.
the CPU's "sleep" function in the idle loop.
* Default all CPUs to not use powersave, except for the PDA processors
(SA11x0 and PXA2x0).
This significantly reduces inteterrupt latency in high-performance
applications (and was good to squeeze another ~10% out of an XScale
IOP on a Gig-E benchmark).
of the range are aligned to a cacheline boundary, when do a dcache-inv
operation, rather than a dcache-wbinv operation.
XXX It could be a little smarter (align using wbinv, inv, then finish
up using wbinv), but even this simple change is good for a nearly 40%
improvement in my test case on XScale.
map contains "coherent" (non-cached in ARM-land) mappings.
* Set ARM32_DMAMAP_COHERENT in the map at the start of a load operation,
and clear it in _bus_dmamap_load_buffer() if we encounter any cacheable
mappings.
* In _bus_dmamap_sync(), if the map is marked COHERENT, skip any cache
flushing.
cache line allocation policy on XScale CPUs: in pmap_enter(), if the
pmap is the kernel pmap, clear the X-bit in the PTE, thus disabling
read/write-allocate for managed kernel mappings.
Yes, this is ugly. But it makes userland code run with r/w-allocate,
which is a huge improvement on systems with low core memory performance.
This version works on both 26-bit and 32-bit machines. For large copies,
it's up to three times as fast as the old arm32 version and five times as
fast as the old arm26 version. For small copies it seems to be even faster
(getrusage() is apparently over ten times faster on an ARM610).
Hooray for Allen!
counters. These counters do not exist on all CPUs, but where they
do exist, can be used for counting events such as dcache misses that
would otherwise be difficult or impossible to instrument by code
inspection or hardware simulation.
pmc(9) is meant to be a general interface. Initially, the Intel XScale
counters are the only ones supported.
page tables.
- pmap_enter(): if making a mapping for the same PA rw->ro, write-back
the cache before doing so.
- pmap_clearbit(): if revoking REF on a page, make sure to wbinv the
cache if the page has write permission, else inv the cache if the page's
PTE is valid (XXX we actually wbinv in this case, as well, due to lack
of idcache_inv_range()). Only flush the TLB if the PTE changed.
nathanw_sa branch.
* In switch_exit(), set the outgoing-proc register to NULL (rather than
proc0) so that we actually use the "exiting process" optimization in
cpu_switch().
A new "arm32_dma_range" structure now describes a DMA window, with
a system address base, bus address base, and length. In addition to
providing info about which memory regions are legal for DMA, the new
structure provides address translation support, as well.
As before, if a tag does not list any ranges, then all addresses are
considered valid, and no DMA address translation is performed.
This allows us to remove a large chunk of code which was duplicated and
tweaked slightly (to do the address translation) from the stock ARM
bus_dma in the XScale IOP and ARM Integrator ports.
Test compiled on all ARM platforms, test booted on Intel IQ80321 and Shark.
into platform-specific initialization code, giving platform-specific
code control over which free list a given chunk of memory gets put
onto.
Changes are essentially mechanical. Test compiled for all ARM
platforms, test booted on Intel IQ80321 and Shark.
Discussed some time ago on port-arm.
the virtual address for each DMA segment, just cache a pointer to the
original buffer/buftype used to load the DMA map, and use that. This
lets us shrink the bus_dma_segment_t down from 12 bytes to 8, and the
cache flushing is also more efficient.
Tested on an i80321 -- changes to others are mechanical.
be properly used by any misc. cloning device. While here, correct
a comment to indicate that "open" is the only entry point and that
everything else is handled with fileops.
mappings bus_dma(9) states: "In the event that the DMA handle contains
a valid mapping, the mapping will be unloaded via the same mechanism
used by bus_dmamap_unload()." And some drivers do mean to skip the
unload step.