in question, whereas the ARM code was using it to hold the model
identification. To fix this, rename:
ci_cpuid -> ci_arm_cpuid
ci_cputype -> ci_arm_cputype (for consistency)
ci_cpurev -> ci_arm_cpurev (ditto)
ci_cpunum -> ci_cpuid
This makes top(1) give correct CPU numbers in its "STATE" column (all 0 for
now).
and boot multi-user on a single-processor machine. Many of these changes
are wildly inappropriate for actual multi-processor operation, and correcting
this will be my next task.
This merge changes the device switch tables from static array to
dynamically generated by config(8).
- All device switches is defined as a constant structure in device drivers.
- The new grammer ``device-major'' is introduced to ``files''.
device-major <prefix> char <num> [block <num>] [<rules>]
- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.
- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.
- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.
- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.
- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
the 4M super-section that the PTP will map, not some random 1M
chunk of it. This gives the PTP hint code a much better chance
to working properly, and allows us to tidy up the code that
flushes a PTP from the cache in pmap_destroy().
to do uncached memory access during VM operations (which can be
quite expensive on some CPUs).
We currently write-back PTEs as soon as they're modified; there is
some room for optimization (to write them back in larger chunks).
For PTEs in the APTE space (i.e. PTEs for pmaps that describe another
process's address space), PTEs must also be evicted from the cache
complete (PTEs in PTE space will be evicted durint a context switch).
* Save an instruction in the transition from idle to have-process-to-
switch-to, and eliminate two instructions that cause datadep-stalls
on StrongARM And XScale (one in each idle block).
* Rearrange some other instructions to avoid datadep-stalls on StrongARM
and XScale.
* Since cpu_do_powersave == 0 is by far the common case, avoid a
pipeline flush by reordering the two idle blocks.
the CPU's "sleep" function in the idle loop.
* Default all CPUs to not use powersave, except for the PDA processors
(SA11x0 and PXA2x0).
This significantly reduces inteterrupt latency in high-performance
applications (and was good to squeeze another ~10% out of an XScale
IOP on a Gig-E benchmark).
of the range are aligned to a cacheline boundary, when do a dcache-inv
operation, rather than a dcache-wbinv operation.
XXX It could be a little smarter (align using wbinv, inv, then finish
up using wbinv), but even this simple change is good for a nearly 40%
improvement in my test case on XScale.
map contains "coherent" (non-cached in ARM-land) mappings.
* Set ARM32_DMAMAP_COHERENT in the map at the start of a load operation,
and clear it in _bus_dmamap_load_buffer() if we encounter any cacheable
mappings.
* In _bus_dmamap_sync(), if the map is marked COHERENT, skip any cache
flushing.
cache line allocation policy on XScale CPUs: in pmap_enter(), if the
pmap is the kernel pmap, clear the X-bit in the PTE, thus disabling
read/write-allocate for managed kernel mappings.
Yes, this is ugly. But it makes userland code run with r/w-allocate,
which is a huge improvement on systems with low core memory performance.
This version works on both 26-bit and 32-bit machines. For large copies,
it's up to three times as fast as the old arm32 version and five times as
fast as the old arm26 version. For small copies it seems to be even faster
(getrusage() is apparently over ten times faster on an ARM610).
Hooray for Allen!
counters. These counters do not exist on all CPUs, but where they
do exist, can be used for counting events such as dcache misses that
would otherwise be difficult or impossible to instrument by code
inspection or hardware simulation.
pmc(9) is meant to be a general interface. Initially, the Intel XScale
counters are the only ones supported.
page tables.
- pmap_enter(): if making a mapping for the same PA rw->ro, write-back
the cache before doing so.
- pmap_clearbit(): if revoking REF on a page, make sure to wbinv the
cache if the page has write permission, else inv the cache if the page's
PTE is valid (XXX we actually wbinv in this case, as well, due to lack
of idcache_inv_range()). Only flush the TLB if the PTE changed.
nathanw_sa branch.
* In switch_exit(), set the outgoing-proc register to NULL (rather than
proc0) so that we actually use the "exiting process" optimization in
cpu_switch().
A new "arm32_dma_range" structure now describes a DMA window, with
a system address base, bus address base, and length. In addition to
providing info about which memory regions are legal for DMA, the new
structure provides address translation support, as well.
As before, if a tag does not list any ranges, then all addresses are
considered valid, and no DMA address translation is performed.
This allows us to remove a large chunk of code which was duplicated and
tweaked slightly (to do the address translation) from the stock ARM
bus_dma in the XScale IOP and ARM Integrator ports.
Test compiled on all ARM platforms, test booted on Intel IQ80321 and Shark.
into platform-specific initialization code, giving platform-specific
code control over which free list a given chunk of memory gets put
onto.
Changes are essentially mechanical. Test compiled for all ARM
platforms, test booted on Intel IQ80321 and Shark.
Discussed some time ago on port-arm.
the virtual address for each DMA segment, just cache a pointer to the
original buffer/buftype used to load the DMA map, and use that. This
lets us shrink the bus_dma_segment_t down from 12 bytes to 8, and the
cache flushing is also more efficient.
Tested on an i80321 -- changes to others are mechanical.
be properly used by any misc. cloning device. While here, correct
a comment to indicate that "open" is the only entry point and that
everything else is handled with fileops.
mappings bus_dma(9) states: "In the event that the DMA handle contains
a valid mapping, the mapping will be unloaded via the same mechanism
used by bus_dmamap_unload()." And some drivers do mean to skip the
unload step.
- implement SIMPLEQ_REMOVE(head, elm, type, field). whilst it's O(n),
this mirrors the functionality of SLIST_REMOVE() (the other
singly-linked list type) and FreeBSD's STAILQ_REMOVE()
- remove the unnecessary elm arg from SIMPLEQ_REMOVE_HEAD().
this mirrors the functionality of SLIST_REMOVE_HEAD() (the other
singly-linked list type) and FreeBSD's STAILQ_REMOVE_HEAD()
- remove notes about SIMPLEQ not supporting arbitrary element removal
- use SIMPLEQ_FOREACH() instead of home-grown for loops
- use SIMPLEQ_EMPTY() appropriately
- use SIMPLEQ_*() instead of accessing sqh_first,sqh_last,sqe_next directly
- reorder manual page; be consistent about how the types are listed
- other minor cleanups
Also add correct locking when freeing pages in pmap_destroy (fix from potr)
This now means that arm32 kernels can be built with LOCKDEBUG enabled. (only tested on cats though)
* pmap_protect(): write back the range when doing a r/w -> r/o
transition. (Still leave the block concerned with this in
pmap_clean_page() disabled, for now.)
* pmap_pte_init_xscale(): Disable read/write-allocate for now, until
we figure out why sometimes cache lines of NULs get deposited into
file data. Also, make sure ECC protection of page table access is
disabled for now.
* xscale_setup_minidata(): Make sure the mini-data cache is configured
write-back with read/write-allocate.
internally). Move arm/iomd/pms* to arm/iomd/opms*. Mechanical change,
tested by cross-compiling a kernel from i386.
Approved by christos.
XXX: What are arm/arm32/conf.c and arm/include/conf.h good for?
cache mode. Add a new XSCALE_CACHE_WRITE_THROUGH option for people who
are paranoid about the cache-related errata (you *do* have to line up
the planets correctly to trip them, but having the option is useful).
file, <arm/cpuconf.h>, which pulls in "opt_cputypes.h" and then defines
the following:
* CPU_NTYPES -- now many CPU types are configured into the kernel. What
you really want to know is "== 1" or "> 1".
* Defines ARM_ARCH_2, ARM_ARCH_3, ARM_ARCH_4, ARM_ARCH_5, depending
on which ARM architecture versions are configured (based on CPU_*
options). Also defines ARM_NARCH to determins how many architecture
versions are configured.
* Defines ARM_MMU_MEMC, ARM_MMU_GENERIC, ARM_MMU_XSCALE depending on
which classes of ARM MMUs are configured into the kernel, and ARM_NMMUS
to determine how many MMU classes are configured.
Remove the needless inclusion of "opt_cputypes.h" in several places.
Convert remaining users to <arm/cpuconf.h>.
read/write-allocate line allocation policy.
On the i80321, this improves nearly every lmbench benchmark, dramatically
so the ones that are sensitive to memory bandwidth (100-300% improvement
for these).
cache, as well. The mini-data cache is 2-way, so src and dst won't
clobber each other, and the smallness of the cache doesn't matter,
since we access each page once sequentially.
While we still have to do the initial clean of the source page, this
saves another 4K of main D$ pollution, and also means we don't have
to do 2 cache passes after the copy is complete (i.e. we can skip the
invalidation of the source page in the main cache, since it's no longer
there).
vs. XScale. Use the mini-data cache for the destination on XScale,
thus saving tossing out 4K of possible-useful data from the main
data cache each time.
This significantly improves every test in lmbench.
and pte_l2_s_cache_mode. The cache-meaningful bits are different
for these descriptor types on some processor models.
* Add pte_*_cache_mask, corresponding to each above, which has a mask
of the cache-meangful bits, and define those for generic and XScale
MMU classes. Note, the L2_S_CACHE_MASK_xscale definition requires
use of the Extended Small Page L2 descriptor (the "X" bit overlaps
with AP bits otherwise).
1. Generic (compatible with ARM6)
1. XScale (can be used as generic, but also has certainly nifty extensions).
Define abstract PTE bit defintions for each MMU class. If only one MMU
class is configured into the kernel (based on CPU_* options), then we
get the constants for that MMU class. Otherwise we indirect through
varaibles set up via set_cpufuncs().
XXX The XScale bits are currently the same as the generic bits. Baby steps.
Significant cleanup, here, including better PTE bit names.
* Add XScale PTE extensions (ECC enable, write-allocate cache mode).
* Mechanical changes everywhere else to update for new pte.h. While
doing this, two bugs (as a result of typos) were fixed in
arm/arm32/bus_dma.c
evbarm/integrator/int_bus_dma.c
the lower bits; UVM provides us page-aligned addresses for
everything. For the paranoid, we'll leave KDASSERT()'s in
that check for this if the kernel is built with DEBUG.
Low-hanging fruit that shaves some cycles.
pmap.h and give them more descriptive names and better comments:
* PT_M -> PVF_MOD (page is modified)
* PT_H -> PVF_REF (page is referenced)
* PT_W -> PVF_WIRED (mapping is wired)
* PT_Wr -> PVF_WRITE (mapping is writable)
* PT_NC -> PVF_NC (mapping is non-cacheable; multiple mappings)
* Don't refer to VA 0, instead refer to a new variable: vector_page
* Delete the old zero_page_*() functions, replacing them with a new
one: vector_page_setprot().
* When manipulating vector page mappings in user pmaps, only do so if
the vector page is below KERNEL_BASE (if it's above KERNEL_BASE, the
vector page is mapped by the kernel pmap).
* Add a new function, arm32_vector_init(), which takes the virtual
address of the vector page (which MUST be valid when the function
is called) and a bitmask of vectors the kernel is going to take
over, and performs all vector page initialization, including setting
the V bit in the CPU Control register ("relocate vectors to high
address"), if necessary.
when the part being quiried was mapped with a section (!) giving weird
results and had become a mess of goto's.
Complete rewrite and cleaned up the `goto'-jungle entirely ... ripped all
goto's. The resulting code is much better to read and might even have a
small performance gain.
I/O processors:
* The i80200 and the i80321 have the same CPU ID, so split the
CPU_XSCALE option into CPU_XSCALE_80200 and CPU_XSCALE_80321
options, and don't let them both be defined at the same time.
XXX May want to revisit this in the future.
* Split some registers common between the i80200 and i80321 into
<arm/xscale/xscalereg.h>.
* Rename a few existing functions.
pmap_copy_page() will never have any mappings. Therefore, it
is unnecessary to do a cache clean for that page.
Add assertions in #ifdef DEBUG that assert this invariant.
This shaves some cycles off the frequently-called pmap_zero_page()
and pmap_copy_page() (no need to look up the dst page's vm_page
structure, and one less function call to clean the page).
become ippp (ISDN ppp) and irip (ISDN raw IP). The character device now
are called: /dev/isdn (isdnd <-> kernel communication), /dev/isdnctl (dialing
and other control), /dev/isdntrc* (tracing), /dev/isdnbchan* (raw B channel
access, i.e. for user land PPP) and /dev/isdntel* (telephone devices, i.e.
for answering machines).
issue an instruction that caused the late abort handler to be called for
wich the kernel had no support build in for.
It now only panics when it happends in kernel but otherwise signals the
process a SEGV signal.
but not used resulting in a compiler error. By splitting the declaration
and the initialisation this is solved.
Better would be to not even declare the flag when ARMFPE isnt enabled but
that would just add to the #ifdef jungle.