(Note: memcmp/memset improvements also benefit non-Xscale).
memcmp() - Compare 32-bits at a time if possible. Special-case 6-byte
comparisons, for the benefit of the network stack.
memset() - More loop unrolling, plus use of 'strd' instruction,
results in > 100% speedup on Xscale.
memcpy() - Big-endian support, unrolled loops, 'strd/ldrd/pld', plus
special-cases for very common length/alignment combinations
(at least in the kernel). Benchmarks show ~50% improvment on
Xscale.
memmove() - Big-endian support. Use fast memcpy(), above, if the regions
don't overlap. Otherwise unchanged.