Costs another couple MOV instructions, but we can't skimp on this --
there's no red zone below sp for interrupts on arm, so we can't touch
anything there. So just use fp to save sp and then adjust sp itself,
rather than using fp as a temporary register to point just below sp.
Should fix PR port-arm/55598 -- previously the ChaCha self-test
failed 33/10000 trials triggered by sysctl during running system;
with the patch it has failed 0/10000 trials.
(Presumably it happened more often at boot time, leading to 5/26
failures in the test bed, because we just enabled interrupts and some
devices are starting to deliver interrupts.)
Swapped the two halves (only gcc does that, I think) and wrote j,i
backwards, oops.
(I don't have a big-endian arm clang build handy to test; hoping this
works.)
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
cgd performance is not as good as I was hoping (~4% improvement over
chacha_ref.c) but it should improve substantially more if we let the
cgd worker thread keep fpu state so we don't have to pay the cost of
isb and zero-the-fpu on every 512-byte cgd block.
However, disable use of (V)TBL on armv7/aarch32 for now, because for
some reason GCC spills things to the stack despite having plenty of
free registers, which hurts performance more than it helps at least
on ARM Cortex-A8.
The 4-blocks-at-a-time assembly helper is disabled for now; adapting
it to armv7 is going to be a little annoying with only 16 128-bit
vector registers.
(Should also do a fifth block in the integer registers for 320 bytes
at a time.)
This way we can use it for cprng_fast early on. ChaCha is easy
because there's no data formats that must be preserved from call to
call but vary from implementation to implementation -- we could even
make it a sysctl knob to dynamically select it with negligible cost.
(In contrast, different AES implementations use different expanded
key formats which must be preserved from aes_setenckey to aes_enc,
for example, which means a considerably greater burden on dynamic
selection that's not really worth it.)
Shouldn't substantively hurt performance -- the comparison that has
been moved into the loop was essentially the former loop condition --
and may improve performance by reducing code size since there's only
one inline call to chacha_permute instead of two.