cprng_strong may sleep on an adaptive lock (via entropy_extract),
which invalidates percpu(9) references.
Discovered by stumbling upon this panic in a test run:
panic: kernel diagnostic assertion "(cprng == percpu_getref(cprng_fast_percpu)) && (percpu_putref(cprng_fast_percpu), true)" failed: file "/home/riastradh/netbsd/current/src/sys/rump/librump/rumpkern/../../../crypto/cprng_fast/cprng_fast.c", line 117
XXX pullup-10
Discovered by code inspection. Remarkably, a combination of errors
made this fail to be a stack buffer overrun. Verified by booting
with ARMv8.0-AES disabled and with the self-test artificially made to
fail.
- Rename: CPUID_PN -> CPUID_PSN
CPUID_CFLUSH -> CPUID_CLFSH
CPUID_SBF -> CPUID_PBE
CPUID_LZCNT -> CPUID_ABM
CPUID_P1GB -> CPUID_PAGE1GB
CPUID2_PCLMUL -> CPUID2_PCLMULQDQ
CPUID2_CID -> CPUID2_CNXTID
CPUID2_xTPR -> CPUID2_XTPR
CPUID2_AES -> CPUID2_AESNI
To match the x86 specification and the other OSes.
- Remove: CPUID_B10, CPUID_B20, CPUID_IA64. They do not exist.
Costs another couple MOV instructions, but we can't skimp on this --
there's no red zone below sp for interrupts on arm, so we can't touch
anything there. So just use fp to save sp and then adjust sp itself,
rather than using fp as a temporary register to point just below sp.
Should fix PR port-arm/55598 -- previously the ChaCha self-test
failed 33/10000 trials triggered by sysctl during running system;
with the patch it has failed 0/10000 trials.
(Presumably it happened more often at boot time, leading to 5/26
failures in the test bed, because we just enabled interrupts and some
devices are starting to deliver interrupts.)
...which is how the kernel runs. Switch to using __SOFTFP__ for
consistency with how it gets exposed to C, although I'm not sure how
to get it defined automagically in the toolchain for .S files so
that's set manually in files.aesneon for now.
GCC 8 miscompiles aes_ccm_tag() for m68k with optimization level -O[12],
which results in failure in aes_ccm_selftest():
| aes_ccm_selftest: tag 0: 8 bytes @ 0x4d3e38
| 03 80 5f 08 22 6f cb fe | .._."o..
| aes_ccm_selftest: verify 0 failed
| ...
| WARNING: module error: built-in module aes_ccm failed its MODULE_CMD_INIT, error 5
This is observed for amiga (A1200, 68060), mac68k (Quadra 840AV, 68040),
and luna68k (nono, 68030 emulator). However, it is not for sun3 (TME, 68020
emulator) and sun2 (TME, 68010 emulator). At the moment, it is unclear
whether this is due to differences b/w 68010-20 vs 68030-60, or something
wrong with TME.
Swapped the two halves (only gcc does that, I think) and wrote j,i
backwards, oops.
(I don't have a big-endian arm clang build handy to test; hoping this
works.)
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.
Experimented with using
static const uint8_t x8[16] = {...};
uint8x16_t x = vld1q_u8(x8);
which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.
Fix some clang mistakes while here too.
Gives a modest speed boost on rk3399 (Cortex-A53/A72), around 20% in
cgd tests, for parallelizable operations like CBC decryption; same
improvement should probably carry over to rpi4 CPU which lacks
ARMv8.0-AES.
cgd performance is not as good as I was hoping (~4% improvement over
chacha_ref.c) but it should improve substantially more if we let the
cgd worker thread keep fpu state so we don't have to pay the cost of
isb and zero-the-fpu on every 512-byte cgd block.