Commit Graph

27 Commits

Author SHA1 Message Date
jmcneill 3f729ba586 Make aes and chacha prints debug only. 2022-11-05 17:36:33 +00:00
jmcneill 4a48ef14f2 Fix detection of NEON features. ID_AA64PFR0_EL1_ADV_SIMD_NONE means SIMD
is not available, and any other value means it is.
2020-10-10 08:24:10 +00:00
jakllsch 3eade4a405 Acknowledge clang warning for NEON cipher code on aarch64eb
We've already made the nonportable vector initializations portable; the
code works on aarch64eb.
2020-09-08 17:35:27 +00:00
jakllsch b762c4de07 use correct condition 2020-09-08 17:17:32 +00:00
jakllsch 9cb9f9bc98 Fix vgetq_lane_u32 for aarch64eb with GCC
Fixes NEON AES on aarch64eb
2020-09-07 18:06:13 +00:00
jakllsch ee45e31caf Use a working macro to detect big endian aarch64.
Fixes aarch64eb NEON ChaCha.
2020-09-07 18:05:17 +00:00
riastradh 3a2006068f Adjust sp, not fp, to allocate a 32-byte temporary.
Costs another couple MOV instructions, but we can't skimp on this --
there's no red zone below sp for interrupts on arm, so we can't touch
anything there.  So just use fp to save sp and then adjust sp itself,
rather than using fp as a temporary register to point just below sp.

Should fix PR port-arm/55598 -- previously the ChaCha self-test
failed 33/10000 trials triggered by sysctl during running system;
with the patch it has failed 0/10000 trials.

(Presumably it happened more often at boot time, leading to 5/26
failures in the test bed, because we just enabled interrupts and some
devices are starting to deliver interrupts.)
2020-08-23 16:39:06 +00:00
riastradh 062ecd5ff2 Fix some clang neon intrinsics.
Compile-tested only, with -Wno-nonportable-vector-initializers.  Need
to address -- and test -- this stuff properly but this is progress.
2020-08-09 02:49:38 +00:00
riastradh 6e727d4c03 Use vshlq_n_s32 rather than vsliq_n_s32 with zero destination.
Not sure why I reached for vsliq_n_s32 at first -- probably so I
wouldn't have to deal with a new intrinsic in arm_neon.h!
2020-08-09 02:48:38 +00:00
riastradh 43f5649092 Fix mistake in big-endian arm clang.
Swapped the two halves (only gcc does that, I think) and wrote j,i
backwards, oops.

(I don't have a big-endian arm clang build handy to test; hoping this
works.)
2020-08-09 01:59:04 +00:00
riastradh 18ff0ad8d5 Fix ARM NEON implementations of AES and ChaCha on big-endian ARM.
New macros such as VQ_N_U32(a,b,c,d) for NEON vector initializers.
Needed because GCC and Clang disagree on the ordering of lanes,
depending on whether it's 64-bit big-endian, 32-bit big-endian, or
little-endian -- and, bizarrely, both of them disagree with the
architectural numbering of lanes.

Experimented with using

static const uint8_t x8[16] = {...};

        uint8x16_t x = vld1q_u8(x8);

which doesn't require knowing anything about the ordering of lanes,
but this generates considerably worse code and apparently confuses
GCC into not recognizing the constant value of x8.

Fix some clang mistakes while here too.
2020-08-08 14:47:01 +00:00
riastradh 143bed0ba5 Issue three more swaps to save eight stores.
Reduces code size and yields a small (~2%) cgd throughput boost.

Remove duplicate comment while here.
2020-07-29 14:23:59 +00:00
riastradh 7a8eb9a111 Implement 4-way vectorization of ChaCha for armv7 NEON.
cgd performance is not as good as I was hoping (~4% improvement over
chacha_ref.c) but it should improve substantially more if we let the
cgd worker thread keep fpu state so we don't have to pay the cost of
isb and zero-the-fpu on every 512-byte cgd block.
2020-07-28 20:08:48 +00:00
riastradh 783ffb04d5 Fix big-endian build with appropriate casts around vrev32q_u8. 2020-07-28 20:05:33 +00:00
riastradh 48a3032d8a Fix typo in comment. 2020-07-28 15:42:41 +00:00
riastradh 3cca5606cd Note that VSRI seems to hurt here. 2020-07-27 20:58:56 +00:00
riastradh d4cf8df3e4 Take advantage of REV32 and TBL for 16-bit and 8-bit rotations.
However, disable use of (V)TBL on armv7/aarch32 for now, because for
some reason GCC spills things to the stack despite having plenty of
free registers, which hurts performance more than it helps at least
on ARM Cortex-A8.
2020-07-27 20:58:06 +00:00
riastradh 74648be169 Add RCSIDs to the AES and ChaCha .S sources. 2020-07-27 20:57:23 +00:00
riastradh 57324de2aa Align critical-path loops in AES and ChaCha. 2020-07-27 20:53:22 +00:00
riastradh f7b532dd9f Enable ChaCha NEON code on armv7 too.
The 4-blocks-at-a-time assembly helper is disabled for now; adapting
it to armv7 is going to be a little annoying with only 16 128-bit
vector registers.

(Should also do a fifth block in the integer registers for 320 bytes
at a time.)
2020-07-27 20:51:29 +00:00
riastradh 94438f5f6b Use <aarch64/asm.h> rather than copying things from it here.
Vestige from userland build on netbsd-9 during development.
2020-07-27 20:50:25 +00:00
riastradh f0c5022fb5 Simplify ChaCha selection and allow it to be used much earlier.
This way we can use it for cprng_fast early on.  ChaCha is easy
because there's no data formats that must be preserved from call to
call but vary from implementation to implementation -- we could even
make it a sysctl knob to dynamically select it with negligible cost.

(In contrast, different AES implementations use different expanded
key formats which must be preserved from aes_setenckey to aes_enc,
for example, which means a considerably greater burden on dynamic
selection that's not really worth it.)
2020-07-27 20:49:10 +00:00
riastradh 2c5bc5a38a Reduce some duplication.
Shouldn't substantively hurt performance -- the comparison that has
been moved into the loop was essentially the former loop condition --
and may improve performance by reducing code size since there's only
one inline call to chacha_permute instead of two.
2020-07-27 20:48:18 +00:00
riastradh 15cabcb36d New sysctl subtree kern.crypto.
kern.crypto.aes.selected (formerly hw.aes_impl)
kern.crypto.chacha.selected (formerly hw.chacha_impl)

XXX Should maybe deduplicate creation of kern.crypto.
2020-07-27 20:45:15 +00:00
riastradh 11faae69bf Implement ChaCha with NEON on ARM.
XXX Needs performance measurement.
XXX Needs adaptation to arm32 neon which has half the registers.
2020-07-25 22:51:57 +00:00
riastradh ba0c8ad577 Implement ChaCha with SSE2 on x86 machines.
Slightly disappointed that it only doubles, rather than quadruples,
throughput on my Ivy Bridge laptop.  Worth investigating.
2020-07-25 22:49:20 +00:00
riastradh fa79152618 New ChaCha API in kernel.
This will enable us to adopt MD vectorized implementations of ChaCha.
2020-07-25 22:46:34 +00:00