If CTR_EL0.DIC=1, Icache invalidation is not required.
If CTR_EL0.IDC=1, Dcache clean before Icache invalidation is not required.
CLIDR_EL1.LoC is 0, or CLIDR_EL1.LoUIS and CLIDR_EL1.LoUU are 0, Dcache clean is not required as well.
SEE ALSO ARMARM, "CTR_EL0 Cache Type Register", and "CLIDR_EL1 Cache Level ID Register"
[repeat revision 1.85]
Fix "make tags" to actually build a tags file:
- Use !commands() instead of !target(), so that the rule actually works
- Write to ${.OBJDIR}/tags for read-only source (don't know why ${.TARGET}
isn't sufficient).
- Only match *.[cly] from ${.ALLSRCS} - just excluding *.h causes failures
because of ${targ}: subdir-${targ} in bsd.subdir.mk.
Thanks to uwe@ for assistance.
Make default (wide) and non-wide behavior match. If the character
argument has (only) attributes set, use them with the default line
character.
In the wide case don't do the fallback in hline - it just calls
hline_set that needs to do it anyway. Fix the latter to check the
wcwidth of the right character and avoid division by zero.
Forgot to consult the AAPCS before committing this before -- oops!
While here, take advantage of the 32 aarch64 simd registers to avoid
all stack spills.
Unlike `.arch_extension crypto', this works with clang; both work
with gas, so we'll go with this.
Clang still can't handle aes_armv8_64.S yet -- it gets confused by
dup and mov on lanes, but this makes progress.
Apparently on arm, .align is actually an alias for .p2align, taking a
power of two rather than a number of bytes, so aes_armv8_64.o was
bloated to 32KB with obscene alignment when it only needed to be
barely past 4KB.
Do the same for the x86 aes_ni_64.S -- even though .align takes a
number of bytes rather than a power of two on x86, let's just stay
away from the temptations of the evil .align directive.
We won't use it on any other systems, and it doesn't build without
NEON anyway. Verified earmv7hf GENERIC, aarch64 GENERIC64, and
earmv6 RPI2 all build with this.
architectures.
Notes:
- On alpha and ia64 the function is kept but gets renamed locally to avoid
symbol collision. This is because on these two arches, I am not sure
whether the ASM callers do not rely on fixed registers, so I prefer to
keep the ASM body for now.
- On Vax, only the symbol is removed, because the body is used from other
functions.
- On RISC-V, this change fixes a bug: copystr() was just a wrapper around
strlcpy(), but strlcpy() makes the operation less safe (strlen on the
source beyond its size).
- The kASan, kCSan and kMSan wrappers are removed, because now that
copystr() is in C, the compiler transformations are applied to it,
without the need for manual wrappers.
Could test on amd64 only, but should be fine.
gcc does a lousy job at compiling 128-bit NEON intrinsics on arm32;
hand-writing it made it about 12x faster, by avoiding a zillion loads
and stores to spill everything and the kitchen sink onto the stack.
(But gcc does fine on aarch64, presumably because it has twice as
many registers and doesn't have to deal with q2=d4/d5 overlapping.)
This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by
(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.
Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
Adiantum is a wide-block cipher, built out of AES, XChaCha12,
Poly1305, and NH, defined in
Paul Crowley and Eric Biggers, `Adiantum: length-preserving
encryption for entry-level processors', IACR Transactions on
Symmetric Cryptology 2018(4), pp. 39--61.
Adiantum provides better security than a narrow-block cipher with CBC
or XTS, because every bit of each sector affects every other bit,
whereas with CBC each block of plaintext only affects the following
blocks of ciphertext in the disk sector, and with XTS each block of
plaintext only affects its own block of ciphertext and nothing else.
Adiantum generally provides much better performance than
constant-time AES-CBC or AES-XTS software do without hardware
support, and performance comparable to or better than the
variable-time (i.e., leaky) AES-CBC and AES-XTS software we had
before. (Note: Adiantum also uses AES as a subroutine, but only once
per disk sector. It takes only a small fraction of the time spent by
Adiantum, so there's relatively little performance impact to using
constant-time AES software over using variable-time AES software for
it.)
Adiantum naturally scales to essentially arbitrary disk sector sizes;
sizes >=1024-bytes take the most advantage of Adiantum's design for
performance, so 4096-byte sectors would be a natural choice if we
taught cgd to change the disk sector size. (However, it's a
different cipher for each disk sector size, so it _must_ be a cgd
parameter.)
The paper presents a similar construction HPolyC. The salient
difference is that HPolyC uses Poly1305 directly, whereas Adiantum
uses Poly1395(NH(...)). NH is annoying because it requires a
1072-byte key, which means the test vectors are ginormous, and
changing keys is costly; HPolyC avoids these shortcomings by using
Poly1305 directly, but HPolyC is measurably slower, costing about
1.5x what Adiantum costs on 4096-byte sectors.
For the purposes of cgd, we will reuse each key for many messages,
and there will be very few keys in total (one per cgd volume) so --
except for the annoying verbosity of test vectors -- the tradeoff
weighs in the favour of Adiantum, especially if we teach cgd to do
>>512-byte sectors.
For now, everything that Adiantum needs beyond what's already in the
kernel is gathered into a single file, including NH, Poly1305, and
XChaCha12. We can split those out -- and reuse them, and provide MD
tuned implementations, and so on -- as needed; this is just a first
pass to get Adiantum implemented for experimentation.