This covers a lot of CPUs -- particularly lower-end CPUs over the
past decade which lack AES-NI.
Derived from Mike Hamburg's public domain vpaes software; see
<https://crypto.stanford.edu/vpaes/> for details.
Ensure that there are no paths into files compiled with -msse -msse2
at all except via fpu_kern_enter.
I didn't run into a practical problem with this, but let's not leave
a ticking time bomb for subsequent toolchain changes in case the mere
declaration of local __m128i variables causes trouble.
This should work on essentially all x86 CPUs of the last two decades,
and may improve throughput over the portable C aes_ct implementation
from BearSSL by
(a) reducing the number of vector operations in sequence, and
(b) batching four rather than two blocks in parallel.
Derived from BearSSL'S aes_ct64 implementation adjusted so that where
aes_ct64 uses 64-bit q[0],...,q[7], aes_sse2 uses (q[0], q[4]), ...,
(q[3], q[7]), each tuple representing a pair of 64-bit quantities
stacked in a single 128-bit register. This translation was done very
naively, and mostly reduces the cost of ShiftRows and data movement
without doing anything to address the S-box or (Inv)MixColumns, which
spread all 64-bit quantities across separate registers and ignore the
upper halves.
Unfortunately, SSE2 -- which is all that is guaranteed on all amd64
CPUs -- doesn't have PSHUFB, which would help out a lot more. For
example, vpaes relies on that. Perhaps there are enough CPUs out
there with PSHUFB but not AES-NI to make it worthwhile to import or
adapt vpaes too.
Note: This includes local definitions of various Intel compiler
intrinsics for gcc and clang in terms of their __builtin_* &c.,
because the necessary header files are not available during the
kernel build. This is a kludge -- we should fix it properly; the
present approach is expedient but not ideal.
Adiantum is a wide-block cipher, built out of AES, XChaCha12,
Poly1305, and NH, defined in
Paul Crowley and Eric Biggers, `Adiantum: length-preserving
encryption for entry-level processors', IACR Transactions on
Symmetric Cryptology 2018(4), pp. 39--61.
Adiantum provides better security than a narrow-block cipher with CBC
or XTS, because every bit of each sector affects every other bit,
whereas with CBC each block of plaintext only affects the following
blocks of ciphertext in the disk sector, and with XTS each block of
plaintext only affects its own block of ciphertext and nothing else.
Adiantum generally provides much better performance than
constant-time AES-CBC or AES-XTS software do without hardware
support, and performance comparable to or better than the
variable-time (i.e., leaky) AES-CBC and AES-XTS software we had
before. (Note: Adiantum also uses AES as a subroutine, but only once
per disk sector. It takes only a small fraction of the time spent by
Adiantum, so there's relatively little performance impact to using
constant-time AES software over using variable-time AES software for
it.)
Adiantum naturally scales to essentially arbitrary disk sector sizes;
sizes >=1024-bytes take the most advantage of Adiantum's design for
performance, so 4096-byte sectors would be a natural choice if we
taught cgd to change the disk sector size. (However, it's a
different cipher for each disk sector size, so it _must_ be a cgd
parameter.)
The paper presents a similar construction HPolyC. The salient
difference is that HPolyC uses Poly1305 directly, whereas Adiantum
uses Poly1395(NH(...)). NH is annoying because it requires a
1072-byte key, which means the test vectors are ginormous, and
changing keys is costly; HPolyC avoids these shortcomings by using
Poly1305 directly, but HPolyC is measurably slower, costing about
1.5x what Adiantum costs on 4096-byte sectors.
For the purposes of cgd, we will reuse each key for many messages,
and there will be very few keys in total (one per cgd volume) so --
except for the annoying verbosity of test vectors -- the tradeoff
weighs in the favour of Adiantum, especially if we teach cgd to do
>>512-byte sectors.
For now, everything that Adiantum needs beyond what's already in the
kernel is gathered into a single file, including NH, Poly1305, and
XChaCha12. We can split those out -- and reuse them, and provide MD
tuned implementations, and so on -- as needed; this is just a first
pass to get Adiantum implemented for experimentation.
Different AES implementations prefer different variations on it, but
some of them -- notably VIA -- require the standard key schedule to
be available and don't provide hardware support for computing it
themselves. So adapt BearSSL's logic to generate the standard key
schedule (and decryption keys, with InvMixColumns), rather than the
bitsliced key schedule that BearSSL uses natively.
Limited to amd64 for now. In principle, AES-NI should work in 32-bit
mode, and there may even be some 32-bit-only CPUs that support
AES-NI, but that requires work to adapt the assembly.
1. Rip out old variable-time reference implementation.
2. Replace it by BearSSL's constant-time 32-bit logic.
=> Obtained from commit dda1f8a0c46e15b4a235163470ff700b2f13dcc5.
=> We could conditionally adopt the 64-bit logic too, which would
likely give a modest performance boost on 64-bit platforms
without AES-NI, but that's a bit more trouble.
3. Select the AES implementation at boot-time; allow an MD override.
=> Use self-tests to verify basic correctness at boot.
=> The implementation selection policy is rather rudimentary at
the moment but it is isolated to one place so it's easy to
change later on.
This (a) plugs a host of timing attacks on, e.g., cgd, and (b) paves
the way to take advantage of CPU support for AES -- both things we
should've done a decade ago. Downside: Computing AES takes 2-3x the
CPU time. But that's what hardware support will be coming for.
Rudimentary measurement of performance impact done by:
mount -t tmpfs tmpfs /tmp
dd if=/dev/zero of=/tmp/disk bs=1m count=512
vnconfig -cv vnd0 /tmp/disk
cgdconfig -s cgd0 /dev/vnd0 aes-cbc 256 < /dev/zero
dd if=/dev/rcgd0d of=/dev/null bs=64k
dd if=/dev/zero of=/dev/rcgd0d bs=64k
The AES-CBC encryption performance impact is closer to 3x because it
is inherently sequential; the AES-CBC decryption impact is closer to
2x because the bitsliced AES logic can process two blocks at once.
Discussed on tech-kern:
https://mail-index.NetBSD.org/tech-kern/2020/06/18/msg026505.html
Benefits:
- larger seeds -- a 128-bit key alone is not enough for `128-bit security'
- better resistance to timing side channels than AES
- a better-understood security story (https://eprint.iacr.org/2018/349)
- no loss in compliance with US government standards that nobody ever
got fired for choosing, at least in the US-dominated western world
- no dirty endianness tricks
- self-tests
Drawbacks:
- performance hit: throughput is reduced to about 1/3 in naive measurements
=> possible to mitigate by using hardware SHA-256 instructions
=> all you really need is 32 bytes to seed a userland PRNG anyway
=> if we just used ChaCha this would go away...
XXX pullup-7
XXX pullup-8
XXX pullup-9
implementation. Rewrite pseudodevice code to use cprng_strong(9).
The new pseudodevice is cloning, so each caller gets bits from a stream
generated with its own key. Users of /dev/urandom get their generators
keyed on a "best effort" basis -- the kernel will rekey generators
whenever the entropy pool hits the high water mark -- while users of
/dev/random get their generators rekeyed every time key-length bits
are output.
The underlying cprng_strong API can use AES-256 or AES-128, but we use
AES-128 because of concerns about related-key attacks on AES-256. This
improves performance (and reduces entropy pool depletion) significantly
for users of /dev/urandom but does cause users of /dev/random to rekey
twice as often.
Also fixes various bugs (including some missing locking and a reseed-counter
overflow in the CTR_DRBG code) found while testing this.
For long reads, this generator is approximately 20 times as fast as the
old generator (dd with bs=64K yields 53MB/sec on 2Ghz Core2 instead of
2.5MB/sec) and also uses a separate mutex per instance so concurrency
is greatly improved. For reads of typical key sizes for modern
cryptosystems (16-32 bytes) performance is about the same as the old
code: a little better for 32 bytes, a little worse for 16 bytes.
<20111022023242.BA26F14A158@mail.netbsd.org>. This change includes
the following:
An initial cleanup and minor reorganization of the entropy pool
code in sys/dev/rnd.c and sys/dev/rndpool.c. Several bugs are
fixed. Some effort is made to accumulate entropy more quickly at
boot time.
A generic interface, "rndsink", is added, for stream generators to
request that they be re-keyed with good quality entropy from the pool
as soon as it is available.
The arc4random()/arc4randbytes() implementation in libkern is
adjusted to use the rndsink interface for rekeying, which helps
address the problem of low-quality keys at boot time.
An implementation of the FIPS 140-2 statistical tests for random
number generator quality is provided (libkern/rngtest.c). This
is based on Greg Rose's implementation from Qualcomm.
A new random stream generator, nist_ctr_drbg, is provided. It is
based on an implementation of the NIST SP800-90 CTR_DRBG by
Henric Jungheim. This generator users AES in a modified counter
mode to generate a backtracking-resistant random stream.
An abstraction layer, "cprng", is provided for in-kernel consumers
of randomness. The arc4random/arc4randbytes API is deprecated for
in-kernel use. It is replaced by "cprng_strong". The current
cprng_fast implementation wraps the existing arc4random
implementation. The current cprng_strong implementation wraps the
new CTR_DRBG implementation. Both interfaces are rekeyed from
the entropy pool automatically at intervals justifiable from best
current cryptographic practice.
In some quick tests, cprng_fast() is about the same speed as
the old arc4randbytes(), and cprng_strong() is about 20% faster
than rnd_extract_data(). Performance is expected to improve.
The AES code in src/crypto/rijndael is no longer an optional
kernel component, as it is required by cprng_strong, which is
not an optional kernel component.
The entropy pool output is subjected to the rngtest tests at
startup time; if it fails, the system will reboot. There is
approximately a 3/10000 chance of a false positive from these
tests. Entropy pool _input_ from hardware random numbers is
subjected to the rngtest tests at attach time, as well as the
FIPS continuous-output test, to detect bad or stuck hardware
RNGs; if any are detected, they are detached, but the system
continues to run.
A problem with rndctl(8) is fixed -- datastructures with
pointers in arrays are no longer passed to userspace (this
was not a security problem, but rather a major issue for
compat32). A new kernel will require a new rndctl.
The sysctl kern.arandom() and kern.urandom() nodes are hooked
up to the new generators, but the /dev/*random pseudodevices
are not, yet.
Manual pages for the new kernel interfaces are forthcoming.
XXX: We still install rmd160.h and sha2.h in /usr/include/crypto, unlike
the other hash functions which get installed in /usr/include for compatibility.
the old one. Rename the functions/structures from cast_* to cast128_*.
Adapt the KAME IPsec to use the new CAST-128 code, which has a simpler
API and smaller footprint.
Define an attribute for each crypto algorithm, and use that attribute
to select the files that implement the algorithm.
* Give the "wlan" attribute a dependency on the "arc4" attribute.
* Give the "cgd" pseudo-device the "des", "blowfish", "cast128", and
"rijndael" attributes.
* Use the new attribute-as-option-dependencies feature of config(8) to
give the IPSEC_ESP option dependencies on the "des", "blowfish", "cast128",
and "rijndael" attributes.
to const BF_KEY * vars, and I chose to ``fix'' it in this file
rather than elsewhere in the framework because, although the other
fix was more appropriate, nothing seems to use the code in this
file and hence the risk of disrupting other people was lower. In
the future, the more appropriate change would be to change blowfish.h
and bf_enc.c to have functions with signatures:
BF_encrypt(BF_LONG *, const BF_KEY *);
BF_decrypt(BF_LONG *, const BF_KEY *);
This also involved updating the in-kernel DES functions to correspond
to the versions in our in-tree OpenSSL, because the des_SPtrans table
has changed; the asm code will not work with the old permutation table!
C and i386 asm code for the DES, 3DES, and Blowfish CBC modes is also
included; it is not currently built as the ESP processing in esp_core.c
splits the CBC operation and the cipher transform apart. Hopefully that
will be fixed as there is a substantial performance improvement to be had
from doing so. It will remain necessary to use the C version of the
Blowfish CBC function on some i386 machines, however, as the asm version
uses bswapl, which ony 486 and later processors have. The DES CBC code
doesn't have this problem.
Finally, change esp_core.c to use the ecb3_encrypt function instead of
calling ecb_encrypt three times; this improves performance a bit, in
particular in the asm case.
This also involved updating the in-kernel DES functions to correspond
to the versions in our in-tree OpenSSL, because the des_SPtrans table
has changed; the asm code will not work with the old permutation table!
C and i386 asm code for the DES, 3DES, and Blowfish CBC modes is also
included; it is not currently built as the ESP processing in esp_core.c
splits the CBC operation and the cipher transform apart. Hopefully that
will be fixed as there is a substantial performance improvement to be had
from doing so. It will remain necessary to use the C version of the
Blowfish CBC function on some i386 machines, however, as the asm version
uses bswapl, which ony 486 and later processors have. The DES CBC code
doesn't have this problem.
Finally, change esp_core.c to use the ecb3_encrypt function instead of
calling ecb_encrypt three times; this improves performance a bit, in
particular in the asm case.
word32 pointers to access data stored in word8 arrays:
* align transformation tables on 32-bit boundaries,
* align key schedule on 32-bit boundary,
* align temporaries on 32-bit boundaries,
* align plaintext and ciphertext used in round transformations on 32-bit
boundaries.
- include string.h (instead of sys/systm.h) on userland compilation.
make compilation under src/regress/sys/crypto happier. from minoura
- (blowfish) KNF.
- use u_int32_t for 32bit quantity unsigned integer type.
- s/unsigned long/BF_LONG/ (BF_LONG = u_int32_t) where appropriate.
- prototype cleanup - due to *BSD code sharing, we still are using __P().
part of PR 10918. sync with kame.
arc4 implementation by Kalle Kaukonen has been added.
define "wlan" in files.
XXX: only awi depends on wlan for now.
Allow authentication for adhoc (IBSS) mode.
Disable adhoc mode without bssid (mediaopt adhoc,flag0) for FH radio.
FH cannot work without synchronization by beacons.
Align IP header for ethernet encapsulation (IFF_FLAG0) mode.
Print available access points for IFF_DEBUG.