Primary goals:
1. Use cryptography primitives designed and vetted by cryptographers.
2. Be honest about entropy estimation.
3. Propagate full entropy as soon as possible.
4. Simplify the APIs.
5. Reduce overhead of rnd_add_data and cprng_strong.
6. Reduce side channels of HWRNG data and human input sources.
7. Improve visibility of operation with sysctl and event counters.
Caveat: rngtest is no longer used generically for RND_TYPE_RNG
rndsources. Hardware RNG devices should have hardware-specific
health tests. For example, checking for two repeated 256-bit outputs
works to detect AMD's 2019 RDRAND bug. Not all hardware RNGs are
necessarily designed to produce exactly uniform output.
ENTROPY POOL
- A Keccak sponge, with test vectors, replaces the old LFSR/SHA-1
kludge as the cryptographic primitive.
- `Entropy depletion' is available for testing purposes with a sysctl
knob kern.entropy.depletion; otherwise it is disabled, and once the
system reaches full entropy it is assumed to stay there as far as
modern cryptography is concerned.
- No `entropy estimation' based on sample values. Such `entropy
estimation' is a contradiction in terms, dishonest to users, and a
potential source of side channels. It is the responsibility of the
driver author to study the entropy of the process that generates
the samples.
- Per-CPU gathering pools avoid contention on a global queue.
- Entropy is occasionally consolidated into global pool -- as soon as
it's ready, if we've never reached full entropy, and with a rate
limit afterward. Operators can force consolidation now by running
sysctl -w kern.entropy.consolidate=1.
- rndsink(9) API has been replaced by an epoch counter which changes
whenever entropy is consolidated into the global pool.
. Usage: Cache entropy_epoch() when you seed. If entropy_epoch()
has changed when you're about to use whatever you seeded, reseed.
. Epoch is never zero, so initialize cache to 0 if you want to reseed
on first use.
. Epoch is -1 iff we have never reached full entropy -- in other
words, the old rnd_initial_entropy is (entropy_epoch() != -1) --
but it is better if you check for changes rather than for -1, so
that if the system estimated its own entropy incorrectly, entropy
consolidation has the opportunity to prevent future compromise.
- Sysctls and event counters provide operator visibility into what's
happening:
. kern.entropy.needed - bits of entropy short of full entropy
. kern.entropy.pending - bits known to be pending in per-CPU pools,
can be consolidated with sysctl -w kern.entropy.consolidate=1
. kern.entropy.epoch - number of times consolidation has happened,
never 0, and -1 iff we have never reached full entropy
CPRNG_STRONG
- A cprng_strong instance is now a collection of per-CPU NIST
Hash_DRBGs. There are only two in the system: user_cprng for
/dev/urandom and sysctl kern.?random, and kern_cprng for kernel
users which may need to operate in interrupt context up to IPL_VM.
(Calling cprng_strong in interrupt context does not strike me as a
particularly good idea, so I added an event counter to see whether
anything actually does.)
- Event counters provide operator visibility into when reseeding
happens.
INTEL RDRAND/RDSEED, VIA C3 RNG (CPU_RNG)
- Unwired for now; will be rewired in a subsequent commit.
KLEAK was a nice feature and served its purpose; it allowed us to detect
dozens of info leaks on the kernel->userland boundary, and thanks to it we
tackled a good part of the infoleak problem 1.5 years ago.
Nowadays however, we have kMSan, which can detect uninitialized memory in
the kernel. kMSan supersedes KLEAK: it can detect what KLEAK was able to
detect, but in addition, (1) it operates in all of the kernel and not just
the kernel->userland boundary, (2) it requires no user interaction, and (3)
it is deterministic and not statistical.
That makes kMSan the feature of choice to detect info leaks nowadays;
people interested in detecting info leaks should boot a kMSan kernel and
just wait for the magic to happen.
KLEAK was a good ride, and a fun project, but now is time for it to go.
Discussed with several people, including Thomas Barabosch.
Keep the names of functions internally as ptrace intact as this code
is shared with core_elf32.c that already reaches ptrace(2) specifc symbols.
No functional change intended.
and remove all #ifdef COREDUMP conditional compilation. Now, the
coredump module is completely separated from the emulation modules, and
they can all be independently loaded and unloaded.
Welcome to 9.99.18 !
memory used by the kernel at run time, and just like kASan and kCSan, it
is an excellent feature. It has already detected 38 uninitialized variables
in the kernel during my testing, which I have since discreetly fixed.
We use two shadows:
- "shad", to track uninitialized memory with a bit granularity (1:1).
Each bit set to 1 in the shad corresponds to one uninitialized bit of
real kernel memory.
- "orig", to track the origin of the memory with a 4-byte granularity
(1:1). Each uint32_t cell in the orig indicates the origin of the
associated uint32_t of real kernel memory.
The memory consumption of these shadows is consequent, so at least 4GB of
RAM is recommended to run kMSan.
The compiler inserts calls to specific __msan_* functions on each memory
access, to manage both the shad and the orig and detect uninitialized
memory accesses that change the execution flow (like an "if" on an
uninitialized variable).
We mark as uninit several types of memory buffers (stack, pools, kmem,
malloc, uvm_km), and check each buffer passed to copyout, copyoutstr,
bwrite, if_transmit_lock and DMA operations, to detect uninitialized memory
that leaves the system. This allows us to detect kernel info leaks in a way
that is more efficient and also more user-friendly than KLEAK.
Contrary to kASan, kMSan requires comprehensive coverage, ie we cannot
tolerate having one non-instrumented function, because this could cause
false positives. kMSan cannot instrument ASM functions, so I converted
most of them to __asm__ inlines, which kMSan is able to instrument. Those
that remain receive special treatment.
Contrary to kASan again, kMSan uses a TLS, so we must context-switch this
TLS during interrupts. We use different contexts depending on the interrupt
level.
The orig tracks precisely the origin of a buffer. We use a special encoding
for the orig values, and pack together in each uint32_t cell of the orig:
- a code designating the type of memory (Stack, Pool, etc), and
- a compressed pointer, which points either (1) to a string containing
the name of the variable associated with the cell, or (2) to an area
in the kernel .text section which we resolve to a symbol name + offset.
This encoding allows us not to consume extra memory for associating
information with each cell, and produces a precise output, that can tell
for example the name of an uninitialized variable on the stack, the
function in which it was pushed on the stack, and the function where we
accessed this uninitialized variable.
kMSan is available with LLVM, but not with GCC.
The code is organized in a way that is similar to kASan and kCSan, so it
means that other architectures than amd64 can be supported.
to detect race conditions at runtime. It is a variation of TSan that is
easy to implement and more suited to kernel internals, albeit theoretically
less precise than TSan's happens-before.
We do basically two things:
- On every KCSAN_NACCESSES (=2000) memory accesses, we create a cell
describing the access, and delay the calling CPU (10ms).
- On all memory accesses, we verify if the memory we're reading/writing
is referenced in a cell already.
The combination of the two means that, if for example cpu0 does a read that
is selected and cpu1 does a write at the same address, kCSan will fire,
because cpu1's write collides with cpu0's read cell.
The coverage of the instrumentation is the same as that of kASan. Also, the
code is organized in a way similar to kASan, so it is easy to add support
for more architectures than amd64. kCSan is compatible with KCOV.
Reviewed by Kamil.
The KCOV driver implements collection of code coverage inside the kernel.
It can be enabled on a per process basis from userland, allowing the kernel
program counter to be collected during syscalls triggered by the same
process.
The device is oriented towards kernel fuzzers, in particular syzkaller.
Currently the only supported coverage type is -fsanitize-coverage=trace-pc.
The KCOV driver was initially developed in Linux. A driver based on the
same concept was then implemented in FreeBSD and OpenBSD.
Documentation is borrowed from OpenBSD and ATF tests from FreeBSD.
This patch has been prepared by Siddharth Muralee, improved by <maxv>
and polished by myself before importing into the mainline tree.
All ATF tests pass.
Pass -DACPI_MISALIGNMENT_NOT_SUPPORTED under kUBSan enabled. This option
is dedicated for alignment sensitive CPUs in acpica. It was originally
designed for Itanium CPUs, but nowadays it's wanted for aarch64 as well.
Define it in acpica code under kUBSan in order to pacify Undefined Behavior
reports on all ports (in particular x86). The number of reports is now
halved with this patch applied. The remaining alignment alarms in acpica
will be addressed in future.
Patch contributed by <Akul Pillai>
so far, only a basic panic() and null deref nodes are added.
with options DEBUG, one can now use:
# sysctl -w kern.crashme_enable=1
# sysctl -w kern.crashme.panic=1
# sysctl -w kern.crashme.null_deref=1
to trigger a crash. crashme_enable must be set to 1 before any
of the nodes will be writeable.
supports dynamic additional/removal of crashme nodes.
(obsoletes kern.panic_now, which will be removed later.)
threads running at specific priorities, with support for unbound pools
and per-cpu pools.
Written by riastradh@, and based on the May 2014 draft, with a few changes
by me:
- Working on the assumption that a relative few priorities will actually
be used, reduce the memory footprint by using linked lists, rather than
2 large (and mostly empty) tables. The performance impact is essentially
nil, since these lists are consulted only when pools are created (and
destroyed, for DIAGNOSTIC checks), and the lists will have at most 225
entries.
- Make threadpool job object, which the caller must allocate storage for,
really opaque.
- Use typedefs for the threadpool types, to reduce the verbosity of the
API somewhat.
- Fix a bunch of pool / worker thread / job object lifecycle bugs.
Also include an ATF unit test, written by me, that exercises the basics
of the API by loading a kernel module that exposes several sysctls that
allow the ATF test script to create and destroy threadpools, schedule a
basic job, and verify that it ran.
And thus NetBSD 8.99.29 has arrived.
hashing and radix trie. It supports lock-free lookups and concurrent
inserts/deletes. It is designed to be optimal as a general purpose
*concurrent* associative array.
Upstream: https://github.com/rmind/thmap
Discussed on tech-kern@
It works by tainting memory sources with marker values, letting the data
travel through the kernel, and scanning the kernel<->user frontier for
these marker values. Combined with compiler instrumentation and rotation
of the markers, it is able to yield relevant results with little effort.
We taint the pools and the stack, and scan copyout/copyoutstr. KLEAK is
supported on amd64 only for now, but it is not complicated to add more
architectures (just a matter of having the address of .text, and a stack
unwinder).
A userland tool is provided, that allows to execute a command in rounds
and monitor the leaks generated all the while.
KLEAK already detected directly 12 kernel info leaks, and prompted changes
that in total fixed 25+ leaks.
Based on an idea developed jointly with Thomas Barabosch (of Fraunhofer
FKIE).
machine/asan.h, which contains the MD functions. We use an include rather
than a plain C file, because we want GCC to optimize/inline some functions
into one single block.
The amd64 MD parts of KASAN are moved accordingly.
The naming convention we use is:
kasan_*
a generic kasan object, declared in subr_asan.c
kasan_md_*
an MD kasan object, declared in machine/asan.h, and used
in subr_asan.c
__md_*
an MD object, declared in machine/asan.h, and not used
outside
Overall this makes it easier to add KASAN support on more architectures.
Discussed with several people.
KMEM_REDZONE is not very efficient and cannot detect read overflows. KASAN
can, and will be used instead.
KMEM_POISON is enabled along with KMEM_GUARD, but it is redundant, since
the latter can detect read UAFs contrary to the former. In fact maybe
KMEM_GUARD should be retired too, because there are many cases where it
doesn't apply.
Simplifies the code.
This change:
* Removes "options PERFCTRS", the associated includes, and the associated
ifdefs. In doing so, it removes several XXXSMPs in the MI code, which is
good.
* Removes the PMC code of ARM XSCALE.
* Removes all the pmc.h files. They were all empty, except for ARM XSCALE.
* Reorders the x86 PMC code not to rely on the legacy pmc.h file. The
definitions are put in sysarch.h.
* Removes the kern/sys_pmc.c file, and along with it, the sys_pmc_control
and sys_pmc_get_info syscalls. They are marked as OBSOL in kern,
netbsd32 and rump.
* Removes the pmc_evid_t and pmc_ctr_t types.
* Removes all the associated man pages. The sets are marked as obsolete.
1 - ptrace(2) syscall for native emulation
2 - common ptrace(2) syscall code (shared with compat_netbsd32)
3 - support routines that are shared with PROCFS and/or KTRACE
* Add module glue for #1 and #2. Both modules will be built-in to the
kernel if "options PTRACE" is included in the config file (this is
the default, defined in sys/conf/std).
* Mark the ptrace(2) syscall as modular in syscalls.master (generated
files will be committed shortly).
* Conditionalize all remaining portions of PTRACE code on a new kernel
option PTRACE_HOOKS.
XXX Instead of PROCFS depending on 'options PTRACE', we should probably
just add a procfs attribute to the sys/kern/sys_process.c file's
entry in files.kern, and add PROCFS to the "#if defineds" for
process_domem(). It's really confusing to have two different ways
of requiring this file.
rather than including in kernels with KDTRACE_HOOKS defined. Update
the dtrace_fbt module to depend on the zlib module.
Bump kernel version to avoid module mismatch.
Welcome to 7.99.38 !
Discussed on tech-kern:
https://mail-index.netbsd.org/tech-kern/2016/01/24/msg020069.html
API is still experimental and likely to change. (Obvious changes:
either remove extra arguments everywhere, or shrink psref_target to a
single bit, at the expense of possibly valuable diagnostic checks.)
Should do some real testing before we use this in anger!
For monolithic kernels, both modules will be compiled as "built-ins",
while modular environments will be able to load the SYSVSEM, SYSVSHM,
and SYSVMSG code independant from the rest of compat.
This is a necessary precursor step to making the "STD" SYSV* code
into a separate module.
Tested in both monolithic and modular environments with no errors
seen.
- move the syncer into kern/vfs_subr.c.
- change the syncer to process the mountlist and VFS_SYNC as appropriate.
- use an API for mount points similiar to the API for vnodes:
- vfs_syncer_add_to_worklist(struct mount *mp) to add
- vfs_syncer_remove_from_worklist(struct mount *mp) to remove a mount.
No objections on tech-kern@
Merge common disk driver functionality in ld.c with dksubr.c.
Adjust the two previous users of dk_intf (cgd and xbd) to
the changes.
This file was missing from the commit.