- Adapt to cpu_need_resched() changes. Avoid lost & duplicate IPIs and ASTs.
sched_resched_cpu() and sched_resched_lwp() contain the logic for this.
- Changes for LSIDL to make the locking scheme match the intended design.
- Reduce lock contention and false sharing further.
- Numerous small bugfixes, including some corrections for SCHED_FIFO/RT.
- Use setrunnable() in more places, and merge cut & pasted code.
- Increase the maximum number of clusters from 32 to 64 for large systems.
kcpuset_t could potentially be used here but that's an excursion I don't
want to go on right now. uint32_t -> uint64_t is very simple.
- In the case of a non-blocking select/poll, or where we won't block
because there are events ready to report, stop registering interest in
the back-end objects early.
- Change the wmesg for poll back to "poll".
interactive response. It should only be dropped on final return to user.
- Clear l_dopreempt with atomics and add some comments around concurrency.
- Hold proc_lock over the lightning bolt and loadavg calc, no reason not to.
- cpu_did_preempt() is useless - don't call it. Will remove soon.
- Avoid false sharing.
- Make the turnstile hash function more suitable.
- Increase turnstile hash table size.
- Make amends by having only one set of system wide sleep queue hash locks.
and remove all #ifdef COREDUMP conditional compilation. Now, the
coredump module is completely separated from the emulation modules, and
they can all be independently loaded and unloaded.
Welcome to 9.99.18 !
Changing it as the comment suggests would be a very terrible idea due to
the common usage of this variable.
Returning only 32 or 64 bits also seems to be the purpose of KERN_URND,
so that functionality is already present.
fsanitize flag on subr_kcov.c, which means that kMSan will instrument KCOV.
We add a bunch of __nomsan attributes to reduce this instrumentation, but
it does not remove it completely. That's fine.
memory used by the kernel at run time, and just like kASan and kCSan, it
is an excellent feature. It has already detected 38 uninitialized variables
in the kernel during my testing, which I have since discreetly fixed.
We use two shadows:
- "shad", to track uninitialized memory with a bit granularity (1:1).
Each bit set to 1 in the shad corresponds to one uninitialized bit of
real kernel memory.
- "orig", to track the origin of the memory with a 4-byte granularity
(1:1). Each uint32_t cell in the orig indicates the origin of the
associated uint32_t of real kernel memory.
The memory consumption of these shadows is consequent, so at least 4GB of
RAM is recommended to run kMSan.
The compiler inserts calls to specific __msan_* functions on each memory
access, to manage both the shad and the orig and detect uninitialized
memory accesses that change the execution flow (like an "if" on an
uninitialized variable).
We mark as uninit several types of memory buffers (stack, pools, kmem,
malloc, uvm_km), and check each buffer passed to copyout, copyoutstr,
bwrite, if_transmit_lock and DMA operations, to detect uninitialized memory
that leaves the system. This allows us to detect kernel info leaks in a way
that is more efficient and also more user-friendly than KLEAK.
Contrary to kASan, kMSan requires comprehensive coverage, ie we cannot
tolerate having one non-instrumented function, because this could cause
false positives. kMSan cannot instrument ASM functions, so I converted
most of them to __asm__ inlines, which kMSan is able to instrument. Those
that remain receive special treatment.
Contrary to kASan again, kMSan uses a TLS, so we must context-switch this
TLS during interrupts. We use different contexts depending on the interrupt
level.
The orig tracks precisely the origin of a buffer. We use a special encoding
for the orig values, and pack together in each uint32_t cell of the orig:
- a code designating the type of memory (Stack, Pool, etc), and
- a compressed pointer, which points either (1) to a string containing
the name of the variable associated with the cell, or (2) to an area
in the kernel .text section which we resolve to a symbol name + offset.
This encoding allows us not to consume extra memory for associating
information with each cell, and produces a precise output, that can tell
for example the name of an uninitialized variable on the stack, the
function in which it was pushed on the stack, and the function where we
accessed this uninitialized variable.
kMSan is available with LLVM, but not with GCC.
The code is organized in a way that is similar to kASan and kCSan, so it
means that other architectures than amd64 can be supported.
_lwp_self() remains invariant as necessary for the locking in the
dynamic linker. Otherwise if a process creates a thread and forks from
it, the main thread of the parent would share the LWP ID of the main
thread of the child, even though they have different origins.
Partial fix for pkg/54192.
The label is searched each 4 bytes and can be detected in an unaligned
location. Before any operations on it, copy it to promptly aligned local
copy on the stack.
This is a missing part of the following change:
revision 1.108
date: 2011-01-18 20:52:24 +0100; author: matt; state: Exp; lines: +2 -1;
Make struct disklabel 8 byte aligned. This increases its size by 4 bytes
on IPL32 platforms so add code in sys_ioctl (and netbsd32_ioctl) to deal
with the older/smaller diskabel size. This change makes disklabel the
same for both IPL32 and LP64 platforms
OK by <martin>
to detect race conditions at runtime. It is a variation of TSan that is
easy to implement and more suited to kernel internals, albeit theoretically
less precise than TSan's happens-before.
We do basically two things:
- On every KCSAN_NACCESSES (=2000) memory accesses, we create a cell
describing the access, and delay the calling CPU (10ms).
- On all memory accesses, we verify if the memory we're reading/writing
is referenced in a cell already.
The combination of the two means that, if for example cpu0 does a read that
is selected and cpu1 does a write at the same address, kCSan will fire,
because cpu1's write collides with cpu0's read cell.
The coverage of the instrumentation is the same as that of kASan. Also, the
code is organized in a way similar to kASan, so it is easy to add support
for more architectures than amd64. kCSan is compatible with KCOV.
Reviewed by Kamil.
Fix a race condition that caused PT_GET_SIGINFO to return incorrect
information when multiple signals were delivered concurrently
to different LWPs. Add a regression test that verifies that when 50
threads concurrently use pthread_kill() on themselves, the debugger
receives all signals with correct information.
The kernel uses separate signal queues for each LWP. However,
the signal context used to implement PT_GET_SIGINFO is stored in 'struct
proc' and therefore common to all LWPs in the process. Previously,
this member was filled in kpsignal2(), i.e. when the signal was sent.
This meant that if another LWP managed to send another signal
concurrently, the data was overwritten before the process was stopped.
As a result, PT_GET_SIGINFO did not report the correct LWP and signal
(it could even report a different signal than wait()). This can be
quite reliably reproduced with the number of 20 LWPs, however it can
also occur with 10.
This patch moves setting of signal context to issignal(), just before
the process is actually stopped. The data is taken from per-LWP
or per-process signal queue. The added test confirms that the debugger
correctly receives all signals, and PT_GET_SIGINFO reports both correct
LWP and signal number.
Reviewed by kamil.
warnings:
1. this one: add a void * cast (which I think is the least intrusive)
2. add pragmas to elide the warning
3. add intermediate inline conversion functions
4. change the called function prototypes, adding unused arguments and
converting some of the pointer arguments to void *.
5. make the functions varyadic (which defeats the purpose of checking)
6. pass command line flags to elide the warning
I did try 3 and 4 and I was not pleased with the result (sys_ptrace_common.c)
(3) added too much code and defines, and (4) made the regular use clumsy.
sigswitch() can be called from exit1() through:
ttywait()->ttysleep()-> cv_timedwait_sig()->sleepq_block()->issignal()->sigswitch()
lwp_exit() called for the last LWP triggers exit1() and this causes a panic.
The debugger related signals have short-circuit demise paths in
eventswitch() and other functions, before calling sigswitch().
This change restores the original behavior, but there is an open question
whether the kernel crash is a red herring of misbehavior of ttywait().
This should fix PR kern/54618 by David H. Gutteridge
second argument, and the compiler is free to perform optimizations knowing
that this argument is never NULL.
In this particular case, it was harmless. But still good to fix.
Reported-by: syzbot+6f504255accb795eb6b7@syzkaller.appspotmail.com
For the PTRACE_LWP_EXIT event, the eventswitch() call is triggered from
lwp_exit(). In the case of setting the program status to PS_WEXIT, do not
try to demise in place, by calling lwp_exit() as it causes panic.
In this scenario bail out from the function and resume the lwp_exit()
procedure.
In case of sigswitchin away in issignal() and continuing the execution on
PT_CONTINUE (or equivalent call), there is a time window when another
thread could cause the process state to be changed to PS_STOPPING.
In the current logic, a thread would receive signal 0 (no-signal) and exit
from issignal(), returning to userland and never finishing the process of
stopping all LWPs. This causes hangs waitpid() waiting for SIGCHLD and
the callout polling for the state of the process in an infinite loop.
Instead of prompting for a returned signal from a debugger, repeat the
issignal() loop, this will cause checking the PS_STOPPING flag again and
sigswitching away in the scenario of stopping the process.
Make the function static as it is now local to kern_sig.c.
Rename the 'relock' argument to 'proc_lock_held' as it is more verbose.
This was suggested by mjg@freebsd. While there this flips the users between
true<->false.
Add additional KASSERT(9) calls here to validate whethe proc_lock is used
accordingly.
This field is not needed as it duplicated p_opptr that is alread safe to
use, unless proven otherwise.
eventswitch() already contained a check for != initproc (pid1).
Ride ABI bump for 9.99.16.
it may deadlock on suspension of this file system.
Add fstrans type LAZY and use it for VOP_STRATEGY().
Adress PR kern/53624 (dom0 freeze on domU exit) is still there
It works like:
- kill(SIGSTOP) for unstopped tracee
- ptrace(PT_CONTINUE,SIGSTOP) for stopped tracee
The child will be stopped and always possible to be waited (with wait(2)
like calls).
For stopped traccee kill(SIGSTOP) has no effect. PT_CONTINUE+SIGSTOP cannot
be used on an unstopped process (EBUSY).
This operation is modeled after PT_KILL that is similar for the SIGKILL
call. While there, allow PT_KILL on unstopped traced child.
This operation is useful in an abnormal exit of a debugger from a signal
handler, usually followed by waitpid(2) and ptrace(PT_DETACH).
Stop competing between threads which one emits event signal quicker and
overwriting the signal from another thread.
This fixes missed in action signals.
NetBSD truss can now report reliably all TRAP_SCE/SCX/etc events without
reports of missed ones.
his was one of the reasons why debuggee with multiple threads misbehaved
under a debugger.
This change is v.2 of the previously reverted commit for the same fix.
This version contains recovery path that stopps triggering event SIGTRAP
for a detached debugger.
friendly methods for sys/conf.h that needs it.
one alias per return type and first function are are needed,
though they can be stubbed to existing code. the only cost is
the symbol itself, the codegen it the same.
An alternative approach would be to check the valie in settime1(), but
it would result in multiple checks for valid tv_nsec, as there are
settime1() users that need to check the ranges earlier.
Reported-by: syzbot+96e5ce2c2c704d96c2f0@syzkaller.appspotmail.com
The condition would be rechecked later again after subtracting start time
and most invalid inputs rejected. In corner cases the current code can
accept certain invalid inputs that will pass checks later and behave like
valid ones (due to signed integer overflow).
Reported-by: syzbot+3a4a07b62558bbbd3baa@syzkaller.appspotmail.com
the map, and check the buffer on each bus_dmamap_sync. This allows us to
find DMA buffer overflows and UAFs, which couldn't be found before because
the device accesses to memory are outside of KASAN's control.
Once a thread was stopped with ptrace(2), userland process must not
be able to unstop it deliberately or by an accident.
This was a Windows-style behavior that makes threading tracing fragile.
sizeof(pid) and sizeof(lwp) will unlikely ever change and the check can
confuse.
The assert has been moved to ATF t_ptrace_wait.c r.1.132.
Requested by <christos>
Solves kernel panic in NetBSD 8.1 amd64 on VirtualBox 6.0.12 r133076.
Triggered with an NVMe controller without any actual discs behind it:
nvme0 at pci0 dev 14 function 0: vendor 80ee product 4e56 (rev. 0x00)
nvme0: NVMe 1.2
nvme0: interrupting at ioapic0 pin 22
nvme0: ORCL-VBOX-NVME-VER12, firmware 1.0, serial VB1234-56789
ld0 at nvme0 nsid 1
ld0: 0, 0 cyl, 16 head, 63 sec, 1 bytes/sect x 0 sectors
Code path is reached 4 times during normal boot, each time after wd0a
is already mounted; this patch avoids a crash with a dirty filesystem.
Storing struct ptrace_state information inside struct proc was vulnerable
to synchronization bugs, as multiple events emitted in the same time were
overwritting other ones.
Cache the original parent process id in p_oppid. Reusing here p_opptr is
in theory prone to slight race codition.
Change the semantics of PT_GET_PROCESS_STATE, reutning EINVAL for calls
prompting for the value in cases when there wasn't registered an
appropriate event.
Add an alternative approach to check the ptrace_state information, directly
from the siginfo_t value returned from PT_GET_SIGINFO. The original
PT_GET_PROCESS_STATE approach is kept for compat with older NetBSD and
OpenBSD. New code is recommended to keep using PT_GET_PROCESS_STATE.
Add a couple of compile-time asserts for assumptions in the code.
No functional change intended in existing ptrace(2) software.
All ATF ptrace(2) and ATF GDB tests pass.
This change improves reliability of the threading ptrace(2) code.
rather than discarding-after-assignment. Introduced from the
[pgoyette-compat] branch work.
Welcome to 9.99.14 !!! (Module hook routine prototype changed.)
Found by the lgtm bot, reported via private Email from maxv@
The new member is caled f_mntfromlabel and it is the dkw_wname
of the corresponding wedge. This is now used by df -W to display
the mountpoint name as NAME=