Add interface in ptrace(2) to track thread (LWP) events:
- birth,
- termination.
The purpose of this thread is to keep track of the current thread state in
a tracee and apply e.g. per-thread designed hardware assisted watchpoints.
This interface reuses the EVENT_MASK and PROCESS_STATE interface, and
shares it with PTRACE_FORK, PTRACE_VFORK and PTRACE_VFORK_DONE.
Change the following structure:
typedef struct ptrace_state {
int pe_report_event;
pid_t pe_other_pid;
} ptrace_state_t;
to
typedef struct ptrace_state {
int pe_report_event;
union {
pid_t _pe_other_pid;
lwpid_t _pe_lwp;
} _option;
} ptrace_state_t;
#define pe_other_pid _option._pe_other_pid
#define pe_lwp _option._pe_lwp
This keeps size of ptrace_state_t unchanged as both pid_t and lwpid_t are
defined as int32_t-like integer. This change does not break existing
prebuilt software and has minimal effect on necessity for source-code
changes. In summary, this change should be binary compatible and shouldn't
break build of existing software.
Introduce new siginfo(5) type for LWP events under the SIGTRAP signal:
TRAP_LWP. This change will help debuggers to distinguish exact source of
SIGTRAP.
Add two basic t_ptrace_wait* tests:
lwp_create1:
Verify that 1 LWP creation is intercepted by ptrace(2) with
EVENT_MASK set to PTRACE_LWP_CREATE
lwp_exit1:
Verify that 1 LWP creation is intercepted by ptrace(2) with
EVENT_MASK set to PTRACE_LWP_EXIT
All tests are passing.
Surfing the previous kernel ABI bump to 7.99.59 for PTRACE_VFORK{,_DONE}.
Sponsored by <The NetBSD Foundation>
PTRACE_VFORK is supposed to be used to track vfork(2)-like events, when
parent gives birth to new process child and stops till it exits or calls
exec().
Currently PTRACE_VFORK is a stub.
PTRACE_VFORK_DONE is notification to notify a debugger that a parent has
resumed after vfork(2)-like action.
PTRACE_VFORK_DONE throws SIGTRAP with TRAP_CHLD.
Sponsored by <The NetBSD Foundation>
The SIGTRAP signal is thrown from the kernel if EVENT_MASK (ptrace_event)
enables PTRACE_FORK. This new si_code helps debuggers to distinguish the
exact source of signal delivered for a debugger.
Another purpose of TRAP_CHLD is to retain the same behavior inside the
NetBSD kernel for process child traps and have an interface to monitor it.
Retrieving exact event and extended properties of process child trap is
available with PT_GET_PROCESS_STATE.
There is no behavior change for existing software.
This si_code value is NetBSD extension.
Sponsored by <The NetBSD Foundation>
This removes dead code introduced with the following commit:
date: 2012-07-27 22:52:49 +0200; author: christos; state: Exp; lines: +8 -2;
revert racy vfork() parent-blocking-before-child-execs-or-exits code.
ok rmind
This interface is designed to read signal information emited to tracee and
fake this signal with new value.
This functionality is required to distinguish types of events that occured
in the tracee and intercepted by a debugger.
These accessors introduce a new structure type ptrace_siginfo:
/*
* Signal Information structure
*/
typedef struct ptrace_siginfo {
siginfo_t psi_siginfo; /* signal information structure */
lwpid_t psi_lwpid; /* destination LWP of the signal
* value 0 means the whole process
* (route signal to all LWPs) */
} ptrace_siginfo_t;
Include <sys/siginfo.h> in <sys/ptrace.h> in order to not break existing
software due to unknown symbol siginfo_t.
This interface has been proposed to the tech-kern@ mailing list.
Sponsored by <The NetBSD Foundation>
On exec() events under a debugger generate the SIGTRAP signal with
TRAP_EXEC property. This allows tracer to distinguish exec() events easily.
Sponsored by <The NetBSD Foundation>
its timestamps.
As this changes storage structures for data passed between kernel and
userland, welcome to 7.99.55!
XXX Output routines still use microsecond resolution when printf()ing.
XXX Possible future feature would be addition of option to use
XXX getbintime(9) for less time-critical histories.
Increment v_holdcnt to prevent the vnode from disappearing while
vcache_vget() waits for a stable state.
Now v_usecount tracks the number of successfull references.
up using the old key until vcache_rekey_exit changes the key to the new one.
Add an assertion that the temporary key is different from the current one.
for averages. Otherwise the decisions can be heavily biased by rounding
errors.
Add sysctl kern.sched_average_weight to change the weight of
historical data, the default is 50%.
to lock this vnodes v_interlock -> vdrain_lock another vnode sharing the
v_interlock may lock this order.
While here, restore fstrans_start_nowait arg to FSTRANS_LAZY.
Fixes a deadlock seen recently on some pbulk environments.
Add new ptrace(2) calls:
- PT_COUNT_WATCHPOINTS - count the number of available hardware watchpoints
- PT_READ_WATCHPOINT - read struct ptrace_watchpoint from the kernel state
- PT_WRITE_WATCHPOINT - write new struct ptrace_watchpoint state, this
includes enabling and disabling watchpoints
The ptrace_watchpoint structure contains MI and MD parts:
typedef struct ptrace_watchpoint {
int pw_index; /* HW Watchpoint ID (count from 0) */
lwpid_t pw_lwpid; /* LWP described */
struct mdpw pw_md; /* MD fields */
} ptrace_watchpoint_t;
For example amd64 defines MD as follows:
struct mdpw {
void *md_address;
int md_condition;
int md_length;
};
These calls are protected with the __HAVE_PTRACE_WATCHPOINTS guard.
Tested on amd64, initial support added for i386 and XEN.
Sponsored by <The NetBSD Foundation>
of the lists. Speeds up namei on cached vnodes by ~3 percent.
Merge "vrele_thread" into "vdrain_thread" so we have one thread
working on the lrulists. Adapt vfs_drainvnodes() to always wait
for a complete cycle of vdrain_thread().
always equal to "desiredvnodes" and move its definition
from sys/vnode.h to sys/vnode_impl.h.
Extend vfs_drainvnodes() to also wait for deferred vrele to flush
and replace the call to vrele_flush() with a call to vfs_drainvnodes().
This allows us to return EEXIST instead of EPERM for higher secure levels.
My use case was to stop npfctl complaining that it could not load bpfjit
on ERLITE when it was compiled into the kernel.
It then went on to complain that NPF performance would be de-graded,
but this is clearly not the case.
When called from vrecycle() or vgone() there is a window where the refcount
is greater than zero and another thread could get and release a reference
that would miss VOP_INACTIVE() as the refcount doesn't drop to zero.
Adjust test fs/puffs/t_basic: test VOP_INACTIVE count being greater zero.
- Make vrecycle() more robust by checking v_usecount first and preventing
further references across vn_lock(). Fixes a deadlock where one thread
starts unmount, second thread locks a directory and allocates a vnode
and first thread tries to vrecycle() the directory.
First thread holds vfs_busy and wants vnode, second thread holds vnode
and wants vfs_busy.
- With these fixes in place change cleanvnode() to use vget()/vrecycle()
to reclaim the vnode.
xc_lowpri and xc_thread are racy and xc_wait may return during/before
executing all xcall callbacks, resulting in a kernel panic at worst.
xc_lowpri serializes multiple jobs by a mutex and a cv. If all xcall
callbacks are done, xc_wait returns and also xc_lowpri accepts a next job.
The problem is that a counter that counts the number of finished xcall
callbacks is incremented *before* actually executing a xcall callback
(see xc_tailp++ in xc_thread). So xc_lowpri accepts a next job before
all xcall callbacks complete and a next job begins to run its xcall callbacks.
Even worse the counter is global and shared between jobs, so if a xcall
callback of the next job completes, the shared counter is incremented,
which confuses wc_wait of the previous job as all xcall callbacks of the
previous job are done and wc_wait of the previous job returns during/before
executing its xcall callbacks.
How to fix: there are actually two counters that count the number of finished
xcall callbacks for low priority xcall for historical reasons (I guess):
xc_tailp and xc_low_pri.xc_donep. xc_low_pri.xc_donep is incremented correctly
while xc_tailp is incremented wrongly, i.e., before executing a xcall callback.
We can fix the issue by dropping xc_tailp and using only xc_low_pri.xc_donep.
PR kern/51632
various sysctl/procfs interfaces that allow it to be interrogated.
(This is rather than the temporary parent's pid when a process is
being traced and has been reparented.)
XXX The ppid in elf32 core files has not been similarly adjusted,
XXX Should it be ?
Only change it when we are being permanently reparented to init. Since
p_ppid is only used as a cached value to retrieve the parent's process id
from userland, this change makes it correct at all times. Idea from kre@
Revert specialized logic from getpid/getppid now that it is not needed.
before recursing into lower blocks, to make sure that it will be removed after
all its referenced blocks are removed
fixes 'ffs_blkfree_common: freeing free block' panic triggered by
ufs_truncate_retry() when just the upper indirect block registration failed,
code tried to free the lower blocks again after wapbl flush
problem found by hannken@, thank you
Revert 1.264 - that was intended to fix 51600, but didn't, it just
hid the problem, and caused 51606. This fixes 51606.
Handle waiting on a process that has been detatched from its parent
because of being ptrace'd by some other process. This fixes 51600.
("handle" here means that the wait() hangs, or with WNOHANG, returns 0,
we cannot actually wait on a process that is not currently an attached
child.)
Note: the detatched process waiting is not yet perfect (it fails to
take account of options like WALLSIG and WALTSIG) - suport for those
(that is, ignoring a detatched child that one of those options will
later cause to be ignored when the process is re-attached.)
For now, for ither than when waiting for a specific process ID, when
a process does a wait() sys call (any of them), has no applicable
children attached that can be returned, and has at least one detatched
child, then we do a linear search of all processes to look for a
suitable detatched child. This is likely to be slow - but very rare.
Eventually it might be better to keep a list of detatched children
per process.
- Move _VFS_VNODE_PRIVATE protected operations into vnode_impl.h.
- Move struct vnode_impl definition and operations into vnode_impl.h.
- Include vnode_impl.h where we include vnode.h with _VFS_VNODE_PRIVATE defined.
- Get rid of _VFS_VNODE_PRIVATE.
- Rename struct vcache_node to vnode_impl, start its fields with vi_.
- Rename enum vcache_state to vnode_state, start its elements with VS_.
- Rename macros VN_TO_VP and VP_TO_VN to VIMPL_TO_VNODE and VNODE_TO_VIMPL.
- Add typedef struct vnode_impl vnode_impl_t.
1 - ptrace(2) syscall for native emulation
2 - common ptrace(2) syscall code (shared with compat_netbsd32)
3 - support routines that are shared with PROCFS and/or KTRACE
* Add module glue for #1 and #2. Both modules will be built-in to the
kernel if "options PTRACE" is included in the config file (this is
the default, defined in sys/conf/std).
* Mark the ptrace(2) syscall as modular in syscalls.master (generated
files will be committed shortly).
* Conditionalize all remaining portions of PTRACE code on a new kernel
option PTRACE_HOOKS.
XXX Instead of PROCFS depending on 'options PTRACE', we should probably
just add a procfs attribute to the sys/kern/sys_process.c file's
entry in files.kern, and add PROCFS to the "#if defineds" for
process_domem(). It's really confusing to have two different ways
of requiring this file.
succeed; change wapbl_register_deallocation() to return EAGAIN
rather than panic when code hits the limit
callers changed to either loop calling ffs_truncate() using new
utility ufs_truncate_retry() if their semantics requires it, or
just ignore the failure; remove ufs_wapbl_truncate()
this fixes possible user-triggerable panic during truncate, and
resolves WAPBL performance issue with truncates of large files
PR kern/47146 and kern/49175
Marking up some zeroes with a type suffix, while not marking others in
the very same function does nothing but places cognitive burden on the
reader.
Spelling "clear bits" as "&~" is actually not uncommon (and some say
is more readable).
It is unexpected for an unprivileged process to gain privs by
typing to root's tty:
$ cat installer
#!/bin/sh
whoami
/usr/sbin/sti /dev/tty whoami\\n
$ su unprivileged -c ./installer
unprivileged
$ whoami
root
change anything, since kernel processes use the shared kernel map instead
of the one they are given here. For consistency though, it is better to
make sure UVM will not be tempted to access machine-dependent reserved
areas (e.g., the PTE space on x86).
rather than including in kernels with KDTRACE_HOOKS defined. Update
the dtrace_fbt module to depend on the zlib module.
Bump kernel version to avoid module mismatch.
Welcome to 7.99.38 !
zero is hugely flawed. It is easy to demonstrate that one can trick UVM
into chosing a NULL hint after the user_va0_disable check from uvm_map.
Such a bypass allows kernel NULL pointer dereferences to be exploitable on
architectures with a shared userland<->kernel VA, like amd64.
Fix this by increasing the limit of the vm space made available for
userland processes. This way, UVM will never chose a NULL hint, since it
would be outside of the vm space.
The user_va0_disable sysctl still controls this feature.
buffer being written.)
There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.
Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.
Spotted by coypu.
relocating them. The text is allocated as RWX, and then mprotected to RW.
There is a bug that prevents us from doing RW->RX on amd64 and perhaps
sparc64. On x86, the pmap waits for the page to fault before granting it
the X permission. But in the trap handler, such a page is considered as
belonging to kernel_map, while it actually belongs to module_map. The
kernel then finds out the page is not present in kernel_map, and panics.
In all cases, module_map is non pageable, so even if the trap were handled
properly, it still wouldn't work.
Therefore, there is a small window in which the segment is RWX. But that's
fine enough, for now.
XXX Since everything has (or should have) been switched to dev_t, we
XXX could probably remove the check for
XXX
XXX ca->ca_devsize >= sizeof(struct device)
XXX
XXX But someone ought to check on that first!
Reviewed by riastradh@
kevent validates that ident is a valid fd by getting the file. one sad
quirk: uint64 to int32 truncation can lead to false positives, and then
later in the array sizing code, very big mallocs panic the kernel.
add a check that the ident isn't larger than INT_MAX in the fd case.
reported by Tim Newsham
up the module segments into one big RWX chunk. Split this chunk into two
different text and data+bss+rodata chunks. The latter is made non-
executable. This also provides some kind of ASLR, since the chunks are
not necessarily contiguous.
by Andy Doran. Also document the get/set pshared thread calls as not
implemented, and add a skeleton implementation that is disabled.
XXX: document _sched_protect(2).
module names both in the built-in list and in the list of previously
"pushed" modules.
While here, delay allocating the new 'struct module' until we've passed
the duplicate-name checks.
detaching devices at shutdown time with RB_POWERDOWN.
When detaching wd(4), put the drive in standby before detach
for DETACH_POWEROFF.
Fix PR kern/51252
"pushed" by the boot loader. The boot loader pushes the module
name for the root file system (unless the root file system is ffs)
even if the file system module is built into the kernel. When
this happens, we get a lot of "redefined symbol" error messages.
This fix does not alter the behavior of pushing the file system
name. It simply avoids the redefined symbol errors by detecting
that the module is already built-in to the kernel and not trying
to load another copy.
While here, differentiate the error message text between "failed
to load" and "failed to fetch_info" conditions.
Addresses PR kern/50357
Having a pointer of an interface in a mbuf isn't safe if we remove big
kernel locks; an interface object (ifnet) can be destroyed anytime in any
packet processing and accessing such object via a pointer is racy. Instead
we have to get an object from the interface collection (ifindex2ifnet) via
an interface index (if_index) that is stored to a mbuf instead of an
pointer.
The change provides two APIs: m_{get,put}_rcvif_psref that use psref(9)
for sleep-able critical sections and m_{get,put}_rcvif that use
pserialize(9) for other critical sections. The change also adds another
API called m_get_rcvif_NOMPSAFE, that is NOT MP-safe and for transition
moratorium, i.e., it is intended to be used for places where are not
planned to be MP-ified soon.
The change adds some overhead due to psref to performance sensitive paths,
however the overhead is not serious, 2% down at worst.
Proposed on tech-kern and tech-net.
The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.
No functional change.
as documented in sysctl(7):
0 - ptrace does not affect mprotect
1 - (default) mprotect is disabled for processes that start executing from
the debugger (being traced)
2 - mprotect restrictions are relaxed for traced processes
mprotect settings so that debuggers can write to the text segment of traced
processes so that they can insert breakpoints. Turned off by default.
Ok: chuq (for now)
compat32, which we deal with properly). It would be possible to get
those working too, but it is not worth the code complexity.
This makes binaries compiled with -mcmodel=medlow (and ancient binaries)
work again on sparc64, smoothing the upgrade path.
ok: christos
sockets sitting in the accept filter can consume the entire listen queue,
such that the application is never able to handle any connections. Handle
this by simply passing through the oldest queued cxn when the queue is full.
This is fair because the longer a cxn lingers in the queue (stays connected
but does not meet the requirements of the filter for passage) the more likely
it is to be passed through, at which point the application can dispose of it.
Works because none of our accept filters actually allocate private state
per-cxn. If they did, we'd have to fix the API bug that there is presently
no way to tell an accf to finish/deallocate for a single cxn (accf_destroy
kills off the entire filter instance for a given listen socket).
Remainder of fix for PR kern/51135: if there is an entropy source
that can produce arbitrarily much data, as in rump, then nothing
should ever block indefinitely waiting for data.