Commit Graph

9751 Commits

Author SHA1 Message Date
maya
f6be953d31 use a bound string copy 2017-01-15 01:47:24 +00:00
maya
8341f84221 use a bound string copy 2017-01-15 01:28:14 +00:00
kamil
c52f1ed048 Fix generation of PTRACE_LWP_EXIT event
Set p_lwp_exited instead of p_lwp_created for PTRACE_LWP_EXIT.

This made the lwp_exit1 ATF test passing.

Sponsored by <The NetBSD Foundation>
2017-01-14 19:32:10 +00:00
kamil
6413a1acf0 Introduce PTRACE_LWP_{CREATE,EXIT} in ptrace(2) and TRAP_LWP in siginfo(5)
Add interface in ptrace(2) to track thread (LWP) events:
 - birth,
 - termination.

The purpose of this thread is to keep track of the current thread state in
a tracee and apply e.g. per-thread designed hardware assisted watchpoints.

This interface reuses the EVENT_MASK and PROCESS_STATE interface, and
shares it with PTRACE_FORK, PTRACE_VFORK and PTRACE_VFORK_DONE.

Change the following structure:

typedef struct ptrace_state {
        int     pe_report_event;
        pid_t   pe_other_pid;
} ptrace_state_t;

to

typedef struct ptrace_state {
        int     pe_report_event;
        union {
                pid_t   _pe_other_pid;
                lwpid_t _pe_lwp;
        } _option;
} ptrace_state_t;

#define pe_other_pid    _option._pe_other_pid
#define pe_lwp          _option._pe_lwp

This keeps size of ptrace_state_t unchanged as both pid_t and lwpid_t are
defined as int32_t-like integer. This change does not break existing
prebuilt software and has minimal effect on necessity for source-code
changes. In summary, this change should be binary compatible and shouldn't
break build of existing software.


Introduce new siginfo(5) type for LWP events under the SIGTRAP signal:
TRAP_LWP. This change will help debuggers to distinguish exact source of
SIGTRAP.


Add two basic t_ptrace_wait* tests:
lwp_create1:
    Verify that 1 LWP creation is intercepted by ptrace(2) with
    EVENT_MASK set to PTRACE_LWP_CREATE

lwp_exit1:
    Verify that 1 LWP creation is intercepted by ptrace(2) with
    EVENT_MASK set to PTRACE_LWP_EXIT

All tests are passing.


Surfing the previous kernel ABI bump to 7.99.59 for PTRACE_VFORK{,_DONE}.

Sponsored by <The NetBSD Foundation>
2017-01-14 06:36:52 +00:00
kamil
0e96af0f53 Add support for PTRACE_VFORK_DONE and stub for PTRACE_VFORK in ptrace(2)
PTRACE_VFORK is supposed to be used to track vfork(2)-like events, when
parent gives birth to new process child and stops till it exits or calls
exec().
Currently PTRACE_VFORK is a stub.

PTRACE_VFORK_DONE is notification to notify a debugger that a parent has
resumed after vfork(2)-like action.
PTRACE_VFORK_DONE throws SIGTRAP with TRAP_CHLD.

Sponsored by <The NetBSD Foundation>
2017-01-13 23:00:35 +00:00
hannken
cfa69dcf1b Add file-local iterator variant vfs_vnode_iterator_next1() that
waits for vnodes to become reclaimed and use it from vflush().
2017-01-13 10:10:32 +00:00
christos
d8dfcd6c2a regen 2017-01-13 06:18:31 +00:00
christos
a1a8fc3617 const police! 2017-01-13 06:11:27 +00:00
hannken
0365dd0e1a Adapt to the recent vnode changes. 2017-01-11 14:52:02 +00:00
joerg
6ff696c6b4 Add ddb command to find a vnode by the address of its lock.
This makes it much easier to convert lockstat traces into understandable
data.
2017-01-11 12:17:34 +00:00
hannken
e2f2c94b67 Move vnode member v_lock as vi_lock to vnode_impl.h. 2017-01-11 09:08:58 +00:00
hannken
dcc198a3f8 Move vnode member v_mntvnodes as vi_mntvnodes to vnode_impl.h.
Add an ugly hack so pstat.c may still traverse the list.
2017-01-11 09:07:57 +00:00
hannken
6e1af6b1d7 Move vnode members v_synclist_slot and v_synclist as vi_synclist_slot and
vi_synclist to vnode_impl.h.
2017-01-11 09:06:57 +00:00
hannken
2b4a4af133 Move vnode members v_dnclist and v_nclist as vi_dnclist and
vi_nclist to vnode_impl.h.
2017-01-11 09:04:37 +00:00
pgoyette
4869ce0a43 Use membar_{producer,consumer}() to ensure proper access to the "ready"
flag.
2017-01-10 22:08:14 +00:00
pgoyette
5a30768de5 Rework the sysctl initialization to avoid creating new nodes from
within the helper function.  This should avoid the "locking against
myself" error reported earlier.
2017-01-10 00:50:57 +00:00
kamil
687ff8a6ad Introduce new si_code for SIGTRAP: TRAP_CHLD - process child trap
The SIGTRAP signal is thrown from the kernel if EVENT_MASK (ptrace_event)
enables PTRACE_FORK. This new si_code helps debuggers to distinguish the
exact source of signal delivered for a debugger.

Another purpose of TRAP_CHLD is to retain the same behavior inside the
NetBSD kernel for process child traps and have an interface to monitor it.

Retrieving exact event and extended properties of process child trap is
available with PT_GET_PROCESS_STATE.

There is no behavior change for existing software.

This si_code value is NetBSD extension.

Sponsored by <The NetBSD Foundation>
2017-01-10 00:48:37 +00:00
christos
69f0023338 If we had an error, don't do the debug checks because they will most certainly
fail and we'll panic.
2017-01-09 14:25:52 +00:00
kamil
e6f79d077f Cleanup dead code after revert of racy vfork(2) commit
This removes dead code introduced with the following commit:

date: 2012-07-27 22:52:49 +0200;  author: christos;  state: Exp;  lines: +8 -2;
revert racy vfork() parent-blocking-before-child-execs-or-exits code.
ok rmind
2017-01-09 00:31:30 +00:00
christos
f896811791 fix build without ddb. 2017-01-08 19:49:25 +00:00
kamil
e4281b2073 Introduce new ptrace(2) interface: PT_SET_SIGINFO and PT_GET_SIGINFO
This interface is designed to read signal information emited to tracee and
fake this signal with new value.

This functionality is required to distinguish types of events that occured
in the tracee and intercepted by a debugger.

These accessors introduce a new structure type ptrace_siginfo:
/*
 * Signal Information structure
 */
typedef struct ptrace_siginfo {
       siginfo_t       psi_siginfo;    /* signal information structure */
       lwpid_t         psi_lwpid;      /* destination LWP of the signal
                                        * value 0 means the whole process
                                        * (route signal to all LWPs) */
} ptrace_siginfo_t;

Include <sys/siginfo.h> in <sys/ptrace.h> in order to not break existing
software due to unknown symbol siginfo_t.

This interface has been proposed to the tech-kern@ mailing list.

Sponsored by <The NetBSD Foundation>
2017-01-06 22:53:17 +00:00
kamil
239e90be56 Introduce new SIGTRAP code: TRAP_EXEC
On exec() events under a debugger generate the SIGTRAP signal with
TRAP_EXEC property. This allows tracer to distinguish exec() events easily.

Sponsored by <The NetBSD Foundation>
2017-01-06 22:42:58 +00:00
pgoyette
a62101788a Use the new magic BINTIME_SCALE_* macros instead of magic numbers.
No functional change.
2017-01-05 23:29:14 +00:00
hannken
592be9ae45 Name all "vnode_impl_t" variables "vip".
No functional change.
2017-01-05 10:05:11 +00:00
pgoyette
c9b6361b98 By popular demand, update kernhist to use bintime(9) as the basis for
its timestamps.

As this changes storage structures for data passed between kernel and
userland, welcome to 7.99.55!

XXX Output routines still use microsecond resolution when printf()ing.

XXX Possible future feature would be addition of option to use
XXX getbintime(9) for less time-critical histories.
2017-01-05 03:40:33 +00:00
pgoyette
c42fba4183 Actually initialize the sysctl stuff for kernhist! Missed this file
in earlier commits.
2017-01-05 03:22:20 +00:00
hannken
78a3dd75dc Expand struct vcache to individual variables (vcache.* -> vcache_*).
No functional change.
2017-01-04 17:13:50 +00:00
pgoyette
5a0b3ff699 Rearrange the sysctl export structure for better alignment. 2017-01-04 01:05:58 +00:00
hannken
8b7bed0d14 Now that v_usecount tracks valid references add some "v_usecount == 1"
assertions.
2017-01-02 10:36:58 +00:00
hannken
e0f81f2c02 Change vcache_*vget() to increment v_usecount on success only.
Increment v_holdcnt to prevent the vnode from disappearing while
vcache_vget() waits for a stable state.

Now v_usecount tracks the number of successfull references.
2017-01-02 10:35:00 +00:00
hannken
998709c439 Rename vget() to vcache_vget() and vcache_tryvget() respectively and
move the definitions to sys/vnode_impl.h.

No functional change intended.

Welcome to 7.99.54
2017-01-02 10:33:28 +00:00
pgoyette
c2efd8c96e Provide a sysctl method of exporting the kernel history data.
XXX vmstat will be update soon to use the sysctl rather than grovelling
XXX through kvm.
2017-01-01 23:58:47 +00:00
pgoyette
c129bbe940 Remove some extraneous whitespace 2016-12-28 06:25:40 +00:00
hannken
3b04d6a086 It is wrong to block the vnode during vcache_rekey. The vnode may be looked
up using the old key until vcache_rekey_exit changes the key to the new one.

Add an assertion that the temporary key is different from the current one.
2016-12-27 11:59:36 +00:00
maya
441aa9cf25 Revert previous commit (to r1.117)
Superfluous warnings in simple userland programs is not a valid reason to
break a security model.
2016-12-27 09:34:44 +00:00
pgoyette
ee1d5b993e Decouple BIOHIST from other users of KERNHIST. 2016-12-27 04:12:34 +00:00
pgoyette
d05a55c879 #include giohist.h from proper location 2016-12-26 23:49:53 +00:00
pgoyette
6a7e4606d5 Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.
2016-12-26 23:15:15 +00:00
pgoyette
7f0851cee1 Add a BIOHIST option. As mentioned on tech-kern. 2016-12-26 23:12:33 +00:00
mlelstv
46f58a90c6 When balancing threads over multiple CPUs, use fixpoint arithmetic
for averages. Otherwise the decisions can be heavily biased by rounding
errors.

Add sysctl kern.sched_average_weight to change the weight of
historical data, the default is 50%.
2016-12-22 14:11:58 +00:00
hannken
0d2ece78cb Restructure vdrain_vrele(). While it is not possible for another thread
to lock this vnodes v_interlock -> vdrain_lock another vnode sharing the
v_interlock may lock this order.
While here, restore fstrans_start_nowait arg to FSTRANS_LAZY.

Fixes a deadlock seen recently on some pbulk environments.
2016-12-20 10:02:21 +00:00
cherry
28fcb4a4b5 panic() must be able to take varargs - in userspace testing too. 2016-12-19 13:02:14 +00:00
dholland
b79a953f51 typo in comment 2016-12-18 05:43:20 +00:00
riastradh
51beee07d0 Fix return value of nommap. 2016-12-16 23:35:04 +00:00
kamil
241cf91ddc Add support for hardware assisted watchpoints/breakpoints API in ptrace(2)
Add new ptrace(2) calls:
 - PT_COUNT_WATCHPOINTS - count the number of available hardware watchpoints
 - PT_READ_WATCHPOINT   - read struct ptrace_watchpoint from the kernel state
 - PT_WRITE_WATCHPOINT  - write new struct ptrace_watchpoint state, this
                          includes enabling and disabling watchpoints

The ptrace_watchpoint structure contains MI and MD parts:

typedef struct ptrace_watchpoint {
	int		pw_index;	/* HW Watchpoint ID (count from 0) */
	lwpid_t		pw_lwpid;	/* LWP described */
	struct mdpw	pw_md;		/* MD fields */
} ptrace_watchpoint_t;

For example amd64 defines MD as follows:
struct mdpw {
	void	*md_address;
	int	 md_condition;
	int	 md_length;
};

These calls are protected with the __HAVE_PTRACE_WATCHPOINTS guard.

Tested on amd64, initial support added for i386 and XEN.

Sponsored by <The NetBSD Foundation>
2016-12-15 12:04:17 +00:00
hannken
4349535165 Change the freelists to lrulists, all vnodes are always on one
of the lists.  Speeds up namei on cached vnodes by ~3 percent.

Merge "vrele_thread" into "vdrain_thread" so we have one thread
working on the lrulists.  Adapt vfs_drainvnodes() to always wait
for a complete cycle of vdrain_thread().
2016-12-14 15:49:35 +00:00
hannken
70ec436e39 Move vnode members "v_freelisthd" and "v_freelist" from "struct vnode"
to "struct vnode_impl" and rename to "vi_lrulisthd" and "vi_lrulist".

No functional change intended.

Welcome to 7.99.48
2016-12-14 15:48:54 +00:00
hannken
13fa9cae25 Remove the "target" argment from vfs_drainvnodes() as it is
always equal to "desiredvnodes" and move its definition
from sys/vnode.h to sys/vnode_impl.h.

Extend vfs_drainvnodes() to also wait for deferred vrele to flush
and replace the call to vrele_flush() with a call to vfs_drainvnodes().
2016-12-14 15:46:57 +00:00
nat
f1631e52a4 Add functions to access device flags. This restores simultaneous audio
open/close.

OK hannken@ christos@
2016-12-09 19:13:47 +00:00
roy
89a9eb7b34 When loading a kernel, test if it's already loaded before authorizing.
This allows us to return EEXIST instead of EPERM for higher secure levels.

My use case was to stop npfctl complaining that it could not load bpfjit
on ERLITE when it was compiled into the kernel.
It then went on to complain that NPF performance would be de-graded,
but this is clearly not the case.
2016-12-09 13:06:41 +00:00
christos
cdadd9e0af void duplicate definition on statically linking libc+ssp and rumpkern+ssp. 2016-12-06 02:55:42 +00:00
christos
cf786e11e4 set the signal flag when the signal was sent to every lwp, not to just an
individual one.
2016-12-05 22:07:16 +00:00
christos
1d6d63b6d6 PR/51685: Kamil Rytarowski: Fill sigcontext info in kpsignal2 so that the
debugger/core-dump signal info gets filled in in all code paths (including
the lwp_kill one).
2016-12-04 16:40:43 +00:00
christos
840d624913 Add missing ktrkuser 2016-12-03 22:28:16 +00:00
hannken
f3e32599e8 - Change vcache_reclaim() to always call VOP_INACTIVE() before VOP_RECLAIM().
When called from vrecycle() or vgone() there is a window where the refcount
  is greater than zero and another thread could get and release a reference
  that would miss VOP_INACTIVE() as the refcount doesn't drop to zero.

  Adjust test fs/puffs/t_basic:  test VOP_INACTIVE count being greater zero.

- Make vrecycle() more robust by checking v_usecount first and preventing
  further references across vn_lock().  Fixes a deadlock where one thread
  starts unmount, second thread locks a directory and allocates a vnode
  and first thread tries to vrecycle() the directory.
  First thread holds vfs_busy and wants vnode, second thread holds vnode
  and wants vfs_busy.

- With these fixes in place change cleanvnode() to use vget()/vrecycle()
  to reclaim the vnode.
2016-12-01 14:49:03 +00:00
ozaki-r
6f15561386 Fix a race condition of low priority xcall
xc_lowpri and xc_thread are racy and xc_wait may return during/before
executing all xcall callbacks, resulting in a kernel panic at worst.

xc_lowpri serializes multiple jobs by a mutex and a cv. If all xcall
callbacks are done, xc_wait returns and also xc_lowpri accepts a next job.

The problem is that a counter that counts the number of finished xcall
callbacks is incremented *before* actually executing a xcall callback
(see xc_tailp++ in xc_thread). So xc_lowpri accepts a next job before
all xcall callbacks complete and a next job begins to run its xcall callbacks.

Even worse the counter is global and shared between jobs, so if a xcall
callback of the next job completes, the shared counter is incremented,
which confuses wc_wait of the previous job as all xcall callbacks of the
previous job are done and wc_wait of the previous job returns during/before
executing its xcall callbacks.

How to fix: there are actually two counters that count the number of finished
xcall callbacks for low priority xcall for historical reasons (I guess):
xc_tailp and xc_low_pri.xc_donep. xc_low_pri.xc_donep is incremented correctly
while xc_tailp is incremented wrongly, i.e., before executing a xcall callback.
We can fix the issue by dropping xc_tailp and using only xc_low_pri.xc_donep.

PR kern/51632
2016-11-21 00:54:21 +00:00
christos
ecb08d7cca Add FALLTHROUGH commit 2016-11-19 19:06:12 +00:00
pgoyette
f48fa2dcc1 By popular request, don't bother initializing a static pointer to NULL. 2016-11-18 02:37:33 +00:00
pgoyette
fdd49fc76c Use compile-time initialization for the list head, and make sure that
the sysctllog is also initialized before being used.
2016-11-17 08:06:49 +00:00
pgoyette
a1889144f5 Initialize the bufq code right before we're ready to load the strategy
modules.
2016-11-16 12:31:33 +00:00
pgoyette
219154eeef Define a new module class for the bufq_strategy modules. These need to
be loaded and intialized before autoconfigure runs, since some devices
(like disks and floppy drives) want to call bufq_alloc().
2016-11-16 10:42:14 +00:00
pgoyette
556c690963 Modularize the various bufq strategies 2016-11-16 00:46:46 +00:00
kre
75973081c3 Return the "true" parent's pid as the parent pid (ppid) via the
various sysctl/procfs interfaces that allow it to be interrogated.
(This is rather than the temporary parent's pid when a process is
being traced and has been reparented.)

XXX The ppid in elf32 core files has not been similarly adjusted,
XXX Should it be ?
2016-11-14 08:55:51 +00:00
christos
931a19e8b1 Make p_ppid contain the original parent's pid even for traced processes.
Only change it when we are being permanently reparented to init. Since
p_ppid is only used as a cached value to retrieve the parent's process id
from userland, this change makes it correct at all times. Idea from kre@
Revert specialized logic from getpid/getppid now that it is not needed.
2016-11-13 15:25:01 +00:00
christos
f19994519e back to using SIGSTOP.. 2016-11-12 20:03:17 +00:00
christos
cf7cb04d80 PR/51624: Return the original parent for a traced process. 2016-11-12 19:42:47 +00:00
christos
711ad24258 kern/51621: When attaching to a child send it a SIGTRAP not a SIGSTOP like
Linux and FreeBSD do.
2016-11-11 17:10:04 +00:00
njoly
a9422942bd Adjust clock_nanosleep(2) to not copyout remaining time struct if
TIMER_ABSTIME flag is set.

Ok Christos.
2016-11-11 15:29:36 +00:00
jdolecek
86e8a3aae2 during truncate with wapbl, register deallocation for upper indirect block
before recursing into lower blocks, to make sure that it will be removed after
all its referenced blocks are removed

fixes 'ffs_blkfree_common: freeing free block' panic triggered by
ufs_truncate_retry() when just the upper indirect block registration failed,
code tried to free the lower blocks again after wapbl flush

problem found by hannken@, thank you
2016-11-10 20:56:32 +00:00
christos
b2924f399d GC WOPTSCHECKED, define macros for the select opts and all the valid opts.
The linux compat flags are not part of X/Open.
2016-11-10 17:07:14 +00:00
ozaki-r
8db944330d Add a new sanity check to psref
It checks if a target being acquired is already acquired with
the same psref. It is usable but not lightweight, so enabled
only if DEBUG.
2016-11-09 09:00:46 +00:00
kre
b6732360dd PR kern/51600 ; PR standards/51606
Revert 1.264 - that was intended to fix 51600, but didn't, it just
hid the problem, and caused 51606.  This fixes 51606.

Handle waiting on a process that has been detatched from its parent
because of being ptrace'd by some other process.  This fixes 51600.
("handle" here means that the wait() hangs, or with WNOHANG, returns 0,
we cannot actually wait on a process that is not currently an attached
child.)

Note: the detatched process waiting is not yet perfect (it fails to
take account of options like WALLSIG and WALTSIG) - suport for those
(that is, ignoring a detatched child that one of those options will
later cause to be ignored when the process is re-attached.)

For now, for ither than when waiting for a specific process ID, when
a process does a wait() sys call (any of them), has no applicable
children attached that can be returned, and has at least one detatched
child, then we do a linear search of all processes to look for a
suitable detatched child.  This is likely to be slow - but very rare.
Eventually it might be better to keep a list of detatched children
per process.
2016-11-09 00:30:17 +00:00
christos
678541356f Return 0 if WNOHANG and no kids. 2016-11-05 02:59:22 +00:00
christos
9b5ab01589 deduplicate the complex lock reparent dance. 2016-11-04 18:14:04 +00:00
christos
e8fde31e58 Cleanup old parent from zombies too. Fixes repeatable panic when we try
to signal the already freed zombie parent after the child exits.
2016-11-04 18:12:06 +00:00
kamil
f26cf4cb48 Prefer modern simple past tense and past participle of catch
The "catched" form is obsolete and nonstandard, prefer "caught".
2016-11-03 22:08:30 +00:00
christos
7bfe2974a7 Fix wrong WIFCONTINUED() status. 2016-11-03 20:58:25 +00:00
hannken
30572e03fd Add a function to print the fields of a vnode including its implementation
and use it from vprint() and vfs_vnode_print().

Move vstate_name() to vfs_subr.c.
2016-11-03 11:04:21 +00:00
hannken
175d720a94 Split sys/vnode.h into sys/vnode.h and sys/vnode_impl.h
- Move _VFS_VNODE_PRIVATE protected operations into vnode_impl.h.
- Move struct vnode_impl definition and operations into vnode_impl.h.
- Include vnode_impl.h where we include vnode.h with _VFS_VNODE_PRIVATE defined.
- Get rid of _VFS_VNODE_PRIVATE.
2016-11-03 11:03:31 +00:00
hannken
4f55676a14 Prepare the split of sys/vnode.h into sys/vnode.h and sys/vnode_impl.h
- Rename struct vcache_node to vnode_impl, start its fields with vi_.
- Rename enum vcache_state to vnode_state, start its elements with VS_.
- Rename macros VN_TO_VP and VP_TO_VN to VIMPL_TO_VNODE and VNODE_TO_VIMPL.
- Add typedef struct vnode_impl vnode_impl_t.
2016-11-03 11:02:09 +00:00
pgoyette
18cd37a864 Remove ptrace_do{,fp}regs - they are a duplicate of process_* routines
which are still in sys_ptrace_common.c.
2016-11-03 03:57:05 +00:00
pgoyette
032607b8f0 Regenerate files for modularization of ptrace(2) 2016-11-02 00:14:11 +00:00
pgoyette
a60b99094c * Split sys/kern/sys_process.c into three parts:
1 - ptrace(2) syscall for native emulation
        2 - common ptrace(2) syscall code (shared with compat_netbsd32)
        3 - support routines that are shared with PROCFS and/or KTRACE

* Add module glue for #1 and #2.  Both modules will be built-in to the
  kernel if "options PTRACE" is included in the config file (this is
  the default, defined in sys/conf/std).

* Mark the ptrace(2) syscall as modular in syscalls.master (generated
  files will be committed shortly).

* Conditionalize all remaining portions of PTRACE code on a new kernel
  option PTRACE_HOOKS.

XXX Instead of PROCFS depending on 'options PTRACE', we should probably
    just add a procfs attribute to the sys/kern/sys_process.c file's
    entry in files.kern, and add PROCFS to the "#if defineds" for
    process_domem().  It's really confusing to have two different ways
    of requiring this file.
2016-11-02 00:11:59 +00:00
maxv
e18421c86e The mbuf is freed by the protocol even on error, so always NULL the pointer
instead of double-freeing it. Indirectly pointed out by Mootja.
2016-10-31 15:27:24 +00:00
maxv
a8d918182b Memory leak, found by Mootja. By the way, we probably shouldn't be
returning -1 here.
2016-10-31 15:08:45 +00:00
maxv
bee122aa97 Memory leak, found by Mootja. It is easily triggerable from userland. 2016-10-31 15:05:05 +00:00
christos
6f53bbe9e7 Fix arg64 computation for compat_netbsd32 2016-10-28 23:44:32 +00:00
jdolecek
b695bc874e reorganize ffs_truncate()/ffs_indirtrunc() to be able to partially
succeed; change wapbl_register_deallocation() to return EAGAIN
rather than panic when code hits the limit

callers changed to either loop calling ffs_truncate() using new
utility ufs_truncate_retry() if their semantics requires it, or
just ignore the failure; remove ufs_wapbl_truncate()

this fixes possible user-triggerable panic during truncate, and
resolves WAPBL performance issue with truncates of large files

PR kern/47146 and kern/49175
2016-10-28 20:38:12 +00:00
jdolecek
71a8e131fb fixup comment 2016-10-28 20:17:27 +00:00
ozaki-r
8941dc1184 Fix an assertion in _psref_held
The assertion, psref->psref_lwp == curlwp, is valid only if the target
is held by the caller.

Reviewed by riastradh@
2016-10-28 07:27:52 +00:00
skrll
f2ef31cb48 PR kern/51514: ptrace(2) fails for 32-bit process on 64-bit kernel
Updated from the original patch in the PR by me.
2016-10-19 09:44:00 +00:00
skrll
a857ba2662 KNF 2016-10-15 09:09:55 +00:00
skrll
855e4d5be4 Trailing whitespace 2016-10-14 08:38:31 +00:00
skrll
07111ed295 KNF 2016-10-14 08:37:05 +00:00
uwe
c9ab2a37ec Revert to revision 1.249 to undo changes from PR 49636.
Marking up some zeroes with a type suffix, while not marking others in
the very same function does nothing but places cognitive burden on the
reader.

Spelling "clear bits" as "&~" is actually not uncommon (and some say
is more readable).
2016-10-13 19:10:23 +00:00
dholland
d81762cbc9 foo & ~bar, not foo &~ bar. From Henning Petersen in PR 49636. 2016-10-10 01:22:51 +00:00
dholland
a6c9b0f9c4 PR 49636 Henning Petersen: use "0L" to return 0 from a function returning
long, and test its returned value against "0L" instead of "0".

This is not especially necessary, but it's also harmless.
2016-10-10 01:22:08 +00:00
christos
192a00203a Hide MFREE now that it is not being used anymore and provide some debugging
for the location of the last free for debugging kernels.
2016-10-04 14:13:21 +00:00
christos
da90486716 more MFREE -> m_free 2016-10-02 19:26:46 +00:00
jdolecek
9e58801f20 drop wl_mtx mutex during call to pool_get() with PR_WAITOK
pointed out by riastradh
2016-10-02 16:52:27 +00:00
jdolecek
407be399a4 fix off-by-one in wapbl_write_revocations() - when exiting the write loop,
wd gets set to next unwritten record, not last written one as code assumed;
'lost head!' KASSERT is not triggered any more
2016-10-02 16:44:02 +00:00
jdolecek
c69152bd80 wapbl_write_revocations(): fix use-after-free when writing more then one
block worth of revocations, introduced in previous commit; discovered by
Brad Harder on current-users
2016-10-02 14:38:46 +00:00
jdolecek
16c7d9d735 allocate wapbl dealloc registration structures via pool, so that there is more
flexibility with limit handling
2016-10-01 13:15:45 +00:00
christos
c0e5049c21 Require exact credential match; this way even if we su to the original user
that created the session, we won't match his credentials.
2016-10-01 04:42:54 +00:00
christos
4b39133eee Weaken the test a bit to still allow non-root to use TIOCSTI; we need to have
the same creds as the session leader process for the tty session.
2016-10-01 03:46:00 +00:00
christos
f08a5ec0bf Only allow root to use TIOCSTI. Don't eat the kauth error number.
It is unexpected for an unprivileged process to gain privs by
typing to root's tty:

$ cat installer
#!/bin/sh
whoami
/usr/sbin/sti /dev/tty whoami\\n

$ su unprivileged -c ./installer
unprivileged
$ whoami
root
2016-09-29 21:46:32 +00:00
christos
e771ba939e Introduce and use PROC_PTRSZ() to handle differing pointer size 64->32
emulation.
2016-09-29 20:40:53 +00:00
christos
d18e278dd0 Allow sparc kernels to build with SSP by using a constant PAGE_SIZE... 2016-09-29 18:47:35 +00:00
skrll
cf96d30a9f Trailing whitespace 2016-09-23 14:16:32 +00:00
skrll
7b000a7783 Add netbsd32_clock_getcpuclockid2 and netbsd32_wait6 functions 2016-09-23 14:09:39 +00:00
jdolecek
e3cebdd8d5 misplaced comment 2016-09-22 16:22:29 +00:00
jdolecek
d6c67f4b63 store the number of block records per block into wl as wl_brperjblock,
so that it's visible it's same value everywhere; no functional change
2016-09-22 16:20:56 +00:00
maxv
2e04133cf9 This is just a temporary stack that holds fake arguments, and that gets
remapped as RW in sys_execve. Still, in this small window, it does not need
to be executable.
2016-09-17 12:09:22 +00:00
maxv
654592fc2b Use VM_MAXUSER_ADDRESS for proc0, not VM_MAX_ADDRESS. It normally does not
change anything, since kernel processes use the shared kernel map instead
of the one they are given here. For consistency though, it is better to
make sure UVM will not be tempted to access machine-dependent reserved
areas (e.g., the PTE space on x86).
2016-09-17 12:00:34 +00:00
christos
f9ab0c061b move aslr stuff to the aslr section 2016-09-17 02:29:11 +00:00
pgoyette
06402e0a42 Move kern_ctf.c into the dtrace_fbt module (the only place it is used)
rather than including in kernels with KDTRACE_HOOKS defined.  Update
the dtrace_fbt module to depend on the zlib module.

Bump kernel version to avoid module mismatch.

Welcome to 7.99.38 !
2016-09-16 03:10:45 +00:00
christos
cbcfdd13ce oops removed too much 2016-09-15 18:40:34 +00:00
christos
406ea0ab88 Add debugging. 2016-09-15 17:45:44 +00:00
christos
4fddba2c93 m68k binaries load @ pagesize. unbreak. 2016-09-15 17:44:16 +00:00
martin
abb6b48937 Allow emulations to override the creation of ktrace records for posting
signals. In compat_netbsd32 use this to write the 32bit version of
the records, so a 32bit userland kdump is happy.
2016-09-13 07:39:45 +00:00
martin
1766e4eee1 Make the ktrace record written by do_sys_sendmsg/do_sys_recvmsg overridable
by the caller. Use this in compat_netbsd32 to log the 32bit version, so
the 32bit userland kdump is happy.
2016-09-13 07:01:07 +00:00
dholland
273d65f9c5 Build fix for when COREDUMP is turned off, from Ray Phillips in PR 51460. 2016-09-05 17:42:57 +00:00
christos
262a6229a0 don't forget to destroy a cv 2016-09-05 14:13:50 +00:00
christos
1fcaa19698 vsize_t is not always u_long :-) 2016-09-03 12:20:58 +00:00
hannken
b3aa7e069f siggetinfo: use TAILQ_FOREACH_SAFE as the element gets removed from the list. 2016-08-21 15:24:17 +00:00
hannken
7139aab724 Remove now obsolete operation vcache_remove().
Welcome to 7.99.36
2016-08-20 12:37:06 +00:00
hannken
2ec6f651c5 Change vcache_reclaim() to remove vnode from vnode cache once the
vnode was reclaimed from the file system.
2016-08-20 12:33:57 +00:00
hannken
113946c517 Rename vclean() to vcache_reclaim().
No functional change.
2016-08-20 12:31:37 +00:00
christos
7ebc13ffe3 tidy up messages and indentation 2016-08-13 12:05:49 +00:00
maxv
e727235220 The way the kernel tries to prevent a userland process from allocating page
zero is hugely flawed. It is easy to demonstrate that one can trick UVM
into chosing a NULL hint after the user_va0_disable check from uvm_map.
Such a bypass allows kernel NULL pointer dereferences to be exploitable on
architectures with a shared userland<->kernel VA, like amd64.

Fix this by increasing the limit of the vm space made available for
userland processes. This way, UVM will never chose a NULL hint, since it
would be outside of the vm space.

The user_va0_disable sysctl still controls this feature.
2016-08-06 15:13:13 +00:00
christos
c10c4abe0f Realtime signal support from GSoC 2016, Charles Cui. 2016-08-04 06:43:43 +00:00
christos
7549563373 Print the parent module that asked for the builtin to be loaded and failed.
XXX: if a driver is built-in why can't it ask for a filesystem module to
be loaded?
2016-08-04 06:13:15 +00:00
martin
90b40fe3e2 kobj_machdep() needs a chance to moify the loaded code, so move the code
to protect it read-only a bit later.
2016-08-02 12:23:08 +00:00
maxv
607912eebd Don't fail if a module does not have a data or rodata section. Small
modules don't have data.
2016-08-01 15:41:05 +00:00
dholland
585fe4a842 typo in comment 2016-07-31 20:34:04 +00:00
dholland
28ccf570bf In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.
2016-07-31 04:05:32 +00:00
christos
b265873d52 Fix reversed test. 2016-07-30 15:38:17 +00:00
skrll
ac3daeaa4c Bump size of scratchstr - some KASSERTMGS exceed 256 characters 2016-07-27 09:57:26 +00:00
maxv
ece8cd54ab Split the data+bss+rodata segment in two data+bss and rodata segments. The
latter is made read-only.
2016-07-20 13:36:19 +00:00
maxv
d2c6c6c84f Change the protection of the kernel modules segments once we are done
relocating them. The text is allocated as RWX, and then mprotected to RW.

There is a bug that prevents us from doing RW->RX on amd64 and perhaps
sparc64. On x86, the pmap waits for the page to fault before granting it
the X permission. But in the trap handler, such a page is considered as
belonging to kernel_map, while it actually belongs to module_map. The
kernel then finds out the page is not present in kernel_map, and panics.
In all cases, module_map is non pageable, so even if the trap were handled
properly, it still wouldn't work.

Therefore, there is a small window in which the segment is RWX. But that's
fine enough, for now.
2016-07-20 13:11:58 +00:00
msaitoh
207265e875 Print number of attach error regardless of AB_QUIET and AB_SILENT. 2016-07-19 07:44:03 +00:00
pgoyette
8665318c03 Also, don't hard-code the function name in the message; use __func__ 2016-07-15 01:17:47 +00:00
pgoyette
abd2da1923 As suggested by christos@, use KASSERTMSG() 2016-07-15 01:13:10 +00:00
pgoyette
097a241ddc Remove a call to panic() which duplicates the subsequent KASSERT()!
XXX Since everything has (or should have) been switched to dev_t, we
XXX could probably remove the check for
XXX
XXX	ca->ca_devsize >= sizeof(struct device)
XXX
XXX But someone ought to check on that first!

Reviewed by riastradh@
2016-07-14 21:57:06 +00:00
christos
20a2c0a7f7 make sure we cleanup properly when fd is too big. 2016-07-14 18:16:51 +00:00
christos
1c128d4498 From tedu at openbsd:
kevent validates that ident is a valid fd by getting the file. one sad
quirk: uint64 to int32 truncation can lead to false positives, and then
later in the array sizing code, very big mallocs panic the kernel.
add a check that the ident isn't larger than INT_MAX in the fd case.
reported by Tim Newsham
2016-07-14 06:22:17 +00:00
njoly
84b8b47bee In dosetrlimit() round stack hard limit just like soft one.
Avoid cases where hard limit becomes smaller than soft limit.
2016-07-13 09:52:00 +00:00
msaitoh
6399f1a6ef KNF. No functional change. 2016-07-11 07:42:13 +00:00
maxv
6c1bb9a544 When loading a module from VFS and from the bootloader, the kernel packs
up the module segments into one big RWX chunk. Split this chunk into two
different text and data+bss+rodata chunks. The latter is made non-
executable. This also provides some kind of ASLR, since the chunks are
not necessarily contiguous.
2016-07-09 07:25:00 +00:00
maxv
e169fdcc18 Force the kernel to dynamically reallocate the preloaded modules. 2016-07-08 08:55:48 +00:00
msaitoh
8bc54e5be6 KNF. Remove extra spaces. No functional change. 2016-07-07 06:55:38 +00:00
ozaki-r
058d974b09 Add HASH_PSLIST (pslist(9)) type for hashinit() 2016-07-06 05:20:48 +00:00
pgoyette
9969e67634 Don't declare module_verbose_on or module_autoload_on static. It is useful
for these variables to be global, so they can be modified by ddb(4)
(entered via "boot -d") early in startup.
2016-07-04 23:55:54 +00:00
maxv
99d1152db6 Make the execution flow canonical instead of jumping back and forth, and
complete the userland check.
2016-07-04 07:56:07 +00:00
knakahara
850de3d9a9 revert kern_softint.c:r1.42 (which was incorrect fix)
gif(4) has violated softint(9) contract. That is fixed by previous 2 commits.
see:
    https://mail-index.netbsd.org/tech-kern/2016/01/12/msg019993.html
2016-07-04 04:20:14 +00:00
christos
65120c51fa regen 2016-07-03 14:26:47 +00:00
christos
7cf7644fc7 GSoC 2016 Charles Cui: Implement thread priority protection based on work
by Andy Doran. Also document the get/set pshared thread calls as not
implemented, and add a skeleton implementation that is disabled.
XXX: document _sched_protect(2).
2016-07-03 14:24:58 +00:00
maxv
5852a7fce9 Ensure the restartable atomic sequence is in userland, for real. 2016-07-01 12:49:22 +00:00
christos
4cfa4299d0 PR/51277: Fix compat32 coredumping that broke with the aux vector note
addition.
2016-06-27 01:46:04 +00:00
pgoyette
1cd7e75a4d Simplfy insertion of newly-activated modules into the list. There's no
good reason to treat modules without dependencies differently from those
which do require other modules.
2016-06-24 23:04:09 +00:00
skrll
70699ce203 Fix UVMHIST builds for kernels that don't include usb 2016-06-23 07:32:12 +00:00
pgoyette
b491c2af6f When importing modules from the boot loader we should check for duplicate
module names both in the built-in list and in the list of previously
"pushed" modules.

While here, delay allocating the new 'struct module' until we've passed
the duplicate-name checks.
2016-06-23 04:41:03 +00:00
skrll
0895dad130 KNF . Sort includes 2016-06-22 07:44:02 +00:00
christos
f4c1c0d146 put back commented out name resolution code that was gc'ed after previous
refactoring.
2016-06-20 19:14:35 +00:00
knakahara
69c0ff04b9 apply if_start_lock() to L2 callers which call ifp->if_start() of device derivers 2016-06-20 08:30:58 +00:00
bouyer
01a30830e3 Add a new config_detach() flag, DETACH_POWEROFF, which is set when
detaching devices at shutdown time with RB_POWERDOWN.
When detaching wd(4), put the drive in standby before detach
for DETACH_POWEROFF.
Fix PR kern/51252
2016-06-19 09:35:06 +00:00
pgoyette
1e522d74f9 Check for duplicate module names before loading modules that were
"pushed" by the boot loader.  The boot loader pushes the module
name for the root file system (unless the root file system is ffs)
even if the file system module is built into the kernel.  When
this happens, we get a lot of "redefined symbol" error messages.

This fix does not alter the behavior of pushing the file system
name.  It simply avoids the redefined symbol errors by detecting
that the module is already built-in to the kernel and not trying
to load another copy.

While here, differentiate the error message text between "failed
to load" and "failed to fetch_info" conditions.

Addresses PR kern/50357
2016-06-16 23:09:44 +00:00
ozaki-r
e1135cd9b9 Use curlwp_bind and curlwp_bindx instead of open-coding LP_BOUND 2016-06-16 02:38:40 +00:00
christos
ea2913a0a2 GSoC 2016: Charles Cui: Add timer related macros
_POSIX_CPUTIME
    _POSIX_THREAD_CPUTIME
    _POSIX_DELAYTIMER_MAX
2016-06-10 23:29:20 +00:00
christos
0196f35dd1 GSoC 2016: Charles Cui: add SEM_NSEMS_MAX 2016-06-10 23:24:33 +00:00
ozaki-r
fe6d427551 Avoid storing a pointer of an interface in a mbuf
Having a pointer of an interface in a mbuf isn't safe if we remove big
kernel locks; an interface object (ifnet) can be destroyed anytime in any
packet processing and accessing such object via a pointer is racy. Instead
we have to get an object from the interface collection (ifindex2ifnet) via
an interface index (if_index) that is stored to a mbuf instead of an
pointer.

The change provides two APIs: m_{get,put}_rcvif_psref that use psref(9)
for sleep-able critical sections and m_{get,put}_rcvif that use
pserialize(9) for other critical sections. The change also adds another
API called m_get_rcvif_NOMPSAFE, that is NOT MP-safe and for transition
moratorium, i.e., it is intended to be used for places where are not
planned to be MP-ified soon.

The change adds some overhead due to psref to performance sensitive paths,
however the overhead is not serious, 2% down at worst.

Proposed on tech-kern and tech-net.
2016-06-10 13:31:43 +00:00
ozaki-r
d938d837b3 Introduce m_set_rcvif and m_reset_rcvif
The API is used to set (or reset) a received interface of a mbuf.
They are counterpart of m_get_rcvif, which will come in another
commit, hide internal of rcvif operation, and reduce the diff of
the upcoming change.

No functional change.
2016-06-10 13:27:10 +00:00
christos
b035b9b913 fix variable name 2016-06-09 00:17:45 +00:00
christos
3aa7fc217c ignore EACCES 2016-06-08 23:55:24 +00:00
palle
3958153370 Added missing "it" to comment in start_init() 2016-06-04 21:10:56 +00:00
pgoyette
7bdbb58b22 Add a new kern.messages sysctl to allow kernel message verbosity to be
altered after boot.

Fixes PR kern/46539 using patch submitted by Nat Sloss.
2016-05-31 05:44:19 +00:00
pgoyette
b847d6b87c Compare names of duplicate symbols properly, so we correctly return
an error status.

Fixes PR kern/45125 with patch supplied by Akinobu  Mita
2016-05-31 03:57:04 +00:00
martin
5fc637a54d David Binderman in PR kern/51189: simplify loop conditions 2016-05-30 11:24:40 +00:00
christos
ff63d49891 fix compilation without PAX_MPROTECT 2016-05-27 16:35:16 +00:00
hannken
40d12c0185 Use vnode state to replace VI_MARKER, VI_CHANGING, VI_XLOCK and VI_CLEAN.
Presented on tech-kern@
2016-05-26 11:09:55 +00:00
hannken
1e17b1e3c2 Add vnode state and supporting operations and diagnostics.
Presented on tech-kern@
2016-05-26 11:08:44 +00:00
hannken
c9685569a3 Merge the vnode and its corresponding vcache_node into one
vcache_node structure.

Print the vcache_node part in vprint() and vfs_vnode_print().

Presented on tech-kern@
2016-05-26 11:07:33 +00:00
wiz
692b4b1e95 Consistent indent. 2016-05-25 20:49:00 +00:00
christos
5763e378f2 Give 0,1,2 for security.pax.mprotect.ptrace and make it default to 1
as documented in sysctl(7):
0 - ptrace does not affect mprotect
1 - (default) mprotect is disabled for processes that start executing from
    the debugger (being traced)
2 - mprotect restrictions are relaxed for traced processes
2016-05-25 20:07:54 +00:00
christos
19ea743456 Introduce security.pax.mprotect.ptrace sysctl which can be used to bypass
mprotect settings so that debuggers can write to the text segment of traced
processes so that they can insert breakpoints. Turned off by default.
Ok: chuq (for now)
2016-05-25 17:43:58 +00:00
christos
cd1c56e89e randomize the location of the rtld. 2016-05-25 17:25:32 +00:00
martin
f3944df18c Effectively disable aslr for non-topdown-VA binaries (unless they are
compat32, which we deal with properly). It would be possible to get
those working too, but it is not worth the code complexity.

This makes binaries compiled with -mcmodel=medlow (and ancient binaries)
work again on sparc64, smoothing the upgrade path.

ok: christos
2016-05-24 17:30:01 +00:00
christos
9d95ecedc7 Add a note for the auxv array so we can find our load location from a
core file of a PIE binary.
2016-05-24 00:49:55 +00:00
tls
1331d5da97 Fix a longstanding problem with accept filters noticed by Timo Buhrmester:
sockets sitting in the accept filter can consume the entire listen queue,
such that the application is never able to handle any connections.  Handle
this by simply passing through the oldest queued cxn when the queue is full.

This is fair because the longer a cxn lingers in the queue (stays connected
but does not meet the requirements of the filter for passage) the more likely
it is to be passed through, at which point the application can dispose of it.

Works because none of our accept filters actually allocate private state
per-cxn.  If they did, we'd have to fix the API bug that there is presently
no way to tell an accf to finish/deallocate for a single cxn (accf_destroy
kills off the entire filter instance for a given listen socket).
2016-05-23 13:54:34 +00:00
christos
b039ee7763 reduce #ifdef mess caused by PaX 2016-05-22 14:26:09 +00:00
christos
2b0df44082 Account for the VA hole differently (simpler) 2016-05-22 01:09:09 +00:00
riastradh
b93e5db80e Use rnd_getmore as intended. No more essay needed here.
Workaround for buffering got pushed into rnd_getmore, closer to the
actual cause of the problem.
2016-05-21 15:33:40 +00:00
riastradh
77ebf39786 Ask on-demand entropy sources to produce enough data to fill buffer.
Remainder of fix for PR kern/51135: if there is an entropy source
that can produce arbitrarily much data, as in rump, then nothing
should ever block indefinitely waiting for data.
2016-05-21 15:27:15 +00:00
christos
142afa09a8 fix for ILP32. 2016-05-19 21:39:15 +00:00
riastradh
950b6c0b3d Replace deprecated disabled code by comment
describing what it intends to do, and why it won't work yet

From coypu.
2016-05-19 18:32:29 +00:00
hannken
a68d62d64c Keep the old vcache node on rekey. Change its key and remove the
new vcache node now used as placeholder only.
2016-05-19 14:50:18 +00:00
hannken
ed5aa2cef9 Change "ISSET(vp->v_iflag, VI_XLOCK)" to "vdead_check(vp, VDEAD_NOWAIT)". 2016-05-19 14:48:28 +00:00
hannken
4222e592dd Add VFS_VNODE_PRIVATE protected operations vnalloc_marker() to create,
vnfree_marker() to destroy and vnis_marker() to test for marker vnodes.

Make operations vnalloc() and vnfree() local to vfs_vnode.c.
2016-05-19 14:47:33 +00:00
christos
f2f81db6f6 Hook to clamp the random value for mmap for machies that don't have enough
VA bits.
2016-05-17 00:38:50 +00:00
christos
2a096139aa only print debugging info if we are actually going to change the permission. 2016-05-14 17:04:09 +00:00