through all LWPs and duplicate locking overhead.
- Move sched_pstats() from soft-interrupt context to process 0 main loop.
Avoids blocking effect on real-time threads. Mostly fixes PR/38792.
Note: it might be worth to move the loop above PRI_PGDAEMON. Also,
sched_pstats() might be cleaned-up slightly.
really need are:
0) provide VOP_OP in the alternate RUMP_VOP_OP namespace
and for each op:
1) schedule rump cpu
2) call VOP_OP
3) unschedule rump cpu
While here, take the opportunity to get rid of _t lossage in the
rump-exported interfaces.
a second time at the end of the file. Adjust whitespace for the
sheer functional joy of it.
(i hope i didn't ruin someone's joke by missing a humorous implication
that all vnode operations are considered a little special)
vdesc_transports from vnodeop_desc until we have a "not not yet"
situation.
Ride 5.99.27 bump (full build still in progress. i wanted to get
this in as soon as possible to most effectively ride the bump.)
debug threaded live apps: Add an optional lwpid in PT_STEP and PT_CONTINUE to
indicate which lwp to operate on, and implement the glue required to make it
work.
also only use 16 u_shorts instead of 32 ints. Also add panic()
calls for under- and overflow of the ks_active members under
DIAGNOSTIC. The MAXBUCKET constant ended up in sys/mallocvar.h
and not sys/param.h, as the latter caused build problems.
Ride the kernel revision bump of my previous change.
per size, and make vmstat report this information under the "Memory
statistics by type" display, which is only printed when the kernel
has been compiled with KMEMSTATS defined, like this:
Memory statistics by type Type Kern
Type InUse MemUse HighUse Limit Requests Limit Limit Size(s)
wapbl 15 4192K 4192K 78644K 376426 0 0 32:0,256:3,512:6,131072:1,262144:2,524288:3
Since struct malloc_type is user-visible and is changed, bump kernel
revision to 5.99.26.
While it is true that malloc(9) is in general on the path of slowly
being replaced by kmem(9) (kmem_alloc/kmem_free), there remains a
lot of points of usage of malloc/free, and this could aid in finding
any leaks. (It helped finding the leak fixed in PR#42661.)
This was discussed with and somewhat hestitantly OKed by rmind@
driver/attach/data typically present and once some locking is grown
in here, these routines can be made to fail or succeed a component
attachment/detachment atomically.
respect the alignment in the ELF phdr.
Also, for correctness, use the maximum alignment of the PT_LOAD
sections rather than just the first one found.
Also, use more meaningful types.
consistently across the code.
- Re-do note parsing code to read the section headers instead of the program
headers because the new binutils merge all the note sections in one program
header. This fixes all the pax note parsing which has been broken for all
binaries built with the new binutils.
- Add diagnostics to the note parsing code to detect malformed binaries.
- Allocate and free note scratch space only once, not once per note.
it'll get spammy.
XXX: this should probably be printed iff the toplevel module is
not being autoloaded (i.e. there is a human to interpret the error).
Otherwise disabled dependencies give a misleading EPERM.
at boot.
Add a ksyms_mod_foreach() function to iterate a callback function over the
set of elf symbols for a specific module (netbsd included).
Add kern_ctf.c and mod_ctf_get() to allow the retrieval and decompression
of CTF sections for a specific module.
instead of using linksets directly. This has two implications:
1) It is now possible to "unload" a builtin module provided it is
not busy. This is useful e.g. to disable a kernel feature as
an immediate workaround to a security problem. To re-initialize
the module, modload -f <name> is required.
2) It is possible to use builtin modules which were linked at
runtime with an external linker (dlopen + rump).
conditionally build DEV_BSIZE adjustments for kernel. fsck_ffs shares
the same code but accesses physical blocks.
Also compute correct block numbers for each physical sector.
to the total free memory available to the system, use the smallest value
between VM_MAXUSER_ADDRESS and total free memory (having a RSS limit
bigger than VM_MAXUSER_ADDRESS has no real meaning).
Fix a possible int overflow when ptoa(uvmexp.free) is bigger than 4GB
with a 32 bits vaddr_t.
This change is similar to the one made in rev 1.144 of uvm/uvm_glue.c.
and non-const types, and the kernel uses both const and non-const
PMF qualifiers and device suspensors, so change the pmf_qual_t and
device_suspensor_t typedefs from "pointers to const" to non-pointer,
non-const types.
Also change access to filesystem blocks to be done by fragment instead
of by physical block. Fragments are the fundamental blocks of the
filesystem.
For a theoretical filesystem that accesses the disk in smaller units
than stored in mp->mnt_fs_bshift, the assumption might be wrong. But
this will also break other subsystems. The value mp->mnt_dev_bshift
which formerly represents the physical sector size is currently only
virtual in NetBSD (always DEV_BSIZE).
than 0. This is still not the intent of PIE, but it allows them to
run with VA 0 disabled.
(The PAX_ASLR stuff which should deal with this needs work.)
CV: ----------------------------------------------------------------------
DTrace adds a pointer to the lwp and proc structures which it uses to
manage its state. These are opaque from the kernel perspective to keep
the kernel free of CDDL code. The state arenas are kmem_alloced and freed
as proccesses and threads are created and destoyed.
Also add a check for trap06 (privileged/illegal instruction) so that
DTrace can check for D scripts that may have triggered the trap so it
can clean up after them and resume normal operation.
Ok with core@.
out of the way before trying to get a unit number. If we cannot
get a unit number, we call config_devfree(), which expects for
fields such as dv_flags, dv_cfattach, and dv_private to be initialized.
on the amount of physical memory and limited by NMBCLUSTERS if present.
Architectures without direct mapping also limit it based on the kmem_map
size, which is used as backing store. On i386 and ARM, the maximum KVA
used for mbuf clusters is limited to 64MB by default.
The old default limits and limits based on GATEWAY have been removed.
key_registered_sb_max is hard-wired to a value derived from 2048
clusters.
Invert the sense of the bit to mark if LOCKDEBUG is enabled to
disabled.
This will help my fellow developers spot "use before initialised"
problems that hppa picks up very well.
but fix the !LOCKDEBUG case by defining the "no debug" bits to zero so
they have no effect on lock stubs.
Invert the sense of the bit to mark if LOCKDEBUG is enabled to disabled.
This will help my fellow developers spot "use before initialised" problems
that hppa picks up very well.
It has to be done differently, because the semantics of mtx_owner in the non-
LOCKDEBUG case can vary significantly between archs, and thus it is not
possible to simply flip a bit to 1.
Ok core@, as at least i386 is unbootable right now.
seminit() calls exithook_establish(). exithook_establish() uses the exec_lock.
exec_lock is initialzed by exec_init(1).
Call exec_init(1) before seminit().
into subr_device.c instead of having them in subr_autoconf.c.
Since none of the copyrights in subr_autoconf.c really match the
history of device accessors, I took the liberty of slapping (c)
2006 TNF onto subr_device.c.
well have triggered the recursive panic.
Fix the comment for panic() to reflect now-current reality: the code
was already changed never to sync() on panic(), now we avoid dumping
core on a recursive panic.
__assert -> kern_assert
__sigtimedwait1 -> sigtimedwait1
__wdstart -> wdstart1
The rest are MD and/or shared with userspace, so they will require
a little more involvement than what is available for this quick
"ride the 5.99.24 bump" action.
#if NBPFILTER is no longer required in the client. This change
doesn't yet add support for loading bpf as a module, since drivers
can register before bpf is attached. However, callers of bpf can
now be modularized.
Dynamically loadable bpf could probably be done fairly easily with
coordination from the stub driver and the real driver by registering
attachments in the stub before the real driver is loaded and doing
a handoff. ... and I'm not going to ponder the depths of unload
here.
Tested with i386/MONOLITHIC, modified MONOLITHIC without bpf and rump.
priority level where the kernel accesses alldevs is IPL_VM, where
some hardware interrupt handlers call config_deactivate(9). Lower
the IPL of alldevs_mtx from IPL_HIGH to IPL_VM, accordingly.
subroutines config_alldevs_enter() and config_alldevs_exit(). This
change amounts to textual substitution. No functional change intended.
We do not collect garbage in device_lookup(), so there is no use dumping
it: get rid of the garbage list. Do not call config_dump_garbage().
In device_lookup_private(), call device_lookup() instead of duplicating
the code from device_lookup().
(for an unprivileged user) to force vfs modules to remain loaded
forever. Also, it's possible for an admin with fat fingers to have
to curse out loud (a lot) and reboot.
.. or at least fix things as much as seems to be possible without
involving 1000 zorkmids. do_sys_mount() takes either struct vfsops
(which hopefully came properly referenced) or a userspace string
for file system type. The standard in-kernel calling convention
of "do_sys_mount(l, vfs_getopsbyname("nfs"), NULL," is not to be
considered healthy, kosher, or even tasty (although if vfs_getopsbyname()
fails the whole thing *currently* fails without the program counter
pointing to hyperspace).
Another process could be vget()ing the vnode and bump v_usecount while
getcleanvnode() is vclean()ing it (as vclean drops the interlock).
vget() will then wait for VI_XLOCK or VI_FREEING to clear; and we could test
this assertion while the other process is still slepping. We could even
end up in ungetnewvnode() before this other process got a chance to run.
char *cpu_name(struct cpu_info *);
and use it when setting up the runq event counters, avoiding an 8 byte
kmem(4) allocation for each cpu. there are more places the cpuname is
used that can be converted to using this new interface, but that can
and will be done as future work.
as discussed with rmind.
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.
no functional change
duplicating code.
Per suggestions by rmind@:
Simplify some code that used "empty statements," ";".
Don't collect garbage in device_lookup{,_private}(), since they
are called in interrupt context from certain drivers.
Make config_collect_garbage() KASSERT() that it does not run in
interrupt or software-interrupt context.
initialized.
Ignore also pool_allocator_lock while the system is in cold state.
When the system has left cold state, uvm_init() should have
also initialized the pool subsystem and the mutexes are
ready to use.
This failed on archs where a mutex isn't initialized to a zero
value.
Defer allocation of pool log to the logging action, if allocation
fails, it will be retried the next time something is logged.
Clear pool log on allocation so that ddb doesn't crash when showing
so far unused log entries.
'veriexec_op_lock'. Triggering a panic is possible in the path from
veriexec_openchk() (easily repeatable). The two switch cases at the
bottom of the function are going to panic anyway, but they might as well
panic as they're intended to as opposed to tripping over a locking
violation...
from when I first implemented it as "metahook."
Cleanup a lot of the mess by unifying variable names, add struct member
prefixes, adjust comments, etc.
No functional change intended.
as we don't have a process context to authorize on. Instead, traverse the
file descriptor table of each process -- as we already do in one case.
Introduce a "marker" we can use to mark files we've seen in an iteration, as
the same file can be referenced more than once.
Hopefully this availability of filtering by process also makes life easier
for those who are interested in implementing process "containers" etc.
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567
Allocate ksiginfo on stack since it is safe and sigget() assumes that it is
not allocated from pool (pending signals via sigput()/sigget() "mill" should
be dynamically allocated, however). Might be useful to revisit later.
Likely the cause of PR/40750 and indirect cause of PR/39283.
collecting garbage in two phases: in the first stage, with
alldevs_mtx held, gather all of the objects to be freed onto a
list. Drop alldevs_mtx, and in the second stage, free all the
collected objects.
Also per rmind@'s suggestion, remove KASSERT(!mutex_owned(&alldevs_mtx))
throughout, it is not useful.
Find a free unit number and allocate it for a new device_t atomically.
Before, two threads would sometimes find the same free unit number
and race to allocate it. The loser panicked. Now there is no
race.
In support of the changes above, extract some new subroutines that
are private to this module: config_unit_nextfree(), config_unit_alloc(),
config_devfree(), config_dump_garbage().
Delete all of the #ifdef __BROKEN_CONFIG_UNIT_USAGE code. Only
the sun3 port still depends on __BROKEN_CONFIG_UNIT_USAGE, it's
not hard for the port to do without, and port-sun3@ had fair warning
that it was going away (>1 week, or a few years' warning, depending
how far back you look!).
Only sleep once within each pipe_read/pipe_write call.
If there is no data/space available after we wakeup return ERESTART so
then the 'fd' number is validated again.
A simple broadcast of the cvs is then enough to evict the correct threads
when close() is called from an active thread.
Only one fd needs clobbering, not all fds that reference the pipe.
This may be what ad@ realised when he tried to add the same code to
sockets. Unfixes part of PR/26567.
For each syscall, add a flag for the return value or an argument indicating
that it is a 64-bit argument. Also include the number of 64-bit arguments.
In theory this could get most of the code in compat/netbsd32/netbsd32_netbsd.c
but not at the moment due to multiply defined structures.
open files is rather gross - the poll map isn't required to be dense.
Instead limit to a much larger value (1000 + dt_nfiles) so that user
programs cannot allocate indefinite sized blocks of kvm.
If the limit is exceeded, then return EINVAL instead of silently truncating
the list.
(The silent truncation in select isn't quite as bad - although even there
any high bits that are set ought to generate an EBADF response.)
Move the code that converts ERESTART and EWOULDBLOCK into common code.
Effectively fixes PR/17507 since the new limit is unlikely to be detected.
struct timeval passed to todr_gettime(9) and todr_settime(9).
We no longer have an ancient and volatile struct timeval `time'
global since we have switched to MI timercounter(9) on all port.
XXX1: some of these RTC drivers still assume 32bit time_t
XXX2: some of these should be rewritten to use todr_[gs]ettime_ymdhms()
XXX3: todr(9) man page doesn't mention todr_[gs]ettime_ymdhms()
check the signal number to be in the allowed range. An invalid
signal number could crash the kernel by overflowing the sigset_t
array.
More checks would be good, and SIGEV_THREAD shouldn't be dropped
silently, but this fixes at least the local DOS vulnerability.
-an invalid signal number passed to mq_notify(2) could crash the kernel
on delivery -- add a boundary check
-mq_receive(2) from an empty queue crashed the kernel by NULL dereference
in timeout calculation -- handle the NULL case
-likewise for mq_send(2) to a full queue
-a user could set mq_maxmsg (the maximal number of messages in a queue)
to a huge value on mq_open(O_CREAT) and later use up all kernel
memory by mq_send(2) -- add a sysctl'able limit which defaults
to 16*mq_def_maxmsg
(mq_notify(2) should get some more checks, and SIGEV_* values other
than SIGEV_SIGNAL should be handled somehow, but this doesn't look
security critical)
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.
illegal. I examined all places where lbolt is referenced to make
sure there were pointer aliases of it passed to tsleep, but put a
KASSERT in m/ltsleep() just to be sure.
can sleep on the vnode lock, while vget is sleeping on the
VI_INACTNOW flag (or the vget caller is looping on vget returning failure
because of the VI_INACTNOW flag). With layered FSes, the upper and lower
vnodes share the same lock, so the vget() caller above can be already
holding the vnode lock.
Fix by dropping VI_INACTNOW before sleeping on the vnode lock in
vrelel(), and check the ref count again once we have the lock. If the
vnode has more than one reference, donc VOP_INACTIVE it.
Fix PR kern/42318 and PR kern/42377
patch tested by Hisashi T Fujinaka, Joachim König, Stephen Borrill and
Matthias Scheler.
Part 1 of N:
There is not an MI ordering of interrupt priority levels,
so use == IPL_HIGH and != IPL_HIGH instead of >= IPL_HIGH
and < IPL_HIGH. Ignore 'cold' and always use curcpu(),
since cpu_info_primary is MD.
Other changes:
There is no need to create symbols named _spldebug_* and
strong aliases to them. Just use symbols spldebug_*,
instead. Use a temporary variable instead of repeat
cpu_index(9) calls. KASSERT() that cpu_index(9) is <
MAXCPUS.
This allows use of subr_disk_mbr on all archs. Default to it for
the rump disk component. No functional change for regular kernels.
(The other option would've been to include dkbad in disklabels
everywhere, but arguably this approach has less possible side-effects,
especially given that wedges and related magic will take over the
world any second now).
support, i.e. move vfs functionality to a separate module
(kern_module_vfs.c)
* make module proplist size an MI constant (now 8k) instead of PAGE_SIZE
* change some error values to something else than the karmic EINVAL
I applied this patch with Coccinelle's semantic patch tool, spatch(1).
I installed Coccinelle from pkgsrc: devel/coccinelle/. I wrote
tailq.spatch and kdefs.h (see below) and ran this command,
spatch -debug -macro_file_builtins ./kdefs.h -outplace \
-sp_file sys/kern/tailq.spatch sys/kern/subr_autoconf.c
which wrote the transformed source file to /tmp/subr_autoconf.c. Then I
used indent(1) to fix the indentation.
::::::::::::::::::::
::: tailq.spatch :::
::::::::::::::::::::
@@
identifier I, N;
expression H;
statement S;
iterator name TAILQ_FOREACH;
@@
- for (I = TAILQ_FIRST(H); I != NULL; I = TAILQ_NEXT(I, N)) S
+ TAILQ_FOREACH(I, H, N) S
:::::::::::::::
::: kdefs.h :::
:::::::::::::::
#define MAXUSERS 64
#define _KERNEL
#define _KERNEL_OPT
#define i386
/*
* Tail queue definitions.
*/
#define _TAILQ_HEAD(name, type, qual) \
struct name { \
qual type *tqh_first; /* first element */ \
qual type *qual *tqh_last; /* addr of last next element */ \
}
#define TAILQ_HEAD(name, type) _TAILQ_HEAD(name, struct type,)
#define TAILQ_HEAD_INITIALIZER(head) \
{ NULL, &(head).tqh_first }
#define _TAILQ_ENTRY(type, qual) \
struct { \
qual type *tqe_next; /* next element */ \
qual type *qual *tqe_prev; /* address of previous next element */\
}
#define TAILQ_ENTRY(type) _TAILQ_ENTRY(struct type,)
#define PMF_FN_PROTO1 pmf_qual_t
#define PMF_FN_ARGS1 pmf_qual_t qual
#define PMF_FN_CALL1 qual
#define PMF_FN_PROTO , pmf_qual_t
#define PMF_FN_ARGS , pmf_qual_t qual
#define PMF_FN_CALL , qual
#define __KERNEL_RCSID(a, b)
the system into config_deactivate(dev): deactivate dev and all of
its descendants. Block all interrupts while calling each device's
activation hook, ca_activate. Now it is possible to simplify or
to delete several device-activation hooks throughout the system.
Do not deactivate a driver while detaching it! If the driver was
already deactivated (because of accidental/emergency removal), let
the driver cope with the knowledge that DVF_ACTIVE has been cleared.
Otherwise, let the driver access the underlying hardware (so that
it can flush caches, restore original register settings, et cetera)
until it exits its device-detachment hook.
Let multiple readers and writers simultaneously access the system's
device_t list, alldevs, from either interrupt or thread context:
postpone changing alldevs linkages and freeing autoconf device
structures until a garbage-collection phase that runs after all
readers & writers have left the list.
Give device iterators (deviter(9)) a consistent view of alldevs no
matter whether device_t's are added and deleted during iteration:
keep a global alldevs generation number. When an iterator enters
alldevs, record the current generation number in the iterator and
increase the global number. When a device_t is created, label it
with the current global generation number. When a device_t is
deleted, add a second label, the current global generation number.
During iteration, compare a device_t's added- and deleted-generation
with the iterator's generation and skip a device_t that was deleted
before the iterator entered the list or added after the iterator
entered the list.
The alldevs generation number is never 0. The garbage collector
reaps device_t's whose delete-generation number is non-zero.
Make alldevs private to sys/kern/subr_autoconf.c. Use deviter(9)
to access it.
reference while we were getting the v_interlock.
vget(): attempt prevent it from returning a clean vnode:
if the vnode is being inactivated (by vrelel()), wait for
vrelel() to complete (or return EBUSY if we can't wait), and return
ENOENT if the vnode has been vclean'ed by vrelel()
Fix kern/41147 in a better way, hopefully fix other related race conditions.
of transitions to IPL_HIGH from lower IPLs. SPLDEBUG is only available
on i386 and Xen kernels, today.
'options SPLDEBUG' adds instrumentation to spllower() and splraise() as
well as routines to start/stop debugging and to record IPL transitions:
spldebug_start(), spldebug_stop(), spldebug_raise(), spldebug_lower().
meta-data of this pool takes more space than the actual data..
- Reduce lowat/hiwat to 1..8, since intensity is very low.
- Remove unused pew_next_free from pmf_event_workitem_t.