table actually match state in NetBSD 0.9 (checked against sys/mount.h
rev. 1.11).
The array is not to be modified from now on, comment updated accordingly.
that fails, just try to recycle a vnode. If we can't allocate or
recycle, issue a warning, sleep a bit, and try the whole thing
again.
This prevents us from blocking forever if we want to use a very large
number of vnodes, but don't have {memory,kva} resources from which to
allocate them.
an spl-protected "interrupt safe map" list, simply require that callers
of uvm_fault() never call us in interrupt context (MD code must make
the assertion), and check for interrupt-safe maps in uvmfault_lookup()
before we lock the map.
has VXLOCK set - it's already being vgoned, most likely by one of our
callers. If we call vgone, we can end up sleeping against ourself
with VXLOCK set - we'll start the race for root.
Pointed out by Love <lha@stacken.kth.se> on tech-kern. Analysis from
Artur Grabowski <art@openbsd.org> via Love.
Should resolve PR kern/13077
is supposed to point directly to struct mbuf or struct sockaddr in kernel
space as appropriate, rather than being a pointer to memory in userland.
This is to be used by compat/* when emulation needs to wrap
send{to|msg}(2)/recv{from|msg}(2) and modify the passed struct
sockaddr.
The end we want to do selwakeup() on is not necessarily same as the one
we send SIGIO to. Make pipeselwakeup() accept two parameters and update
callers accordingly. This change fixes behaviour for code, which does
select(2)s on the write end waiting for reader (watched on gv, the problem
manifestated itself as a too long delay before the document was displayed).
Clearly separate the resource free code for FreeBSD
and NetBSD case in pipeclose(), so that it's a bit clearer what's going on.
Also LK_DRAIN the lock before the memory is returned to pipe_pool.
Add missing wakeup() in pipe_write() for PIPE_WANTCLOSE case.
used to make ELF binaries unmatched by any signature check to be run under
NetBSD 'emulation'. This causes problems like kern/12253.
The old behaviour is available with option EXEC_ELF_CATCHALL.
struct socket so_state field to decide if we need to send asynchronous
notifications. This makes possible to request notification on write but
not on read, and vice versa.
This is used in Linux emulation code, because when async I/O is requested,
Linux does not send SIGIO to write end of sockets, and it never send any
SIGIO to any end of pipes. Il Linux emulation code, we then set SB_ASYNC
only on the read end of sockets, and on no end for pipes.
for FreeBSD project. Besides huge speed boost compared with socketpair-based
pipes, this implementation also uses pagable kernel memory instead of mbufs.
Significant differences to FreeBSD version:
* uses uvm_loan() facility for direct write
* async/SIGIO handling correct also for sync writer, async reader
* limits settable via sysctl, amountpipekva and nbigpipes available via sysctl
* pipes are unidirectional - this is enforced on file descriptor level
for now only, the code would be updated to take advantage of it
eventually
* uses lockmgr(9)-based locks instead of home brew variant
* scatter-gather write is handled correctly for direct write case, data
is transferred by PIPE_DIRECT_CHUNK bytes maximum, to avoid running out of kva
All FreeBSD/NetBSD specific code is within appropriate #ifdef, in preparation
to feed changes back to FreeBSD tree.
This pipe implementation is optional for now, add 'options NEW_PIPE'
to your kernel config to use it.
MNT_NOSUID, just check MNT_NOSUID to clear the S{U,G}ID bits
in the attributes for the vnode we're about to exec.
We now check P_TRACED right before we would actually perform
the s{u,g}id function in the exec code.
This closes a race condition between exec of a setuid binary
and ptrace(2).
between creation of a file descriptor and close(2) when using kernel
assisted threads. What we do is stick descriptors in the table, but
mark them as "larval". This causes essentially everything to treat
it as a non-existent descriptor, except for fdalloc(), which sees a
filled slot so that it won't (incorrectly) allocate it again. When
a descriptor is fully constructed, the code that has constructed it
marks it as "mature" (which actually clears the "larval" flag), and
things continue to work as normal.
While here, gather all the code that gets a descriptor from the table
into a fd_getfile() function, and call it, rather than having the
same (sometimes incorrect) code copied all over the place.
fdexpand(). The former will return ENOSPC if there is not space
in the current filedesc table. The latter performs the expansion
of the filedesc table. This means that fdalloc() won't ever block,
and it gives callers an opportunity to clean up before the
potentially-blocking fdexpand() call.
Update all fdalloc() callers to deal with the need-to-fdexpand() case.
Rewrite unp_externalize() to use fdalloc() and fdexpand() in a
safe way, using an algorithm suggested by Bill Sommerfeld:
- Use a temporary array of integers to hold the new filedesc table
indexes. This allows us to repeat the loop if necessary.
- Loop through the array of file *'s, assigning them to filedesc table
slots. If fdalloc() indicates expansion is necessary, undo the
assignments we've done so far, expand, and retry the whole process.
- Once all file *'s have been assigned to slots, update the f_msgcount
and unp_rights counters.
- Right before we return, copy the temporary integer array to the message
buffer, and trim the length as before.
Note that once locking is added to the filedesc array, this entire
operation will be `atomic', in that the lock will be held while
file *'s are assigned to embryonic table slots, thus preventing anything
else from using them.
descriptor array, which may have blocked. Change callers of
fdalloc() to restart whatever they\'re doing if this condition
happens. (XXX unp_externalize() needs some work, but that will
be tackled later.)
Change finishdup() to close the descriptor in the `new\' slot if
one exists, and change sys_dup2() accordingly.
Closes a race condition when using kernel-assisted user threads.
While here, garbage-collect UF_MAPPED -- it is not used anywhere.
The ISO C standard says in 6.10.3.3 that if the result of using the
'##' operator "is not a valid preprocessing token, the behaviour is
undefined." Gcc 3.0 warns about this.
unions `union_elem: ...', and use c99 syntax `.union_elem = ...' only
where necessary.
in this case, there's no need to tag elf_probe_func because that's the
first union element, and therefore, the implicit case. only specifically
mention ecoff_probe_func where necessary.
if we decide to not use this c99 feature for now, at least there's now
less stuff to rip out.
Previously, we passed __FILE__ and __LINE__ on all pool_get/pool_set calls.
This change results in a measured 1.2% performance improvement in
ping-flood packets-per-second as reported by ping(8).
that the caller allocate the pool_item_header when it allocates the
pool page, so we can avoid a locking pitfall (sleeping with a simple
lock held).
Also revive pool_prime(), as there are some letigimate uses of it,
but in doing so, eliminate some of the bogosities of the old version
(i.e. don't do an implicit "setlowat", just prime the pool, and incr
the minpages for each additional page we add, and compute the number
of pages to prime in a way that callers would expect).
to 512. Apparently, there are ELF binaries with more than 128 section
headers - an example is one of Linux Word Perfect 8 utilities.
This fixes kern/12455 by Mark Davies.
flag.
EMUL_BSD_ASYNCIO_PIPE notes that the emulated binaries expect the original
BSD pipe behavior for asynchronous I/O, which is to fire SIGIO on read() and
write(). OSes without this flag do not expect any SIGIO to be fired on
read() and write() for pipes, even when async I/O was requested. As far as
we know, the OSes that need EMUL_BSD_ASYNCIO_PIPE are NetBSD, OSF/1 and
Darwin.
EMUL_BSD_ASYNCIO_PIPE notes that the emulated binaries expect the original
BSD pipe behavior for asynchronous I/O, which is to fire SIGIO on read() and
write(). OSes without this flag do not expect any SIGIO to be fired on
read() and write() for pipes, even when async I/O was requested. As far as
we know, the OSes that need EMUL_BSD_ASYNCIO_PIPE are NetBSD, OSF/1 and
Darwin.
EMUL_NO_SIGIO_ON_READ notes that the emulated binaries that requested
asynchrnous I/O expect the reader process to be notified by a SIGIO, but
not the writer process. OSes without this flag expect the reader and the
writer to be notified when some data has arrived or when some data have been
read. As far as we know, the OSes that need EMUL_NO_SIGIO_ON_READ are Linux
and SunOS.
SPINLOCK_SPIN_HOOK, so that we actually check for
pending IPIs on the Alpha more than once. Also,
when we call alpha_ipi_process(), make sure to go
to splipi().
vfs_busy'ing just before the dounmount() call. This is to avoid
sleeping with the mountlist_slock held -- but we must acquire
syncer_lock before vfs_busy because the syncer itself uses
syncer_lock -> vfs_busy locking order.
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.
Use a relative path (../..) instead of /sys.
Enhance the sed expression to work with .'s in paths.
Quote sed expressions in single quotes rather than double
quotes unless there's a good reason otherwise.
mappings (vnode -> name) in the reverse mapping hash table. Without
this option, there is no change; only directories will be entered to
speed up getcwd. This is an option because it will cause getcwd
to hit longer hash chains, and at the moment its usefulness is
still limited.
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.
(only if cmd exited successfully) use tmp file as input to sed pipeline.
This works around two issues:
(1) a pathological case where the script would fail in ... interesting
ways if the command being executed closed its stdout. (Certain
commands are used only for their side effects, but not their output,
and doing some testing on my own i got into hot water when one
of my mods caused a command to close its output).
(2) the fact that genassym would succeed even when the command in
fact failed (because the last cmd in the pipeline is the one whose
exit status would be reported).
space is already torn down in uvmspace_free() when the vmspace
refrence count reaches 0. Move the shmexit() call into uvmspace_free().
Note that there is a beneficial side-effect of deferring the unmap
to uvmspace_free() -- on systems where TLB invalidations are
particularly expensive, the unmapping of the address space won't
have to cause TLB invalidations; uvmspace_free() is going to be
run in a context other than the exiting process's, so the "pmap is
active" test will evaluate to FALSE in the pmap module.
is disconnected by RST right before accept(2). fixes PR 10698/12027.
checked with SUSv2, XNET 5.2, and Stevens (unix network programming
vol 1 2nd ed) section 5.11.
on memory shortage. Instead, use the same wait/nowait condition with the
item requested, and just cleanup and return failure if we can't allocate
page header while we aren't allowed to wait.
do not return junk data in mbuf (= sockaddr on accept(2)'s 2nd arg).
set the length zero.
behavior checked with bsdi and freebsd.
partial solution to PR 12027 and 10698 (need more investigation).
and number of ops, not touch anything - vnode_if.sh now generated
proper offset numbers; vfs_op_check() is only defined and called for DEBUG
kernels
constify extern declaration of vfs_op_descs[]
g/c vfs_opv_numops, use VNODE_OPS_COUNT instead
make vfs_opv_init_explicit() and vfs_opv_init_default() static
then don't need to be patched at runtime
add new define VNODE_OPS_COUNT (to vnode_if.h) so that the number is known
at compile-time
make stuff const, it now can be
This structures are actually modified at kernel init time by vfs_op_init.
XXX - looks like the state after initialization is pretty const and with
some magic in the generator script (and appropriate changes to vfs_op_init)
it could be made const.
wrap this all up in a CHECKSIGS() macro. Also, in psignal1(),
signotify() SRUN and SIDL processes if __HAVE_AST_PERPROC is defined.
Per discussion w/ mycroft.
between write i/os in a disk-based filesystem vs. the disk block being
freed by a truncation, allocated to a new file, and written again with
different data. if the disk driver reorders the requests and does
the second i/o first, the old data will clobber the new, corrupting
the new file.
are done inside of wakeup which is holding the sched lock. Printf can cause
wakeup to get called again (pty redirection of console message) which will
panic with sched lock already held.
This isn't a long term fix as not being able to printf vs. sched lock should
be cleaned up better but this avoids continual panics with lockdebug running
and an xterm -C.
only signal handler array sharable between threads
move other random signal stuff from struct proc to struct sigctx
This addresses kern/10981 by Matthew Orgass.
* __HAVE_SYSCALL_INTERN. If this is defined, e_syscall is replaced by
e_syscall_intern, which is called at key places in the kernel. This can be
used to set a MD syscall handler pointer. This obsoletes and replaces the
*_HAS_SEPARATED_SYSCALL flags.
* __HAVE_MINIMAL_EMUL. If this is defined, certain (deprecated) elements in
struct emul are omitted.
passed it down to the appropriate usrreq function, and this
allows usage for contexts that need to be explicitly different
from curproc (like in the NFS code when binding to a reserved port).
defined, call addupc_intr() directly from statclock() in the system time case,
using the same P_OWEUPC path if the copyin/copyout fails.
Use this in i386 to remove profiling code from the normal userret() path.
* Make the syscallnames[] table const.
* Add a separator between the #include section and the syscalls section, so
that #if/#else/#endif can be handled differently in the two.
* Add support for rounding up the size of the sysent table.
constructed objects in the pool allocator, similar to caching
of constructed objects in the Solaris SLAB allocator.
This implementation is a separate API (pool_cache_*()) layered
on top of pools to keep the caching complexity out of the way
of pools that won't benefit from it.
While we're here, allow pool items to be as large as the pool
page size.
*_emul_path variables
change macros CHECK_ALT_{CREAT|EXIST} to use that, 'root' doesn't need
to be passed explicitly any more and *_CHECK_ALT_{CREAT|EXIST} are removed
change explicit emul_find() calls in probe functions to get the emulation
path from the checked exec switch entry's emulation
remove no longer needed header files
add e_flags and e_syscall to struct emul; these are unsed and empty for now
* move all exec-type specific information from struct emul to execsw[] and
provide single struct emul per emulation
* elf:
- kern/exec_elf32.c:probe_funcs[] is gone, execsw[] how has one entry
per emulation and contains pointer to respective probe function
- interp is allocated via MALLOC() rather than on stack
- elf_args structure is allocated via MALLOC() rather than malloc()
* ecoff: the per-emulation hooks moved from alpha and mips specific code
to OSF1 and Ultrix compat code as appropriate, execsw[] has one entry per
emulation supporting ecoff with appropriate probe function
* the makecmds/probe functions don't set emulation, pointer to emulation is
part of appropriate execsw[] entry
* constify couple of structures