loadable drivers and user controlled attach/detach of devices.
An outline was given in
http://mail-index.NetBSD.org/tech-kern/2004/08/11/0000.html
To cite the relevant parts:
-Add a "child detached" and a "rescan" method (both optional)
to the device driver. (This is added to the "cfattach" for now
because this is under the driver writer's control. Logically
it belongs more to the "cfdriver", but this is automatically
generated now.)
The "child detached" is called by the autoconf framework
during config_detach(), after the child's ca_detach()
function was called but before the device data structure
is freed.
The "rescan" is called explicitely, either after a driver LKM
was loaded, or on user request (see the "control device" below).
-Add a field to the device instance where the "locators" (in
terms of the autoconf framework), which describe the actual
location of the device relatively to the parent bus, can be
stored. This can be used by the "child detached" function
for easier bookkeeping (no need to lookup by device instance
pointer). (An idea for the future is to use this for generation
of optimized kernel config files - like DEC's "doconfig".)
-Pass the locators tuple describing a device's location to
various autoconf functions to support the previous. And since
locators do only make sense in relation to an "interface
attribute", pass this as well.
-Add helper functions to add/remove supplemental "cfdata"
arrays. Needed for driver LKMs.
There is some code duplication which will hopefully resolved
when all "submatch"-style functions are changed to accept the
locator argument.
Some more cleanup can take place when config(8) issues more
information about locators, in particular the length and default
values. To be done later.
* Rather than using mnt_maxsymlinklen to indicate that a file systems returns
d_type fields(!), add a new internal flag, IMNT_DTYPE.
Add 3 new elements to ufsmount:
* um_maxsymlinklen, replaces mnt_maxsymlinklen (which never should have existed
in the first place).
* um_dirblksiz, which tracks the current directory block size, eliminating the
FS-specific checks littered throughout the code. This may be used later to
make the block size variable.
* um_maxfilesize, which is the maximum file size, possibly adjusted lower due
to implementation issues.
Sync some bug fixes from FFS into ext2fs, particularly:
* ffs_lookup.c 1.21, 1.28, 1.33, 1.48
* ffs_inode.c 1.43, 1.44, 1.45, 1.66, 1.67
* ffs_vnops.c 1.84, 1.85, 1.86
Clean up some crappy pointer frobnication.
* Process A is closing one file descriptor belonging to a device. In doing so,
ffs_update() is called and starts writing a block synchronously. (Note: This
leaves the vnode locked. It also has other instances -- stdin, et al -- of
the same device open, so v_usecount is definitely non-zero.)
* Process B does a revoke() on the device. The revoke() has to wait for the
vnode to be unlocked because ffs_update() is still in progress.
* Process C tries to open() the device. It wedges in checkalias() repeatedly
calling vget() because it returns EBUSY immediately.
To fix, this:
* checkalias() now uses LK_SLEEPFAIL rather than LK_NOWAIT. Therefore it will
wait for the vnode to become unlocked, but it will recheck that it is on the
hash list, in case it was in the process of being revoke()d or was revoke()d
again before we were woken up.
* Since we're relying on the vnode lock to tell us that the vnode hasn't been
removed from the hash list *anyway*, I have moved the code to remove it into
the DOCLOSE section of vclean(), inside the vnode lock.
In the example at hand, process A was sh(1), process B was a child of init(8),
and process C was syslogd(8).
from Stephan Uphoff, FreeBSD PR/69964.
(http://www.freebsd.org/cgi/query-pr.cgi?pr=69964)
> The LK_WANT_EXCL and LK_WANT_UPGRADE bits act as mini-locks and can block
> other threads.
> Normally this is not a problem since the mini locks are upgraded to full loc
> and the release of the locks will unblock the other threads.
> However if a thread reset the bits without optaining a full lock
> other threads are not awoken.
> This can happens if obtaining the full lock fails because of a LK_SLEEPFAIL,
> or a signal (if lock priority includes PCATCH .. don't think this is used).
ensure that no one else have the same lock.
a patch from Stephan Uphoff, FreeBSD PR/69934.
(http://www.freebsd.org/cgi/query-pr.cgi?pr=69934)
> Upgrading a lock does not play well together with acquiring
> an exclusive lock and can lead to two threads being
> granted exclusive access.
>
> Problematic sequence:
> Thread A acquires a previous unlocked lock in shared mode.
> Thread B tries to acquire the same lock in exclusive mode
> and blocks.
> Thread A upgrades its lock - waking up thread B.
> Thread B wakes up and also acquires the same lock as it only checks
> if the lock is not shared or if someone wants to upgrade the lock
> and not if someone already upgraded the lock to an exclusive lock.
long, not an int, and this causes "problems" on LP64be machines
(sparc64, etc). Assign the value to a temporary int and instrument
that instead. Should be fine until someone wants a message buffer
larger than two gigabytes.
This areas is called the comm pages. It is used to provide fast access to
several data and functions.
The comm pages are mapped starting at 0xffff800 (address chosed so that
absolute branch can be used, so it can be accessed even when dynamic linking
is not ready). NetBSD has the user stack here, so we need to provide a
Darwin-specific stack setup routine which sets the top of the stack at
0xbfff0000.
This implementation is not complete but it does enough to get MacOS X.3
starting again (static binaries run, dynamic binaries still have an issue).
in the comm pages functions, we only implement bcopy, pthread_self and
memcpy.
TODO:
- clean up the powerpc specific code from MD parts
- for now we map only one page to avoid a crash, we want two pages.
- write all the comm functions.
want more flexible namecache handling.
it just looks up a dnlc entry and vget() the result vnode.
ie. no automatic entry removal, no automatic vnode locking.
discussed on tech-kern@.
doesn't advance while we're waiting on the lock. In fact, try to take
the lock even before blocking interrupts: the lock is locking "lasttime"
against other callers of cc_microtime(), not against the clock routines,
and if we take a clock interrupt while waiting for the lock, that's one
we don't have to take after the computations, but before returning to
the caller, and that makes the data a little fresher to the caller.
Moreover, inverting the order of splXXX() and simple_lock() permits us
to unblock interrupts before doing the long division.
With this, finally, performance of "ntpd" on my MP i386 seems to be no
worse than on non-MP i386, so this may fix PR kern/24207.
for consistency with M_FREE() and m_freem(). Affected files:
sys/mbuf.h
kern/uipc_socket2.c
kern/uipc_mbuf.c
net/if_ethersubr.c
netatalk/ddp_input.c
nfs/nfs_socket.c
a ktraced file descriptor that has already been invalidated. Change
all ktrace functions to propagate the error from ktrwrite() and
check for it. Thanks to Pavel Cahyna for finding this and giving
a perfect bug report.
[should be pulled up for 2.0]
doing copy-on-write.
- Change VFS_SNAPSHOT() to return the snapshot vnode locked.
- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().
L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>
overloading "usec". The counter isn't counting micro-seconds, and using
the same variable to mean two different things is false economy: with
this change, the compiled object is 72 bytes smaller on i386, and the
code is easier to understand, to boot.
We do an MGETHDR)() for each mbuf "packet" of the input chain, to hold
the socket address prepended to that "packet". If those MGETHDR()s
ever failed, we would leak all the successfully-allocated mbuf
headers. Leak noted by Yamamoto-san (yamt@NetBSD.org); thanks for catching it!
Add socketbuf invariant-checking macros to sbappendaddrchain(), and
replace a stray bcopy() with memcpy(), also as suggested by Yamamoto-san.
Fix a potential race condition when reallocating storage for file descriptors
(even for non-SMP kernels).
Add missing locks for `struct file' ref count updates.
Introduce new socket-layer function sbappendaddrchain() to
sys/kern/uipc_socket2.c: like sbappendaddr(), only takes a chain of
records and appends the entire chain in one pass. sbappendaddrchain()
also takes an `sbprio' argument, which indicates the caller requires
special `reliable' handling of the socket-buffer. `sbprio' is
described in sys/sys/socketvar.h, although (for now) the different
levels are not yet implemented.
Rework sys/netipsec/key.c PF_KEY DUMP responses to build a chain of
mbuf records, one record per dump response. Unicast the entire chain
to the requestor, with all-or-none semantics.
Changed files;
sys/socketvar.h kern/uipc_socket2.c netipsec/key.c
Reviewed by:
Jason Thorpe, Thor Lancelot Simon, post to tech-kern.
Todo: request pullup to 2.0 branch. Post-2.0, rework sysctl() API for
dumps to use new record-chain constructors. Actually implement
the distinct service levels in sbappendaddrchain() so we can use them
to make PF_KEY ACQUIRE messages more reliable.
- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.
Welcome to 2.0F.
Approved by: Jason R. Thorpe <thorpej@netbsd.org>
Add a new explicit `struct proc *p' argument to socreate(), sosend().
Use that argument instead of curproc. Follow-on changes to pass that
argument to socreate(), sosend(), and (*so->so_send)() calls.
These changes reviewed and independently recoded by Matt Thomas.
Changes to soreceive() and (*dom->dom_exernalize() from Matt Thomas:
pass soreceive()'s struct uio* uio->uio_procp to unp_externalize().
Eliminate curproc from unp_externalize. Also, now soreceive() uses
its uio->uio_procp value, pass that same value downward to
((pr->pru_usrreq)() calls for consistency, instead of (struct proc * )0.
Similar changes in sys/nfs to eliminate (most) uses of curproc,
either via the req-> r_procp field of a struct nfsreq *req argument,
or by passing down new explicit struct proc * arguments.
Reviewed by: Matt Thomas, posted to tech-kern.
NB: The (*pr->pru_usrreq)() change should be tested on more (all!) protocols.
so that it doesn't fire when called twice in the same microsecond,
which can lead to large error accumulation.
Appears to fix "repeated gettimeofday() goes backwards" on a fast
alpha and i386 box.
attempt is terminated, so if a process needs frequent timer interrupts
it can't ever connect() to a machine far away.
Bug found by Erik Lundgren, bugfix (for the same problem) is similar to
the way FreeBSD solved the same problem.
As a side effect, the new connect() behaviour conformes to Posix.
further deprecate struct timezone usage by changing `tzp' argument to
gettimeofday() to void *; align utimes(2) declaration by changing `times`
argument from struct timeval * to struct timeval[2]. From Murray
Armfield in PR standards/25331.
In due curse, reflect these changes in futimes(2), lutimes(2), and
settimeofday(2).
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.
Convert struct session, ucred and lockf to pools.
list of file system types currently supported by the kernel.
Previously there wasn't an easy way to determine this.
(Code shamelessly cribbed from subr_disk.c::sysctl_hw_disknames().)
Use LIST_FOREACH() appropriately.
and unkillable processes.
1. Introduce new SBSIZE resource limit from FreeBSD to limit socket buffer
size resource.
2. make sokvareserve interruptible, so processes ltsleeping on it can be
killed.
would be good) mostly copied from sysctl(3). This takes care of the
top-level, most of kern.* and hw.* (modulo the ath and bge stuff), and
all of proc.*.
If you don't want the added rodata in your kernel, use "options
SYSCTL_NO_DESCR" in your kernel config.
setup function to set the description, even if the node has been
instantiated elsewhere. Or not, depending on the other that the setup
functions are called.
namely, allow following operations.
- purge only an entry specified by a component name.
- purge only child entries.
- purge only parent entries.
no objections on tech-kern@.
Add an initializer for them: KSI_INIT_EMPTY
Add a predicate for them: KSI_EMPTY_P
Don't bother storing empty ksiginfo_t's since they have no information.
Change uses of KSI_INIT to KSI_INIT_EMPTY where no other information other
than the signo is being filled in.
not blocked. Otherwise (it if it blocked or the hanlder is set to SIG_IGN)
reset the signal back to its default settings so that a coredump can be
generated.
other semantics from an earlier incarnation.
Call kcont_init() from init_main before device autoconfiguration,
so kcont is availble to device drivers if required.
Also ensure the kthread process runs any pending continuations once
the kthread is finally up and running. For now, use a non-null timeout
to poll the queue periodically. Draining any pending requests just
before the kthread enters its ltsleep()/kc_run loop is cleaner, but
this is the version I tested with an early-in-boot kcont request.)
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.
Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.
Reviewed by Thor Simon.
size 0.
This way, individual ports can circumvent sigcode mapping
by setting sigcode/esigcode.
(would be better to clean up the __HAVE_SIGINFO/COMPAT_XX
stuff, but it is not a good moment now)
- factor out common code.
- don't stop searching before the target.
- touch the correct object.
- validate the argument before the loop otherwise we need to roll back.
- move per VP data into struct sadata_vp referenced from l->l_savp
* VP id
* lock on VP data
* LWP on VP
* recently blocked LWP on VP
* queue of LWPs woken which ran on this VP before sleep
* faultaddr
* LWP cache for upcalls
* upcall queue
- add current concurrency and requested concurrency variables
- make process exit run LWP on all VPs
- make signal delivery consider all VPs
- make timer events consider all VPs
- add sa_newsavp to allocate new sadata_vp structure
- add sa_increaseconcurrency to prepare new VP
- make sys_sa_setconcurrency request new VP or wakeup idle VP
- make sa_yield lower current concurrency
- set sa_cpu = VP id in upcalls
- maintain cached LWPs per VP
be updated. (This is needed to be compatible with how pre-SIGINFO signals
operated. If you siglongjmp out of a signal handler, the SS_ONSTACK state
needs to be cleared. This commit restores that functionality).
re-used by another cpu immediately. in that case, lwp_exit2() will
access freed memory. to fix this:
- remove curlwp from p_lwps in exit1() rather than letting lwp_exit2() do so.
- add assertions to ensure freed proc has no lwps.
kern/24329 from me and kern/24574 from Havard Eidnes.
In addition to current one (i.e., don't wast so large part of the page),
- if the header fitsin the page without wasting any items, put it there.
- don't put the header in the page if it may consume rather big item.
For example, on i386, header is now allocated in the page for the pools
like fdescpl or sigapl, and allocated off the page for the pools like
buf1k or buf2k.
kern.proc, kern.proc2, kern.lwp, and kern.buf.
Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.
Convert systat to the new kern.buf method.
Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.
Oh, and fix a typo in a comment.
and then allocate it on demand. Rename some common symbols (__bss_start,
_edata, _end, __start_link_set_*, __stop_link_set_*) so that ".<module>"
is appended to them. This shrinks an amd64 kernel by 20KB of BSS.
of using on-stack memory, so that this wouldn't eventually cause kernel
panic if the process get swapped out and another process runs kqueue_scan()
problem pointed out in kern/24220 by Stephan Uphoff
called with every buffer written through spec_strategy().
Used by fss(4). Future file-system-internal snapshots will need them too.
Welcome to 1.6ZK
Approved by: Jason R. Thorpe <thorpej@netbsd.org>
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.
This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.
Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.
This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).
To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?
the latter is not a appropriate place to do so and it broke vfork.
- deactivate pmap before calling cpu_exit() to keep a balance of
pmap_activate/deactivate.
and uncontrolled growth.
The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.
We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.
With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.
Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.
PR#23470, with minor updates by me. This is only the syscall support
from that PR, for now.
Changes: port over fix from FreeBSD for multicast address generation.
Changed bcopy to memcpy. For now, #ifdef notyet the portions of
kern_uuid.c that are meant to be used by (currently nonexistent) other
things in the kernel. Added syscall to COMPAT_FREEBSD as well, though
that's currently not useful, as any program new enough to use this call
also uses other syscalls we don't (yet) emulate.
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.
sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.
See comments in rev 1.106 for more detail.
VOP_STRATEGY(bp) is replaced by one of two new functions:
- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.
DEV_STRATEGY(bp) is used only for block-to-block device situations.
VOP_STRATEGY(bp) is replaced by one of two new functions:
- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.
DEV_STRATEGY(bp) is used only for block-to-block device situations.
From PR kern/13702 from Charles Carvalho. Tested on alpha and
i386 with a Laipac TF10 PPS-capable GPS. The com.c change was
copied wholesale from Charles' z8530tty.c patch.
"rv".
In sysctl_destroyv(), deal with deleting alias nodes, and pass a token
size_t to sysctl_destroy().
In sysctl_free(), check that "node" has not reached "rnode", not that
"pnode" has.
In sysctl_realloc(), don't bother setting sysctl_clen...the value is
unchanged.
- delete ktrsyscall32()
- add a check #ifdef _LP64 to do the conversion if P_32 is set to the
standard ktrsyscall()
- add a couple of similar _LP64/P_32 checks to the systrace code.
this should get systrace working for 32 bit apps as well as complete
ktrace support for "trace_enter/trace_exit" using platforms such as amd64.
XXX: systrace isn't supported on sparc64 currently... (it doesn't use
trace_enter/trace_exit, or have it's own calls to systrace_xxx()...)
suspending.
Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.
Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).
When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.
idle pool pages to be returned to the system immediately upon becoming
de-fragmented.
Also, in pool_do_put(), don't free back an idle page unless we are over
our minimum page claim.
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.
Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...
0.5%, based on some quick measurements on a number of workstations and
small fileservers (including my home fileserver running simultaneous
builds of the NetBSD source tree and several NetBSD kernels). This
brings the hit rate on my machines from below 70% to above 90%. We
should be able to tune this as we run, by tracking the hit rate and
increasing the size of the cache if memory permits.
Some systems will still require significantly larger cache sizes. Some
ports -- notably the 64-bit ones -- probably should use more than 1% of
physmem as the default due to the larger size of struct vnode.
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.
Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.
For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.
Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.
Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.
process context ('reaper').
From within the exiting process context:
* deactivate pmap and free vmspace while we can still block
* introduce MD cpu_lwp_free() - this cleans all MD-specific context (such
as FPU state), and is the last potentially blocking operation;
all of cpu_wait(), and most of cpu_exit(), is now folded into cpu_lwp_free()
* process is now immediatelly marked as zombie and made available for pickup
by parent; the remaining last lwp continues the exit as fully detached
* MI (rather than MD) code bumps uvmexp.swtch, cpu_exit() is now same
for both 'process' and 'lwp' exit
uvm_lwp_exit() is modified to never block; the u-area memory is now
always just linked to the list of available u-areas. Introduce (blocking)
uvm_uarea_drain(), which is called to release the excessive u-area memory;
this is called by parent within wait4(), or by pagedaemon on memory shortage.
uvm_uarea_free() is now private function within uvm_glue.c.
MD process/lwp exit code now always calls lwp_exit2() immediatelly after
switching away from the exiting lwp.
g/c now unneeded routines and variables, including the reaper kernel thread
an offset between ss_sp and struct sa_stackinfo_t (located in struct
__pthread_st) when calling sa_register. The kernel increments the
sast_gen counter in struct sastack when an upcall stack is used.
libpthread increments the sasi_stackgen counter in struct
sa_stackinfo_t when an upcall stack is freed. The kernel compares the
two counters to decide if a stack is free or in use.
- add struct sa_stackinfo_t with sasi_stackgen to count stack use in
userland
- add sast_gen to struct sastack to count stack use in kernel
- add SA_FLAG_STACKINFO to enable the stackinfo_offset argument in the
sa_register syscall
- add sa_stackinfo_offset to struct sadata for offset between ss_sp
and struct sa_stackinfo_t
- add ssize_t stackinfo_offset argument to sa_register, initialize
struct sadata's sa_stackinfo_offset from it if SA_FLAG_STACKINFO is
set
- add sa_getstack, sa_getstack0, sa_stackused and sa_setstackfree
functions to find/use/free upcall stacks and use these where
appropriate
- don't record stack for upcall in sa_upcall0
- pass sau to sa_switchcall instead of l2 (l2 = curlwp in sa_switchcall)
- add sa_vp_blocker to struct sadata to pass recently blocked lwp to
sa_switchcall
- delay finding a stack for blocked upcalls to sa_switchcall
- add sa_stacknext to struct sadata pointing to next most likely free
upcall stack; also g/c sa_stackslist in struct sadata and sast_list
in struct sastack
- add L_SA_WOKEN flag: LWP is on sa_woken queue
- add L_SA_RECYCLE flag: LWP should be recycled in sa_setwoken
- replace l_upcallstack with L_SA_WOKEN/L_SA_RECYCLE/L_SA_BLOCKING
flags
- g/c now unused sast_blocker in struct sastack
- make sa_switchcall, sa_upcall0 and sa_upcall_getstate static in
kern_sa.c
- call sa_upcall_userret only once in userret
- split sa_makeupcalls out of sa_upcall_userret and use to process
the sa_upcalls queue
- on process exit: mark LWPs sleeping in saunblock interruptible; also
there are no LWPs sleeping on l->l_upcallstack anymore; also clear
sa_wokenq_head to prevent unblocked upcalls
additional changes:
- cleanup timerupcall sa_vp == curlwp check
- add check in sa_yield if we didn't block on our way here and we
wouldn't any longer be the LWP on the VP
- invalidate sa_vp_ofaultaddr after resolving pagefault
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.
This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.
On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.
need the data in the mbuf later and m_clget() changes some fields
overlaid to regular mbuf data. Instead, rearange code a bit, create
data into a new allocated buffer and and use MEXTADD to attach it to
the mbuf, if the mbuf internal space is not sufficient.
This fixes a crash on sparc64 (and probably all other archs where
sizeof(int) != sizeof(struct file *)) when running
regress/sys/kern/unfdpass.
Idea for solution from Matt Thomas, with additional input from YAMAMOTO
Takashi.
being requested so that (1) the uniprocessor case and the
multiprocessor case are more similar and (2) so that we return ENOENT
when a non-existent processor is requested (which is both more
sensible and follows the general order of things anyway).
domainname. Note that there's no need to copy rnode since we're not
changing any of it, nor protecting anything from change.
Thanks to martin for initial work.
fit what it does.
The softsignal feature is used in Darwin to trace processes. When the
traced process gets a signal, this raises an exception. The debugger will
receive the exception message, use ptrace with PT_THUPDATE to pass the
signal to the child or discard it, and then it will send a reply to the
exception message, to resume the child.
With the hook at the beginnng of kpsignal2, we are in the context of the
signal sender, which can be the kill(1) command, for instance. We cannot
afford to sleep until the debugger tells us if the signal should be
delivered or not.
Therefore, the hook to generate the Mach exception must be in the traced
process context. That was we can sleep awaiting for the debugger opinion
about the signal, this is not a problem. The hook is hence located into
issignal, at the place where normally SIGCHILD is sent to the debugger,
whereas the traced process is stopped. If the hook returns 0, we bypass
thoses operations, the Mach exception mecanism will take care of notifying
the debugger (through a Mach exception), and stop the faulting thread.
exec case, as the emulation already has the ability to intercept that
with the e_proc_exec hook. It is the responsability of the emulation to
take appropriaye action about lwp_emuldata in e_proc_exec.
Patch reviewed by Christos.
to a 2-clause licence (retaining UCB clauses (1) and (2)), per PR
22409 from Joel Baker, approved by Theo de Raadt, and ratified by
myself - the only discrepancy being the handling of the original
clause 3 in src/usr.sbin/yppoll/yppoll.c.
Uses a hook in spec_strategy() to save data written from a mounted
file system to its block device and a hook in dounmount().
Not enabled by default in any kernel config.
Approved by: Frank van der Linden <fvdl@netbsd.org>
correctly, free original instead of incremented pointer, copy results for
n = -2 case too, so top shows correct stats.
Additionaly, rearange code for better readability (from Andrew).
Gone are the old kern_sysctl(), cpu_sysctl(), hw_sysctl(),
vfs_sysctl(), etc, routines, along with sysctl_int() et al. Now all
nodes are registered with the tree, and nodes can be added (or
removed) easily, and I/O to and from the tree is handled generically.
Since the nodes are registered with the tree, the mapping from name to
number (and back again) can now be discovered, instead of having to be
hard coded. Adding new nodes to the tree is likewise much simpler --
the new infrastructure handles almost all the work for simple types,
and just about anything else can be done with a small helper function.
All existing nodes are where they were before (numerically speaking),
so all existing consumers of sysctl information should notice no
difference.
PS - I'm sorry, but there's a distinct lack of documentation at the
moment. I'm working on sysctl(3/8/9) right now, and I promise to
watch out for buses.
so that a specific emulation has the oportunity to filter out some signals.
if sigfilter returns 0, then no signal is sent by kpsignal2().
There is another place where signals can be generated: trapsignal. Since this
function is already an emulation hook, no call to the sigfilter hook was
introduced in trapsignal.
This is needed to emulate the softsignal feature in COMPAT_DARWIN (signals
sent as Mach exception messages)
accepted. However, this time this behavor is not the default. Instead
it must enabled by using the LOCAL_CONNWAIT socket option on either the
connecting or accepting socket.