doing a context switch. use this on sparc and sparc64 to avoid trying
to access user memory (writing the register windows back to the stack)
in this case (since it's both unnecessary and wrong).
compare fd and fdp->fd_lastfile in fdrelease(), so change the test to a
more explicit one. Spotted by Matt Thomas.
Should fix the panic reported by Matthias Scheler.
Hence, make find_last_set return -1 in such situation, and initialize it
such. Otherwise, with 0 meaning two things, it confused the F_CLOSEM
fcntl which could end up looping indifintely (PR#28929 by Brian Marcotte).
However, this change enlightens another bug in fdcopy(), where more entries
than needed were cleared in the new file descriptor table, so the memset()
call there is fixed too.
Analyzed with the help of Greg Oster.
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.
How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.
and just passes it on to the file system functions. This avoids opening and
closing the device several times.
Mentioned on tech-kern some time ago, IIRC. I've been running this for a
long time.
do not leak siginfo structures.
Note that in the cases of trap signals and timer events, losing this
information could be very bad; right now it will cause us to spin until the
process is SIGKILLed.
"Needs work."
if it's specified, don't use free items as storage for internal state.
so that we can use pools for non memory backed objects.
inspired from solaris's KMC_NOTOUCH.
the code block, so that the purpose is more clear
avoid NULL pointer dereference in lkmunreserve() called on lkm device
close when ksym_addsymtab() fails for the temporary symbol table
(sanity change only, this can never happen at the moment)
* only add the symbol table for the current module if the LKM_E_LOAD
hook returns success; otherwise we overwrite the LKM_E_LOAD error,
which may ultimately lead to (incorrectly) allowing the module load
* only delete the sumbol table for current module if we actually added
the symbol table; avoids deleting symbol table of previously loaded
module when the same module is loaded twice, when the second load
fails with EEXIST
fixes PR kern/28803 by Jens Kessmeier
header files, so that they don't become out of sync (again).
- Use bitmask_snprintf() instead of hand-rolled code.
- Always check array bounds before dereferencing print arrays.
- Order arguments in the vnode printing functions consistently.
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat
do { ... } while(/*CONSTCOND*/0)
so that they can be used unadorned in if/else blocks, etc. This means
that you now *have* to put a ; at the end of the "call" to these
macros.
the code that `knows' about /dev/[pt]tyXX names (the BSD ptys) into a separate
file. Make an interface to be used by the tty creating provider. The code
to enable old PTY searching via ptm is enabled via COMPAT_BSDPTY, and it
is turned on by default on all kernels that have compatibility options enabled.
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.
In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.
The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.
FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.
- Add a booted_wedge variable that indicates the wedge that was booted
from. If this is NULL, booted_partition is consulted.
- Adjust setroot() and its support routines for root-on-wedges. Could
use some tidy-up, but this works for now.
necessary information to create the pseudo-device instance. Pseudo-device
device's will reference this cfdata, just as normal devices reference
their corresponding cfdata.
Welcome to 2.99.10.
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.
a proclist and call the specified function for each of them.
primarily to fix a procfs locking problem, but i think that it's useful for
others as well.
while i'm here, introduce PROCLIST_FOREACH macro, which is similar to
LIST_FOREACH but skips marker entries which are used by proclist_foreach_call.
kernel message buffer/log. Its off by default and can be switched on in the
kernel configuration on build time, be set as a variable in ddb and be set
using sysctl.
This adds the sysctl value
ddb.tee_msgbuf = 0
by default.
The functionality is especially added and aimed for developers who are not
blessed with a serial console and wish to keep all their ddb output in the
log. Specifying /l as a modifier to some selected commands will also put
the output in the log but not all commands provide one nor has the same
meaning for all commands.
This feature could in the future also be implemented as an ddb command but
that could lead to more bloat allthough maybe easier for non developpers to
use when mailing their backtraces from kernel crashes.
segment should succeed even if the segment would be marked removed; use this
to implement the Linux-compatible semantics of shmat(2)
this fixes the old Linux VMware3 graphics problem with local display,
and possibly other local Linux X clients using MIT-SHM
for Linux-compatible shmat() behaviour - shmat() for the removed shared memory
segment must work from all callers, the shared memory id could be passed e.g.
to native X server via MIT-SHM
temporarily remove the functionality, the Linux-compatible semantics
will be reimplemented differently
the reset condition are processed properly; this fixes PR#26687 by
Jan Schaumann
many thanks to Mark Davies, who tracked the offending change down
and helped test patches
while here, g/c unused sigtrapmask and rearrange some code to pre-r1.190 form
for better readability
calls to ensure that the vnode lock state is as expected when the VOP
call is made. Modify vnode_if.src to set the expected state according
to the documenting lock table for each VOP. Modify vnode_if.sh to emit
the checks.
Notes:
- The checks are only performed if the vnode has the VLOCKSWORK bit
set. Some file systems (e.g. specfs) don't even bother with vnode
locks, so of course the checks will fail.
- We can't actually run with VNODE_LOCKDEBUG because there are so many
vnode locking problems, not the least of which is the "use SHARED for
VOP_READ()" issue, which screws things up for the entire call chain.
Inspired by similar changes in OpenBSD, but implemented differently.
- Clean up the namespace of this module and enable the encode/decode
functions and printing functions.
- Move the code that actually generates the UUID out of the system call
routine and into its own function.
> Call dom_dispose() for any SCM_RIGHTS message that went through the
> read path rather than recv. Previously, if an fd was passed via
> sendmsg() but was consumed by the receiver via read() the ref count
> was incremented and never decremented and so the ref count would
> never reach zero even when there was no long any processes holding
> the file open (this was especially bad for locked fds).
loadable drivers and user controlled attach/detach of devices.
An outline was given in
http://mail-index.NetBSD.org/tech-kern/2004/08/11/0000.html
To cite the relevant parts:
-Add a "child detached" and a "rescan" method (both optional)
to the device driver. (This is added to the "cfattach" for now
because this is under the driver writer's control. Logically
it belongs more to the "cfdriver", but this is automatically
generated now.)
The "child detached" is called by the autoconf framework
during config_detach(), after the child's ca_detach()
function was called but before the device data structure
is freed.
The "rescan" is called explicitely, either after a driver LKM
was loaded, or on user request (see the "control device" below).
-Add a field to the device instance where the "locators" (in
terms of the autoconf framework), which describe the actual
location of the device relatively to the parent bus, can be
stored. This can be used by the "child detached" function
for easier bookkeeping (no need to lookup by device instance
pointer). (An idea for the future is to use this for generation
of optimized kernel config files - like DEC's "doconfig".)
-Pass the locators tuple describing a device's location to
various autoconf functions to support the previous. And since
locators do only make sense in relation to an "interface
attribute", pass this as well.
-Add helper functions to add/remove supplemental "cfdata"
arrays. Needed for driver LKMs.
There is some code duplication which will hopefully resolved
when all "submatch"-style functions are changed to accept the
locator argument.
Some more cleanup can take place when config(8) issues more
information about locators, in particular the length and default
values. To be done later.
* Rather than using mnt_maxsymlinklen to indicate that a file systems returns
d_type fields(!), add a new internal flag, IMNT_DTYPE.
Add 3 new elements to ufsmount:
* um_maxsymlinklen, replaces mnt_maxsymlinklen (which never should have existed
in the first place).
* um_dirblksiz, which tracks the current directory block size, eliminating the
FS-specific checks littered throughout the code. This may be used later to
make the block size variable.
* um_maxfilesize, which is the maximum file size, possibly adjusted lower due
to implementation issues.
Sync some bug fixes from FFS into ext2fs, particularly:
* ffs_lookup.c 1.21, 1.28, 1.33, 1.48
* ffs_inode.c 1.43, 1.44, 1.45, 1.66, 1.67
* ffs_vnops.c 1.84, 1.85, 1.86
Clean up some crappy pointer frobnication.
* Process A is closing one file descriptor belonging to a device. In doing so,
ffs_update() is called and starts writing a block synchronously. (Note: This
leaves the vnode locked. It also has other instances -- stdin, et al -- of
the same device open, so v_usecount is definitely non-zero.)
* Process B does a revoke() on the device. The revoke() has to wait for the
vnode to be unlocked because ffs_update() is still in progress.
* Process C tries to open() the device. It wedges in checkalias() repeatedly
calling vget() because it returns EBUSY immediately.
To fix, this:
* checkalias() now uses LK_SLEEPFAIL rather than LK_NOWAIT. Therefore it will
wait for the vnode to become unlocked, but it will recheck that it is on the
hash list, in case it was in the process of being revoke()d or was revoke()d
again before we were woken up.
* Since we're relying on the vnode lock to tell us that the vnode hasn't been
removed from the hash list *anyway*, I have moved the code to remove it into
the DOCLOSE section of vclean(), inside the vnode lock.
In the example at hand, process A was sh(1), process B was a child of init(8),
and process C was syslogd(8).
from Stephan Uphoff, FreeBSD PR/69964.
(http://www.freebsd.org/cgi/query-pr.cgi?pr=69964)
> The LK_WANT_EXCL and LK_WANT_UPGRADE bits act as mini-locks and can block
> other threads.
> Normally this is not a problem since the mini locks are upgraded to full loc
> and the release of the locks will unblock the other threads.
> However if a thread reset the bits without optaining a full lock
> other threads are not awoken.
> This can happens if obtaining the full lock fails because of a LK_SLEEPFAIL,
> or a signal (if lock priority includes PCATCH .. don't think this is used).
ensure that no one else have the same lock.
a patch from Stephan Uphoff, FreeBSD PR/69934.
(http://www.freebsd.org/cgi/query-pr.cgi?pr=69934)
> Upgrading a lock does not play well together with acquiring
> an exclusive lock and can lead to two threads being
> granted exclusive access.
>
> Problematic sequence:
> Thread A acquires a previous unlocked lock in shared mode.
> Thread B tries to acquire the same lock in exclusive mode
> and blocks.
> Thread A upgrades its lock - waking up thread B.
> Thread B wakes up and also acquires the same lock as it only checks
> if the lock is not shared or if someone wants to upgrade the lock
> and not if someone already upgraded the lock to an exclusive lock.
long, not an int, and this causes "problems" on LP64be machines
(sparc64, etc). Assign the value to a temporary int and instrument
that instead. Should be fine until someone wants a message buffer
larger than two gigabytes.
This areas is called the comm pages. It is used to provide fast access to
several data and functions.
The comm pages are mapped starting at 0xffff800 (address chosed so that
absolute branch can be used, so it can be accessed even when dynamic linking
is not ready). NetBSD has the user stack here, so we need to provide a
Darwin-specific stack setup routine which sets the top of the stack at
0xbfff0000.
This implementation is not complete but it does enough to get MacOS X.3
starting again (static binaries run, dynamic binaries still have an issue).
in the comm pages functions, we only implement bcopy, pthread_self and
memcpy.
TODO:
- clean up the powerpc specific code from MD parts
- for now we map only one page to avoid a crash, we want two pages.
- write all the comm functions.
want more flexible namecache handling.
it just looks up a dnlc entry and vget() the result vnode.
ie. no automatic entry removal, no automatic vnode locking.
discussed on tech-kern@.
doesn't advance while we're waiting on the lock. In fact, try to take
the lock even before blocking interrupts: the lock is locking "lasttime"
against other callers of cc_microtime(), not against the clock routines,
and if we take a clock interrupt while waiting for the lock, that's one
we don't have to take after the computations, but before returning to
the caller, and that makes the data a little fresher to the caller.
Moreover, inverting the order of splXXX() and simple_lock() permits us
to unblock interrupts before doing the long division.
With this, finally, performance of "ntpd" on my MP i386 seems to be no
worse than on non-MP i386, so this may fix PR kern/24207.
for consistency with M_FREE() and m_freem(). Affected files:
sys/mbuf.h
kern/uipc_socket2.c
kern/uipc_mbuf.c
net/if_ethersubr.c
netatalk/ddp_input.c
nfs/nfs_socket.c
a ktraced file descriptor that has already been invalidated. Change
all ktrace functions to propagate the error from ktrwrite() and
check for it. Thanks to Pavel Cahyna for finding this and giving
a perfect bug report.
[should be pulled up for 2.0]
doing copy-on-write.
- Change VFS_SNAPSHOT() to return the snapshot vnode locked.
- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().
L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>
overloading "usec". The counter isn't counting micro-seconds, and using
the same variable to mean two different things is false economy: with
this change, the compiled object is 72 bytes smaller on i386, and the
code is easier to understand, to boot.
We do an MGETHDR)() for each mbuf "packet" of the input chain, to hold
the socket address prepended to that "packet". If those MGETHDR()s
ever failed, we would leak all the successfully-allocated mbuf
headers. Leak noted by Yamamoto-san (yamt@NetBSD.org); thanks for catching it!
Add socketbuf invariant-checking macros to sbappendaddrchain(), and
replace a stray bcopy() with memcpy(), also as suggested by Yamamoto-san.
Fix a potential race condition when reallocating storage for file descriptors
(even for non-SMP kernels).
Add missing locks for `struct file' ref count updates.
Introduce new socket-layer function sbappendaddrchain() to
sys/kern/uipc_socket2.c: like sbappendaddr(), only takes a chain of
records and appends the entire chain in one pass. sbappendaddrchain()
also takes an `sbprio' argument, which indicates the caller requires
special `reliable' handling of the socket-buffer. `sbprio' is
described in sys/sys/socketvar.h, although (for now) the different
levels are not yet implemented.
Rework sys/netipsec/key.c PF_KEY DUMP responses to build a chain of
mbuf records, one record per dump response. Unicast the entire chain
to the requestor, with all-or-none semantics.
Changed files;
sys/socketvar.h kern/uipc_socket2.c netipsec/key.c
Reviewed by:
Jason Thorpe, Thor Lancelot Simon, post to tech-kern.
Todo: request pullup to 2.0 branch. Post-2.0, rework sysctl() API for
dumps to use new record-chain constructors. Actually implement
the distinct service levels in sbappendaddrchain() so we can use them
to make PF_KEY ACQUIRE messages more reliable.
- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.
Welcome to 2.0F.
Approved by: Jason R. Thorpe <thorpej@netbsd.org>
Add a new explicit `struct proc *p' argument to socreate(), sosend().
Use that argument instead of curproc. Follow-on changes to pass that
argument to socreate(), sosend(), and (*so->so_send)() calls.
These changes reviewed and independently recoded by Matt Thomas.
Changes to soreceive() and (*dom->dom_exernalize() from Matt Thomas:
pass soreceive()'s struct uio* uio->uio_procp to unp_externalize().
Eliminate curproc from unp_externalize. Also, now soreceive() uses
its uio->uio_procp value, pass that same value downward to
((pr->pru_usrreq)() calls for consistency, instead of (struct proc * )0.
Similar changes in sys/nfs to eliminate (most) uses of curproc,
either via the req-> r_procp field of a struct nfsreq *req argument,
or by passing down new explicit struct proc * arguments.
Reviewed by: Matt Thomas, posted to tech-kern.
NB: The (*pr->pru_usrreq)() change should be tested on more (all!) protocols.
so that it doesn't fire when called twice in the same microsecond,
which can lead to large error accumulation.
Appears to fix "repeated gettimeofday() goes backwards" on a fast
alpha and i386 box.
attempt is terminated, so if a process needs frequent timer interrupts
it can't ever connect() to a machine far away.
Bug found by Erik Lundgren, bugfix (for the same problem) is similar to
the way FreeBSD solved the same problem.
As a side effect, the new connect() behaviour conformes to Posix.
further deprecate struct timezone usage by changing `tzp' argument to
gettimeofday() to void *; align utimes(2) declaration by changing `times`
argument from struct timeval * to struct timeval[2]. From Murray
Armfield in PR standards/25331.
In due curse, reflect these changes in futimes(2), lutimes(2), and
settimeofday(2).
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.
Convert struct session, ucred and lockf to pools.
list of file system types currently supported by the kernel.
Previously there wasn't an easy way to determine this.
(Code shamelessly cribbed from subr_disk.c::sysctl_hw_disknames().)
Use LIST_FOREACH() appropriately.
and unkillable processes.
1. Introduce new SBSIZE resource limit from FreeBSD to limit socket buffer
size resource.
2. make sokvareserve interruptible, so processes ltsleeping on it can be
killed.
would be good) mostly copied from sysctl(3). This takes care of the
top-level, most of kern.* and hw.* (modulo the ath and bge stuff), and
all of proc.*.
If you don't want the added rodata in your kernel, use "options
SYSCTL_NO_DESCR" in your kernel config.
setup function to set the description, even if the node has been
instantiated elsewhere. Or not, depending on the other that the setup
functions are called.
namely, allow following operations.
- purge only an entry specified by a component name.
- purge only child entries.
- purge only parent entries.
no objections on tech-kern@.
Add an initializer for them: KSI_INIT_EMPTY
Add a predicate for them: KSI_EMPTY_P
Don't bother storing empty ksiginfo_t's since they have no information.
Change uses of KSI_INIT to KSI_INIT_EMPTY where no other information other
than the signo is being filled in.
not blocked. Otherwise (it if it blocked or the hanlder is set to SIG_IGN)
reset the signal back to its default settings so that a coredump can be
generated.
other semantics from an earlier incarnation.
Call kcont_init() from init_main before device autoconfiguration,
so kcont is availble to device drivers if required.
Also ensure the kthread process runs any pending continuations once
the kthread is finally up and running. For now, use a non-null timeout
to poll the queue periodically. Draining any pending requests just
before the kthread enters its ltsleep()/kc_run loop is cleaner, but
this is the version I tested with an early-in-boot kcont request.)
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.
Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.
Reviewed by Thor Simon.
size 0.
This way, individual ports can circumvent sigcode mapping
by setting sigcode/esigcode.
(would be better to clean up the __HAVE_SIGINFO/COMPAT_XX
stuff, but it is not a good moment now)
- factor out common code.
- don't stop searching before the target.
- touch the correct object.
- validate the argument before the loop otherwise we need to roll back.
- move per VP data into struct sadata_vp referenced from l->l_savp
* VP id
* lock on VP data
* LWP on VP
* recently blocked LWP on VP
* queue of LWPs woken which ran on this VP before sleep
* faultaddr
* LWP cache for upcalls
* upcall queue
- add current concurrency and requested concurrency variables
- make process exit run LWP on all VPs
- make signal delivery consider all VPs
- make timer events consider all VPs
- add sa_newsavp to allocate new sadata_vp structure
- add sa_increaseconcurrency to prepare new VP
- make sys_sa_setconcurrency request new VP or wakeup idle VP
- make sa_yield lower current concurrency
- set sa_cpu = VP id in upcalls
- maintain cached LWPs per VP
be updated. (This is needed to be compatible with how pre-SIGINFO signals
operated. If you siglongjmp out of a signal handler, the SS_ONSTACK state
needs to be cleared. This commit restores that functionality).
re-used by another cpu immediately. in that case, lwp_exit2() will
access freed memory. to fix this:
- remove curlwp from p_lwps in exit1() rather than letting lwp_exit2() do so.
- add assertions to ensure freed proc has no lwps.
kern/24329 from me and kern/24574 from Havard Eidnes.
In addition to current one (i.e., don't wast so large part of the page),
- if the header fitsin the page without wasting any items, put it there.
- don't put the header in the page if it may consume rather big item.
For example, on i386, header is now allocated in the page for the pools
like fdescpl or sigapl, and allocated off the page for the pools like
buf1k or buf2k.
kern.proc, kern.proc2, kern.lwp, and kern.buf.
Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.
Convert systat to the new kern.buf method.
Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.
Oh, and fix a typo in a comment.
and then allocate it on demand. Rename some common symbols (__bss_start,
_edata, _end, __start_link_set_*, __stop_link_set_*) so that ".<module>"
is appended to them. This shrinks an amd64 kernel by 20KB of BSS.