state into global and per-CPU scheduler state:
- Global state: sched_qs (run queues), sched_whichqs (bitmap
of non-empty run queues), sched_slpque (sleep queues).
NOTE: These may collectively move into a struct schedstate
at some point in the future.
- Per-CPU state, struct schedstate_percpu: spc_runtime
(time process on this CPU started running), spc_flags
(replaces struct proc's p_schedflags), and
spc_curpriority (usrpri of processes on this CPU).
- Every platform must now supply a struct cpu_info and
a curcpu() macro. Simplify existing cpu_info declarations
where appropriate.
- All references to per-CPU scheduler state now made through
curcpu(). NOTE: this will likely be adjusted in the future
after further changes to struct proc are made.
Tested on i386 and Alpha. Changes are mostly mechanical, but apologies
in advance if it doesn't compile on a particular platform.
db_stack_trace_print(__builtin_frame_address(0),...), to printf() the
stack trace to the message bufffer and console. Idea from SunOS/Solaris.
Useful when dumping fails.
errors from ps(1) and some other kernel grovellers, and return some
data that has previously only been accessable with /dev/kmem read
access. The sysctls are:
+ KERN_PROC2 - return an array of fixed sized "struct kinfo_proc2"
structures that contain most of the useful user-level data in
"struct proc" and "struct user". The sysctl also takes the size of
each element, so that if "struct kinfo_proc2" grows over time old
binaries will still be able to request a fixed size amount of data.
+ KERN_PROC_ARGS - return the argv or envv for a particular process id.
envv will only be returned if the process has the same user id as the
requestor or if the requestor is root.
+ KERN_FSCALE - return the current kernel fixpt scale factor.
+ KERN_CCPU - return the scheduler exponential decay value.
+ KERN_CP_TIME - return cpu time state counters.
With input and suggestions from many people on tech-kern.
which indicates that the process is actually running on a
processor. Test against SONPROC as appropriate rather than
combinations of SRUN and curproc. Update all context switch code
to properly set SONPROC when the process becomes the current
process on the CPU.
Change #define's of the form
#define panic(a) printf(a)
to
#define \
panic(a) printf(a)
to prevent ctags(1) from detecting there is a tag.
Otherwise, the tags file claims panic() is in subr_extent.c
instead of subr_prf.c.
a set of flags ("flags"). Two flags are defined, UPDATE_WAIT and
UPDATE_DIROP.
Under the old semantics, VOP_UPDATE would block if waitfor were set,
under the assumption that directory operations should be done
synchronously. At least LFS and FFS+softdep do not make this
assumption; FFS+softdep got around the problem by enclosing all relevant
calls to VOP_UPDATE in a "if(!DOINGSOFTDEP(vp))", while LFS simply
ignored waitfor, one of the reasons why NFS-serving an LFS filesystem
did not work properly.
Under the new semantics, the UPDATE_DIROP flag is a hint to the
fs-specific update routine that the call comes from a dirop routine, and
should be wait for, or not, accordingly.
Closes PR#8996.
contains the values __SIMPLELOCK_LOCKED and __SIMPLELOCK_UNLOCKED, which
replace the old SIMPLELOCK_LOCKED and SIMPLELOCK_UNLOCKED. These files
are also required to supply inline functions __cpu_simple_lock(),
__cpu_simple_lock_try(), and __cpu_simple_unlock() if locking is to be
supported on that platform (i.e. if MULTIPROCESSOR is defined in the
_KERNEL case). Change these functions to take an int * (&alp->lock_data)
rather than the struct simplelock * itself.
These changes make it possible for userland to use the locking primitives
by including <machine/lock.h>.
MALLOC()/FREE().
- In ktrgenio():
- Don't allocate the entire size of the I/O for the temporary
buffer used to write the data to the trace file. Instead,
do it in page-sized chunks.
- As in uiomove(), preempt the process if we are hogging the CPU.
- If writing to the trace file errors, abort rather than continuing
to loop through the buffer.
From Artur Grabowski <art@stacken.kth.se>, with some additional cleanup
by me.
symlinks, and thus can operate on symlinks. remove a bogus comment in
chflags(1) that claims symlinks do not have file flags.
XXX: todo -- make chflags(1) use lchflags(2) when given the right options.
getnewvnode() has been changed to virtually guarantee that we'll have more
vnodes than "desired", so previously there would always be more vnodes
than namecache entries. this fixes PR 9792.
* Remove the casts to vaddr_t from the round_page() and trunc_page() macros to
make them type-generic, which is necessary i.e. to operate on file offsets
without truncating them.
* In due course, cast pointer arguments to these macros to an appropriate
integral type (paddr_t, vaddr_t).
Originally done by Chuck Silvers, updated by myself.
timeout()/untimeout() API:
- Clients supply callout handle storage, thus eliminating problems of
resource allocation.
- Insertion and removal of callouts is constant time, important as
this facility is used quite a lot in the kernel.
The old timeout()/untimeout() API has been removed from the kernel.
in vfs_detach(). vfs_done may free global filesystem's resources,
typically those allocated in respective filesystem's init function.
Needed so those filesystems which went in via LKM have a chance to
clean after themselves before unloading. This fixes random panics
when LKM for filesystem using pools was loaded and unloaded several
times.
For each leaf filesystem, add appropriate vfs_done routine.
metadata. If you do call it, there's actually a fair chance that it
will panic because its metadata dependencies were not cleared in
the VOP_FSYNC above (with FSYNC_DATAONLY).
mode and ownership bits are flushed to disk before the vnode is
reclaimed.
The check, introduced in the softdep merge, assumes that if no blocks
are dirty, no file data *or metadata* needs to be flushed to disk. This
is true of ffs, but is not true of lfs, and may not be true of other
filesystems.
Tested by myself and Bill Squier <groo@cs.stevens-tech.edu>.
disk space rather than doing it in timeout handler. This fixes long
standing bug that accounting file can't be put on NFS file system (so,
e.g, we couldn't turn on accounting on diskless system).
between protocol handlers.
ipsec socket pointers, ipsec decryption/auth information, tunnel
decapsulation information are in my mind - there can be several other usage.
at this moment, we use this for ipsec socket pointer passing. this will
avoid reuse of m->m_pkthdr.rcvif in ipsec code.
due to the change, MHLEN will be decreased by sizeof(void *) - for example,
for i386, MHLEN was 100 bytes, but is now 96 bytes.
we may want to increase MSIZE from 128 to 256 for some of our architectures.
take caution if you use it for keeping some data item for long period
of time - use extra caution on M_PREPEND() or m_adj(), as they may result
in loss of m->m_pkthdr.aux pointer (and mbuf leak).
this will bump kernel version.
(as discussed in tech-net, tested in kame tree)
the change constitutes binary compatibility issue hen sizeof(long) !=4.
there's no way to be backward compatible, and only guys affected
are IPv6 userland tools.
From: =?iso-8859-1?Q?G=F6ran_Bengtson?= <goeran@cdg.chalmers.se>
amount of physical memory, divide it by 4, and then allow machine
dependent code to place upper and lower bounds on the size. Export
the computed value to userspace via the new "vm.nkmempages" sysctl.
NKMEMCLUSTERS is now deprecated and will generate an error if you
attempt to use it. The new option, should you choose to use it,
is called NKMEMPAGES, and two new options NKMEMPAGES_MIN and
NKMEMPAGES_MAX allow the user to configure the bounds in the kernel
config file.
1) fix typo preventing compilation (missing comma).
2) in SLOCK_WHERE, display cpu number in the MP case.
3) the folowing race condition was observed in _simple_lock:
cpu 1 releases lock,
cpu 0 grabs lock
cpu 1 sees it's already locked.
cpu 1 sees that lock_holder== "cpu 1"
cpu 1 assumes that it already holds it and barfs.
cpu 0 sets lock_holder == "cpu 0"
Fix: set lock_holder to LK_NOCPU in _simple_unlock().
cause a core to drop, and whether the core dropped, or, if it did
not, why not (i.e. error number). Logs process ID, name, signal that
hit it, and whether the core dump was successful.
logging only happens if kern_logsigexit is non-zero, and it can be
changed by the new sysctl(3) value KERN_LOGSIGEXIT. The name of this
sysctl and its function are taken from FreeBSD, at the suggestion
of Greg Woods in PR 6224. Default behavior is zero for a normal
kernel, and one for a kernel compiled with DIAGNOSTIC.
to go to the inversion list is incomplete. If the cylinders are equal
block numbers must be checked.
This caused lockups if some buffers with the same cylinder were cycling
through the list, as it may happen with softdep enabled.
Fixes PR #9197.
- If RB_ASKNAME, only dumpdv holds the results asked interactively.
Examie dumpspec only when !RB_ASKNAME. This allows us to override
dumps on none in kernel config file by booting kernel with RB_ASKNAME.
- Slightly rearrange code so that it more matches to comment.
execute certain functions when a process does an exec(). Currently
uses a global list. Could possibly be done using a per-process list,
but this needs more thought.
until all device driver discovery threads have had a chance to do their
work. This in turn blocks initproc's exec of init(8) until root is
mounted and process start times and CWD info has been fixed up.
Addresses kern/9247.
The rule is that you don't get to call scheduler-related functions (e.g.
wakeup()) above the clock interrupt. Going to statclock unnecessarily
hoses e.g. serial interrupts on the SPARC.
be issued/completed in order; that is, provide a barrier for I/O queues.
- Change the buffer driver queue links to a TAILQ, rather than using
a home-grown equivalent. Provide BUFQ_*() macros to manipulate buffer
queues; these deal with the barrier provided by B_ORDERED.
- Update disksort() accordingly, and provide 3 versions:
- disksort_cylinder(): historical disksort(), which keys on
b_cylinder (and b_blkno for the case when b_cylinder matches).
- disksort_blkno(): sorts only on b_blkno. Essentially the
same as disksort_cylinder(), but with fewer comparisons.
- disksort_tail(): requests are simply inserted into the queue
at the tail. This is provided as an option so that drivers
can simply have a pointer to the appropriate sort function.
Note that disksort() now pays attention to B_ORDERED.
if you com* at pcmcia?, and com3 and com4 as pcmcia cards, and removed
and reinserted the card that was com3, it would become com5. if you then
removed and reinserted com4, it would become com6. etc.) Now, instead
of incrementing FSTATE_STAR configuration entries for a driver when
a cloning instance is attached, leave it alone, and scan the device softc
array (starting at the first cloning unit number) for units which are
available for use. This wastes a tiny bit of time (can require a linear
scan of the softc table for the device), but device attachment should be
relatively infrequent and the number of units of each type of device
is never particularly large anyway.
(Previously buffers could be marked dirty by the cleaner, and possibly by
other means.)
Also check for softdep mount in vfs_shutdown before trying to bawrite
buffers, since other filesystems don't need it and lfs doesn't bawrite.
(This fragment reviewed by fvdl.)
Partially addresses PR#8964.
due to massive changes in KAME side.
- IPv6 output goes through nd6_output
- faith can capture IPv4 packets as well - you can run IPv4-to-IPv6 translator
using heavily modified DNS servers
- per-interface statistics (required for IPv6 MIB)
- interface autoconfig is revisited
- udp input handling has a big change for mapped address support.
- introduce in4_cksum() for non-overwriting checksumming
- introduce m_pulldown()
- neighbor discovery cleanups/improvements
- netinet/in.h strictly conforms to RFC2553 (no extra defs visible to userland)
- IFA_STATS is fixed a bit (not tested)
- and more more more.
TODO:
- cleanup os-independency #ifdef
- avoid rcvif dual use (for IPsec) to help ifdetach
(sorry for jumbo commit, I can't separate this any more...)
uncleanly due to a lost connection, it would hang in closef() waiting
for the usecount to go back to 1.
An audit of FILE_USE() vs FILE_UNUSE() usage led me to discover some
incorrect error-path code..
In sys_fcntl(), avoid leaking a file descriptor usecount in an error
case of F_SETFL; don't return, instead go to "out" to clean up. I
suspect that the F_SETFL would fail because vop_fcntl is not
implemented in deadfs.
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.
The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.
synchronization to latest KAME will take place on HEAD branch soon.
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).
are handled as arrays; that is, a truncated old value is returned, alongside
with ENOMEM, if the buffer is too small.
- in all int, quad, and single struct cases, and all specials handled inside
this file, oldlenp semantics are now as documented in the manual page, that
is, a NULL oldp, but non-NULL oldlenp returns the needed size
[I had to change the oldlenp handling, so I thought I should make it as
advertized. Formerly, the subroutines would not know when a NULL oldlenp
was passed, do the work anyway, and the value would be thrown away.]
This is needed as a first step to make gethostname() and getdomainname()
conform to its own manual page and SUSV2. (See pr 7836 by Simon Burge)
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.
Bump version number to 1.4O
be used uninitialized when name[0] != PROC_CURPROC and
proclists[0]->pd_list == NULL; actually, this can never happen
(proclists[0] == &allproc), but the compiler can not know this, so it
complains
recover from failures to accept a socket successfully. Problem suggested
by this:
> It would appear (from two "panic: closef: count < 0" failures in less
> than 12 hours) that Darren's fix to accept(2) for lost file descriptors
> isn't quite correct. His fix inserts a call to closef() to handle one
> of several possible error conditions. However everywhere else in the
> socket code in the same file where falloc() cleanup is necessary the
> function used is ffree().
core filename format, which allow to change the name of the core dump,
and to relocate it in a directory. Credits to Bill Sommerfeld for giving me
the idea :)
The default core filename format can be changed by options DEFCORENAME and/or
kern.defcorename
Create a new sysctl tree, proc, which holds per-process values (for now
the corename format, and resources limits). Process is designed by its pid
at the second level name. These values are inherited on fork, and the corename
fomat is reset to defcorename on suid/sgid exec.
Create a p_sugid() function, to take appropriate actions on suid/sgid
exec (for now set the P_SUGID flag and reset the per-proc corename).
Adjust dosetrlimit() to allow changing limits of one proc by another, with
credential controls.
- Call configure() after setting up proc0.
- Call initclocks() from configure(), after cpu_configure(). Once the
clocks are running, clear `cold'. Then run interrupt-driven
autoconfiguration.
Tree structure:
- sys/arch/sh3: sh3 generic code
As commented, in-chip device drivers are put into sys/arch/sh3/dev.
- sys/arch/evbsh3: sh3 evaluation boards (pure sh3 CPU, no fancy external HW)
- sys/arch/mmeye: Brains mmEye, www.brains.co.jp
MI source code includes couple of #ifdef for sh3-coff support.
(sh3 uses coff or elf)
Needs some more improvements, especialy in sys/arch/sh3/conf/files.sh3,
to compile the tree (due to last minute tree structure change).
approximation of reality if the MD code doesn't. This variable is the
equivalent of "tickfix" for the non-NTP path.
This allows an alpha kernel (where hz=1024) with "options NTP" to
synch up quite nicely (as opposed to having an frequency error of
~560ppm, which is outside the capture range of the PLL).
If the entry is found in name cache, cache_lookup() does all the
necessary locking now, simplifying the interface and making the
code easier to follow and maintain.
The code now also removes the entry from cache when it's either invalid
(vget() fails) or the vnode has been recycled while waiting for the lock.
In that case, unlock/relock of the directory vnode has been eliminated too.
Both changes could lead to sligh performace improvement in same cases.
Furthermore, obscure bug has been found and eliminated for ISDOTDOT in the
lockparent && ISLASTCN case: if the vget() succeded and the re-lock
of the directory vnode not, we returned the error with the '..' vnode still
locked.
For simplicity, cache_lookup() now returns 0 if the positive entry was found
in cache, -1 if not found and ENOENT or error returned by the locking
functions in any other case.
Many thanks to Bill Studenmund and especially Charles Hannum
for invaluable advices and code to get this right.
Tested by: jdolecek
Rewieved by: wrstuden, mycroft
detect a little earlier if we've dup-put'd. Otherwise, underflow occurs,
and subsequent allocations simply hang or fail (it thinks the hardlimit
has been reached).
by the Single UNIX Specification version 2, rather than the SVR2-derived
types. While I was here, I did a namespace sweep to expose the constants
and strucutures, and structure members described by SUSv2; documentation
updates coming shortly.
Fixes kern/8158.
that is priority is rasied. Add a new spllowersoftclock() to provide the
atomic drop-to-softclock semantics that the old splsoftclock() provided,
and update calls accordingly.
This fixes a problem with using the "rnd" pseudo-device from within
interrupt context to extract random data (e.g. from within the softnet
interrupt) where doing so would incorrectly unblock interrupts (causing
all sorts of lossage).
XXX 4 platforms do not have priority-raising capability: newsmips, sparc,
XXX sparc64, and VAX. This platforms still have this bug until their
XXX spl*() functions are fixed.
allocations should fail if the pool is at its hard limit.
Document flag in pool(9).
Use it in mbuf.h for the first allocate call for M_GET, M_GETHDR, and
MCLGET, so that m_reclaim gets called even for blocking allocations.
mbufs since you might overwriting valuable data. (think of
m_copy'ed data from a TCP re-transmission queue. Since those
might be in clusters and referenced in two sockets).
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().
Reviewed by: thorpej
Tested by: wrstuden
Schroder <perseant@hitl.washington.edu>, unlock the mounted on
vnode before we call VFS_ROOT so that we cover the case where the new
root vnode shares a lock with the mounted-on vnode. Note that we have
asserted vfs_busy on the new fs before unlocking, so no other process can
steal the mount out from under us.
The problem was due to an interaction between the doomed unmounts done by
amd and getnewvnode.
I convinced myself that it's ok for getnewvnode() to do a sleeping vfs_busy().
Tested with multiple builds running while another process attempted to unmount
/usr once a second.
too. Remove some needless code duplication by adding a "drain" argument
to the ACQUIRE() macro (compiler can [and does] optimize the constant
conditional).
locking primitive directly to lock it, since those will never attempt
to call printf() to display debugging information (and thus deadlock
on recursion into the kprintf_slock).
- Now compatible with MULTIPROCESSOR (requires other changes not yet
committed, but which will be later today).
- In addition to tracking simple locks, track exclusive spin locks.
- Count spin locks like we do sleep locks (in the cpu_info for this
CPU).
- Lock debug lists are now TAILQs, so as to make the locking order
more obvious when dumping the list.
Also, some suggestions from Bill Sommerfeld:
- SIMPLELOCK_LOCKED and SIMPLELOCK_UNLOCKED constants, which may be
defined in <machine/lock.h> (default to 1 and 0, respectively). This
makes it easier to support architectures which use test-and-clear
rather than test-and-set.
- Add __attribute__((__aligned__)) to the `lock_data' member of the
simplelock structure. This makes it easier to support architectures
which can only perform atomic operations on very-well-aligned memory
locations. NOTE: This changes the size of struct simplelock, and
will cause a version bump.
first in line for the specified identifier. For use in places where
you don't want a Thundering Herd.
While here, add an optimization to wakeup() suggested by Ross Harvey.
calls to reflect this. Also, block statclock rather than softclock during
in the proclist locking functions, to address a problem reported on
current-users by Sean Doran.
write lock when doing PID allocation, and during the process exit path.
Use a read lock every where else, including within schedcpu() (interrupt
context). Note that holding the write lock implies blocking schedcpu()
from running (blocks softclock).
PID allocation is now MP-safe.
Note this actually fixes a bug on single processor systems that was probably
extremely difficult to tickle; it was possible that schedcpu() would run
off a bad pointer if the right clock interrupt happened to come in the
middle of a LIST_INSERT_HEAD() or LIST_REMOVE() to/from allproc.
and PID allocation MP-safe. A new process state is added: SDEAD. This
state indicates that a process is dead, but not yet a zombie (has not
yet been processed by the process reaper).
SDEAD processes exist on both the zombproc list (via p_list) and deadproc
(via p_hash; the proc has been removed from the pidhash earlier in the exit
path). When the reaper deals with a process, it changes the state to
SZOMB, so that wait4 can process it.
Add a P_ZOMBIE() macro, which treats a proc in SZOMB or SDEAD as a zombie,
and update various parts of the kernel to reflect the new state.
body of reaper(), right before the call to uvm_exit(). cpu_wait() must
be done before uvm_exit() because the resources it frees might be located
in the PCB.
remove simplelockrecurse, lockpausetime and PAUSE():
none of these serve any purpose anymore.
in the LOCKDEBUG functions, expand the splhigh() region to
cover the entire function. without this there can still be races.
- When the exit signal is specified to be 0, don't just assume they
meant SIGCHLD. In the Linux world, this appears to mean "don't deliver
an exit signal at all".
- Simplify P_EXITSIG(); don't check against initproc here, just change
the exit signal to SIGCHLD if reparenting to initproc.
A very simple clone(2) test program now works, and the MpegTV package
starts, but doesn't run properly yet (I believe there is a separate
bug which keeps it from working properly).
getnewvnode now checks this bit, and it if's set makes sure a vnode's not
locked before removing it from the free list.
Closes PR 7954 by Alan Barrett <apb@iafrica.com>.
Fix and document naming convention for vnode variables (always use
lvp/lvpp and uvp/uvpp instead of a hash of cvp, vpp, dvpp, pvp, pvpp).
Delete old stale #if 0'ed code at the end.
Change error path code in getcwd_getcache() slightly (merge common
cleanup code; shouldn't affect behavior any).
mp->mnt_flags & MNT_MWAIT is replaced by mp->mnt_wcnt, and a new mount
flag MNT_GONE is created (reusing the same bit).
In insmntque(), add DIAGNOSTIC check to fail if the filesystem vnode
is being moved to is in the process of being unmounted.
getnewvnode() now protects the list of vnodes active on mp with
vfs_busy()/vfs_unbusy().
To avoid generating spurious errors during a doomed unmount, change
the "wait for unmount to finish" protocol between dounmount() and
vfs_busy(). In vfs_busy(), instead of only sleeping once, sleep until
either MNT_UNMOUNT is clear or MNT_GONE is set; also, maintain a count
of waiters in mp->mnt_wcnt so that dounmount() knows when it's safe to
free mp.
tested by running a "while :; do mount /d1; umount -f /d1; done" loop
against multiple find(1) processes.
(Sorry for a big commit, I can't separate this into several pieces...)
Pls check sys/netinet6/TODO and sys/netinet6/IMPLEMENTATION for details.
- sys/kern: do not assume single mbuf, accept chained mbuf on passing
data from userland to kernel (or other way round).
- "midway" ATM card: ATM PVC pseudo device support, like those done in ALTQ
package (ftp://ftp.csl.sony.co.jp/pub/kjc/).
- sys/netinet/tcp*: IPv4/v6 dual stack tcp support.
- sys/netinet/{ip6,icmp6}.h, sys/net/pfkeyv2.h: IETF document assumes those
file to be there so we patch it up.
- sys/netinet: IPsec additions are here and there.
- sys/netinet6/*: most of IPv6 code sits here.
- sys/netkey: IPsec key management code
- dev/pci/pcidevs: regen
In my understanding no code here is subject to export control so it
should be safe.
listen/accept (PR_LISTEN flag in protosw) and detect obvious faults in
parameters passed. It is still possible for the address used for copying
the socket information to become invalid between that check and the copyout
so close the connection's allocated fd if the copyout fails so that we can
return EFAULT without allocating an fd and the application not knowing about
it. Ideally we'd be able to queue the connection back up so a later accept
could retrieve it but unfortunately that's not possible.
which use uvm_vslock() should now test the return value. If it's not
KERN_SUCCESS, wiring the pages failed, so the operation which is using
uvm_vslock() should error out.
XXX We currently just EFAULT a failed uvm_vslock(). We may want to do
more about translating error codes in the future.
the process (i.e. pre-Reno behavior). The 4.4BSD behavior (introduced
in Reno) caused transient errors to stick incorrectly.
This is from PR #7640 (Havard Eidnes), cross-checked w/ FreeBSD, where
Bill Fenner committed the same fix (as described in a comment in the
Vat sources, by Van Jacobsen).
looking up a kernel address, check to see if the address is on this
"interrupt-safe" list. If so, return failure immediately. This prevents
a locking screw if a page fault is taken on an interrupt-safe map in or
out of interrupt context.
has PAGEABLE and INTRSAFE flags. PAGEABLE now really means "pageable",
not "allocate vm_map_entry's from non-static pool", so update all map
creations to reflect that. INTRSAFE maps are maps that are used in
interrupt context (e.g. kmem_map, mb_map), and thus use the static
map entry pool (XXX as does kernel_map, for now). This will eventually
change now these maps are locked, as well.
vslocking here?! copyout() on its own seems to suffice just about everwhere
else, and it's not like the process is going to exit; it's in a system
call!
serious race condition in sosend(). Upon closer inspection, the appropriate
flags are checked within splsoftnet() for soreceive(), so no change needed
there. Also a little KNFing in sosend().
the child inherits the stack pointer from the parent (traditional
behavior). Like the signal stack, the stack area is secified as
a low address and a size; machine-dependent code accounts for stack
direction.
This is required for clone(2).
parent, specified at fork time. Specify a new flag to wait4(2), WALTSIG,
to wait for processes which use an alternate exit signal.
This is required for clone(2).
-a subregion start was ignored if all previous allocations were before
the subregion, reported by Lennart Augustsson in PR kern/7539
-an existing allocation which overlaps the beginning of the subregion
was ignored (ie overlapped) if is is not the last allocation
given buffer, and if necessary, reducing the display width of the
number to fit in the buffer by increasing the units (from kilobytes
(2^10) through to exabytes (2^60)).
count is 0, wait for use count to drain before finishing the close.
This is necessary in order for multiple processes to safely share file
descriptor tables.
* add arguments describing the vnode and ecoff header of the executable
being set up to the [onz]magic setup functions.
* export the stack setup function and the [onz]magic setup functions.
* call the MD ecoff hook _before_ the [onz]magic and stack setup
functions, and bail out early if the MD hook sets up vmcmds.
- Initialize mbpool and mclpool with msize and mclbytes, respectively,
so that those values may be patched and have an actual affect on the
next system reboot.
- Set low water marks on mbpool (default: 16) and mclpool (default: 8).
This should be of great help for diskless systems, which need to allocate
mbufs in order to clean dirty pages; the low water marks increase the
chances of this being possible to do in memory starvation situations.
- Add support for getting/setting some mbuf-related parameters via sysctl.
* msize and mclsize (read-only)
* nmbclusters (read-only unless the platform has direct-mapped pool pages,
in which case the value can be increased).
* mblowat and mcllowat (read/write)
conf/param.c, and move the initialisation of the sb_max
variable from kern/uipc_socket2.c to conf/param.c. Now
everthing that includes sys/socketvar.h doesn't get
recompiled when SB_MAX's value changes.
The problem is that if "sl" is a symbolic link, a lookup on "sl/"
will be flagged as the last component. Thus VOP_LOOKUP will lock
the parent directory if LOCKPARENT is set. In order for the symbolic
link to be resolved, this lock needs to be released. namei() would
test for this by checking if ni_pathlen == 1, which it wouldn't as
"/" is left in the name, and namei() would not unlock the parent.
The next call to lookup() to resolve the symbolic link would fail
as the parent was still locked.
initialized). This lock also protects the "next drain candidate" pointer.
XXX There is still one locking protocol problem, which should not be
a problem in practice, but is still marked as an issue in the code anyhow.
a data structure after it was freed. This wasn't actually a problem,
and the change caused the wrong pool_item_header to be freed
in the non-PR_PHINPAGE case.
Call configure() directly immediately after config_init().
This causes autoconfiguration to happen at the same time as before, but
creates some kernel submaps earlier, so that e.g. mbinit() can now
allocate memory.
- Protect userspace from unnecessary header inclusions (as noted on
current-users).
- Some const poisioning.
- GREATLY simplify the locking protocol, and fix potential deadlock
scenarios. In particular, assume that the back-end page allocator
provides its own locking mechanism (this is currently true for all
such allocators in the NetBSD kernel). Doing so allows us to simply
use one spin lock for serialized access to all r/w members of the pool
descriptor. The spin lock is released before calling the back-end
allocator, and re-acquired upon return from it.
- Fix a problem in pr_rmpage() where a data structure was referenced
after it was freed.
- Minor tweak to page manaement. Migrate both idle and empty pages
to the end of the page list. As soon as a page becomes un-empty
(by a pool_put()), place it at the head of the page list, and set
curpage to point to it. This reduces fragmentation as well as the
time required to find a non-empty page as soon as curpage becomes
empty again.
- Use mono_time throughout, and protect access to it w/ splclock().
- In pool_reclaim(), if freeing an idle page would reduce the number
of allocatable items to below the low water mark, don't.
NMBCLUSTERS for the mbuf cluster pool. On platforms which use direct-mapped
segments for pool pages (MIPS and Alpha), this makes NMBCLUSTERS actually
meaningful (such ports don't even allocate mb_map, as it is not used to
map mbuf cluster pages).
Improve the message logged at a maximum rate of once per second. The
new message: "WARNING: mclpool limit reached; increase NMBCLUSTERS".
In the back-end pool page allocator, remove the message about mb_map
being full. The message was not necessarily correct as the allocator
may have been starved for pages, rather than for space in the map. Also,
the hard limit on the mbuf cluster pool will be reached before the map
fills (the last cluster will always fit into the map), so the message
is redundant.
Add a comment in mbinit() about considering setting low water marks on
the mbuf and mbuf cluster pools.
- Add support for hard limits, with optional rate-limited logging of
a warning message when the pool limit is reached. (This will be used
to fix a bug in mbuf cluster allocation on the MIPS and Alpha ports.)
- Fix some locking protocol errors. This required splitting pr_flags
into pr_flags (which is protected by the spin lock) and pr_roflags (which
are `read only' flags, set when the pool is initialized, and never changed
again; these do not need to be protected by a mutex).
- Make the low water support actually mean something. When a low water
mark is set, add free items to the pool until the low water mark is
reached. When an item allocation causes the number of free items to
drop below the low water mark, make the pool catch up to it. This can
make the pool allocator more useful for several applications (e.g.
pmap `pv entry' management) and more robust for others (for e.g. mbuf
and mbuf cluster allocation, so that the pagedaemon can use NFS to clean
pages on diskless systems without completely running dry on buffers to
receive packets in during extreme memory shoratages).
- Add a comment where we sleep waiting for more pages for the back-end
page allocator. Specifically, instead of sleeping potentially forever,
perhaps we should just wake up once a second to try allocating a page
again. XXX Revisit this soon.
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.
same uid or by root.
This code is from FreeBSD. (Whilst it was originally obtained from OpenBSD,
FreeBSD fixed it to work with multicast. To quote the commit message:
- Don't bother checking for conflicting sockets if we're binding to a
multicast address.
- Don't return an error if we're binding to INADDR_ANY, the conflicting
socket is bound to INADDR_ANY, and the conflicting socket has
SO_REUSEPORT set.
)