correctly, free original instead of incremented pointer, copy results for
n = -2 case too, so top shows correct stats.
Additionaly, rearange code for better readability (from Andrew).
Gone are the old kern_sysctl(), cpu_sysctl(), hw_sysctl(),
vfs_sysctl(), etc, routines, along with sysctl_int() et al. Now all
nodes are registered with the tree, and nodes can be added (or
removed) easily, and I/O to and from the tree is handled generically.
Since the nodes are registered with the tree, the mapping from name to
number (and back again) can now be discovered, instead of having to be
hard coded. Adding new nodes to the tree is likewise much simpler --
the new infrastructure handles almost all the work for simple types,
and just about anything else can be done with a small helper function.
All existing nodes are where they were before (numerically speaking),
so all existing consumers of sysctl information should notice no
difference.
PS - I'm sorry, but there's a distinct lack of documentation at the
moment. I'm working on sysctl(3/8/9) right now, and I promise to
watch out for buses.
so that a specific emulation has the oportunity to filter out some signals.
if sigfilter returns 0, then no signal is sent by kpsignal2().
There is another place where signals can be generated: trapsignal. Since this
function is already an emulation hook, no call to the sigfilter hook was
introduced in trapsignal.
This is needed to emulate the softsignal feature in COMPAT_DARWIN (signals
sent as Mach exception messages)
accepted. However, this time this behavor is not the default. Instead
it must enabled by using the LOCAL_CONNWAIT socket option on either the
connecting or accepting socket.
- always wait for unblocked upcall if we have to continue a blocked
thread.
=> removes wakeup from sys_sa_stacks when a stack is returned.
=> avoids extra sa_unblockyield syscall when unblocked upcall is
delivered before blocked upcall is processed.
=> avoids double pagefault if we continued a thread before the
pagefault was resolved.
=> avoids losing unblocked state if we continued a thread after
skipping the unblocked upcall.
- use splay tree for the pagefault check if the thread was running on
an upcall stack.
=> removes the limitation that all upcall stacks need to be
adjoining and that all upcall stacks have to be loaded with the
1st sys_sa_stacks call.
=> enables keeping information associated with a stack in the kernel
which makes it simpler to find out which LWP is using a stack.
=> allows increasing the SA_MAXNUMSTACKS without having to
allocate an array of that size.
symbols, and made it impossible for the kernel to use that value, and
correctly find symbols from LKMs.
o Allow LKM users to use DDB to debug the entry function of a LKM by
loading the symbol table with the temporary name /lkmtemp/ before calling
it, and then renaming it once we know the module name.
Approved by ragge@.
copyin() or copyout().
uvm_useracc() tells us whether the mapping permissions allow access to
the desired part of an address space, and many callers assume that
this is the same as knowing whether an attempt to access that part of
the address space will succeed. however, access to user space can
fail for reasons other than insufficient permission, most notably that
paging in any non-resident data can fail due to i/o errors. most of
the callers of uvm_useracc() make the above incorrect assumption. the
rest are all misguided optimizations, which optimize for the case
where an operation will fail. we'd rather optimize for operations
succeeding, in which case we should just attempt the access and handle
failures due to insufficient permissions the same way we handle i/o
errors. since there appear to be no good uses of uvm_useracc(), we'll
just remove it.
(1) split the single list of pages allocated to a pool into three lists:
completely full, partially full, and completely empty.
there is no longer any need to traverse any list looking for a
certain type of page.
(2) replace the 8-element hash table for out-of-page page headers
with a splay tree.
these two changes (together with the recent enhancements to the wait code)
give us linear scaling for a fork+exit microbenchmark.
mbuf chains which are recycled (e.g., ICMP reflection, loopback
interface). A consensus was reached that such recycled packets should
behave (more-or-less) the same way if a new chain had been allocated
and the contents copied to that chain.
Some packet tags may in future be marked as "persistent" (e.g., for
mandatory access controls) and should persist across such deletion.
NetBSD as yet hos no persistent tags, so m_tag_delete_nonpersistent()
just deletes all tags. This should not be relied upon.
This should fix PR 23418 which was also reported by Thomas Klausner and
Ian Fry (who also provided core dumps for analysis - thanks!).
Also g/c sa_yieldcall since it's now safe to put LWPs back into the cache.
Also return stacks in failure case.
of the sibling list so that find_stopped_child can be optimised to avoid
traversing the entire sibling list - helps when a process has a lot of
children.
- Modify locking in pfind() and pgfind() to that the caller can rely on the
result being valid, allow caller to request that zombies be findable.
- Rename pfind() to p_find() to ensure we break binary compatibility.
- Remove svr4_pfind since p_find willnow do the job.
- Modify some of the SMP locking of the proc lists - signals are still stuffed.
Welcome to 1.6ZF
specifically, don't keep a stale pointer in fd_ofiles.
it isn't needed anymore as fd allocation is now done using bitmaps.
- clean up dupfdopen() a little.
- don't call fd_used() unnecessarily.
the not-entered symbols will be found anyway but via a linear-search.
This only happens if something is wrong when linking the kernel.
Fixes problems reported on port-hp700.
Remove p_raslock and rename p_lwplock p_lock (one lock is enough).
Simplify window test when adding a ras and correct test on VM_MAXUSER_ADDRESS.
Avoid unpredictable branch in i386 locore.S
(pad fields left in struct proc to avoid kernel bump)
combined. Also prepare for adding VP repossession later.
- kern_sa.c: sa_yield/sa_switch: detect if there are pending unblocked
upcalls.
- kern_sa.c: sa_unblock_userret/sa_setwoken: queue LWPs about to invoke
an unblocked upcall on the sa_wokenq. put queued LWPs in a state where
they can be put in the cache. notify LWP on the VP about pending
upcalls.
- kern_sa.c: sa_upcall_userret: check sa_wokenq for pending upcalls,
generate unblocked upcalls with multiple event sas
- kern_sa.c: sa_vp_repossess/sa_vp_donate: g/c, restore original
sa_vp_repossess
General idea: only consider the LWP on the VP for signal delivery, all
other LWPs are either asleep or running from waking up until repossessing
the VP.
- in kern_sig.c:kpsignal2: handle all states the LWP on the VP can be in
- in kern_sig.c:proc_stop: only try to stop the LWP on the VP. All other
LWPs will suspend in sa_vp_repossess() until the VP-LWP donates the VP.
Restore original behaviour (before SA-specific hacks were added) for
non-SA processes.
- in kern_sig.c:proc_unstop: only return the LWP on the VP
- handle sa_yield as case 0 in sa_switch instead of clearing L_SA, add an
L_SA_YIELD flag
- replace sa_idle by L_SA_IDLE flag since it was either NULL or == sa_vp
Also don't output itimerfire overrun warning if the process is already
exiting.
Also g/c sa_woken because it's not used.
Also g/c some #if 0 code.
we pass via sigctx, so that it guaranteed that the memory wouldn't be
paged out at the time the signal arrives
potential problem pointed out by YAMAMOTO Takashi
generate unblocked upcalls in sa_unblock_userret(), before signal
delivery/p_userret handling in userret().
Also defer getting state for preempted upcalls because on some ports
preemption can happen between sa_unblock_userret() and sa_upcall_userret().
its state is saved:
- don't sa_putcachelwp() in sa_vp_repossess/sa_vp_donate
- only defer saving the event LWP's state
- sa_putcachelwp() after the interrupted LWP's state is saved
-obey ELF_LINK_ADDR in ELF_load_file()
-set ELF_LINK_ADDR in the probe() function if needed
-make ELF_NULL_ADDR the default, so that probe() functions dont need
to set it explicitely
-allocate buffer for interpreter name only if needed
initialized. Update the txp(4) to compensate.
- Statically initialize the TCP timer callout handles in the tcpcb
template. We still use callout_setfunc(), but that call is now much
less expensive. Add a comment that the compiler is likely to unroll
the loop (so don't sweat that it's there).
separately from the bufpages, so that it would be possible to eventually
make their limits changeable in runtime
make static all local variables which do not need to be exported to other
kernel parts
set using a pointer, to save couple bytes in struct sigctx
also fix fallout from recent lwp_wakeup() change, where we failed to properly
detect if tsleep() returned as result of lwp_wakeup() or signal outside
our wait set; could have caused problems for threaded apps using sigwait(2)
et.al.
static binary: otool). Dynamic binaires have a pointer to the Mach-O
header on the top of the stack, static binaries don't have this, and
having it produced a crash.
One bugfix: the EXEC_MACHO code assumes that entry = NULL means that
the entry point has not been found in the load commands seen so far.
Therefore we need to initialized entry to NULL if we want a static binary
to discover it. (dynamic binaries were forced to iscover it because when
the intepreter load command is found, entry is updated whatever its
value was before).
One hack: Both COMPAT_MACH and COMPAT_DARWIN are willing to run Mach-O
binaries. COMPAT_MACH fails for dynamic binaries because it cannot find
the interpreter in /emul/mach. For static binaires, it will accept them
(and for Darwin static binaries, this will cause a failure). Until we
rite a test for matchinf Darwin static binaries, just swap the order of
COMPAT_MACH and COMPAT_DARWIN in the exec switch so that COMPAT_DARWIN
is tried first (this will have the advantage of speeding up program
startup). EXECSW_PRIO_{FIRST_LAST} does not seem to work...
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.
From FreeBSD with slight modifications.
Approved by: Frank van der Linden <fvdl@netbsd.org>
mv MNT_GONE, MNT_UNMOUNT and MNT_WANTRDWR to this field
additonally add mnt_writeopcountupper and mnt_writeopcountlower fields
in preparation for pending write suspension support work
bump kernel version to 1.6ZD
<sys/bootblock.h>:
* Added definitions for the Master Boot Record (MBR) used by
a variety of systems (primarily i386), including the format
of the BIOS Parameter Block (BPB).
This information was cribbed from a variety of sources
including <sys/disklabel_mbr.h> which this is a superset of.
As part of this, some data structure elements and #defines
were renamed to be more "namespace friendly" and consistent
with other bootblocks and MBR documentation.
Update all uses of the old names to the new names.
<sys/disklabel_mbr.h>:
* Deprecated in favor of <sys/bootblock.h> (the latter is more
"host tool" friendly).
amd64 & i386:
* Renamed /usr/mdec/bootxx_dosfs to /usr/mdec/bootxx_msdos, to
be consistent with the naming convention of the msdosfs tools.
* Removed /usr/mdec/bootxx_ufs, as it's equivalent to bootxx_ffsv1
and it's confusing to have two functionally equivalent bootblocks,
especially given that "ufs" has multiple meanings (it could be
a synonym for "ffs", or the group of ffs/lfs/ext2fs file systems).
* Rework pbr.S (the first sector of bootxx_*):
+ Ensure that BPB (bytes 11..89) and the partition table
(bytes 446..509) do not contain code.
+ Add support for booting from FAT partitions if BOOT_FROM_FAT
is defined. (Only set for bootxx_msdos).
+ Remove "dummy" partition 3; if people want to installboot(8)
these to the start of the disk they can use fdisk(8) to
create a real MBR partition table...
+ Compile with TERSE_ERROR so it fits because of the above.
Whilst this is less user friendly, I feel it's important
to have a valid partition table and BPB in the MBR/PBR.
* Renamed /usr/mdec/biosboot to /usr/mdec/boot, to be consistent
with other platforms.
* Enable SUPPORT_DOSFS in /usr/mdec/boot (stage2), so that
we can boot off FAT partitions.
* Crank version of /usr/mdec/boot to 3.1, and fix some of the other
entries in the version file.
installboot(8) (i386):
* Read the existing MBR of the filesystem and retain the BIOS
Parameter Block (BPB) in bytes 11..89 and the MBR partition
table in bytes 446..509. (Previously installboot(8) would
trash those two sections of the MBR.)
mbrlabel(8):
* Use sys/lib/libkern/xlat_mbr_fstype.c instead of homegrown code
to map the MBR partition type to the NetBSD disklabel type.
Test built "make release" for i386, and new bootblocks verified to work
(even off FAT!).
Right now the only flag is used to indicate if a ksiginfo_t is a
result of a trap. Add a predicate macro to test for this flag.
* Add initialization macros for ksiginfo_t's.
* Add accssor macro for ksi_trap. Expands to 0 if the ksiginfo_t was
not the result of a trap. This matches the sigcontext trapcode semantics.
* In kpsendsig(), use KSI_TRAP_P() to select the lwp that gets the signal.
Inspired by Matthias Drochner's fix to kpsendsig(), but correctly handles
the case of non-trap-generated signals that have a > 0 si_code.
This patch fixes a signal delivery problem with threaded programs noted by
Matthias Drochner on tech-kern.
As discussed on tech-kern. Reviewed and OK's by Christos.
* introduce fsetown(), fgetown(), fownsignal() - this sets/retrieves/signals
the owner of descriptor, according to appropriate sematics
of TIOCSPGRP/FIOSETOWN/SIOCSPGRP/TIOCGPGRP/FIOGETOWN/SIOCGPGRP ioctl; use
these routines instead of custom code where appropriate
* make every place handling TIOCSPGRP/TIOCGPGRP handle also FIOSETOWN/FIOGETOWN
properly, and remove the translation of FIO[SG]OWN to TIOC[SG]PGRP
in sys_ioctl() & sys_fcntl()
* also remove the socket-specific hack in sys_ioctl()/sys_fcntl() and
pass the ioctls down to soo_ioctl() as any other ioctl
change discussed on tech-kern@
during a read(v)/write(v). If that happeded, we either passed a NULL
pointer or a pointer to something uninitialized as iov to ktrace,
or we got a memory leak. In read/write (w/o -v) we zero-initialize
the iov now to limit damage, in the -v calls the (dynamically allocated)
pointer is checked after the I/O.
- prevent BLOCKED upcalls on double page faults and during upcalls
- make libpthread handle blocked threads which hold locks
- prevent UNBLOCKED upcalls from overtaking their BLOCKED upcall
this adds a new syscall sa_unblockyield
see also http://mail-index.netbsd.org/tech-kern/2003/09/15/0020.html
- add simple lock for the list
- make get function remove the item from the list
- eliminate superfluous functions
thanks to enami and matt for the feedback.
the information there.
TODO:
1. since timer stuff gets called from an interrupt context, we could
preallocate ksiginfo_t's from the pool, so we don't need a kmem
pool.
2. probably the sa signal delivery syscall can be changed to take
a ksiginfo_t so we can use only one pool.
3. maybe when we add realtime signal support, add a resource limit
on the number of ksiginfo_t's a process can allocate.
and exit(): we have to lookup the entry in the private copy
again, otherwise the wrong list is manipulated.
should fix a panic on postgres shutdown reported by Marc Recht
being here, improve sone debug messages
shminfo.shmseg, in view of the fact that only few processes utilize a
significant fraction of it:
-turn the table into a linked list, elements allocated from a pool(9)
-On fork(), just bump a refcount instead of copying the list; it will
be decremented on exit() and exec(). Only copy if an attach or detach
takes place in between, which is rarely the case.
into compiled LKMs. Check this information when LKM is loaded into kernel
and refuse LKMs not matching currently running kernel. Provide LMFORCE
ioctl to skip this check for those feeling adventurous.
as discussed on tech-kern@, thanks to feedback from Bill Studenmund and others
every field; some need to stay around.
Fixes a bug where by calling shutdown() on a socket with knotes
will cause the kernel to panic when the kernel closes the socket.
Other access, such as calling kevent() may also trigger the panic.
Debugged with help from Jason and Allen. Patch reviewed by same plus
Itojun and Matt Thomas.
This problem seems to be the same one that FreeBSD saw in their PR
number 54331.
Kernel version _not_ bumped as we will piggyback the bump earlier today.
and the subsequent namei(): inform the kernel portion of
valid filenames and then disallow symlink lookups for
those filenames by means of a hook in namei().
with suggestions from provos@
also, add (currently unused) seqnr field to struct
systrace_replace, from provos@
and make the stack and heap non-executable by default. the changes
fall into two basic catagories:
- pmap and trap-handler changes. these are all MD:
= alpha: we already track per-page execute permission with the (software)
PG_EXEC bit, so just have the trap handler pay attention to it.
= i386: use a new GDT segment for %cs for processes that have no
executable mappings above a certain threshold (currently the
bottom of the stack). track per-page execute permission with
the last unused PTE bit.
= powerpc/ibm4xx: just use the hardware exec bit.
= powerpc/oea: we already track per-page exec bits, but the hardware only
implements non-exec mappings at the segment level. so track the
number of executable mappings in each segment and turn on the no-exec
segment bit iff the count is 0. adjust the trap handler to deal.
= sparc (sun4m): fix our use of the hardware protection bits.
fix the trap handler to recognize text faults.
= sparc64: split the existing unified TSB into data and instruction TSBs,
and only load TTEs into the appropriate TSB(s) for the permissions.
fix the trap handler to check for execute permission.
= not yet implemented: amd64, hppa, sh5
- changes in all the emulations that put a signal trampoline on the stack.
instead, we now put the trampoline into a uvm_aobj and map that into
the process separately.
originally from openbsd, adapted for netbsd by me.
- Write label to all netbsd (type 169) mbr partitions (even if they don't
already have a label).
- Update any label found in sector LABELSECTOR and sector 0.
Latter change makes DIOCWDINFO (etc) work on raidframe (fixing bin/22529).
The patch below (hopefully) improves some signaling problems
found by Nathan.
It also contains some cleanup of the sa_upcall_userret() function
removing any sleep calls using PCATCH.
Unblocked threads now only use an upcall stack after they
acquire the virtual CPU.
This prevents unblocked threads from stealing all available
upcall stacks.
Tested by Nick Hudson.
If FAST_IPSEC is configured, attach fast-ipsec transforms after
autoconfiguring devices (perhaps including crypto hardware)
but before starting up network-device packet input.
pseudo-device to init_main(), so the framework is ready for
registration requests at autoconfiguration time.
Thanks to Quentin Garnier for confirming the change was required, and
for testing a similar fix.
Lock setting/clearing of tp->t_oproc to guarantee concurrent opens can't
both suceed and that code in tty.c can't get a NULL t_oproc if the value
is re-read after being checked.
There are still MP issues with pt_flags, pt_send and pt_unctl.
Maybe problems that require TTY_LOCK() to be taken before calling std tty
functions.
to make users of the callout facility able to cooperate to work around the
race caused by the callout code lowering interrupt priority level when
invoking callout handlers, something which allows other code to run before
the callout handler gets to it's spl*() call.
This is to enable the workaround for the TCP code found in PR#20390 to be
applied.
This should be backed out once a more comprehensive fix can be put in
place.
(enabled by defining __HAVE_BIGENDIAN_BITOPS in <machine/types.h>). The
default is still LSB ordering. This change will allow the powerpc MD
implementations of setrunqueue/remrunqueue to be nuked.
used in an ad hoc way by a couple of eval board ports, so might as well
tidy it up a little and add some formality. (And, yes, I need to use it
in another eval board port.)
* Remove the "lwp *" argument that was added to vget(). Turns out
that nothing actually used it!
* Remove the "lwp *" arguments that were added to VFS_ROOT(), VFS_VGET(),
and VFS_FHTOVP(); all they did was pass it to vget() (which, as noted
above, didn't use it).
* Remove all of the "lwp *" arguments to internal functions that were added
just to appease the above.
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.
Bump the kernel rev up to 1.6V
whether there is anything to do - almost as if it were a predicate
test outside of a condition wait. This prevents returning to userland
when tsleep() has woken up spuriously, such as from a signal that was
caught and then removed by a tracing process.
Kills off some double-stops in GDB due to signals as well as a couple
of pthread__idle assertions when detaching from a process.
XXX stopping inside tsleep, via CURSIG(), is evil.
is set to 0, the function will always return 0 (no packets/events
are permitted)." Before this patch, ppsratecheck returned 1 once
a second when maxpps was 0.
1. sa_len was not properly checked.
2. sa_family was not properly checked [even used as an array index!]
3. we only know about inet4 and inet6, so make sure that the corresponding
data is valid before using it.
4. keep reference counts of addresses used (is that necessary?)
out the writing of an lwp's registers to a separate function. XXX Although
not really the correct way to do this, make the thread that caused the
coredump has it's register set written first so GDB is happy. (this is a
bridge until TRT is done).