don't forget to set m_len to 0. Otherwise whatever will compute the size
of this chain (including s_split() itself if called again on this chain)
will get it wrong, leading to various issues.
Bug exposed by the NFS server code with linux clients using TCP mounts.
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.
Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.
thr0 accept(fd, ...)
thr1 close(fd)
provided is "too large" (log10(2^64) = 19).
(It can still overflow if the input value is close to 2^64 but I don't
consider this a problem.)
fixes nonsense displayed as "total memory" on boot
Call the detach routine for every device in the device tree, starting
with the leaves and moving toward the root, expecting that each
(pseudo-)device driver will use the opportunity to gracefully commit
outstandings transactions to the underlying (pseudo-)device and to
relinquish control of the hardware to the system BIOS.
Detaching devices is not suitable for every shutdown: in an emergency,
or if the system state is inconsistent, we should resort to a fast,
simple shutdown that uses only the pmf(9) shutdown hooks and the
(deprecated) shutdownhooks. For now, if the flag RB_NOSYNC is set in
boothowto, opt for the fast, simple shutdown.
Add a device flag, DVF_DETACH_SHUTDOWN, that indicates by its presence
that it is safe to detach a device during shutdown. Introduce macros
CFATTACH_DECL3() and CFATTACH_DECL3_NEW() for creating autoconf
attachments with default device flags. Add DVF_DETACH_SHUTDOWN
to configuration attachments for atabus(4), atw(4) at cardbus(4),
cardbus(4), cardslot(4), com(4) at isa(4), elanpar(4), elanpex(4),
elansc(4), gpio(4), npx(4) at isa(4), nsphyter(4), pci(4), pcib(4),
pcmcia(4), ppb(4), sip(4), wd(4), and wdc(4) at isa(4).
Add a device-detachment "reason" flag, DETACH_SHUTDOWN, that tells the
autoconf code and a device driver that the reason for detachment is
system shutdown.
Add a sysctl, kern.detachall, that tells the system to try to detach
every device at shutdown, regardless of any device's DVF_DETACH_SHUTDOWN
flag. The default for kern.detachall is 0. SET IT TO 1, PLEASE, TO
HELP TEST AND DEBUG DEVICE DETACHMENT AT SHUTDOWN.
This is a work in progress. In future work, I aim to treat
pseudo-devices more thoroughly, and to gracefully tear down a stack of
(pseudo-)disk drivers and filesystems, including cgd(4), vnd(4), and
raid(4) instances at shutdown.
Also commit some changes that are not easily untangled from the rest:
(1) begin to simplify device_t locking: rename struct pmf_private to
device_lock, and incorporate device_lock into struct device.
(2) #include <sys/device.h> in sys/pmf.h in order to get some
definitions that it needs. Stop unnecessarily #including <sys/device.h>
in sys/arch/x86/include/pic.h to keep the amd64, xen, and i386 releases
building.
This will be used to support TLS. The MD method must match the ELF TLS spec
for that CPU architecture (if there is a spec).
At this time it is only implemented for i386, where it means setting the
per-thread base address for %gs. Please implement this for your platform!
signals (i.e. SA_KILL), just if SIGKILL (or SIGCONT). Improve comments.
Make some functions static, remove unused sigrealloc() prototype.
Fixes PR/39814. Similar patch reviewed by <ad>.
address space available to processes. this limit exists in most other
modern unix variants, and like most of them, our defaults are unlimited.
remove the old mmap / rlimit.datasize hack.
- adds the VMCMD_STACK flag to all the stack-creation vmcmd callers.
it is currently unused, but was added a few years ago.
- add a pair of new process size values to kinfo_proc2{}. one is the
total size of the process memory map, and the other is the total size
adjusted for unused stack space (since most processes have a lot of
this...)
- patch sh, and csh to notice RLIMIT_AS. (in some cases, the alias
RLIMIT_VMEM was already present and used if availble.)
- patch ps, top and systat to notice the new k_vm_vsize member of
kinfo_proc2{}.
- update irix, svr4, svr4_32, linux and osf1 emulations to support
this information. (freebsd could be done, but that it's best left
as part of the full-update of compat/freebsd.)
this addresses PR 7897. it also gives correct memory usage values,
which have never been entirely correct (since mmap), and have been
very incorrect since jemalloc() was enabled.
tested on i386 and sparc64, build tested on several other platforms.
thanks to many folks for feedback and testing but most espcially
chuq and yamt for critical suggestions that lead to this patch not
having a special ugliness i wasn't happy with anyway :-)
There are also sigtimedwait(2) et al. to catch signals without invoking
a signal handler. Fixes PR kern/41076 by Matteo Beccati (the first
test case, where the signal is sent before sigwaitinfo(2) gets called).
Fix numerous problems:
1. LDT updates are not atomic.
2. Number of processes running with private LDTs and/or I/O bitmaps
is not capped. System with high maxprocs can be paniced.
3. LDTR can be leaked over context switch.
4. GDT slot allocations can race, giving the same LDT slot to two procs.
5. Incomplete interrupt/trap frames can be stacked.
6. In some rare cases segment faults are not handled correctly.
There are still about 1600 left, but they have ',' or /* ... */
in the actual variable definitions - which my awk script doesn't handle.
There are also many that need () -> (void).
(The script does handle misordered arguments.)
via SCM_RIGHTS messages are dealt with:
1. unp_gc: make this a kthread.
2. unp_detach: go not call unp_gc directly. instead, wake up unp_gc kthread.
3. unp_scan: do not close files here. instead, put them on a global list
for unp_gc to close, along with a per-file "deferred close count". if
file is already enqueued for close, just increment deferred close count.
this eliminates the recursive calls.
3. unp_gc: scan files on global deferred close list. close each file N
times, as specified by deferred close count in file. continue processing
list until it becomes empty (closing may cause additional files to be
queued for close).
4. unp_gc: add additional bit to mark files we are scanning. set during
initial scan of global file list that currently clears FMARK/FDEFER.
during later scans, never examine / garbage collect descriptors that
we have not marked during the earlier scan. do not proceed with this
initial scan until all deferred closes have been processed. be careful
with locking to ensure no races are introduced between deferred close
and file scan.
5. unp_gc: use dummy file_t to mark position in list when scanning. allow
us to drop filelist_lock. in turn allows us to eliminate kmem_alloc()
and safely close files, etc.
6. prohibit transfer of descriptors within SCM_RIGHTS messages if
(num_files_in_transit > maxfiles / unp_rights_ratio)
7. fd_allocfile: ensure recycled filse don't get scanned.
this is 97% work done by andrew doran, with a couple of minor bug fixes
and a lot of testing by yours truly.
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep
Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.
PR kern/40361 WAPBL locking panic in -current
PR kern/40361 WAPBL locking panic in -current
PR kern/40470 WAPBL corrupts ext2fs
PR kern/40562 busy loop in ffs_sync when unmounting a file system
PR kern/40525 panic: ffs_valloc: dup alloc
- A fix for an issue that can lead to "ffs_valloc: dup" due to dirty cg
buffers being invalidated. Problem discovered and patch by dholland@.
- If the syncer fails to lazily sync a vnode due to lock contention,
retry 1 second later instead of 30 seconds later.
- Flush inode atime updates every ~10 seconds (this makes most sense with
logging). Presently they didn't hit the disk for read-only files or
devices until the file system was unmounted. It would be better to trickle
the updates out but that would require more extensive changes.
- Fix issues with file system corruption, busy looping and other nasty
problems when logging and non-logging file systems are intermixed,
with one being the root file system.
- For logging, do not flush metadata on an inode-at-a-time basis if the sync
has been requested by ioflush. Previously, we could try hundreds of log
sync operations a second due to inode update activity, causing the syncer
to fall behind and metadata updates to be serialized across the entire
file system. Instead, burst out metadata and log flushes at a minimum
interval of every 10 seconds on an active file system (happens more often
if the log becomes full). Note this does not change the operation of
fsync() etc.
- With the flush issue fixed, re-enable concurrent metadata updates in
vfs_wapbl.c.
backend, perform all calls through a syscall table. This makes it
possible to make system calls to non-local rump kernels.
(requires a bit support code. it's written but quite messy currently)
- reimplement vmem sanity checks with less code duplication.
- reimplement ddb vmem-related commands in a more consistent ways.
remove automatic whatis.
it caused the return from the enclosing function to break, as well as the
ssp return on i386. To fix both issues, split configure in two pieces
the one before calling ssp_init and the one after, and move the ssp_init()
call back in main. Put ssp_init() in its own file, and compile this new file
with -fno-stack-protector. Tested on amd64.
XXX: If we want to have ssp kernels working on 5.0, this change needs to
be pulled up.
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.
those struct members but there is no reason to rely on that.
While here, I rewrite the loop using an usual idiom. It shaves
both source and object code.
- vfs_syscalls.c rev. 1.342 fails to invert condition correcly when
then-clause and else-clause is swapped. Since then, revoke(2) fails
if it is issued by file owner.
- Probably since rev. 1.160 of genfs_vnops.c, revoke(2) fails if it is
applied to non-device file and drops kernel into ddb.
specs_open routine. If devsw_open fail, get driver name with devsw_getname
routine and autoload module.
For now only dm drivervcan be loaded, other pseudo drivers needs more work.
Ok by ad@.
- Cache kva.
- Convert to use mutex_obj_alloc().
- Make better use of pool_cache.
Also:
Disable direct transfers for the moment. I believe there may be a bug that
can cause transfers to stall when switching between direct/buffered access.
I think this has most recently been run into on 'denver' but I have seen it
as far back as 3.1.
(As an aside, direct is a not a clear win on modern systems with large cache
and high TLB invalidation overhead. Particularly so on MP systems, although
micro benchmarks may report otherwise because they typically do not tax the
system. Anyone want to write a decent benchmark?)
quick-running or non-threaded rump jobs, where the rehash algorithm
does not have a chance to run. For other cases it doesn't make
much difference, since the size will grow or decrease when the
rehash algorithm runs for the first time (t=10*hz currently).
type/status/etc inquiries. (PR kern/37915)
This is clearly a design problem in tty, but we need a cheap fix now.
The problem is that ttyinput() tries to pull a spinlock which
is already held on calls to t_oproc.
The workaround is based on the fact that within wscons code, the
wsdisplay_emulinput() function is only called directly from
wsdisplaystart(). So we can be sure that the tty lock is held,
and use an inofficial entry point in ttc.c which avoids the locking.
These ate certainly more assumptions than needed by the fix
proposed in the PR, but it doesn't affect (and slow down) other
tty drivers.
devmajor_t/devminor_t, as proposed on tech-kern.
This avoids 64-bit arithmetics and 64-bit printf formats in parts
of the kernel where it is not really useful, and helps clarity.
to parse and generate the compat name and basename (e.g. __stat50
and stat). Use this to autogenerate __RENAME()'s to the rump_syscalls
header so that they can be called e.g. rump_sys_socket() instead
of rump_sys___socket30().
disk_read_sector() wants DEV_BLKSIZE blkno's BUT sectorsize unit lengths
specified... how `logical'.
Real fixup pending on discussion on tech-kern/source-changes.
magic libc symbol. This also allows to bid farewell to subr_prf2.c
and merge the contents back to subr_prf.c. The host kernel bridging
is now done via rumpuser_putchar().
partitions on optical media like CD/DVD/BD but also on all other media if
there is no NetBSD disklabel or MBR label.
Also fix cd's readdisklabel arguments so the ioctl's arrive at the right
device (!) and update its default label to make more sense.
- It doesn't work and a dead system that can't be reset from the console is
worse than a system that has painced and rebooted. If you can make it work
reliably please do so.
- If the system is paniced there is every reason to suspect VM structures
and the contents of the buffer cache.
while ironically trying to preserve the same during copy. Would only have
occurred if a multithreaded program expanded the descriptor table and,
within a tiny window of exposure, another thread in the program tried to
access descriptor zero.
- Convert to use kmem_alloc/kmem_free.
in the root of the tree being modified, rather than in the system default
tree. This permits module compat_netbsd32 to initialize its shadow tree
at load time.
Discussed on tech-kern, with no objections.
Addresses my PR kern/40167
embedding the address of its xxx_mountroot() in swapnetbsd.c. This
permits booting of kernels with hard-wired filesystem type even if the
filesystem is in a loadable module (ie, not linked into the kernel
image).
Discussed on current-users. Tested on amd64 and i386 with both hard-
wired and '?' filesystem times, and on both modular and monolithic
kernels.
Thanks to pooka@ for code review and suggestions.
Addresses my PR kern/40167
somewhere in the system. If it is, wait for it to complete before tearing
it down. The caller commits to not trigger the interrupt again once
disestablish is set in motion.
- Output a .bss section and make all the symbols relative to it, instead
of making them absolute.
- Output a single load section, no need for two.
'gdb /dev/ksyms' still doesn't work because ksyms doesn't do mmap yet.
phases, so move the initialization of the ksyms mutex back into main via
a function called ksyms_init. Rename the existing (but quite different)
ksyms_init* variations into ksyms_addsyms_elf() and ksyms_addsyms_explicit()
and adapt machdep code accordingly.
security.curtain=1
If the kauth call failed, we'd silently continue the loop, but the error
code would remain and eventually "leak" to userspace. Reset the error to
zero when continuing.
Tested by snj@ and myself. Okay snj@.
into modules. By and large this commit:
- shuffles header files and ifdefs
- splits code out where necessary to be modular
- adds module glue for each of the components
- adds/replaces hooks for things that can be installed at runtime
own file, subr_exec_fd.c (they're used only by exec).
After this change, the kernel source modules are in a partitioned
enough state to allow building a system without vfs at all.
Their FINI routine may legitimately succeed even though the module is likely
to be used soon again, for example: exec_script. Add a MODULE_CMD_AUTOUNLOAD
to query whether a module wants to avoid autounload.
can unload requisite modules with only one pass.
- If loading a requisite module, scan the global queue before checking the
file system to see if it exists. If it's already present we don't care.
Merge wapbl_replay_get_inodes into wapbl_replay_prescan. Change the
logic to determine the head: It doesn't make sense to update it if the
last inode record seen was not the beginning of the journal, as the
beginning of the journal might not be 0, so always update inodeshead.
transactions. The initial prescan has already sorted out what blocks are
in the journal and removed any revoced blocks, so the hash table is
authorative.
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.
- mutex_enter() from ksyms_getval() could panic due to a change made
in revision 1.40. Fix it.
- Replace the p-tree with a binary search of global symbols. Saves about
250kB of wired memory on i386 and allows for faster lookups within
module symbol tables.
- Split 4.3BSD ifioctl stuff into its own file.
- Remove some ifdefs that include small fragments of vfs compat code
which are difficult to relocate elsewhere.
unused kernel modules.
- Try to unload any autoloaded kernel modules 10 seconds after their
load was successful.
- Keep a counter to track module load/unload events.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.
doshutdownhooks(9): shutdown hooks registered by shutdownhook_establish(9)
expect to be called with interrupts disabled, but shutdown hooks
registered with pmf_device_register1(9) expect to be called with
interrupts enabled. So I have made two changes:
1 Do not call pmf_system_shutdown() from doshutdownhooks(). Instead,
change every call to doshutdownhooks() to a call to doshutdownhooks()
followed by a call to pmf_system_shutdown(). No functional change
is intended by this change.
2 Make i386 re-enable interrupts briefly while it calls
pmf_system_shutdown(). I leave it to others either to fix the
other ports, or to factor out some MI shutdown code, as joerg@
suggests, and fix that. Note that a functional change *is* intended
by this change.
I hope that this patch will stop us from flip-flopping between
calling doshutdownhooks() and pmf_system_shutdown() sometimes with
and sometimes without interrupts enabled.
for now. This will prevent signals from waking them. Adjust
exit_lwps() to explicitly add LW_SINTR to all of them, so that
the process exit code can wake them up.
This is needed as threads in both of these wait channels die once
they are woken. So they aren't interruptable in the typical sense.
I am now able to suspend & resume firefox successfully now.
affinity (cpu_lock protects these operations now).
- Disallow setting of state of CPU to to offline, if there are bound LWPs,
which have no CPU to migrate.
- Disallow setting of affinity for the LWP(s), if all CPUs in the dynamic
CPU-set are offline.
- sched_setaffinity: fix invalid check of kcpuset_isset().
- Rename cpu_setonline() to cpu_setstate().
Should fix PR/39349.
1) Since we want to check for upcalls only once, take LW_SA_UPCALL
out of the while(l->l_flags & LW_USERRET) loop.
2) since the goal is to keep SA code out of userret() (and especially
all the emulations that include userret() but will never do SA),
ALWAYS set LW_SA_UPCALL when we set SAVP_FLAG_NOUPCALLS. Drop the
test for it in lwp_userret() since it will never be set bare.
3) Adapt sa_upcall_userret() to clear LW_SA_UPCALL if it's no longer
needed. If we have gained upcalls since sa_yield(), we will deliver
them next time around.
Tested by skrll at.
This may need more work to prevent warning messages during
"make cleandir" when the commands in "!=" assignments are executed
even though tools may not have been built.
set the RB_ASKNAME flag and prompt users for the init path, rather than
panicking with "no init".
- when prompting for the init path, support the special strings
"halt", "reboot", and "ddb", as well as a prompt for the root device.
Dissussed and no objection on tech-kern. Changes summary by apb@.
- Make ksyms MT safe.
- Fix deadlock from an operation like "modload foo.lkm < /dev/ksyms".
- Fix uninitialized structure members.
- Reduce memory footprint for loaded modules.
- Export ksyms structures for kernel grovellers like savecore.
- Some KNF.
first lwp in the list is the last created and in the firefox and gtk-gnash
case this is usually a zombie, so the status in ps was ZLl. This now picks
the lwp in order ONPROC > RUN > SLEEP > STOP > SUSPENDED > IDL > DEAD > ZOMB
and breaks ties using cpticks.
walk the list, we're looking for a vp to do something with. We do
this in the signal code and in the timer code. The signal code already
runs with proc::p_lock held, so it's a very natural lock to use. The
timer code, however, calls into the sa timer code with a spinlock held.
Since proc::p_lock is an adaptable mutex, we can sleep to get it. Sleeping
with a spinlock is BAD. So proc::p_lock is _not_ the right lock there,
and something like sadata::sa_mutex would be best.
Address this difficulty by noting that both uses actually just read
the list. Changing the list of VPs is rare - once one's added, it stays
until the process ends. So make the locking protocol that to write the
list you have to hold both proc::p_lock and sadata::sa_mutex (taken
in that order). Thus holding either one individually grants read access.
This removes a case where we could sleep with timer_lock, a spinlock at
IPL_SCHED (!!), while trying to get p_lock. If that ever happened, we'd
pretty much be dead. So don't do that!
This fixes a merge botch from how I handled our gaining p_lock - p_lock
should not have simply replaced p_smutex.
While here, tweak the sa_unblock_userret() code for the case
when the blessed vp is actually running (on another CPU). Make its
resched RESCHED_IMMED so we whack the CPU. Addresses a hang I've
observed in starting firefox on occasion when I see one thread running
in userland and another thread sitting in lwpublk, which means it's on
the list of threads for which we need an unblocked upcall. This list is
one on which things should NOT linger.
- Remove remaining #ifdef INET.
- Avoid holding locks so we don't need to do KM_NOSLEEP allocations.
- Use a rwlock to protect the accept filter list.
- Make it safe to unload accept filter modules.
- Minor KNF.
mail: http://mail-index.netbsd.org/source-changes/2008/10/10/msg211109.html
* Scary-looking socket locking stubs (changed to KASSERT of locked)
* depends on INET inappropriately (though now you must add new
accept filter names to the uipc_accf.c line in conf/files if
you aren't using dataready or httpready)
* New code uses MALLOC/FREE -- changed to kmem_alloc/kmem_free;
could be pool_cache, these are all fixed-size allocations.
We need to verify that this works as expected with protocols with per-socket
locking, like PF_LOCAL. I'm a little concerned about the case where the
lock on the listen socket isn't the same lock as on the eventual connected
socket.
Typecasting quad_t * to long * and using atomic_add_long can't
possibly be expected to work!
Another fine error caught by the gcc type-punning warning. That
really really should be on by default in the kernel.
static simplelock initializer (and simplelock too). The fastpath
is still lockless, so doesn't make a difference in terms of
performance.
Also fixes a hanging bug if the once routine returned an error.
It does not retry after an error occurs, as I can't really imagine
fruitful semantics for that.
- Change minimal time-quantum to ~20 ms.
- Thus remove unneeded pool in M2, and unused sched_lwp_exit().
- Do not increase l_slptime twice for SCHED_4BSD (regression fix).
UDF. Future users could be msdosfs, ufs, nilfs2 (when ready), cd9660 etc.
Note that its not the same as UFS's DIRHASH support; UFS would need a good
cleanup/splitout of directory operations to adopt to this new directory
hashing support since most directory operations are interweaved with the
vnops itself. This is a TODO.
Let's look for threads in the TARGET process, not in the
debugger process (gdb). Noticed when a KASSERT fired while
running gdb on a threaded app.
I will adjust wrstuden-revivesa-base-3 to include this change.
free a buffer. This will fail if the buffer owner has a chance to
modify the BC_DONTFREE flag before putphysbuf() examines it.
Fix by removing get/putphysbuf() and BC_DONTFREE. Physio_done() now
has an explicit test for a buffer coming from the call of physio().
Observed by Lars Nordlund when writing a DVD with growisofs, see PR kern/39536.
Reviewed by: Jason Thorpe <thorpej@netbsd.org>
fssvar.h: struct device * -> device_t.
fss.c: establish unmount hook on first attach, remove on last detach.
vfs_syscalls.c: remove the call of fss_umount_hook().
vfs_trans.c: destroy cow handlers on unmount as fstrans_unmount() will be
called before vfs_hooks.
- make sure that sigput always initializes ksi
- initialize si_code with SI_NOINFO instead of lying (SI_USER)
- if a process is being ktraced, make siginfo available
- always pass the available siginfo to ktrace, even if it has SI_NOINFO
without releasing locks.
* Corrected a panic caused by veriexec_file_verify() not setting the
returned struct veriexec_file_entry **vfep in all cases.
Thanks to Stathis Kamperis for finding the issues and testing the fixes.
data which can be read, as expected. Before, the call fell through a
"case" statement and was forwarded to the slave side, returning the
data which can be read by the slave.
The new behaviour also matches Linux and OSF/1.
in pmf_deregister, don't constantly realloc. just shift everything closer
to the front. and then if empty, free. When adding, add space for 4 more
entries.
Instead of n * sizeof(type) use C99 sizeof(type [n]).