filesystem is mounted. Synchronize access to the number with a
mutex. When a struct mount, mp, is allocated, assign the current
generation number to mp->mnt_gen. Introduce vfs_unmount_forceone()
that forcefully unmounts the most recently mounted filesystem.
Refactor: extract vfs_shutdown1() from vfs_shutdown(). Extract
vfs_sync_all() from vfs_shutdown1().
Print more progress indications while we're unmounting all of the
filesystems during shutdown.
We increase the reference count on mp before calling dounmount(mp),
but we do not decrease it if dounmount(mp) fails, and neither does
dounmount(mp). So decrease the reference count if dounmount(mp)
fails.
Change the loop terminating condition in vfs_unmountall1() to (mp
!= (void *)&mountlist) from !CIRCLEQ_EMPTY(&mountlist), because we
may not ever empty the list, especially if we're not forcing the
filesystems to unmount.
the other routines of the same spirit.
Adjust file-system code to use it.
Keep vaccess() for KPI compatibility and to keep element of least
surprise. A "diagnostic" message warning that vaccess() is deprecated will
be printed when it's used (obviously, only in DIAGNOSTIC kernels).
No objections on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2009/06/21/msg005310.html
by making the pps count time stamp and the update
time stamp u_int64.
The time delta between two PPS events can now
be correctly calculated avoiding any unaccounted
for wraps with 32-bit counters.
When I originally wrote this, I was going for maximum flexibility.
However, after a private discussion with dholland@, I see how this
will cause problems with the future world order of namei whenever
that might be. At the moment, I don't need the extra flexibility,
but if something comes up this may have to be revisited.
mq_send() should fail due to permissions. Noted by Stathis Kamperis!
- Check for empty message queue name (POSIX does not allow this for regular
files, and it's weird), check for DTYPE_MQUEUE, fix permission check in
mq_unlink(), clean up.
these commits move all path handling into module_do_load() from
kobj_load_file(). This way the final path used to load a module
is available for loading <module>.plist, which will store parameters
for a module. The end goal of this project is good support for
MODULAR device drivers.
- Avoid atomics in more places.
- Remove the per-descriptor mutex, and just use filedesc_t::fd_lock.
It was only being used to synchronize close, and in any case we needed
to take fd_lock to free the descriptor slot.
- Optimize certain paths for the <NDFDFILE case.
- Sprinkle more comments and assertions.
- Cache more stuff in filedesc_t.
- Fix numerous minor bugs spotted along the way.
- Restructure how the open files array is maintained, for clarity and so
that we can eliminate the membar_consumer() call in fd_getfile(). This is
mostly syntactic sugar; the main functional change is that fd_nfiles now
lives alongside the open file array.
Some measurements with libmicro:
- simple file syscalls are like close() are between 1 to 10% faster.
- some nice improvements, e.g. poll(1000) which is ~50% faster.
- Fix a preemption bug in CURCPU_IDLE_P() that can lead to a bogus
assertion failure on DEBUG kernels.
- Fix MP/preemption races with timecounter detachment.
- Fix a preemption bug in CURCPU_IDLE_P() that can lead to a bogus
assertion failure on DEBUG kernels.
- Fix MP/preemption races with timecounter detachment.
this fixes the following deadlock.
a thread doing getcleanvnode:
pick a vnode
acqure v_interlock
v_usecount++
call vclean
now, another thread doing cache_lookup:
picks the vnode
vtryget succeed
vn_lock succeed
now in vclean:
set VI_XLOCK (too late to be noticed by the competing thread)
wait on the vnode lock (this might violate locking order)
the use of a flag bit was suggested by Andrew Doran. PR/41374.
Without it, on ports where splhigh() is inline, the compiler will optimise
the second SOFTINT_PENDING test in softint_schedule(). A dissasembly
of softint_schedule() with and without the volatile sh_flags confirm this
on sparc.
Because of this there is a race that could lead to the softhand_t
being enqueued twice on si_q, leading to a corrupted queue and
some handler being SOFTINT_PENDING but never called.
Should fix PR kern/38637
way they can be included without having to include DDB.
(arguably all print routines should be behind #ifdef DEBUGPRINT
and options DDB should define that macro, but I'll tackle that later)
a new struct mount-allocation routine, vfs_mountalloc(9). Documentation
updates will follow.
Attention: Synchronization Oversight Committee! In mount_domount(),
I postpone the call mutex_enter(&mp->mnt_updating) until right before
the VFS_MOUNT(9) call because (1) that looks to me like the earliest
possible opportunity for mp to become visible to any other LWP, because
it was just kmem_zalloc(9)'d and (2) it made extracting the common code
much easier. Tell me if my reasoning is faulty.
proc_enterpgrp() with proc_leavepgrp() to free process group and/or
session without proc_lock held.
- Rename SESSHOLD() and SESSRELE() to to proc_sesshold() and
proc_sessrele(). The later releases proc_lock now.
Quick OK by <ad>.
sometimes do not serve as memory barriers, allowing memory references to
bleed outside of critical sections. It's possible that this is the
reason for pkgbuild's longstanding crashiness.
For rwlocks, always enable the explicit membars. They were disabled only
on x86, and since they are not in the fast-path it's not a big deal.
TODO: convert these to an atomic_membar_foo() or similar that does ordering
between regular data references and atomic references.
- Add interrupt shielding (direct hardware interrupts away from the
specified CPUs). Not documented just yet but will be soon.
- Redo /dev/cpu time_t compat so no kernel changes are needed.
x86:
- Make intr_establish, intr_disestablish safe to use when !cold.
- Distribute hardware interrupts among the CPUs, instead of directing
everything to the boot CPU.
- Add MD code for interrupt sheilding. This works in most cases but there is
a bug where delivery is not accepted by an LAPIC after redistribution. It
also needs re-balancing to make things fair after interrupts are turned
back on for a CPU.
over and over to detach all of the devices. Stop when we cannot detach
even a single device in a cycle. Call shutdown hooks on all of the
devices that remain attached.
This is another step toward the detach/unmount cycle that will help us
tear down arbitrary stacks of filesystems, ccd(4), raid(4), and vnd(4).
filesystem at all, false otherwise. This will support tearing down
stacks of filesystems, ccd(4), raid(4), and vnd(4).
Change the misleading variable name 'allerror' to 'any_error'. Make it
a bool.
don't forget to set m_len to 0. Otherwise whatever will compute the size
of this chain (including s_split() itself if called again on this chain)
will get it wrong, leading to various issues.
Bug exposed by the NFS server code with linux clients using TCP mounts.
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.
Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.
thr0 accept(fd, ...)
thr1 close(fd)
provided is "too large" (log10(2^64) = 19).
(It can still overflow if the input value is close to 2^64 but I don't
consider this a problem.)
fixes nonsense displayed as "total memory" on boot
Call the detach routine for every device in the device tree, starting
with the leaves and moving toward the root, expecting that each
(pseudo-)device driver will use the opportunity to gracefully commit
outstandings transactions to the underlying (pseudo-)device and to
relinquish control of the hardware to the system BIOS.
Detaching devices is not suitable for every shutdown: in an emergency,
or if the system state is inconsistent, we should resort to a fast,
simple shutdown that uses only the pmf(9) shutdown hooks and the
(deprecated) shutdownhooks. For now, if the flag RB_NOSYNC is set in
boothowto, opt for the fast, simple shutdown.
Add a device flag, DVF_DETACH_SHUTDOWN, that indicates by its presence
that it is safe to detach a device during shutdown. Introduce macros
CFATTACH_DECL3() and CFATTACH_DECL3_NEW() for creating autoconf
attachments with default device flags. Add DVF_DETACH_SHUTDOWN
to configuration attachments for atabus(4), atw(4) at cardbus(4),
cardbus(4), cardslot(4), com(4) at isa(4), elanpar(4), elanpex(4),
elansc(4), gpio(4), npx(4) at isa(4), nsphyter(4), pci(4), pcib(4),
pcmcia(4), ppb(4), sip(4), wd(4), and wdc(4) at isa(4).
Add a device-detachment "reason" flag, DETACH_SHUTDOWN, that tells the
autoconf code and a device driver that the reason for detachment is
system shutdown.
Add a sysctl, kern.detachall, that tells the system to try to detach
every device at shutdown, regardless of any device's DVF_DETACH_SHUTDOWN
flag. The default for kern.detachall is 0. SET IT TO 1, PLEASE, TO
HELP TEST AND DEBUG DEVICE DETACHMENT AT SHUTDOWN.
This is a work in progress. In future work, I aim to treat
pseudo-devices more thoroughly, and to gracefully tear down a stack of
(pseudo-)disk drivers and filesystems, including cgd(4), vnd(4), and
raid(4) instances at shutdown.
Also commit some changes that are not easily untangled from the rest:
(1) begin to simplify device_t locking: rename struct pmf_private to
device_lock, and incorporate device_lock into struct device.
(2) #include <sys/device.h> in sys/pmf.h in order to get some
definitions that it needs. Stop unnecessarily #including <sys/device.h>
in sys/arch/x86/include/pic.h to keep the amd64, xen, and i386 releases
building.
This will be used to support TLS. The MD method must match the ELF TLS spec
for that CPU architecture (if there is a spec).
At this time it is only implemented for i386, where it means setting the
per-thread base address for %gs. Please implement this for your platform!
signals (i.e. SA_KILL), just if SIGKILL (or SIGCONT). Improve comments.
Make some functions static, remove unused sigrealloc() prototype.
Fixes PR/39814. Similar patch reviewed by <ad>.
address space available to processes. this limit exists in most other
modern unix variants, and like most of them, our defaults are unlimited.
remove the old mmap / rlimit.datasize hack.
- adds the VMCMD_STACK flag to all the stack-creation vmcmd callers.
it is currently unused, but was added a few years ago.
- add a pair of new process size values to kinfo_proc2{}. one is the
total size of the process memory map, and the other is the total size
adjusted for unused stack space (since most processes have a lot of
this...)
- patch sh, and csh to notice RLIMIT_AS. (in some cases, the alias
RLIMIT_VMEM was already present and used if availble.)
- patch ps, top and systat to notice the new k_vm_vsize member of
kinfo_proc2{}.
- update irix, svr4, svr4_32, linux and osf1 emulations to support
this information. (freebsd could be done, but that it's best left
as part of the full-update of compat/freebsd.)
this addresses PR 7897. it also gives correct memory usage values,
which have never been entirely correct (since mmap), and have been
very incorrect since jemalloc() was enabled.
tested on i386 and sparc64, build tested on several other platforms.
thanks to many folks for feedback and testing but most espcially
chuq and yamt for critical suggestions that lead to this patch not
having a special ugliness i wasn't happy with anyway :-)
There are also sigtimedwait(2) et al. to catch signals without invoking
a signal handler. Fixes PR kern/41076 by Matteo Beccati (the first
test case, where the signal is sent before sigwaitinfo(2) gets called).
Fix numerous problems:
1. LDT updates are not atomic.
2. Number of processes running with private LDTs and/or I/O bitmaps
is not capped. System with high maxprocs can be paniced.
3. LDTR can be leaked over context switch.
4. GDT slot allocations can race, giving the same LDT slot to two procs.
5. Incomplete interrupt/trap frames can be stacked.
6. In some rare cases segment faults are not handled correctly.
There are still about 1600 left, but they have ',' or /* ... */
in the actual variable definitions - which my awk script doesn't handle.
There are also many that need () -> (void).
(The script does handle misordered arguments.)
via SCM_RIGHTS messages are dealt with:
1. unp_gc: make this a kthread.
2. unp_detach: go not call unp_gc directly. instead, wake up unp_gc kthread.
3. unp_scan: do not close files here. instead, put them on a global list
for unp_gc to close, along with a per-file "deferred close count". if
file is already enqueued for close, just increment deferred close count.
this eliminates the recursive calls.
3. unp_gc: scan files on global deferred close list. close each file N
times, as specified by deferred close count in file. continue processing
list until it becomes empty (closing may cause additional files to be
queued for close).
4. unp_gc: add additional bit to mark files we are scanning. set during
initial scan of global file list that currently clears FMARK/FDEFER.
during later scans, never examine / garbage collect descriptors that
we have not marked during the earlier scan. do not proceed with this
initial scan until all deferred closes have been processed. be careful
with locking to ensure no races are introduced between deferred close
and file scan.
5. unp_gc: use dummy file_t to mark position in list when scanning. allow
us to drop filelist_lock. in turn allows us to eliminate kmem_alloc()
and safely close files, etc.
6. prohibit transfer of descriptors within SCM_RIGHTS messages if
(num_files_in_transit > maxfiles / unp_rights_ratio)
7. fd_allocfile: ensure recycled filse don't get scanned.
this is 97% work done by andrew doran, with a couple of minor bug fixes
and a lot of testing by yours truly.
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep
Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.
PR kern/40361 WAPBL locking panic in -current
PR kern/40361 WAPBL locking panic in -current
PR kern/40470 WAPBL corrupts ext2fs
PR kern/40562 busy loop in ffs_sync when unmounting a file system
PR kern/40525 panic: ffs_valloc: dup alloc
- A fix for an issue that can lead to "ffs_valloc: dup" due to dirty cg
buffers being invalidated. Problem discovered and patch by dholland@.
- If the syncer fails to lazily sync a vnode due to lock contention,
retry 1 second later instead of 30 seconds later.
- Flush inode atime updates every ~10 seconds (this makes most sense with
logging). Presently they didn't hit the disk for read-only files or
devices until the file system was unmounted. It would be better to trickle
the updates out but that would require more extensive changes.
- Fix issues with file system corruption, busy looping and other nasty
problems when logging and non-logging file systems are intermixed,
with one being the root file system.
- For logging, do not flush metadata on an inode-at-a-time basis if the sync
has been requested by ioflush. Previously, we could try hundreds of log
sync operations a second due to inode update activity, causing the syncer
to fall behind and metadata updates to be serialized across the entire
file system. Instead, burst out metadata and log flushes at a minimum
interval of every 10 seconds on an active file system (happens more often
if the log becomes full). Note this does not change the operation of
fsync() etc.
- With the flush issue fixed, re-enable concurrent metadata updates in
vfs_wapbl.c.
backend, perform all calls through a syscall table. This makes it
possible to make system calls to non-local rump kernels.
(requires a bit support code. it's written but quite messy currently)
- reimplement vmem sanity checks with less code duplication.
- reimplement ddb vmem-related commands in a more consistent ways.
remove automatic whatis.
it caused the return from the enclosing function to break, as well as the
ssp return on i386. To fix both issues, split configure in two pieces
the one before calling ssp_init and the one after, and move the ssp_init()
call back in main. Put ssp_init() in its own file, and compile this new file
with -fno-stack-protector. Tested on amd64.
XXX: If we want to have ssp kernels working on 5.0, this change needs to
be pulled up.
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.
those struct members but there is no reason to rely on that.
While here, I rewrite the loop using an usual idiom. It shaves
both source and object code.
- vfs_syscalls.c rev. 1.342 fails to invert condition correcly when
then-clause and else-clause is swapped. Since then, revoke(2) fails
if it is issued by file owner.
- Probably since rev. 1.160 of genfs_vnops.c, revoke(2) fails if it is
applied to non-device file and drops kernel into ddb.
specs_open routine. If devsw_open fail, get driver name with devsw_getname
routine and autoload module.
For now only dm drivervcan be loaded, other pseudo drivers needs more work.
Ok by ad@.
- Cache kva.
- Convert to use mutex_obj_alloc().
- Make better use of pool_cache.
Also:
Disable direct transfers for the moment. I believe there may be a bug that
can cause transfers to stall when switching between direct/buffered access.
I think this has most recently been run into on 'denver' but I have seen it
as far back as 3.1.
(As an aside, direct is a not a clear win on modern systems with large cache
and high TLB invalidation overhead. Particularly so on MP systems, although
micro benchmarks may report otherwise because they typically do not tax the
system. Anyone want to write a decent benchmark?)
quick-running or non-threaded rump jobs, where the rehash algorithm
does not have a chance to run. For other cases it doesn't make
much difference, since the size will grow or decrease when the
rehash algorithm runs for the first time (t=10*hz currently).
type/status/etc inquiries. (PR kern/37915)
This is clearly a design problem in tty, but we need a cheap fix now.
The problem is that ttyinput() tries to pull a spinlock which
is already held on calls to t_oproc.
The workaround is based on the fact that within wscons code, the
wsdisplay_emulinput() function is only called directly from
wsdisplaystart(). So we can be sure that the tty lock is held,
and use an inofficial entry point in ttc.c which avoids the locking.
These ate certainly more assumptions than needed by the fix
proposed in the PR, but it doesn't affect (and slow down) other
tty drivers.
devmajor_t/devminor_t, as proposed on tech-kern.
This avoids 64-bit arithmetics and 64-bit printf formats in parts
of the kernel where it is not really useful, and helps clarity.
to parse and generate the compat name and basename (e.g. __stat50
and stat). Use this to autogenerate __RENAME()'s to the rump_syscalls
header so that they can be called e.g. rump_sys_socket() instead
of rump_sys___socket30().
disk_read_sector() wants DEV_BLKSIZE blkno's BUT sectorsize unit lengths
specified... how `logical'.
Real fixup pending on discussion on tech-kern/source-changes.
magic libc symbol. This also allows to bid farewell to subr_prf2.c
and merge the contents back to subr_prf.c. The host kernel bridging
is now done via rumpuser_putchar().
partitions on optical media like CD/DVD/BD but also on all other media if
there is no NetBSD disklabel or MBR label.
Also fix cd's readdisklabel arguments so the ioctl's arrive at the right
device (!) and update its default label to make more sense.
- It doesn't work and a dead system that can't be reset from the console is
worse than a system that has painced and rebooted. If you can make it work
reliably please do so.
- If the system is paniced there is every reason to suspect VM structures
and the contents of the buffer cache.
while ironically trying to preserve the same during copy. Would only have
occurred if a multithreaded program expanded the descriptor table and,
within a tiny window of exposure, another thread in the program tried to
access descriptor zero.
- Convert to use kmem_alloc/kmem_free.
in the root of the tree being modified, rather than in the system default
tree. This permits module compat_netbsd32 to initialize its shadow tree
at load time.
Discussed on tech-kern, with no objections.
Addresses my PR kern/40167
embedding the address of its xxx_mountroot() in swapnetbsd.c. This
permits booting of kernels with hard-wired filesystem type even if the
filesystem is in a loadable module (ie, not linked into the kernel
image).
Discussed on current-users. Tested on amd64 and i386 with both hard-
wired and '?' filesystem times, and on both modular and monolithic
kernels.
Thanks to pooka@ for code review and suggestions.
Addresses my PR kern/40167
somewhere in the system. If it is, wait for it to complete before tearing
it down. The caller commits to not trigger the interrupt again once
disestablish is set in motion.
- Output a .bss section and make all the symbols relative to it, instead
of making them absolute.
- Output a single load section, no need for two.
'gdb /dev/ksyms' still doesn't work because ksyms doesn't do mmap yet.
phases, so move the initialization of the ksyms mutex back into main via
a function called ksyms_init. Rename the existing (but quite different)
ksyms_init* variations into ksyms_addsyms_elf() and ksyms_addsyms_explicit()
and adapt machdep code accordingly.
security.curtain=1
If the kauth call failed, we'd silently continue the loop, but the error
code would remain and eventually "leak" to userspace. Reset the error to
zero when continuing.
Tested by snj@ and myself. Okay snj@.