VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.
- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55
Adresses PR kern/38762 panic: vwakeup: neg numoutput
No objections from tech-kern@.
- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.
- Simplify locking in some pmap(9) modules by removing P->V locking.
- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).
- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.
- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.
Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
to store disk quota usage and limits, integrated with ffs
metadata. Usage is checked by fsck_ffs (no more quotacheck)
and is covered by the WAPBL journal. Enabled with kernel
option QUOTA2 (added where QUOTA was enabled in kernel config files),
turned on with tunefs(8) on a per-filesystem
basis. mount_mfs(8) can also turn quotas on.
See http://mail-index.netbsd.org/tech-kern/2011/02/19/msg010025.html
for details.
version or the 32bit compat layout in execve1. Introduce a new function
copyin_psstrings for reading it back from userland and converting it to
the native layout. Refactor procfs to share most of the code with the
kern.proc_args sysctl handler.
This material is based upon work partially supported by
The NetBSD Foundation under a contract with Joerg Sonnenberger.
lower vnode before passing down the VOP_REVOKE(). This way VOP_REVOKE()
on a layered file system always inactivates and closes the lower vnode.
Should finally fix PR kern/43456.
high as the upper vnode count before passing down the VOP_REVOKE().
This way vclean() check for active (vp->v_usecount > 1) vnodes gets it right.
Should fix PR kern/43456.
pathbuf object passed to namei as work space instead. (For now a pnbuf
pointer appears in struct nameidata, to support certain unclean things
that haven't been fixed yet, but it will be going away in the future.)
This removes the need for the SAVENAME and HASBUF namei flags.
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.
Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).
The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.
has busy pages and wants the wapbl lock as reader from wapbl_begin(),
another thread has the wapbl lock as reader and waits for a page from
the first thread. Now a third thread calls wapbl_flush() and wants the
wapbl lock as writer.
Move the wapbl_begin() up to a point where genfs_getpages() has no busy
pages yet.
the earlier change caused data corruption by freeing pages
without invaliding their mappings. instead of the trylock/retry,
just take the genfs-node lock before calling VOP_GETPAGES()
and pass a new flag to tell it that we're already holding this lock.
and set VI_WRMAPDIRTY) after we have busied the pages rather than
before. this prevents other threads calling genfs_do_putpages() from
marking the vnode clean again while we're in the process of creating
new writable mappings, since such threads will wait for the page(s) to
become unbusy before proceeding.
fixes the problem recently reported by hannken@ on tech-kern.
in genfs_do_putpages() and uao_put().
Use 'v_uobj.uo_npages' to check for an empty memq.
Put some assertions where these marker pages may not appear.
Ok: YAMAMOTO Takashi <yamt@netbsd.org>
vnode that may disappear before the caller has a chance to reference it.
Reference the vnode while the specfs cache is locked.
Welcome to 5.99.37.
No objections on tech-kern.
before removing a node from the hash chain.
Release the hash list lock before calling getnewvnode() and check the
hash list again like other file systems do.
Take v_interlock before calling vget().
the reality (remove duplicate one in nullfs, merge some differences from
it), KNF, improve and update some comments, add few KASSERT()s, remove
unused declarations, avoid double inclusion of headers, misc.
No functional changes.
in the vnode. All LK_* flags move from sys/lock.h to sys/vnode.h. Calls
to vlockmgr() in file systems get replaced with VOP_LOCK() or VOP_UNLOCK().
Welcome to 5.99.34.
Discussed on tech-kern.
Rename real routines to proc_find() and pgrp_find(), remove PFIND_* flags
and have consistent behaviour. Provide proc_find_raw() for special cases.
Fix memory leak in sysctl_proc_corename().
COMPAT_LINUX: rework ptrace() locking, minimise differences between
different versions per-arch.
Note: while this change adds some formal cosmetics for COMPAT_DARWIN and
COMPAT_IRIX - locking there is utterly broken (for ages).
Fixes PR/43176.
- VOP_LOCK(vp, flags): Limit the set of allowed flags to LK_EXCLUSIVE,
LK_SHARED and LK_NOWAIT. LK_INTERLOCK is no longer allowed as it
makes no sense here.
- VOP_ISLOCKED(vp): Remove the for some time unused return value
LK_EXCLOTHER. Mark this operation as "diagnostic only".
Making a lock decision based on this operation is no longer allowed.
Discussed on tech-kern.
authors from having to get down on their knees and pray they won't
get POGA'd(*) again.
This plugs componentname leaks in at least smbfs and buggy puffs
servers (buggy servers shouldn't be able to leak kernel memory).
*) principle of greatest astonishment
again: if (...) goto err;
void *ptr = alloc();
if (...) goto again;
if (...) goto err1;
...
err1: if (ptr) free(ptr);
err:
return;
This leaks memory if exited with "goto again; -> goto err;".
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.
no functional change
- Addresses the issue described in PR/38828.
- Some simplification in threading and sleepq subsystems.
- Eliminates pmap_collect() and, as a side note, allows pmap optimisations.
- Eliminates XS_CTL_DATA_ONSTACK in scsipi code.
- Avoids few scans on LWP list and thus potentially long holds of proc_lock.
- Cuts ~1.5k lines of code. Reduces amd64 kernel size by ~4k.
- Removes __SWAP_BROKEN cases.
Tested on x86, mips, acorn32 (thanks <mpumford>) and partly tested on
acorn26 (thanks to <bjh21>).
Discussed on <tech-kern>, reviewed by <ad>.
Don't try to load a driver module if the driver is already exist but just
not attached. [bc]dev_open() could return ENXIO even if the driver exists.
XXX: Maybe this should be handled by helper functions for
XXX: module_autoload() calls on demand.
process doing statvfs(!), just report 0. The code had some kernel
panicking bug after the descriptor code update, the functionality
is more like a bunny rabbit hat than anything useful, and I can't
bother to figure out what the invariants in the new descriptor code
are.
fixes PR kern/41534 and kern/41786
the other routines of the same spirit.
Adjust file-system code to use it.
Keep vaccess() for KPI compatibility and to keep element of least
surprise. A "diagnostic" message warning that vaccess() is deprecated will
be printed when it's used (obviously, only in DIAGNOSTIC kernels).
No objections on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2009/06/21/msg005310.html
- Avoid atomics in more places.
- Remove the per-descriptor mutex, and just use filedesc_t::fd_lock.
It was only being used to synchronize close, and in any case we needed
to take fd_lock to free the descriptor slot.
- Optimize certain paths for the <NDFDFILE case.
- Sprinkle more comments and assertions.
- Cache more stuff in filedesc_t.
- Fix numerous minor bugs spotted along the way.
- Restructure how the open files array is maintained, for clarity and so
that we can eliminate the membar_consumer() call in fd_getfile(). This is
mostly syntactic sugar; the main functional change is that fd_nfiles now
lives alongside the open file array.
Some measurements with libmicro:
- simple file syscalls are like close() are between 1 to 10% faster.
- some nice improvements, e.g. poll(1000) which is ~50% faster.
the security checks when mounting a device (VOP_ACCESS() + kauth(9) call)).
Proposed with no objections on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2009/04/20/msg004859.html
The vnode is always expected to be locked, so no locking is done outside
the file-system code.
proc_enterpgrp() with proc_leavepgrp() to free process group and/or
session without proc_lock held.
- Rename SESSHOLD() and SESSRELE() to to proc_sesshold() and
proc_sessrele(). The later releases proc_lock now.
Quick OK by <ad>.
There are still about 1600 left, but they have ',' or /* ... */
in the actual variable definitions - which my awk script doesn't handle.
There are also many that need () -> (void).
(The script does handle misordered arguments.)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep
Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.
PR kern/40361 WAPBL locking panic in -current
PR kern/40361 WAPBL locking panic in -current
PR kern/40470 WAPBL corrupts ext2fs
PR kern/40562 busy loop in ffs_sync when unmounting a file system
PR kern/40525 panic: ffs_valloc: dup alloc
- A fix for an issue that can lead to "ffs_valloc: dup" due to dirty cg
buffers being invalidated. Problem discovered and patch by dholland@.
- If the syncer fails to lazily sync a vnode due to lock contention,
retry 1 second later instead of 30 seconds later.
- Flush inode atime updates every ~10 seconds (this makes most sense with
logging). Presently they didn't hit the disk for read-only files or
devices until the file system was unmounted. It would be better to trickle
the updates out but that would require more extensive changes.
- Fix issues with file system corruption, busy looping and other nasty
problems when logging and non-logging file systems are intermixed,
with one being the root file system.
- For logging, do not flush metadata on an inode-at-a-time basis if the sync
has been requested by ioflush. Previously, we could try hundreds of log
sync operations a second due to inode update activity, causing the syncer
to fall behind and metadata updates to be serialized across the entire
file system. Instead, burst out metadata and log flushes at a minimum
interval of every 10 seconds on an active file system (happens more often
if the log becomes full). Note this does not change the operation of
fsync() etc.
- With the flush issue fixed, re-enable concurrent metadata updates in
vfs_wapbl.c.
could cause a bad pointer dereference in the debug printing when
credentials with values of NOCRED or FSCRED were passed to kauth.
I don't see any way to set such a flag, I think its just a debug
thing that could be enabled at compile time by somebody who knew
how, hence the comment rather than a real fix.
specs_open routine. If devsw_open fail, get driver name with devsw_getname
routine and autoload module.
For now only dm drivervcan be loaded, other pseudo drivers needs more work.
Ok by ad@.
and wants to busy a page while another thread calls VOP_PUTPAGES on the same
vnode, takes pages busy and wants to start a wapbl transaction.
Reviewed by: Jason Thorpe <thorpej@netbsd.org>
Ignore procs with zero or all LSZOMB LWPs. Get a non-LSZOMB LWP to perform
operations against as part of the deal.
procfs really needs to be updated to support multi-threading fully.
Hi Antti!
Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.
OK'd by core@, releng@.
run through copy-on-write. Call fscow_run() with valid data where possible.
The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.
- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.
- Always run copy-on-write on buffers returned from ffs_balloc().
- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.
Welcome to 4.99.63
Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>
and DVD's behave like floppy discs. Writing is supported upto and including
version 2.01; version 2.50 and 2.60 will follow.
Also extending the UDF implementation to support symbolic links and
hardlinks.
Added are the mmcformat(8) tool to format rewritable CD/DVD discs and
newfs_udf(8).
Limitations:
all operations can be performed on the file system though the
sheduling is currently optimised for archiving workloads.
mv(1)/rename(2) is currently only implemented for non-directories.
Make VFS hooks dynamic while we're here and say farewell to VFS_ATTACH and
VFS_HOOKS_ATTACH linksets.
As a consequence, most of the file systems can now be loaded as new style
modules.
Quick sanity check by ad@.
Simplify the mount locking. Remove all the crud to deal with recursion on
the mount lock, and crud to deal with unmount as another weirdo lock.
Hopefully this will once and for all fix the deadlocks with this. With this
commit there are two locks on each mount:
- krwlock_t mnt_unmounting. This is used to prevent unmount across critical
sections like getnewvnode(). It's only ever read locked with rw_tryenter(),
and is only ever write locked in dounmount(). A write hold can't be taken
on this lock if the current LWP could hold a vnode lock.
- kmutex_t mnt_updating. This is taken by threads updating the mount, for
example when going r/o -> r/w, and is only present to serialize updates.
In order to take this lock, a read hold must first be taken on
mnt_unmounting, and the two need to be held across the operation.
One effect of this change: previously if an unmount failed, we would make a
half hearted attempt to back out of it gracefully, but that was unlikely to
work in a lot of cases. Now while an unmount that will be aborted is in
progress, new file operations within the mount will fail instead of being
delayed. That is unlikely to be a problem though, because if the admin
requests unmount of a file system then s(he) has made a decision to deny
access to the resource.
The previous fix worked, but it opened a window where mounts could have
disappeared from mountlist while the caller was traversing it using
vfs_trybusy(). Fix that.
The symptom was that sometimes file systems would occasionally not appear
in output from 'df' or 'mount' if the system was busy. Resolution:
- Make mount locks work somewhat like vm_map locks.
- vfs_trybusy() now only fails if the mount is gone, or if someone is
unmounting the file system. Simple contention on mnt_lock doesn't
cause it to fail.
- vfs_busy() will wait even if the file system is being unmounted.
we no longer need to guard against access from hardware interrupt handlers.
Additionally, if cloning a process with CLONE_SIGHAND, arrange to have the
child process share the parent's lock so that signal state may be kept in
sync. Partially addresses PR kern/37437.
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:
- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.
- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.
- The system spends less time at IPL_SCHED, and there is less lock activity.
- Socket layer becomes MP safe.
- Unix protocols become MP safe.
- Allows protocol processing interrupts to safely block on locks.
- Fixes a number of race conditions.
With much feedback from matt@ and plunky@.
- Do reference counting for 'struct mount'. Each vnode associated with a
mount takes a reference, and in turn the mount takes a reference to the
vfsops.
- Now that mounts are reference counted, replace the overcomplicated mount
locking inherited from 4.4BSD with a recursable rwlock.
Introduce a per-FS rename lock and new vfsops to manipulate it.
Get this lock while renaming. Also add another relookup() in do_sys_rename,
which is a hack to kludge around some of the worst deficiencies of
ufs_rename.
reviewed-by: pooka (and an earlier rev by ad)
posted on tech-kern with no objections.
shutdown). There are still problems with device access and a PR will be
filed.
- Kill checkalias(). Allow multiple vnodes to reference a single device.
- Don't play dangerous tricks with block vnodes to ensure that only one
vnode can describe a block device. Instead, prohibit concurrent opens of
block devices. As a bonus remove the unreliable code that prevents
multiple file system mounts on the same device. It's no longer needed.
- Track opens by vnode and by device. Issue cdev_close() when the last open
goes away, instead of abusing vnode::v_usecount to tell if the device is
open.
continue. A thread trying to clean out the extant layer vnode needs to
acquire the shared lock (i.e. the lower vnode's lock), which our caller
already holds. To allow the cleaning to succeed the current thread must make
progress. So, for a brief time more than one vnode in a layered file system
may refer to a single vnode in the lower file system.
- Add a KAUTH_PROCESS_SCHEDULER action, to handle scheduler related
requests, and add specific requests for set/get scheduler policy and
set/get scheduler parameters.
- Add a KAUTH_PROCESS_KEVENT_FILTER action, to handle kevent(2) related
requests.
- Add a KAUTH_DEVICE_TTY_STI action to handle requests to TIOCSTI.
- Add requests for the KAUTH_PROCESS_CANSEE action, indicating what
process information is being looked at (entry itself, args, env,
open files).
- Add requests for the KAUTH_PROCESS_RLIMIT action indicating set/get.
- Add requests for the KAUTH_PROCESS_CORENAME action indicating set/get.
- Make bsd44 secmodel code handle the newly added rqeuests appropriately.
All of the above make it possible to issue finer-grained kauth(9) calls in
many places, removing some KAUTH_GENERIC_ISSUSER requests.
- Remove the "CAN" from KAUTH_PROCESS_CAN{KTRACE,PROCFS,PTRACE,SIGNAL}.
Discussed with christos@ and yamt@.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().
Welcome to 4.99.39.
Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.
quick consensus on tech-kern
- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.
- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.
- Replace the hand-made reader/writer lock with a krwlock.
- Keep `vn_cow_*' functions and mark as obsolete.
- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.
Reviewed by: Jason Thorpe <thorpej@netbsd.org>
knew what it was supposed to be used for and wrstuden gave a go-ahead
* while rototilling, convert file systems which went easily to
use VFS_PROTOS() instead of manually prototyping the methods
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.
Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.
reviewed by tech-kern & wrstuden
instead of the result from getcwd(). The works around locking
panics caused by namei calling VOP_READLINK while holding on to a
directory lock and getcwd() trying to acquire that lock. The real
fix would be to get rid of getcwd() calls within VOPs (not locking
safe), but that's not a viable option in the netbsd-4 timeframe.
Suggestion for workaround from David Holland.
fs code is a kernel buffer, pass though the length of the buffer as well.
Since the length of the userspace buffer isn'it (yet) passed through the mount
system call, add a field to the vfsops structure containing the default length.
Split sys_mount() for calls from compat code.
Ride one of the recent kernel version changes - old fs LKMs will load, but
sys_mount() will reject any attempt to use them.
MNT_FORCE is given
* decrease cargo cult index by getting rid of commented sections
with mntflushbuf() in them - AFAICT the call was removed from our
kernel over 13 years ago with the 4.4BSDlite import
an init method. So get rid of it and #ifdef _LKM and just always
init in the init method. Give malloc types the same treatment.
Makes file systems nicer to work with in linksetless environments
and fixes a few LKM discrepancies.
vmspace information fails.
Return the nice value properly to userland via the /proc/<pid>/stat entry.
Use vm sizes from vmspace, rather than rusage structs, for the same
reasons as mentioned previously - see the comment in
kvm_proc.c::kvm_getproc2() about rusage values and zombie processes.
+ in /proc/<pid>/statm emulation, use the memory values from vmspace,
rather than struct rusage, since the rusage values appear to be 0 for
all processes except zombies. cf dsl's comment in
kvm_proc.c::kvm_getproc2()
+ in /proc/<pid>/stat, instead of returning the tv_sec value, return the
number of ticks we've had (roughly equivalent to the Linux jiffies).
Calculate these values from the tv_usec values.
Also:
+ enclose CPU_INFO_ITERATOR and CPU_INFO_FOREACH usage in #ifdef
MULTIPROCESSOR, at the request of Nick Hudson
Together, these changes allow htop to work on NetBSD.
/proc/stat
/proc/loadavg and
/proc/<pid>/statm.
These are only present when -o linux is specified as a mount option
to procfs.
Factor out some common code so that it can be used by a number of
functions.
XXX The values returned in the statm emulation need to be verified.
function that does the work, genfs_do_putpages(), now takes as an argument
a pointer to the page that would be waited on, if PGO_BUSYWAIT were not set.
This allows a consumer, e.g. lfs_putpages(), to perform an action outside
the scope of UVM before sleeping on the page in question.
to skip unnecessary flushing when layered file system vnodes are recycled.
this also prevents a deadlock with the dodgy LFS putpages routine.
fixes the non-LFS part of PR 36150.
corresponding flags.
Revert softdep_trackbufs() to its state before vn_start_write() was added.
Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.
Welcome to 4.99.17
write the whole map in one shot so that we don't have to deal with the
map changing under us. Fixes the linux emulated jdk-1.6 where it was
losing the last map entry and could not find the stack on startup.
- Drop the target's vm_map lock before calling uiomove(). We could
deadlock if inspecting /proc/curproc/map.
- If the vm_map might have changed, restart the operation, but give
up after 250 retries if the map keeps changing. XXX This is not
ideal.