different is that in order to avoid issues with the WAPBL journal lock
the wrong locking had to be changed to different wrong locking. This
is now moot.
I have not hand-validated that the current two copies of rename are
equivalent, or that the locking fixes merged with the old rename
produce code that is textually identical (modulo WAPBL calls that do
nothing when WAPBL is turned off) to the WAPBL rename... but I did
this check when preparing my previous round of rename patches last
year and all updates since have been applied to both.
fix the non-wapbl rename; that will be coming soon. This patch also
leaves a lot of the older locking-related code around in #if 0 blocks,
and there's a lot of leftover redundant logic. All that will be going
away later.
Relates to at least these PRs:
PR kern/24887
PR kern/41417
PR kern/42093
PR kern/43626
and possibly others.
sys/stdarg.h and expect compiler to provide proper builtins, defaulting
to the GCC interface. lint still has a special fallback.
Reduce abuse of _BSD_VA_LIST_ by defining __va_list by default and
derive va_list as required by standards.
the inode in the guts of ufs. Now, in VOPs where i_crap is used it is
used (directly) only immediately on entry to the VOP call and then
passed around by reference.
Except for rename, which needs explicit sorting out. The code in
ufs_wapbl_rename is unchanged in behavior but I'm increasingly
inclined to think it's wrong.
the vnode when lookup returns and fished out again later.
1. Create struct ufs_lookup_results to hold these.
2. Call the ufs_lookup_results instance in struct inode "i_crap" to be
clear about exactly what's going on, and to distinguish the lookup
results from respectable members of struct inode.
3. Update references to these members in the directory access
subroutines.
4. Include preliminary infrastructure for checking that the i_crap
being used is still valid when it's used. This doesn't actually do
anything yet.
5. Update the way ufs_wapbl_rename manipulates these elements to use
the new data structures. I have not changed the manipulation; it may
or may not be correct but I continue to suspect that it is not.
The word of the day is "stigmergy".
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.
- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55
Adresses PR kern/38762 panic: vwakeup: neg numoutput
No objections from tech-kern@.
filesystem in which format extended attribute shall be listed.
There are currently two formats:
- NUL-terminated strings, used for listxattr(2), this is the default.
- one byte length-pprefixed, non NUL-terminated strings, used for
extattr_list_file(2), which is obtanined by setting the
EXTATTR_LIST_PREFIXLEN flag to VOP_LISTEXTATTR(9)
This approach avoid the need for converting the list back and forth, except
in libperfuse, since FUSE uses NUL-terminated strings, and the kernel may
have requested EXTATTR_LIST_PREFIXLEN.
Modify lsextattr(8) so that it does not expect each attribute name to be
prefixed by its length. This enable extattr_list_(file|link|fd) to
return a buffer matching its documentation. This also makes the interface
similar to what Linux and FUSE do, which is nice for interoperability.
Note that since we had no EA implementation supporting listing, we do
not break anything.
for UFS1).
Remove kernel option for EA backing store autocreation and do it by
default. Add a sysctl so that autocreated attriutr size can be modified.
ubc_zerorange(struct uvm_object *, off_t, size_t, int) changing
the first argument to an uvm_object and adding a flags argument.
Modify tmpfs_reg_resize() to zero the backing store (aobj) instead
of the vnode. Ubc_purge() no longer panics when unmounting tmpfs.
Keep uvm_vnp_zerorange() until the next kernel version bump.
- autocreate attribute backing file for new attributes
- autoload attributes when issuing extattrctl start
- when autoloading attributes, do not display garbage warning when looking
up entries that got ENOENT
- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.
- Simplify locking in some pmap(9) modules by removing P->V locking.
- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).
- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.
- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.
Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.
is suspended extends the suspension until the vnode gets unlocked by
the caller of ffs_snapshot().
Resuming the file system before expunging all snapshots and syncing the
snapshot creates races and deadlocks with journaling file systems at least.
system as this is sufficient for the remaining operations.
Reduces the time the file system is suspended and should make this time
independent of the number of snapshots already present.
- Replace the ugly sync loop in ffs_full_fsync() and ffs_vfs_fsync() with
vflushbuf(). This loop is a relic of softdeps and not needed anymore.
- Add ffs_spec_fsync() for device nodes on ffs file systems that calls
spec_fsync() like all other file systems do and then updates the ctime.
Discussed on tech-kern.
Should fix PRs:
PR #41192 wapbl diagnostic panic during cgdconfig
PR #41977 kernel diagnostic assertion "rw_lock_held(&wl->wl_rwlock)" failed
PR #42149 wapbl locking panic if watching DVD
PR #42551 Lockdebug assert in wapbl when running zpool
already prevented). File systems are no longer responsible to check this.
Clean up and add asserts (note that dvp == vp cannot happen in vop_link).
OK dholland@
parse quota plists; as well as a getfsquota() function to retrieve quotas
for a single id from a single filesystem (whatever filesystem this is:
a local quota-enabled fs or NFS). This is build on functions getufsquota()
(for local filesystems with UFS-like quotas) and getnfsquota();
which are also available to userland programs.
move functions from quota2_subr.c to libquota or libprop as appropriate,
and ajust in-tree quota tools.
move some declarations from kernel headers to either sys/quota.h or
quota/quota.h as appropriate. ufs/ufs/quota.h still installed because
it's needed by other installed ufs headers.
ufs/ufs/quota1.h still installed as a quick&dirty way to get a code
using the old quotactl() to compile (just include ufs/ufs/quota1.h instead of
ufs/ufs/quota.h - old code won't compile without this change and this is
on purpose).
Discussed on tech-kern@ and tech-net@ (long thread, but not much about
libquota itself ...)
to store disk quota usage and limits, integrated with ffs
metadata. Usage is checked by fsck_ffs (no more quotacheck)
and is covered by the WAPBL journal. Enabled with kernel
option QUOTA2 (added where QUOTA was enabled in kernel config files),
turned on with tunefs(8) on a per-filesystem
basis. mount_mfs(8) can also turn quotas on.
See http://mail-index.netbsd.org/tech-kern/2011/02/19/msg010025.html
for details.
"FSS_UNLINK_ON_CREATE" to unlink the backing store before
the snapshot gets created.
With this change dump(8) no longer dumps the zero-sized, but named
snapshot it is working on. Same applies to fsck_ffs(8).
- No need to take the snapshot lock while the file system is suspended.
- Allow ffs_copyonwrite() one level of recursion with snapshots locked.
- Do the block address lookup with snapshots locked.
- Take the snapshot lock while removing a snapshot from the list.
While hunting deadlocks change the transaction scope for ffs_snapremove().
We could deadlock from UFS_WAPBL_BEGIN() with a buffer held.
a "wapbl_flush: current transaction too big to flush" panic when
creating or removing snapshots on larger logging disks.
Adresses PR #44568 (WAPBL doens't play nice with snapshots).
parent dir) associated with SAVESTART in relookup().
Check all call sites to make sure that SAVESTART wasn't set while
calling relookup(); if it was, adjust the refcount behavior. Remove
related references to SAVESTART.
The only code that was reaching the extra ref was msdosfs_rename,
where the refcount behavior was already fairly broken and/or gross;
repair it.
Add a dummy 4th argument to relookup to make sure code that hasn't
been inspected won't compile. (This will go away next time the
relookup semantics change, which they will.)
so that they get reused with a invalid pointer to a mount structure.
As a workaround, free the vnodes used to create the in-filesystem journal
immediately.
pathbuf object passed to namei as work space instead. (For now a pnbuf
pointer appears in struct nameidata, to support certain unclean things
that haven't been fixed yet, but it will be going away in the future.)
This removes the need for the SAVENAME and HASBUF namei flags.
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.
Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).
The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.
the earlier change caused data corruption by freeing pages
without invaliding their mappings. instead of the trylock/retry,
just take the genfs-node lock before calling VOP_GETPAGES()
and pass a new flag to tell it that we're already holding this lock.
Note: there is a billion ways to make the kernel panic by trying
to mount a garbage file system and I don't imagine we'll ever get
close to fixing even half of them. However, for this one failing
gracefully is a bonus since Xen DomU only does 32k MAXBSIZE and
the 64k MAXBSIZE file systems are out there (PR port-xen/43727).
Tested by compiling sys/rump with CPPFLAGS+=-DMAXPHYS=32768 (all
tests in tests/fs still pass). I don't know how we're going to
translate this into an easy regression test, though. Maybe with
a hacked newfs?
in genfs_do_putpages() and uao_put().
Use 'v_uobj.uo_npages' to check for an empty memq.
Put some assertions where these marker pages may not appear.
Ok: YAMAMOTO Takashi <yamt@netbsd.org>
in the vnode. All LK_* flags move from sys/lock.h to sys/vnode.h. Calls
to vlockmgr() in file systems get replaced with VOP_LOCK() or VOP_UNLOCK().
Welcome to 5.99.34.
Discussed on tech-kern.
- VOP_LOCK(vp, flags): Limit the set of allowed flags to LK_EXCLUSIVE,
LK_SHARED and LK_NOWAIT. LK_INTERLOCK is no longer allowed as it
makes no sense here.
- VOP_ISLOCKED(vp): Remove the for some time unused return value
LK_EXCLOTHER. Mark this operation as "diagnostic only".
Making a lock decision based on this operation is no longer allowed.
Discussed on tech-kern.
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).
* XXX: Get extra reference to LFS vfsops. This prevents unload,
* but also prevents kernel panic due to text being unloaded
* from below lfs_writerd. When lfs_writerd can exit, remove
* this!!!
*/
Could use UFS_OPS, but:
1) the lfs kernel module depends on full ffs already anway
2) lfs is being split from ufs, so this will automatically
go away soon
3) chances of anyone wanting an lfs-only kernel are pretty slim
4) i'm too lazy to figure out how to test ffs_snapgone() is
still called properly if I change the call ;)
new helper function.
Use this information to query physical sector sizes for WAPBL
instead of hardcoded defaults.
No longer limits physical sector sizes to 512 bytes.
- drop the notion of frags (LFS fragments) vs fsb (FFS fragments)
The code uses a complicated unity function that just makes the
code difficult to understand.
- support larger sector sizes. Fix disk address computations
to use DEV_BSIZE in the kernel as required by device drivers
and to use sector sizes in userland.
- Fix several locking bugs in lfs_bio.c and lfs_subr.c.
allocated to extend the file to the new size. Releasing all pages
may release pages that contains previously-written data not yet flushed
to disk. Should fix PR kern/35704
- {ffs,lfs,ext2fs}_truncate(): Even if the inode's size is the same as
the new length, call uvm_vnp_setsize(). *_truncate() may have been
called by *_write() in the error path (e.g. block allocation failure
because of quota of file system full), and at this point v_writesize
has been set to the desired size of the file and not reverted to the
old size. Not adjusting v_writesize to the real size cause
genfs_do_io() to write to disk past the real end of the file.
Unlike other filesystems this has some side issues because
the shift values are stored in the superblock and because
userland utitlies share the same fsbtodb macros.
-> the kernel now ignores the value stored in the superblock.
-> the macro adaption is only done for defined(_KERNEL) code.
getcleanvnode() sets v_type to VNON after releasing v_interlock.
So the thread doing quotaon(), quotaoff() or qsync() could vget()
a vnode which is being recycled in getcleanvnode(), after is has
been cleaned and v_interlock released, but before v_type has been
reset, leading to KASSERT(vp->v_usecount == 1) firing in
getnewvnode(), or qsync() dereferending a NULL pointer as in
PR kern/42205.
Fix by using the same tests as other ffs function traversing the mount
list: also check for VTOI(vp) == NULL, and VI_XLOCK in addition
to VI_CLEAN.
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.
no functional change
reference while we were getting the v_interlock.
vget(): attempt prevent it from returning a clean vnode:
if the vnode is being inactivated (by vrelel()), wait for
vrelel() to complete (or return EBUSY if we can't wait), and return
ENOENT if the vnode has been vclean'ed by vrelel()
Fix kern/41147 in a better way, hopefully fix other related race conditions.
hack is ffs_sync().
- Use the generic lock operations for ffs.
- Change ffs_sync() to omit the vnode lock while suspending.
Reviewed by: Antti Kantee <pooka@netbsd.org>
vput(vp);
error = VFS_VGET(vp->v_mount, ...);
just isn't right. Because of vnode caching this *probably* never bit
anyone, except maybe under very heavy load, but still.
Note that the race also exists between 2 nfs client, one of them doing the rm.
In ufs_ihashget(), vget() can return a vnode that has been vclean'ed because
vget() can sleep. After vget returns, check that vp is still connected with
ip, and that ip still points to the inode we want. This fix the NULL
pointer dereference in ufs_fhtovp() I've been seeing on a NFS server.
XXX I have no idea why using vput() instead of
vlockmgr(vp->v_vnlock, LK_RELEASE); vrele(vp); does not work.
> Fix bug introduced in revision 1.174(*) where a NULL fspec with an MNT_UPDATE
> command would always return EINVAL. This broke fsck on root, where fsck'ing
> a dirty root would always return an error causing rc to resort in a reboot.
(*) This is "Apply the NFS exports list rototill patch" change
in ext2fs_vfsops.c rev 1.91.
> Change ffs_mount, in MNT_UPDATE case, to check dev_t's for equality
> instead of just vnode pointers. Fixes erroneous "does not match mounted
> device" errors from mount(8) in the presence of MFS /dev, init.root, &c.
Fixes LOCKDEBUG panic which is the same one mentioned in PR kern/41078
on trying to mount_ext2fs against a raw device, while that panic
seems to have another route cause around module_autoload() in
sys/miscfs/specfs/spec_vnops.c:spec_open().
command would always return EINVAL. This broke fsck on root, where fsck'ing
a dirty root would always return an error causing rc to resort in a reboot.
check_console, veriexecclose, veriexec_delete, veriexec_file_add,
emul_find_root, coff_load_shlib (sh3 version), coff_load_shlib,
compat_20_sys_statfs, compat_20_netbsd32_statfs,
ELFNAME2(netbsd32,probe_noteless), darwin_sys_statfs,
ibcs2_sys_statfs, ibcs2_sys_statvfs, linux_sys_uselib,
osf1_sys_statfs, sunos_sys_statfs, sunos32_sys_statfs,
ultrix_sys_statfs, do_sys_mount, fss_create_files (3 of 4),
adosfs_mount, cd9660_mount, coda_ioctl, coda_mount, ext2fs_mount,
ffs_mount, filecore_mount, hfs_mount, lfs_mount, msdosfs_mount,
ntfs_mount, sysvbfs_mount, udf_mount, union_mount, sys_chflags,
sys_lchflags, sys_chmod, sys_lchmod, sys_chown, sys_lchown,
sys___posix_chown, sys___posix_lchown, sys_link, do_sys_pstatvfs,
sys_quotactl, sys_revoke, sys_truncate, do_sys_utimes, sys_extattrctl,
sys_extattr_set_file, sys_extattr_set_link, sys_extattr_get_file,
sys_extattr_get_link, sys_extattr_delete_file,
sys_extattr_delete_link, sys_extattr_list_file, sys_extattr_list_link,
sys_setxattr, sys_lsetxattr, sys_getxattr, sys_lgetxattr,
sys_listxattr, sys_llistxattr, sys_removexattr, sys_lremovexattr
All have been scrutinized (several times, in fact) and compile-tested,
but not all have been explicitly tested in action.
XXX: While I haven't (intentionally) changed the use or nonuse of
XXX: TRYEMULROOT in any of these places, I'm not convinced all the
XXX: uses are correct; an audit might be desirable.
the other routines of the same spirit.
Adjust file-system code to use it.
Keep vaccess() for KPI compatibility and to keep element of least
surprise. A "diagnostic" message warning that vaccess() is deprecated will
be printed when it's used (obviously, only in DIAGNOSTIC kernels).
No objections on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2009/06/21/msg005310.html
operations, specifically quota and block allocation from reserved space.
Modify ufs_quotactl() to accomodate passing "mp" earlier by vfs_busy()ing
it a little bit higher.
Mailing list reference:
http://mail-index.netbsd.org/tech-kern/2009/04/26/msg004936.html
Note that the umapfs request mentioned in this thread was NOT added as
there is still on-going discussion regarding the proper implementation.
the security checks when mounting a device (VOP_ACCESS() + kauth(9) call)).
Proposed with no objections on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2009/04/20/msg004859.html
The vnode is always expected to be locked, so no locking is done outside
the file-system code.
panic described in PR kern/40948.
As usual, all the error branches in rename live based on an unholy
amalgamation of prayer and the blood of cute, furry and tasty
quadrupeds, so I won't even attempt to audit the rest.
And this wapbl rename really really needs to be merged with the
standard rename. That should be a fun PhD thesis topic ....
- atime updates were not being synced.
ffs_sync:
- In some cases the sync vnode was acting like now dead /usr/sbin/update.
It was examining vnodes that it should have ignored.
- It would find dirty inodes and try to flush them. Often ffs_fsync()
cheerfully ignored the flush request due to the fsync bug. Such inodes
remained dirty and were repeatedly re-examined by the syncer until
vnode reclaim or system shutdown.
- We were marking our place in the per-mount vnode list even though in
most cases there was not flush to perform. While not a bug, this wasted
CPU cycles because a TAILQ_NEXT would have sufficed.
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep
Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.
PR kern/40361 WAPBL locking panic in -current
PR kern/40361 WAPBL locking panic in -current
PR kern/40470 WAPBL corrupts ext2fs
PR kern/40562 busy loop in ffs_sync when unmounting a file system
PR kern/40525 panic: ffs_valloc: dup alloc
- A fix for an issue that can lead to "ffs_valloc: dup" due to dirty cg
buffers being invalidated. Problem discovered and patch by dholland@.
- If the syncer fails to lazily sync a vnode due to lock contention,
retry 1 second later instead of 30 seconds later.
- Flush inode atime updates every ~10 seconds (this makes most sense with
logging). Presently they didn't hit the disk for read-only files or
devices until the file system was unmounted. It would be better to trickle
the updates out but that would require more extensive changes.
- Fix issues with file system corruption, busy looping and other nasty
problems when logging and non-logging file systems are intermixed,
with one being the root file system.
- For logging, do not flush metadata on an inode-at-a-time basis if the sync
has been requested by ioflush. Previously, we could try hundreds of log
sync operations a second due to inode update activity, causing the syncer
to fall behind and metadata updates to be serialized across the entire
file system. Instead, burst out metadata and log flushes at a minimum
interval of every 10 seconds on an active file system (happens more often
if the log becomes full). Note this does not change the operation of
fsync() etc.
- With the flush issue fixed, re-enable concurrent metadata updates in
vfs_wapbl.c.
have pages busied and are trying to get the genfs node lock.
This causes a lock order reversal described in PR kern/40389.
This is not a proper fix and only a workaround for NetBSD 5.0.
problem first reported by simonb, patch tested by rmind