Patch by Slava Semushin <slava.semushin@gmail.com>
Again, this was tested by comparing obj files from a pristine and a patched
source tree against an i386/ALL kernel, and also for src/sbin/fsck_ffs,
src/sbin/fsdb and src/usr.sbin/makefs. Only changes in assert() line numbers
were detected in 'objdump -d' output.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.
Implemented for file systems of type ffs.
The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.
Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.
Welcome to 4.99.9 (new vfs op vfs_suspendctl).
- finish implementing splraiseipl (and makeiplcookie).
http://mail-index.NetBSD.org/tech-kern/2006/07/01/0000.html
- complete workqueue(9) and fix its ipl problem, which is reported
to cause audio skipping.
- fix netbt (at least compilation problems) for some ports.
- fix PR/33218.
- LOCKPARENT is no longer relevant for lookup(), relookup() or VOP_LOOKUP().
these now always return the parent vnode locked. namei() works as before.
lookup() and various other paths no longer acquire vnode locks in the
wrong order via vrele(). fixes PR 32535.
as a nice side effect, path lookup is also up to 25% faster.
- the above allows us to get rid of PDIRUNLOCK.
- also get rid of WANTPARENT (just use LOCKPARENT and unlock it).
- remove an assumption in layer_node_find() that all file systems implement
a recursive VOP_LOCK() (unionfs doesn't).
- require that all file systems supply vfs_vptofh and vfs_fhtovp routines.
fill in eopnotsupp() for file systems that don't support being exported
and remove the checks for NULL. (layerfs calls these without checking.)
- in union_lookup1(), don't change refcounts in the ISDOTDOT case, just
adjust which vnode is locked. fixes PR 33374.
- apply fixes for ufs_rename() from ufs_vnops.c rev. 1.61 to ext2fs_rename().
loops where vnodes can get removed or added during the loops. This could
lead to panic's on unmount since nodes are skipped or otherwise
TAILQ_NEXT(0xdeadbeef, ...) was dereferenced.
After a rmdir()ed directory has been truncated, force an update of
the directory's inode after queuing the dirrem that will decrement
the parent directory's link count. This will force the update of
the parent directory's actual link to actually be scheduled. Without
this change the parent directory's actual link count would not be
updated until ufs_inactive() cleared the inode of the newly removed
directory, which might be deferred indefinitely. ufs_inactive()
will not be called as long as any process holds a reference to the
removed directory, and ufs_inactive() will not clear the inode if
the link count is non-zero, which could be the result of an earlier
system crash.
[plus description about problems woth background fsck solved
by this; irrelevant to NetBSD]
For me, the good effect is at least that I'm getting less filesystem
inconsistencies after a crash.
Approved by christos quite a while ago.
vnodes were synced and processed backwards. This meant that the last
accessed node was processed first and the earlierst last.
An extra benefit is the removal of the ugly hack from the Berkly days on
LFS.
In the proces, i've also replaced the various variations hand written loops
by the TAILQ_FOREACH() macro's.
LFCNWRAPSTOP and LFCNWRAPGO.
Be less verbose about the various looping checks: use log() rather than
printf(), and only log anything if we are really looping ("count = 2" is
not an error condition).
Allow dirops sleeping on available space to be interruptible.
instead of just vnode pointers. Fixes erroneous "does not match mounted
device" errors from mount(8) in the presence of MFS /dev, init.root, &c.
No objections on tech-kern.
if we ourselves hold the lock. This prevents e.g. mknod from hanging
indefinitely.
Also, always use the return value from VOP_ISLOCKED to determine whether
we hold the lock or someone else does, rather than looking into the lock
structure ourselves.
* Mark being-deleted files in the Ifile so we can finish deleting them
at fs mount time.
* Flag the Ifile with "cleaner must clean" when writers are waiting for
the cleaner, rather than relying solely on the cleaner's estimation of
whether it should clean or not.
* Note partial segments written by a user agent (in particular,
fsck_lfs) so that repeated rolls forward don't interfere with one
another.
* Add a new fcntl, LFCNPASS, that allows the log to wrap exactly once,
for better testing of the validity of checkpoints.
* Keep track of the on-disk nlink count when cleaning, so that we don't
partially complete directory operations while cleaning.
* Ensure that every single Ifile inode write represents a consistent
view of the filesystem. In particular, the accounting for the segment
we are writing the inode into must be correct, and the accounting for
the segment that inode used to reside in must be correct. Rather than
just rewriting the inode if we wrote it wrong, rewrite the necessary
ifile blocks before writing the inode so we never write it wrong.
* Don't unmark any VDIROP vnodes if we haven't written them to disk,
avoiding yet another problem with the "wait for the cleaner" error
return from lfs_putpages().
Also, move the last callback to an aiodone call, so we no longer do any
memory management from interrupt context.
as the lfs compat_30_fhandle, g/c the latter.
Add an alias for the LFCNIFILEFH fcntl, so that binaries compiled in the
meantime (with too large lfs_fhandle) continue to work.
This makes vfs_cleanerd work again after the kernel checks filehandle size
more strictly (problem reported by Kurt Schreiner on current-users).
While touching all vptofh/fhtovp functions, get rid of VFS_MAXFIDSIZ,
version the getfh(2) syscall and explicitly pass the size available in
the filehandle from userland.
Discussed on tech-kern, with lots of help from yamt (thanks!).
particular, the caller can now choose whether to wait for the condition
to be met, and if the caller of LFCNWRAPSTOP dies or otherwise closes
the descriptor, the filesystem is started again. Updated the ckckp
regression test to use the new semantics.
dump_lfs(8) now uses the fcntls to implement LFS-style snapshotting through
the -X flag, addressing PR#33457 albeit not using fss(4). Fixed a couple
other problems with dump_lfs that manifested themselves during testing.
the precision of getnanotime() is not suitable for file timestamps.
esp. when it's nfs-exported.
- introduce vfs_timestamp().
(the name is from freebsd. currently merely a wrapper of nanotime())
- for ufs-like filesystems, use it rather than getnanotime().
XXX check other filesystems.
posted for it even if the vnode is locked. This will deadlock with wmesg
"softgetdbuf" if it gets a BMSAFEMAP dependency as here we have "bp == nbp"
and try to get a buffer we already own.
Approved by: Frank van der Linden <fvdl@netbsd.org>
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
many inodes are cleaned at once. Make sure that we write all the pages
on vnodes that are being flushed, even if we don't think there's room;
drain v_numoutput before lfs_vflush() completes.
Also, don't allow a vnode that is in the process of being cleaned to be
chosen by getnewvnode(); this avoids a segment accounting panic in the case
that a large number of inodes are fed to lfs_markv() all at once.
notion of "how many segments are reserved for the cleaner" from that of
"how many segments are not counted in lfs_bfree". The default value
used for existing filesystems is the same as the previous implicit value
of (lfs_minfreeseg / 2 + 1), modulo some sanity checking.
Count pending dirops on a per-filesystem basis, since once we start
writing them we can't stop until we're done. This seems to help stave off
the "no clean segments" panic in the case of filling the filesystem with
directories and small files (e.g. simultaneously unpacking more copies of
pkgsrc than will fit).
inode that makes those changes valid is either written to disk by
lfs_writeinode() or discarded by lfs_vfree().
A couple of locking fixes are also included as well.
Move the stop for LFCNWRAPSTOP to the point at which writing at segment 0
is really about to commence, since this is what the test expects (and
incidentally what a snapshotting utility wants as well).
More correctly reconstruct the on-disk state at every checkpoint, rather
than relying on the entire state at the point of wrapping to be accurate
(that is only true the first time we wrap). Add a "make abort" target to
make rerunning the test more convenient when it has failed and we're done
analyzing the failure.
where segment 0 is being considered for writing. This allows for automated
checkpoint vailidity scanning, and could be used (in conjunction with the
existing LFCNREWIND) for e.g. snapshot dumps as well.
Include a regression test that does such scanning.
When writing the Ifile, loop through the dirty block list three times to
make sure that the checkpoint is always consistent (the first and second
times the Ifile blocks can cross a segment boundary; not so the third time
unless the segments are very small). Discovered by using the aforementioned
regression test.
My understanding is that the CLRSIG() is supposed to clear the signal
that was sent to the syncer process to prevent it from being delivered
to the syncer process in case unmounting fails, so that the syncer process
does not die while the filesystem is still mounted. The typical scenario
is, the syncher process is tsleep()ing in the kernel, and waking up when
it needs to do work. If someone sends a signal to it, eg. kill -TERM
the mfs process, then the kernel will try to unmount the mfs filesystem
before delivering the signal to the process. If that unmount fails, then
we should not really kill the process because that will hang the mount.
So we call CLRSIG() to stop the signal from being delivered.
So the first call to issignal() will return the signal number that was
sent to the syncer process (unless someone malicious was able to send
a lower numbered signal between the time tsleep() returned and we called
issignal()... something that is not really easy to do). But you are
right, we should not be calling it many times as a side effect of this
macro.
Rewrite CLRSIG() clear all the signals and call issignal() the correct
number of times.
explicitly (especially since we didn't know about VFREEING at all before),
but notice the EBUSY return from vget() instead.
Fix some more MP locking protocol issues, most of which were pointed out by
Christian Ehrhardt this morning on tech-kern.
instead of bytes for the index, and never search below fs->lfs_freehd.
Fix a bug in the previous version of the search (an erroneous assumption
that ino_t was signed).
Free the bitmap when we unmount the filesystem.
The writer daemon, if it does not need to flush the whole filesystem,
now only writes the vnodes for which the pagedaemon has requested pageouts
(although it does not pay attention to the page ranges the pagedaemon
supplies).
by Michel Oey, in which an aged LFS writes up to an extra Ifile block for
every file created; and paves the way for the truncation of the Ifile when
many files are deleted.
* Correct (weak) segment lock assertions in lfs_fragextend and lfs_putpages.
* Keep IN_MODIFIED set if we run out of avail in lfs_putpages.
* Don't try to (re)write buffers on a VBLK vnode; fixes a panic I found
while running with an LFS root.
* Raise priority of LFCNSEGWAIT to PVFS; PUSER is way too low for
something the pagedaemon is relying on.
- remove GOP_SIZE_READ/GOP_SIZE_WRITE flags.
they have not been used since the change.
- ufs_balloc_range: remove code which has been no-op since the change.
thanks Konrad Schroder for explaining the original intention of the code.
- ffs_gop_size: don't extend past eof, in the case of GOP_SIZE_MEM.
otherwise genfs_getpages end up to allocate pages past eof unnecessarily.
* Acknowledge that sometimes there are more dirty pages to be written to
disk than clean segments. When we reach the danger line,
lfs_gop_write() now returns EAGAIN. The caller of VOP_PUTPAGES(), if
it holds the segment lock, drops it and waits for the cleaner to make
room before continuing.
* Note and avoid a three-way deadlock in lfs_putpages (a writer holding
a page busy blocks on the cleaner while the cleaner blocks on the
segment lock while lfs_putpages blocks on the page).
- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.
> ----------------------------
> revision 1.54
> date: 2001/08/26 01:25:12; author: iedowse; state: Exp; lines: +30 -12
> When compacting directories, ufs_direnter() always trusted DIRSIZ()
> to supply the number of bytes to be bcopy()'d to move an entry. If
> d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
> length, so ufs_direnter could end up corrupting a directory during
> compaction. In practice I believe this can only happen after fsck_ffs
> has fixed a previously-corrupted directory.
>
> We now deal with any mid-block unused entries specially to avoid
> using DIRSIZ() or bcopy() on such entries. We also ensure that the
> variables 'dsize' and 'spacefree' contain meaningful values at all
> times. Add a few comments to describe better this intricate piece
> of code.
>
> The special handling of mid-block unused entries makes the dirhash-
> specific bugfix in the previous revision (1.53) now uncecessary,
> so this change removes it.
>
> Reviewed by: mckusick
> ----------------------------
> revision 1.53
> date: 2001/08/22 01:35:17; author: iedowse; state: Exp; lines: +2 -2
> When compressing directory blocks, the dirhash code didn't check
> that the directory entry was in use before attempting to find it
> in the hash structures to change its offset. Normally, unused
> entries do not need to be moved, but fsck can leave behind some
> unused entries that do. A dirhash sanity panic resulted when the
> entry to be moved was not found. Add a check that stops entries
> with d_ino == 0 from being passed to ufsdirhash_move().
- for structure fields that are conditionally present,
make those fields always present.
- for functions which are conditionally inline, make them never inline.
- remove some other functions which are conditionally defined but
don't actually do anything anymore.
- make a lock-debugging function conditional on only LOCKDEBUG.
as discussed on tech-kern some time back.
otherwise, once the corresponding bit in the inode bitmap is cleared,
an unrelated inode with the same inode number can be allocated and
ufs_ihashget() picks a stale in-core vnode for it.
PR/32301 by Matthias Scheler.
- rather than embedding bufq_state in driver softc,
have a pointer to the former.
- move bufq related functions from kern/subr_disk.c to kern/subr_bufq.c.
- rename method to strategy for consistency.
- move some definitions which don't need to be exposed to the rest of kernel
from sys/bufq.h to sys/bufq_impl.h.
(is it better to move it to kern/ or somewhere?)
- fix some obvious breakage in dev/qbus/ts.c. (not tested)
- Remove all NFS related stuff from file system specific code.
- Drop the vfs_checkexp hook and generalize it in the new nfs_check_export
function, thus removing redundancy from all file systems.
- Move all NFS export-related stuff from kern/vfs_subr.c to the new
file sys/nfs/nfs_export.c. The former was becoming large and its code
is always compiled, regardless of the build options. Using the latter,
the code is only compiled in when NFSSERVER is enabled. While doing this,
also make some functions in nfs_subs.c conditional to NFSSERVER.
- Add a new command in nfssvc(2), called NFSSVC_SETEXPORTSLIST, that takes a
path and a set of export entries. At the moment it can only clear the
exports list or append entries, one by one, but it is done in a way that
allows setting the whole set of entries atomically in the future (see the
comment in mountd_set_exports_list or in doc/TODO).
- Change mountd(8) to use the nfssvc(2) system call instead of mount(2) so
that it becomes file system agnostic. In fact, all this whole thing was
done to remove a 'XXX' block from this utility!
- Change the mount*, newfs and fsck* userland utilities to not deal with NFS
exports initialization; done internally by the kernel when initializing
the NFS support for each file system.
- Implement an interface for VFS (called VFS hooks) so that several kernel
subsystems can run arbitrary code upon receipt of specific VFS events.
At the moment, this only provides support for unmount and is used to
destroy NFS exports lists from the file systems being unmounted, though it
has room for extension.
Thanks go to yamt@, chs@, thorpej@, wrstuden@ and others for their comments
and advice in the development of this patch.
if the filesystem is not compiled in the kernel still links. Probably
a better solution is to use weak symbols.
- move the filesystem-specific itime macros to the filesystem header files.
from macros to real functions. Original patch and review from chuq.
Note: ext2fs only keeps seconds in the on-disk inode, and msdosfs does not
have enough precision for all fields, so this is not very useful for those
two.
has been written or not individually by (ab)using b_resid
in pcbp as a bitmap.
- add a comment to explain why it's needed.
PR/15364. reviewed by Chuck Silvers.
backing file per attribute type indexed by inode number to hold the extended
attributes.
This is working pretty well on my test systems, except for the "autostart"
feature. I need someone with a better handle on the VFS locking protocol
to go over that.
This is a work-in-progress. There are parts of this that could be re-factored
allowing this approach to be used on other types of file systems.
Adapted from FreeBSD.
to prevent unnecessary block allocations in the case that
page size > block size.
- ufs_balloc_range: use VM_PROT_WRITE+PGO_NOBLOCKALLOC rather than
VM_PROT_READ.