direct and indirect block pointers are not valid in the case of shortlinks.
while i'm here, move duplicated code in lfs_vget/fastvget into a new
function, lfs_vinit.
Note however that blocks can be added to the Ifile even when the segment
block is held because of inodes' atime. Do not panic with "dirty blocks"
if these blocks are present.
be expanded to cover other per-fs and subsystem-wide data as well.
Fix a case of IN_MODIFIED being set without updating lfs_uinodes, resulting
in a "lfs_uinodes < 0" panic.
Fix a deadlock in lfs_putpages arising from the need to busy all pages in a
block; unbusy any that had already been busied before starting over.
always true) and accompanying dead code.
- When constructing write clusters in lfs_writeseg, if the block we are
about to add is itself a cluster from GOP_WRITE, don't put a cluster
in a cluster, just write the GOP_WRITE cluster on its own. This seems
to represent a slight performance gain on my test machine.
- Charge someone's rusage for writes on LFSes. It's difficult to tell
who the "right" process to charge is; just charge whoever triggered
the write.
where the cleaner is trying to write, instead of tying up the "live"
buffers (or pages).
Fix a bug in the LFS_UBC case where oversized buffers would not be
checksummed correctly, causing uncleanable segments.
Make sure that wakeup(fs->lfs_iocount) is done if fs->lfs_iocount is 1
as well as 0, since we wait in some places for it to drop to 1.
Activate all pages that make it into lfs_gop_write without the segment
lock held, since they must have been dirtied very recently, even if
PG_DELWRI is not set.
actually happens.
Add a new fcntl call that will write the minimum necessary to checkpoint
(i.e., for on-disk directory structure to be consistent, not including
updates to file data) so that the cleaner can clean segments more quickly
without sacrificing three-way commit for cleaning.
either as a mysterious UVM error or as "panic: dirty bufs". Verify
maximum size in lfs_malloc.
Teach lfs_updatemeta and lfs_shellsort about oversized cluster blocks from
lfs_gop_write.
When unwiring pages in lfs_gop_write, deactivate them, under the theory
that the pagedaemon wanted to free them last we knew.
(there are still some details to work out) but expect that to go
away soon. To support these basic changes (creation of lfs_putpages,
lfs_gop_write, mods to lfs_balloc) several other changes were made, to
wit:
* Create a writer daemon kernel thread whose purpose is to handle page
writes for the pagedaemon, but which also takes over some of the
functions of lfs_check(). This thread is started the first time an
LFS is mounted.
* Add a "flags" parameter to GOP_SIZE. Current values are
GOP_SIZE_READ, meaning that the call should return the size of the
in-core version of the file, and GOP_SIZE_WRITE, meaning that it
should return the on-disk size. One of GOP_SIZE_READ or
GOP_SIZE_WRITE must be specified.
* Instead of using malloc(...M_WAITOK) for everything, reserve enough
resources to get by and use malloc(...M_NOWAIT), using the reserves if
necessary. Use the pool subsystem for structures small enough that
this is feasible. This also obsoletes LFS_THROTTLE.
And a few that are not strictly necessary:
* Moves the LFS inode extensions off onto a separately allocated
structure; getting closer to LFS as an LKM. "Welcome to 1.6O."
* Unified GOP_ALLOC between FFS and LFS.
* Update LFS copyright headers to correct values.
* Actually cast to unsigned in lfs_shellsort, like the comment says.
* Keep track of which segments were empty before the previous
checkpoint; any segments that pass two checkpoints both dirty and
empty can be summarily cleaned. Do this. Right now lfs_segclean
still works, but this should be turned into an effectless
compatibility syscall.
malloc types into a structure, a pointer to which is passed around,
instead of an int constant. Allow the limit to be adjusted when the
malloc type is defined, or with a function call, as suggested by
Jonathan Stone.
(backout parts of rev.1.40)
otherwise, directory structures can be corrupted because checkpoints can
occur via eg. lfs_vflush before parent directory is written.
- move calls to softdep_setup_pagecache() (which can sleep to allocate
memory) outside the softdep lock.
- replace the softdep_flush_indir() hack (which tries to find another
vnode to fsync when we are holding lots of buffer-cache buffers locked
for long periods of time) with softdep_trackbufs() (which just kicks
the syncer and sleeps under the same circumstances). the former method
had a lock-ordering problem which would occasionally deadlock.
- relax the assertion in softdep_sync_metadata() which says that we should
never see D_ALLOCDIRECT deps for VREG vnodes. it's ok to see those
attached to indirect blocks.
also, there's no need to splbio() while allocating the buffer headers
to which pagecache dependencies are attached, so remove that.
fixes all the problems in PR 19288.
(backout rev.1.74)
it seems that there's no need to do it (anymore?) and LFS has trouble with it.
(VNON vnodes marked VDIROP will never reclaimed)
ok'ed by Frank van der Linden.
try to reclaim them.
(workaround for deadlock noted in the comment in lfs_reserveavail)
- in lfs_rename, mark vnodes which are being moved as well as directry vnodes.
- don't wait for locking buf in lfs_bwrite_ext to avoid deadlocks.
- skip lfs_reserve when we're doing dirop.
reserve more (for lfs_truncate) in set_dirop instead.
this mostly solves PR 18972. (and hopefully PR 19196)
mark inode IN_CLEANING rather then IN_MODIFIED.
otherwise cleaned (indirect) blocks belongs to the inode isn't written
until next sync.
- add assertions.
reading blocks that isn't written yet.
it's needed because we'll update metadatas in lfs_updatemeta
before data pointed by them is actually written to disk.
XXX should be solved with fake inode/indirect blocks instead?
kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals
kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)
based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
This is the bulk of PR #17345
The general approach is to use a run time deteriminable value
for DIRBLKSIZ. Additional allowances are included for using
MAXSYMLINKLEN with FS_42INODEFMT and a shift in the cylinder group
cluster summary count array. Support is added for managing
the Apple UFS volume label.
This merge changes the device switch tables from static array to
dynamically generated by config(8).
- All device switches is defined as a constant structure in device drivers.
- The new grammer ``device-major'' is introduced to ``files''.
device-major <prefix> char <num> [block <num>] [<rules>]
- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.
- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.
- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.
- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.
- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
exist on an on-disk inode, we keep a record of its size in struct inode,
which is updated when we write the block to disk. The cleaner routines
thus have ready access to what size is the correct size for this block,
on disk.
Fixed a related bug: if a file with fragments is being cleaned
(fragments being cleaned) at the same time it is being extended beyond
NDADDR blocks, we could write a bogus FINFO record that has a frag in the
middle; when it was cleaned this would give back bogus file data. Don't
write the indirect blocks in this case, since there is no need.
lfs_fragextend and lfs_truncate no longer require the seglock, but instead
take a shared lock, which the seglock locks exclusively.
processes don't have to wait for one another to finish (e.g., nfsd seems
to be a little happier now, though I haven't measured the difference).
Synchronous checkpoints, however, must always wait for all i/o to finish.
Take the contents of the callback functions and have them run in thread
context instead (aiodoned thread). lfs_iocount no longer has to be
protected in splbio(), and quite a bit less of the segment construction
loop needs to be in splbio() as well.
If lfs_markv is handed a block that is not the correct size according to
the inode, refuse to process it. (Formerly it was extended to the "correct"
size.) This is possibly more prone to deadlock, but less prone to corruption.
lfs_segclean now outright refuses to clean segments that appear to have live
bytes in them. Again this may be more prone to deadlock but avoids
corruption.
Replace ufsspec_close and ufsfifo_close with LFS equivalents; this means
that no UFS functions need to know about LFS_ITIMES any more. Remove
the reference from ufs/inode.h.
Tested on i386, test-compiled on alpha.
this enables one to recover data from a failing disk (where the read failure
is a hardware problem) while avoiding corrupting the fs further (in the case
where the read failure is due to a misconfiguration).
as well as bi_daddr. This lets the cleaner have an idea of what the size
of this block was at the time it was written without having to refer to
a segment header (e.g., in the file coalescing case).
Tested on i386.
enough to be useful, and broadening it so that it did would have meant
that operations possibly requiring synchronous disk activity would have
to be done in splbio(). This clearly was not going to work.
Worked around this in the LFS case by having lfs_cluster_callback put an
extra hold on the vnode before calling biodone(), and taking the hold
off without HOLDRELE's problematic list swapping. lfs_vunref() will take
care of that---in thread context---on the next write if need be.
Also, ensure that the list walking in lfs_{writevnodes,segunlock,gather}
takes into account the possibility that the list may change
underneath it (possibly because it itself deleted an element).
Tested on i386, test-compiled on alpha.
I found while making sure there weren't any new ones.
* Make the write clusters keep track of the buffers whose blocks they contain.
This should make it possible to (1) write clusters using a page mapping
instead of malloc, if desired, and (2) schedule blocks for rewriting
(somewhere else) if a write error occurs. Code is present to use
pagemove() to construct the clusters but that is untested and will go away
anyway in favor of page mapping.
* DEBUG now keeps a log of Ifile writes, so that any lingering instances of
the "dirty bufs" problem can be properly debugged.
* Keep track of whether the Ifile has been dirtied by various routines that
can be called by lfs_segwrite, and loop on that until it is clean, for
a checkpoint. Checkpoints need to be squeaky clean.
* Warn the user (once) if the Ifile grows larger than is reasonable for their
buffer cache. Both lfs_mountfs and lfs_unmount check since the Ifile can
grow.
* If an inode is not found in a disk block, try rereading the block, under
the assumption that the block was copied to a cluster and then freed.
* Protect WRITEINPROG() with splbio() to fix a hang in lfs_update.