- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.
The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.
adjusted via sysctl. file systems that have hash tables which are
sized based on the value of this variable now resize those hash tables
using the new value. the max number of FFS softdeps is also recalculated.
convert various file systems to use the <sys/queue.h> macros for
their hash tables.
the lower layer needs to have control over that flag.
that didn't solve the whole problem that it was trying to solve anyway.
(the issue is that if we have create mappings to the lower layer,
we need to get rid of those when we copy the file to the upper layer.)
we'll have to figure out some other way to handle this.
between creation of a file descriptor and close(2) when using kernel
assisted threads. What we do is stick descriptors in the table, but
mark them as "larval". This causes essentially everything to treat
it as a non-existent descriptor, except for fdalloc(), which sees a
filled slot so that it won't (incorrectly) allocate it again. When
a descriptor is fully constructed, the code that has constructed it
marks it as "mature" (which actually clears the "larval" flag), and
things continue to work as normal.
While here, gather all the code that gets a descriptor from the table
into a fd_getfile() function, and call it, rather than having the
same (sometimes incorrect) code copied all over the place.
to resolve a write fault. fixes PR 13201.
also, be sure to allocate blocks for write faults to holes even if
the page is already in memory. fixes PR 13189.
- We need to skip PGO_DONTCARE page also.
- ``npages'' returned by VOP_GETPAGES for lower vp doesn't count
those pages in this case. So, just loop ``npages'' times is
insufficient. Loop while there is real pages instead.
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.
the mapping is:
VM_PAGER_OK 0
VM_PAGER_BAD <unused>
VM_PAGER_FAIL <unused>
VM_PAGER_PEND 0 (see below)
VM_PAGER_ERROR EIO
VM_PAGER_AGAIN EAGAIN
VM_PAGER_UNLOCK EBUSY
VM_PAGER_REFAULT ERESTART
for async i/o requests, it used to be possible for the request to
be convert to sync, and the pager would return VM_PAGER_OK or VM_PAGER_PEND
to indicate whether the caller should perform post-i/o cleanup.
this is no longer allowed; pagers must now return 0 to indicate that
the async i/o was successfully started, and the caller never needs to
worry about doing the post-i/o cleanup.
setattr calls on underlying vnodes the same as sockets and just return 0.
This whole thing needs to be gutted and replaced with either fall throughs
to specfs (the attr forwarding is just bizarre and leads to weird crap like
the above truncation problems), or better yet a real cloning device node.
which we disallow creation of page cache pages) and its on-disk EOF
(which marks the offset at which there is not (yet) data on disk that
we need to read when creating pages). for requests with PGO_PASTEOF,
the in-memory EOF maybe be much larger than the on-disk EOF.
- in genfs_getpages(), unbusy any pages that we don't free in the error path.
- in genfs_putpages(), if we get a bmap error, record that in the master buf.
- in the cases where we skip over the i/o loop, increment npages by ridx
so that when the cleanup code starts processing the pgs array at index 0
it'll actually process all of the pages.
- process the PG_RELEASED flag when unbusying pages.
- add some missing MP locking.
- use MIN() and MAX() instead of min() and max() since the latter are
functions which take arguments of type "int" but we call them with
values of type "off_t", so the values could be truncated.
- in the PGO_PASTEOF case, use the larger of the current file size and the
end of the requested range of pages as the file size for this request.
this fixes some problems with sparsing writes to large offsets.
- in genfs_getpages() don't start read-ahead if we get an error on the
sync read, and always start read-ahead after the range of the sync read
if we do any at all.
- off-by-one error in genfs_size().
`struct vmspace' has a new field `vm_minsaddr' which is the user TOS.
PS_STRINGS is deprecated in favor of curproc->p_pstr which is derived
from `vm_minsaddr'.
Bump the kernel version number.
in the non-MULTIPROCESSOR case (LOCKDEBUG requires it). Scheduler
lock is held upon entry to mi_switch() and cpu_switch(), and
cpu_switch() releases the lock before returning.
Largely from Bill Sommerfeld, with some minor bug fixes and
machine-dependent code hacking from me.
int lf_advlock __P((struct lockf **,
off_t, caddr_t, int, struct flock *, int));
to
int lf_advlock __P((struct vop_advlock_args *, struct lockf **, off_t));
This matches common usage and is also compatible with similar change
in FreeBSD (though they use u_quad_t as last arg).
<vm/pglist.h> -> <uvm/uvm_pglist.h>
<vm/vm_inherit.h> -> <uvm/uvm_inherit.h>
<vm/vm_kern.h> -> into <uvm/uvm_extern.h>
<vm/vm_object.h> -> nothing
<vm/vm_pager.h> -> into <uvm/uvm_pager.h>
also includes a bunch of <vm/vm_page.h> include removals (due to redudancy
with <vm/vm.h>), and a scattering of other similar headers.
with procfs's cmdline - from the PR:
The cmdline implementation in procfs is bogus. It's possible that
part of the fix is a workaround of a UVM problem - that is, when
(internally) accessing the top of the process VM (the end of the
args) a request for I/0 of a PAGE_SIZE'd block starting at less
than a PAGE_SIZE from the end of the mem space returns EINVAL
rather than the data that is available. Whether this is a bug
in UVM or not depends upon how it is defined to work, and I was
unable to determine that. (Simon Burge found that problem, and
provided the basis of the workaround/fix).
Then, the cmdline function is unable to read more than one
page of args, and a good thing too, as the way it is written
attempting to get more than that would reference into lala land.
And, on an attempt to read a lot of data when the above is
fixed, most of the data won't be returned, only the final block
of any read.
Tested on alpha, pmax, i386 and sparc.
a set of flags ("flags"). Two flags are defined, UPDATE_WAIT and
UPDATE_DIROP.
Under the old semantics, VOP_UPDATE would block if waitfor were set,
under the assumption that directory operations should be done
synchronously. At least LFS and FFS+softdep do not make this
assumption; FFS+softdep got around the problem by enclosing all relevant
calls to VOP_UPDATE in a "if(!DOINGSOFTDEP(vp))", while LFS simply
ignored waitfor, one of the reasons why NFS-serving an LFS filesystem
did not work properly.
Under the new semantics, the UPDATE_DIROP flag is a hint to the
fs-specific update routine that the call comes from a dirop routine, and
should be wait for, or not, accordingly.
Closes PR#8996.
in vfs_detach(). vfs_done may free global filesystem's resources,
typically those allocated in respective filesystem's init function.
Needed so those filesystems which went in via LKM have a chance to
clean after themselves before unloading. This fixes random panics
when LKM for filesystem using pools was loaded and unloaded several
times.
For each leaf filesystem, add appropriate vfs_done routine.
UFS code, and I forgot to rename the "ihash" variable, causing
weird effects, because 3/4th of the UFS hash table would become
unreachable after procfs was loaded as an LKM.
process is about to exec a sugid binary.
To speed up things, use hashing for vnode allocation, like other filesystems
do. This avoids walking the whole procfs node list in the revoke case too.
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.
Bump version number to 1.4O
not set, unlock the vnode before calling the device's close routine and
relock it after it returns. tty close routines will sleep waiting for
buffers to drain, which won't happen often times as the other side needs
to grab the vnode lock first.
Make all unmount routines lock the device vnode before calling VOP_CLOSE().
Problem turned out to be due to improper handling of reads beyond EOF:
they should just return without error with the uio unchanged, and the
caller will recognize this as a zero-byte return (EOF).
The previous fix to protect directory reads against bogus uio_offset
values returned EINVAL, which broke mount -o union, which only
union'ed in the lower directory if the upper directory cleanly
returned EOF.
While we're here, protect kernfs as well.
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().
Reviewed by: thorpej
Tested by: wrstuden
exists is bogus. The goal here is to produce a synthetic link count
which won't confuse fts and similar routines which "know" that
directories with a link count of 2 don't have subdirectories (and
thus, they can avoid having to stat every entry in the directory
looking for subdirectories which aren't there).
We know that non-UNIX filesystem implementations may return a link
count of `1' for directories with an indeterminate number of
subdirectories; if either the upper or lower layer returns a link
count of `1', return a link count of 1. If both layers return a link
count of 2, return a link count of 2; otherwise, return the sum of the
link count of both layers.
Also, fix PR7430: unionfs ignores read-only mounts. Check for
MNT_RDONLY in union_lookup (more-or-less as in layer_lookup) as well
as union_access() and union_setattr().
Note that a read-only union layer may still cause side effects on the
underlying filesystems... Most notably, we'll still attempt to create
shadow directories in the upper layer. Also, of course, we'll
side-effect atimes in the lower layer.
"panic: lockmgr: using decommisioned lock"
(only if DIAGNOSTIC)
The problem turned out to be due to the way LK_DRAIN was (not) handled
in union_lock; it just got passed through to the lock on the upper
vnode (which got marked as decommissioned, instead of that happening
to the union vnode. When the upper vnode was next locked (typically
when it was released), it went kaboom.
and PID allocation MP-safe. A new process state is added: SDEAD. This
state indicates that a process is dead, but not yet a zombie (has not
yet been processed by the process reaper).
SDEAD processes exist on both the zombproc list (via p_list) and deadproc
(via p_hash; the proc has been removed from the pidhash earlier in the exit
path). When the reaper deals with a process, it changes the state to
SZOMB, so that wait4 can process it.
Add a P_ZOMBIE() macro, which treats a proc in SZOMB or SDEAD as a zombie,
and update various parts of the kernel to reflect the new state.
getnewvnode now checks this bit, and it if's set makes sure a vnode's not
locked before removing it from the free list.
Closes PR 7954 by Alan Barrett <apb@iafrica.com>.
Update coda to new struct lock in struct vnode.
make fdescfs, kernfs, portalfs, and procfs actually lock their vnodes.
It's not that hard.
Make unionfs set v_vnlock = NULL so any overlayed fs will call its
VOP_LOCK.
the functionality of nullfs. The latter is now just a mount & unmount
routine, and a few tables. umapfs borrow most of this infrastructure.
Both fs's are now nfs-exportable.
All layered fs's share a common format to private mount & private
vnode structs (which a particular fs can extend).
Also add genfs_noerr_rele(), a vnode op which will vrele/vput
operand vnodes appropriately.
umap_unlock()) so as to not explicitly depend on nullfs being compiled
into the kernel.
umap_bypass won't be too slow as there are no credentials in these two ops
to need mapping.
count is 0, wait for use count to drain before finishing the close.
This is necessary in order for multiple processes to safely share file
descriptor tables.
deadlock in VOP_FSYNC() if the unreferenced vnode picked for
reclamation happened to be stacked on top of a vnode the process
already had locked. This could happen if the same filesystem was
accessed both through a union mount and directly; it seemed to happen
most frequently when the direct access was through NFS.
Avoid this deadlock by changing vinvalbuf to pass a new FSYNC_RECLAIM
flag bit to VOP_FSYNC() to indicate that a reclaim is in progress and
only a `shallow' fsync is necessary.
Do nothing in *_fsync() in umapfs, nullfs, and unionfs when
FSYNC_RECLAIM is set; the underlying vnodes will shortly be released
in *_reclaim and may be reclaimed (and fsync'ed) later.
Fix group mapping so members of group 0 get other group-ids mapped as well.
Avoid rename panic by checking (*this_vp_p) against NULLVP before
dereferencing it (same change as to NULLFS some time ago).
array of ARG_MAX size. ARG_MAX is currently 256k, which causes a rather
serious stack overflow (kernel stacks are not very large, usually 8k).
Fixes memory corruption problems observed after accessig /proc/1/cmdline
during tests. Problem in my case manifested itself as massive lossage
in ffs_sync(), resulting in a crash, and sometimes, pooched file systems.
XXX This could, and probably should, be rewritten to use a much smaller
temporary buffer, and a loop around uiomove().
- Don't error out on P_SYSTEM or SZOMB processes; instead, do what ps(1)
would do, i.e. the p_comm in parenthesis.
- Use uvm_io() (or procfs_rwmem() if !UVM) to read the target process's
psstrings and argument vector. Using copyin() is problematic, because
it operates on the current processes! That is, the old code would
always get the `cmdline' of the process reading the file, not that of
the target process.