- IO_SYNC: don't bother to flush dirty pages before copying data from
user buffer.
- IO_APPEND: don't invalidate pages blindly. PR/28472 from Brian Marcotte.
- as far as i understand the code, it shouldn't be necessary
because nfs_request can't return without removing its request
and r->r_lwp is either curlwp or NULL.
- even if it's necessary, leaking requests is not the correct way
to recover from the condition.
- nfs_request: add a related assertion.
received won't be stuck in nfs_receive.
- nfs_rcvlock: check exceptions before sleeping on the lock.
- nfs_rcvunlock: use cv_broadcast rather than cv_signal to ensure that
lwps which received its reply get woken up.
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.
- don't use SAVESTART in calls to relookup() from unionfs,
just vref() the desired vnode when we need to.
- fix locking and refcounting in the unionfs EEXIST error cases.
- release any vnode locks before calling VFS_ROOT(), vfs_busy() is enough.
this allows us to simplify union_root() and fix PR 3006.
- union_lock() doesn't handle shared lock requests correctly,
so convert them to exclusive instead. fixes PR 34775.
- in relookup(), avoid reusing "dp" for different purposes,
the error handling wasn't right. (actually just get rid of dp.)
also, change relookup() to ignore LOCKLEAF and always return the
vnode locked since the callers already expect this.
returned from nfs_namei() is not locked in this case. fixes PR 35542.
also, apply the same fix here as was made in rename_files() to handle
the case when dvp and vp are the same.
by Slava Semushin <slava.semushin@gmail.com>.
To verify that no nasty side effects of duplicate includes (or their
removal) have an effect here, I've compiled an i386/ALL kernel with
and without the patch, and the only difference in the resulting .o
files was in shifted line numbers in some assert() calls.
The comparison of the .o files was based on the output of "objdump -D".
Thanks to martin@ for the input on testing.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.
Implemented for file systems of type ffs.
The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.
Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.
Welcome to 4.99.9 (new vfs op vfs_suspendctl).
- LOCKPARENT is no longer relevant for lookup(), relookup() or VOP_LOOKUP().
these now always return the parent vnode locked. namei() works as before.
lookup() and various other paths no longer acquire vnode locks in the
wrong order via vrele(). fixes PR 32535.
as a nice side effect, path lookup is also up to 25% faster.
- the above allows us to get rid of PDIRUNLOCK.
- also get rid of WANTPARENT (just use LOCKPARENT and unlock it).
- remove an assumption in layer_node_find() that all file systems implement
a recursive VOP_LOCK() (unionfs doesn't).
- require that all file systems supply vfs_vptofh and vfs_fhtovp routines.
fill in eopnotsupp() for file systems that don't support being exported
and remove the checks for NULL. (layerfs calls these without checking.)
- in union_lookup1(), don't change refcounts in the ISDOTDOT case, just
adjust which vnode is locked. fixes PR 33374.
- apply fixes for ufs_rename() from ufs_vnops.c rev. 1.61 to ext2fs_rename().
loops where vnodes can get removed or added during the loops. This could
lead to panic's on unmount since nodes are skipped or otherwise
TAILQ_NEXT(0xdeadbeef, ...) was dereferenced.
vnodes were synced and processed backwards. This meant that the last
accessed node was processed first and the earlierst last.
An extra benefit is the removal of the ugly hack from the Berkly days on
LFS.
In the proces, i've also replaced the various variations hand written loops
by the TAILQ_FOREACH() macro's.
be set. Linux NFS servers (at least) reset suid/sgid bits if a write
happens afterwards. Add a comment why this is done.
This fixes system builds on diskless systems for me where suid bits
were missing after install(1).
Approved by yamt.
that callers are not responsible for initializing the fields. Store the name
inside the struct instead of maintaining a pointer to external storage, or
leaked memory (nfs case).
While touching all vptofh/fhtovp functions, get rid of VFS_MAXFIDSIZ,
version the getfh(2) syscall and explicitly pass the size available in
the filehandle from userland.
Discussed on tech-kern, with lots of help from yamt (thanks!).
make sillyrename (try to) use LINK operation rather than RENAME.
PR/33861 from Jed Davis. he provided the almost same patch.
according to him, it also happen to be what opensolaris does in this case.
from the PR:
> In nfs_rename(), if the destination appears to exist and is "in use"
> (this check is apparently satisfied even if the file isn't in use by
> anything except the rename itself), it will sillyrename it, then delete
> the sillyrenamed file even if the rename fails -- for instance, because
> the "from" file no longer exists on the server.
> mkdir a b; touch a/x; perl -e 'fork(); rename("a/x","b/x") or die "$!\n"'
>
> Afterwards, neither a/x nor b/x will exist.
> 1) Lookup of b/x; fails with NOENT.
> 2) Rename from a/x to b/x; succeeds.
> 3) Lookup of b/x; fails with NOENT.
> 4) Rename from b/x to b/.nfsA23a3; succeeds.
> 5) Rename from a/x to b/x; fails with NOENT.
> 6) Remove of b/.nfsA23a3; succeeds.
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
My understanding is that the CLRSIG() is supposed to clear the signal
that was sent to the syncer process to prevent it from being delivered
to the syncer process in case unmounting fails, so that the syncer process
does not die while the filesystem is still mounted. The typical scenario
is, the syncher process is tsleep()ing in the kernel, and waking up when
it needs to do work. If someone sends a signal to it, eg. kill -TERM
the mfs process, then the kernel will try to unmount the mfs filesystem
before delivering the signal to the process. If that unmount fails, then
we should not really kill the process because that will hang the mount.
So we call CLRSIG() to stop the signal from being delivered.
So the first call to issignal() will return the signal number that was
sent to the syncer process (unless someone malicious was able to send
a lower numbered signal between the time tsleep() returned and we called
issignal()... something that is not really easy to do). But you are
right, we should not be calling it many times as a side effect of this
macro.
Rewrite CLRSIG() clear all the signals and call issignal() the correct
number of times.
- remove GOP_SIZE_READ/GOP_SIZE_WRITE flags.
they have not been used since the change.
- ufs_balloc_range: remove code which has been no-op since the change.
thanks Konrad Schroder for explaining the original intention of the code.
- ffs_gop_size: don't extend past eof, in the case of GOP_SIZE_MEM.
otherwise genfs_getpages end up to allocate pages past eof unnecessarily.
- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.
- don't bother to take nfs_sndlock when doing nfsrv_rcv.
unlike client, we never reconnect.
- nfsrv_getstream: fix the case that m_split sleeps.
- free socket in nfsrv_slpderef rather than nfsrv_zapsock.
fix race with nfssvc_nfsd.
- while i'm here, remove NFSD_WAITING and NFSD_REQINPROG
as they are redundant.
- some comments and assertions.
- use LK_CANRECURSE instead of LK_RECURSEFAIL.
PR/32435 from Valeriy E. Ushakov.
- panic explicitly if the parent directory has been revoked.
add an XXX comment.
Solves a crash when mounting NFS shares. (The proc parameter used before
the conversion to lwp's was NULL too, so the addition of 'l->l_proc' in the
code was extra.)
shortcut to the process of the passed lwp paniced the kernel since lwp
could/can be passwd as NULL in VOP_WRITE().
This was happening when ktracing to NFS. The function ktrwrite() set the
uio_lwp to NULL and then calls VOP_WRITE() with this argument. nfs_write()
then accessed lwp *l->l_proc wich paniced.
Thanks to David Laight for his help on tracking it down.
nfs errors are chosen to be the same as errno, some of them are not and
it is better for portability to do the conversion anyway. Also a server
can return a bad error number that can cause the server to crash, because
it can have the high bits that are used internally set. This was the case
with amd. Finally nfs_request() should return a valid errno, because we
can return a bogus value to userland. Thanks to rpaulo for debugging this.
- Remove all NFS related stuff from file system specific code.
- Drop the vfs_checkexp hook and generalize it in the new nfs_check_export
function, thus removing redundancy from all file systems.
- Move all NFS export-related stuff from kern/vfs_subr.c to the new
file sys/nfs/nfs_export.c. The former was becoming large and its code
is always compiled, regardless of the build options. Using the latter,
the code is only compiled in when NFSSERVER is enabled. While doing this,
also make some functions in nfs_subs.c conditional to NFSSERVER.
- Add a new command in nfssvc(2), called NFSSVC_SETEXPORTSLIST, that takes a
path and a set of export entries. At the moment it can only clear the
exports list or append entries, one by one, but it is done in a way that
allows setting the whole set of entries atomically in the future (see the
comment in mountd_set_exports_list or in doc/TODO).
- Change mountd(8) to use the nfssvc(2) system call instead of mount(2) so
that it becomes file system agnostic. In fact, all this whole thing was
done to remove a 'XXX' block from this utility!
- Change the mount*, newfs and fsck* userland utilities to not deal with NFS
exports initialization; done internally by the kernel when initializing
the NFS support for each file system.
- Implement an interface for VFS (called VFS hooks) so that several kernel
subsystems can run arbitrary code upon receipt of specific VFS events.
At the moment, this only provides support for unmount and is used to
destroy NFS exports lists from the file systems being unmounted, though it
has room for extension.
Thanks go to yamt@, chs@, thorpej@, wrstuden@ and others for their comments
and advice in the development of this patch.
of curproc (where uio->uio_procp should be used?). Don't do this
for nfs_commit(), because yamt says it is possibly wrong.
2. nfs_doio() does not use struct proc; remove it and the code to compute it.
3. use copyin_proc() and copyout_proc() instead of copyin() and copyout().
4. check return of copyout_proc(). and mark return from copyin_proc() XXX
5. Eliminate check p == curproc assertion check from nfs_write;
nfs_read does not have it and we might be called in a different
process context anyway (PR 20138).
into the "vfsops" link set.
- Use VFS_ATTACH() where vfsops are declared for individual file systems.
- In vfsinit(), traverse the "vfsops" link set, rather than vfs_list_initial[].
this means we can no longer look at the vnode size to determine how many
pages to request in a fault, which is good since for NFS the size can change
out from under us on the server anyway. there's also a new flag UBC_UNMAP
for ubc_release(), so that the file system code can make the decision about
whether to cache mappings for files being used as executables.
for i/o requests which are expected not to fail due to permission
to mimic unix file open semantics (READ, WRITE, COMMIT),
try two credentials. namely, the file owner's one and open time one.
remember which credential worked in per-file basis and try it first
next time to minimize number of retries.
ideas from Chuck Silvers. PR/23716 and PR/24987.
* Rather than using mnt_maxsymlinklen to indicate that a file systems returns
d_type fields(!), add a new internal flag, IMNT_DTYPE.
Add 3 new elements to ufsmount:
* um_maxsymlinklen, replaces mnt_maxsymlinklen (which never should have existed
in the first place).
* um_dirblksiz, which tracks the current directory block size, eliminating the
FS-specific checks littered throughout the code. This may be used later to
make the block size variable.
* um_maxfilesize, which is the maximum file size, possibly adjusted lower due
to implementation issues.
Sync some bug fixes from FFS into ext2fs, particularly:
* ffs_lookup.c 1.21, 1.28, 1.33, 1.48
* ffs_inode.c 1.43, 1.44, 1.45, 1.66, 1.67
* ffs_vnops.c 1.84, 1.85, 1.86
Clean up some crappy pointer frobnication.
- catchup to in_ifaddr -> in_ifaddrhead rename.
XXX the address on the top of in_ifaddrhead is likely 127.0.0.1.
using it to construct the verifier doesn't make much sense.
maybe it's better to use some uuid or ip_randomid-like method.
- "intrusive" dirops now have more chances to get benefits from dnlc.
- fixes a deadlock due to vnode locking order inversion.
nfs_create and others: purge stale dnlc entries
as nfs_lookup() no longer does it automatically.
for consistency with M_FREE() and m_freem(). Affected files:
sys/mbuf.h
kern/uipc_socket2.c
kern/uipc_mbuf.c
net/if_ethersubr.c
netatalk/ddp_input.c
nfs/nfs_socket.c
- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.
Welcome to 2.0F.
Approved by: Jason R. Thorpe <thorpej@netbsd.org>