loops where vnodes can get removed or added during the loops. This could
lead to panic's on unmount since nodes are skipped or otherwise
TAILQ_NEXT(0xdeadbeef, ...) was dereferenced.
vnodes were synced and processed backwards. This meant that the last
accessed node was processed first and the earlierst last.
An extra benefit is the removal of the ugly hack from the Berkly days on
LFS.
In the proces, i've also replaced the various variations hand written loops
by the TAILQ_FOREACH() macro's.
be set. Linux NFS servers (at least) reset suid/sgid bits if a write
happens afterwards. Add a comment why this is done.
This fixes system builds on diskless systems for me where suid bits
were missing after install(1).
Approved by yamt.
that callers are not responsible for initializing the fields. Store the name
inside the struct instead of maintaining a pointer to external storage, or
leaked memory (nfs case).
While touching all vptofh/fhtovp functions, get rid of VFS_MAXFIDSIZ,
version the getfh(2) syscall and explicitly pass the size available in
the filehandle from userland.
Discussed on tech-kern, with lots of help from yamt (thanks!).
make sillyrename (try to) use LINK operation rather than RENAME.
PR/33861 from Jed Davis. he provided the almost same patch.
according to him, it also happen to be what opensolaris does in this case.
from the PR:
> In nfs_rename(), if the destination appears to exist and is "in use"
> (this check is apparently satisfied even if the file isn't in use by
> anything except the rename itself), it will sillyrename it, then delete
> the sillyrenamed file even if the rename fails -- for instance, because
> the "from" file no longer exists on the server.
> mkdir a b; touch a/x; perl -e 'fork(); rename("a/x","b/x") or die "$!\n"'
>
> Afterwards, neither a/x nor b/x will exist.
> 1) Lookup of b/x; fails with NOENT.
> 2) Rename from a/x to b/x; succeeds.
> 3) Lookup of b/x; fails with NOENT.
> 4) Rename from b/x to b/.nfsA23a3; succeeds.
> 5) Rename from a/x to b/x; fails with NOENT.
> 6) Remove of b/.nfsA23a3; succeeds.
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
My understanding is that the CLRSIG() is supposed to clear the signal
that was sent to the syncer process to prevent it from being delivered
to the syncer process in case unmounting fails, so that the syncer process
does not die while the filesystem is still mounted. The typical scenario
is, the syncher process is tsleep()ing in the kernel, and waking up when
it needs to do work. If someone sends a signal to it, eg. kill -TERM
the mfs process, then the kernel will try to unmount the mfs filesystem
before delivering the signal to the process. If that unmount fails, then
we should not really kill the process because that will hang the mount.
So we call CLRSIG() to stop the signal from being delivered.
So the first call to issignal() will return the signal number that was
sent to the syncer process (unless someone malicious was able to send
a lower numbered signal between the time tsleep() returned and we called
issignal()... something that is not really easy to do). But you are
right, we should not be calling it many times as a side effect of this
macro.
Rewrite CLRSIG() clear all the signals and call issignal() the correct
number of times.
- remove GOP_SIZE_READ/GOP_SIZE_WRITE flags.
they have not been used since the change.
- ufs_balloc_range: remove code which has been no-op since the change.
thanks Konrad Schroder for explaining the original intention of the code.
- ffs_gop_size: don't extend past eof, in the case of GOP_SIZE_MEM.
otherwise genfs_getpages end up to allocate pages past eof unnecessarily.
- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.
- don't bother to take nfs_sndlock when doing nfsrv_rcv.
unlike client, we never reconnect.
- nfsrv_getstream: fix the case that m_split sleeps.
- free socket in nfsrv_slpderef rather than nfsrv_zapsock.
fix race with nfssvc_nfsd.
- while i'm here, remove NFSD_WAITING and NFSD_REQINPROG
as they are redundant.
- some comments and assertions.
- use LK_CANRECURSE instead of LK_RECURSEFAIL.
PR/32435 from Valeriy E. Ushakov.
- panic explicitly if the parent directory has been revoked.
add an XXX comment.
Solves a crash when mounting NFS shares. (The proc parameter used before
the conversion to lwp's was NULL too, so the addition of 'l->l_proc' in the
code was extra.)
shortcut to the process of the passed lwp paniced the kernel since lwp
could/can be passwd as NULL in VOP_WRITE().
This was happening when ktracing to NFS. The function ktrwrite() set the
uio_lwp to NULL and then calls VOP_WRITE() with this argument. nfs_write()
then accessed lwp *l->l_proc wich paniced.
Thanks to David Laight for his help on tracking it down.
nfs errors are chosen to be the same as errno, some of them are not and
it is better for portability to do the conversion anyway. Also a server
can return a bad error number that can cause the server to crash, because
it can have the high bits that are used internally set. This was the case
with amd. Finally nfs_request() should return a valid errno, because we
can return a bogus value to userland. Thanks to rpaulo for debugging this.
- Remove all NFS related stuff from file system specific code.
- Drop the vfs_checkexp hook and generalize it in the new nfs_check_export
function, thus removing redundancy from all file systems.
- Move all NFS export-related stuff from kern/vfs_subr.c to the new
file sys/nfs/nfs_export.c. The former was becoming large and its code
is always compiled, regardless of the build options. Using the latter,
the code is only compiled in when NFSSERVER is enabled. While doing this,
also make some functions in nfs_subs.c conditional to NFSSERVER.
- Add a new command in nfssvc(2), called NFSSVC_SETEXPORTSLIST, that takes a
path and a set of export entries. At the moment it can only clear the
exports list or append entries, one by one, but it is done in a way that
allows setting the whole set of entries atomically in the future (see the
comment in mountd_set_exports_list or in doc/TODO).
- Change mountd(8) to use the nfssvc(2) system call instead of mount(2) so
that it becomes file system agnostic. In fact, all this whole thing was
done to remove a 'XXX' block from this utility!
- Change the mount*, newfs and fsck* userland utilities to not deal with NFS
exports initialization; done internally by the kernel when initializing
the NFS support for each file system.
- Implement an interface for VFS (called VFS hooks) so that several kernel
subsystems can run arbitrary code upon receipt of specific VFS events.
At the moment, this only provides support for unmount and is used to
destroy NFS exports lists from the file systems being unmounted, though it
has room for extension.
Thanks go to yamt@, chs@, thorpej@, wrstuden@ and others for their comments
and advice in the development of this patch.
of curproc (where uio->uio_procp should be used?). Don't do this
for nfs_commit(), because yamt says it is possibly wrong.
2. nfs_doio() does not use struct proc; remove it and the code to compute it.
3. use copyin_proc() and copyout_proc() instead of copyin() and copyout().
4. check return of copyout_proc(). and mark return from copyin_proc() XXX
5. Eliminate check p == curproc assertion check from nfs_write;
nfs_read does not have it and we might be called in a different
process context anyway (PR 20138).
into the "vfsops" link set.
- Use VFS_ATTACH() where vfsops are declared for individual file systems.
- In vfsinit(), traverse the "vfsops" link set, rather than vfs_list_initial[].
this means we can no longer look at the vnode size to determine how many
pages to request in a fault, which is good since for NFS the size can change
out from under us on the server anyway. there's also a new flag UBC_UNMAP
for ubc_release(), so that the file system code can make the decision about
whether to cache mappings for files being used as executables.
for i/o requests which are expected not to fail due to permission
to mimic unix file open semantics (READ, WRITE, COMMIT),
try two credentials. namely, the file owner's one and open time one.
remember which credential worked in per-file basis and try it first
next time to minimize number of retries.
ideas from Chuck Silvers. PR/23716 and PR/24987.
* Rather than using mnt_maxsymlinklen to indicate that a file systems returns
d_type fields(!), add a new internal flag, IMNT_DTYPE.
Add 3 new elements to ufsmount:
* um_maxsymlinklen, replaces mnt_maxsymlinklen (which never should have existed
in the first place).
* um_dirblksiz, which tracks the current directory block size, eliminating the
FS-specific checks littered throughout the code. This may be used later to
make the block size variable.
* um_maxfilesize, which is the maximum file size, possibly adjusted lower due
to implementation issues.
Sync some bug fixes from FFS into ext2fs, particularly:
* ffs_lookup.c 1.21, 1.28, 1.33, 1.48
* ffs_inode.c 1.43, 1.44, 1.45, 1.66, 1.67
* ffs_vnops.c 1.84, 1.85, 1.86
Clean up some crappy pointer frobnication.
- catchup to in_ifaddr -> in_ifaddrhead rename.
XXX the address on the top of in_ifaddrhead is likely 127.0.0.1.
using it to construct the verifier doesn't make much sense.
maybe it's better to use some uuid or ip_randomid-like method.
- "intrusive" dirops now have more chances to get benefits from dnlc.
- fixes a deadlock due to vnode locking order inversion.
nfs_create and others: purge stale dnlc entries
as nfs_lookup() no longer does it automatically.
for consistency with M_FREE() and m_freem(). Affected files:
sys/mbuf.h
kern/uipc_socket2.c
kern/uipc_mbuf.c
net/if_ethersubr.c
netatalk/ddp_input.c
nfs/nfs_socket.c
- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.
Welcome to 2.0F.
Approved by: Jason R. Thorpe <thorpej@netbsd.org>
warning is triggered pervasively, so print it only once per boot.
(The callers who pass NULL r_procps should soon be fixed to pass a
valid struct proc* ).
otherwise we can be stuck in soreceive forever.
the problem is pointed by Minoura Makoto. PR/25662
- clear r_rexmit on reconnect and clear r_rtt and R_TIMING on retransmit
so that the above (and soft mounts) happy.
Add a new explicit `struct proc *p' argument to socreate(), sosend().
Use that argument instead of curproc. Follow-on changes to pass that
argument to socreate(), sosend(), and (*so->so_send)() calls.
These changes reviewed and independently recoded by Matt Thomas.
Changes to soreceive() and (*dom->dom_exernalize() from Matt Thomas:
pass soreceive()'s struct uio* uio->uio_procp to unp_externalize().
Eliminate curproc from unp_externalize. Also, now soreceive() uses
its uio->uio_procp value, pass that same value downward to
((pr->pru_usrreq)() calls for consistency, instead of (struct proc * )0.
Similar changes in sys/nfs to eliminate (most) uses of curproc,
either via the req-> r_procp field of a struct nfsreq *req argument,
or by passing down new explicit struct proc * arguments.
Reviewed by: Matt Thomas, posted to tech-kern.
NB: The (*pr->pru_usrreq)() change should be tested on more (all!) protocols.