otherwise generate an UVM trap or will access random memory. This is due to
the dereference of vp->v_specmountpoint that is really
vp->v_specinfo->si_mountpoint. The field v_specinfo is multiplexed with
other structs in the vun union in struct vnode like struct socket.
The patch adds a sanity check for accessing the specinfo fields by only
allowing VBLK nodes to be passed. In theory also VCHR could be valid since
its also a special node though mounting is only done on VBLK so be strict.
Ok'd by yamt.
- for structure fields that are conditionally present,
make those fields always present.
- for functions which are conditionally inline, make them never inline.
- remove some other functions which are conditionally defined but
don't actually do anything anymore.
- make a lock-debugging function conditional on only LOCKDEBUG.
as discussed on tech-kern some time back.
otherwise, once the corresponding bit in the inode bitmap is cleared,
an unrelated inode with the same inode number can be allocated and
ufs_ihashget() picks a stale in-core vnode for it.
PR/32301 by Matthias Scheler.
- Remove all NFS related stuff from file system specific code.
- Drop the vfs_checkexp hook and generalize it in the new nfs_check_export
function, thus removing redundancy from all file systems.
- Move all NFS export-related stuff from kern/vfs_subr.c to the new
file sys/nfs/nfs_export.c. The former was becoming large and its code
is always compiled, regardless of the build options. Using the latter,
the code is only compiled in when NFSSERVER is enabled. While doing this,
also make some functions in nfs_subs.c conditional to NFSSERVER.
- Add a new command in nfssvc(2), called NFSSVC_SETEXPORTSLIST, that takes a
path and a set of export entries. At the moment it can only clear the
exports list or append entries, one by one, but it is done in a way that
allows setting the whole set of entries atomically in the future (see the
comment in mountd_set_exports_list or in doc/TODO).
- Change mountd(8) to use the nfssvc(2) system call instead of mount(2) so
that it becomes file system agnostic. In fact, all this whole thing was
done to remove a 'XXX' block from this utility!
- Change the mount*, newfs and fsck* userland utilities to not deal with NFS
exports initialization; done internally by the kernel when initializing
the NFS support for each file system.
- Implement an interface for VFS (called VFS hooks) so that several kernel
subsystems can run arbitrary code upon receipt of specific VFS events.
At the moment, this only provides support for unmount and is used to
destroy NFS exports lists from the file systems being unmounted, though it
has room for extension.
Thanks go to yamt@, chs@, thorpej@, wrstuden@ and others for their comments
and advice in the development of this patch.
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.
- Handle non-regular (!VREG) files correctly).
- Remove (no longer needed) FINGERPRINT_NOENTRY.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.
and just passes it on to the file system functions. This avoids opening and
closing the device several times.
Mentioned on tech-kern some time ago, IIRC. I've been running this for a
long time.
header files, so that they don't become out of sync (again).
- Use bitmask_snprintf() instead of hand-rolled code.
- Always check array bounds before dereferencing print arrays.
- Order arguments in the vnode printing functions consistently.
calls to ensure that the vnode lock state is as expected when the VOP
call is made. Modify vnode_if.src to set the expected state according
to the documenting lock table for each VOP. Modify vnode_if.sh to emit
the checks.
Notes:
- The checks are only performed if the vnode has the VLOCKSWORK bit
set. Some file systems (e.g. specfs) don't even bother with vnode
locks, so of course the checks will fail.
- We can't actually run with VNODE_LOCKDEBUG because there are so many
vnode locking problems, not the least of which is the "use SHARED for
VOP_READ()" issue, which screws things up for the entire call chain.
Inspired by similar changes in OpenBSD, but implemented differently.
* Rather than using mnt_maxsymlinklen to indicate that a file systems returns
d_type fields(!), add a new internal flag, IMNT_DTYPE.
Add 3 new elements to ufsmount:
* um_maxsymlinklen, replaces mnt_maxsymlinklen (which never should have existed
in the first place).
* um_dirblksiz, which tracks the current directory block size, eliminating the
FS-specific checks littered throughout the code. This may be used later to
make the block size variable.
* um_maxfilesize, which is the maximum file size, possibly adjusted lower due
to implementation issues.
Sync some bug fixes from FFS into ext2fs, particularly:
* ffs_lookup.c 1.21, 1.28, 1.33, 1.48
* ffs_inode.c 1.43, 1.44, 1.45, 1.66, 1.67
* ffs_vnops.c 1.84, 1.85, 1.86
Clean up some crappy pointer frobnication.
* Process A is closing one file descriptor belonging to a device. In doing so,
ffs_update() is called and starts writing a block synchronously. (Note: This
leaves the vnode locked. It also has other instances -- stdin, et al -- of
the same device open, so v_usecount is definitely non-zero.)
* Process B does a revoke() on the device. The revoke() has to wait for the
vnode to be unlocked because ffs_update() is still in progress.
* Process C tries to open() the device. It wedges in checkalias() repeatedly
calling vget() because it returns EBUSY immediately.
To fix, this:
* checkalias() now uses LK_SLEEPFAIL rather than LK_NOWAIT. Therefore it will
wait for the vnode to become unlocked, but it will recheck that it is on the
hash list, in case it was in the process of being revoke()d or was revoke()d
again before we were woken up.
* Since we're relying on the vnode lock to tell us that the vnode hasn't been
removed from the hash list *anyway*, I have moved the code to remove it into
the DOCLOSE section of vclean(), inside the vnode lock.
In the example at hand, process A was sh(1), process B was a child of init(8),
and process C was syslogd(8).
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.
Convert struct session, ucred and lockf to pools.
list of file system types currently supported by the kernel.
Previously there wasn't an easy way to determine this.
(Code shamelessly cribbed from subr_disk.c::sysctl_hw_disknames().)
Use LIST_FOREACH() appropriately.
called with every buffer written through spec_strategy().
Used by fss(4). Future file-system-internal snapshots will need them too.
Welcome to 1.6ZK
Approved by: Jason R. Thorpe <thorpej@netbsd.org>
suspending.
Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.
Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).
When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.
This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.
On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.
Gone are the old kern_sysctl(), cpu_sysctl(), hw_sysctl(),
vfs_sysctl(), etc, routines, along with sysctl_int() et al. Now all
nodes are registered with the tree, and nodes can be added (or
removed) easily, and I/O to and from the tree is handled generically.
Since the nodes are registered with the tree, the mapping from name to
number (and back again) can now be discovered, instead of having to be
hard coded. Adding new nodes to the tree is likewise much simpler --
the new infrastructure handles almost all the work for simple types,
and just about anything else can be done with a small helper function.
All existing nodes are where they were before (numerically speaking),
so all existing consumers of sysctl information should notice no
difference.
PS - I'm sorry, but there's a distinct lack of documentation at the
moment. I'm working on sysctl(3/8/9) right now, and I promise to
watch out for buses.
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.
From FreeBSD with slight modifications.
Approved by: Frank van der Linden <fvdl@netbsd.org>
mv MNT_GONE, MNT_UNMOUNT and MNT_WANTRDWR to this field
additonally add mnt_writeopcountupper and mnt_writeopcountlower fields
in preparation for pending write suspension support work
bump kernel version to 1.6ZD
* Remove the "lwp *" argument that was added to vget(). Turns out
that nothing actually used it!
* Remove the "lwp *" arguments that were added to VFS_ROOT(), VFS_VGET(),
and VFS_FHTOVP(); all they did was pass it to vget() (which, as noted
above, didn't use it).
* Remove all of the "lwp *" arguments to internal functions that were added
just to appease the above.
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.
Bump the kernel rev up to 1.6V
1. sa_len was not properly checked.
2. sa_family was not properly checked [even used as an array index!]
3. we only know about inet4 and inet6, so make sure that the corresponding
data is valid before using it.
4. keep reference counts of addresses used (is that necessary?)
- Under chroot it displays only the visible filesystems with appropriate paths.
- The statfs f_mntonname gets adjusted to contain the real path from root.
- While was there, fixed a bug in ext2fs, locking problems with vfs_getfsstat(),
and factored out some of the vfsop statfs() code to copy_statfs_info(). This
fixes the problem where some filesystems forgot to set fsid.
- Made coda look more like a normal fs.
malloc types into a structure, a pointer to which is passed around,
instead of an int constant. Allow the limit to be adjusted when the
malloc type is defined, or with a function call, as suggested by
Jonathan Stone.
kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals
kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)
based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
This merge changes the device switch tables from static array to
dynamically generated by config(8).
- All device switches is defined as a constant structure in device drivers.
- The new grammer ``device-major'' is introduced to ``files''.
device-major <prefix> char <num> [block <num>] [<rules>]
- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.
- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.
- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.
- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.
- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
enough to be useful, and broadening it so that it did would have meant
that operations possibly requiring synchronous disk activity would have
to be done in splbio(). This clearly was not going to work.
Worked around this in the LFS case by having lfs_cluster_callback put an
extra hold on the vnode before calling biodone(), and taking the hold
off without HOLDRELE's problematic list swapping. lfs_vunref() will take
care of that---in thread context---on the next write if need be.
Also, ensure that the list walking in lfs_{writevnodes,segunlock,gather}
takes into account the possibility that the list may change
underneath it (possibly because it itself deleted an element).
Tested on i386, test-compiled on alpha.
first. This is necessary to avoid warnings with -fshort-enums. Casting
to an int really should be enough, but turns out not to be.
This change will be documented in doc/HACKS.
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:
* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.
From art@openbsd.org.
VOP_PUTPAGES() just because the vnode has no pages. layered filesystems
will want to pass these calls on through to the underlying filesystem,
and non-layered filesystems may need to remove the vnode from the
syncer queues. fix up MP locking and add some locking assertions.
fixes PRs 12284 and 14640.