go through VFS_ROOT() and allow to fetch it without locking it.
This allows us to call the cache flush operations also for the root
vnode and most notably fixes e.g. a "No such file or directory"
for a psshfs root directory ls -l when a file was locally deleted
and remotely re-created.
Also fix some sloppy programming in root node fetch (mostly cosmetic).
Actually, we do need separate "no references in file server" and
"noref + inactive" flags if we wish to correctly support unix open
file semantics and optimize away pre-reclaim cache flushes. So,
add PNODE_DYING which stands for norefs + inactive.
"noref + inactive" flags if we wish to correctly support unix open
file semantics and optimize away pre-reclaim cache flushes. So,
add PNODE_DYING which stands for norefs + inactive.
the kernel it has 0 references to the node in question. In other
words, this can be used to avoid inactive(), or, if the file server
does not implement inactive, prompt reclaim for removed nodes.
as we give a reference to userspace for the puffs_node for the
duration of the poll call. So reference count puffs_node separately
from the parent vnode. vref()/vrele() is not possible due to a possible
surprise visit from VOP_INACTIVE.
and other information instead of always using VDIR. To make this
possible without races, require all root node information already
in puffs_mount() and nuke puffs_start2() and the associated start
operation completely.
requested/inspired by Tobias Nygren
file system value for the size of device special files, as that
comes from specfs instead of the "host" file system. Therefore,
take care that getattr doesn't override the value of vp->v_size.
file server only if the op was still waiting for fetch (as opposed
to waiting for the response). Also, properly flag the possible
following inactive as an op for which we do not want to wait for
the response from the file server.
for nodes upon return from the userspace. Currently it can be used
to indicate that the file server should be notified of "inactive"
in case the file server has opted to not receive inactive every
time the reference count for a vnode drops to zero. (inactive is
a common event, almost never requires any action and must be executed
sychronously, so it is wasteful).
While doing this, cleanup the release-relock nonsense from the
vntouser*() arguments. It was never enabled and the whole LOCKEDVP()
concept was very broken to begin with.
all metadata info cached in the kernel while we're setattr'ing in
any case. Solves problems such as truncate (via extend-by-write)
+ chmod resulting in EPERM because the file was already read-only
when the actual truncate was flushed out of the kernel in fsync.
when unmounting the file system in case of a certain timing (and
possibly some other conditions), a thread would wait on a condition
variable, while another thread broadcast the cv and immediately
proceeded to destroy it. The result was a system frozen completely
solid shorly after the process waiting for the cv woke up. So
introduce reference counting to synchronize destruction of the
resources in unmount.
I was able to repeat the problem only on my laptop in some special
cases, so I do not know how common it was. Ironically, killing
the file server process violently instead of unmount() didn't have
this problem because it never entered the unmount path from two
directions.
it. It might be that the file server is either crashing or just
returning consistent errors. uiomove() would handle the error,
but if the pages weren't faulted in, memset() to the unfaultable
ubc window would cause a kernel page fault.
to skip unnecessary flushing when layered file system vnodes are recycled.
this also prevents a deadlock with the dodgy LFS putpages routine.
fixes the non-LFS part of PR 36150.
break other implementations so lookup the physical number instead of
indexing it. Choosing random numbers here is legal according to the specs,
but not a logical choice and most likely done as a wierd kind of copy
protection.
Rogue implementation found to use this
*Microsoft CDIMAGE UDF
corresponding flags.
Revert softdep_trackbufs() to its state before vn_start_write() was added.
Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.
Welcome to 4.99.17
again. This is useful until locking is further developed and basically
any deadlocks can be solved by killing appropriate processes.
Thanks especially to Tommi Kyntola and Antti Louko for sitting down
with me and discussing resource ownership and locking strategies
in implementing this.
workaround for the problem analyzed more deeply in kern/30831. In
short, the problem is keeping the vnode on the mount point vnode
list during reclaim. If reclaim happens to sleep (as is a possibility
with smbfs due to calling vrele() and therefore possibly VOP_INACTIVE),
code going through the entire mountpoint vnode list will hit
half-reclaimed vnodes.
from putop. even though there's only one user currently, makes code
more readable
* move "delta" to a standard parameter in vntouser and get rid of the
specialcase vntouser_delta
bunch of bugs.
* park structures are now always allocated from a pool instead of a
mixed stack/malloc allocation
* get rid of the whole adjbuf concept, always just alloc the maximal
amount of memory to satisfy a request
* little regression: don't allow interrupting wait from file system
to userspace; this had problems already before, but now the problems
really started to shine through. I'll try to make this work again
some day.
* fix bmap to return a sensible value in runp
Remove offset argument, we should now find an HFS+ volume in any
of its standard places.
Based on work from and test image provided by Pelle Johansson.
kernel and flush it out all at once instead of continuous updating
* add support for delivering notifications to the file server about
when a page was written to (but disabled by default for now). the
file server can use this to request flushing or invalidating the
kernel page cache
(hfslib_open_volume) We are only interested in the catalgo and
extents header, so read the first 512 bytes, not the whole first
extent. Also makes mounting a lot faster.
possible to recover the system by just killing processes in case
a file server manages to recurse into itself either by fault of
file server implementation or by pilot error. The downside is that
the code is extremely hard to follow and practically screams out
for newlock2 (in addition to screaming "bug here"). The whole
PCATCH nonsense and induced megacomplexity can hopefully be avoided
in the future by tweaking other parts of the implementation.
positive values of errno and 0. Otherwise it can return internal values
such as EJUSTRETURN and screw things up.
thanks to Bill for reminding me to revisit this
Since 1.40, which introduced support for non-DEV_BSIZE media,
mounting msdos floppy returned ENOTTY.
This is because floppy driver does not support DIOCGPART or DIOCWEDGEINFO
ioctl.
Those ioctls should not be a requirement for mounting msdosfs.
This patch is made by Christian Biere.
VFS_ROOT() if the cookies match. Without this fix, if the root
vnode was reclaimed, doing lookups for dotdot from the root vnode
was possible. In practice this occured only through getcwd.
the user file server can't really keep up and just writing and writing
may result in kernel memory exhaustion. this lossage is also partially
due to the stupid way mtime + size info is handled currently, but that
should change soon (*knock knock* ;)
* score a few debug printfs
for putpages, as it assumes a vnode doesn't have any pages. For
mounts using the page cache this is simply not true. Rather,
prevent opening a regular file in write-mode. That way a vnode
can never have dirty pages which would need to be flushed (i.e.
written).
- don't use SAVESTART in calls to relookup() from unionfs,
just vref() the desired vnode when we need to.
- fix locking and refcounting in the unionfs EEXIST error cases.
- release any vnode locks before calling VFS_ROOT(), vfs_busy() is enough.
this allows us to simplify union_root() and fix PR 3006.
- union_lock() doesn't handle shared lock requests correctly,
so convert them to exclusive instead. fixes PR 34775.
- in relookup(), avoid reusing "dp" for different purposes,
the error handling wasn't right. (actually just get rid of dp.)
also, change relookup() to ignore LOCKLEAF and always return the
vnode locked since the callers already expect this.
Patch by Slava Semushin <slava.semushin@gmail.com>
Again, this was tested by comparing obj files from a pristine and a patched
source tree against an i386/ALL kernel, and also for src/sbin/fsck_ffs,
src/sbin/fsdb and src/usr.sbin/makefs. Only changes in assert() line numbers
were detected in 'objdump -d' output.
also to a place less panicy, namely fifo_fsync (because currently the
metadata information is update when the node is changed. This will
probably change soon, though).
servers. This is still pretty much on the level "if it breaks ...".
It should work for single-threaded servers which handle one operation
from start to finish in one go. Also, it does not yet totally
correctly synchronize metadata and data in some cases. So needless
to say, it needs improvement, but it is possible that will have to
wait for some lock revampage.
since DIOCGPART is going to fail. Unfortunately there is no way to get the
geometry information we need from the wedge; it would be nice for wedges
to support a geometry ioctl. The values we cannot retrieve are marked with
XXX.
- Add a lot more debugging.
completely different file system, we still might re-enter the same
puffs fs in case we execute something on the other file system,
which wants to get a new vnode and ends up recycling a puffs vnode
for the purpose. In this case the fs server will sleep in the
kernel until it itself handles the operation .... which of course
is a slightly unlikely event.
After analyzing the path from getcleanvnode() to the vnode cemetary,
identify that fsync and putpages (strategy) are the ones in danger
of striking a deadlock deal. Abuse the vnode flag VXLOCK to tell
them "this vnode is irreversably going to meet its maker, don't
care about user server return values" (failure is not acceptable
down the vgonel() path) and issue the respective operations as
Fire-And-Forget (FAF) operations. no wait -> no deadlock.
This of course is a "fix" skating on thin ice. A better, more
generic solution is already in sight, but will take more effort to
implement.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.
Implemented for file systems of type ffs.
The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.
Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.
Welcome to 4.99.9 (new vfs op vfs_suspendctl).
fixes the case where a directory lookup is done in a directory has never
been visted/listed; the search optimalisation that searches the directory
from where it left behind the last time would never reach the initial
offset of zero since it would always have at least processed one entry.
of rolling around VOP_FSYNC(). The user server will be given the
VFS_SYNC instruction and it can do its own equivalent of VOP_FSYNC()
if it pleases, no need for the kernel to explicitly issue #{vnodes}
FSYNCs.
kernel caching. Currently supported are only flushing the name
cache for a directory or flushing the name cache for the entire fs.
Also, get rid of PNODE_INACTIVE status, since it was racy and
essentially didn't work. All this on top of being useless in the
first place ....
kernel->server calls at that time and it allows a window where
operations use an incorrect root node cookie.
XXX: there's still a (very much smaller and biglock safe) race, but
that's going to be solved by some more thorough restructuring
from under someone waiting for the fs server response in puffs_unmount()
if the descriptor was closed during the response wait (such as bug
leading to a crash in fs implementation unmount()).
- LOCKPARENT is no longer relevant for lookup(), relookup() or VOP_LOOKUP().
these now always return the parent vnode locked. namei() works as before.
lookup() and various other paths no longer acquire vnode locks in the
wrong order via vrele(). fixes PR 32535.
as a nice side effect, path lookup is also up to 25% faster.
- the above allows us to get rid of PDIRUNLOCK.
- also get rid of WANTPARENT (just use LOCKPARENT and unlock it).
- remove an assumption in layer_node_find() that all file systems implement
a recursive VOP_LOCK() (unionfs doesn't).
- require that all file systems supply vfs_vptofh and vfs_fhtovp routines.
fill in eopnotsupp() for file systems that don't support being exported
and remove the checks for NULL. (layerfs calls these without checking.)
- in union_lookup1(), don't change refcounts in the ISDOTDOT case, just
adjust which vnode is locked. fixes PR 33374.
- apply fixes for ufs_rename() from ufs_vnops.c rev. 1.61 to ext2fs_rename().
from under us while waiting for syncer_lock and before we got to vfs_busy.
This happens easily e.g. when the userspace server loses its will to
live in VOP_RECLAIM, which is called from vflush() in VFS_UNMOUNT. We
get two competing unmounters. When the first one finishes, it releases
syncer_lock. Now the second one tries to vfs_busy(), but is greeted
with garbage in *mp.
XXX: Technically this is a more general issue and should be fixed
elsewhere, but it's hard to trigger it with normal file systems
unless they are unmounted "simultaneously" twice and are dirty
enough for flushing to take a while. So make a note about it in
the little black book next to the poems and postpone the crusade
for now.
mmap and therefore execution of binaries starting to work, some
speed improvements with large file I/O also. caching semantics
and error case handling most likely need revisiting.
predictable. This solves a problem that may appear when serving a tmpfs
over NFS: if the server reboots, newly allocated files should have
different file handles; otherwise the remote clients could access files
they were not supposed to touch.
not persistent across reboots but neither are those of MFS, which we are
trying to replace. We should probably warn the user somehow, but not
prevent him doing this if he really wants to.
While here add a "reply" to the code-style change item.
binaries which cast the returned values to 64-bits and fail due to sign
expansion. More details are provided in the big comment in tmpfs.h that
describes how the new tmpfs_dircookie works.
This is a rather ugly hack that shall be fixed with a cleaner solution,
but this resolves the problem in an effective way.
Fixes kern PR/32034.
reference to the parent directory's vnode instead of its smbnode to
avoid a use-after-free bug causing a panic when a smbfs mount is
forcefully unmounted.
Keep trying to flush the vnode list for the mount while some are still
busy and we are making progress towards making them not busy. This
stops attempts to unmount idle smbfs mounts failing with EBUSY.
The easiest way to reproduce the above problem, from what I have seen is:
1) Assume /s is a smbfs mount point.
2) mount /s
3) stat /s/foo/1
4) umount /s
Returns error because the file system is busy.
5) Shutdown the machine: panic in smbfs_reclaim because vrele
accesses already-released memory.
array's contents and returning all the requested pages. Otherwise there
are problems (accessing invalid memory) when the a_m vector is passed
uninitialized as the NFS server code does. Fixes PR kern/34959.
Note that this is not a "real" fix. While this makes tmpfs's getpages
operation consistent with the behavior of other file systems, it does
not resolve the different semantics between uvn_get and uao_get as
described in PR kern/32166. I'm adding a comment in the code mentioning
exactly this so that it can be reviewed when this last problem is
addressed.
it, not the mtime of the file itself. This fixes the problems exposed when
unpacking software under a tmpfs and trying to build it because dependencies
were not calculated properly (e.g. autoconf 2.60 as reported by tls@).
loops where vnodes can get removed or added during the loops. This could
lead to panic's on unmount since nodes are skipped or otherwise
TAILQ_NEXT(0xdeadbeef, ...) was dereferenced.
all waiters *before* trying to get the syncer lock necessary for
dounmount(). This prevents a deadlock if the userspace server dies
while the syncer is running.
getnewvnode() while holding on to any vnode lock deadlocks the
system if the file system is being forcibly unmounted.
Normal file systems don't trigger this problem because of two reaons:
1) they don't hold on to vnode locks while idling who-knows-where, so
the race doesn't trigger
2) they aren't usually unmounted with FORCE; puffs is, in case "someone"
manages to make a crashy userspace server
Nevertheless, a real solution is slowly being braised.
It contains the VFS attachment and userspace message-passing interface.
This work was initially started and completed for Google SoC 2005
and tweaked to work a bit better in the past few weeks. While
being far from complete, it is functional enough to be able and
stable to host a fairly general-purpose in-memory file system in
userspace. Even so, puffs should be considered experimental and
no binary compatibility for interfaces or crash-freedom or zero
security implications should be relied upon just yet.
The GSoC project was mentored by William Studenmund and the final
review for the code was done by Christos.
vnodes were synced and processed backwards. This meant that the last
accessed node was processed first and the earlierst last.
An extra benefit is the removal of the ugly hack from the Berkly days on
LFS.
In the proces, i've also replaced the various variations hand written loops
by the TAILQ_FOREACH() macro's.
the udf_verbose variable. So when something goes wrong, it can be examined
on the spot without needing to reboot a new kernel and possibly loosing
state.
. get rid of struct adirent which didn't match struct dirent anymore
. fix cookies, move all the code handling them to the end of the function
Includes many minor changes to the code of this function.
"Add a 3rd entry in the cache, which keeps the end position
from just before extending a file.
This has the desired effect of keeping the write speed constant."
And yes, that helps a lot copying large files... always at full speed
now. This closes my PR kern/30868 "Poor performance copying large files
on msdosfs".
Also remove a 2 if-statements testing the same condition, combine them.
All that from Rhialto, thank you very much.
If you perform this request on a directory with exactly 50 files
(plus '.' and '..' which brings the total to 52 objects), the first
reply for the SMB server completely satisfies the query (server
side is Windows 2000 Professional).
The smbfs client then performs a TRANS2_FIND_NEXT2 using the last
file name as the resume key. The response returns a SearchCount
of zero (ctx->f_ecnt == 0) and an EndOfSearch code of zero.
Any attempt to get more entries with calls to TRANS2_FIND_NEXT2
result in Badfid (bad file descriptor). I suspect the return of
SearchCount of zero means that end-of-search has been reached and
the Sid is now closed.
The solution is to set "SMB_RDD_EOF | SMB_RDD_NOCLOSE" after getting
back a zero SearchCount, I've tested this in the field on a quite
a few systems, aggressively accessing Windows shares over smbfs
and it appears flawless.
I was initially concerned about the possibility of resource exhaustion
on the Windows server. I was afraid by not officially closing the
search, it would leave a resource hung-up and over time, exhaust
some sort of "open search table" limit. I've since convinced myself
this is NOT the case.
Windows needs to be able to handle clients that come and go over
time. If the search is not closed, Windows will close it if it
finds it needs more resources. I've testing this on directory
searches descending into 10's of thousands of folders, with 100's
of thousands of files.
it was initialised quite late due to its reliance on disc data the mount
process could have stopped before initialising and thus could panic again
only now for uninitialising an not initialised pool! *sigh*
the same memory block allocated as before and it bombs out on its
descriptor pool allready being initialised. It turns out that the pool was
not allways destroyed. This fix ought to clean it up whatever the cause of
the mishap that results in a reject.
253 of the superblock be zero. Searching the net failed to find any
justification for checking these bytes; all available references say
that they are part of the boot code and not BOOTSIG2 and BOOTSIG3.
Modify the MSDOS 7.1 bootsector definition to have 420 bytes of boot
code and no BOOTSIG[23], rather than 418 bytes of boot code, to follow
available references and apparent Windows practice. A test build
showed that these defines are not used other than in the check removed
by this commit.
Patch tested on netbsd-3, and enabled mounting of a 4 GB CF formatted
under Windows XP and then in a digital camera. The CF was previously
unmountable.
Concept approved on tech-kern by christos@ and martin@.
While touching all vptofh/fhtovp functions, get rid of VFS_MAXFIDSIZ,
version the getfh(2) syscall and explicitly pass the size available in
the filehandle from userland.
Discussed on tech-kern, with lots of help from yamt (thanks!).
allocated space was 2048 bytes, but when adding 1024 to the variable
`unix_name' to split the allocated space in half it effectively starts just
OUTSIDE the allocated space. This ought to fix memory corruption bugs when
using UDF.
This is a routine to revisit one day.
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.
The code supports read access to all media types that CD/DVD type drives
can recognize including DVD-RAM and BD- drives as well as harddisc partions
and vnd devices. UDF versions upto the latest 2.60 are to be supported
though due to lack of test media version 2.50 and 2.60 are not implemented
yet though easy to add. Both open and closed media are supported.
Write access is planned and in preparation. To facilitate this some hooks
are present in the code that are not strictly needed in a read-only
implementation but which allow writing to be added more easily.
Implemented and tested media types are CD-ROM, CD-R, CD-RW, CD-MRW,
DVD-ROM, DVD*R, DVD*RW, DVD+MRW but the same code can also read DVD-RAM,
HD-DVD and BluRay discs. Also vnd devices have been tested with several
sector sizes.
Discs created and written by UDFclient, Nero's InCD and Roxio's
DirectCD/Drag2Disc read fine.
Let it get recycled along with all of the others. This is important
as if the root vnode has already been reclaimed, then we get a panic
when we try to vget it.
This addresses PR: kern/31382
- for structure fields that are conditionally present,
make those fields always present.
- for functions which are conditionally inline, make them never inline.
- remove some other functions which are conditionally defined but
don't actually do anything anymore.
- make a lock-debugging function conditional on only LOCKDEBUG.
as discussed on tech-kern some time back.
- make sure that kernel only files don't compile in userland using #error
- XXX: some kernel only files still get installed.
- XXX: some files used in userland, don't get installed.
because VOP_UPDATE() usually succeeded, spec_close() was not usually
called. Only skip the spec_close() step if VOP_UPDATE() returns
an error result. Now /dev/watchdog works as expected when /dev/
is a tmpfs; previously, it was impossible to disarm a user-tickled
watchdog.
per yamt's suggestion. Previously, if /dev/ was mounted on a tmpfs,
block device buffers were never flushed to disk. Trying to unmount
a dirty filesystem (umount /dev/wd0e, say) caused an endless stream
of vflushbuf warnings, because tmpfs_bwrite was not flushing buffers.
The fix told to me by yamt solves the problem.
ptyfs_write() rather than setting a flag and updating these times
through ptyfs_itimes() at some indeterminate time in the future.
However, just use the "time" variable to set the times instead of
using a potentially expensive call to nanotime(). A HZ resolution
on these timestamps is more than enough.
(Possibly incomplete) fix for PR kern/31430.
OK'd be christos@.