uncouple it from the timespec layout. Also, change return value
to zero for "timeout didn't expire" and non-zero for "timeout
expired". This decouples the interface from errno assignments.
* split them into levels
* allow only one per level to be active at a time
* fire softints only when we are unscheduling from a CPU instead
of immediately in softint_schedule(). this will later morph
into return from interrupt, but that part isn't done yet.
- Addresses the issue described in PR/38828.
- Some simplification in threading and sleepq subsystems.
- Eliminates pmap_collect() and, as a side note, allows pmap optimisations.
- Eliminates XS_CTL_DATA_ONSTACK in scsipi code.
- Avoids few scans on LWP list and thus potentially long holds of proc_lock.
- Cuts ~1.5k lines of code. Reduces amd64 kernel size by ~4k.
- Removes __SWAP_BROKEN cases.
Tested on x86, mips, acorn32 (thanks <mpumford>) and partly tested on
acorn26 (thanks to <bjh21>).
Discussed on <tech-kern>, reviewed by <ad>.
all-sink and make sure each separate thread in rump has its own
lwp. Happy-go-lucky callers will get scheduled a temporary lwp
on entry, while true lwp connoisseurs may request a stable lwp
for their purposes. Some more love may be required later down the
road, but for now different threads will stepping on each others
toes.
for kernel code which has been written to avoid MP contention by
using cpu-local storage (most prominently, select and pool_cache).
Instead of always assuming rump_cpu, the scheduler must now be run
(and unrun) on all entry points into rump. Likewise, rumpuser
unruns and re-runs the scheduler around each potentially blocking
operation. As an optimization, I modified some locking primitives
to try to get the lock without blocking before releasing the cpu.
Also, ltsleep was modified to assume that it is never called without
the biglock held and made to use the biglock as the sleep interlock.
Otherwise there is just too much drama with deadlocks. If some
kernel code wants to call ltsleep without the biglock, then, *snif*,
it's no longer supported and rump and should be modified to support
newstyle locks anyway.
of the external interfaces and namespacing the internal ones to
"rumppriv", put the external ones in a "rump_pub" namespace. While
this requires adjusting all of the external callers of these
interfaces, it is the right thing to do in the long run, since it
clarifies the structure.
us to control and wrap all entry points from "userspace" into rump.
This in turn is necessary for the upcoming rump cpu scheduler.
For each interface "foo" a public wrapper called "rump_foo" is
created. It calls the internal implementation "rumppriv_foo". In
case foo is to be called from inside of rump kernel space, the
private interface "rumppriv_foo" is used -- the userspace wrapper
prototypes are not even exported into the rump kernel namespace.
Needless to say, the rump kernel internal interfaces are not exported
for users.
Now, three classes of interfaces fight for control of rump:
+ the noble local control interfaces (which this commit addresses)
+ the insidious rump system calls (which are generated from syscalls.master)
+ and the evil vnode interfaces (which are generated from vnode_if.src)
reference counting and not release nodes based just on puffs'
impression of if they are free.
This also allows us to reclaim vnodes already in inactive if the
file system so desires. Some file systems, most notably ffs, change
file state already in inactive. This could lead to a deadlock in
the middle of inactive and reclaim if some other puffs operation
was processed in between (as exposed by haad's open(at) test
program).
Also, properly thread the componentname from lookup to the actual
vnode operation. This required the changes the rump componentname
routines. Yes, the rename case is truly mindbogglingly disgusting.
Puke for yourself.
* drop async transfer requests on the floor (no, this does not make
anything work, but it's the easiest way to prevent a receive pipe
transfer request from hanging everything. one tiny bugstep at a time ...)
since ltsleep abuses "while (!mutex_tryenter()) continue;" for NOT
releasing the kernel biglock before sleeping, we cannot do a normal
mutex_enter() in the wakeup path, or otherwise we might be a
situation where the sleeper holds the kernel lock and wants the
sleepermutex (and will not back down) and the wakeupper holds the
sleepermutex and wants the kernel lock. So introduce kernel lock
backdown to the wakeup path.
vfs in nature, and therefore it belongs here (can't load a firmware
from a file system without file system support, right?). Rename
rump_cwdi to cwdi0, since firmload depends on that name (naughty
firmload).
which take softnet_lock and might run before the lock is actually
initialized. Also, soinit() itself already calls soinit2(), so no
need to call it twice.
- Separate the suser part of the bsd44 secmodel into its own secmodel
and directory, pending even more cleanups. For revision history
purposes, the original location of the files was
src/sys/secmodel/bsd44/secmodel_bsd44_suser.c
src/sys/secmodel/bsd44/suser.h
- Add a man-page for secmodel_suser(9) and update the one for
secmodel_bsd44(9).
- Add a "secmodel" module class and use it. Userland program and
documentation updated.
- Manage secmodel count (nsecmodels) through the module framework.
This eliminates the need for secmodel_{,de}register() calls in
secmodel code.
- Prepare for secmodel modularization by adding relevant module bits.
The secmodels don't allow auto unload. The bsd44 secmodel depends
on the suser and securelevel secmodels. The overlay secmodel depends
on the bsd44 secmodel. As the module class is only cosmetic, and to
prevent ambiguity, the bsd44 and overlay secmodels are prefixed with
"secmodel_".
- Adapt the overlay secmodel to recent changes (mainly vnode scope).
- Stop using link-sets for the sysctl node(s) creation.
- Keep sysctl variables under nodes of their relevant secmodels. In
other words, don't create duplicates for the suser/securelevel
secmodels under the bsd44 secmodel, as the latter is merely used
for "grouping".
- For the suser and securelevel secmodels, "advertise presence" in
relevant sysctl nodes (sysctl.security.models.{suser,securelevel}).
- Get rid of the LKM preprocessor stuff.
- As secmodels are now modules, there's no need for an explicit call
to secmodel_start(); it's handled by the module framework. That
said, the module framework was adjusted to properly load secmodels
early during system startup.
- Adapt rump to changes: Instead of using empty stubs for securelevel,
simply use the suser secmodel. Also replace secmodel_start() with a
call to secmodel_suser_start().
- 5.99.20.
Testing was done on i386 ("release" build). Spearated module_init()
changes were tested on sparc and sparc64 as well by martin@ (thanks!).
Mailing list reference:
http://mail-index.netbsd.org/tech-kern/2009/09/25/msg006135.html
Pfsync interface exposes change in the pf(4) over a pseudo-interface, and can
be used to synchronise different pf.
This work was part of my 2009 GSoC
No objection on tech-net@
for orphaned sections to using PROVIDE. What this means is that
unless a rump component internally references that symbol, it will
not be included in the component shared library, and hence cannot
be referenced when the component is loaded. Add a workaround which
works both with 2.16 and 2.19: force a reference to the __start
symbol internally and hence retain it in the resulting library.
since that opens a race window for non-mpsafe code, so do it after.
Additionally, we cannot call mutex_enter() for sleepermtx, since
ltsleep/mtsleep should not block (i.e. release kernel lock) before
actually blocking, so busyloop in mutex_tryenter(). Finally, when
waking up, take kernel lock back only *after* releasing sleepermtx
to avoid deadlock against another thread holding the kernel lock
and wanting sleepermtx.
(yes, it's functionally a device instead of a networking domain,
since it provides and is accessed through /dev/nsmb instead of
being accessed through sockets)
vnode pager.
It would have been nice to keep a separate version:
* it has helped find file system bugs which the kernel pager
treated as non-errors
* it does not contain extra payload unnecessary in userspace
However, getting the details of the pager implementation correct
with all the flags, offsets and block/page size special cases is
*EXTREMELY* difficult (chuq > god).
On the plus side, LFS write now works for file data too instead of
just metadata. Also, maybe being able to singlestep the genfs
vnode pager in the comfort of userspace will allow more people to
understand how the behemoth functions.
Instead of doing actual page remapping, which we can't portably
do in userspace without extensive trickery (read: signals), simply
allocate the kva window with new physical backing, copy page
contents, return, and copy contents back in mapout. Since the
pages are locked during the mapping cycle, we can do this without
hazard.
* add lots of stubbies necessary for new stuff coming soon
introduce a new and improved "etfs" interface, which can be used
to register host files accessible from rump fs namespace. This
new interface is not restriced to block devices, and neither does
it require the same pathname in host namespace and rump namespace.
Therefore, the same host file can be represented both as a char
and block device in rump namespace.
* adjust rumpblk to make the above possible
* improve rumpfs: nodes are now created properly and not implicitly
tied to the vnode lifecycle
tested with a DEBUG+DIAGNOSTIC+LOCKDEBUG kernel. To summerise NiLFS, i'll
repeat my posting to tech-kern here:
NiLFS stands for New implementation of Logging File System; LFS done
right they claim :) It is at version 2 now and is being developed by NTT, the
Japanese telecom company and recently put into the linux source tree. See
http://www.nilfs.org. The on-disc format is not completely frozen and i expect
at least one minor revision to come in time.
The benefits of NiLFS are build-in fine-grained checkpointing, persistent
snapshots, multiple mounts and very large file and media support. Every
checkpoint can be transformed into a snapshot and v.v. It is said to perform
very well on flash media since it is not overwriting pieces apart from a
incidental update of the superblock, but that might change. It is accompanied
by a cleaner to clean up the segments and recover lost space.
My work is not a port of the linux code; its a new implementation. Porting the
code would be more work since its very linux oriented and never written to be
ported outside linux. The goal is to be fully interchangable. The code is non
intrusive to other parts of the kernel. It is also very light-weight.
The current state of the code is read-only access to both clean and dirty
NiLFS partitions. On mounting a dirty partition it rolls forward the log to
the last checkpoint. Full read-write support is however planned!
Just as the linux code, mount_nilfs allows for the `head' to be mounted
read/write and allows multiple read-only snapshots/checkpoint mounts next to
it.
By allowing the RW mount at a different snapshot for read-write it should be
possible eventually to revert back to a previous state; i.e. try to upgrade a
system and being able to revert to the exact state prior to the upgrade.
Compared to other FS's its pretty light-weight, suitable for embedded use and
on flash media. The read-only code is currently 17kb object code on
NetBSD/i386. I doubt the read-write code will surpass the 50 or 60. Compared
this to FFS being 156kb, UDF being 84 kb and NFS being 130kb. Run-time memory
usage is most likely not very different from other uses though maybe a bit
higher than FFS.
and raidframe. Raidframe works well enough to configure a raid in
the rump kernel, but the usage is "interesting" (pending some other
changes/cleanup from other parts in my tree).
These are not built by default yet.
are present. This works in userspace as opposed relying in link
sets, which fail miserably. Later, when the networking stack
becomes modularized, we can move to a dynamic scheme like with file
systems.
Also, this change allows us to do proper autoconfig, namely attach
the loopback interface iff it is present.
component, but due to ifdef happiness permeating the sources, it's
a compile decision for now, so netinet pulls in both inet and inet6.
One issue, one single issue: the loopback interface still needs to
be created for IPv6 to work. I have patches to take care of it
automatically if the appropriate component (net) is present, but
they require a bit more testing before commit.