and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.
Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).
The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.
vm_page *) "reverse" lookup code from uvm_page.h to uvm_page.c, to
help migration to not do that.
Likewise move per-page metadata (struct vm_page *) -> physical
address "forward" conversion code into *.c too. This is called
only low-layer VM and MD code.
do, this effectively allows changing the uid of proc0 without
running into KASSERT problems in uidinfo code (although I'm not
quite so sure changing proc0's uid is the right thing to do ...).
problem reported by njoly
This incarnation is written in the user namespace as opposed to
the previous one which was done in kernel namespace. Also, rump
does all the handshaking now instead of excepting an application
to come up with the user namespace socket.
There's still a lot to do, including making code "a bit" more
robust, actually running different clients in a different process
inside the kernel and splitting the client side library from librump.
I'm committing this now so that I don't lose it, plus it generally
works as long as you don't use it in unexcepted ways: i've tested
ifconfig(8), route(8), envstat(8) and sysctl(8).
Introduce rump_pub_syscall() as the generic interface for making
system calls with already marshalled arguments. So it's kinda like
syscall(2), except it also remembered to breathe instead of having
to figure out how to deal with 64bit values.
"ifconfig create". As previously, virt<n> interfaces with the
host's /dev/tap<n> (I guess it could be made explicit with
"ifconfig media", but leave it this way for now).
pagedaemon. This mimics normal kernel behaviour where pmap_kentered
mappings are not tracked for references. Without this change the
vnode pager's clustering could cause one page to be released by
the pagedaemon, and the rest of the pages in the pageout cluster
made unlikely candidates to be released soon.
1. Fix inverted node order, so that negative value from comparison operator
would represent lower (left) node, and positive - higher (right) node.
2. Add an argument (i.e. "context"), passed to comparison operators.
3. Change rb_tree_insert_node() to return a node - either inserted one or
already existing one.
4. Amend the interface to manipulate the actual object, instead of the
rb_node (in a similar way as Patricia-tree interface does).
5. Update all RB-tree users accordingly.
XXX: Perhaps rename rb.h to rbtree.h, since cleaning-up..
1-3 address the PR/43488 by Jeremy Huddleston.
Passes RB-tree regression tests.
Reviewed by: matt@, christos@
* page out vnode objects
* drain kmem/kernel_map
As long as there is a reasonable memory hardlimit (>600kB or so),
a rump kernel can now survive file system metadata access for an
arbitrary size file system (provided, of course, that the file
system does not use wired kernel memory for metadata ...).
Data handling still needs a little give&take finetuning. The
general problem is that a single vm object can easily be the owner
of all vm pages in a rump kernel. now, if a thread wants to allocate
memory while holding that object locked, there's very little the
pagedaemon can do to avoid deadlock. but I think the problem can
be solved by making an object release a page when it wants to
allocate a page if a) the system is short on memory and b) too many
pages belong to the object. that still doesn't take care of the
pathological situation where 1000 threads hold an object with 1
page of memory locked and try to allocate more. but then again,
running 1000 threads with <1MB of memory is an unlikely scenario.
and ultimately, I call upon the fundamental interaction which is
the basis of why any operating works: luck.
we are short of memory.
There are still some funnies left to iron out. For example, with
a certain file system / memory size configuration it's still not
possible to create enough files to make the file system run out of
inodes before the kernel runs out of memory. Also, with some other
configurations disk access slows down gargantually (though i'm sure
there are >0 buffers available). Anyway, it ~works for now and
it's by no means worse than what it was before.
number currently attached. Deals with a SNAFU in my commit earlier
today which would cause softints established early to lack a
softint context on non-bootstrap CPUs.
structure itself and allocating the backing page directly from the
hypervisor.
* initial write to a large tmpfs file is almost 2x faster
* truncating the file to 0 length after write is over 50% faster
* rewrite of the file is just slightly faster (indicating that
kmem does a good job with caching, as expected)
rump. These move the management of the pid/lwpid space from the
application into the kernel, make code more robust, and make it
possible to attach multiple lwp's to non-proc0 processes.
some idea of how they should be done. This change essentially
moves the responsibility of pid/lwpid management from the application
side into the rump kernel. It also introduces clear rules on what
happens when, i.e. introduces semantics (these semantics will be
documented on the man page, and more importantly in atf tests).
our SCSIPI driver stack. Currently we pretend to be a single CD
controller with an optional host file as the image, but I guess
the sky's the limit.
dmesg porn:
NetBSD 5.99.39 (RUMP-ROAST) #0: Mon Aug 23 11:38:16 CEST 2010
pooka@pain-rustique.localhost:/usr/allsrc/src/sys/rump/librump/rumpkern
total memory = unlimited (host limit)
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "rumpclk" frequency 100 Hz quality 0
root file system type: rumpfs
mainbus0 (root)
scsitest0 at mainbus0
scsibus0 at scsitest0: 2 targets, 1 lun per target
cd0 at scsibus0 target 1 lun 0: <RUMPHOBO, It's a LIE, 0.00> cdrom removable
advance the "first" pointer. This problem triggered only if the
bus was filled in the first round, since the first pointer is at
the end-of-bus only for the bootstrap round.
a CPU. This fixes some heavy-load problems with the pool code when
rump kernels essentially lied and caused the pool code not to do
a proper backdown from the fastpath when a context switch happened
when taking a lock.
because they aren't softint threads. This fixes callouts in
situations where there is nothing else happening in the rump kernel
(i.e. no threads executed which would trigger the softints when
they unschedule).
this more closely when he wakes up. Normally I wouldn't be in such
a huge rush, but due to atf bug #53 the whole test run breaks now.
At least with the KASSERT removed all tests pass again.
in genfs_do_putpages() and uao_put().
Use 'v_uobj.uo_npages' to check for an empty memq.
Put some assertions where these marker pages may not appear.
Ok: YAMAMOTO Takashi <yamt@netbsd.org>
still at the original value and not the schedstate one. This makes
select not miss wakeups in cases where there was a lot of selecting
going on (which is not all that common in a rump kernel).
that we can attach a power management handler. The handler prevents
a suspend if the watchdog is active, to be consistent with other
watchdog drivers.
As discussed on tech-kern.
with rumpfs_puffs for prehistoric reasons which are no longer valid
(namely, only fs components existed back then and there was no /dev
support in rump fs namespace).
accounting. Use wired memory (which can be limited) for meta-data, and
kmem(9) for string allocations.
Close PR/31944. Fix PR/38361 while here. OK ad@.
our window to it. This fixes cases like opening a window at offsets
[8,32] to a file, which would cause host file offset [0,32-8] to
be mapped, i.e. [0,16] inside the window. Obviously, access to
the entire in-window [0,24] range should have been mapped (and
after this fix it is).
kernel ABI (i.e. not i386 or amd64). Due to the "half function,
half macro, all noodles" nature of pmap.h, it's too entangling and
too brittle to keep up with an ifdeffy MI implementation.
the rump kernel by specifying RUMP_MEMLIMIT. In case allocation
over that limit is attempted, essentially pool reclaim and uvm_wait()
is done. The default is to allow to allocate as much as the host
will give.
XXX: uvm_km_alloc and malloc(9) do not currently conform. the
former is easy, the latter requires kern_malloc.c (rump malloc is
currently directly relegated to host malloc).
components which are too bloaty to be included in rumpkern (where
bloaty means "can be easily left out without anyone missing"), but
generally do not require the support of the dev/fs/net factions to
function. As the first one, add ksems. librumpcrypto will migrate
here too once I get my timeslice to deal with the setlists, as most
likely will tty support.
really wanted from this commit was the support for proc_specificdata.
TODO: make creating a new process actually use kern_proc and
maybe even add an interface which starts a process with
"any pid you don't like"
was autoload *not* working with an alternate path. This revision
make the code double as good in the sense that it now works also
in case you *do* want it to work.
through an in-rumpkernel hypermemory allocator which knows it should
kick the pagedaemon and block in case ``waitok'' memory allocation
fails.
This allows us to recover from some out-of-memory situations.
Realworld'istically speaking (as opposed to whatever "should be"
theory), these OOM situations will happen extremely rarely if ever
when our hypervisor is a regular process. Speculatively, this
should be useful for other types of hosts.
issues remaining:
* the hypervisor does not know how to reclaim kernel memory (and
for the reason I stated above, I'm not sure if it makes sense
to teach the current implementation about that)
* vfs memory (buffers, vm object pages etc.) is not reclaimed
Since they are practically never used (only when prehistoric code
uses simple_lock()), their efficiency doesn't matter that much and
we can simply adapt the versions from x86 lock.h.
access. The old scheduler had a global freelist which caused a
cache crisis with multiple host threads trying to schedule a virtual
CPU simultaneously.
The rump scheduler is different from a normal thread scheduler, so
it has different requirements. First, we schedule a CPU for a
thread (which we get from the host scheduler) instead of scheduling
a thread onto a CPU. Second, scheduling points are at every
entry/exit to/from the rump kernel, including (but not limited to)
syscall entry points and hypercalls. This means scheduling happens
a lot more frequently than in a normal kernel.
For every lwp, cache the previously used CPU. When scheduling,
attempt to reuse the same CPU. If we get it, we can use it directly
without any memory barriers or expensive locks. If the CPU is
taken, migrate. Use a lock/wait only in the slowpath. Be very
wary of walking the entire CPU array because that does not lead to
a happy cacher.
The migration algorithm could probably benefit from improved
heuristics and tuning. Even as such, with the new scheduler an
application which has two threads making rlimit syscalls in a tight
loop experiences almost 400% speedup. The exact speedup is difficult
to pinpoint, though, since the old scheduler caused very jittery
results due to cache contention. Also, the rump version is now
70% faster than the counterpart which calls the host kernel.
unicpu configurations (i.e. RUMP_NCPU==1), but are massively faster
than the multiprocessor versions since the fast path does not have
to perform any cache coherent operations. _Applications_ with
lock-happy kernel paths, i.e. _not_ lock microbenchmarks, measure
up to tens of percents speedup on my Core2 Duo. Every globally
atomic state required by normal locks/atomic ops implies a hideous
speed penalty even for the fast path.
While this requires a unicpu configuration, it should be noted that
we are talking about a virtual unicpu configuration. The host can
have as many processors as it desires, and the speed benefit of
virtual unicpu is still there. It's pretty obvious that in terms
of scalability simple workload partitioning and replication into
multiple kernels wins hands down over complicated locking or
locklessing algorithms which depend on globally atomic state.