it may return space already in use as free space under some condition.
The symptom of the bug is that exec fails if stack is unlimited on
topdown VM kernel.
uvm_swap_free() being called with a zero slot; this might have been
the reason for crashes with sysvshm and heavy swapping.
(PR kern/22752 by Tom Spindler)
Confirmed by Chuck Silvers.
the `# swap page in use' and `# swap page only' counters. However, at the
time of swap device removal we can no longer figure out how many of the
bad swap pages are actually also `swap only' pages.
So, on swap I/O errors arrange things to not include the bad swap pages in
the `swpgonly' counter as follows: uvm_swap_markbad() decrements `swpgonly'
by the number of bad pages, and the various VM object deallocation routines
do not decrement `swpgonly' for swap slots marked as SWSLOT_BAD.
and make the stack and heap non-executable by default. the changes
fall into two basic catagories:
- pmap and trap-handler changes. these are all MD:
= alpha: we already track per-page execute permission with the (software)
PG_EXEC bit, so just have the trap handler pay attention to it.
= i386: use a new GDT segment for %cs for processes that have no
executable mappings above a certain threshold (currently the
bottom of the stack). track per-page execute permission with
the last unused PTE bit.
= powerpc/ibm4xx: just use the hardware exec bit.
= powerpc/oea: we already track per-page exec bits, but the hardware only
implements non-exec mappings at the segment level. so track the
number of executable mappings in each segment and turn on the no-exec
segment bit iff the count is 0. adjust the trap handler to deal.
= sparc (sun4m): fix our use of the hardware protection bits.
fix the trap handler to recognize text faults.
= sparc64: split the existing unified TSB into data and instruction TSBs,
and only load TTEs into the appropriate TSB(s) for the permissions.
fix the trap handler to check for execute permission.
= not yet implemented: amd64, hppa, sh5
- changes in all the emulations that put a signal trampoline on the stack.
instead, we now put the trampoline into a uvm_aobj and map that into
the process separately.
originally from openbsd, adapted for netbsd by me.
* Remove the "lwp *" argument that was added to vget(). Turns out
that nothing actually used it!
* Remove the "lwp *" arguments that were added to VFS_ROOT(), VFS_VGET(),
and VFS_FHTOVP(); all they did was pass it to vget() (which, as noted
above, didn't use it).
* Remove all of the "lwp *" arguments to internal functions that were added
just to appease the above.
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.
Bump the kernel rev up to 1.6V
http://mail-index.netbsd.org/source-changes/2003/05/08/0068.html
There were some side-effects that I didn't anticipate, and fixing them
is proving to be more difficult than I thought, do just eject for now.
Maybe one day we can look at this again.
Fixes PR kern/21517.
space is advertised to UVM by making virtual_avail and virtual_end
first-class exported variables by UVM. Machine-dependent code is
responsible for initializing them before main() is called. Anything
that steals KVA must adjust these variables accordingly.
This reduces the number of instances of this info from 3 to 1, and
simplifies the pmap(9) interface by removing the pmap_virtual_space()
function call, and removing two arguments from pmap_steal_memory().
This also eliminates some kludges such as having to burn kernel_map
entries on space used by the kernel and stolen KVA.
This also eliminates use of VM_{MIN,MAX}_KERNEL_ADDRESS from MI code,
this giving MD code greater flexibility over the bounds of the managed
kernel virtual address space if a given port's specific platforms can
vary in this regard (this is especially true of the evb* ports).
first step towards per-device MAXPHYS, and has the beneficial side effect
of allowing clustering to MAXPHYS even on systems that need to run with
a reduced MAXBSIZE to get more metadata buffers.
* Remove DEFAULT_PAGE_SIZE. We don't use PAGE_SIZE the way Mach did.
* In uvm_setpagesize(), if we are called with uvmexp.pagesize == 0,
then assert that PAGE_SIZE != 0 (i.e. a constant), and set uvmexp.pagesize
accordingly.
* Provide defaults for MIN_PAGE_SIZE and MAX_PAGE_SIZE if not defined
by <machine/vmparam.h>. If PAGE_SIZE is not a constant, MIN_PAGE_SIZE
and MAX_PAGE_SIZE must be provided.
* If MIN_PAGE_SIZE and MAX_PAGE_SIZE are not equal (i.e. PAGE_SIZE may
not be a constant in all configurations), then ensure that PAGE_SIZE
and friends expand to variable references for LKMs.
* User allocates ZFOD region, but does not actually touch the buffer
to fault in the pages.
* In a loop, user writes this buffer to a network socket, triggering
sosend_loan().
* uvm_loan() calls uvm_loanzero() once for each page in the loaned
region (since the pages have not yet faulted in). This causes a
page to be allocated and zero'd. The result is the kernel spends
a lot of time allocating and zero'ing pages.
This fixes creates a special object which owns a single zero'd page.
This single zero'd page is used to satisfy all loans of non-resident
ZFOD mappings.
Thanks to Allen Briggs for discovering the problem and for providing
an initial patch.
previous entry. (not if the current entry starts at the end of the new
space; that case doesn't take into account if the new space had a specified
alignment).
means that the dynamic linker gets mapped in at the top of available
user virtual memory (typically just below the stack), shared libraries
get mapped downwards from that point, and calls to mmap() that don't
specify a preferred address will get mapped in below those.
This means that the heap and the mmap()ed allocations will grow
towards each other, allowing one or the other to grow larger than
before. Previously, the heap was limited to MAXDSIZ by the placement
of the dynamic linker (and the process's rlimits) and the space
available to mmap was hobbled by this reservation.
This is currently only enabled via an *option* for the i386 platform
(though other platforms are expected to follow). Add "options
USE_TOPDOWN_VM" to your kernel config file, rerun config, and rebuild
your kernel to take advantage of this.
Note that the pmap_prefer() interface has not yet been modified to
play nicely with this, so those platforms require a bit more work
(most notably the sparc) before they can use this new memory
arrangement.
This change also introduces a VM_DEFAULT_ADDRESS() macro that picks
the appropriate default address based on the size of the allocation or
the size of the process's text segment accordingly. Several drivers
and the SYSV SHM address assignment were changed to use this instead
of each one picking their own "default".
(there are still some details to work out) but expect that to go
away soon. To support these basic changes (creation of lfs_putpages,
lfs_gop_write, mods to lfs_balloc) several other changes were made, to
wit:
* Create a writer daemon kernel thread whose purpose is to handle page
writes for the pagedaemon, but which also takes over some of the
functions of lfs_check(). This thread is started the first time an
LFS is mounted.
* Add a "flags" parameter to GOP_SIZE. Current values are
GOP_SIZE_READ, meaning that the call should return the size of the
in-core version of the file, and GOP_SIZE_WRITE, meaning that it
should return the on-disk size. One of GOP_SIZE_READ or
GOP_SIZE_WRITE must be specified.
* Instead of using malloc(...M_WAITOK) for everything, reserve enough
resources to get by and use malloc(...M_NOWAIT), using the reserves if
necessary. Use the pool subsystem for structures small enough that
this is feasible. This also obsoletes LFS_THROTTLE.
And a few that are not strictly necessary:
* Moves the LFS inode extensions off onto a separately allocated
structure; getting closer to LFS as an LKM. "Welcome to 1.6O."
* Unified GOP_ALLOC between FFS and LFS.
* Update LFS copyright headers to correct values.
* Actually cast to unsigned in lfs_shellsort, like the comment says.
* Keep track of which segments were empty before the previous
checkpoint; any segments that pass two checkpoints both dirty and
empty can be summarily cleaned. Do this. Right now lfs_segclean
still works, but this should be turned into an effectless
compatibility syscall.
we read-lock the map and call uvm_map_lookup_entry() instead of simply
walking from the header to the next and to the next, etc.
Dumping from sparsely populated amaps could cause faults that would
result in amaps being split, which (in turn) resulted in the core
dumping routines dumping some regions of memory twice. This makes the
core file too large, the headers not match, gdb not work properly,
and so on.
Addresses PR 19260.
malloc types into a structure, a pointer to which is passed around,
instead of an int constant. Allow the limit to be adjusted when the
malloc type is defined, or with a function call, as suggested by
Jonathan Stone.
request may contain PGO_DONTCARE and nfs_getpages may unbusy them on error.
Fix is provided in PR#20028 by YAMAMOTO Takashi. (and same one is approved
by chuq while ago in private mail). It was my fault to forget to commit.
allocated ppref data to zero in the case of an amap that has empty
space at the front.
Don't set anything in the ppref array if "len" is zero.
Many thanks to Sami Kantoluoto for providing gdb access to a machine
that would reliably crash with problems related to the above, and to
Stephan Thesing for corroborating that the patch properly addressed
the problem.
Note that the ar_pageoff (and related variables) types must be changed
soon. The use of "int" here is not theoretically sufficient.
to sleep. Define UVM_KMF_NOWAIT in terms of UVM_FLAG_NOWAIT.
From Manuel Bouyer. Fixes a problem where any mapping with
read protection was created in a "nowait" context, causing
spurious failures.
uvm_map(). Change uvm_map() to honnor UVM_KMF_NOWAIT. For this, change
amap_extend() to take a flags parameter instead of just boolean for
direction, and introduce AMAP_EXTEND_FORWARDS and AMAP_EXTEND_NOWAIT flags
(AMAP_EXTEND_BACKWARDS is still defined as 0x0, to keep the code easier to
read).
Add a flag parameter to uvm_mapent_alloc().
This solves a problem a pool_get(PR_NOWAIT) could trigger a pool_get(PR_WAITOK)
in uvm_mapent_alloc().
Thanks to Chuck Silvers, enami tsugutomo, Andrew Brown and Jason R Thorpe
for feedback.
backed by physical pages (ie. because it reused a previously-freed one),
so that we can skip a bunch of useless work in that case.
this fixes the underlying problem behind PR 18543, and also speeds up fork()
quite a bit (eg. 7% on my pc, 1% on my ultra2) when we get a cache hit.
delay freeing the old am_ppref so that if we bail early due to
malloc() failures, valid ppref data hasn't been freed for no reason.
Based on comments from enami.
with:
Case #1 -- adjust offset: The slot offset in the aref can be
decremented to cover the required size addition.
Case #2 -- move pages and adjust offset: The slot offset is not large
enough, but the amap contains enough inactive space *after* the mapped
pages to make up the difference, so active slots are slid to the "end"
of the amap, and the slot offset is, again, adjusted to cover the
required size addition. This optimizes for hitting case #1 again on
the next small extension.
Case #3 -- reallocate, move pages, and adjust offset: There is not
enough inactive space in the amap, so the arrays are reallocated, and
the active pages are copied again to the "end" of the amap, and the
slot offset is adjusted to cover the required size. This also
optimizes for hitting case #1 on the next backwards extension.
This provides the missing piece in the "forward extension of
vm_map_entries" logic, so the merge failure counters have been
removed.
Not many applications will make any use of this at this time (except
for jvms and perhaps gcc3), but a "top-down" memory allocator will use
it extensively.
backwards and forwards) if the previous entry was backed by an amap.
Fixes pr kern/18789, where netscape 7 + a java applet actually manage
to incur forward and bimerges in userspace.
Code reviewed by fvdl and thorpej.
kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals
kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)
based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
allocations can be merged either forwards or backwards, meaning no new
entries will be added to the list, and some can even be merged in both
directions, resulting in a surplus entry.
This code typically reduces the number of map entries in the
kernel_map by an order of magnitude or more. It also makes possible
recovery from the pathological case of "5000 processes created and
then killed", which leaves behind a large number of map entries.
The only forward merge case not covered is the instance of an amap
that has to be extended backwards (WIP). Note that this only affects
processes, not the kernel (the kernel doesn't use amaps), and that
merge opportunities like this come up *very* rarely, if at all. Eg,
after being up for eight days, I see only three failures in this
regard, and even those are most likely due to programs I'm developing
to exercise this case.
Code reviewed by thorpej, matt, christos, mrg, chuq, chuck, perry,
tls, and probably others. I'd like to thank my mother, the Hollywood
Foreign Press...
tearing down a vm_map. use this to skip the pmap_update()
at the end of all the removes, which allows pmaps to optimize
pmap tear-down. also, use the new pmap_remove_all() hook to
let the pmap implemenation know what we're up to.
return failure if swap is full and there are no free physical pages.
have malloc() use this flag if M_CANFAIL is passed to it.
use M_CANFAIL to allow amap_extend() to fail when memory is scarce.
this should prevent most of the remaining hangs in low-memory situations.