failed because we failed to acquire some resource needed to initiate
the pageout (such as failing to lock an indirect buffer) rather than
a hard i/o error. in this case we just want to reactivate the page(s)
so that we'll try to write them again later.
while I'm here, clean up some DIAGNOSTIC code.
space is already torn down in uvmspace_free() when the vmspace
refrence count reaches 0. Move the shmexit() call into uvmspace_free().
Note that there is a beneficial side-effect of deferring the unmap
to uvmspace_free() -- on systems where TLB invalidations are
particularly expensive, the unmapping of the address space won't
have to cause TLB invalidations; uvmspace_free() is going to be
run in a context other than the exiting process's, so the "pmap is
active" test will evaluate to FALSE in the pmap module.
entry in the map. the old code would walk around the end of the linked list,
through the header entry, and keep going from the first map entry until it
found a gap in the map, at which point it would return an error. if the map
had no gaps then it would loop forever. reported by k-abe@cs.utah.edu.
while I'm here, clean up this function a bit.
also, use MIN() instead of min(), since the latter takes arguments of
type "int" but we're passing it values of type "vaddr_t", which can be
a larger size.
Mach VM's now. Specific changes:
- Pages now need not have all of their mappings removed before being
put on the inactive list. They only need to have the "referenced"
attribute cleared. This makes putting pages onto the inactive list
much more efficient. In order to eliminate redundant clearings of
"refrenced", callers of uvm_pagedeactivate() must now do this
themselves.
- When checking the "modified" attribute for a page (for clearing
PG_CLEAN), make sure to only do it if PG_CLEAN is currently set on
the page (saves a potentially expensive pmap operation).
- When scanning the inactive list, if a page is referenced, reactivate
it (this part was actually added in uvm_pdaemon.c,v 1.27). This
now works properly now that pages on the inactive list are allowed to
have mappings.
- When scanning the inactive list and considering a page for freeing,
remove all mappings, and then check the "modified" attribute if the
page is marked PG_CLEAN.
- When scanning the active list, if the page was referenced since its
last sweep by the scanner, don't deactivate it. (This part was
actually added in uvm_pdaemon.c,v 1.28.)
These changes greatly improve interactive performance during
moderate to high memory and I/O load.
amap_free(): Assert that the amap is locked.
amap_share_protect(): Assert that the amap is locked.
amap_wipeout(): Assert that the amap is locked.
uvm_anfree(): Assert that the anon has a reference count of 0 and is
not locked.
uvm_anon_lockloanpg(): Assert that the anon is locked.
anon_pagein(): Assert that the anon is locked.
uvmfault_anonget(): Assert that the anon is locked.
uvm_pagealloc_strat(): Assert that the uobj or the anon is locked
And fix the problems these have uncovered:
amap_cow_now(): Lock the new anon after allocating it, and unref and
unlock it (rather than lock!) before freeing it in case
of an error condition. This should fix a problem reported
by Dan Carosone using cdrecord on an i386 MP kernel.
uvm_fault(): Case1B -- Lock the new anon afer allocating it, and unlock
it later when we unlock the old anon.
Case2 -- Lock the new anon after allocating it, and unlock
it later by passing it to uvmfault_unlockall() (we set anon
to NULL if we're not doing a promote fault).
pending i/os to complete before returning even if PGO_CLEANIT is not
specified. this fixes two races:
(1) NFS write rpcs vs. setattr operations which truncate the file.
if the truncate doesn't wait for pending writes to complete then
a later write rpc completion can undo the effect of the truncate.
this problem has been reported by several people.
(2) write i/os in disk-based filesystem vs. the disk block being
freed by a truncation, allocated to a new file, and written
again with different data. if the disk driver reorders the requests
and does the second i/o first, the old data will clobber the new,
corrupting the new file. I haven't heard of anyone experiencing
this problem yet, but it's fixed now anyway.
doesn't have the exec bit set, we need to have PROT_EXEC set
in order for some expected mmap/mprotect behavior to work, so
do the last bit slightly differently: if udv_attach() fails, and
the protection (NOT maxprot) doens't include PROT_EXEC, then clear
PROT_EXEC from maxprot and try udv_attach() again.
Sigh, mmap really needs to be rototilled.
in the mmap() call. maxprot is used to create device mappings,
and always including PROT_EXEC causes the mapping to fail on the Alpha
when mapping a non-RAM offset of /dev/mem (which may be sparse, so
instruction fetch from there is disallowed).
use queue.h macros and KASSERT().
address amap offsets in pages instead of bytes.
make amap_ref() and amap_unref() take an amap, offset and length
instead of a vm_map_entry_t.
improve whitespace and comments.
devices will actually be notified if this is the last close.
this allows raidframe swap devices to be marked clean.
also, move the corresponding vref() into swap_on() for symmetry
and improve some comments.
it and free it as appropriate. Activate p2's new address space once
it references p1's.
- uvm_fork(): Make sure the child's vmspace is NULL before calling
uvmspace_share() (the child doens't have one already in this case).
These changes do not change the behavior for the current use of
uvmspace_share() (vfork(2)), but make it possible for an already
running process (such as a kernel thread) to properly attach to
another process's address space.
to the contents of the hint in the map, and the hint saved in the
map only if the two values match. When an unconditional save is
required, the "check" value passed should be map->hint (and the
compiler will optimize the test away). When deleting a map entry,
the new SAVE_HINT() will only change the hint if the entry being
deleted was the hint value (thus preserving any meaningful hint
that may have been there previously, rather than stomping on it).
- Add a missing hint update when deleting the map entry in
uvm_map_entry_unlink(). This is the fix for kern/11125, from
ITOH Yasufumi <itohy@netbsd.org>.
`struct vmspace' has a new field `vm_minsaddr' which is the user TOS.
PS_STRINGS is deprecated in favor of curproc->p_pstr which is derived
from `vm_minsaddr'.
Bump the kernel version number.
that the page being zero'd was not completed and that page zeroing
should be aborted. This may be used by machine-dependent code doing
slow page access to reduce the latency of running a process that has
become runnable while in the middle of doing a slow page zero.
routine. Works similarly fto pmap_prefer(), but allows callers
to specify a minimum power-of-two alignment of the region.
How we ever got along without this for so long is beyond me.
When it wasn't (which could happen on a 4Mb machine with 32kb pages),
uvm_pagealloc_strat could refuse to allocate user memory, while the pagedaemon
didn't think it was worth freeing any more, resulting in the system seizing up.
rlimit in sbrk. Slightly modified from a patch from Artur Grabowski.
- Rearrange code slightly, partially from Artur Grabowski.
- Only adjust vm_dsize if the grow or shrink actually succeeds.
<vm/vm_extern.h> merged into <uvm/uvm_extern.h>
<vm/vm_page.h> merged into <uvm/uvm_page.h>
<vm/pmap.h> has become <uvm/uvm_pmap.h>
this leaves just <vm/vm.h> in NetBSD.
<vm/pglist.h> -> <uvm/uvm_pglist.h>
<vm/vm_inherit.h> -> <uvm/uvm_inherit.h>
<vm/vm_kern.h> -> into <uvm/uvm_extern.h>
<vm/vm_object.h> -> nothing
<vm/vm_pager.h> -> into <uvm/uvm_pager.h>
also includes a bunch of <vm/vm_page.h> include removals (due to redudancy
with <vm/vm.h>), and a scattering of other similar headers.
"off_t" and the return value is a "paddr_t" to allow mappings
at offsets past 2^31 bytes. Somewhat inspired by FreeBSD, which
only changed the offset to a "vm_offset_t".
Includes updates for the i386, pc532 and sh3 mmmmap from Jason Thorpe.
doing a cpu_set_kpc(), just pass the entry point and argument all
the way down the fork path starting with fork1(). In order to
avoid special-casing the normal fork in every cpu_fork(), MI code
passes down child_return() and the child process pointer explicitly.
This fixes a race condition on multiprocessor systems; a CPU could
grab the newly created processes (which has been placed on a run queue)
before cpu_set_kpc() would be performed.
state into global and per-CPU scheduler state:
- Global state: sched_qs (run queues), sched_whichqs (bitmap
of non-empty run queues), sched_slpque (sleep queues).
NOTE: These may collectively move into a struct schedstate
at some point in the future.
- Per-CPU state, struct schedstate_percpu: spc_runtime
(time process on this CPU started running), spc_flags
(replaces struct proc's p_schedflags), and
spc_curpriority (usrpri of processes on this CPU).
- Every platform must now supply a struct cpu_info and
a curcpu() macro. Simplify existing cpu_info declarations
where appropriate.
- All references to per-CPU scheduler state now made through
curcpu(). NOTE: this will likely be adjusted in the future
after further changes to struct proc are made.
Tested on i386 and Alpha. Changes are mostly mechanical, but apologies
in advance if it doesn't compile on a particular platform.
which indicates that the process is actually running on a
processor. Test against SONPROC as appropriate rather than
combinations of SRUN and curproc. Update all context switch code
to properly set SONPROC when the process becomes the current
process on the CPU.
uvm_map_pageable(map, ...) implies unlocking passed map, just before the
function call.
- If we bail out before calling the uvm_map_pageable, unlock the map
by ourself to prevent a panic ``locking against myself''. The panic is,
for example, caused when cdrecord is invoked with too large fifo size.
set up quite a few regular ones (at every fork!), so put interrupt-
safe map setup in the slow path with a __predict_false().
uvm_map_reference(): __predict_false() the check for NULL map.
uvm_map_deallocate(): Likewise.
- Make page free lists have two actual queues: known-zero pages and
pages with unknown contents.
- Implement uvm_pageidlezero(). This function attempts to zero up to
the target number of pages until the target has been reached (currently
target is `all free pages') or until whichqs becomes non-zero (indicating
that a process is ready to run).
- Define a new hook for the pmap module for pre-zero'ing pages. This is
used to zero the pages using uncached access. This allows us to zero
as many pages as we want without polluting the cache.
In order to use this feature, each platform must add the appropropriate
glue in their idle loop.
one pmap and activating another. this isn't actually necessary (since
pmap_activate() and pmap_deactivate() affect only user-level mappings,
which cannot be accessed from interrupts anyway), and pmap_activate()
is very slow on old sun4c sparcs so we can't block interrupts for this long.
this fixes PR 8322.
uvm_page_init() has completed, add a boolean uvm.page_init_done,
and test against that. Use this same boolean (rather than
pmap_initialized) in pmap_growkernel() to determine if we are
being called via uvm_page_init() to grow the kernel address space.
This fixes a problem on some i386 configurations where pmap_init()
itself was needing to have the kernel page table grown, and since
pmap_initialized was not yet set to TRUE, pmap_growkernel() was
choosing the wrong code path.
Fix tested by Havard Eidnes.
Add a new type voff_t (defined as a synonym for off_t) to describe offsets
into uvm objects, and update the appropriate interfaces to use it, the
most visible effect being the ability to mmap() file offsets beyond
the range of a vaddr_t.
Originally by Chuck Silvers; blame me for problems caused by merging this
into non-UBC.
amount of physical memory, divide it by 4, and then allow machine
dependent code to place upper and lower bounds on the size. Export
the computed value to userspace via the new "vm.nkmempages" sysctl.
NKMEMCLUSTERS is now deprecated and will generate an error if you
attempt to use it. The new option, should you choose to use it,
is called NKMEMPAGES, and two new options NKMEMPAGES_MIN and
NKMEMPAGES_MAX allow the user to configure the bounds in the kernel
config file.
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.
Bump version number to 1.4O
value (KERN_SUCCESS or KERN_RESOURCE_SHORTAGE) indicating if it succeeded
or failed. Change the `wired' and `access_type' arguments to a single
`flags' argument, which includes the access type, and flags:
PMAP_WIRED the old `wired' boolean
PMAP_CANFAIL pmap_enter() is allowed to fail
If PMAP_CANFAIL is not specified, the pmap should behave as it always
has in the face of a drastic resource shortage: fall over dead.
Change the fault handler to deal with failure (which indicates resource
shortage) by unlocking everything, waiting for the pagedaemon to free
more memory, then retrying the fault.
not set, unlock the vnode before calling the device's close routine and
relock it after it returns. tty close routines will sleep waiting for
buffers to drain, which won't happen often times as the other side needs
to grab the vnode lock first.
Make all unmount routines lock the device vnode before calling VOP_CLOSE().
calls to reflect this. Also, block statclock rather than softclock during
in the proclist locking functions, to address a problem reported on
current-users by Sean Doran.
- Fix some locking bugs; a couple of places would return an error condition
without unlocking the map.
- Deal with maps marked WIREFUTURE; if making an entry VM_PROT_NONE ->
anything else, and it is not already marked as wired, wire it.
of some functions. Use these flags in uvm_map_pageable() to determine
if the map is locked on entry (replaces an already present boolean_t
argument `islocked'), and if the function should return with the map
still locked.
pages.
XXX This should be handled better in the future, probably by marking the
XXX page as released, and making uvm_pageunwire() free the page when
XXX the wire count on a released page reaches zero.
* Implement MADV_DONTNEED: deactivate pages in the specified range,
semantics similar to Solaris's MADV_DONTNEED.
* Add MADV_FREE: free pages and swap resources associated with the
specified range, causing the range to be reloaded from backing
store (vnodes) or zero-fill (anonymous), semantics like FreeBSD's
MADV_FREE and like Digital UNIX's MADV_DONTNEED (isn't it SO GREAT
that madvise(2) isn't standardized!?)
As part of this, move the non-map-modifying advice handling out of
uvm_map_advise(), and into sys_madvise().
As another part, implement general amap cleaning in uvm_map_clean(), and
change uvm_map_clean() to only push dirty pages to disk if PGO_CLEANIT
is set in its flags (and update sys___msync13() accordingly). XXX Add
a patchable global "amap_clean_works", defaulting to 1, which can disable
the amap cleaning code, just in case problems are unearthed; this gives
a developer/user a quick way to recover and send a bug report (e.g. boot
into DDB and change the value).
XXX Still need to implement a real uao_flush().
XXX Need to update the manual page.
With these changes, rebuilding libc will automatically cause the new
malloc(3) to use MADV_FREE to actually release pages and swap resources
when it decides that can be done.
* Nothing currently uses this return value.
* It's arguably an abstraction violation.
Fix amap_unadd()'s API to be consistent w/ amap_add()'s: rather than
take a vm_amap * and a slot number, take a vm_aref * and an offset.
It's now actually possible to use amap_unadd() to remove an anon from
an amap.
> XXX (in)sanity check. We don't do proper datasize checking
> XXX for anonymous (or private writable) mmap(). However,
> XXX know that if we're trying to allocate more than the amount
> XXX remaining under our current data size limit, _that_ should
> XXX be disallowed.
This is one link on the chain of lossage known as PR#7897. It's
definitely not the right fix, but it's better than nothing.
sub-structure malloc() failed, it was quite likely that the function
would return success incorrectly. This is this direct cause of the bug
reported in PR#7897. (Thanks to chs for helping to track it down.)
- rather than treating MAP_COPY like MAP_PRIVATE by sheer virtue of it not
being MAP_SHARED, actually convert the MAP_COPY flag into MAP_PRIVATE.
- return EINVAL if MAP_SHARED and MAP_PRIVATE are both included in flags.
which use uvm_vslock() should now test the return value. If it's not
KERN_SUCCESS, wiring the pages failed, so the operation which is using
uvm_vslock() should error out.
XXX We currently just EFAULT a failed uvm_vslock(). We may want to do
more about translating error codes in the future.
pmap_change_wiring(...,FALSE) unless the map entry claims the address
is unwired. This fixes the following scenario, as described on
tech-kern@netbsd.org on Wed 6/16/1999 12:25:23:
- User mlock(2)'s a buffer, to guarantee it will never become
non-resident while he is using it.
- User then does physio to that buffer. Physio calls uvm_vslock()
to lock down the pages and ensure that page faults do not happen
while the I/O is in progress (possibly in interrupt context).
- Physio does the I/O.
- Physio calls uvm_vsunlock(). This calls uvm_fault_unwire().
>>> HERE IS WHERE THE PROBLEM OCCURS <<<
uvm_fault_unwire() calls pmap_change_wiring(..., FALSE),
which now gives the pmap free reign to recycle the mapping
information for that page, which is illegal; the mapping is
still wired (due to the mlock(2)), but now access of the
page could cause a non-protection page fault (disallowed).
NOTE: This could eventually lead to a panic when the user
subsequently munlock(2)'s the buffer and the mapping info
has been recycled for use by another mapping!
the map be at least read-locked to call this function. This requirement
will be taken advantage of in a future commit.
* Write a uvm_fault_unwire() wrapper which read-locks the map and calls
uvm_fault_unwire_locked().
* Update the comments describing the locking contraints of uvm_fault_wire()
and uvm_fault_unwire().
semantics. That is, regardless of the number of mlock/mlockall calls,
an munlock/munlockall actually unlocks the region (i.e. sets wiring count
to 0).
Add a comment describing why uvm_map_pageable() should not be used for
transient page wirings (e.g. for physio) -- note, it's currently only
(ab)used in this way by a few pieces of code which are known to be
broken, i.e. the Amiga and Atari pmaps, and i386 and pc532 if PMAP_NEW is
not used. The i386 GDT code uses uvm_map_pageable(), but in a safe
way, and could be trivially converted to use uvm_fault_wire() instead.
* Provide POSIX 1003.1b mlockall(2) and munlockall(2) system calls.
MCL_CURRENT is presently implemented. MCL_FUTURE is not fully
implemented. Also, the same one-unlock-for-every-lock caveat
currently applies here as it does to mlock(2). This will be
addressed in a future commit.
* Provide the mincore(2) system call, with the same semantics as
Solaris.
* Clean up the error recovery in uvm_map_pageable().
* Fix a bug where a process would hang if attempting to mlock a
zero-fill region where none of the pages in that region are resident.
[ This fix has been submitted for inclusion in 1.4.1 ]
looking up a kernel address, check to see if the address is on this
"interrupt-safe" list. If so, return failure immediately. This prevents
a locking screw if a page fault is taken on an interrupt-safe map in or
out of interrupt context.
setting recursive has no effect! The kernel lock manager doesn't allow
an exclusive recursion into a shared lock. This situation must simply
be avoided. The only place where this might be a problem is the (ab)use
of uvm_map_pageable() in the Utah-derived pmaps for m68k (they should
either toss the iffy scheme they use completely, or use something like
uvm_fault_wire()).
In addition, once we have looped over uvm_fault_wire(), only upgrade to
an exclusive (write) lock if we need to modify the map again (i.e.
wiring a page failed).
don't unlock a kernel map (!!!) and then relock it later; a recursive lock,
as it used in the user map case, is fine. Also, don't change map entries
while only holding a read lock on the map. Instead, if we fail to wire
a page, clear recursive locking, and upgrade back to a write lock before
dropping the wiring count on the remaining map entries.
locks (and thus, never shared locks). Move the "set/clear recursive"
functions to uvm_map.c, which is the only placed they're used (and
they should go away anyhow). Delete some unused cruft.
right access_type to pass to uvm_fault_wire(). This way, if the entry has
VM_PROT_WRITE, and the entry is marked COW, the copy will happen immediately
in uvm_fault(), as if the access were performed.
access_type to pmap_enter() to ensure that when these mappings are accessed,
possibly in interrupt context, that they won't cause mod/ref emulation
page faults.
has PAGEABLE and INTRSAFE flags. PAGEABLE now really means "pageable",
not "allocate vm_map_entry's from non-static pool", so update all map
creations to reflect that. INTRSAFE maps are maps that are used in
interrupt context (e.g. kmem_map, mb_map), and thus use the static
map entry pool (XXX as does kernel_map, for now). This will eventually
change now these maps are locked, as well.
ensure we don't take mod/ref emulation faults in an interrupt context
(e.g. during the i/o operation). This is safe because:
- For a pageout operation, the page is already known to be
modified, and the pagedaemon will pmap_clear_modify() after
the pageout has completed.
- For a pagein operation, pagers must already pmap_clear_modify()
after the pagein operation is complete, because the i/o may have
been done with e.g. programmed i/o.
XXX It would be nice to know the i/o direction so that we can call
XXX pmap_enter() with only the protection and access_type necessary.
to uvm_fault_wire(), to guarantee that the kernel stacks will not
cause even a mod/ref emulation fault.
- uvm_vslock(): pass VM_PROT_NONE until this function is updated.
which can be used in an interrupt context. Use pmap_kenter*() and
pmap_kremove() only for mappings owned by these objects.
Fixes some locking protocol issues related to MP support, and eliminates
all of the pmap_enter vs. pmap_kremove inconsistencies.
are still owned by the object which is paging, and so the test for a kernel
object in uvm_unmap_remove() will cause pmap_remove() to be used instead
of pmap_kremove().
This was a MAJOR source of pmap_remove() vs pmap_kremove() inconsistency
(which caused the busted kernel pmap statistics, and a cause of much
locking hair on MP systems).
level directly, instead of making the caller wrap the calls in
splimp()/splx().
- Add a comment documenting that interrupts that cause memory allocation
must be blocked while the free page queue is locked.
Since interrupts must be blocked while this lock is asserted, tying them
together like this helps to prevent mistakes.
end of the mappable kernel virtual address space. Previously, it would
get called more often than necessary, because the caller only new what
was requested.
Also, export uvm_maxkaddr so that uvm_pageboot_alloc() can grow the
kernel pmap if necessary, as well. Note that pmap_growkernel() must
now be able to handle being called before pmap_init().
releasing any swap resources. if we don't do this, we can
end up with a clean, swap-backed page, which is illegal.
tracked down by Bill Sommerfeld, fixes PR 7578.
the child inherits the stack pointer from the parent (traditional
behavior). Like the signal stack, the stack area is secified as
a low address and a size; machine-dependent code accounts for stack
direction.
This is required for clone(2).
uvmspace_fork().
pmap_fork() is used to "fork a pmap", that is copy data from one pmap
to the other that is NOT related to actual mappings in the pmap, but is
otherwise logically coupled to the address space.