backwards and forwards) if the previous entry was backed by an amap.
Fixes pr kern/18789, where netscape 7 + a java applet actually manage
to incur forward and bimerges in userspace.
Code reviewed by fvdl and thorpej.
kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals
kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)
based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
allocations can be merged either forwards or backwards, meaning no new
entries will be added to the list, and some can even be merged in both
directions, resulting in a surplus entry.
This code typically reduces the number of map entries in the
kernel_map by an order of magnitude or more. It also makes possible
recovery from the pathological case of "5000 processes created and
then killed", which leaves behind a large number of map entries.
The only forward merge case not covered is the instance of an amap
that has to be extended backwards (WIP). Note that this only affects
processes, not the kernel (the kernel doesn't use amaps), and that
merge opportunities like this come up *very* rarely, if at all. Eg,
after being up for eight days, I see only three failures in this
regard, and even those are most likely due to programs I'm developing
to exercise this case.
Code reviewed by thorpej, matt, christos, mrg, chuq, chuck, perry,
tls, and probably others. I'd like to thank my mother, the Hollywood
Foreign Press...
tearing down a vm_map. use this to skip the pmap_update()
at the end of all the removes, which allows pmaps to optimize
pmap tear-down. also, use the new pmap_remove_all() hook to
let the pmap implemenation know what we're up to.
return failure if swap is full and there are no free physical pages.
have malloc() use this flag if M_CANFAIL is passed to it.
use M_CANFAIL to allow amap_extend() to fail when memory is scarce.
this should prevent most of the remaining hangs in low-memory situations.
This merge changes the device switch tables from static array to
dynamically generated by config(8).
- All device switches is defined as a constant structure in device drivers.
- The new grammer ``device-major'' is introduced to ``files''.
device-major <prefix> char <num> [block <num>] [<rules>]
- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.
- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.
- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.
- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.
- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
the page is still loaned to an anon, we should put the page back on a
paging queue. this is because while pages loaned to the kernel really
do need to stay resident (since the kernel is accessing the physical
memory directly), pages loaned to anons can be paged out just fine.
(the page will be paged out twice, first to the object and then again
to the anon, but after that the page can be reused.)
-pass vm_physseg* instead of physseg index, and PFN (int) instead
of physical address (could be done even more)
-simplify detection of boundary crossing and behave more intelligently
in this case
-take stuff out of the inner loops, or put into "#ifdef DEBUG"
(because we move along physsegs we don't need to check that the
pages are physically contigous)
-make the "simple" and "contigous" branches look more uniform; at
least the outer loops might coalesce one day
Makoto Fujiwara <makoto@ki.nu> and Manuel Bouyer <bouyer@netbsd.org>.
Help from Allen Briggs, Jason Thorpe, and Matt Thomas.
We need to call cpu_cache_probe() early in boot (machdep.c).
Add 603 info for completeness, and use NBPG not PAGESIZE, as the
latter relies on uvm being setup (cpu_subr.c).
Let uvm_page_recolor() be called before uvm has been set up; just
note the page coloring value (uvm_page.c).
obey the preferences expressed by freelist assignment,
to avoid wasting valuable "low memory" to devices which
don't really need it.
comments:
-I'm not sure searching the physsegs within a freelist
beginning with the biggest is the right thing. This is
what the "memory steal" code in uvm_page.c does, so
keep it consistent.
-There seems to be some confusion whether the upper
address limit passed is inclusive or not. Stays on
the save side, possibly leaving one page out.
-The boundary/pagemask check can be simplified, also some
arguments passed are only used for diagnostic checks.
-Integration with UVM_PAGE_TRKOWN???
no alignment / boundary / nsegs restrictions apply.
This one doesn't insist in a contigous range, and it honours the "waitok"
flag, thus succeeds in situations which were hopeless with the existing one.
(A solution which searches for a minimum number of contiguous ranges using
some best-fit or so algorithm would be expensive to implement; I believe the
"either-or" done here does reflect the current use by bus_dma quite well.)
Now agp memory allocation is robust for me. (tested on i810)
we can't simply reuse the pointor to the page. Instead, we need to
acquire it again. So, rearrange the loop like genfs_putpages() does.
Reviewed by chuq.
This makes `tail -<N> <FILE> | cat > file' correctly, where <FILE> is
a regular file larger than 10Mbytes (makes tail to map part of file)
and <N> is big enough to produce output larger than 8kbytes (makes pipe
to use page loan facility). Problem reported by FUKAUMI Naoki on japanese
local mailing list.
uvm_swap_stats(). This is done in order to allow COMPAT_* swapctl()
emulation to use it directly without going through sys_swapctl().
The problem with using sys_swapctl() there is that it involves
copying the swapent array to the stackgap, and this array's size
is not known at build time. Hence it would not be possible to
ensure it would fit in the stackgap in any case.
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:
* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.
From art@openbsd.org.
just skip that page. this situation can arise legitimately when a file
with a wired mapping is truncated so that a wired page is no longer
part of the file.
from VM_FAULT_WIRE in that when the pages being wired are faulted in,
the simulated fault is at the maximum protection allowed for the mapping
instead of the current protection. use this in uvm_map_pageable{,_all}()
to fix the problem where writing via ptrace() to shared libraries that
are also mapped with wired mappings in another process causes a
diagnostic panic when the wired mapping is removed.
this is a really obscure problem so it deserves some more explanation.
ptrace() writing to another process ends up down in uvm_map_extract(),
which for MAP_PRIVATE mappings (such as shared libraries) will cause
the amap to be copied or created. then the amap is made shared
(ie. the AMAP_SHARED flag is set) between the kernel and the ptrace()d
process so that the kernel can modify pages in the amap and have the
ptrace()d process see the changes. then when the page being modified
is actually faulted on, the object pages (from the shared library vnode)
is copied to a new anon page and inserted into the shared amap.
to make all the processes sharing the amap actually see the new anon
page instead of the vnode page that was there before, we need to
invalidate all the pmap-level mappings of the vnode page in the pmaps
of the processes sharing the amap, but we don't have a good way of
doing this. the amap doesn't keep track of the vm_maps which map it.
so all we can do at this point is to remove all the mappings of the
page with pmap_page_protect(), but this has the unfortunate side-effect
of removing wired mappings as well. removing wired mappings with
pmap_page_protect() is a legitimate operation, it can happen when a file
with a wired mapping is truncated. so the pmap has no way of knowing
whether a request to remove a wired mapping is normal or when it's due to
this weird situation. so the pmap has to remove the weird mapping.
the process being ptrace()d goes away and life continues. then,
much later when we go to unwire or remove the wired vm_map mapping,
we discover that the pmap mapping has been removed when it should
still be there, and we panic.
so where did we go wrong? the problem is that we don't have any way
to update just the pmap mappings that need to be updated in this
scenario. we could invent a mechanism to do this, but that is much
more complicated than this change and it doesn't seem like the right
way to go in the long run either.
the real underlying problem here is that wired pmap mappings just
aren't a good concept. one of the original properties of the pmap
design was supposed to be that all the information in the pmap could
be thrown away at any time and the VM system could regenerate it all
through fault processing, but wired pmap mappings don't allow that.
a better design for UVM would not require wired pmap mappings,
and Chuck C. and I are talking about this, but it won't be done
anytime soon, so this change will do for now.
this change has the effect of causing MAP_PRIVATE mappings to be
copied to anonymous memory when they are mlock()d, so that uvm_fault()
doesn't need to copy these pages later when called from ptrace(), thus
avoiding the call to pmap_page_protect() and the panic that results
from this when the mlock()d region is unlocked or freed. note that
this change doesn't help the case where the wired mapping is MAP_SHARED.
discussed at great length with Chuck Cranor.
fixes PRs 10363, 12554, 12604, 13041, 13487, 14580 and 14853.
we need to make sure that vnode pages are written to disk at least once,
otherwise processes could gain access to whatever data was previously stored
in disk blocks which are freshly allocated to a file.
uobject and uanon pointers rather than at the PQ_ANON flag to determine
which lock to hold, since PQ_ANON can be clear even when the anon's lock
is the one which we should hold (if the page was loaned from an object
and then freed by the object).
if the vec pointer is valid rather than using uvm_useracc().
uvm_useracc() just tells you if the permissions of a user mapping allow
the desired access, not whether faulting on that mapping will succeed.
will be allocated for the respective usage types when there is contention
for memory.
replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names
and sysctl names.
- fix the loaned case in uvm_pagefree().
- redo uvmexp.swpgonly accounting to work with page loaning.
add an assertion before each place we adjust uvmexp.swpgonly.
- fix uvm_km_pgremove() to always free any swap space associated with
the range being removed.
- get rid of UVM_LOAN_WIRED flag. instead, we just make sure that
pages loaned to the kernel are never on the page queues.
this allows us to assert that pages are not loaned and wired
at the same time.
- add yet more assertions.
(either the current protection or the max protection) that reference
vnodes associated with a file system mounted with the NOEXEC option.
uvm_mmap(): Don't allow PROT_EXEC mappings to be established of vnodes
which are associated with a file system mounted with the NOEXEC option.
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).
VEXECMAP suggested by Chuq Silvers.