Commit Graph

148 Commits

Author SHA1 Message Date
perry a2cd732268 Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete. 2005-12-24 19:12:23 +00:00
christos 95e1ffb156 merge ktrace-lwp. 2005-12-11 12:16:03 +00:00
atatat 420d91208b Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code.  I know it's not the prettiest code, but it seems to work rather
well in spite of itself.
2005-06-09 02:19:59 +00:00
christos efb6943313 - add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.
2005-05-29 22:24:14 +00:00
yamt 6b2d8b66a4 merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
  save some resources like pv_entry.  also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.
2005-04-01 11:59:21 +00:00
chs c92634930b fix validation of new values when setting vm.{hi,low}water. fixes PR 29651. 2005-03-31 02:34:10 +00:00
perry da8abec863 nuke trailing whitespace 2005-02-26 21:34:55 +00:00
tls abcbeb46d9 Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there.  Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all?  Consider a machine with two filesystems, one
with a much larger blocksize than the other.  If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_.  Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.
2005-01-10 15:29:50 +00:00
dbj b2c6a6a4ea also define bioops if FFS is not defined. 2004-12-23 20:11:28 +00:00
jrf ac053b9ef0 Fix previous commit, got bufcache and bufmem messages reversed. 2004-12-05 06:12:54 +00:00
jrf cd5b5ced44 Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.
2004-12-05 06:00:20 +00:00
christos dfa8d84485 PR/25749: Peter Postma: Missing splx() in kernel. 2004-11-13 19:16:18 +00:00
enami 4fa5fd9ed6 - Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
  pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
  of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.
2004-10-04 01:24:18 +00:00
enami 51718e92ee Factor out code to set watermark and ensure high > low. 2004-10-04 00:46:05 +00:00
enami 682c3c9443 - Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).
2004-10-03 08:47:48 +00:00
enami ba25820566 x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful.  also, I guess lowater
# XXX: is better than 0 as origin.
2004-10-03 08:30:09 +00:00
enami 778d21de43 Cheap test first. 2004-10-03 08:17:54 +00:00
yamt 3362d4ed5b fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.
2004-09-18 16:37:12 +00:00
yamt 2bf1a4ef17 - add missing function prototypes.
- fix prototype mismatches.
2004-09-18 16:01:03 +00:00
yamt d08391b2a3 buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.
2004-09-08 10:20:15 +00:00
hannken 7a5be5a9ff - Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
  Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
  to copy-on-write.
  Avoids deadlocks/panics where to clean pages the copy-on-write needs
  to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>
2004-06-20 18:55:58 +00:00
thorpej 3183ea47c2 When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.
2004-06-20 18:29:47 +00:00
thorpej bbbb3183d6 Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.
2004-06-20 18:17:09 +00:00
atatat 5b22e79ada Remaining sysctl descriptions under kern subtree 2004-05-25 04:30:32 +00:00
yamt ab195ed32f bio_doread: vp is always non-NULL here. 2004-04-25 12:41:12 +00:00
christos 6bd1d6d4db Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.
2004-04-21 01:05:31 +00:00
simonb 1c13fd358f Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
  currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
  random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.
2004-03-26 00:31:55 +00:00
simonb 07056cd3d1 More white space nits. 2004-03-25 23:17:16 +00:00
simonb c67d420cbf White-space nit. 2004-03-25 08:22:31 +00:00
atatat 19af35fd0d Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.
2004-03-24 15:34:46 +00:00
dan 5819919614 micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work
2004-02-22 01:00:41 +00:00
atatat caea20e952 Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little.  There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.
2004-02-19 03:56:30 +00:00
yamt 0e9e078e22 - raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().
2004-02-16 09:34:15 +00:00
tls eb9b96577c Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers!  The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit".  In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there.  I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc".  Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled.  Two adjustments seem to solve this:  1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?
2004-02-11 17:36:31 +00:00
tls aeaf748ff2 Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache.  This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%.  The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.
2004-01-30 11:32:16 +00:00
dan c6ba3edf9d Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required.  Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.
2004-01-27 11:35:23 +00:00
hannken b1cb363c11 Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.
VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp)  Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp)      Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.
2004-01-25 18:02:03 +00:00
yamt ce0a402d3c bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well. 2004-01-19 11:57:42 +00:00
enami 9e2ac76ac4 Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.
2004-01-15 09:03:26 +00:00
yamt 8c55727694 reset i/o priority in geteblk() as well. 2004-01-10 14:43:05 +00:00
yamt 7266a95907 store a i/o priority hint in struct buf for buffer queue discipline. 2004-01-10 14:39:50 +00:00
thorpej 4aeba6790d Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.
2004-01-09 19:01:01 +00:00
tls e4758a97ae Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%.  This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache.  We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory.  However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...
2004-01-09 06:26:15 +00:00
tls 28364b01be Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease().  Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly.  If anything,
it may be a bit more aggressive than intended.  On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds.  Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system?  This
is not obvious to me (I must be looking in the wrong place).  Also,
buf_mrelease() is also called from brelse() in some cases.  Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)?  Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out.  Jason proposed
the use of pool_reclaim as a way to fix it.
2004-01-08 23:41:14 +00:00
atatat 5efc584023 Expose the buf_map symbol so that pmap(1) can find it.
Split the sysctl setup routine into two routines, one for each
"subtree".  Perhaps it's a little pedantic, but it's cleaner.  Also,
assert that the "kern" and "vm" nodes exist.
2004-01-06 13:51:09 +00:00
pk 90cc172b86 bufpool_page_free: pass `buf_map' to uvm_km_free(). 2004-01-04 16:17:13 +00:00
pk dc6d5d0dd1 getnewbuf: return buffer locked. 2003-12-31 14:37:17 +00:00
thorpej 7e958083b1 Consistently use ANSI-style function decls. 2003-12-30 20:40:39 +00:00
pk 70f20a1217 Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes).  It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms.  Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.
2003-12-30 12:33:13 +00:00
dbj 076b9a1a1e when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency
2003-12-02 04:18:19 +00:00