Commit Graph

128 Commits

Author SHA1 Message Date
hannken
7a5be5a9ff - Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
  Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
  to copy-on-write.
  Avoids deadlocks/panics where to clean pages the copy-on-write needs
  to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>
2004-06-20 18:55:58 +00:00
thorpej
3183ea47c2 When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.
2004-06-20 18:29:47 +00:00
thorpej
bbbb3183d6 Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.
2004-06-20 18:17:09 +00:00
atatat
5b22e79ada Remaining sysctl descriptions under kern subtree 2004-05-25 04:30:32 +00:00
yamt
ab195ed32f bio_doread: vp is always non-NULL here. 2004-04-25 12:41:12 +00:00
christos
6bd1d6d4db Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.
2004-04-21 01:05:31 +00:00
simonb
1c13fd358f Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
  currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
  random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.
2004-03-26 00:31:55 +00:00
simonb
07056cd3d1 More white space nits. 2004-03-25 23:17:16 +00:00
simonb
c67d420cbf White-space nit. 2004-03-25 08:22:31 +00:00
atatat
19af35fd0d Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.
2004-03-24 15:34:46 +00:00
dan
5819919614 micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work
2004-02-22 01:00:41 +00:00
atatat
caea20e952 Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little.  There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.
2004-02-19 03:56:30 +00:00
yamt
0e9e078e22 - raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().
2004-02-16 09:34:15 +00:00
tls
eb9b96577c Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers!  The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit".  In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there.  I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc".  Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled.  Two adjustments seem to solve this:  1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?
2004-02-11 17:36:31 +00:00
tls
aeaf748ff2 Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache.  This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%.  The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.
2004-01-30 11:32:16 +00:00
dan
c6ba3edf9d Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required.  Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.
2004-01-27 11:35:23 +00:00
hannken
b1cb363c11 Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.
VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp)  Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp)      Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.
2004-01-25 18:02:03 +00:00
yamt
ce0a402d3c bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well. 2004-01-19 11:57:42 +00:00
enami
9e2ac76ac4 Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.
2004-01-15 09:03:26 +00:00
yamt
8c55727694 reset i/o priority in geteblk() as well. 2004-01-10 14:43:05 +00:00
yamt
7266a95907 store a i/o priority hint in struct buf for buffer queue discipline. 2004-01-10 14:39:50 +00:00
thorpej
4aeba6790d Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.
2004-01-09 19:01:01 +00:00
tls
e4758a97ae Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%.  This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache.  We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory.  However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...
2004-01-09 06:26:15 +00:00
tls
28364b01be Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease().  Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly.  If anything,
it may be a bit more aggressive than intended.  On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds.  Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system?  This
is not obvious to me (I must be looking in the wrong place).  Also,
buf_mrelease() is also called from brelse() in some cases.  Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)?  Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out.  Jason proposed
the use of pool_reclaim as a way to fix it.
2004-01-08 23:41:14 +00:00
atatat
5efc584023 Expose the buf_map symbol so that pmap(1) can find it.
Split the sysctl setup routine into two routines, one for each
"subtree".  Perhaps it's a little pedantic, but it's cleaner.  Also,
assert that the "kern" and "vm" nodes exist.
2004-01-06 13:51:09 +00:00
pk
90cc172b86 bufpool_page_free: pass `buf_map' to uvm_km_free(). 2004-01-04 16:17:13 +00:00
pk
dc6d5d0dd1 getnewbuf: return buffer locked. 2003-12-31 14:37:17 +00:00
thorpej
7e958083b1 Consistently use ANSI-style function decls. 2003-12-30 20:40:39 +00:00
pk
70f20a1217 Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes).  It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms.  Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.
2003-12-30 12:33:13 +00:00
dbj
076b9a1a1e when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency
2003-12-02 04:18:19 +00:00
dbj
2162bce654 add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
 buffer from its freelist without having to know exactly which
 freelist it is on.
2003-12-02 03:36:33 +00:00
dbj
84865d5d4f protect a few uses of buf's b_flags with b_interlock 2003-11-08 04:22:35 +00:00
yamt
4e746c95f7 in getblk(), don't call allocbuf() for B_LOCKED buffers.
LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.
2003-09-24 10:44:44 +00:00
yamt
1c9095a5b6 buffer with B_CALL shouldn't be brelse'ed. assert it. 2003-09-07 11:59:40 +00:00
yamt
059404deaf bremfree needs bqueue_slock held. assert it. 2003-09-07 11:57:43 +00:00
agc
aad01611e7 Move UCB-licensed code from 4-clause to 3-clause licence.
Patches provided by Joel Baker in PR 22364, verified by myself.
2003-08-07 16:26:28 +00:00
yamt
e5655297db remove B_NEEDCOMMIT as it's no longer used. 2003-04-09 12:55:50 +00:00
thorpej
eb14e86676 Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it.  This fixes a few places where either b_dep or b_interlock were
not properly initialized.
2003-02-25 20:35:31 +00:00
pk
1262bf7cb5 bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.
2003-02-06 11:46:49 +00:00
pk
9df517d22e In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.
2003-02-06 11:22:35 +00:00
pk
408ae56abd Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.
2003-02-06 09:46:46 +00:00
pk
338f31f581 Make the buffer cache code MP-safe. 2003-02-05 21:38:38 +00:00
thorpej
e0d8d366df Merge the nathanw_sa branch. 2003-01-18 10:06:22 +00:00
gehenna
77a6b82b27 Merge the gehenna-devsw branch into the trunk.
This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

	device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
  by using this grammer.

- Added the new naming convention.
  The name of the device switch must be <prefix>_[bc]devsw for auto-generation
  of device switch tables.

- The backward compatibility of loading block/character device
  switch by LKM framework is broken. This is necessary to convert
  from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
  We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
  the LKM framework will refer it to assign device major number dynamically.
2002-09-06 13:18:43 +00:00
matt
48bbf5f234 Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly.  Use *_FOREACH whenever possible.
2002-09-04 01:32:31 +00:00
hannken
815491c0b3 Remove the old device buffer queue interface.
Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>
2002-08-30 15:43:36 +00:00
thorpej
139cdc3125 Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.
2002-08-25 20:21:33 +00:00
matt
0cb85bc7b9 Eliminate commons. 2002-05-12 23:06:27 +00:00
chs
4d4825010d fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success.  but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.
2002-03-16 23:49:59 +00:00
thorpej
a180cee23b Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map).  Try to deal with this:

* Group all information about the backend allocator for a pool in a
  separate structure.  The pool references this structure, rather than
  the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
  to become available, but will still fail if it cannot callocate KVA
  space for the pages.  If this happens, carefully drain all pools using
  the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
  some pages, and use that information to make draining easier and more
  efficient.
* Get rid of PR_URGENT.  There was only one use of it, and it could be
  dealt with by the caller.

From art@openbsd.org.
2002-03-08 20:48:27 +00:00