NetBSD

Author	SHA1	Message	Date
hannken	7a5be5a9ff	- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when doing copy-on-write. - Change VFS_SNAPSHOT() to return the snapshot vnode locked. - Make the IO path for copy-on-write and snapshot-read more lightweight. Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs to copy-on-write. Avoids deadlocks/panics where to clean pages the copy-on-write needs to allocate pages for its VOP_PUTPAGES(). L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>	2004-06-20 18:55:58 +00:00
thorpej	3183ea47c2	When initializing the buffer cache memory pools where the size <= PAGE_SIZE, also use the standard allocator on systems that use a direct-mapped memory segment for mapping pool pages.	2004-06-20 18:29:47 +00:00
thorpej	bbbb3183d6	Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high water mark of 1, which will have the same effect. Pointed out back in January by YAMAMOTO Takashi.	2004-06-20 18:17:09 +00:00
atatat	5b22e79ada	Remaining sysctl descriptions under kern subtree	2004-05-25 04:30:32 +00:00
yamt	ab195ed32f	bio_doread: vp is always non-NULL here.	2004-04-25 12:41:12 +00:00
christos	6bd1d6d4db	Replace the statfs() family of system calls with statvfs(). Retain binary compatibility.	2004-04-21 01:05:31 +00:00
simonb	1c13fd358f	Give buf_lotsfree() a bit of a service: - Fix a 32-bit overflow that could erroneously return true even if the currently allocated buffer memory was greater than the high water mark. - Add an early check for bufmem > hiwater to avoid a needless call to random(). - Sprinkle some comments. Add a vm.bufmem sysctl so the current bufmem value can be easily queried from userland. Reviewed by Thor Simon.	2004-03-26 00:31:55 +00:00
simonb	07056cd3d1	More white space nits.	2004-03-25 23:17:16 +00:00
simonb	c67d420cbf	White-space nit.	2004-03-25 08:22:31 +00:00
atatat	19af35fd0d	Tango on sysctl_createv() and flags. The flags have all been renamed, and sysctl_createv() now uses more arguments.	2004-03-24 15:34:46 +00:00
dan	5819919614	micro-optimisation - if we're going to return 0, do so before doing other unnecessary work	2004-02-22 01:00:41 +00:00
atatat	caea20e952	Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by kern.proc, kern.proc2, kern.lwp, and kern.buf. Define more MIB for kern.buf so that specific buffers can be selected (only all/all is supported right now), and use a 32/64 bit agnostic structure for communcating buffer information to userland. Convert systat to the new kern.buf method. Clean up the vm.buf* handling a little. There's no actual need to record the dynamically assigned OIDs, since sysctl_data can tell us what we're looking at. Oh, and fix a typo in a comment.	2004-02-19 03:56:30 +00:00
yamt	0e9e078e22	- raise ipl when calling buf_canrelease() because it traverses buffer queue. - correct/add comments on buf_canrelease().	2004-02-16 09:34:15 +00:00
tls	eb9b96577c	Fix bug noted by yamt@netbsd.org: the UVM free target is in pages, so the last change has us comparing pages to bytes instead of pages to buffers! The consequence was to try to free radically less memory than UVM wanted us to -- though always at least one buffer, which is probably why the results weren't dire. This does suggest that buf_canrelease() could be a lot more conservative about how much to release than "2 * page deficit". In fact, serious trouble seems to ensue if it's not -- when anything else on the system demands enough pages, we slam down to the low water mark nd stay there. I've adjusted it to use min(page defecit, buffer memory / 16), which still isn't quite right but seems better. Another change: consider the case of an infinite loop that does "tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs, all the dead metadata will go on the AGE list -- and, until we hit the high-water mark, stay there, at which point it may be slowly recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree() to return 0 if there's anything on the AGE list; 2) whack buf_canrelease() to count the memory used by the AGE list and always return at least that much. This basically turns the AGE list into a "delayed free" list, since we can't entirely eliminate it as we can't free pool items from interrupt context (e.g. from biodone()). To consider: with the bookkeeping corrected, should buf_drain() move back to the _end_ of the pagedaemon, and should the calculation then try to give back at least the current defecit?	2004-02-11 17:36:31 +00:00
tls	aeaf748ff2	Buffer cache fixes to avoid thrashing between high and low water marks and uncontrolled growth. The key fix is from Dan Carasone, who noticed that buf_canfree() was counting in _bytes_ but freeing in _buffers_, which caused the instant drop to lowater observed by some users. We now control the rate of growth; the probability of getting a new allocation is inversely proportional to the current size of the cache. This idea is from a long-ago conversation with Kirk McKusick and, if memory serves, was used for the file-system cache in some other BSD variant at some point in history. With growth and shrinkage more or less dealt with, we return the default maximum cache size to 15%. The default _minimum_ cache size is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was chosen when the maximum size was 30% of memory. Finally, after observing the behaviour of the pagedaemon and the buffer cache drainer under pathological workloads (e.g. a benchmark that steps through 75% of available memory backwards) I have moved the call to buf_drain() to the beginning of the pagedaemon from the end; if the pagedaemon bogs down, it still won't get run as often as it should, but at least this way it will see the state of the free count and free target _before_ the scan step does its thing.	2004-01-30 11:32:16 +00:00
dan	c6ba3edf9d	Reduce the default BUFCACHE to 10% for now. Too many users are tripping over this getting too large, and suffering other performance problems due to the lack of good backpressure shrinking the bufcache when other memory is required. Again, this tunable should be revisited when the backpressure mechanism has been improved. sysctl vm.bufcache can be used to manually tune those rare machines that might need more than this. See comments in rev 1.106 for more detail.	2004-01-27 11:35:23 +00:00
hannken	b1cb363c11	Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern. VOP_STRATEGY(bp) is replaced by one of two new functions: - VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp. - DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp. DEV_STRATEGY(bp) is used only for block-to-block device situations.	2004-01-25 18:02:03 +00:00
yamt	ce0a402d3c	bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.	2004-01-19 11:57:42 +00:00
enami	9e2ac76ac4	Obviously, sizeof(u_int) is not enough to copy struct buf. Prevents ``sysctl -a'' from dumping core.	2004-01-15 09:03:26 +00:00
yamt	8c55727694	reset i/o priority in geteblk() as well.	2004-01-10 14:43:05 +00:00
yamt	7266a95907	store a i/o priority hint in struct buf for buffer queue discipline.	2004-01-10 14:39:50 +00:00
thorpej	4aeba6790d	Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim() on those pools; it is no longer necessary.	2004-01-09 19:01:01 +00:00
tls	e4758a97ae	Change BUFCACHE (default hard limit on physmem consumption by metadata cache) from 30% to 20%. This seems to significantly smooth the oscillation between "almost no memory available" and "UVM free target available" caused by the current sudden, heavy backpressure on the metadata cache. We should revisit this again once the backpressure mechanism is better tuned; ideally, the hard limit should almost never come into play, because the metadata cache should gradually give back pages as buffers hit the AGE list and as the page cache demands them, rather than giving back a big slug of pages all at once when UVM decides it's in a hurry and fires off the page daemon. Just how well this adjustment works is likely to vary significantly from machine to machine depending on I/O mix, filesystem frag size, and total memory. However, 20% seems to be quite a bit better than 30% on several systems I've tested and is, coincidentally, more than enough to cache the entire metadata working set of the AnonCVS server with 100 clients, which is a useful worst-case stake in the ground...	2004-01-09 06:26:15 +00:00
tls	28364b01be	Add pool_reclaim() on pool to which we just pool_put() a buffer in buf_mrelease(). Without this, though the pages are returned to the relevant pool, they are never available for any other use in the system. Now the backpressure on the physical size of the buffer cache through the buf_drain() call in the pagedaemon works correctly. If anything, it may be a bit more aggressive than intended. On my 256MB system, with vm.bufcache set to the default 30% of physmem, a kernel with this fix can do 5 simultaneous config/makedep/builds of different NetBSD kernels in 1313 seconds; with the "traditional" buffer cache code it requires 1320 seconds. Running "find / -type d -exec ls -l {}" while the build is going demonstrates that the backpressure is working correctly: free memory oscillates slowly between close to none and the UVM target free, and vmstat -m shows a large number of releases for the buffer pools. For future work: how is "bufpl" memory returned to the system? This is not obvious to me (I must be looking in the wrong place). Also, buf_mrelease() is also called from brelse() in some cases. Would it be better to add a pool flag causing automatic release of full pages as they become available (not fragmented)? Jason Thorpe proposed this and it seems more elegant than cleaning the _entire_ pool only upon memory pressure. Greg Oster did a lot of the work of figuring this out. Jason proposed the use of pool_reclaim as a way to fix it.	2004-01-08 23:41:14 +00:00
atatat	5efc584023	Expose the buf_map symbol so that pmap(1) can find it. Split the sysctl setup routine into two routines, one for each "subtree". Perhaps it's a little pedantic, but it's cleaner. Also, assert that the "kern" and "vm" nodes exist.	2004-01-06 13:51:09 +00:00
pk	90cc172b86	bufpool_page_free: pass `buf_map' to uvm_km_free().	2004-01-04 16:17:13 +00:00
pk	dc6d5d0dd1	getnewbuf: return buffer locked.	2003-12-31 14:37:17 +00:00
thorpej	7e958083b1	Consistently use ANSI-style function decls.	2003-12-30 20:40:39 +00:00
pk	70f20a1217	Replace the traditional buffer memory management -- based on fixed per buffer virtual memory reservation and a private pool of memory pages -- by a scheme based on memory pools. This allows better utilization of memory because buffers can now be allocated with a granularity finer than the system's native page size (useful for filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation of virtual to physical memory mappings (due to the former fixed virtual address reservation) resulting in better utilization of MMU resources on some platforms. Finally, the scheme is more flexible by allowing run-time decisions on the amount of memory to be used for buffers. On the other hand, the effectiveness of the LRU queue for buffer recycling may be somewhat reduced compared to the traditional method since, due to the nature of the pool based memory allocation, the actual least recently used buffer may release its memory to a pool different from the one needed by a newly allocated buffer. However, this effect will kick in only if the system is under memory pressure.	2003-12-30 12:33:13 +00:00
dbj	076b9a1a1e	when ifdef DEBUG and debug_verify_freelist != 0 then perform an expensive search of the buffer freelists in brelse and bremfree to verify consistency	2003-12-02 04:18:19 +00:00
dbj	2162bce654	add explanatory comment in bremfree: We break the TAILQ abstraction in order to efficiently remove a buffer from its freelist without having to know exactly which freelist it is on.	2003-12-02 03:36:33 +00:00
dbj	84865d5d4f	protect a few uses of buf's b_flags with b_interlock	2003-11-08 04:22:35 +00:00
yamt	4e746c95f7	in getblk(), don't call allocbuf() for B_LOCKED buffers. LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when getblk() re-size them. XXX maybe needs a better fix.	2003-09-24 10:44:44 +00:00
yamt	1c9095a5b6	buffer with B_CALL shouldn't be brelse'ed. assert it.	2003-09-07 11:59:40 +00:00
yamt	059404deaf	bremfree needs bqueue_slock held. assert it.	2003-09-07 11:57:43 +00:00
agc	aad01611e7	Move UCB-licensed code from 4-clause to 3-clause licence. Patches provided by Joel Baker in PR 22364, verified by myself.	2003-08-07 16:26:28 +00:00
yamt	e5655297db	remove B_NEEDCOMMIT as it's no longer used.	2003-04-09 12:55:50 +00:00
thorpej	eb14e86676	Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and use it. This fixes a few places where either b_dep or b_interlock were not properly initialized.	2003-02-25 20:35:31 +00:00
pk	1262bf7cb5	bdwrite(): remove check for MFS major device number (why was 255 changed to 4096?). In any case, bdevsw_lookup() will take care of it.	2003-02-06 11:46:49 +00:00
pk	9df517d22e	In getnewbuf(), release the buffer queue lock before calling bawrite() and re-acquire it afterward.	2003-02-06 11:22:35 +00:00
pk	408ae56abd	Require the bdirty() be called at splbio() and with the buffer interlock held. This is essentially just a helper routine called from biodone() through ffs softdep's I/O completion, to re-queue the buffer.	2003-02-06 09:46:46 +00:00
pk	338f31f581	Make the buffer cache code MP-safe.	2003-02-05 21:38:38 +00:00
thorpej	e0d8d366df	Merge the nathanw_sa branch.	2003-01-18 10:06:22 +00:00
gehenna	77a6b82b27	Merge the gehenna-devsw branch into the trunk. This merge changes the device switch tables from static array to dynamically generated by config(8). - All device switches is defined as a constant structure in device drivers. - The new grammer ``device-major'' is introduced to ``files''. device-major <prefix> char <num> [block <num>] [<rules>] - All device major numbers must be listed up in port dependent majors.<arch> by using this grammer. - Added the new naming convention. The name of the device switch must be <prefix>_[bc]devsw for auto-generation of device switch tables. - The backward compatibility of loading block/character device switch by LKM framework is broken. This is necessary to convert from block/character device major to device name in runtime and vice versa. - The restriction to assign device major by LKM is completely removed. We don't need to reserve LKM entries for dynamic loading of device switch. - In compile time, device major numbers list is packed into the kernel and the LKM framework will refer it to assign device major number dynamically.	2002-09-06 13:18:43 +00:00
matt	48bbf5f234	Use the queue macros from <sys/queue.h> instead of referring to the queue members directly. Use *_FOREACH whenever possible.	2002-09-04 01:32:31 +00:00
hannken	815491c0b3	Remove the old device buffer queue interface. Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>	2002-08-30 15:43:36 +00:00
thorpej	139cdc3125	Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these variables unsigned, and update places where their values are printed.	2002-08-25 20:21:33 +00:00
matt	0cb85bc7b9	Eliminate commons.	2002-05-12 23:06:27 +00:00
chs	4d4825010d	fix bread() to return errors from reading past the end of the device. back in rev. 1.51, bread() and breadn() were changed to assume that if B_DONE is set on a buffer returned by bio_doread(), that the buffer must have already been in the cache, and thus the overall bread() should return success. but if the requested buffer is not in the cache and is past the end of the device, bounds_check_with_label() will set B_ERROR on the buffer and the caller will call biodone(), which will cause bread() to think the buffer was already in the cache and thus return success. to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE and B_DELWRI as indicators that it doesn't need to sleep waiting for an i/o to complete.	2002-03-16 23:49:59 +00:00
thorpej	a180cee23b	Pool deals fairly well with physical memory shortage, but it doesn't deal with shortages of the VM maps where the backing pages are mapped (usually kmem_map). Try to deal with this: * Group all information about the backend allocator for a pool in a separate structure. The pool references this structure, rather than the individual fields. * Change the pool_init() API accordingly, and adjust all callers. * Link all pools using the same backend allocator on a list. * The backend allocator is responsible for waiting for physical memory to become available, but will still fail if it cannot callocate KVA space for the pages. If this happens, carefully drain all pools using the same backend allocator, so that some KVA space can be freed. * Change pool_reclaim() to indicate if it actually succeeded in freeing some pages, and use that information to make draining easier and more efficient. * Get rid of PR_URGENT. There was only one use of it, and it could be dealt with by the caller. From art@openbsd.org.	2002-03-08 20:48:27 +00:00

1 2 3

128 Commits