as necessary:
* Implement a new mbuf utility routine, m_copyup(), is is like
m_pullup(), except that it always prepends and copies, rather
than only doing so if the desired length is larger than m->m_len.
m_copyup() also allows an offset into the destination mbuf, which
allows space for packet headers, in the forwarding case.
* Add *_HDR_ALIGNED_P() macros for IP, IPv6, ICMP, and IGMP. These
macros expand to 1 if __NO_STRICT_ALIGNMENT is defined, so that
architectures which do not have strict alignment constraints don't
pay for the test or visit the new align-if-needed path.
* Use the new macros to check if a header needs to be aligned, or to
assert that it already is, as appropriate.
Note: This code is still somewhat experimental. However, the new
code path won't be visited if individual device drivers continue
to guarantee that packets are delivered to layer 3 already properly
aligned (which are rules that are already in use).
still in the chroot. If not, teleport the lookup to the chroot
and log. Closes an assisted-jail escape method pointed out by
xs@kittenz.org. Patch from xs@kittenz.org and myself
It's not even built if the option isn't present.
* Use cdev_decl() to generate prototypes for the devsw functions.
* Minor whitespace cleanup.
* Nuke the SYSTR_CLONE ioctl from orbit; instead, just clone it in
systraceopen(), like we do with svr4_net.
trying to MSG_PEEK for more than the socket can hold. The second is that
before sleeping waiting for more data, upcall the protocol telling it you
have just received data so it can kick itself to re-fill the just drained
socket buffer.
and an error occurs, make sure the socket doesn't retain a partial
copy by dropping the rest of the record.
This would otherwise trigger a panic("receive 1a") under DIAGNOSTIC.
Fixes PR#16990, suggested fix adapted.
Reviewed by Matt Thomas.
- implement SIMPLEQ_REMOVE(head, elm, type, field). whilst it's O(n),
this mirrors the functionality of SLIST_REMOVE() (the other
singly-linked list type) and FreeBSD's STAILQ_REMOVE()
- remove the unnecessary elm arg from SIMPLEQ_REMOVE_HEAD().
this mirrors the functionality of SLIST_REMOVE_HEAD() (the other
singly-linked list type) and FreeBSD's STAILQ_REMOVE_HEAD()
- remove notes about SIMPLEQ not supporting arbitrary element removal
- use SIMPLEQ_FOREACH() instead of home-grown for loops
- use SIMPLEQ_EMPTY() appropriately
- use SIMPLEQ_*() instead of accessing sqh_first,sqh_last,sqe_next directly
- reorder manual page; be consistent about how the types are listed
- other minor cleanups
enough to be useful, and broadening it so that it did would have meant
that operations possibly requiring synchronous disk activity would have
to be done in splbio(). This clearly was not going to work.
Worked around this in the LFS case by having lfs_cluster_callback put an
extra hold on the vnode before calling biodone(), and taking the hold
off without HOLDRELE's problematic list swapping. lfs_vunref() will take
care of that---in thread context---on the next write if need be.
Also, ensure that the list walking in lfs_{writevnodes,segunlock,gather}
takes into account the possibility that the list may change
underneath it (possibly because it itself deleted an element).
Tested on i386, test-compiled on alpha.
by default, and can be enabled by adding the SOSEND_LOAN option to your
kernel config. The SOSEND_COUNTERS option can be used to provide some
instrumentation.
Use of this option, combined with an application that does large enough
writes, gets us zero-copy on the TCP and UDP transmit path.
closed, open those fds to /dev/null.
XXX: This needs to be fixed in a better way. The kernel should not need to
know about /dev/null or special case 0, 1, 2.
This generates more useful information of a process who catches SIGINFO,
rather than always printing "runnable" (the process is marked runnable
because of the signal).
Inspired by the behavior of BSD/OS.
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.
on the <bsd-api-discuss@wasabisystems.com> mailing list. PT_IO
is a more general inferior I/D space I/O mechanism. FreeBSD and
OpenBSD have also added PT_IO.
From lha@stacken.kth.se, kern/15945.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.
Changes:
* MP locking changes (mostly FreeBSD specific)
XXXSMP the MP locking macros are noops on NetBSD for now
* kevent fix (FreeBSD rev. 1.87): when the last reader/writer
disconnects, ensure that anybody who is waiting for the kevent
on the other end of the pipe gets EV_EOF
* kill __P
first. This is necessary to avoid warnings with -fshort-enums. Casting
to an int really should be enough, but turns out not to be.
This change will be documented in doc/HACKS.
m_reclaim() to match the drain hook signature. This allows us to
delete m_retry() and m_retryhdr(), as the pool allocator will now
perform the reclaimation step for us.
From art@openbsd.org.
and the latter, while there was some code tested the bit, was woefully
incomplete and also unused by anything. Besides, PR_STATIC functionality
could be better handled by backend allocators anyhow.
From art@openbsd.org
pool_set_drain_hook(). This hook is called in three cases:
* When a pool has hit the hard limit, just before either erroring
out or sleeping.
* When a backend allocator fails to allocate memory.
* Just before trying to reclaim pages in pool_reclaim().
This hook requests the client to try and free some items back to
the pool.
From art@openbsd.org.
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:
* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.
From art@openbsd.org.
header to distinguish between o32, n32 and n64 ABIs. We now use this.
This suppress the need of the mips_option test, which had some fake positive.
This also removes the mandatory ordering of n32 vs o32 in the exec switch
(exec_conf.c)
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.
rounded up to respect boundary limits, adjust newstart and last before
skiping to the next region. Otherwise we may check the same candidate
region against the start of the next region, no the one immediatly following
the hole, leading to corrupted map.
This fixes the panic seen on sparc64 with scsi drivers, and probably fixes
PR 15489.
easy, convenient dropping into DDB at the "root device: " prompt.
Useful if your console can't do it w/o actually taking an interrupt
and you want to, say, look at the boot messages.
as an added measure to make sure that we can execute a binary.
These default to (1) if elf_machdep.h does not override them.
On Sun2, ELF32_EHDR_FLAGS_OK() checks for the presense of EF_M68000,
since the 68010 cannot run binaries for the 68020-and-up.
was successfully changed. previously, successfully viewing the
current value would flush the cache :-/
- similarly, don't change hostid and sb_max unless the value was
successfully changed
ltsleep() is calling CURSIG() which can call issignal() and issignal()
could not deal with being called from a locked context. This happens
when a process receives SIGTTIN, and issignal() calls psignal() to
post SIGCHLD to the parent.
XXX: It is really messy to have issignal() handle the job control
functionality and the whole signal interlocking protocol needs to
be re-designed. For now this fix (provided by enami) does the trick.
I've been running with this fix for weeks, and atatat has stress-tested
the kernel running ~30 make kernels...
expecting pmap_kenter_pa() to be used to replace an existing mapping,
plus it just seems like a bad idea to keep around mappings of pages
that may be freed and reused.
uint32_t namei_hash(const char *p, const char **ep)
which determines the equivalent MI hash32_str() hash for p.
If *ep != NULL, calculate the hash to the character before ep.
If *ep == NULL, calculate the has to the first / or NUL found, and
point *ep to that location.
- Use namei_hash() to calculate cn_hash in lookup() and relookup().
Hash distribution goes from 35-40% to 55-70%, with similar profiled
time spent in cache_lookup() and cache_enter() on my P3-600.
- Use namei_hash() to calculate cn_hash in nfs_readdirplusrpc(),
insetad of homegrown code (that differed from that in lookup() !)
namei_hash() has better spread and is faster than previous code
(which used a non-constant multiplication).
in f_flag of struct file
for now, keep former f_iflags of struct file as _f_spare0, it will be g/c'ed
when struct file will be changed (this will happen soon)
VOP_PUTPAGES() just because the vnode has no pages. layered filesystems
will want to pass these calls on through to the underlying filesystem,
and non-layered filesystems may need to remove the vnode from the
syncer queues. fix up MP locking and add some locking assertions.
fixes PRs 12284 and 14640.
(__HAVE_PTRACE_MACHDEP) and procfs (__HAVE_PROCFS_MACHDEP).
These changes will allow platforms like x86 (XMM) and PowerPC
(AltiVec) to export extended register sets in a sane manner.
* Use __HAVE_PTRACE_MACHDEP to export x86 XMM registers (standard
FP + SSE/SSE2) using PT_{GET,SET}XMMREGS (in the machdep
ptrace request space).
* Use __HAVE_PROCFS_MACHDEP to export x86 XMM registers via
/proc/N/xmmregs in procfs.
case when the requested memory size can't ever be granted - instead
of panic, malloc(9) would return failure (NULL).
Note kernel code should do proper bound checking, rather than
depend on M_CANFAIL. This flag is only supposed to be used in very
special cases, where common bound checking is not appropriate.
Discussed on tech-kern@, name ``M_CANFAIL'' suggested by Chuck Cranor.
is freed prematurely the check won't be triggered immediatelly, probably
since the memory is likely to be reused fast; but it _would_ be triggered
eventually
at all, it's only needed in LKM case
use #if defined(LKM) || defined(_LKM) condition for netbsd32_execve.c,
to DTRT when either compiled statically into kernel with LKM support,
or compiled as a LKM
* return EINVAL if specified current limit exceeds specified hard limit.
This behaviour is required by SUSv2 (noted by Giles Lean on tech-kern)
* return EINVAL if an attempt is made to lower stack size limit below
current usage; this addresses bin/3045 by Jason Thorpe, and conforms to SUSv2
- replace opt_kgdb_machdep.h with opt_kgdb.h
- defparam opt_kgdb.h:
KGDB_DEV KGDB_DEVNAME KGDB_DEVADDR KGDB_DEVRATE KGDB_DEVMODE
- move from opt_ddbparam.h to opt_ddb.h:
DDB_FROMCONSOLE DDB_ONPANIC DDB_HISTORY_SIZE DDB_BREAK_CHAR SYMTAB_SPACE
- replace KGDBDEV with KGDB_DEV
- replace KGDBADDR with KGDB_DEVADDR
- replace KGDBMODE with KGDB_DEVMODE
- replace KGDBRATE with KGDB_DEVRATE
- use `9600' instead of `0x2580' for 9600 baud rate
- use correct quotes for options KGDB_DEVNAME="\"com\""
- use correct quotes for options KGDB_DEV="17*256+0"
- remove unnecessary dependancy on Makefile for kgdb_stub.o
- minor whitespace cleanup
This is a followup to PR/14558.
- itimerfix(9) limited the number of seconds to 100M, before I changed
it to 1000M for PR/14558.
- nanosleep(2) documents a limit of 1000M seconds.
- setitimer(2), select(2), and other library functions that indirectly
use setitimer(2) for example alarm(3) don't specify a limit.
So it only seems appropriate that any positive number of seconds in
struct timeval should be accepted by any code that uses itimerfix(9)
directly, except nanosleep(2) which should check for 1000M seconds
manually. This changes makes the manual pages of select(2), nanosleep(2),
setitimer(2), and alarm(3) consistent with the code.
pages loaned to the kernel. this implies that we also need to
call pmap_kremove() before uvm_km_free().
other general cleanup: remove argument names from prototypes,
rename some variables, etc.
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).
VEXECMAP suggested by Chuq Silvers.
THAT accurate and microtime(9) is painlessly slow on i386 currently.
This speeds up small transfers much. The gain for large transfers
is less significant, but notable too.
Bottleneck was found by Andreas Persson (Re: kern/14246).
Performance improvement with PIII on 661 Mhz according to hbench (with
PIPE_MINDIRECT=8192):
buffersize before after
512 17 49
1024 33 110
2048 52 143
4096 77 163
8192 142 190
64K 577 662
128K 372 392
vnode, we should not attempt to remove the namecache entry. this is because
vget() can sleep (eg. if VXLOCK is set because the vnode is being reclaimed),
and so multiple threads can end up in this context at the same time.
if this happens, each thread ends up removing the cache entry, but
the code to remove the entry assumes that the entry is still valid.
so we should just leave the (now stale) entry in the cache.
if another thread finds the entry again before it is reused,
that thread will notice that the entry is stale and remove it safely.
fixes PR 14042.
This is activated by defining POOL_SUBPAGE to the size of the new allocation
unit, and makes pools much more efficient on machines with obscenely large
pages. It might even make four-megabyte arm26 systems usable.
woken-up thread is guaranteed to pass the buck to the next guy before
going back to sleep, and the rest of the lockmgr() code doesn't do that.
from Bill Sommerfeld. fixes PR 14097.
overflow on LP64 architectures. This fixes kern/10070 by Juergen Weiss.
Fix tested on NetBSD/alpha by Bernd Ernesti, on NetBSD/sparc64
by David Brownlee and Eduardo Horvath.
(not just EPIPE), so that the higher-level code would note partial
write has happened and DTRT if the write was interrupted due to
e.g. delivery of signal.
This fixes kern/14087 by Frank van der Linden.
Much thanks to Frank for extensive help with debugging this, and review
of the fix.
Note: EPIPE/SIGPIPE delivery behaviour was retained - they're delivered
even if the write was partially successful.
not do short writes unless when using non-blocking I/O.
This fixes kern/13744 by Geoff C. Wing.
Note this partially undoes rev. 1.5 change. Upon closer examination,
it's been apparent that hbench-OS expectations were not actually justified.
are only wired if this flag is present (i.e. they are not wired by default now)
loaned pages are unloaned via new uvm_unloan(), uvm_unloananon() and
uvm_unloanpage() are no longer exported
adjust uvm_unloanpage() to unwire the pages if UVM_LOAN_WIRED is specified
mark uvm_loanuobj() and uvm_loanzero() static also in function implementation
kern/sys_pipe.c: uvm_unloanpage() --> uvm_unloan()
ALWAYS call uvm_unloanpage() in cleanup - it's necessary even
in pipe_loan_free() case, since uvm_km_free() doesn't seem
to implicitly unloan the loaned pages
format specific.
Struct emul has a e_setregs hook back, which points to emulation-specific
setregs function. es_setregs of struct execsw now only points to
optional executable-specific setup function (this is only used for
ECOFF).
checks root privs, and a lower part that does the actual job. The lower part
will be called by the upcoming clockctl driver. Approved by Christos
Also fixed a few cosmetic things
- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.
The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.
adjusted via sysctl. file systems that have hash tables which are
sized based on the value of this variable now resize those hash tables
using the new value. the max number of FFS softdeps is also recalculated.
convert various file systems to use the <sys/queue.h> macros for
their hash tables.
"earliest" firing callout in a bucket. This allows us to skip
the scan up the bucket if no callouts are due in the bucket.
A cheap O(1) hint update is done at callout insertion (if new callout
is earlier than hint) and removal (is bucket empty). A thorough
refresh of the hint is done when the bucket is traversed.
This doesn't matter much on machines with small values of hz
(e.g. i386), but on systems with large values of hz (e.g. Alpha),
it has a definite positive effect.
Also, keep the callwheel stats in evcnts, so that you can view them
with "vmstat -e".
guard pages. Can only debug one malloc type at a time, and nothing
larger than 1 page. But can be useful for debugging certain types
of "data modified on freelist" type problems.
Modified from code in OpenBSD.
the stack, so that it can be modified.
- pass the error code in the exit code in addition to aborting.
- kill the second exit1() call; it does not make any sense.
ctor/dtor feature, it's still faster to allocate from the cache groups
than it is from the pool (cache groups are analogous to "magazines"
in the Solaris SLAB allocator).
of some selective pieces. This fixes problem with NEW_PIPE in kernels
with DEBUG option, reported via e-mail by Chuck Silvers.
sys_pipe(): g/c fdp, provide it at the chunk of FreeBSD code where it's used