the same file multiple times because of recursive loading (ie: libx require
liby and libz and liby require libz, so libz would be loaded twice)
This is probably suboptimal, but it enable /bin/sh to load on the PowerPC,
so it's a good interim solution until we figure precisely how things should
work.
I'm not sure whether this makes the excessive recursive check useless or not.
macho_hdr, argc, *argv, NULL, *envp, NULL, progname, NULL,
*progname, **argv, **envp
Where progname is a pointer to the program name as given in the first
argument to execve(), and macho_hdr a pointer to the Mach-O header at
the beginning of the executable file.
and friends should either be made first-class citizens and moved
to an include file (systm.h perhaps), or nuked completely, but
not be redefined in a lot of files.
that can be used to block a process after fork(2) or exec(2) calls. The
new process is created in the SSTOP state and is never scheduled for running.
This feature is designed so that it is esay to attach the process using gdb
before it has done anything.
It works also with sproc, kthread_create, clone...
in the event that it needs to use a special VM range (x86_64 falls
into this category). We fall back onto kernel_map if machine-dependent
code doesn't create a special map.
- disk_unbusy() gets a new parameter to tell the IO direction.
- struct disk_sysctl gets 4 new members for read/write bytes/transfers.
when processing hw.diskstats, add the read&write bytes/transfers for
the old combined stats to attempt to keep backwards compatibility.
unfortunately, due to multiple bugs, this will cause new kernels and old
vmstat/iostat/systat programs to fail. however, the next time this is
change it will not fail again.
this is just the kernel portion.
kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals
kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)
based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe
removing has no next element, verify that the queue head agrees that
the current element is the last one. (this is how I found the recent
ppc pmap bugs).
children nodes have reached their final state before augmenting the
parent. This fixes an obscure inconsistency of the space in the
vm_map tree that gcc 3.2 triggers when compiling isp.o on alpha. (this
only led to some leaked space). from art@openbsd.org
with privilege elevation no suid or sgid binaries are necessary any
longer. Applications can be executed completely unprivileged. Systrace
raises the privileges for a single system call depending on the
configured policy.
Idea from discussions with Perry Metzger, Dug Song and Marcus Watts.
Approved by christos and thorpej.
now carries the name of the attachment (e.g. "tlp_pci" or "audio"),
and cfattach structures are registered at boot time on a per-driver
basis. The cfdriver and cfattach pointers are cached in the device
structure when attached.
devices have been discovered. All finalizer routines are iteratively
invoked until all of them report that they have done no work.
Use this hook to fix a latent bug in RAIDframe autoconfiguration of
RAID sets exposed by the rework of SCSI device discovery.
This is the bulk of PR #17345
The general approach is to use a run time deteriminable value
for DIRBLKSIZ. Additional allowances are included for using
MAXSYMLINKLEN with FS_42INODEFMT and a shift in the cylinder group
cluster summary count array. Support is added for managing
the Apple UFS volume label.
a vector of indices into the cfdata table to specify potential parents,
record the interface attributes that devices have and add a new "parent
spec" structure which lists the iattr, as well as optionally listing
specific parent device instances.
See:
http://mail-index.netbsd.org/tech-kern/2002/09/25/0014.html
...for a detailed description.
While here, const poison some things, as suggested by Matt Thomas.
This is done by adding an extra argument to mi_switch() and
cpu_switch() which specifies the new process. If NULL is passed,
then the new function chooseproc() is invoked to wait for a new
process to appear on the run queue.
Also provides an opportunity for optimisations if "switching to self".
Also added are C versions of the setrunqueue() and remrunqueue()
low-level primitives if __HAVE_MD_RUNQUEUE is not defined by MD code.
All these changes are contingent upon the __HAVE_CHOOSEPROC flag being
defined by MD code to indicate that cpu_switch() supports the changes.
memory fault handler. IRIX uses irix_vm_fault, and all other emulation
use NULL, which means to use uvm_fault.
- While we are there, explicitely set to NULL the uninitialized fields in
struct emul: e_fault and e_sysctl on most ports
- e_fault is used by the trap handler, for now only on mips. In order to avoid
intrusive modifications in UVM, the function pointed by e_fault does not
has exactly the same protoype as uvm_fault:
int uvm_fault __P((struct vm_map *, vaddr_t, vm_fault_t, vm_prot_t));
int e_fault __P((struct proc *, vaddr_t, vm_fault_t, vm_prot_t));
- In IRIX share groups, all the VM space is shared, except one page.
This bounds us to have different VM spaces and synchronize modifications
to the VM space accross share group members. We need an IRIX specific hook
to the page fault handler in order to propagate VM space modifications
caused by page faults.
with humanize_number(3). In reality, the kernel version should be changed
to more closely match the userland version so that there are no prototype
conflicts.
This merge changes the device switch tables from static array to
dynamically generated by config(8).
- All device switches is defined as a constant structure in device drivers.
- The new grammer ``device-major'' is introduced to ``files''.
device-major <prefix> char <num> [block <num>] [<rules>]
- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.
- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.
- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.
- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.
- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.
counters. These counters do not exist on all CPUs, but where they
do exist, can be used for counting events such as dcache misses that
would otherwise be difficult or impossible to instrument by code
inspection or hardware simulation.
pmc(9) is meant to be a general interface. Initially, the Intel XScale
counters are the only ones supported.
- avoid race conditions by having seqno in ioctl
- better uid/gid tracking
- "replace" policy to replace args
- less diffs, as many of local changes were fed back to openbsd already
due to the 1st item, it was impossible for us to provide backward-compatibility
(new kernel + old bin/systrace won't work). upgrade both.
date: 2002/06/11 18:59:22; author: provos; state: Exp; lines: +48 -42
SPLAY_{INSERT,REMOVE} have return values now that can be used for error
checking
date: 2002/06/11 22:09:52; author: provos; state: Exp; lines: +6 -5
have rb_remove return the right value, too.
gets reset properly when the old parent exits before the child. A flag
is set in old parent process when the child is reparented in ptrace(2).
If it's set when process is exiting, all running processes have their
'old parent process' pointer checked and reset if appropriate. Also
change to use 'struct proc *' pointer directly, rather than pid_t.
This fixes security/14444 by David Sainty.
Reviewed by Christos Zoulas.
be properly used by any misc. cloning device. While here, correct
a comment to indicate that "open" is the only entry point and that
everything else is handled with fileops.
One basic struct, a function to setup a queue with a specific strategy and
three macros to put buf's into the queue, get and remove the next buf or
get the next buf without removal.
The BUFQ_XXX interface will be removed in the future.
The B_ORDERED flag is not longer supported.
Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>
* struct sigacts gets a new sigact_sigdesc structure, which has the
sigaction and the trampoline/version. Version 0 means "legacy kernel
provided trampoline". Other versions are coordinated with machine-
dependent code in libc.
* sigaction1() grows two more arguments -- the trampoline pointer and
the trampoline version.
* A new __sigaction_sigtramp() system call is provided to register a
trampoline along with a signal handler.
* The handler is no longer passed to sensig() functions. Instead,
sendsig() looks up the handler by peeking in the sigacts for the
process getting the signal (since it has to look in there for the
trampoline anyway).
* Native sendsig() functions now select the appropriate trampoline and
its arguments based on the trampoline version in the sigacts.
Changes to libc to use the new facility will be checked in later. Kernel
version not bumped; we will ride the 1.6C bump made recently.
* Keep pointers to the first and last mbufs of the last record in the
socket buffer.
* Use the sb_lastrecord pointer in the sbappend*() family of functions
to avoid traversing the packet chain to find the last record.
* Add a new sbappend_stream() function for stream protocols which
guarantee that there will never be more than one record in the
socket buffer. This function uses the sb_mbtail pointer to perform
the data insertion. Make TCP use sbappend_stream().
On a profiling run, this makes sbappend of a TCP transmission using
a 1M socket buffer go from 50% of the time to .02% of the time.
Thanks to Bill Sommerfeld and YAMAMOTO Takashi for their debugging
assistance!
as necessary:
* Implement a new mbuf utility routine, m_copyup(), is is like
m_pullup(), except that it always prepends and copies, rather
than only doing so if the desired length is larger than m->m_len.
m_copyup() also allows an offset into the destination mbuf, which
allows space for packet headers, in the forwarding case.
* Add *_HDR_ALIGNED_P() macros for IP, IPv6, ICMP, and IGMP. These
macros expand to 1 if __NO_STRICT_ALIGNMENT is defined, so that
architectures which do not have strict alignment constraints don't
pay for the test or visit the new align-if-needed path.
* Use the new macros to check if a header needs to be aligned, or to
assert that it already is, as appropriate.
Note: This code is still somewhat experimental. However, the new
code path won't be visited if individual device drivers continue
to guarantee that packets are delivered to layer 3 already properly
aligned (which are rules that are already in use).
sysconf(_SC_CLK_TCK) return hz will work.
In detail:
__times13() returns values scaled by hz.
times() returns values scaled by 100.
<sys/times.h> renames times() to __times13().
_SC_CLK_TCK has changed from 3 to 39.
sysconf(3) returns 100.
sysconf(39) returns hz.
CLK_TCK is defined as sysconf(39).
It's not even built if the option isn't present.
* Use cdev_decl() to generate prototypes for the devsw functions.
* Minor whitespace cleanup.
* Nuke the SYSTR_CLONE ioctl from orbit; instead, just clone it in
systraceopen(), like we do with svr4_net.
- implement SIMPLEQ_REMOVE(head, elm, type, field). whilst it's O(n),
this mirrors the functionality of SLIST_REMOVE() (the other
singly-linked list type) and FreeBSD's STAILQ_REMOVE()
- remove the unnecessary elm arg from SIMPLEQ_REMOVE_HEAD().
this mirrors the functionality of SLIST_REMOVE_HEAD() (the other
singly-linked list type) and FreeBSD's STAILQ_REMOVE_HEAD()
- remove notes about SIMPLEQ not supporting arbitrary element removal
- use SIMPLEQ_FOREACH() instead of home-grown for loops
- use SIMPLEQ_EMPTY() appropriately
- use SIMPLEQ_*() instead of accessing sqh_first,sqh_last,sqe_next directly
- reorder manual page; be consistent about how the types are listed
- other minor cleanups
variant also zeroes the counters after copying them). In ifunit, add
support for dealing all numeric ifname by treating them as an ifindex
which is used to look up the interface.
- unify sparc_bbinfo (1064 bytes, with 256 block entries)
and sun68k_bbinfo (296 byte, with 64 block entries)
into shared_bbinfo (512 bytes, with 118 block entries),
which will be also shared by future bbinfo-using platforms
(including macppc)
- add datestamp to *_BBINFO_MAGIC strings, to prevent installboot vs
bootxx version skew.
- add macppc support
*/bootxx.c:
- migrate to new shared_bbinfo structure
installboot:
- add macppc support (still needs applepartmap support and testing)
- improve and add some more warnings & errors to installboot
- implement shared_bbinfo_clearboot() and shared_bbinfo_setboot(), which
perform the majority of the work for bbinfo-using back-ends
(rather than replicating that across multiple back-ends).
to be more generic than ``bbinfo definitions for Sun-based systems''.
Other platforms can store bbinfo-style information here, and possibly
other platform-specific boot information that needs to be accessible
by foriegn platforms in tools such as /usr/sbin/installboot.
by default, and can be enabled by adding the SOSEND_LOAN option to your
kernel config. The SOSEND_COUNTERS option can be used to provide some
instrumentation.
Use of this option, combined with an application that does large enough
writes, gets us zero-copy on the TCP and UDP transmit path.
so that they're more useful for arbitrary types of external storage:
* Add an "mbuf *" argument to (*ext_free)(). If non-NULL, (*ext_free)()
is expected to free the mbuf itself. This allows (*ext_free)() to use
the mbuf for bookkeeping (e.g. deferring the work to a helper thread).
If the "mbuf *" argument is NULL, we are assumed to be in a context
which is safe for performing the destructor operation *now*.
* Adjust MEXTREMOVE() and MFREE() routines for above change.
* Update "ade" and "ti" drivers for new semantics.
closed, open those fds to /dev/null.
XXX: This needs to be fixed in a better way. The kernel should not need to
know about /dev/null or special case 0, 1, 2.
future direction: nuke /usr/include/sys/sha1.h, it shouldn't be there as
we don't provide libkern to userland.
This mirrors the same change for md5.h made by itojun on 2000/12/11.
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.
on the <bsd-api-discuss@wasabisystems.com> mailing list. PT_IO
is a more general inferior I/D space I/O mechanism. FreeBSD and
OpenBSD have also added PT_IO.
From lha@stacken.kth.se, kern/15945.
become ippp (ISDN ppp) and irip (ISDN raw IP). The character device now
are called: /dev/isdn (isdnd <-> kernel communication), /dev/isdnctl (dialing
and other control), /dev/isdntrc* (tracing), /dev/isdnbchan* (raw B channel
access, i.e. for user land PPP) and /dev/isdntel* (telephone devices, i.e.
for answering machines).
m_reclaim() to match the drain hook signature. This allows us to
delete m_retry() and m_retryhdr(), as the pool allocator will now
perform the reclaimation step for us.
From art@openbsd.org.
and the latter, while there was some code tested the bit, was woefully
incomplete and also unused by anything. Besides, PR_STATIC functionality
could be better handled by backend allocators anyhow.
From art@openbsd.org
pool_set_drain_hook(). This hook is called in three cases:
* When a pool has hit the hard limit, just before either erroring
out or sleeping.
* When a backend allocator fails to allocate memory.
* Just before trying to reclaim pages in pool_reclaim().
This hook requests the client to try and free some items back to
the pool.
From art@openbsd.org.
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:
* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.
From art@openbsd.org.
as an added measure to make sure that we can execute a binary.
These default to (1) if elf_machdep.h does not override them.
On Sun2, ELF32_EHDR_FLAGS_OK() checks for the presense of EF_M68000,
since the 68010 cannot run binaries for the 68020-and-up.
on disks in a generic way. Implement these ioctls for SCSI disks.
This is not fully fleshed-out yet, but it allows people to experiment
with disk caches more easily.
* NOTE: Do not protect this header against multiple inclusion. Doing
* so can have subtle side-effects due to header file inclusion order
* and testing of e.g. _POSIX_SOURCE vs. _POSIX_C_SOURCE. Instead,
* protect each CPP macro that we want to supply.
counter overflow. Fixes kern/5080 by David A. Holland.
Also move f_usecount & f_iflags to better place, and make f_type int.
Note: the maximum possible number of references to a struct file is
maxfiles + unp_rights == 2 * INT_MAX
uint32_t namei_hash(const char *p, const char **ep)
which determines the equivalent MI hash32_str() hash for p.
If *ep != NULL, calculate the hash to the character before ep.
If *ep == NULL, calculate the has to the first / or NUL found, and
point *ep to that location.
- Use namei_hash() to calculate cn_hash in lookup() and relookup().
Hash distribution goes from 35-40% to 55-70%, with similar profiled
time spent in cache_lookup() and cache_enter() on my P3-600.
- Use namei_hash() to calculate cn_hash in nfs_readdirplusrpc(),
insetad of homegrown code (that differed from that in lookup() !)
namei_hash() has better spread and is faster than previous code
(which used a non-constant multiplication).
in f_flag of struct file
for now, keep former f_iflags of struct file as _f_spare0, it will be g/c'ed
when struct file will be changed (this will happen soon)
uint32_t hash32_buf(const void *buf, size_t len, uint32_t ihash)
return 32 bit hash of buf, size len,
seeded with initial hash of ihash (usually HASH32_BUF_INIT).
this hash may use a different algorithm to hash32_str() and
hash32_strn().
uint32_t hash32_str(const void *buf, uint32_t ihash)
return 32 bit hash of buf, which is an NUL terminated ascii string,
seeded with initial hash of ihash (usually HASH32_STR_INIT).
this hash may use a different algorithm to hash32_buf()
but must use the same algorithm as hash32_strn().
uint32_t hash32_strn(const void *buf, size_t len, uint32_t ihash)
return 32 bit hash of buf, which is an NUL terminated ascii string
up to a maximum of len bytes,
seeded with initial hash of ihash (usually HASH32_STR_INIT).
this hash may use a different algorithm to hash32_buf()
but must use the same algorithm as hash32_str().
As discussed on tech-kern@netbsd.org.
(__HAVE_PTRACE_MACHDEP) and procfs (__HAVE_PROCFS_MACHDEP).
These changes will allow platforms like x86 (XMM) and PowerPC
(AltiVec) to export extended register sets in a sane manner.
* Use __HAVE_PTRACE_MACHDEP to export x86 XMM registers (standard
FP + SSE/SSE2) using PT_{GET,SET}XMMREGS (in the machdep
ptrace request space).
* Use __HAVE_PROCFS_MACHDEP to export x86 XMM registers via
/proc/N/xmmregs in procfs.
case when the requested memory size can't ever be granted - instead
of panic, malloc(9) would return failure (NULL).
Note kernel code should do proper bound checking, rather than
depend on M_CANFAIL. This flag is only supposed to be used in very
special cases, where common bound checking is not appropriate.
Discussed on tech-kern@, name ``M_CANFAIL'' suggested by Chuck Cranor.
going to be used the kernel interfaces to userland, so that we don't
break the existing ABI. struct ucred has now been modified to have
u_int32_t for cr_ref and cr_ngroups.
with the option USE_KERNEL_RCSIDS. (On a.out, these strings are actually
allocated memory and loaded; on ELF, they exist in a non-loaded file section.)
pages loaned to the kernel. this implies that we also need to
call pmap_kremove() before uvm_km_free().
other general cleanup: remove argument names from prototypes,
rename some variables, etc.
and non-standard inttype-like types, pull in <sys/types.h> if
_KERNEL or _STANDALONE and <inttypes.h> otherwise, and use standard
inttype types.
Discussed with and OK'd by Christos.
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).
VEXECMAP suggested by Chuq Silvers.
on PAGE_SIZE. The overhead of setting up Page Loan is pretty much constant
irregardless of page size, so it makes more sense to use fixed constant.
According to hbench, the overhead of Page Loan setup is still significantly
bigger than the performance gain for 4096 byte buffers on i386
(PIII/600Mhz). The difference is smaller on 386DX, but Page Loan is
still not faster for this case.
Also, there is some other code out there which expects 4KB writes
to not block even for 'blocking' write, since it works this
way on some other operating systems.
Partially addresses kern/14246 by Andreas Persson.
This is activated by defining POOL_SUBPAGE to the size of the new allocation
unit, and makes pools much more efficient on machines with obscenely large
pages. It might even make four-megabyte arm26 systems usable.
format specific.
Struct emul has a e_setregs hook back, which points to emulation-specific
setregs function. es_setregs of struct execsw now only points to
optional executable-specific setup function (this is only used for
ECOFF).
time-related system calls through ioctls. For instance, if user daemon is able
to write to /dev/clockctl, then it is able to use the CLOCKCTL_SETTIMEOFDAY
ioctl on it, which will be equivalent to a settimeofday.
Approved by Christos
- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.
The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.
adjusted via sysctl. file systems that have hash tables which are
sized based on the value of this variable now resize those hash tables
using the new value. the max number of FFS softdeps is also recalculated.
convert various file systems to use the <sys/queue.h> macros for
their hash tables.
"earliest" firing callout in a bucket. This allows us to skip
the scan up the bucket if no callouts are due in the bucket.
A cheap O(1) hint update is done at callout insertion (if new callout
is earlier than hint) and removal (is bucket empty). A thorough
refresh of the hint is done when the bucket is traversed.
This doesn't matter much on machines with small values of hz
(e.g. i386), but on systems with large values of hz (e.g. Alpha),
it has a definite positive effect.
Also, keep the callwheel stats in evcnts, so that you can view them
with "vmstat -e".
guard pages. Can only debug one malloc type at a time, and nothing
larger than 1 page. But can be useful for debugging certain types
of "data modified on freelist" type problems.
Modified from code in OpenBSD.
ctor/dtor feature, it's still faster to allocate from the cache groups
than it is from the pool (cache groups are analogous to "magazines"
in the Solaris SLAB allocator).
data area is not to be written to. This is the case for mbufs with
external storage which is either a non-cluster or a cluster referenced
by multiple mbufs.
Change M_LEADINGSPACE() and M_TRAILINGSPACE() to use M_READONLY(),
rather than their own testing for M_EXT. Previously, M_LEADINGSPACE()
treated all M_EXT mbufs as read-only (which causes an extra mbuf to
be needlessly allocated when sending large TCP packets), and
M_TRAILINGSPACE() previously did not treat any external storage as
read-only (could lead to data corruption of external storage buffers!).
arrange things as needed. Unfortunately, the check in sockargs()
have to stay, since 4.3BSD bind(2), connect(2) and sendto(2) were
not versioned at the time :(
This code was tested to pass regression tests.
Here is why:
kernel bcopy and userland bcopy semantics were never the same. bcopy
in the kernel did not traditionally handle overlap.
ovbcopy in the kernel was the traditional "overlapping bcopy".
Lets take a step back here. The point of the macros was to provide
legacy interfaces so we could transition to mem* without disrupting
large parts of the code still being repeatedly merged, like the KAME
merges in net*/. Having purged the last ovbcopy from the kernel,
replacing them all with memmove, we didn't need ovbcopy any more so we
didn't need a macro.
Now, by leaving bcopy as memcpy, we make it clear that if you are
purging bcopys, you should replace them with memcpys. If we used
memmoves everywhere, it would lose very painstaking optimizations made
in the original code during which the ovbcopy/bcopy distinction was
held. Making bcopy into memmove is BAD BAD BAD.
It has been argued we should add an ovbcopy->memmove macro, but that
is precisely what we do not want -- if someone needs ovbcopy, what
they really want to write memmove, not ovbcopy. We don't want NEW code
with ovbcopy, having laboriously gotten rid of it.
In fact, the bcopy/bzero/bcmps in the kernel should all be purged. We
held off on doing net*/ to make the kame merge easier, and similarly
held off on some other places, but the time has come.
Anyway, for all these reasons, bcopy is changed back to memcpy.
is supposed to point directly to struct mbuf or struct sockaddr in kernel
space as appropriate, rather than being a pointer to memory in userland.
This is to be used by compat/* when emulation needs to wrap
send{to|msg}(2)/recv{from|msg}(2) and modify the passed struct
sockaddr.