once the 'address' has been copied into an mbuf.
Add extra flags for 'struct msghdr.msg_flags' to indicate that the address
and control are already in mbufs, and that the uio structure is in userspace
for sending data, rename sendit() to do_sys_sendmsg() to ensure no old code
passes in random flags.
Changes to compat code to use new functions - removing some stackgap use.
Fix a 'use after free' in compat_43_sys_recvmsg.
I ***THINK*** the code that converts 'cmsg' formatted data is borked!
svr4_stream.c ought to be generated from svr4_32_stream.c during the build.
- Allow aio_suspend() to be ended early by a signal
- Fix reference counting on LIO structures (remove hack)
- Use two global pools for AIO structures
- Minor cleanups
Patch provided by <ad>. Some additional modifications by me.
Reviewed by <yamt>.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.
from doc/BRANCHES:
idle lwp, and some changes depending on it.
1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.
Bug fixes:
- Fix crash reported by Scott Ellis on current-users@.
- Fix race conditions in enforcing the Veriexec rename and remove
policies. These are NOT security issues.
- Fix memory leak in rename handling when overwriting a monitored
file.
- Fix table deletion logic.
- Don't prevent query requests if not in learning mode.
KPI updates:
- fileassoc_table_run() now takes a cookie to pass to the callback.
- veriexec_table_add() was removed, it is now done internally. As a
result, there's no longer a need for VERIEXEC_TABLESIZE.
- veriexec_report() was removed, it is now internal.
- Perform sanity checks on the entry type, and enforce default type
in veriexec_file_add() rather than in veriexecctl.
- Add veriexec_flush(), used to delete all Veriexec tables, and
veriexec_dump(), used to fill an array with all Veriexec entries.
New features:
- Add a '-k' flag to veriexecctl, to keep the filenames in the kernel
database. This allows Veriexec to produce slightly more accurate
logs under certain circumstances. In the future, this can be either
replaced by vnode->pathname translation, or combined with it.
- Add a VERIEXEC_DUMP ioctl, to dump the entire Veriexec database.
This can be used to recover a database if the file was lost.
Example usage:
# veriexecctl dump > /etc/signatures
Note that only entries with the filename kept (that is, were loaded
with the '-k' flag) will be dumped.
Idea from Brett Lymn.
- Add a VERIEXEC_FLUSH ioctl, to delete all Veriexec entries. Sample
usage:
# veriexecctl flush
- Add a 'veriexec_flags' rc(8) variable, and make its default have
the '-k' flag. On systems using the default signatures file
(generaetd from running 'veriexecgen' with no arguments), this will
use additional 32kb of kernel memory on average.
- Add a '-e' flag to veriexecctl, to evaluate the fingerprint during
load. This is done automatically for files marked as 'untrusted'.
Misc. stuff:
- The code for veriexecctl was massively simplified as a result of
eliminating the need for VERIEXEC_TABLESIZE, and now uses a single
pass of the signatures file, making the loading somewhat faster.
- Lots of minor fixes found using the (still under development)
Veriexec regression testsuite.
- Some of the messages Veriexec prints were improved.
- Various documentation fixes.
All relevant man-pages were updated to reflect the above changes.
Binary compatibility with existing veriexecctl binaries is maintained.
once, and prior to passing it to the caller of sys_wait4() and at the same
time as adding it to the parent.
Commands like:
time sh -c 'i=0; while [ $i -lt 1000 ]; do i=$(expr $i + 1); done'
now give same output.
Please note, that <tech-kern> people should note about
file names before commit. Otherwise, function may fail
with errno set to EDIRTY, and return -1. ;)
and 'rusage' without having to copy data to/from stackgap buffers.
The old split (find_stopped_child) could be removed.
amd64 seems to run netbsd32, linux and linux32 emulations. sparc64 compiles.
avoid an indirect function call by comparing the family, length,
and bytes [dom->dom_sa_cmpofs, dom->dom_sa_cmpofs + dom->dom_sa_cmplen),
corresponding to the the sockaddrs' "address" members.
For ISO, actually use sockaddr_iso_cmp, for a change. Thanks to
yamt@ for pointing out my error.
- Set a lower priority for AIO-worker thread, because current could cause
interactivity problems (eg. with qemu - thanks <xtraeme> for testing).
Mark it as XXX for now - after priority model change, this should
be reconsidered anyway.
- Do not copyout() with lock held in sys_aio_cancel().
- Fix a leak of the lock in aio_process().
- Check for any error of cv_wait_sig().
- Cache p->p_aio in aio_exit().
Thanks <ad> for catching the issues!
route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.
- Add POSIX defined system variables and constants of AIO_LISTIO_MAX and
AIO_MAX values. Both with _POSIX_ASYNCHRONOUS_IO, provide them in
sysconf(3) and getconf(1) interfaces.
- Clean up sysconf(3) for handling sysctl nodes dynamically.
I think it existed to cache the numbers in kernel memory of a zombie when
proc->p_stats was part of the 'u' area - so got freed earlier and wouldn't
(easily) be accessible from a separate process. However since both the
p_ru and p_stats fields are freed at the same time it is no longer needed.
Ride the recent 4.99.19 version change.
Seems to be quite stable. Some work still left to do.
Please note, that syscalls are not yet MP-safe, because
of the file and vnode subsystems.
Reviewed by: <tech-kern>, <ad>
which can either be copied directly to userspace, or converted then copied.
Saves replicating a lot of code in the compat functions (esp. for
getvfsstat) at a cast of an extra function call in the non-emulated case -
which is unlikely to be measurable given the other costs of the actions
involved (even on vax).
Remove dofhstat() and dofhstatvfs() (and the last caller).
Remove some redundant stackgap_init() calls.
to the real root. Rather that do the check inside lookup() - where it
applies to to every ".." in a pathname, explicitly check the start of
the caller-supplied buffers and any absolute symbolic links.
Note that in the latter case the re-search from the real root is supressed.
Should fix PR kern/36225
emulation lookups.
If doing a lookup relative to the emulation root, prepend the emulation root
to the traced filename.
While here pass the filename length through to the ktrace code since namei()
knows the length and ktr_namei() would have to call strlen().
Note: that if namei() is being called during execve processing, the emulation
root name isn't available and "/emul/???" is used. Also namei() has to use
strlen() to get the lenght on the emulatoon root - even though it is a
compile-time constant string.
Since this is called in fork() it does rather need to give the child
process the parent's emulation root.
This means that (for example) an emulated shell will, by default, run
programs from the emulation root.
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.
to skip unnecessary flushing when layered file system vnodes are recycled.
this also prevents a deadlock with the dodgy LFS putpages routine.
fixes the non-LFS part of PR 36150.
the "mountpoint" vnode twice due to an error branch.
thanks go to Gert Doering for reporting the problem and testing the fix
and to Juergen Hannken-Illjes for much of the analysis work leading to
the discovery of the problem cause
corresponding flags.
Revert softdep_trackbufs() to its state before vn_start_write() was added.
Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.
Welcome to 4.99.17
Make sure that if we pull a buffer off of the read queue that it really
is a read request. Lower in this routine we base which queue we
dequeue the request from on its read/write state. Thus if a write
op ever ended up on the read queue, we'd explode (dereference NULL).
- cv_wait and friends: after resuming execution, check to see if we have
been restarted as a result of cv_signal. If we have, but cannot take
the wakeup (because of eg a pending Unix signal or timeout) then try to
ensure that another LWP sees it. This is necessary because there may
be multiple waiters, and at least one should take the wakeup if possible.
Prompted by a discussion with pooka@.
- typedef struct lwp lwp_t;
- int -> bool, struct lwp -> lwp_t in a few places.
- Better detect simple cycles of threads calling _lwp_wait and return
EDEADLK. Does not handle deeper cycles like t1 -> t2 -> t3 -> t1.
- If there are multiple threads in _lwp_wait, then make sure that
targeted waits take precedence over waits for any LWP to exit.
- When checking for deadlock, also count the number of zombies currently
in the process as potentially reapable. Whenever a zombie is murdered,
kick all waiters to make them check again for deadlock.
- Add more comments.
Also, while here:
- LOCK_ASSERT -> KASSERT in some places
- lwp_free: change boolean arguments to type 'bool'.
- proc_free: let lwp_free spin waiting for the last LWP to exit, there's
no reason to do it here.
- Don't bother sorting the sleep queues, since user space controls the
order of removal.
- Change setrunnable(t) to lwp_unsleep(t). No functional change from the
perspective of user applications.
- Minor cosmetic changes.
so it's not worth counting.
- lwp_wakeup: set LW_UNPARKED on the target. Ensures that _lwp_park will
always be awoken even if another system call eats the wakeup, e.g. as a
result of an intervening signal. To deal with this correctly for other
system calls will require a different approach.
- _lwp_unpark, _lwp_unpark_all: use setrunnable if the LWP is not parked
on the same sync queue: (1) simplifies the code a bit as there no point
doing anything special for this case (2) makes it possible for p_smutex
to be replaced by p_mutex and (3) restores the guarantee that the 'hint'
argument really is just a hint.
separate functions that don't do the copyout.
This allows all the compat_xxx versions to convert the 'struct stat' to
the correct format without using the 'stackgap'.
The stackgap isn't at all LWP friendly, and needs to be removed from
any compat functions that might involve threads (inc. clone()).
The code is still binary compatible with existing LKMs.
as possible. For that, split out a function which does the allocation
of a softc (without linking it into global structures) and a function
which inserts the device into the global alldevs lists and the per-driver
cd_devs.
There is a little semantic change involved: the pseudo-device code didn't
interpret FSTATE_STAR as such, for no good reason. This looks harmless;
I'll modify driver frontends as I find ways to test.
Get config_makeroom() out of the public namespace - that's clearly an
internal of autoconf which drivers can't be allowed to deal with.
length separately.
Use this for foreign-endianess labels in wedge autodiscovery, and
calculate the checksum of those before we swap various fields in the
label.
with newlock2 merge:
Replace the Mach-derived boolean_t type with the C99 bool type. A
future commit will replace use of TRUE and FALSE with true and false.
On i386 (at least) gcc manages two generate two forwards branches which are not
usually taken for the old code, and one forwards branch that is usually taken
for my 'improved version'. Since (IIRC) both athlon and P4 will predict
forwards branches 'not taken' the old code is likely to be faster :-(
Faster variants exist, especially ones using the cmov instruction.
option SYSCALL_STATS counts the number of times each system call is made
option SYSCALL_TIMES counts the amount of time spent in each system call
Currently the counting hooks have only been added to the i386 system call
handler, and the time spent in interrupts is not subtracted.
It ought also be possible to add the times to the processes profiling
counters in order to get a more accurate user/system/interrupt split.
The counts themselves are readable via the sysctl interface.
(depracted some time ago) 'struct kinfo_proc' returned by sysctl.
Move the definitions to sys/syctl.h and rename in order to ensure all the
users are located.
parentheses in return statements.
Cosmetic: don't open-code TAILQ_FOREACH().
Cosmetic: change types of variables to avoid oodles of casts: in
in6_src.c, avoid casts by changing several route_in6 pointers
to struct route pointers. Remove unnecessary casts to caddr_t
elsewhere.
Pave the way for eliminating address family-specific route caches:
soon, struct route will not embed a sockaddr, but it will hold
a reference to an external sockaddr, instead. We will set the
destination sockaddr using rtcache_setdst(). (I created a stub
for it, but it isn't used anywhere, yet.) rtcache_free() will
free the sockaddr. I have extracted from rtcache_free() a helper
subroutine, rtcache_clear(). rtcache_clear() will "forget" a
cached route, but it will not forget the destination by releasing
the sockaddr. I use rtcache_clear() instead of rtcache_free()
in rtcache_update(), because rtcache_update() is not supposed
to forget the destination.
Constify:
1 Introduce const accessor for route->ro_dst, rtcache_getdst().
2 Constify the 'dst' argument to ifnet->if_output(). This
led me to constify a lot of code called by output routines.
3 Constify the sockaddr argument to protosw->pr_ctlinput. This
led me to constify a lot of code called by ctlinput routines.
4 Introduce const macros for converting from a generic sockaddr
to family-specific sockaddrs, e.g., sockaddr_in: satocsin6,
satocsin, et cetera.
P_*/L_* naming convention, and rename the in-kernel flags to avoid
conflict. (P_ -> PK_, L_ -> LW_ ). Add back the (now unused) LSDEAD
constant.
Restores source compatibility with pre-newlock2 tools like ps or top.
Reviewed by Andrew Doran.
process was reparented. Change proc_free() to copy the rusage to a buffer
on the stack if required, so it can be passed both to the debugger and
to the real parent process.
Fixes kern/35582 (kernel panics with gdb).
fileassoc_table_add() was removed from the KPI and made internal. From now
fileassoc(9) will manage the optimal table size internally.
Input from and okay yamt@.
- don't use SAVESTART in calls to relookup() from unionfs,
just vref() the desired vnode when we need to.
- fix locking and refcounting in the unionfs EEXIST error cases.
- release any vnode locks before calling VFS_ROOT(), vfs_busy() is enough.
this allows us to simplify union_root() and fix PR 3006.
- union_lock() doesn't handle shared lock requests correctly,
so convert them to exclusive instead. fixes PR 34775.
- in relookup(), avoid reusing "dp" for different purposes,
the error handling wasn't right. (actually just get rid of dp.)
also, change relookup() to ignore LOCKLEAF and always return the
vnode locked since the callers already expect this.
implementation and meant to be used by security models to hook credential
related operations (init, fork, copy, free -- hooked in kauth_cred_alloc(),
kauth_proc_fork(), kauth_cred_clone(), and kauth_cred_free(), respectively)
and document it.
Add specificdata to credentials, and routines to register/deregister new
"keys", as well as set/get routines. This allows security models to add
their own private data to a kauth_cred_t.
The above two, combined, allow security models to control inheritance of
their own private data in credentials which is a requirement for doing
stuff like, I dunno, capabilities?
As with previous commit, we want to allow the secmodel code to control
the credential inheritance, etc., so we need it started earlier (also
before proc0_init()).
Since we'll soon want to be able to control the inheritance of credentials,
kauth(9) needs to be ready for use much sooner -- at least before the call
to proc0_init().
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.
Implemented for file systems of type ffs.
The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.
Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.
Welcome to 4.99.9 (new vfs op vfs_suspendctl).
undocumented) and change logic in kauth_authorize_action() to only
allow an action if it wasn't explicitly allowed/denied and there are no
secmodels loaded.
Okay yamt@.
tail instead of an explicit check to add to the head for an empty
queue. Apparently TAILQ_INSERT_HEAD happens to work for a
non-initialized head and does implicit initialization so that
TAILQ_INSERT_TAIL works after that.