1) don't set a clock when panicing during early boot
2) if the filesystem time is newer than the rtc time (by at least 2 days) then
revert to the filesystem time.
3) use x86 style messaging.
We still use a threshold of 2 days of gain or loss in time to warn though.
listeners to always return "allow".
The idea is that it's not entirely unlikely that some vendors, or users,
will decide to load the security model as an LKM, and that can only
happen after at least mounting local file-systems. If we would not have
this fast-path, all authorization requests would be denied.
okay christos@
Members of the thread group must die without reporting to the parent and
without going to zombie stage. We do that by reparenting to init before
catching a SIGKILL. The parent will not see the child death.
The thread group leader must report the exit status, even if it exits
because of another thread calling exit_group(). We do that by storing the
exit status in struct linux_emuldata_shared, and the exit hook has the
duty of setting struct proc's p_xstat for the thread group leader.
2) For exit/fork/exec hooks, move the NPTL specific code to separate functions
that are shared between COMPAT_LINUX and COMPAT_LINUX32
3) Fix LINUX_CLONE_PARENT_SETTID semantics
that callers are not responsible for initializing the fields. Store the name
inside the struct instead of maintaining a pointer to external storage, or
leaked memory (nfs case).
user supplied a bad name or anamelen parameter to accept(2).
If bad paramaters were suplied and a copyout() failed, the
struct file was cleaned up but not the associated socket. This
could leave sockets in CLOSE_WAIT that could never be closed.
Attached diff let's call kauth_register_scope() with a NULL default
listener. from tn2127:
"callback is the address of the listener callback function for this
scope; this becomes the scope's default listener. This parameter may be
NULL, in which case a callback that always returns KAUTH_RESULT_DEFER is
assumed."
fileassoc.diff adds a fileassoc_table_run() routine that allows you to
pass a callback to be called with every entry on a given mount.
veriexec.diff adds some raw device access policies: if raw disk is
opened at strict level 1, all fingerprints on this disk will be
invalidated as a safety measure. level 2 will not allow opening disk
for raw writing if we monitor it, and prevent raw writes to memory.
level 3 will not allow opening any disk for raw writing.
both update all relevant documentation.
veriexec concept is okay blymn@.
- remove the support of variable-sized filehandle from compat version of
syscalls. (strictly speaking, it breaks abi. i don't think it's a problem
because this feature is short-lived and there are no affected in-tree
filesystems.)
- unify vfs_copyinfh_alloc and vfs_copyinfh_alloc_size.
- vfs_copyinfh_alloc_size: check fhsize strictly.
- reduce code duplication between compat and current syscalls.
If we cannot write on the slave side, always return EWOULDBLOCK in the
non-blocking case, because we don't know that the buffer we started
writing is actually in a system call boundary.
sysctl(9) flags CTLFLAG_READONLY[12]. luckily they're not documented
so it's only half regression.
only two knobs used them; proc.curproc.corename (check added in the
existing handler; its CTLFLAG_ANYWRITE, yay) and net.inet.ip.forwsrcrt,
that got its own handler now too.
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.
- When freeing, test the value of cr_refcnt from inside the lock perimiter.
- Change some uint16_t/uint32_t types to u_int.
- KASSERT(cr_refcnt > 0) in appropriate places.
- KASSERT(cr_refcnt == 1) when changing the credential.
for two kauth_cred_t rather than kauth_cred_t and struct proc *.
advise against using it in the man-page; it should be used only in cases
where we either don't have an object-specific op or when we can't easily
use one.
struct actually keeps the start of the UTC
time scale and not the boot time. the relationship
is: utc-time = up-time + timebase.
background: when doing an ACPI sleep the uptime
freezes and on wakeup the tc_setclock() leads to
a new timebasebin - this had no relationship with
a boottime as the structure was previously called.
discussed on tech-kern@
anomalies (moving boottime, uptime describing running time)
where discovered by Arnaud Lacombe.
- only set at boot
- only tracking delta of set-time operations
-> will keep boottime stable across ACPI sleeps
uptime(1) will report the time since last boot
introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.
this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.
as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.
also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.
tons of input from yamt@, wrstuden@, martin@, and christos@.
- FHANDLE_SIZE_MAX: refuse unreasonable size allocation, esp. when
it's a user-specified value.
- FHANDLE_SIZE_MIN: pad small filehandles with zero for compatibility.
XXX it might be better to push this into filesystem dependent code so that
new filesystems can choose smaller handles.
- restructure code so that it doesn't try to allocate user-specified
unbound amount of memory.
- don't ignore copyout failure in the case of E2BIG.
- rename vfs_copyinfh to vfs_copyinfh_alloc for consistency.
- don't initialize it needlessly.
- fix the poll code the same way the select code was fixed, so that it
computes the remaining time to sleep properly.
While touching all vptofh/fhtovp functions, get rid of VFS_MAXFIDSIZ,
version the getfh(2) syscall and explicitly pass the size available in
the filehandle from userland.
Discussed on tech-kern, with lots of help from yamt (thanks!).
select(2) timeouts. Introduced via timecounter branch from a
tvtohz() conversion.
The left over timeout was not decremented when re-starting
the sleep in select.
on alpha (which were broken for more than a year appearently and noone
noticed). (The other archs didn't suffer because their pmap_kenter_pa()
doesn't support non-executable mappings.)
With sock_loan_thresh=4096, sb_lowat==sb_hiwat, and sowritable will never
be true (even if only a single byte is pending). Some programs (like screen)
expect select() to return that a socket is writable on a socket when there
is space to write to it. XXX: What is the right thing to do here?
EPROTONOSUPPORT instead of EAFNOSUPPORT.
from pavel@ with a little bit of clean up from myself.
XXX: netbsd32 (and perhaps other emulations) should be able
XXX: to call the standard socket calls for this i think, but
XXX: revisit this at another time.
2. implement solaris-like kmem_alloc/free api, using #1.
(note: this implementation is backed by kernel_map, thus can't be
used from interrupt context.)
the precision of getnanotime() is not suitable for file timestamps.
esp. when it's nfs-exported.
- introduce vfs_timestamp().
(the name is from freebsd. currently merely a wrapper of nanotime())
- for ufs-like filesystems, use it rather than getnanotime().
XXX check other filesystems.
so I remove the assertion uid >= 0 && uid <= UID_MAX. This squashes
a bug where Quagga would panic my machine by passing a UID outside
the range [0, UID_MAX].
AFAICT, this restores the historical (pre-kauth) behavior.
It is likely that GIDs do not satisfy the assertion gid >= 0 &&
gid <= GID_MAX, so remove that, too.
Patch from elad.
supported. This adds further differentiation between which argument to
socket(2) caused the error. No longer are invalid domain (address family)
errors classified as ENOPROTOSUPPORT errors. This should make socket(2)
conform to current POSIX and X/Open standards. Fixes PR/33676.
- struct timeval time is gone
time.tv_sec -> time_second
- struct timeval mono_time is gone
mono_time.tv_sec -> time_uptime
- access to time via
{get,}{micro,nano,bin}time()
get* versions are fast but less precise
- support NTP nanokernel implementation (NTP API 4)
- further reading:
Timecounter Paper: http://phk.freebsd.dk/pubs/timecounter.pdf
NTP Nanokernel: http://www.eecis.udel.edu/~mills/ntp/html/kern.html
code and not trying to use temporary solutions.
Lots of comments and help from YAMAMOTO Takashi, also thanks to the PaX
author for being quick to recognize that something fishy's going on. :)
Hook up in mmap/vmcmd rather than (ugh!) uvm_map_protect().
Next time I suggest to commit a temporary solution just revoke my
commit bit.