This underscores the need to replace this error-prone cpp API by
unconditionally defined {pre,post}atomic_membar_*.
This change should only remove unnecessary membar_producers on x86.
the file action list) by limiting the maximum number of file actions to
twice the current file descriptor limit.
Fix a few bugs in the support functions and document the new limit.
From Maxime Villard.
in subr_cprng and get rid of SYSCTL_PRIVATE namespace leak macro.
Fixes ping(8) when run against a standalone rump kernel due to appearance
of the kern.urandom sysctl node (in case someone was wondering ...)
describe the process memory layout).
Fudge the a.out core code to not dump the entire contents.
I'm not sue that anything can read a.out core files - more progress might
be made on such dumps by converting the a.out file to elf!
of the fp save area to all the process_read_fpregs() and
process_write_fpregs() functions.
None of the functions have been modified to use the new parameters.
The size is set for all the writes, but some of the arch-specific reads
just pass NULL.
The amd64 (and i386) need variable sized fp register save areas in order
to support AVX and other enhanced register areas.
These functions are rarely called - so the extra argument won't matter.
Change the interface to ELFNAMEEND(coredump_savenote) so that the caller
doesn't need to know the type of the elf note header.
Simplifies the calling code somewhat.
'fast path' size on the first path matches the actual size on the second)
save all the notes (mostly the cpu registers for all the LWPs) in
malloced memory on the first pass.
Sanity check that the number of memory segments matches written matches
the count obtained earlier. If gcore() is used they could differ.
(Not sure that returning ENOMEM is ideal, but it is better than a crash.)
- Add some extra comments.
- Add some XXX comments because the process state might not be stable,
- Add uvm_coredump_count_segs() to simplify the calling code.
- uvm code now only returns non-empty sections/segments.
- Put the 'iocookie' into the 'cookie' block passed to uvm_coredump_walkmap()
instead of passing it through as an additional parameter.
amd64 can still generate core dumps that gdb can read.
from 'void *' to the actual type 'struct coredump_iostate *'.
In most of the code the contents of the structure are still unknown.
This just stops the wrong type of pointer being passed to the 'void *'
parameter.
I hope I've found everything, amd64 GENERIC and i386 GENERIC & ALL compile.
Otherwise things which treat syscall parameters as register_t (like
ktrace) will encounter garbage for parameters which are of smaller size
than register_t. Using memset is probably not the most optimal way,
but oh well.
as well use __strong_alias instead of __weak_alias. Some toolchains
such as the cygwin pecoff one get weak aliases a bit wrong, so avoiding
unnecessary weak alises helps there.
- No need to always defer layer vnodes, if we get the vnode lock it
is safe to inactivate.
- Always use VOP_LOCK(), it makes no sense to use vn_lock() here.
- No need to drop v_interlock for VOP_LOCK(... LK_NOWAIT).
the vnode lock but before it calls VOP_INACTIVE().
Should fix the race between layer_node_find() trying to vget(, LK_NOWAIT)
a locked vnode when vrelel() marked it as changing and wants its lock.
PR kern/48411 (repeatable SMP crashes in amd64-current)
Adjust pcu(9) to this xcall(9) change. This may fix the problems after
x86 FPU was converted to use PCU, since it avoids heavy contention at the
lower levels (particularly, IPL_SOFTNET). This is a good illustration why
software interrupts should generally avoid any blocking on locks.
process just did a setuid() call, the lwp might not have had a chance to
refresh l->l_cred (still has LPR_CRMOD), and we don't want to bother spending
time syncing the creds of a dying lwp. Should fix the problem with hald
people have been observing.
set while a vnode changes state from active to inactive or from active
or inactive to clean and protects "vclean(); vrelel()" and "vrelel()"
against "vget()".
Presented on tech-kern.
- VC_XLOCK/VC_MASK are not used anymore, remove.
- If we get a reference while cleaning, there is no need to retry as
these reference and this vnode will disappear soon.
- Make sure we run inside a fstrans transaction to prevent deadlocks
against vget().
vrecycle():
- don't even try to recycle a vnode currently cleaning.
- Make these defines and functions private to vfs_vnode.c:
VC_MASK, VC_LOCK, DOCLOSE, VI_IANCTREDO and VI_INACTNOW
vclean() and vrelel()
- Remove the long time unused lwp argument from vrecycle().
- Remove vtryget(), it is responsible for ugly hacks and doesn't
look that effective.
Presented on tech-kern.
Welcome to 6.99.25
that the address length returned is correct, not always 106. Note that
we do things slightly differently than linux and explain why. Unit-tests
to come.
where accept(2) was called on a unix socket that called connect(2) and
then close(2), before the connection was accepted, return the empty
sockaddr_un.
- Fix the length of the empty sockaddr_un socket so that it reflects reality.
unp->unp_conn will be NULL in PRU_ACCEPT, as called from
sys_accept->so_accept. This will cause the usrreq to return with
no error, leaving the mbuf gotten from m_get() with an uninitialized
length, containing junk from a previous call. Initialize m_len to
be 0 to handle this case. This is yet another reason why Beverly's
idea of setting m_len = 0 in m_get() makes a lot of sense. Arguably
this could be an error, since the data we return now has 0 family
and length.
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.
Welcome to 6.99.24
Discussed on tech-kern@ some time ago.
Reviewed by: David Holland <dholland@netbsd.org>
syscall stub symbol names. This allows running standalone programs in
OS-less environments such as directly on a Xen DomU backed only by a
libc and a rump kernel.
local syscall entry points. The wrapper is now so big that it doesn't
get inlined (original intent for having it close to the entry points),
and autogenerating a regular function just loses in flexibility.
A vnode lock is required to access or modify this field.
Lock/unlock the vnode when clearing v_mountedhere.
Reviewed by: David Holland <dholland@netbsd.org>
Should fix PR #48135 (Bad locking for umount)
- Do rnd_init_softint as early as possible in main, after configure2,
and before networking is initialized.
- Initialize the rnd_wakeup softint in rnd_init_softint, not lazily in
rnd_schedule_wakeup.
ok tls
This reverts
sys/dev/rnd_private.h -> r1.1
sys/kern/init_main.c -> r1.450
sys/kern/kern_rndq.c -> r1.14
sys/kern/kern_rndsink.c -> r1.2
Parts of these changes will be added back, and the rndsource
callbacks will be fixed to avoid the lock recursion bug that
motivated the stop-gaps in the first place.
ok tls
Otherwise, rndsink_request takes rndsinks_lock and calls
rnd_extract_data, which synchronously calls rndsinks_distribute,
which takes rndsinks_lock -> boom.
This is a stop-gap on a stop-gap on a stop-gap; we really ought to
back out all of these stop-gaps, make bcm2835_rng call rnd_add_data
asynchronously to work around the original symptom, and design a real
solution when we have time to sort this mess out properly.
* elf_check_header() already ensures eh.e_phnum > MAXPHNUM, so do not
test it again at the call site
* is_dyn == true implies a successfull call to elf_check_header(eh, ET_DYN),
so no need to call elf_check_header(eh, ET_EXEC)
From Maxime Villard.
hardware RNGs using the polling mode of operation:
1) Initialize the rng subsystem soft interrupts as early in kernel startup
as seems safe (we have no MI guarantee that softints are working at all
until configure2() returns, AFAICT).
This should have the rnd subsystem able to process events via softint
before the network subsystem (a notorious early user of entropy) starts.
2) Remove the shortcut calls to rnd_process_events() from
rnd_schedule_process(), with the result that until the softint is installed
rnd_process_events() is a NOP.
3) Directly call rnd_process_events() in rnd_extract_data(),
rnd_maybe_extract(), and rnd_init_softint(). This should suck up any
samples actually collected as early as possible.
current CPU, and use it if a CPU is taken offline
-add a bool argument to pcu_discard which tells whether the internal
"LWP has used the coprocessor" flag should be set or reset. The flag
is reported by pcu_used_p(). If set, future accesses should use the
state stored in the PCB. If reset, it should be reset to default.
The former case is useful for setmcontext().
With that, it should not be necessary anymore to manage the "FPU used"
state by an additional MD variable.
approved by matt
was sent with zero file descriptors in it. Otherwise, a zero-length
temporary storage would be allocated which triggers panic on DIAGNOSTIC
kernels (but is harmless for release kernels).
reviewed by Taylor R Campbell
The type daddr_t is not available for all systems (e.g. Linux systems with
musl libc), and exposing it will just cause an unnecessary compilation
failure even if the type is not used.
- Push /dev/random `information-theoretic' accounting into cprng(9).
- Use percpu(9) for the per-CPU CPRNGs.
- Use atomics with correct memory barriers for lazy CPRNG creation.
- Remove /dev/random file kmem grovelling from fstat(1).
can take flags (M_WAITOK), and allocate large messages if needed. It also
returns the allocated pointer instead of copying the data to the passed
pointer. Implement sbcreatecontrol() using that.
consttime_memequal is the same as the old consttime_bcmp.
explicit_memset is to memset as explicit_bzero was to bcmp.
Passes amd64 release and i386/ALL, but I'm sure I missed some spots,
so please let me know.
rndsink(9):
- Simplify API.
- Simplify locking scheme.
- Add a man page.
- Avoid races in destruction.
- Avoid races in requesting entropy now and scheduling entropy later.
Periodic distribution of entropy to sinks reduces the need for the
last one, but this way we don't need to rely on periodic distribution
(e.g., in a future tickless NetBSD).
rndsinks_lock should probably eventually merge with the rndpool lock,
but we'll put that off for now.
cprng(9):
- Make struct cprng_strong opaque.
- Move rndpseudo.c parts that futz with cprng guts to subr_cprng.c.
- Fix kevent locking. (Is kevent locking documented anywhere?)
- Stub out rump cprng further until we can rumpify rndsink instead.
- Strip code to grovel through struct cprng_strong in fstat.
The "threshold" value was being inappropriately used to limit how many
bytes could be output even after the estimator said enough bytes had
been put in to meet our minimum security guarantee.
This fixes a panic observed with the automatic test harness and by
msaitoh, where it was not possible to extract the full estimate's worth
of bytes even holding the pool lock across the estimate and extract
calls.
soft interrupt driven operation.
Add a polling mode of operation -- now we can ask hardware random number
generators to top us up just when we need it (bcm2835_rng and amdpm
converted as examples).
Fix a stall noticed with repeated reads from /dev/random while testing.
can named the same as those on other platforms.
For example, proc:::exec-success, not proc:::exec_success.
Implementation follows the same basic principle as FreeBSD's; add
another field to the SDT_PROBE_DEFINE macro which is the name
as exposed to userland.
pps_ref_event() allows capturing PPS time stamps
that are not generated at precisely 1Hz (e. g.
by reading a precision clock via callout()).
This extension allows clock drivers to supply PPS
time-stamps and drive the kernel NTP PLL
without the overhead of interrupt-handling and
-processing.
it to 0. Some callers (e.g. nanosleep1()) expect *start to always be
initialized and would use random values from stack otherwise.
While there, remove an always-true conditionnal.
backwards, in low-entropy conditions there was a time interval in which
/dev/urandom could still output bits on an unacceptably short key. Output
from /dev/random was *NOT* impacted.
Eliminate the flag in question -- it's safest to always fill the requested
key buffer with output from the entropy-pool, even if we let the caller
know we couldn't provide bytes with the full entropy it requested.
Advisory will be updated soon with a full worst-case analysis of the
/dev/urandom output path in the presence of either variant of the
SA-2013-003 bug. Fortunately, because a large amount of other input
is mixed in before users can obtain any output, it doesn't look as dangerous
in practice as I'd feared it might be.
A type specifier of the form
enum identifier
without an enumerator list shall only appear after the type it
specifies is complete.
which means that we cannot pass an "enum vtype" argument to
kauth_access_action() without fully specifying the type first.
Unfortunately there is a complicated include file loop which
makes that difficult, so convert this minimal function into a
macro (and capitalize it).
(ok elad@)
so we need to retry if curlwp took a context switch during the call.
Otherwise, CPU-local invariants can get screwed up:
panic: kernel diagnostic assertion "cur->pcg_avail == cur->pcg_size" failed
This is (was) very easy to reproduce by just running:
while : ; do RUMP_NCPU=32 ./a.out ; done
where a.out only calls rump_init(). But, any situation there's contention
and a pool doesn't have emptygroups would do.
and {.tv_sec = 0, .tv_nsec=0} means do not block at all. Add a comment
saying so. The code incorrectly treats them both as an infinite timeout,
and that is not fixed by this commit.
- Don't leave garbage in the control buffer if allocating file
descriptors fails in unp_externalize.
- Scrub the space between CMSG_LEN and CMSG_SPACE to avoid kernel
memory disclosure in unp_externalize.
- Don't read past cmsg_len when closing file descriptors that
couldn't get delivered, in free_rights.
ok christos
To retrieve a spec_node, two new lookup functions (by device or by mount)
are implemented. Both return a referenced vnode, for an opened block device
the opened vnode is returned so further diagnostic checks "vp == ... sd_bdevvp"
will not fire. Otherwise any vnode matching the criteria gets returned.
No objections on tech-kern.
Welcome to 6.99.17
defined in ddb/db_panic.c and declared in ddb/ddbvar.h. No functional
change.
The copyright years in db_panic.c are the years in which changes were
made to the code that has now been moved to db_panic.c. No pre-NetBSD
copyright notice is needed because revision 1.12 of subr_prf.c had only
the trivial "#ifdef DDB \\ Debugger(); \\ #endif"
pass memory for vmem structs into the initialization function and
do away with the static pool of vmem structs.
remove special bootstrapping of the quantum cache pools of the kmem_va_arena
as memory for pool_caches is allocated via pool_allocator_meta which is
fully operational at this point.
and process them in bulk, but, always declare hardware RNGs and VM system
sources as "fast" since in these cases efficiency is important and data
will be abundant.
before we had ever had any entropy, if something else has consumed the
entropy that triggered the immediate reseed, we can reseed with as little
as sizeof(int) bytes of entropy.
pass memory for vmem structs into the initialization functions and
do away with the static pools for this.
factor out the vmem internal structures into a private header.
remove special bootstrapping of the kmem_va_arena as all necessary memory
comes from pool_allocator_meta wich is fully operational at this point.
from pserialize(9) and mutex / condvar.
The fast paths (fstrans_start/fstrans_done on a file system not
suspended or suspending and fscow_run with no change pending) now
run without locks or other atomic operations. Suspension and cow
handler insertion and removal is done with mutex / condvars.
The API remains unchanged.
freeing the VA; otherwise there is a window when it can be re-used while
stale TLB entries may be present.
- physmap_fill: use MIN() instead of min(), since vsize_t is used.
- Add RCS ID comment while here and prevent physmap.h inclusion in userland.
from lists of pages or ranges of virtual address. By using these physical
maps, the kernel can avoid mapping physical I/O in the kernel's address space
in most cases.
the *at system calls on November 18th of last year. Reasons to revert
it include:
- it is incorrect in a whole variety of ways (but fortunately, one
of them is that the missing and improper permission checks have
no net effect);
- it was committed without review or discussion;
- core ruled that all the new O_* flags pertaining to the *at calls
needed to wait until their semantics could be clarified.
manu was asked to revert it on these grounds but has ignored the request.
I have left O_SEARCH defined and visible and made open() explicitly
ignore it. This way, most code that tries to use it will continue to
build and run. I've also arranged lib/libc/c063/t_o_search.c so that
the tests that make use of the O_SEARCH semantics will disappear until
O_SEARCH comes back, and fixed some mistakes and/or incorrect hacks
that were causing some of these to succeed despite the broken O_SEARCH
implementation.
finishes (without blocking). Problem reported by hannken@, thanks!
- pserialize_read_enter: use splsoftserial(), not splsoftclock().
- pserialize_perform: add xcall(9) barrier as interrupts may be coalesced.