Revert rev 1.122:
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/uvm/uvm_bio.c#rev1.122
If this commit is applied to NFS client, changes to files in client
side are sometimes invisible in server side, which results in file
corruption.
Demonstrated by test code provided by Anthony Mallet:
https://mail-index.netbsd.org/current-users/2020/10/17/msg039708.html
Whether the test case above passes or not depends on architectures
and size of NFS I/O specified by -r and -w options of mount_nfs(8)
(the default size is 32KB for x86 and 8KB for other archs).
Whereas it fails on amd64 and i386 with the default size, it passes
on other archs (aarch64, arm, alpha, m68k, and powerpc at least) with
their default. On most ports, it fails with some I/O sizes.
However, the condition for failure is still unclear; whereas it fails
with 2KB I/O size on amiga (m68k, 8KB page), it passes with same I/O
size on alpha (8KB page). It may depends on some VM parameters or
details in pmap implementation, or some race conditions are involved.
Great thanks to Anthony Mallet for providing the test code, and sorry
everyone for breakage.
ubc_fault_page(): Ignore PG_RDONLY flag and always pmap_enter() the page
with the permissions of the original access_type.
It is the file system's responsibility to allocate blocks that is being
modified by write(), before calling into UBC to fill the pages for that
range. KASSERT() is added there to confirm that no clean page is mapped
writable.
Fix infinite loop in uvm_fault_internal(), observed on 16KB-page systems,
where it continues to try to make a partially-backed page writable.
No regression in ATF and KASSERT() does not fire on several architectures,
as far as I can see.
Fix suggested by chs. Thanks!
tmpfs calls pmap_kenter_pa(9) with virtual address with page offset
Bisectioning revealed that the failure starts with this commit:
sys/fs/tmpfs/tmpfs_vnops.c rev 1.142:
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/fs/tmpfs/tmpfs_vnops.c#rev1.142
by which tmpfs became to use UBC_FAULTBUSY flag for ubc_uiomove(9).
If this flag is specified, pmap_kenter_pa(9) is called with virtual
address with page offset via ubc_alloc(9):
https://nxr.netbsd.org/xref/src/sys/uvm/uvm_bio.c#616
Most ports seem to neglect silently page offset of va argument for
pmap_kenter_pa(9). However, it causes KASSERT failure correctly for
powerpc/booke. So, truncate page offset there.
Now, tmpfs works just fine on evbppc-booke, and I've confirmed that
no new failures are detected by ATF.
Discussed with chs@. Thanks!
of (1 << UBC_WINSHIFT, MAX_PAGE_SIZE)
given that default UBC_WINSHIFT is 13, this changes behaviour only
for mips and powerpc (BookE/OEA), which will now have twice as much
memory used for UBC windows; if this ever becomes a problem, it's
possible to reduce ubc_nwins in MD code similar to what is done on sparc
this eliminates variable-length arrays in ubc_fault(),
ubc_uiomove(), and ubc_zerorange() so that the stack usage can be
determined and checked in compile time
it works basically the same way as !direct minus temporary mappings, and
there are no concurrency issues.
- ubc_alloc_direct(): In the PGO_OVERWRITE case blocks are allocated
beforehand. Avoid waking or activating pages unless needed.
or one too few pages can be mapped.
- In ubc_release() with UBC_FAULTBUSY, chances are that pages are newly
allocated and freshly enqueued, so avoid uvm_pageactivate() if possible
- Keep track of the pages mapped in ubc_alloc() in an array on the stack,
and use this to avoid calling pmap_extract() in ubc_release().
Also problems with tmpfs+nfs noted by hannken@.
Don't pass PGO_ALLPAGES to pgo_get, and ignore PGO_DONTCARE in the
!PGO_LOCKED case. In uao_get() have uvm_pagealloc() take care of page
zeroing and release busy pages on error.
- Add new flag UBC_ISMAPPED which tells ubc_uiomove() the object is mmap()ed
somewhere. Use it to decide whether to do direct-mapped copy, rather than
poking around directly in the vnode in ubc_uiomove(), which is ugly and
doesn't work for tmpfs. It would be nicer to contain all this in UVM but
the filesystem provides the needed locking here (VV_MAPPED) and to
reinvent that would suck more.
- Rename UBC_UNMAP_FLAG() to UBC_VNODE_FLAGS(). Pass in UBC_ISMAPPED where
appropriate.
- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.
- Reduce unnecessary page scan in putpages esp. when an object has a ton of
pages cached but only a few of them are dirty.
- Reduce the number of pmap operations by tracking page dirtiness more
precisely in uvm layer.
of page interlocks. Require that the page interlock be held over calls to
uvm_pageactivate(), uvm_pagewire() and similar.
- Solve the concurrency problem with page replacement state. Rather than
updating the global state synchronously, set an intended state on
individual pages (active, inactive, enqueued, dequeued) while holding the
page interlock. After the interlock is released put the pages on a 128
entry per-CPU queue for their state changes to be made real in batch.
This results in in a ~400 fold decrease in contention on my test system.
Proposed on tech-kern but modified to use the page interlock rather than
atomics to synchronise as it's much easier to maintain that way, and
cheaper.
lock for use of the pagedaemon policy code. Discussed on tech-kern.
PR kern/54209: NetBSD 8 large memory performance extremely low
PR kern/54210: NetBSD-8 processes presumably not exiting
PR kern/54727: writing a large file causes unreasonable system behaviour
genfs_getpages() return unallocated pages using the zero page and
PG_RDONLY; the old code relied on fault logic to get it allocated, which
the direct case can't rely on
instead just allocate the blocks right away; pass PGO_JOURNALLOCKED
so that code wouldn't try to take wapbl lock, this code path is called
with it already held
this should fix KASSERT() due to PG_RDONLY on write with wapbl
towards resolution of PR kern/53124
as non-direct code - otherwise the code tries to acquire the wapbl
lock again in genfs_getpages(), and panic due to locking against itself
towards PR kern/53124
to bother pdaemon with PG_BUSY pages; also clear the PG_FAKE and PG_CLEAN after
we are done with the write
this does not make any difference on my machine, but maybe it might fix
the machine check panic on Martin's alpha
while here remove UBC_PARTIALOK handling from ubc_zeropage_direct(), just to be sure
it works exactly the same as the non-direct one
to map pages into kernel
this improves performance of UBC-based (read(2)/write(2)) I/O especially
for cached block I/O - sequential read on my NVMe goes from 1.7 GB/s to 1.9 GB/s
for non-cached, and from 2.2 GB/s to 5.6 GB/s for cached read
the new code is conditional now and off for now, so that it can be tested further;
can be turned on by adjusting ubc_direct variable to true
part of fix for PR kern/53124
since the protection code is not applied: the pages are manually kentered
as RW.
But fix it anyway, so that "pmap 0" does not say the map is executable.
in PR kern/52639, as well as some general cleaning-up...
(As proposed on tech-kern@ with additional changes and enhancements.)
Details of changes:
* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)
* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.
* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.
* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."
* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.
* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).
* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).
* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.
* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.
[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".
[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.
kmem_alloc() with KM_SLEEP
kmem_zalloc() with KM_SLEEP
percpu_alloc()
pserialize_create()
psref_class_create()
all of these paths include an assertion that the allocation has not failed,
so callers should not assert that again.
PRIxOFF (and PRIdOFF) to print off_t's in a way that matches the
arch's definition of off_t.
In the meantime fall back on %jx and an (intmax_t) cast. Ugly.
(And the way it is written is even uglier...)
Why is there no PRI[xd]OFF ? How are off_t's intended to be printed?
If a PRIxOFF gets added in some appropriate place, the XXX lines in this
commit can go away.
(I understand not having PRI[xd]VOFF).
is to provide routines that do as KASSERT(9) says: append a message
to the panic format string when the assertion triggers, with optional
arguments.
Fix call sites to reflect the new definition.
Discussed on tech-kern@. See
http://mail-index.netbsd.org/tech-kern/2011/09/07/msg011427.html