temporary zbookmark_phys_t using kmem_alloc() rather than stack;
this recuses several times usually, and this saves 2x
sizeof(zbookmark_phys_t) == 64 bytes per recursion
part of fix for PR kern/55402 by Frank Kardel
on-stack - this function is called recursively, and the 120 bytes per call
add up; also remove unused variable
part of fix for PR kern/55402 by Frank Kardel
this should generally slightly improve performance on MP systems, and
specifically for xbd(4) storage avoids slow unaligned I/O buffer handling
this change requires updated kernel, to allow up to SPA_MAXBLOCKSHIFT item
size for pools
fixes PR kern/55397 by Frank Kardel
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.
This logic correctly uses strncpy(3) to fully initialize a fixed-width field, and also ensures
NUL-termination on the next line as other users of the field expect.
Add -Werror=stringop-truncation to prevent build failure, when run with MKSANITIZER=yes.
Error was reported when build.sh was run with MKSANITIZER=yes flag.
Reviewed by: kamil@
Operation zfs_znode.c::zfs_zget_cleaner() depends on this
zil_commit() as a barrier to guarantee the znode cannot
get freed before its log entries are resolved.
Currently x86_patch_window_open is a big problem, because it is a perfect
function to inject/modify executable code with ROP.
- Remove x86_patch_window_open(), along with its x86_patch_window_close()
counterpart.
- Introduce a read-only link-set of hotpatch descriptor structures,
which reference a maximum of two read-only hotpatch sources.
- Modify x86_hotpatch() to open a window and call the new
x86_hotpatch_apply() function in a hard-coded manner.
- Modify x86_hotpatch() to take a name and a selector, and have
x86_hotpatch_apply() resolve the descriptor from the name and the
source from the selector, before hotpatching.
- Move the error handling in a separate x86_hotpatch_cleanup() function,
that gets called after we closed the window.
The resulting implementation is a bit complex and non-obvious. But it
gains the following properties: the code executed in the hotpatch window
is strictly hard-coded (no callback and no possibility to execute your own
code in the window) and the pointers this code accesses are strictly
read-only (no possibility to forge pointers to hotpatch an area that was
not designated as hotpatchable at compile-time, and no possibility to
choose what bytes to write other than the maximum of two read-only
templates that were designated as valid for the given destination at
compile-time).
With current CPUs this slightly improves a situation that is already
pretty bad by definition on x86. Assuming CET however, this change closes
a big hole and is kinda great.
The only ~problem there is, is that dtrace-fbt tries to hotpatch random
places with random bytes, and there is just no way to make it safe.
However dtrace is only in a module, that is rarely used and never compiled
into the kernel, so it's not a big problem; add a shitty & vulnerable
independent hotpatch window in it, and leave big XXXs. It looks like fbt
is going to collapse soon anyway.
- Don't use a static buffer for the result.
- kauth_cred_getgroups refuses to return more than the actual number
of groups, so passing NGROUPS_MAX generally doesn't work.
To avoid patching zfs, just expose struct kauth_cred::cr_groups
directly, with __KAUTH_PRIVATE. Unclear why the official API only
exposes it via memcpy or copyout anyway.
This makes unprivileged zfs operations work, by anyone with access to
/dev/zfs (which is conventionally mode 777, and which we should maybe
set it to by default; zfs has its own ACL system, zfs allow).
It looks like this is a false positive, since the section of code triggering the error
external/cddl/osnet/dist/lib/libdtrace/common/dt_proc.c:400:42:
is only accessed after "err" is initialized.
Error was reported when build.sh was run with MKLIBCSANITIZER=yes flag.
Reviewed by: kamil@
1. Issue zil_commit only if we're actually updating something --
there's no need to commit if we're unlinking the file or if
there's no atime update being applied.
2. Issue zil_commit only if the zfs has sync=always set -- for
sync=standard there's no need for us to commit anything here since
no application asked for an explicit sync.
Speeds up untarring base.tgz on top of itself by a factor of about
2x, and speeds up rm by a factor of about 10x, on my system with an
SSD SLOG over SATA. Histogram of unlink, rmdir, and rename timing
shows dramatic reduction in latency for most samples.
(To be fair, this was not an improvement over zfs; issuing the
unnecessary zil_commit was a self-inflicted performance wound.)
struct disk_sectoralign {
/* First aligned sector number. */
uint32_t dsa_firstaligned;
/* Number of sectors per aligned unit. */
uint32_t dsa_alignment;
};
- Teach wd(4) to get it from ATA.
- Teach cgd(4) to pass it through from the underlying disk.
- Teach dk(4) to pass it through with adjustments.
- Teach zpool (zfs) to take advantage of it.
=> XXX zpool doesn't seem to understand when the vdev's starting
sector is misaligned.
Missing:
- ccd(4) and raidframe(4) support -- these should support _using_
DIOCGSECTORALIGN to decide where to start putting ccd or raid
stripes on disk, and these should perhaps _implement_
DIOCGSECTORALIGN by reporting the stripe/interleave factor.
- sd(4) support -- I don't know any obvious way to get it from SCSI,
but if any SCSI wizards know better than I, please feel free to
teach sd(4) about it!
- any ld(4) attachments -- might be worth teaching the ld drivers for
nvme and various raid controllers to get the aligned sector size
There's some duplicate logic here for now. I'm doing it this way,
rather than gathering the logic into a new disklabel_sectoralign
function or something, so that this change is limited to adding a new
ioctl, without any new kernel symbols, in order to make it easy to
pull up to netbsd-9 without worrying about the module ABI.
Detected by UBSan and already fixed upstream.
Cherry-pick:
From aa0218d6a12814fac50b287214f9f3b0b99e11b1 Mon Sep 17 00:00:00 2001
From: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Tue, 7 Jan 2014 23:24:37 +0100
Subject: [PATCH] Fix nvlist 'Bus Error' for Sparc
The mis-aligned memory accesses in nvpair_native_embedded() and
nvpair_native_embedded_array() will cause a 'Bus Error' for
architectures such as Sparc which not fully byte addressible.
To avoid this issue care is taken to avoid dereferencing the
potentially mis-aligned packed nvlist_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
This is needed, among other things, to swap on zvols.
Attempting to swap on zvols currently deadlocks but that's a separate
issue that needs to be fixed too!
- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.
`Toxic' means dtrace forbids D scripts from even attempting to read
or write at them.
Previously we considered [0, VM_MIN_KERNEL_ADDRESS) toxic, but
VM_MIN_KERNEL_ADDRESS is only the minimum address of the kernel map;
the direct-mapped region lies below it, and with PMAP_MAP_POOLPAGE we
allocate virtual pages for pool backing directly from physical pages
through the direct-mapped region. Also, this did not consider I/O
mappings to be toxic, which they probably should be.
Instead, treat:
[0, AARCH64_KSEG_START)
and
[VM_KERNEL_IO_ADDRESS, 0xfff...ff)
as toxic. (The upper bound for 0xfff...ff ought to be inclusive, not
exclusive, but I think we'll need another mechanism for expressing
that to dtrace!)
aarch64 fbt_invop doesn't actually use the argument, but it would
make more sense for it to be the return value and/or first argument
register. Certainly it's not `eax'!
Add GCC_NO_STRINGOP_TRUNCATION dwarf.c to prevent build failure.
Error was reported when build.sh was run with MKLIBCSANITIZER=yes flag.
Reviewed by: kamil@
- Reduce unnecessary page scan in putpages esp. when an object has a ton of
pages cached but only a few of them are dirty.
- Reduce the number of pmap operations by tracking page dirtiness more
precisely in uvm layer.
lock for use of the pagedaemon policy code. Discussed on tech-kern.
PR kern/54209: NetBSD 8 large memory performance extremely low
PR kern/54210: NetBSD-8 processes presumably not exiting
PR kern/54727: writing a large file causes unreasonable system behaviour
have been converted to GCC_NO_FORMAT_TRUNCATION rather than
GCC_NO_STRINGOP_TRUNCATION which is what happened. This might unbreak
the build (olr at least get it further).
GCC_NO_FORMAT_TRUNCATION -Wno-format-truncation (GCC 7/8)
GCC_NO_STRINGOP_TRUNCATION -Wno-stringop-truncation (GCC 8)
GCC_NO_STRINGOP_OVERFLOW -Wno-stringop-overflow (GCC 8)
GCC_NO_CAST_FUNCTION_TYPE -Wno-cast-function-type (GCC 8)
use these to turn off warnings for most GCC-8 complaints. many
of these are false positives, most of the real bugs are already
commited, or are yet to come.
we plan to introduce versions of (some?) of these that use the
"-Wno-error=" form, which still displays the warnings but does
not make it an error, and all of the above will be re-considered
as either being "fix me" (warning still displayed) or "warning
is wrong."
zfs_netbsd_{create,mknod,link,etc..} that call functions called
zfs_{create,mknod,link,etc..}. These later functions may return a
error code along with a *vpp that is NULL. This situation was not
handled by the zfs_netbsd_* functions and would result in a panic in a
number of cases. The simplest to trigger it was filling up a dataset
or pool resulting in a over quota condition. An attempt to create
another file, or directory at that point would panic.
by /sbin/{zfs,zpool,mount_zfs}. The general effect is to move them
from /usr/lib to /lib. Compatibility links are installed in /usr/lib
and nothing that is installed, say in /usr/pkg, appears to break.
With this, it is possible to have a /var and /usr mount using ZFS
legacy mounting early on in the boot process.
Run tested on amd64 and i386 and compile tested on evbarm.
in the ZFS properties of the dataset and a simple man page for
mount_zfs. With this, it is possible to put ZFS filesystems in
/etc/fstab as file system type zfs.
Add a rc.d script that kicks the module ZFS load mostly before
mountall runs simular to what LVM does. This allows for any legacy
mounts to be specified in critical_local_filesystems and allows for
ZFS pools on top of cgd (probably among other things). Introduce a
rc.conf variable called zfs which needs to be set to YES, in the usual
manor of things, to get zvols and ZFS dataset support rather then just
assume that 'zfs mount' does that in mountall. Fix a problem in
mountall if ZFS is not compiled into the system.
remove now unneeded files from "external/cddl/osnet/sys/sys/".
- sys/sys/bitmap.h -> dist/uts/common/sys/bitmap.h
- sys/sys/callb.h -> dist/uts/common/sys/callb.h
Stop including "cpupart.h", not needed for build.
GENRIC kernel below 2**15-1 (from ~34000 to ~29800).
Remove the type attributes "volatile", "const" and "restrict",
for DTRACE these attributes are of little value.
FreeBSD splits "zfs_context.h" into:
"lib/libzpool/common/sys/zfs_context.h" for user space
"uts/common/fs/zfs/sys/zfs_context.h" for kernel space
Do the same here, move and sync "sys/sys/zfs_context.h" to
"dist/lib/libzpool/common/sys/zfs_context.h" and
"dist/uts/common/fs/zfs/sys/zfs_context.h".
Change "Makefile.zfs" to search includes from "dist/lib"
before "dist/uts" so we get the right include file.
Adapt "usr.sbin/fstyp/Makefile" to get the right include file.
zfs_suspend_fs()/zfs_resume_fs() and get rid of dead "z_sa_hdl == NULL"
znodes before vfs_resume() to keep the vnode cache consistent.
Live rollback should work now.
PR port-xen/54273 ("zpool create pool xbd2" panics DOMU kernel)
the latter implicitly calls cv_signal() on error.
This leads to "tq_waiting > 0" with "tq_running == 0" and the
taskq stalls.
Change task_executor() to increment and decrement "tq_waiting"
and always check and run the queue after cv_timedwait().
Use mstohz(), fix timeout and sort includes.
Addresses PR port-xen/54273: "zpool create pool xbd2" panics DOMU kernel
Change zvol_size_changed() to initialize "zv->zv_volsize"
and initialize only "dg_secsize" and "dg_secperunit".
Calling disk_set_info() will initialize the remaining
parts of the geometry.
Set "doread" in zvol_strategy() to make reading from
device possible.
Reorganize/add disk_busy()/disk_unbusy() instrumentation.
Redo zvol_ioctl() to implement DIOCGWEDGEINFO and let
disk_ioctl() process the remaining ioctls.
- Put device major numbers into "dev_info_t".
- Fix an off-by-one in zvol_create_minor().
- When creating a node handle existing nodes
and add owner read/write permission.
- When removing nodes remove now empty directories.
- Defer spa_config_load() until root is mounted.
- Restore the config path to "/etc/zfs/zpool.cache".
- Module "zfs" is type MODULE_CLASS_VFS and no longer depends on "rootvnode".
- Module "solaris" no longer depends on "mp_online".
- Fix rump component registration to not detach "/dev/zfs" if
it didn't attach it.
Solaris upstream. FreeBSD already replaced it with a glue to their
taskqueue API.
Replace it with a glue component that queues Solaris taskq requests to
threadpool jobs.
like FreeBSD and Illumos do.
Use "f_fsid" for "va_fsid" and cheat NFSD to export snapshots under
".zfs" by setting these snaphots "f_fsidx" to the parents "f_fsidx".