432 lines
17 KiB
Plaintext
432 lines
17 KiB
Plaintext
$NetBSD: storage,v 1.18 2016/09/21 20:32:47 jdolecek Exp $
|
|
|
|
NetBSD Storage Roadmap
|
|
======================
|
|
|
|
This is a small roadmap document, and deals with the storage and file
|
|
systems side of the operating system. It discusses elements, projects,
|
|
and goals that are under development or under discussion; and it is
|
|
divided into three categories based on perceived priority.
|
|
|
|
The following elements, projects, and goals are considered strategic
|
|
priorities for the project:
|
|
|
|
1. Improving iscsi
|
|
2. nfsv4 support
|
|
3. A better journaling file system solution
|
|
4. Getting zfs working for real
|
|
5. Seamless full-disk encryption
|
|
6. Finish tls-maxphys
|
|
|
|
The following elements, projects, and goals are not strategic
|
|
priorities but are still important undertakings worth doing:
|
|
|
|
7. nvme support
|
|
8. lfs64
|
|
9. Per-process namespaces
|
|
10. lvm tidyup
|
|
11. Flash translation layer
|
|
12. Shingled disk support
|
|
13. ext3/ext4 support
|
|
14. Port hammer from Dragonfly
|
|
15. afs maintenance
|
|
16. execute-in-place
|
|
17. extended attributes for acl and capability storage
|
|
|
|
The following elements, projects, and goals are perhaps less pressing;
|
|
this doesn't mean one shouldn't work on them but the expected payoff
|
|
is perhaps less than for other things:
|
|
|
|
18. coda maintenance
|
|
|
|
|
|
Explanations
|
|
============
|
|
|
|
1. Improving iscsi
|
|
------------------
|
|
|
|
Both the existing iscsi target and initiator are fairly bad code, and
|
|
neither works terribly well. Fixing this is fairly important as iscsi
|
|
is where it's at for remote block devices. Note that there appears to
|
|
be no compelling reason to move the target to the kernel or otherwise
|
|
make major architectural changes.
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is currently no clear timeframe or release target.
|
|
- Contact agc for further information.
|
|
|
|
|
|
2. nfsv4 support
|
|
----------------
|
|
|
|
nfsv4 is at this point the de facto standard for FS-level (as opposed
|
|
to block-level) network volumes in production settings. The legacy nfs
|
|
code currently in NetBSD only supports nfsv2 and nfsv3.
|
|
|
|
The intended plan is to port FreeBSD's nfsv4 code, which also includes
|
|
nfsv2 and nfsv3 support, and eventually transition to it completely,
|
|
dropping our current nfs code. (Which is kind of a mess.) So far the
|
|
only step that has been taken is to import the code from FreeBSD. The
|
|
next step is to update that import (since it was done a while ago now)
|
|
and then work on getting it to configure and compile.
|
|
|
|
- As of November 2015 nobody is working on this, and a volunteer to
|
|
take charge is urgently needed.
|
|
- There is no clear timeframe or release target, although having an
|
|
experimental version ready for -8 would be great.
|
|
- Contact dholland for further information.
|
|
|
|
|
|
3. A better journaling file system solution
|
|
-------------------------------------------
|
|
|
|
WAPBL, the journaling FFS that NetBSD rolled out some time back, has a
|
|
critical problem: it does not address the historic ffs behavior of
|
|
allowing stale on-disk data to leak into user files in crashes. And
|
|
because it runs faster, this happens more often and with more data.
|
|
This situation is both a correctness and a security liability. Fixing
|
|
it has turned out to be difficult. It is not really clear what the
|
|
best option at this point is:
|
|
|
|
+ Fixing WAPBL (e.g. to flush newly allocated/newly written blocks to
|
|
disk early) has been examined by several people who know the code base
|
|
and judged difficult. Also, some other problems have come to light
|
|
more recently; e.g. PR 50725, PR 47146, and a problem where truncating
|
|
large sparse files takes ~forever in PR 49175. Also see PR 45676. Still,
|
|
it might be the best way forward.
|
|
|
|
+ There is another journaling FFS; the Harvard one done by Margo
|
|
Seltzer's group some years back. We have a copy of this, but as it was
|
|
written in BSD/OS circa 1999 it needs a lot of merging, and then will
|
|
undoubtedly also need a certain amount of polishing to be ready for
|
|
production use. It does record-based rather than block-based
|
|
journaling and does not share the stale data problem.
|
|
|
|
+ We could bring back softupdates (in the softupdates-with-journaling
|
|
form found today in FreeBSD) -- this code is even more complicated
|
|
than the softupdates code we removed back in 2009, and it's not clear
|
|
that it's any more robust either. However, it would solve the stale
|
|
data problem if someone wanted to port it over. It isn't clear that
|
|
this would be any less work than getting the Harvard journaling FFS
|
|
running... or than writing a whole new file system either.
|
|
|
|
+ We could write a whole new journaling file system. (That is, not
|
|
FFS. Doing a new journaling FFS implementation is probably not
|
|
sensible relative to merging the Harvard journaling FFS.) This is a
|
|
big project.
|
|
|
|
Right now it is not clear which of these avenues is the best way
|
|
forward. Given the general manpower shortage, it may be that the best
|
|
way is whatever looks best to someone who wants to work on the
|
|
problem.
|
|
|
|
- There has been some interest in the Harvard journaling FFS but no
|
|
significant progress. Nobody is known to be working on or particularly
|
|
interested in porting softupdates-with-journaling. And, while
|
|
dholland has been mumbling for some time about a plan for a
|
|
specific new file system to solve this problem, there isn't any
|
|
realistic prospect of significant progress on that in the
|
|
foreseeable future, and nobody else is known to have or be working
|
|
on even that much.
|
|
- There is no clear timeframe or release target; but given that WAPBL
|
|
has been disabled by default for new installs in -7 this problem
|
|
can reasonably be said to have become critical.
|
|
- jdolecek is working on fixing WAPBL, goal is to get WAPBL fixed
|
|
enough to be safe to re-enable as default for -8
|
|
- Contact joerg or martin regarding WAPBL; contact dholland regarding
|
|
the Harvard journaling FFS.
|
|
|
|
|
|
4. Getting zfs working for real
|
|
-------------------------------
|
|
|
|
ZFS has been almost working for years now. It is high time we got it
|
|
really working. One of the things this entails is updating the ZFS
|
|
code, as what we have is rather old. The Illumos version is probably
|
|
what we want for this.
|
|
|
|
- There has been intermittent work on zfs, but as of November 2015
|
|
nobody is known to be actively working on it
|
|
- There is no clear timeframe or release target.
|
|
- Contact riastradh or ?? for further information.
|
|
|
|
|
|
5. Seamless full-disk encryption
|
|
--------------------------------
|
|
|
|
(This is only sort of a storage issue.) We have cgd, and it is
|
|
believed to still be cryptographically suitable, at least for the time
|
|
being. However, we don't have any of the following things:
|
|
|
|
+ An easy way to install a machine with full-disk encryption. It
|
|
should really just be a checkbox item in sysinst, or not much more
|
|
than that.
|
|
|
|
+ Ideally, also an easy way to turn on full-disk encryption for a
|
|
machine that's already been installed, though this is harder.
|
|
|
|
+ A good story for booting off a disk that is otherwise encrypted;
|
|
obviously one cannot encrypt the bootblocks, but it isn't clear where
|
|
in boot the encrypted volume should take over, or how to make a best
|
|
effort at protecting the unencrypted elements needed to boot. (At
|
|
least, in the absence of something like UEFI secure boot combined with
|
|
an cryptographic oracle to sign your bootloader image so UEFI will
|
|
accept it.) There's also the question of how one runs cgdconfig(8) and
|
|
where the cgdconfig binary comes from.
|
|
|
|
+ A reasonable way to handle volume passphrases. MacOS apparently uses
|
|
login passwords for this (or as passphrases for secondary keys, or
|
|
something) and this seems to work well enough apart from the somewhat
|
|
surreal experience of sometimes having to log in twice. However, it
|
|
will complicate the bootup story.
|
|
|
|
Given the increasing regulatory-level importance of full-disk
|
|
encryption, this is at least a de facto requirement for using NetBSD
|
|
on laptops in many circumstances.
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is no clear timeframe or release target.
|
|
- Contact dholland for further information.
|
|
|
|
|
|
6. Finish tls-maxphys
|
|
---------------------
|
|
|
|
The tls-maxphys branch changes MAXPHYS (the maximum size of a single
|
|
I/O request) from a global fixed constant to a value that's probed
|
|
separately for each particular I/O channel based on its
|
|
capabilities. Large values are highly desirable for e.g. feeding large
|
|
disk arrays but do not work with all hardware.
|
|
|
|
The code is nearly done and just needs more testing and support in
|
|
more drivers.
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is no clear timeframe or release target.
|
|
- Contact tls for further information.
|
|
|
|
|
|
7. nvme suppport
|
|
----------------
|
|
|
|
nvme ("NVM Express") is a hardware interface standard for PCI-attached
|
|
SSDs. NetBSD now has a driver for these.
|
|
|
|
Driver is now MPSAFE and uses bufq fcfs (i.e. no disksort()) already,
|
|
so the most obvious software bottlenecks were treated. It still needs
|
|
more testing on real hardware, and it may be good to investigate some further
|
|
optimizations, such as DragonFly pbuf(9) or something similar.
|
|
|
|
Semi-relatedly, it is also time for scsipi to become MPSAFE.
|
|
|
|
- As of May 2016 a port of OpenBSD's driver has been commited. This
|
|
will be in -8.
|
|
- The nvme driver is a backend to ld(4) which is MPSAFE, but we still
|
|
need to attend to I/O path bottlenecks. Better instrumentation
|
|
is needed.
|
|
- Flush cache commands via DIOCCACHESYNC is currently implemented using polled
|
|
commands for simplicity, limiting speed to about 10 milliseconds due to use
|
|
of delay(9); investigate if it's worth changing this to a cv to avoid
|
|
the delay, especially for journalled/heavy fsync scenarios
|
|
- NVMe controllers supports write cache administration via GET/SET FEATURE, but
|
|
driver doesn't currently implement the cache ioctls, leading to somewhat
|
|
ugly dkctl(1) output; it would be fairly simple to add this, but would
|
|
require small changes to ld(4) attachment code
|
|
- There is no clear timeframe or release target for these points.
|
|
- Contact msaitoh or agc for further information.
|
|
|
|
|
|
8. lfs64
|
|
--------
|
|
|
|
LFS currently only supports volumes up to 2 TB. As LFS is of interest
|
|
for use on shingled disks (which are larger than 2 TB) and also for
|
|
use on disk arrays (ditto) this is something of a problem. A 64-bit
|
|
version of LFS for large volumes is in the works.
|
|
|
|
- As of November 2015 dholland is working on this.
|
|
- It is close to being ready for at least experimental use and is
|
|
expected to be in 8.0.
|
|
- Responsible: dholland
|
|
|
|
|
|
9. Per-process namespaces
|
|
-------------------------
|
|
|
|
Support for per-process variation of the file system namespace enables
|
|
a number of things; more flexible chroots, for example, and also
|
|
potentially more efficient pkgsrc builds. dholland thought up a
|
|
somewhat hackish but low-footprint way to implement this.
|
|
|
|
- As of November 2015 dholland is working on this.
|
|
- It is scheduled to be in 8.0.
|
|
- Responsible: dholland
|
|
|
|
|
|
10. lvm tidyup
|
|
--------------
|
|
|
|
[agc says someone should look at our lvm stuff; XXX fill this in]
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is no clear timeframe or release target.
|
|
- Contact agc for further information.
|
|
|
|
|
|
11. Flash translation layer
|
|
---------------------------
|
|
|
|
SSDs ship with firmware called a "flash translation layer" that
|
|
arbitrates between the block device software expects to see and the
|
|
raw flash chips. FTLs handle wear leveling, lifetime management, and
|
|
also internal caching, striping, and other performance concerns. While
|
|
NetBSD has a file system for raw flash (chfs), it seems that given
|
|
things NetBSD is often used for it ought to come with a flash
|
|
translation layer as well.
|
|
|
|
Note that this is an area where writing your own is probably a bad
|
|
plan; it is a complicated area with a lot of prior art that's also
|
|
reportedly full of patent mines. There are a couple of open FTL
|
|
implementations that we might be able to import.
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is no clear timeframe or release target.
|
|
- Contact dholland for further information.
|
|
|
|
|
|
12. Shingled disk support
|
|
-------------------------
|
|
|
|
Shingled disks (or more technically, disks with "shingled magnetic
|
|
recording" or SMR) can only write whole tracks at once. Thus, to
|
|
operate effectively they require translation support similar to the
|
|
flash translation layers found in SSDs. The nature and structure of
|
|
shingle translation layers is still being researched; however, at some
|
|
point we will want to support these things in NetBSD.
|
|
|
|
- As of November 2015 one of dholland's coworkers is looking at this.
|
|
- There is no clear timeframe or release target.
|
|
- Contact dholland for further information.
|
|
|
|
|
|
13. ext3/ext4 support
|
|
---------------------
|
|
|
|
We would like to be able to read and write Linux ext3fs and ext4fs
|
|
volumes. (We can already read clean ext3fs volumes as they're the same
|
|
as ext2fs, modulo volume features our ext2fs code does not support;
|
|
but we can't write them.)
|
|
|
|
Ideally someone would write ext3 and/or ext4 code, whether integrated
|
|
with or separate from the ext2 code we already have. It might also
|
|
make sense to port or wrap the Linux ext3 or ext4 code so it can be
|
|
loaded as a GPL'd kernel module; it isn't clear if that would be more
|
|
or less work than doing an implementation.
|
|
|
|
Note however that implementing ext3 has already defeated several
|
|
people; this is a harder project than it looks.
|
|
|
|
- GSoc 2016 brought support for extents, and also ro support for dir
|
|
hashes; jdolecek also implemented several frequently used ext4 features
|
|
so most contemporary ext filesystems should be possible to mount
|
|
read-write
|
|
- still need rw dir_nhash and xattr (semi-easy), and eventually journalling
|
|
(hard)
|
|
- There is no clear timeframe or release target.
|
|
- jdolecek is working on improving ext3/ext4 support (particularily
|
|
journalling)
|
|
|
|
|
|
14. Port hammer from Dragonfly
|
|
------------------------------
|
|
|
|
While the motivation for and role of hammer isn't perhaps super
|
|
persuasive, it would still be good to have it. Porting it from
|
|
Dragonfly is probably not that painful (compared to, say, zfs) but as
|
|
the Dragonfly and NetBSD VFS layers have diverged in different
|
|
directions from the original 4.4BSD, may not be entirely trivial
|
|
either.
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is no clear timeframe or release target.
|
|
- There probably isn't any particular person to contact; for VFS
|
|
concerns contact dholland or hannken.
|
|
|
|
|
|
15. afs maintenance
|
|
-------------------
|
|
|
|
AFS needs periodic care and feeding to continue working as NetBSD
|
|
changes, because the kernel-level bits aren't kept in the NetBSD tree
|
|
and don't get updated with other things. This is an ongoing issue that
|
|
always seems to need more manpower than it gets. It might make sense
|
|
to import some of the kernel AFS code, or maybe even just some of the
|
|
glue layer that it uses, in order to keep it more current.
|
|
|
|
- jakllsch sometimes works on this.
|
|
- We would like every release to have working AFS by the time it's
|
|
released.
|
|
- Contact jakllsch or gendalia about AFS; for VFS concerns contact
|
|
dholland or hannken.
|
|
|
|
|
|
16. execute-in-place
|
|
--------------------
|
|
|
|
It is likely that the future includes non-volatile storage (so-called
|
|
"nvram") that looks like RAM from the perspective of software. Most
|
|
importantly: the storage is memory-mapped rather than looking like a
|
|
disk controller. There are a number of things NetBSD ought to have to
|
|
be ready for this, of which probably the most important is
|
|
"execute-in-place": when an executable is run from such storage, and
|
|
mapped into user memory with mmap, the storage hardware pages should
|
|
be able to appear directly in user memory. Right now they get
|
|
gratuitously copied into RAM, which is slow and wasteful. There are
|
|
also other reasons (e.g. embedded device ROMs) to want execute-in-
|
|
place support.
|
|
|
|
Note that at the implementation level this is a UVM issue rather than
|
|
strictly a storage issue.
|
|
|
|
Also note that one does not need access to nvram hardware to work on
|
|
this issue; given the performance profiles touted for nvram
|
|
technologies, a plain RAM disk like md(4) is sufficient both
|
|
structurally and for performance analysis.
|
|
|
|
- As of November 2015 nobody is known to be working on this. Some
|
|
time back, uebayasi wrote some preliminary patches, but they were
|
|
rejected by the UVM maintainers.
|
|
- There is no clear timeframe or release target.
|
|
- Contact dholland for further information.
|
|
|
|
|
|
17. use extended attributes for ACL and capability storage
|
|
----------------------------------------------------------
|
|
|
|
Currently there is some support for extended attributes in ffs,
|
|
but nothing really uses it. I would be nice if we came up with
|
|
a standard format to store ACL's and capabilities like Linux has.
|
|
The various tools must be modified to understand this and be able
|
|
to copy them if requested. Also tools to manipulate the data will
|
|
need to be written.
|
|
|
|
18. coda maintenance
|
|
--------------------
|
|
|
|
Coda only sort of works. [And I think it's behind relative to
|
|
upstream, or something of the sort; XXX fill this in.] Also the code
|
|
appears to have an ugly incestuous relationship with FFS. This should
|
|
really be cleaned up. That or maybe it's time to remove Coda.
|
|
|
|
- As of November 2015 nobody is known to be working on this.
|
|
- There is no clear timeframe or release target.
|
|
- There isn't anyone in particular to contact.
|
|
- Circa 2012 christos made it work read-write and split it
|
|
into modules. Since then christos has not tested it.
|
|
|
|
Alistair Crooks, David Holland
|
|
Fri Nov 20 02:17:53 EST 2015
|
|
Sun May 1 16:50:42 EDT 2016 (some updates)
|
|
|