spares. While here, observe that we were actually doing one more
stripe than we thought we were, and correct that too (it didn't matter
for non-RAID5_RS, but it definitely does for RAID5_RS). Add some
bounds-checking at the beginning to handle the case where the number
of stripes in the set is smaller than the sliding reconstruction window.
XXX: this problem likely needs to be fixed for PARITY_DECLUSTERING too.
cover the old sleep/wakeup points for adding_hot_spare and waitForReconCond.
convert all remaining simple_lock's to kmutexes (they're not used or compiled
right now... even with all options enabled) and remove the support for them.
this leaves just a pair of tsleep()/wakeup() calls using old scheduling APIs.
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.
Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).
The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.
- add two new members to the component label:
u_int numBlocksHi
u_int partitionSizeHi
and store the top 32 bits of the real number of blocks and
partition size. modify rf_print_component_label(),
rf_does_it_fit(), rf_AutoConfigureDisks() and
rf_ReconstructFailedDiskBasic().
- call disk_blocksize() after disk_attach() [ from mlelstv ]
- shift the block number relative to DEV_BSHIFT in raidstart()
and InitBP() so that accesses work for non 512-byte devices.
[ from mlelstv ]
- update rf_getdisksize() to use the new getdisksize() [ from
mlelstv. this part needs a separate change for netbsd-5. ]
reviewed by: oster, christos and darrenr
Drastically reduces the amount of time spent rewriting parity after an
unclean shutdown by keeping better track of which regions might have had
outstanding writes. Enabled by default; can be disabled on a per-set
basis, or tuned, with the new raidctl(8) commands.
Discussed on tech-kern@ to a general air of approval; exhortations to
commit from mrg@, christos@, and others.
Thanks to Google for their sponsorship, oster@ for mentoring the
project, assorted developers for trying very hard to break it, and
probably more I'm forgetting.
we need to account for that. Failure to do so means we can end up
waiting forever for writes we think are outstanding, but which have
already completed.
Addresses the RAIDframe part of PR#40569. Thanks to Matthias Scheler
for reporting the issue and verifying the fix.
Reconmap used to have one pointer for every reconstruction unit. This
does not scale well in the land of 1TB disks, where some 100MB+ of
"status pointers" are required for typical configurations. Convert
the reconstruction code to use a "sliding status window" which will
scale nicely regardless of the number of stripes/reconstruction units
in the RAID set. Convert the main reconstruction loop to rebuild the
array in chunks rather than in one big lump.
As part of these changes, introduce a function to kick any waiters on
the head separation callback list, and use that in the main
reconstruction event queue to wake up the waiters if things have
stalled. (I believe this may fix a race condition that could occur at
at least at the very end of a disk during reconstruction under heavy
IO load.)
Thanks to Brian Buhrow for all his help, support, and patience in
testing these changes.
for that disk have stopped, since this will bump us out of the normal
reconstruction loop prematurely.
Fixes the (mostly cosmetic) bug where the reconstruction
status values stop updating, and from raidctl it appears that
reconstruction has totally stalled (which it actually hasn't -- the
reconstruction does complete properly, but not in the normal way).
needed to keep track of the kernel process that opened a device in
order to close it with the right credentials. Flash forward to today
where curlwp is now quite sufficient.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.
quick consensus on tech-kern
that tells whether the given path is in user space or kernel space, so it
can tell NDINIT().
While the raidframe calls were ok, both ccd(4) and cgd(4) were passing
pointers to user space data, which leads to strange error on i386, as
reported by Jukka Salmi on current-users.
The issue has been there since last august, I'm actually a bit surprised
that no one in the meantime has used ccd(4) or cgd(4) on an arch where it
would have simply faulted.
whatever reason), return 0 instead of the default
RF_RECON_READ_STOPPED. Returning RF_RECON_READ_STOPPED would result
in rf_ContinueReconstructFailedDisk() thinking that the given
component was "done" and breaking out of the main reconstruction loop
far too early. Reconstruction still worked correctly as long as there
were no errors, but RAIDframe wouldn't be in a position to properly
handle read/write errors during reconstruction.
This fixes the "raidctl's progress bar spins at 0% until
reconstruction finishes" problem.
determine if we are willing to wait for memory to come from the
diskqueuedata (dqd) and bufpool pools. Cleanup the mess related to
code calling rf_CreateDiskQueueData() with different expectations
(and/or blatent disregard) of what might happen if there were
insufficient pool resources.
that occurs during a reconstruction. We go from zero error handling
and likely panicing if something goes amiss, to gracefully bailing and
leaving the system in the best, usable state possible.
- introduce rf_DrainReconEventQueue() to allow easy cleaning of the
reconstruction event queue
- change how we cleanup the floating recon buffers in
rf_FreeReconControl(). Detect the end of the list rather
than traversing according to a count.
- keep track of the number of pending reconstruction writes. In the
event of a read error, use this to wait long enough for the pending
writes to (hopefully) drain.
- more cleanup is still needed on this code, but I didn't want to
start mixing major functional changes with minor cleanups.
XXX: There is a known issue with pool items left outstanding due to
the IO failure, and this can show up in the form of a panic at the
tail end of a shutdown. This problem is much less severe than before
these changes, and the hope/plan is that this problem will go away
once this code gets overhauled again.
used in non-simulation code, and thus is just wasting space (and
making the code more confusing to read!). Turf the switch, left-shift
the indentation of code, and nuke 'state' field of struct RF_RaidReconDesc_s.
No real functional changes.
such that we don't actually hold a simplelock while we are doing
a pool_get(), but that we still effectively protecting critical code.
This should fix all of the outstanding LOCKDEBUG warnings related to
rebuilding RAID sets.
rf_PrintUserStats() was mean for the simulator, and doesn't provide
any real info in kernel-space, especially for reconstructs.
Reconstructing actually renders the stats even more useless, since it
resets them all to zero before the reconstruct starts!
- since rf_PrintUserStats() is no longer used, nuke it along with the
routines that feed it. Nothing was using this code, and if we ever
need it again, we know where to find it.
by RAIDframe. Convert all other RAIDframe global pools to use pools
defined within this new structure.
- Introduce rf_pool_init(), used for initializing a single pool in
RAIDframe. Teach each of the configuration routines to use
rf_pool_init().
- Cleanup a few pool-related comments.
- Cleanup revent initialization and #defines.
- Add a missing pool_destroy() for the reconbuffer pool.
(Saves another 1K off of an i386 GENERIC kernel, and makes
stuff a lot more readable)
This removes 3 more RF_PANIC()'s (but we'll currently still panic if any of these cases occur).
fix up a few printf's.
XXX: still needs more cleanup and testing (and be taught to not panic).