that occurs during a reconstruction. We go from zero error handling
and likely panicing if something goes amiss, to gracefully bailing and
leaving the system in the best, usable state possible.
- introduce rf_DrainReconEventQueue() to allow easy cleaning of the
reconstruction event queue
- change how we cleanup the floating recon buffers in
rf_FreeReconControl(). Detect the end of the list rather
than traversing according to a count.
- keep track of the number of pending reconstruction writes. In the
event of a read error, use this to wait long enough for the pending
writes to (hopefully) drain.
- more cleanup is still needed on this code, but I didn't want to
start mixing major functional changes with minor cleanups.
XXX: There is a known issue with pool items left outstanding due to
the IO failure, and this can show up in the form of a panic at the
tail end of a shutdown. This problem is much less severe than before
these changes, and the hope/plan is that this problem will go away
once this code gets overhauled again.
desc->dagList is set to NULL before continuing. If we don't,
there's a danger that we'll try to re-free these items later.
(This should fix a panic reported to me via private communciation.)
used in non-simulation code, and thus is just wasting space (and
making the code more confusing to read!). Turf the switch, left-shift
the indentation of code, and nuke 'state' field of struct RF_RaidReconDesc_s.
No real functional changes.
render the RAID set completely dead. Instead, we retry the IO a
maximum of RF_RETRY_THRESHOLD times (currently '5'), and then just
return an IO error if the IO fails. This should reduce the damage
caused by having multiple disks appear to fail when the culprit is
really something else (power, controllers, etc.)
- If DIOCGDINFO failed with ENOTTY, don't print an error message; wedges
don't support that ioctl. Clean up the error message.
- If DIOCGDINFO fails, don't proceed to examine an invalid disklabel
structure.
As well, when we do detect some sort of an error, we should be doing a
biodone() here. Thanks to yamt for noting the missing biodone(), as
that led to discovery of the additional lossage.
1) Introduce functions to allocate and free the emergency IO buffers.
2) Make sure we free any allocated emergency buffers in the event that
we bail out during configuration, or when we unconfigure an array.
3) if we run out of memory trying to allocate a given type of buffer,
don't continue to try to allocate more of those buffers.
(Partially addresses PR#25787)
write paths within RAIDframe. They also resolve the "panics with
RAID 5 sets with more than 3 components" issue which was present
(briefly) in the commits which were previously supposed to address
the malloc() issue.
With this new code the 5-component RAID 5 set panics are now gone.
It is also now also possible to swap to RAID 5.
The changes made are:
1) Introduce rf_AllocStripeBuffer() and rf_FreeStripeBuffer() to
allocate/free one stripe's worth of space. rf_AllocStripeBuffer() is
used in rf_MapUnaccessedPortionOfStripe() where it is not sufficient to
allocate memory using just rf_AllocBuffer(). rf_FreeStripeBuffer() is
called from rf_FreeRaidAccDesc(), well after the DAG is finished.
2) Add a set of emergency "stripe buffers" to struct RF_Raid_s.
Arrange for their initialization in rf_Configure(). In low-memory
situations these buffers will be returned by rf_AllocStripeBuffer()
and re-populated by rf_FreeStripeBuffer().
3) Move RF_VoidPointerListElem_t *iobufs from the dagHeader into
into struct RF_RaidAccessDesc_s. This is more consistent with the
original code, and will not result in items being freed "too early".
4) Add a RF_RaidAccessDesc_t *desc to RF_DagHeader_s so that we have a
way to find desc->iobufs.
5) Arrange for desc in the DagHeader to be initialized in InitHdrNode().
6) Don't cleanup iobufs in rf_FreeDAG() -- the freeing is now delayed
until rf_FreeRaidAccDesc() (which is how the original code handled the
allocList, and for which there seem to be some subtle, undocumented
assumptions).
7) Rename rf_AllocBuffer2() to be rf_AllocBuffer() and remove the
former rf_AllocBuffer(). Fix all callers of rf_AllocBuffer().
(This was how it was *supposed* to be after the last time these
changes were made, before they were backed out).
8) Remove RF_IOBufHeader and all references to it.
9) Remove desc->cleanupList and all references to it.
Fixes PR#20191
RAID5 sets with more than 3 drives. Still need to figure out why
the original changes were losing, but need the version in tree reliable
first!
Huge THANKS to Juergen Hannken-Illjes for helping track down
the changes that were causing the lossage.
sufficient to clobber this nasty little bug. The behaviour observed
was a panic when doing a 'raidctl -f' on a component when DAGs were
in flight for the given RAID set. Unfortunatly, the faulty behaviour
was very intermittent, and it was difficult to not only reliably
reproduce the bug (nor determine when it was fixed!) but also to even
figure out what might be the cause of the problem.
The real issue was that ci_vp for the failed component was being
set to NULL in rf_FailDisk(), but with DAGs still in flight, some
of them were still expecting to use ci_vp to determine where to
read to/write from!
The fix is to call rf_SuspendNewRequestsAndWait() from rf_FailDisk()
to make sure the RAID set is quiet and all IOs have completed before
mucking with ci_vp and other data structures. rf_ResumeNewRequests()
is then used to continue on as usual.
If raidPtr->numFailures isn't initialized properly, then all sorts of
whacky things can happen, including incorrect DAGs being generated.
(Triggering this problem is a little esoteric, which is why this bug has
been in hiding for so long -- I only saw it after rebooting with a
degraded RAID 5 set that was autoconfigured, rebuilding the failed
componennt, and then failing the component while IO was happening to
the RAID set.)
rf_dagutils.h... missed this one from yesterday. sorry folks :( ]
Change signature of rf_AllocBuffer() to take a dag_h and buffer size
instead of an PDA and an alloclist. This lets us do the vple dance
inside of rf_AllocBuffer().
Cleanup usage of rf_AllocIOBuffer() and use rf_AllocBuffer() instead.
Fix all uses of rf_AllocBuffer() to conform to the new way of doing
things.
instead of an PDA and an alloclist. This lets us do the vple dance
inside of rf_AllocBuffer().
Cleanup usage of rf_AllocIOBuffer() and use rf_AllocBuffer() instead.
Fix all uses of rf_AllocBuffer() to conform to the new way of doing
things.
used in the event that we can't malloc a buffer of the appropriate
size in the traditional way. rf_AllocIOBuffer() and rf_FreeIOBuffer()
deal with allocating/freeing these structures. These buffers are
stored in a list on the 'iobuf' list. iobuf_count keeps track of how
many buffers are available, and numEmergencyBuffers is the effective
"high-water" mark for the freelist. The buffers allocated by
rf_AllocIOBuffer() are stripe-unit sized, which is the maximum
size requested by any of the callers.
Add an iobufs entry to RF_DagHeader_s. Use it for keeping track of
buffers that get allocated from the free-list.
Add a "generic list" pool (VoidPointerListElement Pool) for elements
used to maintain a list of allocated memory. [It is somewhat less
than ideal to add another little pool to handle this...]
Teach rf_AllocBuffer() to use the new rf_AllocIOBuffer(). Modify
other Mallocs to use rf_AllocIOBuffer(), and to update dag_h->iobufs as
appropriate.
Update rf_FreeDAG() to handle cleanup of dag_h->iobufs.
While here, add some missing pool_destroy() calls for a number of pools.
With these changes, it should (in theory) be possible to swap on
RAID 5 sets again. That said, I've not had any success there yet --
but the last issue I saw at least wasn't in RAIDframe. :-}
[There is room for this code to become a bit more consise, but I
wanted to do a checkpoint here with something known to work :) ]
for RF_DagNode_t's. Scale the structure size based on RF_MAXCOL.
Use the new allocation method in InitNode(). Note that we can't get
rid of the mallocs in there until we can prove that this new
allocation method is a strict upper bound. Unless someone tries
running a RAID set with 40 components, the mallocs here shouldn't
shouldn't be an issue. (and if someone does make a set with 40 components
they will run into other issues with other constants long before
then)
- Pull rf_FreePhysDiskAddr() out from under a #ifdef, since we're now
going to use it.
- Add a pda_cleanup_list into the DAG header. Use it in rf_FreeDAG() to
cleanup any PDA's that get allocated but have no "easy" way of being
located and freed when the DAG completes.
- numStripeUnitsAccessed is a per-stripe value, and has a maximum
value equal to the number of colums (thus limited by RF_MAXCOL).
Use this knowledge to set a high-bound on overlappingPDAs, and stuff
it on the stack instead of malloc'ing it all the time! This costs us
a whopping 40 bytes on the stack, but saves a malloc() and a free().
elements from the pools.
Re-work rf_SelectAlgorithm() to get rid of all the 8 malloc's, and to
use the new functions to get/put these 'support structures'. I'm not
overly happy with some of the variable names, but them's the breaks.
In the process of changing things, fix a bug:
- in the case where we can't create a dag, free asmh_b and blockFuncs
too!!
[if you were able to look at the source code related to these changes,
and comprehend what was going on without having your eyes bleed or
getting dizzy, please contact me... I'm sure I'll have more code
which would benefit by you having a look at it before I commit it :) ]
such that we don't actually hold a simplelock while we are doing
a pool_get(), but that we still effectively protecting critical code.
This should fix all of the outstanding LOCKDEBUG warnings related to
rebuilding RAID sets.
Provide rf_AllocDAGNode() and rf_FreeDAGNode() to handle
allocation/freeing.
- Introduce a "nodes" linked list of RF_DagNode_t's into the DAG header.
Initialize nodes in InitHdrNode(). Arrange for nodes cleanup in rf_FreeDAG().
- Add a "list_next" to RF_DagNode_t to keep track of nodes on the
above "nodes" list. (This is distinct from the "next" field of
RF_DagNode_t, which keeps track of the firing order of nodes.)
"list_next" gets used in the cleanup routines, and in traversing
through a set of nodes that belong to a particular set of nodes
(e.g. those belonging to xorNodes for a given DAG).
- use rf_AllocDAGNode() instead of mallocs of variable-sized arrays of
RF_DagNode_t's. Mostly mechanical changes to convert the DAG construction
from "access nodes via an array index" to "access nodes via a 'nextnode'
pointer".
- rework a couple of tricky spots where assumptions about the node order
was being abused.
- performance remains consistent with performance before these changes.
[Thanks to Simon Burge (simonb at you.know.where) for looking over
the mechanical changes to make sure I didn't biff anything.]
rf_PrintUserStats() was mean for the simulator, and doesn't provide
any real info in kernel-space, especially for reconstructs.
Reconstructing actually renders the stats even more useless, since it
resets them all to zero before the reconstruct starts!
- since rf_PrintUserStats() is no longer used, nuke it along with the
routines that feed it. Nothing was using this code, and if we ever
need it again, we know where to find it.
by RAIDframe. Convert all other RAIDframe global pools to use pools
defined within this new structure.
- Introduce rf_pool_init(), used for initializing a single pool in
RAIDframe. Teach each of the configuration routines to use
rf_pool_init().
- Cleanup a few pool-related comments.
- Cleanup revent initialization and #defines.
- Add a missing pool_destroy() for the reconbuffer pool.
(Saves another 1K off of an i386 GENERIC kernel, and makes
stuff a lot more readable)
by RAIDframe. Convert all other RAIDframe global pools to use pools
defined within this new structure.
- Introduce rf_pool_init(), used for initializing a single pool in
RAIDframe. Teach each of the configuration routines to use
rf_pool_init().
- Cleanup a few pool-related comments.
- Cleanup revent initialization and #defines.
- Add a missing pool_destroy() for the reconbuffer pool.
(Saves another 1K off of an i386 GENERIC kernel, and makes
stuff a lot more readable)
rf_AllocBuffer() is available, so use it to get buffer space instead
of the previous RF_Malloc() bits. Saves a few bytes, but more
importantly makes the code much more readable.
- introduce RF_MIN_*'s, as necessary. These will indicate the
low-water mark for pools as well as the pool_prime() value.
- add pool_setlowat() for the critical pools.
- pool_prime() and pool_setlowat() the raidframe_cbufpool.
- re-order some pool_prime()'s and pool_sethiwat()'s for clarity.
This removes 3 more RF_PANIC()'s (but we'll currently still panic if any of these cases occur).
fix up a few printf's.
XXX: still needs more cleanup and testing (and be taught to not panic).
- remove callbackArg2 from RF_CallbackDesc_s -- it is only ever set,
never read.
- now that this is done, all callbacks should only take a single argument,
and we can simplify things further.
- change function signature of rf_LookupRUStatus(). The last argument
is now a pointer to a new PSS, in case one is needed. Rather than
having rf_LookupRUStatus() allocate a new PSS, we pre-allocate one
beforehand, where necessary, just in case.
- change callers of rf_lookupRUStatus() to deal with the new way of
calling rf_lookupRUStatus().
[no improvement or worsening of parity rebuild/initialization performance.]
[For the record: The mcpair mutex is being used to protect mcpair->flag.
mcpair gets allocated before each call to rf_DispatchDAG(), so there is no
other process/thread that could be mucking with it. It is only used to
detect the completion of a given parity unit, and rf_DispatchDAG()
only uses it to setup the callback argument for rf_MCPairWakeupFunc()
which will be called when the IO completes. The code after the call
to rf_DispatchDAG() sits and waits for a 'wakeup' on mcpair->cond
(rf_MCPairWakeupFunc() does that). If mcpair->flag is 0 when
rf_DispatchDAG() completes, then rf_MCPairWakeupFunc() hasn't been
called yet (the IO hasn't completed). If it is 1, then the IO is
already done, and we continue on our merry way without sleeping.
Thus, we don't need to hold any lock on mcpair while calling
rf_DispatchDAG().]
memory. Since we only now ever "return(0)", just return (void)
instead.
Cleanup all uses of rf_ShutdownCreate() to not worry about
it ever failing. Shaves another 600 bytes off of an i386 GENERIC kernel.
dynamically allocated variable-sized array (dagArray). Convert code
to use the new linked list stuff instead of the array stuff (the ratio
of one dagList per stripe still applies). The big advantage is in
being able to more efficiently allocate the dagLists on-the-fly, and
not have to know the size(s) of the array beforehand.
VOP_STRATEGY(bp) is replaced by one of two new functions:
- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.
DEV_STRATEGY(bp) is used only for block-to-block device situations.
give it back if we don't need it. If we don't allocate it before
we take our lock, LOCKDEBUG (rightfully) complains that we're trying
to grab something from the pool with PR_WAITOK. This code (and the
PR_WAITOK in particular) really needs to be revisited at some point.
skrueger-at-europe-dot-com. (It turns out that the mutex used to
serve two different purposes, not just one, and for its current use,
it's actually miss-named. Will fix that some other time.)
Collapse the related variables down to zero. That means 'flags' is 0
as well. Nuke the extraction macros, a bunch of the variables, and replace
'flags' as well.
The compiler already knew that these chunks of code
could never be reached (since lu_flag was always 0), so it
already ignored them.
No functional changes.
rf_enableAtomicRMW changes.]
Cleanup rf_enableAtomicRMW and its use. According to the comments, we
can't set this to anything other than zero anyway. Shaves off another
900 bytes. lu_flag's days are numbered now, as are the middle
parameters of RF_CREATE_PARAM3.
can't set this to anything other than zero anyway. Shaves off another
900 bytes. lu_flag's days are numbered now, as are the middle
parameters of RF_CREATE_PARAM3.
debugging printf, and in rf_netbsdkintf.c. We can do the calculations
inside of RF_DEBUG_RECON for the one debugging printf, and only
perform the percentCompleted calculation "on demand" in the
rf_netbsdkintf.c case. Shaves a few more bytes off an i386 GENERIC
kernel, and ever-so-slightly decreases the amount of work performed
during a reconstruct.
rf_DecrAccessesCountState wasn't in the correct spot in
RF_AccessState_e. Following up on that has resulted in one other
correction. Changing orderings of these states is tricky, and
shouldn't be attempted without some thorough analysis. For the
changes committed, the following analysis is offerred:
1) RAIDframe uses a little state machine to take care of building,
executing, and processing the DAGs used to direct IO.
2) The rf_DecrAccessesCountState state is handled by the function
rf_State_DecrAccessCount(). The purpose of this state is to
decrement the number of "accesses-in-flight".
3) rf_Cleanup_State is handled by rf_State_Cleanup(). Its job is to
do general cleanup of DAG arrays and any stripe locks.
4) DefaultStates[] in rf_layout.c indicates that the right spot
for rf_DecrAccessesCountState is just before rf_Cleanup_State.
Analysis of code for both states indicates that the order doesn't
matter too much, although rf_State_DecrAccessCount() should probably
take place *after* rf_State_Cleanup() to be more correct.
5) Comments in rf_State_ProcessDAG() indicates that the next state
should be rf_Cleanup_State. However: it attempts to get there by using
desc->state++;
which actually takes it to just rf_DecrAccessesCountState! This turned
out to be OK before, since rf_Cleanup_State would follow right after,
and all would be taken careof (albeit in arguably the "less correct"
order).
6) With the current ordering, if we head directly to rf_Cleanup_State
(as we do, for example, if multiple components fail in a RAID 5 set),
then we'll actually miss going trough rf_DecrAccessesCountState), and
could end up never being able to reach quiescence! Perhaps not too
big of a deal, given that the RAID set is pretty much toast by that
point at which such a drastic state change happens, but might as well
have this correct.
The changes made are:
1) Since having rf_State_DecrAccessCount() come after
rf_State_Cleanup() is just fine, change rf_layout.c to reflect that
rf_DecrAccessesCountState comes after rf_Cleanup_State (i.e. they swap
positions in the state list). This means that going to
rf_Cleanup_State after bailing on a failed DAG access will do all the
right things -- the state will get cleaned up, and then the access
counts will get decremented properly. The comment in
rf_State_ProcessDAG() is now actually correct -- the next state *will*
be rf_Cleanup_State.
2) Move rf_DecrAccessesCountState in RF_AccessState_e to just after
rf_CleanupState. This puts RF_AccessState_e in sync with
DefaultStates[]. Fortunately, these states are rarely referred to by
name, and so this change ends up being mostly cosmetic -- it really
only fixes cleanup behaviour for the recent "Failed to create a DAG"
changes.
~forever. This requires a number of things:
1) If we can't create a DAG, set desc->numStripes to 0 in
rf_SelectAlgorithm. This will ensure that we don't attempt to free
any dagArray[] elements in rf_StateCleanup.
2) Modify rf_State_CreateDAG() to not panic in the event of a DAG
failure. Instead, set the bp->b_flags and bp->b_error, and set things
up to skip to rf_State_Cleanup().
3) Need to mark desc->status as "bad" so that we actually stop looking
for a different DAG. (which we won't find... no matter how many times
we try).
4) rf_State_LastState() will then do the biodone(), and return EIO for
the IO in question.
5) Remove some " || 1 "'s from ProcessNode(). These were for
debugging, and we don't need the failure notices spewing
over and over again as the failing DAGs are processed.
6) Needed to change
if (asmap->numDataFailed + asmap->numParityFailed > 1)
to
if ((asmap->numDataFailed + asmap->numParityFailed > 1) ||
(raidPtr->numFailures > 1)){
in rf_raid5.c so that it doesn't try to return
rf_CreateNonRedundantWriteDAG as the creation function.
7) Note that we can't apply the above change to the RAID 1 code as
with the silly "fake 2-D" RAID 1 sets, it is possible to have 2 failed
components in the RAID 1 set, and that would stop them from working.
(I really don't know why/how those "fake 2-D" RAID 1 sets even work
with all the "single-fault" assumptions present in the rest of the
code.)
8) Needed to protect rf_RAID0DagSelect() in a similar way -- it should
return NULL as the createFunc.
9) No point printing out "Multiple disks failed..." a zillion times.
RF_DAG_RETURN_DAG
RF_DAG_RETURN_ASM
RF_DAG_TEST_ACCESS
and the code that goes with them. A couple more of these
can probably go too, but I might need them in a bit.
bp->b_proc for mapping userspace buffers to kernelspace in the
original rf_kintf.c. That means bp isn't of any use in RF_BZERO()
for us, and the macro can be replaced with just the memset().
No functional changes.
was just an accident in the first place. Cleanup function decls and
a few comments. [ok.. so I wasn't going to fix this many.. but once
you're on a roll....]
datastructures allowed! Punt.
accessTraceBufCount, rf_accessTraceBufSize, and
rf_stopCollectingTraces are similarly declared, initialized, and then
never changed. Punt.
rf_ShutdownAccessTrace() now does nothing. Remove it, and the
callback setup stuff from rf_ConfigureAccessTrace().
can't fail. Simplify life in rf_BootRaidframe(), and then nuke
rf_lkmgr_mutex_init(). Cleanup rf_threadstuff.h a bit more too.
rf_threadstuff.c is about to Go Away.
(other than NULL when raidPtr is initialized). That means
SignalReconDone() never does anything useful. Bye-bye!
Say good-bye to recon_done_procs and recon_done_procs_mutex (and its
initializer) as well.
Mash DO_RAID_COND in rf_driver.c out of existance.
- Nuke (already #if 0'ed) _rf_create_managed_lkmgr_mutex() while we're
busy here.
simplify DO_INIT in rf_engine.c
here" department.
remove _rf_init_threadgroup() and rf_destroy_threadgroup() which were
already #if 0'ed.
rf_cond_destroy() does nothing. Nuke it, and all callers.
rf_cond_init() doesn't deserve to be a separate function any more.
Fix up the remaining 3 callers, and nuke rf_cond_init().
Another 0.4K goes "poof", but still no functionality lost!
rf_mutex_init(m)
now. The rest of the fluff is no longer needed.
It also cannot fail, so error checking on rf_create_managed_mutex()
is just wasting space.
Nuke the #define's associated with rf_create_managed_mutex().
Convert rf_create_managed_mutex(listp,m) to just rf_mutex_init(m).
Remove wasteful "error checking" and simplify all instances where this
is called. (another 0.3K saved in the binary, but the real savings
is in code readability!)
neither of these ever fail, no need to have a return value. That
makes all the "error detection" on these functions completely
unneeded. But since we're here, if we don't have a return value, then
why not make these macros? My.. look how things keep shrinking, with
no loss in functionality!
clean_rad() were -- these days they only serve to clutter things up.
Remove the functions, and put the 2 lines of actual useful initialization
into rf_AllocMCPair().
to make things look far more complicated than they really are. It was
also impossible for any of the mutex/cond initializations in
init_rad() to actually fail, making the "error detection code"
unneeded. Collapse the little work done by init_rad into
rf_AllocRaidAccDesc(), and nuke init_rad() and clean_rad(). Save
another 0.25K in GENERIC.
[To be accurate/complete, init_rad() and clean_rad() *ARE* used in the
simulator version of RAIDframe. But we're so far removed from that
now that there is no point pretending otherwise.]
- all freelists converted to pools
- initialization of structure members in certain cases where
code was relying on specific allocation and usage properties
to keep structures in a "known state" (that doesn't work with
pools!).
- make most pool_get() be "PR_WAITOK" until they can be analyzed
further, and/or have proper error handling added.
- all RF_Mallocs zero the space returned, so there is no difference
between RF_Calloc and RF_Malloc. In fact, all the RF_Calloc()'s
do is tend to do is get things horribly confused.
Make RF_Malloc() the "general memory allocator", with
RF_MallocAndAdd() the "general memory allocator with
allocation list".
- some of these RF_Malloc's et al. are destined to disappear.
- remove rf_rdp_freelist entirely (it's not used anywhere!)
- remove: #include "rf_freelist.h"
- to the files that were relying on the above, add: #include "rf_general.h"
- add: #include "rf_debugMem.h" to rf_shutdown.h to make it happy
about the loss of: #include "rf_freelist.h".
This shrinks an i386 GENERIC kernel by approx 5K. RAIDframe now
weighs in at about 162K on i386.
of strenuous agreement, and some general agreement, this commit is
going ahead because it's now starting to block some other changes I
wish to make.]
Remove most of the support for the concept of "rows" from RAIDframe.
While the "row" interface has been exported to the world, RAIDframe
internals have really only supported a single row, even though they
have feigned support of multiple rows.
Nothing changes in configuration land -- config files still need to
specify a single row, etc. All auto-config structures remain fully
forward/backwards compatible.
The only visible difference to the average user should be a
reduction in the size of a GENERIC kernel (i386) by 4.5K. For those
of us trolling through RAIDframe kernel code, a lot of the driver
configuration code has become a LOT easier to read.