determine if we are willing to wait for memory to come from the
diskqueuedata (dqd) and bufpool pools. Cleanup the mess related to
code calling rf_CreateDiskQueueData() with different expectations
(and/or blatent disregard) of what might happen if there were
insufficient pool resources.
that occurs during a reconstruction. We go from zero error handling
and likely panicing if something goes amiss, to gracefully bailing and
leaving the system in the best, usable state possible.
- introduce rf_DrainReconEventQueue() to allow easy cleaning of the
reconstruction event queue
- change how we cleanup the floating recon buffers in
rf_FreeReconControl(). Detect the end of the list rather
than traversing according to a count.
- keep track of the number of pending reconstruction writes. In the
event of a read error, use this to wait long enough for the pending
writes to (hopefully) drain.
- more cleanup is still needed on this code, but I didn't want to
start mixing major functional changes with minor cleanups.
XXX: There is a known issue with pool items left outstanding due to
the IO failure, and this can show up in the form of a panic at the
tail end of a shutdown. This problem is much less severe than before
these changes, and the hope/plan is that this problem will go away
once this code gets overhauled again.
desc->dagList is set to NULL before continuing. If we don't,
there's a danger that we'll try to re-free these items later.
(This should fix a panic reported to me via private communciation.)
used in non-simulation code, and thus is just wasting space (and
making the code more confusing to read!). Turf the switch, left-shift
the indentation of code, and nuke 'state' field of struct RF_RaidReconDesc_s.
No real functional changes.
render the RAID set completely dead. Instead, we retry the IO a
maximum of RF_RETRY_THRESHOLD times (currently '5'), and then just
return an IO error if the IO fails. This should reduce the damage
caused by having multiple disks appear to fail when the culprit is
really something else (power, controllers, etc.)
- If DIOCGDINFO failed with ENOTTY, don't print an error message; wedges
don't support that ioctl. Clean up the error message.
- If DIOCGDINFO fails, don't proceed to examine an invalid disklabel
structure.
As well, when we do detect some sort of an error, we should be doing a
biodone() here. Thanks to yamt for noting the missing biodone(), as
that led to discovery of the additional lossage.
1) Introduce functions to allocate and free the emergency IO buffers.
2) Make sure we free any allocated emergency buffers in the event that
we bail out during configuration, or when we unconfigure an array.
3) if we run out of memory trying to allocate a given type of buffer,
don't continue to try to allocate more of those buffers.
(Partially addresses PR#25787)
write paths within RAIDframe. They also resolve the "panics with
RAID 5 sets with more than 3 components" issue which was present
(briefly) in the commits which were previously supposed to address
the malloc() issue.
With this new code the 5-component RAID 5 set panics are now gone.
It is also now also possible to swap to RAID 5.
The changes made are:
1) Introduce rf_AllocStripeBuffer() and rf_FreeStripeBuffer() to
allocate/free one stripe's worth of space. rf_AllocStripeBuffer() is
used in rf_MapUnaccessedPortionOfStripe() where it is not sufficient to
allocate memory using just rf_AllocBuffer(). rf_FreeStripeBuffer() is
called from rf_FreeRaidAccDesc(), well after the DAG is finished.
2) Add a set of emergency "stripe buffers" to struct RF_Raid_s.
Arrange for their initialization in rf_Configure(). In low-memory
situations these buffers will be returned by rf_AllocStripeBuffer()
and re-populated by rf_FreeStripeBuffer().
3) Move RF_VoidPointerListElem_t *iobufs from the dagHeader into
into struct RF_RaidAccessDesc_s. This is more consistent with the
original code, and will not result in items being freed "too early".
4) Add a RF_RaidAccessDesc_t *desc to RF_DagHeader_s so that we have a
way to find desc->iobufs.
5) Arrange for desc in the DagHeader to be initialized in InitHdrNode().
6) Don't cleanup iobufs in rf_FreeDAG() -- the freeing is now delayed
until rf_FreeRaidAccDesc() (which is how the original code handled the
allocList, and for which there seem to be some subtle, undocumented
assumptions).
7) Rename rf_AllocBuffer2() to be rf_AllocBuffer() and remove the
former rf_AllocBuffer(). Fix all callers of rf_AllocBuffer().
(This was how it was *supposed* to be after the last time these
changes were made, before they were backed out).
8) Remove RF_IOBufHeader and all references to it.
9) Remove desc->cleanupList and all references to it.
Fixes PR#20191
RAID5 sets with more than 3 drives. Still need to figure out why
the original changes were losing, but need the version in tree reliable
first!
Huge THANKS to Juergen Hannken-Illjes for helping track down
the changes that were causing the lossage.
sufficient to clobber this nasty little bug. The behaviour observed
was a panic when doing a 'raidctl -f' on a component when DAGs were
in flight for the given RAID set. Unfortunatly, the faulty behaviour
was very intermittent, and it was difficult to not only reliably
reproduce the bug (nor determine when it was fixed!) but also to even
figure out what might be the cause of the problem.
The real issue was that ci_vp for the failed component was being
set to NULL in rf_FailDisk(), but with DAGs still in flight, some
of them were still expecting to use ci_vp to determine where to
read to/write from!
The fix is to call rf_SuspendNewRequestsAndWait() from rf_FailDisk()
to make sure the RAID set is quiet and all IOs have completed before
mucking with ci_vp and other data structures. rf_ResumeNewRequests()
is then used to continue on as usual.
If raidPtr->numFailures isn't initialized properly, then all sorts of
whacky things can happen, including incorrect DAGs being generated.
(Triggering this problem is a little esoteric, which is why this bug has
been in hiding for so long -- I only saw it after rebooting with a
degraded RAID 5 set that was autoconfigured, rebuilding the failed
componennt, and then failing the component while IO was happening to
the RAID set.)
rf_dagutils.h... missed this one from yesterday. sorry folks :( ]
Change signature of rf_AllocBuffer() to take a dag_h and buffer size
instead of an PDA and an alloclist. This lets us do the vple dance
inside of rf_AllocBuffer().
Cleanup usage of rf_AllocIOBuffer() and use rf_AllocBuffer() instead.
Fix all uses of rf_AllocBuffer() to conform to the new way of doing
things.
instead of an PDA and an alloclist. This lets us do the vple dance
inside of rf_AllocBuffer().
Cleanup usage of rf_AllocIOBuffer() and use rf_AllocBuffer() instead.
Fix all uses of rf_AllocBuffer() to conform to the new way of doing
things.
used in the event that we can't malloc a buffer of the appropriate
size in the traditional way. rf_AllocIOBuffer() and rf_FreeIOBuffer()
deal with allocating/freeing these structures. These buffers are
stored in a list on the 'iobuf' list. iobuf_count keeps track of how
many buffers are available, and numEmergencyBuffers is the effective
"high-water" mark for the freelist. The buffers allocated by
rf_AllocIOBuffer() are stripe-unit sized, which is the maximum
size requested by any of the callers.
Add an iobufs entry to RF_DagHeader_s. Use it for keeping track of
buffers that get allocated from the free-list.
Add a "generic list" pool (VoidPointerListElement Pool) for elements
used to maintain a list of allocated memory. [It is somewhat less
than ideal to add another little pool to handle this...]
Teach rf_AllocBuffer() to use the new rf_AllocIOBuffer(). Modify
other Mallocs to use rf_AllocIOBuffer(), and to update dag_h->iobufs as
appropriate.
Update rf_FreeDAG() to handle cleanup of dag_h->iobufs.
While here, add some missing pool_destroy() calls for a number of pools.
With these changes, it should (in theory) be possible to swap on
RAID 5 sets again. That said, I've not had any success there yet --
but the last issue I saw at least wasn't in RAIDframe. :-}
[There is room for this code to become a bit more consise, but I
wanted to do a checkpoint here with something known to work :) ]