NetBSD/sbin/raidctl/raidctl.8
2002-01-21 11:40:20 +00:00

1419 lines
43 KiB
Groff

.\" $NetBSD: raidctl.8,v 1.28 2002/01/21 11:40:20 wiz Exp $
.\"
.\" Copyright (c) 1998 The NetBSD Foundation, Inc.
.\" All rights reserved.
.\"
.\" This code is derived from software contributed to The NetBSD Foundation
.\" by Greg Oster
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\" notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\" notice, this list of conditions and the following disclaimer in the
.\" documentation and/or other materials provided with the distribution.
.\" 3. All advertising materials mentioning features or use of this software
.\" must display the following acknowledgement:
.\" This product includes software developed by the NetBSD
.\" Foundation, Inc. and its contributors.
.\" 4. Neither the name of The NetBSD Foundation nor the names of its
.\" contributors may be used to endorse or promote products derived
.\" from this software without specific prior written permission.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
.\" POSSIBILITY OF SUCH DAMAGE.
.\"
.\"
.\" Copyright (c) 1995 Carnegie-Mellon University.
.\" All rights reserved.
.\"
.\" Author: Mark Holland
.\"
.\" Permission to use, copy, modify and distribute this software and
.\" its documentation is hereby granted, provided that both the copyright
.\" notice and this permission notice appear in all copies of the
.\" software, derivative works or modified versions, and any portions
.\" thereof, and that both notices appear in supporting documentation.
.\"
.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
.\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
.\"
.\" Carnegie Mellon requests users of this software to return to
.\"
.\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
.\" School of Computer Science
.\" Carnegie Mellon University
.\" Pittsburgh PA 15213-3890
.\"
.\" any improvements or extensions that they make and grant Carnegie the
.\" rights to redistribute these changes.
.\"
.Dd July 10, 2001
.Dt RAIDCTL 8
.Os
.Sh NAME
.Nm raidctl
.Nd configuration utility for the RAIDframe disk driver
.Sh SYNOPSIS
.Nm ""
.Op Fl v
.Fl a Ar component Ar dev
.Nm ""
.Op Fl v
.Fl A Op yes | no | root
.Ar dev
.Nm ""
.Op Fl v
.Fl B Ar dev
.Nm ""
.Op Fl v
.Fl c Ar config_file Ar dev
.Nm ""
.Op Fl v
.Fl C Ar config_file Ar dev
.Nm ""
.Op Fl v
.Fl f Ar component Ar dev
.Nm ""
.Op Fl v
.Fl F Ar component Ar dev
.Nm ""
.Op Fl v
.Fl g Ar component Ar dev
.Nm ""
.Op Fl v
.Fl G Ar dev
.Nm ""
.Op Fl v
.Fl i Ar dev
.Nm ""
.Op Fl v
.Fl I Ar serial_number Ar dev
.Nm ""
.Op Fl v
.Fl p Ar dev
.Nm ""
.Op Fl v
.Fl P Ar dev
.Nm ""
.Op Fl v
.Fl r Ar component Ar dev
.Nm ""
.Op Fl v
.Fl R Ar component Ar dev
.Nm ""
.Op Fl v
.Fl s Ar dev
.Nm ""
.Op Fl v
.Fl S Ar dev
.Nm ""
.Op Fl v
.Fl u Ar dev
.Sh DESCRIPTION
.Nm ""
is the user-land control program for
.Xr raid 4 ,
the RAIDframe disk device.
.Nm ""
is primarily used to dynamically configure and unconfigure RAIDframe disk
devices. For more information about the RAIDframe disk device, see
.Xr raid 4 .
.Pp
This document assumes the reader has at least rudimentary knowledge of
RAID and RAID concepts.
.Pp
The command-line options for
.Nm
are as follows:
.Bl -tag -width indent
.It Fl a Ar component Ar dev
Add
.Ar component
as a hot spare for the device
.Ar dev .
.It Fl A Ic yes Ar dev
Make the RAID set auto-configurable. The RAID set will be
automatically configured at boot
.Ar before
the root file system is
mounted. Note that all components of the set must be of type RAID in the
disklabel.
.It Fl A Ic no Ar dev
Turn off auto-configuration for the RAID set.
.It Fl A Ic root Ar dev
Make the RAID set auto-configurable, and also mark the set as being
eligible to be the root partition. A RAID set configured this way
will
.Ar override
the use of the boot disk as the root device. All components of the
set must be of type RAID in the disklabel. Note that the kernel being
booted must currently reside on a non-RAID set.
.It Fl B Ar dev
Initiate a copyback of reconstructed data from a spare disk to
its original disk. This is performed after a component has failed,
and the failed drive has been reconstructed onto a spare drive.
.It Fl c Ar config_file Ar dev
Configure the RAIDframe device
.Ar dev
according to the configuration given in
.Ar config_file .
A description of the contents of
.Ar config_file
is given later.
.It Fl C Ar config_file Ar dev
As for
.Ar -c ,
but forces the configuration to take place. This is required the
first time a RAID set is configured.
.It Fl f Ar component Ar dev
This marks the specified
.Ar component
as having failed, but does not initiate a reconstruction of that
component.
.It Fl F Ar component Ar dev
Fails the specified
.Ar component
of the device, and immediately begin a reconstruction of the failed
disk onto an available hot spare. This is one of the mechanisms used to start
the reconstruction process if a component does have a hardware failure.
.It Fl g Ar component Ar dev
Get the component label for the specified component.
.It Fl G Ar dev
Generate the configuration of the RAIDframe device in a format suitable for
use with
.Nm
.Fl c
or
.Fl C .
.It Fl i Ar dev
Initialize the RAID device. In particular, (re-write) the parity on
the selected device. This
.Ar MUST
be done for
.Ar all
RAID sets before the RAID device is labeled and before
file systems are created on the RAID device.
.It Fl I Ar serial_number Ar dev
Initialize the component labels on each component of the device.
.Ar serial_number
is used as one of the keys in determining whether a
particular set of components belong to the same RAID set. While not
strictly enforced, different serial numbers should be used for
different RAID sets. This step
.Ar MUST
be performed when a new RAID set is created.
.It Fl p Ar dev
Check the status of the parity on the RAID set. Displays a status
message, and returns successfully if the parity is up-to-date.
.It Fl P Ar dev
Check the status of the parity on the RAID set, and initialize
(re-write) the parity if the parity is not known to be up-to-date.
This is normally used after a system crash (and before a
.Xr fsck 8 )
to ensure the integrity of the parity.
.It Fl r Ar component Ar dev
Remove the spare disk specified by
.Ar component
from the set of available spare components.
.It Fl R Ar component Ar dev
Fails the specified
.Ar component ,
if necessary, and immediately begins a reconstruction back to
.Ar component .
This is useful for reconstructing back onto a component after
it has been replaced following a failure.
.It Fl s Ar dev
Display the status of the RAIDframe device for each of the components
and spares.
.It Fl S Ar dev
Check the status of parity re-writing, component reconstruction, and
component copyback. The output indicates the amount of progress
achieved in each of these areas.
.It Fl u Ar dev
Unconfigure the RAIDframe device.
.It Fl v
Be more verbose. For operations such as reconstructions, parity
re-writing, and copybacks, provide a progress indicator.
.El
.Pp
The device used by
.Nm
is specified by
.Ar dev .
.Ar dev
may be either the full name of the device, e.g. /dev/rraid0d,
for the i386 architecture, and /dev/rraid0c
for all others, or just simply raid0 (for /dev/rraid0d).
.Ss Configuration file
The format of the configuration file is complex, and
only an abbreviated treatment is given here. In the configuration
files, a
.Sq #
indicates the beginning of a comment.
.Pp
There are 4 required sections of a configuration file, and 2
optional sections. Each section begins with a
.Sq START ,
followed by
the section name, and the configuration parameters associated with that
section. The first section is the
.Sq array
section, and it specifies
the number of rows, columns, and spare disks in the RAID set. For
example:
.Bd -literal -offset indent
START array
1 3 0
.Ed
.Pp
indicates an array with 1 row, 3 columns, and 0 spare disks. Note
that although multi-dimensional arrays may be specified, they are
.Ar NOT
supported in the driver.
.Pp
The second section, the
.Sq disks
section, specifies the actual
components of the device. For example:
.Bd -literal -offset indent
START disks
/dev/sd0e
/dev/sd1e
/dev/sd2e
.Ed
.Pp
specifies the three component disks to be used in the RAID device. If
any of the specified drives cannot be found when the RAID device is
configured, then they will be marked as
.Sq failed ,
and the system will
operate in degraded mode. Note that it is
.Ar imperative
that the order of the components in the configuration file does not
change between configurations of a RAID device. Changing the order
of the components will result in data loss if the set is configured
with the
.Fl C
option. In normal circumstances, the RAID set will not configure if
only
.Fl c
is specified, and the components are out-of-order.
.Pp
The next section, which is the
.Sq spare
section, is optional, and, if
present, specifies the devices to be used as
.Sq hot spares
-- devices
which are on-line, but are not actively used by the RAID driver unless
one of the main components fail. A simple
.Sq spare
section might be:
.Bd -literal -offset indent
START spare
/dev/sd3e
.Ed
.Pp
for a configuration with a single spare component. If no spare drives
are to be used in the configuration, then the
.Sq spare
section may be omitted.
.Pp
The next section is the
.Sq layout
section. This section describes the
general layout parameters for the RAID device, and provides such
information as sectors per stripe unit, stripe units per parity unit,
stripe units per reconstruction unit, and the parity configuration to
use. This section might look like:
.Bd -literal -offset indent
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
32 1 1 5
.Ed
.Pp
The sectors per stripe unit specifies, in blocks, the interleave
factor; i.e. the number of contiguous sectors to be written to each
component for a single stripe. Appropriate selection of this value
(32 in this example) is the subject of much research in RAID
architectures. The stripe units per parity unit and
stripe units per reconstruction unit are normally each set to 1.
While certain values above 1 are permitted, a discussion of valid
values and the consequences of using anything other than 1 are outside
the scope of this document. The last value in this section (5 in this
example) indicates the parity configuration desired. Valid entries
include:
.Bl -tag -width inde
.It 0
RAID level 0. No parity, only simple striping.
.It 1
RAID level 1. Mirroring. The parity is the mirror.
.It 4
RAID level 4. Striping across components, with parity stored on the
last component.
.It 5
RAID level 5. Striping across components, parity distributed across
all components.
.El
.Pp
There are other valid entries here, including those for Even-Odd
parity, RAID level 5 with rotated sparing, Chained declustering,
and Interleaved declustering, but as of this writing the code for
those parity operations has not been tested with
.Nx .
.Pp
The next required section is the
.Sq queue
section. This is most often
specified as:
.Bd -literal -offset indent
START queue
fifo 100
.Ed
.Pp
where the queuing method is specified as fifo (first-in, first-out),
and the size of the per-component queue is limited to 100 requests.
Other queuing methods may also be specified, but a discussion of them
is beyond the scope of this document.
.Pp
The final section, the
.Sq debug
section, is optional. For more details
on this the reader is referred to the RAIDframe documentation
discussed in the
.Sx HISTORY
section.
.Pp
See
.Sx EXAMPLES
for a more complete configuration file example.
.Sh FILES
.Bl -tag -width /dev/XXrXraidX -compact
.It Pa /dev/{,r}raid*
.Cm raid
device special files.
.El
.Sh EXAMPLES
It is highly recommended that before using the RAID driver for real
file systems that the system administrator(s) become quite familiar
with the use of
.Nm "" ,
and that they understand how the component reconstruction process
works. The examples in this section will focus on configuring a
number of different RAID sets of varying degrees of redundancy.
By working through these examples, administrators should be able to
develop a good feel for how to configure a RAID set, and how to
initiate reconstruction of failed components.
.Pp
In the following examples
.Sq raid0
will be used to denote the RAID device. Depending on the
architecture,
.Sq /dev/rraid0c
or
.Sq /dev/rraid0d
may be used in place of
.Sq raid0 .
.Ss Initialization and Configuration
The initial step in configuring a RAID set is to identify the components
that will be used in the RAID set. All components should be the same
size. Each component should have a disklabel type of
.Dv FS_RAID ,
and a typical disklabel entry for a RAID component
might look like:
.Bd -literal -offset indent
f: 1800000 200495 RAID # (Cyl. 405*- 4041*)
.Ed
.Pp
While
.Dv FS_BSDFFS
will also work as the component type, the type
.Dv FS_RAID
is preferred for RAIDframe use, as it is required for features such as
auto-configuration. As part of the initial configuration of each RAID
set, each component will be given a
.Sq component label .
A
.Sq component label
contains important information about the component, including a
user-specified serial number, the row and column of that component in
the RAID set, the redundancy level of the RAID set, a 'modification
counter', and whether the parity information (if any) on that
component is known to be correct. Component labels are an integral
part of the RAID set, since they are used to ensure that components
are configured in the correct order, and used to keep track of other
vital information about the RAID set. Component labels are also
required for the auto-detection and auto-configuration of RAID sets at
boot time. For a component label to be considered valid, that
particular component label must be in agreement with the other
component labels in the set. For example, the serial number,
.Sq modification counter ,
number of rows and number of columns must all
be in agreement. If any of these are different, then the component is
not considered to be part of the set. See
.Xr raid 4
for more information about component labels.
.Pp
Once the components have been identified, and the disks have
appropriate labels,
.Nm ""
is then used to configure the
.Xr raid 4
device. To configure the device, a configuration
file which looks something like:
.Bd -literal -offset indent
START array
# numRow numCol numSpare
1 3 1
START disks
/dev/sd1e
/dev/sd2e
/dev/sd3e
START spare
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
32 1 1 5
START queue
fifo 100
.Ed
.Pp
is created in a file. The above configuration file specifies a RAID 5
set consisting of the components /dev/sd1e, /dev/sd2e, and /dev/sd3e,
with /dev/sd4e available as a
.Sq hot spare
in case one of
the three main drives should fail. A RAID 0 set would be specified in
a similar way:
.Bd -literal -offset indent
START array
# numRow numCol numSpare
1 4 0
START disks
/dev/sd10e
/dev/sd11e
/dev/sd12e
/dev/sd13e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
64 1 1 0
START queue
fifo 100
.Ed
.Pp
In this case, devices /dev/sd10e, /dev/sd11e, /dev/sd12e, and /dev/sd13e
are the components that make up this RAID set. Note that there are no
hot spares for a RAID 0 set, since there is no way to recover data if
any of the components fail.
.Pp
For a RAID 1 (mirror) set, the following configuration might be used:
.Bd -literal -offset indent
START array
# numRow numCol numSpare
1 2 0
START disks
/dev/sd20e
/dev/sd21e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
128 1 1 1
START queue
fifo 100
.Ed
.Pp
In this case, /dev/sd20e and /dev/sd21e are the two components of the
mirror set. While no hot spares have been specified in this
configuration, they easily could be, just as they were specified in
the RAID 5 case above. Note as well that RAID 1 sets are currently
limited to only 2 components. At present, n-way mirroring is not
possible.
.Pp
The first time a RAID set is configured, the
.Fl C
option must be used:
.Bd -literal -offset indent
raidctl -C raid0.conf raid0
.Ed
.Pp
where
.Sq raid0.conf
is the name of the RAID configuration file. The
.Fl C
forces the configuration to succeed, even if any of the component
labels are incorrect. The
.Fl C
option should not be used lightly in
situations other than initial configurations, as if
the system is refusing to configure a RAID set, there is probably a
very good reason for it. After the initial configuration is done (and
appropriate component labels are added with the
.Fl I
option) then raid0 can be configured normally with:
.Bd -literal -offset indent
raidctl -c raid0.conf raid0
.Ed
.Pp
When the RAID set is configured for the first time, it is
necessary to initialize the component labels, and to initialize the
parity on the RAID set. Initializing the component labels is done with:
.Bd -literal -offset indent
raidctl -I 112341 raid0
.Ed
.Pp
where
.Sq 112341
is a user-specified serial number for the RAID set. This
initialization step is
.Ar required
for all RAID sets. As well, using different
serial numbers between RAID sets is
.Ar strongly encouraged ,
as using the same serial number for all RAID sets will only serve to
decrease the usefulness of the component label checking.
.Pp
Initializing the RAID set is done via the
.Fl i
option. This initialization
.Ar MUST
be done for
.Ar all
RAID sets, since among other things it verifies that the parity (if
any) on the RAID set is correct. Since this initialization may be
quite time-consuming, the
.Fl v
option may be also used in conjunction with
.Fl i :
.Bd -literal -offset indent
raidctl -iv raid0
.Ed
.Pp
This will give more verbose output on the
status of the initialization:
.Bd -literal -offset indent
Initiating re-write of parity
Parity Re-write status:
10% |**** | ETA: 06:03 /
.Ed
.Pp
The output provides a
.Sq Percent Complete
in both a numeric and graphical format, as well as an estimated time
to completion of the operation.
.Pp
Since it is the parity that provides the
.Sq redundancy
part of RAID, it is critical that the parity is correct
as much as possible. If the parity is not correct, then there is no
guarantee that data will not be lost if a component fails.
.Pp
Once the parity is known to be correct,
it is then safe to perform
.Xr disklabel 8 ,
.Xr newfs 8 ,
or
.Xr fsck 8
on the device or its file systems, and then to mount the file systems
for use.
.Pp
Under certain circumstances (e.g. the additional component has not
arrived, or data is being migrated off of a disk destined to become a
component) it may be desirable to to configure a RAID 1 set with only
a single component. This can be achieved by configuring the set with
a physically existing component (as either the first or second
component) and with a
.Sq fake
component. In the following:
.Bd -literal -offset indent
START array
# numRow numCol numSpare
1 2 0
START disks
/dev/sd6e
/dev/sd0e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
128 1 1 1
START queue
fifo 100
.Ed
.Pp
/dev/sd0e is the real component, and will be the second disk of a RAID 1
set. The component /dev/sd6e, which must exist, but have no physical
device associated with it, is simply used as a placeholder.
Configuration (using
.Fl C
and
.Fl I Ar 12345
as above) proceeds normally, but initialization of the RAID set will
have to wait until all physical components are present. After
configuration, this set can be used normally, but will be operating
in degraded mode. Once a second physical component is obtained, it
can be hot-added, the existing data mirrored, and normal operation
resumed.
.Ss Maintenance of the RAID set
After the parity has been initialized for the first time, the command:
.Bd -literal -offset indent
raidctl -p raid0
.Ed
.Pp
can be used to check the current status of the parity. To check the
parity and rebuild it necessary (for example, after an unclean
shutdown) the command:
.Bd -literal -offset indent
raidctl -P raid0
.Ed
.Pp
is used. Note that re-writing the parity can be done while
other operations on the RAID set are taking place (e.g. while doing a
.Xr fsck 8
on a file system on the RAID set). However: for maximum effectiveness
of the RAID set, the parity should be known to be correct before any
data on the set is modified.
.Pp
To see how the RAID set is doing, the following command can be used to
show the RAID set's status:
.Bd -literal -offset indent
raidctl -s raid0
.Ed
.Pp
The output will look something like:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
Component label for /dev/sd1e:
Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Component label for /dev/sd2e:
Row: 0 Column: 1 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Component label for /dev/sd3e:
Row: 0 Column: 2 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
.Ed
.Pp
This indicates that all is well with the RAID set. Of importance here
are the component lines which read
.Sq optimal ,
and the
.Sq Parity status
line which indicates that the parity is up-to-date. Note that if
there are file systems open on the RAID set, the individual components
will not be
.Sq clean
but the set as a whole can still be clean.
.Pp
To check the component label of /dev/sd1e, the following is used:
.Bd -literal -offset indent
raidctl -g /dev/sd1e raid0
.Ed
.Pp
The output of this command will look something like:
.Bd -literal -offset indent
Component label for /dev/sd1e:
Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
.Ed
.Ss Dealing with Component Failures
If for some reason
(perhaps to test reconstruction) it is necessary to pretend a drive
has failed, the following will perform that function:
.Bd -literal -offset indent
raidctl -f /dev/sd2e raid0
.Ed
.Pp
The system will then be performing all operations in degraded mode,
where missing data is re-computed from existing data and the parity.
In this case, obtaining the status of raid0 will return (in part):
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
.Ed
.Pp
Note that with the use of
.Fl f
a reconstruction has not been started. To both fail the disk and
start a reconstruction, the
.Fl F
option must be used:
.Bd -literal -offset indent
raidctl -F /dev/sd2e raid0
.Ed
.Pp
The
.Fl f
option may be used first, and then the
.Fl F
option used later, on the same disk, if desired.
Immediately after the reconstruction is started, the status will report:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: reconstructing
/dev/sd3e: optimal
Spares:
/dev/sd4e: used_spare
[...]
Parity status: clean
Reconstruction is 10% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
.Ed
.Pp
This indicates that a reconstruction is in progress. To find out how
the reconstruction is progressing the
.Fl S
option may be used. This will indicate the progress in terms of the
percentage of the reconstruction that is completed. When the
reconstruction is finished the
.Fl s
option will show:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: spared
/dev/sd3e: optimal
Spares:
/dev/sd4e: used_spare
[...]
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
.Ed
.Pp
At this point there are at least two options. First, if /dev/sd2e is
known to be good (i.e. the failure was either caused by
.Fl f
or
.Fl F ,
or the failed disk was replaced), then a copyback of the data can
be initiated with the
.Fl B
option. In this example, this would copy the entire contents of
/dev/sd4e to /dev/sd2e. Once the copyback procedure is complete, the
status of the device would be (in part):
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
.Ed
.Pp
and the system is back to normal operation.
.Pp
The second option after the reconstruction is to simply use /dev/sd4e
in place of /dev/sd2e in the configuration file. For example, the
configuration file (in part) might now look like:
.Bd -literal -offset indent
START array
1 3 0
START drives
/dev/sd1e
/dev/sd4e
/dev/sd3e
.Ed
.Pp
This can be done as /dev/sd4e is completely interchangeable with
/dev/sd2e at this point. Note that extreme care must be taken when
changing the order of the drives in a configuration. This is one of
the few instances where the devices and/or their orderings can be
changed without loss of data! In general, the ordering of components
in a configuration file should
.Ar never
be changed.
.Pp
If a component fails and there are no hot spares
available on-line, the status of the RAID set might (in part) look like:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
No spares.
.Ed
.Pp
In this case there are a number of options. The first option is to add a hot
spare using:
.Bd -literal -offset indent
raidctl -a /dev/sd4e raid0
.Ed
.Pp
After the hot add, the status would then be:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
.Ed
.Pp
Reconstruction could then take place using
.Fl F
as describe above.
.Pp
A second option is to rebuild directly onto /dev/sd2e. Once the disk
containing /dev/sd2e has been replaced, one can simply use:
.Bd -literal -offset indent
raidctl -R /dev/sd2e raid0
.Ed
.Pp
to rebuild the /dev/sd2e component. As the rebuilding is in progress,
the status will be:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: reconstructing
/dev/sd3e: optimal
No spares.
.Ed
.Pp
and when completed, will be:
.Bd -literal -offset indent
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
No spares.
.Ed
.Pp
In circumstances where a particular component is completely
unavailable after a reboot, a special component name will be used to
indicate the missing component. For example:
.Bd -literal -offset indent
Components:
/dev/sd2e: optimal
component1: failed
No spares.
.Ed
.Pp
indicates that the second component of this RAID set was not detected
at all by the auto-configuration code. The name
.Sq component1
can be used anywhere a normal component name would be used. For
example, to add a hot spare to the above set, and rebuild to that hot
spare, the following could be done:
.Bd -literal -offset indent
raidctl -a /dev/sd3e raid0
raidctl -F component1 raid0
.Ed
.Pp
at which point the data missing from
.Sq component1
would be reconstructed onto /dev/sd3e.
.Pp
When more than one component is marked as
.Sq failed
due to a non-component hardware failure (e.g. loss of power to two
components, adapter problems, termination problems, or cabling issues) it
is quite possible to recover the data on the RAID set. The first
thing to be aware of is that the first disk to fail will almost certainly
be out-of-sync with the remainder of the array. If any IO was
performed between the time the first component is considered
.Sq failed
and when the second component is considered
.Sq failed ,
then the first component to fail will
.Ar not
contain correct data, and should be ignored. When the second
component is marked as failed, however, the RAID device will
(currently) panic the system. At this point the data on the RAID set
(not including the first failed component) is still self consistent,
and will be in no worse state of repair than had the power gone out in
the middle of a write to a filesystem on a non-RAID device.
The problem, however, is that the component labels may now have 3
different 'modification counters' (one value on the first component
that failed, one value on the second component that failed, and a
third value on the remaining components). In such a situation, the
RAID set will not autoconfigure, and can only be forcibly re-configured
with the
.Fl C
option. To recover the RAID set, one must first remedy whatever physical
problem caused the multiple-component failure. After that is done,
the RAID set can be restored by forcibly configuring the raid set
.Ar without
the component that failed first. For example, if /dev/sd1e and
/dev/sd2e fail (in that order) in a RAID set of the following
configuration:
.Bd -literal -offset indent
START array
1 4 0
START drives
/dev/sd1e
/dev/sd2e
/dev/sd3e
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
64 1 1 5
START queue
fifo 100
.Ed
.Pp
then the following configuration (say "recover_raid0.conf")
.Bd -literal -offset indent
START array
1 4 0
START drives
/dev/sd6e
/dev/sd2e
/dev/sd3e
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
64 1 1 5
START queue
fifo 100
.Ed
.Pp
(where /dev/sd6e has no physical device) can be used with
.Bd -literal -offset indent
raidctl -C recover_raid0.conf raid0
.Ed
.Pp
to force the configuration of raid0. A
.Bd -literal -offset indent
raidctl -I 12345 raid0
.Ed
.Pp
will be required in order to synchronize the component labels.
At this point the filesystems on the RAID set can then be checked and
corrected. To complete the re-construction of the RAID set,
/dev/sd1e is simply hot-added back into the array, and reconstructed
as described earlier.
.Ss RAID on RAID
RAID sets can be layered to create more complex and much larger RAID
sets. A RAID 0 set, for example, could be constructed from four RAID
5 sets. The following configuration file shows such a setup:
.Bd -literal -offset indent
START array
# numRow numCol numSpare
1 4 0
START disks
/dev/raid1e
/dev/raid2e
/dev/raid3e
/dev/raid4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
128 1 1 0
START queue
fifo 100
.Ed
.Pp
A similar configuration file might be used for a RAID 0 set
constructed from components on RAID 1 sets. In such a configuration,
the mirroring provides a high degree of redundancy, while the striping
provides additional speed benefits.
.Ss Auto-configuration and Root on RAID
RAID sets can also be auto-configured at boot. To make a set
auto-configurable, simply prepare the RAID set as above, and then do
a:
.Bd -literal -offset indent
raidctl -A yes raid0
.Ed
.Pp
to turn on auto-configuration for that set. To turn off
auto-configuration, use:
.Bd -literal -offset indent
raidctl -A no raid0
.Ed
.Pp
RAID sets which are auto-configurable will be configured before the
root file system is mounted. These RAID sets are thus available for
use as a root file system, or for any other file system. A primary
advantage of using the auto-configuration is that RAID components
become more independent of the disks they reside on. For example,
SCSI ID's can change, but auto-configured sets will always be
configured correctly, even if the SCSI ID's of the component disks
have become scrambled.
.Pp
Having a system's root file system
.Pq Pa /
on a RAID set is also allowed,
with the
.Sq a
partition of such a RAID set being used for
.Pa / .
To use raid0a as the root file system, simply use:
.Bd -literal -offset indent
raidctl -A root raid0
.Ed
.Pp
To return raid0a to be just an auto-configuring set simply use the
.Fl A Ar yes
arguments.
.Pp
Note that kernels can only be directly read from RAID 1 components on
alpha and pmax architectures. On those architectures, the
.Dv FS_RAID
file system is recognized by the bootblocks, and will properly load the
kernel directly from a RAID 1 component. For other architectures, or
to support the root file system on other RAID sets, some other
mechanism must be used to get a kernel booting. For example, a small
partition containing only the secondary boot-blocks and an alternate
kernel (or two) could be used. Once a kernel is booting however, and
an auto-configuring RAID set is found that is eligible to be root,
then that RAID set will be auto-configured and used as the root
device. If two or more RAID sets claim to be root devices, then the
user will be prompted to select the root device. At this time, RAID
0, 1, 4, and 5 sets are all supported as root devices.
.Pp
A typical RAID 1 setup with root on RAID might be as follows:
.Bl -enum
.It
wd0a - a small partition, which contains a complete, bootable, basic
.Nx
installation.
.It
wd1a - also contains a complete, bootable, basic
.Nx
installation.
.It
wd0e and wd1e - a RAID 1 set, raid0, used for the root file system.
.It
wd0f and wd1f - a RAID 1 set, raid1, which will be used only for
swap space.
.It
wd0g and wd1g - a RAID 1 set, raid2, used for
.Pa /usr ,
.Pa /home ,
or other data, if desired.
.It
wd0h and wd0h - a RAID 1 set, raid3, if desired.
.El
.Pp
RAID sets raid0, raid1, and raid2 are all marked as
auto-configurable. raid0 is marked as being a root file system.
When new kernels are installed, the kernel is not only copied to
.Pa / ,
but also to wd0a and wd1a. The kernel on wd0a is required, since that
is the kernel the system boots from. The kernel on wd1a is also
required, since that will be the kernel used should wd0 fail. The
important point here is to have redundant copies of the kernel
available, in the event that one of the drives fail.
.Pp
There is no requirement that the root file system be on the same disk
as the kernel. For example, obtaining the kernel from wd0a, and using
sd0e and sd1e for raid0, and the root file system, is fine. It
.Ar is
critical, however, that there be multiple kernels available, in the
event of media failure.
.Pp
Multi-layered RAID devices (such as a RAID 0 set made
up of RAID 1 sets) are
.Ar not
supported as root devices or auto-configurable devices at this point.
(Multi-layered RAID devices
.Ar are
supported in general, however, as mentioned earlier.) Note that in
order to enable component auto-detection and auto-configuration of
RAID devices, the line:
.Bd -literal -offset indent
options RAID_AUTOCONFIG
.Ed
.Pp
must be in the kernel configuration file. See
.Xr raid 4
for more details.
.Ss Unconfiguration
The final operation performed by
.Nm
is to unconfigure a
.Xr raid 4
device. This is accomplished via a simple:
.Bd -literal -offset indent
raidctl -u raid0
.Ed
.Pp
at which point the device is ready to be reconfigured.
.Ss Performance Tuning
Selection of the various parameter values which result in the best
performance can be quite tricky, and often requires a bit of
trial-and-error to get those values most appropriate for a given system.
A whole range of factors come into play, including:
.Bl -enum
.It
Types of components (e.g. SCSI vs. IDE) and their bandwidth
.It
Types of controller cards and their bandwidth
.It
Distribution of components among controllers
.It
IO bandwidth
.It
file system access patterns
.It
CPU speed
.El
.Pp
As with most performance tuning, benchmarking under real-life loads
may be the only way to measure expected performance. Understanding
some of the underlying technology is also useful in tuning. The goal
of this section is to provide pointers to those parameters which may
make significant differences in performance.
.Pp
For a RAID 1 set, a SectPerSU value of 64 or 128 is typically
sufficient. Since data in a RAID 1 set is arranged in a linear
fashion on each component, selecting an appropriate stripe size is
somewhat less critical than it is for a RAID 5 set. However: a stripe
size that is too small will cause large IO's to be broken up into a
number of smaller ones, hurting performance. At the same time, a
large stripe size may cause problems with concurrent accesses to
stripes, which may also affect performance. Thus values in the range
of 32 to 128 are often the most effective.
.Pp
Tuning RAID 5 sets is trickier. In the best case, IO is presented to
the RAID set one stripe at a time. Since the entire stripe is
available at the beginning of the IO, the parity of that stripe can
be calculated before the stripe is written, and then the stripe data
and parity can be written in parallel. When the amount of data being
written is less than a full stripe worth, the
.Sq small write
problem occurs. Since a
.Sq small write
means only a portion of the stripe on the components is going to
change, the data (and parity) on the components must be updated
slightly differently. First, the
.Sq old parity
and
.Sq old data
must be read from the components. Then the new parity is constructed,
using the new data to be written, and the old data and old parity.
Finally, the new data and new parity are written. All this extra data
shuffling results in a serious loss of performance, and is typically 2
to 4 times slower than a full stripe write (or read). To combat this
problem in the real world, it may be useful to ensure that stripe
sizes are small enough that a
.Sq large IO
from the system will use exactly one large stripe write. As is seen
later, there are some file system dependencies which may come into play
here as well.
.Pp
Since the size of a
.Sq large IO
is often (currently) only 32K or 64K, on a 5-drive RAID 5 set it may
be desirable to select a SectPerSU value of 16 blocks (8K) or 32
blocks (16K). Since there are 4 data sectors per stripe, the maximum
data per stripe is 64 blocks (32K) or 128 blocks (64K). Again,
empirical measurement will provide the best indicators of which
values will yeild better performance.
.Pp
The parameters used for the file system are also critical to good
performance. For
.Xr newfs 8 ,
for example, increasing the block size to 32K or 64K may improve
performance dramatically. As well, changing the cylinders-per-group
parameter from 16 to 32 or higher is often not only necessary for
larger file systems, but may also have positive performance
implications.
.Ss Summary
Despite the length of this man-page, configuring a RAID set is a
relatively straight-forward process. All that needs to be done is the
following steps:
.Bl -enum
.It
Use
.Xr disklabel 8
to create the components (of type RAID).
.It
Construct a RAID configuration file: e.g.
.Sq raid0.conf
.It
Configure the RAID set with:
.Bd -literal -offset indent
raidctl -C raid0.conf raid0
.Ed
.Pp
.It
Initialize the component labels with:
.Bd -literal -offset indent
raidctl -I 123456 raid0
.Ed
.Pp
.It
Initialize other important parts of the set with:
.Bd -literal -offset indent
raidctl -i raid0
.Ed
.Pp
.It
Get the default label for the RAID set:
.Bd -literal -offset indent
disklabel raid0 > /tmp/label
.Ed
.Pp
.It
Edit the label:
.Bd -literal -offset indent
vi /tmp/label
.Ed
.Pp
.It
Put the new label on the RAID set:
.Bd -literal -offset indent
disklabel -R -r raid0 /tmp/label
.Ed
.Pp
.It
Create the file system:
.Bd -literal -offset indent
newfs /dev/rraid0e
.Ed
.Pp
.It
Mount the file system:
.Bd -literal -offset indent
mount /dev/raid0e /mnt
.Ed
.Pp
.It
Use:
.Bd -literal -offset indent
raidctl -c raid0.conf raid0
.Ed
.Pp
To re-configure the RAID set the next time it is needed, or put
raid0.conf into /etc where it will automatically be started by
the /etc/rc scripts.
.El
.Sh SEE ALSO
.Xr ccd 4 ,
.Xr raid 4 ,
.Xr rc 8
.Sh HISTORY
RAIDframe is a framework for rapid prototyping of RAID structures
developed by the folks at the Parallel Data Laboratory at Carnegie
Mellon University (CMU).
A more complete description of the internals and functionality of
RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool
for RAID Systems", by William V. Courtright II, Garth Gibson, Mark
Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the
Parallel Data Laboratory of Carnegie Mellon University.
.Pp
The
.Nm
command first appeared as a program in CMU's RAIDframe v1.1 distribution. This
version of
.Nm
is a complete re-write, and first appeared in
.Nx 1.4 .
.Sh COPYRIGHT
.Bd -literal
The RAIDframe Copyright is as follows:
Copyright (c) 1994-1996 Carnegie-Mellon University.
All rights reserved.
Permission to use, copy, modify and distribute this software and
its documentation is hereby granted, provided that both the copyright
notice and this permission notice appear in all copies of the
software, derivative works or modified versions, and any portions
thereof, and that both notices appear in supporting documentation.
CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
Carnegie Mellon requests users of this software to return to
Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
School of Computer Science
Carnegie Mellon University
Pittsburgh PA 15213-3890
any improvements or extensions that they make and grant Carnegie the
rights to redistribute these changes.
.Ed
.Sh WARNINGS
Certain RAID levels (1, 4, 5, 6, and others) can protect against some
data loss due to component failure. However the loss of two
components of a RAID 4 or 5 system, or the loss of a single component
of a RAID 0 system will result in the entire file system being lost.
RAID is
.Ar NOT
a substitute for good backup practices.
.Pp
Recomputation of parity
.Ar MUST
be performed whenever there is a chance that it may have been
compromised. This includes after system crashes, or before a RAID
device has been used for the first time. Failure to keep parity
correct will be catastrophic should a component ever fail -- it is
better to use RAID 0 and get the additional space and speed, than it
is to use parity, but not keep the parity correct. At least with RAID
0 there is no perception of increased data security.
.Sh BUGS
Hot-spare removal is currently not available.