456 lines
16 KiB
Groff
456 lines
16 KiB
Groff
.\" $NetBSD: raid.4,v 1.35 2008/05/02 18:11:05 martin Exp $
|
|
.\"
|
|
.\" Copyright (c) 1998 The NetBSD Foundation, Inc.
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" This code is derived from software contributed to The NetBSD Foundation
|
|
.\" by Greg Oster
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
|
|
.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
|
|
.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
|
.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
|
|
.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
.\" POSSIBILITY OF SUCH DAMAGE.
|
|
.\"
|
|
.\"
|
|
.\" Copyright (c) 1995 Carnegie-Mellon University.
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" Author: Mark Holland
|
|
.\"
|
|
.\" Permission to use, copy, modify and distribute this software and
|
|
.\" its documentation is hereby granted, provided that both the copyright
|
|
.\" notice and this permission notice appear in all copies of the
|
|
.\" software, derivative works or modified versions, and any portions
|
|
.\" thereof, and that both notices appear in supporting documentation.
|
|
.\"
|
|
.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
|
|
.\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
|
|
.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
|
|
.\"
|
|
.\" Carnegie Mellon requests users of this software to return to
|
|
.\"
|
|
.\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
|
|
.\" School of Computer Science
|
|
.\" Carnegie Mellon University
|
|
.\" Pittsburgh PA 15213-3890
|
|
.\"
|
|
.\" any improvements or extensions that they make and grant Carnegie the
|
|
.\" rights to redistribute these changes.
|
|
.\"
|
|
.Dd August 6, 2007
|
|
.Dt RAID 4
|
|
.Os
|
|
.Sh NAME
|
|
.Nm raid
|
|
.Nd RAIDframe disk driver
|
|
.Sh SYNOPSIS
|
|
.Cd options RAID_AUTOCONFIG
|
|
.Cd options RAID_DIAGNOSTIC
|
|
.Cd options RF_ACC_TRACE=n
|
|
.Cd options RF_DEBUG_MAP=n
|
|
.Cd options RF_DEBUG_PSS=n
|
|
.Cd options RF_DEBUG_QUEUE=n
|
|
.Cd options RF_DEBUG_QUIESCE=n
|
|
.Cd options RF_DEBUG_RECON=n
|
|
.Cd options RF_DEBUG_STRIPELOCK=n
|
|
.Cd options RF_DEBUG_VALIDATE_DAG=n
|
|
.Cd options RF_DEBUG_VERIFYPARITY=n
|
|
.Cd options RF_INCLUDE_CHAINDECLUSTER=n
|
|
.Cd options RF_INCLUDE_EVENODD=n
|
|
.Cd options RF_INCLUDE_INTERDECLUSTER=n
|
|
.Cd options RF_INCLUDE_PARITY_DECLUSTERING=n
|
|
.Cd options RF_INCLUDE_PARITY_DECLUSTERING_DS=n
|
|
.Cd options RF_INCLUDE_PARITYLOGGING=n
|
|
.Cd options RF_INCLUDE_RAID5_RS=n
|
|
.Pp
|
|
.Cd "pseudo-device raid" Op Ar count
|
|
.Sh DESCRIPTION
|
|
The
|
|
.Nm
|
|
driver provides RAID 0, 1, 4, and 5 (and more!) capabilities to
|
|
.Nx .
|
|
This
|
|
document assumes that the reader has at least some familiarity with RAID
|
|
and RAID concepts. The reader is also assumed to know how to configure
|
|
disks and pseudo-devices into kernels, how to generate kernels, and how
|
|
to partition disks.
|
|
.Pp
|
|
RAIDframe provides a number of different RAID levels including:
|
|
.Bl -tag -width indent
|
|
.It RAID 0
|
|
provides simple data striping across the components.
|
|
.It RAID 1
|
|
provides mirroring.
|
|
.It RAID 4
|
|
provides data striping across the components, with parity
|
|
stored on a dedicated drive (in this case, the last component).
|
|
.It RAID 5
|
|
provides data striping across the components, with parity
|
|
distributed across all the components.
|
|
.El
|
|
.Pp
|
|
There are a wide variety of other RAID levels supported by RAIDframe.
|
|
The configuration file options to enable them are briefly outlined
|
|
at the end of this section.
|
|
.Pp
|
|
Depending on the parity level configured, the device driver can
|
|
support the failure of component drives. The number of failures
|
|
allowed depends on the parity level selected. If the driver is able
|
|
to handle drive failures, and a drive does fail, then the system is
|
|
operating in "degraded mode". In this mode, all missing data must be
|
|
reconstructed from the data and parity present on the other
|
|
components. This results in much slower data accesses, but
|
|
does mean that a failure need not bring the system to a complete halt.
|
|
.Pp
|
|
The RAID driver supports and enforces the use of
|
|
.Sq component labels .
|
|
A
|
|
.Sq component label
|
|
contains important information about the component, including a
|
|
user-specified serial number, the row and column of that component in
|
|
the RAID set, and whether the data (and parity) on the component is
|
|
.Sq clean .
|
|
The component label currently lives at the half-way point of the
|
|
.Sq reserved section
|
|
located at the beginning of each component.
|
|
This
|
|
.Sq reserved section
|
|
is RF_PROTECTED_SECTORS in length (64 blocks or 32Kbytes) and the
|
|
component label is currently 1Kbyte in size.
|
|
.Pp
|
|
If the driver determines that the component labels are very inconsistent with
|
|
respect to each other (e.g. two or more serial numbers do not match)
|
|
or that the component label is not consistent with its assigned place
|
|
in the set (e.g. the component label claims the component should be
|
|
the 3rd one in a 6-disk set, but the RAID set has it as the 3rd component
|
|
in a 5-disk set) then the device will fail to configure. If the
|
|
driver determines that exactly one component label seems to be
|
|
incorrect, and the RAID set is being configured as a set that supports
|
|
a single failure, then the RAID set will be allowed to configure, but
|
|
the incorrectly labeled component will be marked as
|
|
.Sq failed ,
|
|
and the RAID set will begin operation in degraded mode.
|
|
If all of the components are consistent among themselves, the RAID set
|
|
will configure normally.
|
|
.Pp
|
|
Component labels are also used to support the auto-detection and
|
|
autoconfiguration of RAID sets. A RAID set can be flagged as
|
|
autoconfigurable, in which case it will be configured automatically
|
|
during the kernel boot process. RAID file systems which are
|
|
automatically configured are also eligible to be the root file system.
|
|
There is currently only limited support (alpha, amd64, i386, pmax,
|
|
sparc, sparc64, and vax architectures)
|
|
for booting a kernel directly from a RAID 1 set, and no support for
|
|
booting from any other RAID sets. To use a RAID set as the root
|
|
file system, a kernel is usually obtained from a small non-RAID
|
|
partition, after which any autoconfiguring RAID set can be used for the
|
|
root file system. See
|
|
.Xr raidctl 8
|
|
for more information on autoconfiguration of RAID sets.
|
|
Note that with autoconfiguration of RAID sets, it is no longer
|
|
necessary to hard-code SCSI IDs of drives.
|
|
The autoconfiguration code will
|
|
correctly configure a device even after any number of the components
|
|
have had their device IDs changed or device names changed.
|
|
.Pp
|
|
The driver supports
|
|
.Sq hot spares ,
|
|
disks which are on-line, but are not
|
|
actively used in an existing file system. Should a disk fail, the
|
|
driver is capable of reconstructing the failed disk onto a hot spare
|
|
or back onto a replacement drive.
|
|
If the components are hot swappable, the failed disk can then be
|
|
removed, a new disk put in its place, and a copyback operation
|
|
performed. The copyback operation, as its name indicates, will copy
|
|
the reconstructed data from the hot spare to the previously failed
|
|
(and now replaced) disk. Hot spares can also be hot-added using
|
|
.Xr raidctl 8 .
|
|
.Pp
|
|
If a component cannot be detected when the RAID device is configured,
|
|
that component will be simply marked as 'failed'.
|
|
.Pp
|
|
The user-land utility for doing all
|
|
.Nm
|
|
configuration and other operations
|
|
is
|
|
.Xr raidctl 8 .
|
|
Most importantly,
|
|
.Xr raidctl 8
|
|
must be used with the
|
|
.Fl i
|
|
option to initialize all RAID sets. In particular, this
|
|
initialization includes re-building the parity data. This rebuilding
|
|
of parity data is also required when either a) a new RAID device is
|
|
brought up for the first time or b) after an un-clean shutdown of a
|
|
RAID device. By using the
|
|
.Fl P
|
|
option to
|
|
.Xr raidctl 8 ,
|
|
and performing this on-demand recomputation of all parity
|
|
before doing a
|
|
.Xr fsck 8
|
|
or a
|
|
.Xr newfs 8 ,
|
|
file system integrity and parity integrity can be ensured. It bears
|
|
repeating again that parity recomputation is
|
|
.Ar required
|
|
before any file systems are created or used on the RAID device. If the
|
|
parity is not correct, then missing data cannot be correctly recovered.
|
|
.Pp
|
|
RAID levels may be combined in a hierarchical fashion. For example, a RAID 0
|
|
device can be constructed out of a number of RAID 5 devices (which, in turn,
|
|
may be constructed out of the physical disks, or of other RAID devices).
|
|
.Pp
|
|
The first step to using the
|
|
.Nm
|
|
driver is to ensure that it is suitably configured in the kernel. This is
|
|
done by adding a line similar to:
|
|
.Bd -unfilled -offset indent
|
|
pseudo-device raid 4 # RAIDframe disk device
|
|
.Ed
|
|
.Pp
|
|
to the kernel configuration file. The
|
|
.Sq count
|
|
argument (
|
|
.Sq 4 ,
|
|
in this case), specifies the number of RAIDframe drivers to configure.
|
|
To turn on component auto-detection and autoconfiguration of RAID
|
|
sets, simply add:
|
|
.Bd -unfilled -offset indent
|
|
options RAID_AUTOCONFIG
|
|
.Ed
|
|
.Pp
|
|
to the kernel configuration file.
|
|
.Pp
|
|
All component partitions must be of the type
|
|
.Dv FS_BSDFFS
|
|
(e.g. 4.2BSD) or
|
|
.Dv FS_RAID .
|
|
The use of the latter is strongly encouraged, and is required if
|
|
autoconfiguration of the RAID set is desired. Since RAIDframe leaves
|
|
room for disklabels, RAID components can be simply raw disks, or
|
|
partitions which use an entire disk.
|
|
.Pp
|
|
A more detailed treatment of actually using a
|
|
.Nm
|
|
device is found in
|
|
.Xr raidctl 8 .
|
|
It is highly recommended that the steps to reconstruct, copyback, and
|
|
re-compute parity are well understood by the system administrator(s)
|
|
.Ar before
|
|
a component failure. Doing the wrong thing when a component fails may
|
|
result in data loss.
|
|
.Pp
|
|
Additional internal consistency checking can be enabled by specifying:
|
|
.Bd -unfilled -offset indent
|
|
options RAID_DIAGNOSTIC
|
|
.Ed
|
|
.Pp
|
|
These assertions are disabled by default in order to improve
|
|
performance.
|
|
.Pp
|
|
RAIDframe supports an access tracing facility for tracking both
|
|
requests made and performance of various parts of the RAID systems
|
|
as the request is processed.
|
|
To enable this tracing the following option may be specified:
|
|
.Bd -unfilled -offset indent
|
|
options RF_ACC_TRACE=1
|
|
.Ed
|
|
.Pp
|
|
For extensive debugging there are a number of kernel options which
|
|
will aid in performing extra diagnosis of various parts of the
|
|
RAIDframe sub-systems.
|
|
Note that in order to make full use of these options it is often
|
|
necessary to enable one or more debugging options as listed in
|
|
.Pa src/sys/dev/raidframe/rf_options.h .
|
|
As well, these options are also only typically useful for people who wish
|
|
to debug various parts of RAIDframe.
|
|
The options include:
|
|
.Pp
|
|
For debugging the code which maps RAID addresses to physical
|
|
addresses:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_MAP=1
|
|
.Ed
|
|
.Pp
|
|
Parity stripe status debugging is enabled with:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_PSS=1
|
|
.Ed
|
|
.Pp
|
|
Additional debugging for queuing is enabled with:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_QUEUE=1
|
|
.Ed
|
|
.Pp
|
|
Problems with non-quiescent file systems should be easier to debug if
|
|
the following is enabled:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_QUIESCE=1
|
|
.Ed
|
|
.Pp
|
|
Stripelock debugging is enabled with:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_STRIPELOCK=1
|
|
.Ed
|
|
.Pp
|
|
Additional diagnostic checks during reconstruction are enabled with:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_RECON=1
|
|
.Ed
|
|
.Pp
|
|
Validation of the DAGs (Directed Acyclic Graphs) used to describe an
|
|
I/O access can be performed when the following is enabled:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_VALIDATE_DAG=1
|
|
.Ed
|
|
.Pp
|
|
Additional diagnostics during parity verification are enabled with:
|
|
.Bd -unfilled -offset indent
|
|
options RF_DEBUG_VERIFYPARITY=1
|
|
.Ed
|
|
.Pp
|
|
There are a number of less commonly used RAID levels supported by
|
|
RAIDframe.
|
|
These additional RAID types should be considered experimental, and
|
|
may not be ready for production use.
|
|
The various types and the options to enable them are shown here:
|
|
.Pp
|
|
For Even-Odd parity:
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_EVENODD=1
|
|
.Ed
|
|
.Pp
|
|
For RAID level 5 with rotated sparing:
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_RAID5_RS=1
|
|
.Ed
|
|
.Pp
|
|
For Parity Logging (highly experimental):
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_PARITYLOGGING=1
|
|
.Ed
|
|
.Pp
|
|
For Chain Declustering:
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_CHAINDECLUSTER=1
|
|
.Ed
|
|
.Pp
|
|
For Interleaved Declustering:
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_INTERDECLUSTER=1
|
|
.Ed
|
|
.Pp
|
|
For Parity Declustering:
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_PARITY_DECLUSTERING=1
|
|
.Ed
|
|
.Pp
|
|
For Parity Declustering with Distributed Spares:
|
|
.Bd -unfilled -offset indent
|
|
options RF_INCLUDE_PARITY_DECLUSTERING_DS=1
|
|
.Ed
|
|
.Pp
|
|
The reader is referred to the RAIDframe documentation mentioned in the
|
|
.Sx HISTORY
|
|
section for more detail on these various RAID configurations.
|
|
.Sh WARNINGS
|
|
Certain RAID levels (1, 4, 5, 6, and others) can protect against some
|
|
data loss due to component failure. However the loss of two
|
|
components of a RAID 4 or 5 system, or the loss of a single component
|
|
of a RAID 0 system, will result in the entire file systems on that RAID
|
|
device being lost.
|
|
RAID is
|
|
.Ar NOT
|
|
a substitute for good backup practices.
|
|
.Pp
|
|
Recomputation of parity
|
|
.Ar MUST
|
|
be performed whenever there is a chance that it may have been
|
|
compromised. This includes after system crashes, or before a RAID
|
|
device has been used for the first time. Failure to keep parity
|
|
correct will be catastrophic should a component ever fail -- it is
|
|
better to use RAID 0 and get the additional space and speed, than it
|
|
is to use parity, but not keep the parity correct. At least with RAID
|
|
0 there is no perception of increased data security.
|
|
.Sh FILES
|
|
.Bl -tag -width /dev/XXrXraidX -compact
|
|
.It Pa /dev/{,r}raid*
|
|
.Nm
|
|
device special files.
|
|
.El
|
|
.Sh SEE ALSO
|
|
.Xr config 1 ,
|
|
.Xr sd 4 ,
|
|
.Xr MAKEDEV 8 ,
|
|
.Xr fsck 8 ,
|
|
.Xr mount 8 ,
|
|
.Xr newfs 8 ,
|
|
.Xr raidctl 8
|
|
.Sh HISTORY
|
|
The
|
|
.Nm
|
|
driver in
|
|
.Nx
|
|
is a port of RAIDframe, a framework for rapid prototyping of RAID
|
|
structures developed by the folks at the Parallel Data Laboratory at
|
|
Carnegie Mellon University (CMU). RAIDframe, as originally distributed
|
|
by CMU, provides a RAID simulator for a number of different
|
|
architectures, and a user-level device driver and a kernel device
|
|
driver for Digital Unix. The
|
|
.Nm
|
|
driver is a kernelized version of RAIDframe v1.1.
|
|
.Pp
|
|
A more complete description of the internals and functionality of
|
|
RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool
|
|
for RAID Systems", by William V. Courtright II, Garth Gibson, Mark
|
|
Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the
|
|
Parallel Data Laboratory of Carnegie Mellon University.
|
|
The
|
|
.Nm
|
|
driver first appeared in
|
|
.Nx 1.4 .
|
|
.Sh COPYRIGHT
|
|
.Bd -unfilled
|
|
The RAIDframe Copyright is as follows:
|
|
.Pp
|
|
Copyright (c) 1994-1996 Carnegie-Mellon University.
|
|
All rights reserved.
|
|
.Pp
|
|
Permission to use, copy, modify and distribute this software and
|
|
its documentation is hereby granted, provided that both the copyright
|
|
notice and this permission notice appear in all copies of the
|
|
software, derivative works or modified versions, and any portions
|
|
thereof, and that both notices appear in supporting documentation.
|
|
.Pp
|
|
CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
|
|
CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
|
|
FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
|
|
.Pp
|
|
Carnegie Mellon requests users of this software to return to
|
|
.Pp
|
|
Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
|
|
School of Computer Science
|
|
Carnegie Mellon University
|
|
Pittsburgh PA 15213-3890
|
|
.Pp
|
|
any improvements or extensions that they make and grant Carnegie the
|
|
rights to redistribute these changes.
|
|
.Ed
|