2003-08-07 14:30:41 +04:00
|
|
|
.\" $NetBSD: 3.t,v 1.3 2003/08/07 10:30:54 agc Exp $
|
1998-01-09 09:54:57 +03:00
|
|
|
.\"
|
1994-06-19 04:07:16 +04:00
|
|
|
.\" Copyright (c) 1986, 1993
|
|
|
|
.\" The Regents of the University of California. All rights reserved.
|
|
|
|
.\"
|
|
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
|
|
.\" modification, are permitted provided that the following conditions
|
|
|
|
.\" are met:
|
|
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
|
|
.\" documentation and/or other materials provided with the distribution.
|
2003-08-07 14:30:41 +04:00
|
|
|
.\" 3. Neither the name of the University nor the names of its contributors
|
1994-06-19 04:07:16 +04:00
|
|
|
.\" may be used to endorse or promote products derived from this software
|
|
|
|
.\" without specific prior written permission.
|
|
|
|
.\"
|
|
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
.\" SUCH DAMAGE.
|
|
|
|
.\"
|
|
|
|
.\" @(#)3.t 8.1 (Berkeley) 6/8/93
|
|
|
|
.\"
|
|
|
|
.ds RH New file system
|
|
|
|
.NH
|
|
|
|
New file system organization
|
|
|
|
.PP
|
|
|
|
In the new file system organization (as in the
|
|
|
|
old file system organization),
|
|
|
|
each disk drive contains one or more file systems.
|
|
|
|
A file system is described by its super-block,
|
|
|
|
located at the beginning of the file system's disk partition.
|
|
|
|
Because the super-block contains critical data,
|
|
|
|
it is replicated to protect against catastrophic loss.
|
|
|
|
This is done when the file system is created;
|
|
|
|
since the super-block data does not change,
|
|
|
|
the copies need not be referenced unless a head crash
|
|
|
|
or other hard disk error causes the default super-block
|
|
|
|
to be unusable.
|
|
|
|
.PP
|
|
|
|
To insure that it is possible to create files as large as
|
|
|
|
$2 sup 32$ bytes with only two levels of indirection,
|
|
|
|
the minimum size of a file system block is 4096 bytes.
|
|
|
|
The size of file system blocks can be any power of two
|
|
|
|
greater than or equal to 4096.
|
|
|
|
The block size of a file system is recorded in the
|
|
|
|
file system's super-block
|
|
|
|
so it is possible for file systems with different block sizes
|
|
|
|
to be simultaneously accessible on the same system.
|
|
|
|
The block size must be decided at the time that
|
|
|
|
the file system is created;
|
|
|
|
it cannot be subsequently changed without rebuilding the file system.
|
|
|
|
.PP
|
|
|
|
The new file system organization divides a disk partition
|
|
|
|
into one or more areas called
|
|
|
|
.I "cylinder groups".
|
|
|
|
A cylinder group is comprised of one or more consecutive
|
|
|
|
cylinders on a disk.
|
|
|
|
Associated with each cylinder group is some bookkeeping information
|
|
|
|
that includes a redundant copy of the super-block,
|
|
|
|
space for inodes,
|
|
|
|
a bit map describing available blocks in the cylinder group,
|
|
|
|
and summary information describing the usage of data blocks
|
|
|
|
within the cylinder group.
|
|
|
|
The bit map of available blocks in the cylinder group replaces
|
|
|
|
the traditional file system's free list.
|
|
|
|
For each cylinder group a static number of inodes
|
|
|
|
is allocated at file system creation time.
|
|
|
|
The default policy is to allocate one inode for each 2048
|
|
|
|
bytes of space in the cylinder group, expecting this
|
|
|
|
to be far more than will ever be needed.
|
|
|
|
.PP
|
|
|
|
All the cylinder group bookkeeping information could be
|
|
|
|
placed at the beginning of each cylinder group.
|
|
|
|
However if this approach were used,
|
|
|
|
all the redundant information would be on the top platter.
|
|
|
|
A single hardware failure that destroyed the top platter
|
|
|
|
could cause the loss of all redundant copies of the super-block.
|
|
|
|
Thus the cylinder group bookkeeping information
|
|
|
|
begins at a varying offset from the beginning of the cylinder group.
|
|
|
|
The offset for each successive cylinder group is calculated to be
|
|
|
|
about one track further from the beginning of the cylinder group
|
|
|
|
than the preceding cylinder group.
|
|
|
|
In this way the redundant
|
|
|
|
information spirals down into the pack so that any single track, cylinder,
|
|
|
|
or platter can be lost without losing all copies of the super-block.
|
|
|
|
Except for the first cylinder group,
|
|
|
|
the space between the beginning of the cylinder group
|
|
|
|
and the beginning of the cylinder group information
|
|
|
|
is used for data blocks.\(dg
|
|
|
|
.FS
|
|
|
|
\(dg While it appears that the first cylinder group could be laid
|
|
|
|
out with its super-block at the ``known'' location,
|
|
|
|
this would not work for file systems
|
|
|
|
with blocks sizes of 16 kilobytes or greater.
|
|
|
|
This is because of a requirement that the first 8 kilobytes of the disk
|
|
|
|
be reserved for a bootstrap program and a separate requirement that
|
|
|
|
the cylinder group information begin on a file system block boundary.
|
|
|
|
To start the cylinder group on a file system block boundary,
|
|
|
|
file systems with block sizes larger than 8 kilobytes
|
|
|
|
would have to leave an empty space between the end of
|
|
|
|
the boot block and the beginning of the cylinder group.
|
|
|
|
Without knowing the size of the file system blocks,
|
|
|
|
the system would not know what roundup function to use
|
|
|
|
to find the beginning of the first cylinder group.
|
|
|
|
.FE
|
|
|
|
.NH 2
|
|
|
|
Optimizing storage utilization
|
|
|
|
.PP
|
|
|
|
Data is laid out so that larger blocks can be transferred
|
|
|
|
in a single disk transaction, greatly increasing file system throughput.
|
|
|
|
As an example, consider a file in the new file system
|
|
|
|
composed of 4096 byte data blocks.
|
|
|
|
In the old file system this file would be composed of 1024 byte blocks.
|
|
|
|
By increasing the block size, disk accesses in the new file
|
|
|
|
system may transfer up to four times as much information per
|
|
|
|
disk transaction.
|
|
|
|
In large files, several
|
|
|
|
4096 byte blocks may be allocated from the same cylinder so that
|
|
|
|
even larger data transfers are possible before requiring a seek.
|
|
|
|
.PP
|
|
|
|
The main problem with
|
|
|
|
larger blocks is that most UNIX
|
|
|
|
file systems are composed of many small files.
|
|
|
|
A uniformly large block size wastes space.
|
|
|
|
Table 1 shows the effect of file system
|
|
|
|
block size on the amount of wasted space in the file system.
|
|
|
|
The files measured to obtain these figures reside on
|
|
|
|
one of our time sharing
|
|
|
|
systems that has roughly 1.2 gigabytes of on-line storage.
|
|
|
|
The measurements are based on the active user file systems containing
|
|
|
|
about 920 megabytes of formatted space.
|
|
|
|
.KF
|
|
|
|
.DS B
|
|
|
|
.TS
|
|
|
|
box;
|
|
|
|
l|l|l
|
|
|
|
a|n|l.
|
|
|
|
Space used % waste Organization
|
|
|
|
_
|
|
|
|
775.2 Mb 0.0 Data only, no separation between files
|
|
|
|
807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
|
|
|
|
828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
|
|
|
|
866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
|
|
|
|
948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
|
|
|
|
1128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system
|
|
|
|
.TE
|
|
|
|
Table 1 \- Amount of wasted space as a function of block size.
|
|
|
|
.DE
|
|
|
|
.KE
|
|
|
|
The space wasted is calculated to be the percentage of space
|
|
|
|
on the disk not containing user data.
|
|
|
|
As the block size on the disk
|
|
|
|
increases, the waste rises quickly, to an intolerable
|
|
|
|
45.6% waste with 4096 byte file system blocks.
|
|
|
|
.PP
|
|
|
|
To be able to use large blocks without undue waste,
|
|
|
|
small files must be stored in a more efficient way.
|
|
|
|
The new file system accomplishes this goal by allowing the division
|
|
|
|
of a single file system block into one or more
|
|
|
|
.I "fragments".
|
|
|
|
The file system fragment size is specified
|
|
|
|
at the time that the file system is created;
|
|
|
|
each file system block can optionally be broken into
|
|
|
|
2, 4, or 8 fragments, each of which is addressable.
|
|
|
|
The lower bound on the size of these fragments is constrained
|
|
|
|
by the disk sector size,
|
|
|
|
typically 512 bytes.
|
|
|
|
The block map associated with each cylinder group
|
|
|
|
records the space available in a cylinder group
|
|
|
|
at the fragment level;
|
|
|
|
to determine if a block is available, aligned fragments are examined.
|
|
|
|
Figure 1 shows a piece of a map from a 4096/1024 file system.
|
|
|
|
.KF
|
|
|
|
.DS B
|
|
|
|
.TS
|
|
|
|
box;
|
|
|
|
l|c c c c.
|
|
|
|
Bits in map XXXX XXOO OOXX OOOO
|
|
|
|
Fragment numbers 0-3 4-7 8-11 12-15
|
|
|
|
Block numbers 0 1 2 3
|
|
|
|
.TE
|
|
|
|
Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system.
|
|
|
|
.DE
|
|
|
|
.KE
|
|
|
|
Each bit in the map records the status of a fragment;
|
|
|
|
an ``X'' shows that the fragment is in use,
|
|
|
|
while a ``O'' shows that the fragment is available for allocation.
|
|
|
|
In this example,
|
|
|
|
fragments 0\-5, 10, and 11 are in use,
|
|
|
|
while fragments 6\-9, and 12\-15 are free.
|
|
|
|
Fragments of adjoining blocks cannot be used as a full block,
|
|
|
|
even if they are large enough.
|
|
|
|
In this example,
|
|
|
|
fragments 6\-9 cannot be allocated as a full block;
|
|
|
|
only fragments 12\-15 can be coalesced into a full block.
|
|
|
|
.PP
|
|
|
|
On a file system with a block size of 4096 bytes
|
|
|
|
and a fragment size of 1024 bytes,
|
|
|
|
a file is represented by zero or more 4096 byte blocks of data,
|
|
|
|
and possibly a single fragmented block.
|
|
|
|
If a file system block must be fragmented to obtain
|
|
|
|
space for a small amount of data,
|
|
|
|
the remaining fragments of the block are made
|
|
|
|
available for allocation to other files.
|
|
|
|
As an example consider an 11000 byte file stored on
|
|
|
|
a 4096/1024 byte file system.
|
|
|
|
This file would uses two full size blocks and one
|
|
|
|
three fragment portion of another block.
|
|
|
|
If no block with three aligned fragments is
|
|
|
|
available at the time the file is created,
|
|
|
|
a full size block is split yielding the necessary
|
|
|
|
fragments and a single unused fragment.
|
|
|
|
This remaining fragment can be allocated to another file as needed.
|
|
|
|
.PP
|
|
|
|
Space is allocated to a file when a program does a \fIwrite\fP
|
|
|
|
system call.
|
|
|
|
Each time data is written to a file, the system checks to see if
|
|
|
|
the size of the file has increased*.
|
|
|
|
.FS
|
|
|
|
* A program may be overwriting data in the middle of an existing file
|
|
|
|
in which case space would already have been allocated.
|
|
|
|
.FE
|
|
|
|
If the file needs to be expanded to hold the new data,
|
|
|
|
one of three conditions exists:
|
|
|
|
.IP 1)
|
|
|
|
There is enough space left in an already allocated
|
|
|
|
block or fragment to hold the new data.
|
|
|
|
The new data is written into the available space.
|
|
|
|
.IP 2)
|
|
|
|
The file contains no fragmented blocks (and the last
|
|
|
|
block in the file
|
|
|
|
contains insufficient space to hold the new data).
|
|
|
|
If space exists in a block already allocated,
|
|
|
|
the space is filled with new data.
|
|
|
|
If the remainder of the new data contains more than
|
|
|
|
a full block of data, a full block is allocated and
|
|
|
|
the first full block of new data is written there.
|
|
|
|
This process is repeated until less than a full block
|
|
|
|
of new data remains.
|
|
|
|
If the remaining new data to be written will
|
|
|
|
fit in less than a full block,
|
|
|
|
a block with the necessary fragments is located,
|
|
|
|
otherwise a full block is located.
|
|
|
|
The remaining new data is written into the located space.
|
|
|
|
.IP 3)
|
|
|
|
The file contains one or more fragments (and the
|
|
|
|
fragments contain insufficient space to hold the new data).
|
|
|
|
If the size of the new data plus the size of the data
|
|
|
|
already in the fragments exceeds the size of a full block,
|
|
|
|
a new block is allocated.
|
|
|
|
The contents of the fragments are copied
|
|
|
|
to the beginning of the block
|
|
|
|
and the remainder of the block is filled with new data.
|
|
|
|
The process then continues as in (2) above.
|
|
|
|
Otherwise, if the new data to be written will
|
|
|
|
fit in less than a full block,
|
|
|
|
a block with the necessary fragments is located,
|
|
|
|
otherwise a full block is located.
|
|
|
|
The contents of the existing fragments
|
|
|
|
appended with the new data
|
|
|
|
are written into the allocated space.
|
|
|
|
.PP
|
|
|
|
The problem with expanding a file one fragment at a
|
|
|
|
a time is that data may be copied many times as a
|
|
|
|
fragmented block expands to a full block.
|
|
|
|
Fragment reallocation can be minimized
|
|
|
|
if the user program writes a full block at a time,
|
|
|
|
except for a partial block at the end of the file.
|
|
|
|
Since file systems with different block sizes may reside on
|
|
|
|
the same system,
|
|
|
|
the file system interface has been extended to provide
|
|
|
|
application programs the optimal size for a read or write.
|
|
|
|
For files the optimal size is the block size of the file system
|
|
|
|
on which the file is being accessed.
|
|
|
|
For other objects, such as pipes and sockets,
|
|
|
|
the optimal size is the underlying buffer size.
|
|
|
|
This feature is used by the Standard
|
|
|
|
Input/Output Library,
|
|
|
|
a package used by most user programs.
|
|
|
|
This feature is also used by
|
|
|
|
certain system utilities such as archivers and loaders
|
|
|
|
that do their own input and output management
|
|
|
|
and need the highest possible file system bandwidth.
|
|
|
|
.PP
|
|
|
|
The amount of wasted space in the 4096/1024 byte new file system
|
|
|
|
organization is empirically observed to be about the same as in the
|
|
|
|
1024 byte old file system organization.
|
|
|
|
A file system with 4096 byte blocks and 512 byte fragments
|
|
|
|
has about the same amount of wasted space as the 512 byte
|
|
|
|
block UNIX file system.
|
|
|
|
The new file system uses less space
|
|
|
|
than the 512 byte or 1024 byte
|
|
|
|
file systems for indexing information for
|
|
|
|
large files and the same amount of space
|
|
|
|
for small files.
|
|
|
|
These savings are offset by the need to use
|
|
|
|
more space for keeping track of available free blocks.
|
|
|
|
The net result is about the same disk utilization
|
|
|
|
when a new file system's fragment size
|
|
|
|
equals an old file system's block size.
|
|
|
|
.PP
|
|
|
|
In order for the layout policies to be effective,
|
|
|
|
a file system cannot be kept completely full.
|
|
|
|
For each file system there is a parameter, termed
|
|
|
|
the free space reserve, that
|
|
|
|
gives the minimum acceptable percentage of file system
|
|
|
|
blocks that should be free.
|
|
|
|
If the number of free blocks drops below this level
|
|
|
|
only the system administrator can continue to allocate blocks.
|
|
|
|
The value of this parameter may be changed at any time,
|
|
|
|
even when the file system is mounted and active.
|
|
|
|
The transfer rates that appear in section 4 were measured on file
|
|
|
|
systems kept less than 90% full (a reserve of 10%).
|
|
|
|
If the number of free blocks falls to zero,
|
|
|
|
the file system throughput tends to be cut in half,
|
|
|
|
because of the inability of the file system to localize
|
|
|
|
blocks in a file.
|
|
|
|
If a file system's performance degrades because
|
|
|
|
of overfilling, it may be restored by removing
|
|
|
|
files until the amount of free space once again
|
|
|
|
reaches the minimum acceptable level.
|
|
|
|
Access rates for files created during periods of little
|
|
|
|
free space may be restored by moving their data once enough
|
|
|
|
space is available.
|
|
|
|
The free space reserve must be added to the
|
|
|
|
percentage of waste when comparing the organizations given
|
|
|
|
in Table 1.
|
|
|
|
Thus, the percentage of waste in
|
|
|
|
an old 1024 byte UNIX file system is roughly
|
|
|
|
comparable to a new 4096/512 byte file system
|
|
|
|
with the free space reserve set at 5%.
|
|
|
|
(Compare 11.8% wasted with the old file system
|
|
|
|
to 6.9% waste + 5% reserved space in the
|
|
|
|
new file system.)
|
|
|
|
.NH 2
|
|
|
|
File system parameterization
|
|
|
|
.PP
|
|
|
|
Except for the initial creation of the free list,
|
|
|
|
the old file system ignores the parameters of the underlying hardware.
|
|
|
|
It has no information about either the physical characteristics
|
|
|
|
of the mass storage device,
|
|
|
|
or the hardware that interacts with it.
|
|
|
|
A goal of the new file system is to parameterize the
|
|
|
|
processor capabilities and
|
|
|
|
mass storage characteristics
|
|
|
|
so that blocks can be allocated in an
|
|
|
|
optimum configuration-dependent way.
|
|
|
|
Parameters used include the speed of the processor,
|
|
|
|
the hardware support for mass storage transfers,
|
|
|
|
and the characteristics of the mass storage devices.
|
|
|
|
Disk technology is constantly improving and
|
|
|
|
a given installation can have several different disk technologies
|
|
|
|
running on a single processor.
|
|
|
|
Each file system is parameterized so that it can be
|
|
|
|
adapted to the characteristics of the disk on which
|
|
|
|
it is placed.
|
|
|
|
.PP
|
|
|
|
For mass storage devices such as disks,
|
|
|
|
the new file system tries to allocate new blocks
|
|
|
|
on the same cylinder as the previous block in the same file.
|
|
|
|
Optimally, these new blocks will also be
|
|
|
|
rotationally well positioned.
|
|
|
|
The distance between ``rotationally optimal'' blocks varies greatly;
|
|
|
|
it can be a consecutive block
|
|
|
|
or a rotationally delayed block
|
|
|
|
depending on system characteristics.
|
|
|
|
On a processor with an input/output channel that does not require
|
|
|
|
any processor intervention between mass storage transfer requests,
|
|
|
|
two consecutive disk blocks can often be accessed
|
|
|
|
without suffering lost time because of an intervening disk revolution.
|
|
|
|
For processors without input/output channels,
|
|
|
|
the main processor must field an interrupt and
|
|
|
|
prepare for a new disk transfer.
|
|
|
|
The expected time to service this interrupt and
|
|
|
|
schedule a new disk transfer depends on the
|
|
|
|
speed of the main processor.
|
|
|
|
.PP
|
|
|
|
The physical characteristics of each disk include
|
|
|
|
the number of blocks per track and the rate at which
|
|
|
|
the disk spins.
|
|
|
|
The allocation routines use this information to calculate
|
|
|
|
the number of milliseconds required to skip over a block.
|
|
|
|
The characteristics of the processor include
|
|
|
|
the expected time to service an interrupt and schedule a
|
|
|
|
new disk transfer.
|
|
|
|
Given a block allocated to a file,
|
|
|
|
the allocation routines calculate the number of blocks to
|
|
|
|
skip over so that the next block in the file will
|
|
|
|
come into position under the disk head in the expected
|
|
|
|
amount of time that it takes to start a new
|
|
|
|
disk transfer operation.
|
|
|
|
For programs that sequentially access large amounts of data,
|
|
|
|
this strategy minimizes the amount of time spent waiting for
|
|
|
|
the disk to position itself.
|
|
|
|
.PP
|
|
|
|
To ease the calculation of finding rotationally optimal blocks,
|
|
|
|
the cylinder group summary information includes
|
|
|
|
a count of the available blocks in a cylinder
|
|
|
|
group at different rotational positions.
|
|
|
|
Eight rotational positions are distinguished,
|
|
|
|
so the resolution of the
|
|
|
|
summary information is 2 milliseconds for a typical 3600
|
|
|
|
revolution per minute drive.
|
|
|
|
The super-block contains a vector of lists called
|
|
|
|
.I "rotational layout tables".
|
|
|
|
The vector is indexed by rotational position.
|
|
|
|
Each component of the vector
|
|
|
|
lists the index into the block map for every data block contained
|
|
|
|
in its rotational position.
|
|
|
|
When looking for an allocatable block,
|
|
|
|
the system first looks through the summary counts for a rotational
|
|
|
|
position with a non-zero block count.
|
|
|
|
It then uses the index of the rotational position to find the appropriate
|
|
|
|
list to use to index through
|
|
|
|
only the relevant parts of the block map to find a free block.
|
|
|
|
.PP
|
|
|
|
The parameter that defines the
|
|
|
|
minimum number of milliseconds between the completion of a data
|
|
|
|
transfer and the initiation of
|
|
|
|
another data transfer on the same cylinder
|
|
|
|
can be changed at any time,
|
|
|
|
even when the file system is mounted and active.
|
|
|
|
If a file system is parameterized to lay out blocks with
|
|
|
|
a rotational separation of 2 milliseconds,
|
|
|
|
and the disk pack is then moved to a system that has a
|
|
|
|
processor requiring 4 milliseconds to schedule a disk operation,
|
|
|
|
the throughput will drop precipitously because of lost disk revolutions
|
|
|
|
on nearly every block.
|
|
|
|
If the eventual target machine is known,
|
|
|
|
the file system can be parameterized for it
|
|
|
|
even though it is initially created on a different processor.
|
|
|
|
Even if the move is not known in advance,
|
|
|
|
the rotational layout delay can be reconfigured after the disk is moved
|
|
|
|
so that all further allocation is done based on the
|
|
|
|
characteristics of the new host.
|
|
|
|
.NH 2
|
|
|
|
Layout policies
|
|
|
|
.PP
|
|
|
|
The file system layout policies are divided into two distinct parts.
|
|
|
|
At the top level are global policies that use file system
|
|
|
|
wide summary information to make decisions regarding
|
|
|
|
the placement of new inodes and data blocks.
|
|
|
|
These routines are responsible for deciding the
|
|
|
|
placement of new directories and files.
|
|
|
|
They also calculate rotationally optimal block layouts,
|
|
|
|
and decide when to force a long seek to a new cylinder group
|
|
|
|
because there are insufficient blocks left
|
|
|
|
in the current cylinder group to do reasonable layouts.
|
|
|
|
Below the global policy routines are
|
|
|
|
the local allocation routines that use a locally optimal scheme to
|
|
|
|
lay out data blocks.
|
|
|
|
.PP
|
|
|
|
Two methods for improving file system performance are to increase
|
|
|
|
the locality of reference to minimize seek latency
|
|
|
|
as described by [Trivedi80], and
|
|
|
|
to improve the layout of data to make larger transfers possible
|
|
|
|
as described by [Nevalainen77].
|
|
|
|
The global layout policies try to improve performance
|
|
|
|
by clustering related information.
|
|
|
|
They cannot attempt to localize all data references,
|
|
|
|
but must also try to spread unrelated data
|
|
|
|
among different cylinder groups.
|
|
|
|
If too much localization is attempted,
|
|
|
|
the local cylinder group may run out of space
|
|
|
|
forcing the data to be scattered to non-local cylinder groups.
|
|
|
|
Taken to an extreme,
|
|
|
|
total localization can result in a single huge cluster of data
|
|
|
|
resembling the old file system.
|
|
|
|
The global policies try to balance the two conflicting
|
|
|
|
goals of localizing data that is concurrently accessed
|
|
|
|
while spreading out unrelated data.
|
|
|
|
.PP
|
|
|
|
One allocatable resource is inodes.
|
|
|
|
Inodes are used to describe both files and directories.
|
|
|
|
Inodes of files in the same directory are frequently accessed together.
|
|
|
|
For example, the ``list directory'' command often accesses
|
|
|
|
the inode for each file in a directory.
|
|
|
|
The layout policy tries to place all the inodes of
|
|
|
|
files in a directory in the same cylinder group.
|
|
|
|
To ensure that files are distributed throughout the disk,
|
|
|
|
a different policy is used for directory allocation.
|
|
|
|
A new directory is placed in a cylinder group that has a greater
|
|
|
|
than average number of free inodes,
|
|
|
|
and the smallest number of directories already in it.
|
|
|
|
The intent of this policy is to allow the inode clustering policy
|
|
|
|
to succeed most of the time.
|
|
|
|
The allocation of inodes within a cylinder group is done using a
|
|
|
|
next free strategy.
|
|
|
|
Although this allocates the inodes randomly within a cylinder group,
|
|
|
|
all the inodes for a particular cylinder group can be read with
|
|
|
|
8 to 16 disk transfers.
|
|
|
|
(At most 16 disk transfers are required because a cylinder
|
|
|
|
group may have no more than 2048 inodes.)
|
|
|
|
This puts a small and constant upper bound on the number of
|
|
|
|
disk transfers required to access the inodes
|
|
|
|
for all the files in a directory.
|
|
|
|
In contrast, the old file system typically requires
|
|
|
|
one disk transfer to fetch the inode for each file in a directory.
|
|
|
|
.PP
|
|
|
|
The other major resource is data blocks.
|
|
|
|
Since data blocks for a file are typically accessed together,
|
|
|
|
the policy routines try to place all data
|
|
|
|
blocks for a file in the same cylinder group,
|
|
|
|
preferably at rotationally optimal positions in the same cylinder.
|
|
|
|
The problem with allocating all the data blocks
|
|
|
|
in the same cylinder group is that large files will
|
|
|
|
quickly use up available space in the cylinder group,
|
|
|
|
forcing a spill over to other areas.
|
|
|
|
Further, using all the space in a cylinder group
|
|
|
|
causes future allocations for any file in the cylinder group
|
|
|
|
to also spill to other areas.
|
|
|
|
Ideally none of the cylinder groups should ever become completely full.
|
|
|
|
The heuristic solution chosen is to
|
|
|
|
redirect block allocation
|
|
|
|
to a different cylinder group
|
|
|
|
when a file exceeds 48 kilobytes,
|
|
|
|
and at every megabyte thereafter.*
|
|
|
|
.FS
|
|
|
|
* The first spill over point at 48 kilobytes is the point
|
|
|
|
at which a file on a 4096 byte block file system first
|
|
|
|
requires a single indirect block. This appears to be
|
|
|
|
a natural first point at which to redirect block allocation.
|
|
|
|
The other spillover points are chosen with the intent of
|
|
|
|
forcing block allocation to be redirected when a
|
|
|
|
file has used about 25% of the data blocks in a cylinder group.
|
|
|
|
In observing the new file system in day to day use, the heuristics appear
|
|
|
|
to work well in minimizing the number of completely filled
|
|
|
|
cylinder groups.
|
|
|
|
.FE
|
|
|
|
The newly chosen cylinder group is selected from those cylinder
|
|
|
|
groups that have a greater than average number of free blocks left.
|
|
|
|
Although big files tend to be spread out over the disk,
|
|
|
|
a megabyte of data is typically accessible before
|
|
|
|
a long seek must be performed,
|
|
|
|
and the cost of one long seek per megabyte is small.
|
|
|
|
.PP
|
|
|
|
The global policy routines call local allocation routines with
|
|
|
|
requests for specific blocks.
|
|
|
|
The local allocation routines will
|
|
|
|
always allocate the requested block
|
|
|
|
if it is free, otherwise it
|
|
|
|
allocates a free block of the requested size that is
|
|
|
|
rotationally closest to the requested block.
|
|
|
|
If the global layout policies had complete information,
|
|
|
|
they could always request unused blocks and
|
|
|
|
the allocation routines would be reduced to simple bookkeeping.
|
|
|
|
However, maintaining complete information is costly;
|
|
|
|
thus the implementation of the global layout policy
|
|
|
|
uses heuristics that employ only partial information.
|
|
|
|
.PP
|
|
|
|
If a requested block is not available, the local allocator uses
|
|
|
|
a four level allocation strategy:
|
|
|
|
.IP 1)
|
|
|
|
Use the next available block rotationally closest
|
|
|
|
to the requested block on the same cylinder. It is assumed
|
|
|
|
here that head switching time is zero. On disk
|
|
|
|
controllers where this is not the case, it may be possible
|
|
|
|
to incorporate the time required to switch between disk platters
|
|
|
|
when constructing the rotational layout tables. This, however,
|
|
|
|
has not yet been tried.
|
|
|
|
.IP 2)
|
|
|
|
If there are no blocks available on the same cylinder,
|
|
|
|
use a block within the same cylinder group.
|
|
|
|
.IP 3)
|
|
|
|
If that cylinder group is entirely full,
|
|
|
|
quadratically hash the cylinder group number to choose
|
|
|
|
another cylinder group to look for a free block.
|
|
|
|
.IP 4)
|
|
|
|
Finally if the hash fails, apply an exhaustive search
|
|
|
|
to all cylinder groups.
|
|
|
|
.PP
|
|
|
|
Quadratic hash is used because of its speed in finding
|
|
|
|
unused slots in nearly full hash tables [Knuth75].
|
|
|
|
File systems that are parameterized to maintain at least
|
|
|
|
10% free space rarely use this strategy.
|
|
|
|
File systems that are run without maintaining any free
|
|
|
|
space typically have so few free blocks that almost any
|
|
|
|
allocation is random;
|
|
|
|
the most important characteristic of
|
|
|
|
the strategy used under such conditions is that the strategy be fast.
|
|
|
|
.ds RH Performance
|
|
|
|
.sp 2
|
|
|
|
.ne 1i
|