mirror of https://github.com/postgres/postgres
Add some real documentation about the overall filesystem layout used by
a Postgres database. Update page.sgml to match 8.0 tuple header layout.
This commit is contained in:
parent
c7866f6645
commit
7f4b5a003b
|
@ -0,0 +1,161 @@
|
|||
<!--
|
||||
$PostgreSQL: pgsql/doc/src/sgml/filelayout.sgml,v 1.1 2004/11/12 21:50:53 tgl Exp $
|
||||
-->
|
||||
|
||||
<chapter id="file-layout">
|
||||
|
||||
<title>Database File Layout</title>
|
||||
|
||||
<abstract>
|
||||
<para>
|
||||
A description of the database physical storage layout.
|
||||
</para>
|
||||
</abstract>
|
||||
|
||||
<para>
|
||||
This section provides an overview of the physical format used by
|
||||
<productname>PostgreSQL</productname> databases.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
All the data needed for a database cluster is stored within the cluster's data
|
||||
directory, commonly referred to as <varname>PGDATA</> (after the name of the
|
||||
environment variable that can be used to define it). A common location for
|
||||
<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
|
||||
managed by different postmasters, can exist on the same machine.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <varname>PGDATA</> directory contains several subdirectories and control
|
||||
files, as shown in <xref linkend="pgdata-contents-table">. In addition to
|
||||
these required items, the cluster configuration files
|
||||
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
|
||||
<filename>pg_ident.conf</filename> are traditionally stored in
|
||||
<varname>PGDATA</> (although beginning in
|
||||
<productname>PostgreSQL</productname> 8.0 it is possible to keep them
|
||||
elsewhere).
|
||||
</para>
|
||||
|
||||
<table tocentry="1" id="pgdata-contents-table">
|
||||
<title>Contents of <varname>PGDATA</></title>
|
||||
<tgroup cols="2">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>
|
||||
Item
|
||||
</entry>
|
||||
<entry>Description</entry>
|
||||
</row>
|
||||
</thead>
|
||||
|
||||
<tbody>
|
||||
|
||||
<row>
|
||||
<entry><filename>PG_VERSION</></entry>
|
||||
<entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>base</></entry>
|
||||
<entry>Subdirectory containing per-database subdirectories</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>global</></entry>
|
||||
<entry>Subdirectory containing cluster-wide tables, such as
|
||||
<structname>pg_database</></entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>pg_clog</></entry>
|
||||
<entry>Subdirectory containing transaction commit status data</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>pg_subtrans</></entry>
|
||||
<entry>Subdirectory containing subtransaction status data</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>pg_tblspc</></entry>
|
||||
<entry>Subdirectory containing symbolic links to tablespaces</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>pg_xlog</></entry>
|
||||
<entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>postmaster.opts</></entry>
|
||||
<entry>A file recording the command-line options the postmaster was
|
||||
last started with</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><filename>postmaster.pid</></entry>
|
||||
<entry>A lock file recording the current postmaster PID and shared memory
|
||||
segment ID (not present after postmaster shutdown)</entry>
|
||||
</row>
|
||||
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<para>
|
||||
For each database in the cluster there is a subdirectory within
|
||||
<varname>PGDATA</><filename>/base</>, named after the database's OID in
|
||||
<structname>pg_database</>. This subdirectory is the default location
|
||||
for the database's files; in particular, its system catalogs are stored
|
||||
there.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Each table and index is stored in a separate file, named after the table
|
||||
or index's <firstterm>filenode</> number, which can be found in
|
||||
<structname>pg_class</>.<structfield>relfilenode</>.
|
||||
</para>
|
||||
|
||||
<caution>
|
||||
<para>
|
||||
Note that while a table's filenode often matches its OID, this is
|
||||
<emphasis>not</> necessarily the case; some operations, like
|
||||
<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
|
||||
of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
|
||||
Avoid assuming that filenode and table OID are the same.
|
||||
</para>
|
||||
</caution>
|
||||
|
||||
<para>
|
||||
When a table or index exceeds 1Gb, it is divided into gigabyte-sized
|
||||
<firstterm>segments</>. The first segment's file name is the same as the
|
||||
filenode; subsequent segments are named filenode.1, filenode.2, etc.
|
||||
This arrangement avoids problems on platforms that have file size limitations.
|
||||
The contents of tables and indexes are discussed further in
|
||||
<xref linkend="page">.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A table that has columns with potentially large entries will have an
|
||||
associated <firstterm>TOAST</> table, which is used for out-of-line storage of
|
||||
field values that are too large to keep in the table rows proper.
|
||||
<structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
|
||||
its TOAST table, if any.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Tablespaces make the scenario more complicated. Each non-default tablespace
|
||||
has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
|
||||
directory, which points to the physical tablespace directory (as specified in
|
||||
its <command>CREATE TABLESPACE</> command). The symbolic link is named after
|
||||
the tablespace's OID. Inside the physical tablespace directory there is
|
||||
a subdirectory for each database that has elements in the tablespace, named
|
||||
after the database's OID. Tables within that directory follow the filenode
|
||||
naming scheme. The <literal>pg_default</> tablespace is not accessed through
|
||||
<filename>pg_tblspc</>, but corresponds to
|
||||
<varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
|
||||
tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
|
||||
<varname>PGDATA</><filename>/global</>.
|
||||
</para>
|
||||
|
||||
</chapter>
|
|
@ -1,4 +1,4 @@
|
|||
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.38 2004/06/07 04:04:47 tgl Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.39 2004/11/12 21:50:53 tgl Exp $ -->
|
||||
|
||||
<!entity history SYSTEM "history.sgml">
|
||||
<!entity info SYSTEM "info.sgml">
|
||||
|
@ -74,6 +74,7 @@
|
|||
<!entity arch-dev SYSTEM "arch-dev.sgml">
|
||||
<!entity bki SYSTEM "bki.sgml">
|
||||
<!entity catalogs SYSTEM "catalogs.sgml">
|
||||
<!entity filelayout SYSTEM "filelayout.sgml">
|
||||
<!entity geqo SYSTEM "geqo.sgml">
|
||||
<!entity gist SYSTEM "gist.sgml">
|
||||
<!entity indexcost SYSTEM "indexcost.sgml">
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
<!--
|
||||
$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.18 2004/07/21 22:31:18 tgl Exp $
|
||||
$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.19 2004/11/12 21:50:53 tgl Exp $
|
||||
-->
|
||||
|
||||
<chapter id="page">
|
||||
|
||||
<title>Page Files</title>
|
||||
<title>Database Page Layout</title>
|
||||
|
||||
<abstract>
|
||||
<para>
|
||||
|
@ -14,11 +14,15 @@ A description of the database file page format.
|
|||
|
||||
<para>
|
||||
This section provides an overview of the page format used by
|
||||
<productname>PostgreSQL</productname> tables and indexes. (Index
|
||||
access methods need not use this page format. At present, all index
|
||||
methods do use this basic format, but the data kept on index metapages
|
||||
usually doesn't follow the item layout rules exactly.) TOAST tables
|
||||
and sequences are formatted just like a regular table.
|
||||
<productname>PostgreSQL</productname> tables and indexes.<footnote>
|
||||
<para>
|
||||
Actually, index access methods need not use this page format.
|
||||
All the existing index methods do use this basic format,
|
||||
but the data kept on index metapages usually doesn't follow
|
||||
the item layout rules.
|
||||
</para>
|
||||
</footnote>
|
||||
TOAST tables and sequences are formatted just like a regular table.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -31,14 +35,22 @@ an item is a row; in an index, an item is an index entry.
|
|||
</para>
|
||||
|
||||
<para>
|
||||
Every table and index is stored as an array of <firstterm>pages</> of a
|
||||
fixed size (usually 8K, although a different page size can be selected
|
||||
when compiling the server). In a table, all the pages are logically
|
||||
equivalent, so a particular item (row) can be stored in any page. In
|
||||
indexes, the first page is generally reserved as a <firstterm>metapage</>
|
||||
holding control information, and there may be different types of pages
|
||||
within the index, depending on the index access method.
|
||||
</para>
|
||||
|
||||
<xref linkend="page-table"> shows the basic layout of a page.
|
||||
<para>
|
||||
<xref linkend="page-table"> shows the overall layout of a page.
|
||||
There are five parts to each page.
|
||||
|
||||
</para>
|
||||
|
||||
<table tocentry="1" id="page-table">
|
||||
<title>Sample Page Layout</title>
|
||||
<title>Overall Page Layout</title>
|
||||
<titleabbrev>Page Layout</titleabbrev>
|
||||
<tgroup cols="2">
|
||||
<thead>
|
||||
|
@ -60,12 +72,14 @@ free space pointers.</entry>
|
|||
|
||||
<row>
|
||||
<entry>ItemPointerData</entry>
|
||||
<entry>Array of (offset,length) pairs pointing to the actual items.</entry>
|
||||
<entry>Array of (offset,length) pairs pointing to the actual items.
|
||||
4 bytes per item.</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry>Free space</entry>
|
||||
<entry>The unallocated space. All new rows are allocated from here, generally from the end.</entry>
|
||||
<entry>The unallocated space. New item pointers are allocated from the start
|
||||
of this area, new items from the end.</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
|
@ -74,7 +88,7 @@ free space pointers.</entry>
|
|||
</row>
|
||||
|
||||
<row>
|
||||
<entry>Special Space</entry>
|
||||
<entry>Special space</entry>
|
||||
<entry>Index access method specific data. Different methods store different
|
||||
data. Empty in ordinary tables.</entry>
|
||||
</row>
|
||||
|
@ -87,13 +101,24 @@ data. Empty in ordinary tables.</entry>
|
|||
|
||||
The first 20 bytes of each page consists of a page header
|
||||
(PageHeaderData). Its format is detailed in <xref
|
||||
linkend="pageheaderdata-table">. The first two fields deal with WAL
|
||||
related stuff. This is followed by three 2-byte integer fields
|
||||
linkend="pageheaderdata-table">. The first two fields track the most
|
||||
recent WAL entry related to this page. They are followed by three 2-byte
|
||||
integer fields
|
||||
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
|
||||
and <structfield>pd_special</structfield>). These represent byte offsets to
|
||||
the start
|
||||
and <structfield>pd_special</structfield>). These contain byte offsets
|
||||
from the page start to the start
|
||||
of unallocated space, to the end of unallocated space, and to the start of
|
||||
the special space.
|
||||
The last 2 bytes of the page header,
|
||||
<structfield>pd_pagesize_version</structfield>, store both the page size
|
||||
and a version indicator. Beginning with
|
||||
<productname>PostgreSQL</productname> 8.0 the version number is 2;
|
||||
<productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
|
||||
prior releases used version number 0.
|
||||
(The basic page layout and header format has not changed in these versions,
|
||||
but the layout of heap row headers has.) The page size
|
||||
is basically only present as a cross-check; there is no support for having
|
||||
more than one page size in an installation.
|
||||
|
||||
</para>
|
||||
|
||||
|
@ -156,25 +181,12 @@ data. Empty in ordinary tables.</entry>
|
|||
<filename>src/include/storage/bufpage.h</filename>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Special space is a region at the end of the page that is allocated at page
|
||||
initialization time and contains information specific to an access method.
|
||||
The last 2 bytes of the page header,
|
||||
<structfield>pd_pagesize_version</structfield>, store both the page size
|
||||
and a version indicator. Beginning with
|
||||
<productname>PostgreSQL</productname> 7.3 the version number is 1; prior
|
||||
releases used version number 0. (The basic page layout and header format
|
||||
has not changed, but the layout of heap row headers has.) The page size
|
||||
is basically only present as a cross-check; there is no support for having
|
||||
more than one page size in an installation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
Following the page header are item identifiers
|
||||
(<type>ItemIdData</type>), each requiring four bytes.
|
||||
An item identifier contains a byte-offset to
|
||||
the start of an item, its length in bytes, and a set of attribute bits
|
||||
the start of an item, its length in bytes, and a few attribute bits
|
||||
which affect its interpretation.
|
||||
New item identifiers are allocated
|
||||
as needed from the beginning of the unallocated space.
|
||||
|
@ -203,16 +215,18 @@ data. Empty in ordinary tables.</entry>
|
|||
<para>
|
||||
|
||||
The final section is the <quote>special section</quote> which may
|
||||
contain anything the access method wishes to store. Ordinary tables
|
||||
do not use this at all (indicated by setting
|
||||
<structfield>pd_special</> to equal the pagesize).
|
||||
contain anything the access method wishes to store. For example,
|
||||
b-tree indexes store links to the page's left and right siblings,
|
||||
as well as some other data relevant to the index structure.
|
||||
Ordinary tables do not use a special section at all (indicated by setting
|
||||
<structfield>pd_special</> to equal the page size).
|
||||
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
All table rows are structured the same way. There is a fixed-size
|
||||
header (occupying 23 bytes on most machines), followed by an optional null
|
||||
All table rows are structured in the same way. There is a fixed-size
|
||||
header (occupying 27 bytes on most machines), followed by an optional null
|
||||
bitmap, an optional object ID field, and the user data. The header is
|
||||
detailed
|
||||
in <xref linkend="heaptupleheaderdata-table">. The actual user data
|
||||
|
@ -258,7 +272,7 @@ data. Empty in ordinary tables.</entry>
|
|||
<entry>t_cmin</entry>
|
||||
<entry>CommandId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>insert CID stamp (overlays with t_xmax)</entry>
|
||||
<entry>insert CID stamp</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_xmax</entry>
|
||||
|
@ -276,7 +290,7 @@ data. Empty in ordinary tables.</entry>
|
|||
<entry>t_xvac</entry>
|
||||
<entry>TransactionId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>XID for VACUUM operation moving row version</entry>
|
||||
<entry>XID for VACUUM operation moving a row version</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_ctid</entry>
|
||||
|
@ -294,7 +308,7 @@ data. Empty in ordinary tables.</entry>
|
|||
<entry>t_infomask</entry>
|
||||
<entry>uint16</entry>
|
||||
<entry>2 bytes</entry>
|
||||
<entry>various flags</entry>
|
||||
<entry>various flag bits</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_hoff</entry>
|
||||
|
@ -314,9 +328,10 @@ data. Empty in ordinary tables.</entry>
|
|||
<para>
|
||||
|
||||
Interpreting the actual data can only be done with information obtained
|
||||
from other tables, mostly <firstterm>pg_attribute</firstterm>. The
|
||||
particular fields are <structfield>attlen</structfield> and
|
||||
<structfield>attalign</structfield>. There is no way to directly get a
|
||||
from other tables, mostly <structname>pg_attribute</structname>. The
|
||||
key values needed to identify field locations are
|
||||
<structfield>attlen</structfield> and <structfield>attalign</structfield>.
|
||||
There is no way to directly get a
|
||||
particular attribute, except when there are only fixed width fields and no
|
||||
NULLs. All this trickery is wrapped up in the functions
|
||||
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
|
||||
|
@ -329,10 +344,11 @@ data. Empty in ordinary tables.</entry>
|
|||
whether the field is NULL according to the null bitmap. If it is, go to
|
||||
the next. Then make sure you have the right alignment. If the field is a
|
||||
fixed width field, then all the bytes are simply placed. If it's a
|
||||
variable length field (attlen == -1) then it's a bit more complicated,
|
||||
using the variable length structure <type>varattrib</type>.
|
||||
Depending on the flags, the data may be either inline, compressed or in
|
||||
another table (TOAST).
|
||||
variable length field (attlen = -1) then it's a bit more complicated.
|
||||
All variable-length datatypes share the common header structure
|
||||
<type>varattrib</type>, which includes the total length of the stored
|
||||
value and some flag bits. Depending on the flags, the data may be either
|
||||
inline or in another table (TOAST); it might be compressed, too.
|
||||
|
||||
</para>
|
||||
</chapter>
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
<!--
|
||||
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.64 2004/04/20 01:11:49 momjian Exp $
|
||||
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.65 2004/11/12 21:50:53 tgl Exp $
|
||||
-->
|
||||
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
|
||||
|
@ -235,6 +235,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.64 2004/04/20 01:11:49 momjian
|
|||
&geqo;
|
||||
&indexcost;
|
||||
&gist;
|
||||
&filelayout;
|
||||
&page;
|
||||
&bki;
|
||||
|
||||
|
|
Loading…
Reference in New Issue