diff --git a/doc/src/sgml/filelayout.sgml b/doc/src/sgml/filelayout.sgml
new file mode 100644
index 0000000000..8b7381078a
--- /dev/null
+++ b/doc/src/sgml/filelayout.sgml
@@ -0,0 +1,161 @@
+
+
+
+
+Database File Layout
+
+
+
+A description of the database physical storage layout.
+
+
+
+
+This section provides an overview of the physical format used by
+PostgreSQL databases.
+
+
+
+All the data needed for a database cluster is stored within the cluster's data
+directory, commonly referred to as PGDATA> (after the name of the
+environment variable that can be used to define it). A common location for
+PGDATA> is /var/lib/pgsql/data>. Multiple clusters,
+managed by different postmasters, can exist on the same machine.
+
+
+
+The PGDATA> directory contains several subdirectories and control
+files, as shown in . In addition to
+these required items, the cluster configuration files
+postgresql.conf, pg_hba.conf, and
+pg_ident.conf are traditionally stored in
+PGDATA> (although beginning in
+PostgreSQL 8.0 it is possible to keep them
+elsewhere).
+
+
+
+Contents of PGDATA>
+
+
+
+
+Item
+
+Description
+
+
+
+
+
+
+ PG_VERSION>
+ A file containing the major version number of PostgreSQL
+
+
+
+ base>
+ Subdirectory containing per-database subdirectories
+
+
+
+ global>
+ Subdirectory containing cluster-wide tables, such as
+ pg_database>
+
+
+
+ pg_clog>
+ Subdirectory containing transaction commit status data
+
+
+
+ pg_subtrans>
+ Subdirectory containing subtransaction status data
+
+
+
+ pg_tblspc>
+ Subdirectory containing symbolic links to tablespaces
+
+
+
+ pg_xlog>
+ Subdirectory containing WAL (Write Ahead Log) files
+
+
+
+ postmaster.opts>
+ A file recording the command-line options the postmaster was
+last started with
+
+
+
+ postmaster.pid>
+ A lock file recording the current postmaster PID and shared memory
+segment ID (not present after postmaster shutdown)
+
+
+
+
+
+
+
+For each database in the cluster there is a subdirectory within
+PGDATA>/base>, named after the database's OID in
+pg_database>. This subdirectory is the default location
+for the database's files; in particular, its system catalogs are stored
+there.
+
+
+
+Each table and index is stored in a separate file, named after the table
+or index's filenode> number, which can be found in
+pg_class>.relfilenode>.
+
+
+
+
+Note that while a table's filenode often matches its OID, this is
+not> necessarily the case; some operations, like
+TRUNCATE>, REINDEX>, CLUSTER> and some forms
+of ALTER TABLE>, can change the filenode while preserving the OID.
+Avoid assuming that filenode and table OID are the same.
+
+
+
+
+When a table or index exceeds 1Gb, it is divided into gigabyte-sized
+segments>. The first segment's file name is the same as the
+filenode; subsequent segments are named filenode.1, filenode.2, etc.
+This arrangement avoids problems on platforms that have file size limitations.
+The contents of tables and indexes are discussed further in
+.
+
+
+
+A table that has columns with potentially large entries will have an
+associated TOAST> table, which is used for out-of-line storage of
+field values that are too large to keep in the table rows proper.
+pg_class>.reltoastrelid> links from a table to
+its TOAST table, if any.
+
+
+
+Tablespaces make the scenario more complicated. Each non-default tablespace
+has a symbolic link inside the PGDATA>/pg_tblspc>
+directory, which points to the physical tablespace directory (as specified in
+its CREATE TABLESPACE> command). The symbolic link is named after
+the tablespace's OID. Inside the physical tablespace directory there is
+a subdirectory for each database that has elements in the tablespace, named
+after the database's OID. Tables within that directory follow the filenode
+naming scheme. The pg_default> tablespace is not accessed through
+pg_tblspc>, but corresponds to
+PGDATA>/base>. Similarly, the pg_global>
+tablespace is not accessed through pg_tblspc>, but corresponds to
+PGDATA>/global>.
+
+
+
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index d8e5b30ab2..427b4739ec 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -1,4 +1,4 @@
-
+
@@ -74,6 +74,7 @@
+
diff --git a/doc/src/sgml/page.sgml b/doc/src/sgml/page.sgml
index ebafa46598..8f2388af6a 100644
--- a/doc/src/sgml/page.sgml
+++ b/doc/src/sgml/page.sgml
@@ -1,10 +1,10 @@
-Page Files
+Database Page Layout
@@ -14,11 +14,15 @@ A description of the database file page format.
This section provides an overview of the page format used by
-PostgreSQL tables and indexes. (Index
-access methods need not use this page format. At present, all index
-methods do use this basic format, but the data kept on index metapages
-usually doesn't follow the item layout rules exactly.) TOAST tables
-and sequences are formatted just like a regular table.
+PostgreSQL tables and indexes.
+
+ Actually, index access methods need not use this page format.
+ All the existing index methods do use this basic format,
+ but the data kept on index metapages usually doesn't follow
+ the item layout rules.
+
+
+TOAST tables and sequences are formatted just like a regular table.
@@ -31,14 +35,22 @@ an item is a row; in an index, an item is an index entry.
+Every table and index is stored as an array of pages> of a
+fixed size (usually 8K, although a different page size can be selected
+when compiling the server). In a table, all the pages are logically
+equivalent, so a particular item (row) can be stored in any page. In
+indexes, the first page is generally reserved as a metapage>
+holding control information, and there may be different types of pages
+within the index, depending on the index access method.
+
- shows the basic layout of a page.
+
+ shows the overall layout of a page.
There are five parts to each page.
-
-Sample Page Layout
+Overall Page Layout
Page Layout
@@ -60,12 +72,14 @@ free space pointers.
ItemPointerData
-Array of (offset,length) pairs pointing to the actual items.
+Array of (offset,length) pairs pointing to the actual items.
+4 bytes per item.
Free space
-The unallocated space. All new rows are allocated from here, generally from the end.
+The unallocated space. New item pointers are allocated from the start
+of this area, new items from the end.
@@ -74,7 +88,7 @@ free space pointers.
-Special Space
+Special space
Index access method specific data. Different methods store different
data. Empty in ordinary tables.
@@ -87,13 +101,24 @@ data. Empty in ordinary tables.
The first 20 bytes of each page consists of a page header
(PageHeaderData). Its format is detailed in . The first two fields deal with WAL
- related stuff. This is followed by three 2-byte integer fields
+ linkend="pageheaderdata-table">. The first two fields track the most
+ recent WAL entry related to this page. They are followed by three 2-byte
+ integer fields
(pd_lower, pd_upper,
- and pd_special). These represent byte offsets to
- the start
+ and pd_special). These contain byte offsets
+ from the page start to the start
of unallocated space, to the end of unallocated space, and to the start of
the special space.
+ The last 2 bytes of the page header,
+ pd_pagesize_version, store both the page size
+ and a version indicator. Beginning with
+ PostgreSQL 8.0 the version number is 2;
+ PostgreSQL 7.3 and 7.4 used version number 1;
+ prior releases used version number 0.
+ (The basic page layout and header format has not changed in these versions,
+ but the layout of heap row headers has.) The page size
+ is basically only present as a cross-check; there is no support for having
+ more than one page size in an installation.
@@ -156,25 +181,12 @@ data. Empty in ordinary tables.
src/include/storage/bufpage.h.
-
- Special space is a region at the end of the page that is allocated at page
- initialization time and contains information specific to an access method.
- The last 2 bytes of the page header,
- pd_pagesize_version, store both the page size
- and a version indicator. Beginning with
- PostgreSQL 7.3 the version number is 1; prior
- releases used version number 0. (The basic page layout and header format
- has not changed, but the layout of heap row headers has.) The page size
- is basically only present as a cross-check; there is no support for having
- more than one page size in an installation.
-
-
Following the page header are item identifiers
(ItemIdData), each requiring four bytes.
An item identifier contains a byte-offset to
- the start of an item, its length in bytes, and a set of attribute bits
+ the start of an item, its length in bytes, and a few attribute bits
which affect its interpretation.
New item identifiers are allocated
as needed from the beginning of the unallocated space.
@@ -203,16 +215,18 @@ data. Empty in ordinary tables.
The final section is the special section
which may
- contain anything the access method wishes to store. Ordinary tables
- do not use this at all (indicated by setting
- pd_special> to equal the pagesize).
+ contain anything the access method wishes to store. For example,
+ b-tree indexes store links to the page's left and right siblings,
+ as well as some other data relevant to the index structure.
+ Ordinary tables do not use a special section at all (indicated by setting
+ pd_special> to equal the page size).
- All table rows are structured the same way. There is a fixed-size
- header (occupying 23 bytes on most machines), followed by an optional null
+ All table rows are structured in the same way. There is a fixed-size
+ header (occupying 27 bytes on most machines), followed by an optional null
bitmap, an optional object ID field, and the user data. The header is
detailed
in . The actual user data
@@ -258,7 +272,7 @@ data. Empty in ordinary tables.
t_cmin
CommandId
4 bytes
- insert CID stamp (overlays with t_xmax)
+ insert CID stamp
t_xmax
@@ -276,7 +290,7 @@ data. Empty in ordinary tables.
t_xvac
TransactionId
4 bytes
- XID for VACUUM operation moving row version
+ XID for VACUUM operation moving a row version
t_ctid
@@ -294,7 +308,7 @@ data. Empty in ordinary tables.
t_infomask
uint16
2 bytes
- various flags
+ various flag bits
t_hoff
@@ -314,9 +328,10 @@ data. Empty in ordinary tables.
Interpreting the actual data can only be done with information obtained
- from other tables, mostly pg_attribute. The
- particular fields are attlen and
- attalign. There is no way to directly get a
+ from other tables, mostly pg_attribute. The
+ key values needed to identify field locations are
+ attlen and attalign.
+ There is no way to directly get a
particular attribute, except when there are only fixed width fields and no
NULLs. All this trickery is wrapped up in the functions
heap_getattr, fastgetattr
@@ -329,10 +344,11 @@ data. Empty in ordinary tables.
whether the field is NULL according to the null bitmap. If it is, go to
the next. Then make sure you have the right alignment. If the field is a
fixed width field, then all the bytes are simply placed. If it's a
- variable length field (attlen == -1) then it's a bit more complicated,
- using the variable length structure varattrib.
- Depending on the flags, the data may be either inline, compressed or in
- another table (TOAST).
+ variable length field (attlen = -1) then it's a bit more complicated.
+ All variable-length datatypes share the common header structure
+ varattrib, which includes the total length of the stored
+ value and some flag bits. Depending on the flags, the data may be either
+ inline or in another table (TOAST); it might be compressed, too.
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index 159a9f3ca2..ca0d55c3b4 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -1,5 +1,5 @@