Force default wal_sync_method to be fdatasync on Linux.
Recent versions of the Linux system header files cause xlogdefs.h to believe that open_datasync should be the default sync method, whereas formerly fdatasync was the default on Linux. open_datasync is a bad choice, first because it doesn't actually outperform fdatasync (in fact the reverse), and second because we try to use O_DIRECT with it, causing failures on certain filesystems (e.g., ext4 with data=journal option). This part of the patch is largely per a proposal from Marti Raudsepp. More extensive changes are likely to follow in HEAD, but this is as much change as we want to back-patch. Also clean up confusing code and incorrect documentation surrounding the fsync_writethrough option. Those changes shouldn't result in any actual behavioral change, but I chose to back-patch them anyway to keep the branches looking similar in this area. In 9.0 and HEAD, also do some copy-editing on the WAL Reliability documentation section. Back-patch to all supported branches, since any of them might get used on modern Linux versions.
This commit is contained in:
parent
e620ee35b2
commit
576477e73c
@ -1460,7 +1460,7 @@ SET ENABLE_SEQSCAN TO OFF;
|
|||||||
<para>
|
<para>
|
||||||
While turning off <varname>fsync</varname> is often a performance
|
While turning off <varname>fsync</varname> is often a performance
|
||||||
benefit, this can result in unrecoverable data corruption in
|
benefit, this can result in unrecoverable data corruption in
|
||||||
the event of an unexpected system shutdown or crash. Thus it
|
the event of a power failure or system crash. Thus it
|
||||||
is only advisable to turn off <varname>fsync</varname> if
|
is only advisable to turn off <varname>fsync</varname> if
|
||||||
you can easily recreate your entire database from external
|
you can easily recreate your entire database from external
|
||||||
data.
|
data.
|
||||||
@ -1468,10 +1468,11 @@ SET ENABLE_SEQSCAN TO OFF;
|
|||||||
|
|
||||||
<para>
|
<para>
|
||||||
Examples of safe circumstances for turning off
|
Examples of safe circumstances for turning off
|
||||||
<varname>fsync</varname> include the initial loading a new
|
<varname>fsync</varname> include the initial loading of a new
|
||||||
database cluster from a backup file, using a database cluster
|
database cluster from a backup file, using a database cluster
|
||||||
for processing statistics on an hourly basis which is then
|
for processing a batch of data after which the database
|
||||||
recreated, or for a reporting read-only database clone which
|
will be thrown away and recreated,
|
||||||
|
or for a read-only database clone which
|
||||||
gets recreated frequently and is not used for failover. High
|
gets recreated frequently and is not used for failover. High
|
||||||
quality hardware alone is not a sufficient justification for
|
quality hardware alone is not a sufficient justification for
|
||||||
turning off <varname>fsync</varname>.
|
turning off <varname>fsync</varname>.
|
||||||
@ -1554,12 +1555,12 @@ SET ENABLE_SEQSCAN TO OFF;
|
|||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
<literal>fsync_writethrough</> (call <function>fsync()</> at each commit, forcing write-through of any disk write cache)
|
<literal>fsync</> (call <function>fsync()</> at each commit)
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
<literal>fsync</> (call <function>fsync()</> at each commit)
|
<literal>fsync_writethrough</> (call <function>fsync()</> at each commit, forcing write-through of any disk write cache)
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
@ -1569,16 +1570,15 @@ SET ENABLE_SEQSCAN TO OFF;
|
|||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
<para>
|
<para>
|
||||||
Not all of these choices are available on all platforms.
|
|
||||||
The <literal>open_</>* options also use <literal>O_DIRECT</> if available.
|
The <literal>open_</>* options also use <literal>O_DIRECT</> if available.
|
||||||
|
Not all of these choices are available on all platforms.
|
||||||
The default is the first method in the above list that is supported
|
The default is the first method in the above list that is supported
|
||||||
by the platform. The default is not necessarily ideal; it might be
|
by the platform, except that <literal>fdatasync</> is the default on
|
||||||
|
Linux. The default is not necessarily ideal; it might be
|
||||||
necessary to change this setting or other aspects of your system
|
necessary to change this setting or other aspects of your system
|
||||||
configuration in order to create a crash-safe configuration or
|
configuration in order to create a crash-safe configuration or
|
||||||
achieve optimal performance.
|
achieve optimal performance.
|
||||||
These aspects are discussed in <xref linkend="wal-reliability">.
|
These aspects are discussed in <xref linkend="wal-reliability">.
|
||||||
The utility <filename>src/tools/fsync</> in the PostgreSQL source tree
|
|
||||||
can do performance testing of various fsync methods.
|
|
||||||
This parameter can only be set in the <filename>postgresql.conf</>
|
This parameter can only be set in the <filename>postgresql.conf</>
|
||||||
file or on the server command line.
|
file or on the server command line.
|
||||||
</para>
|
</para>
|
||||||
@ -1686,21 +1686,20 @@ SET ENABLE_SEQSCAN TO OFF;
|
|||||||
When the commit data for a transaction is flushed to disk, any
|
When the commit data for a transaction is flushed to disk, any
|
||||||
additional commits ready at that time are also flushed out.
|
additional commits ready at that time are also flushed out.
|
||||||
<varname>commit_delay</varname> adds a time delay, set in
|
<varname>commit_delay</varname> adds a time delay, set in
|
||||||
microseconds, before writing some commit records to the WAL
|
microseconds, before a transaction attempts to
|
||||||
buffer and flushing the buffer out to disks. A nonzero delay
|
flush the WAL buffer out to disk. A nonzero delay can allow more
|
||||||
can allow more transactions to be committed with only one call
|
transactions to be committed with only one flush operation, if
|
||||||
to the active <varname>wal_sync_method</varname>, if
|
|
||||||
system load is high enough that additional transactions become
|
system load is high enough that additional transactions become
|
||||||
ready to commit within the given interval. But the delay is
|
ready to commit within the given interval. But the delay is
|
||||||
just wasted if no other transactions become ready to
|
just wasted if no other transactions become ready to
|
||||||
commit. Therefore, the delay is only performed if at least
|
commit. Therefore, the delay is only performed if at least
|
||||||
<varname>commit_siblings</varname> other transactions are
|
<varname>commit_siblings</varname> other transactions are
|
||||||
active at the instant that a server process has written its
|
active at the instant that a server process has written its
|
||||||
commit record. The default is zero (no delay). Since
|
commit record.
|
||||||
all pending commit data flushes are written at every flush
|
The default <varname>commit_delay</> is zero (no delay).
|
||||||
regardless of this setting, it is rare that adding delay to
|
Since all pending commit data will be written at every flush
|
||||||
that by increasing this parameter will actually improve commit
|
regardless of this setting, it is rare that adding delay
|
||||||
performance.
|
by increasing this parameter will actually improve performance.
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
@ -27,7 +27,7 @@
|
|||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
While forcing data periodically to the disk platters might seem like
|
While forcing data to the disk platters periodically might seem like
|
||||||
a simple operation, it is not. Because disk drives are dramatically
|
a simple operation, it is not. Because disk drives are dramatically
|
||||||
slower than main memory and CPUs, several layers of caching exist
|
slower than main memory and CPUs, several layers of caching exist
|
||||||
between the computer's main memory and the disk platters.
|
between the computer's main memory and the disk platters.
|
||||||
@ -48,7 +48,7 @@
|
|||||||
some later time. Such caches can be a reliability hazard because the
|
some later time. Such caches can be a reliability hazard because the
|
||||||
memory in the disk controller cache is volatile, and will lose its
|
memory in the disk controller cache is volatile, and will lose its
|
||||||
contents in a power failure. Better controller cards have
|
contents in a power failure. Better controller cards have
|
||||||
<firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
|
<firstterm>battery-backup units</> (<acronym>BBU</>s), meaning
|
||||||
the card has a battery that
|
the card has a battery that
|
||||||
maintains power to the cache in case of system power loss. After power
|
maintains power to the cache in case of system power loss. After power
|
||||||
is restored the data will be written to the disk drives.
|
is restored the data will be written to the disk drives.
|
||||||
@ -57,15 +57,10 @@
|
|||||||
<para>
|
<para>
|
||||||
And finally, most disk drives have caches. Some are write-through
|
And finally, most disk drives have caches. Some are write-through
|
||||||
while some are write-back, and the same concerns about data loss
|
while some are write-back, and the same concerns about data loss
|
||||||
exist for write-back drive caches as exist for disk controller
|
exist for write-back drive caches as for disk controller
|
||||||
caches. Consumer-grade IDE and SATA drives are particularly likely
|
caches. Consumer-grade IDE and SATA drives are particularly likely
|
||||||
to have write-back caches that will not survive a power failure,
|
to have write-back caches that will not survive a power failure. Many
|
||||||
though <acronym>ATAPI-6</> introduced a drive cache flush command
|
solid-state drives (SSD) also have volatile write-back caches.
|
||||||
(<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
|
|
||||||
<acronym>ZFS</>, <acronym>ext4</>. (The SCSI command
|
|
||||||
<command>SYNCHRONIZE CACHE</> has long been available.) Many
|
|
||||||
solid-state drives (SSD) also have volatile write-back caches, and
|
|
||||||
many do not honor cache flush commands by default.
|
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
@ -81,7 +76,7 @@
|
|||||||
a <literal>*</> next to <literal>Write cache</>. <command>hdparm -W</>
|
a <literal>*</> next to <literal>Write cache</>. <command>hdparm -W</>
|
||||||
can be used to turn off write caching. SCSI drives can be queried
|
can be used to turn off write caching. SCSI drives can be queried
|
||||||
using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>.
|
using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>.
|
||||||
for SCSI drives. Use <command>sdparm --get=WCE</command> to check
|
Use <command>sdparm --get=WCE</command> to check
|
||||||
whether the write cache is enabled and <command>sdparm --clear=WCE</>
|
whether the write cache is enabled and <command>sdparm --clear=WCE</>
|
||||||
to disable it.
|
to disable it.
|
||||||
</para>
|
</para>
|
||||||
@ -107,35 +102,40 @@
|
|||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
On <productname>Windows</>, if <varname>wal_sync_method</> is
|
On <productname>Windows</>, if <varname>wal_sync_method</> is
|
||||||
<literal>open_datasync</> (the default), write caching is disabled
|
<literal>open_datasync</> (the default), write caching can be disabled
|
||||||
by unchecking <literal>My Computer\Open\{select disk drive}\Properties\Hardware\Properties\Policies\Enable write caching on the disk</>.
|
by unchecking <literal>My Computer\Open\<replaceable>disk drive</>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</>.
|
||||||
Alternatively, set <varname>wal_sync_method</varname> to <literal>fsync</> or <literal>fsync_writethrough</>, which never do write caching.
|
Alternatively, set <varname>wal_sync_method</varname> to
|
||||||
|
<literal>fsync</> or <literal>fsync_writethrough</>, which prevent
|
||||||
|
write caching.
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
|
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
On <productname>MacOS X</productname>, write caching can be disabled by
|
On <productname>Mac OS X</productname>, write caching can be prevented by
|
||||||
setting <varname>wal_sync_method</> to <literal>fsync_writethrough</>.
|
setting <varname>wal_sync_method</> to <literal>fsync_writethrough</>.
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
Many file systems that use write barriers (e.g. <acronym>ZFS</>,
|
Recent SATA drives (those following <acronym>ATAPI-6</> or later)
|
||||||
<acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
|
offer a drive cache flush command (<command>FLUSH CACHE EXT</>),
|
||||||
<command>SYNCHRONIZE CACHE</> commands to flush data to the platters on
|
while SCSI drives have long supported a similar command
|
||||||
write-back-enabled drives. Unfortunately, such write barrier file
|
<command>SYNCHRONIZE CACHE</>. These commands are not directly
|
||||||
systems behave suboptimally when combined with battery-backed unit
|
accessible to <productname>PostgreSQL</>, but some file systems
|
||||||
|
(e.g., <acronym>ZFS</>, <acronym>ext4</>) can use them to flush
|
||||||
|
data to the platters on write-back-enabled drives. Unfortunately, such
|
||||||
|
file systems behave suboptimally when combined with battery-backup unit
|
||||||
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
||||||
command forces all data from the BBU to the disks, eliminating much
|
command forces all data from the controller cache to the disks,
|
||||||
of the benefit of the BBU. You can run the utility
|
eliminating much of the benefit of the BBU. You can run the utility
|
||||||
<filename>src/tools/fsync</> in the PostgreSQL source tree to see
|
<filename>src/tools/fsync</> in the PostgreSQL source tree to see
|
||||||
if you are affected. If you are affected, the performance benefits
|
if you are affected. If you are affected, the performance benefits
|
||||||
of the BBU cache can be regained by turning off write barriers in
|
of the BBU can be regained by turning off write barriers in
|
||||||
the file system or reconfiguring the disk controller, if that is
|
the file system or reconfiguring the disk controller, if that is
|
||||||
an option. If write barriers are turned off, make sure the battery
|
an option. If write barriers are turned off, make sure the battery
|
||||||
remains active; a faulty battery can potentially lead to data loss.
|
remains functional; a faulty battery can potentially lead to data loss.
|
||||||
Hopefully file system and disk controller designers will eventually
|
Hopefully file system and disk controller designers will eventually
|
||||||
address this suboptimal behavior.
|
address this suboptimal behavior.
|
||||||
</para>
|
</para>
|
||||||
@ -148,6 +148,8 @@
|
|||||||
ensure data integrity. Avoid disk controllers that have non-battery-backed
|
ensure data integrity. Avoid disk controllers that have non-battery-backed
|
||||||
write caches. At the drive level, disable write-back caching if the
|
write caches. At the drive level, disable write-back caching if the
|
||||||
drive cannot guarantee the data will be written before shutdown.
|
drive cannot guarantee the data will be written before shutdown.
|
||||||
|
If you use SSDs, be aware that many of these do not honor cache flush
|
||||||
|
commands by default.
|
||||||
You can test for reliable I/O subsystem behavior using <ulink
|
You can test for reliable I/O subsystem behavior using <ulink
|
||||||
url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
|
url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
|
||||||
</para>
|
</para>
|
||||||
@ -157,16 +159,17 @@
|
|||||||
operations themselves. Disk platters are divided into sectors,
|
operations themselves. Disk platters are divided into sectors,
|
||||||
commonly 512 bytes each. Every physical read or write operation
|
commonly 512 bytes each. Every physical read or write operation
|
||||||
processes a whole sector.
|
processes a whole sector.
|
||||||
When a write request arrives at the drive, it might be for 512 bytes,
|
When a write request arrives at the drive, it might be for some multiple
|
||||||
1024 bytes, or 8192 bytes, and the process of writing could fail due
|
of 512 bytes (<productname>PostgreSQL</> typically writes 8192 bytes, or
|
||||||
|
16 sectors, at a time), and the process of writing could fail due
|
||||||
to power loss at any time, meaning some of the 512-byte sectors were
|
to power loss at any time, meaning some of the 512-byte sectors were
|
||||||
written, and others were not. To guard against such failures,
|
written while others were not. To guard against such failures,
|
||||||
<productname>PostgreSQL</> periodically writes full page images to
|
<productname>PostgreSQL</> periodically writes full page images to
|
||||||
permanent WAL storage <emphasis>before</> modifying the actual page on
|
permanent WAL storage <emphasis>before</> modifying the actual page on
|
||||||
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
|
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
|
||||||
restore partially-written pages. If you have a battery-backed disk
|
restore partially-written pages from WAL. If you have a battery-backed disk
|
||||||
controller or file-system software that prevents partial page writes
|
controller or file-system software that prevents partial page writes
|
||||||
(e.g., ZFS), you can turn off this page imaging by turning off the
|
(e.g., ZFS), you can safely turn off this page imaging by turning off the
|
||||||
<xref linkend="guc-full-page-writes"> parameter.
|
<xref linkend="guc-full-page-writes"> parameter.
|
||||||
</para>
|
</para>
|
||||||
</sect1>
|
</sect1>
|
||||||
|
@ -260,12 +260,13 @@ static bool looks_like_temp_rel_name(const char *name);
|
|||||||
int
|
int
|
||||||
pg_fsync(int fd)
|
pg_fsync(int fd)
|
||||||
{
|
{
|
||||||
#ifndef HAVE_FSYNC_WRITETHROUGH_ONLY
|
/* #if is to skip the sync_method test if there's no need for it */
|
||||||
if (sync_method != SYNC_METHOD_FSYNC_WRITETHROUGH)
|
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
|
||||||
return pg_fsync_no_writethrough(fd);
|
if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
|
||||||
|
return pg_fsync_writethrough(fd);
|
||||||
else
|
else
|
||||||
#endif
|
#endif
|
||||||
return pg_fsync_writethrough(fd);
|
return pg_fsync_no_writethrough(fd);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@ -157,7 +157,7 @@
|
|||||||
#wal_sync_method = fsync # the default is the first option
|
#wal_sync_method = fsync # the default is the first option
|
||||||
# supported by the operating system:
|
# supported by the operating system:
|
||||||
# open_datasync
|
# open_datasync
|
||||||
# fdatasync
|
# fdatasync (default on Linux)
|
||||||
# fsync
|
# fsync
|
||||||
# fsync_writethrough
|
# fsync_writethrough
|
||||||
# open_sync
|
# open_sync
|
||||||
|
@ -123,12 +123,12 @@ typedef uint32 TimeLineID;
|
|||||||
#endif
|
#endif
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(OPEN_DATASYNC_FLAG)
|
#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
|
||||||
|
#define DEFAULT_SYNC_METHOD PLATFORM_DEFAULT_SYNC_METHOD
|
||||||
|
#elif defined(OPEN_DATASYNC_FLAG)
|
||||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC
|
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC
|
||||||
#elif defined(HAVE_FDATASYNC)
|
#elif defined(HAVE_FDATASYNC)
|
||||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
|
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
|
||||||
#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)
|
|
||||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH
|
|
||||||
#else
|
#else
|
||||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
|
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
|
||||||
#endif
|
#endif
|
||||||
|
@ -12,3 +12,11 @@
|
|||||||
* to have a kernel version test here.
|
* to have a kernel version test here.
|
||||||
*/
|
*/
|
||||||
#define HAVE_LINUX_EIDRM_BUG
|
#define HAVE_LINUX_EIDRM_BUG
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Set the default wal_sync_method to fdatasync. With recent Linux versions,
|
||||||
|
* xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
|
||||||
|
* perform better and (b) causes outright failures on ext4 data=journal
|
||||||
|
* filesystems, because those don't support O_DIRECT.
|
||||||
|
*/
|
||||||
|
#define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
|
||||||
|
@ -34,15 +34,19 @@
|
|||||||
/* Must be here to avoid conflicting with prototype in windows.h */
|
/* Must be here to avoid conflicting with prototype in windows.h */
|
||||||
#define mkdir(a,b) mkdir(a)
|
#define mkdir(a,b) mkdir(a)
|
||||||
|
|
||||||
#define HAVE_FSYNC_WRITETHROUGH
|
|
||||||
#define HAVE_FSYNC_WRITETHROUGH_ONLY
|
|
||||||
#define ftruncate(a,b) chsize(a,b)
|
#define ftruncate(a,b) chsize(a,b)
|
||||||
/*
|
|
||||||
* Even though we don't support 'fsync' as a wal_sync_method,
|
/* Windows doesn't have fsync() as such, use _commit() */
|
||||||
* we do fsync() a few other places where _commit() is just fine.
|
|
||||||
*/
|
|
||||||
#define fsync(fd) _commit(fd)
|
#define fsync(fd) _commit(fd)
|
||||||
|
|
||||||
|
/*
|
||||||
|
* For historical reasons, we allow setting wal_sync_method to
|
||||||
|
* fsync_writethrough on Windows, even though it's really identical to fsync
|
||||||
|
* (both code paths wind up at _commit()).
|
||||||
|
*/
|
||||||
|
#define HAVE_FSYNC_WRITETHROUGH
|
||||||
|
#define FSYNC_WRITETHROUGH_IS_FSYNC
|
||||||
|
|
||||||
#define USES_WINSOCK
|
#define USES_WINSOCK
|
||||||
|
|
||||||
/* defines for dynamic linking on Win32 platform */
|
/* defines for dynamic linking on Win32 platform */
|
||||||
|
Loading…
Reference in New Issue
Block a user