Force default wal_sync_method to be fdatasync on Linux.
Recent versions of the Linux system header files cause xlogdefs.h to believe that open_datasync should be the default sync method, whereas formerly fdatasync was the default on Linux. open_datasync is a bad choice, first because it doesn't actually outperform fdatasync (in fact the reverse), and second because we try to use O_DIRECT with it, causing failures on certain filesystems (e.g., ext4 with data=journal option). This part of the patch is largely per a proposal from Marti Raudsepp. More extensive changes are likely to follow in HEAD, but this is as much change as we want to back-patch. Also clean up confusing code and incorrect documentation surrounding the fsync_writethrough option. Those changes shouldn't result in any actual behavioral change, but I chose to back-patch them anyway to keep the branches looking similar in this area. In 9.0 and HEAD, also do some copy-editing on the WAL Reliability documentation section. Back-patch to all supported branches, since any of them might get used on modern Linux versions.
This commit is contained in:
parent
e620ee35b2
commit
576477e73c
@ -1460,7 +1460,7 @@ SET ENABLE_SEQSCAN TO OFF;
|
||||
<para>
|
||||
While turning off <varname>fsync</varname> is often a performance
|
||||
benefit, this can result in unrecoverable data corruption in
|
||||
the event of an unexpected system shutdown or crash. Thus it
|
||||
the event of a power failure or system crash. Thus it
|
||||
is only advisable to turn off <varname>fsync</varname> if
|
||||
you can easily recreate your entire database from external
|
||||
data.
|
||||
@ -1468,10 +1468,11 @@ SET ENABLE_SEQSCAN TO OFF;
|
||||
|
||||
<para>
|
||||
Examples of safe circumstances for turning off
|
||||
<varname>fsync</varname> include the initial loading a new
|
||||
<varname>fsync</varname> include the initial loading of a new
|
||||
database cluster from a backup file, using a database cluster
|
||||
for processing statistics on an hourly basis which is then
|
||||
recreated, or for a reporting read-only database clone which
|
||||
for processing a batch of data after which the database
|
||||
will be thrown away and recreated,
|
||||
or for a read-only database clone which
|
||||
gets recreated frequently and is not used for failover. High
|
||||
quality hardware alone is not a sufficient justification for
|
||||
turning off <varname>fsync</varname>.
|
||||
@ -1554,12 +1555,12 @@ SET ENABLE_SEQSCAN TO OFF;
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<literal>fsync_writethrough</> (call <function>fsync()</> at each commit, forcing write-through of any disk write cache)
|
||||
<literal>fsync</> (call <function>fsync()</> at each commit)
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<literal>fsync</> (call <function>fsync()</> at each commit)
|
||||
<literal>fsync_writethrough</> (call <function>fsync()</> at each commit, forcing write-through of any disk write cache)
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
@ -1569,16 +1570,15 @@ SET ENABLE_SEQSCAN TO OFF;
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<para>
|
||||
Not all of these choices are available on all platforms.
|
||||
The <literal>open_</>* options also use <literal>O_DIRECT</> if available.
|
||||
Not all of these choices are available on all platforms.
|
||||
The default is the first method in the above list that is supported
|
||||
by the platform. The default is not necessarily ideal; it might be
|
||||
by the platform, except that <literal>fdatasync</> is the default on
|
||||
Linux. The default is not necessarily ideal; it might be
|
||||
necessary to change this setting or other aspects of your system
|
||||
configuration in order to create a crash-safe configuration or
|
||||
achieve optimal performance.
|
||||
These aspects are discussed in <xref linkend="wal-reliability">.
|
||||
The utility <filename>src/tools/fsync</> in the PostgreSQL source tree
|
||||
can do performance testing of various fsync methods.
|
||||
This parameter can only be set in the <filename>postgresql.conf</>
|
||||
file or on the server command line.
|
||||
</para>
|
||||
@ -1686,21 +1686,20 @@ SET ENABLE_SEQSCAN TO OFF;
|
||||
When the commit data for a transaction is flushed to disk, any
|
||||
additional commits ready at that time are also flushed out.
|
||||
<varname>commit_delay</varname> adds a time delay, set in
|
||||
microseconds, before writing some commit records to the WAL
|
||||
buffer and flushing the buffer out to disks. A nonzero delay
|
||||
can allow more transactions to be committed with only one call
|
||||
to the active <varname>wal_sync_method</varname>, if
|
||||
microseconds, before a transaction attempts to
|
||||
flush the WAL buffer out to disk. A nonzero delay can allow more
|
||||
transactions to be committed with only one flush operation, if
|
||||
system load is high enough that additional transactions become
|
||||
ready to commit within the given interval. But the delay is
|
||||
just wasted if no other transactions become ready to
|
||||
commit. Therefore, the delay is only performed if at least
|
||||
<varname>commit_siblings</varname> other transactions are
|
||||
active at the instant that a server process has written its
|
||||
commit record. The default is zero (no delay). Since
|
||||
all pending commit data flushes are written at every flush
|
||||
regardless of this setting, it is rare that adding delay to
|
||||
that by increasing this parameter will actually improve commit
|
||||
performance.
|
||||
commit record.
|
||||
The default <varname>commit_delay</> is zero (no delay).
|
||||
Since all pending commit data will be written at every flush
|
||||
regardless of this setting, it is rare that adding delay
|
||||
by increasing this parameter will actually improve performance.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -27,7 +27,7 @@
|
||||
</para>
|
||||
|
||||
<para>
|
||||
While forcing data periodically to the disk platters might seem like
|
||||
While forcing data to the disk platters periodically might seem like
|
||||
a simple operation, it is not. Because disk drives are dramatically
|
||||
slower than main memory and CPUs, several layers of caching exist
|
||||
between the computer's main memory and the disk platters.
|
||||
@ -48,7 +48,7 @@
|
||||
some later time. Such caches can be a reliability hazard because the
|
||||
memory in the disk controller cache is volatile, and will lose its
|
||||
contents in a power failure. Better controller cards have
|
||||
<firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
|
||||
<firstterm>battery-backup units</> (<acronym>BBU</>s), meaning
|
||||
the card has a battery that
|
||||
maintains power to the cache in case of system power loss. After power
|
||||
is restored the data will be written to the disk drives.
|
||||
@ -57,15 +57,10 @@
|
||||
<para>
|
||||
And finally, most disk drives have caches. Some are write-through
|
||||
while some are write-back, and the same concerns about data loss
|
||||
exist for write-back drive caches as exist for disk controller
|
||||
exist for write-back drive caches as for disk controller
|
||||
caches. Consumer-grade IDE and SATA drives are particularly likely
|
||||
to have write-back caches that will not survive a power failure,
|
||||
though <acronym>ATAPI-6</> introduced a drive cache flush command
|
||||
(<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
|
||||
<acronym>ZFS</>, <acronym>ext4</>. (The SCSI command
|
||||
<command>SYNCHRONIZE CACHE</> has long been available.) Many
|
||||
solid-state drives (SSD) also have volatile write-back caches, and
|
||||
many do not honor cache flush commands by default.
|
||||
to have write-back caches that will not survive a power failure. Many
|
||||
solid-state drives (SSD) also have volatile write-back caches.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -81,7 +76,7 @@
|
||||
a <literal>*</> next to <literal>Write cache</>. <command>hdparm -W</>
|
||||
can be used to turn off write caching. SCSI drives can be queried
|
||||
using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>.
|
||||
for SCSI drives. Use <command>sdparm --get=WCE</command> to check
|
||||
Use <command>sdparm --get=WCE</command> to check
|
||||
whether the write cache is enabled and <command>sdparm --clear=WCE</>
|
||||
to disable it.
|
||||
</para>
|
||||
@ -107,35 +102,40 @@
|
||||
<listitem>
|
||||
<para>
|
||||
On <productname>Windows</>, if <varname>wal_sync_method</> is
|
||||
<literal>open_datasync</> (the default), write caching is disabled
|
||||
by unchecking <literal>My Computer\Open\{select disk drive}\Properties\Hardware\Properties\Policies\Enable write caching on the disk</>.
|
||||
Alternatively, set <varname>wal_sync_method</varname> to <literal>fsync</> or <literal>fsync_writethrough</>, which never do write caching.
|
||||
<literal>open_datasync</> (the default), write caching can be disabled
|
||||
by unchecking <literal>My Computer\Open\<replaceable>disk drive</>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</>.
|
||||
Alternatively, set <varname>wal_sync_method</varname> to
|
||||
<literal>fsync</> or <literal>fsync_writethrough</>, which prevent
|
||||
write caching.
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
On <productname>MacOS X</productname>, write caching can be disabled by
|
||||
On <productname>Mac OS X</productname>, write caching can be prevented by
|
||||
setting <varname>wal_sync_method</> to <literal>fsync_writethrough</>.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>
|
||||
Many file systems that use write barriers (e.g. <acronym>ZFS</>,
|
||||
<acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
|
||||
<command>SYNCHRONIZE CACHE</> commands to flush data to the platters on
|
||||
write-back-enabled drives. Unfortunately, such write barrier file
|
||||
systems behave suboptimally when combined with battery-backed unit
|
||||
Recent SATA drives (those following <acronym>ATAPI-6</> or later)
|
||||
offer a drive cache flush command (<command>FLUSH CACHE EXT</>),
|
||||
while SCSI drives have long supported a similar command
|
||||
<command>SYNCHRONIZE CACHE</>. These commands are not directly
|
||||
accessible to <productname>PostgreSQL</>, but some file systems
|
||||
(e.g., <acronym>ZFS</>, <acronym>ext4</>) can use them to flush
|
||||
data to the platters on write-back-enabled drives. Unfortunately, such
|
||||
file systems behave suboptimally when combined with battery-backup unit
|
||||
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
||||
command forces all data from the BBU to the disks, eliminating much
|
||||
of the benefit of the BBU. You can run the utility
|
||||
command forces all data from the controller cache to the disks,
|
||||
eliminating much of the benefit of the BBU. You can run the utility
|
||||
<filename>src/tools/fsync</> in the PostgreSQL source tree to see
|
||||
if you are affected. If you are affected, the performance benefits
|
||||
of the BBU cache can be regained by turning off write barriers in
|
||||
of the BBU can be regained by turning off write barriers in
|
||||
the file system or reconfiguring the disk controller, if that is
|
||||
an option. If write barriers are turned off, make sure the battery
|
||||
remains active; a faulty battery can potentially lead to data loss.
|
||||
remains functional; a faulty battery can potentially lead to data loss.
|
||||
Hopefully file system and disk controller designers will eventually
|
||||
address this suboptimal behavior.
|
||||
</para>
|
||||
@ -148,6 +148,8 @@
|
||||
ensure data integrity. Avoid disk controllers that have non-battery-backed
|
||||
write caches. At the drive level, disable write-back caching if the
|
||||
drive cannot guarantee the data will be written before shutdown.
|
||||
If you use SSDs, be aware that many of these do not honor cache flush
|
||||
commands by default.
|
||||
You can test for reliable I/O subsystem behavior using <ulink
|
||||
url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
|
||||
</para>
|
||||
@ -157,16 +159,17 @@
|
||||
operations themselves. Disk platters are divided into sectors,
|
||||
commonly 512 bytes each. Every physical read or write operation
|
||||
processes a whole sector.
|
||||
When a write request arrives at the drive, it might be for 512 bytes,
|
||||
1024 bytes, or 8192 bytes, and the process of writing could fail due
|
||||
When a write request arrives at the drive, it might be for some multiple
|
||||
of 512 bytes (<productname>PostgreSQL</> typically writes 8192 bytes, or
|
||||
16 sectors, at a time), and the process of writing could fail due
|
||||
to power loss at any time, meaning some of the 512-byte sectors were
|
||||
written, and others were not. To guard against such failures,
|
||||
written while others were not. To guard against such failures,
|
||||
<productname>PostgreSQL</> periodically writes full page images to
|
||||
permanent WAL storage <emphasis>before</> modifying the actual page on
|
||||
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
|
||||
restore partially-written pages. If you have a battery-backed disk
|
||||
restore partially-written pages from WAL. If you have a battery-backed disk
|
||||
controller or file-system software that prevents partial page writes
|
||||
(e.g., ZFS), you can turn off this page imaging by turning off the
|
||||
(e.g., ZFS), you can safely turn off this page imaging by turning off the
|
||||
<xref linkend="guc-full-page-writes"> parameter.
|
||||
</para>
|
||||
</sect1>
|
||||
|
@ -260,12 +260,13 @@ static bool looks_like_temp_rel_name(const char *name);
|
||||
int
|
||||
pg_fsync(int fd)
|
||||
{
|
||||
#ifndef HAVE_FSYNC_WRITETHROUGH_ONLY
|
||||
if (sync_method != SYNC_METHOD_FSYNC_WRITETHROUGH)
|
||||
return pg_fsync_no_writethrough(fd);
|
||||
/* #if is to skip the sync_method test if there's no need for it */
|
||||
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
|
||||
if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
|
||||
return pg_fsync_writethrough(fd);
|
||||
else
|
||||
#endif
|
||||
return pg_fsync_writethrough(fd);
|
||||
return pg_fsync_no_writethrough(fd);
|
||||
}
|
||||
|
||||
|
||||
|
@ -157,7 +157,7 @@
|
||||
#wal_sync_method = fsync # the default is the first option
|
||||
# supported by the operating system:
|
||||
# open_datasync
|
||||
# fdatasync
|
||||
# fdatasync (default on Linux)
|
||||
# fsync
|
||||
# fsync_writethrough
|
||||
# open_sync
|
||||
|
@ -123,12 +123,12 @@ typedef uint32 TimeLineID;
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#if defined(OPEN_DATASYNC_FLAG)
|
||||
#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
|
||||
#define DEFAULT_SYNC_METHOD PLATFORM_DEFAULT_SYNC_METHOD
|
||||
#elif defined(OPEN_DATASYNC_FLAG)
|
||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC
|
||||
#elif defined(HAVE_FDATASYNC)
|
||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
|
||||
#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)
|
||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH
|
||||
#else
|
||||
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
|
||||
#endif
|
||||
|
@ -12,3 +12,11 @@
|
||||
* to have a kernel version test here.
|
||||
*/
|
||||
#define HAVE_LINUX_EIDRM_BUG
|
||||
|
||||
/*
|
||||
* Set the default wal_sync_method to fdatasync. With recent Linux versions,
|
||||
* xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
|
||||
* perform better and (b) causes outright failures on ext4 data=journal
|
||||
* filesystems, because those don't support O_DIRECT.
|
||||
*/
|
||||
#define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
|
||||
|
@ -34,15 +34,19 @@
|
||||
/* Must be here to avoid conflicting with prototype in windows.h */
|
||||
#define mkdir(a,b) mkdir(a)
|
||||
|
||||
#define HAVE_FSYNC_WRITETHROUGH
|
||||
#define HAVE_FSYNC_WRITETHROUGH_ONLY
|
||||
#define ftruncate(a,b) chsize(a,b)
|
||||
/*
|
||||
* Even though we don't support 'fsync' as a wal_sync_method,
|
||||
* we do fsync() a few other places where _commit() is just fine.
|
||||
*/
|
||||
|
||||
/* Windows doesn't have fsync() as such, use _commit() */
|
||||
#define fsync(fd) _commit(fd)
|
||||
|
||||
/*
|
||||
* For historical reasons, we allow setting wal_sync_method to
|
||||
* fsync_writethrough on Windows, even though it's really identical to fsync
|
||||
* (both code paths wind up at _commit()).
|
||||
*/
|
||||
#define HAVE_FSYNC_WRITETHROUGH
|
||||
#define FSYNC_WRITETHROUGH_IS_FSYNC
|
||||
|
||||
#define USES_WINSOCK
|
||||
|
||||
/* defines for dynamic linking on Win32 platform */
|
||||
|
Loading…
Reference in New Issue
Block a user