Users sometimes get concerned whe they start the server and it
emits a few messages and then doesn't emit any more messages for
a long time. Generally, what's happening is either that the
system is taking a long time to apply WAL, or it's taking a
long time to reset unlogged relations, or it's taking a long
time to fsync the data directory, but it's not easy to tell
which is the case.
To fix that, add a new 'log_startup_progress_interval' setting,
by default 10s. When an operation that is known to be potentially
long-running takes more than this amount of time, we'll log a
status update each time this interval elapses.
To avoid undesirable log chatter, don't log anything about WAL
replay when in standby mode.
Nitin Jadhav and Robert Haas, reviewed by Amul Sul, Bharath
Rupireddy, Justin Pryzby, Michael Paquier, and Álvaro Herrera.
Discussion: https://postgr.es/m/CA+TgmoaHQrgDFOBwgY16XCoMtXxsrVGFB2jNCvb7-ubuEe1MGg@mail.gmail.com
Discussion: https://postgr.es/m/CAMm1aWaHF7VE69572_OLQ+MgpT5RUiUDgF1x5RrtkJBLdpRj3Q@mail.gmail.com
All the code that runs in the startup process to write WAL records
before that's allowed generally is now consecutive, so there's no
reason to shut the facility to write WAL locally off and then turn
it on again three times in a row.
Unfortunately, this requires a slight kludge in the checkpointer,
which needs to separately enable writing WAL in order to write the
checkpoint record. Because that code might run in the same process
as StartupXLOG() if we are in single-user mode, we must save/restore
the state of the LocalXLogInsertAllowed flag. Hopefully, we'll be
able to eliminate this wart in further refactoring, but it's
not too bad anyway.
Amul Sul, with modifications by me.
Discussion: http://postgr.es/m/CAAJ_b97fysj6sRSQEfOHj-y8Jfd5uPqOgO74qast89B4WfD+TA@mail.gmail.com
This does a better job grouping related operations together, since
all of the WAL records that we need to write prior to allowing WAL
writes generally and written by a single uninterrupted stretch of code.
Since CleanupAfterArchiveRecovery() just (1) runs recovery_end_command,
(2) removes non-parent xlog files, and (3) archives any final partial
segment, this should be safe, because all of those things are pretty
much unrelated to the WAL record written by XLogReportParameters().
Amul Sul, per a suggestion from me
Discussion: http://postgr.es/m/CAAJ_b97fysj6sRSQEfOHj-y8Jfd5uPqOgO74qast89B4WfD+TA@mail.gmail.com
CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
PerformRecoveryXLogAction(), and CleanupAfterArchiveRecovery()
are moved somewhat later in StartupXLOG(). This is preparatory
work for a future patch that wants to allow recovery to end at one
time and only later start to allow WAL writes. To do that, it's
necessary to separate code that has to do with allowing WAL writes
from other things that need to happen simply because recovery is
ending, such as initializing shared memory data structures that
depend on information that might not be accurate before redo is
complete.
This commit does not achieve that goal, but it is a step in that
direction. For example, there are a few different bits of code that
write things into WAL once we have finished recovery, and with this
change, those bits of code are closer to each other than previously,
with fewer unrelated bits of code interspersed.
Robert Haas and Amul Sul
Discussion: http://postgr.es/m/CAAJ_b97abMuq=470Wahun=aS1PHTSbStHtrjjPaD-C0YQ1AqVw@mail.gmail.com
Create a new function PerformRecoveryXLogAction() and move the
code which either writes an end-of-recovery record or requests a
checkpoint there.
Also create a new function CleanupAfterArchiveRecovery() to
perform a few tasks that we want to do after we've actually exited
archive recovery but before we start accepting new WAL writes.
More refactoring of this file is planned, but this commit is
just straightforward code movement to make StartupXLOG() a
little bit shorter and a little bit easier to understand.
Robert Haas and Amul Sul
Discussion: http://postgr.es/m/CAAJ_b97abMuq=470Wahun=aS1PHTSbStHtrjjPaD-C0YQ1AqVw@mail.gmail.com
Commit 0668719801 changed XLogPageRead() so that it validated the page
header, if invalid page header was found reset the error message and
retried reading the page, to fix the scenario where streaming standby
got stuck at a continuation record. This change hid the error message
about invalid page header, which would make it harder for users to
investigate what the actual issue was found in WAL.
To fix the issue, this commit makes XLogPageRead() report the error
message when invalid page header is found.
When not in standby mode, an invalid page header should cause recovery
to end, not retry reading the page, so XLogPageRead() doesn't need to
validate the page header for the retry. Instead, ReadPageInternal() should
be responsible for the validation in that case. Therefore this commit
changes XLogPageRead() so that if not in standby mode it doesn't validate
the page header for the retry.
Reported-by: Yugo Nagata
Author: Yugo Nagata, Kyotaro Horiguchi
Reviewed-by: Ranier Vilela, Fujii Masao
Discussion: https://postgr.es/m/20210718045505.32f463ed6c227111038d8ae4@sraoss.co.jp
Some specific logic is done at the end of recovery when involving 2PC
transactions:
1) Call RecoverPreparedTransactions(), to recover the state of 2PC
transactions into memory (re-acquire locks, etc.).
2) ShutdownRecoveryTransactionEnvironment(), to move back to normal
operations, mainly cleaning up recovery locks and KnownAssignedXids
(including any 2PC transaction tracked previously).
3) Switch XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE, which is
the tipping point for any process calling RecoveryInProgress() to check
if the cluster is still in recovery or not.
Any snapshot taken between steps 2) and 3) would be empty, causing any
transaction relying on a snapshot at this point to potentially corrupt
data as there could still be some 2PC transactions to track, with
RecentXmin moving backwards on successive calls to GetSnapshotData() in
the same transaction.
As SharedRecoveryState is the point to take into account to know if it
is safe to discard KnownAssignedXids, this commit moves step 2) after
step 3), so as we can never finish with empty snapshots.
This exists since the introduction of hot standby, so backpatch all the
way down. The window with incorrect snapshots is extremely small, but I
have seen it when running 023_pitr_prepared_xact.pl, as did buildfarm
member fairywren. Thomas Munro also found it independently. Special
thanks to Andres Freund for taking the time to analyze this issue.
Reported-by: Thomas Munro, Michael Paquier
Analyzed-by: Andres Freund
Discussion: https://postgr.es/m/20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de
Backpatch-through: 9.6
Commit ff9f111bce24 added some test code that's unportable and doesn't
add meaningful coverage. Remove it rather than try and get it to work
everywhere.
While at it, fix a typo in a log message added by the aforementioned
commit.
Backpatch to 14.
Discussion: https://postgr.es/m/3000074.1632947632@sss.pgh.pa.us
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
This reverts commit 515e3d84a0b5 and equivalent commits in back
branches. This solution to the problem has a number of problems, so
we'll try again with a different approach.
Per note from Andres Freund
Discussion: https://postgr.es/m/20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files. Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it. If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.
To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.
This has always been wrong, so backpatch all the way back.
Author: Nathan Bossart <bossartn@amazon.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
This commit ensures that the wait interval in the replay delay loop
waiting for an amount of time defined by recovery_min_apply_delay is
correctly handled on reload, recalculating the delay if this GUC value
is updated, based on the timestamp of the commit record being replayed.
The previous behavior would be problematic for example with replay
still waiting even if the delay got reduced or just cancelled. If the
apply delay was increased to a larger value, the wait would have just
respected the old value set, finishing earlier.
Author: Soumyadeep Chakraborty, Ashwin Agrawal
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com
Backpatch-through: 9.6
These have been unrelated since bgwriter and checkpointer were split into two
processes in 806a2aee379. As there several pending patches (shared memory
stats, extending the set of tracked IO / buffer statistics) that are made a
bit more awkward by the grouping, split them. Done separately to make
reviewing easier.
This does *not* change the contents of pg_stat_bgwriter or move fields out of
bgwriter/checkpointer stats that arguably do not belong in either. However
pgstat_fetch_global() was renamed and split into
pgstat_fetch_stat_checkpointer() and pgstat_fetch_stat_bgwriter().
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20210405092914.mmxqe7j56lsjfsej@alap3.anarazel.de
Start up the checkpointer and bgwriter during crash recovery (except in
--single mode), as we do for replication. This wasn't done back in
commit cdd46c76 out of caution. Now it seems like a better idea to make
the environment as similar as possible in both cases. There may also be
some performance advantages.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com
They are used in code that runs both during normal operation and during
WAL replay, and needs to behave differently during replay. Move them to
xlogutils.c, because that's where we have other helper functions used by
redo routines.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
It should always be the case that the last checkpoint record is still
readable, because otherwise, a crash would leave us in a situation
from which we can't recover. Therefore the test removed by this patch
should always succeed. For it to fail, either there has to be a serious
bug in the code someplace, or the user has to be manually modifying
pg_wal while crash recovery is running. If it's the first one, we
should fix the bug. If it's the second one, they should stop, or
anyway they're doing so at their own risk. In neither case does
a full checkpoint instead of an end-of-recovery record seem like a
clear winner. Furthermore, rarely-taken code paths are particularly
vulnerable to bugs, so let's simplify by getting rid of this one.
Discussion: http://postgr.es/m/CA+TgmoYmw==TOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA@mail.gmail.com
CheckpointLock was removed in commit d18e75664a, and commit ce197e91d0
updated a leftover comment in CreateCheckPoint, but there was another
copy of it in CreateRestartPoint still.
Buildfarm shows that this test has a further failure mode when a
checkpoint starts earlier than expected, so we detect a "checkpoint
completed" line that's not the one we want. Change the config to try
and prevent this.
Per buildfarm
While at it, update one comment that was forgotten in commit
d18e75664a2f.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20210729.162038.534808353849568395.horikyota.ntt@gmail.com
Commit 2c03216d83 changed XLOG_FPI_FOR_HINT records so that they always
included full-page images even when full_page_writes was disabled. However,
in this setting, they don't need to do that because hint bit updates don't
need to be protected from torn writes.
Therefore, this commit makes XLOG_FPI_FOR_HINT records honor full_page_writes
setting. That is, XLOG_FPI_FOR_HINT records may include no full-page images
if full_page_writes is disabled, and WAL replay of them does nothing.
Reported-by: Zhang Wenjie
Author: Kyotaro Horiguchi
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/tencent_60F11973A111EED97A8596FFECC4A91ED405@qq.com
When some slots are invalidated due to the max_slot_wal_keep_size limit,
the old segment horizon should move forward to stay within the limit.
However, in commit c6550776394e we forgot to call KeepLogSeg again to
recompute the horizon after invalidating replication slots. In cases
where other slots remained, the limits would be recomputed eventually
for other reasons, but if all slots were invalidated, the limits would
not move at all afterwards. Repair.
Backpatch to 13 where the feature was introduced.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Marcin Krupowicz <mk@071.ovh>
Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
As of v14, pg_depend contains almost 7000 "pin" entries recording
the OIDs of built-in objects. This is a fair amount of bloat for
every database, and it adds time to pg_depend lookups as well as
initdb. We can get rid of all of those entries in favor of an OID
range check, i.e. "OIDs below FirstUnpinnedObjectId are pinned".
(template1 and the public schema are exceptions. Those exceptions
are now wired into IsPinnedObject() instead of initdb's code for
filling pg_depend, but it's the same amount of cruft either way.)
The contents of pg_shdepend are modified likewise.
Discussion: https://postgr.es/m/3737988.1618451008@sss.pgh.pa.us
This concerns pg_stop_backup() and BASE_BACKUP, when waiting for the
WAL segments required for a backup to be archived. This simplifies a
bit the handling of the wait event used in this code path.
Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Stephen Frost
Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
The previous commit addressed the chief consequences of a race condition
between InstallXLogFileSegment() and KeepFileRestoredFromArchive(). Fix
three lesser consequences. A spurious durable_rename_excl() LOG message
remained possible. KeepFileRestoredFromArchive() wasted the proceeds of
WAL recycling and preallocation. Finally, XLogFileInitInternal() could
return a descriptor for a file that KeepFileRestoredFromArchive() had
already unlinked. That felt like a recipe for future bugs.
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Before a restartpoint finishes PreallocXlogFiles(), a startup process
KeepFileRestoredFromArchive() call can unlink the preallocated segment.
If a CHECKPOINT sql command had elicited the restartpoint experiencing
the race condition, that sql command failed. Moreover, the restartpoint
omitted its log_checkpoints message and some inessential resource
reclamation. Prevent the ERROR by skipping open() of the segment.
Since these consequences are so minor, no back-patch.
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Only initdb used it. initdb refuses to operate on a non-empty directory
and generally does not cope with pre-existing files of other kinds.
Hence, use the opportunity to simplify.
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Infrequently, the mismatch caused log_checkpoints messages and
TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high
by one. Since that consequence is so minor, no back-patch.
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL
file, which doesn't change the seek position (unless we fall back to
the implementation in src/port/pread.c). Update comment accordingly.
Backpatch-through: 12, where we started to use pg_pread()
This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.
Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.
Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi.
Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But previously in that case
pg_get_wal_replay_pause_state() returned 'paused' wrongly while a promotion
was ongoing.
This commit changes a standby promotion so that it marks the recovery
pause state as 'not paused' when it's triggered, to fix the issue.
Author: Fujii Masao
Reviewed-by: Dilip Kumar, Kyotaro Horiguchi
Discussion: https://postgr.es/m/f706876c-4894-0ba5-6f4d-79803eeea21b@oss.nttdata.com
Also "make reformat-dat-files".
The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
This set of commits has some bugs with known fixes, but at this late
stage in the release cycle it seems best to revert and resubmit next
time, along with some new automated test coverage for this whole area.
Commits reverted:
dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery."
1d257577: Optionally prefetch referenced data in recovery.
f003d9f8: Add circular WAL decoding buffer.
323cbe7c: Remove read_page callback from XLogReader.
Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the
corresponding section of config.sgml is now reverted.
Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
Introduce a new GUC recovery_prefetch, disabled by default. When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.
The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.
The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts)
Reviewed-by: Andres Freund <andres@anarazel.de> (parts)
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com>
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
Teach xlogreader.c to decode its output into a circular buffer, to
support optimizations based on looking ahead.
* XLogReadRecord() works as before, consuming records one by one, and
allowing them to be examined via the traditional XLogRecGetXXX()
macros.
* An alternative new interface XLogNextRecord() is added that returns
pointers to DecodedXLogRecord structs that can be examined directly.
* XLogReadAhead() provides a second cursor that lets you see
further ahead, as long as data is available and there is enough space
in the decoding buffer. This returns DecodedXLogRecord pointers to the
caller, but also adds them to a queue of records that will later be
consumed by XLogNextRecord()/XLogReadRecord().
The buffer's size is controlled with wal_decode_buffer_size. The buffer
could potentially be placed into shared memory, for future projects.
Large records that don't fit in the circular buffer are called
"oversized" and allocated separately with palloc().
Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
Previously, the XLogReader module would fetch new input data using a
callback function. Redesign the interface so that it tells the caller
to insert more data with a special return value instead. This API suits
later patches for prefetching, encryption and maybe other future
projects that would otherwise require continually extending the callback
interface.
As incidental cleanup work, move global variables readOff, readLen and
readSegNo inside XlogReaderState.
Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version)
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
Previously if hot standby was enabled, archive recovery exited with
an error when it found WAL generated with wal_level=minimal.
But if hot standby was disabled, it just reported a warning and
continued in that case. Which could lead to data loss or errors
during normal operation. A warning was emitted, but users could
easily miss that and not notice this serious situation until
they encountered the actual errors.
To improve this situation, this commit changes archive recovery
so that it exits with FATAL error when it finds WAL generated with
wal_level=minimal whatever the setting of hot standby. This enables
users to notice the serious situation soon.
The FATAL error is thrown if archive recovery starts from a base
backup taken before wal_level is changed to minimal. When archive
recovery exits with the error, if users have a base backup taken
after setting wal_level to higher than minimal, they can recover
the database by starting archive recovery from that newer backup.
But note that if such backup doesn't exist, there is no easy way to
complete archive recovery, which may make the database server
unstartable and users may lose whole database. The commit adds
the note about this risk into the document.
Even in the case of unstartable database server, previously by just
disabling hot standby users could avoid the error during archive
recovery, forcibly start up the server and salvage data from it.
But note that this commit makes this procedure unavailable at all.
Author: Takamichi Osumi
Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao
Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
While looking at Robert Foggia's report, I noticed a passel of
other issues in the same area:
* The scheme for backslash-quoting newlines in pathnames is just
wrong; it will misbehave if the last ordinary character in a pathname
is a backslash. I'm not sure why we're bothering to allow newlines
in tablespace paths, but if we're going to do it we should do it
without introducing other problems. Hence, backslashes themselves
have to be backslashed too.
* The author hadn't read the sscanf man page very carefully, because
this code would drop any leading whitespace from the path. (I doubt
that a tablespace path with leading whitespace could happen in
practice; but if we're bothering to allow newlines in the path, it
sure seems like leading whitespace is little less implausible.) Using
sscanf for the task of finding the first space is overkill anyway.
* While I'm not 100% sure what the rationale for escaping both \r and
\n is, if the idea is to allow Windows newlines in the file then this
code failed, because it'd throw an error if it saw \r followed by \n.
* There's no cross-check for an incomplete final line in the map file,
which would be a likely apparent symptom of the improper-escaping
bug.
On the generation end, aside from the escaping issue we have:
* If needtblspcmapfile is true then do_pg_start_backup will pass back
escaped strings in tablespaceinfo->path values, which no caller wants
or is prepared to deal with. I'm not sure if there's a live bug from
that, but it looks like there might be (given the dubious assumption
that anyone actually has newlines in their tablespace paths).
* It's not being very paranoid about the possibility of random stuff
in the pg_tblspc directory. IMO we should ignore anything without an
OID-like name.
The escaping rule change doesn't seem back-patchable: it'll require
doubling of backslashes in the tablespace_map file, which is basically
a basebackup format change. The odds of that causing trouble are
considerably more than the odds of the existing bug causing trouble.
The rest of this seems somewhat unlikely to cause problems too,
so no back-patch.
Robert Foggia of Trustwave reported that read_tablespace_map()
fails to prevent an overrun of its on-stack input buffer.
Since the tablespace map file is presumed trustworthy, this does
not seem like an interesting security vulnerability, but still
we should fix it just in the name of robustness.
While here, document that pg_basebackup's --tablespace-mapping option
doesn't work with tar-format output, because it doesn't. To make it
work, we'd have to modify the tablespace_map file within the tarball
sent by the server, which might be possible but I'm not volunteering.
(Less-painful solutions would require changing the basebackup protocol
so that the source server could adjust the map. That's not very
appetizing either.)
Previously, the code and documentation seem to have essentially
assumed than a call to pg_wal_replay_pause() would take place
immediately, but that's not the case, because we only check for a
pause in certain places. This means that a tool that uses this
function and then wants to do something else afterward that is
dependent on the pause having taken effect doesn't know how long it
needs to wait to be sure that no more WAL is going to be replayed.
To avoid that, add a new function pg_get_wal_replay_pause_state()
which returns either 'not paused', 'paused requested', or 'paused'.
After calling pg_wal_replay_pause() the status will immediate change
from 'not paused' to 'pause requested'; when the startup process
has noticed this, the status will change to 'pause'. For backward
compatibility, pg_is_wal_replay_paused() still exists and returns
the same thing as before: true if a pause has been requested,
whether or not it has taken effect yet; and false if not.
The documentation is updated to clarify.
To improve the changes that a pause request is quickly confirmed
effective, adjust things so that WaitForWALToBecomeAvailable will
swiftly reach a call to recoveryPausesHere() when a pause request
is made.
Dilip Kumar, reviewed by Simon Riggs, Kyotaro Horiguchi, Yugo Nagata,
Masahiko Sawada, and Bharath Rupireddy.
Discussion: http://postgr.es/m/CAFiTN-vcLLWEm8Zr%3DYK83rgYrT9pbC8VJCfa1kY9vL3AUPfu6g%40mail.gmail.com
This commit adds new GUC track_wal_io_timing. When this is enabled,
the total amounts of time XLogWrite writes and issue_xlog_fsync syncs
WAL data to disk are counted in pg_stat_wal. This information would be
useful to check how much WAL write and sync affect the performance.
Enabling track_wal_io_timing will make the server query the operating
system for the current time every time WAL is written or synced,
which may cause significant overhead on some platforms. To avoid such
additional overhead in the server with track_io_timing enabled,
this commit introduces track_wal_io_timing as a separate parameter from
track_io_timing.
Note that WAL write and sync activity by walreceiver has not been tracked yet.
This commit makes the server also track the numbers of times XLogWrite
writes and issue_xlog_fsync syncs WAL data to disk, in pg_stat_wal,
regardless of the setting of track_wal_io_timing. This counters can be
used to calculate the WAL write and sync time per request, for example.
Bump PGSTAT_FILE_FORMAT_ID.
Bump catalog version.
Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com