Fix race conditions in newly-added test.

Buildfarm has been failing sporadically on the new test.  I was able to
reproduce this by adding a random 0-10 s delay in the walreceiver, just
before it connects to the primary. There's a race condition where node_3
is promoted before it has fully caught up with node_1, leading to diverged
timelines. When node_1 is later reconfigured as standby following node_3,
it fails to catch up:

LOG:  primary server contains no more WAL on requested timeline 1
LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/30000A0

That's the situation where you'd need to use pg_rewind, but in this case
it happens already when we are just setting up the actual pg_rewind
scenario we want to test, so change the test so that it waits until
node_3 is connected and fully caught up before promoting it, so that you
get a clean, controlled failover.

Also rewrite some of the comments, for clarity. The existing comments
detailed what each step in the test did, but didn't give a good overview
of the situation the steps were trying to create.

For reasons I don't understand, the test setup had to be written slightly
differently in 9.6 and 9.5 than in later versions. The 9.5/9.6 version
needed node 1 to be reinitialized from backup, whereas in later versions
it could be shut down and reconfigured to be a standby. But even 9.5 should
support "clean switchover", where primary makes sure that pending WAL is
replicated to standby on shutdown. It would be nice to figure out what's
going on there, but that's independent of pg_rewind and the scenario that
this test tests.

Discussion: https://www.postgresql.org/message-id/b0a3b95b-82d2-6089-6892-40570f8c5e60%40iki.fi
This commit is contained in:
Heikki Linnakangas 2020-12-04 18:20:18 +02:00
parent 89cdf1b65e
commit a075c84f2c
1 changed files with 34 additions and 15 deletions

View File

@ -34,6 +34,7 @@ use TestLib;
use Test::More tests => 3;
use File::Copy;
use File::Path qw(rmtree);
my $tmp_folder = TestLib::tempdir;
@ -50,53 +51,69 @@ $node_1->safe_psql('postgres', 'CREATE TABLE public.foo (t TEXT)');
$node_1->safe_psql('postgres', 'CREATE TABLE public.bar (t TEXT)');
$node_1->safe_psql('postgres', "INSERT INTO public.bar VALUES ('in both')");
# Take backup
#
# Create node_2 and node_3 as standbys following node_1
#
my $backup_name = 'my_backup';
$node_1->backup($backup_name);
# Create streaming standby from backup
my $node_2 = get_new_node('node_2');
$node_2->init_from_backup($node_1, $backup_name,
has_streaming => 1);
$node_2->start;
# Create streaming standby from backup
my $node_3 = get_new_node('node_3');
$node_3->init_from_backup($node_1, $backup_name,
has_streaming => 1);
$node_3->start;
# Stop node_1
# Wait until node 3 has connected and caught up
my $until_lsn =
$node_1->safe_psql('postgres', "SELECT pg_current_xlog_location();");
my $caughtup_query =
"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
$node_3->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for standby to catch up";
#
# Swap the roles of node_1 and node_3, so that node_1 follows node_3.
#
$node_1->stop('fast');
# Promote node_3
$node_3->promote;
# node_1 rejoins node_3
# reconfigure node_1 as a standby following node_3
rmtree $node_1->data_dir;
$node_1->init_from_backup($node_1, $backup_name);
my $node_3_connstr = $node_3->connstr;
unlink($node_2->data_dir . '/recovery.conf');
unlink($node_1->data_dir . '/recovery.conf');
$node_1->append_conf('recovery.conf', qq(
standby_mode=on
primary_conninfo='$node_3_connstr'
primary_conninfo='$node_3_connstr application_name=node_1'
recovery_target_timeline='latest'
));
$node_1->start();
# node_2 follows node_3
# also reconfigure node_2 to follow node_3
unlink($node_2->data_dir . '/recovery.conf');
$node_2->append_conf('recovery.conf', qq(
standby_mode=on
primary_conninfo='$node_3_connstr'
primary_conninfo='$node_3_connstr application_name=node_2'
recovery_target_timeline='latest'
));
$node_2->restart();
# Promote node_1
#
# Promote node_1, to create a split-brain scenario.
#
# make sure node_1 is full caught up with node_3 first
$until_lsn =
$node_3->safe_psql('postgres', "SELECT pg_current_xlog_location();");
$caughtup_query =
"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
$node_1->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for standby to catch up";
$node_1->promote;
@ -104,9 +121,11 @@ $node_1->promote;
$node_1->poll_query_until('postgres', "SELECT pg_is_in_recovery() <> true");
$node_3->poll_query_until('postgres', "SELECT pg_is_in_recovery() <> true");
#
# We now have a split-brain with two primaries. Insert a row on both to
# demonstratively create a split brain. After the rewind, we should only
# see the insert on 1, as the insert on node 3 is rewound away.
#
$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('keep this')");
# Insert more rows in node 1, to bump up the XID counter. Otherwise, if