Just as backends must clean up their shared memory state (releasing
lwlocks, buffer pins, etc.) before exiting, they must also perform
any similar cleanups related to dynamic shared memory segments they
have mapped before unmapping those segments. So add a mechanism to
ensure that.
Existing on_shmem_exit hooks include both "user level" cleanup such
as transaction abort and removal of leftover temporary relations and
also "low level" cleanup that forcibly released leftover shared
memory resources. On-detach callbacks should run after the first
group but before the second group, so create a new before_shmem_exit
function for registering the early callbacks and keep on_shmem_exit
for the regular callbacks. (An earlier draft of this patch added an
additional argument to on_shmem_exit, but that had a much larger
footprint and probably a substantially higher risk of breaking third
party code for no real gain.)
Patch by me, reviewed by KaiGai Kohei and Andres Freund.
Previously, lookups of non-existent user names could return "Success";
it will now return "User does not exist" by resetting errno. This also
centralizes the user name lookup code in libpgport.
Report and analysis by Nicolas Marchildon; patch by me
If a tuple was locked by transaction A, and transaction B updated it,
the new version of the tuple created by B would be locked by A, yet
visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the
lock held by A wouldn't get checked if transaction B later deleted (or
key-updated) the new version of the tuple. This might cause referential
integrity checks to give false positives (that is, allow deletes that
should have been rejected).
This is an easy oversight to have made, because prior to improved tuple
locks in commit 0ac5ad5134 it wasn't possible to have tuples created by
our own transaction that were also locked by remote transactions, and so
locks weren't even considered in that code path.
It is recommended that foreign keys be rechecked manually in bulk after
installing this update, in case some referenced rows are missing with
some referencing row remaining.
Per bug reported by Daniel Wood in
CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
Tuple freezing was broken in connection to MultiXactIds; commit
8e53ae025d tried to fix it, but didn't go far enough. As noted by
Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted
update might cause locks in the multi to go ignored by later
transactions. This is because the code depended on a multixact above
their cutoff point not having any lock-only member older than the cutoff
point for Xids, which is easily defeated in READ COMMITTED transactions.
The fix for this involves creating a new MultiXactId when necessary.
But this cannot be done during WAL replay, and moreover multixact
examination requires using CLOG access routines which are not supposed
to be used during WAL replay either; so tuple freezing cannot be done
with the old freeze WAL record. Therefore, separate the freezing
computation from its execution, and change the WAL record to carry all
necessary information. At WAL replay time, it's easy to re-execute
freezing because we don't need to re-compute the new infomask/Xmax
values but just take them from the WAL record.
While at it, restructure the coding to ensure all page changes occur in
a single critical section without much room for failures. The previous
coding wasn't using a critical section, without any explanation as to
why this was acceptable.
In replication scenarios using the 9.3 branch, standby servers must be
upgraded before their master, so that they are prepared to deal with the
new WAL record once the master is upgraded; failure to do so will cause
WAL replay to die with a PANIC message. Later upgrade of the standby
will allow the process to continue where it left off, so there's no
disruption of the data in the standby in any case. Standbys know how to
deal with the old WAL record, so it's okay to keep the master running
the old code for a while.
In master, the old freeze WAL record is gone, for cleanliness' sake;
there's no compatibility concern there.
Backpatch to 9.3, where the original bug was introduced and where the
previous fix was backpatched.
Álvaro Herrera and Andres Freund
When locale is "ja_JP.SJIS", nl_langinfo(CODESET) returns "SHIFT_JIS"
on some platforms, at least on RedHat Linux. So the encoding/locale
match table (encoding_match_list) needs the entry. Otherwise client
encoding is set to SQL_ASCII.
Back patch to all supported branches.
This fixes a problem noted as a followup to bug #8648: if a query has a
semantically-empty target list, e.g. SELECT * FROM zero_column_table,
ruleutils.c will dump it as a syntactically-empty target list, which was
not allowed. There doesn't seem to be any reliable way to fix this by
hacking ruleutils (note in particular that the originally zero-column table
might since have had columns added to it); and even if we had such a fix,
it would do nothing for existing dump files that might contain bad syntax.
The best bet seems to be to relax the syntactic restriction.
Also, add parse-analysis errors for SELECT DISTINCT with no columns (after
*-expansion) and RETURNING with no columns. These cases previously
produced unexpected behavior because the parsed Query looked like it had
no DISTINCT or RETURNING clause, respectively. If anyone ever offers
a plausible use-case for this, we could work a bit harder on making the
situation distinguishable.
Arguably this is a bug fix that should be back-patched, but I'm worried
that there may be client apps or PLs that expect "SELECT ;" to throw a
syntax error. The issue doesn't seem important enough to risk changing
behavior in minor releases.
Fix an oversight in commit b3aaf9081a1a95c245fd605dcf02c91b3a5c3a29: we do
indeed need to process the planner's append_rel_list when copying RTE
subqueries, because if any of them were flattenable UNION ALL subqueries,
the append_rel_list shows which subquery RTEs were pulled up out of which
other ones. Without this, UNION ALL subqueries aren't correctly inserted
into the update plans for inheritance child tables after the first one,
typically resulting in no update happening for those child table(s).
Per report from Victor Yegorov.
Experimentation with this case also exposed a fault in commit
a7b965382cf0cb30aeacb112572718045e6d4be7: if an inherited UPDATE/DELETE
was proven totally dummy by constraint exclusion, we might arrive at
add_rtes_to_flat_rtable with root->simple_rel_array being NULL. This
should be interpreted as not having any RelOptInfos. I chose to code
the guard as a check against simple_rel_array_size, so as to also
provide some protection against indexing off the end of the array.
Back-patch to 9.2 where the faulty code was added.
The original performs too poorly; in some scenarios it shows way too
high while profiling. Try to make it a bit smarter to avoid excessive
cosst. In particular, make it have a maximum size, and have entries be
sorted in LRU order; once the max size is reached, evict the oldest
entry to avoid it from growing too large.
Per complaint from Andres Freund in connection with new tuple freezing
code.
This prevents a possible longjmp out of the signal handler if a timeout
or SIGINT occurs while something within the handler has transiently set
ImmediateInterruptOK. For safety we must hold off the timeout or cancel
error until we're back in mainline, or at least till we reach the end of
the signal handler when ImmediateInterruptOK was true at entry. This
syncs these functions with the logic now present in handle_sig_alarm.
AFAICT there is no live bug here in 9.0 and up, because I don't think we
currently can wait for any heavyweight lock inside these functions, and
there is no other code (except read-from-client) that will turn on
ImmediateInterruptOK. However, that was not true pre-9.0: in older
branches ProcessIncomingNotify might block trying to lock pg_listener, and
then a SIGINT could lead to undesirable control flow. It might be all
right anyway given the relatively narrow code ranges in which NOTIFY
interrupts are enabled, but for safety's sake I'm back-patching this.
Serious oversight in commit 16e1b7a1b7f7ffd8a18713e83c8cd72c9ce48e07:
we should not allow an interrupt to take control away from mainline code
except when ImmediateInterruptOK is set. Just to be safe, let's adopt
the same save-clear-restore dance that's been used for many years in
HandleCatchupInterrupt and HandleNotifyInterrupt, so that nothing bad
happens if a timeout handler invokes code that tests or even manipulates
ImmediateInterruptOK.
Per report of "stuck spinlock" failures from Christophe Pettus, though
many other symptoms are possible. Diagnosis by Andres Freund.
WAL records of hint bit updates is useful to tools that want to examine
which pages have been modified. In particular, this is required to make
the pg_rewind tool safe (without checksums).
This can also be used to test how much extra WAL-logging would occur if
you enabled checksums, without actually enabling them (which you can't
currently do without re-initdb'ing).
Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with
further changes by me.
The operation that removes the remaining dead tuples from the page must
be WAL-logged before the setting of the VM bit. Otherwise, if you replay
the WAL to between those two records, you end up with the VM bit set, but
the dead tuples are still there.
Backpatch to 9.3, where this bug was introduced.
This can be used to mark custom built binaries with an extra version
string such as a git describe identifier or distribution package release
version.
From: Oskari Saarenmaa <os@ohmu.fi>
Set min_recovery_apply_delay to force a delay in recovery apply for commit and
restore point WAL records. Other records are replayed immediately. Delay is
measured between WAL record time and local standby time.
Robert Haas, Fabrízio de Royes Mello and Simon Riggs
Detailed review by Mitsumasa Kondo
Integer overflow showed minus percent and minus remaining time something like this.
239300000 of 3800000000 tuples (-48%) done (elapsed 226.86 s, remaining -696.10 s).
We had coverage for functions returning setof a named composite type,
but not for anonymous records, which is a somewhat different code path.
In view of recent crash report from Sergey Konoplev, this seems worth
testing, though I doubt there's any deterministic bug here today.
Make the COPY test, which loads most of the large static tables used in
the tests, also explicitly ANALYZE those tables. This allows us to get
rid of various ad-hoc, and rather redundant, ANALYZE commands that had
gotten stuck into various test scripts over time to ensure we got
consistent plan choices. (We could have done a database-wide ANALYZE,
but that would cause stats to get attached to the small static tables
too, which results in plan changes compared to the historical behavior.
I'm not sure that's a good idea, so not going that far for now.)
Back-patch to 9.0, since 9.0 and 9.1 are currently sometimes failing
regression tests for lack of an "ANALYZE tenk1" in the subselect test.
There's no need for this in 8.4 since we didn't print any plans back
then.
There's no real point in not doing this. It doesn't cost anything
in performance or space. So let's go wild.
Andres Freund, with substantial editing as to style by me.
The test only needs the one table to be vacuumed. Vacuuming the
database may affect other tests.
Per gripe from Tom Lane. Back-patch to 9.3, where the test was
was added.
When this reloption is set and wal_level=logical is configured,
we'll record the CIDs stamped by inserts, updates, and deletes to
the table just as we would for an actual catalog table. This will
allow logical decoding to use historical MVCC snapshots to access
such tables just as they access ordinary catalog tables.
Replication solutions built around the logical decoding machinery
will likely need to set this operation for their configuration
tables; it might also be needed by extensions which perform table
access in their output functions.
Andres Freund, reviewed by myself and others.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
An expression such as WHERE (... x IN (SELECT ...) ...) IN (SELECT ...)
could produce an invalid plan that results in a crash at execution time,
if the planner attempts to flatten the outer IN into a semi-join.
This happens because convert_testexpr() was not expecting any nested
SubLinks and would wrongly replace any PARAM_SUBLINK Params belonging
to the inner SubLink. (I think the comment denying that this case could
happen was wrong when written; it's certainly been wrong for quite a long
time, since very early versions of the semijoin flattening logic.)
Per report from Teodor Sigaev. Back-patch to all supported branches.
SQL-standard TABLE() is a subset of UNNEST(); they deal with arrays and
other collection types. This feature, however, deals with set-returning
functions. Use a different syntax for this feature to keep open the
possibility of implementing the standard TABLE().
Previous commit e5de601267 modified dblink
to ensure client encoding matched the server. However the added
PQsetClientEncoding() call added significant overhead. Restore original
performance in the common case where client encoding already matches
server encoding by doing nothing in that case. Applies to all active
branches.
Issue reported and work sponsored by Zonar Systems.
This sets up ECDH key exchange, when compiling against OpenSSL that
supports EC. Then the ECDHE-RSA and ECDHE-ECDSA cipher suites can be
used for SSL connections. The latter one means that EC keys are now
usable.
The reason for EC key exchange is that it's faster than DHE and it
allows to go to higher security levels where RSA will be horribly slow.
There is also new GUC option ssl_ecdh_curve that specifies the curve
name used for ECDH. It defaults to "prime256v1", which is the most
common curve in use in HTTPS.
From: Marko Kreen <markokr@gmail.com>
Reviewed-by: Adrian Klaver <adrian.klaver@gmail.com>
The query ID is the internal hash identifier of the statement,
and was not available in pg_stat_statements view so far.
Daniel Farina, Sameer Thakur and Peter Geoghegan, reviewed by me.
By default, OpenSSL (and SSL/TLS in general) lets the client cipher
order take priority. This is OK for browsers where the ciphers were
tuned, but few PostgreSQL client libraries make the cipher order
configurable. So it makes sense to have the cipher order in
postgresql.conf take priority over client defaults.
This patch adds the setting "ssl_prefer_server_ciphers" that can be
turned on so that server cipher order is preferred. Per discussion,
this now defaults to on.
From: Marko Kreen <markokr@gmail.com>
Reviewed-by: Adrian Klaver <adrian.klaver@gmail.com>