MinMaxExpr invokes the btree comparison function for its input datatype,
so it's only leakproof if that function is. Many such functions are
indeed leakproof, but others are not, and we should not just assume that
they are. Hence, adjust contain_leaked_vars to verify the leakproofness
of the referenced function explicitly.
I didn't add a regression test because it would need to depend on
some particular comparison function being leaky, and that's a moving
target, per discussion.
This has been wrong all along, so back-patch to supported branches.
Discussion: https://postgr.es/m/31042.1546194242@sss.pgh.pa.us
It's important for link commands to list *.o input files before -l
switches for libraries, as library code may not get pulled into the link
unless referenced by an earlier command-line entry. This is certainly
necessary for static libraries (.a style). Apparently on some platforms
it is also necessary for shared libraries, as reported by Donald Dong.
We often put -l switches for within-tree libraries into LDFLAGS, meaning
that link commands that list *.o files after LDFLAGS are hazardous.
Most of our link commands got this right, but a few did not. In
particular, places that relied on gmake's default implicit link rule
failed, because that puts LDFLAGS first. Fix that by overriding the
built-in rule with our own. The implicit link rules in
src/makefiles/Makefile.* for single-.o-file shared libraries mostly
got this wrong too, so fix them. I also changed the link rules for the
backend and a couple of other places for consistency, even though they
are not (currently) at risk because they aren't adding any -l switches
to LDFLAGS.
Arguably, the real problem here is that we're abusing LDFLAGS by
putting -l switches in it and we should stop doing that. But changing
that would be quite invasive, so I'm not eager to do so.
Perhaps this is a candidate for back-patching, but so far it seems
that problems can only be exhibited in test code we don't normally
build, and at least some of the problems are new in HEAD anyway.
So I'll refrain for now.
Donald Dong and Tom Lane
Discussion: https://postgr.es/m/CAKABAquXn-BF-vBeRZxhzvPyfMqgGuc74p8BmQZyCFDpyROBJQ@mail.gmail.com
This removes a portion of infrastructure introduced by fe0a0b5 to allow
compilation of Postgres in environments where no strong random source is
available, meaning that there is no linking to OpenSSL and no
/dev/urandom (Windows having its own CryptoAPI). No systems shipped
this century lack /dev/urandom, and the buildfarm is actually not
testing this switch at all, so just remove it. This simplifies
particularly some backend code which included a fallback implementation
using shared memory, and removes a set of alternate regression output
files from pgcrypto.
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/20181230063219.GG608@paquier.xyz
fe0a0b5, which has added a stronger random source in Postgres, has
introduced a thinko when creating a padding message which gets encrypted
for Elgamal. The padding message cannot have zeros, which are replaced
by random bytes. However if pg_strong_random() failed, the message
would finish by being considered in correct shape for encryption with
zeros.
Author: Tom Lane
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20186.1546188423@sss.pgh.pa.us
Backpatch-through: 10
The function name pg_stop_backup() has been included for ages in some
log messages when stopping the backup, which is confusing for base
backups taken with the replication protocol because this function is
never called. Some other comments and messages in this area are
improved while on it.
The new wording is based on input and suggestions from several people,
all listed below.
Author: Michael Paquier
Reviewed-by: Peter Eisentraut, Álvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/20181221040510.GA12599@paquier.xyz
Detect it the way pg_ctl's wait_for_postmaster() does. When pg_regress
spawned a postmaster that failed startup, we were detecting that only
with "pg_regress: postmaster did not respond within 60 seconds".
Back-patch to 9.4 (all supported versions).
Reviewed by Tom Lane.
Discussion: https://postgr.es/m/20181231172922.GA199150@gust.leadboat.com
Mark pg_lsn and oidvector comparison functions as leakproof. Per
discussion, these clearly are leakproof so we might as well mark them so.
On the other hand, remove leakproof markings from name comparison
functions other than equal/not-equal. Now that these depend on
varstr_cmp, they can't be considered leakproof if text comparison isn't.
(This was my error in commit 586b98fdf.)
While at it, add some opr_sanity queries to catch cases where related
functions do not have the same volatility and leakproof markings.
This would clearly be bogus for commutator or negator pairs. In the
domain of btree comparison functions, we do have some exceptions,
because text equality is leakproof but inequality comparisons are not.
That's odd on first glance but is reasonable (for now anyway) given
the much greater complexity of the inequality code paths.
Discussion: https://postgr.es/m/20181231172551.GA206480@gust.leadboat.com
In commit 8b08f7d4820f I added member relationId to IndexStmt struct.
I'm now not sure why; DefineIndex doesn't need it, since the relation
OID is passed as a separate argument anyway. Remove it.
Also remove a redundant assignment to the relationId argument (it wasn't
redundant when added by commit e093dcdd285, but should have been removed
in commit 5f173040e3), and use relationId instead of stmt->relation when
locking the relation in the second phase of CREATE INDEX CONCURRENTLY,
which is not only confusing but it means we resolve the name twice for
no reason.
While rearranging code in tidpath.c, I overlooked the fact that we ought
to check restriction_is_securely_promotable when trying to use a join
clause as a TID qual. Since tideq itself is leakproof, this wouldn't
really allow any interesting leak AFAICT, but it still seems like we
had better check it.
For consistency with the corresponding logic in indxpath.c, also
check rinfo->pseudoconstant. I'm not sure right now that it's
possible for that to be set in a join clause, but if it were,
a match couldn't be made anyway.
Up to now we've not worried much about joins where the join key is a
relation's CTID column, reasoning that storing a table's CTIDs in some
other table would be pretty useless. However, there are use-cases for
this sort of query involving self-joins, so that argument doesn't really
hold water.
With larger relations, a merge or hash join is desirable. We had a btree
opclass for type "tid", allowing merge joins on CTID, but no hash opclass
so that hash joins weren't possible. Add the missing infrastructure.
This also potentially enables hash aggregation on "tid", though the
use-cases for that aren't too clear.
Discussion: https://postgr.es/m/1853.1545453106@sss.pgh.pa.us
Up to now we've not worried much about joins where the join key is a
relation's CTID column, reasoning that storing a table's CTIDs in some
other table would be pretty useless. However, there are use-cases for
this sort of query involving self-joins, so that argument doesn't really
hold water.
This patch allows generating plans for joins on CTID that use a nestloop
with inner TidScan, similar to what we might do with an index on the join
column. This is the most efficient way to join when the outer side of
the nestloop is expected to yield relatively few rows.
This change requires upgrading tidpath.c and the generated TidPaths
to work with RestrictInfos instead of bare qual clauses, but that's
long-postponed technical debt anyway.
Discussion: https://postgr.es/m/17443.1545435266@sss.pgh.pa.us
Doing this requires an assumption that the invoked btree comparison
function is immutable. We could check that explicitly, but in other
places such as contain_mutable_functions we just assume that it's true,
so we may as well do likewise here. (If the comparison function's
behavior isn't immutable, the sort order in indexes built with it would
be unstable, so it seems certainly wrong for it not to be so.)
Vik Fearing
Discussion: https://postgr.es/m/c6e8504c-4c43-35fa-6c8f-3c0b80a912cc@2ndquadrant.com
PL/pgSQL provides a set of callbacks which can be used for extra
instrumentation of functions written in this language called at function
setup, begin and end, as well as statement begin and end. When calling
a routine, a trigger, or an event trigger, statement callbacks are not
getting called for the top-level statement block leading to an
inconsistent handling compared to the other statements. This
inconsistency can potentially complicate extensions doing
instrumentation work on top of PL/pgSQL, so this commit makes sure that
all statement blocks, including the top-level one, go through the
correct corresponding callbacks.
Author: Pavel Stehule
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAFj8pRArEANsaUjo5in9_iQt0vKf9ecwDAmsdN_EBwL13ps12A@mail.gmail.com
Previously we just set the seed based on process ID and start timestamp.
Both those values are directly available within the session, and can
be found out or guessed by other users too, making the session's series
of random(3) values fairly predictable. Up to now, our backend-internal
uses of random(3) haven't seemed security-critical, but commit 88bdbd3f7
added one that potentially is: when using log_statement_sample_rate, a
user might be able to predict which of his SQL statements will get logged.
To improve this situation, upgrade the per-process seed initialization
method to use pg_strong_random() if available, greatly reducing the
predictability of the initial seed value. This adds a few tens of
microseconds to process start time, but since backend startup time is
at least a couple of milliseconds, that seems an acceptable price.
This means that pg_strong_random() needs to be able to run without
reliance on any backend infrastructure, since it will be invoked
before any of that is up. It was safe for that already, but adjust
comments and #include commands to make it clearer.
Discussion: https://postgr.es/m/3859.1545849900@sss.pgh.pa.us
Previously, the SQL random() function depended on libc's random(3),
and setseed() invoked srandom(3). This results in interference between
these functions and backend-internal uses of random(3). We'd never paid
too much mind to that, but in the wake of commit 88bdbd3f7 which added
log_statement_sample_rate, the interference arguably has a security
consequence: if log_statement_sample_rate is active then an unprivileged
user could probably control which if any of his SQL commands get logged,
by issuing setseed() at the right times. That seems bad.
To fix this reliably, we need random() and setseed() to use their own
private random state variable. Standard random(3) isn't amenable to such
usage, so let's switch to pg_erand48(). It's hard to say whether that's
more or less "random" than any particular platform's version of random(3),
but it does have a wider seed value and a longer period than are required
by POSIX, so we can hope that this isn't a big downgrade. Also, we should
now have uniform behavior of random() across platforms, which is worth
something.
While at it, upgrade the per-process seed initialization method to use
pg_strong_random() if available, greatly reducing the predictability
of the initial seed value. (I'll separately do something similar for
the internal uses of random().)
In addition to forestalling the possible security problem, this has a
benefit in the other direction, which is that we can now document
setseed() as guaranteeing a reproducible sequence of random() values.
Previously, because of the possibility of internal calls of random(3),
we could not promise any such thing.
Discussion: https://postgr.es/m/3859.1545849900@sss.pgh.pa.us
Get rid of the multiplier and addend variables in favor of hard-wired
constants. Do the multiply-and-add using uint64 arithmetic, rather
than manually combining several narrower multiplications and additions.
Make _dorand48 return the full-width new random value, and have its
callers use that directly (after suitable masking) rather than
reconstructing what they need from the unsigned short[] representation.
On my machine, this is good for a nearly factor-of-2 speedup of
pg_erand48(), probably mostly from needing just one call of ldexp()
rather than three. The wins for the other functions are smaller
but measurable. While none of the existing call sites are really
performance-critical, a cycle saved is a cycle earned; and besides
the machine code is smaller this way (at least on x86_64).
Patch by me, but the original idea to optimize this by switching
to int64 arithmetic is from Fabien Coelho.
Discussion: https://postgr.es/m/1551.1546018192@sss.pgh.pa.us
POSIX specifies that jrand48() returns a signed 32-bit value (in the
range [-2^31, 2^31)), but our code was returning an unsigned 32-bit
value (in the range [0, 2^32)). This doesn't actually matter to any
existing call site, because they all cast the "long" result to int32
or uint32; but it will doubtless bite somebody in the future.
To fix, cast the arithmetic result to int32 explicitly before the
compiler widens it to long (if widening is needed).
While at it, upgrade this file's far-short-of-project-style comments.
Had there been some peer pressure to document pg_jrand48() properly,
maybe this thinko wouldn't have gotten committed to begin with.
Backpatch to v10 where pg_jrand48() was added, just in case somebody
back-patches a fix that uses it and depends on the standard behavior.
Discussion: https://postgr.es/m/17235.1545951602@sss.pgh.pa.us
The documentation of ON DELETE and ON UPDATE uses the term "action",
which is also used on the ALTER TABLE page for other purposes. This
commit renames the term to "referential_action", which is more
consistent with the SQL specification. The new term is now used on the
documentation of both CREATE TABLE and ALTER TABLE for consistency.
Reported-by: Brigitte Blanc-Lafay
Author: Lætitia Avrot
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/CAB_COdiHEVVs0uB+uYCjjYUwQ4YFFekppq+Xqv6qAM8+cd42gA@mail.gmail.com
Isolation test suite of GIN predicate locking was criticized for being too slow,
especially under Valgrind. This commit is intended to accelerate it. Tests are
simplified in the following ways.
1) Amount of data is reduced. We're now close to the minimal amount of data,
which produces at least one posting tree and at least two pages of entry
tree.
2) Three isolation tests are merged into one.
3) Only one tuple is queried from posting tree. So, locking of index is the
same, but tuple locks are not propagated to relation lock. Also, it is
faster.
4) Test cases itself are simplified. Now each test case run just one INSERT
and one SELECT involving GIN, which either conflict or not.
Discussion: https://postgr.es/m/20181204000740.ok2q53nvkftwu43a%40alap3.anarazel.de
Reported-by: Andres Freund
Tested-by: Andrew Dunstan
Author: Alexander Korotkov
Backpatch-through: 11
According to README we acquire predicate locks on entry tree leafs and posting
tree roots. However, when ginFindLeafPage() is going to lock leaf in exclusive
mode, then it checks root for conflicts regardless whether it's a entry or
posting tree. Assuming that we never place predicate lock on entry tree root
(excluding corner case when root is leaf), this check is redundant. This
commit removes this check. Now, root conflict checking is controlled by
separate argument of ginFindLeafPage().
Discussion: https://postgr.es/m/CAPpHfdv7rrDyy%3DMgsaK-L9kk0AH7az0B-mdC3w3p0FSb9uoyEg%40mail.gmail.com
Author: Alexander Korotkov
Backpatch-through: 11
Inheritance trees can include temporary tables if the parent is
permanent, which makes possible the presence of multiple temporary
children from different sessions. Trying to issue a TRUNCATE on the
parent in this scenario causes a failure, so similarly to any other
queries just ignore such cases, which makes TRUNCATE work
transparently.
This makes truncation behave similarly to any other DML query working on
the parent table with queries which need to be work on the children. A
set of isolation tests is added to cover basic cases.
Reported-by: Zhou Digoal
Author: Amit Langote, Michael Paquier
Discussion: https://postgr.es/m/15565-ce67a48d0244436a@postgresql.org
Backpatch-through: 9.4
While it seems OK to not be concerned about fsync() failure for a
pre-existing signal file, it's not OK to not even check for open()
failure. This at least causes complaints from static analyzers,
and I think on some platforms passing -1 to fsync() or close() might
trigger assertion-type failures. Also add (void) casts to make clear
that we're ignoring fsync's result intentionally.
Oversights in commit 2dedf4d9a, noted by Coverity.
I made a frontend fprintf() format use %m, forgetting that that's only
safe in HEAD not the back branches; prior to 96bf88d52 and d6c55de1f,
it would work on glibc platforms but not elsewhere. Revert to using
%s ... strerror(errno) as the code did before.
We could have left HEAD as-is, but for code consistency across branches,
I chose to apply this patch there too.
Per Coverity and a few buildfarm members.
This fixes two issues with the completion of ALTER TABLE and ALTER INDEX
after SET STATISTICS is typed, trying to suggest schema objects while
the grammar only allows integers.
The tab completion of ALTER INDEX is made smarter by handling properly
more patterns. COLUMN is an optional keyword, but as no column numbers
can be suggested yet as possible input simply adjust the completion so
as no incorrect queries are generated.
Author: Michael Paquier
Reviewed-by: Tatsuro Yamada
Discussion: https://postgr.es/m/20181219092255.GC680@paquier.xyz
At the end of recovery for the post-promotion process, a new history
file is created followed by the last partial segment of the previous
timeline. Based on the timing, the archiver would first try to archive
the last partial segment and then the history file. This can delay the
detection of a new timeline taken, particularly depending on the time it
takes to transfer the last partial segment as it delays the moment the
history file of the new timeline gets archived. This can cause promoted
standbys to use the same timeline as one already taken depending on the
circumstances if multiple instances look at archives at the same
location.
This commit changes the order of archiving so as history files are
archived in priority over other file types, which reduces the likelihood
of the same timeline being taken (still not reducing the window to
zero), and it makes the archiver behave more consistently with the
startup process doing its post-promotion business.
Author: David Steele
Reviewed-by: Michael Paquier, Kyotaro Horiguchi
Discussion: https://postgr.es/m/929068cf-69e1-bba2-9dc0-e05986aed471@pgmasters.net
Backpatch-through: 9.5
COPY can skip writing WAL when loading data on a table which has been
created in the same transaction as the one loading the data, however
this cannot work on views or foreign table as this would result in
trying to flush relation files which do not exist. So disable the
optimization so as commands are able to work the same way with any
configuration of wal_level.
Tests are added to cover the different cases, which need to have
wal_level set to minimal to allow the problem to show up, and that is
not the default configuration.
Reported-by: Luis M. Carril, Etsuro Fujita
Author: Amit Langote, Michael Paquier
Reviewed-by: Etsuro Fujita
Discussion: https://postgr.es/m/15552-c64aa14c5c22f63c@postgresql.org
Backpatch-through: 10, where support for COPY on views has been added,
while v11 has added support for COPY on foreign tables.
In passing, move the list of parameters where it can be used for both
CREATE TABLE and ALTER TABLE, and reorder it alphabetically.
Author: Dagfinn Ilmari Mannsåker
Discussion: https://postgr.es/m/d8j1s77kdbb.fsf@dalvik.ping.uio.no
013ebc0a7b implements so-called GiST microvacuum. That is gistgettuple() marks
index tuples as dead when kill_prior_tuple is set. Later, when new tuple
insertion claims page space, those dead index tuples are physically deleted
from page. When this deletion is replayed on standby, it might conflict with
read-only queries. But 013ebc0a7b doesn't handle this. That may lead to
disappearance of some tuples from read-only snapshots on standby.
This commit implements resolving of conflicts between replay of GiST microvacuum
and standby queries. On the master we implement new WAL record type
XLOG_GIST_DELETE, which comprises necessary information. On stable releases
we've to be tricky to keep WAL compatibility. Information required for conflict
processing is just appended to data of XLOG_GIST_PAGE_UPDATE record. So,
PostgreSQL version, which doesn't know about conflict processing, will just
ignore that.
Reported-by: Andres Freund
Diagnosed-by: Andres Freund
Discussion: https://postgr.es/m/20181212224524.scafnlyjindmrbe6%40alap3.anarazel.de
Author: Alexander Korotkov
Backpatch-through: 9.6
The SQL spec says that sql_identifier is a domain over varchar,
but it also says that that domain is supposed to represent the set
of valid identifiers for the implementation, in particular applying
a length limit matching the implementation's identifier length limit.
We were declaring sql_identifier as just "character varying", thus
duplicating what the spec says about base type, but entirely failing
at the rest of it.
Instead, let's declare sql_identifier as a domain over type "name".
(We can drop the COLLATE "C" added by commit 6b0faf723, since that's
now implicit in "name".) With the recent improvements to name's
comparison support, there's not a lot of functional difference between
name and varchar. So although in principle this is a spec deviation,
it's a pretty minor one. And correctly enforcing PG's name length limit
is a good thing; on balance this seems closer to the intent of the spec
than what we had.
But that's all just language-lawyering. The *real* reason to do this is
that it makes sql_identifier columns exposed by information_schema views
be just direct representations of the underlying "name" catalog columns,
eliminating a semantic mismatch that was disastrous for performance of
typical queries on the information_schema. In combination with the
recent change to allow dropping no-op CoerceToDomain nodes, this allows
(for example) queries such as
select ... from information_schema.tables where table_name = 'foo';
to produce an indexscan rather than a seqscan on pg_class.
Discussion: https://postgr.es/m/CAFj8pRBUCX4LZ2rA2BbEkdD6NN59mgx+BLo1gO08Wod4RLtcTg@mail.gmail.com
information_schema output columns that are declared as being type
sql_identifier are supposed to conform to the implementation's rules
for valid identifiers, in particular the identifier length limit.
Several places potentially violated this limit by concatenating a
function's name and OID. (The OID is added to ensure name uniqueness
within a schema, since the spec doesn't expect function name overloading.)
Simply truncating the concatenation result to fit in "name" won't do,
since losing part of the OID might wind up giving non-unique results.
Instead, let's truncate the function name as necessary.
The most practical way to do that is to do it in a C function; the
information_schema.sql script doesn't have easy access to the value
of NAMEDATALEN, nor does it have an easy way to truncate on the basis
of resulting byte-length rather than number of characters.
(There are still a couple of places that cast concatenation results to
sql_identifier, but as far as I can see they are guaranteed not to produce
over-length strings, at least with the normal value of NAMEDATALEN.)
Discussion: https://postgr.es/m/23817.1545283477@sss.pgh.pa.us
For probably bogus reasons, we acquire only AccessShareLock on the
partition when we try to detach it from its parent partitioned table.
This can cause ugly things to happen if another transaction is doing
any sort of DDL to the partition concurrently.
Upgrade that lock to ShareUpdateExclusiveLock, which per discussion
seems to be the minimum needed.
Reported by Robert Haas.
Discussion: https://postgr.es/m/CA+TgmoYruJQ+2qnFLtF1xQtr71pdwgfxy3Ziy-TxV28M6pEmyA@mail.gmail.com
"$user" in a search_path string is replaced by CURRENT_USER not
SESSION_USER. (It actually was SESSION_USER in the initial implementation,
but we changed it shortly later, and evidently forgot to fix the docs to
match.)
Noted by antonov@stdpr.ru
Discussion: https://postgr.es/m/159151fb45d490c8d31ea9707e9ba99d@stdpr.ru
Using the full width of the CPU's native word size shouldn't cost
anything in typical cases. When working with large bitmapsets,
this halves the number of operations needed for many common BMS
operations. On the right sort of test case, a measurable improvement
is obtained.
David Rowley
Discussion: https://postgr.es/m/CAKJS1f9EGBd2h-VkXvb=51tf+X46zMX5T8h-KYgXEV_u2zmLUw@mail.gmail.com
When a partition is detached from its parent, we acquire locks on all
attached indexes to also detach them ... but we release those locks
immediately. This is a violation of the policy of keeping locks on user
objects to the end of the transaction. Bug introduced in 8b08f7d4820f.
It's unclear that there are any ill effects possible, but it's clearly
wrong nonetheless. It's likely that bad behavior *is* possible, but
mostly because the relation that the index is for is only locked with
AccessShareLock, which is an older bug that shall be fixed separately.
While touching that line of code, close the index opened with
index_open() using index_close() instead of relation_close().
No difference in practice, but let's be consistent.
Unearthed by Robert Haas.
Discussion: https://postgr.es/m/CA+TgmoYruJQ+2qnFLtF1xQtr71pdwgfxy3Ziy-TxV28M6pEmyA@mail.gmail.com