Since partitioned tables in pg12 do not have toast tables, trying to set
the toast OID confuses pg_upgrade. Have pg_dump omit those values to
avoid the problem.
Per Andres Freund and buildfarm members crake and snapper
Discussion: https://postgr.es/m/20190306204104.yle5jfbnqkcwykni@alap3.anarazel.de
Similarly to B-tree, GiST index access method gets support of INCLUDE
attributes. These attributes aren't used for tree navigation and aren't
present in non-leaf pages. But they are present in leaf pages and can be
fetched during index-only scan.
The point of having INCLUDE attributes in GiST indexes is slightly different
from the point of having them in B-tree. The main point of INCLUDE attributes
in B-tree is to define UNIQUE constraint over part of attributes enabled for
index-only scan. In GiST the main point of INCLUDE attributes is to use
index-only scan for attributes, whose data types don't have GiST opclasses.
Discussion: https://postgr.es/m/73A1A452-AD5F-40D4-BD61-978622FF75C1%40yandex-team.ru
Author: Andrey Borodin, with small changes by me
Reviewed-by: Andreas Karlsson
This allows a login to require both that the cn of the certificate
matches (like authentication type cert) *and* that another
authentication method (such as password or kerberos) succeeds as well.
The old value of clientcert=1 maps to the new clientcert=verify-ca,
clientcert=0 maps to the new clientcert=no-verify, and the new option
erify-full will add the validation of the CN.
Author: Julian Markwort, Marius Timmer
Reviewed by: Magnus Hagander, Thomas Munro
This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.
Author: Magnus Hagander
Reviewed by: Julien Rouhaud
When the timezone is UTC, timestamptz and timestamp are binary coercible
in both directions. See b8a18ad485 and
c22ecc6562 for the previous attempt in
this problem space. Skip the table rewrite; for now, continue to
needlessly rewrite any index on an affected column.
Reviewed by Simon Riggs and Tom Lane.
Discussion: https://postgr.es/m/20190226061450.GA1665944@rfd.leadboat.com
This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary. These have been reported by Joe Conway.
2) When opening transient files, it is up to the caller to close the
file descriptors opened. In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error. However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it. This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.
Note that one frontend code path is impacted by this commit so as an
error is issued when fetching control file data, making backend and
frontend to be treated consistently.
Reported-by: Joe Conway, Michael Paquier
Author: Michael Paquier
Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway
Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz
Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
Certain libxml2 versions (such as the 2.7.6 commonly seen in older
distributions, but apparently only on x86_64) contain a bug that causes
xmlCopyNode, when called on a XML_DOCUMENT_NODE, to return a node that
xmlFreeNode crashes on. Arrange to call xmlFreeDoc instead of
xmlFreeNode for those nodes.
Per buildfarm members lapwing and grison.
Author: Pavel Stehule, light editing by Álvaro.
Discussion: https://postgr.es/m/20190308024436.GA2374@alvherre.pgsql
Use Getopt::Long in preference to hand-rolled option parsing code.
Also, remove "-I .../backend/catalog" switch from the Makefile
invocations. That's been unnecessary for some time, and leaving it
there gives the false impression it's needed in manual invocations.
John Naylor (extracted from a larger but more controversial patch)
Discussion: https://postgr.es/m/CACPNZCsHdcQN2jQ1=ptbi1Co2Nj3aHgRCUMk62=ThgWNabPY+Q@mail.gmail.com
tuple_data_split() lacked the type of the first argument, and
heap_page_item_attrs() has reversed the first and second argument,
with the bytea argument using an incorrect name.
Author: Laurenz Albe
Discussion: https://postgr.es/m/8f9ab7b16daf623e87eeef5203a4ffc0dece8dfd.camel@cybertec.at
Backpatch-through: 9.6
When 2dedf4d9 has integrated recovery.conf into postgresql.conf, it also
changed pg_basebackup -R in the way recovery configuration is
generated. However this implementation forgot the fact that
pg_basebackup needs to keep compatibility with older server versions as
well.
Reported-by: Devrim Gündüz
Author: Sergei Kornilov, Michael Paquier
Discussion: https://postgr.es/m/3458f7cd12d74acd90180a671c8d5a081d60e162.camel@gunduz.org
Correctly process nodes of more types than previously. In some cases,
nodes were being ignored (nothing was output); in other cases, trying to
return them resulted in errors about unrecognized nodes. In yet other
cases, necessary escaping (of XML special characters) was not being
done. Fix all those (as far as the authors could find) and add
regression tests cases verifying the new behavior.
I (Álvaro) was of two minds about backpatching these changes. They do
seem bugfixes that would benefit most users of the affected functions;
but on the other hand it would change established behavior in minor
releases, so it seems prudent not to.
Authors: Pavel Stehule, Markus Winand, Chapman Flack
Discussion:
https://postgr.es/m/CAFj8pRA6J25CtAZ2TuRvxK3gat7-bBUYh0rfE2yM7Hj9GD14Dg@mail.gmail.comhttps://postgr.es/m/8BDB0627-2105-4564-AA76-7849F028B96E@winand.at
The elephant in the room as pointed out by Chapman Flack, not fixed in
this commit, is that we still have XMLTABLE operating on XPath 1.0
instead of the standard-mandated XQuery (or even its subset XPath 2.0).
Fixing that is a major undertaking, however.
When we introduced separate ProjectSetPath nodes for application of
set-returning functions in v10, we inadvertently broke some cases where
we're supposed to recognize that the result of a subquery is known to be
empty (contain zero rows). That's because IS_DUMMY_REL was just looking
for a childless AppendPath without allowing for a ProjectSetPath being
possibly stuck on top. In itself, this didn't do anything much worse
than produce slightly worse plans for some corner cases.
Then in v11, commit 11cf92f6e rearranged things to allow the scan/join
targetlist to be applied directly to partial paths before they get
gathered. But it inserted a short-circuit path for dummy relations
that was a little too short: it failed to insert a ProjectSetPath node
at all for a targetlist containing set-returning functions, resulting in
bogus "set-valued function called in context that cannot accept a set"
errors, as reported in bug #15669 from Madelaine Thibaut.
The best way to fix this mess seems to be to reimplement IS_DUMMY_REL
so that it drills down through any ProjectSetPath nodes that might be
there (and it seems like we'd better allow for ProjectionPath as well).
While we're at it, make it look at rel->pathlist not cheapest_total_path,
so that it gives the right answer independently of whether set_cheapest
has been done lately. That dependency looks pretty shaky in the context
of code like apply_scanjoin_target_to_paths, and even if it's not broken
today it'd certainly bite us at some point. (Nastily, unsafe use of the
old coding would almost always work; the hazard comes down to possibly
looking through a dangling pointer, and only once in a blue moon would
you find something there that resulted in the wrong answer.)
It now looks like it was a mistake for IS_DUMMY_REL to be a macro: if
there are any extensions using it, they'll continue to use the old
inadequate logic until they're recompiled, after which they'll fail
to load into server versions predating this fix. Hopefully there are
few such extensions.
Having fixed IS_DUMMY_REL, the special path for dummy rels in
apply_scanjoin_target_to_paths is unnecessary as well as being wrong,
so we can just drop it.
Also change a few places that were testing for partitioned-ness of a
planner relation but not using IS_PARTITIONED_REL for the purpose; that
seems unsafe as well as inconsistent, plus it required an ugly hack in
apply_scanjoin_target_to_paths.
In passing, save a few cycles in apply_scanjoin_target_to_paths by
skipping processing of pre-existing paths for partitioned rels,
and do some cosmetic cleanup and comment adjustment in that function.
I renamed IS_DUMMY_PATH to IS_DUMMY_APPEND with the intention of breaking
any code that might be using it, since in almost every case that would
be wrong; IS_DUMMY_REL is what to be using instead.
In HEAD, also make set_dummy_rel_pathlist static (since it's no longer
used from outside allpaths.c), and delete is_dummy_plan, since it's no
longer used anywhere.
Back-patch as appropriate into v11 and v10.
Tom Lane and Julien Rouhaud
Discussion: https://postgr.es/m/15669-02fb3296cca26203@postgresql.org
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
This clause is used to indicate the passing mode of a XML document, but
we were doing it wrong: we accepted BY REF and ignored it, and rejected
BY VALUE as a syntax error. The reality, however, is that documents are
always passed BY VALUE, so rejecting that clause was silly. Change
things so that we accept BY VALUE.
BY REF continues to be accepted, and continues to be ignored.
Author: Chapman Flack
Reviewed-by: Pavel Stehule
Discussion: https://postgr.es/m/5C297BB7.9070509@anastigmatix.net
Before commit 3fa2bb31 this type appeared in the catalogs to
select which of several block storage mechanisms each relation
used.
New features under development propose to revive the concept of
different block storage managers for new kinds of data accessed
via bufmgr.c, but don't need to put references to them in the
catalogs. So, avoid useless maintenance work on this type by
dropping it. Update some regression tests that were referencing
it where any type would do.
Discussion: https://postgr.es/m/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com
Until now the the slot to store the conflicting tuple, and the result
of the ON CONFLICT SET, where reused between partitions. That
necessitated changing slots descriptor when switching partitions.
Besides the overhead of switching descriptors on a slot (which
requires memory allocations and prevents JITing), that's importantly
also problematic for tableam. There individual partitions might belong
to different tableams, needing different kinds of slots.
In passing also fix ExecOnConflictUpdate to clear the existing slot at
exit. Otherwise that slot could continue to hold a pin till the query
ends, which could be far too long if the input data set is large, and
there's no further conflicts. While previously also problematic, it's
now more important as there will be more such slots when partitioned.
Author: Andres Freund
Reviewed-By: Robert Haas, David Rowley
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
In a complete brown paper bag moment, I forgot to include equalfuncs
in my previous fix of copy/out/readfuncs. Thanks Tom for noticing.
Discussion: https://postgr.es/m/1659.1551903210@sss.pgh.pa.us
This includes a catversion bump, as IntoClause is theoretically
speaking part of storable rules. In practice I don't think that can
happen, but there's no reason to be stingy here.
Per buildfarm member calliphoridae.
This adds pg_dump support for table AMs in a similar manner to how
tablespaces are handled. That is, instead of specifying the AM for
every CREATE TABLE etc, emit SET default_table_access_method
statements. That makes it easier to change the AM for all/most tables
in a dump, and allows restore to succeed even if some AM is not
available.
This increases the dump archive version, as a tables/matview's AM
needs to be tracked therein.
Author: Dimitri Dolgov, Andres Freund
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsqlhttps://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.dehttps://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
I broke/typoed this in 4da597edf1. Astonishingly this mostly
doesn't cause breakage, except when trying to change the tuple
descriptor of a slot (because TTS_FLAG_FIXED is assumed to be set).
Author: Andres Freund
The original 200 default value was set back in f425b605f4 when the cost
delay settings were first added. Hardware has improved quite a bit since
then and we've also made improvements such as sorting buffers during
checkpoints (9cd00c457e) which should result in less random writes.
This low default value was reportedly causing problems with badly
configured servers and in the absence of a native method to remove
excessive bloat from tables without incurring an AccessExclusiveLock, this
often made cleaning up the damage caused by badly configured auto-vacuums
difficult.
It seems more likely that someone will notice that auto-vacuum is running
too quickly than too slowly, so let's go all out and multiple the default
value for the setting by 10. With the default vacuum_cost_page_dirty and
autovacuum_vacuum_cost_delay (assuming a page size of 8192 bytes), this
allows autovacuum a theoretical maximum dirty write rate of around 39MB/s
instead of just 3.9MB/s.
Author: David Rowley
Discussion: https://postgr.es/m/CAKJS1f_YbXC2qTMPyCbmsPiKvZYwpuQNQMohiRXLj1r=8_rYvw@mail.gmail.com
This test fails if the containing directory contains a funny character
such as a space or some perl metacharacter. To avoid that, we check for
files names using readdir and a regex, rather than using a glob pattern.
Discussion: https://postgr.es/m/CAM6_UM6dGdU39PKAC24T+HD9ouy0jLN9vH6163K8QEEzr__iZw@mail.gmail.com
Author: Fabien COELHO
Reviewed-by: Raúl Marín Rodríguez
Scanning an index in physical order is faster than walking it in logical
order, because sequential I/O is faster than random I/O. The idea and code
structure is borrowed from B-tree vacuum code.
Patch by Andrey Borodin, with changes by me. Based on early work by
Konstantin Kuznetsov, although the patch has been rewritten multiple times
since his original version.
Discussion: https://www.postgresql.org/message-id/1B9FAC6F-FA19-4A24-8C1B-F4F574844892%40yandex-team.ru
For some reason the dump test with names with high bits set fails on
Msys2 (although not Msys1). Disable the tests for now, so that other
tests can run.
This is the only test that fails when run as an admin user. The reason
is that when Postgres is started via pg_ctl its admin privileges are
lowered. However, this test called 'postgres -D datadir' directly,
resulting in a failure. Replace that by calling pg_ctl and then checking
the result for the expected failure, and the logfile for the expected
error message.
Currently only frogmouth in the buildfarm uses the 32bit params, and
it's not able to build past release 10, so put those last, saving
substantial configure time on more modern systems. Even if we get a
modern 32 bit Windows system at some stage we should probably prefer the
64 bit interface here these days.
The implementation of readdir() in src/port/ which gets used by MSVC has
been added in 399a36a, and since the beginning it considers all errors
on the first file lookup as ENOENT, setting errno accordingly and
letting the routine caller think that the directory is empty. While
this is normally enough for the case of the backend, this can confuse
callers of this routine on Windows as all errors would map to the same
behavior. So, for example, even permission errors would be thought as
having an empty directory, while there could be contents in it.
This commit changes the error handling so as readdir() gets a behavior
similar to native implementations: force errno=0 when seeing
ERROR_FILE_NOT_FOUND as error and consider other errors as plain
failures.
While looking at the patch, I noticed that MinGW does not enforce
errno=0 when looking at the first file, but it gets enforced on the next
file lookups. A comment related to that was incorrect in the code.
Reported-by: Yuri Kurenkov
Diagnosed-by: Yuri Kurenkov, Grigory Smolkin
Author: Konstantin Knizhnik
Reviewed-by: Andrew Dunstan, Michael Paquier
Discussion: https://postgr.es/m/2cad7829-8d66-e39c-b937-ac825db5203d@postgrespro.ru
Backpatch-through: 9.4
The test crashes and burns quite badly, for some reason, but even if it
didn't it wouldn't work, since Windows doesn't let you rename a file
held by a running process.
StoreIndexTuple was a loop over index_getattr, which is O(N^2)
if the index columns are variable-width, and the performance
impact is already quite visible at ten columns. The obvious
move is to replace that with a call to index_deform_tuple ...
but that's *also* a loop over index_getattr. Improve it to
be essentially a clone of heap_deform_tuple.
(There are a few other places that loop over all index columns
with index_getattr, and perhaps should be changed likewise,
but most of them don't seem performance-critical. Anyway, the
rest would mostly only be interested in the index key columns,
which there aren't likely to be so many of. Wide index tuples
are a new thing with INCLUDE.)
Konstantin Knizhnik
Discussion: https://postgr.es/m/e06b2d27-04fc-5c0e-bb8c-ecd72aa24959@postgrespro.ru
Commit f092de05 added a test for pg_dumpall --exclude-database including
the wildcard pattern '*dump*' which matches some files in the source
directory. The test library on msys uses the shell which expands this
and thus the program gets incorrect arguments. This doesn't happen if
the pattern doesn't match any files, so here the pattern is set to
'*dump_test*' which is such a pattern.
Per buildfarm animal jacana.
Previously, rewriteTargetListIU() generated a list of attribute
numbers from the targetlist, which were passed to rewriteValuesRTE(),
which expected them to contain the same number of entries as there are
columns in the VALUES RTE, and to be in the same order. That was fine
when the target relation was a table, but for an updatable view it
could be broken in at least three different ways ---
rewriteTargetListIU() could insert additional targetlist entries for
view columns with defaults, the view columns could be in a different
order from the columns of the underlying base relation, and targetlist
entries could be merged together when assigning to elements of an
array or composite type. As a result, when recursing to the base
relation, the list of attribute numbers generated from the rewritten
targetlist could no longer be relied upon to match the columns of the
VALUES RTE. We got away with that prior to 41531e42d3 because it used
to always be the case that rewriteValuesRTE() did nothing for the
underlying base relation, since all DEFAULTS had already been replaced
when it was initially invoked for the view, but that was incorrect
because it failed to apply defaults from the base relation.
Fix this by examining the targetlist entries more carefully and
picking out just those that are simple Vars referencing the VALUES
RTE. That's sufficient for the purposes of rewriteValuesRTE(), which
is only responsible for dealing with DEFAULT items in the VALUES
RTE. Any DEFAULT item in the VALUES RTE that doesn't have a matching
simple-Var-assignment in the targetlist is an error which we complain
about, but in theory that ought to be impossible.
Additionally, move this code into rewriteValuesRTE() to give a clearer
separation of concerns between the 2 functions. There is no need for
rewriteTargetListIU() to know about the details of the VALUES RTE.
While at it, fix the comment for rewriteValuesRTE() which claimed that
it doesn't support array element and field assignments --- that hasn't
been true since a3c7a993d5 (9.6 and later).
Back-patch to all supported versions, with minor differences for the
pre-9.6 branches, which don't support array element and field
assignments to the same column in multi-row VALUES lists.
Reviewed by Amit Langote.
Discussion: https://postgr.es/m/15623-5d67a46788ec8b7f@postgresql.org