This is essentially the tableam version of heapam_fetch(),
i.e. fetching a tuple identified by a tid, performing visibility
checks.
Note that this different from table_index_fetch_tuple(), which is for
index lookups. It therefore has to handle a tid pointing to an earlier
version of a tuple if the AM uses an optimization like heap's HOT. Add
comments to that end.
This commit removes the stats_relation argument from heap_fetch, as
it's been unused for a long time.
Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
Previously those directly performed a heap_insert(). Use
table_insert() instead. The input slot of those routines is not of
the target relation - we could fix that by copying if necessary, but
that'd not be beneficial for performance. As those codepaths don't
access any AM specific tuple fields (say xmin/xmax), there's no need
to use an AM specific slot.
Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
Now that the ordering of DROP messages ought to be stable everywhere,
we should not need these kluges of hiding DETAIL output just to avoid
unstable ordering. Hiding it's not great for test coverage, so
let's undo that where possible.
In a small number of places, it's necessary to leave it in, for
example because the output might include a variable pg_temp_nnn
schema name. I also left things alone in places where the details
would depend on other regression test scripts, e.g. plpython_drop.sql.
Perhaps buildfarm experience will show this to be a bad idea,
but if so I'd like to know why.
Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org
Commit 8aa9dd74b didn't quite finish the job in this area after all,
because DROP ROLE has a code path distinct from DROP OWNED BY, and
it was still reporting dependent objects in whatever order the index
scan returned them in.
Buildfarm experience shows that index ordering of equal-keyed objects is
significantly less stable than before in the wake of using heap TIDs as
tie-breakers. So if we try to hide the unstable ordering by suppressing
DETAIL reports, we're just going to end up having to do that for every
DROP that reports multiple objects. That's not great from a coverage
or problem-detection standpoint, and it's something we'll inevitably
forget in future patches, leading to more iterations of fixing-an-
unstable-result. So let's just bite the bullet and sort here too.
Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org
It doesn't make sense to consider the possibility that there will only
be one candidate split point when choosing among split points to find
the split with the lowest penalty. This is a vestige of an earlier
version of the patch that became commit fab25024.
Issue spotted while rereviewing coverage of the nbtree patch series
using gcov.
The code would do "PQclear(res)" twice if lo_unlink failed, evidently
due to careless thinking about how far out a "break" would break.
Remove the extra PQclear and adjust the loop logic so that we'll fall
out of both levels of loop after an error, as was clearly the intent.
Spotted by Coverity. I have no idea why it took this long to notice,
since the bug has been there since commit 67ccbb080. Accordingly,
back-patch to all supported branches.
Since its introduction in 19dc233c, current_logfiles has been assigned
the same permissions as a log file, which can be enforced with
log_file_mode. This setup can lead to incompatibility problems with
group access permissions as current_logfiles is not located in the log
directory, but at the root of the data folder. Hence, if group
permissions are used but log_file_mode is more restrictive, a backup
with a user in the group having read access could fail even if the log
directory is located outside of the data folder.
Per discussion with the folks mentioned below, we have concluded that
current_logfiles should not be treated as a log file as it only stores
metadata related to log files, and that it should use the same
permissions as all other files in the data directory. This solution has
the merit to be simple and fixes all the interaction problems between
group access and log_file_mode.
Author: Haribabu Kommi
Reviewed-by: Stephen Frost, Robert Haas, Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/CAJrrPGcEotF1P7AWoeQyD3Pqr-0xkQg_Herv98DjbaMj+naozw@mail.gmail.com
Backpatch-through: 11, where group access has been added.
Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.
Support for transaction chaining in PL/pgSQL is also added. This
functionality is especially useful when running COMMIT in a loop in
PL/pgSQL.
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com
This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.
Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one. It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.
The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.
As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.
The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).
Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.
Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.
Author: Andres Freund and Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
Previously we were using the SQL:2003 definition, which doesn't allow
this, but that creates a serious dump/restore gotcha: there is no
setting of xmloption that will allow all valid XML data. Hence,
switch to the 2006 definition.
Since libxml doesn't accept <!DOCTYPE> directives in the mode we
use for CONTENT parsing, the implementation is to detect <!DOCTYPE>
in the input and switch to DOCUMENT parsing mode. This should not
cost much, because <!DOCTYPE> should be close to the front of the
input if it's there at all. It's possible that this causes the
error messages for malformed input to be slightly different than
they were before, if said input includes <!DOCTYPE>; but that does
not seem like a big problem.
In passing, buy back a few cycles in parsing of large XML documents
by not doing strlen() of the whole input in parse_xml_decl().
Back-patch because dump/restore failures are not nice. This change
shouldn't break any cases that worked before, so it seems safe to
back-patch.
Chapman Flack (revised a bit by me)
Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
Suppress 3 lines of unstable DETAIL output from a DROP ROLE statement in
event_trigger.sql. This is further cleanup for commit dd299df8.
Note that the event_trigger test instability issue is very similar to
the recently suppressed foreign_data test instability issue. Both
issues involve DETAIL output for a DROP ROLE statement that needed to be
changed as part of dd299df8.
Per buildfarm member macaque.
Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so. The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all. The new high key check
is complementary.
The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes. The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page. A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.
Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans. Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).
Note that even pg_upgrade'd v3 indexes make use of this optimization.
Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
Previously there was basically no coverage for UPDATEs encountering
deleted rows, and no coverage for DELETE having to perform EPQ. That's
problematic for an upcoming commit in which EPQ is tought to integrate
with tableams. Also, there was no test for UPDATE to encounter a row
UPDATEd into another partition.
Author: Andres Freund
This is an option consistent with what pg_dump, pg_rewind and
pg_basebackup provide which is useful for leveraging the I/O effort when
testing things, not to be used in a production environment.
Author: Michael Paquier
Reviewed-by: Michael Banck, Fabien Coelho, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de
An offline cluster can now work with more modes in pg_checksums:
- --enable enables checksums in a cluster, updating all blocks with a
correct checksum, and updating the control file at the end.
- --disable disables checksums in a cluster, updating only the control
file.
- --check is an extra option able to verify checksums for a cluster, and
the default used if no mode is specified.
When running --enable or --disable, the data folder gets fsync'd for
durability, and then it is followed by a control file update and flush
to keep the operation consistent should the tool be interrupted, killed
or the host unplugged. If no mode is specified in the options, then
--check is used for compatibility with older versions of pg_checksums
(named pg_verify_checksums in v11 where it was introduced).
Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de
Postpone most of the effort of constructing PartitionedRelPruneInfos
until after we have found out whether run-time pruning is needed at all.
This costs very little duplicated effort (basically just an extra
find_base_rel() call per partition) and saves quite a bit when we
can't do run-time pruning.
Also, merge the first loop (for building relid_subpart_map) into
the second loop, since we don't need the map to be valid during
that loop.
Amit Langote
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
This is almost a straight revert of commit fff518d, which itself was a
revert of 7d3bf73ac.
It turns out that commit 8aa9dd74, which sorted dependent objects before
deletion in DROP OWNED BY, was not sufficient to make all remaining
unstable DETAIL output stable. Unstable DETAIL output from DROP ROLE
was not affected, because that happens to use a different code path. It
doesn't seem worthwhile to fix the other code path at this time.
Discussion: https://postgr.es/m/6226.1553274783@sss.pgh.pa.us
I (tgl) remain dubious that it's a good idea for PartitionDirectory
to hold a pin on a relcache entry throughout planning, rather than
copying the data or using some kind of refcount scheme. However, it's
certainly the responsibility of the PartitionDirectory code to ensure
that what it's handing back is a stable data structure, not that of
its caller. So this is a pretty clear oversight in commit 898e5e329,
and one that can cost a lot of performance when there are many
partitions.
Amit Langote (extracted from a much larger patch set)
Discussion: https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
There were more large constants that needed UINT64CONST. And one variable
was declared as "int", when it needed to be uint64. These bugs were only
visible on 32-bit systems; clearly I should've tested on one, given that
this code does a lot of work with 64-bit integers.
Also, in the test "huge distances" test, the code created some values with
random distances between them, but the test logic didn't take into account
the possibility that the random distance was exactly 1. That never actually
happens with the seed we're using, but let's be tidy.
Buildfarm member 'woodlouse' failed one of the tests, and I'm not sure
which test failed. Better to print the names of the tests, so that it
will appear in the regression.diffs on failure.
To do this, we scan GiST two times. In the first pass we make note of
empty leaf pages and internal pages. At second pass we scan through
internal pages, looking for downlinks to the empty pages.
Deleting internal pages is still not supported, like in nbtree, the last
child of an internal page is never deleted. That means that if you have a
workload where new keys are always inserted to different area than where
old keys are removed, the index will still grow without bound. But the rate
of growth will be an order of magnitude slower than before.
Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values use
less memory.
The IntegerSet isn't used for anything yet, aside from the test code, but
we have two patches in the works that would benefit from this: A patch to
allow GiST vacuum to delete empty pages, and a patch to reduce heap
VACUUM's memory usage, by storing the list of dead TIDs more efficiently
and lifting the 1 GB limit on its size.
This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.
Author: Heikki Linnakangas, Andrey Borodin
Reviewed-by: Julien Rouhaud
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi
This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
Trying to call the function with the top-most parent of a partition tree
was leading to a crash. In this case the correct result is to return
the top-most parent itself.
Reported-by: Álvaro Herrera
Author: Michael Paquier
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/20190322032612.GA323@alvherre.pgsql
When DefineIndex recurses to create constraints on partitions, it needs
to use the value returned by index_constraint_create to set up partition
dependencies. However, in the course of fixing the DEPENDENCY_INTERNAL_AUTO
mess, commit 1d92a0c9f7dd introduced some code to that function that
clobbered the return value, causing the recorded OID to be of the wrong
object. Close examination of pg_depend after creating the tables leads
to indescribable objects :-( My sin (in commit bdc3d7fa2376, while
preparing for DDL deparsing in event triggers) was to use a variable
name for the return value that's typically used for throwaway objects in
dependency-setting calls ("referenced"). Fix by changing the variable
names to match extended practice (the return value is "myself" rather
than "referenced".)
The pg_upgrade test notices the problem (in an indirect way: the pg_dump
outputs are in different order), but only if you create the objects in a
specific way that wasn't being used in the existing tests. Add a stanza
to leave some objects around that shows the bug.
Catversion bump because preexisting databases might have bogus pg_depend
entries.
Discussion: https://postgr.es/m/20190318204235.GA30360@alvherre.pgsql
These commands allow the argument type list to be omitted if there is
just one object that matches by name. However, if that syntax was
used with DROP IF EXISTS and there was more than one match, you got
a "function ... does not exist, skipping" notice message rather than a
truthful complaint about the ambiguity. This was basically due to
poor factorization and a rats-nest of logic, so refactor the relevant
lookup code to make it cleaner.
Note that this amounts to narrowing the scope of which sorts of
error conditions IF EXISTS will bypass. Per discussion, we only
intend it to skip no-such-object cases, not multiple-possible-matches
cases.
Per bug #15572 from Ash Marath. Although this definitely seems like
a bug, it's not clear that people would thank us for changing the
behavior in minor releases, so no back-patch.
David Rowley, reviewed by Julien Rouhaud and Pavel Stehule
Discussion: https://postgr.es/m/15572-ed1b9ed09503de8a@postgresql.org
LDAP servers can be advertised on a network with RFC 2782 DNS SRV
records. The OpenLDAP command-line tools automatically try to find
servers that way, if no server name is provided by the user. Teach
PostgreSQL to do the same using OpenLDAP's support functions, when
building with OpenLDAP.
For now, we assume that HAVE_LDAP_INITIALIZE (an OpenLDAP extension
available since OpenLDAP 2.0 and also present in Apple LDAP) implies
that you also have ldap_domain2hostlist() (which arrived in the same
OpenLDAP version and is also present in Apple LDAP).
Author: Thomas Munro
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/CAEepm=2hAnSfhdsd6vXsM6VZVN0br-FbAZ-O+Swk18S5HkCP=A@mail.gmail.com
This finishes a task we left undone in commit f1ad067fc, by extending
the delete-in-descending-OID-order rule to deletions triggered by
DROP OWNED BY. We've coped with machine-dependent deletion orders
one time too many, and the new issues caused by Peter G's recent
nbtree hacking seem like the last straw.
Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org
Unstable sort order related to changes to nbtree from commit dd299df8
can cause two lines of DETAIL output to be in opposite-of-expected
order. Suppress the output using the same VERBOSITY hack that is used
elsewhere in the foreign_data tests.
Note that the same foreign_data.out DETAIL output was mechanically
updated by commit dd299df8. Only a few such changes were required,
though.
Per buildfarm member batfish.
Discussion: https://postgr.es/m/CAH2-WzkCQ_MtKeOpzozj7QhhgP1unXsK8o9DMAFvDqQFEPpkYQ@mail.gmail.com
I unnecessarily removed this check in 3de241dba86f because I
misunderstood what the final representation of constraints across a
partitioning hierarchy was to be. Put it back (in both branches).
Discussion: https://postgr.es/m/201901222145.t6wws6t6vrcu@alvherre.pgsql
Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be re-found using a new search for each,
that starts from the root page. If a tuple cannot be re-found, report
that the index is corrupt.
The new "rootdescend" verification option is exhaustive, and can
therefore make a call to bt_index_parent_check() take a lot longer.
Re-finding tuples during verification is mostly intended as an option
for backend developers, since the corruption scenarios that it alone is
uniquely capable of detecting seem fairly far-fetched.
For example, "rootdescend" verification is much more likely to detect
corruption of the least significant byte of a key from a pivot tuple in
the root page of a B-Tree that already has at least three levels.
Typically, only a few tuples on a cousin leaf page are at risk of
"getting overlooked" by index scans in this scenario. The corrupt key
in the root page is only slightly corrupt: corrupt enough to give wrong
answers to some queries, and yet not corrupt enough to allow the problem
to be detected without verifying agreement between the leaf page and the
root page, skipping at least one internal page level. The existing
bt_index_parent_check() checks never cross more than a single level.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com