The limitations that it is not allowed to create/attach a foreign table
as a partition of an indexed partitioned table were not documented.
Reported-By: Stepan Yankevych
Author: Etsuro Fujita
Reviewed-By: Amit Langote
Backpatch-through: 11 where partitioned index was introduced
Discussion: https://postgr.es/m/1553869152.858391073.5f8m3n0x@frv53.fwdcdn.com
There was some confusion over the format of the error message returned
from the server during GSSAPI startup; specifically, it was expected
that a length would be returned when, in reality, at this early stage in
the startup sequence, no length is returned from the server as part of
an error message.
Correct the client-side code for dealing with error messages sent by the
server during startup by simply reading what's available into our
buffer, after we've discovered it's an error message, and then reporting
back what was returned.
In passing, also add in documentation of the environment variable
PGGSSENCMODE which was missed previously, and adjust the code to look
for the PGGSSENCMODE variable (the environment variable change was
missed in the prior GSSMODE -> GSSENCMODE commit).
Error-handling issue discovered by Peter Eisentraut, the rest were items
discovered during testing of the error handling.
Since 11, it is possible to use a non-superuser role when using an
online source cluster with pg_rewind as long as the role has proper
permissions to execute on the source all the functions used by
pg_rewind, and the documentation stated that a superuser is necessary.
Let's add at the same time all the details needed to create such a
role.
A second confusion which comes a lot from users is that it is necessary
to issue a checkpoint on a freshly-promoted standby so as its control
file has up-to-date timeline information which is used by pg_rewind to
validate the operation. Let's document that properly. This is
back-patched down to 9.5 where pg_rewind has been introduced.
Author: Michael Paquier
Reviewed-by: Magnus Hagander
Discussion: https://postgr.es/m/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com
Backpatch-through: 9.5
This is intended for use mostly in test scripts for external tools,
which could do without cross-PG-version variations in error message
wording. Of course, the SQLSTATE isn't guaranteed stable either, but
it should be more so than the error message text.
Note: there's a bit of an ABI change for libpq here, but it seems
OK because if somebody compiles against a newer version of libpq-fe.h,
and then tries to pass PQERRORS_SQLSTATE to PQsetErrorVerbosity()
of an older libpq library, it will be accepted and then act like
PQERRORS_DEFAULT, thanks to the way the tests in pqBuildErrorMessage3
have historically been phrased. That seems acceptable.
Didier Gautheron, reviewed by Dagfinn Ilmari Mannsåker
Discussion: https://postgr.es/m/CAJRYxuKyj4zA+JGVrtx8OWAuBfE-_wN4sUMK4H49EuPed=mOBw@mail.gmail.com
The previous convention that stdout was selected by default when nothing
is specified was just too error-prone.
After a suggestion from Andrew Gierth.
Author: Euler Taveira
Reviewed-by: Yoshikazu Imai, José Arthur Benetasso Villanova
Discussion: https://postgr.es/m/87sgwrmhdv.fsf@news-spur.riddles.org.uk
This commit adds a new reloption, vacuum_index_cleanup, which
controls whether index cleanup is performed for a particular
relation by default. It also adds a new option to the VACUUM
command, INDEX_CLEANUP, which can be used to override the
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.
Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
The wording of the documentation is mostly due to me.
Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com
This adds documentation about the user oriented parts of table access
methods (i.e. the default_table_access_method GUC and the USING clause
for CREATE TABLE etc), adds a basic chapter about the table access
method interface, and adds a note to storage.sgml that it's contents
don't necessarily apply for non-builtin AMs.
Author: Haribabu Kommi and Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
Query planning is affected by a number of configuration options, and it
may be crucial to know which of those options were set to non-default
values. With this patch you can say EXPLAIN (SETTINGS ON) to include
that information in the query plan. Only options affecting planning,
with values different from the built-in default are printed.
This patch also adds auto_explain.log_settings option, providing the
same capability in auto_explain module.
Author: Tomas Vondra
Reviewed-by: Rafia Sabih, John Naylor
Discussion: https://postgr.es/m/e1791b4c-df9c-be02-edc5-7c8874944be0@2ndquadrant.com
This is useful to obtain a view of the different transaction types in an
application, regardless of the durations of the statements each runs.
Author: Adrien Nayrat
Reviewed-by: Masahiko Sawada, Hayato Kuroda, Andres Freund
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.
Add frontend and backend encryption support functions. Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.
In postmaster, pull GSSAPI authorization checking into a shared
function. Also share the initiator name between the encryption and
non-encryption codepaths.
For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts. "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.
Similarly, add a "gssencmode" parameter to libpq. Supported values are
"disable", "require", and "prefer". Notably, negotiation will only be
attempted if credentials can be acquired. Move credential acquisition
into its own function to support this behavior.
Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.
Finally, add documentation for everything new, and update existing
documentation on connection security.
Thanks to Michael Paquier for the Windows fixes.
Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
Previously, while primary keys could be made on partitioned tables, it
was not possible to define foreign keys that reference those primary
keys. Now it is possible to do that.
Author: Álvaro Herrera
Reviewed-by: Amit Langote, Jesper Pedersen
Discussion: https://postgr.es/m/20181102234158.735b3fevta63msbj@alvherre.pgsql
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
This adds a new option to pg_checksums called -P/--progress, showing
every second some information about the computation state of an
operation for --check and --enable (--disable only updates the control
file and is quick). This requires a pre-scan of the data folder so as
the total size of checksummable items can be calculated, and then it
gets compared to the amount processed.
Similarly to what is done for pg_rewind and pg_basebackup, the
information printed in the progress report consists of the current
amount of data computed and the total amount of data to compute. This
could be extended later on.
Author: Michael Banck, Bernd Helmle
Reviewed-by: Fabien Coelho, Michael Paquier
Discussion: https://postgr.es/m/1535719851.1286.17.camel@credativ.de
On at least ZFS, it can be beneficial to create new WAL files every
time and not to bother zero-filling them. Since it's not clear which
other filesystems might benefit from one or both of those things,
add individual GUCs to control those two behaviors independently and
make only very general statements in the docs.
Author: Jerry Jelinek, with some adjustments by Thomas Munro
Reviewed-by: Alvaro Herrera, Andres Freund, Tomas Vondra, Robert Haas and others
Discussion: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com
Remove the code that supported zipfian distribution parameters less
than 1.0, as it had undocumented performance hazards, and it's not
clear that the case is useful enough to justify either fixing or
documenting those hazards.
Also, since the code path for parameter > 1.0 could perform badly
for values very close to 1.0, establish a minimum allowed value
of 1.001. This solution seems superior to the previous vague
documentation warning about small values not performing well.
Fabien Coelho, per a gripe from Tomas Vondra
Discussion: https://postgr.es/m/b5e172e9-ad22-48a3-86a3-589afa20e8f7@2ndquadrant.com
This unifies the various ad hoc logging (message printing, error
printing) systems used throughout the command-line programs.
Features:
- Program name is automatically prefixed.
- Message string does not end with newline. This removes a common
source of inconsistencies and omissions.
- Additionally, a final newline is automatically stripped, simplifying
use of PQerrorMessage() etc., another common source of mistakes.
- I converted error message strings to use %m where possible.
- As a result of the above several points, more translatable message
strings can be shared between different components and between
frontends and backend, without gratuitous punctuation or whitespace
differences.
- There is support for setting a "log level". This is not meant to be
user-facing, but can be used internally to implement debug or
verbose modes.
- Lazy argument evaluation, so no significant overhead if logging at
some level is disabled.
- Some color in the messages, similar to gcc and clang. Set
PG_COLOR=auto to try it out. Some colors are predefined, but can be
customized by setting PG_COLORS.
- Common files (common/, fe_utils/, etc.) can handle logging much more
simply by just using one API without worrying too much about the
context of the calling program, requiring callbacks, or having to
pass "progname" around everywhere.
- Some programs called setvbuf() to make sure that stderr is
unbuffered, even on Windows. But not all programs did that. This
is now done centrally.
Soft goals:
- Reduces vertical space use and visual complexity of error reporting
in the source code.
- Encourages more deliberate classification of messages. For example,
in some cases it wasn't clear without analyzing the surrounding code
whether a message was meant as an error or just an info.
- Concepts and terms are vaguely aligned with popular logging
frameworks such as log4j and Python logging.
This is all just about printing stuff out. Nothing affects program
flow (e.g., fatal exits). The uses are just too varied to do that.
Some existing code had wrappers that do some kind of print-and-exit,
and I adapted those.
I tried to keep the output mostly the same, but there is a lot of
historical baggage to unwind and special cases to consider, and I
might not always have succeeded. One significant change is that
pg_rewind used to write all error messages to stdout. That is now
changed to stderr.
Reviewed-by: Donald Dong <xdong@csumb.edu>
Reviewed-by: Arthur Zakirov <a.zakirov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/6a609b43-4f57-7348-6480-bd022f924310@2ndquadrant.com
This commit makes existing GIN operator classes jsonb_ops and json_path_ops
support "jsonb @@ jsonpath" and "jsonb @? jsonpath" operators. Basic idea is
to extract statements of following form out of jsonpath.
key1.key2. ... .keyN = const
The rest of jsonpath is rechecked from heap.
Catversion is bumped.
Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com
Author: Nikita Glukhov, Alexander Korotkov
Reviewed-by: Jonathan Katz, Pavel Stehule
This moves sample scan support to below tableam. It's not optional as
there is, in contrast to e.g. bitmap heap scans, no alternative way to
perform tablesample queries. If an AM can't deal with the block based
API, it will have to throw an ERROR.
The tableam callbacks for this are block based, but given the current
TsmRoutine interface, that seems to be required.
The new interface doesn't require TsmRoutines to perform visibility
checks anymore - that requires the TsmRoutine to know details about
the AM, which we want to avoid. To continue to allow taking the
returned number of tuples account SampleScanState now has a donetuples
field (which previously e.g. existed in SystemRowsSamplerData), which
is only incremented after the visibility check succeeds.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.
This implements one kind of generated column: stored (computed on
write). Another kind, virtual (computed on read), is planned for the
future, and some room is left for it.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
This commit reorders the paragraphs of the Notes section in order of
importance, and clarifies better the safe uses of pg_checksums for
replication setups.
Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903231404280.18811@lancre
Especially, warn about the hazards of mishandling the backup_label
file. Adjust a couple of server messages to be more clear about
the hazards associated with removing backup_label files, too.
David Steele and Robert Haas, reviewed by Laurenz Albe, Martín
Marqués, Peter Eisentraut, and Magnus Hagander.
Discussion: http://postgr.es/m/7d85c387-000e-16f0-e00b-50bf83c22127@pgmasters.net
This adds the CONCURRENTLY option to the REINDEX command. A REINDEX
CONCURRENTLY on a specific index creates a new index (like CREATE
INDEX CONCURRENTLY), then renames the old index away and the new index
in place and adjusts the dependencies, and then drops the old
index (like DROP INDEX CONCURRENTLY). The REINDEX command also has
the capability to run its other variants (TABLE, DATABASE) with the
CONCURRENTLY option (but not SYSTEM).
The reindexdb command gets the --concurrently option.
Author: Michael Paquier, Andreas Karlsson, Peter Eisentraut
Reviewed-by: Andres Freund, Fujii Masao, Jim Nasby, Sergei Kornilov
Discussion: https://www.postgresql.org/message-id/flat/60052986-956b-4478-45ed-8bd119e9b9cf%402ndquadrant.com#74948a1044c56c5e817a5050f554ddee
To support building indexes over tables of different AMs, the scans to
do so need to be routed through the table AM. While moving a fair
amount of code, nearly all the changes are just moving code to below a
callback.
Currently the range based interface wouldn't make much sense for non
block based table AMs. But that seems aceptable for now.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
The makefile rule for the (rarely used) plain-text output postgres.txt
was still written to use lynx, but in
96b8b8b6f9d8de4af01a77797273ad88c7a8e32e, where the INSTALL file was
switched to pandoc, the rest of the makefile support for lynx was
removed, so this was broken. Rewrite the rule to also use pandoc for
postgres.txt.
Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.
Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.
Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com
Column references are not allowed in default expressions and partition
bound expressions, and are restricted as such once the transformation of
their expressions is done. However, trying to use more complex column
references can lead to confusing error messages. For example, trying to
use a two-field column reference name for default expressions and
partition bounds leads to "missing FROM-clause entry for table", which
makes no sense in their respective context.
In order to make the errors generated more useful, this commit adds more
verbose messages when transforming column references depending on the
context. This has a little consequence though: for example an
expression using an aggregate with a column reference as argument would
cause an error to be generated for the column reference, while the
aggregate was the problem reported before this commit because column
references get transformed first.
The confusion exists for default expressions for a long time, and the
problem is new as of v12 for partition bounds. Still per the lack of
complaints on the matter no backpatch is done.
The patch has been written by Amit Langote and me, and Tom Lane has
provided the improvement of the documentation for default expressions on
the CREATE TABLE page.
Author: Amit Langote, Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/20190326020853.GM2558@paquier.xyz
Partial revert of commit 6260cc550b0e, "pgbench: add \cset and \gset
commands".
While \gset is widely considered a useful and necessary tool for user-
defined benchmarks, \cset does not have as much value, and its
implementation was considered "not to be up to project standards"
(though I, Álvaro, can't quite understand exactly how). Therefore,
remove \cset.
Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903230716030.18811@lancre
Discussion: https://postgr.es/m/201901101900.mv7zduch6sad@alvherre.pgsql
This uses the same progress reporting infrastructure added in commit
c16dc1aca5e01e6acaadfcf38f5fc964a381dc62 and extends it to these
additional cases. We lack the ability to track the internal progress
of sorts and index builds so the information reported is
coarse-grained for some parts of the operation, but it still seems
like a significant improvement over having nothing at all.
Tatsuro Yamada, reviewed by Thomas Munro, Masahiko Sawada, Michael
Paquier, Jeff Janes, Alvaro Herrera, Rafia Sabih, and by me. A fair
amount of polishing also by me.
Discussion: http://postgr.es/m/59A77072.3090401@lab.ntt.co.jp
Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.
Support for transaction chaining in PL/pgSQL is also added. This
functionality is especially useful when running COMMIT in a loop in
PL/pgSQL.
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com
Previously we were using the SQL:2003 definition, which doesn't allow
this, but that creates a serious dump/restore gotcha: there is no
setting of xmloption that will allow all valid XML data. Hence,
switch to the 2006 definition.
Since libxml doesn't accept <!DOCTYPE> directives in the mode we
use for CONTENT parsing, the implementation is to detect <!DOCTYPE>
in the input and switch to DOCUMENT parsing mode. This should not
cost much, because <!DOCTYPE> should be close to the front of the
input if it's there at all. It's possible that this causes the
error messages for malformed input to be slightly different than
they were before, if said input includes <!DOCTYPE>; but that does
not seem like a big problem.
In passing, buy back a few cycles in parsing of large XML documents
by not doing strlen() of the whole input in parse_xml_decl().
Back-patch because dump/restore failures are not nice. This change
shouldn't break any cases that worked before, so it seems safe to
back-patch.
Chapman Flack (revised a bit by me)
Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
This is an option consistent with what pg_dump, pg_rewind and
pg_basebackup provide which is useful for leveraging the I/O effort when
testing things, not to be used in a production environment.
Author: Michael Paquier
Reviewed-by: Michael Banck, Fabien Coelho, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de
An offline cluster can now work with more modes in pg_checksums:
- --enable enables checksums in a cluster, updating all blocks with a
correct checksum, and updating the control file at the end.
- --disable disables checksums in a cluster, updating only the control
file.
- --check is an extra option able to verify checksums for a cluster, and
the default used if no mode is specified.
When running --enable or --disable, the data folder gets fsync'd for
durability, and then it is followed by a control file update and flush
to keep the operation consistent should the tool be interrupted, killed
or the host unplugged. If no mode is specified in the options, then
--check is used for compatibility with older versions of pg_checksums
(named pg_verify_checksums in v11 where it was introduced).
Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de
This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
LDAP servers can be advertised on a network with RFC 2782 DNS SRV
records. The OpenLDAP command-line tools automatically try to find
servers that way, if no server name is provided by the user. Teach
PostgreSQL to do the same using OpenLDAP's support functions, when
building with OpenLDAP.
For now, we assume that HAVE_LDAP_INITIALIZE (an OpenLDAP extension
available since OpenLDAP 2.0 and also present in Apple LDAP) implies
that you also have ldap_domain2hostlist() (which arrived in the same
OpenLDAP version and is also present in Apple LDAP).
Author: Thomas Munro
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/CAEepm=2hAnSfhdsd6vXsM6VZVN0br-FbAZ-O+Swk18S5HkCP=A@mail.gmail.com
Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be re-found using a new search for each,
that starts from the root page. If a tuple cannot be re-found, report
that the index is corrupt.
The new "rootdescend" verification option is exhaustive, and can
therefore make a call to bt_index_parent_check() take a lot longer.
Re-finding tuples during verification is mostly intended as an option
for backend developers, since the corruption scenarios that it alone is
uniquely capable of detecting seem fairly far-fetched.
For example, "rootdescend" verification is much more likely to detect
corruption of the least significant byte of a key from a pivot tuple in
the root page of a B-Tree that already has at least three levels.
Typically, only a few tuples on a cousin leaf page are at risk of
"getting overlooked" by index scans in this scenario. The corrupt key
in the root page is only slightly corrupt: corrupt enough to give wrong
answers to some queries, and yet not corrupt enough to allow the problem
to be detected without verifying agreement between the leaf page and the
root page, skipping at least one internal page level. The existing
bt_index_parent_check() checks never cross more than a single level.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com