This rectifies an oversight in commit 8224de4f4, by adding a new
property 'can_include' for pg_indexam_has_property, and adjusting the
results of pg_index_column_has_property to give more appropriate
results for INCLUDEd columns.
The atomics code asserts proper alignment in various places. That's
mainly because the alignment of 64bit integers is not sufficient for
atomic operations on all platforms. Some ABIs only have four byte
alignment, but don't have atomic behavior when crossing page
boundaries.
The flags code isn't affected by that however, as the type alignment
always is sufficient for atomic operations. Nevertheless the code
asserted alignment requirements. Before 8c3debbb it was only broken on
hppa, after it probably affect further platforms.
Thus remove the assertions for pg_atomic_flag operators.
Per buildfarm animal pademelon.
Discussion: https://postgr.es/m/7223.1523124425@sss.pgh.pa.us
Backpatch: 9.5-
Under EXEC BACKEND we also need to be going through the group privileges
setup since we do support that on Unixy systems, so add that to
SubPostmasterMain().
Under Windows, we need to simply return true from
GetDataDirectoryCreatePerm(), but that wasn't happening due to a missing
#else clause.
Per buildfarm.
Allow the cluster to be optionally init'd with read access for the
group.
This means a relatively non-privileged user can perform a backup of the
cluster without requiring write privileges, which enhances security.
The mode of PGDATA is used to determine whether group permissions are
enabled for directory and file creates. This method was chosen as it's
simple and works well for the various utilities that write into PGDATA.
Changing the mode of PGDATA manually will not automatically change the
mode of all the files contained therein. If the user would like to
enable group access on an existing cluster then changing the mode of all
the existing files will be required. Note that pg_upgrade will
automatically change the mode of all migrated files if the new cluster
is init'd with the -g option.
Tests are included for the backend and all the utilities which operate
on the PG data directory to ensure that the correct mode is set based on
the data directory permissions.
Author: David Steele <david@pgmasters.net>
Reviewed-By: Michael Paquier, with discussion amongst many others.
Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net
Consolidate directory and file create permissions for tools which work
with the PG data directory by adding a new module (common/file_perm.c)
that contains variables (pg_file_create_mode, pg_dir_create_mode) and
constants to initialize them (0600 for files and 0700 for directories).
Convert mkdir() calls in the backend to MakePGDirectory() if the
original call used default permissions (always the case for regular PG
directories).
Add tests to make sure permissions in PGDATA are set correctly by the
tools which modify the PG data directory.
Authors: David Steele <david@pgmasters.net>,
Adam Brightwell <adam.brightwell@crunchydata.com>
Reviewed-By: Michael Paquier, with discussion amongst many others.
Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
This works very much like the existing bms_last_member function, only it
traverses through the Bitmapset in the opposite direction from the most
significant bit down to the least significant bit. A special prevbit value of
-1 may be used to have the function determine the most significant bit. This
is useful for starting a loop. When there are no members less than prevbit,
the function returns -2 to indicate there are no more members.
Author: David Rowley
Discussion: https://postgr.es/m/CAKJS1f-K=3d5MDASNYFJpUpc20xcBnAwNC1-AOeunhn0OtkWbQ@mail.gmail.com
When an update moves a row between partitions (supported since
2f178441044b), our normal logic for following update chains in READ
COMMITTED mode doesn't work anymore. Cross partition updates are
modeled as an delete from the old and insert into the new
partition. No ctid chain exists across partitions, and there's no
convenient space to introduce that link.
Not throwing an error in a partitioned context when one would have
been thrown without partitioning is obviously problematic. This commit
introduces infrastructure to detect when a tuple has been moved, not
just plainly deleted. That allows to throw an error when encountering
a deletion that's actually a move, while attempting to following a
ctid chain.
The row deleted as part of a cross partition update is marked by
pointing it's t_ctid to an invalid block, instead of self as a normal
update would. That was deemed to be the least invasive and most
future proof way to represent the knowledge, given how few infomask
bits are there to be recycled (there's also some locking issues with
using infomask bits).
External code following ctid chains should be updated to check for
moved tuples. The most likely consequence of not doing so is a missed
error.
Author: Amul Sul, editorialized by me
Reviewed-By: Amit Kapila, Pavan Deolasee, Andres Freund, Robert Haas
Discussion: http://postgr.es/m/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com
This patch introduces INCLUDE clause to index definition. This clause
specifies a list of columns which will be included as a non-key part in
the index. The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans. Also, such columns don't need to have
appropriate operator classes. Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.
Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine. For now, only B-tree indexes support INCLUDE clause.
In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys). Therefore, B-tree indexes
now might have variable number of attributes. This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid. Free 13th bit of t_info is used for indicating
that. This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.
Bump catalog version
Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
Jsonb has a complex nature so there isn't best-for-everything way to convert it
to tsvector for full text search. Current to_tsvector(json(b)) suggests to
convert only string values, but it's possible to index keys, numerics and even
booleans value. To solve that json(b)_to_tsvector has a second required
argument contained a list of desired types of json fields. Second argument is
a jsonb scalar or array right now with possibility to add new options in a
future.
Bump catalog version
Author: Dmitry Dolgov with some editorization by me
Reviewed by: Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CA+q6zcXJQbS1b4kJ_HeAOoOc=unfnOrUEL=KGgE32QKDww7d8g@mail.gmail.com
We need to wait for the initial sync of all subscriptions. On
some (faster?) machines, this didn't make a difference, but
the (slower?) buildfarm machines are upset.
Update the built-in logical replication system to make use of the
previously added logical decoding for TRUNCATE support. Add the
required truncate callback to pgoutput and a new logical replication
protocol message.
Publications get a new attribute to determine whether to replicate
truncate actions. When updating a publication via pg_dump from an older
version, this is not set, thus preserving the previous behavior.
Author: Simon Riggs <simon@2ndquadrant.com>
Author: Marco Nenciarini <marco.nenciarini@2ndquadrant.it>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Add a new WAL record type for TRUNCATE, which is only used when
wal_level >= logical. (For physical replication, TRUNCATE is already
replicated via SMGR records.) Add new callback for logical decoding
output plugins to receive TRUNCATE actions.
Author: Simon Riggs <simon@2ndquadrant.com>
Author: Marco Nenciarini <marco.nenciarini@2ndquadrant.it>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Hash index searches acquire predicate locks on the primary
page of a bucket. It acquires a lock on both the old and new buckets
for scans that happen concurrently with page splits. During a bucket
split, a predicate lock is copied from the primary page of an old
bucket to the primary page of a new bucket.
Author: Shubham Barai, Amit Kapila
Reviewed by: Amit Kapila, Alexander Korotkov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/flat/CALxAEPvNsM2GTiXdRgaaZ1Pjd1bs+sxfFsf7Ytr+iq+5JJoYXA@mail.gmail.com
The tests added as part of 9fdb675fc5 yield differing results
depending on endianess, causing buildfarm failures. As the differences
are expected, split the hash partitioning tests into a different file
and maintain alternative output. The separate file is so the amount of
duplicated output is reduced.
David produced the alternative output without a machine to test on, so
it's possible this'll require a buildfarm cycle or two to get right.
Author: David Rowley
Discussion: https://postgr.es/m/CAKJS1f-6f4c2Qhuipe-GY7BKmFd0FMBobRnLS7hVCoAmTszsBg@mail.gmail.com
The atomics fallback implementation for pg_atomic_flag was broken,
returning the inverted value from pg_atomic_test_set_flag(). This was
unnoticed because a) atomic flags were unused until recently b) the
test code wasn't run when the fallback implementation was in
use (because it didn't allow to test for some edge cases).
Fix the bug, and improve the fallback so it has the same behaviour as
the non-fallback implementation in the problematic edge cases. That
breaks ABI compatibility in the back branches when fallbacks are in
use, but given they were broken until now...
Author: Andres Freund
Reported-by: Daniel Gustafsson
Discussion:
https://postgr.es/m/FB948276-7B32-4B77-83E6-D00167F8EEB4@yesql.sehttps://postgr.es/m/20180406233854.uni2h3mbnveczl32@alap3.anarazel.de
Backpatch: 9.5-, where the atomics abstraction was introduced.
Also enable this for postgres_fdw.
Etsuro Fujita, based on an earlier patch by Amit Langote. The larger
patch series of which this is a part has been reviewed by Amit
Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost,
and me. Minor documentation changes to the final version by me.
Discussion: http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f2b9@lab.ntt.co.jp
This has no functional impact whatsoever. However, it causes
these unnecessary quote marks to disappear from the generated
postgres.bki file, making it easier to verify that the upcoming
bootstrap data conversion patch doesn't change the generated file.
These tests accummulated almost a gigabyte of data during the test which
was then removed at the end. Instead, remove output that's no longer
needed between the individual tests, to keep the total disk usage down
lower.
Author: Michael Banck
Add a new module backend/partitioning/partprune.c, implementing a more
sophisticated algorithm for partition pruning. The new module uses each
partition's "boundinfo" for pruning instead of constraint exclusion,
based on an idea proposed by Robert Haas of a "pruning program": a list
of steps generated from the query quals which are run iteratively to
obtain a list of partitions that must be scanned in order to satisfy
those quals.
At present, this targets planner-time partition pruning, but there exist
further patches to apply partition pruning at execution time as well.
This commit also moves some definitions from include/catalog/partition.h
to a new file include/partitioning/partbounds.h, in an attempt to
rationalize partitioning related code.
Authors: Amit Langote, David Rowley, Dilip Kumar
Reviewers: Robert Haas, Kyotaro Horiguchi, Ashutosh Bapat, Jesper Pedersen.
Discussion: https://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
This provides a newer version of adminpack which works with the newly
added default roles to support GRANT'ing to non-superusers access to
read and write files, along with related functions (unlinking files,
getting file length, renaming/removing files, scanning the log file
directory) which are supported through adminpack.
Note that new versions of the functions are required because an
environment might have an updated version of the library but still have
the old adminpack 1.0 catalog definitions (where EXECUTE is GRANT'd to
PUBLIC for the functions).
This patch also removes the long-deprecated alternative names for
functions that adminpack used to include and which are now included in
the backend, in adminpack v1.1. Applications using the deprecated names
should be updated to use the backend functions instead. Existing
installations which continue to use adminpack v1.0 should continue to
function until/unless adminpack is upgraded.
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20171231191939.GR2416%40tamriel.snowman.net
This patch adds new default roles named 'pg_read_server_files',
'pg_write_server_files', 'pg_execute_server_program' which
allow an administrator to GRANT to a non-superuser role the ability to
access server-side files or run programs through PostgreSQL (as the user
the database is running as). Having one of these roles allows a
non-superuser to use server-side COPY to read, write, or with a program,
and to use file_fdw (if installed by a superuser and GRANT'd USAGE on
it) to read from files or run a program.
The existing misc file functions are also changed to allow a user with
the 'pg_read_server_files' default role to read any files on the
filesystem, matching the privileges given to that role through COPY and
file_fdw from above.
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20171231191939.GR2416%40tamriel.snowman.net
This removes the explicit superuser checks in the various file-access
functions in the backend, specifically pg_ls_dir(), pg_read_file(),
pg_read_binary_file(), and pg_stat_file(). Instead, EXECUTE is REVOKE'd
from public for these, meaning that only a superuser is able to run them
by default, but access to them can be GRANT'd to other roles.
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20171231191939.GR2416%40tamriel.snowman.net
The previous coding inadvertently checked the constraints for the
partitioned table rather than the target partition, which could
lead to data in a partition that fails to satisfy some constraint
on that partition. This problem seems to date back to when
table partitioning was introduced; prior to that, there was only
one target table for a COPY, so the problem didn't occur, and the
code just didn't get updated.
Etsuro Fujita, reviewed by Amit Langote and Ashutosh Bapat
Discussion: https://postgr.es/message-id/5ABA4074.1090500%40lab.ntt.co.jp
Etsuro Fujita. The larger patch series of which this is a part has
been reviewed by Amit Langote, David Fetter, Maksim Milyutin,
Álvaro Herrera, Stephen Frost, and me.
Discussion: http://postgr.es/m/5A95487E.9050808@lab.ntt.co.jp
We don't actually need the insert-or-update logic, so it's clearer to
have separate functions for the inserting and updating.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
In case the subscription is removed before the worker is fully started,
give a specific error message instead of the generic "cache lookup"
error.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).
Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.
This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.
Authors: Magnus Hagander and Daniel Gustafsson
Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin
A normal SQL command run inside PL/pgSQL acquires a snapshot, but SET
TRANSACTION does not work anymore if a snapshot is set. So we have to
handle this separately.
Reviewed-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
If we are not going to save the plan, then we need to unset expr->plan
after we are done, also in error cases. Otherwise, we get a dangling
pointer next time around.
This is not the ideal solution. It would be better if we could convince
SPI not to associate a cached plan with a resource owner, and then we
could just save the plan in all cases. But that would require bigger
surgery.
Reported-by: Pavel Stehule <pavel.stehule@gmail.com>