346 Commits

Author SHA1 Message Date
Nathan Bossart
4dbdb82513 Fix cache lookup hazards introduced by ff9618e82a.
ff9618e82a introduced has_partition_ancestor_privs(), which is used
to check whether a user has MAINTAIN on any partition ancestors.
This involves syscache lookups, and presently this function does
not take any relation locks, so it is likely subject to the same
kind of cache lookup failures that were fixed by 19de0ab23c.

To fix this problem, this commit partially reverts ff9618e82a.
Specifically, it removes the partition-related changes, including
the has_partition_ancestor_privs() function mentioned above.  This
means that MAINTAIN on a partitioned table is no longer sufficient
to perform maintenance commands on its partitions.  This is more
like how privileges for maintenance commands work on supported
versions.  Privileges are checked for each partition, so a command
that flows down to all partitions might refuse to process them
(e.g., if the current user doesn't have MAINTAIN on the partition).

In passing, adjust a few related comments and error messages, and
add a test for the privilege checks for CLUSTER on a partitioned
table.

Reviewed-by: Michael Paquier, Jeff Davis
Discussion: https://postgr.es/m/20230613211246.GA219055%40nathanxps13
2023-06-22 15:48:20 -07:00
Jeff Davis
2fcc7ee7af Revert "Fix search_path to a safe value during maintenance operations."
This reverts commit 05e17373517114167d002494e004fa0aa32d1fd1.
2023-06-10 08:11:41 -07:00
Jeff Davis
05e1737351 Fix search_path to a safe value during maintenance operations.
While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
'pg_catalog, pg_temp' to prevent inconsistent behavior.

Functions that are used for functional indexes, in index expressions,
or in materialized views and depend on a different search path must be
declared with CREATE FUNCTION ... SET search_path='...'.

This change addresses a security risk introduced in commit 60684dd834,
where a role with MAINTAIN privileges on a table may be able to
escalate privileges to the table owner. That commit is not yet part of
any release, so no need to backpatch.

Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com
Reviewed-by: Greg Stark
Reviewed-by: Nathan Bossart
2023-06-09 11:20:47 -07:00
Jeff Davis
ff9618e82a Fix MAINTAIN privileges for toast tables and partitions.
Commit 60684dd8 left loose ends when it came to maintaining toast
tables or partitions.

For toast tables, simply skip the privilege check if the toast table
is an indirect target of the maintenance command, because the main
table privileges have already been checked.

For partitions, allow the maintenance command if the user has the
MAINTAIN privilege on the partition or any parent.

Also make CLUSTER emit "skipping" messages when the user doesn't have
privileges, similar to VACUUM.

Author: Nathan Bossart
Reported-by: Pavel Luzanov
Reviewed-by: Pavel Luzanov, Ted Yu
Discussion: https://postgr.es/m/20230113231339.GA2422750@nathanxps13
2023-01-14 00:16:23 -08:00
Bruce Momjian
c8e1ba736b Update copyright for 2023
Backpatch-through: 11
2023-01-02 15:00:37 -05:00
Peter Geoghegan
4ce3afb82e Refactor how VACUUM passes around its XID cutoffs.
Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin.  This state is initialized in vacuum.c, and
then passed around by code from vacuumlazy.c to heapam.c freezing
related routines.  The new convention is that everybody works off of the
same cutoff state, which is passed around via pointers to const.

Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before.  That way the routine's xmin handling
code is symmetrical with the existing xmax handling code.  This is
preparation for an upcoming commit that will add page level freezing.

Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks.  We now test OldestXmin directly, instead of
using FreezeLimit as an inexact proxy for OldestXmin.  This is further
preparation for the page level freezing work, which will make the
function's caller cede control of page level freezing to the function
where appropriate (where heap_prepare_freeze_tuple sees a tuple that
happens to contain a MultiXactId in its xmax).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-WznS9TxXmz2_=SY+SyJyDFbiOftKofM9=aDo68BbXNBUMA@mail.gmail.com
2022-12-22 09:37:59 -08:00
Jeff Davis
60684dd834 Add grantable MAINTAIN privilege and pg_maintain role.
Allows VACUUM, ANALYZE, REINDEX, REFRESH MATERIALIZED VIEW, CLUSTER,
and LOCK TABLE.

Effectively reverts 4441fc704d. Instead of creating separate
privileges for VACUUM, ANALYZE, and other maintenance commands, group
them together under a single MAINTAIN privilege.

Author: Nathan Bossart
Discussion: https://postgr.es/m/20221212210136.GA449764@nathanxps13
Discussion: https://postgr.es/m/45224.1670476523@sss.pgh.pa.us
2022-12-13 17:33:28 -08:00
Peter Geoghegan
b6074846ce Simplify vacuum_set_xid_limits() signature.
Pass VACUUM parameters (VacuumParams state) to vacuum_set_xid_limits()
directly, rather than passing most individual VacuumParams fields as
separate arguments.

Also make vacuum_set_xid_limits() output parameter symbol names match
those used by its vacuumlazy.c caller.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=TE7gW5DgSahDkf0UEZigFGAoHNNN6EvSrdzC=Kn+hrA@mail.gmail.com
2022-11-23 11:10:06 -08:00
Peter Eisentraut
afbfc02983 Refactor ownercheck functions
Instead of dozens of mostly-duplicate pg_foo_ownercheck() functions,
write one common function object_ownercheck() that can handle almost
all of them.  We already have all the information we need, such as
which system catalog corresponds to which catalog table and which
column is the owner column.

Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://www.postgresql.org/message-id/flat/95c30f96-4060-2f48-98b5-a4392d3b6066@enterprisedb.com
2022-11-13 08:12:37 +01:00
Tom Lane
0a20ff54f5 Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation.  It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:

* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
  SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
  built-in GUC variables, along with some already-exported constant
  tables.
* GUC check/assign/show hook functions are moved to the variable's
  home module, whenever that's clearly identifiable.  A few hard-
  to-classify hooks ended up in commands/variable.c, which was
  already a home for miscellaneous GUC hook functions.

To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module.  That allowed removal of #include "guc.h"
from some existing headers.  The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.

There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments.  But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.

Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.

Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 11:11:45 -04:00
Robert Haas
b0a55e4329 Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.

Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".

Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.

On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.

Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.

Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.

Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 11:39:09 -04:00
Noah Misch
a117cebd63 Make relation-enumerating operations be security-restricted operations.
When a feature enumerates relations and runs functions associated with
all found relations, the feature's user shall not need to trust every
user having permission to create objects.  BRIN-specific functionality
in autovacuum neglected to account for this, as did pg_amcheck and
CLUSTER.  An attacker having permission to create non-temp objects in at
least one schema could execute arbitrary SQL functions under the
identity of the bootstrap superuser.  CREATE INDEX (not a
relation-enumerating operation) and REINDEX protected themselves too
late.  This change extends to the non-enumerating amcheck interface.
Back-patch to v10 (all supported versions).

Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin.
Reported by Alexander Lakhin.

Security: CVE-2022-1552
2022-05-09 08:35:08 -07:00
Alvaro Herrera
3f19e176ae
Have CLUSTER ignore partitions not owned by caller
If a partitioned table has partitions owned by roles other than the
owner of the partitioned table, don't include them in the to-be-
clustered list.  This is similar to what VACUUM FULL does (except we do
it sooner, because there is no reason to postpone it).  Add a simple
test to verify that only owned partitions are clustered.

While at it, change memory context switch-and-back to occur once per
partition instead of outside of the loop.

Author: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20220411140609.GF26620@telsasoft.com
2022-04-14 22:11:06 +02:00
Michael Paquier
b940918dc8 Remove "recheck" argument from check_index_is_clusterable()
The last usage of this argument in this routine can be tracked down to
7e2f9062, aka 11 years ago.  Getting rid of this argument can also be an
advantage for extensions calling check_index_is_clusterable(), as it
removes any need to worry about the meaning of what a recheck would be
at this level.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20220411140609.GF26620@telsasoft.com
2022-04-13 15:32:35 +09:00
Peter Geoghegan
0b018fabaa Set relfrozenxid to oldest extant XID seen by VACUUM.
When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive.  The relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.

VACUUM now carefully tracks the oldest remaining XID/MultiXactId as it
goes (the oldest remaining values _after_ lazy_scan_prune processing).
The final values are set as the table's new relfrozenxid and new
relminmxid in pg_class at the end of each VACUUM.  The oldest XID might
come from a tuple's xmin, xmax, or xvac fields.  It might even come from
one of the table's remaining MultiXacts.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM (FreezeLimit still acts as a lower bound on the final value that
aggressive VACUUM can set relfrozenxid to).  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
2022-04-03 09:57:21 -07:00
Alvaro Herrera
cfdd03f45e
Allow CLUSTER on partitioned tables
This is essentially the same as applying VACUUM FULL to a partitioned
table, which has been supported since commit 3c3bb99330aa (March 2017).
While there's no great use case in applying CLUSTER to partitioned
tables, we don't have any strong reason not to allow it either.

For now, partitioned indexes cannot be marked clustered, so an index
must always be specified.

While at it, rename some variables that were RangeVars during the
development that led to 8bc717cb8878 but never made it that way to the
source tree; there's no need to perpetuate names that have always been
more confusing than helpful.

Author: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20201028003312.GU9241@telsasoft.com
Discussion: https://postgr.es/m/20200611153502.GT14879@telsasoft.com
2022-04-02 19:08:34 +02:00
Peter Geoghegan
efa4a9462a Consolidate VACUUM xid cutoff logic.
Push the logic for determining whether or not a VACUUM operation will be
aggressive down into vacuum_set_xid_limits().  This makes the function's
signature significantly simpler, and seems clearer overall.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
2022-02-11 18:26:15 -08:00
Michael Paquier
410aa248e5 Fix various typos, grammar and code style in comments and docs
This fixes a set of issues that have accumulated over the past months
(or years) in various code areas.  Most fixes are related to some recent
additions, as of the development of v15.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20220124030001.GQ23027@telsasoft.com
2022-01-25 09:40:04 +09:00
Bruce Momjian
27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Amit Kapila
29b5905470 Fix toast rewrites in logical decoding.
Commit 325f2ec555 introduced pg_class.relwrite to skip operations on
tables created as part of a heap rewrite during DDL. It links such
transient heaps to the original relation OID via this new field in
pg_class but forgot to do anything about toast tables. So, logical
decoding was not able to skip operations on internally created toast
tables. This leads to an error when we tried to decode the WAL for the
next operation for which it appeared that there is a toast data where
actually it didn't have any toast data.

To fix this, we set pg_class.relwrite for internally created toast tables
as well which allowed skipping operations on them during logical decoding.

Author: Bertrand Drouvot
Reviewed-by: David Zhang, Amit Kapila
Backpatch-through: 11, where it was introduced
Discussion: https://postgr.es/m/b5146fb1-ad9e-7d6e-f980-98ed68744a7c@amazon.com
2021-08-25 09:53:07 +05:30
Daniel Gustafsson
069d33d0c5 Emit namespace in the post-copy errmsg
During a VACUUM or CLUSTER command, the initial output emits a
fully qualified relation path with namespace.  The post-action
errmsg only emitted the relation name however, which may lead
to hard to parse output when using multiple jobs with vacuumdb
as the output from different jobs may be interleaved.  Include
the full path in the post-action errmsg to be consistent with
the initial errmsg.

Author: Mike Fiedler <miketheman@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CAMerE0oz+8G-aORZL_BJcPxnBqewZAvND4bSUysjz+r-oT1BxQ@mail.gmail.com
2021-08-16 20:06:54 +02:00
Michael Paquier
b0483263dd Add support for SET ACCESS METHOD in ALTER TABLE
The logic used to support a change of access method for a table is
similar to changes for tablespace or relation persistence, requiring a
table rewrite with an exclusive lock of the relation changed.  Table
rewrites done in ALTER TABLE already go through the table AM layer when
scanning tuples from the old relation and inserting them into the new
one, making this implementation straight-forward.

Note that partitioned tables are not supported as these have no access
methods defined.

Author: Justin Pryzby, Jeff Davis
Reviewed-by: Michael Paquier, Vignesh C
Discussion: https://postgr.es/m/20210228222530.GD20769@telsasoft.com
2021-07-28 10:10:44 +09:00
Michael Paquier
609b0652af Fix typos and grammar in documentation and code comments
Comment fixes are applied on HEAD, and documentation improvements are
applied on back-branches where needed.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20210408164008.GJ6592@telsasoft.com
Backpatch-through: 9.6
2021-04-09 13:53:07 +09:00
Michael Paquier
a3dc926009 Refactor option handling of CLUSTER, REINDEX and VACUUM
This continues the work done in b5913f6.  All the options of those
commands are changed to use hex values rather than enums to reduce the
risk of compatibility bugs when introducing new options.  Each option
set is moved into a new structure that can be extended with more
non-boolean options (this was already the case of VACUUM).  The code of
REINDEX is restructured so as manual REINDEX commands go through a
single routine from utility.c, like VACUUM, to ease the allocation
handling of option parameters when a command needs to go through
multiple transactions.

This can be used as a base infrastructure for future patches related to
those commands, including reindex filtering and tablespace support.

Per discussion with people mentioned below, as well as Alvaro Herrera
and Peter Eisentraut.

Author: Michael Paquier, Justin Pryzby
Reviewed-by: Alexey Kondratov, Justin Pryzby
Discussion: https://postgr.es/m/X8riynBLwxAD9uKk@paquier.xyz
2021-01-18 14:03:10 +09:00
Bruce Momjian
ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Michael Paquier
b5913f6120 Refactor CLUSTER and REINDEX grammar to use DefElem for option lists
This changes CLUSTER and REINDEX so as a parenthesized grammar becomes
possible for options, while unifying the grammar parsing rules for
option lists with the existing ones.

This is a follow-up of the work done in 873ea9e for VACUUM, ANALYZE and
EXPLAIN.  This benefits REINDEX for a potential backend-side filtering
for collatable-sensitive indexes and TABLESPACE, while CLUSTER would
benefit from the latter.

Author: Alexey Kondratov, Justin Pryzby
Discussion: https://postgr.es/m/8a8f5f73-00d3-55f8-7583-1375ca8f6a91@postgrespro.ru
2020-12-03 10:13:21 +09:00
Andres Freund
94bc27b576 Centralize horizon determination for temp tables, fixing bug due to skew.
This fixes a bug in the edge case where, for a temp table, heap_page_prune()
can end up with a different horizon than heap_vacuum_rel(). Which can trigger
errors like "ERROR: cannot freeze committed xmax ...".

The bug was introduced due to interaction of a7212be8b9e "Set cutoff xmin more
aggressively when vacuuming a temporary table." with dc7420c2c92 "snapshot
scalability: Don't compute global horizons while building snapshots.".

The problem is caused by lazy_scan_heap() assuming that the only reason its
HeapTupleSatisfiesVacuum() call would return HEAPTUPLE_DEAD is if the tuple is
a HOT tuple, or if the tuple's inserting transaction has aborted since the
heap_page_prune() call. But after a7212be8b9e that was also possible in other
cases for temp tables, because heap_page_prune() uses a different visibility
test after dc7420c2c92.

The fix is fairly simple: Move the special case logic for temp tables from
vacuum_set_xid_limits() to the infrastructure introduced in dc7420c2c92. That
ensures that the horizon used for pruning is at least as aggressive as the one
used by lazy_scan_heap(). The concrete horizon used for temp tables is
slightly different than the logic in dc7420c2c92, but should always be as
aggressive as before (see comments).

A significant benefit to centralizing the logic procarray.c is that now the
more aggressive horizons for temp tables does not just apply to VACUUM but
also to e.g. HOT pruning and the nbtree killtuples logic.

Because isTopLevel is not needed by vacuum_set_xid_limits() anymore, I
undid the the related changes from a7212be8b9e.

This commit also adds an isolation test ensuring that the more aggressive
vacuuming and pruning of temp tables keeps working.

Debugged-By: Amit Kapila <amit.kapila16@gmail.com>
Debugged-By: Tom Lane <tgl@sss.pgh.pa.us>
Debugged-By: Ashutosh Sharma <ashu.coek88@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de
2020-10-28 18:02:31 -07:00
Tom Lane
a7212be8b9 Set cutoff xmin more aggressively when vacuuming a temporary table.
Since other sessions aren't allowed to look into a temporary table
of our own session, we do not need to worry about the global xmin
horizon when setting the vacuum XID cutoff.  Indeed, if we're not
inside a transaction block, we may set oldestXmin to be the next
XID, because there cannot be any in-doubt tuples in a temp table,
nor any tuples that are dead but still visible to some snapshot of
our transaction.  (VACUUM, of course, is never inside a transaction
block; but we need to test that because CLUSTER shares the same code.)

This approach allows us to always clean out a temp table completely
during VACUUM, independently of concurrent activity.  Aside from
being useful in its own right, that simplifies building reproducible
test cases.

Discussion: https://postgr.es/m/3490536.1598629609@sss.pgh.pa.us
2020-09-01 18:40:43 -04:00
Michael Paquier
8ef9451f58 Refactor cluster.c to use new routine get_index_isclustered()
This new cache lookup routine has been introduced in a40caf5, and more
code paths can directly use it.

Note that in cluster_rel(), the code was returning immediately if the
tuple's entry in pg_index for the clustered index was not valid.  This
commit changes the code so as a lookup error is raised instead,
something that could not happen from the start as we check for the
existence of the index beforehand, while holding an exclusive lock on
the parent table.

Author: Justin Pryzby
Reviewed-by: Álvaro Herrera, Michael Paquier
Discussion: https://postgr.es/m/20200202161718.GI13621@telsasoft.com
2020-04-06 11:44:23 +09:00
Noah Misch
c6b92041d3 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 12:25:34 -07:00
Noah Misch
de9396326e Revert "Skip WAL for new relfilenodes, under wal_level=minimal."
This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f.  Per
numerous buildfarm members, it was incompatible with parallel query, and
a test case assumed LP64.  Back-patch to 9.5 (all supported versions).

Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-22 09:24:09 -07:00
Noah Misch
cb2fd7eac2 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Back-patch to 9.5 (all supported versions).  This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
always, update standby systems before master systems.  This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions.  (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-21 09:38:26 -07:00
Noah Misch
e629a01f69 During heap rebuild, lock any TOAST index until end of transaction.
swap_relation_files() calls toast_get_valid_index() to find and lock
this index, just before swapping with the rebuilt TOAST index.  The
latter function releases the lock before returning.  Potential for
mischief is low; a concurrent session can issue ALTER INDEX ... SET
(fillfactor = ...), which is not alarming.  Nonetheless, changing
pg_class.relfilenode without a lock is unconventional.  Back-patch to
9.5 (all supported versions), because another fix needs this.

Discussion: https://postgr.es/m/20191226001521.GA1772687@rfd.leadboat.com
2020-03-21 09:38:26 -07:00
Tom Lane
faade5d4c6 Update obsolete comment.
Noted by Justin Pryzby, though I chose to just rip out the stale text,
as it's in no way relevant to this particular function.

Discussion: https://postgr.es/m/20200212182337.GZ1412@telsasoft.com
2020-02-15 15:22:40 -05:00
Bruce Momjian
7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Michael Paquier
7854e07f25 Revert "Rename files and headers related to index AM"
This follows multiple complains from Peter Geoghegan, Andres Freund and
Alvaro Herrera that this issue ought to be dug more before actually
happening, if it happens.

Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
2019-12-27 08:09:00 +09:00
Michael Paquier
8ce3aa9b59 Rename files and headers related to index AM
The following renaming is done so as source files related to index
access methods are more consistent with table access methods (the
original names used for index AMs ware too generic, and could be
confused as including features related to table AMs):
- amapi.h -> indexam.h.
- amapi.c -> indexamapi.c.  Here we have an equivalent with
backend/access/table/tableamapi.c.
- amvalidate.c -> indexamvalidate.c.
- amvalidate.h -> indexamvalidate.h.
- genam.c -> indexgenam.c.
- genam.h -> indexgenam.h.

This has been discussed during the development of v12 when table AM was
worked on, but the renaming never happened.

Author: Michael Paquier
Reviewed-by: Fabien Coelho, Julien Rouhaud
Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
2019-12-25 10:23:39 +09:00
Amit Kapila
14aec03502 Make the order of the header file includes consistent in backend modules.
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.

In the passing, removed a couple of duplicate inclusions.

Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-12 08:30:16 +05:30
Robert Haas
8b94dab066 Split tuptoaster.c into three separate files.
detoast.c/h contain functions required to detoast a datum, partially
or completely, plus a few other utility functions for examining the
size of toasted datums.

toast_internals.c/h contain functions that are used internally to the
TOAST subsystem but which (mostly) do not need to be accessed from
outside.

heaptoast.c/h contains code that is intrinsically specific to the
heap AM, either because it operates on HeapTuples or is based on the
layout of a heap page.

detoast.c and toast_internals.c are placed in
src/backend/access/common rather than src/backend/access/heap.  At
present, both files still have dependencies on the heap, but that will
be improved in a future commit.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-09-05 13:15:10 -04:00
Michael Paquier
66bde49d96 Fix inconsistencies and typos in the tree, take 10
This addresses some issues with unnecessary code comments, fixes various
typos in docs and comments, and removes some orphaned structures and
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-08-13 13:53:41 +09:00
Tom Lane
d97b714a21 Avoid using lcons and list_delete_first where it's easy to do so.
Formerly, lcons was about the same speed as lappend, but with the new
List implementation, that's not so; with a long List, data movement
imposes an O(N) cost on lcons and list_delete_first, but not lappend.

Hence, invent list_delete_last with semantics parallel to
list_delete_first (but O(1) cost), and change various places to use
lappend and list_delete_last where this can be done without much
violence to the code logic.

There are quite a few places that construct result lists using lcons not
lappend.  Some have semantic rationales for that; I added comments about
it to a couple that didn't have them already.  In many such places though,
I think the coding is that way only because back in the dark ages lcons
was faster than lappend.  Hence, switch to lappend where this can be done
without causing semantic changes.

In ExecInitExprRec(), this results in aggregates and window functions that
are in the same plan node being executed in a different order than before.
Generally, the executions of such functions ought to be independent of
each other, so this shouldn't result in visibly different query results.
But if you push it, as one regression test case does, you can show that
the order is different.  The new order seems saner; it's closer to
the order of the functions in the query text.  And we never documented
or promised anything about this, anyway.

Also, in gistfinishsplit(), don't bother building a reverse-order list;
it's easy now to iterate backwards through the original list.

It'd be possible to go further towards removing uses of lcons and
list_delete_first, but it'd require more extensive logic changes,
and I'm not convinced it's worth it.  Most of the remaining uses
deal with queues that probably never get long enough to be worth
sweating over.  (Actually, I doubt that any of the changes in this
patch will have measurable performance effects either.  But better
to have good examples than bad ones in the code base.)

Patch by me, thanks to David Rowley and Daniel Gustafsson for review.

Discussion: https://postgr.es/m/21272.1563318411@sss.pgh.pa.us
2019-07-17 11:15:34 -04:00
Tom Lane
8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane
be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Andres Freund
fdc7efcc30 Allow pg_class xid & multixid horizons to not be set.
This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM code contained
the change itself). Defang the asserts in the general code, and move
the stronger ones into heap AM.

Relatedly, after CLUSTER/VACUUM, we'd always assign a relfrozenxid /
relminmxid. Change the table_relation_copy_for_cluster() interface to
allow the AM to overwrite the horizons that get set on the pg_class
entry.  This'd also in the future allow AMs like heap to compute a
relfrozenxid during rewrite that's the table's actual minimum rather
than a pre-determined value.  Arguably it'd have been better to move
the whole computation / setting of those values into the callback, but
it seems likely that for other reasons it'd be better to be able to
use one value to vacuum/cluster multiple tables (e.g. a toast's
horizon shouldn't be different than the table's).

Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi
2019-04-23 21:42:12 -07:00
Andres Freund
d25f519107 tableam: relation creation, VACUUM FULL/CLUSTER, SET TABLESPACE.
This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
  new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.

This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-28 20:01:43 -07:00
Robert Haas
6f97457e0d Add progress reporting for CLUSTER and VACUUM FULL.
This uses the same progress reporting infrastructure added in commit
c16dc1aca5e01e6acaadfcf38f5fc964a381dc62 and extends it to these
additional cases.  We lack the ability to track the internal progress
of sorts and index builds so the information reported is
coarse-grained for some parts of the operation, but it still seems
like a significant improvement over having nothing at all.

Tatsuro Yamada, reviewed by Thomas Munro, Masahiko Sawada, Michael
Paquier, Jeff Janes, Alvaro Herrera, Rafia Sabih, and by me.  A fair
amount of polishing also by me.

Discussion: http://postgr.es/m/59A77072.3090401@lab.ntt.co.jp
2019-03-25 10:59:04 -04:00
Andres Freund
c2fe139c20 tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:

1) Heap scans need to be generalized into table scans. Do this by
   introducing TableScanDesc, which will be the "base class" for
   individual AMs. This contains the AM independent fields from
   HeapScanDesc.

   The previous heap_{beginscan,rescan,endscan} et al. have been
   replaced with a table_ version.

   There's no direct replacement for heap_getnext(), as that returned
   a HeapTuple, which is undesirable for a other AMs. Instead there's
   table_scan_getnextslot().  But note that heap_getnext() lives on,
   it's still used widely to access catalog tables.

   This is achieved by new scan_begin, scan_end, scan_rescan,
   scan_getnextslot callbacks.

2) The portion of parallel scans that's shared between backends need
   to be able to do so without the user doing per-AM work. To achieve
   that new parallelscan_{estimate, initialize, reinitialize}
   callbacks are introduced, which operate on a new
   ParallelTableScanDesc, which again can be subclassed by AMs.

   As it is likely that several AMs are going to be block oriented,
   block oriented callbacks that can be shared between such AMs are
   provided and used by heap. table_block_parallelscan_{estimate,
   intiialize, reinitialize} as callbacks, and
   table_block_parallelscan_{nextpage, init} for use in AMs. These
   operate on a ParallelBlockTableScanDesc.

3) Index scans need to be able to access tables to return a tuple, and
   there needs to be state across individual accesses to the heap to
   store state like buffers. That's now handled by introducing a
   sort-of-scan IndexFetchTable, which again is intended to be
   subclassed by individual AMs (for heap IndexFetchHeap).

   The relevant callbacks for an AM are index_fetch_{end, begin,
   reset} to create the necessary state, and index_fetch_tuple to
   retrieve an indexed tuple.  Note that index_fetch_tuple
   implementations need to be smarter than just blindly fetching the
   tuples for AMs that have optimizations similar to heap's HOT - the
   currently alive tuple in the update chain needs to be fetched if
   appropriate.

   Similar to table_scan_getnextslot(), it's undesirable to continue
   to return HeapTuples. Thus index_fetch_heap (might want to rename
   that later) now accepts a slot as an argument. Core code doesn't
   have a lot of call sites performing index scans without going
   through the systable_* API (in contrast to loads of heap_getnext
   calls and working directly with HeapTuples).

   Index scans now store the result of a search in
   IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
   target is not generally a HeapTuple anymore that seems cleaner.

To be able to sensible adapt code to use the above, two further
callbacks have been introduced:

a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
   slots capable of holding a tuple of the AMs
   type. table_slot_callbacks() and table_slot_create() are based
   upon that, but have additional logic to deal with views, foreign
   tables, etc.

   While this change could have been done separately, nearly all the
   call sites that needed to be adapted for the rest of this commit
   also would have been needed to be adapted for
   table_slot_callbacks(), making separation not worthwhile.

b) tuple_satisfies_snapshot checks whether the tuple in a slot is
   currently visible according to a snapshot. That's required as a few
   places now don't have a buffer + HeapTuple around, but a
   slot (which in heap's case internally has that information).

Additionally a few infrastructure changes were needed:

I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
   internally uses a slot to keep track of tuples. While
   systable_getnext() still returns HeapTuples, and will so for the
   foreseeable future, the index API (see 1) above) now only deals with
   slots.

The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.

Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 12:46:41 -07:00
Andres Freund
8586bf7ed8 tableam: introduce table AM infrastructure.
This introduces the concept of table access methods, i.e. CREATE
  ACCESS METHOD ... TYPE TABLE and
  CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.

Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.

Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.

Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.

Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs.  It's quite possible that this behaviour
still needs to be fine tuned.

For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.

Catversion bumped, to add the heap table AM and references to it.

Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
    https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de
    https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
2019-03-06 09:54:38 -08:00
Tom Lane
f09346a9c6 Refactor planner's header files.
Create a new header optimizer/optimizer.h, which exposes just the
planner functions that can be used "at arm's length", without need
to access Paths or the other planner-internal data structures defined
in nodes/relation.h.  This is intended to provide the whole planner
API seen by most of the rest of the system; although FDWs still need
to use additional stuff, and more thought is also needed about just
what selfuncs.c should rely on.

The main point of doing this now is to limit the amount of new
#include baggage that will be needed by "planner support functions",
which I expect to introduce later, and which will be in relevant
datatype modules rather than anywhere near the planner.

This commit just moves relevant declarations into optimizer.h from
other header files (a couple of which go away because everything
got moved), and adjusts #include lists to match.  There's further
cleanup that could be done if we want to decide that some stuff
being exposed by optimizer.h doesn't belong in the planner at all,
but I'll leave that for another day.

Discussion: https://postgr.es/m/11460.1548706639@sss.pgh.pa.us
2019-01-29 15:48:51 -05:00
Andres Freund
346ed70b0a Rename RelationData.rd_amroutine to rd_indam.
The upcoming table AM support makes rd_amroutine to generic, as its
only about index AMs. The new name makes that clear, and is shorter to
boot.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 17:36:55 -08:00