When detaching partition in concurrent mode, it's possible for partition
descriptors to not match the set that was recently seen when the plan
was made, causing an assertion failure or (in production builds) failure
to construct a working plan. The case that was reported involves
prepared statements, but I think it may be possible to hit this bug
without that too.
The problem is that CreatePartitionPruneState is constructing a
PartitionPruneState under the assumption that new partitions can be
added, but never removed, but it turns out that this isn't true: a
prepared statement gets replanned when the DETACH CONCURRENTLY session
sends out its invalidation message, but if the invalidation message
arrives after ExecInitAppend started, we would build a partition
descriptor without the partition, and then CreatePartitionPruneState
would refuse to work with it.
CreatePartitionPruneState already contains code to deal with the new
descriptor having more partitions than before (and behaving for the
extra partitions as if they had been pruned), but doesn't have code to
deal with less partitions than before, and it is naïve about the case
where the number of partitions is the same. We could simply add that a
new stanza for less partitions than before, and in simple testing it
works to do that; but it's possible to press the test scripts even
further and hit the case where one partition is added and a partition is
removed quickly enough that we see the same number of partitions, but
they don't actually match, causing hangs during execution.
To cope with both these problems, we now memcmp() the arrays of
partition OIDs, and do a more elaborate mapping (relying on the fact
that both OID arrays are in partition-bounds order) if they're not
identical.
This fix was already pushed in backbranches earlier.
Reported-by: yajun Hu <1026592243@qq.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/18377-e0324601cfebdfe5@postgresql.org
When detaching partition in concurrent mode, it's possible for partition
descriptors to not match the set that was recently seen when the plan
was made, causing an assertion failure or (in production builds) failure
to construct a working plan. The case that was reported involves
prepared statements, but I think it may be possible to hit this bug
without that too.
The problem is that CreatePartitionPruneState is constructing a
PartitionPruneState under the assumption that new partitions can be
added, but never removed, but it turns out that this isn't true: a
prepared statement gets replanned when the DETACH CONCURRENTLY session
sends out its invalidation message, but if the invalidation message
arrives after ExecInitAppend started, we would build a partition
descriptor without the partition, and then CreatePartitionPruneState
would refuse to work with it.
CreatePartitionPruneState already contains code to deal with the new
descriptor having more partitions than before (and behaving for the
extra partitions as if they had been pruned), but doesn't have code to
deal with less partitions than before, and it is naïve about the case
where the number of partitions is the same. We could simply add that a
new stanza for less partitions than before, and in simple testing it
works to do that; but it's possible to press the test scripts even
further and hit the case where one partition is added and a partition is
removed quickly enough that we see the same number of partitions, but
they don't actually match, causing hangs during execution.
To cope with both these problems, we now memcmp() the arrays of
partition OIDs, and do a more elaborate mapping (relying on the fact
that both OID arrays are in partition-bounds order) if they're not
identical.
Backpatch to 14, where DETACH CONCURRENTLY appeared.
Reported-by: yajun Hu <1026592243@qq.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/18377-e0324601cfebdfe5@postgresql.org
This allows MERGE commands to include WHEN NOT MATCHED BY SOURCE
actions, which operate on rows that exist in the target relation, but
not in the data source. These actions can execute UPDATE, DELETE, or
DO NOTHING sub-commands.
This is in contrast to already-supported WHEN NOT MATCHED actions,
which operate on rows that exist in the data source, but not in the
target relation. To make this distinction clearer, such actions may
now be written as WHEN NOT MATCHED BY TARGET.
Writing WHEN NOT MATCHED without specifying BY SOURCE or BY TARGET is
equivalent to writing WHEN NOT MATCHED BY TARGET.
Dean Rasheed, reviewed by Alvaro Herrera, Ted Yu and Vik Fearing.
Discussion: https://postgr.es/m/CAEZATCWqnKGc57Y_JanUBHQXNKcXd7r=0R4NEZUVwP+syRkWbA@mail.gmail.com
This allows a RETURNING clause to be appended to a MERGE query, to
return values based on each row inserted, updated, or deleted. As with
plain INSERT, UPDATE, and DELETE commands, the returned values are
based on the new contents of the target table for INSERT and UPDATE
actions, and on its old contents for DELETE actions. Values from the
source relation may also be returned.
As with INSERT/UPDATE/DELETE, the output of MERGE ... RETURNING may be
used as the source relation for other operations such as WITH queries
and COPY commands.
Additionally, a special function merge_action() is provided, which
returns 'INSERT', 'UPDATE', or 'DELETE', depending on the action
executed for each row. The merge_action() function can be used
anywhere in the RETURNING list, including in arbitrary expressions and
subqueries, but it is an error to use it anywhere outside of a MERGE
query's RETURNING list.
Dean Rasheed, reviewed by Isaac Morland, Vik Fearing, Alvaro Herrera,
Gurjeet Singh, Jian He, Jeff Davis, Merlin Moncure, Peter Eisentraut,
and Wolfgang Walther.
Discussion: http://postgr.es/m/CAEZATCWePEGQR5LBn-vD6SfeLZafzEm2Qy_L_Oky2=qw2w3Pzg@mail.gmail.com
as determined by include-what-you-use (IWYU)
While IWYU also suggests to *add* a bunch of #include's (which is its
main purpose), this patch does not do that. In some cases, a more
specific #include replaces another less specific one.
Some manual adjustments of the automatic result:
- IWYU currently doesn't know about includes that provide global
variable declarations (like -Wmissing-variable-declarations), so
those includes are being kept manually.
- All includes for port(ability) headers are being kept for now, to
play it safe.
- No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the
patch from exploding in size.
Note that this patch touches just *.c files, so nothing declared in
header files changes in hidden ways.
As a small example, in src/backend/access/transam/rmgr.c, some IWYU
pragma annotations are added to handle a special case there.
Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
This allows the target relation of MERGE to be an auto-updatable or
trigger-updatable view, and includes support for WITH CHECK OPTION,
security barrier views, and security invoker views.
A trigger-updatable view must have INSTEAD OF triggers for every type
of action (INSERT, UPDATE, and DELETE) mentioned in the MERGE command.
An auto-updatable view must not have any INSTEAD OF triggers. Mixing
auto-update and trigger-update actions (i.e., having a partial set of
INSTEAD OF triggers) is not supported.
Rule-updatable views are also not supported, since there is no
rewriter support for non-SELECT rules with MERGE operations.
Dean Rasheed, reviewed by Jian He and Alvaro Herrera.
Discussion: https://postgr.es/m/CAEZATCVcB1g0nmxuEc-A+gGB0HnfcGQNGYH7gS=7rq0u0zOBXA@mail.gmail.com
This could only affect HASH partitioned tables with at least 2 partition
key columns.
If partition pruning was delayed until execution and the query contained
an IS NULL qual on one of the partitioned keys, and some subsequent
partitioned key was being compared to a non-Const, then this could result
in a crash due to the incorrect keyno being used to calculate the
stateidx for the expression evaluation code.
Here we fix this by properly skipping partitioned keys which have a
nullkey set. Effectively, this must be the same as what's going on
inside perform_pruning_base_step().
Sergei Glukhov also provided a patch, but that's not what's being used
here.
Reported-by: Sergei Glukhov
Reviewed-by: tender wang, Sergei Glukhov
Discussion: https://postgr.es/m/d05b26fa-af54-27e1-f693-6c31590802fa@postgrespro.ru
Backpatch-through: 11, where runtime partition pruning was added.
This reverts commit ec386948948c and its fixup 589bb816499e.
This change was intended to support query planning avoiding acquisition
of locks on partitions that were going to be pruned; however, the
overall project took a different direction at [1] and this bit is no
longer needed. Put things back the way they were as agreed in [2], to
avoid unnecessary complexity.
Discussion: [1] https://postgr.es/m/4191508.1674157166@sss.pgh.pa.us
Discussion: [2] https://postgr.es/m/20230502175409.kcoirxczpdha26wt@alvherre.pgsql
This provides a very simple way to see the generic plan for a
parameterized query. Without this, it's necessary to define
a prepared statement and temporarily change plan_cache_mode,
which is a bit tedious.
One thing that's a bit of a hack perhaps is that we disable
execution-time partition pruning when the GENERIC_PLAN option
is given. That's because the pruning code may attempt to
fetch the value of one of the parameters, which would fail.
Laurenz Albe, reviewed by Julien Rouhaud, Christoph Berg,
Michel Pelletier, Jim Jones, and myself
Discussion: https://postgr.es/m/0a29b954b10b57f0d135fe12aa0909bd41883eb0.camel@cybertec.at
In ExecInitPartitionInfo(), the Assert when building the WITH CHECK
OPTION list for the new partition assumed that the command would be an
INSERT or UPDATE, but it can also be a MERGE. This can be triggered by
a MERGE into a partitioned table with RLS checks to enforce.
Fix, and back-patch to v15, where MERGE was introduced.
Discussion: https://postgr.es/m/CAEZATCWWFtQmW67F3XTyMU5Am10Oxa_b8oe0x%2BNu5Mo%2BCdRErg%40mail.gmail.com
Commit 927f453a9 disallowed batching added by commit b663a4136 to be
used for the inserts performed as part of cross-partition updates of
partitioned tables, mainly because the previous code in
nodeModifyTable.c couldn't handle pending inserts into foreign-table
partitions that are also UPDATE target partitions. But we don't have
such a limitation anymore (cf. commit ffbb7e65a), so let's allow for
this by removing from execPartition.c the restriction added by commit
927f453a9 that batching is only allowed if the query command type is
CMD_INSERT.
In postgres_fdw, since commit 86dc90056 changed it to effectively
disable cross-partition updates in the case where a foreign-table
partition chosen to insert rows into is also an UPDATE target partition,
allow batching in the case where a foreign-table partition chosen to
do so is *not* also an UPDATE target partition. This is enabled by the
"batch_size" option added by commit b663a4136, which is disabled by
default.
This patch also adjusts the test case added by commit 927f453a9 to
confirm that the inserts performed as part of a cross-partition update
of a partitioned table indeed uses batching.
Amit Langote, reviewed and/or tested by Georgios Kokolatos, Zhihong Yu,
Bharath Rupireddy, Hou Zhijie, Vignesh C, and me.
Discussion: http://postgr.es/m/CA%2BHiwqH1Lz1yJmPs%3DaD-pzd_HLLynLHvq5iYeT9mB0bBV7oJ6w%40mail.gmail.com
In commit ffbb7e65a, I added a ModifyTableState member to ResultRelInfo
to save the owning ModifyTableState for use by nodeModifyTable.c when
performing batch inserts, but as pointed out by Tom Lane, that changed
the array stride of es_result_relations, and that would break any
previously-compiled extension code that accesses that array. Fix by
removing that member from ResultRelInfo and instead adding a List member
at the end of EState to save such ModifyTableStates.
Per report from Tom Lane. Back-patch to v14, like the previous commit;
I chose to apply the patch to HEAD as well, to make back-patching easy.
Discussion: http://postgr.es/m/4065383.1669395453%40sss.pgh.pa.us
ri_RootToPartitionMap is currently only initialized for tuple routing
target partitions, though a future commit will need the ability to use
it even for the non-partition child tables, so make adjustments to the
decouple it from the partitioning code.
Also, make it lazily initialized via ExecGetRootToChildMap(), making
that function its preferred access path. Existing third-party code
accessing it directly should no longer do so; consequently, it's been
renamed to ri_RootToChildMap, which also makes it consistent with
ri_ChildToRootMap.
ExecGetRootToChildMap() houses the logic of setting the map appropriately
depending on whether a given child relation is partition or not.
To support this, also add a separate entry point for TupleConversionMap
creation that receives an AttrMap. No new code here, just split an
existing function in two.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqEYUhDXSK5BTvG_xk=eaAEJCD4GS3C6uH7ybBvv+Z_Tmg@mail.gmail.com
The planner will now add a given PartitioPruneInfo to
PlannedStmt.partPruneInfos instead of directly to the
Append/MergeAppend plan node. What gets set instead in the
latter is an index field which points to the list element
of PlannedStmt.partPruneInfos containing the PartitioPruneInfo
belonging to the plan node.
A later commit will make AcquireExecutorLocks() do the initial
partition pruning to determine a minimal set of partitions to be
locked when validating a plan tree and it will need to consult the
PartitioPruneInfos referenced therein to do so. It would be better
for the PartitioPruneInfos to be accessible directly than requiring
a walk of the plan tree to find them, which is easier when it can be
done by simply iterating over PlannedStmt.partPruneInfos.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
When it's given as true, return a 0 in the position of the missing
column rather than raising an error.
This is currently unused, but it allows us to reimplement column
permission checking in a subsequent commit. It seems worth breaking
into a separate commit because it affects unrelated code.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFfiai=qBxPDTjaio_ZcaqUKh+FC=prESrB8ogZgFNNNQ@mail.gmail.com
Commit b663a4136, which allowed FDWs to INSERT rows in bulk, added to
nodeModifyTable.c code to flush pending inserts to the foreign-table
result relation(s) before completing processing of the ModifyTable node,
but the code failed to take into account the case where the INSERT query
has modifying CTEs, leading to incorrect results.
Also, that commit failed to flush pending inserts before firing BEFORE
ROW triggers so that rows are visible to such triggers.
In that commit we scanned through EState's
es_tuple_routing_result_relations or es_opened_result_relations list to
find the foreign-table result relations to which pending inserts are
flushed, but that would be inefficient in some cases. So to fix, 1) add
a List member to EState to record the insert-pending result relations,
and 2) modify nodeModifyTable.c so that it adds the foreign-table result
relation to the list in ExecInsert() if appropriate, and flushes pending
inserts properly using the list where needed.
While here, fix a copy-and-pasteo in a comment in ExecBatchInsert(),
which was added by that commit.
Back-patch to v14 where that commit appeared.
Discussion: https://postgr.es/m/CAPmGK16qutyCmyJJzgQOhfBq%3DNoGDqTB6O0QBZTihrbqre%2BoxA%40mail.gmail.com
In a similar effort to f01592f91, here we're targetting fixing the
warnings where we've deemed the shadowing variable to serve a close enough
purpose to the shadowed variable just to reuse the shadowed version and
not declare the shadowing variable at all.
By my count, this takes the warning count from 106 down to 71.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20220825020839.GT2342@telsasoft.com
The standard way to check for list emptiness is to compare the
List pointer to NIL; our list code goes out of its way to ensure
that that is the only representation of an empty list. (An
acceptable alternative is a plain boolean test for non-null
pointer, but explicit mention of NIL is usually preferable.)
Various places didn't get that memo and expressed the condition
with list_length(), which might not be so bad except that there
were such a variety of ways to check it exactly: equal to zero,
less than or equal to zero, less than one, yadda yadda. In the
name of code readability, let's standardize all those spellings
as "list == NIL" or "list != NIL". (There's probably some
microscopic efficiency gain too, though few of these look to be
at all performance-critical.)
A very small number of cases were left as-is because they seemed
more consistent with other adjacent list_length tests that way.
Peter Smith, with bikeshedding from a number of us
Discussion: https://postgr.es/m/CAHut+PtQYe+ENX5KrONMfugf0q6NHg4hR5dAhqEXEc2eefFeig@mail.gmail.com
Here we add code which detects when ExecFindPartition() continually finds
the same partition and add a caching layer to improve partition lookup
performance for such cases.
Both RANGE and LIST partitioned tables traditionally require a binary
search for the set of Datums that a partition needs to be found for. This
binary search is commonly visible in profiles when bulk loading into a
partitioned table. Here we aim to reduce the overhead of bulk-loading
into partitioned tables for cases where many consecutive tuples belong to
the same partition and make the performance of this operation closer to
what it is with a traditional non-partitioned table.
When we find the same partition 16 times in a row, the next search will
result in us simply just checking if the current set of values belongs to
the last found partition. For LIST partitioning we record the index into
the PartitionBoundInfo's datum array. This allows us to check if the
current Datum is the same as the Datum that was last looked up. This
means if any given LIST partition supports storing multiple different
Datum values, then the caching only works when we find the same value as
we did the last time. For RANGE partitioning we simply check if the given
Datums are in the same range as the previously found partition.
We store the details of the cached partition in PartitionDesc (i.e.
relcache) so that the cached values are maintained over multiple
statements.
No caching is done for HASH partitions. The majority of the cost in HASH
partition lookups are in the hashing function(s), which would also have to
be executed if we were to try to do caching for HASH partitioned tables.
Since most of the cost is already incurred, we just don't bother. We also
don't do any caching for LIST partitions when we continually find the
values being looked up belong to the DEFAULT partition. We've no
corresponding index in the PartitionBoundInfo's datum array for this case.
We also don't cache when we find the given values match to a LIST
partitioned table's NULL partition. This is so cheap that there's no
point in doing any caching for this. We also don't cache for a RANGE
partitioned table's DEFAULT partition.
There have been a number of different patches submitted to improve
partition lookups. Hou, Zhijie submitted a patch to detect when the value
belonging to the partition key column(s) were constant and added code to
cache the partition in that case. Amit Langote then implemented an idea
suggested by me to remember the last found partition and start to check if
the current values work for that partition. The final patch here was
written by me and was done by taking many of the ideas I liked from the
patches in the thread and redesigning other aspects.
Discussion: https://postgr.es/m/OS0PR01MB571649B27E912EA6CC4EEF03942D9%40OS0PR01MB5716.jpnprd01.prod.outlook.com
Author: Amit Langote, Hou Zhijie, David Rowley
Reviewed-by: Amit Langote, Hou Zhijie
Modify the subroutines called by RI trigger functions that want to check
if a given referenced value exists in the referenced relation to simply
scan the foreign key constraint's unique index, instead of using SPI to
execute
SELECT 1 FROM referenced_relation WHERE ref_key = $1
This saves a lot of work, especially when inserting into or updating a
referencing relation.
This rewrite allows to fix a PK row visibility bug caused by a partition
descriptor hack which requires ActiveSnapshot to be set to come up with
the correct set of partitions for the RI query running under REPEATABLE
READ isolation. We now set that snapshot indepedently of the snapshot
to be used by the PK index scan, so the two no longer interfere. The
buggy output in src/test/isolation/expected/fk-snapshot.out of the
relevant test case added by commit 00cb86e75d6d has been corrected.
(The bug still exists in branch 14, however, but this fix is too
invasive to backpatch.)
Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Li Japin <japinli@hotmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Discussion: https://postgr.es/m/CA+HiwqGkfJfYdeq5vHPh6eqPKjSbfpDDY+j-kXYFePQedtSLeg@mail.gmail.com
* Move the execution pruning initialization steps that are common
between both ExecInitAppend() and ExecInitMergeAppend() into a new
function ExecInitPartitionPruning() defined in execPartition.c.
Those steps include creation of a PartitionPruneState to be used for
all instances of pruning and determining the minimal set of child
subplans that need to be initialized by performing initial pruning if
needed, and finally adjusting the subplan_map arrays in the
PartitionPruneState to reflect the new set of subplans remaining
after initial pruning if it was indeed performed.
ExecCreatePartitionPruneState() is no longer exported out of
execPartition.c and has been renamed to CreatePartitionPruneState()
as a local sub-routine of ExecInitPartitionPruning().
* Likewise, ExecFindInitialMatchingSubPlans() that was in charge of
performing initial pruning no longer needs to be exported. In fact,
since it would now have the same body as the more generally named
ExecFindMatchingSubPlans(), except differing in the value of
initial_prune passed to the common subroutine
find_matching_subplans_recurse(), it seems better to remove it and add
an initial_prune argument to ExecFindMatchingSubPlans().
* Add an ExprContext field to PartitionPruneContext to remove the
implicit assumption in the runtime pruning code that the ExprContext to
use to compute pruning expressions that need one can always rely on the
PlanState providing it. A future patch will allow runtime pruning (at
least the initial pruning steps) to be performed without the
corresponding PlanState yet having been created, so this will help.
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqEYCpEqh2LMDOp9mT+4-QoVe8HgFMKBjntEMCTZLpcCCA@mail.gmail.com
MERGE performs actions that modify rows in the target table using a
source table or query. MERGE provides a single SQL statement that can
conditionally INSERT/UPDATE/DELETE rows -- a task that would otherwise
require multiple PL statements. For example,
MERGE INTO target AS t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED AND t.balance > s.delta THEN
UPDATE SET balance = t.balance - s.delta
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED AND s.delta > 0 THEN
INSERT VALUES (s.sid, s.delta)
WHEN NOT MATCHED THEN
DO NOTHING;
MERGE works with regular tables, partitioned tables and inheritance
hierarchies, including column and row security enforcement, as well as
support for row and statement triggers and transition tables therein.
MERGE is optimized for OLTP and is parameterizable, though also useful
for large scale ETL/ELT. MERGE is not intended to be used in preference
to existing single SQL commands for INSERT, UPDATE or DELETE since there
is some overhead. MERGE can be used from PL/pgSQL.
MERGE does not support targetting updatable views or foreign tables, and
RETURNING clauses are not allowed either. These limitations are likely
fixable with sufficient effort. Rewrite rules are also not supported,
but it's not clear that we'd want to support them.
Author: Pavan Deolasee <pavan.deolasee@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Reviewed-by: Peter Geoghegan <pg@bowt.ie> (earlier versions)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions)
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Discussion: https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com
Discussion: https://postgr.es/m/20201231134736.GA25392@alvherre.pgsql
Also "make reformat-dat-files".
The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE
list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present.
If it happens, the ON CONFLICT UPDATE code path would end up storing
tuples that include the values of the extra resjunk columns. That's
fairly harmless in the short run, but if new columns are added to
the table then the values would become accessible, possibly leading
to malfunctions if they don't match the datatypes of the new columns.
This had escaped notice through a confluence of missing sanity checks,
including
* There's no cross-check that a tuple presented to heap_insert or
heap_update matches the table rowtype. While it's difficult to
check that fully at reasonable cost, we can easily add assertions
that there aren't too many columns.
* The output-column-assignment cases in execExprInterp.c lacked
any sanity checks on the output column numbers, which seems like
an oversight considering there are plenty of assertion checks on
input column numbers. Add assertions there too.
* We failed to apply nodeModifyTable's ExecCheckPlanOutput() to
the ON CONFLICT UPDATE tlist. That wouldn't have caught this
specific error, since that function is chartered to ignore resjunk
columns; but it sure seems like a bad omission now that we've seen
this bug.
In HEAD, the right way to fix this is to make the processing of
ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists
now do, that is don't add "SET x = x" entries, and use
ExecBuildUpdateProjection to evaluate the tlist and combine it with
old values of the not-set columns. This adds a little complication
to ExecBuildUpdateProjection, but allows removal of a comparable
amount of now-dead code from the planner.
In the back branches, the most expedient solution seems to be to
(a) use an output slot for the ON CONFLICT UPDATE projection that
actually matches the target table, and then (b) invent a variant of
ExecBuildProjectionInfo that can be told to not store values resulting
from resjunk columns, so it doesn't try to store into nonexistent
columns of the output slot. (We can't simply ignore the resjunk columns
altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.)
This works back to v10. In 9.6, projections work much differently and
we can't cheaply give them such an option. The 9.6 version of this
patch works by inserting a JunkFilter when it's necessary to get rid
of resjunk columns.
In addition, v11 and up have the reverse problem when trying to
perform ON CONFLICT UPDATE on a partitioned table. Through a
further oversight, adjust_partition_tlist() discarded resjunk columns
when re-ordering the ON CONFLICT UPDATE tlist to match a partition.
This accidentally prevented the storing-bogus-tuples problem, but
at the cost that MULTIEXPR_SUBLINK cases didn't work, typically
crashing if more than one row has to be updated. Fix by preserving
resjunk columns in that routine. (I failed to resist the temptation
to add more assertions there too, and to do some minor code
beautification.)
Per report from Andres Freund. Back-patch to all supported branches.
Security: CVE-2021-32028
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:
* Don't open result relations' indexes until needed.
* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.
This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable. Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.
Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node. Instead
use the hash added to ModifyTable itself by commit 86dc90056.
To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos. We formerly set it
only in some, not terribly well-defined, cases. This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too). It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.
Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.
Amit Langote, reviewed at different times by Heikki Linnakangas and
myself
Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
This patch makes two closely related sets of changes:
1. For UPDATE, the subplan of the ModifyTable node now only delivers
the new values of the changed columns (i.e., the expressions computed
in the query's SET clause) plus row identity information such as CTID.
ModifyTable must re-fetch the original tuple to merge in the old
values of any unchanged columns. The core advantage of this is that
the changed columns are uniform across all tables of an inherited or
partitioned target relation, whereas the other columns might not be.
A secondary advantage, when the UPDATE involves joins, is that less
data needs to pass through the plan tree. The disadvantage of course
is an extra fetch of each tuple to be updated. However, that seems to
be very nearly free in context; even worst-case tests don't show it to
add more than a couple percent to the total query cost. At some point
it might be interesting to combine the re-fetch with the tuple access
that ModifyTable must do anyway to mark the old tuple dead; but that
would require a good deal of refactoring and it seems it wouldn't buy
all that much, so this patch doesn't attempt it.
2. For inherited UPDATE/DELETE, instead of generating a separate
subplan for each target relation, we now generate a single subplan
that is just exactly like a SELECT's plan, then stick ModifyTable
on top of that. To let ModifyTable know which target relation a
given incoming row refers to, a tableoid junk column is added to
the row identity information. This gets rid of the horrid hack
that was inheritance_planner(), eliminating O(N^2) planning cost
and memory consumption in cases where there were many unprunable
target relations.
Point 2 of course requires point 1, so that there is a uniform
definition of the non-junk columns to be returned by the subplan.
We can't insist on uniform definition of the row identity junk
columns however, if we want to keep the ability to have both
plain and foreign tables in a partitioning hierarchy. Since
it wouldn't scale very far to have every child table have its
own row identity column, this patch includes provisions to merge
similar row identity columns into one column of the subplan result.
In particular, we can merge the whole-row Vars typically used as
row identity by FDWs into one column by pretending they are type
RECORD. (It's still okay for the actual composite Datums to be
labeled with the table's rowtype OID, though.)
There is more that can be done to file down residual inefficiencies
in this patch, but it seems to be committable now.
FDW authors should note several API changes:
* The argument list for AddForeignUpdateTargets() has changed, and so
has the method it must use for adding junk columns to the query. Call
add_row_identity_var() instead of manipulating the parse tree directly.
You might want to reconsider exactly what you're adding, too.
* PlanDirectModify() must now work a little harder to find the
ForeignScan plan node; if the foreign table is part of a partitioning
hierarchy then the ForeignScan might not be the direct child of
ModifyTable. See postgres_fdw for sample code.
* To check whether a relation is a target relation, it's no
longer sufficient to compare its relid to root->parse->resultRelation.
Instead, check it against all_result_relids or leaf_result_relids,
as appropriate.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
A cross-partition update on a partitioned table is implemented as a
delete followed by an insert. With foreign partitions, this was however
causing issues, because the FDW and core may disagree on when to enable
batching. postgres_fdw was only allowing batching for plain inserts
(CMD_INSERT) while core was trying to batch the insert component of the
cross-partition update. Fix by restricting core to apply batching only
to plain CMD_INSERT queries.
It's possible to allow batching for cross-partition updates, but that
will require more extensive changes, so better to leave that for a
separate patch.
Author: Amit Langote
Reviewed-by: Tomas Vondra, Takayuki Tsunakawa
Discussion: https://postgr.es/m/20200628151002.7x5laxwpgvkyiu3q@development
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
perform_pruning_combine_step() was not taught about the number of
partition indexes used in hash partitioning; more embarrassingly,
get_matching_hash_bounds() also had it wrong. These errors are masked
in the common case where all the partitions have the same modulus
and no partition is missing. However, with missing or unequal-size
partitions, we could erroneously prune some partitions that need
to be scanned, leading to silently wrong query answers.
While a minimal-footprint fix for this could be to export
get_partition_bound_num_indexes and make the incorrect functions use it,
I'm of the opinion that that function should never have existed in the
first place. It's not reasonable data structure design that
PartitionBoundInfoData lacks any explicit record of the length of
its indexes[] array. Perhaps that was all right when it could always
be assumed equal to ndatums, but something should have been done about
it as soon as that stopped being true. Putting in an explicit
"nindexes" field makes both partition_bounds_equal() and
partition_bounds_copy() simpler, safer, and faster than before,
and removes explicit knowledge of the number-of-partition-indexes
rules from some other places too.
This change also makes get_hash_partition_greatest_modulus obsolete.
I left that in place in case any external code uses it, but no core
code does anymore.
Per bug #16840 from Michał Albrycht. Back-patch to v11 where the
hash partitioning code came in. (In the back branches, add the new
field at the end of PartitionBoundInfoData to minimize ABI risks.)
Discussion: https://postgr.es/m/16840-571a22976f829ad4@postgresql.org
Extends the FDW API to allow batching inserts into foreign tables. That
is usually much more efficient than inserting individual rows, due to
high latency for each round-trip to the foreign server.
It was possible to implement something similar in the regular FDW API,
but it was inconvenient and there were issues with reporting the number
of actually inserted rows etc. This extends the FDW API with two new
functions:
* GetForeignModifyBatchSize - allows the FDW picking optimal batch size
* ExecForeignBatchInsert - inserts a batch of rows at once
Currently, only INSERT queries support batching. Support for DELETE and
UPDATE may be added in the future.
This also implements batching for postgres_fdw. The batch size may be
specified using "batch_size" option both at the server and table level.
The initial patch version was written by me, but it was rewritten and
improved in many ways by Takayuki Tsunakawa.
Author: Takayuki Tsunakawa
Reviewed-by: Tomas Vondra, Amit Langote
Discussion: https://postgr.es/m/20200628151002.7x5laxwpgvkyiu3q@development
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).
Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes. This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)
Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize. Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.
Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one. This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not. We might as well save a few
cycles by standardizing on "not".
Also improve the documentation for hash_create().
In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.
Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
Store the tuple conversion map to convert a tuple from a child table's
format to the root format in a new ri_ChildToRootMap field in
ResultRelInfo. It is initialized if transition tuple capture for FOR
STATEMENT triggers or INSERT tuple routing on a partitioned table is
needed. Previously, ModifyTable kept the maps in the per-subplan
ModifyTableState->mt_per_subplan_tupconv_maps array, or when tuple
routing was used, in
ResultRelInfo->ri_Partitioninfo->pi_PartitionToRootMap. The new field
replaces both of those.
Now that the child-to-root tuple conversion map is always available in
ResultRelInfo (when needed), remove the TransitionCaptureState.tcs_map
field. The callers of Exec*Trigger() functions no longer need to set or
save it, which is much less confusing and bug-prone. Also, as a future
optimization, this will allow us to delay creating the map for a given
result relation until the relation is actually processed during
execution.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqHtCWLdK-LO%3DNEsvOdHx%2B7yv4mE_zYK0i3BH7dXb-wxog%40mail.gmail.com
Since there is only one place that actually needs the partition check
expression, namely ExecPartitionCheck, it's better to fetch it from
the relcache there. In this way we will never fetch it at all if
the query never has use for it, and we still fetch it just once when
we do need it.
The reason for taking an interest in this is that if the relcache
doesn't already have the check expression cached, fetching it
requires obtaining AccessShareLock on the partition root. That
means that operations that look like they should only touch the
partition itself will also take a lock on the root. In particular
we observed that TRUNCATE on a partition may take a lock on the
partition's root, contributing to a deadlock situation in parallel
pg_restore.
As written, this patch does have a small cost, which is that we
are microscopically reducing efficiency for the case where a partition
has an empty check expression. ExecPartitionCheck will be called,
and will go through the motions of setting up and checking an empty
qual, where before it would not have been called at all. We could
avoid that by adding a separate boolean flag to track whether there
is a partition expression to test. However, this case only arises
for a default partition with no siblings, which surely is not an
interesting case in practice. Hence adding complexity for it
does not seem like a good trade-off.
Amit Langote, per a suggestion by me
Discussion: https://postgr.es/m/VI1PR03MB31670CA1BD9625C3A8C5DD05EB230@VI1PR03MB3167.eurprd03.prod.outlook.com
Partitioning tuple route code assumes that the partition chosen while
descending the partition hierarchy is always the correct one. This is
true except when the partition is the default partition and another
partition has been added concurrently: the partition constraint changes
and we don't recheck it. This can lead to tuples mistakenly being added
to the default partition that should have been rejected.
Fix by rechecking the default partition constraint while descending the
hierarchy.
An isolation test based on the reproduction steps described by Hao Wu
(with tweaks for extra coverage) is included.
Backpatch to 12, where this bug came in with 898e5e3290a7.
Reported by: Hao Wu <hawu@vmware.com>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CA+HiwqFqBmcSSap4sFnCBUEL_VfOMmEKaQ3gwUhyfa4c7J_-nA@mail.gmail.com
Discussion: https://postgr.es/m/DM5PR0501MB3910E97A9EDFB4C775CF3D75A42F0@DM5PR0501MB3910.namprd05.prod.outlook.com
Since we no longer require AccessExclusiveLock to add a partition,
the executor may see that a partitioned table has more partitions
than the planner saw. ExecCreatePartitionPruneState's code for
matching up the partition lists in such cases was faulty, and would
misbehave if the planner had successfully pruned any partitions from
the query. (Thus, trouble would occur only if a partition addition
happens concurrently with a query that uses both static and dynamic
partition pruning.) This led to an Assert failure in debug builds,
and probably to crashes or query misbehavior in production builds.
To repair the bug, just explicitly skip zeroes in the plan's
relid_map[] list. I also made some cosmetic changes to make the code
more readable (IMO anyway). Also, convert the cross-checking Assert
to a regular test-and-elog, since it's now apparent that this logic
is more fragile than one would like.
Currently, there's no way to repeatably exercise this code, except
with manual use of a debugger to stop the backend between planning
and execution. Hence, no test case in this patch. We oughta do
something about that testability gap, but that's for another day.
Amit Langote and Tom Lane, per report from Justin Pryzby. Oversight
in commit 898e5e329; backpatch to v12 where that appeared.
Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
All errors of SQLSTATE class 23 should include the name of an object
associated with the error in separate fields of the error report message.
We do this so that applications need not try to extract them from the
possibly-localized human-readable text of the message.
Reported-by: Chris Bandy
Author: Chris Bandy
Reviewed-by: Amit Kapila and Amit Langote
Discussion: https://postgr.es/m/0aa113a3-3c7f-db48-bcd8-f9290b2269ae@gmail.com
The need for this was removed by
8b9e9644dc6a9bd4b7a97950e6212f63880cf18b.
A number of files now need to include utils/acl.h or
parser/parse_node.h explicitly where they previously got it indirectly
somehow.
Since parser/parse_node.h already includes nodes/parsenodes.h, the
latter is then removed where the former was added. Also, remove
nodes/pg_list.h from objectaddress.h, since that's included via
nodes/parsenodes.h.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/7601e258-26b2-8481-36d0-dc9dca6f28f1%402ndquadrant.com