Some of the wrapper functions didn't match the callback names. Many of
them due to staying "consistent" with historic naming of the wrapped
functionality. We decided that for most cases it's more important to
be for tableam to be consistent going forward, than with the past.
The one exception is beginscan/endscan/... because it'd have looked
odd to have systable_beginscan/endscan/... with a different naming
scheme, and changing the systable_* APIs would have caused way too
much churn (including breaking a lot of external users).
Author: Ashwin Agrawal, with some small additions by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeiugyrXZfX7n0ORCa4L-m834dzmaE8eFdbNR6PMpetU4Ww@mail.gmail.com
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.
Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.
The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed. That's even sillier. With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.
With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID. Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)". But keep
IsCatalogRelation as a convenience function.
This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid. The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.
While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.
Also improve the comments in catalog.c.
There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId. I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId. But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.
Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
Some messages related to foreign servers were reporting the server name
without quotes, or not at all; our style is to have all names be quoted,
and the server name already appears quoted in a few other messages, so
just add quotes and make them all consistent.
Remove an extra "s" in other messages (typos introduced by myself in
f56f8f8da6af).
Project style is to check the success of SearchSysCacheN and friends
by applying HeapTupleIsValid to the result. A tiny minority of calls
creatively did it differently. Bring them into line with the rest.
This is just cosmetic, since HeapTupleIsValid is indeed just a null
check at the moment ... but that may not be true forever, and in any
case it puts a mental burden on readers who may wonder why these
call sites are not like the rest.
Back-patch to v11 just to keep the branches in sync. (The bulk of these
errors seem to have originated in v11 or v12, though a few are old.)
Per searching to see if anyplace else had made the same error
repaired in 62148c352.
Most of these stem from d25f519107 "tableam: relation creation, VACUUM
FULL/CLUSTER, SET TABLESPACE.".
1) To pass data to the relation_set_new_filenode()
RelationSetNewRelfilenode() was made to update RelationData.rd_rel
directly. That's not OK however, as it makes the relcache entries
temporarily inconsistent. Which among other scenarios is a problem
if a REINDEX targets an index on pg_class - the
CatalogTupleUpdate() in RelationSetNewRelfilenode(). Presumably
that was introduced because other places in the code do so - while
those aren't "good practice" they don't appear to be actively
buggy (e.g. because system tables may not be targeted).
I (Andres) should have caught this while reviewing and signficantly
evolving the code in that commit, mea culpa.
Fix that by instead passing in the new RelFileNode as separate
argument to relation_set_new_filenode() and rely on the relcache to
update the catalog entry. Also revert that the
RelationMapUpdateMap() call was changed to immediate, and undo some
other more unnecessary changes.
2) Document that the relation_set_new_filenode cannot rely on the
whole relcache entry to be valid. It might be worthwhile to
refactor the code to never have to rely on that, but given the way
heap_create() is currently coded, that'd be a large change.
3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A
table AM might not use shared buffers at all. Move to
index_copy_data() and heapam_relation_copy_data().
4) heapam_relation_set_new_filenode() previously sometimes accessed
rel->rd_rel->relpersistence rather than the `persistence`
argument. Code movement mistake.
5) Previously heapam_relation_set_new_filenode() re-opened the smgr
relation to create the init for, if necesary. Instead have
RelationCreateStorage() return the SMgrRelation and use it to
create the init fork.
6) Add a note about the danger of modifying the relcache directly to
ATExecSetTableSpace() - it's currently not a bug because there's a
check ERRORing for catalog tables.
Regression tests and assertion improvements that together trigger the
bug described in 1) will be added in a later commit, as there is a
related bug on all branches.
Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
Fix DefineIndex so that it doesn't attempt to pass down a to-be-reused
index relfilenode to a child index creation, and fix TryReuseIndex
to not think that reuse is sensible for a partitioned index.
In v11, this fixes a problem where ALTER TABLE on a partitioned table
could assign the same relfilenode to several different child indexes,
causing very nasty catalog corruption --- in fact, attempting to DROP
the partitioned table then leads not only to a database crash, but to
inability to restart because the same crash will recur during WAL replay.
Either of these two changes would be enough to prevent the failure, but
since neither action could possibly be sane, let's put in both changes
for future-proofing.
In HEAD, no such bug manifests, but that's just an accidental consequence
of having changed the pg_class representation of partitioned indexes to
have relfilenode = 0. Both of these changes still seem like smart
future-proofing.
This is only a stop-gap because the code for ALTER TABLE on a partitioned
table with a no-op type change still leaves a great deal to be desired.
As the added regression tests show, it gets things wrong for comments on
child indexes/constraints, and it is regenerating child indexes it doesn't
have to. However, fixing those problems will take more work which may not
get back-patched into v11. We need a fix for the corruption problem now.
Per bug #15672 from Jianing Yang.
Patch by me, regression test cases based on work by Amit Langote,
who also did a lot of the investigative work.
Discussion: https://postgr.es/m/15672-b9fa7db32698269f@postgresql.org
When an existing index in a partition is attached to a new index on
its parent, we forgot to set the "relispartition" flag correctly, which
meant that it was not possible to find the index in various operations,
such as adding a foreign key constraint that references that partitioned
table. One of four places that was assigning the parent index was
forgetting to do that, so fix by shifting responsibility of updating the
flag to the routine that changes the parent.
Author: Amit Langote, Álvaro Herrera
Reported-by: Hubert "depesz" Lubaczewski
Discussion: https://postgr.es/m/CA+HiwqHMsRtRYRWYTWavKJ8x14AFsv7bmAV46mYwnfD3vy8goQ@mail.gmail.com
Commit ca4103025dfe left a few loose ends. The most important one
(broken pg_dump output) is already fixed by virtue of commit
3b23552ad8bb, but some things remained:
* When ALTER TABLE rewrites tables, the indexes must remain in the
tablespace they were originally in. This didn't work because
index recreation during ALTER TABLE runs manufactured SQL (yuck),
which runs afoul of default_tablespace in competition with the parent
relation tablespace. To fix, reset default_tablespace to the empty
string temporarily, and add the TABLESPACE clause as appropriate.
* Setting a partitioned rel's tablespace to the database default is
confusing; if it worked, it would direct the partitions to that
tablespace regardless of default_tablespace. But in reality it does
not work, and making it work is a larger project. Therefore, throw
an error when this condition is detected, to alert the unwary.
Add some docs and tests, too.
Author: Álvaro Herrera
Discussion: https://postgr.es/m/CAKJS1f_1c260nOt_vBJ067AZ3JXptXVRohDVMLEBmudX1YEx-A@mail.gmail.com
Up to now, DefineIndex() was responsible for adding attnotnull constraints
to the columns of a primary key, in any case where it hadn't been
convenient for transformIndexConstraint() to mark those columns as
is_not_null. It (or rather its minion index_check_primary_key) did this
by executing an ALTER TABLE SET NOT NULL command for the target table.
The trouble with this solution is that if we're creating the index due
to ALTER TABLE ADD PRIMARY KEY, and the outer ALTER TABLE has additional
sub-commands, the inner ALTER TABLE's operations executed at the wrong
time with respect to the outer ALTER TABLE's operations. In particular,
the inner ALTER would perform a validation scan at a point where the
table's storage might be inconsistent with its catalog entries. (This is
on the hairy edge of being a security problem, but AFAICS it isn't one
because the inner scan would only be interested in the tuples' null
bitmaps.) This can result in unexpected failures, such as the one seen
in bug #15580 from Allison Kaptur.
To fix, let's remove the attempt to do SET NOT NULL from DefineIndex(),
reducing index_check_primary_key's role to verifying that the columns are
already not null. (It shouldn't ever see such a case, but it seems wise
to keep the check for safety.) Instead, make transformIndexConstraint()
generate ALTER TABLE SET NOT NULL subcommands to be executed ahead of
the ADD PRIMARY KEY operation in every case where it can't force the
column to be created already-not-null. This requires only minor surgery
in parse_utilcmd.c, and it makes for a much more satisfying spec for
transformIndexConstraint(): it's no longer having to take it on faith
that someone else will handle addition of NOT NULL constraints.
To make that work, we have to move the execution of AT_SetNotNull into
an ALTER pass that executes ahead of AT_PASS_ADD_INDEX. I moved it to
AT_PASS_COL_ATTRS, and put that after AT_PASS_ADD_COL to avoid failure
when the column is being added in the same command. This incidentally
fixes a bug in the only previous usage of AT_PASS_COL_ATTRS, for
AT_SetIdentity: it didn't work either for a newly-added column.
Playing around with this exposed a separate bug in ALTER TABLE ONLY ...
ADD PRIMARY KEY for partitioned tables. The intent of the ONLY modifier
in that context is to prevent doing anything that would require holding
lock for a long time --- but the implied SET NOT NULL would recurse to
the child partitions, and do an expensive validation scan for any child
where the column(s) were not already NOT NULL. To fix that, invent a
new ALTER subcommand AT_CheckNotNull that just insists that a child
column be already NOT NULL, and apply that, not AT_SetNotNull, when
recursing to children in this scenario. This results in a slightly laxer
definition of ALTER TABLE ONLY ... SET NOT NULL for partitioned tables,
too: that command will now work as long as all children are already NOT
NULL, whereas before it just threw up its hands if there were any
partitions.
In passing, clean up the API of generateClonedIndexStmt(): remove a
useless argument, ensure that the output argument is not left undefined,
update the header comment.
A small side effect of this change is that no-such-column errors in ALTER
TABLE ADD PRIMARY KEY now produce a different message that includes the
table name, because they are now detected by the SET NOT NULL step which
has historically worded its error that way. That seems fine to me, so
I didn't make any effort to avoid the wording change.
The basic bug #15580 is of very long standing, and these other bugs
aren't new in v12 either. However, this is a pretty significant change
in the way ALTER TABLE ADD PRIMARY KEY works. On balance it seems best
not to back-patch, at least not till we get some more confidence that
this patch has no new bugs.
Patch by me, but thanks to Jie Zhang for a preliminary version.
Discussion: https://postgr.es/m/15580-d1a6de5a3d65da51@postgresql.org
Discussion: https://postgr.es/m/1396E95157071C4EBBA51892C5368521017F2E6E63@G08CNEXMBPEKD02.g08.fujitsu.local
The foreign-key-checking loop in ATRewriteTables failed to ignore
relations without storage (e.g., partitioned tables), unlike the
initial loop. This accidentally worked as long as RI_Initial_Check
succeeded, which it does in most practical cases (including all the
ones exercised in the existing regression tests :-(). However, if
that failed, as for instance when there are permissions issues,
then we entered the slow fire-the-trigger-on-each-tuple path.
And that would try to read from the referencing relation, and fail
if it lacks storage.
A second problem, recently introduced in HEAD, was that this loop
had been broken by sloppy refactoring for the tableam API changes.
Repair both issues, and add a regression test case so we have some
coverage on this code path. Back-patch as needed to v11.
(It looks like this code could do with additional bulletproofing,
but let's get a working test case in place first.)
Hadi Moshayedi, Tom Lane, Andres Freund
Discussion: https://postgr.es/m/CAK=1=WrnNmBbe5D9sm3t0a6dnAq3cdbF1vXY816j1wsMqzC8bw@mail.gmail.com
Discussion: https://postgr.es/m/19030.1554574075@sss.pgh.pa.us
Discussion: https://postgr.es/m/20190325180405.jytoehuzkeozggxx%40alap3.anarazel.de
Previously it was allowed to set default_table_access_method to an
empty string. That makes sense for default_tablespace, where that was
copied from, as it signals falling back to the database's default
tablespace. As there is no equivalent for table AMs, forbid that.
Also make sure to throw a usable error when creating a table using an
index AM, by using get_am_type_oid() to implement get_table_am_oid()
instead of a separate copy. Previously we'd error out only later, in
GetTableAmRoutine().
Thirdly remove GetTableAmRoutineByAmId() - it was only used in an
earlier version of 8586bf7ed8.
Add tests for the above (some for index AMs as well).
Previously, while primary keys could be made on partitioned tables, it
was not possible to define foreign keys that reference those primary
keys. Now it is possible to do that.
Author: Álvaro Herrera
Reviewed-by: Amit Langote, Jesper Pedersen
Discussion: https://postgr.es/m/20181102234158.735b3fevta63msbj@alvherre.pgsql
This replaces the previous calls of heap_sync() in places using
bulk-insert. By passing in the flags used for bulk-insert the AM can
decide (first at insert time and then during the finish call) which of
the optimizations apply to it, and what operations are necessary to
finish a bulk insert operation.
Also change HEAP_INSERT_* flags to TABLE_INSERT, and rename hi_options
to ti_options.
These changes are made even in copy.c, which hasn't yet been converted
to tableam. There's no harm in doing so.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.
This implements one kind of generated column: stored (computed on
write). Another kind, virtual (computed on read), is planned for the
future, and some room is left for it.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
This adds the CONCURRENTLY option to the REINDEX command. A REINDEX
CONCURRENTLY on a specific index creates a new index (like CREATE
INDEX CONCURRENTLY), then renames the old index away and the new index
in place and adjusts the dependencies, and then drops the old
index (like DROP INDEX CONCURRENTLY). The REINDEX command also has
the capability to run its other variants (TABLE, DATABASE) with the
CONCURRENTLY option (but not SYSTEM).
The reindexdb command gets the --concurrently option.
Author: Michael Paquier, Andreas Karlsson, Peter Eisentraut
Reviewed-by: Andres Freund, Fujii Masao, Jim Nasby, Sergei Kornilov
Discussion: https://www.postgresql.org/message-id/flat/60052986-956b-4478-45ed-8bd119e9b9cf%402ndquadrant.com#74948a1044c56c5e817a5050f554ddee
This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.
This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
ALTER INDEX .. ATTACH PARTITION fails if the partitioned table where the
index is defined contains more dropped columns than its partition, with
this message:
ERROR: incorrect attribute map
The cause was that one caller of CompareIndexInfo was passing the number
of attributes of the partition rather than the parent, which confused
the length check. Repair.
This can cause pg_upgrade to fail when used on such a database. Leave
some more objects around after regression tests, so that the case is
detected by pg_upgrade test suite.
Remove some spurious empty lines noticed while looking for other cases
of the same problem.
Discussion: https://postgr.es/m/20190326213924.GA2322@alvherre.pgsql
If existing CHECK or NOT NULL constraints preclude the presence
of nulls, we need not look to see whether any are present.
Sergei Kornilov, reviewed by Stephen Frost, Ildar Musin, David Rowley,
and by me.
Discussion: http://postgr.es/m/81911511895540@web58j.yandex.ru
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
When the timezone is UTC, timestamptz and timestamp are binary coercible
in both directions. See b8a18ad4850ea5ad7884aa6ab731fd392e73b4ad and
c22ecc6562aac895f0f0529707d7bdb460fd2a49 for the previous attempt in
this problem space. Skip the table rewrite; for now, continue to
needlessly rewrite any index on an affected column.
Reviewed by Simon Riggs and Tom Lane.
Discussion: https://postgr.es/m/20190226061450.GA1665944@rfd.leadboat.com
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.dehttps://postgr.es/m/20160812231527.GA690404@alvherre.pgsqlhttps://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.dehttps://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
In preparation for abstracting table storage, convert trigger.c to
track tuples in slots. Which also happens to make code calling
triggers simpler.
As the calling interface for triggers themselves is not changed in
this patch, HeapTuples still are extracted from the slot at that
time. But that's handled solely inside trigger.c, not visible to
callers. It's quite likely that we'll want to revise the external
trigger interface, but that's a separate large project.
As part of this work the slots used for old/new/return tuples are
moved from EState into ResultRelInfo, as different updated tables
might need different slots. The slots are now also now created
on-demand, which is good both from an efficiency POV, but also makes
the modifying code simpler.
Author: Andres Freund, Amit Khandekar and Ashutosh Bapat
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
After the introduction of tuple table slots all table AMs need to
support returning the table oid of the tuple stored in a slot created
by said AM. It does not make sense to re-implement that in every AM,
therefore move handling of table OIDs into the TupleTableSlot
structure itself. It's possible that we, at a later date, might want
to get rid of HeapTupleData.t_tableOid entirely, but doing so before
the abstractions for table AMs are integrated turns out to be too
hard, so delay that for now.
Similarly, every AM needs to support the concept of a tuple
identifier (tid / item pointer) for its tuples. It's quite possible
that we'll generalize the exact form of a tid at a future point (to
allow for things like index organized tables), but for now many parts
of the code know about tids, so there's not much point in abstracting
tids away. Therefore also move into slot (rather than providing API to
set/get the tid associated with the tuple in a slot).
Once table AM includes insert/updating/deleting tuples, the
responsibility to set the correct tid after such an action will move
into that. After that change, code doing such modifications, should
not have to deal with HeapTuples directly anymore.
Author: Andres Freund, Haribabu Kommi and Ashutosh Bapat
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
This is similar in spirit to the existing partbounds.c file in the
same directory, except that there's a lot less code in the new file
created by this commit. Pending work in this area proposes to add a
bunch more code related to PartitionDescs, though, and this will give
us a good place to put it.
Discussion: http://postgr.es/m/CA+TgmoZUwPf_uanjF==gTGBMJrn8uCq52XYvAEorNkLrUdoawg@mail.gmail.com
The original setup for dependencies of partitioned objects had
serious problems:
1. It did not verify that a drop cascading to a partition-child object
also cascaded to at least one of the object's partition parents. Now,
normally a child object would share all its dependencies with one or
another parent (e.g. a child index's opclass dependencies would be shared
with the parent index), so that this oversight is usually harmless.
But if some dependency failed to fit this pattern, the child could be
dropped while all its parents remain, creating a logically broken
situation. (It's easy to construct artificial cases that break it,
such as attaching an unrelated extension dependency to the child object
and then dropping the extension. I'm not sure if any less-artificial
cases exist.)
2. Management of partition dependencies during ATTACH/DETACH PARTITION
was complicated and buggy; for example, after detaching a partition
table it was possible to create cases where a formerly-child index
should be dropped and was not, because the correct set of dependencies
had not been reconstructed.
Less seriously, because multiple partition relationships were
represented identically in pg_depend, there was an order-of-traversal
dependency on which partition parent was cited in error messages.
We also had some pre-existing order-of-traversal hazards for error
messages related to internal and extension dependencies. This is
cosmetic to users but causes testing problems.
To fix#1, add a check at the end of the partition tree traversal
to ensure that at least one partition parent got deleted. To fix#2,
establish a new policy that partition dependencies are in addition to,
not instead of, a child object's usual dependencies; in this way
ATTACH/DETACH PARTITION need not cope with adding or removing the
usual dependencies.
To fix the cosmetic problem, distinguish between primary and secondary
partition dependency entries in pg_depend, by giving them different
deptypes. (They behave identically except for having different
priorities for being cited in error messages.) This means that the
former 'I' dependency type is replaced with new 'P' and 'S' types.
This also fixes a longstanding bug that after handling an internal
dependency by recursing to the owning object, findDependentObjects
did not verify that the current target was now scheduled for deletion,
and did not apply the current recursion level's objflags to it.
Perhaps that should be back-patched; but in the back branches it
would only matter if some concurrent transaction had removed the
internal-linkage pg_depend entry before the recursive call found it,
or the recursive call somehow failed to find it, both of which seem
unlikely.
Catversion bump because the contents of pg_depend change for
partitioning relationships.
Patch HEAD only. It's annoying that we're not fixing #2 in v11,
but there seems no practical way to do so given that the problem
is exactly a poor choice of what entries to put in pg_depend.
We can't really fix that while staying compatible with what's
in pg_depend in existing v11 installations.
Discussion: https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
After commit 123cc697a8eb, we remove redundant FK action triggers during
partition ATTACH by merely deleting the catalog tuple, but that's wrong:
it should use performDeletion() instead. Repair, and make the comments
more explicit.
Per code review from Tom Lane.
Discussion: https://postgr.es/m/18885.1549642539@sss.pgh.pa.us
We can't allow these pseudo-types to be used as table column types,
because storing an anonymous record value in a table would result
in data that couldn't be understood by other sessions. However,
it seems like there's no harm in allowing the case in a column
definition list that's specifying what a function-returning-record
returns. The data involved is all local to the current session,
so we should be just as able to resolve its actual tuple type as
we are for the function-returning-record's top-level tuple output.
Elvis Pranskevichus, with cosmetic changes by me
Discussion: https://postgr.es/m/11038447.kQ5A9Uj5xi@hammer.magicstack.net
Create a new header optimizer/optimizer.h, which exposes just the
planner functions that can be used "at arm's length", without need
to access Paths or the other planner-internal data structures defined
in nodes/relation.h. This is intended to provide the whole planner
API seen by most of the rest of the system; although FDWs still need
to use additional stuff, and more thought is also needed about just
what selfuncs.c should rely on.
The main point of doing this now is to limit the amount of new
#include baggage that will be needed by "planner support functions",
which I expect to introduce later, and which will be in relevant
datatype modules rather than anywhere near the planner.
This commit just moves relevant declarations into optimizer.h from
other header files (a couple of which go away because everything
got moved), and adjusts #include lists to match. There's further
cleanup that could be done if we want to decide that some stuff
being exposed by optimizer.h doesn't belong in the planner at all,
but I'll leave that for another day.
Discussion: https://postgr.es/m/11460.1548706639@sss.pgh.pa.us
Before this change FunctionCallInfoData, the struct arguments etc for
V1 function calls are stored in, always had space for
FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two
arrays. For nearly every function call 100 arguments is far more than
needed, therefore wasting memory. Arg and argnull being two separate
arrays also guarantees that to access a single argument, two
cachelines have to be touched.
Change the layout so there's a single variable-length array with pairs
of value / isnull. That drastically reduces memory consumption for
most function calls (on x86-64 a two argument function now uses
64bytes, previously 936 bytes), and makes it very likely that argument
value and its nullness are on the same cacheline.
Arguments are stored in a new NullableDatum struct, which, due to
padding, needs more memory per argument than before. But as usually
far fewer arguments are stored, and individual arguments are cheaper
to access, that's still a clear win. It's likely that there's other
places where conversion to NullableDatum arrays would make sense,
e.g. TupleTableSlots, but that's for another commit.
Because the function call information is now variable-length
allocations have to take the number of arguments into account. For
heap allocations that can be done with SizeForFunctionCallInfoData(),
for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro
that helps to allocate an appropriately sized and aligned variable.
Some places with stack allocation function call information don't know
the number of arguments at compile time, and currently variably sized
stack allocations aren't allowed in postgres. Therefore allow for
FUNC_MAX_ARGS space in these cases. They're not that common, so for
now that seems acceptable.
Because of the need to allocate FunctionCallInfo of the appropriate
size, older extensions may need to update their code. To avoid subtle
breakages, the FunctionCallInfoData struct has been renamed to
FunctionCallInfoBaseData. Most code only references FunctionCallInfo,
so that shouldn't cause much collateral damage.
This change is also a prerequisite for more efficient expression JIT
compilation (by allocating the function call information on the stack,
allowing LLVM to optimize it away); previously the size of the call
information caused problems inside LLVM's optimizer.
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
There were two flags used to track the access to temporary tables and
to the temporary namespace of a session which are used to restrict
PREPARE TRANSACTION, however the first control flag is a concept
included in the second. This removes the flag for temporary table
tracking, keeping around only the one at namespace level.
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20190118053126.GH1883@paquier.xyz
Previously, only literals were allowed. This change allows general
expressions, including functions calls, which are evaluated at the
time the DDL command is executed.
Besides offering some more functionality, it simplifies the parser
structures and removes some inconsistencies in how the literals were
handled.
Author: Kyotaro Horiguchi, Tom Lane, Amit Langote
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/9f88b5e0-6da2-5227-20d0-0d7012beaa1c@lab.ntt.co.jp/
We were failing to set conislocal correctly for constraints in
partitions after partition detach, leading to those constraints becoming
undroppable. Fix by setting the flag correctly. Existing databases
might contain constraints with the conislocal wrongly set to false, for
partitions that were detached; this situation should be fixable by
applying an UPDATE on pg_constraint to set conislocal true. This
problem should otherwise be innocuous and should disappear across a
dump/restore or pg_upgrade.
Secondarily, when constraint drop was attempted in a partitioned table,
ATExecDropConstraint would try to recurse to partitions after doing
performDeletion() of the constraint in the partitioned table itself; but
since the constraint in the partitions are dropped by the initial call
of performDeletion() (because of following dependencies), the recursion
step would fail since it would not find the constraint, causing the
whole operation to fail. Fix by preventing recursion.
Reported-by: Amit Langote
Diagnosed-by: Amit Langote
Author: Amit Langote, Álvaro Herrera
Discussion: https://postgr.es/m/f2b8ead5-4131-d5a8-8016-2ea0a31250af@lab.ntt.co.jp
Detaching a partition from a partitioned table that's constrained by
foreign keys requires additional action triggers on the referenced side;
otherwise, DELETE/UPDATE actions there fail to notice rows in the table
that was partition, and so are incorrectly allowed through. With this
commit, those triggers are now created. Conversely, when a table that
has a foreign key is attached as a partition to a table that also has
the same foreign key, those action triggers are no longer needed, so we
remove them.
Add a minimal test case verifying (part of) this.
Authors: Amit Langote, Álvaro Herrera
Discussion: https://postgr.es/m/f2b8ead5-4131-d5a8-8016-2ea0a31250af@lab.ntt.co.jp
Most of these had been obsoleted by 568d4138c / the SnapshotNow
removal.
This is is preparation for moving most of tqual.[ch] into either
snapmgr.h or heapam.h, which in turn is in preparation for pluggable
table AMs.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
When creating a foreign key in a partitioned table, if some partitions
already have equivalent constraints, we wastefully create duplicates of
the constraints instead of attaching to the existing ones. That's
inconsistent with the de-duplication that is applied when a table is
attached as a partition. To fix, reuse the FK-cloning code instead of
having a separate code path.
Backpatch to Postgres 11. This is a subtle behavior change, but surely
a welcome one since there's no use in having duplicate foreign keys.
Discovered by Álvaro Herrera while thinking about a different problem
reported by Jesper Pedersen (bug #15587).
Author: Álvaro Herrera
Discussion: https://postgr.es/m/201901151935.zfadrzvyof4k@alvherre.pgsql
My commit 3de241dba86f introduced some code to create a clone of a
foreign key to a partition, but I put it in pg_constraint.c because it
was too close to the contents of the pg_constraint row. With the
previous commit that split out the constraint tuple deconstruction into
its own routine, it makes more sense to have the FK-cloning function in
tablecmds.c, mostly because its static subroutine can then be used by a
future bugfix.
My initial posting of this patch had this routine as static in
tablecmds.c, but sadly this function is already part of the Postgres 11
ABI as exported from pg_constraint.c, so keep it as exported also just
to avoid breaking any possible users of it.