Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reported by Smolkin Grigory.  Reviewed by Nitin Motiani, (in earlier
versions) Heikki Linnakangas, and (in earlier versions) Alexander
Lakhin.

Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
This commit is contained in:
Noah Misch 2024-09-24 15:25:18 -07:00
parent dbf3f974ee
commit a07e03fd8f
16 changed files with 801 additions and 155 deletions

View File

@ -153,3 +153,14 @@ The following infomask bits are applicable:
We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
is set.
Reading inplace-updated columns
-------------------------------
Inplace updates create an exception to the rule that tuple data won't change
under a reader holding a pin. A reader of a heap_fetch() result tuple may
witness a torn read. Current inplace-updated fields are aligned and are no
wider than four bytes, and current readers don't need consistency across
fields. Hence, they get by with just fetching each field once. XXX such a
caller may also read a value that has not reached WAL; see
systable_inplace_update_finish().

View File

@ -47,7 +47,6 @@
#include "storage/predicate.h"
#include "storage/procarray.h"
#include "utils/datum.h"
#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/spccache.h"
@ -6023,61 +6022,167 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
}
/*
* heap_inplace_update - update a tuple "in place" (ie, overwrite it)
* heap_inplace_lock - protect inplace update from concurrent heap_update()
*
* Overwriting violates both MVCC and transactional safety, so the uses
* of this function in Postgres are extremely limited. Nonetheless we
* find some places to use it.
* Evaluate whether the tuple's state is compatible with a no-key update.
* Current transaction rowmarks are fine, as is KEY SHARE from any
* transaction. If compatible, return true with the buffer exclusive-locked,
* and the caller must release that by calling
* heap_inplace_update_and_unlock(), calling heap_inplace_unlock(), or raising
* an error. Otherwise, return false after blocking transactions, if any,
* have ended.
*
* The tuple cannot change size, and therefore it's reasonable to assume
* that its null bitmap (if any) doesn't change either. So we just
* overwrite the data portion of the tuple without touching the null
* bitmap or any of the header fields.
* Since this is intended for system catalogs and SERIALIZABLE doesn't cover
* DDL, this doesn't guarantee any particular predicate locking.
*
* tuple is an in-memory tuple structure containing the data to be written
* over the target tuple. Also, tuple->t_self identifies the target tuple.
* One could modify this to return true for tuples with delete in progress,
* All inplace updaters take a lock that conflicts with DROP. If explicit
* "DELETE FROM pg_class" is in progress, we'll wait for it like we would an
* update.
*
* Note that the tuple updated here had better not come directly from the
* syscache if the relation has a toast relation as this tuple could
* include toast values that have been expanded, causing a failure here.
* Readers of inplace-updated fields expect changes to those fields are
* durable. For example, vac_truncate_clog() reads datfrozenxid from
* pg_database tuples via catalog snapshots. A future snapshot must not
* return a lower datfrozenxid for the same database OID (lower in the
* FullTransactionIdPrecedes() sense). We achieve that since no update of a
* tuple can start while we hold a lock on its buffer. In cases like
* BEGIN;GRANT;CREATE INDEX;COMMIT we're inplace-updating a tuple visible only
* to this transaction. ROLLBACK then is one case where it's okay to lose
* inplace updates. (Restoring relhasindex=false on ROLLBACK is fine, since
* any concurrent CREATE INDEX would have blocked, then inplace-updated the
* committed tuple.)
*
* In principle, we could avoid waiting by overwriting every tuple in the
* updated tuple chain. Reader expectations permit updating a tuple only if
* it's aborted, is the tail of the chain, or we already updated the tuple
* referenced in its t_ctid. Hence, we would need to overwrite the tuples in
* order from tail to head. That would imply either (a) mutating all tuples
* in one critical section or (b) accepting a chance of partial completion.
* Partial completion of a relfrozenxid update would have the weird
* consequence that the table's next VACUUM could see the table's relfrozenxid
* move forward between vacuum_get_cutoffs() and finishing.
*/
bool
heap_inplace_lock(Relation relation,
HeapTuple oldtup_ptr, Buffer buffer)
{
HeapTupleData oldtup = *oldtup_ptr; /* minimize diff vs. heap_update() */
TM_Result result;
bool ret;
Assert(BufferIsValid(buffer));
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*----------
* Interpret HeapTupleSatisfiesUpdate() like heap_update() does, except:
*
* - wait unconditionally
* - no tuple locks
* - don't recheck header after wait: simpler to defer to next iteration
* - don't try to continue even if the updater aborts: likewise
* - no crosscheck
*/
result = HeapTupleSatisfiesUpdate(&oldtup, GetCurrentCommandId(false),
buffer);
if (result == TM_Invisible)
{
/* no known way this can happen */
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg_internal("attempted to overwrite invisible tuple")));
}
else if (result == TM_SelfModified)
{
/*
* CREATE INDEX might reach this if an expression is silly enough to
* call e.g. SELECT ... FROM pg_class FOR SHARE. C code of other SQL
* statements might get here after a heap_update() of the same row, in
* the absence of an intervening CommandCounterIncrement().
*/
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("tuple to be updated was already modified by an operation triggered by the current command")));
}
else if (result == TM_BeingModified)
{
TransactionId xwait;
uint16 infomask;
xwait = HeapTupleHeaderGetRawXmax(oldtup.t_data);
infomask = oldtup.t_data->t_infomask;
if (infomask & HEAP_XMAX_IS_MULTI)
{
LockTupleMode lockmode = LockTupleNoKeyExclusive;
MultiXactStatus mxact_status = MultiXactStatusNoKeyUpdate;
int remain;
bool current_is_member;
if (DoesMultiXactIdConflict((MultiXactId) xwait, infomask,
lockmode, &current_is_member))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ret = false;
MultiXactIdWait((MultiXactId) xwait, mxact_status, infomask,
relation, &oldtup.t_self, XLTW_Update,
&remain);
}
else
ret = true;
}
else if (TransactionIdIsCurrentTransactionId(xwait))
ret = true;
else if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask))
ret = true;
else
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ret = false;
XactLockTableWait(xwait, relation, &oldtup.t_self,
XLTW_Update);
}
}
else
{
ret = (result == TM_Ok);
if (!ret)
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
}
/*
* GetCatalogSnapshot() relies on invalidation messages to know when to
* take a new snapshot. COMMIT of xwait is responsible for sending the
* invalidation. We're not acquiring heavyweight locks sufficient to
* block if not yet sent, so we must take a new snapshot to ensure a later
* attempt has a fair chance. While we don't need this if xwait aborted,
* don't bother optimizing that.
*/
if (!ret)
InvalidateCatalogSnapshot();
return ret;
}
/*
* heap_inplace_update_and_unlock - core of systable_inplace_update_finish
*
* The tuple cannot change size, and therefore its header fields and null
* bitmap (if any) don't change either.
*/
void
heap_inplace_update(Relation relation, HeapTuple tuple)
heap_inplace_update_and_unlock(Relation relation,
HeapTuple oldtup, HeapTuple tuple,
Buffer buffer)
{
Buffer buffer;
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
HeapTupleHeader htup;
HeapTupleHeader htup = oldtup->t_data;
uint32 oldlen;
uint32 newlen;
/*
* For now, we don't allow parallel updates. Unlike a regular update,
* this should never create a combo CID, so it might be possible to relax
* this restriction, but not without more thought and testing. It's not
* clear that it would be useful, anyway.
*/
if (IsInParallelMode())
ereport(ERROR,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
INJECTION_POINT("inplace-before-pin");
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&(tuple->t_self)));
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
offnum = ItemPointerGetOffsetNumber(&(tuple->t_self));
if (PageGetMaxOffsetNumber(page) >= offnum)
lp = PageGetItemId(page, offnum);
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(ERROR, "invalid lp");
htup = (HeapTupleHeader) PageGetItem(page, lp);
oldlen = ItemIdGetLength(lp) - htup->t_hoff;
Assert(ItemPointerEquals(&oldtup->t_self, &tuple->t_self));
oldlen = oldtup->t_len - htup->t_hoff;
newlen = tuple->t_len - tuple->t_data->t_hoff;
if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
elog(ERROR, "wrong tuple length");
@ -6089,6 +6194,19 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
(char *) tuple->t_data + tuple->t_data->t_hoff,
newlen);
/*----------
* XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
*
* ["D" is a VACUUM (ONLY_DATABASE_STATS)]
* ["R" is a VACUUM tbl]
* D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
* D: systable_getnext() returns pg_class tuple of tbl
* R: memcpy() into pg_class tuple of tbl
* D: raise pg_database.datfrozenxid, XLogInsert(), finish
* [crash]
* [recovery restores datfrozenxid w/o relfrozenxid]
*/
MarkBufferDirty(buffer);
/* XLOG stuff */
@ -6109,23 +6227,35 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
PageSetLSN(page, recptr);
PageSetLSN(BufferGetPage(buffer), recptr);
}
END_CRIT_SECTION();
UnlockReleaseBuffer(buffer);
heap_inplace_unlock(relation, oldtup, buffer);
/*
* Send out shared cache inval if necessary. Note that because we only
* pass the new version of the tuple, this mustn't be used for any
* operations that could change catcache lookup keys. But we aren't
* bothering with index updates either, so that's true a fortiori.
*
* XXX ROLLBACK discards the invalidation. See test inplace-inval.spec.
*/
if (!IsBootstrapProcessingMode())
CacheInvalidateHeapTuple(relation, tuple, NULL);
}
/*
* heap_inplace_unlock - reverse of heap_inplace_lock
*/
void
heap_inplace_unlock(Relation relation,
HeapTuple oldtup, Buffer buffer)
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
#define FRM_NOOP 0x0001
#define FRM_INVALIDATE_XMAX 0x0002
#define FRM_RETURN_IS_XID 0x0004

View File

@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/genam.h"
#include "access/heapam.h"
#include "access/relscan.h"
#include "access/tableam.h"
#include "access/transam.h"
@ -29,6 +30,7 @@
#include "storage/bufmgr.h"
#include "storage/procarray.h"
#include "utils/acl.h"
#include "utils/injection_point.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/rls.h"
@ -757,3 +759,138 @@ systable_endscan_ordered(SysScanDesc sysscan)
UnregisterSnapshot(sysscan->snapshot);
pfree(sysscan);
}
/*
* systable_inplace_update_begin --- update a row "in place" (overwrite it)
*
* Overwriting violates both MVCC and transactional safety, so the uses of
* this function in Postgres are extremely limited. Nonetheless we find some
* places to use it. Standard flow:
*
* ... [any slow preparation not requiring oldtup] ...
* systable_inplace_update_begin([...], &tup, &inplace_state);
* if (!HeapTupleIsValid(tup))
* elog(ERROR, [...]);
* ... [buffer is exclusive-locked; mutate "tup"] ...
* if (dirty)
* systable_inplace_update_finish(inplace_state, tup);
* else
* systable_inplace_update_cancel(inplace_state);
*
* The first several params duplicate the systable_beginscan() param list.
* "oldtupcopy" is an output parameter, assigned NULL if the key ceases to
* find a live tuple. (In PROC_IN_VACUUM, that is a low-probability transient
* condition.) If "oldtupcopy" gets non-NULL, you must pass output parameter
* "state" to systable_inplace_update_finish() or
* systable_inplace_update_cancel().
*/
void
systable_inplace_update_begin(Relation relation,
Oid indexId,
bool indexOK,
Snapshot snapshot,
int nkeys, const ScanKeyData *key,
HeapTuple *oldtupcopy,
void **state)
{
int retries = 0;
SysScanDesc scan;
HeapTuple oldtup;
/*
* For now, we don't allow parallel updates. Unlike a regular update,
* this should never create a combo CID, so it might be possible to relax
* this restriction, but not without more thought and testing. It's not
* clear that it would be useful, anyway.
*/
if (IsInParallelMode())
ereport(ERROR,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
/*
* Accept a snapshot argument, for symmetry, but this function advances
* its snapshot as needed to reach the tail of the updated tuple chain.
*/
Assert(snapshot == NULL);
Assert(IsInplaceUpdateRelation(relation) || !IsSystemRelation(relation));
/* Loop for an exclusive-locked buffer of a non-updated tuple. */
for (;;)
{
TupleTableSlot *slot;
BufferHeapTupleTableSlot *bslot;
CHECK_FOR_INTERRUPTS();
/*
* Processes issuing heap_update (e.g. GRANT) at maximum speed could
* drive us to this error. A hostile table owner has stronger ways to
* damage their own table, so that's minor.
*/
if (retries++ > 10000)
elog(ERROR, "giving up after too many tries to overwrite row");
INJECTION_POINT("inplace-before-pin");
scan = systable_beginscan(relation, indexId, indexOK, snapshot,
nkeys, unconstify(ScanKeyData *, key));
oldtup = systable_getnext(scan);
if (!HeapTupleIsValid(oldtup))
{
systable_endscan(scan);
*oldtupcopy = NULL;
return;
}
slot = scan->slot;
Assert(TTS_IS_BUFFERTUPLE(slot));
bslot = (BufferHeapTupleTableSlot *) slot;
if (heap_inplace_lock(scan->heap_rel,
bslot->base.tuple, bslot->buffer))
break;
systable_endscan(scan);
};
*oldtupcopy = heap_copytuple(oldtup);
*state = scan;
}
/*
* systable_inplace_update_finish --- second phase of inplace update
*
* The tuple cannot change size, and therefore its header fields and null
* bitmap (if any) don't change either.
*/
void
systable_inplace_update_finish(void *state, HeapTuple tuple)
{
SysScanDesc scan = (SysScanDesc) state;
Relation relation = scan->heap_rel;
TupleTableSlot *slot = scan->slot;
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
HeapTuple oldtup = bslot->base.tuple;
Buffer buffer = bslot->buffer;
heap_inplace_update_and_unlock(relation, oldtup, tuple, buffer);
systable_endscan(scan);
}
/*
* systable_inplace_update_cancel --- abandon inplace update
*
* This is an alternative to making a no-op update.
*/
void
systable_inplace_update_cancel(void *state)
{
SysScanDesc scan = (SysScanDesc) state;
Relation relation = scan->heap_rel;
TupleTableSlot *slot = scan->slot;
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
HeapTuple oldtup = bslot->base.tuple;
Buffer buffer = bslot->buffer;
heap_inplace_unlock(relation, oldtup, buffer);
systable_endscan(scan);
}

View File

@ -2792,7 +2792,9 @@ index_update_stats(Relation rel,
{
Oid relid = RelationGetRelid(rel);
Relation pg_class;
ScanKeyData key[1];
HeapTuple tuple;
void *state;
Form_pg_class rd_rel;
bool dirty;
@ -2826,33 +2828,12 @@ index_update_stats(Relation rel,
pg_class = table_open(RelationRelationId, RowExclusiveLock);
/*
* Make a copy of the tuple to update. Normally we use the syscache, but
* we can't rely on that during bootstrap or while reindexing pg_class
* itself.
*/
if (IsBootstrapProcessingMode() ||
ReindexIsProcessingHeap(RelationRelationId))
{
/* don't assume syscache will work */
TableScanDesc pg_class_scan;
ScanKeyData key[1];
ScanKeyInit(&key[0],
Anum_pg_class_oid,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(relid));
pg_class_scan = table_beginscan_catalog(pg_class, 1, key);
tuple = heap_getnext(pg_class_scan, ForwardScanDirection);
tuple = heap_copytuple(tuple);
table_endscan(pg_class_scan);
}
else
{
/* normal case, use syscache */
tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
}
ScanKeyInit(&key[0],
Anum_pg_class_oid,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(relid));
systable_inplace_update_begin(pg_class, ClassOidIndexId, true, NULL,
1, key, &tuple, &state);
if (!HeapTupleIsValid(tuple))
elog(ERROR, "could not find tuple for relation %u", relid);
@ -2915,11 +2896,12 @@ index_update_stats(Relation rel,
*/
if (dirty)
{
heap_inplace_update(pg_class, tuple);
systable_inplace_update_finish(state, tuple);
/* the above sends a cache inval message */
}
else
{
systable_inplace_update_cancel(state);
/* no need to change tuple, but force relcache inval anyway */
CacheInvalidateRelcacheByTuple(tuple);
}

View File

@ -29,6 +29,7 @@
#include "catalog/toasting.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "utils/fmgroids.h"
#include "utils/rel.h"
#include "utils/syscache.h"
@ -333,21 +334,36 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
*/
class_rel = table_open(RelationRelationId, RowExclusiveLock);
reltup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relOid));
if (!HeapTupleIsValid(reltup))
elog(ERROR, "cache lookup failed for relation %u", relOid);
((Form_pg_class) GETSTRUCT(reltup))->reltoastrelid = toast_relid;
if (!IsBootstrapProcessingMode())
{
/* normal case, use a transactional update */
reltup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relOid));
if (!HeapTupleIsValid(reltup))
elog(ERROR, "cache lookup failed for relation %u", relOid);
((Form_pg_class) GETSTRUCT(reltup))->reltoastrelid = toast_relid;
CatalogTupleUpdate(class_rel, &reltup->t_self, reltup);
}
else
{
/* While bootstrapping, we cannot UPDATE, so overwrite in-place */
heap_inplace_update(class_rel, reltup);
ScanKeyData key[1];
void *state;
ScanKeyInit(&key[0],
Anum_pg_class_oid,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(relOid));
systable_inplace_update_begin(class_rel, ClassOidIndexId, true,
NULL, 1, key, &reltup, &state);
if (!HeapTupleIsValid(reltup))
elog(ERROR, "cache lookup failed for relation %u", relOid);
((Form_pg_class) GETSTRUCT(reltup))->reltoastrelid = toast_relid;
systable_inplace_update_finish(state, reltup);
}
heap_freetuple(reltup);

View File

@ -1651,7 +1651,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
Relation pgdbrel;
HeapTuple tup;
ScanKeyData scankey;
SysScanDesc scan;
void *inplace_state;
Form_pg_database datform;
int notherbackends;
int npreparedxacts;
@ -1789,24 +1789,6 @@ dropdb(const char *dbname, bool missing_ok, bool force)
*/
pgstat_drop_database(db_id);
/*
* Get the pg_database tuple to scribble on. Note that this does not
* directly rely on the syscache to avoid issues with flattened toast
* values for the in-place update.
*/
ScanKeyInit(&scankey,
Anum_pg_database_datname,
BTEqualStrategyNumber, F_NAMEEQ,
CStringGetDatum(dbname));
scan = systable_beginscan(pgdbrel, DatabaseNameIndexId, true,
NULL, 1, &scankey);
tup = systable_getnext(scan);
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for database %u", db_id);
datform = (Form_pg_database) GETSTRUCT(tup);
/*
* Except for the deletion of the catalog row, subsequent actions are not
* transactional (consider DropDatabaseBuffers() discarding modified
@ -1818,8 +1800,17 @@ dropdb(const char *dbname, bool missing_ok, bool force)
* modification is durable before performing irreversible filesystem
* operations.
*/
ScanKeyInit(&scankey,
Anum_pg_database_datname,
BTEqualStrategyNumber, F_NAMEEQ,
CStringGetDatum(dbname));
systable_inplace_update_begin(pgdbrel, DatabaseNameIndexId, true,
NULL, 1, &scankey, &tup, &inplace_state);
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for database %u", db_id);
datform = (Form_pg_database) GETSTRUCT(tup);
datform->datconnlimit = DATCONNLIMIT_INVALID_DB;
heap_inplace_update(pgdbrel, tup);
systable_inplace_update_finish(inplace_state, tup);
XLogFlush(XactLastRecEnd);
/*
@ -1827,8 +1818,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
* the row will be gone, but if we fail, dropdb() can be invoked again.
*/
CatalogTupleDelete(pgdbrel, &tup->t_self);
systable_endscan(scan);
heap_freetuple(tup);
/*
* Drop db-specific replication slots.

View File

@ -946,25 +946,18 @@ EventTriggerOnLogin(void)
{
Relation pg_db = table_open(DatabaseRelationId, RowExclusiveLock);
HeapTuple tuple;
void *state;
Form_pg_database db;
ScanKeyData key[1];
SysScanDesc scan;
/*
* Get the pg_database tuple to scribble on. Note that this does
* not directly rely on the syscache to avoid issues with
* flattened toast values for the in-place update.
*/
/* Fetch a copy of the tuple to scribble on */
ScanKeyInit(&key[0],
Anum_pg_database_oid,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(MyDatabaseId));
scan = systable_beginscan(pg_db, DatabaseOidIndexId, true,
NULL, 1, key);
tuple = systable_getnext(scan);
tuple = heap_copytuple(tuple);
systable_endscan(scan);
systable_inplace_update_begin(pg_db, DatabaseOidIndexId, true,
NULL, 1, key, &tuple, &state);
if (!HeapTupleIsValid(tuple))
elog(ERROR, "could not find tuple for database %u", MyDatabaseId);
@ -980,13 +973,15 @@ EventTriggerOnLogin(void)
* that avoids possible waiting on the row-level lock. Second,
* that avoids dealing with TOAST.
*
* It's known that changes made by heap_inplace_update() may
* be lost due to concurrent normal updates. However, we are
* OK with that. The subsequent connections will still have a
* chance to set "dathasloginevt" to false.
* Changes made by inplace update may be lost due to
* concurrent normal updates; see inplace-inval.spec. However,
* we are OK with that. The subsequent connections will still
* have a chance to set "dathasloginevt" to false.
*/
heap_inplace_update(pg_db, tuple);
systable_inplace_update_finish(state, tuple);
}
else
systable_inplace_update_cancel(state);
table_close(pg_db, RowExclusiveLock);
heap_freetuple(tuple);
}

View File

@ -1417,7 +1417,9 @@ vac_update_relstats(Relation relation,
{
Oid relid = RelationGetRelid(relation);
Relation rd;
ScanKeyData key[1];
HeapTuple ctup;
void *inplace_state;
Form_pg_class pgcform;
bool dirty,
futurexid,
@ -1428,7 +1430,12 @@ vac_update_relstats(Relation relation,
rd = table_open(RelationRelationId, RowExclusiveLock);
/* Fetch a copy of the tuple to scribble on */
ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
ScanKeyInit(&key[0],
Anum_pg_class_oid,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(relid));
systable_inplace_update_begin(rd, ClassOidIndexId, true,
NULL, 1, key, &ctup, &inplace_state);
if (!HeapTupleIsValid(ctup))
elog(ERROR, "pg_class entry for relid %u vanished during vacuuming",
relid);
@ -1536,7 +1543,9 @@ vac_update_relstats(Relation relation,
/* If anything changed, write out the tuple. */
if (dirty)
heap_inplace_update(rd, ctup);
systable_inplace_update_finish(inplace_state, ctup);
else
systable_inplace_update_cancel(inplace_state);
table_close(rd, RowExclusiveLock);
@ -1588,6 +1597,7 @@ vac_update_datfrozenxid(void)
bool bogus = false;
bool dirty = false;
ScanKeyData key[1];
void *inplace_state;
/*
* Restrict this task to one backend per database. This avoids race
@ -1711,20 +1721,18 @@ vac_update_datfrozenxid(void)
relation = table_open(DatabaseRelationId, RowExclusiveLock);
/*
* Get the pg_database tuple to scribble on. Note that this does not
* directly rely on the syscache to avoid issues with flattened toast
* values for the in-place update.
* Fetch a copy of the tuple to scribble on. We could check the syscache
* tuple first. If that concluded !dirty, we'd avoid waiting on
* concurrent heap_update() and would avoid exclusive-locking the buffer.
* For now, don't optimize that.
*/
ScanKeyInit(&key[0],
Anum_pg_database_oid,
BTEqualStrategyNumber, F_OIDEQ,
ObjectIdGetDatum(MyDatabaseId));
scan = systable_beginscan(relation, DatabaseOidIndexId, true,
NULL, 1, key);
tuple = systable_getnext(scan);
tuple = heap_copytuple(tuple);
systable_endscan(scan);
systable_inplace_update_begin(relation, DatabaseOidIndexId, true,
NULL, 1, key, &tuple, &inplace_state);
if (!HeapTupleIsValid(tuple))
elog(ERROR, "could not find tuple for database %u", MyDatabaseId);
@ -1758,7 +1766,9 @@ vac_update_datfrozenxid(void)
newMinMulti = dbform->datminmxid;
if (dirty)
heap_inplace_update(relation, tuple);
systable_inplace_update_finish(inplace_state, tuple);
else
systable_inplace_update_cancel(inplace_state);
heap_freetuple(tuple);
table_close(relation, RowExclusiveLock);

View File

@ -233,5 +233,14 @@ extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
ScanDirection direction);
extern void systable_endscan_ordered(SysScanDesc sysscan);
extern void systable_inplace_update_begin(Relation relation,
Oid indexId,
bool indexOK,
Snapshot snapshot,
int nkeys, const ScanKeyData *key,
HeapTuple *oldtupcopy,
void **state);
extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
extern void systable_inplace_update_cancel(void *state);
#endif /* GENAM_H */

View File

@ -337,7 +337,13 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
bool follow_updates,
Buffer *buffer, struct TM_FailureData *tmfd);
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
extern bool heap_inplace_lock(Relation relation,
HeapTuple oldtup_ptr, Buffer buffer);
extern void heap_inplace_update_and_unlock(Relation relation,
HeapTuple oldtup, HeapTuple tuple,
Buffer buffer);
extern void heap_inplace_unlock(Relation relation,
HeapTuple oldtup, Buffer buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
const struct VacuumCutoffs *cutoffs,
HeapPageFreeze *pagefrz,

View File

@ -9,20 +9,20 @@ step b1: BEGIN;
step grant1:
GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;
step vac2: VACUUM (FREEZE);
step vac2: VACUUM (FREEZE); <waiting ...>
step snap3:
INSERT INTO frozen_witness
SELECT datfrozenxid FROM pg_database WHERE datname = current_catalog;
step c1: COMMIT;
step vac2: <... completed>
step cmp3:
SELECT 'datfrozenxid retreated'
FROM pg_database
WHERE datname = current_catalog
AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness);
?column?
----------------------
datfrozenxid retreated
(1 row)
?column?
--------
(0 rows)

View File

@ -14,15 +14,16 @@ relhasindex
f
(1 row)
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c);
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c); <waiting ...>
step c1: COMMIT;
step addk2: <... completed>
step read2:
SELECT relhasindex FROM pg_class
WHERE oid = 'intra_grant_inplace'::regclass;
relhasindex
-----------
f
t
(1 row)
@ -58,8 +59,9 @@ relhasindex
f
(1 row)
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c);
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c); <waiting ...>
step r3: ROLLBACK;
step addk2: <... completed>
starting permutation: b2 sfnku2 addk2 c2
step b2: BEGIN;
@ -98,7 +100,7 @@ f
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c);
step c2: COMMIT;
starting permutation: b3 sfu3 b1 grant1 read2 addk2 r3 c1 read2
starting permutation: b3 sfu3 b1 grant1 read2 as3 addk2 r3 c1 read2
step b3: BEGIN ISOLATION LEVEL READ COMMITTED;
step sfu3:
SELECT relhasindex FROM pg_class
@ -122,17 +124,19 @@ relhasindex
f
(1 row)
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c);
step as3: LOCK TABLE intra_grant_inplace IN ACCESS SHARE MODE;
step addk2: ALTER TABLE intra_grant_inplace ADD PRIMARY KEY (c); <waiting ...>
step r3: ROLLBACK;
step grant1: <... completed>
step c1: COMMIT;
step addk2: <... completed>
step read2:
SELECT relhasindex FROM pg_class
WHERE oid = 'intra_grant_inplace'::regclass;
relhasindex
-----------
f
t
(1 row)

View File

@ -42,5 +42,4 @@ step cmp3 {
}
# XXX extant bug
permutation snap3 b1 grant1 vac2(c1) snap3 c1 cmp3

View File

@ -48,6 +48,7 @@ step sfu3 {
SELECT relhasindex FROM pg_class
WHERE oid = 'intra_grant_inplace'::regclass FOR UPDATE;
}
step as3 { LOCK TABLE intra_grant_inplace IN ACCESS SHARE MODE; }
step r3 { ROLLBACK; }
# Additional heap_update()
@ -73,7 +74,7 @@ step keyshr5 {
teardown { ROLLBACK; }
# XXX extant bugs: permutation comments refer to planned post-bugfix behavior
# XXX extant bugs: permutation comments refer to planned future LockTuple()
permutation
b1
@ -117,6 +118,7 @@ permutation
b1
grant1(r3) # acquire LockTuple(), await sfu3 xmax
read2
as3 # XXX temporary until patch adds locking to addk2
addk2(c1) # block in LockTuple() behind grant1
r3 # unblock grant1; addk2 now awaits grant1 xmax
c1

View File

@ -40,4 +40,301 @@ step read1:
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
ERROR: could not create unique index "pg_class_oid_index"
reltuples_unknown
-----------------
f
(1 row)
starting permutation: begin2 grant2 vac1 c2 vac3 mkrels3 read1
mkrels
------
(1 row)
injection_points_attach
-----------------------
(1 row)
step begin2: BEGIN;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step vac1: VACUUM vactest.orig50; -- wait during inplace update <waiting ...>
step c2: COMMIT;
step vac3: VACUUM pg_class;
step mkrels3:
SELECT vactest.mkrels('intruder', 1, 100); -- repopulate LP_UNUSED
SELECT injection_points_detach('inplace-before-pin');
SELECT injection_points_wakeup('inplace-before-pin');
mkrels
------
(1 row)
injection_points_detach
-----------------------
(1 row)
injection_points_wakeup
-----------------------
(1 row)
step vac1: <... completed>
step read1:
REINDEX TABLE pg_class; -- look for duplicates
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
reltuples_unknown
-----------------
f
(1 row)
starting permutation: begin2 grant2 vac1 r2 vac3 mkrels3 read1
mkrels
------
(1 row)
injection_points_attach
-----------------------
(1 row)
step begin2: BEGIN;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step vac1: VACUUM vactest.orig50; -- wait during inplace update <waiting ...>
step r2: ROLLBACK;
step vac3: VACUUM pg_class;
step mkrels3:
SELECT vactest.mkrels('intruder', 1, 100); -- repopulate LP_UNUSED
SELECT injection_points_detach('inplace-before-pin');
SELECT injection_points_wakeup('inplace-before-pin');
mkrels
------
(1 row)
injection_points_detach
-----------------------
(1 row)
injection_points_wakeup
-----------------------
(1 row)
step vac1: <... completed>
step read1:
REINDEX TABLE pg_class; -- look for duplicates
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
reltuples_unknown
-----------------
f
(1 row)
starting permutation: begin2 grant2 vac1 c2 revoke2 grant2 vac3 mkrels3 read1
mkrels
------
(1 row)
injection_points_attach
-----------------------
(1 row)
step begin2: BEGIN;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step vac1: VACUUM vactest.orig50; -- wait during inplace update <waiting ...>
step c2: COMMIT;
step revoke2: REVOKE SELECT ON TABLE vactest.orig50 FROM PUBLIC;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step vac3: VACUUM pg_class;
step mkrels3:
SELECT vactest.mkrels('intruder', 1, 100); -- repopulate LP_UNUSED
SELECT injection_points_detach('inplace-before-pin');
SELECT injection_points_wakeup('inplace-before-pin');
mkrels
------
(1 row)
injection_points_detach
-----------------------
(1 row)
injection_points_wakeup
-----------------------
(1 row)
step vac1: <... completed>
step read1:
REINDEX TABLE pg_class; -- look for duplicates
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
reltuples_unknown
-----------------
f
(1 row)
starting permutation: vac1 begin2 grant2 revoke2 mkrels3 c2 read1
mkrels
------
(1 row)
injection_points_attach
-----------------------
(1 row)
step vac1: VACUUM vactest.orig50; -- wait during inplace update <waiting ...>
step begin2: BEGIN;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step revoke2: REVOKE SELECT ON TABLE vactest.orig50 FROM PUBLIC;
step mkrels3:
SELECT vactest.mkrels('intruder', 1, 100); -- repopulate LP_UNUSED
SELECT injection_points_detach('inplace-before-pin');
SELECT injection_points_wakeup('inplace-before-pin');
mkrels
------
(1 row)
injection_points_detach
-----------------------
(1 row)
injection_points_wakeup
-----------------------
(1 row)
step c2: COMMIT;
step vac1: <... completed>
step read1:
REINDEX TABLE pg_class; -- look for duplicates
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
reltuples_unknown
-----------------
f
(1 row)
starting permutation: begin2 grant2 vac1 r2 grant2 revoke2 vac3 mkrels3 read1
mkrels
------
(1 row)
injection_points_attach
-----------------------
(1 row)
step begin2: BEGIN;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step vac1: VACUUM vactest.orig50; -- wait during inplace update <waiting ...>
step r2: ROLLBACK;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step revoke2: REVOKE SELECT ON TABLE vactest.orig50 FROM PUBLIC;
step vac3: VACUUM pg_class;
step mkrels3:
SELECT vactest.mkrels('intruder', 1, 100); -- repopulate LP_UNUSED
SELECT injection_points_detach('inplace-before-pin');
SELECT injection_points_wakeup('inplace-before-pin');
mkrels
------
(1 row)
injection_points_detach
-----------------------
(1 row)
injection_points_wakeup
-----------------------
(1 row)
step vac1: <... completed>
step read1:
REINDEX TABLE pg_class; -- look for duplicates
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
reltuples_unknown
-----------------
f
(1 row)
starting permutation: begin2 grant2 vac1 c2 revoke2 vac3 mkrels3 read1
mkrels
------
(1 row)
injection_points_attach
-----------------------
(1 row)
step begin2: BEGIN;
step grant2: GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC;
step vac1: VACUUM vactest.orig50; -- wait during inplace update <waiting ...>
step c2: COMMIT;
step revoke2: REVOKE SELECT ON TABLE vactest.orig50 FROM PUBLIC;
step vac3: VACUUM pg_class;
step mkrels3:
SELECT vactest.mkrels('intruder', 1, 100); -- repopulate LP_UNUSED
SELECT injection_points_detach('inplace-before-pin');
SELECT injection_points_wakeup('inplace-before-pin');
mkrels
------
(1 row)
injection_points_detach
-----------------------
(1 row)
injection_points_wakeup
-----------------------
(1 row)
step vac1: <... completed>
step read1:
REINDEX TABLE pg_class; -- look for duplicates
SELECT reltuples = -1 AS reltuples_unknown
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
reltuples_unknown
-----------------
f
(1 row)

View File

@ -32,12 +32,9 @@ setup
CREATE TABLE vactest.orig50 ();
SELECT vactest.mkrels('orig', 51, 100);
}
# XXX DROP causes an assertion failure; adopt DROP once fixed
teardown
{
--DROP SCHEMA vactest CASCADE;
DO $$BEGIN EXECUTE 'ALTER SCHEMA vactest RENAME TO schema' || oid FROM pg_namespace where nspname = 'vactest'; END$$;
DROP SCHEMA vactest CASCADE;
DROP EXTENSION injection_points;
}
@ -56,11 +53,13 @@ step read1 {
FROM pg_class WHERE oid = 'vactest.orig50'::regclass;
}
# Transactional updates of the tuple vac1 is waiting to inplace-update.
session s2
step grant2 { GRANT SELECT ON TABLE vactest.orig50 TO PUBLIC; }
step revoke2 { REVOKE SELECT ON TABLE vactest.orig50 FROM PUBLIC; }
step begin2 { BEGIN; }
step c2 { COMMIT; }
step r2 { ROLLBACK; }
# Non-blocking actions.
session s3
@ -74,10 +73,69 @@ step mkrels3 {
}
# XXX extant bug
# target gains a successor at the last moment
permutation
vac1(mkrels3) # reads pg_class tuple T0 for vactest.orig50, xmax invalid
grant2 # T0 becomes eligible for pruning, T1 is successor
vac3 # T0 becomes LP_UNUSED
mkrels3 # T0 reused; vac1 wakes and overwrites the reused T0
mkrels3 # vac1 wakes, scans to T1
read1
# target already has a successor, which commits
permutation
begin2
grant2 # T0.t_ctid = T1
vac1(mkrels3) # reads T0 for vactest.orig50
c2 # T0 becomes eligible for pruning
vac3 # T0 becomes LP_UNUSED
mkrels3 # vac1 wakes, scans to T1
read1
# target already has a successor, which becomes LP_UNUSED at the last moment
permutation
begin2
grant2 # T0.t_ctid = T1
vac1(mkrels3) # reads T0 for vactest.orig50
r2 # T1 becomes eligible for pruning
vac3 # T1 becomes LP_UNUSED
mkrels3 # reuse T1; vac1 scans to T0
read1
# target already has a successor, which becomes LP_REDIRECT at the last moment
permutation
begin2
grant2 # T0.t_ctid = T1, non-HOT due to filled page
vac1(mkrels3) # reads T0
c2
revoke2 # HOT update to T2
grant2 # HOT update to T3
vac3 # T1 becomes LP_REDIRECT
mkrels3 # reuse T2; vac1 scans to T3
read1
# waiting for updater to end
permutation
vac1(c2) # reads pg_class tuple T0 for vactest.orig50, xmax invalid
begin2
grant2 # T0.t_ctid = T1, non-HOT due to filled page
revoke2 # HOT update to T2
mkrels3 # vac1 awakes briefly, then waits for s2
c2
read1
# Another LP_UNUSED. This time, do change the live tuple. Final live tuple
# body is identical to original, at a different TID.
permutation
begin2
grant2 # T0.t_ctid = T1, non-HOT due to filled page
vac1(mkrels3) # reads T0
r2 # T1 becomes eligible for pruning
grant2 # T0.t_ctid = T2; T0 becomes eligible for pruning
revoke2 # T2.t_ctid = T3; T2 becomes eligible for pruning
vac3 # T0, T1 & T2 become LP_UNUSED
mkrels3 # reuse T0, T1 & T2; vac1 scans to T3
read1
# Another LP_REDIRECT. Compared to the earlier test, omit the last grant2.
# Hence, final live tuple body is identical to original, at a different TID.
permutation begin2 grant2 vac1(mkrels3) c2 revoke2 vac3 mkrels3 read1