Minor editing for README-SSI.
Fix some grammatical issues, try to clarify a couple of proofs, make the terminology more consistent.
This commit is contained in:
parent
e2a0cb1a80
commit
a3290f655e
@ -3,11 +3,11 @@ src/backend/storage/lmgr/README-SSI
|
|||||||
Serializable Snapshot Isolation (SSI) and Predicate Locking
|
Serializable Snapshot Isolation (SSI) and Predicate Locking
|
||||||
===========================================================
|
===========================================================
|
||||||
|
|
||||||
This is currently sitting in the lmgr directory because about 90% of
|
This code is in the lmgr directory because about 90% of it is an
|
||||||
the code is an implementation of predicate locking, which is required
|
implementation of predicate locking, which is required for SSI,
|
||||||
for SSI, rather than being directly related to SSI itself. When
|
rather than being directly related to SSI itself. When another use
|
||||||
another use for predicate locking justifies the effort to tease these
|
for predicate locking justifies the effort to tease these two things
|
||||||
two things apart, this README file should probably be split.
|
apart, this README file should probably be split.
|
||||||
|
|
||||||
|
|
||||||
Credits
|
Credits
|
||||||
@ -151,11 +151,11 @@ transactions.
|
|||||||
SSI Algorithm
|
SSI Algorithm
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
Serializable transaction in PostgreSQL are implemented using
|
As of 9.1, serializable transactions in PostgreSQL are implemented using
|
||||||
Serializable Snapshot Isolation (SSI), based on the work of Cahill
|
Serializable Snapshot Isolation (SSI), based on the work of Cahill
|
||||||
et al. Fundamentally, this allows snapshot isolation to run as it
|
et al. Fundamentally, this allows snapshot isolation to run as it
|
||||||
has, while monitoring for conditions which could create a serialization
|
previously did, while monitoring for conditions which could create a
|
||||||
anomaly.
|
serialization anomaly.
|
||||||
|
|
||||||
SSI is based on the observation [2] that each snapshot isolation
|
SSI is based on the observation [2] that each snapshot isolation
|
||||||
anomaly corresponds to a cycle that contains a "dangerous structure"
|
anomaly corresponds to a cycle that contains a "dangerous structure"
|
||||||
@ -168,8 +168,10 @@ SSI works by watching for this dangerous structure, and rolling
|
|||||||
back a transaction when needed to prevent any anomaly. This means it
|
back a transaction when needed to prevent any anomaly. This means it
|
||||||
only needs to track rw-conflicts between concurrent transactions, not
|
only needs to track rw-conflicts between concurrent transactions, not
|
||||||
wr- and ww-dependencies. It also means there is a risk of false
|
wr- and ww-dependencies. It also means there is a risk of false
|
||||||
positives, because not every dangerous structure corresponds to an
|
positives, because not every dangerous structure is embedded in an
|
||||||
actual serialization failure.
|
actual cycle. The number of false positives is low in practice, so
|
||||||
|
this represents an acceptable tradeoff for keeping the detection
|
||||||
|
overhead low.
|
||||||
|
|
||||||
The PostgreSQL implementation uses two additional optimizations:
|
The PostgreSQL implementation uses two additional optimizations:
|
||||||
|
|
||||||
@ -182,11 +184,12 @@ The PostgreSQL implementation uses two additional optimizations:
|
|||||||
one. Proof:
|
one. Proof:
|
||||||
|
|
||||||
- Because there is a cycle, there must be some transaction T0 that
|
- Because there is a cycle, there must be some transaction T0 that
|
||||||
precedes Tin in the serial order. (T0 might be the same as Tout).
|
precedes Tin in the cycle. (T0 might be the same as Tout.)
|
||||||
|
|
||||||
- The dependency between T0 and Tin can't be a rw-conflict,
|
- The edge between T0 and Tin can't be a rw-conflict or ww-dependency,
|
||||||
because Tin was read-only, so it must be a wr-dependency.
|
because Tin was read-only, so it must be a wr-dependency.
|
||||||
Those can only occur if T0 committed before Tin started.
|
Those can only occur if T0 committed before Tin took its snapshot,
|
||||||
|
else Tin would have ignored T0's output.
|
||||||
|
|
||||||
- Because Tout must commit before any other transaction in the
|
- Because Tout must commit before any other transaction in the
|
||||||
cycle, it must commit before T0 commits -- and thus before Tin
|
cycle, it must commit before T0 commits -- and thus before Tin
|
||||||
@ -258,8 +261,8 @@ full serializable transactions under either strategy. Practical
|
|||||||
implementations of predicate locking generally involve acquiring
|
implementations of predicate locking generally involve acquiring
|
||||||
locks against data as it is accessed, using multiple granularities
|
locks against data as it is accessed, using multiple granularities
|
||||||
(tuple, page, table, etc.) with escalation as needed to keep the lock
|
(tuple, page, table, etc.) with escalation as needed to keep the lock
|
||||||
count to a number which can be tracked within RAM structures, and
|
count to a number which can be tracked within RAM structures. This
|
||||||
this was used in PostgreSQL. Coarse granularities can cause some
|
approach was used in PostgreSQL. Coarse granularities can cause some
|
||||||
false positive indications of conflict. The number of false positives
|
false positive indications of conflict. The number of false positives
|
||||||
can be influenced by plan choice.
|
can be influenced by plan choice.
|
||||||
|
|
||||||
@ -276,7 +279,7 @@ Hellerstein, Stonebraker and Hamilton paper [3], along with the
|
|||||||
locking papers referenced from that and the Cahill papers.
|
locking papers referenced from that and the Cahill papers.
|
||||||
|
|
||||||
Because the SIREAD locks don't block, traditional locking techniques
|
Because the SIREAD locks don't block, traditional locking techniques
|
||||||
were be modified. Intent locking (locking higher level objects
|
have to be modified. Intent locking (locking higher level objects
|
||||||
before locking lower level objects) doesn't work with non-blocking
|
before locking lower level objects) doesn't work with non-blocking
|
||||||
"locks" (which are, in some respects, more like flags than locks).
|
"locks" (which are, in some respects, more like flags than locks).
|
||||||
|
|
||||||
@ -284,10 +287,10 @@ A configurable amount of shared memory is reserved at postmaster
|
|||||||
start-up to track predicate locks. This size cannot be changed
|
start-up to track predicate locks. This size cannot be changed
|
||||||
without a restart.
|
without a restart.
|
||||||
|
|
||||||
* To prevent resource exhaustion, multiple fine-grained locks may
|
To prevent resource exhaustion, multiple fine-grained locks may
|
||||||
be promoted to a single coarser-grained lock as needed.
|
be promoted to a single coarser-grained lock as needed.
|
||||||
|
|
||||||
* An attempt to acquire an SIREAD lock on a tuple when the same
|
An attempt to acquire an SIREAD lock on a tuple when the same
|
||||||
transaction already holds an SIREAD lock on the page or the relation
|
transaction already holds an SIREAD lock on the page or the relation
|
||||||
will be ignored. Likewise, an attempt to lock a page when the
|
will be ignored. Likewise, an attempt to lock a page when the
|
||||||
relation is locked will be ignored, and the acquisition of a coarser
|
relation is locked will be ignored, and the acquisition of a coarser
|
||||||
@ -306,8 +309,8 @@ Predicate locks will be acquired for the heap based on the following:
|
|||||||
will be locked, whether or not it meets selection criteria; except
|
will be locked, whether or not it meets selection criteria; except
|
||||||
that there is no need to acquire an SIREAD lock on a tuple when the
|
that there is no need to acquire an SIREAD lock on a tuple when the
|
||||||
transaction already holds a write lock on any tuple representing the
|
transaction already holds a write lock on any tuple representing the
|
||||||
row, since a rw-dependency would also create a ww-dependency which
|
row, since a rw-conflict would also create a ww-dependency which
|
||||||
has more aggressive enforcement and will thus prevent any anomaly.
|
has more aggressive enforcement and thus will prevent any anomaly.
|
||||||
|
|
||||||
* Modifying a heap tuple creates a rw-conflict with any transaction
|
* Modifying a heap tuple creates a rw-conflict with any transaction
|
||||||
that holds a SIREAD lock on that tuple, or on the page or relation
|
that holds a SIREAD lock on that tuple, or on the page or relation
|
||||||
@ -341,13 +344,13 @@ need not generate a conflict, although an update which "moves" a row
|
|||||||
into the scan must generate a conflict. While correctness allows
|
into the scan must generate a conflict. While correctness allows
|
||||||
false positives, they should be minimized for performance reasons.
|
false positives, they should be minimized for performance reasons.
|
||||||
|
|
||||||
Several optimizations are possible, though not all implemented yet:
|
Several optimizations are possible, though not all are implemented yet:
|
||||||
|
|
||||||
* An index scan which is just finding the right position for an
|
* An index scan which is just finding the right position for an
|
||||||
index insertion or deletion needs not acquire a predicate lock.
|
index insertion or deletion need not acquire a predicate lock.
|
||||||
|
|
||||||
* An index scan which is comparing for equality on the entire key
|
* An index scan which is comparing for equality on the entire key
|
||||||
for a unique index needs not acquire a predicate lock as long as a key
|
for a unique index need not acquire a predicate lock as long as a key
|
||||||
is found corresponding to a visible tuple which has not been modified
|
is found corresponding to a visible tuple which has not been modified
|
||||||
by another transaction -- there are no "between or around" gaps to
|
by another transaction -- there are no "between or around" gaps to
|
||||||
cover.
|
cover.
|
||||||
@ -362,6 +365,9 @@ x = 1 AND x = 2), then no predicate lock is needed.
|
|||||||
|
|
||||||
Other index AM implementation considerations:
|
Other index AM implementation considerations:
|
||||||
|
|
||||||
|
* For an index AM that doesn't have support for predicate locking,
|
||||||
|
we just acquire a predicate lock on the whole index for any search.
|
||||||
|
|
||||||
* B-tree index searches acquire predicate locks only on the
|
* B-tree index searches acquire predicate locks only on the
|
||||||
index *leaf* pages needed to lock the appropriate index range. If,
|
index *leaf* pages needed to lock the appropriate index range. If,
|
||||||
however, a search discovers that no root page has yet been created, a
|
however, a search discovers that no root page has yet been created, a
|
||||||
@ -395,8 +401,8 @@ tracking SIREAD locks.
|
|||||||
any length of time; lock information is written to the tuples
|
any length of time; lock information is written to the tuples
|
||||||
involved in the transactions.
|
involved in the transactions.
|
||||||
* In PostgreSQL, existing lock structures have pointers to
|
* In PostgreSQL, existing lock structures have pointers to
|
||||||
memory which is related to a connection. SIREAD locks need to persist
|
memory which is related to a session. SIREAD locks need to persist
|
||||||
past the end of the originating transaction and even the connection
|
past the end of the originating transaction and even the session
|
||||||
which ran it.
|
which ran it.
|
||||||
* PostgreSQL needs to be able to tolerate a large number of
|
* PostgreSQL needs to be able to tolerate a large number of
|
||||||
transactions executing while one long-running transaction stays open
|
transactions executing while one long-running transaction stays open
|
||||||
@ -411,7 +417,8 @@ isolation level distinct from snapshot isolation.
|
|||||||
in the papers.
|
in the papers.
|
||||||
|
|
||||||
5. PostgreSQL doesn't assign a transaction number to a database
|
5. PostgreSQL doesn't assign a transaction number to a database
|
||||||
transaction until and unless necessary.
|
transaction until and unless necessary (normally, when the transaction
|
||||||
|
attempts to modify data).
|
||||||
|
|
||||||
6. PostgreSQL has pluggable data types with user-definable
|
6. PostgreSQL has pluggable data types with user-definable
|
||||||
operators, as well as pluggable index types, not all of which are
|
operators, as well as pluggable index types, not all of which are
|
||||||
@ -453,42 +460,46 @@ versions of the row, based on the following proof that any additional
|
|||||||
serialization failures we would get from that would be false
|
serialization failures we would get from that would be false
|
||||||
positives:
|
positives:
|
||||||
|
|
||||||
o If transaction T1 reads a row (thus acquiring a predicate
|
o If transaction T1 reads a row version (thus acquiring a
|
||||||
lock on it) and a second transaction T2 updates that row, must a
|
predicate lock on it) and a second transaction T2 updates that row
|
||||||
third transaction T3 which updates the new version of the row have a
|
version (thus creating a rw-conflict graph edge from T1 to T2), must a
|
||||||
rw-conflict in from T1 to prevent anomalies? In other words, does it
|
third transaction T3 which re-updates the new version of the row also
|
||||||
matter whether this edge T1 -> T3 is there?
|
have a rw-conflict in from T1 to prevent anomalies? In other words,
|
||||||
|
does it matter whether we recognize the edge T1 -> T3?
|
||||||
|
|
||||||
o If T1 has a conflict in, it certainly doesn't. Adding the
|
o If T1 has a conflict in, it certainly doesn't. Adding the
|
||||||
edge T1 -> T3 would create a dangerous structure, but we already had
|
edge T1 -> T3 would create a dangerous structure, but we already had
|
||||||
one from the edge T1 -> T2, so we would have aborted something
|
one from the edge T1 -> T2, so we would have aborted something anyway.
|
||||||
anyway.
|
(T2 has already committed, else T3 could not have updated its output;
|
||||||
|
but we would have aborted either T1 or T1's predecessor(s). Hence
|
||||||
|
no cycle involving T1 and T3 can survive.)
|
||||||
|
|
||||||
o Now let's consider the case where T1 doesn't have a
|
o Now let's consider the case where T1 doesn't have a
|
||||||
conflict in. If that's the case, for this edge T1 -> T3 to make a
|
rw-conflict in. If that's the case, for this edge T1 -> T3 to make a
|
||||||
difference, T3 must have a rw-conflict out that induces a cycle in
|
difference, T3 must have a rw-conflict out that induces a cycle in the
|
||||||
the dependency graph, i.e. a conflict out to some transaction
|
dependency graph, i.e. a conflict out to some transaction preceding T1
|
||||||
preceding T1 in the serial order. (A conflict out to T1 would work
|
in the graph. (A conflict out to T1 itself would be problematic too,
|
||||||
too, but that would mean T1 has a conflict in and we would have
|
but that would mean T1 has a conflict in, the case we already
|
||||||
rolled back.)
|
eliminated.)
|
||||||
|
|
||||||
o So now we're trying to figure out if there can be an
|
o So now we're trying to figure out if there can be an
|
||||||
rw-conflict edge T3 -> T0, where T0 is some transaction that precedes
|
rw-conflict edge T3 -> T0, where T0 is some transaction that precedes
|
||||||
T1. For T0 to precede T1, there has to be has to be some edge, or
|
T1. For T0 to precede T1, there has to be some edge, or sequence of
|
||||||
sequence of edges, from T0 to T1. At least the last edge has to be a
|
edges, from T0 to T1. At least the last edge has to be a wr-dependency
|
||||||
wr-dependency or ww-dependency rather than a rw-conflict, because T1
|
or ww-dependency rather than a rw-conflict, because T1 doesn't have a
|
||||||
doesn't have a rw-conflict in. And that gives us enough information
|
rw-conflict in. And that gives us enough information about the order
|
||||||
about the order of transactions to see that T3 can't have a
|
of transactions to see that T3 can't have a rw-conflict to T0:
|
||||||
rw-dependency to T0:
|
|
||||||
- T0 committed before T1 started (the wr/ww-dependency implies this)
|
- T0 committed before T1 started (the wr/ww-dependency implies this)
|
||||||
- T1 started before T2 committed (the T1->T2 rw-conflict implies this)
|
- T1 started before T2 committed (the T1->T2 rw-conflict implies this)
|
||||||
- T2 committed before T3 started (otherwise, T3 would be aborted
|
- T2 committed before T3 started (otherwise, T3 would get aborted
|
||||||
because of an update conflict)
|
because of an update conflict)
|
||||||
|
|
||||||
o That means T0 committed before T3 started, and therefore
|
o That means T0 committed before T3 started, and therefore
|
||||||
there can't be a rw-conflict from T3 to T0.
|
there can't be a rw-conflict from T3 to T0.
|
||||||
|
|
||||||
o In both cases, we didn't need the T1 -> T3 edge.
|
o So in all cases, we don't need the T1 -> T3 edge to
|
||||||
|
recognize cycles. Therefore it's not necessary for T1's SIREAD lock
|
||||||
|
on the original tuple version to cover later versions as well.
|
||||||
|
|
||||||
* Predicate locking in PostgreSQL starts at the tuple level
|
* Predicate locking in PostgreSQL starts at the tuple level
|
||||||
when possible. Multiple fine-grained locks are promoted to a single
|
when possible. Multiple fine-grained locks are promoted to a single
|
||||||
@ -520,10 +531,12 @@ NULL to indicate no conflict and a self-reference to indicate
|
|||||||
multiple conflicts or conflicts with committed transactions, we use a
|
multiple conflicts or conflicts with committed transactions, we use a
|
||||||
list of rw-conflicts. With the more complete information, false
|
list of rw-conflicts. With the more complete information, false
|
||||||
positives are reduced and we have sufficient data for more aggressive
|
positives are reduced and we have sufficient data for more aggressive
|
||||||
clean-up and other optimizations.
|
clean-up and other optimizations:
|
||||||
|
|
||||||
o We can avoid ever rolling back a transaction until and
|
o We can avoid ever rolling back a transaction until and
|
||||||
unless there is a pivot where a transaction on the conflict *out*
|
unless there is a pivot where a transaction on the conflict *out*
|
||||||
side of the pivot committed before either of the other transactions.
|
side of the pivot committed before either of the other transactions.
|
||||||
|
|
||||||
o We can avoid ever rolling back a transaction when the
|
o We can avoid ever rolling back a transaction when the
|
||||||
transaction on the conflict *in* side of the pivot is explicitly or
|
transaction on the conflict *in* side of the pivot is explicitly or
|
||||||
implicitly READ ONLY unless the transaction on the conflict *out*
|
implicitly READ ONLY unless the transaction on the conflict *out*
|
||||||
@ -531,6 +544,7 @@ side of the pivot committed before the READ ONLY transaction acquired
|
|||||||
its snapshot. (An implicit READ ONLY transaction is one which
|
its snapshot. (An implicit READ ONLY transaction is one which
|
||||||
committed without writing, even though it was not explicitly declared
|
committed without writing, even though it was not explicitly declared
|
||||||
to be READ ONLY.)
|
to be READ ONLY.)
|
||||||
|
|
||||||
o We can more aggressively clean up conflicts, predicate
|
o We can more aggressively clean up conflicts, predicate
|
||||||
locks, and SSI transaction information.
|
locks, and SSI transaction information.
|
||||||
|
|
||||||
@ -543,7 +557,7 @@ overlapping transaction dependencies.
|
|||||||
until the conditions are right for it to start in the "opt out" state
|
until the conditions are right for it to start in the "opt out" state
|
||||||
described above. We add a DEFERRABLE state to transactions, which is
|
described above. We add a DEFERRABLE state to transactions, which is
|
||||||
specified and maintained in a way similar to READ ONLY. It is
|
specified and maintained in a way similar to READ ONLY. It is
|
||||||
ignored for transactions which are not SERIALIZABLE and READ ONLY.
|
ignored for transactions that are not SERIALIZABLE and READ ONLY.
|
||||||
|
|
||||||
* When a transaction must be rolled back, we pick among the
|
* When a transaction must be rolled back, we pick among the
|
||||||
active transactions such that an immediate retry will not fail again
|
active transactions such that an immediate retry will not fail again
|
||||||
@ -593,8 +607,8 @@ might never be touched, or should we keep adding returned items to
|
|||||||
the end of the available list?
|
the end of the available list?
|
||||||
|
|
||||||
|
|
||||||
Footnotes
|
References
|
||||||
---------
|
----------
|
||||||
|
|
||||||
[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
|
[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
|
||||||
Search for serial execution to find the relevant section.
|
Search for serial execution to find the relevant section.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user