Reduce use of heavyweight locking inside hash AM.
Avoid using LockPage(rel, 0, lockmode) to protect against changes to the bucket mapping. Instead, an exclusive buffer content lock is now viewed as sufficient permission to modify the metapage, and a shared buffer content lock is used when such modifications need to be prevented. This more relaxed locking regimen makes it possible that, when we're busy getting a heavyweight bucket on the bucket we intend to search or insert into, a bucket split might occur underneath us. To compenate for that possibility, we use a loop-and-retry system: release the metapage content lock, acquire the heavyweight lock on the target bucket, and then reacquire the metapage content lock and check that the bucket mapping has not changed. Normally it hasn't, and we're done. But if by chance it has, we simply unlock the metapage, release the heavyweight lock we acquired previously, lock the new bucket, and loop around again. Even in the worst case we cannot loop very many times here, since we don't split the same bucket again until we've split all the other buckets, and 2^N gets big pretty fast. This results in greatly improved concurrency, because we're effectively replacing two lwlock acquire-and-release cycles in exclusive mode (on one of the lock manager locks) with a single acquire-and-release cycle in shared mode (on the metapage buffer content lock). Testing shows that it's still not quite as good as btree; for that, we'd probably have to find some way of getting rid of the heavyweight bucket locks as well, which does not appear straightforward. Patch by me, review by Jeff Janes.
This commit is contained in:
parent
038f3a0509
commit
76837c1507
@ -132,15 +132,6 @@ long-term locking since there is a (small) risk of deadlock, which we must
|
|||||||
be able to detect. Buffer context locks are used for short-term access
|
be able to detect. Buffer context locks are used for short-term access
|
||||||
control to individual pages of the index.
|
control to individual pages of the index.
|
||||||
|
|
||||||
We define the following lmgr locks for a hash index:
|
|
||||||
|
|
||||||
LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
|
|
||||||
mapping. A process attempting to enlarge the hash table by splitting a
|
|
||||||
bucket must exclusive-lock this lock before modifying the metapage data
|
|
||||||
representing the mapping. Processes intending to access a particular
|
|
||||||
bucket must share-lock this lock until they have acquired lock on the
|
|
||||||
correct target bucket.
|
|
||||||
|
|
||||||
LockPage(rel, page), where page is the page number of a hash bucket page,
|
LockPage(rel, page), where page is the page number of a hash bucket page,
|
||||||
represents the right to split or compact an individual bucket. A process
|
represents the right to split or compact an individual bucket. A process
|
||||||
splitting a bucket must exclusive-lock both old and new halves of the
|
splitting a bucket must exclusive-lock both old and new halves of the
|
||||||
@ -150,7 +141,10 @@ insertions must share-lock the bucket they are scanning or inserting into.
|
|||||||
(It is okay to allow concurrent scans and insertions.)
|
(It is okay to allow concurrent scans and insertions.)
|
||||||
|
|
||||||
The lmgr lock IDs corresponding to overflow pages are currently unused.
|
The lmgr lock IDs corresponding to overflow pages are currently unused.
|
||||||
These are available for possible future refinements.
|
These are available for possible future refinements. LockPage(rel, 0)
|
||||||
|
is also currently undefined (it was previously used to represent the right
|
||||||
|
to modify the hash-code-to-bucket mapping, but it is no longer needed for
|
||||||
|
that purpose).
|
||||||
|
|
||||||
Note that these lock definitions are conceptually distinct from any sort
|
Note that these lock definitions are conceptually distinct from any sort
|
||||||
of lock on the pages whose numbers they share. A process must also obtain
|
of lock on the pages whose numbers they share. A process must also obtain
|
||||||
@ -165,9 +159,7 @@ hash index code, since a process holding one of these locks could block
|
|||||||
waiting for an unrelated lock held by another process. If that process
|
waiting for an unrelated lock held by another process. If that process
|
||||||
then does something that requires exclusive lock on the bucket, we have
|
then does something that requires exclusive lock on the bucket, we have
|
||||||
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
|
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
|
||||||
can be detected and recovered from. This also forces the page-zero lock
|
can be detected and recovered from.
|
||||||
to be an lmgr lock, because as we'll see below it is held while attempting
|
|
||||||
to acquire a bucket lock, and so it could also participate in a deadlock.
|
|
||||||
|
|
||||||
Processes must obtain read (share) buffer context lock on any hash index
|
Processes must obtain read (share) buffer context lock on any hash index
|
||||||
page while reading it, and write (exclusive) lock while modifying it.
|
page while reading it, and write (exclusive) lock while modifying it.
|
||||||
@ -195,24 +187,30 @@ track of available overflow pages.
|
|||||||
|
|
||||||
The reader algorithm is:
|
The reader algorithm is:
|
||||||
|
|
||||||
share-lock page 0 (to prevent active split)
|
pin meta page and take buffer content lock in shared mode
|
||||||
read/sharelock meta page
|
loop:
|
||||||
compute bucket number for target hash key
|
compute bucket number for target hash key
|
||||||
release meta page
|
release meta page buffer content lock
|
||||||
share-lock bucket page (to prevent split/compact of this bucket)
|
if (correct bucket page is already locked)
|
||||||
release page 0 share-lock
|
break
|
||||||
|
release any existing bucket page lock (if a concurrent split happened)
|
||||||
|
take heavyweight bucket lock
|
||||||
|
retake meta page buffer content lock in shared mode
|
||||||
-- then, per read request:
|
-- then, per read request:
|
||||||
read/sharelock current page of bucket
|
release pin on metapage
|
||||||
|
read current page of bucket and take shared buffer content lock
|
||||||
step to next page if necessary (no chaining of locks)
|
step to next page if necessary (no chaining of locks)
|
||||||
get tuple
|
get tuple
|
||||||
release current page
|
release buffer content lock and pin on current page
|
||||||
-- at scan shutdown:
|
-- at scan shutdown:
|
||||||
release bucket share-lock
|
release bucket share-lock
|
||||||
|
|
||||||
By holding the page-zero lock until lock on the target bucket is obtained,
|
We can't hold the metapage lock while acquiring a lock on the target bucket,
|
||||||
the reader ensures that the target bucket calculation is valid (otherwise
|
because that might result in an undetected deadlock (lwlocks do not participate
|
||||||
the bucket might be split before the reader arrives at it, and the target
|
in deadlock detection). Instead, we relock the metapage after acquiring the
|
||||||
entries might go into the new bucket). Holding the bucket sharelock for
|
bucket page lock and check whether the bucket has been split. If not, we're
|
||||||
|
done. If so, we release our previously-acquired lock and repeat the process
|
||||||
|
using the new bucket number. Holding the bucket sharelock for
|
||||||
the remainder of the scan prevents the reader's current-tuple pointer from
|
the remainder of the scan prevents the reader's current-tuple pointer from
|
||||||
being invalidated by splits or compactions. Notice that the reader's lock
|
being invalidated by splits or compactions. Notice that the reader's lock
|
||||||
does not prevent other buckets from being split or compacted.
|
does not prevent other buckets from being split or compacted.
|
||||||
@ -229,22 +227,26 @@ as it was before.
|
|||||||
|
|
||||||
The insertion algorithm is rather similar:
|
The insertion algorithm is rather similar:
|
||||||
|
|
||||||
share-lock page 0 (to prevent active split)
|
pin meta page and take buffer content lock in shared mode
|
||||||
read/sharelock meta page
|
loop:
|
||||||
compute bucket number for target hash key
|
compute bucket number for target hash key
|
||||||
release meta page
|
release meta page buffer content lock
|
||||||
share-lock bucket page (to prevent split/compact of this bucket)
|
if (correct bucket page is already locked)
|
||||||
release page 0 share-lock
|
break
|
||||||
|
release any existing bucket page lock (if a concurrent split happened)
|
||||||
|
take heavyweight bucket lock in shared mode
|
||||||
|
retake meta page buffer content lock in shared mode
|
||||||
-- (so far same as reader)
|
-- (so far same as reader)
|
||||||
read/exclusive-lock current page of bucket
|
release pin on metapage
|
||||||
|
pin current page of bucket and take exclusive buffer content lock
|
||||||
if full, release, read/exclusive-lock next page; repeat as needed
|
if full, release, read/exclusive-lock next page; repeat as needed
|
||||||
>> see below if no space in any page of bucket
|
>> see below if no space in any page of bucket
|
||||||
insert tuple at appropriate place in page
|
insert tuple at appropriate place in page
|
||||||
write/release current page
|
mark current page dirty and release buffer content lock and pin
|
||||||
release bucket share-lock
|
release heavyweight share-lock
|
||||||
read/exclusive-lock meta page
|
pin meta page and take buffer content lock in shared mode
|
||||||
increment tuple count, decide if split needed
|
increment tuple count, decide if split needed
|
||||||
write/release meta page
|
mark meta page dirty and release buffer content lock and pin
|
||||||
done if no split needed, else enter Split algorithm below
|
done if no split needed, else enter Split algorithm below
|
||||||
|
|
||||||
To speed searches, the index entries within any individual index page are
|
To speed searches, the index entries within any individual index page are
|
||||||
@ -269,26 +271,23 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
|
|||||||
The algorithm attempts, but does not necessarily succeed, to split one
|
The algorithm attempts, but does not necessarily succeed, to split one
|
||||||
existing bucket in two, thereby lowering the fill ratio:
|
existing bucket in two, thereby lowering the fill ratio:
|
||||||
|
|
||||||
exclusive-lock page 0 (assert the right to begin a split)
|
pin meta page and take buffer content lock in exclusive mode
|
||||||
read/exclusive-lock meta page
|
|
||||||
check split still needed
|
check split still needed
|
||||||
if split not needed anymore, drop locks and exit
|
if split not needed anymore, drop buffer content lock and pin and exit
|
||||||
decide which bucket to split
|
decide which bucket to split
|
||||||
Attempt to X-lock old bucket number (definitely could fail)
|
Attempt to X-lock old bucket number (definitely could fail)
|
||||||
Attempt to X-lock new bucket number (shouldn't fail, but...)
|
Attempt to X-lock new bucket number (shouldn't fail, but...)
|
||||||
if above fail, drop locks and exit
|
if above fail, drop locks and pin and exit
|
||||||
update meta page to reflect new number of buckets
|
update meta page to reflect new number of buckets
|
||||||
write/release meta page
|
mark meta page dirty and release buffer content lock and pin
|
||||||
release X-lock on page 0
|
|
||||||
-- now, accesses to all other buckets can proceed.
|
-- now, accesses to all other buckets can proceed.
|
||||||
Perform actual split of bucket, moving tuples as needed
|
Perform actual split of bucket, moving tuples as needed
|
||||||
>> see below about acquiring needed extra space
|
>> see below about acquiring needed extra space
|
||||||
Release X-locks of old and new buckets
|
Release X-locks of old and new buckets
|
||||||
|
|
||||||
Note the page zero and metapage locks are not held while the actual tuple
|
Note the metapage lock is not held while the actual tuple rearrangement is
|
||||||
rearrangement is performed, so accesses to other buckets can proceed in
|
performed, so accesses to other buckets can proceed in parallel; in fact,
|
||||||
parallel; in fact, it's possible for multiple bucket splits to proceed
|
it's possible for multiple bucket splits to proceed in parallel.
|
||||||
in parallel.
|
|
||||||
|
|
||||||
Split's attempt to X-lock the old bucket number could fail if another
|
Split's attempt to X-lock the old bucket number could fail if another
|
||||||
process holds S-lock on it. We do not want to wait if that happens, first
|
process holds S-lock on it. We do not want to wait if that happens, first
|
||||||
@ -316,20 +315,20 @@ go-round.
|
|||||||
The fourth operation is garbage collection (bulk deletion):
|
The fourth operation is garbage collection (bulk deletion):
|
||||||
|
|
||||||
next bucket := 0
|
next bucket := 0
|
||||||
read/sharelock meta page
|
pin metapage and take buffer content lock in exclusive mode
|
||||||
fetch current max bucket number
|
fetch current max bucket number
|
||||||
release meta page
|
release meta page buffer content lock and pin
|
||||||
while next bucket <= max bucket do
|
while next bucket <= max bucket do
|
||||||
Acquire X lock on target bucket
|
Acquire X lock on target bucket
|
||||||
Scan and remove tuples, compact free space as needed
|
Scan and remove tuples, compact free space as needed
|
||||||
Release X lock
|
Release X lock
|
||||||
next bucket ++
|
next bucket ++
|
||||||
end loop
|
end loop
|
||||||
exclusive-lock meta page
|
pin metapage and take buffer content lock in exclusive mode
|
||||||
check if number of buckets changed
|
check if number of buckets changed
|
||||||
if so, release lock and return to for-each-bucket loop
|
if so, release content lock and pin and return to for-each-bucket loop
|
||||||
else update metapage tuple count
|
else update metapage tuple count
|
||||||
write/release meta page
|
mark meta page dirty and release buffer content lock and pin
|
||||||
|
|
||||||
Note that this is designed to allow concurrent splits. If a split occurs,
|
Note that this is designed to allow concurrent splits. If a split occurs,
|
||||||
tuples relocated into the new bucket will be visited twice by the scan,
|
tuples relocated into the new bucket will be visited twice by the scan,
|
||||||
@ -360,25 +359,25 @@ overflow page to the free pool.
|
|||||||
|
|
||||||
Obtaining an overflow page:
|
Obtaining an overflow page:
|
||||||
|
|
||||||
read/exclusive-lock meta page
|
take metapage content lock in exclusive mode
|
||||||
determine next bitmap page number; if none, exit loop
|
determine next bitmap page number; if none, exit loop
|
||||||
release meta page lock
|
release meta page content lock
|
||||||
read/exclusive-lock bitmap page
|
pin bitmap page and take content lock in exclusive mode
|
||||||
search for a free page (zero bit in bitmap)
|
search for a free page (zero bit in bitmap)
|
||||||
if found:
|
if found:
|
||||||
set bit in bitmap
|
set bit in bitmap
|
||||||
write/release bitmap page
|
mark bitmap page dirty and release content lock
|
||||||
read/exclusive-lock meta page
|
take metapage buffer content lock in exclusive mode
|
||||||
if first-free-bit value did not change,
|
if first-free-bit value did not change,
|
||||||
update it and write meta page
|
update it and mark meta page dirty
|
||||||
release meta page
|
release meta page buffer content lock
|
||||||
return page number
|
return page number
|
||||||
else (not found):
|
else (not found):
|
||||||
release bitmap page
|
release bitmap page buffer content lock
|
||||||
loop back to try next bitmap page, if any
|
loop back to try next bitmap page, if any
|
||||||
-- here when we have checked all bitmap pages; we hold meta excl. lock
|
-- here when we have checked all bitmap pages; we hold meta excl. lock
|
||||||
extend index to add another overflow page; update meta information
|
extend index to add another overflow page; update meta information
|
||||||
write/release meta page
|
mark meta page dirty and release buffer content lock
|
||||||
return page number
|
return page number
|
||||||
|
|
||||||
It is slightly annoying to release and reacquire the metapage lock
|
It is slightly annoying to release and reacquire the metapage lock
|
||||||
@ -428,17 +427,17 @@ algorithm is:
|
|||||||
|
|
||||||
delink overflow page from bucket chain
|
delink overflow page from bucket chain
|
||||||
(this requires read/update/write/release of fore and aft siblings)
|
(this requires read/update/write/release of fore and aft siblings)
|
||||||
read/share-lock meta page
|
pin meta page and take buffer content lock in shared mode
|
||||||
determine which bitmap page contains the free space bit for page
|
determine which bitmap page contains the free space bit for page
|
||||||
release meta page
|
relase meta page buffer content lock
|
||||||
read/exclusive-lock bitmap page
|
pin bitmap page and take buffer content lock in exclusie mode
|
||||||
update bitmap bit
|
update bitmap bit
|
||||||
write/release bitmap page
|
mark bitmap page dirty and release buffer content lock and pin
|
||||||
if page number is less than what we saw as first-free-bit in meta:
|
if page number is less than what we saw as first-free-bit in meta:
|
||||||
read/exclusive-lock meta page
|
retake meta page buffer content lock in exclusive mode
|
||||||
if page number is still less than first-free-bit,
|
if page number is still less than first-free-bit,
|
||||||
update first-free-bit field and write meta page
|
update first-free-bit field and mark meta page dirty
|
||||||
release meta page
|
release meta page buffer content lock and pin
|
||||||
|
|
||||||
We have to do it this way because we must clear the bitmap bit before
|
We have to do it this way because we must clear the bitmap bit before
|
||||||
changing the first-free-bit field (hashm_firstfree). It is possible that
|
changing the first-free-bit field (hashm_firstfree). It is possible that
|
||||||
|
@ -32,6 +32,8 @@ _hash_doinsert(Relation rel, IndexTuple itup)
|
|||||||
Buffer metabuf;
|
Buffer metabuf;
|
||||||
HashMetaPage metap;
|
HashMetaPage metap;
|
||||||
BlockNumber blkno;
|
BlockNumber blkno;
|
||||||
|
BlockNumber oldblkno = InvalidBlockNumber;
|
||||||
|
bool retry = false;
|
||||||
Page page;
|
Page page;
|
||||||
HashPageOpaque pageopaque;
|
HashPageOpaque pageopaque;
|
||||||
Size itemsz;
|
Size itemsz;
|
||||||
@ -49,12 +51,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
|
|||||||
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
|
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
|
||||||
* need to be consistent */
|
* need to be consistent */
|
||||||
|
|
||||||
/*
|
|
||||||
* Acquire shared split lock so we can compute the target bucket safely
|
|
||||||
* (see README).
|
|
||||||
*/
|
|
||||||
_hash_getlock(rel, 0, HASH_SHARE);
|
|
||||||
|
|
||||||
/* Read the metapage */
|
/* Read the metapage */
|
||||||
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
|
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
|
||||||
metap = HashPageGetMeta(BufferGetPage(metabuf));
|
metap = HashPageGetMeta(BufferGetPage(metabuf));
|
||||||
@ -74,6 +70,11 @@ _hash_doinsert(Relation rel, IndexTuple itup)
|
|||||||
(unsigned long) HashMaxItemSize((Page) metap)),
|
(unsigned long) HashMaxItemSize((Page) metap)),
|
||||||
errhint("Values larger than a buffer page cannot be indexed.")));
|
errhint("Values larger than a buffer page cannot be indexed.")));
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Loop until we get a lock on the correct target bucket.
|
||||||
|
*/
|
||||||
|
for (;;)
|
||||||
|
{
|
||||||
/*
|
/*
|
||||||
* Compute the target bucket number, and convert to block number.
|
* Compute the target bucket number, and convert to block number.
|
||||||
*/
|
*/
|
||||||
@ -84,15 +85,30 @@ _hash_doinsert(Relation rel, IndexTuple itup)
|
|||||||
|
|
||||||
blkno = BUCKET_TO_BLKNO(metap, bucket);
|
blkno = BUCKET_TO_BLKNO(metap, bucket);
|
||||||
|
|
||||||
/* release lock on metapage, but keep pin since we'll need it again */
|
/* Release metapage lock, but keep pin. */
|
||||||
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
|
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Acquire share lock on target bucket; then we can release split lock.
|
* If the previous iteration of this loop locked what is still the
|
||||||
|
* correct target bucket, we are done. Otherwise, drop any old lock
|
||||||
|
* and lock what now appears to be the correct bucket.
|
||||||
*/
|
*/
|
||||||
|
if (retry)
|
||||||
|
{
|
||||||
|
if (oldblkno == blkno)
|
||||||
|
break;
|
||||||
|
_hash_droplock(rel, oldblkno, HASH_SHARE);
|
||||||
|
}
|
||||||
_hash_getlock(rel, blkno, HASH_SHARE);
|
_hash_getlock(rel, blkno, HASH_SHARE);
|
||||||
|
|
||||||
_hash_droplock(rel, 0, HASH_SHARE);
|
/*
|
||||||
|
* Reacquire metapage lock and check that no bucket split has taken
|
||||||
|
* place while we were awaiting the bucket lock.
|
||||||
|
*/
|
||||||
|
_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
|
||||||
|
oldblkno = blkno;
|
||||||
|
retry = true;
|
||||||
|
}
|
||||||
|
|
||||||
/* Fetch the primary bucket page for the bucket */
|
/* Fetch the primary bucket page for the bucket */
|
||||||
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
|
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
|
||||||
|
@ -57,9 +57,9 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
|
|||||||
/*
|
/*
|
||||||
* _hash_getlock() -- Acquire an lmgr lock.
|
* _hash_getlock() -- Acquire an lmgr lock.
|
||||||
*
|
*
|
||||||
* 'whichlock' should be zero to acquire the split-control lock, or the
|
* 'whichlock' should the block number of a bucket's primary bucket page to
|
||||||
* block number of a bucket's primary bucket page to acquire the per-bucket
|
* acquire the per-bucket lock. (See README for details of the use of these
|
||||||
* lock. (See README for details of the use of these locks.)
|
* locks.)
|
||||||
*
|
*
|
||||||
* 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
|
* 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
|
||||||
*/
|
*/
|
||||||
@ -507,21 +507,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
|
|||||||
uint32 lowmask;
|
uint32 lowmask;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Obtain the page-zero lock to assert the right to begin a split (see
|
* Write-lock the meta page. It used to be necessary to acquire a
|
||||||
* README).
|
* heavyweight lock to begin a split, but that is no longer required.
|
||||||
*
|
|
||||||
* Note: deadlock should be impossible here. Our own backend could only be
|
|
||||||
* holding bucket sharelocks due to stopped indexscans; those will not
|
|
||||||
* block other holders of the page-zero lock, who are only interested in
|
|
||||||
* acquiring bucket sharelocks themselves. Exclusive bucket locks are
|
|
||||||
* only taken here and in hashbulkdelete, and neither of these operations
|
|
||||||
* needs any additional locks to complete. (If, due to some flaw in this
|
|
||||||
* reasoning, we manage to deadlock anyway, it's okay to error out; the
|
|
||||||
* index will be left in a consistent state.)
|
|
||||||
*/
|
*/
|
||||||
_hash_getlock(rel, 0, HASH_EXCLUSIVE);
|
|
||||||
|
|
||||||
/* Write-lock the meta page */
|
|
||||||
_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
|
_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
|
||||||
|
|
||||||
_hash_checkpage(rel, metabuf, LH_META_PAGE);
|
_hash_checkpage(rel, metabuf, LH_META_PAGE);
|
||||||
@ -663,9 +651,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
|
|||||||
/* Write out the metapage and drop lock, but keep pin */
|
/* Write out the metapage and drop lock, but keep pin */
|
||||||
_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
|
_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
|
||||||
|
|
||||||
/* Release split lock; okay for other splits to occur now */
|
|
||||||
_hash_droplock(rel, 0, HASH_EXCLUSIVE);
|
|
||||||
|
|
||||||
/* Relocate records to the new bucket */
|
/* Relocate records to the new bucket */
|
||||||
_hash_splitbucket(rel, metabuf, old_bucket, new_bucket,
|
_hash_splitbucket(rel, metabuf, old_bucket, new_bucket,
|
||||||
start_oblkno, start_nblkno,
|
start_oblkno, start_nblkno,
|
||||||
@ -682,9 +667,6 @@ fail:
|
|||||||
|
|
||||||
/* We didn't write the metapage, so just drop lock */
|
/* We didn't write the metapage, so just drop lock */
|
||||||
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
|
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
|
||||||
|
|
||||||
/* Release split lock */
|
|
||||||
_hash_droplock(rel, 0, HASH_EXCLUSIVE);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@ -125,6 +125,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
|
|||||||
uint32 hashkey;
|
uint32 hashkey;
|
||||||
Bucket bucket;
|
Bucket bucket;
|
||||||
BlockNumber blkno;
|
BlockNumber blkno;
|
||||||
|
BlockNumber oldblkno = InvalidBuffer;
|
||||||
|
bool retry = false;
|
||||||
Buffer buf;
|
Buffer buf;
|
||||||
Buffer metabuf;
|
Buffer metabuf;
|
||||||
Page page;
|
Page page;
|
||||||
@ -184,16 +186,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
|
|||||||
|
|
||||||
so->hashso_sk_hash = hashkey;
|
so->hashso_sk_hash = hashkey;
|
||||||
|
|
||||||
/*
|
|
||||||
* Acquire shared split lock so we can compute the target bucket safely
|
|
||||||
* (see README).
|
|
||||||
*/
|
|
||||||
_hash_getlock(rel, 0, HASH_SHARE);
|
|
||||||
|
|
||||||
/* Read the metapage */
|
/* Read the metapage */
|
||||||
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
|
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
|
||||||
metap = HashPageGetMeta(BufferGetPage(metabuf));
|
metap = HashPageGetMeta(BufferGetPage(metabuf));
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Loop until we get a lock on the correct target bucket.
|
||||||
|
*/
|
||||||
|
for (;;)
|
||||||
|
{
|
||||||
/*
|
/*
|
||||||
* Compute the target bucket number, and convert to block number.
|
* Compute the target bucket number, and convert to block number.
|
||||||
*/
|
*/
|
||||||
@ -204,15 +205,33 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
|
|||||||
|
|
||||||
blkno = BUCKET_TO_BLKNO(metap, bucket);
|
blkno = BUCKET_TO_BLKNO(metap, bucket);
|
||||||
|
|
||||||
/* done with the metapage */
|
/* Release metapage lock, but keep pin. */
|
||||||
_hash_relbuf(rel, metabuf);
|
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Acquire share lock on target bucket; then we can release split lock.
|
* If the previous iteration of this loop locked what is still the
|
||||||
|
* correct target bucket, we are done. Otherwise, drop any old lock
|
||||||
|
* and lock what now appears to be the correct bucket.
|
||||||
*/
|
*/
|
||||||
|
if (retry)
|
||||||
|
{
|
||||||
|
if (oldblkno == blkno)
|
||||||
|
break;
|
||||||
|
_hash_droplock(rel, oldblkno, HASH_SHARE);
|
||||||
|
}
|
||||||
_hash_getlock(rel, blkno, HASH_SHARE);
|
_hash_getlock(rel, blkno, HASH_SHARE);
|
||||||
|
|
||||||
_hash_droplock(rel, 0, HASH_SHARE);
|
/*
|
||||||
|
* Reacquire metapage lock and check that no bucket split has taken
|
||||||
|
* place while we were awaiting the bucket lock.
|
||||||
|
*/
|
||||||
|
_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
|
||||||
|
oldblkno = blkno;
|
||||||
|
retry = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* done with the metapage */
|
||||||
|
_hash_dropbuf(rel, metabuf);
|
||||||
|
|
||||||
/* Update scan opaque state to show we have lock on the bucket */
|
/* Update scan opaque state to show we have lock on the bucket */
|
||||||
so->hashso_bucket = bucket;
|
so->hashso_bucket = bucket;
|
||||||
|
Loading…
x
Reference in New Issue
Block a user