Avoid parallel nbtree index scan hangs with SAOPs.

Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made
parallel index scans work with the new design for arrays via explicit
scheduling of primitive index scans.  A backend that successfully
scheduled the scan's next primitive index scan saved its backend local
array keys in shared memory.  Any backend could pick up the scheduled
primitive scan within _bt_first.  This scheme decouples scheduling a
primitive scan from starting the scan (by performing another descent of
the index via a _bt_search call from _bt_first) to make things robust.

The scheme had a deadlock hazard, at least when the leader process
participated in the scan.  _bt_parallel_seize had a code path that made
backends that were not in an immediate position to start a scheduled
primitive index scan wait for some other backend to do so instead.
Under the right circumstances, the leader process could wait here
forever: the leader would wait for any other backend to start the
primitive scan, while every worker was busy waiting on the leader to
consume tuples from the scan's tuple queue.

To fix, don't wait for a scheduled primitive index scan to be started by
some other eligible backend from within _bt_parallel_seize (when the
calling backend isn't in a position to do so itself).  Return false
instead, while recording that the scan has a scheduled primitive index
scan in backend local state.  This leaves the backend in the same state
as the existing case where a backend schedules (or tries to schedule)
another primitive index scan from within _bt_advance_array_keys, before
calling _bt_parallel_seize.  _bt_parallel_seize already handles that
case by returning false without waiting, and without unsetting the
backend local state.  Leaving the backend in this state enables it to
start a previously scheduled primitive index scan once it gets back to
_bt_first.

Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.

Matthias van de Meent, with tweaks by me.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reported-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com
Backpatch: 17-, where nbtree SAOP execution was enhanced.
This commit is contained in:
Peter Geoghegan 2024-09-17 11:10:35 -04:00
parent 89f908a6d0
commit d8adfc18be

View File

@ -584,7 +584,10 @@ btparallelrescan(IndexScanDesc scan)
* or _bt_parallel_done().
*
* The return value is true if we successfully seized the scan and false
* if we did not. The latter case occurs if no pages remain.
* if we did not. The latter case occurs when no pages remain, or when
* another primitive index scan is scheduled that caller's backend cannot
* start just yet (only backends that call from _bt_first are capable of
* starting primitive index scans, which they indicate by passing first=true).
*
* If the return value is true, *pageno returns the next or current page
* of the scan (depending on the scan direction). An invalid block number
@ -595,10 +598,6 @@ btparallelrescan(IndexScanDesc scan)
* scan will return false.
*
* Callers should ignore the value of pageno if the return value is false.
*
* Callers that are in a position to start a new primitive index scan must
* pass first=true (all other callers pass first=false). We just return false
* for first=false callers that require another primitive index scan.
*/
bool
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
@ -615,13 +614,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
/*
* Initialize array related state when called from _bt_first, assuming
* that this will either be the first primitive index scan for the
* scan, or a previous explicitly scheduled primitive scan.
*
* Note: so->needPrimScan is only set when a scheduled primitive index
* scan is set to be performed in caller's worker process. It should
* not be set here by us for the first primitive scan, nor should we
* ever set it for a parallel scan that has no array keys.
* that this will be the first primitive index scan for the scan
*/
so->needPrimScan = false;
so->scanBehind = false;
@ -629,8 +622,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
else
{
/*
* Don't attempt to seize the scan when backend requires another
* primitive index scan unless we're in a position to start it now
* Don't attempt to seize the scan when it requires another primitive
* index scan, since caller's backend cannot start it right now
*/
if (so->needPrimScan)
return false;
@ -652,12 +645,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
Assert(so->numArrayKeys);
/*
* If we can start another primitive scan right away, do so.
* Otherwise just wait.
*/
if (first)
{
/* Can start scheduled primitive scan right away, so do so */
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
for (int i = 0; i < so->numArrayKeys; i++)
{
@ -667,11 +657,25 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
array->cur_elem = btscan->btps_arrElems[i];
skey->sk_argument = array->elem_values[array->cur_elem];
}
so->needPrimScan = true;
so->scanBehind = false;
*pageno = InvalidBlockNumber;
exit_loop = true;
}
else
{
/*
* Don't attempt to seize the scan when it requires another
* primitive index scan, since caller's backend cannot start
* it right now
*/
status = false;
}
/*
* Either way, update backend local state to indicate that a
* pending primitive scan is required
*/
so->needPrimScan = true;
so->scanBehind = false;
}
else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
{
@ -730,6 +734,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
void
_bt_parallel_done(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
bool status_changed = false;
@ -738,6 +743,13 @@ _bt_parallel_done(IndexScanDesc scan)
if (parallel_scan == NULL)
return;
/*
* Should not mark parallel scan done when there's still a pending
* primitive index scan
*/
if (so->needPrimScan)
return;
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
@ -746,6 +758,7 @@ _bt_parallel_done(IndexScanDesc scan)
* already
*/
SpinLockAcquire(&btscan->btps_mutex);
Assert(btscan->btps_pageStatus != BTPARALLEL_NEED_PRIMSCAN);
if (btscan->btps_pageStatus != BTPARALLEL_DONE)
{
btscan->btps_pageStatus = BTPARALLEL_DONE;