diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index a204ad4af0..9ae596ab23 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -1,68 +1,175 @@ -$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $ +$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $ This directory contains a correct implementation of Lehman and Yao's -btree management algorithm that supports concurrent access for Postgres. +high-concurrency B-tree management algorithm (P. Lehman and S. Yao, +Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions +on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). + We have made the following changes in order to incorporate their algorithm into Postgres: - + The requirement that all btree keys be unique is too onerous, - but the algorithm won't work correctly without it. As a result, - this implementation adds an OID (guaranteed to be unique) to - every key in the index. This guarantees uniqueness within a set - of duplicates. Space overhead is four bytes. ++ The requirement that all btree keys be unique is too onerous, + but the algorithm won't work correctly without it. Fortunately, it is + only necessary that keys be unique on a single tree level, because L&Y + only use the assumption of key uniqueness when re-finding a key in a + parent node (to determine where to insert the key for a split page). + Therefore, we can use the link field to disambiguate multiple + occurrences of the same user key: only one entry in the parent level + will be pointing at the page we had split. (Indeed we need not look at + the real "key" at all, just at the link field.) We can distinguish + items at the leaf level in the same way, by examining their links to + heap tuples; we'd never have two items for the same heap tuple. - For this reason, when we're passed an index tuple to store by the - common access method code, we allocate a larger one and copy the - supplied tuple into it. No Postgres code outside of the btree - access method knows about this xid or sequence number. ++ Lehman and Yao assume that the key range for a subtree S is described + by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent + node. This does not work for nonunique keys (for example, if we have + enough equal keys to spread across several leaf pages, there *must* be + some equal bounding keys in the first level up). Therefore we assume + Ki <= v <= Ki+1 instead. A search that finds exact equality to a + bounding key in an upper tree level must descend to the left of that + key to ensure it finds any equal keys in the preceding page. An + insertion that sees the high key of its target page is equal to the key + to be inserted has a choice whether or not to move right, since the new + key could go on either page. (Currently, we try to find a page where + there is room for the new key without a split.) - + Lehman and Yao don't require read locks, but assume that in- - memory copies of tree nodes are unshared. Postgres shares - in-memory buffers among backends. As a result, we do page- - level read locking on btree nodes in order to guarantee that - no record is modified while we are examining it. This reduces - concurrency but guaranteees correct behavior. ++ Lehman and Yao don't require read locks, but assume that in-memory + copies of tree nodes are unshared. Postgres shares in-memory buffers + among backends. As a result, we do page-level read locking on btree + nodes in order to guarantee that no record is modified while we are + examining it. This reduces concurrency but guaranteees correct + behavior. An advantage is that when trading in a read lock for a + write lock, we need not re-read the page after getting the write lock. + Since we're also holding a pin on the shared buffer containing the + page, we know that buffer still contains the page and is up-to-date. - + Read locks on a page are held for as long as a scan has a pointer - to the page. However, locks are always surrendered before the - sibling page lock is acquired (for readers), so we remain deadlock- - free. I will do a formal proof if I get bored anytime soon. ++ We support the notion of an ordered "scan" of an index as well as + insertions, deletions, and simple lookups. A scan in the forward + direction is no problem, we just use the right-sibling pointers that + L&Y require anyway. (Thus, once we have descended the tree to the + correct start point for the scan, the scan looks only at leaf pages + and never at higher tree levels.) To support scans in the backward + direction, we also store a "left sibling" link much like the "right + sibling". (This adds an extra step to the L&Y split algorithm: while + holding the write lock on the page being split, we also lock its former + right sibling to update that page's left-link. This is safe since no + writer of that page can be interested in acquiring a write lock on our + page.) A backwards scan has one additional bit of complexity: after + following the left-link we must account for the possibility that the + left sibling page got split before we could read it. So, we have to + move right until we find a page whose right-link matches the page we + came from. + ++ Read locks on a page are held for as long as a scan has a pointer + to the page. However, locks are always surrendered before the + sibling page lock is acquired (for readers), so we remain deadlock- + free. I will do a formal proof if I get bored anytime soon. + NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin, + on the current page of a scan before control leaves nbtree. When we + come back to resume the scan, we have to re-grab the read lock and + then move right if the current item moved (see _bt_restscan()). + ++ Lehman and Yao fail to discuss what must happen when the root page + becomes full and must be split. Our implementation is to split the + root in the same way that any other page would be split, then construct + a new root page holding pointers to both of the resulting pages (which + now become siblings on level 2 of the tree). The new root page is then + installed by altering the root pointer in the meta-data page (see + below). This works because the root is not treated specially in any + other way --- in particular, searches will move right using its link + pointer if the link is set. Therefore, searches will find the data + that's been moved into the right sibling even if they read the metadata + page before it got updated. This is the same reasoning that makes a + split of a non-root page safe. The locking considerations are similar too. + ++ Lehman and Yao assume fixed-size keys, but we must deal with + variable-size keys. Therefore there is not a fixed maximum number of + keys per page; we just stuff in as many as will fit. When we split a + page, we try to equalize the number of bytes, not items, assigned to + each of the resulting pages. Note we must include the incoming item in + this calculation, otherwise it is possible to find that the incoming + item doesn't fit on the split page where it needs to go! In addition, the following things are handy to know: - + Page zero of every btree is a meta-data page. This page stores - the location of the root page, a pointer to a list of free - pages, and other stuff that's handy to know. ++ Page zero of every btree is a meta-data page. This page stores + the location of the root page, a pointer to a list of free + pages, and other stuff that's handy to know. (Currently, we + never shrink btree indexes so there are never any free pages.) - + This algorithm doesn't really work, since it requires ordered - writes, and UNIX doesn't support ordered writes. ++ The algorithm assumes we can fit at least three items per page + (a "high key" and two real data items). Therefore it's unsafe + to accept items larger than 1/3rd page size. Larger items would + work sometimes, but could cause failures later on depending on + what else gets put on their page. - + There's one other case where we may screw up in this - implementation. When we start a scan, we descend the tree - to the key nearest the one in the qual, and once we get there, - position ourselves correctly for the qual type (eg, <, >=, etc). - If we happen to step off a page, decide we want to get back to - it, and fetch the page again, and if some bad person has split - the page and moved the last tuple we saw off of it, then the - code complains about botched concurrency in an elog(WARN, ...) - and gives up the ghost. This is the ONLY violation of Lehman - and Yao's guarantee of correct behavior that I am aware of in - this code. ++ This algorithm doesn't guarantee btree consistency after a kernel crash + or hardware failure. To do that, we'd need ordered writes, and UNIX + doesn't support ordered writes (short of fsync'ing every update, which + is too high a price). Rebuilding corrupted indexes during restart + seems more attractive. + ++ On deletions, we need to adjust the position of active scans on + the index. The code in nbtscan.c handles this. We don't need to + do this for insertions or splits because _bt_restscan can find the + new position of the previously-found item. NOTE that nbtscan.c + only copes with deletions issued by the current backend. This + essentially means that concurrent deletions are not supported, but + that's true already in the Lehman and Yao algorithm. nbtscan.c + exists only to support VACUUM and allow it to delete items while + it's scanning the index. + +Notes about data representation: + ++ The right-sibling link required by L&Y is kept in the page "opaque + data" area, as is the left-sibling link and some flags. + ++ We also keep a parent link in the opaque data, but this link is not + very trustworthy because it is not updated when the parent page splits. + Thus, it points to some page on the parent level, but possibly a page + well to the left of the page's actual current parent. In most cases + we do not need this link at all. Normally we return to a parent page + using a stack of entries that are made as we descend the tree, as in L&Y. + There is exactly one case where the stack will not help: concurrent + root splits. If an inserter process needs to split what had been the + root when it started its descent, but finds that that page is no longer + the root (because someone else split it meanwhile), then it uses the + parent link to move up to the next level. This is OK because we do fix + the parent link in a former root page when splitting it. This logic + will work even if the root is split multiple times (even up to creation + of multiple new levels) before an inserter returns to it. The same + could not be said of finding the new root via the metapage, since that + would work only for a single level of added root. + ++ The Postgres disk block data format (an array of items) doesn't fit + Lehman and Yao's alternating-keys-and-pointers notion of a disk page, + so we have to play some games. + ++ On a page that is not rightmost in its tree level, the "high key" is + kept in the page's first item, and real data items start at item 2. + The link portion of the "high key" item goes unused. A page that is + rightmost has no "high key", so data items start with the first item. + Putting the high key at the left, rather than the right, may seem odd, + but it avoids moving the high key as we add data items. + ++ On a leaf page, the data items are simply links to (TIDs of) tuples + in the relation being indexed, with the associated key values. + ++ On a non-leaf page, the data items are down-links to child pages with + bounding keys. The key in each data item is the *lower* bound for + keys on that child page, so logically the key is to the left of that + downlink. The high key (if present) is the upper bound for the last + downlink. The first data item on each such page has no lower bound + --- or lower bound of minus infinity, if you prefer. The comparison + routines must treat it accordingly. The actual key stored in the + item is irrelevant, and need not be stored at all. This arrangement + corresponds to the fact that an L&Y non-leaf page has one more pointer + than key. Notes to operator class implementors: - With this implementation, we require the user to supply us with - a procedure for pg_amproc. This procedure should take two keys - A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, - respectively. See the contents of that relation for the btree - access method for some samples. - -Notes to mao for implementation document: - - On deletions, we need to adjust the position of active scans on - the index. The code in nbtscan.c handles this. We don't need to - do this for splits because of the way splits are handled; if they - happen behind us, we'll automatically go to the next page, and if - they happen in front of us, we're not affected by them. For - insertions, if we inserted a tuple behind the current scan location - on the current scan page, we move one space ahead. ++ With this implementation, we require the user to supply us with + a procedure for pg_amproc. This procedure should take two keys + A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, + respectively. See the contents of that relation for the btree + access method for some samples. diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 7d65c63dc8..6be8e97b50 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.59 2000/06/08 22:36:52 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.60 2000/07/21 06:42:32 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -19,53 +19,76 @@ #include "access/nbtree.h" -static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, BTStack stack, int keysz, ScanKey scankey, BTItem btitem, BTItem afteritem); -static Buffer _bt_split(Relation rel, Size keysz, ScanKey scankey, - Buffer buf, OffsetNumber firstright); -static OffsetNumber _bt_findsplitloc(Relation rel, Size keysz, ScanKey scankey, - Page page, OffsetNumber start, - OffsetNumber maxoff, Size llimit); +typedef struct +{ + /* context data for _bt_checksplitloc */ + Size newitemsz; /* size of new item to be inserted */ + bool non_leaf; /* T if splitting an internal node */ + + bool have_split; /* found a valid split? */ + + /* these fields valid only if have_split is true */ + bool newitemonleft; /* new item on left or right of best split */ + OffsetNumber firstright; /* best split point */ + int best_delta; /* best size delta so far */ +} FindSplitData; + + +static TransactionId _bt_check_unique(Relation rel, BTItem btitem, + Relation heapRel, Buffer buf, + ScanKey itup_scankey); +static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, + BTStack stack, + int keysz, ScanKey scankey, + BTItem btitem, + OffsetNumber afteritem); +static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright, + OffsetNumber newitemoff, Size newitemsz, + BTItem newitem, bool newitemonleft, + OffsetNumber *itup_off, BlockNumber *itup_blkno); +static OffsetNumber _bt_findsplitloc(Relation rel, Page page, + OffsetNumber newitemoff, + Size newitemsz, + bool *newitemonleft); +static void _bt_checksplitloc(FindSplitData *state, OffsetNumber firstright, + int leftfree, int rightfree, + bool newitemonleft, Size firstrightitemsz); +static Buffer _bt_getstackbuf(Relation rel, BTStack stack); static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf); -static OffsetNumber _bt_pgaddtup(Relation rel, Buffer buf, int keysz, ScanKey itup_scankey, Size itemsize, BTItem btitem, BTItem afteritem); -static bool _bt_goesonpg(Relation rel, Buffer buf, Size keysz, ScanKey scankey, BTItem afteritem); -static void _bt_updateitem(Relation rel, Size keysz, Buffer buf, BTItem oldItem, BTItem newItem); -static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey); -static int32 _bt_tuplecompare(Relation rel, Size keysz, ScanKey scankey, - IndexTuple tuple1, IndexTuple tuple2); +static void _bt_pgaddtup(Relation rel, Page page, + Size itemsize, BTItem btitem, + OffsetNumber itup_off, const char *where); +static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, + int keysz, ScanKey scankey); /* * _bt_doinsert() -- Handle insertion of a single btitem in the tree. * * This routine is called by the public interface routines, btbuild - * and btinsert. By here, btitem is filled in, and has a unique - * (xid, seqno) pair. + * and btinsert. By here, btitem is filled in, including the TID. */ InsertIndexResult -_bt_doinsert(Relation rel, BTItem btitem, bool index_is_unique, Relation heapRel) +_bt_doinsert(Relation rel, BTItem btitem, + bool index_is_unique, Relation heapRel) { + IndexTuple itup = &(btitem->bti_itup); + int natts = rel->rd_rel->relnatts; ScanKey itup_scankey; - IndexTuple itup; BTStack stack; Buffer buf; - BlockNumber blkno; - int natts = rel->rd_rel->relnatts; InsertIndexResult res; - Buffer buffer; - - itup = &(btitem->bti_itup); /* we need a scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); +top: /* find the page containing this key */ - stack = _bt_search(rel, natts, itup_scankey, &buf); + stack = _bt_search(rel, natts, itup_scankey, &buf, BT_WRITE); /* trade in our read lock for a write lock */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BT_WRITE); -l1: - /* * If the page was split between the time that we surrendered our read * lock and acquired our write lock, then this page may no longer be @@ -73,141 +96,31 @@ l1: * need to move right in the tree. See Lehman and Yao for an * excruciatingly precise description. */ - buf = _bt_moveright(rel, buf, natts, itup_scankey, BT_WRITE); - blkno = BufferGetBlockNumber(buf); - /* if we're not allowing duplicates, make sure the key isn't */ - /* already in the node */ + /* + * If we're not allowing duplicates, make sure the key isn't + * already in the index. XXX this belongs somewhere else, likely + */ if (index_is_unique) { - OffsetNumber offset, - maxoff; - Page page; + TransactionId xwait; - page = BufferGetPage(buf); - maxoff = PageGetMaxOffsetNumber(page); + xwait = _bt_check_unique(rel, btitem, heapRel, buf, itup_scankey); - offset = _bt_binsrch(rel, buf, natts, itup_scankey, BT_DESCENT); - - /* make sure the offset we're given points to an actual */ - /* key on the page before trying to compare it */ - if (!PageIsEmpty(page) && offset <= maxoff) + if (TransactionIdIsValid(xwait)) { - TupleDesc itupdesc; - BTItem cbti; - HeapTupleData htup; - BTPageOpaque opaque; - Buffer nbuf; - BlockNumber nblkno; - bool chtup = true; - - itupdesc = RelationGetDescr(rel); - nbuf = InvalidBuffer; - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - /* - * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's - * how we handling NULLs - and so we must not use _bt_compare - * in real comparison, but only for ordering/finding items on - * pages. - vadim 03/24/97 - * - * while ( !_bt_compare (rel, itupdesc, page, natts, - * itup_scankey, offset) ) - */ - while (_bt_isequal(itupdesc, page, offset, natts, itup_scankey)) - { /* they're equal */ - - /* - * Have to check is inserted heap tuple deleted one (i.e. - * just moved to another place by vacuum)! - */ - if (chtup) - { - htup.t_self = btitem->bti_itup.t_tid; - heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); - if (htup.t_data == NULL) /* YES! */ - break; - /* Live tuple was inserted */ - ReleaseBuffer(buffer); - chtup = false; - } - cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset)); - htup.t_self = cbti->bti_itup.t_tid; - heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); - if (htup.t_data != NULL) /* it is a duplicate */ - { - TransactionId xwait = - (TransactionIdIsValid(SnapshotDirty->xmin)) ? - SnapshotDirty->xmin : SnapshotDirty->xmax; - - /* - * If this tuple is being updated by other transaction - * then we have to wait for its commit/abort. - */ - ReleaseBuffer(buffer); - if (TransactionIdIsValid(xwait)) - { - if (nbuf != InvalidBuffer) - _bt_relbuf(rel, nbuf, BT_READ); - _bt_relbuf(rel, buf, BT_WRITE); - XactLockTableWait(xwait); - buf = _bt_getbuf(rel, blkno, BT_WRITE); - goto l1;/* continue from the begin */ - } - elog(ERROR, "Cannot insert a duplicate key into unique index %s", RelationGetRelationName(rel)); - } - /* htup null so no buffer to release */ - /* get next offnum */ - if (offset < maxoff) - offset = OffsetNumberNext(offset); - else - { /* move right ? */ - if (P_RIGHTMOST(opaque)) - break; - if (!_bt_isequal(itupdesc, page, P_HIKEY, - natts, itup_scankey)) - break; - - /* - * min key of the right page is the same, ooh - so - * many dead duplicates... - */ - nblkno = opaque->btpo_next; - if (nbuf != InvalidBuffer) - _bt_relbuf(rel, nbuf, BT_READ); - for (nbuf = InvalidBuffer;;) - { - nbuf = _bt_getbuf(rel, nblkno, BT_READ); - page = BufferGetPage(nbuf); - maxoff = PageGetMaxOffsetNumber(page); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - offset = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - if (!PageIsEmpty(page) && offset <= maxoff) - { /* Found some key */ - break; - } - else - { /* Empty or "pseudo"-empty page - get next */ - nblkno = opaque->btpo_next; - _bt_relbuf(rel, nbuf, BT_READ); - nbuf = InvalidBuffer; - if (nblkno == P_NONE) - break; - } - } - if (nbuf == InvalidBuffer) - break; - } - } - if (nbuf != InvalidBuffer) - _bt_relbuf(rel, nbuf, BT_READ); + /* Have to wait for the other guy ... */ + _bt_relbuf(rel, buf, BT_WRITE); + XactLockTableWait(xwait); + /* start over... */ + _bt_freestack(stack); + goto top; } } /* do the insertion */ - res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, - btitem, (BTItem) NULL); + res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, btitem, 0); /* be tidy */ _bt_freestack(stack); @@ -217,32 +130,178 @@ l1: } /* + * _bt_check_unique() -- Check for violation of unique index constraint + * + * Returns NullTransactionId if there is no conflict, else an xact ID we + * must wait for to see if it commits a conflicting tuple. If an actual + * conflict is detected, no return --- just elog(). + */ +static TransactionId +_bt_check_unique(Relation rel, BTItem btitem, Relation heapRel, + Buffer buf, ScanKey itup_scankey) +{ + TupleDesc itupdesc = RelationGetDescr(rel); + int natts = rel->rd_rel->relnatts; + OffsetNumber offset, + maxoff; + Page page; + BTPageOpaque opaque; + Buffer nbuf = InvalidBuffer; + bool chtup = true; + + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + + /* + * Find first item >= proposed new item. Note we could also get + * a pointer to end-of-page here. + */ + offset = _bt_binsrch(rel, buf, natts, itup_scankey); + + /* + * Scan over all equal tuples, looking for live conflicts. + */ + for (;;) + { + HeapTupleData htup; + Buffer buffer; + BTItem cbti; + BlockNumber nblkno; + + /* + * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's + * how we handling NULLs - and so we must not use _bt_compare + * in real comparison, but only for ordering/finding items on + * pages. - vadim 03/24/97 + * + * make sure the offset points to an actual key + * before trying to compare it... + */ + if (offset <= maxoff) + { + if (! _bt_isequal(itupdesc, page, offset, natts, itup_scankey)) + break; /* we're past all the equal tuples */ + + /* + * Have to check is inserted heap tuple deleted one (i.e. + * just moved to another place by vacuum)! We only need to + * do this once, but don't want to do it at all unless + * we see equal tuples, so as not to slow down unequal case. + */ + if (chtup) + { + htup.t_self = btitem->bti_itup.t_tid; + heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); + if (htup.t_data == NULL) /* YES! */ + break; + /* Live tuple is being inserted, so continue checking */ + ReleaseBuffer(buffer); + chtup = false; + } + + cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset)); + htup.t_self = cbti->bti_itup.t_tid; + heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); + if (htup.t_data != NULL) /* it is a duplicate */ + { + TransactionId xwait = + (TransactionIdIsValid(SnapshotDirty->xmin)) ? + SnapshotDirty->xmin : SnapshotDirty->xmax; + + /* + * If this tuple is being updated by other transaction + * then we have to wait for its commit/abort. + */ + ReleaseBuffer(buffer); + if (TransactionIdIsValid(xwait)) + { + if (nbuf != InvalidBuffer) + _bt_relbuf(rel, nbuf, BT_READ); + /* Tell _bt_doinsert to wait... */ + return xwait; + } + /* + * Otherwise we have a definite conflict. + */ + elog(ERROR, "Cannot insert a duplicate key into unique index %s", + RelationGetRelationName(rel)); + } + /* htup null so no buffer to release */ + } + + /* + * Advance to next tuple to continue checking. + */ + if (offset < maxoff) + offset = OffsetNumberNext(offset); + else + { + /* If scankey == hikey we gotta check the next page too */ + if (P_RIGHTMOST(opaque)) + break; + if (!_bt_isequal(itupdesc, page, P_HIKEY, + natts, itup_scankey)) + break; + nblkno = opaque->btpo_next; + if (nbuf != InvalidBuffer) + _bt_relbuf(rel, nbuf, BT_READ); + nbuf = _bt_getbuf(rel, nblkno, BT_READ); + page = BufferGetPage(nbuf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + offset = P_FIRSTDATAKEY(opaque); + } + } + + if (nbuf != InvalidBuffer) + _bt_relbuf(rel, nbuf, BT_READ); + + return NullTransactionId; +} + +/*---------- * _bt_insertonpg() -- Insert a tuple on a particular page in the index. * * This recursive procedure does the following things: * - * + if necessary, splits the target page. - * + finds the right place to insert the tuple (taking into - * account any changes induced by a split). + * + finds the right place to insert the tuple. + * + if necessary, splits the target page (making sure that the + * split is equitable as far as post-insert free space goes). * + inserts the tuple. * + if the page was split, pops the parent stack, and finds the * right place to insert the new child pointer (by walking * right using information stored in the parent stack). - * + invoking itself with the appropriate tuple for the right + * + invokes itself with the appropriate tuple for the right * child page on the parent. * * On entry, we must have the right buffer on which to do the * insertion, and the buffer must be pinned and locked. On return, * we will have dropped both the pin and the write lock on the buffer. * + * If 'afteritem' is >0 then the new tuple must be inserted after the + * existing item of that number, noplace else. If 'afteritem' is 0 + * then the procedure finds the exact spot to insert it by searching. + * (keysz and scankey parameters are used ONLY if afteritem == 0.) + * + * NOTE: if the new key is equal to one or more existing keys, we can + * legitimately place it anywhere in the series of equal keys --- in fact, + * if the new key is equal to the page's "high key" we can place it on + * the next page. If it is equal to the high key, and there's not room + * to insert the new tuple on the current page without splitting, then + * we move right hoping to find more free space and avoid a split. + * Ordinarily, though, we'll insert it before the existing equal keys + * because of the way _bt_binsrch() works. + * * The locking interactions in this code are critical. You should * grok Lehman and Yao's paper before making any changes. In addition, * you need to understand how we disambiguate duplicate keys in this * implementation, in order to be able to find our location using * L&Y "move right" operations. Since we may insert duplicate user - * keys, and since these dups may propogate up the tree, we use the + * keys, and since these dups may propagate up the tree, we use the * 'afteritem' parameter to position ourselves correctly for the * insertion on internal pages. + *---------- */ static InsertIndexResult _bt_insertonpg(Relation rel, @@ -251,17 +310,16 @@ _bt_insertonpg(Relation rel, int keysz, ScanKey scankey, BTItem btitem, - BTItem afteritem) + OffsetNumber afteritem) { InsertIndexResult res; Page page; BTPageOpaque lpageop; - BlockNumber itup_blkno; OffsetNumber itup_off; + BlockNumber itup_blkno; + OffsetNumber newitemoff; OffsetNumber firstright = InvalidOffsetNumber; Size itemsz; - bool do_split = false; - bool keys_equal = false; page = BufferGetPage(buf); lpageop = (BTPageOpaque) PageGetSpecialPointer(page); @@ -285,355 +343,117 @@ _bt_insertonpg(Relation rel, (PageGetPageSize(page) - sizeof(PageHeaderData) - MAXALIGN(sizeof(BTPageOpaqueData))) /3 - sizeof(ItemIdData)); /* - * If we have to insert item on the leftmost page which is the first - * page in the chain of duplicates then: 1. if scankey == hikey (i.e. - * - new duplicate item) then insert it here; 2. if scankey < hikey - * then: 2.a if there is duplicate key(s) here - we force splitting; - * 2.b else - we may "eat" this page from duplicates chain. + * Determine exactly where new item will go. */ - if (lpageop->btpo_flags & BTP_CHAIN) + if (afteritem > 0) { - OffsetNumber maxoff = PageGetMaxOffsetNumber(page); - ItemId hitemid; - BTItem hitem; - - Assert(!P_RIGHTMOST(lpageop)); - hitemid = PageGetItemId(page, P_HIKEY); - hitem = (BTItem) PageGetItem(page, hitemid); - if (maxoff > P_HIKEY && - !_bt_itemcmp(rel, keysz, scankey, hitem, - (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)), - BTEqualStrategyNumber)) - elog(FATAL, "btree: bad key on the page in the chain of duplicates"); - - if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid, - BTEqualStrategyNumber)) - { - if (!P_LEFTMOST(lpageop)) - elog(FATAL, "btree: attempt to insert bad key on the non-leftmost page in the chain of duplicates"); - if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid, - BTLessStrategyNumber)) - elog(FATAL, "btree: attempt to insert higher key on the leftmost page in the chain of duplicates"); - if (maxoff > P_HIKEY) /* have duplicate(s) */ - { - firstright = P_FIRSTKEY; - do_split = true; - } - else -/* "eat" page */ - { - Buffer pbuf; - Page ppage; - - itup_blkno = BufferGetBlockNumber(buf); - itup_off = PageAddItem(page, (Item) btitem, itemsz, - P_FIRSTKEY, LP_USED); - if (itup_off == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item"); - lpageop->btpo_flags &= ~BTP_CHAIN; - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - ppage = BufferGetPage(pbuf); - PageIndexTupleDelete(ppage, stack->bts_offset); - pfree(stack->bts_btitem); - stack->bts_btitem = _bt_formitem(&(btitem->bti_itup)); - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - itup_blkno, P_HIKEY); - _bt_wrtbuf(rel, buf); - res = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, scankey, stack->bts_btitem, - NULL); - ItemPointerSet(&(res->pointerData), itup_blkno, itup_off); - return res; - } - } - else - { - keys_equal = true; - if (PageGetFreeSpace(page) < itemsz) - do_split = true; - } + newitemoff = afteritem + 1; } - else if (PageGetFreeSpace(page) < itemsz) - do_split = true; - else if (PageGetFreeSpace(page) < 3 * itemsz + 2 * sizeof(ItemIdData)) + else { - OffsetNumber offnum = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY; - OffsetNumber maxoff = PageGetMaxOffsetNumber(page); - - if (offnum < maxoff) /* can't split unless at least 2 items... */ + /* + * If we will need to split the page to put the item here, + * check whether we can put the tuple somewhere to the right, + * instead. Keep scanning until we find enough free space or + * reach the last page where the tuple can legally go. + */ + while (PageGetFreeSpace(page) < itemsz && + !P_RIGHTMOST(lpageop) && + _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0) { - ItemId itid; - BTItem previtem, - chkitem; - Size maxsize; - Size currsize; + /* step right one page */ + BlockNumber rblkno = lpageop->btpo_next; - /* find largest group of identically-keyed items on page */ - itid = PageGetItemId(page, offnum); - previtem = (BTItem) PageGetItem(page, itid); - maxsize = currsize = (ItemIdGetLength(itid) + sizeof(ItemIdData)); - for (offnum = OffsetNumberNext(offnum); - offnum <= maxoff; offnum = OffsetNumberNext(offnum)) - { - itid = PageGetItemId(page, offnum); - chkitem = (BTItem) PageGetItem(page, itid); - if (!_bt_itemcmp(rel, keysz, scankey, - previtem, chkitem, - BTEqualStrategyNumber)) - { - if (currsize > maxsize) - maxsize = currsize; - currsize = 0; - previtem = chkitem; - } - currsize += (ItemIdGetLength(itid) + sizeof(ItemIdData)); - } - if (currsize > maxsize) - maxsize = currsize; - /* Decide to split if largest group is > 1/2 page size */ - maxsize += sizeof(PageHeaderData) + - MAXALIGN(sizeof(BTPageOpaqueData)); - if (maxsize >= PageGetPageSize(page) / 2) - do_split = true; + _bt_relbuf(rel, buf, BT_WRITE); + buf = _bt_getbuf(rel, rblkno, BT_WRITE); + page = BufferGetPage(buf); + lpageop = (BTPageOpaque) PageGetSpecialPointer(page); } + /* + * This is it, so find the position... + */ + newitemoff = _bt_binsrch(rel, buf, keysz, scankey); } - if (do_split) + /* + * Do we need to split the page to fit the item on it? + */ + if (PageGetFreeSpace(page) < itemsz) { Buffer rbuf; - Page rpage; - BTItem ritem; - BlockNumber rbknum; - BTPageOpaque rpageop; - Buffer pbuf; - Page ppage; - BTPageOpaque ppageop; BlockNumber bknum = BufferGetBlockNumber(buf); - BTItem lowLeftItem; - OffsetNumber maxoff; - bool shifted = false; - bool left_chained = (lpageop->btpo_flags & BTP_CHAIN) ? true : false; - bool is_root = lpageop->btpo_flags & BTP_ROOT; + BlockNumber rbknum; + bool is_root = P_ISROOT(lpageop); + bool newitemonleft; - /* - * Instead of splitting leaf page in the chain of duplicates by - * new duplicate, insert it into some right page. - */ - if ((lpageop->btpo_flags & BTP_CHAIN) && - (lpageop->btpo_flags & BTP_LEAF) && keys_equal) - { - rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE); - rpage = BufferGetPage(rbuf); - rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage); - - /* - * some checks - */ - if (!P_RIGHTMOST(rpageop)) /* non-rightmost page */ - { /* If we have the same hikey here then - * it's yet another page in chain. */ - if (_bt_skeycmp(rel, keysz, scankey, rpage, - PageGetItemId(rpage, P_HIKEY), - BTEqualStrategyNumber)) - { - if (!(rpageop->btpo_flags & BTP_CHAIN)) - elog(FATAL, "btree: lost page in the chain of duplicates"); - } - else if (_bt_skeycmp(rel, keysz, scankey, rpage, - PageGetItemId(rpage, P_HIKEY), - BTGreaterStrategyNumber)) - elog(FATAL, "btree: hikey is out of order"); - else if (rpageop->btpo_flags & BTP_CHAIN) - - /* - * If hikey > scankey then it's last page in chain and - * BTP_CHAIN must be OFF - */ - elog(FATAL, "btree: lost last page in the chain of duplicates"); - } - else -/* rightmost page */ - Assert(!(rpageop->btpo_flags & BTP_CHAIN)); - _bt_relbuf(rel, buf, BT_WRITE); - return (_bt_insertonpg(rel, rbuf, stack, keysz, - scankey, btitem, afteritem)); - } - - /* - * If after splitting un-chained page we'll got chain of pages - * with duplicates then we want to know 1. on which of two pages - * new btitem will go (current _bt_findsplitloc is quite bad); 2. - * what parent (if there's one) thinking about it (remember about - * deletions) - */ - else if (!(lpageop->btpo_flags & BTP_CHAIN)) - { - OffsetNumber start = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY; - Size llimit; - - maxoff = PageGetMaxOffsetNumber(page); - llimit = PageGetPageSize(page) - sizeof(PageHeaderData) - - MAXALIGN(sizeof(BTPageOpaqueData)) - +sizeof(ItemIdData); - llimit /= 2; - firstright = _bt_findsplitloc(rel, keysz, scankey, - page, start, maxoff, llimit); - - if (_bt_itemcmp(rel, keysz, scankey, - (BTItem) PageGetItem(page, PageGetItemId(page, start)), - (BTItem) PageGetItem(page, PageGetItemId(page, firstright)), - BTEqualStrategyNumber)) - { - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, firstright), - BTLessStrategyNumber)) - - /* - * force moving current items to the new page: new - * item will go on the current page. - */ - firstright = start; - else - - /* - * new btitem >= firstright, start item == firstright - * - new chain of duplicates: if this non-leftmost - * leaf page and parent item < start item then force - * moving all items to the new page - current page - * will be "empty" after it. - */ - { - if (!P_LEFTMOST(lpageop) && - (lpageop->btpo_flags & BTP_LEAF)) - { - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - if (_bt_itemcmp(rel, keysz, scankey, - stack->bts_btitem, - (BTItem) PageGetItem(page, - PageGetItemId(page, start)), - BTLessStrategyNumber)) - { - firstright = start; - shifted = true; - } - _bt_relbuf(rel, pbuf, BT_WRITE); - } - } - } /* else - no new chain if start item < - * firstright one */ - } + /* Choose the split point */ + firstright = _bt_findsplitloc(rel, page, + newitemoff, itemsz, + &newitemonleft); /* split the buffer into left and right halves */ - rbuf = _bt_split(rel, keysz, scankey, buf, firstright); + rbuf = _bt_split(rel, buf, firstright, + newitemoff, itemsz, btitem, newitemonleft, + &itup_off, &itup_blkno); - /* which new page (left half or right half) gets the tuple? */ - if (_bt_goesonpg(rel, buf, keysz, scankey, afteritem)) - { - /* left page */ - itup_off = _bt_pgaddtup(rel, buf, keysz, scankey, - itemsz, btitem, afteritem); - itup_blkno = BufferGetBlockNumber(buf); - } - else - { - /* right page */ - itup_off = _bt_pgaddtup(rel, rbuf, keysz, scankey, - itemsz, btitem, afteritem); - itup_blkno = BufferGetBlockNumber(rbuf); - } - - maxoff = PageGetMaxOffsetNumber(page); - if (shifted) - { - if (maxoff > P_FIRSTKEY) - elog(FATAL, "btree: shifted page is not empty"); - lowLeftItem = (BTItem) NULL; - } - else - { - if (maxoff < P_FIRSTKEY) - elog(FATAL, "btree: un-shifted page is empty"); - lowLeftItem = (BTItem) PageGetItem(page, - PageGetItemId(page, P_FIRSTKEY)); - if (_bt_itemcmp(rel, keysz, scankey, lowLeftItem, - (BTItem) PageGetItem(page, PageGetItemId(page, P_HIKEY)), - BTEqualStrategyNumber)) - lpageop->btpo_flags |= BTP_CHAIN; - } - - /* + /*---------- * By here, * - * + our target page has been split; + the original tuple has been - * inserted; + we have write locks on both the old (left half) - * and new (right half) buffers, after the split; and + we have - * the key we want to insert into the parent. + * + our target page has been split; + * + the original tuple has been inserted; + * + we have write locks on both the old (left half) + * and new (right half) buffers, after the split; and + * + we know the key we want to insert into the parent + * (it's the "high key" on the left child page). * - * Do the parent insertion. We need to hold onto the locks for the - * child pages until we locate the parent, but we can release them - * before doing the actual insertion (see Lehman and Yao for the - * reasoning). + * We're ready to do the parent insertion. We need to hold onto the + * locks for the child pages until we locate the parent, but we can + * release them before doing the actual insertion (see Lehman and Yao + * for the reasoning). + * + * Here we have to do something Lehman and Yao don't talk about: + * deal with a root split and construction of a new root. If our + * stack is empty then we have just split a node on what had been + * the root level when we descended the tree. If it is still the + * root then we perform a new-root construction. If it *wasn't* + * the root anymore, use the parent pointer to get up to the root + * level that someone constructed meanwhile, and find the right + * place to insert as for the normal case. + *---------- */ -l_spl: ; - if (stack == (BTStack) NULL) + if (is_root) { - if (!is_root) /* if this page was not root page */ - { - elog(DEBUG, "btree: concurrent ROOT page split"); - stack = (BTStack) palloc(sizeof(BTStackData)); - stack->bts_blkno = lpageop->btpo_parent; - stack->bts_offset = InvalidOffsetNumber; - stack->bts_btitem = (BTItem) palloc(sizeof(BTItemData)); - /* bts_btitem will be initialized below */ - stack->bts_parent = NULL; - goto l_spl; - } + Assert(stack == (BTStack) NULL); /* create a new root node and release the split buffers */ _bt_newroot(rel, buf, rbuf); } else { - ScanKey newskey; InsertIndexResult newres; BTItem new_item; - OffsetNumber upditem_offset = P_HIKEY; - bool do_update = false; - bool update_in_place = true; - bool parent_chained; + BTStackData fakestack; + BTItem ritem; + Buffer pbuf; - /* form a index tuple that points at the new right page */ - rbknum = BufferGetBlockNumber(rbuf); - rpage = BufferGetPage(rbuf); - rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage); - - /* - * By convention, the first entry (1) on every non-rightmost - * page is the high key for that page. In order to get the - * lowest key on the new right page, we actually look at its - * second (2) entry. - */ - - if (!P_RIGHTMOST(rpageop)) + /* Set up a phony stack entry if we haven't got a real one */ + if (stack == (BTStack) NULL) { - ritem = (BTItem) PageGetItem(rpage, - PageGetItemId(rpage, P_FIRSTKEY)); - if (_bt_itemcmp(rel, keysz, scankey, - ritem, - (BTItem) PageGetItem(rpage, - PageGetItemId(rpage, P_HIKEY)), - BTEqualStrategyNumber)) - rpageop->btpo_flags |= BTP_CHAIN; + elog(DEBUG, "btree: concurrent ROOT page split"); + stack = &fakestack; + stack->bts_blkno = lpageop->btpo_parent; + stack->bts_offset = InvalidOffsetNumber; + /* bts_btitem will be initialized below */ + stack->bts_parent = NULL; } - else - ritem = (BTItem) PageGetItem(rpage, - PageGetItemId(rpage, P_HIKEY)); - /* get a unique btitem for this key */ + /* get high key from left page == lowest key on new right page */ + ritem = (BTItem) PageGetItem(page, + PageGetItemId(page, P_HIKEY)); + + /* form an index tuple that points at the new right page */ new_item = _bt_formitem(&(ritem->bti_itup)); - + rbknum = BufferGetBlockNumber(rbuf); ItemPointerSet(&(new_item->bti_itup.t_tid), rbknum, P_HIKEY); /* @@ -642,192 +462,39 @@ l_spl: ; * Oops - if we were moved right then we need to change stack * item! We want to find parent pointing to where we are, * right ? - vadim 05/27/97 - */ - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - parent_chained = ((ppageop->btpo_flags & BTP_CHAIN)) ? true : false; - - if (parent_chained && !left_chained) - elog(FATAL, "nbtree: unexpected chained parent of unchained page"); - - /* - * If the key of new_item is < than the key of the item in the - * parent page pointing to the left page (stack->bts_btitem), - * we have to update the latter key; otherwise the keys on the - * parent page wouldn't be monotonically increasing after we - * inserted the new pointer to the right page (new_item). This - * only happens if our left page is the leftmost page and a - * new minimum key had been inserted before, which is not - * reflected in the parent page but didn't matter so far. If - * there are duplicate keys and this new minimum key spills - * over to our new right page, we get an inconsistency if we - * don't update the left key in the parent page. * - * Also, new duplicates handling code require us to update parent - * item if some smaller items left on the left page (which is - * possible in splitting leftmost page) and current parent - * item == new_item. - vadim 05/27/97 + * Interestingly, this means we didn't *really* need to stack + * the parent key at all; all we really care about is the + * saved block and offset as a starting point for our search... */ - if (_bt_itemcmp(rel, keysz, scankey, - stack->bts_btitem, new_item, - BTGreaterStrategyNumber) || - (!shifted && - _bt_itemcmp(rel, keysz, scankey, - stack->bts_btitem, new_item, - BTEqualStrategyNumber) && - _bt_itemcmp(rel, keysz, scankey, - lowLeftItem, new_item, - BTLessStrategyNumber))) - { - do_update = true; + ItemPointerSet(&(stack->bts_btitem.bti_itup.t_tid), + bknum, P_HIKEY); - /* - * figure out which key is leftmost (if the parent page is - * rightmost, too, it must be the root) - */ - if (P_RIGHTMOST(ppageop)) - upditem_offset = P_HIKEY; - else - upditem_offset = P_FIRSTKEY; - if (!P_LEFTMOST(lpageop) || - stack->bts_offset != upditem_offset) - elog(FATAL, "btree: items are out of order (leftmost %d, stack %u, update %u)", - P_LEFTMOST(lpageop), stack->bts_offset, upditem_offset); - } + pbuf = _bt_getstackbuf(rel, stack); - if (do_update) - { - if (shifted) - elog(FATAL, "btree: attempt to update parent for shifted page"); - - /* - * Try to update in place. If out parent page is chained - * then we must forse insertion. - */ - if (!parent_chained && - MAXALIGN(IndexTupleDSize(lowLeftItem->bti_itup)) == - MAXALIGN(IndexTupleDSize(stack->bts_btitem->bti_itup))) - { - _bt_updateitem(rel, keysz, pbuf, - stack->bts_btitem, lowLeftItem); - _bt_wrtbuf(rel, buf); - _bt_wrtbuf(rel, rbuf); - } - else - { - update_in_place = false; - PageIndexTupleDelete(ppage, upditem_offset); - - /* - * don't write anything out yet--we still have the - * write lock, and now we call another _bt_insertonpg - * to insert the correct key. First, make a new item, - * using the tuple data from lowLeftItem. Point it to - * the left child. Update it on the stack at the same - * time. - */ - pfree(stack->bts_btitem); - stack->bts_btitem = _bt_formitem(&(lowLeftItem->bti_itup)); - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - - /* - * Unlock the children before doing this - */ - _bt_wrtbuf(rel, buf); - _bt_wrtbuf(rel, rbuf); - - /* - * A regular _bt_binsrch should find the right place - * to put the new entry, since it should be lower than - * any other key on the page. Therefore set afteritem - * to NULL. - */ - newskey = _bt_mkscankey(rel, &(stack->bts_btitem->bti_itup)); - newres = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, newskey, stack->bts_btitem, - NULL); - - pfree(newres); - pfree(newskey); - - /* - * we have now lost our lock on the parent buffer, and - * need to get it back. - */ - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - } - } - else - { - _bt_wrtbuf(rel, buf); - _bt_wrtbuf(rel, rbuf); - } - - newskey = _bt_mkscankey(rel, &(new_item->bti_itup)); - - afteritem = stack->bts_btitem; - if (parent_chained && !update_in_place) - { - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - if (ppageop->btpo_flags & BTP_CHAIN) - elog(FATAL, "btree: unexpected BTP_CHAIN flag in parent after update"); - if (P_RIGHTMOST(ppageop)) - elog(FATAL, "btree: chained parent is RIGHTMOST after update"); - maxoff = PageGetMaxOffsetNumber(ppage); - if (maxoff != P_FIRSTKEY) - elog(FATAL, "btree: FIRSTKEY was unexpected in parent after update"); - if (_bt_skeycmp(rel, keysz, newskey, ppage, - PageGetItemId(ppage, P_FIRSTKEY), - BTLessEqualStrategyNumber)) - elog(FATAL, "btree: parent FIRSTKEY is >= duplicate key after update"); - if (!_bt_skeycmp(rel, keysz, newskey, ppage, - PageGetItemId(ppage, P_HIKEY), - BTEqualStrategyNumber)) - elog(FATAL, "btree: parent HIGHKEY is not equal duplicate key after update"); - afteritem = (BTItem) NULL; - } - else if (left_chained && !update_in_place) - { - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - if (!P_RIGHTMOST(ppageop) && - _bt_skeycmp(rel, keysz, newskey, ppage, - PageGetItemId(ppage, P_HIKEY), - BTGreaterStrategyNumber)) - afteritem = (BTItem) NULL; - } - if (afteritem == (BTItem) NULL) - { - rbuf = _bt_getbuf(rel, ppageop->btpo_next, BT_WRITE); - _bt_relbuf(rel, pbuf, BT_WRITE); - pbuf = rbuf; - } + /* Now we can write and unlock the children */ + _bt_wrtbuf(rel, rbuf); + _bt_wrtbuf(rel, buf); + /* Recursively update the parent */ newres = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, newskey, new_item, - afteritem); + 0, NULL, new_item, stack->bts_offset); /* be tidy */ pfree(newres); - pfree(newskey); pfree(new_item); } } else { - itup_off = _bt_pgaddtup(rel, buf, keysz, scankey, - itemsz, btitem, afteritem); + _bt_pgaddtup(rel, page, itemsz, btitem, newitemoff, "page"); + itup_off = newitemoff; itup_blkno = BufferGetBlockNumber(buf); - - _bt_relbuf(rel, buf, BT_WRITE); + /* Write out the updated page and release pin/lock */ + _bt_wrtbuf(rel, buf); } - /* by here, the new tuple is inserted */ + /* by here, the new tuple is inserted at itup_blkno/itup_off */ res = (InsertIndexResult) palloc(sizeof(InsertIndexResultData)); ItemPointerSet(&(res->pointerData), itup_blkno, itup_off); @@ -838,12 +505,19 @@ l_spl: ; * _bt_split() -- split a page in the btree. * * On entry, buf is the page to split, and is write-locked and pinned. - * Returns the new right sibling of buf, pinned and write-locked. The - * pin and lock on buf are maintained. + * firstright is the item index of the first item to be moved to the + * new right page. newitemoff etc. tell us about the new item that + * must be inserted along with the data from the old page. + * + * Returns the new right sibling of buf, pinned and write-locked. + * The pin and lock on buf are maintained. *itup_off and *itup_blkno + * are set to the exact location where newitem was inserted. */ static Buffer -_bt_split(Relation rel, Size keysz, ScanKey scankey, - Buffer buf, OffsetNumber firstright) +_bt_split(Relation rel, Buffer buf, OffsetNumber firstright, + OffsetNumber newitemoff, Size newitemsz, BTItem newitem, + bool newitemonleft, + OffsetNumber *itup_off, BlockNumber *itup_blkno) { Buffer rbuf; Page origpage; @@ -860,7 +534,6 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, BTItem item; OffsetNumber leftoff, rightoff; - OffsetNumber start; OffsetNumber maxoff; OffsetNumber i; @@ -869,8 +542,8 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, leftpage = PageGetTempPage(origpage, sizeof(BTPageOpaqueData)); rightpage = BufferGetPage(rbuf); - _bt_pageinit(rightpage, BufferGetPageSize(rbuf)); _bt_pageinit(leftpage, BufferGetPageSize(buf)); + _bt_pageinit(rightpage, BufferGetPageSize(rbuf)); /* init btree private data */ oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage); @@ -879,106 +552,130 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, /* if we're splitting this page, it won't be the root when we're done */ oopaque->btpo_flags &= ~BTP_ROOT; - oopaque->btpo_flags &= ~BTP_CHAIN; lopaque->btpo_flags = ropaque->btpo_flags = oopaque->btpo_flags; lopaque->btpo_prev = oopaque->btpo_prev; - ropaque->btpo_prev = BufferGetBlockNumber(buf); lopaque->btpo_next = BufferGetBlockNumber(rbuf); + ropaque->btpo_prev = BufferGetBlockNumber(buf); ropaque->btpo_next = oopaque->btpo_next; + /* + * Must copy the original parent link into both new pages, even though + * it might be quite obsolete by now. We might need it if this level + * is or recently was the root (see README). + */ lopaque->btpo_parent = ropaque->btpo_parent = oopaque->btpo_parent; /* * If the page we're splitting is not the rightmost page at its level - * in the tree, then the first (0) entry on the page is the high key + * in the tree, then the first entry on the page is the high key * for the page. We need to copy that to the right half. Otherwise - * (meaning the rightmost page case), we should treat the line - * pointers beginning at zero as user data. - * - * We leave a blank space at the start of the line table for the left - * page. We'll come back later and fill it in with the high key item - * we get from the right key. + * (meaning the rightmost page case), all the items on the right half + * will be user data. */ + rightoff = P_HIKEY; - leftoff = P_FIRSTKEY; - ropaque->btpo_next = oopaque->btpo_next; if (!P_RIGHTMOST(oopaque)) { - /* splitting a non-rightmost page, start at the first data item */ - start = P_FIRSTKEY; - itemid = PageGetItemId(origpage, P_HIKEY); itemsz = ItemIdGetLength(itemid); item = (BTItem) PageGetItem(origpage, itemid); - if (PageAddItem(rightpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) + if (PageAddItem(rightpage, (Item) item, itemsz, rightoff, + LP_USED) == InvalidOffsetNumber) elog(FATAL, "btree: failed to add hikey to the right sibling"); - rightoff = P_FIRSTKEY; + rightoff = OffsetNumberNext(rightoff); + } + + /* + * The "high key" for the new left page will be the first key that's + * going to go into the new right page. This might be either the + * existing data item at position firstright, or the incoming tuple. + */ + leftoff = P_HIKEY; + if (!newitemonleft && newitemoff == firstright) + { + /* incoming tuple will become first on right page */ + itemsz = newitemsz; + item = newitem; } else { - /* splitting a rightmost page, "high key" is the first data item */ - start = P_HIKEY; - - /* the new rightmost page will not have a high key */ - rightoff = P_HIKEY; + /* existing item at firstright will become first on right page */ + itemid = PageGetItemId(origpage, firstright); + itemsz = ItemIdGetLength(itemid); + item = (BTItem) PageGetItem(origpage, itemid); } + if (PageAddItem(leftpage, (Item) item, itemsz, leftoff, + LP_USED) == InvalidOffsetNumber) + elog(FATAL, "btree: failed to add hikey to the left sibling"); + leftoff = OffsetNumberNext(leftoff); + + /* + * Now transfer all the data items to the appropriate page + */ maxoff = PageGetMaxOffsetNumber(origpage); - if (firstright == InvalidOffsetNumber) - { - Size llimit = PageGetFreeSpace(leftpage) / 2; - firstright = _bt_findsplitloc(rel, keysz, scankey, - origpage, start, maxoff, llimit); - } - - for (i = start; i <= maxoff; i = OffsetNumberNext(i)) + for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i)) { itemid = PageGetItemId(origpage, i); itemsz = ItemIdGetLength(itemid); item = (BTItem) PageGetItem(origpage, itemid); + /* does new item belong before this one? */ + if (i == newitemoff) + { + if (newitemonleft) + { + _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff, + "left sibling"); + *itup_off = leftoff; + *itup_blkno = BufferGetBlockNumber(buf); + leftoff = OffsetNumberNext(leftoff); + } + else + { + _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff, + "right sibling"); + *itup_off = rightoff; + *itup_blkno = BufferGetBlockNumber(rbuf); + rightoff = OffsetNumberNext(rightoff); + } + } + /* decide which page to put it on */ if (i < firstright) { - if (PageAddItem(leftpage, (Item) item, itemsz, leftoff, - LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the left sibling"); + _bt_pgaddtup(rel, leftpage, itemsz, item, leftoff, + "left sibling"); leftoff = OffsetNumberNext(leftoff); } else { - if (PageAddItem(rightpage, (Item) item, itemsz, rightoff, - LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the right sibling"); + _bt_pgaddtup(rel, rightpage, itemsz, item, rightoff, + "right sibling"); rightoff = OffsetNumberNext(rightoff); } } - /* - * Okay, page has been split, high key on right page is correct. Now - * set the high key on the left page to be the min key on the right - * page. - */ - - if (P_RIGHTMOST(ropaque)) - itemid = PageGetItemId(rightpage, P_HIKEY); - else - itemid = PageGetItemId(rightpage, P_FIRSTKEY); - itemsz = ItemIdGetLength(itemid); - item = (BTItem) PageGetItem(rightpage, itemid); - - /* - * We left a hole for the high key on the left page; fill it. The - * modal crap is to tell the page manager to put the new item on the - * page and not screw around with anything else. Whoever designed - * this interface has presumably crawled back into the dung heap they - * came from. No one here will admit to it. - */ - - PageManagerModeSet(OverwritePageManagerMode); - if (PageAddItem(leftpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add hikey to the left sibling"); - PageManagerModeSet(ShufflePageManagerMode); + /* cope with possibility that newitem goes at the end */ + if (i <= newitemoff) + { + if (newitemonleft) + { + _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff, + "left sibling"); + *itup_off = leftoff; + *itup_blkno = BufferGetBlockNumber(buf); + leftoff = OffsetNumberNext(leftoff); + } + else + { + _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff, + "right sibling"); + *itup_off = rightoff; + *itup_blkno = BufferGetBlockNumber(rbuf); + rightoff = OffsetNumberNext(rightoff); + } + } /* * By here, the original data page has been split into two new halves, @@ -992,14 +689,10 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, PageRestoreTempPage(leftpage, origpage); - /* write these guys out */ - _bt_wrtnorelbuf(rel, rbuf); - _bt_wrtnorelbuf(rel, buf); - /* * Finally, we need to grab the right sibling (if any) and fix the * prev pointer there. We are guaranteed that this is deadlock-free - * since no other writer will be moving holding a lock on that page + * since no other writer will be holding a lock on that page * and trying to move left, and all readers release locks on a page * before trying to fetch its neighbors. */ @@ -1020,87 +713,214 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, } /* - * _bt_findsplitloc() -- find a safe place to split a page. + * _bt_findsplitloc() -- find an appropriate place to split a page. * - * In order to guarantee the proper handling of searches for duplicate - * keys, the first duplicate in the chain must either be the first - * item on the page after the split, or the entire chain must be on - * one of the two pages. That is, - * [1 2 2 2 3 4 5] - * must become - * [1] [2 2 2 3 4 5] - * or - * [1 2 2 2] [3 4 5] - * but not - * [1 2 2] [2 3 4 5]. - * However, - * [2 2 2 2 2 3 4] - * may be split as - * [2 2 2 2] [2 3 4]. + * The idea here is to equalize the free space that will be on each split + * page, *after accounting for the inserted tuple*. (If we fail to account + * for it, we might find ourselves with too little room on the page that + * it needs to go into!) + * + * We are passed the intended insert position of the new tuple, expressed as + * the offsetnumber of the tuple it must go in front of. (This could be + * maxoff+1 if the tuple is to go at the end.) + * + * We return the index of the first existing tuple that should go on the + * righthand page, plus a boolean indicating whether the new tuple goes on + * the left or right page. The bool is necessary to disambiguate the case + * where firstright == newitemoff. */ static OffsetNumber _bt_findsplitloc(Relation rel, - Size keysz, - ScanKey scankey, Page page, - OffsetNumber start, - OffsetNumber maxoff, - Size llimit) + OffsetNumber newitemoff, + Size newitemsz, + bool *newitemonleft) { - OffsetNumber i; - OffsetNumber saferight; - ItemId nxtitemid, - safeitemid; - BTItem safeitem, - nxtitem; - Size nbytes; + BTPageOpaque opaque; + OffsetNumber offnum; + OffsetNumber maxoff; + ItemId itemid; + FindSplitData state; + int leftspace, + rightspace, + dataitemtotal, + dataitemstoleft; - if (start >= maxoff) - elog(FATAL, "btree: cannot split if start (%d) >= maxoff (%d)", - start, maxoff); - saferight = start; - safeitemid = PageGetItemId(page, saferight); - nbytes = ItemIdGetLength(safeitemid) + sizeof(ItemIdData); - safeitem = (BTItem) PageGetItem(page, safeitemid); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); - i = OffsetNumberNext(start); + state.newitemsz = newitemsz; + state.non_leaf = ! P_ISLEAF(opaque); + state.have_split = false; - while (nbytes < llimit) + /* Total free space available on a btree page, after fixed overhead */ + leftspace = rightspace = + PageGetPageSize(page) - sizeof(PageHeaderData) - + MAXALIGN(sizeof(BTPageOpaqueData)) + + sizeof(ItemIdData); + + /* The right page will have the same high key as the old page */ + if (!P_RIGHTMOST(opaque)) { - /* check the next item on the page */ - nxtitemid = PageGetItemId(page, i); - nbytes += (ItemIdGetLength(nxtitemid) + sizeof(ItemIdData)); - nxtitem = (BTItem) PageGetItem(page, nxtitemid); - - /* - * Test against last known safe item: if the tuple we're looking - * at isn't equal to the last safe one we saw, then it's our new - * safe tuple. - */ - if (!_bt_itemcmp(rel, keysz, scankey, - safeitem, nxtitem, BTEqualStrategyNumber)) - { - safeitem = nxtitem; - saferight = i; - } - if (i < maxoff) - i = OffsetNumberNext(i); - else - break; + itemid = PageGetItemId(page, P_HIKEY); + rightspace -= (int) (ItemIdGetLength(itemid) + sizeof(ItemIdData)); } + /* Count up total space in data items without actually scanning 'em */ + dataitemtotal = rightspace - (int) PageGetFreeSpace(page); + /* - * If the chain of dups starts at the beginning of the page and - * extends past the halfway mark, we can split it in the middle. + * Scan through the data items and calculate space usage for a split + * at each possible position. XXX we could probably stop somewhere + * near the middle... */ + dataitemstoleft = 0; + maxoff = PageGetMaxOffsetNumber(page); - if (saferight == start) - saferight = i; + for (offnum = P_FIRSTDATAKEY(opaque); + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + Size itemsz; + int leftfree, + rightfree; - if (saferight == maxoff && (maxoff - start) > 1) - saferight = start + (maxoff - start) / 2; + itemid = PageGetItemId(page, offnum); + itemsz = ItemIdGetLength(itemid) + sizeof(ItemIdData); - return saferight; + /* + * We have to allow for the current item becoming the high key of + * the left page; therefore it counts against left space. + */ + leftfree = leftspace - dataitemstoleft - (int) itemsz; + rightfree = rightspace - (dataitemtotal - dataitemstoleft); + if (offnum < newitemoff) + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + false, itemsz); + else if (offnum > newitemoff) + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + true, itemsz); + else + { + /* need to try it both ways!! */ + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + false, newitemsz); + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + true, itemsz); + } + + dataitemstoleft += itemsz; + } + + if (! state.have_split) + elog(FATAL, "_bt_findsplitloc: can't find a feasible split point for %s", + RelationGetRelationName(rel)); + *newitemonleft = state.newitemonleft; + return state.firstright; +} + +static void +_bt_checksplitloc(FindSplitData *state, OffsetNumber firstright, + int leftfree, int rightfree, + bool newitemonleft, Size firstrightitemsz) +{ + if (newitemonleft) + leftfree -= (int) state->newitemsz; + else + rightfree -= (int) state->newitemsz; + /* + * If we are not on the leaf level, we will be able to discard the + * key data from the first item that winds up on the right page. + */ + if (state->non_leaf) + rightfree += (int) firstrightitemsz - + (int) (sizeof(BTItemData) + sizeof(ItemIdData)); + /* + * If feasible split point, remember best delta. + */ + if (leftfree >= 0 && rightfree >= 0) + { + int delta = leftfree - rightfree; + + if (delta < 0) + delta = -delta; + if (!state->have_split || delta < state->best_delta) + { + state->have_split = true; + state->newitemonleft = newitemonleft; + state->firstright = firstright; + state->best_delta = delta; + } + } +} + +/* + * _bt_getstackbuf() -- Walk back up the tree one step, and find the item + * we last looked at in the parent. + * + * This is possible because we save a bit image of the last item + * we looked at in the parent, and the update algorithm guarantees + * that if items above us in the tree move, they only move right. + * + * Also, re-set bts_blkno & bts_offset if changed. + */ +static Buffer +_bt_getstackbuf(Relation rel, BTStack stack) +{ + BlockNumber blkno; + Buffer buf; + OffsetNumber start, + offnum, + maxoff; + Page page; + ItemId itemid; + BTItem item; + BTPageOpaque opaque; + + blkno = stack->bts_blkno; + buf = _bt_getbuf(rel, blkno, BT_WRITE); + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + + start = stack->bts_offset; + /* + * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the + * case of concurrent ROOT page split. Also, watch out for + * possibility that page has a high key now when it didn't before. + */ + if (start < P_FIRSTDATAKEY(opaque)) + start = P_FIRSTDATAKEY(opaque); + + for (;;) + { + /* see if it's on this page */ + for (offnum = start; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + itemid = PageGetItemId(page, offnum); + item = (BTItem) PageGetItem(page, itemid); + if (BTItemSame(item, &stack->bts_btitem)) + { + /* Return accurate pointer to where link is now */ + stack->bts_blkno = blkno; + stack->bts_offset = offnum; + return buf; + } + } + /* by here, the item we're looking for moved right at least one page */ + if (P_RIGHTMOST(opaque)) + elog(FATAL, "_bt_getstackbuf: my bits moved right off the end of the world!" + "\n\tRecreate index %s.", RelationGetRelationName(rel)); + + blkno = opaque->btpo_next; + _bt_relbuf(rel, buf, BT_WRITE); + buf = _bt_getbuf(rel, blkno, BT_WRITE); + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + start = P_FIRSTDATAKEY(opaque); + } } /* @@ -1116,9 +936,9 @@ _bt_findsplitloc(Relation rel, * graph. * * On entry, lbuf (the old root) and rbuf (its new peer) are write- - * locked. We don't drop the locks in this routine; that's done by - * the caller. On exit, a new root page exists with entries for the - * two new children. The new root page is neither pinned nor locked. + * locked. On exit, a new root page exists with entries for the + * two new children. The new root page is neither pinned nor locked, and + * we have also written out lbuf and rbuf and dropped their pins/locks. */ static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) @@ -1140,52 +960,52 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); rootpage = BufferGetPage(rootbuf); rootbknum = BufferGetBlockNumber(rootbuf); - _bt_pageinit(rootpage, BufferGetPageSize(rootbuf)); /* set btree special data */ rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage); rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE; rootopaque->btpo_flags |= BTP_ROOT; - /* - * Insert the internal tuple pointers. - */ - lbkno = BufferGetBlockNumber(lbuf); rbkno = BufferGetBlockNumber(rbuf); lpage = BufferGetPage(lbuf); rpage = BufferGetPage(rbuf); + /* + * Make sure pages in old root level have valid parent links --- we will + * need this in _bt_insertonpg() if a concurrent root split happens (see + * README). + */ ((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_parent = ((BTPageOpaque) PageGetSpecialPointer(rpage))->btpo_parent = rootbknum; /* - * step over the high key on the left page while building the left - * page pointer. + * Create downlink item for left page (old root). Since this will be + * the first item in a non-leaf page, it implicitly has minus-infinity + * key value, so we need not store any actual key in it. */ - itemid = PageGetItemId(lpage, P_FIRSTKEY); - itemsz = ItemIdGetLength(itemid); - item = (BTItem) PageGetItem(lpage, itemid); - new_item = _bt_formitem(&(item->bti_itup)); + itemsz = sizeof(BTItemData); + new_item = (BTItem) palloc(itemsz); + new_item->bti_itup.t_info = itemsz; ItemPointerSet(&(new_item->bti_itup.t_tid), lbkno, P_HIKEY); /* - * insert the left page pointer into the new root page. the root page - * is the rightmost page on its level so the "high key" item is the - * first data item. + * Insert the left page pointer into the new root page. The root page + * is the rightmost page on its level so there is no "high key" in it; + * the two items will go into positions P_HIKEY and P_FIRSTKEY. */ if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) elog(FATAL, "btree: failed to add leftkey to new root page"); pfree(new_item); /* - * the right page is the rightmost page on the second level, so the - * "high key" item is the first data item on that page as well. + * Create downlink item for right page. The key for it is obtained from + * the "high key" position in the left page. */ - itemid = PageGetItemId(rpage, P_HIKEY); + itemid = PageGetItemId(lpage, P_HIKEY); itemsz = ItemIdGetLength(itemid); - item = (BTItem) PageGetItem(rpage, itemid); + item = (BTItem) PageGetItem(lpage, itemid); new_item = _bt_formitem(&(item->bti_itup)); ItemPointerSet(&(new_item->bti_itup.t_tid), rbkno, P_HIKEY); @@ -1196,497 +1016,101 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) elog(FATAL, "btree: failed to add rightkey to new root page"); pfree(new_item); - /* write and let go of the root buffer */ + /* write and let go of the new root buffer */ _bt_wrtbuf(rel, rootbuf); /* update metadata page with new root block number */ _bt_metaproot(rel, rootbknum, 0); - _bt_wrtbuf(rel, lbuf); + /* update and release new sibling, and finally the old root */ _bt_wrtbuf(rel, rbuf); + _bt_wrtbuf(rel, lbuf); } /* * _bt_pgaddtup() -- add a tuple to a particular page in the index. * - * This routine adds the tuple to the page as requested, and keeps the - * write lock and reference associated with the page's buffer. It is - * an error to call pgaddtup() without a write lock and reference. If - * afteritem is non-null, it's the item that we expect our new item - * to follow. Otherwise, we do a binary search for the correct place - * and insert the new item there. - */ -static OffsetNumber -_bt_pgaddtup(Relation rel, - Buffer buf, - int keysz, - ScanKey itup_scankey, - Size itemsize, - BTItem btitem, - BTItem afteritem) -{ - OffsetNumber itup_off; - OffsetNumber first; - Page page; - BTPageOpaque opaque; - BTItem chkitem; - - page = BufferGetPage(buf); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - first = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - if (afteritem == (BTItem) NULL) - itup_off = _bt_binsrch(rel, buf, keysz, itup_scankey, BT_INSERTION); - else - { - itup_off = first; - - do - { - chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, itup_off)); - itup_off = OffsetNumberNext(itup_off); - } while (!BTItemSame(chkitem, afteritem)); - } - - if (PageAddItem(page, (Item) btitem, itemsize, itup_off, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the page"); - - /* write the buffer, but hold our lock */ - _bt_wrtnorelbuf(rel, buf); - - return itup_off; -} - -/* - * _bt_goesonpg() -- Does a new tuple belong on this page? - * - * This is part of the complexity introduced by allowing duplicate - * keys into the index. The tuple belongs on this page if: - * - * + there is no page to the right of this one; or - * + it is less than the high key on the page; or - * + the item it is to follow ("afteritem") appears on this - * page. - */ -static bool -_bt_goesonpg(Relation rel, - Buffer buf, - Size keysz, - ScanKey scankey, - BTItem afteritem) -{ - Page page; - ItemId hikey; - BTPageOpaque opaque; - BTItem chkitem; - OffsetNumber offnum, - maxoff; - bool found; - - page = BufferGetPage(buf); - - /* no right neighbor? */ - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - if (P_RIGHTMOST(opaque)) - return true; - - /* - * this is a non-rightmost page, so it must have a high key item. - * - * If the scan key is < the high key (the min key on the next page), then - * it for sure belongs here. - */ - hikey = PageGetItemId(page, P_HIKEY); - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTLessStrategyNumber)) - return true; - - /* - * If the scan key is > the high key, then it for sure doesn't belong - * here. - */ - - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTGreaterStrategyNumber)) - return false; - - /* - * If we have no adjacency information, and the item is equal to the - * high key on the page (by here it is), then the item does not belong - * on this page. - * - * Now it's not true in all cases. - vadim 06/10/97 - */ - - if (afteritem == (BTItem) NULL) - { - if (opaque->btpo_flags & BTP_LEAF) - return false; - if (opaque->btpo_flags & BTP_CHAIN) - return true; - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, P_FIRSTKEY), - BTEqualStrategyNumber)) - return true; - return false; - } - - /* damn, have to work for it. i hate that. */ - maxoff = PageGetMaxOffsetNumber(page); - - /* - * Search the entire page for the afteroid. We need to do this, - * rather than doing a binary search and starting from there, because - * if the key we're searching for is the leftmost key in the tree at - * this level, then a binary search will do the wrong thing. Splits - * are pretty infrequent, so the cost isn't as bad as it could be. - */ - - found = false; - for (offnum = P_FIRSTKEY; - offnum <= maxoff; - offnum = OffsetNumberNext(offnum)) - { - chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); - - if (BTItemSame(chkitem, afteritem)) - { - found = true; - break; - } - } - - return found; -} - -/* - * _bt_tuplecompare() -- compare two IndexTuples, - * return -1, 0, or +1 - * - */ -static int32 -_bt_tuplecompare(Relation rel, - Size keysz, - ScanKey scankey, - IndexTuple tuple1, - IndexTuple tuple2) -{ - TupleDesc tupDes; - int i; - int32 compare = 0; - - tupDes = RelationGetDescr(rel); - - for (i = 1; i <= (int) keysz; i++) - { - ScanKey entry = &scankey[i - 1]; - Datum attrDatum1, - attrDatum2; - bool isFirstNull, - isSecondNull; - - attrDatum1 = index_getattr(tuple1, i, tupDes, &isFirstNull); - attrDatum2 = index_getattr(tuple2, i, tupDes, &isSecondNull); - - /* see comments about NULLs handling in btbuild */ - if (isFirstNull) /* attr in tuple1 is NULL */ - { - if (isSecondNull) /* attr in tuple2 is NULL too */ - compare = 0; - else - compare = 1; /* NULL ">" not-NULL */ - } - else if (isSecondNull) /* attr in tuple1 is NOT_NULL and */ - { /* attr in tuple2 is NULL */ - compare = -1; /* not-NULL "<" NULL */ - } - else - { - compare = DatumGetInt32(FunctionCall2(&entry->sk_func, - attrDatum1, attrDatum2)); - } - - if (compare != 0) - break; /* done when we find unequal attributes */ - } - - return compare; -} - -/* - * _bt_itemcmp() -- compare two BTItems using a requested - * strategy (<, <=, =, >=, >) - * - */ -bool -_bt_itemcmp(Relation rel, - Size keysz, - ScanKey scankey, - BTItem item1, - BTItem item2, - StrategyNumber strat) -{ - int32 compare; - - compare = _bt_tuplecompare(rel, keysz, scankey, - &(item1->bti_itup), - &(item2->bti_itup)); - - switch (strat) - { - case BTLessStrategyNumber: - return (bool) (compare < 0); - case BTLessEqualStrategyNumber: - return (bool) (compare <= 0); - case BTEqualStrategyNumber: - return (bool) (compare == 0); - case BTGreaterEqualStrategyNumber: - return (bool) (compare >= 0); - case BTGreaterStrategyNumber: - return (bool) (compare > 0); - } - - elog(ERROR, "_bt_itemcmp: bogus strategy %d", (int) strat); - return false; -} - -/* - * _bt_updateitem() -- updates the key of the item identified by the - * oid with the key of newItem (done in place if - * possible) + * This routine adds the tuple to the page as requested. It does + * not affect pin/lock status, but you'd better have a write lock + * and pin on the target buffer! Don't forget to write and release + * the buffer afterwards, either. * + * The main difference between this routine and a bare PageAddItem call + * is that this code knows that the leftmost data item on a non-leaf + * btree page doesn't need to have a key. Therefore, it strips such + * items down to just the item header. CAUTION: this works ONLY if + * we insert the items in order, so that the given itup_off does + * represent the final position of the item! */ static void -_bt_updateitem(Relation rel, - Size keysz, - Buffer buf, - BTItem oldItem, - BTItem newItem) +_bt_pgaddtup(Relation rel, + Page page, + Size itemsize, + BTItem btitem, + OffsetNumber itup_off, + const char *where) { - Page page; - OffsetNumber maxoff; - OffsetNumber i; - ItemPointerData itemPtrData; - BTItem item; - IndexTuple oldIndexTuple, - newIndexTuple; - int first; + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); + BTItemData truncitem; - page = BufferGetPage(buf); - maxoff = PageGetMaxOffsetNumber(page); - - /* locate item on the page */ - first = P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page)) - ? P_HIKEY : P_FIRSTKEY; - i = first; - do + if (! P_ISLEAF(opaque) && itup_off == P_FIRSTDATAKEY(opaque)) { - item = (BTItem) PageGetItem(page, PageGetItemId(page, i)); - i = OffsetNumberNext(i); - } while (i <= maxoff && !BTItemSame(item, oldItem)); - - /* this should never happen (in theory) */ - if (!BTItemSame(item, oldItem)) - elog(FATAL, "_bt_getstackbuf was lying!!"); - - /* - * It's defined by caller (_bt_insertonpg) - */ - - /* - * if(IndexTupleDSize(newItem->bti_itup) > - * IndexTupleDSize(item->bti_itup)) { elog(NOTICE, "trying to - * overwrite a smaller value with a bigger one in _bt_updateitem"); - * elog(ERROR, "this is not good."); } - */ - - oldIndexTuple = &(item->bti_itup); - newIndexTuple = &(newItem->bti_itup); - - /* keep the original item pointer */ - ItemPointerCopy(&(oldIndexTuple->t_tid), &itemPtrData); - CopyIndexTuple(newIndexTuple, &oldIndexTuple); - ItemPointerCopy(&itemPtrData, &(oldIndexTuple->t_tid)); + memcpy(&truncitem, btitem, sizeof(BTItemData)); + truncitem.bti_itup.t_info = sizeof(BTItemData); + btitem = &truncitem; + itemsize = sizeof(BTItemData); + } + if (PageAddItem(page, (Item) btitem, itemsize, itup_off, + LP_USED) == InvalidOffsetNumber) + elog(FATAL, "btree: failed to add item to the %s for %s", + where, RelationGetRelationName(rel)); } /* * _bt_isequal - used in _bt_doinsert in check for duplicates. * + * This is very similar to _bt_compare, except for NULL handling. * Rule is simple: NOT_NULL not equal NULL, NULL not_equal NULL too. */ static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey) { - Datum datum; BTItem btitem; IndexTuple itup; - ScanKey entry; - AttrNumber attno; - int32 result; int i; - bool null; + + /* Better be comparing to a leaf item */ + Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page))); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &(btitem->bti_itup); for (i = 1; i <= keysz; i++) { - entry = &scankey[i - 1]; + ScanKey entry = &scankey[i - 1]; + AttrNumber attno; + Datum datum; + bool isNull; + int32 result; + attno = entry->sk_attno; Assert(attno == i); - datum = index_getattr(itup, attno, itupdesc, &null); + datum = index_getattr(itup, attno, itupdesc, &isNull); - /* NULLs are not equal */ - if (entry->sk_flags & SK_ISNULL || null) + /* NULLs are never equal to anything */ + if (entry->sk_flags & SK_ISNULL || isNull) return false; result = DatumGetInt32(FunctionCall2(&entry->sk_func, - entry->sk_argument, datum)); + entry->sk_argument, + datum)); + if (result != 0) return false; } - /* by here, the keys are equal */ + /* if we get here, the keys are equal */ return true; } - -#ifdef NOT_USED -/* - * _bt_shift - insert btitem on the passed page after shifting page - * to the right in the tree. - * - * NOTE: tested for shifting leftmost page only, having btitem < hikey. - */ -static InsertIndexResult -_bt_shift(Relation rel, Buffer buf, BTStack stack, int keysz, - ScanKey scankey, BTItem btitem, BTItem hikey) -{ - InsertIndexResult res; - int itemsz; - Page page; - BlockNumber bknum; - BTPageOpaque pageop; - Buffer rbuf; - Page rpage; - BTPageOpaque rpageop; - Buffer pbuf; - Page ppage; - BTPageOpaque ppageop; - Buffer nbuf; - Page npage; - BTPageOpaque npageop; - BlockNumber nbknum; - BTItem nitem; - OffsetNumber afteroff; - - btitem = _bt_formitem(&(btitem->bti_itup)); - hikey = _bt_formitem(&(hikey->bti_itup)); - - page = BufferGetPage(buf); - - /* grab new page */ - nbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); - nbknum = BufferGetBlockNumber(nbuf); - npage = BufferGetPage(nbuf); - _bt_pageinit(npage, BufferGetPageSize(nbuf)); - npageop = (BTPageOpaque) PageGetSpecialPointer(npage); - - /* copy content of the passed page */ - memmove((char *) npage, (char *) page, BufferGetPageSize(buf)); - - /* re-init old (passed) page */ - _bt_pageinit(page, BufferGetPageSize(buf)); - pageop = (BTPageOpaque) PageGetSpecialPointer(page); - - /* init old page opaque */ - pageop->btpo_flags = npageop->btpo_flags; /* restore flags */ - pageop->btpo_flags &= ~BTP_CHAIN; - if (_bt_itemcmp(rel, keysz, scankey, hikey, btitem, BTEqualStrategyNumber)) - pageop->btpo_flags |= BTP_CHAIN; - pageop->btpo_prev = npageop->btpo_prev; /* restore prev */ - pageop->btpo_next = nbknum; /* next points to the new page */ - pageop->btpo_parent = npageop->btpo_parent; - - /* init shifted page opaque */ - npageop->btpo_prev = bknum = BufferGetBlockNumber(buf); - - /* shifted page is ok, populate old page */ - - /* add passed hikey */ - itemsz = IndexTupleDSize(hikey->bti_itup) - + (sizeof(BTItemData) - sizeof(IndexTupleData)); - itemsz = MAXALIGN(itemsz); - if (PageAddItem(page, (Item) hikey, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add hikey in _bt_shift"); - pfree(hikey); - - /* add btitem */ - itemsz = IndexTupleDSize(btitem->bti_itup) - + (sizeof(BTItemData) - sizeof(IndexTupleData)); - itemsz = MAXALIGN(itemsz); - if (PageAddItem(page, (Item) btitem, itemsz, P_FIRSTKEY, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add firstkey in _bt_shift"); - pfree(btitem); - nitem = (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)); - btitem = _bt_formitem(&(nitem->bti_itup)); - ItemPointerSet(&(btitem->bti_itup.t_tid), bknum, P_HIKEY); - - /* ok, write them out */ - _bt_wrtnorelbuf(rel, nbuf); - _bt_wrtnorelbuf(rel, buf); - - /* fix btpo_prev on right sibling of old page */ - if (!P_RIGHTMOST(npageop)) - { - rbuf = _bt_getbuf(rel, npageop->btpo_next, BT_WRITE); - rpage = BufferGetPage(rbuf); - rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage); - rpageop->btpo_prev = nbknum; - _bt_wrtbuf(rel, rbuf); - } - - /* get parent pointing to the old page */ - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - - _bt_relbuf(rel, nbuf, BT_WRITE); - _bt_relbuf(rel, buf, BT_WRITE); - - /* re-set parent' pointer - we shifted our page to the right ! */ - nitem = (BTItem) PageGetItem(ppage, - PageGetItemId(ppage, stack->bts_offset)); - ItemPointerSet(&(nitem->bti_itup.t_tid), nbknum, P_HIKEY); - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), nbknum, P_HIKEY); - _bt_wrtnorelbuf(rel, pbuf); - - /* - * Now we want insert into the parent pointer to our old page. It has - * to be inserted before the pointer to new page. You may get problems - * here (in the _bt_goesonpg and/or _bt_pgaddtup), but may be not - I - * don't know. It works if old page is leftmost (nitem is NULL) and - * btitem < hikey and it's all what we need currently. - vadim - * 05/30/97 - */ - nitem = NULL; - afteroff = P_FIRSTKEY; - if (!P_RIGHTMOST(ppageop)) - afteroff = OffsetNumberNext(afteroff); - if (stack->bts_offset >= afteroff) - { - afteroff = OffsetNumberPrev(stack->bts_offset); - nitem = (BTItem) PageGetItem(ppage, PageGetItemId(ppage, afteroff)); - nitem = _bt_formitem(&(nitem->bti_itup)); - } - res = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, scankey, btitem, nitem); - pfree(btitem); - - ItemPointerSet(&(res->pointerData), nbknum, P_HIKEY); - - return res; -} - -#endif diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 1a623698f5..40604dbc25 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -9,7 +9,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $ * * NOTES * Postgres btree pages look like ordinary relation pages. The opaque @@ -90,7 +90,7 @@ _bt_metapinit(Relation rel) metad.btm_version = BTREE_VERSION; metad.btm_root = P_NONE; metad.btm_level = 0; - memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad)); + memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad)); op = (BTPageOpaque) PageGetSpecialPointer(pg); op->btpo_flags = BTP_META; @@ -102,52 +102,6 @@ _bt_metapinit(Relation rel) UnlockRelation(rel, AccessExclusiveLock); } -#ifdef NOT_USED -/* - * _bt_checkmeta() -- Verify that the metadata stored in a btree are - * reasonable. - */ -void -_bt_checkmeta(Relation rel) -{ - Buffer metabuf; - Page metap; - BTMetaPageData *metad; - BTPageOpaque op; - int nblocks; - - /* if the relation is empty, this is init time; don't complain */ - if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0) - return; - - metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); - metap = BufferGetPage(metabuf); - op = (BTPageOpaque) PageGetSpecialPointer(metap); - if (!(op->btpo_flags & BTP_META)) - { - elog(ERROR, "Invalid metapage for index %s", - RelationGetRelationName(rel)); - } - metad = BTPageGetMeta(metap); - - if (metad->btm_magic != BTREE_MAGIC) - { - elog(ERROR, "Index %s is not a btree", - RelationGetRelationName(rel)); - } - - if (metad->btm_version != BTREE_VERSION) - { - elog(ERROR, "Version mismatch on %s: version %d file, version %d code", - RelationGetRelationName(rel), - metad->btm_version, BTREE_VERSION); - } - - _bt_relbuf(rel, metabuf, BT_READ); -} - -#endif - /* * _bt_getroot() -- Get the root page of the btree. * @@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel) * standard class of race conditions exists here; I think I covered * them all in the Hopi Indian rain dance of lock requests below. * - * We pass in the access type (BT_READ or BT_WRITE), and return the - * root page's buffer with the appropriate lock type set. Reference - * count on the root page gets bumped by ReadBuffer. The metadata - * page is unlocked and unreferenced by this process when this routine - * returns. + * The access type parameter (BT_READ or BT_WRITE) controls whether + * a new root page will be created or not. If access = BT_READ, + * and no root page exists, we just return InvalidBuffer. For + * BT_WRITE, we try to create the root page if it doesn't exist. + * NOTE that the returned root page will have only a read lock set + * on it even if access = BT_WRITE! + * + * On successful return, the root page is pinned and read-locked. + * The metadata page is not locked or pinned on exit. */ Buffer _bt_getroot(Relation rel, int access) @@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access) metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); metapg = BufferGetPage(metabuf); metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); - Assert(metaopaque->btpo_flags & BTP_META); metad = BTPageGetMeta(metapg); - if (metad->btm_magic != BTREE_MAGIC) - { + if (!(metaopaque->btpo_flags & BTP_META) || + metad->btm_magic != BTREE_MAGIC) elog(ERROR, "Index %s is not a btree", RelationGetRelationName(rel)); - } if (metad->btm_version != BTREE_VERSION) - { - elog(ERROR, "Version mismatch on %s: version %d file, version %d code", + elog(ERROR, "Version mismatch on %s: version %d file, version %d code", RelationGetRelationName(rel), metad->btm_version, BTREE_VERSION); - } /* if no root page initialized yet, do it */ if (metad->btm_root == P_NONE) { + /* If access = BT_READ, caller doesn't want us to create root yet */ + if (access == BT_READ) + { + _bt_relbuf(rel, metabuf, BT_READ); + return InvalidBuffer; + } - /* turn our read lock in for a write lock */ - _bt_relbuf(rel, metabuf, BT_READ); - metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE); - metapg = BufferGetPage(metabuf); - metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); - Assert(metaopaque->btpo_flags & BTP_META); - metad = BTPageGetMeta(metapg); + /* trade in our read lock for a write lock */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + LockBuffer(metabuf, BT_WRITE); /* * Race condition: if someone else initialized the metadata * between the time we released the read lock and acquired the - * write lock, above, we want to avoid doing it again. + * write lock, above, we must avoid doing it again. */ - if (metad->btm_root == P_NONE) { /* * Get, initialize, write, and leave a lock of the appropriate * type on the new root page. Since this is the first page in - * the tree, it's a leaf. + * the tree, it's a leaf as well as the root. */ - rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); rootblkno = BufferGetBlockNumber(rootbuf); rootpg = BufferGetPage(rootbuf); + metad->btm_root = rootblkno; metad->btm_level = 1; + _bt_pageinit(rootpg, BufferGetPageSize(rootbuf)); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg); rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT); _bt_wrtnorelbuf(rel, rootbuf); - /* swap write lock for read lock, if appropriate */ - if (access != BT_WRITE) - { - LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK); - LockBuffer(rootbuf, BT_READ); - } + /* swap write lock for read lock */ + LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK); + LockBuffer(rootbuf, BT_READ); - /* okay, metadata is correct */ + /* okay, metadata is correct, write and release it */ _bt_wrtbuf(rel, metabuf); } else { - /* * Metadata initialized by someone else. In order to * guarantee no deadlocks, we have to release the metadata * page and start all over again. */ - _bt_relbuf(rel, metabuf, BT_WRITE); return _bt_getroot(rel, access); } @@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access) rootblkno = metad->btm_root; _bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */ - rootbuf = _bt_getbuf(rel, rootblkno, access); + rootbuf = _bt_getbuf(rel, rootblkno, BT_READ); } /* * Race condition: If the root page split between the time we looked * at the metadata page and got the root buffer, then we got the wrong - * buffer. + * buffer. Release it and try again. */ - rootpg = BufferGetPage(rootbuf); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg); - if (!(rootopaque->btpo_flags & BTP_ROOT)) - { + if (! P_ISROOT(rootopaque)) + { /* it happened, try again */ - _bt_relbuf(rel, rootbuf, access); + _bt_relbuf(rel, rootbuf, BT_READ); return _bt_getroot(rel, access); } @@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access) * count is correct, and we have no lock set on the metadata page. * Return the root block. */ - return rootbuf; } @@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access) * _bt_getbuf() -- Get a buffer by block number for read or write. * * When this routine returns, the appropriate lock is set on the - * requested buffer its reference count is correct. + * requested buffer and its reference count has been incremented + * (ie, the buffer is "locked and pinned"). */ Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access) { Buffer buf; - Page page; if (blkno != P_NEW) { + /* Read an existing block of the relation */ buf = ReadBuffer(rel, blkno); LockBuffer(buf, access); } else { + Page page; /* - * Extend bufmgr code is unclean and so we have to use locking + * Extend the relation by one page. + * + * Extend bufmgr code is unclean and so we have to use extra locking * here. */ LockPage(rel, 0, ExclusiveLock); buf = ReadBuffer(rel, blkno); + LockBuffer(buf, access); UnlockPage(rel, 0, ExclusiveLock); - blkno = BufferGetBlockNumber(buf); + + /* Initialize the new page before returning it */ page = BufferGetPage(buf); _bt_pageinit(page, BufferGetPageSize(buf)); - LockBuffer(buf, access); } /* ref count and lock type are correct */ @@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) /* * _bt_relbuf() -- release a locked buffer. + * + * Lock and pin (refcount) are both dropped. */ void _bt_relbuf(Relation rel, Buffer buf, int access) @@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access) /* * _bt_wrtbuf() -- write a btree page to disk. * - * This routine releases the lock held on the buffer and our reference - * to it. It is an error to call _bt_wrtbuf() without a write lock - * or a reference to the buffer. + * This routine releases the lock held on the buffer and our refcount + * for it. It is an error to call _bt_wrtbuf() without a write lock + * and a pin on the buffer. + * + * NOTE: actually, the buffer manager just marks the shared buffer page + * dirty here, the real I/O happens later. Since we can't persuade the + * Unix kernel to schedule disk writes in a particular order, there's not + * much point in worrying about this. The most we can say is that all the + * writes will occur before commit. */ void _bt_wrtbuf(Relation rel, Buffer buf) @@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf) * our reference or lock. * * It is an error to call _bt_wrtnorelbuf() without a write lock - * or a reference to the buffer. + * and a pin on the buffer. + * + * See above NOTE. */ void _bt_wrtnorelbuf(Relation rel, Buffer buf) @@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size) * we split the root page, we record the new parent in the metadata page * for the relation. This routine does the work. * - * No direct preconditions, but if you don't have the a write lock on + * No direct preconditions, but if you don't have the write lock on * at least the old root page when you call this, you're making a big * mistake. On exit, metapage data is correct and we no longer have - * a reference to or lock on the metapage. + * a pin or lock on the metapage. */ void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level) @@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level) } /* - * _bt_getstackbuf() -- Walk back up the tree one step, and find the item - * we last looked at in the parent. - * - * This is possible because we save a bit image of the last item - * we looked at in the parent, and the update algorithm guarantees - * that if items above us in the tree move, they only move right. - * - * Also, re-set bts_blkno & bts_offset if changed and - * bts_btitem (it may be changed - see _bt_insertonpg). + * Delete an item from a btree. It had better be a leaf item... */ -Buffer -_bt_getstackbuf(Relation rel, BTStack stack, int access) -{ - Buffer buf; - BlockNumber blkno; - OffsetNumber start, - offnum, - maxoff; - OffsetNumber i; - Page page; - ItemId itemid; - BTItem item; - BTPageOpaque opaque; - BTItem item_save; - int item_nbytes; - - blkno = stack->bts_blkno; - buf = _bt_getbuf(rel, blkno, access); - page = BufferGetPage(buf); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - - if (stack->bts_offset == InvalidOffsetNumber || - maxoff >= stack->bts_offset) - { - - /* - * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the - * case of concurrent ROOT page split - */ - if (stack->bts_offset == InvalidOffsetNumber) - i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - else - { - itemid = PageGetItemId(page, stack->bts_offset); - item = (BTItem) PageGetItem(page, itemid); - - /* if the item is where we left it, we're done */ - if (BTItemSame(item, stack->bts_btitem)) - { - pfree(stack->bts_btitem); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) item, item_nbytes); - stack->bts_btitem = item_save; - return buf; - } - i = OffsetNumberNext(stack->bts_offset); - } - - /* if the item has just moved right on this page, we're done */ - for (; - i <= maxoff; - i = OffsetNumberNext(i)) - { - itemid = PageGetItemId(page, i); - item = (BTItem) PageGetItem(page, itemid); - - /* if the item is where we left it, we're done */ - if (BTItemSame(item, stack->bts_btitem)) - { - stack->bts_offset = i; - pfree(stack->bts_btitem); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) item, item_nbytes); - stack->bts_btitem = item_save; - return buf; - } - } - } - - /* by here, the item we're looking for moved right at least one page */ - for (;;) - { - blkno = opaque->btpo_next; - if (P_RIGHTMOST(opaque)) - elog(FATAL, "my bits moved right off the end of the world!\ -\n\tRecreate index %s.", RelationGetRelationName(rel)); - - _bt_relbuf(rel, buf, access); - buf = _bt_getbuf(rel, blkno, access); - page = BufferGetPage(buf); - maxoff = PageGetMaxOffsetNumber(page); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - /* if we have a right sibling, step over the high key */ - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* see if it's on this page */ - for (offnum = start; - offnum <= maxoff; - offnum = OffsetNumberNext(offnum)) - { - itemid = PageGetItemId(page, offnum); - item = (BTItem) PageGetItem(page, itemid); - if (BTItemSame(item, stack->bts_btitem)) - { - stack->bts_offset = offnum; - stack->bts_blkno = blkno; - pfree(stack->bts_btitem); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) item, item_nbytes); - stack->bts_btitem = item_save; - return buf; - } - } - } -} - void _bt_pagedel(Relation rel, ItemPointer tid) { diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index b174d30317..072d400070 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -12,7 +12,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -26,6 +26,7 @@ #include "executor/executor.h" #include "miscadmin.h" + bool BuildingBtree = false; /* see comment in btbuild() */ bool FastBuild = true; /* use sort/build instead of insertion * build */ @@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS) * btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE. * Sure, it's just rule for placing/finding items and no more - * keytest'll return FALSE for a = 5 for items having 'a' isNULL. - * Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it - * works. - vadim 03/23/97 + * Look at _bt_compare for how it works. + * - vadim 03/23/97 * * if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; } */ @@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS) /* generate an index tuple */ itup = index_formtuple(RelationGetDescr(rel), datum, nulls); itup->t_tid = *ht_ctid; - - /* - * See comments in btbuild. - * - * if (itup->t_info & INDEX_NULL_MASK) - * PG_RETURN_POINTER((InsertIndexResult) NULL); - */ - btitem = _bt_formitem(itup); res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel); @@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS) if (ItemPointerIsValid(&(scan->currentItemData))) { - /* * Restore scan position using heap TID returned by previous call - * to btgettuple(). _bt_restscan() locks buffer. + * to btgettuple(). _bt_restscan() re-grabs the read lock on + * the buffer, too. */ _bt_restscan(scan); res = _bt_next(scan, dir); @@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS) res = _bt_first(scan, dir); /* - * Save heap TID to use it in _bt_restscan. Unlock buffer before - * leaving index ! + * Save heap TID to use it in _bt_restscan. Then release the read + * lock on the buffer so that we aren't blocking other backends. + * NOTE: we do keep the pin on the buffer! */ if (res) { @@ -419,22 +413,6 @@ btrescan(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold a read lock on the current page in the scan */ - if (ItemPointerIsValid(iptr = &(scan->currentItemData))) - { - ReleaseBuffer(so->btso_curbuf); - so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(iptr); - } - - /* and we don't hold a read lock on the last marked item in the scan */ - if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) - { - ReleaseBuffer(so->btso_mrkbuf); - so->btso_mrkbuf = InvalidBuffer; - ItemPointerSetInvalid(iptr); - } - if (so == NULL) /* if called from btbeginscan */ { so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); @@ -446,6 +424,21 @@ btrescan(PG_FUNCTION_ARGS) scan->flags = 0x0; } + /* we aren't holding any read locks, but gotta drop the pins */ + if (ItemPointerIsValid(iptr = &(scan->currentItemData))) + { + ReleaseBuffer(so->btso_curbuf); + so->btso_curbuf = InvalidBuffer; + ItemPointerSetInvalid(iptr); + } + + if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) + { + ReleaseBuffer(so->btso_mrkbuf); + so->btso_mrkbuf = InvalidBuffer; + ItemPointerSetInvalid(iptr); + } + /* * Reset the scan keys. Note that keys ordering stuff moved to * _bt_first. - vadim 05/05/97 @@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v) so = (BTScanOpaque) scan->opaque; - /* we don't hold a read lock on the current page in the scan */ + /* we aren't holding any read locks, but gotta drop the pin */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { ReleaseBuffer(so->btso_curbuf); @@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v) ItemPointerSetInvalid(iptr); } -/* scan->keyData[0].sk_argument = v; */ so->keyData[0].sk_argument = v; } @@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold any read locks */ + /* we aren't holding any read locks, but gotta drop the pins */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { if (BufferIsValid(so->btso_curbuf)) @@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold any read locks */ + /* we aren't holding any read locks, but gotta drop the pin */ if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) { ReleaseBuffer(so->btso_mrkbuf); @@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS) ItemPointerSetInvalid(iptr); } - /* bump pin on current buffer */ + /* bump pin on current buffer for assignment to mark buffer */ if (ItemPointerIsValid(&(scan->currentItemData))) { so->btso_mrkbuf = ReadBuffer(scan->relation, @@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold any read locks */ + /* we aren't holding any read locks, but gotta drop the pin */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { ReleaseBuffer(so->btso_curbuf); @@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS) { so->btso_curbuf = ReadBuffer(scan->relation, BufferGetBlockNumber(so->btso_mrkbuf)); - scan->currentItemData = scan->currentMarkData; so->curHeapIptr = so->mrkHeapIptr; } @@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS) PG_RETURN_VOID(); } +/* + * Restore scan position when btgettuple is called to continue a scan. + */ static void _bt_restscan(IndexScanDesc scan) { @@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan) BTItem item; BlockNumber blkno; - LockBuffer(buf, BT_READ); /* lock buffer first! */ + /* + * Get back the read lock we were holding on the buffer. + * (We still have a reference-count pin on it, though.) + */ + LockBuffer(buf, BT_READ); + page = BufferGetPage(buf); maxoff = PageGetMaxOffsetNumber(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page); @@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan) */ if (!ItemPointerIsValid(&target)) { - ItemPointerSetOffsetNumber(&(scan->currentItemData), - OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)); + ItemPointerSetOffsetNumber(current, + OffsetNumberPrev(P_FIRSTDATAKEY(opaque))); return; } - if (maxoff >= offnum) + /* + * The item we were on may have moved right due to insertions. + * Find it again. + */ + for (;;) { - - /* - * if the item is where we left it or has just moved right on this - * page, we're done - */ + /* Check for item on this page */ for (; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); - if (item->bti_itup.t_tid.ip_blkid.bi_hi == \ - target.ip_blkid.bi_hi && \ - item->bti_itup.t_tid.ip_blkid.bi_lo == \ - target.ip_blkid.bi_lo && \ + if (item->bti_itup.t_tid.ip_blkid.bi_hi == + target.ip_blkid.bi_hi && + item->bti_itup.t_tid.ip_blkid.bi_lo == + target.ip_blkid.bi_lo && item->bti_itup.t_tid.ip_posid == target.ip_posid) { current->ip_posid = offnum; return; } } - } - /* - * By here, the item we're looking for moved right at least one page - */ - for (;;) - { + /* + * By here, the item we're looking for moved right at least one page + */ if (P_RIGHTMOST(opaque)) - elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\ -\n\tRecreate index %s.", RelationGetRelationName(rel)); + elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!" + "\n\tRecreate index %s.", RelationGetRelationName(rel)); blkno = opaque->btpo_next; _bt_relbuf(rel, buf, BT_READ); @@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan) page = BufferGetPage(buf); maxoff = PageGetMaxOffsetNumber(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - /* see if it's on this page */ - for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - offnum <= maxoff; - offnum = OffsetNumberNext(offnum)) - { - item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); - if (item->bti_itup.t_tid.ip_blkid.bi_hi == \ - target.ip_blkid.bi_hi && \ - item->bti_itup.t_tid.ip_blkid.bi_lo == \ - target.ip_blkid.bi_lo && \ - item->bti_itup.t_tid.ip_posid == target.ip_posid) - { - ItemPointerSet(current, blkno, offnum); - so->btso_curbuf = buf; - return; - } - } + offnum = P_FIRSTDATAKEY(opaque); + ItemPointerSet(current, blkno, offnum); + so->btso_curbuf = buf; } } diff --git a/src/backend/access/nbtree/nbtscan.c b/src/backend/access/nbtree/nbtscan.c index 37469365bc..5d48895c1a 100644 --- a/src/backend/access/nbtree/nbtscan.c +++ b/src/backend/access/nbtree/nbtscan.c @@ -8,22 +8,25 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $ * * * NOTES * Because we can be doing an index scan on a relation while we update * it, we need to avoid missing data that moves around in the index. - * The routines and global variables in this file guarantee that all - * scans in the local address space stay correctly positioned. This - * is all we need to worry about, since write locking guarantees that - * no one else will be on the same page at the same time as we are. + * Insertions and page splits are no problem because _bt_restscan() + * can figure out where the current item moved to, but if a deletion + * happens at or before the current scan position, we'd better do + * something to stay in sync. + * + * The routines in this file handle the problem for deletions issued + * by the current backend. Currently, that's all we need, since + * deletions are only done by VACUUM and it gets an exclusive lock. * * The scheme is to manage a list of active scans in the current backend. - * Whenever we add or remove records from an index, or whenever we - * split a leaf page, we check the list of active scans to see if any - * has been affected. A scan is affected only if it is on the same - * relation, and the same page, as the update. + * Whenever we remove a record from an index, we check the list of active + * scans to see if any has been affected. A scan is affected only if it + * is on the same relation, and the same page, as the update. * *------------------------------------------------------------------------- */ @@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan) /* * _bt_adjscans() -- adjust all scans in the scan list to compensate - * for a given deletion or insertion + * for a given deletion */ void _bt_adjscans(Relation rel, ItemPointer tid) @@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) { page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + start = P_FIRSTDATAKEY(opaque); if (ItemPointerGetOffsetNumber(current) == start) ItemPointerSetInvalid(&(so->curHeapIptr)); else @@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) */ LockBuffer(buf, BT_READ); _bt_step(scan, &buf, BackwardScanDirection); - so->btso_curbuf = buf; if (ItemPointerIsValid(current)) { Page pg = BufferGetPage(buf); @@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) && ItemPointerGetBlockNumber(current) == blkno && ItemPointerGetOffsetNumber(current) >= offno) { - page = BufferGetPage(so->btso_mrkbuf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + start = P_FIRSTDATAKEY(opaque); if (ItemPointerGetOffsetNumber(current) == start) ItemPointerSetInvalid(&(so->mrkHeapIptr)); diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 54c15b2f6a..49aec3b23d 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1,14 +1,14 @@ /*------------------------------------------------------------------------- * - * btsearch.c + * nbtsearch.c * search code for postgres btrees. * + * * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * - * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.60 2000/05/30 04:24:33 tgl Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.61 2000/07/21 06:42:32 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -19,102 +19,96 @@ #include "access/nbtree.h" +static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir); -static BTStack _bt_searchr(Relation rel, int keysz, ScanKey scankey, - Buffer *bufP, BTStack stack_in); -static int32 _bt_compare(Relation rel, TupleDesc itupdesc, Page page, - int keysz, ScanKey scankey, OffsetNumber offnum); -static bool - _bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir); -static RetrieveIndexResult - _bt_endpoint(IndexScanDesc scan, ScanDirection dir); /* - * _bt_search() -- Search for a scan key in the index. + * _bt_search() -- Search the tree for a particular scankey, + * or more precisely for the first leaf page it could be on. * - * This routine is actually just a helper that sets things up and - * calls a recursive-descent search routine on the tree. + * Return value is a stack of parent-page pointers. *bufP is set to the + * address of the leaf-page buffer, which is read-locked and pinned. + * No locks are held on the parent pages, however! + * + * NOTE that the returned buffer is read-locked regardless of the access + * parameter. However, access = BT_WRITE will allow an empty root page + * to be created and returned. When access = BT_READ, an empty index + * will result in *bufP being set to InvalidBuffer. */ BTStack -_bt_search(Relation rel, int keysz, ScanKey scankey, Buffer *bufP) +_bt_search(Relation rel, int keysz, ScanKey scankey, + Buffer *bufP, int access) { - *bufP = _bt_getroot(rel, BT_READ); - return _bt_searchr(rel, keysz, scankey, bufP, (BTStack) NULL); -} + BTStack stack_in = NULL; -/* - * _bt_searchr() -- Search the tree recursively for a particular scankey. - */ -static BTStack -_bt_searchr(Relation rel, - int keysz, - ScanKey scankey, - Buffer *bufP, - BTStack stack_in) -{ - BTStack stack; - OffsetNumber offnum; - Page page; - BTPageOpaque opaque; - BlockNumber par_blkno; - BlockNumber blkno; - ItemId itemid; - BTItem btitem; - BTItem item_save; - int item_nbytes; - IndexTuple itup; + /* Get the root page to start with */ + *bufP = _bt_getroot(rel, access); - /* if this is a leaf page, we're done */ - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - if (opaque->btpo_flags & BTP_LEAF) - return stack_in; + /* If index is empty and access = BT_READ, no root page is created. */ + if (! BufferIsValid(*bufP)) + return (BTStack) NULL; - /* - * Find the appropriate item on the internal page, and get the child - * page that it points to. - */ + /* Loop iterates once per level descended in the tree */ + for (;;) + { + Page page; + BTPageOpaque opaque; + OffsetNumber offnum; + ItemId itemid; + BTItem btitem; + IndexTuple itup; + BlockNumber blkno; + BlockNumber par_blkno; + BTStack new_stack; - par_blkno = BufferGetBlockNumber(*bufP); - offnum = _bt_binsrch(rel, *bufP, keysz, scankey, BT_DESCENT); - itemid = PageGetItemId(page, offnum); - btitem = (BTItem) PageGetItem(page, itemid); - itup = &(btitem->bti_itup); - blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); + /* if this is a leaf page, we're done */ + page = BufferGetPage(*bufP); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + if (P_ISLEAF(opaque)) + break; - /* - * We need to save the bit image of the index entry we chose in the - * parent page on a stack. In case we split the tree, we'll use this - * bit image to figure out what our real parent page is, in case the - * parent splits while we're working lower in the tree. See the paper - * by Lehman and Yao for how this is detected and handled. (We use - * unique OIDs to disambiguate duplicate keys in the index -- Lehman - * and Yao disallow duplicate keys). - */ + /* + * Find the appropriate item on the internal page, and get the + * child page that it points to. + */ + offnum = _bt_binsrch(rel, *bufP, keysz, scankey); + itemid = PageGetItemId(page, offnum); + btitem = (BTItem) PageGetItem(page, itemid); + itup = &(btitem->bti_itup); + blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); + par_blkno = BufferGetBlockNumber(*bufP); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) btitem, item_nbytes); - stack = (BTStack) palloc(sizeof(BTStackData)); - stack->bts_blkno = par_blkno; - stack->bts_offset = offnum; - stack->bts_btitem = item_save; - stack->bts_parent = stack_in; + /* + * We need to save the bit image of the index entry we chose in the + * parent page on a stack. In case we split the tree, we'll use this + * bit image to figure out what our real parent page is, in case the + * parent splits while we're working lower in the tree. See the paper + * by Lehman and Yao for how this is detected and handled. (We use the + * child link to disambiguate duplicate keys in the index -- Lehman + * and Yao disallow duplicate keys.) + */ + new_stack = (BTStack) palloc(sizeof(BTStackData)); + new_stack->bts_blkno = par_blkno; + new_stack->bts_offset = offnum; + memcpy(&new_stack->bts_btitem, btitem, sizeof(BTItemData)); + new_stack->bts_parent = stack_in; - /* drop the read lock on the parent page and acquire one on the child */ - _bt_relbuf(rel, *bufP, BT_READ); - *bufP = _bt_getbuf(rel, blkno, BT_READ); + /* drop the read lock on the parent page, acquire one on the child */ + _bt_relbuf(rel, *bufP, BT_READ); + *bufP = _bt_getbuf(rel, blkno, BT_READ); - /* - * Race -- the page we just grabbed may have split since we read its - * pointer in the parent. If it has, we may need to move right to its - * new sibling. Do that. - */ + /* + * Race -- the page we just grabbed may have split since we read its + * pointer in the parent. If it has, we may need to move right to its + * new sibling. Do that. + */ + *bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ); - *bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ); + /* okay, all set to move down a level */ + stack_in = new_stack; + } - /* okay, all set to move down a level */ - return _bt_searchr(rel, keysz, scankey, bufP, stack); + return stack_in; } /* @@ -133,7 +127,7 @@ _bt_searchr(Relation rel, * * On entry, we have the buffer pinned and a lock of the proper type. * If we move right, we release the buffer and lock and acquire the - * same on the right sibling. + * same on the right sibling. Return value is the buffer we stop at. */ Buffer _bt_moveright(Relation rel, @@ -144,231 +138,81 @@ _bt_moveright(Relation rel, { Page page; BTPageOpaque opaque; - ItemId hikey; - BlockNumber rblkno; - int natts = rel->rd_rel->relnatts; page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - /* if we're on a rightmost page, we don't need to move right */ - if (P_RIGHTMOST(opaque)) - return buf; - - /* by convention, item 0 on non-rightmost pages is the high key */ - hikey = PageGetItemId(page, P_HIKEY); - /* - * If the scan key that brought us to this page is >= the high key + * If the scan key that brought us to this page is > the high key * stored on the page, then the page has split and we need to move - * right. + * right. (If the scan key is equal to the high key, we might or + * might not need to move right; have to scan the page first anyway.) + * It could even have split more than once, so scan as far as needed. */ - - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, - BTGreaterEqualStrategyNumber)) + while (!P_RIGHTMOST(opaque) && + _bt_compare(rel, keysz, scankey, page, P_HIKEY) > 0) { - /* move right as long as we need to */ - do - { - OffsetNumber offmax = PageGetMaxOffsetNumber(page); + /* step right one page */ + BlockNumber rblkno = opaque->btpo_next; - /* - * If this page consists of all duplicate keys (hikey and - * first key on the page have the same value), then we don't - * need to step right. - * - * NOTE for multi-column indices: we may do scan using keys not - * for all attrs. But we handle duplicates using all attrs in - * _bt_insert/_bt_spool code. And so we've to compare scankey - * with _last_ item on this page to do not lose "good" tuples - * if number of attrs > keysize. Example: (2,0) - last items - * on this page, (2,1) - first item on next page (hikey), our - * scankey is x = 2. Scankey == (2,1) because of we compare - * first attrs only, but we shouldn't to move right of here. - - * vadim 04/15/97 - * - * Also, if this page is not LEAF one (and # of attrs > keysize) - * then we can't move too. - vadim 10/22/97 - */ - - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, - BTEqualStrategyNumber)) - { - if (opaque->btpo_flags & BTP_CHAIN) - { - Assert((opaque->btpo_flags & BTP_LEAF) || offmax > P_HIKEY); - break; - } - if (offmax > P_HIKEY) - { - if (natts == keysz) /* sanity checks */ - { - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, P_FIRSTKEY), - BTEqualStrategyNumber)) - elog(FATAL, "btree: BTP_CHAIN flag was expected in %s (access = %s)", - RelationGetRelationName(rel), access ? "bt_write" : "bt_read"); - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, offmax), - BTEqualStrategyNumber)) - elog(FATAL, "btree: unexpected equal last item"); - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, offmax), - BTLessStrategyNumber)) - elog(FATAL, "btree: unexpected greater last item"); - /* move right */ - } - else if (!(opaque->btpo_flags & BTP_LEAF)) - break; - else if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, offmax), - BTLessEqualStrategyNumber)) - break; - } - } - - /* step right one page */ - rblkno = opaque->btpo_next; - _bt_relbuf(rel, buf, access); - buf = _bt_getbuf(rel, rblkno, access); - page = BufferGetPage(buf); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - hikey = PageGetItemId(page, P_HIKEY); - - } while (!P_RIGHTMOST(opaque) - && _bt_skeycmp(rel, keysz, scankey, page, hikey, - BTGreaterEqualStrategyNumber)); + _bt_relbuf(rel, buf, access); + buf = _bt_getbuf(rel, rblkno, access); + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); } + return buf; } -/* - * _bt_skeycmp() -- compare a scan key to a particular item on a page using - * a requested strategy (<, <=, =, >=, >). - * - * We ignore the unique OIDs stored in the btree item here. Those - * numbers are intended for use internally only, in repositioning a - * scan after a page split. They do not impose any meaningful ordering. - * - * The comparison is A B, where A is the scan key and B is the - * tuple pointed at by itemid on page. - */ -bool -_bt_skeycmp(Relation rel, - Size keysz, - ScanKey scankey, - Page page, - ItemId itemid, - StrategyNumber strat) -{ - BTItem item; - IndexTuple indexTuple; - TupleDesc tupDes; - int i; - int32 compare = 0; - - item = (BTItem) PageGetItem(page, itemid); - indexTuple = &(item->bti_itup); - - tupDes = RelationGetDescr(rel); - - for (i = 1; i <= (int) keysz; i++) - { - ScanKey entry = &scankey[i - 1]; - Datum attrDatum; - bool isNull; - - Assert(entry->sk_attno == i); - attrDatum = index_getattr(indexTuple, - entry->sk_attno, - tupDes, - &isNull); - - /* see comments about NULLs handling in btbuild */ - if (entry->sk_flags & SK_ISNULL) /* key is NULL */ - { - if (isNull) - compare = 0; /* NULL key "=" NULL datum */ - else - compare = 1; /* NULL key ">" not-NULL datum */ - } - else if (isNull) /* key is NOT_NULL and item is NULL */ - { - compare = -1; /* not-NULL key "<" NULL datum */ - } - else - compare = DatumGetInt32(FunctionCall2(&entry->sk_func, - entry->sk_argument, - attrDatum)); - - if (compare != 0) - break; /* done when we find unequal attributes */ - } - - switch (strat) - { - case BTLessStrategyNumber: - return (bool) (compare < 0); - case BTLessEqualStrategyNumber: - return (bool) (compare <= 0); - case BTEqualStrategyNumber: - return (bool) (compare == 0); - case BTGreaterEqualStrategyNumber: - return (bool) (compare >= 0); - case BTGreaterStrategyNumber: - return (bool) (compare > 0); - } - - elog(ERROR, "_bt_skeycmp: bogus strategy %d", (int) strat); - return false; -} - /* * _bt_binsrch() -- Do a binary search for a key on a particular page. * - * The scankey we get has the compare function stored in the procedure - * entry of each data struct. We invoke this regproc to do the - * comparison for every key in the scankey. _bt_binsrch() returns - * the OffsetNumber of the first matching key on the page, or the - * OffsetNumber at which the matching key would appear if it were - * on this page. (NOTE: in particular, this means it is possible to - * return a value 1 greater than the number of keys on the page, if - * the scankey is > all keys on the page.) + * The scankey we get has the compare function stored in the procedure + * entry of each data struct. We invoke this regproc to do the + * comparison for every key in the scankey. * - * By the time this procedure is called, we're sure we're looking - * at the right page -- don't need to walk right. _bt_binsrch() has - * no lock or refcount side effects on the buffer. + * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first + * key >= given scankey. (NOTE: in particular, this means it is possible + * to return a value 1 greater than the number of keys on the page, + * if the scankey is > all keys on the page.) + * + * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber + * of the last key < given scankey. (Since _bt_compare treats the first + * data key of such a page as minus infinity, there will be at least one + * key < scankey, so the result always points at one of the keys on the + * page.) This key indicates the right place to descend to be sure we + * find all leaf keys >= given scankey. + * + * This procedure is not responsible for walking right, it just examines + * the given page. _bt_binsrch() has no lock or refcount side effects + * on the buffer. */ OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz, - ScanKey scankey, - int srchtype) + ScanKey scankey) { TupleDesc itupdesc; Page page; BTPageOpaque opaque; OffsetNumber low, high; - bool haveEq; - int natts = rel->rd_rel->relnatts; int32 result; itupdesc = RelationGetDescr(rel); page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - /* by convention, item 1 on any non-rightmost page is the high key */ - low = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - + low = P_FIRSTDATAKEY(opaque); high = PageGetMaxOffsetNumber(page); /* * If there are no keys on the page, return the first available slot. * Note this covers two cases: the page is really empty (no keys), or * it contains only a high key. The latter case is possible after - * vacuuming. + * vacuuming. This can never happen on an internal page, however, + * since they are never empty (an internal page must have children). */ if (high < low) return low; @@ -376,11 +220,9 @@ _bt_binsrch(Relation rel, /* * Binary search to find the first key on the page >= scan key. Loop * invariant: all slots before 'low' are < scan key, all slots at or - * after 'high' are >= scan key. Also, haveEq is true if the tuple at - * 'high' is == scan key. We can fall out when high == low. + * after 'high' are >= scan key. We can fall out when high == low. */ high++; /* establish the loop invariant for high */ - haveEq = false; while (high > low) { @@ -388,175 +230,77 @@ _bt_binsrch(Relation rel, /* We have low <= mid < high, so mid points at a real slot */ - result = _bt_compare(rel, itupdesc, page, keysz, scankey, mid); + result = _bt_compare(rel, keysz, scankey, page, mid); if (result > 0) low = mid + 1; else - { high = mid; - haveEq = (result == 0); - } } /*-------------------- * At this point we have high == low, but be careful: they could point - * past the last slot on the page. We also know that haveEq is true - * if and only if there is an equal key (in which case high&low point - * at the first equal key). + * past the last slot on the page. * * On a leaf page, we always return the first key >= scan key * (which could be the last slot + 1). *-------------------- */ - - if (opaque->btpo_flags & BTP_LEAF) + if (P_ISLEAF(opaque)) return low; /*-------------------- - * On a non-leaf page, there are special cases: - * - * For an insertion (srchtype != BT_DESCENT and natts == keysz) - * always return first key >= scan key (which could be off the end). - * - * For a standard search (srchtype == BT_DESCENT and natts == keysz) - * return the first equal key if one exists, else the last lesser key - * if one exists, else the first slot on the page. - * - * For a partial-match search (srchtype == BT_DESCENT and natts > keysz) - * return the last lesser key if one exists, else the first slot. - * - * Old comments: - * For multi-column indices, we may scan using keys - * not for all attrs. But we handle duplicates using all attrs - * in _bt_insert/_bt_spool code. And so while searching on - * internal pages having number of attrs > keysize we want to - * point at the last item < the scankey, not at the first item - * = the scankey (!!!), and let _bt_moveright decide later - * whether to move right or not (see comments and example - * there). Note also that INSERTions are not affected by this - * code (since natts == keysz for inserts). - vadim 04/15/97 + * On a non-leaf page, return the last key < scan key. + * There must be one if _bt_compare() is playing by the rules. *-------------------- */ - - if (haveEq) - { - - /* - * There is an equal key. We return either the first equal key - * (which we just found), or the last lesser key. - * - * We need not check srchtype != BT_DESCENT here, since if that is - * true then natts == keysz by assumption. - */ - if (natts == keysz) - return low; /* return first equal key */ - } - else - { - - /* - * There is no equal key. We return either the first greater key - * (which we just found), or the last lesser key. - */ - if (srchtype != BT_DESCENT) - return low; /* return first greater key */ - } - - - if (low == (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)) - return low; /* there is no prior item */ + Assert(low > P_FIRSTDATAKEY(opaque)); return OffsetNumberPrev(low); } -/* +/*---------- * _bt_compare() -- Compare scankey to a particular tuple on the page. * + * keysz: number of key conditions to be checked (might be less than the + * total length of the scan key!) + * page/offnum: location of btree item to be compared to. + * * This routine returns: * <0 if scankey < tuple at offnum; * 0 if scankey == tuple at offnum; * >0 if scankey > tuple at offnum. + * NULLs in the keys are treated as sortable values. Therefore + * "equality" does not necessarily mean that the item should be + * returned to the caller as a matching key! * - * -- Old comments: - * In order to avoid having to propagate changes up the tree any time - * a new minimal key is inserted, the leftmost entry on the leftmost - * page is less than all possible keys, by definition. - * - * -- New ones: - * New insertion code (fix against updating _in_place_ if new minimal - * key has bigger size than old one) may delete P_HIKEY entry on the - * root page in order to insert new minimal key - and so this definition - * does not work properly in this case and breaks key' order on root - * page. BTW, this propagation occures only while page' splitting, - * but not "any time a new min key is inserted" (see _bt_insertonpg). - * - vadim 12/05/96 + * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be + * "minus infinity": this routine will always claim it is less than the + * scankey. The actual key value stored (if any, which there probably isn't) + * does not matter. This convention allows us to implement the Lehman and + * Yao convention that the first down-link pointer is before the first key. + * See backend/access/nbtree/README for details. + *---------- */ -static int32 +int32 _bt_compare(Relation rel, - TupleDesc itupdesc, - Page page, int keysz, ScanKey scankey, + Page page, OffsetNumber offnum) { - Datum datum; + TupleDesc itupdesc = RelationGetDescr(rel); + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); BTItem btitem; IndexTuple itup; - BTPageOpaque opaque; - ScanKey entry; - AttrNumber attno; - int32 result; int i; - bool null; /* - * If this is a leftmost internal page, and if our comparison is with - * the first key on the page, then the item at that position is by - * definition less than the scan key. - * - * - see new comments above... + * Force result ">" if target item is first data item on an internal + * page --- see NOTE above. */ - - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - if (!(opaque->btpo_flags & BTP_LEAF) - && P_LEFTMOST(opaque) - && offnum == P_HIKEY) - { - - /* - * we just have to believe that this will only be called with - * offnum == P_HIKEY when P_HIKEY is the OffsetNumber of the first - * actual data key (i.e., this is also a rightmost page). there - * doesn't seem to be any code that implies that the leftmost page - * is normally missing a high key as well as the rightmost page. - * but that implies that this code path only applies to the root - * -- which seems unlikely.. - * - * - see new comments above... - */ - if (!P_RIGHTMOST(opaque)) - elog(ERROR, "_bt_compare: invalid comparison to high key"); - -#ifdef NOT_USED - - /* - * We just have to belive that right answer will not break - * anything. I've checked code and all seems to be ok. See new - * comments above... - * - * -- Old comments If the item on the page is equal to the scankey, - * that's okay to admit. We just can't claim that the first key - * on the page is greater than anything. - */ - - if (_bt_skeycmp(rel, keysz, scankey, page, PageGetItemId(page, offnum), - BTEqualStrategyNumber)) - return 0; + if (! P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) return 1; -#endif - } btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &(btitem->bti_itup); @@ -568,37 +312,45 @@ _bt_compare(Relation rel, * they be in order. If you think about how multi-key ordering works, * you'll understand why this is. * - * We don't test for violation of this condition here. + * We don't test for violation of this condition here, however. The + * initial setup for the index scan had better have gotten it right + * (see _bt_first). */ - for (i = 1; i <= keysz; i++) + for (i = 0; i < keysz; i++) { - entry = &scankey[i - 1]; - attno = entry->sk_attno; - datum = index_getattr(itup, attno, itupdesc, &null); + ScanKey entry = &scankey[i]; + Datum datum; + bool isNull; + int32 result; + + datum = index_getattr(itup, entry->sk_attno, itupdesc, &isNull); /* see comments about NULLs handling in btbuild */ - if (entry->sk_flags & SK_ISNULL) /* key is NULL */ + if (entry->sk_flags & SK_ISNULL) /* key is NULL */ { - if (null) + if (isNull) result = 0; /* NULL "=" NULL */ else result = 1; /* NULL ">" NOT_NULL */ } - else if (null) /* key is NOT_NULL and item is NULL */ + else if (isNull) /* key is NOT_NULL and item is NULL */ { result = -1; /* NOT_NULL "<" NULL */ } else + { result = DatumGetInt32(FunctionCall2(&entry->sk_func, - entry->sk_argument, datum)); + entry->sk_argument, + datum)); + } /* if the keys are unequal, return the difference */ if (result != 0) return result; } - /* by here, the keys are equal */ + /* if we get here, the keys are equal */ return 0; } @@ -606,10 +358,10 @@ _bt_compare(Relation rel, * _bt_next() -- Get the next item in a scan. * * On entry, we have a valid currentItemData in the scan, and a - * read lock on the page that contains that item. We do not have - * the page pinned. We return the next item in the scan. On - * exit, we have the page containing the next item locked but not - * pinned. + * read lock and pin count on the page that contains that item. + * We return the next item in the scan, or NULL if no more. + * On successful exit, the page containing the new item is locked + * and pinned; on NULL exit, no lock or pin is held. */ RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir) @@ -618,7 +370,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) Buffer buf; Page page; OffsetNumber offnum; - RetrieveIndexResult res; ItemPointer current; BTItem btitem; IndexTuple itup; @@ -629,10 +380,9 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) so = (BTScanOpaque) scan->opaque; current = &(scan->currentItemData); - Assert(BufferIsValid(so->btso_curbuf)); - /* we still have the buffer pinned and locked */ buf = so->btso_curbuf; + Assert(BufferIsValid(buf)); do { @@ -640,7 +390,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) if (!_bt_step(scan, &buf, dir)) return (RetrieveIndexResult) NULL; - /* by here, current is the tuple we want to return */ + /* current is the next candidate tuple to return */ offnum = ItemPointerGetOffsetNumber(current); page = BufferGetPage(buf); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); @@ -648,17 +398,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) if (_bt_checkkeys(scan, itup, &keysok)) { + /* tuple passes all scan key conditions, so return it */ Assert(keysok == so->numberOfKeys); - res = FormRetrieveIndexResult(current, &(itup->t_tid)); - - /* remember which buffer we have pinned and locked */ - so->btso_curbuf = buf; - return res; + return FormRetrieveIndexResult(current, &(itup->t_tid)); } + /* This tuple doesn't pass, but there might be more that do */ } while (keysok >= so->numberOfFirstKeys || (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))); + /* No more items, so close down the current-item info */ ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; _bt_relbuf(rel, buf, BT_READ); @@ -680,14 +429,10 @@ RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir) { Relation rel; - TupleDesc itupdesc; Buffer buf; Page page; - BTPageOpaque pop; BTStack stack; - OffsetNumber offnum, - maxoff; - bool offGmax = false; + OffsetNumber offnum; BTItem btitem; IndexTuple itup; ItemPointer current; @@ -698,7 +443,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) int32 result; BTScanOpaque so; Size keysok; - bool strategyCheck; ScanKey scankeys = 0; int keysCount = 0; @@ -784,20 +528,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) return _bt_endpoint(scan, dir); } - itupdesc = RelationGetDescr(rel); - current = &(scan->currentItemData); - /* * Okay, we want something more complicated. What we'll do is use the * first item in the scan key passed in (which has been correctly * ordered to take advantage of index ordering) to position ourselves * at the right place in the scan. */ - /* _bt_orderkeys disallows it, but it's place to add some code latter */ scankeys = (ScanKey) palloc(keysCount * sizeof(ScanKeyData)); for (i = 0; i < keysCount; i++) { j = nKeyIs[i]; + /* _bt_orderkeys disallows it, but it's place to add some code latter */ if (so->keyData[j].sk_flags & SK_ISNULL) { pfree(nKeyIs); @@ -812,234 +553,213 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) if (nKeyIs) pfree(nKeyIs); - stack = _bt_search(rel, keysCount, scankeys, &buf); - _bt_freestack(stack); - - blkno = BufferGetBlockNumber(buf); - page = BufferGetPage(buf); + current = &(scan->currentItemData); /* - * This will happen if the tree we're searching is entirely empty, or - * if we're doing a search for a key that would appear on an entirely - * empty internal page. In either case, there are no matching tuples - * in the index. + * Use the manufactured scan key to descend the tree and position + * ourselves on the target leaf page. */ + stack = _bt_search(rel, keysCount, scankeys, &buf, BT_READ); - if (PageIsEmpty(page)) + /* don't need to keep the stack around... */ + _bt_freestack(stack); + + if (! BufferIsValid(buf)) { + /* Only get here if index is completely empty */ ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; - _bt_relbuf(rel, buf, BT_READ); pfree(scankeys); return (RetrieveIndexResult) NULL; } - maxoff = PageGetMaxOffsetNumber(page); - pop = (BTPageOpaque) PageGetSpecialPointer(page); - /* - * Now _bt_moveright doesn't move from non-rightmost leaf page if - * scankey == hikey and there is only hikey there. It's good for - * insertion, but we need to do work for scan here. - vadim 05/27/97 - */ + /* remember which buffer we have pinned */ + so->btso_curbuf = buf; + blkno = BufferGetBlockNumber(buf); + page = BufferGetPage(buf); - while (maxoff == P_HIKEY && !P_RIGHTMOST(pop) && - _bt_skeycmp(rel, keysCount, scankeys, page, - PageGetItemId(page, P_HIKEY), - BTGreaterEqualStrategyNumber)) - { - /* step right one page */ - blkno = pop->btpo_next; - _bt_relbuf(rel, buf, BT_READ); - buf = _bt_getbuf(rel, blkno, BT_READ); - page = BufferGetPage(buf); - if (PageIsEmpty(page)) - { - ItemPointerSetInvalid(current); - so->btso_curbuf = InvalidBuffer; - _bt_relbuf(rel, buf, BT_READ); - pfree(scankeys); - return (RetrieveIndexResult) NULL; - } - maxoff = PageGetMaxOffsetNumber(page); - pop = (BTPageOpaque) PageGetSpecialPointer(page); - } - - - /* find the nearest match to the manufactured scan key on the page */ - offnum = _bt_binsrch(rel, buf, keysCount, scankeys, BT_DESCENT); - - if (offnum > maxoff) - { - offnum = maxoff; - offGmax = true; - } + offnum = _bt_binsrch(rel, buf, keysCount, scankeys); ItemPointerSet(current, blkno, offnum); - /* - * Now find the right place to start the scan. Result is the value - * we're looking for minus the value we're looking at in the index. + /*---------- + * At this point we are positioned at the first item >= scan key, + * or possibly at the end of a page on which all the existing items + * are < scan key and we know that everything on later pages is + * >= scan key. We could step forward in the latter case, but that'd + * be a waste of time if we want to scan backwards. So, it's now time to + * examine the scan strategy to find the exact place to start the scan. + * + * Note: if _bt_step fails (meaning we fell off the end of the index + * in one direction or the other), we either return NULL (no matches) or + * call _bt_endpoint() to set up a scan starting at that index endpoint, + * as appropriate for the desired scan type. + * + * it's yet other place to add some code latter for is(not)null ... + *---------- */ - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - - /* it's yet other place to add some code latter for is(not)null */ - - strat = strat_total; - switch (strat) + switch (strat_total) { case BTLessStrategyNumber: - if (result <= 0) + /* + * Back up one to arrive at last item < scankey + */ + if (!_bt_step(scan, &buf, BackwardScanDirection)) { - do - { - if (!_bt_twostep(scan, &buf, BackwardScanDirection)) - break; - - offnum = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result <= 0); - + pfree(scankeys); + return (RetrieveIndexResult) NULL; } break; case BTLessEqualStrategyNumber: - if (result >= 0) + /* + * We need to find the last item <= scankey, so step forward + * till we find one > scankey, then step back one. + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - do + if (!_bt_step(scan, &buf, ForwardScanDirection)) { - if (!_bt_twostep(scan, &buf, ForwardScanDirection)) - break; - - offnum = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result >= 0); + pfree(scankeys); + return _bt_endpoint(scan, dir); + } } - if (result < 0) - _bt_twostep(scan, &buf, BackwardScanDirection); - break; - - case BTEqualStrategyNumber: - if (result != 0) + for (;;) + { + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + result = _bt_compare(rel, keysCount, scankeys, page, offnum); + if (result < 0) + break; + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return _bt_endpoint(scan, dir); + } + } + if (!_bt_step(scan, &buf, BackwardScanDirection)) { - _bt_relbuf(scan->relation, buf, BT_READ); - so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(&(scan->currentItemData)); pfree(scankeys); return (RetrieveIndexResult) NULL; } - else if (ScanDirectionIsBackward(dir)) + break; + + case BTEqualStrategyNumber: + /* + * Make sure we are on the first equal item; might have to step + * forward if currently at end of page. + */ + if (offnum > PageGetMaxOffsetNumber(page)) + { + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return (RetrieveIndexResult) NULL; + } + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + } + result = _bt_compare(rel, keysCount, scankeys, page, offnum); + if (result != 0) + goto nomatches; /* no equal items! */ + /* + * If a backward scan was specified, need to start with last + * equal item not first one. + */ + if (ScanDirectionIsBackward(dir)) { do { - if (!_bt_twostep(scan, &buf, ForwardScanDirection)) - break; - + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return _bt_endpoint(scan, dir); + } offnum = ItemPointerGetOffsetNumber(current); page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); + result = _bt_compare(rel, keysCount, scankeys, page, offnum); } while (result == 0); - - if (result < 0) - _bt_twostep(scan, &buf, BackwardScanDirection); + if (!_bt_step(scan, &buf, BackwardScanDirection)) + elog(ERROR, "_bt_first: equal items disappeared?"); } break; case BTGreaterEqualStrategyNumber: - if (offGmax) + /* + * We want the first item >= scankey, which is where we are... + * unless we're not anywhere at all... + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - if (result < 0) + if (!_bt_step(scan, &buf, ForwardScanDirection)) { - Assert(!P_RIGHTMOST(pop) && maxoff == P_HIKEY); - if (!_bt_step(scan, &buf, ForwardScanDirection)) - { - _bt_relbuf(scan->relation, buf, BT_READ); - so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(&(scan->currentItemData)); - pfree(scankeys); - return (RetrieveIndexResult) NULL; - } + pfree(scankeys); + return (RetrieveIndexResult) NULL; } - else if (result > 0) - { /* Just remember: _bt_binsrch() returns - * the OffsetNumber of the first matching - * key on the page, or the OffsetNumber at - * which the matching key WOULD APPEAR IF - * IT WERE on this page. No key on this - * page, but offnum from _bt_binsrch() - * greater maxoff - have to move right. - - * vadim 12/06/96 */ - _bt_twostep(scan, &buf, ForwardScanDirection); - } - } - else if (result < 0) - { - do - { - if (!_bt_twostep(scan, &buf, BackwardScanDirection)) - break; - - page = BufferGetPage(buf); - offnum = ItemPointerGetOffsetNumber(current); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result < 0); - - if (result > 0) - _bt_twostep(scan, &buf, ForwardScanDirection); } break; case BTGreaterStrategyNumber: - /* offGmax helps as above */ - if (result >= 0 || offGmax) + /* + * We want the first item > scankey, so make sure we are on + * an item and then step over any equal items. + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - do + if (!_bt_step(scan, &buf, ForwardScanDirection)) { - if (!_bt_twostep(scan, &buf, ForwardScanDirection)) - break; - - offnum = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result >= 0); + pfree(scankeys); + return (RetrieveIndexResult) NULL; + } + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + } + result = _bt_compare(rel, keysCount, scankeys, page, offnum); + while (result == 0) + { + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return (RetrieveIndexResult) NULL; + } + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + result = _bt_compare(rel, keysCount, scankeys, page, offnum); } break; } - pfree(scankeys); /* okay, current item pointer for the scan is right */ offnum = ItemPointerGetOffsetNumber(current); page = BufferGetPage(buf); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &btitem->bti_itup; + /* is the first item actually acceptable? */ if (_bt_checkkeys(scan, itup, &keysok)) { + /* yes, return it */ res = FormRetrieveIndexResult(current, &(itup->t_tid)); - - /* remember which buffer we have pinned */ - so->btso_curbuf = buf; } - else if (keysok >= so->numberOfFirstKeys) + else if (keysok >= so->numberOfFirstKeys || + (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))) { - so->btso_curbuf = buf; - return _bt_next(scan, dir); - } - else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)) - { - so->btso_curbuf = buf; - return _bt_next(scan, dir); + /* no, but there might be another one that is */ + res = _bt_next(scan, dir); } else { + /* no tuples in the index match this scan key */ +nomatches: ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; _bt_relbuf(rel, buf, BT_READ); res = (RetrieveIndexResult) NULL; } + pfree(scankeys); + return res; } @@ -1047,276 +767,128 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) * _bt_step() -- Step one item in the requested direction in a scan on * the tree. * - * If no adjacent record exists in the requested direction, return - * false. Else, return true and set the currentItemData for the - * scan to the right thing. + * *bufP is the current buffer (read-locked and pinned). If we change + * pages, it's updated appropriately. + * + * If successful, update scan's currentItemData and return true. + * If no adjacent record exists in the requested direction, + * release buffer pin/locks and return false. */ bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) { + Relation rel = scan->relation; + ItemPointer current = &(scan->currentItemData); + BTScanOpaque so = (BTScanOpaque) scan->opaque; Page page; BTPageOpaque opaque; OffsetNumber offnum, maxoff; - OffsetNumber start; BlockNumber blkno; BlockNumber obknum; - BTScanOpaque so; - ItemPointer current; - Relation rel; - - rel = scan->relation; - current = &(scan->currentItemData); /* * Don't use ItemPointerGetOffsetNumber or you risk to get assertion * due to ability of ip_posid to be equal 0. */ offnum = current->ip_posid; + page = BufferGetPage(*bufP); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - so = (BTScanOpaque) scan->opaque; maxoff = PageGetMaxOffsetNumber(page); - /* get the next tuple */ if (ScanDirectionIsForward(dir)) { if (!PageIsEmpty(page) && offnum < maxoff) offnum = OffsetNumberNext(offnum); else { - - /* if we're at end of scan, release the buffer and return */ - blkno = opaque->btpo_next; - if (P_RIGHTMOST(opaque)) + /* walk right to the next page with data */ + for (;;) { - _bt_relbuf(rel, *bufP, BT_READ); - ItemPointerSetInvalid(current); - *bufP = so->btso_curbuf = InvalidBuffer; - return false; - } - else - { - - /* walk right to the next page with data */ - _bt_relbuf(rel, *bufP, BT_READ); - for (;;) + /* if we're at end of scan, release the buffer and return */ + if (P_RIGHTMOST(opaque)) { - *bufP = _bt_getbuf(rel, blkno, BT_READ); - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - if (!PageIsEmpty(page) && start <= maxoff) - break; - else - { - blkno = opaque->btpo_next; - _bt_relbuf(rel, *bufP, BT_READ); - if (blkno == P_NONE) - { - *bufP = so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; - } - } + _bt_relbuf(rel, *bufP, BT_READ); + ItemPointerSetInvalid(current); + *bufP = so->btso_curbuf = InvalidBuffer; + return false; } - offnum = start; + /* step right one page */ + blkno = opaque->btpo_next; + _bt_relbuf(rel, *bufP, BT_READ); + *bufP = _bt_getbuf(rel, blkno, BT_READ); + page = BufferGetPage(*bufP); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + /* done if it's not empty */ + offnum = P_FIRSTDATAKEY(opaque); + if (!PageIsEmpty(page) && offnum <= maxoff) + break; } } } - else if (ScanDirectionIsBackward(dir)) + else { - - /* remember that high key is item zero on non-rightmost pages */ - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - if (offnum > start) + if (offnum > P_FIRSTDATAKEY(opaque)) offnum = OffsetNumberPrev(offnum); else { - - /* if we're at end of scan, release the buffer and return */ - blkno = opaque->btpo_prev; - if (P_LEFTMOST(opaque)) + /* walk left to the next page with data */ + for (;;) { - _bt_relbuf(rel, *bufP, BT_READ); - *bufP = so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; - } - else - { - - obknum = BufferGetBlockNumber(*bufP); - - /* walk right to the next page with data */ - _bt_relbuf(rel, *bufP, BT_READ); - for (;;) + /* if we're at end of scan, release the buffer and return */ + if (P_LEFTMOST(opaque)) { + _bt_relbuf(rel, *bufP, BT_READ); + ItemPointerSetInvalid(current); + *bufP = so->btso_curbuf = InvalidBuffer; + return false; + } + /* step left */ + obknum = BufferGetBlockNumber(*bufP); + blkno = opaque->btpo_prev; + _bt_relbuf(rel, *bufP, BT_READ); + *bufP = _bt_getbuf(rel, blkno, BT_READ); + page = BufferGetPage(*bufP); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + /* + * If the adjacent page just split, then we have to walk + * right to find the block that's now adjacent to where + * we were. Because pages only split right, we don't have + * to worry about this failing to terminate. + */ + while (opaque->btpo_next != obknum) + { + blkno = opaque->btpo_next; + _bt_relbuf(rel, *bufP, BT_READ); *bufP = _bt_getbuf(rel, blkno, BT_READ); page = BufferGetPage(*bufP); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - - /* - * If the adjacent page just split, then we may have - * the wrong block. Handle this case. Because pages - * only split right, we don't have to worry about this - * failing to terminate. - */ - - while (opaque->btpo_next != obknum) - { - blkno = opaque->btpo_next; - _bt_relbuf(rel, *bufP, BT_READ); - *bufP = _bt_getbuf(rel, blkno, BT_READ); - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - } - - /* don't consider the high key */ - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* anything to look at here? */ - if (!PageIsEmpty(page) && maxoff >= start) - break; - else - { - blkno = opaque->btpo_prev; - obknum = BufferGetBlockNumber(*bufP); - _bt_relbuf(rel, *bufP, BT_READ); - if (blkno == P_NONE) - { - *bufP = so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; - } - } } - offnum = maxoff;/* XXX PageIsEmpty? */ + /* done if it's not empty */ + maxoff = PageGetMaxOffsetNumber(page); + offnum = maxoff; + if (!PageIsEmpty(page) && maxoff >= P_FIRSTDATAKEY(opaque)) + break; } } } - blkno = BufferGetBlockNumber(*bufP); + + /* Update scan state */ so->btso_curbuf = *bufP; + blkno = BufferGetBlockNumber(*bufP); ItemPointerSet(current, blkno, offnum); return true; } -/* - * _bt_twostep() -- Move to an adjacent record in a scan on the tree, - * if an adjacent record exists. - * - * This is like _bt_step, except that if no adjacent record exists - * it restores us to where we were before trying the step. This is - * only hairy when you cross page boundaries, since the page you cross - * from could have records inserted or deleted, or could even split. - * This is unlikely, but we try to handle it correctly here anyway. - * - * This routine contains the only case in which our changes to Lehman - * and Yao's algorithm. - * - * Like step, this routine leaves the scan's currentItemData in the - * proper state and acquires a lock and pin on *bufP. If the twostep - * succeeded, we return true; otherwise, we return false. - */ -static bool -_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) -{ - Page page; - BTPageOpaque opaque; - OffsetNumber offnum, - maxoff; - OffsetNumber start; - ItemPointer current; - ItemId itemid; - int itemsz; - BTItem btitem; - BTItem svitem; - BlockNumber blkno; - - blkno = BufferGetBlockNumber(*bufP); - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - current = &(scan->currentItemData); - offnum = ItemPointerGetOffsetNumber(current); - - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* if we're safe, just do it */ - if (ScanDirectionIsForward(dir) && offnum < maxoff) - { /* XXX PageIsEmpty? */ - ItemPointerSet(current, blkno, OffsetNumberNext(offnum)); - return true; - } - else if (ScanDirectionIsBackward(dir) && offnum > start) - { - ItemPointerSet(current, blkno, OffsetNumberPrev(offnum)); - return true; - } - - /* if we've hit end of scan we don't have to do any work */ - if (ScanDirectionIsForward(dir) && P_RIGHTMOST(opaque)) - return false; - else if (ScanDirectionIsBackward(dir) && P_LEFTMOST(opaque)) - return false; - - /* - * Okay, it's off the page; let _bt_step() do the hard work, and we'll - * try to remember where we were. This is not guaranteed to work; - * this is the only place in the code where concurrency can screw us - * up, and it's because we want to be able to move in two directions - * in the scan. - */ - - itemid = PageGetItemId(page, offnum); - itemsz = ItemIdGetLength(itemid); - btitem = (BTItem) PageGetItem(page, itemid); - svitem = (BTItem) palloc(itemsz); - memmove((char *) svitem, (char *) btitem, itemsz); - - if (_bt_step(scan, bufP, dir)) - { - pfree(svitem); - return true; - } - - /* try to find our place again */ - *bufP = _bt_getbuf(scan->relation, blkno, BT_READ); - page = BufferGetPage(*bufP); - maxoff = PageGetMaxOffsetNumber(page); - - while (offnum <= maxoff) - { - itemid = PageGetItemId(page, offnum); - btitem = (BTItem) PageGetItem(page, itemid); - if (BTItemSame(btitem, svitem)) - { - pfree(svitem); - ItemPointerSet(current, blkno, offnum); - return false; - } - } - - /* - * XXX crash and burn -- can't find our place. We can be a little - * smarter -- walk to the next page to the right, for example, since - * that's the only direction that splits happen in. Deletions screw - * us up less often since they're only done by the vacuum daemon. - */ - - elog(ERROR, "btree synchronization error: concurrent update botched scan"); - - return false; -} - /* * _bt_endpoint() -- Find the first or last key in the index. + * + * This is used by _bt_first() to set up a scan when we've determined + * that the scan must start at the beginning or end of the index (for + * a forward or backward scan respectively). */ static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir) @@ -1328,7 +900,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) ItemPointer current; OffsetNumber offnum, maxoff; - OffsetNumber start = 0; + OffsetNumber start; BlockNumber blkno; BTItem btitem; IndexTuple itup; @@ -1340,38 +912,50 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) current = &(scan->currentItemData); so = (BTScanOpaque) scan->opaque; + /* + * Scan down to the leftmost or rightmost leaf page. This is a + * simplified version of _bt_search(). We don't maintain a stack + * since we know we won't need it. + */ buf = _bt_getroot(rel, BT_READ); + + if (! BufferIsValid(buf)) + { + /* empty index... */ + ItemPointerSetInvalid(current); + so->btso_curbuf = InvalidBuffer; + return (RetrieveIndexResult) NULL; + } + blkno = BufferGetBlockNumber(buf); page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); for (;;) { - if (opaque->btpo_flags & BTP_LEAF) + if (P_ISLEAF(opaque)) break; if (ScanDirectionIsForward(dir)) - offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + offnum = P_FIRSTDATAKEY(opaque); else offnum = PageGetMaxOffsetNumber(page); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &(btitem->bti_itup); - blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); _bt_relbuf(rel, buf, BT_READ); buf = _bt_getbuf(rel, blkno, BT_READ); + page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); /* - * Race condition: If the child page we just stepped onto is in - * the process of being split, we need to make sure we're all the - * way at the right edge of the tree. See the paper by Lehman and - * Yao. + * Race condition: If the child page we just stepped onto was just + * split, we need to make sure we're all the way at the right edge + * of the tree. See the paper by Lehman and Yao. */ - if (ScanDirectionIsBackward(dir) && !P_RIGHTMOST(opaque)) { do @@ -1390,101 +974,39 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) if (ScanDirectionIsForward(dir)) { - if (!P_LEFTMOST(opaque))/* non-leftmost page ? */ - elog(ERROR, "_bt_endpoint: leftmost page (%u) has not leftmost flag", blkno); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + Assert(P_LEFTMOST(opaque)); - /* - * I don't understand this stuff! It doesn't work for - * non-rightmost pages with only one element (P_HIKEY) which we - * have after deletion itups by vacuum (it's case of start > - * maxoff). Scanning in BackwardScanDirection is not - * understandable at all. Well - new stuff. - vadim 12/06/96 - */ -#ifdef NOT_USED - if (PageIsEmpty(page) || start > maxoff) - { - ItemPointerSet(current, blkno, maxoff); - if (!_bt_step(scan, &buf, BackwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } -#endif - if (PageIsEmpty(page)) - { - if (start != P_HIKEY) /* non-rightmost page */ - elog(ERROR, "_bt_endpoint: non-rightmost page (%u) is empty", blkno); - - /* - * It's left- & right- most page - root page, - and it's - * empty... - */ - _bt_relbuf(rel, buf, BT_READ); - ItemPointerSetInvalid(current); - so->btso_curbuf = InvalidBuffer; - return (RetrieveIndexResult) NULL; - } - if (start > maxoff) /* start == 2 && maxoff == 1 */ - { - ItemPointerSet(current, blkno, maxoff); - if (!_bt_step(scan, &buf, ForwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } - /* new stuff ends here */ - else - ItemPointerSet(current, blkno, start); + start = P_FIRSTDATAKEY(opaque); } else if (ScanDirectionIsBackward(dir)) { + Assert(P_RIGHTMOST(opaque)); - /* - * I don't understand this stuff too! If RIGHT-most leaf page is - * empty why do scanning in ForwardScanDirection ??? Well - new - * stuff. - vadim 12/06/96 - */ -#ifdef NOT_USED - if (PageIsEmpty(page)) - { - ItemPointerSet(current, blkno, FirstOffsetNumber); - if (!_bt_step(scan, &buf, ForwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } -#endif - if (PageIsEmpty(page)) - { - /* If it's leftmost page too - it's empty root page... */ - if (P_LEFTMOST(opaque)) - { - _bt_relbuf(rel, buf, BT_READ); - ItemPointerSetInvalid(current); - so->btso_curbuf = InvalidBuffer; - return (RetrieveIndexResult) NULL; - } - /* Go back ! */ - ItemPointerSet(current, blkno, FirstOffsetNumber); - if (!_bt_step(scan, &buf, BackwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } - /* new stuff ends here */ - else - { - start = PageGetMaxOffsetNumber(page); - ItemPointerSet(current, blkno, start); - } + start = PageGetMaxOffsetNumber(page); + if (start < P_FIRSTDATAKEY(opaque)) /* watch out for empty page */ + start = P_FIRSTDATAKEY(opaque); } else + { elog(ERROR, "Illegal scan direction %d", dir); + start = 0; /* keep compiler quiet */ + } + + ItemPointerSet(current, blkno, start); + /* remember which buffer we have pinned */ + so->btso_curbuf = buf; + + /* + * Left/rightmost page could be empty due to deletions, + * if so step till we find a nonempty page. + */ + if (start > maxoff) + { + if (!_bt_step(scan, &buf, dir)) + return (RetrieveIndexResult) NULL; + start = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + } btitem = (BTItem) PageGetItem(page, PageGetItemId(page, start)); itup = &(btitem->bti_itup); @@ -1492,23 +1014,18 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) /* see if we picked a winner */ if (_bt_checkkeys(scan, itup, &keysok)) { + /* yes, return it */ res = FormRetrieveIndexResult(current, &(itup->t_tid)); - - /* remember which buffer we have pinned */ - so->btso_curbuf = buf; } - else if (keysok >= so->numberOfFirstKeys) + else if (keysok >= so->numberOfFirstKeys || + (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))) { - so->btso_curbuf = buf; - return _bt_next(scan, dir); - } - else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)) - { - so->btso_curbuf = buf; - return _bt_next(scan, dir); + /* no, but there might be another one that is */ + res = _bt_next(scan, dir); } else { + /* no tuples in the index match this scan key */ ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; _bt_relbuf(rel, buf, BT_READ); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 458abe7754..1981f55469 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -6,8 +6,12 @@ * * We use tuplesort.c to sort the given index tuples into order. * Then we scan the index tuples in order and build the btree pages - * for each level. When we have only one page on a level, it must be the - * root -- it can be attached to the btree metapage and we are done. + * for each level. We load source tuples into leaf-level pages. + * Whenever we fill a page at one level, we add a link to it to its + * parent level (starting a new parent level if necessary). When + * done, we write out each final page on each level, adding it to + * its parent level. When we have only one page on a level, it must be + * the root -- it can be attached to the btree metapage and we are done. * * this code is moderately slow (~10% slower) compared to the regular * btree (insertion) build code on sorted or well-clustered data. on @@ -23,12 +27,20 @@ * something like the standard 70% steady-state load factor for btrees * would probably be better. * + * Another limitation is that we currently load full copies of all keys + * into upper tree levels. The leftmost data key in each non-leaf node + * could be omitted as far as normal btree operations are concerned + * (see README for more info). However, because we build the tree from + * the bottom up, we need that data key to insert into the node's parent. + * This could be fixed by keeping a spare copy of the minimum key in the + * state stack, but I haven't time for that right now. + * * * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -57,6 +69,20 @@ struct BTSpool bool isunique; }; +/* + * Status record for a btree page being built. We have one of these + * for each active tree level. + */ +typedef struct BTPageState +{ + Buffer btps_buf; /* current buffer & page */ + Page btps_page; + OffsetNumber btps_lastoff; /* last item offset loaded */ + int btps_level; + struct BTPageState *btps_next; /* link to parent level, if any */ +} BTPageState; + + #define BTITEMSZ(btitem) \ ((btitem) ? \ (IndexTupleDSize((btitem)->bti_itup) + \ @@ -65,13 +91,11 @@ struct BTSpool static void _bt_load(Relation index, BTSpool *btspool); -static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey, - BTPageState *state, BTItem bti, int flags); +static void _bt_buildadd(Relation index, BTPageState *state, + BTItem bti, int flags); static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend); -static BTPageState *_bt_pagestate(Relation index, int flags, - int level, bool doupper); -static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, - BTPageState *state); +static BTPageState *_bt_pagestate(Relation index, int flags, int level); +static void _bt_uppershutdown(Relation index, BTPageState *state); /* @@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags) BTPageOpaque opaque; *buf = _bt_getbuf(index, P_NEW, BT_WRITE); -#ifdef NOT_USED - printf("\tblk=%d\n", BufferGetBlockNumber(*buf)); -#endif *page = BufferGetPage(*buf); _bt_pageinit(*page, BufferGetPageSize(*buf)); opaque = (BTPageOpaque) PageGetSpecialPointer(*page); @@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page) * is suitable for immediate use by _bt_buildadd. */ static BTPageState * -_bt_pagestate(Relation index, int flags, int level, bool doupper) +_bt_pagestate(Relation index, int flags, int level) { BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState)); MemSet((char *) state, 0, sizeof(BTPageState)); _bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags); - state->btps_firstoff = InvalidOffsetNumber; state->btps_lastoff = P_HIKEY; - state->btps_lastbti = (BTItem) NULL; state->btps_next = (BTPageState *) NULL; state->btps_level = level; - state->btps_doupper = doupper; return state; } @@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend) } /* - * add an item to a disk page from a merge tape block. + * add an item to a disk page from the sort output. * * we must be careful to observe the following restrictions, placed * upon us by the conventions in nbtsearch.c: * - rightmost pages start data items at P_HIKEY instead of at * P_FIRSTKEY. - * - duplicates cannot be split among pages unless the chain of - * duplicates starts at the first data item. * * a leaf page being built looks like: * * +----------------+---------------------------------+ * | PageHeaderData | linp0 linp1 linp2 ... | * +-----------+----+---------------------------------+ - * | ... linpN | ^ first | + * | ... linpN | | * +-----------+--------------------------------------+ * | ^ last | * | | - * | v last | * +-------------+------------------------------------+ * | | itemN ... | * +-------------+------------------+-----------------+ * | ... item3 item2 item1 | "special space" | * +--------------------------------+-----------------+ - * ^ first * * contrast this with the diagram in bufpage.h; note the mismatch * between linps and items. this is because we reserve linp0 as a @@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend) * filled up the page, we will set linp0 to point to itemN and clear * linpN. * - * 'last' pointers indicate the last offset/item added to the page. - * 'first' pointers indicate the first offset/item that is part of a - * chain of duplicates extending from 'first' to 'last'. - * - * if all keys are unique, 'first' will always be the same as 'last'. + * 'last' pointer indicates the last offset added to the page. */ -static BTItem -_bt_buildadd(Relation index, Size keysz, ScanKey scankey, - BTPageState *state, BTItem bti, int flags) +static void +_bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags) { Buffer nbuf; Page npage; - BTItem last_bti; - OffsetNumber first_off; OffsetNumber last_off; - OffsetNumber off; Size pgspc; Size btisz; nbuf = state->btps_buf; npage = state->btps_page; - first_off = state->btps_firstoff; last_off = state->btps_lastoff; - last_bti = state->btps_lastbti; pgspc = PageGetFreeSpace(npage); btisz = BTITEMSZ(bti); @@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, if (pgspc < btisz) { + /* + * Item won't fit on this page, so finish off the page and + * write it out. + */ Buffer obuf = nbuf; Page opage = npage; - OffsetNumber o, - n; ItemId ii; ItemId hii; + BTItem nbti; _bt_blnewpage(index, &nbuf, &npage, flags); /* - * if 'last' is part of a chain of duplicates that does not start - * at the beginning of the old page, the entire chain is copied to - * the new page; we delete all of the duplicates from the old page - * except the first, which becomes the high key item of the old - * page. + * We copy the last item on the page into the new page, and then + * rearrange the old page so that the 'last item' becomes its high + * key rather than a true data item. * - * if the chain starts at the beginning of the page or there is no - * chain ('first' == 'last'), we need only copy 'last' to the new - * page. again, 'first' (== 'last') becomes the high key of the - * old page. - * - * note that in either case, we copy at least one item to the new - * page, so 'last_bti' will always be valid. 'bti' will never be - * the first data item on the new page. + * note that since we always copy an item to the new page, + * 'bti' will never be the first data item on the new page. */ - if (first_off == P_FIRSTKEY) - { - Assert(last_off != P_FIRSTKEY); - first_off = last_off; - } - for (o = first_off, n = P_FIRSTKEY; - o <= last_off; - o = OffsetNumberNext(o), n = OffsetNumberNext(n)) - { - ii = PageGetItemId(opage, o); - if (PageAddItem(npage, PageGetItem(opage, ii), - ii->lp_len, n, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)"); + ii = PageGetItemId(opage, last_off); + if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len, + P_FIRSTKEY, LP_USED) == InvalidOffsetNumber) + elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)"); #ifdef FASTBUILD_DEBUG - { - bool isnull; - BTItem tmpbti = - (BTItem) PageGetItem(npage, PageGetItemId(npage, n)); - Datum d = index_getattr(&(tmpbti->bti_itup), 1, - index->rd_att, &isnull); + { + bool isnull; + BTItem tmpbti = + (BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY)); + Datum d = index_getattr(&(tmpbti->bti_itup), 1, + index->rd_att, &isnull); - printf("_bt_buildadd: moved <%x> to offset %d at level %d\n", - d, n, state->btps_level); - } -#endif + printf("_bt_buildadd: moved <%x> to offset %d at level %d\n", + d, P_FIRSTKEY, state->btps_level); } +#endif /* - * this loop is backward because PageIndexTupleDelete shuffles the - * tuples to fill holes in the page -- by starting at the end and - * working back, we won't create holes (and thereby avoid - * shuffling). + * Move 'last' into the high key position on opage */ - for (o = last_off; o > first_off; o = OffsetNumberPrev(o)) - PageIndexTupleDelete(opage, o); hii = PageGetItemId(opage, P_HIKEY); - ii = PageGetItemId(opage, first_off); *hii = *ii; ii->lp_flags &= ~LP_USED; ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData); - first_off = P_FIRSTKEY; + /* + * Reset last_off to point to new page + */ last_off = PageGetMaxOffsetNumber(npage); - last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off)); /* * set the page (side link) pointers. @@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, oopaque->btpo_next = BufferGetBlockNumber(nbuf); nopaque->btpo_prev = BufferGetBlockNumber(obuf); nopaque->btpo_next = P_NONE; - - if (_bt_itemcmp(index, keysz, scankey, - (BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)), - (BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)), - BTEqualStrategyNumber)) - oopaque->btpo_flags |= BTP_CHAIN; } /* - * copy the old buffer's minimum key to its parent. if we don't - * have a parent, we have to create one; this adds a new btree - * level. + * Link the old buffer into its parent, using its minimum key. + * If we don't have a parent, we have to create one; + * this adds a new btree level. */ - if (state->btps_doupper) + if (state->btps_next == (BTPageState *) NULL) { - BTItem nbti; - - if (state->btps_next == (BTPageState *) NULL) - { - state->btps_next = - _bt_pagestate(index, 0, state->btps_level + 1, true); - } - nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0); - _bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0); - pfree((void *) nbti); + state->btps_next = + _bt_pagestate(index, 0, state->btps_level + 1); } + nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0); + _bt_buildadd(index, state->btps_next, nbti, 0); + pfree((void *) nbti); /* * write out the old stuff. we never want to see it again, so we @@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, } /* - * if this item is different from the last item added, we start a new - * chain of duplicates. + * Add the new item into the current page. */ - off = OffsetNumberNext(last_off); - if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber) + last_off = OffsetNumberNext(last_off); + if (PageAddItem(npage, (Item) bti, btisz, + last_off, LP_USED) == InvalidOffsetNumber) elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)"); #ifdef FASTBUILD_DEBUG { @@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull); printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n", - d, off, state->btps_level); + d, last_off, state->btps_level); } #endif - if (last_bti == (BTItem) NULL) - first_off = P_FIRSTKEY; - else if (!_bt_itemcmp(index, keysz, scankey, - bti, last_bti, BTEqualStrategyNumber)) - first_off = off; - last_off = off; - last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off)); state->btps_buf = nbuf; state->btps_page = npage; - state->btps_lastbti = last_bti; state->btps_lastoff = last_off; - state->btps_firstoff = first_off; - - return last_bti; } +/* + * Finish writing out the completed btree. + */ static void -_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, - BTPageState *state) +_bt_uppershutdown(Relation index, BTPageState *state) { BTPageState *s; BlockNumber blkno; BTPageOpaque opaque; BTItem bti; + /* + * Each iteration of this loop completes one more level of the tree. + */ for (s = state; s != (BTPageState *) NULL; s = s->btps_next) { blkno = BufferGetBlockNumber(s->btps_buf); opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page); /* - * if this is the root, attach it to the metapage. otherwise, - * stick the minimum key of the last page on this level (which has - * not been split, or else it wouldn't be the last page) into its - * parent. this may cause the last page of upper levels to split, - * but that's not a problem -- we haven't gotten to them yet. + * We have to link the last page on this level to somewhere. + * + * If we're at the top, it's the root, so attach it to the metapage. + * Otherwise, add an entry for it to its parent using its minimum + * key. This may cause the last page of the parent level to split, + * but that's not a problem -- we haven't gotten to it yet. */ - if (s->btps_doupper) + if (s->btps_next == (BTPageState *) NULL) { - if (s->btps_next == (BTPageState *) NULL) - { - opaque->btpo_flags |= BTP_ROOT; - _bt_metaproot(index, blkno, s->btps_level + 1); - } - else - { - bti = _bt_minitem(s->btps_page, blkno, 0); - _bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0); - pfree((void *) bti); - } + opaque->btpo_flags |= BTP_ROOT; + _bt_metaproot(index, blkno, s->btps_level + 1); + } + else + { + bti = _bt_minitem(s->btps_page, blkno, 0); + _bt_buildadd(index, s->btps_next, bti, 0); + pfree((void *) bti); } /* - * this is the rightmost page, so the ItemId array needs to be - * slid back one slot. + * This is the rightmost page, so the ItemId array needs to be + * slid back one slot. Then we can dump out the page. */ _bt_slideleft(index, s->btps_buf, s->btps_page); _bt_wrtbuf(index, s->btps_buf); @@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, static void _bt_load(Relation index, BTSpool *btspool) { - BTPageState *state; - ScanKey skey; - int natts; - BTItem bti; - bool should_free; - - /* - * initialize state needed for the merge into the btree leaf pages. - */ - state = _bt_pagestate(index, BTP_LEAF, 0, true); - - skey = _bt_mkscankey_nodata(index); - natts = RelationGetNumberOfAttributes(index); + BTPageState *state = NULL; for (;;) { + BTItem bti; + bool should_free; + bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true, &should_free); if (bti == (BTItem) NULL) break; - _bt_buildadd(index, natts, skey, state, bti, BTP_LEAF); + + /* When we see first tuple, create first index page */ + if (state == NULL) + state = _bt_pagestate(index, BTP_LEAF, 0); + + _bt_buildadd(index, state, bti, BTP_LEAF); if (should_free) pfree((void *) bti); } - _bt_uppershutdown(index, natts, skey, state); - - _bt_freeskey(skey); + if (state != NULL) + _bt_uppershutdown(index, state); } diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 5853267670..aabdf80900 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -20,16 +20,13 @@ #include "access/nbtree.h" #include "executor/execdebug.h" -extern int NIndexTupleProcessed; - /* * _bt_mkscankey * Build a scan key that contains comparison data from itup * as well as comparator routines appropriate to the key datatypes. * - * The result is intended for use with _bt_skeycmp() or _bt_compare(), - * although it could be used with _bt_itemcmp() or _bt_tuplecompare(). + * The result is intended for use with _bt_compare(). */ ScanKey _bt_mkscankey(Relation rel, IndexTuple itup) @@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup) * Build a scan key that contains comparator routines appropriate to * the key datatypes, but no comparison data. * - * The result can be used with _bt_itemcmp() or _bt_tuplecompare(), - * but not with _bt_skeycmp() or _bt_compare(). + * The result cannot be used with _bt_compare(). Currently this + * routine is only called by utils/sort/tuplesort.c, which has its + * own comparison routine. */ ScanKey _bt_mkscankey_nodata(Relation rel) @@ -114,7 +112,6 @@ _bt_freestack(BTStack stack) { ostack = stack; stack = stack->bts_parent; - pfree(ostack->bts_btitem); pfree(ostack); } } @@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup) Size tuplen; extern Oid newoid(); - /* - * see comments in btbuild - * - * if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot - * include null keys"); - */ - /* make a copy of the index tuple with room for the sequence number */ tuplen = IndexTupleSize(itup); nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData)); btitem = (BTItem) palloc(nbytes_btitem); - memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen); + memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen); return btitem; } -#ifdef NOT_USED -bool -_bt_checkqual(IndexScanDesc scan, IndexTuple itup) -{ - BTScanOpaque so; - - so = (BTScanOpaque) scan->opaque; - if (so->numberOfKeys > 0) - return (index_keytest(itup, RelationGetDescr(scan->relation), - so->numberOfKeys, so->keyData)); - else - return true; -} - -#endif - -#ifdef NOT_USED -bool -_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz) -{ - BTScanOpaque so; - - so = (BTScanOpaque) scan->opaque; - if (keysz > 0 && so->numberOfKeys >= keysz) - return (index_keytest(itup, RelationGetDescr(scan->relation), - keysz, so->keyData)); - else - return true; -} - -#endif - bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok) { diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 43cabceba1..1a970a1375 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $ + * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -19,10 +19,10 @@ #include "storage/bufpage.h" + static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr, char *location, Size size); -static bool PageManagerShuffle = true; /* default is shuffle mode */ /* ---------------------------------------------------------------- * Page support functions @@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize) /* ---------------- * PageAddItem * - * add an item to a page. + * Add an item to a page. Return value is offset at which it was + * inserted, or InvalidOffsetNumber if there's not room to insert. * - * !!! ELOG(ERROR) IS DISALLOWED HERE !!! - * - * Notes on interface: - * If offsetNumber is valid, shuffle ItemId's down to make room - * to use it, if PageManagerShuffle is true. If PageManagerShuffle is - * false, then overwrite the specified ItemId. (PageManagerShuffle is - * true by default, and is modified by calling PageManagerModeSet.) + * If offsetNumber is valid and <= current max offset in the page, + * insert item into the array at that position by shuffling ItemId's + * down to make room. * If offsetNumber is not valid, then assign one by finding the first * one that is both unused and deallocated. * - * NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it - * is assumed that there is room on the page to shuffle the ItemId's - * down by one. + * !!! ELOG(ERROR) IS DISALLOWED HERE !!! + * * ---------------- */ OffsetNumber @@ -82,11 +78,8 @@ PageAddItem(Page page, Offset lower; Offset upper; ItemId itemId; - ItemId fromitemId, - toitemId; OffsetNumber limit; - - bool shuffled = false; + bool needshuffle = false; /* * Find first unallocated offsetNumber @@ -96,31 +89,12 @@ PageAddItem(Page page, /* was offsetNumber passed in? */ if (OffsetNumberIsValid(offsetNumber)) { - if (PageManagerShuffle == true) - { - /* shuffle ItemId's (Do the PageManager Shuffle...) */ - for (i = (limit - 1); i >= offsetNumber; i--) - { - fromitemId = &((PageHeader) page)->pd_linp[i - 1]; - toitemId = &((PageHeader) page)->pd_linp[i]; - *toitemId = *fromitemId; - } - shuffled = true; /* need to increase "lower" */ - } - else - { /* overwrite mode */ - itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1]; - if (((*itemId).lp_flags & LP_USED) || - ((*itemId).lp_len != 0)) - { - elog(NOTICE, "PageAddItem: tried overwrite of used ItemId"); - return InvalidOffsetNumber; - } - } + needshuffle = true; /* need to increase "lower" */ + /* don't actually do the shuffle till we've checked free space! */ } else - { /* offsetNumber was not passed in, so find - * one */ + { + /* offsetNumber was not passed in, so find one */ /* look for "recyclable" (unused & deallocated) ItemId */ for (offsetNumber = 1; offsetNumber < limit; offsetNumber++) { @@ -130,9 +104,13 @@ PageAddItem(Page page, break; } } + + /* + * Compute new lower and upper pointers for page, see if it'll fit + */ if (offsetNumber > limit) lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page)); - else if (offsetNumber == limit || shuffled == true) + else if (offsetNumber == limit || needshuffle) lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData); else lower = ((PageHeader) page)->pd_lower; @@ -144,6 +122,23 @@ PageAddItem(Page page, if (lower > upper) return InvalidOffsetNumber; + /* + * OK to insert the item. First, shuffle the existing pointers if needed. + */ + if (needshuffle) + { + /* shuffle ItemId's (Do the PageManager Shuffle...) */ + for (i = (limit - 1); i >= offsetNumber; i--) + { + ItemId fromitemId, + toitemId; + + fromitemId = &((PageHeader) page)->pd_linp[i - 1]; + toitemId = &((PageHeader) page)->pd_linp[i]; + *toitemId = *fromitemId; + } + } + itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1]; (*itemId).lp_off = upper; (*itemId).lp_len = size; @@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize) PageHeader thdr; pageSize = PageGetPageSize(page); - - if ((temp = (Page) palloc(pageSize)) == (Page) NULL) - elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize); + temp = (Page) palloc(pageSize); thdr = (PageHeader) temp; /* copy old page in */ @@ -327,23 +320,6 @@ PageGetFreeSpace(Page page) return space; } -/* - * PageManagerModeSet - * - * Sets mode to either: ShufflePageManagerMode (the default) or - * OverwritePageManagerMode. For use by access methods code - * for determining semantics of PageAddItem when the offsetNumber - * argument is passed in. - */ -void -PageManagerModeSet(PageManagerMode mode) -{ - if (mode == ShufflePageManagerMode) - PageManagerShuffle = true; - else if (mode == OverwritePageManagerMode) - PageManagerShuffle = false; -} - /* *---------------------------------------------------------------- * PageIndexTupleDelete diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 49d9dd07dc..3f8eebc3b3 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * - * $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $ + * $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -24,14 +24,9 @@ * info. In addition, we need to know what sort of page this is * (leaf or internal), and whether the page is available for reuse. * - * Lehman and Yao's algorithm requires a ``high key'' on every page. - * The high key on a page is guaranteed to be greater than or equal - * to any key that appears on this page. Our insertion algorithm - * guarantees that we can use the initial least key on our right - * sibling as the high key. We allocate space for the line pointer - * to the high key in the opaque data at the end of the page. - * - * Rightmost pages in the tree have no high key. + * We also store a back-link to the parent page, but this cannot be trusted + * very far since it does not get updated when the parent is split. + * See backend/access/nbtree/README for details. */ typedef struct BTPageOpaqueData @@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData BlockNumber btpo_parent; uint16 btpo_flags; -#define BTP_LEAF (1 << 0) -#define BTP_ROOT (1 << 1) -#define BTP_FREE (1 << 2) -#define BTP_META (1 << 3) -#define BTP_CHAIN (1 << 4) +/* Bits defined in btpo_flags */ +#define BTP_LEAF (1 << 0) /* It's a leaf page */ +#define BTP_ROOT (1 << 1) /* It's the root page (has no parent) */ +#define BTP_FREE (1 << 2) /* not currently used... */ +#define BTP_META (1 << 3) /* Set in the meta-page only */ } BTPageOpaqueData; @@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData typedef BTScanOpaqueData *BTScanOpaque; /* - * BTItems are what we store in the btree. Each item has an index - * tuple, including key and pointer values. In addition, we must - * guarantee that all tuples in the index are unique, in order to - * satisfy some assumptions in Lehman and Yao. The way that we do - * this is by generating a new OID for every insertion that we do in - * the tree. This adds eight bytes to the size of btree index - * tuples. Note that we do not use the OID as part of a composite - * key; the OID only serves as a unique identifier for a given index - * tuple (logical position within a page). + * BTItems are what we store in the btree. Each item is an index tuple, + * including key and pointer values. (In some cases either the key or the + * pointer may go unused, see backend/access/nbtree/README for details.) + * + * Old comments: + * In addition, we must guarantee that all tuples in the index are unique, + * in order to satisfy some assumptions in Lehman and Yao. The way that we + * do this is by generating a new OID for every insertion that we do in the + * tree. This adds eight bytes to the size of btree index tuples. Note + * that we do not use the OID as part of a composite key; the OID only + * serves as a unique identifier for a given index tuple (logical position + * within a page). * * New comments: * actually, we must guarantee that all tuples in A LEVEL * are unique, not in ALL INDEX. So, we can use bti_itup->t_tid * as unique identifier for a given index tuple (logical position - * within a level). - vadim 04/09/97 + * within a level). - vadim 04/09/97 */ typedef struct BTItemData @@ -108,12 +106,13 @@ typedef struct BTItemData typedef BTItemData *BTItem; -#define BTItemSame(i1, i2) ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \ - i2->bti_itup.t_tid.ip_blkid.bi_hi && \ - i1->bti_itup.t_tid.ip_blkid.bi_lo == \ - i2->bti_itup.t_tid.ip_blkid.bi_lo && \ - i1->bti_itup.t_tid.ip_posid == \ - i2->bti_itup.t_tid.ip_posid ) +/* Test whether items are the "same" per the above notes */ +#define BTItemSame(i1, i2) ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \ + (i2)->bti_itup.t_tid.ip_blkid.bi_hi && \ + (i1)->bti_itup.t_tid.ip_blkid.bi_lo == \ + (i2)->bti_itup.t_tid.ip_blkid.bi_lo && \ + (i1)->bti_itup.t_tid.ip_posid == \ + (i2)->bti_itup.t_tid.ip_posid ) /* * BTStackData -- As we descend a tree, we push the (key, pointer) @@ -129,24 +128,12 @@ typedef struct BTStackData { BlockNumber bts_blkno; OffsetNumber bts_offset; - BTItem bts_btitem; + BTItemData bts_btitem; struct BTStackData *bts_parent; } BTStackData; typedef BTStackData *BTStack; -typedef struct BTPageState -{ - Buffer btps_buf; - Page btps_page; - BTItem btps_lastbti; - OffsetNumber btps_lastoff; - OffsetNumber btps_firstoff; - int btps_level; - bool btps_doupper; - struct BTPageState *btps_next; -} BTPageState; - /* * We need to be able to tell the difference between read and write * requests for pages, in order to do locking correctly. @@ -155,31 +142,49 @@ typedef struct BTPageState #define BT_READ BUFFER_LOCK_SHARE #define BT_WRITE BUFFER_LOCK_EXCLUSIVE -/* - * Similarly, the difference between insertion and non-insertion binary - * searches on a given page makes a difference when we're descending the - * tree. - */ - -#define BT_INSERTION 0 -#define BT_DESCENT 1 - /* * In general, the btree code tries to localize its knowledge about * page layout to a couple of routines. However, we need a special * value to indicate "no page number" in those places where we expect - * page numbers. + * page numbers. We can use zero for this because we never need to + * make a pointer to the metadata page. */ #define P_NONE 0 + +/* + * Macros to test whether a page is leftmost or rightmost on its tree level, + * as well as other state info kept in the opaque data. + */ #define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE) #define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE) +#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF) +#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT) + +/* + * Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost + * page. The high key is not a data key, but gives info about what range of + * keys is supposed to be on this page. The high key on a page is required + * to be greater than or equal to any data key that appears on the page. + * If we find ourselves trying to insert a key > high key, we know we need + * to move right (this should only happen if the page was split since we + * examined the parent page). + * + * Our insertion algorithm guarantees that we can use the initial least key + * on our right sibling as the high key. Once a page is created, its high + * key changes only if the page is split. + * + * On a non-rightmost page, the high key lives in item 1 and data items + * start in item 2. Rightmost pages have no high key, so we store data + * items beginning in item 1. + */ #define P_HIKEY ((OffsetNumber) 1) #define P_FIRSTKEY ((OffsetNumber) 2) +#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY) /* - * Strategy numbers -- ordering of these is <, <=, =, >=, > + * Operator strategy numbers -- ordering of these is <, <=, =, >=, > */ #define BTLessStrategyNumber 1 @@ -200,29 +205,7 @@ typedef struct BTPageState #define BTORDER_PROC 1 /* - * prototypes for functions in nbtinsert.c - */ -extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem, - bool index_is_unique, Relation heapRel); -extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey, - BTItem item1, BTItem item2, StrategyNumber strat); - -/* - * prototypes for functions in nbtpage.c - */ -extern void _bt_metapinit(Relation rel); -extern Buffer _bt_getroot(Relation rel, int access); -extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access); -extern void _bt_relbuf(Relation rel, Buffer buf, int access); -extern void _bt_wrtbuf(Relation rel, Buffer buf); -extern void _bt_wrtnorelbuf(Relation rel, Buffer buf); -extern void _bt_pageinit(Page page, Size size); -extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level); -extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access); -extern void _bt_pagedel(Relation rel, ItemPointer tid); - -/* - * prototypes for functions in nbtree.c + * prototypes for functions in nbtree.c (external entry points for btree) */ extern bool BuildingBtree; /* in nbtree.c */ @@ -237,6 +220,25 @@ extern Datum btmarkpos(PG_FUNCTION_ARGS); extern Datum btrestrpos(PG_FUNCTION_ARGS); extern Datum btdelete(PG_FUNCTION_ARGS); +/* + * prototypes for functions in nbtinsert.c + */ +extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem, + bool index_is_unique, Relation heapRel); + +/* + * prototypes for functions in nbtpage.c + */ +extern void _bt_metapinit(Relation rel); +extern Buffer _bt_getroot(Relation rel, int access); +extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access); +extern void _bt_relbuf(Relation rel, Buffer buf, int access); +extern void _bt_wrtbuf(Relation rel, Buffer buf); +extern void _bt_wrtnorelbuf(Relation rel, Buffer buf); +extern void _bt_pageinit(Page page, Size size); +extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level); +extern void _bt_pagedel(Relation rel, ItemPointer tid); + /* * prototypes for functions in nbtscan.c */ @@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void); * prototypes for functions in nbtsearch.c */ extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey, - Buffer *bufP); + Buffer *bufP, int access); extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz, ScanKey scankey, int access); -extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey, - Page page, ItemId itemid, StrategyNumber strat); extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz, - ScanKey scankey, int srchtype); + ScanKey scankey); +extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey, + Page page, OffsetNumber offnum); extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir); extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir); extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir); diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h index 30b5a93ad6..8498c783a1 100644 --- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * - * $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $ + * $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize); extern void PageRestoreTempPage(Page tempPage, Page oldPage); extern void PageRepairFragmentation(Page page); extern Size PageGetFreeSpace(Page page); -extern void PageManagerModeSet(PageManagerMode mode); extern void PageIndexTupleDelete(Page page, OffsetNumber offset);