Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for
duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details.
This commit is contained in:
parent
c9537ca88f
commit
9e85183bfc
@ -1,68 +1,175 @@
|
|||||||
$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $
|
$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
|
||||||
|
|
||||||
This directory contains a correct implementation of Lehman and Yao's
|
This directory contains a correct implementation of Lehman and Yao's
|
||||||
btree management algorithm that supports concurrent access for Postgres.
|
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
|
||||||
|
Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
|
||||||
|
on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
|
||||||
|
|
||||||
We have made the following changes in order to incorporate their algorithm
|
We have made the following changes in order to incorporate their algorithm
|
||||||
into Postgres:
|
into Postgres:
|
||||||
|
|
||||||
+ The requirement that all btree keys be unique is too onerous,
|
+ The requirement that all btree keys be unique is too onerous,
|
||||||
but the algorithm won't work correctly without it. As a result,
|
but the algorithm won't work correctly without it. Fortunately, it is
|
||||||
this implementation adds an OID (guaranteed to be unique) to
|
only necessary that keys be unique on a single tree level, because L&Y
|
||||||
every key in the index. This guarantees uniqueness within a set
|
only use the assumption of key uniqueness when re-finding a key in a
|
||||||
of duplicates. Space overhead is four bytes.
|
parent node (to determine where to insert the key for a split page).
|
||||||
|
Therefore, we can use the link field to disambiguate multiple
|
||||||
|
occurrences of the same user key: only one entry in the parent level
|
||||||
|
will be pointing at the page we had split. (Indeed we need not look at
|
||||||
|
the real "key" at all, just at the link field.) We can distinguish
|
||||||
|
items at the leaf level in the same way, by examining their links to
|
||||||
|
heap tuples; we'd never have two items for the same heap tuple.
|
||||||
|
|
||||||
For this reason, when we're passed an index tuple to store by the
|
+ Lehman and Yao assume that the key range for a subtree S is described
|
||||||
common access method code, we allocate a larger one and copy the
|
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
|
||||||
supplied tuple into it. No Postgres code outside of the btree
|
node. This does not work for nonunique keys (for example, if we have
|
||||||
access method knows about this xid or sequence number.
|
enough equal keys to spread across several leaf pages, there *must* be
|
||||||
|
some equal bounding keys in the first level up). Therefore we assume
|
||||||
|
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
|
||||||
|
bounding key in an upper tree level must descend to the left of that
|
||||||
|
key to ensure it finds any equal keys in the preceding page. An
|
||||||
|
insertion that sees the high key of its target page is equal to the key
|
||||||
|
to be inserted has a choice whether or not to move right, since the new
|
||||||
|
key could go on either page. (Currently, we try to find a page where
|
||||||
|
there is room for the new key without a split.)
|
||||||
|
|
||||||
+ Lehman and Yao don't require read locks, but assume that in-
|
+ Lehman and Yao don't require read locks, but assume that in-memory
|
||||||
memory copies of tree nodes are unshared. Postgres shares
|
copies of tree nodes are unshared. Postgres shares in-memory buffers
|
||||||
in-memory buffers among backends. As a result, we do page-
|
among backends. As a result, we do page-level read locking on btree
|
||||||
level read locking on btree nodes in order to guarantee that
|
nodes in order to guarantee that no record is modified while we are
|
||||||
no record is modified while we are examining it. This reduces
|
examining it. This reduces concurrency but guaranteees correct
|
||||||
concurrency but guaranteees correct behavior.
|
behavior. An advantage is that when trading in a read lock for a
|
||||||
|
write lock, we need not re-read the page after getting the write lock.
|
||||||
|
Since we're also holding a pin on the shared buffer containing the
|
||||||
|
page, we know that buffer still contains the page and is up-to-date.
|
||||||
|
|
||||||
+ Read locks on a page are held for as long as a scan has a pointer
|
+ We support the notion of an ordered "scan" of an index as well as
|
||||||
to the page. However, locks are always surrendered before the
|
insertions, deletions, and simple lookups. A scan in the forward
|
||||||
sibling page lock is acquired (for readers), so we remain deadlock-
|
direction is no problem, we just use the right-sibling pointers that
|
||||||
free. I will do a formal proof if I get bored anytime soon.
|
L&Y require anyway. (Thus, once we have descended the tree to the
|
||||||
|
correct start point for the scan, the scan looks only at leaf pages
|
||||||
|
and never at higher tree levels.) To support scans in the backward
|
||||||
|
direction, we also store a "left sibling" link much like the "right
|
||||||
|
sibling". (This adds an extra step to the L&Y split algorithm: while
|
||||||
|
holding the write lock on the page being split, we also lock its former
|
||||||
|
right sibling to update that page's left-link. This is safe since no
|
||||||
|
writer of that page can be interested in acquiring a write lock on our
|
||||||
|
page.) A backwards scan has one additional bit of complexity: after
|
||||||
|
following the left-link we must account for the possibility that the
|
||||||
|
left sibling page got split before we could read it. So, we have to
|
||||||
|
move right until we find a page whose right-link matches the page we
|
||||||
|
came from.
|
||||||
|
|
||||||
|
+ Read locks on a page are held for as long as a scan has a pointer
|
||||||
|
to the page. However, locks are always surrendered before the
|
||||||
|
sibling page lock is acquired (for readers), so we remain deadlock-
|
||||||
|
free. I will do a formal proof if I get bored anytime soon.
|
||||||
|
NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
|
||||||
|
on the current page of a scan before control leaves nbtree. When we
|
||||||
|
come back to resume the scan, we have to re-grab the read lock and
|
||||||
|
then move right if the current item moved (see _bt_restscan()).
|
||||||
|
|
||||||
|
+ Lehman and Yao fail to discuss what must happen when the root page
|
||||||
|
becomes full and must be split. Our implementation is to split the
|
||||||
|
root in the same way that any other page would be split, then construct
|
||||||
|
a new root page holding pointers to both of the resulting pages (which
|
||||||
|
now become siblings on level 2 of the tree). The new root page is then
|
||||||
|
installed by altering the root pointer in the meta-data page (see
|
||||||
|
below). This works because the root is not treated specially in any
|
||||||
|
other way --- in particular, searches will move right using its link
|
||||||
|
pointer if the link is set. Therefore, searches will find the data
|
||||||
|
that's been moved into the right sibling even if they read the metadata
|
||||||
|
page before it got updated. This is the same reasoning that makes a
|
||||||
|
split of a non-root page safe. The locking considerations are similar too.
|
||||||
|
|
||||||
|
+ Lehman and Yao assume fixed-size keys, but we must deal with
|
||||||
|
variable-size keys. Therefore there is not a fixed maximum number of
|
||||||
|
keys per page; we just stuff in as many as will fit. When we split a
|
||||||
|
page, we try to equalize the number of bytes, not items, assigned to
|
||||||
|
each of the resulting pages. Note we must include the incoming item in
|
||||||
|
this calculation, otherwise it is possible to find that the incoming
|
||||||
|
item doesn't fit on the split page where it needs to go!
|
||||||
|
|
||||||
In addition, the following things are handy to know:
|
In addition, the following things are handy to know:
|
||||||
|
|
||||||
+ Page zero of every btree is a meta-data page. This page stores
|
+ Page zero of every btree is a meta-data page. This page stores
|
||||||
the location of the root page, a pointer to a list of free
|
the location of the root page, a pointer to a list of free
|
||||||
pages, and other stuff that's handy to know.
|
pages, and other stuff that's handy to know. (Currently, we
|
||||||
|
never shrink btree indexes so there are never any free pages.)
|
||||||
|
|
||||||
+ This algorithm doesn't really work, since it requires ordered
|
+ The algorithm assumes we can fit at least three items per page
|
||||||
writes, and UNIX doesn't support ordered writes.
|
(a "high key" and two real data items). Therefore it's unsafe
|
||||||
|
to accept items larger than 1/3rd page size. Larger items would
|
||||||
|
work sometimes, but could cause failures later on depending on
|
||||||
|
what else gets put on their page.
|
||||||
|
|
||||||
+ There's one other case where we may screw up in this
|
+ This algorithm doesn't guarantee btree consistency after a kernel crash
|
||||||
implementation. When we start a scan, we descend the tree
|
or hardware failure. To do that, we'd need ordered writes, and UNIX
|
||||||
to the key nearest the one in the qual, and once we get there,
|
doesn't support ordered writes (short of fsync'ing every update, which
|
||||||
position ourselves correctly for the qual type (eg, <, >=, etc).
|
is too high a price). Rebuilding corrupted indexes during restart
|
||||||
If we happen to step off a page, decide we want to get back to
|
seems more attractive.
|
||||||
it, and fetch the page again, and if some bad person has split
|
|
||||||
the page and moved the last tuple we saw off of it, then the
|
+ On deletions, we need to adjust the position of active scans on
|
||||||
code complains about botched concurrency in an elog(WARN, ...)
|
the index. The code in nbtscan.c handles this. We don't need to
|
||||||
and gives up the ghost. This is the ONLY violation of Lehman
|
do this for insertions or splits because _bt_restscan can find the
|
||||||
and Yao's guarantee of correct behavior that I am aware of in
|
new position of the previously-found item. NOTE that nbtscan.c
|
||||||
this code.
|
only copes with deletions issued by the current backend. This
|
||||||
|
essentially means that concurrent deletions are not supported, but
|
||||||
|
that's true already in the Lehman and Yao algorithm. nbtscan.c
|
||||||
|
exists only to support VACUUM and allow it to delete items while
|
||||||
|
it's scanning the index.
|
||||||
|
|
||||||
|
Notes about data representation:
|
||||||
|
|
||||||
|
+ The right-sibling link required by L&Y is kept in the page "opaque
|
||||||
|
data" area, as is the left-sibling link and some flags.
|
||||||
|
|
||||||
|
+ We also keep a parent link in the opaque data, but this link is not
|
||||||
|
very trustworthy because it is not updated when the parent page splits.
|
||||||
|
Thus, it points to some page on the parent level, but possibly a page
|
||||||
|
well to the left of the page's actual current parent. In most cases
|
||||||
|
we do not need this link at all. Normally we return to a parent page
|
||||||
|
using a stack of entries that are made as we descend the tree, as in L&Y.
|
||||||
|
There is exactly one case where the stack will not help: concurrent
|
||||||
|
root splits. If an inserter process needs to split what had been the
|
||||||
|
root when it started its descent, but finds that that page is no longer
|
||||||
|
the root (because someone else split it meanwhile), then it uses the
|
||||||
|
parent link to move up to the next level. This is OK because we do fix
|
||||||
|
the parent link in a former root page when splitting it. This logic
|
||||||
|
will work even if the root is split multiple times (even up to creation
|
||||||
|
of multiple new levels) before an inserter returns to it. The same
|
||||||
|
could not be said of finding the new root via the metapage, since that
|
||||||
|
would work only for a single level of added root.
|
||||||
|
|
||||||
|
+ The Postgres disk block data format (an array of items) doesn't fit
|
||||||
|
Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
|
||||||
|
so we have to play some games.
|
||||||
|
|
||||||
|
+ On a page that is not rightmost in its tree level, the "high key" is
|
||||||
|
kept in the page's first item, and real data items start at item 2.
|
||||||
|
The link portion of the "high key" item goes unused. A page that is
|
||||||
|
rightmost has no "high key", so data items start with the first item.
|
||||||
|
Putting the high key at the left, rather than the right, may seem odd,
|
||||||
|
but it avoids moving the high key as we add data items.
|
||||||
|
|
||||||
|
+ On a leaf page, the data items are simply links to (TIDs of) tuples
|
||||||
|
in the relation being indexed, with the associated key values.
|
||||||
|
|
||||||
|
+ On a non-leaf page, the data items are down-links to child pages with
|
||||||
|
bounding keys. The key in each data item is the *lower* bound for
|
||||||
|
keys on that child page, so logically the key is to the left of that
|
||||||
|
downlink. The high key (if present) is the upper bound for the last
|
||||||
|
downlink. The first data item on each such page has no lower bound
|
||||||
|
--- or lower bound of minus infinity, if you prefer. The comparison
|
||||||
|
routines must treat it accordingly. The actual key stored in the
|
||||||
|
item is irrelevant, and need not be stored at all. This arrangement
|
||||||
|
corresponds to the fact that an L&Y non-leaf page has one more pointer
|
||||||
|
than key.
|
||||||
|
|
||||||
Notes to operator class implementors:
|
Notes to operator class implementors:
|
||||||
|
|
||||||
With this implementation, we require the user to supply us with
|
+ With this implementation, we require the user to supply us with
|
||||||
a procedure for pg_amproc. This procedure should take two keys
|
a procedure for pg_amproc. This procedure should take two keys
|
||||||
A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
|
A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
|
||||||
respectively. See the contents of that relation for the btree
|
respectively. See the contents of that relation for the btree
|
||||||
access method for some samples.
|
access method for some samples.
|
||||||
|
|
||||||
Notes to mao for implementation document:
|
|
||||||
|
|
||||||
On deletions, we need to adjust the position of active scans on
|
|
||||||
the index. The code in nbtscan.c handles this. We don't need to
|
|
||||||
do this for splits because of the way splits are handled; if they
|
|
||||||
happen behind us, we'll automatically go to the next page, and if
|
|
||||||
they happen in front of us, we're not affected by them. For
|
|
||||||
insertions, if we inserted a tuple behind the current scan location
|
|
||||||
on the current scan page, we move one space ahead.
|
|
||||||
|
File diff suppressed because it is too large
Load Diff
@ -9,7 +9,7 @@
|
|||||||
*
|
*
|
||||||
*
|
*
|
||||||
* IDENTIFICATION
|
* IDENTIFICATION
|
||||||
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $
|
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $
|
||||||
*
|
*
|
||||||
* NOTES
|
* NOTES
|
||||||
* Postgres btree pages look like ordinary relation pages. The opaque
|
* Postgres btree pages look like ordinary relation pages. The opaque
|
||||||
@ -90,7 +90,7 @@ _bt_metapinit(Relation rel)
|
|||||||
metad.btm_version = BTREE_VERSION;
|
metad.btm_version = BTREE_VERSION;
|
||||||
metad.btm_root = P_NONE;
|
metad.btm_root = P_NONE;
|
||||||
metad.btm_level = 0;
|
metad.btm_level = 0;
|
||||||
memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
|
memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
|
||||||
|
|
||||||
op = (BTPageOpaque) PageGetSpecialPointer(pg);
|
op = (BTPageOpaque) PageGetSpecialPointer(pg);
|
||||||
op->btpo_flags = BTP_META;
|
op->btpo_flags = BTP_META;
|
||||||
@ -102,52 +102,6 @@ _bt_metapinit(Relation rel)
|
|||||||
UnlockRelation(rel, AccessExclusiveLock);
|
UnlockRelation(rel, AccessExclusiveLock);
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef NOT_USED
|
|
||||||
/*
|
|
||||||
* _bt_checkmeta() -- Verify that the metadata stored in a btree are
|
|
||||||
* reasonable.
|
|
||||||
*/
|
|
||||||
void
|
|
||||||
_bt_checkmeta(Relation rel)
|
|
||||||
{
|
|
||||||
Buffer metabuf;
|
|
||||||
Page metap;
|
|
||||||
BTMetaPageData *metad;
|
|
||||||
BTPageOpaque op;
|
|
||||||
int nblocks;
|
|
||||||
|
|
||||||
/* if the relation is empty, this is init time; don't complain */
|
|
||||||
if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0)
|
|
||||||
return;
|
|
||||||
|
|
||||||
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
|
|
||||||
metap = BufferGetPage(metabuf);
|
|
||||||
op = (BTPageOpaque) PageGetSpecialPointer(metap);
|
|
||||||
if (!(op->btpo_flags & BTP_META))
|
|
||||||
{
|
|
||||||
elog(ERROR, "Invalid metapage for index %s",
|
|
||||||
RelationGetRelationName(rel));
|
|
||||||
}
|
|
||||||
metad = BTPageGetMeta(metap);
|
|
||||||
|
|
||||||
if (metad->btm_magic != BTREE_MAGIC)
|
|
||||||
{
|
|
||||||
elog(ERROR, "Index %s is not a btree",
|
|
||||||
RelationGetRelationName(rel));
|
|
||||||
}
|
|
||||||
|
|
||||||
if (metad->btm_version != BTREE_VERSION)
|
|
||||||
{
|
|
||||||
elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
|
|
||||||
RelationGetRelationName(rel),
|
|
||||||
metad->btm_version, BTREE_VERSION);
|
|
||||||
}
|
|
||||||
|
|
||||||
_bt_relbuf(rel, metabuf, BT_READ);
|
|
||||||
}
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* _bt_getroot() -- Get the root page of the btree.
|
* _bt_getroot() -- Get the root page of the btree.
|
||||||
*
|
*
|
||||||
@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel)
|
|||||||
* standard class of race conditions exists here; I think I covered
|
* standard class of race conditions exists here; I think I covered
|
||||||
* them all in the Hopi Indian rain dance of lock requests below.
|
* them all in the Hopi Indian rain dance of lock requests below.
|
||||||
*
|
*
|
||||||
* We pass in the access type (BT_READ or BT_WRITE), and return the
|
* The access type parameter (BT_READ or BT_WRITE) controls whether
|
||||||
* root page's buffer with the appropriate lock type set. Reference
|
* a new root page will be created or not. If access = BT_READ,
|
||||||
* count on the root page gets bumped by ReadBuffer. The metadata
|
* and no root page exists, we just return InvalidBuffer. For
|
||||||
* page is unlocked and unreferenced by this process when this routine
|
* BT_WRITE, we try to create the root page if it doesn't exist.
|
||||||
* returns.
|
* NOTE that the returned root page will have only a read lock set
|
||||||
|
* on it even if access = BT_WRITE!
|
||||||
|
*
|
||||||
|
* On successful return, the root page is pinned and read-locked.
|
||||||
|
* The metadata page is not locked or pinned on exit.
|
||||||
*/
|
*/
|
||||||
Buffer
|
Buffer
|
||||||
_bt_getroot(Relation rel, int access)
|
_bt_getroot(Relation rel, int access)
|
||||||
@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access)
|
|||||||
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
|
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
|
||||||
metapg = BufferGetPage(metabuf);
|
metapg = BufferGetPage(metabuf);
|
||||||
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
|
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
|
||||||
Assert(metaopaque->btpo_flags & BTP_META);
|
|
||||||
metad = BTPageGetMeta(metapg);
|
metad = BTPageGetMeta(metapg);
|
||||||
|
|
||||||
if (metad->btm_magic != BTREE_MAGIC)
|
if (!(metaopaque->btpo_flags & BTP_META) ||
|
||||||
{
|
metad->btm_magic != BTREE_MAGIC)
|
||||||
elog(ERROR, "Index %s is not a btree",
|
elog(ERROR, "Index %s is not a btree",
|
||||||
RelationGetRelationName(rel));
|
RelationGetRelationName(rel));
|
||||||
}
|
|
||||||
|
|
||||||
if (metad->btm_version != BTREE_VERSION)
|
if (metad->btm_version != BTREE_VERSION)
|
||||||
{
|
elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
|
||||||
elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
|
|
||||||
RelationGetRelationName(rel),
|
RelationGetRelationName(rel),
|
||||||
metad->btm_version, BTREE_VERSION);
|
metad->btm_version, BTREE_VERSION);
|
||||||
}
|
|
||||||
|
|
||||||
/* if no root page initialized yet, do it */
|
/* if no root page initialized yet, do it */
|
||||||
if (metad->btm_root == P_NONE)
|
if (metad->btm_root == P_NONE)
|
||||||
{
|
{
|
||||||
|
/* If access = BT_READ, caller doesn't want us to create root yet */
|
||||||
|
if (access == BT_READ)
|
||||||
|
{
|
||||||
|
_bt_relbuf(rel, metabuf, BT_READ);
|
||||||
|
return InvalidBuffer;
|
||||||
|
}
|
||||||
|
|
||||||
/* turn our read lock in for a write lock */
|
/* trade in our read lock for a write lock */
|
||||||
_bt_relbuf(rel, metabuf, BT_READ);
|
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
|
||||||
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
|
LockBuffer(metabuf, BT_WRITE);
|
||||||
metapg = BufferGetPage(metabuf);
|
|
||||||
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
|
|
||||||
Assert(metaopaque->btpo_flags & BTP_META);
|
|
||||||
metad = BTPageGetMeta(metapg);
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Race condition: if someone else initialized the metadata
|
* Race condition: if someone else initialized the metadata
|
||||||
* between the time we released the read lock and acquired the
|
* between the time we released the read lock and acquired the
|
||||||
* write lock, above, we want to avoid doing it again.
|
* write lock, above, we must avoid doing it again.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
if (metad->btm_root == P_NONE)
|
if (metad->btm_root == P_NONE)
|
||||||
{
|
{
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Get, initialize, write, and leave a lock of the appropriate
|
* Get, initialize, write, and leave a lock of the appropriate
|
||||||
* type on the new root page. Since this is the first page in
|
* type on the new root page. Since this is the first page in
|
||||||
* the tree, it's a leaf.
|
* the tree, it's a leaf as well as the root.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
|
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
|
||||||
rootblkno = BufferGetBlockNumber(rootbuf);
|
rootblkno = BufferGetBlockNumber(rootbuf);
|
||||||
rootpg = BufferGetPage(rootbuf);
|
rootpg = BufferGetPage(rootbuf);
|
||||||
|
|
||||||
metad->btm_root = rootblkno;
|
metad->btm_root = rootblkno;
|
||||||
metad->btm_level = 1;
|
metad->btm_level = 1;
|
||||||
|
|
||||||
_bt_pageinit(rootpg, BufferGetPageSize(rootbuf));
|
_bt_pageinit(rootpg, BufferGetPageSize(rootbuf));
|
||||||
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
|
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
|
||||||
rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT);
|
rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT);
|
||||||
_bt_wrtnorelbuf(rel, rootbuf);
|
_bt_wrtnorelbuf(rel, rootbuf);
|
||||||
|
|
||||||
/* swap write lock for read lock, if appropriate */
|
/* swap write lock for read lock */
|
||||||
if (access != BT_WRITE)
|
LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
|
||||||
{
|
LockBuffer(rootbuf, BT_READ);
|
||||||
LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
|
|
||||||
LockBuffer(rootbuf, BT_READ);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* okay, metadata is correct */
|
/* okay, metadata is correct, write and release it */
|
||||||
_bt_wrtbuf(rel, metabuf);
|
_bt_wrtbuf(rel, metabuf);
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Metadata initialized by someone else. In order to
|
* Metadata initialized by someone else. In order to
|
||||||
* guarantee no deadlocks, we have to release the metadata
|
* guarantee no deadlocks, we have to release the metadata
|
||||||
* page and start all over again.
|
* page and start all over again.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
_bt_relbuf(rel, metabuf, BT_WRITE);
|
_bt_relbuf(rel, metabuf, BT_WRITE);
|
||||||
return _bt_getroot(rel, access);
|
return _bt_getroot(rel, access);
|
||||||
}
|
}
|
||||||
@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access)
|
|||||||
rootblkno = metad->btm_root;
|
rootblkno = metad->btm_root;
|
||||||
_bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */
|
_bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */
|
||||||
|
|
||||||
rootbuf = _bt_getbuf(rel, rootblkno, access);
|
rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Race condition: If the root page split between the time we looked
|
* Race condition: If the root page split between the time we looked
|
||||||
* at the metadata page and got the root buffer, then we got the wrong
|
* at the metadata page and got the root buffer, then we got the wrong
|
||||||
* buffer.
|
* buffer. Release it and try again.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
rootpg = BufferGetPage(rootbuf);
|
rootpg = BufferGetPage(rootbuf);
|
||||||
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
|
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
|
||||||
if (!(rootopaque->btpo_flags & BTP_ROOT))
|
|
||||||
{
|
|
||||||
|
|
||||||
|
if (! P_ISROOT(rootopaque))
|
||||||
|
{
|
||||||
/* it happened, try again */
|
/* it happened, try again */
|
||||||
_bt_relbuf(rel, rootbuf, access);
|
_bt_relbuf(rel, rootbuf, BT_READ);
|
||||||
return _bt_getroot(rel, access);
|
return _bt_getroot(rel, access);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access)
|
|||||||
* count is correct, and we have no lock set on the metadata page.
|
* count is correct, and we have no lock set on the metadata page.
|
||||||
* Return the root block.
|
* Return the root block.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
return rootbuf;
|
return rootbuf;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access)
|
|||||||
* _bt_getbuf() -- Get a buffer by block number for read or write.
|
* _bt_getbuf() -- Get a buffer by block number for read or write.
|
||||||
*
|
*
|
||||||
* When this routine returns, the appropriate lock is set on the
|
* When this routine returns, the appropriate lock is set on the
|
||||||
* requested buffer its reference count is correct.
|
* requested buffer and its reference count has been incremented
|
||||||
|
* (ie, the buffer is "locked and pinned").
|
||||||
*/
|
*/
|
||||||
Buffer
|
Buffer
|
||||||
_bt_getbuf(Relation rel, BlockNumber blkno, int access)
|
_bt_getbuf(Relation rel, BlockNumber blkno, int access)
|
||||||
{
|
{
|
||||||
Buffer buf;
|
Buffer buf;
|
||||||
Page page;
|
|
||||||
|
|
||||||
if (blkno != P_NEW)
|
if (blkno != P_NEW)
|
||||||
{
|
{
|
||||||
|
/* Read an existing block of the relation */
|
||||||
buf = ReadBuffer(rel, blkno);
|
buf = ReadBuffer(rel, blkno);
|
||||||
LockBuffer(buf, access);
|
LockBuffer(buf, access);
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
|
Page page;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Extend bufmgr code is unclean and so we have to use locking
|
* Extend the relation by one page.
|
||||||
|
*
|
||||||
|
* Extend bufmgr code is unclean and so we have to use extra locking
|
||||||
* here.
|
* here.
|
||||||
*/
|
*/
|
||||||
LockPage(rel, 0, ExclusiveLock);
|
LockPage(rel, 0, ExclusiveLock);
|
||||||
buf = ReadBuffer(rel, blkno);
|
buf = ReadBuffer(rel, blkno);
|
||||||
|
LockBuffer(buf, access);
|
||||||
UnlockPage(rel, 0, ExclusiveLock);
|
UnlockPage(rel, 0, ExclusiveLock);
|
||||||
blkno = BufferGetBlockNumber(buf);
|
|
||||||
|
/* Initialize the new page before returning it */
|
||||||
page = BufferGetPage(buf);
|
page = BufferGetPage(buf);
|
||||||
_bt_pageinit(page, BufferGetPageSize(buf));
|
_bt_pageinit(page, BufferGetPageSize(buf));
|
||||||
LockBuffer(buf, access);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/* ref count and lock type are correct */
|
/* ref count and lock type are correct */
|
||||||
@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
|
|||||||
|
|
||||||
/*
|
/*
|
||||||
* _bt_relbuf() -- release a locked buffer.
|
* _bt_relbuf() -- release a locked buffer.
|
||||||
|
*
|
||||||
|
* Lock and pin (refcount) are both dropped.
|
||||||
*/
|
*/
|
||||||
void
|
void
|
||||||
_bt_relbuf(Relation rel, Buffer buf, int access)
|
_bt_relbuf(Relation rel, Buffer buf, int access)
|
||||||
@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access)
|
|||||||
/*
|
/*
|
||||||
* _bt_wrtbuf() -- write a btree page to disk.
|
* _bt_wrtbuf() -- write a btree page to disk.
|
||||||
*
|
*
|
||||||
* This routine releases the lock held on the buffer and our reference
|
* This routine releases the lock held on the buffer and our refcount
|
||||||
* to it. It is an error to call _bt_wrtbuf() without a write lock
|
* for it. It is an error to call _bt_wrtbuf() without a write lock
|
||||||
* or a reference to the buffer.
|
* and a pin on the buffer.
|
||||||
|
*
|
||||||
|
* NOTE: actually, the buffer manager just marks the shared buffer page
|
||||||
|
* dirty here, the real I/O happens later. Since we can't persuade the
|
||||||
|
* Unix kernel to schedule disk writes in a particular order, there's not
|
||||||
|
* much point in worrying about this. The most we can say is that all the
|
||||||
|
* writes will occur before commit.
|
||||||
*/
|
*/
|
||||||
void
|
void
|
||||||
_bt_wrtbuf(Relation rel, Buffer buf)
|
_bt_wrtbuf(Relation rel, Buffer buf)
|
||||||
@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf)
|
|||||||
* our reference or lock.
|
* our reference or lock.
|
||||||
*
|
*
|
||||||
* It is an error to call _bt_wrtnorelbuf() without a write lock
|
* It is an error to call _bt_wrtnorelbuf() without a write lock
|
||||||
* or a reference to the buffer.
|
* and a pin on the buffer.
|
||||||
|
*
|
||||||
|
* See above NOTE.
|
||||||
*/
|
*/
|
||||||
void
|
void
|
||||||
_bt_wrtnorelbuf(Relation rel, Buffer buf)
|
_bt_wrtnorelbuf(Relation rel, Buffer buf)
|
||||||
@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size)
|
|||||||
* we split the root page, we record the new parent in the metadata page
|
* we split the root page, we record the new parent in the metadata page
|
||||||
* for the relation. This routine does the work.
|
* for the relation. This routine does the work.
|
||||||
*
|
*
|
||||||
* No direct preconditions, but if you don't have the a write lock on
|
* No direct preconditions, but if you don't have the write lock on
|
||||||
* at least the old root page when you call this, you're making a big
|
* at least the old root page when you call this, you're making a big
|
||||||
* mistake. On exit, metapage data is correct and we no longer have
|
* mistake. On exit, metapage data is correct and we no longer have
|
||||||
* a reference to or lock on the metapage.
|
* a pin or lock on the metapage.
|
||||||
*/
|
*/
|
||||||
void
|
void
|
||||||
_bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
|
_bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
|
||||||
@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
|
* Delete an item from a btree. It had better be a leaf item...
|
||||||
* we last looked at in the parent.
|
|
||||||
*
|
|
||||||
* This is possible because we save a bit image of the last item
|
|
||||||
* we looked at in the parent, and the update algorithm guarantees
|
|
||||||
* that if items above us in the tree move, they only move right.
|
|
||||||
*
|
|
||||||
* Also, re-set bts_blkno & bts_offset if changed and
|
|
||||||
* bts_btitem (it may be changed - see _bt_insertonpg).
|
|
||||||
*/
|
*/
|
||||||
Buffer
|
|
||||||
_bt_getstackbuf(Relation rel, BTStack stack, int access)
|
|
||||||
{
|
|
||||||
Buffer buf;
|
|
||||||
BlockNumber blkno;
|
|
||||||
OffsetNumber start,
|
|
||||||
offnum,
|
|
||||||
maxoff;
|
|
||||||
OffsetNumber i;
|
|
||||||
Page page;
|
|
||||||
ItemId itemid;
|
|
||||||
BTItem item;
|
|
||||||
BTPageOpaque opaque;
|
|
||||||
BTItem item_save;
|
|
||||||
int item_nbytes;
|
|
||||||
|
|
||||||
blkno = stack->bts_blkno;
|
|
||||||
buf = _bt_getbuf(rel, blkno, access);
|
|
||||||
page = BufferGetPage(buf);
|
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
|
||||||
maxoff = PageGetMaxOffsetNumber(page);
|
|
||||||
|
|
||||||
if (stack->bts_offset == InvalidOffsetNumber ||
|
|
||||||
maxoff >= stack->bts_offset)
|
|
||||||
{
|
|
||||||
|
|
||||||
/*
|
|
||||||
* _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
|
|
||||||
* case of concurrent ROOT page split
|
|
||||||
*/
|
|
||||||
if (stack->bts_offset == InvalidOffsetNumber)
|
|
||||||
i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
|
|
||||||
else
|
|
||||||
{
|
|
||||||
itemid = PageGetItemId(page, stack->bts_offset);
|
|
||||||
item = (BTItem) PageGetItem(page, itemid);
|
|
||||||
|
|
||||||
/* if the item is where we left it, we're done */
|
|
||||||
if (BTItemSame(item, stack->bts_btitem))
|
|
||||||
{
|
|
||||||
pfree(stack->bts_btitem);
|
|
||||||
item_nbytes = ItemIdGetLength(itemid);
|
|
||||||
item_save = (BTItem) palloc(item_nbytes);
|
|
||||||
memmove((char *) item_save, (char *) item, item_nbytes);
|
|
||||||
stack->bts_btitem = item_save;
|
|
||||||
return buf;
|
|
||||||
}
|
|
||||||
i = OffsetNumberNext(stack->bts_offset);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* if the item has just moved right on this page, we're done */
|
|
||||||
for (;
|
|
||||||
i <= maxoff;
|
|
||||||
i = OffsetNumberNext(i))
|
|
||||||
{
|
|
||||||
itemid = PageGetItemId(page, i);
|
|
||||||
item = (BTItem) PageGetItem(page, itemid);
|
|
||||||
|
|
||||||
/* if the item is where we left it, we're done */
|
|
||||||
if (BTItemSame(item, stack->bts_btitem))
|
|
||||||
{
|
|
||||||
stack->bts_offset = i;
|
|
||||||
pfree(stack->bts_btitem);
|
|
||||||
item_nbytes = ItemIdGetLength(itemid);
|
|
||||||
item_save = (BTItem) palloc(item_nbytes);
|
|
||||||
memmove((char *) item_save, (char *) item, item_nbytes);
|
|
||||||
stack->bts_btitem = item_save;
|
|
||||||
return buf;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/* by here, the item we're looking for moved right at least one page */
|
|
||||||
for (;;)
|
|
||||||
{
|
|
||||||
blkno = opaque->btpo_next;
|
|
||||||
if (P_RIGHTMOST(opaque))
|
|
||||||
elog(FATAL, "my bits moved right off the end of the world!\
|
|
||||||
\n\tRecreate index %s.", RelationGetRelationName(rel));
|
|
||||||
|
|
||||||
_bt_relbuf(rel, buf, access);
|
|
||||||
buf = _bt_getbuf(rel, blkno, access);
|
|
||||||
page = BufferGetPage(buf);
|
|
||||||
maxoff = PageGetMaxOffsetNumber(page);
|
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
|
||||||
|
|
||||||
/* if we have a right sibling, step over the high key */
|
|
||||||
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
|
|
||||||
|
|
||||||
/* see if it's on this page */
|
|
||||||
for (offnum = start;
|
|
||||||
offnum <= maxoff;
|
|
||||||
offnum = OffsetNumberNext(offnum))
|
|
||||||
{
|
|
||||||
itemid = PageGetItemId(page, offnum);
|
|
||||||
item = (BTItem) PageGetItem(page, itemid);
|
|
||||||
if (BTItemSame(item, stack->bts_btitem))
|
|
||||||
{
|
|
||||||
stack->bts_offset = offnum;
|
|
||||||
stack->bts_blkno = blkno;
|
|
||||||
pfree(stack->bts_btitem);
|
|
||||||
item_nbytes = ItemIdGetLength(itemid);
|
|
||||||
item_save = (BTItem) palloc(item_nbytes);
|
|
||||||
memmove((char *) item_save, (char *) item, item_nbytes);
|
|
||||||
stack->bts_btitem = item_save;
|
|
||||||
return buf;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
void
|
void
|
||||||
_bt_pagedel(Relation rel, ItemPointer tid)
|
_bt_pagedel(Relation rel, ItemPointer tid)
|
||||||
{
|
{
|
||||||
|
@ -12,7 +12,7 @@
|
|||||||
* Portions Copyright (c) 1994, Regents of the University of California
|
* Portions Copyright (c) 1994, Regents of the University of California
|
||||||
*
|
*
|
||||||
* IDENTIFICATION
|
* IDENTIFICATION
|
||||||
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $
|
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -26,6 +26,7 @@
|
|||||||
#include "executor/executor.h"
|
#include "executor/executor.h"
|
||||||
#include "miscadmin.h"
|
#include "miscadmin.h"
|
||||||
|
|
||||||
|
|
||||||
bool BuildingBtree = false; /* see comment in btbuild() */
|
bool BuildingBtree = false; /* see comment in btbuild() */
|
||||||
bool FastBuild = true; /* use sort/build instead of insertion
|
bool FastBuild = true; /* use sort/build instead of insertion
|
||||||
* build */
|
* build */
|
||||||
@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS)
|
|||||||
* btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE.
|
* btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE.
|
||||||
* Sure, it's just rule for placing/finding items and no more -
|
* Sure, it's just rule for placing/finding items and no more -
|
||||||
* keytest'll return FALSE for a = 5 for items having 'a' isNULL.
|
* keytest'll return FALSE for a = 5 for items having 'a' isNULL.
|
||||||
* Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it
|
* Look at _bt_compare for how it works.
|
||||||
* works. - vadim 03/23/97
|
* - vadim 03/23/97
|
||||||
*
|
*
|
||||||
* if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; }
|
* if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; }
|
||||||
*/
|
*/
|
||||||
@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS)
|
|||||||
/* generate an index tuple */
|
/* generate an index tuple */
|
||||||
itup = index_formtuple(RelationGetDescr(rel), datum, nulls);
|
itup = index_formtuple(RelationGetDescr(rel), datum, nulls);
|
||||||
itup->t_tid = *ht_ctid;
|
itup->t_tid = *ht_ctid;
|
||||||
|
|
||||||
/*
|
|
||||||
* See comments in btbuild.
|
|
||||||
*
|
|
||||||
* if (itup->t_info & INDEX_NULL_MASK)
|
|
||||||
* PG_RETURN_POINTER((InsertIndexResult) NULL);
|
|
||||||
*/
|
|
||||||
|
|
||||||
btitem = _bt_formitem(itup);
|
btitem = _bt_formitem(itup);
|
||||||
|
|
||||||
res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel);
|
res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel);
|
||||||
@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS)
|
|||||||
|
|
||||||
if (ItemPointerIsValid(&(scan->currentItemData)))
|
if (ItemPointerIsValid(&(scan->currentItemData)))
|
||||||
{
|
{
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Restore scan position using heap TID returned by previous call
|
* Restore scan position using heap TID returned by previous call
|
||||||
* to btgettuple(). _bt_restscan() locks buffer.
|
* to btgettuple(). _bt_restscan() re-grabs the read lock on
|
||||||
|
* the buffer, too.
|
||||||
*/
|
*/
|
||||||
_bt_restscan(scan);
|
_bt_restscan(scan);
|
||||||
res = _bt_next(scan, dir);
|
res = _bt_next(scan, dir);
|
||||||
@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS)
|
|||||||
res = _bt_first(scan, dir);
|
res = _bt_first(scan, dir);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Save heap TID to use it in _bt_restscan. Unlock buffer before
|
* Save heap TID to use it in _bt_restscan. Then release the read
|
||||||
* leaving index !
|
* lock on the buffer so that we aren't blocking other backends.
|
||||||
|
* NOTE: we do keep the pin on the buffer!
|
||||||
*/
|
*/
|
||||||
if (res)
|
if (res)
|
||||||
{
|
{
|
||||||
@ -419,22 +413,6 @@ btrescan(PG_FUNCTION_ARGS)
|
|||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
so = (BTScanOpaque) scan->opaque;
|
||||||
|
|
||||||
/* we don't hold a read lock on the current page in the scan */
|
|
||||||
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
|
||||||
{
|
|
||||||
ReleaseBuffer(so->btso_curbuf);
|
|
||||||
so->btso_curbuf = InvalidBuffer;
|
|
||||||
ItemPointerSetInvalid(iptr);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* and we don't hold a read lock on the last marked item in the scan */
|
|
||||||
if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
|
|
||||||
{
|
|
||||||
ReleaseBuffer(so->btso_mrkbuf);
|
|
||||||
so->btso_mrkbuf = InvalidBuffer;
|
|
||||||
ItemPointerSetInvalid(iptr);
|
|
||||||
}
|
|
||||||
|
|
||||||
if (so == NULL) /* if called from btbeginscan */
|
if (so == NULL) /* if called from btbeginscan */
|
||||||
{
|
{
|
||||||
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
|
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
|
||||||
@ -446,6 +424,21 @@ btrescan(PG_FUNCTION_ARGS)
|
|||||||
scan->flags = 0x0;
|
scan->flags = 0x0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* we aren't holding any read locks, but gotta drop the pins */
|
||||||
|
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
||||||
|
{
|
||||||
|
ReleaseBuffer(so->btso_curbuf);
|
||||||
|
so->btso_curbuf = InvalidBuffer;
|
||||||
|
ItemPointerSetInvalid(iptr);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
|
||||||
|
{
|
||||||
|
ReleaseBuffer(so->btso_mrkbuf);
|
||||||
|
so->btso_mrkbuf = InvalidBuffer;
|
||||||
|
ItemPointerSetInvalid(iptr);
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Reset the scan keys. Note that keys ordering stuff moved to
|
* Reset the scan keys. Note that keys ordering stuff moved to
|
||||||
* _bt_first. - vadim 05/05/97
|
* _bt_first. - vadim 05/05/97
|
||||||
@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v)
|
|||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
so = (BTScanOpaque) scan->opaque;
|
||||||
|
|
||||||
/* we don't hold a read lock on the current page in the scan */
|
/* we aren't holding any read locks, but gotta drop the pin */
|
||||||
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
||||||
{
|
{
|
||||||
ReleaseBuffer(so->btso_curbuf);
|
ReleaseBuffer(so->btso_curbuf);
|
||||||
@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v)
|
|||||||
ItemPointerSetInvalid(iptr);
|
ItemPointerSetInvalid(iptr);
|
||||||
}
|
}
|
||||||
|
|
||||||
/* scan->keyData[0].sk_argument = v; */
|
|
||||||
so->keyData[0].sk_argument = v;
|
so->keyData[0].sk_argument = v;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS)
|
|||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
so = (BTScanOpaque) scan->opaque;
|
||||||
|
|
||||||
/* we don't hold any read locks */
|
/* we aren't holding any read locks, but gotta drop the pins */
|
||||||
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
||||||
{
|
{
|
||||||
if (BufferIsValid(so->btso_curbuf))
|
if (BufferIsValid(so->btso_curbuf))
|
||||||
@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS)
|
|||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
so = (BTScanOpaque) scan->opaque;
|
||||||
|
|
||||||
/* we don't hold any read locks */
|
/* we aren't holding any read locks, but gotta drop the pin */
|
||||||
if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
|
if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
|
||||||
{
|
{
|
||||||
ReleaseBuffer(so->btso_mrkbuf);
|
ReleaseBuffer(so->btso_mrkbuf);
|
||||||
@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS)
|
|||||||
ItemPointerSetInvalid(iptr);
|
ItemPointerSetInvalid(iptr);
|
||||||
}
|
}
|
||||||
|
|
||||||
/* bump pin on current buffer */
|
/* bump pin on current buffer for assignment to mark buffer */
|
||||||
if (ItemPointerIsValid(&(scan->currentItemData)))
|
if (ItemPointerIsValid(&(scan->currentItemData)))
|
||||||
{
|
{
|
||||||
so->btso_mrkbuf = ReadBuffer(scan->relation,
|
so->btso_mrkbuf = ReadBuffer(scan->relation,
|
||||||
@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS)
|
|||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
so = (BTScanOpaque) scan->opaque;
|
||||||
|
|
||||||
/* we don't hold any read locks */
|
/* we aren't holding any read locks, but gotta drop the pin */
|
||||||
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
|
||||||
{
|
{
|
||||||
ReleaseBuffer(so->btso_curbuf);
|
ReleaseBuffer(so->btso_curbuf);
|
||||||
@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS)
|
|||||||
{
|
{
|
||||||
so->btso_curbuf = ReadBuffer(scan->relation,
|
so->btso_curbuf = ReadBuffer(scan->relation,
|
||||||
BufferGetBlockNumber(so->btso_mrkbuf));
|
BufferGetBlockNumber(so->btso_mrkbuf));
|
||||||
|
|
||||||
scan->currentItemData = scan->currentMarkData;
|
scan->currentItemData = scan->currentMarkData;
|
||||||
so->curHeapIptr = so->mrkHeapIptr;
|
so->curHeapIptr = so->mrkHeapIptr;
|
||||||
}
|
}
|
||||||
@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS)
|
|||||||
PG_RETURN_VOID();
|
PG_RETURN_VOID();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Restore scan position when btgettuple is called to continue a scan.
|
||||||
|
*/
|
||||||
static void
|
static void
|
||||||
_bt_restscan(IndexScanDesc scan)
|
_bt_restscan(IndexScanDesc scan)
|
||||||
{
|
{
|
||||||
@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan)
|
|||||||
BTItem item;
|
BTItem item;
|
||||||
BlockNumber blkno;
|
BlockNumber blkno;
|
||||||
|
|
||||||
LockBuffer(buf, BT_READ); /* lock buffer first! */
|
/*
|
||||||
|
* Get back the read lock we were holding on the buffer.
|
||||||
|
* (We still have a reference-count pin on it, though.)
|
||||||
|
*/
|
||||||
|
LockBuffer(buf, BT_READ);
|
||||||
|
|
||||||
page = BufferGetPage(buf);
|
page = BufferGetPage(buf);
|
||||||
maxoff = PageGetMaxOffsetNumber(page);
|
maxoff = PageGetMaxOffsetNumber(page);
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||||
@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan)
|
|||||||
*/
|
*/
|
||||||
if (!ItemPointerIsValid(&target))
|
if (!ItemPointerIsValid(&target))
|
||||||
{
|
{
|
||||||
ItemPointerSetOffsetNumber(&(scan->currentItemData),
|
ItemPointerSetOffsetNumber(current,
|
||||||
OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY));
|
OffsetNumberPrev(P_FIRSTDATAKEY(opaque)));
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (maxoff >= offnum)
|
/*
|
||||||
|
* The item we were on may have moved right due to insertions.
|
||||||
|
* Find it again.
|
||||||
|
*/
|
||||||
|
for (;;)
|
||||||
{
|
{
|
||||||
|
/* Check for item on this page */
|
||||||
/*
|
|
||||||
* if the item is where we left it or has just moved right on this
|
|
||||||
* page, we're done
|
|
||||||
*/
|
|
||||||
for (;
|
for (;
|
||||||
offnum <= maxoff;
|
offnum <= maxoff;
|
||||||
offnum = OffsetNumberNext(offnum))
|
offnum = OffsetNumberNext(offnum))
|
||||||
{
|
{
|
||||||
item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
|
item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
|
||||||
if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
|
if (item->bti_itup.t_tid.ip_blkid.bi_hi ==
|
||||||
target.ip_blkid.bi_hi && \
|
target.ip_blkid.bi_hi &&
|
||||||
item->bti_itup.t_tid.ip_blkid.bi_lo == \
|
item->bti_itup.t_tid.ip_blkid.bi_lo ==
|
||||||
target.ip_blkid.bi_lo && \
|
target.ip_blkid.bi_lo &&
|
||||||
item->bti_itup.t_tid.ip_posid == target.ip_posid)
|
item->bti_itup.t_tid.ip_posid == target.ip_posid)
|
||||||
{
|
{
|
||||||
current->ip_posid = offnum;
|
current->ip_posid = offnum;
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* By here, the item we're looking for moved right at least one page
|
* By here, the item we're looking for moved right at least one page
|
||||||
*/
|
*/
|
||||||
for (;;)
|
|
||||||
{
|
|
||||||
if (P_RIGHTMOST(opaque))
|
if (P_RIGHTMOST(opaque))
|
||||||
elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\
|
elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!"
|
||||||
\n\tRecreate index %s.", RelationGetRelationName(rel));
|
"\n\tRecreate index %s.", RelationGetRelationName(rel));
|
||||||
|
|
||||||
blkno = opaque->btpo_next;
|
blkno = opaque->btpo_next;
|
||||||
_bt_relbuf(rel, buf, BT_READ);
|
_bt_relbuf(rel, buf, BT_READ);
|
||||||
@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan)
|
|||||||
page = BufferGetPage(buf);
|
page = BufferGetPage(buf);
|
||||||
maxoff = PageGetMaxOffsetNumber(page);
|
maxoff = PageGetMaxOffsetNumber(page);
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||||
|
offnum = P_FIRSTDATAKEY(opaque);
|
||||||
/* see if it's on this page */
|
ItemPointerSet(current, blkno, offnum);
|
||||||
for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
|
so->btso_curbuf = buf;
|
||||||
offnum <= maxoff;
|
|
||||||
offnum = OffsetNumberNext(offnum))
|
|
||||||
{
|
|
||||||
item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
|
|
||||||
if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
|
|
||||||
target.ip_blkid.bi_hi && \
|
|
||||||
item->bti_itup.t_tid.ip_blkid.bi_lo == \
|
|
||||||
target.ip_blkid.bi_lo && \
|
|
||||||
item->bti_itup.t_tid.ip_posid == target.ip_posid)
|
|
||||||
{
|
|
||||||
ItemPointerSet(current, blkno, offnum);
|
|
||||||
so->btso_curbuf = buf;
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -8,22 +8,25 @@
|
|||||||
*
|
*
|
||||||
*
|
*
|
||||||
* IDENTIFICATION
|
* IDENTIFICATION
|
||||||
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $
|
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $
|
||||||
*
|
*
|
||||||
*
|
*
|
||||||
* NOTES
|
* NOTES
|
||||||
* Because we can be doing an index scan on a relation while we update
|
* Because we can be doing an index scan on a relation while we update
|
||||||
* it, we need to avoid missing data that moves around in the index.
|
* it, we need to avoid missing data that moves around in the index.
|
||||||
* The routines and global variables in this file guarantee that all
|
* Insertions and page splits are no problem because _bt_restscan()
|
||||||
* scans in the local address space stay correctly positioned. This
|
* can figure out where the current item moved to, but if a deletion
|
||||||
* is all we need to worry about, since write locking guarantees that
|
* happens at or before the current scan position, we'd better do
|
||||||
* no one else will be on the same page at the same time as we are.
|
* something to stay in sync.
|
||||||
|
*
|
||||||
|
* The routines in this file handle the problem for deletions issued
|
||||||
|
* by the current backend. Currently, that's all we need, since
|
||||||
|
* deletions are only done by VACUUM and it gets an exclusive lock.
|
||||||
*
|
*
|
||||||
* The scheme is to manage a list of active scans in the current backend.
|
* The scheme is to manage a list of active scans in the current backend.
|
||||||
* Whenever we add or remove records from an index, or whenever we
|
* Whenever we remove a record from an index, we check the list of active
|
||||||
* split a leaf page, we check the list of active scans to see if any
|
* scans to see if any has been affected. A scan is affected only if it
|
||||||
* has been affected. A scan is affected only if it is on the same
|
* is on the same relation, and the same page, as the update.
|
||||||
* relation, and the same page, as the update.
|
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan)
|
|||||||
|
|
||||||
/*
|
/*
|
||||||
* _bt_adjscans() -- adjust all scans in the scan list to compensate
|
* _bt_adjscans() -- adjust all scans in the scan list to compensate
|
||||||
* for a given deletion or insertion
|
* for a given deletion
|
||||||
*/
|
*/
|
||||||
void
|
void
|
||||||
_bt_adjscans(Relation rel, ItemPointer tid)
|
_bt_adjscans(Relation rel, ItemPointer tid)
|
||||||
@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
|
|||||||
{
|
{
|
||||||
page = BufferGetPage(buf);
|
page = BufferGetPage(buf);
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||||
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
|
start = P_FIRSTDATAKEY(opaque);
|
||||||
if (ItemPointerGetOffsetNumber(current) == start)
|
if (ItemPointerGetOffsetNumber(current) == start)
|
||||||
ItemPointerSetInvalid(&(so->curHeapIptr));
|
ItemPointerSetInvalid(&(so->curHeapIptr));
|
||||||
else
|
else
|
||||||
@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
|
|||||||
*/
|
*/
|
||||||
LockBuffer(buf, BT_READ);
|
LockBuffer(buf, BT_READ);
|
||||||
_bt_step(scan, &buf, BackwardScanDirection);
|
_bt_step(scan, &buf, BackwardScanDirection);
|
||||||
so->btso_curbuf = buf;
|
|
||||||
if (ItemPointerIsValid(current))
|
if (ItemPointerIsValid(current))
|
||||||
{
|
{
|
||||||
Page pg = BufferGetPage(buf);
|
Page pg = BufferGetPage(buf);
|
||||||
@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
|
|||||||
&& ItemPointerGetBlockNumber(current) == blkno
|
&& ItemPointerGetBlockNumber(current) == blkno
|
||||||
&& ItemPointerGetOffsetNumber(current) >= offno)
|
&& ItemPointerGetOffsetNumber(current) >= offno)
|
||||||
{
|
{
|
||||||
|
|
||||||
page = BufferGetPage(so->btso_mrkbuf);
|
page = BufferGetPage(so->btso_mrkbuf);
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||||
start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
|
start = P_FIRSTDATAKEY(opaque);
|
||||||
|
|
||||||
if (ItemPointerGetOffsetNumber(current) == start)
|
if (ItemPointerGetOffsetNumber(current) == start)
|
||||||
ItemPointerSetInvalid(&(so->mrkHeapIptr));
|
ItemPointerSetInvalid(&(so->mrkHeapIptr));
|
||||||
|
File diff suppressed because it is too large
Load Diff
@ -6,8 +6,12 @@
|
|||||||
*
|
*
|
||||||
* We use tuplesort.c to sort the given index tuples into order.
|
* We use tuplesort.c to sort the given index tuples into order.
|
||||||
* Then we scan the index tuples in order and build the btree pages
|
* Then we scan the index tuples in order and build the btree pages
|
||||||
* for each level. When we have only one page on a level, it must be the
|
* for each level. We load source tuples into leaf-level pages.
|
||||||
* root -- it can be attached to the btree metapage and we are done.
|
* Whenever we fill a page at one level, we add a link to it to its
|
||||||
|
* parent level (starting a new parent level if necessary). When
|
||||||
|
* done, we write out each final page on each level, adding it to
|
||||||
|
* its parent level. When we have only one page on a level, it must be
|
||||||
|
* the root -- it can be attached to the btree metapage and we are done.
|
||||||
*
|
*
|
||||||
* this code is moderately slow (~10% slower) compared to the regular
|
* this code is moderately slow (~10% slower) compared to the regular
|
||||||
* btree (insertion) build code on sorted or well-clustered data. on
|
* btree (insertion) build code on sorted or well-clustered data. on
|
||||||
@ -23,12 +27,20 @@
|
|||||||
* something like the standard 70% steady-state load factor for btrees
|
* something like the standard 70% steady-state load factor for btrees
|
||||||
* would probably be better.
|
* would probably be better.
|
||||||
*
|
*
|
||||||
|
* Another limitation is that we currently load full copies of all keys
|
||||||
|
* into upper tree levels. The leftmost data key in each non-leaf node
|
||||||
|
* could be omitted as far as normal btree operations are concerned
|
||||||
|
* (see README for more info). However, because we build the tree from
|
||||||
|
* the bottom up, we need that data key to insert into the node's parent.
|
||||||
|
* This could be fixed by keeping a spare copy of the minimum key in the
|
||||||
|
* state stack, but I haven't time for that right now.
|
||||||
|
*
|
||||||
*
|
*
|
||||||
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
|
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
|
||||||
* Portions Copyright (c) 1994, Regents of the University of California
|
* Portions Copyright (c) 1994, Regents of the University of California
|
||||||
*
|
*
|
||||||
* IDENTIFICATION
|
* IDENTIFICATION
|
||||||
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $
|
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -57,6 +69,20 @@ struct BTSpool
|
|||||||
bool isunique;
|
bool isunique;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Status record for a btree page being built. We have one of these
|
||||||
|
* for each active tree level.
|
||||||
|
*/
|
||||||
|
typedef struct BTPageState
|
||||||
|
{
|
||||||
|
Buffer btps_buf; /* current buffer & page */
|
||||||
|
Page btps_page;
|
||||||
|
OffsetNumber btps_lastoff; /* last item offset loaded */
|
||||||
|
int btps_level;
|
||||||
|
struct BTPageState *btps_next; /* link to parent level, if any */
|
||||||
|
} BTPageState;
|
||||||
|
|
||||||
|
|
||||||
#define BTITEMSZ(btitem) \
|
#define BTITEMSZ(btitem) \
|
||||||
((btitem) ? \
|
((btitem) ? \
|
||||||
(IndexTupleDSize((btitem)->bti_itup) + \
|
(IndexTupleDSize((btitem)->bti_itup) + \
|
||||||
@ -65,13 +91,11 @@ struct BTSpool
|
|||||||
|
|
||||||
|
|
||||||
static void _bt_load(Relation index, BTSpool *btspool);
|
static void _bt_load(Relation index, BTSpool *btspool);
|
||||||
static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
|
static void _bt_buildadd(Relation index, BTPageState *state,
|
||||||
BTPageState *state, BTItem bti, int flags);
|
BTItem bti, int flags);
|
||||||
static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend);
|
static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend);
|
||||||
static BTPageState *_bt_pagestate(Relation index, int flags,
|
static BTPageState *_bt_pagestate(Relation index, int flags, int level);
|
||||||
int level, bool doupper);
|
static void _bt_uppershutdown(Relation index, BTPageState *state);
|
||||||
static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
|
|
||||||
BTPageState *state);
|
|
||||||
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags)
|
|||||||
BTPageOpaque opaque;
|
BTPageOpaque opaque;
|
||||||
|
|
||||||
*buf = _bt_getbuf(index, P_NEW, BT_WRITE);
|
*buf = _bt_getbuf(index, P_NEW, BT_WRITE);
|
||||||
#ifdef NOT_USED
|
|
||||||
printf("\tblk=%d\n", BufferGetBlockNumber(*buf));
|
|
||||||
#endif
|
|
||||||
*page = BufferGetPage(*buf);
|
*page = BufferGetPage(*buf);
|
||||||
_bt_pageinit(*page, BufferGetPageSize(*buf));
|
_bt_pageinit(*page, BufferGetPageSize(*buf));
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(*page);
|
opaque = (BTPageOpaque) PageGetSpecialPointer(*page);
|
||||||
@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page)
|
|||||||
* is suitable for immediate use by _bt_buildadd.
|
* is suitable for immediate use by _bt_buildadd.
|
||||||
*/
|
*/
|
||||||
static BTPageState *
|
static BTPageState *
|
||||||
_bt_pagestate(Relation index, int flags, int level, bool doupper)
|
_bt_pagestate(Relation index, int flags, int level)
|
||||||
{
|
{
|
||||||
BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState));
|
BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState));
|
||||||
|
|
||||||
MemSet((char *) state, 0, sizeof(BTPageState));
|
MemSet((char *) state, 0, sizeof(BTPageState));
|
||||||
_bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags);
|
_bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags);
|
||||||
state->btps_firstoff = InvalidOffsetNumber;
|
|
||||||
state->btps_lastoff = P_HIKEY;
|
state->btps_lastoff = P_HIKEY;
|
||||||
state->btps_lastbti = (BTItem) NULL;
|
|
||||||
state->btps_next = (BTPageState *) NULL;
|
state->btps_next = (BTPageState *) NULL;
|
||||||
state->btps_level = level;
|
state->btps_level = level;
|
||||||
state->btps_doupper = doupper;
|
|
||||||
|
|
||||||
return state;
|
return state;
|
||||||
}
|
}
|
||||||
@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* add an item to a disk page from a merge tape block.
|
* add an item to a disk page from the sort output.
|
||||||
*
|
*
|
||||||
* we must be careful to observe the following restrictions, placed
|
* we must be careful to observe the following restrictions, placed
|
||||||
* upon us by the conventions in nbtsearch.c:
|
* upon us by the conventions in nbtsearch.c:
|
||||||
* - rightmost pages start data items at P_HIKEY instead of at
|
* - rightmost pages start data items at P_HIKEY instead of at
|
||||||
* P_FIRSTKEY.
|
* P_FIRSTKEY.
|
||||||
* - duplicates cannot be split among pages unless the chain of
|
|
||||||
* duplicates starts at the first data item.
|
|
||||||
*
|
*
|
||||||
* a leaf page being built looks like:
|
* a leaf page being built looks like:
|
||||||
*
|
*
|
||||||
* +----------------+---------------------------------+
|
* +----------------+---------------------------------+
|
||||||
* | PageHeaderData | linp0 linp1 linp2 ... |
|
* | PageHeaderData | linp0 linp1 linp2 ... |
|
||||||
* +-----------+----+---------------------------------+
|
* +-----------+----+---------------------------------+
|
||||||
* | ... linpN | ^ first |
|
* | ... linpN | |
|
||||||
* +-----------+--------------------------------------+
|
* +-----------+--------------------------------------+
|
||||||
* | ^ last |
|
* | ^ last |
|
||||||
* | |
|
* | |
|
||||||
* | v last |
|
|
||||||
* +-------------+------------------------------------+
|
* +-------------+------------------------------------+
|
||||||
* | | itemN ... |
|
* | | itemN ... |
|
||||||
* +-------------+------------------+-----------------+
|
* +-------------+------------------+-----------------+
|
||||||
* | ... item3 item2 item1 | "special space" |
|
* | ... item3 item2 item1 | "special space" |
|
||||||
* +--------------------------------+-----------------+
|
* +--------------------------------+-----------------+
|
||||||
* ^ first
|
|
||||||
*
|
*
|
||||||
* contrast this with the diagram in bufpage.h; note the mismatch
|
* contrast this with the diagram in bufpage.h; note the mismatch
|
||||||
* between linps and items. this is because we reserve linp0 as a
|
* between linps and items. this is because we reserve linp0 as a
|
||||||
@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
|
|||||||
* filled up the page, we will set linp0 to point to itemN and clear
|
* filled up the page, we will set linp0 to point to itemN and clear
|
||||||
* linpN.
|
* linpN.
|
||||||
*
|
*
|
||||||
* 'last' pointers indicate the last offset/item added to the page.
|
* 'last' pointer indicates the last offset added to the page.
|
||||||
* 'first' pointers indicate the first offset/item that is part of a
|
|
||||||
* chain of duplicates extending from 'first' to 'last'.
|
|
||||||
*
|
|
||||||
* if all keys are unique, 'first' will always be the same as 'last'.
|
|
||||||
*/
|
*/
|
||||||
static BTItem
|
static void
|
||||||
_bt_buildadd(Relation index, Size keysz, ScanKey scankey,
|
_bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags)
|
||||||
BTPageState *state, BTItem bti, int flags)
|
|
||||||
{
|
{
|
||||||
Buffer nbuf;
|
Buffer nbuf;
|
||||||
Page npage;
|
Page npage;
|
||||||
BTItem last_bti;
|
|
||||||
OffsetNumber first_off;
|
|
||||||
OffsetNumber last_off;
|
OffsetNumber last_off;
|
||||||
OffsetNumber off;
|
|
||||||
Size pgspc;
|
Size pgspc;
|
||||||
Size btisz;
|
Size btisz;
|
||||||
|
|
||||||
nbuf = state->btps_buf;
|
nbuf = state->btps_buf;
|
||||||
npage = state->btps_page;
|
npage = state->btps_page;
|
||||||
first_off = state->btps_firstoff;
|
|
||||||
last_off = state->btps_lastoff;
|
last_off = state->btps_lastoff;
|
||||||
last_bti = state->btps_lastbti;
|
|
||||||
|
|
||||||
pgspc = PageGetFreeSpace(npage);
|
pgspc = PageGetFreeSpace(npage);
|
||||||
btisz = BTITEMSZ(bti);
|
btisz = BTITEMSZ(bti);
|
||||||
@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
|
|||||||
|
|
||||||
if (pgspc < btisz)
|
if (pgspc < btisz)
|
||||||
{
|
{
|
||||||
|
/*
|
||||||
|
* Item won't fit on this page, so finish off the page and
|
||||||
|
* write it out.
|
||||||
|
*/
|
||||||
Buffer obuf = nbuf;
|
Buffer obuf = nbuf;
|
||||||
Page opage = npage;
|
Page opage = npage;
|
||||||
OffsetNumber o,
|
|
||||||
n;
|
|
||||||
ItemId ii;
|
ItemId ii;
|
||||||
ItemId hii;
|
ItemId hii;
|
||||||
|
BTItem nbti;
|
||||||
|
|
||||||
_bt_blnewpage(index, &nbuf, &npage, flags);
|
_bt_blnewpage(index, &nbuf, &npage, flags);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* if 'last' is part of a chain of duplicates that does not start
|
* We copy the last item on the page into the new page, and then
|
||||||
* at the beginning of the old page, the entire chain is copied to
|
* rearrange the old page so that the 'last item' becomes its high
|
||||||
* the new page; we delete all of the duplicates from the old page
|
* key rather than a true data item.
|
||||||
* except the first, which becomes the high key item of the old
|
|
||||||
* page.
|
|
||||||
*
|
*
|
||||||
* if the chain starts at the beginning of the page or there is no
|
* note that since we always copy an item to the new page,
|
||||||
* chain ('first' == 'last'), we need only copy 'last' to the new
|
* 'bti' will never be the first data item on the new page.
|
||||||
* page. again, 'first' (== 'last') becomes the high key of the
|
|
||||||
* old page.
|
|
||||||
*
|
|
||||||
* note that in either case, we copy at least one item to the new
|
|
||||||
* page, so 'last_bti' will always be valid. 'bti' will never be
|
|
||||||
* the first data item on the new page.
|
|
||||||
*/
|
*/
|
||||||
if (first_off == P_FIRSTKEY)
|
ii = PageGetItemId(opage, last_off);
|
||||||
{
|
if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len,
|
||||||
Assert(last_off != P_FIRSTKEY);
|
P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
|
||||||
first_off = last_off;
|
elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
|
||||||
}
|
|
||||||
for (o = first_off, n = P_FIRSTKEY;
|
|
||||||
o <= last_off;
|
|
||||||
o = OffsetNumberNext(o), n = OffsetNumberNext(n))
|
|
||||||
{
|
|
||||||
ii = PageGetItemId(opage, o);
|
|
||||||
if (PageAddItem(npage, PageGetItem(opage, ii),
|
|
||||||
ii->lp_len, n, LP_USED) == InvalidOffsetNumber)
|
|
||||||
elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
|
|
||||||
#ifdef FASTBUILD_DEBUG
|
#ifdef FASTBUILD_DEBUG
|
||||||
{
|
{
|
||||||
bool isnull;
|
bool isnull;
|
||||||
BTItem tmpbti =
|
BTItem tmpbti =
|
||||||
(BTItem) PageGetItem(npage, PageGetItemId(npage, n));
|
(BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY));
|
||||||
Datum d = index_getattr(&(tmpbti->bti_itup), 1,
|
Datum d = index_getattr(&(tmpbti->bti_itup), 1,
|
||||||
index->rd_att, &isnull);
|
index->rd_att, &isnull);
|
||||||
|
|
||||||
printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
|
printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
|
||||||
d, n, state->btps_level);
|
d, P_FIRSTKEY, state->btps_level);
|
||||||
}
|
|
||||||
#endif
|
|
||||||
}
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* this loop is backward because PageIndexTupleDelete shuffles the
|
* Move 'last' into the high key position on opage
|
||||||
* tuples to fill holes in the page -- by starting at the end and
|
|
||||||
* working back, we won't create holes (and thereby avoid
|
|
||||||
* shuffling).
|
|
||||||
*/
|
*/
|
||||||
for (o = last_off; o > first_off; o = OffsetNumberPrev(o))
|
|
||||||
PageIndexTupleDelete(opage, o);
|
|
||||||
hii = PageGetItemId(opage, P_HIKEY);
|
hii = PageGetItemId(opage, P_HIKEY);
|
||||||
ii = PageGetItemId(opage, first_off);
|
|
||||||
*hii = *ii;
|
*hii = *ii;
|
||||||
ii->lp_flags &= ~LP_USED;
|
ii->lp_flags &= ~LP_USED;
|
||||||
((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
|
((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
|
||||||
|
|
||||||
first_off = P_FIRSTKEY;
|
/*
|
||||||
|
* Reset last_off to point to new page
|
||||||
|
*/
|
||||||
last_off = PageGetMaxOffsetNumber(npage);
|
last_off = PageGetMaxOffsetNumber(npage);
|
||||||
last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off));
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* set the page (side link) pointers.
|
* set the page (side link) pointers.
|
||||||
@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
|
|||||||
oopaque->btpo_next = BufferGetBlockNumber(nbuf);
|
oopaque->btpo_next = BufferGetBlockNumber(nbuf);
|
||||||
nopaque->btpo_prev = BufferGetBlockNumber(obuf);
|
nopaque->btpo_prev = BufferGetBlockNumber(obuf);
|
||||||
nopaque->btpo_next = P_NONE;
|
nopaque->btpo_next = P_NONE;
|
||||||
|
|
||||||
if (_bt_itemcmp(index, keysz, scankey,
|
|
||||||
(BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)),
|
|
||||||
(BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)),
|
|
||||||
BTEqualStrategyNumber))
|
|
||||||
oopaque->btpo_flags |= BTP_CHAIN;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* copy the old buffer's minimum key to its parent. if we don't
|
* Link the old buffer into its parent, using its minimum key.
|
||||||
* have a parent, we have to create one; this adds a new btree
|
* If we don't have a parent, we have to create one;
|
||||||
* level.
|
* this adds a new btree level.
|
||||||
*/
|
*/
|
||||||
if (state->btps_doupper)
|
if (state->btps_next == (BTPageState *) NULL)
|
||||||
{
|
{
|
||||||
BTItem nbti;
|
state->btps_next =
|
||||||
|
_bt_pagestate(index, 0, state->btps_level + 1);
|
||||||
if (state->btps_next == (BTPageState *) NULL)
|
|
||||||
{
|
|
||||||
state->btps_next =
|
|
||||||
_bt_pagestate(index, 0, state->btps_level + 1, true);
|
|
||||||
}
|
|
||||||
nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
|
|
||||||
_bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0);
|
|
||||||
pfree((void *) nbti);
|
|
||||||
}
|
}
|
||||||
|
nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
|
||||||
|
_bt_buildadd(index, state->btps_next, nbti, 0);
|
||||||
|
pfree((void *) nbti);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* write out the old stuff. we never want to see it again, so we
|
* write out the old stuff. we never want to see it again, so we
|
||||||
@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* if this item is different from the last item added, we start a new
|
* Add the new item into the current page.
|
||||||
* chain of duplicates.
|
|
||||||
*/
|
*/
|
||||||
off = OffsetNumberNext(last_off);
|
last_off = OffsetNumberNext(last_off);
|
||||||
if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber)
|
if (PageAddItem(npage, (Item) bti, btisz,
|
||||||
|
last_off, LP_USED) == InvalidOffsetNumber)
|
||||||
elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
|
elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
|
||||||
#ifdef FASTBUILD_DEBUG
|
#ifdef FASTBUILD_DEBUG
|
||||||
{
|
{
|
||||||
@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
|
|||||||
Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull);
|
Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull);
|
||||||
|
|
||||||
printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n",
|
printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n",
|
||||||
d, off, state->btps_level);
|
d, last_off, state->btps_level);
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
if (last_bti == (BTItem) NULL)
|
|
||||||
first_off = P_FIRSTKEY;
|
|
||||||
else if (!_bt_itemcmp(index, keysz, scankey,
|
|
||||||
bti, last_bti, BTEqualStrategyNumber))
|
|
||||||
first_off = off;
|
|
||||||
last_off = off;
|
|
||||||
last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off));
|
|
||||||
|
|
||||||
state->btps_buf = nbuf;
|
state->btps_buf = nbuf;
|
||||||
state->btps_page = npage;
|
state->btps_page = npage;
|
||||||
state->btps_lastbti = last_bti;
|
|
||||||
state->btps_lastoff = last_off;
|
state->btps_lastoff = last_off;
|
||||||
state->btps_firstoff = first_off;
|
|
||||||
|
|
||||||
return last_bti;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Finish writing out the completed btree.
|
||||||
|
*/
|
||||||
static void
|
static void
|
||||||
_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
|
_bt_uppershutdown(Relation index, BTPageState *state)
|
||||||
BTPageState *state)
|
|
||||||
{
|
{
|
||||||
BTPageState *s;
|
BTPageState *s;
|
||||||
BlockNumber blkno;
|
BlockNumber blkno;
|
||||||
BTPageOpaque opaque;
|
BTPageOpaque opaque;
|
||||||
BTItem bti;
|
BTItem bti;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Each iteration of this loop completes one more level of the tree.
|
||||||
|
*/
|
||||||
for (s = state; s != (BTPageState *) NULL; s = s->btps_next)
|
for (s = state; s != (BTPageState *) NULL; s = s->btps_next)
|
||||||
{
|
{
|
||||||
blkno = BufferGetBlockNumber(s->btps_buf);
|
blkno = BufferGetBlockNumber(s->btps_buf);
|
||||||
opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);
|
opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* if this is the root, attach it to the metapage. otherwise,
|
* We have to link the last page on this level to somewhere.
|
||||||
* stick the minimum key of the last page on this level (which has
|
*
|
||||||
* not been split, or else it wouldn't be the last page) into its
|
* If we're at the top, it's the root, so attach it to the metapage.
|
||||||
* parent. this may cause the last page of upper levels to split,
|
* Otherwise, add an entry for it to its parent using its minimum
|
||||||
* but that's not a problem -- we haven't gotten to them yet.
|
* key. This may cause the last page of the parent level to split,
|
||||||
|
* but that's not a problem -- we haven't gotten to it yet.
|
||||||
*/
|
*/
|
||||||
if (s->btps_doupper)
|
if (s->btps_next == (BTPageState *) NULL)
|
||||||
{
|
{
|
||||||
if (s->btps_next == (BTPageState *) NULL)
|
opaque->btpo_flags |= BTP_ROOT;
|
||||||
{
|
_bt_metaproot(index, blkno, s->btps_level + 1);
|
||||||
opaque->btpo_flags |= BTP_ROOT;
|
}
|
||||||
_bt_metaproot(index, blkno, s->btps_level + 1);
|
else
|
||||||
}
|
{
|
||||||
else
|
bti = _bt_minitem(s->btps_page, blkno, 0);
|
||||||
{
|
_bt_buildadd(index, s->btps_next, bti, 0);
|
||||||
bti = _bt_minitem(s->btps_page, blkno, 0);
|
pfree((void *) bti);
|
||||||
_bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0);
|
|
||||||
pfree((void *) bti);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* this is the rightmost page, so the ItemId array needs to be
|
* This is the rightmost page, so the ItemId array needs to be
|
||||||
* slid back one slot.
|
* slid back one slot. Then we can dump out the page.
|
||||||
*/
|
*/
|
||||||
_bt_slideleft(index, s->btps_buf, s->btps_page);
|
_bt_slideleft(index, s->btps_buf, s->btps_page);
|
||||||
_bt_wrtbuf(index, s->btps_buf);
|
_bt_wrtbuf(index, s->btps_buf);
|
||||||
@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
|
|||||||
static void
|
static void
|
||||||
_bt_load(Relation index, BTSpool *btspool)
|
_bt_load(Relation index, BTSpool *btspool)
|
||||||
{
|
{
|
||||||
BTPageState *state;
|
BTPageState *state = NULL;
|
||||||
ScanKey skey;
|
|
||||||
int natts;
|
|
||||||
BTItem bti;
|
|
||||||
bool should_free;
|
|
||||||
|
|
||||||
/*
|
|
||||||
* initialize state needed for the merge into the btree leaf pages.
|
|
||||||
*/
|
|
||||||
state = _bt_pagestate(index, BTP_LEAF, 0, true);
|
|
||||||
|
|
||||||
skey = _bt_mkscankey_nodata(index);
|
|
||||||
natts = RelationGetNumberOfAttributes(index);
|
|
||||||
|
|
||||||
for (;;)
|
for (;;)
|
||||||
{
|
{
|
||||||
|
BTItem bti;
|
||||||
|
bool should_free;
|
||||||
|
|
||||||
bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true,
|
bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true,
|
||||||
&should_free);
|
&should_free);
|
||||||
if (bti == (BTItem) NULL)
|
if (bti == (BTItem) NULL)
|
||||||
break;
|
break;
|
||||||
_bt_buildadd(index, natts, skey, state, bti, BTP_LEAF);
|
|
||||||
|
/* When we see first tuple, create first index page */
|
||||||
|
if (state == NULL)
|
||||||
|
state = _bt_pagestate(index, BTP_LEAF, 0);
|
||||||
|
|
||||||
|
_bt_buildadd(index, state, bti, BTP_LEAF);
|
||||||
if (should_free)
|
if (should_free)
|
||||||
pfree((void *) bti);
|
pfree((void *) bti);
|
||||||
}
|
}
|
||||||
|
|
||||||
_bt_uppershutdown(index, natts, skey, state);
|
if (state != NULL)
|
||||||
|
_bt_uppershutdown(index, state);
|
||||||
_bt_freeskey(skey);
|
|
||||||
}
|
}
|
||||||
|
@ -8,7 +8,7 @@
|
|||||||
*
|
*
|
||||||
*
|
*
|
||||||
* IDENTIFICATION
|
* IDENTIFICATION
|
||||||
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $
|
* $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -20,16 +20,13 @@
|
|||||||
#include "access/nbtree.h"
|
#include "access/nbtree.h"
|
||||||
#include "executor/execdebug.h"
|
#include "executor/execdebug.h"
|
||||||
|
|
||||||
extern int NIndexTupleProcessed;
|
|
||||||
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* _bt_mkscankey
|
* _bt_mkscankey
|
||||||
* Build a scan key that contains comparison data from itup
|
* Build a scan key that contains comparison data from itup
|
||||||
* as well as comparator routines appropriate to the key datatypes.
|
* as well as comparator routines appropriate to the key datatypes.
|
||||||
*
|
*
|
||||||
* The result is intended for use with _bt_skeycmp() or _bt_compare(),
|
* The result is intended for use with _bt_compare().
|
||||||
* although it could be used with _bt_itemcmp() or _bt_tuplecompare().
|
|
||||||
*/
|
*/
|
||||||
ScanKey
|
ScanKey
|
||||||
_bt_mkscankey(Relation rel, IndexTuple itup)
|
_bt_mkscankey(Relation rel, IndexTuple itup)
|
||||||
@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
|
|||||||
* Build a scan key that contains comparator routines appropriate to
|
* Build a scan key that contains comparator routines appropriate to
|
||||||
* the key datatypes, but no comparison data.
|
* the key datatypes, but no comparison data.
|
||||||
*
|
*
|
||||||
* The result can be used with _bt_itemcmp() or _bt_tuplecompare(),
|
* The result cannot be used with _bt_compare(). Currently this
|
||||||
* but not with _bt_skeycmp() or _bt_compare().
|
* routine is only called by utils/sort/tuplesort.c, which has its
|
||||||
|
* own comparison routine.
|
||||||
*/
|
*/
|
||||||
ScanKey
|
ScanKey
|
||||||
_bt_mkscankey_nodata(Relation rel)
|
_bt_mkscankey_nodata(Relation rel)
|
||||||
@ -114,7 +112,6 @@ _bt_freestack(BTStack stack)
|
|||||||
{
|
{
|
||||||
ostack = stack;
|
ostack = stack;
|
||||||
stack = stack->bts_parent;
|
stack = stack->bts_parent;
|
||||||
pfree(ostack->bts_btitem);
|
|
||||||
pfree(ostack);
|
pfree(ostack);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup)
|
|||||||
Size tuplen;
|
Size tuplen;
|
||||||
extern Oid newoid();
|
extern Oid newoid();
|
||||||
|
|
||||||
/*
|
|
||||||
* see comments in btbuild
|
|
||||||
*
|
|
||||||
* if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot
|
|
||||||
* include null keys");
|
|
||||||
*/
|
|
||||||
|
|
||||||
/* make a copy of the index tuple with room for the sequence number */
|
/* make a copy of the index tuple with room for the sequence number */
|
||||||
tuplen = IndexTupleSize(itup);
|
tuplen = IndexTupleSize(itup);
|
||||||
nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData));
|
nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData));
|
||||||
|
|
||||||
btitem = (BTItem) palloc(nbytes_btitem);
|
btitem = (BTItem) palloc(nbytes_btitem);
|
||||||
memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen);
|
memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen);
|
||||||
|
|
||||||
return btitem;
|
return btitem;
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef NOT_USED
|
|
||||||
bool
|
|
||||||
_bt_checkqual(IndexScanDesc scan, IndexTuple itup)
|
|
||||||
{
|
|
||||||
BTScanOpaque so;
|
|
||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
|
||||||
if (so->numberOfKeys > 0)
|
|
||||||
return (index_keytest(itup, RelationGetDescr(scan->relation),
|
|
||||||
so->numberOfKeys, so->keyData));
|
|
||||||
else
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#ifdef NOT_USED
|
|
||||||
bool
|
|
||||||
_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz)
|
|
||||||
{
|
|
||||||
BTScanOpaque so;
|
|
||||||
|
|
||||||
so = (BTScanOpaque) scan->opaque;
|
|
||||||
if (keysz > 0 && so->numberOfKeys >= keysz)
|
|
||||||
return (index_keytest(itup, RelationGetDescr(scan->relation),
|
|
||||||
keysz, so->keyData));
|
|
||||||
else
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
bool
|
bool
|
||||||
_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok)
|
_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok)
|
||||||
{
|
{
|
||||||
|
@ -8,7 +8,7 @@
|
|||||||
*
|
*
|
||||||
*
|
*
|
||||||
* IDENTIFICATION
|
* IDENTIFICATION
|
||||||
* $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $
|
* $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -19,10 +19,10 @@
|
|||||||
|
|
||||||
#include "storage/bufpage.h"
|
#include "storage/bufpage.h"
|
||||||
|
|
||||||
|
|
||||||
static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr,
|
static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr,
|
||||||
char *location, Size size);
|
char *location, Size size);
|
||||||
|
|
||||||
static bool PageManagerShuffle = true; /* default is shuffle mode */
|
|
||||||
|
|
||||||
/* ----------------------------------------------------------------
|
/* ----------------------------------------------------------------
|
||||||
* Page support functions
|
* Page support functions
|
||||||
@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize)
|
|||||||
/* ----------------
|
/* ----------------
|
||||||
* PageAddItem
|
* PageAddItem
|
||||||
*
|
*
|
||||||
* add an item to a page.
|
* Add an item to a page. Return value is offset at which it was
|
||||||
|
* inserted, or InvalidOffsetNumber if there's not room to insert.
|
||||||
*
|
*
|
||||||
* !!! ELOG(ERROR) IS DISALLOWED HERE !!!
|
* If offsetNumber is valid and <= current max offset in the page,
|
||||||
*
|
* insert item into the array at that position by shuffling ItemId's
|
||||||
* Notes on interface:
|
* down to make room.
|
||||||
* If offsetNumber is valid, shuffle ItemId's down to make room
|
|
||||||
* to use it, if PageManagerShuffle is true. If PageManagerShuffle is
|
|
||||||
* false, then overwrite the specified ItemId. (PageManagerShuffle is
|
|
||||||
* true by default, and is modified by calling PageManagerModeSet.)
|
|
||||||
* If offsetNumber is not valid, then assign one by finding the first
|
* If offsetNumber is not valid, then assign one by finding the first
|
||||||
* one that is both unused and deallocated.
|
* one that is both unused and deallocated.
|
||||||
*
|
*
|
||||||
* NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it
|
* !!! ELOG(ERROR) IS DISALLOWED HERE !!!
|
||||||
* is assumed that there is room on the page to shuffle the ItemId's
|
*
|
||||||
* down by one.
|
|
||||||
* ----------------
|
* ----------------
|
||||||
*/
|
*/
|
||||||
OffsetNumber
|
OffsetNumber
|
||||||
@ -82,11 +78,8 @@ PageAddItem(Page page,
|
|||||||
Offset lower;
|
Offset lower;
|
||||||
Offset upper;
|
Offset upper;
|
||||||
ItemId itemId;
|
ItemId itemId;
|
||||||
ItemId fromitemId,
|
|
||||||
toitemId;
|
|
||||||
OffsetNumber limit;
|
OffsetNumber limit;
|
||||||
|
bool needshuffle = false;
|
||||||
bool shuffled = false;
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Find first unallocated offsetNumber
|
* Find first unallocated offsetNumber
|
||||||
@ -96,31 +89,12 @@ PageAddItem(Page page,
|
|||||||
/* was offsetNumber passed in? */
|
/* was offsetNumber passed in? */
|
||||||
if (OffsetNumberIsValid(offsetNumber))
|
if (OffsetNumberIsValid(offsetNumber))
|
||||||
{
|
{
|
||||||
if (PageManagerShuffle == true)
|
needshuffle = true; /* need to increase "lower" */
|
||||||
{
|
/* don't actually do the shuffle till we've checked free space! */
|
||||||
/* shuffle ItemId's (Do the PageManager Shuffle...) */
|
|
||||||
for (i = (limit - 1); i >= offsetNumber; i--)
|
|
||||||
{
|
|
||||||
fromitemId = &((PageHeader) page)->pd_linp[i - 1];
|
|
||||||
toitemId = &((PageHeader) page)->pd_linp[i];
|
|
||||||
*toitemId = *fromitemId;
|
|
||||||
}
|
|
||||||
shuffled = true; /* need to increase "lower" */
|
|
||||||
}
|
|
||||||
else
|
|
||||||
{ /* overwrite mode */
|
|
||||||
itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
|
|
||||||
if (((*itemId).lp_flags & LP_USED) ||
|
|
||||||
((*itemId).lp_len != 0))
|
|
||||||
{
|
|
||||||
elog(NOTICE, "PageAddItem: tried overwrite of used ItemId");
|
|
||||||
return InvalidOffsetNumber;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{ /* offsetNumber was not passed in, so find
|
{
|
||||||
* one */
|
/* offsetNumber was not passed in, so find one */
|
||||||
/* look for "recyclable" (unused & deallocated) ItemId */
|
/* look for "recyclable" (unused & deallocated) ItemId */
|
||||||
for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
|
for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
|
||||||
{
|
{
|
||||||
@ -130,9 +104,13 @@ PageAddItem(Page page,
|
|||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Compute new lower and upper pointers for page, see if it'll fit
|
||||||
|
*/
|
||||||
if (offsetNumber > limit)
|
if (offsetNumber > limit)
|
||||||
lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page));
|
lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page));
|
||||||
else if (offsetNumber == limit || shuffled == true)
|
else if (offsetNumber == limit || needshuffle)
|
||||||
lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData);
|
lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData);
|
||||||
else
|
else
|
||||||
lower = ((PageHeader) page)->pd_lower;
|
lower = ((PageHeader) page)->pd_lower;
|
||||||
@ -144,6 +122,23 @@ PageAddItem(Page page,
|
|||||||
if (lower > upper)
|
if (lower > upper)
|
||||||
return InvalidOffsetNumber;
|
return InvalidOffsetNumber;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* OK to insert the item. First, shuffle the existing pointers if needed.
|
||||||
|
*/
|
||||||
|
if (needshuffle)
|
||||||
|
{
|
||||||
|
/* shuffle ItemId's (Do the PageManager Shuffle...) */
|
||||||
|
for (i = (limit - 1); i >= offsetNumber; i--)
|
||||||
|
{
|
||||||
|
ItemId fromitemId,
|
||||||
|
toitemId;
|
||||||
|
|
||||||
|
fromitemId = &((PageHeader) page)->pd_linp[i - 1];
|
||||||
|
toitemId = &((PageHeader) page)->pd_linp[i];
|
||||||
|
*toitemId = *fromitemId;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
|
itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
|
||||||
(*itemId).lp_off = upper;
|
(*itemId).lp_off = upper;
|
||||||
(*itemId).lp_len = size;
|
(*itemId).lp_len = size;
|
||||||
@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize)
|
|||||||
PageHeader thdr;
|
PageHeader thdr;
|
||||||
|
|
||||||
pageSize = PageGetPageSize(page);
|
pageSize = PageGetPageSize(page);
|
||||||
|
temp = (Page) palloc(pageSize);
|
||||||
if ((temp = (Page) palloc(pageSize)) == (Page) NULL)
|
|
||||||
elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize);
|
|
||||||
thdr = (PageHeader) temp;
|
thdr = (PageHeader) temp;
|
||||||
|
|
||||||
/* copy old page in */
|
/* copy old page in */
|
||||||
@ -327,23 +320,6 @@ PageGetFreeSpace(Page page)
|
|||||||
return space;
|
return space;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
|
||||||
* PageManagerModeSet
|
|
||||||
*
|
|
||||||
* Sets mode to either: ShufflePageManagerMode (the default) or
|
|
||||||
* OverwritePageManagerMode. For use by access methods code
|
|
||||||
* for determining semantics of PageAddItem when the offsetNumber
|
|
||||||
* argument is passed in.
|
|
||||||
*/
|
|
||||||
void
|
|
||||||
PageManagerModeSet(PageManagerMode mode)
|
|
||||||
{
|
|
||||||
if (mode == ShufflePageManagerMode)
|
|
||||||
PageManagerShuffle = true;
|
|
||||||
else if (mode == OverwritePageManagerMode)
|
|
||||||
PageManagerShuffle = false;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
*----------------------------------------------------------------
|
*----------------------------------------------------------------
|
||||||
* PageIndexTupleDelete
|
* PageIndexTupleDelete
|
||||||
|
@ -7,7 +7,7 @@
|
|||||||
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
|
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
|
||||||
* Portions Copyright (c) 1994, Regents of the University of California
|
* Portions Copyright (c) 1994, Regents of the University of California
|
||||||
*
|
*
|
||||||
* $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $
|
* $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -24,14 +24,9 @@
|
|||||||
* info. In addition, we need to know what sort of page this is
|
* info. In addition, we need to know what sort of page this is
|
||||||
* (leaf or internal), and whether the page is available for reuse.
|
* (leaf or internal), and whether the page is available for reuse.
|
||||||
*
|
*
|
||||||
* Lehman and Yao's algorithm requires a ``high key'' on every page.
|
* We also store a back-link to the parent page, but this cannot be trusted
|
||||||
* The high key on a page is guaranteed to be greater than or equal
|
* very far since it does not get updated when the parent is split.
|
||||||
* to any key that appears on this page. Our insertion algorithm
|
* See backend/access/nbtree/README for details.
|
||||||
* guarantees that we can use the initial least key on our right
|
|
||||||
* sibling as the high key. We allocate space for the line pointer
|
|
||||||
* to the high key in the opaque data at the end of the page.
|
|
||||||
*
|
|
||||||
* Rightmost pages in the tree have no high key.
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
typedef struct BTPageOpaqueData
|
typedef struct BTPageOpaqueData
|
||||||
@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData
|
|||||||
BlockNumber btpo_parent;
|
BlockNumber btpo_parent;
|
||||||
uint16 btpo_flags;
|
uint16 btpo_flags;
|
||||||
|
|
||||||
#define BTP_LEAF (1 << 0)
|
/* Bits defined in btpo_flags */
|
||||||
#define BTP_ROOT (1 << 1)
|
#define BTP_LEAF (1 << 0) /* It's a leaf page */
|
||||||
#define BTP_FREE (1 << 2)
|
#define BTP_ROOT (1 << 1) /* It's the root page (has no parent) */
|
||||||
#define BTP_META (1 << 3)
|
#define BTP_FREE (1 << 2) /* not currently used... */
|
||||||
#define BTP_CHAIN (1 << 4)
|
#define BTP_META (1 << 3) /* Set in the meta-page only */
|
||||||
|
|
||||||
} BTPageOpaqueData;
|
} BTPageOpaqueData;
|
||||||
|
|
||||||
@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData
|
|||||||
typedef BTScanOpaqueData *BTScanOpaque;
|
typedef BTScanOpaqueData *BTScanOpaque;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* BTItems are what we store in the btree. Each item has an index
|
* BTItems are what we store in the btree. Each item is an index tuple,
|
||||||
* tuple, including key and pointer values. In addition, we must
|
* including key and pointer values. (In some cases either the key or the
|
||||||
* guarantee that all tuples in the index are unique, in order to
|
* pointer may go unused, see backend/access/nbtree/README for details.)
|
||||||
* satisfy some assumptions in Lehman and Yao. The way that we do
|
*
|
||||||
* this is by generating a new OID for every insertion that we do in
|
* Old comments:
|
||||||
* the tree. This adds eight bytes to the size of btree index
|
* In addition, we must guarantee that all tuples in the index are unique,
|
||||||
* tuples. Note that we do not use the OID as part of a composite
|
* in order to satisfy some assumptions in Lehman and Yao. The way that we
|
||||||
* key; the OID only serves as a unique identifier for a given index
|
* do this is by generating a new OID for every insertion that we do in the
|
||||||
* tuple (logical position within a page).
|
* tree. This adds eight bytes to the size of btree index tuples. Note
|
||||||
|
* that we do not use the OID as part of a composite key; the OID only
|
||||||
|
* serves as a unique identifier for a given index tuple (logical position
|
||||||
|
* within a page).
|
||||||
*
|
*
|
||||||
* New comments:
|
* New comments:
|
||||||
* actually, we must guarantee that all tuples in A LEVEL
|
* actually, we must guarantee that all tuples in A LEVEL
|
||||||
* are unique, not in ALL INDEX. So, we can use bti_itup->t_tid
|
* are unique, not in ALL INDEX. So, we can use bti_itup->t_tid
|
||||||
* as unique identifier for a given index tuple (logical position
|
* as unique identifier for a given index tuple (logical position
|
||||||
* within a level). - vadim 04/09/97
|
* within a level). - vadim 04/09/97
|
||||||
*/
|
*/
|
||||||
|
|
||||||
typedef struct BTItemData
|
typedef struct BTItemData
|
||||||
@ -108,12 +106,13 @@ typedef struct BTItemData
|
|||||||
|
|
||||||
typedef BTItemData *BTItem;
|
typedef BTItemData *BTItem;
|
||||||
|
|
||||||
#define BTItemSame(i1, i2) ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \
|
/* Test whether items are the "same" per the above notes */
|
||||||
i2->bti_itup.t_tid.ip_blkid.bi_hi && \
|
#define BTItemSame(i1, i2) ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \
|
||||||
i1->bti_itup.t_tid.ip_blkid.bi_lo == \
|
(i2)->bti_itup.t_tid.ip_blkid.bi_hi && \
|
||||||
i2->bti_itup.t_tid.ip_blkid.bi_lo && \
|
(i1)->bti_itup.t_tid.ip_blkid.bi_lo == \
|
||||||
i1->bti_itup.t_tid.ip_posid == \
|
(i2)->bti_itup.t_tid.ip_blkid.bi_lo && \
|
||||||
i2->bti_itup.t_tid.ip_posid )
|
(i1)->bti_itup.t_tid.ip_posid == \
|
||||||
|
(i2)->bti_itup.t_tid.ip_posid )
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* BTStackData -- As we descend a tree, we push the (key, pointer)
|
* BTStackData -- As we descend a tree, we push the (key, pointer)
|
||||||
@ -129,24 +128,12 @@ typedef struct BTStackData
|
|||||||
{
|
{
|
||||||
BlockNumber bts_blkno;
|
BlockNumber bts_blkno;
|
||||||
OffsetNumber bts_offset;
|
OffsetNumber bts_offset;
|
||||||
BTItem bts_btitem;
|
BTItemData bts_btitem;
|
||||||
struct BTStackData *bts_parent;
|
struct BTStackData *bts_parent;
|
||||||
} BTStackData;
|
} BTStackData;
|
||||||
|
|
||||||
typedef BTStackData *BTStack;
|
typedef BTStackData *BTStack;
|
||||||
|
|
||||||
typedef struct BTPageState
|
|
||||||
{
|
|
||||||
Buffer btps_buf;
|
|
||||||
Page btps_page;
|
|
||||||
BTItem btps_lastbti;
|
|
||||||
OffsetNumber btps_lastoff;
|
|
||||||
OffsetNumber btps_firstoff;
|
|
||||||
int btps_level;
|
|
||||||
bool btps_doupper;
|
|
||||||
struct BTPageState *btps_next;
|
|
||||||
} BTPageState;
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* We need to be able to tell the difference between read and write
|
* We need to be able to tell the difference between read and write
|
||||||
* requests for pages, in order to do locking correctly.
|
* requests for pages, in order to do locking correctly.
|
||||||
@ -155,31 +142,49 @@ typedef struct BTPageState
|
|||||||
#define BT_READ BUFFER_LOCK_SHARE
|
#define BT_READ BUFFER_LOCK_SHARE
|
||||||
#define BT_WRITE BUFFER_LOCK_EXCLUSIVE
|
#define BT_WRITE BUFFER_LOCK_EXCLUSIVE
|
||||||
|
|
||||||
/*
|
|
||||||
* Similarly, the difference between insertion and non-insertion binary
|
|
||||||
* searches on a given page makes a difference when we're descending the
|
|
||||||
* tree.
|
|
||||||
*/
|
|
||||||
|
|
||||||
#define BT_INSERTION 0
|
|
||||||
#define BT_DESCENT 1
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* In general, the btree code tries to localize its knowledge about
|
* In general, the btree code tries to localize its knowledge about
|
||||||
* page layout to a couple of routines. However, we need a special
|
* page layout to a couple of routines. However, we need a special
|
||||||
* value to indicate "no page number" in those places where we expect
|
* value to indicate "no page number" in those places where we expect
|
||||||
* page numbers.
|
* page numbers. We can use zero for this because we never need to
|
||||||
|
* make a pointer to the metadata page.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
#define P_NONE 0
|
#define P_NONE 0
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Macros to test whether a page is leftmost or rightmost on its tree level,
|
||||||
|
* as well as other state info kept in the opaque data.
|
||||||
|
*/
|
||||||
#define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE)
|
#define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE)
|
||||||
#define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE)
|
#define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE)
|
||||||
|
#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF)
|
||||||
|
#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
|
||||||
|
* page. The high key is not a data key, but gives info about what range of
|
||||||
|
* keys is supposed to be on this page. The high key on a page is required
|
||||||
|
* to be greater than or equal to any data key that appears on the page.
|
||||||
|
* If we find ourselves trying to insert a key > high key, we know we need
|
||||||
|
* to move right (this should only happen if the page was split since we
|
||||||
|
* examined the parent page).
|
||||||
|
*
|
||||||
|
* Our insertion algorithm guarantees that we can use the initial least key
|
||||||
|
* on our right sibling as the high key. Once a page is created, its high
|
||||||
|
* key changes only if the page is split.
|
||||||
|
*
|
||||||
|
* On a non-rightmost page, the high key lives in item 1 and data items
|
||||||
|
* start in item 2. Rightmost pages have no high key, so we store data
|
||||||
|
* items beginning in item 1.
|
||||||
|
*/
|
||||||
|
|
||||||
#define P_HIKEY ((OffsetNumber) 1)
|
#define P_HIKEY ((OffsetNumber) 1)
|
||||||
#define P_FIRSTKEY ((OffsetNumber) 2)
|
#define P_FIRSTKEY ((OffsetNumber) 2)
|
||||||
|
#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Strategy numbers -- ordering of these is <, <=, =, >=, >
|
* Operator strategy numbers -- ordering of these is <, <=, =, >=, >
|
||||||
*/
|
*/
|
||||||
|
|
||||||
#define BTLessStrategyNumber 1
|
#define BTLessStrategyNumber 1
|
||||||
@ -200,29 +205,7 @@ typedef struct BTPageState
|
|||||||
#define BTORDER_PROC 1
|
#define BTORDER_PROC 1
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* prototypes for functions in nbtinsert.c
|
* prototypes for functions in nbtree.c (external entry points for btree)
|
||||||
*/
|
|
||||||
extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem,
|
|
||||||
bool index_is_unique, Relation heapRel);
|
|
||||||
extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey,
|
|
||||||
BTItem item1, BTItem item2, StrategyNumber strat);
|
|
||||||
|
|
||||||
/*
|
|
||||||
* prototypes for functions in nbtpage.c
|
|
||||||
*/
|
|
||||||
extern void _bt_metapinit(Relation rel);
|
|
||||||
extern Buffer _bt_getroot(Relation rel, int access);
|
|
||||||
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
|
|
||||||
extern void _bt_relbuf(Relation rel, Buffer buf, int access);
|
|
||||||
extern void _bt_wrtbuf(Relation rel, Buffer buf);
|
|
||||||
extern void _bt_wrtnorelbuf(Relation rel, Buffer buf);
|
|
||||||
extern void _bt_pageinit(Page page, Size size);
|
|
||||||
extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level);
|
|
||||||
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
|
|
||||||
extern void _bt_pagedel(Relation rel, ItemPointer tid);
|
|
||||||
|
|
||||||
/*
|
|
||||||
* prototypes for functions in nbtree.c
|
|
||||||
*/
|
*/
|
||||||
extern bool BuildingBtree; /* in nbtree.c */
|
extern bool BuildingBtree; /* in nbtree.c */
|
||||||
|
|
||||||
@ -237,6 +220,25 @@ extern Datum btmarkpos(PG_FUNCTION_ARGS);
|
|||||||
extern Datum btrestrpos(PG_FUNCTION_ARGS);
|
extern Datum btrestrpos(PG_FUNCTION_ARGS);
|
||||||
extern Datum btdelete(PG_FUNCTION_ARGS);
|
extern Datum btdelete(PG_FUNCTION_ARGS);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* prototypes for functions in nbtinsert.c
|
||||||
|
*/
|
||||||
|
extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem,
|
||||||
|
bool index_is_unique, Relation heapRel);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* prototypes for functions in nbtpage.c
|
||||||
|
*/
|
||||||
|
extern void _bt_metapinit(Relation rel);
|
||||||
|
extern Buffer _bt_getroot(Relation rel, int access);
|
||||||
|
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
|
||||||
|
extern void _bt_relbuf(Relation rel, Buffer buf, int access);
|
||||||
|
extern void _bt_wrtbuf(Relation rel, Buffer buf);
|
||||||
|
extern void _bt_wrtnorelbuf(Relation rel, Buffer buf);
|
||||||
|
extern void _bt_pageinit(Page page, Size size);
|
||||||
|
extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level);
|
||||||
|
extern void _bt_pagedel(Relation rel, ItemPointer tid);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* prototypes for functions in nbtscan.c
|
* prototypes for functions in nbtscan.c
|
||||||
*/
|
*/
|
||||||
@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void);
|
|||||||
* prototypes for functions in nbtsearch.c
|
* prototypes for functions in nbtsearch.c
|
||||||
*/
|
*/
|
||||||
extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey,
|
extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey,
|
||||||
Buffer *bufP);
|
Buffer *bufP, int access);
|
||||||
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
|
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
|
||||||
ScanKey scankey, int access);
|
ScanKey scankey, int access);
|
||||||
extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey,
|
|
||||||
Page page, ItemId itemid, StrategyNumber strat);
|
|
||||||
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
|
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
|
||||||
ScanKey scankey, int srchtype);
|
ScanKey scankey);
|
||||||
|
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
|
||||||
|
Page page, OffsetNumber offnum);
|
||||||
extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir);
|
extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir);
|
||||||
extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir);
|
extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir);
|
||||||
extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
|
extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
|
||||||
|
@ -7,7 +7,7 @@
|
|||||||
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
|
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
|
||||||
* Portions Copyright (c) 1994, Regents of the University of California
|
* Portions Copyright (c) 1994, Regents of the University of California
|
||||||
*
|
*
|
||||||
* $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $
|
* $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $
|
||||||
*
|
*
|
||||||
*-------------------------------------------------------------------------
|
*-------------------------------------------------------------------------
|
||||||
*/
|
*/
|
||||||
@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize);
|
|||||||
extern void PageRestoreTempPage(Page tempPage, Page oldPage);
|
extern void PageRestoreTempPage(Page tempPage, Page oldPage);
|
||||||
extern void PageRepairFragmentation(Page page);
|
extern void PageRepairFragmentation(Page page);
|
||||||
extern Size PageGetFreeSpace(Page page);
|
extern Size PageGetFreeSpace(Page page);
|
||||||
extern void PageManagerModeSet(PageManagerMode mode);
|
|
||||||
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
|
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user