Editorial improvements for GIN documentation.
This commit is contained in:
parent
a88ec7b4bb
commit
08fa6a6851
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.6 2006/11/30 20:50:44 petere Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.7 2006/12/01 23:46:46 tgl Exp $ -->
|
||||
|
||||
<chapter id="GIN">
|
||||
<title>GIN Indexes</title>
|
||||
@ -14,8 +14,9 @@
|
||||
<para>
|
||||
<acronym>GIN</acronym> stands for Generalized Inverted Index. It is
|
||||
an index structure storing a set of (key, posting list) pairs, where
|
||||
'posting list' is a set of rows in which the key occurs. Each
|
||||
row may contain many keys.
|
||||
a <quote>posting list</> is a set of rows in which the key occurs. Each
|
||||
indexed value may contain many keys, so the same row ID may appear in
|
||||
multiple posting lists.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -45,7 +46,7 @@
|
||||
|
||||
<para>
|
||||
The <acronym>GIN</acronym> interface has a high level of abstraction,
|
||||
requiring the access method implementer to only implement the semantics of
|
||||
requiring the access method implementer only to implement the semantics of
|
||||
the data type being accessed. The <acronym>GIN</acronym> layer itself
|
||||
takes care of concurrency, logging and searching the tree structure.
|
||||
</para>
|
||||
@ -53,26 +54,14 @@
|
||||
<para>
|
||||
All it takes to get a <acronym>GIN</acronym> access method working
|
||||
is to implement four user-defined methods, which define the behavior of
|
||||
keys in the tree. In short, <acronym>GIN</acronym> combines extensibility
|
||||
along with generality, code reuse, and a clean interface.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="gin-implementation">
|
||||
<title>Implementation</title>
|
||||
|
||||
<para>
|
||||
Internally, <acronym>GIN</acronym> consists of a B-tree index constructed
|
||||
over keys, where each key is an element of the indexed value
|
||||
(element of array, for example) and where each tuple in a leaf page is
|
||||
either a pointer to a B-tree over heap pointers (PT, posting tree), or a
|
||||
list of heap pointers (PL, posting list) if the tuple is small enough.
|
||||
keys in the tree and the relationships between keys, indexed values,
|
||||
and indexable queries. In short, <acronym>GIN</acronym> combines
|
||||
extensibility with generality, code reuse, and a clean interface.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
There are four methods that an index operator class for
|
||||
<acronym>GIN</acronym> must provide (prototypes are in pseudocode):
|
||||
The four methods that an index operator class for
|
||||
<acronym>GIN</acronym> must provide are:
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
@ -80,9 +69,9 @@
|
||||
<term>int compare(Datum a, Datum b)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Compares keys (not indexed values!) and returns an integer less than
|
||||
zero, zero, or greater than zero, indicating whether the first key is
|
||||
less than, equal to, or greater than the second.
|
||||
Compares keys (not indexed values!) and returns an integer less than
|
||||
zero, zero, or greater than zero, indicating whether the first key is
|
||||
less than, equal to, or greater than the second.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
@ -91,21 +80,26 @@
|
||||
<term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Returns an array of keys of value to be indexed, nkeys should
|
||||
contain the number of returned keys.
|
||||
Returns an array of keys given a value to be indexed. The
|
||||
number of returned keys must be stored into <literal>*nkeys</>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Datum* extractQuery(Datum query, uint32 nkeys,
|
||||
StrategyNumber n)</term>
|
||||
<term>Datum* extractQuery(Datum query, uint32 *nkeys,
|
||||
StrategyNumber n)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Returns an array of keys of the query to be executed. n contains the
|
||||
strategy number of the operation (see <xref
|
||||
linkend="xindex-strategies">). Depending on n, query may be
|
||||
different type.
|
||||
Returns an array of keys given a value to be queried; that is,
|
||||
<literal>query</> is the value on the right-hand side of an
|
||||
indexable operator whose left-hand side is the indexed column.
|
||||
<literal>n</> is the strategy number of the operator within the
|
||||
operator class (see <xref linkend="xindex-strategies">).
|
||||
Often, <function>extractQuery</> will need
|
||||
to consult <literal>n</> to determine the data type of
|
||||
<literal>query</> and the key values that need to be extracted.
|
||||
The number of returned keys must be stored into <literal>*nkeys</>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
@ -114,11 +108,16 @@
|
||||
<term>bool consistent(bool check[], StrategyNumber n, Datum query)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Returns TRUE if the indexed value satisfies the query qualifier with
|
||||
strategy n (or may satisfy in case of RECHECK mark in operator class).
|
||||
Each element of the check array is TRUE if the indexed value has a
|
||||
corresponding key in the query: if (check[i] == TRUE) the i-th key of
|
||||
the query is present in the indexed value.
|
||||
Returns TRUE if the indexed value satisfies the query operator with
|
||||
strategy number <literal>n</> (or may satisfy, if the operator is
|
||||
marked RECHECK in the operator class). The <literal>check</> array has
|
||||
the same length as the number of keys previously returned by
|
||||
<function>extractQuery</> for this query. Each element of the
|
||||
<literal>check</> array is TRUE if the indexed value contains the
|
||||
corresponding query key, ie, if (check[i] == TRUE) the i-th key of the
|
||||
<function>extractQuery</> result array is present in the indexed value.
|
||||
The original <literal>query</> datum (not the extracted key array!) is
|
||||
passed in case the <function>consistent</> method needs to consult it.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
@ -127,6 +126,19 @@
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="gin-implementation">
|
||||
<title>Implementation</title>
|
||||
|
||||
<para>
|
||||
Internally, a <acronym>GIN</acronym> index contains a B-tree index
|
||||
constructed over keys, where each key is an element of the indexed value
|
||||
(a member of an array, for example) and where each tuple in a leaf page is
|
||||
either a pointer to a B-tree over heap pointers (PT, posting tree), or a
|
||||
list of heap pointers (PL, posting list) if the list is small enough.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="gin-tips">
|
||||
<title>GIN tips and tricks</title>
|
||||
|
||||
@ -134,44 +146,43 @@
|
||||
<varlistentry>
|
||||
<term>Create vs insert</term>
|
||||
<listitem>
|
||||
<para>
|
||||
In most cases, insertion into a <acronym>GIN</acronym> index is slow
|
||||
due to the likelihood of many keys being inserted for each value.
|
||||
So, for bulk insertions into a table it is advisable to to drop the GIN
|
||||
index and recreate it after finishing bulk insertion.
|
||||
</para>
|
||||
<para>
|
||||
In most cases, insertion into a <acronym>GIN</acronym> index is slow
|
||||
due to the likelihood of many keys being inserted for each value.
|
||||
So, for bulk insertions into a table it is advisable to drop the GIN
|
||||
index and recreate it after finishing bulk insertion.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>gin_fuzzy_search_limit</term>
|
||||
<term><xref linkend="guc-gin-fuzzy-search-limit"></term>
|
||||
<listitem>
|
||||
<para>
|
||||
The primary goal of developing <acronym>GIN</acronym> indices was
|
||||
support for highly scalable, full-text search in
|
||||
<productname>PostgreSQL</productname> and there are often situations when
|
||||
a full-text search returns a very large set of results. Since reading
|
||||
tuples from the disk and sorting them could take a lot of time, this is
|
||||
unacceptable for production. (Note that the index search itself is very
|
||||
fast.)
|
||||
<para>
|
||||
The primary goal of developing <acronym>GIN</acronym> indexes was
|
||||
to create support for highly scalable, full-text search in
|
||||
<productname>PostgreSQL</productname>, and there are often situations when
|
||||
a full-text search returns a very large set of results. Moreover, this
|
||||
often happens when the query contains very frequent words, so that the
|
||||
large result set is not even useful. Since reading many
|
||||
tuples from the disk and sorting them could take a lot of time, this is
|
||||
unacceptable for production. (Note that the index search itself is very
|
||||
fast.)
|
||||
</para>
|
||||
<para>
|
||||
To facilitate controlled execution of such queries
|
||||
<acronym>GIN</acronym> has a configurable soft upper limit on the size
|
||||
of the returned set, the
|
||||
<varname>gin_fuzzy_search_limit</varname> configuration parameter.
|
||||
It is set to 0 (meaning no limit) by default.
|
||||
If a non-zero limit is set, then the returned set is a subset of
|
||||
the whole result set, chosen at random.
|
||||
</para>
|
||||
<para>
|
||||
<quote>Soft</quote> means that the actual number of returned results
|
||||
could differ slightly from the specified limit, depending on the query
|
||||
and the quality of the system's random number generator.
|
||||
</para>
|
||||
<para>
|
||||
Such queries usually contain very frequent words, so the results are not
|
||||
very helpful. To facilitate execution of such queries
|
||||
<acronym>GIN</acronym> has a configurable soft upper limit of the size
|
||||
of the returned set, determined by the
|
||||
<varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by
|
||||
default (no limit).
|
||||
</para>
|
||||
<para>
|
||||
If a non-zero search limit is set, then the returned set is a subset of
|
||||
the whole result set, chosen at random.
|
||||
</para>
|
||||
<para>
|
||||
<quote>Soft</quote> means that the actual number of returned results
|
||||
could slightly differ from the specified limit, depending on the query
|
||||
and the quality of the system's random number generator.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
@ -182,21 +193,30 @@
|
||||
<title>Limitations</title>
|
||||
|
||||
<para>
|
||||
<acronym>GIN</acronym> doesn't support full index scans due to their
|
||||
extreme inefficiency: because there are often many keys per value,
|
||||
each heap pointer will be returned several times.
|
||||
<acronym>GIN</acronym> doesn't support full index scans: because there are
|
||||
often many keys per value, each heap pointer would be returned many times,
|
||||
and there is no easy way to prevent this.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
When <function>extractQuery</function> returns zero keys,
|
||||
<acronym>GIN</acronym> will emit an error: for different opclasses and
|
||||
strategies the semantic meaning of a void query may be different (for
|
||||
example, any array contains the void array, but they don't overlap the
|
||||
void array), and <acronym>GIN</acronym> can't suggest a reasonable answer.
|
||||
<acronym>GIN</acronym> will emit an error. Depending on the operator,
|
||||
a void query might match all, some, or none of the indexed values (for
|
||||
example, every array contains the empty array, but does not overlap the
|
||||
empty array), and <acronym>GIN</acronym> can't determine the correct
|
||||
answer, nor produce a full-index-scan result if it could determine that
|
||||
that was correct.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<acronym>GIN</acronym> searches keys only by equality matching. This may
|
||||
It is not an error for <function>extractValue</> to return zero keys,
|
||||
but in this case the indexed value will be unrepresented in the index.
|
||||
This is another reason why full index scan is not useful — it would
|
||||
miss such rows.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<acronym>GIN</acronym> searches keys only by equality matching. This may
|
||||
be improved in future.
|
||||
</para>
|
||||
</sect1>
|
||||
@ -206,12 +226,12 @@
|
||||
|
||||
<para>
|
||||
The <productname>PostgreSQL</productname> source distribution includes
|
||||
<acronym>GIN</acronym> classes for one-dimensional arrays of all internal
|
||||
<acronym>GIN</acronym> classes for one-dimensional arrays of all internal
|
||||
types. The following
|
||||
<filename>contrib</> modules also contain <acronym>GIN</acronym>
|
||||
operator classes:
|
||||
operator classes:
|
||||
</para>
|
||||
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term>intarray</term>
|
||||
|
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.65 2006/10/23 18:10:31 petere Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.66 2006/12/01 23:46:46 tgl Exp $ -->
|
||||
|
||||
<chapter id="indexes">
|
||||
<title id="indexes-title">Indexes</title>
|
||||
@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id);
|
||||
|
||||
<para>
|
||||
<productname>PostgreSQL</productname> provides several index types:
|
||||
B-tree, Hash, GIN and GiST. Each index type uses a different
|
||||
B-tree, Hash, GiST and GIN. Each index type uses a different
|
||||
algorithm that is best suited to different types of queries.
|
||||
By default, the <command>CREATE INDEX</command> command will create a
|
||||
B-tree index, which fits the most common situations.
|
||||
@ -247,8 +247,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
|
||||
<primary>GIN</primary>
|
||||
<see>index</see>
|
||||
</indexterm>
|
||||
GIN is a inverted index and it's usable for values which have more
|
||||
than one key, arrays for example. Like GiST, GIN may support
|
||||
GIN indexes are inverted indexes which can handle values that contain more
|
||||
than one key, arrays for example. Like GiST, GIN can support
|
||||
many different user-defined indexing strategies and the particular
|
||||
operators with which a GIN index can be used vary depending on the
|
||||
indexing strategy.
|
||||
@ -267,7 +267,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
|
||||
(See <xref linkend="functions-array"> for the meaning of
|
||||
these operators.)
|
||||
Other GIN operator classes are available in the <literal>contrib</>
|
||||
<literal>tsearch2</literal> and <literal>intarray</literal> modules. For more information see <xref linkend="GIN">.
|
||||
<literal>tsearch2</literal> and <literal>intarray</literal> modules.
|
||||
For more information see <xref linkend="GIN">.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.52 2006/10/23 18:10:32 petere Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.53 2006/12/01 23:46:46 tgl Exp $ -->
|
||||
|
||||
<sect1 id="xindex">
|
||||
<title>Interfacing Extensions To Indexes</title>
|
||||
@ -243,15 +243,16 @@
|
||||
</table>
|
||||
|
||||
<para>
|
||||
GIN indexes are similar to GiST's in flexibility: they don't have a fixed
|
||||
et of strategies. Instead, the <quote>consistency</> support routine
|
||||
interprets the strategy numbers accordingly with operator class
|
||||
definition. As an example, strategies of operator class over arrays
|
||||
is shown in <xref linkend="xindex-gin-array-strat-table">.
|
||||
GIN indexes are similar to GiST indexes in flexibility: they don't have a
|
||||
fixed set of strategies. Instead the support routines of each operator
|
||||
class interpret the strategy numbers according to the operator class's
|
||||
definition. As an example, the strategy numbers used by the built-in
|
||||
operator classes for arrays are
|
||||
shown in <xref linkend="xindex-gin-array-strat-table">.
|
||||
</para>
|
||||
|
||||
<table tocentry="1" id="xindex-gin-array-strat-table">
|
||||
<title>GIN Array's Strategies</title>
|
||||
<title>GIN Array Strategies</title>
|
||||
<tgroup cols="2">
|
||||
<thead>
|
||||
<row>
|
||||
@ -388,36 +389,35 @@
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>consistent - determine whether key satisfies the
|
||||
query qualifier</entry>
|
||||
query qualifier</entry>
|
||||
<entry>1</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>union - compute union of of a set of given keys</entry>
|
||||
<entry>union - compute union of a set of keys</entry>
|
||||
<entry>2</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>compress - computes a compressed representation of a key or value
|
||||
to be indexed</entry>
|
||||
<entry>compress - compute a compressed representation of a key or value
|
||||
to be indexed</entry>
|
||||
<entry>3</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>decompress - computes a decompressed representation of a
|
||||
compressed key </entry>
|
||||
<entry>decompress - compute a decompressed representation of a
|
||||
compressed key</entry>
|
||||
<entry>4</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>penalty - compute penalty for inserting new key into subtree
|
||||
with given subtree's key</entry>
|
||||
with given subtree's key</entry>
|
||||
<entry>5</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>picksplit - determine which entries of a page are to be moved
|
||||
to the new page and compute the union keys for resulting pages </entry>
|
||||
to the new page and compute the union keys for resulting pages</entry>
|
||||
<entry>6</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>equal - compare two keys and returns true if they are equal
|
||||
</entry>
|
||||
<entry>equal - compare two keys and return true if they are equal</entry>
|
||||
<entry>7</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
@ -441,23 +441,22 @@
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>
|
||||
compare - Compare two keys and return an integer less than zero, zero, or
|
||||
greater than zero, indicating whether the first key is less than, equal to,
|
||||
or greater than the second.
|
||||
</entry>
|
||||
compare - compare two keys and return an integer less than zero, zero,
|
||||
or greater than zero, indicating whether the first key is less than,
|
||||
equal to, or greater than the second
|
||||
</entry>
|
||||
<entry>1</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>extractValue - extract keys from value to be indexed</entry>
|
||||
<entry>extractValue - extract keys from a value to be indexed</entry>
|
||||
<entry>2</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>extractQuery - extract keys from query</entry>
|
||||
<entry>extractQuery - extract keys from a query condition</entry>
|
||||
<entry>3</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>consistent - determine whether value matches by the
|
||||
query</entry>
|
||||
<entry>consistent - determine whether value matches query condition</entry>
|
||||
<entry>4</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
@ -822,12 +821,16 @@ CREATE OPERATOR CLASS polygon_ops
|
||||
STORAGE box;
|
||||
</programlisting>
|
||||
|
||||
At present, only the GiST and GIN index method supports a
|
||||
At present, only the GiST and GIN index methods support a
|
||||
<literal>STORAGE</> type that's different from the column data type.
|
||||
The GiST <literal>compress</> and <literal>decompress</> support
|
||||
The GiST <function>compress</> and <function>decompress</> support
|
||||
routines must deal with data-type conversion when <literal>STORAGE</>
|
||||
is used. Functions named <literal>extractValue</> and <literal>extractQuery</>
|
||||
do conversation into internally used types for GIN.
|
||||
is used. In GIN, the <literal>STORAGE</> type identifies the type of
|
||||
the <quote>key</> values, which normally is different from the type
|
||||
of the indexed column — for example, an operator class for
|
||||
integer array columns might have keys that are just integers. The
|
||||
GIN <function>extractValue</> and <function>extractQuery</> support
|
||||
routines are responsible for extracting keys from indexed values.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user