Sync examples of psql \dF output with current CVS HEAD behavior.

Random other wordsmithing.
This commit is contained in:
Tom Lane 2007-09-04 03:46:36 +00:00
parent 6d871a2538
commit fcc6756341

View File

@ -1,7 +1,15 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.16 2007/09/04 03:46:36 tgl Exp $ -->
<chapter id="textsearch"> <chapter id="textsearch">
<title id="textsearch-title">Full Text Search</title>
<title>Full Text Search</title> <indexterm zone="textsearch">
<primary>full text search</primary>
</indexterm>
<indexterm zone="textsearch">
<primary>text search</primary>
</indexterm>
<sect1 id="textsearch-intro"> <sect1 id="textsearch-intro">
<title>Introduction</title> <title>Introduction</title>
@ -67,43 +75,52 @@
<listitem> <listitem>
<para> <para>
<emphasis>Parsing documents into <firstterm>lexemes</></emphasis>. It is <emphasis>Parsing documents into <firstterm>lexemes</></emphasis>. It is
useful to identify various lexemes, e.g. digits, words, complex words, useful to identify various classes of lexemes, e.g. digits, words,
email addresses, so they can be processed differently. In principle complex words, email addresses, so that they can be processed
lexemes depend on the specific application but for an ordinary search it differently. In principle lexeme classes depend on the specific
is useful to have a predefined list of lexemes. <!-- add list of lexemes. application but for an ordinary search it is useful to have a predefined
--> set of classes.
<productname>PostgreSQL</productname> uses a <firstterm>parser</> to
perform this step. A standard parser is provided, and custom parsers
can be created for specific needs.
</para> </para>
</listitem> </listitem>
<listitem> <listitem>
<para> <para>
<emphasis>Dictionaries</emphasis> allow the conversion of lexemes into <emphasis>Converting lexemes into <firstterm>normalized
a <emphasis>normalized form</emphasis> so it is not necessary to enter form</></emphasis>. This allows searches to find variant forms of the
search words in a specific form. same word, without tediously entering all the possible variants.
Also, this step typically eliminates <firstterm>stop words</>, which
are words that are so common that they are useless for searching.
<productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
perform this step. Various standard dictionaries are provided, and
custom ones can be created for specific needs.
</para> </para>
</listitem> </listitem>
<listitem> <listitem>
<para> <para>
<emphasis>Store</emphasis> preprocessed documents optimized for <emphasis>Storing preprocessed documents optimized for
searching. For example, represent each document as a sorted array searching</emphasis>. For example, each document can be represented
of lexemes. Along with lexemes it is desirable to store positional as a sorted array of normalized lexemes. Along with the lexemes it is
information to use for <varname>proximity ranking</varname>, so that desirable to store positional information to use for <firstterm>proximity
a document which contains a more "dense" region of query words is ranking</firstterm>, so that a document which contains a more
<quote>dense</> region of query words is
assigned a higher rank than one with scattered query words. assigned a higher rank than one with scattered query words.
</para> </para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
<para> <para>
Dictionaries allow fine-grained control over how lexemes are created. With Dictionaries allow fine-grained control over how lexemes are normalized.
dictionaries you can: With dictionaries you can:
</para> </para>
<itemizedlist spacing="compact" mark="bullet"> <itemizedlist spacing="compact" mark="bullet">
<listitem> <listitem>
<para> <para>
Define "stop words" that should not be indexed. Define stop words that should not be indexed.
</para> </para>
</listitem> </listitem>
@ -135,13 +152,12 @@
</itemizedlist> </itemizedlist>
<para> <para>
A data type (<xref linkend="datatype-textsearch">), <type>tsvector</type> A data type <type>tsvector</type> is provided for storing preprocessed
is provided, for storing preprocessed documents, documents, along with a type <type>tsquery</type> for representing processed
along with a type <type>tsquery</type> for representing textual queries (<xref linkend="datatype-textsearch">). Also, a full text search
queries. Also, a full text search operator <literal>@@</literal> is defined operator <literal>@@</literal> is defined for these data types (<xref
for these data types (<xref linkend="textsearch-searches">). Full text linkend="textsearch-searches">). Full text searches can be accelerated
searches can be accelerated using indexes (<xref using indexes (<xref linkend="textsearch-indexes">).
linkend="textsearch-indexes">).
</para> </para>
@ -154,20 +170,20 @@
</indexterm> </indexterm>
<para> <para>
A document can be a simple text file stored in the file system. The full A <firstterm>document</> is the unit of searching in a full text search
text indexing engine can parse text files and store associations of lexemes system; for example, a magazine article or email message. The text search
(words) with their parent document. Later, these associations are used to engine must be able to parse documents and store associations of lexemes
search for documents which contain query words. In this case, the database (key words) with their parent document. Later, these associations are
can be used to store the full text index and for executing searches, and used to search for documents which contain query words.
some unique identifier can be used to retrieve the document from the file
system.
</para> </para>
<para> <para>
A document can also be any textual database attribute or a combination For searches within <productname>PostgreSQL</productname>,
(concatenation), which in turn can be stored in various tables or obtained a document is normally a textual field within a row of a database table,
dynamically. In other words, a document can be constructed from different or possibly a combination (concatenation) of such fields, perhaps stored
parts for indexing and it might not exist as a whole. For example: in several tables or obtained dynamically. In other words, a document can
be constructed from different parts for indexing and it might not be
stored anywhere as a whole. For example:
<programlisting> <programlisting>
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
@ -184,10 +200,20 @@ WHERE mid = did AND mid = 12;
<para> <para>
Actually, in the previous example queries, <literal>COALESCE</literal> Actually, in the previous example queries, <literal>COALESCE</literal>
<!-- TODO make this a link? --> <!-- TODO make this a link? -->
should be used to prevent a <literal>NULL</literal> attribute from causing should be used to prevent a simgle <literal>NULL</literal> attribute from
a <literal>NULL</literal> result. causing a <literal>NULL</literal> result for the whole document.
</para> </para>
</note> </note>
<para>
Another possibility is to store the documents as simple text files in the
file system. In this case, the database can be used to store the full text
index and to execute searches, and some unique identifier can be used to
retrieve the document from the file system. However, retrieving files
from outside the database requires superuser permissions or special
function support, so this is usually less convenient than keeping all
the data inside <productname>PostgreSQL</productname>.
</para>
</sect2> </sect2>
<sect2 id="textsearch-searches"> <sect2 id="textsearch-searches">
@ -261,8 +287,9 @@ SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t
<xref linkend="guc-default-text-search-config"> was set accordingly <xref linkend="guc-default-text-search-config"> was set accordingly
in <filename>postgresql.conf</>. If you are using the same text search in <filename>postgresql.conf</>. If you are using the same text search
configuration for the entire cluster you can use the value in configuration for the entire cluster you can use the value in
<filename>postgresql.conf</>. If using different configurations but <filename>postgresql.conf</>. If using different configurations
the same text search configuration for an entire database, throughout the cluster but
the same text search configuration for any one database,
use <command>ALTER DATABASE ... SET</>. If not, you must set <varname> use <command>ALTER DATABASE ... SET</>. If not, you must set <varname>
default_text_search_config</varname> in each session. Many functions default_text_search_config</varname> in each session. Many functions
also take an optional configuration name. also take an optional configuration name.
@ -555,7 +582,7 @@ UPDATE tt SET ti=
<term> <term>
<synopsis> <synopsis>
ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF <type>tokenout</type> ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> text, OUT <replaceable class="PARAMETER">tokid</> integer, OUT <replaceable class="PARAMETER">token</> text) returns SETOF RECORD
</synopsis> </synopsis>
</term> </term>
@ -588,7 +615,7 @@ SELECT * FROM ts_parse('default','123 - a number');
<term> <term>
<synopsis> <synopsis>
ts_token_type(<replaceable class="PARAMETER">parser</replaceable> ) returns SETOF <type>tokentype</type> ts_token_type(<replaceable class="PARAMETER">parser</>, OUT <replaceable class="PARAMETER">tokid</> integer, OUT <replaceable class="PARAMETER">alias</> text, OUT <replaceable class="PARAMETER">description</> text) returns SETOF RECORD
</synopsis> </synopsis>
</term> </term>
@ -1107,20 +1134,20 @@ SELECT ts_lexize('english_stem', 'stars');
(1 row) (1 row)
</programlisting> </programlisting>
Also, the <function>ts_debug</function> function (<xref linkend="textsearch-debugging">) Also, the <function>ts_debug</function> function (<xref
can be used for this. linkend="textsearch-debugging">) is helpful for testing.
</para> </para>
<sect2 id="textsearch-stopwords"> <sect2 id="textsearch-stopwords">
<title>Stop Words</title> <title>Stop Words</title>
<para> <para>
Stop words are words which are very common, appear in almost Stop words are words which are very common, appear in almost every
every document, and have no discrimination value. Therefore, they can be ignored document, and have no discrimination value. Therefore, they can be ignored
in the context of full text searching. For example, every English text contains in the context of full text searching. For example, every English text
words like <literal>a</literal> although it is useless to store them in an index. contains words like <literal>a</literal> and <literal>the</>, so it is
However, stop words do affect the positions in <type>tsvector</type>, useless to store them in an index. However, stop words do affect the
which in turn, do affect ranking: positions in <type>tsvector</type>, which in turn affect ranking:
<programlisting> <programlisting>
SELECT to_tsvector('english','in the list of stop words'); SELECT to_tsvector('english','in the list of stop words');
@ -1542,11 +1569,15 @@ SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk');
<para> <para>
The <application>Snowball</> dictionary template is based on the project The <application>Snowball</> dictionary template is based on the project
of Martin Porter, inventor of the popular Porter's stemming algorithm of Martin Porter, inventor of the popular Porter's stemming algorithm
for the English language and now supported in many languages (see the <ulink for the English language. Snowball now provides stemming algorithms for
url="http://snowball.tartarus.org">Snowball site</ulink> for more many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
information). The Snowball project supplies a large number of stemmers for site</ulink> for more information). Each algorithm understands how to
many languages. A Snowball dictionary requires a language parameter to reduce common variant forms of words to a base, or stem, spelling within
identify which stemmer to use, and optionally can specify a stopword file name. its language. A Snowball dictionary requires a language parameter to
identify which stemmer to use, and optionally can specify a stopword file
name that gives a list of words to eliminate.
(<productname>PostgreSQL</productname>'s standard stopword lists are also
provided by the Snowball project.)
For example, there is a built-in definition equivalent to For example, there is a built-in definition equivalent to
<programlisting> <programlisting>
@ -1782,7 +1813,7 @@ version of our software: PostgreSQL 8.3.
<programlisting> <programlisting>
=&gt; \dF =&gt; \dF
List of fulltext configurations List of text search configurations
Schema | Name | Description Schema | Name | Description
---------+------+------------- ---------+------+-------------
public | pg | public | pg |
@ -2053,24 +2084,24 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<para> <para>
Information about full text searching objects can be obtained Information about full text searching objects can be obtained
in <literal>psql</literal> using a set of commands: in <application>psql</application> using a set of commands:
<synopsis> <synopsis>
\dF{,d,p}<optional>+</optional> <optional>PATTERN</optional> \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
</synopsis> </synopsis>
An optional <literal>+</literal> produces more details. An optional <literal>+</literal> produces more details.
</para> </para>
<para> <para>
The optional parameter <literal>PATTERN</literal> should be the name of The optional parameter <literal>PATTERN</literal> should be the name of
a full text searching object, optionally schema-qualified. If a text searching object, optionally schema-qualified. If
<literal>PATTERN</literal> is not specified then information about all <literal>PATTERN</literal> is not specified then information about all
visible objects will be displayed. <literal>PATTERN</literal> can be a visible objects will be displayed. <literal>PATTERN</literal> can be a
regular expression and can apply <emphasis>separately</emphasis> to schema regular expression and can provide <emphasis>separate</emphasis> patterns
names and object names. The following examples illustrate this: for the schema and object names. The following examples illustrate this:
<programlisting> <programlisting>
=&gt; \dF *fulltext* =&gt; \dF *fulltext*
List of fulltext configurations List of text search configurations
Schema | Name | Description Schema | Name | Description
--------+--------------+------------- --------+--------------+-------------
public | fulltext_cfg | public | fulltext_cfg |
@ -2078,7 +2109,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<programlisting> <programlisting>
=&gt; \dF *.fulltext* =&gt; \dF *.fulltext*
List of fulltext configurations List of text search configurations
Schema | Name | Description Schema | Name | Description
----------+---------------------------- ----------+----------------------------
fulltext | fulltext_cfg | fulltext | fulltext_cfg |
@ -2093,46 +2124,42 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<listitem> <listitem>
<para> <para>
List full text searching configurations (add "+" for more detail) List text searching configurations (add <literal>+</> for more detail).
</para>
<para>
By default (without <literal>PATTERN</literal>), information about
all <emphasis>visible</emphasis> full text configurations will be
displayed.
</para> </para>
<para> <para>
<programlisting> <programlisting>
=&gt; \dF russian =&gt; \dF russian
List of fulltext configurations List of text search configurations
Schema | Name | Description Schema | Name | Description
------------+---------+----------------------------------- ------------+---------+------------------------------------
pg_catalog | russian | default configuration for Russian pg_catalog | russian | configuration for russian language
=&gt; \dF+ russian =&gt; \dF+ russian
Configuration "pg_catalog.russian" Text search configuration "pg_catalog.russian"
Parser name: "pg_catalog.default" Parser: "pg_catalog.default"
Token | Dictionaries Token | Dictionaries
--------------+------------------------- --------------+--------------
email | pg_catalog.simple email | simple
file | pg_catalog.simple file | simple
float | pg_catalog.simple float | simple
host | pg_catalog.simple host | simple
hword | pg_catalog.russian_stem hword | russian_stem
int | pg_catalog.simple int | simple
lhword | public.tz_simple lhword | english_stem
lpart_hword | public.tz_simple lpart_hword | english_stem
lword | public.tz_simple lword | english_stem
nlhword | pg_catalog.russian_stem nlhword | russian_stem
nlpart_hword | pg_catalog.russian_stem nlpart_hword | russian_stem
nlword | pg_catalog.russian_stem nlword | russian_stem
part_hword | pg_catalog.simple part_hword | russian_stem
sfloat | pg_catalog.simple sfloat | simple
uint | pg_catalog.simple uint | simple
uri | pg_catalog.simple uri | simple
url | pg_catalog.simple url | simple
version | pg_catalog.simple version | simple
word | pg_catalog.russian_stem word | russian_stem
</programlisting> </programlisting>
</para> </para>
</listitem> </listitem>
@ -2142,35 +2169,31 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<term>\dFd[+] [PATTERN]</term> <term>\dFd[+] [PATTERN]</term>
<listitem> <listitem>
<para> <para>
List full text dictionaries (add "+" for more detail). List text search dictionaries (add <literal>+</> for more detail).
</para>
<para>
By default (without <literal>PATTERN</literal>), information about
all <emphasis>visible</emphasis> dictionaries will be displayed.
</para> </para>
<para> <para>
<programlisting> <programlisting>
=&gt; \dFd =&gt; \dFd
List of fulltext dictionaries List of text search dictionaries
Schema | Name | Description Schema | Name | Description
------------+------------+----------------------------------------------------------- ------------+-----------------+-----------------------------------------------------------
pg_catalog | danish | Snowball stemmer for danish language pg_catalog | danish_stem | snowball stemmer for danish language
pg_catalog | dutch | Snowball stemmer for dutch language pg_catalog | dutch_stem | snowball stemmer for dutch language
pg_catalog | english | Snowball stemmer for english language pg_catalog | english_stem | snowball stemmer for english language
pg_catalog | finnish | Snowball stemmer for finnish language pg_catalog | finnish_stem | snowball stemmer for finnish language
pg_catalog | french | Snowball stemmer for french language pg_catalog | french_stem | snowball stemmer for french language
pg_catalog | german | Snowball stemmer for german language pg_catalog | german_stem | snowball stemmer for german language
pg_catalog | hungarian | Snowball stemmer for hungarian language pg_catalog | hungarian_stem | snowball stemmer for hungarian language
pg_catalog | italian | Snowball stemmer for italian language pg_catalog | italian_stem | snowball stemmer for italian language
pg_catalog | norwegian | Snowball stemmer for norwegian language pg_catalog | norwegian_stem | snowball stemmer for norwegian language
pg_catalog | portuguese | Snowball stemmer for portuguese language pg_catalog | portuguese_stem | snowball stemmer for portuguese language
pg_catalog | romanian | Snowball stemmer for romanian language pg_catalog | romanian_stem | snowball stemmer for romanian language
pg_catalog | russian | Snowball stemmer for russian language pg_catalog | russian_stem | snowball stemmer for russian language
pg_catalog | simple | simple dictionary: just lower case and check for stopword pg_catalog | simple | simple dictionary: just lower case and check for stopword
pg_catalog | spanish | Snowball stemmer for spanish language pg_catalog | spanish_stem | snowball stemmer for spanish language
pg_catalog | swedish | Snowball stemmer for swedish language pg_catalog | swedish_stem | snowball stemmer for swedish language
pg_catalog | turkish | Snowball stemmer for turkish language pg_catalog | turkish_stem | snowball stemmer for turkish language
</programlisting> </programlisting>
</para> </para>
</listitem> </listitem>
@ -2181,32 +2204,28 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<term>\dFp[+] [PATTERN]</term> <term>\dFp[+] [PATTERN]</term>
<listitem> <listitem>
<para> <para>
List full text parsers (add "+" for more detail) List text search parsers (add <literal>+</> for more detail).
</para>
<para>
By default (without <literal>PATTERN</literal>), information about
all <emphasis>visible</emphasis> full text parsers will be displayed.
</para> </para>
<para> <para>
<programlisting> <programlisting>
=&gt; \dFp =&gt; \dFp
List of fulltext parsers List of text search parsers
Schema | Name | Description Schema | Name | Description
------------+---------+--------------------- ------------+---------+---------------------
pg_catalog | default | default word parser pg_catalog | default | default word parser
(1 row)
=&gt; \dFp+ =&gt; \dFp+
Fulltext parser "pg_catalog.default" Text search parser "pg_catalog.default"
Method | Function | Description Method | Function | Description
-------------------+---------------------------+------------- ------------------+----------------+-------------
Start parse | pg_catalog.prsd_start | Start parse | prsd_start |
Get next token | pg_catalog.prsd_nexttoken | Get next token | prsd_nexttoken |
End parse | pg_catalog.prsd_end | End parse | prsd_end |
Get headline | pg_catalog.prsd_headline | Get headline | prsd_headline |
Get lexeme's type | pg_catalog.prsd_lextype | Get lexeme types | prsd_lextype |
Token's types for parser "pg_catalog.default" Token types for parser "pg_catalog.default"
Token name | Description Token name | Description
--------------+----------------------------------- --------------+-----------------------------------
blank | Space symbols blank | Space symbols
email | Email email | Email
@ -2237,6 +2256,30 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry>
<term>\dFt[+] [PATTERN]</term>
<listitem>
<para>
List text search templates (add <literal>+</> for more detail).
</para>
<para>
<programlisting>
=&gt; \dFt
List of text search templates
Schema | Name | Description
------------+-----------+-----------------------------------------------------------
pg_catalog | ispell | ispell dictionary
pg_catalog | simple | simple dictionary: just lower case and check for stopword
pg_catalog | snowball | snowball stemmer
pg_catalog | synonym | synonym dictionary: replace word by its synonym
pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
</programlisting>
</para>
</listitem>
</varlistentry>
</variablelist> </variablelist>
</sect1> </sect1>
@ -2261,7 +2304,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
</para> </para>
<para> <para>
<replaceable class="PARAMETER">ts_debug</replaceable> type defined as: <replaceable class="PARAMETER">ts_debug</replaceable>'s result type is defined as:
<programlisting> <programlisting>
CREATE TYPE ts_debug AS ( CREATE TYPE ts_debug AS (
@ -2297,7 +2340,7 @@ ALTER TEXT SEARCH CONFIGURATION public.english
<programlisting> <programlisting>
SELECT * FROM ts_debug('public.english','The Brightest supernovaes'); SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
Alias | Description | Token | Dicts list | Lexized token Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-------------+---------------------------------------+--------------------------------- -------+---------------+-------------+---------------------------------------+---------------------------------
lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {} lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {}
blank | Space symbols | | | blank | Space symbols | | |