Add description of new features
This commit is contained in:
parent
7e63445d59
commit
bf028fa8a6
@ -427,9 +427,9 @@ concatenation also works with NULL fields.</strong></p>
|
|||||||
<p>We need to create the index on the column idxFTI. Keep in mind
|
<p>We need to create the index on the column idxFTI. Keep in mind
|
||||||
that the database will update the index when some action is taken.
|
that the database will update the index when some action is taken.
|
||||||
In this case we _need_ the index (The whole point of Full Text
|
In this case we _need_ the index (The whole point of Full Text
|
||||||
INDEXINGi ;-)), so don't worry about any indexing overhead. We will
|
INDEXING ;-)), so don't worry about any indexing overhead. We will
|
||||||
create an index based on the gist function. GiST is an index
|
create an index based on the gist or gin function. GiST is an index
|
||||||
structure for Generalized Search Tree.</p>
|
structure for Generalized Search Tree, GIN is a inverted index (see <a href="tsearch2-ref.html#indexes">The tsearch2 Reference: Indexes</a>).</p>
|
||||||
<pre>
|
<pre>
|
||||||
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
|
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
|
||||||
VACUUM FULL ANALYZE;
|
VACUUM FULL ANALYZE;
|
||||||
|
@ -1,7 +1,6 @@
|
|||||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||||
<html>
|
<html>
|
||||||
<head>
|
<head>
|
||||||
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
|
|
||||||
<title>tsearch2 guide</title>
|
<title>tsearch2 guide</title>
|
||||||
</head>
|
</head>
|
||||||
<body>
|
<body>
|
||||||
@ -9,16 +8,13 @@
|
|||||||
|
|
||||||
<p align=center>
|
<p align=center>
|
||||||
Brandon Craig Rhodes<br>30 June 2003
|
Brandon Craig Rhodes<br>30 June 2003
|
||||||
|
<br>Updated to 8.2 release by Oleg Bartunov, October 2006</br>
|
||||||
<p>
|
<p>
|
||||||
This Guide introduces the reader to the PostgreSQL tsearch2 module,
|
This Guide introduces the reader to the PostgreSQL tsearch2 module,
|
||||||
version 2.
|
version 2.
|
||||||
More formal descriptions of the module's types and functions
|
More formal descriptions of the module's types and functions
|
||||||
are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
|
are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
|
||||||
which is a companion to this document.
|
which is a companion to this document.
|
||||||
You can retrieve a beta copy of the tsearch2 module from the
|
|
||||||
<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
|
|
||||||
page — look under the section entitled <i>Development History</i>
|
|
||||||
for the current version.
|
|
||||||
<p>
|
<p>
|
||||||
First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
|
First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
|
||||||
and how they are used to search documents;
|
and how they are used to search documents;
|
||||||
@ -32,15 +28,40 @@ you should be able to run the examples here exactly as they are typed.
|
|||||||
<hr>
|
<hr>
|
||||||
<h2>Table of Contents</h2>
|
<h2>Table of Contents</h2>
|
||||||
<blockquote>
|
<blockquote>
|
||||||
|
<a href="#intro">Introduction to FTS with tsearch2</a><br>
|
||||||
<a href="#vectors_queries">Vectors and Queries</a><br>
|
<a href="#vectors_queries">Vectors and Queries</a><br>
|
||||||
<a href="#simple_search">A Simple Search Engine</a><br>
|
<a href="#simple_search">A Simple Search Engine</a><br>
|
||||||
<a href="#weights">Ranking and Position Weights</a><br>
|
<a href="#weights">Ranking and Position Weights</a><br>
|
||||||
<a href="#casting">Casting Vectors and Queries</a><br>
|
<a href="#casting">Casting Vectors and Queries</a><br>
|
||||||
<a href="#parsing_lexing">Parsing and Lexing</a><br>
|
<a href="#parsing_lexing">Parsing and Lexing</a><br>
|
||||||
|
<a href="#ref">Additional information</a>
|
||||||
</blockquote>
|
</blockquote>
|
||||||
|
|
||||||
<hr>
|
<hr>
|
||||||
|
|
||||||
|
|
||||||
|
<h2><a name="intro">Introduction to FTS with tsearch2</a></h2>
|
||||||
|
The purpose of FTS is to
|
||||||
|
find <b>documents</b>, which satisfy <b>query</b> and optionally return
|
||||||
|
them in some <b>order</b>.
|
||||||
|
Most common case: Find documents containing all query terms and return them in order
|
||||||
|
of their similarity to the query. Document in database can be
|
||||||
|
any text attribute, or combination of text attributes from one or many tables
|
||||||
|
(using joins).
|
||||||
|
Text search operators existed for years, in PostgreSQL they are
|
||||||
|
<tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support,
|
||||||
|
tends to be slow and have no relevance ranking. The idea behind tsearch2 is
|
||||||
|
is rather simple - preprocess document at index time to save time at search stage.
|
||||||
|
Preprocessing includes
|
||||||
|
<ul>
|
||||||
|
<li>document parsing onto words
|
||||||
|
<li>linguistic - normalize words to obtain lexemes
|
||||||
|
<li>store document in optimized for searching way
|
||||||
|
</ul>
|
||||||
|
Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types,
|
||||||
|
which represent document and query - <tt>tsquery @@ tsvector</tt>.
|
||||||
|
|
||||||
|
<P>
|
||||||
<h2><a name=vectors_queries>Vectors and Queries</a></h2>
|
<h2><a name=vectors_queries>Vectors and Queries</a></h2>
|
||||||
|
|
||||||
<blockquote>
|
<blockquote>
|
||||||
@ -79,6 +100,8 @@ Preparing your document index involves two steps:
|
|||||||
on the <tt>tsvector</tt> column of a table,
|
on the <tt>tsvector</tt> column of a table,
|
||||||
which implements a form of the Berkeley
|
which implements a form of the Berkeley
|
||||||
<a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
|
<a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
|
||||||
|
Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index,
|
||||||
|
which is an inverted index, commonly used in search engines. It adds scalability to tsearch2.
|
||||||
</ul>
|
</ul>
|
||||||
Once your documents are indexed,
|
Once your documents are indexed,
|
||||||
performing a search involves:
|
performing a search involves:
|
||||||
@ -251,7 +274,7 @@ and give you an error to prevent this mistake:
|
|||||||
|
|
||||||
<pre>
|
<pre>
|
||||||
=# <b>SELECT to_tsquery('the')</b>
|
=# <b>SELECT to_tsquery('the')</b>
|
||||||
NOTICE: Query contains only stopword(s) or doesn't contain lexeme(s), ignored
|
NOTICE: Query contains only stopword(s) or doesn't contain lexem(s), ignored
|
||||||
to_tsquery
|
to_tsquery
|
||||||
------------
|
------------
|
||||||
|
|
||||||
@ -483,8 +506,8 @@ The <tt>rank()</tt> function existed in older versions of OpenFTS,
|
|||||||
and has the feature that you can assign different weights
|
and has the feature that you can assign different weights
|
||||||
to words from different sections of your document.
|
to words from different sections of your document.
|
||||||
The <tt>rank_cd()</tt> uses a recent technique for weighting results
|
The <tt>rank_cd()</tt> uses a recent technique for weighting results
|
||||||
but does not allow different weight to be given
|
and also allows different weight to be given
|
||||||
to different sections of your document.
|
to different sections of your document (since 8.2).
|
||||||
<p>
|
<p>
|
||||||
Both ranking functions allow you to specify,
|
Both ranking functions allow you to specify,
|
||||||
as an optional last argument,
|
as an optional last argument,
|
||||||
@ -511,9 +534,6 @@ for details
|
|||||||
see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
|
see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
|
||||||
in the Reference.
|
in the Reference.
|
||||||
<p>
|
<p>
|
||||||
The <tt>rank()</tt> function offers more flexibility
|
|
||||||
because it pays attention to the <i>weights</i>
|
|
||||||
with which you have labelled lexeme positions.
|
|
||||||
Currently tsearch2 supports four different weight labels:
|
Currently tsearch2 supports four different weight labels:
|
||||||
<tt>'D'</tt>, the default weight;
|
<tt>'D'</tt>, the default weight;
|
||||||
and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
|
and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
|
||||||
@ -730,7 +750,7 @@ The main problem is that the apostrophe and backslash
|
|||||||
are important <i>both</i> to PostgreSQL when it is interpreting a string,
|
are important <i>both</i> to PostgreSQL when it is interpreting a string,
|
||||||
<i>and</i> to the <tt>tsvector</tt> conversion function.
|
<i>and</i> to the <tt>tsvector</tt> conversion function.
|
||||||
You may want to review section
|
You may want to review section
|
||||||
<a href="http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-syntax.html#SQL-SYNTAX-STRINGS">1.1.2.1,
|
<a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS">
|
||||||
“String Constants”</a>
|
“String Constants”</a>
|
||||||
in the PostgreSQL documentation before proceeding.
|
in the PostgreSQL documentation before proceeding.
|
||||||
<p>
|
<p>
|
||||||
@ -1051,6 +1071,14 @@ using the same scheme to determine the dictionary for each token,
|
|||||||
with the difference that the query parser recognizes as special
|
with the difference that the query parser recognizes as special
|
||||||
the boolean operators that separate query words.
|
the boolean operators that separate query words.
|
||||||
|
|
||||||
|
|
||||||
|
<h2><a name="ref">Additional information</a></h2>
|
||||||
|
More information about tsearch2 is available from
|
||||||
|
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page.
|
||||||
|
Also, it's worth to check
|
||||||
|
<a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages.
|
||||||
|
|
||||||
|
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
|
|
||||||
|
@ -1,53 +1,74 @@
|
|||||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||||
<link type="text/css" rel="stylesheet" href="tsearch2-ref_files/tsearch.txt"><title>tsearch2 reference</title></head>
|
<html><head>
|
||||||
|
|
||||||
|
<title>tsearch2 reference</title></head>
|
||||||
|
|
||||||
<body>
|
<body>
|
||||||
<h1 align="center">The tsearch2 Reference</h1>
|
<h1 align="center">The tsearch2 Reference</h1>
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
|
Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
|
||||||
</p><p>
|
<br>Massive update for 8.2 release by Oleg Bartunov, October 2006
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
This Reference documents the user types and functions
|
This Reference documents the user types and functions
|
||||||
of the tsearch2 module for PostgreSQL.
|
of the tsearch2 module for PostgreSQL.
|
||||||
An introduction to the module is provided
|
An introduction to the module is provided
|
||||||
by the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
|
by the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
|
||||||
a companion document to this one.
|
a companion document to this one.
|
||||||
You can retrieve a beta copy of the tsearch2 module from the
|
</p>
|
||||||
<a href="http://www.sai.msu.su/%7Emegera/postgres/gist/">GiST for PostgreSQL</a>
|
|
||||||
page -- look under the section entitled <i>Development History</i>
|
|
||||||
for the current version.
|
|
||||||
|
|
||||||
</p><h2><a name="vq">Vectors and Queries</a></h2>
|
<h2>Table of Contents</h2>
|
||||||
|
<blockquote>
|
||||||
|
<a href="#vq">Vectors and Queries</a><br>
|
||||||
|
<a href="#vqo">Vector Operations</a><br>
|
||||||
|
<a href="#qo">Query Operations</a><br>
|
||||||
|
<a href="#fts">Full Text Search Operator</a><br>
|
||||||
|
<a href="#configurations">Configurations</a><br>
|
||||||
|
<a href="#testing">Testing</a><br>
|
||||||
|
<a href="#parsers">Parsers</a><br>
|
||||||
|
<a href="#dictionaries">Dictionaries</a><br>
|
||||||
|
<a href="#ranking">Ranking</a><br>
|
||||||
|
<a href="#headlines">Headlines</a><br>
|
||||||
|
<a href="#indexes">Indexes</a><br>
|
||||||
|
<a href="#tz">Thesaurus dictionary</a><br>
|
||||||
|
</blockquote>
|
||||||
|
|
||||||
<a name="vq">Vectors and queries both store lexemes,
|
|
||||||
|
|
||||||
|
|
||||||
|
<h2><a name="vq">Vectors and Queries</a></h2>
|
||||||
|
|
||||||
|
Vectors and queries both store lexemes,
|
||||||
but for different purposes.
|
but for different purposes.
|
||||||
A <tt>tsvector</tt> stores the lexemes
|
A <tt>tsvector</tt> stores the lexemes
|
||||||
of the words that are parsed out of a document,
|
of the words that are parsed out of a document,
|
||||||
and can also remember the position of each word.
|
and can also remember the position of each word.
|
||||||
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
|
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
|
||||||
</a><p>
|
<p>
|
||||||
<a name="vq">Any of the following functions with a <tt><i>configuration</i></tt> argument
|
Any of the following functions with a <tt><i>configuration</i></tt> argument
|
||||||
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
|
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
|
||||||
to select a configuration;
|
to select a configuration;
|
||||||
if the option is omitted, then the current configuration is used.
|
if the option is omitted, then the current configuration is used.
|
||||||
For more information on the current configuration,
|
For more information on the current configuration,
|
||||||
read the next section on Configurations.
|
read the next section on Configurations.
|
||||||
|
</p>
|
||||||
|
|
||||||
</a></p><h3><a name="vq">Vector Operations</a></h3>
|
<h3><a name="vqo">Vector Operations</a></h3>
|
||||||
|
|
||||||
<dl><dt>
|
<dl><dt>
|
||||||
<a name="vq"> <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
|
<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
|
||||||
<i>document</i> TEXT) RETURNS tsvector</tt>
|
<i>document</i> TEXT) RETURNS TSVECTOR</tt>
|
||||||
</a></dt><dd>
|
</dt><dd>
|
||||||
<a name="vq"> Parses a document into tokens,
|
Parses a document into tokens,
|
||||||
reduces the tokens to lexemes,
|
reduces the tokens to lexemes,
|
||||||
and returns a <tt>tsvector</tt> which lists the lexemes
|
and returns a <tt>tsvector</tt> which lists the lexemes
|
||||||
together with their positions in the document.
|
together with their positions in the document.
|
||||||
For the best description of this process,
|
For the best description of this process,
|
||||||
see the section on </a><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
|
see the section on <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
|
||||||
in the accompanying tsearch2 Guide.
|
in the accompanying tsearch2 Guide.
|
||||||
</dd><dt>
|
</dd><dt>
|
||||||
<tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
|
<tt>strip(<i>vector</i> TSVECTOR) RETURNS TSVECTOR</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
Return a vector which lists the same lexemes
|
Return a vector which lists the same lexemes
|
||||||
as the given <tt><i>vector</i></tt>,
|
as the given <tt><i>vector</i></tt>,
|
||||||
@ -56,10 +77,10 @@ read the next section on Configurations.
|
|||||||
While the returned vector is thus useless for relevance ranking,
|
While the returned vector is thus useless for relevance ranking,
|
||||||
it will usually be much smaller.
|
it will usually be much smaller.
|
||||||
</dd><dt>
|
</dd><dt>
|
||||||
<tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
|
<tt>setweight(<i>vector</i> TSVECTOR, <i>letter</i>) RETURNS TSVECTOR</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
This function returns a copy of the input vector
|
This function returns a copy of the input vector
|
||||||
in which every location has been labelled
|
in which every location has been labeled
|
||||||
with either the <tt><i>letter</i></tt>
|
with either the <tt><i>letter</i></tt>
|
||||||
<tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
|
<tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
|
||||||
or the default label <tt>'D'</tt>
|
or the default label <tt>'D'</tt>
|
||||||
@ -68,11 +89,11 @@ read the next section on Configurations.
|
|||||||
These labels are retained when vectors are concatenated,
|
These labels are retained when vectors are concatenated,
|
||||||
allowing words from different parts of a document
|
allowing words from different parts of a document
|
||||||
to be weighted differently by ranking functions.
|
to be weighted differently by ranking functions.
|
||||||
</dd><dt>
|
</dd>
|
||||||
<tt><i>vector1</i> || <i>vector2</i></tt>
|
<dt>
|
||||||
</dt><dt class="br">
|
<tt><i>vector1</i> || <i>vector2</i></tt><BR>
|
||||||
<tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
|
<tt>concat(<i>vector1</i> TSVECTOR, <i>vector2</i> TSVECTOR)
|
||||||
RETURNS tsvector</tt>
|
RETURNS TSVECTOR</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
Returns a vector which combines the lexemes and position information
|
Returns a vector which combines the lexemes and position information
|
||||||
in the two vectors given as arguments.
|
in the two vectors given as arguments.
|
||||||
@ -95,27 +116,81 @@ read the next section on Configurations.
|
|||||||
to the <tt>rank()</tt> function
|
to the <tt>rank()</tt> function
|
||||||
that assigns different weights to positions with different labels.
|
that assigns different weights to positions with different labels.
|
||||||
</dd><dt>
|
</dd><dt>
|
||||||
<tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
|
<tt>length(<i>vector</i> TSVECTOR) RETURNS INT4</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
Returns the number of lexemes stored in the vector.
|
Returns the number of lexemes stored in the vector.
|
||||||
</dd><dt>
|
</dd><dt>
|
||||||
<tt><i>text</i>::tsvector RETURNS tsvector</tt>
|
<tt><i>text</i>::TSVECTOR RETURNS TSVECTOR</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
Directly casting text to a <tt>tsvector</tt>
|
Directly casting text to a <tt>tsvector</tt>
|
||||||
allows you to directly inject lexemes into a vector,
|
allows you to directly inject lexemes into a vector,
|
||||||
with whatever positions and position weights you choose to specify.
|
with whatever positions and position weights you choose to specify.
|
||||||
The <tt><i>text</i></tt> should be formatted
|
The <tt><i>text</i></tt> should be formatted
|
||||||
like the vector would be printed by the output of a <tt>SELECT</tt>.
|
like the vector would be printed by the output of a <tt>SELECT</tt>.
|
||||||
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
|
See the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
|
||||||
section in the Guide for details.
|
section in the Guide for details.
|
||||||
</dd></dl>
|
</dd><dt>
|
||||||
|
<tt>tsearch2(<i>vector_column_name</i>[, (<i>my_filter_name</i> | <i>text_column_name1</i>) [...] ], <i>text_column_nameN</i>)</tt>
|
||||||
<h3>Query Operations</h3>
|
|
||||||
|
|
||||||
<dl><dt>
|
|
||||||
<tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
|
|
||||||
<i>querytext</i> text) RETURNS tsvector</tt>
|
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
|
<tt>tsearch2()</tt> trigger used to automatically update <i>vector_column_name</i>, <i>my_filter_name</i>
|
||||||
|
is the function name to preprocess <i>text_column_name</i>. There are can be many
|
||||||
|
functions and text columns specified in <tt>tsearch2()</tt> trigger.
|
||||||
|
The following rule used:
|
||||||
|
function applied to all subsequent text columns until next function occurs.
|
||||||
|
Example, function <tt>dropatsymbol</tt> replaces all entries of <tt>@</tt>
|
||||||
|
sign by space.
|
||||||
|
<pre>
|
||||||
|
CREATE FUNCTION dropatsymbol(text) RETURNS text
|
||||||
|
AS 'select replace($1, ''@'', '' '');'
|
||||||
|
LANGUAGE SQL;
|
||||||
|
|
||||||
|
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT
|
||||||
|
ON tblMessages FOR EACH ROW EXECUTE PROCEDURE
|
||||||
|
tsearch2(tsvector_column,dropatsymbol, strMessage);
|
||||||
|
</pre>
|
||||||
|
</dd>
|
||||||
|
|
||||||
|
<dt>
|
||||||
|
<tt>stat(<i>sqlquery</i> text [, <i>weight</i> text]) RETURNS SETOF statinfo</tt>
|
||||||
|
</dt><dd>
|
||||||
|
Here <tt>statinfo</tt> is a type, defined as
|
||||||
|
<tt>
|
||||||
|
CREATE TYPE statinfo as (<i>word</i> text, <i>ndoc</i> int4, <i>nentry</i> int4)
|
||||||
|
</tt> and <i>sqlquery</i> is a query, which returns column <tt>tsvector</tt>.
|
||||||
|
<P>
|
||||||
|
This returns statistics (the number of documents <i>ndoc</i> and total number <i>nentry</i> of <i>word</i>
|
||||||
|
in the collection) about column <i>vector</i> <tt>tsvector</tt>.
|
||||||
|
Useful to check how good is your configuration and
|
||||||
|
to find stop-words candidates.For example, find top 10 most frequent words:
|
||||||
|
<pre>
|
||||||
|
=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
|
||||||
|
</pre>
|
||||||
|
Optionally, one can specify <i>weight</i> to obtain statistics about words with specific weight.
|
||||||
|
<pre>
|
||||||
|
=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
</dd>
|
||||||
|
<dt>
|
||||||
|
<tt>TSVECTOR < TSVECTOR</tt><BR>
|
||||||
|
<tt>TSVECTOR <= TSVECTOR</tt><BR>
|
||||||
|
<tt>TSVECTOR = TSVECTOR</tt><BR>
|
||||||
|
<tt>TSVECTOR >= TSVECTOR</tt><BR>
|
||||||
|
<tt>TSVECTOR > TSVECTOR</tt>
|
||||||
|
</dt><dd>
|
||||||
|
All btree operations defined for <tt>tsvector</tt> type. <tt>tsvectors</tt> compares
|
||||||
|
with each other using lexicographical order.
|
||||||
|
</dd>
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
<h3><a name="qo">Query Operations</a></h3>
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
<dt>
|
||||||
|
<tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
|
||||||
|
<i>querytext</i> text) RETURNS TSQUERY[A</tt>
|
||||||
|
</dt>
|
||||||
|
<dd>
|
||||||
Parses a query,
|
Parses a query,
|
||||||
which should be single words separated by the boolean operators
|
which should be single words separated by the boolean operators
|
||||||
"<tt>&</tt>" and,
|
"<tt>&</tt>" and,
|
||||||
@ -124,13 +199,26 @@ read the next section on Configurations.
|
|||||||
which can be grouped using parenthesis.
|
which can be grouped using parenthesis.
|
||||||
Each word is reduced to a lexeme using the current
|
Each word is reduced to a lexeme using the current
|
||||||
or specified configuration.
|
or specified configuration.
|
||||||
|
Weight class can be assigned to each lexeme entry
|
||||||
|
to restrict search region
|
||||||
|
(see <tt>setweight</tt> for explanation), for example
|
||||||
|
"<tt>fat:a & rats</tt>".
|
||||||
|
</dd><dt>
|
||||||
|
<dt>
|
||||||
|
<tt>plainto_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
|
||||||
|
<i>querytext</i> text) RETURNS TSQUERY</tt>
|
||||||
|
</dt>
|
||||||
|
<dd>
|
||||||
|
Transforms unformatted text to tsquery. It is the same as to_tsquery,
|
||||||
|
but assumes "<tt>&</tt>" boolean operator between words and doesn't
|
||||||
|
recognizes weight classes.
|
||||||
|
</dd><dt>
|
||||||
|
|
||||||
</dd><dt>
|
<tt>querytree(<i>query</i> TSQUERY) RETURNS text</tt>
|
||||||
<tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
|
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
This might return a textual representation of the given query.
|
This returns a query which actually used in searching in GiST index.
|
||||||
</dd><dt>
|
</dd><dt>
|
||||||
<tt><i>text</i>::tsquery RETURNS tsquery</tt>
|
<tt><i>text</i>::TSQUERY RETURNS TSQUERY</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
Directly casting text to a <tt>tsquery</tt>
|
Directly casting text to a <tt>tsquery</tt>
|
||||||
allows you to directly inject lexemes into a query,
|
allows you to directly inject lexemes into a query,
|
||||||
@ -139,7 +227,117 @@ read the next section on Configurations.
|
|||||||
like the query would be printed by the output of a <tt>SELECT</tt>.
|
like the query would be printed by the output of a <tt>SELECT</tt>.
|
||||||
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
|
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
|
||||||
section in the Guide for details.
|
section in the Guide for details.
|
||||||
</dd></dl>
|
</dd>
|
||||||
|
<dt>
|
||||||
|
<tt>numnode(<i>query</i> TSQUERY) RETURNS INTEGER</tt>
|
||||||
|
</dt><dd>
|
||||||
|
This returns the number of nodes in query tree
|
||||||
|
</dd><dt>
|
||||||
|
<tt>TSQUERY && TSQUERY RETURNS TSQUERY</tt>
|
||||||
|
</dt><dd>
|
||||||
|
AND-ed TSQUERY
|
||||||
|
</dd><dt>
|
||||||
|
<tt>TSQUERY || TSQUERY RETURNS TSQUERY</tt>
|
||||||
|
</dt> <dd>
|
||||||
|
OR-ed TSQUERY
|
||||||
|
</dd><dt>
|
||||||
|
<tt>!! TSQUERY RETURNS TSQUERY</tt>
|
||||||
|
</dt> <dd>
|
||||||
|
negation of TSQUERY
|
||||||
|
</dd>
|
||||||
|
<dt>
|
||||||
|
<tt>TSQUERY < TSQUERY</tt><BR>
|
||||||
|
<tt>TSQUERY <= TSQUERY</tt><BR>
|
||||||
|
<tt>TSQUERY = TSQUERY</tt><BR>
|
||||||
|
<tt>TSQUERY >= TSQUERY</tt><BR>
|
||||||
|
<tt>TSQUERY > TSQUERY</tt>
|
||||||
|
</dt><dd>
|
||||||
|
All btree operations defined for <tt>tsquery</tt> type. <tt>tsqueries</tt> compares
|
||||||
|
with each other using lexicographical order.
|
||||||
|
</dd>
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
<h3>Query rewriting</h3>
|
||||||
|
Query rewriting is a set of functions and operators for tsquery type.
|
||||||
|
It allows to control search at query time without reindexing (opposite to thesaurus), for example,
|
||||||
|
expand search using synonyms (new york, big apple, nyc, gotham).
|
||||||
|
<P>
|
||||||
|
<tt><b>rewrite()</b></tt> function changes original <i>query</i> by replacing <i>target</i> by <i>sample</i>.
|
||||||
|
There are three possibilities to use <tt>rewrite()</tt> function. Notice, that arguments of <tt>rewrite()</tt>
|
||||||
|
function can be column names of type <tt>tsquery</tt>.
|
||||||
|
<pre>
|
||||||
|
create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
|
||||||
|
insert into rw values('a & b','a', 'c');
|
||||||
|
</pre>
|
||||||
|
<dl>
|
||||||
|
<dt> <tt>rewrite (<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY) RETURNS TSQUERY</tt>
|
||||||
|
</dt>
|
||||||
|
<dd>
|
||||||
|
<pre>
|
||||||
|
=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
|
||||||
|
rewrite
|
||||||
|
-----------
|
||||||
|
'c' & 'b'
|
||||||
|
</pre>
|
||||||
|
</dd>
|
||||||
|
<dt> <tt>rewrite (ARRAY[<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY]) RETURNS TSQUERY</tt>
|
||||||
|
</dt>
|
||||||
|
<dd>
|
||||||
|
<pre>
|
||||||
|
=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
|
||||||
|
rewrite
|
||||||
|
-----------
|
||||||
|
'c' & 'b'
|
||||||
|
</pre>
|
||||||
|
</dd>
|
||||||
|
<dt> <tt>rewrite (<i>query</i> TSQUERY,'select <i>target</i> ,<i>sample</i> from test'::text) RETURNS TSQUERY</tt>
|
||||||
|
</dt>
|
||||||
|
<dd>
|
||||||
|
<pre>
|
||||||
|
=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
|
||||||
|
rewrite
|
||||||
|
-----------
|
||||||
|
'c' & 'b'
|
||||||
|
</pre>
|
||||||
|
</dd>
|
||||||
|
</dl>
|
||||||
|
Two operators defined for <tt>tsquery</tt> type:
|
||||||
|
<dl>
|
||||||
|
<dt><tt>TSQUERY @ TSQUERY</tt></dt>
|
||||||
|
<dd>
|
||||||
|
Returns <tt>TRUE</tt> if right agrument might contained in left argument.
|
||||||
|
</dd>
|
||||||
|
<dt><tt>TSQUERY ~ TSQUERY</tt></dt>
|
||||||
|
<dd>
|
||||||
|
Returns <tt>TRUE</tt> if left agrument might contained in right argument.
|
||||||
|
</dd>
|
||||||
|
</dl>
|
||||||
|
To speed up these operators one can use GiST index with <tt>gist_tp_tsquery_ops</tt> opclass.
|
||||||
|
<pre>
|
||||||
|
create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<h2><a name="fts">Full Text Search operator</a></h2>
|
||||||
|
|
||||||
|
<dl><dt>
|
||||||
|
<tt>TSQUERY @@ TSVECTOR</tt><br>
|
||||||
|
<tt>TSVECTOR @@ TSQUERY</tt>
|
||||||
|
</dt>
|
||||||
|
<dd>
|
||||||
|
Returns <tt>TRUE</tt> if <tt>TSQUERY</tt> contained in <tt>TSVECTOR</tt> and
|
||||||
|
<tt>FALSE</tt> otherwise.
|
||||||
|
<pre>
|
||||||
|
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
|
||||||
|
?column?
|
||||||
|
----------
|
||||||
|
t
|
||||||
|
=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
|
||||||
|
?column?
|
||||||
|
----------
|
||||||
|
f
|
||||||
|
</pre>
|
||||||
|
</dd>
|
||||||
|
</dl>
|
||||||
|
|
||||||
<h2><a name="configurations">Configurations</a></h2>
|
<h2><a name="configurations">Configurations</a></h2>
|
||||||
|
|
||||||
@ -147,7 +345,7 @@ A configuration specifies all of the equipment necessary
|
|||||||
to transform a document into a <tt>tsvector</tt>:
|
to transform a document into a <tt>tsvector</tt>:
|
||||||
the parser that breaks its text into tokens,
|
the parser that breaks its text into tokens,
|
||||||
and the dictionaries which then transform each token into a lexeme.
|
and the dictionaries which then transform each token into a lexeme.
|
||||||
Every call to <tt>to_tsvector()</tt> (described above)
|
Every call to <tt>to_tsvector(), to_tsquery()</tt> (described above)
|
||||||
uses a configuration to perform its processing.
|
uses a configuration to perform its processing.
|
||||||
Three configurations come with tsearch2:
|
Three configurations come with tsearch2:
|
||||||
|
|
||||||
@ -157,7 +355,10 @@ Three configurations come with tsearch2:
|
|||||||
and the <i>simple</i> dictionary for all others.
|
and the <i>simple</i> dictionary for all others.
|
||||||
</li><li><b>default_russian</b> -- Indexes words and numbers,
|
</li><li><b>default_russian</b> -- Indexes words and numbers,
|
||||||
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
|
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
|
||||||
and the <i>ru_stem</i> Russian Snowball dictionary for all others.
|
and the <i>ru_stem</i> Russian Snowball dictionary for all others. It's default
|
||||||
|
for <tt>ru_RU.KOI8-R</tt> locale.
|
||||||
|
</li><li><b>utf8_russian</b> -- the same as <b>default_russian</b> but
|
||||||
|
for <tt>ru_RU.UTF-8</tt> locale.
|
||||||
</li><li><b>simple</b> -- Processes both words and numbers
|
</li><li><b>simple</b> -- Processes both words and numbers
|
||||||
with the <i>simple</i> dictionary,
|
with the <i>simple</i> dictionary,
|
||||||
which neither discards any stop words nor alters them.
|
which neither discards any stop words nor alters them.
|
||||||
@ -239,7 +440,8 @@ Here:
|
|||||||
</li><li>description - human readable name of tok_type
|
</li><li>description - human readable name of tok_type
|
||||||
</li><li>token - parser's token
|
</li><li>token - parser's token
|
||||||
</li><li>dict_name - dictionary used for the token
|
</li><li>dict_name - dictionary used for the token
|
||||||
</li><li>tsvector - final result</li></ul>
|
</li><li>tsvector - final result</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
|
||||||
<h2><a name="parsers">Parsers</a></h2>
|
<h2><a name="parsers">Parsers</a></h2>
|
||||||
@ -300,20 +502,40 @@ the current parser is used when this argument is omitted.
|
|||||||
|
|
||||||
<h2><a name="dictionaries">Dictionaries</a></h2>
|
<h2><a name="dictionaries">Dictionaries</a></h2>
|
||||||
|
|
||||||
Dictionaries take textual tokens as input,
|
Dictionary is a program, which accepts lexeme(s), usually those produced by a parser,
|
||||||
usually those produced by a parser,
|
on input and returns:
|
||||||
and return lexemes which are usually some reduced form of the token.
|
<ul>
|
||||||
|
<li>array of lexeme(s) if input lexeme is known to the dictionary
|
||||||
|
<li>void array - dictionary knows lexeme, but it's stop word.
|
||||||
|
<li> NULL - dictionary doesn't recognized input lexeme
|
||||||
|
</ul>
|
||||||
|
Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries),
|
||||||
|
but see, for example, <tt>intdict</tt> dictionary (available from
|
||||||
|
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">Tsearch2</a> home page,
|
||||||
|
which controls indexing of integers.
|
||||||
|
|
||||||
|
<P>
|
||||||
Among the dictionaries which come installed with tsearch2 are:
|
Among the dictionaries which come installed with tsearch2 are:
|
||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
<li><b>simple</b> simply folds uppercase letters to lowercase
|
<li><b>simple</b> simply folds uppercase letters to lowercase
|
||||||
before returning the word.
|
before returning the word.
|
||||||
</li><li><b>en_stem</b> runs an English Snowball stemmer on each word
|
</li>
|
||||||
|
<li><b>ispell_template</b> - template for ispell dictionaries.
|
||||||
|
</li>
|
||||||
|
<li><b>en_stem</b> runs an English Snowball stemmer on each word
|
||||||
that attempts to reduce the various forms of a verb or noun
|
that attempts to reduce the various forms of a verb or noun
|
||||||
to a single recognizable form.
|
to a single recognizable form.
|
||||||
</li><li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
|
</li><li><b>ru_stem_koi8</b>, <b>ru_stem_utf8</b> runs a Russian Snowball stemmer on each word.
|
||||||
</li></ul>
|
</li>
|
||||||
|
<li><b>synonym</b> - simple lexeme-to-lexeme replacement
|
||||||
|
</li>
|
||||||
|
<li><b>thesaurus_template</b> - template for <a href="#tz">thesaurus dictionary</a>. It's
|
||||||
|
phrase-to-phrase replacement
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<P>
|
||||||
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
|
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
|
||||||
|
|
||||||
<pre>CREATE TABLE pg_ts_dict (
|
<pre>CREATE TABLE pg_ts_dict (
|
||||||
@ -332,6 +554,12 @@ it specifies a file from which stop words should be read.
|
|||||||
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
|
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
|
||||||
The other fields are internal function identifiers
|
The other fields are internal function identifiers
|
||||||
useful only to developers trying to implement their own dictionaries.
|
useful only to developers trying to implement their own dictionaries.
|
||||||
|
|
||||||
|
<blockquote>
|
||||||
|
<b>WARNING:</b> Data files, used by dictionaries, should be in <tt>server_encoding</tt> to
|
||||||
|
avoid possible problems !
|
||||||
|
</blockquote>
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
The argument named <tt><i>dictionary</i></tt>
|
The argument named <tt><i>dictionary</i></tt>
|
||||||
in each of the following functions
|
in each of the following functions
|
||||||
@ -355,6 +583,27 @@ if omitted then the current dictionary is used.
|
|||||||
from which an inflected form could arise.
|
from which an inflected form could arise.
|
||||||
</dd></dl>
|
</dd></dl>
|
||||||
|
|
||||||
|
<h3>Using dictionaries template</h3>
|
||||||
|
Templates used to define new dictionaries, for example,
|
||||||
|
<pre>
|
||||||
|
INSERT INTO pg_ts_dict
|
||||||
|
(SELECT 'en_ispell', dict_init,
|
||||||
|
'DictFile="/usr/local/share/dicts/ispell/english.dict",'
|
||||||
|
'AffFile="/usr/local/share/dicts/ispell/english.aff",'
|
||||||
|
'StopFile="/usr/local/share/dicts/english.stop"',
|
||||||
|
dict_lexize
|
||||||
|
FROM pg_ts_dict
|
||||||
|
WHERE dict_name = 'ispell_template');
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
<h3>Working with stop words</h3>
|
||||||
|
Ispell and snowball stemmers treat stop words differently:
|
||||||
|
<ul>
|
||||||
|
<li>ispell - normalize word and then lookups normalized form in stop-word file
|
||||||
|
<li>snowball stemmer - first, it lookups word in stop-word file and then does it job.
|
||||||
|
The reason - to minimize possible 'noise'.
|
||||||
|
</ul>
|
||||||
|
|
||||||
<h2><a name="ranking">Ranking</a></h2>
|
<h2><a name="ranking">Ranking</a></h2>
|
||||||
|
|
||||||
Ranking attempts to measure how relevant documents are to particular queries
|
Ranking attempts to measure how relevant documents are to particular queries
|
||||||
@ -364,26 +613,18 @@ Note that this information is only available in unstripped vectors --
|
|||||||
ranking functions will only return a useful result
|
ranking functions will only return a useful result
|
||||||
for a <tt>tsvector</tt> which still has position information!
|
for a <tt>tsvector</tt> which still has position information!
|
||||||
<p>
|
<p>
|
||||||
Both of these ranking functions
|
Notice, that ranking functions supplied are just an examples and
|
||||||
take an integer <i>normalization</i> option
|
doesn't belong to the tsearch2 core, you can
|
||||||
that specifies whether a document's length should impact its rank.
|
write your very own ranking function and/or combine additional
|
||||||
This is often desirable,
|
factors to fit your specific interest.
|
||||||
since a hundred-word document with five instances of a search word
|
</p>
|
||||||
is probably more relevant than a thousand-word document with five instances.
|
|
||||||
The option can have the values:
|
|
||||||
|
|
||||||
</p><ul>
|
|
||||||
<li><tt>0</tt> (the default) ignores document length.
|
|
||||||
</li><li><tt>1</tt> divides the rank by the logarithm of the length.
|
|
||||||
</li><li><tt>2</tt> divides the rank by the length itself.
|
|
||||||
</li></ul>
|
|
||||||
|
|
||||||
The two ranking functions currently available are:
|
The two ranking functions currently available are:
|
||||||
|
|
||||||
<dl><dt>
|
<dl><dt>
|
||||||
<tt>CREATE FUNCTION rank(<br>
|
<tt>CREATE FUNCTION rank(<br>
|
||||||
<em>[</em> <i>weights</i> float4[], <em>]</em>
|
<em>[</em> <i>weights</i> float4[], <em>]</em>
|
||||||
<i>vector</i> tsvector, <i>query</i> tsquery,
|
<i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
|
||||||
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
|
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
|
||||||
) RETURNS float4</tt>
|
) RETURNS float4</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
@ -399,8 +640,8 @@ The two ranking functions currently available are:
|
|||||||
and make them more or less important than words in the document body.
|
and make them more or less important than words in the document body.
|
||||||
</dd><dt>
|
</dd><dt>
|
||||||
<tt>CREATE FUNCTION rank_cd(<br>
|
<tt>CREATE FUNCTION rank_cd(<br>
|
||||||
<em>[</em> <i>K</i> int4, <em>]</em>
|
<em>[</em> <i>weights</i> float4[], <em>]</em>
|
||||||
<i>vector</i> tsvector, <i>query</i> tsquery,
|
<i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
|
||||||
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
|
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
|
||||||
) RETURNS float4</tt>
|
) RETURNS float4</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
@ -409,20 +650,51 @@ The two ranking functions currently available are:
|
|||||||
as described in Clarke, Cormack, and Tudhope's
|
as described in Clarke, Cormack, and Tudhope's
|
||||||
"<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
|
"<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
|
||||||
in the 1999 <i>Information Processing and Management</i>.
|
in the 1999 <i>Information Processing and Management</i>.
|
||||||
The value <i>K</i> is one of the values from their formula,
|
</dd>
|
||||||
and defaults to <i>K</i>=4.
|
<dt>
|
||||||
The examples in their paper <i>K</i>=16;
|
<tt>CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text</tt>
|
||||||
we can roughly describe the term
|
</dt>
|
||||||
as stating how far apart two search terms can fall
|
<dd>
|
||||||
before the formula begins penalizing them for lack of proximity.
|
Returns <tt>extents</tt>, which are a shortest and non-nested sequences of words, which satisfy a query.
|
||||||
</dd></dl>
|
Extents (covers) used in <tt>rank_cd</tt> algorithm for fast calculation of proximity ranking.
|
||||||
|
In example below there are two extents - <tt><b>{1</b>...<b>}1</b> and <b>{2</b> ...<b>}2</b></tt>.
|
||||||
|
<pre>
|
||||||
|
=# select get_covers('1:1,2,10 2:4'::tsvector,'1& 2');
|
||||||
|
get_covers
|
||||||
|
----------------------
|
||||||
|
1 {1 1 {2 2 }1 1 }2
|
||||||
|
</pre>
|
||||||
|
</dd>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Both of these (<tt>rank(), rank_cd()</tt>) ranking functions
|
||||||
|
take an integer <i>normalization</i> option
|
||||||
|
that specifies whether a document's length should impact its rank.
|
||||||
|
This is often desirable,
|
||||||
|
since a hundred-word document with five instances of a search word
|
||||||
|
is probably more relevant than a thousand-word document with five instances.
|
||||||
|
The option can have the values, which could be combined using "|" ( 2|4) to
|
||||||
|
take into account several factors:
|
||||||
|
|
||||||
|
</p>
|
||||||
|
<ul>
|
||||||
|
<li><tt>0</tt> (the default) ignores document length.</li>
|
||||||
|
<li><tt>1</tt> divides the rank by the 1 + logarithm of the length </li>
|
||||||
|
<li><tt>2</tt> divides the rank by the length itself.</li>
|
||||||
|
<li><tt>4</tt> divides the rank by the mean harmonic distance between extents</li>
|
||||||
|
<li><tt>8</tt> divides the rank by the number of unique words in document</li>
|
||||||
|
<li><tt>16</tt> divides the rank by 1 + logarithm of the number of unique words in document
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
<h2><a name="headlines">Headlines</a></h2>
|
<h2><a name="headlines">Headlines</a></h2>
|
||||||
|
|
||||||
<dl><dt>
|
<dl><dt>
|
||||||
<tt>CREATE FUNCTION headline(<br>
|
<tt>CREATE FUNCTION headline(<br>
|
||||||
<em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
|
<em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
|
||||||
<i>document</i> text, <i>query</i> tsquery,
|
<i>document</i> text, <i>query</i> TSQUERY,
|
||||||
<em>[</em> <i>options</i> text <em>]</em><br>
|
<em>[</em> <i>options</i> text <em>]</em><br>
|
||||||
) RETURNS text</tt>
|
) RETURNS text</tt>
|
||||||
</dt><dd>
|
</dt><dd>
|
||||||
@ -448,10 +720,123 @@ The two ranking functions currently available are:
|
|||||||
with a word which has this many characters or less.
|
with a word which has this many characters or less.
|
||||||
The default value of <tt>3</tt> should eliminate most English
|
The default value of <tt>3</tt> should eliminate most English
|
||||||
conjunctions and articles.
|
conjunctions and articles.
|
||||||
|
</li><li><tt>HighlightAll</tt> --
|
||||||
|
boolean flag, if TRUE, than the whole document will be highlighted.
|
||||||
</li></ul>
|
</li></ul>
|
||||||
Any unspecified options receive these defaults:
|
Any unspecified options receive these defaults:
|
||||||
<pre>StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3
|
<pre>StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
|
||||||
</pre>
|
</pre>
|
||||||
</dd></dl>
|
</dd></dl>
|
||||||
|
|
||||||
|
|
||||||
|
<h2><a name="indexes">Indexes</a></h2>
|
||||||
|
Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS !
|
||||||
|
<ul>
|
||||||
|
<li> RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree)
|
||||||
|
<pre>
|
||||||
|
=# create index fts_idx on apod using gist(fts);
|
||||||
|
</pre>
|
||||||
|
<li>GIN - Generalized Inverted Index
|
||||||
|
<pre>
|
||||||
|
=# create index fts_idx on apod using gin(fts);
|
||||||
|
</pre>
|
||||||
|
</ul>
|
||||||
|
<b>GiST</b> index is very good for online update, but is not as scalable as <b>GIN</b> index,
|
||||||
|
which, in turn, isn't good for updates. Both indexes support concurrency and recovery.
|
||||||
|
|
||||||
|
<h2><a name="tz">Thesaurus dictionary</a></h2>
|
||||||
|
|
||||||
|
<P>
|
||||||
|
Thesaurus - is a collection of words with included information about the relationships of words and phrases,
|
||||||
|
i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.</p>
|
||||||
|
<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally,
|
||||||
|
preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
|
||||||
|
Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary
|
||||||
|
with <b>phrase</b> support. Thesaurus is a plain file of the following format:
|
||||||
|
<pre>
|
||||||
|
# this is a comment
|
||||||
|
sample word(s) : indexed word(s)
|
||||||
|
...............................
|
||||||
|
</pre>
|
||||||
|
<ul>
|
||||||
|
<li><strong>Colon</strong> (:) symbol used as a delimiter.</li>
|
||||||
|
<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary.
|
||||||
|
It's still required, that <tt>sample words</tt> should be known.</li>
|
||||||
|
<li>thesaurus dictionary looks for the most longest match</li></ul>
|
||||||
|
<P>
|
||||||
|
TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration)
|
||||||
|
to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>.
|
||||||
|
Notice, that subdictionary produces an error, if it couldn't recognize word.
|
||||||
|
In that case, you should remove definition line with this word or teach subdictionary to know it.
|
||||||
|
</p>
|
||||||
|
<p>Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e.,
|
||||||
|
important only their position.
|
||||||
|
To break possible ties thesaurus applies the last definition. For example, consider
|
||||||
|
thesaurus (with simple subdictionary) rules with pattern 'swsw'
|
||||||
|
('s' designates stop-word and 'w' - known word): </p>
|
||||||
|
<pre>
|
||||||
|
a one the two : swsw
|
||||||
|
the one a two : swsw2
|
||||||
|
</pre>
|
||||||
|
<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary.
|
||||||
|
Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition
|
||||||
|
'swsw2'.</p>
|
||||||
|
<p>As a normal dictionary, it should be assigned to the specific lexeme types.
|
||||||
|
Since TZ has a capability to recognize phrases it must remember its state and interact with parser.
|
||||||
|
TZ use these assignments to check if it should handle next word or stop accumulation.
|
||||||
|
Compiler of TZ should take care about proper configuration to avoid confusion.
|
||||||
|
For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like
|
||||||
|
' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p>
|
||||||
|
|
||||||
|
<h3>Configuration</h3>
|
||||||
|
|
||||||
|
<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p>
|
||||||
|
<pre class="real">INSERT INTO pg_ts_dict
|
||||||
|
(SELECT 'tz_simple', dict_init,
|
||||||
|
'DictFile="/path/to/tz_simple.txt",'
|
||||||
|
'Dictionary="en_stem"',
|
||||||
|
dict_lexize
|
||||||
|
FROM pg_ts_dict
|
||||||
|
WHERE dict_name = 'thesaurus_template');
|
||||||
|
|
||||||
|
</pre>
|
||||||
|
<p>Here: </p>
|
||||||
|
<ul>
|
||||||
|
<li><tt>tz_simple</tt> - is the dictionary name</li>
|
||||||
|
<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li>
|
||||||
|
<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li>
|
||||||
|
</ul>
|
||||||
|
<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for example: </p>
|
||||||
|
<pre>
|
||||||
|
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and
|
||||||
|
tok_alias in ('lhword', 'lword', 'lpart_hword');
|
||||||
|
</pre>
|
||||||
|
<h3>Examples</h3>
|
||||||
|
<p>tz_simple: </p>
|
||||||
|
<pre>
|
||||||
|
one : 1
|
||||||
|
two : 2
|
||||||
|
one two : 12
|
||||||
|
the one : 1
|
||||||
|
one 1 : 11
|
||||||
|
</pre>
|
||||||
|
<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><pre class="real">=# select plainto_tsquery('default_russian',' one day is oneday');
|
||||||
|
plainto_tsquery
|
||||||
|
------------------------
|
||||||
|
'1' & 'day' & 'oneday'
|
||||||
|
|
||||||
|
=# select plainto_tsquery('default_russian','one two day is oneday');
|
||||||
|
plainto_tsquery
|
||||||
|
-------------------------
|
||||||
|
'12' & 'day' & 'oneday'
|
||||||
|
|
||||||
|
=# select plainto_tsquery('default_russian','the one');
|
||||||
|
NOTICE: Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
|
||||||
|
plainto_tsquery
|
||||||
|
-----------------
|
||||||
|
'1'
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
Additional information about thesaurus dictionary is available from
|
||||||
|
<a href="http://www.sai.msu.su/~megera/wiki/Thesaurus_dictionary">Wiki</a> page.
|
||||||
</body></html>
|
</body></html>
|
||||||
|
Loading…
x
Reference in New Issue
Block a user