From bf028fa8a653d6379a3257176ce43873f5163798 Mon Sep 17 00:00:00 2001 From: Teodor Sigaev <teodor@sigaev.ru> Date: Tue, 31 Oct 2006 16:23:05 +0000 Subject: [PATCH] Add description of new features --- contrib/tsearch2/docs/tsearch-V2-intro.html | 6 +- contrib/tsearch2/docs/tsearch2-guide.html | 52 +- contrib/tsearch2/docs/tsearch2-ref.html | 533 +++++++++++++++++--- 3 files changed, 502 insertions(+), 89 deletions(-) diff --git a/contrib/tsearch2/docs/tsearch-V2-intro.html b/contrib/tsearch2/docs/tsearch-V2-intro.html index b9cb80574e..8b2514e5be 100644 --- a/contrib/tsearch2/docs/tsearch-V2-intro.html +++ b/contrib/tsearch2/docs/tsearch-V2-intro.html @@ -427,9 +427,9 @@ concatenation also works with NULL fields.</strong></p> <p>We need to create the index on the column idxFTI. Keep in mind that the database will update the index when some action is taken. In this case we _need_ the index (The whole point of Full Text -INDEXINGi ;-)), so don't worry about any indexing overhead. We will -create an index based on the gist function. GiST is an index -structure for Generalized Search Tree.</p> +INDEXING ;-)), so don't worry about any indexing overhead. We will +create an index based on the gist or gin function. GiST is an index +structure for Generalized Search Tree, GIN is a inverted index (see <a href="tsearch2-ref.html#indexes">The tsearch2 Reference: Indexes</a>).</p> <pre> CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI); VACUUM FULL ANALYZE; diff --git a/contrib/tsearch2/docs/tsearch2-guide.html b/contrib/tsearch2/docs/tsearch2-guide.html index 5540e5d323..d2d764580c 100644 --- a/contrib/tsearch2/docs/tsearch2-guide.html +++ b/contrib/tsearch2/docs/tsearch2-guide.html @@ -1,7 +1,6 @@ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> -<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css"> <title>tsearch2 guide</title> </head> <body> @@ -9,16 +8,13 @@ <p align=center> Brandon Craig Rhodes<br>30 June 2003 +<br>Updated to 8.2 release by Oleg Bartunov, October 2006</br> <p> This Guide introduces the reader to the PostgreSQL tsearch2 module, version 2. More formal descriptions of the module's types and functions are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>, which is a companion to this document. -You can retrieve a beta copy of the tsearch2 module from the -<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a> -page — look under the section entitled <i>Development History</i> -for the current version. <p> First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types and how they are used to search documents; @@ -32,15 +28,40 @@ you should be able to run the examples here exactly as they are typed. <hr> <h2>Table of Contents</h2> <blockquote> +<a href="#intro">Introduction to FTS with tsearch2</a><br> <a href="#vectors_queries">Vectors and Queries</a><br> <a href="#simple_search">A Simple Search Engine</a><br> <a href="#weights">Ranking and Position Weights</a><br> <a href="#casting">Casting Vectors and Queries</a><br> <a href="#parsing_lexing">Parsing and Lexing</a><br> +<a href="#ref">Additional information</a> </blockquote> <hr> + +<h2><a name="intro">Introduction to FTS with tsearch2</a></h2> +The purpose of FTS is to +find <b>documents</b>, which satisfy <b>query</b> and optionally return +them in some <b>order</b>. +Most common case: Find documents containing all query terms and return them in order +of their similarity to the query. Document in database can be +any text attribute, or combination of text attributes from one or many tables +(using joins). +Text search operators existed for years, in PostgreSQL they are +<tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support, +tends to be slow and have no relevance ranking. The idea behind tsearch2 is +is rather simple - preprocess document at index time to save time at search stage. +Preprocessing includes +<ul> +<li>document parsing onto words +<li>linguistic - normalize words to obtain lexemes +<li>store document in optimized for searching way +</ul> +Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types, +which represent document and query - <tt>tsquery @@ tsvector</tt>. + +<P> <h2><a name=vectors_queries>Vectors and Queries</a></h2> <blockquote> @@ -79,6 +100,8 @@ Preparing your document index involves two steps: on the <tt>tsvector</tt> column of a table, which implements a form of the Berkeley <a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>. + Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index, + which is an inverted index, commonly used in search engines. It adds scalability to tsearch2. </ul> Once your documents are indexed, performing a search involves: @@ -251,7 +274,7 @@ and give you an error to prevent this mistake: <pre> =# <b>SELECT to_tsquery('the')</b> -NOTICE: Query contains only stopword(s) or doesn't contain lexeme(s), ignored +NOTICE: Query contains only stopword(s) or doesn't contain lexem(s), ignored to_tsquery ------------ @@ -483,8 +506,8 @@ The <tt>rank()</tt> function existed in older versions of OpenFTS, and has the feature that you can assign different weights to words from different sections of your document. The <tt>rank_cd()</tt> uses a recent technique for weighting results -but does not allow different weight to be given -to different sections of your document. +and also allows different weight to be given +to different sections of your document (since 8.2). <p> Both ranking functions allow you to specify, as an optional last argument, @@ -511,9 +534,6 @@ for details see the <a href="tsearch2-ref.html#ranking">section on ranking</a> in the Reference. <p> -The <tt>rank()</tt> function offers more flexibility -because it pays attention to the <i>weights</i> -with which you have labelled lexeme positions. Currently tsearch2 supports four different weight labels: <tt>'D'</tt>, the default weight; and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>. @@ -730,7 +750,7 @@ The main problem is that the apostrophe and backslash are important <i>both</i> to PostgreSQL when it is interpreting a string, <i>and</i> to the <tt>tsvector</tt> conversion function. You may want to review section -<a href="http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-syntax.html#SQL-SYNTAX-STRINGS">1.1.2.1, +<a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS"> “String Constants”</a> in the PostgreSQL documentation before proceeding. <p> @@ -1051,6 +1071,14 @@ using the same scheme to determine the dictionary for each token, with the difference that the query parser recognizes as special the boolean operators that separate query words. + +<h2><a name="ref">Additional information</a></h2> +More information about tsearch2 is available from +<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page. +Also, it's worth to check +<a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages. + + </body> </html> diff --git a/contrib/tsearch2/docs/tsearch2-ref.html b/contrib/tsearch2/docs/tsearch2-ref.html index 85401e83e7..7edcc55a9b 100644 --- a/contrib/tsearch2/docs/tsearch2-ref.html +++ b/contrib/tsearch2/docs/tsearch2-ref.html @@ -1,53 +1,74 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> -<link type="text/css" rel="stylesheet" href="tsearch2-ref_files/tsearch.txt"><title>tsearch2 reference</title></head> +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html><head> + +<title>tsearch2 reference</title></head> <body> <h1 align="center">The tsearch2 Reference</h1> <p align="center"> Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003). -</p><p> +<br>Massive update for 8.2 release by Oleg Bartunov, October 2006 +</p> +<p> This Reference documents the user types and functions of the tsearch2 module for PostgreSQL. An introduction to the module is provided -by the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>, +by the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>, a companion document to this one. -You can retrieve a beta copy of the tsearch2 module from the -<a href="http://www.sai.msu.su/%7Emegera/postgres/gist/">GiST for PostgreSQL</a> -page -- look under the section entitled <i>Development History</i> -for the current version. +</p> -</p><h2><a name="vq">Vectors and Queries</a></h2> +<h2>Table of Contents</h2> +<blockquote> +<a href="#vq">Vectors and Queries</a><br> +<a href="#vqo">Vector Operations</a><br> +<a href="#qo">Query Operations</a><br> +<a href="#fts">Full Text Search Operator</a><br> +<a href="#configurations">Configurations</a><br> +<a href="#testing">Testing</a><br> +<a href="#parsers">Parsers</a><br> +<a href="#dictionaries">Dictionaries</a><br> +<a href="#ranking">Ranking</a><br> +<a href="#headlines">Headlines</a><br> +<a href="#indexes">Indexes</a><br> +<a href="#tz">Thesaurus dictionary</a><br> +</blockquote> -<a name="vq">Vectors and queries both store lexemes, + + + +<h2><a name="vq">Vectors and Queries</a></h2> + +Vectors and queries both store lexemes, but for different purposes. A <tt>tsvector</tt> stores the lexemes of the words that are parsed out of a document, and can also remember the position of each word. A <tt>tsquery</tt> specifies a boolean condition among lexemes. -</a><p> -<a name="vq">Any of the following functions with a <tt><i>configuration</i></tt> argument +<p> +Any of the following functions with a <tt><i>configuration</i></tt> argument can use either an integer <tt>id</tt> or textual <tt>ts_name</tt> to select a configuration; if the option is omitted, then the current configuration is used. For more information on the current configuration, read the next section on Configurations. +</p> -</a></p><h3><a name="vq">Vector Operations</a></h3> +<h3><a name="vqo">Vector Operations</a></h3> <dl><dt> -<a name="vq"> <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em> - <i>document</i> TEXT) RETURNS tsvector</tt> -</a></dt><dd> -<a name="vq"> Parses a document into tokens, +<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em> + <i>document</i> TEXT) RETURNS TSVECTOR</tt> +</dt><dd> + Parses a document into tokens, reduces the tokens to lexemes, and returns a <tt>tsvector</tt> which lists the lexemes together with their positions in the document. For the best description of this process, - see the section on </a><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a> + see the section on <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a> in the accompanying tsearch2 Guide. </dd><dt> - <tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt> + <tt>strip(<i>vector</i> TSVECTOR) RETURNS TSVECTOR</tt> </dt><dd> Return a vector which lists the same lexemes as the given <tt><i>vector</i></tt>, @@ -56,10 +77,10 @@ read the next section on Configurations. While the returned vector is thus useless for relevance ranking, it will usually be much smaller. </dd><dt> - <tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt> + <tt>setweight(<i>vector</i> TSVECTOR, <i>letter</i>) RETURNS TSVECTOR</tt> </dt><dd> This function returns a copy of the input vector - in which every location has been labelled + in which every location has been labeled with either the <tt><i>letter</i></tt> <tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>, or the default label <tt>'D'</tt> @@ -68,11 +89,11 @@ read the next section on Configurations. These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions. -</dd><dt> - <tt><i>vector1</i> || <i>vector2</i></tt> -</dt><dt class="br"> - <tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector) - RETURNS tsvector</tt> +</dd> +<dt> + <tt><i>vector1</i> || <i>vector2</i></tt><BR> + <tt>concat(<i>vector1</i> TSVECTOR, <i>vector2</i> TSVECTOR) + RETURNS TSVECTOR</tt> </dt><dd> Returns a vector which combines the lexemes and position information in the two vectors given as arguments. @@ -95,27 +116,81 @@ read the next section on Configurations. to the <tt>rank()</tt> function that assigns different weights to positions with different labels. </dd><dt> - <tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt> + <tt>length(<i>vector</i> TSVECTOR) RETURNS INT4</tt> </dt><dd> Returns the number of lexemes stored in the vector. </dd><dt> - <tt><i>text</i>::tsvector RETURNS tsvector</tt> + <tt><i>text</i>::TSVECTOR RETURNS TSVECTOR</tt> </dt><dd> Directly casting text to a <tt>tsvector</tt> allows you to directly inject lexemes into a vector, with whatever positions and position weights you choose to specify. The <tt><i>text</i></tt> should be formatted like the vector would be printed by the output of a <tt>SELECT</tt>. - See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a> + See the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a> section in the Guide for details. -</dd></dl> +</dd><dt> + <tt>tsearch2(<i>vector_column_name</i>[, (<i>my_filter_name</i> | <i>text_column_name1</i>) [...] ], <i>text_column_nameN</i>)</tt> + </dt><dd> +<tt>tsearch2()</tt> trigger used to automatically update <i>vector_column_name</i>, <i>my_filter_name</i> +is the function name to preprocess <i>text_column_name</i>. There are can be many +functions and text columns specified in <tt>tsearch2()</tt> trigger. +The following rule used: +function applied to all subsequent text columns until next function occurs. +Example, function <tt>dropatsymbol</tt> replaces all entries of <tt>@</tt> +sign by space. +<pre> +CREATE FUNCTION dropatsymbol(text) RETURNS text +AS 'select replace($1, ''@'', '' '');' +LANGUAGE SQL; -<h3>Query Operations</h3> +CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT +ON tblMessages FOR EACH ROW EXECUTE PROCEDURE +tsearch2(tsvector_column,dropatsymbol, strMessage); +</pre> +</dd> -<dl><dt> - <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em> - <i>querytext</i> text) RETURNS tsvector</tt> +<dt> +<tt>stat(<i>sqlquery</i> text [, <i>weight</i> text]) RETURNS SETOF statinfo</tt> </dt><dd> +Here <tt>statinfo</tt> is a type, defined as +<tt> +CREATE TYPE statinfo as (<i>word</i> text, <i>ndoc</i> int4, <i>nentry</i> int4) +</tt> and <i>sqlquery</i> is a query, which returns column <tt>tsvector</tt>. +<P> +This returns statistics (the number of documents <i>ndoc</i> and total number <i>nentry</i> of <i>word</i> +in the collection) about column <i>vector</i> <tt>tsvector</tt>. +Useful to check how good is your configuration and +to find stop-words candidates.For example, find top 10 most frequent words: +<pre> +=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10; +</pre> +Optionally, one can specify <i>weight</i> to obtain statistics about words with specific weight. +<pre> +=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10; +</pre> + +</dd> +<dt> +<tt>TSVECTOR < TSVECTOR</tt><BR> +<tt>TSVECTOR <= TSVECTOR</tt><BR> +<tt>TSVECTOR = TSVECTOR</tt><BR> +<tt>TSVECTOR >= TSVECTOR</tt><BR> +<tt>TSVECTOR > TSVECTOR</tt> +</dt><dd> +All btree operations defined for <tt>tsvector</tt> type. <tt>tsvectors</tt> compares +with each other using lexicographical order. +</dd> +</dl> + +<h3><a name="qo">Query Operations</a></h3> + +<dl> +<dt> + <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em> + <i>querytext</i> text) RETURNS TSQUERY[A</tt> +</dt> +<dd> Parses a query, which should be single words separated by the boolean operators "<tt>&</tt>" and, @@ -123,14 +198,27 @@ read the next section on Configurations. and "<tt>!</tt>" not, which can be grouped using parenthesis. Each word is reduced to a lexeme using the current - or specified configuration. + or specified configuration. + Weight class can be assigned to each lexeme entry + to restrict search region + (see <tt>setweight</tt> for explanation), for example + "<tt>fat:a & rats</tt>". +</dd><dt> +<dt> + <tt>plainto_tsquery( <em>[</em><i>configuration</i>,<em>]</em> + <i>querytext</i> text) RETURNS TSQUERY</tt> +</dt> +<dd> +Transforms unformatted text to tsquery. It is the same as to_tsquery, +but assumes "<tt>&</tt>" boolean operator between words and doesn't +recognizes weight classes. +</dd><dt> -</dd><dt> - <tt>querytree(<i>query</i> tsquery) RETURNS text</tt> + <tt>querytree(<i>query</i> TSQUERY) RETURNS text</tt> </dt><dd> - This might return a textual representation of the given query. +This returns a query which actually used in searching in GiST index. </dd><dt> - <tt><i>text</i>::tsquery RETURNS tsquery</tt> + <tt><i>text</i>::TSQUERY RETURNS TSQUERY</tt> </dt><dd> Directly casting text to a <tt>tsquery</tt> allows you to directly inject lexemes into a query, @@ -139,7 +227,117 @@ read the next section on Configurations. like the query would be printed by the output of a <tt>SELECT</tt>. See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a> section in the Guide for details. -</dd></dl> +</dd> +<dt> + <tt>numnode(<i>query</i> TSQUERY) RETURNS INTEGER</tt> +</dt><dd> +This returns the number of nodes in query tree +</dd><dt> + <tt>TSQUERY && TSQUERY RETURNS TSQUERY</tt> +</dt><dd> +AND-ed TSQUERY +</dd><dt> + <tt>TSQUERY || TSQUERY RETURNS TSQUERY</tt> +</dt> <dd> + OR-ed TSQUERY +</dd><dt> + <tt>!! TSQUERY RETURNS TSQUERY</tt> +</dt> <dd> + negation of TSQUERY +</dd> +<dt> +<tt>TSQUERY < TSQUERY</tt><BR> +<tt>TSQUERY <= TSQUERY</tt><BR> +<tt>TSQUERY = TSQUERY</tt><BR> +<tt>TSQUERY >= TSQUERY</tt><BR> +<tt>TSQUERY > TSQUERY</tt> +</dt><dd> +All btree operations defined for <tt>tsquery</tt> type. <tt>tsqueries</tt> compares +with each other using lexicographical order. +</dd> +</dl> + +<h3>Query rewriting</h3> +Query rewriting is a set of functions and operators for tsquery type. +It allows to control search at query time without reindexing (opposite to thesaurus), for example, +expand search using synonyms (new york, big apple, nyc, gotham). +<P> +<tt><b>rewrite()</b></tt> function changes original <i>query</i> by replacing <i>target</i> by <i>sample</i>. +There are three possibilities to use <tt>rewrite()</tt> function. Notice, that arguments of <tt>rewrite()</tt> +function can be column names of type <tt>tsquery</tt>. +<pre> +create table rw (q TSQUERY, t TSQUERY, s TSQUERY); +insert into rw values('a & b','a', 'c'); +</pre> +<dl> +<dt> <tt>rewrite (<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY) RETURNS TSQUERY</tt> +</dt> +<dd> +<pre> +=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY); + rewrite + ----------- + 'c' & 'b' +</pre> +</dd> +<dt> <tt>rewrite (ARRAY[<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY]) RETURNS TSQUERY</tt> +</dt> +<dd> +<pre> +=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw; + rewrite + ----------- + 'c' & 'b' +</pre> +</dd> +<dt> <tt>rewrite (<i>query</i> TSQUERY,'select <i>target</i> ,<i>sample</i> from test'::text) RETURNS TSQUERY</tt> +</dt> +<dd> +<pre> +=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text); + rewrite + ----------- + 'c' & 'b' +</pre> +</dd> +</dl> +Two operators defined for <tt>tsquery</tt> type: +<dl> +<dt><tt>TSQUERY @ TSQUERY</tt></dt> +<dd> + Returns <tt>TRUE</tt> if right agrument might contained in left argument. + </dd> + <dt><tt>TSQUERY ~ TSQUERY</tt></dt> + <dd> + Returns <tt>TRUE</tt> if left agrument might contained in right argument. + </dd> +</dl> +To speed up these operators one can use GiST index with <tt>gist_tp_tsquery_ops</tt> opclass. +<pre> +create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops); +</pre> + +<h2><a name="fts">Full Text Search operator</a></h2> + +<dl><dt> +<tt>TSQUERY @@ TSVECTOR</tt><br> +<tt>TSVECTOR @@ TSQUERY</tt> +</dt> +<dd> +Returns <tt>TRUE</tt> if <tt>TSQUERY</tt> contained in <tt>TSVECTOR</tt> and +<tt>FALSE</tt> otherwise. +<pre> +=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; + ?column? + ---------- + t +=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; + ?column? + ---------- + f +</pre> +</dd> +</dl> <h2><a name="configurations">Configurations</a></h2> @@ -147,7 +345,7 @@ A configuration specifies all of the equipment necessary to transform a document into a <tt>tsvector</tt>: the parser that breaks its text into tokens, and the dictionaries which then transform each token into a lexeme. -Every call to <tt>to_tsvector()</tt> (described above) +Every call to <tt>to_tsvector(), to_tsquery()</tt> (described above) uses a configuration to perform its processing. Three configurations come with tsearch2: @@ -157,7 +355,10 @@ Three configurations come with tsearch2: and the <i>simple</i> dictionary for all others. </li><li><b>default_russian</b> -- Indexes words and numbers, using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words - and the <i>ru_stem</i> Russian Snowball dictionary for all others. + and the <i>ru_stem</i> Russian Snowball dictionary for all others. It's default + for <tt>ru_RU.KOI8-R</tt> locale. +</li><li><b>utf8_russian</b> -- the same as <b>default_russian</b> but +for <tt>ru_RU.UTF-8</tt> locale. </li><li><b>simple</b> -- Processes both words and numbers with the <i>simple</i> dictionary, which neither discards any stop words nor alters them. @@ -239,7 +440,8 @@ Here: </li><li>description - human readable name of tok_type </li><li>token - parser's token </li><li>dict_name - dictionary used for the token -</li><li>tsvector - final result</li></ul> +</li><li>tsvector - final result</li> +</ul> <h2><a name="parsers">Parsers</a></h2> @@ -300,20 +502,40 @@ the current parser is used when this argument is omitted. <h2><a name="dictionaries">Dictionaries</a></h2> -Dictionaries take textual tokens as input, -usually those produced by a parser, -and return lexemes which are usually some reduced form of the token. +Dictionary is a program, which accepts lexeme(s), usually those produced by a parser, +on input and returns: +<ul> +<li>array of lexeme(s) if input lexeme is known to the dictionary +<li>void array - dictionary knows lexeme, but it's stop word. +<li> NULL - dictionary doesn't recognized input lexeme +</ul> +Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries), +but see, for example, <tt>intdict</tt> dictionary (available from +<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">Tsearch2</a> home page, +which controls indexing of integers. + +<P> Among the dictionaries which come installed with tsearch2 are: <ul> <li><b>simple</b> simply folds uppercase letters to lowercase before returning the word. -</li><li><b>en_stem</b> runs an English Snowball stemmer on each word +</li> +<li><b>ispell_template</b> - template for ispell dictionaries. +</li> +<li><b>en_stem</b> runs an English Snowball stemmer on each word that attempts to reduce the various forms of a verb or noun to a single recognizable form. -</li><li><b>ru_stem</b> runs a Russian Snowball stemmer on each word. -</li></ul> +</li><li><b>ru_stem_koi8</b>, <b>ru_stem_utf8</b> runs a Russian Snowball stemmer on each word. +</li> +<li><b>synonym</b> - simple lexeme-to-lexeme replacement +</li> +<li><b>thesaurus_template</b> - template for <a href="#tz">thesaurus dictionary</a>. It's +phrase-to-phrase replacement +</li> +</ul> +<P> Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table: <pre>CREATE TABLE pg_ts_dict ( @@ -332,6 +554,12 @@ it specifies a file from which stop words should be read. The <tt>dict_comment</tt> is a human-readable description of the dictionary. The other fields are internal function identifiers useful only to developers trying to implement their own dictionaries. + +<blockquote> +<b>WARNING:</b> Data files, used by dictionaries, should be in <tt>server_encoding</tt> to +avoid possible problems ! +</blockquote> + <p> The argument named <tt><i>dictionary</i></tt> in each of the following functions @@ -355,6 +583,27 @@ if omitted then the current dictionary is used. from which an inflected form could arise. </dd></dl> +<h3>Using dictionaries template</h3> +Templates used to define new dictionaries, for example, +<pre> +INSERT INTO pg_ts_dict + (SELECT 'en_ispell', dict_init, + 'DictFile="/usr/local/share/dicts/ispell/english.dict",' + 'AffFile="/usr/local/share/dicts/ispell/english.aff",' + 'StopFile="/usr/local/share/dicts/english.stop"', + dict_lexize + FROM pg_ts_dict + WHERE dict_name = 'ispell_template'); +</pre> + +<h3>Working with stop words</h3> +Ispell and snowball stemmers treat stop words differently: +<ul> +<li>ispell - normalize word and then lookups normalized form in stop-word file +<li>snowball stemmer - first, it lookups word in stop-word file and then does it job. +The reason - to minimize possible 'noise'. +</ul> + <h2><a name="ranking">Ranking</a></h2> Ranking attempts to measure how relevant documents are to particular queries @@ -364,26 +613,18 @@ Note that this information is only available in unstripped vectors -- ranking functions will only return a useful result for a <tt>tsvector</tt> which still has position information! <p> -Both of these ranking functions -take an integer <i>normalization</i> option -that specifies whether a document's length should impact its rank. -This is often desirable, -since a hundred-word document with five instances of a search word -is probably more relevant than a thousand-word document with five instances. -The option can have the values: - -</p><ul> -<li><tt>0</tt> (the default) ignores document length. -</li><li><tt>1</tt> divides the rank by the logarithm of the length. -</li><li><tt>2</tt> divides the rank by the length itself. -</li></ul> +Notice, that ranking functions supplied are just an examples and +doesn't belong to the tsearch2 core, you can +write your very own ranking function and/or combine additional +factors to fit your specific interest. +</p> The two ranking functions currently available are: <dl><dt> <tt>CREATE FUNCTION rank(<br> <em>[</em> <i>weights</i> float4[], <em>]</em> - <i>vector</i> tsvector, <i>query</i> tsquery, + <i>vector</i> TSVECTOR, <i>query</i> TSQUERY, <em>[</em> <i>normalization</i> int4 <em>]</em><br> ) RETURNS float4</tt> </dt><dd> @@ -399,8 +640,8 @@ The two ranking functions currently available are: and make them more or less important than words in the document body. </dd><dt> <tt>CREATE FUNCTION rank_cd(<br> - <em>[</em> <i>K</i> int4, <em>]</em> - <i>vector</i> tsvector, <i>query</i> tsquery, + <em>[</em> <i>weights</i> float4[], <em>]</em> + <i>vector</i> TSVECTOR, <i>query</i> TSQUERY, <em>[</em> <i>normalization</i> int4 <em>]</em><br> ) RETURNS float4</tt> </dt><dd> @@ -409,20 +650,51 @@ The two ranking functions currently available are: as described in Clarke, Cormack, and Tudhope's "<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>" in the 1999 <i>Information Processing and Management</i>. - The value <i>K</i> is one of the values from their formula, - and defaults to <i>K</i>=4. - The examples in their paper <i>K</i>=16; - we can roughly describe the term - as stating how far apart two search terms can fall - before the formula begins penalizing them for lack of proximity. -</dd></dl> +</dd> +<dt> + <tt>CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text</tt> + </dt> + <dd> + Returns <tt>extents</tt>, which are a shortest and non-nested sequences of words, which satisfy a query. + Extents (covers) used in <tt>rank_cd</tt> algorithm for fast calculation of proximity ranking. + In example below there are two extents - <tt><b>{1</b>...<b>}1</b> and <b>{2</b> ...<b>}2</b></tt>. + <pre> +=# select get_covers('1:1,2,10 2:4'::tsvector,'1& 2'); +get_covers +---------------------- +1 {1 1 {2 2 }1 1 }2 +</pre> + </dd> + +</dl> + +<p> +Both of these (<tt>rank(), rank_cd()</tt>) ranking functions +take an integer <i>normalization</i> option +that specifies whether a document's length should impact its rank. +This is often desirable, +since a hundred-word document with five instances of a search word +is probably more relevant than a thousand-word document with five instances. +The option can have the values, which could be combined using "|" ( 2|4) to +take into account several factors: + +</p> +<ul> +<li><tt>0</tt> (the default) ignores document length.</li> +<li><tt>1</tt> divides the rank by the 1 + logarithm of the length </li> +<li><tt>2</tt> divides the rank by the length itself.</li> +<li><tt>4</tt> divides the rank by the mean harmonic distance between extents</li> +<li><tt>8</tt> divides the rank by the number of unique words in document</li> +<li><tt>16</tt> divides the rank by 1 + logarithm of the number of unique words in document +</li> +</ul> <h2><a name="headlines">Headlines</a></h2> <dl><dt> <tt>CREATE FUNCTION headline(<br> <em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em> - <i>document</i> text, <i>query</i> tsquery, + <i>document</i> text, <i>query</i> TSQUERY, <em>[</em> <i>options</i> text <em>]</em><br> ) RETURNS text</tt> </dt><dd> @@ -448,10 +720,123 @@ The two ranking functions currently available are: with a word which has this many characters or less. The default value of <tt>3</tt> should eliminate most English conjunctions and articles. + </li><li><tt>HighlightAll</tt> -- + boolean flag, if TRUE, than the whole document will be highlighted. </li></ul> Any unspecified options receive these defaults: - <pre>StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3 + <pre>StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE </pre> </dd></dl> + +<h2><a name="indexes">Indexes</a></h2> +Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! +<ul> +<li> RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree) +<pre> + =# create index fts_idx on apod using gist(fts); +</pre> +<li>GIN - Generalized Inverted Index +<pre> + =# create index fts_idx on apod using gin(fts); +</pre> +</ul> +<b>GiST</b> index is very good for online update, but is not as scalable as <b>GIN</b> index, +which, in turn, isn't good for updates. Both indexes support concurrency and recovery. + +<h2><a name="tz">Thesaurus dictionary</a></h2> + +<P> +Thesaurus - is a collection of words with included information about the relationships of words and phrases, +i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.</p> +<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, +preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing. +Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary +with <b>phrase</b> support. Thesaurus is a plain file of the following format: +<pre> +# this is a comment +sample word(s) : indexed word(s) +............................... +</pre> +<ul> +<li><strong>Colon</strong> (:) symbol used as a delimiter.</li> +<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary. +It's still required, that <tt>sample words</tt> should be known.</li> +<li>thesaurus dictionary looks for the most longest match</li></ul> +<P> +TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration) +to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>. +Notice, that subdictionary produces an error, if it couldn't recognize word. +In that case, you should remove definition line with this word or teach subdictionary to know it. +</p> +<p>Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e., +important only their position. +To break possible ties thesaurus applies the last definition. For example, consider +thesaurus (with simple subdictionary) rules with pattern 'swsw' +('s' designates stop-word and 'w' - known word): </p> +<pre> +a one the two : swsw +the one a two : swsw2 +</pre> +<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. +Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition +'swsw2'.</p> +<p>As a normal dictionary, it should be assigned to the specific lexeme types. +Since TZ has a capability to recognize phrases it must remember its state and interact with parser. +TZ use these assignments to check if it should handle next word or stop accumulation. +Compiler of TZ should take care about proper configuration to avoid confusion. +For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like +' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p> + +<h3>Configuration</h3> + +<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p> +<pre class="real">INSERT INTO pg_ts_dict + (SELECT 'tz_simple', dict_init, + 'DictFile="/path/to/tz_simple.txt",' + 'Dictionary="en_stem"', + dict_lexize + FROM pg_ts_dict + WHERE dict_name = 'thesaurus_template'); + +</pre> +<p>Here: </p> +<ul> +<li><tt>tz_simple</tt> - is the dictionary name</li> +<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li> +<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li> +</ul> +<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for example: </p> +<pre> +update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and +tok_alias in ('lhword', 'lword', 'lpart_hword'); +</pre> +<h3>Examples</h3> +<p>tz_simple: </p> +<pre> +one : 1 +two : 2 +one two : 12 +the one : 1 +one 1 : 11 +</pre> +<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><pre class="real">=# select plainto_tsquery('default_russian',' one day is oneday'); + plainto_tsquery +------------------------ + '1' & 'day' & 'oneday' + +=# select plainto_tsquery('default_russian','one two day is oneday'); + plainto_tsquery +------------------------- + '12' & 'day' & 'oneday' + +=# select plainto_tsquery('default_russian','the one'); +NOTICE: Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3) + plainto_tsquery +----------------- + '1' +</pre> + +Additional information about thesaurus dictionary is available from +<a href="http://www.sai.msu.su/~megera/wiki/Thesaurus_dictionary">Wiki</a> page. </body></html>