postgres/contrib/tsearch
Bruce Momjian cdfe4bb64f Pleas apply it for 7.2.1 and current CVS.
Patch fixes using lc.lang instead of lc.lc_ctype.

Teodor Sigaev
2002-03-11 16:54:27 +00:00
..
data txtidx datatype for full text indexing with GiST. 2001-10-12 23:19:09 +00:00
dict Please, apply attached patch for contrib/tsearch to 7.2.1 and current 2002-03-05 06:10:28 +00:00
expected Updated regression test for tsearch, from Teodor Sigaev. 2001-10-15 17:41:33 +00:00
makedict txtidx datatype for full text indexing with GiST. 2001-10-12 23:19:09 +00:00
sql Updated regression test for tsearch, from Teodor Sigaev. 2001-10-15 17:41:33 +00:00
crc32.c pgindent run on all C files. Java run to follow. initdb/regression 2001-10-25 05:50:21 +00:00
crc32.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
deflex.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
dict.h txtidx datatype for full text indexing with GiST. 2001-10-12 23:19:09 +00:00
gistidx.c Repair some problems in GIST-index contrib modules. Patch from 2002-02-07 22:11:43 +00:00
gistidx.h New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00
Makefile txtidx datatype for full text indexing with GiST. 2001-10-12 23:19:09 +00:00
morph.c Pleas apply it for 7.2.1 and current CVS. 2002-03-11 16:54:27 +00:00
morph.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
parser.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
parser.l txtidx datatype for full text indexing with GiST. 2001-10-12 23:19:09 +00:00
query.c Change made to elog: 2002-03-06 06:10:59 +00:00
query.h New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00
README.tsearch Repair some problems in GIST-index contrib modules. Patch from 2002-02-07 22:11:43 +00:00
rewrite.c New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00
rewrite.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
tsearch.sql.in Repair some problems in GIST-index contrib modules. Patch from 2002-02-07 22:11:43 +00:00
txtidx.c Change made to elog: 2002-03-06 06:10:59 +00:00
txtidx.h New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00

Tsearch contrib module contains implementation of new data type txtidx - 
a searchable data type (textual) with indexed access.

All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
(oleg@sai.msu.su).

IMPORTANT NOTICE:

This is a first step of our work on integration of OpenFTS
full text search engine (http://openfts.sourceforge.net/) into 
PostgreSQL. It's based on our recent development of GiST 
(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
and will works only for PostgreSQL version 7.2 and later.

We didn't try to implement a full-featured search engine with 
stable interfaces but rather experiment with various approaches.
There are many issues remains (most of them just not documented or
implemented) but we'd like to present a working prototype
of full text search engine fully integrated into PostgreSQL to 
collect user's feedback and recommendations. 

INSTALLATION:

   cd contrib/tsearch
   gmake
   gmake install

REGRESSION TEST:

   gmake installcheck

USAGE:

 psql DATABASE < tsearch.sql (from contrib/tsearch)
  

INTRODUCTION:

This module provides an implementation of a new data type 'txtidx' which is
a string of a space separated "words". "Words" with spaces 
should be enclosed in apostrophes and apostrophes inside a "word" should be 
escaped by backslash.

This is quite different from OpenFTS approach which uses array
of integers (ID of lexems) and requires storing of lexem-id pairs in database. 
One of the prominent benefit of this new approach is that  it's possible now
to perform full text search in a 'natural' way. 

Some examples:

  create table foo (
	titleidx  txtidx
  );

 2 regular words:
   insert into foo values ( 'the are' );
 Word with space:
   insert into foo values ( 'the\\ are' );
 Words with apostrophe:
   insert into foo values ( 'value\'s this' );
 Complex word with apostrophe:
   insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );

   select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
   select * from foo where titleidx @@ 'value\'s | this';
   select * from foo where titleidx @@ '(the|this)&!we';

test=# select 'two words'::txtidx;
    txtidx     
---------------
 'two' 'words'
(1 row)

test=# select 'single\\ word'::txtidx;
    txtidx     
---------------
 'single word'
(1 row)


FULL TEXT SEARCH:

The basic idea of this data type is to use it for full text search inside
database. If you have a 'text' column title and corresponding column 
titleidx of type 'txtidx', which contains the same information from 
text column, then search on title could be replaced by
searching on titleidx which would be fast because of indexed access.

As a real life example consider database with table 'titles' containing
titles of mailing list postings in column 'title':

  create table titles (
        title  text
 
 );

Suppose, you already have a lot of titles and want to do full text search
on them.

First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
Add column 'titleidx' of type txtidx, containing space separated words from 
title. It's possible to use function txt2txtidx(title) to fill 'titleidx' 
column (see notice 1):
  
  -- add titleidx column of type txtidx 
  alter table titles add titleidx txtidx;
  update titles set titleidx=txt2txtidx(title);

Create index on titleidx:
  create index t_idx on titles using gist(titleidx);

and now you can search all titles with words 'patch' and 'gist':
  select title from titles where titleidx ## 'patch&gist';

Here, ## is a new operation defined for type 'txtidx' which could use index 
(if exists) built on titleidx. This operator uses morphology to
expand query, i.e. 
  ## 'patches&gist' will find titles with 'patch' and 'gist' also.
If you want to provide query as is, use operator @@ instead:
  select title from titles where titleidx @@ 'patch&gist';
but remember, that function txt2txtidx does uses morphology, so you need
to fill column 'titleidx' using some another way. We hope in future releases
provide more consistent and convenient interfaces.

Query could contains boolean operators &,|,!,() with their usual meaning,
for example: 'patch&gist&!cvs', 'patch|cvs'.
Each operation ( ##, @@ ) requires appropriate query type - 
 txtidx ## mquery_txt
 txtidx @@ query_txt

To see what query actually will be used :

test=# select 'patches&gist'::mquery_txt;
    mquery_txt    
------------------
 'patch' & 'gist'
(1 row)

test=# select 'patches&gist'::query_txt;
     query_txt      
--------------------
 'patches' & 'gist'
(1 row)

Notice the difference !

You could use trigger to be sure column 'titleidx' is consistent
with any changes in column 'title':

  create trigger txtidxupdate before update or insert on titles
  for each row execute procedure tsearch(titleidx, title);

This trigger uses the same parser, dictionaries as function
txt2txtidx (see notice 1).
Current syntax allows creating trigger for several columns
you want to be searchable:

  create trigger txtidxupdate before update or insert on titles
  for each row execute procedure tsearch(titleidx, title1, title2,... );

Use function txtidxsize(titleidx) to get the number of "words" in column
titleidx. To get total number of words in table titles:

test=# select sum(txtidxsize(titleidx)) from titles;
   sum   
---------
 1917182
(1 row)

NOTICES:

1.
function txt2txtidx and trigger use parser, dictionaries coming with 
this contrib module on default. Parser is mostly the same as in OpenFTS and 
dictionaries are simple stemmers (sort of Lovin's stemmer which uses a 
longest match algorithm.) for english and russian languages. There is a perl
script makedict/makedict.pl, which could be used to create specific
dictionaries from files with endings and stop-words. 
Example files for english and russian languages are available 
from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
Run script without parameters to see information about arguments and options. 

Example:
cd makedict
./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
              -o ../dict/YOURDICT.dct

Another options of makedict.pl:
-f do not execute tolower for any char
-a function of checking stopword will be work after lemmatize, 
   default is before

You need to edit dict.h to use your dictionary and, probably, 
morph.c to change mapdict array.

Don't forget to do 
  make clean; make; make install

2.
txtidx doesn't preserve words ordering (this is not critical for searching)
for performance reason, for example:

test=# select 'page two'::txtidx;
    txtidx    
--------------
 'two' 'page'
(1 row)

3. 
Indexed access provided by txtidx data type isn't always good
because of internal data structure we use (RD-Tree). Particularly,
queries like '!gist' will be  slower than just a sequential scan,
because for such queries RD-Tree doesn't provides selectivity on internal 
nodes and all checks should be processed at leaf nodes, i.t. scan of
full index. You may play with function query_tree to see how effective
will be index usage:

test=# select querytree( 'patch&gist'::query_txt );
    querytree     
------------------
 'patch' & 'gist'
(1 row)

This is an example of "good" query - index will effective for both words.

test=# select querytree( 'patch&!gist'::query_txt );
 querytree 
-----------
 'patch'
(1 row)

This means that index is effective only to search word 'patch' and resulted
rows will be checked against '!gist'.

test=# select querytree( 'patch|!gist'::query_txt );
 querytree 
-----------
 T
(1 row)

test=# select querytree( '!gist'::query_txt );
 querytree 
-----------
 T
(1 row)

These two queries will be processed by scanning of full index !
Very slow !

4.
Following selects produce the same result

  select title from titles where titleidx @@ 'patch&gist';
  select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';

but the former will be more effective, because of internal optimization
of query executor.


TODO:

Better configurability (as in OpenFTS)
User's interfaces to parser, dictionaries ...
Write documentation


BENCHMARKS:

We use test collection in our experiments which contains 377905
titles from various mailing lists stored in our mailware
project. 

All runs were performed on IBM ThinkPad T21 notebook with 
PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
We didn't do extensive benchmarking and all
numbers provide for illustration. Actual performance
is strongly depends on many factors (query, collection, dictionaries 
and hardware).

Collection is available for download from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/ 
as mw_titles.gz (about 3Mb).

0. install contrib/tsearch module
1. createdb test
2. psql test < tsearch.sql (from contrib/tsearch)
3. zcat mw_titles.gz | psql test 
   (it will creates table, copy test data and creates index)

Database contains one table:

test=# \d titles
                Table "titles"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 title    | character varying(256) | 
 titleidx | txtidx                 | 
Indexes: t_idx

Index was created as:
   create index t_idx on titles using gist(titleidx);
   (notice: this operation takes about 14 minutes on my notebook)

Typical select looks like:
   select title from titles where titleidx @@ 'patch&gist';

Total number of lexems in collection : 1917182

1. We trust index  - we consider index is exact and no
   checking against tuples is necessary.

   update pg_amop set amopreqcheck = false where amopclaid = 
   (select oid from pg_opclass where opcname = 'gist_txtidx_ops');

using gist indices
1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
using gist indices (morph)
1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
disable gist index
1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
traditional like
1: title ~* 'gist' and title ~* 'patch'  0.010u 0.000s 0m9.206s 0.10%
2: title ~* 'gist' and title ~* 'patch'  0.000u 0.010s 0m9.205s 0.10%
3: title ~* 'gist' and title ~* 'patch'  0.010u 0.000s 0m9.208s 0.10%

2. Need to check results against tuples to avoid possible hash collision.

   update pg_amop set amopreqcheck = true where amopclaid = 
   (select oid from pg_opclass where opcname = 'gist_txtidx_ops');

using gist indices
1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
using gist indices (morph)
1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%

There are no visible difference between these 2 cases but your
mileage may vary.