b57705673d
From Oleg Bartunov and Teodor Sigaev.
366 lines
12 KiB
Plaintext
366 lines
12 KiB
Plaintext
Tsearch contrib module contains implementation of new data type txtidx -
|
|
a searchable data type (textual) with indexed access.
|
|
|
|
All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
|
|
(oleg@sai.msu.su).
|
|
|
|
IMPORTANT NOTICE:
|
|
|
|
This is a first step of our work on integration of OpenFTS
|
|
full text search engine (http://openfts.sourceforge.net/) into
|
|
PostgreSQL. It's based on our recent development of GiST
|
|
(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
|
|
at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
|
|
and will works only for PostgreSQL version 7.2 and later.
|
|
|
|
We didn't try to implement a full-featured search engine with
|
|
stable interfaces but rather experiment with various approaches.
|
|
There are many issues remains (most of them just not documented or
|
|
implemented) but we'd like to present a working prototype
|
|
of full text search engine fully integrated into PostgreSQL to
|
|
collect user's feedback and recommendations.
|
|
|
|
INSTALLATION:
|
|
|
|
cd contrib/tsearch
|
|
gmake
|
|
gmake install
|
|
|
|
REGRESSION TEST:
|
|
|
|
gmake installcheck
|
|
|
|
USAGE:
|
|
|
|
psql DATABASE < tsearch.sql (from contrib/tsearch)
|
|
|
|
|
|
INTRODUCTION:
|
|
|
|
This module provides an implementation of a new data type 'txtidx' which is
|
|
a string of a space separated "words". "Words" with spaces
|
|
should be enclosed in apostrophes and apostrophes inside a "word" should be
|
|
escaped by backslash.
|
|
|
|
This is quite different from OpenFTS approach which uses array
|
|
of integers (ID of lexems) and requires storing of lexem-id pairs in database.
|
|
One of the prominent benefit of this new approach is that it's possible now
|
|
to perform full text search in a 'natural' way.
|
|
|
|
Some examples:
|
|
|
|
create table foo (
|
|
titleidx txtidx
|
|
);
|
|
|
|
2 regular words:
|
|
insert into foo values ( 'the are' );
|
|
Word with space:
|
|
insert into foo values ( 'the\\ are' );
|
|
Words with apostrophe:
|
|
insert into foo values ( 'value\'s this' );
|
|
Complex word with apostrophe:
|
|
insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );
|
|
|
|
select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
|
|
select * from foo where titleidx @@ 'value\'s | this';
|
|
select * from foo where titleidx @@ '(the|this)&!we';
|
|
|
|
test=# select 'two words'::txtidx;
|
|
txtidx
|
|
---------------
|
|
'two' 'words'
|
|
(1 row)
|
|
|
|
test=# select 'single\\ word'::txtidx;
|
|
txtidx
|
|
---------------
|
|
'single word'
|
|
(1 row)
|
|
|
|
|
|
FULL TEXT SEARCH:
|
|
|
|
The basic idea of this data type is to use it for full text search inside
|
|
database. If you have a 'text' column title and corresponding column
|
|
titleidx of type 'txtidx', which contains the same information from
|
|
text column, then search on title could be replaced by
|
|
searching on titleidx which would be fast because of indexed access.
|
|
|
|
As a real life example consider database with table 'titles' containing
|
|
titles of mailing list postings in column 'title':
|
|
|
|
create table titles (
|
|
title text
|
|
|
|
);
|
|
|
|
Suppose, you already have a lot of titles and want to do full text search
|
|
on them.
|
|
|
|
First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
|
|
Add column 'titleidx' of type txtidx, containing space separated words from
|
|
title. It's possible to use function txt2txtidx(title) to fill 'titleidx'
|
|
column (see notice 1):
|
|
|
|
-- add titleidx column of type txtidx
|
|
alter table titles add titleidx txtidx;
|
|
update titles set titleidx=txt2txtidx(title);
|
|
|
|
Create index on titleidx:
|
|
create index t_idx on titles using gist(titleidx);
|
|
|
|
and now you can search all titles with words 'patch' and 'gist':
|
|
select title from titles where titleidx ## 'patch&gist';
|
|
|
|
Here, ## is a new operation defined for type 'txtidx' which could use index
|
|
(if exists) built on titleidx. This operator uses morphology to
|
|
expand query, i.e.
|
|
## 'patches&gist' will find titles with 'patch' and 'gist' also.
|
|
If you want to provide query as is, use operator @@ instead:
|
|
select title from titles where titleidx @@ 'patch&gist';
|
|
but remember, that function txt2txtidx does uses morphology, so you need
|
|
to fill column 'titleidx' using some another way. We hope in future releases
|
|
provide more consistent and convenient interfaces.
|
|
|
|
Query could contains boolean operators &,|,!,() with their usual meaning,
|
|
for example: 'patch&gist&!cvs', 'patch|cvs'.
|
|
Each operation ( ##, @@ ) requires appropriate query type -
|
|
txtidx ## mquery_txt
|
|
txtidx @@ query_txt
|
|
|
|
To see what query actually will be used :
|
|
|
|
test=# select 'patches&gist'::mquery_txt;
|
|
mquery_txt
|
|
------------------
|
|
'patch' & 'gist'
|
|
(1 row)
|
|
|
|
test=# select 'patches&gist'::query_txt;
|
|
query_txt
|
|
--------------------
|
|
'patches' & 'gist'
|
|
(1 row)
|
|
|
|
Notice the difference !
|
|
|
|
You could use trigger to be sure column 'titleidx' is consistent
|
|
with any changes in column 'title':
|
|
|
|
create trigger txtidxupdate before update or insert on titles
|
|
for each row execute procedure tsearch(titleidx, title);
|
|
|
|
This trigger uses the same parser, dictionaries as function
|
|
txt2txtidx (see notice 1).
|
|
Current syntax allows creating trigger for several columns
|
|
you want to be searchable:
|
|
|
|
create trigger txtidxupdate before update or insert on titles
|
|
for each row execute procedure tsearch(titleidx, title1, title2,... );
|
|
|
|
Use function txtidxsize(titleidx) to get the number of "words" in column
|
|
titleidx. To get total number of words in table titles:
|
|
|
|
test=# select sum(txtidxsize(titleidx)) from titles;
|
|
sum
|
|
---------
|
|
1917182
|
|
(1 row)
|
|
|
|
NOTICES:
|
|
|
|
1.
|
|
function txt2txtidx and trigger use parser, dictionaries coming with
|
|
this contrib module on default. Parser is mostly the same as in OpenFTS and
|
|
dictionaries are simple stemmers (sort of Lovin's stemmer which uses a
|
|
longest match algorithm.) for english and russian languages. There is a perl
|
|
script makedict/makedict.pl, which could be used to create specific
|
|
dictionaries from files with endings and stop-words.
|
|
Example files for english and russian languages are available
|
|
from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
|
|
Run script without parameters to see information about arguments and options.
|
|
|
|
Example:
|
|
cd makedict
|
|
./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
|
|
-o ../dict/YOURDICT.dct
|
|
|
|
Another options of makedict.pl:
|
|
-f do not execute tolower for any char
|
|
-a function of checking stopword will be work after lemmatize,
|
|
default is before
|
|
|
|
You need to edit dict.h to use your dictionary and, probably,
|
|
morph.c to change mapdict array.
|
|
|
|
Don't forget to do
|
|
make clean; make; make install
|
|
|
|
2.
|
|
As it was mentioned above we don't use explicitly ID of lexems
|
|
as in OpenFTS but use hash function (crc32) instead to map lexem to
|
|
integer. Our experiments show that probability of collision is quite small:
|
|
for english text it's about 10**(-6) and 10**(-5) for russian collection.
|
|
Default installation doesn't check for collisions but if your application
|
|
does need to guarantee an exact (no collisions) search, you need
|
|
to update system table to mark index islossy:
|
|
|
|
update pg_amop set amopreqcheck = true where amopclaid =
|
|
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
|
|
|
If you don't bother about collisions :
|
|
|
|
update pg_amop set amopreqcheck = false where amopclaid =
|
|
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
|
|
|
3.
|
|
txtidx doesn't preserve words ordering (this is not critical for searching)
|
|
for performance reason, for example:
|
|
|
|
test=# select 'page two'::txtidx;
|
|
txtidx
|
|
--------------
|
|
'two' 'page'
|
|
(1 row)
|
|
|
|
4.
|
|
Indexed access provided by txtidx data type isn't always good
|
|
because of internal data structure we use (RD-Tree). Particularly,
|
|
queries like '!gist' will be slower than just a sequential scan,
|
|
because for such queries RD-Tree doesn't provides selectivity on internal
|
|
nodes and all checks should be processed at leaf nodes, i.t. scan of
|
|
full index. You may play with function query_tree to see how effective
|
|
will be index usage:
|
|
|
|
test=# select querytree( 'patch&gist'::query_txt );
|
|
querytree
|
|
------------------
|
|
'patch' & 'gist'
|
|
(1 row)
|
|
|
|
This is an example of "good" query - index will effective for both words.
|
|
|
|
test=# select querytree( 'patch&!gist'::query_txt );
|
|
querytree
|
|
-----------
|
|
'patch'
|
|
(1 row)
|
|
|
|
This means that index is effective only to search word 'patch' and resulted
|
|
rows will be checked against '!gist'.
|
|
|
|
test=# select querytree( 'patch|!gist'::query_txt );
|
|
querytree
|
|
-----------
|
|
T
|
|
(1 row)
|
|
|
|
test=# select querytree( '!gist'::query_txt );
|
|
querytree
|
|
-----------
|
|
T
|
|
(1 row)
|
|
|
|
These two queries will be processed by scanning of full index !
|
|
Very slow !
|
|
|
|
5.
|
|
Following selects produce the same result
|
|
|
|
select title from titles where titleidx @@ 'patch&gist';
|
|
select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';
|
|
|
|
but the former will be more effective, because of internal optimization
|
|
of query executor.
|
|
|
|
|
|
TODO:
|
|
|
|
Better configurability (as in OpenFTS)
|
|
User's interfaces to parser, dictionaries ...
|
|
Write documentation
|
|
|
|
|
|
BENCHMARKS:
|
|
|
|
We use test collection in our experiments which contains 377905
|
|
titles from various mailing lists stored in our mailware
|
|
project.
|
|
|
|
All runs were performed on IBM ThinkPad T21 notebook with
|
|
PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
|
|
We didn't do extensive benchmarking and all
|
|
numbers provide for illustration. Actual performance
|
|
is strongly depends on many factors (query, collection, dictionaries
|
|
and hardware).
|
|
|
|
Collection is available for download from
|
|
http://www.sai.msu.su/~megera/postgres/gist/tsearch/
|
|
as mw_titles.gz (about 3Mb).
|
|
|
|
0. install contrib/tsearch module
|
|
1. createdb test
|
|
2. psql test < tsearch.sql (from contrib/tsearch)
|
|
3. zcat mw_titles.gz | psql test
|
|
(it will creates table, copy test data and creates index)
|
|
|
|
Database contains one table:
|
|
|
|
test=# \d titles
|
|
Table "titles"
|
|
Column | Type | Modifiers
|
|
----------+------------------------+-----------
|
|
title | character varying(256) |
|
|
titleidx | txtidx |
|
|
Indexes: t_idx
|
|
|
|
Index was created as:
|
|
create index t_idx on titles using gist(titleidx);
|
|
(notice: this operation takes about 14 minutes on my notebook)
|
|
|
|
Typical select looks like:
|
|
select title from titles where titleidx @@ 'patch&gist';
|
|
|
|
Total number of lexems in collection : 1917182
|
|
|
|
1. We trust index - we consider index is exact and no
|
|
checking against tuples is necessary.
|
|
|
|
update pg_amop set amopreqcheck = false where amopclaid =
|
|
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
|
|
|
using gist indices
|
|
1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
|
|
2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
|
|
3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
|
|
using gist indices (morph)
|
|
1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
|
|
2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
|
|
3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
|
|
disable gist index
|
|
1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
|
|
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
|
|
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
|
|
traditional like
|
|
1: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.206s 0.10%
|
|
2: title ~* 'gist' and title ~* 'patch' 0.000u 0.010s 0m9.205s 0.10%
|
|
3: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.208s 0.10%
|
|
|
|
2. Need to check results against tuples to avoid possible hash collision.
|
|
|
|
update pg_amop set amopreqcheck = true where amopclaid =
|
|
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
|
|
|
using gist indices
|
|
1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
|
|
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
|
|
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
|
|
using gist indices (morph)
|
|
1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
|
|
2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
|
|
3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%
|
|
|
|
There are no visible difference between these 2 cases but your
|
|
mileage may vary.
|