c9483cb6e4
Thanks. Teodor Sigaev |
||
---|---|---|
.. | ||
data | ||
dict | ||
expected | ||
makedict | ||
sql | ||
crc32.c | ||
crc32.h | ||
deflex.h | ||
dict.h | ||
gistidx.c | ||
gistidx.h | ||
Makefile | ||
morph.c | ||
morph.h | ||
parser.h | ||
parser.l | ||
query.c | ||
query.h | ||
README.tsearch | ||
rewrite.c | ||
rewrite.h | ||
tsearch.sql.in | ||
txtidx.c | ||
txtidx.h |
Tsearch contrib module contains implementation of new data type txtidx - a searchable data type (textual) with indexed access. All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov (oleg@sai.msu.su). IMPORTANT NOTICE: This is a first step of our work on integration of OpenFTS full text search engine (http://openfts.sourceforge.net/) into PostgreSQL. It's based on our recent development of GiST (Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST) and will works only for PostgreSQL version 7.2 and later. We didn't try to implement a full-featured search engine with stable interfaces but rather experiment with various approaches. There are many issues remains (most of them just not documented or implemented) but we'd like to present a working prototype of full text search engine fully integrated into PostgreSQL to collect user's feedback and recommendations. INSTALLATION: cd contrib/tsearch gmake gmake install REGRESSION TEST: gmake installcheck USAGE: psql DATABASE < tsearch.sql (from contrib/tsearch) INTRODUCTION: This module provides an implementation of a new data type 'txtidx' which is a string of a space separated "words". "Words" with spaces should be enclosed in apostrophes and apostrophes inside a "word" should be escaped by backslash. This is quite different from OpenFTS approach which uses array of integers (ID of lexems) and requires storing of lexem-id pairs in database. One of the prominent benefit of this new approach is that it's possible now to perform full text search in a 'natural' way. Some examples: create table foo ( titleidx txtidx ); 2 regular words: insert into foo values ( 'the are' ); Word with space: insert into foo values ( 'the\\ are' ); Words with apostrophe: insert into foo values ( 'value\'s this' ); Complex word with apostrophe: insert into foo values ( 'value\'s this we \'PostgreSQL site\'' ); select * from foo where titleidx @@ '\'PostgreSQL site\' | this'; select * from foo where titleidx @@ 'value\'s | this'; select * from foo where titleidx @@ '(the|this)&!we'; test=# select 'two words'::txtidx; txtidx --------------- 'two' 'words' (1 row) test=# select 'single\\ word'::txtidx; txtidx --------------- 'single word' (1 row) FULL TEXT SEARCH: The basic idea of this data type is to use it for full text search inside database. If you have a 'text' column title and corresponding column titleidx of type 'txtidx', which contains the same information from text column, then search on title could be replaced by searching on titleidx which would be fast because of indexed access. As a real life example consider database with table 'titles' containing titles of mailing list postings in column 'title': create table titles ( title text ); Suppose, you already have a lot of titles and want to do full text search on them. First, you need to install contrib/tsearch module (see INSTALLATION and USAGE). Add column 'titleidx' of type txtidx, containing space separated words from title. It's possible to use function txt2txtidx(title) to fill 'titleidx' column (see notice 1): -- add titleidx column of type txtidx alter table titles add titleidx txtidx; update titles set titleidx=txt2txtidx(title); Create index on titleidx: create index t_idx on titles using gist(titleidx); and now you can search all titles with words 'patch' and 'gist': select title from titles where titleidx ## 'patch&gist'; Here, ## is a new operation defined for type 'txtidx' which could use index (if exists) built on titleidx. This operator uses morphology to expand query, i.e. ## 'patches&gist' will find titles with 'patch' and 'gist' also. If you want to provide query as is, use operator @@ instead: select title from titles where titleidx @@ 'patch&gist'; but remember, that function txt2txtidx does uses morphology, so you need to fill column 'titleidx' using some another way. We hope in future releases provide more consistent and convenient interfaces. Query could contains boolean operators &,|,!,() with their usual meaning, for example: 'patch&gist&!cvs', 'patch|cvs'. Each operation ( ##, @@ ) requires appropriate query type - txtidx ## mquery_txt txtidx @@ query_txt To see what query actually will be used : test=# select 'patches&gist'::mquery_txt; mquery_txt ------------------ 'patch' & 'gist' (1 row) test=# select 'patches&gist'::query_txt; query_txt -------------------- 'patches' & 'gist' (1 row) Notice the difference ! You could use trigger to be sure column 'titleidx' is consistent with any changes in column 'title': create trigger txtidxupdate before update or insert on titles for each row execute procedure tsearch(titleidx, title); This trigger uses the same parser, dictionaries as function txt2txtidx (see notice 1). Current syntax allows creating trigger for several columns you want to be searchable: create trigger txtidxupdate before update or insert on titles for each row execute procedure tsearch(titleidx, title1, title2,... ); Use function txtidxsize(titleidx) to get the number of "words" in column titleidx. To get total number of words in table titles: test=# select sum(txtidxsize(titleidx)) from titles; sum --------- 1917182 (1 row) NOTICES: 1. function txt2txtidx and trigger use parser, dictionaries coming with this contrib module on default. Parser is mostly the same as in OpenFTS and dictionaries are simple stemmers (sort of Lovin's stemmer which uses a longest match algorithm.) for english and russian languages. There is a perl script makedict/makedict.pl, which could be used to create specific dictionaries from files with endings and stop-words. Example files for english and russian languages are available from http://www.sai.msu.su/~megera/postgres/gist/tsearch/. Run script without parameters to see information about arguments and options. Example: cd makedict ./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \ -o ../dict/YOURDICT.dct Another options of makedict.pl: -f do not execute tolower for any char -a function of checking stopword will be work after lemmatize, default is before You need to edit dict.h to use your dictionary and, probably, morph.c to change mapdict array. Don't forget to do make clean; make; make install 2. As it was mentioned above we don't use explicitly ID of lexems as in OpenFTS but use hash function (crc32) instead to map lexem to integer. Our experiments show that probability of collision is quite small: for english text it's about 10**(-6) and 10**(-5) for russian collection. Default installation doesn't check for collisions but if your application does need to guarantee an exact (no collisions) search, you need to update system table to mark index islossy: update pg_amop set amopreqcheck = true where amopclaid = (select oid from pg_opclass where opcname = 'gist_txtidx_ops'); If you don't bother about collisions : update pg_amop set amopreqcheck = false where amopclaid = (select oid from pg_opclass where opcname = 'gist_txtidx_ops'); 3. txtidx doesn't preserve words ordering (this is not critical for searching) for performance reason, for example: test=# select 'page two'::txtidx; txtidx -------------- 'two' 'page' (1 row) 4. Indexed access provided by txtidx data type isn't always good because of internal data structure we use (RD-Tree). Particularly, queries like '!gist' will be slower than just a sequential scan, because for such queries RD-Tree doesn't provides selectivity on internal nodes and all checks should be processed at leaf nodes, i.t. scan of full index. You may play with function query_tree to see how effective will be index usage: test=# select querytree( 'patch&gist'::query_txt ); querytree ------------------ 'patch' & 'gist' (1 row) This is an example of "good" query - index will effective for both words. test=# select querytree( 'patch&!gist'::query_txt ); querytree ----------- 'patch' (1 row) This means that index is effective only to search word 'patch' and resulted rows will be checked against '!gist'. test=# select querytree( 'patch|!gist'::query_txt ); querytree ----------- T (1 row) test=# select querytree( '!gist'::query_txt ); querytree ----------- T (1 row) These two queries will be processed by scanning of full index ! Very slow ! 5. Following selects produce the same result select title from titles where titleidx @@ 'patch&gist'; select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist'; but the former will be more effective, because of internal optimization of query executor. TODO: Better configurability (as in OpenFTS) User's interfaces to parser, dictionaries ... Write documentation BENCHMARKS: We use test collection in our experiments which contains 377905 titles from various mailing lists stored in our mailware project. All runs were performed on IBM ThinkPad T21 notebook with PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev We didn't do extensive benchmarking and all numbers provide for illustration. Actual performance is strongly depends on many factors (query, collection, dictionaries and hardware). Collection is available for download from http://www.sai.msu.su/~megera/postgres/gist/tsearch/ as mw_titles.gz (about 3Mb). 0. install contrib/tsearch module 1. createdb test 2. psql test < tsearch.sql (from contrib/tsearch) 3. zcat mw_titles.gz | psql test (it will creates table, copy test data and creates index) Database contains one table: test=# \d titles Table "titles" Column | Type | Modifiers ----------+------------------------+----------- title | character varying(256) | titleidx | txtidx | Indexes: t_idx Index was created as: create index t_idx on titles using gist(titleidx); (notice: this operation takes about 14 minutes on my notebook) Typical select looks like: select title from titles where titleidx @@ 'patch&gist'; Total number of lexems in collection : 1917182 1. We trust index - we consider index is exact and no checking against tuples is necessary. update pg_amop set amopreqcheck = false where amopclaid = (select oid from pg_opclass where opcname = 'gist_txtidx_ops'); using gist indices 1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00% 2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82% 3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00% using gist indices (morph) 1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62% 2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47% 3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00% disable gist index 1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62% 2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00% 3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62% traditional like 1: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.206s 0.10% 2: title ~* 'gist' and title ~* 'patch' 0.000u 0.010s 0m9.205s 0.10% 3: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.208s 0.10% 2. Need to check results against tuples to avoid possible hash collision. update pg_amop set amopreqcheck = true where amopclaid = (select oid from pg_opclass where opcname = 'gist_txtidx_ops'); using gist indices 1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26% 2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00% 3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39% using gist indices (morph) 1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00% 2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75% 3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13% There are no visible difference between these 2 cases but your mileage may vary.