postgres/contrib/tsearch2/gendict
Teodor Sigaev 324300bc7c improve support of agglutinative languages (query with compound words).
regression=# select to_tsquery( '\'fotballklubber\'');
                   to_tsquery
------------------------------------------------
 'fotball' & 'klubb' | 'fot' & 'ball' & 'klubb'
(1 row)

So, changed interface to dictionaries, lexize method of dictionary shoud return
pointer to aray of TSLexeme structs instead of char**. Last element should
have TSLexeme->lexeme == NULL.

typedef struct {
        /* number of variant of split word , for example
                Word 'fotballklubber' (norwegian) has two varian to split:
                ( fotball, klubb ) and ( fot, ball, klubb ). So, dictionary
                should return:
                nvariant        lexeme
                1               fotball
                1               klubb
                2               fot
                2               ball
                2               klubb

        */
        uint16  nvariant;

        /* currently unused */
        uint16  flags;

        /* C-string */
        char    *lexeme;
} TSLexeme;
2005-01-25 15:24:38 +00:00
..
config.sh
dict_snowball.c.IN
dict_tmpl.c.IN improve support of agglutinative languages (query with compound words). 2005-01-25 15:24:38 +00:00
Makefile.IN
README.gendict
sql.IN

Gendict - generate dictionary templates for contrib/tsearch2 module.

This utility aims to help people creating dictionary for contrib/tsearch v2
module. Particularly, it has built-in support for snowball stemmers.

Programming API to tsearch2 dictionaries is described in tsearch v2 
documentation.


Prerequisities:

* PostgreSQL 7.3 and above.

* You need tsearch2 module sources already compiled

* Rights to install contrib modules

Usage:

    run config.sh without parameters to see options and arguments

Usage:
./config.sh -n DICTNAME ( [ -s [ -p PREFIX ] ] | [ -c CFILES ] [ -h HFILES ] [ -i ] ) [ -v ] [ -d DIR ] [ -C COMMENT ]
    -v - be verbose
    -d DIR - name of directory in PGSQL_SRC/contrib (default dict_DICTNAME)
    -C COMMENT - dictionary comment
Generate Snowball stemmer:
./config.sh -n DICTNAME -s [ -p PREFIX ] [ -v ] [ -d DIR ] [ -C COMMENT ]
    -s - generate Snowball wrapper
    -p - prefix of Snowball's function, (default DICTNAME)
Generate template dictionary:
./config.sh -n DICTNAME [ -c CFILES ] [ -h HFILES ] [ -i ] [ -v ] [ -d DIR ] [ -C COMMENT ]
    -c CFILES - source files, must be placed in contrib/tsearch2/gendict directory.
                These files will be used in Makefile.
    -h HFILES - header files, must be placed in contrib/tsearch2/gendict directory.
                These files will be used in Makefile and subinclude.h
    -i - dictionary has init method


Example 1:

   Create Portuguese stemmer
 
   0. cd PGSQL_SRC/contrib/tsearch2/gendict

   1. Obtain stem.{c,h} files for Portuguese

      wget http://snowball.tartarus.org/portuguese/stem.c
      wget http://snowball.tartarus.org/portuguese/stem.h
   
   2. Create template files for Portuguese

      ./config.sh -n pt -s -p portuguese -v -C'Snowball stemmer for Portuguese'

      Note, that argument for -p option should be *the same* as name of stemming
      function in stem.c (without _stem)

      A bunch of files will be generated and placed in PGSQL_SRC/contrib/dict_pt
      directory.

   3. Compile and install dictionary

	cd PGSQL_SRC/contrib/dict_pt
	make
	make install

   4. Test it 

	Sample portuguese words with the stemmed forms are available
        from http://snowball.tartarus.org/portuguese/stemmer.html

 	createdb testdict
	psql testdict < /usr/local/pgsql/share/contrib/tsearch2.sql
	psql testdict < /usr/local/pgsql/share/contrib/dict_pt.sql
	psql -d testdict -c "select lexize('pt','bobagem');"
	 lexize  
	---------
	 {bobag}
	(1 row)

	Here is what I have in pg_ts_dict table

	psql -d testdict -c "select * from pg_ts_dict where dict_name='pt';"
	 dict_name | dict_init | dict_initoption | dict_lexize |          dict_comment           
	-----------+-----------+-----------------+-------------+---------------------------------
	 pt        |   7177806 |                 |     7159330 | Snowball stemmer for Portuguese
	(1 row)

 
        Note, that you have already installed dictionary and corresponding
	entry in tsearch configuration and you may modify it using
	plain SQL commands, for example, specify stop words.

Example 2:

      a) Simple template dictionary with init method 

       ./config.sh -n wow -v -i -C WOW

      b) Create simple template dict (without init method):
	./config.sh -n wow -v  -C WOW

        The same as above, but dictionary will have not init method

       Dictionaries obtained in a) and b) are fully working and ready
       for use: 
	  a) lowercase input word and remove it if it is a stop word
	  b) recognizes any word

      c) Simple template dictionary with source files (with init method):

       ./config.sh -n wow -v -i -c a.c -h a.h -C WOW

        Source files ( a.c ) must be placed in contrib/tsearch2/gendict directory.
        These files will be used in Makefile.

        Header files ( a.h ), must be placed in contrib/tsearch2/gendict directory.
        These files will be used in Makefile and subinclude.h

      d) Simple template dictionary with source files (without init method):

	./config.sh -n wow -v  -c a.c -h a.h -C WOW

	The same as above, but dictionary will have not init method

       After that you have sources in PGSQL_SRC/contrib/dict_wow and
       you may edit them to create actual dictionary.

  Please, check Tsearch2 home page (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/)
  for additional information about "Gendict tutorial" and dictionaries.