postgres/contrib/tsearch2
Teodor Sigaev 1f7ef548ec Changes
* new split algorithm (as proposed in http://archives.postgresql.org/pgsql-hackers/2006-06/msg00254.php)
  * possible call pickSplit() for second and below columns
  * add spl_(l|r)datum_exists to GIST_SPLITVEC -
    pickSplit should check its values to use already defined
    spl_(l|r)datum for splitting. pickSplit should set
    spl_(l|r)datum_exists to 'false' (if they was 'true') to
    signal to caller about using spl_(l|r)datum.
  * support for old pickSplit(): not very optimal
    but correct split
* remove 'bytes' field from GISTENTRY: in any case size of
  value is defined by it's type.
* split GIST_SPLITVEC to two structures: one for using in picksplit
  and second - for internal use.
* some code refactoring
* support of subsplit to rtree opclasses

TODO: add support of subsplit to contrib modules
2006-06-28 12:00:14 +00:00
..
data tsearch2 module 2003-07-21 10:27:44 +00:00
docs This patch makes the error message strings throughout the backend 2006-03-01 06:30:32 +00:00
expected Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
gendict Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
ispell Now ispell dictionary can eat dictionaries in MySpell format, 2006-06-09 13:25:59 +00:00
my2ispell Utility for convert myspell dictionaries to ispell, full README will be later 2003-11-26 14:06:16 +00:00
snowball Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
sql GIN: Generalized Inverted iNdex. 2006-05-02 11:28:56 +00:00
stopword Snowball multibyte. It's a pity, but snowball sources is very diferent for multibyte and 2006-01-27 16:32:31 +00:00
wordparser Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
common.c Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
common.h Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
crc32.c Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
crc32.h Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
dict_ex.c Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
dict_ispell.c Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
dict_snowball.c Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
dict_syn.c Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
dict_thesaurus.c Allow do not lexize words in substitution. 2006-06-06 16:25:55 +00:00
dict.c Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
dict.h Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
ginidx.c GIN: Generalized Inverted iNdex. 2006-05-02 11:28:56 +00:00
gistidx.c Changes 2006-06-28 12:00:14 +00:00
gistidx.h Add CVS tag lines to files that were lacking them. 2006-03-11 04:38:42 +00:00
Makefile Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
prs_dcfg.c This patch makes the error message strings throughout the backend 2006-03-01 06:30:32 +00:00
query_cleanup.c This patch makes the error message strings throughout the backend 2006-03-01 06:30:32 +00:00
query_cleanup.h New features for tsearch2: 2005-11-08 17:08:46 +00:00
query_gist.c Changes 2006-06-28 12:00:14 +00:00
query_rewrite.c fix comparison with SPI_processed 2006-05-31 14:53:41 +00:00
query_support.c Re-run pgindent, fixing a problem where comment lines after a blank 2005-11-22 18:17:34 +00:00
query_util.c Re-run pgindent, fixing a problem where comment lines after a blank 2005-11-22 18:17:34 +00:00
query_util.h Re-run pgindent, fixing a problem where comment lines after a blank 2005-11-22 18:17:34 +00:00
query.c Back out \' change for tsearch2, broke regression tests. 2006-05-19 04:39:47 +00:00
query.h Improve support of multibyte encoding: 2005-12-12 11:10:12 +00:00
rank.c Fix stupid mistake in rank_cd_def cleanup 2006-04-10 09:56:52 +00:00
README.tsearch2 Label CVS tip as 8.0devel instead of 7.5devel. Adjust various comments 2004-08-04 21:34:35 +00:00
snmap.c For some reason access/tupmacs.h has been #including utils/memutils.h, 2005-05-06 17:24:55 +00:00
snmap.h 1 add namespaces as Tom suggest http://www.pgsql.ru/db/mw/msg.html?mid=1987703 2004-05-31 16:51:56 +00:00
stopword.c Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
thesaurus Allow do not lexize words in substitution. 2006-06-06 16:25:55 +00:00
ts_cfg.c Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
ts_cfg.h improve support of agglutinative languages (query with compound words). 2005-01-25 15:24:38 +00:00
ts_lexize.c Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
ts_locale.c Clean up some signedness warnings. 2006-02-10 15:57:58 +00:00
ts_locale.h Clean up some signedness warnings. 2006-02-10 15:57:58 +00:00
ts_stat.c This patch makes the error message strings throughout the backend 2006-03-01 06:30:32 +00:00
ts_stat.h Add parentheses to macros when args are used in computations. Without 2005-05-25 21:40:43 +00:00
tsearch.sql.in Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
tsvector_op.c Improve support of multibyte encoding: 2005-12-12 11:10:12 +00:00
tsvector.c Back out \' change for tsearch2, broke regression tests. 2006-05-19 04:39:47 +00:00
tsvector.h Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
untsearch.sql.in Add thesaurus dictionary which can replace N>0 lexemes by M>0 lexemes. 2006-05-31 14:05:31 +00:00
wparser_def.c Re-run pgindent, fixing a problem where comment lines after a blank 2005-11-22 18:17:34 +00:00
wparser.c This patch makes the error message strings throughout the backend 2006-03-01 06:30:32 +00:00
wparser.h pgindent run. 2003-08-04 00:43:34 +00:00

Tsearch2 - full text search extension for PostgreSQL

   [10][Online version] of this document is available
   
   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
   
   Notice: This version is fully incompatible with old tsearch (V1),
   which is considered as deprecated in upcoming 7.4 release and
   obsoleted in 8.0.
   
   The Tsearch2 contrib module contains an implementation of a new data
   type tsvector - a searchable data type with indexed access. In a
   nutshell, tsvector is a set of unique words along with their
   positional information in the document, organized in a special
   structure optimized for fast access and lookup. Actually, each word
   entry, besides its position in the document, could have a weight
   attribute, describing importance of this word (at a specific) position
   in document. A set of bit-signatures of a fixed length, representing
   tsvectors, are stored in a search tree (developed using PostgreSQL
   GiST), which provides online update of full text index and fast query
   lookup. The module provides indexed access methods, queries,
   operations and supporting routines for the tsvector data type and easy
   conversion of text data to tsvector. Table driven configuration allows
   creation of custom configuration optimized for specific searches using
   standard SQL commands.
   
   Configuration allows you to:
     * specify the type of lexemes to be indexed and the way they are
       processed.
     * specify dictionaries to be used along with stop words recognition.
     * specify the parser used to process a document.
       
   See [11]Documentation Roadmap for links to documentation.

OpenFTS vs Tsearch2

    OpenFTS is a middleware between application and database, so it uses 
    tsearch2 as a storage, while database engine is used as a query executor 
    (searching). Everything else (parsing of documents, query processing, 
    linguistics) carry outs on client side. That's why OpenFTS has its own 
    configuration table (fts_conf) and works with its own set of dictionaries. 
    OpenFTS is more flexible, because it could be used in multi-server 
    architecture with separated machines for repository of documents 
    (documents could be stored in file system), database and query engine.   

Authors

     * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
     * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
       
Contributors

     * Robert John Shepherd and Andrew J. Kopciuch submitted
       "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
       v2)
     * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
       Reference" and proposed new naming convention for tsearch V2
       
New features

     * Relevance ranking of search results
     * Table driven configuration
     * Morphology support (ispell dictionaries, snowball stemmers)
     * Headline support (text fragments with highlighted search terms)
     * Ability to plug-in custom dictionaries and parsers
     * Synonym dictionary
     * Generator of templates for dictionaries (built-in snowball stemmer
       support)
     * Statistics of indexed words is available
       
Limitations

     * Lexeme should be not longer than 2048 bytes
     * The number of lexemes is limited by 2^32. Note, that actual
       capacity of tsvector is depends on whether positional information
       is stored or not.
     * tsvector - the size is limited by approximately 2^20 bytes.
     * tsquery - the number of entries (lexemes and operations) < 32768
     * Positional information
          + maximal position of lexeme < 2^14 (16384)
          + lexeme could have maximum 256 positions
       
References

     * GiST development site -
       [12]http://www.sai.msu.su/~megera/postgres/gist
     * OpenFTS home page - [13]http://openfts.sourceforge.net/
     * Mailing list -
       [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
       eral
       
   [15]Documentation Roadmap
   
Documentation Roadmap

     * Several docs are available from docs/ subdirectory
          + "Tsearch V2 Introduction" by Andrew Kopciuch
          + "Tsearch2 Guide" by Brandon Rhodes
          + "Tsearch2 Reference" by Brandon Rhodes
     * Readme.gendict in gendict/ subdirectory
          + [16][Gendict tutorial]
       
   Online version of documentation is always available from Tsearch V2
   home page -
   [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
   
Support

   Authors urgently recommend people to use [18][openfts-general] or
   [19][pgsql-general] mailing lists for questions and discussions.
   
Caution

   In spite of apparent easy full text searching with our tsearch module
   (authors hope it's so), any serious search engine require profound
   study of various aspects, such as stop words, dictionaries, special
   parsers. Tsearch module was designed to facilitate both those cases.
   
Development History

   Pre-tsearch era
          Development of OpenFTS began in 2000 after realizing that we
          needed a search engine optimized for online updates and able to
          access metadata from the database. This is essential for online
          news agencies, web portals, digital libraries, etc. Most search
          engines available utilize an inverted index which is very fast
          for searching but very slow for online updates. Incremental
          updates of an inverted index is a complex engineering task
          while we needed something light, free and with the ability to
          access metadata from the database. The last requirement is very
          important because in a real life application a search engine
          should always consult metadata ( topic, permissions, date
          range, version, etc.). We extensively use PostgreSQL as a
          database backend and have no intention to move from it, so the
          problem was to find a data structure and a fast way to access
          it. PostgreSQL has rather unique data type for storing sets
          (think about words) - arrays, but lacks index access to them. A
          document is parsed into lexemes, which are identified in
          various ways (e.g. stemming, morphology, dictionary), and as a
          result is reduced to an array of integer numbers. During our
          research we found a paper of Joseph Hellerstein which
          introduced an interesting data structure suitable for sets -
          RD-tree (Russian Doll tree). It looked very attractive, but
          implementing it in PostgreSQL seemed difficult because of our
          ignorance of database internals. Further research lead us to
          the idea to use GiST for implementing RD-tree, but at that time
          the GiST code had for a long while remained untouched and
          contained several bugs. After work on improving GiST for
          version 7.0.3 of PostgreSQL was done, we were able to implement
          RD-Tree and use it for index access to arrays of integers. This
          implementation was ideally suited for small arrays and
          eliminated complex joins, but was practically useless for
          indexing large arrays. The next improvement came from an idea
          to represent a document by a single bit-signature, a so-called
          superimposed signature (see "Index Structures for Databases
          Containing Data Items with Set-valued Attributes", 1997, Sven
          Helmer for details). We developeded the contrib/intarray module
          and used it for full text indexing.
          
   tsearch v1
          It was inconvenient to use integer id's instead of words, so we
          introduced a new data type called 'txtidx' - a searchable data
          type (textual) with indexed access. This was a first step of
          our work on an implementation of a built-in PostgreSQL full
          text search engine. Even though tsearch v1 had many features of
          a search engine it lacked configuration support and relevance
          ranking. People were encouraged to use OpenFTS, which provided
          relevance ranking based on coordinate information and flexible
          configuration. OpenFTS v.0.34 is the last version based on
          tsearch v1.
          
   tsearch V2
          People recognized tsearch as a powerful tool for full text
          searching and insisted on adding ranking support, better
          configurability, etc. We already thought about moving most of
          the features of OpenFTS to tsearch, and in the early 2003 we
          decided to work on a new version of tsearch - tsearch v2. We've
          abandoned auxiliary index tables which were used by OpenFTS to
          store coordinate information and modified the txtidx type to
          store them internally. Also, we've added table-driven
          configuration, support of ispell dictionaries, snowball
          stemmers and the ability to specify which types of lexemes to
          index. Also, it's now possible to generate headlines of
          documents with highlighted search terms. These changes make
          tsearch more user friendly and turn it into a really powerful
          full text search engine. After announcing the alpha version, we
          received a proposal from Brandon Rhodes to rename tsearch
          functions to be more consistent. So, we have renamed txtidx
          type to tsvector and other things as well.
          
   To allow users of tsearch v1 smooth upgrade, we named the module as
   tsearch2.
   
   Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
   people could download it from OpenFTS CVS (see link from [20][OpenFTS
   page]

References

  10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
  11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
  12. http://www.sai.msu.su/~megera/postgres/gist
  13. http://openfts.sourceforge.net/
  14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
  15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
  16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
  17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
  18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
  19. http://archives.postgresql.org/pgsql-general/
  20. http://openfts.sourceforge.net/