211 lines
9.8 KiB
Plaintext
211 lines
9.8 KiB
Plaintext
Tsearch2 - full text search extension for PostgreSQL
|
|
|
|
[1]Online version of this document is available
|
|
|
|
Tsearch2 - is the full text engine, fully integrated into PostgreSQL
|
|
RDBMS.
|
|
|
|
Main features
|
|
|
|
* Full online update
|
|
* Supports multiple table driven configurations
|
|
* flexible and rich linguistic support (dictionaries, stop words),
|
|
thesaurus
|
|
* full multibyte (UTF-8) support
|
|
* Sophisticated ranking functions with support of proximity and
|
|
structure information (rank, rank_cd)
|
|
* Index support (GiST and Gin) with concurrency and recovery support
|
|
* Rich query language with query rewriting support
|
|
* Headline support (text fragments with highlighted search terms)
|
|
* Ability to plug-in custom dictionaries and parsers
|
|
* Template generator for tsearch2 dictionaries with [2]snowball
|
|
stemmer support
|
|
* It is mature (5 years of development)
|
|
|
|
Tsearch2, in a nutshell, provides FTS operator (contains) for the new
|
|
data types, representing document (tsvector) and query (tsquery).
|
|
Table driven configuration allows creation of custom searches using
|
|
standard SQL commands.
|
|
|
|
tsvector is a searchable data type, representing document. It is a set
|
|
of unique words along with their positional information in the
|
|
document, organized in a special structure optimized for fast access
|
|
and lookup. Each entry could be labelled to reflect its importance in
|
|
document.
|
|
|
|
tsquery is a data type for textual queries with support of boolean
|
|
operators. It consists of lexemes (optionally labelled) with boolean
|
|
operators between.
|
|
|
|
Table driven configuration allows to specify:
|
|
* parser, which used to break document onto lexemes
|
|
* what lexemes to index and the way they are processed
|
|
* dictionaries to be used along with stop words recognition.
|
|
|
|
OpenFTS vs Tsearch2
|
|
|
|
[3]OpenFTS is a middleware between application and database. OpenFTS
|
|
uses tsearch2 as a storage and database engine as a query executor
|
|
(searching). Everything else, i.e. parsing of documents, query
|
|
processing, linguistics, carry outs on client side. That's why OpenFTS
|
|
has its own configuration table (fts_conf) and works with its own set
|
|
of dictionaries. OpenFTS is more flexible, because it could be used in
|
|
multi-server architecture with separate machines for repository of
|
|
documents (documents could be stored in filesystem), database and
|
|
query engine.
|
|
|
|
See [4]Documentation Roadmap for links to documentation.
|
|
|
|
Authors
|
|
|
|
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
|
|
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia
|
|
|
|
Contributors
|
|
|
|
* Robert John Shepherd and Andrew J. Kopciuch submitted
|
|
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
|
|
v2)
|
|
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
|
|
Reference" and proposed new naming convention for tsearch V2
|
|
|
|
Sponsors
|
|
|
|
* ABC Startsiden - compound words support
|
|
* University of Mannheim for UTF-8 support (in 8.2)
|
|
* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
|
|
Inverted index (in 8.2)
|
|
* Georgia Public Library Service and LibLime, Inc. for Thesaurus
|
|
dictionary
|
|
* PostGIS community - GiST Concurrency and Recovery
|
|
|
|
The authors are grateful to the Russian Foundation for Basic Research
|
|
and Delta-Soft Ltd., Moscow, Russia for support.
|
|
|
|
Limitations
|
|
|
|
* Length of lexeme < 2K
|
|
* Length of tsvector (lexemes + positions) < 1Mb
|
|
* The number of lexemes < 4^32
|
|
* 0< Positional information < 16383
|
|
* No more than 256 positions per lexeme
|
|
* The number of nodes ( lexemes + operations) in tsquery < 32768
|
|
|
|
References
|
|
|
|
* GiST development site -
|
|
[6]http://www.sai.msu.su/~megera/postgres/gist
|
|
* GiN development - [7]http://www.sigaev.ru/gin/
|
|
* OpenFTS home page - [8]http://openfts.sourceforge.net/
|
|
* Mailing list -
|
|
[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
|
|
ral
|
|
|
|
Documentation Roadmap
|
|
|
|
* Several docs are available from docs/ subdirectory
|
|
+ "Tsearch V2 Introduction" by Andrew Kopciuch
|
|
+ "Tsearch2 Guide" by Brandon Rhodes
|
|
+ "Tsearch2 Reference" by Brandon Rhodes
|
|
* Readme.gendict in gendict/ subdirectory
|
|
+ Also, check [10]Gendict tutorial
|
|
* Check [11]tsearch2 Wiki pages for various documentation
|
|
|
|
Support
|
|
|
|
Authors urgently recommend people to use [12]openfts-general or
|
|
[13]pgsql-general mailing lists for questions and discussions.
|
|
|
|
Development History
|
|
|
|
Latest news
|
|
|
|
To the PostgreSQL 8.2 release we added:
|
|
* multibyte (UTF-8) support
|
|
* Thesaurus dictionary
|
|
* Query rewriting
|
|
* rank_cd relevation function now support different weights of
|
|
lexemes
|
|
* GiN support adds scalability of tsearch2
|
|
|
|
Pre-tsearch era
|
|
Development of OpenFTS began in 2000 after realizing that we
|
|
need a search engine optimized for online updates with access
|
|
to metadata from the database. This is essential for online
|
|
news agencies, web portals, digital libraries, etc. Most search
|
|
engines available utilize an inverted index which is very fast
|
|
for searching but very slow for online updates. Incremental
|
|
updates of an inverted index is a complex engineering task
|
|
while we needed something light, free and with the ability to
|
|
access metadata from the database. The last requirement was
|
|
very important because in a real life application search engine
|
|
should always consult metadata ( topic, permissions, date
|
|
range, version, etc.). We extensively use PostgreSQL as a
|
|
database backend and have no intention to move from it, so the
|
|
problem was to find a data structure and a fast way to access
|
|
it. PostgreSQL has rather unique data type for storing sets
|
|
(think about words) - arrays, but lacks index access to them.
|
|
During our research we found a paper of Joseph Hellerstein, who
|
|
introduced an interesting data structure suitable for sets -
|
|
RD-tree (Russian Doll tree). Further research lead us to the
|
|
idea to use GiST for implementing RD-tree, but at that time the
|
|
GiST code was untouched for a long time and contained several
|
|
bugs. After work on improving GiST for version 7.0.3 of
|
|
PostgreSQL was done, we were able to implement RD-Tree and use
|
|
it for index access to arrays of integers. This implementation
|
|
was ideally suited for small arrays and eliminated complex
|
|
joins, but was practically useless for indexing large arrays.
|
|
The next improvement came from an idea to represent a document
|
|
by a single bit-signature, a so-called superimposed signature
|
|
(see "Index Structures for Databases Containing Data Items with
|
|
Set-valued Attributes", 1997, Sven Helmer for details). We
|
|
developed the contrib/intarray module and used it for full
|
|
text indexing.
|
|
|
|
tsearch v1
|
|
It was inconvenient to use integer id's instead of words, so we
|
|
introduced a new data type called 'txtidx' - a searchable data
|
|
type (textual) with indexed access. This was a first step of
|
|
our work on an implementation of a built-in PostgreSQL full
|
|
text search engine. Even though tsearch v1 had many features of
|
|
a search engine it lacked configuration support and relevance
|
|
ranking. People were encouraged to use OpenFTS, which provided
|
|
relevance ranking based on positional information and flexible
|
|
configuration. OpenFTS v.0.34 is the last version based on
|
|
tsearch v1.
|
|
|
|
tsearch V2
|
|
People recognized tsearch as a powerful tool for full text
|
|
searching and insisted on adding ranking support, better
|
|
configurability, etc. We already thought about moving most of
|
|
the features of OpenFTS to tsearch, and in the early 2003 we
|
|
decided to work on a new version of tsearch. We abandoned
|
|
auxiliary index tables which were used by OpenFTS to store
|
|
positional information and modified the txtidx type to store
|
|
them internally. We added table-driven configuration, support
|
|
of ispell dictionaries, snowball stemmers and the ability to
|
|
specify which types of lexemes to index. Now, it's possible to
|
|
generate headlines of documents with highlighted search terms.
|
|
These changes make tsearch more user friendly and turn it into
|
|
a really powerful full text search engine. Brandon Rhodes
|
|
proposed to rename tsearch functions for consistency and we
|
|
renamed txtidx type to tsvector and other things as well. To
|
|
allow users of tsearch v1 smooth upgrade, we named the module
|
|
as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
|
|
|
|
References
|
|
|
|
1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
|
|
2. http://snowball.tartarus.org/
|
|
3. http://openfts.sourceforge.net/
|
|
4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
|
|
5. http:www.jfg-networks.com/
|
|
6. http://www.sai.msu.su/~megera/postgres/gist
|
|
7. http://www.sigaev.ru/gin/
|
|
8. http://openfts.sourceforge.net/
|
|
9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
|
|
10. http://www.sai.msu.su/~megera/wiki/Gendict
|
|
11. http://www.sai.msu.su/~megera/wiki/Tsearch2
|
|
12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
|
|
13. http://archives.postgresql.org/pgsql-general/
|