mirror of https://github.com/postgres/postgres
ISpell info updated
This commit is contained in:
parent
ef38ca9b3d
commit
38e2bf6283
|
@ -1,17 +1,13 @@
|
|||
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>tsearch-v2-intro</title>
|
||||
|
||||
<link type="text/css" rel="stylesheet" href="tsearch-V2-intro_files/tsearch.txt"></head>
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>tsearch-v2-intro</title>
|
||||
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<div class="content">
|
||||
<h2>Tsearch2 - Introduction</h2>
|
||||
|
||||
<p><a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
|
||||
<p><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
|
||||
[Online version]</a> of this document is available.</p>
|
||||
|
||||
<p>The tsearch2 module is available to add as an extension to
|
||||
|
@ -38,13 +34,11 @@
|
|||
|
||||
<p>The README.tsearch2 file included in the contrib/tsearch2
|
||||
directory contains a brief overview and history behind tsearch.
|
||||
This can also be found online <a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
|
||||
This can also be found online <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/">[right
|
||||
here]</a>.</p>
|
||||
|
||||
<p>Further in depth documentation such as a full function
|
||||
reference, and user guide can be found online at the <a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
|
||||
reference, and user guide can be found online at the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/">[tsearch
|
||||
documentation home]</a>.</p>
|
||||
|
||||
<h3>ACKNOWLEDGEMENTS</h3>
|
||||
|
@ -105,11 +99,9 @@
|
|||
|
||||
<p>Step one is to download the tsearch V2 module :</p>
|
||||
|
||||
<p><a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
|
||||
<p><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
|
||||
(check Development History for latest stable version !)</p>
|
||||
<pre>
|
||||
tar -zxvf tsearch-v2.tar.gz
|
||||
<pre> tar -zxvf tsearch-v2.tar.gz
|
||||
mv tsearch2 PGSQL_SRC/contrib/
|
||||
cd PGSQL_SRC/contrib/tsearch2
|
||||
</pre>
|
||||
|
@ -121,18 +113,15 @@
|
|||
|
||||
<p>Then continue with the regular building and installation
|
||||
process</p>
|
||||
<pre>
|
||||
gmake
|
||||
<pre> gmake
|
||||
gmake install
|
||||
gmake installcheck
|
||||
</pre>
|
||||
|
||||
<p>That is pretty much all you have to do, unless of course you
|
||||
get errors. However if you get those, you better go check with
|
||||
the mailing lists over at <a href=
|
||||
"http://www.postgresql.org">http://www.postgresql.org</a> or
|
||||
<a href=
|
||||
"http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
|
||||
the mailing lists over at <a href="http://www.postgresql.org/">http://www.postgresql.org</a> or
|
||||
<a href="http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
|
||||
since its never failed for me.</p>
|
||||
|
||||
<p>The directory in the contib/ and the directory from the
|
||||
|
@ -151,15 +140,13 @@
|
|||
<p>We should create a database to use as an example for the
|
||||
remainder of this file. We can call the database "ftstest". You
|
||||
can create it from the command line like this:</p>
|
||||
<pre>
|
||||
#createdb ftstest
|
||||
<pre> #createdb ftstest
|
||||
</pre>
|
||||
|
||||
<p>If you thought installation was easy, this next bit is even
|
||||
easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
|
||||
type:</p>
|
||||
<pre>
|
||||
psql ftstest < tsearch2.sql
|
||||
<pre> psql ftstest < tsearch2.sql
|
||||
</pre>
|
||||
|
||||
<p>The file "tsearch2.sql" holds all the wonderful little
|
||||
|
@ -170,8 +157,7 @@
|
|||
pg_ts_cfgmap are added.</p>
|
||||
|
||||
<p>You can check out the tables if you like:</p>
|
||||
<pre>
|
||||
#psql ftstest
|
||||
<pre> #psql ftstest
|
||||
ftstest=# \d
|
||||
List of relations
|
||||
Schema | Name | Type | Owner
|
||||
|
@ -188,8 +174,7 @@
|
|||
<p>The first thing we can do is try out some of the types that
|
||||
are provided for us. Lets look at the tsvector type provided
|
||||
for us:</p>
|
||||
<pre>
|
||||
SELECT 'Our first string used today'::tsvector;
|
||||
<pre> SELECT 'Our first string used today'::tsvector;
|
||||
tsvector
|
||||
---------------------------------------
|
||||
'Our' 'used' 'first' 'today' 'string'
|
||||
|
@ -199,8 +184,7 @@
|
|||
<p>The results are the words used within our string. Notice
|
||||
they are not in any particular order. The tsvector type returns
|
||||
a string of space separated words.</p>
|
||||
<pre>
|
||||
SELECT 'Our first string used today first string'::tsvector;
|
||||
<pre> SELECT 'Our first string used today first string'::tsvector;
|
||||
tsvector
|
||||
-----------------------------------------------
|
||||
'Our' 'used' 'again' 'first' 'today' 'string'
|
||||
|
@ -217,8 +201,7 @@
|
|||
by the tsearch2 module.</p>
|
||||
|
||||
<p>The function to_tsvector has 3 possible signatures:</p>
|
||||
<pre>
|
||||
to_tsvector(oid, text);
|
||||
<pre> to_tsvector(oid, text);
|
||||
to_tsvector(text, text);
|
||||
to_tsvector(text);
|
||||
</pre>
|
||||
|
@ -228,8 +211,7 @@
|
|||
the searchable text is broken up into words (Stemming process).
|
||||
Right now we will specify the 'default' configuration. See the
|
||||
section on TSEARCH2 CONFIGURATION to learn more about this.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default',
|
||||
<pre> SELECT to_tsvector('default',
|
||||
'Our first string used today first string');
|
||||
to_tsvector
|
||||
--------------------------------------------
|
||||
|
@ -259,8 +241,7 @@
|
|||
<p>If you want to view the output of the tsvector fields
|
||||
without their positions, you can do so with the function
|
||||
"strip(tsvector)".</p>
|
||||
<pre>
|
||||
SELECT strip(to_tsvector('default',
|
||||
<pre> SELECT strip(to_tsvector('default',
|
||||
'Our first string used today first string'));
|
||||
strip
|
||||
--------------------------------
|
||||
|
@ -270,8 +251,7 @@
|
|||
<p>If you wish to know the number of unique words returned in
|
||||
the tsvector you can do so by using the function
|
||||
"length(tsvector)"</p>
|
||||
<pre>
|
||||
SELECT length(to_tsvector('default',
|
||||
<pre> SELECT length(to_tsvector('default',
|
||||
'Our first string used today first string'));
|
||||
length
|
||||
--------
|
||||
|
@ -282,15 +262,13 @@
|
|||
<p>Lets take a look at the function to_tsquery. It also has 3
|
||||
signatures which follow the same rational as the to_tsvector
|
||||
function:</p>
|
||||
<pre>
|
||||
to_tsquery(oid, text);
|
||||
<pre> to_tsquery(oid, text);
|
||||
to_tsquery(text, text);
|
||||
to_tsquery(text);
|
||||
</pre>
|
||||
|
||||
<p>Lets try using the function with a single word :</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'word');
|
||||
<pre> SELECT to_tsquery('default', 'word');
|
||||
to_tsquery
|
||||
-----------
|
||||
'word'
|
||||
|
@ -303,8 +281,7 @@
|
|||
|
||||
<p>Lets attempt to use the function with a string of multiple
|
||||
words:</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'this is many words');
|
||||
<pre> SELECT to_tsquery('default', 'this is many words');
|
||||
ERROR: Syntax error
|
||||
</pre>
|
||||
|
||||
|
@ -313,8 +290,7 @@
|
|||
"tsquery" used for searching a tsvector field. What we need to
|
||||
do is search for one to many words with some kind of logic (for
|
||||
now simple boolean).</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'searching|sentence');
|
||||
<pre> SELECT to_tsquery('default', 'searching|sentence');
|
||||
to_tsquery
|
||||
----------------------
|
||||
'search' | 'sentenc'
|
||||
|
@ -328,8 +304,7 @@
|
|||
<p>You can not use words defined as being a stop word in your
|
||||
configuration. The function will not fail ... you will just get
|
||||
no result, and a NOTICE like this:</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'a|is&not|!the');
|
||||
<pre> SELECT to_tsquery('default', 'a|is&not|!the');
|
||||
NOTICE: Query contains only stopword(s)
|
||||
or doesn't contain lexem(s), ignored
|
||||
to_tsquery
|
||||
|
@ -348,8 +323,7 @@
|
|||
<p>The next stage is to add a full text index to an existing
|
||||
table. In this example we already have a table defined as
|
||||
follows:</p>
|
||||
<pre>
|
||||
CREATE TABLE tblMessages
|
||||
<pre> CREATE TABLE tblMessages
|
||||
(
|
||||
intIndex int4,
|
||||
strTopic varchar(100),
|
||||
|
@ -362,8 +336,7 @@
|
|||
test strings for a topic, and a message. here is some test data
|
||||
I inserted. (yes I know it's completely useless stuff ;-) but
|
||||
it will serve our purpose right now).</p>
|
||||
<pre>
|
||||
INSERT INTO tblMessages
|
||||
<pre> INSERT INTO tblMessages
|
||||
VALUES ('1', 'Testing Topic', 'Testing message data input');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
|
||||
|
@ -400,8 +373,7 @@
|
|||
<p>The next stage is to create a special text index which we
|
||||
will use for FTI, so we can search our table of messages for
|
||||
words or a phrase. We do this using the SQL command:</p>
|
||||
<pre>
|
||||
ALTER TABLE tblMessages ADD idxFTI tsvector;
|
||||
<pre> ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
|
||||
</pre>
|
||||
|
||||
<p>Note that unlike traditional indexes, this is actually a new
|
||||
|
@ -411,8 +383,7 @@
|
|||
|
||||
<p>The general rule for the initial insertion of data will
|
||||
follow four steps:</p>
|
||||
<pre>
|
||||
1. update table
|
||||
<pre> 1. update table
|
||||
2. vacuum full analyze
|
||||
3. create index
|
||||
4. vacuum full analyze
|
||||
|
@ -426,8 +397,7 @@
|
|||
the index has been created on the table, vacuum full analyze is
|
||||
run again to update postgres's statistics (ie having the index
|
||||
take effect).</p>
|
||||
<pre>
|
||||
UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
|
||||
<pre> UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
|
||||
VACUUM FULL ANALYZE;
|
||||
</pre>
|
||||
|
||||
|
@ -436,8 +406,7 @@
|
|||
information stored, you should instead do the following, which
|
||||
effectively concatenates the two fields into one before being
|
||||
inserted into the table:</p>
|
||||
<pre>
|
||||
UPDATE tblMessages
|
||||
<pre> UPDATE tblMessages
|
||||
SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
|
||||
VACUUM FULL ANALYZE;
|
||||
</pre>
|
||||
|
@ -451,8 +420,7 @@
|
|||
Full Text INDEXINGi ;-)), so don't worry about any indexing
|
||||
overhead. We will create an index based on the gist function.
|
||||
GiST is an index structure for Generalized Search Tree.</p>
|
||||
<pre>
|
||||
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
|
||||
<pre> CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
|
||||
VACUUM FULL ANALYZE;
|
||||
</pre>
|
||||
|
||||
|
@ -464,15 +432,13 @@
|
|||
<p>The last thing to do is set up a trigger so every time a row
|
||||
in this table is changed, the text index is automatically
|
||||
updated. This is easily done using:</p>
|
||||
<pre>
|
||||
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
<pre> CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
|
||||
</pre>
|
||||
|
||||
<p>Or if you are indexing both strMessage and strTopic you
|
||||
should instead do:</p>
|
||||
<pre>
|
||||
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
<pre> CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
FOR EACH ROW EXECUTE PROCEDURE
|
||||
tsearch2(idxFTI, strTopic, strMessage);
|
||||
</pre>
|
||||
|
@ -490,15 +456,13 @@
|
|||
the tsearch2 function. Lets say we want to create a function to
|
||||
remove certain characters (like the @ symbol from all
|
||||
text).</p>
|
||||
<pre>
|
||||
CREATE FUNCTION dropatsymbol(text)
|
||||
<pre> CREATE FUNCTION dropatsymbol(text)
|
||||
RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
|
||||
</pre>
|
||||
|
||||
<p>Now we can use this function within the tsearch2 function on
|
||||
the trigger.</p>
|
||||
<pre>
|
||||
DROP TRIGGER tsvectorupdate ON tblmessages;
|
||||
<pre> DROP TRIGGER tsvectorupdate ON tblmessages;
|
||||
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
|
||||
INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
|
||||
|
@ -513,8 +477,7 @@
|
|||
locale of the server. All you have to do is change your default
|
||||
configuration, or add a new one for your specific locale. See
|
||||
the section on TSEARCH2 CONFIGURATION.</p>
|
||||
<pre class="real">
|
||||
SELECT * FROM tblmessages WHERE intindex = 69;
|
||||
<pre class="real"> SELECT * FROM tblmessages WHERE intindex = 69;
|
||||
|
||||
intindex | strtopic | strmessage | idxfti
|
||||
----------+--------------------------+---------------+-----------------------
|
||||
|
@ -540,8 +503,7 @@ in the tsvector column.
|
|||
<p>Lets search the indexed data for the word "Test". I indexed
|
||||
based on the the concatenation of the strTopic, and the
|
||||
strMessage:</p>
|
||||
<pre>
|
||||
SELECT intindex, strtopic FROM tblmessages
|
||||
<pre> SELECT intindex, strtopic FROM tblmessages
|
||||
WHERE idxfti @@ 'test'::tsquery;
|
||||
intindex | strtopic
|
||||
----------+---------------
|
||||
|
@ -553,8 +515,7 @@ in the tsvector column.
|
|||
"Testing Topic". Notice that the word I search for was all
|
||||
lowercase. Let's see what happens when I query for uppercase
|
||||
"Test".</p>
|
||||
<pre>
|
||||
SELECT intindex, strtopic FROM tblmessages
|
||||
<pre> SELECT intindex, strtopic FROM tblmessages
|
||||
WHERE idxfti @@ 'Test'::tsquery;
|
||||
intindex | strtopic
|
||||
----------+----------
|
||||
|
@ -570,8 +531,7 @@ in the tsvector column.
|
|||
<p>Most likely the best way to query the field is to use the
|
||||
to_tsquery function on the right hand side of the @@ operator
|
||||
like this:</p>
|
||||
<pre>
|
||||
SELECT intindex, strtopic FROM tblmessages
|
||||
<pre> SELECT intindex, strtopic FROM tblmessages
|
||||
WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
|
||||
intindex | strtopic
|
||||
----------+--------------------
|
||||
|
@ -592,8 +552,7 @@ in the tsvector column.
|
|||
a way around which doesn't appear to have a significant impact
|
||||
on query time, and that is to use a query such as the
|
||||
following:</p>
|
||||
<pre>
|
||||
SELECT intindex, strTopic FROM tblmessages
|
||||
<pre> SELECT intindex, strTopic FROM tblmessages
|
||||
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
|
||||
AND strMessage ~* '.*men are created equal.*';
|
||||
intindex | strtopic
|
||||
|
@ -626,8 +585,7 @@ in the tsvector column.
|
|||
english stemming. We could edit the file
|
||||
:'/usr/local/pgsql/share/english.stop' and add a word to the
|
||||
list. I edited mine to exclude my name from indexing:</p>
|
||||
<pre>
|
||||
- Edit /usr/local/pgsql/share/english.stop
|
||||
<pre> - Edit /usr/local/pgsql/share/english.stop
|
||||
- Add 'andy' to the list
|
||||
- Save the file.
|
||||
</pre>
|
||||
|
@ -638,16 +596,14 @@ in the tsvector column.
|
|||
connected to the DB while editing the stop words, you will need
|
||||
to end the current session and re-connect. When you re-connect
|
||||
to the database, 'andy' is no longer indexed:</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default', 'Andy');
|
||||
<pre> SELECT to_tsvector('default', 'Andy');
|
||||
to_tsvector
|
||||
------------
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Originally I would get the result :</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default', 'Andy');
|
||||
<pre> SELECT to_tsvector('default', 'Andy');
|
||||
to_tsvector
|
||||
------------
|
||||
'andi':1
|
||||
|
@ -660,8 +616,7 @@ in the tsvector column.
|
|||
'simple', the results would be different. There are no stop
|
||||
words for the simple dictionary. It will just convert to lower
|
||||
case, and index every unique word.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('simple', 'Andy andy The the in out');
|
||||
<pre> SELECT to_tsvector('simple', 'Andy andy The the in out');
|
||||
to_tsvector
|
||||
-------------------------------------
|
||||
'in':5 'out':6 'the':3,4 'andy':1,2
|
||||
|
@ -672,8 +627,7 @@ in the tsvector column.
|
|||
into the actual configuration of tsearch2. In the examples in
|
||||
this document the configuration has always been specified when
|
||||
using the tsearch2 functions:</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default', 'Testing the default config');
|
||||
<pre> SELECT to_tsvector('default', 'Testing the default config');
|
||||
SELECT to_tsvector('simple', 'Example of simple Config');
|
||||
</pre>
|
||||
|
||||
|
@ -682,8 +636,7 @@ in the tsvector column.
|
|||
contains both the 'default' configurations based on the 'C'
|
||||
locale. And the 'simple' configuration which is not based on
|
||||
any locale.</p>
|
||||
<pre>
|
||||
SELECT * from pg_ts_cfg;
|
||||
<pre> SELECT * from pg_ts_cfg;
|
||||
ts_name | prs_name | locale
|
||||
-----------------+----------+--------------
|
||||
default | default | C
|
||||
|
@ -706,8 +659,7 @@ in the tsvector column.
|
|||
configuration or just use one that already exists. If I do not
|
||||
specify which configuration to use in the to_tsvector function,
|
||||
I receive the following error.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('learning tsearch is like going to school');
|
||||
<pre> SELECT to_tsvector('learning tsearch is like going to school');
|
||||
ERROR: Can't find tsearch config by locale
|
||||
</pre>
|
||||
|
||||
|
@ -716,8 +668,7 @@ in the tsvector column.
|
|||
into the pg_ts_cfg table. We will call the configuration
|
||||
'default_english', with the default parser and use the locale
|
||||
'en_US'.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
|
||||
<pre> INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
|
||||
VALUES ('default_english', 'default', 'en_US');
|
||||
</pre>
|
||||
|
||||
|
@ -732,15 +683,14 @@ in the tsvector column.
|
|||
tsearch2.sql</p>
|
||||
|
||||
<p>Lets take a first look at the pg_ts_dict table</p>
|
||||
<pre>
|
||||
ftstest=# \d pg_ts_dict
|
||||
<pre> ftstest=# \d pg_ts_dict
|
||||
Table "public.pg_ts_dict"
|
||||
Column | Type | Modifiers
|
||||
-----------------+---------+-----------
|
||||
dict_name | text | not null
|
||||
dict_init | oid |
|
||||
dict_initoption | text |
|
||||
dict_lemmatize | oid | not null
|
||||
dict_lexize | oid | not null
|
||||
dict_comment | text |
|
||||
Indexes: pg_ts_dict_idx unique btree (dict_name)
|
||||
</pre>
|
||||
|
@ -763,28 +713,57 @@ in the tsvector column.
|
|||
ISpell. We will assume you have ISpell installed on you
|
||||
machine. (in /usr/local/lib)</p>
|
||||
|
||||
<p>First lets register the dictionary(ies) to use from ISpell.
|
||||
We will use the english dictionary from ISpell. We insert the
|
||||
paths to the relevant ISpell dictionary (*.hash) and affixes
|
||||
(*.aff) files. There seems to be some question as to which
|
||||
ISpell files are to be used. I installed ISpell from the latest
|
||||
sources on my computer. The installation installed the
|
||||
dictionary files with an extension of *.hash. Some
|
||||
installations install with an extension of *.dict As far as I
|
||||
know the two extensions are equivilant. So *.hash ==
|
||||
*.dict.</p>
|
||||
<p>There has been some confusion in the past as to which files
|
||||
are used from ISpell. ISpell operates using a hash file. This
|
||||
is a binary file created by the ISpell command line utility
|
||||
"buildhash". This utility accepts a file containing the words
|
||||
from the dictionary, and the affixes file and the output is the
|
||||
hash file. The default installation of ISPell installs the
|
||||
english hash file english.hash, which is the exact same file as
|
||||
american.hash. ISpell uses this as the fallback dictionary to
|
||||
use.</p>
|
||||
|
||||
<p>We will also continue to use the english word stop file that
|
||||
<p>This hash file is not what tsearch2 requires as the ISpell
|
||||
interface. The file(s) needed are those used to create the
|
||||
hash. Tsearch uses the dictionary words for morphology, so the
|
||||
listing is needed not spellchecking. Regardless, these files
|
||||
are included in the ISpell sources, and you can use them to
|
||||
integrate into tsearch2. This is not complicated, but is not
|
||||
very obvious to begin with. The tsearch2 ISpell interface needs
|
||||
only the listing of dictionary words, it will parse and load
|
||||
those words, and use the ISpell dictionary for lexem
|
||||
processing.</p>
|
||||
|
||||
<p>I found the ISPell make system to be very finicky. Their
|
||||
documentation actually states this to be the case. So I just
|
||||
did things the command line way. In the ISpell source tree
|
||||
under langauges/english there are several files in this
|
||||
directory. For a complete description, please read the ISpell
|
||||
README. Basically for the english dictionary there is the
|
||||
option to create the small, medium, large and extra large
|
||||
dictionaries. The medium dictionary is recommended. If the make
|
||||
system is configured correctly, it would build and install the
|
||||
english.has file from the medium size dictionary. Since we are
|
||||
only concerned with the dictionary word listing ... it can be
|
||||
created from the /languages/english directory with the
|
||||
following command:</p>
|
||||
<pre> sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1
|
||||
</pre>
|
||||
|
||||
<p>This will create a file called english.med. You can copy
|
||||
this file to whever you like. I place mine in /usr/local/lib so
|
||||
it coincides with the ISpell hash files. You can now add the
|
||||
tsearch2 configuration entry for the ISpell english dictionary.
|
||||
We will also continue to use the english word stop file that
|
||||
was installed for the en_stem dictionary. You could use a
|
||||
different one if you like. The ISpell configuration is based on
|
||||
the "ispell_template" dictionary installed by default with
|
||||
tsearch2. We will use the OIDs to the stored procedures from
|
||||
the row where the dict_name = 'ispell_template'.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_dict
|
||||
<pre> INSERT INTO pg_ts_dict
|
||||
(SELECT 'en_ispell',
|
||||
dict_init,
|
||||
'DictFile="/usr/local/lib/english.hash",'
|
||||
'DictFile="/usr/local/lib/english.med",'
|
||||
'AffFile="/usr/local/lib/english.aff",'
|
||||
'StopFile="/usr/local/pgsql/share/english.stop"',
|
||||
dict_lexize
|
||||
|
@ -792,6 +771,50 @@ in the tsvector column.
|
|||
WHERE dict_name = 'ispell_template');
|
||||
</pre>
|
||||
|
||||
<p>Now that we have a dictionary we can specify it's use in a
|
||||
query to get a lexem. For this we will use the lexize function.
|
||||
The lexize function takes the name of the dictionary to use as
|
||||
an argument. Just as the other tsearch2 functions operate.</p>
|
||||
<pre> SELECT lexize('en_ispell', 'program');
|
||||
lexize
|
||||
-----------
|
||||
{program}
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>If you wanted to always use the ISpell english dictionary
|
||||
you have installed, you can configure tsearch2 to always use a
|
||||
specific dictionary.</p>
|
||||
<pre> SELCECT set_curdict('en_ispell');
|
||||
</pre>
|
||||
|
||||
<p>Lexize is meant to turn a word into a lexem. It is possible
|
||||
to receive more than one lexem returned for a single word.</p>
|
||||
<pre> SELECT lexize('en_ispell', 'conditionally');
|
||||
lexize
|
||||
-----------------------------
|
||||
{conditionally,conditional}
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>The lexize function is not meant to take a full string as an
|
||||
argument to return lexems for. If you passed in an entire
|
||||
sentence, it attempts to find that entire sentence in the
|
||||
dictionary. SInce the dictionary contains only words, you will
|
||||
receive an empty result set back.</p>
|
||||
<pre> SELECT lexize('en_ispell', 'This is a senctece to lexize');
|
||||
lexize
|
||||
--------
|
||||
|
||||
(1 row)
|
||||
|
||||
If you parse a lexem from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not int the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)
|
||||
</pre>
|
||||
|
||||
<p>This is not to say that tsearch will be ignored when adding
|
||||
text information to the the tsvector index column. This will be
|
||||
explained in greater detail with the table pg_ts_cfgmap.</p>
|
||||
|
||||
<p>Next we need to set up the configuration for mapping the
|
||||
dictionay use to the lexxem parsings. This will be done by
|
||||
altering the pg_ts_cfgmap table. We will insert several rows,
|
||||
|
@ -799,8 +822,7 @@ in the tsvector column.
|
|||
configured for use within tsearch2. There are several type of
|
||||
lexims we would be concerned with forcing the use of the ISpell
|
||||
dictionary.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
||||
<pre> INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
||||
VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
|
||||
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
||||
VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
|
||||
|
@ -818,8 +840,7 @@ in the tsvector column.
|
|||
<p>There are several other lexem types used that we do not need
|
||||
to specify as using the ISpell dictionary. We can simply insert
|
||||
values using the 'simple' stemming process dictionary.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
<pre> INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'url', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'host', '{simple}');
|
||||
|
@ -857,8 +878,7 @@ in the tsvector column.
|
|||
complete. We have successfully created a new tsearch2
|
||||
configuration. At the same time we have also set the new
|
||||
configuration to be our default for en_US locale.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default_english',
|
||||
<pre> SELECT to_tsvector('default_english',
|
||||
'learning tsearch is like going to school');
|
||||
to_tsvector
|
||||
--------------------------------------------------
|
||||
|
@ -870,12 +890,37 @@ in the tsvector column.
|
|||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Notice here that words like "tsearch" are still parsed and
|
||||
indexed in the tsvector column. There is a lexem returned for
|
||||
the word becuase in the configuration mapping table, we specify
|
||||
words to be used from the 'en_ispell' dictionary first, but as
|
||||
a fallback to use the 'en_stem' dictionary. Therefore a lexem
|
||||
is not returned from en_ispell, but is returned from en_stem,
|
||||
and added to the tsvector.</p>
|
||||
<pre> SELECT to_tsvector('learning tsearch is like going to computer school');
|
||||
to_tsvector
|
||||
---------------------------------------------------------------------------
|
||||
'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Notice in this last example I added the word "computer" to
|
||||
the text to be converted into a tsvector. Because we have setup
|
||||
our default configuration to use the ISpell english dictionary,
|
||||
the words are lexized, and computer returns 2 lexems at the
|
||||
same position. 'compute':7 and 'computer':7 are now both
|
||||
indexed for the word computer.</p>
|
||||
|
||||
<p>You can create additional dictionarynlists, or use the extra
|
||||
large dictionary from ISpell. You can read through the ISpell
|
||||
documents, and source tree to make modifications as you see
|
||||
fit.</p>
|
||||
|
||||
<p>In the case that you already have a configuration set for
|
||||
the locale, and you are changing it to your new dictionary
|
||||
configuration. You will have to set the old locale to NULL. If
|
||||
we are using the 'C' locale then we would do this:</p>
|
||||
<pre>
|
||||
UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
|
||||
<pre> UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
|
||||
</pre>
|
||||
|
||||
<p>That about wraps up the configuration of tsearch2. There is
|
||||
|
@ -917,38 +962,32 @@ in the tsvector column.
|
|||
<p>1) Backup any global database objects such as users and
|
||||
groups (this step is usually only necessary when you will be
|
||||
restoring to a virgin system)</p>
|
||||
<pre>
|
||||
pg_dumpall -g > GLOBALobjects.sql
|
||||
<pre> pg_dumpall -g > GLOBALobjects.sql
|
||||
</pre>
|
||||
|
||||
<p>2) Backup the full database schema using pg_dump</p>
|
||||
<pre>
|
||||
pg_dump -s DATABASE > DATABASEschema.sql
|
||||
<pre> pg_dump -s DATABASE > DATABASEschema.sql
|
||||
</pre>
|
||||
|
||||
<p>3) Backup the full database using pg_dump</p>
|
||||
<pre>
|
||||
pg_dump -Fc DATABASE > DATABASEdata.tar
|
||||
<pre> pg_dump -Fc DATABASE > DATABASEdata.tar
|
||||
</pre>
|
||||
|
||||
<p>To Restore a PostgreSQL database that uses the tsearch2
|
||||
module:</p>
|
||||
|
||||
<p>1) Create the blank database</p>
|
||||
<pre>
|
||||
createdb DATABASE
|
||||
<pre> createdb DATABASE
|
||||
</pre>
|
||||
|
||||
<p>2) Restore any global database objects such as users and
|
||||
groups (this step is usually only necessary when you will be
|
||||
restoring to a virgin system)</p>
|
||||
<pre>
|
||||
psql DATABASE < GLOBALobjects.sql
|
||||
<pre> psql DATABASE < GLOBALobjects.sql
|
||||
</pre>
|
||||
|
||||
<p>3) Create the tsearch2 objects, functions and operators</p>
|
||||
<pre>
|
||||
psql DATABASE < tsearch2.sql
|
||||
<pre> psql DATABASE < tsearch2.sql
|
||||
</pre>
|
||||
|
||||
<p>4) Edit the backed up database schema and delete all SQL
|
||||
|
@ -957,13 +996,11 @@ in the tsvector column.
|
|||
tsvector types. If your not sure what these are, they are the
|
||||
ones listed in tsearch2.sql. Then restore the edited schema to
|
||||
the database</p>
|
||||
<pre>
|
||||
psql DATABASE < DATABASEschema.sql
|
||||
<pre> psql DATABASE < DATABASEschema.sql
|
||||
</pre>
|
||||
|
||||
<p>5) Restore the data for the database</p>
|
||||
<pre>
|
||||
pg_restore -N -a -d DATABASE DATABASEdata.tar
|
||||
<pre> pg_restore -N -a -d DATABASE DATABASEdata.tar
|
||||
</pre>
|
||||
|
||||
<p>If you get any errors in step 4, it will most likely be
|
||||
|
@ -971,5 +1008,4 @@ in the tsvector column.
|
|||
tsearch2.sql. Any errors in step 5 will mean the database
|
||||
schema was probably restored wrongly.</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
</body></html>
|
Loading…
Reference in New Issue