Some more tsearch docs work --- sync names with CVS-tip reality, some

minor rewording, some markup fixups. Lots left to do here ...
2007-08-25 06:26:57 +00:00 · 2007-08-25 06:26:57 +00:00 · 52a0830c40
commit 52a0830c40
parent a13cefafb1
1 changed files with 150 additions and 156 deletions
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@ -210,9 +210,9 @@ SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::ts
 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
 </programlisting>

-Each lexeme position also can be labeled as <literal>'A'</literal>,
-<literal>'B'</literal>, <literal>'C'</literal>, <literal>'D'</literal>,
-where <literal>'D'</literal> is the default. These labels can be used to group
+Each lexeme position also can be labeled as <literal>A</literal>,
+<literal>B</literal>, <literal>C</literal>, <literal>D</literal>,
+where <literal>D</literal> is the default. These labels can be used to group
 lexemes into different <emphasis>importance</emphasis> or
 <emphasis>rankings</emphasis>, for example to reflect document structure.
 Actual values can be assigned at search time and used during the calculation
@ -668,9 +668,9 @@ setweight(<replaceable class="PARAMETER">vector</replaceable> TSVECTOR, <replace
 <listitem>
 <para>
 This function returns a copy of the input vector in which every location
-has been labeled with either the letter <literal>'A'</literal>,
-<literal>'B'</literal>, or <literal>'C'</literal>, or the default label
-<literal>'D'</literal> (which is the default for new vectors
+has been labeled with either the letter <literal>A</literal>,
+<literal>B</literal>, or <literal>C</literal>, or the default label
+<literal>D</literal> (which is the default for new vectors
 and as such is usually not displayed). These labels are retained
 when vectors are concatenated, allowing words from different parts of a
 document to be weighted differently by ranking functions.
@ -807,13 +807,12 @@ to be made.

 <varlistentry>
 <indexterm zone="textsearch-tsvector">
-<primary>stat</primary>
+<primary>ts_stat</primary>
 </indexterm>

 <term>
 <synopsis>
-stat(<optional><replaceable class="PARAMETER">sqlquery</replaceable> text </optional>, <optional>weight text </optional>) returns SETOF statinfo
-<!-- TODO I guess that not both of the arguments are optional? -->
+ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> text <optional>, <replaceable class="PARAMETER">weights</replaceable> text </optional>) returns SETOF statinfo
 </synopsis>
 </term>

@ -821,27 +820,27 @@ stat(<optional><replaceable class="PARAMETER">sqlquery</replaceable> text </opti
 <para>
 Here <type>statinfo</type> is a type, defined as:
 <programlisting>
-CREATE TYPE statinfo AS (word text, ndoc int4, nentry int4);
+CREATE TYPE statinfo AS (word text, ndoc integer, nentry integer);
 </programlisting>
-and <replaceable>sqlquery</replaceable> is a query which returns a
-<type>tsvector</type> column's contents.  <function>stat</> returns
-statistics about a <type>tsvector</type> column, i.e., the number of
-documents, <literal>ndoc</>, and the total number of words in the
-collection, <literal>nentry</>.  It is useful for checking your
-configuration and to find stop word candidates.  For example, to find
-the ten most frequent words:
+and <replaceable>sqlquery</replaceable> is a text value containing a SQL query
+which returns a single <type>tsvector</type> column.  <function>ts_stat</>
+executes the query and returns statistics about the resulting
+<type>tsvector</type> data, i.e., the number of documents, <literal>ndoc</>,
+and the total number of words in the collection, <literal>nentry</>.  It is
+useful for checking your configuration and to find stop word candidates.  For
+example, to find the ten most frequent words:

 <programlisting>
-SELECT * FROM stat('SELECT vector from apod')
+SELECT * FROM ts_stat('SELECT vector from apod')
 ORDER BY ndoc DESC, nentry DESC, word
 LIMIT 10;
 </programlisting>

-Optionally, one can specify <replaceable>weight</replaceable> to obtain
+Optionally, one can specify <replaceable>weights</replaceable> to obtain
 statistics about words with a specific <replaceable>weight</replaceable>:

 <programlisting>
-SELECT * FROM stat('SELECT vector FROM apod','a')
+SELECT * FROM ts_stat('SELECT vector FROM apod','a')
 ORDER BY ndoc DESC, nentry DESC, word
 LIMIT 10;
 </programlisting>
@ -1146,9 +1145,9 @@ topic.
 </para>

 <para>
-The <function>rewrite()</function> function changes the original query by
+The <function>ts_rewrite()</function> function changes the original query by
 replacing part of the query with some other string of type <type>tsquery</type>,
-as defined by the rewrite rule. Arguments to <function>rewrite()</function>
+as defined by the rewrite rule. Arguments to <function>ts_rewrite()</function>
 can be names of columns of type <type>tsquery</type>.
 </para>

@ -1161,20 +1160,20 @@ INSERT INTO aliases VALUES('a', 'c');
 <varlistentry>

 <indexterm zone="textsearch-tsquery">
-<primary>rewrite - 1</primary>
+<primary>ts_rewrite</primary>
 </indexterm>

 <term>
 <synopsis>
-rewrite (<replaceable class="PARAMETER">query</replaceable> TSQUERY, <replaceable class="PARAMETER">target</replaceable> TSQUERY, <replaceable class="PARAMETER">sample</replaceable> TSQUERY) returns TSQUERY
+ts_rewrite (<replaceable class="PARAMETER">query</replaceable> TSQUERY, <replaceable class="PARAMETER">target</replaceable> TSQUERY, <replaceable class="PARAMETER">sample</replaceable> TSQUERY) returns TSQUERY
 </synopsis>
 </term>

 <listitem>
 <para>
 <programlisting>
-SELECT rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
-  rewrite
+SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
+  ts_rewrite
  -----------
   'b' &amp; 'c'
 </programlisting>
@ -1184,21 +1183,17 @@ SELECT rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);

 <varlistentry>

-<indexterm zone="textsearch-tsquery">
-<primary>rewrite - 2</primary>
-</indexterm>
-
 <term>
 <synopsis>
-rewrite(ARRAY[<replaceable class="PARAMETER">query</replaceable> TSQUERY, <replaceable class="PARAMETER">target</replaceable> TSQUERY, <replaceable class="PARAMETER">sample</replaceable> TSQUERY]) returns TSQUERY
+ts_rewrite(ARRAY[<replaceable class="PARAMETER">query</replaceable> TSQUERY, <replaceable class="PARAMETER">target</replaceable> TSQUERY, <replaceable class="PARAMETER">sample</replaceable> TSQUERY]) returns TSQUERY
 </synopsis>
 </term>

 <listitem>
 <para>
 <programlisting>
-SELECT rewrite(ARRAY['a &amp; b'::tsquery, t,s]) FROM aliases;
-  rewrite
+SELECT ts_rewrite(ARRAY['a &amp; b'::tsquery, t,s]) FROM aliases;
+  ts_rewrite
  -----------
   'b' &amp; 'c'
 </programlisting>
@ -1208,21 +1203,17 @@ SELECT rewrite(ARRAY['a &amp; b'::tsquery, t,s]) FROM aliases;

 <varlistentry>

-<indexterm zone="textsearch-tsquery">
-<primary>rewrite - 3</primary>
-</indexterm>
-
 <term>
 <synopsis>
-rewrite (<replaceable class="PARAMETER">query</> TSQUERY,<literal>'SELECT target ,sample FROM test'</literal>::text) returns TSQUERY
+ts_rewrite (<replaceable class="PARAMETER">query</> TSQUERY,<literal>'SELECT target ,sample FROM test'</literal>::text) returns TSQUERY
 </synopsis>
 </term>

 <listitem>
 <para>
 <programlisting>
-SELECT rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
-  rewrite
+SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
+  ts_rewrite
  -----------
   'b' &amp; 'c'
 </programlisting>
@ -1246,12 +1237,12 @@ SELECT * FROM aliases;
 </programlisting>
 This ambiguity can be resolved by specifying a sort order:
 <programlisting>
-SELECT rewrite('a &amp; b', 'SELECT t, s FROM aliases ORDER BY t DESC');
- rewrite
+SELECT ts_rewrite('a &amp; b', 'SELECT t, s FROM aliases ORDER BY t DESC');
+ ts_rewrite
 ---------
 'cc'
-SELECT rewrite('a &amp; b', 'SELECT t, s FROM aliases ORDER BY t ASC');
-  rewrite
+SELECT ts_rewrite('a &amp; b', 'SELECT t, s FROM aliases ORDER BY t ASC');
+  ts_rewrite
 -----------
 'b' &amp; 'c'
 </programlisting>
@ -1263,7 +1254,7 @@ Let's consider a real-life astronomical example. We'll expand query
 <programlisting>
 CREATE TABLE aliases (t tsquery primary key, s tsquery);
 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
-SELECT rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') &amp;&amp; to_tsquery('crab');
+SELECT ts_rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') &amp;&amp; to_tsquery('crab');
            ?column?
 ---------------------------------
 ( 'supernova' | 'sn' ) &amp; 'crab'
@ -1271,7 +1262,7 @@ SELECT rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') &amp;&amp; to
 Notice, that we can change the rewriting rule online<!-- TODO maybe use another word for "online"? -->:
 <programlisting>
 UPDATE aliases SET s=to_tsquery('supernovae|sn &amp; !nebulae') WHERE t=to_tsquery('supernovae');
-SELECT rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') &amp;&amp; to_tsquery('crab');
+SELECT ts_rewrite(to_tsquery('supernovae'),  'SELECT * FROM aliases') &amp;&amp; to_tsquery('crab');
                  ?column?
 ---------------------------------------------
 ( 'supernova' | 'sn' &amp; !'nebula' ) &amp; 'crab'
@ -1288,10 +1279,10 @@ for a possible hit. To filter out obvious non-candidate rules there are containm
 operators for the <type>tsquery</type> type. In the example below, we select only those
 rules which might contain the original query:
 <programlisting>
-SELECT rewrite(ARRAY['a &amp; b'::tsquery, t,s])
+SELECT ts_rewrite(ARRAY['a &amp; b'::tsquery, t,s])
 FROM aliases
 WHERE 'a &amp; b' @> t;
-  rewrite
+  ts_rewrite
 -----------
 'b' &amp; 'c'
 </programlisting>
@ -1525,7 +1516,7 @@ SELECT * FROM ts_parse('default','123 - a number');

 <varlistentry>
 <indexterm zone="textsearch-parser">
-<primary>token_type</primary>
+<primary>ts_token_type</primary>
 </indexterm>

 <term>
@ -1894,11 +1885,13 @@ configuration <replaceable>config_name</replaceable><!-- TODO I don't get this -
 <title>Dictionaries</title>

 <para>
-Dictionaries are used to specify words that should not be considered in
-a search and for the normalization of words to allow the user to use any
-derived form of a word in a query. Also, normalization can reduce the size of
-<type>tsvector</type>. Normalization does not always have linguistic
-meaning and usually depends on application semantics.
+Dictionaries are used to eliminate words that should not be considered in a
+search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
+that different derived forms of the same word will match.  Aside from
+improving search quality, normalization and removal of stop words reduce the
+size of the <type>tsvector</type> representation of a document, thereby
+improving performance.  Normalization does not always have linguistic meaning
+and usually depends on application semantics.
 </para>

 <para>
@ -1954,10 +1947,6 @@ a void array if the dictionary knows the lexeme, but it is a stop word
 <literal>NULL</literal> if the dictionary does not recognize the input lexeme
 </para></listitem>
 </itemizedlist>
-
-<emphasis>WARNING:</emphasis>
-Data files used by dictionaries should be in the <varname>server_encoding</varname>
-so all encodings are consistent across databases.
 </para>

 <para>
@ -1987,7 +1976,8 @@ recognizes everything.  For example, for an astronomy-specific search
 terms, a general English dictionary and a <application>snowball</> English
 stemmer:
 <programlisting>
-ALTER TEXT SEARCH CONFIGURATION astro_en ADD MAPPING FOR lword WITH astrosyn, en_ispell, en_stem;
+ALTER TEXT SEARCH CONFIGURATION astro_en
+    ADD MAPPING FOR lword WITH astrosyn, english_ispell, english_stem;
 </programlisting>
 </para>

@ -1995,7 +1985,7 @@ ALTER TEXT SEARCH CONFIGURATION astro_en ADD MAPPING FOR lword WITH astrosyn, en
 Function <function>ts_lexize</function> can be used to test dictionaries,
 for example:
 <programlisting>
-SELECT ts_lexize('en_stem', 'stars');
+SELECT ts_lexize('english_stem', 'stars');
 ts_lexize
 -----------
 {star}
@ -2068,6 +2058,15 @@ SELECT ts_lexize('public.simple_dict','The');
 </programlisting>
 </para>

+<caution>
+<para>
+Most types of dictionaries rely on configuration files, such as files of stop
+words.  These files <emphasis>must</> be stored in UTF-8 encoding.  They will
+be translated to the actual database encoding, if that is different, when they
+are read into the server.
+</para>
+</caution>
+
 </sect2>


@ -2080,23 +2079,25 @@ word with a synonym. Phrases are not supported (use the thesaurus
 dictionary (<xref linkend="textsearch-thesaurus">) for that).  A synonym
 dictionary can be used to overcome linguistic problems, for example, to
 prevent an English stemmer dictionary from reducing the word 'Paris' to
-'pari'.  In that case, it is enough to have a <literal>Paris
-paris</literal> line in the synonym dictionary and put it before the
-<literal>en_stem</> dictionary:
+'pari'.  It is enough to have a <literal>Paris paris</literal> line in the
+synonym dictionary and put it before the <literal>english_stem</> dictionary:
 <programlisting>
 SELECT * FROM ts_debug('english','Paris');
- Alias | Description | Token | Dictionaries |  Lexized token  
-------+-------------+-------+--------------+-----------------
- lword | Latin word  | Paris | {english}    | english: {pari}
+ Alias | Description | Token |  Dictionaries  |    Lexized token     
+-------+-------------+-------+----------------+----------------------
+ lword | Latin word  | Paris | {english_stem} | english_stem: {pari}
 (1 row)

+CREATE TEXT SEARCH DICTIONARY synonym
+    (TEMPLATE = synonym, SYNONYMS = my_synonyms);
+
 ALTER TEXT SEARCH CONFIGURATION english
-    ADD MAPPING FOR lword WITH synonym, en_stem;
+    ALTER MAPPING FOR lword WITH synonym, english_stem;

 SELECT * FROM ts_debug('english','Paris');
- Alias | Description | Token |    Dictionaries   |   Lexized token
-------+-------------+-------+-------------------+------------------
- lword | Latin word  | Paris | {synonym,en_stem} | synonym: {paris}
+ Alias | Description | Token |      Dictionaries      |  Lexized token   
+-------+-------------+-------+------------------------+------------------
+ lword | Latin word  | Paris | {synonym,english_stem} | synonym: {paris}
 (1 row)
 </programlisting>
 </para>
@ -2119,25 +2120,27 @@ preferred term and, optionally, preserves them for indexing.  Thesauruses
 are used during indexing so any change in the thesaurus <emphasis>requires</emphasis>
 reindexing.  The current implementation of the thesaurus
 dictionary is an extension of the synonym dictionary with added
-<emphasis>phrase</emphasis> support.  A thesaurus is a plain file of the
-following format:
+<emphasis>phrase</emphasis> support.  A thesaurus dictionary requires
+a configuration file of the following format:
 <programlisting>
 # this is a comment
 sample word(s) : indexed word(s)
-...............................
+more sample word(s) : more indexed word(s)
+...
 </programlisting>
-where  the colon (<symbol>:</symbol>) symbol acts as a delimiter.
+where  the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
+a phrase and its replacement.
 </para>

 <para>
 A thesaurus dictionary uses a <emphasis>subdictionary</emphasis> (which
-should be defined in the full text configuration) to normalize the
-thesaurus text. It is only possible to define one dictionary.  Notice that
-the <emphasis>subdictionary</emphasis> will produce an error if it can
-not recognize a word. In that case, you should remove the definition of
-the word or teach the <emphasis>subdictionary</emphasis> to about it.
-Use an asterisk (<symbol>*</symbol>) at the beginning of an indexed word to
-skip the subdictionary. It is still required that sample words are known.
+is defined in the dictionary's configuration) to normalize the input text
+before checking for phrase matches. It is only possible to select one
+subdictionary.  An error is reported if the subdictionary fails to
+recognize a word. In that case, you should remove the use of the word or teach
+the subdictionary about it.  Use an asterisk (<symbol>*</symbol>) at the
+beginning of an indexed word to skip the subdictionary. It is still required
+that sample words are known.
 </para>

 <para>
@ -2149,16 +2152,16 @@ Stop words recognized by the subdictionary are replaced by a 'stop word
 placeholder' to record their position. To break possible ties the thesaurus
 uses the last definition. To illustrate this, consider a thesaurus (with
 a <parameter>simple</parameter> subdictionary) with pattern
-<literal>'swsw'</>, where <literal>'s'</> designates any stop word and
-<literal>'w'</>, any known word:
+<replaceable>swsw</>, where <replaceable>s</> designates any stop word and
+<replaceable>w</>, any known word:
 <programlisting>
 a one the two : swsw
 the one a two : swsw2
 </programlisting>
-Words <literal>'a'</> and <literal>'the'</> are stop words defined in the
-configuration of a subdictionary. The thesaurus considers <literal>'the
-one the two'</literal> and <literal>'that one then two'</literal> as equal
-and will use definition 'swsw2'.
+Words <literal>a</> and <literal>the</> are stop words defined in the
+configuration of a subdictionary. The thesaurus considers <literal>the
+one the two</literal> and <literal>that one then two</literal> as equal
+and will use definition <replaceable>swsw2</>.
 </para>

 <para>
@ -2186,7 +2189,7 @@ For example:
 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
    TEMPLATE = thesaurus,
    DictFile = mythesaurus,
-    Dictionary = pg_catalog.en_stem
+    Dictionary = pg_catalog.english_stem
 );
 </programlisting>
 Here:
@ -2201,10 +2204,10 @@ where <literal>$SHAREDIR</> means the installation shared-data directory,
 often <filename>/usr/local/share</>).
 </para></listitem>
 <listitem><para>
-<literal>pg_catalog.en_stem</literal> is the dictionary (snowball
-English stemmer) to use for thesaurus normalization. Notice that the
-<literal>en_stem</> dictionary has its own configuration (for example,
-stop words).
+<literal>pg_catalog.english_stem</literal> is the dictionary (Snowball
+English stemmer) to use for thesaurus normalization.  Notice that the
+<literal>english_stem</> dictionary has its own configuration (for example,
+stop words), which is not shown here.
 </para></listitem>
 </itemizedlist>

@ -2235,10 +2238,10 @@ an astronomical thesaurus and english stemmer:
 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
    TEMPLATE = thesaurus,
    DictFile = thesaurus_astro,
-    Dictionary = en_stem
+    Dictionary = english_stem
 );
 ALTER TEXT SEARCH CONFIGURATION russian
-    ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_astro, en_stem;
+    ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_astro, english_stem;
 </programlisting>
 Now we can see how it works. Note that <function>ts_lexize</function> cannot
 be used for testing the thesaurus (see description of
@ -2266,7 +2269,7 @@ SELECT to_tsquery('''supernova star''');
 </programlisting>
 Notice that <literal>supernova star</literal> matches <literal>supernovae
 stars</literal> in <literal>thesaurus_astro</literal> because we specified the
-<literal>en_stem</literal> stemmer in the thesaurus definition.
+<literal>english_stem</literal> stemmer in the thesaurus definition.
 </para>
 <para>
 To keep an original phrase in full text indexing just add it to the right part
@ -2308,15 +2311,15 @@ conjugations of the search term <literal>bank</literal>, e.g.
 <literal>banking</>, <literal>banked</>, <literal>banks</>,
 <literal>banks'</>, and <literal>bank's</>.
 <programlisting>
-SELECT ts_lexize('en_ispell','banking');
+SELECT ts_lexize('english_ispell','banking');
 ts_lexize
 -----------
 {bank}
-SELECT ts_lexize('en_ispell','bank''s');
+SELECT ts_lexize('english_ispell','bank''s');
 ts_lexize
 -----------
 {bank}
-SELECT ts_lexize('en_ispell','banked');
+SELECT ts_lexize('english_ispell','banked');
 ts_lexize
 -----------
 {bank}
@ -2330,7 +2333,7 @@ To create an ispell dictionary one should use the built-in
 parameters.
 </para>
 <programlisting>
-CREATE TEXT SEARCH DICTIONARY en_ispell (
+CREATE TEXT SEARCH DICTIONARY english_ispell (
    TEMPLATE = ispell,
    DictFile = english,
    AffFile = english,
@ -2386,13 +2389,13 @@ The <application>Snowball</> dictionary template is based on the project
 of Martin Porter, inventor of the popular Porter's stemming algorithm
 for the English language and now supported in many languages (see the <ulink
 url="http://snowball.tartarus.org">Snowball site</ulink> for more
-information). Full text searching contains a large number of stemmers for
+information).  The Snowball project supplies a large number of stemmers for
 many languages. A Snowball dictionary requires a language parameter to
 identify which stemmer to use, and optionally can specify a stopword file name.
-For example,
+For example, there is a built-in definition equivalent to
 <programlisting>
-ALTER TEXT SEARCH DICTIONARY en_stem (
-    StopWords = english-utf8, Language = english
+CREATE TEXT SEARCH DICTIONARY english_stem (
+    TEMPLATE = snowball, Language = english, StopWords = english
 );
 </programlisting>
 </para>
@ -2400,7 +2403,8 @@ ALTER TEXT SEARCH DICTIONARY en_stem (
 <para>
 The <application>Snowball</> dictionary recognizes everything, so it is best
 to place it at the end of the dictionary stack. It it useless to have it
-before any other dictionary because a lexeme will not pass through its stemmer.
+before any other dictionary because a lexeme will never pass through it to
+the next dictionary.
 </para>

 </sect2>
@ -2420,7 +2424,7 @@ The <function>ts_lexize</> function facilitates dictionary testing:

 <term>
 <synopsis>
-ts_lexize(<optional> <replaceable class="PARAMETER">dict_name</replaceable> text</optional>, <replaceable class="PARAMETER">lexeme</replaceable> text) returns text[]
+ts_lexize(<replaceable class="PARAMETER">dict_name</replaceable> text, <replaceable class="PARAMETER">lexeme</replaceable> text) returns text[]
 </synopsis>
 </term>

@ -2432,11 +2436,11 @@ array if the lexeme is known to the dictionary but it is a stop word, or
 <literal>NULL</literal> if it is an unknown word.
 </para>
 <programlisting>
-SELECT ts_lexize('en_stem', 'stars');
+SELECT ts_lexize('english_stem', 'stars');
 ts_lexize
 -----------
 {star}
-SELECT ts_lexize('en_stem', 'a');
+SELECT ts_lexize('english_stem', 'a');
 ts_lexize
 -----------
 {}
@ -2457,9 +2461,9 @@ SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
 ----------
 t
 </programlisting>
-Thesaurus dictionary <literal>thesaurus_astro</literal> does know
-<literal>supernovae stars</literal>, but ts_lexize fails since it does not
-parse the input text and considers it as a single lexeme. Use
+The thesaurus dictionary <literal>thesaurus_astro</literal> does know
+<literal>supernovae stars</literal>, but <function>ts_lexize</> fails since it
+does not parse the input text and considers it as a single lexeme. Use
 <function>plainto_tsquery</> and <function>to_tsvector</> to test thesaurus
 dictionaries:
 <programlisting>
@ -2541,25 +2545,14 @@ CREATE TEXT SEARCH DICTIONARY pg_dict (

 <para>
 Then register the <productname>ispell</> dictionary
-<literal>en_ispell</literal> using the <literal>ispell</literal> template:
+<literal>english_ispell</literal> using the <literal>ispell</literal> template:

 <programlisting>
-CREATE TEXT SEARCH DICTIONARY en_ispell (
+CREATE TEXT SEARCH DICTIONARY english_ispell (
    TEMPLATE = ispell,
-    DictFile = english-utf8,
-    AffFile = english-utf8,
-    StopWords = english-utf8
-);
-</programlisting>
-</para>
-
-<para>
-We can use the same stop word list for the <application>Snowball</> stemmer
-<literal>en_stem</literal>, which is available by default:
-
-<programlisting>
-ALTER TEXT SEARCH DICTIONARY en_stem (
-    StopWords = english-utf8
+    DictFile = english,
+    AffFile = english,
+    StopWords = english
 );
 </programlisting>
 </para>
@ -2570,7 +2563,7 @@ Now modify mappings for Latin words for configuration <literal>pg</>:
 <programlisting>
 ALTER TEXT SEARCH CONFIGURATION pg
    ALTER MAPPING FOR lword, lhword, lpart_hword
-    WITH pg_dict, en_ispell, en_stem;
+    WITH pg_dict, english_ispell, english_stem;
 </programlisting>
 </para>

@ -2759,10 +2752,10 @@ the transitive containment relation <!-- huh --> is realized by
 superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the
 result of 'OR'-ing the bit-strings of all children.  This is a second
 factor of lossiness.  It is clear that parents tend to be full of
-<literal>'1'</>s (degenerates) and become quite useless because of the
+<literal>1</>s (degenerates) and become quite useless because of the
 limited selectivity.  Searching is performed as a bit comparison of a
 signature representing the query and an <literal>RD-tree</literal> entry.
-If all <literal>'1'</>s of both signatures are in the same position we
+If all <literal>1</>s of both signatures are in the same position we
 say that this branch probably matches the query, but if there is even one
 discrepancy we can definitely reject this branch.
 </para>
@ -2870,13 +2863,15 @@ The current limitations of Full Text Searching are:

 <para>
 For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
-consists of 10,441 unique words, a total of 335,420 words, and the most frequent word
-'postgresql' is mentioned 6,127 times in 655 documents.
+contained 10,441 unique words, a total of 335,420 words, and the most frequent
+word <quote>postgresql</> was mentioned 6,127 times in 655 documents.
 </para>

+<!-- TODO we need to put a date on these numbers? -->
 <para>
-Another example - the <productname>PostgreSQL</productname> mailing list archives
-consists of 910,989  unique words with 57,491,343 lexemes in 461,020 messages.
+Another example &mdash; the <productname>PostgreSQL</productname> mailing list
+archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
+messages.
 </para>

 </sect1>
@ -2942,28 +2937,27 @@ names and object names.  The following examples illustrate this:
 =&gt; \dF+ russian
 Configuration "pg_catalog.russian"
 Parser name: "pg_catalog.default"
-Locale: 'ru_RU.UTF-8' (default)
    Token     |      Dictionaries
 --------------+-------------------------
 email        | pg_catalog.simple
 file         | pg_catalog.simple
 float        | pg_catalog.simple
 host         | pg_catalog.simple
- hword        | pg_catalog.ru_stem_utf8
+ hword        | pg_catalog.russian_stem
 int          | pg_catalog.simple
 lhword       | public.tz_simple
 lpart_hword  | public.tz_simple
 lword        | public.tz_simple
- nlhword      | pg_catalog.ru_stem_utf8
- nlpart_hword | pg_catalog.ru_stem_utf8
- nlword       | pg_catalog.ru_stem_utf8
+ nlhword      | pg_catalog.russian_stem
+ nlpart_hword | pg_catalog.russian_stem
+ nlword       | pg_catalog.russian_stem
 part_hword   | pg_catalog.simple
 sfloat       | pg_catalog.simple
 uint         | pg_catalog.simple
 uri          | pg_catalog.simple
 url          | pg_catalog.simple
 version      | pg_catalog.simple
- word         | pg_catalog.ru_stem_utf8
+ word         | pg_catalog.russian_stem
 </programlisting>
 </para>
    </listitem>
@ -3112,43 +3106,43 @@ play with the standard <literal>english</literal> configuration.
 <programlisting>
 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );

-CREATE TEXT SEARCH DICTIONARY en_ispell (
+CREATE TEXT SEARCH DICTIONARY english_ispell (
    TEMPLATE = ispell,
-    DictFile = english-utf8,
-    AffFile = english-utf8,
+    DictFile = english,
+    AffFile = english,
    StopWords = english
 );

 ALTER TEXT SEARCH CONFIGURATION public.english
-    ALTER MAPPING FOR lword WITH en_ispell, en_stem;
+    ALTER MAPPING FOR lword WITH english_ispell, english_stem;
 </programlisting>

 <programlisting>
 SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
 Alias |  Description  |    Token    |              Dicts list               |          Lexized token
 -------+---------------+-------------+---------------------------------------+---------------------------------
- lword | Latin word    | The         | {public.en_ispell,pg_catalog.en_stem} | public.en_ispell: {}
+ lword | Latin word    | The         | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {}
 blank | Space symbols |             |                                       |
- lword | Latin word    | Brightest   | {public.en_ispell,pg_catalog.en_stem} | public.en_ispell: {bright}
+ lword | Latin word    | Brightest   | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {bright}
 blank | Space symbols |             |                                       |
- lword | Latin word    | supernovaes | {public.en_ispell,pg_catalog.en_stem} | pg_catalog.en_stem: {supernova}
+ lword | Latin word    | supernovaes | {public.english_ispell,pg_catalog.english_stem} | pg_catalog.english_stem: {supernova}
 (5 rows)
 </programlisting>
 <para>
-In this example, the word <literal>'Brightest'</> was recognized by a
+In this example, the word <literal>Brightest</> was recognized by a
 parser as a <literal>Latin word</literal> (alias <literal>lword</literal>)
-and came through the dictionaries <literal>public.en_ispell</> and
-<literal>pg_catalog.en_stem</literal>. It was recognized by
-<literal>public.en_ispell</literal>, which reduced it to the noun
+and came through the dictionaries <literal>public.english_ispell</> and
+<literal>pg_catalog.english_stem</literal>. It was recognized by
+<literal>public.english_ispell</literal>, which reduced it to the noun
 <literal>bright</literal>. The word <literal>supernovaes</literal> is unknown
-by the <literal>public.en_ispell</literal> dictionary so it was passed to
+by the <literal>public.english_ispell</literal> dictionary so it was passed to
 the next dictionary, and, fortunately, was recognized (in fact,
-<literal>public.en_stem</literal> is a stemming dictionary and recognizes
+<literal>public.english_stem</literal> is a stemming dictionary and recognizes
 everything; that is why it was placed at the end of the dictionary stack).
 </para>

 <para>
-The word <literal>The</literal> was recognized by <literal>public.en_ispell</literal>
+The word <literal>The</literal> was recognized by <literal>public.english_ispell</literal>
 dictionary as a stop word (<xref linkend="textsearch-stopwords">) and will not be indexed.
 </para>

@ -3159,11 +3153,11 @@ SELECT "Alias", "Token", "Lexized token"
 FROM ts_debug('public.english','The Brightest supernovaes');
 Alias |    Token    |          Lexized token
 -------+-------------+---------------------------------
- lword | The         | public.en_ispell: {}
+ lword | The         | public.english_ispell: {}
 blank |             |
- lword | Brightest   | public.en_ispell: {bright}
+ lword | Brightest   | public.english_ispell: {bright}
 blank |             |
- lword | supernovaes | pg_catalog.en_stem: {supernova}
+ lword | supernovaes | pg_catalog.english_stem: {supernova}
 (5 rows)
 </programlisting>
 </para>