Remove spellfix virtual table documentation from the source tree.
Reference the separate documentation on the website instead. FossilOrigin-Name: adcf78909ff9064b6e3c4dd15ccd3245c8cf270b
This commit is contained in:
parent
c5797545de
commit
015db9c859
@ -1,114 +0,0 @@
|
|||||||
<title>The editdist3 algorithm</title>
|
|
||||||
|
|
||||||
The editdist3 algorithm is a function that computes the minimum edit distance
|
|
||||||
(a.k.a. the Levenshtein distance) between two input strings. Features of
|
|
||||||
editdist3 include:
|
|
||||||
|
|
||||||
* It works with unicode (UTF8) text.
|
|
||||||
|
|
||||||
* A table of insertion, deletion, and substitution costs can be
|
|
||||||
provided by the application.
|
|
||||||
|
|
||||||
* Multi-character insertsions, deletions, and substitutions can be
|
|
||||||
enumerated in the cost table.
|
|
||||||
|
|
||||||
<h2>The COST table</h2>
|
|
||||||
|
|
||||||
To program the costs of editdist3, create a table such as the following:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
CREATE TABLE editcost(
|
|
||||||
iLang INT, -- The language ID
|
|
||||||
cFrom TEXT, -- Convert text from this
|
|
||||||
cTo TEXT, -- Convert text into this
|
|
||||||
iCost INT -- The cost of doing the conversionnn
|
|
||||||
);
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
The cost table can be named anything you want - it does not have to be called
|
|
||||||
"editcost". And the table can contain additional columns. However, it the
|
|
||||||
table must contain the four columns show above, with exactly the names shown.
|
|
||||||
|
|
||||||
The iLang column is a non-negative integer that identifies a set of costs
|
|
||||||
appropriate for a particular language. The editdist3 function will only use
|
|
||||||
a single iLang value for any given edit-distance computation. The default
|
|
||||||
value is 0. It is recommended that applications that only need to use a
|
|
||||||
single langauge always use iLang==0 for all entries.
|
|
||||||
|
|
||||||
The iCost column is the numeric cost of transforming cFrom into cTo. This
|
|
||||||
value should be a non-negative integer, and should probably be less than 100.
|
|
||||||
The default single-character insertion and deletion costs are 100 and the
|
|
||||||
default single-character to single-character substitution cost is 150. A
|
|
||||||
cost of 10000 or more is considered "infinite" and causes the rule to be
|
|
||||||
ignored.
|
|
||||||
|
|
||||||
The cFrom and cTo columns show edit transformation strings. Either or both
|
|
||||||
columns may contain more than one character. Or either column (but not both)
|
|
||||||
may hold an empty string. When cFrom is empty, that is the cost of inserting
|
|
||||||
cTo. When cTo is empty, that is the cost of deleting cFrom.
|
|
||||||
|
|
||||||
In the spellfix1 algorithm, cFrom is the text as the user entered it and
|
|
||||||
cTo is the correctly spelled text as it exists in the database. The goal
|
|
||||||
of the editdist3 algorithm is to determine how close the user-entered text is
|
|
||||||
to the dictionary text.
|
|
||||||
|
|
||||||
There are three special-case entries in the cost table:
|
|
||||||
|
|
||||||
<table border=1>
|
|
||||||
<tr><th>cFrom</th><th>cTo</th><th>Meaning</th></tr>
|
|
||||||
<tr><td>''</td><td>'?'</td><td>The default insertion cost</td></tr>
|
|
||||||
<tr><td>'?'</td><td>''</td><td>The default deletion cost</td></tr>
|
|
||||||
<tr><td>'?'</td><td>'?'</td><td>The default substitution cost</td></tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
If any of the special-case entries shows above are omitted, then the
|
|
||||||
value of 100 is used for insertion and deletion and 150 is used for
|
|
||||||
substitution. To disable the default insertion, deletion, and/or substitution
|
|
||||||
set their respective cost to 10000 or more.
|
|
||||||
|
|
||||||
Other entries in the cost table specific transforms for particular characters.
|
|
||||||
The cost of specific transforms should be less than the default costs, or else
|
|
||||||
the default costs will take precedence and the specific transforms will never
|
|
||||||
be used.
|
|
||||||
|
|
||||||
Some example, cost table entries:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO editcost(iLang, cFrom, cTo, iCost)
|
|
||||||
VALUES(0, 'a', 'ä', 5);
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
The rule above says that the letter "a" in user input can be matched against
|
|
||||||
the letter "ä" in the dictionary with a penalty of 5.
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO editcost(iLang, cFrom, cTo, iCost)
|
|
||||||
VALUES(0, 'ss', 'ß', 8);
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
The number of characters in cFrom and cTo do not need to be the same. The
|
|
||||||
rule above says that "ss" on user input will match "ß" with a penalty of 8.
|
|
||||||
|
|
||||||
<h2>Experimenting with the editcost3() function</h2>
|
|
||||||
|
|
||||||
The [./spellfix1.wiki | spellfix1 virtual table]
|
|
||||||
uses editdist3 if the "edit_cost_table=TABLE" option
|
|
||||||
is specified as an argument when the spellfix1 virtual table is created.
|
|
||||||
But editdist3 can also be tested directly using the built-in "editdist3()"
|
|
||||||
SQL function. The editdist3() SQL function has 3 forms:
|
|
||||||
|
|
||||||
1. editdist3('TABLENAME');
|
|
||||||
2. editdist3('string1', 'string2');
|
|
||||||
3. editdist3('string1', 'string2', langid);
|
|
||||||
|
|
||||||
The first form loads the edit distance coefficients from a table called
|
|
||||||
'TABLENAME'. Any prior coefficients are discarded. So when experimenting
|
|
||||||
with weights and the weight table changes, simply rerun the single-argument
|
|
||||||
form of editdist3() to reload revised coefficients. Note that the
|
|
||||||
edit distance
|
|
||||||
weights used by the editdist3() SQL function are independent from the
|
|
||||||
weights used by the spellfix1 virtual table.
|
|
||||||
|
|
||||||
The second and third forms return the computed edit distance between strings
|
|
||||||
'string1' and "string2'. In the second form, an language id of 0 is used.
|
|
||||||
The language id is specified in the third form.
|
|
@ -12,7 +12,7 @@
|
|||||||
**
|
**
|
||||||
** This module implements the spellfix1 VIRTUAL TABLE that can be used
|
** This module implements the spellfix1 VIRTUAL TABLE that can be used
|
||||||
** to search a large vocabulary for close matches. See separate
|
** to search a large vocabulary for close matches. See separate
|
||||||
** documentation files (spellfix1.wiki and editdist3.wiki) for details.
|
** documentation (http://www.sqlite.org/spellfix1.html) for details.
|
||||||
*/
|
*/
|
||||||
#include "sqlite3ext.h"
|
#include "sqlite3ext.h"
|
||||||
SQLITE_EXTENSION_INIT1
|
SQLITE_EXTENSION_INIT1
|
||||||
|
@ -1,464 +0,0 @@
|
|||||||
<title>The Spellfix1 Virtual Table</title>
|
|
||||||
|
|
||||||
This spellfix1 virtual table is used to search
|
|
||||||
a large vocabulary for close matches. For example, spellfix1
|
|
||||||
can be used to suggest corrections to misspelled words. Or,
|
|
||||||
it could be used with FTS4 to do full-text search using potentially
|
|
||||||
misspelled words.
|
|
||||||
|
|
||||||
Create an instance of the spellfix1 virtual table like this:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
CREATE VIRTUAL TABLE demo USING spellfix1;
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
The "spellfix1" term is the name of this module and must be entered as
|
|
||||||
shown. The "demo" term is the
|
|
||||||
name of the virtual table you will be creating and can be altered
|
|
||||||
to suit the needs of your application. The virtual table is initially
|
|
||||||
empty. In order for the virtual table to be useful, you will need to
|
|
||||||
populate it with your vocabulary. Suppose you
|
|
||||||
have a list of words in a table named "big_vocabulary". Then do this:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO demo(word) SELECT word FROM big_vocabulary;
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
If you intend to use this virtual table in cooperation with an FTS4
|
|
||||||
table (for spelling correctly of search terms) then you might extract
|
|
||||||
the vocabulary using an fts3aux table:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
You can also provide the virtual table with a "rank" for each word.
|
|
||||||
The "rank" is an estimate of how common the word is. Larger numbers
|
|
||||||
mean the word is more common. If you omit the rank when populating
|
|
||||||
the table, then a rank of 1 is assumed. But if you have rank
|
|
||||||
information, you can supply it and the virtual table will show a
|
|
||||||
slight preference for selecting more commonly used terms. To
|
|
||||||
populate the rank from an fts4aux table "search_aux" do something
|
|
||||||
like this:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO demo(word,rank)
|
|
||||||
SELECT term, documents FROM search_aux WHERE col='*';
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
To query the virtual table, include a MATCH operator in the WHERE
|
|
||||||
clause. For example:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
SELECT word FROM demo WHERE word MATCH 'kennasaw';
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
Using a dataset of American place names (derived from
|
|
||||||
[http://geonames.usgs.gov/domestic/download_data.htm]) the query above
|
|
||||||
returns 20 results beginning with:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
kennesaw
|
|
||||||
kenosha
|
|
||||||
kenesaw
|
|
||||||
kenaga
|
|
||||||
keanak
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
If you append the character '*' to the end of the pattern, then
|
|
||||||
a prefix search is performed. For example:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
SELECT word FROM demo WHERE word MATCH 'kennes*';
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
Yields 20 results beginning with:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
kennesaw
|
|
||||||
kennestone
|
|
||||||
kenneson
|
|
||||||
kenneys
|
|
||||||
keanes
|
|
||||||
keenes
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
<h2>Search Refinements</h2>
|
|
||||||
|
|
||||||
By default, the spellfix1 table returns no more than 20 results.
|
|
||||||
(It might return less than 20 if there were fewer good matches.)
|
|
||||||
You can change the upper bound on the number of returned rows by
|
|
||||||
adding a "top=N" term to the WHERE clause of your query, where N
|
|
||||||
is the new maximum. For example, to see the 5 best matches:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
Each entry in the spellfix1 virtual table is associated with a
|
|
||||||
a particular language, identified by the integer "langid" column.
|
|
||||||
The default langid is 0 and if no other actions are taken, the
|
|
||||||
entire vocabulary is a part of the 0 language. But if your application
|
|
||||||
needs to operate in multiple languages, then you can specify different
|
|
||||||
vocabulary items for each language by specifying the langid field
|
|
||||||
when populating the table. For example:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO demo(word,langid) SELECT word, 0 FROM en_vocabulary;
|
|
||||||
INSERT INTO demo(word,langid) SELECT word, 1 FROM de_vocabulary;
|
|
||||||
INSERT INTO demo(word,langid) SELECT word, 2 FROM fr_vocabulary;
|
|
||||||
INSERT INTO demo(word,langid) SELECT word, 3 FROM ru_vocabulary;
|
|
||||||
INSERT INTO demo(word,langid) SELECT word, 4 FROM cn_vocabulary;
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
After the virtual table has been populated with items from multiple
|
|
||||||
languages, specify the language of interest using a "langid=N" term
|
|
||||||
in the WHERE clause of the query:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
SELECT word FROM demo WHERE word MATCH 'hildes*' AND langid=1;
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
Note that if you do not include the "langid=N" term in the WHERE clause,
|
|
||||||
the search will be against language 0 (English in the example above.)
|
|
||||||
All spellfix1 searches are against a single language id. There is no
|
|
||||||
way to search all languages at once.
|
|
||||||
|
|
||||||
|
|
||||||
<h2>Virtual Table Details</h2>
|
|
||||||
|
|
||||||
The virtual table actually has a unique rowid with seven columns plus five
|
|
||||||
extra hidden columns. The columns are as follows:
|
|
||||||
|
|
||||||
<blockquote><dl>
|
|
||||||
<dt><b>rowid</b><dd>
|
|
||||||
A unique integer number associated with each
|
|
||||||
vocabulary item in the table. This can be used
|
|
||||||
as a foreign key on other tables in the database.
|
|
||||||
|
|
||||||
<dt><b>word</b><dd>
|
|
||||||
The text of the word that matches the pattern.
|
|
||||||
Both word and pattern can contains unicode characters
|
|
||||||
and can be mixed case.
|
|
||||||
|
|
||||||
<dt><b>rank</b><dd>
|
|
||||||
This is the rank of the word, as specified in the
|
|
||||||
original INSERT statement.
|
|
||||||
|
|
||||||
|
|
||||||
<dt><b>distance</b><dd>
|
|
||||||
This is an edit distance or Levensthein distance going
|
|
||||||
from the pattern to the word.
|
|
||||||
|
|
||||||
<dt><b>langid</b><dd>
|
|
||||||
This is the language-id of the word. All queries are
|
|
||||||
against a single language-id, which defaults to 0.
|
|
||||||
For any given query this value is the same on all rows.
|
|
||||||
|
|
||||||
<dt><b>score</b><dd>
|
|
||||||
The score is a combination of rank and distance. The
|
|
||||||
idea is that a lower score is better. The virtual table
|
|
||||||
attempts to find words with the lowest score and
|
|
||||||
by default (unless overridden by ORDER BY) returns
|
|
||||||
results in order of increasing score.
|
|
||||||
|
|
||||||
<dt><b>matchlen</b><dd>
|
|
||||||
In a prefix search, the matchlen is the number of characters in
|
|
||||||
the string that match against the prefix. For a non-prefix search,
|
|
||||||
this is the same as length(word).
|
|
||||||
|
|
||||||
<dt><b>phonehash</b><dd>
|
|
||||||
This column shows the phonetic hash prefix that was used to restrict
|
|
||||||
the search. For any given query, this column should be the same for
|
|
||||||
every row. This information is available for diagnostic purposes and
|
|
||||||
is not normally considered useful in real applications.
|
|
||||||
|
|
||||||
<dt><b>top</b><dd>
|
|
||||||
(HIDDEN) For any query, this value is the same on all
|
|
||||||
rows. It is an integer which is the maximum number of
|
|
||||||
rows that will be output. The actually number of rows
|
|
||||||
output might be less than this number, but it will never
|
|
||||||
be greater. The default value for top is 20, but that
|
|
||||||
can be changed for each query by including a term of
|
|
||||||
the form "top=N" in the WHERE clause of the query.
|
|
||||||
|
|
||||||
<dt><b>scope</b><dd>
|
|
||||||
(HIDDEN) For any query, this value is the same on all
|
|
||||||
rows. The scope is a measure of how widely the virtual
|
|
||||||
table looks for matching words. Smaller values of
|
|
||||||
scope cause a broader search. The scope is normally
|
|
||||||
choosen automatically and is capped at 4. Applications
|
|
||||||
can change the scope by including a term of the form
|
|
||||||
"scope=N" in the WHERE clause of the query. Increasing
|
|
||||||
the scope will make the query run faster, but will reduce
|
|
||||||
the possible corrections.
|
|
||||||
|
|
||||||
<dt><b>srchcnt</b><dd>
|
|
||||||
(HIDDEN) For any query, this value is the same on all
|
|
||||||
rows. This value is an integer which is the number of
|
|
||||||
of words examined using the edit-distance algorithm to
|
|
||||||
find the top matches that are ultimately displayed. This
|
|
||||||
value is for diagnostic use only.
|
|
||||||
|
|
||||||
<dt><b>soundslike</b><dd>
|
|
||||||
(HIDDEN) When inserting vocabulary entries, this field
|
|
||||||
can be set to an spelling that matches what the word
|
|
||||||
sounds like. See the DEALING WITH UNUSUAL AND DIFFICULT
|
|
||||||
SPELLINGS section below for details.
|
|
||||||
|
|
||||||
<dt><b>command</b><dd>
|
|
||||||
(HIDDEN) The value of the "command" column is always NULL. However,
|
|
||||||
applications can insert special strings into the "command" column in order
|
|
||||||
to provoke certain behaviors in the spellfix1 virtual table.
|
|
||||||
For example, inserting the string 'reset' into the "command" column
|
|
||||||
will cause the virtual table will reread its edit distance weights
|
|
||||||
(if there are any).
|
|
||||||
</dl></blockquote>
|
|
||||||
|
|
||||||
<h2>Algorithm</h2>
|
|
||||||
|
|
||||||
The spellfix1 virtual table creates a single
|
|
||||||
shadow table named "%_vocab" (where the % is replaced by the name of
|
|
||||||
the virtual table; Ex: "demo_vocab" for the "demo" virtual table).
|
|
||||||
the shadow table contains the following columns:
|
|
||||||
|
|
||||||
<blockquote><dl>
|
|
||||||
<dt><b>id</b><dd>
|
|
||||||
The unique id (INTEGER PRIMARY KEY)
|
|
||||||
|
|
||||||
<dt><b>rank</b><dd>
|
|
||||||
The rank of word.
|
|
||||||
|
|
||||||
<dt><b>langid</b><dd>
|
|
||||||
The language id for this entry.
|
|
||||||
|
|
||||||
<dt><b>word</b><dd>
|
|
||||||
The original UTF8 text of the vocabulary word
|
|
||||||
|
|
||||||
<dt><b>k1</b><dd>
|
|
||||||
The word transliterated into lower-case ASCII.
|
|
||||||
There is a standard table of mappings from non-ASCII
|
|
||||||
characters into ASCII. Examples: "æ" -> "ae",
|
|
||||||
"þ" -> "th", "ß" -> "ss", "á" -> "a", ... The
|
|
||||||
accessory function spellfix1_translit(X) will do
|
|
||||||
the non-ASCII to ASCII mapping. The built-in lower(X)
|
|
||||||
function will convert to lower-case. Thus:
|
|
||||||
k1 = lower(spellfix1_translit(word)).
|
|
||||||
|
|
||||||
<dt><b>k2</b><dd>
|
|
||||||
This field holds a phonetic code derived from k1. Letters
|
|
||||||
that have similar sounds are mapped into the same symbol.
|
|
||||||
For example, all vowels and vowel clusters become the
|
|
||||||
single symbol "A". And the letters "p", "b", "f", and
|
|
||||||
"v" all become "B". All nasal sounds are represented
|
|
||||||
as "N". And so forth. The mapping is base on
|
|
||||||
ideas found in Soundex, Metaphone, and other
|
|
||||||
long-standing phonetic matching systems. This key can
|
|
||||||
be generated by the function spellfix1_phonehash(X).
|
|
||||||
Hence: k2 = spellfix1_phonehash(k1)
|
|
||||||
</dl></blockquote>
|
|
||||||
|
|
||||||
There is also a function for computing the Wagner edit distance or the
|
|
||||||
Levenshtein distance between a pattern and a word. This function
|
|
||||||
is exposed as spellfix1_editdist(X,Y). The edit distance function
|
|
||||||
returns the "cost" of converting X into Y. Some transformations
|
|
||||||
cost more than others. Changing one vowel into a different vowel,
|
|
||||||
for example is relatively cheap, as is doubling a constant, or
|
|
||||||
omitting the second character of a double-constant. Other transformations
|
|
||||||
or more expensive. The idea is that the edit distance function returns
|
|
||||||
a low cost of words that are similar and a higher cost for words
|
|
||||||
that are futher apart. In this implementation, the maximum cost
|
|
||||||
of any single-character edit (delete, insert, or substitute) is 100,
|
|
||||||
with lower costs for some edits (such as transforming vowels).
|
|
||||||
|
|
||||||
The "score" for a comparison is the edit distance between the pattern
|
|
||||||
and the word, adjusted down by the base-2 logorithm of the word rank.
|
|
||||||
For example, a match with distance 100 but rank 1000 would have a
|
|
||||||
score of 122 (= 100 - log2(1000) + 32) where as a match with distance
|
|
||||||
100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).
|
|
||||||
(NB: The constant 32 is added to each score to keep it from going
|
|
||||||
negative in case the edit distance is zero.) In this way, frequently
|
|
||||||
used words get a slightly lower cost which tends to move them toward
|
|
||||||
the top of the list of alternative spellings.
|
|
||||||
|
|
||||||
A straightforward implementation of a spelling corrector would be
|
|
||||||
to compare the search term against every word in the vocabulary
|
|
||||||
and select the 20 with the lowest scores. However, there will
|
|
||||||
typically be hundreds of thousands or millions of words in the
|
|
||||||
vocabulary, and so this approach is not fast enough.
|
|
||||||
|
|
||||||
Suppose the term that is being spell-corrected is X. To limit
|
|
||||||
the search space, X is converted to a k2-like key using the
|
|
||||||
equivalent of:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
key = spellfix1_phonehash(lower(spellfix1_translit(X)))
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
This key is then limited to "scope" characters. The default scope
|
|
||||||
value is 4, but an alternative scope can be specified using the
|
|
||||||
"scope=N" term in the WHERE clause. After the key has been truncated,
|
|
||||||
the edit distance is run against every term in the vocabulary that
|
|
||||||
has a k2 value that begins with the abbreviated key.
|
|
||||||
|
|
||||||
For example, suppose the input word is "Paskagula". The phonetic
|
|
||||||
key is "BACACALA" which is then truncated to 4 characters "BACA".
|
|
||||||
The edit distance is then run on the 4980 entries (out of
|
|
||||||
272,597 entries total) of the vocabulary whose k2 values begin with
|
|
||||||
BACA, yielding "Pascagoula" as the best match.
|
|
||||||
|
|
||||||
Only terms of the vocabulary with a matching langid are searched.
|
|
||||||
Hence, the same table can contain entries from multiple languages
|
|
||||||
and only the requested language will be used. The default langid
|
|
||||||
is 0.
|
|
||||||
|
|
||||||
<h2>Configurable Edit Distance</h2>
|
|
||||||
|
|
||||||
The built-in Wagner edit-distance function with fixed weights can be
|
|
||||||
replaced by the [./editdist3.wiki | editdist3()] edit-distance function
|
|
||||||
with application-defined weights and support for unicode, by specifying
|
|
||||||
the "edit_cost_table=<i>TABLENAME</i>" parameter to the spellfix1 module
|
|
||||||
when the virtual table is created.
|
|
||||||
For example:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
CREATE VIRTUAL TABLE demo2 USING spellfix1(edit_cost_table=APPCOST);
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
In the example above, the APPCOST table would be interrogated to find
|
|
||||||
the edit distance coefficients. It is the presence of the "edit_cost_table="
|
|
||||||
parameter to the spellfix1 module name that causes editdist3() to be used
|
|
||||||
in place of the built-in edit distance function.
|
|
||||||
|
|
||||||
The edit distance coefficients are normally read from the APPCOST table
|
|
||||||
once and there after stored in memory. Hence, run-time changes to the
|
|
||||||
APPCOST table will not normally effect the edit distance results.
|
|
||||||
However, inserting the special string 'reset' into the "command" column of the
|
|
||||||
virtual table causes the edit distance coefficients to be reread the
|
|
||||||
APPCOST table. Hence, applications should run a SQL statement similar
|
|
||||||
to the following when changes to the APPCOST table occur:
|
|
||||||
|
|
||||||
<blockquote>
|
|
||||||
INSERT INTO demo2(command) VALUES('reset');
|
|
||||||
</blockquote>
|
|
||||||
|
|
||||||
The tables used for edit distance costs can be changed using a command
|
|
||||||
like the following:
|
|
||||||
|
|
||||||
<blockquote>
|
|
||||||
INSERT INTO demo2(command) VALUES('edit_cost_table=APPCOST2');
|
|
||||||
</blockquote>
|
|
||||||
|
|
||||||
In the example above, any prior edit distance costs would be discarded and
|
|
||||||
all future queries would use the costs found in the APPCOST2 table. If the
|
|
||||||
name of the table specified by the "edit_cost_table" command is "NULL", then
|
|
||||||
theh built-in Wagner edit-distance function will be used instead of the
|
|
||||||
editdist3() function in all future queries.
|
|
||||||
|
|
||||||
<h2>Dealing With Unusual And Difficult Spellings</h2>
|
|
||||||
|
|
||||||
The algorithm above works quite well for most cases, but there are
|
|
||||||
exceptions. These exceptions can be dealt with by making additional
|
|
||||||
entries in the virtual table using the "soundslike" column.
|
|
||||||
|
|
||||||
For example, many words of Greek origin begin with letters "ps" where
|
|
||||||
the "p" is silent. Ex: psalm, pseudonym, psoriasis, psyche. In
|
|
||||||
another example, many Scottish surnames can be spelled with an
|
|
||||||
initial "Mac" or "Mc". Thus, "MacKay" and "McKay" are both pronounced
|
|
||||||
the same.
|
|
||||||
|
|
||||||
Accommodation can be made for words that are not spelled as they
|
|
||||||
sound by making additional entries into the virtual table for the
|
|
||||||
same word, but adding an alternative spelling in the "soundslike"
|
|
||||||
column. For example, the canonical entry for "psalm" would be this:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO demo(word) VALUES('psalm');
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
To enhance the ability to correct the spelling of "salm" into
|
|
||||||
"psalm", make an addition entry like this:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
INSERT INTO demo(word,soundslike) VALUES('psalm','salm');
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
It is ok to make multiple entries for the same word as long as
|
|
||||||
each entry has a different soundslike value. Note that if no
|
|
||||||
soundslike value is specified, the soundslike defaults to the word
|
|
||||||
itself.
|
|
||||||
|
|
||||||
Listed below are some cases where it might make sense to add additional
|
|
||||||
soundslike entries. The specific entries will depend on the application
|
|
||||||
and the target language.
|
|
||||||
|
|
||||||
* Silent "p" in words beginning with "ps": psalm, psyche
|
|
||||||
|
|
||||||
* Silent "p" in words beginning with "pn": pneumonia, pneumatic
|
|
||||||
|
|
||||||
* Silent "p" in words beginning with "pt": pterodactyl, ptolemaic
|
|
||||||
|
|
||||||
* Silent "d" in words beginning with "dj": djinn, Djikarta
|
|
||||||
|
|
||||||
* Silent "k" in words beginning with "kn": knight, Knuthson
|
|
||||||
|
|
||||||
* Silent "g" in words beginning with "gn": gnarly, gnome, gnat
|
|
||||||
|
|
||||||
* "Mac" versus "Mc" beginning Scottish surnames
|
|
||||||
|
|
||||||
* "Tch" sounds in Slavic words: Tchaikovsky vs. Chaykovsky
|
|
||||||
|
|
||||||
* The letter "j" pronounced like "h" in Spanish: LaJolla
|
|
||||||
|
|
||||||
* Words beginning with "wr" versus "r": write vs. rite
|
|
||||||
|
|
||||||
* Miscellanous problem words such as "debt", "tsetse",
|
|
||||||
"Nguyen", "Van Nuyes".
|
|
||||||
|
|
||||||
<h2>Auxiliary Functions</h2>
|
|
||||||
|
|
||||||
The source code module that implements the spellfix1 virtual table also
|
|
||||||
implements several SQL functions that might be useful to applications
|
|
||||||
that employ spellfix1 or for testing or diagnostic work while developing
|
|
||||||
applications that use spellfix1. The following auxiliary functions are
|
|
||||||
available:
|
|
||||||
|
|
||||||
<blockquote><dl>
|
|
||||||
<dt><b>editdist3(P,W)<br>editdist2(P,W,L)<br>editdist3(T)</b><dd>
|
|
||||||
These routines provide direct access to the version of the Wagner
|
|
||||||
edit-distance function that allows for application-defined weights
|
|
||||||
on edit operations. The first two forms of this function compare
|
|
||||||
pattern P against word W and return the edit distance. In the first
|
|
||||||
function, the langid is assumed to be 0 and in the second, the
|
|
||||||
langid is given by the L parameter. The third form of this function
|
|
||||||
reloads edit distance coefficience from the table named by T.
|
|
||||||
|
|
||||||
<dt><b>spellfix1_editdist(P,W)</b><dd>
|
|
||||||
This routine provides access to the built-in Wagner edit-distance
|
|
||||||
function that uses default, fixed costs. The value returned is
|
|
||||||
the edit distance needed to transform W into P.
|
|
||||||
|
|
||||||
<dt><b>spellfix1_phonehash(X)</b><dd>
|
|
||||||
This routine constructs a phonetic hash of the pure ascii input word X
|
|
||||||
and returns that hash. This routine is used internally by spellfix1 in
|
|
||||||
order to transform the K1 column of the shadow table into the K2
|
|
||||||
column.
|
|
||||||
|
|
||||||
<dt><b>spellfix1_scriptcode(X)</b><dd>
|
|
||||||
Given an input string X, this routine attempts to determin the dominant
|
|
||||||
script of that input and returns the ISO-15924 numeric code for that
|
|
||||||
script. The current implementation understands the following scripts:
|
|
||||||
<ul>
|
|
||||||
<li> 215 - Latin
|
|
||||||
<li> 220 - Cyrillic
|
|
||||||
<li> 200 - Greek
|
|
||||||
</ul>
|
|
||||||
Additional language codes might be added in future releases.
|
|
||||||
|
|
||||||
<dt><b>spellfix1_translit(X)</b><dd>
|
|
||||||
This routine transliterates unicode text into pure ascii, returning
|
|
||||||
the pure ascii representation of the input text X. This is the function
|
|
||||||
that is used internally to transform vocabulary words into the K1
|
|
||||||
column of the shadow table.
|
|
||||||
|
|
||||||
</dl></blockquote>
|
|
14
manifest
14
manifest
@ -1,5 +1,5 @@
|
|||||||
C Untested\sfix\sfor\sbuilding\son\sVxWorks.
|
C Remove\sspellfix\svirtual\stable\sdocumentation\sfrom\sthe\ssource\stree.\nReference\sthe\sseparate\sdocumentation\son\sthe\swebsite\sinstead.
|
||||||
D 2013-04-27T12:13:29.526
|
D 2013-04-27T18:06:40.561
|
||||||
F Makefile.arm-wince-mingw32ce-gcc d6df77f1f48d690bd73162294bbba7f59507c72f
|
F Makefile.arm-wince-mingw32ce-gcc d6df77f1f48d690bd73162294bbba7f59507c72f
|
||||||
F Makefile.in ce81671efd6223d19d4c8c6b88ac2c4134427111
|
F Makefile.in ce81671efd6223d19d4c8c6b88ac2c4134427111
|
||||||
F Makefile.linux-gcc 91d710bdc4998cb015f39edf3cb314ec4f4d7e23
|
F Makefile.linux-gcc 91d710bdc4998cb015f39edf3cb314ec4f4d7e23
|
||||||
@ -85,13 +85,11 @@ F ext/icu/icu.c eb9ae1d79046bd7871aa97ee6da51eb770134b5a
|
|||||||
F ext/icu/sqliteicu.h 728867a802baa5a96de7495e9689a8e01715ef37
|
F ext/icu/sqliteicu.h 728867a802baa5a96de7495e9689a8e01715ef37
|
||||||
F ext/misc/amatch.c 3369b2b544066e620d986f0085d039c77d1ef17f
|
F ext/misc/amatch.c 3369b2b544066e620d986f0085d039c77d1ef17f
|
||||||
F ext/misc/closure.c fec0c8537c69843e0b7631d500a14c0527962cd6
|
F ext/misc/closure.c fec0c8537c69843e0b7631d500a14c0527962cd6
|
||||||
F ext/misc/editdist3.wiki 06100a0c558921a563cbc40e0d0151902b1eef6d
|
|
||||||
F ext/misc/fuzzer.c fb64a15af978ae73fa9075b9b1dfbe82b8defc6f
|
F ext/misc/fuzzer.c fb64a15af978ae73fa9075b9b1dfbe82b8defc6f
|
||||||
F ext/misc/ieee754.c 2565ce373d842977efe0922dc50b8a41b3289556
|
F ext/misc/ieee754.c 2565ce373d842977efe0922dc50b8a41b3289556
|
||||||
F ext/misc/nextchar.c 1131e2b36116ffc6fe6b2e3464bfdace27978b1e
|
F ext/misc/nextchar.c 1131e2b36116ffc6fe6b2e3464bfdace27978b1e
|
||||||
F ext/misc/regexp.c c25c65fe775f5d9801fb8573e36ebe73f2c0c2e0
|
F ext/misc/regexp.c c25c65fe775f5d9801fb8573e36ebe73f2c0c2e0
|
||||||
F ext/misc/spellfix.c e323eebb877d735bc64404c16a6d758ab17a0b7a
|
F ext/misc/spellfix.c f9d24a2b2617cee143b7841b453e4e1fd8f189cc
|
||||||
F ext/misc/spellfix1.wiki dd1830444c14cf0f54dd680cc044df2ace2e9d09
|
|
||||||
F ext/misc/wholenumber.c ce362368b9381ea48cbd951ade8df867eeeab014
|
F ext/misc/wholenumber.c ce362368b9381ea48cbd951ade8df867eeeab014
|
||||||
F ext/rtree/README 6315c0d73ebf0ec40dedb5aa0e942bc8b54e3761
|
F ext/rtree/README 6315c0d73ebf0ec40dedb5aa0e942bc8b54e3761
|
||||||
F ext/rtree/rtree.c 757abea591d4ff67c0ff4e8f9776aeda86b18c14
|
F ext/rtree/rtree.c 757abea591d4ff67c0ff4e8f9776aeda86b18c14
|
||||||
@ -1062,7 +1060,7 @@ F tool/vdbe-compress.tcl f12c884766bd14277f4fcedcae07078011717381
|
|||||||
F tool/warnings-clang.sh f6aa929dc20ef1f856af04a730772f59283631d4
|
F tool/warnings-clang.sh f6aa929dc20ef1f856af04a730772f59283631d4
|
||||||
F tool/warnings.sh fbc018d67fd7395f440c28f33ef0f94420226381
|
F tool/warnings.sh fbc018d67fd7395f440c28f33ef0f94420226381
|
||||||
F tool/win/sqlite.vsix 97894c2790eda7b5bce3cc79cb2a8ec2fde9b3ac
|
F tool/win/sqlite.vsix 97894c2790eda7b5bce3cc79cb2a8ec2fde9b3ac
|
||||||
P 7a97226ffe174349e7113340f5354c4e44bd9738
|
P f14d55cf358b0392d3b8cd61dc85f43a610a8edf
|
||||||
R 45f16684eb908e3d81d1d0d938ab9d11
|
R 1d8237b86dffdd5327eb650dcb291120
|
||||||
U drh
|
U drh
|
||||||
Z 3089ee3240913ca0c76196e0917eb228
|
Z 369cfa66c298e6ca3c71b08895115463
|
||||||
|
@ -1 +1 @@
|
|||||||
f14d55cf358b0392d3b8cd61dc85f43a610a8edf
|
adcf78909ff9064b6e3c4dd15ccd3245c8cf270b
|
Loading…
Reference in New Issue
Block a user