Commit Graph

26 Commits

Author SHA1 Message Date
dan
b651084713 Add tests to restore coverage of fts5_tokenizer.c.
FossilOrigin-Name: 8f9257361b05e368bf433e56d0698923b0f97d12e7c0ad7760aaab6746c0e467
2024-08-17 17:22:49 +00:00
dan
ec8962869a Update mkunicode.tcl to match the change erroneously made to machine generated file fts5_unicode2.c in [b7b7bde9].
FossilOrigin-Name: 326d579d777fdede6bc64f9525248767f4730de4e50260b0387e614a9d006416
2020-11-26 20:13:54 +00:00
mistachkin
065f3bf4f2 Fix various harmless compiler warnings seen with MSVC.
FossilOrigin-Name: 1c0fe5b5763fe5cbace9773dcdab742e126d0bd035ab13d61f9d134afa0afc0c
2019-03-20 05:45:03 +00:00
drh
8fc4a11c94 Fix harmless compiler warnings in the unicode2 logic of FTS3 and FTS5.
FossilOrigin-Name: 703029ac6d24860230a8c30fcbf5e7e1da619e84f1cc9b9e65ebc74879a184d2
2019-01-02 23:49:47 +00:00
dan
b163b57212 Fix problems in fts5 found by ASAN.
FossilOrigin-Name: c564bf870106faef297594a51995619c80311d06bd5f8a0c7644f666f22ba576
2018-12-28 07:37:22 +00:00
drh
f8c2fea195 Remove the unused sqlite3Fts5UnicodeNCat() function.
FossilOrigin-Name: 7149dacf1d440a19f62808b4591c3fa8da202b2ec742d5490a63f2ec005ff9e7
2018-12-03 17:40:46 +00:00
dan
e89feee5c3 Add the "remove_diacritics=2" option to the unicode61 tokenizer in both FTS5
and FTS3/4.

FossilOrigin-Name: 06177f3f114b5d804b84c27ac843740282e2176fdf0f7a999feda0e1b624adec
2018-12-03 16:14:49 +00:00
dan
b80bb6ce88 Add the "categories" option to the unicode61 tokenizer in fts5.
FossilOrigin-Name: 80d2b9e635e3100f90cffdcffa5b5038da6fbbfccc9f5777c59a4ae760d4cb62
2018-07-13 19:52:43 +00:00
dan
920c83f18f Fix some problems in fts3 found by address-sanitizer.
FossilOrigin-Name: 16a8e84fa7f67a467f824bdd7f72cbd6a6e95dab8cc7aa1e0e751720b98f3e31
2017-03-20 18:53:32 +00:00
dan
53ff9c2972 Fix a potential buffer overread provoked by invalid utf-8 in fts5.
FossilOrigin-Name: a049fbbde5da2e43d41aa8c2b41f9eb21507ac76
2016-02-12 18:48:09 +00:00
dan
3f09beda45 Remove "#ifdef SQLITE_ENABLE_FTS5" from individual fts5 source files. Add a single "#if !defined(SQLITE_CORE) || defined(SQLITE_ENABLE_FTS5)" to fts5.c.
FossilOrigin-Name: 7819002ed85497bbd0f9cf4d39df641573324436
2015-07-02 15:52:21 +00:00
dan
2e7d35e2fe Avoid making redundant copies of position-lists within the fts5 code.
FossilOrigin-Name: 5165de548b84825cb000d33e5d3de12b0ef112c0
2015-05-23 15:43:05 +00:00
dan
21b7d2a9b8 Improve test coverage of fts5_unicode2.c.
FossilOrigin-Name: fea8a4db9d8c7b9a946017a0dc984cbca6ce240e
2015-05-22 06:08:25 +00:00
dan
57fec54b53 Fix some problems with building fts5 and fts3 together using the amalgamation.
FossilOrigin-Name: fb10bbb9f9c4481e6043d323a3018a4ec68eb0ff
2015-02-02 11:32:20 +00:00
dan
37db72f1f7 Merge latest trunk changes with this branch.
FossilOrigin-Name: 4b3651677e7132c4c45605bc1f216fc08ef31198
2015-01-01 18:03:49 +00:00
dan
6024772ba2 Add a version of the unicode61 tokenizer to fts5.
FossilOrigin-Name: d09f7800cf14f73ea86d037107ef80295b2c173a
2015-01-01 16:46:10 +00:00
drh
858b638d1f A couple more harmless compiler warnings eliminated.
FossilOrigin-Name: bcf6d775f90f4d1ba018a1b965f2f710df130f01
2014-08-06 18:50:51 +00:00
drh
e8f2c9dc71 Fix two more harmless compiler warnings. Make sure the fts3_unicode2.c file
is in sync with mkunicode.tcl.

FossilOrigin-Name: a2a60307ea68a3230952a56cb65369ba0a208967
2014-08-06 17:49:13 +00:00
dan
f2c9229f73 Up until now the fts4 "unicode61" tokenizer has treated all private use codepoints except the first and last of each of the three ranges as alphanumeric (eligible to be part of tokens). This commit fixes this so that all private use codepoints are considered alphanumeric. In other words, it fixes the handling of codepoints 0xE000, 0xF8FF, 0xF0000, 0xFFFFD, 0x100000 and 0x10FFFD.
FossilOrigin-Name: 6cfd9af5250029c0d275be027b4208c48954a8a1
2013-06-05 16:17:21 +00:00
dan
754d3adf7c Have the FTS unicode61 strip out diacritics when tokenizing text. This can be disabled by specifying the tokenizer option "remove_diacritics=0".
FossilOrigin-Name: 790f76a5898dad1a955d40edddf11f7b0fec0ccd
2012-06-06 19:30:38 +00:00
drh
a9cfaba95a Omit the fts3 unicode character class routines from the build if fts3/4
is disabled.

FossilOrigin-Name: c00bb5d4601efc15933f222349e96a043b610a19
2012-05-28 12:22:00 +00:00
dan
7946c53009 If SQLITE_DISABLE_FTS3_UNICODE is defined, do not build the "unicode61" tokenizer.
FossilOrigin-Name: e71495a817b479bc23c5403d99255e3f098eb054
2012-05-26 18:28:14 +00:00
dan
501c74d3e1 Change the format of the tables used by sqlite3FtsUnicodeTolower() to make them a little smaller.
FossilOrigin-Name: b89d3834f6690073fca0fc22c18afa1fb280ea7d
2012-05-26 17:57:02 +00:00
dan
1c7016c9a5 Add special fast paths to sqlite3FtsUnicodeTolower() and Isalnum() for codepoints in the ASCII range.
FossilOrigin-Name: cf7b25d47687635a04f4347d45f135c686b9d758
2012-05-25 19:50:12 +00:00
dan
80ed5a56a5 Fix comments in generated file fts3_unicode2.c.
FossilOrigin-Name: 3dc567ef4702d9a63d78d11ff705cb7f7359f7a6
2012-05-25 18:48:48 +00:00
dan
3d403c71a8 Add an experimental tokenizer to fts4 - "unicode". This tokenizer works in the same way except that it understands unicode "simple case folding" and recognizes all characters not classified as "Letters" or "Numbers" by unicode as token separators.
FossilOrigin-Name: 0c13570ec78c6887103dc99b81b470829fa28385
2012-05-25 17:50:19 +00:00