dan
31afee9372
Add extra test for handling of embedded nul characters in the fts4 unicode61 tokenizer.
...
FossilOrigin-Name: c2c2c7e945f5d5700d91b8e779117e70e388ffc613912a434885ae27f5fe4e22
2021-01-04 18:28:29 +00:00
dan
d26d2c7de2
Omit a test of codepoint 0x202F (non-break narrow space) from the fts3 ICU
...
tests. Different versions of ICU apparently handle this obscure codepoint
slightly differently.
FossilOrigin-Name: 69ae688982d6cb9f859f5643c315a1dc5ba76ad35553ecea8329a75ee70a87b1
2017-05-30 18:14:47 +00:00
drh
07d694c750
Adjust ICU tests to account for recent changes in the official
...
Unicode definition of whitespace.
FossilOrigin-Name: 0816525386ac51454b7b09a507e45b6a2cb8bf6e
2015-06-15 16:40:38 +00:00
drh
f6b1a8e1a5
Make sure errors encountered while initializing extensions such as FTS4
...
get reported out from sqlite3_open(). This fixes a bug introduced by
check-in [9d347f547e7ba9]. Also remove lots of forgotten "breakpoint"
commands left in test scripts over the years.
FossilOrigin-Name: ca3fdfd41961d8d3d1e39d20dc628e8a95dabb2f
2013-12-19 16:26:05 +00:00
mistachkin
549bc3db1f
Fix Unicode character encoding issues on Windows.
...
FossilOrigin-Name: c9310c9a2bad11f1d033a57b33ea7aed43a8238d
2013-10-12 00:56:21 +00:00
mistachkin
cbc53fec75
Fix test numbering.
...
FossilOrigin-Name: cef39f6933dcfec4b4a087a05dbb4e7766003fb7
2013-10-11 22:17:39 +00:00
dan
6284d02160
Test that the unicode61 tokenchars= and separators= options work with the fts3tokenize virtual table.
...
FossilOrigin-Name: ed24051462c09220ebfb82a347b4a2b5c820ef63
2013-09-18 11:16:32 +00:00
dan
f1d2670d40
Add tests for the fts4 unicode61 tokenchars and separators options.
...
FossilOrigin-Name: 9ce6f40dfb54b35cecba3cc9c1ec0d111f6e9f11
2013-09-13 12:10:09 +00:00
dan
43398081a8
Add a test for fts4 unicode61 option remove_diacritics=0.
...
FossilOrigin-Name: 6bf7ae6ff6b18712544ddeafb6848b3b27ff22d2
2013-08-30 13:29:51 +00:00
dan
f2c9229f73
Up until now the fts4 "unicode61" tokenizer has treated all private use codepoints except the first and last of each of the three ranges as alphanumeric (eligible to be part of tokens). This commit fixes this so that all private use codepoints are considered alphanumeric. In other words, it fixes the handling of codepoints 0xE000, 0xF8FF, 0xF0000, 0xFFFFD, 0x100000 and 0x10FFFD.
...
FossilOrigin-Name: 6cfd9af5250029c0d275be027b4208c48954a8a1
2013-06-05 16:17:21 +00:00
drh
7c37e2f674
Add a single test case to fts4unicode.test to verify that title-case
...
maps to lower case.
FossilOrigin-Name: 955a9459dabad231aa8d6282676975ab7fba244e
2013-01-26 19:31:42 +00:00
dan
3aaa4cd9ed
Add tests to check that the "unicode61" and "icu" tokenizers both identify white-space codepoints outside the ASCII range.
...
FossilOrigin-Name: bfb2d4730cbbe18fb940e72f4fde9122d550734e
2012-06-19 06:35:39 +00:00
dan
25cdf46ae4
Add the "tokenchars=" and "separators=" options, for customizing the set of characters considered to be token separators, to the unicode61 tokenizer.
...
FossilOrigin-Name: e56fb462aa1f11bb23303ae0dc62815c21e26a52
2012-06-07 15:53:48 +00:00
dan
754d3adf7c
Have the FTS unicode61 strip out diacritics when tokenizing text. This can be disabled by specifying the tokenizer option "remove_diacritics=0".
...
FossilOrigin-Name: 790f76a5898dad1a955d40edddf11f7b0fec0ccd
2012-06-06 19:30:38 +00:00
dan
7946c53009
If SQLITE_DISABLE_FTS3_UNICODE is defined, do not build the "unicode61" tokenizer.
...
FossilOrigin-Name: e71495a817b479bc23c5403d99255e3f098eb054
2012-05-26 18:28:14 +00:00
dan
7a796731db
Add coverage tests for fts3_unicode.c.
...
FossilOrigin-Name: 07d3ea8a3cb179fab6c48934fc6751f53b507d36
2012-05-26 16:22:56 +00:00
dan
ab322bd21e
Change the name of the "unicode" tokenizer to "unicode61" to emphasize that the case folding and separator-character identification routines are based on unicode version 6.1.
...
FossilOrigin-Name: 8f3e60aa2253f21bcee5d03982cfdd7f16c00060
2012-05-26 14:54:50 +00:00
dan
3d403c71a8
Add an experimental tokenizer to fts4 - "unicode". This tokenizer works in the same way except that it understands unicode "simple case folding" and recognizes all characters not classified as "Letters" or "Numbers" by unicode as token separators.
...
FossilOrigin-Name: 0c13570ec78c6887103dc99b81b470829fa28385
2012-05-25 17:50:19 +00:00