Commit Graph

4401 Commits

Author SHA1 Message Date
shess
87f1d16bdb Replace the DocList and DocListReader structures. The new structures
distinguish reading from a static buffer from writing to a dynamic
buffer.  This allows n-way doclist merging, and in-place merging of
segment leaf nodes, which together cut segment merge times in half. (CVS 3486)

FossilOrigin-Name: af5bfb986e39248abbfc6fff2e13c6f9e634a751
2006-10-25 21:00:09 +00:00
shess
9289cba076 Don't store empty segments. When inserting empty strings, the code
was writing out a segment made up of a single leaf node containing the
\0 header.  LeafReader assumed that leaf nodes always contained at
least one term, so assertions would fail.

While it would be possible to support reading and merging empty
segments, there's no reason to do so.  While this change could have
been done in writeZeroSegment(), I put it in leafWriterFlush() so that
it would work right if segmentMerge() created an empty segment, which
could happen with future changes to how deleted documents are handled. (CVS 3484)

FossilOrigin-Name: fed79beec7da24a26ae94494bdc0c98dd102bc06
2006-10-25 05:21:55 +00:00
drh
d9033a6569 Removing debugging printf from the porter stemmer code. Ticket #2016. (CVS 3475)
FossilOrigin-Name: 7a08c6272f76d53b13313019b4f9da3c8f02b650
2006-10-13 11:55:39 +00:00
shess
8a235d4d3b Convert fts2 to store data in a way which allows for much faster
updates.  Groups of documents form segments which are encoded in a
btree layered over a table of blocks, with various tricks to make
merges fast.  This performs 20x-25x faster than fts1 when loading the
Enron corpus, and is only slightly slower for queries. (CVS 3474)

FossilOrigin-Name: 85272b2f5394e37916afb1d509e7296810d976f5
2006-10-12 23:15:24 +00:00
shess
0d6e29b832 Fix leaky symbols. With this change, fts1 and fts2 can both be
statically linked. (CVS 3472)

FossilOrigin-Name: 5e8bbb85c1493e3ab2d807d24c68294f26838e49
2006-10-10 23:22:40 +00:00
shess
2670a173ed Copy fts1/ to fts2/, changing reference from fts1 to fts2. For future
reference, the source versions copied were:

README.txt r1.1
fts1.c r1.37
fts1.h r1.2
fts1_hash.c r1.1
fts1_hash.h r1.1
fts1_porter.c r1.1
fts1_tokenizer.h r1.4
fts1_tokenizer1.c r1.6 (CVS 3471)

FossilOrigin-Name: d0d1e7cdcc1dd085f1e359ce35c441699d517b02
2006-10-10 17:37:14 +00:00
shess
9f4683cd42 Fix incorrect doclist initialization in term_select_all().
docListRestrictColumn() generates a DL_POSITIONS doclist, which means
that after the first doclist is processed, the second doclist is
initialized as DL_POSITIONS, but with DL_POSITIONS_OFFSETS data.
(Note that DL_DEFAULT is now DL_POSITIONS, which masks this bug.) (CVS 3467)

FossilOrigin-Name: 144e3f11e22c6efd6f2d960599ab2d93542db406
2006-10-05 21:48:56 +00:00
drh
53c36d5444 The snippet generator adds ellipsis between text from different columns. (CVS 3465)
FossilOrigin-Name: 6cf1fb9f801dc1b2865c0d1f9afb1b2076d4246e
2006-10-04 17:35:28 +00:00
drh
b1b6d4a929 Make DL_POSITION the default mode in FTS1. Remove the need to compile
with SQLITE_CORE when SQLITE_ENABLE_FTS1 is used. (CVS 3462)

FossilOrigin-Name: df1a4b4834fdc88056371bcc767c5dfde2eaab72
2006-10-03 19:37:37 +00:00
drh
d75e03df2b Add the option to omit offset information from posting lists in FTS1. (CVS 3456)
FossilOrigin-Name: fdcea7b1ffd821f3f2b6d30997d3957f705a6d0c
2006-10-03 11:42:28 +00:00
drh
6da40bcd79 Add a Porter stemmer option to the FTS1 module. (CVS 3452)
FossilOrigin-Name: 936b06aaa8133e83104de87e03dc94e286a31f86
2006-10-01 18:41:19 +00:00
drh
7cf43fa64e Fix a bug in the handling of the OR operator in FTS1. Test cases added to
prevent a repeat. (CVS 3450)

FossilOrigin-Name: 8cdf1d6ae018dfc93f8f0962b2530e31aa0bebff
2006-09-28 19:43:31 +00:00
drh
07aa67c14a More snippet generator improvements and test cases. (CVS 3449)
FossilOrigin-Name: 0934d220b33c52024f42c89fa13326bd52333f39
2006-09-28 18:57:59 +00:00
drh
1e7423e57f Bug fix in the FTS1 snippet generator. Improvements in the way the snippet
generator handles whitespace. (CVS 3448)

FossilOrigin-Name: d3f4ae827582bd0aac54ae3211d272a1429b6523
2006-09-28 18:37:15 +00:00
drh
361e2bdeb5 Avoid segfaults when inserted NULL values into FTS1. (CVS 3447)
FossilOrigin-Name: 165645d30115f3171fc45489823f85639fe2bfcd
2006-09-28 11:41:41 +00:00
adamd
adf52ce14b Implemented UPDATE for full-text tables.
We handle an UPDATE to a row by performing an UPDATE on the content table and by building new position lists for each term which appears in either the old or new versions of the row.  We write these position lists all at once; this is presumably more efficient than a delete followed by an insert (which would first write empty position lists, then new position lists). (CVS 3434)

FossilOrigin-Name: 757fa22400b363212b4d5f648bdc9fcbd9a7f152
2006-09-22 00:06:39 +00:00
adamd
f40a504164 When gathering a doclist for querying, don't discard empty position lists until the end; this allows empty position lists to override non-empty lists encountered later in the gathering process. This fixes #1982, which was caused by the fact that for all-column queries we weren't discarding empty position lists at all. (CVS 3433)
FossilOrigin-Name: 111ca616713dd89b5d1e114de29c83256731c482
2006-09-21 20:56:52 +00:00
drh
8b62817797 Implementation of the snippet() function for FTS1. Includes a few
simple test cases but more testing is needed. (CVS 3431)

FossilOrigin-Name: c7ee60d00976efab25a830e7416538010c734129
2006-09-21 02:03:08 +00:00
adamd
d47522807e Fixed a build problem in sqlite3_extension_init(). (CVS 3430)
FossilOrigin-Name: bb2e1871cb10b470f96c793bb137c043ef30e1da
2006-09-18 21:14:40 +00:00
drh
a70034de7c Convert all names to lower case before sending them to the xFindFunction
method of a virtual table.  In FTS1, use strcmp instead of strcasecmp.
Ticket #1981. (CVS 3428)

FossilOrigin-Name: efa8fb32a596c7232bb1754b3231e4f2421df75b
2006-09-18 20:24:02 +00:00
drh
b08249ced3 Modify FTS1 so that the "magic" column has the same name as the virtual
table.  Offsets are retrieved using a special "offsets" function whose
first argument is the magic column.  Snippets will ultimately be retrieved
in the same way. (CVS 3427)

FossilOrigin-Name: 5e35dc1ffadfe7fa47673d052501ee79903eead9
2006-09-18 02:12:47 +00:00
drh
b7481e70c5 Add the sqlite3_overload_function() API - part of the virtual table
interface. (CVS 3426)

FossilOrigin-Name: aa7728f9f5b80dbb1b3db124f84b9166bf72bdd3
2006-09-16 21:45:14 +00:00
drh
ae2f2048df Fix an initialization problem in FTS1. Ticket #1977. (CVS 3424)
FossilOrigin-Name: 5a18dd88498ca35ca1333d88c4635868d0b61073
2006-09-15 16:08:59 +00:00
drh
f800e3e63a The FTS1 tables have a new automatic column named "offset" that returns
a string containing byte offset information for all matching terms.
Also added a large test case based on SQLite mailing list entries. (CVS 3417)

FossilOrigin-Name: f25cfa1aec0e4c1fe07176039a1b7f4e6a2c66ec
2006-09-14 01:17:30 +00:00
drh
8f116cc15c In FTS1: Retain the Query structure as part of the cursor. It will be used
laster as part of snippet generation. (CVS 3414)

FossilOrigin-Name: 607d928ce91f3efa9c7019fc789a9cd3c41cfc92
2006-09-13 19:18:29 +00:00
shess
c48f2a10aa Earlier refactoring changed name in fts1.c but not fts1.h. (CVS 3413)
FossilOrigin-Name: d4edb8035c8abbdb301893557934dd644ef3c950
2006-09-13 18:40:25 +00:00
drh
1de6154d39 Minor code cleanup in FTS1. (CVS 3412)
FossilOrigin-Name: fca592816767de397fbaf22cccdf1028fc5dfc91
2006-09-13 17:17:48 +00:00
drh
a3baa963bc Implementation of "column:" modifiers in FTS1 queries. (CVS 3411)
FossilOrigin-Name: 820634f71e3a3499994f82b56b784d22a7e3cdcf
2006-09-13 16:02:43 +00:00
drh
cbaac514bc Module spec parser enhancements for FTS1. Now able to cope with column
names in the spec that are SQL keywords or have special characters, etc.
Also added support for additional control lines.  Column names can be
followed by a type specifier (which is ignored.) (CVS 3410)

FossilOrigin-Name: adb780e0dc8bc7dcd1102efbfa4bc17eefdf968e
2006-09-13 15:20:13 +00:00
drh
a6be0dc938 Fix the FTS1 test cases and add new tests. Comments added to the FTS1 code. (CVS 3409)
FossilOrigin-Name: 528036c828c93c78ca879bf89a52131b72e24067
2006-09-13 12:36:08 +00:00
adamd
4f1a424e72 Allow virtual tables to contain multiple full-text-indexed columns. Added a magic column "_all" which can be used for querying all columns in a table at once.
For now, each posting list stores position/offset information for multiple columns.  We may implement separate posting lists for separate columns at some future point. (CVS 3408)

FossilOrigin-Name: 366a70b086c817bddecd83053472ec76ef20f309
2006-09-13 02:18:20 +00:00
adamd
341d60838c Answer queries for a particular rowid in a full-text table by looking up
that rowid directly rather than by performing a table scan. (CVS 3407)

FossilOrigin-Name: 877d5558b1a6f65201b1825336935b146583bffa
2006-09-12 23:36:45 +00:00
shess
4240240f12 Re-use deleted rowids for new segments. This has a somewhat
surprising impact on performance, I believe because it keeps the index
smaller (by keeping rowids smaller), and also because it improves
locality in the table (deleting a row means we've already touched the
pages leading to that rowid). (CVS 3405)

FossilOrigin-Name: 2f5f6290c9ef99c7b060aecc4d996c976c50c9d7
2006-09-11 21:39:21 +00:00
drh
e410296021 Add a rudimentary tokenizer and parser to FTS1 for parsing the module
arguments during initialization.   Recognized arguments include a
tokenizer selector and a list of virtual table columns. (CVS 3403)

FossilOrigin-Name: 227dc3feb537e6efd5b0c1d2dad40193db07d5aa
2006-09-11 00:34:22 +00:00
drh
4ca8aac2b4 Add pzErr parameters to the xConnect and xCreate methods of virtual tables
in order to provide better error reporting.  This is an interface change
for virtual tables.  Prior virtual table implementations will need to be
modified and recompiled. (CVS 3402)

FossilOrigin-Name: f44b8bae97b6872524580009c96d07391578c388
2006-09-10 17:31:58 +00:00
drh
a2a9d18869 Add some simple test cases for the OR and NOT logic of the fts1 module.
Fix lots of bugs discovered while developing these test cases. (CVS 3400)

FossilOrigin-Name: 70bcff024b44d1b40afac6eba959fa89fb993147
2006-09-10 03:34:06 +00:00
drh
a7e98f2a54 Add support for OR and NOT terms in fts1. (CVS 3399)
FossilOrigin-Name: ae50265791d1a7500aa3c405a78a9bca8ff0cc08
2006-09-09 23:11:51 +00:00
shess
fb6794360d Write doclists using a segmented technique to amortize costs better.
New items for a term are merged with the term's segment 0 doclist,
until that doclist exceeds CHUNK_MAX.  Then the segments are merged in
exponential fashion, so that segment 1 contains approximately
2*CHUNK_MAX data, segment 2 4*CHUNK_MAX, and so on. (CVS 3398)

FossilOrigin-Name: b6b93a3325d3e728ca36255c0ff6e1f63e03b0ac
2006-09-08 17:00:17 +00:00
adamd
338565ad4b A minor change to fts1.c to fix broken build. (CVS 3393)
FossilOrigin-Name: 55a03b96251515a4817a0eefb197219a460640e7
2006-09-05 18:21:31 +00:00
drh
fb52cc95ff Add a TRACE macro to the FTS1 module for troubleshooting. Turned off by
default. (CVS 3388)

FossilOrigin-Name: d4923e98c66ae03d899f633e5e309471f5695abb
2006-09-02 20:58:25 +00:00
drh
7c2d87cd71 Convert static variables into constants in the FTS module. (CVS 3385)
FossilOrigin-Name: 098cbafcd6dcf57142b0417e796d27ffddcc0920
2006-09-02 14:16:59 +00:00
adamd
9eb3997b02 Miscellaneous restructuring and cleanup based on suggestions from shess. (CVS 3382)
FossilOrigin-Name: e98b0cf292f6dc9deb6ae9b773c52b16867f7556
2006-09-02 00:23:01 +00:00
shess
b2f4d0173a Make fts1.c not rely on nul-terminated strings. Mostly a matter of
making sure we always pass around ptr/len, but there were a few places
where we actually relied on nul-termination.

An earlier change had additionally changed appropriate
sqlite3_bind_text() calls to sqlite3_bind_blob().  I've found that
this changes what's actually stored in the database, so backed those
changes out.  Also (and this is weird), I found that I could no longer
do straight-forward = queries against %_term.term at a command-line. (CVS 3379)

FossilOrigin-Name: 5844db1aa9c23a005c88104b084f68afb21891c7
2006-09-01 00:33:44 +00:00
shess
c0beb14f23 Make tokenizer not rely on nul-terminated text. Instead of using
strcspn() and a nul-terminated delimiter list, I just flagged
delimiters in an array and wrote things inline.  Submitting this for
review separately because it's pretty standalone. (CVS 3378)

FossilOrigin-Name: 2631ceaeefaca3aa837e3b439399f13c51456914
2006-09-01 00:05:17 +00:00
drh
5db455e7b5 Refactor the FTS1 module so that its name is "fts1" instead of "fulltext",
so that all symbols with external linkage begin with "sqlite3Fts1", and
so that all filenames begin with "fts1". (CVS 3377)

FossilOrigin-Name: e1891f0dc58e5498a8845d8b9b5b092d7f9c7003
2006-08-31 15:07:14 +00:00
shess
2b85d5f46e Just don't run tolower() on hi-bit characters. This shouldn't cause
us to break any UTF-8 code points, unless they were already broken in
the input. (CVS 3376)

FossilOrigin-Name: 6c77c2d5e15e9d3efed3e274bc93cd5a4868f574
2006-08-30 21:40:30 +00:00
shess
c9e0a9057e Make static some symbols which shouldn't have been exported. (CVS 3371)
FossilOrigin-Name: 58006e38af760b53cf72bf127d7c7b8a619a1282
2006-08-28 23:46:01 +00:00
shess
4f4897e80d Make hi-bit characters delimiters. This is a stopgap until the tokenizer
and fulltext.c recognize UTF-8 correctly. (CVS 3370)

FossilOrigin-Name: ca850d3d80f67672172d11392fcdf60bfbb94c02
2006-08-28 20:08:56 +00:00
shess
0de250e46f Fix gcc gripe about parens in a ||/&& in mergePosList().
Drop unused pBlob/nBlob in index_insert_term().
Fix NULL deref in an assertion in docListUpdate() delete case.
Minor code tightening in docListUpdate(). (CVS 3367)

FossilOrigin-Name: a6fcf9101a831bf5f129c6045eabf30376d365dc
2006-08-25 19:20:26 +00:00
adamd
1717edd157 A first implementation of a full-text search module for SQLite. (CVS 3363)
FossilOrigin-Name: b0d8e0d314d6f77b7d4b5dd00c694a1323f7a8e4
2006-08-23 23:58:50 +00:00
drh
fa9b4b1499 Add the ext/fts1 subdirectory for holding the first full-text search
extension. (CVS 3360)

FossilOrigin-Name: 7f152f9f3a647d30874f2da46ce93a1e31ea7cf3
2006-08-22 14:45:37 +00:00