Commit Graph

71 Commits

Author SHA1 Message Date
shess
6b6ab13353 Fix crash in delete when existing row has null fields. Previous code
assumed that the row had values in all columns, sigh.  Fixes bug
http://www.sqlite.org/cvstrac/tktview?tn=2289 . (CVS 3833)

FossilOrigin-Name: 81be7290a4db7b74a533aaf95c7389eb4bde6a88
2007-04-09 20:45:40 +00:00
shess
06c69d2ed6 Buffer updates per-transaction rather than per-update. If lots of
updates happen within a single transaction, there was a lot of wasted
encode/decode overhead due to segment merges.  This code buffers
updates in memory and writes out larger level-0 segments.  It only
works when documents are presented in ascending order by docid.
Comparing a test set running 100 documents per transaction, the total
runtime is cut almost in half. (CVS 3751)

FossilOrigin-Name: 0229cba69698ab4b44f8583ef50a87c49422f8ec
2007-03-29 18:41:03 +00:00
shess
194f8972d5 Don't call ctype functions on hi-bit chars. Some platforms raise
assertions when this occurs, and it's almost certainly not the right
thing to do in the first place. (CVS 3746)

FossilOrigin-Name: f6c3abdc6c5e916e5366ba28fb1cd06ca3554303
2007-03-29 16:30:38 +00:00
shess
13ee81fe96 Refactor PLWriter to remove owned buffer. DLCollector (Document List
Collector) now handles the case where PLWriter (Position List Writer)
needed a local buffer.  Change to using the associated DLWriter
(Document List Writer) buffer, which reduces the number of memory
copies needed in doclist processing, and brings PLWriter operation in
line with DLWriter operation. (CVS 3707)

FossilOrigin-Name: d04fa3a13a84f49074c673b8ee2fb6541da061b5
2007-03-22 00:14:28 +00:00
shess
4607fc06f6 Refactor PLWriter in preparation for buffered-document change.
Currently, PLWriter (Position List Writer) creates a locally-owned
DataBuffer to write into.  This is necessary to support doclist
collection during tokenization, where there is no obvious buffer to
write output to, but is not necessary for the other users of PLWriter.
 This change adds a DLCollector (Doc List Collector) structure to
handle the tokenization case.

Also fix a potential memory leak in writeZeroSegment().  In case of
error from leafWriterStep(), the DataBuffer dl was being leaked. (CVS 3706)

FossilOrigin-Name: 1b9918e20767aebc9c1e7523027139e5fbc12688
2007-03-20 23:52:37 +00:00
shess
0d9f55a177 Out-of-memory cleanup in tokenizers. Handle NULL return from
malloc/calloc/realloc appropriately, and use sizeof(var) instead of
sizeof(type) to make certain that we don't get a mismatch between
them as the code rots. (CVS 3693)

FossilOrigin-Name: fbc53da8c645935c74e49af2ab2cf447dc72ba4e
2007-03-16 18:30:54 +00:00
shess
3438ea3b9e http://www.sqlite.org/cvstrac/tktview?tn=2219
When creating fts tables in an attached database, the backing tables
are created in database 'main'.  This change propagates the
appropriate database name to the routines which build sql statements.

Note that I propagate the database name and table name separately.  I
briefly considered just making the table name be "db.table", but it
didn't fit so well in the model used to store the table name and other
information, and having the db name passed separately seemed a bit
more transparent. (CVS 3631)

FossilOrigin-Name: 283385d20724f0144f38de89bd179715ee5e738b
2007-02-07 01:01:17 +00:00
shess
3ad202dd17 http://www.sqlite.org/cvstrac/tktview?tn=2166,35
Calling UPDATE against an fts table in a UTF-16 database inserts
corrupted data into the database.  The UTF-8 data is being inserted
directly.  This appears to happen because sqlite3_ value_text()
destructively coerces a value to UTF-8, and it's never converted back
when updating the table. This works around the problem by rearranging
things so that the update happens before the coercion. (CVS 3596)

FossilOrigin-Name: 4f2ab4b6320ffc621900049b41f50bc30d76d7f5
2007-01-19 22:59:56 +00:00
shess
f7912aff8a Drop a couple variables which are no longer used anywhere. (CVS 3524)
FossilOrigin-Name: 08c2cc0e0782cfaca89947a01b7ea4474dbe71aa
2006-11-29 23:41:10 +00:00
shess
5c327dbb46 http://www.sqlite.org/cvstrac/tktview?tn=2046
The virtual table interface allows for a cursor to field multiple
xFilter() calls.  For instance, if a join is done with a virtual
table, there could be a call for each row which potentially matches.
Unfortunately, fulltextFilter() assumes that it has a fresh cursor,
and overwrites a prepared statement and a malloc'ed pointer, resulting
in unfinalized statements and a memory leak.

This change hacks the code to manually clean up offending items in
fulltextFilter(), emphasis on "hacks", since it's a fragile fix
insofar as future additions to fulltext_cursor could continue to have
the problem. (CVS 3521)

FossilOrigin-Name: 18142fdb6d1f5bfdbb1155274502b9a602885fcb
2006-11-29 05:17:28 +00:00
shess
7e3d0c2d2f Delta-encode terms in interior nodes. While experiments have shown
that this is of marginal utility when encoding terms resulting from
regular English text, it turns out to be very useful when encoding
inputs with very large terms. (CVS 3520)

FossilOrigin-Name: c8151a998ec2423b417566823dc9957c7d5d782c
2006-11-29 01:02:03 +00:00
shess
f72442be68 Store minimal terms in interior nodes. Whenever there's a break
between leaf nodes, instead of storing the entire leftmost term of the
rightmost child, store only that portion of the leftmost term
necessary to distinguish it from the rightmost term of the leftmost
child. (CVS 3513)

FossilOrigin-Name: f6e0b080dcfaf554b2c05df5e7d4db69d012fba3
2006-11-18 00:12:44 +00:00
shess
9e6a561554 Refactoring groundwork for coming work on interior nodes. Change
LeafWriter to use empty data buffer (instead of empty term) to detect
an empty block.  Code to validate interior nodes.  Moderate revisions
to leaf-node and doclist validation.  Recast leafWriterStep() in terms
of LeafWriterStepMerge(). (CVS 3512)

FossilOrigin-Name: f30771d5c7ef2b502af95d81a18796b75271ada4
2006-11-17 21:12:15 +00:00
shess
de163af26e Delta-encode docids. This is good for around 22% reduction in index
size with DL_POSITIONS.  It improves performance about 5%-6%. (CVS 3511)

FossilOrigin-Name: 9b6d413d751d962b67cb4e3a208efe61581cb822
2006-11-13 21:09:24 +00:00
shess
debbcdfead Require a minimum fanout for interior nodes. This prevents cases
where excessively large terms keep the tree from finding a single
root.  A downside is that this could result in large interior nodes in
the presence of large terms, which may be prone to fragmentation,
though if the nodes were smaller that would translate into more levels
in the tree, which would also have that problem. (CVS 3510)

FossilOrigin-Name: 64b7e3406134ac4891113b9bb432ad97504268bb
2006-11-13 21:00:54 +00:00
shess
545311eeca Allow backing tables to be missing on dropping fts table. Fixes
http://www.sqlite.org/cvstrac/tktview?tn=1992,35 . (CVS 3509)

FossilOrigin-Name: 9628a61a6f33b7bec3455086534b76437d2622b4
2006-11-13 20:15:27 +00:00
shess
aedbce0376 Fix a pair of memory leaks. These were turned up by running valgrind
memcheck with various 10k doc insert, update, delete, and query tests. (CVS 3497)

FossilOrigin-Name: 3cd9b64b96018f69163ad0be0b5c07dd1be6abc6
2006-10-31 18:13:42 +00:00
shess
93d2a81401 Empty queries should get no results. My recent change
( http://www.sqlite.org/cvstrac/chngview?cn=3486 ) broke test fts2a-5.3.
This change should make the expected result more obvious. (CVS 3489)

FossilOrigin-Name: cde383eb467de0d752e94a22cd2f890c2dc599cc
2006-10-26 00:41:51 +00:00
shess
9d5586fc9f Make memset() uses less error-prone.
http://www.sqlite.org/cvstrac/tktview?tn=2036,35 describes some cases
where we were passing memset() a length which was the sizeof a
pointer, rather than the structure pointed to.  Instead, wrap this
idiom up in CLEAR() and SCRAMBLE() macros. (CVS 3488)

FossilOrigin-Name: 5878add0839f9c5bec77caae2361ec20cb60b48b
2006-10-26 00:04:31 +00:00
shess
627a74c48c Remove unreferenced local variable. (CVS 3487)
FossilOrigin-Name: 2d3b22197c7c06488b789cce333b34b6d1ae39aa
2006-10-25 23:22:03 +00:00
shess
87f1d16bdb Replace the DocList and DocListReader structures. The new structures
distinguish reading from a static buffer from writing to a dynamic
buffer.  This allows n-way doclist merging, and in-place merging of
segment leaf nodes, which together cut segment merge times in half. (CVS 3486)

FossilOrigin-Name: af5bfb986e39248abbfc6fff2e13c6f9e634a751
2006-10-25 21:00:09 +00:00
shess
9289cba076 Don't store empty segments. When inserting empty strings, the code
was writing out a segment made up of a single leaf node containing the
\0 header.  LeafReader assumed that leaf nodes always contained at
least one term, so assertions would fail.

While it would be possible to support reading and merging empty
segments, there's no reason to do so.  While this change could have
been done in writeZeroSegment(), I put it in leafWriterFlush() so that
it would work right if segmentMerge() created an empty segment, which
could happen with future changes to how deleted documents are handled. (CVS 3484)

FossilOrigin-Name: fed79beec7da24a26ae94494bdc0c98dd102bc06
2006-10-25 05:21:55 +00:00
drh
d9033a6569 Removing debugging printf from the porter stemmer code. Ticket #2016. (CVS 3475)
FossilOrigin-Name: 7a08c6272f76d53b13313019b4f9da3c8f02b650
2006-10-13 11:55:39 +00:00
shess
8a235d4d3b Convert fts2 to store data in a way which allows for much faster
updates.  Groups of documents form segments which are encoded in a
btree layered over a table of blocks, with various tricks to make
merges fast.  This performs 20x-25x faster than fts1 when loading the
Enron corpus, and is only slightly slower for queries. (CVS 3474)

FossilOrigin-Name: 85272b2f5394e37916afb1d509e7296810d976f5
2006-10-12 23:15:24 +00:00
shess
0d6e29b832 Fix leaky symbols. With this change, fts1 and fts2 can both be
statically linked. (CVS 3472)

FossilOrigin-Name: 5e8bbb85c1493e3ab2d807d24c68294f26838e49
2006-10-10 23:22:40 +00:00
shess
2670a173ed Copy fts1/ to fts2/, changing reference from fts1 to fts2. For future
reference, the source versions copied were:

README.txt r1.1
fts1.c r1.37
fts1.h r1.2
fts1_hash.c r1.1
fts1_hash.h r1.1
fts1_porter.c r1.1
fts1_tokenizer.h r1.4
fts1_tokenizer1.c r1.6 (CVS 3471)

FossilOrigin-Name: d0d1e7cdcc1dd085f1e359ce35c441699d517b02
2006-10-10 17:37:14 +00:00
shess
9f4683cd42 Fix incorrect doclist initialization in term_select_all().
docListRestrictColumn() generates a DL_POSITIONS doclist, which means
that after the first doclist is processed, the second doclist is
initialized as DL_POSITIONS, but with DL_POSITIONS_OFFSETS data.
(Note that DL_DEFAULT is now DL_POSITIONS, which masks this bug.) (CVS 3467)

FossilOrigin-Name: 144e3f11e22c6efd6f2d960599ab2d93542db406
2006-10-05 21:48:56 +00:00
drh
53c36d5444 The snippet generator adds ellipsis between text from different columns. (CVS 3465)
FossilOrigin-Name: 6cf1fb9f801dc1b2865c0d1f9afb1b2076d4246e
2006-10-04 17:35:28 +00:00
drh
b1b6d4a929 Make DL_POSITION the default mode in FTS1. Remove the need to compile
with SQLITE_CORE when SQLITE_ENABLE_FTS1 is used. (CVS 3462)

FossilOrigin-Name: df1a4b4834fdc88056371bcc767c5dfde2eaab72
2006-10-03 19:37:37 +00:00
drh
d75e03df2b Add the option to omit offset information from posting lists in FTS1. (CVS 3456)
FossilOrigin-Name: fdcea7b1ffd821f3f2b6d30997d3957f705a6d0c
2006-10-03 11:42:28 +00:00
drh
6da40bcd79 Add a Porter stemmer option to the FTS1 module. (CVS 3452)
FossilOrigin-Name: 936b06aaa8133e83104de87e03dc94e286a31f86
2006-10-01 18:41:19 +00:00
drh
7cf43fa64e Fix a bug in the handling of the OR operator in FTS1. Test cases added to
prevent a repeat. (CVS 3450)

FossilOrigin-Name: 8cdf1d6ae018dfc93f8f0962b2530e31aa0bebff
2006-09-28 19:43:31 +00:00
drh
07aa67c14a More snippet generator improvements and test cases. (CVS 3449)
FossilOrigin-Name: 0934d220b33c52024f42c89fa13326bd52333f39
2006-09-28 18:57:59 +00:00
drh
1e7423e57f Bug fix in the FTS1 snippet generator. Improvements in the way the snippet
generator handles whitespace. (CVS 3448)

FossilOrigin-Name: d3f4ae827582bd0aac54ae3211d272a1429b6523
2006-09-28 18:37:15 +00:00
drh
361e2bdeb5 Avoid segfaults when inserted NULL values into FTS1. (CVS 3447)
FossilOrigin-Name: 165645d30115f3171fc45489823f85639fe2bfcd
2006-09-28 11:41:41 +00:00
adamd
adf52ce14b Implemented UPDATE for full-text tables.
We handle an UPDATE to a row by performing an UPDATE on the content table and by building new position lists for each term which appears in either the old or new versions of the row.  We write these position lists all at once; this is presumably more efficient than a delete followed by an insert (which would first write empty position lists, then new position lists). (CVS 3434)

FossilOrigin-Name: 757fa22400b363212b4d5f648bdc9fcbd9a7f152
2006-09-22 00:06:39 +00:00
adamd
f40a504164 When gathering a doclist for querying, don't discard empty position lists until the end; this allows empty position lists to override non-empty lists encountered later in the gathering process. This fixes #1982, which was caused by the fact that for all-column queries we weren't discarding empty position lists at all. (CVS 3433)
FossilOrigin-Name: 111ca616713dd89b5d1e114de29c83256731c482
2006-09-21 20:56:52 +00:00
drh
8b62817797 Implementation of the snippet() function for FTS1. Includes a few
simple test cases but more testing is needed. (CVS 3431)

FossilOrigin-Name: c7ee60d00976efab25a830e7416538010c734129
2006-09-21 02:03:08 +00:00
adamd
d47522807e Fixed a build problem in sqlite3_extension_init(). (CVS 3430)
FossilOrigin-Name: bb2e1871cb10b470f96c793bb137c043ef30e1da
2006-09-18 21:14:40 +00:00
drh
a70034de7c Convert all names to lower case before sending them to the xFindFunction
method of a virtual table.  In FTS1, use strcmp instead of strcasecmp.
Ticket #1981. (CVS 3428)

FossilOrigin-Name: efa8fb32a596c7232bb1754b3231e4f2421df75b
2006-09-18 20:24:02 +00:00
drh
b08249ced3 Modify FTS1 so that the "magic" column has the same name as the virtual
table.  Offsets are retrieved using a special "offsets" function whose
first argument is the magic column.  Snippets will ultimately be retrieved
in the same way. (CVS 3427)

FossilOrigin-Name: 5e35dc1ffadfe7fa47673d052501ee79903eead9
2006-09-18 02:12:47 +00:00
drh
b7481e70c5 Add the sqlite3_overload_function() API - part of the virtual table
interface. (CVS 3426)

FossilOrigin-Name: aa7728f9f5b80dbb1b3db124f84b9166bf72bdd3
2006-09-16 21:45:14 +00:00
drh
ae2f2048df Fix an initialization problem in FTS1. Ticket #1977. (CVS 3424)
FossilOrigin-Name: 5a18dd88498ca35ca1333d88c4635868d0b61073
2006-09-15 16:08:59 +00:00
drh
f800e3e63a The FTS1 tables have a new automatic column named "offset" that returns
a string containing byte offset information for all matching terms.
Also added a large test case based on SQLite mailing list entries. (CVS 3417)

FossilOrigin-Name: f25cfa1aec0e4c1fe07176039a1b7f4e6a2c66ec
2006-09-14 01:17:30 +00:00
drh
8f116cc15c In FTS1: Retain the Query structure as part of the cursor. It will be used
laster as part of snippet generation. (CVS 3414)

FossilOrigin-Name: 607d928ce91f3efa9c7019fc789a9cd3c41cfc92
2006-09-13 19:18:29 +00:00
shess
c48f2a10aa Earlier refactoring changed name in fts1.c but not fts1.h. (CVS 3413)
FossilOrigin-Name: d4edb8035c8abbdb301893557934dd644ef3c950
2006-09-13 18:40:25 +00:00
drh
1de6154d39 Minor code cleanup in FTS1. (CVS 3412)
FossilOrigin-Name: fca592816767de397fbaf22cccdf1028fc5dfc91
2006-09-13 17:17:48 +00:00
drh
a3baa963bc Implementation of "column:" modifiers in FTS1 queries. (CVS 3411)
FossilOrigin-Name: 820634f71e3a3499994f82b56b784d22a7e3cdcf
2006-09-13 16:02:43 +00:00
drh
cbaac514bc Module spec parser enhancements for FTS1. Now able to cope with column
names in the spec that are SQL keywords or have special characters, etc.
Also added support for additional control lines.  Column names can be
followed by a type specifier (which is ignored.) (CVS 3410)

FossilOrigin-Name: adb780e0dc8bc7dcd1102efbfa4bc17eefdf968e
2006-09-13 15:20:13 +00:00
drh
a6be0dc938 Fix the FTS1 test cases and add new tests. Comments added to the FTS1 code. (CVS 3409)
FossilOrigin-Name: 528036c828c93c78ca879bf89a52131b72e24067
2006-09-13 12:36:08 +00:00