being able to write test script which verify that fts3 is internally
building indices in the expected way. Both new functions are only
defined if fts3.c is compiled with SQLITE_TEST defined, as when
building testfixture. These functions are not intended to be part of
the exposed fts3 API.
dump_terms() generates a TEXT result of all the terms in the index (or
a specified segment), sorted and joined with spaces.
dump_doclist() generates a TEXT representation of the doclist
associated with a given term in the index (or a specified segment). (CVS 5340)
FossilOrigin-Name: a48e3d95f7a656285e959cef595cbe6d53428ad9
as appropriate, in case the comments are ever again read by a pedantic
grammarian. Ticket #2840. (CVS 4629)
FossilOrigin-Name: 4e91a267febda572e7239f0f1cc66b3102558c36
linearly merged the doclists, so as the accumulated list got large,
things got slow (the M term, a fucntion of the number of documents in
the index). This change does pairwise merges until a single doclist
remains. A test search of 't*' against a database of RFC text
improves from 1m16s to 4.75s. (CVS 4599)
FossilOrigin-Name: feef1b15d645d638b4a05742f214b0445fa7e176
rowid, similar to how things work in SQLite tables with INTEGER
PRIMARY KEY. Add tests to verify operation. (CVS 4426)
FossilOrigin-Name: c8d2345200f9ece1af712543982097d0b6f348c7
alias to fix the rowid for documents, %_segments.blockid is an alias
to fix the rowid for segment blocks. Unit test for the problem. (CVS 4280)
FossilOrigin-Name: 6eb2d74a8cfce322930f05c97d4ec255f3711efb
errors around SQLITE_SCHEMA handling. This also allows
sql_step_statement() and sql_step_leaf_statement() to be replaced with
sqlite3_step().
Also fix a logic error in flushPendingTerms() which was clearing the
term table in case of error. This was wrong in the face of
SQLITE_SCHEMA. Even though the change to sqlite3_prepare_v2() should
cause us not to see SQLITE_SCHEMA any longer, it was still a logic
error... (CVS 4205)
FossilOrigin-Name: 16730cb137eaf576b87cdc17913564c9c5c0ed82
unportable and highly deprecated <malloc.h> header on all platforms
except Apple Mac OS X. The <malloc.h> actually is never required on
any OS with an at least partly POSIX-conforming API as the malloc(3) &
friends functions officially live in <stdlib.h> since over 10 years.
Under some platform like FreeBSD the inclusion of <malloc.h> since a few
years even causes an "#error" and this way a build failure. So, just get
rid of the bad <malloc.h> usage in FTS1 and FTS2 extensions at all and
stick with <stdlib.h> there only. (CVS 4191)
FossilOrigin-Name: 3f9a666143a8aafa0b1a5d56ec68f69f2b3d6a21
modified fts2:
Modify handling of SQLITE_SCHEMA in fts2 code. An SQLITE_SCHEMA error
may cause SQLite to reload the internal schema, deleting and
recreating v-table objects. So the sqlite3_vtab structure can be
deleted out from under a v-table implementation. (CVS 4183)
FossilOrigin-Name: f9020cffda02923ef45979bb447ec2e232086ad5
is omitted. Add the SQLite blessing to the header comments on all FTS2
source files. (CVS 4120)
FossilOrigin-Name: c795e6fd8f01bcbc1967062632c13d4952abf4d8
docids are ascending if there was a prior docid set for the doclist,
ignore the initial docid of 0. (CVS 4026)
FossilOrigin-Name: ed3a131f1d3fe51d1e79bdfe1bfafa55f825afa9
character immediately after the end of a term is '*', that term is
marked for prefix matching. Modify term comparison in
snippetOffsetsOfColumn() to respect isPrefix. fts2n.test runs prefix
searching through some obvious test cases. (CVS 3893)
FossilOrigin-Name: 7c4c65924035d9f260f6b64eb92c5c6cf6c04b7b
The new function docListUnion() is used to accumulate a union of the
hits for the matching terms, which will be merged across segments
using docListMerge(). (CVS 3891)
FossilOrigin-Name: 72c796307338c2751a91c30f6fb16989afbf3816
Also implement correct prefix-handling for traversal of interior nodes
of segment tree. A given prefix can span multiple children of an
interior node, and from there the branches need to be followed in
parallel. (CVS 3889)
FossilOrigin-Name: cae844a01a1d87ffb00bba8b4e7b62a92e633aa9
search. Doclists from multiple prefix matches will need a union merge
function, which will have to logically happen across a segment before
doclists are merged between segments. (CVS 3887)
FossilOrigin-Name: 7ddb82668906e33e2d6a796f2da1795032e036d5
Previously, the code looped until the block was a leaf node as
indicated by a leading NUL. Now the code loops until it finds a block
in the range of leaf nodes for this segment, then reads it using
LeavesReader. This will make it easier to traverse a range of leaves
when doing a prefix search. (CVS 3884)
FossilOrigin-Name: 9466367d65f43d58020e709428268dc2ff98aa35
Prefix-searching will want to accumulate data across multiple leaves
in the segment, using LeavesReader instead of LeafReader is the first
step in that direction. (CVS 3881)
FossilOrigin-Name: 22ffdae4b6f3d0ea584dafa5268af7aa6fdcdc6e
the other, the code potentially tries to read past the end of the
doclist. http://www.sqlite.org/cvstrac/tktview?tn=2309 (CVS 3862)
FossilOrigin-Name: dfac6082e8ffc52a85c4906107a7fc0e1aa9df82
assumed that the row had values in all columns, sigh. Fixes bug
http://www.sqlite.org/cvstrac/tktview?tn=2289 . (CVS 3833)
FossilOrigin-Name: 81be7290a4db7b74a533aaf95c7389eb4bde6a88
updates happen within a single transaction, there was a lot of wasted
encode/decode overhead due to segment merges. This code buffers
updates in memory and writes out larger level-0 segments. It only
works when documents are presented in ascending order by docid.
Comparing a test set running 100 documents per transaction, the total
runtime is cut almost in half. (CVS 3751)
FossilOrigin-Name: 0229cba69698ab4b44f8583ef50a87c49422f8ec
assertions when this occurs, and it's almost certainly not the right
thing to do in the first place. (CVS 3746)
FossilOrigin-Name: f6c3abdc6c5e916e5366ba28fb1cd06ca3554303
Collector) now handles the case where PLWriter (Position List Writer)
needed a local buffer. Change to using the associated DLWriter
(Document List Writer) buffer, which reduces the number of memory
copies needed in doclist processing, and brings PLWriter operation in
line with DLWriter operation. (CVS 3707)
FossilOrigin-Name: d04fa3a13a84f49074c673b8ee2fb6541da061b5
Currently, PLWriter (Position List Writer) creates a locally-owned
DataBuffer to write into. This is necessary to support doclist
collection during tokenization, where there is no obvious buffer to
write output to, but is not necessary for the other users of PLWriter.
This change adds a DLCollector (Doc List Collector) structure to
handle the tokenization case.
Also fix a potential memory leak in writeZeroSegment(). In case of
error from leafWriterStep(), the DataBuffer dl was being leaked. (CVS 3706)
FossilOrigin-Name: 1b9918e20767aebc9c1e7523027139e5fbc12688
malloc/calloc/realloc appropriately, and use sizeof(var) instead of
sizeof(type) to make certain that we don't get a mismatch between
them as the code rots. (CVS 3693)
FossilOrigin-Name: fbc53da8c645935c74e49af2ab2cf447dc72ba4e
When creating fts tables in an attached database, the backing tables
are created in database 'main'. This change propagates the
appropriate database name to the routines which build sql statements.
Note that I propagate the database name and table name separately. I
briefly considered just making the table name be "db.table", but it
didn't fit so well in the model used to store the table name and other
information, and having the db name passed separately seemed a bit
more transparent. (CVS 3631)
FossilOrigin-Name: 283385d20724f0144f38de89bd179715ee5e738b
Calling UPDATE against an fts table in a UTF-16 database inserts
corrupted data into the database. The UTF-8 data is being inserted
directly. This appears to happen because sqlite3_ value_text()
destructively coerces a value to UTF-8, and it's never converted back
when updating the table. This works around the problem by rearranging
things so that the update happens before the coercion. (CVS 3596)
FossilOrigin-Name: 4f2ab4b6320ffc621900049b41f50bc30d76d7f5
The virtual table interface allows for a cursor to field multiple
xFilter() calls. For instance, if a join is done with a virtual
table, there could be a call for each row which potentially matches.
Unfortunately, fulltextFilter() assumes that it has a fresh cursor,
and overwrites a prepared statement and a malloc'ed pointer, resulting
in unfinalized statements and a memory leak.
This change hacks the code to manually clean up offending items in
fulltextFilter(), emphasis on "hacks", since it's a fragile fix
insofar as future additions to fulltext_cursor could continue to have
the problem. (CVS 3521)
FossilOrigin-Name: 18142fdb6d1f5bfdbb1155274502b9a602885fcb
that this is of marginal utility when encoding terms resulting from
regular English text, it turns out to be very useful when encoding
inputs with very large terms. (CVS 3520)
FossilOrigin-Name: c8151a998ec2423b417566823dc9957c7d5d782c
between leaf nodes, instead of storing the entire leftmost term of the
rightmost child, store only that portion of the leftmost term
necessary to distinguish it from the rightmost term of the leftmost
child. (CVS 3513)
FossilOrigin-Name: f6e0b080dcfaf554b2c05df5e7d4db69d012fba3
LeafWriter to use empty data buffer (instead of empty term) to detect
an empty block. Code to validate interior nodes. Moderate revisions
to leaf-node and doclist validation. Recast leafWriterStep() in terms
of LeafWriterStepMerge(). (CVS 3512)
FossilOrigin-Name: f30771d5c7ef2b502af95d81a18796b75271ada4
where excessively large terms keep the tree from finding a single
root. A downside is that this could result in large interior nodes in
the presence of large terms, which may be prone to fragmentation,
though if the nodes were smaller that would translate into more levels
in the tree, which would also have that problem. (CVS 3510)
FossilOrigin-Name: 64b7e3406134ac4891113b9bb432ad97504268bb
( http://www.sqlite.org/cvstrac/chngview?cn=3486 ) broke test fts2a-5.3.
This change should make the expected result more obvious. (CVS 3489)
FossilOrigin-Name: cde383eb467de0d752e94a22cd2f890c2dc599cc
http://www.sqlite.org/cvstrac/tktview?tn=2036,35 describes some cases
where we were passing memset() a length which was the sizeof a
pointer, rather than the structure pointed to. Instead, wrap this
idiom up in CLEAR() and SCRAMBLE() macros. (CVS 3488)
FossilOrigin-Name: 5878add0839f9c5bec77caae2361ec20cb60b48b
distinguish reading from a static buffer from writing to a dynamic
buffer. This allows n-way doclist merging, and in-place merging of
segment leaf nodes, which together cut segment merge times in half. (CVS 3486)
FossilOrigin-Name: af5bfb986e39248abbfc6fff2e13c6f9e634a751
was writing out a segment made up of a single leaf node containing the
\0 header. LeafReader assumed that leaf nodes always contained at
least one term, so assertions would fail.
While it would be possible to support reading and merging empty
segments, there's no reason to do so. While this change could have
been done in writeZeroSegment(), I put it in leafWriterFlush() so that
it would work right if segmentMerge() created an empty segment, which
could happen with future changes to how deleted documents are handled. (CVS 3484)
FossilOrigin-Name: fed79beec7da24a26ae94494bdc0c98dd102bc06
updates. Groups of documents form segments which are encoded in a
btree layered over a table of blocks, with various tricks to make
merges fast. This performs 20x-25x faster than fts1 when loading the
Enron corpus, and is only slightly slower for queries. (CVS 3474)
FossilOrigin-Name: 85272b2f5394e37916afb1d509e7296810d976f5
docListRestrictColumn() generates a DL_POSITIONS doclist, which means
that after the first doclist is processed, the second doclist is
initialized as DL_POSITIONS, but with DL_POSITIONS_OFFSETS data.
(Note that DL_DEFAULT is now DL_POSITIONS, which masks this bug.) (CVS 3467)
FossilOrigin-Name: 144e3f11e22c6efd6f2d960599ab2d93542db406
We handle an UPDATE to a row by performing an UPDATE on the content table and by building new position lists for each term which appears in either the old or new versions of the row. We write these position lists all at once; this is presumably more efficient than a delete followed by an insert (which would first write empty position lists, then new position lists). (CVS 3434)
FossilOrigin-Name: 757fa22400b363212b4d5f648bdc9fcbd9a7f152
method of a virtual table. In FTS1, use strcmp instead of strcasecmp.
Ticket #1981. (CVS 3428)
FossilOrigin-Name: efa8fb32a596c7232bb1754b3231e4f2421df75b
table. Offsets are retrieved using a special "offsets" function whose
first argument is the magic column. Snippets will ultimately be retrieved
in the same way. (CVS 3427)
FossilOrigin-Name: 5e35dc1ffadfe7fa47673d052501ee79903eead9
a string containing byte offset information for all matching terms.
Also added a large test case based on SQLite mailing list entries. (CVS 3417)
FossilOrigin-Name: f25cfa1aec0e4c1fe07176039a1b7f4e6a2c66ec
names in the spec that are SQL keywords or have special characters, etc.
Also added support for additional control lines. Column names can be
followed by a type specifier (which is ignored.) (CVS 3410)
FossilOrigin-Name: adb780e0dc8bc7dcd1102efbfa4bc17eefdf968e
For now, each posting list stores position/offset information for multiple columns. We may implement separate posting lists for separate columns at some future point. (CVS 3408)
FossilOrigin-Name: 366a70b086c817bddecd83053472ec76ef20f309
surprising impact on performance, I believe because it keeps the index
smaller (by keeping rowids smaller), and also because it improves
locality in the table (deleting a row means we've already touched the
pages leading to that rowid). (CVS 3405)
FossilOrigin-Name: 2f5f6290c9ef99c7b060aecc4d996c976c50c9d7
arguments during initialization. Recognized arguments include a
tokenizer selector and a list of virtual table columns. (CVS 3403)
FossilOrigin-Name: 227dc3feb537e6efd5b0c1d2dad40193db07d5aa
in order to provide better error reporting. This is an interface change
for virtual tables. Prior virtual table implementations will need to be
modified and recompiled. (CVS 3402)
FossilOrigin-Name: f44b8bae97b6872524580009c96d07391578c388
New items for a term are merged with the term's segment 0 doclist,
until that doclist exceeds CHUNK_MAX. Then the segments are merged in
exponential fashion, so that segment 1 contains approximately
2*CHUNK_MAX data, segment 2 4*CHUNK_MAX, and so on. (CVS 3398)
FossilOrigin-Name: b6b93a3325d3e728ca36255c0ff6e1f63e03b0ac
making sure we always pass around ptr/len, but there were a few places
where we actually relied on nul-termination.
An earlier change had additionally changed appropriate
sqlite3_bind_text() calls to sqlite3_bind_blob(). I've found that
this changes what's actually stored in the database, so backed those
changes out. Also (and this is weird), I found that I could no longer
do straight-forward = queries against %_term.term at a command-line. (CVS 3379)
FossilOrigin-Name: 5844db1aa9c23a005c88104b084f68afb21891c7
strcspn() and a nul-terminated delimiter list, I just flagged
delimiters in an array and wrote things inline. Submitting this for
review separately because it's pretty standalone. (CVS 3378)
FossilOrigin-Name: 2631ceaeefaca3aa837e3b439399f13c51456914
so that all symbols with external linkage begin with "sqlite3Fts1", and
so that all filenames begin with "fts1". (CVS 3377)
FossilOrigin-Name: e1891f0dc58e5498a8845d8b9b5b092d7f9c7003
us to break any UTF-8 code points, unless they were already broken in
the input. (CVS 3376)
FossilOrigin-Name: 6c77c2d5e15e9d3efed3e274bc93cd5a4868f574
Drop unused pBlob/nBlob in index_insert_term().
Fix NULL deref in an assertion in docListUpdate() delete case.
Minor code tightening in docListUpdate(). (CVS 3367)
FossilOrigin-Name: a6fcf9101a831bf5f129c6045eabf30376d365dc