I finally got the time to put together some stuff for fti for
inclusion in pgsql. I have included a README which should be enough to start using it, plus a BENCH file that describes some timings I have done. Please have a look at it, and if you think everything is OK, I would like it seen included in the contrib-section of pgsql. I don't think I will do any more work in this, but maybe it inspires somebody else to improve on it. Maarten Boekhold
This commit is contained in:
parent
aac163336f
commit
8f0ca623ff
97
contrib/fulltextindex/BENCH
Normal file
97
contrib/fulltextindex/BENCH
Normal file
@ -0,0 +1,97 @@
|
||||
The following data was generated by the 'timings.sh' script included
|
||||
in this directory. It uses a very large table with music-related
|
||||
articles as a source for the fti-table. The tables used are:
|
||||
|
||||
product : contains product information : 540.429 rows
|
||||
artist_fti : fti table for product : 4.501.321 rows
|
||||
clustered : same as above, only clustered : 4.501.321 rows
|
||||
|
||||
A sequential scan of the artist_fti table (and thus also the clustered table)
|
||||
takes around 6:16 minutes....
|
||||
|
||||
Unfortunately I cannot probide anybody else with this test-date, since I
|
||||
am not allowed to redistribute the data (it's a database being sold by
|
||||
a couple of wholesale companies). Anyways, it's megabytes, so you probably
|
||||
wouldn't want it in this distribution anyways.
|
||||
|
||||
I haven't tested this with less data.
|
||||
|
||||
The test-machine is a Pentium 133, 64 MB, Linux 2.0.32 with the database
|
||||
on a 'QUANTUM BIGFOOT_CY4320A, 4134MB w/67kB Cache, CHS=8960/15/63'. This
|
||||
is a very slow disk.
|
||||
|
||||
The postmaster was running with:
|
||||
|
||||
postmaster -i -b /usr/local/pgsql/bin/postgres -S 1024 -B 256 \
|
||||
-o -o /usr/local/pgsql/debug-output -F -d 1
|
||||
|
||||
('trashing' means a 'select count(*) from artist_fti' to completely trash
|
||||
any disk-caches and buffers....)
|
||||
|
||||
TESTING ON UNCLUSTERED FTI
|
||||
trashing
|
||||
1: ^lapton and ^ric : 0.050u 0.000s 5m37.484s 0.01%
|
||||
2: ^lapton and ^ric : 0.050u 0.030s 5m32.447s 0.02%
|
||||
3: ^lapton and ^ric : 0.030u 0.020s 5m28.822s 0.01%
|
||||
trashing
|
||||
1: ^lling and ^tones : 0.020u 0.030s 0m54.313s 0.09%
|
||||
2: ^lling and ^tones : 0.040u 0.030s 0m5.057s 1.38%
|
||||
3: ^lling and ^tones : 0.010u 0.050s 0m2.072s 2.89%
|
||||
trashing
|
||||
1: ^aughan and ^evie : 0.020u 0.030s 0m26.241s 0.19%
|
||||
2: ^aughan and ^evie : 0.050u 0.010s 0m1.316s 4.55%
|
||||
3: ^aughan and ^evie : 0.030u 0.020s 0m1.029s 4.85%
|
||||
trashing
|
||||
1: ^lling : 0.040u 0.010s 0m55.104s 0.09%
|
||||
2: ^lling : 0.030u 0.030s 0m4.716s 1.27%
|
||||
3: ^lling : 0.040u 0.010s 0m2.157s 2.31%
|
||||
trashing
|
||||
1: ^stev and ^ray and ^vaugh : 0.040u 0.000s 1m5.630s 0.06%
|
||||
2: ^stev and ^ray and ^vaugh : 0.050u 0.020s 1m3.561s 0.11%
|
||||
3: ^stev and ^ray and ^vaugh : 0.050u 0.010s 1m5.923s 0.09%
|
||||
trashing
|
||||
1: ^lling (no join) : 0.050u 0.020s 0m24.139s 0.28%
|
||||
2: ^lling (no join) : 0.040u 0.040s 0m1.087s 7.35%
|
||||
3: ^lling (no join) : 0.020u 0.030s 0m0.772s 6.48%
|
||||
trashing
|
||||
1: ^vaughan (no join) : 0.040u 0.030s 0m9.075s 0.77%
|
||||
2: ^vaughan (no join) : 0.030u 0.010s 0m0.609s 6.56%
|
||||
3: ^vaughan (no join) : 0.040u 0.010s 0m0.503s 9.94%
|
||||
trashing
|
||||
1: ^rol (no join) : 0.020u 0.030s 0m49.898s 0.10%
|
||||
2: ^rol (no join) : 0.030u 0.020s 0m3.136s 1.59%
|
||||
3: ^rol (no join) : 0.030u 0.020s 0m1.231s 4.06%
|
||||
|
||||
TESTING ON CLUSTERED FTI
|
||||
trashing
|
||||
1: ^lapton and ^ric : 0.020u 0.020s 2m17.120s 0.02%
|
||||
2: ^lapton and ^ric : 0.030u 0.020s 2m11.767s 0.03%
|
||||
3: ^lapton and ^ric : 0.040u 0.010s 2m8.128s 0.03%
|
||||
trashing
|
||||
1: ^lling and ^tones : 0.020u 0.030s 0m18.179s 0.27%
|
||||
2: ^lling and ^tones : 0.030u 0.010s 0m1.897s 2.10%
|
||||
3: ^lling and ^tones : 0.040u 0.010s 0m1.619s 3.08%
|
||||
trashing
|
||||
1: ^aughan and ^evie : 0.070u 0.010s 0m11.765s 0.67%
|
||||
2: ^aughan and ^evie : 0.040u 0.010s 0m1.198s 4.17%
|
||||
3: ^aughan and ^evie : 0.030u 0.020s 0m0.872s 5.73%
|
||||
trashing
|
||||
1: ^lling : 0.040u 0.000s 0m28.623s 0.13%
|
||||
2: ^lling : 0.030u 0.010s 0m2.339s 1.70%
|
||||
3: ^lling : 0.030u 0.010s 0m1.975s 2.02%
|
||||
trashing
|
||||
1: ^stev and ^ray and ^vaugh : 0.020u 0.010s 0m17.667s 0.16%
|
||||
2: ^stev and ^ray and ^vaugh : 0.030u 0.010s 0m3.745s 1.06%
|
||||
3: ^stev and ^ray and ^vaugh : 0.030u 0.020s 0m3.439s 1.45%
|
||||
trashing
|
||||
1: ^lling (no join) : 0.020u 0.040s 0m2.218s 2.70%
|
||||
2: ^lling (no join) : 0.020u 0.020s 0m0.506s 7.90%
|
||||
3: ^lling (no join) : 0.030u 0.030s 0m0.510s 11.76%
|
||||
trashing
|
||||
1: ^vaughan (no join) : 0.040u 0.050s 0m2.048s 4.39%
|
||||
2: ^vaughan (no join) : 0.030u 0.020s 0m0.332s 15.04%
|
||||
3: ^vaughan (no join) : 0.040u 0.010s 0m0.318s 15.72%
|
||||
trashing
|
||||
1: ^rol (no join) : 0.020u 0.030s 0m2.384s 2.09%
|
||||
2: ^rol (no join) : 0.020u 0.030s 0m0.676s 7.39%
|
||||
3: ^rol (no join) : 0.020u 0.030s 0m0.697s 7.17%
|
24
contrib/fulltextindex/Makefile
Normal file
24
contrib/fulltextindex/Makefile
Normal file
@ -0,0 +1,24 @@
|
||||
SRCDIR= ../../src
|
||||
|
||||
include $(SRCDIR)/Makefile.global
|
||||
|
||||
CONTRIBDIR=$(LIBDIR)/contrib
|
||||
|
||||
CFLAGS+= $(CFLAGS_SL) -I$(SRCDIR)/include
|
||||
|
||||
TARGETS= fti$(DLSUFFIX)
|
||||
CLEANFILES+= $(TARGETS)
|
||||
CURDIR=`pwd`
|
||||
|
||||
all:: $(TARGETS)
|
||||
|
||||
%.sql: %.source
|
||||
rm -f $@; \
|
||||
sed -e "s:_CURRENTDIR_:$(CURDIR):g" \
|
||||
-e "s:_DLSUFFIX_:$(DLSUFFIX):g" < $< > $@
|
||||
|
||||
clean:
|
||||
rm -f $(TARGETS) *.o
|
||||
|
||||
dist:
|
||||
tar cf fti.tar README BENCH Makefile fti.c timings.sh
|
95
contrib/fulltextindex/README
Normal file
95
contrib/fulltextindex/README
Normal file
@ -0,0 +1,95 @@
|
||||
An attempt at some sort of Full Text Indexing for PostgreSQL.
|
||||
|
||||
The included software is an attempt to add some sort of Full Text Indexing
|
||||
support to PostgreSQL. I mean by this that we can ask questions like:
|
||||
|
||||
Give me all rows that have 'still' and 'nash' in the 'artist' field.
|
||||
|
||||
Ofcourse we can write this as:
|
||||
|
||||
select * from cds where artist ~* 'stills' and artist ~* 'nash';
|
||||
|
||||
But this does not use any indices, and therefore, if your database
|
||||
gets very large, it will not have very high performance (the above query
|
||||
requires at least one sequential scan, it probably takes 2 due to the
|
||||
self-join).
|
||||
|
||||
The approach used by this add-on is to define a trigger on the table and
|
||||
column you want to do this queries on. On every insert in the table, it
|
||||
takes the value in the specified column, breaks the text in this column
|
||||
up into pieces, and stores all sub-strings into another table, together
|
||||
with a reference to the row in the original table that contained this
|
||||
sub-string (it uses the oid of that row).
|
||||
|
||||
By now creating an index over the 'fti-table', we can search for
|
||||
substrings that occur in the original table. By making a join between
|
||||
the fti-table and the orig-table, we can get the actual rows we want
|
||||
(this can also be done by using subselects, and maybe there're other
|
||||
ways too).
|
||||
|
||||
As an example we take the previous query, where we assume we have all
|
||||
sub-strings in the table 'cds-fti':
|
||||
|
||||
select c.*
|
||||
from cds c, cds-fti f1, cds-fti f2
|
||||
where f1.string ~ '^stills' and
|
||||
f2.string ~ '^nash' and
|
||||
f1.id = c.oid and
|
||||
f2.id = c.oid ;
|
||||
|
||||
We can use the ~ (case-sensitive regular expression) here, because of
|
||||
the way sub-strings are built: from right to left, ie. house -> 'se' +
|
||||
'use' + 'ouse' + 'house'. If a ~ search starts with a ^ (match start of
|
||||
string), btree indices can be used by PostgreSQL.
|
||||
|
||||
Now, how do we create the trigger that maintains the fti-table? First: the
|
||||
fti-table should have the following schema:
|
||||
|
||||
create cds-fti ( string varchar(N), id oid );
|
||||
|
||||
Don't change the *names* of the columns, the varchar() can in fact also
|
||||
be of text-type. If you do use varchar, make sure the largest possible
|
||||
sub-string will fit.
|
||||
|
||||
The create the function that contains the trigger::
|
||||
|
||||
create function fti() returns opaque as '/path/to/fti.so' language 'C';
|
||||
|
||||
And finally define the trigger on the 'cds' table:
|
||||
|
||||
create trigger cds-fti-trigger after update or insert or delete on cds
|
||||
for each row execute procedure fti(cds-fti, artist);
|
||||
|
||||
Here, the trigger will be defined on table 'cds', it will create
|
||||
sub-strings from the field 'artist', and it will place those sub-strings
|
||||
in the table 'cds-fti'.
|
||||
|
||||
Now populate the table 'cds'. This will also populate the table 'cds-fti'.
|
||||
It's fastest to populate the table *before* you create the indices.
|
||||
|
||||
Before you start using the system, you should at least have the following
|
||||
indices:
|
||||
|
||||
create index cds-fti-idx on cds-fti (string, id);
|
||||
create index cds-oid-idx on cds (oid);
|
||||
|
||||
To get the most performance out of this, you should have 'cds-fti'
|
||||
clustered on disk, ie. all rows with the same sub-strings should be
|
||||
close to each other. There are 3 ways of doing this:
|
||||
|
||||
1. After you have created the indices, execute 'cluster cds-fti-idx on cds-fti'.
|
||||
2. Do a 'select * into tmp-table from cds-fti order by string' *before*
|
||||
you create the indices, then 'drop table cds-fti' and
|
||||
'alter table tmp-table rename to cds-fti'
|
||||
3. *Before* creating indices, dump the contents of the cds-fti table using
|
||||
'pg_dump -a -t cds-fti dbase-name', remove the \connect
|
||||
from the beginning and the \. from the end, and sort it using the
|
||||
UNIX 'sort' program, and reload the data.
|
||||
|
||||
Method 1 is very slow, 2 a lot faster, and for very large tables, 3 is
|
||||
preferred.
|
||||
|
||||
|
||||
|
||||
|
||||
|
381
contrib/fulltextindex/fti.c
Normal file
381
contrib/fulltextindex/fti.c
Normal file
@ -0,0 +1,381 @@
|
||||
#include "executor/spi.h"
|
||||
#include "commands/trigger.h"
|
||||
#include "c.h" /* endof() macro */
|
||||
#include <ctype.h> /* tolower */
|
||||
#include <stdio.h> /* debugging */
|
||||
|
||||
/*
|
||||
* Trigger function takes 2 arguments:
|
||||
1. relation in which to store the substrings
|
||||
2. field to extract substrings from
|
||||
|
||||
The relation in which to insert *must* have the following layout:
|
||||
|
||||
string varchar(#)
|
||||
id oid
|
||||
|
||||
Example:
|
||||
|
||||
create function fti() returns opaque as
|
||||
'/home/boekhold/src/postgresql-6.2/contrib/fti/fti.so' language 'c';
|
||||
|
||||
create table title_fti (string varchar(25), id oid);
|
||||
create index title_fti_idx on title_fti (string);
|
||||
|
||||
create trigger title_fti_trigger after update or insert or delete on product
|
||||
for each row execute procedure fti(title_fti, title);
|
||||
^^^^^^^^^
|
||||
where to store index in
|
||||
^^^^^
|
||||
which column to index
|
||||
|
||||
ofcourse don't forget to create an index on title_idx, column string, else
|
||||
you won't notice much speedup :)
|
||||
|
||||
After populating 'product', try something like:
|
||||
|
||||
select p.* from product p, title_fti f1, title_fti f2 where
|
||||
f1.string='slippery' and f2.string='wet' and f1.id=f2.id and p.oid=f1.id;
|
||||
*/
|
||||
|
||||
/*
|
||||
march 4 1998 Changed breakup() to return less substrings. Only breakup
|
||||
in word parts which are in turn shortened from the start
|
||||
of the word (ie. word, ord, rd)
|
||||
Did allocation of substring buffer outside of breakup()
|
||||
oct. 5 1997, fixed a bug in string breakup (where there are more nonalpha
|
||||
characters between words then 1).
|
||||
|
||||
oct 4-5 1997 implemented the thing, at least the basic functionallity
|
||||
of it all....
|
||||
*/
|
||||
|
||||
/* IMPROVEMENTS:
|
||||
|
||||
save a plan for deletes
|
||||
create a function that will make the index *after* we have populated
|
||||
the main table (probably first delete all contents to be sure there's
|
||||
nothing in it, then re-populate the fti-table)
|
||||
|
||||
can we do something with operator overloading or a seperate function
|
||||
that can build the final query automatigally?
|
||||
*/
|
||||
|
||||
HeapTuple fti(void);
|
||||
char *breakup(char*, char*);
|
||||
bool is_stopword(char*);
|
||||
|
||||
bool new_tuple = false;
|
||||
|
||||
|
||||
/* THIS LIST MUST BE IN SORTED ORDER, A BINARY SEARCH IS USED!!!! */
|
||||
char *StopWords[] = { /* list of words to skip in indexing */
|
||||
"no"
|
||||
"the",
|
||||
"yes",
|
||||
};
|
||||
|
||||
/* stuff for caching query-plans, stolen from contrib/spi/\*.c */
|
||||
typedef struct
|
||||
{
|
||||
char *ident;
|
||||
int nplans;
|
||||
void **splan;
|
||||
} EPlan;
|
||||
|
||||
static EPlan *InsertPlans = NULL;
|
||||
static EPlan *DeletePlans = NULL;
|
||||
static int nInsertPlans = 0;
|
||||
static int nDeletePlans = 0;
|
||||
|
||||
static EPlan *find_plan(char *ident, EPlan ** eplan, int *nplans);
|
||||
|
||||
/***********************************************************************/
|
||||
HeapTuple
|
||||
fti()
|
||||
{
|
||||
Trigger *trigger; /* to get trigger name */
|
||||
int nargs; /* # of arguments */
|
||||
char **args; /* arguments */
|
||||
char *relname; /* triggered relation name */
|
||||
Relation rel; /* triggered relation */
|
||||
char *indexname; /* name of table for substrings */
|
||||
HeapTuple rettuple = NULL;
|
||||
TupleDesc tupdesc; /* tuple description */
|
||||
bool isinsert=false;
|
||||
bool isdelete=false;
|
||||
int ret;
|
||||
char query[8192];
|
||||
Oid oid;
|
||||
/*
|
||||
FILE *debug;
|
||||
*/
|
||||
|
||||
/*
|
||||
debug = fopen("/dev/xconsole", "w");
|
||||
fprintf(debug, "FTI: entered function\n");
|
||||
fflush(debug);
|
||||
*/
|
||||
|
||||
if (!CurrentTriggerData)
|
||||
elog(ERROR, "Full Text Indexing: triggers are not initialized");
|
||||
if (TRIGGER_FIRED_FOR_STATEMENT(CurrentTriggerData->tg_event))
|
||||
elog(ERROR, "Full Text Indexing: can't process STATEMENT events");
|
||||
if (TRIGGER_FIRED_BEFORE(CurrentTriggerData->tg_event))
|
||||
elog(ERROR, "Full Text Indexing: must be fired AFTER event");
|
||||
|
||||
if (TRIGGER_FIRED_BY_INSERT(CurrentTriggerData->tg_event))
|
||||
isinsert=true;
|
||||
if (TRIGGER_FIRED_BY_UPDATE(CurrentTriggerData->tg_event))
|
||||
{ isdelete=true;isinsert=true;}
|
||||
if (TRIGGER_FIRED_BY_DELETE(CurrentTriggerData->tg_event))
|
||||
isdelete=true;
|
||||
|
||||
trigger = CurrentTriggerData->tg_trigger;
|
||||
rel = CurrentTriggerData->tg_relation;
|
||||
relname = SPI_getrelname(rel);
|
||||
rettuple=CurrentTriggerData->tg_trigtuple;
|
||||
if (isdelete&&isinsert) /* is an UPDATE */
|
||||
rettuple=CurrentTriggerData->tg_newtuple;
|
||||
|
||||
CurrentTriggerData = NULL; /* invalidate 'normal' calls to this function */
|
||||
|
||||
if ((ret = SPI_connect()) <0)
|
||||
elog(ERROR,"Full Text Indexing: SPI_connect failed, returned %d\n",ret);
|
||||
|
||||
nargs = trigger->tgnargs;
|
||||
if (nargs != 2)
|
||||
elog(ERROR, "Full Text Indexing: trigger can only have 2 arguments");
|
||||
|
||||
args = trigger->tgargs;
|
||||
indexname = args[0];
|
||||
tupdesc = rel->rd_att; /* what the tuple looks like (?) */
|
||||
|
||||
/* get oid of current tuple, needed by all, so place here */
|
||||
oid = rettuple->t_oid;
|
||||
if (!OidIsValid(oid))
|
||||
elog(ERROR,"Full Text Indexing: oid of current tuple is NULL");
|
||||
|
||||
if (isdelete) {
|
||||
void *pplan;
|
||||
Oid *argtypes;
|
||||
Datum values[1];
|
||||
EPlan *plan;
|
||||
|
||||
sprintf(query, "D%s$%s", args[0], args[1]);
|
||||
plan = find_plan(query, &DeletePlans, &nDeletePlans);
|
||||
if (plan->nplans <= 0) {
|
||||
argtypes = (Oid *)palloc(sizeof(Oid));
|
||||
|
||||
argtypes[0] = OIDOID;
|
||||
|
||||
sprintf(query, "DELETE FROM %s WHERE id = $1", indexname);
|
||||
pplan = SPI_prepare(query, 1, argtypes);
|
||||
if (!pplan)
|
||||
elog(ERROR, "Full Text Indexing: SPI_prepare returned NULL "
|
||||
"in delete");
|
||||
pplan = SPI_saveplan(pplan);
|
||||
if (pplan == NULL)
|
||||
elog(ERROR, "Full Text Indexing: SPI_saveplan returned NULL "
|
||||
"in delete");
|
||||
|
||||
plan->splan = (void **)malloc(sizeof(void*));
|
||||
*(plan->splan) = pplan;
|
||||
plan->nplans = 1;
|
||||
}
|
||||
|
||||
values[0] = oid;
|
||||
|
||||
ret = SPI_execp(*(plan->splan), values, NULL, 0);
|
||||
if (ret != SPI_OK_DELETE)
|
||||
elog(ERROR, "Full Text Indexing: error executing plan in delete");
|
||||
}
|
||||
|
||||
if (isinsert) {
|
||||
char *substring, *column;
|
||||
void *pplan;
|
||||
Oid *argtypes;
|
||||
Datum values[2];
|
||||
int colnum;
|
||||
struct varlena *data;
|
||||
EPlan *plan;
|
||||
|
||||
sprintf(query, "I%s$%s", args[0], args[1]);
|
||||
plan = find_plan(query, &InsertPlans, &nInsertPlans);
|
||||
|
||||
/* no plan yet, so allocate mem for argtypes */
|
||||
if (plan->nplans <= 0) {
|
||||
argtypes = (Oid *)palloc(2*sizeof(Oid));
|
||||
|
||||
argtypes[0] = VARCHAROID; /*create table t_name
|
||||
(string varchar, */
|
||||
argtypes[1] = OIDOID; /* id oid); */
|
||||
|
||||
/* prepare plan to gain speed */
|
||||
sprintf(query, "INSERT INTO %s (string, id) VALUES ($1, $2)",
|
||||
indexname);
|
||||
pplan = SPI_prepare(query, 2, argtypes);
|
||||
if (!pplan)
|
||||
elog(ERROR, "Full Text Indexing: SPI_prepare returned NULL "
|
||||
"in insert");
|
||||
|
||||
pplan = SPI_saveplan(pplan);
|
||||
if (pplan == NULL)
|
||||
elog(ERROR, "Full Text Indexing: SPI_saveplan returned NULL"
|
||||
" in insert");
|
||||
|
||||
plan->splan = (void **)malloc(sizeof(void*));
|
||||
*(plan->splan) = pplan;
|
||||
plan->nplans = 1;
|
||||
}
|
||||
|
||||
|
||||
/* prepare plan for query */
|
||||
colnum=SPI_fnumber(tupdesc, args[1]);
|
||||
if (colnum == SPI_ERROR_NOATTRIBUTE)
|
||||
elog(ERROR, "Full Text Indexing: column '%s' of '%s' not found",
|
||||
args[1], args[0]);
|
||||
|
||||
/* Get the char* representation of the column with name args[1] */
|
||||
column = SPI_getvalue(rettuple, tupdesc, colnum);
|
||||
|
||||
if (column) { /* make sure we don't try to index NULL's */
|
||||
char *buff;
|
||||
char *string = column;
|
||||
|
||||
while(*string != '\0') { /* placed 'really' inline. */
|
||||
*string = tolower(*string); /* some compilers will choke */
|
||||
string++; /* on 'inline' keyword */
|
||||
}
|
||||
|
||||
data = (struct varlena*)palloc(sizeof(int32)+strlen(column)+1);
|
||||
buff = palloc(strlen(column) + 1);
|
||||
/* saves lots of calls in while-loop and in breakup()*/
|
||||
|
||||
new_tuple=true;
|
||||
while ((substring = breakup(column, buff))) {
|
||||
int l;
|
||||
|
||||
l = strlen(substring);
|
||||
|
||||
data->vl_len = l+sizeof(int32);
|
||||
memcpy(VARDATA(data), substring, l);
|
||||
values[0] = PointerGetDatum(data);
|
||||
values[1] = oid;
|
||||
|
||||
ret = SPI_execp(*(plan->splan), values, NULL, 0);
|
||||
if (ret != SPI_OK_INSERT)
|
||||
elog(ERROR, "Full Text Indexing: error executing plan "
|
||||
"in insert");
|
||||
}
|
||||
pfree(buff);
|
||||
pfree(data);
|
||||
}
|
||||
}
|
||||
|
||||
SPI_finish();
|
||||
return (rettuple);
|
||||
}
|
||||
|
||||
char *breakup(char *string, char *substring)
|
||||
{
|
||||
static char *last_start;
|
||||
static char *cur_pos;
|
||||
|
||||
if (new_tuple)
|
||||
{
|
||||
cur_pos=last_start=&string[strlen(string)-1];
|
||||
new_tuple=false; /* don't initialize this next time */
|
||||
}
|
||||
|
||||
while (cur_pos > string) /* don't read before start of 'string' */
|
||||
{
|
||||
/* skip pieces at the end of a string that are not
|
||||
alfa-numeric (ie. 'string$%^&', last_start first points to
|
||||
'&', and after this to 'g' */
|
||||
if (!isalnum((int)*last_start)) {
|
||||
while (!isalnum((int)*last_start) &&
|
||||
last_start > string)
|
||||
last_start--;
|
||||
cur_pos=last_start;
|
||||
}
|
||||
|
||||
cur_pos--; /* substrings are at minimum 2 characters long */
|
||||
|
||||
if (isalnum((int)*cur_pos))
|
||||
{
|
||||
/* Houston, we have a substring! :) */
|
||||
memcpy(substring, cur_pos, last_start - cur_pos + 1);
|
||||
substring[last_start-cur_pos+1]='\0';
|
||||
if (!is_stopword(substring)) return substring;
|
||||
}
|
||||
else
|
||||
{
|
||||
last_start=cur_pos-1;
|
||||
cur_pos = last_start;
|
||||
}
|
||||
}
|
||||
|
||||
return NULL; /* we've processed all of 'string' */
|
||||
}
|
||||
|
||||
/* copied from src/backend/parser/keywords.c and adjusted for our situation*/
|
||||
bool
|
||||
is_stopword(char *text)
|
||||
{
|
||||
char **StopLow; /* for list of stop-words */
|
||||
char **StopHigh;
|
||||
char **StopMiddle;
|
||||
unsigned int difference;
|
||||
|
||||
StopLow = &StopWords[0]; /* initialize stuff for binary search */
|
||||
StopHigh = endof(StopWords);
|
||||
|
||||
while (StopLow <= StopHigh)
|
||||
{
|
||||
StopMiddle = StopLow + (StopHigh - StopLow) / 2;
|
||||
difference = strcmp(*StopMiddle, text);
|
||||
if (difference == 0)
|
||||
return (true);
|
||||
else if (difference < 0)
|
||||
StopLow = StopMiddle + 1;
|
||||
else
|
||||
StopHigh = StopMiddle - 1;
|
||||
}
|
||||
|
||||
return (false);
|
||||
}
|
||||
|
||||
/* for caching of query plans, stolen from contrib/spi/\*.c */
|
||||
static EPlan *
|
||||
find_plan(char *ident, EPlan ** eplan, int *nplans)
|
||||
{
|
||||
EPlan *newp;
|
||||
int i;
|
||||
|
||||
if (*nplans > 0)
|
||||
{
|
||||
for (i = 0; i < *nplans; i++)
|
||||
{
|
||||
if (strcmp((*eplan)[i].ident, ident) == 0)
|
||||
break;
|
||||
}
|
||||
if (i != *nplans)
|
||||
return (*eplan + i);
|
||||
*eplan = (EPlan *) realloc(*eplan, (i + 1) * sizeof(EPlan));
|
||||
newp = *eplan + i;
|
||||
}
|
||||
else
|
||||
{
|
||||
newp = *eplan = (EPlan *) malloc(sizeof(EPlan));
|
||||
(*nplans) = i = 0;
|
||||
}
|
||||
|
||||
newp->ident = (char *) malloc(strlen(ident) + 1);
|
||||
strcpy(newp->ident, ident);
|
||||
newp->nplans = 0;
|
||||
newp->splan = NULL;
|
||||
(*nplans)++;
|
||||
|
||||
return (newp);
|
||||
}
|
350
contrib/fulltextindex/timings.sh
Executable file
350
contrib/fulltextindex/timings.sh
Executable file
@ -0,0 +1,350 @@
|
||||
#!/bin/sh
|
||||
|
||||
PATH=${PATH}:/usr/local/pgsql/bin
|
||||
TIMEFORMAT="%3Uu %3Ss %lR %P%%"
|
||||
export PATH TIMEFORMAT
|
||||
|
||||
case "$1" in
|
||||
-n)
|
||||
trashing=0
|
||||
;;
|
||||
*)
|
||||
trashing=1
|
||||
;;
|
||||
esac
|
||||
|
||||
echo "TESTING ON UNCLUSTERED FTI"
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, artist_fti f1, artist_fti f2
|
||||
where
|
||||
f1.string ~ '^lapton' and f2.string ~ '^ric' and
|
||||
f1.id=p.oid and f2.id=p.oid;"
|
||||
|
||||
echo -n "1: ^lapton and ^ric : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lapton and ^ric : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lapton and ^ric : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, artist_fti f1, artist_fti f2
|
||||
where
|
||||
f1.string ~ '^lling' and f2.string ~ '^tones' and
|
||||
f1.id=p.oid and f2.id=p.oid;"
|
||||
|
||||
echo -n "1: ^lling and ^tones : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lling and ^tones : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lling and ^tones : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, artist_fti f1, artist_fti f2
|
||||
where
|
||||
f1.string ~ '^aughan' and f2.string ~ '^evie' and
|
||||
f1.id=p.oid and f2.id=p.oid;"
|
||||
|
||||
echo -n "1: ^aughan and ^evie : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^aughan and ^evie : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^aughan and ^evie : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, artist_fti f1
|
||||
where
|
||||
f1.string ~ '^lling' and
|
||||
p.oid=f1.id;"
|
||||
|
||||
echo -n "1: ^lling : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lling : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lling : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, artist_fti f1, artist_fti f2, artist_fti f3
|
||||
where
|
||||
f1.string ~ '^stev' and
|
||||
f2.string ~ '^ray' and
|
||||
f3.string ~ '^vaugh' and
|
||||
p.oid=f1.id and p.oid=f2.id and p.oid=f3.id;"
|
||||
|
||||
echo -n "1: ^stev and ^ray and ^vaugh : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^stev and ^ray and ^vaugh : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^stev and ^ray and ^vaugh : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(*) from artist_fti where string ~ '^lling';"
|
||||
|
||||
echo -n "1: ^lling (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lling (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lling (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(*) from artist_fti where string ~ '^vaughan';"
|
||||
|
||||
echo -n "1: ^vaughan (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^vaughan (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^vaughan (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(*) from artist_fti where string ~ '^rol';"
|
||||
|
||||
echo -n "1: ^rol (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^rol (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^rol (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
echo
|
||||
echo "TESTING ON CLUSTERED FTI"
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, clustered f1, clustered f2
|
||||
where
|
||||
f1.string ~ '^lapton' and f2.string ~ '^ric' and
|
||||
f1.id=p.oid and f2.id=p.oid;"
|
||||
|
||||
echo -n "1: ^lapton and ^ric : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lapton and ^ric : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lapton and ^ric : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, clustered f1, clustered f2
|
||||
where
|
||||
f1.string ~ '^lling' and f2.string ~ '^tones' and
|
||||
f1.id=p.oid and f2.id=p.oid;"
|
||||
|
||||
echo -n "1: ^lling and ^tones : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lling and ^tones : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lling and ^tones : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, clustered f1, clustered f2
|
||||
where
|
||||
f1.string ~ '^aughan' and f2.string ~ '^evie' and
|
||||
f1.id=p.oid and f2.id=p.oid;"
|
||||
|
||||
echo -n "1: ^aughan and ^evie : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^aughan and ^evie : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^aughan and ^evie : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, clustered f1
|
||||
where
|
||||
f1.string ~ '^lling' and
|
||||
p.oid=f1.id;"
|
||||
|
||||
echo -n "1: ^lling : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lling : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lling : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(p.oid) from product p, clustered f1, clustered f2, clustered f3
|
||||
where
|
||||
f1.string ~ '^stev' and
|
||||
f2.string ~ '^ray' and
|
||||
f3.string ~ '^vaugh' and
|
||||
p.oid=f1.id and p.oid=f2.id and p.oid=f3.id;"
|
||||
|
||||
echo -n "1: ^stev and ^ray and ^vaugh : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^stev and ^ray and ^vaugh : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^stev and ^ray and ^vaugh : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(*) from clustered where string ~ '^lling';"
|
||||
|
||||
echo -n "1: ^lling (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^lling (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^lling (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(*) from clustered where string ~ '^vaughan';"
|
||||
|
||||
echo -n "1: ^vaughan (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^vaughan (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^vaughan (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
# trash disk
|
||||
if [ $trashing = 1 ]
|
||||
then
|
||||
echo "trashing"
|
||||
psql -q -n -o /dev/null -c "select count(*) from product;" test
|
||||
else
|
||||
echo
|
||||
fi
|
||||
|
||||
Q="select count(*) from clustered where string ~ '^rol';"
|
||||
|
||||
echo -n "1: ^rol (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "2: ^rol (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
echo -n "3: ^rol (no join) : "
|
||||
time psql -q -n -o /dev/null -c "$Q" test
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user