mirror of https://github.com/postgres/postgres
483 lines
17 KiB
Plaintext
483 lines
17 KiB
Plaintext
contrib/ltree module
|
|
|
|
ltree - is a PostgreSQL contrib module which contains implementation of data
|
|
types, indexed access methods and queries for data organized as a tree-like
|
|
structures.
|
|
This module will works for PostgreSQL version 7.3.
|
|
(version for 7.2 version is available from http://www.sai.msu.su/~megera/postgres/gist/ltree/ltree-7.2.tar.gz)
|
|
-------------------------------------------------------------------------------
|
|
All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
|
|
(oleg@sai.msu.su). See http://www.sai.msu.su/~megera/postgres/gist for
|
|
additional information. Authors would like to thank Eugeny Rodichev for helpful
|
|
discussions. Comments and bug reports are welcome.
|
|
-------------------------------------------------------------------------------
|
|
|
|
LEGAL NOTICES: This module is released under BSD license (as PostgreSQL
|
|
itself). This work was done in framework of Russian Scientific Network and
|
|
partially supported by Russian Foundation for Basic Research and Stack Group.
|
|
-------------------------------------------------------------------------------
|
|
|
|
MOTIVATION
|
|
|
|
This is a placeholder for introduction to the problem. Hope, people reading
|
|
this document doesn't need it too much :-)
|
|
|
|
DEFINITIONS
|
|
|
|
A label of a node is a sequence of one or more words separated by blank
|
|
character '_' and containing letters and digits ( for example, [a-zA-Z0-9] for
|
|
C locale). The length of a label is limited by 256 bytes.
|
|
|
|
Example: 'Countries', 'Personal_Services'
|
|
|
|
A label path of a node is a sequence of one or more dot-separated labels
|
|
l1.l2...ln, represents path from root to the node. The length of a label path
|
|
is limited by 65Kb, but size <= 2Kb is preferrable. We consider it's not a
|
|
strict limitation ( maximal size of label path for DMOZ catalogue - http://
|
|
www.dmoz.org, is about 240 bytes !)
|
|
|
|
Example: 'Top.Countries.Europe.Russia'
|
|
|
|
We introduce several datatypes:
|
|
|
|
ltree
|
|
- is a datatype for label path.
|
|
|
|
ltree[]
|
|
- is a datatype for arrays of ltree.
|
|
|
|
lquery
|
|
- is a path expression that has regular expression in the label path and
|
|
used for ltree matching. Star symbol (*) is used to specify any number of
|
|
labels (levels) and could be used at the beginning and the end of lquery,
|
|
for example, '*.Europe.*'.
|
|
|
|
The following quantifiers are recognized for '*' (like in Perl):
|
|
|
|
{n} Match exactly n levels
|
|
{n,} Match at least n levels
|
|
{n,m} Match at least n but not more than m levels
|
|
{,m} Match at maximum m levels (eq. to {0,m})
|
|
|
|
It is possible to use several modifiers at the end of a label:
|
|
|
|
|
|
@ Do case-insensitive label matching
|
|
* Do prefix matching for a label
|
|
% Don't account word separator '_' in label matching, that is
|
|
'Russian%' would match 'Russian_nations', but not 'Russian'
|
|
|
|
lquery could contains logical '!' (NOT) at the beginning of the label and '
|
|
|' (OR) to specify possible alternatives for label matching.
|
|
|
|
Example of lquery:
|
|
|
|
|
|
Top.*{0,2}.sport*@.!football|tennis.Russ*|Spain
|
|
a) b) c) d) e)
|
|
|
|
A label path should
|
|
+ a) begins from a node with label 'Top'
|
|
+ b) and following zero or 2 labels until
|
|
+ c) a node with label beginning from case-insensitive prefix 'sport'
|
|
+ d) following node with label not matched 'football' or 'tennis' and
|
|
+ e) ends on node with label beginning from 'Russ' or strictly matched
|
|
'Spain'.
|
|
|
|
ltxtquery
|
|
- is a datatype for label searching (like type 'query' for full text
|
|
searching, see contrib/tsearch). It's possible to use modifiers @,%,* at
|
|
the end of word. The meaning of modifiers are the same as for lquery.
|
|
|
|
Example: 'Europe & Russia*@ & !Transportation'
|
|
|
|
Search paths contain words 'Europe' and 'Russia*' (case-insensitive) and
|
|
not 'Transportation'. Notice, the order of words as they appear in label
|
|
path is not important !
|
|
|
|
OPERATIONS
|
|
|
|
The following operations are defined for type ltree:
|
|
|
|
<,>,<=,>=,=, <>
|
|
- have their usual meanings. Comparison is doing in the order of direct
|
|
tree traversing, children of a node are sorted lexicographic.
|
|
ltree @> ltree
|
|
- returns TRUE if left argument is an ancestor of right argument (or
|
|
equal).
|
|
ltree <@ ltree
|
|
- returns TRUE if left argument is a descendant of right argument (or
|
|
equal).
|
|
ltree ~ lquery, lquery ~ ltree
|
|
- return TRUE if node represented by ltree satisfies lquery.
|
|
ltree @ ltxtquery, ltxtquery @ ltree
|
|
- return TRUE if node represented by ltree satisfies ltxtquery.
|
|
ltree || ltree, ltree || text, text || ltree
|
|
- return concatenated ltree.
|
|
|
|
Operations for arrays of ltree (ltree[]):
|
|
|
|
ltree[] @> ltree, ltree <@ ltree[]
|
|
- returns TRUE if array ltree[] contains an ancestor of ltree.
|
|
ltree @> ltree[], ltree[] <@ ltree
|
|
- returns TRUE if array ltree[] contains a descendant of ltree.
|
|
ltree[] ~ lquery, lquery ~ ltree[]
|
|
- returns TRUE if array ltree[] contains label paths matched lquery.
|
|
ltree[] @ ltxtquery, ltxtquery @ ltree[]
|
|
- returns TRUE if array ltree[] contains label paths matched ltxtquery
|
|
(full text search).
|
|
ltree[] ?@> ltree, ltree ?<@ ltree[], ltree[] ?~ lquery, ltree[] ?@ ltxtquery
|
|
- returns first element of array ltree[] satisfies corresponding condition
|
|
and NULL in vice versa.
|
|
|
|
REMARK
|
|
|
|
Operations <@, @>, @ and ~ have analogues - ^<@, ^@>, ^@, ^~, which doesn't use
|
|
indices !
|
|
|
|
INDICES
|
|
|
|
Various indices could be created to speed up execution of operations:
|
|
|
|
* B-tree index over ltree:
|
|
<, <=, =, =>, >
|
|
* GiST index over ltree:
|
|
<, <=, =, =>, >, @>, <@, @, ~
|
|
Example:
|
|
create index path_gist_idx on test using gist_ltree_ops (path);
|
|
* GiST index over ltree[]:
|
|
ltree[]<@ ltree, ltree @> ltree[], @, ~.
|
|
Example:
|
|
create index path_gist_idx on test using gist__ltree_ops (array_path);
|
|
Notices: This index is lossy.
|
|
|
|
FUNCTIONS
|
|
|
|
ltree subltree
|
|
ltree subltree(ltree, start, end)
|
|
returns subpath of ltree from start (inclusive) until the end.
|
|
# select subltree('Top.Child1.Child2',1,2);
|
|
subltree
|
|
--------
|
|
Child1
|
|
ltree subpath
|
|
ltree subpath(ltree, OFFSET,LEN)
|
|
ltree subpath(ltree, OFFSET)
|
|
returns subpath of ltree from OFFSET (inclusive) with length LEN.
|
|
If OFFSET is negative returns subpath starts that far from the end
|
|
of the path. If LENGTH is omitted, returns everything to the end
|
|
of the path. If LENGTH is negative, leaves that many labels off
|
|
the end of the path.
|
|
# select subpath('Top.Child1.Child2',1,2);
|
|
subpath
|
|
-------
|
|
Child1.Child2
|
|
|
|
# select subpath('Top.Child1.Child2',-2,1);
|
|
subpath
|
|
---------
|
|
Child1
|
|
int4 nlevel
|
|
|
|
int4 nlevel(ltree) - returns level of the node.
|
|
# select nlevel('Top.Child1.Child2');
|
|
nlevel
|
|
--------
|
|
3
|
|
Note, that arguments start, end, OFFSET, LEN have meaning of level of the
|
|
node !
|
|
|
|
ltree lca(ltree,ltree,...) (up to 8 arguments)
|
|
ltree lca(ltree[])
|
|
Returns Lowest Common Ancestor (lca)
|
|
# select lca('1.2.2.3','1.2.3.4.5.6');
|
|
lca
|
|
-----
|
|
1.2
|
|
# select lca('{la.2.3,1.2.3.4.5.6}') is null;
|
|
?column?
|
|
----------
|
|
f
|
|
|
|
|
|
INSTALLATION
|
|
|
|
cd contrib/ltree
|
|
make
|
|
make install
|
|
make installcheck
|
|
|
|
EXAMPLE OF USAGE
|
|
|
|
createdb ltreetest
|
|
psql ltreetest < /usr/local/pgsql/share/contrib/ltree.sql
|
|
psql ltreetest < ltreetest.sql
|
|
|
|
Now, we have a database ltreetest populated with a data describing hierarchy
|
|
shown below:
|
|
|
|
|
|
TOP
|
|
/ | \
|
|
Science Hobbies Collections
|
|
/ | \
|
|
Astronomy Amateurs_Astronomy Pictures
|
|
/ \ |
|
|
Astrophysics Cosmology Astronomy
|
|
/ | \
|
|
Galaxies Stars Astronauts
|
|
|
|
Inheritance:
|
|
|
|
ltreetest=# select path from test where path <@ 'Top.Science';
|
|
path
|
|
------------------------------------
|
|
Top.Science
|
|
Top.Science.Astronomy
|
|
Top.Science.Astronomy.Astrophysics
|
|
Top.Science.Astronomy.Cosmology
|
|
(4 rows)
|
|
|
|
Matching:
|
|
|
|
ltreetest=# select path from test where path ~ '*.Astronomy.*';
|
|
path
|
|
-----------------------------------------------
|
|
Top.Science.Astronomy
|
|
Top.Science.Astronomy.Astrophysics
|
|
Top.Science.Astronomy.Cosmology
|
|
Top.Collections.Pictures.Astronomy
|
|
Top.Collections.Pictures.Astronomy.Stars
|
|
Top.Collections.Pictures.Astronomy.Galaxies
|
|
Top.Collections.Pictures.Astronomy.Astronauts
|
|
(7 rows)
|
|
ltreetest=# select path from test where path ~ '*.!pictures@.*.Astronomy.*';
|
|
path
|
|
------------------------------------
|
|
Top.Science.Astronomy
|
|
Top.Science.Astronomy.Astrophysics
|
|
Top.Science.Astronomy.Cosmology
|
|
(3 rows)
|
|
|
|
Full text search:
|
|
|
|
ltreetest=# select path from test where path @ 'Astro*% & !pictures@';
|
|
path
|
|
------------------------------------
|
|
Top.Science.Astronomy
|
|
Top.Science.Astronomy.Astrophysics
|
|
Top.Science.Astronomy.Cosmology
|
|
Top.Hobbies.Amateurs_Astronomy
|
|
(4 rows)
|
|
|
|
ltreetest=# select path from test where path @ 'Astro* & !pictures@';
|
|
path
|
|
------------------------------------
|
|
Top.Science.Astronomy
|
|
Top.Science.Astronomy.Astrophysics
|
|
Top.Science.Astronomy.Cosmology
|
|
(3 rows)
|
|
|
|
Using Functions:
|
|
|
|
ltreetest=# select subpath(path,0,2)||'Space'||subpath(path,2) from test where path <@ 'Top.Science.Astronomy';
|
|
?column?
|
|
------------------------------------------
|
|
Top.Science.Space.Astronomy
|
|
Top.Science.Space.Astronomy.Astrophysics
|
|
Top.Science.Space.Astronomy.Cosmology
|
|
(3 rows)
|
|
We could create SQL-function:
|
|
CREATE FUNCTION ins_label(ltree, int4, text) RETURNS ltree
|
|
AS 'select subpath($1,0,$2) || $3 || subpath($1,$2);'
|
|
LANGUAGE SQL WITH (ISCACHABLE);
|
|
|
|
and previous select could be rewritten as:
|
|
|
|
ltreetest=# select ins_label(path,2,'Space') from test where path <@ 'Top.Science.Astronomy';
|
|
ins_label
|
|
------------------------------------------
|
|
Top.Science.Space.Astronomy
|
|
Top.Science.Space.Astronomy.Astrophysics
|
|
Top.Science.Space.Astronomy.Cosmology
|
|
(3 rows)
|
|
|
|
Or with another arguments:
|
|
|
|
CREATE FUNCTION ins_label(ltree, ltree, text) RETURNS ltree
|
|
AS 'select subpath($1,0,nlevel($2)) || $3 || subpath($1,nlevel($2));'
|
|
LANGUAGE SQL WITH (ISCACHABLE);
|
|
|
|
ltreetest=# select ins_label(path,'Top.Science'::ltree,'Space') from test where path <@ 'Top.Science.Astronomy';
|
|
ins_label
|
|
------------------------------------------
|
|
Top.Science.Space.Astronomy
|
|
Top.Science.Space.Astronomy.Astrophysics
|
|
Top.Science.Space.Astronomy.Cosmology
|
|
(3 rows)
|
|
|
|
ADDITIONAL DATA
|
|
|
|
To get more feeling from our ltree module you could download
|
|
dmozltree-eng.sql.gz (about 3Mb tar.gz archive containing 300,274 nodes),
|
|
available from http://www.sai.msu.su/~megera/postgres/gist/ltree/
|
|
dmozltree-eng.sql.gz, which is DMOZ catalogue, prepared for use with ltree.
|
|
Setup your test database (dmoz), load ltree module and issue command:
|
|
|
|
zcat dmozltree-eng.sql.gz| psql dmoz
|
|
|
|
Data will be loaded into database dmoz and all indices will be created.
|
|
|
|
BENCHMARKS
|
|
|
|
All runs were performed on my IBM ThinkPad T21 (256 MB RAM, 750Mhz) using DMOZ
|
|
data, containing 300,274 nodes (see above for download link). We used some
|
|
basic queries typical for walking through catalog.
|
|
|
|
QUERIES
|
|
|
|
* Q0: Count all rows (sort of base time for comparison)
|
|
select count(*) from dmoz;
|
|
count
|
|
--------
|
|
300274
|
|
(1 row)
|
|
* Q1: Get direct children (without inheritance)
|
|
select path from dmoz where path ~ 'Top.Adult.Arts.Animation.*{1}';
|
|
path
|
|
-----------------------------------
|
|
Top.Adult.Arts.Animation.Cartoons
|
|
Top.Adult.Arts.Animation.Anime
|
|
(2 rows)
|
|
* Q2: The same as Q1 but with counting of successors
|
|
select path as parentpath , (select count(*)-1 from dmoz where path <@
|
|
p.path) as count from dmoz p where path ~ 'Top.Adult.Arts.Animation.*{1}';
|
|
parentpath | count
|
|
-----------------------------------+-------
|
|
Top.Adult.Arts.Animation.Cartoons | 2
|
|
Top.Adult.Arts.Animation.Anime | 61
|
|
(2 rows)
|
|
* Q3: Get all parents
|
|
select path from dmoz where path @> 'Top.Adult.Arts.Animation' order by
|
|
path asc;
|
|
path
|
|
--------------------------
|
|
Top
|
|
Top.Adult
|
|
Top.Adult.Arts
|
|
Top.Adult.Arts.Animation
|
|
(4 rows)
|
|
* Q4: Get all parents with counting of children
|
|
select path, (select count(*)-1 from dmoz where path <@ p.path) as count
|
|
from dmoz p where path @> 'Top.Adult.Arts.Animation' order by path asc;
|
|
path | count
|
|
--------------------------+--------
|
|
Top | 300273
|
|
Top.Adult | 4913
|
|
Top.Adult.Arts | 339
|
|
Top.Adult.Arts.Animation | 65
|
|
(4 rows)
|
|
* Q5: Get all children with levels
|
|
select path, nlevel(path) - nlevel('Top.Adult.Arts.Animation') as level
|
|
from dmoz where path ~ 'Top.Adult.Arts.Animation.*{1,2}' order by path asc;
|
|
path | level
|
|
------------------------------------------------+-------
|
|
Top.Adult.Arts.Animation.Anime | 1
|
|
Top.Adult.Arts.Animation.Anime.Fan_Works | 2
|
|
Top.Adult.Arts.Animation.Anime.Games | 2
|
|
Top.Adult.Arts.Animation.Anime.Genres | 2
|
|
Top.Adult.Arts.Animation.Anime.Image_Galleries | 2
|
|
Top.Adult.Arts.Animation.Anime.Multimedia | 2
|
|
Top.Adult.Arts.Animation.Anime.Resources | 2
|
|
Top.Adult.Arts.Animation.Anime.Titles | 2
|
|
Top.Adult.Arts.Animation.Cartoons | 1
|
|
Top.Adult.Arts.Animation.Cartoons.AVS | 2
|
|
Top.Adult.Arts.Animation.Cartoons.Members | 2
|
|
(11 rows)
|
|
|
|
Timings
|
|
|
|
+---------------------------------------------+
|
|
|Query|Rows|Time (ms) index|Time (ms) no index|
|
|
|-----+----+---------------+------------------|
|
|
| Q0| 1| NA| 1453.44|
|
|
|-----+----+---------------+------------------|
|
|
| Q1| 2| 0.49| 1001.54|
|
|
|-----+----+---------------+------------------|
|
|
| Q2| 2| 1.48| 3009.39|
|
|
|-----+----+---------------+------------------|
|
|
| Q3| 4| 0.55| 906.98|
|
|
|-----+----+---------------+------------------|
|
|
| Q4| 4| 24385.07| 4951.91|
|
|
|-----+----+---------------+------------------|
|
|
| Q5| 11| 0.85| 1003.23|
|
|
+---------------------------------------------+
|
|
Timings without indices were obtained using operations which doesn't use
|
|
indices (see above)
|
|
|
|
Remarks
|
|
|
|
We didn't run full-scale tests, also we didn't present (yet) data for
|
|
operations with arrays of ltree (ltree[]) and full text searching. We'll
|
|
appreciate your input. So far, below some (rather obvious) results:
|
|
|
|
* Indices does help execution of queries
|
|
* Q4 performs bad because one needs to read almost all data from the HDD
|
|
|
|
CHANGES
|
|
|
|
Aug 9, 2002
|
|
Fixed very stupid but important bug :-)
|
|
July 31, 2002
|
|
Now works on 64-bit platforms.
|
|
Added function lca - lowest common ancestor
|
|
Version for 7.2 is distributed as separate package -
|
|
http://www.sai.msu.su/~megera/postgres/gist/ltree/ltree-7.2.tar.gz
|
|
July 13, 2002
|
|
Initial release.
|
|
|
|
TODO
|
|
|
|
* Testing on 64-bit platforms. There are several known problems with byte
|
|
alignment; -- RESOLVED
|
|
* Better documentation;
|
|
* We plan (probably) to improve regular expressions processing using
|
|
non-deterministic automata;
|
|
* Some sort of XML support;
|
|
* Better full text searching;
|
|
|
|
SOME BACKGROUNDS
|
|
|
|
The approach we use for ltree is much like one we used in our other GiST based
|
|
contrib modules (intarray, tsearch, tree, btree_gist, rtree_gist). Theoretical
|
|
background is available in papers referenced from our GiST development page
|
|
(http://www.sai.msu.su/~megera/postgres/gist).
|
|
|
|
A hierarchical data structure (tree) is a set of nodes. Each node has a
|
|
signature (LPS) of a fixed size, which is a hashed label path of that node.
|
|
Traversing a tree we could *certainly* prune branches if
|
|
|
|
LQS (bitwise AND) LPS != LQS
|
|
|
|
where LQS is a signature of lquery or ltxtquery, obtained in the same way as
|
|
LPS.
|
|
|
|
ltree[]:
|
|
For array of ltree LPS is a bitwise OR-ed signatures of *ALL* children
|
|
reachable from that node. Signatures are stored in RD-tree, implemented using
|
|
GiST, which provides indexed access.
|
|
|
|
ltree:
|
|
For ltree we store LPS in a B-tree, implemented using GiST. Each node entry is
|
|
represented by (left_bound, signature, right_bound), so that we could speedup
|
|
operations <, <=, =, =>, > using left_bound, right_bound and prune branches of
|
|
a tree using signature.
|
|
-------------------------------------------------------------------------------
|
|
We ask people who find the module useful to send us a postcards to:
|
|
Moscow, 119899, Universitetski pr.13, Moscow State University, Sternberg
|
|
Astronomical Institute, Russia
|
|
For: Bartunov O.S.
|
|
and
|
|
Moscow, Bratislavskaya str.23, appt. 18, Russia
|
|
For: Sigaev F.G.
|