sqlite/ext/icu
drh 6695f47e93 Change the REGEXP function in ICU to require exactly 2 arguments.
FossilOrigin-Name: c34cf23efb92d9bbb7cf4245d372e47182771f63
2009-12-17 02:13:39 +00:00
..
icu.c Change the REGEXP function in ICU to require exactly 2 arguments. 2009-12-17 02:13:39 +00:00
README.txt Extend fts2 so that user defined tokenizers may be added. Add a tokenizer that uses the ICU library if available. Documentation and tests to come. (CVS 4108) 2007-06-22 15:21:15 +00:00
sqliteicu.h Add header file sqliteicu.h to the ICU extension. This is analogous to the rtree.h and fts3.h headers used by other extensions to declare their entry points. Fix for ticket #3361. (CVS 5680) 2008-09-08 08:08:09 +00:00

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This directory contains source code for the SQLite "ICU" extension, an
integration of the "International Components for Unicode" library with
SQLite. Documentation follows.

    1. Features
    
        1.1  SQL Scalars upper() and lower()
        1.2  Unicode Aware LIKE Operator
        1.3  ICU Collation Sequences
        1.4  SQL REGEXP Operator
    
    2. Compilation and Usage
    
    3. Bugs, Problems and Security Issues
    
        3.1  The "case_sensitive_like" Pragma
        3.2  The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
        3.3  Collation Sequence Security Issue


1. FEATURES

  1.1  SQL Scalars upper() and lower()

    SQLite's built-in implementations of these two functions only 
    provide case mapping for the 26 letters used in the English
    language. The ICU based functions provided by this extension
    provide case mapping, where defined, for the full range of 
    unicode characters.

    ICU provides two types of case mapping, "general" case mapping and
    "language specific". Refer to ICU documentation for the differences
    between the two. Specifically:

       http://www.icu-project.org/userguide/caseMappings.html
       http://www.icu-project.org/userguide/posix.html#case_mappings

    To utilise "general" case mapping, the upper() or lower() scalar 
    functions are invoked with one argument:

        upper('ABC') -> 'abc'
        lower('abc') -> 'ABC'

    To access ICU "language specific" case mapping, upper() or lower()
    should be invoked with two arguments. The second argument is the name
    of the locale to use. Passing an empty string ("") or SQL NULL value
    as the second argument is the same as invoking the 1 argument version
    of upper() or lower():

        lower('I', 'en_us') -> 'i'
        lower('I', 'tr_tr') -> 'ı' (small dotless i)

  1.2  Unicode Aware LIKE Operator

    Similarly to the upper() and lower() functions, the built-in SQLite LIKE
    operator understands case equivalence for the 26 letters of the English
    language alphabet. The implementation of LIKE included in this
    extension uses the ICU function u_foldCase() to provide case
    independent comparisons for the full range of unicode characters.  

    The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
    dotless 'I' character used in the Turkish language is considered
    to be in the same equivalence class as the dotted 'I' character
    used by many languages (including English).

  1.3  ICU Collation Sequences

    A special SQL scalar function, icu_load_collation() is provided that 
    may be used to register ICU collation sequences with SQLite. It
    is always called with exactly two arguments, the ICU locale 
    identifying the collation sequence to ICU, and the name of the
    SQLite collation sequence to create. For example, to create an
    SQLite collation sequence named "turkish" using Turkish language
    sorting rules, the SQL statement:

        SELECT icu_load_collation('tr_TR', 'turkish');

    Or, for Australian English:

        SELECT icu_load_collation('en_AU', 'australian');

    The identifiers "turkish" and "australian" may then be used
    as collation sequence identifiers in SQL statements:

        CREATE TABLE aust_turkish_penpals(
          australian_penpal_name TEXT COLLATE australian,
          turkish_penpal_name    TEXT COLLATE turkish
        );
  
  1.4 SQL REGEXP Operator

    This extension provides an implementation of the SQL binary
    comparision operator "REGEXP", based on the regular expression functions
    provided by the ICU library. The syntax of the operator is as described
    in SQLite documentation:

        <string> REGEXP <re-pattern>

    This extension uses the ICU defaults for regular expression matching
    behaviour. Specifically, this means that:

        * Matching is case-sensitive,
        * Regular expression comments are not allowed within patterns, and
        * The '^' and '$' characters match the beginning and end of the
          <string> argument, not the beginning and end of lines within
          the <string> argument.

    Even more specifically, the value passed to the "flags" parameter
    of ICU C function uregex_open() is 0.


2  COMPILATION AND USAGE

  The easiest way to compile and use the ICU extension is to build
  and use it as a dynamically loadable SQLite extension. To do this
  using gcc on *nix:

    gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so

  You may need to add "-I" flags so that gcc can find sqlite3ext.h
  and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
  loaded into sqlite in the same way as any other dynamically loadable
  extension.


3 BUGS, PROBLEMS AND SECURITY ISSUES

  3.1 The "case_sensitive_like" Pragma

    This extension does not work well with the "case_sensitive_like"
    pragma. If this pragma is used before the ICU extension is loaded,
    then the pragma has no effect. If the pragma is used after the ICU
    extension is loaded, then SQLite ignores the ICU implementation and
    always uses the built-in LIKE operator.

    The ICU extension LIKE operator is always case insensitive.

  3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro

    Passing very long patterns to the built-in SQLite LIKE operator can
    cause a stack overflow. To curb this problem, SQLite defines the
    SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
    pattern in bytes (irrespective of encoding). The default value is
    defined in internal header file "limits.h".
    
    The ICU extension LIKE implementation suffers from the same 
    problem and uses the same solution. However, since the ICU extension
    code does not include the SQLite file "limits.h", modifying
    the default value therein does not affect the ICU extension.
    The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
    the ICU extension LIKE operator is 50000, defined in source 
    file "icu.c".

  3.3 Collation Sequence Security Issue

    Internally, SQLite assumes that indices stored in database files
    are sorted according to the collation sequence indicated by the
    SQL schema. Changing the definition of a collation sequence after
    an index has been built is therefore equivalent to database
    corruption. The SQLite library is not very well tested under
    these conditions, and may contain potential buffer overruns
    or other programming errors that could be exploited by a malicious
    programmer.

    If the ICU extension is used in an environment where potentially
    malicious users may execute arbitrary SQL (i.e. gears), they
    should be prevented from invoking the icu_load_collation() function,
    possibly using the authorisation callback.