Home | History | Annotate | Download | only in icu
      1 
      2 This directory contains source code for the SQLite "ICU" extension, an
      3 integration of the "International Components for Unicode" library with
      4 SQLite. Documentation follows.
      5 
      6     1. Features
      7     
      8         1.1  SQL Scalars upper() and lower()
      9         1.2  Unicode Aware LIKE Operator
     10         1.3  ICU Collation Sequences
     11         1.4  SQL REGEXP Operator
     12     
     13     2. Compilation and Usage
     14     
     15     3. Bugs, Problems and Security Issues
     16     
     17         3.1  The "case_sensitive_like" Pragma
     18         3.2  The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
     19         3.3  Collation Sequence Security Issue
     20 
     21 
     22 1. FEATURES
     23 
     24   1.1  SQL Scalars upper() and lower()
     25 
     26     SQLite's built-in implementations of these two functions only 
     27     provide case mapping for the 26 letters used in the English
     28     language. The ICU based functions provided by this extension
     29     provide case mapping, where defined, for the full range of 
     30     unicode characters.
     31 
     32     ICU provides two types of case mapping, "general" case mapping and
     33     "language specific". Refer to ICU documentation for the differences
     34     between the two. Specifically:
     35 
     36        http://www.icu-project.org/userguide/caseMappings.html
     37        http://www.icu-project.org/userguide/posix.html#case_mappings
     38 
     39     To utilise "general" case mapping, the upper() or lower() scalar 
     40     functions are invoked with one argument:
     41 
     42         upper('ABC') -> 'abc'
     43         lower('abc') -> 'ABC'
     44 
     45     To access ICU "language specific" case mapping, upper() or lower()
     46     should be invoked with two arguments. The second argument is the name
     47     of the locale to use. Passing an empty string ("") or SQL NULL value
     48     as the second argument is the same as invoking the 1 argument version
     49     of upper() or lower():
     50 
     51         lower('I', 'en_us') -> 'i'
     52         lower('I', 'tr_tr') -> '' (small dotless i)
     53 
     54   1.2  Unicode Aware LIKE Operator
     55 
     56     Similarly to the upper() and lower() functions, the built-in SQLite LIKE
     57     operator understands case equivalence for the 26 letters of the English
     58     language alphabet. The implementation of LIKE included in this
     59     extension uses the ICU function u_foldCase() to provide case
     60     independent comparisons for the full range of unicode characters.  
     61 
     62     The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
     63     dotless 'I' character used in the Turkish language is considered
     64     to be in the same equivalence class as the dotted 'I' character
     65     used by many languages (including English).
     66 
     67   1.3  ICU Collation Sequences
     68 
     69     A special SQL scalar function, icu_load_collation() is provided that 
     70     may be used to register ICU collation sequences with SQLite. It
     71     is always called with exactly two arguments, the ICU locale 
     72     identifying the collation sequence to ICU, and the name of the
     73     SQLite collation sequence to create. For example, to create an
     74     SQLite collation sequence named "turkish" using Turkish language
     75     sorting rules, the SQL statement:
     76 
     77         SELECT icu_load_collation('tr_TR', 'turkish');
     78 
     79     Or, for Australian English:
     80 
     81         SELECT icu_load_collation('en_AU', 'australian');
     82 
     83     The identifiers "turkish" and "australian" may then be used
     84     as collation sequence identifiers in SQL statements:
     85 
     86         CREATE TABLE aust_turkish_penpals(
     87           australian_penpal_name TEXT COLLATE australian,
     88           turkish_penpal_name    TEXT COLLATE turkish
     89         );
     90   
     91   1.4 SQL REGEXP Operator
     92 
     93     This extension provides an implementation of the SQL binary
     94     comparision operator "REGEXP", based on the regular expression functions
     95     provided by the ICU library. The syntax of the operator is as described
     96     in SQLite documentation:
     97 
     98         <string> REGEXP <re-pattern>
     99 
    100     This extension uses the ICU defaults for regular expression matching
    101     behaviour. Specifically, this means that:
    102 
    103         * Matching is case-sensitive,
    104         * Regular expression comments are not allowed within patterns, and
    105         * The '^' and '$' characters match the beginning and end of the
    106           <string> argument, not the beginning and end of lines within
    107           the <string> argument.
    108 
    109     Even more specifically, the value passed to the "flags" parameter
    110     of ICU C function uregex_open() is 0.
    111 
    112 
    113 2  COMPILATION AND USAGE
    114 
    115   The easiest way to compile and use the ICU extension is to build
    116   and use it as a dynamically loadable SQLite extension. To do this
    117   using gcc on *nix:
    118 
    119     gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so
    120 
    121   You may need to add "-I" flags so that gcc can find sqlite3ext.h
    122   and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
    123   loaded into sqlite in the same way as any other dynamically loadable
    124   extension.
    125 
    126 
    127 3 BUGS, PROBLEMS AND SECURITY ISSUES
    128 
    129   3.1 The "case_sensitive_like" Pragma
    130 
    131     This extension does not work well with the "case_sensitive_like"
    132     pragma. If this pragma is used before the ICU extension is loaded,
    133     then the pragma has no effect. If the pragma is used after the ICU
    134     extension is loaded, then SQLite ignores the ICU implementation and
    135     always uses the built-in LIKE operator.
    136 
    137     The ICU extension LIKE operator is always case insensitive.
    138 
    139   3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
    140 
    141     Passing very long patterns to the built-in SQLite LIKE operator can
    142     cause excessive CPU usage. To curb this problem, SQLite defines the
    143     SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
    144     pattern in bytes (irrespective of encoding). The default value is
    145     defined in internal header file "limits.h".
    146     
    147     The ICU extension LIKE implementation suffers from the same 
    148     problem and uses the same solution. However, since the ICU extension
    149     code does not include the SQLite file "limits.h", modifying
    150     the default value therein does not affect the ICU extension.
    151     The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
    152     the ICU extension LIKE operator is 50000, defined in source 
    153     file "icu.c".
    154 
    155   3.3 Collation Sequence Security Issue
    156 
    157     Internally, SQLite assumes that indices stored in database files
    158     are sorted according to the collation sequence indicated by the
    159     SQL schema. Changing the definition of a collation sequence after
    160     an index has been built is therefore equivalent to database
    161     corruption. The SQLite library is not very well tested under
    162     these conditions, and may contain potential buffer overruns
    163     or other programming errors that could be exploited by a malicious
    164     programmer.
    165 
    166     If the ICU extension is used in an environment where potentially
    167     malicious users may execute arbitrary SQL (i.e. gears), they
    168     should be prevented from invoking the icu_load_collation() function,
    169     possibly using the authorisation callback.
    170