1 2 This directory contains source code for the SQLite "ICU" extension, an 3 integration of the "International Components for Unicode" library with 4 SQLite. Documentation follows. 5 6 1. Features 7 8 1.1 SQL Scalars upper() and lower() 9 1.2 Unicode Aware LIKE Operator 10 1.3 ICU Collation Sequences 11 1.4 SQL REGEXP Operator 12 13 2. Compilation and Usage 14 15 3. Bugs, Problems and Security Issues 16 17 3.1 The "case_sensitive_like" Pragma 18 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro 19 3.3 Collation Sequence Security Issue 20 21 22 1. FEATURES 23 24 1.1 SQL Scalars upper() and lower() 25 26 SQLite's built-in implementations of these two functions only 27 provide case mapping for the 26 letters used in the English 28 language. The ICU based functions provided by this extension 29 provide case mapping, where defined, for the full range of 30 unicode characters. 31 32 ICU provides two types of case mapping, "general" case mapping and 33 "language specific". Refer to ICU documentation for the differences 34 between the two. Specifically: 35 36 http://www.icu-project.org/userguide/caseMappings.html 37 http://www.icu-project.org/userguide/posix.html#case_mappings 38 39 To utilise "general" case mapping, the upper() or lower() scalar 40 functions are invoked with one argument: 41 42 upper('ABC') -> 'abc' 43 lower('abc') -> 'ABC' 44 45 To access ICU "language specific" case mapping, upper() or lower() 46 should be invoked with two arguments. The second argument is the name 47 of the locale to use. Passing an empty string ("") or SQL NULL value 48 as the second argument is the same as invoking the 1 argument version 49 of upper() or lower(): 50 51 lower('I', 'en_us') -> 'i' 52 lower('I', 'tr_tr') -> '' (small dotless i) 53 54 1.2 Unicode Aware LIKE Operator 55 56 Similarly to the upper() and lower() functions, the built-in SQLite LIKE 57 operator understands case equivalence for the 26 letters of the English 58 language alphabet. The implementation of LIKE included in this 59 extension uses the ICU function u_foldCase() to provide case 60 independent comparisons for the full range of unicode characters. 61 62 The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the 63 dotless 'I' character used in the Turkish language is considered 64 to be in the same equivalence class as the dotted 'I' character 65 used by many languages (including English). 66 67 1.3 ICU Collation Sequences 68 69 A special SQL scalar function, icu_load_collation() is provided that 70 may be used to register ICU collation sequences with SQLite. It 71 is always called with exactly two arguments, the ICU locale 72 identifying the collation sequence to ICU, and the name of the 73 SQLite collation sequence to create. For example, to create an 74 SQLite collation sequence named "turkish" using Turkish language 75 sorting rules, the SQL statement: 76 77 SELECT icu_load_collation('tr_TR', 'turkish'); 78 79 Or, for Australian English: 80 81 SELECT icu_load_collation('en_AU', 'australian'); 82 83 The identifiers "turkish" and "australian" may then be used 84 as collation sequence identifiers in SQL statements: 85 86 CREATE TABLE aust_turkish_penpals( 87 australian_penpal_name TEXT COLLATE australian, 88 turkish_penpal_name TEXT COLLATE turkish 89 ); 90 91 1.4 SQL REGEXP Operator 92 93 This extension provides an implementation of the SQL binary 94 comparision operator "REGEXP", based on the regular expression functions 95 provided by the ICU library. The syntax of the operator is as described 96 in SQLite documentation: 97 98 <string> REGEXP <re-pattern> 99 100 This extension uses the ICU defaults for regular expression matching 101 behaviour. Specifically, this means that: 102 103 * Matching is case-sensitive, 104 * Regular expression comments are not allowed within patterns, and 105 * The '^' and '$' characters match the beginning and end of the 106 <string> argument, not the beginning and end of lines within 107 the <string> argument. 108 109 Even more specifically, the value passed to the "flags" parameter 110 of ICU C function uregex_open() is 0. 111 112 113 2 COMPILATION AND USAGE 114 115 The easiest way to compile and use the ICU extension is to build 116 and use it as a dynamically loadable SQLite extension. To do this 117 using gcc on *nix: 118 119 gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so 120 121 You may need to add "-I" flags so that gcc can find sqlite3ext.h 122 and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be 123 loaded into sqlite in the same way as any other dynamically loadable 124 extension. 125 126 127 3 BUGS, PROBLEMS AND SECURITY ISSUES 128 129 3.1 The "case_sensitive_like" Pragma 130 131 This extension does not work well with the "case_sensitive_like" 132 pragma. If this pragma is used before the ICU extension is loaded, 133 then the pragma has no effect. If the pragma is used after the ICU 134 extension is loaded, then SQLite ignores the ICU implementation and 135 always uses the built-in LIKE operator. 136 137 The ICU extension LIKE operator is always case insensitive. 138 139 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro 140 141 Passing very long patterns to the built-in SQLite LIKE operator can 142 cause excessive CPU usage. To curb this problem, SQLite defines the 143 SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a 144 pattern in bytes (irrespective of encoding). The default value is 145 defined in internal header file "limits.h". 146 147 The ICU extension LIKE implementation suffers from the same 148 problem and uses the same solution. However, since the ICU extension 149 code does not include the SQLite file "limits.h", modifying 150 the default value therein does not affect the ICU extension. 151 The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by 152 the ICU extension LIKE operator is 50000, defined in source 153 file "icu.c". 154 155 3.3 Collation Sequence Security Issue 156 157 Internally, SQLite assumes that indices stored in database files 158 are sorted according to the collation sequence indicated by the 159 SQL schema. Changing the definition of a collation sequence after 160 an index has been built is therefore equivalent to database 161 corruption. The SQLite library is not very well tested under 162 these conditions, and may contain potential buffer overruns 163 or other programming errors that could be exploited by a malicious 164 programmer. 165 166 If the ICU extension is used in an environment where potentially 167 malicious users may execute arbitrary SQL (i.e. gears), they 168 should be prevented from invoking the icu_load_collation() function, 169 possibly using the authorisation callback. 170