1 2 1. FTS2 Tokenizers 3 4 When creating a new full-text table, FTS2 allows the user to select 5 the text tokenizer implementation to be used when indexing text 6 by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE 7 statement: 8 9 CREATE VIRTUAL TABLE <table-name> USING fts2( 10 <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]] 11 ); 12 13 The built-in tokenizers (valid values to pass as <tokenizer name>) are 14 "simple" and "porter". 15 16 <tokenizer-args> should consist of zero or more white-space separated 17 arguments to pass to the selected tokenizer implementation. The 18 interpretation of the arguments, if any, depends on the individual 19 tokenizer. 20 21 2. Custom Tokenizers 22 23 FTS2 allows users to provide custom tokenizer implementations. The 24 interface used to create a new tokenizer is defined and described in 25 the fts2_tokenizer.h source file. 26 27 Registering a new FTS2 tokenizer is similar to registering a new 28 virtual table module with SQLite. The user passes a pointer to a 29 structure containing pointers to various callback functions that 30 make up the implementation of the new tokenizer type. For tokenizers, 31 the structure (defined in fts2_tokenizer.h) is called 32 "sqlite3_tokenizer_module". 33 34 FTS2 does not expose a C-function that users call to register new 35 tokenizer types with a database handle. Instead, the pointer must 36 be encoded as an SQL blob value and passed to FTS2 through the SQL 37 engine by evaluating a special scalar function, "fts2_tokenizer()". 38 The fts2_tokenizer() function may be called with one or two arguments, 39 as follows: 40 41 SELECT fts2_tokenizer(<tokenizer-name>); 42 SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>); 43 44 Where <tokenizer-name> is a string identifying the tokenizer and 45 <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module 46 structure encoded as an SQL blob. If the second argument is present, 47 it is registered as tokenizer <tokenizer-name> and a copy of it 48 returned. If only one argument is passed, a pointer to the tokenizer 49 implementation currently registered as <tokenizer-name> is returned, 50 encoded as a blob. Or, if no such tokenizer exists, an SQL exception 51 (error) is raised. 52 53 SECURITY: If the fts2 extension is used in an environment where potentially 54 malicious users may execute arbitrary SQL (i.e. gears), they should be 55 prevented from invoking the fts2_tokenizer() function, possibly using the 56 authorisation callback. 57 58 See "Sample code" below for an example of calling the fts2_tokenizer() 59 function from C code. 60 61 3. ICU Library Tokenizers 62 63 If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor 64 symbol defined, then there exists a built-in tokenizer named "icu" 65 implemented using the ICU library. The first argument passed to the 66 xCreate() method (see fts2_tokenizer.h) of this tokenizer may be 67 an ICU locale identifier. For example "tr_TR" for Turkish as used 68 in Turkey, or "en_AU" for English as used in Australia. For example: 69 70 "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)" 71 72 The ICU tokenizer implementation is very simple. It splits the input 73 text according to the ICU rules for finding word boundaries and discards 74 any tokens that consist entirely of white-space. This may be suitable 75 for some applications in some locales, but not all. If more complex 76 processing is required, for example to implement stemming or 77 discard punctuation, this can be done by creating a tokenizer 78 implementation that uses the ICU tokenizer as part of its implementation. 79 80 When using the ICU tokenizer this way, it is safe to overwrite the 81 contents of the strings returned by the xNext() method (see 82 fts2_tokenizer.h). 83 84 4. Sample code. 85 86 The following two code samples illustrate the way C code should invoke 87 the fts2_tokenizer() scalar function: 88 89 int registerTokenizer( 90 sqlite3 *db, 91 char *zName, 92 const sqlite3_tokenizer_module *p 93 ){ 94 int rc; 95 sqlite3_stmt *pStmt; 96 const char zSql[] = "SELECT fts2_tokenizer(?, ?)"; 97 98 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); 99 if( rc!=SQLITE_OK ){ 100 return rc; 101 } 102 103 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); 104 sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC); 105 sqlite3_step(pStmt); 106 107 return sqlite3_finalize(pStmt); 108 } 109 110 int queryTokenizer( 111 sqlite3 *db, 112 char *zName, 113 const sqlite3_tokenizer_module **pp 114 ){ 115 int rc; 116 sqlite3_stmt *pStmt; 117 const char zSql[] = "SELECT fts2_tokenizer(?)"; 118 119 *pp = 0; 120 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); 121 if( rc!=SQLITE_OK ){ 122 return rc; 123 } 124 125 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); 126 if( SQLITE_ROW==sqlite3_step(pStmt) ){ 127 if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){ 128 memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp)); 129 } 130 } 131 132 return sqlite3_finalize(pStmt); 133 } 134