README.syntax
1
2 1. OVERVIEW
3
4 This README file describes the syntax of the arguments that may be passed to
5 the FTS3 MATCH operator used for full-text queries. For example, if table
6 "t1" is an Fts3 virtual table, the following SQL query:
7
8 SELECT * FROM t1 WHERE <col> MATCH <full-text query>
9
10 may be used to retrieve all rows that match a specified for full-text query.
11 The text "<col>" should be replaced by either the name of the fts3 table
12 (in this case "t1"), or by the name of one of the columns of the fts3
13 table. <full-text-query> should be replaced by an SQL expression that
14 computes to a string containing an Fts3 query.
15
16 If the left-hand-side of the MATCH operator is set to the name of the
17 fts3 table, then by default the query may be matched against any column
18 of the table. If it is set to a column name, then by default the query
19 may only match the specified column. In both cases this may be overriden
20 as part of the query text (see sections 2 and 3 below).
21
22 As of SQLite version 3.6.8, Fts3 supports two slightly different query
23 formats; the standard syntax, which is used by default, and the enhanced
24 query syntax which can be selected by compiling with the pre-processor
25 symbol SQLITE_ENABLE_FTS3_PARENTHESIS defined.
26
27 -DSQLITE_ENABLE_FTS3_PARENTHESIS
28
29 2. STANDARD QUERY SYNTAX
30
31 When using the standard Fts3 query syntax, a query usually consists of a
32 list of terms (words) separated by white-space characters. To match a
33 query, a row (or column) of an Fts3 table must contain each of the specified
34 terms. For example, the following query:
35
36 <col> MATCH 'hello world'
37
38 matches rows (or columns, if <col> is the name of a column name) that
39 contain at least one instance of the token "hello", and at least one
40 instance of the token "world". Tokens may be grouped into phrases using
41 quotation marks. In this case, a matching row or column must contain each
42 of the tokens in the phrase in the order specified, with no intervening
43 tokens. For example, the query:
44
45 <col> MATCH '"hello world" joe"
46
47 matches the first of the following two documents, but not the second or
48 third:
49
50 "'Hello world', said Joe."
51 "One should always greet the world with a cheery hello, thought Joe."
52 "How many hello world programs could their be?"
53
54 As well as grouping tokens together by phrase, the binary NEAR operator
55 may be used to search for rows that contain two or more specified tokens
56 or phrases within a specified proximity of each other. The NEAR operator
57 must always be specified in upper case. The word "near" in lower or mixed
58 case is treated as an ordinary token. For example, the following query:
59
60 <col> MATCH 'engineering NEAR consultancy'
61
62 matches rows that contain both the "engineering" and "consultancy" tokens
63 in the same column with not more than 10 other words between them. It does
64 not matter which of the two terms occurs first in the document, only that
65 they be seperated by only 10 tokens or less. The user may also specify
66 a different required proximity by adding "/N" immediately after the NEAR
67 operator, where N is an integer. For example:
68
69 <col> MATCH 'engineering NEAR/5 consultancy'
70
71 searches for a row containing an instance of each specified token seperated
72 by not more than 5 other tokens. More than one NEAR operator can be used
73 in as sequence. For example this query:
74
75 <col> MATCH 'reliable NEAR/2 engineering NEAR/5 consultancy'
76
77 searches for a row that contains an instance of the token "reliable"
78 seperated by not more than two tokens from an instance of "engineering",
79 which is in turn separated by not more than 5 other tokens from an
80 instance of the term "consultancy". Phrases enclosed in quotes may
81 also be used as arguments to the NEAR operator.
82
83 Similar to the NEAR operator, one or more tokens or phrases may be
84 separated by OR operators. In this case, only one of the specified tokens
85 or phrases must appear in the document. For example, the query:
86
87 <col> MATCH 'hello OR world'
88
89 matches rows that contain either the term "hello", or the term "world",
90 or both. Note that unlike in many programming languages, the OR operator
91 has a higher precedence than the AND operators implied between white-space
92 separated tokens. The following query matches documents that contain the
93 term 'sqlite' and at least one of the terms 'fantastic' or 'impressive',
94 not those that contain both 'sqlite' and 'fantastic' or 'impressive':
95
96 <col> MATCH 'sqlite fantastic OR impressive'
97
98 Any token that is part of an Fts3 query expression, whether or not it is
99 part of a phrase enclosed in quotes, may have a '*' character appended to
100 it. In this case, the token matches all terms that begin with the characters
101 of the token, not just those that exactly match it. For example, the
102 following query:
103
104 <col> MATCH 'sql*'
105
106 matches all rows that contain the term "SQLite", as well as those that
107 contain "SQL".
108
109 A token that is not part of a quoted phrase may be preceded by a '-'
110 character, which indicates that matching rows must not contain the
111 specified term. For example, the following:
112
113 <col> MATCH '"database engine" -sqlite'
114
115 matches rows that contain the phrase "database engine" but do not contain
116 the term "sqlite". If the '-' character occurs inside a quoted phrase,
117 it is ignored. It is possible to use both the '-' prefix and the '*' postfix
118 on a single term. At this time, all Fts3 queries must contain at least
119 one term or phrase that is not preceded by the '-' prefix.
120
121 Regardless of whether or not a table name or column name is used on the
122 left hand side of the MATCH operator, a specific column of the fts3 table
123 may be associated with each token in a query by preceding a token with
124 a column name followed by a ':' character. For example, regardless of what
125 is specified for <col>, the following query requires that column "col1"
126 of the table contains the term "hello", and that column "col2" of the
127 table contains the term "world". If the table does not contain columns
128 named "col1" and "col2", then an error is returned and the query is
129 not run.
130
131 <col> MATCH 'col1:hello col2:world'
132
133 It is not possible to associate a specific table column with a quoted
134 phrase or a term preceded by a '-' operator. A '*' character may be
135 appended to a term associated with a specific column for prefix matching.
136
137 3. ENHANCED QUERY SYNTAX
138
139 The enhanced query syntax is quite similar to the standard query syntax,
140 with the following four differences:
141
142 1) Parenthesis are supported. When using the enhanced query syntax,
143 parenthesis may be used to overcome the built-in precedence of the
144 supplied binary operators. For example, the following query:
145
146 <col> MATCH '(hello world) OR (simple example)'
147
148 matches documents that contain both "hello" and "world", and documents
149 that contain both "simple" and "example". It is not possible to forumlate
150 such a query using the standard syntax.
151
152 2) Instead of separating tokens and phrases by whitespace, an AND operator
153 may be explicitly specified. This does not change query processing at
154 all, but may be used to improve readability. For example, the following
155 query is handled identically to the one above:
156
157 <col> MATCH '(hello AND world) OR (simple AND example)'
158
159 As with the OR and NEAR operators, the AND operator must be specified
160 in upper case. The word "and" specified in lower or mixed case is
161 handled as a regular token.
162
163 3) The '-' token prefix is not supported. Instead, a new binary operator,
164 NOT, is included. The NOT operator requires that the query specified
165 as its left-hand operator matches, but that the query specified as the
166 right-hand operator does not. For example, to query for all rows that
167 contain the term "example" but not the term "simple", the following
168 query could be used:
169
170 <col> MATCH 'example NOT simple'
171
172 As for all other operators, the NOT operator must be specified in
173 upper case. Otherwise it will be treated as a regular token.
174
175 4) Unlike in the standard syntax, where the OR operator has a higher
176 precedence than the implicit AND operator, when using the enhanced
177 syntax implicit and explict AND operators have a higher precedence
178 than OR operators. Using the enhanced syntax, the following two
179 queries are equivalent:
180
181 <col> MATCH 'sqlite fantastic OR impressive'
182 <col> MATCH '(sqlite AND fantastic) OR impressive'
183
184 however, when using the standard syntax, the query:
185
186 <col> MATCH 'sqlite fantastic OR impressive'
187
188 is equivalent to the enhanced syntax query:
189
190 <col> MATCH 'sqlite AND (fantastic OR impressive)'
191
192 The precedence of all enhanced syntax operators, in order from highest
193 to lowest, is:
194
195 NEAR (highest precedence, tightest grouping)
196 NOT
197 AND
198 OR (lowest precedence, loosest grouping)
199
200 Using the advanced syntax, it is possible to specify expressions enclosed
201 in parenthesis as operands to the NOT, AND and OR operators. However both
202 the left and right hand side operands of NEAR operators must be either
203 tokens or phrases. Attempting the following query will return an error:
204
205 <col> MATCH 'sqlite NEAR (fantastic OR impressive)'
206
207 Queries of this form must be re-written as:
208
209 <col> MATCH 'sqlite NEAR fantastic OR sqlite NEAR impressive'
210
README.tokenizers
1
2 1. FTS3 Tokenizers
3
4 When creating a new full-text table, FTS3 allows the user to select
5 the text tokenizer implementation to be used when indexing text
6 by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE
7 statement:
8
9 CREATE VIRTUAL TABLE <table-name> USING fts3(
10 <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]]
11 );
12
13 The built-in tokenizers (valid values to pass as <tokenizer name>) are
14 "simple" and "porter".
15
16 <tokenizer-args> should consist of zero or more white-space separated
17 arguments to pass to the selected tokenizer implementation. The
18 interpretation of the arguments, if any, depends on the individual
19 tokenizer.
20
21 2. Custom Tokenizers
22
23 FTS3 allows users to provide custom tokenizer implementations. The
24 interface used to create a new tokenizer is defined and described in
25 the fts3_tokenizer.h source file.
26
27 Registering a new FTS3 tokenizer is similar to registering a new
28 virtual table module with SQLite. The user passes a pointer to a
29 structure containing pointers to various callback functions that
30 make up the implementation of the new tokenizer type. For tokenizers,
31 the structure (defined in fts3_tokenizer.h) is called
32 "sqlite3_tokenizer_module".
33
34 FTS3 does not expose a C-function that users call to register new
35 tokenizer types with a database handle. Instead, the pointer must
36 be encoded as an SQL blob value and passed to FTS3 through the SQL
37 engine by evaluating a special scalar function, "fts3_tokenizer()".
38 The fts3_tokenizer() function may be called with one or two arguments,
39 as follows:
40
41 SELECT fts3_tokenizer(<tokenizer-name>);
42 SELECT fts3_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
43
44 Where <tokenizer-name> is a string identifying the tokenizer and
45 <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
46 structure encoded as an SQL blob. If the second argument is present,
47 it is registered as tokenizer <tokenizer-name> and a copy of it
48 returned. If only one argument is passed, a pointer to the tokenizer
49 implementation currently registered as <tokenizer-name> is returned,
50 encoded as a blob. Or, if no such tokenizer exists, an SQL exception
51 (error) is raised.
52
53 SECURITY: If the fts3 extension is used in an environment where potentially
54 malicious users may execute arbitrary SQL (i.e. gears), they should be
55 prevented from invoking the fts3_tokenizer() function, possibly using the
56 authorisation callback.
57
58 See "Sample code" below for an example of calling the fts3_tokenizer()
59 function from C code.
60
61 3. ICU Library Tokenizers
62
63 If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
64 symbol defined, then there exists a built-in tokenizer named "icu"
65 implemented using the ICU library. The first argument passed to the
66 xCreate() method (see fts3_tokenizer.h) of this tokenizer may be
67 an ICU locale identifier. For example "tr_TR" for Turkish as used
68 in Turkey, or "en_AU" for English as used in Australia. For example:
69
70 "CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenizer icu th_TH)"
71
72 The ICU tokenizer implementation is very simple. It splits the input
73 text according to the ICU rules for finding word boundaries and discards
74 any tokens that consist entirely of white-space. This may be suitable
75 for some applications in some locales, but not all. If more complex
76 processing is required, for example to implement stemming or
77 discard punctuation, this can be done by creating a tokenizer
78 implementation that uses the ICU tokenizer as part of its implementation.
79
80 When using the ICU tokenizer this way, it is safe to overwrite the
81 contents of the strings returned by the xNext() method (see
82 fts3_tokenizer.h).
83
84 4. Sample code.
85
86 The following two code samples illustrate the way C code should invoke
87 the fts3_tokenizer() scalar function:
88
89 int registerTokenizer(
90 sqlite3 *db,
91 char *zName,
92 const sqlite3_tokenizer_module *p
93 ){
94 int rc;
95 sqlite3_stmt *pStmt;
96 const char zSql[] = "SELECT fts3_tokenizer(?, ?)";
97
98 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
99 if( rc!=SQLITE_OK ){
100 return rc;
101 }
102
103 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
104 sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
105 sqlite3_step(pStmt);
106
107 return sqlite3_finalize(pStmt);
108 }
109
110 int queryTokenizer(
111 sqlite3 *db,
112 char *zName,
113 const sqlite3_tokenizer_module **pp
114 ){
115 int rc;
116 sqlite3_stmt *pStmt;
117 const char zSql[] = "SELECT fts3_tokenizer(?)";
118
119 *pp = 0;
120 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
121 if( rc!=SQLITE_OK ){
122 return rc;
123 }
124
125 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
126 if( SQLITE_ROW==sqlite3_step(pStmt) ){
127 if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
128 memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
129 }
130 }
131
132 return sqlite3_finalize(pStmt);
133 }
134
README.txt
1 This folder contains source code to the second full-text search
2 extension for SQLite. While the API is the same, this version uses a
3 substantially different storage schema from fts1, so tables will need
4 to be rebuilt.
5