Cross Reference: /external/pcre/dist/doc/html/pcrepattern.html

Lines Matching full:character
18 <li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a>
24 <li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
25 <li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>
91 extra library that supports 16-bit and UTF-16 character strings, and a
92 third library that supports 32-bit and UTF-32 character strings. To use these
122 such as \d and \w to use Unicode properties to determine character types,
153 strings: a single CR (carriage return) character, a single LF (linefeed)
154 character, the two-character sequence CRLF, any of the three preceding, or any
214 <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
216 PCRE can be compiled to run in an environment that uses EBCDIC as its character
218 sections below, character code values are ASCII or Unicode; in an EBCDIC
252   \      general escape character with several uses
255   .      match any character except newline (by default)
256   [      start character class definition
268 Part of a pattern that is in square brackets is called a "character class". In
269 a character class the only metacharacters are:
271   \      general escape character
272   ^      negate the class, but only if the first character
273   -      indicates character range
274   [      POSIX character class (only if followed by POSIX syntax)
275   ]      terminates the character class
281 The backslash character has several uses. Firstly, if it is followed by a
282 character that is not a number or a letter, it takes away any special meaning
283 that character may have. This use of backslash as an escape character applies
284 both inside and outside character classes.
287 For example, if you want to match a * character, you write \* in the pattern.
288 This escaping action applies whether or not the following character would
300 pattern (other than in a character class), and characters between a # outside a
301 character class and the next newline, inclusive, are ignored. An escaping
302 backslash can be used to include a white space or # character as part of the
317 The \Q...\E sequence is recognized both inside and outside character classes.
321 a character class, this causes an error, because the character class is not
332 one of the following escape sequences than the binary character it represents.
335   \a        alarm, that is, the BEL character (hex 07)
336   \cx       "control-x", where x is any ASCII character
342   \0dd      character with octal code 0dd
343   \ddd      character with octal code ddd, or back reference
344   \o{ddd..} character with octal code ddd..
345   \xhh      character with hex code hh
346   \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
347   \uhhhh    character with hex code hhhh (JavaScript mode only)
350 case letter, it is converted to upper case. Then bit 6 of the character (hex
361 other character provokes a compile-time error. The sequence \@ encodes
362 character code 0; the letters (in either case) encode characters 1-26 (hex 01
367 Thus, apart from \?, these escapes generate the same character code values as
374 because 127 is not a control character in EBCDIC, Perl makes it generate the
375 APC character. Unfortunately, there are several variants of EBCDIC. In most of
376 them the APC character has the value 255 (hex FF), but in the one Perl calls
383 specifies two binary zeros followed by a CR character (code value 13). Make
384 sure you supply two digits after the initial zero if the pattern character that
390 addition to Perl; it provides way of specifying character code points as octal
396 digit greater than zero. Instead, use \o{} or \x{} to specify character
403 character class, PCRE reads the digit and any following digits as a decimal
412 Inside a character class, or if the decimal number following \ is greater than
415 octal digits following the backslash, using them to generate a data character.
423   \0113  is a tab followed by the character "3"
424   \113   might be a back reference, otherwise the character with octal code 113
435 hexadecimal digits may appear between \x{ and }. If a character other than
442 Otherwise, it matches a literal "x" character. In JavaScript mode, support for
444 four hexadecimal digits; otherwise it matches a literal "u" character.
453 Constraints on character values
470 Escape sequences in character classes
473 All the sequences that define a single character value can be used both inside
474 and outside character classes. In addition, inside a character class, \b is
475 interpreted as the backspace character (hex 08).
478 \N is not allowed in a character class. \B, \R, and \X are not special
479 inside a character class. Like other unrecognized escape sequences, they are
481 error if the PCRE_EXTRA option is set. Outside a character class, these
491 option is set, \U matches a "U" character, and \u can be used to define a
492 character by code point, as described in the previous section.
519 Generic character types
522 Another use of backslash is for specifying generic character types:
525   \D     any character that is not a decimal digit
526   \h     any horizontal white space character
527   \H     any character that is not a horizontal white space character
528   \s     any white space character
529   \S     any character that is not a white space character
530   \v     any vertical white space character
531   \V     any character that is not a vertical white space character
532   \w     any "word" character
533   \W     any "non-word" character
535 There is also the single sequence \N, which matches a non-newline character.
543 of characters into two disjoint sets. Any given character matches one, and only
544 one, of each pair. The sequences can appear both inside and outside character
545 classes. They each match one character of the appropriate type. If the current
547 there is no character to match.
550 For compatibility with Perl, \s did not used to match the VT character (code
556 "non-breaking space" character (\xA0) is recognized as white space, and in
557 others the VT character is not.
560 A "word" character is an underscore or any character that is a letter or digit.
562 low-valued character tables, and may vary if locale-specific matching is taking
568 or "french" in Windows, some character codes greater than 127 are used for
579 changed so that Unicode properties are used to determine character types, as
582   \d  any character that matches \p{Nd} (decimal digit)
583   \s  any character that matches \p{Z} or \h or \v
584   \w  any character that matches \p{L} or \p{N}, plus underscore
635 Outside a character class, by default, the escape sequence \R matches any
643 This particular group matches either the two-character sequence CR followed by
646 line, U+0085). The two-character sequence is treated as a single unit that
652 Unicode character property support is not needed for these characters to be
677 (*UCP) special sequences. Inside a character class, \R is treated as an
682 Unicode character properties
685 When PCRE is built with Unicode character property support, three additional
691   \p{<i>xx</i>}   a character with the <i>xx</i> property
692   \P{<i>xx</i>}   a character without the <i>xx</i> property
697 character (including newline), and some special PCRE properties (described
706 character from one of these sets can be matched using a script name. For
843 Each character has exactly one Unicode general category property, specified by
903 The special property L& is also supported: it matches a character that has
922 No character that is in the Unicode table has the Cn (unassigned) property.
933 multistage table lookup in order to find a character's property. That is why
950 That is, it matched a character without the "mark" property, followed by zero
952 property are typically non-spacing accents that affect the preceding character.
956 kinds of composite character by giving each character a grapheme breaking
962 \X always matches at least one character. Then it decides whether to add
969 2. Do not end between CR and LF; otherwise end after any control character.
973 are of five types: L, V, T, LV, and LVT. An L character may be followed by an
974 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
975 character; an LVT or T character may be follwed only by a T character.
997   Xan   Any alphanumeric character
998   Xps   Any POSIX space character
999   Xsp   Any Perl space character
1000   Xwd   Any Perl "word" character
1004 carriage return, and any other character that has the Z (separator) property.
1010 There is another non-standard property, Xuc, which matches any character that
1011 can be represented by a Universal Character Name in C++ and other programming
1015 excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
1067 Inside a character class, \b has a different meaning; it matches the backspace
1068 character. If any other of these assertions appears in a character class, by
1069 default it matches the corresponding literal character (for example, \B
1074 A word boundary is a position in the subject string where the current character
1075 and the previous character do not both match \w or \W (i.e. one matches
1077 first or last character matches \w, respectively. In a UTF mode, the meanings
1123 Outside a character class, in the default matching mode, the circumflex
1124 character is an assertion that is true only if the current matching point is at
1127 option is unset. Inside a character class, circumflex has an entirely different
1132 Circumflex need not be the first character of the pattern if a number of
1141 The dollar character is an assertion that is true only if the current matching
1144 match the newline. Dollar need not be the last character of the pattern if a
1146 branch in which it appears. Dollar has no special meaning in a character class.
1159 PCRE_MULTILINE is set. When newline is specified as the two-character
1177 Outside a character class, a dot in the pattern matches any one character in
1178 the subject string except (by default) a character that signifies the end of a
1182 When a line ending is defined as a single character, dot never matches that
1183 character; when the two-character sequence CRLF is used, dot does not match CR
1191 option is set, a dot matches any one character, without exception. If the
1192 two-character sequence CRLF is present in the subject string, it takes two dots
1198 special meaning in a character class.
1202 the PCRE_DOTALL option. In other words, it matches any character except one
1208 Outside a character class, the escape sequence \C matches any one data unit,
1216 malformed UTF character. This has undefined results, because PCRE assumes that
1230 lookahead to check the length of the next character, as in this pattern, which
1242 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1243 character's individual bytes are then captured by the appropriate number of
1246 <br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
1248 An opening square bracket introduces a character class, terminated by a closing
1252 a member of the class, it should be the first data character in the class
1256 A character class matches a single character in the subject. In a UTF mode, the
1257 character may be more than one data unit long. A matched character must be in
1258 the set of characters defined by the class, unless the first character in the
1259 class definition is a circumflex, in which case the subject character must not
1261 member of the class, ensure it is not the first character, or escape it with a
1265 For example, the character class [aeiou] matches any lower case vowel, while
1266 [^aeiou] matches any character that is not a lower case vowel. Note that a
1269 circumflex is not an assertion; it still consumes a character from the subject
1292 when matching character classes, whatever line-ending sequence is in use, and
1297 The minus (hyphen) character can be used to specify a range of characters in a
1298 character class. For example, [d-m] matches any letter between d and m,
1299 inclusive. If a minus character is required in a class, it must be escaped with
1301 indicating a range, typically as the first or last character in the class, or
1303 to d, a hyphen character, or z.
1306 It is not possible to have the literal character "]" as the end character of a
1315 An error is generated if a POSIX character class (see below) or an escape
1316 sequence other than one that defines a single character appears at a point
1317 where a range ending character is expected. For example, [z-\xff] is valid,
1321 Ranges operate in the collating sequence of character values. They can also be
1328 [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1335 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
1336 \V, \w, and \W may appear in a character class, and add the characters that
1340 character class, as described in the section entitled
1341 <a href="#genericchartypes">"Generic character types"</a>
1342 above. The escape sequence \b has a different meaning inside a character
1343 class; it matches the backspace character. The sequences \B, \N, \R, and \X
1344 are not special inside a character class. Like any other unrecognized escape
1349 A circumflex can conveniently be used with the upper case character types to
1352 whereas [\w] includes underscore. A positive character class should be read as
1357 The only metacharacters that are recognized in character classes are backslash,
1364 <br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
1366 Perl supports the POSIX notation for character classes. This uses names
1372 matches "0", "1", any alphabetic character, or "%". The supported class names
1377   ascii    character codes 0 - 127
1399 5.8. Another Perl extension is negation, which is indicated by a ^ character
1410 POSIX character classes. However, if the PCRE_UCP option is passed to
1411 <b>pcre_compile()</b>, some of the classes are changed so that Unicode character
1463 Only these exact character sequences are recognized. A sequence such as
1469 above), and in a Perl-style pattern the preceding or following character
1747   a literal data character
1752   an escape such as \d or \pL that matches a single character
1753   a character class
1766 character. If the second number is omitted, but the comma is present, there is
1778 quantifier, is taken as a literal character. For example, {,6} is not a
1799 For convenience, the three most common quantifiers have single-character
1865 character position in the subject string, so there is no point in retrying the
1882 If the subject is "xyz123abc123" the match point is the fourth character. For
1959 additional + character following a quantifier. Using this notation, the
2004 than a single character at the end, because both PCRE and Perl have an
2005 optimization that allows for fast failure when a single character is used. They
2006 remember the last single character that is required for a match, and fail early
2016 Outside a character class, a backslash followed by a digit greater than 0 (and
2033 interpreted as a character defined in octal. See the subsection entitled
2111 If the pattern continues with a digit character, some delimiter must be used to
2129 the subpattern, the back reference matches the character string corresponding
2285 there is no following "a"), it backtracks to match all but the last character,
2383 character is present, sets it as the first captured substring. The second part
2485 PCRE. In both cases, the start of the comment must not be in a character class,
2493 option is set, an unescaped # character also introduces a comment, which in
2494 this case continues to immediately after the next newline character or
2495 character sequence in the pattern. Which characters are interpreted as newlines
2506 On encountering the # character, <b>pcre_compile()</b> skips along, looking for
2508 it does not terminate the comment. Only an actual character with the code value
2648 The idea is that it either matches a single character, or two identical
2654 At the top level, the first character is matched, but as it is not at the end
2657 matches the next character ("b"). (Note that the beginning and end of line
2661 Back at the top level, the next character ("c") is compared with what
2684 deeper recursion has matched a single character, it cannot be entered again in
2875 minimum length of matching subject, or that a particular character must be
3046 This disables the optimization that skips along to the first character. The
3055 starting character then happens. Backtracking can occur as usual to the left of
3071 pattern is unanchored, the "bumpalong" advance is not to the next character,
3079 the first character in the string), the starting point skips on to start the
3082 first match attempt, the second attempt would start at the second character
3120 A subpattern that does not contain a | character is just a part of the
3142 alternatives, because only one is ever used. In other words, the | character in
3150 character "b" is matched, but "c" is not. At this point, matching does not
3152 character. The conditional subpattern is part of the single alternative that
3160 starting position, but allowing an advance to the next character (for an
3162 than one character. (*COMMIT) is the strongest, causing the entire match to
OpenGrok