1 <html> 2 <head> 3 <title>pcre2syntax specification</title> 4 </head> 5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6 <h1>pcre2syntax man page</h1> 7 <p> 8 Return to the <a href="index.html">PCRE2 index page</a>. 9 </p> 10 <p> 11 This page is part of the PCRE2 HTML documentation. It was generated 12 automatically from the original man page. If there is any nonsense in it, 13 please consult the man page, in case the conversion went wrong. 14 <br> 15 <ul> 16 <li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a> 17 <li><a name="TOC2" href="#SEC2">QUOTING</a> 18 <li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a> 19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> 20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> 21 <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> 22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a> 23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a> 24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a> 25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a> 26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a> 27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a> 28 <li><a name="TOC13" href="#SEC13">CAPTURING</a> 29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a> 30 <li><a name="TOC15" href="#SEC15">COMMENT</a> 31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a> 32 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a> 33 <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a> 34 <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> 35 <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a> 36 <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> 37 <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a> 38 <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a> 39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a> 40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a> 41 <li><a name="TOC26" href="#SEC26">AUTHOR</a> 42 <li><a name="TOC27" href="#SEC27">REVISION</a> 43 </ul> 44 <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br> 45 <P> 46 The full syntax and semantics of the regular expressions that are supported by 47 PCRE2 are described in the 48 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 49 documentation. This document contains a quick-reference summary of the syntax. 50 </P> 51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br> 52 <P> 53 <pre> 54 \x where x is non-alphanumeric is a literal x 55 \Q...\E treat enclosed characters as literal 56 </PRE> 57 </P> 58 <br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br> 59 <P> 60 This table applies to ASCII and Unicode environments. 61 <pre> 62 \a alarm, that is, the BEL character (hex 07) 63 \cx "control-x", where x is any ASCII printing character 64 \e escape (hex 1B) 65 \f form feed (hex 0C) 66 \n newline (hex 0A) 67 \r carriage return (hex 0D) 68 \t tab (hex 09) 69 \0dd character with octal code 0dd 70 \ddd character with octal code ddd, or backreference 71 \o{ddd..} character with octal code ddd.. 72 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) 73 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) 74 \xhh character with hex code hh 75 \x{hhh..} character with hex code hhh.. 76 </pre> 77 Note that \0dd is always an octal code. The treatment of backslash followed by 78 a non-zero digit is complicated; for details see the section 79 <a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a> 80 in the 81 <a href="pcre2pattern.html"><b>pcre2pattern</b></a> 82 documentation, where details of escape processing in EBCDIC environments are 83 also given. 84 </P> 85 <P> 86 When \x is not followed by {, from zero to two hexadecimal digits are read, 87 but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to 88 be recognized as a hexadecimal escape; otherwise it matches a literal "x". 89 Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits, 90 it matches a literal "u". 91 </P> 92 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br> 93 <P> 94 <pre> 95 . any character except newline; 96 in dotall mode, any character whatsoever 97 \C one code unit, even in UTF mode (best avoided) 98 \d a decimal digit 99 \D a character that is not a decimal digit 100 \h a horizontal white space character 101 \H a character that is not a horizontal white space character 102 \N a character that is not a newline 103 \p{<i>xx</i>} a character with the <i>xx</i> property 104 \P{<i>xx</i>} a character without the <i>xx</i> property 105 \R a newline sequence 106 \s a white space character 107 \S a character that is not a white space character 108 \v a vertical white space character 109 \V a character that is not a vertical white space character 110 \w a "word" character 111 \W a "non-word" character 112 \X a Unicode extended grapheme cluster 113 </pre> 114 \C is dangerous because it may leave the current matching point in the middle 115 of a UTF-8 or UTF-16 character. The application can lock out the use of \C by 116 setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2 117 with the use of \C permanently disabled. 118 </P> 119 <P> 120 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode 121 or in the 16-bit and 32-bit libraries. However, if locale-specific matching is 122 happening, \s and \w may also match characters with code points in the range 123 128-255. If the PCRE2_UCP option is set, the behaviour of these escape 124 sequences is changed to use Unicode properties and they match many more 125 characters. 126 </P> 127 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> 128 <P> 129 <pre> 130 C Other 131 Cc Control 132 Cf Format 133 Cn Unassigned 134 Co Private use 135 Cs Surrogate 136 137 L Letter 138 Ll Lower case letter 139 Lm Modifier letter 140 Lo Other letter 141 Lt Title case letter 142 Lu Upper case letter 143 L& Ll, Lu, or Lt 144 145 M Mark 146 Mc Spacing mark 147 Me Enclosing mark 148 Mn Non-spacing mark 149 150 N Number 151 Nd Decimal number 152 Nl Letter number 153 No Other number 154 155 P Punctuation 156 Pc Connector punctuation 157 Pd Dash punctuation 158 Pe Close punctuation 159 Pf Final punctuation 160 Pi Initial punctuation 161 Po Other punctuation 162 Ps Open punctuation 163 164 S Symbol 165 Sc Currency symbol 166 Sk Modifier symbol 167 Sm Mathematical symbol 168 So Other symbol 169 170 Z Separator 171 Zl Line separator 172 Zp Paragraph separator 173 Zs Space separator 174 </PRE> 175 </P> 176 <br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br> 177 <P> 178 <pre> 179 Xan Alphanumeric: union of properties L and N 180 Xps POSIX space: property Z or tab, NL, VT, FF, CR 181 Xsp Perl space: property Z or tab, NL, VT, FF, CR 182 Xuc Univerally-named character: one that can be 183 represented by a Universal Character Name 184 Xwd Perl word: property Xan or underscore 185 </pre> 186 Perl and POSIX space are now the same. Perl added VT to its space character set 187 at release 5.18. 188 </P> 189 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br> 190 <P> 191 Ahom, 192 Anatolian_Hieroglyphs, 193 Arabic, 194 Armenian, 195 Avestan, 196 Balinese, 197 Bamum, 198 Bassa_Vah, 199 Batak, 200 Bengali, 201 Bopomofo, 202 Brahmi, 203 Braille, 204 Buginese, 205 Buhid, 206 Canadian_Aboriginal, 207 Carian, 208 Caucasian_Albanian, 209 Chakma, 210 Cham, 211 Cherokee, 212 Common, 213 Coptic, 214 Cuneiform, 215 Cypriot, 216 Cyrillic, 217 Deseret, 218 Devanagari, 219 Duployan, 220 Egyptian_Hieroglyphs, 221 Elbasan, 222 Ethiopic, 223 Georgian, 224 Glagolitic, 225 Gothic, 226 Grantha, 227 Greek, 228 Gujarati, 229 Gurmukhi, 230 Han, 231 Hangul, 232 Hanunoo, 233 Hatran, 234 Hebrew, 235 Hiragana, 236 Imperial_Aramaic, 237 Inherited, 238 Inscriptional_Pahlavi, 239 Inscriptional_Parthian, 240 Javanese, 241 Kaithi, 242 Kannada, 243 Katakana, 244 Kayah_Li, 245 Kharoshthi, 246 Khmer, 247 Khojki, 248 Khudawadi, 249 Lao, 250 Latin, 251 Lepcha, 252 Limbu, 253 Linear_A, 254 Linear_B, 255 Lisu, 256 Lycian, 257 Lydian, 258 Mahajani, 259 Malayalam, 260 Mandaic, 261 Manichaean, 262 Meetei_Mayek, 263 Mende_Kikakui, 264 Meroitic_Cursive, 265 Meroitic_Hieroglyphs, 266 Miao, 267 Modi, 268 Mongolian, 269 Mro, 270 Multani, 271 Myanmar, 272 Nabataean, 273 New_Tai_Lue, 274 Nko, 275 Ogham, 276 Ol_Chiki, 277 Old_Hungarian, 278 Old_Italic, 279 Old_North_Arabian, 280 Old_Permic, 281 Old_Persian, 282 Old_South_Arabian, 283 Old_Turkic, 284 Oriya, 285 Osmanya, 286 Pahawh_Hmong, 287 Palmyrene, 288 Pau_Cin_Hau, 289 Phags_Pa, 290 Phoenician, 291 Psalter_Pahlavi, 292 Rejang, 293 Runic, 294 Samaritan, 295 Saurashtra, 296 Sharada, 297 Shavian, 298 Siddham, 299 SignWriting, 300 Sinhala, 301 Sora_Sompeng, 302 Sundanese, 303 Syloti_Nagri, 304 Syriac, 305 Tagalog, 306 Tagbanwa, 307 Tai_Le, 308 Tai_Tham, 309 Tai_Viet, 310 Takri, 311 Tamil, 312 Telugu, 313 Thaana, 314 Thai, 315 Tibetan, 316 Tifinagh, 317 Tirhuta, 318 Ugaritic, 319 Vai, 320 Warang_Citi, 321 Yi. 322 </P> 323 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br> 324 <P> 325 <pre> 326 [...] positive character class 327 [^...] negative character class 328 [x-y] range (can be used for hex characters) 329 [[:xxx:]] positive POSIX named set 330 [[:^xxx:]] negative POSIX named set 331 332 alnum alphanumeric 333 alpha alphabetic 334 ascii 0-127 335 blank space or tab 336 cntrl control character 337 digit decimal digit 338 graph printing, excluding space 339 lower lower case letter 340 print printing, including space 341 punct printing, excluding alphanumeric 342 space white space 343 upper upper case letter 344 word same as \w 345 xdigit hexadecimal digit 346 </pre> 347 In PCRE2, POSIX character set names recognize only ASCII characters by default, 348 but some of them use Unicode properties if PCRE2_UCP is set. You can use 349 \Q...\E inside a character class. 350 </P> 351 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br> 352 <P> 353 <pre> 354 ? 0 or 1, greedy 355 ?+ 0 or 1, possessive 356 ?? 0 or 1, lazy 357 * 0 or more, greedy 358 *+ 0 or more, possessive 359 *? 0 or more, lazy 360 + 1 or more, greedy 361 ++ 1 or more, possessive 362 +? 1 or more, lazy 363 {n} exactly n 364 {n,m} at least n, no more than m, greedy 365 {n,m}+ at least n, no more than m, possessive 366 {n,m}? at least n, no more than m, lazy 367 {n,} n or more, greedy 368 {n,}+ n or more, possessive 369 {n,}? n or more, lazy 370 </PRE> 371 </P> 372 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> 373 <P> 374 <pre> 375 \b word boundary 376 \B not a word boundary 377 ^ start of subject 378 also after an internal newline in multiline mode 379 (after any newline if PCRE2_ALT_CIRCUMFLEX is set) 380 \A start of subject 381 $ end of subject 382 also before newline at end of subject 383 also before internal newline in multiline mode 384 \Z end of subject 385 also before newline at end of subject 386 \z end of subject 387 \G first matching position in subject 388 </PRE> 389 </P> 390 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br> 391 <P> 392 <pre> 393 \K reset start of match 394 </pre> 395 \K is honoured in positive assertions, but ignored in negative ones. 396 </P> 397 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br> 398 <P> 399 <pre> 400 expr|expr|expr... 401 </PRE> 402 </P> 403 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br> 404 <P> 405 <pre> 406 (...) capturing group 407 (?<name>...) named capturing group (Perl) 408 (?'name'...) named capturing group (Perl) 409 (?P<name>...) named capturing group (Python) 410 (?:...) non-capturing group 411 (?|...) non-capturing group; reset group numbers for 412 capturing groups in each alternative 413 </PRE> 414 </P> 415 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br> 416 <P> 417 <pre> 418 (?>...) atomic, non-capturing group 419 </PRE> 420 </P> 421 <br><a name="SEC15" href="#TOC1">COMMENT</a><br> 422 <P> 423 <pre> 424 (?#....) comment (not nestable) 425 </PRE> 426 </P> 427 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br> 428 <P> 429 <pre> 430 (?i) caseless 431 (?J) allow duplicate names 432 (?m) multiline 433 (?s) single line (dotall) 434 (?U) default ungreedy (lazy) 435 (?x) extended (ignore white space) 436 (?-...) unset option(s) 437 </pre> 438 The following are recognized only at the very start of a pattern or after one 439 of the newline or \R options with similar syntax. More than one of them may 440 appear. 441 <pre> 442 (*LIMIT_MATCH=d) set the match limit to d (decimal number) 443 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) 444 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching 445 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 446 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) 447 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) 448 (*NO_JIT) disable JIT optimization 449 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) 450 (*UTF) set appropriate UTF mode for the library in use 451 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) 452 </pre> 453 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the 454 limits set by the caller of pcre2_match(), not increase them. The application 455 can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or 456 PCRE2_NEVER_UCP options, respectively, at compile time. 457 </P> 458 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br> 459 <P> 460 These are recognized only at the very start of the pattern or after option 461 settings with a similar syntax. 462 <pre> 463 (*CR) carriage return only 464 (*LF) linefeed only 465 (*CRLF) carriage return followed by linefeed 466 (*ANYCRLF) all three of the above 467 (*ANY) any Unicode newline sequence 468 </PRE> 469 </P> 470 <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br> 471 <P> 472 These are recognized only at the very start of the pattern or after option 473 setting with a similar syntax. 474 <pre> 475 (*BSR_ANYCRLF) CR, LF, or CRLF 476 (*BSR_UNICODE) any Unicode newline sequence 477 </PRE> 478 </P> 479 <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> 480 <P> 481 <pre> 482 (?=...) positive look ahead 483 (?!...) negative look ahead 484 (?<=...) positive look behind 485 (?<!...) negative look behind 486 </pre> 487 Each top-level branch of a look behind must be of a fixed length. 488 </P> 489 <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br> 490 <P> 491 <pre> 492 \n reference by number (can be ambiguous) 493 \gn reference by number 494 \g{n} reference by number 495 \g{-n} relative reference by number 496 \k<name> reference by name (Perl) 497 \k'name' reference by name (Perl) 498 \g{name} reference by name (Perl) 499 \k{name} reference by name (.NET) 500 (?P=name) reference by name (Python) 501 </PRE> 502 </P> 503 <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> 504 <P> 505 <pre> 506 (?R) recurse whole pattern 507 (?n) call subpattern by absolute number 508 (?+n) call subpattern by relative number 509 (?-n) call subpattern by relative number 510 (?&name) call subpattern by name (Perl) 511 (?P>name) call subpattern by name (Python) 512 \g<name> call subpattern by name (Oniguruma) 513 \g'name' call subpattern by name (Oniguruma) 514 \g<n> call subpattern by absolute number (Oniguruma) 515 \g'n' call subpattern by absolute number (Oniguruma) 516 \g<+n> call subpattern by relative number (PCRE2 extension) 517 \g'+n' call subpattern by relative number (PCRE2 extension) 518 \g<-n> call subpattern by relative number (PCRE2 extension) 519 \g'-n' call subpattern by relative number (PCRE2 extension) 520 </PRE> 521 </P> 522 <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br> 523 <P> 524 <pre> 525 (?(condition)yes-pattern) 526 (?(condition)yes-pattern|no-pattern) 527 528 (?(n) absolute reference condition 529 (?(+n) relative reference condition 530 (?(-n) relative reference condition 531 (?(<name>) named reference condition (Perl) 532 (?('name') named reference condition (Perl) 533 (?(name) named reference condition (PCRE2) 534 (?(R) overall recursion condition 535 (?(Rn) specific group recursion condition 536 (?(R&name) specific recursion condition 537 (?(DEFINE) define subpattern for reference 538 (?(VERSION[>]=n.m) test PCRE2 version 539 (?(assert) assertion condition 540 </PRE> 541 </P> 542 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br> 543 <P> 544 The following act immediately they are reached: 545 <pre> 546 (*ACCEPT) force successful match 547 (*FAIL) force backtrack; synonym (*F) 548 (*MARK:NAME) set name to be passed back; synonym (*:NAME) 549 </pre> 550 The following act only when a subsequent match failure causes a backtrack to 551 reach them. They all force a match failure, but they differ in what happens 552 afterwards. Those that advance the start-of-match point do so only if the 553 pattern is not anchored. 554 <pre> 555 (*COMMIT) overall failure, no advance of starting point 556 (*PRUNE) advance to next starting character 557 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) 558 (*SKIP) advance to current matching position 559 (*SKIP:NAME) advance to position corresponding to an earlier 560 (*MARK:NAME); if not found, the (*SKIP) is ignored 561 (*THEN) local failure, backtrack to next alternation 562 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) 563 </PRE> 564 </P> 565 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br> 566 <P> 567 <pre> 568 (?C) callout (assumed number 0) 569 (?Cn) callout with numerical data n 570 (?C"text") callout with string data 571 </pre> 572 The allowed string delimiters are ` ' " ^ % # $ (which are the same for the 573 start and the end), and the starting delimiter { matched with the ending 574 delimiter }. To encode the ending delimiter within the string, double it. 575 </P> 576 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br> 577 <P> 578 <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), 579 <b>pcre2matching</b>(3), <b>pcre2</b>(3). 580 </P> 581 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br> 582 <P> 583 Philip Hazel 584 <br> 585 University Computing Service 586 <br> 587 Cambridge, England. 588 <br> 589 </P> 590 <br><a name="SEC27" href="#TOC1">REVISION</a><br> 591 <P> 592 Last updated: 16 October 2015 593 <br> 594 Copyright © 1997-2015 University of Cambridge. 595 <br> 596 <p> 597 Return to the <a href="index.html">PCRE2 index page</a>. 598 </p> 599