1 <html> 2 <head> 3 <title>pcresyntax specification</title> 4 </head> 5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6 <h1>pcresyntax man page</h1> 7 <p> 8 Return to the <a href="index.html">PCRE index page</a>. 9 </p> 10 <p> 11 This page is part of the PCRE HTML documentation. It was generated automatically 12 from the original man page. If there is any nonsense in it, please consult the 13 man page, in case the conversion went wrong. 14 <br> 15 <ul> 16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a> 17 <li><a name="TOC2" href="#SEC2">QUOTING</a> 18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a> 19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> 20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> 21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> 22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a> 23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a> 24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a> 25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a> 26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a> 27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a> 28 <li><a name="TOC13" href="#SEC13">CAPTURING</a> 29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a> 30 <li><a name="TOC15" href="#SEC15">COMMENT</a> 31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a> 32 <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> 33 <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a> 34 <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> 35 <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a> 36 <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a> 37 <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a> 38 <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a> 39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a> 40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a> 41 <li><a name="TOC26" href="#SEC26">AUTHOR</a> 42 <li><a name="TOC27" href="#SEC27">REVISION</a> 43 </ul> 44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br> 45 <P> 46 The full syntax and semantics of the regular expressions that are supported by 47 PCRE are described in the 48 <a href="pcrepattern.html"><b>pcrepattern</b></a> 49 documentation. This document contains just a quick-reference summary of the 50 syntax. 51 </P> 52 <br><a name="SEC2" href="#TOC1">QUOTING</a><br> 53 <P> 54 <pre> 55 \x where x is non-alphanumeric is a literal x 56 \Q...\E treat enclosed characters as literal 57 </PRE> 58 </P> 59 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br> 60 <P> 61 <pre> 62 \a alarm, that is, the BEL character (hex 07) 63 \cx "control-x", where x is any ASCII character 64 \e escape (hex 1B) 65 \f formfeed (hex 0C) 66 \n newline (hex 0A) 67 \r carriage return (hex 0D) 68 \t tab (hex 09) 69 \ddd character with octal code ddd, or backreference 70 \xhh character with hex code hh 71 \x{hhh..} character with hex code hhh.. 72 </PRE> 73 </P> 74 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br> 75 <P> 76 <pre> 77 . any character except newline; 78 in dotall mode, any character whatsoever 79 \C one byte, even in UTF-8 mode (best avoided) 80 \d a decimal digit 81 \D a character that is not a decimal digit 82 \h a horizontal whitespace character 83 \H a character that is not a horizontal whitespace character 84 \N a character that is not a newline 85 \p{<i>xx</i>} a character with the <i>xx</i> property 86 \P{<i>xx</i>} a character without the <i>xx</i> property 87 \R a newline sequence 88 \s a whitespace character 89 \S a character that is not a whitespace character 90 \v a vertical whitespace character 91 \V a character that is not a vertical whitespace character 92 \w a "word" character 93 \W a "non-word" character 94 \X an extended Unicode sequence 95 </pre> 96 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII 97 characters, even in UTF-8 mode. However, this can be changed by setting the 98 PCRE_UCP option. 99 </P> 100 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> 101 <P> 102 <pre> 103 C Other 104 Cc Control 105 Cf Format 106 Cn Unassigned 107 Co Private use 108 Cs Surrogate 109 110 L Letter 111 Ll Lower case letter 112 Lm Modifier letter 113 Lo Other letter 114 Lt Title case letter 115 Lu Upper case letter 116 L& Ll, Lu, or Lt 117 118 M Mark 119 Mc Spacing mark 120 Me Enclosing mark 121 Mn Non-spacing mark 122 123 N Number 124 Nd Decimal number 125 Nl Letter number 126 No Other number 127 128 P Punctuation 129 Pc Connector punctuation 130 Pd Dash punctuation 131 Pe Close punctuation 132 Pf Final punctuation 133 Pi Initial punctuation 134 Po Other punctuation 135 Ps Open punctuation 136 137 S Symbol 138 Sc Currency symbol 139 Sk Modifier symbol 140 Sm Mathematical symbol 141 So Other symbol 142 143 Z Separator 144 Zl Line separator 145 Zp Paragraph separator 146 Zs Space separator 147 </PRE> 148 </P> 149 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br> 150 <P> 151 <pre> 152 Xan Alphanumeric: union of properties L and N 153 Xps POSIX space: property Z or tab, NL, VT, FF, CR 154 Xsp Perl space: property Z or tab, NL, FF, CR 155 Xwd Perl word: property Xan or underscore 156 </PRE> 157 </P> 158 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br> 159 <P> 160 Arabic, 161 Armenian, 162 Avestan, 163 Balinese, 164 Bamum, 165 Bengali, 166 Bopomofo, 167 Braille, 168 Buginese, 169 Buhid, 170 Canadian_Aboriginal, 171 Carian, 172 Cham, 173 Cherokee, 174 Common, 175 Coptic, 176 Cuneiform, 177 Cypriot, 178 Cyrillic, 179 Deseret, 180 Devanagari, 181 Egyptian_Hieroglyphs, 182 Ethiopic, 183 Georgian, 184 Glagolitic, 185 Gothic, 186 Greek, 187 Gujarati, 188 Gurmukhi, 189 Han, 190 Hangul, 191 Hanunoo, 192 Hebrew, 193 Hiragana, 194 Imperial_Aramaic, 195 Inherited, 196 Inscriptional_Pahlavi, 197 Inscriptional_Parthian, 198 Javanese, 199 Kaithi, 200 Kannada, 201 Katakana, 202 Kayah_Li, 203 Kharoshthi, 204 Khmer, 205 Lao, 206 Latin, 207 Lepcha, 208 Limbu, 209 Linear_B, 210 Lisu, 211 Lycian, 212 Lydian, 213 Malayalam, 214 Meetei_Mayek, 215 Mongolian, 216 Myanmar, 217 New_Tai_Lue, 218 Nko, 219 Ogham, 220 Old_Italic, 221 Old_Persian, 222 Old_South_Arabian, 223 Old_Turkic, 224 Ol_Chiki, 225 Oriya, 226 Osmanya, 227 Phags_Pa, 228 Phoenician, 229 Rejang, 230 Runic, 231 Samaritan, 232 Saurashtra, 233 Shavian, 234 Sinhala, 235 Sundanese, 236 Syloti_Nagri, 237 Syriac, 238 Tagalog, 239 Tagbanwa, 240 Tai_Le, 241 Tai_Tham, 242 Tai_Viet, 243 Tamil, 244 Telugu, 245 Thaana, 246 Thai, 247 Tibetan, 248 Tifinagh, 249 Ugaritic, 250 Vai, 251 Yi. 252 </P> 253 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br> 254 <P> 255 <pre> 256 [...] positive character class 257 [^...] negative character class 258 [x-y] range (can be used for hex characters) 259 [[:xxx:]] positive POSIX named set 260 [[:^xxx:]] negative POSIX named set 261 262 alnum alphanumeric 263 alpha alphabetic 264 ascii 0-127 265 blank space or tab 266 cntrl control character 267 digit decimal digit 268 graph printing, excluding space 269 lower lower case letter 270 print printing, including space 271 punct printing, excluding alphanumeric 272 space whitespace 273 upper upper case letter 274 word same as \w 275 xdigit hexadecimal digit 276 </pre> 277 In PCRE, POSIX character set names recognize only ASCII characters by default, 278 but some of them use Unicode properties if PCRE_UCP is set. You can use 279 \Q...\E inside a character class. 280 </P> 281 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br> 282 <P> 283 <pre> 284 ? 0 or 1, greedy 285 ?+ 0 or 1, possessive 286 ?? 0 or 1, lazy 287 * 0 or more, greedy 288 *+ 0 or more, possessive 289 *? 0 or more, lazy 290 + 1 or more, greedy 291 ++ 1 or more, possessive 292 +? 1 or more, lazy 293 {n} exactly n 294 {n,m} at least n, no more than m, greedy 295 {n,m}+ at least n, no more than m, possessive 296 {n,m}? at least n, no more than m, lazy 297 {n,} n or more, greedy 298 {n,}+ n or more, possessive 299 {n,}? n or more, lazy 300 </PRE> 301 </P> 302 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> 303 <P> 304 <pre> 305 \b word boundary 306 \B not a word boundary 307 ^ start of subject 308 also after internal newline in multiline mode 309 \A start of subject 310 $ end of subject 311 also before newline at end of subject 312 also before internal newline in multiline mode 313 \Z end of subject 314 also before newline at end of subject 315 \z end of subject 316 \G first matching position in subject 317 </PRE> 318 </P> 319 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br> 320 <P> 321 <pre> 322 \K reset start of match 323 </PRE> 324 </P> 325 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br> 326 <P> 327 <pre> 328 expr|expr|expr... 329 </PRE> 330 </P> 331 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br> 332 <P> 333 <pre> 334 (...) capturing group 335 (?<name>...) named capturing group (Perl) 336 (?'name'...) named capturing group (Perl) 337 (?P<name>...) named capturing group (Python) 338 (?:...) non-capturing group 339 (?|...) non-capturing group; reset group numbers for 340 capturing groups in each alternative 341 </PRE> 342 </P> 343 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br> 344 <P> 345 <pre> 346 (?>...) atomic, non-capturing group 347 </PRE> 348 </P> 349 <br><a name="SEC15" href="#TOC1">COMMENT</a><br> 350 <P> 351 <pre> 352 (?#....) comment (not nestable) 353 </PRE> 354 </P> 355 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br> 356 <P> 357 <pre> 358 (?i) caseless 359 (?J) allow duplicate names 360 (?m) multiline 361 (?s) single line (dotall) 362 (?U) default ungreedy (lazy) 363 (?x) extended (ignore white space) 364 (?-...) unset option(s) 365 </pre> 366 The following are recognized only at the start of a pattern or after one of the 367 newline-setting options with similar syntax: 368 <pre> 369 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) 370 (*UTF8) set UTF-8 mode (PCRE_UTF8) 371 (*UCP) set PCRE_UCP (use Unicode properties for \d etc) 372 </PRE> 373 </P> 374 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> 375 <P> 376 <pre> 377 (?=...) positive look ahead 378 (?!...) negative look ahead 379 (?<=...) positive look behind 380 (?<!...) negative look behind 381 </pre> 382 Each top-level branch of a look behind must be of a fixed length. 383 </P> 384 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br> 385 <P> 386 <pre> 387 \n reference by number (can be ambiguous) 388 \gn reference by number 389 \g{n} reference by number 390 \g{-n} relative reference by number 391 \k<name> reference by name (Perl) 392 \k'name' reference by name (Perl) 393 \g{name} reference by name (Perl) 394 \k{name} reference by name (.NET) 395 (?P=name) reference by name (Python) 396 </PRE> 397 </P> 398 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> 399 <P> 400 <pre> 401 (?R) recurse whole pattern 402 (?n) call subpattern by absolute number 403 (?+n) call subpattern by relative number 404 (?-n) call subpattern by relative number 405 (?&name) call subpattern by name (Perl) 406 (?P>name) call subpattern by name (Python) 407 \g<name> call subpattern by name (Oniguruma) 408 \g'name' call subpattern by name (Oniguruma) 409 \g<n> call subpattern by absolute number (Oniguruma) 410 \g'n' call subpattern by absolute number (Oniguruma) 411 \g<+n> call subpattern by relative number (PCRE extension) 412 \g'+n' call subpattern by relative number (PCRE extension) 413 \g<-n> call subpattern by relative number (PCRE extension) 414 \g'-n' call subpattern by relative number (PCRE extension) 415 </PRE> 416 </P> 417 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br> 418 <P> 419 <pre> 420 (?(condition)yes-pattern) 421 (?(condition)yes-pattern|no-pattern) 422 423 (?(n)... absolute reference condition 424 (?(+n)... relative reference condition 425 (?(-n)... relative reference condition 426 (?(<name>)... named reference condition (Perl) 427 (?('name')... named reference condition (Perl) 428 (?(name)... named reference condition (PCRE) 429 (?(R)... overall recursion condition 430 (?(Rn)... specific group recursion condition 431 (?(R&name)... specific recursion condition 432 (?(DEFINE)... define subpattern for reference 433 (?(assert)... assertion condition 434 </PRE> 435 </P> 436 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br> 437 <P> 438 The following act immediately they are reached: 439 <pre> 440 (*ACCEPT) force successful match 441 (*FAIL) force backtrack; synonym (*F) 442 </pre> 443 The following act only when a subsequent match failure causes a backtrack to 444 reach them. They all force a match failure, but they differ in what happens 445 afterwards. Those that advance the start-of-match point do so only if the 446 pattern is not anchored. 447 <pre> 448 (*COMMIT) overall failure, no advance of starting point 449 (*PRUNE) advance to next starting character 450 (*SKIP) advance start to current matching position 451 (*THEN) local failure, backtrack to next alternation 452 </PRE> 453 </P> 454 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br> 455 <P> 456 These are recognized only at the very start of the pattern or after a 457 (*BSR_...) or (*UTF8) or (*UCP) option. 458 <pre> 459 (*CR) carriage return only 460 (*LF) linefeed only 461 (*CRLF) carriage return followed by linefeed 462 (*ANYCRLF) all three of the above 463 (*ANY) any Unicode newline sequence 464 </PRE> 465 </P> 466 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br> 467 <P> 468 These are recognized only at the very start of the pattern or after a 469 (*...) option that sets the newline convention or UTF-8 or UCP mode. 470 <pre> 471 (*BSR_ANYCRLF) CR, LF, or CRLF 472 (*BSR_UNICODE) any Unicode newline sequence 473 </PRE> 474 </P> 475 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br> 476 <P> 477 <pre> 478 (?C) callout 479 (?Cn) callout with data n 480 </PRE> 481 </P> 482 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br> 483 <P> 484 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3), 485 <b>pcrematching</b>(3), <b>pcre</b>(3). 486 </P> 487 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br> 488 <P> 489 Philip Hazel 490 <br> 491 University Computing Service 492 <br> 493 Cambridge CB2 3QH, England. 494 <br> 495 </P> 496 <br><a name="SEC27" href="#TOC1">REVISION</a><br> 497 <P> 498 Last updated: 21 November 2010 499 <br> 500 Copyright © 1997-2010 University of Cambridge. 501 <br> 502 <p> 503 Return to the <a href="index.html">PCRE index page</a>. 504 </p> 505