1 :mod:`re` --- Regular expression operations 2 =========================================== 3 4 .. module:: re 5 :synopsis: Regular expression operations. 6 7 .. moduleauthor:: Fredrik Lundh <fredrik (a] pythonware.com> 8 .. sectionauthor:: Andrew M. Kuchling <amk (a] amk.ca> 9 10 **Source code:** :source:`Lib/re.py` 11 12 -------------- 13 14 This module provides regular expression matching operations similar to 15 those found in Perl. 16 17 Both patterns and strings to be searched can be Unicode strings (:class:`str`) 18 as well as 8-bit strings (:class:`bytes`). 19 However, Unicode strings and 8-bit strings cannot be mixed: 20 that is, you cannot match a Unicode string with a byte pattern or 21 vice-versa; similarly, when asking for a substitution, the replacement 22 string must be of the same type as both the pattern and the search string. 23 24 Regular expressions use the backslash character (``'\'``) to indicate 25 special forms or to allow special characters to be used without invoking 26 their special meaning. This collides with Python's usage of the same 27 character for the same purpose in string literals; for example, to match 28 a literal backslash, one might have to write ``'\\\\'`` as the pattern 29 string, because the regular expression must be ``\\``, and each 30 backslash must be expressed as ``\\`` inside a regular Python string 31 literal. 32 33 The solution is to use Python's raw string notation for regular expression 34 patterns; backslashes are not handled in any special way in a string literal 35 prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 36 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 37 newline. Usually patterns will be expressed in Python code using this raw 38 string notation. 39 40 It is important to note that most regular expression operations are available as 41 module-level functions and methods on 42 :ref:`compiled regular expressions <re-objects>`. The functions are shortcuts 43 that don't require you to compile a regex object first, but miss some 44 fine-tuning parameters. 45 46 .. seealso:: 47 48 The third-party `regex <https://pypi.org/project/regex/>`_ module, 49 which has an API compatible with the standard library :mod:`re` module, 50 but offers additional functionality and a more thorough Unicode support. 51 52 53 .. _re-syntax: 54 55 Regular Expression Syntax 56 ------------------------- 57 58 A regular expression (or RE) specifies a set of strings that matches it; the 59 functions in this module let you check if a particular string matches a given 60 regular expression (or if a given regular expression matches a particular 61 string, which comes down to the same thing). 62 63 Regular expressions can be concatenated to form new regular expressions; if *A* 64 and *B* are both regular expressions, then *AB* is also a regular expression. 65 In general, if a string *p* matches *A* and another string *q* matches *B*, the 66 string *pq* will match AB. This holds unless *A* or *B* contain low precedence 67 operations; boundary conditions between *A* and *B*; or have numbered group 68 references. Thus, complex expressions can easily be constructed from simpler 69 primitive expressions like the ones described here. For details of the theory 70 and implementation of regular expressions, consult the Friedl book [Frie09]_, 71 or almost any textbook about compiler construction. 72 73 A brief explanation of the format of regular expressions follows. For further 74 information and a gentler presentation, consult the :ref:`regex-howto`. 75 76 Regular expressions can contain both special and ordinary characters. Most 77 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 78 expressions; they simply match themselves. You can concatenate ordinary 79 characters, so ``last`` matches the string ``'last'``. (In the rest of this 80 section, we'll write RE's in ``this special style``, usually without quotes, and 81 strings to be matched ``'in single quotes'``.) 82 83 Some characters, like ``'|'`` or ``'('``, are special. Special 84 characters either stand for classes of ordinary characters, or affect 85 how the regular expressions around them are interpreted. 86 87 Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 88 directly nested. This avoids ambiguity with the non-greedy modifier suffix 89 ``?``, and with other modifiers in other implementations. To apply a second 90 repetition to an inner repetition, parentheses may be used. For example, 91 the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 92 93 94 The special characters are: 95 96 .. index:: single: . (dot); in regular expressions 97 98 ``.`` 99 (Dot.) In the default mode, this matches any character except a newline. If 100 the :const:`DOTALL` flag has been specified, this matches any character 101 including a newline. 102 103 .. index:: single: ^ (caret); in regular expressions 104 105 ``^`` 106 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 107 matches immediately after each newline. 108 109 .. index:: single: $ (dollar); in regular expressions 110 111 ``$`` 112 Matches the end of the string or just before the newline at the end of the 113 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 114 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 115 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 116 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 117 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 118 the newline, and one at the end of the string. 119 120 .. index:: single: * (asterisk); in regular expressions 121 122 ``*`` 123 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 124 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 125 by any number of 'b's. 126 127 .. index:: single: + (plus); in regular expressions 128 129 ``+`` 130 Causes the resulting RE to match 1 or more repetitions of the preceding RE. 131 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 132 match just 'a'. 133 134 .. index:: single: ? (question mark); in regular expressions 135 136 ``?`` 137 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 138 ``ab?`` will match either 'a' or 'ab'. 139 140 .. index:: 141 single: *?; in regular expressions 142 single: +?; in regular expressions 143 single: ??; in regular expressions 144 145 ``*?``, ``+?``, ``??`` 146 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match 147 as much text as possible. Sometimes this behaviour isn't desired; if the RE 148 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire 149 string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it 150 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 151 characters as possible will be matched. Using the RE ``<.*?>`` will match 152 only ``'<a>'``. 153 154 .. index:: 155 single: {} (curly brackets); in regular expressions 156 157 ``{m}`` 158 Specifies that exactly *m* copies of the previous RE should be matched; fewer 159 matches cause the entire RE not to match. For example, ``a{6}`` will match 160 exactly six ``'a'`` characters, but not five. 161 162 ``{m,n}`` 163 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 164 RE, attempting to match as many repetitions as possible. For example, 165 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 166 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 167 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters 168 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the 169 modifier would be confused with the previously described form. 170 171 ``{m,n}?`` 172 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 173 RE, attempting to match as *few* repetitions as possible. This is the 174 non-greedy version of the previous qualifier. For example, on the 175 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 176 while ``a{3,5}?`` will only match 3 characters. 177 178 .. index:: single: \ (backslash); in regular expressions 179 180 ``\`` 181 Either escapes special characters (permitting you to match characters like 182 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 183 sequences are discussed below. 184 185 If you're not using a raw string to express the pattern, remember that Python 186 also uses the backslash as an escape sequence in string literals; if the escape 187 sequence isn't recognized by Python's parser, the backslash and subsequent 188 character are included in the resulting string. However, if Python would 189 recognize the resulting sequence, the backslash should be repeated twice. This 190 is complicated and hard to understand, so it's highly recommended that you use 191 raw strings for all but the simplest expressions. 192 193 .. index:: 194 single: [] (square brackets); in regular expressions 195 196 ``[]`` 197 Used to indicate a set of characters. In a set: 198 199 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 200 ``'m'``, or ``'k'``. 201 202 .. index:: single: - (minus); in regular expressions 203 204 * Ranges of characters can be indicated by giving two characters and separating 205 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 206 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 207 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 208 ``[a\-z]``) or if it's placed as the first or last character 209 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``. 210 211 * Special characters lose their special meaning inside sets. For example, 212 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 213 ``'*'``, or ``')'``. 214 215 .. index:: single: \ (backslash); in regular expressions 216 217 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 218 inside a set, although the characters they match depends on whether 219 :const:`ASCII` or :const:`LOCALE` mode is in force. 220 221 .. index:: single: ^ (caret); in regular expressions 222 223 * Characters that are not within a range can be matched by :dfn:`complementing` 224 the set. If the first character of the set is ``'^'``, all the characters 225 that are *not* in the set will be matched. For example, ``[^5]`` will match 226 any character except ``'5'``, and ``[^^]`` will match any character except 227 ``'^'``. ``^`` has no special meaning if it's not the first character in 228 the set. 229 230 * To match a literal ``']'`` inside a set, precede it with a backslash, or 231 place it at the beginning of the set. For example, both ``[()[\]{}]`` and 232 ``[]()[{}]`` will both match a parenthesis. 233 234 .. .. index:: single: --; in regular expressions 235 .. .. index:: single: &&; in regular expressions 236 .. .. index:: single: ~~; in regular expressions 237 .. .. index:: single: ||; in regular expressions 238 239 * Support of nested sets and set operations as in `Unicode Technical 240 Standard #18`_ might be added in the future. This would change the 241 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised 242 in ambiguous cases for the time being. 243 That includes sets starting with a literal ``'['`` or containing literal 244 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To 245 avoid a warning escape them with a backslash. 246 247 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/ 248 249 .. versionchanged:: 3.7 250 :exc:`FutureWarning` is raised if a character set contains constructs 251 that will change semantically in the future. 252 253 .. index:: single: | (vertical bar); in regular expressions 254 255 ``|`` 256 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that 257 will match either *A* or *B*. An arbitrary number of REs can be separated by the 258 ``'|'`` in this way. This can be used inside groups (see below) as well. As 259 the target string is scanned, REs separated by ``'|'`` are tried from left to 260 right. When one pattern completely matches, that branch is accepted. This means 261 that once *A* matches, *B* will not be tested further, even if it would 262 produce a longer overall match. In other words, the ``'|'`` operator is never 263 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 264 character class, as in ``[|]``. 265 266 .. index:: 267 single: () (parentheses); in regular expressions 268 269 ``(...)`` 270 Matches whatever regular expression is inside the parentheses, and indicates the 271 start and end of a group; the contents of a group can be retrieved after a match 272 has been performed, and can be matched later in the string with the ``\number`` 273 special sequence, described below. To match the literals ``'('`` or ``')'``, 274 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``. 275 276 .. index:: single: (?; in regular expressions 277 278 ``(?...)`` 279 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 280 otherwise). The first character after the ``'?'`` determines what the meaning 281 and further syntax of the construct is. Extensions usually do not create a new 282 group; ``(?P<name>...)`` is the only exception to this rule. Following are the 283 currently supported extensions. 284 285 ``(?aiLmsux)`` 286 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 287 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the 288 letters set the corresponding flags: :const:`re.A` (ASCII-only matching), 289 :const:`re.I` (ignore case), :const:`re.L` (locale dependent), 290 :const:`re.M` (multi-line), :const:`re.S` (dot matches all), 291 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose), 292 for the entire regular expression. 293 (The flags are described in :ref:`contents-of-module-re`.) 294 This is useful if you wish to include the flags as part of the 295 regular expression, instead of passing a *flag* argument to the 296 :func:`re.compile` function. Flags should be used first in the 297 expression string. 298 299 .. index:: single: (?:; in regular expressions 300 301 ``(?:...)`` 302 A non-capturing version of regular parentheses. Matches whatever regular 303 expression is inside the parentheses, but the substring matched by the group 304 *cannot* be retrieved after performing a match or referenced later in the 305 pattern. 306 307 ``(?aiLmsux-imsx:...)`` 308 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 309 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by 310 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) 311 The letters set or remove the corresponding flags: 312 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case), 313 :const:`re.L` (locale dependent), :const:`re.M` (multi-line), 314 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), 315 and :const:`re.X` (verbose), for the part of the expression. 316 (The flags are described in :ref:`contents-of-module-re`.) 317 318 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used 319 as inline flags, so they can't be combined or follow ``'-'``. Instead, 320 when one of them appears in an inline group, it overrides the matching mode 321 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to 322 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching 323 (default). In byte pattern ``(?L:...)`` switches to locale depending 324 matching, and ``(?a:...)`` switches to ASCII-only matching (default). 325 This override is only in effect for the narrow inline group, and the 326 original matching mode is restored outside of the group. 327 328 .. versionadded:: 3.6 329 330 .. versionchanged:: 3.7 331 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group. 332 333 .. index:: single: (?P<; in regular expressions 334 335 ``(?P<name>...)`` 336 Similar to regular parentheses, but the substring matched by the group is 337 accessible via the symbolic group name *name*. Group names must be valid 338 Python identifiers, and each group name must be defined only once within a 339 regular expression. A symbolic group is also a numbered group, just as if 340 the group were not named. 341 342 Named groups can be referenced in three contexts. If the pattern is 343 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 344 single or double quotes): 345 346 +---------------------------------------+----------------------------------+ 347 | Context of reference to group "quote" | Ways to reference it | 348 +=======================================+==================================+ 349 | in the same pattern itself | * ``(?P=quote)`` (as shown) | 350 | | * ``\1`` | 351 +---------------------------------------+----------------------------------+ 352 | when processing match object *m* | * ``m.group('quote')`` | 353 | | * ``m.end('quote')`` (etc.) | 354 +---------------------------------------+----------------------------------+ 355 | in a string passed to the *repl* | * ``\g<quote>`` | 356 | argument of ``re.sub()`` | * ``\g<1>`` | 357 | | * ``\1`` | 358 +---------------------------------------+----------------------------------+ 359 360 .. index:: single: (?P=; in regular expressions 361 362 ``(?P=name)`` 363 A backreference to a named group; it matches whatever text was matched by the 364 earlier group named *name*. 365 366 .. index:: single: (?#; in regular expressions 367 368 ``(?#...)`` 369 A comment; the contents of the parentheses are simply ignored. 370 371 .. index:: single: (?=; in regular expressions 372 373 ``(?=...)`` 374 Matches if ``...`` matches next, but doesn't consume any of the string. This is 375 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match 376 ``'Isaac '`` only if it's followed by ``'Asimov'``. 377 378 .. index:: single: (?!; in regular expressions 379 380 ``(?!...)`` 381 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`. 382 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 383 followed by ``'Asimov'``. 384 385 .. index:: single: (?<=; in regular expressions 386 387 ``(?<=...)`` 388 Matches if the current position in the string is preceded by a match for ``...`` 389 that ends at the current position. This is called a :dfn:`positive lookbehind 390 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the 391 lookbehind will back up 3 characters and check if the contained pattern matches. 392 The contained pattern must only match strings of some fixed length, meaning that 393 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that 394 patterns which start with positive lookbehind assertions will not match at the 395 beginning of the string being searched; you will most likely want to use the 396 :func:`search` function rather than the :func:`match` function: 397 398 >>> import re 399 >>> m = re.search('(?<=abc)def', 'abcdef') 400 >>> m.group(0) 401 'def' 402 403 This example looks for a word following a hyphen: 404 405 >>> m = re.search(r'(?<=-)\w+', 'spam-egg') 406 >>> m.group(0) 407 'egg' 408 409 .. versionchanged:: 3.5 410 Added support for group references of fixed length. 411 412 .. index:: single: (?<!; in regular expressions 413 414 ``(?<!...)`` 415 Matches if the current position in the string is not preceded by a match for 416 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 417 positive lookbehind assertions, the contained pattern must only match strings of 418 some fixed length. Patterns which start with negative lookbehind assertions may 419 match at the beginning of the string being searched. 420 421 ``(?(id/name)yes-pattern|no-pattern)`` 422 Will try to match with ``yes-pattern`` if the group with given *id* or 423 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is 424 optional and can be omitted. For example, 425 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which 426 will match with ``'<user (a] host.com>'`` as well as ``'user (a] host.com'``, but 427 not with ``'<user (a] host.com'`` nor ``'user (a] host.com>'``. 428 429 430 The special sequences consist of ``'\'`` and a character from the list below. 431 If the ordinary character is not an ASCII digit or an ASCII letter, then the 432 resulting RE will match the second character. For example, ``\$`` matches the 433 character ``'$'``. 434 435 .. index:: single: \ (backslash); in regular expressions 436 437 ``\number`` 438 Matches the contents of the group of the same number. Groups are numbered 439 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 440 but not ``'thethe'`` (note the space after the group). This special sequence 441 can only be used to match one of the first 99 groups. If the first digit of 442 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 443 a group match, but as the character with octal value *number*. Inside the 444 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 445 characters. 446 447 .. index:: single: \A; in regular expressions 448 449 ``\A`` 450 Matches only at the start of the string. 451 452 .. index:: single: \b; in regular expressions 453 454 ``\b`` 455 Matches the empty string, but only at the beginning or end of a word. 456 A word is defined as a sequence of word characters. Note that formally, 457 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character 458 (or vice versa), or between ``\w`` and the beginning/end of the string. 459 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, 460 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. 461 462 By default Unicode alphanumerics are the ones used in Unicode patterns, but 463 this can be changed by using the :const:`ASCII` flag. Word boundaries are 464 determined by the current locale if the :const:`LOCALE` flag is used. 465 Inside a character range, ``\b`` represents the backspace character, for 466 compatibility with Python's string literals. 467 468 .. index:: single: \B; in regular expressions 469 470 ``\B`` 471 Matches the empty string, but only when it is *not* at the beginning or end 472 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, 473 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. 474 ``\B`` is just the opposite of ``\b``, so word characters in Unicode 475 patterns are Unicode alphanumerics or the underscore, although this can 476 be changed by using the :const:`ASCII` flag. Word boundaries are 477 determined by the current locale if the :const:`LOCALE` flag is used. 478 479 .. index:: single: \d; in regular expressions 480 481 ``\d`` 482 For Unicode (str) patterns: 483 Matches any Unicode decimal digit (that is, any character in 484 Unicode character category [Nd]). This includes ``[0-9]``, and 485 also many other digit characters. If the :const:`ASCII` flag is 486 used only ``[0-9]`` is matched. 487 488 For 8-bit (bytes) patterns: 489 Matches any decimal digit; this is equivalent to ``[0-9]``. 490 491 .. index:: single: \D; in regular expressions 492 493 ``\D`` 494 Matches any character which is not a decimal digit. This is 495 the opposite of ``\d``. If the :const:`ASCII` flag is used this 496 becomes the equivalent of ``[^0-9]``. 497 498 .. index:: single: \s; in regular expressions 499 500 ``\s`` 501 For Unicode (str) patterns: 502 Matches Unicode whitespace characters (which includes 503 ``[ \t\n\r\f\v]``, and also many other characters, for example the 504 non-breaking spaces mandated by typography rules in many 505 languages). If the :const:`ASCII` flag is used, only 506 ``[ \t\n\r\f\v]`` is matched. 507 508 For 8-bit (bytes) patterns: 509 Matches characters considered whitespace in the ASCII character set; 510 this is equivalent to ``[ \t\n\r\f\v]``. 511 512 .. index:: single: \S; in regular expressions 513 514 ``\S`` 515 Matches any character which is not a whitespace character. This is 516 the opposite of ``\s``. If the :const:`ASCII` flag is used this 517 becomes the equivalent of ``[^ \t\n\r\f\v]``. 518 519 .. index:: single: \w; in regular expressions 520 521 ``\w`` 522 For Unicode (str) patterns: 523 Matches Unicode word characters; this includes most characters 524 that can be part of a word in any language, as well as numbers and 525 the underscore. If the :const:`ASCII` flag is used, only 526 ``[a-zA-Z0-9_]`` is matched. 527 528 For 8-bit (bytes) patterns: 529 Matches characters considered alphanumeric in the ASCII character set; 530 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is 531 used, matches characters considered alphanumeric in the current locale 532 and the underscore. 533 534 .. index:: single: \W; in regular expressions 535 536 ``\W`` 537 Matches any character which is not a word character. This is 538 the opposite of ``\w``. If the :const:`ASCII` flag is used this 539 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is 540 used, matches characters considered alphanumeric in the current locale 541 and the underscore. 542 543 .. index:: single: \Z; in regular expressions 544 545 ``\Z`` 546 Matches only at the end of the string. 547 548 .. index:: 549 single: \a; in regular expressions 550 single: \b; in regular expressions 551 single: \f; in regular expressions 552 single: \n; in regular expressions 553 single: \N; in regular expressions 554 single: \r; in regular expressions 555 single: \t; in regular expressions 556 single: \u; in regular expressions 557 single: \U; in regular expressions 558 single: \v; in regular expressions 559 single: \x; in regular expressions 560 single: \\; in regular expressions 561 562 Most of the standard escapes supported by Python string literals are also 563 accepted by the regular expression parser:: 564 565 \a \b \f \n 566 \r \t \u \U 567 \v \x \\ 568 569 (Note that ``\b`` is used to represent word boundaries, and means "backspace" 570 only inside character classes.) 571 572 ``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode 573 patterns. In bytes patterns they are errors. Unknown escapes of ASCII 574 letters are reserved for future use and treated as errors. 575 576 Octal escapes are included in a limited form. If the first digit is a 0, or if 577 there are three octal digits, it is considered an octal escape. Otherwise, it is 578 a group reference. As for string literals, octal escapes are always at most 579 three digits in length. 580 581 .. versionchanged:: 3.3 582 The ``'\u'`` and ``'\U'`` escape sequences have been added. 583 584 .. versionchanged:: 3.6 585 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors. 586 587 588 589 .. _contents-of-module-re: 590 591 Module Contents 592 --------------- 593 594 The module defines several functions, constants, and an exception. Some of the 595 functions are simplified versions of the full featured methods for compiled 596 regular expressions. Most non-trivial applications always use the compiled 597 form. 598 599 .. versionchanged:: 3.6 600 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of 601 :class:`enum.IntFlag`. 602 603 .. function:: compile(pattern, flags=0) 604 605 Compile a regular expression pattern into a :ref:`regular expression object 606 <re-objects>`, which can be used for matching using its 607 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described 608 below. 609 610 The expression's behaviour can be modified by specifying a *flags* value. 611 Values can be any of the following variables, combined using bitwise OR (the 612 ``|`` operator). 613 614 The sequence :: 615 616 prog = re.compile(pattern) 617 result = prog.match(string) 618 619 is equivalent to :: 620 621 result = re.match(pattern, string) 622 623 but using :func:`re.compile` and saving the resulting regular expression 624 object for reuse is more efficient when the expression will be used several 625 times in a single program. 626 627 .. note:: 628 629 The compiled versions of the most recent patterns passed to 630 :func:`re.compile` and the module-level matching functions are cached, so 631 programs that use only a few regular expressions at a time needn't worry 632 about compiling regular expressions. 633 634 635 .. data:: A 636 ASCII 637 638 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` 639 perform ASCII-only matching instead of full Unicode matching. This is only 640 meaningful for Unicode patterns, and is ignored for byte patterns. 641 Corresponds to the inline flag ``(?a)``. 642 643 Note that for backward compatibility, the :const:`re.U` flag still 644 exists (as well as its synonym :const:`re.UNICODE` and its embedded 645 counterpart ``(?u)``), but these are redundant in Python 3 since 646 matches are Unicode by default for strings (and Unicode matching 647 isn't allowed for bytes). 648 649 650 .. data:: DEBUG 651 652 Display debug information about compiled expression. 653 No corresponding inline flag. 654 655 656 .. data:: I 657 IGNORECASE 658 659 Perform case-insensitive matching; expressions like ``[A-Z]`` will also 660 match lowercase letters. Full Unicode matching (such as ```` matching 661 ````) also works unless the :const:`re.ASCII` flag is used to disable 662 non-ASCII matches. The current locale does not change the effect of this 663 flag unless the :const:`re.LOCALE` flag is also used. 664 Corresponds to the inline flag ``(?i)``. 665 666 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in 667 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII 668 letters and 4 additional non-ASCII letters: '' (U+0130, Latin capital 669 letter I with dot above), '' (U+0131, Latin small letter dotless i), 670 '' (U+017F, Latin small letter long s) and '' (U+212A, Kelvin sign). 671 If the :const:`ASCII` flag is used, only letters 'a' to 'z' 672 and 'A' to 'Z' are matched. 673 674 .. data:: L 675 LOCALE 676 677 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching 678 dependent on the current locale. This flag can be used only with bytes 679 patterns. The use of this flag is discouraged as the locale mechanism 680 is very unreliable, it only handles one "culture" at a time, and it only 681 works with 8-bit locales. Unicode matching is already enabled by default 682 in Python 3 for Unicode (str) patterns, and it is able to handle different 683 locales/languages. 684 Corresponds to the inline flag ``(?L)``. 685 686 .. versionchanged:: 3.6 687 :const:`re.LOCALE` can be used only with bytes patterns and is 688 not compatible with :const:`re.ASCII`. 689 690 .. versionchanged:: 3.7 691 Compiled regular expression objects with the :const:`re.LOCALE` flag no 692 longer depend on the locale at compile time. Only the locale at 693 matching time affects the result of matching. 694 695 696 .. data:: M 697 MULTILINE 698 699 When specified, the pattern character ``'^'`` matches at the beginning of the 700 string and at the beginning of each line (immediately following each newline); 701 and the pattern character ``'$'`` matches at the end of the string and at the 702 end of each line (immediately preceding each newline). By default, ``'^'`` 703 matches only at the beginning of the string, and ``'$'`` only at the end of the 704 string and immediately before the newline (if any) at the end of the string. 705 Corresponds to the inline flag ``(?m)``. 706 707 708 .. data:: S 709 DOTALL 710 711 Make the ``'.'`` special character match any character at all, including a 712 newline; without this flag, ``'.'`` will match anything *except* a newline. 713 Corresponds to the inline flag ``(?s)``. 714 715 716 .. data:: X 717 VERBOSE 718 719 .. index:: single: # (hash); in regular expressions 720 721 This flag allows you to write regular expressions that look nicer and are 722 more readable by allowing you to visually separate logical sections of the 723 pattern and add comments. Whitespace within the pattern is ignored, except 724 when in a character class, or when preceded by an unescaped backslash, 725 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. 726 When a line contains a ``#`` that is not in a character class and is not 727 preceded by an unescaped backslash, all characters from the leftmost such 728 ``#`` through the end of the line are ignored. 729 730 This means that the two following regular expression objects that match a 731 decimal number are functionally equal:: 732 733 a = re.compile(r"""\d + # the integral part 734 \. # the decimal point 735 \d * # some fractional digits""", re.X) 736 b = re.compile(r"\d+\.\d*") 737 738 Corresponds to the inline flag ``(?x)``. 739 740 741 .. function:: search(pattern, string, flags=0) 742 743 Scan through *string* looking for the first location where the regular expression 744 *pattern* produces a match, and return a corresponding :ref:`match object 745 <match-objects>`. Return ``None`` if no position in the string matches the 746 pattern; note that this is different from finding a zero-length match at some 747 point in the string. 748 749 750 .. function:: match(pattern, string, flags=0) 751 752 If zero or more characters at the beginning of *string* match the regular 753 expression *pattern*, return a corresponding :ref:`match object 754 <match-objects>`. Return ``None`` if the string does not match the pattern; 755 note that this is different from a zero-length match. 756 757 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 758 at the beginning of the string and not at the beginning of each line. 759 760 If you want to locate a match anywhere in *string*, use :func:`search` 761 instead (see also :ref:`search-vs-match`). 762 763 764 .. function:: fullmatch(pattern, string, flags=0) 765 766 If the whole *string* matches the regular expression *pattern*, return a 767 corresponding :ref:`match object <match-objects>`. Return ``None`` if the 768 string does not match the pattern; note that this is different from a 769 zero-length match. 770 771 .. versionadded:: 3.4 772 773 774 .. function:: split(pattern, string, maxsplit=0, flags=0) 775 776 Split *string* by the occurrences of *pattern*. If capturing parentheses are 777 used in *pattern*, then the text of all groups in the pattern are also returned 778 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 779 splits occur, and the remainder of the string is returned as the final element 780 of the list. :: 781 782 >>> re.split(r'\W+', 'Words, words, words.') 783 ['Words', 'words', 'words', ''] 784 >>> re.split(r'(\W+)', 'Words, words, words.') 785 ['Words', ', ', 'words', ', ', 'words', '.', ''] 786 >>> re.split(r'\W+', 'Words, words, words.', 1) 787 ['Words', 'words, words.'] 788 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 789 ['0', '3', '9'] 790 791 If there are capturing groups in the separator and it matches at the start of 792 the string, the result will start with an empty string. The same holds for 793 the end of the string:: 794 795 >>> re.split(r'(\W+)', '...words, words...') 796 ['', '...', 'words', ', ', 'words', '...', ''] 797 798 That way, separator components are always found at the same relative 799 indices within the result list. 800 801 Empty matches for the pattern split the string only when not adjacent 802 to a previous empty match. 803 804 >>> re.split(r'\b', 'Words, words, words.') 805 ['', 'Words', ', ', 'words', ', ', 'words', '.'] 806 >>> re.split(r'\W*', '...words...') 807 ['', '', 'w', 'o', 'r', 'd', 's', '', ''] 808 >>> re.split(r'(\W*)', '...words...') 809 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', ''] 810 811 .. versionchanged:: 3.1 812 Added the optional flags argument. 813 814 .. versionchanged:: 3.7 815 Added support of splitting on a pattern that could match an empty string. 816 817 818 .. function:: findall(pattern, string, flags=0) 819 820 Return all non-overlapping matches of *pattern* in *string*, as a list of 821 strings. The *string* is scanned left-to-right, and matches are returned in 822 the order found. If one or more groups are present in the pattern, return a 823 list of groups; this will be a list of tuples if the pattern has more than 824 one group. Empty matches are included in the result. 825 826 .. versionchanged:: 3.7 827 Non-empty matches can now start just after a previous empty match. 828 829 830 .. function:: finditer(pattern, string, flags=0) 831 832 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over 833 all non-overlapping matches for the RE *pattern* in *string*. The *string* 834 is scanned left-to-right, and matches are returned in the order found. Empty 835 matches are included in the result. 836 837 .. versionchanged:: 3.7 838 Non-empty matches can now start just after a previous empty match. 839 840 841 .. function:: sub(pattern, repl, string, count=0, flags=0) 842 843 Return the string obtained by replacing the leftmost non-overlapping occurrences 844 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 845 *string* is returned unchanged. *repl* can be a string or a function; if it is 846 a string, any backslash escapes in it are processed. That is, ``\n`` is 847 converted to a single newline character, ``\r`` is converted to a carriage return, and 848 so forth. Unknown escapes of ASCII letters are reserved for future use and 849 treated as errors. Other unknown escapes such as ``\&`` are left alone. 850 Backreferences, such 851 as ``\6``, are replaced with the substring matched by group 6 in the pattern. 852 For example:: 853 854 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 855 ... r'static PyObject*\npy_\1(void)\n{', 856 ... 'def myfunc():') 857 'static PyObject*\npy_myfunc(void)\n{' 858 859 If *repl* is a function, it is called for every non-overlapping occurrence of 860 *pattern*. The function takes a single :ref:`match object <match-objects>` 861 argument, and returns the replacement string. For example:: 862 863 >>> def dashrepl(matchobj): 864 ... if matchobj.group(0) == '-': return ' ' 865 ... else: return '-' 866 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 867 'pro--gram files' 868 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 869 'Baked Beans & Spam' 870 871 The pattern may be a string or a :ref:`pattern object <re-objects>`. 872 873 The optional argument *count* is the maximum number of pattern occurrences to be 874 replaced; *count* must be a non-negative integer. If omitted or zero, all 875 occurrences will be replaced. Empty matches for the pattern are replaced only 876 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns 877 ``'-a-b--d-'``. 878 879 .. index:: single: \g; in regular expressions 880 881 In string-type *repl* arguments, in addition to the character escapes and 882 backreferences described above, 883 ``\g<name>`` will use the substring matched by the group named ``name``, as 884 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 885 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 886 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 887 reference to group 20, not a reference to group 2 followed by the literal 888 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 889 substring matched by the RE. 890 891 .. versionchanged:: 3.1 892 Added the optional flags argument. 893 894 .. versionchanged:: 3.5 895 Unmatched groups are replaced with an empty string. 896 897 .. versionchanged:: 3.6 898 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter 899 now are errors. 900 901 .. versionchanged:: 3.7 902 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter 903 now are errors. 904 905 Empty matches for the pattern are replaced when adjacent to a previous 906 non-empty match. 907 908 909 .. function:: subn(pattern, repl, string, count=0, flags=0) 910 911 Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 912 number_of_subs_made)``. 913 914 .. versionchanged:: 3.1 915 Added the optional flags argument. 916 917 .. versionchanged:: 3.5 918 Unmatched groups are replaced with an empty string. 919 920 921 .. function:: escape(pattern) 922 923 Escape special characters in *pattern*. 924 This is useful if you want to match an arbitrary literal string that may 925 have regular expression metacharacters in it. For example:: 926 927 >>> print(re.escape('python.exe')) 928 python\.exe 929 930 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:" 931 >>> print('[%s]+' % re.escape(legal_chars)) 932 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+ 933 934 >>> operators = ['+', '-', '*', '/', '**'] 935 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True)))) 936 /|\-|\+|\*\*|\* 937 938 This functions must not be used for the replacement string in :func:`sub` 939 and :func:`subn`, only backslashes should be escaped. For example:: 940 941 >>> digits_re = r'\d+' 942 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings' 943 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample)) 944 /usr/sbin/sendmail - \d+ errors, \d+ warnings 945 946 .. versionchanged:: 3.3 947 The ``'_'`` character is no longer escaped. 948 949 .. versionchanged:: 3.7 950 Only characters that can have special meaning in a regular expression 951 are escaped. 952 953 954 .. function:: purge() 955 956 Clear the regular expression cache. 957 958 959 .. exception:: error(msg, pattern=None, pos=None) 960 961 Exception raised when a string passed to one of the functions here is not a 962 valid regular expression (for example, it might contain unmatched parentheses) 963 or when some other error occurs during compilation or matching. It is never an 964 error if a string contains no match for a pattern. The error instance has 965 the following additional attributes: 966 967 .. attribute:: msg 968 969 The unformatted error message. 970 971 .. attribute:: pattern 972 973 The regular expression pattern. 974 975 .. attribute:: pos 976 977 The index in *pattern* where compilation failed (may be ``None``). 978 979 .. attribute:: lineno 980 981 The line corresponding to *pos* (may be ``None``). 982 983 .. attribute:: colno 984 985 The column corresponding to *pos* (may be ``None``). 986 987 .. versionchanged:: 3.5 988 Added additional attributes. 989 990 .. _re-objects: 991 992 Regular Expression Objects 993 -------------------------- 994 995 Compiled regular expression objects support the following methods and 996 attributes: 997 998 .. method:: Pattern.search(string[, pos[, endpos]]) 999 1000 Scan through *string* looking for the first location where this regular 1001 expression produces a match, and return a corresponding :ref:`match object 1002 <match-objects>`. Return ``None`` if no position in the string matches the 1003 pattern; note that this is different from finding a zero-length match at some 1004 point in the string. 1005 1006 The optional second parameter *pos* gives an index in the string where the 1007 search is to start; it defaults to ``0``. This is not completely equivalent to 1008 slicing the string; the ``'^'`` pattern character matches at the real beginning 1009 of the string and at positions just after a newline, but not necessarily at the 1010 index where the search is to start. 1011 1012 The optional parameter *endpos* limits how far the string will be searched; it 1013 will be as if the string is *endpos* characters long, so only the characters 1014 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 1015 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular 1016 expression object, ``rx.search(string, 0, 50)`` is equivalent to 1017 ``rx.search(string[:50], 0)``. :: 1018 1019 >>> pattern = re.compile("d") 1020 >>> pattern.search("dog") # Match at index 0 1021 <re.Match object; span=(0, 1), match='d'> 1022 >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 1023 1024 1025 .. method:: Pattern.match(string[, pos[, endpos]]) 1026 1027 If zero or more characters at the *beginning* of *string* match this regular 1028 expression, return a corresponding :ref:`match object <match-objects>`. 1029 Return ``None`` if the string does not match the pattern; note that this is 1030 different from a zero-length match. 1031 1032 The optional *pos* and *endpos* parameters have the same meaning as for the 1033 :meth:`~Pattern.search` method. :: 1034 1035 >>> pattern = re.compile("o") 1036 >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 1037 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 1038 <re.Match object; span=(1, 2), match='o'> 1039 1040 If you want to locate a match anywhere in *string*, use 1041 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`). 1042 1043 1044 .. method:: Pattern.fullmatch(string[, pos[, endpos]]) 1045 1046 If the whole *string* matches this regular expression, return a corresponding 1047 :ref:`match object <match-objects>`. Return ``None`` if the string does not 1048 match the pattern; note that this is different from a zero-length match. 1049 1050 The optional *pos* and *endpos* parameters have the same meaning as for the 1051 :meth:`~Pattern.search` method. :: 1052 1053 >>> pattern = re.compile("o[gh]") 1054 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog". 1055 >>> pattern.fullmatch("ogre") # No match as not the full string matches. 1056 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits. 1057 <re.Match object; span=(1, 3), match='og'> 1058 1059 .. versionadded:: 3.4 1060 1061 1062 .. method:: Pattern.split(string, maxsplit=0) 1063 1064 Identical to the :func:`split` function, using the compiled pattern. 1065 1066 1067 .. method:: Pattern.findall(string[, pos[, endpos]]) 1068 1069 Similar to the :func:`findall` function, using the compiled pattern, but 1070 also accepts optional *pos* and *endpos* parameters that limit the search 1071 region like for :meth:`search`. 1072 1073 1074 .. method:: Pattern.finditer(string[, pos[, endpos]]) 1075 1076 Similar to the :func:`finditer` function, using the compiled pattern, but 1077 also accepts optional *pos* and *endpos* parameters that limit the search 1078 region like for :meth:`search`. 1079 1080 1081 .. method:: Pattern.sub(repl, string, count=0) 1082 1083 Identical to the :func:`sub` function, using the compiled pattern. 1084 1085 1086 .. method:: Pattern.subn(repl, string, count=0) 1087 1088 Identical to the :func:`subn` function, using the compiled pattern. 1089 1090 1091 .. attribute:: Pattern.flags 1092 1093 The regex matching flags. This is a combination of the flags given to 1094 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit 1095 flags such as :data:`UNICODE` if the pattern is a Unicode string. 1096 1097 1098 .. attribute:: Pattern.groups 1099 1100 The number of capturing groups in the pattern. 1101 1102 1103 .. attribute:: Pattern.groupindex 1104 1105 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 1106 numbers. The dictionary is empty if no symbolic groups were used in the 1107 pattern. 1108 1109 1110 .. attribute:: Pattern.pattern 1111 1112 The pattern string from which the pattern object was compiled. 1113 1114 1115 .. versionchanged:: 3.7 1116 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled 1117 regular expression objects are considered atomic. 1118 1119 1120 .. _match-objects: 1121 1122 Match Objects 1123 ------------- 1124 1125 Match objects always have a boolean value of ``True``. 1126 Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None`` 1127 when there is no match, you can test whether there was a match with a simple 1128 ``if`` statement:: 1129 1130 match = re.search(pattern, string) 1131 if match: 1132 process(match) 1133 1134 Match objects support the following methods and attributes: 1135 1136 1137 .. method:: Match.expand(template) 1138 1139 Return the string obtained by doing backslash substitution on the template 1140 string *template*, as done by the :meth:`~Pattern.sub` method. 1141 Escapes such as ``\n`` are converted to the appropriate characters, 1142 and numeric backreferences (``\1``, ``\2``) and named backreferences 1143 (``\g<1>``, ``\g<name>``) are replaced by the contents of the 1144 corresponding group. 1145 1146 .. versionchanged:: 3.5 1147 Unmatched groups are replaced with an empty string. 1148 1149 .. method:: Match.group([group1, ...]) 1150 1151 Returns one or more subgroups of the match. If there is a single argument, the 1152 result is a single string; if there are multiple arguments, the result is a 1153 tuple with one item per argument. Without arguments, *group1* defaults to zero 1154 (the whole match is returned). If a *groupN* argument is zero, the corresponding 1155 return value is the entire matching string; if it is in the inclusive range 1156 [1..99], it is the string matching the corresponding parenthesized group. If a 1157 group number is negative or larger than the number of groups defined in the 1158 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 1159 part of the pattern that did not match, the corresponding result is ``None``. 1160 If a group is contained in a part of the pattern that matched multiple times, 1161 the last match is returned. :: 1162 1163 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1164 >>> m.group(0) # The entire match 1165 'Isaac Newton' 1166 >>> m.group(1) # The first parenthesized subgroup. 1167 'Isaac' 1168 >>> m.group(2) # The second parenthesized subgroup. 1169 'Newton' 1170 >>> m.group(1, 2) # Multiple arguments give us a tuple. 1171 ('Isaac', 'Newton') 1172 1173 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 1174 arguments may also be strings identifying groups by their group name. If a 1175 string argument is not used as a group name in the pattern, an :exc:`IndexError` 1176 exception is raised. 1177 1178 A moderately complicated example:: 1179 1180 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1181 >>> m.group('first_name') 1182 'Malcolm' 1183 >>> m.group('last_name') 1184 'Reynolds' 1185 1186 Named groups can also be referred to by their index:: 1187 1188 >>> m.group(1) 1189 'Malcolm' 1190 >>> m.group(2) 1191 'Reynolds' 1192 1193 If a group matches multiple times, only the last match is accessible:: 1194 1195 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 1196 >>> m.group(1) # Returns only the last match. 1197 'c3' 1198 1199 1200 .. method:: Match.__getitem__(g) 1201 1202 This is identical to ``m.group(g)``. This allows easier access to 1203 an individual group from a match:: 1204 1205 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1206 >>> m[0] # The entire match 1207 'Isaac Newton' 1208 >>> m[1] # The first parenthesized subgroup. 1209 'Isaac' 1210 >>> m[2] # The second parenthesized subgroup. 1211 'Newton' 1212 1213 .. versionadded:: 3.6 1214 1215 1216 .. method:: Match.groups(default=None) 1217 1218 Return a tuple containing all the subgroups of the match, from 1 up to however 1219 many groups are in the pattern. The *default* argument is used for groups that 1220 did not participate in the match; it defaults to ``None``. 1221 1222 For example:: 1223 1224 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 1225 >>> m.groups() 1226 ('24', '1632') 1227 1228 If we make the decimal place and everything after it optional, not all groups 1229 might participate in the match. These groups will default to ``None`` unless 1230 the *default* argument is given:: 1231 1232 >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 1233 >>> m.groups() # Second group defaults to None. 1234 ('24', None) 1235 >>> m.groups('0') # Now, the second group defaults to '0'. 1236 ('24', '0') 1237 1238 1239 .. method:: Match.groupdict(default=None) 1240 1241 Return a dictionary containing all the *named* subgroups of the match, keyed by 1242 the subgroup name. The *default* argument is used for groups that did not 1243 participate in the match; it defaults to ``None``. For example:: 1244 1245 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1246 >>> m.groupdict() 1247 {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 1248 1249 1250 .. method:: Match.start([group]) 1251 Match.end([group]) 1252 1253 Return the indices of the start and end of the substring matched by *group*; 1254 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 1255 *group* exists but did not contribute to the match. For a match object *m*, and 1256 a group *g* that did contribute to the match, the substring matched by group *g* 1257 (equivalent to ``m.group(g)``) is :: 1258 1259 m.string[m.start(g):m.end(g)] 1260 1261 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 1262 null string. For example, after ``m = re.search('b(c?)', 'cba')``, 1263 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 1264 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 1265 1266 An example that will remove *remove_this* from email addresses:: 1267 1268 >>> email = "tony (a] tiremove_thisger.net" 1269 >>> m = re.search("remove_this", email) 1270 >>> email[:m.start()] + email[m.end():] 1271 'tony (a] tiger.net' 1272 1273 1274 .. method:: Match.span([group]) 1275 1276 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note 1277 that if *group* did not contribute to the match, this is ``(-1, -1)``. 1278 *group* defaults to zero, the entire match. 1279 1280 1281 .. attribute:: Match.pos 1282 1283 The value of *pos* which was passed to the :meth:`~Pattern.search` or 1284 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 1285 the index into the string at which the RE engine started looking for a match. 1286 1287 1288 .. attribute:: Match.endpos 1289 1290 The value of *endpos* which was passed to the :meth:`~Pattern.search` or 1291 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 1292 the index into the string beyond which the RE engine will not go. 1293 1294 1295 .. attribute:: Match.lastindex 1296 1297 The integer index of the last matched capturing group, or ``None`` if no group 1298 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 1299 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 1300 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 1301 string. 1302 1303 1304 .. attribute:: Match.lastgroup 1305 1306 The name of the last matched capturing group, or ``None`` if the group didn't 1307 have a name, or if no group was matched at all. 1308 1309 1310 .. attribute:: Match.re 1311 1312 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or 1313 :meth:`~Pattern.search` method produced this match instance. 1314 1315 1316 .. attribute:: Match.string 1317 1318 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`. 1319 1320 1321 .. versionchanged:: 3.7 1322 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects 1323 are considered atomic. 1324 1325 1326 .. _re-examples: 1327 1328 Regular Expression Examples 1329 --------------------------- 1330 1331 1332 Checking for a Pair 1333 ^^^^^^^^^^^^^^^^^^^ 1334 1335 In this example, we'll use the following helper function to display match 1336 objects a little more gracefully: 1337 1338 .. testcode:: 1339 1340 def displaymatch(match): 1341 if match is None: 1342 return None 1343 return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 1344 1345 Suppose you are writing a poker program where a player's hand is represented as 1346 a 5-character string with each character representing a card, "a" for ace, "k" 1347 for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 1348 representing the card with that value. 1349 1350 To see if a given string is a valid hand, one could do the following:: 1351 1352 >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 1353 >>> displaymatch(valid.match("akt5q")) # Valid. 1354 "<Match: 'akt5q', groups=()>" 1355 >>> displaymatch(valid.match("akt5e")) # Invalid. 1356 >>> displaymatch(valid.match("akt")) # Invalid. 1357 >>> displaymatch(valid.match("727ak")) # Valid. 1358 "<Match: '727ak', groups=()>" 1359 1360 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 1361 To match this with a regular expression, one could use backreferences as such:: 1362 1363 >>> pair = re.compile(r".*(.).*\1") 1364 >>> displaymatch(pair.match("717ak")) # Pair of 7s. 1365 "<Match: '717', groups=('7',)>" 1366 >>> displaymatch(pair.match("718ak")) # No pairs. 1367 >>> displaymatch(pair.match("354aa")) # Pair of aces. 1368 "<Match: '354aa', groups=('a',)>" 1369 1370 To find out what card the pair consists of, one could use the 1371 :meth:`~Match.group` method of the match object in the following manner: 1372 1373 .. doctest:: 1374 1375 >>> pair.match("717ak").group(1) 1376 '7' 1377 1378 # Error because re.match() returns None, which doesn't have a group() method: 1379 >>> pair.match("718ak").group(1) 1380 Traceback (most recent call last): 1381 File "<pyshell#23>", line 1, in <module> 1382 re.match(r".*(.).*\1", "718ak").group(1) 1383 AttributeError: 'NoneType' object has no attribute 'group' 1384 1385 >>> pair.match("354aa").group(1) 1386 'a' 1387 1388 1389 Simulating scanf() 1390 ^^^^^^^^^^^^^^^^^^ 1391 1392 .. index:: single: scanf() 1393 1394 Python does not currently have an equivalent to :c:func:`scanf`. Regular 1395 expressions are generally more powerful, though also more verbose, than 1396 :c:func:`scanf` format strings. The table below offers some more-or-less 1397 equivalent mappings between :c:func:`scanf` format tokens and regular 1398 expressions. 1399 1400 +--------------------------------+---------------------------------------------+ 1401 | :c:func:`scanf` Token | Regular Expression | 1402 +================================+=============================================+ 1403 | ``%c`` | ``.`` | 1404 +--------------------------------+---------------------------------------------+ 1405 | ``%5c`` | ``.{5}`` | 1406 +--------------------------------+---------------------------------------------+ 1407 | ``%d`` | ``[-+]?\d+`` | 1408 +--------------------------------+---------------------------------------------+ 1409 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 1410 +--------------------------------+---------------------------------------------+ 1411 | ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 1412 +--------------------------------+---------------------------------------------+ 1413 | ``%o`` | ``[-+]?[0-7]+`` | 1414 +--------------------------------+---------------------------------------------+ 1415 | ``%s`` | ``\S+`` | 1416 +--------------------------------+---------------------------------------------+ 1417 | ``%u`` | ``\d+`` | 1418 +--------------------------------+---------------------------------------------+ 1419 | ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 1420 +--------------------------------+---------------------------------------------+ 1421 1422 To extract the filename and numbers from a string like :: 1423 1424 /usr/sbin/sendmail - 0 errors, 4 warnings 1425 1426 you would use a :c:func:`scanf` format like :: 1427 1428 %s - %d errors, %d warnings 1429 1430 The equivalent regular expression would be :: 1431 1432 (\S+) - (\d+) errors, (\d+) warnings 1433 1434 1435 .. _search-vs-match: 1436 1437 search() vs. match() 1438 ^^^^^^^^^^^^^^^^^^^^ 1439 1440 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org> 1441 1442 Python offers two different primitive operations based on regular expressions: 1443 :func:`re.match` checks for a match only at the beginning of the string, while 1444 :func:`re.search` checks for a match anywhere in the string (this is what Perl 1445 does by default). 1446 1447 For example:: 1448 1449 >>> re.match("c", "abcdef") # No match 1450 >>> re.search("c", "abcdef") # Match 1451 <re.Match object; span=(2, 3), match='c'> 1452 1453 Regular expressions beginning with ``'^'`` can be used with :func:`search` to 1454 restrict the match at the beginning of the string:: 1455 1456 >>> re.match("c", "abcdef") # No match 1457 >>> re.search("^c", "abcdef") # No match 1458 >>> re.search("^a", "abcdef") # Match 1459 <re.Match object; span=(0, 1), match='a'> 1460 1461 Note however that in :const:`MULTILINE` mode :func:`match` only matches at the 1462 beginning of the string, whereas using :func:`search` with a regular expression 1463 beginning with ``'^'`` will match at the beginning of each line. :: 1464 1465 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match 1466 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match 1467 <re.Match object; span=(4, 5), match='X'> 1468 1469 1470 Making a Phonebook 1471 ^^^^^^^^^^^^^^^^^^ 1472 1473 :func:`split` splits a string into a list delimited by the passed pattern. The 1474 method is invaluable for converting textual data into data structures that can be 1475 easily read and modified by Python as demonstrated in the following example that 1476 creates a phonebook. 1477 1478 First, here is the input. Normally it may come from a file, here we are using 1479 triple-quoted string syntax:: 1480 1481 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 1482 ... 1483 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 1484 ... Frank Burger: 925.541.7625 662 South Dogwood Way 1485 ... 1486 ... 1487 ... Heather Albrecht: 548.326.4584 919 Park Place""" 1488 1489 The entries are separated by one or more newlines. Now we convert the string 1490 into a list with each nonempty line having its own entry: 1491 1492 .. doctest:: 1493 :options: +NORMALIZE_WHITESPACE 1494 1495 >>> entries = re.split("\n+", text) 1496 >>> entries 1497 ['Ross McFluff: 834.345.1254 155 Elm Street', 1498 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 1499 'Frank Burger: 925.541.7625 662 South Dogwood Way', 1500 'Heather Albrecht: 548.326.4584 919 Park Place'] 1501 1502 Finally, split each entry into a list with first name, last name, telephone 1503 number, and address. We use the ``maxsplit`` parameter of :func:`split` 1504 because the address has spaces, our splitting pattern, in it: 1505 1506 .. doctest:: 1507 :options: +NORMALIZE_WHITESPACE 1508 1509 >>> [re.split(":? ", entry, 3) for entry in entries] 1510 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 1511 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 1512 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 1513 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 1514 1515 The ``:?`` pattern matches the colon after the last name, so that it does not 1516 occur in the result list. With a ``maxsplit`` of ``4``, we could separate the 1517 house number from the street name: 1518 1519 .. doctest:: 1520 :options: +NORMALIZE_WHITESPACE 1521 1522 >>> [re.split(":? ", entry, 4) for entry in entries] 1523 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 1524 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 1525 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 1526 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 1527 1528 1529 Text Munging 1530 ^^^^^^^^^^^^ 1531 1532 :func:`sub` replaces every occurrence of a pattern with a string or the 1533 result of a function. This example demonstrates using :func:`sub` with 1534 a function to "munge" text, or randomize the order of all the characters 1535 in each word of a sentence except for the first and last characters:: 1536 1537 >>> def repl(m): 1538 ... inner_word = list(m.group(2)) 1539 ... random.shuffle(inner_word) 1540 ... return m.group(1) + "".join(inner_word) + m.group(3) 1541 >>> text = "Professor Abdolmalek, please report your absences promptly." 1542 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1543 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 1544 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1545 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 1546 1547 1548 Finding all Adverbs 1549 ^^^^^^^^^^^^^^^^^^^ 1550 1551 :func:`findall` matches *all* occurrences of a pattern, not just the first 1552 one as :func:`search` does. For example, if a writer wanted to 1553 find all of the adverbs in some text, they might use :func:`findall` in 1554 the following manner:: 1555 1556 >>> text = "He was carefully disguised but captured quickly by police." 1557 >>> re.findall(r"\w+ly", text) 1558 ['carefully', 'quickly'] 1559 1560 1561 Finding all Adverbs and their Positions 1562 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1563 1564 If one wants more information about all matches of a pattern than the matched 1565 text, :func:`finditer` is useful as it provides :ref:`match objects 1566 <match-objects>` instead of strings. Continuing with the previous example, if 1567 a writer wanted to find all of the adverbs *and their positions* in 1568 some text, they would use :func:`finditer` in the following manner:: 1569 1570 >>> text = "He was carefully disguised but captured quickly by police." 1571 >>> for m in re.finditer(r"\w+ly", text): 1572 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 1573 07-16: carefully 1574 40-47: quickly 1575 1576 1577 Raw String Notation 1578 ^^^^^^^^^^^^^^^^^^^ 1579 1580 Raw string notation (``r"text"``) keeps regular expressions sane. Without it, 1581 every backslash (``'\'``) in a regular expression would have to be prefixed with 1582 another one to escape it. For example, the two following lines of code are 1583 functionally identical:: 1584 1585 >>> re.match(r"\W(.)\1\W", " ff ") 1586 <re.Match object; span=(0, 4), match=' ff '> 1587 >>> re.match("\\W(.)\\1\\W", " ff ") 1588 <re.Match object; span=(0, 4), match=' ff '> 1589 1590 When one wants to match a literal backslash, it must be escaped in the regular 1591 expression. With raw string notation, this means ``r"\\"``. Without raw string 1592 notation, one must use ``"\\\\"``, making the following lines of code 1593 functionally identical:: 1594 1595 >>> re.match(r"\\", r"\\") 1596 <re.Match object; span=(0, 1), match='\\'> 1597 >>> re.match("\\\\", r"\\") 1598 <re.Match object; span=(0, 1), match='\\'> 1599 1600 1601 Writing a Tokenizer 1602 ^^^^^^^^^^^^^^^^^^^ 1603 1604 A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_ 1605 analyzes a string to categorize groups of characters. This is a useful first 1606 step in writing a compiler or interpreter. 1607 1608 The text categories are specified with regular expressions. The technique is 1609 to combine those into a single master regular expression and to loop over 1610 successive matches:: 1611 1612 import collections 1613 import re 1614 1615 Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column']) 1616 1617 def tokenize(code): 1618 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'} 1619 token_specification = [ 1620 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number 1621 ('ASSIGN', r':='), # Assignment operator 1622 ('END', r';'), # Statement terminator 1623 ('ID', r'[A-Za-z]+'), # Identifiers 1624 ('OP', r'[+\-*/]'), # Arithmetic operators 1625 ('NEWLINE', r'\n'), # Line endings 1626 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs 1627 ('MISMATCH', r'.'), # Any other character 1628 ] 1629 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) 1630 line_num = 1 1631 line_start = 0 1632 for mo in re.finditer(tok_regex, code): 1633 kind = mo.lastgroup 1634 value = mo.group() 1635 column = mo.start() - line_start 1636 if kind == 'NUMBER': 1637 value = float(value) if '.' in value else int(value) 1638 elif kind == 'ID' and value in keywords: 1639 kind = value 1640 elif kind == 'NEWLINE': 1641 line_start = mo.end() 1642 line_num += 1 1643 continue 1644 elif kind == 'SKIP': 1645 continue 1646 elif kind == 'MISMATCH': 1647 raise RuntimeError(f'{value!r} unexpected on line {line_num}') 1648 yield Token(kind, value, line_num, column) 1649 1650 statements = ''' 1651 IF quantity THEN 1652 total := total + price * quantity; 1653 tax := price * 0.05; 1654 ENDIF; 1655 ''' 1656 1657 for token in tokenize(statements): 1658 print(token) 1659 1660 The tokenizer produces the following output:: 1661 1662 Token(type='IF', value='IF', line=2, column=4) 1663 Token(type='ID', value='quantity', line=2, column=7) 1664 Token(type='THEN', value='THEN', line=2, column=16) 1665 Token(type='ID', value='total', line=3, column=8) 1666 Token(type='ASSIGN', value=':=', line=3, column=14) 1667 Token(type='ID', value='total', line=3, column=17) 1668 Token(type='OP', value='+', line=3, column=23) 1669 Token(type='ID', value='price', line=3, column=25) 1670 Token(type='OP', value='*', line=3, column=31) 1671 Token(type='ID', value='quantity', line=3, column=33) 1672 Token(type='END', value=';', line=3, column=41) 1673 Token(type='ID', value='tax', line=4, column=8) 1674 Token(type='ASSIGN', value=':=', line=4, column=12) 1675 Token(type='ID', value='price', line=4, column=15) 1676 Token(type='OP', value='*', line=4, column=21) 1677 Token(type='NUMBER', value=0.05, line=4, column=23) 1678 Token(type='END', value=';', line=4, column=27) 1679 Token(type='ENDIF', value='ENDIF', line=5, column=4) 1680 Token(type='END', value=';', line=5, column=9) 1681 1682 1683 .. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly 1684 Media, 2009. The third edition of the book no longer covers Python at all, 1685 but the first edition covered writing good regular expression patterns in 1686 great detail. 1687