1 :mod:`re` --- Regular expression operations 2 =========================================== 3 4 .. module:: re 5 :synopsis: Regular expression operations. 6 7 .. moduleauthor:: Fredrik Lundh <fredrik (a] pythonware.com> 8 .. sectionauthor:: Andrew M. Kuchling <amk (a] amk.ca> 9 10 **Source code:** :source:`Lib/re.py` 11 12 -------------- 13 14 This module provides regular expression matching operations similar to 15 those found in Perl. 16 17 Both patterns and strings to be searched can be Unicode strings as well as 18 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: 19 that is, you cannot match a Unicode string with a byte pattern or 20 vice-versa; similarly, when asking for a substitution, the replacement 21 string must be of the same type as both the pattern and the search string. 22 23 Regular expressions use the backslash character (``'\'``) to indicate 24 special forms or to allow special characters to be used without invoking 25 their special meaning. This collides with Python's usage of the same 26 character for the same purpose in string literals; for example, to match 27 a literal backslash, one might have to write ``'\\\\'`` as the pattern 28 string, because the regular expression must be ``\\``, and each 29 backslash must be expressed as ``\\`` inside a regular Python string 30 literal. 31 32 The solution is to use Python's raw string notation for regular expression 33 patterns; backslashes are not handled in any special way in a string literal 34 prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 35 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 36 newline. Usually patterns will be expressed in Python code using this raw 37 string notation. 38 39 It is important to note that most regular expression operations are available as 40 module-level functions and methods on 41 :ref:`compiled regular expressions <re-objects>`. The functions are shortcuts 42 that don't require you to compile a regex object first, but miss some 43 fine-tuning parameters. 44 45 .. seealso:: 46 47 The third-party `regex <https://pypi.python.org/pypi/regex/>`_ module, 48 which has an API compatible with the standard library :mod:`re` module, 49 but offers additional functionality and a more thorough Unicode support. 50 51 52 .. _re-syntax: 53 54 Regular Expression Syntax 55 ------------------------- 56 57 A regular expression (or RE) specifies a set of strings that matches it; the 58 functions in this module let you check if a particular string matches a given 59 regular expression (or if a given regular expression matches a particular 60 string, which comes down to the same thing). 61 62 Regular expressions can be concatenated to form new regular expressions; if *A* 63 and *B* are both regular expressions, then *AB* is also a regular expression. 64 In general, if a string *p* matches *A* and another string *q* matches *B*, the 65 string *pq* will match AB. This holds unless *A* or *B* contain low precedence 66 operations; boundary conditions between *A* and *B*; or have numbered group 67 references. Thus, complex expressions can easily be constructed from simpler 68 primitive expressions like the ones described here. For details of the theory 69 and implementation of regular expressions, consult the Friedl book referenced 70 above, or almost any textbook about compiler construction. 71 72 A brief explanation of the format of regular expressions follows. For further 73 information and a gentler presentation, consult the :ref:`regex-howto`. 74 75 Regular expressions can contain both special and ordinary characters. Most 76 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 77 expressions; they simply match themselves. You can concatenate ordinary 78 characters, so ``last`` matches the string ``'last'``. (In the rest of this 79 section, we'll write RE's in ``this special style``, usually without quotes, and 80 strings to be matched ``'in single quotes'``.) 81 82 Some characters, like ``'|'`` or ``'('``, are special. Special 83 characters either stand for classes of ordinary characters, or affect 84 how the regular expressions around them are interpreted. Regular 85 expression pattern strings may not contain null bytes, but can specify 86 the null byte using a ``\number`` notation such as ``'\x00'``. 87 88 Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 89 directly nested. This avoids ambiguity with the non-greedy modifier suffix 90 ``?``, and with other modifiers in other implementations. To apply a second 91 repetition to an inner repetition, parentheses may be used. For example, 92 the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 93 94 95 The special characters are: 96 97 ``'.'`` 98 (Dot.) In the default mode, this matches any character except a newline. If 99 the :const:`DOTALL` flag has been specified, this matches any character 100 including a newline. 101 102 ``'^'`` 103 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 104 matches immediately after each newline. 105 106 ``'$'`` 107 Matches the end of the string or just before the newline at the end of the 108 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 109 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 110 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 111 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 112 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 113 the newline, and one at the end of the string. 114 115 ``'*'`` 116 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 117 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 118 by any number of 'b's. 119 120 ``'+'`` 121 Causes the resulting RE to match 1 or more repetitions of the preceding RE. 122 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 123 match just 'a'. 124 125 ``'?'`` 126 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 127 ``ab?`` will match either 'a' or 'ab'. 128 129 ``*?``, ``+?``, ``??`` 130 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match 131 as much text as possible. Sometimes this behaviour isn't desired; if the RE 132 ``<.*>`` is matched against ``<a> b <c>``, it will match the entire 133 string, and not just ``<a>``. Adding ``?`` after the qualifier makes it 134 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 135 characters as possible will be matched. Using the RE ``<.*?>`` will match 136 only ``<a>``. 137 138 ``{m}`` 139 Specifies that exactly *m* copies of the previous RE should be matched; fewer 140 matches cause the entire RE not to match. For example, ``a{6}`` will match 141 exactly six ``'a'`` characters, but not five. 142 143 ``{m,n}`` 144 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 145 RE, attempting to match as many repetitions as possible. For example, 146 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 147 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 148 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters 149 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the 150 modifier would be confused with the previously described form. 151 152 ``{m,n}?`` 153 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 154 RE, attempting to match as *few* repetitions as possible. This is the 155 non-greedy version of the previous qualifier. For example, on the 156 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 157 while ``a{3,5}?`` will only match 3 characters. 158 159 ``'\'`` 160 Either escapes special characters (permitting you to match characters like 161 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 162 sequences are discussed below. 163 164 If you're not using a raw string to express the pattern, remember that Python 165 also uses the backslash as an escape sequence in string literals; if the escape 166 sequence isn't recognized by Python's parser, the backslash and subsequent 167 character are included in the resulting string. However, if Python would 168 recognize the resulting sequence, the backslash should be repeated twice. This 169 is complicated and hard to understand, so it's highly recommended that you use 170 raw strings for all but the simplest expressions. 171 172 ``[]`` 173 Used to indicate a set of characters. In a set: 174 175 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 176 ``'m'``, or ``'k'``. 177 178 * Ranges of characters can be indicated by giving two characters and separating 179 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 180 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 181 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 182 ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``), 183 it will match a literal ``'-'``. 184 185 * Special characters lose their special meaning inside sets. For example, 186 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 187 ``'*'``, or ``')'``. 188 189 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 190 inside a set, although the characters they match depends on whether 191 :const:`ASCII` or :const:`LOCALE` mode is in force. 192 193 * Characters that are not within a range can be matched by :dfn:`complementing` 194 the set. If the first character of the set is ``'^'``, all the characters 195 that are *not* in the set will be matched. For example, ``[^5]`` will match 196 any character except ``'5'``, and ``[^^]`` will match any character except 197 ``'^'``. ``^`` has no special meaning if it's not the first character in 198 the set. 199 200 * To match a literal ``']'`` inside a set, precede it with a backslash, or 201 place it at the beginning of the set. For example, both ``[()[\]{}]`` and 202 ``[]()[{}]`` will both match a parenthesis. 203 204 ``'|'`` 205 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that 206 will match either A or B. An arbitrary number of REs can be separated by the 207 ``'|'`` in this way. This can be used inside groups (see below) as well. As 208 the target string is scanned, REs separated by ``'|'`` are tried from left to 209 right. When one pattern completely matches, that branch is accepted. This means 210 that once ``A`` matches, ``B`` will not be tested further, even if it would 211 produce a longer overall match. In other words, the ``'|'`` operator is never 212 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 213 character class, as in ``[|]``. 214 215 ``(...)`` 216 Matches whatever regular expression is inside the parentheses, and indicates the 217 start and end of a group; the contents of a group can be retrieved after a match 218 has been performed, and can be matched later in the string with the ``\number`` 219 special sequence, described below. To match the literals ``'('`` or ``')'``, 220 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``. 221 222 ``(?...)`` 223 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 224 otherwise). The first character after the ``'?'`` determines what the meaning 225 and further syntax of the construct is. Extensions usually do not create a new 226 group; ``(?P<name>...)`` is the only exception to this rule. Following are the 227 currently supported extensions. 228 229 ``(?aiLmsux)`` 230 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 231 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the 232 letters set the corresponding flags: :const:`re.A` (ASCII-only matching), 233 :const:`re.I` (ignore case), :const:`re.L` (locale dependent), 234 :const:`re.M` (multi-line), :const:`re.S` (dot matches all), 235 and :const:`re.X` (verbose), for the entire regular expression. (The 236 flags are described in :ref:`contents-of-module-re`.) This 237 is useful if you wish to include the flags as part of the regular 238 expression, instead of passing a *flag* argument to the 239 :func:`re.compile` function. Flags should be used first in the 240 expression string. 241 242 ``(?:...)`` 243 A non-capturing version of regular parentheses. Matches whatever regular 244 expression is inside the parentheses, but the substring matched by the group 245 *cannot* be retrieved after performing a match or referenced later in the 246 pattern. 247 248 ``(?imsx-imsx:...)`` 249 (Zero or more letters from the set ``'i'``, ``'m'``, ``'s'``, ``'x'``, 250 optionally followed by ``'-'`` followed by one or more letters from the 251 same set.) The letters set or removes the corresponding flags: 252 :const:`re.I` (ignore case), :const:`re.M` (multi-line), :const:`re.S` 253 (dot matches all), and :const:`re.X` (verbose), for the part of the 254 expression. (The flags are described in :ref:`contents-of-module-re`.) 255 256 .. versionadded:: 3.6 257 258 ``(?P<name>...)`` 259 Similar to regular parentheses, but the substring matched by the group is 260 accessible via the symbolic group name *name*. Group names must be valid 261 Python identifiers, and each group name must be defined only once within a 262 regular expression. A symbolic group is also a numbered group, just as if 263 the group were not named. 264 265 Named groups can be referenced in three contexts. If the pattern is 266 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 267 single or double quotes): 268 269 +---------------------------------------+----------------------------------+ 270 | Context of reference to group "quote" | Ways to reference it | 271 +=======================================+==================================+ 272 | in the same pattern itself | * ``(?P=quote)`` (as shown) | 273 | | * ``\1`` | 274 +---------------------------------------+----------------------------------+ 275 | when processing match object ``m`` | * ``m.group('quote')`` | 276 | | * ``m.end('quote')`` (etc.) | 277 +---------------------------------------+----------------------------------+ 278 | in a string passed to the ``repl`` | * ``\g<quote>`` | 279 | argument of ``re.sub()`` | * ``\g<1>`` | 280 | | * ``\1`` | 281 +---------------------------------------+----------------------------------+ 282 283 ``(?P=name)`` 284 A backreference to a named group; it matches whatever text was matched by the 285 earlier group named *name*. 286 287 ``(?#...)`` 288 A comment; the contents of the parentheses are simply ignored. 289 290 ``(?=...)`` 291 Matches if ``...`` matches next, but doesn't consume any of the string. This is 292 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match 293 ``'Isaac '`` only if it's followed by ``'Asimov'``. 294 295 ``(?!...)`` 296 Matches if ``...`` doesn't match next. This is a negative lookahead assertion. 297 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 298 followed by ``'Asimov'``. 299 300 ``(?<=...)`` 301 Matches if the current position in the string is preceded by a match for ``...`` 302 that ends at the current position. This is called a :dfn:`positive lookbehind 303 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the 304 lookbehind will back up 3 characters and check if the contained pattern matches. 305 The contained pattern must only match strings of some fixed length, meaning that 306 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that 307 patterns which start with positive lookbehind assertions will not match at the 308 beginning of the string being searched; you will most likely want to use the 309 :func:`search` function rather than the :func:`match` function: 310 311 >>> import re 312 >>> m = re.search('(?<=abc)def', 'abcdef') 313 >>> m.group(0) 314 'def' 315 316 This example looks for a word following a hyphen: 317 318 >>> m = re.search('(?<=-)\w+', 'spam-egg') 319 >>> m.group(0) 320 'egg' 321 322 .. versionchanged:: 3.5 323 Added support for group references of fixed length. 324 325 ``(?<!...)`` 326 Matches if the current position in the string is not preceded by a match for 327 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 328 positive lookbehind assertions, the contained pattern must only match strings of 329 some fixed length. Patterns which start with negative lookbehind assertions may 330 match at the beginning of the string being searched. 331 332 ``(?(id/name)yes-pattern|no-pattern)`` 333 Will try to match with ``yes-pattern`` if the group with given *id* or 334 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is 335 optional and can be omitted. For example, 336 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which 337 will match with ``'<user (a] host.com>'`` as well as ``'user (a] host.com'``, but 338 not with ``'<user (a] host.com'`` nor ``'user (a] host.com>'``. 339 340 341 The special sequences consist of ``'\'`` and a character from the list below. 342 If the ordinary character is not an ASCII digit or an ASCII letter, then the 343 resulting RE will match the second character. For example, ``\$`` matches the 344 character ``'$'``. 345 346 ``\number`` 347 Matches the contents of the group of the same number. Groups are numbered 348 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 349 but not ``'thethe'`` (note the space after the group). This special sequence 350 can only be used to match one of the first 99 groups. If the first digit of 351 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 352 a group match, but as the character with octal value *number*. Inside the 353 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 354 characters. 355 356 ``\A`` 357 Matches only at the start of the string. 358 359 ``\b`` 360 Matches the empty string, but only at the beginning or end of a word. 361 A word is defined as a sequence of Unicode alphanumeric or underscore 362 characters, so the end of a word is indicated by whitespace or a 363 non-alphanumeric, non-underscore Unicode character. Note that formally, 364 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character 365 (or vice versa), or between ``\w`` and the beginning/end of the string. 366 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, 367 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. 368 369 By default Unicode alphanumerics are the ones used, but this can be changed 370 by using the :const:`ASCII` flag. Inside a character range, ``\b`` 371 represents the backspace character, for compatibility with Python's string 372 literals. 373 374 ``\B`` 375 Matches the empty string, but only when it is *not* at the beginning or end 376 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, 377 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. 378 ``\B`` is just the opposite of ``\b``, so word characters are 379 Unicode alphanumerics or the underscore, although this can be changed 380 by using the :const:`ASCII` flag. 381 382 ``\d`` 383 For Unicode (str) patterns: 384 Matches any Unicode decimal digit (that is, any character in 385 Unicode character category [Nd]). This includes ``[0-9]``, and 386 also many other digit characters. If the :const:`ASCII` flag is 387 used only ``[0-9]`` is matched (but the flag affects the entire 388 regular expression, so in such cases using an explicit ``[0-9]`` 389 may be a better choice). 390 For 8-bit (bytes) patterns: 391 Matches any decimal digit; this is equivalent to ``[0-9]``. 392 393 ``\D`` 394 Matches any character which is not a Unicode decimal digit. This is 395 the opposite of ``\d``. If the :const:`ASCII` flag is used this 396 becomes the equivalent of ``[^0-9]`` (but the flag affects the entire 397 regular expression, so in such cases using an explicit ``[^0-9]`` may 398 be a better choice). 399 400 ``\s`` 401 For Unicode (str) patterns: 402 Matches Unicode whitespace characters (which includes 403 ``[ \t\n\r\f\v]``, and also many other characters, for example the 404 non-breaking spaces mandated by typography rules in many 405 languages). If the :const:`ASCII` flag is used, only 406 ``[ \t\n\r\f\v]`` is matched (but the flag affects the entire 407 regular expression, so in such cases using an explicit 408 ``[ \t\n\r\f\v]`` may be a better choice). 409 410 For 8-bit (bytes) patterns: 411 Matches characters considered whitespace in the ASCII character set; 412 this is equivalent to ``[ \t\n\r\f\v]``. 413 414 ``\S`` 415 Matches any character which is not a Unicode whitespace character. This is 416 the opposite of ``\s``. If the :const:`ASCII` flag is used this 417 becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire 418 regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may 419 be a better choice). 420 421 ``\w`` 422 For Unicode (str) patterns: 423 Matches Unicode word characters; this includes most characters 424 that can be part of a word in any language, as well as numbers and 425 the underscore. If the :const:`ASCII` flag is used, only 426 ``[a-zA-Z0-9_]`` is matched (but the flag affects the entire 427 regular expression, so in such cases using an explicit 428 ``[a-zA-Z0-9_]`` may be a better choice). 429 For 8-bit (bytes) patterns: 430 Matches characters considered alphanumeric in the ASCII character set; 431 this is equivalent to ``[a-zA-Z0-9_]``. 432 433 ``\W`` 434 Matches any character which is not a Unicode word character. This is 435 the opposite of ``\w``. If the :const:`ASCII` flag is used this 436 becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the 437 entire regular expression, so in such cases using an explicit 438 ``[^a-zA-Z0-9_]`` may be a better choice). 439 440 ``\Z`` 441 Matches only at the end of the string. 442 443 Most of the standard escapes supported by Python string literals are also 444 accepted by the regular expression parser:: 445 446 \a \b \f \n 447 \r \t \u \U 448 \v \x \\ 449 450 (Note that ``\b`` is used to represent word boundaries, and means "backspace" 451 only inside character classes.) 452 453 ``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode 454 patterns. In bytes patterns they are not treated specially. 455 456 Octal escapes are included in a limited form. If the first digit is a 0, or if 457 there are three octal digits, it is considered an octal escape. Otherwise, it is 458 a group reference. As for string literals, octal escapes are always at most 459 three digits in length. 460 461 .. versionchanged:: 3.3 462 The ``'\u'`` and ``'\U'`` escape sequences have been added. 463 464 .. versionchanged:: 3.6 465 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors. 466 467 468 .. seealso:: 469 470 Mastering Regular Expressions 471 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The 472 second edition of the book no longer covers Python at all, but the first 473 edition covered writing good regular expression patterns in great detail. 474 475 476 477 .. _contents-of-module-re: 478 479 Module Contents 480 --------------- 481 482 The module defines several functions, constants, and an exception. Some of the 483 functions are simplified versions of the full featured methods for compiled 484 regular expressions. Most non-trivial applications always use the compiled 485 form. 486 487 .. versionchanged:: 3.6 488 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of 489 :class:`enum.IntFlag`. 490 491 .. function:: compile(pattern, flags=0) 492 493 Compile a regular expression pattern into a regular expression object, which 494 can be used for matching using its :func:`~regex.match` and 495 :func:`~regex.search` methods, described below. 496 497 The expression's behaviour can be modified by specifying a *flags* value. 498 Values can be any of the following variables, combined using bitwise OR (the 499 ``|`` operator). 500 501 The sequence :: 502 503 prog = re.compile(pattern) 504 result = prog.match(string) 505 506 is equivalent to :: 507 508 result = re.match(pattern, string) 509 510 but using :func:`re.compile` and saving the resulting regular expression 511 object for reuse is more efficient when the expression will be used several 512 times in a single program. 513 514 .. note:: 515 516 The compiled versions of the most recent patterns passed to 517 :func:`re.compile` and the module-level matching functions are cached, so 518 programs that use only a few regular expressions at a time needn't worry 519 about compiling regular expressions. 520 521 522 .. data:: A 523 ASCII 524 525 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` 526 perform ASCII-only matching instead of full Unicode matching. This is only 527 meaningful for Unicode patterns, and is ignored for byte patterns. 528 529 Note that for backward compatibility, the :const:`re.U` flag still 530 exists (as well as its synonym :const:`re.UNICODE` and its embedded 531 counterpart ``(?u)``), but these are redundant in Python 3 since 532 matches are Unicode by default for strings (and Unicode matching 533 isn't allowed for bytes). 534 535 536 .. data:: DEBUG 537 538 Display debug information about compiled expression. 539 540 541 .. data:: I 542 IGNORECASE 543 544 Perform case-insensitive matching; expressions like ``[A-Z]`` will match 545 lowercase letters, too. This is not affected by the current locale 546 and works for Unicode characters as expected. 547 548 549 .. data:: L 550 LOCALE 551 552 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the 553 current locale. The use of this flag is discouraged as the locale mechanism 554 is very unreliable, and it only handles one "culture" at a time anyway; 555 you should use Unicode matching instead, which is the default in Python 3 556 for Unicode (str) patterns. This flag can be used only with bytes patterns. 557 558 .. versionchanged:: 3.6 559 :const:`re.LOCALE` can be used only with bytes patterns and is 560 not compatible with :const:`re.ASCII`. 561 562 563 .. data:: M 564 MULTILINE 565 566 When specified, the pattern character ``'^'`` matches at the beginning of the 567 string and at the beginning of each line (immediately following each newline); 568 and the pattern character ``'$'`` matches at the end of the string and at the 569 end of each line (immediately preceding each newline). By default, ``'^'`` 570 matches only at the beginning of the string, and ``'$'`` only at the end of the 571 string and immediately before the newline (if any) at the end of the string. 572 573 574 .. data:: S 575 DOTALL 576 577 Make the ``'.'`` special character match any character at all, including a 578 newline; without this flag, ``'.'`` will match anything *except* a newline. 579 580 581 .. data:: X 582 VERBOSE 583 584 This flag allows you to write regular expressions that look nicer and are 585 more readable by allowing you to visually separate logical sections of the 586 pattern and add comments. Whitespace within the pattern is ignored, except 587 when in a character class or when preceded by an unescaped backslash. 588 When a line contains a ``#`` that is not in a character class and is not 589 preceded by an unescaped backslash, all characters from the leftmost such 590 ``#`` through the end of the line are ignored. 591 592 This means that the two following regular expression objects that match a 593 decimal number are functionally equal:: 594 595 a = re.compile(r"""\d + # the integral part 596 \. # the decimal point 597 \d * # some fractional digits""", re.X) 598 b = re.compile(r"\d+\.\d*") 599 600 601 602 603 .. function:: search(pattern, string, flags=0) 604 605 Scan through *string* looking for the first location where the regular expression 606 *pattern* produces a match, and return a corresponding :ref:`match object 607 <match-objects>`. Return ``None`` if no position in the string matches the 608 pattern; note that this is different from finding a zero-length match at some 609 point in the string. 610 611 612 .. function:: match(pattern, string, flags=0) 613 614 If zero or more characters at the beginning of *string* match the regular 615 expression *pattern*, return a corresponding :ref:`match object 616 <match-objects>`. Return ``None`` if the string does not match the pattern; 617 note that this is different from a zero-length match. 618 619 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 620 at the beginning of the string and not at the beginning of each line. 621 622 If you want to locate a match anywhere in *string*, use :func:`search` 623 instead (see also :ref:`search-vs-match`). 624 625 626 .. function:: fullmatch(pattern, string, flags=0) 627 628 If the whole *string* matches the regular expression *pattern*, return a 629 corresponding :ref:`match object <match-objects>`. Return ``None`` if the 630 string does not match the pattern; note that this is different from a 631 zero-length match. 632 633 .. versionadded:: 3.4 634 635 636 .. function:: split(pattern, string, maxsplit=0, flags=0) 637 638 Split *string* by the occurrences of *pattern*. If capturing parentheses are 639 used in *pattern*, then the text of all groups in the pattern are also returned 640 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 641 splits occur, and the remainder of the string is returned as the final element 642 of the list. :: 643 644 >>> re.split('\W+', 'Words, words, words.') 645 ['Words', 'words', 'words', ''] 646 >>> re.split('(\W+)', 'Words, words, words.') 647 ['Words', ', ', 'words', ', ', 'words', '.', ''] 648 >>> re.split('\W+', 'Words, words, words.', 1) 649 ['Words', 'words, words.'] 650 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 651 ['0', '3', '9'] 652 653 If there are capturing groups in the separator and it matches at the start of 654 the string, the result will start with an empty string. The same holds for 655 the end of the string: 656 657 >>> re.split('(\W+)', '...words, words...') 658 ['', '...', 'words', ', ', 'words', '...', ''] 659 660 That way, separator components are always found at the same relative 661 indices within the result list. 662 663 .. note:: 664 665 :func:`split` doesn't currently split a string on an empty pattern match. 666 For example: 667 668 >>> re.split('x*', 'axbc') 669 ['a', 'bc'] 670 671 Even though ``'x*'`` also matches 0 'x' before 'a', between 'b' and 'c', 672 and after 'c', currently these matches are ignored. The correct behavior 673 (i.e. splitting on empty matches too and returning ``['', 'a', 'b', 'c', 674 '']``) will be implemented in future versions of Python, but since this 675 is a backward incompatible change, a :exc:`FutureWarning` will be raised 676 in the meanwhile. 677 678 Patterns that can only match empty strings currently never split the 679 string. Since this doesn't match the expected behavior, a 680 :exc:`ValueError` will be raised starting from Python 3.5:: 681 682 >>> re.split("^$", "foo\n\nbar\n", flags=re.M) 683 Traceback (most recent call last): 684 File "<stdin>", line 1, in <module> 685 ... 686 ValueError: split() requires a non-empty pattern match. 687 688 .. versionchanged:: 3.1 689 Added the optional flags argument. 690 691 .. versionchanged:: 3.5 692 Splitting on a pattern that could match an empty string now raises 693 a warning. Patterns that can only match empty strings are now rejected. 694 695 .. function:: findall(pattern, string, flags=0) 696 697 Return all non-overlapping matches of *pattern* in *string*, as a list of 698 strings. The *string* is scanned left-to-right, and matches are returned in 699 the order found. If one or more groups are present in the pattern, return a 700 list of groups; this will be a list of tuples if the pattern has more than 701 one group. Empty matches are included in the result unless they touch the 702 beginning of another match. 703 704 705 .. function:: finditer(pattern, string, flags=0) 706 707 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over 708 all non-overlapping matches for the RE *pattern* in *string*. The *string* 709 is scanned left-to-right, and matches are returned in the order found. Empty 710 matches are included in the result unless they touch the beginning of another 711 match. 712 713 714 .. function:: sub(pattern, repl, string, count=0, flags=0) 715 716 Return the string obtained by replacing the leftmost non-overlapping occurrences 717 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 718 *string* is returned unchanged. *repl* can be a string or a function; if it is 719 a string, any backslash escapes in it are processed. That is, ``\n`` is 720 converted to a single newline character, ``\r`` is converted to a carriage return, and 721 so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such 722 as ``\6``, are replaced with the substring matched by group 6 in the pattern. 723 For example: 724 725 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 726 ... r'static PyObject*\npy_\1(void)\n{', 727 ... 'def myfunc():') 728 'static PyObject*\npy_myfunc(void)\n{' 729 730 If *repl* is a function, it is called for every non-overlapping occurrence of 731 *pattern*. The function takes a single match object argument, and returns the 732 replacement string. For example: 733 734 >>> def dashrepl(matchobj): 735 ... if matchobj.group(0) == '-': return ' ' 736 ... else: return '-' 737 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 738 'pro--gram files' 739 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 740 'Baked Beans & Spam' 741 742 The pattern may be a string or an RE object. 743 744 The optional argument *count* is the maximum number of pattern occurrences to be 745 replaced; *count* must be a non-negative integer. If omitted or zero, all 746 occurrences will be replaced. Empty matches for the pattern are replaced only 747 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns 748 ``'-a-b-c-'``. 749 750 In string-type *repl* arguments, in addition to the character escapes and 751 backreferences described above, 752 ``\g<name>`` will use the substring matched by the group named ``name``, as 753 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 754 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 755 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 756 reference to group 20, not a reference to group 2 followed by the literal 757 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 758 substring matched by the RE. 759 760 .. versionchanged:: 3.1 761 Added the optional flags argument. 762 763 .. versionchanged:: 3.5 764 Unmatched groups are replaced with an empty string. 765 766 .. versionchanged:: 3.6 767 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter 768 now are errors. 769 770 .. deprecated-removed:: 3.5 3.7 771 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter now raise 772 a deprecation warning and will be forbidden in Python 3.7. 773 774 775 .. function:: subn(pattern, repl, string, count=0, flags=0) 776 777 Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 778 number_of_subs_made)``. 779 780 .. versionchanged:: 3.1 781 Added the optional flags argument. 782 783 .. versionchanged:: 3.5 784 Unmatched groups are replaced with an empty string. 785 786 787 .. function:: escape(string) 788 789 Escape all the characters in pattern except ASCII letters, numbers and ``'_'``. 790 This is useful if you want to match an arbitrary literal string that may 791 have regular expression metacharacters in it. 792 793 .. versionchanged:: 3.3 794 The ``'_'`` character is no longer escaped. 795 796 797 .. function:: purge() 798 799 Clear the regular expression cache. 800 801 802 .. exception:: error(msg, pattern=None, pos=None) 803 804 Exception raised when a string passed to one of the functions here is not a 805 valid regular expression (for example, it might contain unmatched parentheses) 806 or when some other error occurs during compilation or matching. It is never an 807 error if a string contains no match for a pattern. The error instance has 808 the following additional attributes: 809 810 .. attribute:: msg 811 812 The unformatted error message. 813 814 .. attribute:: pattern 815 816 The regular expression pattern. 817 818 .. attribute:: pos 819 820 The index of *pattern* where compilation failed. 821 822 .. attribute:: lineno 823 824 The line corresponding to *pos*. 825 826 .. attribute:: colno 827 828 The column corresponding to *pos*. 829 830 .. versionchanged:: 3.5 831 Added additional attributes. 832 833 .. _re-objects: 834 835 Regular Expression Objects 836 -------------------------- 837 838 Compiled regular expression objects support the following methods and 839 attributes: 840 841 .. method:: regex.search(string[, pos[, endpos]]) 842 843 Scan through *string* looking for the first location where this regular 844 expression produces a match, and return a corresponding :ref:`match object 845 <match-objects>`. Return ``None`` if no position in the string matches the 846 pattern; note that this is different from finding a zero-length match at some 847 point in the string. 848 849 The optional second parameter *pos* gives an index in the string where the 850 search is to start; it defaults to ``0``. This is not completely equivalent to 851 slicing the string; the ``'^'`` pattern character matches at the real beginning 852 of the string and at positions just after a newline, but not necessarily at the 853 index where the search is to start. 854 855 The optional parameter *endpos* limits how far the string will be searched; it 856 will be as if the string is *endpos* characters long, so only the characters 857 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 858 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular 859 expression object, ``rx.search(string, 0, 50)`` is equivalent to 860 ``rx.search(string[:50], 0)``. 861 862 >>> pattern = re.compile("d") 863 >>> pattern.search("dog") # Match at index 0 864 <_sre.SRE_Match object; span=(0, 1), match='d'> 865 >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 866 867 868 .. method:: regex.match(string[, pos[, endpos]]) 869 870 If zero or more characters at the *beginning* of *string* match this regular 871 expression, return a corresponding :ref:`match object <match-objects>`. 872 Return ``None`` if the string does not match the pattern; note that this is 873 different from a zero-length match. 874 875 The optional *pos* and *endpos* parameters have the same meaning as for the 876 :meth:`~regex.search` method. 877 878 >>> pattern = re.compile("o") 879 >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 880 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 881 <_sre.SRE_Match object; span=(1, 2), match='o'> 882 883 If you want to locate a match anywhere in *string*, use 884 :meth:`~regex.search` instead (see also :ref:`search-vs-match`). 885 886 887 .. method:: regex.fullmatch(string[, pos[, endpos]]) 888 889 If the whole *string* matches this regular expression, return a corresponding 890 :ref:`match object <match-objects>`. Return ``None`` if the string does not 891 match the pattern; note that this is different from a zero-length match. 892 893 The optional *pos* and *endpos* parameters have the same meaning as for the 894 :meth:`~regex.search` method. 895 896 >>> pattern = re.compile("o[gh]") 897 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog". 898 >>> pattern.fullmatch("ogre") # No match as not the full string matches. 899 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits. 900 <_sre.SRE_Match object; span=(1, 3), match='og'> 901 902 .. versionadded:: 3.4 903 904 905 .. method:: regex.split(string, maxsplit=0) 906 907 Identical to the :func:`split` function, using the compiled pattern. 908 909 910 .. method:: regex.findall(string[, pos[, endpos]]) 911 912 Similar to the :func:`findall` function, using the compiled pattern, but 913 also accepts optional *pos* and *endpos* parameters that limit the search 914 region like for :meth:`match`. 915 916 917 .. method:: regex.finditer(string[, pos[, endpos]]) 918 919 Similar to the :func:`finditer` function, using the compiled pattern, but 920 also accepts optional *pos* and *endpos* parameters that limit the search 921 region like for :meth:`match`. 922 923 924 .. method:: regex.sub(repl, string, count=0) 925 926 Identical to the :func:`sub` function, using the compiled pattern. 927 928 929 .. method:: regex.subn(repl, string, count=0) 930 931 Identical to the :func:`subn` function, using the compiled pattern. 932 933 934 .. attribute:: regex.flags 935 936 The regex matching flags. This is a combination of the flags given to 937 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit 938 flags such as :data:`UNICODE` if the pattern is a Unicode string. 939 940 941 .. attribute:: regex.groups 942 943 The number of capturing groups in the pattern. 944 945 946 .. attribute:: regex.groupindex 947 948 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 949 numbers. The dictionary is empty if no symbolic groups were used in the 950 pattern. 951 952 953 .. attribute:: regex.pattern 954 955 The pattern string from which the RE object was compiled. 956 957 958 .. _match-objects: 959 960 Match Objects 961 ------------- 962 963 Match objects always have a boolean value of ``True``. 964 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None`` 965 when there is no match, you can test whether there was a match with a simple 966 ``if`` statement:: 967 968 match = re.search(pattern, string) 969 if match: 970 process(match) 971 972 Match objects support the following methods and attributes: 973 974 975 .. method:: match.expand(template) 976 977 Return the string obtained by doing backslash substitution on the template 978 string *template*, as done by the :meth:`~regex.sub` method. 979 Escapes such as ``\n`` are converted to the appropriate characters, 980 and numeric backreferences (``\1``, ``\2``) and named backreferences 981 (``\g<1>``, ``\g<name>``) are replaced by the contents of the 982 corresponding group. 983 984 .. versionchanged:: 3.5 985 Unmatched groups are replaced with an empty string. 986 987 .. method:: match.group([group1, ...]) 988 989 Returns one or more subgroups of the match. If there is a single argument, the 990 result is a single string; if there are multiple arguments, the result is a 991 tuple with one item per argument. Without arguments, *group1* defaults to zero 992 (the whole match is returned). If a *groupN* argument is zero, the corresponding 993 return value is the entire matching string; if it is in the inclusive range 994 [1..99], it is the string matching the corresponding parenthesized group. If a 995 group number is negative or larger than the number of groups defined in the 996 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 997 part of the pattern that did not match, the corresponding result is ``None``. 998 If a group is contained in a part of the pattern that matched multiple times, 999 the last match is returned. 1000 1001 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1002 >>> m.group(0) # The entire match 1003 'Isaac Newton' 1004 >>> m.group(1) # The first parenthesized subgroup. 1005 'Isaac' 1006 >>> m.group(2) # The second parenthesized subgroup. 1007 'Newton' 1008 >>> m.group(1, 2) # Multiple arguments give us a tuple. 1009 ('Isaac', 'Newton') 1010 1011 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 1012 arguments may also be strings identifying groups by their group name. If a 1013 string argument is not used as a group name in the pattern, an :exc:`IndexError` 1014 exception is raised. 1015 1016 A moderately complicated example: 1017 1018 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1019 >>> m.group('first_name') 1020 'Malcolm' 1021 >>> m.group('last_name') 1022 'Reynolds' 1023 1024 Named groups can also be referred to by their index: 1025 1026 >>> m.group(1) 1027 'Malcolm' 1028 >>> m.group(2) 1029 'Reynolds' 1030 1031 If a group matches multiple times, only the last match is accessible: 1032 1033 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 1034 >>> m.group(1) # Returns only the last match. 1035 'c3' 1036 1037 1038 .. method:: match.__getitem__(g) 1039 1040 This is identical to ``m.group(g)``. This allows easier access to 1041 an individual group from a match: 1042 1043 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1044 >>> m[0] # The entire match 1045 'Isaac Newton' 1046 >>> m[1] # The first parenthesized subgroup. 1047 'Isaac' 1048 >>> m[2] # The second parenthesized subgroup. 1049 'Newton' 1050 1051 .. versionadded:: 3.6 1052 1053 1054 .. method:: match.groups(default=None) 1055 1056 Return a tuple containing all the subgroups of the match, from 1 up to however 1057 many groups are in the pattern. The *default* argument is used for groups that 1058 did not participate in the match; it defaults to ``None``. 1059 1060 For example: 1061 1062 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 1063 >>> m.groups() 1064 ('24', '1632') 1065 1066 If we make the decimal place and everything after it optional, not all groups 1067 might participate in the match. These groups will default to ``None`` unless 1068 the *default* argument is given: 1069 1070 >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 1071 >>> m.groups() # Second group defaults to None. 1072 ('24', None) 1073 >>> m.groups('0') # Now, the second group defaults to '0'. 1074 ('24', '0') 1075 1076 1077 .. method:: match.groupdict(default=None) 1078 1079 Return a dictionary containing all the *named* subgroups of the match, keyed by 1080 the subgroup name. The *default* argument is used for groups that did not 1081 participate in the match; it defaults to ``None``. For example: 1082 1083 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1084 >>> m.groupdict() 1085 {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 1086 1087 1088 .. method:: match.start([group]) 1089 match.end([group]) 1090 1091 Return the indices of the start and end of the substring matched by *group*; 1092 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 1093 *group* exists but did not contribute to the match. For a match object *m*, and 1094 a group *g* that did contribute to the match, the substring matched by group *g* 1095 (equivalent to ``m.group(g)``) is :: 1096 1097 m.string[m.start(g):m.end(g)] 1098 1099 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 1100 null string. For example, after ``m = re.search('b(c?)', 'cba')``, 1101 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 1102 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 1103 1104 An example that will remove *remove_this* from email addresses: 1105 1106 >>> email = "tony (a] tiremove_thisger.net" 1107 >>> m = re.search("remove_this", email) 1108 >>> email[:m.start()] + email[m.end():] 1109 'tony (a] tiger.net' 1110 1111 1112 .. method:: match.span([group]) 1113 1114 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note 1115 that if *group* did not contribute to the match, this is ``(-1, -1)``. 1116 *group* defaults to zero, the entire match. 1117 1118 1119 .. attribute:: match.pos 1120 1121 The value of *pos* which was passed to the :meth:`~regex.search` or 1122 :meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is 1123 the index into the string at which the RE engine started looking for a match. 1124 1125 1126 .. attribute:: match.endpos 1127 1128 The value of *endpos* which was passed to the :meth:`~regex.search` or 1129 :meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is 1130 the index into the string beyond which the RE engine will not go. 1131 1132 1133 .. attribute:: match.lastindex 1134 1135 The integer index of the last matched capturing group, or ``None`` if no group 1136 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 1137 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 1138 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 1139 string. 1140 1141 1142 .. attribute:: match.lastgroup 1143 1144 The name of the last matched capturing group, or ``None`` if the group didn't 1145 have a name, or if no group was matched at all. 1146 1147 1148 .. attribute:: match.re 1149 1150 The regular expression object whose :meth:`~regex.match` or 1151 :meth:`~regex.search` method produced this match instance. 1152 1153 1154 .. attribute:: match.string 1155 1156 The string passed to :meth:`~regex.match` or :meth:`~regex.search`. 1157 1158 1159 .. _re-examples: 1160 1161 Regular Expression Examples 1162 --------------------------- 1163 1164 1165 Checking for a Pair 1166 ^^^^^^^^^^^^^^^^^^^ 1167 1168 In this example, we'll use the following helper function to display match 1169 objects a little more gracefully: 1170 1171 .. testcode:: 1172 1173 def displaymatch(match): 1174 if match is None: 1175 return None 1176 return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 1177 1178 Suppose you are writing a poker program where a player's hand is represented as 1179 a 5-character string with each character representing a card, "a" for ace, "k" 1180 for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 1181 representing the card with that value. 1182 1183 To see if a given string is a valid hand, one could do the following: 1184 1185 >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 1186 >>> displaymatch(valid.match("akt5q")) # Valid. 1187 "<Match: 'akt5q', groups=()>" 1188 >>> displaymatch(valid.match("akt5e")) # Invalid. 1189 >>> displaymatch(valid.match("akt")) # Invalid. 1190 >>> displaymatch(valid.match("727ak")) # Valid. 1191 "<Match: '727ak', groups=()>" 1192 1193 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 1194 To match this with a regular expression, one could use backreferences as such: 1195 1196 >>> pair = re.compile(r".*(.).*\1") 1197 >>> displaymatch(pair.match("717ak")) # Pair of 7s. 1198 "<Match: '717', groups=('7',)>" 1199 >>> displaymatch(pair.match("718ak")) # No pairs. 1200 >>> displaymatch(pair.match("354aa")) # Pair of aces. 1201 "<Match: '354aa', groups=('a',)>" 1202 1203 To find out what card the pair consists of, one could use the 1204 :meth:`~match.group` method of the match object in the following manner: 1205 1206 .. doctest:: 1207 1208 >>> pair.match("717ak").group(1) 1209 '7' 1210 1211 # Error because re.match() returns None, which doesn't have a group() method: 1212 >>> pair.match("718ak").group(1) 1213 Traceback (most recent call last): 1214 File "<pyshell#23>", line 1, in <module> 1215 re.match(r".*(.).*\1", "718ak").group(1) 1216 AttributeError: 'NoneType' object has no attribute 'group' 1217 1218 >>> pair.match("354aa").group(1) 1219 'a' 1220 1221 1222 Simulating scanf() 1223 ^^^^^^^^^^^^^^^^^^ 1224 1225 .. index:: single: scanf() 1226 1227 Python does not currently have an equivalent to :c:func:`scanf`. Regular 1228 expressions are generally more powerful, though also more verbose, than 1229 :c:func:`scanf` format strings. The table below offers some more-or-less 1230 equivalent mappings between :c:func:`scanf` format tokens and regular 1231 expressions. 1232 1233 +--------------------------------+---------------------------------------------+ 1234 | :c:func:`scanf` Token | Regular Expression | 1235 +================================+=============================================+ 1236 | ``%c`` | ``.`` | 1237 +--------------------------------+---------------------------------------------+ 1238 | ``%5c`` | ``.{5}`` | 1239 +--------------------------------+---------------------------------------------+ 1240 | ``%d`` | ``[-+]?\d+`` | 1241 +--------------------------------+---------------------------------------------+ 1242 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 1243 +--------------------------------+---------------------------------------------+ 1244 | ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 1245 +--------------------------------+---------------------------------------------+ 1246 | ``%o`` | ``[-+]?[0-7]+`` | 1247 +--------------------------------+---------------------------------------------+ 1248 | ``%s`` | ``\S+`` | 1249 +--------------------------------+---------------------------------------------+ 1250 | ``%u`` | ``\d+`` | 1251 +--------------------------------+---------------------------------------------+ 1252 | ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 1253 +--------------------------------+---------------------------------------------+ 1254 1255 To extract the filename and numbers from a string like :: 1256 1257 /usr/sbin/sendmail - 0 errors, 4 warnings 1258 1259 you would use a :c:func:`scanf` format like :: 1260 1261 %s - %d errors, %d warnings 1262 1263 The equivalent regular expression would be :: 1264 1265 (\S+) - (\d+) errors, (\d+) warnings 1266 1267 1268 .. _search-vs-match: 1269 1270 search() vs. match() 1271 ^^^^^^^^^^^^^^^^^^^^ 1272 1273 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org> 1274 1275 Python offers two different primitive operations based on regular expressions: 1276 :func:`re.match` checks for a match only at the beginning of the string, while 1277 :func:`re.search` checks for a match anywhere in the string (this is what Perl 1278 does by default). 1279 1280 For example:: 1281 1282 >>> re.match("c", "abcdef") # No match 1283 >>> re.search("c", "abcdef") # Match 1284 <_sre.SRE_Match object; span=(2, 3), match='c'> 1285 1286 Regular expressions beginning with ``'^'`` can be used with :func:`search` to 1287 restrict the match at the beginning of the string:: 1288 1289 >>> re.match("c", "abcdef") # No match 1290 >>> re.search("^c", "abcdef") # No match 1291 >>> re.search("^a", "abcdef") # Match 1292 <_sre.SRE_Match object; span=(0, 1), match='a'> 1293 1294 Note however that in :const:`MULTILINE` mode :func:`match` only matches at the 1295 beginning of the string, whereas using :func:`search` with a regular expression 1296 beginning with ``'^'`` will match at the beginning of each line. 1297 1298 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match 1299 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match 1300 <_sre.SRE_Match object; span=(4, 5), match='X'> 1301 1302 1303 Making a Phonebook 1304 ^^^^^^^^^^^^^^^^^^ 1305 1306 :func:`split` splits a string into a list delimited by the passed pattern. The 1307 method is invaluable for converting textual data into data structures that can be 1308 easily read and modified by Python as demonstrated in the following example that 1309 creates a phonebook. 1310 1311 First, here is the input. Normally it may come from a file, here we are using 1312 triple-quoted string syntax: 1313 1314 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 1315 ... 1316 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 1317 ... Frank Burger: 925.541.7625 662 South Dogwood Way 1318 ... 1319 ... 1320 ... Heather Albrecht: 548.326.4584 919 Park Place""" 1321 1322 The entries are separated by one or more newlines. Now we convert the string 1323 into a list with each nonempty line having its own entry: 1324 1325 .. doctest:: 1326 :options: +NORMALIZE_WHITESPACE 1327 1328 >>> entries = re.split("\n+", text) 1329 >>> entries 1330 ['Ross McFluff: 834.345.1254 155 Elm Street', 1331 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 1332 'Frank Burger: 925.541.7625 662 South Dogwood Way', 1333 'Heather Albrecht: 548.326.4584 919 Park Place'] 1334 1335 Finally, split each entry into a list with first name, last name, telephone 1336 number, and address. We use the ``maxsplit`` parameter of :func:`split` 1337 because the address has spaces, our splitting pattern, in it: 1338 1339 .. doctest:: 1340 :options: +NORMALIZE_WHITESPACE 1341 1342 >>> [re.split(":? ", entry, 3) for entry in entries] 1343 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 1344 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 1345 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 1346 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 1347 1348 The ``:?`` pattern matches the colon after the last name, so that it does not 1349 occur in the result list. With a ``maxsplit`` of ``4``, we could separate the 1350 house number from the street name: 1351 1352 .. doctest:: 1353 :options: +NORMALIZE_WHITESPACE 1354 1355 >>> [re.split(":? ", entry, 4) for entry in entries] 1356 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 1357 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 1358 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 1359 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 1360 1361 1362 Text Munging 1363 ^^^^^^^^^^^^ 1364 1365 :func:`sub` replaces every occurrence of a pattern with a string or the 1366 result of a function. This example demonstrates using :func:`sub` with 1367 a function to "munge" text, or randomize the order of all the characters 1368 in each word of a sentence except for the first and last characters:: 1369 1370 >>> def repl(m): 1371 ... inner_word = list(m.group(2)) 1372 ... random.shuffle(inner_word) 1373 ... return m.group(1) + "".join(inner_word) + m.group(3) 1374 >>> text = "Professor Abdolmalek, please report your absences promptly." 1375 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1376 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 1377 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1378 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 1379 1380 1381 Finding all Adverbs 1382 ^^^^^^^^^^^^^^^^^^^ 1383 1384 :func:`findall` matches *all* occurrences of a pattern, not just the first 1385 one as :func:`search` does. For example, if one was a writer and wanted to 1386 find all of the adverbs in some text, he or she might use :func:`findall` in 1387 the following manner: 1388 1389 >>> text = "He was carefully disguised but captured quickly by police." 1390 >>> re.findall(r"\w+ly", text) 1391 ['carefully', 'quickly'] 1392 1393 1394 Finding all Adverbs and their Positions 1395 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1396 1397 If one wants more information about all matches of a pattern than the matched 1398 text, :func:`finditer` is useful as it provides :ref:`match objects 1399 <match-objects>` instead of strings. Continuing with the previous example, if 1400 one was a writer who wanted to find all of the adverbs *and their positions* in 1401 some text, he or she would use :func:`finditer` in the following manner: 1402 1403 >>> text = "He was carefully disguised but captured quickly by police." 1404 >>> for m in re.finditer(r"\w+ly", text): 1405 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 1406 07-16: carefully 1407 40-47: quickly 1408 1409 1410 Raw String Notation 1411 ^^^^^^^^^^^^^^^^^^^ 1412 1413 Raw string notation (``r"text"``) keeps regular expressions sane. Without it, 1414 every backslash (``'\'``) in a regular expression would have to be prefixed with 1415 another one to escape it. For example, the two following lines of code are 1416 functionally identical: 1417 1418 >>> re.match(r"\W(.)\1\W", " ff ") 1419 <_sre.SRE_Match object; span=(0, 4), match=' ff '> 1420 >>> re.match("\\W(.)\\1\\W", " ff ") 1421 <_sre.SRE_Match object; span=(0, 4), match=' ff '> 1422 1423 When one wants to match a literal backslash, it must be escaped in the regular 1424 expression. With raw string notation, this means ``r"\\"``. Without raw string 1425 notation, one must use ``"\\\\"``, making the following lines of code 1426 functionally identical: 1427 1428 >>> re.match(r"\\", r"\\") 1429 <_sre.SRE_Match object; span=(0, 1), match='\\'> 1430 >>> re.match("\\\\", r"\\") 1431 <_sre.SRE_Match object; span=(0, 1), match='\\'> 1432 1433 1434 Writing a Tokenizer 1435 ^^^^^^^^^^^^^^^^^^^ 1436 1437 A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_ 1438 analyzes a string to categorize groups of characters. This is a useful first 1439 step in writing a compiler or interpreter. 1440 1441 The text categories are specified with regular expressions. The technique is 1442 to combine those into a single master regular expression and to loop over 1443 successive matches:: 1444 1445 import collections 1446 import re 1447 1448 Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column']) 1449 1450 def tokenize(code): 1451 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'} 1452 token_specification = [ 1453 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number 1454 ('ASSIGN', r':='), # Assignment operator 1455 ('END', r';'), # Statement terminator 1456 ('ID', r'[A-Za-z]+'), # Identifiers 1457 ('OP', r'[+\-*/]'), # Arithmetic operators 1458 ('NEWLINE', r'\n'), # Line endings 1459 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs 1460 ('MISMATCH',r'.'), # Any other character 1461 ] 1462 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) 1463 line_num = 1 1464 line_start = 0 1465 for mo in re.finditer(tok_regex, code): 1466 kind = mo.lastgroup 1467 value = mo.group(kind) 1468 if kind == 'NEWLINE': 1469 line_start = mo.end() 1470 line_num += 1 1471 elif kind == 'SKIP': 1472 pass 1473 elif kind == 'MISMATCH': 1474 raise RuntimeError(f'{value!r} unexpected on line {line_num}') 1475 else: 1476 if kind == 'ID' and value in keywords: 1477 kind = value 1478 column = mo.start() - line_start 1479 yield Token(kind, value, line_num, column) 1480 1481 statements = ''' 1482 IF quantity THEN 1483 total := total + price * quantity; 1484 tax := price * 0.05; 1485 ENDIF; 1486 ''' 1487 1488 for token in tokenize(statements): 1489 print(token) 1490 1491 The tokenizer produces the following output:: 1492 1493 Token(typ='IF', value='IF', line=2, column=4) 1494 Token(typ='ID', value='quantity', line=2, column=7) 1495 Token(typ='THEN', value='THEN', line=2, column=16) 1496 Token(typ='ID', value='total', line=3, column=8) 1497 Token(typ='ASSIGN', value=':=', line=3, column=14) 1498 Token(typ='ID', value='total', line=3, column=17) 1499 Token(typ='OP', value='+', line=3, column=23) 1500 Token(typ='ID', value='price', line=3, column=25) 1501 Token(typ='OP', value='*', line=3, column=31) 1502 Token(typ='ID', value='quantity', line=3, column=33) 1503 Token(typ='END', value=';', line=3, column=41) 1504 Token(typ='ID', value='tax', line=4, column=8) 1505 Token(typ='ASSIGN', value=':=', line=4, column=12) 1506 Token(typ='ID', value='price', line=4, column=15) 1507 Token(typ='OP', value='*', line=4, column=21) 1508 Token(typ='NUMBER', value='0.05', line=4, column=23) 1509 Token(typ='END', value=';', line=4, column=27) 1510 Token(typ='ENDIF', value='ENDIF', line=5, column=4) 1511 Token(typ='END', value=';', line=5, column=9) 1512