1 2 :mod:`re` --- Regular expression operations 3 =========================================== 4 5 .. module:: re 6 :synopsis: Regular expression operations. 7 .. moduleauthor:: Fredrik Lundh <fredrik (a] pythonware.com> 8 .. sectionauthor:: Andrew M. Kuchling <amk (a] amk.ca> 9 10 11 This module provides regular expression matching operations similar to 12 those found in Perl. Both patterns and strings to be searched can be 13 Unicode strings as well as 8-bit strings. 14 15 Regular expressions use the backslash character (``'\'``) to indicate 16 special forms or to allow special characters to be used without invoking 17 their special meaning. This collides with Python's usage of the same 18 character for the same purpose in string literals; for example, to match 19 a literal backslash, one might have to write ``'\\\\'`` as the pattern 20 string, because the regular expression must be ``\\``, and each 21 backslash must be expressed as ``\\`` inside a regular Python string 22 literal. 23 24 The solution is to use Python's raw string notation for regular expression 25 patterns; backslashes are not handled in any special way in a string literal 26 prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 27 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 28 newline. Usually patterns will be expressed in Python code using this raw 29 string notation. 30 31 It is important to note that most regular expression operations are available as 32 module-level functions and :class:`RegexObject` methods. The functions are 33 shortcuts that don't require you to compile a regex object first, but miss some 34 fine-tuning parameters. 35 36 37 .. _re-syntax: 38 39 Regular Expression Syntax 40 ------------------------- 41 42 A regular expression (or RE) specifies a set of strings that matches it; the 43 functions in this module let you check if a particular string matches a given 44 regular expression (or if a given regular expression matches a particular 45 string, which comes down to the same thing). 46 47 Regular expressions can be concatenated to form new regular expressions; if *A* 48 and *B* are both regular expressions, then *AB* is also a regular expression. 49 In general, if a string *p* matches *A* and another string *q* matches *B*, the 50 string *pq* will match AB. This holds unless *A* or *B* contain low precedence 51 operations; boundary conditions between *A* and *B*; or have numbered group 52 references. Thus, complex expressions can easily be constructed from simpler 53 primitive expressions like the ones described here. For details of the theory 54 and implementation of regular expressions, consult the Friedl book referenced 55 above, or almost any textbook about compiler construction. 56 57 A brief explanation of the format of regular expressions follows. For further 58 information and a gentler presentation, consult the :ref:`regex-howto`. 59 60 Regular expressions can contain both special and ordinary characters. Most 61 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 62 expressions; they simply match themselves. You can concatenate ordinary 63 characters, so ``last`` matches the string ``'last'``. (In the rest of this 64 section, we'll write RE's in ``this special style``, usually without quotes, and 65 strings to be matched ``'in single quotes'``.) 66 67 Some characters, like ``'|'`` or ``'('``, are special. Special 68 characters either stand for classes of ordinary characters, or affect 69 how the regular expressions around them are interpreted. Regular 70 expression pattern strings may not contain null bytes, but can specify 71 the null byte using the ``\number`` notation, e.g., ``'\x00'``. 72 73 Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 74 directly nested. This avoids ambiguity with the non-greedy modifier suffix 75 ``?``, and with other modifiers in other implementations. To apply a second 76 repetition to an inner repetition, parentheses may be used. For example, 77 the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 78 79 80 The special characters are: 81 82 ``'.'`` 83 (Dot.) In the default mode, this matches any character except a newline. If 84 the :const:`DOTALL` flag has been specified, this matches any character 85 including a newline. 86 87 ``'^'`` 88 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 89 matches immediately after each newline. 90 91 ``'$'`` 92 Matches the end of the string or just before the newline at the end of the 93 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 94 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 95 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 96 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 97 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 98 the newline, and one at the end of the string. 99 100 ``'*'`` 101 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 102 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 103 by any number of 'b's. 104 105 ``'+'`` 106 Causes the resulting RE to match 1 or more repetitions of the preceding RE. 107 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 108 match just 'a'. 109 110 ``'?'`` 111 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 112 ``ab?`` will match either 'a' or 'ab'. 113 114 ``*?``, ``+?``, ``??`` 115 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match 116 as much text as possible. Sometimes this behaviour isn't desired; if the RE 117 ``<.*>`` is matched against ``<a> b <c>``, it will match the entire 118 string, and not just ``<a>``. Adding ``?`` after the qualifier makes it 119 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 120 characters as possible will be matched. Using the RE ``<.*?>`` will match 121 only ``<a>``. 122 123 ``{m}`` 124 Specifies that exactly *m* copies of the previous RE should be matched; fewer 125 matches cause the entire RE not to match. For example, ``a{6}`` will match 126 exactly six ``'a'`` characters, but not five. 127 128 ``{m,n}`` 129 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 130 RE, attempting to match as many repetitions as possible. For example, 131 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 132 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 133 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters 134 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the 135 modifier would be confused with the previously described form. 136 137 ``{m,n}?`` 138 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 139 RE, attempting to match as *few* repetitions as possible. This is the 140 non-greedy version of the previous qualifier. For example, on the 141 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 142 while ``a{3,5}?`` will only match 3 characters. 143 144 ``'\'`` 145 Either escapes special characters (permitting you to match characters like 146 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 147 sequences are discussed below. 148 149 If you're not using a raw string to express the pattern, remember that Python 150 also uses the backslash as an escape sequence in string literals; if the escape 151 sequence isn't recognized by Python's parser, the backslash and subsequent 152 character are included in the resulting string. However, if Python would 153 recognize the resulting sequence, the backslash should be repeated twice. This 154 is complicated and hard to understand, so it's highly recommended that you use 155 raw strings for all but the simplest expressions. 156 157 ``[]`` 158 Used to indicate a set of characters. In a set: 159 160 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 161 ``'m'``, or ``'k'``. 162 163 * Ranges of characters can be indicated by giving two characters and separating 164 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 165 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 166 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 167 ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``), 168 it will match a literal ``'-'``. 169 170 * Special characters lose their special meaning inside sets. For example, 171 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 172 ``'*'``, or ``')'``. 173 174 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 175 inside a set, although the characters they match depends on whether 176 :const:`LOCALE` or :const:`UNICODE` mode is in force. 177 178 * Characters that are not within a range can be matched by :dfn:`complementing` 179 the set. If the first character of the set is ``'^'``, all the characters 180 that are *not* in the set will be matched. For example, ``[^5]`` will match 181 any character except ``'5'``, and ``[^^]`` will match any character except 182 ``'^'``. ``^`` has no special meaning if it's not the first character in 183 the set. 184 185 * To match a literal ``']'`` inside a set, precede it with a backslash, or 186 place it at the beginning of the set. For example, both ``[()[\]{}]`` and 187 ``[]()[{}]`` will both match a parenthesis. 188 189 ``'|'`` 190 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that 191 will match either A or B. An arbitrary number of REs can be separated by the 192 ``'|'`` in this way. This can be used inside groups (see below) as well. As 193 the target string is scanned, REs separated by ``'|'`` are tried from left to 194 right. When one pattern completely matches, that branch is accepted. This means 195 that once ``A`` matches, ``B`` will not be tested further, even if it would 196 produce a longer overall match. In other words, the ``'|'`` operator is never 197 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 198 character class, as in ``[|]``. 199 200 ``(...)`` 201 Matches whatever regular expression is inside the parentheses, and indicates the 202 start and end of a group; the contents of a group can be retrieved after a match 203 has been performed, and can be matched later in the string with the ``\number`` 204 special sequence, described below. To match the literals ``'('`` or ``')'``, 205 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``. 206 207 ``(?...)`` 208 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 209 otherwise). The first character after the ``'?'`` determines what the meaning 210 and further syntax of the construct is. Extensions usually do not create a new 211 group; ``(?P<name>...)`` is the only exception to this rule. Following are the 212 currently supported extensions. 213 214 ``(?iLmsux)`` 215 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``, 216 ``'u'``, ``'x'``.) The group matches the empty string; the letters 217 set the corresponding flags: :const:`re.I` (ignore case), 218 :const:`re.L` (locale dependent), :const:`re.M` (multi-line), 219 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent), 220 and :const:`re.X` (verbose), for the entire regular expression. (The 221 flags are described in :ref:`contents-of-module-re`.) This 222 is useful if you wish to include the flags as part of the regular 223 expression, instead of passing a *flag* argument to the 224 :func:`re.compile` function. 225 226 Note that the ``(?x)`` flag changes how the expression is parsed. It should be 227 used first in the expression string, or after one or more whitespace characters. 228 If there are non-whitespace characters before the flag, the results are 229 undefined. 230 231 ``(?:...)`` 232 A non-capturing version of regular parentheses. Matches whatever regular 233 expression is inside the parentheses, but the substring matched by the group 234 *cannot* be retrieved after performing a match or referenced later in the 235 pattern. 236 237 ``(?P<name>...)`` 238 Similar to regular parentheses, but the substring matched by the group is 239 accessible via the symbolic group name *name*. Group names must be valid 240 Python identifiers, and each group name must be defined only once within a 241 regular expression. A symbolic group is also a numbered group, just as if 242 the group were not named. 243 244 Named groups can be referenced in three contexts. If the pattern is 245 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 246 single or double quotes): 247 248 +---------------------------------------+----------------------------------+ 249 | Context of reference to group "quote" | Ways to reference it | 250 +=======================================+==================================+ 251 | in the same pattern itself | * ``(?P=quote)`` (as shown) | 252 | | * ``\1`` | 253 +---------------------------------------+----------------------------------+ 254 | when processing match object ``m`` | * ``m.group('quote')`` | 255 | | * ``m.end('quote')`` (etc.) | 256 +---------------------------------------+----------------------------------+ 257 | in a string passed to the ``repl`` | * ``\g<quote>`` | 258 | argument of ``re.sub()`` | * ``\g<1>`` | 259 | | * ``\1`` | 260 +---------------------------------------+----------------------------------+ 261 262 ``(?P=name)`` 263 A backreference to a named group; it matches whatever text was matched by the 264 earlier group named *name*. 265 266 ``(?#...)`` 267 A comment; the contents of the parentheses are simply ignored. 268 269 ``(?=...)`` 270 Matches if ``...`` matches next, but doesn't consume any of the string. This is 271 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match 272 ``'Isaac '`` only if it's followed by ``'Asimov'``. 273 274 ``(?!...)`` 275 Matches if ``...`` doesn't match next. This is a negative lookahead assertion. 276 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 277 followed by ``'Asimov'``. 278 279 ``(?<=...)`` 280 Matches if the current position in the string is preceded by a match for ``...`` 281 that ends at the current position. This is called a :dfn:`positive lookbehind 282 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the 283 lookbehind will back up 3 characters and check if the contained pattern matches. 284 The contained pattern must only match strings of some fixed length, meaning that 285 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group 286 references are not supported even if they match strings of some fixed length. 287 Note that 288 patterns which start with positive lookbehind assertions will not match at the 289 beginning of the string being searched; you will most likely want to use the 290 :func:`search` function rather than the :func:`match` function: 291 292 >>> import re 293 >>> m = re.search('(?<=abc)def', 'abcdef') 294 >>> m.group(0) 295 'def' 296 297 This example looks for a word following a hyphen: 298 299 >>> m = re.search('(?<=-)\w+', 'spam-egg') 300 >>> m.group(0) 301 'egg' 302 303 ``(?<!...)`` 304 Matches if the current position in the string is not preceded by a match for 305 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 306 positive lookbehind assertions, the contained pattern must only match strings of 307 some fixed length and shouldn't contain group references. 308 Patterns which start with negative lookbehind assertions may 309 match at the beginning of the string being searched. 310 311 ``(?(id/name)yes-pattern|no-pattern)`` 312 Will try to match with ``yes-pattern`` if the group with given *id* or *name* 313 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and 314 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email 315 matching pattern, which will match with ``'<user (a] host.com>'`` as well as 316 ``'user (a] host.com'``, but not with ``'<user (a] host.com'``. 317 318 .. versionadded:: 2.4 319 320 The special sequences consist of ``'\'`` and a character from the list below. 321 If the ordinary character is not on the list, then the resulting RE will match 322 the second character. For example, ``\$`` matches the character ``'$'``. 323 324 ``\number`` 325 Matches the contents of the group of the same number. Groups are numbered 326 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 327 but not ``'thethe'`` (note the space after the group). This special sequence 328 can only be used to match one of the first 99 groups. If the first digit of 329 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 330 a group match, but as the character with octal value *number*. Inside the 331 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 332 characters. 333 334 ``\A`` 335 Matches only at the start of the string. 336 337 ``\b`` 338 Matches the empty string, but only at the beginning or end of a word. A word is 339 defined as a sequence of alphanumeric or underscore characters, so the end of a 340 word is indicated by whitespace or a non-alphanumeric, non-underscore character. 341 Note that formally, ``\b`` is defined as the boundary between a ``\w`` and 342 a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end 343 of the string, so the precise set of characters deemed to be alphanumeric 344 depends on the values of the ``UNICODE`` and ``LOCALE`` flags. 345 For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, 346 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. 347 Inside a character range, ``\b`` represents the backspace character, for 348 compatibility with Python's string literals. 349 350 ``\B`` 351 Matches the empty string, but only when it is *not* at the beginning or end of a 352 word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``, 353 but not ``'py'``, ``'py.'``, or ``'py!'``. 354 ``\B`` is just the opposite of ``\b``, so is also subject to the settings 355 of ``LOCALE`` and ``UNICODE``. 356 357 ``\d`` 358 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this 359 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match 360 whatever is classified as a decimal digit in the Unicode character properties 361 database. 362 363 ``\D`` 364 When the :const:`UNICODE` flag is not specified, matches any non-digit 365 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it 366 will match anything other than character marked as digits in the Unicode 367 character properties database. 368 369 ``\s`` 370 When the :const:`UNICODE` flag is not specified, it matches any whitespace 371 character, this is equivalent to the set ``[ \t\n\r\f\v]``. The 372 :const:`LOCALE` flag has no extra effect on matching of the space. 373 If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]`` 374 plus whatever is classified as space in the Unicode character properties 375 database. 376 377 ``\S`` 378 When the :const:`UNICODE` flag is not specified, matches any non-whitespace 379 character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The 380 :const:`LOCALE` flag has no extra effect on non-whitespace match. If 381 :const:`UNICODE` is set, then any character not marked as space in the 382 Unicode character properties database is matched. 383 384 385 ``\w`` 386 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches 387 any alphanumeric character and the underscore; this is equivalent to the set 388 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus 389 whatever characters are defined as alphanumeric for the current locale. If 390 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever 391 is classified as alphanumeric in the Unicode character properties database. 392 393 ``\W`` 394 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches 395 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``. 396 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and 397 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set, 398 this will match anything other than ``[0-9_]`` plus characters classified as 399 not alphanumeric in the Unicode character properties database. 400 401 ``\Z`` 402 Matches only at the end of the string. 403 404 If both :const:`LOCALE` and :const:`UNICODE` flags are included for a 405 particular sequence, then :const:`LOCALE` flag takes effect first followed by 406 the :const:`UNICODE`. 407 408 Most of the standard escapes supported by Python string literals are also 409 accepted by the regular expression parser:: 410 411 \a \b \f \n 412 \r \t \v \x 413 \\ 414 415 (Note that ``\b`` is used to represent word boundaries, and means "backspace" 416 only inside character classes.) 417 418 Octal escapes are included in a limited form: If the first digit is a 0, or if 419 there are three octal digits, it is considered an octal escape. Otherwise, it is 420 a group reference. As for string literals, octal escapes are always at most 421 three digits in length. 422 423 .. seealso:: 424 425 Mastering Regular Expressions 426 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The 427 second edition of the book no longer covers Python at all, but the first 428 edition covered writing good regular expression patterns in great detail. 429 430 431 432 .. _contents-of-module-re: 433 434 Module Contents 435 --------------- 436 437 The module defines several functions, constants, and an exception. Some of the 438 functions are simplified versions of the full featured methods for compiled 439 regular expressions. Most non-trivial applications always use the compiled 440 form. 441 442 443 .. function:: compile(pattern, flags=0) 444 445 Compile a regular expression pattern into a regular expression object, which 446 can be used for matching using its :func:`~RegexObject.match` and 447 :func:`~RegexObject.search` methods, described below. 448 449 The expression's behaviour can be modified by specifying a *flags* value. 450 Values can be any of the following variables, combined using bitwise OR (the 451 ``|`` operator). 452 453 The sequence :: 454 455 prog = re.compile(pattern) 456 result = prog.match(string) 457 458 is equivalent to :: 459 460 result = re.match(pattern, string) 461 462 but using :func:`re.compile` and saving the resulting regular expression 463 object for reuse is more efficient when the expression will be used several 464 times in a single program. 465 466 .. note:: 467 468 The compiled versions of the most recent patterns passed to 469 :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so 470 programs that use only a few regular expressions at a time needn't worry 471 about compiling regular expressions. 472 473 474 .. data:: DEBUG 475 476 Display debug information about compiled expression. 477 478 479 .. data:: I 480 IGNORECASE 481 482 Perform case-insensitive matching; expressions like ``[A-Z]`` will match 483 lowercase letters, too. This is not affected by the current locale. 484 485 486 .. data:: L 487 LOCALE 488 489 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the 490 current locale. 491 492 493 .. data:: M 494 MULTILINE 495 496 When specified, the pattern character ``'^'`` matches at the beginning of the 497 string and at the beginning of each line (immediately following each newline); 498 and the pattern character ``'$'`` matches at the end of the string and at the 499 end of each line (immediately preceding each newline). By default, ``'^'`` 500 matches only at the beginning of the string, and ``'$'`` only at the end of the 501 string and immediately before the newline (if any) at the end of the string. 502 503 504 .. data:: S 505 DOTALL 506 507 Make the ``'.'`` special character match any character at all, including a 508 newline; without this flag, ``'.'`` will match anything *except* a newline. 509 510 511 .. data:: U 512 UNICODE 513 514 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent 515 on the Unicode character properties database. 516 517 .. versionadded:: 2.0 518 519 520 .. data:: X 521 VERBOSE 522 523 This flag allows you to write regular expressions that look nicer and are 524 more readable by allowing you to visually separate logical sections of the 525 pattern and add comments. Whitespace within the pattern is ignored, except 526 when in a character class or when preceded by an unescaped backslash. 527 When a line contains a ``#`` that is not in a character class and is not 528 preceded by an unescaped backslash, all characters from the leftmost such 529 ``#`` through the end of the line are ignored. 530 531 This means that the two following regular expression objects that match a 532 decimal number are functionally equal:: 533 534 a = re.compile(r"""\d + # the integral part 535 \. # the decimal point 536 \d * # some fractional digits""", re.X) 537 b = re.compile(r"\d+\.\d*") 538 539 540 .. function:: search(pattern, string, flags=0) 541 542 Scan through *string* looking for the first location where the regular expression 543 *pattern* produces a match, and return a corresponding :class:`MatchObject` 544 instance. Return ``None`` if no position in the string matches the pattern; note 545 that this is different from finding a zero-length match at some point in the 546 string. 547 548 549 .. function:: match(pattern, string, flags=0) 550 551 If zero or more characters at the beginning of *string* match the regular 552 expression *pattern*, return a corresponding :class:`MatchObject` instance. 553 Return ``None`` if the string does not match the pattern; note that this is 554 different from a zero-length match. 555 556 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 557 at the beginning of the string and not at the beginning of each line. 558 559 If you want to locate a match anywhere in *string*, use :func:`search` 560 instead (see also :ref:`search-vs-match`). 561 562 563 .. function:: split(pattern, string, maxsplit=0, flags=0) 564 565 Split *string* by the occurrences of *pattern*. If capturing parentheses are 566 used in *pattern*, then the text of all groups in the pattern are also returned 567 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 568 splits occur, and the remainder of the string is returned as the final element 569 of the list. (Incompatibility note: in the original Python 1.5 release, 570 *maxsplit* was ignored. This has been fixed in later releases.) 571 572 >>> re.split('\W+', 'Words, words, words.') 573 ['Words', 'words', 'words', ''] 574 >>> re.split('(\W+)', 'Words, words, words.') 575 ['Words', ', ', 'words', ', ', 'words', '.', ''] 576 >>> re.split('\W+', 'Words, words, words.', 1) 577 ['Words', 'words, words.'] 578 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 579 ['0', '3', '9'] 580 581 If there are capturing groups in the separator and it matches at the start of 582 the string, the result will start with an empty string. The same holds for 583 the end of the string: 584 585 >>> re.split('(\W+)', '...words, words...') 586 ['', '...', 'words', ', ', 'words', '...', ''] 587 588 That way, separator components are always found at the same relative 589 indices within the result list (e.g., if there's one capturing group 590 in the separator, the 0th, the 2nd and so forth). 591 592 Note that *split* will never split a string on an empty pattern match. 593 For example: 594 595 >>> re.split('x*', 'foo') 596 ['foo'] 597 >>> re.split("(?m)^$", "foo\n\nbar\n") 598 ['foo\n\nbar\n'] 599 600 .. versionchanged:: 2.7 601 Added the optional flags argument. 602 603 604 .. function:: findall(pattern, string, flags=0) 605 606 Return all non-overlapping matches of *pattern* in *string*, as a list of 607 strings. The *string* is scanned left-to-right, and matches are returned in 608 the order found. If one or more groups are present in the pattern, return a 609 list of groups; this will be a list of tuples if the pattern has more than 610 one group. Empty matches are included in the result unless they touch the 611 beginning of another match. 612 613 .. versionadded:: 1.5.2 614 615 .. versionchanged:: 2.4 616 Added the optional flags argument. 617 618 619 .. function:: finditer(pattern, string, flags=0) 620 621 Return an :term:`iterator` yielding :class:`MatchObject` instances over all 622 non-overlapping matches for the RE *pattern* in *string*. The *string* is 623 scanned left-to-right, and matches are returned in the order found. Empty 624 matches are included in the result unless they touch the beginning of another 625 match. 626 627 .. versionadded:: 2.2 628 629 .. versionchanged:: 2.4 630 Added the optional flags argument. 631 632 633 .. function:: sub(pattern, repl, string, count=0, flags=0) 634 635 Return the string obtained by replacing the leftmost non-overlapping occurrences 636 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 637 *string* is returned unchanged. *repl* can be a string or a function; if it is 638 a string, any backslash escapes in it are processed. That is, ``\n`` is 639 converted to a single newline character, ``\r`` is converted to a carriage return, and 640 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such 641 as ``\6``, are replaced with the substring matched by group 6 in the pattern. 642 For example: 643 644 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 645 ... r'static PyObject*\npy_\1(void)\n{', 646 ... 'def myfunc():') 647 'static PyObject*\npy_myfunc(void)\n{' 648 649 If *repl* is a function, it is called for every non-overlapping occurrence of 650 *pattern*. The function takes a single match object argument, and returns the 651 replacement string. For example: 652 653 >>> def dashrepl(matchobj): 654 ... if matchobj.group(0) == '-': return ' ' 655 ... else: return '-' 656 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 657 'pro--gram files' 658 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 659 'Baked Beans & Spam' 660 661 The pattern may be a string or an RE object. 662 663 The optional argument *count* is the maximum number of pattern occurrences to be 664 replaced; *count* must be a non-negative integer. If omitted or zero, all 665 occurrences will be replaced. Empty matches for the pattern are replaced only 666 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns 667 ``'-a-b-c-'``. 668 669 In string-type *repl* arguments, in addition to the character escapes and 670 backreferences described above, 671 ``\g<name>`` will use the substring matched by the group named ``name``, as 672 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 673 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 674 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 675 reference to group 20, not a reference to group 2 followed by the literal 676 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 677 substring matched by the RE. 678 679 .. versionchanged:: 2.7 680 Added the optional flags argument. 681 682 683 .. function:: subn(pattern, repl, string, count=0, flags=0) 684 685 Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 686 number_of_subs_made)``. 687 688 .. versionchanged:: 2.7 689 Added the optional flags argument. 690 691 692 .. function:: escape(string) 693 694 Return *string* with all non-alphanumerics backslashed; this is useful if you 695 want to match an arbitrary literal string that may have regular expression 696 metacharacters in it. 697 698 699 .. function:: purge() 700 701 Clear the regular expression cache. 702 703 704 .. exception:: error 705 706 Exception raised when a string passed to one of the functions here is not a 707 valid regular expression (for example, it might contain unmatched parentheses) 708 or when some other error occurs during compilation or matching. It is never an 709 error if a string contains no match for a pattern. 710 711 712 .. _re-objects: 713 714 Regular Expression Objects 715 -------------------------- 716 717 .. class:: RegexObject 718 719 The :class:`RegexObject` class supports the following methods and attributes: 720 721 .. method:: RegexObject.search(string[, pos[, endpos]]) 722 723 Scan through *string* looking for a location where this regular expression 724 produces a match, and return a corresponding :class:`MatchObject` instance. 725 Return ``None`` if no position in the string matches the pattern; note that this 726 is different from finding a zero-length match at some point in the string. 727 728 The optional second parameter *pos* gives an index in the string where the 729 search is to start; it defaults to ``0``. This is not completely equivalent to 730 slicing the string; the ``'^'`` pattern character matches at the real beginning 731 of the string and at positions just after a newline, but not necessarily at the 732 index where the search is to start. 733 734 The optional parameter *endpos* limits how far the string will be searched; it 735 will be as if the string is *endpos* characters long, so only the characters 736 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 737 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular 738 expression object, ``rx.search(string, 0, 50)`` is equivalent to 739 ``rx.search(string[:50], 0)``. 740 741 >>> pattern = re.compile("d") 742 >>> pattern.search("dog") # Match at index 0 743 <_sre.SRE_Match object at ...> 744 >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 745 746 747 .. method:: RegexObject.match(string[, pos[, endpos]]) 748 749 If zero or more characters at the *beginning* of *string* match this regular 750 expression, return a corresponding :class:`MatchObject` instance. Return 751 ``None`` if the string does not match the pattern; note that this is different 752 from a zero-length match. 753 754 The optional *pos* and *endpos* parameters have the same meaning as for the 755 :meth:`~RegexObject.search` method. 756 757 >>> pattern = re.compile("o") 758 >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 759 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 760 <_sre.SRE_Match object at ...> 761 762 If you want to locate a match anywhere in *string*, use 763 :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`). 764 765 766 .. method:: RegexObject.split(string, maxsplit=0) 767 768 Identical to the :func:`split` function, using the compiled pattern. 769 770 771 .. method:: RegexObject.findall(string[, pos[, endpos]]) 772 773 Similar to the :func:`findall` function, using the compiled pattern, but 774 also accepts optional *pos* and *endpos* parameters that limit the search 775 region like for :meth:`match`. 776 777 778 .. method:: RegexObject.finditer(string[, pos[, endpos]]) 779 780 Similar to the :func:`finditer` function, using the compiled pattern, but 781 also accepts optional *pos* and *endpos* parameters that limit the search 782 region like for :meth:`match`. 783 784 785 .. method:: RegexObject.sub(repl, string, count=0) 786 787 Identical to the :func:`sub` function, using the compiled pattern. 788 789 790 .. method:: RegexObject.subn(repl, string, count=0) 791 792 Identical to the :func:`subn` function, using the compiled pattern. 793 794 795 .. attribute:: RegexObject.flags 796 797 The regex matching flags. This is a combination of the flags given to 798 :func:`.compile` and any ``(?...)`` inline flags in the pattern. 799 800 801 .. attribute:: RegexObject.groups 802 803 The number of capturing groups in the pattern. 804 805 806 .. attribute:: RegexObject.groupindex 807 808 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 809 numbers. The dictionary is empty if no symbolic groups were used in the 810 pattern. 811 812 813 .. attribute:: RegexObject.pattern 814 815 The pattern string from which the RE object was compiled. 816 817 818 .. _match-objects: 819 820 Match Objects 821 ------------- 822 823 .. class:: MatchObject 824 825 Match objects always have a boolean value of ``True``. 826 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None`` 827 when there is no match, you can test whether there was a match with a simple 828 ``if`` statement:: 829 830 match = re.search(pattern, string) 831 if match: 832 process(match) 833 834 Match objects support the following methods and attributes: 835 836 837 .. method:: MatchObject.expand(template) 838 839 Return the string obtained by doing backslash substitution on the template 840 string *template*, as done by the :meth:`~RegexObject.sub` method. Escapes 841 such as ``\n`` are converted to the appropriate characters, and numeric 842 backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``, 843 ``\g<name>``) are replaced by the contents of the corresponding group. 844 845 846 .. method:: MatchObject.group([group1, ...]) 847 848 Returns one or more subgroups of the match. If there is a single argument, the 849 result is a single string; if there are multiple arguments, the result is a 850 tuple with one item per argument. Without arguments, *group1* defaults to zero 851 (the whole match is returned). If a *groupN* argument is zero, the corresponding 852 return value is the entire matching string; if it is in the inclusive range 853 [1..99], it is the string matching the corresponding parenthesized group. If a 854 group number is negative or larger than the number of groups defined in the 855 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 856 part of the pattern that did not match, the corresponding result is ``None``. 857 If a group is contained in a part of the pattern that matched multiple times, 858 the last match is returned. 859 860 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 861 >>> m.group(0) # The entire match 862 'Isaac Newton' 863 >>> m.group(1) # The first parenthesized subgroup. 864 'Isaac' 865 >>> m.group(2) # The second parenthesized subgroup. 866 'Newton' 867 >>> m.group(1, 2) # Multiple arguments give us a tuple. 868 ('Isaac', 'Newton') 869 870 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 871 arguments may also be strings identifying groups by their group name. If a 872 string argument is not used as a group name in the pattern, an :exc:`IndexError` 873 exception is raised. 874 875 A moderately complicated example: 876 877 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 878 >>> m.group('first_name') 879 'Malcolm' 880 >>> m.group('last_name') 881 'Reynolds' 882 883 Named groups can also be referred to by their index: 884 885 >>> m.group(1) 886 'Malcolm' 887 >>> m.group(2) 888 'Reynolds' 889 890 If a group matches multiple times, only the last match is accessible: 891 892 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 893 >>> m.group(1) # Returns only the last match. 894 'c3' 895 896 897 .. method:: MatchObject.groups([default]) 898 899 Return a tuple containing all the subgroups of the match, from 1 up to however 900 many groups are in the pattern. The *default* argument is used for groups that 901 did not participate in the match; it defaults to ``None``. (Incompatibility 902 note: in the original Python 1.5 release, if the tuple was one element long, a 903 string would be returned instead. In later versions (from 1.5.1 on), a 904 singleton tuple is returned in such cases.) 905 906 For example: 907 908 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 909 >>> m.groups() 910 ('24', '1632') 911 912 If we make the decimal place and everything after it optional, not all groups 913 might participate in the match. These groups will default to ``None`` unless 914 the *default* argument is given: 915 916 >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 917 >>> m.groups() # Second group defaults to None. 918 ('24', None) 919 >>> m.groups('0') # Now, the second group defaults to '0'. 920 ('24', '0') 921 922 923 .. method:: MatchObject.groupdict([default]) 924 925 Return a dictionary containing all the *named* subgroups of the match, keyed by 926 the subgroup name. The *default* argument is used for groups that did not 927 participate in the match; it defaults to ``None``. For example: 928 929 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 930 >>> m.groupdict() 931 {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 932 933 934 .. method:: MatchObject.start([group]) 935 MatchObject.end([group]) 936 937 Return the indices of the start and end of the substring matched by *group*; 938 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 939 *group* exists but did not contribute to the match. For a match object *m*, and 940 a group *g* that did contribute to the match, the substring matched by group *g* 941 (equivalent to ``m.group(g)``) is :: 942 943 m.string[m.start(g):m.end(g)] 944 945 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 946 null string. For example, after ``m = re.search('b(c?)', 'cba')``, 947 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 948 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 949 950 An example that will remove *remove_this* from email addresses: 951 952 >>> email = "tony (a] tiremove_thisger.net" 953 >>> m = re.search("remove_this", email) 954 >>> email[:m.start()] + email[m.end():] 955 'tony (a] tiger.net' 956 957 958 .. method:: MatchObject.span([group]) 959 960 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group), 961 m.end(group))``. Note that if *group* did not contribute to the match, this is 962 ``(-1, -1)``. *group* defaults to zero, the entire match. 963 964 965 .. attribute:: MatchObject.pos 966 967 The value of *pos* which was passed to the :meth:`~RegexObject.search` or 968 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the 969 index into the string at which the RE engine started looking for a match. 970 971 972 .. attribute:: MatchObject.endpos 973 974 The value of *endpos* which was passed to the :meth:`~RegexObject.search` or 975 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the 976 index into the string beyond which the RE engine will not go. 977 978 979 .. attribute:: MatchObject.lastindex 980 981 The integer index of the last matched capturing group, or ``None`` if no group 982 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 983 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 984 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 985 string. 986 987 988 .. attribute:: MatchObject.lastgroup 989 990 The name of the last matched capturing group, or ``None`` if the group didn't 991 have a name, or if no group was matched at all. 992 993 994 .. attribute:: MatchObject.re 995 996 The regular expression object whose :meth:`~RegexObject.match` or 997 :meth:`~RegexObject.search` method produced this :class:`MatchObject` 998 instance. 999 1000 1001 .. attribute:: MatchObject.string 1002 1003 The string passed to :meth:`~RegexObject.match` or 1004 :meth:`~RegexObject.search`. 1005 1006 1007 Examples 1008 -------- 1009 1010 1011 Checking For a Pair 1012 ^^^^^^^^^^^^^^^^^^^ 1013 1014 In this example, we'll use the following helper function to display match 1015 objects a little more gracefully: 1016 1017 .. testcode:: 1018 1019 def displaymatch(match): 1020 if match is None: 1021 return None 1022 return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 1023 1024 Suppose you are writing a poker program where a player's hand is represented as 1025 a 5-character string with each character representing a card, "a" for ace, "k" 1026 for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 1027 representing the card with that value. 1028 1029 To see if a given string is a valid hand, one could do the following: 1030 1031 >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 1032 >>> displaymatch(valid.match("akt5q")) # Valid. 1033 "<Match: 'akt5q', groups=()>" 1034 >>> displaymatch(valid.match("akt5e")) # Invalid. 1035 >>> displaymatch(valid.match("akt")) # Invalid. 1036 >>> displaymatch(valid.match("727ak")) # Valid. 1037 "<Match: '727ak', groups=()>" 1038 1039 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 1040 To match this with a regular expression, one could use backreferences as such: 1041 1042 >>> pair = re.compile(r".*(.).*\1") 1043 >>> displaymatch(pair.match("717ak")) # Pair of 7s. 1044 "<Match: '717', groups=('7',)>" 1045 >>> displaymatch(pair.match("718ak")) # No pairs. 1046 >>> displaymatch(pair.match("354aa")) # Pair of aces. 1047 "<Match: '354aa', groups=('a',)>" 1048 1049 To find out what card the pair consists of, one could use the 1050 :meth:`~MatchObject.group` method of :class:`MatchObject` in the following 1051 manner: 1052 1053 .. doctest:: 1054 1055 >>> pair.match("717ak").group(1) 1056 '7' 1057 1058 # Error because re.match() returns None, which doesn't have a group() method: 1059 >>> pair.match("718ak").group(1) 1060 Traceback (most recent call last): 1061 File "<pyshell#23>", line 1, in <module> 1062 re.match(r".*(.).*\1", "718ak").group(1) 1063 AttributeError: 'NoneType' object has no attribute 'group' 1064 1065 >>> pair.match("354aa").group(1) 1066 'a' 1067 1068 1069 Simulating scanf() 1070 ^^^^^^^^^^^^^^^^^^ 1071 1072 .. index:: single: scanf() 1073 1074 Python does not currently have an equivalent to :c:func:`scanf`. Regular 1075 expressions are generally more powerful, though also more verbose, than 1076 :c:func:`scanf` format strings. The table below offers some more-or-less 1077 equivalent mappings between :c:func:`scanf` format tokens and regular 1078 expressions. 1079 1080 +--------------------------------+---------------------------------------------+ 1081 | :c:func:`scanf` Token | Regular Expression | 1082 +================================+=============================================+ 1083 | ``%c`` | ``.`` | 1084 +--------------------------------+---------------------------------------------+ 1085 | ``%5c`` | ``.{5}`` | 1086 +--------------------------------+---------------------------------------------+ 1087 | ``%d`` | ``[-+]?\d+`` | 1088 +--------------------------------+---------------------------------------------+ 1089 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 1090 +--------------------------------+---------------------------------------------+ 1091 | ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 1092 +--------------------------------+---------------------------------------------+ 1093 | ``%o`` | ``[-+]?[0-7]+`` | 1094 +--------------------------------+---------------------------------------------+ 1095 | ``%s`` | ``\S+`` | 1096 +--------------------------------+---------------------------------------------+ 1097 | ``%u`` | ``\d+`` | 1098 +--------------------------------+---------------------------------------------+ 1099 | ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 1100 +--------------------------------+---------------------------------------------+ 1101 1102 To extract the filename and numbers from a string like :: 1103 1104 /usr/sbin/sendmail - 0 errors, 4 warnings 1105 1106 you would use a :c:func:`scanf` format like :: 1107 1108 %s - %d errors, %d warnings 1109 1110 The equivalent regular expression would be :: 1111 1112 (\S+) - (\d+) errors, (\d+) warnings 1113 1114 1115 .. _search-vs-match: 1116 1117 search() vs. match() 1118 ^^^^^^^^^^^^^^^^^^^^ 1119 1120 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org> 1121 1122 Python offers two different primitive operations based on regular expressions: 1123 :func:`re.match` checks for a match only at the beginning of the string, while 1124 :func:`re.search` checks for a match anywhere in the string (this is what Perl 1125 does by default). 1126 1127 For example:: 1128 1129 >>> re.match("c", "abcdef") # No match 1130 >>> re.search("c", "abcdef") # Match 1131 <_sre.SRE_Match object at ...> 1132 1133 Regular expressions beginning with ``'^'`` can be used with :func:`search` to 1134 restrict the match at the beginning of the string:: 1135 1136 >>> re.match("c", "abcdef") # No match 1137 >>> re.search("^c", "abcdef") # No match 1138 >>> re.search("^a", "abcdef") # Match 1139 <_sre.SRE_Match object at ...> 1140 1141 Note however that in :const:`MULTILINE` mode :func:`match` only matches at the 1142 beginning of the string, whereas using :func:`search` with a regular expression 1143 beginning with ``'^'`` will match at the beginning of each line. 1144 1145 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match 1146 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match 1147 <_sre.SRE_Match object at ...> 1148 1149 1150 Making a Phonebook 1151 ^^^^^^^^^^^^^^^^^^ 1152 1153 :func:`split` splits a string into a list delimited by the passed pattern. The 1154 method is invaluable for converting textual data into data structures that can be 1155 easily read and modified by Python as demonstrated in the following example that 1156 creates a phonebook. 1157 1158 First, here is the input. Normally it may come from a file, here we are using 1159 triple-quoted string syntax: 1160 1161 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 1162 ... 1163 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 1164 ... Frank Burger: 925.541.7625 662 South Dogwood Way 1165 ... 1166 ... 1167 ... Heather Albrecht: 548.326.4584 919 Park Place""" 1168 1169 The entries are separated by one or more newlines. Now we convert the string 1170 into a list with each nonempty line having its own entry: 1171 1172 .. doctest:: 1173 :options: +NORMALIZE_WHITESPACE 1174 1175 >>> entries = re.split("\n+", text) 1176 >>> entries 1177 ['Ross McFluff: 834.345.1254 155 Elm Street', 1178 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 1179 'Frank Burger: 925.541.7625 662 South Dogwood Way', 1180 'Heather Albrecht: 548.326.4584 919 Park Place'] 1181 1182 Finally, split each entry into a list with first name, last name, telephone 1183 number, and address. We use the ``maxsplit`` parameter of :func:`split` 1184 because the address has spaces, our splitting pattern, in it: 1185 1186 .. doctest:: 1187 :options: +NORMALIZE_WHITESPACE 1188 1189 >>> [re.split(":? ", entry, 3) for entry in entries] 1190 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 1191 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 1192 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 1193 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 1194 1195 The ``:?`` pattern matches the colon after the last name, so that it does not 1196 occur in the result list. With a ``maxsplit`` of ``4``, we could separate the 1197 house number from the street name: 1198 1199 .. doctest:: 1200 :options: +NORMALIZE_WHITESPACE 1201 1202 >>> [re.split(":? ", entry, 4) for entry in entries] 1203 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 1204 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 1205 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 1206 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 1207 1208 1209 Text Munging 1210 ^^^^^^^^^^^^ 1211 1212 :func:`sub` replaces every occurrence of a pattern with a string or the 1213 result of a function. This example demonstrates using :func:`sub` with 1214 a function to "munge" text, or randomize the order of all the characters 1215 in each word of a sentence except for the first and last characters:: 1216 1217 >>> def repl(m): 1218 ... inner_word = list(m.group(2)) 1219 ... random.shuffle(inner_word) 1220 ... return m.group(1) + "".join(inner_word) + m.group(3) 1221 >>> text = "Professor Abdolmalek, please report your absences promptly." 1222 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1223 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 1224 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1225 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 1226 1227 1228 Finding all Adverbs 1229 ^^^^^^^^^^^^^^^^^^^ 1230 1231 :func:`findall` matches *all* occurrences of a pattern, not just the first 1232 one as :func:`search` does. For example, if one was a writer and wanted to 1233 find all of the adverbs in some text, he or she might use :func:`findall` in 1234 the following manner: 1235 1236 >>> text = "He was carefully disguised but captured quickly by police." 1237 >>> re.findall(r"\w+ly", text) 1238 ['carefully', 'quickly'] 1239 1240 1241 Finding all Adverbs and their Positions 1242 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1243 1244 If one wants more information about all matches of a pattern than the matched 1245 text, :func:`finditer` is useful as it provides instances of 1246 :class:`MatchObject` instead of strings. Continuing with the previous example, 1247 if one was a writer who wanted to find all of the adverbs *and their positions* 1248 in some text, he or she would use :func:`finditer` in the following manner: 1249 1250 >>> text = "He was carefully disguised but captured quickly by police." 1251 >>> for m in re.finditer(r"\w+ly", text): 1252 ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0)) 1253 07-16: carefully 1254 40-47: quickly 1255 1256 1257 Raw String Notation 1258 ^^^^^^^^^^^^^^^^^^^ 1259 1260 Raw string notation (``r"text"``) keeps regular expressions sane. Without it, 1261 every backslash (``'\'``) in a regular expression would have to be prefixed with 1262 another one to escape it. For example, the two following lines of code are 1263 functionally identical: 1264 1265 >>> re.match(r"\W(.)\1\W", " ff ") 1266 <_sre.SRE_Match object at ...> 1267 >>> re.match("\\W(.)\\1\\W", " ff ") 1268 <_sre.SRE_Match object at ...> 1269 1270 When one wants to match a literal backslash, it must be escaped in the regular 1271 expression. With raw string notation, this means ``r"\\"``. Without raw string 1272 notation, one must use ``"\\\\"``, making the following lines of code 1273 functionally identical: 1274 1275 >>> re.match(r"\\", r"\\") 1276 <_sre.SRE_Match object at ...> 1277 >>> re.match("\\\\", r"\\") 1278 <_sre.SRE_Match object at ...> 1279