Home | History | Annotate | Download | only in library
      1 
      2 :mod:`re` --- Regular expression operations
      3 ===========================================
      4 
      5 .. module:: re
      6    :synopsis: Regular expression operations.
      7 .. moduleauthor:: Fredrik Lundh <fredrik (a] pythonware.com>
      8 .. sectionauthor:: Andrew M. Kuchling <amk (a] amk.ca>
      9 
     10 
     11 This module provides regular expression matching operations similar to
     12 those found in Perl. Both patterns and strings to be searched can be
     13 Unicode strings as well as 8-bit strings.
     14 
     15 Regular expressions use the backslash character (``'\'``) to indicate
     16 special forms or to allow special characters to be used without invoking
     17 their special meaning.  This collides with Python's usage of the same
     18 character for the same purpose in string literals; for example, to match
     19 a literal backslash, one might have to write ``'\\\\'`` as the pattern
     20 string, because the regular expression must be ``\\``, and each
     21 backslash must be expressed as ``\\`` inside a regular Python string
     22 literal.
     23 
     24 The solution is to use Python's raw string notation for regular expression
     25 patterns; backslashes are not handled in any special way in a string literal
     26 prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
     27 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
     28 newline.  Usually patterns will be expressed in Python code using this raw
     29 string notation.
     30 
     31 It is important to note that most regular expression operations are available as
     32 module-level functions and :class:`RegexObject` methods.  The functions are
     33 shortcuts that don't require you to compile a regex object first, but miss some
     34 fine-tuning parameters.
     35 
     36 
     37 .. _re-syntax:
     38 
     39 Regular Expression Syntax
     40 -------------------------
     41 
     42 A regular expression (or RE) specifies a set of strings that matches it; the
     43 functions in this module let you check if a particular string matches a given
     44 regular expression (or if a given regular expression matches a particular
     45 string, which comes down to the same thing).
     46 
     47 Regular expressions can be concatenated to form new regular expressions; if *A*
     48 and *B* are both regular expressions, then *AB* is also a regular expression.
     49 In general, if a string *p* matches *A* and another string *q* matches *B*, the
     50 string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
     51 operations; boundary conditions between *A* and *B*; or have numbered group
     52 references.  Thus, complex expressions can easily be constructed from simpler
     53 primitive expressions like the ones described here.  For details of the theory
     54 and implementation of regular expressions, consult the Friedl book referenced
     55 above, or almost any textbook about compiler construction.
     56 
     57 A brief explanation of the format of regular expressions follows.  For further
     58 information and a gentler presentation, consult the :ref:`regex-howto`.
     59 
     60 Regular expressions can contain both special and ordinary characters. Most
     61 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
     62 expressions; they simply match themselves.  You can concatenate ordinary
     63 characters, so ``last`` matches the string ``'last'``.  (In the rest of this
     64 section, we'll write RE's in ``this special style``, usually without quotes, and
     65 strings to be matched ``'in single quotes'``.)
     66 
     67 Some characters, like ``'|'`` or ``'('``, are special. Special
     68 characters either stand for classes of ordinary characters, or affect
     69 how the regular expressions around them are interpreted. Regular
     70 expression pattern strings may not contain null bytes, but can specify
     71 the null byte using the ``\number`` notation, e.g., ``'\x00'``.
     72 
     73 Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
     74 directly nested. This avoids ambiguity with the non-greedy modifier suffix
     75 ``?``, and with other modifiers in other implementations. To apply a second
     76 repetition to an inner repetition, parentheses may be used. For example,
     77 the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
     78 
     79 
     80 The special characters are:
     81 
     82 ``'.'``
     83    (Dot.)  In the default mode, this matches any character except a newline.  If
     84    the :const:`DOTALL` flag has been specified, this matches any character
     85    including a newline.
     86 
     87 ``'^'``
     88    (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
     89    matches immediately after each newline.
     90 
     91 ``'$'``
     92    Matches the end of the string or just before the newline at the end of the
     93    string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
     94    matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
     95    only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
     96    matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
     97    a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
     98    the newline, and one at the end of the string.
     99 
    100 ``'*'``
    101    Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
    102    many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
    103    by any number of 'b's.
    104 
    105 ``'+'``
    106    Causes the resulting RE to match 1 or more repetitions of the preceding RE.
    107    ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
    108    match just 'a'.
    109 
    110 ``'?'``
    111    Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    112    ``ab?`` will match either 'a' or 'ab'.
    113 
    114 ``*?``, ``+?``, ``??``
    115    The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
    116    as much text as possible.  Sometimes this behaviour isn't desired; if the RE
    117    ``<.*>`` is matched against ``<a> b <c>``, it will match the entire
    118    string, and not just ``<a>``.  Adding ``?`` after the qualifier makes it
    119    perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
    120    characters as possible will be matched.  Using the RE ``<.*?>`` will match
    121    only ``<a>``.
    122 
    123 ``{m}``
    124    Specifies that exactly *m* copies of the previous RE should be matched; fewer
    125    matches cause the entire RE not to match.  For example, ``a{6}`` will match
    126    exactly six ``'a'`` characters, but not five.
    127 
    128 ``{m,n}``
    129    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
    130    RE, attempting to match as many repetitions as possible.  For example,
    131    ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
    132    lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
    133    example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
    134    followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
    135    modifier would be confused with the previously described form.
    136 
    137 ``{m,n}?``
    138    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
    139    RE, attempting to match as *few* repetitions as possible.  This is the
    140    non-greedy version of the previous qualifier.  For example, on the
    141    6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
    142    while ``a{3,5}?`` will only match 3 characters.
    143 
    144 ``'\'``
    145    Either escapes special characters (permitting you to match characters like
    146    ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
    147    sequences are discussed below.
    148 
    149    If you're not using a raw string to express the pattern, remember that Python
    150    also uses the backslash as an escape sequence in string literals; if the escape
    151    sequence isn't recognized by Python's parser, the backslash and subsequent
    152    character are included in the resulting string.  However, if Python would
    153    recognize the resulting sequence, the backslash should be repeated twice.  This
    154    is complicated and hard to understand, so it's highly recommended that you use
    155    raw strings for all but the simplest expressions.
    156 
    157 ``[]``
    158    Used to indicate a set of characters.  In a set:
    159 
    160    * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
    161      ``'m'``, or ``'k'``.
    162 
    163    * Ranges of characters can be indicated by giving two characters and separating
    164      them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
    165      ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
    166      ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
    167      ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
    168      it will match a literal ``'-'``.
    169 
    170    * Special characters lose their special meaning inside sets.  For example,
    171      ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
    172      ``'*'``, or ``')'``.
    173 
    174    * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
    175      inside a set, although the characters they match depends on whether
    176      :const:`LOCALE` or  :const:`UNICODE` mode is in force.
    177 
    178    * Characters that are not within a range can be matched by :dfn:`complementing`
    179      the set.  If the first character of the set is ``'^'``, all the characters
    180      that are *not* in the set will be matched.  For example, ``[^5]`` will match
    181      any character except ``'5'``, and ``[^^]`` will match any character except
    182      ``'^'``.  ``^`` has no special meaning if it's not the first character in
    183      the set.
    184 
    185    * To match a literal ``']'`` inside a set, precede it with a backslash, or
    186      place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
    187      ``[]()[{}]`` will both match a parenthesis.
    188 
    189 ``'|'``
    190    ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
    191    will match either A or B.  An arbitrary number of REs can be separated by the
    192    ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
    193    the target string is scanned, REs separated by ``'|'`` are tried from left to
    194    right. When one pattern completely matches, that branch is accepted. This means
    195    that once ``A`` matches, ``B`` will not be tested further, even if it would
    196    produce a longer overall match.  In other words, the ``'|'`` operator is never
    197    greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
    198    character class, as in ``[|]``.
    199 
    200 ``(...)``
    201    Matches whatever regular expression is inside the parentheses, and indicates the
    202    start and end of a group; the contents of a group can be retrieved after a match
    203    has been performed, and can be matched later in the string with the ``\number``
    204    special sequence, described below.  To match the literals ``'('`` or ``')'``,
    205    use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
    206 
    207 ``(?...)``
    208    This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
    209    otherwise).  The first character after the ``'?'`` determines what the meaning
    210    and further syntax of the construct is. Extensions usually do not create a new
    211    group; ``(?P<name>...)`` is the only exception to this rule. Following are the
    212    currently supported extensions.
    213 
    214 ``(?iLmsux)``
    215    (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
    216    ``'u'``, ``'x'``.)  The group matches the empty string; the letters
    217    set the corresponding flags: :const:`re.I` (ignore case),
    218    :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
    219    :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
    220    and :const:`re.X` (verbose), for the entire regular expression. (The
    221    flags are described in :ref:`contents-of-module-re`.) This
    222    is useful if you wish to include the flags as part of the regular
    223    expression, instead of passing a *flag* argument to the
    224    :func:`re.compile` function.
    225 
    226    Note that the ``(?x)`` flag changes how the expression is parsed. It should be
    227    used first in the expression string, or after one or more whitespace characters.
    228    If there are non-whitespace characters before the flag, the results are
    229    undefined.
    230 
    231 ``(?:...)``
    232    A non-capturing version of regular parentheses.  Matches whatever regular
    233    expression is inside the parentheses, but the substring matched by the group
    234    *cannot* be retrieved after performing a match or referenced later in the
    235    pattern.
    236 
    237 ``(?P<name>...)``
    238    Similar to regular parentheses, but the substring matched by the group is
    239    accessible via the symbolic group name *name*.  Group names must be valid
    240    Python identifiers, and each group name must be defined only once within a
    241    regular expression.  A symbolic group is also a numbered group, just as if
    242    the group were not named.
    243 
    244    Named groups can be referenced in three contexts.  If the pattern is
    245    ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
    246    single or double quotes):
    247 
    248    +---------------------------------------+----------------------------------+
    249    | Context of reference to group "quote" | Ways to reference it             |
    250    +=======================================+==================================+
    251    | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
    252    |                                       | * ``\1``                         |
    253    +---------------------------------------+----------------------------------+
    254    | when processing match object ``m``    | * ``m.group('quote')``           |
    255    |                                       | * ``m.end('quote')`` (etc.)      |
    256    +---------------------------------------+----------------------------------+
    257    | in a string passed to the ``repl``    | * ``\g<quote>``                  |
    258    | argument of ``re.sub()``              | * ``\g<1>``                      |
    259    |                                       | * ``\1``                         |
    260    +---------------------------------------+----------------------------------+
    261 
    262 ``(?P=name)``
    263    A backreference to a named group; it matches whatever text was matched by the
    264    earlier group named *name*.
    265 
    266 ``(?#...)``
    267    A comment; the contents of the parentheses are simply ignored.
    268 
    269 ``(?=...)``
    270    Matches if ``...`` matches next, but doesn't consume any of the string.  This is
    271    called a lookahead assertion.  For example, ``Isaac (?=Asimov)`` will match
    272    ``'Isaac '`` only if it's followed by ``'Asimov'``.
    273 
    274 ``(?!...)``
    275    Matches if ``...`` doesn't match next.  This is a negative lookahead assertion.
    276    For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
    277    followed by ``'Asimov'``.
    278 
    279 ``(?<=...)``
    280    Matches if the current position in the string is preceded by a match for ``...``
    281    that ends at the current position.  This is called a :dfn:`positive lookbehind
    282    assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
    283    lookbehind will back up 3 characters and check if the contained pattern matches.
    284    The contained pattern must only match strings of some fixed length, meaning that
    285    ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Group
    286    references are not supported even if they match strings of some fixed length.
    287    Note that
    288    patterns which start with positive lookbehind assertions will not match at the
    289    beginning of the string being searched; you will most likely want to use the
    290    :func:`search` function rather than the :func:`match` function:
    291 
    292       >>> import re
    293       >>> m = re.search('(?<=abc)def', 'abcdef')
    294       >>> m.group(0)
    295       'def'
    296 
    297    This example looks for a word following a hyphen:
    298 
    299       >>> m = re.search('(?<=-)\w+', 'spam-egg')
    300       >>> m.group(0)
    301       'egg'
    302 
    303 ``(?<!...)``
    304    Matches if the current position in the string is not preceded by a match for
    305    ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
    306    positive lookbehind assertions, the contained pattern must only match strings of
    307    some fixed length and shouldn't contain group references.
    308    Patterns which start with negative lookbehind assertions may
    309    match at the beginning of the string being searched.
    310 
    311 ``(?(id/name)yes-pattern|no-pattern)``
    312    Will try to match with ``yes-pattern`` if the group with given *id* or *name*
    313    exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
    314    can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
    315    matching pattern, which will match with ``'<user (a] host.com>'`` as well as
    316    ``'user (a] host.com'``, but not with ``'<user (a] host.com'``.
    317 
    318    .. versionadded:: 2.4
    319 
    320 The special sequences consist of ``'\'`` and a character from the list below.
    321 If the ordinary character is not on the list, then the resulting RE will match
    322 the second character.  For example, ``\$`` matches the character ``'$'``.
    323 
    324 ``\number``
    325    Matches the contents of the group of the same number.  Groups are numbered
    326    starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
    327    but not ``'thethe'`` (note the space after the group).  This special sequence
    328    can only be used to match one of the first 99 groups.  If the first digit of
    329    *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
    330    a group match, but as the character with octal value *number*. Inside the
    331    ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
    332    characters.
    333 
    334 ``\A``
    335    Matches only at the start of the string.
    336 
    337 ``\b``
    338    Matches the empty string, but only at the beginning or end of a word.  A word is
    339    defined as a sequence of alphanumeric or underscore characters, so the end of a
    340    word is indicated by whitespace or a non-alphanumeric, non-underscore character.
    341    Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
    342    a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
    343    of the string, so the precise set of characters deemed to be alphanumeric
    344    depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
    345    For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
    346    ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
    347    Inside a character range, ``\b`` represents the backspace character, for
    348    compatibility with Python's string literals.
    349 
    350 ``\B``
    351    Matches the empty string, but only when it is *not* at the beginning or end of a
    352    word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
    353    but not ``'py'``, ``'py.'``, or ``'py!'``.
    354    ``\B`` is just the opposite of ``\b``, so is also subject to the settings
    355    of ``LOCALE`` and ``UNICODE``.
    356 
    357 ``\d``
    358    When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
    359    is equivalent to the set ``[0-9]``.  With :const:`UNICODE`, it will match
    360    whatever is classified as a decimal digit in the Unicode character properties
    361    database.
    362 
    363 ``\D``
    364    When the :const:`UNICODE` flag is not specified, matches any non-digit
    365    character; this is equivalent to the set  ``[^0-9]``.  With :const:`UNICODE`, it
    366    will match  anything other than character marked as digits in the Unicode
    367    character  properties database.
    368 
    369 ``\s``
    370    When the :const:`UNICODE` flag is not specified, it matches any whitespace
    371    character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
    372    :const:`LOCALE` flag has no extra effect on matching of the space.
    373    If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
    374    plus whatever is classified as space in the Unicode character properties
    375    database.
    376 
    377 ``\S``
    378    When the :const:`UNICODE` flag is not specified, matches any non-whitespace
    379    character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
    380    :const:`LOCALE` flag has no extra effect on non-whitespace match.  If
    381    :const:`UNICODE` is set, then any character not marked as space in the
    382    Unicode character properties database is matched.
    383 
    384 
    385 ``\w``
    386    When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
    387    any alphanumeric character and the underscore; this is equivalent to the set
    388    ``[a-zA-Z0-9_]``.  With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
    389    whatever characters are defined as alphanumeric for the current locale.  If
    390    :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
    391    is classified as alphanumeric in the Unicode character properties database.
    392 
    393 ``\W``
    394    When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
    395    any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
    396    With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
    397    not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
    398    this will match anything other than ``[0-9_]`` plus characters classified as
    399    not alphanumeric in the Unicode character properties database.
    400 
    401 ``\Z``
    402    Matches only at the end of the string.
    403 
    404 If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
    405 particular sequence, then :const:`LOCALE` flag takes effect first followed by
    406 the :const:`UNICODE`.
    407 
    408 Most of the standard escapes supported by Python string literals are also
    409 accepted by the regular expression parser::
    410 
    411    \a      \b      \f      \n
    412    \r      \t      \v      \x
    413    \\
    414 
    415 (Note that ``\b`` is used to represent word boundaries, and means "backspace"
    416 only inside character classes.)
    417 
    418 Octal escapes are included in a limited form: If the first digit is a 0, or if
    419 there are three octal digits, it is considered an octal escape. Otherwise, it is
    420 a group reference.  As for string literals, octal escapes are always at most
    421 three digits in length.
    422 
    423 .. seealso::
    424 
    425    Mastering Regular Expressions
    426       Book on regular expressions by Jeffrey Friedl, published by O'Reilly.  The
    427       second edition of the book no longer covers Python at all, but the first
    428       edition covered writing good regular expression patterns in great detail.
    429 
    430 
    431 
    432 .. _contents-of-module-re:
    433 
    434 Module Contents
    435 ---------------
    436 
    437 The module defines several functions, constants, and an exception. Some of the
    438 functions are simplified versions of the full featured methods for compiled
    439 regular expressions.  Most non-trivial applications always use the compiled
    440 form.
    441 
    442 
    443 .. function:: compile(pattern, flags=0)
    444 
    445    Compile a regular expression pattern into a regular expression object, which
    446    can be used for matching using its :func:`~RegexObject.match` and
    447    :func:`~RegexObject.search` methods, described below.
    448 
    449    The expression's behaviour can be modified by specifying a *flags* value.
    450    Values can be any of the following variables, combined using bitwise OR (the
    451    ``|`` operator).
    452 
    453    The sequence ::
    454 
    455       prog = re.compile(pattern)
    456       result = prog.match(string)
    457 
    458    is equivalent to ::
    459 
    460       result = re.match(pattern, string)
    461 
    462    but using :func:`re.compile` and saving the resulting regular expression
    463    object for reuse is more efficient when the expression will be used several
    464    times in a single program.
    465 
    466    .. note::
    467 
    468       The compiled versions of the most recent patterns passed to
    469       :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
    470       programs that use only a few regular expressions at a time needn't worry
    471       about compiling regular expressions.
    472 
    473 
    474 .. data:: DEBUG
    475 
    476    Display debug information about compiled expression.
    477 
    478 
    479 .. data:: I
    480           IGNORECASE
    481 
    482    Perform case-insensitive matching; expressions like ``[A-Z]`` will match
    483    lowercase letters, too.  This is not affected by the current locale.
    484 
    485 
    486 .. data:: L
    487           LOCALE
    488 
    489    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
    490    current locale.
    491 
    492 
    493 .. data:: M
    494           MULTILINE
    495 
    496    When specified, the pattern character ``'^'`` matches at the beginning of the
    497    string and at the beginning of each line (immediately following each newline);
    498    and the pattern character ``'$'`` matches at the end of the string and at the
    499    end of each line (immediately preceding each newline).  By default, ``'^'``
    500    matches only at the beginning of the string, and ``'$'`` only at the end of the
    501    string and immediately before the newline (if any) at the end of the string.
    502 
    503 
    504 .. data:: S
    505           DOTALL
    506 
    507    Make the ``'.'`` special character match any character at all, including a
    508    newline; without this flag, ``'.'`` will match anything *except* a newline.
    509 
    510 
    511 .. data:: U
    512           UNICODE
    513 
    514    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
    515    on the Unicode character properties database.
    516 
    517    .. versionadded:: 2.0
    518 
    519 
    520 .. data:: X
    521           VERBOSE
    522 
    523    This flag allows you to write regular expressions that look nicer and are
    524    more readable by allowing you to visually separate logical sections of the
    525    pattern and add comments. Whitespace within the pattern is ignored, except
    526    when in a character class or when preceded by an unescaped backslash.
    527    When a line contains a ``#`` that is not in a character class and is not
    528    preceded by an unescaped backslash, all characters from the leftmost such
    529    ``#`` through the end of the line are ignored.
    530 
    531    This means that the two following regular expression objects that match a
    532    decimal number are functionally equal::
    533 
    534       a = re.compile(r"""\d +  # the integral part
    535                          \.    # the decimal point
    536                          \d *  # some fractional digits""", re.X)
    537       b = re.compile(r"\d+\.\d*")
    538 
    539 
    540 .. function:: search(pattern, string, flags=0)
    541 
    542    Scan through *string* looking for the first location where the regular expression
    543    *pattern* produces a match, and return a corresponding :class:`MatchObject`
    544    instance. Return ``None`` if no position in the string matches the pattern; note
    545    that this is different from finding a zero-length match at some point in the
    546    string.
    547 
    548 
    549 .. function:: match(pattern, string, flags=0)
    550 
    551    If zero or more characters at the beginning of *string* match the regular
    552    expression *pattern*, return a corresponding :class:`MatchObject` instance.
    553    Return ``None`` if the string does not match the pattern; note that this is
    554    different from a zero-length match.
    555 
    556    Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
    557    at the beginning of the string and not at the beginning of each line.
    558 
    559    If you want to locate a match anywhere in *string*, use :func:`search`
    560    instead (see also :ref:`search-vs-match`).
    561 
    562 
    563 .. function:: split(pattern, string, maxsplit=0, flags=0)
    564 
    565    Split *string* by the occurrences of *pattern*.  If capturing parentheses are
    566    used in *pattern*, then the text of all groups in the pattern are also returned
    567    as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
    568    splits occur, and the remainder of the string is returned as the final element
    569    of the list.  (Incompatibility note: in the original Python 1.5 release,
    570    *maxsplit* was ignored.  This has been fixed in later releases.)
    571 
    572       >>> re.split('\W+', 'Words, words, words.')
    573       ['Words', 'words', 'words', '']
    574       >>> re.split('(\W+)', 'Words, words, words.')
    575       ['Words', ', ', 'words', ', ', 'words', '.', '']
    576       >>> re.split('\W+', 'Words, words, words.', 1)
    577       ['Words', 'words, words.']
    578       >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
    579       ['0', '3', '9']
    580 
    581    If there are capturing groups in the separator and it matches at the start of
    582    the string, the result will start with an empty string.  The same holds for
    583    the end of the string:
    584 
    585       >>> re.split('(\W+)', '...words, words...')
    586       ['', '...', 'words', ', ', 'words', '...', '']
    587 
    588    That way, separator components are always found at the same relative
    589    indices within the result list (e.g., if there's one capturing group
    590    in the separator, the 0th, the 2nd and so forth).
    591 
    592    Note that *split* will never split a string on an empty pattern match.
    593    For example:
    594 
    595       >>> re.split('x*', 'foo')
    596       ['foo']
    597       >>> re.split("(?m)^$", "foo\n\nbar\n")
    598       ['foo\n\nbar\n']
    599 
    600    .. versionchanged:: 2.7
    601       Added the optional flags argument.
    602 
    603 
    604 .. function:: findall(pattern, string, flags=0)
    605 
    606    Return all non-overlapping matches of *pattern* in *string*, as a list of
    607    strings.  The *string* is scanned left-to-right, and matches are returned in
    608    the order found.  If one or more groups are present in the pattern, return a
    609    list of groups; this will be a list of tuples if the pattern has more than
    610    one group.  Empty matches are included in the result unless they touch the
    611    beginning of another match.
    612 
    613    .. versionadded:: 1.5.2
    614 
    615    .. versionchanged:: 2.4
    616       Added the optional flags argument.
    617 
    618 
    619 .. function:: finditer(pattern, string, flags=0)
    620 
    621    Return an :term:`iterator` yielding :class:`MatchObject` instances over all
    622    non-overlapping matches for the RE *pattern* in *string*.  The *string* is
    623    scanned left-to-right, and matches are returned in the order found.  Empty
    624    matches are included in the result unless they touch the beginning of another
    625    match.
    626 
    627    .. versionadded:: 2.2
    628 
    629    .. versionchanged:: 2.4
    630       Added the optional flags argument.
    631 
    632 
    633 .. function:: sub(pattern, repl, string, count=0, flags=0)
    634 
    635    Return the string obtained by replacing the leftmost non-overlapping occurrences
    636    of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
    637    *string* is returned unchanged.  *repl* can be a string or a function; if it is
    638    a string, any backslash escapes in it are processed.  That is, ``\n`` is
    639    converted to a single newline character, ``\r`` is converted to a carriage return, and
    640    so forth.  Unknown escapes such as ``\j`` are left alone.  Backreferences, such
    641    as ``\6``, are replaced with the substring matched by group 6 in the pattern.
    642    For example:
    643 
    644       >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
    645       ...        r'static PyObject*\npy_\1(void)\n{',
    646       ...        'def myfunc():')
    647       'static PyObject*\npy_myfunc(void)\n{'
    648 
    649    If *repl* is a function, it is called for every non-overlapping occurrence of
    650    *pattern*.  The function takes a single match object argument, and returns the
    651    replacement string.  For example:
    652 
    653       >>> def dashrepl(matchobj):
    654       ...     if matchobj.group(0) == '-': return ' '
    655       ...     else: return '-'
    656       >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
    657       'pro--gram files'
    658       >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
    659       'Baked Beans & Spam'
    660 
    661    The pattern may be a string or an RE object.
    662 
    663    The optional argument *count* is the maximum number of pattern occurrences to be
    664    replaced; *count* must be a non-negative integer.  If omitted or zero, all
    665    occurrences will be replaced. Empty matches for the pattern are replaced only
    666    when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
    667    ``'-a-b-c-'``.
    668 
    669    In string-type *repl* arguments, in addition to the character escapes and
    670    backreferences described above,
    671    ``\g<name>`` will use the substring matched by the group named ``name``, as
    672    defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
    673    group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
    674    in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
    675    reference to group 20, not a reference to group 2 followed by the literal
    676    character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
    677    substring matched by the RE.
    678 
    679    .. versionchanged:: 2.7
    680       Added the optional flags argument.
    681 
    682 
    683 .. function:: subn(pattern, repl, string, count=0, flags=0)
    684 
    685    Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
    686    number_of_subs_made)``.
    687 
    688    .. versionchanged:: 2.7
    689       Added the optional flags argument.
    690 
    691 
    692 .. function:: escape(string)
    693 
    694    Return *string* with all non-alphanumerics backslashed; this is useful if you
    695    want to match an arbitrary literal string that may have regular expression
    696    metacharacters in it.
    697 
    698 
    699 .. function:: purge()
    700 
    701    Clear the regular expression cache.
    702 
    703 
    704 .. exception:: error
    705 
    706    Exception raised when a string passed to one of the functions here is not a
    707    valid regular expression (for example, it might contain unmatched parentheses)
    708    or when some other error occurs during compilation or matching.  It is never an
    709    error if a string contains no match for a pattern.
    710 
    711 
    712 .. _re-objects:
    713 
    714 Regular Expression Objects
    715 --------------------------
    716 
    717 .. class:: RegexObject
    718 
    719    The :class:`RegexObject` class supports the following methods and attributes:
    720 
    721    .. method:: RegexObject.search(string[, pos[, endpos]])
    722 
    723       Scan through *string* looking for a location where this regular expression
    724       produces a match, and return a corresponding :class:`MatchObject` instance.
    725       Return ``None`` if no position in the string matches the pattern; note that this
    726       is different from finding a zero-length match at some point in the string.
    727 
    728       The optional second parameter *pos* gives an index in the string where the
    729       search is to start; it defaults to ``0``.  This is not completely equivalent to
    730       slicing the string; the ``'^'`` pattern character matches at the real beginning
    731       of the string and at positions just after a newline, but not necessarily at the
    732       index where the search is to start.
    733 
    734       The optional parameter *endpos* limits how far the string will be searched; it
    735       will be as if the string is *endpos* characters long, so only the characters
    736       from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
    737       than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
    738       expression object, ``rx.search(string, 0, 50)`` is equivalent to
    739       ``rx.search(string[:50], 0)``.
    740 
    741       >>> pattern = re.compile("d")
    742       >>> pattern.search("dog")     # Match at index 0
    743       <_sre.SRE_Match object at ...>
    744       >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
    745 
    746 
    747    .. method:: RegexObject.match(string[, pos[, endpos]])
    748 
    749       If zero or more characters at the *beginning* of *string* match this regular
    750       expression, return a corresponding :class:`MatchObject` instance.  Return
    751       ``None`` if the string does not match the pattern; note that this is different
    752       from a zero-length match.
    753 
    754       The optional *pos* and *endpos* parameters have the same meaning as for the
    755       :meth:`~RegexObject.search` method.
    756 
    757       >>> pattern = re.compile("o")
    758       >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
    759       >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
    760       <_sre.SRE_Match object at ...>
    761 
    762       If you want to locate a match anywhere in *string*, use
    763       :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
    764 
    765 
    766    .. method:: RegexObject.split(string, maxsplit=0)
    767 
    768       Identical to the :func:`split` function, using the compiled pattern.
    769 
    770 
    771    .. method:: RegexObject.findall(string[, pos[, endpos]])
    772 
    773       Similar to the :func:`findall` function, using the compiled pattern, but
    774       also accepts optional *pos* and *endpos* parameters that limit the search
    775       region like for :meth:`match`.
    776 
    777 
    778    .. method:: RegexObject.finditer(string[, pos[, endpos]])
    779 
    780       Similar to the :func:`finditer` function, using the compiled pattern, but
    781       also accepts optional *pos* and *endpos* parameters that limit the search
    782       region like for :meth:`match`.
    783 
    784 
    785    .. method:: RegexObject.sub(repl, string, count=0)
    786 
    787       Identical to the :func:`sub` function, using the compiled pattern.
    788 
    789 
    790    .. method:: RegexObject.subn(repl, string, count=0)
    791 
    792       Identical to the :func:`subn` function, using the compiled pattern.
    793 
    794 
    795    .. attribute:: RegexObject.flags
    796 
    797       The regex matching flags.  This is a combination of the flags given to
    798       :func:`.compile` and any ``(?...)`` inline flags in the pattern.
    799 
    800 
    801    .. attribute:: RegexObject.groups
    802 
    803       The number of capturing groups in the pattern.
    804 
    805 
    806    .. attribute:: RegexObject.groupindex
    807 
    808       A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
    809       numbers.  The dictionary is empty if no symbolic groups were used in the
    810       pattern.
    811 
    812 
    813    .. attribute:: RegexObject.pattern
    814 
    815       The pattern string from which the RE object was compiled.
    816 
    817 
    818 .. _match-objects:
    819 
    820 Match Objects
    821 -------------
    822 
    823 .. class:: MatchObject
    824 
    825    Match objects always have a boolean value of ``True``.
    826    Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
    827    when there is no match, you can test whether there was a match with a simple
    828    ``if`` statement::
    829 
    830       match = re.search(pattern, string)
    831       if match:
    832           process(match)
    833 
    834    Match objects support the following methods and attributes:
    835 
    836 
    837    .. method:: MatchObject.expand(template)
    838 
    839       Return the string obtained by doing backslash substitution on the template
    840       string *template*, as done by the :meth:`~RegexObject.sub` method.  Escapes
    841       such as ``\n`` are converted to the appropriate characters, and numeric
    842       backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
    843       ``\g<name>``) are replaced by the contents of the corresponding group.
    844 
    845 
    846    .. method:: MatchObject.group([group1, ...])
    847 
    848       Returns one or more subgroups of the match.  If there is a single argument, the
    849       result is a single string; if there are multiple arguments, the result is a
    850       tuple with one item per argument. Without arguments, *group1* defaults to zero
    851       (the whole match is returned). If a *groupN* argument is zero, the corresponding
    852       return value is the entire matching string; if it is in the inclusive range
    853       [1..99], it is the string matching the corresponding parenthesized group.  If a
    854       group number is negative or larger than the number of groups defined in the
    855       pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
    856       part of the pattern that did not match, the corresponding result is ``None``.
    857       If a group is contained in a part of the pattern that matched multiple times,
    858       the last match is returned.
    859 
    860          >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
    861          >>> m.group(0)       # The entire match
    862          'Isaac Newton'
    863          >>> m.group(1)       # The first parenthesized subgroup.
    864          'Isaac'
    865          >>> m.group(2)       # The second parenthesized subgroup.
    866          'Newton'
    867          >>> m.group(1, 2)    # Multiple arguments give us a tuple.
    868          ('Isaac', 'Newton')
    869 
    870       If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
    871       arguments may also be strings identifying groups by their group name.  If a
    872       string argument is not used as a group name in the pattern, an :exc:`IndexError`
    873       exception is raised.
    874 
    875       A moderately complicated example:
    876 
    877          >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
    878          >>> m.group('first_name')
    879          'Malcolm'
    880          >>> m.group('last_name')
    881          'Reynolds'
    882 
    883       Named groups can also be referred to by their index:
    884 
    885          >>> m.group(1)
    886          'Malcolm'
    887          >>> m.group(2)
    888          'Reynolds'
    889 
    890       If a group matches multiple times, only the last match is accessible:
    891 
    892          >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
    893          >>> m.group(1)                        # Returns only the last match.
    894          'c3'
    895 
    896 
    897    .. method:: MatchObject.groups([default])
    898 
    899       Return a tuple containing all the subgroups of the match, from 1 up to however
    900       many groups are in the pattern.  The *default* argument is used for groups that
    901       did not participate in the match; it defaults to ``None``.  (Incompatibility
    902       note: in the original Python 1.5 release, if the tuple was one element long, a
    903       string would be returned instead.  In later versions (from 1.5.1 on), a
    904       singleton tuple is returned in such cases.)
    905 
    906       For example:
    907 
    908          >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
    909          >>> m.groups()
    910          ('24', '1632')
    911 
    912       If we make the decimal place and everything after it optional, not all groups
    913       might participate in the match.  These groups will default to ``None`` unless
    914       the *default* argument is given:
    915 
    916          >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
    917          >>> m.groups()      # Second group defaults to None.
    918          ('24', None)
    919          >>> m.groups('0')   # Now, the second group defaults to '0'.
    920          ('24', '0')
    921 
    922 
    923    .. method:: MatchObject.groupdict([default])
    924 
    925       Return a dictionary containing all the *named* subgroups of the match, keyed by
    926       the subgroup name.  The *default* argument is used for groups that did not
    927       participate in the match; it defaults to ``None``.  For example:
    928 
    929          >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
    930          >>> m.groupdict()
    931          {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
    932 
    933 
    934    .. method:: MatchObject.start([group])
    935                MatchObject.end([group])
    936 
    937       Return the indices of the start and end of the substring matched by *group*;
    938       *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
    939       *group* exists but did not contribute to the match.  For a match object *m*, and
    940       a group *g* that did contribute to the match, the substring matched by group *g*
    941       (equivalent to ``m.group(g)``) is ::
    942 
    943          m.string[m.start(g):m.end(g)]
    944 
    945       Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
    946       null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
    947       ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
    948       2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
    949 
    950       An example that will remove *remove_this* from email addresses:
    951 
    952          >>> email = "tony (a] tiremove_thisger.net"
    953          >>> m = re.search("remove_this", email)
    954          >>> email[:m.start()] + email[m.end():]
    955          'tony (a] tiger.net'
    956 
    957 
    958    .. method:: MatchObject.span([group])
    959 
    960       For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
    961       m.end(group))``. Note that if *group* did not contribute to the match, this is
    962       ``(-1, -1)``.  *group* defaults to zero, the entire match.
    963 
    964 
    965    .. attribute:: MatchObject.pos
    966 
    967       The value of *pos* which was passed to the :meth:`~RegexObject.search` or
    968       :meth:`~RegexObject.match` method of the :class:`RegexObject`.  This is the
    969       index into the string at which the RE engine started looking for a match.
    970 
    971 
    972    .. attribute:: MatchObject.endpos
    973 
    974       The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
    975       :meth:`~RegexObject.match` method of the :class:`RegexObject`.  This is the
    976       index into the string beyond which the RE engine will not go.
    977 
    978 
    979    .. attribute:: MatchObject.lastindex
    980 
    981       The integer index of the last matched capturing group, or ``None`` if no group
    982       was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
    983       ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
    984       the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
    985       string.
    986 
    987 
    988    .. attribute:: MatchObject.lastgroup
    989 
    990       The name of the last matched capturing group, or ``None`` if the group didn't
    991       have a name, or if no group was matched at all.
    992 
    993 
    994    .. attribute:: MatchObject.re
    995 
    996       The regular expression object whose :meth:`~RegexObject.match` or
    997       :meth:`~RegexObject.search` method produced this :class:`MatchObject`
    998       instance.
    999 
   1000 
   1001    .. attribute:: MatchObject.string
   1002 
   1003       The string passed to :meth:`~RegexObject.match` or
   1004       :meth:`~RegexObject.search`.
   1005 
   1006 
   1007 Examples
   1008 --------
   1009 
   1010 
   1011 Checking For a Pair
   1012 ^^^^^^^^^^^^^^^^^^^
   1013 
   1014 In this example, we'll use the following helper function to display match
   1015 objects a little more gracefully:
   1016 
   1017 .. testcode::
   1018 
   1019    def displaymatch(match):
   1020        if match is None:
   1021            return None
   1022        return '<Match: %r, groups=%r>' % (match.group(), match.groups())
   1023 
   1024 Suppose you are writing a poker program where a player's hand is represented as
   1025 a 5-character string with each character representing a card, "a" for ace, "k"
   1026 for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
   1027 representing the card with that value.
   1028 
   1029 To see if a given string is a valid hand, one could do the following:
   1030 
   1031    >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
   1032    >>> displaymatch(valid.match("akt5q"))  # Valid.
   1033    "<Match: 'akt5q', groups=()>"
   1034    >>> displaymatch(valid.match("akt5e"))  # Invalid.
   1035    >>> displaymatch(valid.match("akt"))    # Invalid.
   1036    >>> displaymatch(valid.match("727ak"))  # Valid.
   1037    "<Match: '727ak', groups=()>"
   1038 
   1039 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
   1040 To match this with a regular expression, one could use backreferences as such:
   1041 
   1042    >>> pair = re.compile(r".*(.).*\1")
   1043    >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
   1044    "<Match: '717', groups=('7',)>"
   1045    >>> displaymatch(pair.match("718ak"))     # No pairs.
   1046    >>> displaymatch(pair.match("354aa"))     # Pair of aces.
   1047    "<Match: '354aa', groups=('a',)>"
   1048 
   1049 To find out what card the pair consists of, one could use the
   1050 :meth:`~MatchObject.group` method of :class:`MatchObject` in the following
   1051 manner:
   1052 
   1053 .. doctest::
   1054 
   1055    >>> pair.match("717ak").group(1)
   1056    '7'
   1057 
   1058    # Error because re.match() returns None, which doesn't have a group() method:
   1059    >>> pair.match("718ak").group(1)
   1060    Traceback (most recent call last):
   1061      File "<pyshell#23>", line 1, in <module>
   1062        re.match(r".*(.).*\1", "718ak").group(1)
   1063    AttributeError: 'NoneType' object has no attribute 'group'
   1064 
   1065    >>> pair.match("354aa").group(1)
   1066    'a'
   1067 
   1068 
   1069 Simulating scanf()
   1070 ^^^^^^^^^^^^^^^^^^
   1071 
   1072 .. index:: single: scanf()
   1073 
   1074 Python does not currently have an equivalent to :c:func:`scanf`.  Regular
   1075 expressions are generally more powerful, though also more verbose, than
   1076 :c:func:`scanf` format strings.  The table below offers some more-or-less
   1077 equivalent mappings between :c:func:`scanf` format tokens and regular
   1078 expressions.
   1079 
   1080 +--------------------------------+---------------------------------------------+
   1081 | :c:func:`scanf` Token          | Regular Expression                          |
   1082 +================================+=============================================+
   1083 | ``%c``                         | ``.``                                       |
   1084 +--------------------------------+---------------------------------------------+
   1085 | ``%5c``                        | ``.{5}``                                    |
   1086 +--------------------------------+---------------------------------------------+
   1087 | ``%d``                         | ``[-+]?\d+``                                |
   1088 +--------------------------------+---------------------------------------------+
   1089 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
   1090 +--------------------------------+---------------------------------------------+
   1091 | ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
   1092 +--------------------------------+---------------------------------------------+
   1093 | ``%o``                         | ``[-+]?[0-7]+``                             |
   1094 +--------------------------------+---------------------------------------------+
   1095 | ``%s``                         | ``\S+``                                     |
   1096 +--------------------------------+---------------------------------------------+
   1097 | ``%u``                         | ``\d+``                                     |
   1098 +--------------------------------+---------------------------------------------+
   1099 | ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
   1100 +--------------------------------+---------------------------------------------+
   1101 
   1102 To extract the filename and numbers from a string like ::
   1103 
   1104    /usr/sbin/sendmail - 0 errors, 4 warnings
   1105 
   1106 you would use a :c:func:`scanf` format like ::
   1107 
   1108    %s - %d errors, %d warnings
   1109 
   1110 The equivalent regular expression would be ::
   1111 
   1112    (\S+) - (\d+) errors, (\d+) warnings
   1113 
   1114 
   1115 .. _search-vs-match:
   1116 
   1117 search() vs. match()
   1118 ^^^^^^^^^^^^^^^^^^^^
   1119 
   1120 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org>
   1121 
   1122 Python offers two different primitive operations based on regular expressions:
   1123 :func:`re.match` checks for a match only at the beginning of the string, while
   1124 :func:`re.search` checks for a match anywhere in the string (this is what Perl
   1125 does by default).
   1126 
   1127 For example::
   1128 
   1129    >>> re.match("c", "abcdef")    # No match
   1130    >>> re.search("c", "abcdef")   # Match
   1131    <_sre.SRE_Match object at ...>
   1132 
   1133 Regular expressions beginning with ``'^'`` can be used with :func:`search` to
   1134 restrict the match at the beginning of the string::
   1135 
   1136    >>> re.match("c", "abcdef")    # No match
   1137    >>> re.search("^c", "abcdef")  # No match
   1138    >>> re.search("^a", "abcdef")  # Match
   1139    <_sre.SRE_Match object at ...>
   1140 
   1141 Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
   1142 beginning of the string, whereas using :func:`search` with a regular expression
   1143 beginning with ``'^'`` will match at the beginning of each line.
   1144 
   1145    >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
   1146    >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
   1147    <_sre.SRE_Match object at ...>
   1148 
   1149 
   1150 Making a Phonebook
   1151 ^^^^^^^^^^^^^^^^^^
   1152 
   1153 :func:`split` splits a string into a list delimited by the passed pattern.  The
   1154 method is invaluable for converting textual data into data structures that can be
   1155 easily read and modified by Python as demonstrated in the following example that
   1156 creates a phonebook.
   1157 
   1158 First, here is the input.  Normally it may come from a file, here we are using
   1159 triple-quoted string syntax:
   1160 
   1161    >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
   1162    ...
   1163    ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
   1164    ... Frank Burger: 925.541.7625 662 South Dogwood Way
   1165    ...
   1166    ...
   1167    ... Heather Albrecht: 548.326.4584 919 Park Place"""
   1168 
   1169 The entries are separated by one or more newlines. Now we convert the string
   1170 into a list with each nonempty line having its own entry:
   1171 
   1172 .. doctest::
   1173    :options: +NORMALIZE_WHITESPACE
   1174 
   1175    >>> entries = re.split("\n+", text)
   1176    >>> entries
   1177    ['Ross McFluff: 834.345.1254 155 Elm Street',
   1178    'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
   1179    'Frank Burger: 925.541.7625 662 South Dogwood Way',
   1180    'Heather Albrecht: 548.326.4584 919 Park Place']
   1181 
   1182 Finally, split each entry into a list with first name, last name, telephone
   1183 number, and address.  We use the ``maxsplit`` parameter of :func:`split`
   1184 because the address has spaces, our splitting pattern, in it:
   1185 
   1186 .. doctest::
   1187    :options: +NORMALIZE_WHITESPACE
   1188 
   1189    >>> [re.split(":? ", entry, 3) for entry in entries]
   1190    [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
   1191    ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
   1192    ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
   1193    ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
   1194 
   1195 The ``:?`` pattern matches the colon after the last name, so that it does not
   1196 occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
   1197 house number from the street name:
   1198 
   1199 .. doctest::
   1200    :options: +NORMALIZE_WHITESPACE
   1201 
   1202    >>> [re.split(":? ", entry, 4) for entry in entries]
   1203    [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
   1204    ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
   1205    ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
   1206    ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
   1207 
   1208 
   1209 Text Munging
   1210 ^^^^^^^^^^^^
   1211 
   1212 :func:`sub` replaces every occurrence of a pattern with a string or the
   1213 result of a function.  This example demonstrates using :func:`sub` with
   1214 a function to "munge" text, or randomize the order of all the characters
   1215 in each word of a sentence except for the first and last characters::
   1216 
   1217    >>> def repl(m):
   1218    ...     inner_word = list(m.group(2))
   1219    ...     random.shuffle(inner_word)
   1220    ...     return m.group(1) + "".join(inner_word) + m.group(3)
   1221    >>> text = "Professor Abdolmalek, please report your absences promptly."
   1222    >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
   1223    'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
   1224    >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
   1225    'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
   1226 
   1227 
   1228 Finding all Adverbs
   1229 ^^^^^^^^^^^^^^^^^^^
   1230 
   1231 :func:`findall` matches *all* occurrences of a pattern, not just the first
   1232 one as :func:`search` does.  For example, if one was a writer and wanted to
   1233 find all of the adverbs in some text, he or she might use :func:`findall` in
   1234 the following manner:
   1235 
   1236    >>> text = "He was carefully disguised but captured quickly by police."
   1237    >>> re.findall(r"\w+ly", text)
   1238    ['carefully', 'quickly']
   1239 
   1240 
   1241 Finding all Adverbs and their Positions
   1242 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   1243 
   1244 If one wants more information about all matches of a pattern than the matched
   1245 text, :func:`finditer` is useful as it provides instances of
   1246 :class:`MatchObject` instead of strings.  Continuing with the previous example,
   1247 if one was a writer who wanted to find all of the adverbs *and their positions*
   1248 in some text, he or she would use :func:`finditer` in the following manner:
   1249 
   1250    >>> text = "He was carefully disguised but captured quickly by police."
   1251    >>> for m in re.finditer(r"\w+ly", text):
   1252    ...     print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
   1253    07-16: carefully
   1254    40-47: quickly
   1255 
   1256 
   1257 Raw String Notation
   1258 ^^^^^^^^^^^^^^^^^^^
   1259 
   1260 Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
   1261 every backslash (``'\'``) in a regular expression would have to be prefixed with
   1262 another one to escape it.  For example, the two following lines of code are
   1263 functionally identical:
   1264 
   1265    >>> re.match(r"\W(.)\1\W", " ff ")
   1266    <_sre.SRE_Match object at ...>
   1267    >>> re.match("\\W(.)\\1\\W", " ff ")
   1268    <_sre.SRE_Match object at ...>
   1269 
   1270 When one wants to match a literal backslash, it must be escaped in the regular
   1271 expression.  With raw string notation, this means ``r"\\"``.  Without raw string
   1272 notation, one must use ``"\\\\"``, making the following lines of code
   1273 functionally identical:
   1274 
   1275    >>> re.match(r"\\", r"\\")
   1276    <_sre.SRE_Match object at ...>
   1277    >>> re.match("\\\\", r"\\")
   1278    <_sre.SRE_Match object at ...>
   1279