Home | History | Annotate | Download | only in library
      1 :mod:`re` --- Regular expression operations
      2 ===========================================
      3 
      4 .. module:: re
      5    :synopsis: Regular expression operations.
      6 
      7 .. moduleauthor:: Fredrik Lundh <fredrik (a] pythonware.com>
      8 .. sectionauthor:: Andrew M. Kuchling <amk (a] amk.ca>
      9 
     10 **Source code:** :source:`Lib/re.py`
     11 
     12 --------------
     13 
     14 This module provides regular expression matching operations similar to
     15 those found in Perl.
     16 
     17 Both patterns and strings to be searched can be Unicode strings as well as
     18 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
     19 that is, you cannot match a Unicode string with a byte pattern or
     20 vice-versa; similarly, when asking for a substitution, the replacement
     21 string must be of the same type as both the pattern and the search string.
     22 
     23 Regular expressions use the backslash character (``'\'``) to indicate
     24 special forms or to allow special characters to be used without invoking
     25 their special meaning.  This collides with Python's usage of the same
     26 character for the same purpose in string literals; for example, to match
     27 a literal backslash, one might have to write ``'\\\\'`` as the pattern
     28 string, because the regular expression must be ``\\``, and each
     29 backslash must be expressed as ``\\`` inside a regular Python string
     30 literal.
     31 
     32 The solution is to use Python's raw string notation for regular expression
     33 patterns; backslashes are not handled in any special way in a string literal
     34 prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
     35 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
     36 newline.  Usually patterns will be expressed in Python code using this raw
     37 string notation.
     38 
     39 It is important to note that most regular expression operations are available as
     40 module-level functions and methods on
     41 :ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
     42 that don't require you to compile a regex object first, but miss some
     43 fine-tuning parameters.
     44 
     45 .. seealso::
     46 
     47    The third-party `regex <https://pypi.python.org/pypi/regex/>`_ module,
     48    which has an API compatible with the standard library :mod:`re` module,
     49    but offers additional functionality and a more thorough Unicode support.
     50 
     51 
     52 .. _re-syntax:
     53 
     54 Regular Expression Syntax
     55 -------------------------
     56 
     57 A regular expression (or RE) specifies a set of strings that matches it; the
     58 functions in this module let you check if a particular string matches a given
     59 regular expression (or if a given regular expression matches a particular
     60 string, which comes down to the same thing).
     61 
     62 Regular expressions can be concatenated to form new regular expressions; if *A*
     63 and *B* are both regular expressions, then *AB* is also a regular expression.
     64 In general, if a string *p* matches *A* and another string *q* matches *B*, the
     65 string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
     66 operations; boundary conditions between *A* and *B*; or have numbered group
     67 references.  Thus, complex expressions can easily be constructed from simpler
     68 primitive expressions like the ones described here.  For details of the theory
     69 and implementation of regular expressions, consult the Friedl book referenced
     70 above, or almost any textbook about compiler construction.
     71 
     72 A brief explanation of the format of regular expressions follows.  For further
     73 information and a gentler presentation, consult the :ref:`regex-howto`.
     74 
     75 Regular expressions can contain both special and ordinary characters. Most
     76 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
     77 expressions; they simply match themselves.  You can concatenate ordinary
     78 characters, so ``last`` matches the string ``'last'``.  (In the rest of this
     79 section, we'll write RE's in ``this special style``, usually without quotes, and
     80 strings to be matched ``'in single quotes'``.)
     81 
     82 Some characters, like ``'|'`` or ``'('``, are special. Special
     83 characters either stand for classes of ordinary characters, or affect
     84 how the regular expressions around them are interpreted. Regular
     85 expression pattern strings may not contain null bytes, but can specify
     86 the null byte using a ``\number`` notation such as ``'\x00'``.
     87 
     88 Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
     89 directly nested. This avoids ambiguity with the non-greedy modifier suffix
     90 ``?``, and with other modifiers in other implementations. To apply a second
     91 repetition to an inner repetition, parentheses may be used. For example,
     92 the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
     93 
     94 
     95 The special characters are:
     96 
     97 ``'.'``
     98    (Dot.)  In the default mode, this matches any character except a newline.  If
     99    the :const:`DOTALL` flag has been specified, this matches any character
    100    including a newline.
    101 
    102 ``'^'``
    103    (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
    104    matches immediately after each newline.
    105 
    106 ``'$'``
    107    Matches the end of the string or just before the newline at the end of the
    108    string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
    109    matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
    110    only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
    111    matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
    112    a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
    113    the newline, and one at the end of the string.
    114 
    115 ``'*'``
    116    Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
    117    many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
    118    by any number of 'b's.
    119 
    120 ``'+'``
    121    Causes the resulting RE to match 1 or more repetitions of the preceding RE.
    122    ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
    123    match just 'a'.
    124 
    125 ``'?'``
    126    Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    127    ``ab?`` will match either 'a' or 'ab'.
    128 
    129 ``*?``, ``+?``, ``??``
    130    The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
    131    as much text as possible.  Sometimes this behaviour isn't desired; if the RE
    132    ``<.*>`` is matched against ``<a> b <c>``, it will match the entire
    133    string, and not just ``<a>``.  Adding ``?`` after the qualifier makes it
    134    perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
    135    characters as possible will be matched.  Using the RE ``<.*?>`` will match
    136    only ``<a>``.
    137 
    138 ``{m}``
    139    Specifies that exactly *m* copies of the previous RE should be matched; fewer
    140    matches cause the entire RE not to match.  For example, ``a{6}`` will match
    141    exactly six ``'a'`` characters, but not five.
    142 
    143 ``{m,n}``
    144    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
    145    RE, attempting to match as many repetitions as possible.  For example,
    146    ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
    147    lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
    148    example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
    149    followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
    150    modifier would be confused with the previously described form.
    151 
    152 ``{m,n}?``
    153    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
    154    RE, attempting to match as *few* repetitions as possible.  This is the
    155    non-greedy version of the previous qualifier.  For example, on the
    156    6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
    157    while ``a{3,5}?`` will only match 3 characters.
    158 
    159 ``'\'``
    160    Either escapes special characters (permitting you to match characters like
    161    ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
    162    sequences are discussed below.
    163 
    164    If you're not using a raw string to express the pattern, remember that Python
    165    also uses the backslash as an escape sequence in string literals; if the escape
    166    sequence isn't recognized by Python's parser, the backslash and subsequent
    167    character are included in the resulting string.  However, if Python would
    168    recognize the resulting sequence, the backslash should be repeated twice.  This
    169    is complicated and hard to understand, so it's highly recommended that you use
    170    raw strings for all but the simplest expressions.
    171 
    172 ``[]``
    173    Used to indicate a set of characters.  In a set:
    174 
    175    * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
    176      ``'m'``, or ``'k'``.
    177 
    178    * Ranges of characters can be indicated by giving two characters and separating
    179      them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
    180      ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
    181      ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
    182      ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
    183      it will match a literal ``'-'``.
    184 
    185    * Special characters lose their special meaning inside sets.  For example,
    186      ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
    187      ``'*'``, or ``')'``.
    188 
    189    * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
    190      inside a set, although the characters they match depends on whether
    191      :const:`ASCII` or :const:`LOCALE` mode is in force.
    192 
    193    * Characters that are not within a range can be matched by :dfn:`complementing`
    194      the set.  If the first character of the set is ``'^'``, all the characters
    195      that are *not* in the set will be matched.  For example, ``[^5]`` will match
    196      any character except ``'5'``, and ``[^^]`` will match any character except
    197      ``'^'``.  ``^`` has no special meaning if it's not the first character in
    198      the set.
    199 
    200    * To match a literal ``']'`` inside a set, precede it with a backslash, or
    201      place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
    202      ``[]()[{}]`` will both match a parenthesis.
    203 
    204 ``'|'``
    205    ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
    206    will match either A or B.  An arbitrary number of REs can be separated by the
    207    ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
    208    the target string is scanned, REs separated by ``'|'`` are tried from left to
    209    right. When one pattern completely matches, that branch is accepted. This means
    210    that once ``A`` matches, ``B`` will not be tested further, even if it would
    211    produce a longer overall match.  In other words, the ``'|'`` operator is never
    212    greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
    213    character class, as in ``[|]``.
    214 
    215 ``(...)``
    216    Matches whatever regular expression is inside the parentheses, and indicates the
    217    start and end of a group; the contents of a group can be retrieved after a match
    218    has been performed, and can be matched later in the string with the ``\number``
    219    special sequence, described below.  To match the literals ``'('`` or ``')'``,
    220    use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
    221 
    222 ``(?...)``
    223    This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
    224    otherwise).  The first character after the ``'?'`` determines what the meaning
    225    and further syntax of the construct is. Extensions usually do not create a new
    226    group; ``(?P<name>...)`` is the only exception to this rule. Following are the
    227    currently supported extensions.
    228 
    229 ``(?aiLmsux)``
    230    (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
    231    ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
    232    letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
    233    :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
    234    :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
    235    and :const:`re.X` (verbose), for the entire regular expression. (The
    236    flags are described in :ref:`contents-of-module-re`.) This
    237    is useful if you wish to include the flags as part of the regular
    238    expression, instead of passing a *flag* argument to the
    239    :func:`re.compile` function.  Flags should be used first in the
    240    expression string.
    241 
    242 ``(?:...)``
    243    A non-capturing version of regular parentheses.  Matches whatever regular
    244    expression is inside the parentheses, but the substring matched by the group
    245    *cannot* be retrieved after performing a match or referenced later in the
    246    pattern.
    247 
    248 ``(?imsx-imsx:...)``
    249    (Zero or more letters from the set ``'i'``, ``'m'``, ``'s'``, ``'x'``,
    250    optionally followed by ``'-'`` followed by one or more letters from the
    251    same set.)  The letters set or removes the corresponding flags:
    252    :const:`re.I` (ignore case), :const:`re.M` (multi-line), :const:`re.S`
    253    (dot matches all), and :const:`re.X` (verbose), for the part of the
    254    expression.  (The flags are described in :ref:`contents-of-module-re`.)
    255 
    256    .. versionadded:: 3.6
    257 
    258 ``(?P<name>...)``
    259    Similar to regular parentheses, but the substring matched by the group is
    260    accessible via the symbolic group name *name*.  Group names must be valid
    261    Python identifiers, and each group name must be defined only once within a
    262    regular expression.  A symbolic group is also a numbered group, just as if
    263    the group were not named.
    264 
    265    Named groups can be referenced in three contexts.  If the pattern is
    266    ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
    267    single or double quotes):
    268 
    269    +---------------------------------------+----------------------------------+
    270    | Context of reference to group "quote" | Ways to reference it             |
    271    +=======================================+==================================+
    272    | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
    273    |                                       | * ``\1``                         |
    274    +---------------------------------------+----------------------------------+
    275    | when processing match object ``m``    | * ``m.group('quote')``           |
    276    |                                       | * ``m.end('quote')`` (etc.)      |
    277    +---------------------------------------+----------------------------------+
    278    | in a string passed to the ``repl``    | * ``\g<quote>``                  |
    279    | argument of ``re.sub()``              | * ``\g<1>``                      |
    280    |                                       | * ``\1``                         |
    281    +---------------------------------------+----------------------------------+
    282 
    283 ``(?P=name)``
    284    A backreference to a named group; it matches whatever text was matched by the
    285    earlier group named *name*.
    286 
    287 ``(?#...)``
    288    A comment; the contents of the parentheses are simply ignored.
    289 
    290 ``(?=...)``
    291    Matches if ``...`` matches next, but doesn't consume any of the string.  This is
    292    called a lookahead assertion.  For example, ``Isaac (?=Asimov)`` will match
    293    ``'Isaac '`` only if it's followed by ``'Asimov'``.
    294 
    295 ``(?!...)``
    296    Matches if ``...`` doesn't match next.  This is a negative lookahead assertion.
    297    For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
    298    followed by ``'Asimov'``.
    299 
    300 ``(?<=...)``
    301    Matches if the current position in the string is preceded by a match for ``...``
    302    that ends at the current position.  This is called a :dfn:`positive lookbehind
    303    assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
    304    lookbehind will back up 3 characters and check if the contained pattern matches.
    305    The contained pattern must only match strings of some fixed length, meaning that
    306    ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
    307    patterns which start with positive lookbehind assertions will not match at the
    308    beginning of the string being searched; you will most likely want to use the
    309    :func:`search` function rather than the :func:`match` function:
    310 
    311       >>> import re
    312       >>> m = re.search('(?<=abc)def', 'abcdef')
    313       >>> m.group(0)
    314       'def'
    315 
    316    This example looks for a word following a hyphen:
    317 
    318       >>> m = re.search('(?<=-)\w+', 'spam-egg')
    319       >>> m.group(0)
    320       'egg'
    321 
    322    .. versionchanged:: 3.5
    323       Added support for group references of fixed length.
    324 
    325 ``(?<!...)``
    326    Matches if the current position in the string is not preceded by a match for
    327    ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
    328    positive lookbehind assertions, the contained pattern must only match strings of
    329    some fixed length.  Patterns which start with negative lookbehind assertions may
    330    match at the beginning of the string being searched.
    331 
    332 ``(?(id/name)yes-pattern|no-pattern)``
    333    Will try to match with ``yes-pattern`` if the group with given *id* or
    334    *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
    335    optional and can be omitted. For example,
    336    ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
    337    will match with ``'<user (a] host.com>'`` as well as ``'user (a] host.com'``, but
    338    not with ``'<user (a] host.com'`` nor ``'user (a] host.com>'``.
    339 
    340 
    341 The special sequences consist of ``'\'`` and a character from the list below.
    342 If the ordinary character is not an ASCII digit or an ASCII letter, then the
    343 resulting RE will match the second character.  For example, ``\$`` matches the
    344 character ``'$'``.
    345 
    346 ``\number``
    347    Matches the contents of the group of the same number.  Groups are numbered
    348    starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
    349    but not ``'thethe'`` (note the space after the group).  This special sequence
    350    can only be used to match one of the first 99 groups.  If the first digit of
    351    *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
    352    a group match, but as the character with octal value *number*. Inside the
    353    ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
    354    characters.
    355 
    356 ``\A``
    357    Matches only at the start of the string.
    358 
    359 ``\b``
    360    Matches the empty string, but only at the beginning or end of a word.
    361    A word is defined as a sequence of Unicode alphanumeric or underscore
    362    characters, so the end of a word is indicated by whitespace or a
    363    non-alphanumeric, non-underscore Unicode character.  Note that formally,
    364    ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
    365    (or vice versa), or between ``\w`` and the beginning/end of the string.
    366    This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
    367    ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
    368 
    369    By default Unicode alphanumerics are the ones used, but this can  be changed
    370    by using the :const:`ASCII` flag.  Inside a character range, ``\b``
    371    represents the backspace character, for compatibility with Python's string
    372    literals.
    373 
    374 ``\B``
    375    Matches the empty string, but only when it is *not* at the beginning or end
    376    of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
    377    ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
    378    ``\B`` is just the opposite of ``\b``, so word characters are
    379    Unicode alphanumerics or the underscore, although this can be changed
    380    by using the :const:`ASCII` flag.
    381 
    382 ``\d``
    383    For Unicode (str) patterns:
    384       Matches any Unicode decimal digit (that is, any character in
    385       Unicode character category [Nd]).  This includes ``[0-9]``, and
    386       also many other digit characters.  If the :const:`ASCII` flag is
    387       used only ``[0-9]`` is matched (but the flag affects the entire
    388       regular expression, so in such cases using an explicit ``[0-9]``
    389       may be a better choice).
    390    For 8-bit (bytes) patterns:
    391       Matches any decimal digit; this is equivalent to ``[0-9]``.
    392 
    393 ``\D``
    394    Matches any character which is not a Unicode decimal digit. This is
    395    the opposite of ``\d``. If the :const:`ASCII` flag is used this
    396    becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
    397    regular expression, so in such cases using an explicit ``[^0-9]`` may
    398    be a better choice).
    399 
    400 ``\s``
    401    For Unicode (str) patterns:
    402       Matches Unicode whitespace characters (which includes
    403       ``[ \t\n\r\f\v]``, and also many other characters, for example the
    404       non-breaking spaces mandated by typography rules in many
    405       languages). If the :const:`ASCII` flag is used, only
    406       ``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
    407       regular expression, so in such cases using an explicit
    408       ``[ \t\n\r\f\v]`` may be a better choice).
    409 
    410    For 8-bit (bytes) patterns:
    411       Matches characters considered whitespace in the ASCII character set;
    412       this is equivalent to ``[ \t\n\r\f\v]``.
    413 
    414 ``\S``
    415    Matches any character which is not a Unicode whitespace character. This is
    416    the opposite of ``\s``. If the :const:`ASCII` flag is used this
    417    becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
    418    regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
    419    be a better choice).
    420 
    421 ``\w``
    422    For Unicode (str) patterns:
    423       Matches Unicode word characters; this includes most characters
    424       that can be part of a word in any language, as well as numbers and
    425       the underscore. If the :const:`ASCII` flag is used, only
    426       ``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
    427       regular expression, so in such cases using an explicit
    428       ``[a-zA-Z0-9_]`` may be a better choice).
    429    For 8-bit (bytes) patterns:
    430       Matches characters considered alphanumeric in the ASCII character set;
    431       this is equivalent to ``[a-zA-Z0-9_]``.
    432 
    433 ``\W``
    434    Matches any character which is not a Unicode word character. This is
    435    the opposite of ``\w``. If the :const:`ASCII` flag is used this
    436    becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
    437    entire regular expression, so in such cases using an explicit
    438    ``[^a-zA-Z0-9_]`` may be a better choice).
    439 
    440 ``\Z``
    441    Matches only at the end of the string.
    442 
    443 Most of the standard escapes supported by Python string literals are also
    444 accepted by the regular expression parser::
    445 
    446    \a      \b      \f      \n
    447    \r      \t      \u      \U
    448    \v      \x      \\
    449 
    450 (Note that ``\b`` is used to represent word boundaries, and means "backspace"
    451 only inside character classes.)
    452 
    453 ``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
    454 patterns.  In bytes patterns they are not treated specially.
    455 
    456 Octal escapes are included in a limited form.  If the first digit is a 0, or if
    457 there are three octal digits, it is considered an octal escape. Otherwise, it is
    458 a group reference.  As for string literals, octal escapes are always at most
    459 three digits in length.
    460 
    461 .. versionchanged:: 3.3
    462    The ``'\u'`` and ``'\U'`` escape sequences have been added.
    463 
    464 .. versionchanged:: 3.6
    465    Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
    466 
    467 
    468 .. seealso::
    469 
    470    Mastering Regular Expressions
    471       Book on regular expressions by Jeffrey Friedl, published by O'Reilly.  The
    472       second edition of the book no longer covers Python at all, but the first
    473       edition covered writing good regular expression patterns in great detail.
    474 
    475 
    476 
    477 .. _contents-of-module-re:
    478 
    479 Module Contents
    480 ---------------
    481 
    482 The module defines several functions, constants, and an exception. Some of the
    483 functions are simplified versions of the full featured methods for compiled
    484 regular expressions.  Most non-trivial applications always use the compiled
    485 form.
    486 
    487 .. versionchanged:: 3.6
    488    Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
    489    :class:`enum.IntFlag`.
    490 
    491 .. function:: compile(pattern, flags=0)
    492 
    493    Compile a regular expression pattern into a regular expression object, which
    494    can be used for matching using its :func:`~regex.match` and
    495    :func:`~regex.search` methods, described below.
    496 
    497    The expression's behaviour can be modified by specifying a *flags* value.
    498    Values can be any of the following variables, combined using bitwise OR (the
    499    ``|`` operator).
    500 
    501    The sequence ::
    502 
    503       prog = re.compile(pattern)
    504       result = prog.match(string)
    505 
    506    is equivalent to ::
    507 
    508       result = re.match(pattern, string)
    509 
    510    but using :func:`re.compile` and saving the resulting regular expression
    511    object for reuse is more efficient when the expression will be used several
    512    times in a single program.
    513 
    514    .. note::
    515 
    516       The compiled versions of the most recent patterns passed to
    517       :func:`re.compile` and the module-level matching functions are cached, so
    518       programs that use only a few regular expressions at a time needn't worry
    519       about compiling regular expressions.
    520 
    521 
    522 .. data:: A
    523           ASCII
    524 
    525    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
    526    perform ASCII-only matching instead of full Unicode matching.  This is only
    527    meaningful for Unicode patterns, and is ignored for byte patterns.
    528 
    529    Note that for backward compatibility, the :const:`re.U` flag still
    530    exists (as well as its synonym :const:`re.UNICODE` and its embedded
    531    counterpart ``(?u)``), but these are redundant in Python 3 since
    532    matches are Unicode by default for strings (and Unicode matching
    533    isn't allowed for bytes).
    534 
    535 
    536 .. data:: DEBUG
    537 
    538    Display debug information about compiled expression.
    539 
    540 
    541 .. data:: I
    542           IGNORECASE
    543 
    544    Perform case-insensitive matching; expressions like ``[A-Z]`` will match
    545    lowercase letters, too.  This is not affected by the current locale
    546    and works for Unicode characters as expected.
    547 
    548 
    549 .. data:: L
    550           LOCALE
    551 
    552    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
    553    current locale. The use of this flag is discouraged as the locale mechanism
    554    is very unreliable, and it only handles one "culture" at a time anyway;
    555    you should use Unicode matching instead, which is the default in Python 3
    556    for Unicode (str) patterns. This flag can be used only with bytes patterns.
    557 
    558    .. versionchanged:: 3.6
    559       :const:`re.LOCALE` can be used only with bytes patterns and is
    560       not compatible with :const:`re.ASCII`.
    561 
    562 
    563 .. data:: M
    564           MULTILINE
    565 
    566    When specified, the pattern character ``'^'`` matches at the beginning of the
    567    string and at the beginning of each line (immediately following each newline);
    568    and the pattern character ``'$'`` matches at the end of the string and at the
    569    end of each line (immediately preceding each newline).  By default, ``'^'``
    570    matches only at the beginning of the string, and ``'$'`` only at the end of the
    571    string and immediately before the newline (if any) at the end of the string.
    572 
    573 
    574 .. data:: S
    575           DOTALL
    576 
    577    Make the ``'.'`` special character match any character at all, including a
    578    newline; without this flag, ``'.'`` will match anything *except* a newline.
    579 
    580 
    581 .. data:: X
    582           VERBOSE
    583 
    584    This flag allows you to write regular expressions that look nicer and are
    585    more readable by allowing you to visually separate logical sections of the
    586    pattern and add comments. Whitespace within the pattern is ignored, except
    587    when in a character class or when preceded by an unescaped backslash.
    588    When a line contains a ``#`` that is not in a character class and is not
    589    preceded by an unescaped backslash, all characters from the leftmost such
    590    ``#`` through the end of the line are ignored.
    591 
    592    This means that the two following regular expression objects that match a
    593    decimal number are functionally equal::
    594 
    595       a = re.compile(r"""\d +  # the integral part
    596                          \.    # the decimal point
    597                          \d *  # some fractional digits""", re.X)
    598       b = re.compile(r"\d+\.\d*")
    599 
    600 
    601 
    602 
    603 .. function:: search(pattern, string, flags=0)
    604 
    605    Scan through *string* looking for the first location where the regular expression
    606    *pattern* produces a match, and return a corresponding :ref:`match object
    607    <match-objects>`.  Return ``None`` if no position in the string matches the
    608    pattern; note that this is different from finding a zero-length match at some
    609    point in the string.
    610 
    611 
    612 .. function:: match(pattern, string, flags=0)
    613 
    614    If zero or more characters at the beginning of *string* match the regular
    615    expression *pattern*, return a corresponding :ref:`match object
    616    <match-objects>`.  Return ``None`` if the string does not match the pattern;
    617    note that this is different from a zero-length match.
    618 
    619    Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
    620    at the beginning of the string and not at the beginning of each line.
    621 
    622    If you want to locate a match anywhere in *string*, use :func:`search`
    623    instead (see also :ref:`search-vs-match`).
    624 
    625 
    626 .. function:: fullmatch(pattern, string, flags=0)
    627 
    628    If the whole *string* matches the regular expression *pattern*, return a
    629    corresponding :ref:`match object <match-objects>`.  Return ``None`` if the
    630    string does not match the pattern; note that this is different from a
    631    zero-length match.
    632 
    633    .. versionadded:: 3.4
    634 
    635 
    636 .. function:: split(pattern, string, maxsplit=0, flags=0)
    637 
    638    Split *string* by the occurrences of *pattern*.  If capturing parentheses are
    639    used in *pattern*, then the text of all groups in the pattern are also returned
    640    as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
    641    splits occur, and the remainder of the string is returned as the final element
    642    of the list. ::
    643 
    644       >>> re.split('\W+', 'Words, words, words.')
    645       ['Words', 'words', 'words', '']
    646       >>> re.split('(\W+)', 'Words, words, words.')
    647       ['Words', ', ', 'words', ', ', 'words', '.', '']
    648       >>> re.split('\W+', 'Words, words, words.', 1)
    649       ['Words', 'words, words.']
    650       >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
    651       ['0', '3', '9']
    652 
    653    If there are capturing groups in the separator and it matches at the start of
    654    the string, the result will start with an empty string.  The same holds for
    655    the end of the string:
    656 
    657       >>> re.split('(\W+)', '...words, words...')
    658       ['', '...', 'words', ', ', 'words', '...', '']
    659 
    660    That way, separator components are always found at the same relative
    661    indices within the result list.
    662 
    663    .. note::
    664 
    665       :func:`split` doesn't currently split a string on an empty pattern match.
    666       For example:
    667 
    668          >>> re.split('x*', 'axbc')
    669          ['a', 'bc']
    670 
    671       Even though ``'x*'`` also matches 0 'x' before 'a', between 'b' and 'c',
    672       and after 'c', currently these matches are ignored.  The correct behavior
    673       (i.e. splitting on empty matches too and returning ``['', 'a', 'b', 'c',
    674       '']``) will be implemented in future versions of Python, but since this
    675       is a backward incompatible change, a :exc:`FutureWarning` will be raised
    676       in the meanwhile.
    677 
    678       Patterns that can only match empty strings currently never split the
    679       string.  Since this doesn't match the expected behavior, a
    680       :exc:`ValueError` will be raised starting from Python 3.5::
    681 
    682          >>> re.split("^$", "foo\n\nbar\n", flags=re.M)
    683          Traceback (most recent call last):
    684            File "<stdin>", line 1, in <module>
    685            ...
    686          ValueError: split() requires a non-empty pattern match.
    687 
    688    .. versionchanged:: 3.1
    689       Added the optional flags argument.
    690 
    691    .. versionchanged:: 3.5
    692       Splitting on a pattern that could match an empty string now raises
    693       a warning.  Patterns that can only match empty strings are now rejected.
    694 
    695 .. function:: findall(pattern, string, flags=0)
    696 
    697    Return all non-overlapping matches of *pattern* in *string*, as a list of
    698    strings.  The *string* is scanned left-to-right, and matches are returned in
    699    the order found.  If one or more groups are present in the pattern, return a
    700    list of groups; this will be a list of tuples if the pattern has more than
    701    one group.  Empty matches are included in the result unless they touch the
    702    beginning of another match.
    703 
    704 
    705 .. function:: finditer(pattern, string, flags=0)
    706 
    707    Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
    708    all non-overlapping matches for the RE *pattern* in *string*.  The *string*
    709    is scanned left-to-right, and matches are returned in the order found.  Empty
    710    matches are included in the result unless they touch the beginning of another
    711    match.
    712 
    713 
    714 .. function:: sub(pattern, repl, string, count=0, flags=0)
    715 
    716    Return the string obtained by replacing the leftmost non-overlapping occurrences
    717    of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
    718    *string* is returned unchanged.  *repl* can be a string or a function; if it is
    719    a string, any backslash escapes in it are processed.  That is, ``\n`` is
    720    converted to a single newline character, ``\r`` is converted to a carriage return, and
    721    so forth.  Unknown escapes such as ``\&`` are left alone.  Backreferences, such
    722    as ``\6``, are replaced with the substring matched by group 6 in the pattern.
    723    For example:
    724 
    725       >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
    726       ...        r'static PyObject*\npy_\1(void)\n{',
    727       ...        'def myfunc():')
    728       'static PyObject*\npy_myfunc(void)\n{'
    729 
    730    If *repl* is a function, it is called for every non-overlapping occurrence of
    731    *pattern*.  The function takes a single match object argument, and returns the
    732    replacement string.  For example:
    733 
    734       >>> def dashrepl(matchobj):
    735       ...     if matchobj.group(0) == '-': return ' '
    736       ...     else: return '-'
    737       >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
    738       'pro--gram files'
    739       >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
    740       'Baked Beans & Spam'
    741 
    742    The pattern may be a string or an RE object.
    743 
    744    The optional argument *count* is the maximum number of pattern occurrences to be
    745    replaced; *count* must be a non-negative integer.  If omitted or zero, all
    746    occurrences will be replaced. Empty matches for the pattern are replaced only
    747    when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
    748    ``'-a-b-c-'``.
    749 
    750    In string-type *repl* arguments, in addition to the character escapes and
    751    backreferences described above,
    752    ``\g<name>`` will use the substring matched by the group named ``name``, as
    753    defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
    754    group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
    755    in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
    756    reference to group 20, not a reference to group 2 followed by the literal
    757    character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
    758    substring matched by the RE.
    759 
    760    .. versionchanged:: 3.1
    761       Added the optional flags argument.
    762 
    763    .. versionchanged:: 3.5
    764       Unmatched groups are replaced with an empty string.
    765 
    766    .. versionchanged:: 3.6
    767       Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
    768       now are errors.
    769 
    770    .. deprecated-removed:: 3.5 3.7
    771       Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter now raise
    772       a deprecation warning and will be forbidden in Python 3.7.
    773 
    774 
    775 .. function:: subn(pattern, repl, string, count=0, flags=0)
    776 
    777    Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
    778    number_of_subs_made)``.
    779 
    780    .. versionchanged:: 3.1
    781       Added the optional flags argument.
    782 
    783    .. versionchanged:: 3.5
    784       Unmatched groups are replaced with an empty string.
    785 
    786 
    787 .. function:: escape(string)
    788 
    789    Escape all the characters in pattern except ASCII letters, numbers and ``'_'``.
    790    This is useful if you want to match an arbitrary literal string that may
    791    have regular expression metacharacters in it.
    792 
    793    .. versionchanged:: 3.3
    794       The ``'_'`` character is no longer escaped.
    795 
    796 
    797 .. function:: purge()
    798 
    799    Clear the regular expression cache.
    800 
    801 
    802 .. exception:: error(msg, pattern=None, pos=None)
    803 
    804    Exception raised when a string passed to one of the functions here is not a
    805    valid regular expression (for example, it might contain unmatched parentheses)
    806    or when some other error occurs during compilation or matching.  It is never an
    807    error if a string contains no match for a pattern.  The error instance has
    808    the following additional attributes:
    809 
    810    .. attribute:: msg
    811 
    812       The unformatted error message.
    813 
    814    .. attribute:: pattern
    815 
    816       The regular expression pattern.
    817 
    818    .. attribute:: pos
    819 
    820       The index of *pattern* where compilation failed.
    821 
    822    .. attribute:: lineno
    823 
    824       The line corresponding to *pos*.
    825 
    826    .. attribute:: colno
    827 
    828       The column corresponding to *pos*.
    829 
    830    .. versionchanged:: 3.5
    831       Added additional attributes.
    832 
    833 .. _re-objects:
    834 
    835 Regular Expression Objects
    836 --------------------------
    837 
    838 Compiled regular expression objects support the following methods and
    839 attributes:
    840 
    841 .. method:: regex.search(string[, pos[, endpos]])
    842 
    843    Scan through *string* looking for the first location where this regular
    844    expression produces a match, and return a corresponding :ref:`match object
    845    <match-objects>`.  Return ``None`` if no position in the string matches the
    846    pattern; note that this is different from finding a zero-length match at some
    847    point in the string.
    848 
    849    The optional second parameter *pos* gives an index in the string where the
    850    search is to start; it defaults to ``0``.  This is not completely equivalent to
    851    slicing the string; the ``'^'`` pattern character matches at the real beginning
    852    of the string and at positions just after a newline, but not necessarily at the
    853    index where the search is to start.
    854 
    855    The optional parameter *endpos* limits how far the string will be searched; it
    856    will be as if the string is *endpos* characters long, so only the characters
    857    from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
    858    than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
    859    expression object, ``rx.search(string, 0, 50)`` is equivalent to
    860    ``rx.search(string[:50], 0)``.
    861 
    862    >>> pattern = re.compile("d")
    863    >>> pattern.search("dog")     # Match at index 0
    864    <_sre.SRE_Match object; span=(0, 1), match='d'>
    865    >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
    866 
    867 
    868 .. method:: regex.match(string[, pos[, endpos]])
    869 
    870    If zero or more characters at the *beginning* of *string* match this regular
    871    expression, return a corresponding :ref:`match object <match-objects>`.
    872    Return ``None`` if the string does not match the pattern; note that this is
    873    different from a zero-length match.
    874 
    875    The optional *pos* and *endpos* parameters have the same meaning as for the
    876    :meth:`~regex.search` method.
    877 
    878    >>> pattern = re.compile("o")
    879    >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
    880    >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
    881    <_sre.SRE_Match object; span=(1, 2), match='o'>
    882 
    883    If you want to locate a match anywhere in *string*, use
    884    :meth:`~regex.search` instead (see also :ref:`search-vs-match`).
    885 
    886 
    887 .. method:: regex.fullmatch(string[, pos[, endpos]])
    888 
    889    If the whole *string* matches this regular expression, return a corresponding
    890    :ref:`match object <match-objects>`.  Return ``None`` if the string does not
    891    match the pattern; note that this is different from a zero-length match.
    892 
    893    The optional *pos* and *endpos* parameters have the same meaning as for the
    894    :meth:`~regex.search` method.
    895 
    896    >>> pattern = re.compile("o[gh]")
    897    >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
    898    >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
    899    >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
    900    <_sre.SRE_Match object; span=(1, 3), match='og'>
    901 
    902    .. versionadded:: 3.4
    903 
    904 
    905 .. method:: regex.split(string, maxsplit=0)
    906 
    907    Identical to the :func:`split` function, using the compiled pattern.
    908 
    909 
    910 .. method:: regex.findall(string[, pos[, endpos]])
    911 
    912    Similar to the :func:`findall` function, using the compiled pattern, but
    913    also accepts optional *pos* and *endpos* parameters that limit the search
    914    region like for :meth:`match`.
    915 
    916 
    917 .. method:: regex.finditer(string[, pos[, endpos]])
    918 
    919    Similar to the :func:`finditer` function, using the compiled pattern, but
    920    also accepts optional *pos* and *endpos* parameters that limit the search
    921    region like for :meth:`match`.
    922 
    923 
    924 .. method:: regex.sub(repl, string, count=0)
    925 
    926    Identical to the :func:`sub` function, using the compiled pattern.
    927 
    928 
    929 .. method:: regex.subn(repl, string, count=0)
    930 
    931    Identical to the :func:`subn` function, using the compiled pattern.
    932 
    933 
    934 .. attribute:: regex.flags
    935 
    936    The regex matching flags.  This is a combination of the flags given to
    937    :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
    938    flags such as :data:`UNICODE` if the pattern is a Unicode string.
    939 
    940 
    941 .. attribute:: regex.groups
    942 
    943    The number of capturing groups in the pattern.
    944 
    945 
    946 .. attribute:: regex.groupindex
    947 
    948    A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
    949    numbers.  The dictionary is empty if no symbolic groups were used in the
    950    pattern.
    951 
    952 
    953 .. attribute:: regex.pattern
    954 
    955    The pattern string from which the RE object was compiled.
    956 
    957 
    958 .. _match-objects:
    959 
    960 Match Objects
    961 -------------
    962 
    963 Match objects always have a boolean value of ``True``.
    964 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
    965 when there is no match, you can test whether there was a match with a simple
    966 ``if`` statement::
    967 
    968    match = re.search(pattern, string)
    969    if match:
    970        process(match)
    971 
    972 Match objects support the following methods and attributes:
    973 
    974 
    975 .. method:: match.expand(template)
    976 
    977    Return the string obtained by doing backslash substitution on the template
    978    string *template*, as done by the :meth:`~regex.sub` method.
    979    Escapes such as ``\n`` are converted to the appropriate characters,
    980    and numeric backreferences (``\1``, ``\2``) and named backreferences
    981    (``\g<1>``, ``\g<name>``) are replaced by the contents of the
    982    corresponding group.
    983 
    984    .. versionchanged:: 3.5
    985       Unmatched groups are replaced with an empty string.
    986 
    987 .. method:: match.group([group1, ...])
    988 
    989    Returns one or more subgroups of the match.  If there is a single argument, the
    990    result is a single string; if there are multiple arguments, the result is a
    991    tuple with one item per argument. Without arguments, *group1* defaults to zero
    992    (the whole match is returned). If a *groupN* argument is zero, the corresponding
    993    return value is the entire matching string; if it is in the inclusive range
    994    [1..99], it is the string matching the corresponding parenthesized group.  If a
    995    group number is negative or larger than the number of groups defined in the
    996    pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
    997    part of the pattern that did not match, the corresponding result is ``None``.
    998    If a group is contained in a part of the pattern that matched multiple times,
    999    the last match is returned.
   1000 
   1001       >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
   1002       >>> m.group(0)       # The entire match
   1003       'Isaac Newton'
   1004       >>> m.group(1)       # The first parenthesized subgroup.
   1005       'Isaac'
   1006       >>> m.group(2)       # The second parenthesized subgroup.
   1007       'Newton'
   1008       >>> m.group(1, 2)    # Multiple arguments give us a tuple.
   1009       ('Isaac', 'Newton')
   1010 
   1011    If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
   1012    arguments may also be strings identifying groups by their group name.  If a
   1013    string argument is not used as a group name in the pattern, an :exc:`IndexError`
   1014    exception is raised.
   1015 
   1016    A moderately complicated example:
   1017 
   1018       >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
   1019       >>> m.group('first_name')
   1020       'Malcolm'
   1021       >>> m.group('last_name')
   1022       'Reynolds'
   1023 
   1024    Named groups can also be referred to by their index:
   1025 
   1026       >>> m.group(1)
   1027       'Malcolm'
   1028       >>> m.group(2)
   1029       'Reynolds'
   1030 
   1031    If a group matches multiple times, only the last match is accessible:
   1032 
   1033       >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
   1034       >>> m.group(1)                        # Returns only the last match.
   1035       'c3'
   1036 
   1037 
   1038 .. method:: match.__getitem__(g)
   1039 
   1040    This is identical to ``m.group(g)``.  This allows easier access to
   1041    an individual group from a match:
   1042 
   1043       >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
   1044       >>> m[0]       # The entire match
   1045       'Isaac Newton'
   1046       >>> m[1]       # The first parenthesized subgroup.
   1047       'Isaac'
   1048       >>> m[2]       # The second parenthesized subgroup.
   1049       'Newton'
   1050 
   1051    .. versionadded:: 3.6
   1052 
   1053 
   1054 .. method:: match.groups(default=None)
   1055 
   1056    Return a tuple containing all the subgroups of the match, from 1 up to however
   1057    many groups are in the pattern.  The *default* argument is used for groups that
   1058    did not participate in the match; it defaults to ``None``.
   1059 
   1060    For example:
   1061 
   1062       >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
   1063       >>> m.groups()
   1064       ('24', '1632')
   1065 
   1066    If we make the decimal place and everything after it optional, not all groups
   1067    might participate in the match.  These groups will default to ``None`` unless
   1068    the *default* argument is given:
   1069 
   1070       >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
   1071       >>> m.groups()      # Second group defaults to None.
   1072       ('24', None)
   1073       >>> m.groups('0')   # Now, the second group defaults to '0'.
   1074       ('24', '0')
   1075 
   1076 
   1077 .. method:: match.groupdict(default=None)
   1078 
   1079    Return a dictionary containing all the *named* subgroups of the match, keyed by
   1080    the subgroup name.  The *default* argument is used for groups that did not
   1081    participate in the match; it defaults to ``None``.  For example:
   1082 
   1083       >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
   1084       >>> m.groupdict()
   1085       {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
   1086 
   1087 
   1088 .. method:: match.start([group])
   1089             match.end([group])
   1090 
   1091    Return the indices of the start and end of the substring matched by *group*;
   1092    *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
   1093    *group* exists but did not contribute to the match.  For a match object *m*, and
   1094    a group *g* that did contribute to the match, the substring matched by group *g*
   1095    (equivalent to ``m.group(g)``) is ::
   1096 
   1097       m.string[m.start(g):m.end(g)]
   1098 
   1099    Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
   1100    null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
   1101    ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
   1102    2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
   1103 
   1104    An example that will remove *remove_this* from email addresses:
   1105 
   1106       >>> email = "tony (a] tiremove_thisger.net"
   1107       >>> m = re.search("remove_this", email)
   1108       >>> email[:m.start()] + email[m.end():]
   1109       'tony (a] tiger.net'
   1110 
   1111 
   1112 .. method:: match.span([group])
   1113 
   1114    For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
   1115    that if *group* did not contribute to the match, this is ``(-1, -1)``.
   1116    *group* defaults to zero, the entire match.
   1117 
   1118 
   1119 .. attribute:: match.pos
   1120 
   1121    The value of *pos* which was passed to the :meth:`~regex.search` or
   1122    :meth:`~regex.match` method of a :ref:`regex object <re-objects>`.  This is
   1123    the index into the string at which the RE engine started looking for a match.
   1124 
   1125 
   1126 .. attribute:: match.endpos
   1127 
   1128    The value of *endpos* which was passed to the :meth:`~regex.search` or
   1129    :meth:`~regex.match` method of a :ref:`regex object <re-objects>`.  This is
   1130    the index into the string beyond which the RE engine will not go.
   1131 
   1132 
   1133 .. attribute:: match.lastindex
   1134 
   1135    The integer index of the last matched capturing group, or ``None`` if no group
   1136    was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
   1137    ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
   1138    the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
   1139    string.
   1140 
   1141 
   1142 .. attribute:: match.lastgroup
   1143 
   1144    The name of the last matched capturing group, or ``None`` if the group didn't
   1145    have a name, or if no group was matched at all.
   1146 
   1147 
   1148 .. attribute:: match.re
   1149 
   1150    The regular expression object whose :meth:`~regex.match` or
   1151    :meth:`~regex.search` method produced this match instance.
   1152 
   1153 
   1154 .. attribute:: match.string
   1155 
   1156    The string passed to :meth:`~regex.match` or :meth:`~regex.search`.
   1157 
   1158 
   1159 .. _re-examples:
   1160 
   1161 Regular Expression Examples
   1162 ---------------------------
   1163 
   1164 
   1165 Checking for a Pair
   1166 ^^^^^^^^^^^^^^^^^^^
   1167 
   1168 In this example, we'll use the following helper function to display match
   1169 objects a little more gracefully:
   1170 
   1171 .. testcode::
   1172 
   1173    def displaymatch(match):
   1174        if match is None:
   1175            return None
   1176        return '<Match: %r, groups=%r>' % (match.group(), match.groups())
   1177 
   1178 Suppose you are writing a poker program where a player's hand is represented as
   1179 a 5-character string with each character representing a card, "a" for ace, "k"
   1180 for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
   1181 representing the card with that value.
   1182 
   1183 To see if a given string is a valid hand, one could do the following:
   1184 
   1185    >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
   1186    >>> displaymatch(valid.match("akt5q"))  # Valid.
   1187    "<Match: 'akt5q', groups=()>"
   1188    >>> displaymatch(valid.match("akt5e"))  # Invalid.
   1189    >>> displaymatch(valid.match("akt"))    # Invalid.
   1190    >>> displaymatch(valid.match("727ak"))  # Valid.
   1191    "<Match: '727ak', groups=()>"
   1192 
   1193 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
   1194 To match this with a regular expression, one could use backreferences as such:
   1195 
   1196    >>> pair = re.compile(r".*(.).*\1")
   1197    >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
   1198    "<Match: '717', groups=('7',)>"
   1199    >>> displaymatch(pair.match("718ak"))     # No pairs.
   1200    >>> displaymatch(pair.match("354aa"))     # Pair of aces.
   1201    "<Match: '354aa', groups=('a',)>"
   1202 
   1203 To find out what card the pair consists of, one could use the
   1204 :meth:`~match.group` method of the match object in the following manner:
   1205 
   1206 .. doctest::
   1207 
   1208    >>> pair.match("717ak").group(1)
   1209    '7'
   1210 
   1211    # Error because re.match() returns None, which doesn't have a group() method:
   1212    >>> pair.match("718ak").group(1)
   1213    Traceback (most recent call last):
   1214      File "<pyshell#23>", line 1, in <module>
   1215        re.match(r".*(.).*\1", "718ak").group(1)
   1216    AttributeError: 'NoneType' object has no attribute 'group'
   1217 
   1218    >>> pair.match("354aa").group(1)
   1219    'a'
   1220 
   1221 
   1222 Simulating scanf()
   1223 ^^^^^^^^^^^^^^^^^^
   1224 
   1225 .. index:: single: scanf()
   1226 
   1227 Python does not currently have an equivalent to :c:func:`scanf`.  Regular
   1228 expressions are generally more powerful, though also more verbose, than
   1229 :c:func:`scanf` format strings.  The table below offers some more-or-less
   1230 equivalent mappings between :c:func:`scanf` format tokens and regular
   1231 expressions.
   1232 
   1233 +--------------------------------+---------------------------------------------+
   1234 | :c:func:`scanf` Token          | Regular Expression                          |
   1235 +================================+=============================================+
   1236 | ``%c``                         | ``.``                                       |
   1237 +--------------------------------+---------------------------------------------+
   1238 | ``%5c``                        | ``.{5}``                                    |
   1239 +--------------------------------+---------------------------------------------+
   1240 | ``%d``                         | ``[-+]?\d+``                                |
   1241 +--------------------------------+---------------------------------------------+
   1242 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
   1243 +--------------------------------+---------------------------------------------+
   1244 | ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
   1245 +--------------------------------+---------------------------------------------+
   1246 | ``%o``                         | ``[-+]?[0-7]+``                             |
   1247 +--------------------------------+---------------------------------------------+
   1248 | ``%s``                         | ``\S+``                                     |
   1249 +--------------------------------+---------------------------------------------+
   1250 | ``%u``                         | ``\d+``                                     |
   1251 +--------------------------------+---------------------------------------------+
   1252 | ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
   1253 +--------------------------------+---------------------------------------------+
   1254 
   1255 To extract the filename and numbers from a string like ::
   1256 
   1257    /usr/sbin/sendmail - 0 errors, 4 warnings
   1258 
   1259 you would use a :c:func:`scanf` format like ::
   1260 
   1261    %s - %d errors, %d warnings
   1262 
   1263 The equivalent regular expression would be ::
   1264 
   1265    (\S+) - (\d+) errors, (\d+) warnings
   1266 
   1267 
   1268 .. _search-vs-match:
   1269 
   1270 search() vs. match()
   1271 ^^^^^^^^^^^^^^^^^^^^
   1272 
   1273 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org>
   1274 
   1275 Python offers two different primitive operations based on regular expressions:
   1276 :func:`re.match` checks for a match only at the beginning of the string, while
   1277 :func:`re.search` checks for a match anywhere in the string (this is what Perl
   1278 does by default).
   1279 
   1280 For example::
   1281 
   1282    >>> re.match("c", "abcdef")    # No match
   1283    >>> re.search("c", "abcdef")   # Match
   1284    <_sre.SRE_Match object; span=(2, 3), match='c'>
   1285 
   1286 Regular expressions beginning with ``'^'`` can be used with :func:`search` to
   1287 restrict the match at the beginning of the string::
   1288 
   1289    >>> re.match("c", "abcdef")    # No match
   1290    >>> re.search("^c", "abcdef")  # No match
   1291    >>> re.search("^a", "abcdef")  # Match
   1292    <_sre.SRE_Match object; span=(0, 1), match='a'>
   1293 
   1294 Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
   1295 beginning of the string, whereas using :func:`search` with a regular expression
   1296 beginning with ``'^'`` will match at the beginning of each line.
   1297 
   1298    >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
   1299    >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
   1300    <_sre.SRE_Match object; span=(4, 5), match='X'>
   1301 
   1302 
   1303 Making a Phonebook
   1304 ^^^^^^^^^^^^^^^^^^
   1305 
   1306 :func:`split` splits a string into a list delimited by the passed pattern.  The
   1307 method is invaluable for converting textual data into data structures that can be
   1308 easily read and modified by Python as demonstrated in the following example that
   1309 creates a phonebook.
   1310 
   1311 First, here is the input.  Normally it may come from a file, here we are using
   1312 triple-quoted string syntax:
   1313 
   1314    >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
   1315    ...
   1316    ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
   1317    ... Frank Burger: 925.541.7625 662 South Dogwood Way
   1318    ...
   1319    ...
   1320    ... Heather Albrecht: 548.326.4584 919 Park Place"""
   1321 
   1322 The entries are separated by one or more newlines. Now we convert the string
   1323 into a list with each nonempty line having its own entry:
   1324 
   1325 .. doctest::
   1326    :options: +NORMALIZE_WHITESPACE
   1327 
   1328    >>> entries = re.split("\n+", text)
   1329    >>> entries
   1330    ['Ross McFluff: 834.345.1254 155 Elm Street',
   1331    'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
   1332    'Frank Burger: 925.541.7625 662 South Dogwood Way',
   1333    'Heather Albrecht: 548.326.4584 919 Park Place']
   1334 
   1335 Finally, split each entry into a list with first name, last name, telephone
   1336 number, and address.  We use the ``maxsplit`` parameter of :func:`split`
   1337 because the address has spaces, our splitting pattern, in it:
   1338 
   1339 .. doctest::
   1340    :options: +NORMALIZE_WHITESPACE
   1341 
   1342    >>> [re.split(":? ", entry, 3) for entry in entries]
   1343    [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
   1344    ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
   1345    ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
   1346    ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
   1347 
   1348 The ``:?`` pattern matches the colon after the last name, so that it does not
   1349 occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
   1350 house number from the street name:
   1351 
   1352 .. doctest::
   1353    :options: +NORMALIZE_WHITESPACE
   1354 
   1355    >>> [re.split(":? ", entry, 4) for entry in entries]
   1356    [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
   1357    ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
   1358    ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
   1359    ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
   1360 
   1361 
   1362 Text Munging
   1363 ^^^^^^^^^^^^
   1364 
   1365 :func:`sub` replaces every occurrence of a pattern with a string or the
   1366 result of a function.  This example demonstrates using :func:`sub` with
   1367 a function to "munge" text, or randomize the order of all the characters
   1368 in each word of a sentence except for the first and last characters::
   1369 
   1370    >>> def repl(m):
   1371    ...     inner_word = list(m.group(2))
   1372    ...     random.shuffle(inner_word)
   1373    ...     return m.group(1) + "".join(inner_word) + m.group(3)
   1374    >>> text = "Professor Abdolmalek, please report your absences promptly."
   1375    >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
   1376    'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
   1377    >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
   1378    'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
   1379 
   1380 
   1381 Finding all Adverbs
   1382 ^^^^^^^^^^^^^^^^^^^
   1383 
   1384 :func:`findall` matches *all* occurrences of a pattern, not just the first
   1385 one as :func:`search` does.  For example, if one was a writer and wanted to
   1386 find all of the adverbs in some text, he or she might use :func:`findall` in
   1387 the following manner:
   1388 
   1389    >>> text = "He was carefully disguised but captured quickly by police."
   1390    >>> re.findall(r"\w+ly", text)
   1391    ['carefully', 'quickly']
   1392 
   1393 
   1394 Finding all Adverbs and their Positions
   1395 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   1396 
   1397 If one wants more information about all matches of a pattern than the matched
   1398 text, :func:`finditer` is useful as it provides :ref:`match objects
   1399 <match-objects>` instead of strings.  Continuing with the previous example, if
   1400 one was a writer who wanted to find all of the adverbs *and their positions* in
   1401 some text, he or she would use :func:`finditer` in the following manner:
   1402 
   1403    >>> text = "He was carefully disguised but captured quickly by police."
   1404    >>> for m in re.finditer(r"\w+ly", text):
   1405    ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
   1406    07-16: carefully
   1407    40-47: quickly
   1408 
   1409 
   1410 Raw String Notation
   1411 ^^^^^^^^^^^^^^^^^^^
   1412 
   1413 Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
   1414 every backslash (``'\'``) in a regular expression would have to be prefixed with
   1415 another one to escape it.  For example, the two following lines of code are
   1416 functionally identical:
   1417 
   1418    >>> re.match(r"\W(.)\1\W", " ff ")
   1419    <_sre.SRE_Match object; span=(0, 4), match=' ff '>
   1420    >>> re.match("\\W(.)\\1\\W", " ff ")
   1421    <_sre.SRE_Match object; span=(0, 4), match=' ff '>
   1422 
   1423 When one wants to match a literal backslash, it must be escaped in the regular
   1424 expression.  With raw string notation, this means ``r"\\"``.  Without raw string
   1425 notation, one must use ``"\\\\"``, making the following lines of code
   1426 functionally identical:
   1427 
   1428    >>> re.match(r"\\", r"\\")
   1429    <_sre.SRE_Match object; span=(0, 1), match='\\'>
   1430    >>> re.match("\\\\", r"\\")
   1431    <_sre.SRE_Match object; span=(0, 1), match='\\'>
   1432 
   1433 
   1434 Writing a Tokenizer
   1435 ^^^^^^^^^^^^^^^^^^^
   1436 
   1437 A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
   1438 analyzes a string to categorize groups of characters.  This is a useful first
   1439 step in writing a compiler or interpreter.
   1440 
   1441 The text categories are specified with regular expressions.  The technique is
   1442 to combine those into a single master regular expression and to loop over
   1443 successive matches::
   1444 
   1445     import collections
   1446     import re
   1447 
   1448     Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
   1449 
   1450     def tokenize(code):
   1451         keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
   1452         token_specification = [
   1453             ('NUMBER',  r'\d+(\.\d*)?'),  # Integer or decimal number
   1454             ('ASSIGN',  r':='),           # Assignment operator
   1455             ('END',     r';'),            # Statement terminator
   1456             ('ID',      r'[A-Za-z]+'),    # Identifiers
   1457             ('OP',      r'[+\-*/]'),      # Arithmetic operators
   1458             ('NEWLINE', r'\n'),           # Line endings
   1459             ('SKIP',    r'[ \t]+'),       # Skip over spaces and tabs
   1460             ('MISMATCH',r'.'),            # Any other character
   1461         ]
   1462         tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
   1463         line_num = 1
   1464         line_start = 0
   1465         for mo in re.finditer(tok_regex, code):
   1466             kind = mo.lastgroup
   1467             value = mo.group(kind)
   1468             if kind == 'NEWLINE':
   1469                 line_start = mo.end()
   1470                 line_num += 1
   1471             elif kind == 'SKIP':
   1472                 pass
   1473             elif kind == 'MISMATCH':
   1474                 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
   1475             else:
   1476                 if kind == 'ID' and value in keywords:
   1477                     kind = value
   1478                 column = mo.start() - line_start
   1479                 yield Token(kind, value, line_num, column)
   1480 
   1481     statements = '''
   1482         IF quantity THEN
   1483             total := total + price * quantity;
   1484             tax := price * 0.05;
   1485         ENDIF;
   1486     '''
   1487 
   1488     for token in tokenize(statements):
   1489         print(token)
   1490 
   1491 The tokenizer produces the following output::
   1492 
   1493     Token(typ='IF', value='IF', line=2, column=4)
   1494     Token(typ='ID', value='quantity', line=2, column=7)
   1495     Token(typ='THEN', value='THEN', line=2, column=16)
   1496     Token(typ='ID', value='total', line=3, column=8)
   1497     Token(typ='ASSIGN', value=':=', line=3, column=14)
   1498     Token(typ='ID', value='total', line=3, column=17)
   1499     Token(typ='OP', value='+', line=3, column=23)
   1500     Token(typ='ID', value='price', line=3, column=25)
   1501     Token(typ='OP', value='*', line=3, column=31)
   1502     Token(typ='ID', value='quantity', line=3, column=33)
   1503     Token(typ='END', value=';', line=3, column=41)
   1504     Token(typ='ID', value='tax', line=4, column=8)
   1505     Token(typ='ASSIGN', value=':=', line=4, column=12)
   1506     Token(typ='ID', value='price', line=4, column=15)
   1507     Token(typ='OP', value='*', line=4, column=21)
   1508     Token(typ='NUMBER', value='0.05', line=4, column=23)
   1509     Token(typ='END', value=';', line=4, column=27)
   1510     Token(typ='ENDIF', value='ENDIF', line=5, column=4)
   1511     Token(typ='END', value=';', line=5, column=9)
   1512