Home | History | Annotate | Download | only in library
      1 :mod:`re` --- Regular expression operations
      2 ===========================================
      3 
      4 .. module:: re
      5    :synopsis: Regular expression operations.
      6 
      7 .. moduleauthor:: Fredrik Lundh <fredrik (a] pythonware.com>
      8 .. sectionauthor:: Andrew M. Kuchling <amk (a] amk.ca>
      9 
     10 **Source code:** :source:`Lib/re.py`
     11 
     12 --------------
     13 
     14 This module provides regular expression matching operations similar to
     15 those found in Perl.
     16 
     17 Both patterns and strings to be searched can be Unicode strings (:class:`str`)
     18 as well as 8-bit strings (:class:`bytes`).
     19 However, Unicode strings and 8-bit strings cannot be mixed:
     20 that is, you cannot match a Unicode string with a byte pattern or
     21 vice-versa; similarly, when asking for a substitution, the replacement
     22 string must be of the same type as both the pattern and the search string.
     23 
     24 Regular expressions use the backslash character (``'\'``) to indicate
     25 special forms or to allow special characters to be used without invoking
     26 their special meaning.  This collides with Python's usage of the same
     27 character for the same purpose in string literals; for example, to match
     28 a literal backslash, one might have to write ``'\\\\'`` as the pattern
     29 string, because the regular expression must be ``\\``, and each
     30 backslash must be expressed as ``\\`` inside a regular Python string
     31 literal.
     32 
     33 The solution is to use Python's raw string notation for regular expression
     34 patterns; backslashes are not handled in any special way in a string literal
     35 prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
     36 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
     37 newline.  Usually patterns will be expressed in Python code using this raw
     38 string notation.
     39 
     40 It is important to note that most regular expression operations are available as
     41 module-level functions and methods on
     42 :ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
     43 that don't require you to compile a regex object first, but miss some
     44 fine-tuning parameters.
     45 
     46 .. seealso::
     47 
     48    The third-party `regex <https://pypi.org/project/regex/>`_ module,
     49    which has an API compatible with the standard library :mod:`re` module,
     50    but offers additional functionality and a more thorough Unicode support.
     51 
     52 
     53 .. _re-syntax:
     54 
     55 Regular Expression Syntax
     56 -------------------------
     57 
     58 A regular expression (or RE) specifies a set of strings that matches it; the
     59 functions in this module let you check if a particular string matches a given
     60 regular expression (or if a given regular expression matches a particular
     61 string, which comes down to the same thing).
     62 
     63 Regular expressions can be concatenated to form new regular expressions; if *A*
     64 and *B* are both regular expressions, then *AB* is also a regular expression.
     65 In general, if a string *p* matches *A* and another string *q* matches *B*, the
     66 string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
     67 operations; boundary conditions between *A* and *B*; or have numbered group
     68 references.  Thus, complex expressions can easily be constructed from simpler
     69 primitive expressions like the ones described here.  For details of the theory
     70 and implementation of regular expressions, consult the Friedl book [Frie09]_,
     71 or almost any textbook about compiler construction.
     72 
     73 A brief explanation of the format of regular expressions follows.  For further
     74 information and a gentler presentation, consult the :ref:`regex-howto`.
     75 
     76 Regular expressions can contain both special and ordinary characters. Most
     77 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
     78 expressions; they simply match themselves.  You can concatenate ordinary
     79 characters, so ``last`` matches the string ``'last'``.  (In the rest of this
     80 section, we'll write RE's in ``this special style``, usually without quotes, and
     81 strings to be matched ``'in single quotes'``.)
     82 
     83 Some characters, like ``'|'`` or ``'('``, are special. Special
     84 characters either stand for classes of ordinary characters, or affect
     85 how the regular expressions around them are interpreted.
     86 
     87 Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
     88 directly nested. This avoids ambiguity with the non-greedy modifier suffix
     89 ``?``, and with other modifiers in other implementations. To apply a second
     90 repetition to an inner repetition, parentheses may be used. For example,
     91 the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
     92 
     93 
     94 The special characters are:
     95 
     96 .. index:: single: . (dot); in regular expressions
     97 
     98 ``.``
     99    (Dot.)  In the default mode, this matches any character except a newline.  If
    100    the :const:`DOTALL` flag has been specified, this matches any character
    101    including a newline.
    102 
    103 .. index:: single: ^ (caret); in regular expressions
    104 
    105 ``^``
    106    (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
    107    matches immediately after each newline.
    108 
    109 .. index:: single: $ (dollar); in regular expressions
    110 
    111 ``$``
    112    Matches the end of the string or just before the newline at the end of the
    113    string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
    114    matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
    115    only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
    116    matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
    117    a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
    118    the newline, and one at the end of the string.
    119 
    120 .. index:: single: * (asterisk); in regular expressions
    121 
    122 ``*``
    123    Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
    124    many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
    125    by any number of 'b's.
    126 
    127 .. index:: single: + (plus); in regular expressions
    128 
    129 ``+``
    130    Causes the resulting RE to match 1 or more repetitions of the preceding RE.
    131    ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
    132    match just 'a'.
    133 
    134 .. index:: single: ? (question mark); in regular expressions
    135 
    136 ``?``
    137    Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    138    ``ab?`` will match either 'a' or 'ab'.
    139 
    140 .. index::
    141    single: *?; in regular expressions
    142    single: +?; in regular expressions
    143    single: ??; in regular expressions
    144 
    145 ``*?``, ``+?``, ``??``
    146    The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
    147    as much text as possible.  Sometimes this behaviour isn't desired; if the RE
    148    ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
    149    string, and not just ``'<a>'``.  Adding ``?`` after the qualifier makes it
    150    perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
    151    characters as possible will be matched.  Using the RE ``<.*?>`` will match
    152    only ``'<a>'``.
    153 
    154 .. index::
    155    single: {} (curly brackets); in regular expressions
    156 
    157 ``{m}``
    158    Specifies that exactly *m* copies of the previous RE should be matched; fewer
    159    matches cause the entire RE not to match.  For example, ``a{6}`` will match
    160    exactly six ``'a'`` characters, but not five.
    161 
    162 ``{m,n}``
    163    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
    164    RE, attempting to match as many repetitions as possible.  For example,
    165    ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
    166    lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
    167    example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
    168    followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
    169    modifier would be confused with the previously described form.
    170 
    171 ``{m,n}?``
    172    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
    173    RE, attempting to match as *few* repetitions as possible.  This is the
    174    non-greedy version of the previous qualifier.  For example, on the
    175    6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
    176    while ``a{3,5}?`` will only match 3 characters.
    177 
    178 .. index:: single: \ (backslash); in regular expressions
    179 
    180 ``\``
    181    Either escapes special characters (permitting you to match characters like
    182    ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
    183    sequences are discussed below.
    184 
    185    If you're not using a raw string to express the pattern, remember that Python
    186    also uses the backslash as an escape sequence in string literals; if the escape
    187    sequence isn't recognized by Python's parser, the backslash and subsequent
    188    character are included in the resulting string.  However, if Python would
    189    recognize the resulting sequence, the backslash should be repeated twice.  This
    190    is complicated and hard to understand, so it's highly recommended that you use
    191    raw strings for all but the simplest expressions.
    192 
    193 .. index::
    194    single: [] (square brackets); in regular expressions
    195 
    196 ``[]``
    197    Used to indicate a set of characters.  In a set:
    198 
    199    * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
    200      ``'m'``, or ``'k'``.
    201 
    202    .. index:: single: - (minus); in regular expressions
    203 
    204    * Ranges of characters can be indicated by giving two characters and separating
    205      them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
    206      ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
    207      ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
    208      ``[a\-z]``) or if it's placed as the first or last character
    209      (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
    210 
    211    * Special characters lose their special meaning inside sets.  For example,
    212      ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
    213      ``'*'``, or ``')'``.
    214 
    215    .. index:: single: \ (backslash); in regular expressions
    216 
    217    * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
    218      inside a set, although the characters they match depends on whether
    219      :const:`ASCII` or :const:`LOCALE` mode is in force.
    220 
    221    .. index:: single: ^ (caret); in regular expressions
    222 
    223    * Characters that are not within a range can be matched by :dfn:`complementing`
    224      the set.  If the first character of the set is ``'^'``, all the characters
    225      that are *not* in the set will be matched.  For example, ``[^5]`` will match
    226      any character except ``'5'``, and ``[^^]`` will match any character except
    227      ``'^'``.  ``^`` has no special meaning if it's not the first character in
    228      the set.
    229 
    230    * To match a literal ``']'`` inside a set, precede it with a backslash, or
    231      place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
    232      ``[]()[{}]`` will both match a parenthesis.
    233 
    234    .. .. index:: single: --; in regular expressions
    235    .. .. index:: single: &&; in regular expressions
    236    .. .. index:: single: ~~; in regular expressions
    237    .. .. index:: single: ||; in regular expressions
    238 
    239    * Support of nested sets and set operations as in `Unicode Technical
    240      Standard #18`_ might be added in the future.  This would change the
    241      syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
    242      in ambiguous cases for the time being.
    243      That includes sets starting with a literal ``'['`` or containing literal
    244      character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``.  To
    245      avoid a warning escape them with a backslash.
    246 
    247    .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
    248 
    249    .. versionchanged:: 3.7
    250       :exc:`FutureWarning` is raised if a character set contains constructs
    251       that will change semantically in the future.
    252 
    253 .. index:: single: | (vertical bar); in regular expressions
    254 
    255 ``|``
    256    ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
    257    will match either *A* or *B*.  An arbitrary number of REs can be separated by the
    258    ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
    259    the target string is scanned, REs separated by ``'|'`` are tried from left to
    260    right. When one pattern completely matches, that branch is accepted. This means
    261    that once *A* matches, *B* will not be tested further, even if it would
    262    produce a longer overall match.  In other words, the ``'|'`` operator is never
    263    greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
    264    character class, as in ``[|]``.
    265 
    266 .. index::
    267    single: () (parentheses); in regular expressions
    268 
    269 ``(...)``
    270    Matches whatever regular expression is inside the parentheses, and indicates the
    271    start and end of a group; the contents of a group can be retrieved after a match
    272    has been performed, and can be matched later in the string with the ``\number``
    273    special sequence, described below.  To match the literals ``'('`` or ``')'``,
    274    use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
    275 
    276 .. index:: single: (?; in regular expressions
    277 
    278 ``(?...)``
    279    This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
    280    otherwise).  The first character after the ``'?'`` determines what the meaning
    281    and further syntax of the construct is. Extensions usually do not create a new
    282    group; ``(?P<name>...)`` is the only exception to this rule. Following are the
    283    currently supported extensions.
    284 
    285 ``(?aiLmsux)``
    286    (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
    287    ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
    288    letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
    289    :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
    290    :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
    291    :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
    292    for the entire regular expression.
    293    (The flags are described in :ref:`contents-of-module-re`.)
    294    This is useful if you wish to include the flags as part of the
    295    regular expression, instead of passing a *flag* argument to the
    296    :func:`re.compile` function.  Flags should be used first in the
    297    expression string.
    298 
    299 .. index:: single: (?:; in regular expressions
    300 
    301 ``(?:...)``
    302    A non-capturing version of regular parentheses.  Matches whatever regular
    303    expression is inside the parentheses, but the substring matched by the group
    304    *cannot* be retrieved after performing a match or referenced later in the
    305    pattern.
    306 
    307 ``(?aiLmsux-imsx:...)``
    308    (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
    309    ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
    310    one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
    311    The letters set or remove the corresponding flags:
    312    :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
    313    :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
    314    :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
    315    and :const:`re.X` (verbose), for the part of the expression.
    316    (The flags are described in :ref:`contents-of-module-re`.)
    317 
    318    The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
    319    as inline flags, so they can't be combined or follow ``'-'``.  Instead,
    320    when one of them appears in an inline group, it overrides the matching mode
    321    in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
    322    ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
    323    (default).  In byte pattern ``(?L:...)`` switches to locale depending
    324    matching, and ``(?a:...)`` switches to ASCII-only matching (default).
    325    This override is only in effect for the narrow inline group, and the
    326    original matching mode is restored outside of the group.
    327 
    328    .. versionadded:: 3.6
    329 
    330    .. versionchanged:: 3.7
    331       The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
    332 
    333 .. index:: single: (?P<; in regular expressions
    334 
    335 ``(?P<name>...)``
    336    Similar to regular parentheses, but the substring matched by the group is
    337    accessible via the symbolic group name *name*.  Group names must be valid
    338    Python identifiers, and each group name must be defined only once within a
    339    regular expression.  A symbolic group is also a numbered group, just as if
    340    the group were not named.
    341 
    342    Named groups can be referenced in three contexts.  If the pattern is
    343    ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
    344    single or double quotes):
    345 
    346    +---------------------------------------+----------------------------------+
    347    | Context of reference to group "quote" | Ways to reference it             |
    348    +=======================================+==================================+
    349    | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
    350    |                                       | * ``\1``                         |
    351    +---------------------------------------+----------------------------------+
    352    | when processing match object *m*      | * ``m.group('quote')``           |
    353    |                                       | * ``m.end('quote')`` (etc.)      |
    354    +---------------------------------------+----------------------------------+
    355    | in a string passed to the *repl*      | * ``\g<quote>``                  |
    356    | argument of ``re.sub()``              | * ``\g<1>``                      |
    357    |                                       | * ``\1``                         |
    358    +---------------------------------------+----------------------------------+
    359 
    360 .. index:: single: (?P=; in regular expressions
    361 
    362 ``(?P=name)``
    363    A backreference to a named group; it matches whatever text was matched by the
    364    earlier group named *name*.
    365 
    366 .. index:: single: (?#; in regular expressions
    367 
    368 ``(?#...)``
    369    A comment; the contents of the parentheses are simply ignored.
    370 
    371 .. index:: single: (?=; in regular expressions
    372 
    373 ``(?=...)``
    374    Matches if ``...`` matches next, but doesn't consume any of the string.  This is
    375    called a :dfn:`lookahead assertion`.  For example, ``Isaac (?=Asimov)`` will match
    376    ``'Isaac '`` only if it's followed by ``'Asimov'``.
    377 
    378 .. index:: single: (?!; in regular expressions
    379 
    380 ``(?!...)``
    381    Matches if ``...`` doesn't match next.  This is a :dfn:`negative lookahead assertion`.
    382    For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
    383    followed by ``'Asimov'``.
    384 
    385 .. index:: single: (?<=; in regular expressions
    386 
    387 ``(?<=...)``
    388    Matches if the current position in the string is preceded by a match for ``...``
    389    that ends at the current position.  This is called a :dfn:`positive lookbehind
    390    assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
    391    lookbehind will back up 3 characters and check if the contained pattern matches.
    392    The contained pattern must only match strings of some fixed length, meaning that
    393    ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
    394    patterns which start with positive lookbehind assertions will not match at the
    395    beginning of the string being searched; you will most likely want to use the
    396    :func:`search` function rather than the :func:`match` function:
    397 
    398       >>> import re
    399       >>> m = re.search('(?<=abc)def', 'abcdef')
    400       >>> m.group(0)
    401       'def'
    402 
    403    This example looks for a word following a hyphen:
    404 
    405       >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
    406       >>> m.group(0)
    407       'egg'
    408 
    409    .. versionchanged:: 3.5
    410       Added support for group references of fixed length.
    411 
    412 .. index:: single: (?<!; in regular expressions
    413 
    414 ``(?<!...)``
    415    Matches if the current position in the string is not preceded by a match for
    416    ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
    417    positive lookbehind assertions, the contained pattern must only match strings of
    418    some fixed length.  Patterns which start with negative lookbehind assertions may
    419    match at the beginning of the string being searched.
    420 
    421 ``(?(id/name)yes-pattern|no-pattern)``
    422    Will try to match with ``yes-pattern`` if the group with given *id* or
    423    *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
    424    optional and can be omitted. For example,
    425    ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
    426    will match with ``'<user (a] host.com>'`` as well as ``'user (a] host.com'``, but
    427    not with ``'<user (a] host.com'`` nor ``'user (a] host.com>'``.
    428 
    429 
    430 The special sequences consist of ``'\'`` and a character from the list below.
    431 If the ordinary character is not an ASCII digit or an ASCII letter, then the
    432 resulting RE will match the second character.  For example, ``\$`` matches the
    433 character ``'$'``.
    434 
    435 .. index:: single: \ (backslash); in regular expressions
    436 
    437 ``\number``
    438    Matches the contents of the group of the same number.  Groups are numbered
    439    starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
    440    but not ``'thethe'`` (note the space after the group).  This special sequence
    441    can only be used to match one of the first 99 groups.  If the first digit of
    442    *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
    443    a group match, but as the character with octal value *number*. Inside the
    444    ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
    445    characters.
    446 
    447 .. index:: single: \A; in regular expressions
    448 
    449 ``\A``
    450    Matches only at the start of the string.
    451 
    452 .. index:: single: \b; in regular expressions
    453 
    454 ``\b``
    455    Matches the empty string, but only at the beginning or end of a word.
    456    A word is defined as a sequence of word characters.  Note that formally,
    457    ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
    458    (or vice versa), or between ``\w`` and the beginning/end of the string.
    459    This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
    460    ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
    461 
    462    By default Unicode alphanumerics are the ones used in Unicode patterns, but
    463    this can be changed by using the :const:`ASCII` flag.  Word boundaries are
    464    determined by the current locale if the :const:`LOCALE` flag is used.
    465    Inside a character range, ``\b`` represents the backspace character, for
    466    compatibility with Python's string literals.
    467 
    468 .. index:: single: \B; in regular expressions
    469 
    470 ``\B``
    471    Matches the empty string, but only when it is *not* at the beginning or end
    472    of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
    473    ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
    474    ``\B`` is just the opposite of ``\b``, so word characters in Unicode
    475    patterns are Unicode alphanumerics or the underscore, although this can
    476    be changed by using the :const:`ASCII` flag.  Word boundaries are
    477    determined by the current locale if the :const:`LOCALE` flag is used.
    478 
    479 .. index:: single: \d; in regular expressions
    480 
    481 ``\d``
    482    For Unicode (str) patterns:
    483       Matches any Unicode decimal digit (that is, any character in
    484       Unicode character category [Nd]).  This includes ``[0-9]``, and
    485       also many other digit characters.  If the :const:`ASCII` flag is
    486       used only ``[0-9]`` is matched.
    487 
    488    For 8-bit (bytes) patterns:
    489       Matches any decimal digit; this is equivalent to ``[0-9]``.
    490 
    491 .. index:: single: \D; in regular expressions
    492 
    493 ``\D``
    494    Matches any character which is not a decimal digit. This is
    495    the opposite of ``\d``. If the :const:`ASCII` flag is used this
    496    becomes the equivalent of ``[^0-9]``.
    497 
    498 .. index:: single: \s; in regular expressions
    499 
    500 ``\s``
    501    For Unicode (str) patterns:
    502       Matches Unicode whitespace characters (which includes
    503       ``[ \t\n\r\f\v]``, and also many other characters, for example the
    504       non-breaking spaces mandated by typography rules in many
    505       languages). If the :const:`ASCII` flag is used, only
    506       ``[ \t\n\r\f\v]`` is matched.
    507 
    508    For 8-bit (bytes) patterns:
    509       Matches characters considered whitespace in the ASCII character set;
    510       this is equivalent to ``[ \t\n\r\f\v]``.
    511 
    512 .. index:: single: \S; in regular expressions
    513 
    514 ``\S``
    515    Matches any character which is not a whitespace character. This is
    516    the opposite of ``\s``. If the :const:`ASCII` flag is used this
    517    becomes the equivalent of ``[^ \t\n\r\f\v]``.
    518 
    519 .. index:: single: \w; in regular expressions
    520 
    521 ``\w``
    522    For Unicode (str) patterns:
    523       Matches Unicode word characters; this includes most characters
    524       that can be part of a word in any language, as well as numbers and
    525       the underscore. If the :const:`ASCII` flag is used, only
    526       ``[a-zA-Z0-9_]`` is matched.
    527 
    528    For 8-bit (bytes) patterns:
    529       Matches characters considered alphanumeric in the ASCII character set;
    530       this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
    531       used, matches characters considered alphanumeric in the current locale
    532       and the underscore.
    533 
    534 .. index:: single: \W; in regular expressions
    535 
    536 ``\W``
    537    Matches any character which is not a word character. This is
    538    the opposite of ``\w``. If the :const:`ASCII` flag is used this
    539    becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
    540    used, matches characters considered alphanumeric in the current locale
    541    and the underscore.
    542 
    543 .. index:: single: \Z; in regular expressions
    544 
    545 ``\Z``
    546    Matches only at the end of the string.
    547 
    548 .. index::
    549    single: \a; in regular expressions
    550    single: \b; in regular expressions
    551    single: \f; in regular expressions
    552    single: \n; in regular expressions
    553    single: \N; in regular expressions
    554    single: \r; in regular expressions
    555    single: \t; in regular expressions
    556    single: \u; in regular expressions
    557    single: \U; in regular expressions
    558    single: \v; in regular expressions
    559    single: \x; in regular expressions
    560    single: \\; in regular expressions
    561 
    562 Most of the standard escapes supported by Python string literals are also
    563 accepted by the regular expression parser::
    564 
    565    \a      \b      \f      \n
    566    \r      \t      \u      \U
    567    \v      \x      \\
    568 
    569 (Note that ``\b`` is used to represent word boundaries, and means "backspace"
    570 only inside character classes.)
    571 
    572 ``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
    573 patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
    574 letters are reserved for future use and treated as errors.
    575 
    576 Octal escapes are included in a limited form.  If the first digit is a 0, or if
    577 there are three octal digits, it is considered an octal escape. Otherwise, it is
    578 a group reference.  As for string literals, octal escapes are always at most
    579 three digits in length.
    580 
    581 .. versionchanged:: 3.3
    582    The ``'\u'`` and ``'\U'`` escape sequences have been added.
    583 
    584 .. versionchanged:: 3.6
    585    Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
    586 
    587 
    588 
    589 .. _contents-of-module-re:
    590 
    591 Module Contents
    592 ---------------
    593 
    594 The module defines several functions, constants, and an exception. Some of the
    595 functions are simplified versions of the full featured methods for compiled
    596 regular expressions.  Most non-trivial applications always use the compiled
    597 form.
    598 
    599 .. versionchanged:: 3.6
    600    Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
    601    :class:`enum.IntFlag`.
    602 
    603 .. function:: compile(pattern, flags=0)
    604 
    605    Compile a regular expression pattern into a :ref:`regular expression object
    606    <re-objects>`, which can be used for matching using its
    607    :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
    608    below.
    609 
    610    The expression's behaviour can be modified by specifying a *flags* value.
    611    Values can be any of the following variables, combined using bitwise OR (the
    612    ``|`` operator).
    613 
    614    The sequence ::
    615 
    616       prog = re.compile(pattern)
    617       result = prog.match(string)
    618 
    619    is equivalent to ::
    620 
    621       result = re.match(pattern, string)
    622 
    623    but using :func:`re.compile` and saving the resulting regular expression
    624    object for reuse is more efficient when the expression will be used several
    625    times in a single program.
    626 
    627    .. note::
    628 
    629       The compiled versions of the most recent patterns passed to
    630       :func:`re.compile` and the module-level matching functions are cached, so
    631       programs that use only a few regular expressions at a time needn't worry
    632       about compiling regular expressions.
    633 
    634 
    635 .. data:: A
    636           ASCII
    637 
    638    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
    639    perform ASCII-only matching instead of full Unicode matching.  This is only
    640    meaningful for Unicode patterns, and is ignored for byte patterns.
    641    Corresponds to the inline flag ``(?a)``.
    642 
    643    Note that for backward compatibility, the :const:`re.U` flag still
    644    exists (as well as its synonym :const:`re.UNICODE` and its embedded
    645    counterpart ``(?u)``), but these are redundant in Python 3 since
    646    matches are Unicode by default for strings (and Unicode matching
    647    isn't allowed for bytes).
    648 
    649 
    650 .. data:: DEBUG
    651 
    652    Display debug information about compiled expression.
    653    No corresponding inline flag.
    654 
    655 
    656 .. data:: I
    657           IGNORECASE
    658 
    659    Perform case-insensitive matching; expressions like ``[A-Z]`` will also
    660    match lowercase letters.  Full Unicode matching (such as ```` matching
    661    ````) also works unless the :const:`re.ASCII` flag is used to disable
    662    non-ASCII matches.  The current locale does not change the effect of this
    663    flag unless the :const:`re.LOCALE` flag is also used.
    664    Corresponds to the inline flag ``(?i)``.
    665 
    666    Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
    667    combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
    668    letters and 4 additional non-ASCII letters: '' (U+0130, Latin capital
    669    letter I with dot above), '' (U+0131, Latin small letter dotless i),
    670    '' (U+017F, Latin small letter long s) and '' (U+212A, Kelvin sign).
    671    If the :const:`ASCII` flag is used, only letters 'a' to 'z'
    672    and 'A' to 'Z' are matched.
    673 
    674 .. data:: L
    675           LOCALE
    676 
    677    Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
    678    dependent on the current locale.  This flag can be used only with bytes
    679    patterns.  The use of this flag is discouraged as the locale mechanism
    680    is very unreliable, it only handles one "culture" at a time, and it only
    681    works with 8-bit locales.  Unicode matching is already enabled by default
    682    in Python 3 for Unicode (str) patterns, and it is able to handle different
    683    locales/languages.
    684    Corresponds to the inline flag ``(?L)``.
    685 
    686    .. versionchanged:: 3.6
    687       :const:`re.LOCALE` can be used only with bytes patterns and is
    688       not compatible with :const:`re.ASCII`.
    689 
    690    .. versionchanged:: 3.7
    691       Compiled regular expression objects with the :const:`re.LOCALE` flag no
    692       longer depend on the locale at compile time.  Only the locale at
    693       matching time affects the result of matching.
    694 
    695 
    696 .. data:: M
    697           MULTILINE
    698 
    699    When specified, the pattern character ``'^'`` matches at the beginning of the
    700    string and at the beginning of each line (immediately following each newline);
    701    and the pattern character ``'$'`` matches at the end of the string and at the
    702    end of each line (immediately preceding each newline).  By default, ``'^'``
    703    matches only at the beginning of the string, and ``'$'`` only at the end of the
    704    string and immediately before the newline (if any) at the end of the string.
    705    Corresponds to the inline flag ``(?m)``.
    706 
    707 
    708 .. data:: S
    709           DOTALL
    710 
    711    Make the ``'.'`` special character match any character at all, including a
    712    newline; without this flag, ``'.'`` will match anything *except* a newline.
    713    Corresponds to the inline flag ``(?s)``.
    714 
    715 
    716 .. data:: X
    717           VERBOSE
    718 
    719    .. index:: single: # (hash); in regular expressions
    720 
    721    This flag allows you to write regular expressions that look nicer and are
    722    more readable by allowing you to visually separate logical sections of the
    723    pattern and add comments. Whitespace within the pattern is ignored, except
    724    when in a character class, or when preceded by an unescaped backslash,
    725    or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
    726    When a line contains a ``#`` that is not in a character class and is not
    727    preceded by an unescaped backslash, all characters from the leftmost such
    728    ``#`` through the end of the line are ignored.
    729 
    730    This means that the two following regular expression objects that match a
    731    decimal number are functionally equal::
    732 
    733       a = re.compile(r"""\d +  # the integral part
    734                          \.    # the decimal point
    735                          \d *  # some fractional digits""", re.X)
    736       b = re.compile(r"\d+\.\d*")
    737 
    738    Corresponds to the inline flag ``(?x)``.
    739 
    740 
    741 .. function:: search(pattern, string, flags=0)
    742 
    743    Scan through *string* looking for the first location where the regular expression
    744    *pattern* produces a match, and return a corresponding :ref:`match object
    745    <match-objects>`.  Return ``None`` if no position in the string matches the
    746    pattern; note that this is different from finding a zero-length match at some
    747    point in the string.
    748 
    749 
    750 .. function:: match(pattern, string, flags=0)
    751 
    752    If zero or more characters at the beginning of *string* match the regular
    753    expression *pattern*, return a corresponding :ref:`match object
    754    <match-objects>`.  Return ``None`` if the string does not match the pattern;
    755    note that this is different from a zero-length match.
    756 
    757    Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
    758    at the beginning of the string and not at the beginning of each line.
    759 
    760    If you want to locate a match anywhere in *string*, use :func:`search`
    761    instead (see also :ref:`search-vs-match`).
    762 
    763 
    764 .. function:: fullmatch(pattern, string, flags=0)
    765 
    766    If the whole *string* matches the regular expression *pattern*, return a
    767    corresponding :ref:`match object <match-objects>`.  Return ``None`` if the
    768    string does not match the pattern; note that this is different from a
    769    zero-length match.
    770 
    771    .. versionadded:: 3.4
    772 
    773 
    774 .. function:: split(pattern, string, maxsplit=0, flags=0)
    775 
    776    Split *string* by the occurrences of *pattern*.  If capturing parentheses are
    777    used in *pattern*, then the text of all groups in the pattern are also returned
    778    as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
    779    splits occur, and the remainder of the string is returned as the final element
    780    of the list. ::
    781 
    782       >>> re.split(r'\W+', 'Words, words, words.')
    783       ['Words', 'words', 'words', '']
    784       >>> re.split(r'(\W+)', 'Words, words, words.')
    785       ['Words', ', ', 'words', ', ', 'words', '.', '']
    786       >>> re.split(r'\W+', 'Words, words, words.', 1)
    787       ['Words', 'words, words.']
    788       >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
    789       ['0', '3', '9']
    790 
    791    If there are capturing groups in the separator and it matches at the start of
    792    the string, the result will start with an empty string.  The same holds for
    793    the end of the string::
    794 
    795       >>> re.split(r'(\W+)', '...words, words...')
    796       ['', '...', 'words', ', ', 'words', '...', '']
    797 
    798    That way, separator components are always found at the same relative
    799    indices within the result list.
    800 
    801    Empty matches for the pattern split the string only when not adjacent
    802    to a previous empty match.
    803 
    804       >>> re.split(r'\b', 'Words, words, words.')
    805       ['', 'Words', ', ', 'words', ', ', 'words', '.']
    806       >>> re.split(r'\W*', '...words...')
    807       ['', '', 'w', 'o', 'r', 'd', 's', '', '']
    808       >>> re.split(r'(\W*)', '...words...')
    809       ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
    810 
    811    .. versionchanged:: 3.1
    812       Added the optional flags argument.
    813 
    814    .. versionchanged:: 3.7
    815       Added support of splitting on a pattern that could match an empty string.
    816 
    817 
    818 .. function:: findall(pattern, string, flags=0)
    819 
    820    Return all non-overlapping matches of *pattern* in *string*, as a list of
    821    strings.  The *string* is scanned left-to-right, and matches are returned in
    822    the order found.  If one or more groups are present in the pattern, return a
    823    list of groups; this will be a list of tuples if the pattern has more than
    824    one group.  Empty matches are included in the result.
    825 
    826    .. versionchanged:: 3.7
    827       Non-empty matches can now start just after a previous empty match.
    828 
    829 
    830 .. function:: finditer(pattern, string, flags=0)
    831 
    832    Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
    833    all non-overlapping matches for the RE *pattern* in *string*.  The *string*
    834    is scanned left-to-right, and matches are returned in the order found.  Empty
    835    matches are included in the result.
    836 
    837    .. versionchanged:: 3.7
    838       Non-empty matches can now start just after a previous empty match.
    839 
    840 
    841 .. function:: sub(pattern, repl, string, count=0, flags=0)
    842 
    843    Return the string obtained by replacing the leftmost non-overlapping occurrences
    844    of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
    845    *string* is returned unchanged.  *repl* can be a string or a function; if it is
    846    a string, any backslash escapes in it are processed.  That is, ``\n`` is
    847    converted to a single newline character, ``\r`` is converted to a carriage return, and
    848    so forth.  Unknown escapes of ASCII letters are reserved for future use and
    849    treated as errors.  Other unknown escapes such as ``\&`` are left alone.
    850    Backreferences, such
    851    as ``\6``, are replaced with the substring matched by group 6 in the pattern.
    852    For example::
    853 
    854       >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
    855       ...        r'static PyObject*\npy_\1(void)\n{',
    856       ...        'def myfunc():')
    857       'static PyObject*\npy_myfunc(void)\n{'
    858 
    859    If *repl* is a function, it is called for every non-overlapping occurrence of
    860    *pattern*.  The function takes a single :ref:`match object <match-objects>`
    861    argument, and returns the replacement string.  For example::
    862 
    863       >>> def dashrepl(matchobj):
    864       ...     if matchobj.group(0) == '-': return ' '
    865       ...     else: return '-'
    866       >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
    867       'pro--gram files'
    868       >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
    869       'Baked Beans & Spam'
    870 
    871    The pattern may be a string or a :ref:`pattern object <re-objects>`.
    872 
    873    The optional argument *count* is the maximum number of pattern occurrences to be
    874    replaced; *count* must be a non-negative integer.  If omitted or zero, all
    875    occurrences will be replaced. Empty matches for the pattern are replaced only
    876    when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
    877    ``'-a-b--d-'``.
    878 
    879    .. index:: single: \g; in regular expressions
    880 
    881    In string-type *repl* arguments, in addition to the character escapes and
    882    backreferences described above,
    883    ``\g<name>`` will use the substring matched by the group named ``name``, as
    884    defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
    885    group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
    886    in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
    887    reference to group 20, not a reference to group 2 followed by the literal
    888    character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
    889    substring matched by the RE.
    890 
    891    .. versionchanged:: 3.1
    892       Added the optional flags argument.
    893 
    894    .. versionchanged:: 3.5
    895       Unmatched groups are replaced with an empty string.
    896 
    897    .. versionchanged:: 3.6
    898       Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
    899       now are errors.
    900 
    901    .. versionchanged:: 3.7
    902       Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
    903       now are errors.
    904 
    905       Empty matches for the pattern are replaced when adjacent to a previous
    906       non-empty match.
    907 
    908 
    909 .. function:: subn(pattern, repl, string, count=0, flags=0)
    910 
    911    Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
    912    number_of_subs_made)``.
    913 
    914    .. versionchanged:: 3.1
    915       Added the optional flags argument.
    916 
    917    .. versionchanged:: 3.5
    918       Unmatched groups are replaced with an empty string.
    919 
    920 
    921 .. function:: escape(pattern)
    922 
    923    Escape special characters in *pattern*.
    924    This is useful if you want to match an arbitrary literal string that may
    925    have regular expression metacharacters in it.  For example::
    926 
    927       >>> print(re.escape('python.exe'))
    928       python\.exe
    929 
    930       >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
    931       >>> print('[%s]+' % re.escape(legal_chars))
    932       [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
    933 
    934       >>> operators = ['+', '-', '*', '/', '**']
    935       >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
    936       /|\-|\+|\*\*|\*
    937 
    938    This functions must not be used for the replacement string in :func:`sub`
    939    and :func:`subn`, only backslashes should be escaped.  For example::
    940 
    941       >>> digits_re = r'\d+'
    942       >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
    943       >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
    944       /usr/sbin/sendmail - \d+ errors, \d+ warnings
    945 
    946    .. versionchanged:: 3.3
    947       The ``'_'`` character is no longer escaped.
    948 
    949    .. versionchanged:: 3.7
    950       Only characters that can have special meaning in a regular expression
    951       are escaped.
    952 
    953 
    954 .. function:: purge()
    955 
    956    Clear the regular expression cache.
    957 
    958 
    959 .. exception:: error(msg, pattern=None, pos=None)
    960 
    961    Exception raised when a string passed to one of the functions here is not a
    962    valid regular expression (for example, it might contain unmatched parentheses)
    963    or when some other error occurs during compilation or matching.  It is never an
    964    error if a string contains no match for a pattern.  The error instance has
    965    the following additional attributes:
    966 
    967    .. attribute:: msg
    968 
    969       The unformatted error message.
    970 
    971    .. attribute:: pattern
    972 
    973       The regular expression pattern.
    974 
    975    .. attribute:: pos
    976 
    977       The index in *pattern* where compilation failed (may be ``None``).
    978 
    979    .. attribute:: lineno
    980 
    981       The line corresponding to *pos* (may be ``None``).
    982 
    983    .. attribute:: colno
    984 
    985       The column corresponding to *pos* (may be ``None``).
    986 
    987    .. versionchanged:: 3.5
    988       Added additional attributes.
    989 
    990 .. _re-objects:
    991 
    992 Regular Expression Objects
    993 --------------------------
    994 
    995 Compiled regular expression objects support the following methods and
    996 attributes:
    997 
    998 .. method:: Pattern.search(string[, pos[, endpos]])
    999 
   1000    Scan through *string* looking for the first location where this regular
   1001    expression produces a match, and return a corresponding :ref:`match object
   1002    <match-objects>`.  Return ``None`` if no position in the string matches the
   1003    pattern; note that this is different from finding a zero-length match at some
   1004    point in the string.
   1005 
   1006    The optional second parameter *pos* gives an index in the string where the
   1007    search is to start; it defaults to ``0``.  This is not completely equivalent to
   1008    slicing the string; the ``'^'`` pattern character matches at the real beginning
   1009    of the string and at positions just after a newline, but not necessarily at the
   1010    index where the search is to start.
   1011 
   1012    The optional parameter *endpos* limits how far the string will be searched; it
   1013    will be as if the string is *endpos* characters long, so only the characters
   1014    from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
   1015    than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
   1016    expression object, ``rx.search(string, 0, 50)`` is equivalent to
   1017    ``rx.search(string[:50], 0)``. ::
   1018 
   1019       >>> pattern = re.compile("d")
   1020       >>> pattern.search("dog")     # Match at index 0
   1021       <re.Match object; span=(0, 1), match='d'>
   1022       >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
   1023 
   1024 
   1025 .. method:: Pattern.match(string[, pos[, endpos]])
   1026 
   1027    If zero or more characters at the *beginning* of *string* match this regular
   1028    expression, return a corresponding :ref:`match object <match-objects>`.
   1029    Return ``None`` if the string does not match the pattern; note that this is
   1030    different from a zero-length match.
   1031 
   1032    The optional *pos* and *endpos* parameters have the same meaning as for the
   1033    :meth:`~Pattern.search` method. ::
   1034 
   1035       >>> pattern = re.compile("o")
   1036       >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
   1037       >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
   1038       <re.Match object; span=(1, 2), match='o'>
   1039 
   1040    If you want to locate a match anywhere in *string*, use
   1041    :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
   1042 
   1043 
   1044 .. method:: Pattern.fullmatch(string[, pos[, endpos]])
   1045 
   1046    If the whole *string* matches this regular expression, return a corresponding
   1047    :ref:`match object <match-objects>`.  Return ``None`` if the string does not
   1048    match the pattern; note that this is different from a zero-length match.
   1049 
   1050    The optional *pos* and *endpos* parameters have the same meaning as for the
   1051    :meth:`~Pattern.search` method. ::
   1052 
   1053       >>> pattern = re.compile("o[gh]")
   1054       >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
   1055       >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
   1056       >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
   1057       <re.Match object; span=(1, 3), match='og'>
   1058 
   1059    .. versionadded:: 3.4
   1060 
   1061 
   1062 .. method:: Pattern.split(string, maxsplit=0)
   1063 
   1064    Identical to the :func:`split` function, using the compiled pattern.
   1065 
   1066 
   1067 .. method:: Pattern.findall(string[, pos[, endpos]])
   1068 
   1069    Similar to the :func:`findall` function, using the compiled pattern, but
   1070    also accepts optional *pos* and *endpos* parameters that limit the search
   1071    region like for :meth:`search`.
   1072 
   1073 
   1074 .. method:: Pattern.finditer(string[, pos[, endpos]])
   1075 
   1076    Similar to the :func:`finditer` function, using the compiled pattern, but
   1077    also accepts optional *pos* and *endpos* parameters that limit the search
   1078    region like for :meth:`search`.
   1079 
   1080 
   1081 .. method:: Pattern.sub(repl, string, count=0)
   1082 
   1083    Identical to the :func:`sub` function, using the compiled pattern.
   1084 
   1085 
   1086 .. method:: Pattern.subn(repl, string, count=0)
   1087 
   1088    Identical to the :func:`subn` function, using the compiled pattern.
   1089 
   1090 
   1091 .. attribute:: Pattern.flags
   1092 
   1093    The regex matching flags.  This is a combination of the flags given to
   1094    :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
   1095    flags such as :data:`UNICODE` if the pattern is a Unicode string.
   1096 
   1097 
   1098 .. attribute:: Pattern.groups
   1099 
   1100    The number of capturing groups in the pattern.
   1101 
   1102 
   1103 .. attribute:: Pattern.groupindex
   1104 
   1105    A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
   1106    numbers.  The dictionary is empty if no symbolic groups were used in the
   1107    pattern.
   1108 
   1109 
   1110 .. attribute:: Pattern.pattern
   1111 
   1112    The pattern string from which the pattern object was compiled.
   1113 
   1114 
   1115 .. versionchanged:: 3.7
   1116    Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Compiled
   1117    regular expression objects are considered atomic.
   1118 
   1119 
   1120 .. _match-objects:
   1121 
   1122 Match Objects
   1123 -------------
   1124 
   1125 Match objects always have a boolean value of ``True``.
   1126 Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
   1127 when there is no match, you can test whether there was a match with a simple
   1128 ``if`` statement::
   1129 
   1130    match = re.search(pattern, string)
   1131    if match:
   1132        process(match)
   1133 
   1134 Match objects support the following methods and attributes:
   1135 
   1136 
   1137 .. method:: Match.expand(template)
   1138 
   1139    Return the string obtained by doing backslash substitution on the template
   1140    string *template*, as done by the :meth:`~Pattern.sub` method.
   1141    Escapes such as ``\n`` are converted to the appropriate characters,
   1142    and numeric backreferences (``\1``, ``\2``) and named backreferences
   1143    (``\g<1>``, ``\g<name>``) are replaced by the contents of the
   1144    corresponding group.
   1145 
   1146    .. versionchanged:: 3.5
   1147       Unmatched groups are replaced with an empty string.
   1148 
   1149 .. method:: Match.group([group1, ...])
   1150 
   1151    Returns one or more subgroups of the match.  If there is a single argument, the
   1152    result is a single string; if there are multiple arguments, the result is a
   1153    tuple with one item per argument. Without arguments, *group1* defaults to zero
   1154    (the whole match is returned). If a *groupN* argument is zero, the corresponding
   1155    return value is the entire matching string; if it is in the inclusive range
   1156    [1..99], it is the string matching the corresponding parenthesized group.  If a
   1157    group number is negative or larger than the number of groups defined in the
   1158    pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
   1159    part of the pattern that did not match, the corresponding result is ``None``.
   1160    If a group is contained in a part of the pattern that matched multiple times,
   1161    the last match is returned. ::
   1162 
   1163       >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
   1164       >>> m.group(0)       # The entire match
   1165       'Isaac Newton'
   1166       >>> m.group(1)       # The first parenthesized subgroup.
   1167       'Isaac'
   1168       >>> m.group(2)       # The second parenthesized subgroup.
   1169       'Newton'
   1170       >>> m.group(1, 2)    # Multiple arguments give us a tuple.
   1171       ('Isaac', 'Newton')
   1172 
   1173    If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
   1174    arguments may also be strings identifying groups by their group name.  If a
   1175    string argument is not used as a group name in the pattern, an :exc:`IndexError`
   1176    exception is raised.
   1177 
   1178    A moderately complicated example::
   1179 
   1180       >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
   1181       >>> m.group('first_name')
   1182       'Malcolm'
   1183       >>> m.group('last_name')
   1184       'Reynolds'
   1185 
   1186    Named groups can also be referred to by their index::
   1187 
   1188       >>> m.group(1)
   1189       'Malcolm'
   1190       >>> m.group(2)
   1191       'Reynolds'
   1192 
   1193    If a group matches multiple times, only the last match is accessible::
   1194 
   1195       >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
   1196       >>> m.group(1)                        # Returns only the last match.
   1197       'c3'
   1198 
   1199 
   1200 .. method:: Match.__getitem__(g)
   1201 
   1202    This is identical to ``m.group(g)``.  This allows easier access to
   1203    an individual group from a match::
   1204 
   1205       >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
   1206       >>> m[0]       # The entire match
   1207       'Isaac Newton'
   1208       >>> m[1]       # The first parenthesized subgroup.
   1209       'Isaac'
   1210       >>> m[2]       # The second parenthesized subgroup.
   1211       'Newton'
   1212 
   1213    .. versionadded:: 3.6
   1214 
   1215 
   1216 .. method:: Match.groups(default=None)
   1217 
   1218    Return a tuple containing all the subgroups of the match, from 1 up to however
   1219    many groups are in the pattern.  The *default* argument is used for groups that
   1220    did not participate in the match; it defaults to ``None``.
   1221 
   1222    For example::
   1223 
   1224       >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
   1225       >>> m.groups()
   1226       ('24', '1632')
   1227 
   1228    If we make the decimal place and everything after it optional, not all groups
   1229    might participate in the match.  These groups will default to ``None`` unless
   1230    the *default* argument is given::
   1231 
   1232       >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
   1233       >>> m.groups()      # Second group defaults to None.
   1234       ('24', None)
   1235       >>> m.groups('0')   # Now, the second group defaults to '0'.
   1236       ('24', '0')
   1237 
   1238 
   1239 .. method:: Match.groupdict(default=None)
   1240 
   1241    Return a dictionary containing all the *named* subgroups of the match, keyed by
   1242    the subgroup name.  The *default* argument is used for groups that did not
   1243    participate in the match; it defaults to ``None``.  For example::
   1244 
   1245       >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
   1246       >>> m.groupdict()
   1247       {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
   1248 
   1249 
   1250 .. method:: Match.start([group])
   1251             Match.end([group])
   1252 
   1253    Return the indices of the start and end of the substring matched by *group*;
   1254    *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
   1255    *group* exists but did not contribute to the match.  For a match object *m*, and
   1256    a group *g* that did contribute to the match, the substring matched by group *g*
   1257    (equivalent to ``m.group(g)``) is ::
   1258 
   1259       m.string[m.start(g):m.end(g)]
   1260 
   1261    Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
   1262    null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
   1263    ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
   1264    2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
   1265 
   1266    An example that will remove *remove_this* from email addresses::
   1267 
   1268       >>> email = "tony (a] tiremove_thisger.net"
   1269       >>> m = re.search("remove_this", email)
   1270       >>> email[:m.start()] + email[m.end():]
   1271       'tony (a] tiger.net'
   1272 
   1273 
   1274 .. method:: Match.span([group])
   1275 
   1276    For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
   1277    that if *group* did not contribute to the match, this is ``(-1, -1)``.
   1278    *group* defaults to zero, the entire match.
   1279 
   1280 
   1281 .. attribute:: Match.pos
   1282 
   1283    The value of *pos* which was passed to the :meth:`~Pattern.search` or
   1284    :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
   1285    the index into the string at which the RE engine started looking for a match.
   1286 
   1287 
   1288 .. attribute:: Match.endpos
   1289 
   1290    The value of *endpos* which was passed to the :meth:`~Pattern.search` or
   1291    :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
   1292    the index into the string beyond which the RE engine will not go.
   1293 
   1294 
   1295 .. attribute:: Match.lastindex
   1296 
   1297    The integer index of the last matched capturing group, or ``None`` if no group
   1298    was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
   1299    ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
   1300    the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
   1301    string.
   1302 
   1303 
   1304 .. attribute:: Match.lastgroup
   1305 
   1306    The name of the last matched capturing group, or ``None`` if the group didn't
   1307    have a name, or if no group was matched at all.
   1308 
   1309 
   1310 .. attribute:: Match.re
   1311 
   1312    The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
   1313    :meth:`~Pattern.search` method produced this match instance.
   1314 
   1315 
   1316 .. attribute:: Match.string
   1317 
   1318    The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
   1319 
   1320 
   1321 .. versionchanged:: 3.7
   1322    Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Match objects
   1323    are considered atomic.
   1324 
   1325 
   1326 .. _re-examples:
   1327 
   1328 Regular Expression Examples
   1329 ---------------------------
   1330 
   1331 
   1332 Checking for a Pair
   1333 ^^^^^^^^^^^^^^^^^^^
   1334 
   1335 In this example, we'll use the following helper function to display match
   1336 objects a little more gracefully:
   1337 
   1338 .. testcode::
   1339 
   1340    def displaymatch(match):
   1341        if match is None:
   1342            return None
   1343        return '<Match: %r, groups=%r>' % (match.group(), match.groups())
   1344 
   1345 Suppose you are writing a poker program where a player's hand is represented as
   1346 a 5-character string with each character representing a card, "a" for ace, "k"
   1347 for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
   1348 representing the card with that value.
   1349 
   1350 To see if a given string is a valid hand, one could do the following::
   1351 
   1352    >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
   1353    >>> displaymatch(valid.match("akt5q"))  # Valid.
   1354    "<Match: 'akt5q', groups=()>"
   1355    >>> displaymatch(valid.match("akt5e"))  # Invalid.
   1356    >>> displaymatch(valid.match("akt"))    # Invalid.
   1357    >>> displaymatch(valid.match("727ak"))  # Valid.
   1358    "<Match: '727ak', groups=()>"
   1359 
   1360 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
   1361 To match this with a regular expression, one could use backreferences as such::
   1362 
   1363    >>> pair = re.compile(r".*(.).*\1")
   1364    >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
   1365    "<Match: '717', groups=('7',)>"
   1366    >>> displaymatch(pair.match("718ak"))     # No pairs.
   1367    >>> displaymatch(pair.match("354aa"))     # Pair of aces.
   1368    "<Match: '354aa', groups=('a',)>"
   1369 
   1370 To find out what card the pair consists of, one could use the
   1371 :meth:`~Match.group` method of the match object in the following manner:
   1372 
   1373 .. doctest::
   1374 
   1375    >>> pair.match("717ak").group(1)
   1376    '7'
   1377 
   1378    # Error because re.match() returns None, which doesn't have a group() method:
   1379    >>> pair.match("718ak").group(1)
   1380    Traceback (most recent call last):
   1381      File "<pyshell#23>", line 1, in <module>
   1382        re.match(r".*(.).*\1", "718ak").group(1)
   1383    AttributeError: 'NoneType' object has no attribute 'group'
   1384 
   1385    >>> pair.match("354aa").group(1)
   1386    'a'
   1387 
   1388 
   1389 Simulating scanf()
   1390 ^^^^^^^^^^^^^^^^^^
   1391 
   1392 .. index:: single: scanf()
   1393 
   1394 Python does not currently have an equivalent to :c:func:`scanf`.  Regular
   1395 expressions are generally more powerful, though also more verbose, than
   1396 :c:func:`scanf` format strings.  The table below offers some more-or-less
   1397 equivalent mappings between :c:func:`scanf` format tokens and regular
   1398 expressions.
   1399 
   1400 +--------------------------------+---------------------------------------------+
   1401 | :c:func:`scanf` Token          | Regular Expression                          |
   1402 +================================+=============================================+
   1403 | ``%c``                         | ``.``                                       |
   1404 +--------------------------------+---------------------------------------------+
   1405 | ``%5c``                        | ``.{5}``                                    |
   1406 +--------------------------------+---------------------------------------------+
   1407 | ``%d``                         | ``[-+]?\d+``                                |
   1408 +--------------------------------+---------------------------------------------+
   1409 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
   1410 +--------------------------------+---------------------------------------------+
   1411 | ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
   1412 +--------------------------------+---------------------------------------------+
   1413 | ``%o``                         | ``[-+]?[0-7]+``                             |
   1414 +--------------------------------+---------------------------------------------+
   1415 | ``%s``                         | ``\S+``                                     |
   1416 +--------------------------------+---------------------------------------------+
   1417 | ``%u``                         | ``\d+``                                     |
   1418 +--------------------------------+---------------------------------------------+
   1419 | ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
   1420 +--------------------------------+---------------------------------------------+
   1421 
   1422 To extract the filename and numbers from a string like ::
   1423 
   1424    /usr/sbin/sendmail - 0 errors, 4 warnings
   1425 
   1426 you would use a :c:func:`scanf` format like ::
   1427 
   1428    %s - %d errors, %d warnings
   1429 
   1430 The equivalent regular expression would be ::
   1431 
   1432    (\S+) - (\d+) errors, (\d+) warnings
   1433 
   1434 
   1435 .. _search-vs-match:
   1436 
   1437 search() vs. match()
   1438 ^^^^^^^^^^^^^^^^^^^^
   1439 
   1440 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org>
   1441 
   1442 Python offers two different primitive operations based on regular expressions:
   1443 :func:`re.match` checks for a match only at the beginning of the string, while
   1444 :func:`re.search` checks for a match anywhere in the string (this is what Perl
   1445 does by default).
   1446 
   1447 For example::
   1448 
   1449    >>> re.match("c", "abcdef")    # No match
   1450    >>> re.search("c", "abcdef")   # Match
   1451    <re.Match object; span=(2, 3), match='c'>
   1452 
   1453 Regular expressions beginning with ``'^'`` can be used with :func:`search` to
   1454 restrict the match at the beginning of the string::
   1455 
   1456    >>> re.match("c", "abcdef")    # No match
   1457    >>> re.search("^c", "abcdef")  # No match
   1458    >>> re.search("^a", "abcdef")  # Match
   1459    <re.Match object; span=(0, 1), match='a'>
   1460 
   1461 Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
   1462 beginning of the string, whereas using :func:`search` with a regular expression
   1463 beginning with ``'^'`` will match at the beginning of each line. ::
   1464 
   1465    >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
   1466    >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
   1467    <re.Match object; span=(4, 5), match='X'>
   1468 
   1469 
   1470 Making a Phonebook
   1471 ^^^^^^^^^^^^^^^^^^
   1472 
   1473 :func:`split` splits a string into a list delimited by the passed pattern.  The
   1474 method is invaluable for converting textual data into data structures that can be
   1475 easily read and modified by Python as demonstrated in the following example that
   1476 creates a phonebook.
   1477 
   1478 First, here is the input.  Normally it may come from a file, here we are using
   1479 triple-quoted string syntax::
   1480 
   1481    >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
   1482    ...
   1483    ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
   1484    ... Frank Burger: 925.541.7625 662 South Dogwood Way
   1485    ...
   1486    ...
   1487    ... Heather Albrecht: 548.326.4584 919 Park Place"""
   1488 
   1489 The entries are separated by one or more newlines. Now we convert the string
   1490 into a list with each nonempty line having its own entry:
   1491 
   1492 .. doctest::
   1493    :options: +NORMALIZE_WHITESPACE
   1494 
   1495    >>> entries = re.split("\n+", text)
   1496    >>> entries
   1497    ['Ross McFluff: 834.345.1254 155 Elm Street',
   1498    'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
   1499    'Frank Burger: 925.541.7625 662 South Dogwood Way',
   1500    'Heather Albrecht: 548.326.4584 919 Park Place']
   1501 
   1502 Finally, split each entry into a list with first name, last name, telephone
   1503 number, and address.  We use the ``maxsplit`` parameter of :func:`split`
   1504 because the address has spaces, our splitting pattern, in it:
   1505 
   1506 .. doctest::
   1507    :options: +NORMALIZE_WHITESPACE
   1508 
   1509    >>> [re.split(":? ", entry, 3) for entry in entries]
   1510    [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
   1511    ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
   1512    ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
   1513    ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
   1514 
   1515 The ``:?`` pattern matches the colon after the last name, so that it does not
   1516 occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
   1517 house number from the street name:
   1518 
   1519 .. doctest::
   1520    :options: +NORMALIZE_WHITESPACE
   1521 
   1522    >>> [re.split(":? ", entry, 4) for entry in entries]
   1523    [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
   1524    ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
   1525    ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
   1526    ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
   1527 
   1528 
   1529 Text Munging
   1530 ^^^^^^^^^^^^
   1531 
   1532 :func:`sub` replaces every occurrence of a pattern with a string or the
   1533 result of a function.  This example demonstrates using :func:`sub` with
   1534 a function to "munge" text, or randomize the order of all the characters
   1535 in each word of a sentence except for the first and last characters::
   1536 
   1537    >>> def repl(m):
   1538    ...     inner_word = list(m.group(2))
   1539    ...     random.shuffle(inner_word)
   1540    ...     return m.group(1) + "".join(inner_word) + m.group(3)
   1541    >>> text = "Professor Abdolmalek, please report your absences promptly."
   1542    >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
   1543    'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
   1544    >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
   1545    'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
   1546 
   1547 
   1548 Finding all Adverbs
   1549 ^^^^^^^^^^^^^^^^^^^
   1550 
   1551 :func:`findall` matches *all* occurrences of a pattern, not just the first
   1552 one as :func:`search` does.  For example, if a writer wanted to
   1553 find all of the adverbs in some text, they might use :func:`findall` in
   1554 the following manner::
   1555 
   1556    >>> text = "He was carefully disguised but captured quickly by police."
   1557    >>> re.findall(r"\w+ly", text)
   1558    ['carefully', 'quickly']
   1559 
   1560 
   1561 Finding all Adverbs and their Positions
   1562 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   1563 
   1564 If one wants more information about all matches of a pattern than the matched
   1565 text, :func:`finditer` is useful as it provides :ref:`match objects
   1566 <match-objects>` instead of strings.  Continuing with the previous example, if
   1567 a writer wanted to find all of the adverbs *and their positions* in
   1568 some text, they would use :func:`finditer` in the following manner::
   1569 
   1570    >>> text = "He was carefully disguised but captured quickly by police."
   1571    >>> for m in re.finditer(r"\w+ly", text):
   1572    ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
   1573    07-16: carefully
   1574    40-47: quickly
   1575 
   1576 
   1577 Raw String Notation
   1578 ^^^^^^^^^^^^^^^^^^^
   1579 
   1580 Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
   1581 every backslash (``'\'``) in a regular expression would have to be prefixed with
   1582 another one to escape it.  For example, the two following lines of code are
   1583 functionally identical::
   1584 
   1585    >>> re.match(r"\W(.)\1\W", " ff ")
   1586    <re.Match object; span=(0, 4), match=' ff '>
   1587    >>> re.match("\\W(.)\\1\\W", " ff ")
   1588    <re.Match object; span=(0, 4), match=' ff '>
   1589 
   1590 When one wants to match a literal backslash, it must be escaped in the regular
   1591 expression.  With raw string notation, this means ``r"\\"``.  Without raw string
   1592 notation, one must use ``"\\\\"``, making the following lines of code
   1593 functionally identical::
   1594 
   1595    >>> re.match(r"\\", r"\\")
   1596    <re.Match object; span=(0, 1), match='\\'>
   1597    >>> re.match("\\\\", r"\\")
   1598    <re.Match object; span=(0, 1), match='\\'>
   1599 
   1600 
   1601 Writing a Tokenizer
   1602 ^^^^^^^^^^^^^^^^^^^
   1603 
   1604 A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
   1605 analyzes a string to categorize groups of characters.  This is a useful first
   1606 step in writing a compiler or interpreter.
   1607 
   1608 The text categories are specified with regular expressions.  The technique is
   1609 to combine those into a single master regular expression and to loop over
   1610 successive matches::
   1611 
   1612     import collections
   1613     import re
   1614 
   1615     Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
   1616 
   1617     def tokenize(code):
   1618         keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
   1619         token_specification = [
   1620             ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
   1621             ('ASSIGN',   r':='),           # Assignment operator
   1622             ('END',      r';'),            # Statement terminator
   1623             ('ID',       r'[A-Za-z]+'),    # Identifiers
   1624             ('OP',       r'[+\-*/]'),      # Arithmetic operators
   1625             ('NEWLINE',  r'\n'),           # Line endings
   1626             ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
   1627             ('MISMATCH', r'.'),            # Any other character
   1628         ]
   1629         tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
   1630         line_num = 1
   1631         line_start = 0
   1632         for mo in re.finditer(tok_regex, code):
   1633             kind = mo.lastgroup
   1634             value = mo.group()
   1635             column = mo.start() - line_start
   1636             if kind == 'NUMBER':
   1637                 value = float(value) if '.' in value else int(value)
   1638             elif kind == 'ID' and value in keywords:
   1639                 kind = value
   1640             elif kind == 'NEWLINE':
   1641                 line_start = mo.end()
   1642                 line_num += 1
   1643                 continue
   1644             elif kind == 'SKIP':
   1645                 continue
   1646             elif kind == 'MISMATCH':
   1647                 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
   1648             yield Token(kind, value, line_num, column)
   1649 
   1650     statements = '''
   1651         IF quantity THEN
   1652             total := total + price * quantity;
   1653             tax := price * 0.05;
   1654         ENDIF;
   1655     '''
   1656 
   1657     for token in tokenize(statements):
   1658         print(token)
   1659 
   1660 The tokenizer produces the following output::
   1661 
   1662     Token(type='IF', value='IF', line=2, column=4)
   1663     Token(type='ID', value='quantity', line=2, column=7)
   1664     Token(type='THEN', value='THEN', line=2, column=16)
   1665     Token(type='ID', value='total', line=3, column=8)
   1666     Token(type='ASSIGN', value=':=', line=3, column=14)
   1667     Token(type='ID', value='total', line=3, column=17)
   1668     Token(type='OP', value='+', line=3, column=23)
   1669     Token(type='ID', value='price', line=3, column=25)
   1670     Token(type='OP', value='*', line=3, column=31)
   1671     Token(type='ID', value='quantity', line=3, column=33)
   1672     Token(type='END', value=';', line=3, column=41)
   1673     Token(type='ID', value='tax', line=4, column=8)
   1674     Token(type='ASSIGN', value=':=', line=4, column=12)
   1675     Token(type='ID', value='price', line=4, column=15)
   1676     Token(type='OP', value='*', line=4, column=21)
   1677     Token(type='NUMBER', value=0.05, line=4, column=23)
   1678     Token(type='END', value=';', line=4, column=27)
   1679     Token(type='ENDIF', value='ENDIF', line=5, column=4)
   1680     Token(type='END', value=';', line=5, column=9)
   1681 
   1682 
   1683 .. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
   1684    Media, 2009. The third edition of the book no longer covers Python at all,
   1685    but the first edition covered writing good regular expression patterns in
   1686    great detail.
   1687