Home | History | Annotate | Download | only in Lib
      1 #
      2 # Secret Labs' Regular Expression Engine
      3 #
      4 # re-compatible interface for the sre matching engine
      5 #
      6 # Copyright (c) 1998-2001 by Secret Labs AB.  All rights reserved.
      7 #
      8 # This version of the SRE library can be redistributed under CNRI's
      9 # Python 1.6 license.  For any other use, please contact Secret Labs
     10 # AB (info (at] pythonware.com).
     11 #
     12 # Portions of this engine have been developed in cooperation with
     13 # CNRI.  Hewlett-Packard provided funding for 1.6 integration and
     14 # other compatibility work.
     15 #
     16 
     17 r"""Support for regular expressions (RE).
     18 
     19 This module provides regular expression matching operations similar to
     20 those found in Perl.  It supports both 8-bit and Unicode strings; both
     21 the pattern and the strings being processed can contain null bytes and
     22 characters outside the US ASCII range.
     23 
     24 Regular expressions can contain both special and ordinary characters.
     25 Most ordinary characters, like "A", "a", or "0", are the simplest
     26 regular expressions; they simply match themselves.  You can
     27 concatenate ordinary characters, so last matches the string 'last'.
     28 
     29 The special characters are:
     30     "."      Matches any character except a newline.
     31     "^"      Matches the start of the string.
     32     "$"      Matches the end of the string or just before the newline at
     33              the end of the string.
     34     "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
     35              Greedy means that it will match as many repetitions as possible.
     36     "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
     37     "?"      Matches 0 or 1 (greedy) of the preceding RE.
     38     *?,+?,?? Non-greedy versions of the previous three special characters.
     39     {m,n}    Matches from m to n repetitions of the preceding RE.
     40     {m,n}?   Non-greedy version of the above.
     41     "\\"     Either escapes special characters or signals a special sequence.
     42     []       Indicates a set of characters.
     43              A "^" as the first character indicates a complementing set.
     44     "|"      A|B, creates an RE that will match either A or B.
     45     (...)    Matches the RE inside the parentheses.
     46              The contents can be retrieved or matched later in the string.
     47     (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
     48     (?:...)  Non-grouping version of regular parentheses.
     49     (?P<name>...) The substring matched by the group is accessible by name.
     50     (?P=name)     Matches the text matched earlier by the group named name.
     51     (?#...)  A comment; ignored.
     52     (?=...)  Matches if ... matches next, but doesn't consume the string.
     53     (?!...)  Matches if ... doesn't match next.
     54     (?<=...) Matches if preceded by ... (must be fixed length).
     55     (?<!...) Matches if not preceded by ... (must be fixed length).
     56     (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
     57                        the (optional) no pattern otherwise.
     58 
     59 The special sequences consist of "\\" and a character from the list
     60 below.  If the ordinary character is not on the list, then the
     61 resulting RE will match the second character.
     62     \number  Matches the contents of the group of the same number.
     63     \A       Matches only at the start of the string.
     64     \Z       Matches only at the end of the string.
     65     \b       Matches the empty string, but only at the start or end of a word.
     66     \B       Matches the empty string, but not at the start or end of a word.
     67     \d       Matches any decimal digit; equivalent to the set [0-9] in
     68              bytes patterns or string patterns with the ASCII flag.
     69              In string patterns without the ASCII flag, it will match the whole
     70              range of Unicode digits.
     71     \D       Matches any non-digit character; equivalent to [^\d].
     72     \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
     73              bytes patterns or string patterns with the ASCII flag.
     74              In string patterns without the ASCII flag, it will match the whole
     75              range of Unicode whitespace characters.
     76     \S       Matches any non-whitespace character; equivalent to [^\s].
     77     \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
     78              in bytes patterns or string patterns with the ASCII flag.
     79              In string patterns without the ASCII flag, it will match the
     80              range of Unicode alphanumeric characters (letters plus digits
     81              plus underscore).
     82              With LOCALE, it will match the set [0-9_] plus characters defined
     83              as letters for the current locale.
     84     \W       Matches the complement of \w.
     85     \\       Matches a literal backslash.
     86 
     87 This module exports the following functions:
     88     match     Match a regular expression pattern to the beginning of a string.
     89     fullmatch Match a regular expression pattern to all of a string.
     90     search    Search a string for the presence of a pattern.
     91     sub       Substitute occurrences of a pattern found in a string.
     92     subn      Same as sub, but also return the number of substitutions made.
     93     split     Split a string by the occurrences of a pattern.
     94     findall   Find all occurrences of a pattern in a string.
     95     finditer  Return an iterator yielding a match object for each match.
     96     compile   Compile a pattern into a RegexObject.
     97     purge     Clear the regular expression cache.
     98     escape    Backslash all non-alphanumerics in a string.
     99 
    100 Some of the functions in this module takes flags as optional parameters:
    101     A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
    102                    match the corresponding ASCII character categories
    103                    (rather than the whole Unicode categories, which is the
    104                    default).
    105                    For bytes patterns, this flag is the only available
    106                    behaviour and needn't be specified.
    107     I  IGNORECASE  Perform case-insensitive matching.
    108     L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
    109     M  MULTILINE   "^" matches the beginning of lines (after a newline)
    110                    as well as the string.
    111                    "$" matches the end of lines (before a newline) as well
    112                    as the end of the string.
    113     S  DOTALL      "." matches any character at all, including the newline.
    114     X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
    115     U  UNICODE     For compatibility only. Ignored for string patterns (it
    116                    is the default), and forbidden for bytes patterns.
    117 
    118 This module also defines an exception 'error'.
    119 
    120 """
    121 
    122 import enum
    123 import sre_compile
    124 import sre_parse
    125 import functools
    126 try:
    127     import _locale
    128 except ImportError:
    129     _locale = None
    130 
    131 # public symbols
    132 __all__ = [
    133     "match", "fullmatch", "search", "sub", "subn", "split",
    134     "findall", "finditer", "compile", "purge", "template", "escape",
    135     "error", "A", "I", "L", "M", "S", "X", "U",
    136     "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
    137     "UNICODE",
    138 ]
    139 
    140 __version__ = "2.2.1"
    141 
    142 class RegexFlag(enum.IntFlag):
    143     ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
    144     IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
    145     LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
    146     UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
    147     MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
    148     DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
    149     VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
    150     A = ASCII
    151     I = IGNORECASE
    152     L = LOCALE
    153     U = UNICODE
    154     M = MULTILINE
    155     S = DOTALL
    156     X = VERBOSE
    157     # sre extensions (experimental, don't rely on these)
    158     TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
    159     T = TEMPLATE
    160     DEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation
    161 globals().update(RegexFlag.__members__)
    162 
    163 # sre exception
    164 error = sre_compile.error
    165 
    166 # --------------------------------------------------------------------
    167 # public interface
    168 
    169 def match(pattern, string, flags=0):
    170     """Try to apply the pattern at the start of the string, returning
    171     a match object, or None if no match was found."""
    172     return _compile(pattern, flags).match(string)
    173 
    174 def fullmatch(pattern, string, flags=0):
    175     """Try to apply the pattern to all of the string, returning
    176     a match object, or None if no match was found."""
    177     return _compile(pattern, flags).fullmatch(string)
    178 
    179 def search(pattern, string, flags=0):
    180     """Scan through string looking for a match to the pattern, returning
    181     a match object, or None if no match was found."""
    182     return _compile(pattern, flags).search(string)
    183 
    184 def sub(pattern, repl, string, count=0, flags=0):
    185     """Return the string obtained by replacing the leftmost
    186     non-overlapping occurrences of the pattern in string by the
    187     replacement repl.  repl can be either a string or a callable;
    188     if a string, backslash escapes in it are processed.  If it is
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
    191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):
    194     """Return a 2-tuple containing (new_string, number).
    195     new_string is the string obtained by replacing the leftmost
    196     non-overlapping occurrences of the pattern in the source
    197     string by the replacement repl.  number is the number of
    198     substitutions that were made. repl can be either a string or a
    199     callable; if a string, backslash escapes in it are processed.
    200     If it is a callable, it's passed the match object and must
    201     return a replacement string to be used."""
    202     return _compile(pattern, flags).subn(repl, string, count)
    203 
    204 def split(pattern, string, maxsplit=0, flags=0):
    205     """Split the source string by the occurrences of the pattern,
    206     returning a list containing the resulting substrings.  If
    207     capturing parentheses are used in pattern, then the text of all
    208     groups in the pattern are also returned as part of the resulting
    209     list.  If maxsplit is nonzero, at most maxsplit splits occur,
    210     and the remainder of the string is returned as the final element
    211     of the list."""
    212     return _compile(pattern, flags).split(string, maxsplit)
    213 
    214 def findall(pattern, string, flags=0):
    215     """Return a list of all non-overlapping matches in the string.
    216 
    217     If one or more capturing groups are present in the pattern, return
    218     a list of groups; this will be a list of tuples if the pattern
    219     has more than one group.
    220 
    221     Empty matches are included in the result."""
    222     return _compile(pattern, flags).findall(string)
    223 
    224 def finditer(pattern, string, flags=0):
    225     """Return an iterator over all non-overlapping matches in the
    226     string.  For each match, the iterator returns a match object.
    227 
    228     Empty matches are included in the result."""
    229     return _compile(pattern, flags).finditer(string)
    230 
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
    233     return _compile(pattern, flags)
    234 
    235 def purge():
    236     "Clear the regular expression caches"
    237     _cache.clear()
    238     _compile_repl.cache_clear()
    239 
    240 def template(pattern, flags=0):
    241     "Compile a template pattern, returning a pattern object"
    242     return _compile(pattern, flags|T)
    243 
    244 _alphanum_str = frozenset(
    245     "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
    246 _alphanum_bytes = frozenset(
    247     b"_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
    248 
    249 def escape(pattern):
    250     """
    251     Escape all the characters in pattern except ASCII letters, numbers and '_'.
    252     """
    253     if isinstance(pattern, str):
    254         alphanum = _alphanum_str
    255         s = list(pattern)
    256         for i, c in enumerate(pattern):
    257             if c not in alphanum:
    258                 if c == "\000":
    259                     s[i] = "\\000"
    260                 else:
    261                     s[i] = "\\" + c
    262         return "".join(s)
    263     else:
    264         alphanum = _alphanum_bytes
    265         s = []
    266         esc = ord(b"\\")
    267         for c in pattern:
    268             if c in alphanum:
    269                 s.append(c)
    270             else:
    271                 if c == 0:
    272                     s.extend(b"\\000")
    273                 else:
    274                     s.append(esc)
    275                     s.append(c)
    276         return bytes(s)
    277 
    278 # --------------------------------------------------------------------
    279 # internals
    280 
    281 _cache = {}
    282 
    283 _pattern_type = type(sre_compile.compile("", 0))
    284 
    285 _MAXCACHE = 512
    286 def _compile(pattern, flags):
    287     # internal: compile pattern
    288     try:
    289         p, loc = _cache[type(pattern), pattern, flags]
    290         if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
    291             return p
    292     except KeyError:
    293         pass
    294     if isinstance(pattern, _pattern_type):
    295         if flags:
    296             raise ValueError(
    297                 "cannot process flags argument with a compiled pattern")
    298         return pattern
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
    301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:
    304             _cache.clear()
    305         if p.flags & LOCALE:
    306             if not _locale:
    307                 return p
    308             loc = _locale.setlocale(_locale.LC_CTYPE)
    309         else:
    310             loc = None
    311         _cache[type(pattern), pattern, flags] = p, loc
    312     return p
    313 
    314 @functools.lru_cache(_MAXCACHE)
    315 def _compile_repl(repl, pattern):
    316     # internal: compile replacement pattern
    317     return sre_parse.parse_template(repl, pattern)
    318 
    319 def _expand(pattern, match, template):
    320     # internal: match.expand implementation hook
    321     template = sre_parse.parse_template(template, pattern)
    322     return sre_parse.expand_template(template, match)
    323 
    324 def _subx(pattern, template):
    325     # internal: pattern.sub/subn implementation helper
    326     template = _compile_repl(template, pattern)
    327     if not template[0] and len(template[1]) == 1:
    328         # literal replacement
    329         return template[1][0]
    330     def filter(match, template=template):
    331         return sre_parse.expand_template(template, match)
    332     return filter
    333 
    334 # register myself for pickling
    335 
    336 import copyreg
    337 
    338 def _pickle(p):
    339     return _compile, (p.pattern, p.flags)
    340 
    341 copyreg.pickle(_pattern_type, _pickle, _compile)
    342 
    343 # --------------------------------------------------------------------
    344 # experimental stuff (see python-dev discussions for details)
    345 
    346 class Scanner:
    347     def __init__(self, lexicon, flags=0):
    348         from sre_constants import BRANCH, SUBPATTERN
    349         self.lexicon = lexicon
    350         # combine phrases into a compound pattern
    351         p = []
    352         s = sre_parse.Pattern()
    353         s.flags = flags
    354         for phrase, action in lexicon:
    355             gid = s.opengroup()
    356             p.append(sre_parse.SubPattern(s, [
    357                 (SUBPATTERN, (gid, 0, 0, sre_parse.parse(phrase, flags))),
    358                 ]))
    359             s.closegroup(gid, p[-1])
    360         p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
    361         self.scanner = sre_compile.compile(p)
    362     def scan(self, string):
    363         result = []
    364         append = result.append
    365         match = self.scanner.scanner(string).match
    366         i = 0
    367         while True:
    368             m = match()
    369             if not m:
    370                 break
    371             j = m.end()
    372             if i == j:
    373                 break
    374             action = self.lexicon[m.lastindex-1][1]
    375             if callable(action):
    376                 self.match = m
    377                 action = action(self, m.group())
    378             if action is not None:
    379                 append(action)
    380             i = j
    381         return result, string[i:]
    382