Home | History | Annotate | Download | only in library
      1 :mod:`tokenize` --- Tokenizer for Python source
      2 ===============================================
      3 
      4 .. module:: tokenize
      5    :synopsis: Lexical scanner for Python source code.
      6 .. moduleauthor:: Ka Ping Yee
      7 .. sectionauthor:: Fred L. Drake, Jr. <fdrake (a] acm.org>
      8 
      9 **Source code:** :source:`Lib/tokenize.py`
     10 
     11 --------------
     12 
     13 The :mod:`tokenize` module provides a lexical scanner for Python source code,
     14 implemented in Python.  The scanner in this module returns comments as tokens as
     15 well, making it useful for implementing "pretty-printers," including colorizers
     16 for on-screen displays.
     17 
     18 To simplify token stream handling, all :ref:`operators` and :ref:`delimiters`
     19 tokens are returned using the generic :data:`token.OP` token type.  The exact
     20 type can be determined by checking the second field (containing the actual
     21 token string matched) of the tuple returned from
     22 :func:`tokenize.generate_tokens` for the character sequence that identifies a
     23 specific operator token.
     24 
     25 The primary entry point is a :term:`generator`:
     26 
     27 .. function:: generate_tokens(readline)
     28 
     29    The :func:`generate_tokens` generator requires one argument, *readline*,
     30    which must be a callable object which provides the same interface as the
     31    :meth:`~file.readline` method of built-in file objects (see section
     32    :ref:`bltin-file-objects`).  Each call to the function should return one line
     33    of input as a string. Alternately, *readline* may be a callable object that
     34    signals completion by raising :exc:`StopIteration`.
     35 
     36    The generator produces 5-tuples with these members: the token type; the token
     37    string; a 2-tuple ``(srow, scol)`` of ints specifying the row and column
     38    where the token begins in the source; a 2-tuple ``(erow, ecol)`` of ints
     39    specifying the row and column where the token ends in the source; and the
     40    line on which the token was found.  The line passed (the last tuple item) is
     41    the *logical* line; continuation lines are included.
     42 
     43    .. versionadded:: 2.2
     44 
     45 An older entry point is retained for backward compatibility:
     46 
     47 
     48 .. function:: tokenize(readline[, tokeneater])
     49 
     50    The :func:`.tokenize` function accepts two parameters: one representing the input
     51    stream, and one providing an output mechanism for :func:`.tokenize`.
     52 
     53    The first parameter, *readline*, must be a callable object which provides the
     54    same interface as the :meth:`~file.readline` method of built-in file objects (see
     55    section :ref:`bltin-file-objects`).  Each call to the function should return one
     56    line of input as a string. Alternately, *readline* may be a callable object that
     57    signals completion by raising :exc:`StopIteration`.
     58 
     59    .. versionchanged:: 2.5
     60       Added :exc:`StopIteration` support.
     61 
     62    The second parameter, *tokeneater*, must also be a callable object.  It is
     63    called once for each token, with five arguments, corresponding to the tuples
     64    generated by :func:`generate_tokens`.
     65 
     66 All constants from the :mod:`token` module are also exported from
     67 :mod:`tokenize`, as are two additional token type values that might be passed to
     68 the *tokeneater* function by :func:`.tokenize`:
     69 
     70 
     71 .. data:: COMMENT
     72 
     73    Token value used to indicate a comment.
     74 
     75 
     76 .. data:: NL
     77 
     78    Token value used to indicate a non-terminating newline.  The NEWLINE token
     79    indicates the end of a logical line of Python code; NL tokens are generated when
     80    a logical line of code is continued over multiple physical lines.
     81 
     82 Another function is provided to reverse the tokenization process. This is useful
     83 for creating tools that tokenize a script, modify the token stream, and write
     84 back the modified script.
     85 
     86 
     87 .. function:: untokenize(iterable)
     88 
     89    Converts tokens back into Python source code.  The *iterable* must return
     90    sequences with at least two elements, the token type and the token string.  Any
     91    additional sequence elements are ignored.
     92 
     93    The reconstructed script is returned as a single string.  The result is
     94    guaranteed to tokenize back to match the input so that the conversion is
     95    lossless and round-trips are assured.  The guarantee applies only to the token
     96    type and token string as the spacing between tokens (column positions) may
     97    change.
     98 
     99    .. versionadded:: 2.5
    100 
    101 .. exception:: TokenError
    102 
    103    Raised when either a docstring or expression that may be split over several
    104    lines is not completed anywhere in the file, for example::
    105 
    106       """Beginning of
    107       docstring
    108 
    109    or::
    110 
    111       [1,
    112        2,
    113        3
    114 
    115 Note that unclosed single-quoted strings do not cause an error to be
    116 raised. They are tokenized as ``ERRORTOKEN``, followed by the tokenization of
    117 their contents.
    118 
    119 Example of a script re-writer that transforms float literals into Decimal
    120 objects::
    121 
    122    def decistmt(s):
    123        """Substitute Decimals for floats in a string of statements.
    124 
    125        >>> from decimal import Decimal
    126        >>> s = 'print +21.3e-5*-.1234/81.7'
    127        >>> decistmt(s)
    128        "print +Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7')"
    129 
    130        >>> exec(s)
    131        -3.21716034272e-007
    132        >>> exec(decistmt(s))
    133        -3.217160342717258261933904529E-7
    134 
    135        """
    136        result = []
    137        g = generate_tokens(StringIO(s).readline)   # tokenize the string
    138        for toknum, tokval, _, _, _  in g:
    139            if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
    140                result.extend([
    141                    (NAME, 'Decimal'),
    142                    (OP, '('),
    143                    (STRING, repr(tokval)),
    144                    (OP, ')')
    145                ])
    146            else:
    147                result.append((toknum, tokval))
    148        return untokenize(result)
    149 
    150