Home | History | Annotate | Download | only in library
      1 :mod:`shlex` --- Simple lexical analysis
      2 ========================================
      3 
      4 .. module:: shlex
      5    :synopsis: Simple lexical analysis for Unix shell-like languages.
      6 .. moduleauthor:: Eric S. Raymond <esr (a] snark.thyrsus.com>
      7 .. moduleauthor:: Gustavo Niemeyer <niemeyer (a] conectiva.com>
      8 .. sectionauthor:: Eric S. Raymond <esr (a] snark.thyrsus.com>
      9 .. sectionauthor:: Gustavo Niemeyer <niemeyer (a] conectiva.com>
     10 
     11 
     12 .. versionadded:: 1.5.2
     13 
     14 **Source code:** :source:`Lib/shlex.py`
     15 
     16 --------------
     17 
     18 
     19 The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for
     20 simple syntaxes resembling that of the Unix shell.  This will often be useful
     21 for writing minilanguages, (for example, in run control files for Python
     22 applications) or for parsing quoted strings.
     23 
     24 Prior to Python 2.7.3, this module did not support Unicode input.
     25 
     26 The :mod:`shlex` module defines the following functions:
     27 
     28 
     29 .. function:: split(s[, comments[, posix]])
     30 
     31    Split the string *s* using shell-like syntax. If *comments* is :const:`False`
     32    (the default), the parsing of comments in the given string will be disabled
     33    (setting the :attr:`~shlex.commenters` attribute of the
     34    :class:`~shlex.shlex` instance to the empty string).  This function operates
     35    in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is
     36    false.
     37 
     38    .. versionadded:: 2.3
     39 
     40    .. versionchanged:: 2.6
     41       Added the *posix* parameter.
     42 
     43    .. note::
     44 
     45       Since the :func:`split` function instantiates a :class:`~shlex.shlex`
     46       instance, passing ``None`` for *s* will read the string to split from
     47       standard input.
     48 
     49 The :mod:`shlex` module defines the following class:
     50 
     51 
     52 .. class:: shlex([instream[, infile[, posix]]])
     53 
     54    A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer
     55    object.  The initialization argument, if present, specifies where to read
     56    characters from. It must be a file-/stream-like object with
     57    :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or
     58    a string (strings are accepted since Python 2.3).  If no argument is given,
     59    input will be taken from ``sys.stdin``.  The second optional argument is a
     60    filename string, which sets the initial value of the :attr:`~shlex.infile`
     61    attribute.  If the *instream* argument is omitted or equal to ``sys.stdin``,
     62    this second argument defaults to "stdin".  The *posix* argument was
     63    introduced in Python 2.3, and defines the operational mode.  When *posix* is
     64    not true (default), the :class:`~shlex.shlex` instance will operate in
     65    compatibility mode.  When operating in POSIX mode, :class:`~shlex.shlex`
     66    will try to be as close as possible to the POSIX shell parsing rules.
     67 
     68 
     69 .. seealso::
     70 
     71    Module :mod:`ConfigParser`
     72       Parser for configuration files similar to the Windows :file:`.ini` files.
     73 
     74 
     75 .. _shlex-objects:
     76 
     77 shlex Objects
     78 -------------
     79 
     80 A :class:`~shlex.shlex` instance has the following methods:
     81 
     82 
     83 .. method:: shlex.get_token()
     84 
     85    Return a token.  If tokens have been stacked using :meth:`push_token`, pop a
     86    token off the stack.  Otherwise, read one from the input stream.  If reading
     87    encounters an immediate end-of-file, :attr:`eof` is returned (the empty
     88    string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).
     89 
     90 
     91 .. method:: shlex.push_token(str)
     92 
     93    Push the argument onto the token stack.
     94 
     95 
     96 .. method:: shlex.read_token()
     97 
     98    Read a raw token.  Ignore the pushback stack, and do not interpret source
     99    requests.  (This is not ordinarily a useful entry point, and is documented here
    100    only for the sake of completeness.)
    101 
    102 
    103 .. method:: shlex.sourcehook(filename)
    104 
    105    When :class:`~shlex.shlex` detects a source request (see :attr:`source`
    106    below) this method is given the following token as argument, and expected
    107    to return a tuple consisting of a filename and an open file-like object.
    108 
    109    Normally, this method first strips any quotes off the argument.  If the result
    110    is an absolute pathname, or there was no previous source request in effect, or
    111    the previous source was a stream (such as ``sys.stdin``), the result is left
    112    alone.  Otherwise, if the result is a relative pathname, the directory part of
    113    the name of the file immediately before it on the source inclusion stack is
    114    prepended (this behavior is like the way the C preprocessor handles ``#include
    115    "file.h"``).
    116 
    117    The result of the manipulations is treated as a filename, and returned as the
    118    first component of the tuple, with :func:`open` called on it to yield the second
    119    component. (Note: this is the reverse of the order of arguments in instance
    120    initialization!)
    121 
    122    This hook is exposed so that you can use it to implement directory search paths,
    123    addition of file extensions, and other namespace hacks. There is no
    124    corresponding 'close' hook, but a shlex instance will call the
    125    :meth:`~io.IOBase.close` method of the sourced input stream when it returns
    126    EOF.
    127 
    128    For more explicit control of source stacking, use the :meth:`push_source` and
    129    :meth:`pop_source` methods.
    130 
    131 
    132 .. method:: shlex.push_source(stream[, filename])
    133 
    134    Push an input source stream onto the input stack.  If the filename argument is
    135    specified it will later be available for use in error messages.  This is the
    136    same method used internally by the :meth:`sourcehook` method.
    137 
    138    .. versionadded:: 2.1
    139 
    140 
    141 .. method:: shlex.pop_source()
    142 
    143    Pop the last-pushed input source from the input stack. This is the same method
    144    used internally when the lexer reaches EOF on a stacked input stream.
    145 
    146    .. versionadded:: 2.1
    147 
    148 
    149 .. method:: shlex.error_leader([file[, line]])
    150 
    151    This method generates an error message leader in the format of a Unix C compiler
    152    error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
    153    with the name of the current source file and the ``%d`` with the current input
    154    line number (the optional arguments can be used to override these).
    155 
    156    This convenience is provided to encourage :mod:`shlex` users to generate error
    157    messages in the standard, parseable format understood by Emacs and other Unix
    158    tools.
    159 
    160 Instances of :class:`~shlex.shlex` subclasses have some public instance
    161 variables which either control lexical analysis or can be used for debugging:
    162 
    163 
    164 .. attribute:: shlex.commenters
    165 
    166    The string of characters that are recognized as comment beginners. All
    167    characters from the comment beginner to end of line are ignored. Includes just
    168    ``'#'`` by default.
    169 
    170 
    171 .. attribute:: shlex.wordchars
    172 
    173    The string of characters that will accumulate into multi-character tokens.  By
    174    default, includes all ASCII alphanumerics and underscore.
    175 
    176 
    177 .. attribute:: shlex.whitespace
    178 
    179    Characters that will be considered whitespace and skipped.  Whitespace bounds
    180    tokens.  By default, includes space, tab, linefeed and carriage-return.
    181 
    182 
    183 .. attribute:: shlex.escape
    184 
    185    Characters that will be considered as escape. This will be only used in POSIX
    186    mode, and includes just ``'\'`` by default.
    187 
    188    .. versionadded:: 2.3
    189 
    190 
    191 .. attribute:: shlex.quotes
    192 
    193    Characters that will be considered string quotes.  The token accumulates until
    194    the same quote is encountered again (thus, different quote types protect each
    195    other as in the shell.)  By default, includes ASCII single and double quotes.
    196 
    197 
    198 .. attribute:: shlex.escapedquotes
    199 
    200    Characters in :attr:`quotes` that will interpret escape characters defined in
    201    :attr:`escape`.  This is only used in POSIX mode, and includes just ``'"'`` by
    202    default.
    203 
    204    .. versionadded:: 2.3
    205 
    206 
    207 .. attribute:: shlex.whitespace_split
    208 
    209    If ``True``, tokens will only be split in whitespaces. This is useful, for
    210    example, for parsing command lines with :class:`~shlex.shlex`, getting
    211    tokens in a similar way to shell arguments.
    212 
    213    .. versionadded:: 2.3
    214 
    215 
    216 .. attribute:: shlex.infile
    217 
    218    The name of the current input file, as initially set at class instantiation time
    219    or stacked by later source requests.  It may be useful to examine this when
    220    constructing error messages.
    221 
    222 
    223 .. attribute:: shlex.instream
    224 
    225    The input stream from which this :class:`~shlex.shlex` instance is reading
    226    characters.
    227 
    228 
    229 .. attribute:: shlex.source
    230 
    231    This attribute is ``None`` by default.  If you assign a string to it, that
    232    string will be recognized as a lexical-level inclusion request similar to the
    233    ``source`` keyword in various shells.  That is, the immediately following token
    234    will be opened as a filename and input will
    235    be taken from that stream until EOF, at which
    236    point the :meth:`~io.IOBase.close` method of that stream will be called and
    237    the input source will again become the original input stream.  Source
    238    requests may be stacked any number of levels deep.
    239 
    240 
    241 .. attribute:: shlex.debug
    242 
    243    If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex`
    244    instance will print verbose progress output on its behavior.  If you need
    245    to use this, you can read the module source code to learn the details.
    246 
    247 
    248 .. attribute:: shlex.lineno
    249 
    250    Source line number (count of newlines seen so far plus one).
    251 
    252 
    253 .. attribute:: shlex.token
    254 
    255    The token buffer.  It may be useful to examine this when catching exceptions.
    256 
    257 
    258 .. attribute:: shlex.eof
    259 
    260    Token used to determine end of file. This will be set to the empty string
    261    (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
    262 
    263    .. versionadded:: 2.3
    264 
    265 
    266 .. _shlex-parsing-rules:
    267 
    268 Parsing Rules
    269 -------------
    270 
    271 When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the
    272 following rules.
    273 
    274 * Quote characters are not recognized within words (``Do"Not"Separate`` is
    275   parsed as the single word ``Do"Not"Separate``);
    276 
    277 * Escape characters are not recognized;
    278 
    279 * Enclosing characters in quotes preserve the literal value of all characters
    280   within the quotes;
    281 
    282 * Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
    283   ``Separate``);
    284 
    285 * If :attr:`~shlex.whitespace_split` is ``False``, any character not
    286   declared to be a word character, whitespace, or a quote will be returned as
    287   a single-character token. If it is ``True``, :class:`~shlex.shlex` will only
    288   split words in whitespaces;
    289 
    290 * EOF is signaled with an empty string (``''``);
    291 
    292 * It's not possible to parse empty strings, even if quoted.
    293 
    294 When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the
    295 following parsing rules.
    296 
    297 * Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
    298   parsed as the single word ``DoNotSeparate``);
    299 
    300 * Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
    301   next character that follows;
    302 
    303 * Enclosing characters in quotes which are not part of
    304   :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value
    305   of all characters within the quotes;
    306 
    307 * Enclosing characters in quotes which are part of
    308   :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value
    309   of all characters within the quotes, with the exception of the characters
    310   mentioned in :attr:`~shlex.escape`.  The escape characters retain its
    311   special meaning only when followed by the quote in use, or the escape
    312   character itself. Otherwise the escape character will be considered a
    313   normal character.
    314 
    315 * EOF is signaled with a :const:`None` value;
    316 
    317 * Quoted empty strings (``''``) are allowed;
    318 
    319