Home | History | Annotate | Download | only in library
      1 :mod:`shlex` --- Simple lexical analysis
      2 ========================================
      3 
      4 .. module:: shlex
      5    :synopsis: Simple lexical analysis for Unix shell-like languages.
      6 
      7 .. moduleauthor:: Eric S. Raymond <esr (a] snark.thyrsus.com>
      8 .. moduleauthor:: Gustavo Niemeyer <niemeyer (a] conectiva.com>
      9 .. sectionauthor:: Eric S. Raymond <esr (a] snark.thyrsus.com>
     10 .. sectionauthor:: Gustavo Niemeyer <niemeyer (a] conectiva.com>
     11 
     12 **Source code:** :source:`Lib/shlex.py`
     13 
     14 --------------
     15 
     16 The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for
     17 simple syntaxes resembling that of the Unix shell.  This will often be useful
     18 for writing minilanguages, (for example, in run control files for Python
     19 applications) or for parsing quoted strings.
     20 
     21 The :mod:`shlex` module defines the following functions:
     22 
     23 
     24 .. function:: split(s, comments=False, posix=True)
     25 
     26    Split the string *s* using shell-like syntax. If *comments* is :const:`False`
     27    (the default), the parsing of comments in the given string will be disabled
     28    (setting the :attr:`~shlex.commenters` attribute of the
     29    :class:`~shlex.shlex` instance to the empty string).  This function operates
     30    in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is
     31    false.
     32 
     33    .. note::
     34 
     35       Since the :func:`split` function instantiates a :class:`~shlex.shlex`
     36       instance, passing ``None`` for *s* will read the string to split from
     37       standard input.
     38 
     39 
     40 .. function:: quote(s)
     41 
     42    Return a shell-escaped version of the string *s*.  The returned value is a
     43    string that can safely be used as one token in a shell command line, for
     44    cases where you cannot use a list.
     45 
     46    This idiom would be unsafe::
     47 
     48       >>> filename = 'somefile; rm -rf ~'
     49       >>> command = 'ls -l {}'.format(filename)
     50       >>> print(command)  # executed by a shell: boom!
     51       ls -l somefile; rm -rf ~
     52 
     53    :func:`quote` lets you plug the security hole::
     54 
     55       >>> command = 'ls -l {}'.format(quote(filename))
     56       >>> print(command)
     57       ls -l 'somefile; rm -rf ~'
     58       >>> remote_command = 'ssh home {}'.format(quote(command))
     59       >>> print(remote_command)
     60       ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"''
     61 
     62    The quoting is compatible with UNIX shells and with :func:`split`:
     63 
     64       >>> remote_command = split(remote_command)
     65       >>> remote_command
     66       ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"]
     67       >>> command = split(remote_command[-1])
     68       >>> command
     69       ['ls', '-l', 'somefile; rm -rf ~']
     70 
     71    .. versionadded:: 3.3
     72 
     73 The :mod:`shlex` module defines the following class:
     74 
     75 
     76 .. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False)
     77 
     78    A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer
     79    object.  The initialization argument, if present, specifies where to read
     80    characters from.  It must be a file-/stream-like object with
     81    :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or
     82    a string.  If no argument is given, input will be taken from ``sys.stdin``.
     83    The second optional argument is a filename string, which sets the initial
     84    value of the :attr:`~shlex.infile` attribute.  If the *instream*
     85    argument is omitted or equal to ``sys.stdin``, this second argument
     86    defaults to "stdin".  The *posix* argument defines the operational mode:
     87    when *posix* is not true (default), the :class:`~shlex.shlex` instance will
     88    operate in compatibility mode.  When operating in POSIX mode,
     89    :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell
     90    parsing rules.  The *punctuation_chars* argument provides a way to make the
     91    behaviour even closer to how real shells parse.  This can take a number of
     92    values: the default value, ``False``, preserves the behaviour seen under
     93    Python 3.5 and earlier.  If set to ``True``, then parsing of the characters
     94    ``();<>|&`` is changed: any run of these characters (considered punctuation
     95    characters) is returned as a single token.  If set to a non-empty string of
     96    characters, those characters will be used as the punctuation characters.  Any
     97    characters in the :attr:`wordchars` attribute that appear in
     98    *punctuation_chars* will be removed from :attr:`wordchars`.  See
     99    :ref:`improved-shell-compatibility` for more information.
    100 
    101    .. versionchanged:: 3.6
    102       The *punctuation_chars* parameter was added.
    103 
    104 .. seealso::
    105 
    106    Module :mod:`configparser`
    107       Parser for configuration files similar to the Windows :file:`.ini` files.
    108 
    109 
    110 .. _shlex-objects:
    111 
    112 shlex Objects
    113 -------------
    114 
    115 A :class:`~shlex.shlex` instance has the following methods:
    116 
    117 
    118 .. method:: shlex.get_token()
    119 
    120    Return a token.  If tokens have been stacked using :meth:`push_token`, pop a
    121    token off the stack.  Otherwise, read one from the input stream.  If reading
    122    encounters an immediate end-of-file, :attr:`eof` is returned (the empty
    123    string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).
    124 
    125 
    126 .. method:: shlex.push_token(str)
    127 
    128    Push the argument onto the token stack.
    129 
    130 
    131 .. method:: shlex.read_token()
    132 
    133    Read a raw token.  Ignore the pushback stack, and do not interpret source
    134    requests.  (This is not ordinarily a useful entry point, and is documented here
    135    only for the sake of completeness.)
    136 
    137 
    138 .. method:: shlex.sourcehook(filename)
    139 
    140    When :class:`~shlex.shlex` detects a source request (see :attr:`source`
    141    below) this method is given the following token as argument, and expected
    142    to return a tuple consisting of a filename and an open file-like object.
    143 
    144    Normally, this method first strips any quotes off the argument.  If the result
    145    is an absolute pathname, or there was no previous source request in effect, or
    146    the previous source was a stream (such as ``sys.stdin``), the result is left
    147    alone.  Otherwise, if the result is a relative pathname, the directory part of
    148    the name of the file immediately before it on the source inclusion stack is
    149    prepended (this behavior is like the way the C preprocessor handles ``#include
    150    "file.h"``).
    151 
    152    The result of the manipulations is treated as a filename, and returned as the
    153    first component of the tuple, with :func:`open` called on it to yield the second
    154    component. (Note: this is the reverse of the order of arguments in instance
    155    initialization!)
    156 
    157    This hook is exposed so that you can use it to implement directory search paths,
    158    addition of file extensions, and other namespace hacks. There is no
    159    corresponding 'close' hook, but a shlex instance will call the
    160    :meth:`~io.IOBase.close` method of the sourced input stream when it returns
    161    EOF.
    162 
    163    For more explicit control of source stacking, use the :meth:`push_source` and
    164    :meth:`pop_source` methods.
    165 
    166 
    167 .. method:: shlex.push_source(newstream, newfile=None)
    168 
    169    Push an input source stream onto the input stack.  If the filename argument is
    170    specified it will later be available for use in error messages.  This is the
    171    same method used internally by the :meth:`sourcehook` method.
    172 
    173 
    174 .. method:: shlex.pop_source()
    175 
    176    Pop the last-pushed input source from the input stack. This is the same method
    177    used internally when the lexer reaches EOF on a stacked input stream.
    178 
    179 
    180 .. method:: shlex.error_leader(infile=None, lineno=None)
    181 
    182    This method generates an error message leader in the format of a Unix C compiler
    183    error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
    184    with the name of the current source file and the ``%d`` with the current input
    185    line number (the optional arguments can be used to override these).
    186 
    187    This convenience is provided to encourage :mod:`shlex` users to generate error
    188    messages in the standard, parseable format understood by Emacs and other Unix
    189    tools.
    190 
    191 Instances of :class:`~shlex.shlex` subclasses have some public instance
    192 variables which either control lexical analysis or can be used for debugging:
    193 
    194 
    195 .. attribute:: shlex.commenters
    196 
    197    The string of characters that are recognized as comment beginners. All
    198    characters from the comment beginner to end of line are ignored. Includes just
    199    ``'#'`` by default.
    200 
    201 
    202 .. attribute:: shlex.wordchars
    203 
    204    The string of characters that will accumulate into multi-character tokens.  By
    205    default, includes all ASCII alphanumerics and underscore.  In POSIX mode, the
    206    accented characters in the Latin-1 set are also included.  If
    207    :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can
    208    appear in filename specifications and command line parameters, will also be
    209    included in this attribute, and any characters which appear in
    210    ``punctuation_chars`` will be removed from ``wordchars`` if they are present
    211    there.
    212 
    213 
    214 .. attribute:: shlex.whitespace
    215 
    216    Characters that will be considered whitespace and skipped.  Whitespace bounds
    217    tokens.  By default, includes space, tab, linefeed and carriage-return.
    218 
    219 
    220 .. attribute:: shlex.escape
    221 
    222    Characters that will be considered as escape. This will be only used in POSIX
    223    mode, and includes just ``'\'`` by default.
    224 
    225 
    226 .. attribute:: shlex.quotes
    227 
    228    Characters that will be considered string quotes.  The token accumulates until
    229    the same quote is encountered again (thus, different quote types protect each
    230    other as in the shell.)  By default, includes ASCII single and double quotes.
    231 
    232 
    233 .. attribute:: shlex.escapedquotes
    234 
    235    Characters in :attr:`quotes` that will interpret escape characters defined in
    236    :attr:`escape`.  This is only used in POSIX mode, and includes just ``'"'`` by
    237    default.
    238 
    239 
    240 .. attribute:: shlex.whitespace_split
    241 
    242    If ``True``, tokens will only be split in whitespaces.  This is useful, for
    243    example, for parsing command lines with :class:`~shlex.shlex`, getting
    244    tokens in a similar way to shell arguments.  If this attribute is ``True``,
    245    :attr:`punctuation_chars` will have no effect, and splitting will happen
    246    only on whitespaces.  When using :attr:`punctuation_chars`, which is
    247    intended to provide parsing closer to that implemented by shells, it is
    248    advisable to leave ``whitespace_split`` as ``False`` (the default value).
    249 
    250 
    251 .. attribute:: shlex.infile
    252 
    253    The name of the current input file, as initially set at class instantiation time
    254    or stacked by later source requests.  It may be useful to examine this when
    255    constructing error messages.
    256 
    257 
    258 .. attribute:: shlex.instream
    259 
    260    The input stream from which this :class:`~shlex.shlex` instance is reading
    261    characters.
    262 
    263 
    264 .. attribute:: shlex.source
    265 
    266    This attribute is ``None`` by default.  If you assign a string to it, that
    267    string will be recognized as a lexical-level inclusion request similar to the
    268    ``source`` keyword in various shells.  That is, the immediately following token
    269    will be opened as a filename and input will be taken from that stream until
    270    EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be
    271    called and the input source will again become the original input stream.  Source
    272    requests may be stacked any number of levels deep.
    273 
    274 
    275 .. attribute:: shlex.debug
    276 
    277    If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex`
    278    instance will print verbose progress output on its behavior.  If you need
    279    to use this, you can read the module source code to learn the details.
    280 
    281 
    282 .. attribute:: shlex.lineno
    283 
    284    Source line number (count of newlines seen so far plus one).
    285 
    286 
    287 .. attribute:: shlex.token
    288 
    289    The token buffer.  It may be useful to examine this when catching exceptions.
    290 
    291 
    292 .. attribute:: shlex.eof
    293 
    294    Token used to determine end of file. This will be set to the empty string
    295    (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
    296 
    297 
    298 .. attribute:: shlex.punctuation_chars
    299 
    300    Characters that will be considered punctuation. Runs of punctuation
    301    characters will be returned as a single token. However, note that no
    302    semantic validity checking will be performed: for example, '>>>' could be
    303    returned as a token, even though it may not be recognised as such by shells.
    304 
    305    .. versionadded:: 3.6
    306 
    307 
    308 .. _shlex-parsing-rules:
    309 
    310 Parsing Rules
    311 -------------
    312 
    313 When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the
    314 following rules.
    315 
    316 * Quote characters are not recognized within words (``Do"Not"Separate`` is
    317   parsed as the single word ``Do"Not"Separate``);
    318 
    319 * Escape characters are not recognized;
    320 
    321 * Enclosing characters in quotes preserve the literal value of all characters
    322   within the quotes;
    323 
    324 * Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
    325   ``Separate``);
    326 
    327 * If :attr:`~shlex.whitespace_split` is ``False``, any character not
    328   declared to be a word character, whitespace, or a quote will be returned as
    329   a single-character token. If it is ``True``, :class:`~shlex.shlex` will only
    330   split words in whitespaces;
    331 
    332 * EOF is signaled with an empty string (``''``);
    333 
    334 * It's not possible to parse empty strings, even if quoted.
    335 
    336 When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the
    337 following parsing rules.
    338 
    339 * Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
    340   parsed as the single word ``DoNotSeparate``);
    341 
    342 * Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
    343   next character that follows;
    344 
    345 * Enclosing characters in quotes which are not part of
    346   :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value
    347   of all characters within the quotes;
    348 
    349 * Enclosing characters in quotes which are part of
    350   :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value
    351   of all characters within the quotes, with the exception of the characters
    352   mentioned in :attr:`~shlex.escape`.  The escape characters retain its
    353   special meaning only when followed by the quote in use, or the escape
    354   character itself. Otherwise the escape character will be considered a
    355   normal character.
    356 
    357 * EOF is signaled with a :const:`None` value;
    358 
    359 * Quoted empty strings (``''``) are allowed.
    360 
    361 .. _improved-shell-compatibility:
    362 
    363 Improved Compatibility with Shells
    364 ----------------------------------
    365 
    366 .. versionadded:: 3.6
    367 
    368 The :class:`shlex` class provides compatibility with the parsing performed by
    369 common Unix shells like ``bash``, ``dash``, and ``sh``.  To take advantage of
    370 this compatibility, specify the ``punctuation_chars`` argument in the
    371 constructor.  This defaults to ``False``, which preserves pre-3.6 behaviour.
    372 However, if it is set to ``True``, then parsing of the characters ``();<>|&``
    373 is changed: any run of these characters is returned as a single token.  While
    374 this is short of a full parser for shells (which would be out of scope for the
    375 standard library, given the multiplicity of shells out there), it does allow
    376 you to perform processing of command lines more easily than you could
    377 otherwise.  To illustrate, you can see the difference in the following snippet:
    378 
    379 .. doctest::
    380    :options: +NORMALIZE_WHITESPACE
    381 
    382     >>> import shlex
    383     >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")"
    384     >>> list(shlex.shlex(text))
    385     ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>',
    386     "'abc'", ';', '(', 'def', '"ghi"', ')']
    387     >>> list(shlex.shlex(text, punctuation_chars=True))
    388     ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'",
    389     ';', '(', 'def', '"ghi"', ')']
    390 
    391 Of course, tokens will be returned which are not valid for shells, and you'll
    392 need to implement your own error checks on the returned tokens.
    393 
    394 Instead of passing ``True`` as the value for the punctuation_chars parameter,
    395 you can pass a string with specific characters, which will be used to determine
    396 which characters constitute punctuation. For example::
    397 
    398     >>> import shlex
    399     >>> s = shlex.shlex("a && b || c", punctuation_chars="|")
    400     >>> list(s)
    401     ['a', '&', '&', 'b', '||', 'c']
    402 
    403 .. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars`
    404    attribute is augmented with the characters ``~-./*?=``.  That is because these
    405    characters can appear in file names (including wildcards) and command-line
    406    arguments (e.g. ``--color=auto``). Hence::
    407 
    408       >>> import shlex
    409       >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?',
    410       ...                 punctuation_chars=True)
    411       >>> list(s)
    412       ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']
    413 
    414 For best effect, ``punctuation_chars`` should be set in conjunction with
    415 ``posix=True``. (Note that ``posix=False`` is the default for
    416 :class:`~shlex.shlex`.)
    417