Home | History | Annotate | Download | only in library
      1 :mod:`email.parser`: Parsing email messages
      2 -------------------------------------------
      3 
      4 .. module:: email.parser
      5    :synopsis: Parse flat text email messages to produce a message object structure.
      6 
      7 
      8 Message object structures can be created in one of two ways: they can be created
      9 from whole cloth by instantiating :class:`~email.message.Message` objects and
     10 stringing them together via :meth:`~email.message.Message.attach` and
     11 :meth:`~email.message.Message.set_payload` calls, or they
     12 can be created by parsing a flat text representation of the email message.
     13 
     14 The :mod:`email` package provides a standard parser that understands most email
     15 document structures, including MIME documents.  You can pass the parser a string
     16 or a file object, and the parser will return to you the root
     17 :class:`~email.message.Message` instance of the object structure.  For simple,
     18 non-MIME messages the payload of this root object will likely be a string
     19 containing the text of the message.  For MIME messages, the root object will
     20 return ``True`` from its :meth:`~email.message.Message.is_multipart` method, and
     21 the subparts can be accessed via the :meth:`~email.message.Message.get_payload`
     22 and :meth:`~email.message.Message.walk` methods.
     23 
     24 There are actually two parser interfaces available for use, the classic
     25 :class:`Parser` API and the incremental :class:`FeedParser` API.  The classic
     26 :class:`Parser` API is fine if you have the entire text of the message in memory
     27 as a string, or if the entire message lives in a file on the file system.
     28 :class:`FeedParser` is more appropriate for when you're reading the message from
     29 a stream which might block waiting for more input (e.g. reading an email message
     30 from a socket).  The :class:`FeedParser` can consume and parse the message
     31 incrementally, and only returns the root object when you close the parser [#]_.
     32 
     33 Note that the parser can be extended in limited ways, and of course you can
     34 implement your own parser completely from scratch.  There is no magical
     35 connection between the :mod:`email` package's bundled parser and the
     36 :class:`~email.message.Message` class, so your custom parser can create message
     37 object trees any way it finds necessary.
     38 
     39 
     40 FeedParser API
     41 ^^^^^^^^^^^^^^
     42 
     43 .. versionadded:: 2.4
     44 
     45 The :class:`FeedParser`, imported from the :mod:`email.feedparser` module,
     46 provides an API that is conducive to incremental parsing of email messages, such
     47 as would be necessary when reading the text of an email message from a source
     48 that can block (e.g. a socket).  The :class:`FeedParser` can of course be used
     49 to parse an email message fully contained in a string or a file, but the classic
     50 :class:`Parser` API may be more convenient for such use cases.  The semantics
     51 and results of the two parser APIs are identical.
     52 
     53 The :class:`FeedParser`'s API is simple; you create an instance, feed it a bunch
     54 of text until there's no more to feed it, then close the parser to retrieve the
     55 root message object.  The :class:`FeedParser` is extremely accurate when parsing
     56 standards-compliant messages, and it does a very good job of parsing
     57 non-compliant messages, providing information about how a message was deemed
     58 broken.  It will populate a message object's *defects* attribute with a list of
     59 any problems it found in a message.  See the :mod:`email.errors` module for the
     60 list of defects that it can find.
     61 
     62 Here is the API for the :class:`FeedParser`:
     63 
     64 
     65 .. class:: FeedParser([_factory])
     66 
     67    Create a :class:`FeedParser` instance.  Optional *_factory* is a no-argument
     68    callable that will be called whenever a new message object is needed.  It
     69    defaults to the :class:`email.message.Message` class.
     70 
     71 
     72    .. method:: feed(data)
     73 
     74       Feed the :class:`FeedParser` some more data.  *data* should be a string
     75       containing one or more lines.  The lines can be partial and the
     76       :class:`FeedParser` will stitch such partial lines together properly.  The
     77       lines in the string can have any of the common three line endings,
     78       carriage return, newline, or carriage return and newline (they can even be
     79       mixed).
     80 
     81 
     82    .. method:: close()
     83 
     84       Closing a :class:`FeedParser` completes the parsing of all previously fed
     85       data, and returns the root message object.  It is undefined what happens
     86       if you feed more data to a closed :class:`FeedParser`.
     87 
     88 
     89 Parser class API
     90 ^^^^^^^^^^^^^^^^
     91 
     92 The :class:`Parser` class, imported from the :mod:`email.parser` module,
     93 provides an API that can be used to parse a message when the complete contents
     94 of the message are available in a string or file.  The :mod:`email.parser`
     95 module also provides a second class, called :class:`HeaderParser` which can be
     96 used if you're only interested in the headers of the message.
     97 :class:`HeaderParser` can be much faster in these situations, since it does not
     98 attempt to parse the message body, instead setting the payload to the raw body
     99 as a string. :class:`HeaderParser` has the same API as the :class:`Parser`
    100 class.
    101 
    102 
    103 .. class:: Parser([_class])
    104 
    105    The constructor for the :class:`Parser` class takes an optional argument
    106    *_class*.  This must be a callable factory (such as a function or a class), and
    107    it is used whenever a sub-message object needs to be created.  It defaults to
    108    :class:`~email.message.Message` (see :mod:`email.message`).  The factory will
    109    be called without arguments.
    110 
    111    The optional *strict* flag is ignored.
    112 
    113    .. deprecated:: 2.4
    114       Because the :class:`Parser` class is a backward compatible API wrapper
    115       around the new-in-Python 2.4 :class:`FeedParser`, *all* parsing is
    116       effectively non-strict.  You should simply stop passing a *strict* flag to
    117       the :class:`Parser` constructor.
    118 
    119    .. versionchanged:: 2.2.2
    120       The *strict* flag was added.
    121 
    122    .. versionchanged:: 2.4
    123       The *strict* flag was deprecated.
    124 
    125    The other public :class:`Parser` methods are:
    126 
    127 
    128    .. method:: parse(fp[, headersonly])
    129 
    130       Read all the data from the file-like object *fp*, parse the resulting
    131       text, and return the root message object.  *fp* must support both the
    132       :meth:`~io.TextIOBase.readline` and the :meth:`~io.TextIOBase.read`
    133       methods on file-like objects.
    134 
    135       The text contained in *fp* must be formatted as a block of :rfc:`2822`
    136       style headers and header continuation lines, optionally preceded by an
    137       envelope header.  The header block is terminated either by the end of the
    138       data or by a blank line.  Following the header block is the body of the
    139       message (which may contain MIME-encoded subparts).
    140 
    141       Optional *headersonly* is a flag specifying whether to stop parsing after
    142       reading the headers or not.  The default is ``False``, meaning it parses
    143       the entire contents of the file.
    144 
    145       .. versionchanged:: 2.2.2
    146          The *headersonly* flag was added.
    147 
    148 
    149    .. method:: parsestr(text[, headersonly])
    150 
    151       Similar to the :meth:`parse` method, except it takes a string object
    152       instead of a file-like object.  Calling this method on a string is exactly
    153       equivalent to wrapping *text* in a :class:`~StringIO.StringIO` instance first and
    154       calling :meth:`parse`.
    155 
    156       Optional *headersonly* is as with the :meth:`parse` method.
    157 
    158       .. versionchanged:: 2.2.2
    159          The *headersonly* flag was added.
    160 
    161 Since creating a message object structure from a string or a file object is such
    162 a common task, two functions are provided as a convenience.  They are available
    163 in the top-level :mod:`email` package namespace.
    164 
    165 .. currentmodule:: email
    166 
    167 .. function:: message_from_string(s[, _class[, strict]])
    168 
    169    Return a message object structure from a string.  This is exactly equivalent to
    170    ``Parser().parsestr(s)``.  Optional *_class* and *strict* are interpreted as
    171    with the :class:`~email.parser.Parser` class constructor.
    172 
    173    .. versionchanged:: 2.2.2
    174       The *strict* flag was added.
    175 
    176 
    177 .. function:: message_from_file(fp[, _class[, strict]])
    178 
    179    Return a message object structure tree from an open file object.  This is
    180    exactly equivalent to ``Parser().parse(fp)``.  Optional *_class* and *strict*
    181    are interpreted as with the :class:`~email.parser.Parser` class constructor.
    182 
    183    .. versionchanged:: 2.2.2
    184       The *strict* flag was added.
    185 
    186 Here's an example of how you might use this at an interactive Python prompt::
    187 
    188    >>> import email
    189    >>> msg = email.message_from_string(myString)
    190 
    191 
    192 Additional notes
    193 ^^^^^^^^^^^^^^^^
    194 
    195 Here are some notes on the parsing semantics:
    196 
    197 * Most non-\ :mimetype:`multipart` type messages are parsed as a single message
    198   object with a string payload.  These objects will return ``False`` for
    199   :meth:`~email.message.Message.is_multipart`.  Their
    200   :meth:`~email.message.Message.get_payload` method will return a string object.
    201 
    202 * All :mimetype:`multipart` type messages will be parsed as a container message
    203   object with a list of sub-message objects for their payload.  The outer
    204   container message will return ``True`` for
    205   :meth:`~email.message.Message.is_multipart` and their
    206   :meth:`~email.message.Message.get_payload` method will return the list of
    207   :class:`~email.message.Message` subparts.
    208 
    209 * Most messages with a content type of :mimetype:`message/\*` (e.g.
    210   :mimetype:`message/delivery-status` and :mimetype:`message/rfc822`) will also be
    211   parsed as container object containing a list payload of length 1.  Their
    212   :meth:`~email.message.Message.is_multipart` method will return ``True``.
    213   The single element in the list payload will be a sub-message object.
    214 
    215 * Some non-standards compliant messages may not be internally consistent about
    216   their :mimetype:`multipart`\ -edness.  Such messages may have a
    217   :mailheader:`Content-Type` header of type :mimetype:`multipart`, but their
    218   :meth:`~email.message.Message.is_multipart` method may return ``False``.
    219   If such messages were parsed with the :class:`~email.parser.FeedParser`,
    220   they will have an instance of the
    221   :class:`~email.errors.MultipartInvariantViolationDefect` class in their
    222   *defects* attribute list.  See :mod:`email.errors` for details.
    223 
    224 .. rubric:: Footnotes
    225 
    226 .. [#] As of email package version 3.0, introduced in Python 2.4, the classic
    227    :class:`~email.parser.Parser` was re-implemented in terms of the
    228    :class:`~email.parser.FeedParser`, so the semantics and results are
    229    identical between the two parsers.
    230 
    231