Home | History | Annotate | Download | only in library
      1 :mod:`email.charset`: Representing character sets
      2 -------------------------------------------------
      3 
      4 .. module:: email.charset
      5    :synopsis: Character Sets
      6 
      7 
      8 This module provides a class :class:`Charset` for representing character sets
      9 and character set conversions in email messages, as well as a character set
     10 registry and several convenience methods for manipulating this registry.
     11 Instances of :class:`Charset` are used in several other modules within the
     12 :mod:`email` package.
     13 
     14 Import this class from the :mod:`email.charset` module.
     15 
     16 .. versionadded:: 2.2.2
     17 
     18 
     19 .. class:: Charset([input_charset])
     20 
     21    Map character sets to their email properties.
     22 
     23    This class provides information about the requirements imposed on email for a
     24    specific character set.  It also provides convenience routines for converting
     25    between character sets, given the availability of the applicable codecs.  Given
     26    a character set, it will do its best to provide information on how to use that
     27    character set in an email message in an RFC-compliant way.
     28 
     29    Certain character sets must be encoded with quoted-printable or base64 when used
     30    in email headers or bodies.  Certain character sets must be converted outright,
     31    and are not allowed in email.
     32 
     33    Optional *input_charset* is as described below; it is always coerced to lower
     34    case.  After being alias normalized it is also used as a lookup into the
     35    registry of character sets to find out the header encoding, body encoding, and
     36    output conversion codec to be used for the character set.  For example, if
     37    *input_charset* is ``iso-8859-1``, then headers and bodies will be encoded using
     38    quoted-printable and no output conversion codec is necessary.  If
     39    *input_charset* is ``euc-jp``, then headers will be encoded with base64, bodies
     40    will not be encoded, but output text will be converted from the ``euc-jp``
     41    character set to the ``iso-2022-jp`` character set.
     42 
     43    :class:`Charset` instances have the following data attributes:
     44 
     45 
     46    .. attribute:: input_charset
     47 
     48       The initial character set specified.  Common aliases are converted to
     49       their *official* email names (e.g. ``latin_1`` is converted to
     50       ``iso-8859-1``).  Defaults to 7-bit ``us-ascii``.
     51 
     52 
     53    .. attribute:: header_encoding
     54 
     55       If the character set must be encoded before it can be used in an email
     56       header, this attribute will be set to ``Charset.QP`` (for
     57       quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
     58       ``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
     59       it will be ``None``.
     60 
     61 
     62    .. attribute:: body_encoding
     63 
     64       Same as *header_encoding*, but describes the encoding for the mail
     65       message's body, which indeed may be different than the header encoding.
     66       ``Charset.SHORTEST`` is not allowed for *body_encoding*.
     67 
     68 
     69    .. attribute:: output_charset
     70 
     71       Some character sets must be converted before they can be used in email headers
     72       or bodies.  If the *input_charset* is one of them, this attribute will
     73       contain the name of the character set output will be converted to.  Otherwise, it will
     74       be ``None``.
     75 
     76 
     77    .. attribute:: input_codec
     78 
     79       The name of the Python codec used to convert the *input_charset* to
     80       Unicode.  If no conversion codec is necessary, this attribute will be
     81       ``None``.
     82 
     83 
     84    .. attribute:: output_codec
     85 
     86       The name of the Python codec used to convert Unicode to the
     87       *output_charset*.  If no conversion codec is necessary, this attribute
     88       will have the same value as the *input_codec*.
     89 
     90    :class:`Charset` instances also have the following methods:
     91 
     92 
     93    .. method:: get_body_encoding()
     94 
     95       Return the content transfer encoding used for body encoding.
     96 
     97       This is either the string ``quoted-printable`` or ``base64`` depending on
     98       the encoding used, or it is a function, in which case you should call the
     99       function with a single argument, the Message object being encoded.  The
    100       function should then set the :mailheader:`Content-Transfer-Encoding`
    101       header itself to whatever is appropriate.
    102 
    103       Returns the string ``quoted-printable`` if *body_encoding* is ``QP``,
    104       returns the string ``base64`` if *body_encoding* is ``BASE64``, and
    105       returns the string ``7bit`` otherwise.
    106 
    107 
    108    .. method:: convert(s)
    109 
    110       Convert the string *s* from the *input_codec* to the *output_codec*.
    111 
    112 
    113    .. method:: to_splittable(s)
    114 
    115       Convert a possibly multibyte string to a safely splittable format. *s* is
    116       the string to split.
    117 
    118       Uses the *input_codec* to try and convert the string to Unicode, so it can
    119       be safely split on character boundaries (even for multibyte characters).
    120 
    121       Returns the string as-is if it isn't known how to convert *s* to Unicode
    122       with the *input_charset*.
    123 
    124       Characters that could not be converted to Unicode will be replaced with
    125       the Unicode replacement character ``'U+FFFD'``.
    126 
    127 
    128    .. method:: from_splittable(ustr[, to_output])
    129 
    130       Convert a splittable string back into an encoded string.  *ustr* is a
    131       Unicode string to "unsplit".
    132 
    133       This method uses the proper codec to try and convert the string from
    134       Unicode back into an encoded format.  Return the string as-is if it is not
    135       Unicode, or if it could not be converted from Unicode.
    136 
    137       Characters that could not be converted from Unicode will be replaced with
    138       an appropriate character (usually ``'?'``).
    139 
    140       If *to_output* is ``True`` (the default), uses *output_codec* to convert
    141       to an encoded format.  If *to_output* is ``False``, it uses *input_codec*.
    142 
    143 
    144    .. method:: get_output_charset()
    145 
    146       Return the output character set.
    147 
    148       This is the *output_charset* attribute if that is not ``None``, otherwise
    149       it is *input_charset*.
    150 
    151 
    152    .. method:: encoded_header_len()
    153 
    154       Return the length of the encoded header string, properly calculating for
    155       quoted-printable or base64 encoding.
    156 
    157 
    158    .. method:: header_encode(s[, convert])
    159 
    160       Header-encode the string *s*.
    161 
    162       If *convert* is ``True``, the string will be converted from the input
    163       charset to the output charset automatically.  This is not useful for
    164       multibyte character sets, which have line length issues (multibyte
    165       characters must be split on a character, not a byte boundary); use the
    166       higher-level :class:`~email.header.Header` class to deal with these issues
    167       (see :mod:`email.header`).  *convert* defaults to ``False``.
    168 
    169       The type of encoding (base64 or quoted-printable) will be based on the
    170       *header_encoding* attribute.
    171 
    172 
    173    .. method:: body_encode(s[, convert])
    174 
    175       Body-encode the string *s*.
    176 
    177       If *convert* is ``True`` (the default), the string will be converted from
    178       the input charset to output charset automatically. Unlike
    179       :meth:`header_encode`, there are no issues with byte boundaries and
    180       multibyte charsets in email bodies, so this is usually pretty safe.
    181 
    182       The type of encoding (base64 or quoted-printable) will be based on the
    183       *body_encoding* attribute.
    184 
    185    The :class:`Charset` class also provides a number of methods to support
    186    standard operations and built-in functions.
    187 
    188 
    189    .. method:: __str__()
    190 
    191       Returns *input_charset* as a string coerced to lower
    192       case. :meth:`__repr__` is an alias for :meth:`__str__`.
    193 
    194 
    195    .. method:: __eq__(other)
    196 
    197       This method allows you to compare two :class:`Charset` instances for
    198       equality.
    199 
    200 
    201    .. method:: __ne__(other)
    202 
    203       This method allows you to compare two :class:`Charset` instances for
    204       inequality.
    205 
    206 The :mod:`email.charset` module also provides the following functions for adding
    207 new entries to the global character set, alias, and codec registries:
    208 
    209 
    210 .. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
    211 
    212    Add character properties to the global registry.
    213 
    214    *charset* is the input character set, and must be the canonical name of a
    215    character set.
    216 
    217    Optional *header_enc* and *body_enc* is either ``Charset.QP`` for
    218    quoted-printable, ``Charset.BASE64`` for base64 encoding,
    219    ``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
    220    or ``None`` for no encoding.  ``SHORTEST`` is only valid for
    221    *header_enc*. The default is ``None`` for no encoding.
    222 
    223    Optional *output_charset* is the character set that the output should be in.
    224    Conversions will proceed from input charset, to Unicode, to the output charset
    225    when the method :meth:`Charset.convert` is called.  The default is to output in
    226    the same character set as the input.
    227 
    228    Both *input_charset* and *output_charset* must have Unicode codec entries in the
    229    module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
    230    module does not know about.  See the :mod:`codecs` module's documentation for
    231    more information.
    232 
    233    The global character set registry is kept in the module global dictionary
    234    ``CHARSETS``.
    235 
    236 
    237 .. function:: add_alias(alias, canonical)
    238 
    239    Add a character set alias.  *alias* is the alias name, e.g. ``latin-1``.
    240    *canonical* is the character set's canonical name, e.g. ``iso-8859-1``.
    241 
    242    The global charset alias registry is kept in the module global dictionary
    243    ``ALIASES``.
    244 
    245 
    246 .. function:: add_codec(charset, codecname)
    247 
    248    Add a codec that map characters in the given character set to and from Unicode.
    249 
    250    *charset* is the canonical name of a character set. *codecname* is the name of a
    251    Python codec, as appropriate for the second argument to the :func:`unicode`
    252    built-in, or to the :meth:`~unicode.encode` method of a Unicode string.
    253 
    254