symbian-qemu-0.9.1-12/python-2.6.1/Doc/library/email.charset.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 :mod:`email`: Representing character sets
       
     2 -----------------------------------------
       
     3 
       
     4 .. module:: email.charset
       
     5    :synopsis: Character Sets
       
     6 
       
     7 
       
     8 This module provides a class :class:`Charset` for representing character sets
       
     9 and character set conversions in email messages, as well as a character set
       
    10 registry and several convenience methods for manipulating this registry.
       
    11 Instances of :class:`Charset` are used in several other modules within the
       
    12 :mod:`email` package.
       
    13 
       
    14 Import this class from the :mod:`email.charset` module.
       
    15 
       
    16 .. versionadded:: 2.2.2
       
    17 
       
    18 
       
    19 .. class:: Charset([input_charset])
       
    20 
       
    21    Map character sets to their email properties.
       
    22 
       
    23    This class provides information about the requirements imposed on email for a
       
    24    specific character set.  It also provides convenience routines for converting
       
    25    between character sets, given the availability of the applicable codecs.  Given
       
    26    a character set, it will do its best to provide information on how to use that
       
    27    character set in an email message in an RFC-compliant way.
       
    28 
       
    29    Certain character sets must be encoded with quoted-printable or base64 when used
       
    30    in email headers or bodies.  Certain character sets must be converted outright,
       
    31    and are not allowed in email.
       
    32 
       
    33    Optional *input_charset* is as described below; it is always coerced to lower
       
    34    case.  After being alias normalized it is also used as a lookup into the
       
    35    registry of character sets to find out the header encoding, body encoding, and
       
    36    output conversion codec to be used for the character set.  For example, if
       
    37    *input_charset* is ``iso-8859-1``, then headers and bodies will be encoded using
       
    38    quoted-printable and no output conversion codec is necessary.  If
       
    39    *input_charset* is ``euc-jp``, then headers will be encoded with base64, bodies
       
    40    will not be encoded, but output text will be converted from the ``euc-jp``
       
    41    character set to the ``iso-2022-jp`` character set.
       
    42 
       
    43    :class:`Charset` instances have the following data attributes:
       
    44 
       
    45 
       
    46    .. attribute:: input_charset
       
    47 
       
    48       The initial character set specified.  Common aliases are converted to
       
    49       their *official* email names (e.g. ``latin_1`` is converted to
       
    50       ``iso-8859-1``).  Defaults to 7-bit ``us-ascii``.
       
    51 
       
    52 
       
    53    .. attribute:: header_encoding
       
    54 
       
    55       If the character set must be encoded before it can be used in an email
       
    56       header, this attribute will be set to ``Charset.QP`` (for
       
    57       quoted-printable), ``Charset.BASE64`` (for base64 encoding), or
       
    58       ``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise,
       
    59       it will be ``None``.
       
    60 
       
    61 
       
    62    .. attribute:: body_encoding
       
    63 
       
    64       Same as *header_encoding*, but describes the encoding for the mail
       
    65       message's body, which indeed may be different than the header encoding.
       
    66       ``Charset.SHORTEST`` is not allowed for *body_encoding*.
       
    67 
       
    68 
       
    69    .. attribute:: output_charset
       
    70 
       
    71       Some character sets must be converted before they can be used in email headers
       
    72       or bodies.  If the *input_charset* is one of them, this attribute will
       
    73       contain the name of the character set output will be converted to.  Otherwise, it will
       
    74       be ``None``.
       
    75 
       
    76 
       
    77    .. attribute:: input_codec
       
    78 
       
    79       The name of the Python codec used to convert the *input_charset* to
       
    80       Unicode.  If no conversion codec is necessary, this attribute will be
       
    81       ``None``.
       
    82 
       
    83 
       
    84    .. attribute:: output_codec
       
    85 
       
    86       The name of the Python codec used to convert Unicode to the
       
    87       *output_charset*.  If no conversion codec is necessary, this attribute
       
    88       will have the same value as the *input_codec*.
       
    89 
       
    90    :class:`Charset` instances also have the following methods:
       
    91 
       
    92 
       
    93    .. method:: get_body_encoding()
       
    94 
       
    95       Return the content transfer encoding used for body encoding.
       
    96 
       
    97       This is either the string ``quoted-printable`` or ``base64`` depending on
       
    98       the encoding used, or it is a function, in which case you should call the
       
    99       function with a single argument, the Message object being encoded.  The
       
   100       function should then set the :mailheader:`Content-Transfer-Encoding`
       
   101       header itself to whatever is appropriate.
       
   102 
       
   103       Returns the string ``quoted-printable`` if *body_encoding* is ``QP``,
       
   104       returns the string ``base64`` if *body_encoding* is ``BASE64``, and
       
   105       returns the string ``7bit`` otherwise.
       
   106 
       
   107 
       
   108    .. method:: convert(s)
       
   109 
       
   110       Convert the string *s* from the *input_codec* to the *output_codec*.
       
   111 
       
   112 
       
   113    .. method:: to_splittable(s)
       
   114 
       
   115       Convert a possibly multibyte string to a safely splittable format. *s* is
       
   116       the string to split.
       
   117 
       
   118       Uses the *input_codec* to try and convert the string to Unicode, so it can
       
   119       be safely split on character boundaries (even for multibyte characters).
       
   120 
       
   121       Returns the string as-is if it isn't known how to convert *s* to Unicode
       
   122       with the *input_charset*.
       
   123 
       
   124       Characters that could not be converted to Unicode will be replaced with
       
   125       the Unicode replacement character ``'U+FFFD'``.
       
   126 
       
   127 
       
   128    .. method:: from_splittable(ustr[, to_output])
       
   129 
       
   130       Convert a splittable string back into an encoded string.  *ustr* is a
       
   131       Unicode string to "unsplit".
       
   132 
       
   133       This method uses the proper codec to try and convert the string from
       
   134       Unicode back into an encoded format.  Return the string as-is if it is not
       
   135       Unicode, or if it could not be converted from Unicode.
       
   136 
       
   137       Characters that could not be converted from Unicode will be replaced with
       
   138       an appropriate character (usually ``'?'``).
       
   139 
       
   140       If *to_output* is ``True`` (the default), uses *output_codec* to convert
       
   141       to an encoded format.  If *to_output* is ``False``, it uses *input_codec*.
       
   142 
       
   143 
       
   144    .. method:: get_output_charset()
       
   145 
       
   146       Return the output character set.
       
   147 
       
   148       This is the *output_charset* attribute if that is not ``None``, otherwise
       
   149       it is *input_charset*.
       
   150 
       
   151 
       
   152    .. method:: encoded_header_len()
       
   153 
       
   154       Return the length of the encoded header string, properly calculating for
       
   155       quoted-printable or base64 encoding.
       
   156 
       
   157 
       
   158    .. method:: header_encode(s[, convert])
       
   159 
       
   160       Header-encode the string *s*.
       
   161 
       
   162       If *convert* is ``True``, the string will be converted from the input
       
   163       charset to the output charset automatically.  This is not useful for
       
   164       multibyte character sets, which have line length issues (multibyte
       
   165       characters must be split on a character, not a byte boundary); use the
       
   166       higher-level :class:`Header` class to deal with these issues (see
       
   167       :mod:`email.header`).  *convert* defaults to ``False``.
       
   168 
       
   169       The type of encoding (base64 or quoted-printable) will be based on the
       
   170       *header_encoding* attribute.
       
   171 
       
   172 
       
   173    .. method:: body_encode(s[, convert])
       
   174 
       
   175       Body-encode the string *s*.
       
   176 
       
   177       If *convert* is ``True`` (the default), the string will be converted from
       
   178       the input charset to output charset automatically. Unlike
       
   179       :meth:`header_encode`, there are no issues with byte boundaries and
       
   180       multibyte charsets in email bodies, so this is usually pretty safe.
       
   181 
       
   182       The type of encoding (base64 or quoted-printable) will be based on the
       
   183       *body_encoding* attribute.
       
   184 
       
   185    The :class:`Charset` class also provides a number of methods to support
       
   186    standard operations and built-in functions.
       
   187 
       
   188 
       
   189    .. method:: __str__()
       
   190 
       
   191       Returns *input_charset* as a string coerced to lower
       
   192       case. :meth:`__repr__` is an alias for :meth:`__str__`.
       
   193 
       
   194 
       
   195    .. method:: __eq__(other)
       
   196 
       
   197       This method allows you to compare two :class:`Charset` instances for
       
   198       equality.
       
   199 
       
   200 
       
   201    .. method:: __ne__(other)
       
   202 
       
   203       This method allows you to compare two :class:`Charset` instances for
       
   204       inequality.
       
   205 
       
   206 The :mod:`email.charset` module also provides the following functions for adding
       
   207 new entries to the global character set, alias, and codec registries:
       
   208 
       
   209 
       
   210 .. function:: add_charset(charset[, header_enc[, body_enc[, output_charset]]])
       
   211 
       
   212    Add character properties to the global registry.
       
   213 
       
   214    *charset* is the input character set, and must be the canonical name of a
       
   215    character set.
       
   216 
       
   217    Optional *header_enc* and *body_enc* is either ``Charset.QP`` for
       
   218    quoted-printable, ``Charset.BASE64`` for base64 encoding,
       
   219    ``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding,
       
   220    or ``None`` for no encoding.  ``SHORTEST`` is only valid for
       
   221    *header_enc*. The default is ``None`` for no encoding.
       
   222 
       
   223    Optional *output_charset* is the character set that the output should be in.
       
   224    Conversions will proceed from input charset, to Unicode, to the output charset
       
   225    when the method :meth:`Charset.convert` is called.  The default is to output in
       
   226    the same character set as the input.
       
   227 
       
   228    Both *input_charset* and *output_charset* must have Unicode codec entries in the
       
   229    module's character set-to-codec mapping; use :func:`add_codec` to add codecs the
       
   230    module does not know about.  See the :mod:`codecs` module's documentation for
       
   231    more information.
       
   232 
       
   233    The global character set registry is kept in the module global dictionary
       
   234    ``CHARSETS``.
       
   235 
       
   236 
       
   237 .. function:: add_alias(alias, canonical)
       
   238 
       
   239    Add a character set alias.  *alias* is the alias name, e.g. ``latin-1``.
       
   240    *canonical* is the character set's canonical name, e.g. ``iso-8859-1``.
       
   241 
       
   242    The global charset alias registry is kept in the module global dictionary
       
   243    ``ALIASES``.
       
   244 
       
   245 
       
   246 .. function:: add_codec(charset, codecname)
       
   247 
       
   248    Add a codec that map characters in the given character set to and from Unicode.
       
   249 
       
   250    *charset* is the canonical name of a character set. *codecname* is the name of a
       
   251    Python codec, as appropriate for the second argument to the :func:`unicode`
       
   252    built-in, or to the :meth:`encode` method of a Unicode string.
       
   253