symbian-qemu-0.9.1-12/python-2.6.1/Doc/howto/unicode.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 *****************
       
     2   Unicode HOWTO
       
     3 *****************
       
     4 
       
     5 :Release: 1.02
       
     6 
       
     7 This HOWTO discusses Python's support for Unicode, and explains various problems
       
     8 that people commonly encounter when trying to work with Unicode.
       
     9 
       
    10 Introduction to Unicode
       
    11 =======================
       
    12 
       
    13 History of Character Codes
       
    14 --------------------------
       
    15 
       
    16 In 1968, the American Standard Code for Information Interchange, better known by
       
    17 its acronym ASCII, was standardized.  ASCII defined numeric codes for various
       
    18 characters, with the numeric values running from 0 to
       
    19 127.  For example, the lowercase letter 'a' is assigned 97 as its code
       
    20 value.
       
    21 
       
    22 ASCII was an American-developed standard, so it only defined unaccented
       
    23 characters.  There was an 'e', but no 'é' or 'Í'.  This meant that languages
       
    24 which required accented characters couldn't be faithfully represented in ASCII.
       
    25 (Actually the missing accents matter for English, too, which contains words such
       
    26 as 'naïve' and 'café', and some publications have house styles which require
       
    27 spellings such as 'coöperate'.)
       
    28 
       
    29 For a while people just wrote programs that didn't display accents.  I remember
       
    30 looking at Apple ][ BASIC programs, published in French-language publications in
       
    31 the mid-1980s, that had lines like these::
       
    32 
       
    33 	PRINT "FICHER EST COMPLETE."
       
    34 	PRINT "CARACTERE NON ACCEPTE."
       
    35 
       
    36 Those messages should contain accents, and they just look wrong to someone who
       
    37 can read French.
       
    38 
       
    39 In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
       
    40 hold values ranging from 0 to 255.  ASCII codes only went up to 127, so some
       
    41 machines assigned values between 128 and 255 to accented characters.  Different
       
    42 machines had different codes, however, which led to problems exchanging files.
       
    43 Eventually various commonly used sets of values for the 128-255 range emerged.
       
    44 Some were true standards, defined by the International Standards Organization,
       
    45 and some were **de facto** conventions that were invented by one company or
       
    46 another and managed to catch on.
       
    47 
       
    48 255 characters aren't very many.  For example, you can't fit both the accented
       
    49 characters used in Western Europe and the Cyrillic alphabet used for Russian
       
    50 into the 128-255 range because there are more than 127 such characters.
       
    51 
       
    52 You could write files using different codes (all your Russian files in a coding
       
    53 system called KOI8, all your French files in a different coding system called
       
    54 Latin1), but what if you wanted to write a French document that quotes some
       
    55 Russian text?  In the 1980s people began to want to solve this problem, and the
       
    56 Unicode standardization effort began.
       
    57 
       
    58 Unicode started out using 16-bit characters instead of 8-bit characters.  16
       
    59 bits means you have 2^16 = 65,536 distinct values available, making it possible
       
    60 to represent many different characters from many different alphabets; an initial
       
    61 goal was to have Unicode contain the alphabets for every single human language.
       
    62 It turns out that even 16 bits isn't enough to meet that goal, and the modern
       
    63 Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
       
    64 base-16).
       
    65 
       
    66 There's a related ISO standard, ISO 10646.  Unicode and ISO 10646 were
       
    67 originally separate efforts, but the specifications were merged with the 1.1
       
    68 revision of Unicode.
       
    69 
       
    70 (This discussion of Unicode's history is highly simplified.  I don't think the
       
    71 average Python programmer needs to worry about the historical details; consult
       
    72 the Unicode consortium site listed in the References for more information.)
       
    73 
       
    74 
       
    75 Definitions
       
    76 -----------
       
    77 
       
    78 A **character** is the smallest possible component of a text.  'A', 'B', 'C',
       
    79 etc., are all different characters.  So are 'È' and 'Í'.  Characters are
       
    80 abstractions, and vary depending on the language or context you're talking
       
    81 about.  For example, the symbol for ohms (Ω) is usually drawn much like the
       
    82 capital letter omega (Ω) in the Greek alphabet (they may even be the same in
       
    83 some fonts), but these are two different characters that have different
       
    84 meanings.
       
    85 
       
    86 The Unicode standard describes how characters are represented by **code
       
    87 points**.  A code point is an integer value, usually denoted in base 16.  In the
       
    88 standard, a code point is written using the notation U+12ca to mean the
       
    89 character with value 0x12ca (4810 decimal).  The Unicode standard contains a lot
       
    90 of tables listing characters and their corresponding code points::
       
    91 
       
    92 	0061    'a'; LATIN SMALL LETTER A
       
    93 	0062    'b'; LATIN SMALL LETTER B
       
    94 	0063    'c'; LATIN SMALL LETTER C
       
    95         ...
       
    96 	007B	'{'; LEFT CURLY BRACKET
       
    97 
       
    98 Strictly, these definitions imply that it's meaningless to say 'this is
       
    99 character U+12ca'.  U+12ca is a code point, which represents some particular
       
   100 character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.  In
       
   101 informal contexts, this distinction between code points and characters will
       
   102 sometimes be forgotten.
       
   103 
       
   104 A character is represented on a screen or on paper by a set of graphical
       
   105 elements that's called a **glyph**.  The glyph for an uppercase A, for example,
       
   106 is two diagonal strokes and a horizontal stroke, though the exact details will
       
   107 depend on the font being used.  Most Python code doesn't need to worry about
       
   108 glyphs; figuring out the correct glyph to display is generally the job of a GUI
       
   109 toolkit or a terminal's font renderer.
       
   110 
       
   111 
       
   112 Encodings
       
   113 ---------
       
   114 
       
   115 To summarize the previous section: a Unicode string is a sequence of code
       
   116 points, which are numbers from 0 to 0x10ffff.  This sequence needs to be
       
   117 represented as a set of bytes (meaning, values from 0-255) in memory.  The rules
       
   118 for translating a Unicode string into a sequence of bytes are called an
       
   119 **encoding**.
       
   120 
       
   121 The first encoding you might think of is an array of 32-bit integers.  In this
       
   122 representation, the string "Python" would look like this::
       
   123 
       
   124        P           y           t           h           o           n
       
   125     0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 
       
   126        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
       
   127 
       
   128 This representation is straightforward but using it presents a number of
       
   129 problems.
       
   130 
       
   131 1. It's not portable; different processors order the bytes differently.
       
   132 
       
   133 2. It's very wasteful of space.  In most texts, the majority of the code points
       
   134    are less than 127, or less than 255, so a lot of space is occupied by zero
       
   135    bytes.  The above string takes 24 bytes compared to the 6 bytes needed for an
       
   136    ASCII representation.  Increased RAM usage doesn't matter too much (desktop
       
   137    computers have megabytes of RAM, and strings aren't usually that large), but
       
   138    expanding our usage of disk and network bandwidth by a factor of 4 is
       
   139    intolerable.
       
   140 
       
   141 3. It's not compatible with existing C functions such as ``strlen()``, so a new
       
   142    family of wide string functions would need to be used.
       
   143 
       
   144 4. Many Internet standards are defined in terms of textual data, and can't
       
   145    handle content with embedded zero bytes.
       
   146 
       
   147 Generally people don't use this encoding, instead choosing other encodings that
       
   148 are more efficient and convenient.
       
   149 
       
   150 Encodings don't have to handle every possible Unicode character, and most
       
   151 encodings don't.  For example, Python's default encoding is the 'ascii'
       
   152 encoding.  The rules for converting a Unicode string into the ASCII encoding are
       
   153 simple; for each code point:
       
   154 
       
   155 1. If the code point is < 128, each byte is the same as the value of the code
       
   156    point.
       
   157 
       
   158 2. If the code point is 128 or greater, the Unicode string can't be represented
       
   159    in this encoding.  (Python raises a :exc:`UnicodeEncodeError` exception in this
       
   160    case.)
       
   161 
       
   162 Latin-1, also known as ISO-8859-1, is a similar encoding.  Unicode code points
       
   163 0-255 are identical to the Latin-1 values, so converting to this encoding simply
       
   164 requires converting code points to byte values; if a code point larger than 255
       
   165 is encountered, the string can't be encoded into Latin-1.
       
   166 
       
   167 Encodings don't have to be simple one-to-one mappings like Latin-1.  Consider
       
   168 IBM's EBCDIC, which was used on IBM mainframes.  Letter values weren't in one
       
   169 block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
       
   170 through 153.  If you wanted to use EBCDIC as an encoding, you'd probably use
       
   171 some sort of lookup table to perform the conversion, but this is largely an
       
   172 internal detail.
       
   173 
       
   174 UTF-8 is one of the most commonly used encodings.  UTF stands for "Unicode
       
   175 Transformation Format", and the '8' means that 8-bit numbers are used in the
       
   176 encoding.  (There's also a UTF-16 encoding, but it's less frequently used than
       
   177 UTF-8.)  UTF-8 uses the following rules:
       
   178 
       
   179 1. If the code point is <128, it's represented by the corresponding byte value.
       
   180 2. If the code point is between 128 and 0x7ff, it's turned into two byte values
       
   181    between 128 and 255.
       
   182 3. Code points >0x7ff are turned into three- or four-byte sequences, where each
       
   183    byte of the sequence is between 128 and 255.
       
   184     
       
   185 UTF-8 has several convenient properties:
       
   186 
       
   187 1. It can handle any Unicode code point.
       
   188 2. A Unicode string is turned into a string of bytes containing no embedded zero
       
   189    bytes.  This avoids byte-ordering issues, and means UTF-8 strings can be
       
   190    processed by C functions such as ``strcpy()`` and sent through protocols that
       
   191    can't handle zero bytes.
       
   192 3. A string of ASCII text is also valid UTF-8 text.
       
   193 4. UTF-8 is fairly compact; the majority of code points are turned into two
       
   194    bytes, and values less than 128 occupy only a single byte.
       
   195 5. If bytes are corrupted or lost, it's possible to determine the start of the
       
   196    next UTF-8-encoded code point and resynchronize.  It's also unlikely that
       
   197    random 8-bit data will look like valid UTF-8.
       
   198 
       
   199 
       
   200 
       
   201 References
       
   202 ----------
       
   203 
       
   204 The Unicode Consortium site at <http://www.unicode.org> has character charts, a
       
   205 glossary, and PDF versions of the Unicode specification.  Be prepared for some
       
   206 difficult reading.  <http://www.unicode.org/history/> is a chronology of the
       
   207 origin and development of Unicode.
       
   208 
       
   209 To help understand the standard, Jukka Korpela has written an introductory guide
       
   210 to reading the Unicode character tables, available at
       
   211 <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
       
   212 
       
   213 Two other good introductory articles were written by Joel Spolsky
       
   214 <http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff
       
   215 <http://www.jorendorff.com/articles/unicode/>.  If this introduction didn't make
       
   216 things clear to you, you should try reading one of these alternate articles
       
   217 before continuing.
       
   218 
       
   219 Wikipedia entries are often helpful; see the entries for "character encoding"
       
   220 <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
       
   221 <http://en.wikipedia.org/wiki/UTF-8>, for example.
       
   222 
       
   223 
       
   224 Python's Unicode Support
       
   225 ========================
       
   226 
       
   227 Now that you've learned the rudiments of Unicode, we can look at Python's
       
   228 Unicode features.
       
   229 
       
   230 
       
   231 The Unicode Type
       
   232 ----------------
       
   233 
       
   234 Unicode strings are expressed as instances of the :class:`unicode` type, one of
       
   235 Python's repertoire of built-in types.  It derives from an abstract type called
       
   236 :class:`basestring`, which is also an ancestor of the :class:`str` type; you can
       
   237 therefore check if a value is a string type with ``isinstance(value,
       
   238 basestring)``.  Under the hood, Python represents Unicode strings as either 16-
       
   239 or 32-bit integers, depending on how the Python interpreter was compiled.
       
   240 
       
   241 The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
       
   242 errors])``.  All of its arguments should be 8-bit strings.  The first argument
       
   243 is converted to Unicode using the specified encoding; if you leave off the
       
   244 ``encoding`` argument, the ASCII encoding is used for the conversion, so
       
   245 characters greater than 127 will be treated as errors::
       
   246 
       
   247     >>> unicode('abcdef')
       
   248     u'abcdef'
       
   249     >>> s = unicode('abcdef')
       
   250     >>> type(s)
       
   251     <type 'unicode'>
       
   252     >>> unicode('abcdef' + chr(255))
       
   253     Traceback (most recent call last):
       
   254       File "<stdin>", line 1, in ?
       
   255     UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: 
       
   256                         ordinal not in range(128)
       
   257 
       
   258 The ``errors`` argument specifies the response when the input string can't be
       
   259 converted according to the encoding's rules.  Legal values for this argument are
       
   260 'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
       
   261 'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
       
   262 Unicode result).  The following examples show the differences::
       
   263 
       
   264     >>> unicode('\x80abc', errors='strict')
       
   265     Traceback (most recent call last):
       
   266       File "<stdin>", line 1, in ?
       
   267     UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: 
       
   268                         ordinal not in range(128)
       
   269     >>> unicode('\x80abc', errors='replace')
       
   270     u'\ufffdabc'
       
   271     >>> unicode('\x80abc', errors='ignore')
       
   272     u'abc'
       
   273 
       
   274 Encodings are specified as strings containing the encoding's name.  Python 2.4
       
   275 comes with roughly 100 different encodings; see the Python Library Reference at
       
   276 :ref:`standard-encodings` for a list.  Some encodings
       
   277 have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
       
   278 synonyms for the same encoding.
       
   279 
       
   280 One-character Unicode strings can also be created with the :func:`unichr`
       
   281 built-in function, which takes integers and returns a Unicode string of length 1
       
   282 that contains the corresponding code point.  The reverse operation is the
       
   283 built-in :func:`ord` function that takes a one-character Unicode string and
       
   284 returns the code point value::
       
   285 
       
   286     >>> unichr(40960)
       
   287     u'\ua000'
       
   288     >>> ord(u'\ua000')
       
   289     40960
       
   290 
       
   291 Instances of the :class:`unicode` type have many of the same methods as the
       
   292 8-bit string type for operations such as searching and formatting::
       
   293 
       
   294     >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
       
   295     >>> s.count('e')
       
   296     5
       
   297     >>> s.find('feather')
       
   298     9
       
   299     >>> s.find('bird')
       
   300     -1
       
   301     >>> s.replace('feather', 'sand')
       
   302     u'Was ever sand so lightly blown to and fro as this multitude?'
       
   303     >>> s.upper()
       
   304     u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
       
   305 
       
   306 Note that the arguments to these methods can be Unicode strings or 8-bit
       
   307 strings.  8-bit strings will be converted to Unicode before carrying out the
       
   308 operation; Python's default ASCII encoding will be used, so characters greater
       
   309 than 127 will cause an exception::
       
   310 
       
   311     >>> s.find('Was\x9f')
       
   312     Traceback (most recent call last):
       
   313       File "<stdin>", line 1, in ?
       
   314     UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
       
   315     >>> s.find(u'Was\x9f')
       
   316     -1
       
   317 
       
   318 Much Python code that operates on strings will therefore work with Unicode
       
   319 strings without requiring any changes to the code.  (Input and output code needs
       
   320 more updating for Unicode; more on this later.)
       
   321 
       
   322 Another important method is ``.encode([encoding], [errors='strict'])``, which
       
   323 returns an 8-bit string version of the Unicode string, encoded in the requested
       
   324 encoding.  The ``errors`` parameter is the same as the parameter of the
       
   325 ``unicode()`` constructor, with one additional possibility; as well as 'strict',
       
   326 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
       
   327 character references.  The following example shows the different results::
       
   328 
       
   329     >>> u = unichr(40960) + u'abcd' + unichr(1972)
       
   330     >>> u.encode('utf-8')
       
   331     '\xea\x80\x80abcd\xde\xb4'
       
   332     >>> u.encode('ascii')
       
   333     Traceback (most recent call last):
       
   334       File "<stdin>", line 1, in ?
       
   335     UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
       
   336     >>> u.encode('ascii', 'ignore')
       
   337     'abcd'
       
   338     >>> u.encode('ascii', 'replace')
       
   339     '?abcd?'
       
   340     >>> u.encode('ascii', 'xmlcharrefreplace')
       
   341     '&#40960;abcd&#1972;'
       
   342 
       
   343 Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
       
   344 interprets the string using the given encoding::
       
   345 
       
   346     >>> u = unichr(40960) + u'abcd' + unichr(1972)   # Assemble a string
       
   347     >>> utf8_version = u.encode('utf-8')             # Encode as UTF-8
       
   348     >>> type(utf8_version), utf8_version
       
   349     (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
       
   350     >>> u2 = utf8_version.decode('utf-8')            # Decode using UTF-8
       
   351     >>> u == u2                                      # The two strings match
       
   352     True
       
   353  
       
   354 The low-level routines for registering and accessing the available encodings are
       
   355 found in the :mod:`codecs` module.  However, the encoding and decoding functions
       
   356 returned by this module are usually more low-level than is comfortable, so I'm
       
   357 not going to describe the :mod:`codecs` module here.  If you need to implement a
       
   358 completely new encoding, you'll need to learn about the :mod:`codecs` module
       
   359 interfaces, but implementing encodings is a specialized task that also won't be
       
   360 covered here.  Consult the Python documentation to learn more about this module.
       
   361 
       
   362 The most commonly used part of the :mod:`codecs` module is the
       
   363 :func:`codecs.open` function which will be discussed in the section on input and
       
   364 output.
       
   365             
       
   366             
       
   367 Unicode Literals in Python Source Code
       
   368 --------------------------------------
       
   369 
       
   370 In Python source code, Unicode literals are written as strings prefixed with the
       
   371 'u' or 'U' character: ``u'abcdefghijk'``.  Specific code points can be written
       
   372 using the ``\u`` escape sequence, which is followed by four hex digits giving
       
   373 the code point.  The ``\U`` escape sequence is similar, but expects 8 hex
       
   374 digits, not 4.
       
   375 
       
   376 Unicode literals can also use the same escape sequences as 8-bit strings,
       
   377 including ``\x``, but ``\x`` only takes two hex digits so it can't express an
       
   378 arbitrary code point.  Octal escapes can go up to U+01ff, which is octal 777.
       
   379 
       
   380 ::
       
   381 
       
   382     >>> s = u"a\xac\u1234\u20ac\U00008000"
       
   383                ^^^^ two-digit hex escape
       
   384                    ^^^^^^ four-digit Unicode escape 
       
   385                                ^^^^^^^^^^ eight-digit Unicode escape
       
   386     >>> for c in s:  print ord(c),
       
   387     ... 
       
   388     97 172 4660 8364 32768
       
   389 
       
   390 Using escape sequences for code points greater than 127 is fine in small doses,
       
   391 but becomes an annoyance if you're using many accented characters, as you would
       
   392 in a program with messages in French or some other accent-using language.  You
       
   393 can also assemble strings using the :func:`unichr` built-in function, but this is
       
   394 even more tedious.
       
   395 
       
   396 Ideally, you'd want to be able to write literals in your language's natural
       
   397 encoding.  You could then edit Python source code with your favorite editor
       
   398 which would display the accented characters naturally, and have the right
       
   399 characters used at runtime.
       
   400 
       
   401 Python supports writing Unicode literals in any encoding, but you have to
       
   402 declare the encoding being used.  This is done by including a special comment as
       
   403 either the first or second line of the source file::
       
   404 
       
   405     #!/usr/bin/env python
       
   406     # -*- coding: latin-1 -*-
       
   407     
       
   408     u = u'abcdé'
       
   409     print ord(u[-1])
       
   410     
       
   411 The syntax is inspired by Emacs's notation for specifying variables local to a
       
   412 file.  Emacs supports many different variables, but Python only supports
       
   413 'coding'.  The ``-*-`` symbols indicate that the comment is special; within
       
   414 them, you must supply the name ``coding`` and the name of your chosen encoding,
       
   415 separated by ``':'``.
       
   416 
       
   417 If you don't include such a comment, the default encoding used will be ASCII.
       
   418 Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
       
   419 encoding for string literals; in Python 2.4, characters greater than 127 still
       
   420 work but result in a warning.  For example, the following program has no
       
   421 encoding declaration::
       
   422 
       
   423     #!/usr/bin/env python
       
   424     u = u'abcdé'
       
   425     print ord(u[-1])
       
   426 
       
   427 When you run it with Python 2.4, it will output the following warning::
       
   428 
       
   429     amk:~$ python p263.py
       
   430     sys:1: DeprecationWarning: Non-ASCII character '\xe9' 
       
   431          in file p263.py on line 2, but no encoding declared; 
       
   432          see http://www.python.org/peps/pep-0263.html for details
       
   433   
       
   434 
       
   435 Unicode Properties
       
   436 ------------------
       
   437 
       
   438 The Unicode specification includes a database of information about code points.
       
   439 For each code point that's defined, the information includes the character's
       
   440 name, its category, the numeric value if applicable (Unicode has characters
       
   441 representing the Roman numerals and fractions such as one-third and
       
   442 four-fifths).  There are also properties related to the code point's use in
       
   443 bidirectional text and other display-related properties.
       
   444 
       
   445 The following program displays some information about several characters, and
       
   446 prints the numeric value of one particular character::
       
   447 
       
   448     import unicodedata
       
   449     
       
   450     u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
       
   451     
       
   452     for i, c in enumerate(u):
       
   453         print i, '%04x' % ord(c), unicodedata.category(c),
       
   454         print unicodedata.name(c)
       
   455     
       
   456     # Get numeric value of second character
       
   457     print unicodedata.numeric(u[1])
       
   458 
       
   459 When run, this prints::
       
   460 
       
   461     0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
       
   462     1 0bf2 No TAMIL NUMBER ONE THOUSAND
       
   463     2 0f84 Mn TIBETAN MARK HALANTA
       
   464     3 1770 Lo TAGBANWA LETTER SA
       
   465     4 33af So SQUARE RAD OVER S SQUARED
       
   466     1000.0
       
   467 
       
   468 The category codes are abbreviations describing the nature of the character.
       
   469 These are grouped into categories such as "Letter", "Number", "Punctuation", or
       
   470 "Symbol", which in turn are broken up into subcategories.  To take the codes
       
   471 from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
       
   472 "Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
       
   473 other".  See
       
   474 <http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a
       
   475 list of category codes.
       
   476 
       
   477 References
       
   478 ----------
       
   479 
       
   480 The Unicode and 8-bit string types are described in the Python library reference
       
   481 at :ref:`typesseq`.
       
   482 
       
   483 The documentation for the :mod:`unicodedata` module.
       
   484 
       
   485 The documentation for the :mod:`codecs` module.
       
   486 
       
   487 Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
       
   488 Unicode".  A PDF version of his slides is available at
       
   489 <http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
       
   490 excellent overview of the design of Python's Unicode features.
       
   491 
       
   492 
       
   493 Reading and Writing Unicode Data
       
   494 ================================
       
   495 
       
   496 Once you've written some code that works with Unicode data, the next problem is
       
   497 input/output.  How do you get Unicode strings into your program, and how do you
       
   498 convert Unicode into a form suitable for storage or transmission?
       
   499 
       
   500 It's possible that you may not need to do anything depending on your input
       
   501 sources and output destinations; you should check whether the libraries used in
       
   502 your application support Unicode natively.  XML parsers often return Unicode
       
   503 data, for example.  Many relational databases also support Unicode-valued
       
   504 columns and can return Unicode values from an SQL query.
       
   505 
       
   506 Unicode data is usually converted to a particular encoding before it gets
       
   507 written to disk or sent over a socket.  It's possible to do all the work
       
   508 yourself: open a file, read an 8-bit string from it, and convert the string with
       
   509 ``unicode(str, encoding)``.  However, the manual approach is not recommended.
       
   510 
       
   511 One problem is the multi-byte nature of encodings; one Unicode character can be
       
   512 represented by several bytes.  If you want to read the file in arbitrary-sized
       
   513 chunks (say, 1K or 4K), you need to write error-handling code to catch the case
       
   514 where only part of the bytes encoding a single Unicode character are read at the
       
   515 end of a chunk.  One solution would be to read the entire file into memory and
       
   516 then perform the decoding, but that prevents you from working with files that
       
   517 are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
       
   518 (More, really, since for at least a moment you'd need to have both the encoded
       
   519 string and its Unicode version in memory.)
       
   520 
       
   521 The solution would be to use the low-level decoding interface to catch the case
       
   522 of partial coding sequences.  The work of implementing this has already been
       
   523 done for you: the :mod:`codecs` module includes a version of the :func:`open`
       
   524 function that returns a file-like object that assumes the file's contents are in
       
   525 a specified encoding and accepts Unicode parameters for methods such as
       
   526 ``.read()`` and ``.write()``.
       
   527 
       
   528 The function's parameters are ``open(filename, mode='rb', encoding=None,
       
   529 errors='strict', buffering=1)``.  ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
       
   530 just like the corresponding parameter to the regular built-in ``open()``
       
   531 function; add a ``'+'`` to update the file.  ``buffering`` is similarly parallel
       
   532 to the standard function's parameter.  ``encoding`` is a string giving the
       
   533 encoding to use; if it's left as ``None``, a regular Python file object that
       
   534 accepts 8-bit strings is returned.  Otherwise, a wrapper object is returned, and
       
   535 data written to or read from the wrapper object will be converted as needed.
       
   536 ``errors`` specifies the action for encoding errors and can be one of the usual
       
   537 values of 'strict', 'ignore', and 'replace'.
       
   538 
       
   539 Reading Unicode from a file is therefore simple::
       
   540 
       
   541     import codecs
       
   542     f = codecs.open('unicode.rst', encoding='utf-8')
       
   543     for line in f:
       
   544         print repr(line)
       
   545 
       
   546 It's also possible to open files in update mode, allowing both reading and
       
   547 writing::
       
   548 
       
   549     f = codecs.open('test', encoding='utf-8', mode='w+')
       
   550     f.write(u'\u4500 blah blah blah\n')
       
   551     f.seek(0)
       
   552     print repr(f.readline()[:1])
       
   553     f.close()
       
   554 
       
   555 Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
       
   556 written as the first character of a file in order to assist with autodetection
       
   557 of the file's byte ordering.  Some encodings, such as UTF-16, expect a BOM to be
       
   558 present at the start of a file; when such an encoding is used, the BOM will be
       
   559 automatically written as the first character and will be silently dropped when
       
   560 the file is read.  There are variants of these encodings, such as 'utf-16-le'
       
   561 and 'utf-16-be' for little-endian and big-endian encodings, that specify one
       
   562 particular byte ordering and don't skip the BOM.
       
   563 
       
   564 
       
   565 Unicode filenames
       
   566 -----------------
       
   567 
       
   568 Most of the operating systems in common use today support filenames that contain
       
   569 arbitrary Unicode characters.  Usually this is implemented by converting the
       
   570 Unicode string into some encoding that varies depending on the system.  For
       
   571 example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
       
   572 Windows, Python uses the name "mbcs" to refer to whatever the currently
       
   573 configured encoding is.  On Unix systems, there will only be a filesystem
       
   574 encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
       
   575 you haven't, the default encoding is ASCII.
       
   576 
       
   577 The :func:`sys.getfilesystemencoding` function returns the encoding to use on
       
   578 your current system, in case you want to do the encoding manually, but there's
       
   579 not much reason to bother.  When opening a file for reading or writing, you can
       
   580 usually just provide the Unicode string as the filename, and it will be
       
   581 automatically converted to the right encoding for you::
       
   582 
       
   583     filename = u'filename\u4500abc'
       
   584     f = open(filename, 'w')
       
   585     f.write('blah\n')
       
   586     f.close()
       
   587 
       
   588 Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
       
   589 filenames.
       
   590 
       
   591 :func:`os.listdir`, which returns filenames, raises an issue: should it return
       
   592 the Unicode version of filenames, or should it return 8-bit strings containing
       
   593 the encoded versions?  :func:`os.listdir` will do both, depending on whether you
       
   594 provided the directory path as an 8-bit string or a Unicode string.  If you pass
       
   595 a Unicode string as the path, filenames will be decoded using the filesystem's
       
   596 encoding and a list of Unicode strings will be returned, while passing an 8-bit
       
   597 path will return the 8-bit versions of the filenames.  For example, assuming the
       
   598 default filesystem encoding is UTF-8, running the following program::
       
   599 
       
   600 	fn = u'filename\u4500abc'
       
   601 	f = open(fn, 'w')
       
   602 	f.close()
       
   603 
       
   604 	import os
       
   605 	print os.listdir('.')
       
   606 	print os.listdir(u'.')
       
   607 
       
   608 will produce the following output::
       
   609 
       
   610 	amk:~$ python t.py
       
   611 	['.svn', 'filename\xe4\x94\x80abc', ...]
       
   612 	[u'.svn', u'filename\u4500abc', ...]
       
   613 
       
   614 The first list contains UTF-8-encoded filenames, and the second list contains
       
   615 the Unicode versions.
       
   616 
       
   617 
       
   618 	
       
   619 Tips for Writing Unicode-aware Programs
       
   620 ---------------------------------------
       
   621 
       
   622 This section provides some suggestions on writing software that deals with
       
   623 Unicode.
       
   624 
       
   625 The most important tip is:
       
   626 
       
   627     Software should only work with Unicode strings internally, converting to a
       
   628     particular encoding on output.
       
   629 
       
   630 If you attempt to write processing functions that accept both Unicode and 8-bit
       
   631 strings, you will find your program vulnerable to bugs wherever you combine the
       
   632 two different kinds of strings.  Python's default encoding is ASCII, so whenever
       
   633 a character with an ASCII value > 127 is in the input data, you'll get a
       
   634 :exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
       
   635 encoding.
       
   636 
       
   637 It's easy to miss such problems if you only test your software with data that
       
   638 doesn't contain any accents; everything will seem to work, but there's actually
       
   639 a bug in your program waiting for the first user who attempts to use characters
       
   640 > 127.  A second tip, therefore, is:
       
   641 
       
   642     Include characters > 127 and, even better, characters > 255 in your test
       
   643     data.
       
   644 
       
   645 When using data coming from a web browser or some other untrusted source, a
       
   646 common technique is to check for illegal characters in a string before using the
       
   647 string in a generated command line or storing it in a database.  If you're doing
       
   648 this, be careful to check the string once it's in the form that will be used or
       
   649 stored; it's possible for encodings to be used to disguise characters.  This is
       
   650 especially true if the input data also specifies the encoding; many encodings
       
   651 leave the commonly checked-for characters alone, but Python includes some
       
   652 encodings such as ``'base64'`` that modify every single character.
       
   653 
       
   654 For example, let's say you have a content management system that takes a Unicode
       
   655 filename, and you want to disallow paths with a '/' character.  You might write
       
   656 this code::
       
   657 
       
   658     def read_file (filename, encoding):
       
   659         if '/' in filename:
       
   660             raise ValueError("'/' not allowed in filenames")
       
   661         unicode_name = filename.decode(encoding)
       
   662         f = open(unicode_name, 'r')
       
   663         # ... return contents of file ...
       
   664         
       
   665 However, if an attacker could specify the ``'base64'`` encoding, they could pass
       
   666 ``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
       
   667 ``'/etc/passwd'``, to read a system file.  The above code looks for ``'/'``
       
   668 characters in the encoded form and misses the dangerous character in the
       
   669 resulting decoded form.
       
   670 
       
   671 References
       
   672 ----------
       
   673 
       
   674 The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
       
   675 Applications in Python" are available at
       
   676 <http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
       
   677 and discuss questions of character encodings as well as how to internationalize
       
   678 and localize an application.
       
   679 
       
   680 
       
   681 Revision History and Acknowledgements
       
   682 =====================================
       
   683 
       
   684 Thanks to the following people who have noted errors or offered suggestions on
       
   685 this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
       
   686 Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
       
   687 
       
   688 Version 1.0: posted August 5 2005.
       
   689 
       
   690 Version 1.01: posted August 7 2005.  Corrects factual and markup errors; adds
       
   691 several links.
       
   692 
       
   693 Version 1.02: posted August 16 2005.  Corrects factual errors.
       
   694 
       
   695 
       
   696 .. comment Additional topic: building Python w/ UCS2 or UCS4 support
       
   697 .. comment Describe obscure -U switch somewhere?
       
   698 .. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
       
   699 
       
   700 .. comment 
       
   701    Original outline:
       
   702 
       
   703    - [ ] Unicode introduction
       
   704        - [ ] ASCII
       
   705        - [ ] Terms
       
   706 	   - [ ] Character
       
   707 	   - [ ] Code point
       
   708 	 - [ ] Encodings
       
   709 	    - [ ] Common encodings: ASCII, Latin-1, UTF-8
       
   710        - [ ] Unicode Python type
       
   711 	   - [ ] Writing unicode literals
       
   712 	       - [ ] Obscurity: -U switch
       
   713 	   - [ ] Built-ins
       
   714 	       - [ ] unichr()
       
   715 	       - [ ] ord()
       
   716 	       - [ ] unicode() constructor
       
   717 	   - [ ] Unicode type
       
   718 	       - [ ] encode(), decode() methods
       
   719        - [ ] Unicodedata module for character properties
       
   720        - [ ] I/O
       
   721 	   - [ ] Reading/writing Unicode data into files
       
   722 	       - [ ] Byte-order marks
       
   723 	   - [ ] Unicode filenames
       
   724        - [ ] Writing Unicode programs
       
   725 	   - [ ] Do everything in Unicode
       
   726 	   - [ ] Declaring source code encodings (PEP 263)
       
   727        - [ ] Other issues
       
   728 	   - [ ] Building Python (UCS2, UCS4)