|
1 |
|
2 :mod:`unicodedata` --- Unicode Database |
|
3 ======================================= |
|
4 |
|
5 .. module:: unicodedata |
|
6 :synopsis: Access the Unicode Database. |
|
7 .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
|
8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
|
9 .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> |
|
10 |
|
11 |
|
12 .. index:: |
|
13 single: Unicode |
|
14 single: character |
|
15 pair: Unicode; database |
|
16 |
|
17 This module provides access to the Unicode Character Database which defines |
|
18 character properties for all Unicode characters. The data in this database is |
|
19 based on the :file:`UnicodeData.txt` file version 5.1.0 which is publicly |
|
20 available from ftp://ftp.unicode.org/. |
|
21 |
|
22 The module uses the same names and symbols as defined by the UnicodeData File |
|
23 Format 5.1.0 (see http://www.unicode.org/Public/5.1.0/ucd/UCD.html). It defines |
|
24 the following functions: |
|
25 |
|
26 |
|
27 .. function:: lookup(name) |
|
28 |
|
29 Look up character by name. If a character with the given name is found, return |
|
30 the corresponding Unicode character. If not found, :exc:`KeyError` is raised. |
|
31 |
|
32 |
|
33 .. function:: name(unichr[, default]) |
|
34 |
|
35 Returns the name assigned to the Unicode character *unichr* as a string. If no |
|
36 name is defined, *default* is returned, or, if not given, :exc:`ValueError` is |
|
37 raised. |
|
38 |
|
39 |
|
40 .. function:: decimal(unichr[, default]) |
|
41 |
|
42 Returns the decimal value assigned to the Unicode character *unichr* as integer. |
|
43 If no such value is defined, *default* is returned, or, if not given, |
|
44 :exc:`ValueError` is raised. |
|
45 |
|
46 |
|
47 .. function:: digit(unichr[, default]) |
|
48 |
|
49 Returns the digit value assigned to the Unicode character *unichr* as integer. |
|
50 If no such value is defined, *default* is returned, or, if not given, |
|
51 :exc:`ValueError` is raised. |
|
52 |
|
53 |
|
54 .. function:: numeric(unichr[, default]) |
|
55 |
|
56 Returns the numeric value assigned to the Unicode character *unichr* as float. |
|
57 If no such value is defined, *default* is returned, or, if not given, |
|
58 :exc:`ValueError` is raised. |
|
59 |
|
60 |
|
61 .. function:: category(unichr) |
|
62 |
|
63 Returns the general category assigned to the Unicode character *unichr* as |
|
64 string. |
|
65 |
|
66 |
|
67 .. function:: bidirectional(unichr) |
|
68 |
|
69 Returns the bidirectional category assigned to the Unicode character *unichr* as |
|
70 string. If no such value is defined, an empty string is returned. |
|
71 |
|
72 |
|
73 .. function:: combining(unichr) |
|
74 |
|
75 Returns the canonical combining class assigned to the Unicode character *unichr* |
|
76 as integer. Returns ``0`` if no combining class is defined. |
|
77 |
|
78 |
|
79 .. function:: east_asian_width(unichr) |
|
80 |
|
81 Returns the east asian width assigned to the Unicode character *unichr* as |
|
82 string. |
|
83 |
|
84 .. versionadded:: 2.4 |
|
85 |
|
86 |
|
87 .. function:: mirrored(unichr) |
|
88 |
|
89 Returns the mirrored property assigned to the Unicode character *unichr* as |
|
90 integer. Returns ``1`` if the character has been identified as a "mirrored" |
|
91 character in bidirectional text, ``0`` otherwise. |
|
92 |
|
93 |
|
94 .. function:: decomposition(unichr) |
|
95 |
|
96 Returns the character decomposition mapping assigned to the Unicode character |
|
97 *unichr* as string. An empty string is returned in case no such mapping is |
|
98 defined. |
|
99 |
|
100 |
|
101 .. function:: normalize(form, unistr) |
|
102 |
|
103 Return the normal form *form* for the Unicode string *unistr*. Valid values for |
|
104 *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. |
|
105 |
|
106 The Unicode standard defines various normalization forms of a Unicode string, |
|
107 based on the definition of canonical equivalence and compatibility equivalence. |
|
108 In Unicode, several characters can be expressed in various way. For example, the |
|
109 character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as |
|
110 the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C). |
|
111 |
|
112 For each character, there are two normal forms: normal form C and normal form D. |
|
113 Normal form D (NFD) is also known as canonical decomposition, and translates |
|
114 each character into its decomposed form. Normal form C (NFC) first applies a |
|
115 canonical decomposition, then composes pre-combined characters again. |
|
116 |
|
117 In addition to these two forms, there are two additional normal forms based on |
|
118 compatibility equivalence. In Unicode, certain characters are supported which |
|
119 normally would be unified with other characters. For example, U+2160 (ROMAN |
|
120 NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). |
|
121 However, it is supported in Unicode for compatibility with existing character |
|
122 sets (e.g. gb2312). |
|
123 |
|
124 The normal form KD (NFKD) will apply the compatibility decomposition, i.e. |
|
125 replace all compatibility characters with their equivalents. The normal form KC |
|
126 (NFKC) first applies the compatibility decomposition, followed by the canonical |
|
127 composition. |
|
128 |
|
129 Even if two unicode strings are normalized and look the same to |
|
130 a human reader, if one has combining characters and the other |
|
131 doesn't, they may not compare equal. |
|
132 |
|
133 .. versionadded:: 2.3 |
|
134 |
|
135 In addition, the module exposes the following constant: |
|
136 |
|
137 |
|
138 .. data:: unidata_version |
|
139 |
|
140 The version of the Unicode database used in this module. |
|
141 |
|
142 .. versionadded:: 2.3 |
|
143 |
|
144 |
|
145 .. data:: ucd_3_2_0 |
|
146 |
|
147 This is an object that has the same methods as the entire module, but uses the |
|
148 Unicode database version 3.2 instead, for applications that require this |
|
149 specific version of the Unicode database (such as IDNA). |
|
150 |
|
151 .. versionadded:: 2.5 |
|
152 |
|
153 Examples: |
|
154 |
|
155 >>> import unicodedata |
|
156 >>> unicodedata.lookup('LEFT CURLY BRACKET') |
|
157 u'{' |
|
158 >>> unicodedata.name(u'/') |
|
159 'SOLIDUS' |
|
160 >>> unicodedata.decimal(u'9') |
|
161 9 |
|
162 >>> unicodedata.decimal(u'a') |
|
163 Traceback (most recent call last): |
|
164 File "<stdin>", line 1, in ? |
|
165 ValueError: not a decimal |
|
166 >>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase |
|
167 'Lu' |
|
168 >>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber |
|
169 'AN' |
|
170 |