|
1 |
|
2 :mod:`xml.dom.minidom` --- Lightweight DOM implementation |
|
3 ========================================================= |
|
4 |
|
5 .. module:: xml.dom.minidom |
|
6 :synopsis: Lightweight Document Object Model (DOM) implementation. |
|
7 .. moduleauthor:: Paul Prescod <paul@prescod.net> |
|
8 .. sectionauthor:: Paul Prescod <paul@prescod.net> |
|
9 .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> |
|
10 |
|
11 |
|
12 .. versionadded:: 2.0 |
|
13 |
|
14 :mod:`xml.dom.minidom` is a light-weight implementation of the Document Object |
|
15 Model interface. It is intended to be simpler than the full DOM and also |
|
16 significantly smaller. |
|
17 |
|
18 DOM applications typically start by parsing some XML into a DOM. With |
|
19 :mod:`xml.dom.minidom`, this is done through the parse functions:: |
|
20 |
|
21 from xml.dom.minidom import parse, parseString |
|
22 |
|
23 dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name |
|
24 |
|
25 datasource = open('c:\\temp\\mydata.xml') |
|
26 dom2 = parse(datasource) # parse an open file |
|
27 |
|
28 dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') |
|
29 |
|
30 The :func:`parse` function can take either a filename or an open file object. |
|
31 |
|
32 |
|
33 .. function:: parse(filename_or_file, parser) |
|
34 |
|
35 Return a :class:`Document` from the given input. *filename_or_file* may be |
|
36 either a file name, or a file-like object. *parser*, if given, must be a SAX2 |
|
37 parser object. This function will change the document handler of the parser and |
|
38 activate namespace support; other parser configuration (like setting an entity |
|
39 resolver) must have been done in advance. |
|
40 |
|
41 If you have XML in a string, you can use the :func:`parseString` function |
|
42 instead: |
|
43 |
|
44 |
|
45 .. function:: parseString(string[, parser]) |
|
46 |
|
47 Return a :class:`Document` that represents the *string*. This method creates a |
|
48 :class:`StringIO` object for the string and passes that on to :func:`parse`. |
|
49 |
|
50 Both functions return a :class:`Document` object representing the content of the |
|
51 document. |
|
52 |
|
53 What the :func:`parse` and :func:`parseString` functions do is connect an XML |
|
54 parser with a "DOM builder" that can accept parse events from any SAX parser and |
|
55 convert them into a DOM tree. The name of the functions are perhaps misleading, |
|
56 but are easy to grasp when learning the interfaces. The parsing of the document |
|
57 will be completed before these functions return; it's simply that these |
|
58 functions do not provide a parser implementation themselves. |
|
59 |
|
60 You can also create a :class:`Document` by calling a method on a "DOM |
|
61 Implementation" object. You can get this object either by calling the |
|
62 :func:`getDOMImplementation` function in the :mod:`xml.dom` package or the |
|
63 :mod:`xml.dom.minidom` module. Using the implementation from the |
|
64 :mod:`xml.dom.minidom` module will always return a :class:`Document` instance |
|
65 from the minidom implementation, while the version from :mod:`xml.dom` may |
|
66 provide an alternate implementation (this is likely if you have the `PyXML |
|
67 package <http://pyxml.sourceforge.net/>`_ installed). Once you have a |
|
68 :class:`Document`, you can add child nodes to it to populate the DOM:: |
|
69 |
|
70 from xml.dom.minidom import getDOMImplementation |
|
71 |
|
72 impl = getDOMImplementation() |
|
73 |
|
74 newdoc = impl.createDocument(None, "some_tag", None) |
|
75 top_element = newdoc.documentElement |
|
76 text = newdoc.createTextNode('Some textual content.') |
|
77 top_element.appendChild(text) |
|
78 |
|
79 Once you have a DOM document object, you can access the parts of your XML |
|
80 document through its properties and methods. These properties are defined in |
|
81 the DOM specification. The main property of the document object is the |
|
82 :attr:`documentElement` property. It gives you the main element in the XML |
|
83 document: the one that holds all others. Here is an example program:: |
|
84 |
|
85 dom3 = parseString("<myxml>Some data</myxml>") |
|
86 assert dom3.documentElement.tagName == "myxml" |
|
87 |
|
88 When you are finished with a DOM, you should clean it up. This is necessary |
|
89 because some versions of Python do not support garbage collection of objects |
|
90 that refer to each other in a cycle. Until this restriction is removed from all |
|
91 versions of Python, it is safest to write your code as if cycles would not be |
|
92 cleaned up. |
|
93 |
|
94 The way to clean up a DOM is to call its :meth:`unlink` method:: |
|
95 |
|
96 dom1.unlink() |
|
97 dom2.unlink() |
|
98 dom3.unlink() |
|
99 |
|
100 :meth:`unlink` is a :mod:`xml.dom.minidom`\ -specific extension to the DOM API. |
|
101 After calling :meth:`unlink` on a node, the node and its descendants are |
|
102 essentially useless. |
|
103 |
|
104 |
|
105 .. seealso:: |
|
106 |
|
107 `Document Object Model (DOM) Level 1 Specification <http://www.w3.org/TR/REC-DOM-Level-1/>`_ |
|
108 The W3C recommendation for the DOM supported by :mod:`xml.dom.minidom`. |
|
109 |
|
110 |
|
111 .. _minidom-objects: |
|
112 |
|
113 DOM Objects |
|
114 ----------- |
|
115 |
|
116 The definition of the DOM API for Python is given as part of the :mod:`xml.dom` |
|
117 module documentation. This section lists the differences between the API and |
|
118 :mod:`xml.dom.minidom`. |
|
119 |
|
120 |
|
121 .. method:: Node.unlink() |
|
122 |
|
123 Break internal references within the DOM so that it will be garbage collected on |
|
124 versions of Python without cyclic GC. Even when cyclic GC is available, using |
|
125 this can make large amounts of memory available sooner, so calling this on DOM |
|
126 objects as soon as they are no longer needed is good practice. This only needs |
|
127 to be called on the :class:`Document` object, but may be called on child nodes |
|
128 to discard children of that node. |
|
129 |
|
130 |
|
131 .. method:: Node.writexml(writer[, indent=""[, addindent=""[, newl=""[, encoding=""]]]]) |
|
132 |
|
133 Write XML to the writer object. The writer should have a :meth:`write` method |
|
134 which matches that of the file object interface. The *indent* parameter is the |
|
135 indentation of the current node. The *addindent* parameter is the incremental |
|
136 indentation to use for subnodes of the current one. The *newl* parameter |
|
137 specifies the string to use to terminate newlines. |
|
138 |
|
139 .. versionchanged:: 2.1 |
|
140 The optional keyword parameters *indent*, *addindent*, and *newl* were added to |
|
141 support pretty output. |
|
142 |
|
143 .. versionchanged:: 2.3 |
|
144 For the :class:`Document` node, an additional keyword argument |
|
145 *encoding* can be used to specify the encoding field of the XML header. |
|
146 |
|
147 |
|
148 .. method:: Node.toxml([encoding]) |
|
149 |
|
150 Return the XML that the DOM represents as a string. |
|
151 |
|
152 With no argument, the XML header does not specify an encoding, and the result is |
|
153 Unicode string if the default encoding cannot represent all characters in the |
|
154 document. Encoding this string in an encoding other than UTF-8 is likely |
|
155 incorrect, since UTF-8 is the default encoding of XML. |
|
156 |
|
157 With an explicit *encoding* [1]_ argument, the result is a byte string in the |
|
158 specified encoding. It is recommended that this argument is always specified. To |
|
159 avoid :exc:`UnicodeError` exceptions in case of unrepresentable text data, the |
|
160 encoding argument should be specified as "utf-8". |
|
161 |
|
162 .. versionchanged:: 2.3 |
|
163 the *encoding* argument was introduced; see :meth:`writexml`. |
|
164 |
|
165 |
|
166 .. method:: Node.toprettyxml([indent=""[, newl=""[, encoding=""]]]) |
|
167 |
|
168 Return a pretty-printed version of the document. *indent* specifies the |
|
169 indentation string and defaults to a tabulator; *newl* specifies the string |
|
170 emitted at the end of each line and defaults to ``\n``. |
|
171 |
|
172 .. versionadded:: 2.1 |
|
173 |
|
174 .. versionchanged:: 2.3 |
|
175 the encoding argument was introduced; see :meth:`writexml`. |
|
176 |
|
177 The following standard DOM methods have special considerations with |
|
178 :mod:`xml.dom.minidom`: |
|
179 |
|
180 |
|
181 .. method:: Node.cloneNode(deep) |
|
182 |
|
183 Although this method was present in the version of :mod:`xml.dom.minidom` |
|
184 packaged with Python 2.0, it was seriously broken. This has been corrected for |
|
185 subsequent releases. |
|
186 |
|
187 |
|
188 .. _dom-example: |
|
189 |
|
190 DOM Example |
|
191 ----------- |
|
192 |
|
193 This example program is a fairly realistic example of a simple program. In this |
|
194 particular case, we do not take much advantage of the flexibility of the DOM. |
|
195 |
|
196 .. literalinclude:: ../includes/minidom-example.py |
|
197 |
|
198 |
|
199 .. _minidom-and-dom: |
|
200 |
|
201 minidom and the DOM standard |
|
202 ---------------------------- |
|
203 |
|
204 The :mod:`xml.dom.minidom` module is essentially a DOM 1.0-compatible DOM with |
|
205 some DOM 2 features (primarily namespace features). |
|
206 |
|
207 Usage of the DOM interface in Python is straight-forward. The following mapping |
|
208 rules apply: |
|
209 |
|
210 * Interfaces are accessed through instance objects. Applications should not |
|
211 instantiate the classes themselves; they should use the creator functions |
|
212 available on the :class:`Document` object. Derived interfaces support all |
|
213 operations (and attributes) from the base interfaces, plus any new operations. |
|
214 |
|
215 * Operations are used as methods. Since the DOM uses only :keyword:`in` |
|
216 parameters, the arguments are passed in normal order (from left to right). |
|
217 There are no optional arguments. ``void`` operations return ``None``. |
|
218 |
|
219 * IDL attributes map to instance attributes. For compatibility with the OMG IDL |
|
220 language mapping for Python, an attribute ``foo`` can also be accessed through |
|
221 accessor methods :meth:`_get_foo` and :meth:`_set_foo`. ``readonly`` |
|
222 attributes must not be changed; this is not enforced at runtime. |
|
223 |
|
224 * The types ``short int``, ``unsigned int``, ``unsigned long long``, and |
|
225 ``boolean`` all map to Python integer objects. |
|
226 |
|
227 * The type ``DOMString`` maps to Python strings. :mod:`xml.dom.minidom` supports |
|
228 either byte or Unicode strings, but will normally produce Unicode strings. |
|
229 Values of type ``DOMString`` may also be ``None`` where allowed to have the IDL |
|
230 ``null`` value by the DOM specification from the W3C. |
|
231 |
|
232 * ``const`` declarations map to variables in their respective scope (e.g. |
|
233 ``xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE``); they must not be changed. |
|
234 |
|
235 * ``DOMException`` is currently not supported in :mod:`xml.dom.minidom`. |
|
236 Instead, :mod:`xml.dom.minidom` uses standard Python exceptions such as |
|
237 :exc:`TypeError` and :exc:`AttributeError`. |
|
238 |
|
239 * :class:`NodeList` objects are implemented using Python's built-in list type. |
|
240 Starting with Python 2.2, these objects provide the interface defined in the DOM |
|
241 specification, but with earlier versions of Python they do not support the |
|
242 official API. They are, however, much more "Pythonic" than the interface |
|
243 defined in the W3C recommendations. |
|
244 |
|
245 The following interfaces have no implementation in :mod:`xml.dom.minidom`: |
|
246 |
|
247 * :class:`DOMTimeStamp` |
|
248 |
|
249 * :class:`DocumentType` (added in Python 2.1) |
|
250 |
|
251 * :class:`DOMImplementation` (added in Python 2.1) |
|
252 |
|
253 * :class:`CharacterData` |
|
254 |
|
255 * :class:`CDATASection` |
|
256 |
|
257 * :class:`Notation` |
|
258 |
|
259 * :class:`Entity` |
|
260 |
|
261 * :class:`EntityReference` |
|
262 |
|
263 * :class:`DocumentFragment` |
|
264 |
|
265 Most of these reflect information in the XML document that is not of general |
|
266 utility to most DOM users. |
|
267 |
|
268 .. rubric:: Footnotes |
|
269 |
|
270 .. [#] The encoding string included in XML output should conform to the |
|
271 appropriate standards. For example, "UTF-8" is valid, but "UTF8" is |
|
272 not. See http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl |
|
273 and http://www.iana.org/assignments/character-sets . |