|
1 |
|
2 :mod:`HTMLParser` --- Simple HTML and XHTML parser |
|
3 ================================================== |
|
4 |
|
5 .. module:: HTMLParser |
|
6 :synopsis: A simple parser that can handle HTML and XHTML. |
|
7 |
|
8 .. note:: |
|
9 |
|
10 The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python |
|
11 3.0. The :term:`2to3` tool will automatically adapt imports when converting |
|
12 your sources to 3.0. |
|
13 |
|
14 |
|
15 .. versionadded:: 2.2 |
|
16 |
|
17 .. index:: |
|
18 single: HTML |
|
19 single: XHTML |
|
20 |
|
21 This module defines a class :class:`HTMLParser` which serves as the basis for |
|
22 parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. |
|
23 Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser |
|
24 in :mod:`sgmllib`. |
|
25 |
|
26 |
|
27 .. class:: HTMLParser() |
|
28 |
|
29 The :class:`HTMLParser` class is instantiated without arguments. |
|
30 |
|
31 An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags |
|
32 begin and end. The :class:`HTMLParser` class is meant to be overridden by the |
|
33 user to provide a desired behavior. |
|
34 |
|
35 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags |
|
36 match start tags or call the end-tag handler for elements which are closed |
|
37 implicitly by closing an outer element. |
|
38 |
|
39 An exception is defined as well: |
|
40 |
|
41 |
|
42 .. exception:: HTMLParseError |
|
43 |
|
44 Exception raised by the :class:`HTMLParser` class when it encounters an error |
|
45 while parsing. This exception provides three attributes: :attr:`msg` is a brief |
|
46 message explaining the error, :attr:`lineno` is the number of the line on which |
|
47 the broken construct was detected, and :attr:`offset` is the number of |
|
48 characters into the line at which the construct starts. |
|
49 |
|
50 :class:`HTMLParser` instances have the following methods: |
|
51 |
|
52 |
|
53 .. method:: HTMLParser.reset() |
|
54 |
|
55 Reset the instance. Loses all unprocessed data. This is called implicitly at |
|
56 instantiation time. |
|
57 |
|
58 |
|
59 .. method:: HTMLParser.feed(data) |
|
60 |
|
61 Feed some text to the parser. It is processed insofar as it consists of |
|
62 complete elements; incomplete data is buffered until more data is fed or |
|
63 :meth:`close` is called. |
|
64 |
|
65 |
|
66 .. method:: HTMLParser.close() |
|
67 |
|
68 Force processing of all buffered data as if it were followed by an end-of-file |
|
69 mark. This method may be redefined by a derived class to define additional |
|
70 processing at the end of the input, but the redefined version should always call |
|
71 the :class:`HTMLParser` base class method :meth:`close`. |
|
72 |
|
73 |
|
74 .. method:: HTMLParser.getpos() |
|
75 |
|
76 Return current line number and offset. |
|
77 |
|
78 |
|
79 .. method:: HTMLParser.get_starttag_text() |
|
80 |
|
81 Return the text of the most recently opened start tag. This should not normally |
|
82 be needed for structured processing, but may be useful in dealing with HTML "as |
|
83 deployed" or for re-generating input with minimal changes (whitespace between |
|
84 attributes can be preserved, etc.). |
|
85 |
|
86 |
|
87 .. method:: HTMLParser.handle_starttag(tag, attrs) |
|
88 |
|
89 This method is called to handle the start of a tag. It is intended to be |
|
90 overridden by a derived class; the base class implementation does nothing. |
|
91 |
|
92 The *tag* argument is the name of the tag converted to lower case. The *attrs* |
|
93 argument is a list of ``(name, value)`` pairs containing the attributes found |
|
94 inside the tag's ``<>`` brackets. The *name* will be translated to lower case, |
|
95 and quotes in the *value* have been removed, and character and entity references |
|
96 have been replaced. For instance, for the tag ``<A |
|
97 HREF="http://www.cwi.nl/">``, this method would be called as |
|
98 ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. |
|
99 |
|
100 .. versionchanged:: 2.6 |
|
101 All entity references from :mod:`htmlentitydefs` are now replaced in the attribute |
|
102 values. |
|
103 |
|
104 |
|
105 .. method:: HTMLParser.handle_startendtag(tag, attrs) |
|
106 |
|
107 Similar to :meth:`handle_starttag`, but called when the parser encounters an |
|
108 XHTML-style empty tag (``<a .../>``). This method may be overridden by |
|
109 subclasses which require this particular lexical information; the default |
|
110 implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`. |
|
111 |
|
112 |
|
113 .. method:: HTMLParser.handle_endtag(tag) |
|
114 |
|
115 This method is called to handle the end tag of an element. It is intended to be |
|
116 overridden by a derived class; the base class implementation does nothing. The |
|
117 *tag* argument is the name of the tag converted to lower case. |
|
118 |
|
119 |
|
120 .. method:: HTMLParser.handle_data(data) |
|
121 |
|
122 This method is called to process arbitrary data. It is intended to be |
|
123 overridden by a derived class; the base class implementation does nothing. |
|
124 |
|
125 |
|
126 .. method:: HTMLParser.handle_charref(name) |
|
127 |
|
128 This method is called to process a character reference of the form ``&#ref;``. |
|
129 It is intended to be overridden by a derived class; the base class |
|
130 implementation does nothing. |
|
131 |
|
132 |
|
133 .. method:: HTMLParser.handle_entityref(name) |
|
134 |
|
135 This method is called to process a general entity reference of the form |
|
136 ``&name;`` where *name* is an general entity reference. It is intended to be |
|
137 overridden by a derived class; the base class implementation does nothing. |
|
138 |
|
139 |
|
140 .. method:: HTMLParser.handle_comment(data) |
|
141 |
|
142 This method is called when a comment is encountered. The *comment* argument is |
|
143 a string containing the text between the ``--`` and ``--`` delimiters, but not |
|
144 the delimiters themselves. For example, the comment ``<!--text-->`` will cause |
|
145 this method to be called with the argument ``'text'``. It is intended to be |
|
146 overridden by a derived class; the base class implementation does nothing. |
|
147 |
|
148 |
|
149 .. method:: HTMLParser.handle_decl(decl) |
|
150 |
|
151 Method called when an SGML declaration is read by the parser. The *decl* |
|
152 parameter will be the entire contents of the declaration inside the ``<!``...\ |
|
153 ``>`` markup. It is intended to be overridden by a derived class; the base |
|
154 class implementation does nothing. |
|
155 |
|
156 |
|
157 .. method:: HTMLParser.handle_pi(data) |
|
158 |
|
159 Method called when a processing instruction is encountered. The *data* |
|
160 parameter will contain the entire processing instruction. For example, for the |
|
161 processing instruction ``<?proc color='red'>``, this method would be called as |
|
162 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived |
|
163 class; the base class implementation does nothing. |
|
164 |
|
165 .. note:: |
|
166 |
|
167 The :class:`HTMLParser` class uses the SGML syntactic rules for processing |
|
168 instructions. An XHTML processing instruction using the trailing ``'?'`` will |
|
169 cause the ``'?'`` to be included in *data*. |
|
170 |
|
171 |
|
172 .. _htmlparser-example: |
|
173 |
|
174 Example HTML Parser Application |
|
175 ------------------------------- |
|
176 |
|
177 As a basic example, below is a very basic HTML parser that uses the |
|
178 :class:`HTMLParser` class to print out tags as they are encountered:: |
|
179 |
|
180 from HTMLParser import HTMLParser |
|
181 |
|
182 class MyHTMLParser(HTMLParser): |
|
183 |
|
184 def handle_starttag(self, tag, attrs): |
|
185 print "Encountered the beginning of a %s tag" % tag |
|
186 |
|
187 def handle_endtag(self, tag): |
|
188 print "Encountered the end of a %s tag" % tag |
|
189 |