|
1 :mod:`htmllib` --- A parser for HTML documents |
|
2 ============================================== |
|
3 |
|
4 .. module:: htmllib |
|
5 :synopsis: A parser for HTML documents. |
|
6 :deprecated: |
|
7 |
|
8 .. deprecated:: 2.6 |
|
9 The :mod:`htmllib` module has been removed in Python 3.0. |
|
10 |
|
11 |
|
12 .. index:: |
|
13 single: HTML |
|
14 single: hypertext |
|
15 |
|
16 .. index:: |
|
17 module: sgmllib |
|
18 module: formatter |
|
19 single: SGMLParser (in module sgmllib) |
|
20 |
|
21 This module defines a class which can serve as a base for parsing text files |
|
22 formatted in the HyperText Mark-up Language (HTML). The class is not directly |
|
23 concerned with I/O --- it must be provided with input in string form via a |
|
24 method, and makes calls to methods of a "formatter" object in order to produce |
|
25 output. The :class:`HTMLParser` class is designed to be used as a base class |
|
26 for other classes in order to add functionality, and allows most of its methods |
|
27 to be extended or overridden. In turn, this class is derived from and extends |
|
28 the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The |
|
29 :class:`HTMLParser` implementation supports the HTML 2.0 language as described |
|
30 in :rfc:`1866`. Two implementations of formatter objects are provided in the |
|
31 :mod:`formatter` module; refer to the documentation for that module for |
|
32 information on the formatter interface. |
|
33 |
|
34 The following is a summary of the interface defined by |
|
35 :class:`sgmllib.SGMLParser`: |
|
36 |
|
37 * The interface to feed data to an instance is through the :meth:`feed` method, |
|
38 which takes a string argument. This can be called with as little or as much |
|
39 text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as |
|
40 ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these |
|
41 are processed immediately; incomplete constructs are saved in a buffer. To |
|
42 force processing of all unprocessed data, call the :meth:`close` method. |
|
43 |
|
44 For example, to parse the entire contents of a file, use:: |
|
45 |
|
46 parser.feed(open('myfile.html').read()) |
|
47 parser.close() |
|
48 |
|
49 * The interface to define semantics for HTML tags is very simple: derive a class |
|
50 and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`. |
|
51 The parser will call these at appropriate moments: :meth:`start_tag` or |
|
52 :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is |
|
53 encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>`` |
|
54 is encountered. If an opening tag requires a corresponding closing tag, like |
|
55 ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if |
|
56 a tag requires no closing tag, like ``<P>``, the class should define the |
|
57 :meth:`do_tag` method. |
|
58 |
|
59 The module defines a parser class and an exception: |
|
60 |
|
61 |
|
62 .. class:: HTMLParser(formatter) |
|
63 |
|
64 This is the basic HTML parser class. It supports all entity names required by |
|
65 the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines |
|
66 handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. |
|
67 |
|
68 |
|
69 .. exception:: HTMLParseError |
|
70 |
|
71 Exception raised by the :class:`HTMLParser` class when it encounters an error |
|
72 while parsing. |
|
73 |
|
74 .. versionadded:: 2.4 |
|
75 |
|
76 |
|
77 .. seealso:: |
|
78 |
|
79 Module :mod:`formatter` |
|
80 Interface definition for transforming an abstract flow of formatting events into |
|
81 specific output events on writer objects. |
|
82 |
|
83 Module :mod:`HTMLParser` |
|
84 Alternate HTML parser that offers a slightly lower-level view of the input, but |
|
85 is designed to work with XHTML, and does not implement some of the SGML syntax |
|
86 not used in "HTML as deployed" and which isn't legal for XHTML. |
|
87 |
|
88 Module :mod:`htmlentitydefs` |
|
89 Definition of replacement text for XHTML 1.0 entities. |
|
90 |
|
91 Module :mod:`sgmllib` |
|
92 Base class for :class:`HTMLParser`. |
|
93 |
|
94 |
|
95 .. _html-parser-objects: |
|
96 |
|
97 HTMLParser Objects |
|
98 ------------------ |
|
99 |
|
100 In addition to tag methods, the :class:`HTMLParser` class provides some |
|
101 additional methods and instance variables for use within tag methods. |
|
102 |
|
103 |
|
104 .. attribute:: HTMLParser.formatter |
|
105 |
|
106 This is the formatter instance associated with the parser. |
|
107 |
|
108 |
|
109 .. attribute:: HTMLParser.nofill |
|
110 |
|
111 Boolean flag which should be true when whitespace should not be collapsed, or |
|
112 false when it should be. In general, this should only be true when character |
|
113 data is to be treated as "preformatted" text, as within a ``<PRE>`` element. |
|
114 The default value is false. This affects the operation of :meth:`handle_data` |
|
115 and :meth:`save_end`. |
|
116 |
|
117 |
|
118 .. method:: HTMLParser.anchor_bgn(href, name, type) |
|
119 |
|
120 This method is called at the start of an anchor region. The arguments |
|
121 correspond to the attributes of the ``<A>`` tag with the same names. The |
|
122 default implementation maintains a list of hyperlinks (defined by the ``HREF`` |
|
123 attribute for ``<A>`` tags) within the document. The list of hyperlinks is |
|
124 available as the data attribute :attr:`anchorlist`. |
|
125 |
|
126 |
|
127 .. method:: HTMLParser.anchor_end() |
|
128 |
|
129 This method is called at the end of an anchor region. The default |
|
130 implementation adds a textual footnote marker using an index into the list of |
|
131 hyperlinks created by :meth:`anchor_bgn`. |
|
132 |
|
133 |
|
134 .. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]]) |
|
135 |
|
136 This method is called to handle images. The default implementation simply |
|
137 passes the *alt* value to the :meth:`handle_data` method. |
|
138 |
|
139 |
|
140 .. method:: HTMLParser.save_bgn() |
|
141 |
|
142 Begins saving character data in a buffer instead of sending it to the formatter |
|
143 object. Retrieve the stored data via :meth:`save_end`. Use of the |
|
144 :meth:`save_bgn` / :meth:`save_end` pair may not be nested. |
|
145 |
|
146 |
|
147 .. method:: HTMLParser.save_end() |
|
148 |
|
149 Ends buffering character data and returns all data saved since the preceding |
|
150 call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is |
|
151 collapsed to single spaces. A call to this method without a preceding call to |
|
152 :meth:`save_bgn` will raise a :exc:`TypeError` exception. |
|
153 |
|
154 |
|
155 :mod:`htmlentitydefs` --- Definitions of HTML general entities |
|
156 ============================================================== |
|
157 |
|
158 .. module:: htmlentitydefs |
|
159 :synopsis: Definitions of HTML general entities. |
|
160 .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> |
|
161 |
|
162 .. note:: |
|
163 |
|
164 The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in |
|
165 Python 3.0. The :term:`2to3` tool will automatically adapt imports when |
|
166 converting your sources to 3.0. |
|
167 |
|
168 |
|
169 This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``, |
|
170 and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to |
|
171 provide the :attr:`entitydefs` member of the :class:`HTMLParser` class. The |
|
172 definition provided here contains all the entities defined by XHTML 1.0 that |
|
173 can be handled using simple textual substitution in the Latin-1 character set |
|
174 (ISO-8859-1). |
|
175 |
|
176 |
|
177 .. data:: entitydefs |
|
178 |
|
179 A dictionary mapping XHTML 1.0 entity definitions to their replacement text in |
|
180 ISO Latin-1. |
|
181 |
|
182 |
|
183 .. data:: name2codepoint |
|
184 |
|
185 A dictionary that maps HTML entity names to the Unicode codepoints. |
|
186 |
|
187 .. versionadded:: 2.3 |
|
188 |
|
189 |
|
190 .. data:: codepoint2name |
|
191 |
|
192 A dictionary that maps Unicode codepoints to HTML entity names. |
|
193 |
|
194 .. versionadded:: 2.3 |
|
195 |