Commit | Line | Data |
---|---|---|
920dae64 AT |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
2 | <html> | |
3 | <head> | |
4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> | |
5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> | |
6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> | |
7 | <link rel="first" href="lib.html" title='Python Library Reference' /> | |
8 | <link rel='contents' href='contents.html' title="Contents" /> | |
9 | <link rel='index' href='genindex.html' title='Index' /> | |
10 | <link rel='last' href='about.html' title='About this document...' /> | |
11 | <link rel='help' href='about.html' title='About this document...' /> | |
12 | <link rel="next" href="module-sgmllib.html" /> | |
13 | <link rel="prev" href="markup.html" /> | |
14 | <link rel="parent" href="markup.html" /> | |
15 | <link rel="next" href="htmlparser-example.html" /> | |
16 | <meta name='aesop' content='information' /> | |
17 | <title>13.1 HTMLParser -- Simple HTML and XHTML parser</title> | |
18 | </head> | |
19 | <body> | |
20 | <DIV CLASS="navigation"> | |
21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> | |
22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
23 | <tr> | |
24 | <td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing" | |
25 | href="markup.html"><img src='../icons/previous.png' | |
26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
27 | <td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing" | |
28 | href="markup.html"><img src='../icons/up.png' | |
29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
30 | <td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser" | |
31 | href="htmlparser-example.html"><img src='../icons/next.png' | |
32 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
33 | <td align="center" width="100%">Python Library Reference</td> | |
34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
35 | href="contents.html"><img src='../icons/contents.png' | |
36 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
38 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
39 | <td class='online-navigation'><a rel="index" title="Index" | |
40 | href="genindex.html"><img src='../icons/index.png' | |
41 | border='0' height='32' alt='Index' width='32' /></A></td> | |
42 | </tr></table> | |
43 | <div class='online-navigation'> | |
44 | <b class="navlabel">Previous:</b> | |
45 | <a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A> | |
46 | <b class="navlabel">Up:</b> | |
47 | <a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A> | |
48 | <b class="navlabel">Next:</b> | |
49 | <a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A> | |
50 | </div> | |
51 | <hr /></div> | |
52 | </DIV> | |
53 | <!--End of Navigation Panel--> | |
54 | ||
55 | <H1><A NAME="SECTION0015100000000000000000"> | |
56 | 13.1 <tt class="module">HTMLParser</tt> -- | |
57 | Simple HTML and XHTML parser</A> | |
58 | </H1> | |
59 | ||
60 | <P> | |
61 | <A NAME="module-HTMLParser"></A> | |
62 | ||
63 | <P> | |
64 | ||
65 | <span class="versionnote">New in version 2.2.</span> | |
66 | ||
67 | <P> | |
68 | This module defines a class <tt class="class">HTMLParser</tt> which serves as the | |
69 | basis for parsing text files formatted in HTML<a id='l2h-4257' xml:id='l2h-4257'></a> (HyperText | |
70 | Mark-up Language) and XHTML.<a id='l2h-4258' xml:id='l2h-4258'></a> Unlike the parser in | |
71 | <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser is not based on the SGML parser in | |
72 | <tt class="module"><a href="module-sgmllib.html">sgmllib</a></tt>. | |
73 | ||
74 | <P> | |
75 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
76 | <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-4241' xml:id='l2h-4241' class="class">HTMLParser</tt></b>(</nobr></td> | |
77 | <td><var></var>)</td></tr></table></dt> | |
78 | <dd> | |
79 | The <tt class="class">HTMLParser</tt> class is instantiated without arguments. | |
80 | ||
81 | <P> | |
82 | An HTMLParser instance is fed HTML data and calls handler functions | |
83 | when tags begin and end. The <tt class="class">HTMLParser</tt> class is meant to be | |
84 | overridden by the user to provide a desired behavior. | |
85 | ||
86 | <P> | |
87 | Unlike the parser in <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser does not check | |
88 | that end tags match start tags or call the end-tag handler for | |
89 | elements which are closed implicitly by closing an outer element. | |
90 | </dl> | |
91 | ||
92 | <P> | |
93 | An exception is defined as well: | |
94 | ||
95 | <P> | |
96 | <dl><dt><b><span class="typelabel">exception</span> <tt id='l2h-4242' xml:id='l2h-4242' class="exception">HTMLParseError</tt></b></dt> | |
97 | <dd> | |
98 | Exception raised by the <tt class="class">HTMLParser</tt> class when it encounters an | |
99 | error while parsing. This exception provides three attributes: | |
100 | <tt class="member">msg</tt> is a brief message explaining the error, <tt class="member">lineno</tt> | |
101 | is the number of the line on which the broken construct was detected, | |
102 | and <tt class="member">offset</tt> is the number of characters into the line at which | |
103 | the construct starts. | |
104 | </dd></dl> | |
105 | ||
106 | <P> | |
107 | <tt class="class">HTMLParser</tt> instances have the following methods: | |
108 | ||
109 | <P> | |
110 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
111 | <td><nobr><b><tt id='l2h-4243' xml:id='l2h-4243' class="method">reset</tt></b>(</nobr></td> | |
112 | <td><var></var>)</td></tr></table></dt> | |
113 | <dd> | |
114 | Reset the instance. Loses all unprocessed data. This is called | |
115 | implicitly at instantiation time. | |
116 | </dl> | |
117 | ||
118 | <P> | |
119 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
120 | <td><nobr><b><tt id='l2h-4244' xml:id='l2h-4244' class="method">feed</tt></b>(</nobr></td> | |
121 | <td><var>data</var>)</td></tr></table></dt> | |
122 | <dd> | |
123 | Feed some text to the parser. It is processed insofar as it consists | |
124 | of complete elements; incomplete data is buffered until more data is | |
125 | fed or <tt class="method">close()</tt> is called. | |
126 | </dl> | |
127 | ||
128 | <P> | |
129 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
130 | <td><nobr><b><tt id='l2h-4245' xml:id='l2h-4245' class="method">close</tt></b>(</nobr></td> | |
131 | <td><var></var>)</td></tr></table></dt> | |
132 | <dd> | |
133 | Force processing of all buffered data as if it were followed by an | |
134 | end-of-file mark. This method may be redefined by a derived class to | |
135 | define additional processing at the end of the input, but the | |
136 | redefined version should always call the <tt class="class">HTMLParser</tt> base class | |
137 | method <tt class="method">close()</tt>. | |
138 | </dl> | |
139 | ||
140 | <P> | |
141 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
142 | <td><nobr><b><tt id='l2h-4246' xml:id='l2h-4246' class="method">getpos</tt></b>(</nobr></td> | |
143 | <td><var></var>)</td></tr></table></dt> | |
144 | <dd> | |
145 | Return current line number and offset. | |
146 | </dl> | |
147 | ||
148 | <P> | |
149 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
150 | <td><nobr><b><tt id='l2h-4247' xml:id='l2h-4247' class="method">get_starttag_text</tt></b>(</nobr></td> | |
151 | <td><var></var>)</td></tr></table></dt> | |
152 | <dd> | |
153 | Return the text of the most recently opened start tag. This should | |
154 | not normally be needed for structured processing, but may be useful in | |
155 | dealing with HTML ``as deployed'' or for re-generating input with | |
156 | minimal changes (whitespace between attributes can be preserved, | |
157 | etc.). | |
158 | </dl> | |
159 | ||
160 | <P> | |
161 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
162 | <td><nobr><b><tt id='l2h-4248' xml:id='l2h-4248' class="method">handle_starttag</tt></b>(</nobr></td> | |
163 | <td><var>tag, attrs</var>)</td></tr></table></dt> | |
164 | <dd> | |
165 | This method is called to handle the start of a tag. It is intended to | |
166 | be overridden by a derived class; the base class implementation does | |
167 | nothing. | |
168 | ||
169 | <P> | |
170 | The <var>tag</var> argument is the name of the tag converted to | |
171 | lower case. The <var>attrs</var> argument is a list of <code>(<var>name</var>, | |
172 | <var>value</var>)</code> pairs containing the attributes found inside the tag's | |
173 | <code><></code> brackets. The <var>name</var> will be translated to lower case | |
174 | and double quotes and backslashes in the <var>value</var> have been | |
175 | interpreted. For instance, for the tag <code><A | |
176 | HREF="http://www.cwi.nl/"></code>, this method would be called as | |
177 | "<tt class="samp">handle_starttag('a', [('href', 'http://www.cwi.nl/')])</tt>". | |
178 | </dl> | |
179 | ||
180 | <P> | |
181 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
182 | <td><nobr><b><tt id='l2h-4249' xml:id='l2h-4249' class="method">handle_startendtag</tt></b>(</nobr></td> | |
183 | <td><var>tag, attrs</var>)</td></tr></table></dt> | |
184 | <dd> | |
185 | Similar to <tt class="method">handle_starttag()</tt>, but called when the parser | |
186 | encounters an XHTML-style empty tag (<code><a .../></code>). This method | |
187 | may be overridden by subclasses which require this particular lexical | |
188 | information; the default implementation simple calls | |
189 | <tt class="method">handle_starttag()</tt> and <tt class="method">handle_endtag()</tt>. | |
190 | </dl> | |
191 | ||
192 | <P> | |
193 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
194 | <td><nobr><b><tt id='l2h-4250' xml:id='l2h-4250' class="method">handle_endtag</tt></b>(</nobr></td> | |
195 | <td><var>tag</var>)</td></tr></table></dt> | |
196 | <dd> | |
197 | This method is called to handle the end tag of an element. It is | |
198 | intended to be overridden by a derived class; the base class | |
199 | implementation does nothing. The <var>tag</var> argument is the name of | |
200 | the tag converted to lower case. | |
201 | </dl> | |
202 | ||
203 | <P> | |
204 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
205 | <td><nobr><b><tt id='l2h-4251' xml:id='l2h-4251' class="method">handle_data</tt></b>(</nobr></td> | |
206 | <td><var>data</var>)</td></tr></table></dt> | |
207 | <dd> | |
208 | This method is called to process arbitrary data. It is intended to be | |
209 | overridden by a derived class; the base class implementation does | |
210 | nothing. | |
211 | </dl> | |
212 | ||
213 | <P> | |
214 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
215 | <td><nobr><b><tt id='l2h-4252' xml:id='l2h-4252' class="method">handle_charref</tt></b>(</nobr></td> | |
216 | <td><var>name</var>)</td></tr></table></dt> | |
217 | <dd> This method is called to | |
218 | process a character reference of the form "<tt class="samp">&#<var>ref</var>;</tt>". It | |
219 | is intended to be overridden by a derived class; the base class | |
220 | implementation does nothing. | |
221 | </dl> | |
222 | ||
223 | <P> | |
224 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
225 | <td><nobr><b><tt id='l2h-4253' xml:id='l2h-4253' class="method">handle_entityref</tt></b>(</nobr></td> | |
226 | <td><var>name</var>)</td></tr></table></dt> | |
227 | <dd> | |
228 | This method is called to process a general entity reference of the | |
229 | form "<tt class="samp">&<var>name</var>;</tt>" where <var>name</var> is an general entity | |
230 | reference. It is intended to be overridden by a derived class; the | |
231 | base class implementation does nothing. | |
232 | </dl> | |
233 | ||
234 | <P> | |
235 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
236 | <td><nobr><b><tt id='l2h-4254' xml:id='l2h-4254' class="method">handle_comment</tt></b>(</nobr></td> | |
237 | <td><var>data</var>)</td></tr></table></dt> | |
238 | <dd> | |
239 | This method is called when a comment is encountered. The | |
240 | <var>comment</var> argument is a string containing the text between the | |
241 | "<tt class="samp">--</tt>" and "<tt class="samp">--</tt>" delimiters, but not the delimiters | |
242 | themselves. For example, the comment "<tt class="samp"><!--text--></tt>" will | |
243 | cause this method to be called with the argument <code>'text'</code>. It is | |
244 | intended to be overridden by a derived class; the base class | |
245 | implementation does nothing. | |
246 | </dl> | |
247 | ||
248 | <P> | |
249 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
250 | <td><nobr><b><tt id='l2h-4255' xml:id='l2h-4255' class="method">handle_decl</tt></b>(</nobr></td> | |
251 | <td><var>decl</var>)</td></tr></table></dt> | |
252 | <dd> | |
253 | Method called when an SGML declaration is read by the parser. The | |
254 | <var>decl</var> parameter will be the entire contents of the declaration | |
255 | inside the <code><!</code>...<code>></code> markup.It is intended to be overridden | |
256 | by a derived class; the base class implementation does nothing. | |
257 | </dl> | |
258 | ||
259 | <P> | |
260 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
261 | <td><nobr><b><tt id='l2h-4256' xml:id='l2h-4256' class="method">handle_pi</tt></b>(</nobr></td> | |
262 | <td><var>data</var>)</td></tr></table></dt> | |
263 | <dd> | |
264 | Method called when a processing instruction is encountered. The | |
265 | <var>data</var> parameter will contain the entire processing instruction. | |
266 | For example, for the processing instruction <code><?proc color='red'></code>, | |
267 | this method would be called as <code>handle_pi("proc color='red'")</code>. It | |
268 | is intended to be overridden by a derived class; the base class | |
269 | implementation does nothing. | |
270 | ||
271 | <P> | |
272 | <span class="note"><b class="label">Note:</b> | |
273 | The <tt class="class">HTMLParser</tt> class uses the SGML syntactic rules for | |
274 | processing instructions. An XHTML processing instruction using the | |
275 | trailing "<tt class="character">?</tt>" will cause the "<tt class="character">?</tt>" to be included in | |
276 | <var>data</var>.</span> | |
277 | </dl> | |
278 | ||
279 | <P> | |
280 | ||
281 | <p><br /></p><hr class='online-navigation' /> | |
282 | <div class='online-navigation'> | |
283 | <!--Table of Child-Links--> | |
284 | <A NAME="CHILD_LINKS"><STRONG>Subsections</STRONG></a> | |
285 | ||
286 | <UL CLASS="ChildLinks"> | |
287 | <LI><A href="htmlparser-example.html">13.1.1 Example HTML Parser Application</a> | |
288 | </ul> | |
289 | <!--End of Table of Child-Links--> | |
290 | </div> | |
291 | ||
292 | <DIV CLASS="navigation"> | |
293 | <div class='online-navigation'> | |
294 | <p></p><hr /> | |
295 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
296 | <tr> | |
297 | <td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing" | |
298 | href="markup.html"><img src='../icons/previous.png' | |
299 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
300 | <td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing" | |
301 | href="markup.html"><img src='../icons/up.png' | |
302 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
303 | <td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser" | |
304 | href="htmlparser-example.html"><img src='../icons/next.png' | |
305 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
306 | <td align="center" width="100%">Python Library Reference</td> | |
307 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
308 | href="contents.html"><img src='../icons/contents.png' | |
309 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
310 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
311 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
312 | <td class='online-navigation'><a rel="index" title="Index" | |
313 | href="genindex.html"><img src='../icons/index.png' | |
314 | border='0' height='32' alt='Index' width='32' /></A></td> | |
315 | </tr></table> | |
316 | <div class='online-navigation'> | |
317 | <b class="navlabel">Previous:</b> | |
318 | <a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A> | |
319 | <b class="navlabel">Up:</b> | |
320 | <a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A> | |
321 | <b class="navlabel">Next:</b> | |
322 | <a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A> | |
323 | </div> | |
324 | </div> | |
325 | <hr /> | |
326 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> | |
327 | </DIV> | |
328 | <!--End of Navigation Panel--> | |
329 | <ADDRESS> | |
330 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. | |
331 | </ADDRESS> | |
332 | </BODY> | |
333 | </HTML> |