Initial commit of OpenSPARC T2 architecture model.
[OpenSPARC-T2-SAM] / sam-t2 / devtools / v8plus / html / python / lib / module-HTMLParser.html
CommitLineData
920dae64
AT
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<html>
3<head>
4<link rel="STYLESHEET" href="lib.css" type='text/css' />
5<link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" />
6<link rel='start' href='../index.html' title='Python Documentation Index' />
7<link rel="first" href="lib.html" title='Python Library Reference' />
8<link rel='contents' href='contents.html' title="Contents" />
9<link rel='index' href='genindex.html' title='Index' />
10<link rel='last' href='about.html' title='About this document...' />
11<link rel='help' href='about.html' title='About this document...' />
12<link rel="next" href="module-sgmllib.html" />
13<link rel="prev" href="markup.html" />
14<link rel="parent" href="markup.html" />
15<link rel="next" href="htmlparser-example.html" />
16<meta name='aesop' content='information' />
17<title>13.1 HTMLParser -- Simple HTML and XHTML parser</title>
18</head>
19<body>
20<DIV CLASS="navigation">
21<div id='top-navigation-panel' xml:id='top-navigation-panel'>
22<table align="center" width="100%" cellpadding="0" cellspacing="2">
23<tr>
24<td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing"
25 href="markup.html"><img src='../icons/previous.png'
26 border='0' height='32' alt='Previous Page' width='32' /></A></td>
27<td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing"
28 href="markup.html"><img src='../icons/up.png'
29 border='0' height='32' alt='Up One Level' width='32' /></A></td>
30<td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser"
31 href="htmlparser-example.html"><img src='../icons/next.png'
32 border='0' height='32' alt='Next Page' width='32' /></A></td>
33<td align="center" width="100%">Python Library Reference</td>
34<td class='online-navigation'><a rel="contents" title="Table of Contents"
35 href="contents.html"><img src='../icons/contents.png'
36 border='0' height='32' alt='Contents' width='32' /></A></td>
37<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
38 border='0' height='32' alt='Module Index' width='32' /></a></td>
39<td class='online-navigation'><a rel="index" title="Index"
40 href="genindex.html"><img src='../icons/index.png'
41 border='0' height='32' alt='Index' width='32' /></A></td>
42</tr></table>
43<div class='online-navigation'>
44<b class="navlabel">Previous:</b>
45<a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A>
46<b class="navlabel">Up:</b>
47<a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A>
48<b class="navlabel">Next:</b>
49<a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A>
50</div>
51<hr /></div>
52</DIV>
53<!--End of Navigation Panel-->
54
55<H1><A NAME="SECTION0015100000000000000000">
5613.1 <tt class="module">HTMLParser</tt> --
57 Simple HTML and XHTML parser</A>
58</H1>
59
60<P>
61<A NAME="module-HTMLParser"></A>
62
63<P>
64
65<span class="versionnote">New in version 2.2.</span>
66
67<P>
68This module defines a class <tt class="class">HTMLParser</tt> which serves as the
69basis for parsing text files formatted in HTML<a id='l2h-4257' xml:id='l2h-4257'></a> (HyperText
70Mark-up Language) and XHTML.<a id='l2h-4258' xml:id='l2h-4258'></a> Unlike the parser in
71<tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser is not based on the SGML parser in
72<tt class="module"><a href="module-sgmllib.html">sgmllib</a></tt>.
73
74<P>
75<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
76 <td><nobr><b><span class="typelabel">class</span>&nbsp;<tt id='l2h-4241' xml:id='l2h-4241' class="class">HTMLParser</tt></b>(</nobr></td>
77 <td><var></var>)</td></tr></table></dt>
78<dd>
79The <tt class="class">HTMLParser</tt> class is instantiated without arguments.
80
81<P>
82An HTMLParser instance is fed HTML data and calls handler functions
83when tags begin and end. The <tt class="class">HTMLParser</tt> class is meant to be
84overridden by the user to provide a desired behavior.
85
86<P>
87Unlike the parser in <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser does not check
88that end tags match start tags or call the end-tag handler for
89elements which are closed implicitly by closing an outer element.
90</dl>
91
92<P>
93An exception is defined as well:
94
95<P>
96<dl><dt><b><span class="typelabel">exception</span>&nbsp;<tt id='l2h-4242' xml:id='l2h-4242' class="exception">HTMLParseError</tt></b></dt>
97<dd>
98Exception raised by the <tt class="class">HTMLParser</tt> class when it encounters an
99error while parsing. This exception provides three attributes:
100<tt class="member">msg</tt> is a brief message explaining the error, <tt class="member">lineno</tt>
101is the number of the line on which the broken construct was detected,
102and <tt class="member">offset</tt> is the number of characters into the line at which
103the construct starts.
104</dd></dl>
105
106<P>
107<tt class="class">HTMLParser</tt> instances have the following methods:
108
109<P>
110<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
111 <td><nobr><b><tt id='l2h-4243' xml:id='l2h-4243' class="method">reset</tt></b>(</nobr></td>
112 <td><var></var>)</td></tr></table></dt>
113<dd>
114Reset the instance. Loses all unprocessed data. This is called
115implicitly at instantiation time.
116</dl>
117
118<P>
119<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
120 <td><nobr><b><tt id='l2h-4244' xml:id='l2h-4244' class="method">feed</tt></b>(</nobr></td>
121 <td><var>data</var>)</td></tr></table></dt>
122<dd>
123Feed some text to the parser. It is processed insofar as it consists
124of complete elements; incomplete data is buffered until more data is
125fed or <tt class="method">close()</tt> is called.
126</dl>
127
128<P>
129<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
130 <td><nobr><b><tt id='l2h-4245' xml:id='l2h-4245' class="method">close</tt></b>(</nobr></td>
131 <td><var></var>)</td></tr></table></dt>
132<dd>
133Force processing of all buffered data as if it were followed by an
134end-of-file mark. This method may be redefined by a derived class to
135define additional processing at the end of the input, but the
136redefined version should always call the <tt class="class">HTMLParser</tt> base class
137method <tt class="method">close()</tt>.
138</dl>
139
140<P>
141<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
142 <td><nobr><b><tt id='l2h-4246' xml:id='l2h-4246' class="method">getpos</tt></b>(</nobr></td>
143 <td><var></var>)</td></tr></table></dt>
144<dd>
145Return current line number and offset.
146</dl>
147
148<P>
149<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
150 <td><nobr><b><tt id='l2h-4247' xml:id='l2h-4247' class="method">get_starttag_text</tt></b>(</nobr></td>
151 <td><var></var>)</td></tr></table></dt>
152<dd>
153Return the text of the most recently opened start tag. This should
154not normally be needed for structured processing, but may be useful in
155dealing with HTML ``as deployed'' or for re-generating input with
156minimal changes (whitespace between attributes can be preserved,
157etc.).
158</dl>
159
160<P>
161<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
162 <td><nobr><b><tt id='l2h-4248' xml:id='l2h-4248' class="method">handle_starttag</tt></b>(</nobr></td>
163 <td><var>tag, attrs</var>)</td></tr></table></dt>
164<dd>
165This method is called to handle the start of a tag. It is intended to
166be overridden by a derived class; the base class implementation does
167nothing.
168
169<P>
170The <var>tag</var> argument is the name of the tag converted to
171lower case. The <var>attrs</var> argument is a list of <code>(<var>name</var>,
172<var>value</var>)</code> pairs containing the attributes found inside the tag's
173<code>&lt;&gt;</code> brackets. The <var>name</var> will be translated to lower case
174and double quotes and backslashes in the <var>value</var> have been
175interpreted. For instance, for the tag <code>&lt;A
176HREF="http://www.cwi.nl/"&gt;</code>, this method would be called as
177"<tt class="samp">handle_starttag('a', [('href', 'http://www.cwi.nl/')])</tt>".
178</dl>
179
180<P>
181<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
182 <td><nobr><b><tt id='l2h-4249' xml:id='l2h-4249' class="method">handle_startendtag</tt></b>(</nobr></td>
183 <td><var>tag, attrs</var>)</td></tr></table></dt>
184<dd>
185Similar to <tt class="method">handle_starttag()</tt>, but called when the parser
186encounters an XHTML-style empty tag (<code>&lt;a .../&gt;</code>). This method
187may be overridden by subclasses which require this particular lexical
188information; the default implementation simple calls
189<tt class="method">handle_starttag()</tt> and <tt class="method">handle_endtag()</tt>.
190</dl>
191
192<P>
193<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
194 <td><nobr><b><tt id='l2h-4250' xml:id='l2h-4250' class="method">handle_endtag</tt></b>(</nobr></td>
195 <td><var>tag</var>)</td></tr></table></dt>
196<dd>
197This method is called to handle the end tag of an element. It is
198intended to be overridden by a derived class; the base class
199implementation does nothing. The <var>tag</var> argument is the name of
200the tag converted to lower case.
201</dl>
202
203<P>
204<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
205 <td><nobr><b><tt id='l2h-4251' xml:id='l2h-4251' class="method">handle_data</tt></b>(</nobr></td>
206 <td><var>data</var>)</td></tr></table></dt>
207<dd>
208This method is called to process arbitrary data. It is intended to be
209overridden by a derived class; the base class implementation does
210nothing.
211</dl>
212
213<P>
214<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
215 <td><nobr><b><tt id='l2h-4252' xml:id='l2h-4252' class="method">handle_charref</tt></b>(</nobr></td>
216 <td><var>name</var>)</td></tr></table></dt>
217<dd> This method is called to
218process a character reference of the form "<tt class="samp">&amp;#<var>ref</var>;</tt>". It
219is intended to be overridden by a derived class; the base class
220implementation does nothing.
221</dl>
222
223<P>
224<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
225 <td><nobr><b><tt id='l2h-4253' xml:id='l2h-4253' class="method">handle_entityref</tt></b>(</nobr></td>
226 <td><var>name</var>)</td></tr></table></dt>
227<dd>
228This method is called to process a general entity reference of the
229form "<tt class="samp">&amp;<var>name</var>;</tt>" where <var>name</var> is an general entity
230reference. It is intended to be overridden by a derived class; the
231base class implementation does nothing.
232</dl>
233
234<P>
235<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
236 <td><nobr><b><tt id='l2h-4254' xml:id='l2h-4254' class="method">handle_comment</tt></b>(</nobr></td>
237 <td><var>data</var>)</td></tr></table></dt>
238<dd>
239This method is called when a comment is encountered. The
240<var>comment</var> argument is a string containing the text between the
241"<tt class="samp">-&#45;</tt>" and "<tt class="samp">-&#45;</tt>" delimiters, but not the delimiters
242themselves. For example, the comment "<tt class="samp">&lt;!-&#45;text-&#45;&gt;</tt>" will
243cause this method to be called with the argument <code>'text'</code>. It is
244intended to be overridden by a derived class; the base class
245implementation does nothing.
246</dl>
247
248<P>
249<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
250 <td><nobr><b><tt id='l2h-4255' xml:id='l2h-4255' class="method">handle_decl</tt></b>(</nobr></td>
251 <td><var>decl</var>)</td></tr></table></dt>
252<dd>
253Method called when an SGML declaration is read by the parser. The
254<var>decl</var> parameter will be the entire contents of the declaration
255inside the <code>&lt;!</code>...<code>&gt;</code> markup.It is intended to be overridden
256by a derived class; the base class implementation does nothing.
257</dl>
258
259<P>
260<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
261 <td><nobr><b><tt id='l2h-4256' xml:id='l2h-4256' class="method">handle_pi</tt></b>(</nobr></td>
262 <td><var>data</var>)</td></tr></table></dt>
263<dd>
264Method called when a processing instruction is encountered. The
265<var>data</var> parameter will contain the entire processing instruction.
266For example, for the processing instruction <code>&lt;?proc color='red'&gt;</code>,
267this method would be called as <code>handle_pi("proc color='red'")</code>. It
268is intended to be overridden by a derived class; the base class
269implementation does nothing.
270
271<P>
272<span class="note"><b class="label">Note:</b>
273The <tt class="class">HTMLParser</tt> class uses the SGML syntactic rules for
274processing instructions. An XHTML processing instruction using the
275trailing "<tt class="character">?</tt>" will cause the "<tt class="character">?</tt>" to be included in
276<var>data</var>.</span>
277</dl>
278
279<P>
280
281<p><br /></p><hr class='online-navigation' />
282<div class='online-navigation'>
283<!--Table of Child-Links-->
284<A NAME="CHILD_LINKS"><STRONG>Subsections</STRONG></a>
285
286<UL CLASS="ChildLinks">
287<LI><A href="htmlparser-example.html">13.1.1 Example HTML Parser Application</a>
288</ul>
289<!--End of Table of Child-Links-->
290</div>
291
292<DIV CLASS="navigation">
293<div class='online-navigation'>
294<p></p><hr />
295<table align="center" width="100%" cellpadding="0" cellspacing="2">
296<tr>
297<td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing"
298 href="markup.html"><img src='../icons/previous.png'
299 border='0' height='32' alt='Previous Page' width='32' /></A></td>
300<td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing"
301 href="markup.html"><img src='../icons/up.png'
302 border='0' height='32' alt='Up One Level' width='32' /></A></td>
303<td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser"
304 href="htmlparser-example.html"><img src='../icons/next.png'
305 border='0' height='32' alt='Next Page' width='32' /></A></td>
306<td align="center" width="100%">Python Library Reference</td>
307<td class='online-navigation'><a rel="contents" title="Table of Contents"
308 href="contents.html"><img src='../icons/contents.png'
309 border='0' height='32' alt='Contents' width='32' /></A></td>
310<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
311 border='0' height='32' alt='Module Index' width='32' /></a></td>
312<td class='online-navigation'><a rel="index" title="Index"
313 href="genindex.html"><img src='../icons/index.png'
314 border='0' height='32' alt='Index' width='32' /></A></td>
315</tr></table>
316<div class='online-navigation'>
317<b class="navlabel">Previous:</b>
318<a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A>
319<b class="navlabel">Up:</b>
320<a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A>
321<b class="navlabel">Next:</b>
322<a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A>
323</div>
324</div>
325<hr />
326<span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span>
327</DIV>
328<!--End of Navigation Panel-->
329<ADDRESS>
330See <i><a href="about.html">About this document...</a></i> for information on suggesting changes.
331</ADDRESS>
332</BODY>
333</HTML>