13.1 HTMLParser -- Simple HTML and XHTML parser
<H1><A NAME="SECTION0015100000000000000000">
13.1 <tt class="module">HTMLParser</tt> --
Simple HTML and XHTML parser</A>
<A NAME="module-HTMLParser"></A>
<span class="versionnote">New in version 2.2.</span>
This module defines a class <tt class="class">HTMLParser</tt> which serves as the
basis for parsing text files formatted in HTML<a id='l2h-4257' xml:id='l2h-4257'></a> (HyperText
Mark-up Language) and XHTML.<a id='l2h-4258' xml:id='l2h-4258'></a> Unlike the parser in
<tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser is not based on the SGML parser in
<tt class="module"><a href="module-sgmllib.html">sgmllib</a></tt>.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><span class="typelabel">class</span>&nbsp;<tt id='l2h-4241' xml:id='l2h-4241' class="class">HTMLParser</tt></b>(</nobr></td>
The <tt class="class">HTMLParser</tt> class is instantiated without arguments.
An HTMLParser instance is fed HTML data and calls handler functions
when tags begin and end. The <tt class="class">HTMLParser</tt> class is meant to be
overridden by the user to provide a desired behavior.
Unlike the parser in <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser does not check
that end tags match start tags or call the end-tag handler for
elements which are closed implicitly by closing an outer element.
An exception is defined as well:
<dl><dt><b><span class="typelabel">exception</span>&nbsp;<tt id='l2h-4242' xml:id='l2h-4242' class="exception">HTMLParseError</tt></b></dt>
Exception raised by the <tt class="class">HTMLParser</tt> class when it encounters an
error while parsing. This exception provides three attributes:
<tt class="member">msg</tt> is a brief message explaining the error, <tt class="member">lineno</tt>
is the number of the line on which the broken construct was detected,
and <tt class="member">offset</tt> is the number of characters into the line at which
the construct starts.
<tt class="class">HTMLParser</tt> instances have the following methods:
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4243' xml:id='l2h-4243' class="method">reset</tt></b>(</nobr></td>
Reset the instance. Loses all unprocessed data. This is called
implicitly at instantiation time.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4244' xml:id='l2h-4244' class="method">feed</tt></b>(</nobr></td>
Feed some text to the parser. It is processed insofar as it consists
of complete elements; incomplete data is buffered until more data is
fed or <tt class="method">close()</tt> is called.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4245' xml:id='l2h-4245' class="method">close</tt></b>(</nobr></td>
Force processing of all buffered data as if it were followed by an
end-of-file mark. This method may be redefined by a derived class to
define additional processing at the end of the input, but the
redefined version should always call the <tt class="class">HTMLParser</tt> base class
method <tt class="method">close()</tt>.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4246' xml:id='l2h-4246' class="method">getpos</tt></b>(</nobr></td>
Return current line number and offset.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4247' xml:id='l2h-4247' class="method">get_starttag_text</tt></b>(</nobr></td>
Return the text of the most recently opened start tag. This should
not normally be needed for structured processing, but may be useful in
dealing with HTML ``as deployed'' or for re-generating input with
minimal changes (whitespace between attributes can be preserved,
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4248' xml:id='l2h-4248' class="method">handle_starttag</tt></b>(</nobr></td>
<td><var>tag, attrs</var>)</td></tr></table></dt>
This method is called to handle the start of a tag. It is intended to
be overridden by a derived class; the base class implementation does
The <var>tag</var> argument is the name of the tag converted to
lower case. The <var>attrs</var> argument is a list of <code>(<var>name</var>,
<var>value</var>)</code> pairs containing the attributes found inside the tag's
<code>&lt;&gt;</code> brackets. The <var>name</var> will be translated to lower case
and double quotes and backslashes in the <var>value</var> have been
interpreted. For instance, for the tag <code>&lt;A
HREF=""&gt;</code>, this method would be called as
"<tt class="samp">handle_starttag('a', [('href', '')])</tt>".
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4249' xml:id='l2h-4249' class="method">handle_startendtag</tt></b>(</nobr></td>
<td><var>tag, attrs</var>)</td></tr></table></dt>
Similar to <tt class="method">handle_starttag()</tt>, but called when the parser
encounters an XHTML-style empty tag (<code>&lt;a .../&gt;</code>). This method
may be overridden by subclasses which require this particular lexical
information; the default implementation simple calls
<tt class="method">handle_starttag()</tt> and <tt class="method">handle_endtag()</tt>.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4250' xml:id='l2h-4250' class="method">handle_endtag</tt></b>(</nobr></td>
This method is called to handle the end tag of an element. It is
intended to be overridden by a derived class; the base class
implementation does nothing. The <var>tag</var> argument is the name of
the tag converted to lower case.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4251' xml:id='l2h-4251' class="method">handle_data</tt></b>(</nobr></td>
This method is called to process arbitrary data. It is intended to be
overridden by a derived class; the base class implementation does
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4252' xml:id='l2h-4252' class="method">handle_charref</tt></b>(</nobr></td>
<dd> This method is called to
process a character reference of the form "<tt class="samp">&amp;#<var>ref</var>;</tt>". It
is intended to be overridden by a derived class; the base class
implementation does nothing.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4253' xml:id='l2h-4253' class="method">handle_entityref</tt></b>(</nobr></td>
This method is called to process a general entity reference of the
form "<tt class="samp">&amp;<var>name</var>;</tt>" where <var>name</var> is an general entity
reference. It is intended to be overridden by a derived class; the
base class implementation does nothing.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4254' xml:id='l2h-4254' class="method">handle_comment</tt></b>(</nobr></td>
This method is called when a comment is encountered. The
<var>comment</var> argument is a string containing the text between the
"<tt class="samp">-&#45;</tt>" and "<tt class="samp">-&#45;</tt>" delimiters, but not the delimiters
themselves. For example, the comment "<tt class="samp">&lt;!-&#45;text-&#45;&gt;</tt>" will
cause this method to be called with the argument <code>'text'</code>. It is
intended to be overridden by a derived class; the base class
implementation does nothing.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4255' xml:id='l2h-4255' class="method">handle_decl</tt></b>(</nobr></td>
Method called when an SGML declaration is read by the parser. The
<var>decl</var> parameter will be the entire contents of the declaration
inside the <code>&lt;!</code>...<code>&gt;</code> markup.It is intended to be overridden
by a derived class; the base class implementation does nothing.
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4256' xml:id='l2h-4256' class="method">handle_pi</tt></b>(</nobr></td>
Method called when a processing instruction is encountered. The
<var>data</var> parameter will contain the entire processing instruction.
For example, for the processing instruction <code>&lt;?proc color='red'&gt;</code>,
this method would be called as <code>handle_pi("proc color='red'")</code>. It
is intended to be overridden by a derived class; the base class
implementation does nothing.
<span class="note"><b class="label">Note:</b>
The <tt class="class">HTMLParser</tt> class uses the SGML syntactic rules for
processing instructions. An XHTML processing instruction using the
trailing "<tt class="character">?</tt>" will cause the "<tt class="character">?</tt>" to be included in
<div class='online-navigation'>
<!--Table of Child-Links-->
<UL CLASS="ChildLinks">
<LI><A href="htmlparser-example.html">13.1.1 Example HTML Parser Application</a>
<!--End of Table of Child-Links-->
