<!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.0 Transitional//EN">
<link rel=
"STYLESHEET" href=
"lib.css" type='text/css'
/>
<link rel=
"SHORTCUT ICON" href=
"../icons/pyfav.png" type=
"image/png" />
<link rel='start' href='../index.html' title='Python Documentation Index'
/>
<link rel=
"first" href=
"lib.html" title='Python Library Reference'
/>
<link rel='contents' href='contents.html'
title=
"Contents" />
<link rel='index' href='genindex.html' title='Index'
/>
<link rel='last' href='about.html' title='About this document...'
/>
<link rel='help' href='about.html' title='About this document...'
/>
<link rel=
"next" href=
"module-sgmllib.html" />
<link rel=
"prev" href=
"markup.html" />
<link rel=
"parent" href=
"markup.html" />
<link rel=
"next" href=
"htmlparser-example.html" />
<meta name='aesop' content='information'
/>
<title>13.1 HTMLParser -- Simple HTML and XHTML parser
</title>
<div id='top-navigation-panel' xml:id='top-navigation-panel'
>
<table align=
"center" width=
"100%" cellpadding=
"0" cellspacing=
"2">
<td class='online-navigation'
><a rel=
"prev" title=
"13. Structured Markup Processing"
href=
"markup.html"><img src='../icons/previous.png'
border='
0' height='
32' alt='Previous Page' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"parent" title=
"13. Structured Markup Processing"
href=
"markup.html"><img src='../icons/up.png'
border='
0' height='
32' alt='Up One Level' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"next" title=
"13.1.1 Example HTML Parser"
href=
"htmlparser-example.html"><img src='../icons/next.png'
border='
0' height='
32' alt='Next Page' width='
32'
/></A></td>
<td align=
"center" width=
"100%">Python Library Reference
</td>
<td class='online-navigation'
><a rel=
"contents" title=
"Table of Contents"
href=
"contents.html"><img src='../icons/contents.png'
border='
0' height='
32' alt='Contents' width='
32'
/></A></td>
<td class='online-navigation'
><a href=
"modindex.html" title=
"Module Index"><img src='../icons/modules.png'
border='
0' height='
32' alt='Module Index' width='
32'
/></a></td>
<td class='online-navigation'
><a rel=
"index" title=
"Index"
href=
"genindex.html"><img src='../icons/index.png'
border='
0' height='
32' alt='Index' width='
32'
/></A></td>
<div class='online-navigation'
>
<b class=
"navlabel">Previous:
</b>
<a class=
"sectref" rel=
"prev" href=
"markup.html">13. Structured Markup Processing
</A>
<b class=
"navlabel">Up:
</b>
<a class=
"sectref" rel=
"parent" href=
"markup.html">13. Structured Markup Processing
</A>
<b class=
"navlabel">Next:
</b>
<a class=
"sectref" rel=
"next" href=
"htmlparser-example.html">13.1.1 Example HTML Parser
</A>
<!--End of Navigation Panel-->
<H1><A NAME=
"SECTION0015100000000000000000">
13.1 <tt class=
"module">HTMLParser
</tt> --
Simple HTML and XHTML parser
</A>
<A NAME=
"module-HTMLParser"></A>
<span class=
"versionnote">New in version
2.2.
</span>
This module defines a class
<tt class=
"class">HTMLParser
</tt> which serves as the
basis for parsing text files formatted in HTML
<a id='l2h-
4257' xml:id='l2h-
4257'
></a> (HyperText
Mark-up Language) and XHTML.
<a id='l2h-
4258' xml:id='l2h-
4258'
></a> Unlike the parser in
<tt class=
"module"><a href=
"module-htmllib.html">htmllib
</a></tt>, this parser is not based on the SGML parser in
<tt class=
"module"><a href=
"module-sgmllib.html">sgmllib
</a></tt>.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><span class=
"typelabel">class
</span> <tt id='l2h-
4241' xml:id='l2h-
4241'
class=
"class">HTMLParser
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
The
<tt class=
"class">HTMLParser
</tt> class is instantiated without arguments.
An HTMLParser instance is fed HTML data and calls handler functions
when tags begin and end. The
<tt class=
"class">HTMLParser
</tt> class is meant to be
overridden by the user to provide a desired behavior.
Unlike the parser in
<tt class=
"module"><a href=
"module-htmllib.html">htmllib
</a></tt>, this parser does not check
that end tags match start tags or call the end-tag handler for
elements which are closed implicitly by closing an outer element.
An exception is defined as well:
<dl><dt><b><span class=
"typelabel">exception
</span> <tt id='l2h-
4242' xml:id='l2h-
4242'
class=
"exception">HTMLParseError
</tt></b></dt>
Exception raised by the
<tt class=
"class">HTMLParser
</tt> class when it encounters an
error while parsing. This exception provides three attributes:
<tt class=
"member">msg
</tt> is a brief message explaining the error,
<tt class=
"member">lineno
</tt>
is the number of the line on which the broken construct was detected,
and
<tt class=
"member">offset
</tt> is the number of characters into the line at which
<tt class=
"class">HTMLParser
</tt> instances have the following methods:
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4243' xml:id='l2h-
4243'
class=
"method">reset
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Reset the instance. Loses all unprocessed data. This is called
implicitly at instantiation time.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4244' xml:id='l2h-
4244'
class=
"method">feed
</tt></b>(
</nobr></td>
<td><var>data
</var>)
</td></tr></table></dt>
Feed some text to the parser. It is processed insofar as it consists
of complete elements; incomplete data is buffered until more data is
fed or
<tt class=
"method">close()
</tt> is called.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4245' xml:id='l2h-
4245'
class=
"method">close
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Force processing of all buffered data as if it were followed by an
end-of-file mark. This method may be redefined by a derived class to
define additional processing at the end of the input, but the
redefined version should always call the
<tt class=
"class">HTMLParser
</tt> base class
method
<tt class=
"method">close()
</tt>.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4246' xml:id='l2h-
4246'
class=
"method">getpos
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Return current line number and offset.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4247' xml:id='l2h-
4247'
class=
"method">get_starttag_text
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Return the text of the most recently opened start tag. This should
not normally be needed for structured processing, but may be useful in
dealing with HTML ``as deployed'' or for re-generating input with
minimal changes (whitespace between attributes can be preserved,
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4248' xml:id='l2h-
4248'
class=
"method">handle_starttag
</tt></b>(
</nobr></td>
<td><var>tag, attrs
</var>)
</td></tr></table></dt>
This method is called to handle the start of a tag. It is intended to
be overridden by a derived class; the base class implementation does
The
<var>tag
</var> argument is the name of the tag converted to
lower case. The
<var>attrs
</var> argument is a list of
<code>(
<var>name
</var>,
<var>value
</var>)
</code> pairs containing the attributes found inside the tag's
<code><></code> brackets. The
<var>name
</var> will be translated to lower case
and double quotes and backslashes in the
<var>value
</var> have been
interpreted. For instance, for the tag
<code><A
HREF=
"http://www.cwi.nl/"></code>, this method would be called as
"<tt class="samp
">handle_starttag('a', [('href', 'http://www.cwi.nl/')])</tt>".
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4249' xml:id='l2h-
4249'
class=
"method">handle_startendtag
</tt></b>(
</nobr></td>
<td><var>tag, attrs
</var>)
</td></tr></table></dt>
Similar to
<tt class=
"method">handle_starttag()
</tt>, but called when the parser
encounters an XHTML-style empty tag (
<code><a .../
></code>). This method
may be overridden by subclasses which require this particular lexical
information; the default implementation simple calls
<tt class=
"method">handle_starttag()
</tt> and
<tt class=
"method">handle_endtag()
</tt>.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4250' xml:id='l2h-
4250'
class=
"method">handle_endtag
</tt></b>(
</nobr></td>
<td><var>tag
</var>)
</td></tr></table></dt>
This method is called to handle the end tag of an element. It is
intended to be overridden by a derived class; the base class
implementation does nothing. The
<var>tag
</var> argument is the name of
the tag converted to lower case.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4251' xml:id='l2h-
4251'
class=
"method">handle_data
</tt></b>(
</nobr></td>
<td><var>data
</var>)
</td></tr></table></dt>
This method is called to process arbitrary data. It is intended to be
overridden by a derived class; the base class implementation does
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4252' xml:id='l2h-
4252'
class=
"method">handle_charref
</tt></b>(
</nobr></td>
<td><var>name
</var>)
</td></tr></table></dt>
<dd> This method is called to
process a character reference of the form
"<tt class="samp
">&#<var>ref</var>;</tt>". It
is intended to be overridden by a derived class; the base class
implementation does nothing.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4253' xml:id='l2h-
4253'
class=
"method">handle_entityref
</tt></b>(
</nobr></td>
<td><var>name
</var>)
</td></tr></table></dt>
This method is called to process a general entity reference of the
form
"<tt class="samp
">&<var>name</var>;</tt>" where
<var>name
</var> is an general entity
reference. It is intended to be overridden by a derived class; the
base class implementation does nothing.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4254' xml:id='l2h-
4254'
class=
"method">handle_comment
</tt></b>(
</nobr></td>
<td><var>data
</var>)
</td></tr></table></dt>
This method is called when a comment is encountered. The
<var>comment
</var> argument is a string containing the text between the
"<tt class="samp
">--</tt>" and
"<tt class="samp
">--</tt>" delimiters, but not the delimiters
themselves. For example, the comment
"<tt class="samp
"><!--text--></tt>" will
cause this method to be called with the argument
<code>'text'
</code>. It is
intended to be overridden by a derived class; the base class
implementation does nothing.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4255' xml:id='l2h-
4255'
class=
"method">handle_decl
</tt></b>(
</nobr></td>
<td><var>decl
</var>)
</td></tr></table></dt>
Method called when an SGML declaration is read by the parser. The
<var>decl
</var> parameter will be the entire contents of the declaration
inside the
<code><!
</code>...
<code>></code> markup.It is intended to be overridden
by a derived class; the base class implementation does nothing.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4256' xml:id='l2h-
4256'
class=
"method">handle_pi
</tt></b>(
</nobr></td>
<td><var>data
</var>)
</td></tr></table></dt>
Method called when a processing instruction is encountered. The
<var>data
</var> parameter will contain the entire processing instruction.
For example, for the processing instruction
<code><?proc color='red'
></code>,
this method would be called as
<code>handle_pi(
"proc color='red'")
</code>. It
is intended to be overridden by a derived class; the base class
implementation does nothing.
<span class=
"note"><b class=
"label">Note:
</b>
The
<tt class=
"class">HTMLParser
</tt> class uses the SGML syntactic rules for
processing instructions. An XHTML processing instruction using the
trailing
"<tt class="character
">?</tt>" will cause the
"<tt class="character
">?</tt>" to be included in
<p><br /></p><hr class='online-navigation'
/>
<div class='online-navigation'
>
<!--Table of Child-Links-->
<A NAME=
"CHILD_LINKS"><STRONG>Subsections
</STRONG></a>
<LI><A href=
"htmlparser-example.html">13.1.1 Example HTML Parser Application
</a>
<!--End of Table of Child-Links-->
<div class='online-navigation'
>
<table align=
"center" width=
"100%" cellpadding=
"0" cellspacing=
"2">
<td class='online-navigation'
><a rel=
"prev" title=
"13. Structured Markup Processing"
href=
"markup.html"><img src='../icons/previous.png'
border='
0' height='
32' alt='Previous Page' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"parent" title=
"13. Structured Markup Processing"
href=
"markup.html"><img src='../icons/up.png'
border='
0' height='
32' alt='Up One Level' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"next" title=
"13.1.1 Example HTML Parser"
href=
"htmlparser-example.html"><img src='../icons/next.png'
border='
0' height='
32' alt='Next Page' width='
32'
/></A></td>
<td align=
"center" width=
"100%">Python Library Reference
</td>
<td class='online-navigation'
><a rel=
"contents" title=
"Table of Contents"
href=
"contents.html"><img src='../icons/contents.png'
border='
0' height='
32' alt='Contents' width='
32'
/></A></td>
<td class='online-navigation'
><a href=
"modindex.html" title=
"Module Index"><img src='../icons/modules.png'
border='
0' height='
32' alt='Module Index' width='
32'
/></a></td>
<td class='online-navigation'
><a rel=
"index" title=
"Index"
href=
"genindex.html"><img src='../icons/index.png'
border='
0' height='
32' alt='Index' width='
32'
/></A></td>
<div class='online-navigation'
>
<b class=
"navlabel">Previous:
</b>
<a class=
"sectref" rel=
"prev" href=
"markup.html">13. Structured Markup Processing
</A>
<b class=
"navlabel">Up:
</b>
<a class=
"sectref" rel=
"parent" href=
"markup.html">13. Structured Markup Processing
</A>
<b class=
"navlabel">Next:
</b>
<a class=
"sectref" rel=
"next" href=
"htmlparser-example.html">13.1.1 Example HTML Parser
</A>
<span class=
"release-info">Release
2.4.2, documentation updated on
28 September
2005.
</span>
<!--End of Navigation Panel-->
See
<i><a href=
"about.html">About this document...
</a></i> for information on suggesting changes.