Initial commit of OpenSPARC T2 architecture model.
[OpenSPARC-T2-SAM] / sam-t2 / devtools / amd64 / html / python / lib / module-HTMLParser.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<link rel="STYLESHEET" href="lib.css" type='text/css' />
<link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" />
<link rel='start' href='../index.html' title='Python Documentation Index' />
<link rel="first" href="lib.html" title='Python Library Reference' />
<link rel='contents' href='contents.html' title="Contents" />
<link rel='index' href='genindex.html' title='Index' />
<link rel='last' href='about.html' title='About this document...' />
<link rel='help' href='about.html' title='About this document...' />
<link rel="next" href="module-sgmllib.html" />
<link rel="prev" href="markup.html" />
<link rel="parent" href="markup.html" />
<link rel="next" href="htmlparser-example.html" />
<meta name='aesop' content='information' />
<title>13.1 HTMLParser -- Simple HTML and XHTML parser</title>
</head>
<body>
<DIV CLASS="navigation">
<div id='top-navigation-panel' xml:id='top-navigation-panel'>
<table align="center" width="100%" cellpadding="0" cellspacing="2">
<tr>
<td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing"
href="markup.html"><img src='../icons/previous.png'
border='0' height='32' alt='Previous Page' width='32' /></A></td>
<td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing"
href="markup.html"><img src='../icons/up.png'
border='0' height='32' alt='Up One Level' width='32' /></A></td>
<td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser"
href="htmlparser-example.html"><img src='../icons/next.png'
border='0' height='32' alt='Next Page' width='32' /></A></td>
<td align="center" width="100%">Python Library Reference</td>
<td class='online-navigation'><a rel="contents" title="Table of Contents"
href="contents.html"><img src='../icons/contents.png'
border='0' height='32' alt='Contents' width='32' /></A></td>
<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
border='0' height='32' alt='Module Index' width='32' /></a></td>
<td class='online-navigation'><a rel="index" title="Index"
href="genindex.html"><img src='../icons/index.png'
border='0' height='32' alt='Index' width='32' /></A></td>
</tr></table>
<div class='online-navigation'>
<b class="navlabel">Previous:</b>
<a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A>
<b class="navlabel">Up:</b>
<a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A>
<b class="navlabel">Next:</b>
<a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A>
</div>
<hr /></div>
</DIV>
<!--End of Navigation Panel-->
<H1><A NAME="SECTION0015100000000000000000">
13.1 <tt class="module">HTMLParser</tt> --
Simple HTML and XHTML parser</A>
</H1>
<P>
<A NAME="module-HTMLParser"></A>
<P>
<span class="versionnote">New in version 2.2.</span>
<P>
This module defines a class <tt class="class">HTMLParser</tt> which serves as the
basis for parsing text files formatted in HTML<a id='l2h-4257' xml:id='l2h-4257'></a> (HyperText
Mark-up Language) and XHTML.<a id='l2h-4258' xml:id='l2h-4258'></a> Unlike the parser in
<tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser is not based on the SGML parser in
<tt class="module"><a href="module-sgmllib.html">sgmllib</a></tt>.
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><span class="typelabel">class</span>&nbsp;<tt id='l2h-4241' xml:id='l2h-4241' class="class">HTMLParser</tt></b>(</nobr></td>
<td><var></var>)</td></tr></table></dt>
<dd>
The <tt class="class">HTMLParser</tt> class is instantiated without arguments.
<P>
An HTMLParser instance is fed HTML data and calls handler functions
when tags begin and end. The <tt class="class">HTMLParser</tt> class is meant to be
overridden by the user to provide a desired behavior.
<P>
Unlike the parser in <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser does not check
that end tags match start tags or call the end-tag handler for
elements which are closed implicitly by closing an outer element.
</dl>
<P>
An exception is defined as well:
<P>
<dl><dt><b><span class="typelabel">exception</span>&nbsp;<tt id='l2h-4242' xml:id='l2h-4242' class="exception">HTMLParseError</tt></b></dt>
<dd>
Exception raised by the <tt class="class">HTMLParser</tt> class when it encounters an
error while parsing. This exception provides three attributes:
<tt class="member">msg</tt> is a brief message explaining the error, <tt class="member">lineno</tt>
is the number of the line on which the broken construct was detected,
and <tt class="member">offset</tt> is the number of characters into the line at which
the construct starts.
</dd></dl>
<P>
<tt class="class">HTMLParser</tt> instances have the following methods:
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4243' xml:id='l2h-4243' class="method">reset</tt></b>(</nobr></td>
<td><var></var>)</td></tr></table></dt>
<dd>
Reset the instance. Loses all unprocessed data. This is called
implicitly at instantiation time.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4244' xml:id='l2h-4244' class="method">feed</tt></b>(</nobr></td>
<td><var>data</var>)</td></tr></table></dt>
<dd>
Feed some text to the parser. It is processed insofar as it consists
of complete elements; incomplete data is buffered until more data is
fed or <tt class="method">close()</tt> is called.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4245' xml:id='l2h-4245' class="method">close</tt></b>(</nobr></td>
<td><var></var>)</td></tr></table></dt>
<dd>
Force processing of all buffered data as if it were followed by an
end-of-file mark. This method may be redefined by a derived class to
define additional processing at the end of the input, but the
redefined version should always call the <tt class="class">HTMLParser</tt> base class
method <tt class="method">close()</tt>.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4246' xml:id='l2h-4246' class="method">getpos</tt></b>(</nobr></td>
<td><var></var>)</td></tr></table></dt>
<dd>
Return current line number and offset.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4247' xml:id='l2h-4247' class="method">get_starttag_text</tt></b>(</nobr></td>
<td><var></var>)</td></tr></table></dt>
<dd>
Return the text of the most recently opened start tag. This should
not normally be needed for structured processing, but may be useful in
dealing with HTML ``as deployed'' or for re-generating input with
minimal changes (whitespace between attributes can be preserved,
etc.).
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4248' xml:id='l2h-4248' class="method">handle_starttag</tt></b>(</nobr></td>
<td><var>tag, attrs</var>)</td></tr></table></dt>
<dd>
This method is called to handle the start of a tag. It is intended to
be overridden by a derived class; the base class implementation does
nothing.
<P>
The <var>tag</var> argument is the name of the tag converted to
lower case. The <var>attrs</var> argument is a list of <code>(<var>name</var>,
<var>value</var>)</code> pairs containing the attributes found inside the tag's
<code>&lt;&gt;</code> brackets. The <var>name</var> will be translated to lower case
and double quotes and backslashes in the <var>value</var> have been
interpreted. For instance, for the tag <code>&lt;A
HREF="http://www.cwi.nl/"&gt;</code>, this method would be called as
"<tt class="samp">handle_starttag('a', [('href', 'http://www.cwi.nl/')])</tt>".
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4249' xml:id='l2h-4249' class="method">handle_startendtag</tt></b>(</nobr></td>
<td><var>tag, attrs</var>)</td></tr></table></dt>
<dd>
Similar to <tt class="method">handle_starttag()</tt>, but called when the parser
encounters an XHTML-style empty tag (<code>&lt;a .../&gt;</code>). This method
may be overridden by subclasses which require this particular lexical
information; the default implementation simple calls
<tt class="method">handle_starttag()</tt> and <tt class="method">handle_endtag()</tt>.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4250' xml:id='l2h-4250' class="method">handle_endtag</tt></b>(</nobr></td>
<td><var>tag</var>)</td></tr></table></dt>
<dd>
This method is called to handle the end tag of an element. It is
intended to be overridden by a derived class; the base class
implementation does nothing. The <var>tag</var> argument is the name of
the tag converted to lower case.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4251' xml:id='l2h-4251' class="method">handle_data</tt></b>(</nobr></td>
<td><var>data</var>)</td></tr></table></dt>
<dd>
This method is called to process arbitrary data. It is intended to be
overridden by a derived class; the base class implementation does
nothing.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4252' xml:id='l2h-4252' class="method">handle_charref</tt></b>(</nobr></td>
<td><var>name</var>)</td></tr></table></dt>
<dd> This method is called to
process a character reference of the form "<tt class="samp">&amp;#<var>ref</var>;</tt>". It
is intended to be overridden by a derived class; the base class
implementation does nothing.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4253' xml:id='l2h-4253' class="method">handle_entityref</tt></b>(</nobr></td>
<td><var>name</var>)</td></tr></table></dt>
<dd>
This method is called to process a general entity reference of the
form "<tt class="samp">&amp;<var>name</var>;</tt>" where <var>name</var> is an general entity
reference. It is intended to be overridden by a derived class; the
base class implementation does nothing.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4254' xml:id='l2h-4254' class="method">handle_comment</tt></b>(</nobr></td>
<td><var>data</var>)</td></tr></table></dt>
<dd>
This method is called when a comment is encountered. The
<var>comment</var> argument is a string containing the text between the
"<tt class="samp">-&#45;</tt>" and "<tt class="samp">-&#45;</tt>" delimiters, but not the delimiters
themselves. For example, the comment "<tt class="samp">&lt;!-&#45;text-&#45;&gt;</tt>" will
cause this method to be called with the argument <code>'text'</code>. It is
intended to be overridden by a derived class; the base class
implementation does nothing.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4255' xml:id='l2h-4255' class="method">handle_decl</tt></b>(</nobr></td>
<td><var>decl</var>)</td></tr></table></dt>
<dd>
Method called when an SGML declaration is read by the parser. The
<var>decl</var> parameter will be the entire contents of the declaration
inside the <code>&lt;!</code>...<code>&gt;</code> markup.It is intended to be overridden
by a derived class; the base class implementation does nothing.
</dl>
<P>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
<td><nobr><b><tt id='l2h-4256' xml:id='l2h-4256' class="method">handle_pi</tt></b>(</nobr></td>
<td><var>data</var>)</td></tr></table></dt>
<dd>
Method called when a processing instruction is encountered. The
<var>data</var> parameter will contain the entire processing instruction.
For example, for the processing instruction <code>&lt;?proc color='red'&gt;</code>,
this method would be called as <code>handle_pi("proc color='red'")</code>. It
is intended to be overridden by a derived class; the base class
implementation does nothing.
<P>
<span class="note"><b class="label">Note:</b>
The <tt class="class">HTMLParser</tt> class uses the SGML syntactic rules for
processing instructions. An XHTML processing instruction using the
trailing "<tt class="character">?</tt>" will cause the "<tt class="character">?</tt>" to be included in
<var>data</var>.</span>
</dl>
<P>
<p><br /></p><hr class='online-navigation' />
<div class='online-navigation'>
<!--Table of Child-Links-->
<A NAME="CHILD_LINKS"><STRONG>Subsections</STRONG></a>
<UL CLASS="ChildLinks">
<LI><A href="htmlparser-example.html">13.1.1 Example HTML Parser Application</a>
</ul>
<!--End of Table of Child-Links-->
</div>
<DIV CLASS="navigation">
<div class='online-navigation'>
<p></p><hr />
<table align="center" width="100%" cellpadding="0" cellspacing="2">
<tr>
<td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing"
href="markup.html"><img src='../icons/previous.png'
border='0' height='32' alt='Previous Page' width='32' /></A></td>
<td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing"
href="markup.html"><img src='../icons/up.png'
border='0' height='32' alt='Up One Level' width='32' /></A></td>
<td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser"
href="htmlparser-example.html"><img src='../icons/next.png'
border='0' height='32' alt='Next Page' width='32' /></A></td>
<td align="center" width="100%">Python Library Reference</td>
<td class='online-navigation'><a rel="contents" title="Table of Contents"
href="contents.html"><img src='../icons/contents.png'
border='0' height='32' alt='Contents' width='32' /></A></td>
<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
border='0' height='32' alt='Module Index' width='32' /></a></td>
<td class='online-navigation'><a rel="index" title="Index"
href="genindex.html"><img src='../icons/index.png'
border='0' height='32' alt='Index' width='32' /></A></td>
</tr></table>
<div class='online-navigation'>
<b class="navlabel">Previous:</b>
<a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A>
<b class="navlabel">Up:</b>
<a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A>
<b class="navlabel">Next:</b>
<a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A>
</div>
</div>
<hr />
<span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span>
</DIV>
<!--End of Navigation Panel-->
<ADDRESS>
See <i><a href="about.html">About this document...</a></i> for information on suggesting changes.
</ADDRESS>
</BODY>
</HTML>