| 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
| 2 | <html> |
| 3 | <head> |
| 4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> |
| 5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> |
| 6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> |
| 7 | <link rel="first" href="lib.html" title='Python Library Reference' /> |
| 8 | <link rel='contents' href='contents.html' title="Contents" /> |
| 9 | <link rel='index' href='genindex.html' title='Index' /> |
| 10 | <link rel='last' href='about.html' title='About this document...' /> |
| 11 | <link rel='help' href='about.html' title='About this document...' /> |
| 12 | <link rel="next" href="module-sgmllib.html" /> |
| 13 | <link rel="prev" href="markup.html" /> |
| 14 | <link rel="parent" href="markup.html" /> |
| 15 | <link rel="next" href="htmlparser-example.html" /> |
| 16 | <meta name='aesop' content='information' /> |
| 17 | <title>13.1 HTMLParser -- Simple HTML and XHTML parser</title> |
| 18 | </head> |
| 19 | <body> |
| 20 | <DIV CLASS="navigation"> |
| 21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> |
| 22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> |
| 23 | <tr> |
| 24 | <td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing" |
| 25 | href="markup.html"><img src='../icons/previous.png' |
| 26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> |
| 27 | <td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing" |
| 28 | href="markup.html"><img src='../icons/up.png' |
| 29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> |
| 30 | <td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser" |
| 31 | href="htmlparser-example.html"><img src='../icons/next.png' |
| 32 | border='0' height='32' alt='Next Page' width='32' /></A></td> |
| 33 | <td align="center" width="100%">Python Library Reference</td> |
| 34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" |
| 35 | href="contents.html"><img src='../icons/contents.png' |
| 36 | border='0' height='32' alt='Contents' width='32' /></A></td> |
| 37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' |
| 38 | border='0' height='32' alt='Module Index' width='32' /></a></td> |
| 39 | <td class='online-navigation'><a rel="index" title="Index" |
| 40 | href="genindex.html"><img src='../icons/index.png' |
| 41 | border='0' height='32' alt='Index' width='32' /></A></td> |
| 42 | </tr></table> |
| 43 | <div class='online-navigation'> |
| 44 | <b class="navlabel">Previous:</b> |
| 45 | <a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A> |
| 46 | <b class="navlabel">Up:</b> |
| 47 | <a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A> |
| 48 | <b class="navlabel">Next:</b> |
| 49 | <a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A> |
| 50 | </div> |
| 51 | <hr /></div> |
| 52 | </DIV> |
| 53 | <!--End of Navigation Panel--> |
| 54 | |
| 55 | <H1><A NAME="SECTION0015100000000000000000"> |
| 56 | 13.1 <tt class="module">HTMLParser</tt> -- |
| 57 | Simple HTML and XHTML parser</A> |
| 58 | </H1> |
| 59 | |
| 60 | <P> |
| 61 | <A NAME="module-HTMLParser"></A> |
| 62 | |
| 63 | <P> |
| 64 | |
| 65 | <span class="versionnote">New in version 2.2.</span> |
| 66 | |
| 67 | <P> |
| 68 | This module defines a class <tt class="class">HTMLParser</tt> which serves as the |
| 69 | basis for parsing text files formatted in HTML<a id='l2h-4257' xml:id='l2h-4257'></a> (HyperText |
| 70 | Mark-up Language) and XHTML.<a id='l2h-4258' xml:id='l2h-4258'></a> Unlike the parser in |
| 71 | <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser is not based on the SGML parser in |
| 72 | <tt class="module"><a href="module-sgmllib.html">sgmllib</a></tt>. |
| 73 | |
| 74 | <P> |
| 75 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 76 | <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-4241' xml:id='l2h-4241' class="class">HTMLParser</tt></b>(</nobr></td> |
| 77 | <td><var></var>)</td></tr></table></dt> |
| 78 | <dd> |
| 79 | The <tt class="class">HTMLParser</tt> class is instantiated without arguments. |
| 80 | |
| 81 | <P> |
| 82 | An HTMLParser instance is fed HTML data and calls handler functions |
| 83 | when tags begin and end. The <tt class="class">HTMLParser</tt> class is meant to be |
| 84 | overridden by the user to provide a desired behavior. |
| 85 | |
| 86 | <P> |
| 87 | Unlike the parser in <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser does not check |
| 88 | that end tags match start tags or call the end-tag handler for |
| 89 | elements which are closed implicitly by closing an outer element. |
| 90 | </dl> |
| 91 | |
| 92 | <P> |
| 93 | An exception is defined as well: |
| 94 | |
| 95 | <P> |
| 96 | <dl><dt><b><span class="typelabel">exception</span> <tt id='l2h-4242' xml:id='l2h-4242' class="exception">HTMLParseError</tt></b></dt> |
| 97 | <dd> |
| 98 | Exception raised by the <tt class="class">HTMLParser</tt> class when it encounters an |
| 99 | error while parsing. This exception provides three attributes: |
| 100 | <tt class="member">msg</tt> is a brief message explaining the error, <tt class="member">lineno</tt> |
| 101 | is the number of the line on which the broken construct was detected, |
| 102 | and <tt class="member">offset</tt> is the number of characters into the line at which |
| 103 | the construct starts. |
| 104 | </dd></dl> |
| 105 | |
| 106 | <P> |
| 107 | <tt class="class">HTMLParser</tt> instances have the following methods: |
| 108 | |
| 109 | <P> |
| 110 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 111 | <td><nobr><b><tt id='l2h-4243' xml:id='l2h-4243' class="method">reset</tt></b>(</nobr></td> |
| 112 | <td><var></var>)</td></tr></table></dt> |
| 113 | <dd> |
| 114 | Reset the instance. Loses all unprocessed data. This is called |
| 115 | implicitly at instantiation time. |
| 116 | </dl> |
| 117 | |
| 118 | <P> |
| 119 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 120 | <td><nobr><b><tt id='l2h-4244' xml:id='l2h-4244' class="method">feed</tt></b>(</nobr></td> |
| 121 | <td><var>data</var>)</td></tr></table></dt> |
| 122 | <dd> |
| 123 | Feed some text to the parser. It is processed insofar as it consists |
| 124 | of complete elements; incomplete data is buffered until more data is |
| 125 | fed or <tt class="method">close()</tt> is called. |
| 126 | </dl> |
| 127 | |
| 128 | <P> |
| 129 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 130 | <td><nobr><b><tt id='l2h-4245' xml:id='l2h-4245' class="method">close</tt></b>(</nobr></td> |
| 131 | <td><var></var>)</td></tr></table></dt> |
| 132 | <dd> |
| 133 | Force processing of all buffered data as if it were followed by an |
| 134 | end-of-file mark. This method may be redefined by a derived class to |
| 135 | define additional processing at the end of the input, but the |
| 136 | redefined version should always call the <tt class="class">HTMLParser</tt> base class |
| 137 | method <tt class="method">close()</tt>. |
| 138 | </dl> |
| 139 | |
| 140 | <P> |
| 141 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 142 | <td><nobr><b><tt id='l2h-4246' xml:id='l2h-4246' class="method">getpos</tt></b>(</nobr></td> |
| 143 | <td><var></var>)</td></tr></table></dt> |
| 144 | <dd> |
| 145 | Return current line number and offset. |
| 146 | </dl> |
| 147 | |
| 148 | <P> |
| 149 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 150 | <td><nobr><b><tt id='l2h-4247' xml:id='l2h-4247' class="method">get_starttag_text</tt></b>(</nobr></td> |
| 151 | <td><var></var>)</td></tr></table></dt> |
| 152 | <dd> |
| 153 | Return the text of the most recently opened start tag. This should |
| 154 | not normally be needed for structured processing, but may be useful in |
| 155 | dealing with HTML ``as deployed'' or for re-generating input with |
| 156 | minimal changes (whitespace between attributes can be preserved, |
| 157 | etc.). |
| 158 | </dl> |
| 159 | |
| 160 | <P> |
| 161 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 162 | <td><nobr><b><tt id='l2h-4248' xml:id='l2h-4248' class="method">handle_starttag</tt></b>(</nobr></td> |
| 163 | <td><var>tag, attrs</var>)</td></tr></table></dt> |
| 164 | <dd> |
| 165 | This method is called to handle the start of a tag. It is intended to |
| 166 | be overridden by a derived class; the base class implementation does |
| 167 | nothing. |
| 168 | |
| 169 | <P> |
| 170 | The <var>tag</var> argument is the name of the tag converted to |
| 171 | lower case. The <var>attrs</var> argument is a list of <code>(<var>name</var>, |
| 172 | <var>value</var>)</code> pairs containing the attributes found inside the tag's |
| 173 | <code><></code> brackets. The <var>name</var> will be translated to lower case |
| 174 | and double quotes and backslashes in the <var>value</var> have been |
| 175 | interpreted. For instance, for the tag <code><A |
| 176 | HREF="http://www.cwi.nl/"></code>, this method would be called as |
| 177 | "<tt class="samp">handle_starttag('a', [('href', 'http://www.cwi.nl/')])</tt>". |
| 178 | </dl> |
| 179 | |
| 180 | <P> |
| 181 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 182 | <td><nobr><b><tt id='l2h-4249' xml:id='l2h-4249' class="method">handle_startendtag</tt></b>(</nobr></td> |
| 183 | <td><var>tag, attrs</var>)</td></tr></table></dt> |
| 184 | <dd> |
| 185 | Similar to <tt class="method">handle_starttag()</tt>, but called when the parser |
| 186 | encounters an XHTML-style empty tag (<code><a .../></code>). This method |
| 187 | may be overridden by subclasses which require this particular lexical |
| 188 | information; the default implementation simple calls |
| 189 | <tt class="method">handle_starttag()</tt> and <tt class="method">handle_endtag()</tt>. |
| 190 | </dl> |
| 191 | |
| 192 | <P> |
| 193 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 194 | <td><nobr><b><tt id='l2h-4250' xml:id='l2h-4250' class="method">handle_endtag</tt></b>(</nobr></td> |
| 195 | <td><var>tag</var>)</td></tr></table></dt> |
| 196 | <dd> |
| 197 | This method is called to handle the end tag of an element. It is |
| 198 | intended to be overridden by a derived class; the base class |
| 199 | implementation does nothing. The <var>tag</var> argument is the name of |
| 200 | the tag converted to lower case. |
| 201 | </dl> |
| 202 | |
| 203 | <P> |
| 204 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 205 | <td><nobr><b><tt id='l2h-4251' xml:id='l2h-4251' class="method">handle_data</tt></b>(</nobr></td> |
| 206 | <td><var>data</var>)</td></tr></table></dt> |
| 207 | <dd> |
| 208 | This method is called to process arbitrary data. It is intended to be |
| 209 | overridden by a derived class; the base class implementation does |
| 210 | nothing. |
| 211 | </dl> |
| 212 | |
| 213 | <P> |
| 214 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 215 | <td><nobr><b><tt id='l2h-4252' xml:id='l2h-4252' class="method">handle_charref</tt></b>(</nobr></td> |
| 216 | <td><var>name</var>)</td></tr></table></dt> |
| 217 | <dd> This method is called to |
| 218 | process a character reference of the form "<tt class="samp">&#<var>ref</var>;</tt>". It |
| 219 | is intended to be overridden by a derived class; the base class |
| 220 | implementation does nothing. |
| 221 | </dl> |
| 222 | |
| 223 | <P> |
| 224 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 225 | <td><nobr><b><tt id='l2h-4253' xml:id='l2h-4253' class="method">handle_entityref</tt></b>(</nobr></td> |
| 226 | <td><var>name</var>)</td></tr></table></dt> |
| 227 | <dd> |
| 228 | This method is called to process a general entity reference of the |
| 229 | form "<tt class="samp">&<var>name</var>;</tt>" where <var>name</var> is an general entity |
| 230 | reference. It is intended to be overridden by a derived class; the |
| 231 | base class implementation does nothing. |
| 232 | </dl> |
| 233 | |
| 234 | <P> |
| 235 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 236 | <td><nobr><b><tt id='l2h-4254' xml:id='l2h-4254' class="method">handle_comment</tt></b>(</nobr></td> |
| 237 | <td><var>data</var>)</td></tr></table></dt> |
| 238 | <dd> |
| 239 | This method is called when a comment is encountered. The |
| 240 | <var>comment</var> argument is a string containing the text between the |
| 241 | "<tt class="samp">--</tt>" and "<tt class="samp">--</tt>" delimiters, but not the delimiters |
| 242 | themselves. For example, the comment "<tt class="samp"><!--text--></tt>" will |
| 243 | cause this method to be called with the argument <code>'text'</code>. It is |
| 244 | intended to be overridden by a derived class; the base class |
| 245 | implementation does nothing. |
| 246 | </dl> |
| 247 | |
| 248 | <P> |
| 249 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 250 | <td><nobr><b><tt id='l2h-4255' xml:id='l2h-4255' class="method">handle_decl</tt></b>(</nobr></td> |
| 251 | <td><var>decl</var>)</td></tr></table></dt> |
| 252 | <dd> |
| 253 | Method called when an SGML declaration is read by the parser. The |
| 254 | <var>decl</var> parameter will be the entire contents of the declaration |
| 255 | inside the <code><!</code>...<code>></code> markup.It is intended to be overridden |
| 256 | by a derived class; the base class implementation does nothing. |
| 257 | </dl> |
| 258 | |
| 259 | <P> |
| 260 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 261 | <td><nobr><b><tt id='l2h-4256' xml:id='l2h-4256' class="method">handle_pi</tt></b>(</nobr></td> |
| 262 | <td><var>data</var>)</td></tr></table></dt> |
| 263 | <dd> |
| 264 | Method called when a processing instruction is encountered. The |
| 265 | <var>data</var> parameter will contain the entire processing instruction. |
| 266 | For example, for the processing instruction <code><?proc color='red'></code>, |
| 267 | this method would be called as <code>handle_pi("proc color='red'")</code>. It |
| 268 | is intended to be overridden by a derived class; the base class |
| 269 | implementation does nothing. |
| 270 | |
| 271 | <P> |
| 272 | <span class="note"><b class="label">Note:</b> |
| 273 | The <tt class="class">HTMLParser</tt> class uses the SGML syntactic rules for |
| 274 | processing instructions. An XHTML processing instruction using the |
| 275 | trailing "<tt class="character">?</tt>" will cause the "<tt class="character">?</tt>" to be included in |
| 276 | <var>data</var>.</span> |
| 277 | </dl> |
| 278 | |
| 279 | <P> |
| 280 | |
| 281 | <p><br /></p><hr class='online-navigation' /> |
| 282 | <div class='online-navigation'> |
| 283 | <!--Table of Child-Links--> |
| 284 | <A NAME="CHILD_LINKS"><STRONG>Subsections</STRONG></a> |
| 285 | |
| 286 | <UL CLASS="ChildLinks"> |
| 287 | <LI><A href="htmlparser-example.html">13.1.1 Example HTML Parser Application</a> |
| 288 | </ul> |
| 289 | <!--End of Table of Child-Links--> |
| 290 | </div> |
| 291 | |
| 292 | <DIV CLASS="navigation"> |
| 293 | <div class='online-navigation'> |
| 294 | <p></p><hr /> |
| 295 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> |
| 296 | <tr> |
| 297 | <td class='online-navigation'><a rel="prev" title="13. Structured Markup Processing" |
| 298 | href="markup.html"><img src='../icons/previous.png' |
| 299 | border='0' height='32' alt='Previous Page' width='32' /></A></td> |
| 300 | <td class='online-navigation'><a rel="parent" title="13. Structured Markup Processing" |
| 301 | href="markup.html"><img src='../icons/up.png' |
| 302 | border='0' height='32' alt='Up One Level' width='32' /></A></td> |
| 303 | <td class='online-navigation'><a rel="next" title="13.1.1 Example HTML Parser" |
| 304 | href="htmlparser-example.html"><img src='../icons/next.png' |
| 305 | border='0' height='32' alt='Next Page' width='32' /></A></td> |
| 306 | <td align="center" width="100%">Python Library Reference</td> |
| 307 | <td class='online-navigation'><a rel="contents" title="Table of Contents" |
| 308 | href="contents.html"><img src='../icons/contents.png' |
| 309 | border='0' height='32' alt='Contents' width='32' /></A></td> |
| 310 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' |
| 311 | border='0' height='32' alt='Module Index' width='32' /></a></td> |
| 312 | <td class='online-navigation'><a rel="index" title="Index" |
| 313 | href="genindex.html"><img src='../icons/index.png' |
| 314 | border='0' height='32' alt='Index' width='32' /></A></td> |
| 315 | </tr></table> |
| 316 | <div class='online-navigation'> |
| 317 | <b class="navlabel">Previous:</b> |
| 318 | <a class="sectref" rel="prev" href="markup.html">13. Structured Markup Processing</A> |
| 319 | <b class="navlabel">Up:</b> |
| 320 | <a class="sectref" rel="parent" href="markup.html">13. Structured Markup Processing</A> |
| 321 | <b class="navlabel">Next:</b> |
| 322 | <a class="sectref" rel="next" href="htmlparser-example.html">13.1.1 Example HTML Parser</A> |
| 323 | </div> |
| 324 | </div> |
| 325 | <hr /> |
| 326 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> |
| 327 | </DIV> |
| 328 | <!--End of Navigation Panel--> |
| 329 | <ADDRESS> |
| 330 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. |
| 331 | </ADDRESS> |
| 332 | </BODY> |
| 333 | </HTML> |