| 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
| 2 | <html> |
| 3 | <head> |
| 4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> |
| 5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> |
| 6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> |
| 7 | <link rel="first" href="lib.html" title='Python Library Reference' /> |
| 8 | <link rel='contents' href='contents.html' title="Contents" /> |
| 9 | <link rel='index' href='genindex.html' title='Index' /> |
| 10 | <link rel='last' href='about.html' title='About this document...' /> |
| 11 | <link rel='help' href='about.html' title='About this document...' /> |
| 12 | <link rel="next" href="module-csv.html" /> |
| 13 | <link rel="prev" href="module-netrc.html" /> |
| 14 | <link rel="parent" href="netdata.html" /> |
| 15 | <link rel="next" href="module-csv.html" /> |
| 16 | <meta name='aesop' content='information' /> |
| 17 | <title>12.19 robotparser -- Parser for robots.txt</title> |
| 18 | </head> |
| 19 | <body> |
| 20 | <DIV CLASS="navigation"> |
| 21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> |
| 22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> |
| 23 | <tr> |
| 24 | <td class='online-navigation'><a rel="prev" title="12.18.1 netrc Objects" |
| 25 | href="netrc-objects.html"><img src='../icons/previous.png' |
| 26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> |
| 27 | <td class='online-navigation'><a rel="parent" title="12. Internet Data Handling" |
| 28 | href="netdata.html"><img src='../icons/up.png' |
| 29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> |
| 30 | <td class='online-navigation'><a rel="next" title="12.20 csv " |
| 31 | href="module-csv.html"><img src='../icons/next.png' |
| 32 | border='0' height='32' alt='Next Page' width='32' /></A></td> |
| 33 | <td align="center" width="100%">Python Library Reference</td> |
| 34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" |
| 35 | href="contents.html"><img src='../icons/contents.png' |
| 36 | border='0' height='32' alt='Contents' width='32' /></A></td> |
| 37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' |
| 38 | border='0' height='32' alt='Module Index' width='32' /></a></td> |
| 39 | <td class='online-navigation'><a rel="index" title="Index" |
| 40 | href="genindex.html"><img src='../icons/index.png' |
| 41 | border='0' height='32' alt='Index' width='32' /></A></td> |
| 42 | </tr></table> |
| 43 | <div class='online-navigation'> |
| 44 | <b class="navlabel">Previous:</b> |
| 45 | <a class="sectref" rel="prev" href="netrc-objects.html">12.18.1 netrc Objects</A> |
| 46 | <b class="navlabel">Up:</b> |
| 47 | <a class="sectref" rel="parent" href="netdata.html">12. Internet Data Handling</A> |
| 48 | <b class="navlabel">Next:</b> |
| 49 | <a class="sectref" rel="next" href="module-csv.html">12.20 csv </A> |
| 50 | </div> |
| 51 | <hr /></div> |
| 52 | </DIV> |
| 53 | <!--End of Navigation Panel--> |
| 54 | |
| 55 | <H1><A NAME="SECTION00141900000000000000000"> |
| 56 | 12.19 <tt class="module">robotparser</tt> -- |
| 57 | Parser for robots.txt</A> |
| 58 | </H1> |
| 59 | |
| 60 | <P> |
| 61 | <A NAME="module-robotparser"></A> |
| 62 | |
| 63 | <P> |
| 64 | <a id='l2h-4209' xml:id='l2h-4209'></a> |
| 65 | |
| 66 | <P> |
| 67 | This module provides a single class, <tt class="class">RobotFileParser</tt>, which answers |
| 68 | questions about whether or not a particular user agent can fetch a URL on |
| 69 | the Web site that published the <span class="file">robots.txt</span> file. For more details on |
| 70 | the structure of <span class="file">robots.txt</span> files, see |
| 71 | <a class="url" href="http://www.robotstxt.org/wc/norobots.html">http://www.robotstxt.org/wc/norobots.html</a>. |
| 72 | |
| 73 | <P> |
| 74 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 75 | <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-4202' xml:id='l2h-4202' class="class">RobotFileParser</tt></b>(</nobr></td> |
| 76 | <td><var></var>)</td></tr></table></dt> |
| 77 | <dd> |
| 78 | |
| 79 | <P> |
| 80 | This class provides a set of methods to read, parse and answer questions |
| 81 | about a single <span class="file">robots.txt</span> file. |
| 82 | |
| 83 | <P> |
| 84 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 85 | <td><nobr><b><tt id='l2h-4203' xml:id='l2h-4203' class="method">set_url</tt></b>(</nobr></td> |
| 86 | <td><var>url</var>)</td></tr></table></dt> |
| 87 | <dd> |
| 88 | Sets the URL referring to a <span class="file">robots.txt</span> file. |
| 89 | </dl> |
| 90 | |
| 91 | <P> |
| 92 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 93 | <td><nobr><b><tt id='l2h-4204' xml:id='l2h-4204' class="method">read</tt></b>(</nobr></td> |
| 94 | <td><var></var>)</td></tr></table></dt> |
| 95 | <dd> |
| 96 | Reads the <span class="file">robots.txt</span> URL and feeds it to the parser. |
| 97 | </dl> |
| 98 | |
| 99 | <P> |
| 100 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 101 | <td><nobr><b><tt id='l2h-4205' xml:id='l2h-4205' class="method">parse</tt></b>(</nobr></td> |
| 102 | <td><var>lines</var>)</td></tr></table></dt> |
| 103 | <dd> |
| 104 | Parses the lines argument. |
| 105 | </dl> |
| 106 | |
| 107 | <P> |
| 108 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 109 | <td><nobr><b><tt id='l2h-4206' xml:id='l2h-4206' class="method">can_fetch</tt></b>(</nobr></td> |
| 110 | <td><var>useragent, url</var>)</td></tr></table></dt> |
| 111 | <dd> |
| 112 | Returns <code>True</code> if the <var>useragent</var> is allowed to fetch the <var>url</var> |
| 113 | according to the rules contained in the parsed <span class="file">robots.txt</span> file. |
| 114 | </dl> |
| 115 | |
| 116 | <P> |
| 117 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 118 | <td><nobr><b><tt id='l2h-4207' xml:id='l2h-4207' class="method">mtime</tt></b>(</nobr></td> |
| 119 | <td><var></var>)</td></tr></table></dt> |
| 120 | <dd> |
| 121 | Returns the time the <code>robots.txt</code> file was last fetched. This is |
| 122 | useful for long-running web spiders that need to check for new |
| 123 | <code>robots.txt</code> files periodically. |
| 124 | </dl> |
| 125 | |
| 126 | <P> |
| 127 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> |
| 128 | <td><nobr><b><tt id='l2h-4208' xml:id='l2h-4208' class="method">modified</tt></b>(</nobr></td> |
| 129 | <td><var></var>)</td></tr></table></dt> |
| 130 | <dd> |
| 131 | Sets the time the <code>robots.txt</code> file was last fetched to the current |
| 132 | time. |
| 133 | </dl> |
| 134 | |
| 135 | <P> |
| 136 | </dl> |
| 137 | |
| 138 | <P> |
| 139 | The following example demonstrates basic use of the RobotFileParser class. |
| 140 | |
| 141 | <P> |
| 142 | <div class="verbatim"><pre> |
| 143 | >>> import robotparser |
| 144 | >>> rp = robotparser.RobotFileParser() |
| 145 | >>> rp.set_url("http://www.musi-cal.com/robots.txt") |
| 146 | >>> rp.read() |
| 147 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") |
| 148 | False |
| 149 | >>> rp.can_fetch("*", "http://www.musi-cal.com/") |
| 150 | True |
| 151 | </pre></div> |
| 152 | |
| 153 | <DIV CLASS="navigation"> |
| 154 | <div class='online-navigation'> |
| 155 | <p></p><hr /> |
| 156 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> |
| 157 | <tr> |
| 158 | <td class='online-navigation'><a rel="prev" title="12.18.1 netrc Objects" |
| 159 | href="netrc-objects.html"><img src='../icons/previous.png' |
| 160 | border='0' height='32' alt='Previous Page' width='32' /></A></td> |
| 161 | <td class='online-navigation'><a rel="parent" title="12. Internet Data Handling" |
| 162 | href="netdata.html"><img src='../icons/up.png' |
| 163 | border='0' height='32' alt='Up One Level' width='32' /></A></td> |
| 164 | <td class='online-navigation'><a rel="next" title="12.20 csv " |
| 165 | href="module-csv.html"><img src='../icons/next.png' |
| 166 | border='0' height='32' alt='Next Page' width='32' /></A></td> |
| 167 | <td align="center" width="100%">Python Library Reference</td> |
| 168 | <td class='online-navigation'><a rel="contents" title="Table of Contents" |
| 169 | href="contents.html"><img src='../icons/contents.png' |
| 170 | border='0' height='32' alt='Contents' width='32' /></A></td> |
| 171 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' |
| 172 | border='0' height='32' alt='Module Index' width='32' /></a></td> |
| 173 | <td class='online-navigation'><a rel="index" title="Index" |
| 174 | href="genindex.html"><img src='../icons/index.png' |
| 175 | border='0' height='32' alt='Index' width='32' /></A></td> |
| 176 | </tr></table> |
| 177 | <div class='online-navigation'> |
| 178 | <b class="navlabel">Previous:</b> |
| 179 | <a class="sectref" rel="prev" href="netrc-objects.html">12.18.1 netrc Objects</A> |
| 180 | <b class="navlabel">Up:</b> |
| 181 | <a class="sectref" rel="parent" href="netdata.html">12. Internet Data Handling</A> |
| 182 | <b class="navlabel">Next:</b> |
| 183 | <a class="sectref" rel="next" href="module-csv.html">12.20 csv </A> |
| 184 | </div> |
| 185 | </div> |
| 186 | <hr /> |
| 187 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> |
| 188 | </DIV> |
| 189 | <!--End of Navigation Panel--> |
| 190 | <ADDRESS> |
| 191 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. |
| 192 | </ADDRESS> |
| 193 | </BODY> |
| 194 | </HTML> |