<!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.0 Transitional//EN">
<link rel=
"STYLESHEET" href=
"lib.css" type='text/css'
/>
<link rel=
"SHORTCUT ICON" href=
"../icons/pyfav.png" type=
"image/png" />
<link rel='start' href='../index.html' title='Python Documentation Index'
/>
<link rel=
"first" href=
"lib.html" title='Python Library Reference'
/>
<link rel='contents' href='contents.html'
title=
"Contents" />
<link rel='index' href='genindex.html' title='Index'
/>
<link rel='last' href='about.html' title='About this document...'
/>
<link rel='help' href='about.html' title='About this document...'
/>
<link rel=
"next" href=
"module-csv.html" />
<link rel=
"prev" href=
"module-netrc.html" />
<link rel=
"parent" href=
"netdata.html" />
<link rel=
"next" href=
"module-csv.html" />
<meta name='aesop' content='information'
/>
<title>12.19 robotparser -- Parser for robots.txt
</title>
<div id='top-navigation-panel' xml:id='top-navigation-panel'
>
<table align=
"center" width=
"100%" cellpadding=
"0" cellspacing=
"2">
<td class='online-navigation'
><a rel=
"prev" title=
"12.18.1 netrc Objects"
href=
"netrc-objects.html"><img src='../icons/previous.png'
border='
0' height='
32' alt='Previous Page' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"parent" title=
"12. Internet Data Handling"
href=
"netdata.html"><img src='../icons/up.png'
border='
0' height='
32' alt='Up One Level' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"next" title=
"12.20 csv "
href=
"module-csv.html"><img src='../icons/next.png'
border='
0' height='
32' alt='Next Page' width='
32'
/></A></td>
<td align=
"center" width=
"100%">Python Library Reference
</td>
<td class='online-navigation'
><a rel=
"contents" title=
"Table of Contents"
href=
"contents.html"><img src='../icons/contents.png'
border='
0' height='
32' alt='Contents' width='
32'
/></A></td>
<td class='online-navigation'
><a href=
"modindex.html" title=
"Module Index"><img src='../icons/modules.png'
border='
0' height='
32' alt='Module Index' width='
32'
/></a></td>
<td class='online-navigation'
><a rel=
"index" title=
"Index"
href=
"genindex.html"><img src='../icons/index.png'
border='
0' height='
32' alt='Index' width='
32'
/></A></td>
<div class='online-navigation'
>
<b class=
"navlabel">Previous:
</b>
<a class=
"sectref" rel=
"prev" href=
"netrc-objects.html">12.18.1 netrc Objects
</A>
<b class=
"navlabel">Up:
</b>
<a class=
"sectref" rel=
"parent" href=
"netdata.html">12. Internet Data Handling
</A>
<b class=
"navlabel">Next:
</b>
<a class=
"sectref" rel=
"next" href=
"module-csv.html">12.20 csv
</A>
<!--End of Navigation Panel-->
<H1><A NAME=
"SECTION00141900000000000000000">
12.19 <tt class=
"module">robotparser
</tt> --
Parser for robots.txt
</A>
<A NAME=
"module-robotparser"></A>
<a id='l2h-
4209' xml:id='l2h-
4209'
></a>
This module provides a single class,
<tt class=
"class">RobotFileParser
</tt>, which answers
questions about whether or not a particular user agent can fetch a URL on
the Web site that published the
<span class=
"file">robots.txt
</span> file. For more details on
the structure of
<span class=
"file">robots.txt
</span> files, see
<a class=
"url" href=
"http://www.robotstxt.org/wc/norobots.html">http://www.robotstxt.org/wc/norobots.html
</a>.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><span class=
"typelabel">class
</span> <tt id='l2h-
4202' xml:id='l2h-
4202'
class=
"class">RobotFileParser
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
This class provides a set of methods to read, parse and answer questions
about a single
<span class=
"file">robots.txt
</span> file.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4203' xml:id='l2h-
4203'
class=
"method">set_url
</tt></b>(
</nobr></td>
<td><var>url
</var>)
</td></tr></table></dt>
Sets the URL referring to a
<span class=
"file">robots.txt
</span> file.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4204' xml:id='l2h-
4204'
class=
"method">read
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Reads the
<span class=
"file">robots.txt
</span> URL and feeds it to the parser.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4205' xml:id='l2h-
4205'
class=
"method">parse
</tt></b>(
</nobr></td>
<td><var>lines
</var>)
</td></tr></table></dt>
Parses the lines argument.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4206' xml:id='l2h-
4206'
class=
"method">can_fetch
</tt></b>(
</nobr></td>
<td><var>useragent, url
</var>)
</td></tr></table></dt>
Returns
<code>True
</code> if the
<var>useragent
</var> is allowed to fetch the
<var>url
</var>
according to the rules contained in the parsed
<span class=
"file">robots.txt
</span> file.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4207' xml:id='l2h-
4207'
class=
"method">mtime
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Returns the time the
<code>robots.txt
</code> file was last fetched. This is
useful for long-running web spiders that need to check for new
<code>robots.txt
</code> files periodically.
<dl><dt><table cellpadding=
"0" cellspacing=
"0"><tr valign=
"baseline">
<td><nobr><b><tt id='l2h-
4208' xml:id='l2h-
4208'
class=
"method">modified
</tt></b>(
</nobr></td>
<td><var></var>)
</td></tr></table></dt>
Sets the time the
<code>robots.txt
</code> file was last fetched to the current
The following example demonstrates basic use of the RobotFileParser class.
<div class=
"verbatim"><pre>
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url(
"http://www.musi-cal.com/robots.txt")
>>> rp.can_fetch(
"*",
"http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
>>> rp.can_fetch(
"*",
"http://www.musi-cal.com/")
<div class='online-navigation'
>
<table align=
"center" width=
"100%" cellpadding=
"0" cellspacing=
"2">
<td class='online-navigation'
><a rel=
"prev" title=
"12.18.1 netrc Objects"
href=
"netrc-objects.html"><img src='../icons/previous.png'
border='
0' height='
32' alt='Previous Page' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"parent" title=
"12. Internet Data Handling"
href=
"netdata.html"><img src='../icons/up.png'
border='
0' height='
32' alt='Up One Level' width='
32'
/></A></td>
<td class='online-navigation'
><a rel=
"next" title=
"12.20 csv "
href=
"module-csv.html"><img src='../icons/next.png'
border='
0' height='
32' alt='Next Page' width='
32'
/></A></td>
<td align=
"center" width=
"100%">Python Library Reference
</td>
<td class='online-navigation'
><a rel=
"contents" title=
"Table of Contents"
href=
"contents.html"><img src='../icons/contents.png'
border='
0' height='
32' alt='Contents' width='
32'
/></A></td>
<td class='online-navigation'
><a href=
"modindex.html" title=
"Module Index"><img src='../icons/modules.png'
border='
0' height='
32' alt='Module Index' width='
32'
/></a></td>
<td class='online-navigation'
><a rel=
"index" title=
"Index"
href=
"genindex.html"><img src='../icons/index.png'
border='
0' height='
32' alt='Index' width='
32'
/></A></td>
<div class='online-navigation'
>
<b class=
"navlabel">Previous:
</b>
<a class=
"sectref" rel=
"prev" href=
"netrc-objects.html">12.18.1 netrc Objects
</A>
<b class=
"navlabel">Up:
</b>
<a class=
"sectref" rel=
"parent" href=
"netdata.html">12. Internet Data Handling
</A>
<b class=
"navlabel">Next:
</b>
<a class=
"sectref" rel=
"next" href=
"module-csv.html">12.20 csv
</A>
<span class=
"release-info">Release
2.4.2, documentation updated on
28 September
2005.
</span>
<!--End of Navigation Panel-->
See
<i><a href=
"about.html">About this document...
</a></i> for information on suggesting changes.