Commit | Line | Data |
---|---|---|
920dae64 AT |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
2 | <html> | |
3 | <head> | |
4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> | |
5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> | |
6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> | |
7 | <link rel="first" href="lib.html" title='Python Library Reference' /> | |
8 | <link rel='contents' href='contents.html' title="Contents" /> | |
9 | <link rel='index' href='genindex.html' title='Index' /> | |
10 | <link rel='last' href='about.html' title='About this document...' /> | |
11 | <link rel='help' href='about.html' title='About this document...' /> | |
12 | <link rel="next" href="module-csv.html" /> | |
13 | <link rel="prev" href="module-netrc.html" /> | |
14 | <link rel="parent" href="netdata.html" /> | |
15 | <link rel="next" href="module-csv.html" /> | |
16 | <meta name='aesop' content='information' /> | |
17 | <title>12.19 robotparser -- Parser for robots.txt</title> | |
18 | </head> | |
19 | <body> | |
20 | <DIV CLASS="navigation"> | |
21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> | |
22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
23 | <tr> | |
24 | <td class='online-navigation'><a rel="prev" title="12.18.1 netrc Objects" | |
25 | href="netrc-objects.html"><img src='../icons/previous.png' | |
26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
27 | <td class='online-navigation'><a rel="parent" title="12. Internet Data Handling" | |
28 | href="netdata.html"><img src='../icons/up.png' | |
29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
30 | <td class='online-navigation'><a rel="next" title="12.20 csv " | |
31 | href="module-csv.html"><img src='../icons/next.png' | |
32 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
33 | <td align="center" width="100%">Python Library Reference</td> | |
34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
35 | href="contents.html"><img src='../icons/contents.png' | |
36 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
38 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
39 | <td class='online-navigation'><a rel="index" title="Index" | |
40 | href="genindex.html"><img src='../icons/index.png' | |
41 | border='0' height='32' alt='Index' width='32' /></A></td> | |
42 | </tr></table> | |
43 | <div class='online-navigation'> | |
44 | <b class="navlabel">Previous:</b> | |
45 | <a class="sectref" rel="prev" href="netrc-objects.html">12.18.1 netrc Objects</A> | |
46 | <b class="navlabel">Up:</b> | |
47 | <a class="sectref" rel="parent" href="netdata.html">12. Internet Data Handling</A> | |
48 | <b class="navlabel">Next:</b> | |
49 | <a class="sectref" rel="next" href="module-csv.html">12.20 csv </A> | |
50 | </div> | |
51 | <hr /></div> | |
52 | </DIV> | |
53 | <!--End of Navigation Panel--> | |
54 | ||
55 | <H1><A NAME="SECTION00141900000000000000000"> | |
56 | 12.19 <tt class="module">robotparser</tt> -- | |
57 | Parser for robots.txt</A> | |
58 | </H1> | |
59 | ||
60 | <P> | |
61 | <A NAME="module-robotparser"></A> | |
62 | ||
63 | <P> | |
64 | <a id='l2h-4209' xml:id='l2h-4209'></a> | |
65 | ||
66 | <P> | |
67 | This module provides a single class, <tt class="class">RobotFileParser</tt>, which answers | |
68 | questions about whether or not a particular user agent can fetch a URL on | |
69 | the Web site that published the <span class="file">robots.txt</span> file. For more details on | |
70 | the structure of <span class="file">robots.txt</span> files, see | |
71 | <a class="url" href="http://www.robotstxt.org/wc/norobots.html">http://www.robotstxt.org/wc/norobots.html</a>. | |
72 | ||
73 | <P> | |
74 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
75 | <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-4202' xml:id='l2h-4202' class="class">RobotFileParser</tt></b>(</nobr></td> | |
76 | <td><var></var>)</td></tr></table></dt> | |
77 | <dd> | |
78 | ||
79 | <P> | |
80 | This class provides a set of methods to read, parse and answer questions | |
81 | about a single <span class="file">robots.txt</span> file. | |
82 | ||
83 | <P> | |
84 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
85 | <td><nobr><b><tt id='l2h-4203' xml:id='l2h-4203' class="method">set_url</tt></b>(</nobr></td> | |
86 | <td><var>url</var>)</td></tr></table></dt> | |
87 | <dd> | |
88 | Sets the URL referring to a <span class="file">robots.txt</span> file. | |
89 | </dl> | |
90 | ||
91 | <P> | |
92 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
93 | <td><nobr><b><tt id='l2h-4204' xml:id='l2h-4204' class="method">read</tt></b>(</nobr></td> | |
94 | <td><var></var>)</td></tr></table></dt> | |
95 | <dd> | |
96 | Reads the <span class="file">robots.txt</span> URL and feeds it to the parser. | |
97 | </dl> | |
98 | ||
99 | <P> | |
100 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
101 | <td><nobr><b><tt id='l2h-4205' xml:id='l2h-4205' class="method">parse</tt></b>(</nobr></td> | |
102 | <td><var>lines</var>)</td></tr></table></dt> | |
103 | <dd> | |
104 | Parses the lines argument. | |
105 | </dl> | |
106 | ||
107 | <P> | |
108 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
109 | <td><nobr><b><tt id='l2h-4206' xml:id='l2h-4206' class="method">can_fetch</tt></b>(</nobr></td> | |
110 | <td><var>useragent, url</var>)</td></tr></table></dt> | |
111 | <dd> | |
112 | Returns <code>True</code> if the <var>useragent</var> is allowed to fetch the <var>url</var> | |
113 | according to the rules contained in the parsed <span class="file">robots.txt</span> file. | |
114 | </dl> | |
115 | ||
116 | <P> | |
117 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
118 | <td><nobr><b><tt id='l2h-4207' xml:id='l2h-4207' class="method">mtime</tt></b>(</nobr></td> | |
119 | <td><var></var>)</td></tr></table></dt> | |
120 | <dd> | |
121 | Returns the time the <code>robots.txt</code> file was last fetched. This is | |
122 | useful for long-running web spiders that need to check for new | |
123 | <code>robots.txt</code> files periodically. | |
124 | </dl> | |
125 | ||
126 | <P> | |
127 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
128 | <td><nobr><b><tt id='l2h-4208' xml:id='l2h-4208' class="method">modified</tt></b>(</nobr></td> | |
129 | <td><var></var>)</td></tr></table></dt> | |
130 | <dd> | |
131 | Sets the time the <code>robots.txt</code> file was last fetched to the current | |
132 | time. | |
133 | </dl> | |
134 | ||
135 | <P> | |
136 | </dl> | |
137 | ||
138 | <P> | |
139 | The following example demonstrates basic use of the RobotFileParser class. | |
140 | ||
141 | <P> | |
142 | <div class="verbatim"><pre> | |
143 | >>> import robotparser | |
144 | >>> rp = robotparser.RobotFileParser() | |
145 | >>> rp.set_url("http://www.musi-cal.com/robots.txt") | |
146 | >>> rp.read() | |
147 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") | |
148 | False | |
149 | >>> rp.can_fetch("*", "http://www.musi-cal.com/") | |
150 | True | |
151 | </pre></div> | |
152 | ||
153 | <DIV CLASS="navigation"> | |
154 | <div class='online-navigation'> | |
155 | <p></p><hr /> | |
156 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
157 | <tr> | |
158 | <td class='online-navigation'><a rel="prev" title="12.18.1 netrc Objects" | |
159 | href="netrc-objects.html"><img src='../icons/previous.png' | |
160 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
161 | <td class='online-navigation'><a rel="parent" title="12. Internet Data Handling" | |
162 | href="netdata.html"><img src='../icons/up.png' | |
163 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
164 | <td class='online-navigation'><a rel="next" title="12.20 csv " | |
165 | href="module-csv.html"><img src='../icons/next.png' | |
166 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
167 | <td align="center" width="100%">Python Library Reference</td> | |
168 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
169 | href="contents.html"><img src='../icons/contents.png' | |
170 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
171 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
172 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
173 | <td class='online-navigation'><a rel="index" title="Index" | |
174 | href="genindex.html"><img src='../icons/index.png' | |
175 | border='0' height='32' alt='Index' width='32' /></A></td> | |
176 | </tr></table> | |
177 | <div class='online-navigation'> | |
178 | <b class="navlabel">Previous:</b> | |
179 | <a class="sectref" rel="prev" href="netrc-objects.html">12.18.1 netrc Objects</A> | |
180 | <b class="navlabel">Up:</b> | |
181 | <a class="sectref" rel="parent" href="netdata.html">12. Internet Data Handling</A> | |
182 | <b class="navlabel">Next:</b> | |
183 | <a class="sectref" rel="next" href="module-csv.html">12.20 csv </A> | |
184 | </div> | |
185 | </div> | |
186 | <hr /> | |
187 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> | |
188 | </DIV> | |
189 | <!--End of Navigation Panel--> | |
190 | <ADDRESS> | |
191 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. | |
192 | </ADDRESS> | |
193 | </BODY> | |
194 | </HTML> |