Commit | Line | Data |
---|---|---|
86530b38 AT |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
2 | <html> | |
3 | <head> | |
4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> | |
5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> | |
6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> | |
7 | <link rel="first" href="lib.html" title='Python Library Reference' /> | |
8 | <link rel='contents' href='contents.html' title="Contents" /> | |
9 | <link rel='index' href='genindex.html' title='Index' /> | |
10 | <link rel='last' href='about.html' title='About this document...' /> | |
11 | <link rel='help' href='about.html' title='About this document...' /> | |
12 | <link rel="next" href="module-urllib2.html" /> | |
13 | <link rel="prev" href="module-cgitb.html" /> | |
14 | <link rel="parent" href="internet.html" /> | |
15 | <link rel="next" href="urlopener-objs.html" /> | |
16 | <meta name='aesop' content='information' /> | |
17 | <title>11.4 urllib -- Open arbitrary resources by URL</title> | |
18 | </head> | |
19 | <body> | |
20 | <DIV CLASS="navigation"> | |
21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> | |
22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
23 | <tr> | |
24 | <td class='online-navigation'><a rel="prev" title="11.3 cgitb " | |
25 | href="module-cgitb.html"><img src='../icons/previous.png' | |
26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
27 | <td class='online-navigation'><a rel="parent" title="11. Internet Protocols and" | |
28 | href="internet.html"><img src='../icons/up.png' | |
29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
30 | <td class='online-navigation'><a rel="next" title="11.4.1 URLopener Objects" | |
31 | href="urlopener-objs.html"><img src='../icons/next.png' | |
32 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
33 | <td align="center" width="100%">Python Library Reference</td> | |
34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
35 | href="contents.html"><img src='../icons/contents.png' | |
36 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
38 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
39 | <td class='online-navigation'><a rel="index" title="Index" | |
40 | href="genindex.html"><img src='../icons/index.png' | |
41 | border='0' height='32' alt='Index' width='32' /></A></td> | |
42 | </tr></table> | |
43 | <div class='online-navigation'> | |
44 | <b class="navlabel">Previous:</b> | |
45 | <a class="sectref" rel="prev" href="module-cgitb.html">11.3 cgitb </A> | |
46 | <b class="navlabel">Up:</b> | |
47 | <a class="sectref" rel="parent" href="internet.html">11. Internet Protocols and</A> | |
48 | <b class="navlabel">Next:</b> | |
49 | <a class="sectref" rel="next" href="urlopener-objs.html">11.4.1 URLopener Objects</A> | |
50 | </div> | |
51 | <hr /></div> | |
52 | </DIV> | |
53 | <!--End of Navigation Panel--> | |
54 | ||
55 | <H1><A NAME="SECTION0013400000000000000000"> | |
56 | 11.4 <tt class="module">urllib</tt> -- | |
57 | Open arbitrary resources by URL</A> | |
58 | </H1> | |
59 | ||
60 | <P> | |
61 | <A NAME="module-urllib"></A> | |
62 | ||
63 | <P> | |
64 | <a id='l2h-3202' xml:id='l2h-3202'></a> | |
65 | ||
66 | <P> | |
67 | This module provides a high-level interface for fetching data across | |
68 | the World Wide Web. In particular, the <tt class="function">urlopen()</tt> function | |
69 | is similar to the built-in function <tt class="function">open()</tt>, but accepts | |
70 | Universal Resource Locators (URLs) instead of filenames. Some | |
71 | restrictions apply -- it can only open URLs for reading, and no seek | |
72 | operations are available. | |
73 | ||
74 | <P> | |
75 | It defines the following public functions: | |
76 | ||
77 | <P> | |
78 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
79 | <td><nobr><b><tt id='l2h-3184' xml:id='l2h-3184' class="function">urlopen</tt></b>(</nobr></td> | |
80 | <td><var>url</var><big>[</big><var>, data</var><big>[</big><var>, proxies</var><big>]</big><var></var><big>]</big><var></var>)</td></tr></table></dt> | |
81 | <dd> | |
82 | Open a network object denoted by a URL for reading. If the URL does | |
83 | not have a scheme identifier, or if it has <span class="file">file:</span> as its scheme | |
84 | identifier, this opens a local file (without universal newlines); | |
85 | otherwise it opens a socket to a server somewhere on the network. If | |
86 | the connection cannot be made, or if the server returns an error code, | |
87 | the <tt class="exception">IOError</tt> exception is raised. If all went well, a | |
88 | file-like object is returned. This supports the following methods: | |
89 | <tt class="method">read()</tt>, <tt class="method">readline()</tt>, <tt class="method">readlines()</tt>, <tt class="method">fileno()</tt>, | |
90 | <tt class="method">close()</tt>, <tt class="method">info()</tt> and <tt class="method">geturl()</tt>. It also has | |
91 | proper support for the iterator protocol. | |
92 | One caveat: the <tt class="method">read()</tt> method, if the size argument is | |
93 | omitted or negative, may not read until the end of the data stream; | |
94 | there is no good way to determine that the entire stream from a socket | |
95 | has been read in the general case. | |
96 | ||
97 | <P> | |
98 | Except for the <tt class="method">info()</tt> and <tt class="method">geturl()</tt> methods, | |
99 | these methods have the same interface as for | |
100 | file objects -- see section <A href="bltin-file-objects.html#bltin-file-objects">2.3.9</A> in this | |
101 | manual. (It is not a built-in file object, however, so it can't be | |
102 | used at those few places where a true built-in file object is | |
103 | required.) | |
104 | ||
105 | <P> | |
106 | The <tt class="method">info()</tt> method returns an instance of the class | |
107 | <tt class="class">mimetools.Message</tt> containing meta-information associated | |
108 | with the URL. When the method is HTTP, these headers are those | |
109 | returned by the server at the head of the retrieved HTML page | |
110 | (including Content-Length and Content-Type). When the method is FTP, | |
111 | a Content-Length header will be present if (as is now usual) the | |
112 | server passed back a file length in response to the FTP retrieval | |
113 | request. A Content-Type header will be present if the MIME type can | |
114 | be guessed. When the method is local-file, returned headers will include | |
115 | a Date representing the file's last-modified time, a Content-Length | |
116 | giving file size, and a Content-Type containing a guess at the file's | |
117 | type. See also the description of the | |
118 | <tt class="module"><a href="module-mimetools.html">mimetools</a></tt><a id='l2h-3203' xml:id='l2h-3203'></a> module. | |
119 | ||
120 | <P> | |
121 | The <tt class="method">geturl()</tt> method returns the real URL of the page. In | |
122 | some cases, the HTTP server redirects a client to another URL. The | |
123 | <tt class="function">urlopen()</tt> function handles this transparently, but in some | |
124 | cases the caller needs to know which URL the client was redirected | |
125 | to. The <tt class="method">geturl()</tt> method can be used to get at this | |
126 | redirected URL. | |
127 | ||
128 | <P> | |
129 | If the <var>url</var> uses the <span class="file">http:</span> scheme identifier, the optional | |
130 | <var>data</var> argument may be given to specify a <code>POST</code> request | |
131 | (normally the request type is <code>GET</code>). The <var>data</var> argument | |
132 | must be in standard <span class="mimetype">application/x-www-form-urlencoded</span> format; | |
133 | see the <tt class="function">urlencode()</tt> function below. | |
134 | ||
135 | <P> | |
136 | The <tt class="function">urlopen()</tt> function works transparently with proxies | |
137 | which do not require authentication. In a <span class="Unix">Unix</span> or Windows | |
138 | environment, set the <a class="envvar" id='l2h-3204' xml:id='l2h-3204'>http_proxy</a>, <a class="envvar" id='l2h-3205' xml:id='l2h-3205'>ftp_proxy</a> or | |
139 | <a class="envvar" id='l2h-3206' xml:id='l2h-3206'>gopher_proxy</a> environment variables to a URL that identifies | |
140 | the proxy server before starting the Python interpreter. For example | |
141 | (the "<tt class="character">%</tt>" is the command prompt): | |
142 | ||
143 | <P> | |
144 | <div class="verbatim"><pre> | |
145 | % http_proxy="http://www.someproxy.com:3128" | |
146 | % export http_proxy | |
147 | % python | |
148 | ... | |
149 | </pre></div> | |
150 | ||
151 | <P> | |
152 | In a Windows environment, if no proxy environment variables are set, | |
153 | proxy settings are obtained from the registry's Internet Settings | |
154 | section. | |
155 | ||
156 | <P> | |
157 | In a Macintosh environment, <tt class="function">urlopen()</tt> will retrieve proxy | |
158 | information from Internet<a id='l2h-3207' xml:id='l2h-3207'></a> Config. | |
159 | ||
160 | <P> | |
161 | Alternatively, the optional <var>proxies</var> argument may be used to | |
162 | explicitly specify proxies. It must be a dictionary mapping scheme | |
163 | names to proxy URLs, where an empty dictionary causes no proxies to be | |
164 | used, and <code>None</code> (the default value) causes environmental proxy | |
165 | settings to be used as discussed above. For example: | |
166 | ||
167 | <P> | |
168 | <div class="verbatim"><pre> | |
169 | # Use http://www.someproxy.com:3128 for http proxying | |
170 | proxies = {'http': 'http://www.someproxy.com:3128'} | |
171 | filehandle = urllib.urlopen(some_url, proxies=proxies) | |
172 | # Don't use any proxies | |
173 | filehandle = urllib.urlopen(some_url, proxies={}) | |
174 | # Use proxies from environment - both versions are equivalent | |
175 | filehandle = urllib.urlopen(some_url, proxies=None) | |
176 | filehandle = urllib.urlopen(some_url) | |
177 | </pre></div> | |
178 | ||
179 | <P> | |
180 | The <tt class="function">urlopen()</tt> function does not support explicit proxy | |
181 | specification. If you need to override environmental proxy settings, | |
182 | use <tt class="class">URLopener</tt>, or a subclass such as <tt class="class">FancyURLopener</tt>. | |
183 | ||
184 | <P> | |
185 | Proxies which require authentication for use are not currently | |
186 | supported; this is considered an implementation limitation. | |
187 | ||
188 | <P> | |
189 | ||
190 | <span class="versionnote">Changed in version 2.3: | |
191 | Added the <var>proxies</var> support.</span> | |
192 | ||
193 | </dl> | |
194 | ||
195 | <P> | |
196 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
197 | <td><nobr><b><tt id='l2h-3185' xml:id='l2h-3185' class="function">urlretrieve</tt></b>(</nobr></td> | |
198 | <td><var>url</var><big>[</big><var>, filename</var><big>[</big><var>, | |
199 | reporthook</var><big>[</big><var>, data</var><big>]</big><var></var><big>]</big><var></var><big>]</big><var></var>)</td></tr></table></dt> | |
200 | <dd> | |
201 | Copy a network object denoted by a URL to a local file, if necessary. | |
202 | If the URL points to a local file, or a valid cached copy of the | |
203 | object exists, the object is not copied. Return a tuple | |
204 | <code>(<var>filename</var>, <var>headers</var>)</code> where <var>filename</var> is the | |
205 | local file name under which the object can be found, and <var>headers</var> | |
206 | is whatever the <tt class="method">info()</tt> method of the object returned by | |
207 | <tt class="function">urlopen()</tt> returned (for a remote object, possibly cached). | |
208 | Exceptions are the same as for <tt class="function">urlopen()</tt>. | |
209 | ||
210 | <P> | |
211 | The second argument, if present, specifies the file location to copy | |
212 | to (if absent, the location will be a tempfile with a generated name). | |
213 | The third argument, if present, is a hook function that will be called | |
214 | once on establishment of the network connection and once after each | |
215 | block read thereafter. The hook will be passed three arguments; a | |
216 | count of blocks transferred so far, a block size in bytes, and the | |
217 | total size of the file. The third argument may be <code>-1</code> on older | |
218 | FTP servers which do not return a file size in response to a retrieval | |
219 | request. | |
220 | ||
221 | <P> | |
222 | If the <var>url</var> uses the <span class="file">http:</span> scheme identifier, the optional | |
223 | <var>data</var> argument may be given to specify a <code>POST</code> request | |
224 | (normally the request type is <code>GET</code>). The <var>data</var> argument | |
225 | must in standard <span class="mimetype">application/x-www-form-urlencoded</span> format; | |
226 | see the <tt class="function">urlencode()</tt> function below. | |
227 | </dl> | |
228 | ||
229 | <P> | |
230 | <dl><dt><b><tt id='l2h-3186' xml:id='l2h-3186'>_urlopener</tt></b></dt> | |
231 | <dd> | |
232 | The public functions <tt class="function">urlopen()</tt> and | |
233 | <tt class="function">urlretrieve()</tt> create an instance of the | |
234 | <tt class="class">FancyURLopener</tt> class and use it to perform their requested | |
235 | actions. To override this functionality, programmers can create a | |
236 | subclass of <tt class="class">URLopener</tt> or <tt class="class">FancyURLopener</tt>, then assign | |
237 | an instance of that class to the | |
238 | <code>urllib._urlopener</code> variable before calling the desired function. | |
239 | For example, applications may want to specify a different | |
240 | <span class="mailheader">User-Agent:</span> header than <tt class="class">URLopener</tt> defines. This | |
241 | can be accomplished with the following code: | |
242 | ||
243 | <P> | |
244 | <div class="verbatim"><pre> | |
245 | import urllib | |
246 | ||
247 | class AppURLopener(urllib.FancyURLopener): | |
248 | version = "App/1.7" | |
249 | ||
250 | urllib._urlopener = AppURLopener() | |
251 | </pre></div> | |
252 | </dd></dl> | |
253 | ||
254 | <P> | |
255 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
256 | <td><nobr><b><tt id='l2h-3187' xml:id='l2h-3187' class="function">urlcleanup</tt></b>(</nobr></td> | |
257 | <td><var></var>)</td></tr></table></dt> | |
258 | <dd> | |
259 | Clear the cache that may have been built up by previous calls to | |
260 | <tt class="function">urlretrieve()</tt>. | |
261 | </dl> | |
262 | ||
263 | <P> | |
264 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
265 | <td><nobr><b><tt id='l2h-3188' xml:id='l2h-3188' class="function">quote</tt></b>(</nobr></td> | |
266 | <td><var>string</var><big>[</big><var>, safe</var><big>]</big><var></var>)</td></tr></table></dt> | |
267 | <dd> | |
268 | Replace special characters in <var>string</var> using the "<tt class="samp">%xx</tt>" escape. | |
269 | Letters, digits, and the characters "<tt class="character">_.-</tt>" are never quoted. | |
270 | The optional <var>safe</var> parameter specifies additional characters | |
271 | that should not be quoted -- its default value is <code>'/'</code>. | |
272 | ||
273 | <P> | |
274 | Example: <code>quote('/~connolly/')</code> yields <code>'/%7econnolly/'</code>. | |
275 | </dl> | |
276 | ||
277 | <P> | |
278 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
279 | <td><nobr><b><tt id='l2h-3189' xml:id='l2h-3189' class="function">quote_plus</tt></b>(</nobr></td> | |
280 | <td><var>string</var><big>[</big><var>, safe</var><big>]</big><var></var>)</td></tr></table></dt> | |
281 | <dd> | |
282 | Like <tt class="function">quote()</tt>, but also replaces spaces by plus signs, as | |
283 | required for quoting HTML form values. Plus signs in the original | |
284 | string are escaped unless they are included in <var>safe</var>. It also | |
285 | does not have <var>safe</var> default to <code>'/'</code>. | |
286 | </dl> | |
287 | ||
288 | <P> | |
289 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
290 | <td><nobr><b><tt id='l2h-3190' xml:id='l2h-3190' class="function">unquote</tt></b>(</nobr></td> | |
291 | <td><var>string</var>)</td></tr></table></dt> | |
292 | <dd> | |
293 | Replace "<tt class="samp">%xx</tt>" escapes by their single-character equivalent. | |
294 | ||
295 | <P> | |
296 | Example: <code>unquote('/%7Econnolly/')</code> yields <code>'/~connolly/'</code>. | |
297 | </dl> | |
298 | ||
299 | <P> | |
300 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
301 | <td><nobr><b><tt id='l2h-3191' xml:id='l2h-3191' class="function">unquote_plus</tt></b>(</nobr></td> | |
302 | <td><var>string</var>)</td></tr></table></dt> | |
303 | <dd> | |
304 | Like <tt class="function">unquote()</tt>, but also replaces plus signs by spaces, as | |
305 | required for unquoting HTML form values. | |
306 | </dl> | |
307 | ||
308 | <P> | |
309 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
310 | <td><nobr><b><tt id='l2h-3192' xml:id='l2h-3192' class="function">urlencode</tt></b>(</nobr></td> | |
311 | <td><var>query</var><big>[</big><var>, doseq</var><big>]</big><var></var>)</td></tr></table></dt> | |
312 | <dd> | |
313 | Convert a mapping object or a sequence of two-element tuples to a | |
314 | ``url-encoded'' string, suitable to pass to | |
315 | <tt class="function">urlopen()</tt> above as the optional <var>data</var> argument. This | |
316 | is useful to pass a dictionary of form fields to a <code>POST</code> | |
317 | request. The resulting string is a series of | |
318 | <code><var>key</var>=<var>value</var></code> pairs separated by "<tt class="character">&</tt>" | |
319 | characters, where both <var>key</var> and <var>value</var> are quoted using | |
320 | <tt class="function">quote_plus()</tt> above. If the optional parameter <var>doseq</var> is | |
321 | present and evaluates to true, individual <code><var>key</var>=<var>value</var></code> pairs | |
322 | are generated for each element of the sequence. | |
323 | When a sequence of two-element tuples is used as the <var>query</var> argument, | |
324 | the first element of each tuple is a key and the second is a value. The | |
325 | order of parameters in the encoded string will match the order of parameter | |
326 | tuples in the sequence. | |
327 | The <tt class="module"><a href="module-cgi.html">cgi</a></tt> module provides the functions | |
328 | <tt class="function">parse_qs()</tt> and <tt class="function">parse_qsl()</tt> which are used to | |
329 | parse query strings into Python data structures. | |
330 | </dl> | |
331 | ||
332 | <P> | |
333 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
334 | <td><nobr><b><tt id='l2h-3193' xml:id='l2h-3193' class="function">pathname2url</tt></b>(</nobr></td> | |
335 | <td><var>path</var>)</td></tr></table></dt> | |
336 | <dd> | |
337 | Convert the pathname <var>path</var> from the local syntax for a path to | |
338 | the form used in the path component of a URL. This does not produce a | |
339 | complete URL. The return value will already be quoted using the | |
340 | <tt class="function">quote()</tt> function. | |
341 | </dl> | |
342 | ||
343 | <P> | |
344 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
345 | <td><nobr><b><tt id='l2h-3194' xml:id='l2h-3194' class="function">url2pathname</tt></b>(</nobr></td> | |
346 | <td><var>path</var>)</td></tr></table></dt> | |
347 | <dd> | |
348 | Convert the path component <var>path</var> from an encoded URL to the local | |
349 | syntax for a path. This does not accept a complete URL. This | |
350 | function uses <tt class="function">unquote()</tt> to decode <var>path</var>. | |
351 | </dl> | |
352 | ||
353 | <P> | |
354 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
355 | <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-3195' xml:id='l2h-3195' class="class">URLopener</tt></b>(</nobr></td> | |
356 | <td><var></var><big>[</big><var>proxies</var><big>[</big><var>, **x509</var><big>]</big><var></var><big>]</big><var></var>)</td></tr></table></dt> | |
357 | <dd> | |
358 | Base class for opening and reading URLs. Unless you need to support | |
359 | opening objects using schemes other than <span class="file">http:</span>, <span class="file">ftp:</span>, | |
360 | <span class="file">gopher:</span> or <span class="file">file:</span>, you probably want to use | |
361 | <tt class="class">FancyURLopener</tt>. | |
362 | ||
363 | <P> | |
364 | By default, the <tt class="class">URLopener</tt> class sends a | |
365 | <span class="mailheader">User-Agent:</span> header of "<tt class="samp">urllib/<var>VVV</var></tt>", where | |
366 | <var>VVV</var> is the <tt class="module">urllib</tt> version number. Applications can | |
367 | define their own <span class="mailheader">User-Agent:</span> header by subclassing | |
368 | <tt class="class">URLopener</tt> or <tt class="class">FancyURLopener</tt> and setting the class | |
369 | attribute <tt class="member">version</tt> to an appropriate string value in the | |
370 | subclass definition. | |
371 | ||
372 | <P> | |
373 | The optional <var>proxies</var> parameter should be a dictionary mapping | |
374 | scheme names to proxy URLs, where an empty dictionary turns proxies | |
375 | off completely. Its default value is <code>None</code>, in which case | |
376 | environmental proxy settings will be used if present, as discussed in | |
377 | the definition of <tt class="function">urlopen()</tt>, above. | |
378 | ||
379 | <P> | |
380 | Additional keyword parameters, collected in <var>x509</var>, are used for | |
381 | authentication with the <span class="file">https:</span> scheme. The keywords | |
382 | <var>key_file</var> and <var>cert_file</var> are supported; both are needed to | |
383 | actually retrieve a resource at an <span class="file">https:</span> URL. | |
384 | </dl> | |
385 | ||
386 | <P> | |
387 | <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> | |
388 | <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-3196' xml:id='l2h-3196' class="class">FancyURLopener</tt></b>(</nobr></td> | |
389 | <td><var>...</var>)</td></tr></table></dt> | |
390 | <dd> | |
391 | <tt class="class">FancyURLopener</tt> subclasses <tt class="class">URLopener</tt> providing default | |
392 | handling for the following HTTP response codes: 301, 302, 303, 307 and | |
393 | 401. For the 30x response codes listed above, the | |
394 | <span class="mailheader">Location:</span> header is used to fetch the actual URL. For 401 | |
395 | response codes (authentication required), basic HTTP authentication is | |
396 | performed. For the 30x response codes, recursion is bounded by the | |
397 | value of the <var>maxtries</var> attribute, which defaults to 10. | |
398 | ||
399 | <P> | |
400 | <span class="note"><b class="label">Note:</b> | |
401 | According to the letter of <a class="rfc" id='rfcref-89922' xml:id='rfcref-89922' | |
402 | href="http://www.faqs.org/rfcs/rfc2616.html">RFC 2616</a>, 301 and 302 responses to | |
403 | POST requests must not be automatically redirected without | |
404 | confirmation by the user. In reality, browsers do allow automatic | |
405 | redirection of these responses, changing the POST to a GET, and | |
406 | <tt class="module">urllib</tt> reproduces this behaviour.</span> | |
407 | ||
408 | <P> | |
409 | The parameters to the constructor are the same as those for | |
410 | <tt class="class">URLopener</tt>. | |
411 | ||
412 | <P> | |
413 | <span class="note"><b class="label">Note:</b> | |
414 | When performing basic authentication, a | |
415 | <tt class="class">FancyURLopener</tt> instance calls its | |
416 | <tt class="method">prompt_user_passwd()</tt> method. The default implementation asks | |
417 | the users for the required information on the controlling terminal. A | |
418 | subclass may override this method to support more appropriate behavior | |
419 | if needed.</span> | |
420 | </dl> | |
421 | ||
422 | <P> | |
423 | Restrictions: | |
424 | ||
425 | <P> | |
426 | ||
427 | <UL> | |
428 | <LI>Currently, only the following protocols are supported: HTTP, (versions | |
429 | 0.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files. | |
430 | <a id='l2h-3197' xml:id='l2h-3197'></a><a id='l2h-3198' xml:id='l2h-3198'></a><a id='l2h-3199' xml:id='l2h-3199'></a> | |
431 | <P> | |
432 | </LI> | |
433 | <LI>The caching feature of <tt class="function">urlretrieve()</tt> has been disabled | |
434 | until I find the time to hack proper processing of Expiration time | |
435 | headers. | |
436 | ||
437 | <P> | |
438 | </LI> | |
439 | <LI>There should be a function to query whether a particular URL is in | |
440 | the cache. | |
441 | ||
442 | <P> | |
443 | </LI> | |
444 | <LI>For backward compatibility, if a URL appears to point to a local file | |
445 | but the file can't be opened, the URL is re-interpreted using the FTP | |
446 | protocol. This can sometimes cause confusing error messages. | |
447 | ||
448 | <P> | |
449 | </LI> | |
450 | <LI>The <tt class="function">urlopen()</tt> and <tt class="function">urlretrieve()</tt> functions can | |
451 | cause arbitrarily long delays while waiting for a network connection | |
452 | to be set up. This means that it is difficult to build an interactive | |
453 | Web client using these functions without using threads. | |
454 | ||
455 | <P> | |
456 | </LI> | |
457 | <LI>The data returned by <tt class="function">urlopen()</tt> or <tt class="function">urlretrieve()</tt> | |
458 | is the raw data returned by the server. This may be binary data | |
459 | (e.g. an image), plain text or (for example) HTML<a id='l2h-3208' xml:id='l2h-3208'></a>. The | |
460 | HTTP<a id='l2h-3200' xml:id='l2h-3200'></a> protocol provides type information in the | |
461 | reply header, which can be inspected by looking at the | |
462 | <span class="mailheader">Content-Type:</span> header. For the | |
463 | Gopher<a id='l2h-3201' xml:id='l2h-3201'></a> protocol, type information is encoded | |
464 | in the URL; there is currently no easy way to extract it. If the | |
465 | returned data is HTML, you can use the module | |
466 | <tt class="module"><a href="module-htmllib.html">htmllib</a></tt><a id='l2h-3209' xml:id='l2h-3209'></a> to parse it. | |
467 | ||
468 | <P> | |
469 | </LI> | |
470 | <LI>The code handling the FTP<a id='l2h-3210' xml:id='l2h-3210'></a> protocol cannot differentiate | |
471 | between a file and a directory. This can lead to unexpected behavior | |
472 | when attempting to read a URL that points to a file that is not | |
473 | accessible. If the URL ends in a <code>/</code>, it is assumed to refer to | |
474 | a directory and will be handled accordingly. But if an attempt to | |
475 | read a file leads to a 550 error (meaning the URL cannot be found or | |
476 | is not accessible, often for permission reasons), then the path is | |
477 | treated as a directory in order to handle the case when a directory is | |
478 | specified by a URL but the trailing <code>/</code> has been left off. This can | |
479 | cause misleading results when you try to fetch a file whose read | |
480 | permissions make it inaccessible; the FTP code will try to read it, | |
481 | fail with a 550 error, and then perform a directory listing for the | |
482 | unreadable file. If fine-grained control is needed, consider using the | |
483 | <tt class="module">ftplib</tt> module, subclassing <tt class="class">FancyURLOpener</tt>, or changing | |
484 | <var>_urlopener</var> to meet your needs. | |
485 | ||
486 | <P> | |
487 | </LI> | |
488 | <LI>This module does not support the use of proxies which require | |
489 | authentication. This may be implemented in the future. | |
490 | ||
491 | <P> | |
492 | </LI> | |
493 | <LI>Although the <tt class="module">urllib</tt> module contains (undocumented) routines | |
494 | to parse and unparse URL strings, the recommended interface for URL | |
495 | manipulation is in module <tt class="module"><a href="module-urlparse.html">urlparse</a></tt><a id='l2h-3211' xml:id='l2h-3211'></a>. | |
496 | ||
497 | <P> | |
498 | </LI> | |
499 | </UL> | |
500 | ||
501 | <P> | |
502 | ||
503 | <p><br /></p><hr class='online-navigation' /> | |
504 | <div class='online-navigation'> | |
505 | <!--Table of Child-Links--> | |
506 | <A NAME="CHILD_LINKS"><STRONG>Subsections</STRONG></a> | |
507 | ||
508 | <UL CLASS="ChildLinks"> | |
509 | <LI><A href="urlopener-objs.html">11.4.1 URLopener Objects</a> | |
510 | <LI><A href="node483.html">11.4.2 Examples</a> | |
511 | </ul> | |
512 | <!--End of Table of Child-Links--> | |
513 | </div> | |
514 | ||
515 | <DIV CLASS="navigation"> | |
516 | <div class='online-navigation'> | |
517 | <p></p><hr /> | |
518 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
519 | <tr> | |
520 | <td class='online-navigation'><a rel="prev" title="11.3 cgitb " | |
521 | href="module-cgitb.html"><img src='../icons/previous.png' | |
522 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
523 | <td class='online-navigation'><a rel="parent" title="11. Internet Protocols and" | |
524 | href="internet.html"><img src='../icons/up.png' | |
525 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
526 | <td class='online-navigation'><a rel="next" title="11.4.1 URLopener Objects" | |
527 | href="urlopener-objs.html"><img src='../icons/next.png' | |
528 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
529 | <td align="center" width="100%">Python Library Reference</td> | |
530 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
531 | href="contents.html"><img src='../icons/contents.png' | |
532 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
533 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
534 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
535 | <td class='online-navigation'><a rel="index" title="Index" | |
536 | href="genindex.html"><img src='../icons/index.png' | |
537 | border='0' height='32' alt='Index' width='32' /></A></td> | |
538 | </tr></table> | |
539 | <div class='online-navigation'> | |
540 | <b class="navlabel">Previous:</b> | |
541 | <a class="sectref" rel="prev" href="module-cgitb.html">11.3 cgitb </A> | |
542 | <b class="navlabel">Up:</b> | |
543 | <a class="sectref" rel="parent" href="internet.html">11. Internet Protocols and</A> | |
544 | <b class="navlabel">Next:</b> | |
545 | <a class="sectref" rel="next" href="urlopener-objs.html">11.4.1 URLopener Objects</A> | |
546 | </div> | |
547 | </div> | |
548 | <hr /> | |
549 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> | |
550 | </DIV> | |
551 | <!--End of Navigation Panel--> | |
552 | <ADDRESS> | |
553 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. | |
554 | </ADDRESS> | |
555 | </BODY> | |
556 | </HTML> |