1 :mod:`urllib` --- Open arbitrary resources by URL 2 ================================================= 3 4 .. module:: urllib 5 :synopsis: Open an arbitrary network resource by URL (requires sockets). 6 7 .. note:: 8 The :mod:`urllib` module has been split into parts and renamed in 9 Python 3 to :mod:`urllib.request`, :mod:`urllib.parse`, 10 and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt 11 imports when converting your sources to Python 3. 12 Also note that the :func:`urllib.request.urlopen` function in Python 3 is 13 equivalent to :func:`urllib2.urlopen` and that :func:`urllib.urlopen` has 14 been removed. 15 16 .. index:: 17 single: WWW 18 single: World Wide Web 19 single: URL 20 21 This module provides a high-level interface for fetching data across the World 22 Wide Web. In particular, the :func:`urlopen` function is similar to the 23 built-in function :func:`open`, but accepts Universal Resource Locators (URLs) 24 instead of filenames. Some restrictions apply --- it can only open URLs for 25 reading, and no seek operations are available. 26 27 .. seealso:: 28 29 The `Requests package <http://docs.python-requests.org/>`_ 30 is recommended for a higher-level HTTP client interface. 31 32 .. versionchanged:: 2.7.9 33 34 For HTTPS URIs, :mod:`urllib` performs all the neccessary certificate and hostname checks by default. 35 36 .. warning:: 37 38 For Python versions earlier than 2.7.9, urllib does not attempt to validate the server certificates of HTTPS URIs. Use at your own risk! 39 40 41 High-level interface 42 -------------------- 43 44 .. function:: urlopen(url[, data[, proxies[, context]]]) 45 46 Open a network object denoted by a URL for reading. If the URL does not 47 have a scheme identifier, or if it has :file:`file:` as its scheme 48 identifier, this opens a local file (without :term:`universal newlines`); 49 otherwise it opens a socket to a server somewhere on the network. If the 50 connection cannot be made the :exc:`IOError` exception is raised. If all 51 went well, a file-like object is returned. This supports the following 52 methods: :meth:`read`, :meth:`readline`, :meth:`readlines`, :meth:`fileno`, 53 :meth:`close`, :meth:`info`, :meth:`getcode` and :meth:`geturl`. It also 54 has proper support for the :term:`iterator` protocol. One caveat: the 55 :meth:`read` method, if the size argument is omitted or negative, may not 56 read until the end of the data stream; there is no good way to determine 57 that the entire stream from a socket has been read in the general case. 58 59 Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods, 60 these methods have the same interface as for file objects --- see section 61 :ref:`bltin-file-objects` in this manual. (It is not a built-in file object, 62 however, so it can't be used at those few places where a true built-in file 63 object is required.) 64 65 .. index:: module: mimetools 66 67 The :meth:`info` method returns an instance of the class 68 :class:`mimetools.Message` containing meta-information associated with the 69 URL. When the method is HTTP, these headers are those returned by the server 70 at the head of the retrieved HTML page (including Content-Length and 71 Content-Type). When the method is FTP, a Content-Length header will be 72 present if (as is now usual) the server passed back a file length in response 73 to the FTP retrieval request. A Content-Type header will be present if the 74 MIME type can be guessed. When the method is local-file, returned headers 75 will include a Date representing the file's last-modified time, a 76 Content-Length giving file size, and a Content-Type containing a guess at the 77 file's type. See also the description of the :mod:`mimetools` module. 78 79 The :meth:`geturl` method returns the real URL of the page. In some cases, the 80 HTTP server redirects a client to another URL. The :func:`urlopen` function 81 handles this transparently, but in some cases the caller needs to know which URL 82 the client was redirected to. The :meth:`geturl` method can be used to get at 83 this redirected URL. 84 85 The :meth:`getcode` method returns the HTTP status code that was sent with the 86 response, or ``None`` if the URL is no HTTP URL. 87 88 If the *url* uses the :file:`http:` scheme identifier, the optional *data* 89 argument may be given to specify a ``POST`` request (normally the request type 90 is ``GET``). The *data* argument must be in standard 91 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` 92 function below. 93 94 The :func:`urlopen` function works transparently with proxies which do not 95 require authentication. In a Unix or Windows environment, set the 96 :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that 97 identifies the proxy server before starting the Python interpreter. For example 98 (the ``'%'`` is the command prompt):: 99 100 % http_proxy="http://www.someproxy.com:3128" 101 % export http_proxy 102 % python 103 ... 104 105 The :envvar:`no_proxy` environment variable can be used to specify hosts which 106 shouldn't be reached via proxy; if set, it should be a comma-separated list 107 of hostname suffixes, optionally with ``:port`` appended, for example 108 ``cern.ch,ncsa.uiuc.edu,some.host:8080``. 109 110 In a Windows environment, if no proxy environment variables are set, proxy 111 settings are obtained from the registry's Internet Settings section. 112 113 .. index:: single: Internet Config 114 115 In a Mac OS X environment, :func:`urlopen` will retrieve proxy information 116 from the OS X System Configuration Framework, which can be managed with 117 Network System Preferences panel. 118 119 120 Alternatively, the optional *proxies* argument may be used to explicitly specify 121 proxies. It must be a dictionary mapping scheme names to proxy URLs, where an 122 empty dictionary causes no proxies to be used, and ``None`` (the default value) 123 causes environmental proxy settings to be used as discussed above. For 124 example:: 125 126 # Use http://www.someproxy.com:3128 for HTTP proxying 127 proxies = {'http': 'http://www.someproxy.com:3128'} 128 filehandle = urllib.urlopen(some_url, proxies=proxies) 129 # Don't use any proxies 130 filehandle = urllib.urlopen(some_url, proxies={}) 131 # Use proxies from environment - both versions are equivalent 132 filehandle = urllib.urlopen(some_url, proxies=None) 133 filehandle = urllib.urlopen(some_url) 134 135 Proxies which require authentication for use are not currently supported; 136 this is considered an implementation limitation. 137 138 The *context* parameter may be set to a :class:`ssl.SSLContext` instance to 139 configure the SSL settings that are used if :func:`urlopen` makes a HTTPS 140 connection. 141 142 .. versionchanged:: 2.3 143 Added the *proxies* support. 144 145 .. versionchanged:: 2.6 146 Added :meth:`getcode` to returned object and support for the 147 :envvar:`no_proxy` environment variable. 148 149 .. versionchanged:: 2.7.9 150 The *context* parameter was added. All the neccessary certificate and hostname checks are done by default. 151 152 .. deprecated:: 2.6 153 The :func:`urlopen` function has been removed in Python 3 in favor 154 of :func:`urllib2.urlopen`. 155 156 157 .. function:: urlretrieve(url[, filename[, reporthook[, data]]]) 158 159 Copy a network object denoted by a URL to a local file, if necessary. If the URL 160 points to a local file, or a valid cached copy of the object exists, the object 161 is not copied. Return a tuple ``(filename, headers)`` where *filename* is the 162 local file name under which the object can be found, and *headers* is whatever 163 the :meth:`info` method of the object returned by :func:`urlopen` returned (for 164 a remote object, possibly cached). Exceptions are the same as for 165 :func:`urlopen`. 166 167 The second argument, if present, specifies the file location to copy to (if 168 absent, the location will be a tempfile with a generated name). The third 169 argument, if present, is a callable that will be called once on 170 establishment of the network connection and once after each block read 171 thereafter. The callable will be passed three arguments; a count of blocks 172 transferred so far, a block size in bytes, and the total size of the file. The 173 third argument may be ``-1`` on older FTP servers which do not return a file 174 size in response to a retrieval request. 175 176 If the *url* uses the :file:`http:` scheme identifier, the optional *data* 177 argument may be given to specify a ``POST`` request (normally the request type 178 is ``GET``). The *data* argument must in standard 179 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` 180 function below. 181 182 .. versionchanged:: 2.5 183 :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that 184 the amount of data available was less than the expected amount (which is the 185 size reported by a *Content-Length* header). This can occur, for example, when 186 the download is interrupted. 187 188 The *Content-Length* is treated as a lower bound: if there's more data to read, 189 :func:`urlretrieve` reads more data, but if less data is available, it raises 190 the exception. 191 192 You can still retrieve the downloaded data in this case, it is stored in the 193 :attr:`content` attribute of the exception instance. 194 195 If no *Content-Length* header was supplied, :func:`urlretrieve` can not check 196 the size of the data it has downloaded, and just returns it. In this case you 197 just have to assume that the download was successful. 198 199 200 .. data:: _urlopener 201 202 The public functions :func:`urlopen` and :func:`urlretrieve` create an instance 203 of the :class:`FancyURLopener` class and use it to perform their requested 204 actions. To override this functionality, programmers can create a subclass of 205 :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that 206 class to the ``urllib._urlopener`` variable before calling the desired function. 207 For example, applications may want to specify a different 208 :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be 209 accomplished with the following code:: 210 211 import urllib 212 213 class AppURLopener(urllib.FancyURLopener): 214 version = "App/1.7" 215 216 urllib._urlopener = AppURLopener() 217 218 219 .. function:: urlcleanup() 220 221 Clear the cache that may have been built up by previous calls to 222 :func:`urlretrieve`. 223 224 225 Utility functions 226 ----------------- 227 228 .. function:: quote(string[, safe]) 229 230 Replace special characters in *string* using the ``%xx`` escape. Letters, 231 digits, and the characters ``'_.-'`` are never quoted. By default, this 232 function is intended for quoting the path section of the URL. The optional 233 *safe* parameter specifies additional characters that should not be quoted 234 --- its default value is ``'/'``. 235 236 Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. 237 238 239 .. function:: quote_plus(string[, safe]) 240 241 Like :func:`quote`, but also replaces spaces by plus signs, as required for 242 quoting HTML form values when building up a query string to go into a URL. 243 Plus signs in the original string are escaped unless they are included in 244 *safe*. It also does not have *safe* default to ``'/'``. 245 246 247 .. function:: unquote(string) 248 249 Replace ``%xx`` escapes by their single-character equivalent. 250 251 Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. 252 253 254 .. function:: unquote_plus(string) 255 256 Like :func:`unquote`, but also replaces plus signs by spaces, as required for 257 unquoting HTML form values. 258 259 260 .. function:: urlencode(query[, doseq]) 261 262 Convert a mapping object or a sequence of two-element tuples to a 263 "percent-encoded" string, suitable to pass to :func:`urlopen` above as the 264 optional *data* argument. This is useful to pass a dictionary of form 265 fields to a ``POST`` request. The resulting string is a series of 266 ``key=value`` pairs separated by ``'&'`` characters, where both *key* and 267 *value* are quoted using :func:`quote_plus` above. When a sequence of 268 two-element tuples is used as the *query* argument, the first element of 269 each tuple is a key and the second is a value. The value element in itself 270 can be a sequence and in that case, if the optional parameter *doseq* is 271 evaluates to ``True``, individual ``key=value`` pairs separated by ``'&'`` are 272 generated for each element of the value sequence for the key. The order of 273 parameters in the encoded string will match the order of parameter tuples in 274 the sequence. The :mod:`urlparse` module provides the functions 275 :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings 276 into Python data structures. 277 278 279 .. function:: pathname2url(path) 280 281 Convert the pathname *path* from the local syntax for a path to the form used in 282 the path component of a URL. This does not produce a complete URL. The return 283 value will already be quoted using the :func:`quote` function. 284 285 286 .. function:: url2pathname(path) 287 288 Convert the path component *path* from a percent-encoded URL to the local syntax for a 289 path. This does not accept a complete URL. This function uses :func:`unquote` 290 to decode *path*. 291 292 293 .. function:: getproxies() 294 295 This helper function returns a dictionary of scheme to proxy server URL 296 mappings. It scans the environment for variables named ``<scheme>_proxy``, 297 in case insensitive way, for all operating systems first, and when it cannot 298 find it, looks for proxy information from Mac OSX System Configuration for 299 Mac OS X and Windows Systems Registry for Windows. 300 If both lowercase and uppercase environment variables exist (and disagree), 301 lowercase is preferred. 302 303 .. note:: 304 305 If the environment variable ``REQUEST_METHOD`` is set, which usually 306 indicates your script is running in a CGI environment, the environment 307 variable ``HTTP_PROXY`` (uppercase ``_PROXY``) will be ignored. This is 308 because that variable can be injected by a client using the "Proxy:" HTTP 309 header. If you need to use an HTTP proxy in a CGI environment, either use 310 ``ProxyHandler`` explicitly, or make sure the variable name is in 311 lowercase (or at least the ``_proxy`` suffix). 312 313 .. note:: 314 urllib also exposes certain utility functions like splittype, splithost and 315 others parsing URL into various components. But it is recommended to use 316 :mod:`urlparse` for parsing URLs rather than using these functions directly. 317 Python 3 does not expose these helper functions from :mod:`urllib.parse` 318 module. 319 320 321 URL Opener objects 322 ------------------ 323 324 .. class:: URLopener([proxies[, context[, **x509]]]) 325 326 Base class for opening and reading URLs. Unless you need to support opening 327 objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, 328 you probably want to use :class:`FancyURLopener`. 329 330 By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header 331 of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. 332 Applications can define their own :mailheader:`User-Agent` header by subclassing 333 :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute 334 :attr:`version` to an appropriate string value in the subclass definition. 335 336 The optional *proxies* parameter should be a dictionary mapping scheme names to 337 proxy URLs, where an empty dictionary turns proxies off completely. Its default 338 value is ``None``, in which case environmental proxy settings will be used if 339 present, as discussed in the definition of :func:`urlopen`, above. 340 341 The *context* parameter may be a :class:`ssl.SSLContext` instance. If given, 342 it defines the SSL settings the opener uses to make HTTPS connections. 343 344 Additional keyword parameters, collected in *x509*, may be used for 345 authentication of the client when using the :file:`https:` scheme. The keywords 346 *key_file* and *cert_file* are supported to provide an SSL key and certificate; 347 both are needed to support client authentication. 348 349 :class:`URLopener` objects will raise an :exc:`IOError` exception if the server 350 returns an error code. 351 352 .. method:: open(fullurl[, data]) 353 354 Open *fullurl* using the appropriate protocol. This method sets up cache and 355 proxy information, then calls the appropriate open method with its input 356 arguments. If the scheme is not recognized, :meth:`open_unknown` is called. 357 The *data* argument has the same meaning as the *data* argument of 358 :func:`urlopen`. 359 360 361 .. method:: open_unknown(fullurl[, data]) 362 363 Overridable interface to open unknown URL types. 364 365 366 .. method:: retrieve(url[, filename[, reporthook[, data]]]) 367 368 Retrieves the contents of *url* and places it in *filename*. The return value 369 is a tuple consisting of a local filename and either a 370 :class:`mimetools.Message` object containing the response headers (for remote 371 URLs) or ``None`` (for local URLs). The caller must then open and read the 372 contents of *filename*. If *filename* is not given and the URL refers to a 373 local file, the input filename is returned. If the URL is non-local and 374 *filename* is not given, the filename is the output of :func:`tempfile.mktemp` 375 with a suffix that matches the suffix of the last path component of the input 376 URL. If *reporthook* is given, it must be a function accepting three numeric 377 parameters. It will be called after each chunk of data is read from the 378 network. *reporthook* is ignored for local URLs. 379 380 If the *url* uses the :file:`http:` scheme identifier, the optional *data* 381 argument may be given to specify a ``POST`` request (normally the request type 382 is ``GET``). The *data* argument must in standard 383 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` 384 function below. 385 386 387 .. attribute:: version 388 389 Variable that specifies the user agent of the opener object. To get 390 :mod:`urllib` to tell servers that it is a particular user agent, set this in a 391 subclass as a class variable or in the constructor before calling the base 392 constructor. 393 394 395 .. class:: FancyURLopener(...) 396 397 :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling 398 for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x 399 response codes listed above, the :mailheader:`Location` header is used to fetch 400 the actual URL. For 401 response codes (authentication required), basic HTTP 401 authentication is performed. For the 30x response codes, recursion is bounded 402 by the value of the *maxtries* attribute, which defaults to 10. 403 404 For all other response codes, the method :meth:`http_error_default` is called 405 which you can override in subclasses to handle the error appropriately. 406 407 .. note:: 408 409 According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests 410 must not be automatically redirected without confirmation by the user. In 411 reality, browsers do allow automatic redirection of these responses, changing 412 the POST to a GET, and :mod:`urllib` reproduces this behaviour. 413 414 The parameters to the constructor are the same as those for :class:`URLopener`. 415 416 .. note:: 417 418 When performing basic authentication, a :class:`FancyURLopener` instance calls 419 its :meth:`prompt_user_passwd` method. The default implementation asks the 420 users for the required information on the controlling terminal. A subclass may 421 override this method to support more appropriate behavior if needed. 422 423 The :class:`FancyURLopener` class offers one additional method that should be 424 overloaded to provide the appropriate behavior: 425 426 .. method:: prompt_user_passwd(host, realm) 427 428 Return information needed to authenticate the user at the given host in the 429 specified security realm. The return value should be a tuple, ``(user, 430 password)``, which can be used for basic authentication. 431 432 The implementation prompts for this information on the terminal; an application 433 should override this method to use an appropriate interaction model in the local 434 environment. 435 436 .. exception:: ContentTooShortError(msg[, content]) 437 438 This exception is raised when the :func:`urlretrieve` function detects that the 439 amount of the downloaded data is less than the expected amount (given by the 440 *Content-Length* header). The :attr:`content` attribute stores the downloaded 441 (and supposedly truncated) data. 442 443 .. versionadded:: 2.5 444 445 446 :mod:`urllib` Restrictions 447 -------------------------- 448 449 .. index:: 450 pair: HTTP; protocol 451 pair: FTP; protocol 452 453 * Currently, only the following protocols are supported: HTTP, (versions 0.9 and 454 1.0), FTP, and local files. 455 456 * The caching feature of :func:`urlretrieve` has been disabled until I find the 457 time to hack proper processing of Expiration time headers. 458 459 * There should be a function to query whether a particular URL is in the cache. 460 461 * For backward compatibility, if a URL appears to point to a local file but the 462 file can't be opened, the URL is re-interpreted using the FTP protocol. This 463 can sometimes cause confusing error messages. 464 465 * The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily 466 long delays while waiting for a network connection to be set up. This means 467 that it is difficult to build an interactive Web client using these functions 468 without using threads. 469 470 .. index:: 471 single: HTML 472 pair: HTTP; protocol 473 module: htmllib 474 475 * The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data 476 returned by the server. This may be binary data (such as an image), plain text 477 or (for example) HTML. The HTTP protocol provides type information in the reply 478 header, which can be inspected by looking at the :mailheader:`Content-Type` 479 header. If the returned data is HTML, you can use the module :mod:`htmllib` to 480 parse it. 481 482 .. index:: single: FTP 483 484 * The code handling the FTP protocol cannot differentiate between a file and a 485 directory. This can lead to unexpected behavior when attempting to read a URL 486 that points to a file that is not accessible. If the URL ends in a ``/``, it is 487 assumed to refer to a directory and will be handled accordingly. But if an 488 attempt to read a file leads to a 550 error (meaning the URL cannot be found or 489 is not accessible, often for permission reasons), then the path is treated as a 490 directory in order to handle the case when a directory is specified by a URL but 491 the trailing ``/`` has been left off. This can cause misleading results when 492 you try to fetch a file whose read permissions make it inaccessible; the FTP 493 code will try to read it, fail with a 550 error, and then perform a directory 494 listing for the unreadable file. If fine-grained control is needed, consider 495 using the :mod:`ftplib` module, subclassing :class:`FancyURLopener`, or changing 496 *_urlopener* to meet your needs. 497 498 * This module does not support the use of proxies which require authentication. 499 This may be implemented in the future. 500 501 .. index:: module: urlparse 502 503 * Although the :mod:`urllib` module contains (undocumented) routines to parse 504 and unparse URL strings, the recommended interface for URL manipulation is in 505 module :mod:`urlparse`. 506 507 508 .. _urllib-examples: 509 510 Examples 511 -------- 512 513 Here is an example session that uses the ``GET`` method to retrieve a URL 514 containing parameters:: 515 516 >>> import urllib 517 >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) 518 >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) 519 >>> print f.read() 520 521 The following example uses the ``POST`` method instead:: 522 523 >>> import urllib 524 >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) 525 >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) 526 >>> print f.read() 527 528 The following example uses an explicitly specified HTTP proxy, overriding 529 environment settings:: 530 531 >>> import urllib 532 >>> proxies = {'http': 'http://proxy.example.com:8080/'} 533 >>> opener = urllib.FancyURLopener(proxies) 534 >>> f = opener.open("http://www.python.org") 535 >>> f.read() 536 537 The following example uses no proxies at all, overriding environment settings:: 538 539 >>> import urllib 540 >>> opener = urllib.FancyURLopener({}) 541 >>> f = opener.open("http://www.python.org/") 542 >>> f.read() 543 544