1 .. _urllib-howto: 2 3 *********************************************************** 4 HOWTO Fetch Internet Resources Using The urllib Package 5 *********************************************************** 6 7 :Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_ 8 9 .. note:: 10 11 There is a French translation of an earlier revision of this 12 HOWTO, available at `urllib2 - Le Manuel manquant 13 <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_. 14 15 16 17 Introduction 18 ============ 19 20 .. sidebar:: Related Articles 21 22 You may also find useful the following article on fetching web resources 23 with Python: 24 25 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ 26 27 A tutorial on *Basic Authentication*, with examples in Python. 28 29 **urllib.request** is a Python module for fetching URLs 30 (Uniform Resource Locators). It offers a very simple interface, in the form of 31 the *urlopen* function. This is capable of fetching URLs using a variety of 32 different protocols. It also offers a slightly more complex interface for 33 handling common situations - like basic authentication, cookies, proxies and so 34 on. These are provided by objects called handlers and openers. 35 36 urllib.request supports fetching URLs for many "URL schemes" (identified by the string 37 before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of 38 ``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP). 39 This tutorial focuses on the most common case, HTTP. 40 41 For straightforward situations *urlopen* is very easy to use. But as soon as you 42 encounter errors or non-trivial cases when opening HTTP URLs, you will need some 43 understanding of the HyperText Transfer Protocol. The most comprehensive and 44 authoritative reference to HTTP is :rfc:`2616`. This is a technical document and 45 not intended to be easy to read. This HOWTO aims to illustrate using *urllib*, 46 with enough detail about HTTP to help you through. It is not intended to replace 47 the :mod:`urllib.request` docs, but is supplementary to them. 48 49 50 Fetching URLs 51 ============= 52 53 The simplest way to use urllib.request is as follows:: 54 55 import urllib.request 56 with urllib.request.urlopen('http://python.org/') as response: 57 html = response.read() 58 59 If you wish to retrieve a resource via URL and store it in a temporary 60 location, you can do so via the :func:`shutil.copyfileobj` and 61 :func:`tempfile.NamedTemporaryFile` functions:: 62 63 import shutil 64 import tempfile 65 import urllib.request 66 67 with urllib.request.urlopen('http://python.org/') as response: 68 with tempfile.NamedTemporaryFile(delete=False) as tmp_file: 69 shutil.copyfileobj(response, tmp_file) 70 71 with open(tmp_file.name) as html: 72 pass 73 74 Many uses of urllib will be that simple (note that instead of an 'http:' URL we 75 could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the 76 purpose of this tutorial to explain the more complicated cases, concentrating on 77 HTTP. 78 79 HTTP is based on requests and responses - the client makes requests and servers 80 send responses. urllib.request mirrors this with a ``Request`` object which represents 81 the HTTP request you are making. In its simplest form you create a Request 82 object that specifies the URL you want to fetch. Calling ``urlopen`` with this 83 Request object returns a response object for the URL requested. This response is 84 a file-like object, which means you can for example call ``.read()`` on the 85 response:: 86 87 import urllib.request 88 89 req = urllib.request.Request('http://www.voidspace.org.uk') 90 with urllib.request.urlopen(req) as response: 91 the_page = response.read() 92 93 Note that urllib.request makes use of the same Request interface to handle all URL 94 schemes. For example, you can make an FTP request like so:: 95 96 req = urllib.request.Request('ftp://example.com/') 97 98 In the case of HTTP, there are two extra things that Request objects allow you 99 to do: First, you can pass data to be sent to the server. Second, you can pass 100 extra information ("metadata") *about* the data or the about request itself, to 101 the server - this information is sent as HTTP "headers". Let's look at each of 102 these in turn. 103 104 Data 105 ---- 106 107 Sometimes you want to send data to a URL (often the URL will refer to a CGI 108 (Common Gateway Interface) script or other web application). With HTTP, 109 this is often done using what's known as a **POST** request. This is often what 110 your browser does when you submit a HTML form that you filled in on the web. Not 111 all POSTs have to come from forms: you can use a POST to transmit arbitrary data 112 to your own application. In the common case of HTML forms, the data needs to be 113 encoded in a standard way, and then passed to the Request object as the ``data`` 114 argument. The encoding is done using a function from the :mod:`urllib.parse` 115 library. :: 116 117 import urllib.parse 118 import urllib.request 119 120 url = 'http://www.someserver.com/cgi-bin/register.cgi' 121 values = {'name' : 'Michael Foord', 122 'location' : 'Northampton', 123 'language' : 'Python' } 124 125 data = urllib.parse.urlencode(values) 126 data = data.encode('ascii') # data should be bytes 127 req = urllib.request.Request(url, data) 128 with urllib.request.urlopen(req) as response: 129 the_page = response.read() 130 131 Note that other encodings are sometimes required (e.g. for file upload from HTML 132 forms - see `HTML Specification, Form Submission 133 <https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more 134 details). 135 136 If you do not pass the ``data`` argument, urllib uses a **GET** request. One 137 way in which GET and POST requests differ is that POST requests often have 138 "side-effects": they change the state of the system in some way (for example by 139 placing an order with the website for a hundredweight of tinned spam to be 140 delivered to your door). Though the HTTP standard makes it clear that POSTs are 141 intended to *always* cause side-effects, and GET requests *never* to cause 142 side-effects, nothing prevents a GET request from having side-effects, nor a 143 POST requests from having no side-effects. Data can also be passed in an HTTP 144 GET request by encoding it in the URL itself. 145 146 This is done as follows:: 147 148 >>> import urllib.request 149 >>> import urllib.parse 150 >>> data = {} 151 >>> data['name'] = 'Somebody Here' 152 >>> data['location'] = 'Northampton' 153 >>> data['language'] = 'Python' 154 >>> url_values = urllib.parse.urlencode(data) 155 >>> print(url_values) # The order may differ from below. #doctest: +SKIP 156 name=Somebody+Here&language=Python&location=Northampton 157 >>> url = 'http://www.example.com/example.cgi' 158 >>> full_url = url + '?' + url_values 159 >>> data = urllib.request.urlopen(full_url) 160 161 Notice that the full URL is created by adding a ``?`` to the URL, followed by 162 the encoded values. 163 164 Headers 165 ------- 166 167 We'll discuss here one particular HTTP header, to illustrate how to add headers 168 to your HTTP request. 169 170 Some websites [#]_ dislike being browsed by programs, or send different versions 171 to different browsers [#]_. By default urllib identifies itself as 172 ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version 173 numbers of the Python release, 174 e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain 175 not work. The way a browser identifies itself is through the 176 ``User-Agent`` header [#]_. When you create a Request object you can 177 pass a dictionary of headers in. The following example makes the same 178 request as above, but identifies itself as a version of Internet 179 Explorer [#]_. :: 180 181 import urllib.parse 182 import urllib.request 183 184 url = 'http://www.someserver.com/cgi-bin/register.cgi' 185 user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' 186 values = {'name': 'Michael Foord', 187 'location': 'Northampton', 188 'language': 'Python' } 189 headers = {'User-Agent': user_agent} 190 191 data = urllib.parse.urlencode(values) 192 data = data.encode('ascii') 193 req = urllib.request.Request(url, data, headers) 194 with urllib.request.urlopen(req) as response: 195 the_page = response.read() 196 197 The response also has two useful methods. See the section on `info and geturl`_ 198 which comes after we have a look at what happens when things go wrong. 199 200 201 Handling Exceptions 202 =================== 203 204 *urlopen* raises :exc:`URLError` when it cannot handle a response (though as 205 usual with Python APIs, built-in exceptions such as :exc:`ValueError`, 206 :exc:`TypeError` etc. may also be raised). 207 208 :exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of 209 HTTP URLs. 210 211 The exception classes are exported from the :mod:`urllib.error` module. 212 213 URLError 214 -------- 215 216 Often, URLError is raised because there is no network connection (no route to 217 the specified server), or the specified server doesn't exist. In this case, the 218 exception raised will have a 'reason' attribute, which is a tuple containing an 219 error code and a text error message. 220 221 e.g. :: 222 223 >>> req = urllib.request.Request('http://www.pretend_server.org') 224 >>> try: urllib.request.urlopen(req) 225 ... except urllib.error.URLError as e: 226 ... print(e.reason) #doctest: +SKIP 227 ... 228 (4, 'getaddrinfo failed') 229 230 231 HTTPError 232 --------- 233 234 Every HTTP response from the server contains a numeric "status code". Sometimes 235 the status code indicates that the server is unable to fulfil the request. The 236 default handlers will handle some of these responses for you (for example, if 237 the response is a "redirection" that requests the client fetch the document from 238 a different URL, urllib will handle that for you). For those it can't handle, 239 urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not 240 found), '403' (request forbidden), and '401' (authentication required). 241 242 See section 10 of :rfc:`2616` for a reference on all the HTTP error codes. 243 244 The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which 245 corresponds to the error sent by the server. 246 247 Error Codes 248 ~~~~~~~~~~~ 249 250 Because the default handlers handle redirects (codes in the 300 range), and 251 codes in the 100--299 range indicate success, you will usually only see error 252 codes in the 400--599 range. 253 254 :attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of 255 response codes in that shows all the response codes used by :rfc:`2616`. The 256 dictionary is reproduced here for convenience :: 257 258 # Table mapping response codes to messages; entries have the 259 # form {code: (shortmessage, longmessage)}. 260 responses = { 261 100: ('Continue', 'Request received, please continue'), 262 101: ('Switching Protocols', 263 'Switching to new protocol; obey Upgrade header'), 264 265 200: ('OK', 'Request fulfilled, document follows'), 266 201: ('Created', 'Document created, URL follows'), 267 202: ('Accepted', 268 'Request accepted, processing continues off-line'), 269 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), 270 204: ('No Content', 'Request fulfilled, nothing follows'), 271 205: ('Reset Content', 'Clear input form for further input.'), 272 206: ('Partial Content', 'Partial content follows.'), 273 274 300: ('Multiple Choices', 275 'Object has several resources -- see URI list'), 276 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), 277 302: ('Found', 'Object moved temporarily -- see URI list'), 278 303: ('See Other', 'Object moved -- see Method and URL list'), 279 304: ('Not Modified', 280 'Document has not changed since given time'), 281 305: ('Use Proxy', 282 'You must use proxy specified in Location to access this ' 283 'resource.'), 284 307: ('Temporary Redirect', 285 'Object moved temporarily -- see URI list'), 286 287 400: ('Bad Request', 288 'Bad request syntax or unsupported method'), 289 401: ('Unauthorized', 290 'No permission -- see authorization schemes'), 291 402: ('Payment Required', 292 'No payment -- see charging schemes'), 293 403: ('Forbidden', 294 'Request forbidden -- authorization will not help'), 295 404: ('Not Found', 'Nothing matches the given URI'), 296 405: ('Method Not Allowed', 297 'Specified method is invalid for this server.'), 298 406: ('Not Acceptable', 'URI not available in preferred format.'), 299 407: ('Proxy Authentication Required', 'You must authenticate with ' 300 'this proxy before proceeding.'), 301 408: ('Request Timeout', 'Request timed out; try again later.'), 302 409: ('Conflict', 'Request conflict.'), 303 410: ('Gone', 304 'URI no longer exists and has been permanently removed.'), 305 411: ('Length Required', 'Client must specify Content-Length.'), 306 412: ('Precondition Failed', 'Precondition in headers is false.'), 307 413: ('Request Entity Too Large', 'Entity is too large.'), 308 414: ('Request-URI Too Long', 'URI is too long.'), 309 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), 310 416: ('Requested Range Not Satisfiable', 311 'Cannot satisfy request range.'), 312 417: ('Expectation Failed', 313 'Expect condition could not be satisfied.'), 314 315 500: ('Internal Server Error', 'Server got itself in trouble'), 316 501: ('Not Implemented', 317 'Server does not support this operation'), 318 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 319 503: ('Service Unavailable', 320 'The server cannot process the request due to a high load'), 321 504: ('Gateway Timeout', 322 'The gateway server did not receive a timely response'), 323 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), 324 } 325 326 When an error is raised the server responds by returning an HTTP error code 327 *and* an error page. You can use the :exc:`HTTPError` instance as a response on the 328 page returned. This means that as well as the code attribute, it also has read, 329 geturl, and info, methods as returned by the ``urllib.response`` module:: 330 331 >>> req = urllib.request.Request('http://www.python.org/fish.html') 332 >>> try: 333 ... urllib.request.urlopen(req) 334 ... except urllib.error.HTTPError as e: 335 ... print(e.code) 336 ... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE 337 ... 338 404 339 b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 340 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html 341 ... 342 <title>Page Not Found</title>\n 343 ... 344 345 Wrapping it Up 346 -------------- 347 348 So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two 349 basic approaches. I prefer the second approach. 350 351 Number 1 352 ~~~~~~~~ 353 354 :: 355 356 357 from urllib.request import Request, urlopen 358 from urllib.error import URLError, HTTPError 359 req = Request(someurl) 360 try: 361 response = urlopen(req) 362 except HTTPError as e: 363 print('The server couldn\'t fulfill the request.') 364 print('Error code: ', e.code) 365 except URLError as e: 366 print('We failed to reach a server.') 367 print('Reason: ', e.reason) 368 else: 369 # everything is fine 370 371 372 .. note:: 373 374 The ``except HTTPError`` *must* come first, otherwise ``except URLError`` 375 will *also* catch an :exc:`HTTPError`. 376 377 Number 2 378 ~~~~~~~~ 379 380 :: 381 382 from urllib.request import Request, urlopen 383 from urllib.error import URLError 384 req = Request(someurl) 385 try: 386 response = urlopen(req) 387 except URLError as e: 388 if hasattr(e, 'reason'): 389 print('We failed to reach a server.') 390 print('Reason: ', e.reason) 391 elif hasattr(e, 'code'): 392 print('The server couldn\'t fulfill the request.') 393 print('Error code: ', e.code) 394 else: 395 # everything is fine 396 397 398 info and geturl 399 =============== 400 401 The response returned by urlopen (or the :exc:`HTTPError` instance) has two 402 useful methods :meth:`info` and :meth:`geturl` and is defined in the module 403 :mod:`urllib.response`.. 404 405 **geturl** - this returns the real URL of the page fetched. This is useful 406 because ``urlopen`` (or the opener object used) may have followed a 407 redirect. The URL of the page fetched may not be the same as the URL requested. 408 409 **info** - this returns a dictionary-like object that describes the page 410 fetched, particularly the headers sent by the server. It is currently an 411 :class:`http.client.HTTPMessage` instance. 412 413 Typical headers include 'Content-length', 'Content-type', and so on. See the 414 `Quick Reference to HTTP Headers <http://jkorpela.fi/http.html>`_ 415 for a useful listing of HTTP headers with brief explanations of their meaning 416 and use. 417 418 419 Openers and Handlers 420 ==================== 421 422 When you fetch a URL you use an opener (an instance of the perhaps 423 confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using 424 the default opener - via ``urlopen`` - but you can create custom 425 openers. Openers use handlers. All the "heavy lifting" is done by the 426 handlers. Each handler knows how to open URLs for a particular URL scheme (http, 427 ftp, etc.), or how to handle an aspect of URL opening, for example HTTP 428 redirections or HTTP cookies. 429 430 You will want to create openers if you want to fetch URLs with specific handlers 431 installed, for example to get an opener that handles cookies, or to get an 432 opener that does not handle redirections. 433 434 To create an opener, instantiate an ``OpenerDirector``, and then call 435 ``.add_handler(some_handler_instance)`` repeatedly. 436 437 Alternatively, you can use ``build_opener``, which is a convenience function for 438 creating opener objects with a single function call. ``build_opener`` adds 439 several handlers by default, but provides a quick way to add more and/or 440 override the default handlers. 441 442 Other sorts of handlers you might want to can handle proxies, authentication, 443 and other common but slightly specialised situations. 444 445 ``install_opener`` can be used to make an ``opener`` object the (global) default 446 opener. This means that calls to ``urlopen`` will use the opener you have 447 installed. 448 449 Opener objects have an ``open`` method, which can be called directly to fetch 450 urls in the same way as the ``urlopen`` function: there's no need to call 451 ``install_opener``, except as a convenience. 452 453 454 Basic Authentication 455 ==================== 456 457 To illustrate creating and installing a handler we will use the 458 ``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- 459 including an explanation of how Basic Authentication works - see the `Basic 460 Authentication Tutorial 461 <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. 462 463 When authentication is required, the server sends a header (as well as the 401 464 error code) requesting authentication. This specifies the authentication scheme 465 and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME 466 realm="REALM"``. 467 468 e.g. 469 470 .. code-block:: none 471 472 WWW-Authenticate: Basic realm="cPanel Users" 473 474 475 The client should then retry the request with the appropriate name and password 476 for the realm included as a header in the request. This is 'basic 477 authentication'. In order to simplify this process we can create an instance of 478 ``HTTPBasicAuthHandler`` and an opener to use this handler. 479 480 The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle 481 the mapping of URLs and realms to passwords and usernames. If you know what the 482 realm is (from the authentication header sent by the server), then you can use a 483 ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that 484 case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows 485 you to specify a default username and password for a URL. This will be supplied 486 in the absence of you providing an alternative combination for a specific 487 realm. We indicate this by providing ``None`` as the realm argument to the 488 ``add_password`` method. 489 490 The top-level URL is the first URL that requires authentication. URLs "deeper" 491 than the URL you pass to .add_password() will also match. :: 492 493 # create a password manager 494 password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() 495 496 # Add the username and password. 497 # If we knew the realm, we could use it instead of None. 498 top_level_url = "http://example.com/foo/" 499 password_mgr.add_password(None, top_level_url, username, password) 500 501 handler = urllib.request.HTTPBasicAuthHandler(password_mgr) 502 503 # create "opener" (OpenerDirector instance) 504 opener = urllib.request.build_opener(handler) 505 506 # use the opener to fetch a URL 507 opener.open(a_url) 508 509 # Install the opener. 510 # Now all calls to urllib.request.urlopen use our opener. 511 urllib.request.install_opener(opener) 512 513 .. note:: 514 515 In the above example we only supplied our ``HTTPBasicAuthHandler`` to 516 ``build_opener``. By default openers have the handlers for normal situations 517 -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy` 518 environment variable is set), ``UnknownHandler``, ``HTTPHandler``, 519 ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, 520 ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``. 521 522 ``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme 523 component and the hostname and optionally the port number) 524 e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname, 525 optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"`` 526 (the latter example includes a port number). The authority, if present, must 527 NOT contain the "userinfo" component - for example ``"joe:password (a] example.com"`` is 528 not correct. 529 530 531 Proxies 532 ======= 533 534 **urllib** will auto-detect your proxy settings and use those. This is through 535 the ``ProxyHandler``, which is part of the normal handler chain when a proxy 536 setting is detected. Normally that's a good thing, but there are occasions 537 when it may not be helpful [#]_. One way to do this is to setup our own 538 ``ProxyHandler``, with no proxies defined. This is done using similar steps to 539 setting up a `Basic Authentication`_ handler: :: 540 541 >>> proxy_support = urllib.request.ProxyHandler({}) 542 >>> opener = urllib.request.build_opener(proxy_support) 543 >>> urllib.request.install_opener(opener) 544 545 .. note:: 546 547 Currently ``urllib.request`` *does not* support fetching of ``https`` locations 548 through a proxy. However, this can be enabled by extending urllib.request as 549 shown in the recipe [#]_. 550 551 .. note:: 552 553 ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see 554 the documentation on :func:`~urllib.request.getproxies`. 555 556 557 Sockets and Layers 558 ================== 559 560 The Python support for fetching resources from the web is layered. urllib uses 561 the :mod:`http.client` library, which in turn uses the socket library. 562 563 As of Python 2.3 you can specify how long a socket should wait for a response 564 before timing out. This can be useful in applications which have to fetch web 565 pages. By default the socket module has *no timeout* and can hang. Currently, 566 the socket timeout is not exposed at the http.client or urllib.request levels. 567 However, you can set the default timeout globally for all sockets using :: 568 569 import socket 570 import urllib.request 571 572 # timeout in seconds 573 timeout = 10 574 socket.setdefaulttimeout(timeout) 575 576 # this call to urllib.request.urlopen now uses the default timeout 577 # we have set in the socket module 578 req = urllib.request.Request('http://www.voidspace.org.uk') 579 response = urllib.request.urlopen(req) 580 581 582 ------- 583 584 585 Footnotes 586 ========= 587 588 This document was reviewed and revised by John Lee. 589 590 .. [#] Google for example. 591 .. [#] Browser sniffing is a very bad practice for website design - building 592 sites using web standards is much more sensible. Unfortunately a lot of 593 sites still send different versions to different browsers. 594 .. [#] The user agent for MSIE 6 is 595 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* 596 .. [#] For details of more HTTP request headers, see 597 `Quick Reference to HTTP Headers`_. 598 .. [#] In my case I have to use a proxy to access the internet at work. If you 599 attempt to fetch *localhost* URLs through this proxy it blocks them. IE 600 is set to use the proxy, which urllib picks up on. In order to test 601 scripts with a localhost server, I have to prevent urllib from using 602 the proxy. 603 .. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe 604 <https://code.activestate.com/recipes/456195/>`_. 605 606