Home | History | Annotate | Download | only in library
      1 :mod:`urllib` --- Open arbitrary resources by URL
      2 =================================================
      3 
      4 .. module:: urllib
      5    :synopsis: Open an arbitrary network resource by URL (requires sockets).
      6 
      7 .. note::
      8     The :mod:`urllib` module has been split into parts and renamed in
      9     Python 3 to :mod:`urllib.request`, :mod:`urllib.parse`,
     10     and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt
     11     imports when converting your sources to Python 3.
     12     Also note that the :func:`urllib.request.urlopen` function in Python 3 is
     13     equivalent to :func:`urllib2.urlopen` and that :func:`urllib.urlopen` has
     14     been removed.
     15 
     16 .. index::
     17    single: WWW
     18    single: World Wide Web
     19    single: URL
     20 
     21 This module provides a high-level interface for fetching data across the World
     22 Wide Web.  In particular, the :func:`urlopen` function is similar to the
     23 built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
     24 instead of filenames.  Some restrictions apply --- it can only open URLs for
     25 reading, and no seek operations are available.
     26 
     27 .. seealso::
     28 
     29     The `Requests package <http://docs.python-requests.org/>`_
     30     is recommended for a higher-level HTTP client interface.
     31 
     32 .. versionchanged:: 2.7.9
     33 
     34     For HTTPS URIs, :mod:`urllib` performs all the neccessary certificate and hostname checks by default.
     35 
     36 .. warning::
     37 
     38     For Python versions earlier than 2.7.9, urllib does not attempt to validate the server certificates of HTTPS URIs. Use at your own risk!
     39 
     40 
     41 High-level interface
     42 --------------------
     43 
     44 .. function:: urlopen(url[, data[, proxies[, context]]])
     45 
     46    Open a network object denoted by a URL for reading.  If the URL does not
     47    have a scheme identifier, or if it has :file:`file:` as its scheme
     48    identifier, this opens a local file (without :term:`universal newlines`);
     49    otherwise it opens a socket to a server somewhere on the network.  If the
     50    connection cannot be made the :exc:`IOError` exception is raised.  If all
     51    went well, a file-like object is returned.  This supports the following
     52    methods: :meth:`read`, :meth:`readline`, :meth:`readlines`, :meth:`fileno`,
     53    :meth:`close`, :meth:`info`, :meth:`getcode` and :meth:`geturl`.  It also
     54    has proper support for the :term:`iterator` protocol. One caveat: the
     55    :meth:`read` method, if the size argument is omitted or negative, may not
     56    read until the end of the data stream; there is no good way to determine
     57    that the entire stream from a socket has been read in the general case.
     58 
     59    Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
     60    these methods have the same interface as for file objects --- see section
     61    :ref:`bltin-file-objects` in this manual.  (It is not a built-in file object,
     62    however, so it can't be used at those few places where a true built-in file
     63    object is required.)
     64 
     65    .. index:: module: mimetools
     66 
     67    The :meth:`info` method returns an instance of the class
     68    :class:`mimetools.Message` containing meta-information associated with the
     69    URL.  When the method is HTTP, these headers are those returned by the server
     70    at the head of the retrieved HTML page (including Content-Length and
     71    Content-Type).  When the method is FTP, a Content-Length header will be
     72    present if (as is now usual) the server passed back a file length in response
     73    to the FTP retrieval request. A Content-Type header will be present if the
     74    MIME type can be guessed.  When the method is local-file, returned headers
     75    will include a Date representing the file's last-modified time, a
     76    Content-Length giving file size, and a Content-Type containing a guess at the
     77    file's type. See also the description of the :mod:`mimetools` module.
     78 
     79    The :meth:`geturl` method returns the real URL of the page.  In some cases, the
     80    HTTP server redirects a client to another URL.  The :func:`urlopen` function
     81    handles this transparently, but in some cases the caller needs to know which URL
     82    the client was redirected to.  The :meth:`geturl` method can be used to get at
     83    this redirected URL.
     84 
     85    The :meth:`getcode` method returns the HTTP status code that was sent with the
     86    response, or ``None`` if the URL is no HTTP URL.
     87 
     88    If the *url* uses the :file:`http:` scheme identifier, the optional *data*
     89    argument may be given to specify a ``POST`` request (normally the request type
     90    is ``GET``).  The *data* argument must be in standard
     91    :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
     92    function below.
     93 
     94    The :func:`urlopen` function works transparently with proxies which do not
     95    require authentication.  In a Unix or Windows environment, set the
     96    :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
     97    identifies the proxy server before starting the Python interpreter.  For example
     98    (the ``'%'`` is the command prompt)::
     99 
    100       % http_proxy="http://www.someproxy.com:3128"
    101       % export http_proxy
    102       % python
    103       ...
    104 
    105    The :envvar:`no_proxy` environment variable can be used to specify hosts which
    106    shouldn't be reached via proxy; if set, it should be a comma-separated list
    107    of hostname suffixes, optionally with ``:port`` appended, for example
    108    ``cern.ch,ncsa.uiuc.edu,some.host:8080``.
    109 
    110    In a Windows environment, if no proxy environment variables are set, proxy
    111    settings are obtained from the registry's Internet Settings section.
    112 
    113    .. index:: single: Internet Config
    114 
    115    In a Mac OS X  environment, :func:`urlopen` will retrieve proxy information
    116    from the OS X System Configuration Framework, which can be managed with
    117    Network System Preferences panel.
    118 
    119 
    120    Alternatively, the optional *proxies* argument may be used to explicitly specify
    121    proxies.  It must be a dictionary mapping scheme names to proxy URLs, where an
    122    empty dictionary causes no proxies to be used, and ``None`` (the default value)
    123    causes environmental proxy settings to be used as discussed above.  For
    124    example::
    125 
    126       # Use http://www.someproxy.com:3128 for HTTP proxying
    127       proxies = {'http': 'http://www.someproxy.com:3128'}
    128       filehandle = urllib.urlopen(some_url, proxies=proxies)
    129       # Don't use any proxies
    130       filehandle = urllib.urlopen(some_url, proxies={})
    131       # Use proxies from environment - both versions are equivalent
    132       filehandle = urllib.urlopen(some_url, proxies=None)
    133       filehandle = urllib.urlopen(some_url)
    134 
    135    Proxies which require authentication for use are not currently supported;
    136    this is considered an implementation limitation.
    137 
    138    The *context* parameter may be set to a :class:`ssl.SSLContext` instance to
    139    configure the SSL settings that are used if :func:`urlopen` makes a HTTPS
    140    connection.
    141 
    142    .. versionchanged:: 2.3
    143       Added the *proxies* support.
    144 
    145    .. versionchanged:: 2.6
    146       Added :meth:`getcode` to returned object and support for the
    147       :envvar:`no_proxy` environment variable.
    148 
    149    .. versionchanged:: 2.7.9
    150       The *context* parameter was added. All the neccessary certificate and hostname checks are done by default.
    151 
    152    .. deprecated:: 2.6
    153       The :func:`urlopen` function has been removed in Python 3 in favor
    154       of :func:`urllib2.urlopen`.
    155 
    156 
    157 .. function:: urlretrieve(url[, filename[, reporthook[, data]]])
    158 
    159    Copy a network object denoted by a URL to a local file, if necessary. If the URL
    160    points to a local file, or a valid cached copy of the object exists, the object
    161    is not copied.  Return a tuple ``(filename, headers)`` where *filename* is the
    162    local file name under which the object can be found, and *headers* is whatever
    163    the :meth:`info` method of the object returned by :func:`urlopen` returned (for
    164    a remote object, possibly cached). Exceptions are the same as for
    165    :func:`urlopen`.
    166 
    167    The second argument, if present, specifies the file location to copy to (if
    168    absent, the location will be a tempfile with a generated name). The third
    169    argument, if present, is a callable that will be called once on
    170    establishment of the network connection and once after each block read
    171    thereafter.  The callable will be passed three arguments; a count of blocks
    172    transferred so far, a block size in bytes, and the total size of the file.  The
    173    third argument may be ``-1`` on older FTP servers which do not return a file
    174    size in response to a retrieval request.
    175 
    176    If the *url* uses the :file:`http:` scheme identifier, the optional *data*
    177    argument may be given to specify a ``POST`` request (normally the request type
    178    is ``GET``).  The *data* argument must in standard
    179    :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
    180    function below.
    181 
    182    .. versionchanged:: 2.5
    183       :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
    184       the amount of data available  was less than the expected amount (which is the
    185       size reported by a  *Content-Length* header). This can occur, for example, when
    186       the  download is interrupted.
    187 
    188       The *Content-Length* is treated as a lower bound: if there's more data  to read,
    189       :func:`urlretrieve` reads more data, but if less data is available,  it raises
    190       the exception.
    191 
    192       You can still retrieve the downloaded data in this case, it is stored  in the
    193       :attr:`content` attribute of the exception instance.
    194 
    195       If no *Content-Length* header was supplied, :func:`urlretrieve` can not check
    196       the size of the data it has downloaded, and just returns it.  In this case you
    197       just have to assume that the download was successful.
    198 
    199 
    200 .. data:: _urlopener
    201 
    202    The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
    203    of the :class:`FancyURLopener` class and use it to perform their requested
    204    actions.  To override this functionality, programmers can create a subclass of
    205    :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
    206    class to the ``urllib._urlopener`` variable before calling the desired function.
    207    For example, applications may want to specify a different
    208    :mailheader:`User-Agent` header than :class:`URLopener` defines.  This can be
    209    accomplished with the following code::
    210 
    211       import urllib
    212 
    213       class AppURLopener(urllib.FancyURLopener):
    214           version = "App/1.7"
    215 
    216       urllib._urlopener = AppURLopener()
    217 
    218 
    219 .. function:: urlcleanup()
    220 
    221    Clear the cache that may have been built up by previous calls to
    222    :func:`urlretrieve`.
    223 
    224 
    225 Utility functions
    226 -----------------
    227 
    228 .. function:: quote(string[, safe])
    229 
    230    Replace special characters in *string* using the ``%xx`` escape. Letters,
    231    digits, and the characters ``'_.-'`` are never quoted. By default, this
    232    function is intended for quoting the path section of the URL. The optional
    233    *safe* parameter specifies additional characters that should not be quoted
    234    --- its default value is ``'/'``.
    235 
    236    Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
    237 
    238 
    239 .. function:: quote_plus(string[, safe])
    240 
    241    Like :func:`quote`, but also replaces spaces by plus signs, as required for
    242    quoting HTML form values when building up a query string to go into a URL.
    243    Plus signs in the original string are escaped unless they are included in
    244    *safe*.  It also does not have *safe* default to ``'/'``.
    245 
    246 
    247 .. function:: unquote(string)
    248 
    249    Replace ``%xx`` escapes by their single-character equivalent.
    250 
    251    Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
    252 
    253 
    254 .. function:: unquote_plus(string)
    255 
    256    Like :func:`unquote`, but also replaces plus signs by spaces, as required for
    257    unquoting HTML form values.
    258 
    259 
    260 .. function:: urlencode(query[, doseq])
    261 
    262    Convert a mapping object or a sequence of two-element tuples to a
    263    "percent-encoded" string, suitable to pass to :func:`urlopen` above as the
    264    optional *data* argument.  This is useful to pass a dictionary of form
    265    fields to a ``POST`` request.  The resulting string is a series of
    266    ``key=value`` pairs separated by ``'&'`` characters, where both *key* and
    267    *value* are quoted using :func:`quote_plus` above.  When a sequence of
    268    two-element tuples is used as the *query* argument, the first element of
    269    each tuple is a key and the second is a value. The value element in itself
    270    can be a sequence and in that case, if the optional parameter *doseq* is
    271    evaluates to ``True``, individual ``key=value`` pairs separated by ``'&'`` are
    272    generated for each element of the value sequence for the key.  The order of
    273    parameters in the encoded string will match the order of parameter tuples in
    274    the sequence. The :mod:`urlparse` module provides the functions
    275    :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
    276    into Python data structures.
    277 
    278 
    279 .. function:: pathname2url(path)
    280 
    281    Convert the pathname *path* from the local syntax for a path to the form used in
    282    the path component of a URL.  This does not produce a complete URL.  The return
    283    value will already be quoted using the :func:`quote` function.
    284 
    285 
    286 .. function:: url2pathname(path)
    287 
    288    Convert the path component *path* from a percent-encoded URL to the local syntax for a
    289    path.  This does not accept a complete URL.  This function uses :func:`unquote`
    290    to decode *path*.
    291 
    292 
    293 .. function:: getproxies()
    294 
    295    This helper function returns a dictionary of scheme to proxy server URL
    296    mappings. It scans the environment for variables named ``<scheme>_proxy``,
    297    in case insensitive way, for all operating systems first, and when it cannot
    298    find it, looks for proxy information from Mac OSX System Configuration for
    299    Mac OS X and Windows Systems Registry for Windows.
    300    If both lowercase and uppercase environment variables exist (and disagree),
    301    lowercase is preferred.
    302 
    303    .. note::
    304 
    305       If the environment variable ``REQUEST_METHOD`` is set, which usually
    306       indicates your script is running in a CGI environment, the environment
    307       variable ``HTTP_PROXY`` (uppercase ``_PROXY``) will be ignored. This is
    308       because that variable can be injected by a client using the "Proxy:" HTTP
    309       header. If you need to use an HTTP proxy in a CGI environment, either use
    310       ``ProxyHandler`` explicitly, or make sure the variable name is in
    311       lowercase (or at least the ``_proxy`` suffix).
    312 
    313 .. note::
    314     urllib also exposes certain utility functions like splittype, splithost and
    315     others parsing URL into various components. But it is recommended to use
    316     :mod:`urlparse` for parsing URLs rather than using these functions directly.
    317     Python 3 does not expose these helper functions from :mod:`urllib.parse`
    318     module.
    319 
    320 
    321 URL Opener objects
    322 ------------------
    323 
    324 .. class:: URLopener([proxies[, context[, **x509]]])
    325 
    326    Base class for opening and reading URLs.  Unless you need to support opening
    327    objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
    328    you probably want to use :class:`FancyURLopener`.
    329 
    330    By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
    331    of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number.
    332    Applications can define their own :mailheader:`User-Agent` header by subclassing
    333    :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
    334    :attr:`version` to an appropriate string value in the subclass definition.
    335 
    336    The optional *proxies* parameter should be a dictionary mapping scheme names to
    337    proxy URLs, where an empty dictionary turns proxies off completely.  Its default
    338    value is ``None``, in which case environmental proxy settings will be used if
    339    present, as discussed in the definition of :func:`urlopen`, above.
    340 
    341    The *context* parameter may be a :class:`ssl.SSLContext` instance.  If given,
    342    it defines the SSL settings the opener uses to make HTTPS connections.
    343 
    344    Additional keyword parameters, collected in *x509*, may be used for
    345    authentication of the client when using the :file:`https:` scheme.  The keywords
    346    *key_file* and *cert_file* are supported to provide an  SSL key and certificate;
    347    both are needed to support client authentication.
    348 
    349    :class:`URLopener` objects will raise an :exc:`IOError` exception if the server
    350    returns an error code.
    351 
    352    .. method:: open(fullurl[, data])
    353 
    354       Open *fullurl* using the appropriate protocol.  This method sets up cache and
    355       proxy information, then calls the appropriate open method with its input
    356       arguments.  If the scheme is not recognized, :meth:`open_unknown` is called.
    357       The *data* argument has the same meaning as the *data* argument of
    358       :func:`urlopen`.
    359 
    360 
    361    .. method:: open_unknown(fullurl[, data])
    362 
    363       Overridable interface to open unknown URL types.
    364 
    365 
    366    .. method:: retrieve(url[, filename[, reporthook[, data]]])
    367 
    368       Retrieves the contents of *url* and places it in *filename*.  The return value
    369       is a tuple consisting of a local filename and either a
    370       :class:`mimetools.Message` object containing the response headers (for remote
    371       URLs) or ``None`` (for local URLs).  The caller must then open and read the
    372       contents of *filename*.  If *filename* is not given and the URL refers to a
    373       local file, the input filename is returned.  If the URL is non-local and
    374       *filename* is not given, the filename is the output of :func:`tempfile.mktemp`
    375       with a suffix that matches the suffix of the last path component of the input
    376       URL.  If *reporthook* is given, it must be a function accepting three numeric
    377       parameters.  It will be called after each chunk of data is read from the
    378       network.  *reporthook* is ignored for local URLs.
    379 
    380       If the *url* uses the :file:`http:` scheme identifier, the optional *data*
    381       argument may be given to specify a ``POST`` request (normally the request type
    382       is ``GET``).  The *data* argument must in standard
    383       :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
    384       function below.
    385 
    386 
    387    .. attribute:: version
    388 
    389       Variable that specifies the user agent of the opener object.  To get
    390       :mod:`urllib` to tell servers that it is a particular user agent, set this in a
    391       subclass as a class variable or in the constructor before calling the base
    392       constructor.
    393 
    394 
    395 .. class:: FancyURLopener(...)
    396 
    397    :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
    398    for the following HTTP response codes: 301, 302, 303, 307 and 401.  For the 30x
    399    response codes listed above, the :mailheader:`Location` header is used to fetch
    400    the actual URL.  For 401 response codes (authentication required), basic HTTP
    401    authentication is performed.  For the 30x response codes, recursion is bounded
    402    by the value of the *maxtries* attribute, which defaults to 10.
    403 
    404    For all other response codes, the method :meth:`http_error_default` is called
    405    which you can override in subclasses to handle the error appropriately.
    406 
    407    .. note::
    408 
    409       According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
    410       must not be automatically redirected without confirmation by the user.  In
    411       reality, browsers do allow automatic redirection of these responses, changing
    412       the POST to a GET, and :mod:`urllib` reproduces this behaviour.
    413 
    414    The parameters to the constructor are the same as those for :class:`URLopener`.
    415 
    416    .. note::
    417 
    418       When performing basic authentication, a :class:`FancyURLopener` instance calls
    419       its :meth:`prompt_user_passwd` method.  The default implementation asks the
    420       users for the required information on the controlling terminal.  A subclass may
    421       override this method to support more appropriate behavior if needed.
    422 
    423    The :class:`FancyURLopener` class offers one additional method that should be
    424    overloaded to provide the appropriate behavior:
    425 
    426    .. method:: prompt_user_passwd(host, realm)
    427 
    428       Return information needed to authenticate the user at the given host in the
    429       specified security realm.  The return value should be a tuple, ``(user,
    430       password)``, which can be used for basic authentication.
    431 
    432       The implementation prompts for this information on the terminal; an application
    433       should override this method to use an appropriate interaction model in the local
    434       environment.
    435 
    436 .. exception:: ContentTooShortError(msg[, content])
    437 
    438    This exception is raised when the :func:`urlretrieve` function detects that the
    439    amount of the downloaded data is less than the  expected amount (given by the
    440    *Content-Length* header). The :attr:`content` attribute stores the downloaded
    441    (and supposedly truncated) data.
    442 
    443    .. versionadded:: 2.5
    444 
    445 
    446 :mod:`urllib` Restrictions
    447 --------------------------
    448 
    449   .. index::
    450      pair: HTTP; protocol
    451      pair: FTP; protocol
    452 
    453 * Currently, only the following protocols are supported: HTTP, (versions 0.9 and
    454   1.0),  FTP, and local files.
    455 
    456 * The caching feature of :func:`urlretrieve` has been disabled until I find the
    457   time to hack proper processing of Expiration time headers.
    458 
    459 * There should be a function to query whether a particular URL is in the cache.
    460 
    461 * For backward compatibility, if a URL appears to point to a local file but the
    462   file can't be opened, the URL is re-interpreted using the FTP protocol.  This
    463   can sometimes cause confusing error messages.
    464 
    465 * The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
    466   long delays while waiting for a network connection to be set up.  This means
    467   that it is difficult to build an interactive Web client using these functions
    468   without using threads.
    469 
    470   .. index::
    471      single: HTML
    472      pair: HTTP; protocol
    473      module: htmllib
    474 
    475 * The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
    476   returned by the server.  This may be binary data (such as an image), plain text
    477   or (for example) HTML.  The HTTP protocol provides type information in the reply
    478   header, which can be inspected by looking at the :mailheader:`Content-Type`
    479   header.  If the returned data is HTML, you can use the module :mod:`htmllib` to
    480   parse it.
    481 
    482   .. index:: single: FTP
    483 
    484 * The code handling the FTP protocol cannot differentiate between a file and a
    485   directory.  This can lead to unexpected behavior when attempting to read a URL
    486   that points to a file that is not accessible.  If the URL ends in a ``/``, it is
    487   assumed to refer to a directory and will be handled accordingly.  But if an
    488   attempt to read a file leads to a 550 error (meaning the URL cannot be found or
    489   is not accessible, often for permission reasons), then the path is treated as a
    490   directory in order to handle the case when a directory is specified by a URL but
    491   the trailing ``/`` has been left off.  This can cause misleading results when
    492   you try to fetch a file whose read permissions make it inaccessible; the FTP
    493   code will try to read it, fail with a 550 error, and then perform a directory
    494   listing for the unreadable file. If fine-grained control is needed, consider
    495   using the :mod:`ftplib` module, subclassing :class:`FancyURLopener`, or changing
    496   *_urlopener* to meet your needs.
    497 
    498 * This module does not support the use of proxies which require authentication.
    499   This may be implemented in the future.
    500 
    501   .. index:: module: urlparse
    502 
    503 * Although the :mod:`urllib` module contains (undocumented) routines to parse
    504   and unparse URL strings, the recommended interface for URL manipulation is in
    505   module :mod:`urlparse`.
    506 
    507 
    508 .. _urllib-examples:
    509 
    510 Examples
    511 --------
    512 
    513 Here is an example session that uses the ``GET`` method to retrieve a URL
    514 containing parameters::
    515 
    516    >>> import urllib
    517    >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
    518    >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
    519    >>> print f.read()
    520 
    521 The following example uses the ``POST`` method instead::
    522 
    523    >>> import urllib
    524    >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
    525    >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
    526    >>> print f.read()
    527 
    528 The following example uses an explicitly specified HTTP proxy, overriding
    529 environment settings::
    530 
    531    >>> import urllib
    532    >>> proxies = {'http': 'http://proxy.example.com:8080/'}
    533    >>> opener = urllib.FancyURLopener(proxies)
    534    >>> f = opener.open("http://www.python.org")
    535    >>> f.read()
    536 
    537 The following example uses no proxies at all, overriding environment settings::
    538 
    539    >>> import urllib
    540    >>> opener = urllib.FancyURLopener({})
    541    >>> f = opener.open("http://www.python.org/")
    542    >>> f.read()
    543 
    544