Home | History | Annotate | Download | only in library
      1 :mod:`urlparse` --- Parse URLs into components
      2 ==============================================
      3 
      4 .. module:: urlparse
      5    :synopsis: Parse URLs into or assemble them from components.
      6 
      7 
      8 .. index::
      9    single: WWW
     10    single: World Wide Web
     11    single: URL
     12    pair: URL; parsing
     13    pair: relative; URL
     14 
     15 .. note::
     16    The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.
     17    The :term:`2to3` tool will automatically adapt imports when converting
     18    your sources to Python 3.
     19 
     20 **Source code:** :source:`Lib/urlparse.py`
     21 
     22 --------------
     23 
     24 This module defines a standard interface to break Uniform Resource Locator (URL)
     25 strings up in components (addressing scheme, network location, path etc.), to
     26 combine the components back into a URL string, and to convert a "relative URL"
     27 to an absolute URL given a "base URL."
     28 
     29 The module has been designed to match the Internet RFC on Relative Uniform
     30 Resource Locators. It supports the following URL schemes: ``file``, ``ftp``,
     31 ``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``,
     32 ``news``,  ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``,  ``sftp``,
     33 ``shttp``, ``sip``, ``sips``, ``snews``, ``svn``,  ``svn+ssh``, ``telnet``,
     34 ``wais``.
     35 
     36 .. versionadded:: 2.5
     37    Support for the ``sftp`` and ``sips`` schemes.
     38 
     39 The :mod:`urlparse` module defines the following functions:
     40 
     41 
     42 .. function:: urlparse(urlstring[, scheme[, allow_fragments]])
     43 
     44    Parse a URL into six components, returning a 6-tuple.  This corresponds to the
     45    general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
     46    Each tuple item is a string, possibly empty. The components are not broken up in
     47    smaller parts (for example, the network location is a single string), and %
     48    escapes are not expanded. The delimiters as shown above are not part of the
     49    result, except for a leading slash in the *path* component, which is retained if
     50    present.  For example:
     51 
     52       >>> from urlparse import urlparse
     53       >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
     54       >>> o   # doctest: +NORMALIZE_WHITESPACE
     55       ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
     56                   params='', query='', fragment='')
     57       >>> o.scheme
     58       'http'
     59       >>> o.port
     60       80
     61       >>> o.geturl()
     62       'http://www.cwi.nl:80/%7Eguido/Python.html'
     63 
     64 
     65    Following the syntax specifications in :rfc:`1808`, urlparse recognizes
     66    a netloc only if it is properly introduced by '//'.  Otherwise the
     67    input is presumed to be a relative URL and thus to start with
     68    a path component.
     69 
     70        >>> from urlparse import urlparse
     71        >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
     72        ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
     73                   params='', query='', fragment='')
     74        >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
     75        ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
     76                   params='', query='', fragment='')
     77        >>> urlparse('help/Python.html')
     78        ParseResult(scheme='', netloc='', path='help/Python.html', params='',
     79                   query='', fragment='')
     80 
     81    If the *scheme* argument is specified, it gives the default addressing
     82    scheme, to be used only if the URL does not specify one.  The default value for
     83    this argument is the empty string.
     84 
     85    If the *allow_fragments* argument is false, fragment identifiers are not
     86    recognized and parsed as part of the preceding component, even if the URL's
     87    addressing scheme normally does support them.  The default value for this
     88    argument is :const:`True`.
     89 
     90    The return value is actually an instance of a subclass of :class:`tuple`.  This
     91    class has the following additional read-only convenience attributes:
     92 
     93    +------------------+-------+--------------------------+----------------------+
     94    | Attribute        | Index | Value                    | Value if not present |
     95    +==================+=======+==========================+======================+
     96    | :attr:`scheme`   | 0     | URL scheme specifier     | *scheme* parameter   |
     97    +------------------+-------+--------------------------+----------------------+
     98    | :attr:`netloc`   | 1     | Network location part    | empty string         |
     99    +------------------+-------+--------------------------+----------------------+
    100    | :attr:`path`     | 2     | Hierarchical path        | empty string         |
    101    +------------------+-------+--------------------------+----------------------+
    102    | :attr:`params`   | 3     | Parameters for last path | empty string         |
    103    |                  |       | element                  |                      |
    104    +------------------+-------+--------------------------+----------------------+
    105    | :attr:`query`    | 4     | Query component          | empty string         |
    106    +------------------+-------+--------------------------+----------------------+
    107    | :attr:`fragment` | 5     | Fragment identifier      | empty string         |
    108    +------------------+-------+--------------------------+----------------------+
    109    | :attr:`username` |       | User name                | :const:`None`        |
    110    +------------------+-------+--------------------------+----------------------+
    111    | :attr:`password` |       | Password                 | :const:`None`        |
    112    +------------------+-------+--------------------------+----------------------+
    113    | :attr:`hostname` |       | Host name (lower case)   | :const:`None`        |
    114    +------------------+-------+--------------------------+----------------------+
    115    | :attr:`port`     |       | Port number as integer,  | :const:`None`        |
    116    |                  |       | if present               |                      |
    117    +------------------+-------+--------------------------+----------------------+
    118 
    119    See section :ref:`urlparse-result-object` for more information on the result
    120    object.
    121 
    122    .. versionchanged:: 2.5
    123       Added attributes to return value.
    124 
    125    .. versionchanged:: 2.7
    126       Added IPv6 URL parsing capabilities.
    127 
    128 
    129 .. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
    130 
    131    Parse a query string given as a string argument (data of type
    132    :mimetype:`application/x-www-form-urlencoded`).  Data are returned as a
    133    dictionary.  The dictionary keys are the unique query variable names and the
    134    values are lists of values for each name.
    135 
    136    The optional argument *keep_blank_values* is a flag indicating whether blank
    137    values in percent-encoded queries should be treated as blank strings.   A true value
    138    indicates that blanks should be retained as  blank strings.  The default false
    139    value indicates that blank values are to be ignored and treated as if they were
    140    not included.
    141 
    142    The optional argument *strict_parsing* is a flag indicating what to do with
    143    parsing errors.  If false (the default), errors are silently ignored.  If true,
    144    errors raise a :exc:`ValueError` exception.
    145 
    146    Use the :func:`urllib.urlencode` function to convert such dictionaries into
    147    query strings.
    148 
    149    .. versionadded:: 2.6
    150       Copied from the :mod:`cgi` module.
    151 
    152 
    153 .. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
    154 
    155    Parse a query string given as a string argument (data of type
    156    :mimetype:`application/x-www-form-urlencoded`).  Data are returned as a list of
    157    name, value pairs.
    158 
    159    The optional argument *keep_blank_values* is a flag indicating whether blank
    160    values in percent-encoded queries should be treated as blank strings.   A true value
    161    indicates that blanks should be retained as  blank strings.  The default false
    162    value indicates that blank values are to be ignored and treated as if they were
    163    not included.
    164 
    165    The optional argument *strict_parsing* is a flag indicating what to do with
    166    parsing errors.  If false (the default), errors are silently ignored.  If true,
    167    errors raise a :exc:`ValueError` exception.
    168 
    169    Use the :func:`urllib.urlencode` function to convert such lists of pairs into
    170    query strings.
    171 
    172    .. versionadded:: 2.6
    173       Copied from the :mod:`cgi` module.
    174 
    175 
    176 .. function:: urlunparse(parts)
    177 
    178    Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
    179    can be any six-item iterable. This may result in a slightly different, but
    180    equivalent URL, if the URL that was parsed originally had unnecessary delimiters
    181    (for example, a ? with an empty query; the RFC states that these are
    182    equivalent).
    183 
    184 
    185 .. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
    186 
    187    This is similar to :func:`urlparse`, but does not split the params from the URL.
    188    This should generally be used instead of :func:`urlparse` if the more recent URL
    189    syntax allowing parameters to be applied to each segment of the *path* portion
    190    of the URL (see :rfc:`2396`) is wanted.  A separate function is needed to
    191    separate the path segments and parameters.  This function returns a 5-tuple:
    192    (addressing scheme, network location, path, query, fragment identifier).
    193 
    194    The return value is actually an instance of a subclass of :class:`tuple`.  This
    195    class has the following additional read-only convenience attributes:
    196 
    197    +------------------+-------+-------------------------+----------------------+
    198    | Attribute        | Index | Value                   | Value if not present |
    199    +==================+=======+=========================+======================+
    200    | :attr:`scheme`   | 0     | URL scheme specifier    | *scheme* parameter   |
    201    +------------------+-------+-------------------------+----------------------+
    202    | :attr:`netloc`   | 1     | Network location part   | empty string         |
    203    +------------------+-------+-------------------------+----------------------+
    204    | :attr:`path`     | 2     | Hierarchical path       | empty string         |
    205    +------------------+-------+-------------------------+----------------------+
    206    | :attr:`query`    | 3     | Query component         | empty string         |
    207    +------------------+-------+-------------------------+----------------------+
    208    | :attr:`fragment` | 4     | Fragment identifier     | empty string         |
    209    +------------------+-------+-------------------------+----------------------+
    210    | :attr:`username` |       | User name               | :const:`None`        |
    211    +------------------+-------+-------------------------+----------------------+
    212    | :attr:`password` |       | Password                | :const:`None`        |
    213    +------------------+-------+-------------------------+----------------------+
    214    | :attr:`hostname` |       | Host name (lower case)  | :const:`None`        |
    215    +------------------+-------+-------------------------+----------------------+
    216    | :attr:`port`     |       | Port number as integer, | :const:`None`        |
    217    |                  |       | if present              |                      |
    218    +------------------+-------+-------------------------+----------------------+
    219 
    220    See section :ref:`urlparse-result-object` for more information on the result
    221    object.
    222 
    223    .. versionadded:: 2.2
    224 
    225    .. versionchanged:: 2.5
    226       Added attributes to return value.
    227 
    228 
    229 .. function:: urlunsplit(parts)
    230 
    231    Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
    232    URL as a string. The *parts* argument can be any five-item iterable. This may
    233    result in a slightly different, but equivalent URL, if the URL that was parsed
    234    originally had unnecessary delimiters (for example, a ? with an empty query; the
    235    RFC states that these are equivalent).
    236 
    237    .. versionadded:: 2.2
    238 
    239 
    240 .. function:: urljoin(base, url[, allow_fragments])
    241 
    242    Construct a full ("absolute") URL by combining a "base URL" (*base*) with
    243    another URL (*url*).  Informally, this uses components of the base URL, in
    244    particular the addressing scheme, the network location and (part of) the path,
    245    to provide missing components in the relative URL.  For example:
    246 
    247       >>> from urlparse import urljoin
    248       >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
    249       'http://www.cwi.nl/%7Eguido/FAQ.html'
    250 
    251    The *allow_fragments* argument has the same meaning and default as for
    252    :func:`urlparse`.
    253 
    254    .. note::
    255 
    256       If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
    257       the *url*'s host name and/or scheme will be present in the result.  For example:
    258 
    259    .. doctest::
    260 
    261       >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
    262       ...         '//www.python.org/%7Eguido')
    263       'http://www.python.org/%7Eguido'
    264 
    265    If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
    266    :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
    267 
    268 
    269 .. function:: urldefrag(url)
    270 
    271    If *url* contains a fragment identifier, returns a modified version of *url*
    272    with no fragment identifier, and the fragment identifier as a separate string.
    273    If there is no fragment identifier in *url*, returns *url* unmodified and an
    274    empty string.
    275 
    276 
    277 .. seealso::
    278 
    279    :rfc:`3986` - Uniform Resource Identifiers
    280       This is the current standard (STD66). Any changes to urlparse module
    281       should conform to this. Certain deviations could be observed, which are
    282       mostly for backward compatibility purposes and for certain de-facto
    283       parsing requirements as commonly observed in major browsers.
    284 
    285    :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
    286       This specifies the parsing requirements of IPv6 URLs.
    287 
    288    :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
    289       Document describing the generic syntactic requirements for both Uniform Resource
    290       Names (URNs) and Uniform Resource Locators (URLs).
    291 
    292    :rfc:`2368` - The mailto URL scheme.
    293       Parsing requirements for mailto URL schemes.
    294 
    295    :rfc:`1808` - Relative Uniform Resource Locators
    296       This Request For Comments includes the rules for joining an absolute and a
    297       relative URL, including a fair number of "Abnormal Examples" which govern the
    298       treatment of border cases.
    299 
    300    :rfc:`1738` - Uniform Resource Locators (URL)
    301       This specifies the formal syntax and semantics of absolute URLs.
    302 
    303 
    304 .. _urlparse-result-object:
    305 
    306 Results of :func:`urlparse` and :func:`urlsplit`
    307 ------------------------------------------------
    308 
    309 The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
    310 subclasses of the :class:`tuple` type.  These subclasses add the attributes
    311 described in those functions, as well as provide an additional method:
    312 
    313 
    314 .. method:: ParseResult.geturl()
    315 
    316    Return the re-combined version of the original URL as a string. This may differ
    317    from the original URL in that the scheme will always be normalized to lower case
    318    and empty components may be dropped. Specifically, empty parameters, queries,
    319    and fragment identifiers will be removed.
    320 
    321    The result of this method is a fixpoint if passed back through the original
    322    parsing function:
    323 
    324       >>> import urlparse
    325       >>> url = 'HTTP://www.Python.org/doc/#'
    326 
    327       >>> r1 = urlparse.urlsplit(url)
    328       >>> r1.geturl()
    329       'http://www.Python.org/doc/'
    330 
    331       >>> r2 = urlparse.urlsplit(r1.geturl())
    332       >>> r2.geturl()
    333       'http://www.Python.org/doc/'
    334 
    335    .. versionadded:: 2.5
    336 
    337 The following classes provide the implementations of the parse results:
    338 
    339 
    340 .. class:: ParseResult(scheme, netloc, path, params, query, fragment)
    341 
    342    Concrete class for :func:`urlparse` results.
    343 
    344 
    345 .. class:: SplitResult(scheme, netloc, path, query, fragment)
    346 
    347    Concrete class for :func:`urlsplit` results.
    348 
    349