1 :mod:`urlparse` --- Parse URLs into components 2 ============================================== 3 4 .. module:: urlparse 5 :synopsis: Parse URLs into or assemble them from components. 6 7 8 .. index:: 9 single: WWW 10 single: World Wide Web 11 single: URL 12 pair: URL; parsing 13 pair: relative; URL 14 15 .. note:: 16 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3. 17 The :term:`2to3` tool will automatically adapt imports when converting 18 your sources to Python 3. 19 20 **Source code:** :source:`Lib/urlparse.py` 21 22 -------------- 23 24 This module defines a standard interface to break Uniform Resource Locator (URL) 25 strings up in components (addressing scheme, network location, path etc.), to 26 combine the components back into a URL string, and to convert a "relative URL" 27 to an absolute URL given a "base URL." 28 29 The module has been designed to match the Internet RFC on Relative Uniform 30 Resource Locators. It supports the following URL schemes: ``file``, ``ftp``, 31 ``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``, 32 ``news``, ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``, 33 ``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``, 34 ``wais``. 35 36 .. versionadded:: 2.5 37 Support for the ``sftp`` and ``sips`` schemes. 38 39 The :mod:`urlparse` module defines the following functions: 40 41 42 .. function:: urlparse(urlstring[, scheme[, allow_fragments]]) 43 44 Parse a URL into six components, returning a 6-tuple. This corresponds to the 45 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. 46 Each tuple item is a string, possibly empty. The components are not broken up in 47 smaller parts (for example, the network location is a single string), and % 48 escapes are not expanded. The delimiters as shown above are not part of the 49 result, except for a leading slash in the *path* component, which is retained if 50 present. For example: 51 52 >>> from urlparse import urlparse 53 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') 54 >>> o # doctest: +NORMALIZE_WHITESPACE 55 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', 56 params='', query='', fragment='') 57 >>> o.scheme 58 'http' 59 >>> o.port 60 80 61 >>> o.geturl() 62 'http://www.cwi.nl:80/%7Eguido/Python.html' 63 64 65 Following the syntax specifications in :rfc:`1808`, urlparse recognizes 66 a netloc only if it is properly introduced by '//'. Otherwise the 67 input is presumed to be a relative URL and thus to start with 68 a path component. 69 70 >>> from urlparse import urlparse 71 >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html') 72 ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', 73 params='', query='', fragment='') 74 >>> urlparse('www.cwi.nl/%7Eguido/Python.html') 75 ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', 76 params='', query='', fragment='') 77 >>> urlparse('help/Python.html') 78 ParseResult(scheme='', netloc='', path='help/Python.html', params='', 79 query='', fragment='') 80 81 If the *scheme* argument is specified, it gives the default addressing 82 scheme, to be used only if the URL does not specify one. The default value for 83 this argument is the empty string. 84 85 If the *allow_fragments* argument is false, fragment identifiers are not 86 recognized and parsed as part of the preceding component, even if the URL's 87 addressing scheme normally does support them. The default value for this 88 argument is :const:`True`. 89 90 The return value is actually an instance of a subclass of :class:`tuple`. This 91 class has the following additional read-only convenience attributes: 92 93 +------------------+-------+--------------------------+----------------------+ 94 | Attribute | Index | Value | Value if not present | 95 +==================+=======+==========================+======================+ 96 | :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter | 97 +------------------+-------+--------------------------+----------------------+ 98 | :attr:`netloc` | 1 | Network location part | empty string | 99 +------------------+-------+--------------------------+----------------------+ 100 | :attr:`path` | 2 | Hierarchical path | empty string | 101 +------------------+-------+--------------------------+----------------------+ 102 | :attr:`params` | 3 | Parameters for last path | empty string | 103 | | | element | | 104 +------------------+-------+--------------------------+----------------------+ 105 | :attr:`query` | 4 | Query component | empty string | 106 +------------------+-------+--------------------------+----------------------+ 107 | :attr:`fragment` | 5 | Fragment identifier | empty string | 108 +------------------+-------+--------------------------+----------------------+ 109 | :attr:`username` | | User name | :const:`None` | 110 +------------------+-------+--------------------------+----------------------+ 111 | :attr:`password` | | Password | :const:`None` | 112 +------------------+-------+--------------------------+----------------------+ 113 | :attr:`hostname` | | Host name (lower case) | :const:`None` | 114 +------------------+-------+--------------------------+----------------------+ 115 | :attr:`port` | | Port number as integer, | :const:`None` | 116 | | | if present | | 117 +------------------+-------+--------------------------+----------------------+ 118 119 See section :ref:`urlparse-result-object` for more information on the result 120 object. 121 122 .. versionchanged:: 2.5 123 Added attributes to return value. 124 125 .. versionchanged:: 2.7 126 Added IPv6 URL parsing capabilities. 127 128 129 .. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]]) 130 131 Parse a query string given as a string argument (data of type 132 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a 133 dictionary. The dictionary keys are the unique query variable names and the 134 values are lists of values for each name. 135 136 The optional argument *keep_blank_values* is a flag indicating whether blank 137 values in percent-encoded queries should be treated as blank strings. A true value 138 indicates that blanks should be retained as blank strings. The default false 139 value indicates that blank values are to be ignored and treated as if they were 140 not included. 141 142 The optional argument *strict_parsing* is a flag indicating what to do with 143 parsing errors. If false (the default), errors are silently ignored. If true, 144 errors raise a :exc:`ValueError` exception. 145 146 Use the :func:`urllib.urlencode` function to convert such dictionaries into 147 query strings. 148 149 .. versionadded:: 2.6 150 Copied from the :mod:`cgi` module. 151 152 153 .. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]]) 154 155 Parse a query string given as a string argument (data of type 156 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of 157 name, value pairs. 158 159 The optional argument *keep_blank_values* is a flag indicating whether blank 160 values in percent-encoded queries should be treated as blank strings. A true value 161 indicates that blanks should be retained as blank strings. The default false 162 value indicates that blank values are to be ignored and treated as if they were 163 not included. 164 165 The optional argument *strict_parsing* is a flag indicating what to do with 166 parsing errors. If false (the default), errors are silently ignored. If true, 167 errors raise a :exc:`ValueError` exception. 168 169 Use the :func:`urllib.urlencode` function to convert such lists of pairs into 170 query strings. 171 172 .. versionadded:: 2.6 173 Copied from the :mod:`cgi` module. 174 175 176 .. function:: urlunparse(parts) 177 178 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument 179 can be any six-item iterable. This may result in a slightly different, but 180 equivalent URL, if the URL that was parsed originally had unnecessary delimiters 181 (for example, a ? with an empty query; the RFC states that these are 182 equivalent). 183 184 185 .. function:: urlsplit(urlstring[, scheme[, allow_fragments]]) 186 187 This is similar to :func:`urlparse`, but does not split the params from the URL. 188 This should generally be used instead of :func:`urlparse` if the more recent URL 189 syntax allowing parameters to be applied to each segment of the *path* portion 190 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to 191 separate the path segments and parameters. This function returns a 5-tuple: 192 (addressing scheme, network location, path, query, fragment identifier). 193 194 The return value is actually an instance of a subclass of :class:`tuple`. This 195 class has the following additional read-only convenience attributes: 196 197 +------------------+-------+-------------------------+----------------------+ 198 | Attribute | Index | Value | Value if not present | 199 +==================+=======+=========================+======================+ 200 | :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter | 201 +------------------+-------+-------------------------+----------------------+ 202 | :attr:`netloc` | 1 | Network location part | empty string | 203 +------------------+-------+-------------------------+----------------------+ 204 | :attr:`path` | 2 | Hierarchical path | empty string | 205 +------------------+-------+-------------------------+----------------------+ 206 | :attr:`query` | 3 | Query component | empty string | 207 +------------------+-------+-------------------------+----------------------+ 208 | :attr:`fragment` | 4 | Fragment identifier | empty string | 209 +------------------+-------+-------------------------+----------------------+ 210 | :attr:`username` | | User name | :const:`None` | 211 +------------------+-------+-------------------------+----------------------+ 212 | :attr:`password` | | Password | :const:`None` | 213 +------------------+-------+-------------------------+----------------------+ 214 | :attr:`hostname` | | Host name (lower case) | :const:`None` | 215 +------------------+-------+-------------------------+----------------------+ 216 | :attr:`port` | | Port number as integer, | :const:`None` | 217 | | | if present | | 218 +------------------+-------+-------------------------+----------------------+ 219 220 See section :ref:`urlparse-result-object` for more information on the result 221 object. 222 223 .. versionadded:: 2.2 224 225 .. versionchanged:: 2.5 226 Added attributes to return value. 227 228 229 .. function:: urlunsplit(parts) 230 231 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete 232 URL as a string. The *parts* argument can be any five-item iterable. This may 233 result in a slightly different, but equivalent URL, if the URL that was parsed 234 originally had unnecessary delimiters (for example, a ? with an empty query; the 235 RFC states that these are equivalent). 236 237 .. versionadded:: 2.2 238 239 240 .. function:: urljoin(base, url[, allow_fragments]) 241 242 Construct a full ("absolute") URL by combining a "base URL" (*base*) with 243 another URL (*url*). Informally, this uses components of the base URL, in 244 particular the addressing scheme, the network location and (part of) the path, 245 to provide missing components in the relative URL. For example: 246 247 >>> from urlparse import urljoin 248 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') 249 'http://www.cwi.nl/%7Eguido/FAQ.html' 250 251 The *allow_fragments* argument has the same meaning and default as for 252 :func:`urlparse`. 253 254 .. note:: 255 256 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``), 257 the *url*'s host name and/or scheme will be present in the result. For example: 258 259 .. doctest:: 260 261 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 262 ... '//www.python.org/%7Eguido') 263 'http://www.python.org/%7Eguido' 264 265 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and 266 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts. 267 268 269 .. function:: urldefrag(url) 270 271 If *url* contains a fragment identifier, returns a modified version of *url* 272 with no fragment identifier, and the fragment identifier as a separate string. 273 If there is no fragment identifier in *url*, returns *url* unmodified and an 274 empty string. 275 276 277 .. seealso:: 278 279 :rfc:`3986` - Uniform Resource Identifiers 280 This is the current standard (STD66). Any changes to urlparse module 281 should conform to this. Certain deviations could be observed, which are 282 mostly for backward compatibility purposes and for certain de-facto 283 parsing requirements as commonly observed in major browsers. 284 285 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's. 286 This specifies the parsing requirements of IPv6 URLs. 287 288 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax 289 Document describing the generic syntactic requirements for both Uniform Resource 290 Names (URNs) and Uniform Resource Locators (URLs). 291 292 :rfc:`2368` - The mailto URL scheme. 293 Parsing requirements for mailto URL schemes. 294 295 :rfc:`1808` - Relative Uniform Resource Locators 296 This Request For Comments includes the rules for joining an absolute and a 297 relative URL, including a fair number of "Abnormal Examples" which govern the 298 treatment of border cases. 299 300 :rfc:`1738` - Uniform Resource Locators (URL) 301 This specifies the formal syntax and semantics of absolute URLs. 302 303 304 .. _urlparse-result-object: 305 306 Results of :func:`urlparse` and :func:`urlsplit` 307 ------------------------------------------------ 308 309 The result objects from the :func:`urlparse` and :func:`urlsplit` functions are 310 subclasses of the :class:`tuple` type. These subclasses add the attributes 311 described in those functions, as well as provide an additional method: 312 313 314 .. method:: ParseResult.geturl() 315 316 Return the re-combined version of the original URL as a string. This may differ 317 from the original URL in that the scheme will always be normalized to lower case 318 and empty components may be dropped. Specifically, empty parameters, queries, 319 and fragment identifiers will be removed. 320 321 The result of this method is a fixpoint if passed back through the original 322 parsing function: 323 324 >>> import urlparse 325 >>> url = 'HTTP://www.Python.org/doc/#' 326 327 >>> r1 = urlparse.urlsplit(url) 328 >>> r1.geturl() 329 'http://www.Python.org/doc/' 330 331 >>> r2 = urlparse.urlsplit(r1.geturl()) 332 >>> r2.geturl() 333 'http://www.Python.org/doc/' 334 335 .. versionadded:: 2.5 336 337 The following classes provide the implementations of the parse results: 338 339 340 .. class:: ParseResult(scheme, netloc, path, params, query, fragment) 341 342 Concrete class for :func:`urlparse` results. 343 344 345 .. class:: SplitResult(scheme, netloc, path, query, fragment) 346 347 Concrete class for :func:`urlsplit` results. 348 349