|
1 :mod:`urlparse` --- Parse URLs into components |
|
2 ============================================== |
|
3 |
|
4 .. module:: urlparse |
|
5 :synopsis: Parse URLs into or assemble them from components. |
|
6 |
|
7 |
|
8 .. index:: |
|
9 single: WWW |
|
10 single: World Wide Web |
|
11 single: URL |
|
12 pair: URL; parsing |
|
13 pair: relative; URL |
|
14 |
|
15 .. note:: |
|
16 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.0. |
|
17 The :term:`2to3` tool will automatically adapt imports when converting |
|
18 your sources to 3.0. |
|
19 |
|
20 |
|
21 This module defines a standard interface to break Uniform Resource Locator (URL) |
|
22 strings up in components (addressing scheme, network location, path etc.), to |
|
23 combine the components back into a URL string, and to convert a "relative URL" |
|
24 to an absolute URL given a "base URL." |
|
25 |
|
26 The module has been designed to match the Internet RFC on Relative Uniform |
|
27 Resource Locators (and discovered a bug in an earlier draft!). It supports the |
|
28 following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``, |
|
29 ``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``, |
|
30 ``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``, |
|
31 ``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``. |
|
32 |
|
33 .. versionadded:: 2.5 |
|
34 Support for the ``sftp`` and ``sips`` schemes. |
|
35 |
|
36 The :mod:`urlparse` module defines the following functions: |
|
37 |
|
38 |
|
39 .. function:: urlparse(urlstring[, default_scheme[, allow_fragments]]) |
|
40 |
|
41 Parse a URL into six components, returning a 6-tuple. This corresponds to the |
|
42 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. |
|
43 Each tuple item is a string, possibly empty. The components are not broken up in |
|
44 smaller parts (for example, the network location is a single string), and % |
|
45 escapes are not expanded. The delimiters as shown above are not part of the |
|
46 result, except for a leading slash in the *path* component, which is retained if |
|
47 present. For example: |
|
48 |
|
49 >>> from urlparse import urlparse |
|
50 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') |
|
51 >>> o # doctest: +NORMALIZE_WHITESPACE |
|
52 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', |
|
53 params='', query='', fragment='') |
|
54 >>> o.scheme |
|
55 'http' |
|
56 >>> o.port |
|
57 80 |
|
58 >>> o.geturl() |
|
59 'http://www.cwi.nl:80/%7Eguido/Python.html' |
|
60 |
|
61 If the *default_scheme* argument is specified, it gives the default addressing |
|
62 scheme, to be used only if the URL does not specify one. The default value for |
|
63 this argument is the empty string. |
|
64 |
|
65 If the *allow_fragments* argument is false, fragment identifiers are not |
|
66 allowed, even if the URL's addressing scheme normally does support them. The |
|
67 default value for this argument is :const:`True`. |
|
68 |
|
69 The return value is actually an instance of a subclass of :class:`tuple`. This |
|
70 class has the following additional read-only convenience attributes: |
|
71 |
|
72 +------------------+-------+--------------------------+----------------------+ |
|
73 | Attribute | Index | Value | Value if not present | |
|
74 +==================+=======+==========================+======================+ |
|
75 | :attr:`scheme` | 0 | URL scheme specifier | empty string | |
|
76 +------------------+-------+--------------------------+----------------------+ |
|
77 | :attr:`netloc` | 1 | Network location part | empty string | |
|
78 +------------------+-------+--------------------------+----------------------+ |
|
79 | :attr:`path` | 2 | Hierarchical path | empty string | |
|
80 +------------------+-------+--------------------------+----------------------+ |
|
81 | :attr:`params` | 3 | Parameters for last path | empty string | |
|
82 | | | element | | |
|
83 +------------------+-------+--------------------------+----------------------+ |
|
84 | :attr:`query` | 4 | Query component | empty string | |
|
85 +------------------+-------+--------------------------+----------------------+ |
|
86 | :attr:`fragment` | 5 | Fragment identifier | empty string | |
|
87 +------------------+-------+--------------------------+----------------------+ |
|
88 | :attr:`username` | | User name | :const:`None` | |
|
89 +------------------+-------+--------------------------+----------------------+ |
|
90 | :attr:`password` | | Password | :const:`None` | |
|
91 +------------------+-------+--------------------------+----------------------+ |
|
92 | :attr:`hostname` | | Host name (lower case) | :const:`None` | |
|
93 +------------------+-------+--------------------------+----------------------+ |
|
94 | :attr:`port` | | Port number as integer, | :const:`None` | |
|
95 | | | if present | | |
|
96 +------------------+-------+--------------------------+----------------------+ |
|
97 |
|
98 See section :ref:`urlparse-result-object` for more information on the result |
|
99 object. |
|
100 |
|
101 .. versionchanged:: 2.5 |
|
102 Added attributes to return value. |
|
103 |
|
104 .. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]]) |
|
105 |
|
106 Parse a query string given as a string argument (data of type |
|
107 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a |
|
108 dictionary. The dictionary keys are the unique query variable names and the |
|
109 values are lists of values for each name. |
|
110 |
|
111 The optional argument *keep_blank_values* is a flag indicating whether blank |
|
112 values in URL encoded queries should be treated as blank strings. A true value |
|
113 indicates that blanks should be retained as blank strings. The default false |
|
114 value indicates that blank values are to be ignored and treated as if they were |
|
115 not included. |
|
116 |
|
117 The optional argument *strict_parsing* is a flag indicating what to do with |
|
118 parsing errors. If false (the default), errors are silently ignored. If true, |
|
119 errors raise a :exc:`ValueError` exception. |
|
120 |
|
121 Use the :func:`urllib.urlencode` function to convert such dictionaries into |
|
122 query strings. |
|
123 |
|
124 |
|
125 .. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]]) |
|
126 |
|
127 Parse a query string given as a string argument (data of type |
|
128 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of |
|
129 name, value pairs. |
|
130 |
|
131 The optional argument *keep_blank_values* is a flag indicating whether blank |
|
132 values in URL encoded queries should be treated as blank strings. A true value |
|
133 indicates that blanks should be retained as blank strings. The default false |
|
134 value indicates that blank values are to be ignored and treated as if they were |
|
135 not included. |
|
136 |
|
137 The optional argument *strict_parsing* is a flag indicating what to do with |
|
138 parsing errors. If false (the default), errors are silently ignored. If true, |
|
139 errors raise a :exc:`ValueError` exception. |
|
140 |
|
141 Use the :func:`urllib.urlencode` function to convert such lists of pairs into |
|
142 query strings. |
|
143 |
|
144 .. function:: urlunparse(parts) |
|
145 |
|
146 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument |
|
147 can be any six-item iterable. This may result in a slightly different, but |
|
148 equivalent URL, if the URL that was parsed originally had unnecessary delimiters |
|
149 (for example, a ? with an empty query; the RFC states that these are |
|
150 equivalent). |
|
151 |
|
152 |
|
153 .. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]]) |
|
154 |
|
155 This is similar to :func:`urlparse`, but does not split the params from the URL. |
|
156 This should generally be used instead of :func:`urlparse` if the more recent URL |
|
157 syntax allowing parameters to be applied to each segment of the *path* portion |
|
158 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to |
|
159 separate the path segments and parameters. This function returns a 5-tuple: |
|
160 (addressing scheme, network location, path, query, fragment identifier). |
|
161 |
|
162 The return value is actually an instance of a subclass of :class:`tuple`. This |
|
163 class has the following additional read-only convenience attributes: |
|
164 |
|
165 +------------------+-------+-------------------------+----------------------+ |
|
166 | Attribute | Index | Value | Value if not present | |
|
167 +==================+=======+=========================+======================+ |
|
168 | :attr:`scheme` | 0 | URL scheme specifier | empty string | |
|
169 +------------------+-------+-------------------------+----------------------+ |
|
170 | :attr:`netloc` | 1 | Network location part | empty string | |
|
171 +------------------+-------+-------------------------+----------------------+ |
|
172 | :attr:`path` | 2 | Hierarchical path | empty string | |
|
173 +------------------+-------+-------------------------+----------------------+ |
|
174 | :attr:`query` | 3 | Query component | empty string | |
|
175 +------------------+-------+-------------------------+----------------------+ |
|
176 | :attr:`fragment` | 4 | Fragment identifier | empty string | |
|
177 +------------------+-------+-------------------------+----------------------+ |
|
178 | :attr:`username` | | User name | :const:`None` | |
|
179 +------------------+-------+-------------------------+----------------------+ |
|
180 | :attr:`password` | | Password | :const:`None` | |
|
181 +------------------+-------+-------------------------+----------------------+ |
|
182 | :attr:`hostname` | | Host name (lower case) | :const:`None` | |
|
183 +------------------+-------+-------------------------+----------------------+ |
|
184 | :attr:`port` | | Port number as integer, | :const:`None` | |
|
185 | | | if present | | |
|
186 +------------------+-------+-------------------------+----------------------+ |
|
187 |
|
188 See section :ref:`urlparse-result-object` for more information on the result |
|
189 object. |
|
190 |
|
191 .. versionadded:: 2.2 |
|
192 |
|
193 .. versionchanged:: 2.5 |
|
194 Added attributes to return value. |
|
195 |
|
196 |
|
197 .. function:: urlunsplit(parts) |
|
198 |
|
199 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete |
|
200 URL as a string. The *parts* argument can be any five-item iterable. This may |
|
201 result in a slightly different, but equivalent URL, if the URL that was parsed |
|
202 originally had unnecessary delimiters (for example, a ? with an empty query; the |
|
203 RFC states that these are equivalent). |
|
204 |
|
205 .. versionadded:: 2.2 |
|
206 |
|
207 |
|
208 .. function:: urljoin(base, url[, allow_fragments]) |
|
209 |
|
210 Construct a full ("absolute") URL by combining a "base URL" (*base*) with |
|
211 another URL (*url*). Informally, this uses components of the base URL, in |
|
212 particular the addressing scheme, the network location and (part of) the path, |
|
213 to provide missing components in the relative URL. For example: |
|
214 |
|
215 >>> from urlparse import urljoin |
|
216 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') |
|
217 'http://www.cwi.nl/%7Eguido/FAQ.html' |
|
218 |
|
219 The *allow_fragments* argument has the same meaning and default as for |
|
220 :func:`urlparse`. |
|
221 |
|
222 .. note:: |
|
223 |
|
224 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``), |
|
225 the *url*'s host name and/or scheme will be present in the result. For example: |
|
226 |
|
227 .. doctest:: |
|
228 |
|
229 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', |
|
230 ... '//www.python.org/%7Eguido') |
|
231 'http://www.python.org/%7Eguido' |
|
232 |
|
233 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and |
|
234 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts. |
|
235 |
|
236 |
|
237 .. function:: urldefrag(url) |
|
238 |
|
239 If *url* contains a fragment identifier, returns a modified version of *url* |
|
240 with no fragment identifier, and the fragment identifier as a separate string. |
|
241 If there is no fragment identifier in *url*, returns *url* unmodified and an |
|
242 empty string. |
|
243 |
|
244 |
|
245 .. seealso:: |
|
246 |
|
247 :rfc:`1738` - Uniform Resource Locators (URL) |
|
248 This specifies the formal syntax and semantics of absolute URLs. |
|
249 |
|
250 :rfc:`1808` - Relative Uniform Resource Locators |
|
251 This Request For Comments includes the rules for joining an absolute and a |
|
252 relative URL, including a fair number of "Abnormal Examples" which govern the |
|
253 treatment of border cases. |
|
254 |
|
255 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax |
|
256 Document describing the generic syntactic requirements for both Uniform Resource |
|
257 Names (URNs) and Uniform Resource Locators (URLs). |
|
258 |
|
259 |
|
260 .. _urlparse-result-object: |
|
261 |
|
262 Results of :func:`urlparse` and :func:`urlsplit` |
|
263 ------------------------------------------------ |
|
264 |
|
265 The result objects from the :func:`urlparse` and :func:`urlsplit` functions are |
|
266 subclasses of the :class:`tuple` type. These subclasses add the attributes |
|
267 described in those functions, as well as provide an additional method: |
|
268 |
|
269 |
|
270 .. method:: ParseResult.geturl() |
|
271 |
|
272 Return the re-combined version of the original URL as a string. This may differ |
|
273 from the original URL in that the scheme will always be normalized to lower case |
|
274 and empty components may be dropped. Specifically, empty parameters, queries, |
|
275 and fragment identifiers will be removed. |
|
276 |
|
277 The result of this method is a fixpoint if passed back through the original |
|
278 parsing function: |
|
279 |
|
280 >>> import urlparse |
|
281 >>> url = 'HTTP://www.Python.org/doc/#' |
|
282 |
|
283 >>> r1 = urlparse.urlsplit(url) |
|
284 >>> r1.geturl() |
|
285 'http://www.Python.org/doc/' |
|
286 |
|
287 >>> r2 = urlparse.urlsplit(r1.geturl()) |
|
288 >>> r2.geturl() |
|
289 'http://www.Python.org/doc/' |
|
290 |
|
291 .. versionadded:: 2.5 |
|
292 |
|
293 The following classes provide the implementations of the parse results:: |
|
294 |
|
295 |
|
296 .. class:: BaseResult |
|
297 |
|
298 Base class for the concrete result classes. This provides most of the attribute |
|
299 definitions. It does not provide a :meth:`geturl` method. It is derived from |
|
300 :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__` |
|
301 methods. |
|
302 |
|
303 |
|
304 .. class:: ParseResult(scheme, netloc, path, params, query, fragment) |
|
305 |
|
306 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is |
|
307 overridden to support checking that the right number of arguments are passed. |
|
308 |
|
309 |
|
310 .. class:: SplitResult(scheme, netloc, path, query, fragment) |
|
311 |
|
312 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is |
|
313 overridden to support checking that the right number of arguments are passed. |
|
314 |