symbian-qemu-0.9.1-12/python-2.6.1/Doc/library/urllib.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 :mod:`urllib` --- Open arbitrary resources by URL
       
     2 =================================================
       
     3 
       
     4 .. module:: urllib
       
     5    :synopsis: Open an arbitrary network resource by URL (requires sockets).
       
     6 
       
     7 .. note::
       
     8     The :mod:`urllib` module has been split into parts and renamed in
       
     9     Python 3.0 to :mod:`urllib.request`, :mod:`urllib.parse`,
       
    10     and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt
       
    11     imports when converting your sources to 3.0.
       
    12     Also note that the :func:`urllib.urlopen` function has been removed in
       
    13     Python 3.0 in favor of :func:`urllib2.urlopen`.
       
    14 
       
    15 .. index::
       
    16    single: WWW
       
    17    single: World Wide Web
       
    18    single: URL
       
    19 
       
    20 This module provides a high-level interface for fetching data across the World
       
    21 Wide Web.  In particular, the :func:`urlopen` function is similar to the
       
    22 built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
       
    23 instead of filenames.  Some restrictions apply --- it can only open URLs for
       
    24 reading, and no seek operations are available.
       
    25 
       
    26 High-level interface
       
    27 --------------------
       
    28 
       
    29 .. function:: urlopen(url[, data[, proxies]])
       
    30 
       
    31    Open a network object denoted by a URL for reading.  If the URL does not have a
       
    32    scheme identifier, or if it has :file:`file:` as its scheme identifier, this
       
    33    opens a local file (without universal newlines); otherwise it opens a socket to
       
    34    a server somewhere on the network.  If the connection cannot be made the
       
    35    :exc:`IOError` exception is raised.  If all went well, a file-like object is
       
    36    returned.  This supports the following methods: :meth:`read`, :meth:`readline`,
       
    37    :meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and
       
    38    :meth:`geturl`.  It also has proper support for the :term:`iterator` protocol. One
       
    39    caveat: the :meth:`read` method, if the size argument is omitted or negative,
       
    40    may not read until the end of the data stream; there is no good way to determine
       
    41    that the entire stream from a socket has been read in the general case.
       
    42 
       
    43    Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
       
    44    these methods have the same interface as for file objects --- see section
       
    45    :ref:`bltin-file-objects` in this manual.  (It is not a built-in file object,
       
    46    however, so it can't be used at those few places where a true built-in file
       
    47    object is required.)
       
    48 
       
    49    .. index:: module: mimetools
       
    50 
       
    51    The :meth:`info` method returns an instance of the class
       
    52    :class:`mimetools.Message` containing meta-information associated with the
       
    53    URL.  When the method is HTTP, these headers are those returned by the server
       
    54    at the head of the retrieved HTML page (including Content-Length and
       
    55    Content-Type).  When the method is FTP, a Content-Length header will be
       
    56    present if (as is now usual) the server passed back a file length in response
       
    57    to the FTP retrieval request. A Content-Type header will be present if the
       
    58    MIME type can be guessed.  When the method is local-file, returned headers
       
    59    will include a Date representing the file's last-modified time, a
       
    60    Content-Length giving file size, and a Content-Type containing a guess at the
       
    61    file's type. See also the description of the :mod:`mimetools` module.
       
    62 
       
    63    The :meth:`geturl` method returns the real URL of the page.  In some cases, the
       
    64    HTTP server redirects a client to another URL.  The :func:`urlopen` function
       
    65    handles this transparently, but in some cases the caller needs to know which URL
       
    66    the client was redirected to.  The :meth:`geturl` method can be used to get at
       
    67    this redirected URL.
       
    68 
       
    69    The :meth:`getcode` method returns the HTTP status code that was sent with the
       
    70    response, or ``None`` if the URL is no HTTP URL.
       
    71 
       
    72    If the *url* uses the :file:`http:` scheme identifier, the optional *data*
       
    73    argument may be given to specify a ``POST`` request (normally the request type
       
    74    is ``GET``).  The *data* argument must be in standard
       
    75    :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
       
    76    function below.
       
    77 
       
    78    The :func:`urlopen` function works transparently with proxies which do not
       
    79    require authentication.  In a Unix or Windows environment, set the
       
    80    :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
       
    81    identifies the proxy server before starting the Python interpreter.  For example
       
    82    (the ``'%'`` is the command prompt)::
       
    83 
       
    84       % http_proxy="http://www.someproxy.com:3128"
       
    85       % export http_proxy
       
    86       % python
       
    87       ...
       
    88 
       
    89    The :envvar:`no_proxy` environment variable can be used to specify hosts which
       
    90    shouldn't be reached via proxy; if set, it should be a comma-separated list
       
    91    of hostname suffixes, optionally with ``:port`` appended, for example
       
    92    ``cern.ch,ncsa.uiuc.edu,some.host:8080``.
       
    93 
       
    94    In a Windows environment, if no proxy environment variables are set, proxy
       
    95    settings are obtained from the registry's Internet Settings section.
       
    96 
       
    97    .. index:: single: Internet Config
       
    98 
       
    99    In a Macintosh environment, :func:`urlopen` will retrieve proxy information from
       
   100    Internet Config.
       
   101 
       
   102    Alternatively, the optional *proxies* argument may be used to explicitly specify
       
   103    proxies.  It must be a dictionary mapping scheme names to proxy URLs, where an
       
   104    empty dictionary causes no proxies to be used, and ``None`` (the default value)
       
   105    causes environmental proxy settings to be used as discussed above.  For
       
   106    example::
       
   107 
       
   108       # Use http://www.someproxy.com:3128 for http proxying
       
   109       proxies = {'http': 'http://www.someproxy.com:3128'}
       
   110       filehandle = urllib.urlopen(some_url, proxies=proxies)
       
   111       # Don't use any proxies
       
   112       filehandle = urllib.urlopen(some_url, proxies={})
       
   113       # Use proxies from environment - both versions are equivalent
       
   114       filehandle = urllib.urlopen(some_url, proxies=None)
       
   115       filehandle = urllib.urlopen(some_url)
       
   116 
       
   117    Proxies which require authentication for use are not currently supported; this
       
   118    is considered an implementation limitation.
       
   119 
       
   120    .. versionchanged:: 2.3
       
   121       Added the *proxies* support.
       
   122 
       
   123    .. versionchanged:: 2.6
       
   124       Added :meth:`getcode` to returned object and support for the
       
   125       :envvar:`no_proxy` environment variable.
       
   126       
       
   127    .. deprecated:: 2.6
       
   128       The :func:`urlopen` function has been removed in Python 3.0 in favor
       
   129       of :func:`urllib2.urlopen`.
       
   130 
       
   131 
       
   132 .. function:: urlretrieve(url[, filename[, reporthook[, data]]])
       
   133 
       
   134    Copy a network object denoted by a URL to a local file, if necessary. If the URL
       
   135    points to a local file, or a valid cached copy of the object exists, the object
       
   136    is not copied.  Return a tuple ``(filename, headers)`` where *filename* is the
       
   137    local file name under which the object can be found, and *headers* is whatever
       
   138    the :meth:`info` method of the object returned by :func:`urlopen` returned (for
       
   139    a remote object, possibly cached). Exceptions are the same as for
       
   140    :func:`urlopen`.
       
   141 
       
   142    The second argument, if present, specifies the file location to copy to (if
       
   143    absent, the location will be a tempfile with a generated name). The third
       
   144    argument, if present, is a hook function that will be called once on
       
   145    establishment of the network connection and once after each block read
       
   146    thereafter.  The hook will be passed three arguments; a count of blocks
       
   147    transferred so far, a block size in bytes, and the total size of the file.  The
       
   148    third argument may be ``-1`` on older FTP servers which do not return a file
       
   149    size in response to a retrieval request.
       
   150 
       
   151    If the *url* uses the :file:`http:` scheme identifier, the optional *data*
       
   152    argument may be given to specify a ``POST`` request (normally the request type
       
   153    is ``GET``).  The *data* argument must in standard
       
   154    :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
       
   155    function below.
       
   156 
       
   157    .. versionchanged:: 2.5
       
   158       :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
       
   159       the amount of data available  was less than the expected amount (which is the
       
   160       size reported by a  *Content-Length* header). This can occur, for example, when
       
   161       the  download is interrupted.
       
   162 
       
   163       The *Content-Length* is treated as a lower bound: if there's more data  to read,
       
   164       urlretrieve reads more data, but if less data is available,  it raises the
       
   165       exception.
       
   166 
       
   167       You can still retrieve the downloaded data in this case, it is stored  in the
       
   168       :attr:`content` attribute of the exception instance.
       
   169 
       
   170       If no *Content-Length* header was supplied, urlretrieve can not check the size
       
   171       of the data it has downloaded, and just returns it.  In this case you just have
       
   172       to assume that the download was successful.
       
   173 
       
   174 
       
   175 .. data:: _urlopener
       
   176 
       
   177    The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
       
   178    of the :class:`FancyURLopener` class and use it to perform their requested
       
   179    actions.  To override this functionality, programmers can create a subclass of
       
   180    :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
       
   181    class to the ``urllib._urlopener`` variable before calling the desired function.
       
   182    For example, applications may want to specify a different
       
   183    :mailheader:`User-Agent` header than :class:`URLopener` defines.  This can be
       
   184    accomplished with the following code::
       
   185 
       
   186       import urllib
       
   187 
       
   188       class AppURLopener(urllib.FancyURLopener):
       
   189           version = "App/1.7"
       
   190 
       
   191       urllib._urlopener = AppURLopener()
       
   192 
       
   193 
       
   194 .. function:: urlcleanup()
       
   195 
       
   196    Clear the cache that may have been built up by previous calls to
       
   197    :func:`urlretrieve`.
       
   198 
       
   199 
       
   200 Utility functions
       
   201 -----------------
       
   202 
       
   203 .. function:: quote(string[, safe])
       
   204 
       
   205    Replace special characters in *string* using the ``%xx`` escape. Letters,
       
   206    digits, and the characters ``'_.-'`` are never quoted. The optional *safe*
       
   207    parameter specifies additional characters that should not be quoted --- its
       
   208    default value is ``'/'``.
       
   209 
       
   210    Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
       
   211 
       
   212 
       
   213 .. function:: quote_plus(string[, safe])
       
   214 
       
   215    Like :func:`quote`, but also replaces spaces by plus signs, as required for
       
   216    quoting HTML form values.  Plus signs in the original string are escaped unless
       
   217    they are included in *safe*.  It also does not have *safe* default to ``'/'``.
       
   218 
       
   219 
       
   220 .. function:: unquote(string)
       
   221 
       
   222    Replace ``%xx`` escapes by their single-character equivalent.
       
   223 
       
   224    Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
       
   225 
       
   226 
       
   227 .. function:: unquote_plus(string)
       
   228 
       
   229    Like :func:`unquote`, but also replaces plus signs by spaces, as required for
       
   230    unquoting HTML form values.
       
   231 
       
   232 
       
   233 .. function:: urlencode(query[, doseq])
       
   234 
       
   235    Convert a mapping object or a sequence of two-element tuples  to a "url-encoded"
       
   236    string, suitable to pass to :func:`urlopen` above as the optional *data*
       
   237    argument.  This is useful to pass a dictionary of form fields to a ``POST``
       
   238    request.  The resulting string is a series of ``key=value`` pairs separated by
       
   239    ``'&'`` characters, where both *key* and *value* are quoted using
       
   240    :func:`quote_plus` above.  If the optional parameter *doseq* is present and
       
   241    evaluates to true, individual ``key=value`` pairs are generated for each element
       
   242    of the sequence. When a sequence of two-element tuples is used as the *query*
       
   243    argument, the first element of each tuple is a key and the second is a value.
       
   244    The order of parameters in the encoded string will match the order of parameter
       
   245    tuples in the sequence. The :mod:`urlparse` module provides the functions
       
   246    :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
       
   247    into Python data structures.
       
   248 
       
   249 
       
   250 .. function:: pathname2url(path)
       
   251 
       
   252    Convert the pathname *path* from the local syntax for a path to the form used in
       
   253    the path component of a URL.  This does not produce a complete URL.  The return
       
   254    value will already be quoted using the :func:`quote` function.
       
   255 
       
   256 
       
   257 .. function:: url2pathname(path)
       
   258 
       
   259    Convert the path component *path* from an encoded URL to the local syntax for a
       
   260    path.  This does not accept a complete URL.  This function uses :func:`unquote`
       
   261    to decode *path*.
       
   262 
       
   263 
       
   264 URL Opener objects
       
   265 ------------------
       
   266 
       
   267 .. class:: URLopener([proxies[, **x509]])
       
   268 
       
   269    Base class for opening and reading URLs.  Unless you need to support opening
       
   270    objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
       
   271    you probably want to use :class:`FancyURLopener`.
       
   272 
       
   273    By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
       
   274    of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number.
       
   275    Applications can define their own :mailheader:`User-Agent` header by subclassing
       
   276    :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
       
   277    :attr:`version` to an appropriate string value in the subclass definition.
       
   278 
       
   279    The optional *proxies* parameter should be a dictionary mapping scheme names to
       
   280    proxy URLs, where an empty dictionary turns proxies off completely.  Its default
       
   281    value is ``None``, in which case environmental proxy settings will be used if
       
   282    present, as discussed in the definition of :func:`urlopen`, above.
       
   283 
       
   284    Additional keyword parameters, collected in *x509*, may be used for
       
   285    authentication of the client when using the :file:`https:` scheme.  The keywords
       
   286    *key_file* and *cert_file* are supported to provide an  SSL key and certificate;
       
   287    both are needed to support client authentication.
       
   288 
       
   289    :class:`URLopener` objects will raise an :exc:`IOError` exception if the server
       
   290    returns an error code.
       
   291 
       
   292     .. method:: open(fullurl[, data])
       
   293 
       
   294        Open *fullurl* using the appropriate protocol.  This method sets up cache and
       
   295        proxy information, then calls the appropriate open method with its input
       
   296        arguments.  If the scheme is not recognized, :meth:`open_unknown` is called.
       
   297        The *data* argument has the same meaning as the *data* argument of
       
   298        :func:`urlopen`.
       
   299 
       
   300 
       
   301     .. method:: open_unknown(fullurl[, data])
       
   302 
       
   303        Overridable interface to open unknown URL types.
       
   304 
       
   305 
       
   306     .. method:: retrieve(url[, filename[, reporthook[, data]]])
       
   307 
       
   308        Retrieves the contents of *url* and places it in *filename*.  The return value
       
   309        is a tuple consisting of a local filename and either a
       
   310        :class:`mimetools.Message` object containing the response headers (for remote
       
   311        URLs) or ``None`` (for local URLs).  The caller must then open and read the
       
   312        contents of *filename*.  If *filename* is not given and the URL refers to a
       
   313        local file, the input filename is returned.  If the URL is non-local and
       
   314        *filename* is not given, the filename is the output of :func:`tempfile.mktemp`
       
   315        with a suffix that matches the suffix of the last path component of the input
       
   316        URL.  If *reporthook* is given, it must be a function accepting three numeric
       
   317        parameters.  It will be called after each chunk of data is read from the
       
   318        network.  *reporthook* is ignored for local URLs.
       
   319 
       
   320        If the *url* uses the :file:`http:` scheme identifier, the optional *data*
       
   321        argument may be given to specify a ``POST`` request (normally the request type
       
   322        is ``GET``).  The *data* argument must in standard
       
   323        :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
       
   324        function below.
       
   325 
       
   326 
       
   327     .. attribute:: version
       
   328 
       
   329        Variable that specifies the user agent of the opener object.  To get
       
   330        :mod:`urllib` to tell servers that it is a particular user agent, set this in a
       
   331        subclass as a class variable or in the constructor before calling the base
       
   332        constructor.
       
   333 
       
   334 
       
   335 .. class:: FancyURLopener(...)
       
   336 
       
   337    :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
       
   338    for the following HTTP response codes: 301, 302, 303, 307 and 401.  For the 30x
       
   339    response codes listed above, the :mailheader:`Location` header is used to fetch
       
   340    the actual URL.  For 401 response codes (authentication required), basic HTTP
       
   341    authentication is performed.  For the 30x response codes, recursion is bounded
       
   342    by the value of the *maxtries* attribute, which defaults to 10.
       
   343 
       
   344    For all other response codes, the method :meth:`http_error_default` is called
       
   345    which you can override in subclasses to handle the error appropriately.
       
   346 
       
   347    .. note::
       
   348 
       
   349       According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
       
   350       must not be automatically redirected without confirmation by the user.  In
       
   351       reality, browsers do allow automatic redirection of these responses, changing
       
   352       the POST to a GET, and :mod:`urllib` reproduces this behaviour.
       
   353 
       
   354    The parameters to the constructor are the same as those for :class:`URLopener`.
       
   355 
       
   356    .. note::
       
   357 
       
   358       When performing basic authentication, a :class:`FancyURLopener` instance calls
       
   359       its :meth:`prompt_user_passwd` method.  The default implementation asks the
       
   360       users for the required information on the controlling terminal.  A subclass may
       
   361       override this method to support more appropriate behavior if needed.
       
   362 
       
   363     The :class:`FancyURLopener` class offers one additional method that should be
       
   364     overloaded to provide the appropriate behavior:
       
   365 
       
   366     .. method:: prompt_user_passwd(host, realm)
       
   367 
       
   368        Return information needed to authenticate the user at the given host in the
       
   369        specified security realm.  The return value should be a tuple, ``(user,
       
   370        password)``, which can be used for basic authentication.
       
   371 
       
   372        The implementation prompts for this information on the terminal; an application
       
   373        should override this method to use an appropriate interaction model in the local
       
   374        environment.
       
   375 
       
   376 .. exception:: ContentTooShortError(msg[, content])
       
   377 
       
   378    This exception is raised when the :func:`urlretrieve` function detects that the
       
   379    amount of the downloaded data is less than the  expected amount (given by the
       
   380    *Content-Length* header). The :attr:`content` attribute stores the downloaded
       
   381    (and supposedly truncated) data.
       
   382 
       
   383    .. versionadded:: 2.5
       
   384 
       
   385 
       
   386 :mod:`urllib` Restrictions
       
   387 --------------------------
       
   388 
       
   389   .. index::
       
   390      pair: HTTP; protocol
       
   391      pair: FTP; protocol
       
   392 
       
   393 * Currently, only the following protocols are supported: HTTP, (versions 0.9 and
       
   394   1.0),  FTP, and local files.
       
   395 
       
   396 * The caching feature of :func:`urlretrieve` has been disabled until I find the
       
   397   time to hack proper processing of Expiration time headers.
       
   398 
       
   399 * There should be a function to query whether a particular URL is in the cache.
       
   400 
       
   401 * For backward compatibility, if a URL appears to point to a local file but the
       
   402   file can't be opened, the URL is re-interpreted using the FTP protocol.  This
       
   403   can sometimes cause confusing error messages.
       
   404 
       
   405 * The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
       
   406   long delays while waiting for a network connection to be set up.  This means
       
   407   that it is difficult to build an interactive Web client using these functions
       
   408   without using threads.
       
   409 
       
   410   .. index::
       
   411      single: HTML
       
   412      pair: HTTP; protocol
       
   413      module: htmllib
       
   414 
       
   415 * The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
       
   416   returned by the server.  This may be binary data (such as an image), plain text
       
   417   or (for example) HTML.  The HTTP protocol provides type information in the reply
       
   418   header, which can be inspected by looking at the :mailheader:`Content-Type`
       
   419   header.  If the returned data is HTML, you can use the module :mod:`htmllib` to
       
   420   parse it.
       
   421 
       
   422   .. index:: single: FTP
       
   423 
       
   424 * The code handling the FTP protocol cannot differentiate between a file and a
       
   425   directory.  This can lead to unexpected behavior when attempting to read a URL
       
   426   that points to a file that is not accessible.  If the URL ends in a ``/``, it is
       
   427   assumed to refer to a directory and will be handled accordingly.  But if an
       
   428   attempt to read a file leads to a 550 error (meaning the URL cannot be found or
       
   429   is not accessible, often for permission reasons), then the path is treated as a
       
   430   directory in order to handle the case when a directory is specified by a URL but
       
   431   the trailing ``/`` has been left off.  This can cause misleading results when
       
   432   you try to fetch a file whose read permissions make it inaccessible; the FTP
       
   433   code will try to read it, fail with a 550 error, and then perform a directory
       
   434   listing for the unreadable file. If fine-grained control is needed, consider
       
   435   using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
       
   436   *_urlopener* to meet your needs.
       
   437 
       
   438 * This module does not support the use of proxies which require authentication.
       
   439   This may be implemented in the future.
       
   440 
       
   441   .. index:: module: urlparse
       
   442 
       
   443 * Although the :mod:`urllib` module contains (undocumented) routines to parse
       
   444   and unparse URL strings, the recommended interface for URL manipulation is in
       
   445   module :mod:`urlparse`.
       
   446 
       
   447 
       
   448 .. _urllib-examples:
       
   449 
       
   450 Examples
       
   451 --------
       
   452 
       
   453 Here is an example session that uses the ``GET`` method to retrieve a URL
       
   454 containing parameters::
       
   455 
       
   456    >>> import urllib
       
   457    >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
       
   458    >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
       
   459    >>> print f.read()
       
   460 
       
   461 The following example uses the ``POST`` method instead::
       
   462 
       
   463    >>> import urllib
       
   464    >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
       
   465    >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
       
   466    >>> print f.read()
       
   467 
       
   468 The following example uses an explicitly specified HTTP proxy, overriding
       
   469 environment settings::
       
   470 
       
   471    >>> import urllib
       
   472    >>> proxies = {'http': 'http://proxy.example.com:8080/'}
       
   473    >>> opener = urllib.FancyURLopener(proxies)
       
   474    >>> f = opener.open("http://www.python.org")
       
   475    >>> f.read()
       
   476 
       
   477 The following example uses no proxies at all, overriding environment settings::
       
   478 
       
   479    >>> import urllib
       
   480    >>> opener = urllib.FancyURLopener({})
       
   481    >>> f = opener.open("http://www.python.org/")
       
   482    >>> f.read()
       
   483