20.5. urllib — Open arbitrary resources by URL通过URL打开任意资源

Note

在Python 3中,urllib模块已经被拆分为urllib.requesturllib.parseurllib.error当您将代码转换为Python 3时,2to3 工具可以对代码的imports部分进行自动修改。注意,Python 3中的urllib.request.urlopen()函数相当于urllib2.urlopen(),而urllib.urlopen()函数已被Python 3移除。

该模块提供了用于获取万维网(the World Wide Web)数据的高层接口。特别是urlopen()函数,它很像内建函数open(),但是它接收的参数不是文件名,而是URLs(Universal Resource Locators)。有一些限制 - 它只能打开URL进行读取,并且没有可用的查找操作。

Warning

当打开HTTPS的地址时, 它将不会去尝试去验证服务器的证书。使用时,风险自担!

20.5.1. 高级接口

urllib.urlopen(url[, data[, proxies]])

打开URL所表示的网络上的对象。如果该URL没有指明协议类型,或者其协议标识符为file:,则该函数会打开本地文件(不包括 universal newlines);否则它会使用socket连接网络上的某个服务器。 如果连接失败,会抛出IOError异常。如果一切顺利,返回一个file-like对象。该对象拥有以下方法:read()readline()readlines()fileno()close()info()getcode()geturl()同时也支持iteratorOne caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; 没有好的方法来确定来自套接字的整个流,在一般情况下已经被读取。

info()getcode()geturl()之外,其他的函数与file对象的同名函数有相同的interface——参见本手册的File Objects部分。(它并不是内置对象,所以它不能被用于极少数必须使用真正的内置对象的地方。)

 info() 方法返回一个mimetools.的对象实例,消息包含与所述URL相关联的元信息。当该方法为HTTP,这些标头是那些由服务器在检索到的HTML页面的头返回(包括内容长度和内容类型)。 当方法是FTP,内容长度报头将存在如果(作为现在通常)服务器传递回一个文件长度响应于在FTP检索请求。 当能猜测出 MIME 类型时将会出现 Content-Type 报头。当方法为本地文件时,返回的报头将包含 Date (表示该文件的最后修改时间)、 Content-Length (文件的大小)、以及 Content-Type (检测到的文件类型)。另请参阅mimetools模块的说明。

geturl()函数返回网页的真实URL。有时,HTTP服务器会将client重定向至另一个URL。显然,urlopen()函数可以处理这种情况,但是有时调用者希望知道client被重定向到了哪个URL处。geturl()函数用于获得重定向后的URL。

如果urlopen中提交的URL不是一个HTTP的URL,那么getcode()方法返回None,否则返回HTTP响应发送回来的HTTP状态码。

如果url使用http:方案标识符,则可以给出可选的数据参数,以指定POST请求(通常请求类型为GET)。data 参数必须是标准的 application/x-www-form-urlencoded 格式; 参阅下面的 urlencode() 函数。

urlopen()功能透明地使用不需要身份验证的代理。在Unix或Windows环境中,将 http_proxy ftp_proxy环境变量设置为标识代理的URL开始Python解释器之前的服务器。For example (the '%' is the command prompt):

% http_proxy="http://www.someproxy.com:3128"
% export http_proxy
% python
...

环境变量 no_proxy可用于指定不能通过代理访问的主机;if set, it should be a comma-separated list of hostname suffixes, optionally with :port appended, for example cern.ch,ncsa.uiuc.edu,some.host:8080.

在Windows环境中,如果没有设置代理环境变量,则可以从注册表的“Internet设置”部分获取代理设置。

在Mac OS X环境中,urlopen()将从OS X系统配置框架中检索代理信息,可以使用“网络系统首选项”面板进行管理。

或者,可选的代理参数可用于显式指定代理。它必须是代码URL的字典映射方案名称,其中空字典不会使用代理,并且(默认值)将导致如上所述使用环境代理设置。For example:

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})
# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)

目前不支持需要身份验证的代理;这被认为是实施限制。

Changed in version 2.3: 添加了代理支持。

Changed in version 2.6: getcode()添加到返回的对象,并支持 no_proxy环境变量。

Deprecated since version 2.6: The urlopen() function has been removed in Python 3 in favor of urllib2.urlopen().

urllib.urlretrieve(url[, filename[, reporthook[, data]]])

如果需要,将由URL表示的网络对象复制到本地文件。如果URL指向本地文件,或存在对象的有效缓存副本,则不会复制对象。返回一个元组(filename, 头)其中filename是可以找到对象的本地文件名,是返回的urlopen()返回的对象的info()方法(对于远程对象,可能是缓存)。例外与urlopen()相同。

第二个参数(如果存在)指定要复制的文件位置(如果不存在,则该位置将是具有生成名称的临时文件)。第三个参数(如果存在)是一个钩子函数,在建立网络连接后将被调用一次,并且在每个块读取之后一次。The hook will be passed three arguments; 到目前为止传输的块数,块大小(以字节为单位)以及文件的总大小。较旧的FTP服务器上的第三个参数可能为-1,这些FTP服务器不响应检索请求返回文件大小。

如果url使用http:方案标识符,则可以给出可选的数据参数,以指定POST请求(通常请求类型为GET)。数据参数必须以标准application / x-www-form-urlencoded格式;请参阅下面的urlencode()函数。

Changed in version 2.5: urlretrieve() will raise ContentTooShortError when it detects that the amount of data available was less than the expected amount (which is the size reported by a Content-Length header). 这可能发生在例如下载中断时。

Content-Length被视为一个下限:如果有更多的数据要读取,urlretrieve()读取更多的数据,但是如果可用的数据较少,它会引发异常。

在这种情况下,您仍然可以检索下载的数据,它存储在异常实例的content属性中。

如果没有提供Content-Length标头,urlretrieve()不能检查已下载的数据的大小,只返回它。在这种情况下,您只需假设下载成功。

urllib._urlopener

公共函数urlopen()urlretrieve()创建一个FancyURLopener类的实例,并使用它来执行其请求的操作。为了覆盖此功能,程序员可以创建URLopenerFancyURLopener的子类,然后将该类的实例分配给urllib._urlopener变量调用所需的功能。例如,应用程序可能希望指定不同于URLopener定义的User-Agent头。This can be accomplished with the following code:

import urllib

class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"

urllib._urlopener = AppURLopener()
urllib.urlcleanup()

清理urlretrieve()的缓存。

20.5.2. 功能函数

urllib.quote(string[, safe])

使用%xx 转义替换string 中的特殊字符。字母、数字和'_.-' 字符永远不会转义。默认情况下,这个函数用于转义URL中的路径部分。可选的safe 参数指出其它不应该转义的字符 —— 默认值为'/'

例如:quote('/~connolly/') 得到'/%7econnolly/'

urllib.quote_plus(string[, safe])

quote()一样,也可以通过加号替换空格,这是在构建查询字符串以进入URL时引用HTML表单值所需的。除了包含在safe中,原始字符串中的加号将被转义。它也没有safe默认为'/'

urllib.unquote(string)

使用其单字符替换%xx转义。

Example: unquote('/%7Econnolly/') yields '/~connolly/'.

urllib.unquote_plus(string)

unquote()一样,也可以根据需要取代HTML表单值的空格替换加号。

urllib.urlencode(query[, doseq])

将映射对象或两个元素元组的序列转换为“percent-encoded”字符串,适合作为可选的数据参数传递到urlopen()将表单字段的字典传递给POST请求是非常有用的。The resulting string is a series of key=value pairs separated by '&' characters, where both key and value are quoted using quote_plus() above. 当使用一个双元素元组的序列作为查询参数时,每个元组的第一个元素是一个键,第二个是一个值。The value element in itself can be a sequence and in that case, if the optional parameter doseq is evaluates to True, individual key=value pairs separated by '&' are generated for each element of the value sequence for the key. The order of parameters in the encoded string will match the order of parameter tuples in the sequence. urlparse模块提供了用于将查询字符串解析为Python数据结构的函数parse_qs()parse_qsl()

urllib.pathname2url(path)

将路径名路径从路径的本地语法转换为URL的路径组件中使用的格式。This does not produce a complete URL. The return value will already be quoted using the quote() function.

urllib.url2pathname(path)

Convert the path component path from an percent-encoded URL to the local syntax for a path. This does not accept a complete URL. This function uses unquote() to decode path.

urllib.getproxies()

This helper function returns a dictionary of scheme to proxy server URL mappings. It scans the environment for variables named <scheme>_proxy, in case insensitive way, for all operating systems first, and when it cannot find it, looks for proxy information from Mac OSX System Configuration for Mac OS X and Windows Systems Registry for Windows.

Note

urllib also exposes certain utility functions like splittype, splithost and others parsing url into various components. But it is recommended to use urlparse for parsing urls than using these functions directly. Python 3 does not expose these helper functions from urllib.parse module.

20.5.3. URL Opener objects

class urllib.URLopener([proxies[, **x509]])

Base class for opening and reading URLs. Unless you need to support opening objects using schemes other than http:, ftp:, or file:, you probably want to use FancyURLopener.

By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number. Applications can define their own User-Agent header by subclassing URLopener or FancyURLopener and setting the class attribute version to an appropriate string value in the subclass definition.

The optional proxies parameter should be a dictionary mapping scheme names to proxy URLs, where an empty dictionary turns proxies off completely. Its default value is None, in which case environmental proxy settings will be used if present, as discussed in the definition of urlopen(), above.

Additional keyword parameters, collected in x509, may be used for authentication of the client when using the https: scheme. The keywords key_file and cert_file are supported to provide an SSL key and certificate; both are needed to support client authentication.

URLopener objects will raise an IOError exception if the server returns an error code.

open(fullurl[, data])

Open fullurl using the appropriate protocol. This method sets up cache and proxy information, then calls the appropriate open method with its input arguments. If the scheme is not recognized, open_unknown() is called. The data argument has the same meaning as the data argument of urlopen().

open_unknown(fullurl[, data])

Overridable interface to open unknown URL types.

retrieve(url[, filename[, reporthook[, data]]])

检索url的内容并下载到 filename. 返回值是一个由本地文件名和 mimetools组成的元组。Message object containing the response headers (for remote URLs) or None (for local URLs). The caller must then open and read the contents of filename. If filename is not given and the URL refers to a local file, the input filename is returned. If the URL is non-local and filename is not given, the filename is the output of tempfile.mktemp() with a suffix that matches the suffix of the last path component of the input URL. If reporthook is given, it must be a function accepting three numeric parameters. It will be called after each chunk of data is read from the network. reporthook is ignored for local URLs.

If the url uses the http: scheme identifier, the optional data argument may be given to specify a POST request (normally the request type is GET). The data argument must in standard application/x-www-form-urlencoded format; see the urlencode() function below.

version

Variable that specifies the user agent of the opener object. To get urllib to tell servers that it is a particular user agent, set this in a subclass as a class variable or in the constructor before calling the base constructor.

class urllib.FancyURLopener(...)

FancyURLopener subclasses URLopener providing default handling for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x response codes listed above, the Location header is used to fetch the actual URL. For 401 response codes (authentication required), basic HTTP authentication is performed. For the 30x response codes, recursion is bounded by the value of the maxtries attribute, which defaults to 10.

For all other response codes, the method http_error_default() is called which you can override in subclasses to handle the error appropriately.

Note

According to the letter of RFC 2616, 301 and 302 responses to POST requests must not be automatically redirected without confirmation by the user. In reality, browsers do allow automatic redirection of these responses, changing the POST to a GET, and urllib reproduces this behaviour.

The parameters to the constructor are the same as those for URLopener.

Note

When performing basic authentication, a FancyURLopener instance calls its prompt_user_passwd() method. The default implementation asks the users for the required information on the controlling terminal. A subclass may override this method to support more appropriate behavior if needed.

The FancyURLopener class offers one additional method that should be overloaded to provide the appropriate behavior:

prompt_user_passwd(host, realm)

Return information needed to authenticate the user at the given host in the specified security realm. The return value should be a tuple, (user, password), which can be used for basic authentication.

The implementation prompts for this information on the terminal; an application should override this method to use an appropriate interaction model in the local environment.

exception urllib.ContentTooShortError(msg[, content])

This exception is raised when the urlretrieve() function detects that the amount of the downloaded data is less than the expected amount (given by the Content-Length header). The content attribute stores the downloaded (and supposedly truncated) data.

New in version 2.5.

20.5.4. urllib Restrictions

  • 目前仅支持协议有:HTTP,(版本0.9和版本1.0,FTP和本地文件).

  • The caching feature of urlretrieve() has been disabled until I find the time to hack proper processing of Expiration time headers.

  • There should be a function to query whether a particular URL is in the cache.

  • For backward compatibility, if a URL appears to point to a local file but the file can’t be opened, the URL is re-interpreted using the FTP protocol. This can sometimes cause confusing error messages.

  • The urlopen() and urlretrieve() functions can cause arbitrarily long delays while waiting for a network connection to be set up. This means that it is difficult to build an interactive Web client using these functions without using threads.

  • The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header. If the returned data is HTML, you can use the module htmllib to parse it.

  • The code handling the FTP protocol cannot differentiate between a file and a directory. This can lead to unexpected behavior when attempting to read a URL that points to a file that is not accessible. If the URL ends in a /, it is assumed to refer to a directory and will be handled accordingly. But if an attempt to read a file leads to a 550 error (meaning the URL cannot be found or is not accessible, often for permission reasons), then the path is treated as a directory in order to handle the case when a directory is specified by a URL but the trailing / has been left off. This can cause misleading results when you try to fetch a file whose read permissions make it inaccessible; the FTP code will try to read it, fail with a 550 error, and then perform a directory listing for the unreadable file. If fine-grained control is needed, consider using the ftplib module, subclassing FancyURLopener, or changing _urlopener to meet your needs.

  • This module does not support the use of proxies which require authentication. This may be implemented in the future.

  • Although the urllib module contains (undocumented) routines to parse and unparse URL strings, the recommended interface for URL manipulation is in module urlparse.

20.5.5. 例子

Here is an example session that uses the GET method to retrieve a URL containing parameters:

>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.read()

The following example uses the POST method instead:

>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print f.read()

The following example uses an explicitly specified HTTP proxy, overriding environment settings:

>>> import urllib
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.FancyURLopener(proxies)
>>> f = opener.open("http://www.python.org")
>>> f.read()

The following example uses no proxies at all, overriding environment settings:

>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> f = opener.open("http://www.python.org/")
>>> f.read()