下载和处理文件和图像

Scrapy提供可重复使用的item pipelines,用于下载附加到特定项目的文件(例如,当您抓取产品并想要在本地下载其图像时)。这些pipelines共享一些功能和结构(我们称之为media pipelines),但通常你会使用文件管道或图像管道。

这两个管道都实现以下功能:

  • 避免重复下载最近下载的媒体
  • 指定存储媒体的位置(文件系统目录,Amazon S3存储桶)

图像管道具有用于处理图像的一些额外功能:

  • 将所有下载的图像转换为常用格式(JPG)和模式(RGB)
  • 缩略图生成
  • 检查图像宽度/高度,以确保它们满足最小约束

管道还保持当前正被调度下载的那些媒体URL的内部队列,并将包含相同媒体的那些响应连接到那个队列。这避免了多个项目共享时多次下载相同的媒体。

Using the Files Pipeline

The typical workflow, when using the FilesPipeline goes like this:

  1. 在Spider中,您要抓取一个项目,并将所需的网址放入file_urls字段中。
  2. 该item从Spider返回并转到item pipeline。
  3. 当项目到达FilesPipeline时,file_urls字段中的URL计划使用标准Scrapy调度程序和下载程序(这意味着调度程序和下载程序中间件被重用)但具有较高的优先级,在其他页面被删除之前对其进行处理。该item在该特定流水线阶段保持“锁定”,直到文件完成下载(或由于某种原因失败)。
  4. When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field) , and the file checksum. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.

Using the Images Pipeline

使用ImagesPipeline就像使用FilesPipeline,但使用的默认字段名称不同:您使用image_urls项目,并且它将填充关于下载的图像的信息的images字段。

The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.

Images管道使用Pillow缩略图并将图像标准化为JPEG / RGB格式,因此您需要安装此库才能使用它。Python图像库(PIL)在大多数情况下也应该可以正常工作,但已知会在某些设置中引起麻烦,因此我们建议使用Pillow代替PIL。

Enabling your Media Pipeline

To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.

For Images Pipeline, use:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

For Files Pipeline, use:

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

Note

You can also use both the Files and Images Pipeline at the same time.

Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

For the Files Pipeline, set the FILES_STORE setting:

FILES_STORE = '/path/to/valid/dir'

For the Images Pipeline, set the IMAGES_STORE setting:

IMAGES_STORE = '/path/to/valid/dir'

Supported Storage

File system is currently the only officially supported storage, but there is also support for storing files in Amazon S3.

File system storage

The files are stored using a SHA1 hash of their URLs for the file names.

For example, the following image URL:

http://www.example.com/image.jpg

Whose SHA1 hash is:

3afec3b4765f8f0a07b78f98c07b83f013567a0a

将下载并存储在以下文件中:

<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

Where:

  • <IMAGES_STORE> 是在设置 IMAGES_STORE 中定义的Images Pipeline.
  • full是将完整图片与缩略图(如果使用)分开的子目录。有关详细信息,请参阅图像缩略图生成

Amazon S3 storage

FILES_STORE and IMAGES_STORE can represent an Amazon S3 bucket. Scrapy will automatically upload the files to the bucket.

For example, this is a valid IMAGES_STORE value:

IMAGES_STORE = 's3://bucket/images'

You can modify the Access Control List (ACL) policy used for the stored files, which is defined by the FILES_STORE_S3_ACL and IMAGES_STORE_S3_ACL settings. By default, the ACL is set to private. To make the files publicly available use the public-read policy:

IMAGES_STORE_S3_ACL = 'public-read'

For more information, see canned ACLs in the Amazon S3 Developer Guide.

Usage example

In order to use a media pipeline first, enable it.

然后,如果Spider使用URLs关键字(分别为文件或图像管道)返回一个带有URL键(file_urlsimage_urls)的dict,管道将把结果放在相应的键文件图片)。

如果您更喜欢使用item,那么请使用必要的字段定义一个自定义项目,例如Images pipeline的示例:

import scrapy

class MyItem(scrapy.Item):

    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

如果要对URLs键或结果键使用其他字段名称,也可以覆盖它。

对于Files Pipline,设置FILES_URLS_FIELD和/或FILES_RESULT_FIELD设置:

FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'

对于Images Pipeline,请设置IMAGES_URLS_FIELD和/或IMAGES_RESULT_FIELD设置:

IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'

如果您需要更复杂的内容并想要覆盖自定义管道行为,请参阅Extending the Media Pipelines

如果有多个从ImagePipeline继承的图像管道,并且您希望在不同的管道中具有不同的设置,则可以设置以管道类的大写名称开头的设置键。例如。如果您的管道称为MyPipeline,并且您想要定制IMAGES_URLS_FIELD,那么您将定义MYPIPELINE_IMAGES_URLS_FIELD设置,并使用您的自定义设置。

Additional features

File expiration

The Image Pipeline avoids downloading files that were downloaded recently. To adjust this retention delay use the FILES_EXPIRES setting (or IMAGES_EXPIRES, in case of Images Pipeline), which specifies the delay in number of days:

# 120 days of delay for files expiration
FILES_EXPIRES = 120

# 30 days of delay for images expiration
IMAGES_EXPIRES = 30

The default value for both settings is 90 days.

If you have pipeline that subclasses FilesPipeline and you’d like to have different setting for it you can set setting keys preceded by uppercase class name. E.g. given pipeline class called MyPipeline you can set setting key:

MYPIPELINE_FILES_EXPIRES = 180

and pipeline class MyPipeline will have expiration time set to 180.

Thumbnail generation for images

The Images Pipeline can automatically create thumbnails of the downloaded images.

In order use this feature, you must set IMAGES_THUMBS to a dictionary where the keys are the thumbnail names and the values are their dimensions.

For example:

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:

<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg

Where:

  • <size_name> is the one specified in the IMAGES_THUMBS dictionary keys (small, big, etc)
  • <image_id> is the SHA1 hash of the image url

使用smallbig缩略图名称存储的图像文件示例:

<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg

The first one is the full image, as downloaded from the site.

Filtering out small images

When using the Images Pipeline, you can drop images which are too small, by specifying the minimum allowed size in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.

For example:

IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

Note

The size constraints don’t affect thumbnail generation at all.

It is possible to set just one size constraint or both. When setting both of them, only images that satisfy both minimum sizes will be saved. For the above example, images of sizes (105 x 105) or (105 x 200) or (200 x 105) will all be dropped because at least one dimension is shorter than the constraint.

By default, there are no size constraints, so all images are processed.

Extending the Media Pipelines

See here the methods that you can override in your custom Files Pipeline:

class scrapy.pipelines.files.FilesPipeline
get_media_requests(item, info)

As seen on the workflow, the pipeline will get the URLs of the images to download from the item. In order to do this, you can override the get_media_requests() method and return a Request for each file URL:

def get_media_requests(self, item, info):
    for file_url in item['file_urls']:
        yield scrapy.Request(file_url)

Those requests will be processed by the pipeline and, when they have finished downloading, the results will be sent to the item_completed() method, as a list of 2-element tuples. Each tuple will contain (success, file_info_or_error) where:

  • success is a boolean which is True if the image was downloaded successfully or False if it failed for some reason
  • file_info_or_error is a dict containing the following keys (if success is True) or a Twisted Failure if there was a problem.
    • url - the url where the file was downloaded from. This is the url of the request returned from the get_media_requests() method.
    • path - the path (relative to FILES_STORE) where the file was stored
    • checksum - a MD5 hash of the image contents

The list of tuples received by item_completed() is guaranteed to retain the same order of the requests returned from the get_media_requests() method.

Here’s a typical value of the results argument:

[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
 (False,
  Failure(...))]

By default the get_media_requests() method returns None which means there are no files to download for the item.

item_completed(results, items, info)

当单个项目的所有文件请求已完成(完成下载,或由于某种原因失败)时调用FilesPipeline.item_completed()方法。

item_completed()方法必须返回将发送到后续项目pipline阶段的输出,因此您必须返回(或删除)项目,就像在任何pipline中一样。

这里是item_completed()方法的一个例子,其中我们将下载的文件路径(传入结果)存储在file_paths项目字段中,不包含任何文件:

from scrapy.exceptions import DropItem

def item_completed(self, results, item, info):
    file_paths = [x['path'] for ok, x in results if ok]
    if not file_paths:
        raise DropItem("Item contains no files")
    item['file_paths'] = file_paths
    return item

默认情况下,item_completed()方法返回该项目。

请参阅此处您可以在自定义图像管道中覆盖的方法:

class scrapy.pipelines.images.ImagesPipeline
ImagesPipelineFilesPipeline的扩展,可自定义字段名称并添加图像的自定义行为。
get_media_requests(item, info)

FilesPipeline.get_media_requests()方法的工作方式相同,但对图片网址使用不同的字段名称。

必须为每个图片网址返回一个请求。

item_completed(results, items, info)

当单个项目的所有图像请求都已完成(完成下载,或由于某种原因失败)时,调用ImagesPipeline.item_completed()方法。

FilesPipeline.item_completed()方法的工作方式相同,但使用不同的字段名称来存储图像下载结果。

默认情况下,item_completed()方法返回该项目。

Custom Images pipeline example

这里是Image pipeline的一个完整的例子,其方法如上所示:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item