Extensions

The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.

Extensions are just regular classes that are instantiated at Scrapy startup, when extensions are initialized.

Extension settings

Extensions use the Scrapy settings to manage their settings, just like any other Scrapy code.

It is customary for extensions to prefix their settings with their own name, to avoid collision with existing (and future) extensions. 比如,一个扩展处理Google Sitemaps, 则可以使用类似 GOOGLESITEMAP_ENABLEDGOOGLESITEMAP_DEPTH等设置。

加载和激活扩展

Extensions are loaded and activated at startup by instantiating a single instance of the extension class. Therefore, all the extension initialization code must be performed in the class constructor (__init__ method).

To make an extension available, add it to the EXTENSIONS setting in your Scrapy settings. In EXTENSIONS, each extension is represented by a string: the full Python path to the extension’s class name. For example:

EXTENSIONS = {
    'scrapy.extensions.corestats.CoreStats': 500,
    'scrapy.extensions.telnet.TelnetConsole': 500,
}

如你所见,EXTENSIONS配置是一个dict,key是扩展类的路径,value是顺序, 它定义扩展加载的顺序。EXTENSIONS设置与Scrapy中定义的EXTENSIONS_BASE设置合并 (不是覆盖),然后按顺序得到最终要启用的扩展的列表。

因为扩展通常不互相依赖,其加载顺序在大多数情况下无关紧要。这也是为什么EXTENSIONS_BASE设置定义所有扩展的顺序都相同 (0)。然而,如果你添加的扩展需要取决于其他已加载的扩展,可以利用此功能。

可用的、开启的和禁用的扩展

Not all available extensions will be enabled. Some of them usually depend on a particular setting. For example, the HTTP Cache extension is available by default but disabled unless the HTTPCACHE_ENABLED setting is set.

Disabling an extension

为了禁用一个默认开启的扩展(比如,包含在 EXTENSIONS_BASE设置中的扩展), 需要将其顺序设置为NoneFor example:

EXTENSIONS = {
    'scrapy.extensions.corestats.CoreStats': None,
}

编写你自己的扩展

每个扩展都是一个Python类。The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler class method which receives a Crawler instance. 你可以通过这个对象访问settings,signals,stats,控制爬虫的行为。

Typically, extensions connect to signals and perform tasks triggered by them.

Finally, if the from_crawler method raises the NotConfigured exception, the extension will be disabled. 否则,扩展会被开启。

Sample extension

Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:

  • a spider is opened
  • a spider is closed
  • a specific number of items are scraped

The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through the MYEXT_ITEMCOUNT setting.

Here is the code of such extension:

import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)

class SpiderOpenCloseLogging(object):

    def __init__(self, item_count):
        self.item_count = item_count
        self.items_scraped = 0

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise
        # NotConfigured otherwise
        if not crawler.settings.getbool('MYEXT_ENABLED'):
            raise NotConfigured

        # get the number of items from settings
        item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)

        # instantiate the extension object
        ext = cls(item_count)

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        logger.info("opened spider %s", spider.name)

    def spider_closed(self, spider):
        logger.info("closed spider %s", spider.name)

    def item_scraped(self, item, spider):
        self.items_scraped += 1
        if self.items_scraped % self.item_count == 0:
            logger.info("scraped %d items", self.items_scraped)

内置扩展参考

一般用途的扩展

日志统计扩展

class scrapy.extensions.logstats.LogStats

Log basic stats like crawled pages and scraped items.

核心统计扩展

class scrapy.extensions.corestats.CoreStats

Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).

Telnet 控制台扩展

class scrapy.extensions.telnet.TelnetConsole

Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can be very useful for debugging.

The telnet console must be enabled by the TELNETCONSOLE_ENABLED setting, and the server will listen in the port specified in TELNETCONSOLE_PORT.

内存使用扩展

class scrapy.extensions.memusage.MemoryUsage

Note

此扩展不能在Windows 中工作。

Monitors the memory used by the Scrapy process that runs the spider and:

  1. sends a notification e-mail when it exceeds a certain value
  2. closes the spider when it exceeds a certain value

当内存用量达到 MEMUSAGE_WARNING_MB指定的值,发送提醒邮件, 当内存用量达到 MEMUSAGE_LIMIT_MB指定的值,发送提醒邮件,同时关闭spider, Scrapy进程退出。

This extension is enabled by the MEMUSAGE_ENABLED setting and can be configured with the following settings:

内存调试器扩展

class scrapy.extensions.memdebug.MemoryDebugger

An extension for debugging memory usage. It collects information about:

To enable this extension, turn on the MEMDEBUG_ENABLED setting. The info will be stored in the stats.

关闭Spider扩展

class scrapy.extensions.closespider.CloseSpider

当某些状况发生,spider会自动关闭,每种情况使用指定的关闭原因。

The conditions for closing a spider can be configured through the following settings:

CLOSESPIDER_TIMEOUT

Default: 0

An integer which specifies a number of seconds. If the spider remains open for more than that number of second, it will be automatically closed with the reason closespider_timeout. If zero (or non set), spiders won’t be closed by timeout.

CLOSESPIDER_ITEMCOUNT

Default: 0

An integer which specifies a number of items. If the spider scrapes more than that amount if items and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. 如果零 (或非集合),Spider不会被任何传递的Item关闭。

CLOSESPIDER_PAGECOUNT

New in version 0.11.

Default: 0

An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount. If zero (or non set), spiders won’t be closed by number of crawled responses.

CLOSESPIDER_ERRORCOUNT

New in version 0.11.

Default: 0

An integer which specifies the maximum number of errors to receive before closing the spider. If the spider generates more than that number of errors, it will be closed with the reason closespider_errorcount. If zero (or non set), spiders won’t be closed by number of errors.

StatsMailer extension

class scrapy.extensions.statsmailer.StatsMailer

This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including the Scrapy stats collected. The email will be sent to all recipients specified in the STATSMAILER_RCPTS setting.

调试扩展

Stack trace dump extension

class scrapy.extensions.debug.StackTraceDump

当收到 SIGQUITSIGUSR2信号,spider进程的信息将会被存储下来。存储的信息包括:

  1. engine status (using scrapy.utils.engine.get_engine_status())
  2. live references (see Debugging memory leaks with trackref)
  3. stack trace of all threads

After the stack trace and engine status is dumped, the Scrapy process continues running normally.

该扩展只在POSIX兼容平台上工作(即不能在Windows运行), 因为 SIGQUITSIGUSR2信号在Windows上不可用。

至少有两种方式可以向Scrapy发送 SIGQUIT信号:

  1. By pressing Ctrl-while a Scrapy process is running (Linux only?)

  2. By running this command (assuming <pid> is the process id of the Scrapy process):

    kill -QUIT <pid>
    

Debugger extension

class scrapy.extensions.debug.Debugger

当收到 SIGUSR2信号,将会在Scrapy进程中调用Python debuggerAfter the debugger is exited, the Scrapy process continues running normally.

For more info see Debugging in Python.

该扩展只在POSIX兼容平台上工作(比如不能再Windows上运行)。