Extensions¶
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
Extensions are just regular classes that are instantiated at Scrapy startup, when extensions are initialized.
Extension settings¶
Extensions use the Scrapy settings to manage their settings, just like any other Scrapy code.
It is customary for extensions to prefix their settings with their own name, to avoid collision with existing (and future) extensions. 比如,一个扩展处理Google Sitemaps, 则可以使用类似 GOOGLESITEMAP_ENABLED、GOOGLESITEMAP_DEPTH等设置。
加载和激活扩展¶
Extensions are loaded and activated at startup by instantiating a single instance of the extension class. Therefore, all the extension initialization code must be performed in the class constructor (__init__
method).
To make an extension available, add it to the EXTENSIONS
setting in your Scrapy settings. In EXTENSIONS
, each extension is represented by a string: the full Python path to the extension’s class name. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': 500,
'scrapy.extensions.telnet.TelnetConsole': 500,
}
如你所见,EXTENSIONS
配置是一个dict,key是扩展类的路径,value是顺序, 它定义扩展加载的顺序。EXTENSIONS
设置与Scrapy中定义的EXTENSIONS_BASE
设置合并 (不是覆盖),然后按顺序得到最终要启用的扩展的列表。
因为扩展通常不互相依赖,其加载顺序在大多数情况下无关紧要。这也是为什么EXTENSIONS_BASE
设置定义所有扩展的顺序都相同 (0
)。然而,如果你添加的扩展需要取决于其他已加载的扩展,可以利用此功能。
可用的、开启的和禁用的扩展¶
Not all available extensions will be enabled. Some of them usually depend on a particular setting. For example, the HTTP Cache extension is available by default but disabled unless the HTTPCACHE_ENABLED
setting is set.
Disabling an extension¶
为了禁用一个默认开启的扩展(比如,包含在 EXTENSIONS_BASE
设置中的扩展), 需要将其顺序设置为None
。For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': None,
}
编写你自己的扩展 ¶
每个扩展都是一个Python类。The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler
class method which receives a Crawler
instance. 你可以通过这个对象访问settings,signals,stats,控制爬虫的行为。
Typically, extensions connect to signals and perform tasks triggered by them.
Finally, if the from_crawler
method raises the NotConfigured
exception, the extension will be disabled. 否则,扩展会被开启。
Sample extension¶
Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:
- a spider is opened
- a spider is closed
- a specific number of items are scraped
The extension will be enabled through the MYEXT_ENABLED
setting and the number of items will be specified through the MYEXT_ITEMCOUNT
setting.
Here is the code of such extension:
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)
class SpiderOpenCloseLogging(object):
def __init__(self, item_count):
self.item_count = item_count
self.items_scraped = 0
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED'):
raise NotConfigured
# get the number of items from settings
item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)
# instantiate the extension object
ext = cls(item_count)
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
# return the extension object
return ext
def spider_opened(self, spider):
logger.info("opened spider %s", spider.name)
def spider_closed(self, spider):
logger.info("closed spider %s", spider.name)
def item_scraped(self, item, spider):
self.items_scraped += 1
if self.items_scraped % self.item_count == 0:
logger.info("scraped %d items", self.items_scraped)
内置扩展参考¶
一般用途的扩展 ¶
日志统计扩展 ¶
- class
scrapy.extensions.logstats.
LogStats
¶
Log basic stats like crawled pages and scraped items.
核心统计扩展 ¶
- class
scrapy.extensions.corestats.
CoreStats
¶
Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).
Telnet 控制台扩展 ¶
- class
scrapy.extensions.telnet.
TelnetConsole
¶
Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can be very useful for debugging.
The telnet console must be enabled by the TELNETCONSOLE_ENABLED
setting, and the server will listen in the port specified in TELNETCONSOLE_PORT
.
内存使用扩展 ¶
- class
scrapy.extensions.memusage.
MemoryUsage
¶
Note
此扩展不能在Windows 中工作。
Monitors the memory used by the Scrapy process that runs the spider and:
- sends a notification e-mail when it exceeds a certain value
- closes the spider when it exceeds a certain value
当内存用量达到 MEMUSAGE_WARNING_MB
指定的值,发送提醒邮件, 当内存用量达到 MEMUSAGE_LIMIT_MB
指定的值,发送提醒邮件,同时关闭spider, Scrapy进程退出。
This extension is enabled by the MEMUSAGE_ENABLED
setting and can be configured with the following settings:
内存调试器扩展 ¶
- class
scrapy.extensions.memdebug.
MemoryDebugger
¶
An extension for debugging memory usage. It collects information about:
- objects uncollected by the Python garbage collector
- objects left alive that shouldn’t. For more info, see Debugging memory leaks with trackref
To enable this extension, turn on the MEMDEBUG_ENABLED
setting. The info will be stored in the stats.
关闭Spider扩展 ¶
- class
scrapy.extensions.closespider.
CloseSpider
¶
当某些状况发生,spider会自动关闭,每种情况使用指定的关闭原因。
The conditions for closing a spider can be configured through the following settings:
CLOSESPIDER_TIMEOUT¶
Default: 0
An integer which specifies a number of seconds. If the spider remains open for more than that number of second, it will be automatically closed with the reason closespider_timeout
. If zero (or non set), spiders won’t be closed by timeout.
CLOSESPIDER_ITEMCOUNT¶
Default: 0
An integer which specifies a number of items. If the spider scrapes more than that amount if items and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount
. 如果零 (或非集合),Spider不会被任何传递的Item关闭。
CLOSESPIDER_PAGECOUNT¶
New in version 0.11.
Default: 0
An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount
. If zero (or non set), spiders won’t be closed by number of crawled responses.
CLOSESPIDER_ERRORCOUNT¶
New in version 0.11.
Default: 0
An integer which specifies the maximum number of errors to receive before closing the spider. If the spider generates more than that number of errors, it will be closed with the reason closespider_errorcount
. If zero (or non set), spiders won’t be closed by number of errors.
StatsMailer extension¶
- class
scrapy.extensions.statsmailer.
StatsMailer
¶
This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including the Scrapy stats collected. The email will be sent to all recipients specified in the STATSMAILER_RCPTS
setting.
调试扩展 ¶
Stack trace dump extension¶
- class
scrapy.extensions.debug.
StackTraceDump
¶
当收到 SIGQUIT或 SIGUSR2信号,spider进程的信息将会被存储下来。存储的信息包括:
- engine status (using
scrapy.utils.engine.get_engine_status()
) - live references (see Debugging memory leaks with trackref)
- stack trace of all threads
After the stack trace and engine status is dumped, the Scrapy process continues running normally.
该扩展只在POSIX兼容平台上工作(即不能在Windows运行), 因为 SIGQUIT和 SIGUSR2信号在Windows上不可用。
至少有两种方式可以向Scrapy发送 SIGQUIT信号:
By pressing Ctrl-while a Scrapy process is running (Linux only?)
By running this command (assuming
<pid>
is the process id of the Scrapy process):kill -QUIT <pid>
Debugger extension¶
- class
scrapy.extensions.debug.
Debugger
¶
当收到 SIGUSR2信号,将会在Scrapy进程中调用Python debugger 。After the debugger is exited, the Scrapy process continues running normally.
For more info see Debugging in Python.
该扩展只在POSIX兼容平台上工作(比如不能再Windows上运行)。