Stats Collection

Scrapy提供了方便的收集数据的机制,数据以key/value方式存储,值大多是计数值。该机制叫做数据收集器,可以通过Crawler API的属性 stats来使用,在下面的章节常见数据收集器使用方法将给出例子来说明。

However, the Stats Collector is always available, so you can always import it in your module and use its API (to increment or set new stat keys), regardless of whether the stats collection is enabled or not. If it’s disabled, the API will still work but it won’t collect anything. This is aimed at simplifying the stats collector usage: you should spend no more than one line of code for collecting stats in your spider, Scrapy extension, or whatever code you’re using the Stats Collector from.

Another feature of the Stats Collector is that it’s very efficient (when enabled) and extremely efficient (almost unnoticeable) when disabled.

The Stats Collector keeps a stats table per open spider which is automatically opened when the spider is opened, and closed when the spider is closed.

常见数据收集器使用方法

Access the stats collector through the stats attribute. Here is an example of an extension that access stats:

class ExtensionThatAccessStats(object):

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

设置数据:

stats.set_value('hostname', socket.gethostname())

Increment stat value:

stats.inc_value('custom_count')

Set stat value only if greater than previous:

stats.max_value('max_items_scraped', value)

当新的值比原来的值小时设置数据:

stats.min_value('min_free_memory_percent', value)

Get stat value:

>>> stats.get_value('custom_count')
1

Get all stats:

>>> stats.get_stats()
{'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}

可用的数据收集器

除了基本的 StatsCollector,Scrapy也提供了基于 StatsCollector 的数据收集器。You can select which Stats Collector to use through the STATS_CLASS setting. The default Stats Collector used is the MemoryStatsCollector.

MemoryStatsCollector

class scrapy.statscollectors.MemoryStatsCollector

一个简单的数据收集器,其在spider运行完毕后将其数据保存在内存中。数据可以通过 spider_stats属性访问,该属性是一个以spider名字为键(key)的字典。

This is the default Stats Collector used in Scrapy.

spider_stats

A dict of dicts (keyed by spider name) containing the stats of the last scraping run for each spider.

DummyStatsCollector

class scrapy.statscollectors.DummyStatsCollector

A Stats collector which does nothing but is very efficient (because it does nothing). This stats collector can be set via the STATS_CLASS setting, to disable stats collect in order to improve performance. However, the performance penalty of stats collection is usually marginal compared to other Scrapy workload like parsing pages.