Core API

New in version 0.15.

该节文档讲述Scrapy核心API,目标用户是开发Scrapy扩展和中间件的开发人员。

Crawler API

Scrapy API的主要入口是 Crawler的实例对象, 通过类方法 from_crawler 将它传递给扩展。该对象提供对所有Scrapy核心组件的访问, 也是扩展访问Scrapy核心组件和挂载功能到Scrapy的唯一途径。

Extension Manager负责加载和跟踪已经安装的扩展, 它通过 EXTENSIONS配置,包含一个所有可用扩展的字典, 字典的顺序跟你在 configure the downloader middlewares 配置的顺序一致。

class scrapy.crawler.Crawler(spidercls, settings)

The Crawler object must be instantiated with a scrapy.spiders.Spider subclass and a scrapy.settings.Settings object.

settings

The settings manager of this crawler.

This is used by extensions & middlewares to access the Scrapy settings of this crawler.

For an introduction on Scrapy settings see Settings.

For the API see Settings class.

signals

The signals manager of this crawler.

This is used by extensions & middlewares to hook themselves into Scrapy functionality.

For an introduction on signals see Signals.

For the API see SignalManager class.

stats

The stats collector of this crawler.

This is used from extensions & middlewares to record stats of their behaviour, or access stats collected by other extensions.

For an introduction on stats collection see Stats Collection.

For the API see StatsCollector class.

extensions

The extension manager that keeps track of enabled extensions.

Most extensions won’t need to access this attribute.

For an introduction on extensions and a list of available extensions on Scrapy see Extensions.

engine

The execution engine, which coordinates the core crawling logic between the scheduler, downloader and spiders.

Some extension may want to access the Scrapy engine, to inspect or modify the downloader and scheduler behaviour, although this is an advanced use and this API is not yet stable.

spider

Spider currently being crawled. This is an instance of the spider class provided while constructing the crawler, and it is created after the arguments given in the crawl() method.

crawl(*args, **kwargs)

Starts the crawler by instantiating its spider class with the given args and kwargs arguments, while setting the execution engine in motion.

Returns a deferred that is fired when the crawl is finished.

Settings API

scrapy.settings.SETTINGS_PRIORITIES

Dictionary that sets the key name and priority level of the default settings priorities used in Scrapy.

Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the Settings class.

SETTINGS_PRIORITIES = {
    'default': 0,
    'command': 10,
    'project': 20,
    'spider': 30,
    'cmdline': 40,
}

For a detailed explanation on each settings sources, see: Settings.

SpiderLoader API

class scrapy.loader.SpiderLoader

This class is in charge of retrieving and handling the spider classes defined across the project.

Custom spider loaders can be employed by specifying their path in the SPIDER_LOADER_CLASS project setting. They must fully implement the scrapy.interfaces.ISpiderLoader interface to guarantee an errorless execution.

from_settings(settings)

This class method is used by Scrapy to create an instance of the class. It’s called with the current project settings, and it loads the spiders found in the modules of the SPIDER_MODULES setting.

Parameters:settings (Settings instance) – project settings
load(spider_name)

获取Spider类的名字。它将查看先前加载的Spider类的名称为spider_name的Spider类型,如果找不到则会引发KeyError。

Parameters:spider_name (str) – spider class name
list()

Get the names of the available spiders in the project.

find_by_request(request)

List the spiders’ names that can handle the given request. Will try to match the request’s url against the domains of the spiders.

Parameters:request (Request instance) – queried request

Signals API

Stats Collector API

There are several Stats Collectors available under the scrapy.statscollectors module and they all implement the Stats Collector API defined by the StatsCollector class (which they all inherit from).

class scrapy.statscollectors.StatsCollector
get_value(key, default=None)

Return the value for the given stats key or default if it doesn’t exist.

get_stats()

Get all stats from the currently running spider as a dict.

set_value(key, value)

Set the given value for the given stats key.

set_stats(stats)

Override the current stats with the dict passed in stats argument.

inc_value(key, count=1, start=0)

Increment the value of the given stats key, by the given count, assuming the start value given (when it’s not set).

max_value(key, value)

Set the given value for the given key only if current value for the same key is lower than value. If there is no current value for the given key, the value is always set.

min_value(key, value)

Set the given value for the given key only if current value for the same key is greater than value. If there is no current value for the given key, the value is always set.

clear_stats()

Clear all stats.

The following methods are not part of the stats collection api but instead used when implementing custom stats collectors:

open_spider(spider)

Open the given spider for stats collection.

close_spider(spider)

Close the given spider. After this is called, no more specific stats can be accessed or collected.