Spiders

Spiders是定义如何抓取某个网站(或一组网站)的类,包括如何执行抓取(即跟踪链接)以及如何从其网页中提取结构化数据(即爬取项目)。换句话说,Spider是你定义用于为特定网站(或在某些情况下,一组网站)抓取和解析网页的自定义行为的地方。

对于spiders,爬取周期经历这样的事情:

  1. 以生成抓取第一批URL的初始请求开始,并指定一个回调函数,用于对这些请求下载的响应进行调用。

    第一批请求通过调用start_requests()方法(默认情况下)获得,这个方法为start_urls中指定的URL生成Request并为这些Request生成parse方法作为回调函数。

  2. In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

  3. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

  4. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.

scrapy.Spider

class scrapy.spiders.Spider

This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). 它不提供任何特殊功能。它仅仅提供一个默认的start_requests()方法,该方法实现发送请求到start_urls获取相应属性并且调用spider的parse方法来处理每一条响应结果

name

A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.

如果Spider爬取单个域,通常的做法是以域命名Spider,带或不带顶级域名So, for example, a spider that crawls mywebsite.com would often be called mywebsite.

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl. 如果启用了OffsiteMiddleware,则不会跟踪不属于此列表(或其子域)中指定的域名的网址的请求。

start_urls

A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

custom_settings

一个设置字典,运行此Spider时将覆盖项目范围的配置。It must be defined as a class attribute since the settings are updated before instantiation.

有关可用内置设置的列表,请参阅:内置设置参考

crawler

此属性由类初始化后的from_crawler()类方法设置,并链接到此Spider实例所绑定的Crawler对象。

Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions, middlewares, signals managers, etc). 有关详情,请参阅Crawler API

settings

运行此Spider的配置。它是Settings的一个实例,有关此主题的详细说明,请参阅设置主题。

logger

使用Spider的name创建的Python日志记录器。你可以按照Spiders中的日志记录中所述的使用它发送日志消息。

from_crawler(crawler, *args, **kwargs)

这是Scrapy用来创建你的spiders的类方法。

你可能不需要直接覆盖它,因为默认实现充当__init__()方法的代理,使用给定的参数args和命名参数kwargs调用。

不过,此方法会在新实例中设置crawlersettings属性,以便稍后在spider代码中访问它们。

参数:
  • crawlerCrawler实例) - spider 将绑定到的抓取工具
  • argslist) - 传递给__init__()方法的参数
  • kwargsdict) - 传递给__init__()方法的关键字参数
start_requests()

此方法必须返回一个可迭代对象,并以第一个请求来抓取此Spider。

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. 如果指定了特定的URL,则使用make_requests_from_url()来创建请求。此方法也仅从Scrapy调用一次,因此将其作为生成器实现是安全的。

默认的实现使用make_requests_from_url()start_urls中的每个url生成请求。

如果要更改用于启动爬取域的请求,这是要覆盖的方法。例如,如果你需要通过使用POST请求登录来启动,你可以:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass
make_requests_from_url(url)

一个方法,接收URL并返回要抓取的Request对象(或Request对象列表)。此方法用于在start_requests()方法中构造初始请求,并且通常用于将URL转换为请求。

除非重写,此方法返回具有parse()方法的Requests作为它们的回调函数,并启用dont_filter参数(有关详细信息,请参阅Request类)。

parse(response)

这是Scrapy用来处理下载的响应的默认回调,当它们的请求没有指定回调时。

parse方法负责处理响应并返回已抓取的数据和/或更多要跟踪的网址。其他请求回调与Spider类具有相同的要求。

该方法及其他的Request回调函数必须返回一个包含Request及/或Item的可迭代的对象。

参数:responseResponse) - 要解析的响应
log(message[, level, component])

一个包装器,通过Spider的logger发送日志消息,保留向后兼容性。有关详细信息,请参阅Spiders中记录日志

closed(reason)

当spider关闭时调用。此方法为spider_closed信号的signals.connect()提供了一个快捷方式。

让我们看一个例子:

import scrapy


class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)

从单个回调中返回多个请求和项目:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

你可以直接使用start_requests()来代替start_urls;你可以使用Items给出数据更多的结构:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

Spider的参数

Spiders can receive arguments that modify their behaviour. Spider参数的一些常见用法是定义起始URL或限制抓取到网站的某些部分,但它们可用于配置Spider的任何功能。

Spider参数通过crawl命令使用-a选项传递。例如:

scrapy crawl myspider -a category=electronics

Spider在它们的构造函数中接收参数:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        # ...

Spider arguments can also be passed through the Scrapyd schedule.json API. 请参阅Scrapyd文档

通用的Spiders

Scrapy附带一些有用的通用spiders ,你可以使用它来子类化你的spiders。它们的目的是为几个常见的爬取案例提供方便的功能,例如根据某些规则查看网站上的所有链接,从站点地图抓取或解析XML/CSV Feed。

对于以下spiders中使用的示例,我们假设你有一个在myproject.items模块中声明了TestItem的项目:

import scrapy

class TestItem(scrapy.Item):
    id = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()

CrawlSpider

class scrapy.spiders.CrawlSpider

这是最常用的爬行常规网站的spider ,因为它通过定义一组规则为跟踪链接提供了一个方便的机制。它可能你的特定网站或项目不是最适合,但它对通用的几种情况足够了,所以你可以从它开始,根据需要覆盖更多的自定义功能,或干脆实现自己的spider。

除了从Spider继承的属性(你必须指定),这个类支持一个新的属性:

rules

它是一个(或多个)Rule对象的列表。每个Rule定义了抓取网站的某种行为。Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

这个spider还暴露了一个可覆盖的方法:

parse_start_url(response)

This method is called for the start_urls responses. 它允许解析初始响应,并且必须返回一个Item对象、一个Request对象或包含其中任何一个的可迭代对象。

抓取规则

class scrapy.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

link_extractor是一个Link Extractor对象,它定义如何从爬取的每个网页面提取链接。

callback是一个可调用对象或字符串(在这种情况下,将使用Spider对象中具有该名称的方法),用来给指定的link_extractor提取的每个链接调用。该回调函数接受一个response作为其第一个参数, 并返回一个包含Item以及/或Request对象(或者这两者的子类)的列表。

警告

在编写抓取Spider规则时,避免使用parse作为回调,因为CrawlSpider使用parse方法自己实现其逻辑。因此,如果你覆盖parse方法,爬行Spider将不再工作。

cb_kwargs是一个包含要传递到回调函数的关键字参数的字典。

follow是一个布尔值,指定是否应该跟踪使用此规则从每个响应中提取的链接。如果callback为None,则follow默认设置为True,否则默认为False

process_links是一个可调用的或字符串(在这种情况下,将使用Spider对象中具有该名称的方法),对于从每个响应中使用指定link_extractor提取的每个链接列表都将调用它。This is mainly used for filtering purposes.

process_request是一个可调用的或字符串(在这种情况下,将使用Spider对象中具有该名称的方法),该方法将被此规则提取的每个请求调用,并且必须返回请求或None(来过滤请求)。

CrawlSpider示例

Let’s now take a look at an example CrawlSpider with rules:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. 对于每个item response,将使用XPath从HTML中提取一些数据,并使用它填充Item

XMLFeedSpider

class scrapy.spiders.XMLFeedSpider

XMLFeedSpider设计用于通过迭代某个节点名称来解析XML Feed。迭代器可以选自:iternodexmlhtmlIt’s recommended to use the iternodes iterator for performance reasons, since the xml and html iterators generate the whole DOM at once in order to parse it. However, using html as the iterator may be useful when parsing XML with bad markup.

要设置迭代器和标签名称,必须定义以下类属性:

iterator

一个字符串,定义要使用的迭代器。它可以是:

  • 'iternodes' - 基于正则表达式的快速迭代器
  • 'html' - 使用Selector的迭代器。请记住,它使用DOM解析并且必须加载所有DOM在内存中,对于很大的feeds这可能是一个问题
  • 'xml' - an iterator which uses Selector. 请记住,它使用DOM解析并且必须加载所有DOM在内存中,对于很大的feeds这可能是一个问题

它默认为:'iternodes'

itertag

一个字符串,带有有要迭代的节点(或元素)的名称。Example:

itertag = 'product'
namespaces

一个(prefix, uri)元组列表,定义使用此Spider处理的文档中可用的命名空间。prefixuri将使用register_namespace()方法用来自动注册命名空间。

然后,你可以在itertag属性中指定具有命名空间的节点。

Example:

class YourSpider(XMLFeedSpider):

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:url'
    # ...

除了这些新的属性,这个spider有以下可重写的方法:

adapt_response(response)

一个方法,在Spider开始解析响应之前,一旦到达Spider中间件,就接收此响应。它可以用来在解析响应之前修改它的主体。该方法接受一个response并返回一个response(可以相同也可以不同)。

parse_node(response, selector)

对于与提供的标签名称(itertag)匹配的节点,将调用此方法。接收每个节点的响应和Selector覆盖此方法是强制的。Otherwise, you spider won’t work. 此方法必须返回一个Item对象、一个Request对象或包含它们任何一个的可迭代对象。

process_results(response, results)

对于由Spider返回的每个结果(item或request),将调用此方法,并且它将在将结果返回到框架核心之前执行所需的任何最后一次处理,例如设置item的ID。它接收一个结果列表和产生那些结果的响应。它必须返回一个结果列表(Items或Requests)。

XMLFeedSpider example

这些spiders很容易使用,让我们看一个例子:

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = TestItem()
        item['id'] = node.xpath('@id').extract()
        item['name'] = node.xpath('name').extract()
        item['description'] = node.xpath('description').extract()
        return item

基本上我们做的是创建一个Spider,从给定的start_urls下载一个feed,然后遍历每个item标签,打印出来并存储一些随机数据到Item中。

CSVFeedSpider

class scrapy.spiders.CSVFeedSpider

这个Spider非常类似于XMLFeedSpider,除了它迭代行,而不是节点。在每次迭代中调用的方法是parse_row()

delimiter

一个字符串,带有CSV文件中每个字段的分隔符。默认为','(逗号)。

quotechar

一个字符串,带有CSV文件中每个字段的封装字符。默认为'"'(引号)。

headers

文件CSV Feed中包含的行的列表,用于从中提取字段。

parse_row(response, row)

该方法接收一个response对象及一个以提供或检测出来的header为键的字(代表每行)。此spider还提供了覆盖adapt_responseprocess_results方法的机会,用于预处理和后处理。

CSVFeedSpider example

让我们看一个类似于前一个的示例,但使用CSVFeedSpider

from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem

class MySpider(CSVFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.csv']
    delimiter = ';'
    quotechar = "'"
    headers = ['id', 'name', 'description']

    def parse_row(self, response, row):
        self.logger.info('Hi, this is a row!: %r', row)

        item = TestItem()
        item['id'] = row['id']
        item['name'] = row['name']
        item['description'] = row['description']
        return item

SitemapSpider

class scrapy.spiders.SitemapSpider

SitemapSpider允许你通过使用Sitemaps发现网址来抓取网站。

它支持嵌套的Sitemap并从robots.txt发现Sitemap网址。

sitemap_urls

A list of urls pointing to the sitemaps whose urls you want to crawl.

You can also point to a robots.txt and it will be parsed to extract sitemap urls from it.

sitemap_rules

一个(regex, callback)元组列表,其中:

  • regex is a regular expression to match urls extracted from sitemaps. regex can be either a str or a compiled regex object.
  • callback is the callback to use for processing the urls that match the regular expression. callback can be a string (indicating the name of a spider method) or a callable.

For example:

sitemap_rules = [('/product/', 'parse_product')]

规则按顺序应用,并且将仅使用匹配的第一个。

If you omit this attribute, all urls found in sitemaps will be processed with the parse callback.

sitemap_follow

A list of regexes of sitemap that should be followed. 这只适用于使用指向其他Sitemap文件的Sitemap索引文件的网站。

By default, all sitemaps are followed.

Specifies if alternate links for one url should be followed. These are links for the same website in another language passed within the same url block.

For example:

<url>
    <loc>http://example.com/</loc>
    <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
</url>

With sitemap_alternate_links set, this would retrieve both URLs. With sitemap_alternate_links disabled, only http://example.com/ would be retrieved.

Default is sitemap_alternate_links disabled.

SitemapSpider示例

Simplest example: process all urls discovered through sitemaps using the parse callback:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']

    def parse(self, response):
        pass # ... scrape item here ...

Process some urls with certain callback and other urls with a different callback:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),
        ('/category/', 'parse_category'),
    ]

    def parse_product(self, response):
        pass # ... scrape product ...

    def parse_category(self, response):
        pass # ... scrape category ...

Follow sitemaps defined in the robots.txt file and only follow sitemaps whose url contains /sitemap_shop:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]
    sitemap_follow = ['/sitemap_shops']

    def parse_shop(self, response):
        pass # ... scrape shop here ...

Combine SitemapSpider with other sources of urls:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]

    other_urls = ['http://www.example.com/about']

    def start_requests(self):
        requests = list(super(MySpider, self).start_requests())
        requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
        return requests

    def parse_shop(self, response):
        pass # ... scrape shop here ...

    def parse_other(self, response):
        pass # ... scrape other here ...