Spiders¶
Spiders是定义如何抓取某个网站(或一组网站)的类,包括如何执行抓取(即跟踪链接)以及如何从其网页中提取结构化数据(即爬取项目)。换句话说,Spider是你定义用于为特定网站(或在某些情况下,一组网站)抓取和解析网页的自定义行为的地方。
对于spiders,爬取周期经历这样的事情:
以生成抓取第一批URL的初始请求开始,并指定一个回调函数,用于对这些请求下载的响应进行调用。
第一批请求通过调用
start_requests()
方法(默认情况下)获得,这个方法为start_urls
中指定的URL生成Request
并为这些Request生成parse
方法作为回调函数。In the callback function, you parse the response (web page) and return either dicts with extracted data,
Item
objects,Request
objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.
scrapy.Spider¶
- class
scrapy.spiders.
Spider
¶ This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). 它不提供任何特殊功能。它仅仅提供一个默认的
start_requests()
方法,该方法实现发送请求到start_urls
获取相应属性并且调用spider的parse
方法来处理每一条响应结果-
name
¶ A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.
如果Spider爬取单个域,通常的做法是以域命名Spider,带或不带顶级域名。So, for example, a spider that crawls
mywebsite.com
would often be calledmywebsite
.
-
allowed_domains
¶ An optional list of strings containing domains that this spider is allowed to crawl. 如果启用了
OffsiteMiddleware
,则不会跟踪不属于此列表(或其子域)中指定的域名的网址的请求。
-
start_urls
¶ A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.
-
custom_settings
¶ 一个设置字典,运行此Spider时将覆盖项目范围的配置。It must be defined as a class attribute since the settings are updated before instantiation.
有关可用内置设置的列表,请参阅:内置设置参考。
-
crawler
¶ 此属性由类初始化后的
from_crawler()
类方法设置,并链接到此Spider实例所绑定的Crawler
对象。Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions, middlewares, signals managers, etc). 有关详情,请参阅Crawler API。
-
logger
¶ 使用Spider的
name
创建的Python日志记录器。你可以按照Spiders中的日志记录中所述的使用它发送日志消息。
-
from_crawler
(crawler, *args, **kwargs)¶ 这是Scrapy用来创建你的spiders的类方法。
你可能不需要直接覆盖它,因为默认实现充当
__init__()
方法的代理,使用给定的参数args和命名参数kwargs调用。不过,此方法会在新实例中设置
crawler
和settings
属性,以便稍后在spider代码中访问它们。参数:
-
start_requests
()¶ 此方法必须返回一个可迭代对象,并以第一个请求来抓取此Spider。
This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. 如果指定了特定的URL,则使用
make_requests_from_url()
来创建请求。此方法也仅从Scrapy调用一次,因此将其作为生成器实现是安全的。默认的实现使用
make_requests_from_url()
为start_urls
中的每个url生成请求。如果要更改用于启动爬取域的请求,这是要覆盖的方法。例如,如果你需要通过使用POST请求登录来启动,你可以:
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
-
make_requests_from_url
(url)¶ 一个方法,接收URL并返回要抓取的
Request
对象(或Request
对象列表)。此方法用于在start_requests()
方法中构造初始请求,并且通常用于将URL转换为请求。除非重写,此方法返回具有
parse()
方法的Requests作为它们的回调函数,并启用dont_filter参数(有关详细信息,请参阅Request
类)。
-
parse
(response)¶ 这是Scrapy用来处理下载的响应的默认回调,当它们的请求没有指定回调时。
parse
方法负责处理响应并返回已抓取的数据和/或更多要跟踪的网址。其他请求回调与Spider
类具有相同的要求。该方法及其他的Request回调函数必须返回一个包含
Request
及/或Item
的可迭代的对象。参数: response( Response
) - 要解析的响应
-
log
(message[, level, component])¶ 一个包装器,通过Spider的
logger
发送日志消息,保留向后兼容性。有关详细信息,请参阅Spiders中记录日志。
-
closed
(reason)¶ 当spider关闭时调用。此方法为
spider_closed
信号的signals.connect()提供了一个快捷方式。
-
让我们看一个例子:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
从单个回调中返回多个请求和项目:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield {"title": h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
你可以直接使用start_requests()
来代替start_urls
;你可以使用Items给出数据更多的结构:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
Spider的参数¶
Spiders can receive arguments that modify their behaviour. Spider参数的一些常见用法是定义起始URL或限制抓取到网站的某些部分,但它们可用于配置Spider的任何功能。
Spider参数通过crawl
命令使用-a
选项传递。例如:
scrapy crawl myspider -a category=electronics
Spider在它们的构造函数中接收参数:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
# ...
Spider arguments can also be passed through the Scrapyd schedule.json
API. 请参阅Scrapyd文档。
通用的Spiders¶
Scrapy附带一些有用的通用spiders ,你可以使用它来子类化你的spiders。它们的目的是为几个常见的爬取案例提供方便的功能,例如根据某些规则查看网站上的所有链接,从站点地图抓取或解析XML/CSV Feed。
对于以下spiders中使用的示例,我们假设你有一个在myproject.items
模块中声明了TestItem
的项目:
import scrapy
class TestItem(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
description = scrapy.Field()
CrawlSpider¶
- class
scrapy.spiders.
CrawlSpider
¶ 这是最常用的爬行常规网站的spider ,因为它通过定义一组规则为跟踪链接提供了一个方便的机制。它可能你的特定网站或项目不是最适合,但它对通用的几种情况足够了,所以你可以从它开始,根据需要覆盖更多的自定义功能,或干脆实现自己的spider。
除了从Spider继承的属性(你必须指定),这个类支持一个新的属性:
-
rules
¶ 它是一个(或多个)
Rule
对象的列表。每个Rule
定义了抓取网站的某种行为。Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
这个spider还暴露了一个可覆盖的方法:
-
抓取规则¶
- class
scrapy.spiders.
Rule
(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)¶ link_extractor
是一个Link Extractor对象,它定义如何从爬取的每个网页面提取链接。callback
是一个可调用对象或字符串(在这种情况下,将使用Spider对象中具有该名称的方法),用来给指定的link_extractor提取的每个链接调用。该回调函数接受一个response作为其第一个参数, 并返回一个包含Item
以及/或Request
对象(或者这两者的子类)的列表。警告
在编写抓取Spider规则时,避免使用
parse
作为回调,因为CrawlSpider
使用parse
方法自己实现其逻辑。因此,如果你覆盖parse
方法,爬行Spider将不再工作。cb_kwargs
是一个包含要传递到回调函数的关键字参数的字典。follow
是一个布尔值,指定是否应该跟踪使用此规则从每个响应中提取的链接。如果callback
为None,则follow
默认设置为True
,否则默认为False
。process_links
是一个可调用的或字符串(在这种情况下,将使用Spider对象中具有该名称的方法),对于从每个响应中使用指定link_extractor
提取的每个链接列表都将调用它。This is mainly used for filtering purposes.process_request
是一个可调用的或字符串(在这种情况下,将使用Spider对象中具有该名称的方法),该方法将被此规则提取的每个请求调用,并且必须返回请求或None(来过滤请求)。
CrawlSpider示例¶
Let’s now take a look at an example CrawlSpider with rules:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item
method. 对于每个item response,将使用XPath从HTML中提取一些数据,并使用它填充Item
。
XMLFeedSpider¶
- class
scrapy.spiders.
XMLFeedSpider
¶ XMLFeedSpider设计用于通过迭代某个节点名称来解析XML Feed。迭代器可以选自:
iternode
、xml
和html
。It’s recommended to use theiternodes
iterator for performance reasons, since thexml
andhtml
iterators generate the whole DOM at once in order to parse it. However, usinghtml
as the iterator may be useful when parsing XML with bad markup.要设置迭代器和标签名称,必须定义以下类属性:
-
iterator
¶ 一个字符串,定义要使用的迭代器。它可以是:
它默认为:
'iternodes'
。
-
itertag
¶ 一个字符串,带有有要迭代的节点(或元素)的名称。Example:
itertag = 'product'
-
namespaces
¶ 一个
(prefix, uri)
元组列表,定义使用此Spider处理的文档中可用的命名空间。prefix
和uri
将使用register_namespace()
方法用来自动注册命名空间。然后,你可以在
itertag
属性中指定具有命名空间的节点。Example:
class YourSpider(XMLFeedSpider): namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')] itertag = 'n:url' # ...
除了这些新的属性,这个spider有以下可重写的方法:
-
adapt_response
(response)¶ 一个方法,在Spider开始解析响应之前,一旦到达Spider中间件,就接收此响应。它可以用来在解析响应之前修改它的主体。该方法接受一个response并返回一个response(可以相同也可以不同)。
-
parse_node
(response, selector)¶ 对于与提供的标签名称(
itertag
)匹配的节点,将调用此方法。接收每个节点的响应和Selector
。覆盖此方法是强制的。Otherwise, you spider won’t work. 此方法必须返回一个Item
对象、一个Request
对象或包含它们任何一个的可迭代对象。
-
process_results
(response, results)¶ 对于由Spider返回的每个结果(item或request),将调用此方法,并且它将在将结果返回到框架核心之前执行所需的任何最后一次处理,例如设置item的ID。它接收一个结果列表和产生那些结果的响应。它必须返回一个结果列表(Items或Requests)。
-
XMLFeedSpider example¶
这些spiders很容易使用,让我们看一个例子:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('@id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
基本上我们做的是创建一个Spider,从给定的start_urls
下载一个feed,然后遍历每个item
标签,打印出来并存储一些随机数据到Item
中。
CSVFeedSpider¶
- class
scrapy.spiders.
CSVFeedSpider
¶ 这个Spider非常类似于XMLFeedSpider,除了它迭代行,而不是节点。在每次迭代中调用的方法是
parse_row()
。-
delimiter
¶ 一个字符串,带有CSV文件中每个字段的分隔符。默认为
','
(逗号)。
-
quotechar
¶ 一个字符串,带有CSV文件中每个字段的封装字符。默认为
'"'
(引号)。
-
headers
¶ 文件CSV Feed中包含的行的列表,用于从中提取字段。
-
parse_row
(response, row)¶ 该方法接收一个response对象及一个以提供或检测出来的header为键的字(代表每行)。此spider还提供了覆盖
adapt_response
和process_results
方法的机会,用于预处理和后处理。
-
CSVFeedSpider example¶
让我们看一个类似于前一个的示例,但使用CSVFeedSpider
:
from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
delimiter = ';'
quotechar = "'"
headers = ['id', 'name', 'description']
def parse_row(self, response, row):
self.logger.info('Hi, this is a row!: %r', row)
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
SitemapSpider¶
- class
scrapy.spiders.
SitemapSpider
¶ SitemapSpider允许你通过使用Sitemaps发现网址来抓取网站。
它支持嵌套的Sitemap并从robots.txt发现Sitemap网址。
-
sitemap_urls
¶ A list of urls pointing to the sitemaps whose urls you want to crawl.
You can also point to a robots.txt and it will be parsed to extract sitemap urls from it.
-
sitemap_rules
¶ 一个
(regex, callback)
元组列表,其中:regex
is a regular expression to match urls extracted from sitemaps.regex
can be either a str or a compiled regex object.- callback is the callback to use for processing the urls that match the regular expression.
callback
can be a string (indicating the name of a spider method) or a callable.
For example:
sitemap_rules = [('/product/', 'parse_product')]
规则按顺序应用,并且将仅使用匹配的第一个。
If you omit this attribute, all urls found in sitemaps will be processed with the
parse
callback.
-
sitemap_follow
¶ A list of regexes of sitemap that should be followed. 这只适用于使用指向其他Sitemap文件的Sitemap索引文件的网站。
By default, all sitemaps are followed.
-
sitemap_alternate_links
¶ Specifies if alternate links for one
url
should be followed. These are links for the same website in another language passed within the sameurl
block.For example:
<url> <loc>http://example.com/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/> </url>
With
sitemap_alternate_links
set, this would retrieve both URLs. Withsitemap_alternate_links
disabled, onlyhttp://example.com/
would be retrieved.Default is
sitemap_alternate_links
disabled.
-
SitemapSpider示例¶
Simplest example: process all urls discovered through sitemaps using the parse
callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
def parse(self, response):
pass # ... scrape item here ...
Process some urls with certain callback and other urls with a different callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
def parse_product(self, response):
pass # ... scrape product ...
def parse_category(self, response):
pass # ... scrape category ...
Follow sitemaps defined in the robots.txt file and only follow sitemaps whose url contains /sitemap_shop
:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Combine SitemapSpider with other sources of urls:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
other_urls = ['http://www.example.com/about']
def start_requests(self):
requests = list(super(MySpider, self).start_requests())
requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
return requests
def parse_shop(self, response):
pass # ... scrape shop here ...
def parse_other(self, response):
pass # ... scrape other here ...