Broad Crawls

Scrapy defaults are optimized for crawling specific sites. These sites are often handled by a single Scrapy spider, although this is not necessary or required (for example, there are generic spiders that handle any given site thrown at them).

In addition to this “focused crawl”, there is another common type of crawling which covers a large (potentially unlimited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the domain was crawled to completion or when there are no more requests to perform. These are called “broad crawls” and is the typical crawlers employed by search engines.

通用爬虫一般有以下通用特性:

  • they crawl many domains (often, unbounded) instead of a specific set of sites
  • they don’t necessarily crawl domains to completion, because it would impractical (or impossible) to do so, and instead limit the crawl by time or number of pages crawled
  • they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often post-processed in a separate stage
  • they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in parallel)

As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.

增加并发

Concurrency is the number of requests that are processed in parallel. There is a global limit and a per-domain limit.

The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you will want to increase it. How much to increase it will depend on how much CPU you crawler will have available. 一般开始可以设置为100,不过最好的方式是做一些测试,获得Scrapy进程占取CPU与并发数的关系。为了优化性能,您应该选择一个能使CPU占用率在80%-90%的并发数。

To increase the global concurrency use:

CONCURRENT_REQUESTS = 100

增加Twisted IO线程池的最大大小

目前,Scrapy利用线程池以阻塞的方式做DNS解析。当并发级别更高时,爬虫可能变得缓慢,甚至查询DNS解析因超时失败。可能的解决方案是提高处理DNS查询的线程数目。DNS队列处理更快将整体上加速连接的建立和爬取。

若要增加最大线程池大小,请使用︰

REACTOR_THREADPOOL_MAXSIZE = 20

建立你自己的DNS

如果你有多个爬虫进程而只有一个中心DNS,它的行为就像对DNS服务器进行DoS攻击,可能造成整个网络慢下来或甚至阻塞你的机器。为了避免这种情况,请建立你自己的DNS服务器,设置本地缓存并设置上游为一些大的DNS,如OpenDNS或Verizon。

降低log级别

当进行通用爬取时,一般您所注意的仅仅是爬取的速率以及遇到的错误。Scrapy使用 INFOlog级别来报告这些信息。In order to save CPU (and log storage requirements) you should not use DEBUG log level when preforming large broad crawls in production. 过在开发的时候使用 DEBUG应该还能接受。

To set the log level use:

LOG_LEVEL = 'INFO'

禁止cookies

Disable cookies unless you really need. 在进行通用爬取时cookies并不需要, (搜索引擎则忽略cookies),禁止cookies能减少CPU使用率及Scrapy爬虫在内存中记录的踪迹,提高性能。

To disable cookies use:

COOKIES_ENABLED = False

禁止重试

对失败的HTTP请求进行重试会减慢爬取的效率,尤其是当站点响应很慢(甚至失败)时, 访问这样的站点会造成超时并重试多次,这是不必要的,同时也占用了爬虫爬取其他站点的能力。

To disable retries use:

RETRY_ENABLED = False

减小下载超时

如果您对一个非常慢的连接进行爬取(一般对通用爬虫来说并不重要), 减小下载超时能让卡住的连接能被快速的放弃并解放处理其他站点的能力。

To reduce the download timeout use:

DOWNLOAD_TIMEOUT = 15

禁止重定向

Consider disabling redirects, unless you are interested in following them. 进行通用爬取时,一般的做法是保存重定向的地址,并在之后的爬取进行解析。这保证了每批爬取的request数目在一定的数量, 否则重定向循环可能会导致爬虫在某个站点耗费过多资源。

To disable redirects use:

REDIRECT_ENABLED = False

启用 “Ajax Crawlable Pages” 爬取

Some pages (up to 1%, based on empirical data from year 2013) declare themselves as ajax crawlable. This means they provide plain HTML version of content that is usually available only via AJAX. 网站通过两种方法声明:

  1. 通过使用#!于URL中 - 这是默认的方式;
  2. 使用特殊的meta标签 - 这在”main”, “index” 页面中使用。

Scrapy handles (1) automatically; to handle (2) enable AjaxCrawlMiddleware:

AJAXCRAWL_ENABLED = True

通用爬取经常抓取大量的 “index” 页面; AjaxCrawlMiddleware能帮助您正确地爬取。由于有些性能问题,且对于特定爬虫没有什么意义,该中间件默认关闭。