Scrapy defaults are optimized for crawling specific sites. These sites are often handled by a single Scrapy spider, although this is not necessary or required (for example, there are generic spiders that handle any given site thrown at them).
In addition to this “focused crawl”, there is another common type of crawling which covers a large (potentially unlimited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the domain was crawled to completion or when there are no more requests to perform. These are called “broad crawls” and is the typical crawlers employed by search engines.
- they crawl many domains (often, unbounded) instead of a specific set of sites
- they don’t necessarily crawl domains to completion, because it would impractical (or impossible) to do so, and instead limit the crawl by time or number of pages crawled
- they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often post-processed in a separate stage
- they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in parallel)
As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.
Concurrency is the number of requests that are processed in parallel. There is a global limit and a per-domain limit.
The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you will want to increase it. How much to increase it will depend on how much CPU you crawler will have available. 一般开始可以设置为
To increase the global concurrency use:
CONCURRENT_REQUESTS = 100
REACTOR_THREADPOOL_MAXSIZE = 20
INFOlog级别来报告这些信息。In order to save CPU (and log storage requirements) you should not use
DEBUG log level when preforming large broad crawls in production. 过在开发的时候使用
To set the log level use:
LOG_LEVEL = 'INFO'
To disable retries use:
RETRY_ENABLED = False
To reduce the download timeout use:
DOWNLOAD_TIMEOUT = 15
Consider disabling redirects, unless you are interested in following them. 进行通用爬取时，一般的做法是保存重定向的地址，并在之后的爬取进行解析。这保证了每批爬取的request数目在一定的数量， 否则重定向循环可能会导致爬虫在某个站点耗费过多资源。
To disable redirects use:
REDIRECT_ENABLED = False
启用 “Ajax Crawlable Pages” 爬取¶
Some pages (up to 1%, based on empirical data from year 2013) declare themselves as ajax crawlable. This means they provide plain HTML version of content that is usually available only via AJAX. 网站通过两种方法声明：
#!于URL中 - 这是默认的方式；
- 使用特殊的meta标签 - 这在”main”, “index” 页面中使用。
Scrapy handles (1) automatically; to handle (2) enable AjaxCrawlMiddleware:
AJAXCRAWL_ENABLED = True
通用爬取经常抓取大量的 “index” 页面； AjaxCrawlMiddleware能帮助您正确地爬取。由于有些性能问题，且对于特定爬虫没有什么意义，该中间件默认关闭。