Common Practices

This section documents common practices when using Scrapy. These are things that cover many topics and don’t often fall into any other specific section.

从脚本运行Scrapy

You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl.

Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor.

你可以使用来运行你的Spider的第一个实用程序是scrapy.crawler.CrawlerProcess这个类将为你启动一个Twisted reactor,配置logging 设置关机处理程序。该类是所有Scrapy命令都用到的类。

下面是一个示例,演示如何使用它运行单个Spider。

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

请务必查看CrawlerProcess文档,以了解其用法的详细信息。

如果位于一个Scrapy项目中,有一些额外的辅助函数你可以使用来导入这些组件到项目中。你可以将名称传递给CrawlerProcess自动导入Spider,并使用get_project_settings来获取你的项目设置的Settings实例。

What follows is a working example of how to do that, using the testspiders project as example.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

还有另一个Scrapy实用程序,提供爬取过程更多的控制权︰scrapy.crawler.CrawlerRunner这个类是一个封装类,它封装几个简单的辅助函数来运行多个爬虫,但它不会以任何方式启动或与已经存在的reactor交互。

使用这个类,reactor应在调度你的Sipder后显式运行。建议您使用CrawlerRunner而不是CrawlerProcess,如果您的应用程序已经使用Twisted,并且您想在同一个反应器中运行Scrapy。

Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the CrawlerRunner.crawl method.

以下是其使用示例,以及在MySpider完成运行后手动停止反应器的回调。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

同一进程运行多个spider

By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.

这里是同时运行多个Spider的示例︰

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner的同一示例︰

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

Same example but running the spiders sequentially by chaining the deferreds:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Distributed crawls

Scrapy doesn’t provide any built-in facility for running crawls in a distribute (multi-server) manner. 不过还是有办法进行分布式爬取, 取决于您要怎么分布了。

If you have many spiders, the obvious way to distribute the load is to setup many Scrapyd instances and distribute spider runs among those.

If you instead want to run a single (big) spider through many machines, what you usually do is partition the urls to crawl and send them to each separate spider. 这里是一个具体的例子︰

First, you prepare the list of urls to crawl and put them into separate files/urls:

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

Then you fire a spider run on 3 different Scrapyd servers. The spider would receive a (spider) argument part with the number of the partition to crawl:

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被禁止

有些网站实现了特定的机制,以一定规则来避免被爬虫爬取。Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.

Here are some tips to keep in mind when dealing with these kinds of sites:

  • rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
  • disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
  • use download delays (2 or higher). See DOWNLOAD_DELAY setting.
  • if possible, use Google cache to fetch pages, instead of hitting the sites directly
  • use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
  • use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

If you are still unable to prevent your bot getting banned, consider contacting commercial support.