常见问题

Scrapy相BeautifulSoup或lxml比较,如何呢?

BeautifulSouplxml是解析HTML和XML这两种格式库模块。而Scrapy是一个用于编写抓取网站并从中提取数据的Web爬虫的应用框架。

Scrapy提供了内置的机制来提取数据(叫做selectors), 但如果您觉得使用更为方便,也可以使用 BeautifulSoup(或 lxml)。After all, they’re just parsing libraries which can be imported and used from any Python code.

换句话说,拿Scrapy与 BeautifulSoup(或 lxml) 比较就好像是拿 jinja2Django相比。

可以用BeautifulSoup使用Scrapy吗?

Yes, you can. 所述,BeautifulSoup可用于在Scrapy回调中解析HTML响应。您只需将响应的正文填充到BeautifulSoup对象中,并从中提取所需的任何数据。

下面是一个使用BeautifulSoup API的示例Spider,用lxml作为HTML解析器:

from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }

注意

BeautifulSoup支持多种HTML / XML解析器。参见BeautifulSoup的官方文档,哪些可用。

Scrapy支持那些Python版本?

Scrapy仅仅支持Python 2.7和Python 3.3+。从Scrapy 0.20开始取消对Python 2.6的支持。Python 3的支持在Scrapy 1.1中添加。

Python 3在Windows上尚不支持。

Scrapy是否从Django中”剽窃”了X呢

Probably, but we don’t like that word. 我们认为 Django是一个很好的开源项目,同时也是 一个很好的参考对象,所以我们把其作为Scrapy的启发对象。

We believe that, if something is already done well, there’s no need to reinvent it. This concept, besides being one of the foundations for open source and free software, not only applies to software but also to documentation, procedures, policies, etc. So, instead of going through each problem ourselves, we choose to copy ideas from those projects that have already solved them properly, and focus on the real problems we need to solve.

We’d be proud if Scrapy serves as an inspiration for other projects. Feel free to steal from us!

Scrapy支持HTTP代理么?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

如何爬取属性在不同页面的item呢?

See Passing additional data to callback functions.

Scrapy退出,ImportError: Nomodule named win32api?

由于Twisted的这个bug,你需要安装pywin32

我要如何在spider里模拟用户登录呢?

See Using FormRequest.from_response() to simulate a user login.

Scrapy是以广度优先还是深度优先进行爬取的呢?

默认情况下,Scrapy使用 LIFO队列来存储等待的请求,简单的说,就是DFO顺序This order is more convenient in most cases. 如果您想以BFO 顺序进行爬取,你可以设置以下的设定:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

我的Scrapy爬虫有内存泄露。我该做什么?

See Debugging memory leaks.

Also, Python has a builtin memory leak issue which is described in Leaks without leaks.

如何让Scrapy减少内存消耗?

See previous question.

我能在spider中使用基本HTTP认证么?

Yes, see HttpAuthMiddleware.

为什么Scrapy下载了英文的页面,而不是我的本国语言?

尝试通过覆盖 DEFAULT_REQUEST_HEADERS 设置来修改默认的Accept-Language请求头。

我能在哪里找到Scrapy项目的例子?

See Examples.

我能在不创建Scrapy项目的情况下运行一个爬虫(spider)么?

Yes. You can use the runspider command. For example, if you have a spider written in a my_spider.py file you can run it with:

scrapy runspider my_spider.py

See runspider command for more info.

I get “Filtered offsite request” messages. 如何修复?

Those messages (logged with DEBUG level) don’t necessarily mean there is a problem, so you may not need to fix them.

Those messages are thrown by the Offsite Spider Middleware, which is a spider middleware (enabled by default) whose purpose is to filter out requests to domains outside the ones covered by the spider.

For more info see: OffsiteMiddleware.

我能对大数据(large exports)使用JSON么?

It’ll depend on how large your output is. See this warning in JsonItemExporter documentation.

我能在信号处理器(signal handler)中返回(Twisted)引用么?

Some signals support returning deferreds from their handlers, others don’t. See the Built-in signals reference to know which ones.

reponse返回的状态值999代表了什么?

999 is a custom response status code used by Yahoo sites to throttle requests. Try slowing down the crawling speed by using a download delay of 2 (or higher) in your spider:

class MySpider(CrawlSpider):

    name = 'myspider'

    download_delay = 2

    # [ ... rest of the spider code ... ]

Or by setting a global download delay in your project with the DOWNLOAD_DELAY setting.

我能在spider中调用pdb.set_trace()来调试么?

可以,但你也可以使用Scrapy终端。这能让你快速分析(甚至修改) spider处理返回的返回(response),通常来说,比老旧的pdb.set_trace()有用多了。

For more info see Invoking the shell from spiders to inspect responses.

将所有爬取到的item转存(dump)到JSON/CSV/XML文件的最简单的方法?

To dump into a JSON file:

scrapy crawl myspider -o items.json

To dump into a CSV file:

scrapy crawl myspider -o items.csv

To dump into a XML file:

scrapy crawl myspider -o items.xml

For more information see Feed exports

在某些表单中巨大神秘的 __VIEWSTATE参数是什么?

The __VIEWSTATE parameter is used in sites built with ASP.NET/VB.NET. For more info on how it works see this page. Also, here’s an example spider which scrapes one of these sites.

分析大XML/CSV数据源的最好方法是?

Parsing big feeds with XPath selectors can be problematic since they need to build the DOM of the entire feed in memory, and this can be quite slow and consume a lot of memory.

In order to avoid parsing all the entire feed at once in memory, you can use the functions xmliter and csviter from scrapy.utils.iterators module. In fact, this is what the feed spiders (see Spiders) use under the cover.

Scrapy自动管理cookies么?

Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.

For more info see Requests and Responses and CookiesMiddleware.

如何才能看到Scrapy发出及接收到的Cookies呢?

Enable the COOKIES_DEBUG setting.

要怎么停止爬虫呢?

Raise the CloseSpider exception from a callback. For more info see: CloseSpider.

如何避免我的Scrapy机器人(bot)被禁止(ban)呢?

See Avoiding getting banned.

我应该使用spider参数还是设置来配置spider呢?

Both spider arguments and settings can be used to configure your spider. There is no strict rule that mandates to use one or the other, but settings are more suited for parameters that, once set, don’t change much, while spider arguments are meant to change more often, even on each spider run and sometimes are required for the spider to run at all (for example, to set the start url of a spider).

To illustrate with an example, assuming you have a spider that needs to log into a site to scrape data, and you only want to scrape data from a certain section of the site (which varies each time). In that case, the credentials to log in would be settings, while the url of the section to scrape would be a spider argument.

我爬取了一个XML文档但是XPath选择器不返回任何的item

You may need to remove namespaces. See Removing namespaces.