Scrapy提供了内置的机制来提取数据(叫做selectors), 但如果您觉得使用更为方便,也可以使用 BeautifulSoup(或 lxml)。After all, they’re just parsing libraries which can be imported and used from any Python code.

换句话说,拿Scrapy与 BeautifulSoup(或 lxml) 比较就好像是拿 jinja2Django相比。


Yes, you can. 所述,BeautifulSoup可用于在Scrapy回调中解析HTML响应。您只需将响应的正文填充到BeautifulSoup对象中,并从中提取所需的任何数据。

下面是一个使用BeautifulSoup API的示例Spider,用lxml作为HTML解析器:

from bs4 import BeautifulSoup
import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string


BeautifulSoup支持多种HTML / XML解析器。参见BeautifulSoup的官方文档,哪些可用。


Scrapy仅仅支持Python 2.7和Python 3.3+。从Scrapy 0.20开始取消对Python 2.6的支持。Python 3的支持在Scrapy 1.1中添加。

Python 3在Windows上尚不支持。


Probably, but we don’t like that word. 我们认为 Django是一个很好的开源项目,同时也是 一个很好的参考对象,所以我们把其作为Scrapy的启发对象。

We believe that, if something is already done well, there’s no need to reinvent it. This concept, besides being one of the foundations for open source and free software, not only applies to software but also to documentation, procedures, policies, etc. So, instead of going through each problem ourselves, we choose to copy ideas from those projects that have already solved them properly, and focus on the real problems we need to solve.

We’d be proud if Scrapy serves as an inspiration for other projects. Feel free to steal from us!


Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.


See Passing additional data to callback functions.

Scrapy退出,ImportError: Nomodule named win32api?



See Using FormRequest.from_response() to simulate a user login.


默认情况下,Scrapy使用 LIFO队列来存储等待的请求,简单的说,就是DFO顺序This order is more convenient in most cases. 如果您想以BFO 顺序进行爬取,你可以设置以下的设定:

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'


See Debugging memory leaks.

Also, Python has a builtin memory leak issue which is described in Leaks without leaks.


See previous question.


Yes, see HttpAuthMiddleware.


尝试通过覆盖 DEFAULT_REQUEST_HEADERS 设置来修改默认的Accept-Language请求头。


See Examples.


Yes. You can use the runspider command. For example, if you have a spider written in a my_spider.py file you can run it with:

scrapy runspider my_spider.py

See runspider command for more info.

I get “Filtered offsite request” messages. 如何修复?

Those messages (logged with DEBUG level) don’t necessarily mean there is a problem, so you may not need to fix them.

Those messages are thrown by the Offsite Spider Middleware, which is a spider middleware (enabled by default) whose purpose is to filter out requests to domains outside the ones covered by the spider.

For more info see: OffsiteMiddleware.

我能对大数据(large exports)使用JSON么?

It’ll depend on how large your output is. See this warning in JsonItemExporter documentation.

我能在信号处理器(signal handler)中返回(Twisted)引用么?

Some signals support returning deferreds from their handlers, others don’t. See the Built-in signals reference to know which ones.


999 is a custom response status code used by Yahoo sites to throttle requests. Try slowing down the crawling speed by using a download delay of 2 (or higher) in your spider:

class MySpider(CrawlSpider):

    name = 'myspider'

    download_delay = 2

    # [ ... rest of the spider code ... ]

Or by setting a global download delay in your project with the DOWNLOAD_DELAY setting.


可以,但你也可以使用Scrapy终端。这能让你快速分析(甚至修改) spider处理返回的返回(response),通常来说,比老旧的pdb.set_trace()有用多了。

For more info see Invoking the shell from spiders to inspect responses.


To dump into a JSON file:

scrapy crawl myspider -o items.json

To dump into a CSV file:

scrapy crawl myspider -o items.csv

To dump into a XML file:

scrapy crawl myspider -o items.xml

For more information see Feed exports

在某些表单中巨大神秘的 __VIEWSTATE参数是什么?

The __VIEWSTATE parameter is used in sites built with ASP.NET/VB.NET. For more info on how it works see this page. Also, here’s an example spider which scrapes one of these sites.


Parsing big feeds with XPath selectors can be problematic since they need to build the DOM of the entire feed in memory, and this can be quite slow and consume a lot of memory.

In order to avoid parsing all the entire feed at once in memory, you can use the functions xmliter and csviter from scrapy.utils.iterators module. In fact, this is what the feed spiders (see Spiders) use under the cover.


Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.

For more info see Requests and Responses and CookiesMiddleware.


Enable the COOKIES_DEBUG setting.


Raise the CloseSpider exception from a callback. For more info see: CloseSpider.


See Avoiding getting banned.


Both spider arguments and settings can be used to configure your spider. There is no strict rule that mandates to use one or the other, but settings are more suited for parameters that, once set, don’t change much, while spider arguments are meant to change more often, even on each spider run and sometimes are required for the spider to run at all (for example, to set the start url of a spider).

To illustrate with an example, assuming you have a spider that needs to log into a site to scrape data, and you only want to scrape data from a certain section of the site (which varies each time). In that case, the credentials to log in would be settings, while the url of the section to scrape would be a spider argument.


You may need to remove namespaces. See Removing namespaces.