调试内存溢出

In Scrapy, objects such as Requests, Responses and Items have a finite lifetime: they are created, used for a while, and finally destroyed.

From all those objects, the Request is probably the one with the longest lifetime, as it stays waiting in the Scheduler queue until it’s time to process it. For more info see Architecture overview.

As these Scrapy objects have a (rather long) lifetime, there is always the risk of accumulating them in memory without releasing them properly and thus causing what is known as a “memory leak”.

To help debugging memory leaks, Scrapy provides a built-in mechanism for tracking objects references called trackref, and you can also use a third-party library called Guppy for more advanced memory debugging (see below for more info). Both mechanisms must be used from the Telnet Console.

内存泄露的常见原因

It happens quite often (sometimes by accident, sometimes on purpose) that the Scrapy developer passes objects referenced in Requests (for example, using the meta attribute or the request callback function) and that effectively bounds the lifetime of those referenced objects to the lifetime of the Request. This is, by far, the most common cause of memory leaks in Scrapy projects, and a quite difficult one to debug for newcomers.

在大项目中,spider是由不同的人所编写的,而这其中有的spider可能是有”泄露的”, 当所有的爬虫同时运行时,这些影响了其他(写好)的爬虫,最终,影响了整个爬取进程。

The leak could also come from a custom middleware, pipeline or extension that you have written, if you are not releasing the (previously allocated) resources properly. 例如,在spider_opened中分配资源但在spider_closed中没有释放它们,如果你正在每个进程中运行多个Spider可能导致问题。

太多Requests?

By default Scrapy keeps the request queue in memory; it includes Request objects and all objects referenced in Request attributes (e.g. in meta). 也不一定是泄漏,它可能只是需要大量的内存。Enabling persistent job queue could help keeping memory usage in control.

使用trackref调试内存泄露

trackref是Scrapy提供用于调试大部分内存泄露情况的模块。It basically tracks the references to all live Requests, Responses, Item and Selector objects.

你可以进入telnet终端并通过 prefs() 功能来检查多少(上面所提到的)活跃(alive)对象,prefs()print_live_refs()功能的引用:

telnet localhost 6023

>>> prefs()
Live References

ExampleSpider                       1   oldest: 15s ago
HtmlResponse                       10   oldest: 1s ago
Selector                            2   oldest: 0s ago
FormRequest                       878   oldest: 7s ago

As you can see, that report also shows the “age” of the oldest object in each class. If you’re running multiple spiders per process chances are you can figure out which spider is leaking by looking at the oldest request or response.您可以使用 get_oldest() 方法来获取每个对象中最老的对象(从telnet控制台)。

哪些对象被追踪了?

The objects tracked by trackrefs are all from these classes (and all its subclasses):

真实例子

让我们来看一个假设的具有内存泄露的准确例子。Suppose we have some spider with a line similar to this one:

return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
    callback=self.parse, meta={referer: response})

That line is passing a response reference inside a request which effectively ties the response lifetime to the requests’ one, and that would definitely cause memory leaks.

Let’s see how we can discover the cause (without knowing it a-priori, of course) by using the trackref tool.

当crawler运行了一小阵子后,我们发现内存占用增长了很多,这时候我们进入telnet终端,查看活跃(live)的引用:

>>> prefs()
Live References

SomenastySpider                     1   oldest: 15s ago
HtmlResponse                     3890   oldest: 265s ago
Selector                            2   oldest: 0s ago
Request                          3878   oldest: 250s ago

The fact that there are so many live responses (and that they’re so old) is definitely suspicious, as responses should have a relatively short lifetime compared to Requests. 响应的数量近似于请求的数量,所以它看起来像以某种方式捆绑在一起。We can now go and check the code of the spider to discover the nasty line that is generating the leaks (passing response references inside requests).

有时,活着对象的额外信息可以有所帮助。因此,查看最老的response:

>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('HtmlResponse')
>>> r.url
'http://www.somenastyspider.com/product.php?pid=123'

如果你想要遍历所有而不是最老的对象,您可以使用scrapy.utils.trackref.iter_all()函数:

>>> from scrapy.utils.trackref import iter_all
>>> [r.url for r in iter_all('HtmlResponse')]
['http://www.somenastyspider.com/product.php?pid=123',
 'http://www.somenastyspider.com/product.php?pid=584',
...

太多Spider?

如果你的项目有很多的spider并行执行,prefs()的输出会变得很难阅读。For this reason, that function has a ignore argument which can be used to ignore a particular class (and all its subclases). 例如,下面将不会显示对spider的任何活跃引用。

>>> from scrapy.spiders import Spider
>>> prefs(ignore=Spider)

scrapy.utils.trackref模块

Here are the functions available in the trackref module.

class scrapy.utils.trackref.object_ref

Inherit from this class (instead of object) if you want to track live instances with the trackref module.

scrapy.utils.trackref.print_live_refs(class_name, ignore=NoneType)

Print a report of live references, grouped by class name.

Parameters:ignore (class or classes tuple) – if given, all objects from the specified class (or tuple of classes) will be ignored.
scrapy.utils.trackref.get_oldest(class_name)

Return the oldest object alive with the given class name, or None if none is found. Use print_live_refs() first to get a list of all tracked live objects per class name.

scrapy.utils.trackref.iter_all(class_name)

Return an iterator over all objects alive with the given class name, or None if none is found. Use print_live_refs() first to get a list of all tracked live objects per class name.

使用Guppy调试内存泄露

trackref provides a very convenient mechanism for tracking down memory leaks, but it only keeps track of the objects that are more likely to cause memory leaks (Requests, Responses, Items, and Selectors). However, there are other cases where the memory leaks could come from other (more or less obscure) objects. 如果是因为这个原因,通过 trackref 则无法找到泄露点,您仍然有其他工具:Guppy library

If you use pip, you can install Guppy with the following command:

pip install guppy

The telnet console also comes with a built-in shortcut (hpy) for accessing Guppy heap objects. Here’s an example to view all Python objects available in the heap using Guppy:

>>> x = hpy.heap()
>>> x.bytype
Partition of a set of 297033 objects. Total size = 52587824 bytes.
 Index  Count   %     Size   % Cumulative  % Type
     0  22307   8 16423880  31  16423880  31 dict
     1 122285  41 12441544  24  28865424  55 str
     2  68346  23  5966696  11  34832120  66 tuple
     3    227   0  5836528  11  40668648  77 unicode
     4   2461   1  2222272   4  42890920  82 type
     5  16870   6  2024400   4  44915320  85 function
     6  13949   5  1673880   3  46589200  89 types.CodeType
     7  13422   5  1653104   3  48242304  92 list
     8   3735   1  1173680   2  49415984  94 _sre.SRE_Pattern
     9   1209   0   456936   1  49872920  95 scrapy.http.headers.Headers
<1676 more rows. Type e.g. '_.more' to view.>

You can see that most space is used by dicts. Then, if you want to see from which attribute those dicts are referenced, you could do:

>>> x.bytype[0].byvia
Partition of a set of 22307 objects. Total size = 16423880 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0  10982  49  9416336  57   9416336  57 '.__dict__'
     1   1820   8  2681504  16  12097840  74 '.__dict__', '.func_globals'
     2   3097  14  1122904   7  13220744  80
     3    990   4   277200   2  13497944  82 "['cookies']"
     4    987   4   276360   2  13774304  84 "['cache']"
     5    985   4   275800   2  14050104  86 "['meta']"
     6    897   4   251160   2  14301264  87 '[2]'
     7      1   0   196888   1  14498152  88 "['moduleDict']", "['modules']"
     8    672   3   188160   1  14686312  89 "['cb_kwargs']"
     9     27   0   155016   1  14841328  90 '[1]'
<333 more rows. Type e.g. '_.more' to view.>

As you can see, the Guppy module is very powerful but also requires some deep knowledge about Python internals. For more info about Guppy, refer to the Guppy documentation.

Leaks without leaks

Sometimes, you may notice that the memory usage of your Scrapy process will only increase, but never decrease. Unfortunately, this could happen even though neither Scrapy nor your project are leaking memory. 这是由于一个已知(但不有名)的Python问题, Python在某些情况下可能不会返回已经释放的内存到操作系统。For more information on this issue see:

改进方案由Evan Jones提出,在这篇文章中详细介绍,在Python 2.5中合并。 不过这仅仅减小了这个问题,并没有完全修复。To quote the paper:

Unfortunately, this patch can only free an arena if there are no more objects allocated in it anymore. This means that fragmentation is a large issue. An application could have many megabytes of free memory, scattered throughout all the arenas, but it will be unable to free any of it. This is a problem experienced by all memory allocators. The only way to solve it is to move to a compacting garbage collector, which is able to move objects in memory. This would require significant changes to the Python interpreter.

To keep memory consumption reasonable you can split the job into several smaller jobs or enable persistent job queue and stop/start spider from time to time.