使用Firefox来爬取

Here is a list of tips and advice on using Firefox for scraping, along with a list of useful Firefox add-ons to ease the scraping process.

在浏览器中检查DOM的注意事项

Since Firefox add-ons operate on a live browser DOM, what you’ll actually see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing Javascript code. Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody in your XPath expressions.

Therefore, you should keep in mind the following things when working with Firefox and XPath:

  • Disable Firefox Javascript while inspecting the DOM looking for XPaths to be used in Scrapy
  • 永远不要用完整的XPath路径。使用相对及基于属性(例如id, class, width等)的路径 或者具有区别性的特性例如contains(@href, 'image')
  • Never include <tbody> elements in your XPath expressions unless you really know what you’re doing

对爬取有帮助的实用Firefox插件

Firebug

Firebug is a widely known tool among web developers and it’s also very useful for scraping. 尤其是检查元素特性对构建抓取数据的XPath十分方便,当移动鼠标在页面元素时,您能查看相应元素的HTML源码。

See Using Firebug for scraping for a detailed guide on how to use Firebug with Scrapy.

XPather

XPather能让你在页面上直接测试XPath表达式。

XPath Checker

XPath Checker是另一个用于测试XPath表达式的Firefox插件。

Tamper Data

Tamper Data是一个允许您查看及修改Firefox发送的header的插件。Firebug also allows to view HTTP headers, but not to modify them.

Firecookie

Firecookie使得查看及管理cookie变得简单。You can use this extension to create a new cookie, delete existing cookies, see a list of cookies for the current site, manage cookies permissions and a lot more.