19.1. HTMLParser - 简单的HTML和XHTML解析器

Note

在Python 3中,HTMLParser模块已重命名为html.parser当您将源转换为Python 3时,2to3工具将自动调整导入。

New in version 2.2.

Source code: Lib/HTMLParser.py

该模块定义了一个类HTMLParser,它用作解析以HTML(HyperText Mark-up Language)和XHTML格式化的文本文件的基础。htmllib中的解析器不同,该解析器不是基于sgmllib中的SGML解析器。

classHTMLParser.HTMLParser

An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered. 用户应该将HTMLParser进行子类化,并覆盖其方法来实现所需的行为。

HTMLParser类没有参数被实例化。

htmllib中的解析器不同,此解析器不会检查结束标记是否与起始标记相匹配,或者通过关闭外部元素隐式关闭元素,调用结束标记处理程序。

还定义了一个例外:

exception HTMLParser.HTMLParseError

HTMLParser能够处理断开的标记,但在某些情况下,可能会在解析时遇到错误时引发此异常。This exception provides three attributes: msg is a brief message explaining the error, lineno is the number of the line on which the broken construct was detected, and offset is the number of characters into the line at which the construct starts.

19.1.1. HTML解析器应用程序示例

作为一个基本示例,下面是一个简单的HTML解析器,它使用HTMLParser类来打印出遇到的开始标签,结束标签和数据:

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

输出将是:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

19.1.2. HTMLParser方法

HTMLParser实例具有以下方法:

HTMLParser.feed(data)

将一些文本提供给解析器。它由完整元素组成;不完整的数据被缓冲,直到更多的数据被馈送或者调用close()数据可以是unicodestr,但建议通过unicode

HTMLParser.close()

对所有缓冲数据进行强制处理,就像后面跟着一个文件结束标记一样。This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call the HTMLParser base class method close().

HTMLParser.reset()

重置实例。丢失所有未处理的数据。这在实例化时被隐含地称为。

HTMLParser.getpos()

返回当前行号和偏移量。

HTMLParser.get_starttag_text()

返回最近打开的开始标签的文本。对于结构化处理通常不需要这种方法,但是可能有助于处理HTML“部署”或以最小的更改重新生成输入(可以保留属性之间的空格等)。).

当遇到数据或标记元素并且它们意图在子类中被覆盖时,将调用以下方法。基类实现什么都不做(除了handle_startendtag()):

HTMLParser.handle_starttag(tag, attrs)

调用这种方法来处理标签的开始(例如,<div id="main">).

标签参数是转换为小写字母的标签的名称。The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag’s <> brackets. 名称将被翻译为小写,中的引号已被删除,并且字符和实体引用已被替换。

For instance, for the tag <A HREF="http://www.cwi.nl/">, this method would be called as handle_starttag('a', [('href', 'http://www.cwi.nl/')]).

在版本2.6中更改:现在,在htmlentitydefs中的所有实体引用都被替换为属性值。

HTMLParser.handle_endtag(tag)

调用这种方法来处理元素的结束标签(例如,</div>).

标签参数是转换为小写字母的标签的名称。

HTMLParser.handle_startendtag(tag, attrs)

Similar to handle_starttag(), but called when the parser encounters an XHTML-style empty tag (<img ... />). 该方法可能会被需要此特定词汇信息的子类覆盖;默认实现只需调用handle_starttag()handle_endtag()

HTMLParser.handle_data(data)

这种方法被称为处理任意数据(例如,text nodes and the content of <script>...</script> and <style>...</style>).

HTMLParser.handle_entityref(name)

This method is called to process a named character reference of the form &name; (e.g. &gt;), where name is a general entity reference (e.g. 'gt').

HTMLParser.handle_charref(name)

此方法被调用来处理形式为&#NNN;&#xNNN;的十进制和十六进制数字字符引用For example, the decimal equivalent for &gt; is &#62;, whereas the hexadecimal is &#x3E;; in this case the method will receive '62' or 'x3E'.

HTMLParser.handle_comment(data)

当遇到注释时调用此方法(例如,<!--comment-->).

For example, the comment <!-- comment --> will cause this method to be called with the argument ' comment '.

The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, for <!--[if IE 9]>IE9-specific content<![endif]-->, this method will receive '[if IE 9]>IE-specific content<![endif]'.

HTMLParser.handle_decl(decl)

This method is called to handle an HTML doctype declaration (e.g. <!DOCTYPE html>).

The decl parameter will be the entire contents of the declaration inside the <!...> markup (e.g. 'DOCTYPE html').

HTMLParser.handle_pi(data)

This method is called when a processing instruction is encountered. The data parameter will contain the entire processing instruction. For example, for the processing instruction <?proc color='red'>, this method would be called as handle_pi("proc color='red'").

Note

The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.

HTMLParser.unknown_decl(data)

This method is called when an unrecognized declaration is read by the parser.

The data parameter will be the entire contents of the declaration inside the <![...]> markup. It is sometimes useful to be overridden by a derived class.

19.1.3. Examples

The following class implements a parser that will be used to illustrate more examples:

from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Start tag:", tag
        for attr in attrs:
            print "     attr:", attr
    def handle_endtag(self, tag):
        print "End tag  :", tag
    def handle_data(self, data):
        print "Data     :", data
    def handle_comment(self, data):
        print "Comment  :", data
    def handle_entityref(self, name):
        c = unichr(name2codepoint[name])
        print "Named ent:", c
    def handle_charref(self, name):
        if name.startswith('x'):
            c = unichr(int(name[1:], 16))
        else:
            c = unichr(int(name))
        print "Num ent  :", c
    def handle_decl(self, data):
        print "Decl     :", data

parser = MyHTMLParser()

Parsing a doctype:

>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
...             '"http://www.w3.org/TR/html4/strict.dtd">')
Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

Parsing an element with a few attributes and a title:

>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
     attr: ('src', 'python-logo.png')
     attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data     : Python
End tag  : h1

The content of script and style elements is returned as is, without further parsing:

>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
     attr: ('type', 'text/css')
Data     : #python { color: green }
End tag  : style
>>>
>>> parser.feed('<script type="text/javascript">'
...             'alert("<strong>hello!</strong>");</script>')
Start tag: script
     attr: ('type', 'text/javascript')
Data     : alert("<strong>hello!</strong>");
End tag  : script

Parsing comments:

>>> parser.feed('<!-- a comment -->'
...             '<!--[if IE 9]>IE-specific content<![endif]-->')
Comment  :  a comment
Comment  : [if IE 9]>IE-specific content<![endif]

Parsing named and numeric character references and converting them to the correct char (note: these 3 references are all equivalent to '>'):

>>> parser.feed('&gt;&#62;&#x3E;')
Named ent: >
Num ent  : >
Num ent  : >

Feeding incomplete chunks to feed() works, but handle_data() might be called more than once:

>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
...     parser.feed(chunk)
...
Start tag: span
Data     : buff
Data     : ered
Data     : text
End tag  : span

Parsing invalid HTML (e.g. unquoted attributes) also works:

>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
Start tag: p
Start tag: a
     attr: ('class', 'link')
     attr: ('href', '#main')
Data     : tag soup
End tag  : p
End tag  : a