1 Html / XHtml 解析 - Parsing Html and XHtml
2
3 HTMLParser 模块
4 通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类,
5 然后子类中实现处理的标签(<.>)的方法, 其实现是通过 '重写' 父类(HTMLParser)的
6 handle_starttag(), handle_data(), handle_endtag() 等方法.
7
8 例子,
9 解析 htmlsample.html 中 <head> 标签,
10 <-- htmlsample.html --> -> 文件内容,
11 '
12 <html>
13 <head><title>404 Not Found</title></head>
14 <body bgcolor="white">
15 <center><h1>404 Not Found</h1></center>
16 <hr><center>nginx/1.12.2</center>
17 </body>
18 </html>
19 '
20 from html.parser import HTMLParser
21 class ParsingHeadT(HTMLParser):
22 def __init__(self):
23 self.headtag =''
24 self.parsesemaphore = False
25 HTMLParser.__init__(self)
26
27 def handle_starttag(self, tag, attrs): # enable semaphore
28 if tag == 'head':
29 self.parsesemaphore = True
30
31 def handle_data(self, data): # tag process as requirement
32 if self.parsesemaphore:
33 self.headtag = data
34
35 def handle_endtag(self, tag):
36 if tag == 'head':
37 self.parsesemaphore = False
38
39 def getheadtag(self):
40 return self.headtag
41
42 if __name__ == "__main__":
43 with open('htmlsample.html') as FH:
44 pht = ParsingHeadT()
45 pht.feed(FH.read()) # HTMLParser will invoke the replaced methods
46 # handle_starttag, handle_data and handle_endtag
47 print("Head Tag : %s" % pht.getheadtag())
48
49 output,
50 Head Tag : 404 Not Found
51
52 上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的,
53 比如 html 中的特殊字符 © (copyright 符号), &(& 逻辑与符号) 等,
54 对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理,
55 HTMLParser.handle_entityref(name)¶
56 This method is called to process a named character reference of the form
57 &name; (e.g. >), where name is a general entity reference (e.g. 'gt').
58 This method is never called if convert_charrefs is True.
59
60 字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换.
61 HTMLParser.handle_charref(name)
62 This method is called to process decimal and hexadecimal numeric character
63 references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent
64 for > is >, whereas the hexadecimal is > in this case the method
65 will receive '62' or 'x3E'. This method is never called if convert_charrefs is True.
66
67 Note,
68 幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html
69 <head> tag 中加入一些特殊字符来看看.
70 <-- htmlsample.html -->
71 <html>
72 <head><title>> > 404 © Not > Found & </title></head>
73 <body bgcolor="white">
74 <center><h1>404 Not Found</h1></center>
75 <hr><center>nginx/1.12.2</center>
76 </body>
77 </html>
78
79 上例 Output,
80 Head Tag : > > 404 © Not > Found &
81 从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况.
82
83 然而, 在 html 的代码中存在一类 '非对称'的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子
84 去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使
85 其能够正确解析这些'非对称'标签.
86 先扩展一下儿 htmlsample.html, 以 <li> 标签为例,
87 <-- htmlsample.html -->
88 <html>
89 <head><title>> > 404 © Not > Found &</title>
90 <body bgcolor="white">
91 <center><h1>404 Not Found</h1></center>
92 <hr><center>nginx/1.12.2</center>
93 <ul>
94 <li> First Reason
95 <li> Second Reason
96 </body>
97 </html>
98
99 htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签
100 没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题.
101
102 例,
103 from html.parser import HTMLParser
104 class Parser(HTMLParser):
105 def __init__(self):
106 self.taglevels = [] # track anchor
107 self.tags =['head','ul','li']
108 self.parsesemaphore = False
109 self.data = ''
110 HTMLParser.__init__(self)
111
112 def handle_starttag(self, tag, attrs): # enable semaphore
113 if len(self.taglevels) and self.taglevels[-1] == tag:
114 self.handle_endtag(tag)
115 self.taglevels.append(tag)
116
117 if tag in self.tags:
118 self.parsesemaphore = True
119
120 def handle_data(self, data): # tag process as requirement
121 if self.parsesemaphore:
122 self.data += data
123
124 def handle_endtag(self, tag):
125 self.parsesemaphore = False
126
127 def gettag(self):
128 return self.data
129
130 if __name__ == "__main__":
131 with open('htmlsample.html') as FH:
132 pht = Parser()
133 pht.feed(FH.read()) # HTMLParser will invoke the replaced methods
134 # handle_starttag, handle_data and handle_endtag
135 print("Head Tag : %s" % pht.gettag())
136
137 Output,
138 Head Tag : > > 404 © Not > Found &
139 First Reason
140 Second Reason
141
142 Reference,
143 https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref
144
145 Appendix,
146 The example given by python Doc,
147 from html.parser import HTMLParser
148 from html.entities import name2codepoint
149
150 class MyHTMLParser(HTMLParser):
151 def handle_starttag(self, tag, attrs):
152 print("Start tag:", tag)
153 for attr in attrs:
154 print(" attr:", attr)
155
156 def handle_endtag(self, tag):
157 print("End tag :", tag)
158
159 def handle_data(self, data):
160 print("Data :", data)
161
162 def handle_comment(self, data):
163 print("Comment :", data)
164
165 def handle_entityref(self, name):
166 c = chr(name2codepoint[name])
167 print("Named ent:", c)
168
169 def handle_charref(self, name):
170 if name.startswith('x'):
171 c = chr(int(name[1:], 16))
172 else:
173 c = chr(int(name))
174 print("Num ent :", c)
175
176 def handle_decl(self, data):
177 print("Decl :", data)
178
179 parser = MyHTMLParser()
180
181 Output,
182 Parsing a doctype:
183
184 # >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
185 ... '"http://www.w3.org/TR/html4/strict.dtd">')
186 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
187 Parsing an element with a few attributes and a title:
188
189
190 # >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
191 Start tag: img
192 attr: ('src', 'python-logo.png')
193 attr: ('alt', 'The Python logo')
194
195 # >>> parser.feed('<h1>Python</h1>')
196 Start tag: h1
197 Data : Python
198 End tag : h1
199 The content of script and style elements is returned as is, without further parsing:
200
201
202 # >>> parser.feed('<style type="text/css">#python { color: green }</style>')
203 Start tag: style
204 attr: ('type', 'text/css')
205 Data : #python { color: green }
206 End tag : style
207
208 # >>> parser.feed('<script type="text/javascript">'
209 ... 'alert("<strong>hello!</strong>");</script>')
210 Start tag: script
211 attr: ('type', 'text/javascript')
212 Data : alert("<strong>hello!</strong>");
213 End tag : script
214 Parsing comments:
215
216 # >>> parser.feed('<!-- a comment -->'
217 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
218 Comment : a comment
219 Comment : [if IE 9]>IE-specific content<![endif]
220 Parsing named and numeric character references and converting them to the correct
221 char (note: these 3 references are all equivalent to '>'):
222
223 # >>> parser.feed('>>>')
224 Named ent: >
225 Num ent : >
226 Num ent : >
227 Feeding incomplete chunks to feed() works, but handle_data() might be called more
228 than once (unless convert_charrefs is set to True):
229
230 # >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
231 ... parser.feed(chunk)
232 Start tag: span
233 Data : buff
234 Data : ered
235 Data : text
236 End tag : span
237 Parsing invalid HTML (e.g. unquoted attributes) also works:
238
239 # >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
240 Start tag: p
241 Start tag: a
242 attr: ('class', 'link')
243 attr: ('href', '#main')
244 Data : tag soup
245 End tag : p
246 End tag : a