xpath
chrome插件Xpath Helper 快捷方式 ctrl + shift + x
chrome查找页面元素ctrl + f
bytes》》解码decode(urf-8)》》str
str》》编码encode(utf-8)》》bytes
拷贝网页部分源代码:elements》》标签右键copy》》copy outerHTML
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
</ul>
</div>
'''
html = etree.HTML(text)#<Element html at 0x1d515915c08> etree.HTML()模块能够自动修正HTML文本,默认是HTML解析器
html = etree.tostring(html,encoding='utf-8')#输出修正后的HTML代码(bytes类型),按照utf-8模式进行编码,不然会出现中文乱码
print(html.decode('utf-8'))#str类型
<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
</ul>
</div>
</body></html>
from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')#创建一个HTML解析器,修正不规范代码
html = etree.parse('./test.html',parser=parser)#传入HTML解析器,不传入默认是xml解析器
html = etree.tostring(html)
print(html.decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
</ul>
</div></body></html>
查看本机IP cmd ipconfig
posted on 2018-06-14 19:22 shiqinying 阅读(277) 评论(0) 收藏 举报
浙公网安备 33010602011771号