xpath

chrome插件Xpath Helper 快捷方式 ctrl + shift + x

chrome查找页面元素ctrl + f

bytes》》解码decode(urf-8)》》str

str》》编码encode(utf-8)》》bytes

拷贝网页部分源代码:elements》》标签右键copy》》copy outerHTML

 

 

from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
</ul>
</div>
'''
html = etree.HTML(text)#<Element html at 0x1d515915c08> etree.HTML()模块能够自动修正HTML文本,默认是HTML解析器
html = etree.tostring(html,encoding='utf-8')#输出修正后的HTML代码(bytes类型),按照utf-8模式进行编码,不然会出现中文乱码
print(html.decode('utf-8'))#str类型

<html><body><div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
</ul>
</div>
</body></html>

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')#创建一个HTML解析器,修正不规范代码
html = etree.parse('./test.html',parser=parser)#传入HTML解析器,不传入默认是xml解析器
html = etree.tostring(html)
print(html.decode('utf-8'))

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>&#13;
<ul>&#13;
<li class="item-0"><a href="link1.html">first item</a></li>&#13;
<li class="item-1"><a href="link2.html">second item</a></li>&#13;
<li class="item-inactive"><a href="link3.html">third item</a></li>&#13;
<li class="item-1"><a href="link4.html">fourth item</a></li>&#13;
<li class="item-0"><a href="link5.html">fifth item</a> # &#230;&#179;&#168;&#230;&#132;&#143;&#239;&#188;&#140;&#230;&#173;&#164;&#229;&#164;&#132;&#231;&#188;&#186;&#229;&#176;&#145;&#228;&#184;&#128;&#228;&#184;&#170; </li> &#233;&#151;&#173;&#229;&#144;&#136;&#230;&#160;&#135;&#231;&#173;&#190;&#13;
</ul>&#13;
</div></body></html>

查看本机IP     cmd    ipconfig

 

posted on 2018-06-14 19:22  shiqinying  阅读(277)  评论(0)    收藏  举报

导航