Python爬虫 #004 xpath
点击查看代码
# 一段str类型的数据,赋值为books
books = '''
<?xml version="1.0" encoding="utf-8"?>
<bookstore>
<book category="cooking">
<title lang ="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang ="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang ="en">XQuery Kick Start</title>
<author>James</author>
<year>2001</year>
<price>40</price>
</book>
<book category="web1" cover="paperback">
<title lang ="en">Liu</title>
<author>Feng Feng</author>
<year>1998</year>
<price>99.99</price>
</book>
<book category="web2" cover="paperback">
<title lang ="en">wen</title>
<author>Feng</author>
<year>0820</year>
<price>99</price>
</book>
</bookstore>
4.1-转化为HTML
from lxml import etree
#用etree方法把books转化为一个HTML对象,赋值给html变量(变量名任意取)
html = etree.HTML(books)
#tostring可把数据转化为标准的HTML格式,tip:下面打印的内容和books的内容有差异
print(etree.tostring(html,encoding='utf-8').decode('utf-8'))
<html><body><bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James</author>
<year>2001</year>
<price>40</price>
</book>
<book category="web1" cover="paperback">
<title lang="en">Liu</title>
<author>Feng Feng</author>
<year>1998</year>
<price>99.99</price>
</book>
<book category="web2" cover="paperback">
<title lang="en">wen</title>
<author>Feng</author>
<year>0820</year>
<price>99</price>
</book>
</bookstore>
</body></html>
4.2-根据节点找标签内容
方法一:严格根据父子级关系
# 根据父子级关系查title, text()为文本信息
#得到的为列表
result = html.xpath('/html/body/bookstore/book/title/text()')
print(result)
>>>:['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Liu', 'wen']
方法二:不严格根据父子级关系
# 不根据父子级关系
# //代表当前节点下的所有, 即当前节点(<html>总节点)下所有的book下的title
result = html.xpath('//book/title/text()')
print(result)
>>>:['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Liu', 'wen']
4.3-根据条件找标签内容
方法一:列表索引
#xpath索引从1开始
result = html.xpath('//book[1]/title/text()')
print(result)
>>>:['Everyday Italian']
#索引为[1,3)
result = html.xpath('//book[position() < 3]/title/text()')
print(result)
>>>: ['Everyday Italian', 'Harry Potter']
方法二:根据标签属性
#多条件准确查找,注意第3,4,5,本书属性的区别
result = html.xpath('//book[@category="web1"][@cover="paperback"]/title/text()')
print(result)
>>> :['Liu']
# contains为包含,即模糊查找。 属性category中带有web的,打印其属性
result = html.xpath('//book[contains(@category,"web")]/@category')
print(result)
>>>: ['web', 'web1', 'web2']
#根据price标签查找符合内容的title
result = html.xpath('//book[price < 50]/title/text()')
print(result)
>>>: ['Everyday Italian', 'Harry Potter', 'XQuery Kick Start']
本文来自博客园,作者:{枫_Null},转载请注明原文链接:https://www.cnblogs.com/fengNull/articles/15488755.html

浙公网安备 33010602011771号