Python爬虫 #004 xpath

点击查看代码
# 一段str类型的数据,赋值为books
books = '''
<?xml version="1.0" encoding="utf-8"?>
<bookstore>
    <book category="cooking">
        <title lang ="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="children">
        <title lang ="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="web">
        <title lang ="en">XQuery Kick Start</title>
        <author>James</author>
        <year>2001</year>
        <price>40</price>
    </book>
    <book category="web1" cover="paperback">
        <title lang ="en">Liu</title>
        <author>Feng Feng</author>
        <year>1998</year>
        <price>99.99</price>
    </book>
    <book category="web2" cover="paperback">
        <title lang ="en">wen</title>
        <author>Feng</author>
        <year>0820</year>
        <price>99</price>
    </book>
</bookstore>

4.1-转化为HTML

from lxml import etree

#用etree方法把books转化为一个HTML对象,赋值给html变量(变量名任意取)
html = etree.HTML(books)
#tostring可把数据转化为标准的HTML格式,tip:下面打印的内容和books的内容有差异
print(etree.tostring(html,encoding='utf-8').decode('utf-8'))
<html><body><bookstore>
    <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="web">
        <title lang="en">XQuery Kick Start</title>
        <author>James</author>
        <year>2001</year>
        <price>40</price>
    </book>
    <book category="web1" cover="paperback">
        <title lang="en">Liu</title>
        <author>Feng Feng</author>
        <year>1998</year>
        <price>99.99</price>
    </book>
    <book category="web2" cover="paperback">
        <title lang="en">wen</title>
        <author>Feng</author>
        <year>0820</year>
        <price>99</price>
    </book>
</bookstore>
</body></html>

4.2-根据节点找标签内容

方法一:严格根据父子级关系

# 根据父子级关系查title, text()为文本信息
#得到的为列表
result = html.xpath('/html/body/bookstore/book/title/text()')
print(result)
>>>:['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Liu', 'wen']

方法二:不严格根据父子级关系

# 不根据父子级关系
# //代表当前节点下的所有, 即当前节点(<html>总节点)下所有的book下的title
result = html.xpath('//book/title/text()')
print(result)
>>>:['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Liu', 'wen']

4.3-根据条件找标签内容

方法一:列表索引

#xpath索引从1开始
result = html.xpath('//book[1]/title/text()')
print(result)
>>>:['Everyday Italian']

#索引为[1,3)
result = html.xpath('//book[position() < 3]/title/text()')
print(result)

>>>: ['Everyday Italian', 'Harry Potter']

方法二:根据标签属性

#多条件准确查找,注意第3,4,5,本书属性的区别
result = html.xpath('//book[@category="web1"][@cover="paperback"]/title/text()')
print(result)

>>> :['Liu']

# contains为包含,即模糊查找。 属性category中带有web的,打印其属性
result = html.xpath('//book[contains(@category,"web")]/@category')
print(result)

>>>: ['web', 'web1', 'web2']

#根据price标签查找符合内容的title
result = html.xpath('//book[price < 50]/title/text()')
print(result)

>>>: ['Everyday Italian', 'Harry Potter', 'XQuery Kick Start']

posted @ 2023-06-28 22:54  枫_Null  阅读(13)  评论(0)    收藏  举报