python爬虫边看边学(xpath模块解析)
xpath模块解析
Xpath是一门在 XML 文档中查找信息的语言。 Xpath可用来在 XML文档中对元素和属性进行遍历。而我们熟知的HTML恰巧属于XML的一个子集。所以完全可以用xpath去查找html中的内容。
一、安装lxml模块
pip install lxml
用法:1、将要解析的html内容构造出etree对象。
2、使用etree对象的xpath方法配合xpath表达式来完成对数据的提取。
简单案例:
from lxml import etree
xml='''
<book>
<id>1</id>
<name>野花遍地⾹</name>
<price>1.23</price>
<nick>臭⾖腐</nick>
<author>
<nick id="10086">周⼤强</nick>
<nick id="10010">周芷若</nick>
<nick class="joy">周杰伦</nick>
<nick class="jolin">蔡依林</nick>
<div>
<nick>热了</nick>
</div>
<span>
<nick>热了哦</nick>
</span>
</author>
<partner>
<nick id="ppc">胖胖陈</nick>
<nick id="ppbc">胖胖不陈</nick>
</partner>
</book>
'''
tree=etree.XML(xml)
res=tree.xpath('/book/name/text()') #text() 拿文本
print(res)
# ['野花遍地⾹']
res=tree.xpath('/book/author/nick/text()')
print(res)
# ['周⼤强', '周芷若', '周杰伦', '蔡依林']
res=tree.xpath('/book/author//nick/text()') # // 后代
print(res)
# ['周⼤强', '周芷若', '周杰伦', '蔡依林', '热了', '热了哦']
res=tree.xpath('/book/author/*/nick/text()') # * 任意一个节点
print(res)
# ['热了', '热了哦']
案例2:
有一html文件,文件名1.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Title</title>
</head>
<body>
<ul>
<li><a href="http://www.baidu.com">百度</a></li>
<li><a href="http://www.google.com">⾕歌</a></li>
<li><a href="http://www.sogou.com">搜狗</a></li>
</ul>
<ol>
<li><a href="feiji">飞机</a></li>
<li><a href="dapao">⼤炮</a></li>
<li><a href="huoche">⽕车</a></li>
</ol>
<div class="job">李嘉诚</div>
<div class="common">胡辣汤</div>
</body>
</html>
解析如下:
from lxml import etree
tree = etree.parse('1.html')
result = tree.xpath('/html/body/ul/li/a/text()')
print(result)
# ['百度', '谷歌', '搜狗']
result = tree.xpath('/html/body/ul/li[2]/a/text()') # xpath的顺序从1开始
print(result)
# ['谷歌']
result = tree.xpath('/html/body/ol/li/a[@href="dapao"]/text()') # [@xxx=xxx] 属性的筛选
print(result)
# ['大炮']
ol_li_list = tree.xpath('/html/body/ol/li')
for li in ol_li_list:
res = li.xpath('./a/text()') # 在li中继续查找,相对查找
print(res)
# ['飞机']
# ['大炮']
# ['火车']
res2 = li.xpath('./a/@href') # 属性值:@属性
print(res2)
# ['feiji']
# ['dapao']
# ['huoche']
print(tree.xpath('/html/body/ul/li/a/@href'))
# ['http://www.baidu.com', 'http://www.google.com', 'http://www.sogou.com']
案例3:爬取猪八戒网信息
import requests
from lxml import etree
url = 'https://beijing.zbj.com/search/f/?type=new&kw=前端开发'
resp = requests.get(url)
#解析
html = etree.HTML(resp.text)
divs = html.xpath('/html/body/div[6]/div/div/div[2]/div[4]/div[1]/div')
#每个服务商信息
for div in divs:
price=div.xpath("./div/div/a/div[2]/div[1]/span[1]/text()")
title=div.xpath("./div/div/a/div[2]/div[2]/p/text()")
print(price,title)

浙公网安备 33010602011771号