爬虫利器pyquery使用介绍
1.安装和引用
安装方法:
pip install pyquery
引用方法:
from pyquery import PyQuery as pq
2.使用方法
2.1 初始化方法
from pyquery import PyQuery as pq
doc = pq(html) # 解析html字符串 print(doc) url = "http://news.baidu.com/guonei" doc = pq(url) # 解析网页 print(doc) doc = pq("./a.html") #解析html文本 print(doc)
2.2 基本用法
from pyquery import PyQuery as pq html = """ <html lang="en"> <div class ="py_divc" id="py_divi"> <ul class="container"> <li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> </ul> </div> </html> """ doc = pq(html) print(doc("#py_divi .container li"))
运行结果:
<li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li>
# 是查找id的标签 .是查找class的标签 li 是查找li标签 中间空格表示里层
2.3 查找子元素
doc = pq(html) items = doc("#py_divi") print(items) child = items.find('.container') print(child) child = items.children() print(child)
运行结果:
<div class="py_divc" id="py_divi"> <ul class="container"> <li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> </ul> </div> <ul class="container"> <li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> </ul> <ul class="container"> <li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> </ul>
find方法和children()都可以获取里层标签
2.4 查找父元素
doc = pq(html) items = doc("#py_divi") print(items) parent_href = items.parent() print(parent_href)
运行结果:
<div class="py_divc" id="py_divi"> <ul class="container"> <li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> </ul> </div> <html lang="en"> <div class="py_divc" id="py_divi"> <ul class="container"> <li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> </ul> </div> </html>
2.5 查找兄弟节点
doc = pq(html) items = doc(".object-1") print(items) bro = items.siblings() print(bro)
运行结果:
<li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li>
2.6 遍历查询结果
doc = pq(html) for dl in doc("li").items(): print(dl)
运行结果:
<li class="object-1" href="www.aaa.com">hello Python</li> <li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li>
2.7 获取属性
doc = pq(html) for dl in doc("li").items(): print(dl.attr('href')) print(dl.attr.href)
运行结果:
www.aaa.com
www.aaa.com
www.bbb.com
www.bbb.com
www.ccc.com
www.ccc.com
2.8 获取文本
doc = pq(html) for dl in doc('li').items(): print(dl.text())
运行结果
hello Python
大法
好
3 伪类选择器
doc = pq(html) its = doc("li:first-child") print('第一个标签:%s' %its) its = doc("li:last-child") print('最后一个标签:%s' %its) its=doc("li:nth-child(2)") print("第二个标签:%s" %its) its = doc("li:gt(0)") print("获取0以后的标签:%s" %its) its = doc("li:contains('hello')") print("获取文本包含hell的标签:%s" %its)
运行结果
第一个标签:<li class="object-1" href="www.aaa.com">hello Python</li> 最后一个标签:<li class="object-3" href="www.ccc.com">好</li> 第二个标签:<li class="object-2" href="www.bbb.com">大法</li> 获取0以后的标签:<li class="object-2" href="www.bbb.com">大法</li> <li class="object-3" href="www.ccc.com">好</li> 获取文本包含hell的标签:<li class="object-1" href="www.aaa.com">hello Python</li>