爬虫利器pyquery使用介绍

1.安装和引用

安装方法:

pip install pyquery 

引用方法:

from pyquery import PyQuery as pq

 

2.使用方法

2.1 初始化方法

from pyquery import PyQuery as pq
doc = pq(html) # 解析html字符串
print(doc)
url = "http://news.baidu.com/guonei"
doc = pq(url) # 解析网页
print(doc)
doc = pq("./a.html") #解析html文本
print(doc)

 2.2 基本用法

from pyquery import PyQuery as pq

html = """
<html lang="en">
    <div class ="py_divc" id="py_divi">
        <ul class="container">
            <li class="object-1" href="www.aaa.com">hello Python</li>
            <li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        </ul>
    </div>
</html>
"""

doc = pq(html)
print(doc("#py_divi .container li"))

运行结果:

<li class="object-1" href="www.aaa.com">hello Python</li>
<li class="object-2" href="www.bbb.com">大法</li>
<li class="object-3" href="www.ccc.com">好</li>

# 是查找id的标签  .是查找class的标签  li 是查找li标签 中间空格表示里层

2.3 查找子元素

 

doc = pq(html)
items = doc("#py_divi")
print(items)
child = items.find('.container')
print(child)
child = items.children()
print(child)

运行结果:

<div class="py_divc" id="py_divi">
        <ul class="container">
            <li class="object-1" href="www.aaa.com">hello Python</li>
            <li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        </ul>
 </div>

<ul class="container">
            <li class="object-1" href="www.aaa.com">hello Python</li>
            <li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        </ul>
    
<ul class="container">
            <li class="object-1" href="www.aaa.com">hello Python</li>
            <li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        </ul>

find方法和children()都可以获取里层标签

2.4 查找父元素

doc = pq(html)
items = doc("#py_divi")
print(items)
parent_href = items.parent()
print(parent_href)

运行结果:

<div class="py_divc" id="py_divi">
        <ul class="container">
            <li class="object-1" href="www.aaa.com">hello Python</li>
            <li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        </ul>
    </div>

<html lang="en">
    <div class="py_divc" id="py_divi">
        <ul class="container">
            <li class="object-1" href="www.aaa.com">hello Python</li>
            <li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        </ul>
    </div>
</html>

2.5 查找兄弟节点

doc = pq(html)
items = doc(".object-1")
print(items)
bro = items.siblings()
print(bro)

运行结果:

<li class="object-1" href="www.aaa.com">hello Python</li>
            
<li class="object-2" href="www.bbb.com">大法</li>
<li class="object-3" href="www.ccc.com">好</li>

2.6 遍历查询结果

doc = pq(html)
for dl in doc("li").items():
    print(dl)

运行结果:

<li class="object-1" href="www.aaa.com">hello Python</li>
            
<li class="object-2" href="www.bbb.com">大法</li>
            
<li class="object-3" href="www.ccc.com">好</li>

2.7 获取属性

doc = pq(html)
for dl in doc("li").items():
    print(dl.attr('href'))
    print(dl.attr.href)

运行结果:

www.aaa.com
www.aaa.com
www.bbb.com
www.bbb.com
www.ccc.com
www.ccc.com

2.8 获取文本

doc = pq(html)
for dl in doc('li').items():
    print(dl.text())

运行结果

hello Python
大法
好

3 伪类选择器

doc = pq(html)
its = doc("li:first-child")
print('第一个标签:%s' %its)
its = doc("li:last-child")
print('最后一个标签:%s' %its)
its=doc("li:nth-child(2)")
print("第二个标签:%s" %its)
its = doc("li:gt(0)")
print("获取0以后的标签:%s" %its)
its = doc("li:contains('hello')")
print("获取文本包含hell的标签:%s" %its)

运行结果

第一个标签:<li class="object-1" href="www.aaa.com">hello Python</li>
            
最后一个标签:<li class="object-3" href="www.ccc.com">好</li>
        
第二个标签:<li class="object-2" href="www.bbb.com">大法</li>
            
获取0以后的标签:<li class="object-2" href="www.bbb.com">大法</li>
            <li class="object-3" href="www.ccc.com">好</li>
        
获取文本包含hell的标签:<li class="object-1" href="www.aaa.com">hello Python</li>

 

posted @ 2020-02-29 11:16  风墓  阅读(282)  评论(0编辑  收藏  举报