python模块详解 | pyquery
简介
pyquery是一个强大的 HTML 解析库,利用它,我们可以直接解析 DOM 节点的结构,并通过 DOM 节点的一些属性快速进行内容提取。
官方文档:http://pyquery.readthedocs.io/
安装
pip3 install pyquery
初始化
🍑 字符串初始化
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('li'))
🥭 URL初始化
from pyquery import PyQuery as pq doc = pq(url='https://baidu.com') print(doc('title'))
🍉 文件初始化
from pyquery import PyQuery as pq doc = pq(filename='demo.html') print(doc('li'))
基本CSS选择器
🍎 id选择器(#)
🍊 class选择器(.)
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li'))
查找节点
🍎 子节点
- find()
- children()
from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') #find() lis = items.find('li') print(lis) #children() lis = items.children() print(lis)
🍊 父节点
- parents()
🥝 兄弟节点
- siblings()
ps:以上方法均可传入css选择器
遍历
🚴♀️ 单个节点
doc = pq(html) li = doc('.active') print(li)
🚴 多个节点(遍历)
- item()
doc = pq(html) lis = doc('li').items() for li in lis: print(li, type(li))
获取信息
🚗 属性
- attr()
doc = pq(html) a = doc('a') print(a.attr('href')) #href属性的值
#或 print(a.attr.href) #href属性的值
🚄 文本
- text():返回标签内部纯文本内容
doc = pq(html) a = doc('a') print(a.text())
- html():返回标签内部html文本内容
doc = pq(html) li = doc('li') print(li.html())
https://chenxuefan.cn