Python爬虫常用之PyQuery

PyQuery是解析页面常用的库.是python对jquery的封装.
下面是一份解析基本页面的代码.后期用到复杂或者实用的方式再增加.

 1 from pyquery import PyQuery as pq
 2 
 3 
 4 # 参数为字符串的情况
 5 html_str = "<html></html>"
 6 
 7 # 参数为网页链接（需带 http：//）
 8 your_url = "http://www.baidu.com"
 9 
10 # 参数为文件
11 path_to_html_file = "hello123.html"
12 
13 # 将参数传入pq库之后得到html页面
14 # d = pq(html_str)
15 # d = pq(etree.fromstring(html_str))
16 # d = pq(url=your_url)
17 # d = pq(url=your_url,
18 #        opener=lambda url, **kw: urlopen(url).read())
19 d = pq(filename=path_to_html_file)
20 
21 # 此时的'd'相当于Jquery的'$',选择器,可以通过标签,id,class等选择元素
22 
23 # 通过id选择
24 table = d("#my_table")
25 
26 # 通过标签选择
27 head = d("head")
28 
29 # 通过样式选择,多个样式写一起,使用逗号隔开即可
30 p = d(".p_font")
31 
32 # 获取标签内的文本
33 text = p.text()
34 print text
35 
36 # 获取标签的属性值
37 t_class = table.attr('class')
38 print t_class
39 
40 # 遍历标签内的选项
41 # 打印表格中的td中的文字
42 for item in table.items():
43     # 这个循环只循环一次,item仍然是pquery的对象
44     print item.text()
45 
46 for item in table('td'):
47     # 这个循环循环多次,item是html的对象
48     print item.text

用于测试的html代码:

 1 
 2     <head>
 3         <title>Test</title>
 4     </head>
 5     <body>
 6         <h1>Parse me!</h1>
 7         <img src = "" />
 8         <p>A paragraph.</p>
 9                 <p class = "p_font">A paragraph with class.</p>
10                 <!-- comment -->
11         <div>
12             <p>A paragraph in div.</p>
13         </div>
14         <table id = "my_table" class = "test-table">
15         <thead>
16         </thead>
17         <tbody>
18             <tr>
19                 <td>Month</td>
20                 <td>Savings</td>
21             </tr>
22             <tr>
23                 <td>January</td>
24                 <td>$100</td>
25             </tr>
26         </tbody>
27         </table>
28     </body>
29 </html>

分析html的结果输出如下:

A paragraph with class.
test-table
Month Savings January $100
Month
Savings
January
$100

由于使用python2,有的网页使用requests直接抓取下来放入pyquery()里面会出编码问题,这时使用unicode()转换一下即可.部分代码如下:

import requests
from pyquery import PyQuery as pq

r = requests.get('http://www.baidu.com')
# d = pq(r.content)
u = unicode(r.content, 'utf-8')
d = pq(u)

posted @ 2017-03-27 16:41 Masako 阅读(4093) 评论(3) 收藏举报

刷新页面返回顶部

Masako

Python爬虫常用之PyQuery

公告