scrapy中的xpath中的re使用

第一种:

 

例子:这里我使用"http://www.simple-style.com/page/1"这个网站的爬虫

>>>scrapy shell  http://www.simple-style.com/page/1

进入交互环境后,我想找到当前网页的所有src

 1 >>> response.xpath('//@src').extract()
 2 ['http://www.simple-style.com/wp-includes/js/jquery/jquery.js?ver=1.12.4', 'http://www.simple-style.com/wp-includes/js/jquery/jquery-migrate.m
 3 in.js?ver=1.4.1', 'http://www.simple-style.com/wp-content/plugins/to-top/public/js/to-top-public.js?ver=1.0', 'http://www.simple-style.com/wp-
 4 content/uploads/2017/03/simple-logo.gif', '//v.qq.com/iframe/player.html?vid=e0386mjreck&tiny=0&auto=0', 'http://www.simple-style.com/wp-conte
 5 nt/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/ali_bosworth_01.jpg', 'http://
 6 www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/the_warehouse_hotel_
 7 01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/02/ahndraya_parlato_01.jpg', 'http://www.simple-style.com/wp-content/uploads/201
 8 6/07/inner_self_04.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/07/Yuanghua-Chen-01.jpg', 'http://www.simple-style.com/wp-conten
 9 t/uploads/2016/07/01-alicephoebelou.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/06/02-Tim_Gao_Photography_Invisible_Theatre_17.
10 jpg', 'http://www.simple-style.com/wp-content/uploads/2016/05/4.png', 'http://www.simple-style.com/wp-content/uploads/2016/05/01-Remona.jpg',
11 'http://www.simple-style.com/wp-content/uploads/2016/05/Nbr-h000-1.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/04/0501.jpg', 'h
12 ttp://www.simple-style.com/wp-content/uploads/2016/04/01.jpg', 'http://www.simple-style.com/wp-content/plugins/smartideo/static/smartideo.js?v
13 er=2.2.5', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/skip-link-focus-fix.js?ver=1.0', 'http://www.simple-style.
14 com/wp-content/themes/twentyseventeen/assets/js/navigation.js?ver=1.0', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/
15 js/global.js?ver=1.0', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/jquery.scrollTo.js?ver=2.1.2', 'http://www.sim
16 ple-style.com/wp-includes/js/wp-embed.min.js?ver=4.7.3']

得到很多个src后,我想只取到"/2017/03"日上传的jpg的src,则可以使用正则

这里xpath后的对象不用extract(), re后会返回一个字符串列表,否则会报错

1 response.xpath('//@src').re('.*/2017/03/.*\.jpg')
2 ['http://www.simple-style.com/wp-content/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg', 'http://www.simple-style.com/wp-content/uploads/
3 2017/03/ali_bosworth_01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg', 'http://www.simple-style.com/wp-conten
4 t/uploads/2017/03/the_warehouse_hotel_01.jpg']

 

第二种:

 1 from scrapy.selector import Selector
 2 from scrapy.http import HtmlResponse
 3 html = """<!DOCTYPE html>
 4 <html>
 5 <head lang="en">
 6     <meta charset="UTF-8">
 7     <title></title>
 8 </head>
 9 <body>
10     <li class="item-"><a href="link.html">first item</a></li>
11     <li class="item-0"><a href="link1.html">first item</a></li>
12     <li class="item-1"><a href="link2.html">second item</a></li>
13 </body>
14 </html>
15 """
16 response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
17 ret = Selector(response=response).xpath('//li[re:test(@class, "item-\d*")]//@href').extract()
18 print(ret)
19 
20 正则选择器

 

posted @ 2017-04-12 00:52  Garvey  阅读(7344)  评论(0编辑  收藏  举报