scrapy CrawlSpiders-爬取url实例

一. 新建项目(scrapy startproject)

scrapy startproject wdzurlSpider

 

二、明确目标(wdzurlSpider/items.py)

1 import scrapy
2 
3 class WdzurlspiderItem(scrapy.Item):
4     # define the fields for your item here like:
5     name = scrapy.Field()
6     title=scrapy.Field()
7     zan=scrapy.Field()

 

三、制作爬虫 (spiders/wdzurl.py)

 1、scrapy genspider -t crawl wdzurl "waduanzi.com"

scrapy shell调试:

 

 

***********************************************************************************************************************************

LinkExtractor调试:

scrapy shell "http://www.waduanzi.com/"

from scrapy.linkextractors import LinkExtractor

a=LinkExtractor(allow=('archives/\d+'))

a.extract_links(response)

b=LinkExtractor(allow=('archives/\d+$'))

b.extract_links(response)

   2、打开wdzurlSpider/spider目录里的 wdzurl.py,代码如下:

 1 # -*- coding: utf-8 -*-
 2 
 3 # 导入CrawlSpider类和Rule
 4 from scrapy.spiders import CrawlSpider,Rule
 5 # 导入链接规则匹配类,用来提取符合规则的连接
 6 from scrapy.linkextractors import LinkExtractor
 7 from ..items import WdzurlspiderItem
 8 
 9 class WdzurlSpider(CrawlSpider):
10     name = 'wdzurl'
11     allowed_domains = ['waduanzi.com']
12     start_urls = ['http://waduanzi.com/']
13 
14     # Response里链接的提取规则,返回的符合匹配规则的链接匹配对象的列表
15     pagelink=LinkExtractor(allow=('archives/\d+$'))   #这里只爬取了第一页里面的所有符合该正则的链接
16 
17     rules=[
18         # 获取这个列表里的链接,依次发送请求。当follow=True时,继续跟进,调用指定回调函数处理
19         Rule(pagelink,callback="parseWdz",follow=False)  #该页面不需要再取链接,所以这里不用继续跟进
20     ]
21 
22     # 指定的回调函数
23     def parseWdz(self,response):
24         item=WdzurlspiderItem()
25         item['name']=response.xpath("//div[@class='post-author']/a/text()").extract()[0]
26 
27         #发现title有空值的情况,所以我才做了如下判断
28         title=response.xpath("//div[@class='item-title']/h1/text()").extract()
29         if len(title)==0:
30             item['title'] =''
31         else:
32             item['title']=title[0]
33         item['zan']=response.xpath("//div[@class='item-toolbar']/ul/li[1]/a/text()").extract()[0]
34 
35         yield item
36 
37     # def parse(self, response):  //不能用这个函数名
38     #     pass

 

四、存储内容 (pipelines.py)

修改settings.py以下几个地方(可将日志打印在文件里):

 1 # 保存日志信息的文件名
 2 LOG_FILE='wdzurllog.log'
 3 # 保存日志等级,低于|等于此等级的信息都被保存
 4 LOG_LEVEL='DEBUG'
 5 
 6 DEFAULT_REQUEST_HEADERS = {
 7   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
 8   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 9   'Accept-Language': 'en',
10 }
11 
12 ITEM_PIPELINES = {
13    'wdzurlSpider.pipelines.WdzurlspiderPipeline': 300,
14 }

编写pipelines.py文件

1 import json
2 
3 class WdzurlspiderPipeline(object):
4     def process_item(self, item, spider):
5         jsontext=json.dumps(dict(item),ensure_ascii=False)+',\n'
6         with open('wdz.json','a') as f:
7             f.write(jsontext)
8         return item

命令执行:

 

 

 

 

posted on 2020-03-12 23:46  cherry_ning  阅读(304)  评论(0)    收藏  举报

导航