在我们浏览一部分网页的时候会发现,某些网站会定时在原有的网页数据的基础上更新一批数据,当我们在爬虫的过程中遇到这样的情况,我们也需要过滤掉已经爬过的数据,只爬取刚刚更新出来的数据,防止做大量不必要的操作
- 概念
通过爬虫程序监测某网站数据更新的情况,以便可以爬取到该网站更新出的新数据
- 如何进行增量式爬取工作
- 在发送请求之前判断这个url是不是之前爬取过
- 在解析内容后判断这部分内容是不是之前爬取过
- 写入存储介质时判断内容是不是已经存在
- 分析
不难发现,其实增量爬取的核心是去重, 至于去重的操作在哪个步骤起作用,只能说 各有利弊。在我看来,前两种思路需要根据实际情况取一个(也可能都用)。
第一种思路适合不断有新页面出现的网站,比如说小说的新章节,每天的最新新闻等 等;
第二种思路则适合页面内容会更新的网站。
第三个思路是相当于是最后的一道防线。这样做可以最大程度上达到去重的目的。
- 去重方法
- 将爬取过程中产生的url进行存储,存储在redis中的set中,当下次进行数据爬取时,先对即将要发起的请求对应的url在存储url的set中进行判断,如果存在则不进行请求,如果不存在则请求
- 对爬取到的网页内容进行唯一标识的制定,然后将该唯一标识存储到redis的set中,当下次爬取到网页数据的时候,进行存储之前,可以先判断该数据的唯一标识在redis的set中是否存在,然后再决定是否进行持久化存储
- 代码示例 : url去重爬取4567tv网站电影详情
爬虫文件:

1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 from redis import Redis 6 from increment1_Pro.items import Increment1ProItem 7 class MovieSpider(CrawlSpider): 8 name = 'movie' 9 # allowed_domains = ['www.xxx.com'] 10 start_urls = ['https://www.4567tv.tv/index.php/vod/show/id/7.html'] 11 12 rules = ( 13 Rule(LinkExtractor(allow=r'/index.php/vod/show/id/7/page/\d+\.html'), callback='parse_item', follow=True), 14 ) 15 16 def parse_item(self, response): 17 conn = Redis(host='127.0.0.1',port=6379) 18 detail_url_list = 'https://www.4567tv.tv'+response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]/div/a/@href').extract() 19 for url in detail_url_list: 20 #ex == 1:set中没有存储url 21 ex = conn.sadd('movies_url',url) 22 if ex == 1: 23 yield scrapy.Request(url=url,callback=self.parse_detail) 24 else: 25 print('网站没有更新数据,暂无新数据可爬!') 26 27 def parse_detail(self,response): 28 item = Increment1ProItem() 29 item['name'] = response.xpath('/html/body/div[1]/div/div/div/div[2]/h1/text()').extract_first() 30 item['actor'] = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first() 31 32 yield item
items.py :

1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # https://doc.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class Increment1ProItem(scrapy.Item): 12 # define the fields for your item here like: 13 name = scrapy.Field() 14 actor = scrapy.Field()
pipelines.py(持久化存储文件)

1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 from redis import Redis 9 class Increment1ProPipeline(object): 10 conn = None 11 def open_spider(self,spider): 12 self.conn = Redis(host='127.0.0.1',port=6379) 13 def process_item(self, item, spider): 14 # dic = { 15 # 'name':item['name'], 16 # 'axtor':item['actor'] 17 # } 18 print('有新数据被爬取到,正在入库......') 19 self.conn.lpush('movie_data',item) 20 return item
- 代码示例 : 内容去重爬取糗事百科段子,作者
爬虫文件

1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 6 from increment2_Pro.items import Increment2ProItem 7 from redis import Redis 8 import hashlib 9 class QiubaiSpider(CrawlSpider): 10 name = 'qiubai' 11 # allowed_domains = ['www.xxx.com'] 12 start_urls = ['https://www.qiushibaike.com/text/'] 13 14 rules = ( 15 Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True), 16 ) 17 18 def parse_item(self, response): 19 20 div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]') 21 conn = Redis(host='127.0.0.1',port=6379) 22 for div in div_list: 23 item = Increment2ProItem() 24 item['content'] = div.xpath('.//div[@class="content"]/span//text()').extract() 25 item['content'] = ''.join(item['content']) 26 item['author'] = div.xpath('./div/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first() 27 source = item['author']+item['content'] 28 #自己制定了一种形式的数据指纹 29 hashValue = hashlib.sha256(source.encode()).hexdigest() 30 31 ex = conn.sadd('qiubai_hash',hashValue) 32 if ex == 1: 33 yield item 34 else: 35 print('没有更新数据可爬!!!')
items.py :

1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # https://doc.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class Increment2ProItem(scrapy.Item): 12 # define the fields for your item here like: 13 content = scrapy.Field() 14 author = scrapy.Field()
pipelines.py

1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 from redis import Redis 9 class Increment2ProPipeline(object): 10 conn = None 11 def open_spider(self,spider): 12 self.conn = Redis(host='127.0.0.1',port=6379) 13 def process_item(self, item, spider): 14 dic = { 15 'author':item['author'], 16 'content':item['content'] 17 } 18 self.conn.lpush('qiubaiData',dic) 19 print('爬取到一条数据,正在入库......') 20 return item