在我们浏览一部分网页的时候会发现,某些网站会定时在原有的网页数据的基础上更新一批数据,当我们在爬虫的过程中遇到这样的情况,我们也需要过滤掉已经爬过的数据,只爬取刚刚更新出来的数据,防止做大量不必要的操作

  • 概念 

     通过爬虫程序监测某网站数据更新的情况,以便可以爬取到该网站更新出的新数据

  • 如何进行增量式爬取工作
    • 在发送请求之前判断这个url是不是之前爬取过
    • 在解析内容后判断这部分内容是不是之前爬取过
    • 写入存储介质时判断内容是不是已经存在
  • 分析

    不难发现,其实增量爬取的核心是去重, 至于去重的操作在哪个步骤起作用,只能说    各有利弊。在我看来,前两种思路需要根据实际情况取一个(也可能都用)。

    第一种思路适合不断有新页面出现的网站,比如说小说的新章节,每天的最新新闻等        等;

    第二种思路则适合页面内容会更新的网站。

    第三个思路是相当于是最后的一道防线。这样做可以最大程度上达到去重的目的。

  • 去重方法
    • 将爬取过程中产生的url进行存储,存储在redis中的set中,当下次进行数据爬取时,先对即将要发起的请求对应的url在存储url的set中进行判断,如果存在则不进行请求,如果不存在则请求
    • 对爬取到的网页内容进行唯一标识的制定,然后将该唯一标识存储到redis的set中,当下次爬取到网页数据的时候,进行存储之前,可以先判断该数据的唯一标识在redis的set中是否存在,然后再决定是否进行持久化存储

 

  • 代码示例 : url去重爬取4567tv网站电影详情

爬虫文件:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from redis import Redis
 6 from increment1_Pro.items import Increment1ProItem
 7 class MovieSpider(CrawlSpider):
 8     name = 'movie'
 9     # allowed_domains = ['www.xxx.com']
10     start_urls = ['https://www.4567tv.tv/index.php/vod/show/id/7.html']
11 
12     rules = (
13         Rule(LinkExtractor(allow=r'/index.php/vod/show/id/7/page/\d+\.html'), callback='parse_item', follow=True),
14     )
15 
16     def parse_item(self, response):
17         conn = Redis(host='127.0.0.1',port=6379)
18         detail_url_list = 'https://www.4567tv.tv'+response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]/div/a/@href').extract()
19         for url in detail_url_list:
20             #ex == 1:set中没有存储url
21             ex = conn.sadd('movies_url',url)
22             if ex == 1:
23                 yield scrapy.Request(url=url,callback=self.parse_detail)
24             else:
25                 print('网站没有更新数据,暂无新数据可爬!')
26 
27     def parse_detail(self,response):
28         item = Increment1ProItem()
29         item['name'] = response.xpath('/html/body/div[1]/div/div/div/div[2]/h1/text()').extract_first()
30         item['actor'] = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
31 
32         yield item
View Code

items.py :

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class Increment1ProItem(scrapy.Item):
12     # define the fields for your item here like:
13     name = scrapy.Field()
14     actor = scrapy.Field()
View Code

pipelines.py(持久化存储文件)

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 from redis import Redis
 9 class Increment1ProPipeline(object):
10     conn = None
11     def open_spider(self,spider):
12         self.conn = Redis(host='127.0.0.1',port=6379)
13     def process_item(self, item, spider):
14         # dic = {
15         #     'name':item['name'],
16         #     'axtor':item['actor']
17         # }
18         print('有新数据被爬取到,正在入库......')
19         self.conn.lpush('movie_data',item)
20         return item
View Code

 

  • 代码示例 : 内容去重爬取糗事百科段子,作者

爬虫文件

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 
 6 from increment2_Pro.items import Increment2ProItem
 7 from redis import Redis
 8 import hashlib
 9 class QiubaiSpider(CrawlSpider):
10     name = 'qiubai'
11     # allowed_domains = ['www.xxx.com']
12     start_urls = ['https://www.qiushibaike.com/text/']
13 
14     rules = (
15         Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True),
16     )
17 
18     def parse_item(self, response):
19 
20         div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]')
21         conn = Redis(host='127.0.0.1',port=6379)
22         for div in div_list:
23             item = Increment2ProItem()
24             item['content'] = div.xpath('.//div[@class="content"]/span//text()').extract()
25             item['content'] = ''.join(item['content'])
26             item['author'] = div.xpath('./div/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first()
27             source = item['author']+item['content']
28             #自己制定了一种形式的数据指纹
29             hashValue = hashlib.sha256(source.encode()).hexdigest()
30 
31             ex = conn.sadd('qiubai_hash',hashValue)
32             if ex == 1:
33                 yield item
34             else:
35                 print('没有更新数据可爬!!!')
View Code

items.py :

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class Increment2ProItem(scrapy.Item):
12     # define the fields for your item here like:
13     content = scrapy.Field()
14     author = scrapy.Field()
View Code

pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 from redis import Redis
 9 class Increment2ProPipeline(object):
10     conn = None
11     def open_spider(self,spider):
12         self.conn = Redis(host='127.0.0.1',port=6379)
13     def process_item(self, item, spider):
14         dic = {
15             'author':item['author'],
16             'content':item['content']
17         }
18         self.conn.lpush('qiubaiData',dic)
19         print('爬取到一条数据,正在入库......')
20         return item
View Code