网络爬虫（4）-Scrapy增量爬虫

1.增量爬虫概念

通过爬虫程序监测某网站数据更新的情况，以便可以爬取到该网站更新出的新数据。

2.增量爬虫方法

在发送请求之前判断这个URL是不是之前爬取过
在解析内容后判断这部分内容是不是之前爬取过

写入存储介质时判断内容是不是已经在介质中存在

- 分析：不难发现，其实增量爬取的核心是去重，至于去重的操作在哪个步骤起作用，只能说各有利弊。在我看来，前两种思路需要根据实际情况取一个（也可能都用）。第一种思路适合不断有新页面出现的网站，比如说小说的新章节，每天的最新新闻等等；第二种思路则适合页面内容会更新的网站。第三个思路是相当于是最后的一道防线。这样做可以最大程度上达到去重的目的。
去重方法
- 将爬取过程中产生的url进行存储，存储在redis的set中。当下次进行数据爬取时，首先对即将要发起的请求对应的url在存储的url的set中做判断，如果存在则不进行请求，否则才进行请求。
- 对爬取到的网页内容进行唯一标识的制定，然后将该唯一表示存储至redis的set中。当下次爬取到网页数据的时候，在进行持久化存储之前，首先可以先判断该数据的唯一标识在redis的set中是否存在，在决定是否进行持久化存储。

3.爬虫案例

目标：爬取4567tv网站中所有的电影详情数据

1）爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from redis import Redis
from incrementPro.items import IncrementproItem
class MovieSpider(CrawlSpider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.4567tv.tv/frim/index7-11.html']

    rules = (
        Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True),
    )
    #创建redis链接对象
    conn = Redis(host='127.0.0.1',port=6379)
    def parse_item(self, response):
        li_list = response.xpath('//li[@class="p1 m1"]')
        for li in li_list:
            #获取详情页的url
            detail_url = 'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first()
            #将详情页的url存入redis的set中
            ex = self.conn.sadd('urls',detail_url)
            if ex == 1:
                print('该url没有被爬取过，可以进行数据的爬取')
                yield scrapy.Request(url=detail_url,callback=self.parst_detail)
            else:
                print('数据还没有更新，暂无新数据可爬取！')

    #解析详情页中的电影名称和类型，进行持久化存储
    def parst_detail(self,response):
        item = IncrementproItem()
        item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first()
        item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract()
        item['kind'] = ''.join(item['kind'])
        yield item

2）管道文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from redis import Redis
class IncrementproPipeline(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dic = {
            'name':item['name'],
            'kind':item['kind']
        }
        print(dic)
        self.conn.lpush('movieData',dic)
        return item

posted @ 2019-08-01 21:44 麦小秋阅读(326) 评论(0) 收藏举报

刷新页面返回顶部

麦小秋

记录学习历程！个人Q群：870467632（Python学习交流群）欢迎Python爱好者加入，一起学习，共同进步！

网络爬虫（4）-Scrapy增量爬虫

1.增量爬虫概念

2.增量爬虫方法

3.爬虫案例

公告