爬虫--增量式爬虫

增量式爬虫:

--概念:检测网站数据更新的情况,只会爬取网站最新更新出来的数据

--分析:

　　--指定其实url

　　--基于CrawlSpider获取其他页码链接

　　--基于Rule将其他页码链接进行请求

　　--从每一个页码对应的页面源码中解析出详情页的url

　　--检测详情页url是否已经爬取过,爬过略过,没爬取过发送爬取

　　　　--将爬取过的电影详情页的url进行存储

　　　　　　--存储到redis的set数据结构中(使用redis因为轻量化存储使用数据库)

　　--持久化存储

# 爬虫文件.py

class BlueSpider(CrawlSpider):
    name = 'blue'
    # allowed_domains = ['www.xxx.com'] # 域名限定,只允许该域名下的链接
    start_urls = ['https://www.fuzokuu.com/category/fuzokuuguide-thailand/page/1']
    # 创建redis链接对象
    conn = Redis(host='127.0.0.1', port=6379)
    def parse_item(self, response):
        # 基于response实现数据解析
        # 注意:xpath表达式中不可以出现tbody iframe等标签
        article_list = response.xpath('//article')
        for article in article_list:
            article_title = article.xpath('./div[2]/header/h2/a/@title').extract_first()
            article_url = article.xpath('./div[2]/header/h2/a/@href').extract_first()
            ex = self.conn.sadd('urls',article_url)
            if ex == 1:
                print('该url为最新添加,可以进行数据爬取')
                yield scrapy.Request(url=article_url,callback=self.parse_detail)
            else:
                print('数据还没有更新,暂无新数据可以爬取!')


    # 内容详情页解析
    def parse_detail(self, response):
        content = response.xpath('//article//p/text()').extract()
        article_id = response.xpath('//article/@id').extract_first()
        content = ''.join(content)
        article_id = article_id.split('-')[-1]
        item = DetailItem()
        item['content'] = content
        item['article_id'] = article_id
        yield item

# 管道文件pipelines.py

from redis import Redis


class BluespiderPipeline:
    conn = None

    def open_spider(self, spider):
        self.conn = spider.conn

    def process_item(self, item, spider):
        dic = {
            'article_id': item['article_id'],
            'content': item['content']
        }
        self.conn.lpush('articleData', dic)
        return item

posted @ 2022-06-02 16:02 EricYJChung 阅读(192) 评论(0) 收藏举报

刷新页面返回顶部

ericyjchung 从心出发,不论归期

爬虫--增量式爬虫

公告