scrapy框架中持久存储

scrapy数据解析

使用：response.xpath('xpath表达式')

scrapy中封装的xpath和etree中的xpath区别：

scrapy的xpath将定位到标签中存储的值或者属性取出，返回Selector对象。相关数据值存储在该对象的data属性中，需要调用extract、extract_first()取出字符串数据。

scrapy中xpath解析返回的是Select对象：

import scrapy


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ["https://duanzixing.com/"]

    def parse(self, response):
        # 数据解析段子名称和段子内容
        article_list = response.xpath('/html/body/section/div/div/article')
        for article in article_list:
            # 下面解析出来的内容不是字符串数据，说明和etree中xpath使用方式不同
            # xpath返回的列表中存储的是Selector对象,其实我们想要的字符串数据，存储到该对象的data属性中
            title = article.xpath('./header/h2/a/text()')[0]
            content = article.xpath('./p[2]/text()')[0]
            print(title)
            print(content)
            break

输出：

<Selector xpath='./header/h2/a/text()' data='舒儿'>
<Selector xpath='./p[2]/text()' data='天下女的这么多，我觉得以后就去卖卫生巾，肯定能赚钱。自己创一个品牌，就叫“...'>

Selector对象中extract、extract_first()使用

import scrapy


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ["https://duanzixing.com/"]

    def parse(self, response):
        # 数据解析段子名称和段子内容
        article_list = response.xpath('/html/body/section/div/div/article')
        for article in article_list:
            # 下面解析出来的内容不是字符串数据，说明和etree中xpath使用方式不同
            # xpath返回的列表中存储的是Selector对象,其实我们想要的字符串数据，存储到该对象的data属性中
        
            # 将Selector对象data属性值取出
            # extrac()将data属性值取出(少用)
            # title = article.xpath('./header/h2/a/text()')[0].extract()
            # content = article.xpath('./p[2]/text()')[0].extract()

            # extract_first()将列表中的第一个列表元素表示的Selector对象中的data值取出(常用)
            # title = article.xpath('./header/h2/a/text()').extract_first()
            # content = article.xpath('./p[2]/text()').extract_first()

            # extract()返回的列表：例如：['舒儿'] (常用)
            title = article.xpath('./header/h2/a/text()').extract()
            content = article.xpath('./p[2]/text()').extract()
            print(title)
            print(content)
            break

持久化存储

方式1：基于终端指令的持久化存储

要求：该种方式只可以将parse方法的返回值存储到本地指定后缀的文本文件中。

示例：保存为.csv结尾文件指令：scrapy crawl spiderName -o filePath.csv

import scrapy


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ["https://duanzixing.com/"]

    # 将解析到的数据进行持久化存储,基于终端指令的持久化存储。
    def parse(self, response):
        all_data = []
        # 数据解析段子名称和段子内容
        article_list = response.xpath('/html/body/section/div/div/article')
        for article in article_list:
            # extract_first()将列表中的第一个列表元素表示的Selector对象中的data值取出(常用)
            title = article.xpath('./header/h2/a/text()').extract_first()
            content = article.xpath('./p[2]/text()').extract_first()
            dic = {
                "title": title,
                "content": content
            }
            all_data.append(dic)
        return all_data

本地生成：filePath.csv

基于管道的持久化存储（重点）

流程：

1、爬虫文件中进行数据解析；

2、items.py中定义相关属性，步骤1中解析出了几个字段的数据，在此就定义几个属性；

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WangziproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # Field()定义好的属性当做一个万能类型的属性
    title = scrapy.Field()  # spiders爬虫文件中解析出title和content两个字段数据
    content = scrapy.Field()

3、在爬虫文件中将解析到的数据存储封装到item类型的对象中

爬虫文件

import scrapy

from wangziPro.items import WangziproItem


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ["https://duanzixing.com/"]


    # 基于管道的持久化存储
    def parse(self, response):
        all_data = []
        # 数据解析段子名称和段子内容
        article_list = response.xpath('/html/body/section/div/div/article')
        for article in article_list:
            # extract_first()将列表中的第一个列表元素表示的Selector对象中的data值取出(常用)
            title = article.xpath('./header/h2/a/text()').extract_first()
            content = article.xpath('./p[2]/text()').extract_first()
            # 实例化一个item类型的对象，将解析到的数据存储到该对象中
            item = WangziproItem()
            # 不可以通过.的形式调用属性
            item['title'] = title
            item['content'] = content
            yield item

4、将item类型的对象提交给管道

import scrapy

from wangziPro.items import WangziproItem


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ["https://duanzixing.com/"]

    # 基于管道的持久化存储
    def parse(self, response):
         # 将item对象提交给管道
         yield item

5、在管道文件（pipelines.py）中，接收爬虫文件提交过来的Item类型的对象，且对其进行任意形式的持久化存储操作。

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class WangziproPipeline:
    # 重写父类的两个方法
    def open_spider(self, spider):
        print("我是open_spider(),我只会在爬虫开始的时候执行一次！")
        self.fp = open("duanzi.txt", "w", encoding="utf-8")

    def close_spider(self, spider):
        print("我是close_spider(),我只会在爬虫结束的时候执行一次")
        self.fp.close()

    # 用来接收item对象,一次只能接收一个item，说明该方法会被调用多次
    # 参数item：就是接收到的item对象
    def process_item(self, item, spider):
        # print(item)  # item就是一个字典
        # 将item存储到本地文件中
        self.fp.write(item['title'] + ":" + item['content'] + "\n")
        return item

6、在配置文件中开启管道机制

settings.py

# 开启管道
# 300表示管道类的优先级，数值越小优先级越高
ITEM_PIPELINES = {
   'wangziPro.pipelines.WangziproPipeline': 300,
}

执行指令：scrapy crawl duanzhi

基于管道的数据备份

需求：将爬取到的数据分别存储到不同的载体。

实现：将爬取到的数据分别mysql和redis中。

管道文件中的一个管道类表示怎么样的一组操作？

一个管道类对应一种形式的持久化存储操作，如果将数据存储到不同的载体中就需要定义多个管道类。

已经定义好了三个管道类，将数据写入到三个载体中进行存储。

爬虫文件中的item会不会依次提交给三个管道类?

不会，爬虫文件中的item只会被提交给优先级最高的那一个管道类。

优先级最高的管道类需要在process_item中实现return item，就item传递给下一个即将被执行的管道类。

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

import pymysql
from redis import Redis


class WangziproPipeline:
    # 重写父类的两个方法
    def open_spider(self, spider):
        print("我是open_spider(),我只会在爬虫开始的时候执行一次！")
        self.fp = open("duanzi.txt", "w", encoding="utf-8")

    def close_spider(self, spider):
        print("我是close_spider(),我只会在爬虫结束的时候执行一次")
        self.fp.close()

    # 用来接收item对象,一次只能接收一个item，说明该方法会被调用多次
    # 参数item：就是接收到的item对象
    def process_item(self, item, spider):
        # print(item)  # item就是一个字典
        # 将item存储到本地文件中
        self.fp.write(item['title'] + ":" + item['content'] + "\n")
        return item  # 返回给MysqlPileLine


# 将数据存储到mysql中
class MysqlPileLine:
    conn = None
    cursor = None

    def open_spider(self, spider):
        self.conn = pymysql.Connect(
            host="127.0.0.1",
            port=3306,
            user="tian",
            password="luff",
            db="spider",
            charset="utf8"
        )
        print(self.conn)

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        sql = 'insert into duanzhiwang values ("%s","%s")' % (item['tilte'], item['content'])
        # 执行sql 事务处理
        try:
            self.cursor.execute(sql)
            self.conn.commit()  # 执行事务
        except Exception as er:
            print(er)
            self.conn.rollback()  # 错误时回滚事务
        return item  # 返回给RedisPileLine

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()


# 将数据写入到redis
class RedisPileLine:
    conn = None

    def open_spider(self, spider):
        self.conn = Redis(
            host="127.0.0.1",
            port=6379,
            password=123456,
            encoding="utf-8"
        )

    def process_item(self, item, spider):
        # 报错，将redis模块版本指定成2.10.6
        self.conn.lpush("duanziData", item)  # 将字典写入到redis的列表中

settings.py

# 开启管道
# 300表示管道类的优先级，数值越小优先级越高，优先级越高优先被执行
# 优先级高：表示该管道类优先被执行
ITEM_PIPELINES = {
   'wangziPro.pipelines.WangziproPipeline': 300,
   'wangziPro.pipelines.MysqlPileLine': 301,
   'wangziPro.pipelines.RedisPileLine': 302,
}

posted on 2021-07-20 21:08 赛兔子阅读(45) 评论(0) 收藏举报

刷新页面返回顶部

赛兔子