python爬虫-scrapy持久化存储

scrapy的持久化存储有两种：基于终端指令的和基于管道的

基于终端指令

限制：

只能将parse方法的返回值存储在本地的文本文件中
文件格式只能是，json、jsonlines、jl、csv、xml、marshal、pickle

scrapy crawl 爬虫文件 -o 存储路径

基于管道

编码流程：

数据解析
在item类中定义要存储的相关的属性
将解析的数据存储到item类型的对象中
将item类型对象交给管道持久化存储
在管道类中，process_item将会处理item对象，将数据持久化存储
在setting.py配置文件中开启管道

item对象

在项目工程中有一个item.py的文件，打开是一个类，我们将使用这个类来实例化item。
但是这个类初始是空的，需要我们自己来构建一下。
假设我们需要存储的数据是作者和文本，那么需要在item中添加对应属性。

import scrapy


class StudyScrapy02Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

在爬虫文件中的书写也发生了改变
导包可能会报红，但是不影响使用

import scrapy
from study_scrapy02.items import StudyScrapy02Item


class GushiSpider(scrapy.Spider):
    name = 'gushi'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://so.gushiwen.cn/mingjus/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="html"]/body/div[2]/div[1]/div[2]/div')
        datas = []
        for div in div_list:
            # extract可以将SSelector对象的存储的数据提取出来
            content = div.xpath('./a[1]/text()')[0].extract()
            author = div.xpath('./a[2]/text()')[0].extract()
            # 实例化item
            item = StudyScrapy02Item()
            # 将数据放入item
            item['author'] = author
            item['content'] = content
            # 将item提交给管道
            yield item

管道

在项目工程文件中我们还可以发现一个py文件pipeline.py，里面依旧是一个类，类中定义了一个process_item的方法。
我么将根据这个方法来处理item对象。
如果有多个管道存储需求可以创建多个管道类来使用。

class StudyScrapy02Pipeline:
    fp = None

    # 这个方法只在爬虫开始时执行一次
    def open_spider(self, spider):
        print('爬虫开始====================')
        self.fp = open('./gushi.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(content+' ——  '+author+'\n')
        return item

    # 这个方法只在爬虫结束时执行一次
    def close_spider(self,spider):
        self.fp.close()
        print('爬虫结束====================')

我们还需要在settings配置文件中开启管道

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 300表示优先级，数值越小优先级越高
    'study_scrapy02.pipelines.StudyScrapy02Pipeline': 300,
}

posted on 2022-03-23 20:56 S++ 阅读(112) 评论(0) 收藏举报

刷新页面返回顶部

S++

导航

公告

python爬虫-scrapy持久化存储

基于终端指令

基于管道

item对象

管道