【Python爬虫】：Scrapy数据持久化

要想将我们爬取到的文件的数据进行持久化操作，那么在Scrapy下有两种方式，

1.基于终端指令的数据持久化

要求：只能将parse方法的返回值储存到文本文件当中
注意：持久化文本文件的类型只能够为csv,json.xml等，不能够为txt,excel

指令使用：

scrapy crawl xxx(爬虫文件名) -o xxx.csv（保存的文件名）

好处：十分简洁且高效便捷
缺点：局限性比较强，只能够保存为指定后缀

2.基于管道的数据持久化：

管道持久化的流程：

编码流程：

1.数据解析

2.在item当中定义我们需要储存的数据的属性

3.将解析的对象封装到Item类型的对象

4. 将item类型的对象交给管道，进行数据持久化的操作

5.在管道类Pipline当中的process_item当中进行数据持久化代码的编写

6.在配置文件当中开启管道

进行数据解析和爬取：

import scrapy
from firstBlood.items import FirstbloodItem

class FirstSpider(scrapy.Spider):
    #爬虫文件的名称：爬虫源文件的一个唯一标识
    name = 'first'
    #允许的域名：用来限定start_urls列表中哪些url可以进行请求发送
    #allowed_domains = ['www.baidu.com']
    #起始的url列表：该列表当中存放的url会被scrapy自动发送请求
    start_urls = ['https://www.qiushibaike.com/text/']

    #用来进行数据解析的：response参数标的是请求成功后对应的响应对象
    def parse(self, response):
        #start_urls里面有多少个url。就调用这个方法多少次
        author=response.xpath('//div[@id="content"]//h2/text()').extract()
        content=response.xpath('//div[@id="content"]//div[@class="content"]/span/text()').extract()
        #item只能够一组一组的解析，不能够把全部的数据一次全部加载进去
        item = FirstbloodItem()
        for i in author:
            print(i)
            item['author']=i
            break
        for i in content:
            item['content']=i
            print(i)
            break

        yield item

用yield提交我们所得到的item数据一次

因此我们首先编写item.py文件，标注我们需要进行储存的文件，item.py文件如下所示：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FirstbloodItem(scrapy.Item):
    # define the fields for your item here like:
    author= scrapy.Field()
    content=scrapy.Field()

在pipelines.py文件里重写open_spider和close_spider方法，同时在process_item里提取出数据，将数据写入到txt文件里，因此编写代码：

class FirstbloodPipeline:
    fp=None
    #开始爬虫时，调用一次
    def open_spider(self,spider):
        print("开始爬虫")
        self.fp=open('./result.txt','w',encoding='utf-8')

    # 该方法每接受到一次item就会被调用一次
    def process_item(self, item, spider):
        #取出数据
        author=item['author']
        content=item['content']

        self.fp.write(author+": "+content+"\n")


        return item
    #结束爬虫时，调用一次
    def close_spider(self,spider):
        print("结束爬虫！")
        self.fp.close()

在settings文件里开启管道，优先级设定为默认的300，数值越小，说明优先级越大：

ITEM_PIPELINES = {
    'firstBlood.pipelines.FirstbloodPipeline': 300,
#300表示的是优先级，数值越小，则优先级越高
}

最后的运行结果：

posted @ 2021-02-08 07:08 Geeksongs 阅读(162) 评论(0) 收藏举报

刷新页面返回顶部

Geek Song

保持对科技的热情，不断积累自己的技术套装，力求能够快速从0到1构建整个项目，生命因技术而更加精彩！

【Python爬虫】：Scrapy数据持久化

1.基于终端指令的数据持久化

2.基于管道的数据持久化：

公告

Coded by Geeksongs on Linux

All rights reserved, no one is allowed to pirate or use the document for other purposes.