scrapy之持久化存储

Posted on 2019-03-22 15:25 TigerAt 阅读(216) 评论(0) 收藏举报

【分类】

1.基于终端指令的持久化存储

2.基于管道的持久化存储

【实现方式】

1.基于终端指令的持久化存储

1）保证爬虫文件的parse方法中有可迭代类型对象（通常为列表or字典）的返回

2）使用终端指令完成数据存储到指定磁盘文件

scrapy crawl 爬虫文件名称 -o 磁盘文件.后缀

1 scrapy crawl 爬虫名称 -o xxx.json
2 scrapy crawl 爬虫名称 -o xxx.xml
3 scrapy crawl 爬虫名称 -o xxx.csv

示例：

 1     def parse(self, response):
 2         #定义用户存储作者和段子内容的列表
 3         data_list = []
 4         content_list = response.xpath('//div[@id="content-left"]/div')
 5         for content in content_list:
 6             '''1.xpath解析到的指定内容被存储到了Selector对象
 7                2.extract（）方法可以将Selector对象中存储的数据拿到
 8                3.extract_first()等价于extract()[0]'''
 9             author = content.xpath('./div/a[2]/h2/text()').extract_first()
10             content_detail = content.xpath('.//div[@class="content"]/span/text()').extract()[0]
11             #定义字典用来存储每条段子的作者和内容
12             content_dict = {
13                 "作者":author,
14                 "段子":content_detail,
15             }
16             #将获取的字典内容存入列表
17             data_list.append(content_dict)
18             #print(author + ":" + content_detail +"\n\n\n")
19         #保证parse方法返回一个可迭代类型的对象（存储解析到的页面内容）
20         #print(data_list)
21         print("内容存储完毕")
22         return data_list

基于终端指令存储

注：如果打开文件内容为乱码，用notepad+打开csv，编码方式后可以解决。

2.基于管道的持久化存储

    items.py：数据结构模板文件。定义数据属性。
    pipelines.py：管道文件。接收数据（items），进行持久化操作。

    流程：
    1.爬虫文件爬取到数据后，需要将数据封装到items对象中。
    2.使用yield关键字将items对象提交给pipelines管道进行持久化操作。
    3.在管道文件中的process_item方法中接收爬虫文件提交过来的item对象，然后编写持久化存储的代码将item对象中存储的数据进行持久化存储
    4.settings.py配置文件中开启管道

    如若需要将数据分别存储到本地文件中或数据库上是，流程如下：（详见基于mysql数据库的存储）
    1.需要在管道文件中编写对应平台的管道类；
    2.在配置文件中对自定义的管道类进行生效操作。

a.存储到本地文件中

将数据存储到item对象

1 class QiushibaikeItem(scrapy.Item):
2     # define the fields for your item here like:
3     # name = scrapy.Field()
4     author = scrapy.Field()
5     content_detail = scrapy.Field()

格式化items.py

将数据存储至item对象并交给管道

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from qiushibaike.items import QiushibaikeItem
 4 
 5 
 6 class QiushiTextSpider(scrapy.Spider):
 7     #爬虫文件名称：通过爬虫文件的名称可以指定定位到某一个具体的爬虫文件
 8     name = 'qiushi_text'
 9     '''允许的域名：只可以爬取指定域名下的页面数据，由于我们爬取的链接内容在爬取过程中可能跳转至
10     不是此域名下的url，因此在没有特殊要求的情况下，这条命令可以注释掉'''
11     #allowed_domains = ['https://www.qiushibaike.com/text/']
12     #起始url：当前项目将要爬取的页面所对应的url
13     start_urls = ['https://www.qiushibaike.com/text//']
14 
15     '''解析方法：对获取的页面数据进行指定内容的解析
16        response：根据起始url列表发起请求，请求成功后返回的响应对象  
17        注：
18        1.parse方法的返回值：必须为迭代器（常见字典/列表等）或者空  
19        2.建议大家使用xpath进行指定内容的解析：框架集成了xpath的接口
20     '''
21     def parse(self, response):
22         #定义用户存储作者和段子内容的列表
23         data_list = []
24         content_list = response.xpath('//div[@id="content-left"]/div')
25         for content in content_list:
26             '''1.xpath解析到的指定内容被存储到了Selector对象
27                2.extract（）方法可以将Selector对象中存储的数据拿到
28                3.extract_first()等价于extract()[0]'''
29             author = content.xpath('./div/a[2]/h2/text()').extract_first()
30             content_detail = content.xpath('.//div[@class="content"]/span/text()').extract()[0]
31             #将解析到的数据值存储到items对象
32             item = QiushibaikeItem()
33             item['author']=author
34             item['content_detail']=content_detail
35 
36             #将item对象提交给管道
37             yield item

qiuke_text.py

注：重点步骤为：

1）导包：from qiushibaike.items import QiushibaikeItem

2）数据存储至item对象：

item = QiushibaikeItem()
item['author']=author
item['content_detail']=content_detail

3）将item对象交给管道：yield item

在pipelines.py管道文件中编写代码完成数据存储的操作

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 
 9 class QiushibaikePipeline(object):
10     def process_item(self, item, spider):
11         '''
12         该方法可以接收爬虫文件中提交过来的item对象，并且对item对象中存储的页面数据进行持久化操作
13         :param item:表示的就是接收到的item对象
14         每当爬虫文件向管道提交一次item，该方法就会被执行一次
15         '''
16         return item

pipelines.py初始内容

进行相关持久化操作

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 
 9 class QiushibaikePipeline(object):
10     fp = None
11 
12     # 整个爬虫过程中，该方法只会在开始爬虫的时候被调用一次
13     def open_spider(self,spider):
14         print("开始爬虫")
15         self.fp = open('./qiushi_text.txt','w',encoding='utf-8')
16 
17     '''
18     该方法可以接收爬虫文件中提交过来的item对象，并且对item对象中存储的页面数据进行持久化操作
19     :param item:表示的就是接收到的item对象
20     每当爬虫文件向管道提交一次item，该方法就会被执行一次
21      '''
22     def process_item(self, item, spider):
23         #取出item存储的数据值
24         author = item['author']
25         content = item['content_detail']
26         #持久化存储
27         self.fp.write(author+":"+content+"\n\n\n")
28         return item
29 
30     # 该方法只会在爬虫结束的时候被调用一次
31     def close_spider(self,spider):
32         print("爬虫结束")
33         self.fp.close()

pipelines.py持久化操作

注意：def open_spider(self,spider)和def close_spider(self,spider)的用法

在setting.py中开启管道

管道默认是注释状态，打开即可

1 ITEM_PIPELINES = {
2     'qiushibaike.pipelines.QiushibaikePipeline': 300,
3 }

执行

scrapy crawl qiushi_text --nolog

b.基于mysql数据库的存储
直接在原有pipelines.py代码的基础上添加新的class即可，但class的参数要与原有class保持一致

 1 class QiushibaikeMysql(object):
 2     conn = None
 3     cursor = None
 4     # 整个爬虫过程中，该方法只会在开始爬虫的时候被调用一次
 5     def open_spider(self,spider):
 6         print("mysql开始爬虫")
 7         #连接数据库
 8         self.conn = pymysql.Connect(host = '127.0.0.1',port=3306,user='root',password='hz123456',db='replite')
 9     def process_item(self, item, spider):
10         #执行sql语句
11         sql = 'insert into qiushi_text (author,content) values("%s","%s")'%(item['author'],item['content_detail'])
12         self.cursor= self.conn.cursor()
13         try:
14             self.cursor.execute(sql)
15             self.conn.commit()
16         except Exception as e:
17             print(e)
18             self.conn.rollback()
19         #提交事务
20         return item
21 
22     def close_spider(self,spider):
23         print("mysql爬虫结束")
24         self.cursor.close()
25         self.conn.close()

Mysql持久化存储

在setting.py中配置管道

1 ITEM_PIPELINES = {
2     'qiushibaike.pipelines.QiushibaikePipeline': 300, #原有设置
3     'qiushibaike.pipelines.QiushibaikeMysql': 200, #新增mysql设置
4 }

执行

scrapy crawl qiushi_text --nolog

 1 补充：mysql基本操作：
 2 
 3 创建database：create database replite charset utf8;
 4 删除database：drop database replite;
 5 查已有数据库：show databases;
 6 进入对应数据库：use replite;
 7 创建table：create table qiushi_text(author CHAR(255),content TEXT);
 8 删除table：drop table qiushi_text;
 9 查已有数据库：show tables;
10 
11 改变table编码： ALTER TABLE qiushi_text CONVERT TO CHARACTER SET utf8mb4;（如果出现1366, "Incorrect string value: '\\xF0\\x9F\\x8C\\x82\\xE7\\xA9...' for column 'content' at row 1")报错，则为编码问题，改变表中的编码就可以）
12 查表： select * from qiushi_text;
13 输入插入sql语句：insert into qiushi_text (author,content) values("相关内容","相关内容")

c.基于redis数据库存储

直接在原有pipelines.py代码的基础上添加新的class即可，但class的参数要与原有class保持一致

 1 class QiushibaikeRedis(object):
 2     conn = None
 3     # 整个爬虫过程中，该方法只会在开始爬虫的时候被调用一次
 4     def open_spider(self,spider):
 5         print("redis开始爬虫")
 6         #连接数据库
 7         self.conn = redis.Redis(host = '127.0.0.1',port=6379,decode_responses=True)
 8     def process_item(self, item, spider):
 9         dict_content = {
10             'author':item['author'],
11             'content':item['content_detail']
12         }
13         #将相应数据写入redis数据库
14         self.conn.lpush('data',dict_content)
15         return item
16 
17     def close_spider(self,spider):
18         print("redis爬虫结束")

redis数据库存储

注：在数据写入redis数据库的过程中如果出现报错信息：redis.exceptions.DataError: Invalid input of type: ‘dict‘. Convert to a byte, string or number first.

解决办法：python中redis包更新导致的问题，变更了srem方法的输入。使用旧版本pip install redis==2.10.6，即可解决

1 ITEM_PIPELINES = {
2     'qiushibaike.pipelines.QiushibaikePipeline': 300,
3     'qiushibaike.pipelines.QiushibaikeMysql': 200,
4     'qiushibaike.pipelines.QiushibaikeRedis': 400,
5 }

执行

scrapy crawl qiushi_text --nolog

补充说明：

redis的安装和使用：

【redis简介】
Redis是一个基于内存的高性能key-value数据库。
【redis下载】
下载地址：
Github下载地址：https://github.com/MicrosoftArchive/redis/releases

【redis安装】

解压下载到的压缩包即可使用

【文件目录】

Redis-x64-3.2.100/
dump.rdb
EventLog.dll
Redis on Windows Release Notes.docx
Redis on Windows.docx
redis-benchmark.exe
redis-benchmark.pdb
redis-check-aof.exe
redis-check-aof.pdb
redis-cli.exe #redis客户端启动程序
redis-cli.pdb
redis-server.exe #redis服务器启动程序
redis-server.pdb
redis.windows-service.conf
redis.windows.conf #有关redis服务器的配置信息
Windows Service Documentation.docx

【基本操作】

启动服务器：

在命令行中进入Redis-x64-3.2.100目录执行如下命令：

redis-server.exe redis.windows.conf #注redis.windows.conf并非必选项，可以不输入，表示以默认配置启动服务。这条命令的意思是：使用redis.windows.conf的配置信息启动redis-server

启动客户端：

redis-cli.exe --raw #--raw表示数据值可以显示成中文，否则中文以二进制文件（十六进制）的形式显示，若果命令行窗口编码不匹配，则有可能显示乱码，属于正常现象

获取list类型的数据：

lrange data 0 -1

其它操作及linux/mac下载安装详见redis网站：http://www.redis.cn/（中文网）或https://redis.io/（官网）

注：如上内容仅供学习参考查阅，请勿他用。

刷新页面返回顶部

TigerAt

公告