python之scrapy框架(一) - 努力爬行的小虫子

Scrapy框架

第一板块：scrapy介绍、安装、基本使用

-什么是框架？

　　-就是一个集成了很多功能并且具有很强通用性的一个项目模板。

-如何学习框架？

　　-专门学习框架封装的各种功能的详细用法。

-什么是scrapy?

　　-爬虫中封装好的一个明星框架。功能：高性能的持久化操作，异步的数据下载，高性能的数据解析，分布式。

-scrapy框架的基本使用

　　-环境的安装：

　　　　-mac or linux: pip install scrapy

　　　　-windows:

　　　　　　-下载twisted,下载地址为：

　　　　　　　　http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

　　　　　　-安装twisted:

　　　　　　　　pip install Twisted-17.1.0-cp36-cp36m-win_amd64.whl

　　　　　　-pip install pywin32

　　　　　　-pip install scrapy

　　　　　　测试：在终端里录入scrapy指令，没有报错即表示安装成功！

-创建一个工程：scrapy startproject xxxPro

-cd xxxPro

-在spiders子目录中创建一个爬虫文件

　　-scrapy genspider spiderName www.xxx.com

-执行工程：

　　-scrapy crawl spiderName

scrapy的配置文件修改：

运行结果如下：

在spiders文件夹中创建一个基于scrapy框架的爬虫文件后，对配置文件必要的设置如下：(UA伪装)

第二大板块：基于scrapy的数据解析

-创建一个scrapy工程的基本目录结构如下：

qiubai.py

 1 import scrapy
 2 
 3 
 4 class QiubaiSpider(scrapy.Spider):
 5     name = 'qiubai'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://www.qiushibaike.com/text/']
 8 
 9     def parse(self, response):
10         #解析：作者的名称+段子的内容
11        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
12        # print(len(div_list))
13        num = 0
14        for div in div_list:
15            #xpath 返回的是列表，但是列表元素一定是Selector类型的对象
16            #extract可以将Selector对象中data参数存储的字符串提取出来
17            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
18            #列表调用了extract之后，则表示将列表中每一个Selector对象中data对应的字符串提取了出来
19            content = div.xpath('./a[1]/div/span//text()').extract()
20            #将列表转化成字符串
21            content = ''.join(content)
22            print("++++++++++++++++++++++++++++++++++++")
23            print(num,author,content)
24            print("++++++++++++++++++++++++++++++++++++")
25            num+=1

注意：

-etract_first()操作前提条件是得保证是对列表长度为1的进行操作。

总结：

UA伪装、Robot协议、日志文件的设置都在settings.py即配置文件中；

规定在spiders子目录下创建scrapy爬虫；

项目运行方式为：scrapy crawl spiderName；

对于scrapy爬虫得到的response对象，进行xpath操作后，得到的对象不是一个etree对象而是一个Selector对象，那么要取得其data数据则只需将Selecor对象进行extract操作就行；

第三大板块：scrapy持久化存储

-基于终端指令：（比较小众，用得很少）

　　-要求：只可以将parse方法的返回值存储到本地的文本文件中

　　-注意：持久化存储对应的文本文件的类型只可以为：

　　　　'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'

　　-指令：scrapy crawl xxx -o filePath

　　-好处：简洁高效便捷

　　-缺点：局限性比较强（数据只可以存储到指定后缀的文本文件中）

qiubai.py

 1 import scrapy
 2 
 3 
 4 class QiubaiSpider(scrapy.Spider):
 5     name = 'qiubai'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://www.qiushibaike.com/text/']
 8 
 9     def parse(self, response):
10         #解析：作者的名称+段子的内容
11        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
12        # print(len(div_list))
13        all_data = [] #存储所有解析到的数据
14        for div in div_list:
15            #xpath 返回的是列表，但是列表元素一定是Selector类型的对象
16            #extract可以将Selector对象中data参数存储的字符串提取出来
17            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
18            #列表调用了extract之后，则表示将列表中每一个Selector对象中data对应的字符串提取了出来
19            content = div.xpath('./a[1]/div/span//text()').extract()
20            #将列表转化成字符串
21            content = ''.join(content)
22            dic = {
23                'author':author,
24                'content':content
25            }
26            all_data.append(dic)
27 
28        return all_data

运行结果如下：

-基于管道：

-编码流程：

　　-数据解析

　　-在item类中定义相关的属性 item.py

　　-将解析的数据封装存储到item类型的对象

　　-将item类型的对象提交给管道进行持久化存储的操作

　　-在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作 piplines.py

　　-在配置文件中开启管道

-好处：

　　-通用性强。

为什么要创建item对象呢？因为基于管道实现永久化存储方式是利用了process_items方法

于是就要将解析后的数据放到item对象中，再将item对象提交给管道进行存储

-面试题：将爬取到的数据一份存储到本地一份存储到数据库，如何实现？

　　-管道文件中一个管道类对应的是将数据存储到一种平台

　　-爬虫文件提交的item只会给管道文件中第一个被执行的管道类接收

　　-process_item中的return item表示将item传递给下一个即将被执行的管道类

qiubai.py代码如下：

 1 import scrapy
 2 from qiubaiPro.items import QiubaiproItem
 3 
 4 class QiubaiSpider(scrapy.Spider):
 5     name = 'qiubai'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://www.qiushibaike.com/text/']
 8 
 9     # def parse(self, response):
10     #     #解析：作者的名称+段子的内容
11     #    div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
12     #    # print(len(div_list))
13     #    all_data = [] #存储所有解析到的数据
14     #    for div in div_list:
15     #        #xpath 返回的是列表，但是列表元素一定是Selector类型的对象
16     #        #extract可以将Selector对象中data参数存储的字符串提取出来
17     #        author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
18     #        #列表调用了extract之后，则表示将列表中每一个Selector对象中data对应的字符串提取了出来
19     #        content = div.xpath('./a[1]/div/span//text()').extract()
20     #        #将列表转化成字符串
21     #        content = ''.join(content)
22     #        dic = {
23     #            'author':author,
24     #            'content':content
25     #        }
26     #        all_data.append(dic)
27     #
28     #    return all_data
29     def parse(self, response):
30         # 解析：作者的名称+段子的内容
31         div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
32         # print(len(div_list))
33         all_data = []  # 存储所有解析到的数据
34         for div in div_list:
35             # xpath 返回的是列表，但是列表元素一定是Selector类型的对象
36             # extract可以将Selector对象中data参数存储的字符串提取出来
37             author = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span/h2/text()')[0].extract()
38             # 列表调用了extract之后，则表示将列表中每一个Selector对象中data对应的字符串提取了出来
39             content = div.xpath('./a[1]/div/span//text()').extract()
40             # 将列表转化成字符串
41             content = ''.join(content)
42 
43             item = QiubaiproItem()
44             item['author'] = author
45             item['content'] = content
46 
47             yield item  #将item提交给了管道

pipelines.py

 1 # Define your item pipelines here
 2 #
 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 
 6 
 7 # useful for handling different item types with a single interface
 8 from itemadapter import ItemAdapter
 9 
10 import pymysql
11 
12 class QiubaiproPipeline:
13     fp = None
14     #重写父类方法：该方法只在开始爬虫的时候被调用一次
15     def open_spider(self,spider):
16         print('开始爬虫...')
17         self.fp = open('./qiubai.txt','w',encoding='utf8')
18 
19     #专门用来处理item类型的对象
20     #该方法可以接收爬虫文件提交过来的item对象
21     #该方法每接收到一个item就会被调用一次
22     def process_item(self, item, spider):
23         author = item['author']
24         content = item['content']
25         self.fp.write(author+':'+content+'\n')
26 
27         return item  #就会传递给下一个即将被执行的管道类，所以一定要加上return item
28 
29     def close_spider(self,spider):
30         print('结束爬虫...')
31         self.fp.close()
32 
33 #管道文件中一个管道类对应将一组数据存储到一个平台或者载体中
34 class mysqlPileLine(object):
35 
36     conn = None
37     cursor = None
38     def open_spider(self,spider):
39         self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai',charset='utf8')
40 
41     def process_item(self,item,spider):
42         self.cursor = self.conn.cursor()
43 
44         try:
45 
46             self.cursor.execute('insert into qiubai values("%s","%s")'%(item["author"],item["content"]))
47             self.conn.commit() #事务的提交
48         except Exception as e:
49             print(e)
50             self.conn.rollback() #事务的回滚
51 
52         return item
53 
54     def close_spider(self,spider):
55         self.cursor.close()
56         self.conn.close()
57 
58 
59 #爬虫文件提交的item类型的对象最终会提交给哪一个管道类？
60     #先执行优先级最高的管道类

settings.py

 1 # Scrapy settings for qiubaiPro project
 2 #
 3 # For simplicity, this file contains only settings considered important or
 4 # commonly used. You can find more settings consulting the documentation:
 5 #
 6 #     https://docs.scrapy.org/en/latest/topics/settings.html
 7 #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 8 #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 9 
10 BOT_NAME = 'qiubaiPro'
11 
12 SPIDER_MODULES = ['qiubaiPro.spiders']
13 NEWSPIDER_MODULE = 'qiubaiPro.spiders'
14 
15 
16 # Crawl responsibly by identifying yourself (and your website) on the user-agent
17 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'
18 
19 # Obey robots.txt rules
20 ROBOTSTXT_OBEY = False
21 
22 LOG_LEVEL = 'ERROR'
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 #DEFAULT_REQUEST_HEADERS = {
43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 #   'Accept-Language': 'en',
45 #}
46 
47 # Enable or disable spider middlewares
48 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'qiubaiPro.middlewares.QiubaiproSpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
55 #DOWNLOADER_MIDDLEWARES = {
56 #    'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543,
57 #}
58 
59 # Enable or disable extensions
60 # See https://docs.scrapy.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
67 ITEM_PIPELINES = {
68    'qiubaiPro.pipelines.QiubaiproPipeline': 300,
69    'qiubaiPro.pipelines.mysqlPileLine': 301,
70     #300表示的是优先级，数值越小优先级越高
71 }
72 
73 # Enable and configure the AutoThrottle extension (disabled by default)
74 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
75 #AUTOTHROTTLE_ENABLED = True
76 # The initial download delay
77 #AUTOTHROTTLE_START_DELAY = 5
78 # The maximum download delay to be set in case of high latencies
79 #AUTOTHROTTLE_MAX_DELAY = 60
80 # The average number of requests Scrapy should be sending in parallel to
81 # each remote server
82 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
83 # Enable showing throttling stats for every response received:
84 #AUTOTHROTTLE_DEBUG = False
85 
86 # Enable and configure HTTP caching (disabled by default)
87 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
88 #HTTPCACHE_ENABLED = True
89 #HTTPCACHE_EXPIRATION_SECS = 0
90 #HTTPCACHE_DIR = 'httpcache'
91 #HTTPCACHE_IGNORE_HTTP_CODES = []
92 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

 1 # Define here the models for your scraped items
 2 #
 3 # See documentation in:
 4 # https://docs.scrapy.org/en/latest/topics/items.html
 5 
 6 import scrapy
 7 
 8 
 9 class QiubaiproItem(scrapy.Item):
10     # define the fields for your item here like:
11     # name = scrapy.Field()
12     author = scrapy.Field()
13     content = scrapy.Field()

发表于 2021-04-24 17:18 努力爬行的小虫子阅读(918) 评论(0) 收藏举报