Scrapy框架

第一板块:scrapy介绍、安装、基本使用

-什么是框架?

  -就是一个集成了很多功能并且具有很强通用性的一个项目模板。

 

-如何学习框架?

  -专门学习框架封装的各种功能的详细用法。

 

-什么是scrapy?

  -爬虫中封装好的一个明星框架。功能:高性能的持久化操作,异步的数据下载,高性能的数据解析,分布式。

 

-scrapy框架的基本使用

  -环境的安装:

    -mac or linux: pip install scrapy

    -windows:

      -下载twisted,下载地址为:

        http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

      -安装twisted:

        pip install Twisted-17.1.0-cp36-cp36m-win_amd64.whl

      -pip install pywin32

      -pip install scrapy

      测试:在终端里录入scrapy指令,没有报错即表示安装成功!

 

-创建一个工程:scrapy startproject xxxPro

-cd xxxPro

-spiders子目录中创建一个爬虫文件

  -scrapy genspider spiderName www.xxx.com

-执行工程:

  -scrapy crawl spiderName

 

scrapy的配置文件修改:

 运行结果如下:

 

在spiders文件夹中创建一个基于scrapy框架的爬虫文件后,对配置文件必要的设置如下:(UA伪装)

 

 第二大板块:基于scrapy的数据解析

-创建一个scrapy工程的基本目录结构如下:

 

 qiubai.py

 1 import scrapy
 2 
 3 
 4 class QiubaiSpider(scrapy.Spider):
 5     name = 'qiubai'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://www.qiushibaike.com/text/']
 8 
 9     def parse(self, response):
10         #解析:作者的名称+段子的内容
11        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
12        # print(len(div_list))
13        num = 0
14        for div in div_list:
15            #xpath 返回的是列表,但是列表元素一定是Selector类型的对象
16            #extract可以将Selector对象中data参数存储的字符串提取出来
17            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
18            #列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来
19            content = div.xpath('./a[1]/div/span//text()').extract()
20            #将列表转化成字符串
21            content = ''.join(content)
22            print("++++++++++++++++++++++++++++++++++++")
23            print(num,author,content)
24            print("++++++++++++++++++++++++++++++++++++")
25            num+=1

注意:

 

 

 

-etract_first()操作前提条件是得保证是对列表长度为1的进行操作。

总结:

UA伪装、Robot协议、日志文件的设置都在settings.py即配置文件中;

规定在spiders子目录下创建scrapy爬虫;

项目运行方式为:scrapy crawl spiderName;

对于scrapy爬虫得到的response对象,进行xpath操作后,得到的对象不是一个etree对象而是一个Selector对象,那么要取得其data数据则只需将Selecor对象进行extract操作就行;

 

第三大板块:scrapy持久化存储

-基于终端指令:(比较小众,用得很少)

  -要求:只可以将parse方法的返回值存储到本地的文本文件中

  -注意:持久化存储对应的文本文件的类型只可以为:

    'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'

  -指令:scrapy crawl xxx -o filePath

  -好处:简洁高效便捷

  -缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中)

 qiubai.py

 1 import scrapy
 2 
 3 
 4 class QiubaiSpider(scrapy.Spider):
 5     name = 'qiubai'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://www.qiushibaike.com/text/']
 8 
 9     def parse(self, response):
10         #解析:作者的名称+段子的内容
11        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
12        # print(len(div_list))
13        all_data = [] #存储所有解析到的数据
14        for div in div_list:
15            #xpath 返回的是列表,但是列表元素一定是Selector类型的对象
16            #extract可以将Selector对象中data参数存储的字符串提取出来
17            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
18            #列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来
19            content = div.xpath('./a[1]/div/span//text()').extract()
20            #将列表转化成字符串
21            content = ''.join(content)
22            dic = {
23                'author':author,
24                'content':content
25            }
26            all_data.append(dic)
27 
28        return all_data

运行结果如下:

 

 

-基于管道:

-编码流程:

  -数据解析

  -item类中定义相关的属性  item.py

  -将解析的数据封装存储到item类型的对象

  -item类型的对象提交给管道进行持久化存储的操作

  -在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作 piplines.py

  -在配置文件中开启管道

    

-好处:

  -通用性强。

 

为什么要创建item对象呢?因为基于管道实现永久化存储方式是利用了process_items方法

 

 于是就要将解析后的数据放到item对象中,再将item对象提交给管道进行存储

 

 

 

-面试题:将爬取到的数据一份存储到本地一份存储到数据库,如何实现?

  -管道文件中一个管道类对应的是将数据存储到一种平台

  -爬虫文件提交的item只会给管道文件中第一个被执行的管道类接收

  -process_item中的return item表示将item传递给下一个即将被执行的管道类

qiubai.py代码如下:

 

 1 import scrapy
 2 from qiubaiPro.items import QiubaiproItem
 3 
 4 class QiubaiSpider(scrapy.Spider):
 5     name = 'qiubai'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://www.qiushibaike.com/text/']
 8 
 9     # def parse(self, response):
10     #     #解析:作者的名称+段子的内容
11     #    div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
12     #    # print(len(div_list))
13     #    all_data = [] #存储所有解析到的数据
14     #    for div in div_list:
15     #        #xpath 返回的是列表,但是列表元素一定是Selector类型的对象
16     #        #extract可以将Selector对象中data参数存储的字符串提取出来
17     #        author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
18     #        #列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来
19     #        content = div.xpath('./a[1]/div/span//text()').extract()
20     #        #将列表转化成字符串
21     #        content = ''.join(content)
22     #        dic = {
23     #            'author':author,
24     #            'content':content
25     #        }
26     #        all_data.append(dic)
27     #
28     #    return all_data
29     def parse(self, response):
30         # 解析:作者的名称+段子的内容
31         div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
32         # print(len(div_list))
33         all_data = []  # 存储所有解析到的数据
34         for div in div_list:
35             # xpath 返回的是列表,但是列表元素一定是Selector类型的对象
36             # extract可以将Selector对象中data参数存储的字符串提取出来
37             author = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span/h2/text()')[0].extract()
38             # 列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来
39             content = div.xpath('./a[1]/div/span//text()').extract()
40             # 将列表转化成字符串
41             content = ''.join(content)
42 
43             item = QiubaiproItem()
44             item['author'] = author
45             item['content'] = content
46 
47             yield item  #将item提交给了管道

 

 

pipelines.py

 1 # Define your item pipelines here
 2 #
 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 
 6 
 7 # useful for handling different item types with a single interface
 8 from itemadapter import ItemAdapter
 9 
10 import pymysql
11 
12 class QiubaiproPipeline:
13     fp = None
14     #重写父类方法:该方法只在开始爬虫的时候被调用一次
15     def open_spider(self,spider):
16         print('开始爬虫...')
17         self.fp = open('./qiubai.txt','w',encoding='utf8')
18 
19     #专门用来处理item类型的对象
20     #该方法可以接收爬虫文件提交过来的item对象
21     #该方法每接收到一个item就会被调用一次
22     def process_item(self, item, spider):
23         author = item['author']
24         content = item['content']
25         self.fp.write(author+':'+content+'\n')
26 
27         return item  #就会传递给下一个即将被执行的管道类,所以一定要加上return item
28 
29     def close_spider(self,spider):
30         print('结束爬虫...')
31         self.fp.close()
32 
33 #管道文件中一个管道类对应将一组数据存储到一个平台或者载体中
34 class mysqlPileLine(object):
35 
36     conn = None
37     cursor = None
38     def open_spider(self,spider):
39         self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai',charset='utf8')
40 
41     def process_item(self,item,spider):
42         self.cursor = self.conn.cursor()
43 
44         try:
45 
46             self.cursor.execute('insert into qiubai values("%s","%s")'%(item["author"],item["content"]))
47             self.conn.commit() #事务的提交
48         except Exception as e:
49             print(e)
50             self.conn.rollback() #事务的回滚
51 
52         return item
53 
54     def close_spider(self,spider):
55         self.cursor.close()
56         self.conn.close()
57 
58 
59 #爬虫文件提交的item类型的对象最终会提交给哪一个管道类?
60     #先执行优先级最高的管道类

 

settings.py

 1 # Scrapy settings for qiubaiPro project
 2 #
 3 # For simplicity, this file contains only settings considered important or
 4 # commonly used. You can find more settings consulting the documentation:
 5 #
 6 #     https://docs.scrapy.org/en/latest/topics/settings.html
 7 #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 8 #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 9 
10 BOT_NAME = 'qiubaiPro'
11 
12 SPIDER_MODULES = ['qiubaiPro.spiders']
13 NEWSPIDER_MODULE = 'qiubaiPro.spiders'
14 
15 
16 # Crawl responsibly by identifying yourself (and your website) on the user-agent
17 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'
18 
19 # Obey robots.txt rules
20 ROBOTSTXT_OBEY = False
21 
22 LOG_LEVEL = 'ERROR'
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 #DEFAULT_REQUEST_HEADERS = {
43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 #   'Accept-Language': 'en',
45 #}
46 
47 # Enable or disable spider middlewares
48 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'qiubaiPro.middlewares.QiubaiproSpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
55 #DOWNLOADER_MIDDLEWARES = {
56 #    'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543,
57 #}
58 
59 # Enable or disable extensions
60 # See https://docs.scrapy.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
67 ITEM_PIPELINES = {
68    'qiubaiPro.pipelines.QiubaiproPipeline': 300,
69    'qiubaiPro.pipelines.mysqlPileLine': 301,
70     #300表示的是优先级,数值越小优先级越高
71 }
72 
73 # Enable and configure the AutoThrottle extension (disabled by default)
74 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
75 #AUTOTHROTTLE_ENABLED = True
76 # The initial download delay
77 #AUTOTHROTTLE_START_DELAY = 5
78 # The maximum download delay to be set in case of high latencies
79 #AUTOTHROTTLE_MAX_DELAY = 60
80 # The average number of requests Scrapy should be sending in parallel to
81 # each remote server
82 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
83 # Enable showing throttling stats for every response received:
84 #AUTOTHROTTLE_DEBUG = False
85 
86 # Enable and configure HTTP caching (disabled by default)
87 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
88 #HTTPCACHE_ENABLED = True
89 #HTTPCACHE_EXPIRATION_SECS = 0
90 #HTTPCACHE_DIR = 'httpcache'
91 #HTTPCACHE_IGNORE_HTTP_CODES = []
92 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

 1 # Define here the models for your scraped items
 2 #
 3 # See documentation in:
 4 # https://docs.scrapy.org/en/latest/topics/items.html
 5 
 6 import scrapy
 7 
 8 
 9 class QiubaiproItem(scrapy.Item):
10     # define the fields for your item here like:
11     # name = scrapy.Field()
12     author = scrapy.Field()
13     content = scrapy.Field()