Python爬虫 #017 Scrapy爬虫框架
Scrapy 是用 Python 实现的一个为了爬取网站数据、提取结构性数据而编写的应用框架。
1.安装scrapy
-
windows
安装scrapy:
pip install scrapy安装pypiwin32:
pip install pypiwin32 -
ubuntu
安装非 Python 的依赖:
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev安装 Scrapy:
sudo pip3 install scrapy
2. 创建项目
2.1 建立项目
-
创建项目(在cmd中执行): 进入要创建项目的路径,
scrapy startproject [项目名称] -
创建爬虫(在cmd中执行): 进入到项目所在的路径,执行命令:
scrapy genspider [爬虫名字] [爬虫的域名] -
注意:域名不是网址,推荐域名查询网站:https://tool.chinaz.com/ ,输入网址即可查询域名
且爬虫名字不能和项目名称不能一致。 -
示例:爬取的网址:http://xiaohua.zol.com.cn//

2.2 项目结构
项目创建完成后会有很多文件生成:
scrapy.cfg:项目的配置文件
items.py:用来存放爬虫爬取下来数据的模型。
middlewares.py:用来存放各种中间件的文件。
pipelines.py:用来将items的模型存储到本地磁盘中。
settings.py:本爬虫的一些配置信息(比如请求头、多久发送一 次请求、ip代理池等
spiders文件夹:以后所有的爬虫都放在这里面。

2.3 scrapy执行原理

-
Scrapy Engine(引擎): 负责Spider、Item,Pipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。
-
Scheduler(调度器): 它负责接受引擎发送过来的Request请求,并按照一定的方式进行整理排列,入队。
当引擎需要时,交还给引擎。
-
Downloader(下载器):负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses
交还给Scrapy Engine(引擎),由引擎交给Spider来处理,
-
Spider(爬虫):它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,
并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器).
-
Item Pipeline(管道):它负责处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方。
-
Downloader Middlewares(下载中间件):你可以当作是一个可以自定义扩展下载功能的组件。
-
Spider Middlewares(Spider中间件):你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件
比如进入Spider的Responses;和从Spider出去的Requests
2.4 执行项目
-
命令行方式:需要注意文件路径,且运行的是爬虫文件的名字
JokeSpider而不是爬虫文件中爬虫类的名字JokespiderSpider
-
新建文件方式:在
scrapy.cfg同级目录下创建start.py文件,添加代码:# 此时就可以直接使用快捷键直接运行代码了 from scrapy import cmdline cmdline.execute(["scrapy",'crawl', '爬虫文件名'])
3. scrapy一般流程
-
新建项目 :新建一个爬虫项目
-
修改setting.py:将其中遵守协议改为False,并添加请求头,否则无法爬取内容。
settint.py其他参数:
点击查看代码
#==>第一部分:基本配置<=== #1、项目名称,默认的USER_AGENT由它来构成,也作为日志记录的日志名 BOT_NAME = 'Amazon' #2、爬虫应用路径 SPIDER_MODULES = ['Amazon.spiders'] NEWSPIDER_MODULE = 'Amazon.spiders' #3、客户端User-Agent请求头 USER_AGENT = 'Amazon (+http://www.yourdomain.com)' #4、是否遵循爬虫协议 # Obey robots.txt rules ROBOTSTXT_OBEY = False #5、是否支持cookie,cookiejar进行操作cookie,默认开启 #COOKIES_ENABLED = False #6、Telnet用于查看当前爬虫的信息,操作爬虫等...使用telnet ip port ,然后通过命令操作 #TELNETCONSOLE_ENABLED = False #TELNETCONSOLE_HOST = '127.0.0.1' #TELNETCONSOLE_PORT = [6023,] #7、Scrapy发送HTTP请求默认使用的请求头 #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} #===>第二部分:并发与延迟<=== #1、下载器总共最大处理的并发请求数,默认值16 #CONCURRENT_REQUESTS = 32 #2、每个域名能够被执行的最大并发请求数目,默认值8 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #3、能够被单个IP处理的并发请求数,默认值0,代表无限制,需要注意两点 #I、如果不为零,那CONCURRENT_REQUESTS_PER_DOMAIN将被忽略,即并发数的限制是按照每个IP来计算,而不是每个域名 #II、该设置也影响DOWNLOAD_DELAY,如果该值不为零,那么DOWNLOAD_DELAY下载延迟是限制每个IP而不是每个域 #CONCURRENT_REQUESTS_PER_IP = 16 #4、如果没有开启智能限速,这个值就代表一个规定死的值,代表对同一网址延迟请求的秒数 #DOWNLOAD_DELAY = 3 #===>第三部分:智能限速/自动节流:AutoThrottle extension<=== #一:介绍 from scrapy.contrib.throttle import AutoThrottle #http://scrapy.readthedocs.io/en/latest/topics/autothrottle.html#topics-autothrottle 设置目标: 1、比使用默认的下载延迟对站点更好 2、自动调整scrapy到最佳的爬取速度,所以用户无需自己调整下载延迟到最佳状态。用户只需要定义允许最大并发的请求,剩下的事情由该扩展组件自动完成 #二:如何实现? 在Scrapy中,下载延迟是通过计算建立TCP连接到接收到HTTP包头(header)之间的时间来测量的。 注意,由于Scrapy可能在忙着处理spider的回调函数或者无法下载,因此在合作的多任务环境下准确测量这些延迟是十分苦难的。 不过,这些延迟仍然是对Scrapy(甚至是服务器)繁忙程度的合理测量,而这扩展就是以此为前提进行编写的。 #三:限速算法 自动限速算法基于以下规则调整下载延迟 #1、spiders开始时的下载延迟是基于AUTOTHROTTLE_START_DELAY的值 #2、当收到一个response,对目标站点的下载延迟=收到响应的延迟时间/AUTOTHROTTLE_TARGET_CONCURRENCY #3、下一次请求的下载延迟就被设置成:对目标站点下载延迟时间和过去的下载延迟时间的平均值 #4、没有达到200个response则不允许降低延迟 #5、下载延迟不能变的比DOWNLOAD_DELAY更低或者比AUTOTHROTTLE_MAX_DELAY更高 #四:配置使用 #开启True,默认False AUTOTHROTTLE_ENABLED = True #起始的延迟 AUTOTHROTTLE_START_DELAY = 5 #最小延迟 DOWNLOAD_DELAY = 3 #最大延迟 AUTOTHROTTLE_MAX_DELAY = 10 #每秒并发请求数的平均值,不能高于 CONCURRENT_REQUESTS_PER_DOMAIN或CONCURRENT_REQUESTS_PER_IP,调高了则吞吐量增大强奸目标站点,调低了则对目标站点更加”礼貌“ #每个特定的时间点,scrapy并发请求的数目都可能高于或低于该值,这是爬虫视图达到的建议值而不是硬限制 AUTOTHROTTLE_TARGET_CONCURRENCY = 16.0 #调试 AUTOTHROTTLE_DEBUG = True CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16 #===>第四部分:爬取深度与爬取方式<=== #1、爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度 # DEPTH_LIMIT = 3 #2、爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo # 后进先出,深度优先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先进先出,广度优先 # DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' #3、调度器队列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler #4、访问URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' #===>第五部分:中间件、Pipelines、扩展<=== #1、Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'Amazon.middlewares.AmazonSpiderMiddleware': 543, #} #2、Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { # 'Amazon.middlewares.DownMiddleware1': 543, } #3、Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} #4、Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { # 'Amazon.pipelines.CustomPipeline': 200, } #===>第六部分:缓存<=== """ 1. 启用缓存 目的用于将已经发送的请求或相应缓存下来,以便以后使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware from scrapy.extensions.httpcache import DummyPolicy from scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否启用缓存策略 # HTTPCACHE_ENABLED = True # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 缓存超时时间 # HTTPCACHE_EXPIRATION_SECS = 0 # 缓存保存路径 # HTTPCACHE_DIR = 'httpcache' # 缓存忽略的Http状态码 # HTTPCACHE_IGNORE_HTTP_CODES = [] # 缓存存储的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' -
明确目标 (编写items.py):明确你想要抓取的目标
```python class BmwItem(scrapy.Item): # 定义好要爬取的内容 title = scrapy.Field() urls = scrapy.Field() ``` -
编写爬虫(spider.py):根据
setting的设置,请求网址。解析响应内容,传给item```python # 将start_urls的值修改为需要爬取的第一个url start_urls = ("http://www.itcast.cn/channel/teacher.shtml") ``` -
存储内容 (pipelines.py):把
item储存的内容传给pipelines,编写方法储存数据。```python # 1. 导入item类:from ..items import BmwItem # 设置item对象,把数据返回给pipeline # item = BmwItem(title=title, urls=urls) # yield item # 2. 在setting中打开pipeline(68行) # 3. 在pipeline中编写方法储存数据 ```
4. 实战案例
4.1 爬取笑话大全
-
创建项目:项目名,爬虫名如图:

-
修改
setting.py,一般修改的内容:-
USER_AGENT:设置请求头 -
DEFAULT_REQUEST_HEADERS:默认请求头,其中也可添加 use agent -
ROBOTSTXT_OBEY:遵守协议改为 False -
ITEM_PIPELINES:设置管道,添加管道可以同时对一份数据进行不同的储存方式 -
DOWNLOAD_DELAY:设置下载延迟,过快容易被检测
# -*- coding: utf-8 -*- # Scrapy settings for joke project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'joke' SPIDER_MODULES = ['joke.spiders'] NEWSPIDER_MODULE = 'joke.spiders' FEED_EXPORT_ENCODING = 'utf-8' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-cn', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.3991.400', } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'joke.middlewares.JokeSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'joke.middlewares.JokeDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # 取值范围0-1000,值越小,先执行 ITEM_PIPELINES = { 'joke.pipelines.JokePipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' -
-
明确目标:爬取笑话的标图,以及链接(这些标题点击后就会跳转,则说明含有链接属性)

编写
items.py代码:# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class JokeItem(scrapy.Item): # 标题 title = scrapy.Field() # 超链接 href = scrapy.Field() -
编写爬虫文件
xiaohua.py,解析内容:# -*- coding: utf-8 -*- import scrapy from scrapy.http.response.html import HtmlResponse # 导入items.py文件中的类,.代表当前路径,..代表上一级 from ..items import JokeItem class XiaohuaSpider(scrapy.Spider): name = 'xiaohua' # 限制爬虫的范围 allowed_domains = ['xiaohua.zol.com.cn/'] # 爬取的url start_urls = ['http://xiaohua.zol.com.cn//'] def parse(self, response): # response是 scrapy.http.response.html.HtmlResponse 对象可执行xpath和css语法 ## 需导包from scrapy.http.response.html import HtmlResponse jokes = response.xpath('//ul[@class="news-list video-list"]/li/a') print('---------', jokes) for joke in jokes: title = joke.xpath('./@title')[0].extract() # 后面的xpath只获得域名 href = 'http://xiaohua.zol.com.cn/' + joke.xpath('./@href').get() print('----------',title,href) item = JokeItem() item['title'] = title item['href'] = href # 把数据返回给item yield item -
存储内容 :编写
pipelines.py设计管道存储爬取内容# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import json # 自行导入需要的库 class JokePipeline(object): def __init__(self): # 设置储存路径 self.fp = open(r'D:\joke.txt', mode='a', encoding='utf-8') def open_spider(self,spider): print('爬虫开始-----------') # 将item的数据进行处理 def process_item(self, item, spider): # 把传来的数据改为dict 再转化为json数据(直接写入dict会报错) ## ensure_ascii=False 解决乱码 item_json = json.dumps(dict(item), ensure_ascii=False) print(item_json) self.fp.write(item_json + '\n') # 不使用json格式数据,不需要导json库 # self.fp.write('%s%s'%(item['title'],item['href']+ '\n')) # 处理数据后,传给下一个管道 return item def close_spider(self,spider): self.fp.close() print('爬虫结束-----------')
4.2 异常解决
-
scrapy有时候报错会比较隐晦,最好多添加输出语句,确保数据流没有异常
-
报错:
return (yield download_func(request=request, spider=spider)) twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘’, ‘unsafe legacy renegotiation disabled’)]>](1)可能是没有正确设置请求头,被拒绝访问
(2)
cryptography版本问题,通过cmd命令查看版本,

卸载旧版本,更新即可
pip uninstall cryptographypip install cryptography==36.0.2
本文来自博客园,作者:{枫_Null},转载请注明原文链接:https://www.cnblogs.com/fengNull/articles/16658473.html

浙公网安备 33010602011771号