下载中间件(Downloader Middleware)

下载器中间件是介于 Scrapy 的 request/response 处理的钩子框架。 是用于全局修改 Scrapy request 和r esponse 的一个轻量、底层的系统

这个介绍看起来非常绕口,但其实用容易理解的话表述就是:更换代理 IP,更换 Cookies,更换 User-Agent,自动重试。

1 写在 middlewares.py 中(名字随便命名)

2 激活下载中间件

配置生效,在 setting 中开启()
SPIDER_MIDDLEWARES = {
'cnblogs_crawl.middlewares.CnblogsCrawlSpiderMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
'cnblogs_crawl.middlewares.CnblogsCrawlDownloaderMiddleware': 543,
}


下载中间件

下载中间件的用途
1、在 process——request 内,自定义下载,不用 scrapy 的下载
2、对请求进行二次加工,比如
设置请求头
设置 cookie
添加代理
scrapy自带的代理组件:
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from urllib.request import getproxies

# 请求需要被下载时,经过所有下载器中间件的 process_request 调用
-process_request:(请求去,走)
# return None: 继续处理当次请求,进入下一个中间件
# return Response: 当次请求结束,把Response丢给引擎处理(可以自己爬,包装成Response)
# return Request : 相当于把Request重新给了引擎,引擎再去做调度
# 抛异常:执行process_exception


# spider处理完成,返回时调用
-process_response:(请求回来,走)
# return a Response object :继续处理当次Response,继续走后续的中间件
# return a Request object:重新给引擎做调度
# or raise IgnoreRequest :process_exception


# 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
-process_exception:(出异常,走)
# return None: continue processing this exception  继续
# return a Response object: stops process_exception() chain  :停止异常处理链,给引擎(给爬虫)
# return a Request object: stops process_exception() chain :停止异常处理链,给引擎(重新调度)

中间件添加代理,修改 UA

class TttDownloaderMiddleware(object):

    def get_proxy(self):
        import requests
        res = requests.get('http://0.0.0.0:5010/get').json()['proxy']
        print(res)
        return res

    def process_request(self, request, spider):
        # 1 加 cookie (request.cookie 就是访问该网站的 cookie)
        # request.cookie = {'name': snata}

        # 2 加代理
        request.meta['proxy'] = self.get_proxy()

        # 3 修改 UA
        from fake_useragent import UserAgent
        ua = UserAgent(verify_ssl=False)
        request.headers['User-Agent'] = ua.random

        return None

    def process_response(self, request, response, spider):

        return response

    def process_exception(self, request, exception, spider):

        pass

爬虫中间件(Spider Middleware)

爬虫中间件方法介绍

from scrapy import signals

class SpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened
        return s

​```
def spider_opened(self, spider):
    # spider.logger.info('我是egon派来的爬虫1: %s' % spider.name)
    print('我是egon派来的爬虫1: %s' % spider.name)

def process_start_requests(self, start_requests, spider):
    # Must return only requests (not items).
    print('start_requests1')
    for r in start_requests:
        yield r

def process_spider_input(self, response, spider):
    # 每个response经过爬虫中间件进入spider时调用

    # 返回值:Should return None or raise an exception.
    #1、None: 继续执行其他中间件的process_spider_input
    #2、抛出异常:
    # 一旦抛出异常则不再执行其他中间件的process_spider_input
    # 并且触发request绑定的errback
    # errback的返回值倒着传给中间件的process_spider_output
    # 如果未找到errback,则倒着执行中间件的process_spider_exception

    print("input1")
    return None

def process_spider_output(self, response, result, spider):
    # Must return an iterable of Request, dict or Item objects.
    print('output1')

    # 用yield返回多次,与return返回一次是一个道理
    # 如果生成器掌握不好(函数内有yield执行函数得到的是生成器而并不会立刻执行),生成器的形式会容易误导你对中间件执行顺序的理解
    # for i in result:
    #     yield i
    return result

def process_spider_exception(self, response, exception, spider):
    # Should return either None or an iterable of Response, dict
    # or Item objects.
    print('exception1')
​```

当前爬虫启动时 以及 初始化请求 产生时

from scrapy import signals

class SpiderMiddleware1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened
        return s


def spider_opened(self, spider):
    print('我是爬虫1: %s' % spider.name)

def process_start_requests(self, start_requests, spider):
    # Must return only requests (not items).
    print('start_requests1')
    for r in start_requests:
        yield r


      
class SpiderMiddleware2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)  # 当前爬虫执行时触发spider_opened
        return s

​```
def spider_opened(self, spider):
    print('我是爬虫2: %s' % spider.name)

def process_start_requests(self, start_requests, spider):
    print('start_requests2')
    for r in start_requests:
        yield r


#步骤三:分析运行结果
#1、启动爬虫时则立刻执行:

我是爬虫1: baidu
我是爬虫2: baidu

#2、然后产生一个初始的request请求,依次经过爬虫中间件1,2:
start_requests1
start_requests2

process_spider_input 返回 None 时

from scrapy import signals

class SpiderMiddleware1(object):

​```
def process_spider_input(self, response, spider):
    print("input1")

def process_spider_output(self, response, result, spider):
    print('output1')
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception1')

​```


class SpiderMiddleware2(object):

​```
def process_spider_input(self, response, spider):
    print("input2")
    return None

def process_spider_output(self, response, result, spider):
    print('output2')
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception2')

​```

#步骤三:运行结果分析

#1、返回 response 时,依次经过爬虫中间件1,2
input1
input2

#2、spider处理完毕后,依次经过爬虫中间件2,2
output2
output1

process_spider_input 抛出异常时

from scrapy import signals

class SpiderMiddleware1(object):

​```
def process_spider_input(self, response, spider):
    print("input1")

def process_spider_output(self, response, result, spider):
    print('output1')
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception1')

​```

class SpiderMiddleware2(object):

​```
def process_spider_input(self, response, spider):
    print("input2")
    raise Type

def process_spider_output(self, response, result, spider):
    print('output2')
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception2')

​```

class SpiderMiddleware3(object):

​```
def process_spider_input(self, response, spider):
    print("input3")
    return None

def process_spider_output(self, response, result, spider):
    print('output3')
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception3')

​```

​        

#运行结果        
input1
input2
exception3
exception2
exception1

#分析:
#1、当 response 经过中间件1的 process_spider_input返回None,继续交给中间件2的process_spider_input
#2、中间件2的process_spider_input抛出异常,则直接跳过后续的process_spider_input,将异常信息传递给Spiders里该请求的errback
#3、没有找到errback,则该response既没有被Spiders正常的callback执行,也没有被errback执行,即Spiders啥事也没有干,那么开始倒着执行process_spider_exception
#4、如果process_spider_exception返回None,代表该方法推卸掉责任,并没处理异常,而是直接交给下一个process_spider_exception,全都返回None,则异常最终交给Engine抛出

指定 errback

#步骤一:spider.py
import scrapy

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']

​```
def start_requests(self):
    yield scrapy.Request(url='http://www.baidu.com/',
                         callback=self.parse,
                         errback=self.parse_err,
                         )

def parse(self, response):
    pass

def parse_err(self,res):
    #res 为异常信息,异常已经被该函数处理了,因此不会再抛给异常,于是开始走process_spider_output
    return [1,2,3,4,5] #提取异常信息中有用的数据以可迭代对象的形式存放于管道中,等待被process_spider_output取走

​```



#步骤二:
'''
打开注释:
SPIDER_MIDDLEWARES = {
   'Baidu.middlewares.SpiderMiddleware1': 200,
   'Baidu.middlewares.SpiderMiddleware2': 300,
   'Baidu.middlewares.SpiderMiddleware3': 400,
}

'''

#步骤三:middlewares.py

from scrapy import signals

class SpiderMiddleware1(object):

​```
def process_spider_input(self, response, spider):
    print("input1")

def process_spider_output(self, response, result, spider):
    print('output1',list(result))
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception1')

​```

class SpiderMiddleware2(object):

​```
def process_spider_input(self, response, spider):
    print("input2")
    raise TypeError('input2 抛出异常')

def process_spider_output(self, response, result, spider):
    print('output2',list(result))
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception2')

​```

class SpiderMiddleware3(object):

​```
def process_spider_input(self, response, spider):
    print("input3")
    return None

def process_spider_output(self, response, result, spider):
    print('output3',list(result))
    return result

def process_spider_exception(self, response, exception, spider):
    print('exception3')

​```



#步骤四:运行结果分析
input1
input2
output3 [1, 2, 3, 4, 5] #parse_err的返回值放入管道中,只能被取走一次,在output3的方法内可以根据异常信息封装一个新的request请求
output2 []
output1 []
 posted on 2020-04-11 21:34  Rannie`  阅读(410)  评论(0编辑  收藏  举报
去除动画
找回动画