python学习:python爬虫之Scrapy框架(4):Scrapy的中间件

Scrapy框架

  这里主要介绍Scrapy的中间件,以及自定制中间件。

  1、自定义代理中间件(在下载中间件中配置);

  2、下载中间件;

  3、spider中间件;

12 Scrapy请求代理

    设置http请求代理,其实就是设置请求头参数,meta参数和headers参数。

    配置settings.py的代理中间件:

示例:

DOWNLOADER_MIDDLEWARES = {

   # 'Pro_scrapy.middlewares.ProScrapyDownloaderMiddleware': 543,

   'Pro_scrapy.middlewares.ProxyMiddleware':500,

}

 

     自定义代理中间件:

示例:

import random

import base64

#自定义代理中间件

class ProxyMiddleware(object):

    def process_request(self, request, spider):

        #代理列表

        PROXYS=[

            {'ip_port':'125.108.89.175:9000','user_pass':''},

        ]

        #随机选择一个代理

        proxy = random.choice(PROXYS)

        if  proxy['user_pass'] != '':

            print('user:',proxy['ip_port'])

            #代理有用户名密码

            request.meta['proxy'] = "http://%s" % proxy["ip_port"]

            #设置用户名密码

            encoded_info = base64.encodebytes(bytes(proxy['user_pass'],encoding='utf-8'))

            request.headers['Proxy-Authorization']= encoded_info

        else:

            print('no user:',proxy['ip_port'])

            #代理没有用户名密码

            request.meta['proxy'] = "http://%s" %proxy["ip_port"]

 

13 下载中间件

在settings.py的下载中间件配置:

示例:

DOWNLOADER_MIDDLEWARES = {

   'Pro_scrapy.middlewares.ProScrapyDownloaderMiddleware': 400,

   'Pro_scrapy.middlewares.ProxyMiddleware':500,

}

下载中间件结构示例代码:

示例:

class ProScrapyDownloaderMiddleware(object):

    # Not all methods need to be defined. If a method is not defined,

    # scrapy acts as if the downloader middleware does not modify the

    # passed objects.

 

    @classmethod

    def from_crawler(cls, crawler):

        # This method is used by Scrapy to create your spiders.

        s = cls()

        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

        return s

 

    def process_request(self, request, spider):

        # Called for each request that goes through the downloader

        # middleware.

 

        # Must either:

        # - return None: continue processing this request

        # - or return a Response object

        # - or return a Request object

        # - or raise IgnoreRequest: process_exception() methods of

        #   installed downloader middleware will be called

        return None

 

    def process_response(self, request, response, spider):

        # Called with the response returned from the downloader.

 

        # Must either;

        # - return a Response object

        # - return a Request object

        # - or raise IgnoreRequest

        return response

 

    def process_exception(self, request, exception, spider):

        # Called when a download ha

 

  中间件首先调用process_request方法,如果process_request没有下载,交给process_response处理,process_response处理下载完成,return response,如果有其他中间件,将response结果返回给其他中间件的process_response处理,如果没有其他中间件,返回给spider的中间件后,再给parse解析。

其中process_request方法返回值:

返回None,继续后续中间件处理下载;

返回Response,停止process_request执行,执行process_response;

返回Request,停止中间件执行,将Request加入调度器;

报出异常raise IgnoreException,停止process_request执行,执行process_exception;

 

    其中process_response方法返回值:

    返回response,如果有其他中间件,交给其他中间件process_response处理,没有其他中间件,交给spider的parse方法解析;

    返回request,停止中间件,将请求再加入Request队列;

    报出异常raise IgnoreException:停止process_response执行,执行process_exception;

    其中process_exception方法返回值:

    返回none,交给后续中间件处理;

    返回response,停止后续process_exception方法

 

14 spider的中间件

在settings.py中配置:

SPIDER_MIDDLEWARES = {

   'Pro_scrapy.middlewares.ProScrapySpiderMiddleware': 543,

}

Spider中间件源代码:

示例:

class ProScrapySpiderMiddleware(object):

    # Not all methods need to be defined. If a method is not defined,

    # scrapy acts as if the spider middleware does not modify the

    # passed objects.

 

    @classmethod

    def from_crawler(cls, crawler):

        # This method is used by Scrapy to create your spiders.

        s = cls()

        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

        return s

 

    def process_spider_input(self, response, spider):

        # Called for each response that goes through the spider

        # middleware and into the spider.

 

        # Should return None or raise an exception.

        return None

 

    def process_spider_output(self, response, result, spider):

        # Called with the results returned from the Spider, after

        # it has processed the response.

 

        # Must return an iterable of Request, dict or Item objects.

        for i in result:

            yield i

 

    def process_spider_exception(self, response, exception, spider):

        # Called when a spider or process_spider_input() method

        # (from other spider middleware) raises an exception.

 

        # Should return either None or an iterable of Request, dict

        # or Item objects.

        pass

 

    def process_start_requests(self, start_requests, spider):

        # Called with the start requests of the spider, and works

        # similarly to the process_spider_output() method, except

        # that it doesn’t have a response associated.

 

        # Must return only requests (not items).

        for r in start_requests:

            yield r

 

    def spider_opened(self, spider):

        spider.logger.info('Spider opened: %s' % spider.name)

调用process_spider_input,处理其他中间件下载的结果,处理后给spider的parse方法;

调用process_spider_output,处理parse方法返回的请求或者Pipline Item。

 

posted @ 2021-01-07 20:40  渔歌晚唱  阅读(116)  评论(0)    收藏  举报